[NLP-1] SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations
链接: https://arxiv.org/abs/2511.07405 作者: Manon Berriche,Célia Nouri,Chloé Clavel,Jean-Philippe Cointet 机构: 未知 类目: Computation and Language (cs.CL); Computers and Society (cs.CY) 备注:
点击查看摘要
[NLP-2] SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLM s via Spatial Rewards NEURIPS2025
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间理解能力上的不足问题,尤其针对现有方法依赖显式3D输入、架构特异性修改、大规模数据或稀疏监督所导致的局限性。其解决方案的关键在于提出SpatialThinker,一个通过强化学习(Reinforcement Learning, RL)训练的3D感知MLLM,能够将结构化的空间定位与多步推理相结合;具体包括两个核心贡献:一是构建高质量的空间视觉问答(Spatial Text-Vision Question Answering, STVQA-7K)数据集以支持训练,二是设计基于多目标密集空间奖励的在线强化学习机制,从而强化模型对空间关系的准确建模和推理能力。实验表明,SpatialThinker-7B在空间理解和真实世界视觉问答任务中显著优于监督微调及稀疏强化学习基线,且性能接近甚至超越GPT-4o,验证了结合空间监督与奖励对齐推理的有效性。
链接: https://arxiv.org/abs/2511.07403 作者: Hunar Batra,Haoqin Tu,Hardy Chen,Yuanze Lin,Cihang Xie,Ronald Clark 机构: University of Oxford (牛津大学); University of California, Santa Cruz (加州大学圣克鲁兹分校) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: Preprint. Accepted at NeurIPS 2025 Workshops on SPACE in Vision, Language, and Embodied AI (SpaVLE), Embodied World Models for Decision Making (EWM), Aligning Reinforcement Learning Experimentalists and Theorists (ARLET), and Scaling Environments for Agents (SEA)
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.
zh
[NLP-3] ConvFill: Model Collaboration for Responsive Conversational Voice Agents
链接: https://arxiv.org/abs/2511.07397 作者: Vidya Srinivas,Zachary Englhardt,Maximus Powers,Shwetak Patel,Vikram Iyer 机构: University of Washington (华盛顿大学) 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
[NLP-4] Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction
链接: https://arxiv.org/abs/2511.07392 作者: Hyeryun Park,Byung Mo Gu,Jun Hee Lee,Byeong Hyeon Choi,Sekeun Kim,Hyun Koo Kim,Kyungsang Kim 机构: Korea University (韩国科学技术院); MGB (医疗基因生物公司) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: 22 pages, 12 figures, 1 table, Supplementary Information, Supplementary Data 1
点击查看摘要
[NLP-5] aching Pretrained Language Models to Think Deeper with Retrofitted Recurrence
链接: https://arxiv.org/abs/2511.07384 作者: Sean McLeish,Ang Li,John Kirchenbauer,Dayal Singh Kalra,Brian R. Bartoldson,Bhavya Kailkhura,Avi Schwarzschild,Jonas Geiping,Tom Goldstein,Micah Goldblum 机构: University of Maryland (马里兰大学); New York University (纽约大学); Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室); University of North Carolina (北卡罗来纳大学); ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center (图宾根ELLIS研究所,马克斯·普朗克智能系统研究所,图宾根人工智能中心); Columbia University (哥伦比亚大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: code: this https URL , models: this https URL
点击查看摘要
[NLP-6] Retriv at BLP-2025 Task 2: Test-Driven Feedback-Guided Framework for Bangla-to-Python Code Generation
链接: https://arxiv.org/abs/2511.07382 作者: K M Nafi Asib,Sourav Saha,Mohammed Moshiul Hoque 机构: Chittagong University of Engineering and Technology (吉大港工程与技术大学) 类目: Computation and Language (cs.CL) 备注: 8 pages, 1 figure, experimental scripts publicly available at this https URL
点击查看摘要
[NLP-7] Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains
链接: https://arxiv.org/abs/2511.07380 作者: Pingjie Wang,Hongcheng Liu,Yusheng Liao,Ziqing Fan,Yaxin Du,Shuo Tang,Yanfeng Wang,Yu Wang 机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室) 类目: Computation and Language (cs.CL) 备注: 27 pages
点击查看摘要
[NLP-8] Self-Evaluating LLM s for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection NEURIPS2025
链接: https://arxiv.org/abs/2511.07364 作者: Vaibhav Mavi,Shubh Jaroria,Weiqi Sun 机构: Dyania Health 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注: Accepted at NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
点击查看摘要
[NLP-9] IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction
链接: https://arxiv.org/abs/2511.07327 作者: Guoxin Chen,Zile Qiao,Xuanzhong Chen,Donglei Yu,Haotian Xu,Wayne Xin Zhao,Ruihua Song,Wenbiao Yin,Huifeng Yin,Liwen Zhang,Kuan Li,Minpeng Liao,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou 机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室); OpenRLHF 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注: this https URL
点击查看摘要
[NLP-10] FinRpt: Dataset Evaluation System and LLM -based Multi-agent Framework for Equity Research Report Generation AAAI2026
链接: https://arxiv.org/abs/2511.07322 作者: Song Jin,Shuqi Li,Shukun Zhang,Rui Yan 机构: 武汉大学(Whu); 武汉大学人民医院(Wuhan University People’s Hospital) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: AAAI 2026
点击查看摘要
[NLP-11] When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中由虚假相关性(spurious correlations)引发的幻觉问题,即模型基于训练数据中表面但统计显著的特征-属性关联(如姓氏与国籍)生成看似合理实则错误的回答。解决方案的关键在于揭示现有检测方法(如基于置信度的过滤和内部状态探测)在面对此类虚假相关性时的根本失效机制,并通过系统性的合成实验与真实模型评估验证:这类幻觉具有高置信度、不随模型规模扩大而缓解、可规避当前检测手段且对拒绝微调(refusal fine-tuning)具有鲁棒性。论文进一步提出理论分析,阐明统计偏差如何内在地破坏依赖置信度的检测逻辑,从而强调亟需开发专门针对虚假相关性驱动幻觉的新检测与防御方法。
链接: https://arxiv.org/abs/2511.07318 作者: Shaowen Wang,Yiqi Dong,Ruinian Chang,Tansheng Zhu,Yuebo Sun,Kaifeng Lyu,Jian Li 机构: Institute for Interdisciplinary Information Sciences, Tsinghua University (清华大学交叉信息研究院) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:Despite substantial advances, large language models (LLMs) continue to exhibit hallucinations, generating plausible yet incorrect responses. In this paper, we highlight a critical yet previously underexplored class of hallucinations driven by spurious correlations – superficial but statistically prominent associations between features (e.g., surnames) and attributes (e.g., nationality) present in the training data. We demonstrate that these spurious correlations induce hallucinations that are confidently generated, immune to model scaling, evade current detection methods, and persist even after refusal fine-tuning. Through systematically controlled synthetic experiments and empirical evaluations on state-of-the-art open-source and proprietary LLMs (including GPT-5), we show that existing hallucination detection methods, such as confidence-based filtering and inner-state probing, fundamentally fail in the presence of spurious correlations. Our theoretical analysis further elucidates why these statistical biases intrinsically undermine confidence-based detection techniques. Our findings thus emphasize the urgent need for new approaches explicitly designed to address hallucinations caused by spurious correlations.
zh
[NLP-12] RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
[NLP-13] ACE-ICD: Acronym Expansion As Data Augmentation For Automated ICD Coding AACL2025
链接: https://arxiv.org/abs/2511.07311 作者: Tuan-Dung Le,Shohreh Haddadan,Thanh Q. Thieu 机构: Moffitt Cancer Center and Research Institute (莫菲特癌症中心和研究所); University of South Florida (南佛罗里达大学) 类目: Computation and Language (cs.CL) 备注: Camera ready version for IJCNLP-AACL 2025 (Findings)
点击查看摘要
[NLP-14] Retriv at BLP-2025 Task 1: A Transformer Ensemble and Multi-Task Learning Approach for Bangla Hate Speech Identification
链接: https://arxiv.org/abs/2511.07304 作者: Sourav Saha,K M Nafi Asib,Mohammed Moshiul Hoque 机构: Chittagong University of Engineering and Technology (吉大港工程与技术大学) 类目: Computation and Language (cs.CL) 备注: 7 pages, 3 figures, experimental scripts publicly available at this https URL
点击查看摘要
[NLP-15] Who Is the Story About? Protagonist Entity Recognition in News
链接: https://arxiv.org/abs/2511.07296 作者: Jorge Gabín,M. Eduardo Ares,Javier Parapar 机构: Linknovate Science (Linknovate科学); University of A Coruña (拉科鲁尼亚大学) 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
[NLP-16] he Few Govern the Many:Unveiling Few-Layer Dominance for Time Series Models
链接: https://arxiv.org/abs/2511.07237 作者: Xin Qiu,Junlong Tong,Yirong Sun,Yunpu Ma,Xiaoyu Shen 机构: Eastern Institute of Technology (东方理工大学); Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学); Ludwig Maximilian University of Munich (慕尼黑路德维希-马克西米利安大学) 类目: Machine Learning (cs.LG); Computation and Language (cs.CL) 备注:
点击查看摘要
[NLP-17] Discourse Graph Guided Document Translation with Large Language Models
链接: https://arxiv.org/abs/2511.07230 作者: Viet-Thanh Pham,Minghan Wang,Hao-Han Liao,Thuy-Trang Vu 机构: Monash University (蒙纳士大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[NLP-18] EMODIS: A Benchmark for Context-Dependent Emoji Disambiguation in Large Language Models AAAI2026
[NLP-19] Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents
链接: https://arxiv.org/abs/2511.07176 作者: Hanlin Cai,Houtianfu Wang,Haofan Dong,Kai Li,Ozgur B. Akan 机构: 未知 类目: Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL) 备注: 6 pages, 6 figures
点击查看摘要
[NLP-20] AdaRec: Adaptive Recommendation with LLM s via Narrative Profiling and Dual-Channel Reasoning
链接: https://arxiv.org/abs/2511.07166 作者: Meiyun Wang,Charin Polpanumas 机构: 未知 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE) 备注:
点击查看摘要
[NLP-21] Categorical Emotions or Appraisals - Which Emotion Model Explains Argument Convincingness Better?
[NLP-22] CM-Eval: An Expert-Level Dynamic and Extensible Benchmark for Traditional Chinese Medicine
链接: https://arxiv.org/abs/2511.07148 作者: Zihao Cheng,Yuheng Lu,Huaiqian Ye,Zeming Liu,Minqi Wang,Jingjing Liu,Zihan Li,Wei Fan,Yuanfang Guo,Ruiji Fu,Shifeng She,Gang Wang,Yunhong Wang 机构: Beihang University (北京航空航天大学); Beijing Zhimingtang Technology Co., Ltd. (北京智明堂科技有限公司); Beijing Zhiyan AI Technology Co., Ltd. (北京智言人工智能科技有限公司); Guangzhou University of Chinese Medicine (广州中医药大学) 类目: Computation and Language (cs.CL) 备注: Work in Progress
点击查看摘要
[NLP-23] LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging
链接: https://arxiv.org/abs/2511.07129 作者: Seungeon Lee,Soumi Das,Manish Gupta,Krishna P. Gummadi 机构: MPI-SWS; Microsoft(微软) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注:
点击查看摘要
[NLP-24] hink Consistently Reason Efficiently: Energy-Based Calibration for Implicit Chain-of-Thought
链接: https://arxiv.org/abs/2511.07124 作者: Zhikang Chen,Sen Cui,Deheng Ye,Yu Zhang,Yatao Bian,Tingting Zhu 机构: University of Oxford (牛津大学); Tsinghua University (清华大学); Tencent (腾讯); Southern University of Science and Technology (南方科技大学); National University of Singapore (新加坡国立大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注:
点击查看摘要
[NLP-25] More Agents Helps but Adversarial Robustness Gap Persists
链接: https://arxiv.org/abs/2511.07112 作者: Khashayar Alavi,Zhastay Yeltay,Lucie Flek,Akbar Karimi 机构: 未知 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[NLP-26] MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Risks in LLM s on Domain Tasks
链接: https://arxiv.org/abs/2511.07107 作者: Liang Shan,Kaicheng Shen,Wen Wu,Zhenyu Ying,Chaochao Lu,Guangze Ye,Liang He 机构: 华东师范大学计算机科学与技术学院(Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science and Technology, East China Normal University) 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
[NLP-27] Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora
[NLP-28] EmoBang: Detecting Emotion From Bengali Texts
链接: https://arxiv.org/abs/2511.07077 作者: Abdullah Al Maruf,Aditi Golder,Zakaria Masud Jiyad,Abdullah Al Numan,Tarannum Shaila Zaman 机构: 未知 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
[NLP-29] Importance-Aware Data Selection for Efficient LLM Instruction Tuning AAAI2026
链接: https://arxiv.org/abs/2511.07074 作者: Tingyu Jiang,Shen Li,Yiyao Song,Lan Zhang,Hualei Zhu,Yuan Zhao,Xiaohang Xu,Kenjiro Taura,Hao Henry Wang 机构: 1: University of California, Berkeley (加州大学伯克利分校); 2: Tsinghua University (清华大学); 3: University of Tokyo (东京大学) 类目: Computation and Language (cs.CL) 备注: Accepted by AAAI 2026 Oral
点击查看摘要
[NLP-30] Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection AAAI26 AAAI
链接: https://arxiv.org/abs/2511.07065 作者: Brage Eilertsen,Røskva Bjørgfinsdóttir,Francielle Vargas,Ali Ramezani-Kebrya 机构: 未知 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: Accepted at the Annual AAAI Conference on Artificial Intelligence (AAAI26)
点击查看摘要
[NLP-31] When Sufficient is not Enough: Utilizing the Rashomon Effect for Complete Evidence Extraction
链接: https://arxiv.org/abs/2511.07055 作者: Katharina Beckh,Stefan Rüping 机构: Fraunhofer IAIS (弗劳恩霍夫信息与通信技术研究所); Lamarr Institute (拉马尔研究所) 类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG) 备注:
点击查看摘要
[NLP-32] Evaluating LLM s for Anxiety Depression and Stress Detection Evaluating Large Language Models for Anxiety Depression and Stress Detection: Insights into Prompting Strategies and Synthetic Data
[NLP-33] Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks
链接: https://arxiv.org/abs/2511.07025 作者: Yauhen Babakhin,Radek Osmulski,Ronay Ak,Gabriel Moreira,Mengyao Xu,Benedikt Schifferer,Bo Liu,Even Oldridge 机构: 未知 类目: Computation and Language (cs.CL); Information Retrieval (cs.IR) 备注:
点击查看摘要
[NLP-34] Multilingual Lexical Feature Analysis of Spoken Language for Predicting Major Depression Symptom Severity
链接: https://arxiv.org/abs/2511.07011 作者: Anastasiia Tokareva,Judith Dineley,Zoe Firth,Pauline Conde,Faith Matcham,Sara Siddi,Femke Lamers,Ewan Carr,Carolin Oetzmann,Daniel Leightley,Yuezhou Zhang,Amos A. Folarin,Josep Maria Haro,Brenda W.J.H. Penninx,Raquel Bailon,Srinivasan Vairavan,Til Wykes,Richard J.B. Dobson,Vaibhav A. Narayan,Matthew Hotopf,Nicholas Cummins, TheRADAR-CNS Consortium 机构: 未知 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注:
点击查看摘要
[NLP-35] A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation AACL2025
链接: https://arxiv.org/abs/2511.07010 作者: Siddharth Betala,Kushan Raj,Vipul Betala,Rohan Saswade 机构: 未知 类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) 备注: Accepted at The 12th Workshop on Asian Translation, co-located with IJCLNLP-AACL 2025
点击查看摘要
[NLP-36] Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLM s
链接: https://arxiv.org/abs/2511.07003 作者: Yingfeng Luo,Ziqiang Xu,Yuxuan Ouyang,Murun Yang,Dingyang Lin,Kaiyan Chang,Tong Zheng,Bei Li,Peinan Feng,Quan Du,Tong Xiao,Jingbo Zhu 机构: Northeastern University (东北大学); NiuTrans Research (牛津研究) 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
[NLP-37] Automated Circuit Interpretation via Probe Prompting
链接: https://arxiv.org/abs/2511.07002 作者: Giuseppe Birardi 机构: Orma Lab Srl (Orma 实验室有限公司) 类目: Computation and Language (cs.CL) 备注: 27 pages, 5 figures, 3 tables. Code and interactive demo available
点击查看摘要
[NLP-38] SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLM s AAAI2026
链接: https://arxiv.org/abs/2511.07001 作者: Zhenliang Zhang,Xinyu Hu,Xiaojun Wan 机构: 未知 类目: Computation and Language (cs.CL) 备注: Accepted by the AAAI 2026 (Main Track)
点击查看摘要
[NLP-39] HLPD: Aligning LLM s to Human Language Preference for Machine-Revised Text Detection AAAI’26
链接: https://arxiv.org/abs/2511.06942 作者: Fangqi Dai,Xingjian Jiang,Zizhuang Deng 机构: 未知 类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR) 备注: 9 pages, 3 figures, accepted by AAAI’26
点击查看摘要
[NLP-40] RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
[NLP-41] EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLM s as Simulated Teachers AAAI2026
链接: https://arxiv.org/abs/2511.06890 作者: Yilin Jiang,Mingzi Zhang,Xuanyu Yin,Sheng Jin,Suyu Lu,Zuocan Ying,Zengyi Yu,Xiangjie Kong 机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. Alibaba Group (阿里巴巴集团); 4. University of California, Berkeley (加州大学伯克利分校); 5. Microsoft Research (微软研究院); 6. Peking University (北京大学); 7. Huawei Technologies (华为技术有限公司) 类目: Computation and Language (cs.CL) 备注: 22 pages, 9 figures, accepted by AAAI2026 as oral paper
点击查看摘要
[NLP-42] Inclusion of Role into Named Entity Recognition and Ranking
链接: https://arxiv.org/abs/2511.06886 作者: Neelesh Kumar Shukla,Sanasam Ranbir Singh 机构: IIT Guwahati (印度理工学院古瓦哈蒂分校) 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: MTP Paper
点击查看摘要
[NLP-43] CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition
链接: https://arxiv.org/abs/2511.06860 作者: Hung-Yang Sung,Chien-Chun Wang,Kuan-Tang Huang,Tien-Hong Lo,Yu-Sheng Tsao,Yung-Chang Hsu,Berlin Chen 机构: National Taiwan Normal University (国立台湾师范大学); EZAI (EZAI) 类目: Computation and Language (cs.CL); Sound (cs.SD) 备注: Accepted for an oral presentation at the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
点击查看摘要
[NLP-44] Beyond Plain Demos: A Demo-centric Anchoring Paradigm for In-Context Learning in Alzheimers Disease Detection AAAI
链接: https://arxiv.org/abs/2511.06826 作者: Puzhen Su,Haoran Yin,Yongzhu Miao,Jintao Tang,Shasha Li,Ting Wang 机构: 未知 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: Accepted to the 40th Annual AAAI Conference on Artificial Intelligence (2026) - Main Technical Track (Oral)
点击查看摘要
[NLP-45] Learning to Focus: Focal Attention for Selective and Scalable Transformers
链接: https://arxiv.org/abs/2511.06818 作者: Dhananjay Ram,Wei Xia,Stefano Soatto 机构: AWS AI Labs (Amazon Web Services 人工智能实验室) 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注:
点击查看摘要
[NLP-46] SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM -based Natural Language Database Interfaces
链接: https://arxiv.org/abs/2511.06778 作者: Ruiheng Liu,XiaoBing Chen,Jinyu Zhang,Qiongwen Zhang,Yu Zhang,Bailong Yang 机构: 未知 类目: Computation and Language (cs.CL) 备注: 26 pages, 14 figures, 22 tables
点击查看摘要
[NLP-47] Sensitivity of Small Language Models to Fine-tuning Data Contamination
链接: https://arxiv.org/abs/2511.06763 作者: Nicy Scaria,Silvester John Joseph Kennedy,Deepak Subramani 机构: Indian Institute of Science (印度科学研究所) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[NLP-48] Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale Systematic Expert Evaluation and Practical Insights
链接: https://arxiv.org/abs/2511.06738 作者: Hyunjae Kim,Jiwoong Sohn,Aidan Gilson,Nicholas Cochran-Caggiano,Serina Applebaum,Heeju Jin,Seihee Park,Yujin Park,Jiyeong Park,Seoyoung Choi,Brittany Alexandra Herrera Contreras,Thomas Huang,Jaehoon Yun,Ethan F. Wei,Roy Jiang,Leah Colucci,Eric Lai,Amisha Dave,Tuo Guo,Maxwell B. Singer,Yonghoe Koo,Ron A. Adelman,James Zou,Andrew Taylor,Arman Cohan,Hua Xu,Qingyu Chen 机构: Yale School of Medicine, Yale University (耶鲁大学医学院); ETH Zurich (苏黎世联邦理工学院); Harvard Medical School (哈佛医学院); Geisel School of Medicine at Dartmouth (达特茅斯医学院); Seoul National University College of Medicine (首尔国立大学医学院); Hanyang University College of Medicine (汉阳大学医学院); Asan Medical Center, University of Ulsan College of Medicine (首尔大学医学院附属医院); Stanford University School of Medicine (斯坦福大学医学院); University of Virginia School of Medicine (弗吉尼亚大学医学院); Yale School of Engineering & Applied Science (耶鲁大学工程与应用科学学院) 类目: Computation and Language (cs.CL) 备注: 34 pages, 6 figures
点击查看摘要
[NLP-49] Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View AAAI2026
链接: https://arxiv.org/abs/2511.06722 作者: Jianyu Qi,Ding Zou,Wenrui Yan,Rui Ma,Jiaxu Li,Zhijie Zheng,Zhiguo Yang,Rongchang Zhao 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注: Accpeted by AAAI 2026
点击查看摘要
[NLP-50] Sentiment Analysis On YouTube Comments Using Machine Learning Techniques Based On Video Games Content
链接: https://arxiv.org/abs/2511.06708 作者: Adi Danish Bin Muhammad Amin,Mohaiminul Islam Bhuiyan,Nur Shazwani Kamarudin,Zulfahmi Toh,Nur Syafiqah Nafis 机构: 未知 类目: Computation and Language (cs.CL) 备注: 6 pages, 7 figures, 2025 IEEE 9th International Conference on Software Engineering Computer Systems
点击查看摘要
[NLP-51] Place Matters: Comparing LLM Hallucination Rates for Place-Based Legal Queries
链接: https://arxiv.org/abs/2511.06700 作者: Damian Curran,Vanessa Sporne,Lea Frermann,Jeannie Paterson 机构: The University of Melbourne (墨尔本大学); Melbourne Law School (墨尔本法学院); The Centre for Artificial Intelligence and Digital Ethics (人工智能与数字伦理中心) 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
链接: https://arxiv.org/abs/2511.06682 作者: Shibing Mo,Haoyang Ruan,Kai Wu,Jing Liu 机构: 未知 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: AAAI2026
点击查看摘要
[NLP-53] Steering LLM s toward Korean Local Speech: Iterative Refinement Framework for Faithful Dialect Translation LREC2026
链接: https://arxiv.org/abs/2511.06680 作者: Keunhyeung Park,Seunguk Yu,Youngbin Kim 机构: 未知 类目: Computation and Language (cs.CL) 备注: Submitted to LREC 2026
点击查看摘要
[NLP-54] How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models
链接: https://arxiv.org/abs/2511.06676 作者: Subhojit Ghimire 机构: 未知 类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) 备注: 9 pages, 5 figures, 4 tables, 14 references
点击查看摘要
[NLP-55] HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment AAAI2026
链接: https://arxiv.org/abs/2511.06653 作者: Ruijia Wu,Ping Chen,Fei Shen,Shaoan Zhao,Qiang Hui,Huanlin Gao,Ting Lu,Zhaoxiang Liu,Fang Zhao,Kai Wang,Shiguo Lian 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) 备注: Accepted by AAAI 2026 as an Oral Presentation (13 pages, 7 figures, 7 tables)
点击查看摘要
[NLP-56] GRAPH-GRPO-LEX: Contract Graph Modeling and Reinforcement Learning with Group Relative Policy Optimization
链接: https://arxiv.org/abs/2511.06618 作者: Moriya Dechtiar,Daniel Martin Katz,Mari Sundaresan,Sylvain Jaume,Hongming Wang 机构: Harvard University (哈佛大学); Illinois Tech - Chicago Kent College of Law (伊利诺伊理工学院-芝加哥肯特法学院); CLTDS, Bucerius Law School (CLTDS,布策里乌斯法学院); Yong Pung How School of Law, Singapore Management University (新加坡管理大学杨敬礼法学院); CodeX - The Stanford Center for Legal Informatics, Stanford University (斯坦福大学法律信息中心); Georgetown University (乔治城大学); Massachusetts Institute of Technology (麻省理工学院) 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE) 备注:
点击查看摘要
[NLP-57] Duality-based Mode Operations and Pyramid Multilayer Mapping for Rhetorical Modes
链接: https://arxiv.org/abs/2511.06601 作者: Zi-Niu Wu 机构: 未知 类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Programming Languages (cs.PL) 备注:
点击查看摘要
[NLP-58] MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making
[NLP-59] abRAG : Tabular Document Retrieval via Structured Language Representations NEURIPS2025
链接: https://arxiv.org/abs/2511.06582 作者: Jacob Si,Mike Qu,Michelle Lee,Yingzhen Li 机构: Imperial College London (帝国理工学院); Columbia University (哥伦比亚大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG) 备注: NeurIPS 2025 AI4Tab
点击查看摘要
[NLP-60] Rep2Text: Decoding Full Text from a Single LLM Token Representation
链接: https://arxiv.org/abs/2511.06571 作者: Haiyan Zhao,Zirui He,Fan Yang,Ali Payani,Mengnan Du 机构: New Jersey Institute of Technology (新泽西理工学院); Wake Forest University (维克森林大学); Cisco Research (思科研究院) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: 15 pages, 7 figures, 4 tables
点击查看摘要
[NLP-61] FPGA or GPU? Analyzing comparative research for application-specific guidance
链接: https://arxiv.org/abs/2511.06565 作者: Arnab A Purkayastha,Jay Tharwani,Shobhit Aggarwal 机构: 未知 类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL) 备注: 7 pages
点击查看摘要
[NLP-62] Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigerias Minority Languages AACL
链接: https://arxiv.org/abs/2511.06531 作者: Oluwadara Kalejaiye,Luel Hagos Beyene,David Ifeoluwa Adelani,Mmekut-Mfon Gabriel Edet,Aniefon Daniel Akpan,Eno-Abasi Urua,Anietie Andy 机构: Howard University (霍华德大学); AIMS Research and Innovation Centre; NM-AIST; Mila - Quebec AI Institute; McGill University; Canada CIFAR AI Chair; Korapay; National Institute for Nigerian Languages; University of Uyo (尤yo大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: Accepted at IJCNLP-AACL
点击查看摘要
[NLP-63] Better Datasets Start From RefineLab: Automatic Optimization for High-Quality Dataset Refinement
链接: https://arxiv.org/abs/2511.06530 作者: Xiaonan Luo,Yue Huang,Ping He,Xiangliang Zhang 机构: University of Notre Dame(圣母大学); Vanderbilt University(范德比尔特大学) 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
[NLP-64] You Had One Job: Per-Task Quantization Using LLM s Hidden Representations
链接: https://arxiv.org/abs/2511.06516 作者: Amit LeVi,Raz Lapid,Rom Himelstein,Yaniv Nemcovsky,Ravid Shwartz Ziv,Avi Mendelson 机构: 未知 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
[NLP-65] Rethinking what Matters: Effective and Robust Multilingual Realignment for Low-Resource Languages AACL2025
链接: https://arxiv.org/abs/2511.06497 作者: Quang Phuoc Nguyen,David Anugraha,Felix Gaschi,Jun Bin Cheng,En-Shiun Annie Lee 机构: Ontario Tech University (安大略理工大学); Stanford University (斯坦福大学); SAS Posos; University of Toronto (多伦多大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: Accepted to IJCNLP-AACL 2025
点击查看摘要
[NLP-66] When AI Agents Collude Online: Financial Fraud Risks by Collaborative LLM Agents on Social Platforms
链接: https://arxiv.org/abs/2511.06448 作者: Qibing Ren,Zhijie Zheng,Jiaxuan Guo,Junchi Yan,Lizhuang Ma,Jing Shao 机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Beihang University (北京航空航天大学) 类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI) 备注: Code is available at this https URL
点击查看摘要
[NLP-67] SR-KI: Scalable and Real-Time Knowledge Integration into LLM s via Supervised Attention AAAI2026
链接: https://arxiv.org/abs/2511.06446 作者: Bohan Yu,Wei Huang,Kang Liu 机构: Baidu(百度); Tsinghua University (清华大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: Accepted by AAAI 2026
链接: https://arxiv.org/abs/2511.06441 作者: Mayank Saini,Arit Kumar Bishwas 机构: PwC US 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: 15 pages, 4 figures
点击查看摘要
[NLP-69] Optimizing Chain-of-Thought Confidence via Topological and Dirichlet Risk Analysis
链接: https://arxiv.org/abs/2511.06437 作者: Abhishek More,Anthony Zhang,Nicole Bonilla,Ashvik Vivekan,Kevin Zhu,Parham Sharafoleslami,Maheep Chaudhary 机构: Algoverse AI Research 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) 备注:
点击查看摘要
[NLP-70] CG-TTRL: Context-Guided Test-Time Reinforcement Learning for On-Device Large Language Models
链接: https://arxiv.org/abs/2511.06430 作者: Peyman Hosseini,Ondrej Bohdal,Taha Ceritli,Ignacio Castro,Matthew Purver,Mete Ozay,Umberto Michieli 机构: Samsung R&D Institute UK (三星研发研究院英国); Queen Mary University of London (伦敦玛丽女王大学) 类目: Machine Learning (cs.LG); Computation and Language (cs.CL) 备注: 12 pages, 7 Figures, 4 Tables
点击查看摘要
[NLP-71] Dutch Metaphor Extraction from Cancer Patients Interviews and Forum Data using LLM s and Human in the Loop
链接: https://arxiv.org/abs/2511.06427 作者: Lifeng Han,David Lindevelt,Sander Puts,Erik van Mulligen,Suzan Verberne 机构: 未知 类目: Computation and Language (cs.CL); Computers and Society (cs.CY) 备注: Ongoing project report, on behalf of 4D PICTURE this https URL
点击查看摘要
[NLP-72] MONICA: Real-Time Monitoring and Calibration of Chain-of-Thought Sycophancy in Large Reasoning Models
链接: https://arxiv.org/abs/2511.06419 作者: Jingyu Hu,Shu Yang,Xilin Gong,Hongming Wang,Weiru Liu,Di Wang 机构: University of Bristol (布里斯托大学); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); University of Georgia (佐治亚大学); Southern University of Science and Technology (南方科技大学) 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Large Reasoning Models (LRMs) suffer from sycophantic behavior, where models tend to agree with users’ incorrect beliefs and follow misinformation rather than maintain independent reasoning. This behavior undermines model reliability and poses societal risks. Mitigating LRM sycophancy requires monitoring how this sycophancy emerges during the reasoning trajectory; however, current methods mainly focus on judging based on final answers and correcting them, without understanding how sycophancy develops during reasoning processes. To address this limitation, we propose MONICA, a novel Monitor-guided Calibration framework that monitors and mitigates sycophancy during model inference at the level of reasoning steps, without requiring the model to finish generating its complete answer. MONICA integrates a sycophantic monitor that provides real-time monitoring of sycophantic drift scores during response generation with a calibrator that dynamically suppresses sycophantic behavior when scores exceed predefined thresholds. Extensive experiments across 12 datasets and 3 LRMs demonstrate that our method effectively reduces sycophantic behavior in both intermediate reasoning steps and final answers, yielding robust performance improvements.
zh
[NLP-73] How Well Do LLM s Understand Drug Mechanisms? A Knowledge Reasoning Evaluation Dataset
链接: https://arxiv.org/abs/2511.06418 作者: Sunil Mohan,Theofanis Karaletsos 机构: Chan Zuckerberg Initiative (Chan Zuckerberg Initiative) 类目: Computation and Language (cs.CL) 备注: An earlier version of this paper appears in IEEE FLLM 2025. GitHub: this https URL
点击查看摘要
[NLP-74] SugarTextNet: A Transformer-Based Framework for Detecting Sugar Dating-Related Content on Social Media with Context-Aware Focal Loss
链接: https://arxiv.org/abs/2511.06402 作者: Lionel Z. Wang,Shihan Ben,Yulu Huang,Simeng Qing 机构: The Hong Kong Polytechnic University (香港理工大学); The University of Hong Kong (香港大学); Northeastern University (东北大学) 类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI) 备注: This paper is accepted by HICSS 2026
点击查看摘要
[NLP-75] HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection
链接: https://arxiv.org/abs/2511.06391 作者: Irina Proskurina,Marc-Antoine Carpentier,Julien Velcin 机构: 未知 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[NLP-76] LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation
[NLP-77] meSense:Making Large Language Models Proficient in Time-Series Analysis
链接: https://arxiv.org/abs/2511.06344 作者: Zhirui Zhang,Changhua Pei,Tianyi Gao,Zhe Xie,Yibo Hao,Zhaoyang Yu,Longlong Xu,Tong Xiao,Jing Han,Dan Pei 机构: Tsinghua University (清华大学); Computer Network Information Center, Chinese Academy of Sciences (中国科学院计算机网络信息中心); ZTE Corporation (中兴通讯) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[NLP-78] ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction
[NLP-79] Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective AAAI2026
链接: https://arxiv.org/abs/2511.06284 作者: Bing Wang,Ximing Li,Yanjun Wang,Changchun Li,Lin Yuanbo Wu,Buyu Wang,Shengsheng Wang 机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Artificial Intelligence, University of Science and Technology of China (中国科学技术大学人工智能研究所); 3. School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院); 4. Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系); 5. School of Information Science and Technology, Sun Yat-sen University (中山大学信息科学与技术学院) 类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM) 备注: Accepted by AAAI 2026. 13 pages, 6 figures. Code: this https URL
点击查看摘要
[NLP-80] Mixtures of SubExperts for Large Language Continual Learning
[NLP-82] Overview of CHIP 2025 Shared Task 2: Discharge Medication Recommendation for Metabolic Diseases Based on Chinese Electronic Health Records
链接: https://arxiv.org/abs/2511.06230 作者: Juntao Li,Haobin Yuan,Ling Luo,Tengxiao Lv,Yan Jiang,Fan Wang,Ping Zhang,Huiyi Lv,Jian Wang,Yuanyuan Sun,Hongfei Lin 机构: 未知 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[NLP-83] SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization AAAI2026
链接: https://arxiv.org/abs/2511.06222 作者: Yue Huang,Xiangqi Wang,Xiangliang Zhang 机构: University of Notre Dame (圣母大学) 类目: Computation and Language (cs.CL); Computers and Society (cs.CY) 备注: Accepted by AAAI 2026 (Oral)
点击查看摘要
[NLP-84] ny Model Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B
链接: https://arxiv.org/abs/2511.06221 作者: Sen Xu,Yi Zhou,Wei Wang,Jixin Min,Zhibin Yin,Yingwei Dai,Shixi Liu,Lianyu Pang,Yirong Chen,Junlin Zhang 机构: 未知 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Challenging the prevailing consensus that small models inherently lack robust reasoning, this report introduces VibeThinker-1.5B, a 1.5B-parameter dense model developed via our Spectrum-to-Signal Principle (SSP). This challenges the prevailing approach of scaling model parameters to enhance capabilities, as seen in models like DeepSeek R1 (671B) and Kimi k2 (1T). The SSP framework first employs a Two-Stage Diversity-Exploring Distillation (SFT) to generate a broad spectrum of solutions, followed by MaxEnt-Guided Policy Optimization (RL) to amplify the correct signal. With a total training cost of only 7,800, VibeThinker-1.5B demonstrates superior reasoning capabilities compared to closed-source models like Magistral Medium and Claude Opus 4, and performs on par with open-source models like GPT OSS-20B Medium. Remarkably, it surpasses the 400x larger DeepSeek R1 on three math benchmarks: AIME24 (80.3 vs. 79.8), AIME25 (74.4 vs. 70.0), and HMMT25 (50.4 vs. 41.7). This is a substantial improvement over its base model (6.7, 4.3, and 0.6, respectively). On LiveCodeBench V6, it scores 51.1, outperforming Magistral Medium’s 50.3 and its base model’s 0.0. These findings demonstrate that small models can achieve reasoning capabilities comparable to large models, drastically reducing training and inference costs and thereby democratizing advanced AI research.
zh
[NLP-85] Explicit Knowledge-Guided In-Context Learning for Early Detection of Alzheimers Disease
链接: https://arxiv.org/abs/2511.06215 作者: Puzhen Su,Yongzhu Miao,Chunxi Guo,Jintao Tang,Shasha Li,Ting Wang 机构: National University of Defense Technology (国防科技大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: This paper was accepted by IEEE BIBM 2025 conference
点击查看摘要
[NLP-86] Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads
链接: https://arxiv.org/abs/2511.06209 作者: Jingwei Ni,Ekaterina Fadeeva,Tianyi Wu,Mubashara Akhtar,Jiaheng Zhang,Elliott Ash,Markus Leippold,Timothy Baldwin,See-Kiong Ng,Artem Shelmanov,Mrinmaya Sachan 机构: ETH Zurich (苏黎世联邦理工学院); National University of Singapore (新加坡国立大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学); University of Zurich (苏黎世大学); The University of Melbourne (墨尔本大学) 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注: Preprint under review
点击查看摘要
[NLP-87] Enhancing Adversarial Robustness of IoT Intrusion Detection via SHAP-Based Attribution Fingerprinting
链接: https://arxiv.org/abs/2511.06197 作者: Dilli Prasad Sharma,Liang Xue,Xiaowei Sun,Xiaodong Lin,Pulei Xiong 机构: York University (约克大学); University of Guelph (圭尔夫大学); National Research Council of Canada (加拿大国家研究委员会) 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) 备注:
点击查看摘要
Abstract:The rapid proliferation of Internet of Things (IoT) devices has transformed numerous industries by enabling seamless connectivity and data-driven automation. However, this expansion has also exposed IoT networks to increasingly sophisticated security threats, including adversarial attacks targeting artificial intelligence (AI) and machine learning (ML)-based intrusion detection systems (IDS) to deliberately evade detection, induce misclassification, and systematically undermine the reliability and integrity of security defenses. To address these challenges, we propose a novel adversarial detection model that enhances the robustness of IoT IDS against adversarial attacks through SHapley Additive exPlanations (SHAP)-based fingerprinting. Using SHAP’s DeepExplainer, we extract attribution fingerprints from network traffic features, enabling the IDS to reliably distinguish between clean and adversarially perturbed inputs. By capturing subtle attribution patterns, the model becomes more resilient to evasion attempts and adversarial manipulations. We evaluated the model on a standard IoT benchmark dataset, where it significantly outperformed a state-of-the-art method in detecting adversarial attacks. In addition to enhanced robustness, this approach improves model transparency and interpretability, thereby increasing trust in the IDS through explainable AI.
zh
[NLP-88] Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning
[NLP-89] BookAsSumQA: An Evaluation Framework for Aspect-Based Book Summarization via Question Answering
链接: https://arxiv.org/abs/2511.06183 作者: Ryuhei Miyazato,Ting-Ruen Wei,Xuyang Wu,Hsin-Tai Wu,Kei Harada 机构: The University of Electro-Communications (电波通信大学); Santa Clara University (圣克拉拉大学); DOCOMO Innovations, Inc. (NTT DoCoMo创新公司) 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
[NLP-90] Evaluating Implicit Biases in LLM Reasoning through Logic Grid Puzzles
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂逻辑推理任务中隐性社会偏见难以被现有评估基准捕捉的问题。现有安全防护机制虽能抑制显性偏见,但无法识别由社会刻板印象引发的细微偏差,尤其在涉及性别等敏感属性的推理过程中表现突出。解决方案的关键在于提出PRIME(Puzzle Reasoning for Implicit Biases in Model Evaluation)框架,该框架利用逻辑网格谜题(logic grid puzzles)作为测评工具,通过在同一结构下生成具有刻板印象、反刻板印象和中立情境的谜题变体,实现对模型推理准确率的可控对比与量化分析。实验表明,当解题结果符合性别刻板印象时,模型推理准确性显著提升,凸显了PRIME在诊断和度量LLMs在演绎推理中隐性偏见方面的有效性。
链接: https://arxiv.org/abs/2511.06160 作者: Fatima Jahara,Mark Dredze,Sharon Levy 机构: Rutgers University (罗格斯大学); Johns Hopkins University (约翰霍普金斯大学) 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY) 备注: 24 pages (including appendix)
点击查看摘要
Abstract:While recent safety guardrails effectively suppress overtly biased outputs, subtler forms of social bias emerge during complex logical reasoning tasks that evade current evaluation benchmarks. To fill this gap, we introduce a new evaluation framework, PRIME (Puzzle Reasoning for Implicit Biases in Model Evaluation), that uses logic grid puzzles to systematically probe the influence of social stereotypes on logical reasoning and decision making in LLMs. Our use of logic puzzles enables automatic generation and verification, as well as variability in complexity and biased settings. PRIME includes stereotypical, anti-stereotypical, and neutral puzzle variants generated from a shared puzzle structure, allowing for controlled and fine-grained comparisons. We evaluate multiple model families across puzzle sizes and test the effectiveness of prompt-based mitigation strategies. Focusing our experiments on gender stereotypes, our findings highlight that models consistently reason more accurately when solutions align with stereotypical associations. This demonstrates the significance of PRIME for diagnosing and quantifying social biases perpetuated in the deductive reasoning of LLMs, where fairness is critical.
zh
[NLP-91] Large Language Models Develop Novel Social Biases Through Adaptive Exploration
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在决策框架中可能自发产生新型社会偏见的问题,这些偏见即使在初始无差异的人工人口群体中也会出现,并导致任务分配的严重不平等,且随着模型规模增大而加剧。解决方案的关键在于识别并干预“探索-利用权衡”(exploration-exploitation trade-offs)机制——即模型因早期经验过度固化对群体的认知,从而形成系统性偏见。研究发现,通过显式激励模型进行探索(explicitly incentivizing exploration),能最有效地降低任务分配的分层现象,强调需设计更复杂的多维目标函数来缓解此类偏见。
链接: https://arxiv.org/abs/2511.06148 作者: Addison J. Wu,Ryan Liu,Xuechunzi Bai,Thomas L. Griffiths 机构: Princeton University (普林斯顿大学); University of Chicago (芝加哥大学) 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:As large language models (LLMs) are adopted into frameworks that grant them the capacity to make real decisions, it is increasingly important to ensure that they are unbiased. In this paper, we argue that the predominant approach of simply removing existing biases from models is not enough. Using a paradigm from the psychology literature, we demonstrate that LLMs can spontaneously develop novel social biases about artificial demographic groups even when no inherent differences exist. These biases result in highly stratified task allocations, which are less fair than assignments by human participants and are exacerbated by newer and larger models. In social science, emergent biases like these have been shown to result from exploration-exploitation trade-offs, where the decision-maker explores too little, allowing early observations to strongly influence impressions about entire demographic groups. To alleviate this effect, we examine a series of interventions targeting model inputs, problem structure, and explicit steering. We find that explicitly incentivizing exploration most robustly reduces stratification, highlighting the need for better multifaceted objectives to mitigate bias. These results reveal that LLMs are not merely passive mirrors of human social biases, but can actively create new ones from experience, raising urgent questions about how these systems will shape societies over time.
zh
[NLP-92] Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models AACL2025
链接: https://arxiv.org/abs/2511.06146 作者: Akshar Tumu,Varad Shinde,Parisa Kordjamshidi 机构: UC San Diego (加州大学圣地亚哥分校); IIT Kanpur (印度理工学院坎普尔分校); Michigan State University (密歇根州立大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted at IJCNLP-AACL 2025
点击查看摘要
[NLP-93] Evaluation of retrieval-based QA on QUEST-LOFT
链接: https://arxiv.org/abs/2511.06125 作者: Nathan Scales,Nathanael Schärli,Olivier Bousquet 机构: Google DeepMind(谷歌深度大脑) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) 备注:
点击查看摘要
[NLP-94] Adapting Web Agents with Synthetic Supervision
链接: https://arxiv.org/abs/2511.06101 作者: Zhaoyang Wang,Yiming Liang,Xuchao Zhang,Qianhui Wu,Siwei Han,Anson Bastos,Rujia Wang,Chetan Bansal,Baolin Peng,Jianfeng Gao,Saravan Rajmohan,Huaxiu Yao 机构: UNC-Chapel Hill (北卡罗来纳大学教堂山分校); Purdue University (普渡大学); Microsoft (微软) 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注: 19 pages, 6 figures
点击查看摘要
[NLP-95] MuonAll: Muon Variant for Efficient Finetuning of Large Language Models
链接: https://arxiv.org/abs/2511.06086 作者: Saurabh Page,Advait Joshi,S. S. Sonawane 机构: 未知 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注:
点击查看摘要
[NLP-96] Simulating Students with Large Language Models : A Review of Architecture Mechanisms and Role Modelling in Education with Generative AI
【速读】: 该论文旨在解决如何通过生成式 AI (Generative AI) 构建可模拟多样化学习者行为的虚拟学生(Simulated Students),以系统化评估教学方法、建模认知发展路径与社会行为,从而克服真实教育场景中难以实现的复杂实验设计问题。其解决方案的关键在于将大语言模型(Large Language Models, LLMs)集成到教育研究中,利用其高度的语言真实性与行为适应性,使模拟代理能够逼近人类认知过程并开展情境适配的教学对话,进而支持课程开发、教学评价和教师培训等应用。
链接: https://arxiv.org/abs/2511.06078 作者: Luis Marquez-Carpintero,Alberto Lopez-Sellers,Miguel Cazorla 机构: University of Alicante (阿利坎特大学) 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Simulated Students offer a valuable methodological framework for evaluating pedagogical approaches and modelling diverse learner profiles, tasks which are otherwise challenging to undertake systematically in real-world settings. Recent research has increasingly focused on developing such simulated agents to capture a range of learning styles, cognitive development pathways, and social behaviours. Among contemporary simulation techniques, the integration of large language models (LLMs) into educational research has emerged as a particularly versatile and scalable paradigm. LLMs afford a high degree of linguistic realism and behavioural adaptability, enabling agents to approximate cognitive processes and engage in contextually appropriate pedagogical dialogues. This paper presents a thematic review of empirical and methodological studies utilising LLMs to simulate student behaviour across educational environments. We synthesise current evidence on the capacity of LLM-based agents to emulate learner archetypes, respond to instructional inputs, and interact within multi-agent classroom scenarios. Furthermore, we examine the implications of such systems for curriculum development, instructional evaluation, and teacher training. While LLMs surpass rule-based systems in natural language generation and situational flexibility, ongoing concerns persist regarding algorithmic bias, evaluation reliability, and alignment with educational objectives. The review identifies existing technological and methodological gaps and proposes future research directions for integrating generative AI into adaptive learning systems and instructional design.
zh
[NLP-97] Stemming Hallucination in Language Models Using a Licensing Oracle ACL
链接: https://arxiv.org/abs/2511.06073 作者: Simeon Emanuilov,Richard Ackermann 机构: 未知 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO) 备注: 23 pages, 4 figures, 8 tables. Introduces the Licensing Oracle, an architectural solution for eliminating hallucinations in language models through formal SHACL validation against knowledge graphs. All datasets and models are available at this https URL
点击查看摘要
[NLP-98] Automating Hardware Design and Verification from Architectural Papers via a Neural-Symbolic Graph Framework
链接: https://arxiv.org/abs/2511.06067 作者: Haoyue Yang,Xuanle Zhao,Yujie Liu,Zhuojun Zou,Kailin Lyu,Changchun Zhou,Yao Zhu,Jie Hao 机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Peking University (北京大学); Zhejiang University (浙江大学) 类目: Computation and Language (cs.CL); Software Engineering (cs.SE) 备注: Preprint Version, Work in Progress
点击查看摘要
Abstract:The reproduction of hardware architectures from academic papers remains a significant challenge due to the lack of publicly available source code and the complexity of hardware description languages (HDLs). To this end, we propose \textbfArchCraft, a Framework that converts abstract architectural descriptions from academic papers into synthesizable Verilog projects with register-transfer level (RTL) verification. ArchCraft introduces a structured workflow, which uses formal graphs to capture the Architectural Blueprint and symbols to define the Functional Specification, translating unstructured academic papers into verifiable, hardware-aware designs. The framework then generates RTL and testbench (TB) code decoupled via these symbols to facilitate verification and debugging, ultimately reporting the circuit’s Power, Area, and Performance (PPA). Moreover, we propose the first benchmark, \textbfArchSynthBench, for synthesizing hardware from architectural descriptions, with a complete set of evaluation indicators, 50 project-level circuits, and around 600 circuit blocks. We systematically assess ArchCraft on ArchSynthBench, where the experiment results demonstrate the superiority of our proposed method, surpassing direct generation methods and the VerilogCoder framework in both paper understanding and code completion. Furthermore, evaluation and physical implementation of the generated executable RTL code show that these implementations meet all timing constraints without violations, and their performance metrics are consistent with those reported in the original papers.
zh
链接: https://arxiv.org/abs/2511.06065 作者: Lianrui Li,Dakuan Lu,Jiawei Shao,Chi Zhang,Xuelong Li 机构: Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究院) 类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:We propose Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to enhance large language models on challenging mathemati- cal problems by leveraging self-reflection and error correction. Our approach consists of two stages: (1) Trial-and-error learning stage: training the model with GRPO and collect- ing incorrect answers along with their cor- responding questions in an error pool; (2) Self-correction learning stage: guiding the model to reflect on why its previous an- swers were wrong. Extensive experiments across multiple math reasoning benchmarks, including AIME, AMC, Olympiad, MATH- 500, GSM8k, using Deepseek-Distill-Qwen- 1.5B and Deepseek-Distill-Qwen-7B. The ex- perimental results demonstrate that ScRPO consistently outperforms several post-training methods. These findings highlight ScRPO as a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way to- ward more reliable and capable AI systems.
zh
[NLP-100] ReMoD: Rethinking Modality Contribution in Multimodal Stance Detection via Dual Reasoning
【速读】: 该论文旨在解决多模态立场检测(Multimodal Stance Detection, MSD)中因粗暴融合不同模态信息而导致的立场理解噪声问题,即现有方法忽视了各模态在表达立场时贡献不均的问题,从而可能引入学习误差。其解决方案的关键在于提出一种基于双推理范式的框架 ReMoD(ReThink Modality contribution via Dual-reasoning),通过“经验驱动的直觉推理”和“ deliberate 反思推理”两个阶段动态调整模态权重:首先利用模态经验池(Modality Experience Pool, MEP)与语义经验池(Semantic Experience Pool, SEP)形成初始立场假设,随后通过模态思维链(Modality-CoT)和语义思维链(Semantic-CoT)分别优化模态融合策略与语义上下文理解,实现对模态贡献的自适应调节,从而提升模型在复杂场景下的鲁棒性和泛化能力。
Abstract:Multimodal Stance Detection (MSD) is a crucial task for understanding public opinion on social media. Existing work simply fuses information from various modalities to learn stance representations, overlooking the varying contributions of stance expression from different modalities. Therefore, stance misunderstanding noises may be drawn into the stance learning process due to the risk of learning errors by rough modality combination. To address this, we get inspiration from the dual-process theory of human cognition and propose ReMoD, a framework that Rethinks Modality contribution of stance expression through a Dual-reasoning paradigm. ReMoD integrates experience-driven intuitive reasoning to capture initial stance cues with deliberate reflective reasoning to adjust for modality biases, refine stance judgments, and thereby dynamically weight modality contributions based on their actual expressive power for the target stance. Specifically, the intuitive stage queries the Modality Experience Pool (MEP) and Semantic Experience Pool (SEP) to form an initial stance hypothesis, prioritizing historically impactful modalities. This hypothesis is then refined in the reflective stage via two reasoning chains: Modality-CoT updates MEP with adaptive fusion strategies to amplify relevant modalities, while Semantic-CoT refines SEP with deeper contextual insights of stance semantics. These dual experience structures are continuously refined during training and recalled at inference to guide robust and context-aware stance decisions. Extensive experiments on the public MMSD benchmark demonstrate that our ReMoD significantly outperforms most baseline models and exhibits strong generalization capabilities.
zh
[NLP-101] Efficient Hate Speech Detection: A Three-Layer LoRA-Tuned BERTweet Framework
链接: https://arxiv.org/abs/2511.06051 作者: Mahmoud El-Bahnasawi 机构: Zewail City of Science and Technology (扎维尔科学与技术城) 类目: Computation and Language (cs.CL) 备注: 13 pages, 2 figures
点击查看摘要
Abstract:This paper addresses the critical challenge of developing computationally efficient hate speech detection systems that maintain competitive performance while being practical for real-time deployment. We propose a novel three-layer framework that combines rule-based pre-filtering with a parameter-efficient LoRA-tuned BERTweet model and continuous learning capabilities. Our approach achieves 0.85 macro F1 score - representing 94% of the performance of state-of-the-art large language models like SafePhi (Phi-4 based) while using a base model that is 100x smaller (134M vs 14B parameters). Compared to traditional BERT-based approaches with similar computational requirements, our method demonstrates superior performance through strategic dataset unification and optimized fine-tuning. The system requires only 1.87M trainable parameters (1.37% of full fine-tuning) and trains in approximately 2 hours on a single T4 GPU, making robust hate speech detection accessible in resource-constrained environments while maintaining competitive accuracy for real-world deployment.
zh
[NLP-102] Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts NEURIPS2025
链接: https://arxiv.org/abs/2511.06048 作者: Xinyuan Yan,Shusen Liu,Kowshik Thopalli,Bei Wang 机构: University of Utah (犹他大学); Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室) 类目: Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: 8 pages (5 main paper+3 refernce), 2 figures, pulished at Mechanistic Interpretability Workshop at NeurIPS 2025
点击查看摘要
[NLP-103] Multi-Reward GRPO Fine-Tuning for De-biasing Large Language Models : A Study Based on Chinese-Context Discrimination Data
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的隐性偏见与歧视倾向问题,尤其是那些反映社会刻板印象的文化特定性和多维歧视。现有对齐技术如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和直接偏好优化(Direct Preference Optimization, DPO)在缓解此类问题上存在局限性。解决方案的关键在于提出一种多奖励组相对策略优化(Multi-Reward Group Relative Policy Optimization, GRPO)框架:首先构建一个源自中文语境的合成英文数据集,涵盖地域、民族和职业等维度的偏见类别;其次利用DeBERTa-v3训练一个具备多维奖励信号(公平性、中立性和语言质量)的奖励模型;最后通过该模型引导GRPO进行细粒度的策略优化,从而实现模型输出在伦理维度上的去偏目标。实验表明,该方法显著降低了偏见强度,并保持了语言流畅性和信息丰富性,为跨文化语境下的伦理对齐提供了可复现的技术路径。
链接: https://arxiv.org/abs/2511.06023 作者: Deng Yixuan,Ji Xiaoqiang 机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Shenzhen Institute of Artificial Intelligence and Robotics for Society (深圳市人工智能与机器人社会研究院) 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Large Language Models (LLMs) often exhibit implicit biases and discriminatory tendencies that reflect underlying social stereotypes. While recent alignment techniques such as RLHF and DPO have mitigated some of these issues, they remain limited in addressing culturally specific and multi-dimensional forms of discrimination. This paper proposes a Multi-Reward Group Relative Policy Optimization (GRPO) framework to fine-tune LLMs toward ethical and bias-free behavior. Our approach constructs a synthetic English-language dataset derived from Chinese-context discrimination categories, including regional, ethnic, and occupational biases. Each instance is paired with both neutral and biased responses to train a reward model based on DeBERTa-v3, which provides multi-dimensional reward signals capturing fairness, neutrality, and linguistic quality. The trained reward model then guides GRPO fine-tuning to optimize model outputs along these ethical dimensions. Experimental results demonstrate significant reductions in bias intensity and improved alignment with non-discriminatory standards without compromising fluency or informativeness. This study highlights the effectiveness of GRPO-based multi-reward optimization for de-biasing LLMs and offers a replicable framework for cultural-contextual ethical alignment.
zh
[NLP-104] LLM s Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis AACL2025
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在生物医学证据合成任务中对年龄相关人口特征的保留不足问题,即语言模型在生成摘要时可能忽略或错误处理儿童、成人和老年人群的差异,从而导致潜在的偏见和不准确信息传递。解决方案的关键在于构建了一个新型的分年龄层数据集 DemogSummary,涵盖儿童、成人和老年群体的系统评价原始研究,并引入一种新的评估指标——人口学显著性评分(Demographic Salience Score, DSS),用于量化年龄相关实体的保留程度与幻觉情况。通过该方法,研究发现不同模型在不同年龄群体上的表现存在系统性差异,尤其成人群体的摘要质量最低,而被忽视的人群更容易出现幻觉,揭示了现有大语言模型在生物医学自然语言处理(NLP)中的公平性缺陷,强调需建立面向公平性的评估框架与摘要流程。
链接: https://arxiv.org/abs/2511.06000 作者: Favour Yahdii Aghaebe,Tanefa Apekey,Elizabeth Williams,Nafise Sadat Moosavi 机构: University of Sheffield (谢菲尔德大学) 类目: Computation and Language (cs.CL) 备注: Accepted at AACL 2025
点击查看摘要
Abstract:Clinical interventions often hinge on age: medications and procedures safe for adults may be harmful to children or ineffective for older adults. However, as language models are increasingly integrated into biomedical evidence synthesis workflows, it remains uncertain whether these systems preserve such crucial demographic distinctions. To address this gap, we evaluate how well state-of-the-art language models retain age-related information when generating abstractive summaries of biomedical studies. We construct DemogSummary, a novel age-stratified dataset of systematic review primary studies, covering child, adult, and older adult populations. We evaluate three prominent summarisation-capable LLMs, Qwen (open-source), Longformer (open-source) and GPT-4.1 Nano (proprietary), using both standard metrics and a newly proposed Demographic Salience Score (DSS), which quantifies age-related entity retention and hallucination. Our results reveal systematic disparities across models and age groups: demographic fidelity is lowest for adult-focused summaries, and under-represented populations are more prone to hallucinations. These findings highlight the limitations of current LLMs in faithful and bias-free summarisation and point to the need for fairness-aware evaluation frameworks and summarisation pipelines in biomedical NLP.
zh
[NLP-105] Revisiting Entropy in Reinforcement Learning for Large Reasoning Models
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中大语言模型(Large Language Models, LLMs)熵坍缩(entropy collapse)的问题,该现象会导致模型过早收敛至次优局部极小值,阻碍性能进一步提升。解决方案的关键在于识别并调控导致熵坍缩的核心机制:研究发现,具有正优势(positive advantages)的token是熵坍缩的主要贡献者,通过在优化目标中调整正负优势token的相对损失权重,可以有效控制模型熵,从而缓解熵坍缩问题并提升模型在多个基准测试中的表现与响应多样性。
链接: https://arxiv.org/abs/2511.05993 作者: Renren Jin,Pengzhi Gao,Yuqi Ren,Zhuowen Han,Tongxuan Zhang,Wuwei Huang,Wei Liu,Jian Luan,Deyi Xiong 机构: Tianjin University (天津大学); Tianjin Normal University (天津师范大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: 16 pages, 11 figures, 3 tables
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a predominant approach for enhancing the reasoning capabilities of large language models (LLMs). However, the entropy of LLMs usually collapses during RLVR training, causing premature convergence to suboptimal local minima and hinder further performance improvement. Although various approaches have been proposed to mitigate entropy collapse, a comprehensive study of entropy in RLVR remains lacking. To address this gap, we conduct extensive experiments to investigate the entropy dynamics of LLMs trained with RLVR and analyze how model entropy correlates with response diversity, calibration, and performance across various benchmarks. Our findings reveal that the number of off-policy updates, the diversity of training data, and the clipping thresholds in the optimization objective are critical factors influencing the entropy of LLMs trained with RLVR. Moreover, we theoretically and empirically demonstrate that tokens with positive advantages are the primary contributors to entropy collapse, and that model entropy can be effectively regulated by adjusting the relative loss weights of tokens with positive and negative advantages during training.
zh
[NLP-106] Interpretable Recognition of Cognitive Distortions in Natural Language Texts
链接: https://arxiv.org/abs/2511.05969 作者: Anton Kolonin,Anna Arinicheva 机构: Novosibirsk State University (新西伯利亚国立大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) 备注: 9 pages, 4 figures
点击查看摘要
[NLP-107] Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLM s
链接: https://arxiv.org/abs/2511.05933 作者: Renfei Zhang,Manasa Kaniselvan,Niloofar Mireshghallah 机构: 未知 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: `
点击查看摘要
Abstract:Reinforcement learning (RL) is often credited with improving language model reasoning and generalization at the expense of degrading memorized knowledge. We challenge this narrative by observing that RL-enhanced models consistently outperform their base and supervised fine-tuned (SFT) counterparts on pure knowledge recall tasks, particularly those requiring traversal of hierarchical, structured knowledge (e.g., medical codes). We hypothesize these gains stem not from newly acquired data, but from improved procedural skills in navigating and searching existing knowledge hierarchies within the model parameters. To support this hypothesis, we show that structured prompting, which explicitly guides SFTed models through hierarchical traversal, recovers most of the performance gap (reducing 24pp to 7pp on MedConceptsQA for DeepSeek-V3/R1). We further find that while prompting improves final-answer accuracy, RL-enhanced models retain superior ability to recall correct procedural paths on deep-retrieval tasks. Finally our layer-wise internal activation analysis reveals that while factual representations (e.g., activations for the statement “code 57.95 refers to urinary infection”) maintain high cosine similarity between SFT and RL models, query representations (e.g., “what is code 57.95”) diverge noticeably, indicating that RL primarily transforms how models traverse knowledge rather than the knowledge representation itself.
zh
[NLP-108] IDALC: A Semi-Supervised Framework for Intent Detection and Active Learning based Correction
【速读】: 该论文旨在解决语音控制对话系统中因模型置信度低导致的用户意图识别失败问题,以及系统拒绝的语句在后续迭代中需要重新标注以引入新意图时所面临的高人工标注成本问题。解决方案的关键在于提出一种基于主动学习(Active Learning)的半监督框架IDALC(Intent Detection and Active Learning based Correction),通过智能筛选最具信息量的未标注样本进行人工标注,从而显著降低整体标注需求(仅需6–10%的未标注数据),同时提升意图检测准确率与宏平均F1分数(相比基线方法提升5–10%和4–8%)。
链接: https://arxiv.org/abs/2511.05921 作者: Ankan Mullick,Sukannya Purkayastha,Saransh Sharma,Pawan Goyal,Niloy Ganguly 机构: IIT Kharagpur(印度理工学院克哈格普尔分校); Technische Universität Darmstadt(达姆施塔特工业大学); Adobe Research(Adobe研究院) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: Paper accepted in IEEE Transactions on Artificial Intelligence (October 2025)
点击查看摘要
Abstract:Voice-controlled dialog systems have become immensely popular due to their ability to perform a wide range of actions in response to diverse user queries. These agents possess a predefined set of skills or intents to fulfill specific user tasks. But every system has its own limitations. There are instances where, even for known intents, if any model exhibits low confidence, it results in rejection of utterances that necessitate manual annotation. Additionally, as time progresses, there may be a need to retrain these agents with new intents from the system-rejected queries to carry out additional tasks. Labeling all these emerging intents and rejected utterances over time is impractical, thus calling for an efficient mechanism to reduce annotation costs. In this paper, we introduce IDALC (Intent Detection and Active Learning based Correction), a semi-supervised framework designed to detect user intents and rectify system-rejected utterances while minimizing the need for human annotation. Empirical findings on various benchmark datasets demonstrate that our system surpasses baseline methods, achieving a 5-10% higher accuracy and a 4-8% improvement in macro-F1. Remarkably, we maintain the overall annotation cost at just 6-10% of the unlabelled data available to the system. The overall framework of IDALC is shown in Fig. 1
zh
[NLP-109] Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在信息检索场景中因提示注入(prompt injection)攻击导致的事实记忆错误问题,尤其是针对中间人(man-in-the-middle, MitM)攻击下LLMs生成答案的可靠性与不确定性。其关键解决方案是提出Xmera框架——一个基于理论的MitM攻击评估工具,通过在三个封闭式、基于事实的问答任务中对受害LLM输入进行扰动,量化响应正确性与生成过程中的不确定性;研究发现,简单的指令类攻击成功率高达约85.3%,且错误回答时伴随高不确定性;据此,作者进一步利用随机森林分类器基于响应不确定性水平实现高效防御(平均AUC达~96%),从而为用户识别潜在恶意LLM输出提供初步安全预警机制。
链接: https://arxiv.org/abs/2511.05919 作者: Alina Fastowski,Bardh Prenkaj,Yuxiao Li,Gjergji Kasneci 机构: 未知 类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to “victim” LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.
zh
[NLP-110] NILC: Discovering New Intents with LLM -assisted Clustering
【速读】: 该论文旨在解决新意图发现(New Intent Discovery, NID)任务中现有级联式架构的局限性问题,即文本嵌入与聚类步骤缺乏相互反馈优化,且仅依赖嵌入空间聚类忽视了语义细微差别,导致性能不佳。其解决方案的关键在于提出一种名为NILC的新颖聚类框架,采用迭代流程:首先利用大语言模型(Large Language Models, LLMs)生成额外的语义中心点以增强欧氏空间中的聚类中心语义表示;其次,通过LLMs对模糊或简短样本进行重写以扩充难样本,用于后续聚类修正;同时引入非平凡的种子设置和软必须链接(soft must links)监督信号,在半监督场景下提升NID精度。该方法有效实现了嵌入与聚类的协同优化,显著提升了跨领域基准数据集上的性能表现。
Abstract:New intent discovery (NID) seeks to recognize both new and known intents from unlabeled user utterances, which finds prevalent use in practical dialogue systems. Existing works towards NID mainly adopt a cascaded architecture, wherein the first stage focuses on encoding the utterances into informative text embeddings beforehand, while the latter is to group similar embeddings into clusters (i.e., intents), typically by K-Means. However, such a cascaded pipeline fails to leverage the feedback from both steps for mutual refinement, and, meanwhile, the embedding-only clustering overlooks nuanced textual semantics, leading to suboptimal performance. To bridge this gap, this paper proposes NILC, a novel clustering framework specially catered for effective NID. Particularly, NILC follows an iterative workflow, in which clustering assignments are judiciously updated by carefully refining cluster centroids and text embeddings of uncertain utterances with the aid of large language models (LLMs). Specifically, NILC first taps into LLMs to create additional semantic centroids for clusters, thereby enriching the contextual semantics of the Euclidean centroids of embeddings. Moreover, LLMs are then harnessed to augment hard samples (ambiguous or terse utterances) identified from clusters via rewriting for subsequent cluster correction. Further, we inject supervision signals through non-trivial techniques seeding and soft must links for more accurate NID in the semi-supervised setting. Extensive experiments comparing NILC against multiple recent baselines under both unsupervised and semi-supervised settings showcase that NILC can achieve significant performance improvements over six benchmark datasets of diverse domains consistently.
zh
[NLP-111] he Imperfect Learner: Incorporating Developmental Trajectories in Memory-based Student Simulation
链接: https://arxiv.org/abs/2511.05903 作者: Zhengyuan Liu,Stella Xin Yin,Bryan Chen Zhengyu Tan,Roy Ka-Wei Lee,Guimei Liu,Dion Hoe-Lian Goh,Wenya Wang,Nancy F. Chen 机构: 未知 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) 备注:
点击查看摘要
[NLP-112] Retrieval-Augmented Generation in Medicine: A Scoping Review of Technical Implementations Clinical Applications and Ethical Considerations
【速读】: 该论文旨在解决医疗领域中因医学知识快速增长和临床实践日益复杂所带来的挑战,尤其是在大语言模型(Large Language Models, LLMs)临床应用中存在的固有局限性问题。其核心解决方案是采用检索增强生成(Retrieval-Augmented Generation, RAG)技术,通过结合外部知识检索与生成能力,提升LLMs在医疗场景下的准确性、相关性和可靠性。研究指出,当前RAG在医学领域的应用仍处于早期阶段,关键突破点在于加强临床验证、实现跨语言适应能力以及支持低资源环境,以推动其在全球范围内可信且负责任的部署。
链接: https://arxiv.org/abs/2511.05901 作者: Rui Yang,Matthew Yu Heng Wong,Huitao Li,Xin Li,Wentao Zhu,Jingchi Liao,Kunyu Yu,Jonathan Chong Kai Liew,Weihao Xuan,Yingjian Chen,Yuhe Ke,Jasmine Chiat Ling Ong,Douglas Teodoro,Chuan Hong,Daniel Shi Wei Ting,Nan Liu 机构: 未知 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:The rapid growth of medical knowledge and increasing complexity of clinical practice pose challenges. In this context, large language models (LLMs) have demonstrated value; however, inherent limitations remain. Retrieval-augmented generation (RAG) technologies show potential to enhance their clinical applicability. This study reviewed RAG applications in medicine. We found that research primarily relied on publicly available data, with limited application in private data. For retrieval, approaches commonly relied on English-centric embedding models, while LLMs were mostly generic, with limited use of medical-specific LLMs. For evaluation, automated metrics evaluated generation quality and task performance, whereas human evaluation focused on accuracy, completeness, relevance, and fluency, with insufficient attention to bias and safety. RAG applications were concentrated on question answering, report generation, text summarization, and information extraction. Overall, medical RAG remains at an early stage, requiring advances in clinical validation, cross-linguistic adaptation, and support for low-resource settings to enable trustworthy and responsible global use.
zh
[NLP-113] MCP-RiskCue: Can LLM infer risk information from MCP server System Logs?
链接: https://arxiv.org/abs/2511.05867 作者: Jiayi Fu,Qiyao Sun 机构: Southern University of Science and Technology (南方科技大学); Beijing University of Posts and Telecommunications (北京邮电大学) 类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL) 备注:
点击查看摘要
[NLP-114] Quantifying Edits Decay in Fine-tuned LLM s ICLR2026
【速读】: 该论文旨在解决知识编辑(Knowledge Editing, KE)与微调(Fine-tuning)协同使用时的兼容性问题,即在对已编辑的大语言模型(Large Language Models, LLMs)进行微调后,原有编辑内容是否能够保留。这一问题直接影响到模型部署的安全性和成本效率:若编辑失效,则需重复编辑;若编辑残留,则可能传播恶意信息。解决方案的关键在于系统性地量化编辑衰减(edit decay)现象,发现不同编辑方法(如MEMIT、AlphaEdit)和微调策略(如全参数微调、LoRA、DoRA)对编辑持久性的影响,并提出选择性层微调(selective-layer fine-tuning)策略——仅微调被编辑的层可有效移除编辑,同时保持下游任务性能损失最小;更意外的是,微调未编辑层反而比全参数微调造成更大的编辑破坏,揭示了模型内部结构对编辑稳定性的关键作用。
链接: https://arxiv.org/abs/2511.05852 作者: Yinjie Cheng,Paul Youssef,Christin Seifert,Jörg Schlötterer,Zhixue Zhao 机构: University of Sheffield (谢菲尔德大学); University of Marburg (马尔堡大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: Under review at ICLR 2026
点击查看摘要
Abstract:Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits as shown in Figure 1, current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edits decay after fine-tuning, investigating how fine-tuning affects knowledge editing. We evaluate two state-of-the-art editing methods (MEMIT, AlphaEdit) and three fine-tuning approaches (full-parameter, LoRA, DoRA) across five LLMs and three datasets, yielding 232 experimental configurations. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we propose selective-layer fine-tuning and find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.
zh
[NLP-115] DiagnoLLM : A Hybrid Bayesian Neural Language Framework for Interpretable Disease Diagnosis
【速读】: 该论文旨在解决临床人工智能(AI)系统在疾病诊断中面临的可信赖性问题,即如何在保证高预测准确性的同时提供透明且具有生物学依据的解释。其解决方案的关键在于构建一个混合框架DiagnoLLM,该框架融合了贝叶斯去卷积(Bayesian deconvolution)、eQTL引导的深度学习以及大语言模型(Large Language Model, LLM)驱动的叙事生成模块:首先通过GP-unmix模型从批量和单细胞RNA测序数据中推断细胞类型特异性的基因表达谱并量化生物不确定性;随后利用eQTL分析提供的调控先验信息训练神经分类器以实现阿尔茨海默病(Alzheimer’s Disease, AD)的高效检测(准确率达88.0%);最后借助LLM作为后处理推理模块,将模型输出转化为面向医生和患者的结构化诊断报告,确保内容基于临床特征、归因信号及领域知识,从而增强人类对系统的理解与信任。
Abstract:Building trustworthy clinical AI systems requires not only accurate predictions but also transparent, biologically grounded explanations. We present \textttDiagnoLLM, a hybrid framework that integrates Bayesian deconvolution, eQTL-guided deep learning, and LLM-based narrative generation for interpretable disease diagnosis. DiagnoLLM begins with GP-unmix, a Gaussian Process-based hierarchical model that infers cell-type-specific gene expression profiles from bulk and single-cell RNA-seq data while modeling biological uncertainty. These features, combined with regulatory priors from eQTL analysis, power a neural classifier that achieves high predictive performance in Alzheimer’s Disease (AD) detection (88.0% accuracy). To support human understanding and trust, we introduce an LLM-based reasoning module that translates model outputs into audience-specific diagnostic reports, grounded in clinical features, attribution signals, and domain knowledge. Human evaluations confirm that these reports are accurate, actionable, and appropriately tailored for both physicians and patients. Our findings show that LLMs, when deployed as post-hoc reasoners rather than end-to-end predictors, can serve as effective communicators within hybrid diagnostic pipelines.
zh
[NLP-116] DRAG ON: Guard LLM Unlearning in Context via Negative Detection and Reasoning ICML2025 NEURIPS2025
链接: https://arxiv.org/abs/2511.05784 作者: Yaxuan Wang,Chris Yuhao Liu,Quan Liu,Jinglong Pang,Wei Wei,Yujia Bao,Yang Liu 机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); Center for Advanced AI, Accenture (埃森哲高级人工智能中心) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: Please refer to the NeurIPS 2025 submission: this https URL The paper has been accepted to the ICML 2025 MUGen Workshop: this https URL
点击查看摘要
Abstract:Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a lightweight detection module to identify forget-worthy prompts without any retain data. These are then routed through a dedicated CoT guard model to enforce safe and accurate in-context intervention. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical scenarios.
zh
[NLP-117] Anchors in the Machine: Behavioral and Attributional Evidence of Anchoring Bias in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中锚定偏差(anchoring bias)的本质问题:即观测到的偏倚是表层输出模仿还是深层概率分布的改变。此前研究多依赖表面行为证据,缺乏对内部机制和因果贡献的解析。解决方案的关键在于构建一个融合行为分析与可解释性方法的统一框架——首先通过基于log-probability的行为分析验证锚点会系统性改变输出分布;其次利用Shapley值对结构化提示字段进行精确归因,量化锚点对模型log-probability的影响;最终提出“锚定偏差敏感度评分”(Anchoring Bias Sensitivity Score),整合行为与归因证据,实现对六种开源模型的跨模型比较。该框架不仅揭示了锚定偏差在LLMs中的稳健性和可解释性,也为评估其他认知偏倚提供了可复现的方法路径。
Abstract:Large language models (LLMs) are increasingly examined as both behavioral subjects and decision systems, yet it remains unclear whether observed cognitive biases reflect surface imitation or deeper probability shifts. Anchoring bias, a classic human judgment bias, offers a critical test case. While prior work shows LLMs exhibit anchoring, most evidence relies on surface-level outputs, leaving internal mechanisms and attributional contributions unexplored. This paper advances the study of anchoring in LLMs through three contributions: (1) a log-probability-based behavioral analysis showing that anchors shift entire output distributions, with controls for training-data contamination; (2) exact Shapley-value attribution over structured prompt fields to quantify anchor influence on model log-probabilities; and (3) a unified Anchoring Bias Sensitivity Score integrating behavioral and attributional evidence across six open-source models. Results reveal robust anchoring effects in Gemma-2B, Phi-2, and Llama-2-7B, with attribution signaling that the anchors influence reweighting. Smaller models such as GPT-2, Falcon-RW-1B, and GPT-Neo-125M show variability, suggesting scale may modulate sensitivity. Attributional effects, however, vary across prompt designs, underscoring fragility in treating LLMs as human substitutes. The findings demonstrate that anchoring bias in LLMs is robust, measurable, and interpretable, while highlighting risks in applied domains. More broadly, the framework bridges behavioral science, LLM safety, and interpretability, offering a reproducible path for evaluating other cognitive biases in LLMs.
zh
[NLP-118] Language Generation: Complexity Barriers and Implications for Learning
链接: https://arxiv.org/abs/2511.05759 作者: Marcelo Arenas,Pablo Barceló,Luis Cofré,Alexander Kozachinskiy 机构: DCC UC & IMFD; IMC UC, CENIA & IMFD; Faculty of Mathematics UC; CENIA 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG) 备注:
点击查看摘要
[NLP-119] Multi-Scale Feature Fusion and Graph Neural Network Integration for Text Classification with Large Language Models
【速读】: 该论文旨在解决复杂语义场景下文本分类任务中全局信息与局部细节难以平衡、以及语义单元间潜在关系建模不足的问题。解决方案的关键在于提出一种融合深度特征提取、多尺度特征金字塔融合和图神经网络结构建模的混合方法:首先利用大语言模型(Large Language Model, LLM)捕获上下文依赖和深层语义表示,随后通过特征金字塔机制整合不同尺度的语义特征,实现全局与局部信息的协同表达;进而将融合后的特征转化为图结构,借助图神经网络(Graph Neural Network, GNN)挖掘文本内部隐含的语义关联与逻辑依赖,从而实现对语义单元间复杂交互的全面建模。该框架在准确率(ACC)、F1分数(F1-Score)、AUC及精确率(Precision)等指标上均优于现有模型,验证了其有效性与稳定性。
链接: https://arxiv.org/abs/2511.05752 作者: Xiangchen Song,Yulin Huang,Jinxu Guo,Yuchen Liu,Yaxuan Luan 机构: 未知 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:This study investigates a hybrid method for text classification that integrates deep feature extraction from large language models, multi-scale fusion through feature pyramids, and structured modeling with graph neural networks to enhance performance in complex semantic contexts. First, the large language model captures contextual dependencies and deep semantic representations of the input text, providing a rich feature foundation for subsequent modeling. Then, based on multi-level feature representations, the feature pyramid mechanism effectively integrates semantic features of different scales, balancing global information and local details to construct hierarchical semantic expressions. Furthermore, the fused features are transformed into graph representations, and graph neural networks are employed to capture latent semantic relations and logical dependencies in the text, enabling comprehensive modeling of complex interactions among semantic units. On this basis, the readout and classification modules generate the final category predictions. The proposed method demonstrates significant advantages in robustness alignment experiments, outperforming existing models on ACC, F1-Score, AUC, and Precision, which verifies the effectiveness and stability of the framework. This study not only constructs an integrated framework that balances global and local information as well as semantics and structure, but also provides a new perspective for multi-scale feature fusion and structured semantic modeling in text classification tasks.
zh
链接: https://arxiv.org/abs/2511.05743 作者: Kerem Sahin(1),Sheridan Feucht(1),Adam Belfki(1),Jannik Brinkmann(2),Aaron Mueller(3),David Bau(1),Chris Wendler(1) ((1) Northeastern University, (2) University of Mannheim, (3) Boston University) 机构: Northeastern University (东北大学); University of Mannheim (曼海姆大学); Boston University (波士顿大学) 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Induction heads are attention heads that perform inductive copying by matching patterns from earlier context and copying their continuations verbatim. As models develop induction heads, they often experience a sharp drop in training loss, a phenomenon cited as evidence that induction heads may serve as a prerequisite for more complex in-context learning (ICL) capabilities. In this work, we ask whether transformers can still acquire ICL capabilities when inductive copying is suppressed. We propose Hapax, a setting where we omit the loss contribution of any token that can be correctly predicted by induction heads. Despite a significant reduction in inductive copying, performance on abstractive ICL tasks (i.e., tasks where the answer is not contained in the input context) remains comparable and surpasses the vanilla model on 13 of 21 tasks, even though 31.7% of tokens are omitted from the loss. Furthermore, our model achieves lower loss values on token positions that cannot be predicted correctly by induction heads. Mechanistic analysis further shows that models trained with Hapax develop fewer and weaker induction heads but still preserve ICL capabilities. Taken together, our findings indicate that inductive copying is not essential for learning abstractive ICL mechanisms.
zh
[NLP-121] OckBench: Measuring the Efficiency of LLM Reasoning
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)评估体系中忽视解码Token效率的问题。现有基准主要关注准确性和输出质量,但未考虑在实际系统中生成不同数量Token所带来的延迟、成本和能耗差异。解决方案的关键在于提出OckBench——一个模型无关且硬件无关的基准测试平台,能够同时衡量推理与编码任务中的准确性与Token消耗量。通过该平台,研究者可以识别出在相似准确率下Token使用效率存在显著差异的现象,并构建精度-效率帕累托前沿(Pareto frontiers),从而推动从“将Token视为免费资源”向“重视Token效率”的评估范式转变。
链接: https://arxiv.org/abs/2511.05722 作者: Zheng Du,Hao Kang,Song Han,Tushar Krishna,Ligeng Zhu 机构: Georgia Institute of Technology (佐治亚理工学院); Massachusetts Institute of Technology (麻省理工学院); Nvidia Cooperation (英伟达公司) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Large language models such as GPT-4, Claude 3, and the Gemini series have improved automated reasoning and code generation. However, existing benchmarks mainly focus on accuracy and output quality, and they ignore an important factor: decoding token efficiency. In real systems, generating 10,000 tokens versus 100,000 tokens leads to large differences in latency, cost, and energy. In this work, we introduce OckBench, a model-agnostic and hardware-agnostic benchmark that evaluates both accuracy and token count for reasoning and coding tasks. Through experiments comparing multiple open- and closed-source models, we uncover that many models with comparable accuracy differ wildly in token consumption, revealing that efficiency variance is a neglected but significant axis of differentiation. We further demonstrate Pareto frontiers over the accuracy-efficiency plane and argue for an evaluation paradigm shift: we should no longer treat tokens as “free” to multiply. OckBench provides a unified platform for measuring, comparing, and guiding research in token-efficient reasoning. Our benchmarks are available at this https URL .
zh
[NLP-122] Persian Musical Instruments Classification Using Polyphonic Data Augmentation
【速读】: 该论文旨在解决非西方音乐传统(特别是波斯音乐)中乐器分类研究匮乏的问题,以提升音乐信息检索(Music Information Retrieval, MIR)和生成式音乐系统对多元文化音乐的理解能力。其关键解决方案是提出了一种文化相关数据增强策略,通过该策略从单音轨样本中生成具有真实感的多声部混合音频,并结合基于大规模自监督训练的MERT模型(Music undERstanding with large-scale self-supervised Training)进行分类任务。实验表明,该方法在真实波斯音乐多声部场景下取得最优ROC-AUC(0.795),验证了音高与时间连贯性协同增强对于提升乐器识别鲁棒性的有效性,为构建更具文化包容性的MIR系统提供了基础。
链接: https://arxiv.org/abs/2511.05717 作者: Diba Hadi Esfangereh,Mohammad Hossein Sameti,Sepehr Harfi Moridani,Leili Javidpour,Mahdieh Soleymani Baghshah 机构: Sharif University of Technology (伊朗沙里夫理工大学) 类目: ound (cs.SD); Computation and Language (cs.CL) 备注: 9 pages, 2 figures, 4 tables
点击查看摘要
Abstract:Musical instrument classification is essential for music information retrieval (MIR) and generative music systems. However, research on non-Western traditions, particularly Persian music, remains limited. We address this gap by introducing a new dataset of isolated recordings covering seven traditional Persian instruments, two common but originally non-Persian instruments (i.e., violin, piano), and vocals. We propose a culturally informed data augmentation strategy that generates realistic polyphonic mixtures from monophonic samples. Using the MERT model (Music undERstanding with large-scale self-supervised Training) with a classification head, we evaluate our approach with out-of-distribution data which was obtained by manually labeling segments of traditional songs. On real-world polyphonic Persian music, the proposed method yielded the best ROC-AUC (0.795), highlighting complementary benefits of tonal and temporal coherence. These results demonstrate the effectiveness of culturally grounded augmentation for robust Persian instrument recognition and provide a foundation for culturally inclusive MIR and diverse music generation systems.
zh
[NLP-123] Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale
链接: https://arxiv.org/abs/2511.05705 作者: David Acuna,Chao-Han Huck Yang,Yuntian Deng,Jaehun Jung,Ximing Lu,Prithviraj Ammanabrolu,Hyunwoo Kim,Yuan-Hong Liao,Yejin Choi 机构: NVIDIA; University of Toronto; University of Waterloo; UCSD 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注: Project Page: this https URL
点击查看摘要
Abstract:Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL’s performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.
zh
[NLP-124] abDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification
链接: https://arxiv.org/abs/2511.05704 作者: Pasan Dissanayake,Sanghamitra Dutta 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.
zh
[NLP-125] A Representation Sharpening Framework for Zero Shot Dense Retrieval
链接: https://arxiv.org/abs/2511.05684 作者: Dhananjay Ashok,Suraj Nair,Mutasem Al-Darabsah,Choon Hui Teo,Tarun Agarwal,Jonathan May 机构: Information Sciences Institute, University of Southern California (南加州大学信息科学研究所); Amazon (亚马逊) 类目: Information Retrieval (cs.IR); Computation and Language (cs.CL) 备注: 15 pages, 4 figures
点击查看摘要
Abstract:Zero-shot dense retrieval is a challenging setting where a document corpus is provided without relevant queries, necessitating a reliance on pretrained dense retrievers (DRs). However, since these DRs are not trained on the target corpus, they struggle to represent semantic differences between similar documents. To address this failing, we introduce a training-free representation sharpening framework that augments a document’s representation with information that helps differentiate it from similar documents in the corpus. On over twenty datasets spanning multiple languages, the representation sharpening framework proves consistently superior to traditional retrieval, setting a new state-of-the-art on the BRIGHT benchmark. We show that representation sharpening is compatible with prior approaches to zero-shot dense retrieval and consistently improves their performance. Finally, we address the performance-cost tradeoff presented by our framework and devise an indexing-time approximation that preserves the majority of our performance gains over traditional retrieval, yet suffers no additional inference-time cost.
zh
[NLP-126] Optimizing Diversity and Quality through Base-Aligned Model Collaboration
链接: https://arxiv.org/abs/2511.05650 作者: Yichen Wang,Chenghao Yang,Tenghao Huang,Muhao Chen,Jonathan May,Mina Lee 机构: 未知 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: 52 pages, 16 figures
点击查看摘要
[NLP-127] UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLM s to Generate Ill-formed UTF-8
链接: https://arxiv.org/abs/2511.05578 作者: Preston Firestone,Shubham Ugare,Gagandeep Singh,Sasa Misailovic 机构: 未知 类目: Computation and Language (cs.CL) 备注: COLM 2025
点击查看摘要
Abstract:Subword tokenization segments input text according to a pre-defined vocabulary to feed it into a language model; the language model, in turn, generates a sequence made from this same vocabulary. The members of the vocabulary can be built of code points or bytes. Using code points means that all members of the vocabulary are valid UTF-8 characters. However, it also requires thousands of initial members to achieve acceptable coverage of inputs. Beginning with bytes, on the contrary, avoids out-of-vocabulary errors with only 256 initial members of the vocabulary, but the members of the vocabulary and sequences of them are not guaranteed to be valid UTF-8. Sequences that are not valid UTF-8 break code that assumes its input to be valid UTF-8. Applications of language models must account for the breakage thereby introduced. In this paper, we formalize tokenization using monoid theory and prove that tokenizers whose vocabularies contain tokens that are ill-formed UTF-8 can always produce sequences that are ill-formed UTF-8. We demonstrate formally that attempting to incrementally convert tokens back to a string and interpret the results as UTF-8 gives different results than converting the whole sequence of tokens at once. This formal result predicts real-world bugs: we evaluate mitigations for the problem identified and provide case studies of major foundation models, serving engines, and constrained generation systems.
zh
[NLP-128] Fine-Tuning Vision-Language Models for Multimodal Polymer Property Prediction
链接: https://arxiv.org/abs/2511.05577 作者: An Vuong,Minh-Hao Van,Prateek Verma,Chen Zhao,Xintao Wu 机构: 未知 类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have shown strong performance in tasks like visual question answering and multimodal text generation, but their effectiveness in scientific domains such as materials science remains limited. While some machine learning methods have addressed specific challenges in this field, there is still a lack of foundation models designed for broad tasks like polymer property prediction using multimodal data. In this work, we present a multimodal polymer dataset to fine-tune VLMs through instruction-tuning pairs and assess the impact of multimodality on prediction performance. Our fine-tuned models, using LoRA, outperform unimodal and baseline approaches, demonstrating the benefits of multimodal learning. Additionally, this approach reduces the need to train separate models for different properties, lowering deployment and maintenance costs.
zh
[NLP-129] Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements
链接: https://arxiv.org/abs/2511.05560 作者: Patrick Haller,Jonas Golde,Alan Akbik 机构: Humboldt-Universität zu Berlin (柏林洪堡大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:We study architectural and optimization tech- niques for sample-efficient language modeling under the constraints of the BabyLM 2025 shared task. Our model, BLaLM, replaces self-attention with a linear-time mLSTM to- ken mixer and explores lightweight enhance- ments, including short convolutions, sliding window attention with dynamic modulation, and Hedgehog feature maps. To support train- ing in low-resource settings, we curate a high- quality corpus emphasizing readability and ped- agogical structure. Experiments across both STRICT and STRICT-SMALL tracks show that (1) linear attention combined with sliding win- dow attention consistently improves zero-shot performance, and (2) the Muon optimizer stabi- lizes convergence and reduces perplexity over AdamW. These results highlight effective strate- gies for efficient language modeling without relying on scale.
zh
[NLP-130] Factual and Musical Evaluation Metrics for Music Language Models
【速读】: 该论文旨在解决当前音乐语言模型(Music Language Models, Music LMs)评估体系中存在的核心缺陷——即现有常用指标(如BLEU、METEOR和BERTScore)仅能衡量生成回答的语义流畅性,而无法准确判断其内容是否正确。为应对这一问题,论文提出两个关键解决方案:一是设计了一个适用于音乐领域的通用评价指标,以更贴合音乐语境下的回答质量;二是构建了一个事实性评估框架,用于量化Music LM回答内容的真实性与准确性。该框架不依赖于特定模态的问答模型架构,具备跨领域可扩展性,可推广至其他开放式问答任务中。
链接: https://arxiv.org/abs/2511.05550 作者: Daniel Chenyu Lin,Michael Freeman,John Thickstun 机构: 未知 类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: 18 pages; first submission
点击查看摘要
Abstract:Music language models (Music LMs), like vision language models, leverage multimodal representations to answer natural language queries about musical audio recordings. Although Music LMs are reportedly improving, we find that current evaluations fail to capture whether their answers are correct. Specifically, for all Music LMs that we examine, widely-used evaluation metrics such as BLEU, METEOR, and BERTScore fail to measure anything beyond linguistic fluency of the model’s responses. To measure the true performance of Music LMs, we propose (1) a better general-purpose evaluation metric for Music LMs adapted to the music domain and (2) a factual evaluation framework to quantify the correctness of a Music LM’s responses. Our framework is agnostic to the modality of the question-answering model and could be generalized to quantify performance in other open-ended question-answering domains. We use open datasets in our experiments and will release all code on publication.
zh
[NLP-131] mporal Sparse Autoencoders: Leverag ing the Sequential Nature of Language for Interpretability
链接: https://arxiv.org/abs/2511.05541 作者: Usha Bhalla,Alex Oesterling,Claudio Mayrink Verdun,Himabindu Lakkaraju,Flavio P. Calmon 机构: Harvard University (哈佛大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: 23 Pages, 10 figures
点击查看摘要
Abstract:Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they suffer from a variety of problems, including a systematic failure to capture the rich conceptual information that drives linguistic understanding. Instead, they exhibit a bias towards shallow, token-specific, or noisy features, such as “the phrase ‘The’ at the start of sentences”. In this work, we propose that this is due to a fundamental issue with how dictionary learning methods for LLMs are trained. Language itself has a rich, well-studied structure spanning syntax, semantics, and pragmatics; however, current unsupervised methods largely ignore this linguistic knowledge, leading to poor feature discovery that favors superficial patterns over meaningful concepts. We focus on a simple but important aspect of language: semantic content has long-range dependencies and tends to be smooth over a sequence, whereas syntactic information is much more local. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.
zh
[NLP-132] Future of AI Models: A Computational perspective on Model collapse
链接: https://arxiv.org/abs/2511.05535 作者: Trivikram Satharasi(1),S Sitharama Iyengar(2) ((1) University of Florida, Gainesville, FL, (2) Florida International University, Miami. FL) 机构: 未知 类目: Computation and Language (cs.CL); Databases (cs.DB); Information Theory (cs.IT) 备注: Submitted to Springer Nature. Code Available at this https URL
点击查看摘要
Abstract:Artificial Intelligence, especially Large Language Models (LLMs), has transformed domains such as software engineering, journalism, creative writing, academia, and media (Naveed et al. 2025; arXiv:2307.06435). Diffusion models like Stable Diffusion generate high-quality images and videos from text. Evidence shows rapid expansion: 74.2% of newly published webpages now contain AI-generated material (Ryan Law 2025), 30-40% of the active web corpus is synthetic (Spennemann 2025; arXiv:2504.08755), 52% of U.S. adults use LLMs for writing, coding, or research (Staff 2025), and audits find AI involvement in 18% of financial complaints and 24% of press releases (Liang et al. 2025). The underlying neural architectures, including Transformers (Vaswani et al. 2023; arXiv:1706.03762), RNNs, LSTMs, GANs, and diffusion networks, depend on large, diverse, human-authored datasets (Shi Iyengar 2019). As synthetic content dominates, recursive training risks eroding linguistic and semantic diversity, producing Model Collapse (Shumailov et al. 2024; arXiv:2307.15043; Dohmatob et al. 2024; arXiv:2402.07712). This study quantifies and forecasts collapse onset by examining year-wise semantic similarity in English-language Wikipedia (filtered Common Crawl) from 2013 to 2025 using Transformer embeddings and cosine similarity metrics. Results reveal a steady rise in similarity before public LLM adoption, likely driven by early RNN/LSTM translation and text-normalization pipelines, though modest due to a smaller scale. Observed fluctuations reflect irreducible linguistic diversity, variable corpus size across years, finite sampling error, and an exponential rise in similarity after the public adoption of LLM models. These findings provide a data-driven estimate of when recursive AI contamination may significantly threaten data richness and model generalization.
zh
[NLP-133] FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中KV缓存(Key-Value cache)管理效率与生成质量之间的矛盾问题。传统基于注意力分数的KV缓存淘汰策略易导致上下文丢失或幻觉,而现有融合策略在跨模态场景下受限于模态间分布偏移和跨模态注意力偏差,难以有效保留关键信息。解决方案的关键在于提出FlowMM框架,其核心创新包括:(1) 基于跨模态信息流(cross-modal information flow)动态调整各层的合并策略,以捕捉模态特异性模式并维持上下文完整性;(2) 设计敏感度自适应的token匹配机制,联合评估token相似性与任务敏感性,优先合并低风险token、保护高敏感token。实验表明,FlowMM可在保持任务性能的前提下将KV缓存内存减少80%–95%,解码延迟降低1.3–1.8倍。
链接: https://arxiv.org/abs/2511.05534 作者: Kunxi Li,Yufan Xiong,Zhonghua Jiang,Yiyun Zhou,Zhaode Wang,Chengfei Lv,Shengyu Zhang 机构: Zhejiang University (浙江大学); Huazhong Agricultural University (华中农业大学); Alibaba (阿里巴巴) 类目: Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Traditional KV cache eviction strategies, which discard less critical KV-pairs based on attention scores, often degrade generation quality, causing context loss or hallucinations. Recent efforts shift toward KV merging, merging eviction tokens with retention tokens based on similarity. However, in multimodal scenarios, distributional biases across modality tokens and attentional biases in cross-modal interactions limit its effectiveness. This work introduces FlowMM, an adaptive framework for cross-modal information flow-guided multimodal KV cache merging. FlowMM leverages cross-modal information flow to dynamically apply layer-specific merging strategies, capturing modality-specific patterns while preserving contextual integrity. Furthermore, we introduce a sensitivity-adaptive token matching mechanism that jointly evaluates token similarity and task-critical sensitivity, merging low-risk tokens while safeguarding high-sensitivity ones. Extensive experiments across diverse leading MLLMs show that FlowMM reduces KV cache memory by 80% to 95% and decoding latency by 1.3-1.8x, while maintaining competitive task performance.
zh
[NLP-134] MCP4IFC: IFC-Based Building Design Using Large Language Models
Abstract:Bringing generative AI into the architecture, engineering and construction (AEC) field requires systems that can translate natural language instructions into actions on standardized data models. We present MCP4IFC, a comprehensive open-source framework that enables Large Language Models (LLMs) to directly manipulate Industry Foundation Classes (IFC) data through the Model Context Protocol (MCP). The framework provides a set of BIM tools, including scene querying tools for information retrieval, predefined functions for creating and modifying common building elements, and a dynamic code-generation system that combines in-context learning with retrieval-augmented generation (RAG) to handle tasks beyond the predefined toolset. Experiments demonstrate that an LLM using our framework can successfully perform complex tasks, from building a simple house to querying and editing existing IFC data. Our framework is released as open-source to encourage research in LLM-driven BIM design and provide a foundation for AI-assisted modeling workflows. Our code is available at this https URL.
zh
[NLP-135] Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning
[NLP-136] Retracing the Past: LLM s Emit Training Data When They Get Lost
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)对训练数据的记忆现象所引发的隐私和版权风险问题。现有基于启发式的方法在提取记忆数据时成功率有限,且难以揭示记忆泄露的根本驱动因素。其解决方案的关键在于提出一种系统性的“混淆诱导攻击”(Confusion-Inducing Attacks, CIA)框架,通过优化输入片段以刻意诱发模型在token级别预测熵的持续升高状态,从而有效触发并提取被记忆的文本内容;此外,针对对齐后的LLMs,进一步引入不匹配监督微调(Mismatched Supervised Fine-tuning, SFT),同步削弱模型对齐性并增强其对攻击的敏感性。实验表明,该方法无需事先了解训练数据即可高效提取原文及近似原文的数据,显著优于现有基线方法。
链接: https://arxiv.org/abs/2511.05518 作者: Myeongseob Ko,Nikhil Reddy Billa,Adam Nguyen,Charles Fleming,Ming Jin,Ruoxi Jia 机构: 未知 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注: The 2025 Conference on Empirical Methods in Natural Language Processing
点击查看摘要
Abstract:The memorization of training data in large language models (LLMs) poses significant privacy and copyright concerns. Existing data extraction methods, particularly heuristic-based divergence attacks, often exhibit limited success and offer limited insight into the fundamental drivers of memorization leakage. This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data by systematically maximizing model uncertainty. We empirically demonstrate that the emission of memorized text during divergence is preceded by a sustained spike in token-level prediction entropy. CIA leverages this insight by optimizing input snippets to deliberately induce this consecutive high-entropy state. For aligned LLMs, we further propose Mismatched Supervised Fine-tuning (SFT) to simultaneously weaken their alignment and induce targeted confusion, thereby increasing susceptibility to our attacks. Experiments on various unaligned and aligned LLMs demonstrate that our proposed attacks outperform existing baselines in extracting verbatim and near-verbatim training data without requiring prior knowledge of the training data. Our findings highlight persistent memorization risks across various LLMs and offer a more systematic method for assessing these vulnerabilities.
zh
[NLP-137] Ming-UniAudio: Speech LLM for Joint Understanding Generation and Editing with Unified Representation
Abstract:Existing speech models suffer from competing requirements on token representations by understanding and generation tasks. This discrepancy in representation prevents speech language models from performing instruction-based free-form editing. To solve this challenge, we introduce a novel framework that unifies speech understanding, generation, and editing. The core of our unified model is a unified continuous speech tokenizer MingTok-Audio, the first continuous tokenizer to effectively integrate semantic and acoustic features, which makes it suitable for both understanding and generation tasks. Based on this unified continuous audio tokenizer, we developed the speech language model Ming-UniAudio, which achieved a balance between generation and understanding capabilities. Ming-UniAudio sets new state-of-the-art (SOTA) records on 8 out of 12 metrics on the ContextASR benchmark. Notably, for Chinese voice cloning, it achieves a highly competitive Seed-TTS-WER of 0.95. Leveraging this foundational model, we further trained a dedicated speech editing model Ming-UniAudio-Edit, the first speech language model that enables universal, free-form speech editing guided solely by natural language instructions, handling both semantic and acoustic modifications without timestamp condition. To rigorously assess the editing capability and establish a foundation for future research, we introduce Ming-Freeform-Audio-Edit, the first comprehensive benchmark tailored for instruction-based free-form speech editing, featuring diverse scenarios and evaluation dimensions spanning semantic correctness, acoustic quality, and instruction alignment. We open-sourced the continuous audio tokenizer, the unified foundational model, and the free-form instruction-based editing model to facilitate the development of unified audio understanding, generation, and manipulation.
zh
[NLP-138] Predicting Oscar-Nominated Screenplays with Sentence Embeddings
【速读】: 该论文试图解决的问题是:能否利用现代语言模型预测奥斯卡最佳原创或改编剧本提名(Oscar nominations for screenplays)。其解决方案的关键在于构建了一个名为 Movie-O-Label 的新数据集,该数据集整合了电影剧本集合 MovieSum 与经过人工筛选的奥斯卡获奖记录,并将每部剧本表示为标题、维基百科摘要和完整脚本三部分文本信息;随后使用 E5 句子嵌入模型对长文本进行分块编码,并通过逻辑回归分类器融合三种特征输入(脚本、摘要、标题)进行预测。实验表明,该方法在宏平均 F1 分数(macro F1 score)达到 0.66,ROC-AUC 达到 0.79,证明基于文本嵌入的简单模型已具备较好的预测能力,可作为未来研究的基础。
链接: https://arxiv.org/abs/2511.05500 作者: Francis Gross 机构: University of Regensburg (雷根斯堡大学) 类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Oscar nominations are an important factor in the movie industry because they can boost both the visibility and the commercial success. This work explores whether it is possible to predict Oscar nominations for screenplays using modern language models. Since no suitable dataset was available, a new one called Movie-O-Label was created by combining the MovieSum collection of movie scripts with curated Oscar records. Each screenplay was represented by its title, Wikipedia summary, and full script. Long scripts were split into overlapping text chunks and encoded with the E5 sentence em bedding model. Then, the screenplay embed dings were classified using a logistic regression model. The best results were achieved when three feature inputs related to screenplays (script, summary, and title) were combined. The best-performing model reached a macro F1 score of 0.66, a precision recall AP of 0.445 with baseline 0.19 and a ROC-AUC of 0.79. The results suggest that even simple models based on modern text embeddings demonstrate good prediction performance and might be a starting point for future research.
zh
[NLP-139] AI Brown and AI Koditex: LLM -Generated Corpora Comparable to Traditional Corpora of English and Czech Texts
【速读】: 该论文旨在解决当前缺乏可用于语言学比较研究的、由大语言模型(Large Language Models, LLMs)生成的英文与捷克语文本资源的问题,以实现对人类写作与LLM生成文本在语言特征上的系统性对比。解决方案的关键在于构建两个高质量、多题材、主题丰富且结构上可与现有真人创作语料库(BE21和Koditex)直接比较的生成语料库,其文本由来自OpenAI、Anthropic、Alphabet、Meta和DeepSeek的多种LLM(从GPT-3到GPT-4.5)生成,并统一采用Universal Dependencies标准进行标注(包括分词、词形还原及形态句法标注),从而确保数据的可比性和可用性。
链接: https://arxiv.org/abs/2509.22996 作者: Jiří Milička,Anna Marklová,Václav Cvrček 机构: Charles University (查尔斯大学) 类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:This article presents two corpora of English and Czech texts generated with large language models (LLMs). The motivation is to create a resource for comparing human-written texts with LLM-generated text linguistically. Emphasis was placed on ensuring these resources are multi-genre and rich in terms of topics, authors, and text types, while maintaining comparability with existing human-created corpora. These generated corpora replicate reference human corpora: BE21 by Paul Baker, which is a modern version of the original Brown Corpus, and Koditex corpus that also follows the Brown Corpus tradition but in Czech. The new corpora were generated using models from OpenAI, Anthropic, Alphabet, Meta, and DeepSeek, ranging from GPT-3 (davinci-002) to GPT-4.5, and are tagged according to the Universal Dependencies standard (i.e., they are tokenized, lemmatized, and morphologically and syntactically annotated). The subcorpus size varies according to the model used (the English part contains on average 864k tokens per model, 27M tokens altogether, the Czech partcontains on average 768k tokens per model, 21.5M tokens altogether). The corpora are freely available for download under the CC BY 4.0 license (the annotated data are under CC BY-NC-SA 4.0 licence) and are also accessible through the search interface of the Czech National Corpus.
zh
[NLP-140] Language Generation with Infinite Contamination
【速读】: 该论文旨在解决语言生成在极限情况下对数据污染(contamination)的鲁棒性问题,即当算法从一个未知目标语言 $ K $ 中观察到由对手构造的字符串枚举序列时,如何保证其仍能生成新的、未见过的 $ K $ 中字符串。此前研究假设数据完全纯净(无噪声插入和遗漏),但现实场景中数据常含噪声或缺失,因此核心挑战在于量化生成任务可容忍的污染程度。解决方案的关键在于:首先,证明了在所有可数语言集合上实现语言生成的充要条件是污染比例趋于零;其次,指出稠密生成(dense generation)比普通生成更脆弱,且通过引入受课程学习启发的“超越最坏情况”模型,进一步表明即使存在无限污染,只要污染比例收敛至零,稠密生成依然可行——这揭示了课程学习机制在处理真实世界噪声数据中的潜在重要性。
链接: https://arxiv.org/abs/2511.07417 作者: Anay Mehrotra,Grigoris Velegkas,Xifan Yu,Felix Zhou 机构: Yale University (耶鲁大学); Google Research (谷歌研究院) 类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:We study language generation in the limit, where an algorithm observes an adversarial enumeration of strings from an unknown target language K and must eventually generate new, unseen strings from K . Kleinberg and Mullainathan [KM24] proved that generation is achievable in surprisingly general settings. But their generator suffers from mode collapse,'' producing from an ever-smaller subset of the target. To address this, Kleinberg and Wei [KW25] require the generator's output to be dense’’ in the target language. They showed that generation with density, surprisingly, remains achievable at the same generality. Both results assume perfect data: no noisy insertions and no omissions. This raises a central question: how much contamination can generation tolerate? Recent works made partial progress on this question by studying (non-dense) generation with either finite amounts of noise (but no omissions) or omissions (but no noise). We characterize robustness under contaminated enumerations: 1. Generation under Contamination: Language generation in the limit is achievable for all countable collections iff the fraction of contaminated examples converges to zero. When this fails, we characterize which collections are generable. 2. Dense Generation under Contamination: Dense generation is strictly less robust to contamination than generation. As a byproduct, we resolve an open question of Raman and Raman [ICML25] by showing that generation is possible with only membership oracle access under finitely many contaminated examples. Finally, we introduce a beyond-worst-case model inspired by curriculum learning and prove that dense generation is achievable even with infinite contamination provided the fraction of contaminated examples converges to zero. This suggests curriculum learning may be crucial for learning from noisy web data. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2511.07417 [stat.ML] (or arXiv:2511.07417v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2511.07417 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anay Mehrotra [view email] [v1] Mon, 10 Nov 2025 18:59:39 UTC (117 KB)
zh
[NLP-141] Adaptive Testing for Segmenting Watermarked Texts From Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)文本中水印检测与片段分割的难题,即如何准确识别并分离出由大语言模型(Large Language Models, LLMs)生成且嵌入水印的文本段落与人类撰写的内容。其解决方案的关键在于提出一种基于似然的自适应检测框架,通过引入灵活的加权公式和逆变换采样方法,显著降低了对提示词(prompt)精确估计的依赖性,从而在混合文本中实现高精度、鲁棒的水印区域分割。
链接: https://arxiv.org/abs/2511.06645 作者: Xingchi Li,Xiaochi Liu,Guanxun Li 机构: Texas A&M University (德克萨斯A&M大学); Beijing Normal University at Zhuhai (北京师范大学珠海分校) 类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG) 备注: 13 pages, 3 figures, accepted for publication in STAT, October 28, 2025
点击查看摘要
Abstract:The rapid adoption of large language models (LLMs), such as GPT-4 and Claude 3.5, underscores the need to distinguish LLM-generated text from human-written content to mitigate the spread of misinformation and misuse in education. One promising approach to address this issue is the watermark technique, which embeds subtle statistical signals into LLM-generated text to enable reliable identification. In this paper, we first generalize the likelihood-based LLM detection method of a previous study by introducing a flexible weighted formulation, and further adapt this approach to the inverse transform sampling method. Moving beyond watermark detection, we extend this adaptive detection strategy to tackle the more challenging problem of segmenting a given text into watermarked and non-watermarked substrings. In contrast to the approach in a previous study, which relies on accurate estimation of next-token probabilities that are highly sensitive to prompt estimation, our proposed framework removes the need for precise prompt estimation. Extensive numerical experiments demonstrate that the proposed methodology is both effective and robust in accurately segmenting texts containing a mixture of watermarked and non-watermarked content.
zh
[NLP-142] On the Analogy between Human Brain and LLM s: Spotting Key Neurons in Grammar Perception
【速读】: 该论文试图解决的问题是:如何揭示大型语言模型(Large Language Models, LLMs)在处理语言时是否具备类似人类大脑中语法类别(如名词、动词等)的神经表征机制。解决方案的关键在于,利用Llama 3模型识别出与不同词性标签(part-of-speech tags)预测最相关的神经元,并通过这些关键神经元的激活模式训练一个分类器,在新数据上实现对词性的可靠预测。这一发现表明,LLMs中存在一个专门捕捉词性概念的子空间,其结构和功能特征与神经科学中基于脑损伤研究观察到的人类大脑神经分工模式相似。
链接: https://arxiv.org/abs/2511.06519 作者: Sanaz Saki Norouzi,Mohammad Masjedi,Pascal Hitzler 机构: 未知 类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 备注:
点击查看摘要
Abstract:Artificial Neural Networks, the building blocks of AI, were inspired by the human brain’s network of neurons. Over the years, these networks have evolved to replicate the complex capabilities of the brain, allowing them to handle tasks such as image and language processing. In the realm of Large Language Models, there has been a keen interest in making the language learning process more akin to that of humans. While neuroscientific research has shown that different grammatical categories are processed by different neurons in the brain, we show that LLMs operate in a similar way. Utilizing Llama 3, we identify the most important neurons associated with the prediction of words belonging to different part-of-speech tags. Using the achieved knowledge, we train a classifier on a dataset, which shows that the activation patterns of these key neurons can reliably predict part-of-speech tags on fresh data. The results suggest the presence of a subspace in LLMs focused on capturing part-of-speech tag concepts, resembling patterns observed in lesion studies of the brain in neuroscience.
zh
[NLP-143] Approximating the Mathematical Structure of Psychodynamics
链接: https://arxiv.org/abs/2511.05580 作者: Bryce-Allen Bagley,Navin Khoshnan 机构: 未知 类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) 备注:
点击查看摘要
Abstract:The complexity of human cognition has meant that psychology makes more use of theory and conceptual models than perhaps any other biomedical field. To enable precise quantitative study of the full breadth of phenomena in psychological and psychiatric medicine as well as cognitive aspects of AI safety, there is a need for a mathematical formulation which is both mathematically precise and equally accessible to experts from numerous fields. In this paper we formalize human psychodynamics via the diagrammatic framework of process theory, describe its key properties, and explain the links between a diagrammatic representation and central concepts in analysis of cognitive processes in contexts such as psychotherapy, neurotechnology, AI alignment, AI agent representation of individuals in autonomous negotiations, developing human-like AI systems, and other aspects of AI safety.
zh
[NLP-144] he Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis
链接: https://arxiv.org/abs/2509.16328 作者: Jyun-Ping Kao 机构: National Taiwan University (国立台湾大学) 类目: Tissues and Organs (q-bio.TO); Computation and Language (cs.CL); Image and Video Processing (eess.IV); Medical Physics (physics.med-ph) 备注:
点击查看摘要
Abstract:Large-language models (LLMs) are rapidly being applied to radiology, enabling automated image interpretation and report generation tasks. Their deployment in clinical practice requires both high diagnostic accuracy and low inference latency, which in turn demands powerful hardware. High-performance graphical processing units (GPUs) provide the necessary compute and memory throughput to run large LLMs on imaging data. We review modern GPU architectures (e.g. NVIDIA A100/H100, AMD Instinct MI250X/MI300) and key performance metrics of floating-point throughput, memory bandwidth, VRAM capacity. We show how these hardware capabilities affect radiology tasks: for example, generating reports or detecting findings on CheXpert and MIMIC-CXR images is computationally intensive and benefits from GPU parallelism and tensor-core acceleration. Empirical studies indicate that using appropriate GPU resources can reduce inference time and improve throughput. We discuss practical challenges including privacy, deployment, cost, power and optimization strategies: mixed-precision, quantization, compression, and multi-GPU scaling. Finally, we anticipate that next-generation features (8-bit tensor cores, enhanced interconnect) will further enable on-premise and federated radiology AI. Advancing GPU infrastructure is essential for safe, efficient LLM-based radiology diagnostics.
zh
计算机视觉
[CV-0] Lightning Grasp: High Performance Procedural Grasp Synthesis with Contact Fields
[CV-9] YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting
链接: https://arxiv.org/abs/2511.07321 作者: Botao Ye,Boqi Chen,Haofei Xu,Daniel Barath,Marc Pollefeys 机构: ETH Zurich (苏黎世联邦理工学院); ETH AI Center (苏黎世联邦理工学院人工智能中心); Microsoft (微软) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-10] Beyond Boundaries: Leverag ing Vision Foundation Models for Source-Free Object Detection AAAI2026
链接: https://arxiv.org/abs/2511.07301 作者: Huizai Yao,Sicheng Zhao,Pengteng Li,Yi Cui,Shuo Lu,Weiyu Guo,Yunfan Lu,Yijie Xu,Hui Xiong 机构: The University of Science and Technology Hong Kong (香港科技大学); Tsinghua University (清华大学); Peking University (北京大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: Accepted to AAAI 2026. Extended version with full Appendix
点击查看摘要
[CV-11] VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
链接: https://arxiv.org/abs/2511.07299 作者: Ying Cheng,Yu-Ho Lin,Min-Hung Chen,Fu-En Yang,Shang-Hong Lai 机构: National Tsing Hua University (国立清华大学); NVIDIA 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-12] LMM-IQA: Image Quality Assessment for Low-Dose CT Imaging
链接: https://arxiv.org/abs/2511.07298 作者: Kagan Celik,Mehmet Ozan Unal,Metin Ertas,Isa Yildirim 机构: Istanbul Technical University (伊斯坦布尔技术大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[CV-13] Verifying rich robustness properties for neural networks
链接: https://arxiv.org/abs/2511.07293 作者: Mohammad Afzal,S. Akshay,Ashutosh Gupta 机构: 未知 类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-14] PlanT 2.0: Exposing Biases and Structural Flaws in Closed-Loop Driving
链接: https://arxiv.org/abs/2511.07292 作者: Simon Gerstenecker,Andreas Geiger,Katrin Renz 机构: University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心) 类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-15] Glioma C6: A Novel Dataset for Training and Benchmarking Cell Segmentation
链接: https://arxiv.org/abs/2511.07286 作者: Roman Malashin,Svetlana Pashkevich,Daniil Ilyukhin,Arseniy Volkov,Valeria Yachnaya,Andrey Denisov,Maria Mikhalkova 机构: Pavlov Institute of Physiology, Russian academy of science (巴甫洛夫生理研究所,俄罗斯科学院); Saint-Petersburg State University of Aerospace Instrumentation, Russia (圣彼得堡航空航天仪器大学,俄罗斯); Institute of Physiology, NAS of Belarus (白俄罗斯科学院生理研究所) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[CV-16] Segmentation of Ischemic Stroke Lesions using Transfer Learning on Multi-sequence MRI
链接: https://arxiv.org/abs/2511.07281 作者: R. P. Chowdhury,T. Rahman 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Ischemic Stroke, Segmentation, Transfer Learning, Magnetic Resonance Imaging, Deep Learning, Res-UNet
点击查看摘要
[CV-17] StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
链接: https://arxiv.org/abs/2511.07233 作者: Alexander Bauer,Klaus-Robert Müller 机构: TU Berlin (柏林工业大学); BIFOLD (柏林智能计算中心); Korea University (韩国科学技术院); MPI for Informatics (德国马普研究所信息学所) 类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) 备注:
点击查看摘要
[CV-22] Mapping Reduced Accessibility to WASH Facilities in Rohingya Refugee Camps with Sub-Meter Imagery
链接: https://arxiv.org/abs/2511.07231 作者: Kyeongjin Ahn,YongHun Suh,Sungwon Han,Jeasurk Yang,Hannes Taubenböck,Meeyoung Cha 机构: Max Planck Institute for Security and Privacy (MPI-SP); Korea Advanced Institute of Science and Technology (KAIST); Meta; German Aerospace Center (DLR); Earth Observation Center (EOC); Würzburg University 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 23 pages, 13 figures, 2 tables
点击查看摘要
Abstract:Access to Water, Sanitation, and Hygiene (WASH) services remains a major public health concern in refugee camps. This study introduces a remote sensing-driven framework to quantify WASH accessibility-specifically to water pumps, latrines, and bathing cubicles-in the Rohingya camps of Cox’s Bazar, one of the world’s most densely populated displacement settings. Detecting refugee shelters in such emergent camps presents substantial challenges, primarily due to their dense spatial configuration and irregular geometric patterns. Using sub-meter satellite images, we develop a semi-supervised segmentation framework that achieves an F1-score of 76.4% in detecting individual refugee shelters. Applying the framework across multi-year data reveals declining WASH accessibility, driven by rapid refugee population growth and reduced facility availability, rising from 25 people per facility in 2022 to 29.4 in 2025. Gender-disaggregated analysis further shows that women and girls experience reduced accessibility, in scenarios with inadequate safety-related segregation in WASH facilities. These findings suggest the importance of demand-responsive allocation strategies that can identify areas with under-served populations-such as women and girls-and ensure that limited infrastructure serves the greatest number of people in settings with fixed or shrinking budgets. We also discuss the value of high-resolution remote sensing and machine learning to detect inequality and inform equitable resource planning in complex humanitarian environments.
zh
[CV-23] Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images
链接: https://arxiv.org/abs/2511.07222 作者: JiaKui Hu,Shanshan Zhao,Qing-Guo Chen,Xuerui Qiu,Jialun Liu,Zhao Xu,Weihua Luo,Kaifu Zhang,Yanye Lu 机构: Peking University (北京大学); Alibaba International Digital Commerce Group (阿里巴巴国际数字商业集团); CASIA (中国科学院自动化研究所); TeleAI (TeleAI) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Under review
点击查看摘要
[CV-24] Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization AAAI-2026 AAAI’26
链接: https://arxiv.org/abs/2511.07210 作者: Binyan Xu,Fan Yang,Di Tang,Xilin Dai,Kehuan Zhang 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG) 备注: 19 pages, 22 figures, 15 tables. To appear in AAAI '26 (Oral). This paper extends the AAAI-2026 version by including the Appendix
点击查看摘要
[CV-25] Geometric implicit neural representations for signed distance functions
链接: https://arxiv.org/abs/2511.07206 作者: Luiz Schirmer,Tiago Novello,Vinícius da Silva,Guilherme Schardong,Daniel Perazzo,Hélio Lopes,Nuno Gonçalves,Luiz Velho 机构: Universidade Federal de Santa Maria (圣玛丽亚联邦大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Graphics (cs.GR) 备注:
点击查看摘要
[CV-26] Automated Estimation of Anatomical Risk Metrics for Endoscopic Sinus Surgery Using Deep Learning
链接: https://arxiv.org/abs/2511.07199 作者: Konrad Reuter,Lennart Thaysen,Bilkay Doruk,Sarah Latus,Brigitte Holst,Benjamin Becker,Dennis Eggert,Christian Betz,Anna-Sophie Hoffmann,Alexander Schlaefer 机构: Hamburg University of Technology (汉堡工业大学); University Medical Center Hamburg-Eppendorf (汉堡-埃彭多夫大学医学中心) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted to SPIE Medical Imaging conference 2026
点击查看摘要
[CV-27] LiteUpdate: A Lightweight Framework for Updating AI-Generated Image Detectors
[CV-28] Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use ICTAI2025
【速读】:该论文旨在解决视频监控场景中隐私保护与计算效率之间的矛盾,特别是针对大规模视觉语言模型(VLM)在联邦学习框架下部署时带来的高能耗和可持续性挑战。其核心问题是如何在保障用户隐私的前提下,实现高效、低功耗且准确的暴力行为检测。解决方案的关键在于提出并比较三种策略:基于预训练VLM的零样本推理、LoRA微调的LLaVA-NeXT-Video-7B模型以及个性化联邦学习的65.8M参数3D卷积神经网络(3D CNN)。实验表明,3D CNN在保持超过90%二分类准确率的同时,能耗仅为联邦LoRA方案的一半(240 Wh vs. 570 Wh),且校准性能更优(ROC AUC 92.59%),而VLM则提供更强的多模态推理能力;通过层次化类别分组策略进一步提升VLM在多类暴力行为识别中的准确率(UCF-Crime数据集上从65.31%提升至81%)。研究揭示了混合部署范式的价值:以轻量级CNN处理常规检测任务,仅在复杂情境下调用VLM进行深度语义分析。
Abstract:Deep learning-based video surveillance increasingly demands privacy-preserving architectures with low computational and environmental overhead. Federated learning preserves privacy but deploying large vision-language models (VLMs) introduces major energy and sustainability challenges. We compare three strategies for federated violence detection under realistic non-IID splits on the RWF-2000 and RLVS datasets: zero-shot inference with pretrained VLMs, LoRA-based fine-tuning of LLaVA-NeXT-Video-7B, and personalized federated learning of a 65.8M-parameter 3D CNN. All methods exceed 90% accuracy in binary violence detection. The 3D CNN achieves superior calibration (ROC AUC 92.59%) at roughly half the energy cost (240 Wh vs. 570 Wh) of federated LoRA, while VLMs provide richer multimodal reasoning. Hierarchical category grouping (based on semantic similarity and class exclusion) boosts VLM multiclass accuracy from 65.31% to 81% on the UCF-Crime dataset. To our knowledge, this is the first comparative simulation study of LoRA-tuned VLMs and personalized CNNs for federated violence detection, with explicit energy and CO2e quantification. Our results inform hybrid deployment strategies that default to efficient CNNs for routine inference and selectively engage VLMs for complex contextual reasoning.
zh
[CV-29] ProcGen3D: Learning Neural Procedural Graph Representations for Image-to-3D Reconstruction
链接: https://arxiv.org/abs/2511.07142 作者: Xinyi Zhang,Daoyi Gao,Naiqi Li,Angela Dai 机构: Technical University of Munich (慕尼黑工业大学); Tsinghua University (清华大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Project Page: this https URL
点击查看摘要
[CV-30] MPJudge: Towards Perceptual Assessment of Music-Induced Paintings
链接: https://arxiv.org/abs/2511.07137 作者: Shiqi Jiang,Tianyi Liang,Changbo Wang,Chenhui Li 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-31] Sparse4DGS: 4D Gaussian Splatting for Sparse-Frame Dynamic Scene Reconstruction AAAI2026
[CV-32] HENet: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving
链接: https://arxiv.org/abs/2511.07106 作者: Zhongyu Xia,Zhiwei Lin,Yongtao Wang,Ming-Hsuan Yang 机构: Peking University (北京大学); University of California, Merced (加州大学默塞德分校) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Preliminary version, 19 pages
点击查看摘要
[CV-33] GEWDiff: Geometric Enhanced Wavelet-based Diffusion Model for Hyperspectral Image Super-resolution AAAI2026
链接: https://arxiv.org/abs/2511.07103 作者: Sirui Wang,Jiang He,Natàlia Blasco Andreo,Xiao Xiang Zhu 机构: 1. Wuhan University (武汉大学); 2. University of Barcelona (巴塞罗那大学); 3. German Aerospace Center (德国航空航天中心) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: This manuscript has been accepted for publication in AAAI 2026
点击查看摘要
[CV-34] How Bias Binds: Measuring Hidden Associations for Bias Control in Text-to-Image Compositions AAAI2026 AAAI
链接: https://arxiv.org/abs/2511.07091 作者: Jeng-Lin Li,Ming-Ching Chang,Wei-Chao Chen 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: Accepted for publication at the Alignment Track of The 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)
点击查看摘要
[CV-35] Achieving Effective Virtual Reality Interactions via Acoustic Gesture Recognition based on Large Language Models ICASSP2026
链接: https://arxiv.org/abs/2511.07085 作者: Xijie Zhang,Fengliang He,Hong-Ning Dai 机构: 未知 类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) 备注: 5 pages, 4 figures, 1 table, under review at ICASSP 2026
点击查看摘要
[CV-36] Pandar128 dataset for lane line detection
链接: https://arxiv.org/abs/2511.07084 作者: Filip Beránek,Václav Diviš,Ivan Gruber 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[CV-37] LeCoT: revisiting network architecture for two-view correspondence pruning
链接: https://arxiv.org/abs/2511.07078 作者: Luanyuan Dai,Xiaoyu Du,Jinhui Tang 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Just accepted at SCIENCE CHINA Information Sciences
点击查看摘要
[CV-38] ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora WACV WACV2026
链接: https://arxiv.org/abs/2511.07068 作者: Nikolas Adaloglou,Diana Petrusheva,Mohamed Asker,Felix Michels,Markus Kollmann 机构: Heinrich Heine University of Düsseldorf (海因里希·海涅杜塞尔多夫大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) 备注: Accepted in WACV 2026. Code in this https URL 9 Tables, 11 Figures
点击查看摘要
[CV-39] RaLD: Generating High-Resolution 3D Radar Point Clouds with Latent Diffusion
链接: https://arxiv.org/abs/2511.07067 作者: Ruijie Zhang,Bixin Zeng,Shengpeng Wang,Fuhui Zhou,Wei Wang 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-40] Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation
[CV-42] 3D-ANC: Adaptive Neural Collapse for Robust 3D Point Cloud Recognition AAAI2026
链接: https://arxiv.org/abs/2511.07040 作者: Yuanmin Huang,Wenxuan Li,Mi Zhang,Xiaohan Zhang,Xiaoyu You,Min Yang 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR) 备注: AAAI 2026
点击查看摘要
[CV-43] Certified L2-Norm Robustness of 3D Point Cloud Recognition in the Frequency Domain AAAI26
链接: https://arxiv.org/abs/2511.07029 作者: Liang Zhou,Qiming Wang,Tianze Chen 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted by AAAI26
点击查看摘要
[CV-44] Performance Decay in Deepfake Detection: The Limitations of Training on Outdated Data
链接: https://arxiv.org/abs/2511.07009 作者: Jack Richings,Margaux Leblanc,Ian Groves,Victoria Nockles 机构: Defence AI Research (DARe), The Alan Turing Institute (艾伦图灵研究所) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-45] rueCity: Real and Simulated Urban Data for Cross-Domain 3D Scene Understanding
链接: https://arxiv.org/abs/2511.07007 作者: Duc Nguyen,Yan-Ling Lai,Qilin Zhang,Prabin Gyawali,Benedikt Schwab,Olaf Wysocki,Thomas H. Kolbe 机构: Technical University of Munich (慕尼黑工业大学); CV4DT, University of Cambridge (剑桥大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: The paper accepted for 3DV 2026 (International Conference on 3D Vision 2026)
点击查看摘要
[CV-46] Exploring the “Great Unseen” in Medieval Manuscripts: Instance-Level Labeling of Legacy Image Collections with Zero-Shot Models
[CV-47] Oh That Looks Familiar: A Novel Similarity Measure for Spreadsheet Template Discovery
链接: https://arxiv.org/abs/2511.06973 作者: Ananad Krishnakumar,Vengadesh Ravikumaran 机构: Ekimetrics(埃基Metrics) 类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) 备注: 5 pages, 2 figures, Accepted for EuroIPS: AI for Tabular Data Workshop (2025)
点击查看摘要
[CV-48] Learning from the Right Patches: A Two-Stage Wavelet-Driven Masked Autoencoder for Histopathology Representation Learning
链接: https://arxiv.org/abs/2511.06958 作者: Raneen Younis,Louay Hamdi,Lukas Chavez,Zahra Ahmadi 机构: PLRI Medical Informatics Institute (PLRI 医学信息学研究所); Hannover Medical School (汉诺威医学院); Leibniz University Hannover (汉诺威莱布尼茨大学); Sanford Burnham Prebys Medical Discovery Institute (桑福德伯纳姆-普雷比斯医学发现研究所); University of California San Diego (加州大学圣地亚哥分校) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-49] GFix: Perceptually Enhanced Gaussian Splatting Video Compression
链接: https://arxiv.org/abs/2511.06953 作者: Siyue Teng,Ge Gao,Duolikun Danier,Yuxuan Jiang,Fan Zhang,Thomas Davis,Zoe Liu,David Bull 机构: University of Bristol (布里斯托大学); University of Edinburgh (爱丁堡大学); Visionular Inc. (视觉公司) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-50] PADM: A Physics-aware Diffusion Model for Attenuation Correction WACV
链接: https://arxiv.org/abs/2511.06948 作者: Trung Kien Pham,Hoang Minh Vu,Anh Duc Chu,Dac Thai Nguyen,Trung Thanh Nguyen,Thao Nguyen Truong,Mai Hong Son,Thanh Trung Nguyen,Phi Le Nguyen 机构: AI4LIFE, Hanoi University of Science and Technology (河内科技大学), Vietnam; Nagoya Univeristy (名古屋大学), Japan; National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所), Japan; 108 Military Central Hospital (108军区中央医院), Vietnam 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
点击查看摘要
[CV-51] FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection
链接: https://arxiv.org/abs/2511.06947 作者: Yulin Chen,Zeyuan Wang,Tianyuan Yu,Yingmei Wei,Liang Bai 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: 15 page, 9 figures, published to PRCV
点击查看摘要
[CV-52] From Attribution to Action: Jointly ALIGNing Predictions and Explanations AAAI2026
[CV-53] PlantTraitNet: An Uncertainty-Aware Multimodal Framework for Global-Scale Plant Trait Inference from Citizen Science Data AAAI AAAI-26
链接: https://arxiv.org/abs/2511.06943 作者: Ayushi Sharma,Johanna Trost,Daniel Lusk,Johannes Dollinger,Julian Schrader,Christian Rossi,Javier Lopatin,Etienne Laliberté,Simon Haberstroh,Jana Eichel,Daniel Mederer,Jose Miguel Cerda-Paredes,Shyam S. Phartyal,Lisa-Maricia Schwarz,Anja Linstädter,Maria Conceição Caldeira,Teja Kattenborn 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: Preprint version of the paper accepted at the 40th AAAI Conference on Artificial Intelligence (AAAI-26), organized by the Association for the Advancement of Artificial Intelligence
点击查看摘要
[CV-54] DTTNet: Improving Video Shadow Detection via Dark-Aware Guidance and Tokenized Temporal Modeling
链接: https://arxiv.org/abs/2511.06925 作者: Zhicheng Li,Kunyang Sun,Rui Yao,Hancheng Zhu,Fuyuan Hu,Jiaqi Zhao,Zhiwen Shao,Yong Zhou 机构: China University of Mining and Technology (中国矿业大学); Shanghai Jiao Tong University (上海交通大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-55] Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding
[CV-56] Classification of Microplastic Particles in Water using Polarized Light Scattering and Machine Learning Methods
链接: https://arxiv.org/abs/2511.06901 作者: Leonard Saur,Marc von Pawlowski,Ulrich Gengenbach,Ingo Sieber,Hossein Shirali,Lorenz Wührl,Rainer Kiko,Christian Pylatiuk 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 20 pages, 6 figures
点击查看摘要
[CV-57] Adaptive Morph-Patch Transformer for Arotic Vessel Segmentation AAAI2026 AAAI
链接: https://arxiv.org/abs/2511.06897 作者: Zhenxi Zhang,Fuchen Zheng,Adnan Iltaf,Yifei Han,Zhenyu Cheng,Yue Du,Bin Li,Tianyong Liu,Shoujun Zhou 机构: SIAT; Shenzhen Institutes of Advanced Technology (深圳先进技术研究院) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: This is the preprint version of a paper accepted by AAAI 2026. The final version will appear in the AAAI Proceedings
点击查看摘要
[CV-58] A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models
链接: https://arxiv.org/abs/2511.06888 作者: Jan-Hendrik Koch,Jonas Krumme,Konrad Gadzicki 机构: University of Bremen (不来梅大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 12 pages, 5 figures
点击查看摘要
[CV-59] Generating an Image From 1000 Words: Enhancing Text-to-Image With Structured Captions
[CV-60] VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling
链接: https://arxiv.org/abs/2511.06863 作者: Sicheng Yang,Xing Hu,Qiang Wu,Dawei Yang 机构: Houmo AI(候摩AI) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-61] Ambiguity-aware Truncated Flow Matching for Ambiguous Medical Image Segmentation AAAI-26
链接: https://arxiv.org/abs/2511.06857 作者: Fanding Li(1),Xiangyu Li(1),Xianghe Su(1),Xingyu Qiu(1),Suyu Dong(2),Wei Wang(3),Kuanquan Wang(1),Gongning Luo(1),Shuo Li(4 and 5) ((1) Faculty of Computing, Harbin Institute of Technology, Harbin, China, (2) College of Computer and Control Engineering, Northeast Forestry University, Harbin, China, (3) Faculty of Computing, Harbin Institute of Technology, Shenzhen, China, (4) Department of Computer and Data Science, Case Western Reserve University, Cleveland, Ohio 44106, United States, (5) Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio 44106, United States) 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 13 pages, 10 figures, extended version of AAAI-26 paper
点击查看摘要
[CV-62] Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers AAAI2026
链接: https://arxiv.org/abs/2511.06848 作者: Huiyuan Tian,Bonan Xu Shijian Li 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted to AAAI 2026. Submitted version
点击查看摘要
[CV-63] Gaussian-Augmented Physics Simulation and System Identification with Complex Colliders NEURIPS2025
链接: https://arxiv.org/abs/2511.06846 作者: Federico Vasile,Ri-Zhao Qiu,Lorenzo Natale,Xiaolong Wang 机构: Istituto Italiano di Tecnologia (意大利技术研究院); UC San Diego (加州大学圣地亚哥分校) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted to NeurIPS 2025. Project website: this https URL
点击查看摘要
[CV-64] Aerial Image Stitching Using IMU Data from a UAV
链接: https://arxiv.org/abs/2511.06841 作者: Selim Ahmet Iz,Mustafa Unel 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY); Dynamical Systems (math.DS) 备注:
点击查看摘要
[CV-65] PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory AAAI2026
链接: https://arxiv.org/abs/2511.06840 作者: Qunchao Jin,Yilin Wu,Changhao Chen 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) 备注: Accepted as a poster in AAAI 2026
点击查看摘要
[CV-66] Vision-Based System Identification of a Quadrotor
链接: https://arxiv.org/abs/2511.06839 作者: Selim Ahmet Iz,Mustafa Unel 机构: 未知 类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY); Dynamical Systems (math.DS) 备注:
点击查看摘要
[CV-67] NeuroBridge: Bio-Inspired Self-Supervised EEG-to-Image Decoding via Cognitive Priors and Bidirectional Semantic Alignment AAAI2026
[CV-71] S-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student Learning
链接: https://arxiv.org/abs/2511.06817 作者: Rui Wang,Ying Zhou,Hao Wang,Wenwei Zhang,Qiang Li,Zhiwei Wang 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: 8 pages, 4 figures, accepted by BiBM2025
点击查看摘要
[CV-72] ConeGS: Error-Guided Densification Using Pixel Cones for Improved Reconstruction with Fewer Primitives
链接: https://arxiv.org/abs/2511.06810 作者: Bartłomiej Baranowski,Stefano Esposito,Patricia Gschoßmann,Anpei Chen,Andreas Geiger 机构: University of Tübingen (图宾根大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-73] Robust and High-Fidelity 3D Gaussian Splatting: Fusing Pose Priors and Geometry Constraints for Texture-Deficient Outdoor Scenes IROS2025
链接: https://arxiv.org/abs/2511.06765 作者: Meijun Guo,Yongliang Shi,Caiyun Liu,Yixiao Feng,Ming Ma,Tinghai Yan,Weining Lu,Bin Liang 机构: Beijing Institute of Technology (北京理工大学); Beiing National Research Center for Information Science and Technology (北京信息科学与技术国家研究中心); Qiyuan Lab (启源实验室); Peking University (北京大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) 备注: 7 pages, 3 figures. Accepted by IROS 2025
点击查看摘要
[CV-74] CAST-LUT: Tokenizer-Guided HSV Look-Up Tables for Purple Flare Removal
[CV-75] SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation
链接: https://arxiv.org/abs/2511.06754 作者: Taisei Hanyu,Nhat Chung,Huy Le,Toan Nguyen,Yuki Ikebe,Anthony Gunderman,Duy Nguyen Ho Minh,Khoa Vo,Tung Kieu,Kashu Yamazaki,Chase Rainwater,Anh Nguyen,Ngan Le 机构: University of Arkansas, USA; FPT Software AI Center, Vietnam; University of Stuttgart, Germany; Aalborg University, Denmark; Carnegie Mellon University, USA; University of Liverpool, UK; German Research Center for Artificial Intelligence, Germany; Max Planck Research School for Intelligent Systems, Germany 类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) 备注: under review
点击查看摘要
[CV-76] Med-SORA: Symptom to Organ Reasoning in Abdomen CT Images
链接: https://arxiv.org/abs/2511.06752 作者: You-Kyoung Na,Yeong-Jun Cho 机构: Chonnam National University (全南国立大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 9 pages
链接: https://arxiv.org/abs/2511.06749 作者: Weining Lu,Deer Bin,Lian Ma,Ming Ma,Zhihao Ma,Xiangyang Chen,Longfei Wang,Yixiao Feng,Zhouxian Jiang,Yongliang Shi,Bin Liang 机构: Beijng National Research Center for Information Science and Technology (北京信息科学与技术国家研究中心); Qiyuan Lab (启源实验室); JiangHuai Advanced Technology Center (江淮先进技术中心) 类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) 备注: 7 pages, 3 figures. Accepted by IROS 2025
点击查看摘要
[CV-78] Image Restoration via Primal Dual Hybrid Gradient and Flow Generative Model AAAI26
链接: https://arxiv.org/abs/2511.06748 作者: Ji Li,Chao Wang 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 13 pages; AAAI26 version with appendix
点击查看摘要
[CV-79] PointCubeNet: 3D Part-level Reasoning with 3x3x3 Point Cloud Blocks
[CV-88] K-Stain: Keypoint-Driven Correspondence for HE-to-IHC Virtual Staining
链接: https://arxiv.org/abs/2511.06709 作者: Sicheng Yang,Zhaohu Xing,Haipeng Zhou,Lei Zhu 机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-89] SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection
链接: https://arxiv.org/abs/2511.06702 作者: Yifan Wang,Yian Zhao,Fanqi Pu,Xiaochen Yang,Yang Tang,Xi Chen,Wenming Yang 机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); School of Electronic and Computer Engineering, Peking University (北京大学电子与计算机工程学院); School of Mathematics and Statistics, University of Glasgow (格拉斯哥大学数学与统计学院); Basic Algorithm Center, PCG, Tencent (腾讯PCG基础算法中心) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-90] AnoStyler: Text-Driven Localized Anomaly Generation via Lightweight Style Transfer AAAI2026
[CV-94] Active Learning for Animal Re-Identification with Ambiguity-Aware Sampling AAAI
链接: https://arxiv.org/abs/2511.06658 作者: Depanshu Sani,Mehar Khurana,Saket Anand 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: In Proceedings of AAAI Conference on Artificial Intelligence 2026
点击查看摘要
[CV-95] NOVO: Bridging LLaVA and SAM with Visual-only Prompts for Reasoning Segmentation
链接: https://arxiv.org/abs/2511.06651 作者: Kyung-Yoon Yoon,Yeong-Jun Cho 机构: Chonnam National University (全南国立大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-96] FreqGRL: Suppressing Low-Frequency Bias and Mining High-Frequency Knowledge for Cross-Domain Few-Shot Learning
链接: https://arxiv.org/abs/2511.06648 作者: Siqi Hui,Sanping Zhou,Ye deng,Wenli Huang,Jinjun Wang 机构: Unknown 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-97] UniADC: A Unified Framework for Anomaly Detection and Classification
链接: https://arxiv.org/abs/2511.06644 作者: Ximiao Zhang,Min Xu,Zheng Zhang,Junlin Hu,Xiuzhuang Zhou 机构: Beijing University of Posts and Telecommunications (北京邮电大学); Capital Normal University (首都师范大学); Beihang University (北京航空航天大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-98] DIAL-GS: Dynamic Instance Aware Reconstruction for Label-free Street Scenes with 4D Gaussian Splatting
链接: https://arxiv.org/abs/2511.06632 作者: Chenpeng Su,Wenhua Wu,Chensheng Peng,Tianchen Deng,Zhe Liu,Hesheng Wang 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-99] Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from LDCT
[CV-100] On Accurate and Robust Estimation of 3D and 2D Circular Center: Method and Application to Camera-Lidar Calibration
链接: https://arxiv.org/abs/2511.06611 作者: Jiajun Jiang,Xiao Hu,Wancheng Liu,Wei Jiang 机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); International Digital Economy Academy (国际数字经济发展研究院); Horizon-Continental Technology Corporation ( horizon-continental 技术公司); Beijing Jiaotong University (北京交通大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) 备注:
点击查看摘要
[CV-101] Spatial-Frequency Enhanced Mamba for Multi-Modal Image Fusion
链接: https://arxiv.org/abs/2511.06593 作者: Hui Sun,Long Lv,Pingping Zhang,Tongdan Tang,Feng Tian,Weibing Sun,Huchuan Lu 机构: Dalian University of Technology (大连理工大学); Affiliated Zhongshan Hospital of Dalian University of Technology (大连理工大学附属中山医院); Central Hospital of Dalian University of Technology (大连理工大学中心医院) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: This work is accepted by IEEE Transactions on Image Processing. More modifications may be performed
点击查看摘要
[CV-102] Video Dataset for Surgical Phase Keypoint and Instrument Recognition in Laparoscopic Surgery (PhaKIR)
链接: https://arxiv.org/abs/2511.06549 作者: Tobias Rueckert,Raphaela Maerkl,David Rauber,Leonard Klausmann,Max Gutbrod,Daniel Rueckert,Hubertus Feussner,Dirk Wilhelm,Christoph Palm 机构: Regensburg Medical Image Computing (ReMIC), OTH Regensburg (奥格斯堡技术与科学大学); AKTORmed Robotic Surgery (AKTORmed机器人手术); Regensburg Center of Biomedical Engineering (RCBE), OTH Regensburg and Regensburg University (雷根斯堡大学); Regensburg Center of Health Sciences and Technology (RCHST), OTH Regensburg (奥格斯堡技术与科学大学); Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital (慕尼黑工业大学及慕尼黑工业大学医院); Biomedical Image Analysis Group, Department of Computing, Imperial College London (帝国理工学院); Research Group MITI, TUM University Hospital, School of Medicine and Health, Technical University of Munich (慕尼黑工业大学); Department of Surgery, TUM University Hospital, School of Medicine and Health, Technical University of Munich (慕尼黑工业大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 9 pages, 5 figures, 4 tables
点击查看摘要
[CV-103] SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports
链接: https://arxiv.org/abs/2511.06499 作者: Haotian Xia,Haonan Ge,Junbo Zou,Hyun Woo Choi,Xuebin Zhang,Danny Suradja,Botao Rui,Ethan Tran,Wendy Jin,Zhen Ye,Xiyang Lin,Christopher Lai,Shengjie Zhang,Junwen Miao,Shichao Chen,Rhys Tracy,Vicente Ordonez,Weining Shen,Hanjie Chen 机构: Rice University (莱斯大学); University of California, Irvine (加州大学欧文分校); Georgia Institute of Technology (佐治亚理工学院); Johns Hopkins University (约翰霍普金斯大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-104] A Low-Rank Method for Vision Language Model Hallucination Mitigation in Autonomous Driving
[CV-105] Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models
链接: https://arxiv.org/abs/2511.06490 作者: Yule Chen,Yufan Ren,Sabine Süsstrunk 机构: EPFL (瑞士联邦理工学院); Chalmers University of Technology (查尔姆斯理工大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[CV-106] NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models
链接: https://arxiv.org/abs/2511.06475 作者: Kyuho Lee,Euntae Kim,Jinwoo Choi,Buru Chang 机构: Korea University (韩国大学); Kyung Hee University (庆熙大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 18 pages, 9 figures. Preprint
点击查看摘要
[CV-107] Inpaint360GS: Efficient Object-Aware 3D Inpainting via Gaussian Splatting for 360° Scenes WACV2026
链接: https://arxiv.org/abs/2511.06457 作者: Shaoxiang Wang,Shihong Zhang,Christen Millerdurai,Rüdiger Westermann,Didier Stricker,Alain Pagani 机构: German Research Center for Artificial Intelligence (德国人工智能研究中心); RPTU (莱布尼茨大学); Technical University of Munich (慕尼黑工业大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: WACV 2026, project page: this https URL
点击查看摘要
[CV-108] EIDSeg: A Pixel-Level Semantic Segmentation Dataset for Post-Earthquake Damage Assessment from Social Media Images AAAI
链接: https://arxiv.org/abs/2511.06456 作者: Huili Huang,Chengeng Liu,Danrong Zhang,Shail Patel,Anastasiya Masalava,Sagar Sadak,Parisa Babolhavaeji,WeiHong Low,Max Mahdi Roozbahani,J. David Frost 机构: Georgia Institute of Technology (佐治亚理工学院) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Camera-Ready for AAAI-AISI26
点击查看摘要
[CV-109] Countering Multi-modal Representation Collapse through Rank-targeted Fusion WACV
链接: https://arxiv.org/abs/2511.06450 作者: Seulgi Kim,Kiran Kokilepersaud,Mohit Prabhushankar,Ghassan AlRegib 机构: Georgia Institute of Technology (佐治亚理工学院) 类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) 备注: Accepted in 2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
点击查看摘要
[CV-110] Diagnose Like A REAL Pathologist: An Uncertainty-Focused Approach for Trustworthy Multi-Resolution Multiple Instance Learning WACV
链接: https://arxiv.org/abs/2511.06433 作者: Sungrae Hong,Sol Lee,Jisu Shin,Mun Yong Yi 机构: Korea Advanced Institute of Science and Technology (韩国科学技术院) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted by IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
点击查看摘要
[CV-111] DiffusionUavLoc: Visually Prompted Diffusion for Cross-View UAV Localization
链接: https://arxiv.org/abs/2511.06422 作者: Tao Liu,Kan Ren,Qian Chen 机构: Nanjing University of Science and Technology (南京理工大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-112] VDNeRF: Vision-only Dynamic Neural Radiance Field for Urban Scenes
链接: https://arxiv.org/abs/2511.06408 作者: Zhengyu Zou,Jingfeng Li,Hao Li,Xiaolei Hou,Jinwen Hu,Jingkun Chen,Lechao Cheng,Dingwen Zhang 机构: Northwestern Polytechnical University (西北工业大学); University of Oxford (牛津大学); Hefei University Of Technology (合肥工业大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-113] On Modality Incomplete Infrared-Visible Object Detection: An Architecture Compatibility Perspective
[CV-114] InfoAffect: A Dataset for Affective Analysis of Infographics
链接: https://arxiv.org/abs/2511.06404 作者: Zihang Fu,Yunchao Wang,Chenyu Huang,Guodao Sun,Ronghua Liang 机构: Zhejiang University of Technology (浙江工业大学); Zhejiang University of Science and Technology (浙江科技学院) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-115] ArtReg: Visuo-Tactile based Pose Tracking and Manipulation of Unseen Articulated Objects
[CV-116] V-Shuffle: Zero-Shot Style Transfer via Value Shuffle
链接: https://arxiv.org/abs/2511.06365 作者: Haojun Tang,Qiwei Lin,Tongda Xu,Lida Huang,Yan Wang 机构: Tsinghua University (清华大学); Dalian University of Technology (大连理工大学); Beijing Institute of Radio Measurement (北京无线电计量测试研究所) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-117] AesTest: Measuring Aesthetic Intelligence from Perception to Production
链接: https://arxiv.org/abs/2511.06360 作者: Guolong Wang,Heng Huang,Zhiqiang Zhang,Wentian Li,Feilong Ma,Xin Jin 机构: University of International Business and Economics (对外经济贸易大学); University of Science and Technology of China (中国科学技术大学); Huawei Technologies Co., Ltd (华为技术有限公司); Beijing Electronic Science and Technology Institute (北京电子科技学院); Beijing Institute for General Artificial Intelligence (通用人工智能研究院) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 10 pages, 9 figures
点击查看摘要
[CV-118] GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding
链接: https://arxiv.org/abs/2511.06348 作者: Athul M. Mathew,Haithem Hermassi,Thariq Khalid,Arshad Ali Khan,Riad Souissi 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[CV-119] BuildingWorld: A Structured 3D Building Dataset for Urban Foundation Models
[CV-126] SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection
链接: https://arxiv.org/abs/2511.06298 作者: Xin Zuo,Yuchen Qu,Haibo Zhan,Jifeng Shen,Wankou Yang 机构: Jiangsu University of Science and Technology (江苏科技大学); Jiangsu University (江苏大学); Southeast University (东南大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 11 pages,8 figures, accepted by IEEE TGRS
点击查看摘要
[CV-127] Learning-Based Vision Systems for Semi-Autonomous Forklift Operation in Industrial Warehouse Environments
[CV-128] nyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks AAAI2026
【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在化学领域应用中的两大瓶颈问题:一是直接使用标准VLM处理包含非信息背景的化学图像导致计算效率低下;二是现有方法局限于分子级任务,限制了化学推理能力的发展。解决方案的关键在于提出一种高效且强大的化学专用VLM——TinyChemVL,其核心创新包括:通过视觉token缩减策略显著降低计算开销,同时引入反应级任务(reaction-level tasks)以增强模型的化学推理能力。此外,作者还构建了ChemRxn-V基准数据集用于评估基于视觉的反应识别与预测任务。实验表明,TinyChemVL仅用4B参数即可在分子和反应任务上超越现有模型,且推理和训练速度更快,同时仅需1/16的视觉token即可优于ChemVLM。
链接: https://arxiv.org/abs/2511.06283 作者: Xuanle Zhao,Shuxin Zeng,Yinyuan Cai,Xiang Cheng,Duzhen Zhang,Xiuyi Chen,Bo Xu 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted by AAAI 2026, Preprint Version
点击查看摘要
Abstract:While Vision Language Models (VLMs) have demonstrated remarkable capabilities in general visual understanding, their application in the chemical domain has been limited, with previous works predominantly focusing on text and thus overlooking critical visual information, such as molecular structures. Current approaches that directly adopt standard VLMs for chemical tasks suffer from two primary issues: (i) computational inefficiency of processing entire chemical images with non-informative backgrounds. (ii) a narrow scope on molecular-level tasks that restricts progress in chemical reasoning. In this work, we propose \textbfTinyChemVL, an efficient and powerful chemical VLM that leverages visual token reduction and reaction-level tasks to improve model efficiency and reasoning capacity. Also, we propose \textbfChemRxn-V, a reaction-level benchmark for assessing vision-based reaction recognition and prediction tasks. Directly predicting reaction products from molecular images poses a non-trivial challenge, as it requires models to integrate both recognition and reasoning capacities. Our results demonstrate that with only 4B parameters, TinyChemVL achieves superior performance on both molecular and reaction tasks while demonstrating faster inference and training speeds compared to existing models. Notably, TinyChemVL outperforms ChemVLM while utilizing only 1/16th of the visual tokens. This work builds efficient yet powerful VLMs for chemical domains by co-designing model architecture and task complexity.
zh
[CV-129] From ACR O-RADS 2022 to Explainable Deep Learning: Comparative Performance of Expert Radiologists Convolutional Neural Networks Vision Transformers and Fusion Models in Ovarian Masses
链接: https://arxiv.org/abs/2511.06282 作者: Ali Abbasian Ardakani,Afshin Mohammadi,Alisa Mohebbi,Anushya Vijayananthan,Sook Sam Leong,Lim Yi Ting,Mohd Kamil Bin Mohamad Fabell,U Rajendra Acharya,Sepideh Hatamikia 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 18 pages, 4 figures
点击查看摘要
[CV-130] VideoSSR: Video Self-Supervised Reinforcement Learning
链接: https://arxiv.org/abs/2511.06281 作者: Zefeng He,Xiaoye Qu,Yafu Li,Siyuan Huang,Daizong Liu,Yu Cheng 机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Nanjing Univerisity (南京大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学); Wuhan University (武汉大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
链接: https://arxiv.org/abs/2511.06272 作者: Zijie Wang,Weiming Zhang,Wei Zhang,Xiao Tan,Hongxing Liu,Yaowei Wang,Guanbin Li 机构: Sun Yat-sen University (中山大学); Shenzhen Loop Area Institute (深圳环区研究院); Baidu Inc. (百度公司); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Pengcheng Laboratory (鹏城实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: Accepted by ICCV 2025
点击查看摘要
[CV-132] RelightMaster: Precise Video Relighting with Multi-plane Light Images
链接: https://arxiv.org/abs/2511.06271 作者: Weikang Bian,Xiaoyu Shi,Zhaoyang Huang,Jianhong Bai,Qinghe Wang,Xintao Wang,Pengfei Wan,Kun Gai,Hongsheng Li 机构: Multimedia Laboratory, The Chinese University of Hong Kong (香港中文大学多媒体实验室); Kling Team, Kuaishou Technology (快手科技Kling团队); CPII under InnoHK (InnoHK计划下的CPII); Zhejiang University (浙江大学); Dalian University of Technology (大连理工大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Project Page: this https URL
点击查看摘要
[CV-133] LLM -Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval
链接: https://arxiv.org/abs/2511.06268 作者: Jian Zhang,Junyi Guo,Junyi Yuan,Huanda Lu,Yanlin Zhou,Fangyu Wu,Qiufeng Wang,Dongming Lu 机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); NingboTech University (宁波工程学院); Dunhuang Academy (敦煌研究院); Zhejiang University (浙江大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY) 备注:
点击查看摘要
[CV-134] A Mixture-of-Experts Framework with Log-Logistic Components for Survival Analysis on Histopathology Images
[CV-135] CAMP-HiVe: Cyclic Pair Merging based Efficient DNN Pruning with Hessian-Vector Approximation for Resource-Constrained Systems
链接: https://arxiv.org/abs/2511.06265 作者: Mohammad Helal Uddin,Sai Krishna Ghanta,Liam Seymour,Sabur Baidya 机构: University of Louisville (路易斯维尔大学); University of Georgia (佐治亚大学) 类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-136] Robust Nearest Neighbour Retrieval Using Targeted Manifold Manipulation
链接: https://arxiv.org/abs/2511.06261 作者: B. Ghosh,H. Harikumar,S. Rana 机构: Applied Artificial Intelligence Institute, Deakin University, Australia(澳大利亚迪肯大学应用人工智能研究所); The University of Manchester, Manchester, England(英格兰曼彻斯特大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-137] VLDrive: Vision-Augmented Lightweight MLLM s for Efficient Language-grounded Autonomous Driving ICCV2025
链接: https://arxiv.org/abs/2511.06256 作者: Ruifei Zhang,Wei Zhang,Xiao Tan,Sibei Yang,Xiang Wan,Xiaonan Luo,Guanbin Li 机构: The Chinese University of Hong Kong, Shenzhen; Shenzhen Research Institute of Big Data; Sun Yat-sen University; Baidu Inc.; Guilin University of Electronic Technology; Guangdong Key Laboratory of Big Data Analysis and Processing 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted by ICCV2025
点击查看摘要
Abstract:Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive`s effectiveness. Notably, VLDrive achieves state-of-the-art driving performance while reducing parameters by 81% (from 7B to 1.3B), yielding substantial driving score improvements of 15.4%, 16.8%, and 7.6% at tiny, short, and long distances, respectively, in closed-loop evaluations. Code is available at this https URL.
zh
[CV-138] AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving ICCV2025
链接: https://arxiv.org/abs/2511.06253 作者: Ruifei Zhang,Junlin Xie,Wei Zhang,Weikai Chen,Xiao Tan,Xiang Wan,Guanbin Li 机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Shenzhen Research Institute of Big Data (深圳市大数据研究院); Sun Yat-sen University (中山大学); Baidu Inc. (百度公司); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted by ICCV2025
点击查看摘要
[CV-139] st-Time Iterative Error Correction for Efficient Diffusion Models
链接: https://arxiv.org/abs/2511.06250 作者: Yunshan Zhong,Yanwei Qi,Yuxin Zhang 机构: Hainan University (海南大学); Xiamen University (厦门大学) 类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-140] Gait Recognition via Collaborating Discriminative and Generative Diffusion Models
链接: https://arxiv.org/abs/2511.06245 作者: Haijun Xiong,Bin Feng,Bang Wang,Xinggang Wang,Wenyu Liu 机构: Huazhong University of Science & Technology (华中科技大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 14 pages, 4figures
点击查看摘要
[CV-141] Physics-Informed Image Restoration via Progressive PDE Integration
链接: https://arxiv.org/abs/2511.06244 作者: Shamika Likhite,Santiago López-Tapia,Aggelos K. Katsaggelos 机构: Northwestern University (西北大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-142] mporal-Guided Visual Foundation Models for Event-Based Vision
链接: https://arxiv.org/abs/2511.06238 作者: Ruihao Xia,Junhong Cai,Luziwei Leng,Liuyi Wang,Chengju Liu,Ran Cheng,Yang Tang,Pan Zhou 机构: East China University of Science and Technology (华东理工大学); Huawei Technologies Company Ltd. (华为技术有限公司); Southern University of Science and Technology (南方科技大学); Tongji University (同济大学); The Hong Kong Polytechnic University (香港理工大学); Singapore Management University (新加坡管理大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-143] MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition
链接: https://arxiv.org/abs/2511.06225 作者: Shu Zhao,Nilesh Ahuja,Tan Yu,Tianyi Shen,Vijaykrishnan Narayanan 机构: The Pennsylvania State University (宾夕法尼亚州立大学); Intel (英特尔); NVIDIA (英伟达) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-144] Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models NEURIPS2025
链接: https://arxiv.org/abs/2511.06201 作者: Rodrigo Gallardo,Oz Fishman,Alexander Htet Kyaw 机构: Massachusetts Institute of Technology (麻省理工学院) 类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) 备注: Accepted to NEURIPS 2025 Creative AI Track
链接: https://arxiv.org/abs/2511.06194 作者: Muhammad Usama,Mohammad Sadil Khan,Didier Stricker,Muhammad Zeshan Afzal 机构: 1. University of Siegen (锡根大学); 2. Fraunhofer Institute for Computer Graphics Research (弗劳恩霍夫计算机图形研究所); 3. Center for Advanced Security Research Darmstadt (达姆施塔特高级安全研究中心) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted in AAAI 2026
点击查看摘要
[CV-146] MambaOVSR: Multiscale Fusion with Global Motion Modeling for Chinese Opera Video Super-Resolution
[CV-147] Real-Time Bundle Adjustment for Ultra-High-Resolution UAV Imagery Using Adaptive Patch-Based Feature Tracking
链接: https://arxiv.org/abs/2511.06152 作者: Selim Ahmet Iz,Francesco Nex,Norman Kerle,Henry Meissner,Ralf Berger 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC) 备注:
点击查看摘要
[CV-148] Latent Refinement via Flow Matching for Training-free Linear Inverse Problem Solving
链接: https://arxiv.org/abs/2511.06138 作者: Hossein Askari,Yadan Luo,Hongfu Sun,Fred Roosta 机构: The University of Queensland (昆士兰大学); ARC Training Centre for Information Resilience (CIRES) (信息韧性培训中心) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 37 pages, 16 figures,
点击查看摘要
[CV-149] DiLO: Disentangled Latent Optimization for Learning Shape and Deformation in Grouped Deforming 3D Objects
链接: https://arxiv.org/abs/2511.06115 作者: Mostofa Rafid Uddin,Jana Armouti,Umong Sain,Md Asib Rahman,Xingjian Li,Min Xu 机构: Carnegie Mellon University (卡内基梅隆大学); Bangladesh University of Engineering and Technology (孟加拉国工程技术大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:In this work, we propose a disentangled latent optimization-based method for parameterizing grouped deforming 3D objects into shape and deformation factors in an unsupervised manner. Our approach involves the joint optimization of a generator network along with the shape and deformation factors, supported by specific regularization techniques. For efficient amortized inference of disentangled shape and deformation codes, we train two order-invariant PoinNet-based encoder networks in the second stage of our method. We demonstrate several significant downstream applications of our method, including unsupervised deformation transfer, deformation classification, and explainability analysis. Extensive experiments conducted on 3D human, animal, and facial expression datasets demonstrate that our simple approach is highly effective in these downstream tasks, comparable or superior to existing methods with much higher complexity.
zh
[CV-150] Hybrid CNN-ViT Framework for Motion-Blurred Scene Text Restoration
链接: https://arxiv.org/abs/2511.06087 作者: Umar Rashid(1),Muhammad Arslan Arshad(1),Ghulam Ahmad(1),Muhammad Zeeshan Anjum(1),Rizwan Khan(1),Muhammad Akmal(2) ((1) University of Engineering amp; Technology, New Campus, Lahore, Pakistan, (2) Sheffield Hallam University, Sheffield, UK) 机构: University of Engineering & Technology, New Campus, Lahore, Pakistan; Sheffield Hallam University, Sheffield S1 1WB, UK 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Motion blur in scene text images severely impairs readability and hinders the reliability of computer vision tasks, including autonomous driving, document digitization, and visual information retrieval. Conventional deblurring approaches are often inadequate in handling spatially varying blur and typically fall short in modeling the long-range dependencies necessary for restoring textual clarity. To overcome these limitations, we introduce a hybrid deep learning framework that combines convolutional neural networks (CNNs) with vision transformers (ViTs), thereby leveraging both local feature extraction and global contextual reasoning. The architecture employs a CNN-based encoder-decoder to preserve structural details, while a transformer module enhances global awareness through self-attention. Training is conducted on a curated dataset derived from TextOCR, where sharp scene-text samples are paired with synthetically blurred versions generated using realistic motion-blur kernels of multiple sizes and orientations. Model optimization is guided by a composite loss that incorporates mean absolute error (MAE), squared error (MSE), perceptual similarity, and structural similarity (SSIM). Quantitative eval- uations show that the proposed method attains 32.20 dB in PSNR and 0.934 in SSIM, while remaining lightweight with 2.83 million parameters and an average inference time of 61 ms. These results highlight the effectiveness and computational efficiency of the CNN-ViT hybrid design, establishing its practicality for real-world motion-blurred scene-text restoration.
zh
[CV-151] An Artificial Intelligence-based Assistant for the Visually Impaired
链接: https://arxiv.org/abs/2511.06080 作者: Luis Marquez-Carpintero,Francisco Gomez-Donoso,Zuria Bauer,Bessie Dominguez-Dager,Alvaro Belmonte-Baeza,Mónica Pina-Navarro,Francisco Morillas-Espejo,Felix Escalona,Miguel Cazorla 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) 备注:
点击查看摘要
[CV-152] LoopExpose: An Unsupervised Framework for Arbitrary-Length Exposure Correction
链接: https://arxiv.org/abs/2511.06066 作者: Ao Li,Chen Chen,Zhenyu Wang,Tao Huang,Fangfang Wu,Weisheng Dong 机构: Xidian University (西安电子科技大学); Dalian University of Technology (大连理工大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
[CV-153] Identity Card Presentation Attack Detection: A Systematic Review
链接: https://arxiv.org/abs/2511.06056 作者: Esteban M. Ruiz,Juan E. Tapia,Reinel T. Soto,Christoph Busch 机构: Hochschule Darmstadt (达姆施塔特应用技术大学); Universidad Autónoma de Manizales (曼萨莱斯自治大学); Universidad de Caldas (卡爾達斯大學) 类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Remote identity verification is essential for modern digital security; however, it remains highly vulnerable to sophisticated Presentation Attacks (PAs) that utilise forged or manipulated identity documents. Although Deep Learning (DL) has driven advances in Presentation Attack Detection (PAD), the field is fundamentally limited by a lack of data and the poor generalisation of models across various document types and new attack methods. This article presents a systematic literature review (SLR) conducted in accordance with the PRISMA methodology, aiming to analyse and synthesise the current state of AI-based PAD for identity documents from 2020 to 2025 comprehensively. Our analysis reveals a significant methodological evolution: a transition from standard Convolutional Neural Networks (CNNs) to specialised forensic micro-artefact analysis, and more recently, the adoption of large-scale Foundation Models (FMs), marking a substantial shift in the field. We identify a central paradox that hinders progress: a critical “Reality Gap” exists between models validated on extensive, private datasets and those assessed using limited public datasets, which typically consist of mock-ups or synthetic data. This gap limits the reproducibility of research results. Additionally, we highlight a “Synthetic Utility Gap,” where synthetic data generation the primary academic response to data scarcity often fails to predict forensic utility. This can lead to model overfitting to generation artefacts instead of the actual attack. This review consolidates our findings, identifies critical research gaps, and provides a definitive reference framework that outlines a prescriptive roadmap for future research aimed at developing secure, robust, and globally generalizable PAD systems. Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.06056 [cs.CR] (or arXiv:2511.06056v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.06056 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Juan Tapia Dr. [view email] [v1] Sat, 8 Nov 2025 15:55:37 UTC (2,723 KB)
zh
[CV-154] Neodrag on: Mobile Video Generation using Diffusion Transformer
[CV-155] StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video WWW AAAI2026
链接: https://arxiv.org/abs/2511.06046 作者: Zhihui Ke,Yuyang Liu,Xiaobo Zhou,Tie Qiu 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted by AAAI 2026. Code will be released at this https URL
点击查看摘要
[CV-156] S2ML: Spatio-Spectral Mutual Learning for Depth Completion
链接: https://arxiv.org/abs/2511.06033 作者: Zihui Zhao,Yifei Zhang,Zheng Wang,Yang Li,Kui Jiang,Zihan Geng,Chia-Wen Lin 机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Wuhan University (武汉大学); Harbin Institute of Technology (哈尔滨工业大学); National Tsing Hua University (国立清华大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[CV-157] owards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era NEURIPS2025
【速读】:该论文旨在解决视觉场景识别(Visual Place Recognition, VPR)中传统方法依赖显式聚合模块(aggregator)的问题,即现有主流方法(如NetVLAD)通常采用“骨干网络+显式聚合器”的范式,先提取图像块特征(patch features),再通过额外的聚合层生成全局描述符。然而,在基于Transformer的模型中,作者提出无需设计专门的聚合模块,仅通过骨干网络即可生成鲁棒的全局描述符。其解决方案的关键在于引入可学习的聚合标记(aggregation tokens),这些标记在特定Transformer块之前被插入到图像块标记序列中,并借助Transformer固有的自注意力机制实现隐式聚合——所有标记在多头自注意力下进行全局交互,从而将有用信息从图像块标记隐式地汇聚到聚合标记中。最终,仅取最后一层输出中的聚合标记并拼接作为全局表示。该方法显著简化了架构设计,在多个VPR数据集上性能优于现有最先进方法,且效率更高。
链接: https://arxiv.org/abs/2511.06024 作者: Feng Lu,Tong Jin,Canming Ye,Yunpeng Liu,Xiangyuan Lan,Chun Yuan 机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Pengcheng Laboratory (鹏城实验室); Shenyang Institute of Automation, Chinese Academy of Sciences (中国科学院沈阳自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); Pazhou Laboratory (Huangpu) (琶洲实验室(黄埔)) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at this https URL.
zh
[CV-158] MiVID: Multi-Strategic Self-Supervision for Video Frame Interpolation using Diffusion Model
Abstract:Video Frame Interpolation (VFI) remains a cornerstone in video enhancement, enabling temporal upscaling for tasks like slow-motion rendering, frame rate conversion, and video restoration. While classical methods rely on optical flow and learning-based models assume access to dense ground-truth, both struggle with occlusions, domain shifts, and ambiguous motion. This article introduces MiVID, a lightweight, self-supervised, diffusion-based framework for video interpolation. Our model eliminates the need for explicit motion estimation by combining a 3D U-Net backbone with transformer-style temporal attention, trained under a hybrid masking regime that simulates occlusions and motion uncertainty. The use of cosine-based progressive masking and adaptive loss scheduling allows our network to learn robust spatiotemporal representations without any high-frame-rate supervision. Our framework is evaluated on UCF101-7 and DAVIS-7 datasets. MiVID is trained entirely on CPU using the datasets and 9-frame video segments, making it a low-resource yet highly effective pipeline. Despite these constraints, our model achieves optimal results at just 50 epochs, competitive with several supervised this http URL work demonstrates the power of self-supervised diffusion priors for temporally coherent frame synthesis and provides a scalable path toward accessible and generalizable VFI systems.
zh
[CV-159] One-Shot Knowledge Transfer for Scalable Person Re-Identification ICCV2025
链接: https://arxiv.org/abs/2511.06016 作者: Longhua Li,Lei Qi,Xin Geng 机构: School of Computer Science and Engineering, Southeast University, Nanjing, China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: Accepted by ICCV 2025
点击查看摘要
[CV-160] Distributed Deep Learning for Medical Image Denoising with Data Obfuscation
链接: https://arxiv.org/abs/2511.06006 作者: Sulaimon Oyeniyi Adebayo,Ayaz H. Khan 机构: King Fahd University of Petroleum and Minerals (国王法赫德石油矿产大学); SDAIA-KFUPM Joint Research Center for Artificial Intelligence (沙特数据与人工智能局-国王法赫德石油矿产大学联合人工智能研究中心) 类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC) 备注:
点击查看摘要
[CV-161] How Reasoning Influences Intersectional Biases in Vision Language Models
链接: https://arxiv.org/abs/2511.05996 作者: Xianhui Meng,Yukang Huo,Li Zhang,Liu Liu,Haonan Jiang,Yan Zhong,Pingrui Zhang,Cewu Lu,Jun Liu 机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Peking University (北京大学); 4. Chinese Academy of Sciences (中国科学院); 5. National University of Singapore (新加坡国立大学); 6. University of California, Berkeley (加州大学伯克利分校); 7. University of Oxford (牛津大学); 8. Stanford University (斯坦福大学); 9. Microsoft Research (微软研究院) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Articulated objects are prevalent in daily life and robotic manipulation tasks. However, compared to rigid objects, pose tracking for articulated objects remains an underexplored problem due to their inherent kinematic constraints. To address these challenges, this work proposes a novel point-pair-based pose tracking framework, termed \textbfPPF-Tracker. The proposed framework first performs quasi-canonicalization of point clouds in the SE(3) Lie group space, and then models articulated objects using Point Pair Features (PPF) to predict pose voting parameters by leveraging the invariance properties of SE(3). Finally, semantic information of joint axes is incorporated to impose unified kinematic constraints across all parts of the articulated object. PPF-Tracker is systematically evaluated on both synthetic datasets and real-world scenarios, demonstrating strong generalization across diverse and challenging environments. Experimental results highlight the effectiveness and robustness of PPF-Tracker in multi-frame pose tracking of articulated objects. We believe this work can foster advances in robotics, embodied intelligence, and augmented reality. Codes are available at this https URL.
zh
[CV-164] A Dual-Mode ViT-Conditioned Diffusion Framework with an Adaptive Conditioning Bridge for Breast Cancer Segmentation
Abstract:In breast ultrasound images, precise lesion segmentation is essential for early diagnosis; however, low contrast, speckle noise, and unclear boundaries make this difficult. Even though deep learning models have demonstrated potential, standard convolutional architectures frequently fall short in capturing enough global context, resulting in segmentations that are anatomically inconsistent. To overcome these drawbacks, we suggest a flexible, conditional Denoising Diffusion Model that combines an enhanced UNet-based generative decoder with a Vision Transformer (ViT) encoder for global feature extraction. We introduce three primary innovations: 1) an Adaptive Conditioning Bridge (ACB) for efficient, multi-scale fusion of semantic features; 2) a novel Topological Denoising Consistency (TDC) loss component that regularizes training by penalizing structural inconsistencies during denoising; and 3) a dual-head architecture that leverages the denoising objective as a powerful regularizer, enabling a lightweight auxiliary head to perform rapid and accurate inference on smaller datasets and a noise prediction head. Our framework establishes a new state-of-the-art on public breast ultrasound datasets, achieving Dice scores of 0.96 on BUSI, 0.90 on BrEaST and 0.97 on BUS-UCLM. Comprehensive ablation studies empirically validate that the model components are critical for achieving these results and for producing segmentations that are not only accurate but also anatomically plausible.
zh
[CV-165] Runtime Safety Monitoring of Deep Neural Networks for Perception: A Survey
链接: https://arxiv.org/abs/2511.05982 作者: Albert Schotschneider,Svetlana Pavlitska,J. Marius Zöllner 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO) 备注: 6 pages, 1 figure, 2 tables, accepted at IEEE SMC 2025 in Vienna, presented on 8th October 2025
点击查看摘要
Abstract:Deep neural networks (DNNs) are widely used in perception systems for safety-critical applications, such as autonomous driving and robotics. However, DNNs remain vulnerable to various safety concerns, including generalization errors, out-of-distribution (OOD) inputs, and adversarial attacks, which can lead to hazardous failures. This survey provides a comprehensive overview of runtime safety monitoring approaches, which operate in parallel to DNNs during inference to detect these safety concerns without modifying the DNN itself. We categorize existing methods into three main groups: Monitoring inputs, internal representations, and outputs. We analyze the state-of-the-art for each category, identify strengths and limitations, and map methods to the safety concerns they address. In addition, we highlight open challenges and future research directions.
zh
[CV-166] DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities AAAI AAAI-26
链接: https://arxiv.org/abs/2511.05968 作者: Nagur Shareef Shaik,Teja Krishna Cherukuri,Adnan Masood,Dong Hye Ye 机构: 1: Korea University of Technology and Education (韩国技术教育大学); 2: Samsung Electronics (三星电子) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: Accepted for Oral Presentation at the 40th AAAI Conference on Artificial Intelligence (AAAI-26), Main Technical Track
点击查看摘要
[CV-167] Adapted Foundation Models for Breast MRI Triaging in Contrast-Enhanced and Non-Contrast Enhanced Protocols
链接: https://arxiv.org/abs/2511.05967 作者: Tri-Thien Nguyen,Lorenz A. Kapsner,Tobias Hepp,Shirin Heidarikahkesh,Hannes Schreiter,Luise Brock,Dominika Skwierawska,Dominique Hadler,Julian Hossbach,Evelyn Wenkel,Sabine Ohlmeyer,Frederik B. Laun,Andrzej Liebert,Andreas Maier,Michael Uder,Sebastian Bickelhaupt 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: 23 pages, 6 figures, 4 tables. Originally submitted to Radiology (RAD-25-2541); under consideration for transfer to Radiology: Artificial Intelligence (RSNA Portfolio Journal)
点击查看摘要
Abstract:Background: Magnetic resonance imaging (MRI) has high sensitivity for breast cancer detection, but interpretation is time-consuming. Artificial intelligence may aid in pre-screening. Purpose: To evaluate the DINOv2-based Medical Slice Transformer (MST) for ruling out significant findings (Breast Imaging Reporting and Data System [BI-RADS] =4) in contrast-enhanced and non-contrast-enhanced abbreviated breast MRI. Materials and Methods: This institutional review board approved retrospective study included 1,847 single-breast MRI examinations (377 BI-RADS =4) from an in-house dataset and 924 from an external validation dataset (Duke). Four abbreviated protocols were tested: T1-weighted early subtraction (T1sub), diffusion-weighted imaging with b=1500 s/mm2 (DWI1500), DWI1500+T2-weighted (T2w), and T1sub+T2w. Performance was assessed at 90%, 95%, and 97.5% sensitivity using five-fold cross-validation and area under the receiver operating characteristic curve (AUC) analysis. AUC differences were compared with the DeLong test. False negatives were characterized, and attention maps of true positives were rated in the external dataset. Results: A total of 1,448 female patients (mean age, 49 +/- 12 years) were included. T1sub+T2w achieved an AUC of 0.77 +/- 0.04; DWI1500+T2w, 0.74 +/- 0.04 (p=0.15). At 97.5% sensitivity, T1sub+T2w had the highest specificity (19% +/- 7%), followed by DWI1500+T2w (17% +/- 11%). Missed lesions had a mean diameter 10 mm at 95% and 97.5% thresholds for both T1sub and DWI1500, predominantly non-mass enhancements. External validation yielded an AUC of 0.77, with 88% of attention maps rated good or moderate. Conclusion: At 97.5% sensitivity, the MST framework correctly triaged cases without BI-RADS =4, achieving 19% specificity for contrast-enhanced and 17% for non-contrast-enhanced MRI. Further research is warranted before clinical implementation.
zh
[CV-168] Commonality in Few: Few-Shot Multimodal Anomaly Detection via Hypergraph-Enhanced Memory AAAI2026
链接: https://arxiv.org/abs/2511.05965 作者: Zhixin Cheng,Xiaotian Yin,Jiacheng Deng,Bohao Liao,Yujia Chen,Xu Zhou,Baoqun Yin,Tianzhu Zhang 机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Artificial Intelligence, University of Science and Technology of China (中国科学技术大学人工智能研究院); 3. Alibaba Group (阿里巴巴集团); 4. Tsinghua University (清华大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: Accepted by AAAI2026
点击查看摘要
Abstract:Typical detection-free methods for image-to-point cloud registration leverage transformer-based architectures to aggregate cross-modal features and establish correspondences. However, they often struggle under challenging conditions, where noise disrupts similarity computation and leads to incorrect correspondences. Moreover, without dedicated designs, it remains difficult to effectively select informative and correlated representations across modalities, thereby limiting the robustness and accuracy of registration. To address these challenges, we propose a novel cross-modal registration framework composed of two key modules: the Iterative Agents Selection (IAS) module and the Reliable Agents Interaction (RAI) module. IAS enhances structural feature awareness with phase maps and employs reinforcement learning principles to efficiently select reliable agents. RAI then leverages these selected agents to guide cross-modal interactions, effectively reducing mismatches and improving overall robustness. Extensive experiments on the RGB-D Scenes v2 and 7-Scenes benchmarks demonstrate that our method consistently achieves state-of-the-art performance.
zh
[CV-170] CSGaze: Context-aware Social Gaze Prediction
链接: https://arxiv.org/abs/2511.05955 作者: Surbhi Madan,Shreya Ghosh,Ramanathan Subramanian,Abhinav Dhall,Tom Gedeon 机构: IIT Ropar; National Institute of Informatics Japan; The University of Queensland Australia; University of Canberra Australia; Monash University Australia; Curtin University Australia 类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:A person’s gaze offers valuable insights into their focus of attention, level of social engagement, and confidence. In this work, we investigate how contextual cues combined with visual scene and facial information can be effectively utilized to predict and interpret social gaze patterns during conversational interactions. We introduce CSGaze, a context aware multimodal approach that leverages facial, scene information as complementary inputs to enhance social gaze pattern prediction from multi-person images. The model also incorporates a fine-grained attention mechanism centered on the principal speaker, which helps in better modeling social gaze dynamics. Experimental results show that CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO. Our findings highlight the role of contextual cues in improving social gaze prediction. Additionally, we provide initial explainability through generated attention scores, offering insights into the model’s decision-making process. We also demonstrate our model’s generalizability by testing our model on open set datasets that demonstrating its robustness across diverse scenarios.
zh
[CV-171] Pinching Visuo-haptic Display: Investigating Cross-Modal Effects of Visual Textures on Electrostatic Cloth Tactile Sensations
链接: https://arxiv.org/abs/2511.05952 作者: Takekazu Kitagishi,Chun-Wei Ooi,Yuichi Hiroi,Jun Rekimoto 机构: The University of Tokyo(东京大学); ZOZO Research(ZOZO 研究所); Cluster Metaverse Lab(集群元宇宙实验室); Sony CSL Kyoto(索尼计算机科学实验室) 类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) 备注: 10 pages, 8 figures, 3 tables. Presented at ACM International Conference on Multimodal Interaction (ICMI) 2025
点击查看摘要
Abstract:This paper investigates how visual texture presentation influences tactile perception when interacting with electrostatic cloth displays. We propose a visuo-haptic system that allows users to pinch and rub virtual fabrics while feeling realistic frictional sensations modulated by electrostatic actuation. Through a user study, we examined the cross-modal effects between visual roughness and perceived tactile friction. The results demonstrate that visually rough textures amplify the perceived frictional force, even under identical electrostatic stimuli. These findings contribute to the understanding of multimodal texture perception and provide design insights for haptic feedback in virtual material interfaces.
zh
[CV-172] U(PM)2:Unsupervised polygon matching with pre-trained models for challenging stereo images
链接: https://arxiv.org/abs/2511.05949 作者: Chang Li,Xingtao Peng 机构: Central China Normal University (华中师范大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Stereo image matching is a fundamental task in computer vision, photogrammetry and remote sensing, but there is an almost unexplored field, i.e., polygon matching, which faces the following challenges: disparity discontinuity, scale variation, training requirement, and generalization. To address the above-mentioned issues, this paper proposes a novel U(PM) ^2 : low-cost unsupervised polygon matching with pre-trained models by uniting automatically learned and handcrafted features, of which pipeline is as follows: firstly, the detector leverages the pre-trained segment anything model to obtain masks; then, the vectorizer converts the masks to polygons and graphic structure; secondly, the global matcher addresses challenges from global viewpoint changes and scale variation based on bidirectional-pyramid strategy with pre-trained LoFTR; finally, the local matcher further overcomes local disparity discontinuity and topology inconsistency of polygon matching by local-joint geometry and multi-feature matching strategy with Hungarian algorithm. We benchmark our U(PM) ^2 on the ScanNet and SceneFlow datasets using our proposed new metric, which achieved state-of-the-art accuracy at a competitive speed and satisfactory generalization performance at low cost without any training requirement.
zh
[CV-173] Reperio-rPPG: Relational Temporal Graph Neural Networks for Periodicity Learning in Remote Physiological Measurement
Abstract:Remote photoplethysmography (rPPG) is an emerging contactless physiological sensing technique that leverages subtle color variations in facial videos to estimate vital signs such as heart rate and respiratory rate. This non-invasive method has gained traction across diverse domains, including telemedicine, affective computing, driver fatigue detection, and health monitoring, owing to its scalability and convenience. Despite significant progress in remote physiological signal measurement, a crucial characteristic - the intrinsic periodicity - has often been underexplored or insufficiently modeled in previous approaches, limiting their ability to capture fine-grained temporal dynamics under real-world conditions. To bridge this gap, we propose Reperio-rPPG, a novel framework that strategically integrates Relational Convolutional Networks with a Graph Transformer to effectively capture the periodic structure inherent in physiological signals. Additionally, recognizing the limited diversity of existing rPPG datasets, we further introduce a tailored CutMix augmentation to enhance the model’s generalizability. Extensive experiments conducted on three widely used benchmark datasets - PURE, UBFC-rPPG, and MMPD - demonstrate that Reperio-rPPG not only achieves state-of-the-art performance but also exhibits remarkable robustness under various motion (e.g., stationary, rotation, talking, walking) and illumination conditions (e.g., nature, low LED, high LED). The code is publicly available at this https URL.
zh
[CV-174] Polymap: generating high definition map based on rasterized polygons
链接: https://arxiv.org/abs/2511.05944 作者: Shiyu Gao,Hao Jiang 机构: Institute of Computing Technology, Chinese Academy of Science (中国科学院计算技术研究所); Qcraft Inc. 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:The perception of high-definition maps is an integral component of environmental perception in autonomous driving systems. Existing research have often focused on online construction of high-definition maps. For instance, the Maptr[9] series employ a detection-based method to output vectorized map instances parallelly in an end-to-end manner. However, despite their capability for real-time construction, detection-based methods are observed to lack robust generalizability[19], which hampers their applicability in auto-labeling systems. Therefore, aiming to improve the generalizability, we reinterpret road elements as rasterized polygons and design a concise framework based on instance segmentation. Initially, a segmentation-based transformer is employed to deliver instance masks in an end-to-end manner; succeeding this step, a Potrace-based[17] post-processing module is used to ultimately yield vectorized map elements. Quantitative results attained on the Nuscene[1] dataset substantiate the effectiveness and generaliz-ability of our method.
zh
[CV-175] Global Multiple Extraction Network for Low-Resolution Facial Expression Recognition
[CV-176] Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation NEURIPS2025
【速读】:该论文旨在解决开放词汇场景图生成(Open-vocabulary Scene Graph Generation, OVSGG)中因缺乏显式交互建模而导致的两个关键问题:在知识注入阶段,模型难以区分同类别中相互作用与非相互作用实例,从而产生噪声伪监督信号;在知识迁移阶段,查询匹配模糊,导致关系推理不准确。解决方案的核心在于提出一种以交互为中心(interaction-centric)的端到端框架ACC(Interaction-Centric Consistent framework),其关键创新包括:1)采用双向交互提示(bidirectional interaction prompt)增强伪监督信号的鲁棒性,提升模型对交互关系的理解;2)引入交互引导的查询选择机制,优先匹配潜在交互对象以减少干扰,并结合一致性知识蒸馏策略,在保留通用知识的同时强化关系前景与背景的分离,从而显著提升OVSGG的准确性与泛化能力。
链接: https://arxiv.org/abs/2511.05935 作者: Lin Li,Chuhan Zhang,Dong Zhang,Chong Sun,Chen Li,Long Chen 机构: HKUST(香港科技大学); AI Chip Center for Emerging Smart Systems (人工智能芯片中心); Tencent(腾讯) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) \textitInfusing knowledge into large-scale models via pre-training on large datasets; 2) \textitTransferring knowledge from pre-trained models with fully annotated scene graphs during supervised fine-tuning. However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer. To this end, in this paper, we propose an inter\textbfACtion-\textbfCentric end-to-end OVSGG framework (\textbfACC) in an interaction-driven paradigm to minimize these mismatches. For \textitinteraction-centric knowledge infusion, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model’s interaction knowledge. For \textitinteraction-centric knowledge transfer, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge. Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.
zh
[CV-177] AD-DAE: Unsupervised Modeling of Longitudinal Alzheimers Disease Progression with Diffusion Auto-Encoder
链接: https://arxiv.org/abs/2511.05934 作者: Ayantika Das,Arunima Sarkar,Keerthi Ram,Mohanasankar Sivaprakasam 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Under Review
点击查看摘要
Abstract:Generative modeling frameworks have emerged as an effective approach to capture high-dimensional image distributions from large datasets without requiring domain-specific knowledge, a capability essential for longitudinal disease progression modeling. Recent generative modeling approaches have attempted to capture progression by mapping images into a latent representational space and then controlling and guiding the representations to generate follow-up images from a baseline image. However, existing approaches impose constraints on distribution learning, leading to latent spaces with limited controllability to generate follow-up images without explicit supervision from subject-specific longitudinal images. In order to enable controlled movements in the latent representational space and generate progression images from a baseline image in an unsupervised manner, we introduce a conditionable Diffusion Auto-encoder framework. The explicit encoding mechanism of image-diffusion auto-encoders forms a compact latent space capturing high-level semantics, providing means to disentangle information relevant for progression. Our approach leverages this latent space to condition and apply controlled shifts to baseline representations for generating follow-up. Controllability is induced by restricting these shifts to a subspace, thereby isolating progression-related factors from subject identity-preserving components. The shifts are implicitly guided by correlating with progression attributes, without requiring subject-specific longitudinal supervision. We validate the generations through image quality metrics, volumetric progression analysis, and downstream classification in Alzheimer’s disease datasets from two different sources and disease categories. This demonstrates the effectiveness of our approach for Alzheimer’s progression modeling and longitudinal image generation.
zh
[CV-178] CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework
Abstract:Despite the effectiveness of quantization-aware training (QAT) in compressing deep neural networks, its performance on multi-task architectures often degrades significantly due to task-specific feature discrepancies and gradient conflicts. To address these challenges, we propose Gradient-Aware Balanced Feature Fusion (GABFusion), which dynamically balances gradient magnitudes and fuses task-specific features in a quantization-friendly manner. We further introduce Attention Distribution Alignment (ADA), a feature-level distillation strategy tailored for quantized models. Our method demonstrates strong generalization across network architectures and QAT algorithms, with theoretical guarantees on gradient bias reduction. Extensive experiments demonstrate that our strategy consistently enhances a variety of QAT methods across different network architectures and bit-widths. On PASCAL VOC and COCO datasets, the proposed approach achieves average mAP improvements of approximately 3.3% and 1.6%, respectively. When applied to YOLOv5 under 4-bit quantization, our method narrows the accuracy gap with the full-precision model to only 1.7% on VOC, showcasing its effectiveness in preserving performance under low-bit constraints. Notably, the proposed framework is modular, easy to integrate, and compatible with any existing QAT technique-enhancing the performance of quantized models without requiring modifications to the original network architecture.
zh
[CV-181] Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning AAAI2026
链接: https://arxiv.org/abs/2511.05894 作者: Fei Yu,Quan Deng,Shengeng Tang,Yuehua Li,Lechao Cheng 机构: Huazhong University of Science and Technology (华中科技大学); Hefei University of Technology (合肥工业大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted by AAAI 2026
点击查看摘要
[CV-182] Hybrid second-order gradient histogram based global low-rank sparse regression for robust face recognition
链接: https://arxiv.org/abs/2511.05893 作者: Hongxia Li,Ying Ji,Yongxin Dong,Yuehua Feng 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC) 备注:
点击查看摘要
Abstract:Low-rank sparse regression models have been widely applied in the field of face recognition. To further address the challenges caused by complex occlusions and illumination variations, this paper proposes a Hybrid Second-Order Gradient Histogram based Global Low-Rank Sparse Regression (H2H-GLRSR) model. Specifically, a novel feature descriptor called the Hybrid Second-Order Gradient Histogram (H2H) is first designed to more effectively characterize the local structural features of facial images. Then, this descriptor is integrated with the Sparse Regularized Nuclear Norm based Matrix Regression (SR _ NMR). Moreover, a global low-rank constraint is imposed on the residual matrix, enabling the model to better capture the global correlations inherent in structured noise. Experimental results demonstrate that the proposed method significantly outperforms existing regression-based classification approaches under challenging scenarios involving occlusions, illumination changes, and unconstrained environments.
zh
[CV-183] owards Frequency-Adaptive Learning for SAR Despeckling
链接: https://arxiv.org/abs/2511.05890 作者: Ziqing Ma,Chang Yang,Zhichang Guo,Yao Li 机构: Harbin Institute of Technology (哈尔滨工业大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 13 pages, 14 figures,9 tables
点击查看摘要
Abstract:Synthetic Aperture Radar (SAR) images are inherently corrupted by speckle noise, limiting their utility in high-precision applications. While deep learning methods have shown promise in SAR despeckling, most methods employ a single unified network to process the entire image, failing to account for the distinct speckle statistics associated with different spatial physical characteristics. It often leads to artifacts, blurred edges, and texture distortion. To address these issues, we propose SAR-FAH, a frequency-adaptive heterogeneous despeckling model based on a divide-and-conquer architecture. First, wavelet decomposition is used to separate the image into frequency sub-bands carrying different intrinsic characteristics. Inspired by their differing noise characteristics, we design specialized sub-networks for different frequency components. The tailored approach leverages statistical variations across frequencies, improving edge and texture preservation while suppressing noise. Specifically, for the low-frequency part, denoising is formulated as a continuous dynamic system via neural ordinary differential equations, ensuring structural fidelity and sufficient smoothness that prevents artifacts. For high-frequency sub-bands rich in edges and textures, we introduce an enhanced U-Net with deformable convolutions for noise suppression and enhanced features. Extensive experiments on synthetic and real SAR images validate the superior performance of the proposed model in noise removal and structural preservation.
zh
[CV-184] MoEGCL: Mixture of Ego-Graphs Contrastive Representation Learning for Multi-View Clustering AAAI’2026
链接: https://arxiv.org/abs/2511.05876 作者: Jian Zhu,Xin Zou,Jun Sun,Cheng Luo,Lei Liu,Lingfang Zeng,Ning Zhang,Bian Wu,Chang Tang,Lirong Dai 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) 备注: AAAI’2026 oral paper
点击查看摘要
[CV-185] owards a Humanized Social-Media Ecosystem: AI-Augmented HCI Design Patterns for Safety Agency Well-Being
链接: https://arxiv.org/abs/2511.05875 作者: Mohd Ruhul Ameen,Akif Islam 机构: 未知 类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) 备注: 6 pages, 5 tables, 7 figures, and 2 algorithm tables. Accepted at International Conference on Signal Processing, Information, Communication and Systems (SPICSCON 2025)
点击查看摘要
Abstract:Social platforms connect billions of people, yet their engagement-first algorithms often work on users rather than with them, amplifying stress, misinformation, and a loss of control. We propose Human-Layer AI (HL-AI)–user-owned, explainable intermediaries that sit in the browser between platform logic and the interface. HL-AI gives people practical, moment-to-moment control without requiring platform cooperation. We contribute a working Chrome/Edge prototype implementing five representative pattern frameworks–Context-Aware Post Rewriter, Post Integrity Meter, Granular Feed Curator, Micro-Withdrawal Agent, and Recovery Mode–alongside a unifying mathematical formulation balancing user utility, autonomy costs, and risk thresholds. Evaluation spans technical accuracy, usability, and behavioral outcomes. The result is a suite of humane controls that help users rewrite before harm, read with integrity cues, tune feeds with intention, pause compulsive loops, and seek shelter during harassment, all while preserving agency through explanations and override options. This prototype offers a practical path to retrofit today’s feeds with safety, agency, and well-being, inviting rigorous cross-cultural user evaluation.
zh
[CV-186] Light-Field Dataset for Disparity Based Depth Estimation
【速读】:该论文旨在解决光场(Light Field, LF)深度估计中因角分辨率与空间分辨率权衡关系带来的挑战,尤其是焦平面位置对视差(disparity)影响的不确定性,以及现有光场图像数据集在真实性和多样性上的不足。其解决方案的关键在于构建一个公开可用的高质量光场图像数据集,包含285张由Lytro Illum LF相机拍摄的真实光场图像和13张具有相似视差特性的合成图像,并进一步通过机械滑轨系统与Blender软件生成真实与合成相结合的立体光场数据子集,从而为新型基于视差的光场深度估计算法的设计、开发、实现与测试提供可靠的数据支持。
链接: https://arxiv.org/abs/2511.05866 作者: Suresh Nehra,Aupendu Kar,Jayanta Mukhopadhyay,Prabir Kumar Biswas 机构: Indian Institute of Technology Kharagpur (印度理工学院克哈拉格普尔分校); Dolby Laboratories, Inc (杜比实验室公司) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: This paper has been accepted to ACM ICVGIP 2025
点击查看摘要
Abstract:A Light Field (LF) camera consists of an additional two-dimensional array of micro-lenses placed between the main lens and sensor, compared to a conventional camera. The sensor pixels under each micro-lens receive light from a sub-aperture of the main lens. This enables the image sensor to capture both spatial information and the angular resolution of a scene point. This additional angular information is used to estimate the depth of a 3-D scene. The continuum of virtual viewpoints in light field data enables efficient depth estimation using Epipolar Line Images (EPIs) with robust occlusion handling. However, the trade-off between angular information and spatial information is very critical and depends on the focal position of the camera. To design, develop, implement, and test novel disparity-based light field depth estimation algorithms, the availability of suitable light field image datasets is essential. In this paper, a publicly available light field image dataset is introduced and thoroughly described. We have also demonstrated the effect of focal position on the disparity of a 3-D point as well as the shortcomings of the currently available light field dataset. The proposed dataset contains 285 light field images captured using a Lytro Illum LF camera and 13 synthetic LF images. The proposed dataset also comprises a synthetic dataset with similar disparity characteristics to those of a real light field camera. A real and synthetic stereo light field dataset is also created by using a mechanical gantry system and Blender. The dataset is available at this https URL.
zh
[CV-187] CGCE: Classifier-Guided Concept Erasure in Generative Models
链接: https://arxiv.org/abs/2511.05865 作者: Viet Nguyen,Vishal M. Patel 机构: Johns Hopkins University (约翰霍普金斯大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) 备注: 24 pages, 15 figures
点击查看摘要
Abstract:Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model’s generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model’s original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.
zh
[CV-188] Point Cloud Segmentation of Integrated Circuits Package Substrates Surface Defects Using Causal Inference: Dataset Construction and Methodology
Abstract:The effective segmentation of 3D data is crucial for a wide range of industrial applications, especially for detecting subtle defects in the field of integrated circuits (IC). Ceramic package substrates (CPS), as an important electronic material, are essential in IC packaging owing to their superior physical and chemical properties. However, the complex structure and minor defects of CPS, along with the absence of a publically available dataset, significantly hinder the development of CPS surface defect detection. In this study, we construct a high-quality point cloud dataset for 3D segmentation of surface defects in CPS, i.e., CPS3D-Seg, which has the best point resolution and precision compared to existing 3D industrial datasets. CPS3D-Seg consists of 1300 point cloud samples under 20 product categories, and each sample provides accurate point-level annotations. Meanwhile, we conduct a comprehensive benchmark based on SOTA point cloud segmentation algorithms to validate the effectiveness of CPS3D-Seg. Additionally, we propose a novel 3D segmentation method based on causal inference (CINet), which quantifies potential confounders in point clouds through Structural Refine (SR) and Quality Assessment (QA) Modules. Extensive experiments demonstrate that CINet significantly outperforms existing algorithms in both mIoU and accuracy.
zh
[CV-189] Enhancing Diffusion Model Guidance through Calibration and Regularization NEURIPS2025
链接: https://arxiv.org/abs/2511.05844 作者: Seyed Alireza Javid,Amirhossein Bagheri,Nuria González-Prelcic 机构: UC San Diego (加州大学圣地亚哥分校); Politecnico di Milano (米兰理工大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Image and Video Processing (eess.IV) 备注: Accepted from NeurIPS 2025 Workshop on Structured Probabilistic Inference Generative Modeling. Code available at this https URL
点击查看摘要
Abstract:Classifier-guided diffusion models have emerged as a powerful approach for conditional image generation, but they suffer from overconfident predictions during early denoising steps, causing the guidance gradient to vanish. This paper introduces two complementary contributions to address this issue. First, we propose a differentiable calibration objective based on the Smooth Expected Calibration Error (Smooth ECE), which improves classifier calibration with minimal fine-tuning and yields measurable improvements in Frechet Inception Distance (FID). Second, we develop enhanced sampling guidance methods that operate on off-the-shelf classifiers without requiring retraining. These include tilted sampling with batch-level reweighting, adaptive entropy-regularized sampling to preserve diversity, and a novel f-divergence-based sampling strategy that strengthens class-consistent guidance while maintaining mode coverage. Experiments on ImageNet 128x128 demonstrate that our divergence-regularized guidance achieves an FID of 2.13 using a ResNet-101 classifier, improving upon existing classifier-guided diffusion methods while requiring no diffusion model retraining. The results show that principled calibration and divergence-aware sampling provide practical and effective improvements for classifier-guided diffusion.
zh
[CV-190] Understanding Cross Task Generalization in Handwriting-Based Alzheimers Screening via Vision Language Adaptation
链接: https://arxiv.org/abs/2511.05841 作者: Changqing Gong,Huafeng Qin,Mounim A. El-Yacoubi 机构: Telecom SudParis (电信巴黎高等矿业学院); Institut Polytechnique de Paris (巴黎综合理工学院); School of Computer Science and Information Engineering (计算机科学与信息工程学院); Chongqing Technology and Business University (重庆工商大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Alzheimer’s disease is a prevalent neurodegenerative disorder for which early detection is critical. Handwriting-often disrupted in prodromal AD-provides a non-invasive and cost-effective window into subtle motor and cognitive decline. Existing handwriting-based AD studies, mostly relying on online trajectories and hand-crafted features, have not systematically examined how task type influences diagnostic performance and cross-task generalization. Meanwhile, large-scale vision language models have demonstrated remarkable zero or few-shot anomaly detection in natural images and strong adaptability across medical modalities such as chest X-ray and brain MRI. However, handwriting-based disease detection remains largely unexplored within this paradigm. To close this gap, we introduce a lightweight Cross-Layer Fusion Adapter framework that repurposes CLIP for handwriting-based AD screening. CLFA implants multi-level fusion adapters within the visual encoder to progressively align representations toward handwriting-specific medical cues, enabling prompt-free and efficient zero-shot inference. Using this framework, we systematically investigate cross-task generalization-training on a specific handwriting task and evaluating on unseen ones-to reveal which task types and writing patterns most effectively discriminate AD. Extensive analyses further highlight characteristic stroke patterns and task-level factors that contribute to early AD identification, offering both diagnostic insights and a benchmark for handwriting-based cognitive assessment.
zh
[CV-191] YrPPG: Uncomplicated and Enhanced Learning Capability rPPG for Remote Heart Rate Estimation
链接: https://arxiv.org/abs/2511.05833 作者: Taixi Chen,Yiu-ming Cheung 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: The 6th International Workshop on AI for Social Good in the Connected World (AI4SG)@ IEEE WI-IAT 2025
点击查看摘要
[CV-192] Hilbert-Guided Block-Sparse Local Attention
[CV-193] LRANet: Low-Rank Approximation Network for Accurate and Efficient Text Spotting
【速读】:该论文旨在解决任意形状文本(arbitrary-shaped text)的端到端文本检测与识别(end-to-end text spotting)中检测精度与效率难以兼顾的问题。其核心瓶颈在于缺乏可靠且高效的文本检测方法。解决方案的关键在于提出两个创新模块:一是基于低秩近似的参数化文本形状表示方法,通过从标注文本边界中直接学习低秩子空间,并利用ℓ₁-范数优化恢复机制,实现对文本形状的紧凑且鲁棒的表示;二是三重分配检测头(triple assignment detection head),其中深度稀疏分支用于稳定训练,超轻量稀疏分支用于加速推理,同时密集分支提供丰富的并行监督信号。这两个模块共同构建了LRANet++框架,显著提升了任意形状文本的检测精度与推理效率。
链接: https://arxiv.org/abs/2511.05818 作者: Yuchen Su,Zhineng Chen,Yongkun Du,Zuxuan Wu,Hongtao Xie,Yu-Gang Jiang 机构: Fudan University (复旦大学); University of Science and Technology of China (中国科学技术大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:End-to-end text spotting aims to jointly optimize text detection and recognition within a unified framework. Despite significant progress, designing an accurate and efficient end-to-end text spotter for arbitrary-shaped text remains largely unsolved. We identify the primary bottleneck as the lack of a reliable and efficient text detection method. To address this, we propose a novel parameterized text shape method based on low-rank approximation for precise detection and a triple assignment detection head to enable fast inference. Specifically, unlike other shape representation methods that employ data-irrelevant parameterization, our data-driven approach derives a low-rank subspace directly from labeled text boundaries. To ensure this process is robust against the inherent annotation noise in this data, we utilize a specialized recovery method based on an \ell_1 -norm formulation, which accurately reconstructs the text shape with only a few key orthogonal vectors. By exploiting the inherent shape correlation among different text contours, our method achieves consistency and compactness in shape representation. Next, the triple assignment scheme introduces a novel architecture where a deep sparse branch (for stabilized training) is used to guide the learning of an ultra-lightweight sparse branch (for accelerated inference), while a dense branch provides rich parallel supervision. Building upon these advancements, we integrate the enhanced detection module with a lightweight recognition branch to form an end-to-end text spotting framework, termed LRANet++, capable of accurately and efficiently spotting arbitrary-shaped text. Extensive experiments on several challenging benchmarks demonstrate the superiority of LRANet++ compared to state-of-the-art methods. Code will be available at: this https URL
zh
[CV-194] MACMD: Multi-dilated Contextual Attention and Channel Mixer Decoding for Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中因解剖结构差异导致的挑战,特别是现有基于Transformer的编码器-解码器架构在处理浅层特征信息丢失以及编码器与解码器之间局部细节和全局上下文融合效率低下的问题。解决方案的关键在于提出一种基于MACMD(Multi-scale Attention and Channel Mixing Decoder)的新型解码器设计,其通过跳接连接实现编码器与解码器间的通道混合,并结合多尺度空洞卷积、注意力驱动调制和跨通道混合模块,有效增强自注意力机制的同时保留局部上下文信息,从而在保持计算效率的前提下显著提升分割精度。
链接: https://arxiv.org/abs/2511.05803 作者: Lalit Maurya,Honghai Liu,Reyer Zwiggelaar 机构: University of Portsmouth (朴茨茅斯大学); Aberystwyth University (阿伯里斯特威斯大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Medical image segmentation faces challenges due to variations in anatomical structures. While convolutional neural networks (CNNs) effectively capture local features, they struggle with modeling long-range dependencies. Transformers mitigate this issue with self-attention mechanisms but lack the ability to preserve local contextual information. State-of-the-art models primarily follow an encoder-decoder architecture, achieving notable success. However, two key limitations remain: (1) Shallow layers, which are closer to the input, capture fine-grained details but suffer from information loss as data propagates through deeper layers. (2) Inefficient integration of local details and global context between the encoder and decoder stages. To address these challenges, we propose the MACMD-based decoder, which enhances attention mechanisms and facilitates channel mixing between encoder and decoder stages via skip connections. This design leverages hierarchical dilated convolutions, attention-driven modulation, and a cross channel-mixing module to capture long-range dependencies while preserving local contextual details, essential for precise medical image segmentation. We evaluated our approach using multiple transformer encoders on both binary and multi-organ segmentation tasks. The results demonstrate that our method outperforms state-of-the-art approaches in terms of Dice score and computational efficiency, highlighting its effectiveness in achieving accurate and robust segmentation performance. The code available at this https URL
zh
[CV-195] Position-Prior-Guided Network for System Matrix Super-Resolution in Magnetic Particle Imaging
链接: https://arxiv.org/abs/2511.05795 作者: Xuqing Geng,Lei Su,Zhongwei Bian,Zewen Sun,Jiaxuan Wen,Jie Tian,Yang Du 机构: CAS Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所分子影像重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); School of Engineering Medicine and the School of Biological Science and Medical Engineering, Beihang University(北京航空航天大学医学工程与生物科学与医学工程学院); Key Laboratory of Big Data-Based Precision Medicine (Beihang University), Ministry of Industry and Information Technology of China(工业和信息化部大数据精准医疗(北京航空航天大学)重点实验室) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: accepted as oral presentation at EMBC 2025
点击查看摘要
Abstract:Magnetic Particle Imaging (MPI) is a novel medical imaging modality. One of the established methods for MPI reconstruction is based on the System Matrix (SM). However, the calibration of the SM is often time-consuming and requires repeated measurements whenever the system parameters change. Current methodologies utilize deep learning-based super-resolution (SR) techniques to expedite SM calibration; nevertheless, these strategies do not fully exploit physical prior knowledge associated with the SM, such as symmetric positional priors. Consequently, we integrated positional priors into existing frameworks for SM calibration. Underpinned by theoretical justification, we empirically validated the efficacy of incorporating positional priors through experiments involving both 2D and 3D SM SR methods.
zh
[CV-196] CSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation
链接: https://arxiv.org/abs/2511.05782 作者: Lalit Maurya,Honghai Liu,Reyer Zwiggelaar 机构: University of Portsmouth (朴茨茅斯大学); Aberystwyth University (阿伯里斯特威斯大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Unsupervised domain adaptation for medical image segmentation remains a significant challenge due to substantial domain shifts across imaging modalities, such as CT and MRI. While recent vision-language representation learning methods have shown promise, their potential in UDA segmentation tasks remains underexplored. To address this gap, we propose TCSA-UDA, a Text-driven Cross-Semantic Alignment framework that leverages domain-invariant textual class descriptions to guide visual representation learning. Our approach introduces a vision-language covariance cosine loss to directly align image encoder features with inter-class textual semantic relations, encouraging semantically meaningful and modality-invariant feature representations. Additionally, we incorporate a prototype alignment module that aligns class-wise pixel-level feature distributions across domains using high-level semantic prototypes. This mitigates residual category-level discrepancies and enhances cross-modal consistency. Extensive experiments on challenging cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks demonstrate that our TCSA-UDA framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods, establishing a new paradigm for integrating language-driven semantics into domain-adaptive medical image analysis.
zh
[CV-197] MARAuders Map: Motion-Aware Real-time Activity Recognition with Layout-Based Trajectories
链接: https://arxiv.org/abs/2511.05773 作者: Zishuai Liu,Weihang You,Jin Lu,Fei Dou 机构: University of Georgia (佐治亚大学) 类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Ambient sensor-based human activity recognition (HAR) in smart homes remains challenging due to the need for real-time inference, spatially grounded reasoning, and context-aware temporal modeling. Existing approaches often rely on pre-segmented, within-activity data and overlook the physical layout of the environment, limiting their robustness in continuous, real-world deployments. In this paper, we propose MARAuder’s Map, a novel framework for real-time activity recognition from raw, unsegmented sensor streams. Our method projects sensor activations onto the physical floorplan to generate trajectory-aware, image-like sequences that capture the spatial flow of human movement. These representations are processed by a hybrid deep learning model that jointly captures spatial structure and temporal dependencies. To enhance temporal awareness, we introduce a learnable time embedding module that encodes contextual cues such as hour-of-day and day-of-week. Additionally, an attention-based encoder selectively focuses on informative segments within each observation window, enabling accurate recognition even under cross-activity transitions and temporal ambiguity. Extensive experiments on multiple real-world smart home datasets demonstrate that our method outperforms strong baselines, offering a practical solution for real-time HAR in ambient sensor environments.
zh
[CV-198] Sign language recognition from skeletal data using graph and recurrent neural networks
链接: https://arxiv.org/abs/2511.05772 作者: B. Mederos,J. Mejía,A. Medina-Reyes,Y. Espinosa-Almeyda,J. D. Díaz-Roman,I. Rodríguez-Mederos,M. Mejía-Carreon,F. Gonzalez-Lopez 机构: Universidad Autónoma de Ciudad Juárez (UACJ); Instituto Iberoamericano San Patricio 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: 15 pages, 2 figures
点击查看摘要
[CV-199] A Second-Order Attention Mechanism For Prostate Cancer Segmentation and Detection in Bi-Parametric MRI
链接: https://arxiv.org/abs/2511.05760 作者: Mateo Ortiz,Juan Olmos,Fabio Martínez 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: Accepted at the 28th Iberoamerican Congress on Pattern Recognition (CIARP 2025). To appear in Lecture Notes in Computer Science (LNCS), Springer
点击查看摘要
[CV-200] owards Better Ultrasound Video Segmentation Foundation Model: An Empirical study on SAM2 Finetuning from Data Perspective
链接: https://arxiv.org/abs/2511.05731 作者: Xing Yao,Ahana Gangopadhyay,Hsi-Ming Chang,Ravi Soni 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Ultrasound (US) video segmentation remains a challenging problem due to strong inter- and intra-dataset variability, motion artifacts, and limited annotated data. Although foundation models such as Segment Anything Model 2 (SAM2) demonstrate strong zero-shot and prompt-guided segmentation capabilities, their performance deteriorates substantially when transferred to medical imaging domains. Current adaptation studies mainly emphasize architectural modifications, while the influence of data characteristics and training regimes has not been systematically examined. In this study, we present a comprehensive, data-centric investigation of SAM2 adaptation for ultrasound video segmentation. We analyze how training-set size, video duration, and augmentation schemes affect adaptation performance under three paradigms: task-specific fine-tuning, intermediate adaptation, and multi-task joint training, across five SAM2 variants and multiple prompting modes. We further design six ultrasound-specific augmentations, assessing their effect relative to generic strategies. Experiments on three representative ultrasound datasets reveal that data scale and temporal context play a more decisive role than model architecture or initialization. Moreover, joint training offers an efficient compromise between modality alignment and task specialization. This work aims to provide empirical insights for developing efficient, data-aware adaptation pipelines for SAM2 in ultrasound video analysis.
zh
[CV-201] Pedicle Screw Pairing and Registration for Screw Pose Estimation from Dual C-arm Images Using CAD Models
链接: https://arxiv.org/abs/2511.05702 作者: Yehyun Suh,Lin Li,Aric Plumley,Chaochao Zhou,Daniel Moyer,Kongbin Kang 机构: AIX Research; Alphatec Spine; Vanderbilt University (范德比尔特大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Accurate matching of pedicle screws in both anteroposterior (AP) and lateral (LAT) images is critical for successful spinal decompression and stabilization during surgery. However, establishing screw correspondence, especially in LAT views, remains a significant clinical challenge. This paper introduces a method to address pedicle screw correspondence and pose estimation from dual C-arm images. By comparing screw combinations, the approach demonstrates consistent accuracy in both pairing and registration tasks. The method also employs 2D-3D alignment with screw CAD 3D models to accurately pair and estimate screw pose from dual views. Our results show that the correct screw combination consistently outperforms incorrect pairings across all test cases, even prior to registration. After registration, the correct combination further enhances alignment between projections and images, significantly reducing projection error. This approach shows promise for improving surgical outcomes in spinal procedures by providing reliable feedback on screw positioning.
zh
[CV-202] VMDT: Decoding the Trustworthiness of Video Foundation Models NEURIPS2025
【速读】:该论文旨在解决视频模态基础模型(foundation models)在安全性、公平性、隐私保护等关键维度上缺乏系统性评估标准的问题。当前,尽管生成式 AI (Generative AI) 在文本和图像领域已建立较为成熟的可信度基准,但视频模态仍处于空白状态。为此,作者提出 VMDT(Video-Modal Decoding Trust),这是首个统一平台,用于评估文本到视频(T2V)和视频到文本(V2T)模型在五个核心可信维度上的表现:安全(safety)、幻觉(hallucination)、公平性(fairness)、隐私(privacy)以及对抗鲁棒性(adversarial robustness)。其关键创新在于构建了一个结构化、可扩展的评测框架,并通过大规模实验揭示了现有模型在安全性不足、不公平性突出、隐私风险随规模上升等方面的显著缺陷,从而为未来开发更可靠、可控的视频基础模型提供了量化依据与改进方向。
链接: https://arxiv.org/abs/2511.05682 作者: Yujin Potter,Zhun Wang,Nicholas Crispino,Kyle Montgomery,Alexander Xiong,Ethan Y. Chang,Francesco Pinto,Yuqi Chen,Rahul Gupta,Morteza Ziyadi,Christos Christodoulopoulos,Bo Li,Chenguang Wang,Dawn Song 机构: University of California, Berkeley (加州大学伯克利分校); University of California, Santa Cruz (加州大学圣克鲁兹分校); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Chicago (芝加哥大学); Amazon (亚马逊); Information Commissioner’s Office (信息专员办公室) 类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) 备注: NeurIPS 2025 Datasets Benchmarks
点击查看摘要
Abstract:As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve – though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal. The code is available at this https URL.
zh
[CV-203] Culture in Action: Evaluating Text-to-Image Models through Social Activities
链接: https://arxiv.org/abs/2511.05681 作者: Sina Malakouti,Boqing Gong,Adriana Kovashka 机构: University of Pittsburgh (匹兹堡大学); Boston University (波士顿大学) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural alignment, hallucination, exaggerated elements, and diversity. Our findings reveal systematic disparities: models perform better for global north countries than for the global south, with distinct failure modes across T2I systems. Human studies confirm that our metrics correlate more strongly with human judgments than existing text-image metrics.
zh
[CV-204] Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots
链接: https://arxiv.org/abs/2511.05642 作者: Justin Williams,Kishor Datta Gupta,Roy George,Mrinmoy Sarkar 机构: 未知 类目: Robotics (cs.RO); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY) 备注:
点击查看摘要
Abstract:The deployment of artificial intelligence models at the edge is increasingly critical for autonomous robots operating in GPS-denied environments where local, resource-efficient reasoning is essential. This work demonstrates the feasibility of deploying small Vision-Language Models (VLMs) on mobile robots to achieve real-time scene understanding and reasoning under strict computational constraints. Unlike prior approaches that separate perception from mobility, the proposed framework enables simultaneous movement and reasoning in dynamic environments using only on-board hardware. The system integrates a compact VLM with multimodal perception to perform contextual interpretation directly on embedded hardware, eliminating reliance on cloud connectivity. Experimental validation highlights the balance between computational efficiency, task accuracy, and system responsiveness. Implementation on a mobile robot confirms one of the first successful deployments of small VLMs for concurrent reasoning and mobility at the edge. This work establishes a foundation for scalable, assured autonomy in applications such as service robotics, disaster response, and defense operations.
zh
[CV-205] Registration-Free Monitoring of Unstructured Point Cloud Data via Intrinsic Geometrical Properties
链接: https://arxiv.org/abs/2511.05622 作者: Nicholas Babey,Tiffany Gu,Yiheng Li,Cristian Meo,Kevin Zhu 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO) 备注: Accepted at NeurIPS 2025 SpaVLE, for code see this https URL , 9 pages, 1 figure
点击查看摘要
Abstract:For embodied agents to effectively understand and interact within the world around them, they require a nuanced comprehension of human actions grounded in physical space. Current action recognition models, often relying on RGB video, learn superficial correlations between patterns and action labels, so they struggle to capture underlying physical interaction dynamics and human poses in complex scenes. We propose a model architecture that grounds action recognition in physical space by fusing two powerful, complementary representations: V-JEPA 2’s contextual, predictive world dynamics and CoMotion’s explicit, occlusion-tolerant human pose data. Our model is validated on both the InHARD and UCF-19-Y-OCC benchmarks for general action recognition and high-occlusion action recognition, respectively. Our model outperforms three other baselines, especially within complex, occlusive scenes. Our findings emphasize a need for action recognition to be supported by spatial understanding instead of statistical pattern recognition.
zh
[CV-207] Convolutional Fully-Connected Capsule Network (CFC-CapsNet): A Novel and Fast Capsule Network
Abstract:A Capsule Network (CapsNet) is a relatively new classifier and one of the possible successors of Convolutional Neural Networks (CNNs). CapsNet maintains the spatial hierarchies between the features and outperforms CNNs at classifying images including overlapping categories. Even though CapsNet works well on small-scale datasets such as MNIST, it fails to achieve a similar level of performance on more complicated datasets and real applications. In addition, CapsNet is slow compared to CNNs when performing the same task and relies on a higher number of parameters. In this work, we introduce Convolutional Fully-Connected Capsule Network (CFC-CapsNet) to address the shortcomings of CapsNet by creating capsules using a different method. We introduce a new layer (CFC layer) as an alternative solution to creating capsules. CFC-CapsNet produces fewer, yet more powerful capsules resulting in higher network accuracy. Our experiments show that CFC-CapsNet achieves competitive accuracy, faster training and inference and uses less number of parameters on the CIFAR-10, SVHN and Fashion-MNIST datasets compared to conventional CapsNet.
zh
[CV-208] Personalized Image Editing in Text-to-Image Diffusion Models via Collaborative Direct Preference Optimization NEURIPS’25
链接: https://arxiv.org/abs/2511.05616 作者: Connor Dunlop,Matthew Zheng,Kavana Venkatesh,Pinar Yanardag 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: Published at NeurIPS’25 Main Conference
点击查看摘要
Abstract:Text-to-image (T2I) diffusion models have made remarkable strides in generating and editing high-fidelity images from text. Yet, these models remain fundamentally generic, failing to adapt to the nuanced aesthetic preferences of individual users. In this work, we present the first framework for personalized image editing in diffusion models, introducing Collaborative Direct Preference Optimization (C-DPO), a novel method that aligns image edits with user-specific preferences while leveraging collaborative signals from like-minded individuals. Our approach encodes each user as a node in a dynamic preference graph and learns embeddings via a lightweight graph neural network, enabling information sharing across users with overlapping visual tastes. We enhance a diffusion model’s editing capabilities by integrating these personalized embeddings into a novel DPO objective, which jointly optimizes for individual alignment and neighborhood coherence. Comprehensive experiments, including user studies and quantitative benchmarks, demonstrate that our method consistently outperforms baselines in generating edits that are aligned with user preferences.
zh
[CV-209] Pose-Aware Multi-Level Motion Parsing for Action Quality Assessment
链接: https://arxiv.org/abs/2511.05611 作者: Shuaikang Zhu,Yang Yang,Chen Sun 机构: Xi’an Jiaotong University (西安交通大学); Shanghai University of Sport (上海体育学院) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Human pose serves as a cornerstone of action quality assessment (AQA), where subtle spatial-temporal variations in pose often distinguish excellence from mediocrity. In high-level competitions, these nuanced differences become decisive factors in scoring. In this paper, we propose a novel multi-level motion parsing framework for AQA based on enhanced spatial-temporal pose features. On the first level, the Action-Unit Parser is designed with the help of pose extraction to achieve precise action segmentation and comprehensive local-global pose representations. On the second level, Motion Parser is used by spatial-temporal feature learning to capture pose changes and appearance details for each action-unit. Meanwhile, some special conditions other than body-related will impact action scoring, like water splash in diving. In this work, we design an additional Condition Parser to offer users more flexibility in their choices. Finally, Weight-Adjust Scoring Module is introduced to better accommodate the diverse requirements of various action types and the multi-scale nature of action-units. Extensive evaluations on large-scale diving sports datasets demonstrate that our multi-level motion parsing framework achieves state-of-the-art performance in both action segmentation and action scoring tasks.
zh
[CV-210] Walking the Schrödinger Bridge: A Direct Trajectory for Text-to-3D Generation NEURIPS2025
Abstract:Recent advancements in optimization-based text-to-3D generation heavily rely on distilling knowledge from pre-trained text-to-image diffusion models using techniques like Score Distillation Sampling (SDS), which often introduce artifacts such as over-saturation and over-smoothing into the generated 3D assets. In this paper, we address this essential problem by formulating the generation process as learning an optimal, direct transport trajectory between the distribution of the current rendering and the desired target distribution, thereby enabling high-quality generation with smaller Classifier-free Guidance (CFG) values. At first, we theoretically establish SDS as a simplified instance of the Schrödinger Bridge framework. We prove that SDS employs the reverse process of an Schrödinger Bridge, which, under specific conditions (e.g., a Gaussian noise as one end), collapses to SDS’s score function of the pre-trained diffusion model. Based upon this, we introduce Trajectory-Centric Distillation (TraCe), a novel text-to-3D generation framework, which reformulates the mathematically trackable framework of Schrödinger Bridge to explicitly construct a diffusion bridge from the current rendering to its text-conditioned, denoised target, and trains a LoRA-adapted model on this trajectory’s score dynamics for robust 3D optimization. Comprehensive experiments demonstrate that TraCe consistently achieves superior quality and fidelity to state-of-the-art techniques.
zh
[CV-211] In-process 3D Deviation Mapping and Defect Monitoring (3D-DM2) in High Production-rate Robotic Additive Manufacturing
Abstract:Additive manufacturing (AM) is an emerging digital manufacturing technology to produce complex and freeform objects through a layer-wise deposition. High deposition rate robotic AM (HDRRAM) processes, such as cold spray additive manufacturing (CSAM), offer significantly increased build speeds by delivering large volumes of material per unit time. However, maintaining shape accuracy remains a critical challenge, particularly due to process instabilities in current open-loop systems. Detecting these deviations as they occur is essential to prevent error propagation, ensure part quality, and minimize post-processing requirements. This study presents a real-time monitoring system to acquire and reconstruct the growing part and directly compares it with a near-net reference model to detect the shape deviation during the manufacturing process. The early identification of shape inconsistencies, followed by segmenting and tracking each deviation region, paves the way for timely intervention and compensation to achieve consistent part quality.
zh
[CV-212] Google-MedGemma Based Abnormality Detection in Musculoskeletal radiographs
链接: https://arxiv.org/abs/2511.05600 作者: Soumyajit Maity,Pranjal Kamboj,Sneha Maity,Rajat Singh,Sankhadeep Chatterjee 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注: Proceedings of ICICT 2026, London, Springer (Forthcoming, February 2026; Accepted for Publication)
点击查看摘要
Abstract:This paper proposes a MedGemma-based framework for automatic abnormality detection in musculoskeletal radiographs. Departing from conventional autoencoder and neural network pipelines, the proposed method leverages the MedGemma foundation model, incorporating a SigLIP-derived vision encoder pretrained on diverse medical imaging modalities. Preprocessed X-ray images are encoded into high-dimensional embeddings using the MedGemma vision backbone, which are subsequently passed through a lightweight multilayer perceptron for binary classification. Experimental assessment reveals that the MedGemma-driven classifier exhibits strong performance, exceeding conventional convolutional and autoencoder-based metrics. Additionally, the model leverages MedGemma’s transfer learning capabilities, enhancing generalization and optimizing feature engineering. The integration of a modern medical foundation model not only enhances representation learning but also facilitates modular training strategies such as selective encoder block unfreezing for efficient domain adaptation. The findings suggest that MedGemma-powered classification systems can advance clinical radiograph triage by providing scalable and accurate abnormality detection, with potential for broader applications in automated medical image analysis. Keywords: Google MedGemma, MURA, Medical Image, Classification. Comments: Proceedings of ICICT 2026, London, Springer (Forthcoming, February 2026; Accepted for Publication) Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Reportnumber: ICICT-2026-217 Cite as: arXiv:2511.05600 [cs.CV] (or arXiv:2511.05600v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.05600 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-213] Beyond Softmax: Dual-Branch Sigmoid Architecture for Accurate Class Activation Maps BMVC2025
Abstract:Class Activation Mapping (CAM) and its extensions have become indispensable tools for visualizing the evidence behind deep network predictions. However, by relying on a final softmax classifier, these methods suffer from two fundamental distortions: additive logit shifts that arbitrarily bias importance scores, and sign collapse that conflates excitatory and inhibitory features. We propose a simple, architecture-agnostic dual-branch sigmoid head that decouples localization from classification. Given any pretrained model, we clone its classification head into a parallel branch ending in per-class sigmoid outputs, freeze the original softmax head, and fine-tune only the sigmoid branch with class-balanced binary supervision. At inference, softmax retains recognition accuracy, while class evidence maps are generated from the sigmoid branch – preserving both magnitude and sign of feature contributions. Our method integrates seamlessly with most CAM variants and incurs negligible overhead. Extensive evaluations on fine-grained tasks (CUB-200-2011, Stanford Cars) and WSOL benchmarks (ImageNet-1K, OpenImages30K) show improved explanation fidelity and consistent Top-1 Localization gains – without any drop in classification accuracy. Code is available at this https URL.
zh
[CV-214] DiffSwap: 3D Latent-Controlled Diffusion for Identity-Preserving Face Swapping
Abstract:Diffusion-based approaches have recently achieved strong results in face swapping, offering improved visual quality over traditional GAN-based methods. However, even state-of-the-art models often suffer from fine-grained artifacts and poor identity preservation, particularly under challenging poses and expressions. A key limitation of existing approaches is their failure to meaningfully leverage 3D facial structure, which is crucial for disentangling identity from pose and expression. In this work, we propose DiffSwap++, a novel diffusion-based face-swapping pipeline that incorporates 3D facial latent features during training. By guiding the generation process with 3D-aware representations, our method enhances geometric consistency and improves the disentanglement of facial identity from appearance attributes. We further design a diffusion architecture that conditions the denoising process on both identity embeddings and facial landmarks, enabling high-fidelity and identity-preserving face swaps. Extensive experiments on CelebA, FFHQ, and CelebV-Text demonstrate that DiffSwap++ outperforms prior methods in preserving source identity while maintaining target pose and expression. Additionally, we introduce a biometric-style evaluation and conduct a user study to further validate the realism and effectiveness of our approach. Code will be made publicly available at this https URL
zh
[CV-215] Elements of Active Continuous Learning and Uncertainty Self-Awareness: a Narrow Implementation for Face and Facial Expression Recognition
Abstract:Reflection on one’s thought process and making corrections to it if there exists dissatisfaction in its performance is, perhaps, one of the essential traits of intelligence. However, such high-level abstract concepts mandatory for Artificial General Intelligence can be modelled even at the low level of narrow Machine Learning algorithms. Here, we present the self-awareness mechanism emulation in the form of a supervising artificial neural network (ANN) observing patterns in activations of another underlying ANN in a search for indications of the high uncertainty of the underlying ANN and, therefore, the trustworthiness of its predictions. The underlying ANN is a convolutional neural network (CNN) ensemble employed for face recognition and facial expression tasks. The self-awareness ANN has a memory region where its past performance information is stored, and its learnable parameters are adjusted during the training to optimize the performance. The trustworthiness verdict triggers the active learning mode, giving elements of agency to the machine learning algorithm that asks for human help in high uncertainty and confusion conditions.
zh
[CV-216] Video Text Preservation with Synthetic Text-Rich Videos
链接: https://arxiv.org/abs/2511.05573 作者: Ziyang Liu,Kevin Valencia,Justin Cui 机构: University of California, Los Angeles (加州大学洛杉矶分校) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and previous attempts to address this problem are computationally expensive and not suitable for video generation. In this work, we investigate a lightweight approach to improve T2V diffusion models using synthetic supervision. We first generate text-rich images using a text-to-image (T2I) diffusion model, then animate them into short videos using a text-agnostic image-to-video (I2v) model. These synthetic video-prompt pairs are used to fine-tune Wan2.1, a pre-trained T2V model, without any architectural changes. Our results show improvement in short-text legibility and temporal consistency with emerging structural priors for longer text. These findings suggest that curated synthetic data and weak supervision offer a practical path toward improving textual fidelity in T2V generation.
zh
链接: https://arxiv.org/abs/2511.05571 作者: Xiaofei Wang,Stephen Price,Chao Li 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:The rapid advancement of spatial transcriptomics (ST), i.e., spatial gene expressions, has made it possible to measure gene expression within original tissue, enabling us to discover molecular mechanisms. However, current ST platforms frequently suffer from low resolution, limiting the in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, it remains a challenge to model the interactions between histology images and gene expressions for effective ST enhancement. This study presents a cross-modal cross-content contrastive diffusion framework, called C3-Diff, for ST enhancement with histology images as guidance. In C3-Diff, we firstly analyze the deficiency of traditional contrastive learning paradigm, which is then refined to extract both modal-invariant and content-invariant features of ST maps and histology images. Further, to overcome the problem of low sequencing sensitivity in ST maps, we perform nosing-based information augmentation on the surface of feature unit hypersphere. Finally, we propose a dynamic cross-modal imputation-based training strategy to mitigate ST data scarcity. We tested C3-Diff by benchmarking its performance on four public datasets, where it achieves significant improvements over competing methods. Moreover, we evaluate C3-Diff on downstream tasks of cell type localization, gene expression correlation and single-cell-level gene expression prediction, promoting AI-enhanced biotechnology for biomedical research and clinical applications. Codes are available at this https URL.
zh
[CV-218] Do Street View Imagery and Public Participation GIS align: Comparative Analysis of Urban Attractiveness
链接: https://arxiv.org/abs/2511.05570 作者: Milad Malekzadeh,Elias Willberg,Jussi Torkko,Silviya Korpilo,Kamyar Hasanzadeh,Olle Järv,Tuuli Toivonen 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:As digital tools increasingly shape spatial planning practices, understanding how different data sources reflect human experiences of urban environments is essential. Street View Imagery (SVI) and Public Participation GIS (PPGIS) represent two prominent approaches for capturing place-based perceptions that can support urban planning decisions, yet their comparability remains underexplored. This study investigates the alignment between SVI-based perceived attractiveness and residents’ reported experiences gathered via a city-wide PPGIS survey in Helsinki, Finland. Using participant-rated SVI data and semantic image segmentation, we trained a machine learning model to predict perceived attractiveness based on visual features. We compared these predictions to PPGIS-identified locations marked as attractive or unattractive, calculating agreement using two sets of strict and moderate criteria. Our findings reveal only partial alignment between the two datasets. While agreement (with a moderate threshold) reached 67% for attractive and 77% for unattractive places, agreement (with a strict threshold) dropped to 27% and 29%, respectively. By analysing a range of contextual variables, including noise, traffic, population presence, and land use, we found that non-visual cues significantly contributed to mismatches. The model failed to account for experiential dimensions such as activity levels and environmental stressors that shape perceptions but are not visible in images. These results suggest that while SVI offers a scalable and visual proxy for urban perception, it cannot fully substitute the experiential richness captured through PPGIS. We argue that both methods are valuable but serve different purposes; therefore, a more integrated approach is needed to holistically capture how people perceive urban environments.
zh
[CV-219] Adaptive Sample-Level Framework Motivated by Distributionally Robust Optimization with Variance-Based Radius Assignment for Enhanced Neural Network Generalization Under Distribution Shift
Abstract:Distribution shifts and minority subpopulations frequently undermine the reliability of deep neural networks trained using Empirical Risk Minimization (ERM). Distributionally Robust Optimization (DRO) addresses this by optimizing for the worst-case risk within a neighborhood of the training distribution. However, conventional methods depend on a single, global robustness budget, which can lead to overly conservative models or a misallocation of robustness. We propose a variance-driven, adaptive, sample-level DRO (Var-DRO) framework that automatically identifies high-risk training samples and assigns a personalized robustness budget to each based on its online loss variance. Our formulation employs two-sided, KL-divergence-style bounds to constrain the ratio between adversarial and empirical weights for every sample. This results in a linear inner maximization problem over a convex polytope, which admits an efficient water-filling solution. To stabilize training, we introduce a warmup phase and a linear ramp schedule for the global cap on per-sample budgets, complemented by label smoothing for numerical robustness. Evaluated on CIFAR-10-C (corruptions), our method achieves the highest overall mean accuracy compared to ERM and KL-DRO. On Waterbirds, Var-DRO improves overall performance while matching or surpassing KL-DRO. On the original CIFAR-10 dataset, Var-DRO remains competitive, exhibiting the modest trade-off anticipated when prioritizing robustness. The proposed framework is unsupervised (requiring no group labels), straightforward to implement, theoretically sound, and computationally efficient.
zh
[CV-220] Automatic Extraction of Road Networks by using Teacher-Student Adaptive Structural Deep Belief Network and Its Application to Landslide Disaster
链接: https://arxiv.org/abs/2511.05567 作者: Shin Kamada,Takumi Ichimura 机构: Hiroshima City University (广岛市立大学); Prefectural University of Hiroshima (广岛县立大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:An adaptive structural learning method of Restricted Boltzmann Machine (RBM) and Deep Belief Network (DBN) has been developed as one of prominent deep learning models. The neuron generation-annihilation algorithm in RBM and layer generation algorithm in DBN make an optimal network structure for given input during the learning. In this paper, our model is applied to an automatic recognition method of road network system, called RoadTracer. RoadTracer can generate a road map on the ground surface from aerial photograph data. A novel method of RoadTracer using the Teacher-Student based ensemble learning model of Adaptive DBN is proposed, since the road maps contain many complicated features so that a model with high representation power to detect should be required. The experimental results showed the detection accuracy of the proposed model was improved from 40.0% to 89.0% on average in the seven major cities among the test dataset. In addition, we challenged to apply our method to the detection of available roads when landslide by natural disaster is occurred, in order to rapidly obtain a way of transportation. For fast inference, a small size of the trained model was implemented on a small embedded edge device as lightweight deep learning. We reported the detection results for the satellite image before and after the rainfall disaster in Japan.
zh
[CV-221] Efficient Online Continual Learning in Sensor-Based Human Activity Recognition
Abstract:Machine learning models for sensor-based human activity recognition (HAR) are expected to adapt post-deployment to recognize new activities and different ways of performing existing ones. To address this need, Online Continual Learning (OCL) mechanisms have been proposed, allowing models to update their knowledge incrementally as new data become available while preserving previously acquired information. However, existing OCL approaches for sensor-based HAR are computationally intensive and require extensive labeled samples to represent new changes. Recently, pre-trained model-based (PTM-based) OCL approaches have shown significant improvements in performance and efficiency for computer vision applications. These methods achieve strong generalization capabilities by pre-training complex models on large datasets, followed by fine-tuning on downstream tasks for continual learning. However, applying PTM-based OCL approaches to sensor-based HAR poses significant challenges due to the inherent heterogeneity of HAR datasets and the scarcity of labeled data in post-deployment scenarios. This paper introduces PTRN-HAR, the first successful application of PTM-based OCL to sensor-based HAR. Unlike prior PTM-based OCL approaches, PTRN-HAR pre-trains the feature extractor using contrastive loss with a limited amount of data. This extractor is then frozen during the streaming stage. Furthermore, it replaces the conventional dense classification layer with a relation module network. Our design not only significantly reduces the resource consumption required for model training while maintaining high performance, but also improves data efficiency by reducing the amount of labeled data needed for effective continual learning, as demonstrated through experiments on three public datasets, outperforming the state-of-the-art. The code can be found here: this https URL
zh
[CV-222] In-Context Adaptation of VLMs for Few-Shot Cell Detection in Optical Microscopy
Abstract:Foundation vision-language models (VLMs) excel on natural images, but their utility for biomedical microscopy remains underexplored. In this paper, we investigate how in-context learning enables state-of-the-art VLMs to perform few-shot object detection when large annotated datasets are unavailable, as is often the case with microscopic images. We introduce the Micro-OD benchmark, a curated collection of 252 images specifically curated for in-context learning, with bounding-box annotations spanning 11 cell types across four sources, including two in-lab expert-annotated sets. We systematically evaluate eight VLMs under few-shot conditions and compare variants with and without implicit test-time reasoning tokens. We further implement a hybrid Few-Shot Object Detection (FSOD) pipeline that combines a detection head with a VLM-based few-shot classifier, which enhances the few-shot performance of recent VLMs on our benchmark. Across datasets, we observe that zero-shot performance is weak due to the domain gap; however, few-shot support consistently improves detection, with marginal gains achieved after six shots. We observe that models with reasoning tokens are more effective for end-to-end localization, whereas simpler variants are more suitable for classifying pre-localized crops. Our results highlight in-context adaptation as a practical path for microscopy, and our benchmark provides a reproducible testbed for advancing open-vocabulary detection in biomedical imaging.
zh
[CV-223] M2S2L: Mamba-based Multi-Scale Spatial-temporal Learning for Video Anomaly Detection
Abstract:Video anomaly detection (VAD) is an essential task in the image processing community with prospects in video surveillance, which faces fundamental challenges in balancing detection accuracy with computational efficiency. As video content becomes increasingly complex with diverse behavioral patterns and contextual scenarios, traditional VAD approaches struggle to provide robust assessment for modern surveillance systems. Existing methods either lack comprehensive spatial-temporal modeling or require excessive computational resources for real-time applications. In this regard, we present a Mamba-based multi-scale spatial-temporal learning (M2S2L) framework in this paper. The proposed method employs hierarchical spatial encoders operating at multiple granularities and multi-temporal encoders capturing motion dynamics across different time scales. We also introduce a feature decomposition mechanism to enable task-specific optimization for appearance and motion reconstruction, facilitating more nuanced behavioral modeling and quality-aware anomaly assessment. Experiments on three benchmark datasets demonstrate that M2S2L framework achieves 98.5%, 92.1%, and 77.9% frame-level AUCs on UCSD Ped2, CUHK Avenue, and ShanghaiTech respectively, while maintaining efficiency with 20.1G FLOPs and 45 FPS inference speed, making it suitable for practical surveillance deployment.
zh
[CV-224] FilletRec: A Lightweight Graph Neural Network with Intrinsic Features for Automated Fillet Recognition
Abstract:Automated recognition and simplification of fillet features in CAD models is critical for CAE analysis, yet it remains an open challenge. Traditional rule-based methods lack robustness, while existing deep learning models suffer from poor generalization and low accuracy on complex fillets due to their generic design and inadequate training data. To address these issues, this paper proposes an end-to-end, data-driven framework specifically for fillet features. We first construct and release a large-scale, diverse benchmark dataset for fillet recognition to address the inadequacy of existing data. Based on it, we propose FilletRec, a lightweight graph neural network. The core innovation of this network is its use of pose-invariant intrinsic geometric features, such as curvature, enabling it to learn more fundamental geometric patterns and thereby achieve high-precision recognition of complex geometric topologies. Experiments show that FilletRec surpasses state-of-the-art methods in both accuracy and generalization, while using only 0.2%-5.4% of the parameters of baseline models, demonstrating high model efficiency. Finally, the framework completes the automated workflow from recognition to simplification by integrating an effective geometric simplification algorithm.
zh
[CV-225] Compressing Multi-Task Model for Autonomous Driving via Pruning and Knowledge Distillation
链接: https://arxiv.org/abs/2511.05557 作者: Jiayuan Wang,Q. M. Jonathan Wu,Ning Zhang,Katsuya Suto,Lei Zhong 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Autonomous driving systems rely on panoptic perception to jointly handle object detection, drivable area segmentation, and lane line segmentation. Although multi-task learning is an effective way to integrate these tasks, its increasing model parameters and complexity make deployment on on-board devices difficult. To address this challenge, we propose a multi-task model compression framework that combines task-aware safe pruning with feature-level knowledge distillation. Our safe pruning strategy integrates Taylor-based channel importance with gradient conflict penalty to keep important channels while removing redundant and conflicting channels. To mitigate performance degradation after pruning, we further design a task head-agnostic distillation method that transfers intermediate backbone and encoder features from a teacher to a student model as guidance. Experiments on the BDD100K dataset demonstrate that our compressed model achieves a 32.7% reduction in parameters while segmentation performance shows negligible accuracy loss and only a minor decrease in detection (-1.2% for Recall and -1.8% for mAP50) compared to the teacher. The compressed model still runs at 32.7 FPS in real-time. These results show that combining pruning and knowledge distillation provides an effective compression solution for multi-task panoptic perception.
zh
[CV-226] MCFCN: Multi-View Clustering via a Fusion-Consensus Graph Convolutional Network
链接: https://arxiv.org/abs/2511.05554 作者: Chenping Pei,Fadi Dornaika,Jingjun Bi 机构: University of the Basque Country (巴斯克大学); IKERBASQUE (巴斯克基金会); North China University of Water Resources and Electric Power (华北水利水电大学) 类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) 备注:
点击查看摘要
Abstract:Existing Multi-view Clustering (MVC) methods based on subspace learning focus on consensus representation learning while neglecting the inherent topological structure of data. Despite the integration of Graph Neural Networks (GNNs) into MVC, their input graph structures remain susceptible to noise interference. Methods based on Multi-view Graph Refinement (MGRC) also have limitations such as insufficient consideration of cross-view consistency, difficulty in handling hard-to-distinguish samples in the feature space, and disjointed optimization processes caused by graph construction algorithms. To address these issues, a Multi-View Clustering method via a Fusion-Consensus Graph Convolutional Network (MCFCN) is proposed. The network learns the consensus graph of multi-view data in an end-to-end manner and learns effective consensus representations through a view feature fusion model and a Unified Graph Structure Adapter (UGA). It designs Similarity Matrix Alignment Loss (SMAL) and Feature Representation Alignment Loss (FRAL). With the guidance of consensus, it optimizes view-specific graphs, preserves cross-view topological consistency, promotes the construction of intra-class edges, and realizes effective consensus representation learning with the help of GCN to improve clustering performance. MCFCN demonstrates state-of-the-art performance on eight multi-view benchmark datasets, and its effectiveness is verified by extensive qualitative and quantitative implementations. The code will be provided at this https URL.
zh
[CV-227] EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning
链接: https://arxiv.org/abs/2511.05553 作者: Xinyan Cai,Shiguang Wu,Dafeng Chi,Yuzheng Zhuang,Xingyue Quan,Jianye Hao,Qiang Guan 机构: Institute of Automation, Chinese Academy of Sciences (CASIA); Huawei Noah’s Ark Lab 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, lead to inconsistent in multimodal planning. To address this challenge, we present \textbfEVLP (Embodied Vision-Language Planner), an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: \textbf1) Unified Multimodal Generation Framework: For understanding, We integrate semantic information with spatial features to provide comprehensive visual perception. For generation, we directly learn the joint distribution of discrete images for one-step visual synthesis, enabling coordinated language-visual modeling through learnable cross-modal attention mechanisms. \textbf2) Dynamic Perception Pretraining: We propose a bidirectional dynamic alignment strategy employing inverse dynamics tasks and forward dynamics tasks, effectively strengthening multimodal correlations within a unified feature space. \textbf3) Reinforced Supervised Fine-Tuning: While conducting instruction-based fine-tuning in the unified generation space, we construct a reinforce loss to align the spatial logic between textual actions and generated images, enabling the model to acquire spatio-awared multimodal planning capabilities.
zh
[CV-228] In-Context-Learning-Assisted Quality Assessment Vision-Language Models for Metal Additive Manufacturing
链接: https://arxiv.org/abs/2511.05551 作者: Qiaojie Zheng,Jiucai Zhang,Xiaoli Zhang 机构: Colorado School of Mines (科罗拉多矿业学院) 类目: Computer Vision and Pattern Recognition (cs.CV) 备注: 8 pages, 8 figures
点击查看摘要
Abstract:Vision-based quality assessment in additive manufacturing often requires dedicated machine learning models and application-specific datasets. However, data collection and model training can be expensive and time-consuming. In this paper, we leverage vision-language models’ (VLMs’) reasoning capabilities to assess the quality of printed parts and introduce in-context learning (ICL) to provide VLMs with necessary application-specific knowledge and demonstration samples. This method eliminates the requirement for large application-specific datasets for training models. We explored different sampling strategies for ICL to search for the optimal configuration that makes use of limited samples. We evaluated these strategies on two VLMs, Gemini-2.5-flash and Gemma3:27b, with quality assessment tasks in wire-laser direct energy deposition processes. The results show that ICL-assisted VLMs can reach quality classification accuracies similar to those of traditional machine learning models while requiring only a minimal number of samples. In addition, unlike traditional classification models that lack transparency, VLMs can generate human-interpretable rationales to enhance trust. Since there are no metrics to evaluate their interpretability in manufacturing applications, we propose two metrics, knowledge relevance and rationale validity, to evaluate the quality of VLMs’ supporting rationales. Our results show that ICL-assisted VLMs can address application-specific tasks with limited data, achieving relatively high accuracy while also providing valid supporting rationales for improved decision transparency.
zh
[CV-229] Automated Invoice Data Extraction: Using LLM and OCR
Abstract:Conventional Optical Character Recognition (OCR) systems are challenged by variant invoice layouts, handwritten text, and low- quality scans, which are often caused by strong template dependencies that restrict their flexibility across different document structures and layouts. Newer solutions utilize advanced deep learning models such as Convolutional Neural Networks (CNN) as well as Transformers, and domain-specific models for better layout analysis and accuracy across various sections over varied document types. Large Language Models (LLMs) have revolutionized extraction pipelines at their core with sophisticated entity recognition and semantic comprehension to support complex contextual relationship mapping without direct programming specification. Visual Named Entity Recognition (NER) capabilities permit extraction from invoice images with greater contextual sensitivity and much higher accuracy rates than older approaches. Existing industry best practices utilize hybrid architectures that blend OCR technology and LLM for maximum scalability and minimal human intervention. This work introduces a holistic Artificial Intelligence (AI) platform combining OCR, deep learning, LLMs, and graph analytics to achieve unprecedented extraction quality and consistency.
zh
[CV-230] oken Is All You Need: Cognitive Planning through Sparse Intent Alignment
【速读】:该论文旨在解决端到端自动驾驶(End-to-End Autonomous Driving, E2EAD)中长期依赖于详尽场景建模的假设问题,即传统方法通常需要复杂的未来场景生成或受限于马尔可夫假设的视觉-语言-动作(Vision-Language-Action, VLA)系统。其解决方案的关键在于提出一种基于稀疏语义令牌(semantically rich tokens)的最小表示策略,无需显式预测未来场景即可实现高效规划:实验表明,在nuPlan基准上仅使用感知引导的BEV(Bird’s-Eye-View)表示,即使不进行未来预测也能达到0.548 m ADE(平均位移误差),优于此前在nuScenes上约0.75 m的性能;进一步地,通过条件化轨迹解码于预测的未来令牌,AED可提升至0.479 m,较当前状态基线改善12.6%。此外,研究发现显式重建损失在可靠感知输入下不仅无益反而可能损害性能,并观察到时间模糊性(temporal fuzziness)现象——模型自适应关注任务相关语义而非固定时间戳,体现出在不确定性下的认知优势。这一“token is all you need”范式标志着从世界重建转向语义理解的认知转变,为基于想象而非反应的规划系统奠定基础。
链接: https://arxiv.org/abs/2511.05540 作者: Shiyao Sang 机构: 未知 类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO) 备注: 6 pages, 2 figures. Preprint exploring a new cognitive paradigm for autonomous planning
点击查看摘要
Abstract:We challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving (E2EAD). Unlike world-model approaches that rely on computationally intensive future scene generation or vision-language-action (VLA) systems constrained by Markov assumptions, we show that a minimal set of semantically rich tokens is sufficient for effective planning. Experiments on the nuPlan benchmark (720 scenarios, over 11,000 samples) using perception-informed BEV representations yield three key findings: (1) even without future prediction, our sparse representation achieves 0.548 m ADE, comparable to or surpassing prior methods reporting around 0.75 m on nuScenes; (2) conditioning trajectory decoding on predicted future tokens reduces ADE to 0.479 m, a 12.6% improvement over current-state baselines; and (3) explicit reconstruction loss offers no benefit and may degrade performance under reliable perception inputs. Notably, we observe the emergence of temporal fuzziness, where the model adaptively attends to task-relevant semantics rather than aligning rigidly to fixed timestamps, providing a cognitive advantage for planning under uncertainty. Our “token is all you need” principle marks a paradigm shift from reconstructing the world to understanding it, laying a foundation for cognitively inspired systems that plan through imagination rather than reaction.
zh
[CV-231] Randomized-MLP Regularization Improves Domain Adaptation and Interpretability in DINOv2
Abstract:Vision Transformers (ViTs), such as DINOv2, achieve strong performance across domains but often repurpose low-informative patch tokens in ways that reduce the interpretability of attention and feature maps. This challenge is especially evident in medical imaging, where domain shifts can degrade both performance and transparency. In this paper, we introduce Randomized-MLP (RMLP) regularization, a contrastive learning-based method that encourages more semantically aligned representations. We use RMLPs when fine-tuning DINOv2 to both medical and natural image modalities, showing that it improves or maintains downstream performance while producing more interpretable attention maps. We also provide a mathematical analysis of RMLPs, offering insights into its role in enhancing ViT-based models and advancing our understanding of contrastive learning.
zh
[CV-232] CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video
链接: https://arxiv.org/abs/2511.07290 作者: Xinyi Wang,Angeliki Katsenou,Junxiao Shen,David Bull 机构: 未知 类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) 备注: 14 pages, 6 figures
点击查看摘要
Abstract:The prevalence of user-generated content (UGC) on platforms such as YouTube and TikTok has rendered no-reference (NR) perceptual video quality assessment (VQA) vital for optimizing video delivery. Nonetheless, the characteristics of non-professional acquisition and the subsequent transcoding of UGC video on sharing platforms present significant challenges for NR-VQA. Although NR-VQA models attempt to infer mean opinion scores (MOS), their modeling of subjective scores for compressed content remains limited due to the absence of fine-grained perceptual annotations of artifact types. To address these challenges, we propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large vision-language models. Our approach introduces a quality-aware prompting mechanism that integrates video metadata (e.g., resolution, frame rate, bitrate) with key fragments extracted from inter-frame variations to guide the BLIP-2 pretraining approach in generating fine-grained quality captions. A unified architecture has been designed to model perceptual quality across three dimensions: semantic alignment, temporal characteristics, and spatial characteristics. These multimodal features are extracted and fused, then regressed to video quality scores. Extensive experiments on a wide variety of UGC datasets demonstrate that our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations. Our method achieves the best performance in terms of average rank and linear correlation (SRCC: 0.928, PLCC: 0.938) compared to state-of-the-art methods. The source code and trained models, along with a user-friendly demo, are available at: this https URL.
zh
[CV-233] Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
Abstract:Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.
zh
链接: https://arxiv.org/abs/2511.07094 作者: Necati Sefercioglu,Mehmet Ozan Unal,Metin Ertas,Isa Yildirim 机构: Istanbul Technical University (伊斯坦布尔技术大学) 类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Deep learning-based low-dose computed tomography reconstruction methods already achieve high performance on standard image quality metrics like peak signal-to-noise ratio and structural similarity index measure. Yet, they frequently fail to preserve the critical anatomical details needed for diagnostic tasks. This fundamental limitation hinders their clinical applicability despite their high metric scores. We propose a novel task-adaptive reconstruction framework that addresses this gap by incorporating a frozen pre-trained task network as a regularization term in the reconstruction loss function. Unlike existing joint-training approaches that simultaneously optimize both reconstruction and task networks, and risk diverging from satisfactory reconstructions, our method leverages a pre-trained task model to guide reconstruction training while still maintaining diagnostic quality. We validate our framework on a liver and liver tumor segmentation task. Our task-adaptive models achieve Dice scores up to 0.707, approaching the performance of full-dose scans (0.874), and substantially outperforming joint-training approaches (0.331) and traditional reconstruction methods (0.626). Critically, our framework can be integrated into any existing deep learning-based reconstruction model through simple loss function modification, enabling widespread adoption for task-adaptive optimization in clinical practice. Our codes are available at: this https URL
zh
[CV-235] auFlow: Dynamic Causal Constraint for Complexity-Adaptive Lightweight Segmentation
链接: https://arxiv.org/abs/2511.07057 作者: Zidong Chen,Fadratul Hafinaz Hassan 机构: 未知 类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) 备注: 42 pages and 9 figures
点击查看摘要
Abstract:Deploying lightweight medical image segmentation models on edge devices presents two major challenges: 1) efficiently handling the stark contrast between lesion boundaries and background regions, and 2) the sharp drop in accuracy that occurs when pursuing extremely lightweight designs (e.g., 0.5M parameters). To address these problems, this paper proposes TauFlow, a novel lightweight segmentation model. The core of TauFlow is a dynamic feature response strategy inspired by brain-like mechanisms. This is achieved through two key innovations: the Convolutional Long-Time Constant Cell (ConvLTC), which dynamically regulates the feature update rate to “slowly” process low-frequency backgrounds and “quickly” respond to high-frequency boundaries; and the STDP Self-Organizing Module, which significantly mitigates feature conflicts between the encoder and decoder, reducing the conflict rate from approximately 35%-40% to 8%-10%.
zh
[CV-236] RRTS Dataset: A Benchmark Colonoscopy Dataset from Resource-Limited Settings for Computer-Aided Diagnosis Research
链接: https://arxiv.org/abs/2511.06769 作者: Ridoy Chandra Shil,Ragib Abid,Tasnia Binte Mamun,Samiul Based Shuvo,Masfique Ahmed Bhuiyan,Jahid Ferdous 机构: Bangladesh University of Engineering and Technology (BUET)(孟加拉国工程技术大学); Dhaka Medical College(达卡医学院) 类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Background and Objective: Colorectal cancer prevention relies on early detection of polyps during colonoscopy. Existing public datasets, such as CVC-ClinicDB and Kvasir-SEG, provide valuable benchmarks but are limited by small sample sizes, curated image selection, or lack of real-world artifacts. There remains a need for datasets that capture the complexity of clinical practice, particularly in resource-constrained settings. Methods: We introduce a dataset, BUET Polyp Dataset (BPD), of colonoscopy images collected using Olympus 170 and Pen- tax i-Scan series endoscopes under routine clinical conditions. The dataset contains images with corresponding expert-annotated binary masks, reflecting diverse challenges such as motion blur, specular highlights, stool artifacts, blood, and low-light frames. Annotations were manually reviewed by clinical experts to ensure quality. To demonstrate baseline performance, we provide bench- mark results for classification using VGG16, ResNet50, and InceptionV3, and for segmentation using UNet variants with VGG16, ResNet34, and InceptionV4 backbones. Results: The dataset comprises 1,288 images with polyps from 164 patients with corresponding ground-truth masks and 1,657 polyp-free images from 31 patients. Benchmarking experiments achieved up to 90.8% accuracy for binary classification (VGG16) and a maximum Dice score of 0.64 with InceptionV4-UNet for segmentation. Performance was lower compared to curated datasets, reflecting the real-world difficulty of images with artifacts and variable quality.
zh
[CV-237] Hierarchical Spatial-Frequency Aggregation for Spectral Deconvolution Imaging
链接: https://arxiv.org/abs/2511.06751 作者: Tao Lv,Daoming Zhou,Chenglong Huang,Chongde Zi,Linsen Chen,Xun Cao 机构: 未知 类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) 备注: Under Review at TPAMI
点击查看摘要
Abstract:Computational spectral imaging (CSI) achieves real-time hyperspectral imaging through co-designed optics and algorithms, but typical CSI methods suffer from a bulky footprint and limited fidelity. Therefore, Spectral Deconvolution imaging (SDI) methods based on PSF engineering have been proposed to achieve high-fidelity compact CSI design recently. However, the composite convolution-integration operations of SDI render the normal-equation coefficient matrix scene-dependent, which hampers the efficient exploitation of imaging priors and poses challenges for accurate reconstruction. To tackle the inherent data-dependent operators in SDI, we introduce a Hierarchical Spatial-Spectral Aggregation Unfolding Framework (HSFAUF). By decomposing subproblems and projecting them into the frequency domain, HSFAUF transforms nonlinear processes into linear mappings, thereby enabling efficient solutions. Furthermore, to integrate spatial-spectral priors during iterative refinement, we propose a Spatial-Frequency Aggregation Transformer (SFAT), which explicitly aggregates information across spatial and frequency domains. By integrating SFAT into HSFAUF, we develop a Transformer-based deep unfolding method, \textbfHierarchical \textbfSpatial-\textbfFrequency \textbfAggregation \textbfUnfolding \textbfTransformer (HSFAUT), to solve the inverse problem of SDI. Systematic simulated and real experiments show that HSFAUT surpasses SOTA methods with cheaper memory and computational costs, while exhibiting optimal performance on different SDI systems.
zh
链接: https://arxiv.org/abs/2511.06425 作者: Brian B. Avants,Nicholas J. Tustison,James R Stone(Department of Radiology and Medical Imaging University of Virginia, Charlottesville, VA) 机构: 未知 类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Methodology (stat.ME) 备注:
点击查看摘要
Abstract:Interpretable representation learning is a central challenge in modern machine learning, particularly in high-dimensional settings such as neuroimaging, genomics, and text analysis. Current methods often struggle to balance the competing demands of interpretability and model flexibility, limiting their effectiveness in extracting meaningful insights from complex data. We introduce Non-negative Stiefel Approximating Flow (NSA-Flow), a general-purpose matrix estimation framework that unifies ideas from sparse matrix factorization, orthogonalization, and constrained manifold learning. NSA-Flow enforces structured sparsity through a continuous balance between reconstruction fidelity and column-wise decorrelation, parameterized by a single tunable weight. The method operates as a smooth flow near the Stiefel manifold with proximal updates for non-negativity and adaptive gradient control, yielding representations that are simultaneously sparse, stable, and interpretable. Unlike classical regularization schemes, NSA-Flow provides an intuitive geometric mechanism for manipulating sparsity at the level of global structure while simplifying latent features. We demonstrate that the NSA-Flow objective can be optimized smoothly and integrates seamlessly with existing pipelines for dimensionality reduction while improving interpretability and generalization in both simulated and real biomedical data. Empirical validation on the Golub leukemia dataset and in Alzheimer’s disease demonstrate that the NSA-Flow constraints can maintain or improve performance over related methods with little additional methodological effort. NSA-Flow offers a scalable, general-purpose tool for interpretable ML, applicable across data science domains.
zh
[CV-239] urbo-DDCM: Fast and Flexible Zero-Shot Diffusion-Based Image Compression
链接: https://arxiv.org/abs/2511.06424 作者: Amit Vaisman,Guy Ohayon,Hila Manor,Michael Elad,Tomer Michaeli 机构: Technion – Israel Institute of Technology (以色列理工学院); Flatiron Institute, Simons Foundation (西蒙斯基金会) 类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Machine Learning (stat.ML) 备注: Code is available at this https URL
点击查看摘要
Abstract:While zero-shot diffusion-based compression methods have seen significant progress in recent years, they remain notoriously slow and computationally demanding. This paper presents an efficient zero-shot diffusion-based compression method that runs substantially faster than existing methods, while maintaining performance that is on par with the state-of-the-art techniques. Our method builds upon the recently proposed Denoising Diffusion Codebook Models (DDCMs) compression scheme. Specifically, DDCM compresses an image by sequentially choosing the diffusion noise vectors from reproducible random codebooks, guiding the denoiser’s output to reconstruct the target image. We modify this framework with Turbo-DDCM, which efficiently combines a large number of noise vectors at each denoising step, thereby significantly reducing the number of required denoising operations. This modification is also coupled with an improved encoding protocol. Furthermore, we introduce two flexible variants of Turbo-DDCM, a priority-aware variant that prioritizes user-specified regions and a distortion-controlled variant that compresses an image based on a target PSNR rather than a target BPP. Comprehensive experiments position Turbo-DDCM as a compelling, practical, and flexible image compression scheme.
zh
[CV-240] Cross-Modal Fine-Tuning of 3D Convolutional Foundation Models for ADHD Classification with Low-Rank Adaptation
链接: https://arxiv.org/abs/2511.06163 作者: Jyun-Ping Kao,Shinyeong Rho,Shahar Lazarev,Hyun-Hae Cho,Fangxu Xing,Taehoon Shin,C.-C. Jay Kuo,Jonghye Woo 机构: 未知 类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph) 备注:
点击查看摘要
Abstract:Early diagnosis of attention-deficit/hyperactivity disorder (ADHD) in children plays a crucial role in improving outcomes in education and mental health. Diagnosing ADHD using neuroimaging data, however, remains challenging due to heterogeneous presentations and overlapping symptoms with other conditions. To address this, we propose a novel parameter-efficient transfer learning approach that adapts a large-scale 3D convolutional foundation model, pre-trained on CT images, to an MRI-based ADHD classification task. Our method introduces Low-Rank Adaptation (LoRA) in 3D by factorizing 3D convolutional kernels into 2D low-rank updates, dramatically reducing trainable parameters while achieving superior performance. In a five-fold cross-validated evaluation on a public diffusion MRI database, our 3D LoRA fine-tuning strategy achieved state-of-the-art results, with one model variant reaching 71.9% accuracy and another attaining an AUC of 0.716. Both variants use only 1.64 million trainable parameters (over 113x fewer than a fully fine-tuned foundation model). Our results represent one of the first successful cross-modal (CT-to-MRI) adaptations of a foundation model in neuroimaging, establishing a new benchmark for ADHD classification while greatly improving efficiency.
zh
链接: https://arxiv.org/abs/2511.05873 作者: Tong Chen,Xinyu Ma,Long Bai,Wenyang Wang,Sun Yue,Luping Zhou 机构: 未知 类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) 备注:
点击查看摘要
Abstract:Endoscopic images often suffer from diverse and co-occurring degradations such as low lighting, smoke, and bleeding, which obscure critical clinical details. Existing restoration methods are typically task-specific and often require prior knowledge of the degradation type, limiting their robustness in real-world clinical use. We propose EndoIR, an all-in-one, degradation-agnostic diffusion-based framework that restores multiple degradation types using a single model. EndoIR introduces a Dual-Domain Prompter that extracts joint spatial-frequency features, coupled with an adaptive embedding that encodes both shared and task-specific cues as conditioning for denoising. To mitigate feature confusion in conventional concatenation-based conditioning, we design a Dual-Stream Diffusion architecture that processes clean and degraded inputs separately, with a Rectified Fusion Block integrating them in a structured, degradation-aware manner. Furthermore, Noise-Aware Routing Block improves efficiency by dynamically selecting only noise-relevant features during denoising. Experiments on SegSTRONG-C and CEC datasets demonstrate that EndoIR achieves state-of-the-art performance across multiple degradation scenarios while using fewer parameters than strong baselines, and downstream segmentation experiments confirm its clinical utility.
zh
[CV-242] HarmoQ: Harmonized Post-Training Quantization for High-Fidelity Image
链接: https://arxiv.org/abs/2511.05868 作者: Hongjun Wang,Jiyuan Chen,Xuan Song,Yinqiang Zheng 机构: The University of Tokyo (东京大学); The Hong Kong Polytechnic University (香港理工大学); Jilin University (吉林大学) 类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Post-training quantization offers an efficient pathway to deploy super-resolution models, yet existing methods treat weight and activation quantization independently, missing their critical interplay. Through controlled experiments on SwinIR, we uncover a striking asymmetry: weight quantization primarily degrades structural similarity, while activation quantization disproportionately affects pixel-level accuracy. This stems from their distinct roles–weights encode learned restoration priors for textures and edges, whereas activations carry input-specific intensity information. Building on this insight, we propose HarmoQ, a unified framework that harmonizes quantization across components through three synergistic steps: structural residual calibration proactively adjusts weights to compensate for activation-induced detail loss, harmonized scale optimization analytically balances quantization difficulty via closed-form solutions, and adaptive boundary refinement iteratively maintains this balance during optimization. Experiments show HarmoQ achieves substantial gains under aggressive compression, outperforming prior art by 0.46 dB on Set5 at 2-bit while delivering 3.2x speedup and 4x memory reduction on A100 GPUs. This work provides the first systematic analysis of weight-activation coupling in super-resolution quantization and establishes a principled solution for efficient high-quality image restoration.
zh
[CV-243] raining-Free Adaptive Quantization for Variable Rate Image Coding for Machines
【速读】:该论文旨在解决图像编码用于机器(Image Coding for Machines, ICM)中现有可变比特率学习图像压缩(Learned Image Compression, LIC)方法存在的局限性,即大多数LIC框架采用固定比特率且需为每个目标比特率单独训练,导致部署复杂性和计算开销高,且可变比特率控制在ICM场景下尚未得到充分探索。解决方案的关键在于提出一种无需训练的自适应量化步长控制机制,通过利用超先验网络(hyperprior network)预测的通道级熵依赖关系和空间尺度参数,实现对语义重要区域的精细保留与非关键区域的粗粒度量化,从而以单一参数连续调节比特率,显著提升压缩效率——实验表明相较非自适应可变比特率方法最高可获得11.07%的BD-rate节省。
链接: https://arxiv.org/abs/2511.05836 作者: Yui Tatsumi,Ziyue Zeng,Hiroshi Watanabe 机构: 未知 类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) 备注:
点击查看摘要
Abstract:Image Coding for Machines (ICM) has become increasingly important with the rapid integration of computer vision into real-world applications. However, most ICM frameworks utilize learned image compression (LIC) models that operate at a fixed rate and require separate training for each target bitrate, which may limit their practical applications. Existing variable rate LIC approaches mitigate this limitation but typically depend on training, increasing computational cost and deployment complexity. Moreover, variable rate control has not been thoroughly explored for ICM. To address these challenges, we propose a training-free, adaptive quantization step size control scheme that enables flexible bitrate adjustment. By leveraging both channel-wise entropy dependencies and spatial scale parameters predicted by the hyperprior network, the proposed method preserves semantically important regions while coarsely quantizing less critical areas. The bitrate can be continuously controlled through a single parameter. Experimental results demonstrate the effectiveness of our proposed method, achieving up to 11.07% BD-rate savings over the non-adaptive variable rate method.
zh
[CV-244] ConnectomeBench: Can LLM s Proofread the Connectome? NEURIPS2025
链接: https://arxiv.org/abs/2511.05542 作者: Jeff Brown,Andrew Kirjner Annika Vivekananthan,Ed Boyden 机构: 未知 类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) 备注: To appear in NeurIPS 2025 Datasets and Benchmarks Track
点击查看摘要
Abstract:Connectomics - the mapping of neural connections in an organism’s brain - currently requires extraordinary human effort to proofread the data collected from imaging and machine-learning assisted segmentation. With the growing excitement around using AI agents to automate important scientific tasks, we explore whether current AI systems can perform multiple tasks necessary for data proofreading. We introduce ConnectomeBench, a multimodal benchmark evaluating large language model (LLM) capabilities in three critical proofreading tasks: segment type identification, split error correction, and merge error detection. Using expert annotated data from two large open-source datasets - a cubic millimeter of mouse visual cortex and the complete Drosophila brain - we evaluate proprietary multimodal LLMs including Claude 3.7/4 Sonnet, o4-mini, GPT-4.1, GPT-4o, as well as open source models like InternVL-3 and NVLM. Our results demonstrate that current models achieve surprisingly high performance in segment identification (52-82% balanced accuracy vs. 20-25% chance) and binary/multiple choice split error correction (75-85% accuracy vs. 50% chance) while generally struggling on merge error identification tasks. Overall, while the best models still lag behind expert performance, they demonstrate promising capabilities that could eventually enable them to augment and potentially replace human proofreading in connectomics. Project page: this https URL and Dataset this https URL
zh
[CV-245] Selective Diabetic Retinopathy Screening with Accuracy-Weighted Deep Ensembles and Entropy-Guided Abstention
Abstract:Diabetic retinopathy (DR), a microvascular complication of diabetes and a leading cause of preventable blindness, is projected to affect more than 130 million individuals worldwide by 2030. Early identification is essential to reduce irreversible vision loss, yet current diagnostic workflows rely on methods such as fundus photography and expert review, which remain costly and resource-intensive. This, combined with DR’s asymptomatic nature, results in its underdiagnosis rate of approximately 25 percent. Although convolutional neural networks (CNNs) have demonstrated strong performance in medical imaging tasks, limited interpretability and the absence of uncertainty quantification restrict clinical reliability. Therefore, in this study, a deep ensemble learning framework integrated with uncertainty estimation is introduced to improve robustness, transparency, and scalability in DR detection. The ensemble incorporates seven CNN architectures-ResNet-50, DenseNet-121, MobileNetV3 (Small and Large), and EfficientNet (B0, B2, B3)- whose outputs are fused through an accuracy-weighted majority voting strategy. A probability-weighted entropy metric quantifies prediction uncertainty, enabling low-confidence samples to be excluded or flagged for additional review. Training and validation on 35,000 EyePACS retinal fundus images produced an unfiltered accuracy of 93.70 percent (F1 = 0.9376). Uncertainty-filtering later was conducted to remove unconfident samples, resulting in maximum-accuracy of 99.44 percent (F1 = 0.9932). The framework shows that uncertainty-aware, accuracy-weighted ensembling improves reliability without hindering performance. With confidence-calibrated outputs and a tunable accuracy-coverage trade-off, it offers a generalizable paradigm for deploying trustworthy AI diagnostics in high-risk care.
zh
[CV-246] sMRI-based Brain Age Estimation in MCI using Persistent Homology
链接: https://arxiv.org/abs/2511.05520 作者: Debanjali Bhattacharya,Neelam Sinha 机构: 未知 类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) 备注:
点击查看摘要
Abstract:In this study, we propose the use of persistent homology- specifically Betti curves for brain age prediction and for distinguishing between healthy and pathological aging. The proposed framework is applied to 100 structural MRI scans from the publicly available ADNI dataset. Our results indicate that Betti curve features, particularly those from dimension-1 (connected components) and dimension-2 (1D holes), effectively capture structural brain alterations associated with aging. Furthermore, clinical features are grouped into three categories based on their correlation, or lack thereof, with (i) predicted brain age and (ii) chronological age. The findings demonstrate that this approach successfully differentiates normal from pathological aging and provides a novel framework for understanding how structural brain changes relate to cognitive impairment. The proposed method serves as a foundation for developing potential biomarkers for early detection and monitoring of cognitive decline.
zh
人工智能
[AI-0] Using Vision Language Models as Closed-Loop Symbolic Planners for Robotic Applications: A Control-Theoretic Perspective
[AI-2] ransformers Provably Learn Chain-of-Thought Reasoning with Length Generalization NEURIPS2025
【速读】:该论文旨在解决生成式 AI 模型在面对更复杂、更长链式推理(Chain-of-Thought, CoT)任务时的外推能力问题,即模型能否将已学习的推理模式推广到更难或更长的问题上。其核心解决方案在于通过理论分析揭示了变压器(transformer)模型在梯度下降优化下学习合成状态追踪任务时,其注意力机制如何受问题代数结构调控,并由此决定推理长度的泛化能力。关键突破在于证明了注意力集中机制(attention concentration)能够连接注意力层的鲁棒性与长上下文推理任务的结构特性,从而实现对 NC1-complete 问题的可证明学习,显著超越此前局限于 TC0 类问题的限制,且提出了递归自训练方案以扩展有限推理长度的模型能力。
链接: https://arxiv.org/abs/2511.07378 作者: Yu Huang,Zixin Wen,Aarti Singh,Yuejie Chi,Yuxin Chen 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML) 备注: This is the full version of a paper published at NeurIPS 2025
点击查看摘要
Abstract:The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT). In this work, we present a theoretical analysis of transformers learning on synthetic state-tracking tasks with gradient descent. We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT. Specifically, our theory characterizes the length generalization of transformers through the mechanism of attention concentration, linking the retrieval robustness of the attention layer to the state-tracking task structure of long-context reasoning. Moreover, for transformers with limited reasoning length, we prove that a recursive self-training scheme can progressively extend the range of solvable problem lengths. To our knowledge, we provide the first optimization guarantee that constant-depth transformers provably learn \mathsfNC^1 -complete problems with CoT, significantly going beyond prior art confined in \mathsfTC^0 , unless the widely held conjecture \mathsfTC^0 \neq \mathsfNC^1 fails. Finally, we present a broad set of experiments supporting our theoretical results, confirming the length generalization behaviors and the mechanism of attention concentration.
zh
[AI-3] Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning
[AI-8] Hard vs. Noise: Resolving Hard-Noisy Sample Confusion in Recommender Systems via Large Language Models AAAI2026
【速读】:该论文旨在解决推荐系统中隐式反馈(implicit feedback)因误点击(misclicks)和位置偏差(position bias)等因素导致的噪声问题,尤其关注噪声样本与困难样本(hard samples)在数据模式上高度相似所引发的“硬-噪混淆”(hard-noisy confusion)问题,这可能导致关键的困难样本被错误过滤,从而损害用户偏好建模效果。解决方案的关键在于提出LLMHNI框架,其核心创新包括:利用大语言模型(Large Language Models, LLMs)生成两个辅助的用户-物品相关性信号——一是基于LLM编码嵌入的语义相关性用于负采样以选择困难负样本并过滤噪声假负样本;二是通过LLM推断的逻辑相关性构建交互图,并结合跨图对比对齐策略进行去噪;同时引入图对比学习机制,通过随机边删除视图对齐表示以抑制LLM幻觉带来的不可靠交互边,从而实现更精准的噪声识别与分离。
链接: https://arxiv.org/abs/2511.07295 作者: Tianrui Song,Wen-Shuo Chao,Hao Liu 机构: 未知 类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 备注: Accepted by AAAI2026
点击查看摘要
Abstract:Implicit feedback, employed in training recommender systems, unavoidably confronts noise due to factors such as misclicks and position bias. Previous studies have attempted to identify noisy samples through their diverged data patterns, such as higher loss values, and mitigate their influence through sample dropping or reweighting. However, we observed that noisy samples and hard samples display similar patterns, leading to hard-noisy confusion issue. Such confusion is problematic as hard samples are vital for modeling user preferences. To solve this problem, we propose LLMHNI framework, leveraging two auxiliary user-item relevance signals generated by Large Language Models (LLMs) to differentiate hard and noisy samples. LLMHNI obtains user-item semantic relevance from LLM-encoded embeddings, which is used in negative sampling to select hard negatives while filtering out noisy false negatives. An objective alignment strategy is proposed to project LLM-encoded embeddings, originally for general language tasks, into a representation space optimized for user-item relevance modeling. LLMHNI also exploits LLM-inferred logical relevance within user-item interactions to identify hard and noisy samples. These LLM-inferred interactions are integrated into the interaction graph and guide denoising with cross-graph contrastive alignment. To eliminate the impact of unreliable interactions induced by LLM hallucination, we propose a graph contrastive learning strategy that aligns representations from randomly edge-dropped views to suppress unreliable edges. Empirical results demonstrate that LLMHNI significantly improves denoising and recommendation performance.
zh
[AI-9] Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization
[AI-10] Designing Beyond Language: Sociotechnical Barriers in AI Health Technologies for Limited English Proficiency
链接: https://arxiv.org/abs/2511.07277 作者: Michelle Huang,Violeta J. Rodriguez,Koustuv Saha,Tal August 机构: 未知 类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) 备注:
点击查看摘要
[AI-11] Beyond Detection: Exploring Evidence-based Multi-Agent Debate for Misinformation Intervention and Persuasion AAAI2026
链接: https://arxiv.org/abs/2511.07267 作者: Chen Han,Yijia Ma,Jin Tan,Wenzhen Zheng,Xijin Tang 机构: 未知 类目: Artificial Intelligence (cs.AI) 备注: This paper has been accepted to AAAI 2026
点击查看摘要
[AI-12] Agent icSciML: Collaborative Multi-Agent Systems for Emergent Discovery in Scientific Machine Learning
链接: https://arxiv.org/abs/2511.07202 作者: Praveen Kumar Donta,Alfreds Lapkovskis,Enzo Mingozzi,Schahram Dustdar 机构: 未知 类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI) 备注:
点击查看摘要
Abstract:Failures are the norm in highly complex and heterogeneous devices spanning the distributed computing continuum (DCC), from resource-constrained IoT and edge nodes to high-performance computing systems. Ensuring reliability and global consistency across these layers remains a major challenge, especially for AI-driven workloads requiring real-time, adaptive coordination. This paper introduces a Probabilistic Active Inference Resilience Agent (PAIR-Agent) to achieve resilience in DCC systems. PAIR-Agent performs three core operations: (i) constructing a causal fault graph from device logs, (ii) identifying faults while managing certainties and uncertainties using Markov blankets and the free-energy principle, and (iii) autonomously healing issues through active inference. Through continuous monitoring and adaptive reconfiguration, the agent maintains service continuity and stability under diverse failure conditions. Theoretical validations confirm the reliability and effectiveness of the proposed framework.
zh
[AI-20] Fuzzy Label: From Concept to Its Application in Label Learning
[AI-23] On the Joint Minimization of Regularization Loss Functions in Deep Variational Bayesian Methods for Attribute-Controlled Symbolic Music Generation
Abstract:Market making (MM) through Reinforcement Learning (RL) has attracted significant attention in financial trading. With the development of Large Language Models (LLMs), more and more attempts are being made to apply LLMs to financial areas. A simple, direct application of LLM as an agent shows significant performance. Such methods are hindered by their slow inference speed, while most of the current research has not studied LLM distillation for this specific task. To address this, we first propose the normalized fluorescent probe to study the mechanism of the LLM’s feature. Based on the observation found by our investigation, we propose Cooperative Market Making (CMM), a novel framework that decouples LLM features across three orthogonal dimensions: layer, task, and data. Various student models collaboratively learn simple LLM features along with different dimensions, with each model responsible for a distinct feature to achieve knowledge distillation. Furthermore, CMM introduces an Hájek-MoE to integrate the output of the student models by investigating the contribution of different models in a kernel function-generated common feature space. Extensive experimental results on four real-world market datasets demonstrate the superiority of CMM over the current distillation method and RL-based market-making strategies.
zh
[AI-25] A Theoretical Analysis of Detecting Large Model-Generated Time Series AAAI-2026
链接: https://arxiv.org/abs/2511.07104 作者: Junji Hou,Junzhou Zhao,Shuo Zhang,Pinghui Wang 机构: 未知 类目: Artificial Intelligence (cs.AI) 备注: 23 pages,12 figures, to be published in AAAI-2026 main track
点击查看摘要
[AI-26] E2E-VGuard: Adversarial Prevention for Production LLM -based End-To-End Speech Synthesis NEURIPS2025
[AI-29] Data Complexity of Querying Description Logic Knowledge Bases under Cost-Based Semantics AAAI2026
链接: https://arxiv.org/abs/2511.07095 作者: Meghyn Bienvenu,Quentin Manière 机构: 未知 类目: Artificial Intelligence (cs.AI) 备注: Long version of paper to appear in AAAI 2026
点击查看摘要
[AI-30] Green AI: A systematic review and meta-analysis of its definitions lifecycle models hardware and measurement attempts
[AI-34] Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision AAAI-26
链接: https://arxiv.org/abs/2511.07062 作者: Yimei Zhang,Guojiang Shen,Kaili Ning,Tongwei Ren,Xuebo Qiu,Mengmeng Wang,Xiangjie Kong 机构: 未知 类目: Artificial Intelligence (cs.AI) 备注: Accepted as a full paper by AAAI-26
点击查看摘要
[AI-35] Do LLM s Feel? Teaching Emotion Recognition with Prompts Retrieval and Curriculum Learning AAAI2026
链接: https://arxiv.org/abs/2511.07061 作者: Xinran Li,Xiujuan Xu,Jiaqi Qiao,Yu Liu 机构: 未知 类目: Artificial Intelligence (cs.AI) 备注: Accepted at AAAI 2026
点击查看摘要
[AI-36] Learning Quantized Continuous Controllers for Integer Hardware
[AI-39] S2Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening AAAI2026
链接: https://arxiv.org/abs/2511.07006 作者: Bowei He,Bowen Gao,Yankai Chen,Yanyan Lan,Chen Ma,Philip S. Yu,Ya-Qin Zhang,Wei-Ying Ma 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 备注: Accepted by AAAI 2026 Main Technical Track
点击查看摘要
[AI-40] Hybrid Autoencoders for Tabular Data: Leverag ing Model-Based Augmentation in Low-Label Settings NEURIPS2025
链接: https://arxiv.org/abs/2511.06961 作者: Erel Naor,Ofir Lindenbaum 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 备注: accepted to neurips 2025, main text is 10 pages
点击查看摘要
[AI-41] Learning to Focus: Prioritizing Informative Histories with Structured Attention Mechanisms in Partially Observable Reinforcement Learning NEURIPS2025
链接: https://arxiv.org/abs/2511.06946 作者: Daniel De Dios Allegue,Jinke He,Frans A. Oliehoek 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 备注: Accepted to Embodied World Models for Decision Making (EWM) Workshop at NeurIPS 2025
点击查看摘要
[AI-42] Fine-Tuning Diffusion-Based Recommender Systems via Reinforcement Learning with Reward Function Optimization
链接: https://arxiv.org/abs/2511.06937 作者: Yu Hou,Hua Li,Ha Young Kim,Won-Yong Shin 机构: 未知 类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Social and Information Networks (cs.SI) 备注: 14 pages, 12 figures, 9 tables
链接: https://arxiv.org/abs/2511.06898 作者: Boyan Tang,Xuanhao Ren,Peng Xiao,Shunbo Lei,Xiaorong Sun,Jianghua Wu 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 备注: Published in 2025 IEEE 1st International Symposium on the Application of Artificial Intelligence in Electrical Engineering (AAIEE) this https URL
点击查看摘要
Abstract:Accurate day-ahead electricity price forecasting (DAEPF) is critical for the efficient operation of power systems, but extreme condition and market anomalies pose significant challenges to existing forecasting methods. To overcome these challenges, this paper proposes a novel hybrid deep learning framework that integrates a Distilled Attention Transformer (DAT) model and an Autoencoder Self-regression Model (ASM). The DAT leverages a self-attention mechanism to dynamically assign higher weights to critical segments of historical data, effectively capturing both long-term trends and short-term fluctuations. Concurrently, the ASM employs unsupervised learning to detect and isolate anomalous patterns induced by extreme conditions, such as heavy rain, heat waves, or human festivals. Experiments on datasets sampled from California and Shandong Province demonstrate that our framework significantly outperforms state-of-the-art methods in prediction accuracy, robustness, and computational efficiency. Our framework thus holds promise for enhancing grid resilience and optimizing market operations in future power systems.
zh
[AI-47] On The Presence of Double-Descent in Deep Reinforcement Learning
[AI-55] AgentS UMO: An Agent ic Framework for Interactive Simulation Scenario Generation in SUMO via Large Language Models
链接: https://arxiv.org/abs/2511.06804 作者: Minwoo Jeong,Jeeyun Chang,Yoonjin Yoon 机构: 未知 类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) 备注: Submitted to Transportation Research Part C (under review)
点击查看摘要
[AI-56] Learning to Fast Unrank in Collaborative Filtering Recommendation
Abstract:Variational Autoencoders (VAEs) are a powerful alternative to matrix factorization for recommendation. A common technique in VAE-based collaborative filtering (CF) consists in applying binary input masking to user interaction vectors, which improves performance but remains underexplored theoretically. In this work, we analyze how collaboration arises in VAE-based CF and show it is governed by latent proximity: we derive a latent sharing radius that informs when an SGD update on one user strictly reduces the loss on another user, with influence decaying as the latent Wasserstein distance increases. We further study the induced geometry: with clean inputs, VAE-based CF primarily exploits \emphlocal collaboration between input-similar users and under-utilizes global collaboration between far-but-related users. We compare two mechanisms that encourage \emphglobal mixing and characterize their trade-offs: (1) \beta -KL regularization directly tightens the information bottleneck, promoting posterior overlap but risking representational collapse if too large; (2) input masking induces stochastic geometric contractions and expansions, which can bring distant users onto the same latent neighborhood but also introduce neighborhood drift. To preserve user identity while enabling global consistency, we propose an anchor regularizer that aligns user posteriors with item embeddings, stabilizing users under masking and facilitating signal sharing across related items. Our analyses are validated on the Netflix, MovieLens-20M, and Million Song datasets. We also successfully deployed our proposed algorithm on an Amazon streaming platform following a successful online experiment.
zh
[AI-62] OntoTune: Ontology-Driven Learning for Query Optimization with Convolutional Models
[AI-74] ML-EcoLyzer: Quantifying the Environmental Cost of Machine Learning Inference Across Frameworks and Hardware
链接: https://arxiv.org/abs/2511.06694 作者: Jose Marie Antonio Minoza,Rex Gregor Laylo,Christian F Villarin,Sebastian C. Ibanez 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[AI-75] Rapidly Learning Soft Robot Control via Implicit Time-Stepping
Abstract:Vision-language-action (VLA) models hold the promise to attain generalizable embodied control. To achieve this, a pervasive paradigm is to leverage the rich vision-semantic priors of large vision-language models (VLMs). However, the fundamental question persists: How do VLAs effectively inherit the prior knowledge from VLMs? To address this critical question, we introduce a diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task where the robot arm is asked to place objects onto printed emojis corresponding to language instructions. This task design is particularly revealing – knowledge associated with emojis is ubiquitous in Internet-scale datasets used for VLM pre-training, yet emojis themselves are largely absent from standard robotics datasets. Consequently, they provide a clean proxy: successful task completion indicates effective transfer of VLM priors to embodied control. We implement this diagnostic task in both simulated environment and a real robot, and compare various promising techniques for knowledge transfer. Specifically, we investigate the effects of parameter-efficient fine-tuning, VLM freezing, co-training, predicting discretized actions, and predicting latent actions. Through systematic evaluation, our work not only demonstrates the critical importance of preserving VLM priors for the generalization of VLA but also establishes guidelines for future research in developing truly generalizable embodied AI systems.
zh
[AI-79] Beyond Fixed Depth: Adaptive Graph Neural Networks for Node Classification Under Varying Homophily AAAI2026
[AI-82] Breaking the Dyadic Barrier: Rethinking Fairness in Link Prediction Beyond Demographic Parity AAAI-26
链接: https://arxiv.org/abs/2511.06568 作者: João Mattos,Debolina Halder Lina,Arlei Silva 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Machine Learning (stat.ML) 备注: 12 pages, 5 figures. Accepted at AAAI-26 as an Oral
点击查看摘要
[AI-83] LLM For Loop Invariant Generation and Fixing: How Far Are We?
链接: https://arxiv.org/abs/2511.06552 作者: Mostafijur Rahman Akhond,Saikat Chakraborty,Gias Uddin 机构: 未知 类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI) 备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
[AI-84] riShGAN: Enhancing Sparsity and Robustness in Multivariate Time Series Counterfactuals Explanation
[AI-85] FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis NEURIPS2025
链接: https://arxiv.org/abs/2511.06522 作者: Jan Ondras(1),Marek Šuppa(2) ((1) MIT, (2) Comenius University, Cisco) 机构: 未知 类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 备注: Accepted to The 5th Workshop on Mathematical Reasoning and AI at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025); 25 pages, 14 figures, 8 tables; Code available at this https URL
点击查看摘要
[AI-86] Route Experts by Sequence not by Token
链接: https://arxiv.org/abs/2511.06494 作者: Tiansheng Wen,Yifei Wang,Aosong Feng,Long Ma,Xinyang Liu,Yifan Wang,Lixuan Guo,Bo Chen,Stefanie Jegelka,Chenyu You 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT) 备注:
点击查看摘要
[AI-87] Explainable AI For Early Detection Of Sepsis
Abstract:Background: Large Language Models emerged with the potential of provoking a revolution in software development (e.g., automating processes, workforce transformation). Although studies have started to investigate the perceived impact of LLMs for software development, there is a need for empirical studies to comprehend how to balance forward and backward effects of using LLMs. Objective: We investigated how LLMs impact software development and how to manage the impact from a software developer’s perspective. Method: We conducted 22 interviews with software practitioners across 3 rounds of data collection and analysis, between October (2024) and September (2025). We employed socio-technical grounded theory (STGT) for data analysis to rigorously analyse interview participants’ responses. Results: We identified the benefits (e.g., maintain software development flow, improve developers’ mental model, and foster entrepreneurship) and disadvantages (e.g., negative impact on developers’ personality and damage to developers’ reputation) of using LLMs at individual, team, organisation, and society levels; as well as best practices on how to adopt LLMs. Conclusion: Critically, we present the trade-offs that software practitioners, teams, and organisations face in working with LLMs. Our findings are particularly useful for software team leaders and IT managers to assess the viability of LLMs within their specific context.
zh
[AI-95] AUTO-Explorer: Automated Data Collection for GUI Agent
[AI-101] Privacy-Preserving Federated Learning for Fair and Efficient Urban Traffic Optimization
链接: https://arxiv.org/abs/2511.06363 作者: Rathin Chandra Shit,Sharmila Subudhi 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY) 备注: Under review at IEEE journal
点击查看摘要
[AI-102] Understanding Student Interaction with AI-Powered Next-Step Hints: Strategies and Challenges
链接: https://arxiv.org/abs/2511.06362 作者: Anastasiia Birillo,Aleksei Rostovskii,Yaroslav Golubev,Hieke Keuning 机构: 未知 类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) 备注: Accepted to SIGCSE’26. 7 pages, 3 figures
点击查看摘要
[AI-103] A Graph-Theoretical Perspective on Law Design for Multiagent Systems AAAI AAAI-26
链接: https://arxiv.org/abs/2511.06361 作者: Qi Shi,Pavel Naumov 机构: 未知 类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) 备注: The 40th AAAI Conference on Artificial Intelligence (AAAI-26)
点击查看摘要
[AI-104] Reaction Prediction via Interaction Modeling of Symmetric Difference Shingle Sets
[AI-109] Kaggle Chronicles: 15 Years of Competitions Community and Data Science Innovation
链接: https://arxiv.org/abs/2511.06304 作者: Kevin Bönisch,Leandro Losaria 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); General Literature (cs.GL); Machine Learning (stat.ML) 备注:
点击查看摘要
[AI-110] Secu-Table: a Comprehensive security table dataset for evaluating semantic table interpretation systems
【速读】:该论文旨在解决安全领域中语义表解释(Semantic Tables Interpretation, STI)系统评估缺乏公开可用基准数据集的问题,尤其针对基于大语言模型(Large Language Models, LLMs)的STI方法。其解决方案的关键在于构建并公开发布Secu-Table数据集,该数据集包含超过1500张表格和15000余个实体,来源于CVE(Common Vulnerabilities and Exposures)与CWE(Common Weakness Enumeration)的安全数据,并通过Wikidata及SEPSES CSKG(SEmantic Processing of Security Event Streams CyberSecurity Knowledge Graph)进行标注。该数据集作为SemTab挑战赛的一部分,用于评估开源LLMs(如Falcon-3-7B-Instruct和Mistral-7B-Instruct)与闭源模型(如GPT-4o mini)在表到知识图谱匹配任务中的性能,为安全领域的STI研究提供了可复现、高质量的基准资源。
链接: https://arxiv.org/abs/2511.06301 作者: Azanzi Jiomekong,Jean Bikim,Patricia Negoue,Joyce Chin 机构: 未知 类目: Artificial Intelligence (cs.AI) 备注: Submitted to Nature Scientific Data
点击查看摘要
Abstract:Evaluating semantic tables interpretation (STI) systems, (particularly, those based on Large Language Models- LLMs) especially in domain-specific contexts such as the security domain, depends heavily on the dataset. However, in the security domain, tabular datasets for state-of-the-art are not publicly available. In this paper, we introduce Secu-Table dataset, composed of more than 1500 tables with more than 15k entities constructed using security data extracted from Common Vulnerabilities and Exposures (CVE) and Common Weakness Enumeration (CWE) data sources and annotated using Wikidata and the SEmantic Processing of Security Event Streams CyberSecurity Knowledge Graph (SEPSES CSKG). Along with the dataset, all the code is publicly released. This dataset is made available to the research community in the context of the SemTab challenge on Tabular to Knowledge Graph Matching. This challenge aims to evaluate the performance of several STI based on open source LLMs. Preliminary evaluation, serving as baseline, was conducted using Falcon3-7b-instruct and Mistral-7B-Instruct, two open source LLMs and GPT-4o mini one closed source LLM.
zh
[AI-111] Decomate: Leverag ing Generative Models for Co-Creative SVG Animation NEURIPS2025
链接: https://arxiv.org/abs/2511.06297 作者: Jihyeon Park,Jiyoon Myung,Seone Shin,Jungki Son,Joohyung Han 机构: 未知 类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) 备注: Accepted at the 1st Workshop on Generative and Protective AI for Content Creation (NeurIPS 2025)
点击查看摘要
[AI-112] ransolver is a Linear Transformer: Revisiting Physics-Attention through the Lens of Linear Attention
[AI-118] Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra AAAI2026
【速读】:该论文旨在解决从串联质谱(tandem mass spectra)中检索分子结构时存在的两大问题:一是传统质谱库匹配方法因谱库覆盖范围有限导致的检索精度不足;二是现有跨模态表示学习框架常因模态错位(modality misalignment)而导致检索准确率和泛化能力较差。其解决方案的关键在于提出一种基于生成式语言模型的检索框架(Generative Language Model-based Retrieval, GLMR),通过两阶段策略缓解跨模态错位:第一阶段利用对比学习识别候选分子作为上下文先验;第二阶段将候选分子与输入质谱联合引导生成模型输出精炼的分子结构,并基于分子相似性重新排序候选列表,从而显著提升检索性能与泛化能力。
Abstract:Retrieving molecular structures from tandem mass spectra is a crucial step in rapid compound identification. Existing retrieval methods, such as traditional mass spectral library matching, suffer from limited spectral library coverage, while recent cross-modal representation learning frameworks often encounter modality misalignment, resulting in suboptimal retrieval accuracy and generalization. To address these limitations, we propose GLMR, a Generative Language Model-based Retrieval framework that mitigates the cross-modal misalignment through a two-stage process. In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum. In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures, which are then used to re-rank the candidates based on molecular similarity. Experiments on both MassSpecGym and the proposed MassRET-20k dataset demonstrate that GLMR significantly outperforms existing methods, achieving over 40% improvement in top-1 accuracy and exhibiting strong generalizability.
zh
[AI-119] MrCoM: A Meta-Regularized World-Model Generalizing Across Multi-Scenarios
Abstract:User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks. Our code and models are available at \hrefthis https URL\textttthis https URL.
zh
[AI-121] Constraint-Informed Active Learning for End-to-End ACOPF Optimization Proxies
链接: https://arxiv.org/abs/2511.06227 作者: Anamul Haque Mollah,Ahmed Aljohani,Hyunsook Do 机构: 未知 类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE) 备注: Accepted for publication at 2nd ACM International Conference on AI-powered Software (AIware 2025)
点击查看摘要
Abstract:Unit tests often lack concise summaries that convey test intent, especially in auto-generated or poorly documented codebases. Large Language Models (LLMs) offer a promising solution, but their effectiveness depends heavily on how they are prompted. Unlike generic code summarization, test-code summarization poses distinct challenges because test methods validate expected behavior through assertions rather than im- plementing functionality. This paper presents a new benchmark of 91 real-world Java test cases paired with developer-written summaries and conducts a controlled ablation study to investigate how test code-related components-such as the method under test (MUT), assertion messages, and assertion semantics-affect the performance of LLM-generated test summaries. We evaluate four code LLMs (Codex, Codestral, DeepSeek, and Qwen-Coder) across seven prompt configurations using n-gram metrics (BLEU, ROUGE-L, METEOR), semantic similarity (BERTScore), and LLM-based evaluation. Results show that prompting with as- sertion semantics improves summary quality by an average of 0.10 points (2.3%) over full MUT context (4.45 vs. 4.35) while requiring fewer input tokens. Codex and Qwen-Coder achieve the highest alignment with human-written summaries, while DeepSeek underperforms despite high lexical overlap. The replication package is publicly available at this https URL. 5281/zenodo.17067550
zh
[AI-125] ROAR: Robust Accident Recognition and Anticipation for Autonomous Driving
链接: https://arxiv.org/abs/2511.06226 作者: Xingcheng Liu,Yanchen Guan,Haicheng Liao,Zhengbing He,Zhenning Li 机构: 未知 类目: Artificial Intelligence (cs.AI) 备注: Published to Accident Analysis and Prevention
点击查看摘要
[AI-126] RAG -targeted Adversarial Attack on LLM -based Threat Detection and Mitigation Framework
Abstract:The rapid expansion of the Internet of Things (IoT) is reshaping communication and operational practices across industries, but it also broadens the attack surface and increases susceptibility to security breaches. Artificial Intelligence has become a valuable solution in securing IoT networks, with Large Language Models (LLMs) enabling automated attack behavior analysis and mitigation suggestion in Network Intrusion Detection Systems (NIDS). Despite advancements, the use of LLMs in such systems further expands the attack surface, putting entire networks at risk by introducing vulnerabilities such as prompt injection and data poisoning. In this work, we attack an LLM-based IoT attack analysis and mitigation framework to test its adversarial robustness. We construct an attack description dataset and use it in a targeted data poisoning attack that applies word-level, meaning-preserving perturbations to corrupt the Retrieval-Augmented Generation (RAG) knowledge base of the framework. We then compare pre-attack and post-attack mitigation responses from the target model, ChatGPT-5 Thinking, to measure the impact of the attack on model performance, using an established evaluation rubric designed for human experts and judge LLMs. Our results show that small perturbations degrade LLM performance by weakening the linkage between observed network traffic features and attack behavior, and by reducing the specificity and practicality of recommended mitigations for resource-constrained devices.
zh
[AI-127] Resilience Inference for Supply Chains with Hypergraph Neural Network
[AI-134] LLM Attention Transplant for Transfer Learning of Tabular Data Across Disparate Domains
【速读】:该论文旨在解决跨域表格数据(tabular data)迁移学习中因特征空间异质性导致的挑战,尤其是传统深度学习方法在表格知识迁移上的局限性。其核心问题在于如何有效利用大语言模型(LLM)的能力来提升表格数据的迁移性能,同时避免对文本提示(text prompts)和上下文学习(in-context learning)的依赖。解决方案的关键在于提出一种轻量级迁移学习框架——LATTLE(LLM-Attention Transplant for Transfer Learning),通过在源域表格数据上微调LLM,并将其中选择性的Key和Value投影权重移植到专为表格数据设计的门控特征标记Transformer(gFTT)中,从而构建具备跨域注意力机制的gFTT模型;该模型随后在目标域表格数据上进行微调,无需共享特征、提示工程或大规模预训练模型即可实现高效迁移学习。
链接: https://arxiv.org/abs/2511.06161 作者: Ibna Kowsar,Kazi F. Akhter,Manar D. Samad 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Transfer learning of tabular data is non-trivial due to heterogeneity in the feature space across disparate domains. The limited success of traditional deep learning in tabular knowledge transfer can be advanced by leveraging large language models (LLMs). However, the efficacy of LLMs often stagnates for mixed data types structured in tables due to the limitations of text prompts and in-context learning. We propose a lightweight transfer learning framework that fine-tunes an LLM using source tabular data and transplants the LLM’s selective key and value projection weights into a gated feature tokenized transformer (gFTT) built for tabular data. The gFTT model with cross-domain attention is fine-tuned using target tabular data for transfer learning, eliminating the need for shared features, LLM prompt engineering, and large-scale pretrained models. Our experiments using ten pairs of source-target data sets and 12 baselines demonstrate the superiority of the proposed LLM-attention transplant for transfer learning (LATTLE) method over traditional ML models, state-of-the-art deep tabular architectures, and transfer learning models trained on thousands to billions of tabular samples. The proposed attention transfer demonstrates an effective solution to learning relationships between data tables using an LLM in a low-resource learning environment. The source code for the proposed method is publicly available.
zh
[AI-135] Models Got Talent: Identifying High Performing Wearable Human Activity Recognition Models Without Training
Abstract:A promising alternative to the computationally expensive Neural Architecture Search (NAS) involves the development of \textitZero Cost Proxies (ZCPs), which correlate well to trained performance, but can be computed through a single forward/backward pass on a randomly sampled batch of data. In this paper, we investigate the effectiveness of ZCPs for HAR on six benchmark datasets, and demonstrate that they discover network architectures that obtain within 5% of performance attained by full scale training involving 1500 randomly sampled architectures. This results in substantial computational savings as high performing architectures can be discovered with minimal training. Our experiments not only introduce ZCPs to sensor-based HAR, but also demonstrate that they are robust to data noise, further showcasing their suitability for practical scenarios.
zh
[AI-136] MALinZero: Efficient Low-Dimensional Search for Mastering Complex Multi-Agent Planning
【速读】:该论文旨在解决多智能体规划中蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)因联合动作空间呈指数级增长而导致的探索与利用效率低下问题。其关键解决方案是提出MALinZero方法,通过将联合动作回报投影到低维表示空间,并基于上下文线性Bandit问题建模,在该空间中设计线性置信上界(Linear Upper Confidence Bound for Trees, LinUCT),从而实现更高效的多智能体探索与利用策略;同时,结合子模目标函数的最大化,提出一个(1−e1)-近似算法用于联合动作选择,显著提升了多智能体强化学习在矩阵博弈、SMAC及SMACv2等基准任务上的性能和收敛速度。
Abstract:Monte Carlo Tree Search (MCTS), which leverages Upper Confidence Bound for Trees (UCTs) to balance exploration and exploitation through randomized sampling, is instrumental to solving complex planning problems. However, for multi-agent planning, MCTS is confronted with a large combinatorial action space that often grows exponentially with the number of agents. As a result, the branching factor of MCTS during tree expansion also increases exponentially, making it very difficult to efficiently explore and exploit during tree search. To this end, we propose MALinZero, a new approach to leverage low-dimensional representational structures on joint-action returns and enable efficient MCTS in complex multi-agent planning. Our solution can be viewed as projecting the joint-action returns into the low-dimensional space representable using a contextual linear bandit problem formulation. We solve the contextual linear bandit problem with convex and \mu -smooth loss functions – in order to place more importance on better joint actions and mitigate potential representational limitations – and derive a linear Upper Confidence Bound applied to trees (LinUCT) to enable novel multi-agent exploration and exploitation in the low-dimensional space. We analyze the regret of MALinZero for low-dimensional reward functions and propose an (1-\tfrac1e) -approximation algorithm for the joint action selection by maximizing a sub-modular objective. MALinZero demonstrates state-of-the-art performance on multi-agent benchmarks such as matrix games, SMAC, and SMACv2, outperforming both model-based and model-free multi-agent reinforcement learning baselines with faster learning speed and better performance.
zh
[AI-137] When Object-Centric World Models Meet Policy Learning: From Pixels to Policies and Where It Breaks
Abstract:Omics data is widely employed in medical research to identify disease mechanisms and contains highly sensitive personal information. Federated Learning (FL) with Differential Privacy (DP) can ensure the protection of omics data privacy against malicious user attacks. However, FL with the DP method faces an inherent trade-off: stronger privacy protection degrades predictive accuracy due to injected noise. On the other hand, Homomorphic Encryption (HE) allows computations on encrypted data and enables aggregation of encrypted gradients without DP-induced noise can increase the predictive accuracy. However, it may increase the computation cost. To improve the predictive accuracy while considering the computational ability of heterogeneous clients, we propose a Privacy-Preserving Machine Learning (PPML)-Hybrid method by introducing HE. In the proposed PPML-Hybrid method, clients distributed select either HE or DP based on their computational resources, so that HE clients contribute noise-free updates while DP clients reduce computational overhead. Meanwhile, clients with high computational resources clients can flexibly adopt HE or DP according to their privacy needs. Performance evaluation on omics datasets show that our proposed method achieves comparable predictive accuracy while significantly reducing computation time relative to HE-only. Additionally, it outperforms DP-only methods under equivalent or stricter privacy budgets.
zh
[AI-141] How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy
Abstract:The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7x over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference.
zh
[AI-145] Ontology Learning and Knowledge Graph Construction: A Comparison of Approaches and Their Impact on RAG Performance
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中知识表示方式对性能影响的关键问题,尤其是如何通过优化知识图谱(Knowledge Graph, KG)构建策略来提升RAG的效果。其解决方案的关键在于采用基于本体(ontology)引导的KG构建方法,将文本片段(chunk)信息融入KG结构中,并对比了从关系数据库与文本语料库中提取本体的不同路径;研究发现,基于关系数据库构建的本体引导KG在性能上可媲美最先进的框架,且显著优于纯向量检索基线,同时具备两大优势:一是仅需一次性的本体学习过程,大幅降低大语言模型(Large Language Model, LLM)调用成本;二是避免了文本驱动方法中常见的本体合并复杂性问题。
Abstract:Retrieval-Augmented Generation (RAG) systems combine Large Language Models (LLMs) with external knowledge, and their performance depends heavily on how that knowledge is represented. This study investigates how different Knowledge Graph (KG) construction strategies influence RAG performance. We compare a variety of approaches: standard vector-based RAG, GraphRAG, and retrieval over KGs built from ontologies derived either from relational databases or textual corpora. Results show that ontology-guided KGs incorporating chunk information achieve competitive performance with state-of-the-art frameworks, substantially outperforming vector retrieval baselines. Moreover, the findings reveal that ontology-guided KGs built from relational databases perform competitively to ones built with ontologies extracted from text, with the benefit of offering a dual advantage: they require a one-time-only ontology learning process, substantially reducing LLM usage costs; and avoid the complexity of ontology merging inherent to text-based approaches.
zh
[AI-146] Kunlun Anomaly Troubleshooter: Enabling Kernel-Level Anomaly Detection and Causal Reasoning for Large Model Distributed Inference
【速读】:该论文旨在解决大规模模型分布式推理(Large Model Distributed Inference, LMDI)中异常诊断效率低、准确性差的问题,尤其是在推理性能下降或延迟抖动等复杂异常场景下,传统依赖专家手动排查的方式耗时且效果有限。其解决方案的关键在于提出Kunlun Anomaly Troubleshooter (KAT)框架,通过两项核心创新实现:一是利用GPU工作节点间的同步性和一致性,基于函数调用追踪数据在纳秒级分辨率下精准定位核级别异常及其关联硬件组件;二是将检测结果整合进领域自适应的大语言模型(domain-adapted LLM),实现对复杂异常症状的系统性因果推理与自然语言解释,从而显著缩小诊断范围并提升故障排查效率与成功率。
Abstract:Anomaly troubleshooting for large model distributed inference (LMDI) remains a critical challenge. Resolving anomalies such as inference performance degradation or latency jitter in distributed system demands significant manual efforts from domain experts, resulting in extremely time-consuming diagnosis processes with relatively low accuracy. In this paper, we introduce Kunlun Anomaly Troubleshooter (KAT), the first anomaly troubleshooting framework tailored for LMDI. KAT addresses this problem through two core innovations. First, KAT exploits the synchronicity and consistency of GPU workers, innovatively leverages function trace data to precisely detect kernel-level anomalies and associated hardware components at nanosecond resolution. Second, KAT integrates these detection results into a domain-adapted LLM, delivering systematic causal reasoning and natural language interpretation of complex anomaly symptoms. Evaluations conducted in Alibaba Cloud Service production environment indicate that KAT achieves over 0.884 precision and 0.936 recall in anomaly detection, providing detail anomaly insights that significantly narrow down the diagnostic scope and improve both the efficiency and success rate of troubleshooting.
zh
[AI-147] An Epistemic Perspective on Agent Awareness AAAI AAAI-26
链接: https://arxiv.org/abs/2511.05977 作者: Pavel Naumov,Alexandra Pavlova 机构: 未知 类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA) 备注: Fortieth AAAI Conference on Artificial Intelligence (AAAI-26)
点击查看摘要
Abstract:The paper proposes to treat agent awareness as a form of knowledge, breaking the tradition in the existing literature on awareness. It distinguishes the de re and de dicto forms of such knowledge. The work introduces two modalities capturing these forms and formally specifies their meaning using a version of 2D-semantics. The main technical result is a sound and complete logical system describing the interplay between the two proposed modalities and the standard “knowledge of the fact” modality.
zh
[AI-148] Klear-Agent Forge: Forging Agent ic Intelligence through Posttraining Scaling
Abstract:Despite the proliferation of powerful agentic models, the lack of critical post-training details hinders the development of strong counterparts in the open-source community. In this study, we present a comprehensive and fully open-source pipeline for training a high-performance agentic model for interacting with external tools and environments, named Klear-Qwen3-AgentForge, starting from the Qwen3-8B base model. We design effective supervised fine-tuning (SFT) with synthetic data followed by multi-turn reinforcement learning (RL) to unlock the potential for multiple diverse agentic tasks. We perform exclusive experiments on various agentic benchmarks in both tool use and coding domains. Klear-Qwen3-AgentForge-8B achieves state-of-the-art performance among LLMs of similar size and remains competitive with significantly larger models.
zh
[AI-149] 10 Open Challenges Steering the Future of Vision-Language-Action Models AAAI2026
Abstract:Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors – LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models – multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis – all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.
zh
[AI-150] he Future of AI in the GCC Post-NPM Landscape: A Comparative Analysis of Kuwait and the UAE
链接: https://arxiv.org/abs/2511.05932 作者: Mohammad Rashed Albous,Bedour Alboloushi,Arnaud Lacheret 机构: 未知 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH) 备注:
点击查看摘要
Abstract:Comparative evidence on how Gulf Cooperation Council (GCC) states turn artificial intelligence (AI) ambitions into post–New Public Management (post-NPM) outcomes is scarce because most studies examine Western democracies. We analyze constitutional, collective-choice, and operational rules shaping AI uptake in two contrasting GCC members, the United Arab Emirates (UAE) and Kuwait, and whether they foster citizen centricity, collaborative governance, and public value creation. Anchored in Ostrom’s Institutional Analysis and Development framework, the study combines a most similar/most different systems design with multiple sources: 62 public documents from 2018–2025, embedded UAE cases (Smart Dubai and MBZUAI), and 39 interviews with officials conducted Aug 2024–May 2025. Dual coding and process tracing connect rule configurations to AI performance. Cross-case analysis identifies four reinforcing mechanisms behind divergent trajectories. In the UAE, concentrated authority, credible sanctions, pro-innovation narratives, and flexible reinvestment rules scale pilots into hundreds of services and sizable recycled savings. In Kuwait, dispersed veto points, exhortative sanctions, cautious discourse, and lapsed AI budgets confine initiatives to pilot mode despite equivalent fiscal resources. The findings refine institutional theory by showing that vertical rule coherence, not wealth, determines AI’s public-value yield, and temper post-NPM optimism by revealing that efficiency metrics serve societal goals only when backed by enforceable safeguards. To curb ethics washing and test transferability beyond the GCC, future work should track rule diffusion over time, develop blended legitimacy–efficiency scorecards, and examine how narrative framing shapes citizen consent for data sharing.
zh
[AI-151] Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement
Abstract:Numerous multimodal misinformation benchmarks exhibit bias toward specific modalities, allowing detectors to make predictions based solely on one modality. While previous research has quantified bias at the dataset level or manually identified spurious correlations between modalities and labels, these approaches lack meaningful insights at the sample level and struggle to scale to the vast amount of online information. In this paper, we investigate the design for automated recognition of modality bias at the sample level. Specifically, we propose three bias quantification methods based on theories/views of different levels of granularity: 1) a coarse-grained evaluation of modality benefit; 2) a medium-grained quantification of information flow; and 3) a fine-grained causality analysis. To verify the effectiveness, we conduct a human evaluation on two popular benchmarks. Experimental results reveal three interesting findings that provide potential direction toward future research: 1)~Ensembling multiple views is crucial for reliable automated analysis; 2)~Automated analysis is prone to detector-induced fluctuations; and 3)~Different views produce a higher agreement on modality-balanced samples but diverge on biased ones.
zh
[AI-155] Physics-Informed Neural Networks for Real-Time Gas Crossover Prediction in PEM Electrolyzers: First Application with Multi-Membrane Validation
Abstract:Green hydrogen production via polymer electrolyte membrane (PEM) water electrolysis is pivotal for energy transition, yet hydrogen crossover through membranes threatens safety and economic viability-approaching explosive limits (4 mol% H _2 in O _2 ) while reducing Faradaic efficiency by 2.5%. Current physics-based models require extensive calibration and computational resources that preclude real-time implementation, while purely data-driven approaches fail to extrapolate beyond training conditions-critical for dynamic electrolyzer operation. Here we present the first application of physics-informed neural networks (PINNs) for hydrogen crossover prediction, integrating mass conservation, Fick’s diffusion law, and Henry’s solubility law within a compact architecture (17,793 parameters). Validated across six membranes under industrially relevant conditions (0.05-5.0 A/cm ^2 , 1-200 bar, 25-85°C), our PINN achieves exceptional accuracy (R ^2 = 99.84%, RMSE = 0.0348%) with sub-millisecond inference times suitable for real-time control. Remarkably, the model maintains R ^2 86% when predicting crossover at pressures 2.5x beyond training range-substantially outperforming pure neural networks (R ^2 = 43.4%). The hardware-agnostic deployment, from desktop CPUs to edge devices (Raspberry Pi 4), enables distributed safety monitoring essential for gigawatt-scale installations. By bridging physical rigor and computational efficiency, this work establishes a new paradigm for real-time electrolyzer monitoring, accelerating deployment of safe, efficient green hydrogen infrastructure crucial for net-zero emissions targets.
zh
[AI-156] An Empirical Study of Reasoning Steps in Thinking Code LLM s
[AI-159] Predicting the Future by Retrieving the Past AAAI2026
【速读】:该论文旨在解决当前深度学习模型(如MLP、Transformer和TCN)在单变量时间序列预测中对全局历史信息利用不足的问题。这些模型虽然在训练过程中隐式地将历史信息压缩到参数中,但在推理阶段仅依赖局部滑动窗口内的上下文,无法动态访问全局历史模式,导致预测精度受限。解决方案的关键在于提出Predicting the Future by Retrieving the Past (PFRP),其核心创新是构建一个Global Memory Bank (GMB)用于存储和管理全局历史模式,并通过检索机制提取相似历史片段以生成全局预测,再与本地预测模型输出自适应融合,从而显著提升预测准确性和可解释性。
Abstract:Deep learning models such as MLP, Transformer, and TCN have achieved remarkable success in univariate time series forecasting, typically relying on sliding window samples from historical data for training. However, while these models implicitly compress historical information into their parameters during training, they are unable to explicitly and dynamically access this global knowledge during inference, relying only on the local context within the lookback window. This results in an underutilization of rich patterns from the global history. To bridge this gap, we propose Predicting the Future by Retrieving the Past (PFRP), a novel approach that explicitly integrates global historical data to enhance forecasting accuracy. Specifically, we construct a Global Memory Bank (GMB) to effectively store and manage global historical patterns. A retrieval mechanism is then employed to extract similar patterns from the GMB, enabling the generation of global predictions. By adaptively combining these global predictions with the outputs of any local prediction model, PFRP produces more accurate and interpretable forecasts. Extensive experiments conducted on seven real-world datasets demonstrate that PFRP significantly enhances the average performance of advanced univariate forecasting models by 8.4%. Codes can be found in this https URL.
zh
[AI-160] Can a Small Model Learn to Look Before It Leaps? Dynamic Learning and Proactive Correction for Hallucination Detection
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文场景下存在“中间遗忘”(Lost in the Middle, LITM)现象的问题,即模型对位于长文本中间位置的事实性信息检索准确率显著下降。研究表明,Gemini 2.5 Flash 模型在面对“大海捞针”式问答任务时,无论目标信息处于文档的何种位置(包括接近输入上下文极限的情况),均能保持高准确率,表明其在长距离信息检索方面具有显著改进,关键在于该模型有效缓解甚至消除了传统 LLM 中存在的 LITM 效应。
Abstract:The ability of large language models (LLMs) to recall and retrieve information from long contexts is critical for many real-world applications. Prior work (Liu et al., 2023) reported that LLMs suffer significant drops in retrieval accuracy for facts placed in the middle of large contexts, an effect known as “Lost in the Middle” (LITM). We find the model Gemini 2.5 Flash can answer needle-in-a-haystack questions with great accuracy regardless of document position including when the document is nearly at the input context limit. Our results suggest that the “Lost in the Middle” effect is not present for simple factoid Q\A in Gemini 2.5 Flash, indicating substantial improvements in long-context retrieval.
zh
[AI-162] EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph
Abstract:Symbolic regression seeks to uncover physical laws from experimental data by searching for closed-form expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging. A promising yet underexplored direction for reducing the effective search space and accelerating training lies in symbolic equivalence: many expressions, although syntactically different, define the same function – for example, \log(x_1^2x_2^3) , \log(x_1^2)+\log(x_2^3) , and 2\log(x_1)+3\log(x_2) . Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning. We introduce EGG-SR, a unified framework that integrates equality graphs (e-graphs) into diverse symbolic regression algorithms, including Monte Carlo Tree Search (MCTS), deep reinforcement learning (DRL), and large language models (LLMs). EGG-SR compactly represents equivalent expressions through the proposed EGG module, enabling more efficient learning by: (1) pruning redundant subtree exploration in EGG-MCTS, (2) aggregating rewards across equivalence classes in EGG-DRL, and (3) enriching feedback prompts in EGG-LLM. Under mild assumptions, we show that embedding e-graphs tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirically, EGG-SR consistently enhances multiple baselines across challenging benchmarks, discovering equations with lower normalized mean squared error than state-of-the-art methods. Code implementation is available at: this https URL.
zh
[AI-163] Policy Gradient-Based EMT-in-the-Loop Learning to Mitigate Sub-Synchronous Control Interactions
链接: https://arxiv.org/abs/2511.05822 作者: Sayak Mukherjee,Ramij R. Hossain,Kaustav Chatterjee,Sameer Nekkalapu,Marcelo Elizondo 机构: 未知 类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI) 备注: 10 pages, 7 figures
点击查看摘要
Abstract:This paper explores the development of learning-based tunable control gains using EMT-in-the-loop simulation framework (e.g., PSCAD interfaced with Python-based learning modules) to address critical sub-synchronous oscillations. Since sub-synchronous control interactions (SSCI) arise from the mis-tuning of control gains under specific grid configurations, effective mitigation strategies require adaptive re-tuning of these gains. Such adaptiveness can be achieved by employing a closed-loop, learning-based framework that considers the grid conditions responsible for such sub-synchronous oscillations. This paper addresses this need by adopting methodologies inspired by Markov decision process (MDP) based reinforcement learning (RL), with a particular emphasis on simpler deep policy gradient methods with additional SSCI-specific signal processing modules such as down-sampling, bandpass filtering, and oscillation energy dependent reward computations. Our experimentation in a real-world event setting demonstrates that the deep policy gradient based trained policy can adaptively compute gain settings in response to varying grid conditions and optimally suppress control interaction-induced oscillations.
zh
[AI-164] WAR-Re: Web API Recommendation with Semantic Reasoning
【速读】:该论文旨在解决Web API推荐中的两个关键问题:一是现有方法采用固定的Top-N推荐策略,难以适应不同mashup对API数量需求的差异;二是推荐结果缺乏可解释性,即仅输出排序列表而无推荐依据,导致用户无法理解推荐逻辑。解决方案的关键在于提出WAR-Re模型,该模型基于大语言模型(Large Language Model, LLM)构建,通过引入特殊起始和终止标记(start and stop tokens)以动态调整推荐API的数量,并采用两阶段训练策略——监督微调与基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习,从而同时提升推荐准确性与生成语义推理理由的能力。实验表明,WAR-Re在ProgrammableWeb数据集上相较最先进基线模型推荐准确率最高提升21.59%,并能持续生成高质量的推荐解释。
Abstract:With the development of cloud computing, the number of Web APIs has increased dramatically, further intensifying the demand for efficient Web API recommendation. Despite the demonstrated success of previous Web API recommendation solutions, two critical challenges persist: 1) a fixed top-N recommendation that cannot accommodate the varying API cardinality requirements of different mashups, and 2) these methods output only ranked API lists without accompanying reasons, depriving users of understanding the recommendation. To address these challenges, we propose WAR-Re, an LLM-based model for Web API recommendation with semantic reasoning for justification. WAR-Re leverages special start and stop tokens to handle the first challenge and uses two-stage training: supervised fine-tuning and reinforcement learning via Group Relative Policy Optimization (GRPO) to enhance the model’s ability in both tasks. Comprehensive experimental evaluations on the ProgrammableWeb dataset demonstrate that WAR-Re achieves a gain of up to 21.59% over the state-of-the-art baseline model in recommendation accuracy, while consistently producing high-quality semantic reasons for recommendations.
zh
[AI-165] In-depth Analysis on Caching and Pre-fetching in Mixture of Experts Offloading
【速读】:该论文旨在解决Mixture of Experts (MoE)模型在部署时面临的高内存占用问题,尤其是在GPU显存受限的边缘设备上难以高效运行的挑战。其核心解决方案在于深入研究MoE的专家激活模式与缓存行为,并提出基于LFU(Least Frequently Used)策略的缓存优化方法,显著优于传统LRU(Least Recently Used)缓存算法;同时引入推测性专家预取(speculative expert pre-fetching)机制,通过详细轨迹分析验证了其在提升系统性能方面的巨大潜力。此外,论文还系统揭示了门控网络和专家模块的行为特征,为未来MoE模型的可解释性研究及低损耗剪枝技术开发提供了重要依据。
Abstract:In today’s landscape, Mixture of Experts (MoE) is a crucial architecture that has been used by many of the most advanced models. One of the major challenges of MoE models is that they usually require much more memory than their dense counterparts due to their unique architecture, and hence are harder to deploy in environments with limited GPU memory, such as edge devices. MoE offloading is a promising technique proposed to overcome this challenge, especially if it is enhanced with caching and pre-fetching, but prior work stopped at suboptimal caching algorithm and offered limited insights. In this work, we study MoE offloading in depth and make the following contributions: 1. We analyze the expert activation and LRU caching behavior in detail and provide traces. 2. We propose LFU caching optimization based on our analysis and obtain strong improvements from LRU. 3. We implement and experiment speculative expert pre-fetching, providing detailed trace showing its huge potential . 4. In addition, our study extensively covers the behavior of the MoE architecture itself, offering information on the characteristic of the gating network and experts. This can inspire future work on the interpretation of MoE models and the development of pruning techniques for MoE architecture with minimal performance loss.
zh
[AI-166] MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling
Abstract:Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput.
zh
[AI-167] Measuring Model Performance in the Presence of an Intervention AAAI2026
Abstract:In multi-objective decision-making with hierarchical preferences, lexicographic bandits provide a natural framework for optimizing multiple objectives in a prioritized order. In this setting, a learner repeatedly selects arms and observes reward vectors, aiming to maximize the reward for the highest-priority objective, then the next, and so on. While previous studies have primarily focused on regret minimization, this work bridges the gap between \textitregret minimization and \textitbest arm identification under lexicographic preferences. We propose two elimination-based algorithms to address this joint objective. The first algorithm eliminates suboptimal arms sequentially, layer by layer, in accordance with the objective priorities, and achieves sample complexity and regret bounds comparable to those of the best single-objective algorithms. The second algorithm simultaneously leverages reward information from all objectives in each round, effectively exploiting cross-objective dependencies. Remarkably, it outperforms the known lower bound for the single-objective bandit problem, highlighting the benefit of cross-objective information sharing in the multi-objective setting. Empirical results further validate their superior performance over baselines.
zh
[AI-169] When AI Meets the Web: Prompt Injection Risks in Third-Party AI Chatbot Plugins
Abstract:Prompt injection attacks pose a critical threat to large language models (LLMs), with prior work focusing on cutting-edge LLM applications like personal copilots. In contrast, simpler LLM applications, such as customer service chatbots, are widespread on the web, yet their security posture and exposure to such attacks remain poorly understood. These applications often rely on third-party chatbot plugins that act as intermediaries to commercial LLM APIs, offering non-expert website builders intuitive ways to customize chatbot behaviors. To bridge this gap, we present the first large-scale study of 17 third-party chatbot plugins used by over 10,000 public websites, uncovering previously unknown prompt injection risks in practice. First, 8 of these plugins (used by 8,000 websites) fail to enforce the integrity of the conversation history transmitted in network requests between the website visitor and the chatbot. This oversight amplifies the impact of direct prompt injection attacks by allowing adversaries to forge conversation histories (including fake system messages), boosting their ability to elicit unintended behavior (e.g., code generation) by 3 to 8x. Second, 15 plugins offer tools, such as web-scraping, to enrich the chatbot’s context with website-specific content. However, these tools do not distinguish the website’s trusted content (e.g., product descriptions) from untrusted, third-party content (e.g., customer reviews), introducing a risk of indirect prompt injection. Notably, we found that ~13% of e-commerce websites have already exposed their chatbots to third-party content. We systematically evaluate both vulnerabilities through controlled experiments grounded in real-world observations, focusing on factors such as system prompt design and the underlying LLM. Our findings show that many plugins adopt insecure practices that undermine the built-in LLM safeguards.
zh
[AI-170] VLAD-Grasp: Zero-shot Grasp Detection via Vision-Language Models
Abstract:Robotic grasping is a fundamental capability for autonomous manipulation; however, most existing methods rely on large-scale expert annotations and necessitate retraining to handle new objects. We present VLAD-Grasp, a Vision-Language model Assisted zero-shot approach for Detecting grasps. From a single RGB-D image, our method (1) prompts a large vision-language model to generate a goal image where a straight rod “impales” the object, representing an antipodal grasp, (2) predicts depth and segmentation to lift this generated image into 3D, and (3) aligns generated and observed object point clouds via principal component analysis and correspondence-free optimization to recover an executable grasp pose. Unlike prior work, our approach is training-free and does not rely on curated grasp datasets. Despite this, VLAD-Grasp achieves performance that is competitive with or superior to that of state-of-the-art supervised models on the Cornell and Jacquard datasets. We further demonstrate zero-shot generalization to novel real-world objects on a Franka Research 3 robot, highlighting vision-language foundation models as powerful priors for robotic manipulation.
zh
[AI-171] SymLight: Exploring Interpretable and Deployable Symbolic Policies for Traffic Signal Control
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning)在交通信号控制(Traffic Signal Control, TSC)中因神经策略模型参数过多、缺乏透明性而导致可解释性差和难以部署于资源受限边缘设备的问题。解决方案的关键在于提出SymLight框架,其基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)来搜索具有内在可解释性和可部署性的符号优先级函数(symbolic priority function)。该优先级函数以交通特征为输入,输出各信号相位的优先级以指导相位切换;通过设计简洁而表达能力强的优先级函数表示形式,有效缓解MCTS中动作空间的组合爆炸问题,并引入概率结构回放策略,利用先前发现的高质量优先级函数的结构模式引导探索过程,从而在真实数据集上实现优于基线方法的性能,同时保证策略的可解释性与部署可行性。
Abstract:Deep Reinforcement Learning have achieved significant success in automatically devising effective traffic signal control (TSC) policies. Neural policies, however, tend to be over-parameterized and non-transparent, hindering their interpretability and deployability on resource-limited edge devices. This work presents SymLight, a priority function search framework based on Monte Carlo Tree Search (MCTS) for discovering inherently interpretable and deployable symbolic priority functions to serve as the TSC policies. The priority function, in particular, accepts traffic features as input and then outputs a priority for each traffic signal phase, which subsequently directs the phase transition. For effective search, we propose a concise yet expressive priority function representation. This helps mitigate the combinatorial explosion of the action space in MCTS. Additionally, a probabilistic structural rollout strategy is introduced to leverage structural patterns from previously discovered high-quality priority functions, guiding the rollout process. Our experiments on real-world datasets demonstrate SymLight’s superior performance across a range of baselines. A key advantage is SymLight’s ability to produce interpretable and deployable TSC policies while maintaining excellent performance.
zh
[AI-172] Lived Experience in Dialogue: Co-designing Personalization in Large Language Models to Support Youth Mental Well-being
链接: https://arxiv.org/abs/2511.05769 作者: Kathleen W. Guan,Sarthak Giri,Mohammed Amara,Bernard J. Jansen,Enrico Liscio,Milena Esherick,Mohammed Al Owayyed,Ausrine Ratkute,Gayane Sedrakyan,Mark de Reuver,Joao Fernando Ferreira Goncalves,Caroline A. Figueroa 机构: 未知 类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
[AI-173] CoT-X: An Adaptive Framework for Cross-Model Chain-of-Thought Transfer and Optimization KDD2025
【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)推理在大语言模型(Large Language Models, LLMs)中带来的显著推理开销问题,从而限制其在资源受限场景下的部署。解决方案的关键在于提出一种自适应的推理摘要框架,通过语义分割结合重要性评分对推理路径进行压缩,引入预算感知的动态压缩策略,并辅以连贯性重建机制,在保留关键推理步骤的同时大幅降低token消耗。该方法在医疗考试题库上验证了在相同token预算下比简单截断提升最高达40%的准确率,并展现出跨模型规模与架构的强迁移能力。
Abstract:Chain-of-Thought (CoT) reasoning enhances the problem-solving ability of large language models (LLMs) but leads to substantial inference overhead, limiting deployment in resource-constrained settings. This paper investigates efficient CoT transfer across models of different scales and architectures through an adaptive reasoning summarization framework. The proposed method compresses reasoning traces via semantic segmentation with importance scoring, budget-aware dynamic compression, and coherence reconstruction, preserving critical reasoning steps while significantly reducing token usage. Experiments on 7,501 medical examination questions across 10 specialties show up to 40% higher accuracy than truncation under the same token budgets. Evaluations on 64 model pairs from eight LLMs (1.5B-32B parameters, including DeepSeek-R1 and Qwen3) confirm strong cross-model transferability. Furthermore, a Gaussian Process-based Bayesian optimization module reduces evaluation cost by 84% and reveals a power-law relationship between model size and cross-domain robustness. These results demonstrate that reasoning summarization provides a practical path toward efficient CoT transfer, enabling advanced reasoning under tight computational constraints. Code will be released upon publication.
zh
[AI-174] Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder
【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在大语言模型(Large Language Models, LLMs)解释性分析中面临的可扩展性难题:高维度隐藏层虽能提升解释性,但导致训练与推理成本过高。现有基于混合专家(Mixture of Experts, MoE)的方法试图通过分组专家网络降低计算开销,但其关键瓶颈在于专家之间缺乏特征专业化,常出现冗余或重复学习的现象。本文提出两个核心创新:一是多专家激活机制(Multiple Expert Activation),通过同时调用语义加权的专家子集来促进特征分工;二是特征缩放机制(Feature Scaling),利用自适应高频缩放增强特征多样性。实验表明,该方案相较现有MoE-SAE方法实现了24%更低的重构误差和99%的特征冗余减少,有效弥合了LLM分析中解释性与效率之间的鸿沟。
Abstract:Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM explanations, their practical adoption faces a fundamental challenge: better interpretability demands that SAEs’ hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs. Recent Mixture of Experts (MoE) approaches attempt to address this by partitioning SAEs into narrower expert networks with gated activation, thereby reducing computation. In a well-designed MoE, each expert should focus on learning a distinct set of features. However, we identify a \textitcritical limitation in MoE-SAE: Experts often fail to specialize, which means they frequently learn overlapping or identical features. To deal with it, we propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling. Experiments demonstrate a 24% lower reconstruction error and a 99% reduction in feature redundancy compared to existing MoE-SAE methods. This work bridges the interpretability-efficiency gap in LLM analysis, allowing transparent model inspection without compromising computational feasibility.
zh
[AI-175] Compressing Chemistry Reveals Functional Groups
Abstract:We introduce the first formal large-scale assessment of the utility of traditional chemical functional groups as used in chemical explanations. Our assessment employs a fundamental principle from computational learning theory: a good explanation of data should also compress the data. We introduce an unsupervised learning algorithm based on the Minimum Message Length (MML) principle that searches for substructures that compress around three million biologically relevant molecules. We demonstrate that the discovered substructures contain most human-curated functional groups as well as novel larger patterns with more specific functions. We also run our algorithm on 24 specific bioactivity prediction datasets to discover dataset-specific functional groups. Fingerprints constructed from dataset-specific functional groups are shown to significantly outperform other fingerprint representations, including the MACCS and Morgan fingerprint, when training ridge regression models on bioactivity regression tasks.
zh
[AI-176] AdvisingWise: Supporting Academic Advising in Higher Educations Through a Human-in-the-Loop Multi-Agent Framework
Abstract:Academic advising is critical to student success in higher education, yet high student-to-advisor ratios limit advisors’ capacity to provide timely support, particularly during peak periods. Recent advances in Large Language Models (LLMs) present opportunities to enhance the advising process. We present AdvisingWise, a multi-agent system that automates time-consuming tasks, such as information retrieval and response drafting, while preserving human oversight. AdvisingWise leverages authoritative institutional resources and adaptively prompts students about their academic backgrounds to generate reliable, personalized responses. All system responses undergo human advisor validation before delivery to students. We evaluate AdvisingWise through a mixed-methods approach: (1) expert evaluation on responses of 20 sample queries, (2) LLM-as-a-judge evaluation of the information retrieval strategy, and (3) a user study with 8 academic advisors to assess the system’s practical utility. Our evaluation shows that AdvisingWise produces accurate, personalized responses. Advisors reported increasingly positive perceptions after using AdvisingWise, as their initial concerns about reliability and personalization diminished. We conclude by discussing the implications of human-AI synergy on the practice of academic advising.
zh
[AI-177] SSTODE: Ocean-Atmosphere Physics-Informed Neural ODEs for Sea Surface Temperature Prediction AAAI
链接: https://arxiv.org/abs/2511.05629 作者: Zheng Jiang,Wei Wang,Gaowei Zhang,Yi Wang 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph) 备注: To be published in the Proceedings of AAAI-AISI 2026
点击查看摘要
Abstract:Sea Surface Temperature (SST) is crucial for understanding upper-ocean thermal dynamics and ocean-atmosphere interactions, which have profound economic and social impacts. While data-driven models show promise in SST prediction, their black-box nature often limits interpretability and overlooks key physical processes. Recently, physics-informed neural networks have been gaining momentum but struggle with complex ocean-atmosphere dynamics due to 1) inadequate characterization of seawater movement (e.g., coastal upwelling) and 2) insufficient integration of external SST drivers (e.g., turbulent heat fluxes). To address these challenges, we propose SSTODE, a physics-informed Neural Ordinary Differential Equations (Neural ODEs) framework for SST prediction. First, we derive ODEs from fluid transport principles, incorporating both advection and diffusion to model ocean spatiotemporal dynamics. Through variational optimization, we recover a latent velocity field that explicitly governs the temporal dynamics of SST. Building upon ODE, we introduce an Energy Exchanges Integrator (EEI)-inspired by ocean heat budget equations-to account for external forcing factors. Thus, the variations in the components of these factors provide deeper insights into SST dynamics. Extensive experiments demonstrate that SSTODE achieves state-of-the-art performances in global and regional SST forecasting benchmarks. Furthermore, SSTODE visually reveals the impact of advection dynamics, thermal diffusion patterns, and diurnal heating-cooling cycles on SST evolution. These findings demonstrate the model’s interpretability and physical consistency.
zh
[AI-178] Unveiling the Training Dynamics of ReLU Networks through a Linear Lens
[AI-179] Assessing the Reliability of Large Language Models in the Bengali Legal Context: A Comparative Evaluation Using LLM -as-Judge and Legal Experts
链接: https://arxiv.org/abs/2511.05627 作者: Sabik Aftahee,A.F.M. Farhad,Arpita Mallik,Ratnajit Dhar,Jawadul Karim,Nahiyan Bin Noor,Ishmam Ahmed Solaiman 机构: 未知 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Accessing legal help in Bangladesh is hard. People face high fees, complex legal language, a shortage of lawyers, and millions of unresolved court cases. Generative AI models like OpenAI GPT-4.1 Mini, Gemini 2.0 Flash, Meta Llama 3 70B, and DeepSeek R1 could potentially democratize legal assistance by providing quick and affordable legal advice. In this study, we collected 250 authentic legal questions from the Facebook group “Know Your Rights,” where verified legal experts regularly provide authoritative answers. These questions were subsequently submitted to four four advanced AI models and responses were generated using a consistent, standardized prompt. A comprehensive dual evaluation framework was employed, in which a state-of-the-art LLM model served as a judge, assessing each AI-generated response across four critical dimensions: factual accuracy, legal appropriateness, completeness, and clarity. Following this, the same set of questions was evaluated by three licensed Bangladeshi legal professionals according to the same criteria. In addition, automated evaluation metrics, including BLEU scores, were applied to assess response similarity. Our findings reveal a complex landscape where AI models frequently generate high-quality, well-structured legal responses but also produce dangerous misinformation, including fabricated case citations, incorrect legal procedures, and potentially harmful advice. These results underscore the critical need for rigorous expert validation and comprehensive safeguards before AI systems can be safely deployed for legal consultation in Bangladesh.
zh
[AI-180] LLM s as Packagers of HPC Software
【速读】:该论文旨在解决高性能量子计算(High Performance Computing, HPC)软件生态中依赖管理的难题,即如何高效、准确地生成可维护的Spack包配置文件(Spack recipes),以支持科学应用对数百个外部依赖项的复杂构建需求。传统方法依赖人工编写和维护这些recipe,成本高昂且难以扩展。论文提出的解决方案核心在于设计并实现了一个名为SpackIt的端到端框架,其关键创新包括:基于代码仓库分析的上下文增强、相关示例检索机制以及通过诊断反馈进行迭代优化的结构化流程。实验证明,该方案将零样本场景下的安装成功率从20%提升至80%以上,显著优于单纯依赖大语言模型(Large Language Models, LLMs)的直接生成方式,验证了检索增强与反馈驱动在可靠包合成中的重要价值。
Abstract:High performance computing (HPC) software ecosystems are inherently heterogeneous, comprising scientific applications that depend on hundreds of external packages, each with distinct build systems, options, and dependency constraints. Tools such as Spack automate dependency resolution and environment management, but their effectiveness relies on manually written build recipes. As these ecosystems grow, maintaining existing specifications and creating new ones becomes increasingly labor-intensive. While large language models (LLMs) have shown promise in code generation, automatically producing correct and maintainable Spack recipes remains a significant challenge. We present a systematic analysis of how LLMs and context-augmentation methods can assist in the generation of Spack recipes. To this end, we introduce SpackIt, an end-to-end framework that combines repository analysis, retrieval of relevant examples, and iterative refinement through diagnostic feedback. We apply SpackIt to a representative subset of 308 open-source HPC packages to assess its effectiveness and limitations. Our results show that SpackIt increases installation success from 20% in a zero-shot setting to over 80% in its best configuration, demonstrating the value of retrieval and structured feedback for reliable package synthesis.
zh
[AI-181] Report from Workshop on Dialogue alongside Artificial Intelligence
链接: https://arxiv.org/abs/2511.05625 作者: Thomas J McKenna(Boston University),Ingvill Rasmussen(University of Oslo),Sten Ludvigsen(University of Oslo),Avivit Arvatz(The Hebrew University of Jerusalem),Christa Asterhan(The Hebrew University of Jerusalem),Gaowei Chen(The University of Hong Kong),Julie Cohen(University of Virginia),Michele Flammia(Independent Scholar),Dongkeun Han(University of Cambridge),Emma Hayward(University of Cambridge),Heather Hill(Harvard University),Yifat Kolikant(The Hebrew University of Jerusalem),Helen Lehndorf(Freie Universität Berlin),Kexin Li(The University of Hong Kong),Lindsay Clare Matsumura(University of Pittsburgh),Henrik Tjønn(University of Oslo),Pengjin Wang(The University of Hong Kong),Rupert Wegerif(University of Cambridge) 机构: 未知 类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) 备注: Report from the Workshop on Dialogue alongside Artificial Intelligence (2025)
点击查看摘要
Abstract:Educational dialogue -the collaborative exchange of ideas through talk- is widely recognized as a catalyst for deeper learning and critical thinking in and across contexts. At the same time, artificial intelligence (AI) has rapidly emerged as a powerful force in education, with the potential to address major challenges, personalize learning, and innovate teaching practices. However, these advances come with significant risks: rapid AI development can undermine human agency, exacerbate inequities, and outpace our capacity to guide its use with sound policy. Human learning presupposes cognitive efforts and social interaction (dialogues). In response to this evolving landscape, an international workshop titled “Educational Dialogue: Moving Thinking Forward” convened 19 leading researchers from 11 countries in Cambridge (September 1-3, 2025) to examine the intersection of AI and educational dialogue. This AI-focused strand of the workshop centered on three critical questions: (1) When is AI truly useful in education, and when might it merely replace human effort at the expense of learning? (2) Under what conditions can AI use lead to better dialogic teaching and learning? (3) Does the AI-human partnership risk outpacing and displacing human educational work, and what are the implications? These questions framed two days of presentations and structured dialogue among participants.
zh
[AI-182] Frequency Matters: When Time Series Foundation Models Fail Under Spectral Shift NEURIPS2025
【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在工业场景中泛化能力不足的问题,特别是其在实际应用中表现不如领域适配的基线模型。研究表明,造成这一现象的关键因素是频谱偏移(spectral shift),即下游任务中的主导频率成分与预训练阶段所学习的频率分布不一致。解决方案的核心在于提升TSFM对频率特性的感知能力,通过设计受控的合成实验验证了频谱不匹配会导致系统性性能下降,从而提出应建立更注重频谱多样性的预训练和评估协议,以增强模型在真实工业环境中的鲁棒性。
链接: https://arxiv.org/abs/2511.05619 作者: Tianze Wang,Sofiane Ennadir,John Pertoft,Gabriela Zarzar Gandler,Lele Cao,Zineb Senane,Styliani Katsarou,Sahar Asadi,Axel Karlsson,Oleg Smirnov 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 备注: Accepted and presented at NeurIPS 2025 Workshop on Recent Advances in Time Series Foundation Models (BERT2S)
点击查看摘要
Abstract:Time series foundation models (TSFMs) have shown strong results on public benchmarks, prompting comparisons to a “BERT moment” for time series. Their effectiveness in industrial settings, however, remains uncertain. We examine why TSFMs often struggle to generalize and highlight spectral shift (a mismatch between the dominant frequency components in downstream tasks and those represented during pretraining) as a key factor. We present evidence from an industrial-scale player engagement prediction task in mobile gaming, where TSFMs underperform domain-adapted baselines. To isolate the mechanism, we design controlled synthetic experiments contrasting signals with seen versus unseen frequency bands, observing systematic degradation under spectral mismatch. These findings position frequency awareness as critical for robust TSFM deployment and motivate new pretraining and evaluation protocols that explicitly account for spectral diversity.
zh
[AI-183] wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation
【速读】:该论文旨在解决在硬件加速器设计中,随着生成式 AI (Generative AI) 和机器学习(ML)模型复杂度提升,传统设计流程中硬件综合(hardware synthesis)环节逐渐成为限制快速迭代的关键瓶颈问题。解决方案的关键在于构建一个名为 wa-hls4ml 的基准测试平台,其包含超过68万条全连接和卷积神经网络的资源与延迟数据集,所有模型均通过 hls4ml 工具链在 Xilinx FPGA 上合成。在此基础上,研究提出基于图神经网络(GNN)和 Transformer 架构的代理模型(surrogate models),用于高效预测 ML 加速器的资源占用与延迟性能,实验表明这些模型能够在合成测试集上以几百分比的误差准确估计第75百分位的资源使用情况,从而显著加速设计空间探索与优化过程。
链接: https://arxiv.org/abs/2511.05615 作者: Benjamin Hawks,Jason Weitz,Dmitri Demler,Karla Tame-Narvaez,Dennis Plotnikov,Mohammad Mehdi Rahimifar,Hamza Ezzaoui Rahali,Audrey C. Therrien,Donovan Sproule,Elham E Khoda,Keegan A. Smith,Russell Marroquin,Giuseppe Di Guglielmo,Nhan Tran,Javier Duarte,Vladimir Loncar 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Instrumentation and Detectors (physics.ins-det) 备注: 30 pages, 18 figures
点击查看摘要
Abstract:As machine learning (ML) is increasingly implemented in hardware to address real-time challenges in scientific applications, the development of advanced toolchains has significantly reduced the time required to iterate on various designs. These advancements have solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks, such as hardware synthesis, are becoming limiting factors in the rapid iteration of designs. To mitigate these emerging constraints, multiple efforts have been undertaken to develop an ML-based surrogate model that estimates resource usage of ML accelerator architectures. We introduce wa-hls4ml, a benchmark for ML accelerator resource and latency estimation, and its corresponding initial dataset of over 680,000 fully connected and convolutional neural networks, all synthesized using hls4ml and targeting Xilinx FPGAs. The benchmark evaluates the performance of resource and latency predictors against several common ML model architectures, primarily originating from scientific domains, as exemplar models, and the average performance across a subset of the dataset. Additionally, we introduce GNN- and transformer-based surrogate models that predict latency and resources for ML accelerators. We present the architecture and performance of the models and find that the models generally predict latency and resources for the 75% percentile within several percent of the synthesized resources on the synthetic test dataset.
zh
[AI-184] An MLCommons Scientific Benchmarks Ontology
链接: https://arxiv.org/abs/2511.05614 作者: Ben Hawks,Gregor von Laszewski,Matthew D. Sinclair,Marco Colombo,Shivaram Venkataraman,Rutwik Jain,Yiwei Jiang,Nhan Tran,Geoffrey Fox 机构: 未知 类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Computational Physics (physics.comp-ph) 备注: 16 Pages, 3 Figures
点击查看摘要
Abstract:Scientific machine learning research spans diverse domains and data modalities, yet existing benchmark efforts remain siloed and lack standardization. This makes novel and transformative applications of machine learning to critical scientific use-cases more fragmented and less clear in pathways to impact. This paper introduces an ontology for scientific benchmarking developed through a unified, community-driven effort that extends the MLCommons ecosystem to cover physics, chemistry, materials science, biology, climate science, and more. Building on prior initiatives such as XAI-BENCH, FastML Science Benchmarks, PDEBench, and the SciMLBench framework, our effort consolidates a large set of disparate benchmarks and frameworks into a single taxonomy of scientific, application, and system-level benchmarks. New benchmarks can be added through an open submission workflow coordinated by the MLCommons Science Working Group and evaluated against a six-category rating rubric that promotes and identifies high-quality benchmarks, enabling stakeholders to select benchmarks that meet their specific needs. The architecture is extensible, supporting future scientific and AI/ML motifs, and we discuss methods for identifying emerging computing patterns for unique scientific workloads. The MLCommons Science Benchmarks Ontology provides a standardized, scalable foundation for reproducible, cross-domain benchmarking in scientific machine learning. A companion webpage for this work has also been developed as the effort evolves: this https URL
zh
[AI-185] Who Evaluates AIs Social Impacts? Mapping Coverag e and Gaps in First and Third Party Evaluations
Abstract:Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluations are widespread, social impact assessments covering bias, fairness, privacy, environmental costs, and labor practices remain uneven across the AI ecosystem. To characterize this landscape, we conduct the first comprehensive analysis of both first-party and third-party social impact evaluation reporting across a wide range of model developers. Our study examines 186 first-party release reports and 183 post-release evaluation sources, and complements this quantitative analysis with interviews of model developers. We find a clear division of evaluation labor: first-party reporting is sparse, often superficial, and has declined over time in key areas such as environmental impact and bias, while third-party evaluators including academic researchers, nonprofits, and independent organizations provide broader and more rigorous coverage of bias, harmful content, and performance disparities. However, this complementarity has limits. Only model developers can authoritatively report on data provenance, content moderation labor, financial costs, and training infrastructure, yet interviews reveal that these disclosures are often deprioritized unless tied to product adoption or regulatory compliance. Our findings indicate that current evaluation practices leave major gaps in assessing AI’s societal impacts, highlighting the urgent need for policies that promote developer transparency, strengthen independent evaluation ecosystems, and create shared infrastructure to aggregate and compare third-party evaluations in a consistent and accessible way.
zh
[AI-186] Conformal Prediction-Driven Adaptive Sampling for Digital Twins of Water Distribution Networks
【速读】:该论文旨在解决数字孪生(Digital Twin, DT)在供水管网(Water Distribution Networks, WDNs)中状态估计时,因传感器部署受限而难以实现精准监测的问题。传统均匀采样策略忽视了不同节点间不确定性差异,导致资源浪费。其解决方案的关键在于提出一种自适应框架,融合长短期记忆网络(LSTM)预测与保形预测(Conformal Prediction, CP)技术,通过边际CP方法量化每个节点的不确定性,并据此动态优化传感器部署位置,从而在保证高覆盖率的同时显著降低需求误差(实验显示在40%传感器覆盖率下误差降低33–34%),且仅需5–10%额外计算开销即可维持89.4–90.2%的经验覆盖水平。
Abstract:Digital Twins (DTs) for Water Distribution Networks (WDNs) require accurate state estimation with limited sensors. Uniform sampling often wastes resources across nodes with different uncertainty. We propose an adaptive framework combining LSTM forecasting and Conformal Prediction (CP) to estimate node-wise uncertainty and focus sensing on the most uncertain points. Marginal CP is used for its low computational cost, suitable for real-time DTs. Experiments on Hanoi, Net3, and CTOWN show 33-34% lower demand error than uniform sampling at 40% coverage and maintain 89.4-90.2% empirical coverage with only 5-10% extra computation.
zh
[AI-187] From Prompts to Power: Measuring the Energy Footprint of LLM Inference
【速读】:该论文旨在解决生成式 AI(Generative AI)在推理阶段(inference)的能源消耗问题,尤其是随着大语言模型(Large Language Models, LLMs)规模扩大,其推理能耗已显著超过训练阶段,并成为生命周期总能耗的主要部分。现有研究对推理能效缺乏系统性分析,限制了优化和可持续部署的可能。解决方案的关键在于通过大规模实测(超过32,500次测量)构建涵盖21种GPU配置与155种模型架构的数据集,利用vLLM推理引擎实现提示级别(prompt-level)的能量计量,并基于此建立一个可泛化的预测模型,能够准确估算未见过的模型架构与硬件组合下的推理能耗,最终以浏览器扩展形式落地,提升用户对生成式AI环境影响的认知。
Abstract:The rapid expansion of Large Language Models (LLMs) has introduced unprecedented energy demands, extending beyond training to large-scale inference workloads that often dominate total lifecycle consumption. Deploying these models requires energy-intensive GPU infrastructure, and in some cases has even prompted plans to power data centers with nuclear energy. Despite this growing relevance, systematic analyses of inference energy consumption remain limited. In this work, we present a large-scale measurement-based study comprising over 32,500 measurements across 21 GPU configurations and 155 model architectures, from small open-source models to frontier systems. Using the vLLM inference engine, we quantify energy usage at the prompt level and identify how architectural and operational factors shape energy demand. Building on these insights, we develop a predictive model that accurately estimates inference energy consumption across unseen architectures and hardware, and implement it as a browser extension to raise awareness of the environmental impact of generative AI.
zh
[AI-188] FlowNet: Modeling Dynamic Spatio-Temporal Systems via Flow Propagation
Abstract:Accurately modeling complex dynamic spatio-temporal systems requires capturing flow-mediated interdependencies and context-sensitive interaction dynamics. Existing methods, predominantly graph-based or attention-driven, rely on similarity-driven connectivity assumptions, neglecting asymmetric flow exchanges that govern system evolution. We propose Spatio-Temporal Flow, a physics-inspired paradigm that explicitly models dynamic node couplings through quantifiable flow transfers governed by conservation principles. Building on this, we design FlowNet, a novel architecture leveraging flow tokens as information carriers to simulate source-to-destination transfers via Flow Allocation Modules, ensuring state redistribution aligns with conservation laws. FlowNet dynamically adjusts the interaction radius through an Adaptive Spatial Masking module, suppressing irrelevant noise while enabling context-aware propagation. A cascaded architecture enhances scalability and nonlinear representation capacity. Experiments demonstrate that FlowNet significantly outperforms existing state-of-the-art approaches on seven metrics in the modeling of three real-world systems, validating its efficiency and physical interpretability. We establish a principled methodology for modeling complex systems through spatio-temporal flow interactions.
zh
[AI-189] CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)后训练过程中因完全同步机制导致的效率低下问题。现有RL系统需等待整个批次的轨迹(trajectory)完成才能进行下一步训练,长轨迹会显著拖慢整体流程并造成GPU资源闲置。解决方案的关键在于提出一种并发控制的部分轨迹滚动(Concurrency-Controlled Partial Rollout with Importance Sampling, CoPRIS)方法:通过固定数量的并发滚动任务、提前终止已收集足够样本的轨迹,并复用未完成轨迹以提升资源利用率;同时引入跨阶段重要性采样校正(Cross-stage Importance Sampling Correction),将前一策略的对数概率与当前策略重新计算的概率拼接用于重要性采样校正,从而缓解离策略轨迹带来的偏差。实验表明,CoPRIS在数学推理基准测试中实现最高1.94倍的训练加速,且性能相当或更优。
Abstract:Reinforcement learning (RL) post-training has become a trending paradigm for enhancing the capabilities of large language models (LLMs). Most existing RL systems for LLMs operate in a fully synchronous manner, where training must wait for the rollout of an entire batch to complete. This design leads to severe inefficiencies, as extremely long trajectories can stall the entire rollout process and leave many GPUs idle. To address this issue, we propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS), which mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts. To mitigate the impact of off-policy trajectories, we introduce Cross-stage Importance Sampling Correction, which concatenates buffered log probabilities from the previous policy with those recomputed under the current policy for importance sampling correction. Experiments on challenging mathematical reasoning benchmarks show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems. The code of CoPRIS is available at this https URL.
zh
[AI-190] Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models
Abstract:Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance crucially depends on the inference time order of unmasking. Prevailing heuristics, such as confidence based sampling, are myopic: they optimize locally, fail to leverage extra test-time compute, and let early decoding mistakes cascade. We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders without the need for an external reward model. Our framework couples (i) a path generator that proposes paths by sampling from pools of unmasking sets with (ii) a verifier that computes the uncertainty of the proposed paths and performs importance sampling to subsequently select the final paths. Empirically, erroneous unmasking measurably inflates sequence level uncertainty, and our method exploits this to avoid error-prone trajectories. We validate our framework across six benchmarks, such as mathematics, planning, and coding, and demonstrate consistent performance improvements. LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection. The consistent improvements on both LLaDA and post-trained LLaDA 1.5 are particularly striking: base LLaDA with LookUM rivals the performance of RL-tuned LLaDA 1.5, while LookUM further enhances LLaDA 1.5 itself showing that uncertainty based verification provides orthogonal benefits to reinforcement learning and underscoring the versatility of our framework. Code will be publicly released.
zh
[AI-191] Effective Test-Time Scaling of Discrete Diffusion through Iterative Refinement
Abstract:Test-time scaling through reward-guided generation remains largely unexplored for discrete diffusion models despite its potential as a promising alternative. In this work, we introduce Iterative Reward-Guided Refinement (IterRef), a novel test-time scaling method tailored to discrete diffusion that leverages reward- guided noising-denoising transitions to progressively refine misaligned intermediate states. We formalize this process within a Multiple-Try Metropolis (MTM) framework, proving convergence to the reward-aligned distribution. Unlike prior methods that assume the current state is already aligned with the reward distribution and only guide the subsequent transition, our approach explicitly refines each state in situ, progressively steering it toward the optimal intermediate distribution. Across both text and image domains, we evaluate IterRef on diverse discrete diffusion models and observe consistent improvements in reward-guided generation quality. In particular, IterRef achieves striking gains under low compute budgets, far surpassing prior state-of-the-art baselines.
zh
[AI-192] Diversified Flow Matching with Translation Identifiability
Abstract:Diversified distribution matching (DDM) finds a unified translation function mapping a diverse collection of conditional source distributions to their target counterparts. DDM was proposed to resolve content misalignment issues in unpaired domain translation, achieving translation identifiability. However, DDM has only been implemented using GANs due to its constraints on the translation function. GANs are often unstable to train and do not provide the transport trajectory information – yet such trajectories are useful in applications such as single-cell evolution analysis and robot route planning. This work introduces diversified flow matching (DFM), an ODE-based framework for DDM. Adapting flow matching (FM) to enforce a unified translation function as in DDM is challenging, as FM learns the translation function’s velocity rather than the translation function itself. A custom bilevel optimization-based training loss, a nonlinear interpolant, and a structural reformulation are proposed to address these challenges, offering a tangible implementation. To our knowledge, DFM is the first ODE-based approach guaranteeing translation identifiability. Experiments on synthetic and real-world datasets validate the proposed method.
zh
[AI-193] Deep one-gate per layer networks with skip connections are universal classifiers
Abstract:This paper shows how a multilayer perceptron with two hidden layers, which has been designed to classify two classes of data points, can easily be transformed into a deep neural network with one-gate layers and skip connections.
zh
[AI-194] AGRAG : Advanced Graph-based Retrieval-Augmented Generation for LLM s
【速读】:该论文旨在解决图结构增强生成(Graph-based Retrieval-Augmented Generation, Graph-based RAG)中存在的三个关键问题:1)因大语言模型(Large Language Models, LLMs)幻觉导致的图构建不准确;2)由于缺乏显式推理路径,LLM无法有效解释为何选择特定文本块,从而削弱了推理能力;3)因LLM推理不足导致回答不完整,使得性能在某些任务上落后于朴素RAG(NaiveRAG)。解决方案的核心在于提出AGRAG框架:首先采用基于统计的方法替代LLM实体抽取以避免错误传播;其次将图推理过程建模为最小成本最大影响力(Minimum Cost Maximum Influence, MCMI)子图生成问题,通过引入节点影响力得分与边成本权衡,生成更全面的推理路径;该MCMI子图可作为显式推理依据引导LLM聚焦查询相关部分,减少噪声干扰,并支持包含环路等复杂结构,显著提升推理能力和答案完整性。
Abstract:Graph-based retrieval-augmented generation (Graph-based RAG) has demonstrated significant potential in enhancing Large Language Models (LLMs) with structured knowledge. However, existing methods face three critical challenges: Inaccurate Graph Construction, caused by LLM hallucination; Poor Reasoning Ability, caused by failing to generate explicit reasons telling LLM why certain chunks were selected; and Inadequate Answering, which only partially answers the query due to the inadequate LLM reasoning, making their performance lag behind NaiveRAG on certain tasks. To address these issues, we propose AGRAG, an advanced graph-based retrieval-augmented generation framework. When constructing the graph, AGRAG substitutes the widely used LLM entity extraction method with a statistics-based method, avoiding hallucination and error propagation. When retrieval, AGRAG formulates the graph reasoning procedure as the Minimum Cost Maximum Influence (MCMI) subgraph generation problem, where we try to include more nodes with high influence score, but with less involving edge cost, to make the generated reasoning paths more comprehensive. We prove this problem to be NP-hard, and propose a greedy algorithm to solve it. The MCMI subgraph generated can serve as explicit reasoning paths to tell LLM why certain chunks were retrieved, thereby making the LLM better focus on the query-related part contents of the chunks, reducing the impact of noise, and improving AGRAG’s reasoning ability. Furthermore, compared with the simple tree-structured reasoning paths, our MCMI subgraph can allow more complex graph structures, such as cycles, and improve the comprehensiveness of the generated reasoning paths.
zh
[AI-195] SMAGDi: Socratic Multi Agent Interaction Graph Distillation for Efficient High Accuracy Reasoning NEURIPS2025
链接: https://arxiv.org/abs/2511.05528 作者: Aayush Aluru,Myra Malik,Samarth Patankar,Spencer Kim,Kevin Zhu,Sean O’Brien,Vasu Sharma 机构: 未知 类目: Artificial Intelligence (cs.AI) 备注: Multi-Turn Interactions in Large Language Models (MTI-LLM) Workshop at NeurIPS 2025
点击查看摘要
Abstract:Multi-agent systems (MAS) often achieve higher reasoning accuracy than single models, but their reliance on repeated debates across agents makes them computationally expensive. We introduce SMAGDi, a distillation framework that transfers the debate dynamics of a five-agent Llama-based MAS into a compact Socratic decomposer-solver student. SMAGDi represents debate traces as directed interaction graphs, where nodes encode intermediate reasoning steps with correctness labels and edges capture continuity and cross-agent influence. The student is trained with a composite objective combining language modeling, graph-based supervision, contrastive reasoning, and embedding alignment to preserve both fluency and structured reasoning. On StrategyQA and MMLU, SMAGDi compresses a 40B multi-agent system into a 6B student while retaining 88% of its accuracy, substantially outperforming prior distillation methods such as MAGDi, standard KD, and fine-tuned baselines. These results highlight that explicitly modeling interaction graphs and Socratic decomposition enable small models to inherit the accuracy benefits of multi-agent debate while remaining efficient enough for real-world deployment.
zh
[AI-196] Evidence-Bound Autonomous Research (EviBound): A Governance Framework for Eliminating False Claims
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的自主研究代理(autonomous research agents)中存在的虚假声明问题,即任务标记为“完成”但实际缺失关键产出物、指标矛盾或执行失败的情况。解决方案的核心是提出EviBound框架,通过双层治理门控机制实现证据约束的执行:预执行审批门(Approval Gate)在代码运行前验证接受标准模式,主动识别结构违规;后执行验证门(Verification Gate)则通过MLflow API查询产出物并递归路径检查,同时可选地验证指定指标。只有当任务具备可查询的运行ID、必要产出物及FINISHED状态时,声明方可传播,并辅以有限次数(通常1–2次)的受控重试机制避免无限循环。实验表明,该方法在8项基准任务中实现0%虚假声明率,显著优于仅依赖提示词(100%幻觉)或仅事后验证(25%幻觉)的基线方案。
链接: https://arxiv.org/abs/2511.05524 作者: Ruiying Chen 机构: 未知 类目: Artificial Intelligence (cs.AI) 备注: 27 pages, 11 figures, 5 tables. Reproducibility package with MLflow artifacts and Google Colab notebooks available upon publication
点击查看摘要
Abstract:LLM-based autonomous research agents report false claims: tasks marked “complete” despite missing artifacts, contradictory metrics, or failed executions. EviBound is an evidence-bound execution framework that eliminates false claims through dual governance gates requiring machine-checkable evidence. Two complementary gates enforce evidence requirements. The pre-execution Approval Gate validates acceptance criteria schemas before code runs, catching structural violations proactively. The post-execution Verification Gate validates artifacts via MLflow API queries (with recursive path checking) and optionally validates metrics when specified by acceptance criteria. Claims propagate only when backed by a queryable run ID, required artifacts, and FINISHED status. Bounded, confidence-gated retries (typically 1-2 attempts) recover from transient failures without unbounded loops. The framework was evaluated on 8 benchmark tasks spanning infrastructure validation, ML capabilities, and governance stress tests. Baseline A (Prompt-Level Only) yields 100% hallucination (8/8 claimed, 0/8 verified). Baseline B (Verification-Only) reduces hallucination to 25% (2/8 fail verification). EviBound (Dual Gates) achieves 0% hallucination: 7/8 tasks verified and 1 task correctly blocked at the approval gate, all with only approximately 8.3% execution overhead. This package includes execution trajectories, MLflow run IDs for all verified tasks, and a 4-step verification protocol. Research integrity is an architectural property, achieved through governance gates rather than emergent from model scale. Comments: 27 pages, 11 figures, 5 tables. Reproducibility package with MLflow artifacts and Google Colab notebooks available upon publication Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.05524 [cs.AI] (or arXiv:2511.05524v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.05524 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ruiying Chen [view email] [v1] Tue, 28 Oct 2025 17:47:13 UTC (1,811 KB)
zh
[AI-197] From Failure Modes to Reliability Awareness in Generative and Agent ic AI System
【速读】:该论文旨在解决生成式 AI (Generative AI) 和代理型 AI (Agentic AI) 系统中可靠性风险难以系统识别与管理的问题,尤其关注故障在多层架构中的传播机制及其对组织应对能力的影响。其解决方案的关键在于提出一个11层故障堆栈(failure stack)框架与意识映射(awareness mapping)方法的结合:前者用于结构化识别从硬件到自适应学习等各层级的脆弱性,后者则量化个体和组织对AI全栈可靠性风险的认知水平,并将其作为AI治理的战略输入,最终通过与以可靠性为中心的资产管理(Dependability-Centred Asset Management, DCAM)整合,实现对关键任务领域中可信且可持续AI部署的路径指引。
Abstract:This chapter bridges technical analysis and organizational preparedness by tracing the path from layered failure modes to reliability awareness in generative and agentic AI systems. We first introduce an 11-layer failure stack, a structured framework for identifying vulnerabilities ranging from hardware and power foundations to adaptive learning and agentic reasoning. Building on this, the chapter demonstrates how failures rarely occur in isolation but propagate across layers, creating cascading effects with systemic consequences. To complement this diagnostic lens, we develop the concept of awareness mapping: a maturity-oriented framework that quantifies how well individuals and organizations recognize reliability risks across the AI stack. Awareness is treated not only as a diagnostic score but also as a strategic input for AI governance, guiding improvement and resilience planning. By linking layered failures to awareness levels and further integrating this into Dependability-Centred Asset Management (DCAM), the chapter positions awareness mapping as both a measurement tool and a roadmap for trustworthy and sustainable AI deployment across mission-critical domains.
zh
[AI-198] Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX MLC-LLM Ollama llama.cpp and PyTorch MPS
Abstract:We present a systematic, empirical evaluation of five local large language model (LLM) runtimes on Apple Silicon: MLX, MLC-LLM, this http URL, Ollama, and PyTorch MPS. Experiments were conducted on a Mac Studio equipped with an M2 Ultra processor and 192 GB of unified memory. Using the Qwen-2.5 model family across prompts ranging from a few hundred to 100,000 tokens, we measure time-to-first-token (TTFT), steady-state throughput, latency percentiles, long-context behavior (key-value and prompt caching), quantization support, streaming performance, batching and concurrency behavior, and deployment complexity. Under our settings, MLX achieves the highest sustained generation throughput, while MLC-LLM delivers consistently lower TTFT for moderate prompt sizes and offers stronger out-of-the-box inference features. this http URL is highly efficient for lightweight single-stream use, Ollama emphasizes developer ergonomics but lags in throughput and TTFT, and PyTorch MPS remains limited by memory constraints on large models and long contexts. All frameworks execute fully on-device with no telemetry, ensuring strong privacy guarantees. We release scripts, logs, and plots to reproduce all results. Our analysis clarifies the design trade-offs in Apple-centric LLM deployments and provides evidence-based recommendations for interactive and long-context processing. Although Apple Silicon inference frameworks still trail NVIDIA GPU-based systems such as vLLM in absolute performance, they are rapidly maturing into viable, production-grade solutions for private, on-device LLM inference. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.05502 [cs.AR] (or arXiv:2511.05502v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2511.05502 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-199] owards Ecologically Valid LLM Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners
Abstract:Benchmarks play a significant role in how researchers and the public understand generative AI systems. However, the widespread use of benchmark scores to communicate about model capabilities has led to criticisms of validity, especially whether benchmarks test what they claim to test (i.e. construct validity) and whether benchmark evaluations are representative of how models are used in the wild (i.e. ecological validity). In this work we explore how to create an LLM benchmark that addresses these issues by taking a human-centered approach. We focus on designing a domain-oriented benchmark for journalism practitioners, drawing on insights from a workshop of 23 journalism professionals. Our workshop findings surface specific challenges that inform benchmark design opportunities, which we instantiate in a case study that addresses underlying criticisms and specific domain concerns. Through our findings and design case study, this work provides design guidance for developing benchmarks that are better tuned to specific domains.
zh
[AI-200] Weightless Neural Networks for Continuously Trainable Personalized Recommendation Systems
Abstract:Given that conventional recommenders, while deeply effective, rely on large distributed systems pre-trained on aggregate user data, incorporating new data necessitates large training cycles, making them slow to adapt to real-time user feedback and often lacking transparency in recommendation rationale. We explore the performance of smaller personal models trained on per-user data using weightless neural networks (WNNs), an alternative to neural backpropagation that enable continuous learning by using neural networks as a state machine rather than a system with pretrained weights. We contrast our approach against a classic weighted system, also on a per-user level, and standard collaborative filtering, achieving competitive levels of accuracy on a subset of the MovieLens dataset. We close with a discussion of how weightless systems can be developed to augment centralized systems to achieve higher subjective accuracy through recommenders more directly tunable by end-users.
zh
[AI-201] Biomedical Hypothesis Explainability with Graph-Based Context Retrieval
【速读】:该论文旨在解决生物医学假设生成系统中解释性不足的问题,即如何使大型语言模型(Large Language Models, LLMs)生成的假设具备可解释性,并能基于真实世界科研约束提供可信的证据路径。解决方案的关键在于构建一个基于语义图谱的检索机制与受限数据训练策略相结合的框架——Hypothesis Generation Context Retriever(HGCR),并通过检索增强生成(Retrieval-Augmented Generation, RAG)将LLM与已发表科学文献中的上下文证据相融合;此外,引入一种新颖的反馈循环机制,迭代识别并修正LLM生成解释中的错误部分,从而持续优化证据路径和支撑背景,提升整体系统的可解释性和准确性。
Abstract:We introduce an explainability method for biomedical hypothesis generation systems, built on top of the novel Hypothesis Generation Context Retriever framework. Our approach combines semantic graph-based retrieval and relevant data-restrictive training to simulate real-world discovery constraints. Integrated with large language models (LLMs) via retrieval-augmented generation, the system explains hypotheses with contextual evidence using published scientific literature. We also propose a novel feedback loop approach, which iteratively identifies and corrects flawed parts of LLM-generated explanations, refining both the evidence paths and supporting context. We demonstrate the performance of our method with multiple large language models and evaluate the explanation and context retrieval quality through both expert-curated assessment and large-scale automated analysis. Our code is available at: this https URL.
zh
[AI-202] DOCUEVAL: An LLM -based AI Engineering Tool for Building Customisable Document Evaluation Workflows
Abstract:Foundation models, such as large language models (LLMs), have the potential to streamline evaluation workflows and improve their performance. However, practical adoption faces challenges, such as customisability, accuracy, and scalability. In this paper, we present DOCUEVAL, an AI engineering tool for building customisable DOCUment EVALuation workflows. DOCUEVAL supports advanced document processing and customisable workflow design which allow users to define theory-grounded reviewer roles, specify evaluation criteria, experiment with different reasoning strategies and choose the assessment style. To ensure traceability, DOCUEVAL provides comprehensive logging of every run, along with source attribution and configuration management, allowing systematic comparison of results across alternative setups. By integrating these capabilities, DOCUEVAL directly addresses core software engineering challenges, including how to determine whether evaluators are “good enough” for deployment and how to empirically compare different evaluation strategies. We demonstrate the usefulness of DOCUEVAL through a real-world academic peer review case, showing how DOCUEVAL enables both the engineering of evaluators and scalable, reliable document evaluation.
zh
[AI-203] IMDMR: An Intelligent Multi-Dimensional Memory Retrieval System for Enhanced Conversational AI
【速读】:该论文旨在解决当前对话式人工智能(Conversational AI)系统在长时间交互中难以维持连贯且上下文相关的记忆问题,从而限制了个性化和情境相关响应的能力。其解决方案的关键在于提出一种名为IMDMR(Intelligent Multi-Dimensional Memory Retrieval)的新颖多维检索架构,该架构通过六维记忆维度——语义(semantic)、实体(entity)、类别(category)、意图(intent)、上下文(context)和时间(temporal)——实现全面的记忆检索能力,并结合智能查询处理、动态策略选择、跨记忆实体解析与高级记忆融合技术,显著提升了系统性能,在多项指标上优于现有基线方法(如LangChain RAG、LlamaIndex、MemGPT等),其中整体性能提升达3.8倍(0.792 vs. 0.207)。
链接: https://arxiv.org/abs/2511.05495 作者: Tejas Pawar,Sarika Patil,Om Tilekar,Rushikesh Janwade,Vaibhav Helambe 机构: 未知 类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 备注: 28 pages, 8 figures, submitted to arXiv for open access publication
点击查看摘要
Abstract:Conversational AI systems often struggle with maintaining coherent, contextual memory across extended interactions, limiting their ability to provide personalized and contextually relevant responses. This paper presents IMDMR (Intelligent Multi-Dimensional Memory Retrieval), a novel system that addresses these limitations through a multi-dimensional search architecture. Unlike existing memory systems that rely on single-dimensional approaches, IMDMR leverages six distinct memory dimensions-semantic, entity, category, intent, context, and temporal-to provide comprehensive memory retrieval capabilities. Our system incorporates intelligent query processing with dynamic strategy selection, cross-memory entity resolution, and advanced memory integration techniques. Through comprehensive evaluation against five baseline systems including LangChain RAG, LlamaIndex, MemGPT, and spaCy + RAG, IMDMR achieves a 3.8x improvement in overall performance (0.792 vs 0.207 for the best baseline). We present both simulated (0.314) and production (0.792) implementations, demonstrating the importance of real technology integration while maintaining superiority over all baseline systems. Ablation studies demonstrate the effectiveness of multi-dimensional search, with the full system outperforming individual dimension approaches by 23.3%. Query-type analysis reveals superior performance across all categories, particularly for preferences/interests (0.630) and goals/aspirations (0.630) queries. Comprehensive visualizations and statistical analysis confirm the significance of these improvements with p 0.001 across all metrics. The results establish IMDMR as a significant advancement in conversational AI memory systems, providing a robust foundation for enhanced user interactions and personalized experiences.
zh
[AI-204] Customized Retrieval-Augmented Generation with LLM for Debiasing Recommendation Unlearning ICDM2025
链接: https://arxiv.org/abs/2511.05494 作者: Haichao Zhang,Chong Zhang,Peiyu Hu,Shi Qiu,Jia Wang 机构: 未知 类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 备注: 10 pages, 4 figures. Accepted ICDM 2025 (IEEE International Conference on Data Mining)
点击查看摘要
Abstract:Modern recommender systems face a critical challenge in complying with privacy regulations like the ‘right to be forgotten’: removing a user’s data without disrupting recommendations for others. Traditional unlearning methods address this by partial model updates, but introduce propagation bias–where unlearning one user’s data distorts recommendations for behaviorally similar users, degrading system accuracy. While retraining eliminates bias, it is computationally prohibitive for large-scale systems. To address this challenge, we propose CRAGRU, a novel framework leveraging Retrieval-Augmented Generation (RAG) for efficient, user-specific unlearning that mitigates bias while preserving recommendation quality. CRAGRU decouples unlearning into distinct retrieval and generation stages. In retrieval, we employ three tailored strategies designed to precisely isolate the target user’s data influence, minimizing collateral impact on unrelated users and enhancing unlearning efficiency. Subsequently, the generation stage utilizes an LLM, augmented with user profiles integrated into prompts, to reconstruct accurate and personalized recommendations without needing to retrain the entire base model. Experiments on three public datasets demonstrate that CRAGRU effectively unlearns targeted user data, significantly mitigating unlearning bias by preventing adverse impacts on non-target users, while maintaining recommendation performance comparable to fully trained original models. Our work highlights the promise of RAG-based architectures for building robust and privacy-preserving recommender systems. The source code is available at: this https URL.
zh
[AI-205] Machine-Learning Accelerated Calculations of Reduced Density Matrices
【速读】:该论文旨在解决强关联体系中n-粒子约化密度矩阵(n-particle reduced density matrices, n-RDMs)计算效率低的问题,尤其是在系统尺寸较大时。其核心挑战在于传统方法难以高效处理大尺度系统的n-RDMs,限制了对关联物态的深入研究。解决方案的关键在于利用神经网络(Neural Network, NN)架构对平滑且可插值的n-RDM进行加速计算与预测:首先基于n-RDM在布里渊区(Brillouin zone, BZ)上通常为光滑函数的物理特性(尤其适用于能隙体系),设计两种NN模型——一种是自注意力网络用于将随机RDM映射为物理上合理的RDM,另一种是正弦表示网络(SIREN)直接从动量空间坐标映射到RDM值;实验表明,训练于小尺寸网格(如6×6)的SIREN可高精度预测大尺寸(如18×18)的配对关联函数,并显著减少大系统哈特里-福克(Hartree-Fock, HF)迭代求解所需的迭代次数(最多降低92.78%),从而为强关联物态研究提供了一种高效、可推广的新范式。
Abstract: n -particle reduced density matrices ( n -RDMs) play a central role in understanding correlated phases of matter. Yet the calculation of n -RDMs is often computationally inefficient for strongly-correlated states, particularly when the system sizes are large. In this work, we propose to use neural network (NN) architectures to accelerate the calculation of, and even predict, the n -RDMs for large-size systems. The underlying intuition is that n -RDMs are often smooth functions over the Brillouin zone (BZ) (certainly true for gapped states) and are thus interpolable, allowing NNs trained on small-size n -RDMs to predict large-size ones. Building on this intuition, we devise two NNs: (i) a self-attention NN that maps random RDMs to physical ones, and (ii) a Sinusoidal Representation Network (SIREN) that directly maps momentum-space coordinates to RDM values. We test the NNs in three 2D models: the pair-pair correlation functions of the Richardson model of superconductivity, the translationally-invariant 1-RDM in a four-band model with short-range repulsion, and the translation-breaking 1-RDM in the half-filled Hubbard model. We find that a SIREN trained on a 6\times 6 momentum mesh can predict the 18\times 18 pair-pair correlation function with a relative accuracy of 0.839 . The NNs trained on 6\times 6 \sim 8\times 8 meshes can provide high-quality initial guesses for 50\times 50 translation-invariant Hartree-Fock (HF) and 30\times 30 fully translation-breaking-allowed HF, reducing the number of iterations required for convergence by up to 91.63% and 92.78% , respectively, compared to random initializations. Our results illustrate the potential of using NN-based methods for interpolable n -RDMs, which might open a new avenue for future research on strongly correlated phases.
zh
[AI-206] Sample-efficient quantum error mitigation via classical learning surrogates
Abstract:The pursuit of practical quantum utility on near-term quantum processors is critically challenged by their inherent noise. Quantum error mitigation (QEM) techniques are leading solutions to improve computation fidelity with relatively low qubit-overhead, while full-scale quantum error correction remains a distant goal. However, QEM techniques incur substantial measurement overheads, especially when applied to families of quantum circuits parameterized by classical inputs. Focusing on zero-noise extrapolation (ZNE), a widely adopted QEM technique, here we devise the surrogate-enabled ZNE (S-ZNE), which leverages classical learning surrogates to perform ZNE entirely on the classical side. Unlike conventional ZNE, whose measurement cost scales linearly with the number of circuits, S-ZNE requires only constant measurement overhead for an entire family of quantum circuits, offering superior scalability. Theoretical analysis indicates that S-ZNE achieves accuracy comparable to conventional ZNE in many practical scenarios, and numerical experiments on up to 100-qubit ground-state energy and quantum metrology tasks confirm its effectiveness. Our approach provides a template that can be effectively extended to other quantum error mitigation protocols, opening a promising path toward scalable error mitigation.
zh
[AI-207] Deep learning EPI-TIRF cross-modality enables background subtraction and axial super-resolution for widefield fluorescence microscopy
Abstract:The resolving ability of wide-field fluorescence microscopy is fundamentally limited by out-of-focus background owing to its low axial resolution, particularly for densely labeled biological samples. To address this, we developed ET2dNet, a deep learning-based EPI-TIRF cross-modality network that achieves TIRF-comparable background subtraction and axial super-resolution from a single wide-field image without requiring hardware modifications. The model employs a physics-informed hybrid architecture, synergizing supervised learning with registered EPI-TIRF image pairs and self-supervised physical modeling via convolution with the point spread function. This framework ensures exceptional generalization across microscope objectives, enabling few-shot adaptation to new imaging setups. Rigorous validation on cellular and tissue samples confirms ET2dNet’s superiority in background suppression and axial resolution enhancement, while maintaining compatibility with deconvolution techniques for lateral resolution improvement. Furthermore, by extending this paradigm through knowledge distillation, we developed ET3dNet, a dedicated three-dimensional reconstruction network that produces artifact-reduced volumetric results. ET3dNet effectively removes out-of-focus background signals even when the input image stack lacks the source of background. This framework makes axial super-resolution imaging more accessible by providing an easy-to-deploy algorithm that avoids additional hardware costs and complexity, showing great potential for live cell studies and clinical histopathology.
zh
[AI-208] Diagnosing and Breaking Amplitude Suppression in Seismic Phase Picking Through Adversarial Shape Learning
【速读】:该论文试图解决深度学习在地震相位拾取(seismic phase picking)中一个长期存在的悖论:尽管生成式 AI(Generative AI)模型能够高精度预测 P 波,但对 S 波的振幅预测始终低于检测阈值,表现为持续的振幅抑制现象。研究通过分析训练历史和损失函数的几何结构,识别出三个相互作用的因素:S 波初至时刻具有较高的时间不确定性;卷积神经网络(CNN)倾向于关注高振幅边界而非微弱初至点;逐点二分类交叉熵(Binary Cross-Entropy, BCE)损失缺乏横向校正力,仅提供垂直梯度,导致振幅被抑制而时间间隔未收敛。解决方案的关键在于提出“先形状后对齐”(shape-then-align)策略——即在时间对齐前先构建稳定的几何模板以约束预测形态。作者采用条件生成对抗网络(conditional GAN)框架,在传统 BCE 训练基础上引入判别器模块,强制施加形状约束,从而在 10,000 步训练后实现有效 S 相位检测率提升 64%。该方法无需先验假设即可自动发现目标几何特征,为需要精确对齐细微结构与主导结构的分割任务提供了通用解决方案。
Abstract:Deep learning has revolutionized seismic phase picking, yet a paradox persists: high signal-to-noise S-wave predictions consistently fail to cross detection thresholds, oscillating at suppressed amplitudes. We identify this previously unexplained phenomenon as amplitude suppression, which we diagnose through analyzing training histories and loss landscapes. Three interacting factors emerge: S-wave onsets exhibit high temporal uncertainty relative to high-amplitude boundaries; CNN’s bias toward sharp amplitude changes anchors predictions to these boundaries rather than subtle onsets; and point-wise Binary Cross-Entropy (BCE) loss lacks lateral corrective forces, providing only vertical gradients that suppress amplitude while temporal gaps persist. This geometric trap points to a shape-then-align solution where stable geometric templates must precede temporal alignment. We implement this through a conditional GAN framework by augmenting conventional BCE training with a discriminator that enforces shape constraints. Training for 10,000 steps, this achieves a 64% increase in effective S-phase detections. Our framework autonomously discovers target geometry without a priori assumptions, offering a generalizable solution for segmentation tasks requiring precise alignment of subtle features near dominant structures.
zh
[AI-209] SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models
链接: https://arxiv.org/abs/2511.06606 作者: S Sakshi,Vaibhavi Lokegaonkar,Neil Zhang,Ramani Duraiswami,Sreyan Ghosh,Dinesh Manocha,Lie Lu 机构: 未知 类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI) 备注: Project: this https URL
点击查看摘要
Abstract:Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most operate on monaural inputs and lack the ability to capture spatial cues such as direction, elevation, and distance. We introduce SPUR, a lightweight, plug-in approach that equips LALMs with spatial perception through minimal architectural changes. SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps (W, X, Y, Z) channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning. Fine-tuning our model on the SPUR-Set consistently improves spatial QA and multi-speaker attribution while preserving general audio understanding. SPUR provides a simple recipe that transforms monaural LALMs into spatially aware models. Extensive ablations validate the effectiveness of our approach.
zh
[AI-210] A PDE Perspective on Generative Diffusion Models
链接: https://arxiv.org/abs/2511.05940 作者: Kang Liu,Enrique Zuazua 机构: 未知 类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP) 备注: 30 pages, 4 figures
点击查看摘要
Abstract:Score-based diffusion models have emerged as a powerful class of generative methods, achieving state-of-the-art performance across diverse domains. Despite their empirical success, the mathematical foundations of those models remain only partially understood, particularly regarding the stability and consistency of the underlying stochastic and partial differential equations governing their dynamics. In this work, we develop a rigorous partial differential equation (PDE) framework for score-based diffusion processes. Building on the Li–Yau differential inequality for the heat flow, we prove well-posedness and derive sharp L^p -stability estimates for the associated score-based Fokker–Planck dynamics, providing a mathematically consistent description of their temporal evolution. Through entropy stability methods, we further show that the reverse-time dynamics of diffusion models concentrate on the data manifold for compactly supported data distributions and a broad class of initialization schemes, with a concentration rate of order \sqrtt as t \to 0 . These results yield a theoretical guarantee that, under exact score guidance, diffusion trajectories return to the data manifold while preserving imitation fidelity. Our findings also provide practical insights for designing diffusion models, including principled criteria for score-function construction, loss formulation, and stopping-time selection. Altogether, this framework provides a quantitative understanding of the trade-off between generative capacity and imitation fidelity, bridging rigorous analysis and model design within a unified mathematical perspective. Comments: 30 pages, 4 figures Subjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP) MSC classes: 34D05, 35B35, 35Q68, 35Q84, 68T99 Cite as: arXiv:2511.05940 [math.OC] (or arXiv:2511.05940v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2511.05940 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-211] IoT-based Fresh Produce Supply Chain Under Uncertainty: An Adaptive Optimization Framework
链接: https://arxiv.org/abs/2511.05920 作者: Chirag Seth,Mehrdad Pirnia,James H Bookbinder 机构: 未知 类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI) 备注:
点击查看摘要
Abstract:Fruits and vegetables form a vital component of the global economy; however, their distribution poses complex logistical challenges due to high perishability, supply fluctuations, strict quality and safety standards, and environmental sensitivity. In this paper, we propose an adaptive optimization model that accounts for delays, travel time, and associated temperature changes impacting produce shelf life, and compare it against traditional approaches such as Robust Optimization, Distributionally Robust Optimization, and Stochastic Programming. Additionally, we conduct a series of computational experiments using Internet of Things (IoT) sensor data to evaluate the performance of our proposed model. Our study demonstrates that the proposed adaptive model achieves a higher shelf life, extending it by over 18% compared to traditional optimization models, by dynamically mitigating temperature deviations through a temperature feedback mechanism. The promising results demonstrate the potential of this approach to improve both the freshness and efficiency of logistics systems an aspect often neglected in previous works.
zh
[AI-212] BrainCSD: A Hierarchical Consistency-Driven MoE Foundation Model for Unified Connectome Synthesis and Multitask Brain Trait Prediction
Abstract:Functional and structural connectivity (FC/SC) are key multimodal biomarkers for brain analysis, yet their clinical utility is hindered by costly acquisition, complex preprocessing, and frequent missing modalities. Existing foundation models either process single modalities or lack explicit mechanisms for cross-modal and cross-scale consistency. We propose BrainCSD, a hierarchical mixture-of-experts (MoE) foundation model that jointly synthesizes FC/SC biomarkers and supports downstream decoding tasks (diagnosis and prediction). BrainCSD features three neuroanatomically grounded components: (1) a ROI-specific MoE that aligns regional activations from canonical networks (e.g., DMN, FPN) with a global atlas via contrastive consistency; (2) a Encoding-Activation MOE that models dynamic cross-time/gradient dependencies in fMRI/dMRI; and (3) a network-aware refinement MoE that enforces structural priors and symmetry at individual and population levels. Evaluated on the datasets under complete and missing-modality settings, BrainCSD achieves SOTA results: 95.6% accuracy for MCI vs. CN classification without FC, low synthesis error (FC RMSE: 0.038; SC RMSE: 0.006), brain age prediction (MAE: 4.04 years), and MMSE score estimation (MAE: 1.72 points). Code is available in \hrefthis https URLBrainCSD
zh
[AI-213] AI-Enhanced High-Density NIRS Patch for Real-Time Brain Layer Oxygenation Monitoring in Neurological Emergencies
Abstract:Photon scattering has traditionally limited the ability of near-infrared spectroscopy (NIRS) to extract accurate, layer-specific information from the brain. This limitation restricts its clinical utility for precise neurological monitoring. To address this, we introduce an AI-driven, high-density NIRS system optimized to provide real-time, layer-specific oxygenation data from the brain cortex, specifically targeting acute neuro-emergencies. Our system integrates high-density NIRS reflectance data with a neural network trained on MRI-based synthetic datasets. This approach achieves robust cortical oxygenation accuracy across diverse anatomical variations. In simulations, our AI-assisted NIRS demonstrated a strong correlation (R2=0.913) with actual cortical oxygenation, markedly outperforming conventional methods (R2=0.469). Furthermore, biomimetic phantom experiments confirmed its superior anatomical reliability (R2=0.986) compared to standard commercial devices (R2=0.823). In clinical validation with healthy subjects and ischemic stroke patients, the system distinguished between the two groups with an AUC of 0.943. This highlights its potential as an accessible, high-accuracy diagnostic tool for emergency and point-of-care settings. These results underscore the system’s capability to advance neuro-monitoring precision through AI, enabling timely, data-driven decisions in critical care environments.
zh
[AI-214] Gravity-Awareness: Deep Learning Models and LLM Simulation of Human Awareness in Altered Gravity
Abstract:Earth’s gravity has fundamentally shaped human development by guiding the brain’s integration of vestibular, visual, and proprioceptive inputs into an internal model of gravity: a dynamic neural representation enabling prediction and interpretation of gravitational forces. This work presents a dual computational framework to quantitatively model these adaptations. The first component is a lightweight Multi-Layer Perceptron (MLP) that predicts g-load-dependent changes in key electroencephalographic (EEG) frequency bands, representing the brain’s cortical state. The second component utilizes a suite of independent Gaussian Processes (GPs) to model the body’s broader physiological state, including Heart Rate Variability (HRV), Electrodermal Activity (EDA), and motor behavior. Both models were trained on data derived from a comprehensive review of parabolic flight literature, using published findings as anchor points to construct robust, continuous functions. To complement this quantitative analysis, we simulated subjective human experience under different gravitational loads, ranging from microgravity (0g) and partial gravity (Moon 0.17g, Mars 0.38g) to hypergravity associated with spacecraft launch and re-entry (1.8g), using a large language model (Claude 3.5 Sonnet). The model was prompted with physiological parameters to generate introspective narratives of alertness and self-awareness, which closely aligned with the quantitative findings from both the EEG and physiological models. This combined framework integrates quantitative physiological modeling with generative cognitive simulation, offering a novel approach to understanding and predicting human performance in altered gravity
zh
[AI-215] he Evolution of Probabilistic Price Forecasting Techniques: A Review of the Day-Ahead Intra-Day and Balancing Markets
Abstract:Electricity price forecasting has become a critical tool for decision-making in energy markets, particularly as the increasing penetration of renewable energy introduces greater volatility and uncertainty. Historically, research in this field has been dominated by point forecasting methods, which provide single-value predictions but fail to quantify uncertainty. However, as power markets evolve due to renewable integration, smart grids, and regulatory changes, the need for probabilistic forecasting has become more pronounced, offering a more comprehensive approach to risk assessment and market participation. This paper presents a review of probabilistic forecasting methods, tracing their evolution from Bayesian and distribution based approaches, through quantile regression techniques, to recent developments in conformal prediction. Particular emphasis is placed on advancements in probabilistic forecasting, including validity-focused methods which address key limitations in uncertainty estimation. Additionally, this review extends beyond the Day-Ahead Market to include the Intra-Day and Balancing Markets, where forecasting challenges are intensified by higher temporal granularity and real-time operational constraints. We examine state of the art methodologies, key evaluation metrics, and ongoing challenges, such as forecast validity, model selection, and the absence of standardised benchmarks, providing researchers and practitioners with a comprehensive and timely resource for navigating the complexities of modern electricity markets.
zh
[AI-216] AIRMap - AI-Generated Radio Maps for Wireless Digital Twins
链接: https://arxiv.org/abs/2511.05522 作者: Ali Saeizadeh,Miead Tehrani-Moayyed,Davide Villa,J. Gordon Beattie Jr.,Pedram Johari,Stefano Basagni,Tommaso Melodia 机构: 未知 类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI) 备注: 13 pages, 17 figures, This paper has been submitted to the IEEE Transactions for possible publication
点击查看摘要
Abstract:Accurate, low-latency channel modeling is essential for real-time wireless network simulation and digital-twin applications. Traditional modeling methods like ray tracing are however computationally demanding and unsuited to model dynamic conditions. In this paper, we propose AIRMap, a deep-learning framework for ultra-fast radio-map estimation, along with an automated pipeline for creating the largest radio-map dataset to date. AIRMap uses a single-input U-Net autoencoder that processes only a 2D elevation map of terrain and building heights. Trained and evaluated on 60,000 Boston-area samples, spanning coverage areas from 500 m to 3 km per side, AIRMap predicts path gain with under 5 dB RMSE in 4 ms per inference on an NVIDIA L40S -over 7000x faster than GPU-accelerated ray tracing based radio maps. A lightweight transfer learning calibration using just 20% of field measurements reduces the median error to approximately 10%, significantly outperforming traditional simulators, which exceed 50% error. Integration into the Colosseum emulator and the Sionna SYS platform demonstrate near-zero error in spectral efficiency and block-error rate compared to measurement-based channels. These findings validate AIRMap’s potential for scalable, accurate, and real-time radio map estimation in wireless digital twins.
zh
[AI-217] EMPO: Temporal Multi-scale Autoregressive Generation of Protein Conformational Ensembles
Abstract:Understanding the dynamic behavior of proteins is critical to elucidating their functional mechanisms, yet generating realistic, temporally coherent trajectories of protein ensembles remains a significant challenge. In this work, we introduce a novel hierarchical autoregressive framework for modeling protein dynamics that leverages the intrinsic multi-scale organization of molecular motions. Unlike existing methods that focus on generating static conformational ensembles or treat dynamic sampling as an independent process, our approach characterizes protein dynamics as a Markovian process. The framework employs a two-scale architecture: a low-resolution model captures slow, collective motions driving major conformational transitions, while a high-resolution model generates detailed local fluctuations conditioned on these large-scale movements. This hierarchical design ensures that the causal dependencies inherent in protein dynamics are preserved, enabling the generation of temporally coherent and physically realistic trajectories. By bridging high-level biophysical principles with state-of-the-art generative modeling, our approach provides an efficient framework for simulating protein dynamics that balances computational efficiency with physical accuracy.
zh
[AI-218] Personalized Chain-of-Thought Summarization of Financial News for Investor Decision Support ICDM
Abstract:Financial advisors and investors struggle with information overload from financial news, where irrelevant content and noise obscure key market signals and hinder timely investment decisions. To address this, we propose a novel Chain-of-Thought (CoT) summarization framework that condenses financial news into concise, event-driven summaries. The framework integrates user-specified keywords to generate personalized outputs, ensuring that only the most relevant contexts are highlighted. These personalized summaries provide an intermediate layer that supports language models in producing investor-focused narratives, bridging the gap between raw news and actionable insights.
zh
[AI-219] Rewiring Human Brain Networks via Lightweight Dynamic Connectivity Framework: An EEG-Based Stress Validation
【速读】:该论文旨在解决传统静态功能连接(functional connectivity)方法在捕捉脑区间动态、因果性信息流方面的局限性,从而提升基于脑电图(EEG)的应激状态分类精度。其关键解决方案是提出一种轻量级的动态脑连接框架,基于时变定向传递函数(Time Varying Directed Transfer Function, TV DTF),通过提取不同频段(尤其是α和β频段)的动态方向性信息流特征,并结合多种机器学习(ML)分类器进行验证。结果表明,α-TV-DTF特征在多类和双类应激分类中均表现出显著优于传统绝对功率与相位锁定特征的性能,且特征重要性分析揭示了前额叶-顶叶和前额叶-枕叶之间的主导长程信息调控作用,凸显了前额叶在应激状态下的调节功能,验证了TV-DTF作为高效、可解释的动态脑网络分析工具的潜力。
Abstract:In recent years, Electroencephalographic analysis has gained prominence in stress research when combined with AI and Machine Learning models for validation. In this study, a lightweight dynamic brain connectivity framework based on Time Varying Directed Transfer Function is proposed, where TV DTF features were validated through ML based stress classification. TV DTF estimates the directional information flow between brain regions across distinct EEG frequency bands, thereby capturing temporal and causal influences that are often overlooked by static functional connectivity measures. EEG recordings from the 32 channel SAM 40 dataset were employed, focusing on mental arithmetic task trials. The dynamic EEG-based TV-DTF features were validated through ML classifiers such as Support Vector Machine, Random Forest, Gradient Boosting, Adaptive Boosting, and Extreme Gradient Boosting. Experimental results show that alpha-TV-DTF provided the strongest discriminative power, with SVM achieving 89.73% accuracy in 3-class classification and with XGBoost achieving 93.69% accuracy in 2 class classification. Relative to absolute power and phase locking based functional connectivity features, alpha TV DTF and beta TV DTF achieved higher performance across the ML models, highlighting the advantages of dynamic over static measures. Feature importance analysis further highlighted dominant long-range frontal parietal and frontal occipital informational influences, emphasizing the regulatory role of frontal regions under stress. These findings validate the lightweight TV-DTF as a robust framework, revealing spatiotemporal brain dynamics and directional influences across different stress levels.
zh
机器学习
[LG-0] Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLM s
[LG-3] A Diffusion Model to Shrink Proteins While Maintaining Their Function
链接: https://arxiv.org/abs/2511.07390 作者: Ethan Baron,Alan N. Amin,Ruben Weitzman,Debora Marks,Andrew Gordon Wilson 类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Code available at this https URL
点击查看摘要
[LG-4] Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training
[LG-10] RobustA: Robust Anomaly Detection in Multimodal Data
链接: https://arxiv.org/abs/2511.07276 作者: Salem AlMarri,Muhammad Irzam Liaqat,Muhammad Zaigham Zaheer,Shah Nawaz,Karthik Nandakumar,Markus Schedl 类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Image Processing
点击查看摘要
Abstract:In recent years, multimodal anomaly detection methods have demonstrated remarkable performance improvements over video-only models. However, real-world multimodal data is often corrupted due to unforeseen environmental distortions. In this paper, we present the first-of-its-kind work that comprehensively investigates the adverse effects of corrupted modalities on multimodal anomaly detection task. To streamline this work, we propose RobustA, a carefully curated evaluation dataset to systematically observe the impacts of audio and visual corruptions on the overall effectiveness of anomaly detection systems. Furthermore, we propose a multimodal anomaly detection method, which shows notable resilience against corrupted modalities. The proposed method learns a shared representation space for different modalities and employs a dynamic weighting scheme during inference based on the estimated level of corruption. Our work represents a significant step forward in enabling the real-world application of multimodal anomaly detection, addressing situations where the likely events of modality corruptions occur. The proposed evaluation dataset with corrupted modalities and respective extracted features will be made publicly available.
[LG-11] Multi-modal Dynamic Proxy Learning for Personalized Multiple Clustering AAAI2026
链接: https://arxiv.org/abs/2511.07274 作者: Jinfeng Xu,Zheyu Chen,Shuo Yang,Jinze Li,Ziyue Peng,Zewei Liu,Hewei Wang,Jiayi Zhang,Edith C. H. Ngai 类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026
点击查看摘要
Abstract:Multiple clustering aims to discover diverse latent structures from different perspectives, yet existing methods generate exhaustive clusterings without discerning user interest, necessitating laborious manual screening. Current multi-modal solutions suffer from static semantic rigidity: predefined candidate words fail to adapt to dataset-specific concepts, and fixed fusion strategies ignore evolving feature interactions. To overcome these limitations, we propose Multi-DProxy, a novel multi-modal dynamic proxy learning framework that leverages cross-modal alignment through learnable textual proxies. Multi-DProxy introduces 1) gated cross-modal fusion that synthesizes discriminative joint representations by adaptively modeling feature interactions. 2) dual-constraint proxy optimization where user interest constraints enforce semantic consistency with domain concepts while concept constraints employ hard example mining to enhance cluster discrimination. 3) dynamic candidate management that refines textual proxies through iterative clustering feedback. Therefore, Multi-DProxy not only effectively captures a user’s interest through proxies but also enables the identification of relevant clusterings with greater precision. Extensive experiments demonstrate state-of-the-art performance with significant improvements over existing methods across a broad set of multi-clustering benchmarks.
[LG-12] Understanding the role of depth in the neural tangent kernel for overparameterized neural networks
[LG-16] DETECT: Data-Driven Evaluation of Treatments Enabled by Classification Transformers ICDM2025
链接: https://arxiv.org/abs/2511.07213 作者: Yuanheng Mao,Lillian Yang,Stephen Yang,Ethan Shao,Zihan Li 类目: Machine Learning (cs.LG)
*备注: 5 pages, 4 figures, 2 tables, accepted for presentation by IEEE ICDM 2025 UGHS Symposium and publication with proceedings forthcoming
点击查看摘要
Abstract:Chronic pain is a global health challenge affecting millions of individuals, making it essential for physicians to have reliable and objective methods to measure the functional impact of clinical treatments. Traditionally used methods, like the numeric rating scale, while personalized and easy to use, are subjective due to their self-reported nature. Thus, this paper proposes DETECT (Data-Driven Evaluation of Treatments Enabled by Classification Transformers), a data-driven framework that assesses treatment success by comparing patient activities of daily life before and after treatment. We use DETECT on public benchmark datasets and simulated patient data from smartphone sensors. Our results demonstrate that DETECT is objective yet lightweight, making it a significant and novel contribution to clinical decision-making. By using DETECT, independently or together with other self-reported metrics, physicians can improve their understanding of their treatment impacts, ultimately leading to more personalized and responsive patient care.
[LG-17] Synergy over Discrepancy: A Partition-Based Approach to Multi-Domain LLM Fine-Tuning NEURIPS2025
链接: https://arxiv.org/abs/2511.07198 作者: Hua Ye(1 and 2),Siyuan Chen(3),Haoliang Zhang(4),Weihao Luo(5),Yanbin Li(6),Xuan Zhang(2 and 7) ((1) Nanjing University, (2) Airon Technology CO., LTD, (3) University of Bristol, (4) The University of Oklahoma, (5) Donghua University, (6) Beijing University of Posts and Telecommunications, (7) Carnegie Mellon University) 类目: Machine Learning (cs.LG)
*备注: 20 pages, 5 figures, 21 tables. Accepted at NeurIPS 2025. Corresponding author: Xuan Zhang (xuanzhang2199@gmail.com)
点击查看摘要
[LG-18] On Stealing Graph Neural Network Models
链接: https://arxiv.org/abs/2511.07170 作者: Marcin Podhajski,Jan Dubiński,Franziska Boenisch,Adam Dziedzic,Agnieszka Pręgowska,Tomasz P. Michalak 类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
[LG-19] Combining digital data streams and epidemic networks for real time outbreak detection
Abstract:Vector data trading is essential for cross-domain learning with vector databases, yet it remains largely unexplored. We study this problem under online learning, where sellers face uncertain retrieval costs and buyers provide stochastic feedback to posted prices. Three main challenges arise: (1) heterogeneous and partial feedback in configuration learning, (2) variable and complex feedback in pricing learning, and (3) inherent coupling between configuration and pricing decisions. We propose a hierarchical bandit framework that jointly optimizes retrieval configurations and pricing. Stage I employs contextual clustering with confidence-based exploration to learn effective configurations with logarithmic regret. Stage II adopts interval-based price selection with local Taylor approximation to estimate buyer responses and achieve sublinear regret. We establish theoretical guarantees with polynomial time complexity and validate the framework on four real-world datasets, demonstrating consistent improvements in cumulative reward and regret reduction compared with existing methods. Comments: Accepted by ICDE 2026 Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2511.07139 [cs.DB] (or arXiv:2511.07139v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2511.07139 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-24] REACT-LLM : A Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks
链接: https://arxiv.org/abs/2511.07127 作者: Linna Wang,Zhixuan You,Qihui Zhang,Jiunan Wen,Ji Shi,Yimin Chen,Yusen Wang,Fanqi Ding,Ziliang Feng,Li Lu 类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
[LG-25] A Provably-Correct and Robust Convex Model for Smooth Separable NMF
链接: https://arxiv.org/abs/2511.07109 作者: Junjun Pan,Valentin Leplat,Michael Ng,Nicolas Gillis 类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 30 pages, 10 figures, code available from this https URL
点击查看摘要
Abstract:Nonnegative matrix factorization (NMF) is a linear dimensionality reduction technique for nonnegative data, with applications such as hyperspectral unmixing and topic modeling. NMF is a difficult problem in general (NP-hard), and its solutions are typically not unique. To address these two issues, additional constraints or assumptions are often used. In particular, separability assumes that the basis vectors in the NMF are equal to some columns of the input matrix. In that case, the problem is referred to as separable NMF (SNMF) and can be solved in polynomial-time with robustness guarantees, while identifying a unique solution. However, in real-world scenarios, due to noise or variability, multiple data points may lie near the basis vectors, which SNMF does not leverage. In this work, we rely on the smooth separability assumption, which assumes that each basis vector is close to multiple data points. We explore the properties of the corresponding problem, referred to as smooth SNMF (SSNMF), and examine how it relates to SNMF and orthogonal NMF. We then propose a convex model for SSNMF and show that it provably recovers the sought-after factors, even in the presence of noise. We finally adapt an existing fast gradient method to solve this convex model for SSNMF, and show that it compares favorably with state-of-the-art methods on both synthetic and hyperspectral datasets.
[LG-26] Direct Molecular Polarizability Prediction with SO(3) Equivariant Local Frame GNNs
链接: https://arxiv.org/abs/2511.07087 作者: Jean Philip Filling,Felix Post,Michael Wand,Denis Andrienko 类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
[LG-27] Breaking Privacy in Federated Clustering: Perfect Input Reconstruction via Temporal Correlations
[LG-38] MI-to-Mid Distilled Compression (M2M-DC): An Hybrid-Information-Guided-Block Pruning with Progressive Inner Slicing Approach to Model Compression
Abstract:Over-smoothing remains a fundamental challenge in deep Graph Neural Networks (GNNs), where repeated message passing causes node representations to become indistinguishable. While existing solutions, such as residual connections and skip layers, alleviate this issue to some extent, they fail to explicitly model how node representations evolve in a node-specific and progressive manner across layers. Moreover, these methods do not take global information into account, which is also crucial for mitigating the over-smoothing problem. To address the aforementioned issues, in this work, we propose a Dual Mamba-enhanced Graph Convolutional Network (DMbaGCN), which is a novel framework that integrates Mamba into GNNs to address over-smoothing from both local and global perspectives. DMbaGCN consists of two modules: the Local State-Evolution Mamba (LSEMba) for local neighborhood aggregation and utilizing Mamba’s selective state space modeling to capture node-specific representation dynamics across layers, and the Global Context-Aware Mamba (GCAMba) that leverages Mamba’s global attention capabilities to incorporate global context for each node. By combining these components, DMbaGCN enhances node discriminability in deep GNNs, thereby mitigating over-smoothing. Extensive experiments on multiple benchmarks demonstrate the effectiveness and efficiency of our method.
[LG-48] Multi-Modal Continual Learning via Cross-Modality Adapters and Representation Alignment with Knowledge Preservation ECAI2025
链接: https://arxiv.org/abs/2511.06723 作者: Evelyn Chee,Wynne Hsu,Mong Li Lee 类目: Machine Learning (cs.LG)
*备注: Accepted to ECAI 2025
[LG-53] An Adaptive Machine Learning Triage Framework for Predicting Alzheimers Disease Progression ALT ML4H
链接: https://arxiv.org/abs/2511.06681 作者: Richard Hou,Shengpu Tang,Wei Jin 类目: Machine Learning (cs.LG)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2025, December 1-2, 2025, San Diego, CA, USA, 9 pages. Shengpu Tang and Wei Jin contributed equally as senior authors
点击查看摘要
[LG-54] When Evidence Contradicts: Toward Safer Retrieval-Augmented Generation in Healthcare
[LG-55] GNN-Enabled Robust Hybrid Beamforming with Score-Based CSI Generation and Denoising
链接: https://arxiv.org/abs/2511.06663 作者: Yuhang Li,Yang Lu,Bo Ai,Zhiguo Ding,Dusit Niyato,Arumugam Nallanathan 类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
点击查看摘要
[LG-56] Dual-Pathway Fusion of EHRs and Knowledge Graphs for Predicting Unseen Drug-Drug Interactions ML4H2025
链接: https://arxiv.org/abs/2511.06662 作者: Franklin Lee,Tengfei Ma 类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: ML4H 2025 Findings
点击查看摘要
[LG-57] Improving Asset Allocation in a Fast Moving Consumer Goods B2B Company: An Interpretable Machine Learning Framework for Commercial Cooler Assignment Based on Multi-Tier Growth Targets
[LG-63] Adaptive Initial Residual Connections for GNNs with Theoretical Guarantees AAAI-2026 AAAI
链接: https://arxiv.org/abs/2511.06598 作者: Mohammad Shirzadi,Ali Safarpoor Dehkordi,Ahad N. Zehmakan 类目: Machine Learning (cs.LG)
*备注: This is the full version of the paper accepted to the 40th Annual AAAI Conference on Artificial Intelligence (AAAI-2026)
点击查看摘要
[LG-64] Optimistic Online-to-Batch Conversions for Accelerated Convergence and Universality NEURIPS2025
链接: https://arxiv.org/abs/2511.06597 作者: Yu-Hu Yan,Peng Zhao,Zhi-Hua Zhou 类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: NeurIPS 2025
点击查看摘要
[LG-65] Practical Policy Distillation for Reinforcement Learning in Radio Access Networks
链接: https://arxiv.org/abs/2511.06563 作者: Sara Khosravi,Burak Demirel,Linghui Zhou,Javier Rasines,Pablo Soldati 类目: Machine Learning (cs.LG)
*备注: This paper is accepted for publication in IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, 2025
点击查看摘要
[LG-66] Bayesian Uncertainty Quantification with Anchored Ensembles for Robust EV Power Consumption Prediction
[LG-67] Efficient Approximation of Volterra Series for High-Dimensional Systems
链接: https://arxiv.org/abs/2511.06527 作者: Navin Khoshnan,Claudia K Petritsch,Bryce-Allen Bagley 类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
点击查看摘要
[LG-68] EASE: Practical and Efficient Safety Alignment for Small Language Models AAAI2026
链接: https://arxiv.org/abs/2511.06512 作者: Haonan Shi,Guoli Wang,Tu Ouyang,An Wang 类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to AAAI 2026
点击查看摘要
[LG-69] Probably Approximately Global Robustness Certification ICML2025
Abstract:Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.
[LG-75] A Risk-Neutral Neural Operator for Arbitrag e-Free SPX-VIX Term Structures
链接: https://arxiv.org/abs/2511.06451 作者: Jian’an Zhang 类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 46 pages, 9 figures, includes appendices; v11 draft aligned with final outline
点击查看摘要
[LG-76] How Wide and How Deep? Mitigating Over-Squashing of GNNs via Channel Capacity Constrained Estimation AAAI AAAI-26
链接: https://arxiv.org/abs/2511.06443 作者: Zinuo You,Jin Zheng,John Cartlidge 类目: Machine Learning (cs.LG)
*备注: 29 pages, 11 figures. Author manuscript accepted for the 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26), January 2026
点击查看摘要
[LG-77] Vocabulary In-Context Learning in Transformers: Benefits of Positional Encoding NIPS2025
链接: https://arxiv.org/abs/2511.06376 作者: Qian Ma,Ruoxiang Xu,Yongqiang Cai 类目: Machine Learning (cs.LG)
*备注: Accepted as NIPS 2025 poster
点击查看摘要
[LG-78] Adaptive Regularization for Large-Scale Sparse Feature Embedding Models
[LG-79] Scalable Verification of Neural Control Barrier Functions Using Linear Bound Propagation
链接: https://arxiv.org/abs/2511.06341 作者: Nikolaus Vertovec,Frederik Baymler Mathiesen,Thom Badings,Luca Laurenti,Alessandro Abate 类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:
点击查看摘要
[LG-80] DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation
[LG-81] Setting varepsilon is not the Issue in Differential Privacy NEURIPS
链接: https://arxiv.org/abs/2511.06305 作者: Edwige Cyffers 类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS Position Paper track
点击查看摘要
[LG-82] 3dSAGER: Geospatial Entity Resolution over 3D Objects (Technical Report)
[LG-85] Synheart Emotion: Privacy-Preserving On-Device Emotion Recognition from Biosignals
链接: https://arxiv.org/abs/2511.06231 作者: Henok Ademtew,Israel Goytom 类目: Machine Learning (cs.LG)
*备注: Preprint submitted to the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)
点击查看摘要
[LG-86] Deep Reinforcement Learning for Dynamic Origin-Destination Matrix Estimation in Microscopic Traffic Simulations Considering Credit Assignment
[LG-88] me Matters: A Novel Real-Time Long- and Short-term User Interest Model for Click-Through Rate Prediction
链接: https://arxiv.org/abs/2511.06213 作者: Xian-Jin Gui 类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: This work was doned when the first author interned at Alibaba Group
点击查看摘要
[LG-89] Sparse Linear Regression is Easy on Random Supports
链接: https://arxiv.org/abs/2511.06211 作者: Gautam Chandrasekaran,Raghu Meka,Konstantinos Stavropoulos 类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
点击查看摘要
[LG-90] Local K-Similarity Constraint for Federated Learning with Label Noise
Abstract:Federated learning on clients with noisy labels is a challenging problem, as such clients can infiltrate the global model, impacting the overall generalizability of the system. Existing methods proposed to handle noisy clients assume that a sufficient number of clients with clean labels are available, which can be leveraged to learn a robust global model while dampening the impact of noisy clients. This assumption fails when a high number of heterogeneous clients contain noisy labels, making the existing approaches ineffective. In such scenarios, it is important to locally regularize the clients before communication with the global model, to ensure the global model isn’t corrupted by noisy clients. While pre-trained self-supervised models can be effective for local regularization, existing centralized approaches relying on pretrained initialization are impractical in a federated setting due to the potentially large size of these models, which increases communication costs. In that line, we propose a regularization objective for client models that decouples the pre-trained and classification models by enforcing similarity between close data points within the client. We leverage the representation space of a self-supervised pretrained model to evaluate the closeness among examples. This regularization, when applied with the standard objective function for the downstream task in standard noisy federated settings, significantly improves performance, outperforming existing state-of-the-art federated methods in multiple computer vision and medical image classification benchmarks. Unlike other techniques that rely on self-supervised pretrained initialization, our method does not require the pretrained model and classifier backbone to share the same architecture, making it architecture-agnostic.
[LG-91] Learning Gaussian DAG Models without Condition Number Bounds
[LG-92] Enhancing Robustness of Graph Neural Networks through p-Laplacian AAAI AAAI-26
链接: https://arxiv.org/abs/2511.06143 作者: Anuj Kumar Sirohi,Subhanu Halder,Kabir Kumar,Sandeep Kumar 类目: Machine Learning (cs.LG)
*备注: Accepted at 5th Workshop on Graphs and more Complex Structures For Learning and Reasoning (GCLR), The 40th AAAI Conference on Artificial Intelligence (AAAI-26)
点击查看摘要
[LG-93] On the Convergence and Stability of Distributed Sub-model Training
Abstract:As learning models continue to grow in size, enabling on-device local training of these models has emerged as a critical challenge in federated learning. A popular solution is sub-model training, where the server only distributes randomly sampled sub-models to the edge clients, and clients only update these small models. However, those random sampling of sub-models may not give satisfying convergence performance. In this paper, observing the success of SGD with shuffling, we propose a distributed shuffled sub-model training, where the full model is partitioned into several sub-models in advance, and the server shuffles those sub-models, sends each of them to clients at each round, and by the end of local updating period, clients send back the updated sub-models, and server averages them. We establish the convergence rate of this algorithm. We also study the generalization of distributed sub-model training via stability analysis, and find that the sub-model training can improve the generalization via amplifying the stability of training process. The extensive experiments also validate our theoretical findings.
[LG-94] A Deep Learning Model for Predicting Transformation Legality
Abstract:Compilers must check the legality of code transformations to guarantee the correctness of applying a sequence of code transformations to a given code. While such a legality check needs to be precisely computed in general, we can use an approximate legality prediction model in certain cases, such as training a reinforcement learning (RL) agent for schedule prediction. In this paper, we propose an approximate method for legality checks. We propose a novel DL model for predicting the legality of transformations. The model takes the code representation and a list of transformations as input and predicts whether applying those transformations to the code is legal. We implement and evaluate the proposed model, demonstrating its effectiveness. Our evaluation shows an F1 score of 0.91 on a test set of randomly generated programs. To further evaluate the model in a practical scenario, we used the model to replace the legality check used during the training of an RL agent designed for automatic code optimization. We demonstrate that such a replacement enables the agent to train on twice as many steps, resulting in faster training and reducing resource usage by approximately 80% for CPU and 35% for RAM. The agent trained using this approach maintains comparable performance, with only a 4% reduction on benchmarks from the Polybench suite compared to the traditional method.
[LG-95] Guardian-regularized Safe Offline Reinforcement Learning for Smart Weaning of Mechanical Circulatory Devices
[LG-96] Approximating Shapley Explanations in Reinforcement Learning NEURIPS2025
链接: https://arxiv.org/abs/2511.06094 作者: Daniel Beechey,Özgür Şimşek 类目: Machine Learning (cs.LG)
*备注: Camera-ready version. Published at the Conference on Neural Information Processing Systems (NeurIPS 2025)
点击查看摘要
[LG-97] Event-driven physics-informed operator learning for reliability analysis
Abstract:Reliability analysis of engineering systems under uncertainty poses significant computational challenges, particularly for problems involving high-dimensional stochastic inputs, nonlinear system responses, and multiphysics couplings. Traditional surrogate modeling approaches often incur high energy consumption, which severely limits their scalability and deployability in resource-constrained environments. We introduce NeuroPOL, \textitthe first neuroscience-inspired physics-informed operator learning framework for reliability analysis. NeuroPOL incorporates Variable Spiking Neurons into a physics-informed operator architecture, replacing continuous activations with event-driven spiking dynamics. This innovation promotes sparse communication, significantly reduces computational load, and enables an energy-efficient surrogate model. The proposed framework lowers both computational and power demands, supporting real-time reliability assessment and deployment on edge devices and digital twins. By embedding governing physical laws into operator learning, NeuroPOL builds physics-consistent surrogates capable of accurate uncertainty propagation and efficient failure probability estimation, even for high-dimensional problems. We evaluate NeuroPOL on five canonical benchmarks, the Burgers equation, Nagumo equation, two-dimensional Poisson equation, two-dimensional Darcy equation, and incompressible Navier-Stokes equation with energy coupling. Results show that NeuroPOL achieves reliability measures comparable to standard physics-informed operators, while introducing significant communication sparsity, enabling scalable, distributed, and energy-efficient deployment.
[LG-98] Make It Long Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin
链接: https://arxiv.org/abs/2511.06077 作者: Lin Guan,Jia-Qi Yang,Zhishan Zhao,Beichuan Zhang,Bo Sun,Xuanyuan Luo,Jinan Ni,Xiaowen Li,Yuhang Qi,Zhifang Fan,Hangyu Wang,Qiwei Chen,Yi Cheng,Feng Zhang,Xiao Yang 类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
点击查看摘要
[LG-99] CatBack: Universal Backdoor Attacks on Tabular Data via Categorical Encoding
Abstract:Stiff ordinary differential equations (ODEs) play an important role in many scientific and engineering applications. Often, the dependence of the solution of the ODE on additional parameters is of interest, e.g.\ when dealing with uncertainty quantification or design optimization. Directly studying this dependence can quickly become too computationally expensive, such that cheaper surrogate models approximating the solution are of interest. One popular class of surrogate models are Gaussian processes (GPs). They perform well when approximating stationary functions, functions which have a similar level of variation along any given parameter direction, however solutions to stiff ODEs are often characterized by a mixture of regions of rapid and slow variation along the time axis and when dealing with such nonstationary functions, GP performance frequently degrades drastically. We therefore aim to reparameterize stiff ODE solutions based on the available data, to make them appear more stationary and hence recover good GP performance. This approach comes with minimal computational overhead and requires no internal changes to the GP implementation, as it can be seen as a separate preprocessing step. We illustrate the achieved benefits using multiple examples.
[LG-104] Bespoke Co-processor for Energy-Efficient Health Monitoring on RISC-V-based Flexible Wearables DATE2026
链接: https://arxiv.org/abs/2511.05985 作者: Theofanis Vergos,Polykarpos Vergos,Mehdi B. Tahoori,Georgios Zervakis 类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted for publication at IEEE Design, Automation Test in Europe (DATE 2026)
点击查看摘要
Abstract:Flexible electronics offer unique advantages for conformable, lightweight, and disposable healthcare wearables. However, their limited gate count, large feature sizes, and high static power consumption make on-body machine learning classification highly challenging. While existing bendable RISC-V systems provide compact solutions, they lack the energy efficiency required. We present a mechanically flexible RISC-V that integrates a bespoke multiply-accumulate co-processor with fixed coefficients to maximize energy efficiency and minimize latency. Our approach formulates a constrained programming problem to jointly determine co-processor constants and optimally map Multi-Layer Perceptron (MLP) inference operations, enabling compact, model-specific hardware by leveraging the low fabrication and non-recurring engineering costs of flexible technologies. Post-layout results demonstrate near-real-time performance across several healthcare datasets, with our circuits operating within the power budget of existing flexible batteries and occupying only 2.42 mm^2, offering a promising path toward accessible, sustainable, and conformable healthcare wearables. Our microprocessors achieve an average 2.35x speedup and 2.15x lower energy consumption compared to the state of the art.
[LG-105] Are Time-Indexed Foundation Models the Future of Time Series Imputation?
Abstract:Foundation models for time series imputation remain largely unexplored. Recently, two such models, TabPFN-TS and MoTM, have emerged. These models share a common philosophy that places them within the family of time-indexed foundation models. This paper presents the first large-scale empirical study of these models for zero-shot imputation, which enables missing value recovery without retraining across a wide range of scenarios. We conduct extensive univariate experiments across 33 out-of-domain datasets (approximately 1.3M imputation windows) and evaluate their ability to integrate covariates at inference time to improve accuracy without fine-tuning. Our results demonstrate that time-indexed foundation models are a powerful and practical step toward achieving general-purpose, zero-shot imputation for real-world time series.
[LG-106] Explainable Deep Learning-based Classification of Wolff-Parkinson-White Electrocardiographic Signals
[LG-107] Next-Latent Prediction Transformers Learn Compact World Models MICRO
链接: https://arxiv.org/abs/2511.05963 作者: Jayden Teoh,Manan Tomar,Kwangjun Ahn,Edward S. Hu,Pratyusha Sharma,Riashat Islam,Alex Lamb,John Langford 类目: Machine Learning (cs.LG)
*备注: Preprint by Microsoft Research
点击查看摘要
Abstract:Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc look ups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next output token. Theoretically, we show that these latents provably converge to belief states, compressed information of the history necessary to predict the future. This simple auxiliary objective also injects a recurrent inductive bias into transformers, while leaving their architecture, parallel training, and inference unchanged. NextLat effectively encourages the transformer to form compact internal world models with its own belief states and transition dynamics – a crucial property absent in standard next-token prediction transformers. Empirically, across benchmarks targeting core sequence modeling competencies – world modeling, reasoning, planning, and language modeling – NextLat demonstrates significant gains over standard next-token training in downstream accuracy, representation compression, and lookahead planning. NextLat stands as a simple and efficient paradigm for shaping transformer representations toward stronger generalization.
[LG-108] Deep Survival Analysis of Longitudinal EHR Data for Joint Prediction of Hospitalization and Death in COPD Patients
Abstract:We introduce a unified attention-based framework for joint score and density estimation. Framing the problem as a sequence-to-sequence task, we develop a permutation- and affine-equivariant transformer that estimates both the probability density f(x) and its score \nabla_x \log f(x) directly from i.i.d. samples. Unlike traditional score-matching methods that require training a separate model for each distribution, our approach learns a single distribution-agnostic operator that generalizes across densities and sample sizes. The architecture employs cross-attention to connect observed samples with arbitrary query points, enabling generalization beyond the training data, while built-in symmetry constraints ensure equivariance to permutation and affine transformations. Analytically, we show that the attention weights can recover classical kernel density estimation (KDE), and verify it empirically, establishing a principled link between classical KDE and the transformer architecture. Empirically, the model achieves substantially lower error and better scaling than KDE and score-debiased KDE (SD-KDE), while exhibiting better runtime scaling. Together, these results establish transformers as general-purpose, data-adaptive operators for nonparametric density and score estimation.
[LG-110] FusionLog: Cross-System Log-based Anomaly Detection via Fusion of General and Proprietary Knowledge
Abstract:An appropriate distance metric is crucial for categorical data clustering, as the distance between categorical data cannot be directly calculated. However, the distances between attribute values usually vary in different clusters induced by their different distributions, which has not been taken into account, thus leading to unreasonable distance measurement. Therefore, we propose a cluster-customized distance metric for categorical data clustering, which can competitively update distances based on different distributions of attributes in each cluster. In addition, we extend the proposed distance metric to the mixed data that contains both numerical and categorical attributes. Experiments demonstrate the efficacy of the proposed method, i.e., achieving an average ranking of around first in fourteen datasets. The source code is available at this https URL
[LG-112] AiEDA: An Open-Source AI-Aided Design Library for Design-to-Vector
Abstract:Recent research has demonstrated that artificial intelligence (AI) can assist electronic design automation (EDA) in improving both the quality and efficiency of chip design. But current AI for EDA (AI-EDA) infrastructures remain fragmented, lacking comprehensive solutions for the entire data pipeline from design execution to AI integration. Key challenges include fragmented flow engines that generate raw data, heterogeneous file formats for data exchange, non-standardized data extraction methods, and poorly organized data storage. This work introduces a unified open-source library for EDA (AiEDA) that addresses these issues. AiEDA integrates multiple design-to-vector data representation techniques that transform diverse chip design data into universal multi-level vector representations, establishing an AI-aided design (AAD) paradigm optimized for AI-EDA workflows. AiEDA provides complete physical design flows with programmatic data extraction and standardized Python interfaces bridging EDA datasets and AI frameworks. Leveraging the AiEDA library, we generate iDATA, a 600GB dataset of structured data derived from 50 real chip designs (28nm), and validate its effectiveness through seven representative AAD tasks spanning prediction, generation, optimization and analysis. The code is publicly available at this https URL, while the full iDATA dataset is being prepared for public release, providing a foundation for future AI-EDA research.
[LG-113] Catching Contamination Before Generation: Spectral Kill Switches for Agents
链接: https://arxiv.org/abs/2511.05804 作者: Valentin Noël 类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: Preprint under review (2025). 9 pages, 2 figures. Code and scripts: to be released
点击查看摘要
Abstract:Agentic language models compose multi step reasoning chains, yet intermediate steps can be corrupted by inconsistent context, retrieval errors, or adversarial inputs, which makes post hoc evaluation too late because errors propagate before detection. We introduce a diagnostic that requires no additional training and uses only the forward pass to emit a binary accept or reject signal during agent execution. The method analyzes token graphs induced by attention and computes two spectral statistics in early layers, namely the high frequency energy ratio and spectral entropy. We formalize these signals, establish invariances, and provide finite sample estimators with uncertainty quantification. Under a two regime mixture assumption with a monotone likelihood ratio property, we show that a single threshold on the high frequency energy ratio is optimal in the Bayes sense for detecting context inconsistency. Empirically, the high frequency energy ratio exhibits robust bimodality during context verification across multiple model families, which enables gating decisions with overhead below one millisecond on our hardware and configurations. We demonstrate integration into retrieval augmented agent pipelines and discuss deployment as an inline safety monitor. The approach detects contamination while the model is still processing the text, before errors commit to the reasoning chain.
[LG-114] An Efficient Gradient-Aware Error-Bounded Lossy Compressor for Federated Learning
Abstract:Federated learning (FL) enables collaborative model training without exposing clients’ private data, but its deployment is often constrained by the communication cost of transmitting gradients between clients and the central server, especially under system heterogeneity where low-bandwidth clients bottleneck overall performance. Lossy compression of gradient data can mitigate this overhead, and error-bounded lossy compression (EBLC) is particularly appealing for its fine-grained utility-compression tradeoff. However, existing EBLC methods (e.g., SZ), originally designed for smooth scientific data with strong spatial locality, rely on generic predictors such as Lorenzo and interpolation for entropy reduction to improve compression ratio. Gradient tensors, in contrast, exhibit low smoothness and weak spatial correlation, rendering these predictors ineffective and leading to poor compression ratios. To address this limitation, we propose an EBLC framework tailored for FL gradient data to achieve high compression ratios while preserving model accuracy. The core of it is an innovative prediction mechanism that exploits temporal correlations across FL training rounds and structural regularities within convolutional kernels to reduce residual entropy. The predictor is compatible with standard quantizers and entropy coders and comprises (1) a cross-round magnitude predictor based on a normalized exponential moving average, and (2) a sign predictor that leverages gradient oscillation and kernel-level sign consistency. Experiments show that this new EBLC yields up to 1.53x higher compression ratios than SZ3 with lower accuracy loss. Integrated into a real-world FL framework, APPFL, it reduces end-to-end communication time by 76.1%-96.2% under various constrained-bandwidth scenarios, demonstrating strong scalability for real-world FL deployments.
[LG-115] Primal-Only Actor Critic Algorithm for Robust Constrained Averag e Cost MDPs
[LG-116] Zero-Shot Function Encoder-Based Differentiable Predictive Control
链接: https://arxiv.org/abs/2511.05757 作者: Hassan Iqbal,Xingjian Li,Tyler Ingebrand,Adam Thorpe,Krishna Kumar,Ufuk Topcu,Ján Drgoňa 类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We introduce a differentiable framework for zero-shot adaptive control over parametric families of nonlinear dynamical systems. Our approach integrates a function encoder-based neural ODE (FE-NODE) for modeling system dynamics with a differentiable predictive control (DPC) for offline self-supervised learning of explicit control policies. The FE-NODE captures nonlinear behaviors in state transitions and enables zero-shot adaptation to new systems without retraining, while the DPC efficiently learns control policies across system parameterizations, thus eliminating costly online optimization common in classical model predictive control. We demonstrate the efficiency, accuracy, and online adaptability of the proposed method across a range of nonlinear systems with varying parametric scenarios, highlighting its potential as a general-purpose tool for fast zero-shot adaptive control.
[LG-117] Near-Exponential Savings for Mean Estimation with Active Learning NEURIPS2025
链接: https://arxiv.org/abs/2511.05736 作者: Julian M. Morimoto,Jacob Goldin,Daniel E. Ho 类目: Machine Learning (cs.LG)
*备注: Accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
点击查看摘要
[LG-118] QiVC-Net: Quantum-Inspired Variational Convolutional Network with Application to Biosignal Classification
Abstract:This work introduces the quantum-inspired variational convolution (QiVC) framework, a novel learning paradigm that integrates principles of probabilistic inference, variational optimization, and quantum-inspired transformations within convolutional architectures. The central innovation of QiVC lies in its quantum-inspired rotated ensemble (QiRE) mechanism. QiRE performs differentiable low-dimensional subspace rotations of convolutional weights, analogously to quantum state evolution. This approach enables structured uncertainty modeling while preserving the intrinsic geometry of the parameter space, resulting in more expressive, stable, and uncertainty-aware representations. To demonstrate its practical potential, the concept is instantiated in a QiVC-based convolutional network (QiVC-Net) and evaluated in the context of biosignal classification, focusing on phonocardiogram (PCG) recordings, a challenging domain characterized by high noise, inter-subject variability, and often imbalanced data. The proposed QiVC-Net integrates an architecture in which the QiVC layer does not introduce additional parameters, instead performing an ensemble rotation of the convolutional weights through a structured mechanism ensuring robustness without added highly computational burden. Experiments on two benchmark datasets, PhysioNet CinC 2016 and PhysioNet CirCor DigiScope 2022, show that QiVC-Net achieves state-of-the-art performance, reaching accuracies of 97.84% and 97.89%, respectively. These findings highlight the versatility of the QiVC framework and its promise for advancing uncertainty-aware modeling in real-world biomedical signal analysis. The implementation of the QiVConv layer is openly available in GitHub.
[LG-119] GastroDL-Fusion: A Dual-Modal Deep Learning Framework Integrating Protein-Ligand Complexes and Gene Sequences for Gastrointestinal Disease Drug Discovery
Abstract:Accurate prediction of protein-ligand binding affinity plays a pivotal role in accelerating the discovery of novel drugs and vaccines, particularly for gastrointestinal (GI) diseases such as gastric ulcers, Crohn’s disease, and ulcerative colitis. Traditional computational models often rely on structural information alone and thus fail to capture the genetic determinants that influence disease mechanisms and therapeutic responses. To address this gap, we propose GastroDL-Fusion, a dual-modal deep learning framework that integrates protein-ligand complex data with disease-associated gene sequence information for drug and vaccine development. In our approach, protein-ligand complexes are represented as molecular graphs and modeled using a Graph Isomorphism Network (GIN), while gene sequences are encoded into biologically meaningful embeddings via a pre-trained Transformer (ProtBERT/ESM). These complementary modalities are fused through a multi-layer perceptron to enable robust cross-modal interaction learning. We evaluate the model on benchmark datasets of GI disease-related targets, demonstrating that GastroDL-Fusion significantly improves predictive performance over conventional methods. Specifically, the model achieves a mean absolute error (MAE) of 1.12 and a root mean square error (RMSE) of 1.75, outperforming CNN, BiLSTM, GIN, and Transformer-only baselines. These results confirm that incorporating both structural and genetic features yields more accurate predictions of binding affinities, providing a reliable computational tool for accelerating the design of targeted therapies and vaccines in the context of gastrointestinal diseases.
Abstract:We consider the problem of distributionally robust multimodal machine learning. Existing approaches often rely on merging modalities on the feature level (early fusion) or heuristic uncertainty modeling, which downplays modality-aware ef- fects and provide limited insights. We propose a novel distributionally robust optimization (DRO) framework that aims to study both the theoretical and practical insights of multimodal machine learning. We first justify this setup and show the significance of this problem through complexity analysis. We then establish both generalization upper bounds and minimax lower bounds which provide perfor- mance guarantees. These results are further extended in settings where we consider encoder-specific error propogations. Empirically, we demonstrate that our approach improves robustness in both simulation settings and real-world datasets. Together, these findings provide a principled foundation for employing multimodal machine learning models in high-stakes applications where uncertainty is unavoidable.
[LG-121] AI-assisted workflow enables rapid high-fidelity breast cancer clinical trial eligibility prescreening
链接: https://arxiv.org/abs/2511.05696 作者: Jacob T. Rosenthal,Emma Hahesy,Sulov Chalise,Menglei Zhu,Mert R. Sabuncu,Lior Z. Braunstein,Anyi Li 类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Clinical trials play an important role in cancer care and research, yet participation rates remain low. We developed MSK-MATCH (Memorial Sloan Kettering Multi-Agent Trial Coordination Hub), an AI system for automated eligibility screening from clinical text. MSK-MATCH integrates a large language model with a curated oncology trial knowledge base and retrieval-augmented architecture providing explanations for all AI predictions grounded in source text. In a retrospective dataset of 88,518 clinical documents from 731 patients across six breast cancer trials, MSK-MATCH automatically resolved 61.9% of cases and triaged 38.1% for human review. This AI-assisted workflow achieved 98.6% accuracy, 98.4% sensitivity, and 98.7% specificity for patient-level eligibility classification, matching or exceeding performance of the human-only and AI-only comparisons. For the triaged cases requiring manual review, prepopulating eligibility screens with AI-generated explanations reduced screening time from 20 minutes to 43 seconds at an average cost of 0.96 per patient-trial pair.
Abstract:A central challenge in reinforcement learning is that policies trained in controlled environments often fail under distribution shifts at deployment into real-world environments. Distributionally Robust Reinforcement Learning (DRRL) addresses this by optimizing for worst-case performance within an uncertainty set defined by a robustness budget \epsilon . However, fixing \epsilon results in a tradeoff between performance and robustness: small values yield high nominal performance but weak robustness, while large values can result in instability and overly conservative policies. We propose Distributionally Robust Self-Paced Curriculum Reinforcement Learning (DR-SPCRL), a method that overcomes this limitation by treating \epsilon as a continuous curriculum. DR-SPCRL adaptively schedules the robustness budget according to the agent’s progress, enabling a balance between nominal and robust performance. Empirical results across multiple environments demonstrate that DR-SPCRL not only stabilizes training but also achieves a superior robustness-performance trade-off, yielding an average 11.8% increase in episodic return under varying perturbations compared to fixed or heuristic scheduling strategies, and achieving approximately 1.9 \times the performance of the corresponding nominal RL algorithms.
[LG-123] KLASS: KL-Guided Fast Inference in Masked Diffusion Models NEURIPS2025
Abstract:Masked diffusion models have demonstrated competitive results on various tasks including language generation. However, due to its iterative refinement process, the inference is often bottlenecked by slow and static sampling speed. To overcome this problem, we introduce `KL-Adaptive Stability Sampling’ (KLASS), a fast yet effective sampling method that exploits token-level KL divergence to identify stable, high-confidence predictions. By unmasking multiple tokens in each iteration without any additional model training, our approach speeds up generation significantly while maintaining sample quality. On reasoning benchmarks, KLASS achieves up to 2.78\times wall-clock speedups while improving performance over standard greedy decoding, attaining state-of-the-art results among diffusion-based samplers. We further validate KLASS across diverse domains, including text, image, and molecular generation, showing its effectiveness as a broadly applicable sampler across different models.
[LG-124] Blind Inverse Game Theory: Jointly Decoding Rewards and Rationality in Entropy-Regularized Competitive Games
链接: https://arxiv.org/abs/2511.05640 作者: Hamza Virk,Sandro Amaglobeli,Zuhayr Syed 类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Inverse Game Theory (IGT) methods based on the entropy-regularized Quantal Response Equilibrium (QRE) offer a tractable approach for competitive settings, but critically assume the agents’ rationality parameter (temperature \tau ) is known a priori. When \tau is unknown, a fundamental scale ambiguity emerges that couples \tau with the reward parameters ( \theta ), making them statistically unidentifiable. We introduce Blind-IGT, the first statistical framework to jointly recover both \theta and \tau from observed behavior. We analyze this bilinear inverse problem and establish necessary and sufficient conditions for unique identification by introducing a normalization constraint that resolves the scale ambiguity. We propose an efficient Normalized Least Squares (NLS) estimator and prove it achieves the optimal \mathcalO(N^-1/2) convergence rate for joint parameter recovery. When strong identifiability conditions fail, we provide partial identification guarantees through confidence set construction. We extend our framework to Markov games and demonstrate optimal convergence rates with strong empirical performance even when transition dynamics are unknown.
[LG-125] Physics-Guided Machine Learning for Uncertainty Quantification in Turbulence Models NEURIPS2025
链接: https://arxiv.org/abs/2511.05633 作者: Minghan Chu,Weicheng Qian 类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: Accepted to NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences (ML4PS), non-archival
点击查看摘要
Abstract:Predicting the evolution of turbulent flows is central across science and engineering. Most studies rely on simulations with turbulence models, whose empirical simplifications introduce epistemic uncertainty. The Eigenspace Perturbation Method (EPM) is a widely used physics-based approach to quantify model-form uncertainty, but being purely physics-based it can overpredict uncertainty bounds. We propose a convolutional neural network (CNN)-based modulation of EPM perturbation magnitudes to improve calibration while preserving physical consistency. Across canonical cases, the hybrid ML-EPM framework yields substantially tighter, better-calibrated uncertainty estimates than baseline EPM alone.
[LG-126] Fooling Algorithms in Non-Stationary Bandits using Belief Inertia
Abstract:We study the problem of worst case regret in piecewise stationary multi armed bandits. While the minimax theory for stationary bandits is well established, understanding analogous limits in time-varying settings is challenging. Existing lower bounds rely on what we refer to as infrequent sampling arguments, where long intervals without exploration allow adversarial reward changes that induce large regret. In this paper, we introduce a fundamentally different approach based on a belief inertia argument. Our analysis captures how an algorithm’s empirical beliefs, encoded through historical reward averages, create momentum that resists new evidence after a change. We show how this inertia can be exploited to construct adversarial instances that mislead classical algorithms such as Explore Then Commit, epsilon greedy, and UCB, causing them to suffer regret that grows linearly with T and with a substantial constant factor, regardless of how their parameters are tuned, even with a single change point. We extend the analysis to algorithms that periodically restart to handle non stationarity and prove that, even then, the worst case regret remains linear in T. Our results indicate that utilizing belief inertia can be a powerful method for deriving sharp lower bounds in non stationary bandits. Subjects: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2511.05620 [cs.LG] (or arXiv:2511.05620v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.05620 Focus to learn more arXiv-issued DOI via DataCite
[LG-127] FiCABU: A Fisher-Based Context-Adaptive Machine Unlearning Processor for Edge AI DATE2026
链接: https://arxiv.org/abs/2511.05605 作者: Eun-Su Cho,Jongin Choi,Jeongmin Jin,Jae-Jin Lee,Woojoo Lee 类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 8 pages, 6 figures, 4 tables, DATE 2026 accepted paper
点击查看摘要
Abstract:Machine unlearning, driven by privacy regulations and the “right to be forgotten”, is increasingly needed at the edge, yet server-centric or retraining-heavy methods are impractical under tight computation and energy budgets. We present FiCABU (Fisher-based Context-Adaptive Balanced Unlearning), a software-hardware co-design that brings unlearning to edge AI processors. FiCABU combines (i) Context-Adaptive Unlearning, which begins edits from back-end layers and halts once the target forgetting is reached, with (ii) Balanced Dampening, which scales dampening strength by depth to preserve retain accuracy. These methods are realized in a full RTL design of a RISC-V edge AI processor that integrates two lightweight IPs for Fisher estimation and dampening into a GEMM-centric streaming pipeline, validated on an FPGA prototype and synthesized in 45 nm for power analysis. Across CIFAR-20 and PinsFaceRecognition with ResNet-18 and ViT, FiCABU achieves random-guess forget accuracy while matching the retraining-free Selective Synaptic Dampening (SSD) baseline on retain accuracy, reducing computation by up to 87.52 percent (ResNet-18) and 71.03 percent (ViT). On the INT8 hardware prototype, FiCABU further improves retain preservation and reduces energy to 6.48 percent (CIFAR-20) and 0.13 percent (PinsFaceRecognition) of the SSD baseline. In sum, FiCABU demonstrates that back-end-first, depth-aware unlearning can be made both practical and efficient for resource-constrained edge AI devices.
[LG-128] AutoHood3D: A Multi-Modal Benchmark for Automotive Hood Design and Fluid-Structure Interaction
Abstract:This study presents a new high-fidelity multi-modal dataset containing 16000+ geometric variants of automotive hoods useful for machine learning (ML) applications such as engineering component design and process optimization, and multiphysics system surrogates. The dataset is centered on a practical multiphysics problem-hood deformation from fluid entrapment and inertial loading during rotary-dip painting. Each hood is numerically modeled with a coupled Large-Eddy Simulation (LES)-Finite Element Analysis (FEA), using 1.2M cells in total to ensure spatial and temporal accuracy. The dataset provides time-resolved physical fields, along with STL meshes and structured natural language prompts for text-to-geometry synthesis. Existing datasets are either confined to 2D cases, exhibit limited geometric variations, or lack the multi-modal annotations and data structures - shortcomings we address with AutoHood3D. We validate our numerical methodology, establish quantitative baselines across five neural architectures, and demonstrate systematic surrogate errors in displacement and force predictions. These findings motivate the design of novel approaches and multiphysics loss functions that enforce fluid-solid coupling during model training. By providing fully reproducible workflows, AutoHood3D enables physics-aware ML development, accelerates generative-design iteration, and facilitates the creation of new FSI benchmarks. Dataset and code URLs in Appendix.
[LG-129] Optimizing Predictive Maintenance in Intelligent Manufacturing: An Integrated FNO-DAE-GNN-PPO MDP Framework
Abstract:In the era of smart manufacturing, predictive maintenance (PdM) plays a pivotal role in improving equipment reliability and reducing operating costs. In this paper, we propose a novel Markov Decision Process (MDP) framework that integrates advanced soft computing techniques - Fourier Neural Operator (FNO), Denoising Autoencoder (DAE), Graph Neural Network (GNN), and Proximal Policy Optimisation (PPO) - to address the multidimensional challenges of predictive maintenance in complex manufacturing systems. Specifically, the proposed framework innovatively combines the powerful frequency-domain representation capability of FNOs to capture high-dimensional temporal patterns; DAEs to achieve robust, noise-resistant latent state embedding from complex non-Gaussian sensor data; and GNNs to accurately represent inter-device dependencies for coordinated system-wide maintenance decisions. Furthermore, by exploiting PPO, the framework ensures stable and efficient optimisation of long-term maintenance strategies to effectively handle uncertainty and non-stationary dynamics. Experimental validation demonstrates that the approach significantly outperforms multiple deep learning baseline models with up to 13% cost reduction, as well as strong convergence and inter-module synergy. The framework has considerable industrial potential to effectively reduce downtime and operating expenses through data-driven strategies.
链接: https://arxiv.org/abs/2511.05593 作者: Arnaud Descours(UCBL),Léonard Deroose,Jan Ramon 类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:
点击查看摘要
[LG-131] GRAVER: Generative Graph Vocabularies for Robust Graph Foundation Models Fine-tuning NEURIPS2025
链接: https://arxiv.org/abs/2511.05592 作者: Haonan Yuan,Qingyun Sun,Junhua Shi,Xingcheng Fu,Bryan Hooi,Jianxin Li,Philip S. Yu 类目: Machine Learning (cs.LG)
*备注: Accepted by the NeurIPS 2025
点击查看摘要
Abstract:Inspired by the remarkable success of foundation models in language and vision, Graph Foundation Models (GFMs) hold significant promise for broad applicability across diverse graph tasks and domains. However, existing GFMs struggle with unstable few-shot fine-tuning, where both performance and adaptation efficiency exhibit significant fluctuations caused by the randomness in the support sample selection and structural discrepancies between the pre-trained and target graphs. How to fine-tune GFMs robustly and efficiently to enable trustworthy knowledge transfer across domains and tasks is the major challenge. In this paper, we propose GRAVER, a novel Generative gRAph VocabulariEs for Robust GFM fine-tuning framework that tackles the aforementioned instability via generative augmentations. Specifically, to identify transferable units, we analyze and extract key class-specific subgraph patterns by ego-graph disentanglement and validate their transferability both theoretically and empirically. To enable effective pre-training across diverse domains, we leverage a universal task template based on ego-graph similarity and construct graph vocabularies via graphon-based generative experts. To facilitate robust and efficient prompt fine-tuning, we grave the support samples with in-context vocabularies, where the lightweight MoE-CoE network attentively routes knowledge from source domains. Extensive experiments demonstrate the superiority of GRAVER over effectiveness, robustness, and efficiency on downstream few-shot node and graph classification tasks compared with 15 state-of-the-art baselines.
[LG-132] FedSparQ: Adaptive Sparse Quantization with Error Feedback for Robust Efficient Federated Learning
链接: https://arxiv.org/abs/2511.05591 作者: Chaimaa Medjadji,Sadi Alawadi,Feras M. Awaysheh,Guilain Leduc,Sylvain Kubler,Yves Le Traon 类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Federated Learning (FL) enables collaborative model training across decentralized clients while preserving data privacy by keeping raw data local. However, FL suffers from significant communication overhead due to the frequent exchange of high-dimensional model updates over constrained networks. In this paper, we present FedSparQ, a lightweight compression framework that dynamically sparsifies the gradient of each client through an adaptive threshold, applies half-precision quanti- zation to retained entries and integrates residuals from error feedback to prevent loss of information. FedSparQ requires no manual tuning of sparsity rates or quantization schedules, adapts seamlessly to both homogeneous and heterogeneous data distributions, and is agnostic to model architecture. Through extensive empirical evaluation on vision benchmarks under independent and identically distributed (IID) and non-IID data, we show that FedSparQ substantially reduces communication overhead (reducing by 90% of bytes sent compared to FedAvg) while preserving or improving model accuracy (improving by 6% compared to FedAvg non-compressed solution or to state-of-the- art compression models) and enhancing convergence robustness (by 50%, compared to the other baselines). Our approach provides a practical, easy-to-deploy solution for bandwidth- constrained federated deployments and lays the groundwork for future extensions in adaptive precision and privacy-preserving protocols.
[LG-133] Prompting Neural-Guided Equation Discovery Based on Residuals
Abstract:Neural-guided equation discovery systems use a data set as prompt and predict an equation that describes the data set without extensive search. However, if the equation does not meet the user’s expectations, there are few options for getting other equation suggestions without intensive work with the system. To fill this gap, we propose Residuals for Equation Discovery (RED), a post-processing method that improves a given equation in a targeted manner, based on its residuals. By parsing the initial equation to a syntax tree, we can use node-based calculation rules to compute the residual for each subequation of the initial equation. It is then possible to use this residual as new target variable in the original data set and generate a new prompt. If, with the new prompt, the equation discovery system suggests a subequation better than the old subequation on a validation set, we replace the latter by the former. RED is usable with any equation discovery system, is fast to calculate, and is easy to extend for new mathematical operations. In experiments on 53 equations from the Feynman benchmark, we show that it not only helps to improve all tested neural-guided systems, but also all tested classical genetic programming systems.
[LG-134] Depth-induced NTK: Bridging Over-parameterized Neural Networks and Deep Neural Kernels
Abstract:While deep learning has achieved remarkable success across a wide range of applications, its theoretical understanding of representation learning remains limited. Deep neural kernels provide a principled framework to interpret over-parameterized neural networks by mapping hierarchical feature transformations into kernel spaces, thereby combining the expressive power of deep architectures with the analytical tractability of kernel methods. Recent advances, particularly neural tangent kernels (NTKs) derived by gradient inner products, have established connections between infinitely wide neural networks and nonparametric Bayesian inference. However, the existing NTK paradigm has been predominantly confined to the infinite-width regime, while overlooking the representational role of network depth. To address this gap, we propose a depth-induced NTK kernel based on a shortcut-related architecture, which converges to a Gaussian process as the network depth approaches infinity. We theoretically analyze the training invariance and spectrum properties of the proposed kernel, which stabilizes the kernel dynamics and mitigates degeneration. Experimental results further underscore the effectiveness of our proposed method. Our findings significantly extend the existing landscape of the neural kernel theory and provide an in-depth understanding of deep learning and the scaling law.
[LG-135] Distillation-Accelerated Uncertainty Modeling for Multi-Objective RTA Interception
链接: https://arxiv.org/abs/2511.05582 作者: Gaoxiang Zhao,Ruina Qiu,Pengpeng Zhao,Rongjin Wang,Zhangang Lin,Xiaoqiang Wang 类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
点击查看摘要
Abstract:Real-Time Auction (RTA) Interception aims to filter out invalid or irrelevant traffic to enhance the integrity and reliability of downstream data. However, two key challenges remain: (i) the need for accurate estimation of traffic quality together with sufficiently high confidence in the model’s predictions, typically addressed through uncertainty modeling, and (ii) the efficiency bottlenecks that such uncertainty modeling introduces in real-time applications due to repeated inference. To address these challenges, we propose DAUM, a joint modeling framework that integrates multi-objective learning with uncertainty modeling, yielding both traffic quality predictions and reliable confidence estimates. Building on DAUM, we further apply knowledge distillation to reduce the computational overhead of uncertainty modeling, while largely preserving predictive accuracy and retaining the benefits of uncertainty estimation. Experiments on the JD advertisement dataset demonstrate that DAUM consistently improves predictive performance, with the distilled model delivering a tenfold increase in inference speed.
[LG-136] Data-driven jet fuel demand forecasting: A case study of Copenhagen Airport
Abstract:Accurate forecasting of jet fuel demand is crucial for optimizing supply chain operations in the aviation market. Fuel distributors specifically require precise estimates to avoid inventory shortages or excesses. However, there is a lack of studies that analyze the jet fuel demand forecasting problem using machine learning models. Instead, many industry practitioners rely on deterministic or expertise-based models. In this research, we evaluate the performance of data-driven approaches using a substantial amount of data obtained from a major aviation fuel distributor in the Danish market. Our analysis compares the predictive capabilities of traditional time series models, Prophet, LSTM sequence-to-sequence neural networks, and hybrid models. A key challenge in developing these models is the required forecasting horizon, as fuel demand needs to be predicted for the next 30 days to optimize sourcing strategies. To ensure the reliability of the data-driven approaches and provide valuable insights to practitioners, we analyze three different datasets. The primary objective of this study is to present a comprehensive case study on jet fuel demand forecasting, demonstrating the advantages of employing data-driven models and highlighting the impact of incorporating additional variables in the predictive models.
[LG-137] Daily Forecasting for Annual Time Series Datasets Using Similarity-Based Machine Learning Methods: A Case Study in the Energy Market
Abstract:The policy environment of countries changes rapidly, influencing macro-level indicators such as the Energy Security Index. However, this index is only reported annually, limiting its responsiveness to short-term fluctuations. To address this gap, the present study introduces a daily proxy for the Energy Security Index and applies it to forecast energy security at a daily this http URL study employs a two stage approach first, a suitable daily proxy for the annual Energy Security Index is identified by applying six time series similarity measures to key energy related variables. Second, the selected proxy is modeled using the XGBoost algorithm to generate 15 day ahead forecasts, enabling high frequency monitoring of energy security this http URL the result of proxy choosing, Volume Brent consistently emerged as the most suitable proxy across the majority of methods. The model demonstrated strong performance, with an R squared of 0.981 on the training set and 0.945 on the test set, and acceptable error metrics . The 15 day forecast of Brent volume indicates short term fluctuations, with a peak around day 4, a decline until day 8, a rise near day 10, and a downward trend toward day 15, accompanied by prediction this http URL integrating time series similarity measures with machine learning based forecasting, this study provides a novel framework for converting low frequency macroeconomic indicators into high frequency, actionable signals. The approach enables real time monitoring of the Energy Security Index, offering policymakers and analysts a scalable and practical tool to respond more rapidly to fast changing policy and market conditions, especially in data scarce environments.
[LG-138] EEG Seizure Detection with a Sparse Hyperdimensional Computing Accelerator MICRO
链接: https://arxiv.org/abs/2511.05503 作者: Stef Cuyckens,Ryan Antonio,Chao Fang,Marian Verhelst 类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: To appear at the 20th International Conference on PhD Research in Microelectronics and Electronics (PRIME 2025)
点击查看摘要
Abstract:Implantable devices for reliable intracranial electroencephalography (iEEG) require efficient, accurate, and real-time detection of seizures. Dense hyperdimensional computing (HDC) proves to be efficient over neural networks; however, it still consumes considerable switching power for an ultra-low energy application. Sparse HDC, on the other hand, has the potential of further reducing the energy consumption, yet at the expense of having to support more complex operations and introducing an extra hyperparameter, the maximum hypervector density. To improve the energy and area efficiency of the sparse HDC operations, this work introduces the compressed item memory (CompIM) and simplifies the spatial bundling. We also analyze how a proper hyperparameter choice improves the detection delay compared to dense HDC. Ultimately, our optimizations achieve a 1.73x more energy- and 2.20x more area-efficient hardware design than the naive sparse implementation. We are also 7.50x more energy- and 3.24x more area-efficient than the dense HDC implementation. This work highlights the hardware advantages of sparse HDC, demonstrating its potential to enable smaller brain implants with a substantially extended battery life compared to the current state-of-the-art.
[LG-139] Socially Aware Music Recommendation: A Multi-Modal Graph Neural Networks for Collaborative Music Consumption and Community-Based Engagement
Abstract:This study presents a novel Multi-Modal Graph Neural Network (MM-GNN) framework for socially aware music recommendation, designed to enhance personalization and foster community-based engagement. The proposed model introduces a fusion-free deep mutual learning strategy that aligns modality-specific representations from lyrics, audio, and visual data while maintaining robustness against missing modalities. A heterogeneous graph structure is constructed to capture both user-song interactions and user-user social relationships, enabling the integration of individual preferences with social influence. Furthermore, emotion-aware embeddings derived from acoustic and textual signals contribute to emotionally aligned recommendations. Experimental evaluations on benchmark datasets demonstrate that MM-GNN significantly outperforms existing state-of-the-art methods across various performance metrics. Ablation studies further validate the critical impact of each model component, confirming the effectiveness of the framework in delivering accurate and socially contextualized music recommendations.
[LG-140] Solving bilevel optimization via sequential minimax optimization
链接: https://arxiv.org/abs/2511.07398 作者: Zhaosong Lu,Sanyou Mei 类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: Accepted by Mathematics of Operations Research
点击查看摘要
Abstract:In this paper we propose a sequential minimax optimization (SMO) method for solving a class of constrained bilevel optimization problems in which the lower-level part is a possibly nonsmooth convex optimization problem, while the upper-level part is a possibly nonconvex optimization problem. Specifically, SMO applies a first-order method to solve a sequence of minimax subproblems, which are obtained by employing a hybrid of modified augmented Lagrangian and penalty schemes on the bilevel optimization problems. Under suitable assumptions, we establish an operation complexity of O(\varepsilon^-7\log\varepsilon^-1) and O(\varepsilon^-6\log\varepsilon^-1) , measured in terms of fundamental operations, for SMO in finding an \varepsilon -KKT solution of the bilevel optimization problems with merely convex and strongly convex lower-level objective functions, respectively. The latter result improves the previous best-known operation complexity by a factor of \varepsilon^-1 . Preliminary numerical results demonstrate significantly superior computational performance compared to the recently developed first-order penalty method.
[LG-141] Walsh-Hadamard Neural Operators for Solving PDEs with Discontinuous Coefficients
Abstract:Neural operators have emerged as powerful tools for learning solution operators of partial differential equations (PDEs). However, standard spectral methods based on Fourier transforms struggle with problems involving discontinuous coefficients due to the Gibbs phenomenon and poor representation of sharp interfaces. We introduce the Walsh-Hadamard Neural Operator (WHNO), which leverages Walsh-Hadamard transforms-a spectral basis of rectangular wave functions naturally suited for piecewise constant fields-combined with learnable spectral weights that transform low-sequency Walsh coefficients to capture global dependencies efficiently. We validate WHNO on three problems: steady-state Darcy flow (preliminary validation), heat conduction with discontinuous thermal conductivity, and the 2D Burgers equation with discontinuous initial conditions. In controlled comparisons with Fourier Neural Operators (FNO) under identical conditions, WHNO demonstrates superior accuracy with better preservation of sharp solution features at material interfaces. Critically, we discover that weighted ensemble combinations of WHNO and FNO achieve substantial improvements over either model alone: for both heat conduction and Burgers equation, optimal ensembles reduce mean squared error by 35-40 percent and maximum error by up to 25 percent compared to individual models. This demonstrates that Walsh-Hadamard and Fourier representations capture complementary aspects of discontinuous PDE solutions, with WHNO excelling at sharp interfaces while FNO captures smooth features effectively.
[LG-142] De-Individualizing fMRI Signals via Mahalanobis Whitening and Bures Geometry
Abstract:Functional connectivity has been widely investigated to understand brain disease in clinical studies and imaging-based neuroscience, and analyzing changes in functional connectivity has proven to be valuable for understanding and computationally evaluating the effects on brain function caused by diseases or experimental stimuli. By using Mahalanobis data whitening prior to the use of dimensionality reduction algorithms, we are able to distill meaningful information from fMRI signals about subjects and the experimental stimuli used to prompt them. Furthermore, we offer an interpretation of Mahalanobis whitening as a two-stage de-individualization of data which is motivated by similarity as captured by the Bures distance, which is connected to quantum mechanics. These methods have potential to aid discoveries about the mechanisms that link brain function with cognition and behavior and may improve the accuracy and consistency of Alzheimer’s diagnosis, especially in the preclinical stage of disease progression.
[LG-143] he Value of Personalized Recommendations: Evidence from Netflix
链接: https://arxiv.org/abs/2511.07280 作者: Kevin Zielnicki,Guy Aridor,Aurélien Bibaut,Allen Tran,Winston Chou,Nathan Kallus 类目: General Economics (econ.GN); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Personalized recommendation systems shape much of user choice online, yet their targeted nature makes separating out the value of recommendation and the underlying goods challenging. We build a discrete choice model that embeds recommendation-induced utility, low-rank heterogeneity, and flexible state dependence and apply the model to viewership data at Netflix. We exploit idiosyncratic variation introduced by the recommendation algorithm to identify and separately value these components as well as to recover model-free diversion ratios that we can use to validate our structural model. We use the model to evaluate counterfactuals that quantify the incremental engagement generated by personalized recommendations. First, we show that replacing the current recommender system with a matrix factorization or popularity-based algorithm would lead to 4% and 12% reduction in engagement, respectively, and decreased consumption diversity. Second, most of the consumption increase from recommendations comes from effective targeting, not mechanical exposure, with the largest gains for mid-popularity goods (as opposed to broadly appealing or very niche goods).
[LG-144] High-Dimensional Asymptotics of Differentially Private PCA
Abstract:In differential privacy, statistics of a sensitive dataset are privatized by introducing random noise. Most privacy analyses provide privacy bounds specifying a noise level sufficient to achieve a target privacy guarantee. Sometimes, these bounds are pessimistic and suggest adding excessive noise, which overwhelms the meaningful signal. It remains unclear if such high noise levels are truly necessary or a limitation of the proof techniques. This paper explores whether we can obtain sharp privacy characterizations that identify the smallest noise level required to achieve a target privacy level for a given mechanism. We study this problem in the context of differentially private principal component analysis, where the goal is to privatize the leading principal components (PCs) of a dataset with n samples and p features. We analyze the exponential mechanism for this problem in a model-free setting and provide sharp utility and privacy characterizations in the high-dimensional limit ( p\rightarrow\infty ). Our privacy result shows that, in high dimensions, detecting the presence of a target individual in the dataset using the privatized PCs is exactly as hard as distinguishing two Gaussians with slightly different means, where the mean difference depends on certain spectral properties of the dataset. Our privacy analysis combines the hypothesis-testing formulation of privacy guarantees proposed by Dong, Roth, and Su (2022) with classical contiguity arguments due to Le Cam to obtain sharp high-dimensional privacy characterizations.
[LG-145] Simulation-based Methods for Optimal Sampling Design in Systems Biology
Abstract:In many areas of systems biology, including virology, pharmacokinetics, and population biology, dynamical systems are commonly used to describe biological processes. These systems can be characterized by estimating their parameters from sampled data. The key problem is how to optimally select sampling points to achieve accurate parameter estimation. Classical approaches often rely on Fisher information matrix-based criteria such as A-, D-, and E-optimality, which require an initial parameter estimate and may yield suboptimal results when the estimate is inaccurate. This study proposes two simulation-based methods for optimal sampling design that do not depend on initial parameter estimates. The first method, E-optimal-ranking (EOR), employs the E-optimal criterion, while the second utilizes a Long Short-Term Memory (LSTM) neural network. Simulation studies based on the Lotka-Volterra and three-compartment models demonstrate that the proposed methods outperform both random selection and classical E-optimal design.
[LG-146] Anatomy-Aware Lymphoma Lesion Detection in Whole-Body PET/CT
Abstract:Early cancer detection is crucial for improving patient outcomes, and 18F FDG PET/CT imaging plays a vital role by combining metabolic and anatomical information. Accurate lesion detection remains challenging due to the need to identify multiple lesions of varying sizes. In this study, we investigate the effect of adding anatomy prior information to deep learning-based lesion detection models. In particular, we add organ segmentation masks from the TotalSegmentator tool as auxiliary inputs to provide anatomical context to nnDetection, which is the state-of-the-art for lesion detection, and Swin Transformer. The latter is trained in two stages that combine self-supervised pre-training and supervised fine-tuning. The method is tested in the AutoPET and Karolinska lymphoma datasets. The results indicate that the inclusion of anatomical priors substantially improves the detection performance within the nnDetection framework, while it has almost no impact on the performance of the vision transformer. Moreover, we observe that Swin Transformer does not offer clear advantages over conventional convolutional neural network (CNN) encoders used in nnDetection. These findings highlight the critical role of the anatomical context in cancer lesion detection, especially in CNN-based models.
[LG-147] Dimensionality reduction and width of deep neural networks based on topological degree theory
Abstract:In this paper we present a mathematical framework on linking of embeddings of compact topological spaces into Euclidean spaces and separability of linked embeddings under a specific class of dimension reduction maps. As applications of the established theory, we provide some fascinating insights into classification and approximation problems in deep learning theory in the setting of deep neural networks.
[LG-148] Convergence of Actor-Critic Learning for Mean Field Games and Mean Field Control in Continuous Spaces
链接: https://arxiv.org/abs/2511.06812 作者: Jean-Pierre Fouque,Mathieu Laurière,Mengrui Zhang 类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注:
点击查看摘要
Abstract:We establish the convergence of the deep actor-critic reinforcement learning algorithm presented in [Angiuli et al., 2023a] in the setting of continuous state and action spaces with an infinite discrete-time horizon. This algorithm provides solutions to Mean Field Game (MFG) or Mean Field Control (MFC) problems depending on the ratio between two learning rates: one for the value function and the other for the mean field term. In the MFC case, to rigorously identify the limit, we introduce a discretization of the state and action spaces, following the approach used in the finite-space case in [Angiuli et al., 2023b]. The convergence proofs rely on a generalization of the two-timescale framework introduced in [Borkar, 1997]. We further extend our convergence results to Mean Field Control Games, which involve locally cooperative and globally competitive populations. Finally, we present numerical experiments for linear-quadratic problems in one and two dimensions, for which explicit solutions are available.
[LG-149] Bilevel Learning via Inexact Stochastic Gradient Descent
链接: https://arxiv.org/abs/2511.06774 作者: Mohammad Sadegh Salehi,Subhadip Mukherjee,Lindon Roberts,Matthias J. Ehrhardt 类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Bilevel optimization is a central tool in machine learning for high-dimensional hyperparameter tuning. Its applications are vast; for instance, in imaging it can be used for learning data-adaptive regularizers and optimizing forward operators in variational regularization. These problems are large in many ways: a lot of data is usually available to train a large number of parameters, calling for stochastic gradient-based algorithms. However, exact gradients with respect to parameters (so-called hypergradients) are not available, and their precision is usually linearly related to computational cost. Hence, algorithms must solve the problem efficiently without unnecessary precision. The design of such methods is still not fully understood, especially regarding how accuracy requirements and step size schedules affect theoretical guarantees and practical performance. Existing approaches introduce stochasticity at both the upper level (e.g., in sampling or mini-batch estimates) and the lower level (e.g., in solving the inner problem) to improve generalization, but they typically fix the number of lower-level iterations, which conflicts with asymptotic convergence assumptions. In this work, we advance the theory of inexact stochastic bilevel optimization. We prove convergence and establish rates under decaying accuracy and step size schedules, showing that with optimal configurations convergence occurs at an \mathcalO(k^-1/4) rate in expectation. Experiments on image denoising and inpainting with convex ridge regularizers and input-convex networks confirm our analysis: decreasing step sizes improve stability, accuracy scheduling is more critical than step size strategy, and adaptive preconditioning (e.g., Adam) further boosts performance. These results bridge theory and practice, providing convergence guarantees and practical guidance for large-scale imaging problems.
[LG-150] Lassoed Forests: Random Forests with Adaptive Lasso Post-selection
Abstract:Random forests are a statistical learning technique that use bootstrap aggregation to average high-variance and low-bias trees. Improvements to random forests, such as applying Lasso regression to the tree predictions, have been proposed in order to reduce model bias. However, these changes can sometimes degrade performance (e.g., an increase in mean squared error). In this paper, we show in theory that the relative performance of these two methods, standard and Lasso-weighted random forests, depends on the signal-to-noise ratio. We further propose a unified framework to combine random forests and Lasso selection by applying adaptive weighting and show mathematically that it can strictly outperform the other two methods. We compare the three methods through simulation, including bias-variance decomposition, error estimates evaluation, and variable importance analysis. We also show the versatility of our method by applications to a variety of real-world datasets.
[LG-151] Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer
链接: https://arxiv.org/abs/2511.06675 作者: Steffen Dereich,Thang Do,Arnulf Jentzen,Philippe von Wurstemberger 类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 66 pages
点击查看摘要
Abstract:Beside the standard stochastic gradient descent (SGD) method, the Adam optimizer due to Kingma Ba (2014) is currently probably the best-known optimization method for the training of deep neural networks in artificial intelligence (AI) systems. Despite the popularity and the success of Adam it remains an \emphopen research problem to provide a rigorous convergence analysis for Adam even for the class of strongly convex SOPs. In one of the main results of this work we establish convergence rates for Adam in terms of the number of gradient steps (convergence rate \nicefrac12 w.r.t. the size of the learning rate), the size of the mini-batches (convergence rate 1 w.r.t. the size of the mini-batches), and the size of the second moment parameter of Adam (convergence rate 1 w.r.t. the distance of the second moment parameter to 1) for the class of strongly convex SOPs. In a further main result of this work, which we refer to as \emphAdam symmetry theorem, we illustrate the optimality of the established convergence rates by proving for a special class of simple quadratic strongly convex SOPs that Adam converges as the number of gradient steps increases to infinity to the solution of the SOP (the unique minimizer of the strongly convex objective function) if and \emphonly if the random variables in the SOP (the data in the SOP) are \emphsymmetrically distributed. In particular, in the standard case where the random variables in the SOP are not symmetrically distributed we \emphdisprove that Adam converges to the minimizer of the SOP as the number of Adam steps increases to infinity. We also complement the conclusions of our convergence analysis and the Adam symmetry theorem by several numerical simulations that indicate the sharpness of the established convergence rates and that illustrate the practical appearance of the phenomena revealed in the \emphAdam symmetry theorem.
[LG-152] Learning Biomolecular Motion: The Physics-Informed Machine Learning Paradigm
Abstract:The convergence of statistical learning and molecular physics is transforming our approach to modeling biomolecular systems. Physics-informed machine learning (PIML) offers a systematic framework that integrates data-driven inference with physical constraints, resulting in models that are accurate, mechanistic, generalizable, and able to extrapolate beyond observed domains. This review surveys recent advances in physics-informed neural networks and operator learning, differentiable molecular simulation, and hybrid physics-ML potentials, with emphasis on long-timescale kinetics, rare events, and free-energy estimation. We frame these approaches as solutions to the “biomolecular closure problem”, recovering unresolved interactions beyond classical force fields while preserving thermodynamic consistency and mechanistic interpretability. We examine theoretical foundations, tools and frameworks, computational trade-offs, and unresolved issues, including model expressiveness and stability. We outline prospective research avenues at the intersection of machine learning, statistical physics, and computational chemistry, contending that future advancements will depend on mechanistic inductive biases, and integrated differentiable physical learning frameworks for biomolecular simulation and discovery.
[LG-153] Bridging Theory and Practice: A Stochastic Learning-Optimization Model for Resilient Automotive Supply Chains
链接: https://arxiv.org/abs/2511.06479 作者: Muhammad Shahnawaz,Adeel Safder 类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 14 pages, 4 figures
点击查看摘要
Abstract:Supply chain disruptions and volatile demand pose significant challenges to the UK automotive industry, which relies heavily on Just-In-Time (JIT) manufacturing. While qualitative studies highlight the potential of integrating Artificial Intelligence (AI) with traditional optimization, a formal, quantitative demonstration of this synergy is lacking. This paper introduces a novel stochastic learning-optimization framework that integrates Bayesian inference with inventory optimization for supply chain management (SCM). We model a two-echelon inventory system subject to stochastic demand and supply disruptions, comparing a traditional static optimization policy against an adaptive policy where Bayesian learning continuously updates parameter estimates to inform stochastic optimization. Our simulations over 365 periods across three operational scenarios demonstrate that the integrated approach achieves 7.4% cost reduction in stable environments and 5.7% improvement during supply disruptions, while revealing important limitations during sudden demand shocks due to the inherent conservatism of Bayesian updating. This work provides mathematical validation for practitioner observations and establishes a formal framework for understanding AI-driven supply chain resilience, while identifying critical boundary conditions for successful implementation.
[LG-154] Fast Riemannian-manifold Hamiltonian Monte Carlo for hierarchical Gaussian-process models
[LG-155] Learning the Inverse Ryu–Takayanagi Formula with Transformers
链接: https://arxiv.org/abs/2511.06387 作者: Sejin Kim 类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures
点击查看摘要
Abstract:We study the inverse problem of holographic entanglement entropy in AdS _3 using a data-driven generative model. Training data consist of randomly generated geometries and their holographic entanglement entropies using the Ryu–Takayanagi formula. After training, the Transformer reconstructs the blackening function within our metric ansatz from previously unseen inputs. The Transformer achieves accurate reconstructions on smooth black hole geometries and extrapolates to horizonless backgrounds. We describe the architecture and data generation process, and we quantify accuracy on both f(z) and the reconstructed S(\ell) . Code and evaluation scripts are available at the provided repository.
[LG-156] Functional Adjoint Sampler: Scalable Sampling on Infinite Dimensional Spaces
Abstract:Learning-based methods for sampling from the Gibbs distribution in finite-dimensional spaces have progressed quickly, yet theory and algorithmic design for infinite-dimensional function spaces remain limited. This gap persists despite their strong potential for sampling the paths of conditional diffusion processes, enabling efficient simulation of trajectories of diffusion processes that respect rare events or boundary constraints. In this work, we present the adjoint sampler for infinite-dimensional function spaces, a stochastic optimal control-based diffusion sampler that operates in function space and targets Gibbs-type distributions on infinite-dimensional Hilbert spaces. Our Functional Adjoint Sampler (FAS) generalizes Adjoint Sampling (Havens et al., 2025) to Hilbert spaces based on a SOC theory called stochastic maximum principle, yielding a simple and scalable matching-type objective for a functional representation. We show that FAS achieves superior transition path sampling performance across synthetic potential and real molecular systems, including Alanine Dipeptide and Chignolin.
[LG-157] Sparsity via Hyperpriors: A Theoretical and Algorithmic Study under Empirical Bayes Framework
Abstract:Accurate thermospheric density prediction is crucial for reliable satellite operations in Low Earth Orbits, especially at high solar and geomagnetic activity. Physics-based models such as TIE-GCM offer high fidelity but are computationally expensive, while empirical models like NRLMSIS are efficient yet lack predictive power. This work presents a transformer-based model that forecasts densities up to three days ahead and is intended as a drop-in replacement for an empirical baseline. Unlike recent approaches, it avoids spatial reduction and complex input pipelines, operating directly on a compact input set. Validated on real-world data, the model improves key prediction metrics and shows potential to support mission planning.
[LG-159] he Algorithmic Phase Transition in Symmetric Correlated Spiked Wigner Model
Abstract:We study the computational task of detecting and estimating correlated signals in a pair of spiked Wigner matrices. Our model consists of observations X = \tfrac\lambda\sqrtn xx^\top + W , \quad Y = \tfrac\mu\sqrtn yy^\top + Z ,. where x,y \in \mathbb R^n are signal vectors with norm |x|,|y| \approx\sqrtn and correlation \langle x,y \rangle \approx \rho|x||y| , while W,Z are independent Gaussian noise matrices. We propose an efficient algorithm that succeeds whenever F(\lambda,\mu,\rho)1 , where F(\lambda,\mu,\rho)=\max\Big\ \lambda,\mu, \frac \lambda^2 \rho^2 1-\lambda^2+\lambda^2 \rho^2 + \frac \mu^2 \rho^2 1-\mu^2+\mu^2 \rho^2 \Big\ ,. Our result shows that an algorithm can leverage the correlation between the spikes to detect and estimate the signals even in regimes where efficiently recovering either x from X alone or y from Y alone is believed to be computationally infeasible. We complement our algorithmic result with evidence for a matching computational lower bound. In particular, we prove that when F(\lambda,\mu,\rho)1 , all algorithms based on \em low-degree polynomials fails to distinguish (X,Y) with two independent Wigner matrices. This low-degree analysis strongly suggests that F(\lambda,\mu,\rho)=1 is the precise computation threshold for this problem. Comments: 47 pages Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML) MSC classes: 68Q87, 68Q17 Cite as: arXiv:2511.06040 [math.ST] (or arXiv:2511.06040v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2511.06040 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhangsong Li [view email] [v1] Sat, 8 Nov 2025 15:23:44 UTC (44 KB) Full-text links: Access Paper: View a PDF of the paper titled The Algorithmic Phase Transition in Symmetric Correlated Spiked Wigner Model, by Zhangsong LiView PDFHTML (experimental)TeX Source view license Current browse context: math.ST prev | next new | recent | 2025-11 Change to browse by: cs cs.LG math math.PR stat stat.MLstat.TH References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-160] Benchmarking of Clustering Validity Measures Revisited
链接: https://arxiv.org/abs/2511.05983 作者: Connor Simpson,Ricardo J. G. B. Campello,Elizabeth Stojanovski 类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 48 pages, 17 tables, 17 figures
点击查看摘要
Abstract:Validation plays a crucial role in the clustering process. Many different internal validity indexes exist for the purpose of determining the best clustering solution(s) from a given collection of candidates, e.g., as produced by different algorithms or different algorithm hyper-parameters. In this study, we present a comprehensive benchmark study of 26 internal validity indexes, which includes highly popular classic indexes as well as more recently developed ones. We adopted an enhanced revision of the methodology presented in Vendramin et al. (2010), developed here to address several shortcomings of this previous work. This overall new approach consists of three complementary custom-tailored evaluation sub-methodologies, each of which has been designed to assess specific aspects of an index’s behaviour while preventing potential biases of the other sub-methodologies. Each sub-methodology features two complementary measures of performance, alongside mechanisms that allow for an in-depth investigation of more complex behaviours of the internal validity indexes under study. Additionally, a new collection of 16177 datasets has been produced, paired with eight widely-used clustering algorithms, for a wider applicability scope and representation of more diverse clustering scenarios.
[LG-161] Beyond Resolution: Multi - Scale Weather and Climate Data for Alpine Renewable Energy in the Digital Twin Era - First Evaluations and Recommendations
Abstract:When Austrian hydropower production plummeted by 44% in early 2025 due to reduced snowpack, it exposed a critical vulnerability: standard meteorological and climatological datasets systematically fail in mountain regions that hold untapped renewable potential. This perspectives paper evaluates emerging solutions to the Alpine energy-climate data gap, analyzing datasets from global reanalyses (ERA5, 31 km) to kilometre-scale Digital Twins (Climate DT, Extremes DT, 4.4 km), regional reanalyses (ARA, 2.5 km), and next-generation AI weather prediction models (AIFS, 31 km). The multi-resolution assessment reveals that no single dataset excels universally: coarse reanalyses provide essential climatologies but miss valley-scale processes, while Digital Twins resolve Alpine dynamics yet remain computationally demanding. Effective energy planning therefore requires strategic dataset combinations validated against energy-relevant indices such as population-weighted extremes, wind-gust return periods, and Alpine-adjusted storm thresholds. A key frontier is sub-hourly (10-15 min) temporal resolution to match grid-operation needs. Six evidence-based recommendations outline pathways for bridging spatial and temporal scales. As renewable deployment expands globally into complex terrain, the Alpine region offers transferable perspectives for tackling identical forecasting and climate analysis challenges in mountainous regions worldwide.
[LG-162] Bridging Accuracy and Explainability in EEG-based Graph Attention Network for Depression Detection
链接: https://arxiv.org/abs/2511.05537 作者: Soujanya Hazra,Sanjay Ghosh 类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
*备注: 13 pages, 3 tables, and 7 fugures
点击查看摘要
Abstract:Depression is a major cause of global mental illness and significantly influences suicide rates. Timely and accurate diagnosis is essential for effective intervention. Electroencephalography (EEG) provides a non-invasive and accessible method for examining cerebral activity and identifying disease-associated patterns. We propose a novel graph-based deep learning framework, named Edge-gated, axis-mixed Pooling Attention Network (ExPANet), for differentiating major depressive disorder (MDD) patients from healthy controls (HC). EEG recordings undergo preprocessing to eliminate artifacts and are segmented into short periods of activity. We extract 14 features from each segment, which include time, frequency, fractal, and complexity domains. Electrodes are represented as nodes, whereas edges are determined by the phase-locking value (PLV) to represent functional connectivity. The generated brain graphs are examined utilizing an adapted graph attention network. This architecture acquires both localized electrode characteristics and comprehensive functional connectivity patterns. The proposed framework attains superior performance relative to current EEG-based approaches across two different datasets. A fundamental advantage of our methodology is its explainability. We evaluated the significance of features, channels, and edges, in addition to intrinsic attention weights. These studies highlight features, cerebral areas, and connectivity associations that are especially relevant to MDD, many of which correspond with clinical data. Our findings demonstrate a reliable and transparent method for EEG-based screening of MDD, using deep learning with clinically relevant results.
信息检索
[IR-0] Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation
[IR-1] CGLE: Class-label Graph Link Estimator for Link Prediction ICDM2025
链接: https://arxiv.org/abs/2511.06982 作者: Ankit Mazumder,Srikanta Bedathur 类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
*备注: Paper accepted at the IEEE International Conference on Data Mining (ICDM 2025)
点击查看摘要
[IR-2] Have We Really Understood Collaborative Information? An Empirical Investigation WSDM2026
链接: https://arxiv.org/abs/2511.06905 作者: Xiaokun Zhang,Zhaochun Ren,Bowei He,Ziqiang Cui,Chen Ma 类目: Information Retrieval (cs.IR)
*备注: This work has been accepted by WSDM 2026
点击查看摘要
[IR-3] Accessibility Gaps in U.S. Government Dashboards for Blind and Low-Vision Residents
链接: https://arxiv.org/abs/2511.06688 作者: Chadani Acharya 类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: Preprint. Accessibility audit of six U.S. public dashboard ecosystems; 1 figure, 2 tables
点击查看摘要
[IR-4] Can LLM Annotations Replace User Clicks for Learning to Rank?
[IR-5] OOL4POI: A Tool-Augmented LLM Framework for Next POI Recommendation AAAI2026
链接: https://arxiv.org/abs/2511.06405 作者: Dongsheng Wang,Shen Gao,Chengrui Huang,Yuxi Huang,Ruixiang Feng,Shuo Shang 类目: Information Retrieval (cs.IR)
*备注: Accepted by AAAI2026
点击查看摘要
[IR-6] LLaDA-Rec: Discrete Diffusion for Parallel Semantic ID Generation in Generative Recommendation
链接: https://arxiv.org/abs/2511.06254 作者: Teng Shi,Chenglei Shen,Weijie Yu,Shen Nie,Chongxuan Li,Xiao Zhang,Ming He,Yan Han,Jun Xu 类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Generative recommendation represents each item as a semantic ID, i.e., a sequence of discrete tokens, and generates the next item through autoregressive decoding. While effective, existing autoregressive models face two intrinsic limitations: (1) unidirectional constraints, where causal attention restricts each token to attend only to its predecessors, hindering global semantic modeling; and (2) error accumulation, where the fixed left-to-right generation order causes prediction errors in early tokens to propagate to the predictions of subsequent token. To address these issues, we propose LLaDA-Rec, a discrete diffusion framework that reformulates recommendation as parallel semantic ID generation. By combining bidirectional attention with the adaptive generation order, the approach models inter-item and intra-item dependencies more effectively and alleviates error accumulation. Specifically, our approach comprises three key designs: (1) a parallel tokenization scheme that produces semantic IDs for bidirectional modeling, addressing the mismatch between residual quantization and bidirectional architectures; (2) two masking mechanisms at the user-history and next-item levels to capture both inter-item sequential dependencies and intra-item semantic relationships; and (3) an adapted beam search strategy for adaptive-order discrete diffusion decoding, resolving the incompatibility of standard beam search with diffusion-based generation. Experiments on three real-world datasets show that LLaDA-Rec consistently outperforms both ID-based and state-of-the-art generative recommenders, establishing discrete diffusion as a new paradigm for generative recommendation.
[IR-7] User Hesitation and Negative Transfer in Multi-Behavior Recommendation
Abstract:Multi-behavior recommendation aims to integrate users’ interactions across various behavior types (e.g., view, favorite, add-to-cart, purchase) to more comprehensively characterize user preferences. However, existing methods lack in-depth modeling when dealing with interactions that generate only auxiliary behaviors without triggering the target behavior. In fact, these weak signals contain rich latent information and can be categorized into two types: (1) positive weak signals-items that have not triggered the target behavior but exhibit frequent auxiliary interactions, reflecting users’ hesitation tendencies toward these items; and (2) negative weak signals-auxiliary behaviors that result from misoperations or interaction noise, which deviate from true preferences and may cause negative transfer effects. To more effectively identify and utilize these weak signals, we propose a recommendation framework focused on weak signal learning, termed HNT. Specifically, HNT models weak signal features from two dimensions: positive and negative effects. By learning the characteristics of auxiliary behaviors that lead to target behaviors, HNT identifies similar auxiliary behaviors that did not trigger the target behavior and constructs a hesitation set of related items as weak positive samples to enhance preference modeling, thereby capturing users’ latent hesitation intentions. Meanwhile, during auxiliary feature fusion, HNT incorporates latent negative transfer effect modeling to distinguish and suppress interference caused by negative representations through item similarity learning. Experiments on three real-world datasets demonstrate that HNT improves HR@10 and NDCG@10 by 12.57% and 14.37%, respectively, compared to the best baseline methods.
[IR-8] SARCH: Multimodal Search for Archaeological Archives
Abstract:In this paper, we describe a multi-modal search system designed to search old archaeological books and reports. This corpus is digitally available as scanned PDFs, but varies widely in the quality of scans. Our pipeline, designed for multi-modal archaeological documents, extracts and indexes text, images (classified into maps, photos, layouts, and others), and tables. We evaluated different retrieval strategies, including keyword-based search, embedding- based models, and a hybrid approach that selects optimal results from both modalities. We report and analyze our preliminary results and discuss future work in this exciting vertical.
[IR-9] GreyShot: Zeroshot and Privacy-preserving Recommender System by GM(11) Model
Abstract:Every recommendation engineer needs to face the cold start problem when building his system. During the past decades, most scientists adopted transfer learning and meta learning to solve the problem. Although notable exceptions such as ZeroMat etc. have been invented in recent years, cold-start problem remains a challenging problem for many researchers. In this paper, we build a zeroshot and privacy-preserving recommender system algorithm GreyShot using GM(1,1) model by taking advantage of the Poisson-Pareto property of the online rating data. Our approach relies on no input data and is effective in generating both accurate and fair results. In conclusion, zeroshot problem of recommender systems could be effectively solved by grey system methods such as GM(1,1).