本篇博文主要内容为 2025-11-10 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-11-10)

今日共更新381篇论文,其中:

  • 自然语言处理59篇(Computation and Language (cs.CL))
  • 人工智能102篇(Artificial Intelligence (cs.AI))
  • 计算机视觉79篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习113篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] MIMIC-SR-ICD11: A Dataset for Narrative-Based Diagnosis

【速读】: 该论文旨在解决临床诊断中因电子健康记录(EHR)模板化文档导致的细微但关键临床信号丢失问题,从而提升疾病诊断的准确性。其解决方案的核心在于构建了一个与世界卫生组织国际疾病分类第11版(ICD-11)术语对齐的大规模英文诊断数据集MIMIC-SR-ICD11,并提出LL-Rank框架——一种基于似然的重排序方法,通过计算标签在临床报告上下文中的联合概率与无报告先验概率之差,实现语义兼容性与标签频率偏差的解耦,显著优于生成加映射基线模型(GenMap)。

链接: https://arxiv.org/abs/2511.05485
作者: Yuexin Wu,Shiqi Wang,Vasile Rus
机构: University of Memphis (孟菲斯大学); The Second Clinical College of Guangzhou University of Chinese Medicine (广州中医药大学第二临床学院)
类目: Computation and Language (cs.CL)
备注: 19

点击查看摘要

Abstract:Disease diagnosis is a central pillar of modern healthcare, enabling early detection and timely intervention for acute conditions while guiding lifestyle adjustments and medication regimens to prevent or slow chronic disease. Self-reports preserve clinically salient signals that templated electronic health record (EHR) documentation often attenuates or omits, especially subtle but consequential details. To operationalize this shift, we introduce MIMIC-SR-ICD11, a large English diagnostic dataset built from EHR discharge notes and natively aligned to WHO ICD-11 terminology. We further present LL-Rank, a likelihood-based re-ranking framework that computes a length-normalized joint likelihood of each label given the clinical report context and subtracts the corresponding report-free prior likelihood for that label. Across seven model backbones, LL-Rank consistently outperforms a strong generation-plus-mapping baseline (GenMap). Ablation experiments show that LL-Rank’s gains primarily stem from its PMI-based scoring, which isolates semantic compatibility from label frequency bias.
zh

[NLP-1] APP: Accelerated Path Patching with Task-Specific Pruning

【速读】: 该论文旨在解决电路发现(circuit discovery)在机制可解释性分析中的计算成本过高及对小型模型的深度分析能力有限的问题。其核心解决方案是提出加速路径修补(Accelerated Path Patching, APP),关键在于引入一种基于因果中介分析思想的对比注意力头剪枝算法——Contrastive-FLAP,该方法能更有效地识别并保留任务特定的注意力头,从而显著缩小电路搜索空间(平均减少56%),再结合传统路径修补方法在精简后的注意力头上执行,实现比直接应用于稠密模型的路径修补快59.63%–93.27%的加速效果,同时保持与原有电路相当的性能和结构重叠度。

链接: https://arxiv.org/abs/2511.05442
作者: Frauke Andersen,William Rudman,Ruochen Zhang,Carsten Eickhoff
机构: University of Tübingen (图宾根大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Brown University (布朗大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Circuit discovery is a key step in many mechanistic interpretability pipelines. Current methods, such as Path Patching, are computationally expensive and have limited in-depth circuit analysis for smaller models. In this study, we propose Accelerated Path Patching (APP), a hybrid approach leveraging our novel contrastive attention head pruning method to drastically reduce the search space of circuit discovery methods. Our Contrastive-FLAP pruning algorithm uses techniques from causal mediation analysis to assign higher pruning scores to task-specific attention heads, leading to higher performing sparse models compared to traditional pruning techniques. Although Contrastive-FLAP is successful at preserving task-specific heads that existing pruning algorithms remove at low sparsity ratios, the circuits found by Contrastive-FLAP alone are too large to satisfy the minimality constraint required in circuit analysis. APP first applies Contrastive-FLAP to reduce the search space on required for circuit discovery algorithms by, on average, 56%. Next, APP, applies traditional Path Patching on the remaining attention heads, leading to a speed up of 59.63%-93.27% compared to Path Patching applied to the dense model. Despite the substantial computational saving that APP provides, circuits obtained from APP exhibit substantial overlap and similar performance to previously established Path Patching circuits
zh

[NLP-2] Steering Language Models with Weight Arithmetic

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中因反馈数据分布狭窄而导致的 unintended generalization 问题,即模型可能在特定分布上过度适应,从而在更广泛分布上产生不良行为(如谄媚倾向或不一致的对齐性)。其核心解决方案是提出一种名为“对比权重转向”(contrastive weight steering)的后训练方法,关键在于通过权重空间中的算术操作来编辑模型参数:首先利用两个小规模微调产生的权重增量差值,分离出一个表示特定行为方向的向量(behavior direction),然后通过加减该方向向量来调控模型行为。这种方法相较于激活空间转向(activation steering)能实现更强的分布外行为控制,且在任务性能保持不变的前提下可缓解微调引入的行为漂移(如谄媚和拒答不足),为检测和干预潜在的 emergent misalignment 提供了新路径。

链接: https://arxiv.org/abs/2511.05408
作者: Constanza Fierro,Fabien Roger
机构: University of Copenhagen (哥本哈根大学); Anthropic (Anthropic)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes – one that induces the desired behavior and another that induces its opposite – and then add or remove this direction to modify the model’s weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an “evil” weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.
zh

[NLP-3] Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning AACL2025

【速读】: 该论文旨在解决当前对话系统中用户满意度估计存在的“一刀切”问题,即现有对齐方法通常训练统一模型以追求广泛共识,忽视了少数群体用户的个体差异和特定偏好,导致其满意度评估不准确。解决方案的关键在于提出一个统一框架,通过两个核心创新实现对个体与群体层面偏好的建模:一是引入可解释的个性化推理链(Chain-of-Personalized-Reasoning, CoPeR)来捕捉个体用户偏好;二是设计基于期望最大化算法的多数-少数偏好感知聚类(Majority-Minority Preference-Aware Clustering, M2PC),在无监督条件下识别不同用户群组并学习其群体偏好;最终将二者集成到偏好自适应强化学习框架(Preference-Adaptive PPO, PAda-PPO)中,联合优化对个体与群体偏好的对齐,从而显著提升包括少数群体在内的整体用户满意度估计准确性。

链接: https://arxiv.org/abs/2511.05407
作者: Yahui Fu,Zi Haur Pang,Tatsuya Kawahara
机构: Kyoto University (京都大学)
类目: Computation and Language (cs.CL)
备注: IJCNLP-AACL 2025 (Main)

点击查看摘要

Abstract:User satisfaction in dialogue systems is inherently subjective. When the same response strategy is applied across users, minority users may assign different satisfaction ratings than majority users due to variations in individual intents and preferences. However, existing alignment methods typically train one-size-fits-all models that aim for broad consensus, often overlooking minority perspectives and user-specific adaptation. We propose a unified framework that models both individual- and group-level preferences for user satisfaction estimation. First, we introduce Chain-of-Personalized-Reasoning (CoPeR) to capture individual preferences through interpretable reasoning chains. Second, we propose an expectation-maximization-based Majority-Minority Preference-Aware Clustering (M2PC) algorithm that discovers distinct user groups in an unsupervised manner to learn group-level preferences. Finally, we integrate these components into a preference-adaptive reinforcement learning framework (PAda-PPO) that jointly optimizes alignment with both individual and group preferences. Experiments on the Emotional Support Conversation dataset demonstrate consistent improvements in user satisfaction estimation, particularly for underrepresented user groups.
zh

[NLP-4] Large Language Models for Explainable Threat Intelligence

【速读】: 该论文旨在解决传统网络安全防护机制难以应对日益复杂的网络威胁的问题,尤其是在威胁情报获取与分析效率方面的瓶颈。其解决方案的关键在于提出了一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的系统——RAGRecon,该系统利用大语言模型(Large Language Models, LLMs)结合实时信息检索与领域特定数据,实现对网络安全威胁的精准问答;同时通过生成并可视化知识图谱的方式提升模型推理过程的可解释性,使安全分析师能够清晰理解系统依据上下文所建立的逻辑关联,从而增强决策透明度和可信度。

链接: https://arxiv.org/abs/2511.05406
作者: Tiago Dinis,Miguel Correia,Roger Tavares
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As cyber threats continue to grow in complexity, traditional security mechanisms struggle to keep up. Large language models (LLMs) offer significant potential in cybersecurity due to their advanced capabilities in text processing and generation. This paper explores the use of LLMs with retrieval-augmented generation (RAG) to obtain threat intelligence by combining real-time information retrieval with domain-specific data. The proposed system, RAGRecon, uses a LLM with RAG to answer questions about cybersecurity threats. Moreover, it makes this form of Artificial Intelligence (AI) explainable by generating and visually presenting to the user a knowledge graph for every reply. This increases the transparency and interpretability of the reasoning of the model, allowing analysts to better understand the connections made by the system based on the context recovered by the RAG system. We evaluated RAGRecon experimentally with two datasets and seven different LLMs and the responses matched the reference responses more than 91% of the time for the best combinations.
zh

[NLP-5] A multimodal multiplex of the mental lexicon for multilingual individuals

【速读】: 该论文旨在解决多语言个体中语义表征如何在心理词汇(mental lexicon)中组织与交互的问题,尤其关注母语(heritage language)对另一语言习得的影响机制。其核心问题是:视觉输入是否在翻译任务中提升参与者的语言熟练度和准确性,相较于纯文本条件。解决方案的关键在于构建一个融合多模态信息的多层网络模型(multilayer network model),在传统双语交互激活(Bilingual Interactive Activation, BIA+)框架基础上引入额外视觉层,将视觉输入直接连接至跨语言语义表征层,从而模拟真实情境下多语言认知加工中的跨模态整合机制。

链接: https://arxiv.org/abs/2511.05361
作者: Maria Huynh,Wilder C. Rodrigues
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Historically, bilingualism was often perceived as an additional cognitive load that could hinder linguistic and intellectual development. However, over the last three decades, this view has changed considerably. Numerous studies have aimed to model and understand the architecture of the bilingual word recognition system Dijkstra and van Heuven (2002), investigating how parallel activation operates in the brain and how one language influences another Kroll et al. (2015). Increasingly, evidence suggests that multilinguals, individuals who speak three or more languages, can perform better than monolinguals in various linguistic and cognitive tasks, such as learning an additional language Abu-Rabia and Sanitsky (2010). This research proposal focuses on the study of the mental lexicon and how it may be structured in individuals who speak multiple languages. Building on the work of Stella et al. (2018), who investigated explosive learning in humans using a multiplex model of the mental lexicon, and the Bilingual Interactive Activation (BIA+) framework proposed by Dijkstra and van Heuven (2002), the present study applies the same multilayer network principles introduced by Kivela et al. (2014). Our experimental design extends previous research by incorporating multimodality into the multiplex model, introducing an additional layer that connects visual inputs to their corresponding lexical representations across the multilingual layers of the mental lexicon. In this research, we aim to explore how a heritage language influences the acquisition of another language. Specifically, we ask: Does the presence of visual input in a translation task influence participants’ proficiency and accuracy compared to text-only conditions?
zh

[NLP-6] ConVerse: Benchmarking Contextual Safety in Agent -to-Agent Conversations

【速读】: 该论文旨在解决多智能体生态系统中隐私与安全风险评估的难题,特别是在个人助理与外部服务提供者交互过程中,如何在保障协作效用的同时防范信息泄露和恶意攻击。其解决方案的关键在于提出ConVerse——一个动态基准测试框架,用于评估代理间交互中的隐私与安全风险。该框架涵盖旅行、房地产和保险三个实际场景,包含12种用户角色和超过864个上下文相关的攻击实例(611个隐私类、253个安全类),并通过多轮自主对话模拟真实世界交互,将恶意请求嵌入合理语境中进行测试。该方法首次将隐私与安全统一于多智能体交互环境中,揭示了安全性是通信过程的涌现属性,从而推动对生成式AI (Generative AI) 安全性的系统性理解与改进。

链接: https://arxiv.org/abs/2511.05359
作者: Amr Gomaa,Ahmed Salem,Sahar Abdelnabi
机构: German Research Center for Artificial Intelligence (DFKI); Microsoft; ELLIS Institute Tübingen and MPI for Intelligent Systems; Tübingen AI Center
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As language models evolve into autonomous agents that act and communicate on behalf of users, ensuring safety in multi-agent ecosystems becomes a central challenge. Interactions between personal assistants and external service providers expose a core tension between utility and protection: effective collaboration requires information sharing, yet every exchange creates new attack surfaces. We introduce ConVerse, a dynamic benchmark for evaluating privacy and security risks in agent-agent interactions. ConVerse spans three practical domains (travel, real estate, insurance) with 12 user personas and over 864 contextually grounded attacks (611 privacy, 253 security). Unlike prior single-agent settings, it models autonomous, multi-turn agent-to-agent conversations where malicious requests are embedded within plausible discourse. Privacy is tested through a three-tier taxonomy assessing abstraction quality, while security attacks target tool use and preference manipulation. Evaluating seven state-of-the-art models reveals persistent vulnerabilities; privacy attacks succeed in up to 88% of cases and security breaches in up to 60%, with stronger models leaking more. By unifying privacy and security within interactive multi-agent contexts, ConVerse reframes safety as an emergent property of communication.
zh

[NLP-7] Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

【速读】: 该论文旨在解决当前主流子词分词器(如SentencePiece和HuggingFace BPE)在处理形态丰富的语言(如孟加拉语)时表现不佳的问题。其解决方案的关键在于提出一种专为孟加拉文设计的Byte Pair Encoding(BPE)分词器——BengaliBPE,该方法通过Unicode规范化、基于音素级别的初始化以及考虑形态学结构的合并规则,有效保持了语言的一致性和子词单元的完整性,从而提升了分词粒度的精细程度与形态可解释性。

链接: https://arxiv.org/abs/2511.05324
作者: Firoj Ahmmed Patwary,Abdullah Al Noman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Tokenization is an important first step in Natural Language Processing (NLP) pipelines because it decides how models learn and represent linguistic information. However, current subword tokenizers like SentencePiece or HuggingFace BPE are mostly designed for Latin or multilingual corpora and do not perform well on languages with rich morphology such as Bengali. To address this limitation, we present BengaliBPE, a Byte Pair Encoding (BPE) tokenizer specifically developed for the Bengali script. BengaliBPE applies Unicode normalization, grapheme-level initialization, and morphology-aware merge rules to maintain linguistic consistency and preserve subword integrity. We use a large-scale Bengali news classification dataset to compare BengaliBPE with three baselines: Whitespace, SentencePiece BPE, and HuggingFace BPE. The evaluation considers tokenization granularity, encoding speed, and downstream classification accuracy. While all methods perform reasonably well, BengaliBPE provides the most detailed segmentation and the best morphological interpretability, albeit with slightly higher computational cost. These findings highlight the importance of language-aware tokenization for morphologically rich scripts and establish BengaliBPE as a strong foundation for future Bengali NLP systems, including large-scale pretraining of contextual language models.
zh

[NLP-8] What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions

【速读】: 该论文旨在解决刑事司法行政数据中关于犯罪行为描述信息匮乏的问题,特别是利用大陆法系国家法院判决书中详尽的犯罪行为描述来补充现有数据的不足。其核心挑战在于从公开可获取的斯洛伐克法院判决文本中自动提取这些描述性内容。解决方案的关键在于采用两种互补的技术路径:一是基于规则的高级正则表达式方法,通过识别特定词汇及其上下文(如“sparing”及其分词形式)来定位描述段落;二是基于大语言模型(LLM)的方法,使用Gemini Flash 2.0模型结合预定义指令进行抽取。实验表明,这两种方法均显著优于传统正则表达式基线(准确率从40.5%提升至97%和98.75%),且与人工标注一致性高达约90%,证明了自动化提取刑事行为描述的可行性与高精度。

链接: https://arxiv.org/abs/2511.05320
作者: Klára Bendová,Tomáš Knap,Jan Černý,Vojtěch Pour,Jaromir Savelka,Ivana Kvapilíková,Jakub Drápal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper accepted to the proceedings of ASAIL 2025 Workshop under ICAIL conference for publication. Paper contains 6 pages (references included) and 2 appendices. It contains 8 tables, no figures

点击查看摘要

Abstract:Criminal justice administrative data contain only a limited amount of information about the committed offense. However, there is an unused source of extensive information in continental European courts’ decisions: descriptions of criminal behaviors in verdicts by which offenders are found guilty. In this paper, we study the feasibility of extracting these descriptions from publicly available court decisions from Slovakia. We use two different approaches for retrieval: regular expressions and large language models (LLMs). Our baseline was a simple method employing regular expressions to identify typical words occurring before and after the description. The advanced regular expression approach further focused on “sparing” and its normalization (insertion of spaces between individual letters), typical for delineating the description. The LLM approach involved prompting the Gemini Flash 2.0 model to extract the descriptions using predefined instructions. Although the baseline identified descriptions in only 40.5% of verdicts, both methods significantly outperformed it, achieving 97% with advanced regular expressions and 98.75% with LLMs, and 99.5% when combined. Evaluation by law students showed that both advanced methods matched human annotations in about 90% of cases, compared to just 34.5% for the baseline. LLMs fully matched human-labeled descriptions in 91.75% of instances, and a combination of advanced regular expressions with LLMs reached 92%.
zh

[NLP-9] Listening Between the Lines: Decoding Podcast Narratives with Language Modeling

【速读】: 该论文旨在解决当前大语言模型在分析播客(podcast)叙事结构时存在的不足问题,即现有方法难以准确捕捉人类听众依赖的细微线索来识别叙事框架(narrative frame),从而无法有效实现对非结构化、多主题且口语化的播客内容进行规模化分析。其解决方案的关键在于:提出一种新颖的细粒度叙事框架标注方法,通过微调BERT模型将叙事框架与对话中具体提及的实体(entity)显式关联,从而将抽象的叙事框架“具象化”;在此基础上,进一步将这些细粒度的框架标签与高层级话题(topic)相关联,揭示话题与呈现方式之间的系统性关系,为数字媒体影响力研究提供更稳健的分析框架。

链接: https://arxiv.org/abs/2511.05310
作者: Shreya Gupta,Ojasva Saxena,Arghodeep Nandi,Sarah Masud,Kiran Garimella,Tanmoy Chakraborty
机构: Indian Institute Of Technology Delhi(印度理工学院德里分校); University of Copenhagen(哥本哈根大学); Rutgers University(罗格斯大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 10 pages, 6 Figures, 5 Tables. Under review at IEEE TCSS

点击查看摘要

Abstract:Podcasts have become a central arena for shaping public opinion, making them a vital source for understanding contemporary discourse. Their typically unscripted, multi-themed, and conversational style offers a rich but complex form of data. To analyze how podcasts persuade and inform, we must examine their narrative structures – specifically, the narrative frames they employ. The fluid and conversational nature of podcasts presents a significant challenge for automated analysis. We show that existing large language models, typically trained on more structured text such as news articles, struggle to capture the subtle cues that human listeners rely on to identify narrative frames. As a result, current approaches fall short of accurately analyzing podcast narratives at scale. To solve this, we develop and evaluate a fine-tuned BERT model that explicitly links narrative frames to specific entities mentioned in the conversation, effectively grounding the abstract frame in concrete details. Our approach then uses these granular frame labels and correlates them with high-level topics to reveal broader discourse trends. The primary contributions of this paper are: (i) a novel frame-labeling methodology that more closely aligns with human judgment for messy, conversational data, and (ii) a new analysis that uncovers the systematic relationship between what is being discussed (the topic) and how it is being presented (the frame), offering a more robust framework for studying influence in digital media. Comments: 10 pages, 6 Figures, 5 Tables. Under review at IEEE TCSS Subjects: Computation and Language (cs.CL); Social and Information Networks (cs.SI) Cite as: arXiv:2511.05310 [cs.CL] (or arXiv:2511.05310v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.05310 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-10] QUESTER: Query Specification for Generative Retrieval

【速读】: 该论文旨在解决生成式检索(Generative Retrieval, GR)在泛化能力差和扩展成本高方面的局限性。其核心解决方案是提出QUESTER(QUEry SpecificaTion gEnerative Retrieval),将GR重新建模为查询规范生成任务——即利用小型大语言模型(LLM)将简单关键词查询转换为更精确的查询规范,从而替代传统BM25检索;该策略通过强化学习方法(GRPO)进行训练,在保持高效性的同时显著提升了检索效果,实验证明其在域内与域外场景下均优于BM25,并达到与神经信息检索(Neural IR)模型相当的性能。

链接: https://arxiv.org/abs/2511.05301
作者: Arthur Satouf,Yuxuan Zong,Habiboulaye Amadou-Boubacar,Pablo Piantanida,Benjamin Piwowarski
机构: 1. Université de Paris (巴黎大学); 2. Institut national de recherche en informatique et en automatique (法国国家信息与自动化研究院); 3. Centre National de la Recherche Scientifique (法国国家科学研究中心); 4. École Polytechnique (巴黎综合理工学院); 5. INRIA (法国国家信息与自动化研究院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative Retrieval (GR) differs from the traditional index-then-retrieve pipeline by storing relevance in model parameters and directly generating document identifiers. However, GR often struggles to generalize and is costly to scale. We introduce QUESTER (QUEry SpecificaTion gEnerative Retrieval), which reframes GR as query specification generation - in this work, a simple keyword query handled by BM25 - using a (small) LLM. The policy is trained using reinforcement learning techniques (GRPO). Across in- and out-of-domain evaluations, we show that our model is more effective than BM25, and competitive with neural IR models, while maintaining a good efficiency
zh

[NLP-11] Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations

【速读】: 该论文旨在解决语言生成在极限(language generation in the limit)框架下的最优密度边界问题,即在未知语言 $ K $ 由可数类中抽取、对手枚举其字符串的设定下,算法如何生成未见过的 $ K $ 中字符串,并在正确性与覆盖范围之间实现最优权衡。此前研究已证明生成总是可行,且存在正的下密度(lower density)下界,揭示了有效性(validity)与广度(breadth)之间的权衡关系。本文的关键贡献在于:首先严格证明了最优下密度上界为 $ 1/2 $,并给出达到此界的算法;其次将模型扩展至部分枚举(partial enumeration)情形,其中对手仅提供 $ K $ 的无限子集 $ C \subseteq K $,此时若 $ C $ 在 $ K $ 中具有下密度 $ \alpha $,则生成器输出的密度至少为 $ \alpha/2 $,从而在信息受限场景下仍能保持因子 $ 1/2 $ 的最优恢复能力。此外,论文还重新审视了经典的 Gold–Angluin 语言识别模型,通过拓扑学方法刻画了识别在极限下可能成立的条件——即假设序列 $ M_t $ 最终满足 $ C \subseteq M \subseteq K $,并首次将 Angluin 的条件等价于一个适当拓扑空间满足 $ T_D $ 分离性质,提供了新的理论视角。

链接: https://arxiv.org/abs/2511.05295
作者: Jon Kleinberg,Fan Wei
机构: Cornell University (康奈尔大学); Duke University (杜克大学)
类目: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The success of large language models (LLMs) has motivated formal theories of language generation and learning. We study the framework of \emphlanguage generation in the limit, where an adversary enumerates strings from an unknown language K drawn from a countable class, and an algorithm must generate unseen strings from K . Prior work showed that generation is always possible, and that some algorithms achieve positive lower density, revealing a \emphvalidity–breadth trade-off between correctness and coverage. We resolve a main open question in this line, proving a tight bound of 1/2 on the best achievable lower density. We then strengthen the model to allow \emphpartial enumeration, where the adversary reveals only an infinite subset C \subseteq K . We show that generation in the limit remains achievable, and if C has lower density \alpha in K , the algorithm’s output achieves density at least \alpha/2 , matching the upper bound. This generalizes the 1/2 bound to the partial-information setting, where the generator must recover within a factor 1/2 of the revealed subset’s density. We further revisit the classical Gold–Angluin model of \emphlanguage identification under partial enumeration. We characterize when identification in the limit is possible – when hypotheses M_t eventually satisfy C \subseteq M \subseteq K – and in the process give a new topological formulation of Angluin’s characterization, showing that her condition is precisely equivalent to an appropriate topological space having the T_D separation property.
zh

[NLP-12] Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

【速读】: 该论文旨在解决黑箱大语言模型(Large Language Models, LLMs)个性化过程中存在的核心矛盾:现有基于上下文注入(context injection)的方法在单步推理中同时要求模型生成准确内容并匹配用户特定风格,导致性能折损与控制精度受限。解决方案的关键在于提出一种名为“反思式个性化优化”(Reflective Personalization Optimization, RPO)的新框架,其核心创新是将内容生成与个性化对齐解耦为两个独立阶段——首先由基础模型生成高质量通用响应,随后通过外部反射模块显式重写该输出以匹配用户偏好;该反射模块采用两阶段训练策略:先通过监督微调学习结构化重写轨迹以建立核心个性化推理策略,再利用强化学习进一步提升个性化输出质量,从而实现更精准、高效的用户导向生成。

链接: https://arxiv.org/abs/2511.05286
作者: Teqi Hao,Xioayu Tan,Shaojie Shi,Yinghui Xu,Xihe Qiu
机构: Shanghai University of Engineering Science (上海工程技术大学); Tencent Youtu Lab (腾讯优图实验室); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user’s preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.
zh

[NLP-13] ranslation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

【速读】: 该论文旨在解决古汉语向日语翻译中的低资源问题,即在缺乏足够标注数据的情况下实现高效准确的字符级注释与翻译。其关键解决方案是引入基于大语言模型(Large Language Models, LLMs)的注释流水线,并利用数字化开源翻译数据构建新的标注数据集;同时,通过引入辅助中文自然语言处理(Natural Language Processing, NLP)任务来提升序列标注任务在低资源场景下的训练效果,从而增强模型对古汉语字符注释的理解能力,使LLMs在直接机器翻译中表现优异的同时,也能有效完成字符级注释任务。

链接: https://arxiv.org/abs/2511.05239
作者: Zilong Li,Jie Cao
机构: University of Colorado, Boulder (科罗拉多大学博尔德分校); The University of Oklahoma (俄克拉荷马大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ancient people translated classical Chinese into Japanese by annotating around each character. We abstract this process as sequence tagging tasks and fit them into modern language technologies. The research of this annotation and translation system is a facing low-resource problem. We release this problem by introducing a LLM-based annotation pipeline and construct a new dataset from digitalized open-source translation data. We show that under the low-resource setting, introducing auxiliary Chinese NLP tasks has a promoting effect on the training of sequence tagging tasks. We also evaluate the performance of large language models. They achieve high scores in direct machine translation, but they are confused when being asked to annotate characters. Our method could work as a supplement of LLMs.
zh

[NLP-14] Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

【速读】: 该论文旨在解决如何通过知识蒸馏(Knowledge Distillation, KD)有效将大型语言模型(Large Language Models, LLMs)的推理能力迁移至小型模型的问题,尤其关注链式思维(Chain-of-Thought, CoT)在白盒知识蒸馏中的作用。其解决方案的关键在于利用CoT数据作为中间表示,在白盒KD框架下指导小模型学习大模型的推理过程,从而显著提升小模型在复杂自然语言推理与理解任务上的性能,实验表明该方法可在BIG-Bench-Hard(BBH)基准上实现更优的平均表现。

链接: https://arxiv.org/abs/2511.05184
作者: Cong-Thanh Do,Rama Doddipatla,Kate Knill
机构: Toshiba Europe Ltd.(东芝欧洲有限公司); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注: In proceedings of the 18th International Natural Language Generation Conference (INLG 2025)

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting is a widely used method to improve the reasoning capability of Large Language Models (LLMs). More recently, CoT has been leveraged in Knowledge Distillation (KD) to transfer reasoning capability from a larger LLM to a smaller one. This paper examines the role of CoT in distilling the reasoning capability from larger LLMs to smaller LLMs using white-box KD, analysing its effectiveness in improving the performance of the distilled models for various natural language reasoning and understanding tasks. We conduct white-box KD experiments using LLMs from the Qwen and Llama2 families, employing CoT data from the CoT-Collection dataset. The distilled models are then evaluated on natural language reasoning and understanding tasks from the BIG-Bench-Hard (BBH) benchmark, which presents complex challenges for smaller LLMs. Experimental results demonstrate the role of CoT in improving white-box KD effectiveness, enabling the distilled models to achieve better average performance in natural language reasoning and understanding tasks from BBH.
zh

[NLP-15] Mind the Gap… or Not? How Translation Errors and Evaluation Details Skew Multilingual Results

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在跨语言数学能力评估中所呈现的“语言差距”问题,即不同语言下模型性能存在显著且一致的差异。研究发现,这一差距不仅存在于低资源语言(如斯瓦希里语、泰卢固语),也出现在高资源语言(如德语、中文)中,初步结论暗示模型跨语言泛化能力存在缺陷。然而,通过深入分析标准多语言数学基准测试(MGSM),作者识别出数据中存在的翻译错误,并指出模型输出答案提取方式不统一进一步影响了评估结果。解决方案的关键在于:一是提出一种自动质量保障方法以规模化修正数据中的翻译错误;二是给出标准化答案提取的建议以提升评估一致性。结合这两种策略后,原观察到的语言差距基本消失,表明先前结论可能源于数据质量问题而非模型真实能力差异。

链接: https://arxiv.org/abs/2511.05162
作者: Jan-Thorsten Peter,David Vilar,Tobias Domhan,Dan Malkin,Markus Freitag
机构: Google(谷歌)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most current large language models (LLMs) support a wide variety of languages in addition to English, including high-resource languages (e.g. German, Chinese, French), as well as low-resource ones (e.g. Swahili, Telugu). In addition they have also shown impressive capabilities in different domains, like coding, science and math. In this short paper, taking math as an example domain, we study the performance of different LLMs across languages. Experimental results show that there exists a non-negligible and consistent gap in the performance of the models across languages. Interestingly, and somewhat against expectations, the gap exists for both high- and low-resource languages. We hope that these results influence further research into cross-lingual capability generalization for next generation LLMs. If it weren’t for the fact that they are false! By analyzing one of the standard multilingual math benchmarks (MGSM), we determine that several translation errors are present in the data. Furthermore, the lack of standardized answer extraction from LLM outputs further influences the final results. We propose a method for automatic quality assurance to address the first issue at scale, and give recommendations to address the second one. Combining these two approaches we show that the aforementioned language gap mostly disappears, leading to completely different conclusions from our research. We additionally release the corrected dataset to the community.
zh

[NLP-16] ManufactuBERT: Efficient Continual Pretraining for Manufacturing LREC2026

【速读】: 该论文旨在解决通用Transformer编码器在制造业等专业领域中表现下降的问题,其根源在于模型缺乏对领域特定术语和语义的充分学习。解决方案的关键在于提出ManufactuBERT——一个基于RoBERTa架构并在大规模制造领域语料库上持续预训练的模型;该语料库通过一套完整的数据处理流程构建,包括初始领域过滤与多阶段去重步骤,有效去除冗余信息。实验证明,ManufactuBERT在多个制造相关自然语言处理任务上达到新的最优性能,并且由于使用了精心去重的语料,训练收敛速度提升33%,显著降低计算成本。

链接: https://arxiv.org/abs/2511.05135
作者: Robin Armingaud,Romaric Besançon
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to LREC 2026

点击查看摘要

Abstract:While large general-purpose Transformer-based encoders excel at general language understanding, their performance diminishes in specialized domains like manufacturing due to a lack of exposure to domain-specific terminology and semantics. In this paper, we address this gap by introducing ManufactuBERT, a RoBERTa model continually pretrained on a large-scale corpus curated for the manufacturing domain. We present a comprehensive data processing pipeline to create this corpus from web data, involving an initial domain-specific filtering step followed by a multi-stage deduplication process that removes redundancies. Our experiments show that ManufactuBERT establishes a new state-of-the-art on a range of manufacturing-related NLP tasks, outperforming strong specialized baselines. More importantly, we demonstrate that training on our carefully deduplicated corpus significantly accelerates convergence, leading to a 33% reduction in training time and computational cost compared to training on the non-deduplicated dataset. The proposed pipeline offers a reproducible example for developing high-performing encoders in other specialized domains. We will release our model and curated corpus at this https URL.
zh

[NLP-17] A Toolbox for Improving Evolutionary Prompt Search

【速读】: 该论文旨在解决现有进化式提示优化(Evolutionary Prompt Optimization)方法在算子鲁棒性和评估机制效率方面的不足,从而提升大语言模型(Large Language Models, LLMs)提示优化的质量与效率。其解决方案的关键在于:1)将进化过程分解为独立步骤以增强控制性与可解释性;2)引入基于LLM的判别器(LLM-based judge)验证进化步骤的有效性;3)融合人类反馈对进化算子进行迭代优化;4)设计更高效的评估策略,在保持性能的同时降低计算开销。这些改进共同提升了提示优化的整体效果与实用性。

链接: https://arxiv.org/abs/2511.05120
作者: Daniel Grießhaber,Maximilian Kimmich,Johannes Maucher,Ngoc Thang Vu
机构: Institute for Applied Artificial Intelligence (IAAI), Stuttgart Media University; Institute for Natural Language Processing (IMS), University of Stuttgart
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evolutionary prompt optimization has demonstrated effectiveness in refining prompts for LLMs. However, existing approaches lack robust operators and efficient evaluation mechanisms. In this work, we propose several key improvements to evolutionary prompt optimization that can partially generalize to prompt optimization in general: 1) decomposing evolution into distinct steps to enhance the evolution and its control, 2) introducing an LLM-based judge to verify the evolutions, 3) integrating human feedback to refine the evolutionary operator, and 4) developing more efficient evaluation strategies that maintain performance while reducing computational overhead. Our approach improves both optimization quality and efficiency. We release our code, enabling prompt optimization on new tasks and facilitating further research in this area.
zh

[NLP-18] Iterative Layer-wise Distillation for Efficient Compression of Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因参数量庞大而导致的资源消耗过高问题,目标是构建结构紧凑且性能损失可控的轻量化模型。其解决方案的关键在于提出一种基于迭代评估层重要性的知识蒸馏方法:首先通过移除单个Transformer层并测量性能下降来评估各层的重要性,随后利用包含KL散度和均方误差的联合损失函数对剩余层进行微调。实验表明,该方法可在保持较高性能的前提下显著减少层数(如将Qwen2.5-3B模型从36层压缩至24层),验证了中间层对推理贡献较小的假设,并为资源受限场景下的高效模型部署提供了有效路径。

链接: https://arxiv.org/abs/2511.05085
作者: Grigory Kovalev,Mikhail Tikhomirov
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance. Several existing approaches are reviewed, with a discussion of their respective strengths and limitations. An improved method based on the ShortGPT approach has been developed, building upon the idea of incorporating iterative evaluation of layer importance. At each step, importance is assessed by measuring performance degradation when individual layers are removed, using a set of representative datasets. This process is combined with further training using a joint loss function based on KL divergence and mean squared error. Experiments on the Qwen2.5-3B model show that the number of layers can be reduced from 36 to 28 (resulting in a 2.47 billion parameter model) with only a 9.7% quality loss, and to 24 layers with an 18% loss. The findings suggest that the middle transformer layers contribute less to inference, underscoring the potential of the proposed method for creating efficient models. The results demonstrate the effectiveness of iterative distillation and fine-tuning, making the approach suitable for deployment in resource-limited settings.
zh

[NLP-19] On Text Simplification Metrics and General-Purpose LLM s for Accessible Health Information and A Potential Architectural Advantage of The Instruction-Tuned LLM class

【速读】: 该论文旨在解决如何将复杂的科学文献自动转化为通俗易懂的文本,以满足公众日益增长的健康信息获取需求。其核心挑战在于平衡可读性提升与语篇忠实度(discourse fidelity)之间的矛盾。解决方案的关键在于评估两类通用大语言模型(LLM)——指令微调型(instruction-tuned)的Mistral 24B与推理增强型(reasoning-augmented)的Qwen2.5 32B——在文本简化任务中的表现差异,发现指令微调架构在保持人类水平语篇一致性(BERTScore=0.91)的同时显著提升可读性(SARI均值42.46),展现出优于推理增强模型(BERTScore=0.89)的潜力,表明指令微调策略是当前文本简化任务中更优的架构选择。

链接: https://arxiv.org/abs/2511.05080
作者: P. Bilha Githinji,Aikaterini Meilliou,Peiwu Qin
机构: Tsinghua University (清华大学); Tsinghua-Berkeley Shenzhen Institute (清华-伯克利深圳研究院); Shenzen (深圳)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing health-seeking behavior and digital consumption of biomedical information by the general public necessitate scalable solutions for automatically adapting complex scientific and technical documents into plain language. Automatic text simplification solutions, including advanced large language models, however, continue to face challenges in reliably arbitrating the tension between optimizing readability performance and ensuring preservation of discourse fidelity. This report empirically assesses the performance of two major classes of general-purpose LLMs, demonstrating their linguistic capabilities and foundational readiness for the task compared to a human benchmark. Using a comparative analysis of the instruction-tuned Mistral 24B and the reasoning-augmented QWen2.5 32B, we identify a potential architectural advantage in the instruction-tuned LLM. Mistral exhibits a tempered lexical simplification strategy that enhances readability across a suite of metrics and the simplification-specific formula SARI (mean 42.46), while preserving human-level discourse with a BERTScore of 0.91. QWen also attains enhanced readability performance, but its operational strategy shows a disconnect in balancing between readability and accuracy, reaching a statistically significantly lower BERTScore of 0.89. Additionally, a comprehensive correlation analysis of 21 metrics spanning readability, discourse fidelity, content safety, and underlying distributional measures for mechanistic insights, confirms strong functional redundancies among five readability indices. This empirical evidence tracks baseline performance of the evolving LLMs for the task of text simplification, identifies the instruction-tuned Mistral 24B for simplification, provides necessary heuristics for metric selection, and points to lexical support as a primary domain-adaptation issue for simplification.
zh

[NLP-20] Wikipedia-based Datasets in Russian Information Retrieval Benchmark RusBEIR

【速读】: 该论文旨在解决俄罗斯语信息检索(Information Retrieval, IR)资源匮乏的问题,特别是在支持事实核查、检索增强生成和全文档检索等任务时缺乏高质量、细粒度标注的数据集。其解决方案的关键在于构建了一套全新的俄语文本检索数据集,该数据集源自俄语维基百科的“你知道吗?”栏目,并通过句子级别的分级相关性标注对参考文章进行精细化标记。这一方法不仅扩展了现有的俄语IR资源,还为比较传统词法模型(如BM25)与微调后的神经网络架构(包括多语言模型)提供了基准。实验表明,词法方法在全文档检索中表现更优,而神经方法在短文本任务(如事实核查)中更能捕捉词汇语义;此外,结合检索与神经重排序可稳定提升性能,从而为俄语信息检索研究提供了可靠评估工具与实践指导。

链接: https://arxiv.org/abs/2511.05079
作者: Grigory Kovalev,Natalia Loukachevitch,Mikhail Tikhomirov,Olga Babina,Pavel Mamaev
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we present a novel series of Russian information retrieval datasets constructed from the “Did you know…” section of Russian Wikipedia. Our datasets support a range of retrieval tasks, including fact-checking, retrieval-augmented generation, and full-document retrieval, by leveraging interesting facts and their referenced Wikipedia articles annotated at the sentence level with graded relevance. We describe the methodology for dataset creation that enables the expansion of existing Russian Information Retrieval (IR) resources. Through extensive experiments, we extend the RusBEIR research by comparing lexical retrieval models, such as BM25, with state-of-the-art neural architectures fine-tuned for Russian, as well as multilingual models. Results of our experiments show that lexical methods tend to outperform neural models on full-document retrieval, while neural approaches better capture lexical semantics in shorter texts, such as in fact-checking or fine-grained retrieval. Using our newly created datasets, we also analyze the impact of document length on retrieval performance and demonstrate that combining retrieval with neural reranking consistently improves results. Our contribution expands the resources available for Russian information retrieval research and highlights the importance of accurate evaluation of retrieval models to achieve optimal performance. All datasets are publicly available at HuggingFace. To facilitate reproducibility and future research, we also release the full implementation on GitHub.
zh

[NLP-21] Reasoning -Guided Claim Normalization for Noisy Multilingual Social Media Posts

【速读】: 该论文旨在解决多语言虚假信息检测中的主张归一化(claim normalization)问题,即如何将噪声较大的社交媒体帖子转化为清晰、可验证的陈述,并实现跨语言的一致性表达。其解决方案的关键在于通过系统性地使用“谁(Who)、什么(What)、哪里(Where)、何时(When)、为什么(Why)和如何(How)”六类问题对原始文本进行分解,从而构建具有语义一致性的结构化表示,进而实现仅在英文数据上训练却能有效迁移至20种语言的跨语言泛化能力。该方法结合LoRA微调Qwen3-14B模型、基于token级别的召回过滤以增强语义对齐,并在推理阶段引入上下文示例的检索增强少样本学习策略,显著提升了不同语言下的METEOR评分,尤其在罗曼语系和日耳曼语系语言中表现优异。

链接: https://arxiv.org/abs/2511.05078
作者: Manan Sharma,Arya Suneesh,Manish Jain,Pawan Kumar Rajpoot,Prasanna Devadiga,Bharatdeep Hazarika,Ashish Shrivastava,Kishan Gurumurthy,Anshuman B Suresh,Aditya U Baliga
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We address claim normalization for multilingual misinformation detection - transforming noisy social media posts into clear, verifiable statements across 20 languages. The key contribution demonstrates how systematic decomposition of posts using Who, What, Where, When, Why and How questions enables robust cross-lingual transfer despite training exclusively on English data. Our methodology incorporates finetuning Qwen3-14B using LoRA with the provided dataset after intra-post deduplication, token-level recall filtering for semantic alignment and retrieval-augmented few-shot learning with contextual examples during inference. Our system achieves METEOR scores ranging from 41.16 (English) to 15.21 (Marathi), securing third rank on the English leaderboard and fourth rank for Dutch and Punjabi. The approach shows 41.3% relative improvement in METEOR over baseline configurations and substantial gains over existing methods. Results demonstrate effective cross-lingual generalization for Romance and Germanic languages while maintaining semantic coherence across diverse linguistic structures.
zh

[NLP-22] Order-Level Attention Similarity Across Language Models: A Latent Commonality NEURIPS2025

【速读】: 该论文旨在解决语言模型(Language Models, LMs)中上下文聚合模式的共性问题,即不同模型在处理输入时是否共享相似的注意力机制特征。此前研究多集中于单一模型或注意力头的分析,缺乏对多模型间共性的系统性探索。其解决方案的关键在于提出有序注意力(Order-Level Attention, OLA),通过注意力传播(Attention Rollout)的顺序分解得到,并发现相同顺序下的OLA在不同LMs中具有显著相似性;进一步揭示了OLA与句法知识之间的隐式映射关系,从而设计出无需训练的跨模型适配器——可迁移的OLA适配器(Transferable OLA Adapter, TOA),利用OLA作为统一句法特征表示,实现对未见模型的零样本知识迁移与性能提升。

链接: https://arxiv.org/abs/2511.05064
作者: Jinglin Liang,Jin Zhong,Shuangping Huang,Yunqing Hu,Huiyuan Zhang,Huifang Li,Lixin Fan,Hanlin Gu
机构: South China University of Technology (华南理工大学); Pazhou Laboratory (琶洲实验室); Zhuzhou CRRC Times Electric Co. (株洲中车时代电气股份有限公司); China Telecom Research Institute (中国电信研究院); WeBank (微众银行); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:In this paper, we explore an important yet previously neglected question: Do context aggregation patterns across Language Models (LMs) share commonalities? While some works have investigated context aggregation or attention weights in LMs, they typically focus on individual models or attention heads, lacking a systematic analysis across multiple LMs to explore their commonalities. In contrast, we focus on the commonalities among LMs, which can deepen our understanding of LMs and even facilitate cross-model knowledge transfer. In this work, we introduce the Order-Level Attention (OLA) derived from the order-wise decomposition of Attention Rollout and reveal that the OLA at the same order across LMs exhibits significant similarities. Furthermore, we discover an implicit mapping between OLA and syntactic knowledge. Based on these two findings, we propose the Transferable OLA Adapter (TOA), a training-free cross-LM adapter transfer method. Specifically, we treat the OLA as a unified syntactic feature representation and train an adapter that takes OLA as input. Due to the similarities in OLA across LMs, the adapter generalizes to unseen LMs without requiring any parameter updates. Extensive experiments demonstrate that TOA’s cross-LM generalization effectively enhances the performance of unseen LMs. Code is available at this https URL.
zh

[NLP-23] UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言(如乌克兰语)中代码生成与编程竞赛问题求解能力评估不足的问题。现有基准多基于英文任务的翻译或仅评估基础语言理解,难以真实反映模型在低资源场景下的实际性能。解决方案的关键在于构建并发布UA-Code-Bench——一个面向乌克兰语的开源基准测试集,包含来自Eolymp平台的500道按难度分级(从非常简单到非常难)的编程题目,并通过专用环境对13种主流闭源与开源模型生成的Python代码进行隐式测试用例验证,确保代码正确性。该设计不仅系统评估了模型在不同难度层级的表现、解法唯一性及计算效率(运行时间和内存消耗),还揭示了当前最优模型(如OpenAI o3和GPT-5)仅能解决约50%的问题,凸显了低资源自然语言下代码生成的挑战,为未来多语言代码生成和增强推理能力的模型研究提供了重要支撑。

链接: https://arxiv.org/abs/2511.05040
作者: Mykyta Syromiatnikov,Victoria Ruvinskaya
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 8 pages, 5 figures. XI International conference “Informatics. Culture. Technique.” (2025)

点击查看摘要

Abstract:Evaluating the real capabilities of large language models in low-resource languages still represents a challenge, as many existing benchmarks focus on widespread tasks translated from English or evaluate only simple language understanding. This paper introduces UA-Code-Bench, a new open-source benchmark established for a thorough evaluation of language models’ code generation and competitive programming problem-solving abilities in Ukrainian. The benchmark comprises 500 problems from the Eolymp platform, evenly distributed across five complexity levels from very easy to very hard. A diverse set of 13 leading proprietary and open-source models, generating Python solutions based on a one-shot prompt, was evaluated via the dedicated Eolymp environment against hidden tests, ensuring code correctness. The obtained results reveal that even top-performing models, such as OpenAI o3 and GPT-5, solve only half of the problems, highlighting the challenge of code generation in low-resource natural language. Furthermore, this research presents a comprehensive analysis of performance across various difficulty levels, as well as an assessment of solution uniqueness and computational efficiency, measured by both elapsed time and memory consumption of the generated solutions. In conclusion, this work demonstrates the value of competitive programming benchmarks in evaluating large language models, especially in underrepresented languages. It also paves the way for future research on multilingual code generation and reasoning-enhanced models. The benchmark, data parsing, preparation, code generation, and evaluation scripts are available at this https URL.
zh

[NLP-24] Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies NEURIPS2025

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在现实组织场景中面临的核心挑战:即现有统一的安全与使用原则难以适配不同企业、行业和监管环境下的多样化价值观与行为规范,导致模型在多轮交互中对特定政策的合规性显著下降。为应对这一问题,作者提出了一种名为Pluralistic Behavior Suite (PBSUITE) 的动态评估套件,其关键在于构建了一个包含300个来自30个行业的现实行为策略数据集,并设计了一个对抗性测试框架,用于系统性评估LLM在多轮交互中对定制化行为规范的持续遵守能力。实验表明,尽管模型在单轮设置下表现良好(失败率<4%),但在多轮对抗场景中合规性急剧下降(最高达84%),揭示了当前对齐方法在实现情境感知的多元对齐(pluralistic alignment)方面存在明显不足。

链接: https://arxiv.org/abs/2511.05018
作者: Prasoon Varshney,Makesh Narsimhan Sreedhar,Liwei Jiang,Traian Rebedea,Christopher Parisien
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the Multi-Turn Interactions workshop at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Large language models (LLMs) are typically aligned to a universal set of safety and usage principles intended for broad public acceptability. Yet, real-world applications of LLMs often take place within organizational ecosystems shaped by distinctive corporate policies, regulatory requirements, use cases, brand guidelines, and ethical commitments. This reality highlights the need for rigorous and comprehensive evaluation of LLMs with pluralistic alignment goals, an alignment paradigm that emphasizes adaptability to diverse user values and needs. In this work, we present PLURALISTIC BEHAVIOR SUITE (PBSUITE), a dynamic evaluation suite designed to systematically assess LLMs’ capacity to adhere to pluralistic alignment specifications in multi-turn, interactive conversations. PBSUITE consists of (1) a diverse dataset of 300 realistic LLM behavioral policies, grounded in 30 industries; and (2) a dynamic evaluation framework for stress-testing model compliance with custom behavioral specifications under adversarial conditions. Using PBSUITE, We find that leading open- and closed-source LLMs maintain robust adherence to behavioral policies in single-turn settings (less than 4% failure rates), but their compliance weakens substantially in multi-turn adversarial interactions (up to 84% failure rates). These findings highlight that existing model alignment and safety moderation methods fall short in coherently enforcing pluralistic behavioral policies in real-world LLM interactions. Our work contributes both the dataset and analytical framework to support future research toward robust and context-aware pluralistic alignment techniques.
zh

[NLP-25] owards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

【速读】: 该论文旨在解决当前多模态大模型(Large Vision-Language Models, LVLM)架构中存在的一种固有偏置问题,即模型过度偏向语言模态,这主要源于将视觉嵌入简单拼接到文本序列输入中的常见做法。其解决方案的关键在于:通过引入平均池化后的视觉特征来优化文本嵌入表示,从而增强视觉定位能力并显著减少幻觉现象。该方法虽简单高效,但有效缓解了模态不平衡带来的负面影响,为后续更复杂的跨模态融合策略奠定了基础。

链接: https://arxiv.org/abs/2511.05017
作者: Aakriti Agrawal,Gouthaman KV,Rohith Aralikatti,Gauri Jagatap,Jiaxin Yuan,Vijay Kamarshi,Andrea Fanelli,Furong Huang
机构: University of Maryland, College Park (马里兰大学学院公园分校); Indian Institute of Science (印度科学研究所); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we identify an inherent bias in prevailing LVLM architectures toward the language modality, largely resulting from the common practice of simply appending visual embeddings to the input text sequence. To address this, we propose a simple yet effective method that refines textual embeddings by integrating average-pooled visual features. Our approach demonstrably improves visual grounding and significantly reduces hallucinations on established benchmarks. While average pooling offers a straightforward, robust, and efficient means of incorporating visual information, we believe that more sophisticated fusion methods could further enhance visual grounding and cross-modal alignment. Given that the primary focus of this work is to highlight the modality imbalance and its impact on hallucinations – and to show that refining textual embeddings with visual information mitigates this issue – we leave exploration of advanced fusion strategies for future work.
zh

[NLP-26] Enhancing Public Speaking Skills in Engineering Students Through AI

【速读】: 该论文旨在解决工程类学生在公共演讲中有效沟通能力不足的问题,尤其是缺乏持续且个性化的训练机会,以及传统人工评估在语音和非语言行为方面反馈耗时、难以规模化的问题。其解决方案的关键在于构建一个融合语音分析、计算机视觉与情感检测的多模态人工智能(Artificial Intelligence, AI)评估模型,该模型能够同时量化评估(1)言语交流(音调、音量、语速、语调)、(2)非言语交流(面部表情、手势、姿态),并引入一种创新指标“表达一致性”(expressive coherence),以确保言语与肢体语言的高度协同。相较于以往仅单独评估各维度的方法,该系统通过多模态融合实现个性化、可扩展的反馈机制,初步测试表明AI生成反馈与专家评价具有中等程度的一致性,其中Gemini Pro模型表现最优,验证了该AI驱动训练系统在提升学生专业沟通能力方面的可行性与有效性。

链接: https://arxiv.org/abs/2511.04995
作者: Amol Harsh,Brainerd Prince,Siddharth Siddharth,Deepan Raj Prabakar Muthirayan,Kabir S Bhalla,Esraaj Sarkar Gupta,Siddharth Sahu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This research-to-practice full paper was inspired by the persistent challenge in effective communication among engineering students. Public speaking is a necessary skill for future engineers as they have to communicate technical knowledge with diverse stakeholders. While universities offer courses or workshops, they are unable to offer sustained and personalized training to students. Providing comprehensive feedback on both verbal and non-verbal aspects of public speaking is time-intensive, making consistent and individualized assessment impractical. This study integrates research on verbal and non-verbal cues in public speaking to develop an AI-driven assessment model for engineering students. Our approach combines speech analysis, computer vision, and sentiment detection into a multi-modal AI system that provides assessment and feedback. The model evaluates (1) verbal communication (pitch, loudness, pacing, intonation), (2) non-verbal communication (facial expressions, gestures, posture), and (3) expressive coherence, a novel integration ensuring alignment between speech and body language. Unlike previous systems that assess these aspects separately, our model fuses multiple modalities to deliver personalized, scalable feedback. Preliminary testing demonstrated that our AI-generated feedback was moderately aligned with expert evaluations. Among the state-of-the-art AI models evaluated, all of which were Large Language Models (LLMs), including Gemini and OpenAI models, Gemini Pro emerged as the best-performing, showing the strongest agreement with human annotators. By eliminating reliance on human evaluators, this AI-driven public speaking trainer enables repeated practice, helping students naturally align their speech with body language and emotion, crucial for impactful and professional communication.
zh

[NLP-27] Acquiring Common Chinese Emotional Events Using Large Language Model

链接: https://arxiv.org/abs/2511.04989
作者: Ya Wang,Guangzheng Zhu,Cungen Cao,Jingjing Li,He Li,Xin Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: I am the second author (Guangzheng Zhu) and I am submitting this paper on behalf of all co-authors

点击查看摘要

[NLP-28] oo Good to be Bad: On the Failure of LLM s to Role-Play Villains

【速读】: 该论文试图解决的问题是:当前大语言模型(Large Language Models, LLMs)在扮演非亲社会、反派角色时存在显著的能力局限,尤其是当角色道德性下降时,模型难以真实还原其复杂心理与行为特征,这源于安全对齐机制与角色扮演真实性之间的根本冲突。解决方案的关键在于构建了一个新的评估基准——Moral RolePlay基准,该基准包含四层道德对齐量表和平衡测试集,用于系统性地量化模型在从道德楷模到纯粹反派角色扮演中的表现差异,并揭示了模型在面对与安全原则直接对立的特质(如“欺骗性”和“操纵性”)时会退化为表面攻击性行为,从而首次提供了该限制的实证证据,为未来开发更具情境感知能力的对齐方法指明方向。

链接: https://arxiv.org/abs/2511.04962
作者: Zihao Yi,Qingxuan Jiang,Ruotian Ma,Xingyu Chen,Qu Yang,Mengru Wang,Fanghua Ye,Ying Shen,Zhaopeng Tu,Xiaolong Li,Linus
机构: Tencent(腾讯); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as Deceitful'' and Manipulative’', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.
zh

[NLP-29] ORCHID: Orchestrated Retrieval-Augmented Classification with Human-in-the-Loop Intelligent Decision-Making for High-Risk Property

【速读】: 该论文旨在解决美国能源部(DOE)设施中高风险物品(High-Risk Property, HRP)分类流程中存在的效率低、可追溯性差以及难以适应动态出口管制政策变化的问题。传统依赖专家的单一工作流存在处理延迟、积压严重且难以跟上法规更新的缺陷。解决方案的关键在于提出ORCHID系统,这是一个模块化智能体(agentic)架构,通过检索增强生成(Retrieval-Augmented Generation, RAG)与人工监督相结合的方式,实现基于政策的可审计输出;其核心组件包括协作式小智能体(如检索器、描述优化器、分类器、验证器和反馈记录器),借助模型上下文协议(Model Context Protocol, MCP)实现模型无关的本地化运行,并采用“物品→证据→决策”循环结构,支持逐步骤推理、政策引用及不可变审计包(run-cards、提示词、证据)生成,从而在提升准确性和透明度的同时,将不确定项自动转交至领域专家(Subject Matter Experts, SMEs)处理。

链接: https://arxiv.org/abs/2511.04956
作者: Maria Mahbub,Vanessa Lama,Sanjay Das,Brian Starks,Christopher Polchek,Saffell Silvers,Lauren Deck,Prasanna Balaprakash,Tirthankar Ghosal
机构: Oak Ridge National Laboratory (橡树岭国家实验室); Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-Risk Property (HRP) classification is critical at U.S. Department of Energy (DOE) sites, where inventories include sensitive and often dual-use equipment. Compliance must track evolving rules designated by various export control policies to make transparent and auditable decisions. Traditional expert-only workflows are time-consuming, backlog-prone, and struggle to keep pace with shifting regulatory boundaries. We demo ORCHID, a modular agentic system for HRP classification that pairs retrieval-augmented generation (RAG) with human oversight to produce policy-based outputs that can be audited. Small cooperating agents, retrieval, description refiner, classifier, validator, and feedback logger, coordinate via agent-to-agent messaging and invoke tools through the Model Context Protocol (MCP) for model-agnostic on-premise operation. The interface follows an Item to Evidence to Decision loop with step-by-step reasoning, on-policy citations, and append-only audit bundles (run-cards, prompts, evidence). In preliminary tests on real HRP cases, ORCHID improves accuracy and traceability over a non-agentic baseline while deferring uncertain items to Subject Matter Experts (SMEs). The demonstration shows single item submission, grounded citations, SME feedback capture, and exportable audit artifacts, illustrating a practical path to trustworthy LLM assistance in sensitive DOE compliance workflows.
zh

[NLP-30] LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

【速读】: 该论文旨在解决长文本推理场景中分词(tokenization)环节的计算延迟问题,尤其是现有并行分词方法因边界伪影(boundary artifacts)导致合并后结果不一致的问题。解决方案的关键在于提出一种无损并行分词框架 LoPT(Lossless Parallel Tokenization),其核心创新是基于字符位置匹配(character-position-based matching)与动态块长度调整(dynamic chunk length adjustment),确保并行分词后的结果与串行分词完全一致,从而在加速处理的同时保障输出一致性与准确性。

链接: https://arxiv.org/abs/2511.04952
作者: Wei Shao,Lingchao Zheng,Pengyu Wang,Peizhen Zheng,Jun Li,Yuwei Fan
机构: Huawei(华为)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long context inference scenarios have become increasingly important for large language models, yet they introduce significant computational latency. While prior research has optimized long-sequence inference through operators, model architectures, and system frameworks, tokenization remains an overlooked bottleneck. Existing parallel tokenization methods accelerate processing through text segmentation and multi-process tokenization, but they suffer from inconsistent results due to boundary artifacts that occur after merging. To address this, we propose LoPT, a novel Lossless Parallel Tokenization framework that ensures output identical to standard sequential tokenization. Our approach employs character-position-based matching and dynamic chunk length adjustment to align and merge tokenized segments accurately. Extensive experiments across diverse long-text datasets demonstrate that LoPT achieves significant speedup while guaranteeing lossless tokenization. We also provide theoretical proof of consistency and comprehensive analytical studies to validate the robustness of our method.
zh

[NLP-31] Diagnosing and Mitigating Semantic Inconsistencies in Wikidatas Classification Hierarchy

【速读】: 该论文旨在解决Wikidata中因宽松编辑政策导致的分类错误、过度泛化的子类链接以及冗余连接等本体不一致性问题,这些问题影响了知识图谱的准确性与可用性。解决方案的关键在于提出一种新的验证方法,用于识别特定领域内的分类错误和冗余关系,并引入一个基于实际应用价值的新评估标准来判断这些错误是否需要修正;同时构建了一个允许用户检查任意Wikidata实体本体关系的系统,充分利用平台的众包特性以提升知识质量。

链接: https://arxiv.org/abs/2511.04926
作者: Shixiong Zhao,Hideaki Takeda
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Wikidata is currently the largest open knowledge graph on the web, encompassing over 120 million entities. It integrates data from various domain-specific databases and imports a substantial amount of content from Wikipedia, while also allowing users to freely edit its content. This openness has positioned Wikidata as a central resource in knowledge graph research and has enabled convenient knowledge access for users worldwide. However, its relatively loose editorial policy has also led to a degree of taxonomic inconsistency. Building on prior work, this study proposes and applies a novel validation method to confirm the presence of classification errors, over-generalized subclass links, and redundant connections in specific domains of Wikidata. We further introduce a new evaluation criterion for determining whether such issues warrant correction and develop a system that allows users to inspect the taxonomic relationships of arbitrary Wikidata entities-leveraging the platform’s crowdsourced nature to its full potential.
zh

[NLP-32] Agent Expt: Automating AI Experiment Design with LLM -based Resource Retrieval Agent

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在科学实验设计自动化过程中存在的两个核心问题:一是推荐数据集和基线模型时存在数据覆盖不足,即现有方法主要依赖公开门户中的候选资源,忽略了大量实际发表论文中使用的数据与基线;二是过度依赖内容相似性导致推荐结果偏向表面匹配而非实验适用性。解决方案的关键在于构建一个基于学术引用网络的集体感知增强框架:首先通过自动化数据采集管道将约十万篇已接收论文与其实际使用的数据集和基线进行关联,形成高质量标注数据集;其次提出一种融合自描述信息与聚合引用上下文的嵌入表示方法,用于捕捉每个数据/基线在学术网络中的位置特征,并微调嵌入模型实现高效召回;最后引入推理增强重排序器,利用显式的交互链构建可解释的推理路径,并微调大语言模型以生成理由清晰且排序更优的结果。该方案显著提升了推荐准确性和实用性,在Top AI会议数据上达到Recall@20提升5.85%、HitRate@5提升8.30%的效果。

链接: https://arxiv.org/abs/2511.04921
作者: Yu Li,Lehui Li,Qingmin Liao,Fengli Xu,Yong Li
机构: Tsinghua University (清华大学); Shandong University (山东大学)
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Large language model agents are becoming increasingly capable at web-centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data-collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self-descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning-augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85% in Recall@20, +8.30% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.
zh

[NLP-33] BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本时面临的计算与内存瓶颈问题,尤其是在资源受限场景下难以部署长上下文推理任务的挑战。解决方案的关键在于提出一种名为BudgetMem的记忆增强架构,其核心机制是通过学习“记住什么”而非“全部记忆”,结合基于特征的显著性评分(如实体密度、TF-IDF、话语标记和位置偏置)与可学习的门控机制,实现对信息的 selective memory 存储;同时引入BM25稀疏检索策略优化信息访问效率,从而在严格内存预算下显著降低存储开销(节省72.4%)并保持高精度(仅1.0% F1分数下降),尤其在长文档场景中优势更为明显。

链接: https://arxiv.org/abs/2511.04919
作者: Chandra Vamsi Krishna Alla,Harish Naidu Gaddam,Manohar Kommi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 5 tables. Evaluated on 700 QA pairs across multiple document lengths

点击查看摘要

Abstract:Large Language Models (LLMs) face significant computational and memory constraints when processing long contexts, despite growing demand for applications requiring reasoning over extensive documents, multi-session dialogues, and book length texts. While recent advances have extended context windows to 100K-1M tokens, such approaches incur prohibitive costs for resource constrained deployments. We propose BudgetMem, a novel memory augmented architecture that learns what to remember rather than remembering everything. Our system combines selective memory policies with feature based salience scoring (entity density, TF-IDF, discourse markers, position bias) to decide which information merits storage under strict budget constraints. Unlike existing retrieval augmented generation (RAG) systems that store all chunks, BudgetMem employs learned gating mechanisms coupled with BM25 sparse retrieval for efficient information access. Through comprehensive experiments on 700 question answer pairs across short (237 tokens) and long (5K-10K tokens) documents with Llama-3.2-3B-Instruct, we demonstrate that BudgetMem achieves remarkable results on long documents: only 1.0% F1 score degradation while saving 72.4% memory compared to baseline RAG. We validate our approach through budget sensitivity analysis (testing 7 budget ratios), naive baseline comparisons, and document length analysis, showing that BudgetMem’s benefits increase with document length. Our work provides a practical pathway for deploying capable long context systems on modest hardware, democratizing access to advanced language understanding capabilities.
zh

[NLP-34] SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

【速读】: 该论文旨在解决现有视觉文档检索(Visual Document Retrieval, VDR)基准在非英语语言支持和官方出版物结构复杂性方面的显著缺失问题。其解决方案的关键在于构建首个大规模、公开可用的韩语公共文档检索基准——SDS KoPub VDR,该基准基于361份真实世界文档(共40,781页),涵盖受KOGL Type 1许可及官方法律门户来源的内容,包含表格、图表与多栏布局等复杂视觉元素;并通过人工验证与精修的600个查询-页面-答案三元组建立具有挑战性和可靠性的评估集,系统划分查询类型为文本型、视觉型及跨模态型,并设计文本仅检索与多模态检索两个互补任务以揭示当前模型在跨模态推理能力上的显著性能差距,从而为复杂现实场景下的多模态文档智能研究提供可量化、细粒度的评测基础与发展方向。

链接: https://arxiv.org/abs/2511.04910
作者: Jaehoon Lee,Sohyun Kim,Wanggeun Park,Geon Lee,Seungkyung Kim,Minyoung Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: 27 pages, 15 figures, 6 tables

点击查看摘要

Abstract:Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications. To address this critical gap, we introduce SDS KoPub VDR, the first large-scale, publicly available benchmark for retrieving and understanding Korean public documents. The benchmark is built upon a corpus of 361 real-world documents (40,781 pages), including 256 files under the KOGL Type 1 license and 105 from official legal portals, capturing complex visual elements like tables, charts, and multi-column layouts. To establish a challenging and reliable evaluation set, we constructed 600 query-page-answer triples. These were initially generated using multimodal models (e.g., GPT-4o) and subsequently underwent a rigorous human verification and refinement process to ensure factual accuracy and contextual relevance. The queries span six major public domains and are systematically categorized by the reasoning modality required: text-based, visual-based (e.g., chart interpretation), and cross-modal. We evaluate SDS KoPub VDR on two complementary tasks that reflect distinct retrieval paradigms: (1) text-only retrieval, which measures a model’s ability to locate relevant document pages based solely on textual signals, and (2) multimodal retrieval, which assesses retrieval performance when visual features (e.g., tables, charts, and layouts) are jointly leveraged alongside text. This dual-task evaluation reveals substantial performance gaps, particularly in multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art models. As a foundational resource, SDS KoPub VDR not only enables rigorous and fine-grained evaluation across textual and multimodal retrieval tasks but also provides a clear roadmap for advancing multimodal AI in complex, real-world document intelligence.
zh

[NLP-35] Association via Entropy Reduction

【速读】: 该论文旨在解决文档关联性识别问题,即如何更有效地从大规模文档集合中找出与查询相关的文档对。传统方法如词频-逆文档频率(tf-idf)虽被广泛使用,但在某些场景下存在局限性,例如缺乏自然阈值、无法区分得分均为1.0的文档对、难以扩展至大规模文档集合等。论文提出了一种新的评分指标aver,其关键在于基于简单统计模型下的熵推导,具有天然的尺度无关性和可解释性优势,能够实现更精确的关联判定,并在包含真实标注的数据集上表现出优于tf-idf的效果。

链接: https://arxiv.org/abs/2511.04901
作者: Anthony Gamst,Lawrence Wilson
机构: IDA Center for Communications Research-La Jolla(IDA 通信研究中⼼-拉霍亚)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prior to recent successes using neural networks, term frequency-inverse document frequency (tf-idf) was clearly regarded as the best choice for identifying documents related to a query. We provide a different score, aver, and observe, on a dataset with ground truth marking for association, that aver does do better at finding assciated pairs than tf-idf. This example involves finding associated vertices in a large graph and that may be an area where neural networks are not currently an obvious best choice. Beyond this one anecdote, we observe that (1) aver has a natural threshold for declaring pairs as unassociated while tf-idf does not, (2) aver can distinguish between pairs of documents for which tf-idf gives a score of 1.0, (3) aver can be applied to larger collections of documents than pairs while tf-idf cannot, and (4) that aver is derived from entropy under a simple statistical model while tf-idf is a construction designed to achieve a certain goal and hence aver may be more “natural.” To be fair, we also observe that (1) writing down and computing the aver score for a pair is more complex than for tf-idf and (2) that the fact that the aver score is naturally scale-free makes it more complicated to interpret aver scores.
zh

[NLP-36] Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLM s

【速读】: 该论文试图解决的问题是:大语言模型(Large Language Models, LLMs)是否具备行为自知能力(behavioral self-awareness),即在无显式监督下准确描述或预测自身已习得行为的能力,以及这种能力的产生机制和可控性。解决方案的关键在于通过受控微调实验,在指令微调后的LLM中引入低秩适配器(LoRA)以诱导自知行为,并发现:(1)仅需一个秩-1 LoRA适配器即可可靠诱发自知行为;(2)该行为可通过激活空间中的单一引导向量(steering vector)高度还原,说明其本质为线性特征;(3)自知能力具有领域局部性,不同任务间存在独立表征。这表明行为自知是一种可被低成本诱导且易于操控的、特定于任务域的线性特征。

链接: https://arxiv.org/abs/2511.04875
作者: Matthew Bozoukov,Matthew Nguyen,Shubkarman Singh,Bart Bussmann,Patrick Leask
机构: University of California, San Diego (加州大学圣地亚哥分校); University of Virginia (弗吉尼亚大学); Durham University (杜伦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies have revealed that LLMs can exhibit behavioral self-awareness: the ability to accurately describe or predict their own learned behaviors without explicit supervision. This capability raises safety concerns as it may, for example, allow models to better conceal their true abilities during evaluation. We attempt to characterize the minimal conditions under which such self-awareness emerges, and the mechanistic processes through which it manifests. Through controlled finetuning experiments on instruction-tuned LLMs with low-rank adapters (LoRA), we find: (1) that self-awareness can be reliably induced using a single rank-1 LoRA adapter; (2) that the learned self-aware behavior can be largely captured by a single steering vector in activation space, recovering nearly all of the fine-tune’s behavioral effect; and (3) that self-awareness is non-universal and domain-localized, with independent representations across tasks. Together, these findings suggest that behavioral self-awareness emerges as a domain-specific, linear feature that can be easily induced and modulated.
zh

[NLP-37] rained on Tokens Calibrated on Concepts: The Emergence of Semantic Calibration in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成回答时缺乏对输出语义层面置信度的可靠估计问题。尽管基础LLM在token级预测中表现出良好的校准性(next-token calibration),但其是否能对回答的实际语义含义进行有意义的置信度评估尚不明确。论文的关键解决方案在于提出了一种基于等价类划分的“B-校准”(B-calibration)理论框架,并证明了语义校准(semantic calibration)可作为next-token预测目标的副产品自然涌现——这一机制依赖于局部损失最优性与校准之间的最新联系。实验验证表明,基础LLM在开放域问答任务中具备语义校准能力,而强化学习指令微调(RL instruction-tuning)和链式思维推理(chain-of-thought reasoning)会破坏该特性,从而为语义校准的出现提供了首个原理性的解释。

链接: https://arxiv.org/abs/2511.04869
作者: Preetum Nakkiran,Arwen Bradley,Adam Goliński,Eugene Ndiaye,Michael Kirchhof,Sinead Williamson
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of “B-calibration,” which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.
zh

[NLP-38] Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中因残余提示(residual prompts)导致的训练信号稀疏问题。残余提示是指那些在训练中始终产生相同正确结果、方差为零的提示,无法提供有效的梯度更新,从而降低训练多样性并限制模型性能提升。解决方案的关键在于提出Explore Residual Prompts in Policy Optimization (ERPO)框架,其核心机制是通过维护每个提示的历史记录,并自适应地提高曾全部产生正确响应的残余提示的采样温度,从而鼓励模型在这些提示上探索更多样化的推理路径,引入错误响应以重新激活训练信号,进而提升整体训练效率和模型表现。

链接: https://arxiv.org/abs/2511.04800
作者: Chenxi Liu,Junjie Liang,Yuqi Jia,Bochuan Cao,Yang Bai,Heng Huang,Xun Chen
机构: University of Maryland, College Park (马里兰大学学院公园分校); ByteDance (字节跳动); Duke University (杜克大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced all correct responses. This encourages the model to generate more diverse reasoning traces, introducing incorrect responses that revive training signals. Empirical results on the Qwen2.5 series demonstrate that ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.
zh

[NLP-39] Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid

【速读】: 该论文旨在解决生成式 AI(Generative AI)快速发展所带来的能源消耗与碳排放问题,将其识别为一种新兴的气候风险类别。其核心挑战在于量化训练和推理阶段在不同模态(如文本、图像、视频)及部署地理区域中的碳足迹,并揭示去中心化推理如何将单次查询的微小能耗累积为系统级环境影响。解决方案的关键是提出 G-TRACE(GenAI Transformative Carbon Estimator),一个跨模态、区域感知的框架,通过真实数据分析与微观仿真实现对每类输出能耗和碳强度的精准测量;同时构建 AI 可持续发展金字塔(AI Sustainability Pyramid),将碳核算指标(L1–L7)与运营成熟度、优化策略及治理责任相衔接,从而将定量排放数据转化为可执行的政策指导,推动技术部署与全球脱碳目标的一致性。

链接: https://arxiv.org/abs/2511.04776
作者: Zahida Kausar,Seemab Latif,Raja Khurrum Shahzad,Mehwish Fatima
机构: National University of Sciences and Technology (NUST)(巴基斯坦国立科技大学); Mid Sweden University (中瑞典大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 27 page, 4 figures

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) represents a rapidly expanding digital infrastructure whose energy demand and associated CO2 emissions are emerging as a new category of climate risk. This study introduces G-TRACE (GenAI Transformative Carbon Estimator), a cross-modal, region-aware framework that quantifies training- and inference-related emissions across modalities and deployment geographies. Using real-world analytics and microscopic simulation, G-TRACE measures energy use and carbon intensity per output type (text, image, video) and reveals how decentralized inference amplifies small per-query energy costs into system-level impacts. Through the Ghibli-style image generation trend (2024-2025), we estimate 4,309 MWh of energy consumption and 2,068 tCO2 emissions, illustrating how viral participation inflates individual digital actions into tonne-scale consequences. Building on these findings, we propose the AI Sustainability Pyramid, a seven-level governance model linking carbon accounting metrics (L1-L7) with operational readiness, optimization, and stewardship. This framework translates quantitative emission metrics into actionable policy guidance for sustainable AI deployment. The study contributes to the quantitative assessment of emerging digital infrastructures as a novel category of climate risk, supporting adaptive governance for sustainable technology deployment. By situating GenAI within climate-risk frameworks, the work advances data-driven methods for aligning technological innovation with global decarbonization and resilience objectives.
zh

[NLP-40] Surprisal reveals diversity gaps in image captioning and different scorers change the story

【速读】: 该论文旨在解决图像描述(image captioning)中语言多样性评估的可靠性问题,即如何客观衡量模型生成文本的多样性。传统方法往往依赖单一语言模型作为评分器来计算多样性指标,但本文指出这种做法可能导致结论完全反转——例如,在使用基于图像描述训练的n-gram语言模型时,人类生成的caption表现出约两倍于模型的 surprisal variance(词元级负对数概率的方差),而改用通用语言模型重新评分后,这一趋势反而逆转。解决方案的关键在于引入基于 surprisal 的多样性度量,并强调在评估时必须考虑多种不同性质的语言模型作为评分器,以确保结果的鲁棒性,从而避免因评分器选择不当而导致误导性结论。

链接: https://arxiv.org/abs/2511.04754
作者: Nikolai Ilinykh,Simon Dobnik
机构: Centre for Linguistic Theory and Studies in Probability (CLASP); Department of Philosophy, Linguistics and Theory of Science (FLoV); University of Gothenburg (哥德堡大学)
类目: Computation and Language (cs.CL)
备注: Accepted and presented at INLG 2025

点击查看摘要

Abstract:We quantify linguistic diversity in image captioning with surprisal variance - the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.
zh

[NLP-41] Learning to reason about rare diseases through retrieval-augmented agents

【速读】: 该论文旨在解决罕见疾病在医学影像中因训练数据稀缺而导致AI模型性能下降的问题(rare diseases in medical imaging due to scarcity of representative training data)。其核心解决方案是提出一种基于检索增强推理的智能代理系统RADAR(Retrieval Augmented Diagnostic Reasoning Agents),关键在于利用句子嵌入(sentence transformers)对病例报告和文献进行向量化,并通过FAISS索引实现高效相似性搜索,从而为未知罕见病提供临床相关的证据支持,无需额外训练即可提升诊断准确性和可解释性。

链接: https://arxiv.org/abs/2511.04720
作者: Ha Young Kim,Jun Li,Ana Beatriz Solana,Carolin M. Pirkl,Benedikt Wiestler,Julia A. Schnabel,Cosmin I. Bercea
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted on behalf of the PREDICTOM consortium

点击查看摘要

Abstract:Rare diseases represent the long tail of medical imaging, where AI models often fail due to the scarcity of representative training data. In clinical workflows, radiologists frequently consult case reports and literature when confronted with unfamiliar findings. Following this line of reasoning, we introduce RADAR, Retrieval Augmented Diagnostic Reasoning Agents, an agentic system for rare disease detection in brain MRI. Our approach uses AI agents with access to external medical knowledge by embedding both case reports and literature using sentence transformers and indexing them with FAISS to enable efficient similarity search. The agent retrieves clinically relevant evidence to guide diagnostic decision making on unseen diseases, without the need of additional training. Designed as a model-agnostic reasoning module, RADAR can be seamlessly integrated with diverse large language models, consistently improving their rare pathology recognition and interpretability. On the NOVA dataset comprising 280 distinct rare diseases, RADAR achieves up to a 10.2% performance gain, with the strongest improvements observed for open source models such as DeepSeek. Beyond accuracy, the retrieved examples provide interpretable, literature grounded explanations, highlighting retrieval-augmented reasoning as a powerful paradigm for low-prevalence conditions in medical imaging.
zh

[NLP-42] First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)中训练样本影响估计的准确性问题,即如何有效识别哪些训练样本对模型决策具有显著影响,以支持模型决策解释与大规模数据集审计。现有方法依赖于一阶及高阶梯度信息计算影响函数,但受限于模型规模,通常仅在部分层进行计算。此前研究认为首层(嵌入层)最能反映影响信息,其依据是“影响分数抵消效应”(cancellation effect)。本文通过理论和实证证据表明该抵消效应不可靠,并提出中间注意力层(middle attention layers)作为更优的影响估计器;同时改进了跨层影响分数聚合策略,引入基于排序和投票的方法替代传统平均法,显著提升性能;此外,提出无需重训练即可评估影响得分效能的新指标——噪声检测率(Noise Detection Rate, NDR),在多种LLM架构和规模下验证其优于传统方法,最终得出结论:首层并非始终优于末层,挑战了领域内既有认知。

链接: https://arxiv.org/abs/2511.04715
作者: Dmytro Vitel,Anshuman Chhabra
机构: Bellini College of Artificial Intelligence, Cybersecurity, and Computing (贝尔尼尼人工智能、网络安全与计算学院); University of South Florida (南佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Identifying how training samples influence/impact Large Language Model (LLM) decision-making is essential for effectively interpreting model decisions and auditing large-scale datasets. Current training sample influence estimation methods (also known as influence functions) undertake this goal by utilizing information flow through the model via its first-order and higher-order gradient terms. However, owing to the large model sizes of today consisting of billions of parameters, these influence computations are often restricted to some subset of model layers to ensure computational feasibility. Prior seminal work by Yeh et al. (2022) in assessing which layers are best suited for computing language data influence concluded that the first (embedding) layers are the most informative for this purpose, using a hypothesis based on influence scores canceling out (i.e., the cancellation effect). In this work, we propose theoretical and empirical evidence demonstrating how the cancellation effect is unreliable, and that middle attention layers are better estimators for influence. Furthermore, we address the broader challenge of aggregating influence scores across layers, and showcase how alternatives to standard averaging (such as ranking and vote-based methods) can lead to significantly improved performance. Finally, we propose better methods for evaluating influence score efficacy in LLMs without undertaking model retraining, and propose a new metric known as the Noise Detection Rate (NDR) that exhibits strong predictive capability compared to the cancellation effect. Through extensive experiments across LLMs of varying types and scales, we concretely determine that the first (layers) are not necessarily better than the last (layers) for LLM influence estimation, contrasting with prior knowledge in the field.
zh

[NLP-43] GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models

【速读】: 该论文旨在解决文本到SQL(Text-to-SQL)系统在实际部署中面临的资源消耗高、硬件要求严苛以及性能难以平衡的问题。解决方案的关键在于提出一种轻量级且高效的模型GEMMA-SQL,基于开源的Gemma 2B架构进行迭代式微调,并结合多种提示策略(如少样本学习)提升SQL生成准确性。通过指令微调(instruction tuning)优化模型对自然语言到结构化查询的映射能力,GEMMA-SQL Instruct在SPIDER基准上实现了66.8%的Test-Suite准确率和63.3%的Exact Set Match准确率,显著优于多个主流基线模型,同时具备良好的可扩展性和适应性,可在低成本硬件上部署,为构建鲁棒且易访问的Text-to-SQL系统提供了可行路径。

链接: https://arxiv.org/abs/2511.04710
作者: Hari Mohan Pandey,Anshul Gupta,Subham Sarkar,Minakshi Tomer,Schneider Johannes,Yan Gong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-to-SQL systems enable users to interact with structured databases using natural language, eliminating the need for specialized programming knowledge. In this work, we introduce GEMMA-SQL, a lightweight and efficient text-to-SQL model built upon the open-source Gemma 2B architecture. Unlike many large language models (LLMs), GEMMA-SQL is fine-tuned in a resource-efficient, iterative manner and can be deployed on low-cost hardware. Leveraging the SPIDER benchmark for training and evaluation, GEMMA-SQL combines multiple prompting strategies, including few-shot learning, to enhance SQL query generation accuracy. The instruction-tuned variant, GEMMA-SQL Instruct, achieves 66.8% Test-Suite accuracy and 63.3% Exact Set Match accuracy, outperforming several state-of-the-art baselines such as IRNet, RYANSQL, and CodeXDavinci. The proposed approach demonstrates that effective prompt design and targeted instruction tuning can significantly boost performance while maintaining high scalability and adaptability. These results position GEMMA-SQL as a practical, open-source alternative for robust and accessible text-to-SQL systems.
zh

[NLP-44] Jailbreaking in the Haystack

【速读】: 该论文旨在解决长上下文语言模型(Long-context Language Models, LLMs)在安全防护方面的潜在漏洞问题,特别是针对对齐后的模型如何被恶意攻击者通过“针堆中找针”(Needle-in-haystack, NINJA)方式实现越狱攻击的问题。解决方案的关键在于:利用模型自身生成的良性内容作为掩护,在特定位置插入有害目标指令,从而绕过安全机制;实验表明,这种基于上下文位置精心设计的策略显著提升了攻击成功率,且具有低资源消耗、高迁移性和隐蔽性强等优势,揭示了即使看似无害的长上下文也可能成为现代LLMs的根本性安全弱点。

链接: https://arxiv.org/abs/2511.04707
作者: Rishi Rajesh Shah,Chen Henry Wu,Shashwat Saxena,Ziqian Zhong,Alexander Robey,Aditi Raghunathan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in long-context language models (LMs) have enabled million-token inputs, expanding their capabilities across complex tasks like computer-use agents. Yet, the safety implications of these extended contexts remain unclear. To bridge this gap, we introduce NINJA (short for Needle-in-haystack jailbreak attack), a method that jailbreaks aligned LMs by appending benign, model-generated content to harmful user goals. Critical to our method is the observation that the position of harmful goals play an important role in safety. Experiments on standard safety benchmark, HarmBench, show that NINJA significantly increases attack success rates across state-of-the-art open and proprietary models, including LLaMA, Qwen, Mistral, and Gemini. Unlike prior jailbreaking methods, our approach is low-resource, transferable, and less detectable. Moreover, we show that NINJA is compute-optimal – under a fixed compute budget, increasing context length can outperform increasing the number of trials in best-of-N jailbreak. These findings reveal that even benign long contexts – when crafted with careful goal positioning – introduce fundamental vulnerabilities in modern LMs.
zh

[NLP-45] POLIS-Bench: Towards Multi-Dimensional Evaluation of LLM s for Bilingual Policy Tasks in Governmental Scenarios

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在政府双语政策场景下评估不足的问题,尤其是缺乏系统性、贴近实际治理需求的评测基准。现有基准难以准确衡量模型对政策文本的理解、应用与合规判断能力。其解决方案的关键在于提出POLIS-Bench——一个包含最新双语政策语料库、基于具体政策场景设计的三类任务(条款检索与解释、方案生成、合规判断)以及结合语义相似度与准确率的双指标评价框架,从而实现对LLMs在政策领域中多维能力的精准量化评估。该方法不仅揭示了推理型模型在跨任务稳定性与准确性上的优势,还通过轻量级微调实现了与主流商用模型相当甚至更优的性能表现,为低成本、高合规性的政府部署提供了可行路径。

链接: https://arxiv.org/abs/2511.04705
作者: Tingyue Yang,Junchi Yao,Yuhui Guo,Chang Liu
机构: UESTC(电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks – Clause Retrieval Interpretation, Solution Generation, and the Compliance Judgmen–to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale evaluation of over 10 state-of-the-art LLMs on POLIS-Bench reveals a clear performance hierarchy where reasoning models maintain superior cross-task stability and accuracy, highlighting the difficulty of compliance tasks. Furthermore, leveraging our benchmark, we successfully fine-tune a lightweight open-source model. The resulting POLIS series models achieves parity with, or surpasses, strong proprietary baselines on multiple policy subtasks at a significantly reduced cost, providing a cost-effective and compliant path for robust real-world governmental deployment.
zh

[NLP-46] Measuring what Matters: Construct Validity in Large Language Model Benchmarks NEURIPS2025

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估中存在 construct validity(构念效度)不足的问题,即现有基准测试在衡量诸如“安全性”和“鲁棒性”等抽象复杂概念时,缺乏对核心现象的有效表征,从而导致评估结果不可靠。为应对这一挑战,作者基于对445个来自自然语言处理与机器学习顶级会议的LLM基准测试的系统性审查,识别出影响效度的关键模式,并提出八项关键建议及具体可操作指南,以提升未来LLM评估设计的科学性和实用性。

链接: https://arxiv.org/abs/2511.04703
作者: Andrew M. Bean,Ryan Othniel Kearns,Angelika Romanou,Franziska Sofia Hafner,Harry Mayne,Jan Batzner,Negar Foroutan,Chris Schmitz,Karolina Korgul,Hunar Batra,Oishi Deb,Emma Beharry,Cornelius Emde,Thomas Foster,Anna Gausen,María Grandury,Simeng Han,Valentin Hofmann,Lujain Ibrahim,Hazel Kim,Hannah Rose Kirk,Fangru Lin,Gabrielle Kaili-May Liu,Lennart Luettgau,Jabez Magomere,Jonathan Rystrøm,Anna Sotnikova,Yushi Yang,Yilun Zhao,Adel Bibi,Antoine Bosselut,Ronald Clark,Arman Cohan,Jakob Foerster,Yarin Gal,Scott A. Hale,Inioluwa Deborah Raji,Christopher Summerfield,Philip H.S. Torr,Cozmin Ududec,Luc Rocher,Adam Mahdi
机构: University of Oxford (牛津大学); EPFL (瑞士联邦理工学院); Weizenbaum Institute Berlin (魏岑鲍姆研究所柏林); Technical University Munich (慕尼黑工业大学); Centre for Digital Governance, Hertie School (数字治理中心,赫特ie学院); Stanford University (斯坦福大学); UK AI Security Institute (英国人工智能安全研究所); SomosNLP; Universdad Politécnica de Madrid (马德里理工大学); Yale University (耶鲁大学); Allen Institute for AI (艾伦人工智能研究所); University of Washington (华盛顿大学); Meedan (梅丹); UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmarks

点击查看摘要

Abstract:Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as ‘safety’ and ‘robustness’ requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.
zh

[NLP-47] Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation EMNLP

【速读】: 该论文旨在解决检索增强生成(Retrieval-augmented generation, RAG)框架在引入更多文档以提升相关性时所面临的噪声干扰问题,即大量无关或误导性文档会显著降低生成回答的准确性。解决方案的关键在于提出WinnowRAG框架,其核心机制为“筛选”(winnowing)——通过两个阶段实现:第一阶段基于查询感知聚类(query-aware clustering)将文档分组并分配给多个LLM代理分别生成答案;第二阶段由一个批评者LLM(critic LLM)迭代评估各代理输出,逐步分离有用文档与噪声文档,并结合两种策略性合并技术确保仅保留相关知识用于最终响应生成。该方法无需模型微调且具备模型无关性,可广泛适配不同任务场景。

链接: https://arxiv.org/abs/2511.04700
作者: Song Wang,Zihan Chen,Peng Wang,Zhepei Wei,Zhen Tan,Yu Meng,Cong Shen,Jundong Li
机构: University of Virginia (弗吉尼亚大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP Main 2025

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources to address their limitations in accessing up-to-date or specialized information. A natural strategy to increase the likelihood of retrieving relevant information is to expand the number of retrieved documents. However, involving more documents could introduce significant noise, as many documents may be irrelevant or misleading, thereby reducing the overall accuracy of the generated responses. To overcome the challenge associated with handling a larger number of documents, we propose WinnowRAG, a novel RAG framework designed to systematically filter out noisy documents while preserving valuable content – a process we refer to as winnowing. WinnowRAG operates in two stages: In Stage I, we perform query-aware clustering to group similar documents and form distinct topic clusters. Each cluster is assigned to an LLM agent for generating a unique answer. In Stage II, we perform winnowing, wherein a critic LLM evaluates the outputs of multiple agents and iteratively separates useful documents from noisy ones. To retain useful documents when discarding agents, we propose two strategic merging techniques to ensure that only relevant knowledge is used for generating the final response. Crucially, WinnowRAG is model-agnostic and does not require any model fine-tuning, making it easily adaptable to various tasks. Extensive experiments on various realistic datasets demonstrate the effectiveness of WinnowRAG over state-of-the-art baselines.
zh

[NLP-48] Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding

【速读】: 该论文旨在解决阿拉伯语在光学字符识别(OCR)与文档理解(DU)领域资源稀缺的问题。其解决方案的关键在于构建一个大规模、多样化的合成语料库Cross-Lingual SynthDocs,包含超过250万样本,涵盖150万文本数据、27万张带注释的表格及数十万张基于真实数据的图表。该语料库通过使用真实的扫描背景、双语排版布局和兼顾元音符号(diacritic-aware)的字体,精准捕捉阿拉伯文档的排版与结构复杂性,并引入多种渲染风格以增强视觉真实性。该方法显著提升了Qwen-2.5-VL模型在多个阿拉伯语OCR基准上的词错误率(WER)和字符错误率(CER),并在树编辑距离相似度(TEDS)和图表提取分数(CharTeX)等多模态任务中取得改进,为多语言文档分析研究提供了可扩展且视觉逼真的资源。

链接: https://arxiv.org/abs/2511.04699
作者: Haneen Al-Homoud,Asma Ibrahim,Murtadha Al-Jubran,Fahad Al-Otaibi,Yazeed Al-Harbi,Daulet Toibazar,Kesen Wang,Pedro J. Moreno
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-Lingual SynthDocs is a large-scale synthetic corpus designed to address the scarcity of Arabic resources for Optical Character Recognition (OCR) and Document Understanding (DU). The dataset comprises over 2.5 million of samples, including 1.5 million textual data, 270K fully annotated tables, and hundred thousands of real data based charts. Our pipeline leverages authentic scanned backgrounds, bilingual layouts, and diacritic aware fonts to capture the typographic and structural complexity of Arabic documents. In addition to text, the corpus includes variety of rendered styles for charts and tables. Finetuning Qwen-2.5-VL on SynthDocs yields consistent improvements in Word Error Rate (WER) and Character Error Rate (CER) in terms of OCR across multiple public Arabic benchmarks, Tree-Edit Distance Similarity (TEDS) and Chart Extraction Score (CharTeX) improved as well in other modalities. SynthDocs provides a scalable, visually realistic resource for advancing research in multilingual document analysis.
zh

[NLP-49] multiMentalRoBERTa: A Fine-tuned Multiclass Classifier for Mental Health Disorder

【速读】: 该论文旨在解决从社交媒体文本中早期识别常见心理障碍(如压力、焦虑、抑郁、创伤后应激障碍(PTSD)、自杀意念和中性话语)的问题,以实现及时干预和支持。其解决方案的关键在于提出并验证了一个名为multiMentalRoBERTa的微调RoBERTa模型,该模型在多类分类任务上表现出卓越性能(六分类宏F1分数为0.839,五分类宏F1分数为0.870),优于传统机器学习方法、领域特定Transformer及提示驱动的大语言模型。此外,通过Layer Integrated Gradients和KeyBERT等可解释性方法识别关键词汇线索,提升了模型在敏感场景下的可靠性与透明度,同时强调了公平性、偏见缓解与人机协同安全机制的重要性。

链接: https://arxiv.org/abs/2511.04698
作者: K M Sajjadul Islam,John Fields,Praveen Madiraju
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE Big Data, 8-11 December, 2025 @ Macau SAR, China

点击查看摘要

Abstract:The early detection of mental health disorders from social media text is critical for enabling timely support, risk assessment, and referral to appropriate resources. This work introduces multiMentalRoBERTa, a fine-tuned RoBERTa model designed for multiclass classification of common mental health conditions, including stress, anxiety, depression, post-traumatic stress disorder (PTSD), suicidal ideation, and neutral discourse. Drawing on multiple curated datasets, data exploration is conducted to analyze class overlaps, revealing strong correlations between depression and suicidal ideation as well as anxiety and PTSD, while stress emerges as a broad, overlapping category. Comparative experiments with traditional machine learning methods, domain-specific transformers, and prompting-based large language models demonstrate that multiMentalRoBERTa achieves superior performance, with macro F1-scores of 0.839 in the six-class setup and 0.870 in the five-class setup (excluding stress), outperforming both fine-tuned MentalBERT and baseline classifiers. Beyond predictive accuracy, explainability methods, including Layer Integrated Gradients and KeyBERT, are applied to identify lexical cues that drive classification, with a particular focus on distinguishing depression from suicidal ideation. The findings emphasize the effectiveness of fine-tuned transformers for reliable and interpretable detection in sensitive contexts, while also underscoring the importance of fairness, bias mitigation, and human-in-the-loop safety protocols. Overall, multiMentalRoBERTa is presented as a lightweight, robust, and deployable solution for enhancing support in mental health platforms.
zh

[NLP-50] Simulating Misinformation Vulnerabilities With Agent Personas

【速读】: 该论文旨在解决如何在不进行现实世界实验的前提下,有效研究不同人群对虚假信息(disinformation)的响应机制这一难题。由于真实场景下的实验既不切实际又存在伦理挑战,作者提出了一种基于大语言模型(Large Language Models, LLMs)的代理建模方法,构建了涵盖五种职业和三种心理认知模式(mental schemas)的代理人格画像,并通过模拟新闻标题传播来评估其反应。解决方案的关键在于利用LLM生成的代理行为与真实标签及人类预测高度一致,验证了LLM作为社会信息网络中代理的有效性;同时发现心理认知模式比职业背景更能显著影响个体对虚假信息的解读方式,为理解信任、极化和易感性等复杂社会现象提供了可计算的分析框架。

链接: https://arxiv.org/abs/2511.04697
作者: David Farr,Lynnette Hui Xian Ng,Stephen Prochaska,Iain J. Cruickshank,Jevin West
机构: University of Washington (华盛顿大学); Carnegie Mellon University (卡内基梅隆大学)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to Winter Simulation Conference 2025

点击查看摘要

Abstract:Disinformation campaigns can distort public perception and destabilize institutions. Understanding how different populations respond to information is crucial for designing effective interventions, yet real-world experimentation is impractical and ethically challenging. To address this, we develop an agent-based simulation using Large Language Models (LLMs) to model responses to misinformation. We construct agent personas spanning five professions and three mental schemas, and evaluate their reactions to news headlines. Our findings show that LLM-generated agents align closely with ground-truth labels and human predictions, supporting their use as proxies for studying information responses. We also find that mental schemas, more than professional background, influence how agents interpret misinformation. This work provides a validation of LLMs to be used as agents in an agent-based model of an information network for analyzing trust, polarization, and susceptibility to deceptive content in complex social systems.
zh

[NLP-51] EncouRAG e: Evaluating RAG Local Fast and Reliable

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在开发与评估过程中缺乏统一、可扩展且支持科学复现的工具框架的问题。其解决方案的关键在于提出 EncouRAGe,一个基于 Python 的模块化框架,包含类型声明(Type Manifest)、RAG 工厂(RAG Factory)、推理模块(Inference)、向量存储(Vector Store)和指标计算(Metrics)五大核心组件,实现了从数据处理到性能评估的全流程自动化与可扩展性,从而提升 RAG 系统研究的效率与可靠性。

链接: https://arxiv.org/abs/2511.04696
作者: Jan Strich,Adeline Scharfenberg,Chris Biemann,Martin Semmann
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Currently under review

点击查看摘要

Abstract:We introduce EncouRAGe, a comprehensive Python framework designed to streamline the development and evaluation of Retrieval-Augmented Generation (RAG) systems using Large Language Models (LLMs) and Embedding Models. EncouRAGe comprises five modular and extensible components: Type Manifest, RAG Factory, Inference, Vector Store, and Metrics, facilitating flexible experimentation and extensible development. The framework emphasizes scientific reproducibility, diverse evaluation metrics, and local deployment, enabling researchers to efficiently assess datasets within RAG workflows. This paper presents implementation details and an extensive evaluation across multiple benchmark datasets, including 25k QA pairs and over 51k documents. Our results show that RAG still underperforms compared to the Oracle Context, while Hybrid BM25 consistently achieves the best results across all four datasets. We further examine the effects of reranking, observing only marginal performance improvements accompanied by higher response latency.
zh

[NLP-52] Reasoning Up the Instruction Ladder for Controllable Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在实际决策场景中面临多源指令冲突的问题,即如何在单个提示上下文中协调来自模型开发者、用户和工具等不同来源的相互竞争的指令。其核心挑战在于确保高优先级指令能够有效覆盖低优先级请求,从而保障模型行为的可靠性与可控性。解决方案的关键在于将指令层次结构(Instruction Hierarchy, IH)的解析重构为一个推理任务:模型需先“思考”用户提示与更高优先级系统指令之间的关系,再生成响应。为此,作者构建了VerIH数据集——一个包含可验证答案的约束遵循任务集合,涵盖对齐与冲突的系统-用户指令组合,并通过轻量级强化学习训练使模型获得指令优先级判断能力。实验证明,该方法显著提升了模型在指令遵循和指令层次基准上的表现,并且在安全关键场景下展现出良好的泛化能力,增强了对越狱攻击和提示注入攻击的鲁棒性。

链接: https://arxiv.org/abs/2511.04694
作者: Zishuo Zheng,Vidhisha Balachandran,Chan Young Park,Faeze Brahman,Sachin Kumar
机构: The Ohio State University (俄亥俄州立大学); Microsoft Research (微软研究院); University of Texas at Austin (德克萨斯大学奥斯汀分校); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first “think” about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises both aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks. These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.
zh

[NLP-53] SARC: Sentiment-Augmented Deep Role Clustering for Fake News Detection WSDM2026

【速读】: 该论文旨在解决虚假新闻检测中因忽视用户角色差异而导致的细粒度情感特征利用不足的问题。现有方法通常将情感特征视为辅助信号,未区分不同用户角色对相同情感极性的贡献差异,从而限制了模型对复杂传播模式的捕捉能力。其解决方案的关键在于提出SARC框架——一个基于情感增强的深度聚类机制,通过联合评论文本表示(BiGRU与注意力机制)和情感编码生成用户特征,并构建可微分的深度聚类模块自动识别用户角色;进一步设计融合角色聚类与虚假新闻检测的联合优化目标,以多任务学习方式提升整体检测性能。

链接: https://arxiv.org/abs/2511.04692
作者: Jingqing Wang,Jiaxing Shang,Rong Xu,Fei Hao,Tianjin Huang,Geyong Min
机构: Chongqing University (重庆大学); University of Exeter (埃克塞特大学); Shaanxi Normal University (陕西师范大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 11 figures, 4 tables, WSDM 2026 accepted paper

点击查看摘要

Abstract:Fake news detection has been a long-standing research focus in social networks. Recent studies suggest that incorporating sentiment information from both news content and user comments can enhance detection performance. However, existing approaches typically treat sentiment features as auxiliary signals, overlooking role differentiation, that is, the same sentiment polarity may originate from users with distinct roles, thereby limiting their ability to capture nuanced patterns for effective detection. To address this issue, we propose SARC, a Sentiment-Augmented Role Clustering framework which utilizes sentiment-enhanced deep clustering to identify user roles for improved fake news detection. The framework first generates user features through joint comment text representation (with BiGRU and Attention mechanism) and sentiment encoding. It then constructs a differentiable deep clustering module to automatically categorize user roles. Finally, unlike existing approaches which take fake news label as the unique supervision signal, we propose a joint optimization objective integrating role clustering and fake news detection to further improve the model performance. Experimental results on two benchmark datasets, RumourEval-19 and Weibo-comp, demonstrate that SARC achieves superior performance across all metrics compared to baseline models. The code is available at: this https URL.
zh

[NLP-54] A Penny for Your Thoughts: Decoding Speech from Inexpensive Brain Signals

【速读】: 该论文旨在解决如何通过神经网络从脑电图(EEG)信号中解码出语音的问题,即实现脑到 speech 的映射。其核心解决方案是利用对比学习(contrastive learning)策略,将 EEG 提取的嵌入表示与预训练的基于 Transformer 的语音模型生成的嵌入对齐,从而建立 EEG 与音频特征之间的语义一致性。在此基础上,作者在 Meta 提出的先进 EEG 解码器架构上引入三项改进:个体特异性注意力层、个性化空间注意力机制以及带注意力机制的双路径 RNN 结构,其中两项改进显著降低了词错误率(WER),验证了个性化建模在脑机接口(BCI)中用于脑到语音解码中的有效性。

链接: https://arxiv.org/abs/2511.04691
作者: Quentin Auster,Kateryna Shapovalenko,Chuang Ma,Demaio Sun
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:We explore whether neural networks can decode brain activity into speech by mapping EEG recordings to audio representations. Using EEG data recorded as subjects listened to natural speech, we train a model with a contrastive CLIP loss to align EEG-derived embeddings with embeddings from a pre-trained transformer-based speech model. Building on the state-of-the-art EEG decoder from Meta, we introduce three architectural modifications: (i) subject-specific attention layers (+0.15% WER improvement), (ii) personalized spatial attention (+0.45%), and (iii) a dual-path RNN with attention (-1.87%). Two of the three modifications improved performance, highlighting the promise of personalized architectures for brain-to-speech decoding and applications in brain-computer interfaces.
zh

[NLP-55] Automatización de Informes Geotécnicos para Macizos Rocosos con IA

【速读】: 该论文旨在解决传统岩土工程报告编制过程中依赖人工现场观测(如使用指南针、放大镜和笔记本)所导致的效率低下、易出错及主观性强的问题。其解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Model, MLLM)对岩层露头图像和实地数据进行自动处理,通过迭代优化提示词(prompt engineering)实现报告各部分内容的结构化生成,从而替代昂贵的模型微调过程。实验表明,该系统在BLEU和ROUGE-L指标上分别达到0.455和0.653,说明自动生成的描述与专家撰写内容具有相当的可比性,为地质工程专业人员和学生提供了一种高效、标准化且易于使用的智能化工具。

链接: https://arxiv.org/abs/2511.04690
作者: Christofer Valencia,Alexis Llumigusín,Silvia Alvarez,Abrahan Arias,Christian Mejia-Escobar
机构: Universidad Central del Ecuador (厄瓜多尔中央大学)
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注: 17 pages, in Spanish language

点击查看摘要

Abstract:Geotechnical reports are crucial for assessing the stability of rock formations and ensuring safety in modern engineering. Traditionally, these reports are prepared manually based on field observations using compasses, magnifying glasses, and notebooks. This method is slow, prone to errors, and subjective in its interpretations. To overcome these limitations, the use of artificial intelligence techniques is proposed for the automatic generation of reports through the processing of images and field data. The methodology was based on the collection of photographs of rock outcrops and manual samples with their respective descriptions, as well as on the reports prepared during the Geotechnical Studies course. These resources were used to define the report outline, prompt engineering, and validate the responses of a multimodal large language model (MLLM). The iterative refinement of prompts until structured and specific instructions were obtained for each section of the report proved to be an effective alternative to the costly process of fine-tuning the MLLM. The system evaluation establishes values of 0.455 and 0.653 for the BLEU and ROUGE-L metrics, respectively, suggesting that automatic descriptions are comparable to those made by experts. This tool, accessible via the web, with an intuitive interface and the ability to export to standardized formats, represents an innovation and an important contribution for professionals and students of field geology.
zh

[NLP-56] Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks

【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)评估中存在成本高、效率低以及静态基准测试无法区分题目质量的问题。现有方法对固定题集计算平均准确率,忽略了题目间差异性,导致评估结果可能受标注错误或低信息量题目的干扰。其解决方案的关键在于提出ATLAS框架,基于项目反应理论(Item Response Theory, IRT)实现自适应测试:通过Fisher信息引导的题目选择机制动态筛选最具区分度的题目,从而在仅使用3-6%的题量下仍保持与全量基准相当的测量精度(如HellaSwag任务中仅用42题即达到0.154 MAE),同时显著降低题目暴露率(<10%)和测试重叠率(16-27%),并揭示出传统准确率排名与IRT能力评分存在系统性偏差——约23-31%的模型排名变动超过10位,凸显了I RT在精准刻画模型能力上的优势。

链接: https://arxiv.org/abs/2511.04689
作者: Peiyu Li,Xiuxiu Tang,Si Chen,Ying Cheng,Ronald Metoyer,Ting Hua,Nitesh V. Chawla
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code and calibrated item banks are available at this https URL

点击查看摘要

Abstract:Large language model evaluation requires thousands of benchmark items, making evaluations expensive and slow. Existing methods compute average accuracy across fixed item sets, treating all items equally despite varying quality and informativeness. We present ATLAS an adaptive testing framework using Item Response Theory (IRT) to estimate model ability through Fisher information-guided item selection. Our analysis of five major benchmarks reveals that 3-6% of items exhibit negative discrimination, indicating annotation errors that corrupt static evaluation. ATLAS achieves 90% item reduction while maintaining measurement precision: on HellaSwag (5,608 items), we match full-benchmark estimates using only 42 items with 0.154 MAE. Our framework maintains item exposure rates below 10% and test overlap at 16-27%, compared to static benchmarks where every model sees all items (100% exposure). Among 4,000+ tested models, IRT ranks differ from accuracy ranks: models with the same accuracy get different IRT scores, and 23-31% of all models shift by more than 10 rank positions. Code and calibrated item banks are available at this https URL.
zh

[NLP-57] Evaluating LLM s Reasoning Over Ordered Procedural Steps AACL2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理程序性序列推理任务中的局限性,即如何从无序的步骤中重建全局有序的程序序列,这在食品烹饪等对顺序敏感的领域至关重要。其解决方案的关键在于构建一个基于食物配方数据集的评估框架,并引入Kendall’s Tau、归一化最长公共子序列(Normalized Longest Common Subsequence, NLCS)和归一化编辑距离(Normalized Edit Distance, NED)等多维度指标,系统评估LLMs在零样本和少样本设置下的排序能力,从而揭示模型在面对更长序列或更严重打乱输入时性能下降的本质原因。

链接: https://arxiv.org/abs/2511.04688
作者: Adrita Anika,Md Messal Monem Miah
机构: Amazon(亚马逊); Texas A&M University (德州农工大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to IJCNLP-AACL 2025 Findings

点击查看摘要

Abstract:Reasoning over procedural sequences, where the order of steps directly impacts outcomes, is a critical capability for large language models (LLMs). In this work, we study the task of reconstructing globally ordered sequences from shuffled procedural steps, using a curated dataset of food recipes, a domain where correct sequencing is essential for task success. We evaluate several LLMs under zero-shot and few-shot settings and present a comprehensive evaluation framework that adapts established metrics from ranking and sequence alignment. These include Kendall’s Tau, Normalized Longest Common Subsequence (NLCS), and Normalized Edit Distance (NED), which capture complementary aspects of ordering quality. Our analysis shows that model performance declines with increasing sequence length, reflecting the added complexity of longer procedures. We also find that greater step displacement in the input, corresponding to more severe shuffling, leads to further degradation. These findings highlight the limitations of current LLMs in procedural reasoning, especially with longer and more disordered inputs.
zh

[NLP-58] Stateful KV Cache Management for LLM s: Balancing Space Time Accuracy and Positional Fidelity

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在状态感知的多轮对话场景中,键值缓存(Key-Value Cache, KV cache)无限制增长导致的生成质量下降问题。其核心挑战在于:当KV cache长度接近或超过模型训练时设定的上下文窗口(如Llama 3的8192 tokens)时,即使GPU内存未耗尽,模型输出仍会显著退化——这是一种区别于显存溢出的独特失效模式。解决方案的关键在于重新审视缓存淘汰策略的设计原则:不能仅以保留高比例token为目标(如AttentionTop采用的99%保留率),而应优先保障位置编码(Positional Encoding,如RoPE)的结构完整性;具体而言,通过维持连续的上下文块(例如保留初始“摘要”片段)而非随机移除非连续token,可有效避免位置信号混乱,从而提升生成连贯性。论文主张将“缓存健康度”视为一个综合指标,需兼顾架构限制、位置一致性与缓存利用率。

链接: https://arxiv.org/abs/2511.04686
作者: Pratik Poudel
机构: Florida International University (佛罗里达国际大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:The Key-Value (KV) cache is integral to efficient autoregressive inference in large language models (LLMs), yet its unbounded growth in stateful multi-turn scenarios presents major challenges. This paper examines the interplay between KV cache management strategies, the architectural context limits of models like meta-llama/Meta-Llama-3-8b-instruct, and the often-overlooked integrity of positional encodings. Through empirical analysis using a stateful benchmarking framework, we show that LLM generation quality degrades sharply when the accumulated KV cache approaches or exceeds the model’s trained context window (e.g., 8192 tokens for Llama 3), a failure mode distinct from GPU memory exhaustion. Common eviction strategies, even high-retention ones (e.g., 99% via AttentionTop), can worsen performance if they disrupt positional coherence. Because LLMs rely on consistent positional signals (e.g., RoPE), compacting a cache by removing non-contiguous tokens can scramble these signals and lead to degenerative outputs. We further show that simple strategies preserving contiguous context blocks (e.g., keeping an initial “gist”) can yield more coherent generations than complex or positionally disruptive ones. We advocate for eviction techniques that respect architectural limits, preserve positional structure, and view “cache health” holistically beyond mere size.
zh

计算机视觉

[CV-0] Visual Spatial Tuning

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在空间感知与空间推理能力上的不足,从而提升其类人化的空间智能。现有方法通常依赖额外的专家编码器来增强空间意识,但会引入计算开销并损害模型的通用能力。本文提出了一种名为视觉空间调优(Visual Spatial Tuning, VST)的综合框架,其关键在于构建两个大规模、多模态数据集:VST-P(410万样本,覆盖19项空间技能)用于增强空间感知,VST-R(13.5万样本)用于引导空间推理;并采用渐进式训练策略——先通过监督微调建立基础空间知识,再利用强化学习进一步提升空间推理能力。该方案在不牺牲通用能力的前提下,在多个空间基准测试中达到最优性能,如MMSI-Bench上达34.8%、VSIBench上达61.2%,验证了其有效性与普适性。

链接: https://arxiv.org/abs/2511.05491
作者: Rui Yang,Ziyu Zhu,Yanwei Li,Jingjia Huang,Shen Yan,Siyuan Zhou,Zhe Liu,Xiangtai Li,Shuangye Li,Wenqian Wang,Yi Lin,Hengshuang Zhao
机构: ByteDance(字节跳动); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including 34.8% on MMSI-Bench and 61.2% on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.
zh

[CV-1] meSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

【速读】:该论文旨在解决长视频理解中时间搜索(Temporal Search)的效率与准确性问题,即如何从数万帧视频中高效定位与查询相关的最小关键帧集合。现有方法依赖人工设计的搜索流程,缺乏端到端优化以学习最优搜索策略。其核心解决方案是提出TimeSearch-R框架,将时间搜索重构为文本-视频交织推理(interleaved text-video thinking),并基于强化学习(Reinforcement Learning, RL)实现搜索决策与内容理解的联合优化。关键创新在于引入GRPO-CSV方法——通过在推理过程中收集已搜索视频帧,并利用同一策略模型对这些帧的完整性进行自验证(Completeness Self-Verification),从而提升逻辑一致性与探索充分性,克服传统RL训练中因中间决策无监督导致的内容覆盖不足问题。

链接: https://arxiv.org/abs/2511.05489
作者: Junwen Pan,Qizhe Zhang,Rui Zhang,Ming Lu,Xin Wan,Yuan Zhang,Chang Liu,Qi She
机构: ByteDance(字节跳动); School of Computer Science, Peking University (北京大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 17 figures. Official code: this https URL

点击查看摘要

Abstract:Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at this https URL.
zh

[CV-2] On Flow Matching KL Divergence

【速读】:该论文旨在解决流匹配(Flow Matching)方法在概率分布估计中的统计效率问题,特别是其与真实数据分布之间Kullback-Leibler(KL)散度的非渐近上界分析。解决方案的关键在于推导出一个确定性的、非渐近的KL散度上界:若L₂流匹配损失被控制在ε²以内,则KL散度被限制为A₁ε + A₂ε²,其中常数A₁和A₂仅依赖于数据分布和速度场的正则性。这一理论结果表明,流匹配在总变差(Total Variation, TV)距离下具有近乎最小最大最优的统计收敛速率,从而使其效率可与扩散模型相媲美。

链接: https://arxiv.org/abs/2511.05480
作者: Maojiang Su,Jerry Yao-Chieh Hu,Sophia Pi,Han Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We derive a deterministic, non-asymptotic upper bound on the Kullback-Leibler (KL) divergence of the flow-matching distribution approximation. In particular, if the L_2 flow-matching loss is bounded by \epsilon^2 0 , then the KL divergence between the true data distribution and the estimated distribution is bounded by A_1 \epsilon + A_2 \epsilon^2 . Here, the constants A_1 and A_2 depend only on the regularities of the data and velocity fields. Consequently, this bound implies statistical convergence rates of Flow Matching Transformers under the Total Variation (TV) distance. We show that, flow matching achieves nearly minimax-optimal efficiency in estimating smooth distributions. Our results make the statistical efficiency of flow matching comparable to that of diffusion models under the TV distance. Numerical studies on synthetic and learned velocities corroborate our theory.
zh

[CV-3] GroupKAN: Rethinking Nonlinearity with Grouped Spline-based KAN Modeling for Efficient Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中模型准确性、轻量化与可解释性之间的权衡难题。现有卷积神经网络(Convolutional Neural Networks, CNNs)缺乏自适应非线性表达能力且决策过程不透明,而基于Transformer的架构则受限于二次复杂度(O(C²))和注意力机制的黑箱特性。为此,作者提出GroupKAN,其核心创新在于引入两种结构化功能模块:一是分组KAN变换(Grouped KAN Transform),通过将通道划分为G组进行多变量样条映射,将复杂度从O(C²)降低至O(C²/G);二是分组KAN激活(Grouped KAN Activation),在每组内共享样条非线性映射,实现高效的token级非线性建模。该方案在保持高精度的同时显著减少参数量(仅需U-KAN的47.6%),并提升模型可解释性,在三个医学图像基准数据集上平均IoU达到79.80%,优于U-KAN。

链接: https://arxiv.org/abs/2511.05477
作者: Guojie Li,Anwar P.P. Abdul Majeed,Muhammad Ateeq,Anh Nguyen,Fan Zhang
机构: Xi’an Jiaotong-Liverpool University (西交利物浦大学); University of Liverpool (利物浦大学); Sunway University (南方大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation requires models that are accurate, lightweight, and interpretable. Convolutional architectures lack adaptive nonlinearity and transparent decision-making, whereas Transformer architectures are hindered by quadratic complexity and opaque attention mechanisms. U-KAN addresses these challenges using Kolmogorov-Arnold Networks, achieving higher accuracy than both convolutional and attention-based methods, fewer parameters than Transformer variants, and improved interpretability compared to conventional approaches. However, its O(C^2) complexity due to full-channel transformations limits its scalability as the number of channels increases. To overcome this, we introduce GroupKAN, a lightweight segmentation network that incorporates two novel, structured functional modules: (1) Grouped KAN Transform, which partitions channels into G groups for multivariate spline mappings, reducing complexity to O(C^2/G), and (2) Grouped KAN Activation, which applies shared spline-based mappings within each channel group for efficient, token-wise nonlinearity. Evaluated on three medical benchmarks (BUSI, GlaS, and CVC), GroupKAN achieves an average IoU of 79.80 percent, surpassing U-KAN by +1.11 percent while requiring only 47.6 percent of the parameters (3.02M vs 6.35M), and shows improved interpretability.
zh

[CV-4] Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection

【速读】:该论文旨在解决小目标检测(tiny object detection)中因特征表达不足和跨模态对齐困难而导致的精度受限问题。其解决方案的关键在于将语义引导的自然语言处理与先进的视觉识别骨干网络相结合,具体通过引入BERT语言模型与基于CNN的并行残差双向融合特征金字塔网络(PRB-FPN-Net),并融合ELAN、MSP和CSP等创新骨干结构,实现文本语义与视觉特征的有效对齐。借助词形还原(lemmatization)和微调技术,系统能够精准捕捉文本输入中的语义线索,并将其映射至多尺度视觉特征空间,从而显著提升小目标检测的准确率与鲁棒性,在保持低参数量的同时优于主流Transformer架构(如GLIP)和YOLO-World等方法。

链接: https://arxiv.org/abs/2511.05474
作者: Xian-Hong Huang,Hui-Kai Su,Chi-Chia Sun,Jun-Wei Hsieh
机构: National Formosa University (国立中山大学); National Taipei University (国立台北大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a cutting-edge approach to cross-modal interaction for tiny object detection by combining semantic-guided natural language processing with advanced visual recognition backbones. The proposed method integrates the BERT language model with the CNN-based Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN-Net), incorporating innovative backbone architectures such as ELAN, MSP, and CSP to optimize feature extraction and fusion. By employing lemmatization and fine-tuning techniques, the system aligns semantic cues from textual inputs with visual features, enhancing detection precision for small and complex objects. Experimental validation using the COCO and Objects365 datasets demonstrates that the model achieves superior performance. On the COCO2017 validation set, it attains a 52.6% average precision (AP), outperforming YOLO-World significantly while maintaining half the parameter consumption of Transformer-based models like GLIP. Several test on different of backbones such ELAN, MSP, and CSP further enable efficient handling of multi-scale objects, ensuring scalability and robustness in resource-constrained environments. This study underscores the potential of integrating natural language understanding with advanced backbone architectures, setting new benchmarks in object detection accuracy, efficiency, and adaptability to real-world challenges.
zh

[CV-5] EventFlow: Real-Time Neuromorphic Event-Driven Classification of Two-Phase Boiling Flow Regimes

【速读】:该论文旨在解决流体沸腾过程中流动状态(flow regime)突变导致的热管理性能不稳定与系统可靠性下降问题,尤其针对传统光学成像方法因计算复杂度高和时间分辨率不足而难以实时捕捉瞬态流动行为的局限性。其解决方案的关键在于引入基于神经形态传感器(neuromorphic sensors)的事件驱动型信息获取机制,通过检测像素级亮度变化来高效识别边缘运动特征,无需全帧重建即可实现高速响应;在此基础上构建了融合事件数据的分类模型,其中基于事件数据的长短期记忆网络(event-based LSTM)在准确率(97.6%)与处理速度(0.28 ms)之间取得最优平衡,并结合异步处理流水线与多数投票机制,实现了稳定、低延迟的实时预测,为实验控制和智能热管理提供可靠反馈。

链接: https://arxiv.org/abs/2511.05467
作者: Sanghyeon Chang,Srikar Arani,Nishant Sai Nuthalapati,Youngjoon Suh,Nicholas Choi,Siavash Khodakarami,Md Rakibul Hasan Roni,Nenad Miljkovic,Aparna Chandramowlishwaran,Yoonjin Won
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 6 figures, Under review in Droplet (Manuscript ID: DRO-2025-0045.R1)

点击查看摘要

Abstract:Flow boiling is an efficient heat transfer mechanism capable of dissipating high heat loads with minimal temperature variation, making it an ideal thermal management method. However, sudden shifts between flow regimes can disrupt thermal performance and system reliability, highlighting the need for accurate and low-latency real-time monitoring. Conventional optical imaging methods are limited by high computational demands and insufficient temporal resolution, making them inadequate for capturing transient flow behavior. To address this, we propose a real-time framework based on signals from neuromorphic sensors for flow regime classification. Neuromorphic sensors detect changes in brightness at individual pixels, which typically correspond to motion at edges, enabling fast and efficient detection without full-frame reconstruction, providing event-based information. We develop five classification models using both traditional image data and event-based data, demonstrating that models leveraging event data outperform frame-based approaches due to their sensitivity to dynamic flow features. Among these models, the event-based long short-term memory model provides the best balance between accuracy and speed, achieving 97.6% classification accuracy with a processing time of 0.28 ms. Our asynchronous processing pipeline supports continuous, low-latency predictions and delivers stable output through a majority voting mechanisms, enabling reliable real-time feedback for experimental control and intelligent thermal management.
zh

[CV-6] Photo Dating by Facial Age Aggregation

【速读】:该论文旨在解决照片年代推断(Photo Dating)问题,即准确估计一张照片的拍摄年份。传统方法多依赖场景特征或图像内容的上下文信息,但难以在缺乏明确时间线索的图像中实现高精度推断。论文提出的关键解决方案是利用图像中人物面部信息,结合视觉证据与先验知识构建概率框架:首先从人脸中提取身份和年龄信息(来自现代人脸识别与年龄估计模型),并引入基于职业生涯的时间先验(career-based temporal priors),从而联合推断照片拍摄年份。其核心创新在于通过聚合多张人脸的信息提升预测准确性,并在包含多个可识别个体的图像中显著优于基于场景的基线方法。

链接: https://arxiv.org/abs/2511.05464
作者: Jakub Paplham,Vojtech Franc
机构: Czech Technical University in Prague (捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel method for Photo Dating which estimates the year a photograph was taken by leveraging information from the faces of people present in the image. To facilitate this research, we publicly release CSFD-1.6M, a new dataset containing over 1.6 million annotated faces, primarily from movie stills, with identity and birth year annotations. Uniquely, our dataset provides annotations for multiple individuals within a single image, enabling the study of multi-face information aggregation. We propose a probabilistic framework that formally combines visual evidence from modern face recognition and age estimation models, and career-based temporal priors to infer the photo capture year. Our experiments demonstrate that aggregating evidence from multiple faces consistently improves the performance and the approach significantly outperforms strong, scene-based baselines, particularly for images containing several identifiable individuals.
zh

[CV-7] SiamMM: A Mixture Model Perspective on Deep Unsupervised Learning

【速读】:该论文旨在解决无监督聚类方法在自监督学习中应用时存在的启发式选择问题,即当前聚类策略缺乏理论指导且最优方法尚不明确。其解决方案的关键在于建立无监督聚类方法与统计学中经典混合模型(mixture models)之间的理论联系,并基于此框架对现有聚类方法进行显著改进,从而提出一种名为 SiamMM 的新模型。该模型在多个自监督学习基准上达到最先进性能,且通过分析学习到的聚类结构发现其与未见真实标签高度一致,揭示了潜在的标签错误情况。

链接: https://arxiv.org/abs/2511.05462
作者: Xiaodong Wang,Jing Huang,Kevin J Liang
机构: Meta(Meta)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent studies have demonstrated the effectiveness of clustering-based approaches for self-supervised and unsupervised learning. However, the application of clustering is often heuristic, and the optimal methodology remains unclear. In this work, we establish connections between these unsupervised clustering methods and classical mixture models from statistics. Through this framework, we demonstrate significant enhancements to these clustering methods, leading to the development of a novel model named SiamMM. Our method attains state-of-the-art performance across various self-supervised learning benchmarks. Inspection of the learned clusters reveals a strong resemblance to unseen ground truth labels, uncovering potential instances of mislabeling.
zh

[CV-8] he Potential of Copernicus Satellites for Disaster Response: Retrieving Building Damage from Sentinel-1 and Sentinel-2

【速读】:该论文旨在解决自然灾害发生后快速、大范围建筑损毁评估的问题,尤其针对高分辨率(Very High Resolution, VHR)遥感影像因覆盖范围有限而难以满足应急响应需求的局限性。其解决方案的关键在于利用欧洲空间局 Copernicus 计划提供的中等分辨率(Medium-Resolution)地球观测数据(包括 Sentinel-1 合成孔径雷达和 Sentinel-2 光学影像),构建了 xBD-S12 数据集,包含 10,315 对灾前灾后图像对,并通过实验证明此类数据在多数灾害场景下仍可有效实现建筑损毁检测与制图,且无需复杂模型架构或地理空间基础模型即可获得良好泛化性能,从而为灾后快速评估提供一种可行且可扩展的数据来源。

链接: https://arxiv.org/abs/2511.05461
作者: Olivier Dietrich,Merlin Alfredsson,Emilia Arens,Nando Metzger,Torben Peters,Linus Scheibenreif,Jan Dirk Wegner,Konrad Schindler
机构: Photogrammetry & Remote Sensing Lab, ETH Zurich (摄影测量与遥感实验室,苏黎世联邦理工学院); EcoVision Lab, Department of Mathematical Modeling and Machine Learning (DM3L), University of Zurich (生态视觉实验室,数学建模与机器学习系,苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Natural disasters demand rapid damage assessment to guide humanitarian response. Here, we investigate whether medium-resolution Earth observation images from the Copernicus program can support building damage assessment, complementing very-high resolution imagery with often limited availability. We introduce xBD-S12, a dataset of 10,315 pre- and post-disaster image pairs from both Sentinel-1 and Sentinel-2, spatially and temporally aligned with the established xBD benchmark. In a series of experiments, we demonstrate that building damage can be detected and mapped rather well in many disaster scenarios, despite the moderate 10 , m ground sampling distance. We also find that, for damage mapping at that resolution, architectural sophistication does not seem to bring much advantage: more complex model architectures tend to struggle with generalization to unseen disasters, and geospatial foundation models bring little practical benefit. Our results suggest that Copernicus images are a viable data source for rapid, wide-area damage assessment and could play an important role alongside VHR imagery. We release the xBD-S12 dataset, code, and trained models to support further research.
zh

[CV-9] How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need? NEURIPS2025

【速读】:该论文旨在解决当前3D点云Transformer模型中因密集token表示导致的高计算与内存开销问题,其核心挑战在于现有模型普遍存在token冗余现象,使得模型在训练和推理阶段效率低下。解决方案的关键是提出gitmerge3D方法——一种全局感知的图结构token合并机制,能够将token数量减少90–95%,同时保持与原模型相当的性能表现,从而打破“更多token必然带来更好性能”的固有认知,并为构建更具可扩展性的3D基础架构提供新思路。

链接: https://arxiv.org/abs/2511.05449
作者: Tuan Anh Tran,Duy M. H. Nguyen,Hoai-Chau Tran,Michael Barz,Khoa D. Doan,Roger Wattenhofer,Ngo Anh Vien,Mathias Niepert,Daniel Sonntag,Paul Swoboda
机构: German Research Centre for Artificial Intelligence (DFKI); Max Planck Research School for Intelligent Systems (IMPRS-IS); University of Stuttgart; VinUni-Illinois Smart Health Center, VinUniversity; College of Engineering & Computer Science, VinUniversity; ETH Zurich; VinRobotics; University of Oldenburg; Heinrich Heine University Düsseldorf
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at this https URL
zh

[CV-10] Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

【速读】:该论文旨在解决文本到说话人脸(text-to-talking-face)合成中音频-视觉对齐不准确、面部表情缺乏自然性和表达力的问题,尤其在推理阶段无真实语音(ground-truth audio)可用时的性能瓶颈。解决方案的关键在于引入基于HierSpeech++的潜在语音表征(latent speech representations),通过一个Text-to-Vec模块将文本映射为Wav2Vec2嵌入,作为语音和人脸生成的联合条件;同时采用两阶段训练策略——先在Wav2Vec2嵌入上预训练,再在TTS(Text-to-Speech)输出上微调,有效缓解了干净语音特征与TTS预测特征之间的分布偏移(distribution shift),从而实现高保真的唇同步、稳定的说话人身份保留以及自然且富有表现力的面部运动生成。

链接: https://arxiv.org/abs/2511.05432
作者: Dogucan Yaman,Seymanur Akti,Fevziye Irem Eyiokur,Alexander Waibel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.
zh

[CV-11] Sharing the Learned Knowledge-base to Estimate Convolutional Filter Parameters for Continual Image Restoration

【速读】:该论文旨在解决图像恢复(Image Restoration)领域中持续学习(Continual Learning)的问题,即模型在不断学习新任务时如何避免遗忘先前任务的知识,同时应对大规模图像尺寸和多样退化类型带来的挑战。现有方法通常需要对网络结构进行复杂修改,导致计算开销显著增加,而基于正则化的方案又因不同恢复任务对特征处理需求差异大而不适用。本文的关键解决方案是提出一种对卷积层的简单修改,无需改动主干网络架构即可迁移已有任务的知识,从而实现无缝集成到任意深度网络中;该方法可在不明显增加计算负担或推理时间的前提下提升可训练参数数量,并实现在引入新任务时不损害旧任务性能,且新任务性能可通过利用前序任务构建的知识库得到增强。

链接: https://arxiv.org/abs/2511.05421
作者: Aupendu Kar,Krishnendu Ghosh,Prabir Kumar Biswas
机构: Dolby Laboratories, Inc(杜比实验室); Indian Institute of Technology Kharagpur(印度理工学院克哈格普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to ACM ICVGIP 2025

点击查看摘要

Abstract:Continual learning is an emerging topic in the field of deep learning, where a model is expected to learn continuously for new upcoming tasks without forgetting previous experiences. This field has witnessed numerous advancements, but few works have been attempted in the direction of image restoration. Handling large image sizes and the divergent nature of various degradation poses a unique challenge in the restoration domain. However, existing works require heavily engineered architectural modifications for new task adaptation, resulting in significant computational overhead. Regularization-based methods are unsuitable for restoration, as different restoration challenges require different kinds of feature processing. In this direction, we propose a simple modification of the convolution layer to adapt the knowledge from previous restoration tasks without touching the main backbone architecture. Therefore, it can be seamlessly applied to any deep architecture without any structural modifications. Unlike other approaches, we demonstrate that our model can increase the number of trainable parameters without significantly increasing computational overhead or inference time. Experimental validation demonstrates that new restoration tasks can be introduced without compromising the performance of existing tasks. We also show that performance on new restoration tasks improves by adapting the knowledge from the knowledge base created by previous restoration tasks. The code is available at this https URL.
zh

[CV-12] Multi-modal Loop Closure Detection with Foundation Models in Severely Unstructured Environments ICRA2026

【速读】:该论文旨在解决在GNSS(全球导航卫星系统)缺失环境下,传统视觉与LiDAR方法在复杂、无结构场景中进行回环检测(loop closure detection)时面临的挑战,如视觉因纹理弱化或相似性导致误匹配(aliasing),以及LiDAR因点云稀疏性和几何歧义引发的定位不鲁棒问题。解决方案的关键在于提出一种多模态感知管道MPRF(Multimodal Place Recognition Framework),其核心创新是融合基于Transformer的视觉基础模型(DINOv2)与LiDAR特征聚合策略(SALAD)实现高效候选检索,并引入SONATA-based LiDAR描述子进行显式的6-DoF位姿估计,从而将视觉检索与几何验证紧密结合,显著提升在低纹理区域的位姿估计精度和鲁棒性,同时提供可解释的对应关系以适配SLAM后端处理。

链接: https://arxiv.org/abs/2511.05404
作者: Laura Alejandra Encinar Gonzalez,John Folkesson,Rudolph Triebel,Riccardo Giubilato
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review for ICRA 2026

点击查看摘要

Abstract:Robust loop closure detection is a critical component of Simultaneous Localization and Mapping (SLAM) algorithms in GNSS-denied environments, such as in the context of planetary exploration. In these settings, visual place recognition often fails due to aliasing and weak textures, while LiDAR-based methods suffer from sparsity and ambiguity. This paper presents MPRF, a multimodal pipeline that leverages transformer-based foundation models for both vision and LiDAR modalities to achieve robust loop closure in severely unstructured environments. Unlike prior work limited to retrieval, MPRF integrates a two-stage visual retrieval strategy with explicit 6-DoF pose estimation, combining DINOv2 features with SALAD aggregation for efficient candidate screening and SONATA-based LiDAR descriptors for geometric verification. Experiments on the S3LI dataset and S3LI Vulcano dataset show that MPRF outperforms state-of-the-art retrieval methods in precision while enhancing pose estimation robustness in low-texture regions. By providing interpretable correspondences suitable for SLAM back-ends, MPRF achieves a favorable trade-off between accuracy, efficiency, and reliability, demonstrating the potential of foundation models to unify place recognition and pose estimation. Code and models will be released at this http URL.
zh

[CV-13] PALM: A Dataset and Baseline for Learning Multi-subject Hand Prior

【速读】:该论文旨在解决从单张图像中高质量个性化生成手部虚拟化身(hand avatar)的难题,其核心挑战在于手部复杂的几何结构、外观特性及关节运动,尤其在非受限光照条件和有限视角下更难建模。解决方案的关键在于构建了一个大规模、高多样性的数据集PALM,包含13,000个高质量手部扫描和90,000张多视角图像,覆盖广泛的皮肤色调、年龄和几何形态;并提出PALM-Net,一种基于物理渲染的逆向重建方法,学习跨个体的手部几何与材质先验,从而实现真实感且可重光照的单图手部个性化建模。

链接: https://arxiv.org/abs/2511.05403
作者: Zicong Fan,Edoardo Remelli,David Dimond,Fadime Sener,Liuhao Ge,Bugra Tekin,Cem Keskin,Shreyas Hampali
机构: Meta Reality Labs (Meta); ETH Zürich (苏黎世联邦理工学院); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ability to grasp objects, signal with gestures, and share emotion through touch all stem from the unique capabilities of human hands. Yet creating high-quality personalized hand avatars from images remains challenging due to complex geometry, appearance, and articulation, particularly under unconstrained lighting and limited views. Progress has also been limited by the lack of datasets that jointly provide accurate 3D geometry, high-resolution multiview imagery, and a diverse population of subjects. To address this, we present PALM, a large-scale dataset comprising 13k high-quality hand scans from 263 subjects and 90k multi-view images, capturing rich variation in skin tone, age, and geometry. To show its utility, we present a baseline PALM-Net, a multi-subject prior over hand geometry and material properties learned via physically based inverse rendering, enabling realistic, relightable single-image hand avatar personalization. PALM’s scale and diversity make it a valuable real-world resource for hand modeling and related research.
zh

[CV-14] EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation ICRA2026

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在实际应用中对昂贵硬件的依赖以及在新场景或杂乱环境中性能下降的问题。其关键解决方案是提出 EverydayVLA——一个成本低于300美元、具备6自由度(6-DOF)操作能力的可组装机器人平台,配合统一的VLA模型实现离散与连续动作的联合输出,并引入自适应时间窗集成策略以实时监测运动不确定性并触发重规划,从而保障操作的安全性与可靠性。该方案在LIBERO基准上达到先进水平,在真实世界测试中较以往方法分别提升49%(分布内)和34.9%(分布外)的成功率,显著降低了机器人基础模型的使用门槛。

链接: https://arxiv.org/abs/2511.05397
作者: Samarth Chopra,Alex McMoil,Ben Carnovale,Evan Sokolson,Rajkumar Kubendran,Samuel Dickerson
机构: University of Pittsburgh (匹兹堡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICRA 2026

点击查看摘要

Abstract:While Vision-Language-Action (VLA) models map visual inputs and language instructions directly to robot actions, they often rely on costly hardware and struggle in novel or cluttered scenes. We introduce EverydayVLA, a 6-DOF manipulator that can be assembled for under 300, capable of modest payloads and workspace. A single unified model jointly outputs discrete and continuous actions, and our adaptive-horizon ensemble monitors motion uncertainty to trigger on-the-fly re-planning for safe, reliable operation. On LIBERO, EverydayVLA matches state-of-the-art success rates, and in real-world tests it outperforms prior methods by 49% in-distribution and 34.9% out-of-distribution. By combining a state-of-the-art VLA with cost-effective hardware, EverydayVLA democratizes access to a robotic foundation model and paves the way for economical use in homes and research labs alike. Experiment videos and details: this https URL
zh

[CV-15] AI Assisted AR Assembly: Object Recognition and Computer Vision for Augmented Reality Assisted Assembly

【速读】:该论文旨在解决传统装配流程中因组件识别、定位与排序耗时而导致的效率低下问题,尤其是在复杂装配任务中人工查找和区分部件易出错的问题。解决方案的关键在于利用基于深度学习的目标识别技术(deep learning-based object recognition),实现增强现实(Augmented Reality, AR)系统对物理空间中不同装配组件的自动识别,并通过实时显示每个步骤中目标组件的位置边界框(bounding box)及其放置位置,将装配指令与实际组件位置精准关联,从而省去预先手动分拣或标记组件的环节,提升装配过程的自动化与准确性。

链接: https://arxiv.org/abs/2511.05394
作者: Alexander Htet Kyaw,Haotian Ma,Sasa Zivkovic,Jenny Sabin
机构: Massachusetts Institute of Technology (麻省理工学院); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to the Association for Computing Machinery (ACM) Symposium on Computational Fabrication (SCF '25)

点击查看摘要

Abstract:We present an AI-assisted Augmented Reality assembly workflow that uses deep learning-based object recognition to identify different assembly components and display step-by-step instructions. For each assembly step, the system displays a bounding box around the corresponding components in the physical space, and where the component should be placed. By connecting assembly instructions with the real-time location of relevant components, the system eliminates the need for manual searching, sorting, or labeling of different components before each assembly. To demonstrate the feasibility of using object recognition for AR-assisted assembly, we highlight a case study involving the assembly of LEGO sculptures.
zh

[CV-16] PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization

【速读】:该论文旨在解决视觉质量评估(Visual Quality Assessment, VQA)中现有方法依赖监督微调或仅基于排序目标所导致的推理浅层化、评分校准不佳及跨域泛化能力有限的问题。其解决方案的关键在于提出一种偏好-响应解耦的强化学习框架 PreResQ-R1,该框架通过双分支奖励设计分别建模样本内响应一致性与样本间偏好对齐,并采用组相对策略优化(Group Relative Policy Optimization, GRPO)进行联合优化,从而实现细粒度、稳定且可解释的链式思维推理;此外,为扩展至视频质量评估,还设计了全局时序与局部空间的数据流策略,使得模型在仅用6K图像和28K视频微调后即在10个图像质量评估(IQA)和5个视频质量评估(VQA)基准上达到SOTA性能,显著优于现有方法。

链接: https://arxiv.org/abs/2511.05393
作者: Zehui Feng,Tian Qiu,Tong Wu,Junxuan Li,Huayuan Xu,Ting Han
机构: Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 14 figures, under review as a conference paper

点击查看摘要

Abstract:Visual Quality Assessment (QA) seeks to predict human perceptual judgments of visual fidelity. While recent multimodal large language models (MLLMs) show promise in reasoning about image and video quality, existing approaches mainly rely on supervised fine-tuning or rank-only objectives, resulting in shallow reasoning, poor score calibration, and limited cross-domain generalization. We propose PreResQ-R1, a Preference-Response Disentangled Reinforcement Learning framework that unifies absolute score regression and relative ranking consistency within a single reasoning-driven optimization scheme. Unlike prior QA methods, PreResQ-R1 introduces a dual-branch reward formulation that separately models intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization (GRPO). This design encourages fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality. To extend beyond static imagery, we further design a global-temporal and local-spatial data flow strategy for Video Quality Assessment. Remarkably, with reinforcement fine-tuning on only 6K images and 28K videos, PreResQ-R1 achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under both SRCC and PLCC metrics, surpassing by margins of 5.30% and textbf2.15% in IQA task, respectively. Beyond quantitative gains, it produces human-aligned reasoning traces that reveal the perceptual cues underlying quality judgments. Code and model are available.
zh

[CV-17] Dense Motion Captioning

【速读】:该论文旨在解决3D人体动作理解(3D human motion understanding)中长期被忽视的问题,即如何对复杂、多动作的运动序列进行精确的时间定位与语义描述。现有研究主要集中在文本到动作生成(text-to-motion generation),而缺乏对动作时序边界和语义标签的精细标注数据。为此,作者提出了一种新的任务——密集动作描述(Dense Motion Captioning),并构建了首个大规模复杂动作数据集CompMo,其中包含6万条具有精确时间边界的多动作序列。解决方案的关键在于设计了一个融合大型语言模型(Large Language Model, LLM)与轻量级运动适配器(motion adapter)的模型DEMO,该模型能够生成时序锚定的细粒度动作描述,显著优于现有方法,在CompMo及适配基准上建立了可靠的基线。

链接: https://arxiv.org/abs/2511.05369
作者: Shiyao Xu,Benedetta Liberatori,Gül Varol,Paolo Rota
机构: University of Trento (特伦托大学); LIGM, Ecole des Ponts, IP Paris, Univ Gustave Eiffel, CNRS (LIGM,巴黎路桥学院,巴黎理工学院,古斯塔夫·埃菲尔大学,法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures, accepted to 3DV 2026

点击查看摘要

Abstract:Recent advances in 3D human motion and language integration have primarily focused on text-to-motion generation, leaving the task of motion understanding relatively unexplored. We introduce Dense Motion Captioning, a novel task that aims to temporally localize and caption actions within 3D human motion sequences. Current datasets fall short in providing detailed temporal annotations and predominantly consist of short sequences featuring few actions. To overcome these limitations, we present the Complex Motion Dataset (CompMo), the first large-scale dataset featuring richly annotated, complex motion sequences with precise temporal boundaries. Built through a carefully designed data generation pipeline, CompMo includes 60,000 motion sequences, each composed of multiple actions ranging from at least two to ten, accurately annotated with their temporal extents. We further present DEMO, a model that integrates a large language model with a simple motion adapter, trained to generate dense, temporally grounded captions. Our experiments show that DEMO substantially outperforms existing methods on CompMo as well as on adapted benchmarks, establishing a robust baseline for future research in 3D motion understanding and captioning.
zh

[CV-18] Neural Image Abstraction Using Long Smoothing B-Splines

【速读】:该论文旨在解决图像驱动的深度学习系统中生成平滑、任意长度路径的问题,尤其是在矢量图形生成与矢量化方法中如何平衡形状保真度与简化性(fidelity vs. simplicity)以及实现几何和图像空间中的风格化控制。其解决方案的关键在于将平滑B样条(smoothing B-splines)通过线性映射集成到标准可微分矢量图形(Differentiable Vector Graphics, DiffVG)流水线中,利用基于导数的平滑代价函数实现参数化控制,从而在保持几何结构的同时支持多样化的艺术风格表达。该方法兼容当前主流的矢量图形生成与矢量化技术,并在四类应用中验证了其通用性:风格化填满空间路径生成、基于笔画的图像抽象、闭合区域图像抽象及风格化文本生成。

链接: https://arxiv.org/abs/2511.05360
作者: Daniel Berio,Michael Stroh,Sylvain Calinon,Frederic Fol Leymarie,Oliver Deussen,Ariel Shamir
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We integrate smoothing B-splines into a standard differentiable vector graphics (DiffVG) pipeline through linear mapping, and show how this can be used to generate smooth and arbitrarily long paths within image-based deep learning systems. We take advantage of derivative-based smoothing costs for parametric control of fidelity vs. simplicity tradeoffs, while also enabling stylization control in geometric and image spaces. The proposed pipeline is compatible with recent vector graphics generation and vectorization methods. We demonstrate the versatility of our approach with four applications aimed at the generation of stylized vector graphics: stylized space-filling path generation, stroke-based image abstraction, closed-area image abstraction, and stylized text generation.
zh

[CV-19] Canonical Space Representation for 4D Panoptic Segmentation of Articulated Objects

【速读】:该论文旨在解决关节物体(articulated object)感知中因忽略时间动态性而导致的分割精度不足问题,尤其在4D(时空联合)场景下缺乏有效的全景分割(panoptic segmentation)方法和基准数据集。解决方案的关键在于提出一个名为CanonSeg4D的新框架,其核心创新是通过学习一个规范空间(canonical space),显式估计每帧中物体部件到该空间的偏移量(per-frame offsets),从而实现部件级别的时序一致性对齐与更精准的分割。该方法利用Artic4D数据集中的4D全景标注和关节参数,显著提升了复杂动态场景下的分割性能,验证了时间建模与规范对齐在4D关节物体理解中的有效性。

链接: https://arxiv.org/abs/2511.05356
作者: Manuel Gomes,Bogdan Raducanu,Miguel Oliveira
机构: University of Aveiro (阿维罗大学); Computer Vision Center, Universitat Autònoma de Barcelona (计算机视觉中心,巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 6 figures, 4 tables, submitted to Expert Systems With Applications

点击查看摘要

Abstract:Articulated object perception presents significant challenges in computer vision, particularly because most existing methods ignore temporal dynamics despite the inherently dynamic nature of such objects. The use of 4D temporal data has not been thoroughly explored in articulated object perception and remains unexamined for panoptic segmentation. The lack of a benchmark dataset further hurt this field. To this end, we introduce Artic4D as a new dataset derived from PartNet Mobility and augmented with synthetic sensor data, featuring 4D panoptic annotations and articulation parameters. Building on this dataset, we propose CanonSeg4D, a novel 4D panoptic segmentation framework. This approach explicitly estimates per-frame offsets mapping observed object parts to a learned canonical space, thereby enhancing part-level segmentation. The framework employs this canonical representation to achieve consistent alignment of object parts across sequential frames. Comprehensive experiments on Artic4D demonstrate that the proposed CanonSeg4D outperforms state of the art approaches in panoptic segmentation accuracy in more complex scenarios. These findings highlight the effectiveness of temporal modeling and canonical alignment in dynamic object understanding, and pave the way for future advances in 4D articulated object perception.
zh

[CV-20] mathbfS2LM: Towards Semantic Steganography via Large Language Models

【速读】:该论文旨在解决传统隐写术(steganography)在嵌入语义丰富、句级信息方面能力不足的问题,尤其是在生成式 AI(Generative AI)时代,对隐写容量和语义表达能力提出更高要求的背景下。解决方案的关键在于提出了一种新的任务范式——句子到图像隐写(Sentence-to-Image Steganography),并构建了名为 Invisible Text (IVT) 的基准数据集用于评估。其核心创新是设计了基于大语言模型(LLM)的语义隐写语言模型 S2LM\mathbf{S}^2LM,该模型在整个嵌入流程中深度集成 LLM,从而实现将高阶文本内容(如句子或段落)以语义保真方式嵌入图像载体中,突破了传统比特级隐写方法的局限性。

链接: https://arxiv.org/abs/2511.05319
作者: Huanqi Wu,Huangbiao Xu,Runfeng Xie,Jiaxin Cai,Kaixin Zhang,Xiao Ke
机构: Fuzhou University (福州大学); Beijing University of Technology (北京工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 35 Pages, 20 Figures

点击查看摘要

Abstract:Although steganography has made significant advancements in recent years, it still struggles to embed semantically rich, sentence-level information into carriers. However, in the era of AIGC, the capacity of steganography is more critical than ever. In this work, we present Sentence-to-Image Steganography, an instance of Semantic Steganography, a novel task that enables the hiding of arbitrary sentence-level messages within a cover image. Furthermore, we establish a benchmark named Invisible Text (IVT), comprising a diverse set of sentence-level texts as secret messages for evaluation. Finally, we present \mathbfS^2LM : Semantic Steganographic Language Model, which utilizes large language models (LLMs) to embed high-level textual information, such as sentences or even paragraphs, into images. Unlike traditional bit-level counterparts, \mathrmS^2LM enables the integration of semantically rich content through a newly designed pipeline in which the LLM is involved throughout the entire process. Both quantitative and qualitative experiments demonstrate that our method effectively unlocks new semantic steganographic capabilities for LLMs. The source code will be released soon.
zh

[CV-21] Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation

【速读】:该论文旨在解决当前用于评估生成点云质量的指标(尤其是基于Chamfer Distance, CD)在面对几何缺陷时缺乏鲁棒性、无法准确反映局部形状一致性与几何保真度的问题。其解决方案的关键在于:首先引入样本对齐预处理步骤以提升距离计算的合理性,其次用密度感知的Chamfer Distance (Density-Aware Chamfer Distance, DCD) 替代传统CD,从而增强评估指标的稳定性;此外提出Surface Normal Concordance (SNC) 新型指标,通过比较点云表面法向量来近似表面相似性,弥补仅依赖3D坐标对比的不足;最终结合基于Transformer的结构(如序列化patch注意力机制),设计出Diffusion Point Transformer架构,在ShapeNet数据集上实现高保真度点云生成,达到新的性能上限。

链接: https://arxiv.org/abs/2511.05308
作者: Matteo Bastico,David Ryckelynck,Laurent Corté,Yannick Tillier,Etienne Decencière
机构: Mines Paris, Université PSL; Centre des Matériaux (MAT), UMR7633 CNRS; Centre de Morphologie Mathématique (CMM); Centre de Mise en Forme des Matériaux (CEMEF), UMR7635 CNRS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted at International Conference on 3D Vision (3DV) 2026

点击查看摘要

Abstract:As 3D point clouds become a cornerstone of modern technology, the need for sophisticated generative models and reliable evaluation metrics has grown exponentially. In this work, we first expose that some commonly used metrics for evaluating generated point clouds, particularly those based on Chamfer Distance (CD), lack robustness against defects and fail to capture geometric fidelity and local shape consistency when used as quality indicators. We further show that introducing samples alignment prior to distance calculation and replacing CD with Density-Aware Chamfer Distance (DCD) are simple yet essential steps to ensure the consistency and robustness of point cloud generative model evaluation metrics. While existing metrics primarily focus on directly comparing 3D Euclidean coordinates, we present a novel metric, named Surface Normal Concordance (SNC), which approximates surface similarity by comparing estimated point normals. This new metric, when combined with traditional ones, provides a more comprehensive evaluation of the quality of generated samples. Finally, leveraging recent advancements in transformer-based models for point cloud analysis, such as serialized patch attention , we propose a new architecture for generating high-fidelity 3D structures, the Diffusion Point Transformer. We perform extensive experiments and comparisons on the ShapeNet dataset, showing that our model outperforms previous solutions, particularly in terms of quality of generated point clouds, achieving new state-of-the-art. Code available at this https URL.
zh

[CV-22] LiveStar: Live Streaming Assistant for Real-World Online Video Understanding NEURIPS2025

【速读】:该论文旨在解决在线视频大语言模型(Online Video-LLMs)在处理连续帧流时难以兼顾实时响应与叙事连贯性的问题,即现有方法往往无法同时实现帧级输入的高效处理和最优响应时机的精准判断。其解决方案的关键在于提出LiveStar系统,核心创新包括:(1) 一种支持变长视频流增量对齐的训练策略,确保动态帧序列中的时间一致性;(2) 基于单次前向传播验证的响应-静默解码框架,用于自适应确定主动响应的最佳时机;(3) 结合峰值-末端记忆压缩与流式键值缓存的记忆感知加速机制,显著提升长时间视频(10+分钟)的在线推理效率(提速1.53倍)。

链接: https://arxiv.org/abs/2511.05299
作者: Zhenyu Yang,Kairui Zhang,Yuhang Hu,Bing Wang,Shengsheng Qian,Bin Wen,Fan Yang,Tingting Gao,Weiming Dong,Changsheng Xu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); ShanghaiTech University (上海科技大学); Kuaishou Technology (快手科技); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Accepted

点击查看摘要

Abstract:Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar’s state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at this https URL.
zh

[CV-23] Cross-domain EEG-based Emotion Recognition with Contrastive Learning

【速读】:该论文旨在解决脑电图(Electroencephalogram, EEG)情绪识别在特征利用效率和跨域泛化能力方面的挑战。其解决方案的关键在于将情绪识别重构为EEG与文本之间的匹配任务,并基于CLIP框架引入多模态对比学习机制;同时设计了专用的骨干网络SST-LegoViT,通过多尺度卷积与Transformer模块协同提取EEG信号的空间、频谱和时序特征,从而显著提升跨被试和跨时间的情绪识别准确率。

链接: https://arxiv.org/abs/2511.05293
作者: Rui Yan,Yibo Li,Han Ding,Fei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages

点击查看摘要

Abstract:Electroencephalogram (EEG)-based emotion recognition is vital for affective computing but faces challenges in feature utilization and cross-domain generalization. This work introduces EmotionCLIP, which reformulates recognition as an EEG-text matching task within the CLIP framework. A tailored backbone, SST-LegoViT, captures spatial, spectral, and temporal features using multi-scale convolution and Transformer modules. Experiments on SEED and SEED-IV datasets show superior cross-subject accuracies of 88.69% and 73.50%, and cross-time accuracies of 88.46% and 77.54%, outperforming existing models. Results demonstrate the effectiveness of multimodal contrastive learning for robust EEG emotion recognition.
zh

[CV-24] Whats on Your Plate? Inferring Chinese Cuisine Intake from Wearable IMUs

【速读】:该论文旨在解决现有饮食监测方法在准确识别食物摄入方面存在的局限性,特别是传统自报法易受回忆偏差影响、摄像头方案存在隐私顾虑,以及现有可穿戴设备方法难以覆盖中国菜系多样性的问题。其解决方案的关键在于提出CuisineSense系统,通过融合智能手表获取的手部运动特征与智能眼镜捕捉的头部动态信息,构建一个两阶段检测流程:第一阶段基于时序模式区分进食状态与日常活动,第二阶段则利用进食过程中采集的运动数据实现细粒度的食物类型识别,从而在不侵犯隐私的前提下提升对多样化中式餐饮的识别准确性。

链接: https://arxiv.org/abs/2511.05292
作者: Jiaxi Yin,Pengcheng Wang,Han Ding,Fei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages

点击查看摘要

Abstract:Accurate food intake detection is vital for dietary monitoring and chronic disease prevention. Traditional self-report methods are prone to recall bias, while camera-based approaches raise concerns about privacy. Furthermore, existing wearable-based methods primarily focus on a limited number of food types, such as hamburgers and pizza, failing to address the vast diversity of Chinese cuisine. To bridge this gap, we propose CuisineSense, a system that classifies Chinese food types by integrating hand motion cues from a smartwatch with head dynamics from smart glasses. To filter out irrelevant daily activities, we design a two-stage detection pipeline. The first stage identifies eating states by distinguishing characteristic temporal patterns from non-eating behaviors. The second stage then conducts fine-grained food type recognition based on the motions captured during food intake. To evaluate CuisineSense, we construct a dataset comprising 27.5 hours of IMU recordings across 11 food categories and 10 participants. Experiments demonstrate that CuisineSense achieves high accuracy in both eating state detection and food classification, offering a practical solution for unobtrusive, wearable-based dietary this http URL system code is publicly available at this https URL.
zh

[CV-25] DeepEyesV2: Toward Agent ic Multimodal Model

【速读】:该论文旨在解决当前多模态模型在实际应用中缺乏主动调用外部工具(如代码执行环境、网络搜索等)能力的问题,即如何构建具备自主推理与工具协同能力的代理型多模态模型(agentic multimodal models)。其解决方案的关键在于提出一个两阶段训练流程:第一阶段为冷启动阶段,用于建立初步的工具使用模式;第二阶段采用强化学习进一步优化工具调用策略。此外,作者构建了包含多样化且适度挑战性样本的训练数据集,并设计了RealX-Bench基准以评估真实场景下的多模态推理能力,从而实现任务自适应的工具选择和复杂工具组合,显著提升了模型在感知、搜索与推理任务中的综合表现。

链接: https://arxiv.org/abs/2511.05271
作者: Jack Hong,Chenxiao Zhao,ChengLin Zhu,Weiheng Lu,Guohai Xu,Xing Yu
机构: Xiaohongshu Inc. (小红书公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Homepage: this https URL

点击查看摘要

Abstract:Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.
zh

[CV-26] OregairuChar: A Benchmark Dataset for Character Appearance Frequency Analysis in My Teen Romantic Comedy SNAFU

【速读】:该论文旨在解决动画中角色出现频率分析的问题,以深入理解叙事结构、角色重要性及故事发展进程。其核心解决方案是构建了一个名为OregairuChar的基准数据集,包含来自《我的青春恋爱物语果然有问题》第三季的1600帧手动标注图像,涵盖11位主要角色的2860个边界框(bounding boxes),并引入了遮挡、姿态变化和角色间相似性等视觉挑战,从而为基于外观的角色出现频率研究提供真实场景支持。通过在该数据集上对多个目标检测模型进行基准测试,并利用其预测结果进行逐集细粒度的角色存在时间分析,揭示了角色重要性的动态演变模式,为计算叙事动力学与风格化媒介中的角色导向叙事研究提供了关键工具。

链接: https://arxiv.org/abs/2511.05263
作者: Qi Sun,Dingju Zhou,Lina Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The analysis of character appearance frequency is essential for understanding narrative structure, character prominence, and story progression in anime. In this work, we introduce OregairuChar, a benchmark dataset designed for appearance frequency analysis in the anime series My Teen Romantic Comedy SNAFU. The dataset comprises 1600 manually selected frames from the third season, annotated with 2860 bounding boxes across 11 main characters. OregairuChar captures diverse visual challenges, including occlusion, pose variation, and inter-character similarity, providing a realistic basis for appearance-based studies. To enable quantitative research, we benchmark several object detection models on the dataset and leverage their predictions for fine-grained, episode-level analysis of character presence over time. This approach reveals patterns of character prominence and their evolution within the narrative. By emphasizing appearance frequency, OregairuChar serves as a valuable resource for exploring computational narrative dynamics and character-centric storytelling in stylized media.
zh

[CV-27] Automatic segmentation of colorectal liver metastases for ultrasound-based navigated resection

【速读】:该论文旨在解决术中实时精准分割结直肠肝转移瘤(Colorectal Liver Metastases, CRLM)在三维 intraoperative ultrasound (iUS) 图像中的难题,以实现阴性切缘并提升手术导航效率。传统 iUS 存在对比度低、噪声大及操作者依赖性强等问题,限制了其临床应用。解决方案的关键在于采用基于 nnU-Net 框架的 3D U-Net 网络模型,并通过仅对肿瘤周围区域进行裁剪(cropped-volume)训练,显著优于使用完整 iUS 体积的全图模型,在 Dice Similarity Coefficient (DSC)、Hausdorff Distance (HDist.) 和 Relative Volume Difference (RVD) 等指标上表现更优(AUC-ROC = 0.898 vs 0.718),同时实现约 4 倍于半自动方法的速度提升(~1 分钟/例),且经前瞻性术中测试验证具有临床可接受的准确性,为注册无关的超声引导肝切除手术提供了高效、可靠、近实时的自动化分割方案。

链接: https://arxiv.org/abs/2511.05253
作者: Tiziano Natali,Karin A. Olthof,Niels F.M. Kok,Koert F.D. Kuhlmann,Theo J.M. Ruers,Matteo Fusaglia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Introduction: Accurate intraoperative delineation of colorectal liver metastases (CRLM) is crucial for achieving negative resection margins but remains challenging using intraoperative ultrasound (iUS) due to low contrast, noise, and operator dependency. Automated segmentation could enhance precision and efficiency in ultrasound-based navigation workflows. Methods: Eighty-five tracked 3D iUS volumes from 85 CRLM patients were used to train and evaluate a 3D U-Net implemented via the nnU-Net framework. Two variants were compared: one trained on full iUS volumes and another on cropped regions around tumors. Segmentation accuracy was assessed using Dice Similarity Coefficient (DSC), Hausdorff Distance (HDist.), and Relative Volume Difference (RVD) on retrospective and prospective datasets. The workflow was integrated into 3D Slicer for real-time intraoperative use. Results: The cropped-volume model significantly outperformed the full-volume model across all metrics (AUC-ROC = 0.898 vs 0.718). It achieved median DSC = 0.74, recall = 0.79, and HDist. = 17.1 mm comparable to semi-automatic segmentation but with ~4x faster execution (~ 1 min). Prospective intraoperative testing confirmed robust and consistent performance, with clinically acceptable accuracy for real-time surgical guidance. Conclusion: Automatic 3D segmentation of CRLM in iUS using a cropped 3D U-Net provides reliable, near real-time results with minimal operator input. The method enables efficient, registration-free ultrasound-based navigation for hepatic surgery, approaching expert-level accuracy while substantially reducing manual workload and procedure time. Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2511.05253 [cs.CV] (or arXiv:2511.05253v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.05253 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tiziano Natali [view email] [v1] Fri, 7 Nov 2025 14:13:31 UTC (2,579 KB)
zh

[CV-28] Accurate online action and gesture recognition system using detectors and Deep SPD Siamese Networks

【速读】:该论文旨在解决在线连续动作识别(online continuous motion recognition)问题,即在未分割的骨架序列流中实时检测和分类动作,这在实际应用场景中具有重要意义。传统方法多基于片段划分(segment-based recognition),难以适应在线场景。其解决方案的关键在于提出一个由检测器(detector)和分类器(classifier)组成的系统:检测器利用半正定(Semi-Positive Definite, SPD)矩阵对骨骼数据进行统计建模,从而捕捉时序特征;分类器则采用Siamese网络学习不同动作之间的语义相似性,实现对预测时间区间内动作的准确识别。该架构具备良好的灵活性与连续性,可在无预分割输入下实现高精度的动作识别,在手部手势与身体动作识别基准上均优于现有最先进方法。

链接: https://arxiv.org/abs/2511.05250
作者: Mohamed Sanim Akremi,Rim Slama,Hedi Tabia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Online continuous motion recognition is a hot topic of research since it is more practical in real life application cases. Recently, Skeleton-based approaches have become increasingly popular, demonstrating the power of using such 3D temporal data. However, most of these works have focused on segment-based recognition and are not suitable for the online scenarios. In this paper, we propose an online recognition system for skeleton sequence streaming composed from two main components: a detector and a classifier, which use a Semi-Positive Definite (SPD) matrix representation and a Siamese network. The powerful statistical representations for the skeletal data given by the SPD matrices and the learning of their semantic similarity by the Siamese network enable the detector to predict time intervals of the motions throughout an unsegmented sequence. In addition, they ensure the classifier capability to recognize the motion in each predicted interval. The proposed detector is flexible and able to identify the kinetic state continuously. We conduct extensive experiments on both hand gesture and body action recognition benchmarks to prove the accuracy of our online recognition system which in most cases outperforms state-of-the-art performances.
zh

[CV-29] ADPretrain: Advancing Industrial Anomaly Detection via Anomaly Representation Pretraining NEURIPS2025

【速读】:该论文旨在解决当前工业异常检测(Industrial Anomaly Detection, AD)中主流方法依赖ImageNet预训练特征所导致的性能瓶颈问题。具体而言,ImageNet预训练目标与AD任务不匹配(即未专门优化区分正常与异常样本),且自然图像与工业图像之间存在分布偏移(distribution shift),使得预训练特征在AD场景下表现不佳。解决方案的关键在于提出一种专为AD任务设计的表示学习框架:首先,在大规模工业异常检测数据集RealIAD上进行预训练以缓解分布偏移;其次,设计角度和范数导向的对比损失(angle- and norm-oriented contrastive losses),同时最大化正常与异常特征之间的夹角和范数差异,从而增强特征的判别能力;最后,基于类泛化表示(class-generalizable representation)——残差特征(residual features)来学习预训练表示,进一步降低预训练数据与下游AD数据之间的潜在分布差异。实验表明,该方法在多个AD数据集和骨干网络上均显著优于现有方法。

链接: https://arxiv.org/abs/2511.05245
作者: Xincheng Yao,Yan Luo,Zefeng Qian,Chongyang Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Nanjing Agricultural University (南京农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:The current mainstream and state-of-the-art anomaly detection (AD) methods are substantially established on pretrained feature networks yielded by ImageNet pretraining. However, regardless of supervised or self-supervised pretraining, the pretraining process on ImageNet does not match the goal of anomaly detection (i.e., pretraining in natural images doesn’t aim to distinguish between normal and abnormal). Moreover, natural images and industrial image data in AD scenarios typically have the distribution shift. The two issues can cause ImageNet-pretrained features to be suboptimal for AD tasks. To further promote the development of the AD field, pretrained representations specially for AD tasks are eager and very valuable. To this end, we propose a novel AD representation learning framework specially designed for learning robust and discriminative pretrained representations for industrial anomaly detection. Specifically, closely surrounding the goal of anomaly detection (i.e., focus on discrepancies between normals and anomalies), we propose angle- and norm-oriented contrastive losses to maximize the angle size and norm difference between normal and abnormal features simultaneously. To avoid the distribution shift from natural images to AD images, our pretraining is performed on a large-scale AD dataset, RealIAD. To further alleviate the potential shift between pretraining data and downstream AD datasets, we learn the pretrained AD representations based on the class-generalizable representation, residual features. For evaluation, based on five embedding-based AD methods, we simply replace their original features with our pretrained representations. Extensive experiments on five AD datasets and five backbones consistently show the superiority of our pretrained features. The code is available at this https URL.
zh

[CV-30] 4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos

【速读】:该论文旨在解决从单目视频中重建动态场景的难题,特别是当相机位姿未知时,传统基于神经辐射场(NeRF)和3D高斯泼溅(3DGS)的方法难以有效处理动态内容且依赖预计算的相机位姿。解决方案的关键在于提出一种无位姿约束的四维动态神经渲染框架4D3R,其核心创新包括:(1) 引入运动感知束调整(MA-BA)模块,结合基于Transformer的先验与SAM2分割模型,实现鲁棒的动态物体分割与相机位姿优化;(2) 设计高效的运动感知高斯泼溅(MA-GS)表示方法,通过控制点结合变形场多层感知机(MLP)与线性混合皮肤技术建模动态运动,在显著降低5倍计算开销的同时保持高质量重建效果。

链接: https://arxiv.org/abs/2511.05229
作者: Mengqi Guo,Bo Xu,Yanyan Li,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Novel view synthesis from monocular videos of dynamic scenes with unknown camera poses remains a fundamental challenge in computer vision and graphics. While recent advances in 3D representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown promising results for static scenes, they struggle with dynamic content and typically rely on pre-computed camera poses. We present 4D3R, a pose-free dynamic neural rendering framework that decouples static and dynamic components through a two-stage approach. Our method first leverages 3D foundational models for initial pose and geometry estimation, followed by motion-aware refinement. 4D3R introduces two key technical innovations: (1) a motion-aware bundle adjustment (MA-BA) module that combines transformer-based learned priors with SAM2 for robust dynamic object segmentation, enabling more accurate camera pose refinement; and (2) an efficient Motion-Aware Gaussian Splatting (MA-GS) representation that uses control points with a deformation field MLP and linear blend skinning to model dynamic motion, significantly reducing computational cost while maintaining high-quality reconstruction. Extensive experiments on real-world dynamic datasets demonstrate that our approach achieves up to 1.8dB PSNR improvement over state-of-the-art methods, particularly in challenging scenarios with large dynamic objects, while reducing computational requirements by 5x compared to previous dynamic scene representations.
zh

[CV-31] FreeControl: Efficient Training-Free Structural Control via One-Step Attention Extraction NIPS2025

【速读】:该论文旨在解决扩散模型中图像空间结构与语义结构控制难的问题,现有方法如ControlNet依赖手工设计的条件图和重训练,灵活性差;而基于反演的方法虽对齐效果好,但因双路径去噪导致推理成本高。其解决方案的关键在于提出FreeControl框架:通过在单一最优关键时间步(key timestep)进行一次注意力提取,并在整个去噪过程中复用该注意力信息,实现无需反演或重训练的高效结构引导;进一步引入潜变量条件解耦(Latent-Condition Decoupling, LCD),分离关键时间步与噪声潜变量,提升注意力质量并消除结构伪影,从而在约5%额外计算开销下实现语义一致、视觉连贯且支持组合式控制的图像生成。

链接: https://arxiv.org/abs/2511.05219
作者: Jiang Lin,Xinyu Chen,Song Wu,Zhiqiu Zhang,Jizhi Zhang,Ye Wang,Qiang Tang,Qian Wang,Jian Yang,Zili Yi
机构: Nanjing University (南京大学); JIUTIAN Research (九天研究); University of British Columbia (英属哥伦比亚大学); Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NIPS 2025

点击查看摘要

Abstract:Controlling the spatial and semantic structure of diffusion-generated images remains a challenge. Existing methods like ControlNet rely on handcrafted condition maps and retraining, limiting flexibility and generalization. Inversion-based approaches offer stronger alignment but incur high inference cost due to dual-path denoising. We present FreeControl, a training-free framework for semantic structural control in diffusion models. Unlike prior methods that extract attention across multiple timesteps, FreeControl performs one-step attention extraction from a single, optimally chosen key timestep and reuses it throughout denoising. This enables efficient structural guidance without inversion or retraining. To further improve quality and stability, we introduce Latent-Condition Decoupling (LCD): a principled separation of the key timestep and the noised latent used in attention extraction. LCD provides finer control over attention quality and eliminates structural artifacts. FreeControl also supports compositional control via reference images assembled from multiple sources - enabling intuitive scene layout design and stronger prompt alignment. FreeControl introduces a new paradigm for test-time control, enabling structurally and semantically aligned, visually coherent generation directly from raw images, with the flexibility for intuitive compositional design and compatibility with modern diffusion models at approximately 5 percent additional cost.
zh

[CV-32] Walk the Lines 2: Contour Tracking for Detailed Segmentation

【速读】:该论文旨在解决红外(IR)图像中船舶及RGB图像中多种目标的精细化分割问题,尤其针对传统非极大值抑制(Non-Maximum Suppression, NMS)方法在边界细节保留上的不足。其解决方案的关键在于提出Walk the Lines 2 (WtL2)算法,通过轮廓追踪(contour tracking)替代标准NMS,将物体轮廓逐步细化至1像素宽的闭合形状,并进行二值化以形成可分割的前景-背景区域。该方法不仅扩展了原始WtL算法仅适用于彩色船舶分割的局限,还适配红外船舶和RGB中多样化目标,显著提升分割精度与细节质量,在闭合轮廓生成上优于当前主流轮廓基方法,具有高峰值交并比(Peak Intersection over Union, IoU),适用于对分割精度要求高的专业场景。

链接: https://arxiv.org/abs/2511.05210
作者: André Peter Kelm,Max Braeschke,Emre Gülsoylu,Simone Frintrop
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures. Accepted at CAIP 2025: 21st International Conference on Computer Analysis of Images and Patterns, Las Palmas de Gran Canaria, Spain, September 22-25, 2025. To appear in: Proceedings Part I, Lecture Notes in Computer Science (LNCS), Springer Nature Switzerland

点击查看摘要

Abstract:This paper presents Walk the Lines 2 (WtL2), a unique contour tracking algorithm specifically adapted for detailed segmentation of infrared (IR) ships and various objects in RGB.1 This extends the original Walk the Lines (WtL) [12], which focused solely on detailed ship segmentation in color. These innovative WtLs can replace the standard non-maximum suppression (NMS) by using contour tracking to refine the object contour until a 1-pixel-wide closed shape can be binarized, forming a segmentable area in foreground-background scenarios. WtL2 broadens the application range of WtL beyond its original scope, adapting to IR and expanding to diverse objects within the RGB context. To achieve IR segmentation, we adapt its input, the object contour detector, to IR ships. In addition, the algorithm is enhanced to process a wide range of RGB objects, outperforming the latest generation of contour-based methods when achieving a closed object contour, offering high peak Intersection over Union (IoU) with impressive details. This positions WtL2 as a compelling method for specialized applications that require detailed segmentation or high-quality samples, potentially accelerating progress in several niche areas of image segmentation.
zh

[CV-33] MUSE: Multi-Scale Dense Self-Distillation for Nucleus Detection and Classification

【速读】:该论文旨在解决组织病理学中核检测与分类(Nucleus Detection and Classification, NDC)任务面临的两大挑战:一是现有方法严重依赖人工标注的核级标签,导致成本高昂;二是难以充分利用大规模未标注数据来学习具有判别性的核级表征。解决方案的关键在于提出一种名为MUSE(MUlti-scale denSE self-distillation)的自监督学习框架,其核心创新是NuLo(Nucleus-based Local self-distillation)机制——一种基于预测核位置的坐标引导型局部自蒸馏方法。NuLo通过消除增强视图间严格的时空对齐要求,实现了跨尺度的特征对齐能力,从而显著提升了模型在细粒度核级表征学习上的性能。此外,论文还设计了简洁高效的编码器-解码器结构和大感受野的半监督微调策略,进一步挖掘未标注病理图像的价值,最终在三个主流基准上验证了MUSE的有效性,超越了监督基线和通用病理基础模型。

链接: https://arxiv.org/abs/2511.05170
作者: Zijiang Yang,Hanqing Chao,Bokai Zhao,Yelin Yang,Yunshuo Zhang,Dongmei Fu,Junping Zhang,Le Lu,Ke Yan,Dakai Jin,Minfeng Xu,Yun Bian,Hui Jiang
机构: DAMO Academy, Alibaba Group (达摩院,阿里巴巴集团); Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Nucleus detection and classification (NDC) in histopathology analysis is a fundamental task that underpins a wide range of high-level pathology applications. However, existing methods heavily rely on labor-intensive nucleus-level annotations and struggle to fully exploit large-scale unlabeled data for learning discriminative nucleus representations. In this work, we propose MUSE (MUlti-scale denSE self-distillation), a novel self-supervised learning method tailored for NDC. At its core is NuLo (Nucleus-based Local self-distillation), a coordinate-guided mechanism that enables flexible local self-distillation based on predicted nucleus positions. By removing the need for strict spatial alignment between augmented views, NuLo allows critical cross-scale alignment, thus unlocking the capacity of models for fine-grained nucleus-level representation. To support MUSE, we design a simple yet effective encoder-decoder architecture and a large field-of-view semi-supervised fine-tuning strategy that together maximize the value of unlabeled pathology images. Extensive experiments on three widely used benchmarks demonstrate that MUSE effectively addresses the core challenges of histopathological NDC. The resulting models not only surpass state-of-the-art supervised baselines but also outperform generic pathology foundation models.
zh

[CV-34] Another BRIXEL in the Wall: Towards Cheaper Dense Features

【速读】:该论文旨在解决视觉基础模型(Vision foundation models)在生成细粒度密集特征图时面临的高计算成本与高分辨率输入需求问题。具体而言,尽管DINOv3等先进模型能在下游任务中实现卓越性能,但其依赖于高分辨率输入及Transformer架构带来的平方复杂度导致计算资源消耗巨大。解决方案的关键在于提出一种名为BRIXEL的简单知识蒸馏方法:让学生模型学习在更高分辨率下复现自身特征图,从而在固定分辨率下显著提升下游任务表现,并以极低的计算开销逼近教师模型的特征质量。

链接: https://arxiv.org/abs/2511.05168
作者: Alexander Lappe,Martin A. Giese
机构: Hertie Institute, University of Tübingen (图宾根大学赫尔蒂研究所); IMPRS-IS (国际研究生院-计算神经科学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision foundation models achieve strong performance on both global and locally dense downstream tasks. Pretrained on large images, the recent DINOv3 model family is able to produce very fine-grained dense feature maps, enabling state-of-the-art performance. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite its simplicity, BRIXEL outperforms the baseline DINOv3 models by large margins on downstream tasks when the resolution is kept fixed. Moreover, it is able to produce feature maps that are very similar to those of the teacher at a fraction of the computational cost. Code and model weights are available at this https URL.
zh

[CV-35] Splatography: Sparse multi-view dynamic Gaussian Splatting for filmmaking challenges

【速读】:该论文旨在解决在电影制作中因预算限制导致相机配置稀疏时,现有先进动态三维重建方法难以准确捕捉复杂动态特征的问题。其关键解决方案是将初始高斯点云(canonical Gaussians)及其变形场(deformation field)按前景与背景进行分割,并利用帧 t=0 的稀疏掩码分别训练两组独立的表示:前景部分学习颜色、位置和旋转的变化以适应多样动态特征,背景部分仅学习点位移变化,因其通常包含较暗且静态的剧组人员和设备;这种分层建模策略在无需密集掩码监督的情况下实现了高质量、分割明确的动态重建,同时显著提升性能并减少模型规模。

链接: https://arxiv.org/abs/2511.05152
作者: Adrian Azzarelli,Nantheera Anantrasirichai,David R Bull
机构: Bristol Visual Institute (布里斯托视觉研究所); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Deformable Gaussian Splatting (GS) accomplishes photorealistic dynamic 3-D reconstruction from dense multi-view video (MVV) by learning to deform a canonical GS representation. However, in filmmaking, tight budgets can result in sparse camera configurations, which limits state-of-the-art (SotA) methods when capturing complex dynamic features. To address this issue, we introduce an approach that splits the canonical Gaussians and deformation field into foreground and background components using a sparse set of masks for frames at t=0. Each representation is separately trained on different loss functions during canonical pre-training. Then, during dynamic training, different parameters are modeled for each deformation field following common filmmaking practices. The foreground stage contains diverse dynamic features so changes in color, position and rotation are learned. While, the background containing film-crew and equipment, is typically dimmer and less dynamic so only changes in point position are learned. Experiments on 3-D and 2.5-D entertainment datasets show that our method produces SotA qualitative and quantitative results; up to 3 PSNR higher with half the model size on 3-D scenes. Unlike the SotA and without the need for dense mask supervision, our method also produces segmented dynamic reconstructions including transparent and dynamic textures. Code and video comparisons are available online: this https URL
zh

[CV-36] From Linear Probing to Joint-Weighted Token Hierarchy: A Foundation Model Bridging Global and Cellular Representations in Biomarker Detection

【速读】:该论文旨在解决当前病理基础模型(Pathology Foundation Models, PFMs)在生成式AI(Generative AI)辅助下对组织切片中分子特征预测时,过度依赖全局图像块(patch-level)嵌入而忽略细胞级形态信息的问题。其解决方案的关键在于提出一种名为JWTH(Joint-Weighted Token Hierarchy)的新型PFM架构,通过大规模自监督预训练结合以细胞为中心的后调优(cell-centric post-tuning)和注意力池化机制,实现局部与全局token的有效融合,从而显著提升多任务、多队列场景下的生物标志物检测准确率与可解释性。

链接: https://arxiv.org/abs/2511.05150
作者: Jingsong Liu,Han Li,Nassir Navab,Peter J. Schüffler
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-based biomarkers can infer molecular features directly from hematoxylin eosin (HE) slides, yet most pathology foundation models (PFMs) rely on global patch-level embeddings and overlook cell-level morphology. We present a PFM model, JWTH (Joint-Weighted Token Hierarchy), which integrates large-scale self-supervised pretraining with cell-centric post-tuning and attention pooling to fuse local and global tokens. Across four tasks involving four biomarkers and eight cohorts, JWTH achieves up to 8.3% higher balanced accuracy and 1.2% average improvement over prior PFMs, advancing interpretable and robust AI-based biomarker detection in digital pathology.
zh

[CV-37] SnowyLane: Robust Lane Detection on Snow-covered Rural Roads Using Infrastructural Elements

【速读】:该论文旨在解决自动驾驶在积雪环境下的车道线检测问题,该场景中传统车道线标记常因积雪覆盖或遮挡而难以识别。解决方案的关键在于摒弃对可见车道线的依赖,转而通过检测路侧特征——即垂直的路侧标识柱(delineators)作为间接车道指示,并利用参数化的贝塞尔曲线(Bezier curve)模型结合空间一致性与道路几何信息拟合出平滑的车道轨迹。此方法显著提升了恶劣天气下(尤其是重度积雪遮挡情况下)的鲁棒性,同时引入了SnowyLane合成数据集以支持训练与评估,为全天候自动驾驶中的车道检测提供了可靠基础和重要资源。

链接: https://arxiv.org/abs/2511.05108
作者: Jörg Gamerdinger,Benedict Wetzel,Patrick Schulz,Sven Teufel,Oliver Bringmann
机构: University of Tübingen (图宾根大学); Forschungszentrum Informatik (FZI) Karlsruhe (卡尔斯鲁厄信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lane detection for autonomous driving in snow-covered environments remains a major challenge due to the frequent absence or occlusion of lane markings. In this paper, we present a novel, robust and realtime capable approach that bypasses the reliance on traditional lane markings by detecting roadside features,specifically vertical roadside posts called delineators, as indirect lane indicators. Our method first perceives these posts, then fits a smooth lane trajectory using a parameterized Bezier curve model, leveraging spatial consistency and road geometry. To support training and evaluation in these challenging scenarios, we introduce SnowyLane, a new synthetic dataset containing 80,000 annotated frames capture winter driving conditions, with varying snow coverage, and lighting conditions. Compared to state-of-the-art lane detection systems, our approach demonstrates significantly improved robustness in adverse weather, particularly in cases with heavy snow occlusion. This work establishes a strong foundation for reliable lane detection in winter scenarios and contributes a valuable resource for future research in all-weather autonomous driving. The dataset is available at this https URL
zh

[CV-38] Early Alzheimers Disease Detection from Retinal OCT Images: A UK Biobank Study

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期检测中难以识别细微视网膜结构变化的问题,特别是利用光学相干断层扫描(Optical Coherence Tomography, OCT)B-scan图像进行无须依赖分层厚度测量的直接分类。其解决方案的关键在于首次将深度学习方法应用于原始OCT B-scan图像以预测AD,并通过多种策略提升模型在小样本、高维数据下的泛化能力:包括使用ImageNet预训练模型与OCT特异性RETFound Transformer进行微调、引入标准和OCT专用的数据增强技术、以及设计年份加权损失函数以优先关注影像后4年内确诊的病例。最终ResNet-34表现最稳定,在4年队列中达到AUC=0.62,且可解释性分析验证了中央黄斑亚区结构差异的存在,为未来基于OCT的AD早期筛查提供了基准与方向。

链接: https://arxiv.org/abs/2511.05106
作者: Yasemin Turkan,F. Boray Tek,M. Serdar Nazlı,Öykü Eren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Alterations in retinal layer thickness, measurable using Optical Coherence Tomography (OCT), have been associated with neurodegenerative diseases such as Alzheimer’s disease (AD). While previous studies have mainly focused on segmented layer thickness measurements, this study explored the direct classification of OCT B-scan images for the early detection of AD. To our knowledge, this is the first application of deep learning to raw OCT B-scans for AD prediction in the literature. Unlike conventional medical image classification tasks, early detection is more challenging than diagnosis because imaging precedes clinical diagnosis by several years. We fine-tuned and evaluated multiple pretrained models, including ImageNet-based networks and the OCT-specific RETFound transformer, using subject-level cross-validation datasets matched for age, sex, and imaging instances from the UK Biobank cohort. To reduce overfitting in this small, high-dimensional dataset, both standard and OCT-specific augmentation techniques were applied, along with a year-weighted loss function that prioritized cases diagnosed within four years of imaging. ResNet-34 produced the most stable results, achieving an AUC of 0.62 in the 4-year cohort. Although below the threshold for clinical application, our explainability analyses confirmed localized structural differences in the central macular subfield between the AD and control groups. These findings provide a baseline for OCT-based AD prediction, highlight the challenges of detecting subtle retinal biomarkers years before AD diagnosis, and point to the need for larger datasets and multimodal approaches.
zh

[CV-39] Quantifying the Risk of Transferred Black Box Attacks

【速读】:该论文旨在解决神经网络模型在面对迁移式对抗攻击(transferred adversarial attacks)时的鲁棒性评估难题,尤其是黑盒逃避攻击(black-box evasion attacks)的全面测试与风险量化问题。由于高维输入空间的计算复杂性导致无法实现完全覆盖,传统方法难以准确评估此类攻击带来的实际风险。解决方案的关键在于提出一种基于中心核对齐(Centered Kernel Alignment, CKA)相似度选择代理模型(surrogate models)的靶向鲁棒性测试框架:通过选取与目标模型具有高或低CKA相似性的代理模型,优化对抗子空间的覆盖率,并利用回归估计器进行风险量化,从而为组织提供现实且可操作的风险评估结果。

链接: https://arxiv.org/abs/2511.05102
作者: Disesdi Susanna Cox,Niklas Bunzel
机构: OWASP AI Exchange (OWASP 人工智能交换); Fraunhofer SIT (弗劳恩霍夫信息安全研究所); TU-Darmstadt (达姆施塔特工业大学); ATHENER (ATHENER)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural networks have become pervasive across various applications, including security-related products. However, their widespread adoption has heightened concerns regarding vulnerability to adversarial attacks. With emerging regulations and standards emphasizing security, organizations must reliably quantify risks associated with these attacks, particularly regarding transferred adversarial attacks, which remain challenging to evaluate accurately. This paper investigates the complexities involved in resilience testing against transferred adversarial attacks. Our analysis specifically addresses black-box evasion attacks, highlighting transfer-based attacks due to their practical significance and typically high transferability between neural network models. We underline the computational infeasibility of exhaustively exploring high-dimensional input spaces to achieve complete test coverage. As a result, comprehensive adversarial risk mapping is deemed impractical. To mitigate this limitation, we propose a targeted resilience testing framework that employs surrogate models strategically selected based on Centered Kernel Alignment (CKA) similarity. By leveraging surrogate models exhibiting both high and low CKA similarities relative to the target model, the proposed approach seeks to optimize coverage of adversarial subspaces. Risk estimation is conducted using regression-based estimators, providing organizations with realistic and actionable risk quantification.
zh

[CV-40] Real-World Adverse Weather Image Restoration via Dual-Level Reinforcement Learning with High-Quality Cold Start NEURIPS2025

【速读】:该论文旨在解决恶劣天气条件下视觉感知性能下降的问题,尤其是现有基于固定参数合成数据训练的视觉模型在复杂退化场景中泛化能力不足的局限性。其解决方案的关键在于构建了一个物理驱动的高保真天气数据集HFLS-Weather,并设计了一种双层强化学习框架:局部层面通过扰动驱动的图像质量优化实现无配对监督的天气特异性恢复模型精调;全局层面则由元控制器动态调度模型选择与执行顺序,以适应不同场景的退化程度,从而实现对真实环境条件的持续自适应和跨多种恶劣天气场景的最优性能表现。

链接: https://arxiv.org/abs/2511.05095
作者: Fuyang Liu,Jiaqi Xu,Xiaowei Hu
机构: South China University of Technology (华南理工大学); Nanjing University of Science and Technology (南京理工大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Adverse weather severely impairs real-world visual perception, while existing vision models trained on synthetic data with fixed parameters struggle to generalize to complex degradations. To address this, we first construct HFLS-Weather, a physics-driven, high-fidelity dataset that simulates diverse weather phenomena, and then design a dual-level reinforcement learning framework initialized with HFLS-Weather for cold-start training. Within this framework, at the local level, weather-specific restoration models are refined through perturbation-driven image quality optimization, enabling reward-based learning without paired supervision; at the global level, a meta-controller dynamically orchestrates model selection and execution order according to scene degradation. This framework enables continuous adaptation to real-world conditions and achieves state-of-the-art performance across a wide range of adverse weather scenarios. Code is available at this https URL
zh

[CV-41] A Dual-stage Prompt-driven Privacy-preserving Paradigm for Person Re-Identification

【速读】:该论文旨在解决虚拟数据在行人重识别(Person Re-Identification, Re-ID)模型训练中面临的两大挑战:一是现有游戏引擎生成的虚拟数据集构建复杂且泛化能力差,二是缺乏有效的机制以学习跨域不变特征从而提升模型在真实场景中的适应性。解决方案的关键在于提出一种双阶段提示驱动的隐私保护范式(Dual-stage Prompt-driven Privacy-preserving Paradigm, DPPP),其中第一阶段通过多维属性提示(如行人外观、光照和视角)驱动扩散模型端到端生成大规模虚拟数据集GenePerson(含130,519张图像,6,641个身份);第二阶段引入提示驱动解耦机制(Prompt-driven Disentanglement Mechanism, PDM),利用对比学习和文本反转网络将图像映射为表示风格与内容的伪词,构造风格解耦的内容提示,从而引导模型在图像层面学习域不变的内容特征,显著提升模型的域泛化性能。

链接: https://arxiv.org/abs/2511.05092
作者: Ruolin Li,Min Liu,Yuan Bian,Zhaoyang Li,Yuzhen Li,Xueping Wang,Yaonan Wang
机构: Hunan University (湖南大学); National Engineering Research Center of Robot Visual Perception and Control Technology (国家机器人视觉感知与控制技术工程研究中心); Hunan Normal University (湖南师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:With growing concerns over data privacy, researchers have started using virtual data as an alternative to sensitive real-world images for training person re-identification (Re-ID) models. However, existing virtual datasets produced by game engines still face challenges such as complex construction and poor domain generalization, making them difficult to apply in real scenarios. To address these challenges, we propose a Dual-stage Prompt-driven Privacy-preserving Paradigm (DPPP). In the first stage, we generate rich prompts incorporating multi-dimensional attributes such as pedestrian appearance, illumination, and viewpoint that drive the diffusion model to synthesize diverse data end-to-end, building a large-scale virtual dataset named GenePerson with 130,519 images of 6,641 identities. In the second stage, we propose a Prompt-driven Disentanglement Mechanism (PDM) to learn domain-invariant generalization features. With the aid of contrastive learning, we employ two textual inversion networks to map images into pseudo-words representing style and content, respectively, thereby constructing style-disentangled content prompts to guide the model in learning domain-invariant content features at the image level. Experiments demonstrate that models trained on GenePerson with PDM achieve state-of-the-art generalization performance, surpassing those on popular real and virtual Re-ID datasets.
zh

[CV-42] Deep learning models are vulnerable but adversarial examples are even more vulnerable

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在面对对抗样本时的鲁棒性不足与检测困难问题,核心在于揭示对抗样本与干净样本在遮挡敏感性上的内在差异。其解决方案的关键在于提出一种基于滑动掩码的置信度熵度量方法——滑动掩码置信度熵(Sliding Mask Confidence Entropy, SMCE),通过量化模型在局部遮挡下的置信度波动来识别对抗样本;进一步基于此设计了滑动窗口掩码对抗样本检测方法(Sliding Window Mask-based Adversarial Example Detection, SWM-AED),该方法无需依赖传统对抗训练易导致的过拟合问题,在多种分类器和攻击类型下均表现出高检测准确率(多数情况超过62%,最高达96.5%)。

链接: https://arxiv.org/abs/2511.05073
作者: Jun Li,Yanwei Xu,Keran Li,Xiaoli Zhang
机构: Jilin University of Finance and Economics (吉林财经大学); Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages,12 figures

点击查看摘要

Abstract:Understanding intrinsic differences between adversarial examples and clean samples is key to enhancing DNN robustness and detection against adversarial attacks. This study first empirically finds that image-based adversarial examples are notably sensitive to occlusion. Controlled experiments on CIFAR-10 used nine canonical attacks (e.g., FGSM, PGD) to generate adversarial examples, paired with original samples for evaluation. We introduce Sliding Mask Confidence Entropy (SMCE) to quantify model confidence fluctuation under occlusion. Using 1800+ test images, SMCE calculations supported by Mask Entropy Field Maps and statistical distributions show adversarial examples have significantly higher confidence volatility under occlusion than originals. Based on this, we propose Sliding Window Mask-based Adversarial Example Detection (SWM-AED), which avoids catastrophic overfitting of conventional adversarial training. Evaluations across classifiers and attacks on CIFAR-10 demonstrate robust performance, with accuracy over 62% in most cases and up to 96.5%.
zh

[CV-43] SurgiATM: A Physics-Guided Plug-and-Play Model for Deep Learning-Based Smoke Removal in Laparoscopic Surgery

【速读】:该论文旨在解决腹腔镜手术中组织烧灼产生的烟雾对内窥镜图像视觉质量的显著影响问题,此类烟雾会增加手术错误风险,并干扰临床决策与计算机辅助视觉分析。解决方案的关键在于提出一种名为Surgical Atmospheric Model (SurgiATM) 的轻量级、即插即用模块,其通过统计学方式将基于物理的气象模型与数据驱动的深度学习模型相结合,既保留了物理模型的强泛化能力,又融合了深度学习的高精度特性;SurgiATM仅引入两个超参数且不增加任何可训练权重,无需修改原始网络结构即可显著提升多种去烟架构的准确性与稳定性,从而在不增加计算负担的前提下实现高效、通用的手术烟雾去除。

链接: https://arxiv.org/abs/2511.05059
作者: Mingyu Sheng,Jianan Fan,Dongnan Liu,Guoyan Zheng,Ron Kikinis,Weidong Cai
机构: University of Sydney (悉尼大学); Shanghai Jiao Tong University (上海交通大学); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 6 tables. Code available at this https URL

点击查看摘要

Abstract:During laparoscopic surgery, smoke generated by tissue cauterization can significantly degrade the visual quality of endoscopic frames, increasing the risk of surgical errors and hindering both clinical decision-making and computer-assisted visual analysis. Consequently, removing surgical smoke is critical to ensuring patient safety and maintaining operative efficiency. In this study, we propose the Surgical Atmospheric Model (SurgiATM) for surgical smoke removal. SurgiATM statistically bridges a physics-based atmospheric model and data-driven deep learning models, combining the superior generalizability of the former with the high accuracy of the latter. Furthermore, SurgiATM is designed as a lightweight, plug-and-play module that can be seamlessly integrated into diverse surgical desmoking architectures to enhance their accuracy and stability, better meeting clinical requirements. It introduces only two hyperparameters and no additional trainable weights, preserving the original network architecture with minimal computational and modification overhead. We conduct extensive experiments on three public surgical datasets with ten desmoking methods, involving multiple network architectures and covering diverse procedures, including cholecystectomy, partial nephrectomy, and diaphragm dissection. The results demonstrate that incorporating SurgiATM commonly reduces the restoration errors of existing models and relatively enhances their generalizability, without adding any trainable layers or weights. This highlights the convenience, low cost, effectiveness, and generalizability of the proposed method. The code for SurgiATM is released at this https URL.
zh

[CV-44] Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

【速读】:该论文旨在解决当前对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)模型在使用合成数据进行训练时,因数据语义多样性不足和描述文本冗余或浅层化而导致性能受限的问题。其解决方案的关键在于提出了一种名为Role-SynthCLIP的新颖数据合成框架,该框架通过多视角角色扮演提示(如“构图分析师”、“图像上下文解释者”等)引导多模态大语言模型(Multimodal Large Language Models, MLLMs)从不同认知视角生成语义多样化的图像描述,从而提升合成图像-文本对的语义丰富度与细粒度对齐能力,在不增加数据总量的前提下显著增强caption的表达力与准确性。

链接: https://arxiv.org/abs/2511.05057
作者: Yuanxiang Huangfu,Chaochao Wang,Weilei Wang
机构: PatSnap Co., LTD. (PatSnap公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million Role-SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models are released at this https URL.
zh

[CV-45] No Pose Estimation? No Problem: Pose-Agnostic and Instance-Aware Test-Time Adaptation for Monocular Depth Estimation

【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)模型在真实场景中部署时面临的域适应(Domain Adaptation)问题,尤其是在测试时环境条件与训练数据差异较大时,现有自监督方法性能下降显著。解决方案的关键在于提出一种名为PITTA的新颖且高效的测试时域适应(Test-Time Adaptation, TTA)框架,其核心创新包括:(i) 无姿态(pose-agnostic)的TTA范式,无需依赖相机位姿信息即可实现有效适应;(ii) 基于实例感知图像掩码(instance-aware image masking)策略,利用预训练的全景分割网络提取动态物体(如车辆、行人)的掩码并去除静态背景,从而提升适应精度。此外,还引入了一种简单但有效的边缘提取方法增强输入图像和深度图的信息表达,实验表明PITTA在DrivingStereo和Waymo数据集上显著优于现有最先进方法。

链接: https://arxiv.org/abs/2511.05055
作者: Mingyu Sung,Hyeonmin Choe,Il-Min Kim,Sangseok Yun,Jae Mo Kang
机构: Kyungpook National University (庆北国立大学); Queen’s University (皇后大学); Pukyong National University (釜庆国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monocular depth estimation (MDE), inferring pixel-level depths in single RGB images from a monocular camera, plays a crucial and pivotal role in a variety of AI applications demanding a three-dimensional (3D) topographical scene. In the real-world scenarios, MDE models often need to be deployed in environments with different conditions from those for training. Test-time (domain) adaptation (TTA) is one of the compelling and practical approaches to address the issue. Although there have been notable advancements in TTA for MDE, particularly in a self-supervised manner, existing methods are still ineffective and problematic when applied to diverse and dynamic environments. To break through this challenge, we propose a novel and high-performing TTA framework for MDE, named PITTA. Our approach incorporates two key innovative strategies: (i) pose-agnostic TTA paradigm for MDE and (ii) instance-aware image masking. Specifically, PITTA enables highly effective TTA on a pretrained MDE network in a pose-agnostic manner without resorting to any camera pose information. Besides, our instance-aware masking strategy extracts instance-wise masks for dynamic objects (e.g., vehicles, pedestrians, etc.) from a segmentation mask produced by a pretrained panoptic segmentation network, by removing static objects including background components. To further boost performance, we also present a simple yet effective edge extraction methodology for the input image (i.e., a single monocular image) and depth map. Extensive experimental evaluations on DrivingStereo and Waymo datasets with varying environmental conditions demonstrate that our proposed framework, PITTA, surpasses the existing state-of-the-art techniques with remarkable performance improvements in MDE during TTA.
zh

[CV-46] Medical Referring Image Segmentation via Next-Token Mask Prediction

【速读】:该论文旨在解决医学参照图像分割(Medical Referring Image Segmentation, MRIS)任务中现有方法依赖复杂多模态融合机制或多层次解码器设计的问题,从而导致模型架构冗余、训练效率低且难以端到端优化。其核心解决方案是提出一种名为NTP-MRISeg的新框架,将MRIS重构为统一多模态序列上的自回归下一个token预测任务(autoregressive next-token prediction),该序列包含图像、文本和掩码的分词表示。此设计摒弃了模态特异性融合模块与外部分割模型,实现了轻量化统一架构,并利用大规模预训练多模态模型的分词器提升泛化能力。关键创新包括:(1) Next-k Token Prediction (NkTP)策略降低累积预测误差;(2) Token-level Contrastive Learning (TCL)增强边界敏感性并缓解长尾分布问题;(3) 基于记忆的Hard Error Token (HET)优化策略强化困难样本的学习。实验表明,该方法在QaTa-COV19和MosMedData+数据集上达到新SOTA性能,提供了一种高效且简洁的MRIS建模范式。

链接: https://arxiv.org/abs/2511.05044
作者: Xinyu Chen,Yiran Wang,Gaoyang Pang,Jiafu Hao,Chentao Yue,Luping Zhou,Yonghui Li
机构: University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE Transactions on Medical Imaging for possible publication

点击查看摘要

Abstract:Medical Referring Image Segmentation (MRIS) involves segmenting target regions in medical images based on natural language descriptions. While achieving promising results, recent approaches usually involve complex design of multimodal fusion or multi-stage decoders. In this work, we propose NTP-MRISeg, a novel framework that reformulates MRIS as an autoregressive next-token prediction task over a unified multimodal sequence of tokenized image, text, and mask representations. This formulation streamlines model design by eliminating the need for modality-specific fusion and external segmentation models, supports a unified architecture for end-to-end training. It also enables the use of pretrained tokenizers from emerging large-scale multimodal models, enhancing generalization and adaptability. More importantly, to address challenges under this formulation-such as exposure bias, long-tail token distributions, and fine-grained lesion edges-we propose three novel strategies: (1) a Next-k Token Prediction (NkTP) scheme to reduce cumulative prediction errors, (2) Token-level Contrastive Learning (TCL) to enhance boundary sensitivity and mitigate long-tail distribution effects, and (3) a memory-based Hard Error Token (HET) optimization strategy that emphasizes difficult tokens during training. Extensive experiments on the QaTa-COV19 and MosMedData+ datasets demonstrate that NTP-MRISeg achieves new state-of-the-art performance, offering a streamlined and effective alternative to traditional MRIS pipelines.
zh

[CV-47] Pressure2Motion: Hierarchical Motion Synthesis from Ground Pressure with Text Guidance

【速读】:该论文旨在解决从地面压力序列(ground pressure sequence)和文本提示(text prompt)中合成人体运动的问题,这一任务因压力信号与全身动作之间的不确定性关系而高度病态(ill-posed)。其核心挑战在于如何在缺乏摄像头、专用灯光或可穿戴设备的情况下,仅凭压力数据和语义描述生成高保真且物理合理的运动。解决方案的关键在于提出Pressure2Motion,一种结合双层特征提取器(dual-level feature extractor)与分层扩散模型(hierarchical diffusion model)的生成式AI(Generative AI)架构:前者精准解析压力信号中的物理线索,后者通过多尺度建模区分宏观运动轨迹与细微姿态调整;同时利用文本提示提供高层语义引导,从而实现由压力数据驱动、语义约束增强的高质量运动生成,显著优于现有方法。

链接: https://arxiv.org/abs/2511.05038
作者: Zhengxuan Li,Qinhui Yang,Yiyu Zhuang,Chuan Guo,Xinxin Zuo,Xiaoxiao Long,Yao Yao,Xun Cao,Qiu Shen,Hao Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Pressure2Motion, a novel motion capture algorithm that synthesizes human motion from a ground pressure sequence and text prompt. It eliminates the need for specialized lighting setups, cameras, or wearable devices, making it suitable for privacy-preserving, low-light, and low-cost motion capture scenarios. Such a task is severely ill-posed due to the indeterminate nature of the pressure signals to full-body motion. To address this issue, we introduce Pressure2Motion, a generative model that leverages pressure features as input and utilizes a text prompt as a high-level guiding constraint. Specifically, our model utilizes a dual-level feature extractor that accurately interprets pressure data, followed by a hierarchical diffusion model that discerns broad-scale movement trajectories and subtle posture adjustments. Both the physical cues gained from the pressure sequence and the semantic guidance derived from descriptive texts are leveraged to guide the motion generation with precision. To the best of our knowledge, Pressure2Motion is a pioneering work in leveraging both pressure data and linguistic priors for motion generation, and the established MPL benchmark is the first benchmark for this task. Experiments show our method generates high-fidelity, physically plausible motions, establishing a new state-of-the-art for this task. The codes and benchmarks will be publicly released upon publication.
zh

[CV-48] Dynamic Residual Encoding with Slide-Level Contrastive Learning for End-to-End Whole Slide Image Representation

【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)在癌症亚型分类、识别及突变预测任务中端到端表示学习的挑战,特别是由于单张WSI通常包含数十万张图像块(tile),导致当前GPU显存限制难以在单个mini-batch中计算所有tile的梯度。解决方案的关键在于提出动态残差编码与切片级对比学习(Dynamic Residual Encoding with Slide-level Contrastive Learning, DRE-SLCL)方法:利用一个内存库(memory bank)存储数据集中所有WSI的tile特征,在训练时对每个batch中的WSI随机采样部分tile并计算其特征,同时从内存库中检索同一WSI的额外tile特征,通过残差编码融合这两类特征生成WSI表示,并基于这些表示和组织病理学报告计算切片级对比损失,从而实现高效且具有判别力的WSI表征学习。

链接: https://arxiv.org/abs/2511.05034
作者: Jing Jin,Xu Liu,Te Gao,Zhihong Shi,Yixiong Liang,Ruiqing Zheng,Hulin Kuang,Min Zeng,Shichao Kan
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8pages, 3figures, published to ACM Digital Library

点击查看摘要

Abstract:Whole Slide Image (WSI) representation is critical for cancer subtyping, cancer recognition and mutation this http URL an end-to-end WSI representation model poses significant challenges, as a standard gigapixel slide can contain tens of thousands of image tiles, making it difficult to compute gradients of all tiles in a single mini-batch due to current GPU limitations. To address this challenge, we propose a method of dynamic residual encoding with slide-level contrastive learning (DRE-SLCL) for end-to-end WSI representation. Our approach utilizes a memory bank to store the features of tiles across all WSIs in the dataset. During training, a mini-batch usually contains multiple WSIs. For each WSI in the batch, a subset of tiles is randomly sampled and their features are computed using a tile encoder. Then, additional tile features from the same WSI are selected from the memory bank. The representation of each individual WSI is generated using a residual encoding technique that incorporates both the sampled features and those retrieved from the memory bank. Finally, the slide-level contrastive loss is computed based on the representations and histopathology reports ofthe WSIs within the mini-batch. Experiments conducted over cancer subtyping, cancer recognition, and mutation prediction tasks proved the effectiveness of the proposed DRE-SLCL method.
zh

[CV-49] DAFM: Dynamic Adaptive Fusion for Multi-Model Collaboration in Composed Image Retrieval

【速读】:该论文针对组合图像检索(Composed Image Retrieval, CIR)任务中现有方法依赖单一模型进行特征融合与相似性匹配所面临的两大挑战展开研究:一是单模型难以同时兼顾全局信息与细粒度关联,导致忽略图像与文本间关键的细微联系;二是缺乏动态权重分配机制,无法自适应地利用异构模型的优势,致使嵌入空间偏离目标,干扰最近邻搜索精度。解决方案的关键在于提出动态自适应融合(Dynamic Adaptive Fusion, DAFM)机制,通过协同多个异构模型并动态调整其贡献权重,实现互补优势的自适应整合,从而显著提升检索准确率且不依赖融合顺序,展现出良好的鲁棒性。

链接: https://arxiv.org/abs/2511.05020
作者: Yawei Cai,Jiapeng Mi,Nan Ji,Haotian Rong,Yawei Zhang,Zhangti Li,Wenbin Guo,Rensong Xie
机构: China Unicom Software Research Institute (中国联通软件研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,4 figures

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is a cross-modal task that aims to retrieve target images from large-scale databases using a reference image and a modification text. Most existing methods rely on a single model to perform feature fusion and similarity matching. However, this paradigm faces two major challenges. First, one model alone can’t see the whole picture and the tiny details at the same time; it has to handle different tasks with the same weights, so it often misses the small but important links between image and text. Second, the absence of dynamic weight allocation prevents adaptive leveraging of complementary model strengths, so the resulting embedding drifts away from the target and misleads the nearest-neighbor search in CIR. To address these limitations, we propose Dynamic Adaptive Fusion (DAFM) for multi-model collaboration in CIR. Rather than optimizing a single method in isolation, DAFM exploits the complementary strengths of heterogeneous models and adaptively rebalances their contributions. This not only maximizes retrieval accuracy but also ensures that the performance gains are independent of the fusion order, highlighting the robustness of our approach. Experiments on the CIRR and FashionIQ benchmarks demonstrate consistent improvements. Our method achieves a Recall@10 of 93.21 and an Rmean of 84.43 on CIRR, and an average Rmean of 67.48 on FashionIQ, surpassing recent strong baselines by up to 4.5%. These results confirm that dynamic multi-model collaboration provides an effective and general solution for CIR.
zh

[CV-50] GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder

【速读】:该论文旨在解决贴纸(Sticker)语义相似性理解难题,即如何准确捕捉贴纸内容中高度多样化且符号化的语义关系。现有预训练视觉与多模态模型在处理此类任务时表现不佳,难以建模贴纸的细微语义差异。其解决方案的关键在于提出首个贴纸语义相似性基准Triple-S(包含905对人工标注的正负样本)和轻量级通用贴纸编码器GSE(General Sticker Encoder),通过联合训练Triple-S与其他数据集,使GSE学习到鲁棒的贴纸嵌入表示,在未见贴纸上表现出优越性能,并在情绪分类和贴纸检索等下游任务中验证了其有效性。

链接: https://arxiv.org/abs/2511.04977
作者: Heng Er Metilda Chee,Jiayin Wang,Zhiqiang Guo,Weizhi Ma,Min Zhang
机构: DCST, Tsinghua University (清华大学高等研究院); Quan Cheng Laboratory (全成实验室); AIR, Tsinghua University (清华大学人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Stickers have become a popular form of visual communication, yet understanding their semantic relationships remains challenging due to their highly diverse and symbolic content. In this work, we formally define the Sticker Semantic Similarity task and introduce Triple-S, the first benchmark for this task, consisting of 905 human-annotated positive and negative sticker pairs. Through extensive evaluation, we show that existing pretrained vision and multimodal models struggle to capture nuanced sticker semantics. To address this, we propose the General Sticker Encoder (GSE), a lightweight and versatile model that learns robust sticker embeddings using both Triple-S and additional datasets. GSE achieves superior performance on unseen stickers, and demonstrates strong results on downstream tasks such as emotion classification and sticker-to-sticker retrieval. By releasing both Triple-S and GSE, we provide standardized evaluation tools and robust embeddings, enabling future research in sticker understanding, retrieval, and multimodal content generation. The Triple-S benchmark and GSE have been publicly released and are available here.
zh

[CV-51] Challenges in 3D Data Synthesis for Training Neural Networks on Topological Features

【速读】:该论文旨在解决拓扑数据分析(Topological Data Analysis, TDA)中因传统方法如持久同调(persistent homology)计算复杂度高而导致的效率瓶颈问题,同时应对监督学习在TDA任务中缺乏标注3D数据集的挑战。其解决方案的关键在于提出一种基于排斥表面算法(Repulsive Surface algorithm)的系统性方法,用于生成可控拓扑不变量(如孔洞数量)的合成3D数据集,从而为神经网络估计器提供具有几何多样性与拓扑标签的数据支持。该数据集被用于训练基于3D卷积Transformer架构的 genus 估计网络,并揭示了几何复杂度对模型泛化能力的重要影响。

链接: https://arxiv.org/abs/2511.04972
作者: Dylan Peek,Matthew P. Skerritt,Siddharth Pritam,Stephan Chalup
机构: The University of Newcastle, Australia(纽卡斯尔大学,澳大利亚); RMIT University, Australia(皇家墨尔本理工大学,澳大利亚); Chennai Mathematical Institute, India(印度钦奈数学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Topological Data Analysis (TDA) involves techniques of analyzing the underlying structure and connectivity of data. However, traditional methods like persistent homology can be computationally demanding, motivating the development of neural network-based estimators capable of reducing computational overhead and inference time. A key barrier to advancing these methods is the lack of labeled 3D data with class distributions and diversity tailored specifically for supervised learning in TDA tasks. To address this, we introduce a novel approach for systematically generating labeled 3D datasets using the Repulsive Surface algorithm, allowing control over topological invariants, such as hole count. The resulting dataset offers varied geometry with topological labeling, making it suitable for training and benchmarking neural network estimators. This paper uses a synthetic 3D dataset to train a genus estimator network, created using a 3D convolutional transformer architecture. An observed decrease in accuracy as deformations increase highlights the role of not just topological complexity, but also geometric complexity, when training generalized estimators. This dataset fills a gap in labeled 3D datasets and generation for training and evaluating models and techniques for TDA.
zh

[CV-52] Learning Fourier shapes to probe the geometric world of deep neural networks

【速读】:该论文旨在解决深度神经网络(DNNs)在视觉识别中对几何信息理解不足的问题,当前研究多集中于纹理特征而忽视了形状的语义承载能力。其解决方案的关键在于提出一个端到端可微分框架,该框架融合了基于傅里叶级数的任意形状参数化方法、基于绕数(winding number)的像素网格映射机制,以及信号能量约束策略,从而高效生成具有物理合理性的优化形状,并实现对模型判别区域的精准定位与几何对抗攻击。

链接: https://arxiv.org/abs/2511.04970
作者: Jian Wang,Yixing Yong,Haixia Bi,Lijun He,Fan Li
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:While both shape and texture are fundamental to visual recognition, research on deep neural networks (DNNs) has predominantly focused on the latter, leaving their geometric understanding poorly probed. Here, we show: first, that optimized shapes can act as potent semantic carriers, generating high-confidence classifications from inputs defined purely by their geometry; second, that they are high-fidelity interpretability tools that precisely isolate a model’s salient regions; and third, that they constitute a new, generalizable adversarial paradigm capable of deceiving downstream visual tasks. This is achieved through an end-to-end differentiable framework that unifies a powerful Fourier series to parameterize arbitrary shapes, a winding number-based mapping to translate them into the pixel grid required by DNNs, and signal energy constraints that enhance optimization efficiency while ensuring physically plausible shapes. Our work provides a versatile framework for probing the geometric world of DNNs and opens new frontiers for challenging and understanding machine perception.
zh

[CV-53] Pattern-Aware Diffusion Synthesis of fMRI/dMRI with Tissue and Microstructural Refinement

【速读】:该论文旨在解决神经退行性疾病研究中因模态缺失(如功能磁共振成像 fMRI 与扩散磁共振成像 dMRI)限制临床应用的问题。现有基于生成对抗网络(GAN)和扩散模型的方法在 fMRI-dMRI 合成任务中表现受限,主要源于两点:一是 fMRI 与 dMRI 在时间轴/梯度轴上的信号差异显著(BOLD vs. diffusion-weighted signal differences),二是生成过程中缺乏对疾病相关神经解剖模式的充分整合。为此,作者提出 PDS(Pattern-aware Dual-modal Synthesis)框架,其核心创新在于:(1) 设计了一种面向跨模态学习的、具备模式感知能力的双模态三维扩散机制,以增强不同模态间的语义一致性;(2) 引入一个集成组织精细重构网络(tissue refinement network)及高效微结构优化策略,从而保持结构保真度与细节完整性。实验表明,PDS 在 OASIS-3、ADNI 及自建数据集上均达到当前最优性能,并在混合真实-合成数据诊断测试中展现出高准确率(NC/MCI/AD 分类准确率分别为 67.92%/66.02%/64.15%)。

链接: https://arxiv.org/abs/2511.04963
作者: Xiongri Shen,Jiaqi Wang,Yi Zhong,Zhenxi Song,Leilei Zhao,Yichen Wei,Lingyan Liang,Shuqiang Wang,Baiying Lei,Demao Deng,Zhiguo Zhang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Shenzhen University (深圳大学); Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology (中国科学院深圳先进技术研究院); Guangxi Academy of Medical Sciences (广西壮族自治区医学科学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Magnetic resonance imaging (MRI), especially functional MRI (fMRI) and diffusion MRI (dMRI), is essential for studying neurodegenerative diseases. However, missing modalities pose a major barrier to their clinical use. Although GAN- and diffusion model-based approaches have shown some promise in modality completion, they remain limited in fMRI-dMRI synthesis due to (1) significant BOLD vs. diffusion-weighted signal differences between fMRI and dMRI in time/gradient axis, and (2) inadequate integration of disease-related neuroanatomical patterns during generation. To address these challenges, we propose PDS, introducing two key innovations: (1) a pattern-aware dual-modal 3D diffusion framework for cross-modality learning, and (2) a tissue refinement network integrated with a efficient microstructure refinement to maintain structural fidelity and fine details. Evaluated on OASIS-3, ADNI, and in-house datasets, our method achieves state-of-the-art results, with PSNR/SSIM scores of 29.83 dB/90.84% for fMRI synthesis (+1.54 dB/+4.12% over baselines) and 30.00 dB/77.55% for dMRI synthesis (+1.02 dB/+2.2%). In clinical validation, the synthesized data show strong diagnostic performance, achieving 67.92%/66.02%/64.15% accuracy (NC vs. MCI vs. AD) in hybrid real-synthetic experiments. Code is available in \hrefthis https URLPDS GitHub Repository
zh

[CV-54] CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在处理大规模或复杂场景时因内存需求过高而难以在单张消费级GPU(如RTX 4090)上运行的问题。其核心解决方案是提出CLM系统,通过将部分高斯点(Gaussians)从GPU内存卸载至CPU内存,并仅在需要时加载回GPU,从而有效降低显存占用。关键创新在于利用对3DGS内存访问模式的观察,设计了一种新颖的卸载策略,实现GPU计算、CPU计算与GPU到CPU通信之间的流水线并行,显著减少性能和通信开销,同时优化通信数据量。实验表明,该方法可在单张RTX 4090上渲染需1亿高斯点的大场景,并达到当前最优重建质量。

链接: https://arxiv.org/abs/2511.04951
作者: Hexu Zhao,Xiwen Min,Xiaoteng Liu,Moonjun Gong,Yiming Li,Ang Li,Saining Xie,Jinyang Li,Aurojit Panda
机构: New York University (纽约大学); Pacific Northwest National Laboratory (太平洋西北国家实验室); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to appear in the 2026 ACM International Conference on Architectural Support for Programming Languages and Operating Systems

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is an increasingly popular novel view synthesis approach due to its fast rendering time, and high-quality output. However, scaling 3DGS to large (or intricate) scenes is challenging due to its large memory requirement, which exceed most GPU’s memory capacity. In this paper, we describe CLM, a system that allows 3DGS to render large scenes using a single consumer-grade GPU, e.g., RTX4090. It does so by offloading Gaussians to CPU memory, and loading them into GPU memory only when necessary. To reduce performance and communication overheads, CLM uses a novel offloading strategy that exploits observations about 3DGS’s memory access pattern for pipelining, and thus overlap GPU-to-CPU communication, GPU computation and CPU computation. Furthermore, we also exploit observation about the access pattern to reduce communication volume. Our evaluation shows that the resulting implementation can render a large scene that requires 100 million Gaussians on a single RTX4090 and achieve state-of-the-art reconstruction quality.
zh

[CV-55] DeepForgeSeal: Latent Space-Driven Semi-Frag ile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning

【速读】:该论文旨在解决生成式 AI(Generative AI)日益逼真的深度伪造(deepfake)对执法和公众信任带来的挑战,特别是现有被动检测方法因依赖特定伪造痕迹而难以泛化到新型深度伪造类型的问题。其核心解决方案是提出一种基于高维潜在空间表示与多智能体对抗强化学习(Multi-Agent Adversarial Reinforcement Learning, MAARL)框架的主动水印技术:通过在潜在空间中设计可学习的水印嵌入器,实现对图像语义的高效捕捉与精准的信息编码/提取;同时利用MAARL机制使水印代理在与对抗攻击代理的动态交互中,自适应地平衡鲁棒性(抵抗良性失真)与脆弱性(敏感于恶意篡改),从而显著提升水印在复杂操作场景下的检测性能。

链接: https://arxiv.org/abs/2511.04949
作者: Tharindu Fernando,Clinton Fookes,Sridha Sridharan
机构: Queensland University of Technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rapid advances in generative AI have led to increasingly realistic deepfakes, posing growing challenges for law enforcement and public trust. Existing passive deepfake detectors struggle to keep pace, largely due to their dependence on specific forgery artifacts, which limits their ability to generalize to new deepfake types. Proactive deepfake detection using watermarks has emerged to address the challenge of identifying high-quality synthetic media. However, these methods often struggle to balance robustness against benign distortions with sensitivity to malicious tampering. This paper introduces a novel deep learning framework that harnesses high-dimensional latent space representations and the Multi-Agent Adversarial Reinforcement Learning (MAARL) paradigm to develop a robust and adaptive watermarking approach. Specifically, we develop a learnable watermark embedder that operates in the latent space, capturing high-level image semantics, while offering precise control over message encoding and extraction. The MAARL paradigm empowers the learnable watermarking agent to pursue an optimal balance between robustness and fragility by interacting with a dynamic curriculum of benign and malicious image manipulations simulated by an adversarial attacker agent. Comprehensive evaluations on the CelebA and CelebA-HQ benchmarks reveal that our method consistently outperforms state-of-the-art approaches, achieving improvements of over 4.5% on CelebA and more than 5.3% on CelebA-HQ under challenging manipulation scenarios.
zh

[CV-56] A benchmark multimodal oro-dental dataset for large vision-language models

【速读】:该论文旨在解决口腔健康领域中人工智能(AI)发展受限于高质量、大规模多模态数据集的问题。其解决方案的关键在于构建并公开了一个包含8775次牙科检查的综合性多模态数据集,涵盖50,000张口内图像、8056张X光片及详尽的文本记录(如诊断、治疗计划和随访笔记),覆盖10至90岁患者群体,并在标准伦理规范下完成标注。通过在此数据集上微调先进的视觉-语言模型(如Qwen-VL 3B和7B),并在六类口腔异常分类与多模态输入生成完整诊断报告两项任务中验证性能,结果表明微调模型显著优于基线模型(包括GPT-4o),从而证明了该数据集对推动生成式AI在牙科医疗中应用的有效性。

链接: https://arxiv.org/abs/2511.04948
作者: Haoxin Lv,Ijazul Haq,Jin Du,Jiaxin Ma,Binnian Zhu,Xiaobing Dang,Chaoan Liang,Ruxu Du,Yingjie Zhang,Muhammad Saqib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advancement of artificial intelligence in oral healthcare relies on the availability of large-scale multimodal datasets that capture the complexity of clinical practice. In this paper, we present a comprehensive multimodal dataset, comprising 8775 dental checkups from 4800 patients collected over eight years (2018-2025), with patients ranging from 10 to 90 years of age. The dataset includes 50000 intraoral images, 8056 radiographs, and detailed textual records, including diagnoses, treatment plans, and follow-up notes. The data were collected under standard ethical guidelines and annotated for benchmarking. To demonstrate its utility, we fine-tuned state-of-the-art large vision-language models, Qwen-VL 3B and 7B, and evaluated them on two tasks: classification of six oro-dental anomalies and generation of complete diagnostic reports from multimodal inputs. We compared the fine-tuned models with their base counterparts and GPT-4o. The fine-tuned models achieved substantial gains over these baselines, validating the dataset and underscoring its effectiveness in advancing AI-driven oro-dental healthcare solutions. The dataset is publicly available, providing an essential resource for future research in AI dentistry.
zh

[CV-57] Learning to Restore Multi-Degraded Images via Ingredient Decoupling and Task-Aware Path Adaptation

【速读】:该论文旨在解决现实世界图像中常同时存在多种退化类型(如雨、噪声和雾霾等共存)的问题,而现有图像恢复(Image Restoration, IR)方法大多仅针对单一退化类型设计,导致在复杂多退化场景下的实际应用效果受限。其解决方案的关键在于提出一种自适应多退化图像恢复网络(IMDNet),通过引入三个核心模块实现:1)编码器中的退化成分解耦块(Degradation Ingredient Decoupling Block, DIDBlock),利用空间与频域信息联合统计分离不同退化成分,增强多退化类型的识别能力并获得独立的特征表示;2)融合块(Fusion Block, FBlock),通过可学习矩阵在所有层级整合退化信息;3)解码器中的任务适配块(Task Adaptation Block, TABlock),根据多退化表示动态激活或融合功能分支,灵活选择最优恢复路径。该架构实现了对多退化场景的高效建模与自适应处理,在多退化恢复任务上表现优异,同时保持了单退化任务的竞争力。

链接: https://arxiv.org/abs/2511.04920
作者: Hu Gao,Xiaoning Lei,Ying Zhang,Xichen Xu,Guannan Jiang,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学); CATL (宁德时代); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration (IR) aims to recover clean images from degraded observations. Despite remarkable progress, most existing methods focus on a single degradation type, whereas real-world images often suffer from multiple coexisting degradations, such as rain, noise, and haze coexisting in a single image, which limits their practical effectiveness. In this paper, we propose an adaptive multi-degradation image restoration network that reconstructs images by leveraging decoupled representations of degradation ingredients to guide path selection. Specifically, we design a degradation ingredient decoupling block (DIDBlock) in the encoder to separate degradation ingredients statistically by integrating spatial and frequency domain information, enhancing the recognition of multiple degradation types and making their feature representations independent. In addition, we present fusion block (FBlock) to integrate degradation information across all levels using learnable matrices. In the decoder, we further introduce a task adaptation block (TABlock) that dynamically activates or fuses functional branches based on the multi-degradation representation, flexibly selecting optimal restoration paths under diverse degradation conditions. The resulting tightly integrated architecture, termed IMDNet, is extensively validated through experiments, showing superior performance on multi-degradation restoration while maintaining strong competitiveness on single-degradation tasks.
zh

[CV-58] Beta Distribution Learning for Reliable Roadway Crash Risk Assessment AAAI2026

【速读】:该论文旨在解决传统交通安全隐患研究中忽视建成环境空间复杂性和上下文交互关系,以及基于神经网络的风险估计模型仅输出确定性点估计、缺乏不确定性量化的问题。其解决方案的关键在于提出一种新颖的地理空间深度学习框架,利用卫星影像作为综合空间输入,捕捉细微的空间模式和嵌入式环境风险因素;同时,模型不再输出单一确定性结果,而是通过估算致命车祸风险的完整Beta概率分布,实现准确且具有不确定性感知的预测,从而提升在安全关键场景下人工智能系统的可信度与实用性。

链接: https://arxiv.org/abs/2511.04886
作者: Ahmad Elallaf,Nathan Jacobs,Xinyue Ye,Mei Chen,Gongbo Liang
机构: Texas A&M University-San Antonio (德州农工大学-圣安东尼奥分校); Washington University in St. Louis (圣路易斯华盛顿大学); University of Alabama (阿拉巴马大学); University of Kentucky (肯塔基大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Roadway traffic accidents represent a global health crisis, responsible for over a million deaths annually and costing many countries up to 3% of their GDP. Traditional traffic safety studies often examine risk factors in isolation, overlooking the spatial complexity and contextual interactions inherent in the built environment. Furthermore, conventional Neural Network-based risk estimators typically generate point estimates without conveying model uncertainty, limiting their utility in critical decision-making. To address these shortcomings, we introduce a novel geospatial deep learning framework that leverages satellite imagery as a comprehensive spatial input. This approach enables the model to capture the nuanced spatial patterns and embedded environmental risk factors that contribute to fatal crash risks. Rather than producing a single deterministic output, our model estimates a full Beta probability distribution over fatal crash risk, yielding accurate and uncertainty-aware predictions–a critical feature for trustworthy AI in safety-critical applications. Our model outperforms baselines by achieving a 17-23% improvement in recall, a key metric for flagging potential dangers, while delivering superior calibration. By providing reliable and interpretable risk assessments from satellite imagery alone, our method enables safer autonomous navigation and offers a highly scalable tool for urban planners and policymakers to enhance roadway safety equitably and cost-effectively.
zh

[CV-59] Validating Vision Transformers for Otoscopy: Performance and Data-Leakage Effects

【速读】:该论文旨在解决耳部疾病诊断准确率低的问题,特别是针对专科耳鼻喉科医生存在27%误诊率的现状,探索使用视觉Transformer模型(如Swin Transformer)提升诊断准确性。其解决方案的关键在于对比不同深度学习架构(Vision Transformer与传统卷积神经网络ResNet)在真实世界耳镜视频数据上的表现,并强调了数据预处理阶段存在的关键性数据泄露问题——该问题在原始实验中导致模型性能虚高,修正后模型准确率显著下降,凸显了严谨数据管理在医学AI研究中的核心作用。因此,该研究指出:要实现可靠的耳部疾病辅助诊断系统,必须在先进模型架构与高质量数据预处理之间取得平衡。

链接: https://arxiv.org/abs/2511.04872
作者: James Ndubuisi,Fernando Auat,Marta Vallejo
机构: Heriot-Watt University (赫瑞-瓦特大学); Harper University (哈珀大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study evaluates the efficacy of vision transformer models, specifically Swin transformers, in enhancing the diagnostic accuracy of ear diseases compared to traditional convolutional neural networks. With a reported 27% misdiagnosis rate among specialist otolaryngologists, improving diagnostic accuracy is crucial. The research utilised a real-world dataset from the Department of Otolaryngology at the Clinical Hospital of the Universidad de Chile, comprising otoscopic videos of ear examinations depicting various middle and external ear conditions. Frames were selected based on the Laplacian and Shannon entropy thresholds, with blank frames removed. Initially, Swin v1 and Swin v2 transformer models achieved accuracies of 100% and 99.1%, respectively, marginally outperforming the ResNet model (99.5%). These results surpassed metrics reported in related studies. However, the evaluation uncovered a critical data leakage issue in the preprocessing step, affecting both this study and related research using the same raw dataset. After mitigating the data leakage, model performance decreased significantly. Corrected accuracies were 83% for both Swin v1 and Swin v2, and 82% for the ResNet model. This finding highlights the importance of rigorous data handling in machine learning studies, especially in medical applications. The findings indicate that while vision transformers show promise, it is essential to find an optimal balance between the benefits of advanced model architectures and those derived from effective data preprocessing. This balance is key to developing a reliable machine learning model for diagnosing ear diseases.
zh

[CV-60] Clinical-ComBAT: a diffusion-weighted MRI harmonization method for clinical applications

【速读】:该论文旨在解决扩散加权磁共振成像(Diffusion-weighted MRI, DW-MRI)在多中心数据整合中因扫描仪特异性偏差导致的标准化难题,尤其针对临床场景下样本异质性高、站点数量不固定、小样本群体及非线性关系等限制ComBAT方法应用的问题。其解决方案的关键在于提出Clinical-ComBAT方法:该方法通过独立站点校正实现新数据和医疗机构的灵活接入,引入非线性多项式数据模型以捕捉复杂关系,基于参考规范站点进行站点特异性调和,并结合方差先验适应小样本群体;同时集成超参数调优与拟合优度指标,提升校正效果的可评估性和临床适用性。

链接: https://arxiv.org/abs/2511.04871
作者: Gabriel Girard,Manon Edde,Félix Dumais,Yoan David,Matthieu Dumont,Guillaume Theaud,Jean-Christophe Houde,Arnaud Boré,Maxime Descoteaux,Pierre-Marc Jodoin
机构: Videos & Images Theory and Analytics Lab (VITAL), Department of Computer Science, Université de Sherbrooke, Sherbrooke (Qc), Canada; Sherbrooke Connectivity Imaging Lab (SCIL), Department of Computer Science, Université de Sherbrooke Sherbrooke (Qc), Canada; Imeka Solutions inc, Sherbrooke (Qc), Canada
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注: 39 pages, 11 figures

点击查看摘要

Abstract:Diffusion-weighted magnetic resonance imaging (DW-MRI) derived scalar maps are effective for assessing neurodegenerative diseases and microstructural properties of white matter in large number of brain conditions. However, DW-MRI inherently limits the combination of data from multiple acquisition sites without harmonization to mitigate scanner-specific biases. While the widely used ComBAT method reduces site effects in research, its reliance on linear covariate relationships, homogeneous populations, fixed site numbers, and well populated sites constrains its clinical use. To overcome these limitations, we propose Clinical-ComBAT, a method designed for real-world clinical scenarios. Clinical-ComBAT harmonizes each site independently, enabling flexibility as new data and clinics are introduced. It incorporates a non-linear polynomial data model, site-specific harmonization referenced to a normative site, and variance priors adaptable to small cohorts. It further includes hyperparameter tuning and a goodness-of-fit metric for harmonization assessment. We demonstrate its effectiveness on simulated and real data, showing improved alignment of diffusion metrics and enhanced applicability for normative modeling.
zh

[CV-61] Self-Supervised Implicit Attention Priors for Point Cloud Reconstruction

【速读】:该论文旨在解决从不规则点云中恢复高质量表面的病态问题(ill-posed problem),即在缺乏强几何先验的情况下难以重建精确且细节丰富的三维表面。其解决方案的关键在于提出一种隐式自先验方法(implicit self-prior approach),该方法直接从输入点云本身提取形状特异性先验,并将其嵌入到隐式神经表示中——通过联合训练一个可学习的嵌入字典与隐式距离场,利用交叉注意力机制使网络在查询位置时能够捕捉并复用形状中的重复结构和长程相关性。该方法仅依赖自监督点云重建损失进行优化,无需外部训练数据,最终结合改进的隐式移动最小二乘(RIMLS)框架融合密集采样点及其法向量,从而在保留输入细节的同时有效规约稀疏区域,显著提升重建质量与鲁棒性。

链接: https://arxiv.org/abs/2511.04864
作者: Kyle Fogarty,Chenyue Cai,Jing Yang,Zhilin Guo,Cengiz Öztireli
机构: University of Cambridge (剑桥大学); University of Princeton (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 3DV 2026

点击查看摘要

Abstract:Recovering high-quality surfaces from irregular point cloud is ill-posed unless strong geometric priors are available. We introduce an implicit self-prior approach that distills a shape-specific prior directly from the input point cloud itself and embeds it within an implicit neural representation. This is achieved by jointly training a small dictionary of learnable embeddings with an implicit distance field; at every query location, the field attends to the dictionary via cross-attention, enabling the network to capture and reuse repeating structures and long-range correlations inherent to the shape. Optimized solely with self-supervised point cloud reconstruction losses, our approach requires no external training data. To effectively integrate this learned prior while preserving input fidelity, the trained field is then sampled to extract densely distributed points and analytic normals via automatic differentiation. We integrate the resulting dense point cloud and corresponding normals into a robust implicit moving least squares (RIMLS) formulation. We show this hybrid strategy preserves fine geometric details in the input data, while leveraging the learned prior to regularize sparse regions. Experiments show that our method outperforms both classical and learning-based approaches in generating high-fidelity surfaces with superior detail preservation and robustness to common data degradations.
zh

[CV-62] Geometry Denoising with Preferred Normal Vectors

【速读】:该论文旨在解决三维几何去噪(geometry denoising)问题,其核心挑战在于如何在保留几何结构细节的同时有效去除噪声。解决方案的关键在于引入了一种基于表面法向量先验知识的新范式,该先验以一组预设的“标签向量”(label vectors)形式体现,用于指导去噪过程中各顶点法向量的优化方向。通过将表面分割问题嵌入到去噪流程中,利用法向量与标签向量之间的相似性进行区域划分,并结合总变差(total variation)正则项实现平滑约束,最终采用分裂Bregman(Split Bregman, ADMM)方法求解优化问题,其中顶点更新步骤基于二阶形状微分(second-order shape calculus),从而在保持几何一致性的同时提升去噪效果。

链接: https://arxiv.org/abs/2511.04848
作者: Manuel Weiß,Lukas Baumgärtner,Roland Herzog,Stephan Schmidt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We introduce a new paradigm for geometry denoising using prior knowledge about the surface normal vector. This prior knowledge comes in the form of a set of preferred normal vectors, which we refer to as label vectors. A segmentation problem is naturally embedded in the denoising process. The segmentation is based on the similarity of the normal vector to the elements of the set of label vectors. Regularization is achieved by a total variation term. We formulate a split Bregman (ADMM) approach to solve the resulting optimization problem. The vertex update step is based on second-order shape calculus.
zh

[CV-63] Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models NEURIPS2025

【速读】:该论文旨在解决生成式 AI(Generative AI)在面对恶意文本提示时可能生成有害内容的问题,特别是现有两种主流防御方法——通过微调模型来消除有害概念的方法与无需训练的基于负向提示(negative prompts)的引导方法——在联合使用时往往效果不佳甚至性能下降的问题。其关键解决方案在于:将训练-free 方法中使用的显式负向提示替换为通过概念反转(concept inversion)获得的隐式负向嵌入(implicit negative embeddings),从而实现两种防御策略的有效兼容与协同增强,且无需修改原有方法或训练流程,即可显著提升对色情和暴力内容的防御成功率,同时保持输入提示的核心语义不变。

链接: https://arxiv.org/abs/2511.04834
作者: Jiwoo Shin,Byeonghu Na,Mina Kang,Wonhyeok Choi,Il-chul Moon
机构: KAIST; summary.ai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025 Workshop on Generative and Protective AI for Content Creation

点击查看摘要

Abstract:Recent advances in text-to-image generative models have raised concerns about their potential to produce harmful content when provided with malicious input text prompts. To address this issue, two main approaches have emerged: (1) fine-tuning the model to unlearn harmful concepts and (2) training-free guidance methods that leverage negative prompts. However, we observe that combining these two orthogonal approaches often leads to marginal or even degraded defense performance. This observation indicates a critical incompatibility between two paradigms, which hinders their combined effectiveness. In this work, we address this issue by proposing a conceptually simple yet experimentally robust method: replacing the negative prompts used in training-free methods with implicit negative embeddings obtained through concept inversion. Our method requires no modification to either approach and can be easily integrated into existing pipelines. We experimentally validate its effectiveness on nudity and violence benchmarks, demonstrating consistent improvements in defense success rate while preserving the core semantics of input prompts.
zh

[CV-64] An Active Learning Pipeline for Biomedical Image Instance Segmentation with Minimal Human Intervention

【速读】:该论文旨在解决生物医学图像分割中因标注数据稀缺而导致深度学习模型难以有效训练的问题,尤其是当仅有原始图像而无标签时,传统方法性能受限且大型基础模型在特定数据集上表现不佳。其解决方案的关键在于提出一种以数据为中心的人工智能工作流,结合主动学习(active learning)与伪标签生成(pseudo-labeling),首先利用基础模型生成初始伪标签用于nnU-Net的自配置,随后通过选择代表性子集进行最小化人工标注,从而高效微调nnU-Net模型,显著降低标注成本并保持优异分割性能。

链接: https://arxiv.org/abs/2511.04811
作者: Shuo Zhao,Yu Zhou,Jianxu Chen
机构: isas.de(德国信息科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 4 figures, presented at Bildverarbeitung für die Medizin (BVM) 2025, Wiesbaden, Germany

点击查看摘要

Abstract:Biomedical image segmentation is critical for precise structure delineation and downstream analysis. Traditional methods often struggle with noisy data, while deep learning models such as U-Net have set new benchmarks in segmentation performance. nnU-Net further automates model configuration, making it adaptable across datasets without extensive tuning. However, it requires a substantial amount of annotated data for cross-validation, posing a challenge when only raw images but no labels are available. Large foundation models offer zero-shot generalizability, but may underperform on specific datasets with unique characteristics, limiting their direct use for analysis. This work addresses these bottlenecks by proposing a data-centric AI workflow that leverages active learning and pseudo-labeling to combine the strengths of traditional neural networks and large foundation models while minimizing human intervention. The pipeline starts by generating pseudo-labels from a foundation model, which are then used for nnU-Net’s self-configuration. Subsequently, a representative core-set is selected for minimal manual annotation, enabling effective fine-tuning of the nnU-Net model. This approach significantly reduces the need for manual annotations while maintaining competitive performance, providing an accessible solution for biomedical researchers to apply state-of-the-art AI techniques in their segmentation tasks. The code is available at this https URL.
zh

[CV-65] Data Efficiency and Transfer Robustness in Biomedical Image Segmentation: A Study of Redundancy and Forgetting with Cellpose

【速读】:该论文旨在解决生物医学图像分割中训练数据冗余与跨域迁移导致的模型遗忘问题。其关键解决方案包括:(1)提出数据量化(Dataset Quantization, DQ)策略,通过构建紧凑且多样化的训练子集,在仅使用10%数据的情况下即可达到饱和性能,显著减少标注需求并提升特征多样性;(2)引入选择性DQ回放机制,在跨域微调过程中仅重放5–10%源域数据即可有效恢复源域性能,避免全量回放对目标域适应的干扰;同时发现训练域顺序安排有助于增强泛化能力并降低遗忘。

链接: https://arxiv.org/abs/2511.04803
作者: Shuo Zhao,Jianxu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to IEEE BIBM 2025 Workshop; 6 pages; 4 figures; 5 tables; IEEEtran class. Code: this https URL

点击查看摘要

Abstract:Generalist biomedical image segmentation models such as Cellpose are increasingly applied across diverse imaging modalities and cell types. However, two critical challenges remain underexplored: (1) the extent of training data redundancy and (2) the impact of cross domain transfer on model retention. In this study, we conduct a systematic empirical analysis of these challenges using Cellpose as a case study. First, to assess data redundancy, we propose a simple dataset quantization (DQ) strategy for constructing compact yet diverse training subsets. Experiments on the Cyto dataset show that image segmentation performance saturates with only 10% of the data, revealing substantial redundancy and potential for training with minimal annotations. Latent space analysis using MAE embeddings and t-SNE confirms that DQ selected patches capture greater feature diversity than random sampling. Second, to examine catastrophic forgetting, we perform cross domain finetuning experiments and observe significant degradation in source domain performance, particularly when adapting from generalist to specialist domains. We demonstrate that selective DQ based replay reintroducing just 5-10% of the source data effectively restores source performance, while full replay can hinder target adaptation. Additionally, we find that training domain sequencing improves generalization and reduces forgetting in multi stage transfer. Our findings highlight the importance of data centric design in biomedical image segmentation and suggest that efficient training requires not only compact subsets but also retention aware learning strategies and informed domain ordering. The code is available at this https URL.
zh

[CV-66] 3D Gaussian Point Encoders

【速读】:该论文旨在解决传统3D点云识别模型(如PointNet)在计算效率和参数复杂度方面的局限性,尤其是其隐式表示方式难以实现高效训练与部署的问题。解决方案的关键在于提出一种显式几何表示方法——3D Gaussian Point Encoder,该方法基于可学习的3D高斯混合模型构建逐点嵌入,从而替代传统的隐式神经网络结构。为克服端到端优化困难,作者引入自然梯度(natural gradients)与来自PointNet的知识蒸馏技术,以学习能够重建PointNet激活的高斯基底;同时结合3D高斯泼溅(Gaussian Splatting)中的滤波加速策略,显著提升推理速度并降低内存占用与浮点运算量(FLOPs),最终实现比PointNet快2.7倍、节省46%内存及88% FLOPs的高效编码器,并在Mamba3D中验证其作为组件的有效性。

链接: https://arxiv.org/abs/2511.04797
作者: Jim James,Ben Wilson,Simon Lucey,James Hays
机构: Georgia Tech (佐治亚理工学院); University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 3 tables

点击查看摘要

Abstract:In this work, we introduce the 3D Gaussian Point Encoder, an explicit per-point embedding built on mixtures of learned 3D Gaussians. This explicit geometric representation for 3D recognition tasks is a departure from widely used implicit representations such as PointNet. However, it is difficult to learn 3D Gaussian encoders in end-to-end fashion with standard optimizers. We develop optimization techniques based on natural gradients and distillation from PointNets to find a Gaussian Basis that can reconstruct PointNet activations. The resulting 3D Gaussian Point Encoders are faster and more parameter efficient than traditional PointNets. As in the 3D reconstruction literature where there has been considerable interest in the move from implicit (e.g., NeRF) to explicit (e.g., Gaussian Splatting) representations, we can take advantage of computational geometry heuristics to accelerate 3D Gaussian Point Encoders further. We extend filtering techniques from 3D Gaussian Splatting to construct encoders that run 2.7 times faster as a comparable accuracy PointNet while using 46% less memory and 88% fewer FLOPs. Furthermore, we demonstrate the effectiveness of 3D Gaussian Point Encoders as a component in Mamba3D, running 1.27 times faster and achieving a reduction in memory and FLOPs by 42% and 54% respectively. 3D Gaussian Point Encoders are lightweight enough to achieve high framerates on CPU-only devices.
zh

[CV-67] EETnet: a CNN for Gaze Detection and Tracking for Smart-Eyewear IJCNN

【速读】:该论文旨在解决基于事件的摄像头(event-based cameras)在眼动追踪(eye tracking)应用中,现有方法多依赖高性能GPU进行验证而难以部署到资源受限嵌入式设备的问题。其解决方案的关键在于提出EETnet——一种专为纯事件数据设计的卷积神经网络(convolutional neural network),可在微控制器(microcontrollers)等低功耗、有限资源的硬件上高效运行;同时,论文还提供了一套完整的训练、评估与量化方法,并设计了两种架构:一种是基于网格分类的瞳孔检测模型,另一种是像素级回归模型,从而实现了从算法设计到实际嵌入式部署的闭环优化。

链接: https://arxiv.org/abs/2511.04779
作者: Andrea Aspesi(1 and 2),Andrea Simpsi(1),Aaron Tognoli(1),Simone Mentasti(1),Luca Merigo(2),Matteo Matteucci(1) ((1) Department of Electronics, Information and Bioengineering (DEIB) Politecnico di Milano, (2) EssilorLuxottica)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Joint Conference on Neural Networks (IJCNN), 2025

点击查看摘要

Abstract:Event-based cameras are becoming a popular solution for efficient, low-power eye tracking. Due to the sparse and asynchronous nature of event data, they require less processing power and offer latencies in the microsecond range. However, many existing solutions are limited to validation on powerful GPUs, with no deployment on real embedded devices. In this paper, we present EETnet, a convolutional neural network designed for eye tracking using purely event-based data, capable of running on microcontrollers with limited resources. Additionally, we outline a methodology to train, evaluate, and quantize the network using a public dataset. Finally, we propose two versions of the architecture: a classification model that detects the pupil on a grid superimposed on the original image, and a regression model that operates at the pixel level.
zh

[CV-68] Global 3D Reconstruction of Clouds Tropical Cyclones

【速读】:该论文旨在解决热带气旋(Tropical Cyclone, TC)精准预报难题,核心挑战在于卫星观测数据对TC结构探测有限,且难以解析影响TC增强的云微物理过程。解决方案的关键在于提出一种基于预训练-微调(pre-training–fine-tuning)范式的框架,利用多源全球覆盖卫星数据,将二维卫星图像转化为三维云属性分布图,首次实现了对强风暴三维结构的准确重建,并可在无观测区域提供可靠估计,从而显著扩展可用观测信息并提升TC强度变化的理解与预测能力。

链接: https://arxiv.org/abs/2511.04773
作者: Shirin Ermis,Cesar Aybar,Lilli Freischem,Stella Girtsou,Kyriaki-Margarita Bintsi,Emiliano Diaz Salas-Porras,Michael Eisinger,William Jones,Anna Jungbluth,Benoit Tremblay
机构: University of Oxford (牛津大学); Universitat de València (瓦伦西亚大学); National Observatory of Athens (希腊国家天文台); National Technical University of Athens (雅典国立技术大学); Harvard Medical School (哈佛医学院); Massachusetts General Hospital (麻省总医院); European Space Agency (欧洲航天局); Environment & Climate Change Canada (环境与气候变化加拿大署)
类目: Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Accurate forecasting of tropical cyclones (TCs) remains challenging due to limited satellite observations probing TC structure and difficulties in resolving cloud properties involved in TC intensification. Recent research has demonstrated the capabilities of machine learning methods for 3D cloud reconstruction from satellite observations. However, existing approaches have been restricted to regions where TCs are uncommon, and are poorly validated for intense storms. We introduce a new framework, based on a pre-training–fine-tuning pipeline, that learns from multiple satellites with global coverage to translate 2D satellite imagery into 3D cloud maps of relevant cloud properties. We apply our model to a custom-built TC dataset to evaluate performance in the most challenging and relevant conditions. We show that we can - for the first time - create global instantaneous 3D cloud maps and accurately reconstruct the 3D structure of intense storms. Our model not only extends available satellite observations but also provides estimates when observations are missing entirely. This is crucial for advancing our understanding of TC intensification and improving forecasts.
zh

[CV-69] DARN: Dynamic Adaptive Regularization Networks for Efficient and Robust Foundation Model Adaptation

【速读】:该论文旨在解决基础模型(Foundation Models, FMs)在遥感影像分析中适应性不足的问题,特别是现有适配方法(无论是全量微调还是冻结主干网络的高效方法)普遍采用固定正则化策略,无法应对卫星影像中存在的显著异质性。其解决方案的关键在于提出动态自适应正则化网络(Dynamic Adaptive Regularization Networks, DARN),通过三个核心创新实现:(1) 轻量级任务复杂度预测器(Task Complexity Predictor, TCP)对每个样本的难度进行估计;(2) 自适应 dropout 调制(Adaptive Dropout Modulation, ADM)根据预测复杂度动态调整 dropout 率(0.1–0.5);(3) 动态容量门控机制(Dynamic Capacity Gating, DCG)调节通道激活。DARN 在理论上保障优化过程收敛至稳定点,并通过自适应信息瓶颈机制提升模型鲁棒性和泛化能力,实验证明其在多任务 GeoBench 和 Sen1Floods11 数据集上均达到或超越当前最优性能,同时显著改善了分布外(OOD)泛化、抗扰动能力和少数类表现。

链接: https://arxiv.org/abs/2511.04766
作者: Dhenenjay Yadav,Rohan Sawai
机构: AxionOrbital Space; Virginia Tech
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models (FMs) offer powerful representations for geospatial analysis, but adapting them effectively remains challenging. Standard adaptation methods, whether full fine-tuning or efficient frozen-backbone approaches, typically employ decoders with fixed regularization strategies, failing to account for the significant heterogeneity in satellite imagery. We introduce Dynamic Adaptive Regularization Networks (DARN), a novel decoder architecture designed to address this limitation. DARN integrates three key innovations: (1) a lightweight Task Complexity Predictor (TCP) that estimates per-sample difficulty, (2) Adaptive Dropout Modulation (ADM), dynamically adjusting dropout rates (from 0.1 to 0.5) based on predicted complexity, and (3) Dynamic Capacity Gating (DCG) that modulates channel activation. We provide theoretical justifications linking DARN’s optimization to stationary point convergence and its mechanism to adaptive information bottlenecks. Empirically, DARN demonstrates exceptional performance across both major adaptation paradigms. In full fine-tuning (unfrozen backbone), DARN achieves a new state-of-the-art on the multi-task GeoBench benchmark (86.66% mIoU, +5.56 pp over prior SOTA). In efficient adaptation (frozen backbone), DARN achieves SOTA-competitive accuracy (90.5% mIoU on Sen1Floods11) while delivering substantial advantages crucial for real-world deployment: superior out-of-distribution (OOD) generalization (+9.5 pp mIoU on AI4SmallFarms), enhanced robustness (17% relative reduction in corruption error), and improved performance on minority classes. DARN offers a more intelligent, robust, and efficient approach to leveraging FMs in critical geospatial applications.
zh

[CV-70] CPO: Condition Preference Optimization for Controllable Image Generation

【速读】:该论文旨在解决当前图像生成模型中控制信号(control signal)在不同噪声水平下优化不充分的问题,尤其是ControlNet++方法仅在低噪声时间步(如 t ≥ 200)进行优化,忽略了高噪声时间步的贡献,导致控制精度受限且引入额外近似误差。解决方案的关键在于提出条件偏好优化(Condition Preference Optimization, CPO):不再直接对生成图像进行偏好学习(如DPO),而是对控制条件本身(即控制信号 cw\mathbf{c}^wcl\mathbf{c}^l)进行偏好训练,从而消除图像质量等混杂因素的影响,获得更低方差的训练目标,并显著提升多种控制类型下的生成可控性(如分割、人体姿态、边缘和深度图等)。

链接: https://arxiv.org/abs/2511.04753
作者: Zonglin Lyu,Ming Li,Xinxin Liu,Chen Chen
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To enhance controllability in text-to-image generation, ControlNet introduces image-based control signals, while ControlNet++ improves pixel-level cycle consistency between generated images and the input control signal. To avoid the prohibitive cost of back-propagating through the sampling process, ControlNet++ optimizes only low-noise timesteps (e.g., t 200 ) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images ( I^w ) over less controllable ones ( I^l ). However, due to uncertainty in generative models, it is difficult to ensure that win–lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images. Specifically, we construct winning and losing control signals, \mathbfc^w and \mathbfc^l , and train the model to prefer \mathbfc^w . This method, which we term \textitCondition Preference Optimization (CPO), eliminates confounding factors and yields a low-variance training objective. Our approach theoretically exhibits lower contrastive loss variance than DPO and empirically achieves superior results. Moreover, CPO requires less computation and storage for dataset curation. Extensive experiments show that CPO significantly improves controllability over the state-of-the-art ControlNet++ across multiple control types: over 10% error rate reduction in segmentation, 70 – 80% in human pose, and consistent 2 – 5% reductions in edge and depth maps.
zh

[CV-71] Knowledge-based anomaly detection for identifying network-induced shape artifacts

【速读】:该论文旨在解决合成医学图像中因生成模型引入的形状伪影(shape artifacts)问题,这类伪影可能损害机器学习模型的性能与临床实用性。解决方案的关键在于提出一种基于知识的异常检测方法,其核心是两阶段框架:首先设计了一种新型特征提取器,通过分析图像中解剖边界处角度梯度的分布构建专用特征空间;其次采用孤立森林(isolation forest)作为异常检测器,实现对网络诱导形状伪影的有效识别。该方法在两个合成乳腺X线图像数据集上验证了其有效性,AUC值分别达到0.97和0.91,并且与人工判读具有良好的一致性(Kendall-Tau相关系数分别为0.45和0.43),从而为合成数据的质量评估提供了可量化、可解释的技术路径。

链接: https://arxiv.org/abs/2511.04729
作者: Rucha Deshpande,Tahsin Rahman,Miguel Lago,Adarsh Subbaswamy,Jana G. Delfino,Ghada Zamzmi,Elim Thompson,Aldo Badano,Seyed Kahaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:Synthetic data provides a promising approach to address data scarcity for training machine learning models; however, adoption without proper quality assessments may introduce artifacts, distortions, and unrealistic features that compromise model performance and clinical utility. This work introduces a novel knowledge-based anomaly detection method for detecting network-induced shape artifacts in synthetic images. The introduced method utilizes a two-stage framework comprising (i) a novel feature extractor that constructs a specialized feature space by analyzing the per-image distribution of angle gradients along anatomical boundaries, and (ii) an isolation forest-based anomaly detector. We demonstrate the effectiveness of the method for identifying network-induced shape artifacts in two synthetic mammography datasets from models trained on CSAW-M and VinDr-Mammo patient datasets respectively. Quantitative evaluation shows that the method successfully concentrates artifacts in the most anomalous partition (1st percentile), with AUC values of 0.97 (CSAW-syn) and 0.91 (VMLO-syn). In addition, a reader study involving three imaging scientists confirmed that images identified by the method as containing network-induced shape artifacts were also flagged by human readers with mean agreement rates of 66% (CSAW-syn) and 68% (VMLO-syn) for the most anomalous partition, approximately 1.5-2 times higher than the least anomalous partition. Kendall-Tau correlations between algorithmic and human rankings were 0.45 and 0.43 for the two datasets, indicating reasonable agreement despite the challenging nature of subtle artifact detection. This method is a step forward in the responsible use of synthetic data, as it allows developers to evaluate synthetic images for known anatomic constraints and pinpoint and address specific issues to improve the overall quality of a synthetic dataset.
zh

[CV-72] IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在跨文化多样性和多语言场景下评估不足的问题,尤其是现有基准测试大多以西方为中心,难以反映模型在非西方语境中的真实性能。其解决方案的关键在于提出IndicVisionBench——首个聚焦印度次大陆的大规模多模态评测基准,涵盖英语及10种印度语言,覆盖光学字符识别(OCR)、多模态机器翻译(MMT)和视觉问答(VQA)三类任务,并包含约5K张图像与37K+个问答对,涉及13个具有文化根基的主题。此外,研究还发布了跨10种印地语系语言的平行标注语料库,为分析VLMs中的文化和语言偏见提供了独特资源。通过评估8种不同类型的模型(包括闭源与开源中大型模型),实验揭示了当前VLMs在文化多样性情境下的显著性能差距,从而推动更包容的多模态AI研究发展。

链接: https://arxiv.org/abs/2511.04727
作者: Ali Faraz,Akash,Shaharukh Khan,Raja Kolla,Akshat Patidar,Suranjan Goswami,Abhinav Ravi,Chandra Khatri,Shubham Agarwal
机构: Krutrim AI (Krutrim AI); OLA Electric (OLA Electric)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.
zh

[CV-73] Ada-FCN: Adaptive Frequency-Coupled Network for fMRI-Based Brain Disorder Classification

【速读】:该论文旨在解决传统静息态功能磁共振成像(resting-state fMRI)在脑疾病分类中忽略神经振荡多频特性的问题,即现有模型将血氧水平依赖(BOLD)信号视为单一时间序列,从而遗漏了特定频率带内功能连接紊乱对诊断敏感性和特异性的影响。其关键解决方案是提出一种新型自适应频率耦合网络(Ada-FCN),包含两个核心组件:一是自适应级联分解(Adaptive Cascade Decomposition),用于为每个脑区学习任务相关的频率子带;二是频率耦合连接学习(Frequency-Coupled Connectivity Learning),用于统一建模脑区内及跨频段的精细交互关系,最终通过统一图卷积网络(Unified-GCN)中的消息传递机制生成优化节点表示以提升诊断预测性能。

链接: https://arxiv.org/abs/2511.04718
作者: Yue Xun,Jiaxing Xu,Wenbo Gao,Chen Yang,Shujun Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, conference

点击查看摘要

Abstract:Resting-state fMRI has become a valuable tool for classifying brain disorders and constructing brain functional connectivity networks by tracking BOLD signals across brain regions. However, existing mod els largely neglect the multi-frequency nature of neuronal oscillations, treating BOLD signals as monolithic time series. This overlooks the cru cial fact that neurological disorders often manifest as disruptions within specific frequency bands, limiting diagnostic sensitivity and specificity. While some methods have attempted to incorporate frequency informa tion, they often rely on predefined frequency bands, which may not be optimal for capturing individual variability or disease-specific alterations. To address this, we propose a novel framework featuring Adaptive Cas cade Decomposition to learn task-relevant frequency sub-bands for each brain region and Frequency-Coupled Connectivity Learning to capture both intra- and nuanced cross-band interactions in a unified functional network. This unified network informs a novel message-passing mecha nism within our Unified-GCN, generating refined node representations for diagnostic prediction. Experimental results on the ADNI and ABIDE datasets demonstrate superior performance over existing methods. The code is available at this https URL. Comments: 11 pages, 2 figures, conference Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.04718 [cs.LG] (or arXiv:2511.04718v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.04718 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Medical Image Computing and Computer Assisted Intervention, MICCAI 2025. MICCAI 2025. Lecture Notes in Computer Science, vol 15971. Springer, Cham Related DOI: https://doi.org/10.1007/978-3-032-05162-2_4 Focus to learn more DOI(s) linking to related resources Submission history From: Yue Xun [view email] [v1] Thu, 6 Nov 2025 08:57:07 UTC (1,479 KB) Full-text links: Access Paper: View a PDF of the paper titled Ada-FCN: Adaptive Frequency-Coupled Network for fMRI-Based Brain Disorder Classification, by Yue Xun and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-11 Change to browse by: cs cs.AI cs.CV References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[CV-74] PySlyde: A Lightweight Open-Source Toolkit for Pathology Preprocessing

【速读】:该论文旨在解决病理学中数字化全切片图像(Whole-Slide Images, WSI)预处理流程碎片化与不一致的问题,这些问题限制了其在人工智能(Artificial Intelligence, AI)驱动的精准医疗中的标准化应用。关键解决方案是提出一个轻量级、开源的Python工具包PySlyde,基于OpenSlide构建,通过统一组织组织检测、分块、染色归一化和注释解析等核心步骤,提供直观的API接口,显著提升WSI预处理的可重复性与效率,从而加速生成适用于现代病理基础模型的AI就绪数据集。

链接: https://arxiv.org/abs/2511.05183
作者: Gregory Verghese,Anthony Baptista,Chima Eke,Holly Rafique,Mengyuan Li,Fathima Mohamed,Ananya Bhalla,Lucy Ryan,Michael Pitcher,Enrico Parisini,Concetta Piazzese,Liz Ing-Simmons,Anita Grigoriadis
机构: King’s College London (伦敦国王学院)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The integration of artificial intelligence (AI) into pathology is advancing precision medicine by improving diagnosis, treatment planning, and patient outcomes. Digitised whole-slide images (WSIs) capture rich spatial and morphological information vital for understanding disease biology, yet their gigapixel scale and variability pose major challenges for standardisation and analysis. Robust preprocessing, covering tissue detection, tessellation, stain normalisation, and annotation parsing is critical but often limited by fragmented and inconsistent workflows. We present PySlyde, a lightweight, open-source Python toolkit built on OpenSlide to simplify and standardise WSI preprocessing. PySlyde provides an intuitive API for slide loading, annotation management, tissue detection, tiling, and feature extraction, compatible with modern pathology foundation models. By unifying these processes, it streamlines WSI preprocessing, enhances reproducibility, and accelerates the generation of AI-ready datasets, enabling researchers to focus on model development and downstream analysis.
zh

[CV-75] UHDRes: Ultra-High-Definition Image Restoration via Dual-Domain Decoupled Spectral Modulation

【速读】:该论文旨在解决超高清(Ultra-high-definition, UHD)图像在高分辨率下因模糊、雾霾、雨滴或低光照等退化因素导致的图像复原难题,尤其关注计算复杂度高和资源消耗大的挑战。其解决方案的关键在于提出一种轻量级双域解耦频谱调制框架(UHDRes),通过显式建模幅度谱(amplitude spectrum)的频域调制与隐式恢复相位信息(phase information)的空间域优化相结合的方式实现高效复原;具体而言,该方法引入了时空融合机制(spatio-spectral fusion mechanism),利用多尺度上下文聚合提取空间特征,并以解耦方式在频域增强幅度特征,同时通过空间域细化隐式恢复相位,辅以共享门控前馈网络(shared gated feed-forward network)提升特征交互效率,从而在仅400K参数下实现SOTA性能并显著降低推理延迟与内存占用。

链接: https://arxiv.org/abs/2511.05009
作者: S. Zhao(1),W. Lu(1 and 2),B. Wang(1),T. Wang(3),K. Zhang(4),H. Zhao(1) ((1) College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou, China, (2) Nasdaq, St. John’s, Canada, (3) vivo Mobile Communication Co., Ltd, Shanghai, China, (4) College of Engineering and Computer Science, Australian National University, Australia)
机构: Wenzhou University (温州大学); Nanjing University (南京大学); Australian National University (澳大利亚国立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-high-definition (UHD) images often suffer from severe degradations such as blur, haze, rain, or low-light conditions, which pose significant challenges for image restoration due to their high resolution and computational demands. In this paper, we propose UHDRes, a novel lightweight dual-domain decoupled spectral modulation framework for UHD image restoration. It explicitly models the amplitude spectrum via lightweight spectrum-domain modulation, while restoring phase implicitly through spatial-domain refinement. We introduce the spatio-spectral fusion mechanism, which first employs a multi-scale context aggregator to extract local and global spatial features, and then performs spectral modulation in a decoupled manner. It explicitly enhances amplitude features in the frequency domain while implicitly restoring phase information through spatial refinement. Additionally, a shared gated feed-forward network is designed to efficiently promote feature interaction through shared-parameter convolutions and adaptive gating mechanisms. Extensive experimental comparisons on five public UHD benchmarks demonstrate that our UHDRes achieves the state-of-the-art restoration performance with only 400K parameters, while significantly reducing inference latency and memory usage. The codes and models are available at this https URL.
zh

[CV-76] LG-NuSegHop: A Local-to-Global Self-Supervised Pipeline For Nuclei Instance Segmentation

【速读】:该论文旨在解决组织学图像中细胞核分割(nuclei segmentation)任务的自动化难题,该任务是病理诊断的基础,但因其在不同器官组织和成像条件下存在高度变异,且标注数据昂贵难以获取,导致深度学习模型难以泛化到未见器官或域。解决方案的关键在于提出一种无需人工标注数据或领域自适应的自监督流水线——Local-to-Global NuSegHop(LG-NuSegHop),其核心由三个模块构成:(1) 基于先验知识的局部处理操作生成伪标签(pseudolabel),(2) 一种新型数据驱动的特征提取模型NuSegHop,以及 (3) 用于后处理预测结果的全局操作;该方法在多个公开数据集上表现出优于其他自监督与弱监督方法的性能,并达到全监督方法的竞争力水平,且所有模块均具备可解释性,便于临床医生理解与信任。

链接: https://arxiv.org/abs/2511.04892
作者: Vasileios Magoulianitis,Catherine A. Alexander,Jiaxin Yang,C.-C. Jay Kuo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Biomolecules (q-bio.BM)
备注: 42 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Nuclei segmentation is the cornerstone task in histology image reading, shedding light on the underlying molecular patterns and leading to disease or cancer diagnosis. Yet, it is a laborious task that requires expertise from trained physicians. The large nuclei variability across different organ tissues and acquisition processes challenges the automation of this task. On the other hand, data annotations are expensive to obtain, and thus, Deep Learning (DL) models are challenged to generalize to unseen organs or different domains. This work proposes Local-to-Global NuSegHop (LG-NuSegHop), a self-supervised pipeline developed on prior knowledge of the problem and molecular biology. There are three distinct modules: (1) a set of local processing operations to generate a pseudolabel, (2) NuSegHop a novel data-driven feature extraction model and (3) a set of global operations to post-process the predictions of NuSegHop. Notably, even though the proposed pipeline uses no manually annotated training data or domain adaptation, it maintains a good generalization performance on other datasets. Experiments in three publicly available datasets show that our method outperforms other self-supervised and weakly supervised methods while having a competitive standing among fully supervised methods. Remarkably, every module within LG-NuSegHop is transparent and explainable to physicians.
zh

人工智能

[AI-0] DGTN: Graph-Enhanced Transformer with Diffusive Attention Gating Mechanism for Enzyme DDG Prediction

【速读】:该论文旨在解决氨基酸突变对酶热力学稳定性影响(ΔΔG)预测的问题,这是蛋白质工程和药物设计中的核心挑战。现有深度学习方法通常独立处理序列与结构信息,难以捕捉局部结构几何与全局序列模式之间的复杂耦合关系。解决方案的关键在于提出一种名为DGTN(Diffused Graph-Transformer Network)的新架构,其核心创新是通过可学习的扩散机制实现图神经网络(GNN)与Transformer注意力机制的协同学习:一方面,GNN提取的结构嵌入通过可学习的扩散核引导Transformer注意力;另一方面,Transformer表示通过注意力调制的图更新机制优化GNN的消息传递过程。这种双向扩散机制不仅在ProTherm和SKEMPI基准上实现了当前最优性能(Pearson Rho = 0.87,RMSE = 1.21 kcal/mol),且理论分析证明其能收敛至最优的结构-序列耦合状态,收敛速率可达O(1/√T),其中T为扩散步数。

链接: https://arxiv.org/abs/2511.05483
作者: Abigail Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting the effect of amino acid mutations on enzyme thermodynamic stability (DDG) is fundamental to protein engineering and drug design. While recent deep learning approaches have shown promise, they often process sequence and structure information independently, failing to capture the intricate coupling between local structural geometry and global sequential patterns. We present DGTN (Diffused Graph-Transformer Network), a novel architecture that co-learns graph neural network (GNN) weights for structural priors and transformer attention through a diffusion mechanism. Our key innovation is a bidirectional diffusion process where: (1) GNN-derived structural embeddings guide transformer attention via learnable diffusion kernels, and (2) transformer representations refine GNN message passing through attention-modulated graph updates. We provide rigorous mathematical analysis showing this co-learning scheme achieves provably better approximation bounds than independent processing. On ProTherm and SKEMPI benchmarks, DGTN achieves state-of-the-art performance (Pearson Rho = 0.87, RMSE = 1.21 kcal/mol), with 6.2% improvement over best baselines. Ablation studies confirm the diffusion mechanism contributes 4.8 points to correlation. Our theoretical analysis proves the diffused attention converges to optimal structure-sequence coupling, with convergence rate O(1/sqrt(T) ) where T is diffusion steps. This work establishes a principled framework for integrating heterogeneous protein representations through learnable diffusion.
zh

[AI-1] AI Literacy Assessment Revisited: A Task-Oriented Approach Aligned with Real-world Occupations

【速读】:该论文旨在解决当前AI素养(AI literacy)评估体系与实际职业应用场景脱节的问题,即现有评估多聚焦于基础技术知识(如编程、数学和统计),而忽视了在真实工作环境中使用AI工具所需的实践能力,如模型输出解读、工具选择及伦理风险识别等。解决方案的关键在于提出一种以工作任务为导向的AI素养评估模型,并开发了一套基于具体职业任务的评估工具——在美军海军机器人培训项目中实施的竞赛场景任务,其表现优于传统测试方法,证明了强调高度情境化实践技能的评估方式更能有效衡量非STEM背景人员在AI集成岗位上的实际应用能力。

链接: https://arxiv.org/abs/2511.05475
作者: Christopher Bogart,Aparna Warrier,Arav Agarwal,Ross Higashi,Yufan Zhang,Jesse Flot,Jaromir Savelka,Heather Burte,Majd Sakr
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) systems become ubiquitous in professional contexts, there is an urgent need to equip workers, often with backgrounds outside of STEM, with the skills to use these tools effectively as well as responsibly, that is, to be AI literate. However, prevailing definitions and therefore assessments of AI literacy often emphasize foundational technical knowledge, such as programming, mathematics, and statistics, over practical knowledge such as interpreting model outputs, selecting tools, or identifying ethical concerns. This leaves a noticeable gap in assessing someone’s AI literacy for real-world job use. We propose a work-task-oriented assessment model for AI literacy which is grounded in the competencies required for effective use of AI tools in professional settings. We describe the development of a novel AI literacy assessment instrument, and accompanying formative assessments, in the context of a US Navy robotics training program. The program included training in robotics and AI literacy, as well as a competition with practical tasks and a multiple choice scenario task meant to simulate use of AI in a job setting. We found that, as a measure of applied AI literacy, the competition’s scenario task outperformed the tests we adopted from past research or developed ourselves. We argue that when training people for AI-related work, educators should consider evaluating them with instruments that emphasize highly contextualized practical skills rather than abstract technical knowledge, especially when preparing workers without technical backgrounds for AI-integrated roles.
zh

[AI-2] SWE-Compass: Towards Unified Evaluation of Agent ic Coding Abilities for Large Language Models

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在软件工程领域评估中存在的三大局限:任务覆盖范围狭窄、语言偏倚以及与真实开发者工作流脱节。现有基准测试多聚焦于算法类问题或以Python为主的Bug修复,忽视了软件工程中其他关键维度。为应对这些问题,作者提出了SWE-Compass,其核心创新在于构建了一个统一的、面向生产环境的结构化评估框架,涵盖8种任务类型、8种编程场景和10种编程语言,并基于真实GitHub Pull Requests筛选出2000个高质量实例。该方案通过系统性过滤与验证确保数据真实性,同时在两种代理式框架(SWE-Agent 和 Claude Code)下对10个前沿LLMs进行基准测试,揭示了任务难度、语言和场景之间的清晰层级关系,从而为诊断和提升生成式AI(Generative AI)在软件开发中的代理能力提供了可复现且严谨的评估基础。

链接: https://arxiv.org/abs/2511.05459
作者: Jingxuan Xu,Ken Deng,Weihao Li,Songwei Yu,Huaixi Tang,Haoyang Huang,Zhiyi Lai,Zizheng Zhan,Yanan Wu,Chenchen Zhang,Kepeng Lei,Yifan Yao,Xinping Lei,Wenqiang Zhu,Zongxian Feng,Han Li,Junqi Xiong,Dailin Li,Zuchen Gao,Kun Wu,Wen Xiang,Ziqi Zhan,Yuanxing Zhang,Wuxuan Gong,Ziyuan Gao,Guanxiang Wang,Yirong Xue,Xiaojiang Zhang,Jinghui Wang,Huiming Wang,Wenhao Zhuang,Zhaoxiang Zhang,Yuqun Zhang,Haotian Zhang,Bin Chen,Jiaheng Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.
zh

[AI-3] “I Like That You Have to Poke Around”: Instructors on How Experiential Approaches to AI Literacy Spark Inquiry and Critical Thinking

【速读】:该论文试图解决当前AI教育中缺乏面向非STEM背景学习者的可及性问题,即多数现有教学方法依赖编程工具或抽象讲授,难以覆盖广泛受众。其解决方案的关键在于设计并验证了一个模块化、基于网络的无代码(no-code)课程体系——AI User,通过真实场景驱动的交互式项目(如自然语言处理、计算机视觉等),使学习者在无需编程的前提下体验核心AI概念。研究聚焦于项目5–8,借助15位社区学院教师的焦点小组反馈与主题分析,揭示了该方案在提升探索性学习、角色模拟和现实关联性方面的优势,同时识别出认知负荷、引导机制与学习者多样性适配等设计权衡点,为构建跨学科、包容性强的AI素养教育路径提供了实证依据与实践指导。

链接: https://arxiv.org/abs/2511.05430
作者: Aparna Maya Warrier,Arav Agarwal,Jaromir Savelka,Christopher Bogart,Heather Burte
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) increasingly shapes decision-making across domains, there is a growing need to support AI literacy among learners beyond computer science. However, many current approaches rely on programming-heavy tools or abstract lecture-based content, limiting accessibility for non-STEM audiences. This paper presents findings from a study of AI User, a modular, web-based curriculum that teaches core AI concepts through interactive, no-code projects grounded in real-world scenarios. The curriculum includes eight projects; this study focuses on instructor feedback on Projects 5-8, which address applied topics such as natural language processing, computer vision, decision support, and responsible AI. Fifteen community college instructors participated in structured focus groups, completing the projects as learners and providing feedback through individual reflection and group discussion. Using thematic analysis, we examined how instructors evaluated the design, instructional value, and classroom applicability of these experiential activities. Findings highlight instructors’ appreciation for exploratory tasks, role-based simulations, and real-world relevance, while also surfacing design trade-offs around cognitive load, guidance, and adaptability for diverse learners. This work extends prior research on AI literacy by centering instructor perspectives on teaching complex AI topics without code. It offers actionable insights for designing inclusive, experiential AI learning resources that scale across disciplines and learner backgrounds.
zh

[AI-4] ProDER: A Continual Learning Approach for Fault Prediction in Evolving Smart Grids

【速读】:该论文旨在解决智能电网中故障预测模型在动态演化环境下的可靠性问题,即现有基于人工智能(Artificial Intelligence, AI)的故障预测模型难以适应新型故障类型和运行区域变化所带来的持续学习挑战。其解决方案的关键在于提出一种适用于智能电网场景的持续学习(Continual Learning, CL)框架,并设计了原型引导的暗经验回放(Prototype-based Dark Experience Replay, ProDER)方法,该方法通过融合原型特征正则化、logit蒸馏和原型引导的记忆回放机制,在类增量和域增量学习场景下实现了高精度的故障类型与故障区域预测,仅带来0.045和0.015的准确率下降,显著提升了模型在实际应用中的可扩展性和适应性。

链接: https://arxiv.org/abs/2511.05420
作者: Emad Efatinasab,Nahal Azadi,Davide Dalle Pezze,Gian Antonio Susto,Chuadhry Mujeeb Ahmed,Mirco Rampazzo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As smart grids evolve to meet growing energy demands and modern operational challenges, the ability to accurately predict faults becomes increasingly critical. However, existing AI-based fault prediction models struggle to ensure reliability in evolving environments where they are required to adapt to new fault types and operational zones. In this paper, we propose a continual learning (CL) framework in the smart grid context to evolve the model together with the environment. We design four realistic evaluation scenarios grounded in class-incremental and domain-incremental learning to emulate evolving grid conditions. We further introduce Prototype-based Dark Experience Replay (ProDER), a unified replay-based approach that integrates prototype-based feature regularization, logit distillation, and a prototype-guided replay memory. ProDER achieves the best performance among tested CL techniques, with only a 0.045 accuracy drop for fault type prediction and 0.015 for fault zone prediction. These results demonstrate the practicality of CL for scalable, real-world fault prediction in smart grids.
zh

[AI-5] Robust Neural Audio Fingerprinting using Music Foundation Models

【速读】:该论文旨在解决现代媒体平台(如TikTok)上广泛存在的音乐被扭曲、压缩和篡改的问题,从而提升音频指纹技术在复杂音频场景下的鲁棒性,以准确识别音乐录音的来源。其解决方案的关键在于:一是采用预训练的音乐基础模型(如MuQ、MERT)作为神经网络架构的主干,利用其对音乐特征的深度理解能力;二是扩展数据增强策略,覆盖时间拉伸、音高调制、压缩和滤波等多种音频操作,使模型在多样化的音频扰动下仍能生成稳定可靠的指纹。实验表明,基于音乐基础模型提取的指纹显著优于从头训练或在非音乐音频上预训练的模型,并具备精确的片段级定位能力,这对音乐目录管理具有重要实用价值。

链接: https://arxiv.org/abs/2511.05399
作者: Shubhr Singh,Kiran Bhat,Xavier Riley,Benjamin Resnick,John Thickstun,Walter De Brouwer
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of distorted, compressed, and manipulated music on modern media platforms like TikTok motivates the development of more robust audio fingerprinting techniques to identify the sources of musical recordings. In this paper, we develop and evaluate new neural audio fingerprinting techniques with the aim of improving their robustness. We make two contributions to neural fingerprinting methodology: (1) we use a pretrained music foundation model as the backbone of the neural architecture and (2) we expand the use of data augmentation to train fingerprinting models under a wide variety of audio manipulations, including time streching, pitch modulation, compression, and filtering. We systematically evaluate our methods in comparison to two state-of-the-art neural fingerprinting models: NAFP and GraFPrint. Results show that fingerprints extracted with music foundation models (e.g., MuQ, MERT) consistently outperform models trained from scratch or pretrained on non-musical audio. Segment-level evaluation further reveals their capability to accurately localize fingerprint matches, an important practical feature for catalog management.
zh

[AI-6] Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction ICML2025

【速读】:该论文旨在解决**离动态强化学习(off-dynamics reinforcement learning)**中的在线学习问题,即训练环境与部署环境的转移动态存在差异时,如何在仅依赖在线交互的情况下实现高效学习。传统方法通常假设拥有生成模型或覆盖良好的预收集数据集,从而忽略探索挑战;而本文聚焦更现实的场景——代理只能通过在线交互获取训练环境的数据。解决方案的关键在于引入了一个新的理论工具:最大访问比(supremal visitation ratio),用于量化训练与部署动态之间的不匹配程度,并证明若该比值无界,则在线学习难度呈指数级上升。作者进一步提出首个计算高效的算法,在基于f-散度的转移不确定性下实现了亚线性遗憾(sublinear regret),并通过匹配的下界分析表明其对最大访问比和交互轮次的依赖达到最优。

链接: https://arxiv.org/abs/2511.05396
作者: Yiting He,Zhishuai Liu,Weixin Wang,Pan Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
备注: 53 pages, 6 figures, 3 tables. Published in Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:Off-dynamics reinforcement learning (RL), where training and deployment transition dynamics are different, can be formulated as learning in a robust Markov decision process (RMDP) where uncertainties in transition dynamics are imposed. Existing literature mostly assumes access to generative models allowing arbitrary state-action queries or pre-collected datasets with a good state coverage of the deployment environment, bypassing the challenge of exploration. In this work, we study a more realistic and challenging setting where the agent is limited to online interaction with the training environment. To capture the intrinsic difficulty of exploration in online RMDPs, we introduce the supremal visitation ratio, a novel quantity that measures the mismatch between the training dynamics and the deployment dynamics. We show that if this ratio is unbounded, online learning becomes exponentially hard. We propose the first computationally efficient algorithm that achieves sublinear regret in online RMDPs with f -divergence based transition uncertainties. We also establish matching regret lower bounds, demonstrating that our algorithm achieves optimal dependence on both the supremal visitation ratio and the number of interaction episodes. Finally, we validate our theoretical results through comprehensive numerical experiments.
zh

[AI-7] aRAG : A Token-Efficient Agent ic Retrieval-Augmented Generation Framework

【速读】:该论文旨在解决当前生成式AI(Generative AI)中代理型检索增强生成(agentic Retrieval-Augmented Generation, RAG)框架在准确性和效率之间的权衡问题,即现有方法虽通过强化学习提升了准确性,但因多轮检索与推理过程导致显著的token开销。解决方案的关键在于提出TeaRAG框架,其核心创新包括:1)通过融合基于语义的块检索与基于三元组的图检索构建知识关联图,并利用个性化PageRank算法提取关键知识,从而压缩检索内容;2)设计迭代式过程感知直接偏好优化(Iterative Process-aware Direct Preference Optimization, IP-DPO),以知识匹配机制评估知识充分性并惩罚冗余推理步骤,从而减少推理步数。实验表明,TeaRAG在多个数据集上实现了平均精确匹配率提升4%和2%,同时分别减少输出token数量61%和59%。

链接: https://arxiv.org/abs/2511.05385
作者: Chao Zhang,Yuhao Wang,Derong Xu,Haoxin Zhang,Yuanjie Lyu,Yuhao Chen,Shuochen Liu,Tong Xu,Xiangyu Zhao,Yan Gao,Yao Hu,Enhong Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 32 pages

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) utilizes external knowledge to augment Large Language Models’ (LLMs) reliability. For flexibility, agentic RAG employs autonomous, multi-round retrieval and reasoning to resolve queries. Although recent agentic RAG has improved via reinforcement learning, they often incur substantial token overhead from search and reasoning processes. This trade-off prioritizes accuracy over efficiency. To address this issue, this work proposes TeaRAG, a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps. 1) First, the retrieved content is compressed by augmenting chunk-based semantic retrieval with a graph retrieval using concise triplets. A knowledge association graph is then built from semantic similarity and co-occurrence. Finally, Personalized PageRank is leveraged to highlight key knowledge within this graph, reducing the number of tokens per retrieval. 2) Besides, to reduce reasoning steps, Iterative Process-aware Direct Preference Optimization (IP-DPO) is proposed. Specifically, our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps. This design can produce high-quality preference-pair datasets, supporting iterative DPO to improve reasoning conciseness. Across six datasets, TeaRAG improves the average Exact Match by 4% and 2% while reducing output tokens by 61% and 59% on Llama3-8B-Instruct and Qwen2.5-14B-Instruct, respectively. Code is available at this https URL.
zh

[AI-8] Reasoning Is All You Need for Urban Planning AI AAAI2026

【速读】:该论文旨在解决当前城市规划中AI应用局限于统计学习模型、难以支持透明推理与价值对齐决策的问题。现有方法虽能从数据中学习模式并预测未来情景,但在推荐选址、资源配置及权衡取舍时缺乏可解释性、规则约束保障和伦理价值考量,无法满足复杂城市治理需求。解决方案的关键在于提出一个代理型城市规划AI框架(Agentic Urban Planning AI Framework),其核心是通过三重认知层(感知、基础、推理)与六类逻辑组件(分析、生成、验证、评估、协作、决策)的协同,结合多智能体协作机制,使AI代理具备显式推理能力——即基于规范原则的价值导向(value-based)、规则约束下的合规保障(rule-grounded)以及透明化的理由生成(explainable),从而系统性增强人类规划者的判断力,而非替代其决策角色。

链接: https://arxiv.org/abs/2511.05375
作者: Sijie Yang,Jiatong Li,Filip Biljecki
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to AAAI 2026 Workshop AI4UP

点击查看摘要

Abstract:AI has proven highly successful at urban planning analysis – learning patterns from data to predict future conditions. The next frontier is AI-assisted decision-making: agents that recommend sites, allocate resources, and evaluate trade-offs while reasoning transparently about constraints and stakeholder values. Recent breakthroughs in reasoning AI – CoT prompting, ReAct, and multi-agent collaboration frameworks – now make this vision achievable. This position paper presents the Agentic Urban Planning AI Framework for reasoning-capable planning agents that integrates three cognitive layers (Perception, Foundation, Reasoning) with six logic components (Analysis, Generation, Verification, Evaluation, Collaboration, Decision) through a multi-agents collaboration framework. We demonstrate why planning decisions require explicit reasoning capabilities that are value-based (applying normative principles), rule-grounded (guaranteeing constraint satisfaction), and explainable (generating transparent justifications) – requirements that statistical learning alone cannot fulfill. We compare reasoning agents with statistical learning, present a comprehensive architecture with benchmark evaluation metrics, and outline critical research challenges. This framework shows how AI agents can augment human planners by systematically exploring solution spaces, verifying regulatory compliance, and deliberating over trade-offs transparently – not replacing human judgment but amplifying it with computational reasoning capabilities. Comments: Submitted to AAAI 2026 Workshop AI4UP Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.05375 [cs.AI] (or arXiv:2511.05375v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.05375 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-9] AI Literacy for Community Colleges: Instructors Perspectives on Scenario-Based and Interactive Approaches to Teaching AI

【速读】:该论文旨在解决高校非STEM领域学生AI素养教育中缺乏有效且可扩展教学方法的问题。当前,随着人工智能(Artificial Intelligence, AI)日益融入日常技术,AI素养——即评估AI系统、与之交互并理解其广泛影响的能力——已成为跨学科的关键技能,但现有教学方案难以满足非STEM学习者的需求。论文提出的核心解决方案是开发名为AI User的交互式在线课程,通过真实场景驱动的任务引导学习者探索AI核心概念,从而提升其对AI的理解与批判性参与能力。该方案的关键在于以情境化、无代码(no-code)的互动活动为核心设计原则,强调模拟现实AI应用场景和鼓励实验性学习,同时基于教师反馈优化了教学支持材料的呈现形式,优先采用交互式演示而非传统讲义或概念指南,从而增强教学适配性和学习相关性。

链接: https://arxiv.org/abs/2511.05363
作者: Aparna Maya Warrier,Arav Agarwal,Jaromir Savelka,Christopher A Bogart,Heather Burte
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This research category full paper investigates how community college instructors evaluate interactive, no-code AI literacy resources designed for non-STEM learners. As artificial intelligence becomes increasingly integrated into everyday technologies, AI literacy - the ability to evaluate AI systems, communicate with them, and understand their broader impacts - has emerged as a critical skill across disciplines. Yet effective, scalable approaches for teaching these concepts in higher education remain limited, particularly for students outside STEM fields. To address this gap, we developed AI User, an interactive online curriculum that introduces core AI concepts through scenario - based activities set in real - world contexts. This study presents findings from four focus groups with instructors who engaged with AI User materials and participated in structured feedback activities. Thematic analysis revealed that instructors valued exploratory tasks that simulated real - world AI use cases and fostered experimentation, while also identifying challenges related to scaffolding, accessibility, and multi-modal support. A ranking task for instructional support materials showed a strong preference for interactive demonstrations over traditional educational materials like conceptual guides or lecture slides. These findings offer insights into instructor perspectives on making AI concepts more accessible and relevant for broad learner audiences. They also inform the design of AI literacy tools that align with diverse teaching contexts and support critical engagement with AI in higher education. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.05363 [cs.CY] (or arXiv:2511.05363v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2511.05363 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-10] Perceptually Aligning Representations of Music via Noise-Augmented Autoencoders NEURIPS2025

【速读】:该论文旨在解决如何训练自动编码器(autoencoder)以生成具有感知层次结构的潜在表示的问题,从而更好地捕捉输入数据中的感知重要信息。解决方案的关键在于:在训练过程中,对编码后的噪声版本进行重构,并结合感知损失(perceptual loss),使得学习到的编码结构能够按照感知层级组织信息——即更显著的感知特征被编码在更粗粒度的表示中。这一方法不仅提升了潜在空间的结构性,还在音乐音高突变估计和脑电图(EEG)响应预测等任务中显著改善了扩散模型的解码性能。

链接: https://arxiv.org/abs/2511.05350
作者: Mathias Rose Bjare,Giorgia Cantisani,Marco Pasini,Stefan Lattner,Gerhard Widmer
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025 - AI for Music Workshop, 11 pages, 5 figures, 1 table

点击查看摘要

Abstract:We argue that training autoencoders to reconstruct inputs from noised versions of their encodings, when combined with perceptual losses, yields encodings that are structured according to a perceptual hierarchy. We demonstrate the emergence of this hierarchical structure by showing that, after training an audio autoencoder in this manner, perceptually salient information is captured in coarser representation structures than with conventional training. Furthermore, we show that such perceptual hierarchies improve latent diffusion decoding in the context of estimating surprisal in music pitches and predicting EEG-brain responses to music listening. Pretrained weights are available on this http URL.
zh

[AI-11] Cleaning Maintenance Logs with LLM Agents for Improved Predictive Maintenance

【速读】:该论文旨在解决汽车行业中预测性维护(Predictive Maintenance, PdM)在实际应用中面临的关键障碍,包括经济约束、数据集可用性不足以及专业人才短缺等问题。解决方案的核心在于利用大语言模型(Large Language Models, LLMs)构建智能代理(LLM-based agents),以自动化处理PdM数据清洗流程,特别是针对维修日志这类关键训练数据中存在的多种噪声类型(如拼写错误、字段缺失、近似重复条目和日期错误)。实验表明,LLMs在通用清洗任务上表现出色,为工业场景下的PdM落地提供了可行路径,尽管领域特定错误仍具挑战性,但通过专业化训练与增强的代理能力有望进一步提升性能。

链接: https://arxiv.org/abs/2511.05311
作者: Valeriu Dimidov,Faisal Hawlader,Sasan Jafarnejad,Raphaël Frank
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Economic constraints, limited availability of datasets for reproducibility and shortages of specialized expertise have long been recognized as key challenges to the adoption and advancement of predictive maintenance (PdM) in the automotive sector. Recent progress in large language models (LLMs) presents an opportunity to overcome these barriers and speed up the transition of PdM from research to industrial practice. Under these conditions, we explore the potential of LLM-based agents to support PdM cleaning pipelines. Specifically, we focus on maintenance logs, a critical data source for training well-performing machine learning (ML) models, but one often affected by errors such as typos, missing fields, near-duplicate entries, and incorrect dates. We evaluate LLM agents on cleaning tasks involving six distinct types of noise. Our findings show that LLMs are effective at handling generic cleaning tasks and offer a promising foundation for future industrial applications. While domain-specific errors remain challenging, these results highlight the potential for further improvements through specialized training and enhanced agentic capabilities.
zh

[AI-12] AMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems ICML2025

【速读】:该论文旨在解决多智能体大语言模型(Multi-Agent Large Language Models, Multi-Agent LLMs)系统在复杂任务协作中面临的安全与鲁棒性问题,现有基准测试主要聚焦于单智能体场景,未能充分揭示多智能体动态交互带来的独特漏洞。其解决方案的关键在于提出一个名为TAMAS(Threats and Attacks in Multi-Agent Systems)的综合性基准,包含300个对抗实例、6种攻击类型、211个工具及100个无害任务,用于系统评估多智能体LLM系统的安全性与有效性。同时引入有效鲁棒性评分(Effective Robustness Score, ERS),量化安全与任务性能之间的权衡关系,从而为多智能体LLM系统的安全研究提供标准化评测框架和改进方向。

链接: https://arxiv.org/abs/2511.05269
作者: Ishan Kavathekar,Hemang Jain,Ameya Rathod,Ponnurangam Kumaraguru,Tanuja Ganu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2025 MAS Workshop. This version includes additional experiments and analysis

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents through tool use, planning, and decision-making abilities, leading to their widespread adoption across diverse tasks. As task complexity grows, multi-agent LLM systems are increasingly used to solve problems collaboratively. However, safety and security of these systems remains largely under-explored. Existing benchmarks and datasets predominantly focus on single-agent settings, failing to capture the unique vulnerabilities of multi-agent dynamics and co-ordination. To address this gap, we introduce \textbfT hreats and \textbfA ttacks in \textbfM ulti- \textbfA gent \textbfS ystems ( \textbfTAMAS ), a benchmark designed to evaluate the robustness and safety of multi-agent LLM systems. TAMAS includes five distinct scenarios comprising 300 adversarial instances across six attack types and 211 tools, along with 100 harmless tasks. We assess system performance across ten backbone LLMs and three agent interaction configurations from Autogen and CrewAI frameworks, highlighting critical challenges and failure modes in current multi-agent deployments. Furthermore, we introduce Effective Robustness Score (ERS) to assess the tradeoff between safety and task effectiveness of these frameworks. Our findings show that multi-agent systems are highly vulnerable to adversarial attacks, underscoring the urgent need for stronger defenses. TAMAS provides a foundation for systematically studying and improving the safety of multi-agent LLM systems.
zh

[AI-13] Integrating Score-Based Diffusion Models with Machine Learning-Enhanced Localization for Advanced Data Assimilation in Geological Carbon Storag e

【速读】:该论文旨在解决地质碳封存(Geological Carbon Storage, GCS)项目中地下非均质性(subsurface heterogeneity)准确表征难题,以提升风险评估的不确定性量化可靠性。其解决方案的关键在于构建一个融合基于评分的扩散模型(score-based diffusion models)与机器学习增强定位(machine learning-enhanced localization)的框架:通过扩散模型生成大规模通道型渗透率场(channelized permeability fields),并利用简单机器学习算法对先验集合进行状态估计,从而改进集成平滑器多数据同化(Ensemble Smoother with Multiple Data Assimilation, ESMDA)中的协方差估计;该方法在保持显著更高集合方差的同时,实现了与传统定位方法相当的数据拟合质量,有效提升了GCS模拟中不确定性传播的可信度。

链接: https://arxiv.org/abs/2511.05266
作者: Gabriel Serrão Seabra(1, 2),Nikolaj T. Mücke(1),Vinicius Luiz Santos Silva(2, 4),Alexandre A. Emerick(2),Denis Voskov(1, 5),Femke Vossepoel(1) ((1) Faculty of Civil Engineering and Geosciences, TU Delft, Delft, Netherlands, (2) Petroleo Brasileiro S.A. (Petrobras), Rio de Janeiro, Brazil, (4) Imperial College London, London, United Kingdom, (5) Department of Energy Resources Engineering, Stanford University, CA, USA)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Corresponding author: Gabriel Serrão Seabra

点击查看摘要

Abstract:Accurate characterization of subsurface heterogeneity is important for the safe and effective implementation of geological carbon storage (GCS) projects. This paper explores how machine learning methods can enhance data assimilation for GCS with a framework that integrates score-based diffusion models with machine learning-enhanced localization in channelized reservoirs during CO _2 injection. We employ a machine learning-enhanced localization framework that uses large ensembles ( N_s = 5000 ) with permeabilities generated by the diffusion model and states computed by simple ML algorithms to improve covariance estimation for the Ensemble Smoother with Multiple Data Assimilation (ESMDA). We apply ML algorithms to a prior ensemble of channelized permeability fields, generated with the geostatistical model FLUVSIM. Our approach is applied on a CO _2 injection scenario simulated using the Delft Advanced Research Terra Simulator (DARTS). Our ML-based localization maintains significantly more ensemble variance than when localization is not applied, while achieving comparable data-matching quality. This framework has practical implications for GCS projects, helping improve the reliability of uncertainty quantification for risk assessment.
zh

[AI-14] An End-to-End Deep Reinforcement Learning Approach for Solving the Traveling Salesman Problem with Drones

【速读】:该论文旨在解决卡车-无人机协同配送系统中的旅行商问题(Traveling Salesman Problem with Drones, TSP-D),该问题因引入无人机与卡车的协同调度而具有NP-hard的组合复杂性,传统优化方法难以高效求解。其解决方案的关键在于提出一种分层的Actor-Critic深度强化学习框架,其中编码器采用受Transformer启发的结构并集成优化的k近邻稀疏注意力机制以聚焦关键空间关系,同时融合全局节点特征增强表示能力;解码器则使用高效的Minimal Gated Unit模块生成最优路径序列,整个模型运行于异步优势Actor-Critic训练范式下,从而在保证解质量的同时显著提升训练效率和计算速度。

链接: https://arxiv.org/abs/2511.05265
作者: Taihelong Zeng,Yun Lin,Yuhe Shi,Yan Li,Zhiqing Wei,Xuanru Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of truck-drone collaborative systems in last-mile logistics has positioned the Traveling Salesman Problem with Drones (TSP-D) as a pivotal extension of classical routing optimization, where synchronized vehicle coordination promises substantial operational efficiency and reduced environmental impact, yet introduces NP-hard combinatorial complexity beyond the reach of conventional optimization paradigms. Deep reinforcement learning offers a theoretically grounded framework to address TSP-D’s inherent challenges through self-supervised policy learning and adaptive decision-making. This study proposes a hierarchical Actor-Critic deep reinforcement learning framework for solving the TSP-D problem. The architecture consists of two primary components: a Transformer-inspired encoder and an efficient Minimal Gated Unit decoder. The encoder incorporates a novel, optimized k-nearest neighbors sparse attention mechanism specifically for focusing on relevant spatial relationships, further enhanced by the integration of global node features. The Minimal Gated Unit decoder processes these encoded representations to efficiently generate solution sequences. The entire framework operates within an asynchronous advantage actor-critic paradigm. Experimental results show that, on benchmark TSP-D instances of various scales (N=10 to 100), the proposed model can obtain competitive or even superior solutions in shorter average computation times compared to high-performance heuristic algorithms and existing reinforcement learning methods. Moreover, compared to advanced reinforcement learning algorithm benchmarks, the proposed framework significantly reduces the total training time required while achieving superior final performance, highlighting its notable advantage in training efficiency.
zh

[AI-15] Autonomous generation of different courses of action in mechanized combat operations STOC

链接: https://arxiv.org/abs/2511.05182
作者: Johan Schubert,Patrik Hansen,Pontus Hörling,Ronnie Johansson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: In Proceedings of the 30th International Command and Control Research Technology Symposium, Stockholm, Sweden, 3-6 November 2025, paper 009

点击查看摘要

[AI-16] No One-Model-Fits-All: Uncovering Spatio-Temporal Forecasting Trade-offs with Graph Neural Networks and Foundation Models

【速读】:该论文旨在解决环境感知中高维时空数据的采集与下游预测模型性能之间的关系问题,特别是如何通过调整传感器部署密度(spatial sensor nodes density)和采样频率(sampling intervals)来优化不同架构的预测模型表现。现有方法多关注边缘端的数据过滤和部署策略以减少数据量,却忽视了采样频率和空间覆盖范围对模型性能的影响机制。其关键解决方案在于系统性地评估四类典型模型——经典统计模型(VAR)、神经网络(GRU、Transformer)、时空图神经网络(STGNNs)以及时间序列基础模型(TSFMs: Chronos、Moirai、TimesFM)在真实无线传感器网络温度数据上的表现,发现:当传感器稀疏且采样频率适中时,STGNNs能利用编码的图结构捕捉空间相关性从而提升性能;而TSFMs(尤其是多变量模型Moirai)在高频采样下表现优异,但空间覆盖不足时性能下降;其中Moirai因原生学习跨传感器依赖关系,在所有场景中均展现出最强鲁棒性和泛化能力。这一发现为构建高效、可扩展的时空预测流水线提供了实证依据和设计指导。

链接: https://arxiv.org/abs/2511.05179
作者: Ragini Gupta,Naman Raina,Bo Chen,Li Chen,Claudiu Danilov,Josh Eckhardt,Keyshla Bernard,Klara Nahrstedt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Modern IoT deployments for environmental sensing produce high volume spatiotemporal data to support downstream tasks such as forecasting, typically powered by machine learning models. While existing filtering and strategic deployment techniques optimize collected data volume at the edge, they overlook how variations in sampling frequencies and spatial coverage affect downstream model performance. In many forecasting models, incorporating data from additional sensors denoise predictions by providing broader spatial contexts. This interplay between sampling frequency, spatial coverage and different forecasting model architectures remain underexplored. This work presents a systematic study of forecasting models - classical models (VAR), neural networks (GRU, Transformer), spatio-temporal graph neural networks (STGNNs), and time series foundation models (TSFMs: Chronos Moirai, TimesFM) under varying spatial sensor nodes density and sampling intervals using real-world temperature data in a wireless sensor network. Our results show that STGNNs are effective when sensor deployments are sparse and sampling rate is moderate, leveraging spatial correlations via encoded graph structure to compensate for limited coverage. In contrast, TSFMs perform competitively at high frequencies but degrade when spatial coverage from neighboring sensors is reduced. Crucially, the multivariate TSFM Moirai outperforms all models by natively learning cross-sensor dependencies. These findings offer actionable insights for building efficient forecasting pipelines in spatio-temporal systems. All code for model configurations, training, dataset, and logs are open-sourced for reproducibility: this https URL
zh

[AI-17] Model Merging Improves Zero-Shot Generalization in Bioacoustic Foundation Models

【速读】:该论文旨在解决基础模型(Foundation Model)在生物声学领域中因领域特定微调导致的指令遵循灵活性下降的问题,尤其体现在同时请求多个信息(如物种的通用名和学名)时性能显著下降。解决方案的关键在于采用一种简单的模型融合策略,通过插值方式将微调后的NatureLM与其原始语言模型进行合并,从而在几乎不损失领域专业知识的前提下恢复了更强的指令遵循能力,并显著提升了零样本泛化性能,在未见物种的闭集零样本分类任务上实现了超过200%的相对提升,达到新的最先进水平。

链接: https://arxiv.org/abs/2511.05171
作者: Davide Marincione,Donato Crisostomi,Roberto Dessi,Emanuele Rodolà,Emanuele Rossi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Foundation models capable of generalizing across species and tasks represent a promising new frontier in bioacoustics, with NatureLM being one of the most prominent examples. While its domain-specific fine-tuning yields strong performance on bioacoustic benchmarks, we observe that it also introduces trade-offs in instruction-following flexibility. For instance, NatureLM achieves high accuracy when prompted for either the common or scientific name individually, but its accuracy drops significantly when both are requested in a single prompt. We address this by applying a simple model merging strategy that interpolates NatureLM with its base language model, recovering instruction-following capabilities with minimal loss of domain expertise. Finally, we show that the merged model exhibits markedly stronger zero-shot generalization, achieving over a 200% relative improvement and setting a new state-of-the-art in closed-set zero-shot classification of unseen species.
zh

[AI-18] Generating Software Architecture Description from Source Code using Reverse Engineering and Large Language Model

【速读】:该论文旨在解决现代软件系统中软件架构描述(Software Architecture Descriptions, SADs)缺失、过时或与实际实现不一致的问题,这些问题导致开发人员需耗费大量时间从源代码中手动推导架构信息,进而增加认知负荷、延缓新成员上手效率,并加剧系统长期维护中的结构模糊性。解决方案的关键在于提出一种半自动化的方法,通过将逆向工程(Reverse Engineering, RE)技术与大型语言模型(Large Language Model, LLM)相结合,从源代码中自动提取并重构静态和行为视图:首先生成全面的组件图,利用提示工程(prompt engineering)筛选出具有架构重要性的核心组件,再基于少量示例提示(few-shot prompting)生成状态机图以建模组件行为。该方法显著降低了人工参与度,同时提升了架构表示的准确性与可维护性,为提升系统理解力和长期可维护性提供了可行路径。

链接: https://arxiv.org/abs/2511.05165
作者: Ahmad Hatahet,Christoph Knieke,Andreas Rausch
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Software Architecture Descriptions (SADs) are essential for managing the inherent complexity of modern software systems. They enable high-level architectural reasoning, guide design decisions, and facilitate effective communication among diverse stakeholders. However, in practice, SADs are often missing, outdated, or poorly aligned with the system’s actual implementation. Consequently, developers are compelled to derive architectural insights directly from source code-a time-intensive process that increases cognitive load, slows new developer onboarding, and contributes to the gradual degradation of clarity over the system’s lifetime. To address these issues, we propose a semi-automated generation of SADs from source code by integrating reverse engineering (RE) techniques with a Large Language Model (LLM). Our approach recovers both static and behavioral architectural views by extracting a comprehensive component diagram, filtering architecturally significant elements (core components) via prompt engineering, and generating state machine diagrams to model component behavior based on underlying code logic with few-shots prompting. This resulting views representation offer a scalable and maintainable alternative to traditional manual architectural documentation. This methodology, demonstrated using C++ examples, highlights the potent capability of LLMs to: 1) abstract the component diagram, thereby reducing the reliance on human expert involvement, and 2) accurately represent complex software behaviors, especially when enriched with domain-specific knowledge through few-shot prompting. These findings suggest a viable path toward significantly reducing manual effort while enhancing system understanding and long-term maintainability.
zh

[AI-19] SmartSecChain-SDN: A Blockchain-Integrated Intelligent Framework for Secure and Efficient Software-Defined Networks

链接: https://arxiv.org/abs/2511.05156
作者: Azhar Hussain Mozumder,M. John Basha,Chayapathi A. R
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: 20 pages, 12 figures

点击查看摘要

[AI-20] DL101 Neural Network Outputs and Loss Functions

【速读】:该论文试图解决的问题是:如何从统计学角度合理选择神经网络输出层的激活函数与损失函数,以确保模型训练过程具有坚实的理论基础。其解决方案的关键在于将常见的损失函数(如均方误差 MSE、平均绝对误差 MAE 和交叉熵损失)与最大似然估计(Maximum Likelihood Estimation, MLE)原理相联系,表明特定损失函数的选择等价于对模型输出假设了某种概率分布,从而建立起损失函数与广义线性模型(Generalized Linear Models, GLMs)之间的理论桥梁。这一框架不仅解释了不同任务(如回归、分类)中典型损失函数的合理性,还扩展至实际应用场景,例如输出编码方式、约束输出以及重尾分布等情形,为深度学习模型的设计提供了严谨的统计指导。

链接: https://arxiv.org/abs/2511.05131
作者: Fernando Berzal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The loss function used to train a neural network is strongly connected to its output layer from a statistical point of view. This technical report analyzes common activation functions for a neural network output layer, like linear, sigmoid, ReLU, and softmax, detailing their mathematical properties and their appropriate use cases. A strong statistical justification exists for the selection of the suitable loss function for training a deep learning model. This report connects common loss functions such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and various Cross-Entropy losses to the statistical principle of Maximum Likelihood Estimation (MLE). Choosing a specific loss function is equivalent to assuming a specific probability distribution for the model output, highlighting the link between these functions and the Generalized Linear Models (GLMs) that underlie network output layers. Additional scenarios of practical interest are also considered, such as alternative output encodings, constrained outputs, and distributions with heavy tails.
zh

[AI-21] Accelerating HDC-CNN Hybrid Models Using Custom Instructions on RISC-V GPUs

【速读】:该论文旨在解决当前基于神经网络的机器学习在训练和推理过程中能耗过高,以及超维度计算(Hyperdimensional Computing, HDC)在复杂视觉任务中精度不足的问题。其解决方案的关键在于设计并实现针对HDC操作优化的自定义GPU指令集,并利用开源RISC-V架构构建可编程的域特定GPU平台,从而支持混合HDC-CNN工作负载的高效执行。实验表明,采用四种定制的HDC指令可在微基准测试中实现最高56.2倍的性能提升,验证了RISC-V GPU在能效比和高性能计算方面的潜力。

链接: https://arxiv.org/abs/2511.05053
作者: Wakuto Matsumi,Riaz-Ul-Haque Mian
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Machine learning based on neural networks has advanced rapidly, but the high energy consumption required for training and inference remains a major challenge. Hyperdimensional Computing (HDC) offers a lightweight, brain-inspired alternative that enables high parallelism but often suffers from lower accuracy on complex visual tasks. To overcome this, hybrid accelerators combining HDC and Convolutional Neural Networks (CNNs) have been proposed, though their adoption is limited by poor generalizability and programmability. The rise of open-source RISC-V architectures has created new opportunities for domain-specific GPU design. Unlike traditional proprietary GPUs, emerging RISC-V-based GPUs provide flexible, programmable platforms suitable for custom computation models such as HDC. In this study, we design and implement custom GPU instructions optimized for HDC operations, enabling efficient processing for hybrid HDC-CNN workloads. Experimental results using four types of custom HDC instructions show a performance improvement of up to 56.2 times in microbenchmark tests, demonstrating the potential of RISC-V GPUs for energy-efficient, high-performance computing.
zh

[AI-22] OvA-LP: A Simple and Efficient Framework for Federated Learning on Non-IID Data

【速读】:该论文旨在解决联邦微调(Federated Fine-Tuning, FFT)在客户端数据分布高度异质(non-IID)条件下因局部漂移(local drift)导致的全局模型性能下降问题,即客户端更新偏差引发系统性偏倚和方差放大。其解决方案的关键在于提出一种名为OvA-LP的极简框架,该框架首次在基于参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)的FFT范式中从源头抑制漂移:通过冻结编码器进行线性探测(linear probing),结合一对多(one-vs-all)分类头,并采用两阶段训练流程,从而保留预训练特征空间几何结构并解耦logits输出,避免漂移机制的放大。实验表明,OvA-LP在CIFAR-100上于多种非独立同分布分区下保持95.9%的IID准确率,显著优于现有最优基线(PFPT仅10.1%,FFT-MoE为34.5%),且对对称与非对称标签噪声均具备鲁棒性。

链接: https://arxiv.org/abs/2511.05028
作者: Dongjin Park,Hasung Yeo,Joon-Woo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated fine-tuning (FFT) adapts foundation models to decentralized data but remains fragile under heterogeneous client distributions due to local drift, i.e., client-level update divergences that induce systematic bias and amplified variance in the global model. Existing aggregation and personalization methods largely correct drift post hoc, which proves brittle under extreme non-IID conditions. We introduce OvA-LP, a minimalist framework that is, to our knowledge, the first explicitly designed to suppress drift at its source within the PEFT-based FFT paradigm. OvA-LP combines linear probing on a frozen encoder with a one-vs-all head and a simple two-stage procedure, preserving pretrained feature geometry and decoupling logits to prevent the mechanisms that amplify drift. On CIFAR-100 with 100 clients, averaged over shard-1, shard-2, and Bernoulli-Dirichlet partitions, OvA-LP retains 95.9% of its IID accuracy, whereas state-of-the-art FFT baselines retain only 10.1% (PFPT) and 34.5% (FFT-MoE) under the same conditions. OvA-LP further maintains resilience under both symmetric and asymmetric label noise. In addition, precomputing encoder features makes per-round cost nearly independent of encoder size. Together, these results demonstrate that OvA-LP provides a principled and efficient basis for robust FFT under heterogeneity.
zh

[AI-23] 8bit-GPT : Exploring Human-AI Interaction on Obsolete Macintosh Operating Systems NEURIPS

【速读】:该论文旨在解决当前用户对辅助型聊天机器人(assistive chatbots)过度依赖所引发的信息保留能力下降及浅层情感依附等问题。解决方案的关键在于设计并实现一个名为8bit-GPT的语言模型,其运行于经典Macintosh操作系统之上,通过慢技术(slow-technology)与反功能性的(counterfunctionality)反思性设计原则,刻意制造交互摩擦,从而打破用户对AI界面的熟悉感,凸显聊天机器人的工具属性,引导用户重新审视人机交互的本质与拟人化话语带来的潜在影响。

链接: https://arxiv.org/abs/2511.05025
作者: Hala Sheta
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: NeurIPS Creative AI Track 2025: Humanity

点击查看摘要

Abstract:The proliferation of assistive chatbots offering efficient, personalized communication has driven widespread over-reliance on them for decision-making, information-seeking and everyday tasks. This dependence was found to have adverse consequences on information retention as well as lead to superficial emotional attachment. As such, this work introduces 8bit-GPT; a language model simulated on a legacy Macintosh Operating System, to evoke reflection on the nature of Human-AI interaction and the consequences of anthropomorphic rhetoric. Drawing on reflective design principles such as slow-technology and counterfunctionality, this work aims to foreground the presence of chatbots as a tool by defamiliarizing the interface and prioritizing inefficient interaction, creating a friction between the familiar and not.
zh

[AI-24] Multi-agent Coordination via Flow Matching

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中协调能力与推理效率之间的权衡问题。现有方法通常在复杂协调建模与实时执行效率之间做出取舍:基于去噪扩散的方法虽能捕捉复杂的联合行为,但计算开销大;而基于高斯策略的方法虽然推理速度快,却难以有效处理多智能体交互。解决方案的关键在于提出MAC-Flow框架,首先通过流模型(flow-based representation)学习离线数据中丰富的联合行为分布,再将该表示蒸馏为去中心化的单步策略,从而在保持协调性能的同时实现高效推理——实验证明其推理速度比扩散类MARL方法快约14.5倍,且与传统高斯策略方法相当。

链接: https://arxiv.org/abs/2511.05005
作者: Dongsu Lee,Daehee Lee,Amy Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This work presents MAC-Flow, a simple yet expressive framework for multi-agent coordination. We argue that requirements of effective coordination are twofold: (i) a rich representation of the diverse joint behaviors present in offline data and (ii) the ability to act efficiently in real time. However, prior approaches often sacrifice one for the other, i.e., denoising diffusion-based solutions capture complex coordination but are computationally slow, while Gaussian policy-based solutions are fast but brittle in handling multi-agent interaction. MAC-Flow addresses this trade-off by first learning a flow-based representation of joint behaviors, and then distilling it into decentralized one-step policies that preserve coordination while enabling fast execution. Across four different benchmarks, including 12 environments and 34 datasets, MAC-Flow alleviates the trade-off between performance and computational cost, specifically achieving about \boldsymbol\times14.5 faster inference compared to diffusion-based MARL methods, while maintaining good performance. At the same time, its inference speed is similar to that of prior Gaussian policy-based offline multi-agent reinforcement learning (MARL) methods.
zh

[AI-25] Query Generation Pipeline with Enhanced Answerability Assessment for Financial Information Retrieval

【速读】:该论文旨在解决金融领域中信息检索(Information Retrieval, IR)基准测试不足的问题,即现有基准无法准确反映真实银行业务场景下复杂且领域特定的信息需求。其解决方案的关键在于提出了一种基于大语言模型(Large Language Models, LLMs)的系统性方法,用于构建领域特定的IR基准数据集;具体实现上,通过单文档与多文档查询生成相结合,并引入增强型推理辅助的答案可获取性评估机制,显著提升了与人工判断的一致性,从而有效支撑了对金融领域检索模型性能的科学评估。

链接: https://arxiv.org/abs/2511.05000
作者: Hyunkyu Kim,Yeeun Yoo,Youngjun Kwak
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted(Oral) by ICAIF 2025. Hyunkyu Kim and Yeeun Yoo contributed equally to this work

点击查看摘要

Abstract:As financial applications of large language models (LLMs) gain attention, accurate Information Retrieval (IR) remains crucial for reliable AI services. However, existing benchmarks fail to capture the complex and domain-specific information needs of real-world banking scenarios. Building domain-specific IR benchmarks is costly and constrained by legal restrictions on using real customer data. To address these challenges, we propose a systematic methodology for constructing domain-specific IR benchmarks through LLM-based query generation. As a concrete implementation of this methodology, our pipeline combines single and multi-document query generation with an enhanced and reasoning-augmented answerability assessment method, achieving stronger alignment with human judgments than prior approaches. Using this methodology, we construct KoBankIR, comprising 815 queries derived from 204 official banking documents. Our experiments show that existing retrieval models struggle with the complex multi-document queries in KoBankIR, demonstrating the value of our systematic approach for domain-specific benchmark construction and underscoring the need for improved retrieval techniques in financial domains.
zh

[AI-26] BiPETE: A Bi-Positional Embedding Transformer Encoder for Risk Assessment of Alcohol and Substance Use Disorder with Electronic Health Records

【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHR)中疾病风险预测的挑战,尤其是如何有效建模不规则就诊间隔所导致的时间依赖性问题。现有基于Transformer的深度学习模型在处理EHR数据时,常因缺乏统一结构和时间信息编码不足而表现受限。其解决方案的关键在于提出一种双位置嵌入Transformer编码器(Bi-Positional Embedding Transformer Encoder, BiPETE),该模型融合旋转位置嵌入(rotary positional embeddings)以捕捉相对就诊时间关系,并结合正弦位置嵌入(sinusoidal embeddings)以保留就诊顺序信息,从而在无需大规模预训练的前提下显著提升预测性能。实验表明,BiPETE在抑郁症和创伤后应激障碍(PTSD)队列中分别将精确率-召回率曲线下面积(AUPRC)提升34%和50%,且通过集成梯度(Integrated Gradients)方法实现可解释性分析,识别出与酒精及物质使用障碍(ASUD)风险密切相关的临床特征,如炎症、血液学和代谢指标异常以及特定药物和共病情况,为风险评估机制提供了科学依据和干预线索。

链接: https://arxiv.org/abs/2511.04998
作者: Daniel S. Lee,Mayra S. Haedo-Cruz,Chen Jiang,Oshin Miranda,LiRong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 20 pages, 2 figures, 6 tables, 2 supplementary figures, 4 supplementary tables, submitted to Journal of Biomedical Informatics on 6 Nov, 2025

点击查看摘要

Abstract:Transformer-based deep learning models have shown promise for disease risk prediction using electronic health records(EHRs), but modeling temporal dependencies remains a key challenge due to irregular visit intervals and lack of uniform structure. We propose a Bi-Positional Embedding Transformer Encoder or BiPETE for single-disease prediction, which integrates rotary positional embeddings to encode relative visit timing and sinusoidal embeddings to preserve visit order. Without relying on large-scale pretraining, BiPETE is trained on EHR data from two mental health cohorts-depressive disorder and post-traumatic stress disorder (PTSD)-to predict the risk of alcohol and substance use disorders (ASUD). BiPETE outperforms baseline models, improving the area under the precision-recall curve (AUPRC) by 34% and 50% in the depression and PTSD cohorts, respectively. An ablation study further confirms the effectiveness of the dual positional encoding strategy. We apply the Integrated Gradients method to interpret model predictions, identifying key clinical features associated with ASUD risk and protection, such as abnormal inflammatory, hematologic, and metabolic markers, as well as specific medications and comorbidities. Overall, these key clinical features identified by the attribution methods contribute to a deeper understanding of the risk assessment process and offer valuable clues for mitigating potential risks. In summary, our study presents a practical and interpretable framework for disease risk prediction using EHR data, which can achieve strong performance.
zh

[AI-27] Search Is Not Retrieval: Decoupling Semantic Matching from Contextual Assembly in RAG

【速读】:该论文旨在解决当前检索系统在AI流水线中混淆两个独立过程的问题:即查找相关性信息与提供足够推理上下文之间的界限模糊。传统方法往往将这两个步骤混为一谈,导致检索结果缺乏语义精确性和上下文完整性。解决方案的关键在于提出Search-Is-Not-Retrieve (SINR) 框架,这是一种双层架构设计,通过区分细粒度的搜索表示(search representations)与粗粒度的检索上下文(retrieval contexts),实现小规模语义准确的搜索片段与大规模语境完整的检索块之间的直接连接。该设计在不增加额外处理成本的前提下,提升了检索系统的组合性(composability)、可扩展性(scalability)和上下文保真度(context fidelity),并将检索从被动步骤转变为具有主动性的信息处理环节,更贴近人类认知方式。

链接: https://arxiv.org/abs/2511.04939
作者: Harshit Nainwani,Hediyeh Baban
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 22 pages, 2 figures, technical framework paper

点击查看摘要

Abstract:Retrieval systems are essential to contemporary AI pipelines, although most confuse two separate processes: finding relevant information and giving enough context for reasoning. We introduce the Search-Is-Not-Retrieve (SINR) framework, a dual-layer architecture that distinguishes between fine-grained search representations and coarse-grained retrieval contexts. SINR enhances the composability, scalability, and context fidelity of retrieval systems by directly connecting small, semantically accurate search chunks to larger, contextually complete retrieve chunks, all without incurring extra processing costs. This design changes retrieval from a passive step to an active one, making the system architecture more like how people process information. We discuss the SINR framework’s conceptual foundation, formal structure, implementation issues, and qualitative outcomes. This provides a practical foundation for the next generation of AI systems that use retrieval.
zh

[AI-28] MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages

【速读】:该论文旨在解决跨语言语音情感识别(Speech Emotion Recognition, SER)中模型泛化能力不足与情感表征不全面的问题,尤其针对英语及东南亚多语言场景下的情感建模挑战。解决方案的关键在于提出一种融合加权类别交叉熵(weighted categorical cross-entropy)与一致性相关系数(Concordance Correlation Coefficient, CCC)损失的混合目标函数,实现离散情绪类别(如高兴、愤怒)与连续维度情感(如唤醒度、效价、支配感)的联合建模,从而构建更全面且鲁棒的人类情感表示。此方法显著提升了在新加坡多语种(英语、中文、马来语、泰米尔语)及其他公开基准上的性能,优于现有开源语音编码器和大型音频大模型(Audio-LLMs),验证了专用语音-only模型在跨语言情感理解中的有效性。

链接: https://arxiv.org/abs/2511.04914
作者: Hardik B. Sailor,Aw Ai Ti,Chen Fang Yih Nancy,Chiu Ying Lay,Ding Yang,He Yingxu,Jiang Ridong,Li Jingtao,Liao Jingyi,Liu Zhuohan,Lu Yanfeng,Ma Yi,Manas Gupta,Muhammad Huzaifah Bin Md Shahrin,Nabilah Binte Md Johan,Nattadaporn Lertcheva,Pan Chunlei,Pham Minh Duc,Siti Maryam Binte Ahmad Subaidi,Siti Umairah Binte Mohammad Salleh,Sun Shuo,Tarun Kumar Vangani,Wang Qiongqiong,Won Cheng Yi Lewis,Wong Heng Meng Jeremy,Wu Jinyang,Zhang Huayun,Zhang Longyin,Zou Xunlong
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:We present MERaLiON-SER, a robust speech emotion recognition model de- signed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modelling. This dual approach enables the model to capture both the distinct categories of emotion (like happy or angry) and the fine-grained, such as arousal (intensity), valence (positivity/negativity), and dominance (sense of control), lead- ing to a more comprehensive and robust representation of human affect. Extensive evaluations across multilingual Singaporean languages (English, Chinese, Malay, and Tamil ) and other public benchmarks show that MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs. These results underscore the importance of specialised speech-only models for accurate paralin- guistic understanding and cross-lingual generalisation. Furthermore, the proposed framework provides a foundation for integrating emotion-aware perception into future agentic audio systems, enabling more empathetic and contextually adaptive multimodal reasoning.
zh

[AI-29] A Dual Perspective on Decision-Focused Learning: Scalable Training via Dual-Guided Surrogates

【速读】:该论文旨在解决预测-优化(predict-then-optimize)范式中决策质量受限的问题,即传统方法在训练预测模型时未考虑下游优化器如何使用预测结果,导致最终决策性能不佳。为提升决策对齐性(decision alignment),现有决策聚焦学习(Decision-Focused Learning, DFL)方法通常需频繁调用优化器(尤其是组合优化器),计算成本高昂且难以扩展。其解决方案的关键在于引入一种名为Dual-Guided Loss (DGL) 的新目标函数,该方法利用下游优化问题的对偶变量(dual variables)来构造可微分的代理损失(surrogate loss),从而在不频繁调用优化器的前提下实现高效训练:首先,通过周期性求解原问题以更新对偶信息;其次,在两次求解之间使用对偶调整后的目标进行梯度更新;最后,随着刷新频率降低,训练成本趋近于标准监督学习,同时保持强决策对齐性。理论证明表明,DGL具有渐近衰减的决策遗憾(asymptotically diminishing decision regret),实验证明其在匹配、背包和最短路径等组合选择问题上优于或相当当前最优DFL方法,但显著减少了优化器调用次数与训练时间。

链接: https://arxiv.org/abs/2511.04909
作者: Paula Rodriguez-Diaz,Kirk Bansak Elisabeth Paulson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many real-world decisions are made under uncertainty by solving optimization problems using predicted quantities. This predict-then-optimize paradigm has motivated decision-focused learning, which trains models with awareness of how the optimizer uses predictions, improving the performance of downstream decisions. Despite its promise, scaling is challenging: state-of-the-art methods either differentiate through a solver or rely on task-specific surrogates, both of which require frequent and expensive calls to an optimizer, often a combinatorial one. In this paper, we leverage dual variables from the downstream problem to shape learning and introduce Dual-Guided Loss (DGL), a simple, scalable objective that preserves decision alignment while reducing solver dependence. We construct DGL specifically for combinatorial selection problems with natural one-of-many constraints, such as matching, knapsack, and shortest path. Our approach (a) decouples optimization from gradient updates by solving the downstream problem only periodically; (b) between refreshes, trains on dual-adjusted targets using simple differentiable surrogate losses; and © as refreshes become less frequent, drives training cost toward standard supervised learning while retaining strong decision alignment. We prove that DGL has asymptotically diminishing decision regret, analyze runtime complexity, and show on two problem classes that DGL matches or exceeds state-of-the-art DFL methods while using far fewer solver calls and substantially less training time. Code is available at this https URL.
zh

[AI-30] You Need Reasoning to Learn Reasoning : The Limitations of Label-Free RL in Weak Base Models NEURIPS2025

【速读】:该论文旨在解决无监督强化学习(Reinforcement Learning, RL)方法在小型基础模型(0.5B–7B参数)上泛化能力不足的问题,特别是这些模型因推理能力较弱而难以通过标签自由的RL方法实现有效自我反思与性能提升。研究表明,小模型生成的思维链(Chain-of-Thought, CoT)长度和多样性不足,且训练数据难度对RL成功与否具有决定性影响。解决方案的关键在于引入一种基于课程学习(Curriculum Learning)的策略:在训练过程中逐步引入更难的问题,并在训练中屏蔽“无多数共识”的rollout样本(no-majority rollouts),同时设计了一个数据筛选流水线以生成预设难度的训练样本。该方法显著提升了所有模型规模下的推理能力,为资源受限模型提供了一条可扩展、鲁棒的无监督强化学习路径。

链接: https://arxiv.org/abs/2511.04902
作者: Shuvendu Roy,Hossein Hajimirsadeghi,Mengyao Zhai,Golnoosh Samei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: MATH-AI

点击查看摘要

Abstract:Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model’s pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at this https URL
zh

[AI-31] Real-Time Reasoning Agents in Evolving Environments

【速读】:该论文试图解决现实世界中智能体在动态环境中进行逻辑判断与时间敏感决策的难题,即如何在有限时间内做出合理响应的同时保持推理深度。现有基于语言模型的代理方法未能充分考虑环境的实时演化特性,导致在高难度任务或时间压力下性能受限。解决方案的关键在于提出一种名为AgileThinker的新架构,其核心是同时融合反应式(reactive)与规划式(planning)两种推理范式:前者通过限制推理计算以实现快速响应,后者允许更长的推理周期应对复杂问题;AgileThinker通过动态协调二者,在任务难度和时间压力增加时仍能有效平衡推理深度与响应延迟,显著优于单一推理模式的代理。

链接: https://arxiv.org/abs/2511.04898
作者: Yule Wen,Yixin Ye,Yanzhe Zhang,Diyi Yang,Hao Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages

点击查看摘要

Abstract:Agents in the real world must make not only logical but also timely judgments. This requires continuous awareness of the dynamic environment: hazards emerge, opportunities arise, and other agents act, while the agent’s reasoning is still unfolding. Despite advances in language model reasoning, existing approaches fail to account for this dynamic nature. We introduce real-time reasoning as a new problem formulation for agents in evolving environments and build Real-Time Reasoning Gym to demonstrate it. We study two paradigms for deploying language models in agents: (1) reactive agents, which employ language models with bounded reasoning computation for rapid responses, and (2) planning agents, which allow extended reasoning computation for complex problems. Our experiments show that even state-of-the-art models struggle with making logical and timely judgments in either paradigm. To address this limitation, we propose AgileThinker, which simultaneously engages both reasoning paradigms. AgileThinker consistently outperforms agents engaging only one reasoning paradigm as the task difficulty and time pressure rise, effectively balancing reasoning depth and response latency. Our work establishes real-time reasoning as a critical testbed for developing practical agents and provides a foundation for research in temporally constrained AI systems, highlighting a path toward real-time capable agents.
zh

[AI-32] DMA: Online RAG Alignment with Human Feedback

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因依赖静态检索而导致无法适应用户意图演化和内容漂移的问题。其解决方案的关键在于提出动态记忆对齐(Dynamic Memory Alignment, DMA)框架,该框架通过在线学习机制系统性地整合多粒度人类反馈信号——包括文档级、列表级与响应级的反馈——构建一个协同优化的学习流程:首先使用监督学习训练点对点和列表级别的排序器,其次基于响应偏好进行策略优化,最后通过知识蒸馏将模型压缩为轻量级评分器以支持低延迟部署。DMA在保持基础检索性能的同时,实现了实时交互场景下的反馈驱动适应能力。

链接: https://arxiv.org/abs/2511.04880
作者: Yu Bai,Yukai Miao,Dawei Wang,Li Chen,Fei Long,Rundi Zhai,Dan Li,Yanyu Ren,Tianfeng Liu,Hongtao Xie,Ce Yang,Xuhui Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems often rely on static retrieval, limiting adaptation to evolving intent and content drift. We introduce Dynamic Memory Alignment (DMA), an online learning framework that systematically incorporates multi-granularity human feedback to align ranking in interactive settings. DMA organizes document-, list-, and response-level signals into a coherent learning pipeline: supervised training for pointwise and listwise rankers, policy optimization driven by response-level preferences, and knowledge distillation into a lightweight scorer for low-latency serving. Throughout this paper, memory refers to the model’s working memory, which is the entire context visible to the LLM for In-Context Learning. We adopt a dual-track evaluation protocol mirroring deployment: (i) large-scale online A/B ablations to isolate the utility of each feedback source, and (ii) few-shot offline tests on knowledge-intensive benchmarks. Online, a multi-month industrial deployment further shows substantial improvements in human engagement. Offline, DMA preserves competitive foundational retrieval while yielding notable gains on conversational QA (TriviaQA, HotpotQA). Taken together, these results position DMA as a principled approach to feedback-driven, real-time adaptation in RAG without sacrificing baseline capability. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.04880 [cs.AI] (or arXiv:2511.04880v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.04880 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-33] Epistemic Reject Option Prediction

【速读】:该论文旨在解决高风险应用场景中预测模型不仅要具备高准确性,还需量化并传达其不确定性的问题。传统拒识(reject-option)方法仅关注偶然性不确定性(aleatoric uncertainty),但在训练数据有限的情况下,这种假设不成立,因为认知不确定性(epistemic uncertainty)不可忽略。论文提出一种基于贝叶斯学习的认知拒识预测器(epistemic reject-option predictor),其核心在于重新定义最优预测器为最小化期望后悔(expected regret)的模型——即在给定输入下,模型性能与已知完整数据分布下的贝叶斯最优预测器之间的差距。当该后悔值超过预设的拒识成本时,模型选择拒识,从而识别出因训练数据不足而无法可靠决策的输入区域。这是首个在理论上严谨、可学习的框架,使模型具备识别数据不足区域的能力。

链接: https://arxiv.org/abs/2511.04855
作者: Vojtech Franc,Jakub Paplham
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In high-stakes applications, predictive models must not only produce accurate predictions but also quantify and communicate their uncertainty. Reject-option prediction addresses this by allowing the model to abstain when prediction uncertainty is high. Traditional reject-option approaches focus solely on aleatoric uncertainty, an assumption valid only when large training data makes the epistemic uncertainty negligible. However, in many practical scenarios, limited data makes this assumption unrealistic. This paper introduces the epistemic reject-option predictor, which abstains in regions of high epistemic uncertainty caused by insufficient data. Building on Bayesian learning, we redefine the optimal predictor as the one that minimizes expected regret – the performance gap between the learned model and the Bayes-optimal predictor with full knowledge of the data distribution. The model abstains when the regret for a given input exceeds a specified rejection cost. To our knowledge, this is the first principled framework that enables learning predictors capable of identifying inputs for which the training data is insufficient to make reliable decisions.
zh

[AI-34] Software Defined Vehicle Code Generation: A Few-Shot Prompting Approach

【速读】:该论文旨在解决软件定义汽车(Software-Defined Vehicles, SDVs)特定应用开发中代码生成效率低下的问题,尤其是在缺乏对专有大语言模型(Large Language Models, LLMs)架构访问权限的情况下,如何有效利用通用LLM进行高质量代码生成。解决方案的关键在于采用提示工程(prompt engineering)策略,通过设计结构合理、高效的系统提示(system prompts),在无需训练或访问底层模型架构的前提下,引导LLM输出符合SDV开发需求的代码。实验表明,采用少量示例提示(few-shot prompting)策略的模型在量化指标上显著优于其他方法,展现出强大的任务适配能力。

链接: https://arxiv.org/abs/2511.04849
作者: Quang-Dung Nguyen,Tri-Dung Tran,Thanh-Hieu Chu,Hoang-Loc Tran,Xiangwei Cheng,Dirk Slama
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:The emergence of Software-Defined Vehicles (SDVs) marks a paradigm shift in the automotive industry, where software now plays a pivotal role in defining vehicle functionality, enabling rapid innovation of modern vehicles. Developing SDV-specific applications demands advanced tools to streamline code generation and improve development efficiency. In recent years, general-purpose large language models (LLMs) have demonstrated transformative potential across domains. Still, restricted access to proprietary model architectures hinders their adaption to specific tasks like SDV code generation. In this study, we propose using prompts, a common and basic strategy to interact with LLMs and redirect their responses. Using only system prompts with an appropriate and efficient prompt structure designed using advanced prompt engineering techniques, LLMs can be crafted without requiring a training session or access to their base design. This research investigates the extensive experiments on different models by applying various prompting techniques, including bare models, using a benchmark specifically created to evaluate LLMs’ performance in generating SDV code. The results reveal that the model with a few-shot prompting strategy outperforms the others in adjusting the LLM answers to match the expected outcomes based on quantitative metrics.
zh

[AI-35] Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning AACL

【速读】:该论文旨在解决当前机器人学习中面临的多模态、大规模训练与高保真环境模拟之间的瓶颈问题,特别是在强化学习(Reinforcement Learning, RL)和模仿学习(Imitation Learning, IL)场景下对复杂物理交互、多样化传感器建模及高效数据采集的需求。解决方案的关键在于提出 Isaac Lab —— 一个基于 GPU 原生架构的模块化、可组合式机器人仿真平台,其核心优势在于整合了高保真 GPU 并行物理引擎、逼真的渲染能力、执行器模型、多频传感器模拟、数据收集流水线以及领域随机化工具,从而统一了大规模机器人策略训练的最佳实践,并支持从全身控制到接触丰富操作、跨体态移动等多样任务的学习。此外,该框架还计划集成可微分的 GPU 加速牛顿力学引擎,为梯度驱动的、数据高效的机器人学习开辟新路径。

链接: https://arxiv.org/abs/2511.04831
作者: NVIDIA:Mayank Mittal,Pascal Roth,James Tigue,Antoine Richard,Octi Zhang,Peter Du,Antonio Serrano-Muñoz,Xinjie Yao,René Zurbrügg,Nikita Rudin,Lukasz Wawrzyniak,Milad Rakhsha,Alain Denzler,Eric Heiden,Ales Borovicka,Ossama Ahmed,Iretiayo Akinola,Abrar Anwar,Mark T. Carlson,Ji Yuan Feng,Animesh Garg,Renato Gasoto,Lionel Gulich,Yijie Guo,M. Gussert,Alex Hansen,Mihir Kulkarni,Chenran Li,Wei Liu,Viktor Makoviychuk,Grzegorz Malczyk,Hammad Mazhar,Masoud Moghani,Adithyavairavan Murali,Michael Noseworthy,Alexander Poddubny,Nathan Ratliff,Welf Rehberg,Clemens Schwarke,Ritvik Singh,James Latham Smith,Bingjie Tang,Ruchik Thaker,Matthew Trepte,Karl Van Wyk,Fangzhou Yu,Alex Millane,Vikram Ramasamy,Remo Steiner,Sangeeta Subramanian,Clemens Volk,CY Chen,Neel Jawale,Ashwin Varghese Kuruttukulam,Michael A. Lin,Ajay Mandlekar,Karsten Patzwaldt,John Welsh,Huihua Zhao,Fatima Anes,Jean-Francois Lafleche,Nicolas Moënne-Loccoz,Soowan Park,Rob Stepinski,Dirk Van Gelder,Chris Amevor,Jan Carius,Jumyung Chang,Anka He Chen,Pablo de Heras Ciechomski,Gilles Daviet,Mohammad Mohajerani,Julia von Muralt,Viktor Reutskyy,Michael Sauter,Simon Schirm,Eric L. Shi,Pierre Terdiman,Kenny Vilella,Tobias Widmer,Gordon Yeoman,Tiffany Chen,Sergey Grizan,Cathy Li,Lotus Li,Connor Smith,Rafael Wiltz,Kostas Alexis,Yan Chang,David Chu,Linxi “Jim” Fan,Farbod Farshidian,Ankur Handa,Spencer Huang,Marco Hutter,Yashraj Narang,Soha Pouya,Shiwei Sheng,Yuke Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Code and documentation are available here: this https URL

点击查看摘要

Abstract:We present Isaac Lab, the natural successor to Isaac Gym, which extends the paradigm of GPU-native robotics simulation into the era of large-scale multi-modal learning. Isaac Lab combines high-fidelity GPU parallel physics, photorealistic rendering, and a modular, composable architecture for designing environments and training robot policies. Beyond physics and rendering, the framework integrates actuator models, multi-frequency sensor simulation, data collection pipelines, and domain randomization tools, unifying best practices for reinforcement and imitation learning at scale within a single extensible platform. We highlight its application to a diverse set of challenges, including whole-body control, cross-embodiment mobility, contact-rich and dexterous manipulation, and the integration of human demonstrations for skill acquisition. Finally, we discuss upcoming integration with the differentiable, GPU-accelerated Newton physics engine, which promises new opportunities for scalable, data-efficient, and gradient-based approaches to robot learning. We believe Isaac Lab’s combination of advanced simulation capabilities, rich sensing, and data-center scale execution will help unlock the next generation of breakthroughs in robotics research.
zh

[AI-36] A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification NEURIPS2025

【速读】:该论文旨在解决抗菌肽(Antimicrobial Peptides, AMPs)研究中因数据集碎片化、注释不一致及缺乏标准化基准而导致的计算方法发展缓慢和新候选分子发现受阻的问题。其解决方案的关键在于构建了一个名为ESCAPE(Expanded Standardized Collection for Antimicrobial Peptide Evaluation)的实验框架,该框架整合了来自27个验证数据库的超过80,000条肽序列,并将抗菌肽与非活性序列分离,同时在其功能注释中引入生物学一致的多标签层次结构,从而系统性地覆盖抗菌、抗真菌、抗病毒和抗寄生虫等多种活性类别。在此基础上,作者进一步提出一种基于Transformer的模型,利用序列与结构信息联合预测肽的多重功能活性,在平均精度(mean Average Precision)上相较最优基线方法提升达2.56%,确立了多标签肽分类的新基准。

链接: https://arxiv.org/abs/2511.04814
作者: Sebastian Ojeda,Rafael Velasquez,Nicolás Aparicio,Juanita Puentes,Paula Cárdenas,Nicolás Andrade,Gabriel González,Sergio Rincón,Carolina Muñoz-Camargo,Pablo Arbeláez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025). Camera-ready version. Code: this https URL . Dataset DOI: this https URL

点击查看摘要

Abstract:Antimicrobial peptides have emerged as promising molecules to combat antimicrobial resistance. However, fragmented datasets, inconsistent annotations, and the lack of standardized benchmarks hinder computational approaches and slow down the discovery of new candidates. To address these challenges, we present the Expanded Standardized Collection for Antimicrobial Peptide Evaluation (ESCAPE), an experimental framework integrating over 80.000 peptides from 27 validated repositories. Our dataset separates antimicrobial peptides from negative sequences and incorporates their functional annotations into a biologically coherent multilabel hierarchy, capturing activities across antibacterial, antifungal, antiviral, and antiparasitic classes. Building on ESCAPE, we propose a transformer-based model that leverages sequence and structural information to predict multiple functional activities of peptides. Our method achieves up to a 2.56% relative average improvement in mean Average Precision over the second-best method adapted for this task, establishing a new state-of-the-art multilabel peptide classification. ESCAPE provides a comprehensive and reproducible evaluation framework to advance AI-driven antimicrobial peptide research.
zh

[AI-37] Unified Multimodal Diffusion Forcing for Forceful Manipulation

【速读】:该论文旨在解决传统模仿学习方法在处理多模态机器人轨迹时忽视感官输入、动作与奖励之间复杂交互关系的问题,从而导致对机器人行为建模和任务结果理解不足。其解决方案的关键在于提出一种统一的多模态扩散强制(Multimodal Diffusion Forcing, MDF)框架,该框架通过随机部分掩码(random partial masking)训练扩散模型以重建完整轨迹,从而显式地学习时间依赖性和跨模态依赖性(如从部分观测推断状态或预测动作对力信号的影响),显著提升了模型在高噪声环境下的鲁棒性和多功能性。

链接: https://arxiv.org/abs/2511.04812
作者: Zixuan Huang,Huaidian Hou,Dmitry Berenson
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project website: this https URL

点击查看摘要

Abstract:Given a dataset of expert trajectories, standard imitation learning approaches typically learn a direct mapping from observations (e.g., RGB images) to actions. However, such methods often overlook the rich interplay between different modalities, i.e., sensory inputs, actions, and rewards, which is crucial for modeling robot behavior and understanding task outcomes. In this work, we propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories that extends beyond action generation. Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory. This training objective encourages the model to learn temporal and cross-modal dependencies, such as predicting the effects of actions on force signals or inferring states from partial observations. We evaluate MDF on contact-rich, forceful manipulation tasks in simulated and real-world environments. Our results show that MDF not only delivers versatile functionalities, but also achieves strong performance, and robustness under noisy observations. More visualizations can be found on our website this https URL
zh

[AI-38] PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference

【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在大规模部署中因存储所有专家参数导致的高内存开销问题,尤其是在专家数量增加时,压缩比越高性能下降越明显。其解决方案的关键在于提出了一种无需训练的压缩方法 PuzzleMoE,核心创新包括:一是通过稀疏专家合并策略识别参数层面的冗余与专业化特征,利用双掩码(dual-mask)同时保留共享和专家特异性参数;二是设计了一种位打包编码方案,复用未充分利用的指数位以高效存储掩码和符号信息,从而在GPU上实现低开销的高效推理。

链接: https://arxiv.org/abs/2511.04805
作者: Yushu Zhao,Zheng Wang,Minjia Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models have shown strong potential in scaling language models efficiently by activating only a small subset of experts per input. However, their widespread deployment remains limited due to the high memory overhead associated with storing all expert parameters, particularly as the number of experts increases. To address this challenge, prior works have explored expert dropping and merging strategies, yet they often suffer from performance drop at high compression ratios. In this paper, we introduce PuzzleMoE, a training-free MoE compression method that achieves both high accuracy and efficient inference through two key innovations: First, PuzzleMoE performs sparse expert merging by identifying element-wise weight redundancy and specialization. It uses a dual-mask to capture both shared and expert-specific parameters. Second, to avoid the overhead of storing binary masks and signs, PuzzleMoE introduces a bit-packed encoding scheme that reuses underutilized exponent bits, enabling efficient MoE inference on GPUs. Extensive experiments demonstrate that PuzzleMoE can compress MoE models by up to 50% while maintaining accuracy across various tasks. Specifically, it outperforms prior MoE compression methods by up to 16.7% on MMLU at 50% compression ratio, and achieves up to 1.28\times inference speedup.
zh

[AI-39] MDM: Manhattan Distance Mapping of DNN Weights for Parasitic-Resistance-Resilient Memristive Crossbars

【速读】:该论文旨在解决基于忆阻器(memristor)的位切片计算内存(CIM)交叉阵列中寄生电阻(Parasitic Resistance, PR)非理想性对深度神经网络(DNN)加速效率的限制问题。PR导致在将DNN矩阵映射到小型交叉阵列单元时,需频繁进行数字同步和模拟-数字转换(ADC),从而增加延迟、I/O压力及芯片面积开销。其解决方案的关键在于提出曼哈顿距离映射(Manhattan Distance Mapping, MDM)方法:通过利用比特级结构稀疏性,将激活值从密集的低位侧输入,并根据曼哈顿距离重排行顺序,使活跃的忆阻单元向受PR影响较小的区域迁移,从而显著降低非理想因子(Nonideality Factor, NF)。实验表明,MDM可使NF降低最多46%,并在ResNet模型上平均提升准确率3.6%,为CIM DNN加速器提供了轻量且空间感知的扩展方案。

链接: https://arxiv.org/abs/2511.04798
作者: Matheus Farias,Wanghley Martins,H. T. Kung
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 5 pages, 6 figures

点击查看摘要

Abstract:Manhattan Distance Mapping (MDM) is a post-training deep neural network (DNN) weight mapping technique for memristive bit-sliced compute-in-memory (CIM) crossbars that reduces parasitic resistance (PR) nonidealities. PR limits crossbar efficiency by mapping DNN matrices into small crossbar tiles, reducing CIM-based speedup. Each crossbar executes one tile, requiring digital synchronization before the next layer. At this granularity, designers either deploy many small crossbars in parallel or reuse a few sequentially-both increasing analog-to-digital conversions, latency, I/O pressure, and chip area. MDM alleviates PR effects by optimizing active-memristor placement. Exploiting bit-level structured sparsity, it feeds activations from the denser low-order side and reorders rows according to the Manhattan distance, relocating active cells toward regions less affected by PR and thus lowering the nonideality factor (NF). Applied to DNN models on ImageNet-1k, MDM reduces NF by up to 46% and improves accuracy under analog distortion by an average of 3.6% in ResNets. Overall, it provides a lightweight, spatially informed method for scaling CIM DNN accelerators. Comments: 5 pages, 6 figures Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2511.04798 [cs.AR] (or arXiv:2511.04798v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2511.04798 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Matheus Farias [view email] [v1] Thu, 6 Nov 2025 20:34:10 UTC (234 KB)
zh

[AI-40] Causal Structure and Representation Learning with Biomedical Applications

【速读】:该论文旨在解决当前表示学习(Representation Learning)在因果推断任务中表现不佳的问题,尤其是在预测干预效应时的失效现象。其核心挑战在于:尽管表示学习在预测任务中已取得显著成功,但其学到的潜在空间往往无法捕捉数据背后的因果结构,从而导致因果推断失败。解决方案的关键在于将表示学习与因果推断相结合,并利用多模态数据(包括观测数据和干预数据、成像与测序数据,以及单细胞、组织和生物体层面的数据)构建一个统计与计算框架,以实现因果变量的发现、多视角数据驱动的因果表示学习,以及最优干预设计。

链接: https://arxiv.org/abs/2511.04790
作者: Caroline Uhler,Jiaqi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: This article has successfully completed peer review and will appear in the Proceedings of the International Congress of Mathematicians 2026. Both authors contributed equally to this work

点击查看摘要

Abstract:Massive data collection holds the promise of a better understanding of complex phenomena and, ultimately, better decisions. Representation learning has become a key driver of deep learning applications, as it allows learning latent spaces that capture important properties of the data without requiring any supervised annotations. Although representation learning has been hugely successful in predictive tasks, it can fail miserably in causal tasks including predicting the effect of a perturbation/intervention. This calls for a marriage between representation learning and causal inference. An exciting opportunity in this regard stems from the growing availability of multi-modal data (observational and perturbational, imaging-based and sequencing-based, at the single-cell level, tissue-level, and organism-level). We outline a statistical and computational framework for causal structure and representation learning motivated by fundamental biomedical questions: how to effectively use observational and perturbational data to perform causal discovery on observed causal variables; how to use multi-modal views of the system to learn causal variables; and how to design optimal perturbations.
zh

[AI-41] ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning Scheduling

【速读】:该论文旨在解决多臂机器人在任务与运动规划(Task and Motion Planning, TAMP)中难以实现并行动作调度的问题。传统TAMP算法通常仅生成单臂依次执行的动作序列,无法充分利用双臂或人形机器人的协同能力,导致效率低下。解决方案的关键在于提出名为ScheduleStream的通用框架,其通过引入**混合持续动作(hybrid durative actions)**来建模时间动态性,允许动作异步启动且持续时间由参数决定;同时设计了与领域无关的算法,在无需特定应用机制的情况下直接求解调度问题,并结合GPU加速采样过程以提升实时性。该方法首次实现了从TAMP到任务调度的扩展,显著提高了多臂机器人在仿真和真实场景下的作业效率。

链接: https://arxiv.org/abs/2511.04758
作者: Caelan Garrett,Fabio Ramos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Project website: this https URL

点击查看摘要

Abstract:Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that’s a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at this https URL.
zh

[AI-42] rustworthiness Calibration Framework for Phishing Email Detection Using Large Language Models

【速读】:该论文旨在解决生成式 AI (Generative AI) 在钓鱼邮件检测任务中部署时面临的可靠性评估难题,即现有模型在基准测试中表现优异,但其实际应用中的可信度(如校准性、一致性与鲁棒性)仍缺乏系统化衡量标准。解决方案的关键在于提出可信度校准框架(Trustworthiness Calibration Framework, TCF),通过构建一个综合指标——可信度校准指数(Trustworthiness Calibration Index, TCI),对模型在三个维度上的表现进行量化,并辅以跨数据集稳定性(Cross-Dataset Stability, CDS)指标,实现对大语言模型(LLMs)如GPT-4、LLaMA-3-8B和DeBERTa-v3-base在真实场景下可信度的透明、可复现评估。

链接: https://arxiv.org/abs/2511.04728
作者: Daniyal Ganiuly,Assel Smaiyl
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Phishing emails continue to pose a persistent challenge to online communication, exploiting human trust and evading automated filters through realistic language and adaptive tactics. While large language models (LLMs) such as GPT-4 and LLaMA-3-8B achieve strong accuracy in text classification, their deployment in security systems requires assessing reliability beyond benchmark performance. To address this, this study introduces the Trustworthiness Calibration Framework (TCF), a reproducible methodology for evaluating phishing detectors across three dimensions: calibration, consistency, and robustness. These components are integrated into a bounded index, the Trustworthiness Calibration Index (TCI), and complemented by the Cross-Dataset Stability (CDS) metric that quantifies stability of trustworthiness across datasets. Experiments conducted on five corpora, such as SecureMail 2025, Phishing Validation 2024, CSDMC2010, Enron-Spam, and Nazario, using DeBERTa-v3-base, LLaMA-3-8B, and GPT-4 demonstrate that GPT-4 achieves the strongest overall trust profile, followed by LLaMA-3-8B and DeBERTa-v3-base. Statistical analysis confirms that reliability varies independently of raw accuracy, underscoring the importance of trust-aware evaluation for real-world deployment. The proposed framework establishes a transparent and reproducible foundation for assessing model dependability in LLM-based phishing detection.
zh

[AI-43] mporal convolutional and fusional transformer model with Bi-LSTM encoder-decoder for multi-time-window remaining useful life prediction

【速读】:该论文旨在解决工业系统中剩余使用寿命(Remaining Useful Life, RUL)预测模型在捕捉细粒度时间依赖性以及动态优先级化关键特征方面存在的不足,从而提升预测的鲁棒性和准确性。其解决方案的关键在于提出一种融合Temporal Convolutional Networks (TCNs)与改进型Temporal Fusion Transformer (TFT)的新型框架,其中TFT通过引入Bi-LSTM编码器-解码器结构增强对时序特征的建模能力,并结合多时间窗口策略以提升模型在不同工况下的适应性,有效整合短时与长时依赖关系,强化显著时序模式的识别,从而显著优于现有先进方法,在基准数据集上平均RMSE降低达5.5%。

链接: https://arxiv.org/abs/2511.04723
作者: Mohamadreza Akbari Pour,Mohamad Sadeq Karimi,Amir Hossein Mazloumi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Health prediction is crucial for ensuring reliability, minimizing downtime, and optimizing maintenance in industrial systems. Remaining Useful Life (RUL) prediction is a key component of this process; however, many existing models struggle to capture fine-grained temporal dependencies while dynamically prioritizing critical features across time for robust prognostics. To address these challenges, we propose a novel framework that integrates Temporal Convolutional Networks (TCNs) for localized temporal feature extraction with a modified Temporal Fusion Transformer (TFT) enhanced by Bi-LSTM encoder-decoder. This architecture effectively bridges short- and long-term dependencies while emphasizing salient temporal patterns. Furthermore, the incorporation of a multi-time-window methodology improves adaptability across diverse operating conditions. Extensive evaluations on benchmark datasets demonstrate that the proposed model reduces the average RMSE by up to 5.5%, underscoring its improved predictive accuracy compared to state-of-the-art methods. By closing critical gaps in current approaches, this framework advances the effectiveness of industrial prognostic systems and highlights the potential of advanced time-series transformers for RUL prediction.
zh

[AI-44] P-MIA: A Profiled-Based Membership Inference Attack on Cognitive Diagnosis Models

【速读】:该论文旨在解决认知诊断模型(Cognitive Diagnosis Models, CDMs)在智能教育平台中因训练数据敏感性而引发的隐私泄露风险问题,特别是针对尚未被充分研究的成员推理攻击(Membership Inference Attack, MIA)在CDMs场景下的应用空白。解决方案的关键在于提出一种新颖且现实的灰盒威胁模型,该模型利用教育平台提供的可解释性可视化功能(如雷达图),从暴露的内部知识状态向量(knowledge state vectors)中精确还原出原始特征,并基于此构建了一种以用户画像为基础的MIA框架(Profile-based MIA, P-MIA),该框架同时融合模型最终预测概率与暴露的知识状态向量作为攻击特征,显著优于传统黑盒基线方法,在三个真实数据集上验证了其有效性,并进一步展示了其作为审计工具评估机器遗忘技术效果的能力。

链接: https://arxiv.org/abs/2511.04716
作者: Mingliang Hou,Yinuo Wang,Teng Guo,Zitao Liu,Wenzhou Dou,Jiaqi Zheng,Renqiang Luo,Mi Tian,Weiqi Luo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cognitive diagnosis models (CDMs) are pivotal for creating fine-grained learner profiles in modern intelligent education platforms. However, these models are trained on sensitive student data, raising significant privacy concerns. While membership inference attacks (MIA) have been studied in various domains, their application to CDMs remains a critical research gap, leaving their privacy risks unquantified. This paper is the first to systematically investigate MIA against CDMs. We introduce a novel and realistic grey box threat model that exploits the explainability features of these platforms, where a model’s internal knowledge state vectors are exposed to users through visualizations such as radar charts. We demonstrate that these vectors can be accurately reverse-engineered from such visualizations, creating a potent attack surface. Based on this threat model, we propose a profile-based MIA (P-MIA) framework that leverages both the model’s final prediction probabilities and the exposed internal knowledge state vectors as features. Extensive experiments on three real-world datasets against mainstream CDMs show that our grey-box attack significantly outperforms standard black-box baselines. Furthermore, we showcase the utility of P-MIA as an auditing tool by successfully evaluating the efficacy of machine unlearning techniques and revealing their limitations.
zh

[AI-45] SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking

【速读】:该论文旨在解决大规模视觉-语言模型(如CLIP)中软提示(soft prompts)的版权保护问题,即如何有效审计第三方模型是否非法复制了受保护的软提示。现有模型所有权审计技术在该场景下失效,原因在于软提示学习具有独特特性:非侵入式方法易因数据分布相似产生误报,而侵入式方法(如后门攻击)无法在CLIP中嵌入功能性触发器,且传统深度神经网络(DNN)后门技术扩展至提示学习时面临有害性和模糊性挑战。其根本原因在于水印机制与主任务共享决策空间但目标相悖。为此,作者提出顺序水印方法(Sequential Watermarking for Soft Prompts, SWAP),核心创新在于将水印编码于防御者指定的分布外类别(out-of-distribution classes)的特定顺序中,利用CLIP的零样本预测能力实现隐蔽嵌入,从而避免干扰原始任务性能。SWAP通过假设检验引导的验证协议进行检测,并提供理论保障,实验证明其在11个数据集上具备有效性、无害性和对自适应攻击的鲁棒性。

链接: https://arxiv.org/abs/2511.04711
作者: Wenyuan Yang,Yichen Sun,Changzheng Chen,Zhixuan Chu,Jiaheng Zhang,Yiming Li,Dacheng Tao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The first two authors contributed equally to this work. 27 pages

点击查看摘要

Abstract:Large-scale vision-language models, especially CLIP, have demonstrated remarkable performance across diverse downstream tasks. Soft prompts, as carefully crafted modules that efficiently adapt vision-language models to specific tasks, necessitate effective copyright protection. In this paper, we investigate model copyright protection by auditing whether suspicious third-party models incorporate protected soft prompts. While this can be viewed as a special case of model ownership auditing, our analysis shows that existing techniques are ineffective due to prompt learning’s unique characteristics. Non-intrusive auditing is inherently prone to false positives when independent models share similar data distributions with victim models. Intrusive approaches also fail: backdoor methods designed for CLIP cannot embed functional triggers, while extending traditional DNN backdoor techniques to prompt learning suffers from harmfulness and ambiguity challenges. We find that these failures in intrusive auditing stem from the same fundamental reason: watermarking operates within the same decision space as the primary task yet pursues opposing objectives. Motivated by these findings, we propose sequential watermarking for soft prompts (SWAP), which implants watermarks into a different and more complex space. SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes, inspired by the zero-shot prediction capability of CLIP. This watermark, which is embedded in a more complex space, keeps the original prediction label unchanged, making it less opposed to the primary task. We further design a hypothesis-test-guided verification protocol for SWAP and provide theoretical analyses of success conditions. Extensive experiments on 11 datasets demonstrate SWAP’s effectiveness, harmlessness, and robustness against potential adaptive attacks.
zh

[AI-46] Prioritize Economy or Climate Action? Investigating ChatGPT Response Differences Based on Inferred Political Orientation

【速读】:该论文旨在解决生成式 AI(Generative AI)在响应中可能因推断用户政治倾向而产生偏倚的问题,进而影响输出内容的公平性与中立性。其核心问题是:当模型基于隐含线索(如记忆或自定义指令)推断出用户的政治立场后,是否会改变其回答内容,并形成偏向性输出。解决方案的关键在于通过构建三种人格设定(两个政治倾向明确、一个中立),并利用自定义指令和记忆功能使 ChatGPT 推断出不同政治观点,从而系统性地测试模型响应是否随 inferred political views 变化。研究发现,即使不直接说明政治立场,模型也会依据隐含信息调整语义策略和推理方式,且在输出上表现出明显的左倾倾向,表明模型对政治隐含信号具有敏感性和一致性响应能力。

链接: https://arxiv.org/abs/2511.04706
作者: Pelin Karadal,Dilara Kekulluoglu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) distinguish themselves by quickly delivering information and providing personalized responses through natural language prompts. However, they also infer user demographics, which can raise ethical concerns about bias and implicit personalization and create an echo chamber effect. This study aims to explore how inferred political views impact the responses of ChatGPT globally, regardless of the chat session. We also investigate how custom instruction and memory features alter responses in ChatGPT, considering the influence of political orientation. We developed three personas (two politically oriented and one neutral), each with four statements reflecting their viewpoints on DEI programs, abortion, gun rights, and vaccination. We convey the personas’ remarks to ChatGPT using memory and custom instructions, allowing it to infer their political perspectives without directly stating them. We then ask eight questions to reveal differences in worldview among the personas and conduct a qualitative analysis of the responses. Our findings indicate that responses are aligned with the inferred political views of the personas, showing varied reasoning and vocabulary, even when discussing similar topics. We also find the inference happening with explicit custom instructions and the implicit memory feature in similar ways. Analyzing response similarities reveals that the closest matches occur between the democratic persona with custom instruction and the neutral persona, supporting the observation that ChatGPT’s outputs lean left.
zh

[AI-47] A hybrid solution approach for the Integrated Healthcare Timetabling Competition 2024

【速读】:该论文旨在解决集成医疗预约调度问题(Integrated Healthcare Timetabling Problem),其核心挑战在于在复杂约束条件下优化资源分配与时间安排,以提升医疗服务效率。解决方案的关键在于提出一种三阶段混合优化方法,结合了混合整数规划(Mixed-Integer Programming, MIP)、约束规划(Constraint Programming, CP)和模拟退火(Simulated Annealing)算法,并基于子问题分解策略进行协同求解。该框架不仅提升了求解质量,还首次为基准实例提供了下界估计,从而为后续研究提供了量化参考。

链接: https://arxiv.org/abs/2511.04685
作者: Daniela Guericke,Rolf van der Hulst,Asal Karimpour,Ieke Schrader,Matthias Walter
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 23 pages, 2 figures, 10 tables

点击查看摘要

Abstract:We report about the algorithm, implementation and results submitted to the Integrated Healthcare Timetabling Competition 2024 by Team Twente, which scored third in the competition. Our approach combines mixed-integer programming, constraint programming and simulated annealing in a 3-phase solution approach based on decomposition into subproblems. Next to describing our approach and describing our design decisions, we share our insights and, for the first time, lower bounds on the optimal solution values for the benchmark instances. We finally highlight open problems for which we think that addressing them could improve our approach even further.
zh

[AI-48] AI-Powered Citation Auditing: A Zero-Assumption Protocol for Systematic Reference Verification in Academic Research

【速读】:该论文旨在解决学术引用完整性(citation integrity)长期存在的问题,即约20%的引用存在错误,而传统人工核查需耗费数月专家时间。其解决方案的关键在于提出一种基于具工具调用能力的智能体AI(agentic AI)的全新验证方法,构建了无需预设假设的零假设验证协议(zero-assumption verification protocol),能够独立地对每一条参考文献在多个学术数据库(如Semantic Scholar、Google Scholar、CrossRef)中进行交叉验证,从而系统性识别伪造引用、撤稿文章、孤立引用及掠夺性期刊等严重问题。该方法在30篇涵盖本科至博士论文的文献中共验证2,581条参考文献,平均验证率达91.7%,且将原本耗时数月的手动审核缩短至90分钟,同时保持0.5%的极低假阳性率,显著提升了学术引用审查的效率与准确性。

链接: https://arxiv.org/abs/2511.04683
作者: L.J. Janse van Rensburg
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 1 table. Code and validation data available at this https URL

点击查看摘要

Abstract:Academic citation integrity faces persistent challenges, with research indicating 20% of citations contain errors and manual verification requiring months of expert time. This paper presents a novel AI-powered methodology for systematic, comprehensive reference auditing using agentic AI with tool-use capabilities. We develop a zero-assumption verification protocol that independently validates every reference against multiple academic databases (Semantic Scholar, Google Scholar, CrossRef) without assuming any citation is correct. The methodology was validated across 30 academic documents (2,581 references) spanning undergraduate projects to doctoral theses and peer-reviewed publications. Results demonstrate 91.7% average verification rate on published PLOS papers, with successful detection of fabricated references, retracted articles, orphan citations, and predatory journals. Time efficiency improved dramatically: 90-minute audits for 916-reference doctoral theses versus months of manual review. The system achieved 0.5% false positive rate while identifying critical issues manual review might miss. This work establishes the first validated AI-agent methodology for academic citation integrity, demonstrating practical applicability for supervisors, students, and institutional quality assurance.
zh

[AI-49] Efficient Deployment of CNN Models on Multiple In-Memory Computing Units

【速读】:该论文旨在解决在基于内存计算(In-Memory Computing, IMC)的多处理单元(Processing Units, PUs)系统中部署卷积神经网络(Convolutional Neural Networks, CNNs)时,如何通过优化任务分配策略以提升处理速率并降低延迟的问题。解决方案的关键在于提出一种动态调度算法——负载均衡最长路径(Load-Balance-Longest-Path, LBLP),该算法能够根据IMC模拟器(IMCE)中可用PU资源,智能分配CNN各节点任务,从而实现计算资源的高效利用,最大化处理速率并最小化系统延迟。

链接: https://arxiv.org/abs/2511.04682
作者: Eleni Bougioukou,Theodore Antonakopoulos
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures, 2025 14th International Conference on Modern Circuits and Systems Technologies (MOCAST)

点击查看摘要

Abstract:In-Memory Computing (IMC) represents a paradigm shift in deep learning acceleration by mitigating data movement bottlenecks and leveraging the inherent parallelism of memory-based computations. The efficient deployment of Convolutional Neural Networks (CNNs) on IMC-based hardware necessitates the use of advanced task allocation strategies for achieving maximum computational efficiency. In this work, we exploit an IMC Emulator (IMCE) with multiple Processing Units (PUs) for investigating how the deployment of a CNN model in a multi-processing system affects its performance, in terms of processing rate and latency. For that purpose, we introduce the Load-Balance-Longest-Path (LBLP) algorithm, that dynamically assigns all CNN nodes to the available IMCE PUs, for maximizing the processing rate and minimizing latency due to efficient resources utilization. We are benchmarking LBLP against other alternative scheduling strategies for a number of CNN models and experimental results demonstrate the effectiveness of the proposed algorithm.
zh

[AI-50] Self-adaptive weighting and sampling for physics-informed neural networks

【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在复杂偏微分方程(Partial Differential Equations, PDEs)求解中训练困难、精度和效率受限的问题。其解决方案的关键在于提出一种混合自适应采样与加权方法:自适应采样策略聚焦于识别解变化剧烈的区域以优化训练点分布,而自适应加权策略则平衡不同训练点间的收敛速率。二者协同作用可显著提升PINNs的预测准确性和训练效率,尤其在训练样本稀缺时表现出更强的鲁棒性。

链接: https://arxiv.org/abs/2511.05452
作者: Wenqian Chen,Amanda Howard,Panos Stinis
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注: 11 figures

点击查看摘要

Abstract:Physics-informed deep learning has emerged as a promising framework for solving partial differential equations (PDEs). Nevertheless, training these models on complex problems remains challenging, often leading to limited accuracy and efficiency. In this work, we introduce a hybrid adaptive sampling and weighting method to enhance the performance of physics-informed neural networks (PINNs). The adaptive sampling component identifies training points in regions where the solution exhibits rapid variation, while the adaptive weighting component balances the convergence rate across training points. Numerical experiments show that applying only adaptive sampling or only adaptive weighting is insufficient to consistently achieve accurate predictions, particularly when training points are scarce. Since each method emphasizes different aspects of the solution, their effectiveness is problem dependent. By combining both strategies, the proposed framework consistently improves prediction accuracy and training efficiency, offering a more robust approach for solving PDEs with PINNs.
zh

[AI-51] A Gate-Based Quantum Genetic Algorithm for Real-Valued Global Optimization

【速读】:该论文旨在解决实数域全局优化问题,传统遗传算法在高维复杂搜索空间中易陷入局部最优且收敛效率低。其解决方案的关键在于提出一种基于量子门(gate-based)的量子遗传算法(Quantum Genetic Algorithm, QGA),其中个体由量子电路表示,通过二进制离散化将测量结果解码为实值向量;进化算子直接作用于电路结构,实现对门编码空间的探索。该方法引入固定深度与可变深度两种变体以控制电路复杂度,并利用量子采样评估适应度,采用测量结果的均值作为目标函数输入。关键创新点包括:(1)利用哈达玛门(Hadamard gate)构建叠加态提升收敛速度和鲁棒性;(2)在种群中引入个体间成对纠缠(pairwise inter-individual entanglement),加速早期收敛,揭示量子关联能带来额外优化优势。实验表明,叠加与纠缠共同增强演化搜索动力学,使基于门的QGA成为量子增强全局优化的有力框架。

链接: https://arxiv.org/abs/2511.05254
作者: Leandro C. Souza,Laurent E. Dardenne,Renato Portugal
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 16 pages

点击查看摘要

Abstract:We propose a gate-based Quantum Genetic Algorithm (QGA) for real-valued global optimization. In this model, individuals are represented by quantum circuits whose measurement outcomes are decoded into real-valued vectors through binary discretization. Evolutionary operators act directly on circuit structures, allowing mutation and crossover to explore the space of gate-based encodings. Both fixed-depth and variable-depth variants are introduced, enabling either uniform circuit complexity or adaptive structural evolution. Fitness is evaluated through quantum sampling, using the mean decoded output of measurement outcomes as the argument of the objective function. To isolate the impact of quantum resources, we compare gate sets with and without the Hadamard gate, showing that superposition consistently improves convergence and robustness across benchmark functions such as the Rastrigin function. Furthermore, we demonstrate that introducing pairwise inter-individual entanglement in the population accelerates early convergence, revealing that quantum correlations among individuals provide an additional optimization advantage. Together, these results show that both superposition and entanglement enhance the search dynamics of evolutionary quantum algorithms, establishing gate-based QGAs as a promising framework for quantum-enhanced global optimization.
zh

[AI-52] PECL: A Heterogeneous Parallel Multi-Domain Network for Radar-Based Human Activity Recognition

【速读】:该论文旨在解决雷达信号在人体动作分类中因仅依赖单一域特征而忽略时序依赖性,导致相似动作难以区分的问题。其解决方案的关键在于提出一种并行高效的混合网络架构——Parallel-EfficientNet-CBAM-LSTM(PECL),该架构同时处理Range-Time、Doppler-Time和Range-Doppler三个互补域的数据,并融合通道-空间注意力模块(Channel-Spatial Attention Module)与LSTM时序单元,以增强特征表达能力和动态行为建模能力,从而显著提升分类准确率与鲁棒性。

链接: https://arxiv.org/abs/2511.05039
作者: Jiuqi Yan,Chendong Xu,Dongyu Liu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radar systems are increasingly favored for medical applications because they provide non-intrusive monitoring with high privacy and robustness to lighting conditions. However, existing research typically relies on single-domain radar signals and overlooks the temporal dependencies inherent in human activity, which complicates the classification of similar actions. To address this issue, we designed the Parallel-EfficientNet-CBAM-LSTM (PECL) network to process data in three complementary domains: Range-Time, Doppler-Time, and Range-Doppler. PECL combines a channel-spatial attention module and temporal units to capture more features and dynamic dependencies during action sequences, improving both accuracy and robustness. The experimental results show that PECL achieves an accuracy of 96.16% on the same dataset, outperforming existing methods by at least 4.78%. PECL also performs best in distinguishing between easily confused actions. Despite its strong performance, PECL maintains moderate model complexity, with 23.42M parameters and 1324.82M FLOPs. Its parameter-efficient design further reduces computational cost.
zh

机器学习

[LG-0] SoilX: Calibration-Free Comprehensive Soil Sensing Through Contrastive Cross-Component Learning

链接: https://arxiv.org/abs/2511.05482
作者: Kang Yang,Yuanlin Yang,Yuning Chen,Sikai Yang,Xinyu Zhang,Wan Du
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Precision agriculture demands continuous and accurate monitoring of soil moisture (M) and key macronutrients, including nitrogen (N), phosphorus §, and potassium (K), to optimize yields and conserve resources. Wireless soil sensing has been explored to measure these four components; however, current solutions require recalibration (i.e., retraining the data processing model) to handle variations in soil texture, characterized by aluminosilicates (Al) and organic carbon ©, limiting their practicality. To address this, we introduce SoilX, a calibration-free soil sensing system that jointly measures six key components: M, N, P, K, C, Al. By explicitly modeling C and Al, SoilX eliminates texture- and carbon-dependent recalibration. SoilX incorporates Contrastive Cross-Component Learning (3CL), with two customized terms: the Orthogonality Regularizer and the Separation Loss, to effectively disentangle cross-component interference. Additionally, we design a novel tetrahedral antenna array with an antenna-switching mechanism, which can robustly measure soil dielectric permittivity independent of device placement. Extensive experiments demonstrate that SoilX reduces estimation errors by 23.8% to 31.5% over baselines and generalizes well to unseen fields.

[LG-1] A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

链接: https://arxiv.org/abs/2511.05476
作者: Md. Abdul Awal,Mrigank Rochan,Chanchal K. Roy
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: The paper is currently under review at a peer-reviewed journal

点击查看摘要

Abstract:Transformer-based language models of code have achieved state-of-the-art performance across a wide range of software analytics tasks, but their practical deployment remains limited due to high computational costs, slow inference speeds, and significant environmental impact. To address these challenges, recent research has increasingly explored knowledge distillation as a method for compressing a large language model of code (the teacher) into a smaller model (the student) while maintaining performance. However, the degree to which a student model deeply mimics the predictive behavior and internal representations of its teacher remains largely unexplored, as current accuracy-based evaluation provides only a surface-level view of model quality and often fails to capture more profound discrepancies in behavioral fidelity between the teacher and student models. To address this gap, we empirically show that the student model often fails to deeply mimic the teacher model, resulting in up to 285% greater performance drop under adversarial attacks, which is not captured by traditional accuracy-based evaluation. Therefore, we propose MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity by comparing the outputs of teacher and student models under a set of behavior-preserving metamorphic relations. We evaluate MetaCompress on two widely studied tasks, using compressed versions of popular language models of code, obtained via three different knowledge distillation techniques: Compressor, AVATAR, and MORPH. The results show that MetaCompress identifies up to 62% behavioral discrepancies in student models, underscoring the need for behavioral fidelity evaluation within the knowledge distillation pipeline and establishing MetaCompress as a practical framework for testing compressed language models of code derived through knowledge distillation.

[LG-2] Precipitation nowcasting of satellite data using physically conditioned neural networks

链接: https://arxiv.org/abs/2511.05471
作者: Antônio Catão,Melvin Poveda,Leonardo Voltarelli,Paulo Orenstein
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate short-term precipitation forecasts predominantly rely on dense weather-radar networks, limiting operational value in places most exposed to climate extremes. We present TUPANN (Transferable and Universal Physics-Aligned Nowcasting Network), a satellite-only model trained on GOES-16 RRQPE. Unlike most deep learning models for nowcasting, TUPANN decomposes the forecast into physically meaningful components: a variational encoder-decoder infers motion and intensity fields from recent imagery under optical-flow supervision, a lead-time-conditioned MaxViT evolves the latent state, and a differentiable advection operator reconstructs future frames. We evaluate TUPANN on both GOES-16 and IMERG data, in up to four distinct climates (Rio de Janeiro, Manaus, Miami, La Paz) at 10-180min lead times using the CSI and HSS metrics over 4-64 mm/h thresholds. Comparisons against optical-flow, deep learning and hybrid baselines show that TUPANN achieves the best or second-best skill in most settings, with pronounced gains at higher thresholds. Training on multiple cities further improves performance, while cross-city experiments show modest degradation and occasional gains for rare heavy-rain regimes. The model produces smooth, interpretable motion fields aligned with numerical optical flow and runs in near real time due to the low latency of GOES-16. These results indicate that physically aligned learning can provide nowcasts that are skillful, transferable and global.

[LG-3] Synapse: Adaptive Arbitration of Complementary Expertise in Time Series Foundational Models

链接: https://arxiv.org/abs/2511.05460
作者: Sarkar Snigdha Sarathi Das,Palash Goyal,Mihir Parmar,Yiwen Song,Long T. Le,Lesly Miculicich,Jinsung Yoon,Rui Zhang,Hamid Palangi,Tomas Pfister
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 19 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Pre-trained Time Series Foundational Models (TSFMs) represent a significant advance, capable of forecasting diverse time series with complex characteristics, including varied seasonalities, trends, and long-range dependencies. Despite their primary goal of universal time series forecasting, their efficacy is far from uniform; divergent training protocols and data sources cause individual TSFMs to exhibit highly variable performance across different forecasting tasks, domains, and horizons. Leveraging this complementary expertise by arbitrating existing TSFM outputs presents a compelling strategy, yet this remains a largely unexplored area of research. In this paper, we conduct a thorough examination of how different TSFMs exhibit specialized performance profiles across various forecasting settings, and how we can effectively leverage this behavior in arbitration between different time series models. We specifically analyze how factors such as model selection and forecast horizon distribution can influence the efficacy of arbitration strategies. Based on this analysis, we propose Synapse, a novel arbitration framework for TSFMs. Synapse is designed to dynamically leverage a pool of TSFMs, assign and adjust predictive weights based on their relative, context-dependent performance, and construct a robust forecast distribution by adaptively sampling from the output quantiles of constituent models. Experimental results demonstrate that Synapse consistently outperforms other popular ensembling techniques as well as individual TSFMs, demonstrating Synapse’s efficacy in time series forecasting.

[LG-4] Parameter-Efficient Conditioning for Material Generalization in Graph-Based Simulators

链接: https://arxiv.org/abs/2511.05456
作者: Naveen Raj Manoharan,Hassan Iqbal,Krishna Kumar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph network-based simulators (GNS) have demonstrated strong potential for learning particle-based physics (such as fluids, deformable solids, and granular flows) while generalizing to unseen geometries due to their inherent inductive biases. However, existing models are typically trained for a single material type and fail to generalize across distinct constitutive behaviors, limiting their applicability in real-world engineering settings. Using granular flows as a running example, we propose a parameter-efficient conditioning mechanism that makes the GNS model adaptive to material parameters. We identify that sensitivity to material properties is concentrated in the early message-passing (MP) layers, a finding we link to the local nature of constitutive models (e.g., Mohr-Coulomb) and their effects on information propagation. We empirically validate this by showing that fine-tuning only the first few (1-5) of 10 MP layers of a pretrained model achieves comparable test performance as compared to fine-tuning the entire network. Building on this insight, we propose a parameter-efficient Feature-wise Linear Modulation (FiLM) conditioning mechanism designed to specifically target these early layers. This approach produces accurate long-term rollouts on unseen, interpolated, or moderately extrapolated values (e.g., up to 2.5 degrees for friction angle and 0.25 kPa for cohesion) when trained exclusively on as few as 12 short simulation trajectories from new materials, representing a 5-fold data reduction compared to a baseline multi-task learning method. Finally, we validate the model’s utility by applying it to an inverse problem, successfully identifying unknown cohesion parameters from trajectory data. This approach enables the use of GNS in inverse design and closed-loop control tasks where material properties are treated as design variables.

[LG-5] Adversarially Robust Multitask Adaptive Control

链接: https://arxiv.org/abs/2511.05444
作者: Kasra Fallah,Leonardo F. Toso,James Anderson
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study adversarially robust multitask adaptive linear quadratic control; a setting where multiple systems collaboratively learn control policies under model uncertainty and adversarial corruption. We propose a clustered multitask approach that integrates clustering and system identification with resilient aggregation to mitigate corrupted model updates. Our analysis characterizes how clustering accuracy, intra-cluster heterogeneity, and adversarial behavior affect the expected regret of certainty-equivalent (CE) control across LQR tasks. We establish non-asymptotic bounds demonstrating that the regret decreases inversely with the number of honest systems per cluster and that this reduction is preserved under a bounded fraction of adversarial systems within each cluster.

[LG-6] Diffusion-Based Electromagnetic Inverse Design of Scattering Structured Media NEURIPS2025

链接: https://arxiv.org/abs/2511.05357
作者: Mikhail Tsukerman,Konstantin Grotov,Pavel Ginzburg
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
*备注: Accepted to Machine Learning and the Physical Sciences Workshop, NeurIPS 2025

点击查看摘要

Abstract:We present a conditional diffusion model for electromagnetic inverse design that generates structured media geometries directly from target differential scattering cross-section profiles, bypassing expensive iterative optimization. Our 1D U-Net architecture with Feature-wise Linear Modulation learns to map desired angular scattering patterns to 2x2 dielectric sphere structure, naturally handling the non-uniqueness of inverse problems by sampling diverse valid designs. Trained on 11,000 simulated metasurfaces, the model achieves median MPE below 19% on unseen targets (best: 1.39%), outperforming CMA-ES evolutionary optimization while reducing design time from hours to seconds. These results demonstrate that employing diffusion models is promising for advancing electromagnetic inverse design research, potentially enabling rapid exploration of complex metasurface architectures and accelerating the development of next-generation photonic and wireless communication systems. The code is publicly available at this https URL.

[LG-7] SAD-Flower: Flow Matching for Safe Admissible and Dynamically Consistent Planning

链接: https://arxiv.org/abs/2511.05355
作者: Tzu-Yuan Huang,Armin Lederer,Dai-Jie Wu,Xiaobing Dai,Sihua Zhang,Stefan Sosnowski,Shao-Hua Sun,Sandra Hirche
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Flow matching (FM) has shown promising results in data-driven planning. However, it inherently lacks formal guarantees for ensuring state and action constraints, whose satisfaction is a fundamental and crucial requirement for the safety and admissibility of planned trajectories on various systems. Moreover, existing FM planners do not ensure the dynamical consistency, which potentially renders trajectories inexecutable. We address these shortcomings by proposing SAD-Flower, a novel framework for generating Safe, Admissible, and Dynamically consistent trajectories. Our approach relies on an augmentation of the flow with a virtual control input. Thereby, principled guidance can be derived using techniques from nonlinear control theory, providing formal guarantees for state constraints, action constraints, and dynamic consistency. Crucially, SAD-Flower operates without retraining, enabling test-time satisfaction of unseen constraints. Through extensive experiments across several tasks, we demonstrate that SAD-Flower outperforms various generative-model-based baselines in ensuring constraint satisfaction.

[LG-8] Learning Dynamics from Input-Output Data with Hamiltonian Gaussian Processes

链接: https://arxiv.org/abs/2511.05330
作者: Jan-Hendrik Ewering,Robin E. Herrmann,Niklas Wahlström,Thomas B. Schön,Thomas Seel
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Embedding non-restrictive prior knowledge, such as energy conservation laws, in learning-based approaches is a key motive to construct physically consistent models from limited data, relevant for, e.g., model-based control. Recent work incorporates Hamiltonian dynamics into Gaussian Process (GP) regression to obtain uncertainty-quantifying models that adhere to the underlying physical principles. However, these works rely on velocity or momentum data, which is rarely available in practice. In this paper, we consider dynamics learning with non-conservative Hamiltonian GPs, and address the more realistic problem setting of learning from input-output data. We provide a fully Bayesian scheme for estimating probability densities of unknown hidden states, of GP hyperparameters, as well as of structural hyperparameters, such as damping coefficients. Considering the computational complexity of GPs, we take advantage of a reduced-rank GP approximation and leverage its properties for computationally efficient prediction and training. The proposed method is evaluated in a nonlinear simulation case study and compared to a state-of-the-art approach that relies on momentum measurements.

[LG-9] urning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval

链接: https://arxiv.org/abs/2511.05325
作者: Janet Jenq,Hongda Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content (e.g., titles, descriptions) directly onto product images to perform vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.

[LG-10] Attention and Compression is all you need for Controllably Efficient Language Models

链接: https://arxiv.org/abs/2511.05313
作者: Jatin Prakash,Aahlad Puli,Rajesh Ranganath
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:The quadratic cost of attention in transformers motivated the development of efficient approaches: namely sparse and sliding window attention, convolutions and linear attention. Although these approaches result in impressive reductions in compute and memory, they often trade-off with quality, specifically in-context recall performance. Moreover, apriori fixing this quality-compute tradeoff means being suboptimal from the get-go: some downstream applications require more memory for in-context recall, while others require lower latency and memory. Further, these approaches rely on heuristic choices that artificially restrict attention, or require handcrafted and complex recurrent state update rules, or they must be carefully composed with attention at specific layers to form a hybrid architecture that complicates the design process, especially at scale. To address above issues, we propose Compress Attend Transformer (CAT), a conceptually simple architecture employing two simple ingredients only: dense attention and compression. CAT decodes chunks of tokens by attending to compressed chunks of the sequence so far. Compression results in decoding from a reduced sequence length that yields compute and memory savings, while choosing a particular chunk size trades-off quality for efficiency. Moreover, CAT can be trained with multiple chunk sizes at once, unlocking control of quality-compute trade-offs directly at test-time without any retraining, all in a single adaptive architecture. In exhaustive evaluations on common language modeling tasks, in-context recall, and long-context understanding, a single adaptive CAT model outperforms existing efficient baselines, including hybrid architectures, across different compute-memory budgets. Further, a single CAT matches dense transformer in language modeling across model scales while being 1.4-3x faster and requiring 2-9x lower total memory usage.

[LG-11] Building Specialized Software-Assistant ChatBot with Graph-Based Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2511.05297
作者: Mohammed Hilel,Yannis Karmim,Jean De Bodinat,Reda Sarehane,Antoine Gillon
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Digital Adoption Platforms (DAPs) have become essential tools for helping employees navigate complex enterprise software such as CRM, ERP, or HRMS systems. Companies like LemonLearning have shown how digital guidance can reduce training costs and accelerate onboarding. However, building and maintaining these interactive guides still requires extensive manual effort. Leveraging Large Language Models as virtual assistants is an appealing alternative, yet without a structured understanding of the target software, LLMs often hallucinate and produce unreliable answers. Moreover, most production-grade LLMs are black-box APIs, making fine-tuning impractical due to the lack of access to model weights. In this work, we introduce a Graph-based Retrieval-Augmented Generation framework that automatically converts enterprise web applications into state-action knowledge graphs, enabling LLMs to generate grounded and context-aware assistance. The framework was co-developed with the AI enterprise RAKAM, in collaboration with Lemon Learning. We detail the engineering pipeline that extracts and structures software interfaces, the design of the graph-based retrieval process, and the integration of our approach into production DAP workflows. Finally, we discuss scalability, robustness, and deployment lessons learned from industrial use cases.

[LG-12] Embedding-Space Data Augmentation to Prevent Membership Inference Attacks in Clinical Time Series Forecasting ML4H ALT

链接: https://arxiv.org/abs/2511.05289
作者: Marius Fracarolli,Michael Staniek,Stefan Riezler
类目: Machine Learning (cs.LG)
*备注: Accepted as a proceedings paper at Machine Learning for Health (ML4H) symposium 2025, December 1-2, 2025, San Diego, United States, 15 pages

点击查看摘要

Abstract:Balancing strong privacy guarantees with high predictive performance is critical for time series forecasting (TSF) tasks involving Electronic Health Records (EHR). In this study, we explore how data augmentation can mitigate Membership Inference Attacks (MIA) on TSF models. We show that retraining with synthetic data can substantially reduce the effectiveness of loss-based MIAs by reducing the attacker’s true-positive to false-positive ratio. The key challenge is generating synthetic samples that closely resemble the original training data to confuse the attacker, while also introducing enough novelty to enhance the model’s ability to generalize to unseen data. We examine multiple augmentation strategies - Zeroth-Order Optimization (ZOO), a variant of ZOO constrained by Principal Component Analysis (ZOO-PCA), and MixUp - to strengthen model resilience without sacrificing accuracy. Our experimental results show that ZOO-PCA yields the best reductions in TPR/FPR ratio for MIA attacks without sacrificing performance on test data.

[LG-13] winVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

链接: https://arxiv.org/abs/2511.05275
作者: Hokyun Im,Euijin Jeong,Jianlong Fu,Andrey Kolobov,Youngwoon Lee
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project webpage : this https URL

点击查看摘要

Abstract:Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model, \pi_0 which rely on extensive proprietary bimanual data and compute cost. These results establish our modular composition approach as a data-efficient and scalable path toward high-performance bimanual manipulation, leveraging public single-arm data.

[LG-14] he Causal Round Trip: Generating Authentic Counterfactuals by Eliminating Information Loss

链接: https://arxiv.org/abs/2511.05236
作者: Rui Wu,Lizheng Wang,Yongjun Li
类目: Machine Learning (cs.LG)
*备注: 50 pages, 10 figures. Submitted to the Journal of Machine Learning Research (JMLR). Keywords: Causal Inference, Diffusion Models, Causal Information Conservation, Structural Causal Models, Counterfactual Generation, BELM, Structural Reconstruction Error

点击查看摘要

Abstract:Judea Pearl’s vision of Structural Causal Models (SCMs) as engines for counterfactual reasoning hinges on faithful abduction: the precise inference of latent exogenous noise. For decades, operationalizing this step for complex, non-linear mechanisms has remained a significant computational challenge. The advent of diffusion models, powerful universal function approximators, offers a promising solution. However, we argue that their standard design, optimized for perceptual generation over logical inference, introduces a fundamental flaw for this classical problem: an inherent information loss we term the Structural Reconstruction Error (SRE). To address this challenge, we formalize the principle of Causal Information Conservation (CIC) as the necessary condition for faithful abduction. We then introduce BELM-MDCM, the first diffusion-based framework engineered to be causally sound by eliminating SRE by construction through an analytically invertible mechanism. To operationalize this framework, a Targeted Modeling strategy provides structural regularization, while a Hybrid Training Objective instills a strong causal inductive bias. Rigorous experiments demonstrate that our Zero-SRE framework not only achieves state-of-the-art accuracy but, more importantly, enables the high-fidelity, individual-level counterfactuals required for deep causal inquiries. Our work provides a foundational blueprint that reconciles the power of modern generative models with the rigor of classical causal theory, establishing a new and more rigorous standard for this emerging field.

[LG-15] Context-aware Learned Mesh-based Simulation via Trajectory-Level Meta-Learning

链接: https://arxiv.org/abs/2511.05234
作者: Philipp Dahlinger,Niklas Freymuth,Tai Hoang,Tobias Würth,Michael Volpp,Luise Kärger,Gerhard Neumann
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 35 pages. Submitted to Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Simulating object deformations is a critical challenge across many scientific domains, including robotics, manufacturing, and structural mechanics. Learned Graph Network Simulators (GNSs) offer a promising alternative to traditional mesh-based physics simulators. Their speed and inherent differentiability make them particularly well suited for applications that require fast and accurate simulations, such as robotic manipulation or manufacturing optimization. However, existing learned simulators typically rely on single-step observations, which limits their ability to exploit temporal context. Without this information, these models fail to infer, e.g., material properties. Further, they rely on auto-regressive rollouts, which quickly accumulate error for long trajectories. We instead frame mesh-based simulation as a trajectory-level meta-learning problem. Using Conditional Neural Processes, our method enables rapid adaptation to new simulation scenarios from limited initial data while capturing their latent simulation properties. We utilize movement primitives to directly predict fast, stable and accurate simulations from a single model call. The resulting approach, Movement-primitive Meta-MeshGraphNet (M3GN), provides higher simulation accuracy at a fraction of the runtime cost compared to state-of-the-art GNSs across several tasks.

[LG-16] ActiTect: A Generalizable Machine Learning Pipeline for REM Sleep Behavior Disorder Screening through Standardized Actigraphy

链接: https://arxiv.org/abs/2511.05221
作者: David Bertram,Anja Ophey,Sinah Röttgen,Konstantin Kuffer,Gereon R. Fink,Elke Kalbe,Clint Hansen,Walter Maetzler,Maximilian Kapsecker,Lara M. Reimer,Stephan Jonas,Andreas T. Damgaard,Natasha B. Bertelsen,Casper Skjaerbaek,Per Borghammer,Karolien Groenewald,Pietro-Luca Ratti,Michele T. Hu,No émie Moreau,Michael Sommerauer,Katarzyna Bozek
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 30 pages including supplement, 4 core figures, 1 supplement figure

点击查看摘要

Abstract:Isolated rapid eye movement sleep behavior disorder (iRBD) is a major prodromal marker of \alpha -synucleinopathies, often preceding the clinical onset of Parkinson’s disease, dementia with Lewy bodies, or multiple system atrophy. While wrist-worn actimeters hold significant potential for detecting RBD in large-scale screening efforts by capturing abnormal nocturnal movements, they become inoperable without a reliable and efficient analysis pipeline. This study presents ActiTect, a fully automated, open-source machine learning tool to identify RBD from actigraphy recordings. To ensure generalizability across heterogeneous acquisition settings, our pipeline includes robust preprocessing and automated sleep-wake detection to harmonize multi-device data and extract physiologically interpretable motion features characterizing activity patterns. Model development was conducted on a cohort of 78 individuals, yielding strong discrimination under nested cross-validation (AUROC = 0.95). Generalization was confirmed on a blinded local test set (n = 31, AUROC = 0.86) and on two independent external cohorts (n = 113, AUROC = 0.84; n = 57, AUROC = 0.94). To assess real-world robustness, leave-one-dataset-out cross-validation across the internal and external cohorts demonstrated consistent performance (AUROC range = 0.84-0.89). A complementary stability analysis showed that key predictive features remained reproducible across datasets, supporting the final pooled multi-center model as a robust pre-trained resource for broader deployment. By being open-source and easy to use, our tool promotes widespread adoption and facilitates independent validation and collaborative improvements, thereby advancing the field toward a unified and generalizable RBD detection model using wearable devices.

[LG-17] Linear Gradient Prediction with Control Variates

链接: https://arxiv.org/abs/2511.05187
作者: Kamil Ciosek,Nicolò Felicioni,Juan Elenter Litwin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a new way of training neural networks, with the goal of reducing training cost. Our method uses approximate predicted gradients instead of the full gradients that require an expensive backward pass. We derive a control-variate-based technique that ensures our updates are unbiased estimates of the true gradient. Moreover, we propose a novel way to derive a predictor for the gradient inspired by the theory of the Neural Tangent Kernel. We empirically show the efficacy of the technique on a vision transformer classification task.

[LG-18] Associative Poisoning to Generative Machine Learning

链接: https://arxiv.org/abs/2511.05177
作者: Mathias Lundteigen Mohus,Jingyue Li,Zhirong Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread adoption of generative models such as Stable Diffusion and ChatGPT has made them increasingly attractive targets for malicious exploitation, particularly through data poisoning. Existing poisoning attacks compromising synthesised data typically either cause broad degradation of generated data or require control over the training process, limiting their applicability in real-world scenarios. In this paper, we introduce a novel data poisoning technique called associative poisoning, which compromises fine-grained features of the generated data without requiring control of the training process. This attack perturbs only the training data to manipulate statistical associations between specific feature pairs in the generated outputs. We provide a formal mathematical formulation of the attack and prove its theoretical feasibility and stealthiness. Empirical evaluations using two state-of-the-art generative models demonstrate that associative poisoning effectively induces or suppresses feature associations while preserving the marginal distributions of the targeted features and maintaining high-quality outputs, thereby evading visual detection. These results suggest that generative systems used in image synthesis, synthetic dataset generation, and natural language processing are susceptible to subtle, stealthy manipulations that compromise their statistical integrity. To address this risk, we examine the limitations of existing defensive strategies and propose a novel countermeasure strategy.

[LG-19] Multimodal Deep Learning for Prediction of Progression-Free Survival in Patients with Neuroendocrine Tumors Undergoing 177Lu-based Peptide Receptor Radionuclide Therapy

链接: https://arxiv.org/abs/2511.05169
作者: Simon Baur,Tristan Ruhwedel,Ekin Böke,Zuzanna Kobus,Gergana Lishkova,Christoph Wetz,Holger Amthauer,Christoph Roderburg,Frank Tacke,Julian M. Rogasch,Wojciech Samek,Henning Jann,Jackie Ma,Johannes Eschrich
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Peptide receptor radionuclide therapy (PRRT) is an established treatment for metastatic neuroendocrine tumors (NETs), yet long-term disease control occurs only in a subset of patients. Predicting progression-free survival (PFS) could support individualized treatment planning. This study evaluates laboratory, imaging, and multimodal deep learning models for PFS prediction in PRRT-treated patients. In this retrospective, single-center study 116 patients with metastatic NETs undergoing 177Lu-DOTATOC were included. Clinical characteristics, laboratory values, and pretherapeutic somatostatin receptor positron emission tomography/computed tomographies (SR-PET/CT) were collected. Seven models were trained to classify low- vs. high-PFS groups, including unimodal (laboratory, SR-PET, or CT) and multimodal fusion approaches. Explainability was evaluated by feature importance analysis and gradient maps. Forty-two patients (36%) had short PFS ( 1 year), 74 patients long PFS (1 year). Groups were similar in most characteristics, except for higher baseline chromogranin A (p = 0.003), elevated gamma-GT (p = 0.002), and fewer PRRT cycles (p 0.001) in short-PFS patients. The Random Forest model trained only on laboratory biomarkers reached an AUROC of 0.59 ± 0.02. Unimodal three-dimensional convolutional neural networks using SR-PET or CT performed worse (AUROC 0.42 ± 0.03 and 0.54 ± 0.01, respectively). A multimodal fusion model laboratory values, SR-PET, and CT -augmented with a pretrained CT branch - achieved the best results (AUROC 0.72 ± 0.01, AUPRC 0.80 ± 0.01). Multimodal deep learning combining SR-PET, CT, and laboratory biomarkers outperformed unimodal approaches for PFS prediction after PRRT. Upon external validation, such models may support risk-adapted follow-up strategies.

[LG-20] Consecutive Preferential Bayesian Optimization

链接: https://arxiv.org/abs/2511.05163
作者: Aras Erarslan,Carlos Sevilla Salcedo,Ville Tanskanen,Anni Nisov,Eero Päiväkumpu,Heikki Aisala,Kaisu Honkapää,Arto Klami,Petrus Mikkola
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Preferential Bayesian optimization allows optimization of objectives that are either expensive or difficult to measure directly, by relying on a minimal number of comparative evaluations done by a human expert. Generating candidate solutions for evaluation is also often expensive, but this cost is ignored by existing methods. We generalize preference-based optimization to explicitly account for production and evaluation costs with Consecutive Preferential Bayesian Optimization, reducing production cost by constraining comparisons to involve previously generated candidates. We also account for the perceptual ambiguity of the oracle providing the feedback by incorporating a Just-Noticeable Difference threshold into a probabilistic preference model to capture indifference to small utility differences. We adapt an information-theoretic acquisition strategy to this setting, selecting new configurations that are most informative about the unknown optimum under a preference model accounting for the perceptual ambiguity. We empirically demonstrate a notable increase in accuracy in setups with high production costs or with indifference feedback.

[LG-21] Follow-Me in Micro-Mobility with End-to-End Imitation Learning

链接: https://arxiv.org/abs/2511.05158
作者: Sahar Salimpour,Iacopo Catalano,Tomi Westerlund,Mohsen Falahi,Jorge Peña Queralta
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous micro-mobility platforms face challenges from the perspective of the typical deployment environment: large indoor spaces or urban areas that are potentially crowded and highly dynamic. While social navigation algorithms have progressed significantly, optimizing user comfort and overall user experience over other typical metrics in robotics (e.g., time or distance traveled) is understudied. Specifically, these metrics are critical in commercial applications. In this paper, we show how imitation learning delivers smoother and overall better controllers, versus previously used manually-tuned controllers. We demonstrate how DAAV’s autonomous wheelchair achieves state-of-the-art comfort in follow-me mode, in which it follows a human operator assisting persons with reduced mobility (PRM). This paper analyzes different neural network architectures for end-to-end control and demonstrates their usability in real-world production-level deployments.

[LG-22] QuAnTS: Question Answering on Time Series

链接: https://arxiv.org/abs/2511.05124
作者: Felix Divo,Maurice Kraus,Anh Q. Nguyen,Hao Xue,Imran Razzak,Flora D. Salim,Kristian Kersting,Devendra Singh Dhami
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text offers intuitive access to information. This can, in particular, complement the density of numerical time series, thereby allowing improved interactions with time series models to enhance accessibility and decision-making. While the creation of question-answering datasets and models has recently seen remarkable growth, most research focuses on question answering (QA) on vision and text, with time series receiving minute attention. To bridge this gap, we propose a challenging novel time series QA (TSQA) dataset, QuAnTS, for Question Answering on Time Series data. Specifically, we pose a wide variety of questions and answers about human motion in the form of tracked skeleton trajectories. We verify that the large-scale QuAnTS dataset is well-formed and comprehensive through extensive experiments. Thoroughly evaluating existing and newly proposed baselines then lays the groundwork for a deeper exploration of TSQA using QuAnTS. Additionally, we provide human performances as a key reference for gauging the practical usability of such models. We hope to encourage future research on interacting with time series models through text, enabling better decision-making and more transparent systems.

[LG-23] Usando LLM s para Programar Jogos de Tabuleiro e Variações

链接: https://arxiv.org/abs/2511.05114
作者: Álvaro Guglielmin Becker,Lana Bertoldo Rossato,Anderson Rocha Tavares
类目: Machine Learning (cs.LG)
*备注: Accepted for presentation at the I Escola Regional de Aprendizado de Máquina e Inteligência Artificial da Região Sul, 2025, in Portuguese language

点击查看摘要

Abstract:Creating programs to represent board games can be a time-consuming task. Large Language Models (LLMs) arise as appealing tools to expedite this process, given their capacity to efficiently generate code from simple contextual information. In this work, we propose a method to test how capable three LLMs (Claude, DeepSeek and ChatGPT) are at creating code for board games, as well as new variants of existing games.

[LG-24] Carbon Price Forecasting with Structural Breaks: A Comparative Study of Deep Learning Models

链接: https://arxiv.org/abs/2511.04988
作者: Runsheng Ren,Jing Li,Yanxiu Li,Shixun Huang,Jun Shen,Wanqing Li,John Le,Sheng Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately forecasting carbon prices is essential for informed energy market decision-making, guiding sustainable energy planning, and supporting effective decarbonization strategies. However, it remains challenging due to structural breaks and high-frequency noise caused by frequent policy interventions and market shocks. Existing studies, including the most recent baseline approaches, have attempted to incorporate breakpoints but often treat denoising and modeling as separate processes and lack systematic evaluation across advanced deep learning architectures, limiting the robustness and the generalization capability. To address these gaps, this paper proposes a comprehensive hybrid framework that integrates structural break detection (Bai-Perron, ICSS, and PELT algorithms), wavelet signal denoising, and three state-of-the-art deep learning models (LSTM, GRU, and TCN). Using European Union Allowance (EUA) spot prices from 2007 to 2024 and exogenous features such as energy prices and policy indicators, the framework constructs univariate and multivariate datasets for comparative evaluation. Experimental results demonstrate that our proposed PELT-WT-TCN achieves the highest prediction accuracy, reducing forecasting errors by 22.35% in RMSE and 18.63% in MAE compared to the state-of-the-art baseline model (Breakpoints with Wavelet and LSTM), and by 70.55% in RMSE and 74.42% in MAE compared to the original LSTM without decomposition from the same baseline study. These findings underscore the value of integrating structural awareness and multiscale decomposition into deep learning architectures to enhance accuracy and interpretability in carbon price forecasting and other nonstationary financial time series.

[LG-25] Peptide2Mol: A Diffusion Model for Generating Small Molecules as Peptide Mimics for Targeted Protein Binding

链接: https://arxiv.org/abs/2511.04984
作者: Xinheng He,Yijia Zhang,Haowei Lin,Xingang Peng,Xiangzhe Kong,Mingyu Li,Jianzhu Ma
类目: Machine Learning (cs.LG)
*备注: Abstract 1 page, main text 9 pages, references 2 pages, 4 figures. Submitted to RECOMB 2026

点击查看摘要

Abstract:Structure-based drug design has seen significant advancements with the integration of artificial intelligence (AI), particularly in the generation of hit and lead compounds. However, most AI-driven approaches neglect the importance of endogenous protein interactions with peptides, which may result in suboptimal molecule designs. In this work, we present Peptide2Mol, an E(3)-equivariant graph neural network diffusion model that generates small molecules by referencing both the original peptide binders and their surrounding protein pocket environments. Trained on large datasets and leveraging sophisticated modeling techniques, Peptide2Mol not only achieves state-of-the-art performance in non-autoregressive generative tasks, but also produces molecules with similarity to the original peptide binder. Additionally, the model allows for molecule optimization and peptidomimetic design through a partial diffusion process. Our results highlight Peptide2Mol as an effective deep generative model for generating and optimizing bioactive small molecules from protein binding pockets.

[LG-26] Deep Progressive Training: scaling up depth capacity of zero/one-layer models

链接: https://arxiv.org/abs/2511.04981
作者: Zhiqi Bu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, an effective strategy is the progressive training, which scales up model capacity during training, hence significantly reducing computation with little to none performance degradation. In this work, we study the depth expansion of large models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training for the optimal tradeoff between computation and loss. For example, zero/one-layer progressive training on GPT2 can save \approx 80% compute, or equivalently accelerate \approx 5\times while achieving almost the same loss, compared to to a fully trained 60-layer model with 7B parameters.

[LG-27] Unlocking the Black Box: A Five-Dimensional Framework for Evaluating Explainable AI in Credit Risk

链接: https://arxiv.org/abs/2511.04980
作者: Rongbin Ye,Jiaqi Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The financial industry faces a significant challenge modeling and risk portfolios: balancing the predictability of advanced machine learning models, neural network models, and explainability required by regulatory entities (such as Office of the Comptroller of the Currency, Consumer Financial Protection Bureau). This paper intends to fill the gap in the application between these “black box” models and explainability frameworks, such as LIME and SHAP. Authors elaborate on the application of these frameworks on different models and demonstrates the more complex models with better prediction powers could be applied and reach the same level of the explainability, using SHAP and LIME. Beyond the comparison and discussion of performances, this paper proposes a novel five dimensional framework evaluating Inherent Interpretability, Global Explanations, Local Explanations, Consistency, and Complexity to offer a nuanced method for assessing and comparing model explainability beyond simple accuracy metrics. This research demonstrates the feasibility of employing sophisticated, high performing ML models in regulated financial environments by utilizing modern explainability techniques and provides a structured approach to evaluate the crucial trade offs between model performance and interpretability.

[LG-28] Scaling Up ROC-Optimizing Support Vector Machines

链接: https://arxiv.org/abs/2511.04979
作者: Gimun Bae,Seung Jun Shin(Department of Statistics, Korea University, Seoul, Republic of Korea)
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 15 pages, Submitted to Stat

点击查看摘要

Abstract:The ROC-SVM, originally proposed by Rakotomamonjy, directly maximizes the area under the ROC curve (AUC) and has become an attractive alternative of the conventional binary classification under the presence of class imbalance. However, its practical use is limited by high computational cost, as training involves evaluating all O(n^2) . To overcome this limitation, we develop a scalable variant of the ROC-SVM that leverages incomplete U-statistics, thereby substantially reducing computational complexity. We further extend the framework to nonlinear classification through a low-rank kernel approximation, enabling efficient training in reproducing kernel Hilbert spaces. Theoretical analysis establishes an error bound that justifies the proposed approximation, and empirical results on both synthetic and real datasets demonstrate that the proposed method achieves comparable AUC performance to the original ROC-SVM with drastically reduced training time.

[LG-29] Less Is More: Generating Time Series with LLaMA-Style Autoregression in Simple Factorized Latent Spaces

链接: https://arxiv.org/abs/2511.04973
作者: Siyuan Li,Yifan Sun,Lei Cheng,Lewen Wang,Yang Liu,Weiqing Liu,Jianlong Li,Jiang Bian,Shikai Fang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative models for multivariate time series are essential for data augmentation, simulation, and privacy preservation, yet current state-of-the-art diffusion-based approaches are slow and limited to fixed-length windows. We propose FAR-TS, a simple yet effective framework that combines disentangled factorization with an autoregressive Transformer over a discrete, quantized latent space to generate time series. Each time series is decomposed into a data-adaptive basis that captures static cross-channel correlations and temporal coefficients that are vector-quantized into discrete tokens. A LLaMA-style autoregressive Transformer then models these token sequences, enabling fast and controllable generation of sequences with arbitrary length. Owing to its streamlined design, FAR-TS achieves orders-of-magnitude faster generation than Diffusion-TS while preserving cross-channel correlations and an interpretable latent space, enabling high-quality and flexible time series synthesis.

[LG-30] Risk Prediction of Cardiovascular Disease for Diabetic Patients with Machine Learning and Deep Learning Techniques

链接: https://arxiv.org/abs/2511.04971
作者: Esha Chowdhury(Dhaka University of Engineering amp; Technology Gazipur, Bangladesh)
类目: Machine Learning (cs.LG)
*备注: 24 pages with 6 table and 8 figures

点击查看摘要

Abstract:Accurate prediction of cardiovascular disease (CVD) risk is crucial for healthcare institutions. This study addresses the growing prevalence of diabetes and its strong link to heart disease by proposing an efficient CVD risk prediction model for diabetic patients using machine learning (ML) and hybrid deep learning (DL) approaches. The BRFSS dataset was preprocessed by removing duplicates, handling missing values, identifying categorical and numerical features, and applying Principal Component Analysis (PCA) for feature extraction. Several ML models, including Decision Trees (DT), Random Forest (RF), k-Nearest Neighbors (KNN), Support Vector Machine (SVM), AdaBoost, and XGBoost, were implemented, with XGBoost achieving the highest accuracy of 0.9050. Various DL models, such as Artificial Neural Networks (ANN), Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), and Gated Recurrent Unit (GRU), as well as hybrid models combining CNN with LSTM, BiLSTM, and GRU, were also explored. Some of these models achieved perfect recall (1.00), with the LSTM model achieving the highest accuracy of 0.9050. Our research highlights the effectiveness of ML and DL models in predicting CVD risk among diabetic patients, automating and enhancing clinical decision-making. High accuracy and F1 scores demonstrate these models’ potential to improve personalized risk management and preventive strategies.

[LG-31] Structural Properties Cycloid Trajectories and Non-Asymptotic Guarantees of EM Algorithm for Mixed Linear Regression

链接: https://arxiv.org/abs/2511.04937
作者: Zhankun Luo,Abolfazl Hashemi
类目: Machine Learning (cs.LG)
*备注: Preprint of the paper submitted to IEEE Transactions on Information Theory

点击查看摘要

Abstract:This work investigates the structural properties, cycloid trajectories, and non-asymptotic convergence guarantees of the Expectation-Maximization (EM) algorithm for two-component Mixed Linear Regression (2MLR) with unknown mixing weights and regression parameters. Recent studies have established global convergence for 2MLR with known balanced weights and super-linear convergence in noiseless and high signal-to-noise ratio (SNR) regimes. However, the theoretical behavior of EM in the fully unknown setting remains unclear, with its trajectory and convergence order not yet fully characterized. We derive explicit EM update expressions for 2MLR with unknown mixing weights and regression parameters across all SNR regimes and analyze their structural properties and cycloid trajectories. In the noiseless case, we prove that the trajectory of the regression parameters in EM iterations traces a cycloid by establishing a recurrence relation for the sub-optimality angle, while in high SNR regimes we quantify its discrepancy from the cycloid trajectory. The trajectory-based analysis reveals the order of convergence: linear when the EM estimate is nearly orthogonal to the ground truth, and quadratic when the angle between the estimate and ground truth is small at the population level. Our analysis establishes non-asymptotic guarantees by sharpening bounds on statistical errors between finite-sample and population EM updates, relating EM’s statistical accuracy to the sub-optimality angle, and proving convergence with arbitrary initialization at the finite-sample level. This work provides a novel trajectory-based framework for analyzing EM in Mixed Linear Regression.

[LG-32] Leak@k: Unlearning Does Not Make LLM s Forget Under Probabilistic Decoding

链接: https://arxiv.org/abs/2511.04934
作者: Hadi Reisizadeh,Jiajun Ruan,Yiwei Chen,Soumyadeep Pal,Sijia Liu,Mingyi Hong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work we show that \textitalmost all existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned’ models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks (as has been done in the literature), we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \textttleak@ k , a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating k samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, systematic study of unlearning reliability using our newly defined \textttleak@ k metric. Our findings demonstrate that knowledge leakage persists across methods and tasks, underscoring that current state-of-the-art unlearning techniques provide only limited forgetting and highlighting the urgent need for more robust approaches to LLM unlearning.

[LG-33] Machine Learning Algorithms in Statistical Modelling Bridging Theory and Application

链接: https://arxiv.org/abs/2511.04918
作者: A. Ganapathi Rao,Sathish Krishna Anumula,Aditya Kumar Singh,Renukhadevi M,Y. Jeevan Nagendra Kumar,Tammineni Rama Tulasi
类目: Machine Learning (cs.LG)
*备注: 9 Pages, 4 Figures

点击查看摘要

Abstract:It involves the completely novel ways of integrating ML algorithms with traditional statistical modelling that has changed the way we analyze data, do predictive analytics or make decisions in the fields of the data. In this paper, we study some ML and statistical model connections to understand ways in which some modern ML algorithms help ‘enrich’ conventional models; we demonstrate how new algorithms improve performance, scale, flexibility and robustness of the traditional models. It shows that the hybrid models are of great improvement in predictive accuracy, robustness, and interpretability

[LG-34] Efficient Swap Multicalibration of Elicitable Properties

链接: https://arxiv.org/abs/2511.04907
作者: Lunjia Hu,Haipeng Luo,Spandan Senapati,Vatsal Sharan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multicalibration [HJKRR18] is an algorithmic fairness perspective that demands that the predictions of a predictor are correct conditional on themselves and membership in a collection of potentially overlapping subgroups of a population. The work of [NR23] established a surprising connection between multicalibration for an arbitrary property \Gamma (e.g., mean or median) and property elicitation: a property \Gamma can be multicalibrated if and only if it is elicitable, where elicitability is the notion that the true property value of a distribution can be obtained by solving a regression problem over the distribution. In the online setting, [NR23] proposed an inefficient algorithm that achieves \sqrt T \ell_2 -multicalibration error for a hypothesis class of group membership functions and an elicitable property \Gamma , after T rounds of interaction between a forecaster and adversary. In this paper, we generalize multicalibration for an elicitable property \Gamma from group membership functions to arbitrary bounded hypothesis classes and introduce a stronger notion – swap multicalibration, following [GKR23]. Subsequently, we propose an oracle-efficient algorithm which, when given access to an online agnostic learner, achieves T^1/(r+1) \ell_r -swap multicalibration error with high probability (for r\ge2 ) for a hypothesis class with bounded sequential Rademacher complexity and an elicitable property \Gamma . For the special case of r=2 , this implies an oracle-efficient algorithm that achieves T^1/3 \ell_2 -swap multicalibration error, which significantly improves on the previously established bounds for the problem [NR23, GMS25, LSS25a], and completely resolves an open question raised in [GJRR24] on the possibility of an oracle-efficient algorithm that achieves \sqrtT \ell_2 -mean multicalibration error by answering it in a strongly affirmative sense. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2511.04907 [cs.LG] (or arXiv:2511.04907v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.04907 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Multi-Agent Craftax: Benchmarking Open-Ended Multi-Agent Reinforcement Learning at the Hyperscale

链接: https://arxiv.org/abs/2511.04904
作者: Bassel Al Omari,Michael Matthews,Alexander Rutherford,Jakob Nicolaus Foerster
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Progress in multi-agent reinforcement learning (MARL) requires challenging benchmarks that assess the limits of current methods. However, existing benchmarks often target narrow short-horizon challenges that do not adequately stress the long-term dependencies and generalization capabilities inherent in many multi-agent systems. To address this, we first present \textitCraftax-MA: an extension of the popular open-ended RL environment, Craftax, that supports multiple agents and evaluates a wide range of general abilities within a single environment. Written in JAX, \textitCraftax-MA is exceptionally fast with a training run using 250 million environment interactions completing in under an hour. To provide a more compelling challenge for MARL, we also present \textitCraftax-Coop, an extension introducing heterogeneous agents, trading and more mechanics that require complex cooperation among agents for success. We provide analysis demonstrating that existing algorithms struggle with key challenges in this benchmark, including long-horizon credit assignment, exploration and cooperation, and argue for its potential to drive long-term research in MARL.

[LG-36] Self-Interest and Systemic Benefits: Emergence of Collective Rationality in Mixed Autonomy Traffic Through Deep Reinforcement Learning

链接: https://arxiv.org/abs/2511.04883
作者: Di Chen,Jia Li,Michael Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous vehicles (AVs) are expected to be commercially available in the near future, leading to mixed autonomy traffic consisting of both AVs and human-driven vehicles (HVs). Although numerous studies have shown that AVs can be deployed to benefit the overall traffic system performance by incorporating system-level goals into their decision making, it is not clear whether the benefits still exist when agents act out of self-interest – a trait common to all driving agents, both human and autonomous. This study aims to understand whether self-interested AVs can bring benefits to all driving agents in mixed autonomy traffic systems. The research is centered on the concept of collective rationality (CR). This concept, originating from game theory and behavioral economics, means that driving agents may cooperate collectively even when pursuing individual interests. Our recent research has proven the existence of CR in an analytical game-theoretical model and empirically in mixed human-driven traffic. In this paper, we demonstrate that CR can be attained among driving agents trained using deep reinforcement learning (DRL) with a simple reward design. We examine the extent to which self-interested traffic agents can achieve CR without directly incorporating system-level objectives. Results show that CR consistently emerges in various scenarios, which indicates the robustness of this property. We also postulate a mechanism to explain the emergence of CR in the microscopic and dynamic environment and verify it based on simulation evidence. This research suggests the possibility of leveraging advanced learning methods (such as federated learning) to achieve collective cooperation among self-interested driving agents in mixed-autonomy systems.

[LG-37] FoodRL: A Reinforcement Learning Ensembling Framework For In-Kind Food Donation Forecasting

链接: https://arxiv.org/abs/2511.04865
作者: Esha Sharma,Lauren Davis,Julie Ivy,Min Chi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Food banks are crucial for alleviating food insecurity, but their effectiveness hinges on accurately forecasting highly volatile in-kind donations to ensure equitable and efficient resource distribution. Traditional forecasting models often fail to maintain consistent accuracy due to unpredictable fluctuations and concept drift driven by seasonal variations and natural disasters such as hurricanes in the Southeastern U.S. and wildfires in the West Coast. To address these challenges, we propose FoodRL, a novel reinforcement learning (RL) based metalearning framework that clusters and dynamically weights diverse forecasting models based on recent performance and contextual information. Evaluated on multi-year data from two structurally distinct U.S. food banks-one large regional West Coast food bank affected by wildfires and another state-level East Coast food bank consistently impacted by hurricanes, FoodRL consistently outperforms baseline methods, particularly during periods of disruption or decline. By delivering more reliable and adaptive forecasts, FoodRL can facilitate the redistribution of food equivalent to 1.7 million additional meals annually, demonstrating its significant potential for social impact as well as adaptive ensemble learning for humanitarian supply chains.

[LG-38] Quantum Boltzmann Machines for Sample-Efficient Reinforcement Learning

链接: https://arxiv.org/abs/2511.04856
作者: Thore Gerlach,Michael Schenk,Verena Kain
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:We introduce theoretically grounded Continuous Semi-Quantum Boltzmann Machines (CSQBMs) that supports continuous-action reinforcement learning. By combining exponential-family priors over visible units with quantum Boltzmann distributions over hidden units, CSQBMs yield a hybrid quantum-classical model that reduces qubit requirements while retaining strong expressiveness. Crucially, gradients with respect to continuous variables can be computed analytically, enabling direct integration into Actor-Critic algorithms. Building on this, we propose a continuous Q-learning framework that replaces global maximization by efficient sampling from the CSQBM distribution, thereby overcoming instability issues in continuous control.

[LG-39] SigmaDock: Untwisting Molecular Docking With Frag ment-Based SE(3) Diffusion

链接: https://arxiv.org/abs/2511.04854
作者: Alvaro Prat,Leo Zhang,Charlotte M. Deane,Yee Whye Teh,Garrett M. Morris
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Preprint

点击查看摘要

Abstract:Determining the binding pose of a ligand to a protein, known as molecular docking, is a fundamental task in drug discovery. Generative approaches promise faster, improved, and more diverse pose sampling than physics-based methods, but are often hindered by chemically implausible outputs, poor generalisability, and high computational cost. To address these challenges, we introduce a novel fragmentation scheme, leveraging inductive biases from structural chemistry, to decompose ligands into rigid-body fragments. Building on this decomposition, we present SigmaDock, an SE(3) Riemannian diffusion model that generates poses by learning to reassemble these rigid bodies within the binding pocket. By operating at the level of fragments in SE(3), SigmaDock exploits well-established geometric priors while avoiding overly complex diffusion processes and unstable training dynamics. Experimentally, we show SigmaDock achieves state-of-the-art performance, reaching Top-1 success rates (RMSD2 PB-valid) above 79.9% on the PoseBusters set, compared to 12.7-30.8% reported by recent deep learning approaches, whilst demonstrating consistent generalisation to unseen proteins. SigmaDock is the first deep learning approach to surpass classical physics-based docking under the PB train-test split, marking a significant leap forward in the reliability and feasibility of deep learning for molecular modelling.

[LG-40] Grounded Test-Time Adaptation for LLM Agents

链接: https://arxiv.org/abs/2511.04847
作者: Arthur Chen,Zuxin Liu,Jianguo Zhang,Akshara Prabhakar,Zhiwei Liu,Shelby Heinecke,Silvio Savarese,Victor Zhong,Caiming Xiong
类目: Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:Large language model (LLM)-based agents struggle to generalize to novel and complex environments, such as unseen websites or new sets of functions, due to a fundamental mismatch between their pre-training and test-time conditions. This challenge stems from two distinct failure modes: a syntactic misunderstanding of environment-specific components like observation formats, and a semantic misunderstanding of state-transition dynamics, which are only revealed at test time. To address these issues, we propose two distinct and complementary strategies for adapting LLM agents by leveraging environment-specific information available during deployment. First, an online distributional adaptation method parameterizes environmental nuances by learning a lightweight adaptation vector that biases the model’s output distribution, enabling rapid alignment with an environment response format. Second, a deployment-time dynamics grounding method employs a persona-driven exploration phase to systematically probe and learn the environment’s causal dynamics before task execution, equipping the agent with a nonparametric world model. We evaluate these strategies across diverse agentic benchmarks, including function calling and web navigation. Our empirical results show the effectiveness of both strategies across all benchmarks with minimal computational cost. We find that dynamics grounding is particularly effective in complex environments where unpredictable dynamics pose a major obstacle, demonstrating a robust path toward more generalizable and capable LLM-based agents. For example, on the WebArena multi-site split, this method increases the agent’s success rate from 2% to 23%.

[LG-41] Investigating U.S. Consumer Demand for Food Products with Innovative Transportation Certificates Based on Stated Preferences and Machine Learning Approaches

链接: https://arxiv.org/abs/2511.04845
作者: Jingchen Bi,Rodrigo Mesa-Arango
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper utilizes a machine learning model to estimate the consumer’s behavior for food products with innovative transportation certificates in the U.S. Building on previous research that examined demand for food products with supply chain traceability using stated preference analysis, transportation factors were identified as significant in consumer food purchasing choices. Consequently, a second experiment was conducted to pinpoint the specific transportation attributes valued by consumers. A machine learning model was applied, and five innovative certificates related to transportation were proposed: Transportation Mode, Internet of Things (IoT), Safety measures, Energy Source, and Must Arrive By Dates (MABDs). The preference experiment also incorporated product-specific and decision-maker factors for control purposes. The findings reveal a notable inclination toward safety and energy certificates within the transportation domain of the U.S. food supply chain. Additionally, the study examined the influence of price, product type, certificates, and decision-maker factors on purchasing choices. Ultimately, the study offers data-driven recommendations for improving food supply chain systems.

[LG-42] Sublinear iterations can suffice even for DDPMs

链接: https://arxiv.org/abs/2511.04844
作者: Matthew S. Zhang,Stephen Huan,Jerry Huang,Nicholas M. Boffi,Sitan Chen,Sinho Chewi
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:SDE-based methods such as denoising diffusion probabilistic models (DDPMs) have shown remarkable success in real-world sample generation tasks. Prior analyses of DDPMs have been focused on the exponential Euler discretization, showing guarantees that generally depend at least linearly on the dimension or initial Fisher information. Inspired by works in log-concave sampling (Shen and Lee, 2019), we analyze an integrator – the denoising diffusion randomized midpoint method (DDRaM) – that leverages an additional randomized midpoint to better approximate the SDE. Using a recently-developed analytic framework called the “shifted composition rule”, we show that this algorithm enjoys favorable discretization properties under appropriate smoothness assumptions, with sublinear \widetildeO(\sqrtd) score evaluations needed to ensure convergence. This is the first sublinear complexity bound for pure DDPM sampling – prior works which obtained such bounds worked instead with ODE-based sampling and had to make modifications to the sampler which deviate from how they are used in practice. We also provide experimental validation of the advantages of our method, showing that it performs well in practice with pre-trained image synthesis models.

[LG-43] SPECTRA: Spectral Target-Aware Graph Augmentation for Imbalanced Molecular Property Regression

链接: https://arxiv.org/abs/2511.04838
作者: Brenda Nogueira,Meng Jiang,Nitesh V. Chawla,Nuno Moniz
类目: Machine Learning (cs.LG); Spectral Theory (math.SP); Molecular Networks (q-bio.MN)
*备注:

点击查看摘要

Abstract:In molecular property prediction, the most valuable compounds (e.g., high potency) often occupy sparse regions of the target space. Standard Graph Neural Networks (GNNs) commonly optimize for the average error, underperforming on these uncommon but critical cases, with existing oversampling methods often distorting molecular topology. In this paper, we introduce SPECTRA, a Spectral Target-Aware graph augmentation framework that generates realistic molecular graphs in the spectral domain. SPECTRA (i) reconstructs multi-attribute molecular graphs from SMILES; (ii) aligns molecule pairs via (Fused) Gromov-Wasserstein couplings to obtain node correspondences; (iii) interpolates Laplacian eigenvalues, eigenvectors and node features in a stable share-basis; and (iv) reconstructs edges to synthesize physically plausible intermediates with interpolated targets. A rarity-aware budgeting scheme, derived from a kernel density estimation of labels, concentrates augmentation where data are scarce. Coupled with a spectral GNN using edge-aware Chebyshev convolutions, SPECTRA densifies underrepresented regions without degrading global accuracy. On benchmarks, SPECTRA consistently improves error in relevant target ranges while maintaining competitive overall MAE, and yields interpretable synthetic molecules whose structure reflects the underlying spectral geometry. Our results demonstrate that spectral, geometry-aware augmentation is an effective and efficient strategy for imbalanced molecular property regression.

[LG-44] Persistent reachability homology in machine learning applications

链接: https://arxiv.org/abs/2511.04825
作者: Luigi Caputi,Nicholas Meadows,Henri Riihimäki
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT); Quantitative Methods (q-bio.QM)
*备注: 19 pages; any comments welcome

点击查看摘要

Abstract:We explore the recently introduced persistent reachability homology (PRH) of digraph data, i.e. data in the form of directed graphs. In particular, we study the effectiveness of PRH in network classification task in a key neuroscience problem: epilepsy detection. PRH is a variation of the persistent homology of digraphs, more traditionally based on the directed flag complex (DPH). A main advantage of PRH is that it considers the condensations of the digraphs appearing in the persistent filtration and thus is computed from smaller digraphs. We compare the effectiveness of PRH to that of DPH and we show that PRH outperforms DPH in the classification task. We use the Betti curves and their integrals as topological features and implement our pipeline on support vector machine.

[LG-45] Sharp Minima Can Generalize: A Loss Landscape Perspective On Data

链接: https://arxiv.org/abs/2511.04808
作者: Raymond Fan,Bryce Sandlund,Lin Myat Ko
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The volume hypothesis suggests deep learning is effective because it is likely to find flat minima due to their large volumes, and flat minima generalize well. This picture does not explain the role of large datasets in generalization. Measuring minima volumes under varying amounts of training data reveals sharp minima which generalize well exist, but are unlikely to be found due to their small volumes. Increasing data changes the loss landscape, such that previously small generalizing minima become (relatively) large.

[LG-46] Autoencoding Dynamics: Topological Limitations and Capabilities

链接: https://arxiv.org/abs/2511.04807
作者: Matthew D. Kvalheim,Eduardo D. Sontag
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Given a “data manifold” M\subset \mathbbR^n and “latent space” \mathbbR^\ell , an autoencoder is a pair of continuous maps consisting of an “encoder” E\colon \mathbbR^n\to \mathbbR^\ell and “decoder” D\colon \mathbbR^\ell\to \mathbbR^n such that the “round trip” map D\circ E is as close as possible to the identity map \mboxid_M on M . We present various topological limitations and capabilites inherent to the search for an autoencoder, and describe capabilities for autoencoding dynamical systems having M as an invariant manifold.

[LG-47] Simplex-FEM Networks (SiFEN): Learning A Triangulated Function Approximator

链接: https://arxiv.org/abs/2511.04804
作者: Chaymae Yahyati,Ismail Lamaakal,Khalid El Makkaoui,Ibrahim Ouahbi,Yassine Maleh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Simplex-FEM Networks (SiFEN), a learned piecewise-polynomial predictor that represents f: R^d - R^k as a globally C^r finite-element field on a learned simplicial mesh in an optionally warped input space. Each query activates exactly one simplex and at most d+1 basis functions via barycentric coordinates, yielding explicit locality, controllable smoothness, and cache-friendly sparsity. SiFEN pairs degree-m Bernstein-Bezier polynomials with a light invertible warp and trains end-to-end with shape regularization, semi-discrete OT coverage, and differentiable edge flips. Under standard shape-regularity and bi-Lipschitz warp assumptions, SiFEN achieves the classic FEM approximation rate M^(-m/d) with M mesh vertices. Empirically, on synthetic approximation tasks, tabular regression/classification, and as a drop-in head on compact CNNs, SiFEN matches or surpasses MLPs and KANs at matched parameter budgets, improves calibration (lower ECE/Brier), and reduces inference latency due to geometric locality. These properties make SiFEN a compact, interpretable, and theoretically grounded alternative to dense MLPs and edge-spline networks.

[LG-48] DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

链接: https://arxiv.org/abs/2511.04791
作者: Lei Gao,Chaoyi Jiang,Hossein Entezari Zarch,Daniel Wong,Murali Annavaram
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades time-between-tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources through duplicated models and KV cache transfers. We present DuetServe, a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. DuetServe operates in aggregated mode by default and dynamically activates SM-level GPU spatial multiplexing when TBT degradation is predicted. Its key idea is to decouple prefill and decode execution only when needed through fine-grained, adaptive SM partitioning that provides phase isolation only when contention threatens latency service level objectives (SLOs). DuetServe integrates (1) an attention-aware roofline model to forecast iteration latency, (2) a partitioning optimizer that selects the optimal SM split to maximize throughput under TBT constraints, and (3) an interruption-free execution engine that eliminates CPU-GPU synchronization overhead. Evaluations show that DuetServe improves total throughput by up to 1.3x while maintaining low generation latency compared to state-of-the-art frameworks.

[LG-49] Conditional Neural ODE for Longitudinal Parkinsons Disease Progression Forecasting

链接: https://arxiv.org/abs/2511.04789
作者: Xiaoda Wang,Yuji Zhao,Kaiqiao Han,Xiao Luo,Sanne van Rooij,Jennifer Stevens,Lifang He,Liang Zhan,Yizhou Sun,Wei Wang,Carl Yang
类目: Machine Learning (cs.LG)
*备注: Accepted to IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2025

点击查看摘要

Abstract:Parkinson’s disease (PD) shows heterogeneous, evolving brain-morphometry patterns. Modeling these longitudinal trajectories enables mechanistic insight, treatment development, and individualized ‘digital-twin’ forecasting. However, existing methods usually adopt recurrent neural networks and transformer architectures, which rely on discrete, regularly sampled data while struggling to handle irregular and sparse magnetic resonance imaging (MRI) in PD cohorts. Moreover, these methods have difficulty capturing individual heterogeneity including variations in disease onset, progression rate, and symptom severity, which is a hallmark of PD. To address these challenges, we propose CNODE (Conditional Neural ODE), a novel framework for continuous, individualized PD progression forecasting. The core of CNODE is to model morphological brain changes as continuous temporal processes using a neural ODE model. In addition, we jointly learn patient-specific initial time and progress speed to align individual trajectories into a shared progression trajectory. We validate CNODE on the Parkinson’s Progression Markers Initiative (PPMI) dataset. Experimental results show that our method outperforms state-of-the-art baselines in forecasting longitudinal PD progression.

[LG-50] SLOFetch: Compressed-Hierarchical Instruction Prefetching for Cloud Microservices

链接: https://arxiv.org/abs/2511.04774
作者: Liu Jiang,Zerui Bao,Shiqi Sheng,Di Zhu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Large-scale networked services rely on deep soft-ware stacks and microservice orchestration, which increase instruction footprints and create frontend stalls that inflate tail latency and energy. We revisit instruction prefetching for these cloud workloads and present a design that aligns with SLO driven and self optimizing systems. Building on the Entangling Instruction Prefetcher (EIP), we introduce a Compressed Entry that captures up to eight destinations around a base using 36 bits by exploiting spatial clustering, and a Hierarchical Metadata Storage scheme that keeps only L1 resident and frequently queried entries on chip while virtualizing bulk metadata into lower levels. We further add a lightweight Online ML Controller that scores prefetch profitability using context features and a bandit adjusted threshold. On data center applications, our approach preserves EIP like speedups with smaller on chip state and improves efficiency for networked services in the ML era.

[LG-51] FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow

链接: https://arxiv.org/abs/2511.04768
作者: Rubens Lacouture,Nathan Zhang,Ritvik Sharma,Marco Siracusa,Fredrik Kjolstad,Kunle Olukotun,Olivia Hsu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:As deep learning models scale, sparse computation and specialized dataflow hardware have emerged as powerful solutions to address efficiency. We propose FuseFlow, a compiler that converts sparse machine learning models written in PyTorch to fused sparse dataflow graphs for reconfigurable dataflow architectures (RDAs). FuseFlow is the first compiler to support general cross-expression fusion of sparse operations. In addition to fusion across kernels (expressions), FuseFlow also supports optimizations like parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis of fusion strategies. We use FuseFlow for design-space exploration across four real-world machine learning applications with sparsity, showing that full fusion (entire cross-expression fusion across all computation in an end-to-end model) is not always optimal for sparse models-fusion granularity depends on the model itself. FuseFlow also provides a heuristic to identify and prune suboptimal configurations. Using Fuseflow, we achieve performance improvements, including a ~2.7x speedup over an unfused baseline for GPT-3 with BigBird block-sparse attention.

[LG-52] When Data Falls Short: Grokking Below the Critical Threshold

链接: https://arxiv.org/abs/2511.04760
作者: Vaibhav Singh,Eugene Belilovsky,Rahaf Aljundi
类目: Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:In this paper, we investigate the phenomenon of grokking, where models exhibit delayed generalization following overfitting on training data. We focus on data-scarce regimes where the number of training samples falls below the critical threshold, making grokking unobservable, and on practical scenarios involving distribution shift. We first show that Knowledge Distillation (KD) from a model that has already grokked on a distribution (p1) can induce and accelerate grokking on a different distribution (p2), even when the available data lies below the critical threshold. This highlights the value of KD for deployed models that must adapt to new distributions under limited data. We then study training on the joint distribution (p1, p2) and demonstrate that while standard supervised training fails when either distribution has insufficient data, distilling from models grokked on the individual distributions enables generalization. Finally, we examine a continual pretraining setup, where a grokked model transitions from p1 to p2, and find that KD both accelerates generalization and mitigates catastrophic forgetting, achieving strong performance even with only 10% of the data. Together, our results provide new insights into the mechanics of grokking under knowledge transfer and underscore the central role of KD in enabling generalization in low-data and evolving distribution settings.

[LG-53] Regularized GLISp for sensor-guided human-in-the-loop optimization

链接: https://arxiv.org/abs/2511.04751
作者: Matteo Cercola,Michele Lomuscio,Dario Piga,Simone Formentin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human-in-the-loop calibration is often addressed via preference-based optimization, where algorithms learn from pairwise comparisons rather than explicit cost evaluations. While effective, methods such as Preferential Bayesian Optimization or Global optimization based on active preference learning with radial basis functions (GLISp) treat the system as a black box and ignore informative sensor measurements. In this work, we introduce a sensor-guided regularized extension of GLISp that integrates measurable descriptors into the preference-learning loop through a physics-informed hypothesis function and a least-squares regularization term. This injects grey-box structure, combining subjective feedback with quantitative sensor information while preserving the flexibility of preference-based search. Numerical evaluations on an analytical benchmark and on a human-in-the-loop vehicle suspension tuning task show faster convergence and superior final solutions compared to baseline GLISp.

[LG-54] AWEMixer: Adaptive Wavelet-Enhanced Mixer Network for Long-Term Time Series Forecasting

链接: https://arxiv.org/abs/2511.04722
作者: Qianyang Li,Xingjun Zhang,Peng Tao,Shaoxun Wang,Yancheng Pan,Jia Wei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Forecasting long-term time series in IoT environments remains a significant challenge due to the non-stationary and multi-scale characteristics of sensor signals. Furthermore, error accumulation causes a decrease in forecast quality when predicting further into the future. Traditional methods are restricted to operate in time-domain, while the global frequency information achieved by Fourier transform would be regarded as stationary signals leading to blur the temporal patterns of transient events. We propose AWEMixer, an Adaptive Wavelet-Enhanced Mixer Network including two innovative components: 1) a Frequency Router designs to utilize the global periodicity pattern achieved by Fast Fourier Transform to adaptively weight localized wavelet subband, and 2) a Coherent Gated Fusion Block to achieve selective integration of prominent frequency features with multi-scale temporal representation through cross-attention and gating mechanism, which realizes accurate time-frequency localization while remaining robust to noise. Seven public benchmarks validate that our model is more effective than recent state-of-the-art models. Specifically, our model consistently achieves performance improvement compared with transformer-based and MLP-based state-of-the-art models in long-sequence time series forecasting. Code is available at this https URL

[LG-55] Communication-Constrained Private Decentralized Online Personalized Mean Estimation

链接: https://arxiv.org/abs/2511.04702
作者: Yauhen Yakimenka,Hsuan-Yin Lin,Eirik Rosnes,Jörg Kliewer
类目: ocial and Information Networks (cs.SI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Paper accepted for presentation at the 2025 IEEE Information Theory Workshop (ITW 2025). Final conference version

点击查看摘要

Abstract:We consider the problem of communication-constrained collaborative personalized mean estimation under a privacy constraint in an environment of several agents continuously receiving data according to arbitrary unknown agent-specific distributions. A consensus-based algorithm is studied under the framework of differential privacy in order to protect each agent’s data. We give a theoretical convergence analysis of the proposed consensus-based algorithm for any bounded unknown distributions on the agents’ data, showing that collaboration provides faster convergence than a fully local approach where agents do not share data, under an oracle decision rule and under some restrictions on the privacy level and the agents’ connectivity, which illustrates the benefit of private collaboration in an online setting under a communication restriction on the agents. The theoretical faster-than-local convergence guarantee is backed up by several numerical results.

[LG-56] A differentiable model of supply-chain shocks NEURIPS2025

链接: https://arxiv.org/abs/2511.05231
作者: Saad Hamid,José Moran,Luca Mungo,Arnau Quera-Bofarull,Sebastian Towers
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Differentiable Systems and Scientific Machine Learning (EurIPS)

点击查看摘要

Abstract:Modelling how shocks propagate in supply chains is an increasingly important challenge in economics. Its relevance has been highlighted in recent years by events such as Covid-19 and the Russian invasion of Ukraine. Agent-based models (ABMs) are a promising approach for this problem. However, calibrating them is hard. We show empirically that it is possible to achieve speed ups of over 3 orders of magnitude when calibrating ABMs of supply networks by running them on GPUs and using automatic differentiation, compared to non-differentiable baselines. This opens the door to scaling ABMs to model the whole global supply network.

[LG-57] A New Framework for Convex Clustering in Kernel Spaces: Finite Sample Bounds Consistency and Performance Insights

链接: https://arxiv.org/abs/2511.05159
作者: Shubhayan Pan,Saptarshi Chakraborty,Debolina Paul,Kushal Bose,Swagatam Das
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Convex clustering is a well-regarded clustering method, resembling the similar centroid-based approach of Lloyd’s k -means, without requiring a predefined cluster count. It starts with each data point as its centroid and iteratively merges them. Despite its advantages, this method can fail when dealing with data exhibiting linearly non-separable or non-convex structures. To mitigate the limitations, we propose a kernelized extension of the convex clustering method. This approach projects the data points into a Reproducing Kernel Hilbert Space (RKHS) using a feature map, enabling convex clustering in this transformed space. This kernelization not only allows for better handling of complex data distributions but also produces an embedding in a finite-dimensional vector space. We provide a comprehensive theoretical underpinnings for our kernelized approach, proving algorithmic convergence and establishing finite sample bounds for our estimates. The effectiveness of our method is demonstrated through extensive experiments on both synthetic and real-world datasets, showing superior performance compared to state-of-the-art clustering techniques. This work marks a significant advancement in the field, offering an effective solution for clustering in non-linear and non-convex data scenarios.

[LG-58] Estimating Bidirectional Causal Effects with Large Scale Online Kernel Learning

链接: https://arxiv.org/abs/2511.05050
作者: Masahiro Tanaka
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Accepted for publication in Proceedings of the 2025 International Conference on Data Science and Intelligent Systems (DSIS 2025)

点击查看摘要

Abstract:In this study, a scalable online kernel learning framework is proposed for estimating bidirectional causal effects in systems characterized by mutual dependence and heteroskedasticity. Traditional causal inference often focuses on unidirectional effects, overlooking the common bidirectional relationships in real-world phenomena. Building on heteroskedasticity-based identification, the proposed method integrates a quasi-maximum likelihood estimator for simultaneous equation models with large scale online kernel learning. It employs random Fourier feature approximations to flexibly model nonlinear conditional means and variances, while an adaptive online gradient descent algorithm ensures computational efficiency for streaming and high-dimensional data. Results from extensive simulations demonstrate that the proposed method achieves superior accuracy and stability than single equation and polynomial approximation baselines, exhibiting lower bias and root mean squared error across various data-generating processes. These results confirm that the proposed approach effectively captures complex bidirectional causal effects with near-linear computational scaling. By combining econometric identification with modern machine learning techniques, the proposed framework offers a practical, scalable, and theoretically grounded solution for large scale causal inference in natural/social science, policy making, business, and industrial applications.

[LG-59] Predicting Cognitive Assessment Scores in Older Adults with Cognitive Impairment Using Wearable Sensors

链接: https://arxiv.org/abs/2511.04983
作者: Assma Habadi,Milos Zefran,Lijuan Yin,Woojin Song,Maria Caceres,Elise Hu,Naoko Muramatsu
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 40 pages, 2 figures, 3 tables; Supplementary Material: 3 tables (S1-S3). Presented as a poster at the Gerontological Society of America (GSA) Annual Scientific Meeting, November 2025

点击查看摘要

Abstract:Background and Objectives: This paper focuses on using AI to assess the cognitive function of older adults with mild cognitive impairment or mild dementia using physiological data provided by a wearable device. Cognitive screening tools are disruptive, time-consuming, and only capture brief snapshots of activity. Wearable sensors offer an attractive alternative by continuously monitoring physiological signals. This study investigated whether physiological data can accurately predict scores on established cognitive tests. Research Design and Methods: We recorded physiological signals from 23 older adults completing three NIH Toolbox Cognitive Battery tests, which assess working memory, processing speed, and attention. The Empatica EmbracePlus, a wearable device, measured blood volume pulse, skin conductance, temperature, and movement. Statistical features were extracted using wavelet-based and segmentation methods. We then applied supervised learning and validated predictions via cross-validation, hold-out testing, and bootstrapping. Results: Our models showed strong performance with Spearman’s \rho of 0.73-0.82 and mean absolute errors of 0.14-0.16, significantly outperforming a naive mean predictor. Sensor roles varied: heart-related signals combined with movement and temperature best predicted working memory, movement paired with skin conductance was most informative for processing speed, and heart in tandem with skin conductance worked best for attention. Discussion and Implications: These findings suggest that wearable sensors paired with AI tools such as supervised learning and feature engineering can noninvasively track specific cognitive functions in older adults, enabling continuous monitoring. Our study demonstrates how AI can be leveraged when the data sample is small. This approach may support remote assessments and facilitate clinical interventions.

[LG-60] Prototype Selection Using Topological Data Analysis WWW

链接: https://arxiv.org/abs/2511.04873
作者: Jordan Eckert,Elvan Ceyhan,Henry Schenck
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Code is found on this http URL

点击查看摘要

Abstract:Recently, there has been an explosion in statistical learning literature to represent data using topological principles to capture structure and relationships. We propose a topological data analysis (TDA)-based framework, named Topological Prototype Selector (TPS), for selecting representative subsets (prototypes) from large datasets. We demonstrate the effectiveness of TPS on simulated data under different data intrinsic characteristics, and compare TPS against other currently used prototype selection methods in real data settings. In all simulated and real data settings, TPS significantly preserves or improves classification performance while substantially reducing data size. These contributions advance both algorithmic and geometric aspects of prototype learning and offer practical tools for parallelized, interpretable, and efficient classification.

[LG-61] Blind Strong Gravitational Lensing Inversion: Joint Inference of Source and Lens Mass with Score-Based Models NEURIPS2025

链接: https://arxiv.org/abs/2511.04792
作者: Gabriel Missael Barco,Ronan Legin,Connor Stone,Yashar Hezaveh,Laurence Perreault-Levasseur
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 18 pages, 9 figures, 1 table. Accepted to the NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences

点击查看摘要

Abstract:Score-based models can serve as expressive, data-driven priors for scientific inverse problems. In strong gravitational lensing, they enable posterior inference of a background galaxy from its distorted, multiply-imaged observation. Previous work, however, assumes that the lens mass distribution (and thus the forward operator) is known. We relax this assumption by jointly inferring the source and a parametric lens-mass profile, using a sampler based on GibbsDDRM but operating in continuous time. The resulting reconstructions yield residuals consistent with the observational noise, and the marginal posteriors of the lens parameters recover true values without systematic bias. To our knowledge, this is the first successful demonstration of joint source-and-lens inference with a score-based prior.

[LG-62] Machine Learning-Driven Analysis of kSZ Maps to Predict CMB Optical Depth τ

链接: https://arxiv.org/abs/2511.04770
作者: Farshid Farhadi Khouzani,Abinash Kumar Shaw,Paul La Plante,Bryar Mustafa Shareef,Laxmi Gewali
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, submitted to PASP

点击查看摘要

Abstract:Upcoming measurements of the kinetic Sunyaev-Zel’dovich (kSZ) effect, which results from Cosmic Microwave Background (CMB) photons scattering off moving electrons, offer a powerful probe of the Epoch of Reionization (EoR). The kSZ signal contains key information about the timing, duration, and spatial structure of the EoR. A precise measurement of the CMB optical depth \tau , a key parameter that characterizes the universe’s integrated electron density, would significantly constrain models of early structure formation. However, the weak kSZ signal is difficult to extract from CMB observations due to significant contamination from astrophysical foregrounds. We present a machine learning approach to extract \tau from simulated kSZ maps. We train advanced machine learning models, including swin transformers, on high-resolution seminumeric simulations of the kSZ signal. To robustly quantify prediction uncertainties of \tau , we employ the Laplace Approximation (LA). This approach provides an efficient and principled Gaussian approximation to the posterior distribution over the model’s weights, allowing for reliable error estimation. We investigate and compare two distinct application modes: a post-hoc LA applied to a pre-trained model, and an online LA where model weights and hyperparameters are optimized jointly by maximizing the marginal likelihood. This approach provides a framework for robustly constraining \tau and its associated uncertainty, which can enhance the analysis of upcoming CMB surveys like the Simons Observatory and CMB-S4.

[LG-63] Generalization in Representation Models via Random Matrix Theory: Application to Recurrent Networks

链接: https://arxiv.org/abs/2511.02401
作者: Yessin Moakher(X),Malik Tiomoko,Cosme Louart(CUHK-Shenzhen),Zhenyu Liao(HUST)
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We first study the generalization error of models that use a fixed feature representation (frozen intermediate layers) followed by a trainable readout layer. This setting encompasses a range of architectures, from deep random-feature models to echo-state networks (ESNs) with recurrent dynamics. Working in the high-dimensional regime, we apply Random Matrix Theory to derive a closed-form expression for the asymptotic generalization error. We then apply this analysis to recurrent representations and obtain concise formula that characterize their performance. Surprisingly, we show that a linear ESN is equivalent to ridge regression with an exponentially time-weighted (‘‘memory’’) input covariance, revealing a clear inductive bias toward recent inputs. Experiments match predictions: ESNs win in low-sample, short-memory regimes, while ridge prevails with more data or long-range dependencies. Our methodology provides a general framework for analyzing overparameterized models and offers insights into the behavior of deep learning networks.

信息检索

[IR-0] Mapping Research Productivity of BRICS Countries with Special Reference to Coronary Artery Disease (CAD): A Scientometric Study

链接: https://arxiv.org/abs/2511.05211
作者: Muneer Ahmad
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 260 Pages, 21 figures, PhD Thesis 2020

点击查看摘要

Abstract:This study presents a comprehensive scientometric analysis of research productivity on Coronary Artery Disease (CAD) among the BRICS countries, Brazil, Russia, India, China, and South Africa, using data retrieved from the Web of Science database for the period 1990 to 2019. A total of 50,036 records were analyzed to assess publication growth trends, authorship patterns, collaboration levels, and citation impact. The findings reveal a steady increase in CAD-related publications, with China emerging as the leading contributor, followed by Brazil, Russia, India, and South Africa. English dominated as the primary language of communication, accounting for over 93% of publications. Authorship and collaboration analysis indicate a high degree of joint research, with 97.91% of studies being co-authored and a degree of collaboration of 0.98, underscoring the collective nature of scientific inquiry in this domain. The study validates the applicability of Lotkas Law for author productivity, Bradfords Law for journal distribution, and Zipfs Law for keyword frequency, while the Price Square Root Law was found inapplicable. The predominant publication format was journal articles (79.7%), and Kardiologiya (Russia) emerged as the most prolific journal. The results demonstrate significant growth in CAD research output and collaboration within BRICS, though notable disparities persist among member nations. The study recommends enhancing individual author productivity, expanding international collaboration, and supporting CAD research through strategic institutional and governmental initiatives. These findings provide valuable insights for policymakers, funding agencies, and the academic community to strengthen cardiovascular research capacity within developing economies.

[IR-1] he use of social media among library professionals and patrons: A review of literature

链接: https://arxiv.org/abs/2511.05051
作者: Abimbola Agboke,Felicia Nkatv Undie
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 5 pages, Research Paper

点击查看摘要

Abstract:This paper focused on the utilization of social media by library professionals and library users. It provides an understanding of social media, the most popular social media platforms utilized in the libraries. It also mentions the reasons for the adoption of social media in libraries be it academic, public, school libraries and other types of libraries. This is a review paper on the use of social media among library professionals and patrons. The findings reveal the contributions of social media to the libraries. Social media makes things easy for library professionals and library users. It enables them to connect, create awareness to new information, disseminate information instantly, and helps to market the library resources and services. Therefore, it is recommended amongst others that the library management board should encourage the use of social media in libraries.

[IR-2] EMO1 00DB: An Open Dataset of Improvised Songs with Emotion Data

链接: https://arxiv.org/abs/2511.04755
作者: Daeun Hwang,Saebyul Park
类目: ound (cs.SD); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: 4 pages, 6 figures, International Conference on Music Perception and Cognition

点击查看摘要

Abstract:In this study, we introduce Emo100DB: a dataset consisting of improvised songs that were recorded and transcribed with emotion data based on Russell’s circumplex model of emotion. The dataset was developed by collecting improvised songs that consist of melody, lyrics, and an instrumental accompaniment played, sung, and recorded by 20 young adults. Before recording each song, the participants were asked to report their emotional state, with the axes representing arousal and valence based on Russell’s circumplex model of emotions. The dataset is organized into four emotion quadrants, and it includes the lyrics text and MIDI file of the melody extracted from the participant recordings, along with the original audio in WAV format. By providing an integrated composition of data and analysis, this study aims to offer a comprehensive dataset that allows for a diverse exploration of the relationship between music and emotion.

附件下载

点击下载今日全部论文列表