本篇博文主要内容为 2025-10-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-10-30)

今日共更新545篇论文,其中:

  • 自然语言处理86篇(Computation and Language (cs.CL))
  • 人工智能157篇(Artificial Intelligence (cs.AI))
  • 计算机视觉87篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习153篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Gaperon: A Peppered English-French Generative Language Model Suite

【速读】: 该论文旨在解决大规模多语言语言模型训练中透明度、可复现性与评估偏差之间的矛盾问题,特别是数据过滤和污染对模型性能的复杂影响。其关键解决方案是发布Gaperon系列模型(含1.5B、8B和24B参数版本),这些模型基于2–4万亿token的法语-英语-代码混合数据集训练,并配套提供完整的训练流水线:包括使用神经质量分类器过滤的数据集、高效的数据清洗与训练框架以及数百个中间检查点。通过系统性实验发现,仅依赖语言质量过滤虽提升文本流畅性和连贯性,但损害基准测试表现;而后期有意引入测试集混入的“延迟污染”策略能恢复竞争力评分,同时仅适度降低生成质量,揭示了传统神经过滤可能无意放大评估泄露风险。此外,作者提出预训练阶段的安全数据投毒方法,为模型安全性研究提供真实场景测试平台,从而推动在多语言模型开发中平衡数据治理、评估公平性、安全性和开放性的研究进展。

链接: https://arxiv.org/abs/2510.25771
作者: Nathan Godey,Wissam Antoun,Rian Touchent,Rachel Bawden,Éric de la Clergerie,Benoît Sagot,Djamé Seddah
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination – continuing training on data mixes that include test sets – recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.
zh

[NLP-1] Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文档问答任务中,尤其是多跳推理(multi-hop)、抽象生成(abstractive)和半提取式(semi-extractive)场景下,后验溯源(post-hoc attribution)性能不足的问题。现有方法在这些复杂问答场景中难以准确追踪答案来源,导致可信度下降。解决方案的关键在于将后验溯源重构为一个推理问题:通过引导模型将答案分解为与特定上下文片段关联的语义单元,并将其作为中间推理步骤进行训练。作者提出 DecompTune 方法,基于人工标注的分解数据集,采用两阶段监督微调(SFT)与基于奖励的策略优化(GRPO)联合训练策略,使 Qwen-2.5 模型在生成答案时显式输出分解结构,从而显著提升溯源准确性,优于已有方法并达到或超越前沿模型水平。

链接: https://arxiv.org/abs/2510.25766
作者: Sriram Balasubramaniam,Samyadeep Basu,Koustava Goswami,Ryan Rossi,Varun Manjunatha,Roshan Santhosh,Ruiyi Zhang,Soheil Feizi,Nedim Lipka
机构: 未知
类目: Computation and Language (cs.CL)
备注: Post-hoc attribution

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.
zh

[NLP-2] DiagramEval: Evaluating LLM -Generated Diagrams via Graphs

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在科研文献中用于创建演示图(demonstration diagrams)时所面临的质量评估难题。现有图像生成模型难以产出结构清晰、逻辑明确的图表,而直接以SVG格式生成图表虽具潜力,但缺乏足够判别性和可解释性的评估指标来衡量其质量。解决方案的关键在于提出一种名为DiagramEval的新评估指标体系,将图表建模为有向图结构,其中文本元素作为节点、连接关系作为有向边,并基于节点对齐(node alignment)和路径对齐(path alignment)两个维度量化评估生成图表的质量,从而首次实现了对先进大语言模型(LLM)生成图表的有效定量评估与特性分析。

链接: https://arxiv.org/abs/2510.25761
作者: Chumeng Liang,Jiaxuan You
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: this https URL.
zh

[NLP-3] ask Completion Agents are Not Ideal Collaborators

【速读】: 该论文试图解决当前智能代理(Agent)评估体系过于聚焦单次任务完成能力,而忽视了现实世界问题中人类目标常不明确且动态演化、需人机持续协作的特性。其核心问题是现有评估方法无法衡量代理在多轮交互中如何有效支持和增强人类努力,从而导致代理设计缺乏对用户参与度维持与认知引导能力的重视。解决方案的关键在于提出“协作努力扩展”(collaborative effort scaling)框架,该框架通过量化代理效用随用户投入增加而提升的程度,为诊断代理行为提供新视角,并指导开发更具协同性的交互机制,使代理不仅能产出高质量结果,还能在过程中持续激发并引导用户参与,实现真正意义上的协作式智能。

链接: https://arxiv.org/abs/2510.25744
作者: Shannon Zejiang Shen,Valerie Chen,Ken Gu,Alexis Ross,Zixian Ma,Jillian Ross,Alex Gu,Chenglei Si,Wayne Chi,Andi Peng,Jocelyn J Shen,Ameet Talwalkar,Tongshuang Wu,David Sontag
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent’s utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
zh

[NLP-4] Scaling Latent Reasoning via Looped Language Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在推理过程中依赖显式文本生成(如思维链,Chain-of-Thought, CoT)所导致的预训练数据利用不足和推理效率低下的问题。其核心解决方案是提出了一种名为Ouro的循环语言模型(Looped Language Models, LoopLM),通过在预训练阶段内嵌迭代计算机制(latent space中的迭代运算)、熵正则化目标函数以实现学习到的深度分配(learned depth allocation),并基于7.7万亿token进行大规模预训练,从而将推理能力直接编码进模型结构中。实验表明,Ouro模型在多个基准测试中性能优于甚至可媲美12B规模的先进模型,且优势源于更优的知识操作能力而非单纯的知识容量提升,同时生成的推理轨迹与最终输出更加一致。

链接: https://arxiv.org/abs/2510.25741
作者: Rui-Jie Zhu,Zixuan Wang,Kai Hua,Tianyu Zhang,Ziniu Li,Haoran Que,Boyi Wei,Zixin Wen,Fan Yin,He Xing,Lu Li,Jiajun Shi,Kaijing Ma,Shanda Li,Taylor Kergan,Andrew Smith,Xingwei Qu,Mude Hui,Bohong Wu,Qiyang Min,Hongzhi Huang,Xun Zhou,Wei Ye,Jiaheng Liu,Jian Yang,Yunfeng Shi,Chenghua Lin,Enduo Zhao,Tianle Cai,Ge Zhang,Wenhao Huang,Yoshua Bengio,Jason Eshraghian
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); ByteDance (字节跳动)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern LLMs are trained to “think” primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model could be found in: this http URL.
zh

[NLP-5] he Limits of Obliviate: Evaluating Unlearning in LLM s via Stimulus-Knowledge Entanglement-Behavior Framework

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中“遗忘”(unlearning)效果的评估难题,特别是针对敏感数据管理与错误信息修正场景下,如何有效衡量模型是否真正删除了特定事实知识。其核心问题在于:即使通过训练手段使模型“遗忘”某些知识,仍可能存在隐性记忆或可被外部提示重新激活的现象,而现有评估方法难以系统量化此类行为。解决方案的关键在于提出 Stimulus-Knowledge Entanglement-Behavior Framework (SKeB),该框架基于 ACT-R 和 Hebbian 理论(传播激活理论)以及沟通原理,利用领域图建模知识之间的纠缠关系,并设计 entanglement metrics 来量化知识激活模式;同时引入说服性提示(persuasive prompting)作为刺激手段,验证不同提示策略(如权威型框架)对未学习模型中事实知识召回的影响,从而揭示遗忘完整性与提示鲁棒性之间的关联。实验表明,说服性提示显著提升事实召回率(从14.8%提升至24.5%),且效果随模型规模增大而减弱,为评估 LLM 的遗忘性能提供了可量化的分析基础。

链接: https://arxiv.org/abs/2510.25732
作者: Aakriti Shah,Thai Le
机构: University of Southern California (南加州大学); Indiana University (印第安纳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 11 figures

点击查看摘要

Abstract:Unlearning in large language models (LLMs) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned LLMs across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.
zh

[NLP-6] he Tool Decathlon: Benchmarking Language Agents for Diverse Realistic and Long-Horizon Task Execution

【速读】: 该论文旨在解决现有语言代理(language agent)评估基准在真实场景适用性上的局限性问题,即当前 benchmarks 多集中于狭窄领域或简化任务,缺乏多样化的应用(Apps)、现实的环境状态以及长周期复杂任务的验证能力。其解决方案的关键在于提出 Tool Decathlon(Toolathlon),一个涵盖 32 种软件应用和 604 个工具的综合性评估基准,其中大部分工具基于高质量的 Model Context Protocol (MCP) 服务器实现;同时引入来自真实软件的初始环境状态(如包含数十名学生的 Canvas 课程或实际财务电子表格),确保任务执行过程具备现实性和多样性,并通过专用评估脚本对每项任务进行严格可验证的执行结果判定。此设计显著提升了语言代理在多步骤、跨平台、长期目标导向任务中的评估真实性与挑战性。

链接: https://arxiv.org/abs/2510.25726
作者: Junlong Li,Wenshuo Zhao,Jian Zhao,Weihao Zeng,Haoze Wu,Xiaochen Wang,Rui Ge,Yuxuan Cao,Yuzhen Huang,Wei Liu,Junteng Liu,Zhaochen Su,Yiyang Guo,Fan Zhou,Lueyang Zhang,Juan Michelini,Xingyao Wang,Xiang Yue,Shuyan Zhou,Graham Neubig,Junxian He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Website: this https URL

点击查看摘要

Abstract:Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents’ real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
zh

[NLP-7] Interpreting LLM s as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML? CIKM2025

【速读】: 该论文旨在解决生成式 AI(Generative AI)在结构化金融风险预测任务中作为独立分类模型的适用性与可信度问题,特别是其零样本提示(zero-shot prompting)方法在贷款违约预测场景下的表现是否可信赖。解决方案的关键在于通过系统性对比零样本大语言模型(Large Language Models, LLMs)与LightGBM这一高性能梯度提升模型,在真实世界贷款违约预测任务中评估其预测性能、利用SHAP(SHapley Additive exPlanations)进行特征重要性归因分析,并检验LLM自动生成解释的一致性与可靠性。研究发现,尽管LLMs能识别关键金融风险指标,但其特征重要性排序显著偏离LightGBM,且自解释内容常无法匹配SHAP实证结果,揭示了LLMs在高风险金融场景下作为独立决策工具的局限性及解释可信度风险,强调需引入可解释性审计、与可解释模型的基线比较以及人工监督机制以保障部署安全性。

链接: https://arxiv.org/abs/2510.25701
作者: Saeed AlMarri,Kristof Juhasz,Mathieu Ravaut,Gautier Marti,Hamdan Al Ahbabi,Ibrahim Elfadel
机构: Khalifa University (哈利法大学); ADIA (阿布扎比人工智能局)
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures, 3 tables, CIKM 2025 FinFAI workshop

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting. However, their suitability for structured tabular data remains underexplored, especially in high-stakes financial applications such as financial risk assessment. This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task. We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations. While LLMs are able to identify key financial risk indicators, their feature importance rankings diverge notably from LightGBM, and their self-explanations often fail to align with empirical SHAP attributions. These findings highlight the limitations of LLMs as standalone models for structured financial risk prediction and raise concerns about the trustworthiness of their self-generated explanations. Our results underscore the need for explainability audits, baseline comparisons with interpretable models, and human-in-the-loop oversight when deploying LLMs in risk-sensitive financial environments.
zh

[NLP-8] Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件工程中应用时,环境配置(Environment Configuration)环节因高度依赖人工操作且缺乏高质量、大规模数据集而导致的瓶颈问题。现有基准测试仅评估端到端的构建/测试成功率,无法揭示代理在环境配置过程中的具体能力表现与失败原因。其解决方案的关键在于提出首个面向环境配置的过程级诊断基准——Enconda-bench,该基准通过注入真实 README 文件错误并结合 Docker 验证,实现对代理在规划、感知驱动的错误诊断、反馈驱动修复及最终执行等细粒度步骤上的过程级轨迹评估,从而超越传统聚合成功率指标,提供可解释的能力分析与改进方向。

链接: https://arxiv.org/abs/2510.25694
作者: Jiayi Kuang,Yinghui Li,Xin Zhang,Yangning Li,Di Yin,Xing Sun,Ying Shen,Philip S. Yu
机构: Tencent Youtu Lab (腾讯优图实验室); Sun Yat-sen University (中山大学); The Hong Kong Polytechnic University (香港理工大学); University of Illinons at Chicago (芝加哥大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.
zh

[NLP-9] PairUni: Pairwise Training for Unified Multimodal Language Models

【速读】: 该论文旨在解决统一视觉-语言模型(Unified Vision-Language Models, UVLMs)在强化学习(Reinforcement Learning, RL)过程中理解任务与生成任务难以平衡的问题,因其依赖异构数据和监督信号,易导致任务间干扰。解决方案的关键在于提出PairUni框架,通过将数据重构为理解-生成(Understanding-Generation, UG)配对结构,并引入Pair-GPRO算法:首先利用GPT-o3生成跨任务对齐样本(如为理解样本生成描述、为生成样本生成问答对),并检索语义相关的跨任务样本形成配对;随后在优化中基于配对相似度动态调整优势值(advantage),增强高质量配对的学习强度,降低任务干扰。这一设计有效暴露了跨任务语义对应关系,实现了更一致的策略学习。

链接: https://arxiv.org/abs/2510.25682
作者: Jiani Zheng,Zhiyang Teng,Xiangtai Li,Anran Wang,Yu Tian,Kunpeng Qiu,Ye Tian,Haochen Wang,Zhuochen Wang
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \hrefthis https URLthis http URL
zh

[NLP-10] ZK-SenseLM: Verifiable Large-Model Wireless Sensing with Selective Abstention and Zero-Knowledge Attestation

【速读】: 该论文旨在解决无线感知系统中安全性与可审计性不足的问题,尤其是在使用大规模模型处理Wi-Fi信道状态信息(Channel State Information, CSI)时,如何确保推理过程的可信性、防止恶意篡改和重放攻击,并在分布偏移下保持低风险决策。解决方案的关键在于提出ZK-SenseLM框架,其核心创新包括:(1) 基于掩码频谱预训练与相位一致性正则化的编码器,结合轻量级跨模态对齐机制,将射频(RF)特征映射为人类可解释的策略标记(policy tokens);(2) 引入校准的选择性回避头(selective-abstention head),通过注册并绑定风险-覆盖率操作点至零知识证明(zero-knowledge proofs),提升鲁棒性;(3) 设计四阶段证明流水线(C1–C4),实现端到端的PLONK风格零知识证明,验证量化网络在给定时间窗口内生成的动作及置信度,同时支持微批处理以摊销计算成本,并提供网关选项卸载低功耗设备的证明负担。该方案在不削弱可验证性的前提下,兼容差分隐私联邦学习与本地个性化,显著提升了多任务场景下的宏观F1分数、校准性能与抗扰动覆盖-风险曲线表现。

链接: https://arxiv.org/abs/2510.25677
作者: Hasan Akgul,Mari Eplik,Javier Rojas,Aina Binti Abdullah,Pieter van der Merwe
机构: Istanbul Technical University (伊斯坦布尔技术大学); University of Tartu (塔尔图大学); University of Chile (智利大学); Universiti Teknologi Malaysia (马来西亚理工大学); Stellenbosch University (斯泰伦博斯大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 45 pages

点击查看摘要

Abstract:ZK-SenseLM is a secure and auditable wireless sensing framework that pairs a large-model encoder for Wi-Fi channel state information (and optionally mmWave radar or RFID) with a policy-grounded decision layer and end-to-end zero-knowledge proofs of inference. The encoder uses masked spectral pretraining with phase-consistency regularization, plus a light cross-modal alignment that ties RF features to compact, human-interpretable policy tokens. To reduce unsafe actions under distribution shift, we add a calibrated selective-abstention head; the chosen risk-coverage operating point is registered and bound into the proof. We implement a four-stage proving pipeline: (C1) feature sanity and commitment, (C2) threshold and version binding, (C3) time-window binding, and (C4) PLONK-style proofs that the quantized network, given the committed window, produced the logged action and confidence. Micro-batched proving amortizes cost across adjacent windows, and a gateway option offloads proofs from low-power devices. The system integrates with differentially private federated learning and on-device personalization without weakening verifiability: model hashes and the registered threshold are part of each public statement. Across activity, presence or intrusion, respiratory proxy, and RF fingerprinting tasks, ZK-SenseLM improves macro-F1 and calibration, yields favorable coverage-risk curves under perturbations, and rejects tamper and replay with compact proofs and fast verification.
zh

[NLP-11] EHR-R1: A Reasoning -Enhanced Foundational Language Model for Electronic Health Record Analysis

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在电子健康记录(Electronic Health Records, EHRs)分析中任务覆盖范围狭窄且缺乏面向EHR的推理能力的问题。其核心解决方案是提出EHR-Ins,一个大规模、综合性的EHR推理指令数据集,包含30万条高质量推理案例和400万条非推理案例,覆盖42类EHR任务;并通过思维图驱动(thinking-graph-driven)框架实现高质量推理数据的规模化生成。基于此,研究进一步开发了EHR-R1系列增强推理能力的大语言模型(最大参数量达72B),采用多阶段训练范式(包括领域适配、推理增强与强化学习)系统性地获取临床知识与多样化推理能力,从而显著提升EHR分析的准确性与鲁棒性。

链接: https://arxiv.org/abs/2510.25628
作者: Yusheng Liao,Chaoyi Wu,Junwei Liu,Shuyang Jiang,Pengcheng Qiu,Haowen Wang,Yun Yue,Shuai Zhen,Jian Wang,Qianrui Fan,Jinjie Gu,Ya Zhang,Yanfeng Wang,Yu Wang,Weidi Xie
机构: Shanghai Jiao Tong University (上海交通大学); AntGroup (蚂蚁集团); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Peking University (北京大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.
zh

[NLP-12] Are Language Models Efficient Reason ers? A Perspective from Logic Programming NEURIPS2025

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在推理评估中过于关注正确性而忽视效率的问题,尤其是在面对包含冗余信息的真实推理场景时,模型难以有效识别并忽略无关前提。解决方案的关键在于提出一种基于逻辑编程(logic programming)的评估框架,通过将自然语言生成的推理过程与逻辑程序执行所得的最短证明进行对齐,从而量化模型在推理过程中避免不必要的推导步骤的能力。该方法能够系统性地检测模型是否因冗余前提引入无效推理路径,进而揭示其推理效率的不足。

链接: https://arxiv.org/abs/2510.25626
作者: Andreas Opedal,Yanick Zengaffinen,Haruki Shirakami,Clemente Pasti,Mrinmaya Sachan,Abulhair Saparov,Ryan Cotterell,Bernhard Schölkopf
机构: ETH Zürich (苏黎世联邦理工学院); MPI for Intelligent Systems, Tübingen (马克斯·普朗克智能系统研究所); EPFL (洛桑联邦理工学院); Idiap Research Institute (Idiap 研究所); Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language – as generated by an LM – with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions – even with minimal, domain-consistent distractions – and the proofs they generate frequently exhibit detours through irrelevant inferences.
zh

[NLP-13] Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks EMNLP

【速读】: 该论文旨在解决测试时扩展(Test-time scaling, TTS)技术在论证型领域(如法律)中效果尚不明确的问题,尤其关注基于验证器的TTS方法在法律多选题问答(MCQA)任务中的有效性。其解决方案的关键在于通过实证研究系统评估不同验证策略——包括结果级(Best-of-N)和过程级(树搜索)验证——在多种奖励模型(7种)与低N预算条件下的性能表现,并深入分析验证器效用受领域专业化、模型规模及监督类型(过程监督的PRMs vs. 仅结果监督的ORMs)等因素的影响,从而为TTS在法律等复杂推理场景中的应用提供可量化的优化依据。

链接: https://arxiv.org/abs/2510.25623
作者: Davide Romano,Jonathan Schwarz,Daniele Giofré
机构: Thomson Reuters (汤森路透)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP - NLLP Workshop

点击查看摘要

Abstract:Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming \citepsnell2024scaling, chen2024more, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of- N ) and process-level (tree search) verification under realistic low- N budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.
zh

[NLP-14] FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在宗教问答等高风险、专业化领域中因幻觉(hallucination)和对权威来源不忠实(unfaithfulness)而导致的可靠性问题,尤其针对波斯语穆斯林群体对准确性和可信度的迫切需求。现有基于检索增强生成(Retrieval-Augmented Generation, RAG)的系统因依赖简单的一次性流水线,在处理需要多跳推理(multi-hop reasoning)和证据聚合的复杂查询时表现不足。解决方案的关键在于提出FAIR-RAG架构——一种具有自适应性、迭代精炼能力的忠实型RAG框架(Faithful, Adaptive, Iterative Refinement framework for RAG),其通过动态分解复杂问题、评估证据充分性并进入迭代循环来生成子查询,逐步填补信息缺口,从而显著提升答案的准确性与可信度。

链接: https://arxiv.org/abs/2510.25621
作者: Mohammad Aghajani Asl,Behrooz Minaei Bidgoli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 37 pages, 5 figures, 10 tables. Keywords: Retrieval-Augmented Generation (RAG), Question Answering (QA), Islamic Knowledge Base, Faithful AI, Persian NLP, Multi-hop Reasoning, Large Language Models (LLMs)

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.
zh

[NLP-15] Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry ICML2025

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在信息不对称条件下协作完成联合任务的能力不足的问题,尤其关注代理间如何通过有效沟通与规则理解实现高效协同。其解决方案的关键在于引入一个“微调+验证器”(fine-tuning-plus-verifier)框架,使LLM代理能够采用多样化的通信策略,并利用来自环境的验证信号来增强对任务规则的理解与执行能力。实证结果表明,对齐的通信机制显著提升协作效果,而环境驱动的验证器不仅提高了任务完成度和安全性,还增强了人类评估者对代理行为的信任,从而推动更可解释、可靠的AI系统协作。

链接: https://arxiv.org/abs/2510.25595
作者: Run Peng,Ziqiao Ma,Amy Pang,Sikai Li,Zhang Xi-Jia,Yingzhuo Yu,Cristian-Paul Bara,Joyce Chai
机构: University of Michigan (密歇根大学); UNC, Chapel Hill (北卡罗来纳大学教堂山分校); Georgia Tech (佐治亚理工学院); Apple (苹果公司); Robert Bosch SRL (罗伯特·博世研究所); Babeş-Bolyai University (巴贝什-博雅伊大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Workshop on Multi-Agent System @ ICML 2025

点击查看摘要

Abstract:While Large Language Model (LLM) agents are often approached from the angle of action planning/generation to accomplish a goal (e.g., given by language descriptions), their abilities to collaborate with each other to achieve a joint goal are not well explored. To address this limitation, this paper studies LLM agents in task collaboration, particularly under the condition of information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task. We extend Einstein Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two LLM agents must reason, communicate, and act to satisfy spatial and relational constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier framework in which LLM agents are equipped with various communication strategies and verification signals from the environment. Empirical results highlight the critical importance of aligned communication, especially when agents possess both information-seeking and -providing capabilities. Interestingly, agents without communication can still achieve high task performance; however, further analysis reveals a lack of true rule understanding and lower trust from human evaluators. Instead, by integrating an environment-based verifier, we enhance agents’ ability to comprehend task rules and complete tasks, promoting both safer and more interpretable collaboration in AI systems. this https URL
zh

[NLP-16] Hybrid Quantum-Classical Recurrent Neural Networks

【速读】: 该论文旨在解决传统循环神经网络(Recurrent Neural Network, RNN)在建模长程依赖和高维状态空间时面临的容量瓶颈与物理实现不一致问题,同时探索量子计算资源在序列学习任务中的潜力。其解决方案的关键在于提出一种混合量子-经典循环神经网络(Quantum-Classical Recurrent Neural Network, QRNN),其中整个递归核心由一个参数化量子电路(Parametrized Quantum Circuit, PQC)构成,该PQC的演化遵循幺正性(unitary)约束,从而保证隐藏状态在指数级希尔伯特空间 C2n\mathbb{C}^{2^n} 中的保范演化;同时,通过经典前馈网络对中电路测量结果进行处理并生成控制参数,引入显式非线性以实现输入条件化的参数化更新。这一架构首次将量子操作严格嵌入到序列建模框架中,实现了高容量记忆、部分观测(mid-circuit measurements)与经典非线性控制的统一,且在情感分析、MNIST、复制记忆等任务上展现出可与强基线竞争的性能。

链接: https://arxiv.org/abs/2510.25557
作者: Wenduan Xu
机构: Quantinuum
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the entire recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an n -qubit PQC, residing in an exponentially large Hilbert space \mathbbC^2^n . The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit measurements, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling, adopting projective measurements as a limiting case to obtain mid-circuit readouts while maintaining a coherent recurrent quantum memory. We further devise a soft attention mechanism over the mid-circuit readouts in a sequence-to-sequence model and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.
zh

[NLP-17] winVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在人格模拟(persona simulation)评估中存在的局限性,包括依赖合成对话、缺乏系统性框架以及未深入分析能力需求等问题。其解决方案的关键在于提出TwinVoice基准测试体系,该体系从社会人格(Social Persona)、人际人格(Interpersonal Persona)和叙事人格(Narrative Persona)三个维度构建真实场景下的评估框架,并将LLM的人格模拟能力细分为六项基础能力:观点一致性、记忆回溯、逻辑推理、词汇忠实度、人格语气和句法风格。通过这一结构化评估体系,研究揭示了当前先进模型在句法风格和记忆回溯等关键能力上的显著不足,从而为未来人格模拟能力的提升提供了明确的方向。

链接: https://arxiv.org/abs/2510.25536
作者: Bangde Du(1),Minghao Guo(2),Songming He(3),Ziyi Ye(3),Xi Zhu(2),Weihang Su(1),Shuqi Zhu(1),Yujia Zhou(1),Yongfeng Zhang(2),Qingyao Ai(1),Yiqun Liu(1) ((1) Tsinghua University, (2) Rutgers University, (3) Fudan University)
机构: 未知
类目: Computation and Language (cs.CL)
备注: Main paper: 11 pages, 3 figures, 6 tables. Appendix: 28 pages. Bangde Du and Minghao Guo contributed equally. Corresponding authors: Ziyi Ye (ziyiye@fudan. this http URL ), Qingyao Ai (aiqy@tsinghua. this http URL )

点击查看摘要

Abstract:Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual’s communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.
zh

[NLP-18] Fine-Tuned Language Models for Domain-Specific Summarization and Tagging

【速读】: 该论文旨在解决因快速演变的亚文化语言和俚语导致的自动化信息提取与执法监控困难问题,尤其在政治与安全领域中,传统方法难以应对非标准化文本的处理挑战。解决方案的关键在于构建一个融合微调后的大型语言模型(Large Language Models, LLMs)与命名实体识别(Named Entity Recognition, NER)的流水线系统,通过LLaMA Factory框架对模型进行指令微调(instruction fine-tuning),使其在特定领域内实现高精度的文本摘要与结构化实体标注。实验表明,即使初始中文理解能力有限,经领域微调后的LLaMA3-8B-Instruct模型仍优于专门训练的中文模型,证明了底层推理能力可跨语言迁移,从而为实时、可扩展的信息管理与安全应用提供了高效且灵活的技术路径。

链接: https://arxiv.org/abs/2510.25460
作者: Jun Wang,Fuming Lin,Yuyu Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a pipeline integrating fine-tuned large language models (LLMs) with named entity recognition (NER) for efficient domain-specific text summarization and tagging. The authors address the challenge posed by rapidly evolving sub-cultural languages and slang, which complicate automated information extraction and law enforcement monitoring. By leveraging the LLaMA Factory framework, the study fine-tunes LLMs on both generalpurpose and custom domain-specific datasets, particularly in the political and security domains. The models are evaluated using BLEU and ROUGE metrics, demonstrating that instruction fine-tuning significantly enhances summarization and tagging accuracy, especially for specialized corpora. Notably, the LLaMA3-8B-Instruct model, despite its initial limitations in Chinese comprehension, outperforms its Chinese-trained counterpart after domainspecific fine-tuning, suggesting that underlying reasoning capabilities can transfer across languages. The pipeline enables concise summaries and structured entity tagging, facilitating rapid document categorization and distribution. This approach proves scalable and adaptable for real-time applications, supporting efficient information management and the ongoing need to capture emerging language trends. The integration of LLMs and NER offers a robust solution for transforming unstructured text into actionable insights, crucial for modern knowledge management and security operations.
zh

[NLP-19] Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域中缺乏主动性和目标导向性的问题,即如何使LLM从被动响应者转变为能够自主规划、适时提问并判断何时停止对话的主动式对话代理。当前方法要么仅优化单轮交互属性,要么依赖于复杂且成本高昂的用户模拟器,导致“现实差距”(reality gap)。其解决方案的关键在于提出一种名为 \textttLearn-to-Ask 的通用、无需模拟器的框架,通过利用专家轨迹中可观测的未来信息来推断密集的逐轮奖励信号,从而将长期决策问题分解为一系列监督学习任务;该框架训练策略模型输出结构化的 (action, state_assessment) 元组,明确指导“问什么”和“何时停止”,并通过自动化评分校准(Automated Grader Calibration)管道最小化基于LLM的奖励模型中的噪声,确保奖励保真度。实证表明,该方法可在真实医疗数据集上有效部署,并实现超越人类专家的性能表现。

链接: https://arxiv.org/abs/2510.25441
作者: Fei Wei,Daoyuan Chen,Ce Wang,Yilun Huang,Yushuo Chen,Xuchen Pan,Yaliang Li,Bolin Ding
机构: Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap’'. To bridge this gap, we introduce \textttLearn-to-Ask, a general, simulator-free framework for learning and deploying proactive dialogue agents \textitdirectly from offline expert data, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbfobserved future of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert’s revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt(action, state_assessment) tuple, governing both \textbfwhat to ask and, crucially, \textbfwhen to stop. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \textttLearn-to-Ask in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework’s ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.
zh

[NLP-20] More than a Moment: Towards Coherent Sequences of Audio Descriptions

【速读】: 该论文旨在解决自动音频描述(Audio Descriptions, ADs)生成中因独立处理每个时间区间而导致的连贯性差、重复性强的问题,从而影响视障用户对视频场景的完整理解。其解决方案的关键在于提出一种无需训练的方法 CoherentAD:首先为每个 AD 时间区间生成多个候选描述,随后通过自回归选择机制在整个序列上进行优化,以构建逻辑连贯且信息丰富的叙事结构。该方法显著提升了 AD 序列的整体叙事一致性与语义完整性。

链接: https://arxiv.org/abs/2510.25440
作者: Eshika Khandelwal,Junyu Xie,Tengda Han,Max Bain,Arsha Nagrani,Andrew Zisserman,Gül Varol,Makarand Tapaswi
机构: CVIT, IIIT Hyderabad (计算机视觉与图像处理研究所,印度信息技术学院海得拉巴分校); VGG, University of Oxford (牛津大学视觉几何组); LIGM, École des Ponts, IP Paris, UGE, CNRS (École des Ponts 巴黎路桥大学,巴黎文理研究大学,法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.
zh

[NLP-21] A Critical Study of Automatic Evaluation in Sign Language Translation LREC2026

【速读】: 该论文试图解决当前手语翻译(Sign Language Translation, SLT)领域中缺乏可靠评估指标的问题,特别是现有文本类评估指标(如BLEU、ROUGE等)是否能有效反映SLT输出质量尚不明确。其解决方案的关键在于系统性地对比分析六种评估指标——包括传统基于词汇重叠的指标(如BLEU、chrF、ROUGE)和大语言模型(Large Language Model, LLM)驱动的评估方法(如G-Eval、GEMBA零样本直接评分),并通过三种受控实验条件(改写、幻觉、句长变化)验证它们的一致性和鲁棒性。结果表明,传统指标在捕捉语义等价性方面存在局限,而LLM-based方法虽更贴近人类判断但可能对LLM生成的改写版本产生偏差;同时,所有指标均能检测到幻觉,但BLEU过于敏感,而BLEURT与LLM方法对细微差异更为宽容。因此,论文强调构建多模态评估框架是未来提升SLT评估准确性的关键方向。

链接: https://arxiv.org/abs/2510.25434
作者: Shakib Yazdani,Yasser Hamidullah,Cristina España-Bonet,Eleftherios Avramidis,Josef van Genabith
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to the LREC 2026 conference

点击查看摘要

Abstract:Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.
zh

[NLP-22] Depth and Autonomy: A Framework for Evaluating LLM Applications in Social Science Research

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在定性社会科学研究中应用时面临的持续挑战,包括解释偏差(interpretive bias)、可靠性低以及可审计性弱等问题。其解决方案的关键在于提出一个基于“解释深度”(interpretive depth)与“自主性”(autonomy)两个维度的框架,用以分类LLM在定性研究中的应用场景,并据此制定实用的设计建议。该方法强调通过将任务分解为可管理的子任务、保持较低的自主性水平,并仅在必要且受监督的情况下适度提升解释深度,从而在保障研究透明性和可靠性的同时,有效利用LLM的能力。

链接: https://arxiv.org/abs/2510.25432
作者: Ali Sanaei,Ali Rajabzadeh
机构: 未知
类目: Computation and Language (cs.CL)
备注: Presented at the Annual Meeting of the American Political Science Association, Vancouver, BC, September 11–14 2025

点击查看摘要

Abstract:Large language models (LLMs) are increasingly utilized by researchers across a wide range of domains, and qualitative social science is no exception; however, this adoption faces persistent challenges, including interpretive bias, low reliability, and weak auditability. We introduce a framework that situates LLM usage along two dimensions, interpretive depth and autonomy, thereby offering a straightforward way to classify LLM applications in qualitative research and to derive practical design recommendations. We present the state of the literature with respect to these two dimensions, based on all published social science papers available on Web of Science that use LLMs as a tool and not strictly as the subject of study. Rather than granting models expansive freedom, our approach encourages researchers to decompose tasks into manageable segments, much as they would when delegating work to capable undergraduate research assistants. By maintaining low levels of autonomy and selectively increasing interpretive depth only where warranted and under supervision, one can plausibly reap the benefits of LLMs while preserving transparency and reliability.
zh

[NLP-23] RLMEval: Evaluating Research-Level Neural Theorem Proving EMNLP2025

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在研究级神经定理证明(neural theorem proving)和证明自动形式化(proof autoformalization)任务中实际应用效果有限的问题。现有基准测试虽展示出良好性能,但其结果难以迁移至真实世界中的复杂数学定理场景。为应对这一挑战,作者提出了RLMEval——一个基于真实Lean项目中研究级数学命题构建的评估套件,其关键在于利用来自6个实际Lean Blueprint形式化项目的613个定理,对模型进行更贴近现实的研究级推理能力评估。实验表明,即使是最先进的模型在该基准上仅达到10.3%的通过率,凸显了当前方法与实用需求之间的显著差距,从而为自动化推理在形式数学领域的进一步发展提供了新的方向和衡量标准。

链接: https://arxiv.org/abs/2510.25427
作者: Auguste Poiroux,Antoine Bosselut,Viktor Kunčak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 Findings. RLMEval benchmark released: this https URL

点击查看摘要

Abstract:Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.
zh

[NLP-24] Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction

【速读】: 该论文旨在解决人机交互(Human-Computer Interaction, HCI)中生成式 AI(Generative AI)与人类意图对齐的问题,核心挑战在于模型能否准确理解隐含语用(implicature)——即通过共享语境传递的非字面意义。解决方案的关键在于引入基于语用学的上下文驱动提示(implicature-based prompts),通过增强模型对隐含意图的推理能力,显著提升响应的相关性和质量,尤其在小型模型中效果更为突出,表明语用层面的语言建模是实现更自然、情境化人-AI(Human-AI, HAI)交互的重要路径。

链接: https://arxiv.org/abs/2510.25426
作者: Asutosh Hota,Jussi P. P. Jokinen
机构: University of Jyväskylä (于韦斯屈莱大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The manuscript is approximately 7360 words and contains 12 figures and 6 tables

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) is positioning language at the core of human-computer interaction (HCI). We argue that advancing HCI requires attention to the linguistic foundations of interaction, particularly implicature (meaning conveyed beyond explicit statements through shared context) which is essential for human-AI (HAI) alignment. This study examines LLMs’ ability to infer user intent embedded in context-driven prompts and whether understanding implicature improves response generation. Results show that larger models approximate human interpretations more closely, while smaller models struggle with implicature inference. Furthermore, implicature-based prompts significantly enhance the perceived relevance and quality of responses across models, with notable gains in smaller models. Overall, 67.6% of participants preferred responses with implicature-embedded prompts to literal ones, highlighting a clear preference for contextually nuanced communication. Our work contributes to understanding how linguistic theory can be used to address the alignment problem by making HAI interaction more natural and contextually grounded.
zh

[NLP-25] Seeing Signing and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media

【速读】: 该论文旨在解决现有手语翻译(Sign Language Translation, SLT)数据集在规模、多语言覆盖范围以及标注成本方面的局限性问题,这些问题主要源于对专家标注和受控录制环境的依赖。解决方案的关键在于提出首个基于视觉语言模型(Vision Language Models, VLMs)的自动化标注与过滤框架,通过引入面部可见性检测、手语活动识别、视频内容文本提取及视频-文本对齐验证等模块,实现无需人工干预的数据筛选与标注流程,从而在保证数据质量的同时显著降低获取成本,并首次将VLM应用于社交媒体(如TikTok)中多语言手语数据的大规模采集与预训练支持。

链接: https://arxiv.org/abs/2510.25413
作者: Shakib Yazdani,Yasser Hamidullah,Cristina España-Bonet,Josef van Genabith
机构: German Research Center for Artificial Intelligence (DFKI GmbH)(德国人工智能研究中心); Barcelona Supercomputing Center (BSC-CNS)(巴塞罗那超级计算中心)
类目: Computation and Language (cs.CL)
备注: Accepted by RANLP 2025

点击查看摘要

Abstract:Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.
zh

[NLP-26] Serve Programs Not Prompts SOSP2025

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)服务系统在处理日益复杂的LLM应用时效率低下且缺乏灵活性的问题,这些问题源于其主要为文本补全任务设计的僵化架构。解决方案的关键在于提出一种新的LLM服务系统架构,通过执行“LLM推理程序”(LLM Inference Programs, LIPs)替代传统提示词(prompt)作为输入单元,使用户能够在运行时自定义token预测策略和KV缓存管理,并将工具执行等应用逻辑部分卸载至服务器端。该方案以名为Symphony的系统为例实现,其作为LIP的操作系统,通过系统调用暴露LLM计算能力、利用专用文件系统虚拟化KV缓存,并采用两级进程调度机制保障GPU利用率,从而构建更高效、可扩展的LLM应用生态。

链接: https://arxiv.org/abs/2510.25412
作者: In Gim,Lin Zhong
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL)
备注: HotOS 2025. Follow-up implementation work (SOSP 2025) is available at https://arxiv.org/abs/2510.24051

点击查看摘要

Abstract:Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for LLM applications.
zh

[NLP-27] BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估体系中普遍存在的“盎格鲁中心主义”(Anglocentric)和领域通用性不足的问题,尤其针对印度本土知识体系(如农业、法律、金融与阿育吠陀医学等)缺乏针对性评测基准的现状。其解决方案的关键在于构建首个面向印度本土知识系统的多任务、双语(英语与印地语)评测基准——BhashaBench V1,该基准包含74,166个精心标注的问答对(52,494条英文、21,672条印地语),覆盖四大核心领域(共90余个子领域及500多个主题),并基于对29个以上LLMs的系统评估揭示了显著的语言与领域特异性性能差异,从而为模型在印度多元文化与专业场景下的能力评估提供了精细化、可复现的基准工具。

链接: https://arxiv.org/abs/2510.25409
作者: Vijay Devane,Mohd Nauman,Bhargav Patel,Aniket Mahendra Wakchoure,Yogeshkumar Sant,Shyam Pawar,Viraj Thakur,Ananya Godse,Sunil Patra,Neha Maurya,Suraj Racha,Nitish Kamal Singh,Ajay Nagpal,Piyush Sawarkar,Kundeshwar Vijayrao Pundalik,Rohit Saluja,Ganesh Ramakrishnan
机构: BharatGen Team
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India’s diverse knowledge domains. It enables assessment of models’ ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.
zh

[NLP-28] Roleplaying with Structure: Synthetic Therapist-Client Conversation Generation from Questionnaires

【速读】: 该论文旨在解决当前心理健康领域人工智能(AI)发展受限于真实治疗对话数据稀缺的问题,这主要源于严格的隐私法规以及临床会话历史上极少被记录的现状。解决方案的关键在于提出了一种基于大语言模型(LLM)的生成管道——SQPsych(Structured Questionnaire-based Psychotherapy),该方法通过结构化客户档案和心理量表输入,模拟认知行为疗法(Cognitive Behavioral Therapy, CBT)框架下的治疗师-来访者对话,从而生成高质量、符合临床规范的合成咨询对话数据。为应对数据治理政策限制第三方访问敏感问卷数据的挑战,研究采用开源权重的大语言模型进行本地化训练与验证,最终在人类专家评估和LLM辅助测评中均展现出优于基线模型的治疗技能表现,证明了合成数据在保障隐私前提下实现可扩展、安全且具临床依据的AI心理健康支持的可行性。

链接: https://arxiv.org/abs/2510.25384
作者: Doan Nam Long Vu,Rui Tan,Lena Moench,Svenja Jule Francke,Daniel Woiwod,Florian Thomas-Odenthal,Sanna Stroth,Tilo Kircher,Christiane Hermann,Udo Dannlowski,Hamidreza Jamalabadi,Shaoxiong Ji
机构: Technical University of Darmstadt(达姆施塔特工业大学); Philipps-University Marburg(马尔堡菲利普斯大学); Justus Liebig University Giessen(吉森尤利乌斯·马克西米利安大学); University of Münster(明斯特大学); ELLIS Institute Finland(芬兰ELLIS研究所); University of Turku(图尔库大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of AI for mental health is hindered by a lack of authentic therapy dialogues, due to strict privacy regulations and the fact that clinical sessions were historically rarely recorded. We present an LLM-driven pipeline that generates synthetic counseling dialogues based on structured client profiles and psychological questionnaires. Grounded on the principles of Cognitive Behavioral Therapy (CBT), our method creates synthetic therapeutic conversations for clinical disorders such as anxiety and depression. Our framework, SQPsych (Structured Questionnaire-based Psychotherapy), converts structured psychological input into natural language dialogues through therapist-client simulations. Due to data governance policies and privacy restrictions prohibiting the transmission of clinical questionnaire data to third-party services, previous methodologies relying on proprietary models are infeasible in our setting. We address this limitation by generating a high-quality corpus using open-weight LLMs, validated through human expert evaluation and LLM-based assessments. Our SQPsychLLM models fine-tuned on SQPsychConv achieve strong performance on counseling benchmarks, surpassing baselines in key therapeutic skills. Our findings highlight the potential of synthetic data to enable scalable, data-secure, and clinically informed AI for mental health support. We will release our code, models, and corpus at this https URL
zh

[NLP-29] Hallucinations in Bibliographic Recommendation: Citation Frequency as a Proxy for Training Data Redundancy

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文献引用时存在的幻觉问题,即模型会虚构不存在的论文。其解决方案的关键在于验证“引用频次”作为训练数据冗余度(training data redundancy)的代理指标,发现高被引论文(citation count)与事实准确性呈强相关性,并且当引用次数超过约1000次时,文献信息几乎以逐字记忆的形式保留在模型中,表明模型从泛化向记忆转变存在一个临界阈值。

链接: https://arxiv.org/abs/2510.25378
作者: Junichiro Niimi
机构: Meijo University (明治大学); RIKEN AIP (理化学研究所先进人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in bibliographic recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM’s ability to correctly produce bibliographic information depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the training corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record is repeatedly represented in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 bibliographic records across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) hallucination rates vary across research domains, (ii) citation count is strongly correlated with factual accuracy, and (iii) bibliographic information becomes almost verbatimly memorized beyond approximately 1,000 citations. These findings suggest that highly cited papers are nearly verbatimly retained in the model, indicating a threshold where generalization shifts into memorization.
zh

[NLP-30] Monitoring Transformative Technological Convergence Through LLM -Extracted Semantic Entity Triple Graphs

【速读】: 该论文旨在解决技术预测(technology forecasting)中因创新周期短、早期术语模糊而导致的传统专家方法难以及时捕捉新兴颠覆性技术的问题。其核心挑战在于如何从海量非结构化文本中自动识别技术融合(technological convergence)的早期信号。解决方案的关键在于提出了一种数据驱动的多阶段分析管道,利用大语言模型(Large Language Models, LLMs)从全文中提取语义三元组(semantic triples),构建大规模技术实体与关系图谱,并引入基于图结构的聚类方法(noun stapling)和共现趋势分析来识别技术融合模式。该方法在arXiv预印本和美国专利商标局(USPTO)专利数据上验证有效,实现了对科学前沿与商业转化双重维度下技术演化的系统性监测。

链接: https://arxiv.org/abs/2510.25370
作者: Alexander Sternfeld,Andrei Kucharavy,Dimitri Percia David,Alain Mermoud,Julian Jang-Jaccard,Nathan Monnet
机构: HEVS(瑞士高等教育学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Forecasting transformative technologies remains a critical but challenging task, particularly in fast-evolving domains such as Information and Communication Technologies (ICTs). Traditional expert-based methods struggle to keep pace with short innovation cycles and ambiguous early-stage terminology. In this work, we propose a novel, data-driven pipeline to monitor the emergence of transformative technologies by identifying patterns of technological convergence. Our approach leverages advances in Large Language Models (LLMs) to extract semantic triples from unstructured text and construct a large-scale graph of technology-related entities and relations. We introduce a new method for grouping semantically similar technology terms (noun stapling) and develop graph-based metrics to detect convergence signals. The pipeline includes multi-stage filtering, domain-specific keyword clustering, and a temporal trend analysis of topic co-occurence. We validate our methodology on two complementary datasets: 278,625 arXiv preprints (2017–2024) to capture early scientific signals, and 9,793 USPTO patent applications (2018-2024) to track downstream commercial developments. Our results demonstrate that the proposed pipeline can identify both established and emerging convergence patterns, offering a scalable and generalizable framework for technology forecasting grounded in full-text analysis. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2510.25370 [cs.CL] (or arXiv:2510.25370v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.25370 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-31] CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs EMNLP2025

【速读】: 该论文旨在解决小规模语言模型(Language Models, LM)是否能够从指令微调(instruction tuning)中获益的问题,特别是探讨不同类型的指令微调数据集(对话和问答)以及训练策略(合并式与顺序式课程学习)对模型性能的影响。其解决方案的关键在于系统性地比较多种指令微调设置下模型在有监督微调(fine-tuning)和零样本(zero-shot)任务上的表现,发现顺序式课程学习优于合并式数据训练,但这种优化并未稳定提升零样本泛化能力,揭示了交互式适应与广泛语言泛化之间的权衡关系,从而为资源受限场景下的低资源语言模型训练提供了方向:即采用混合型、基于课程的学习策略以增强模型在生态约束下的通用性。

链接: https://arxiv.org/abs/2510.25364
作者: Luca Capone,Alessandro Bondielli,Alessandro Lenci
机构: CoLing Lab, Department of Philology, Literature and Linguistics, University of Pisa; Department of Computer Science, University of Pisa
类目: Computation and Language (cs.CL)
备注: Paper accepted for oral presentation at the BabyLM Challange 2025 (EMNLP2025)

点击查看摘要

Abstract:This work investigates whether small-scale LMs can benefit from instruction tuning. We compare conversational and question-answering instruction tuning datasets, applied either in a merged or sequential curriculum, using decoder-only models with 100M and 140M parameters. Evaluation spans both fine-tuning (SuperGLUE) and zero-shot (BLiMP, EWoK, WUGs, entity tracking, and psycholinguistic correlation) settings. Results show that instruction tuning yields small but consistent gains in fine-tuning scenarios, with sequential curricula outperforming merged data; however, improvements do not consistently transfer to zero-shot tasks, suggesting a trade-off between interaction-focused adaptation and broad linguistic generalization. These results highlight both the potential and the constraints of adapting human-inspired learning strategies to low-resource LMs, and point toward hybrid, curriculum-based approaches for enhancing generalization under ecological training limits.
zh

[NLP-32] Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments

【速读】: 该论文试图解决的问题是:当前法律实践中是否可以可靠地将生成式 AI(Generative AI)作为法律解释工具,以辅助法官或法律从业者对法律文本的适用判断。解决方案的关键在于通过实证研究发现,当前基于大语言模型(LLMs)的法律解释方法存在显著不稳定性——同一问题因提问方式不同可能导致模型得出截然不同的结论;同时,模型输出与人类法律专家判断的相关性仅为弱至中等,且在不同模型和问题变体间波动剧烈,表明生成式 AI 的结论缺乏可信赖的稳定性与一致性,因此不宜被广泛用于法律解释决策中。

链接: https://arxiv.org/abs/2510.25356
作者: Abhishek Purushothama,Junghyun Min,Brandon Waldon,Nathan Schneider
机构: Georgetown University (乔治城大学); University of South Carolina (南卡罗来纳大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Legal interpretation frequently involves assessing how a legal text, as understood by an ‘ordinary’ speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.
zh

[NLP-33] CRMWeaver: Building Powerful Business Agent via Agent ic RL and Shared Memories

【速读】: 该论文旨在解决业务代理(Business Agents)在复杂商业环境中面临的挑战,即如何高效处理数据关系错综复杂且任务类型多样(如统计查询与知识问答)的问题。解决方案的关键在于提出CRMWeaver方法:首先通过合成数据生成与强化学习(Reinforcement Learning, RL)相结合的训练范式,提升模型对复杂数据和多样化任务的适应能力;其次在推理阶段引入共享记忆机制(Shared Memories Mechanism),使代理能够从相似任务的指导信息中学习,从而增强其在未见场景下的泛化性能和实际应用效果。

链接: https://arxiv.org/abs/2510.25333
作者: Yilong Lai,Yipin Yang,Jialong Wu,Fengran Mo,Zhenglin Wang,Ting Liang,Jianguo Lin,Keping Yang
机构: Taobao & Tmall Group of Alibaba(淘宝与天猫集团); Southeast University(东南大学); University of Montreal(蒙特利尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent years have witnessed the rapid development of LLM-based agents, which shed light on using language agents to solve complex real-world problems. A prominent application lies in business agents, which interact with databases and internal knowledge bases via tool calls to fulfill diverse user requirements. However, this domain is characterized by intricate data relationships and a wide range of heterogeneous tasks, from statistical data queries to knowledge-based question-answering. To address these challenges, we propose CRMWeaver, a novel approach that enhances business agents in such complex settings. To acclimate the agentic model to intricate business environments, we employ a synthesis data generation and RL-based paradigm during training, which significantly improves the model’s ability to handle complex data and varied tasks. During inference, a shared memories mechanism is introduced, prompting the agent to learn from task guidelines in similar problems, thereby further boosting its effectiveness and generalization, especially in unseen scenarios. We validate the efficacy of our approach on the CRMArena-Pro dataset, where our lightweight model achieves competitive results in both B2B and B2C business scenarios, underscoring its practical value for real-world applications.
zh

[NLP-34] GAP: Graph-Based Agent Planning with Parallel Tool Use and Reinforcement Learning

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主代理在执行多步骤任务时因依赖顺序推理与执行而导致的效率低下问题,尤其是工具调用过程中的并行性未被充分利用,从而影响整体任务准确性和执行效率。其解决方案的关键在于提出一种基于图结构的任务规划框架(Graph-based Agent Planning, GAP),通过显式建模子任务间的依赖关系构建依赖感知的子任务图,并训练代理基础模型自动识别可并行执行的工具操作与必须串行执行的依赖路径,实现自适应的并行与串行工具调度。该方法显著提升了工具调用效率和多步推理任务的准确性,尤其在多跳检索类任务中表现突出。

链接: https://arxiv.org/abs/2510.25320
作者: Jiaqi Wu,Qinlao Zhao,Zefeng Chen,Kai Qin,Yifei Zhao,Xueqian Wang,Yuhang Yao
机构: Tsinghua University (清华大学); Huazhong University of Science and Technology (华中科技大学); National University of Singapore (新加坡国立大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autonomous agents powered by large language models (LLMs) have shown impressive capabilities in tool manipulation for complex task-solving. However, existing paradigms such as ReAct rely on sequential reasoning and execution, failing to exploit the inherent parallelism among independent sub-tasks. This sequential bottleneck leads to inefficient tool utilization and suboptimal performance in multi-step reasoning scenarios. We introduce Graph-based Agent Planning (GAP), a novel framework that explicitly models inter-task dependencies through graph-based planning to enable adaptive parallel and serial tool execution. Our approach trains agent foundation models to decompose complex tasks into dependency-aware sub-task graphs, autonomously determining which tools can be executed in parallel and which must follow sequential dependencies. This dependency-aware orchestration achieves substantial improvements in both execution efficiency and task accuracy. To train GAP, we construct a high-quality dataset of graph-based planning traces derived from the Multi-Hop Question Answering (MHQA) benchmark. We employ a two-stage training strategy: supervised fine-tuning (SFT) on the curated dataset, followed by reinforcement learning (RL) with a correctness-based reward function on strategically sampled queries where tool-based reasoning provides maximum value. Experimental results on MHQA datasets demonstrate that GAP significantly outperforms traditional ReAct baselines, particularly on multi-step retrieval tasks, while achieving dramatic improvements in tool invocation efficiency through intelligent parallelization. The project page is available at: this https URL.
zh

[NLP-35] Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在数学推理任务中,自然语言链式思维(Natural language chain-of-thought, N-CoT)与程序链式思维(Program chain-of-thought, P-CoT)两种范式各自独立优化、难以协同提升的问题。现有方法多采用单向增强策略,如P-CoT增强N-CoT或反之,但未能充分发挥二者互补优势。解决方案的关键在于提出Parrot训练管道:通过设计三个目标导向的子任务实现顺序化的P-CoT与N-CoT生成融合;引入子任务混合训练策略以促进自然语言语义迁移;并设计转换后的N-CoT辅助奖励机制缓解P-CoT优化中的稀疏奖励问题。实验表明,Parrot显著提升了N-CoT和P-CoT的性能,尤其对N-CoT提升明显,在MathQA数据集上使LLaMA2和CodeLLaMA的N-CoT表现相较于强化学习基线分别提升+21.87和+21.48,且无需资源密集型训练。

链接: https://arxiv.org/abs/2510.25310
作者: Senjie Jin,Lu Chen,Zhiheng Xi,Yuhui Wang,Sirui Song,Yuhao Zhou,Xinbo Zhang,Peng Sun,Hong Lu,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); ByteDance Research (字节跳动研究); Shanghai Innovation Institute (上海创新研究院); Shanghai Key Laboratory of Intelligent Information Processing (上海市智能信息处理重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms’ strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.
zh

[NLP-36] aching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student

【速读】: 该论文旨在解决低资源环境下多模态讽刺检测(multimodal sarcasm detection)中因标注数据稀缺导致的模型性能瓶颈问题,尤其针对图像与文本之间细微矛盾难以学习的挑战。其解决方案的关键在于提出一种统一的参数高效微调(Parameter-efficient fine-tuning, PEFT)框架——PEKD,通过知识蒸馏(distillation)机制从一个在大规模讽刺数据上训练的专家模型(teacher model)中迁移知识,从而增强PEFT方法的表现力;同时引入基于熵感知的门控机制(entropy-aware gating mechanism),根据教师模型预测置信度动态调节蒸馏强度,以缓解教师信号不可靠带来的负面影响。实验表明,该框架在少样本场景下显著优于现有PEFT方法及大型多模态模型。

链接: https://arxiv.org/abs/2510.25303
作者: Soumyadeep Jana,Sanasam Ranbir Singh
机构: Indian Institute of Technology Guwahati (印度理工学院古瓦哈蒂分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal sarcasm detection is challenging, especially in low-resource settings where subtle image-text contradictions are hard to learn due to scarce annotated data, which hinders the model’s performance. Parameter-efficient fine-tuning (PEFT) methods like adapters, LoRA, and prompt tuning reduce overfitting but struggle to reach optimal performance due to limited supervision from few-shot data. We propose PEKD, a unified framework that enhances PEFT methods via distillation from an expert model trained on large-scale sarcasm data, which acts as the teacher. To mitigate unreliable signals from the teacher, we introduce an entropy-aware gating mechanism that dynamically adjusts the distillation strength based on teacher confidence. Experiments on two public datasets demonstrate that our PEKD framework enables PEFT methods to outperform both prior parameter-efficient approaches and large multimodal models, achieving strong results in the few-shot scenario. The framework is modular and adaptable to a wide range of multimodal models and tasks.
zh

[NLP-37] Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA

【速读】: 该论文旨在解决低资源语言(如印地语)在特定领域(旅游)中问答系统构建所面临的两大挑战:标注数据稀缺和通用语言模型缺乏领域知识。解决方案的关键在于提出一种多阶段微调策略,通过利用大语言模型(LLMs)生成合成问答对来扩充有限的原始数据集,并将这些合成数据用于轻量级模型的训练,从而实现模型在目标领域的有效适应与泛化能力提升。实验表明,大模型可高效生成高质量合成数据,而小模型能有效学习并适应此类数据,为低资源、领域特定的问答任务提供了一条可扩展的技术路径。

链接: https://arxiv.org/abs/2510.25273
作者: Sandipan Majhi,Paheli Bhattacharya
机构: Indian Institute of Technology Kharagpur, India (印度理工学院克哈格普尔分校, 印度); Bosch Research and Technology Centre, Bangalore, India (博世研究中心, 班加罗尔, 印度)
类目: Computation and Language (cs.CL)
备注: Accepted at the Forum for Information Retrieval Evaluation 2025 (VATIKA Track)

点击查看摘要

Abstract:Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.
zh

[NLP-38] From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

【速读】: 该论文旨在解决精神障碍共病(psychiatric comorbidity)在临床实践中因多种疾病同时存在而带来的诊断与治疗复杂性问题。其解决方案的关键在于构建一个融合合成电子病历(EMR)生成与多智能体诊断对话生成的创新框架:首先通过管道化流程生成502份具有临床相关性和多样性的合成EMR,用于模拟常见共病场景;其次设计一个多智能体框架,将临床访谈协议转化为层次状态机和上下文树结构,支持超过130种诊断状态并保持临床规范性;最终产出PsyCoTalk数据集,包含3,000轮多轮诊断对话,经精神科医生验证具备高结构和语言保真度,可有效提升多病种精神障碍筛查模型的开发与评估能力。

链接: https://arxiv.org/abs/2510.25232
作者: Tianxi Wan,Jiaming Luo,Siyuan Chen,Kunyao Lan,Jianhua Chen,Haiyang Geng,Mengyue Wu
机构: X-LANCE Lab, Shanghai Jiao Tong University, China (上海交通大学 X-LANCE 实验室); Shanghai Mental Health Center, SJTU School of Medicine, China (上海精神卫生中心,上海交通大学医学院); Chen Frontier Lab for AI and Mental Health, Tianqiao and Chrissy Chen Institute, Shanghai, China (天桥与曹莉莉陈前沿实验室,上海)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct PsyCoTalk, the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.
zh

[NLP-39] ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在多主体协作场景中缺乏系统性评估方法的问题,特别是针对能够主动管理复杂、多议题、多方谈判的AI中介代理(proactive AI mediator agents)的性能评价难题。现有研究多聚焦于单用户辅助任务,而对多参与者协同决策中的社会认知智能(socio-cognitive intelligence)缺乏有效度量手段,限制了AI在群体协作中的实际应用进展。解决方案的关键在于提出ProMediate框架,其核心由两部分组成:一是基于真实谈判案例与理论驱动难度分级(ProMediate-Easy/Medium/Hard)的仿真测试平台,支持可插拔的主动干预AI中介代理;二是引入一套新的社会认知评估指标体系,用于量化共识变化、干预延迟、中介有效性及智能水平。实验表明,具备社会智能的中介代理相比通用基线在高难度场景下显著提升共识达成效率(+3.6个百分点)并大幅缩短响应时间(快77%),验证了该框架在推动主动型、社交智能代理发展上的理论价值与实践意义。

链接: https://arxiv.org/abs/2510.25224
作者: Ziyi Liu,Bahar Sarrafzadeh,Pei Zhou,Longqi Yang,Jieyu Zhao,Ashish Sharma
机构: University of Southern California (南加州大学); Microsoft Corporation (微软公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) are increasingly used in agentic frameworks to assist individual users, there is a growing need for agents that can proactively manage complex, multi-party collaboration. Systematic evaluation methods for such proactive agents remain scarce, limiting progress in developing AI that can effectively support multiple people together. Negotiation offers a demanding testbed for this challenge, requiring socio-cognitive intelligence to navigate conflicting interests between multiple participants and multiple topics and build consensus. Here, we present ProMediate, the first framework for evaluating proactive AI mediator agents in complex, multi-topic, multi-party negotiations. ProMediate consists of two core components: (i) a simulation testbed based on realistic negotiation cases and theory-driven difficulty levels (ProMediate-Easy, ProMediate-Medium, and ProMediate-Hard), with a plug-and-play proactive AI mediator grounded in socio-cognitive mediation theories, capable of flexibly deciding when and how to intervene; and (ii) a socio-cognitive evaluation framework with a new suite of metrics to measure consensus changes, intervention latency, mediator effectiveness, and intelligence. Together, these components establish a systematic framework for assessing the socio-cognitive intelligence of proactive AI agents in multi-party settings. Our results show that a socially intelligent mediator agent outperforms a generic baseline, via faster, better-targeted interventions. In the ProMediate-Hard setting, our social mediator increases consensus change by 3.6 percentage points compared to the generic baseline (10.65% vs 7.01%) while being 77% faster in response (15.98s vs. 3.71s). In conclusion, ProMediate provides a rigorous, theory-grounded testbed to advance the development of proactive, socially intelligent agents.
zh

[NLP-40] RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)推理能力时面临的瓶颈问题:当任务超出LLM当前能力范围时,模型难以采样出高价值的推理路径(reasoning paths),从而导致学习过程可能固化低效甚至错误的推理策略。解决方案的关键在于引入“答案引导的变分推理”(Reference-Answer-guided Variational Reasoning, RAVR)框架,其核心思想是利用已知答案作为条件来引导推理路径的生成,通过理论证明该条件可显著提升采样推理路径的期望效用,从而将原本难以学习的问题转化为可学习的问题。这一方法有效缓解了开放探索中的认知负荷,并促使模型更专注、高效地构建与答案一致的高质量推理链。

链接: https://arxiv.org/abs/2510.25206
作者: Tianqianjin Lin,Xi Zhao,Xingyao Zhang,Rujiao Long,Yi Xu,Zhuoren Jiang,Wenbo Su,Bo Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 11 figures

点击查看摘要

Abstract:Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM’s current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.
zh

[NLP-41] sting Cross-Lingual Text Comprehension In LLM s Using Next Sentence Prediction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言中性能下降的问题,探究其跨语言能力是否真正具备泛化性,还是仅依赖于英语数据的优势。研究通过构建一个包含英语(高资源)、斯瓦希里语(中资源)和豪萨语(低资源)的10,000题大规模基准测试集,并对GPT-4 Turbo、Gemini 1.5 Flash和LLaMA 3 70B等主流模型进行评估,发现模型在低资源语言中的准确率显著下降,尤其是LLaMA 3表现最差。解决方案的关键在于引入Chain-of-Thought (CoT) 提示策略:对于能力较弱的模型如LLaMA 3,CoT能显著提升其准确性;而对于能力强的模型如GPT-4和Gemini,CoT反而因“过度思考”导致性能下降。这表明CoT并非通用增强手段,其有效性高度依赖于模型的基础能力与任务语境,从而揭示了跨语言任务中提示工程的复杂性和针对性设计的重要性。

链接: https://arxiv.org/abs/2510.25187
作者: Ritesh Sunil Chavan,Jack Mostow
机构: Stony Brook University (石溪大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models are trained on massive datasets, this data is heavily skewed towards English. Does their impressive performance reflect genuine ability or just this data advantage? To find out, we tested them in a setting where they could not rely on data abundance: low-resource languages. Building on prior work Agarwal et al. (2025) that used Next Sentence Prediction (NSP) as a test, we created a large-scale benchmark with 10,000 questions each for English (a high-resource language), Swahili (medium-resource), and Hausa (low-resource). We then tested several top models, including GPT-4 Turbo, Gemini 1.5 Flash, and LLaMA 3 70B, to see how their performance holds up. The results painted a clear picture of how levels of language resources impact outcomes. While all models excelled in English, their accuracy dropped in Swahili and fell sharply in Hausa, with LLaMA 3 struggling the most. The story became even more interesting when we introduced Chain-of-Thought (CoT) prompting. For the struggling LLaMA 3, CoT acted as a helpful guide, significantly boosting its accuracy. However, for the more capable GPT-4 and Gemini, the same technique often backfired, leading to a kind of “overthinking” that hurt their results in the cross-lingual context. This reveals that Chain-of-Thought is not a universal solution; its effectiveness depends heavily on the model’s baseline capability and the specific context of the task. Our framework pinpoints LLM weaknesses, highlights when CoT helps or hinders cross-lingual NSP performance, and factors influencing their decisions.
zh

[NLP-42] Model-Document Protocol for AI Search

【速读】: 该论文旨在解决当前AI搜索中大语言模型(LLM)与外部知识源之间存在的适配性问题:原始文档(如网页、PDF等)通常冗长、杂乱且无结构,传统检索方法仅返回原始片段,迫使LLM自行完成碎片拼接和上下文推理,效率低下且易出错。解决方案的关键在于提出“模型-文档协议”(Model-Document Protocol, MDP),这是一种重新定义文档与LLM交互方式的新范式,通过三种路径将非结构化文本转化为任务导向的、可直接用于推理的紧凑结构化知识表示——包括代理推理(agentic reasoning)、记忆锚定(memory grounding)和结构化利用(structured leveraging)。其中,MDP-Agent作为其实例化实现,借助文档级概要记忆构建全局覆盖、基于扩散的探索与垂直挖掘识别层级依赖,并采用地图-归约式合成整合大规模证据,从而显著提升信息检索任务中的性能表现。

链接: https://arxiv.org/abs/2510.25160
作者: Hongjin Qian,Zheng Liu
机构: BAAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages

点击查看摘要

Abstract:AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation. Comments: 10 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2510.25160 [cs.CL] (or arXiv:2510.25160v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.25160 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-43] Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

【速读】: 该论文旨在解决现有离散音频表示(discrete audio representations)在噪声或真实场景下性能下降的问题,尤其针对语音到单元建模(speech-to-unit modeling)中语义内容与背景噪声耦合导致的鲁棒性不足。其解决方案的关键在于:在潜空间中解耦语音语义内容与噪声成分——通过冻结Whisper模型提取嵌入后,利用代码本(codebook)对纯净语音进行量化以获得语义token,同时将残差部分作为可解释的噪声向量,并通过轻量级分类器进行监督训练。该方法显著提升了语音与文本之间的对齐度和噪声不变性(noise invariance),并在VBDemand测试集上实现了相比Whisper 82%的错误率降低和优于基线方法35%的ASR性能提升。

链接: https://arxiv.org/abs/2510.25150
作者: Shreyas Gopal,Ashutosh Anshul,Haoyang Li,Yue Heng Yeo,Hexin Liu,Eng Siong Chng
机构: Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: Awarded Best Student Paper at APSIPA ASC 2025

点击查看摘要

Abstract:Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.
zh

[NLP-44] A Survey on Unlearning in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中可能记忆敏感个人信息、受版权保护内容及可被滥用的知识所带来的安全与伦理风险,尤其是如何实现对特定知识的“选择性擦除”以满足如“被遗忘权”等法律合规要求。其解决方案的关键在于机器遗忘(machine unlearning)技术,即在不损害模型整体性能的前提下,从LLM中移除特定训练数据所对应的知识;论文通过系统梳理2021年以来超过180篇相关研究,提出新颖的方法分类体系(按训练阶段分为训练时、后训练和推理时遗忘)与评估框架,从而为构建安全、可靠且符合伦理规范的LLMs提供理论基础与实践指导。

链接: https://arxiv.org/abs/2510.25117
作者: Ruichen Qiu,Jiajun Tan,Jiayue Pu,Honglin Wang,Xiao-Shan Gao,Fei Sun
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Computing Technology, CAS (中国科学院计算技术研究所); Academy of Mathematics and Systems Science, CAS (中国科学院数学与系统科学研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of Large Language Models (LLMs) has revolutionized natural language processing, yet their training on massive corpora poses significant risks, including the memorization of sensitive personal data, copyrighted material, and knowledge that could facilitate malicious activities. To mitigate these issues and align with legal and ethical standards such as the “right to be forgotten”, machine unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021, focusing exclusively on large-scale generative models. Distinct from prior surveys, we introduce novel taxonomies for both unlearning methods and evaluations. We clearly categorize methods into training-time, post-training, and inference-time based on the training stage at which unlearning is applied. For evaluations, we not only systematically compile existing datasets and metrics but also critically analyze their advantages, disadvantages, and applicability, providing practical guidance to the research community. In addition, we discuss key challenges and promising future research directions. Our comprehensive overview aims to inform and guide the ongoing development of secure and reliable LLMs.
zh

[NLP-45] Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation

【速读】: 该论文旨在解决低资源语言(如林加拉语)在机器翻译(Machine Translation, MT)中性能不足的问题,即如何通过有效的预训练策略提升此类语言的翻译质量。其解决方案的关键在于采用多语言预训练并融合单语和双语数据,在预训练阶段充分利用多种语言的共享表示能力与平行语料的对齐信息,从而显著改善模型在低资源场景下的表现,为构建更具包容性的自然语言处理(Natural Language Processing, NLP)系统提供实证支持。

链接: https://arxiv.org/abs/2510.25116
作者: Idriss Nguepi Nguefack,Mara Finkelstein,Toadoum Sari Sakayo
机构: AIMS Senegal (非洲国际数学科学研究所塞内加尔分校); Google(谷歌)
类目: Computation and Language (cs.CL)
备注: 8 pages, 1. figure

点击查看摘要

Abstract:This research article examines the effectiveness of various pretraining strategies for developing machine translation models tailored to low-resource languages. Although this work considers several low-resource languages, including Afrikaans, Swahili, and Zulu, the translation model is specifically developed for Lingala, an under-resourced African language, building upon the pretraining approach introduced by Reid and Artetxe (2021), originally designed for high-resource languages. Through a series of comprehensive experiments, we explore different pretraining methodologies, including the integration of multiple languages and the use of both monolingual and parallel data during the pretraining phase. Our findings indicate that pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality. This study offers valuable insights into effective pretraining strategies for low-resource machine translation, helping to bridge the performance gap between high-resource and low-resource languages. The results contribute to the broader goal of developing more inclusive and accurate NLP models for marginalized communities and underrepresented populations. The code and datasets used in this study are publicly available to facilitate further research and ensure reproducibility, with the exception of certain data that may no longer be accessible due to changes in public availability.
zh

[NLP-46] DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents Agent s in Multi-Agent Long-Form Debates

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的角色扮演模拟在群体互动中难以真实再现人类社会动态的问题,尤其是现有单代理对齐方法无法保证多代理间自然、复杂的意见演化过程。其解决方案的关键在于提出DEBATE——首个大规模实证基准,用于评估多代理角色扮演LLM交互的真实性;该基准包含来自2792名美国参与者围绕107个争议话题的29,417条多轮辩论消息,涵盖公开表达与私密报告的意见数据,从而系统性识别模拟与真实群体动态之间的关键差异,并通过监督微调提升LLM行为与人类行为的一致性,尽管在语义层面仍存在局限。

链接: https://arxiv.org/abs/2510.25110
作者: Yun-Shiuan Chuang,Ruixuan Tu,Chengtao Dai,Smit Vasani,Binwei Yao,Michael Henry Tessler,Sijia Yang,Dhavan Shah,Robert Hawkins,Junjie Hu,Timothy T. Rogers
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Stanford University (斯坦福大学); Google DeepMind (谷歌深度智障)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately modeling opinion change through social interactions is crucial for addressing issues like misinformation and polarization. While role-playing large language models (LLMs) offer a promising way to simulate human-like interactions, existing research shows that single-agent alignment does not guarantee authentic multi-agent group dynamics. Current LLM role-play setups often produce unnatural dynamics (e.g., premature convergence), without an empirical benchmark to measure authentic human opinion trajectories. To bridge this gap, we introduce DEBATE, the first large-scale empirical benchmark explicitly designed to evaluate the authenticity of the interaction between multi-agent role-playing LLMs. DEBATE contains 29,417 messages from multi-round debate conversations among over 2,792 U.S.-based participants discussing 107 controversial topics, capturing both publicly-expressed messages and privately-reported opinions. Using DEBATE, we systematically evaluate and identify critical discrepancies between simulated and authentic group dynamics. We further demonstrate DEBATE’s utility for aligning LLMs with human behavior through supervised fine-tuning, achieving improvements in surface-level metrics (e.g., ROUGE-L and message length) while highlighting limitations in deeper semantic alignment (e.g., semantic similarity). Our findings highlight both the potential and current limitations of role-playing LLM agents for realistically simulating human-like social dynamics.
zh

[NLP-47] KnowCoder-A1: Incentivizing Agent ic Reasoning Capability with Outcome Supervision for KBQA

【速读】: 该论文旨在解决当前知识库问答(Knowledge Base Question Answering, KBQA)系统中因依赖过程监督(process supervision)导致的代理式推理(agentic reasoning)能力不足的问题。现有方法通常通过合成的推理轨迹对大语言模型(Large Language Models, LLMs)进行微调,但这种弱激励机制难以促进模型自主探索复杂推理路径。解决方案的关键在于采用仅基于结果的监督(outcome-only supervision)策略,并结合多阶段课程强化学习(multi-stage curriculum reinforcement learning),以逐步从易到难地训练模型。具体而言,首先利用基于结果的拒绝采样获取高质量小规模推理轨迹进行预训练,随后通过设计奖励调度机制缓解结果稀疏性问题,从而显著提升模型在无监督场景下的自主推理能力。实验表明,该方法在多个主流数据集上优于先前方法,尤其在GrailQA零样本子集上实现了11.1%的相对性能提升,且仅使用了原始训练数据的十二分之一。

链接: https://arxiv.org/abs/2510.25101
作者: Zhuo Chen,Fei Wang,Zixuan Li,Zhao Zhang,Weiwei Ding,Chuanguang Yang,Yongjun Xu,Xiaolong Jin,Jiafeng Guo
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); State Key Laboratory of AI Safety (人工智能安全国家重点实验室); School of Computer Science, University of Chinese Academy of Sciences (中国科学院大学计算机学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.
zh

[NLP-48] BioCoref: Benchmarking Biomedical Coreference Resolution with LLM s

【速读】: 该论文旨在解决生物医学文本中指代消解(coreference resolution)的挑战,这些问题源于领域特定术语复杂、提及形式高度模糊以及远距离依赖关系。解决方案的关键在于评估生成式大语言模型(Generative Large Language Models, LLMs)在CRAFT语料库上的表现,并通过四种提示工程实验探究局部信息、上下文增强和领域特异性线索(如缩写和实体词典)对性能的影响。研究发现,尽管LLMs具备较强的表面指代识别能力,尤其在引入实体增强提示后,其性能仍受长程上下文和提及歧义影响;其中,LLaMA 8B与17B模型在实体增强提示下展现出更高的精确率和F1分数,表明轻量级提示工程可有效提升LLMs在生物医学自然语言处理任务中的实用性。

链接: https://arxiv.org/abs/2510.25087
作者: Nourah M Salem,Elizabeth White,Michael Bada,Lawrence Hunter
机构: University of Colorado Anschutz Medical Campus (科罗拉多大学安舒茨医学校区); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs’ performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.
zh

[NLP-49] OPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors

【速读】: 该论文旨在解决传统计算语言学中将情感(sentiment)视为单一维度尺度时,忽略语言多维结构的问题。其核心挑战在于如何在人类在环(human-on-the-loop, HoTL)定义的语境边界(contextual boundaries, CBs)下,重构并解释叙事极性场(narrative polarity fields)的多维特征。解决方案的关键在于提出TOPol框架:首先利用基于Transformer的大语言模型(tLLM)对文档进行嵌入,继而通过邻域调优的UMAP投影与Leiden社区划分实现主题分割;在此基础上,基于CB定义的两个话语区间A与B之间的主题-边界中心点计算方向向量,构建可量化语境切换期间语义位移的极性场。该向量表示不仅可用于评估CB质量及检测极性变化,还可引导HoTL对CB进行优化。此外,借助tLLM对比极性向量极端点并生成具有覆盖率估计的对比标签,实现极性向量的可解释性。实证表明,仅CB定义这一HoTL可控参数显著影响结果,验证了方法的稳定性与泛化能力。

链接: https://arxiv.org/abs/2510.25069
作者: Gabin Taibi,Lucia Gomez
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 3 figures and 2 tables

点击查看摘要

Abstract:Traditional approaches to semantic polarity in computational linguistics treat sentiment as a unidimensional scale, overlooking the multidimensional structure of language. This work introduces TOPol (Topic-Orientation POLarity), a semi-unsupervised framework for reconstructing and interpreting multidimensional narrative polarity fields under human-on-the-loop (HoTL) defined contextual boundaries (CBs). The framework embeds documents using a transformer-based large language model (tLLM), applies neighbor-tuned UMAP projection, and segments topics via Leiden partitioning. Given a CB between discourse regimes A and B, TOPol computes directional vectors between corresponding topic-boundary centroids, yielding a polarity field that quantifies fine-grained semantic displacement during regime shifts. This vectorial representation enables assessing CB quality and detecting polarity changes, guiding HoTL CB refinement. To interpret identified polarity vectors, the tLLM compares their extreme points and produces contrastive labels with estimated coverage. Robustness analyses show that only CB definitions (the main HoTL-tunable parameter) significantly affect results, confirming methodological stability. We evaluate TOPol on two corpora: (i) U.S. Central Bank speeches around a macroeconomic breakpoint, capturing non-affective semantic shifts, and (ii) Amazon product reviews across rating strata, where affective polarity aligns with NRC valence. Results demonstrate that TOPol consistently captures both affective and non-affective polarity transitions, providing a scalable, generalizable, and interpretable framework for context-sensitive multidimensional discourse analysis.
zh

[NLP-50] Can LLM s Estimate Cognitive Complexity of Reading Comprehension Items?

【速读】: 该论文试图解决的问题是:如何在阅读理解(Reading Comprehension, RC)题目实际施测前,准确估计其认知复杂度(cognitive complexity),以辅助难度预测。传统方法依赖人工标注认知特征,效率低且难以规模化;而现有自然语言处理(Natural Language Processing, NLP)工具主要关注句法和语义特征(如文本长度或选项间的语义相似性),无法有效捕捉答题推理过程中产生的认知负担。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)对两个核心认知维度——证据范围(Evidence Scope)和转换层级(Transformation Level)进行建模,从而量化推理所需的认知资源。实验表明,LLMs能够有效近似RC题目的认知复杂度,展现出作为前置难度分析工具的潜力。

链接: https://arxiv.org/abs/2510.25064
作者: Seonjeong Hwang,Hyounghun Kim,Gary Geunbae Lee
机构: Graduate School of Artificial Intelligence, POSTECH, Republic of Korea (韩国浦项科技大学人工智能研究生院); Department of Computer Science and Engineering, POSTECH, Republic of Korea (韩国浦项科技大学计算机科学与工程系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs’ reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.
zh

[NLP-51] GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models

【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)自动识别生物医学文献中的知识空白问题,尤其聚焦于两类知识缺口:显式缺口(explicit gaps,即明确声明的未知信息)和隐式缺口(implicit gaps,即需通过上下文推断得出的缺失知识)。传统研究主要关注显式缺口检测,而本文首次系统性地提出并实现对隐式缺口的推理识别任务。解决方案的关键在于引入一种名为TABI(Toulmin-Abductive Bucketed Inference)的结构化推理框架,该框架基于Toulmin论证模型与溯因推理机制,将推理过程分层组织,并对可能的结论候选进行分桶验证,从而提升对隐式知识缺口的识别准确性与可解释性。实验表明,无论是闭源模型(如OpenAI系列)还是开源模型(如Llama和Gemma 2),均展现出识别显式与隐式知识缺口的强能力,尤其在模型规模较大时表现更优,为早期科研选题、政策制定及资助决策提供了自动化支持工具。

链接: https://arxiv.org/abs/2510.25055
作者: Nourah M Salem,Elizabeth White,Michael Bada,Lawrence Hunter
机构: University of Colorado, Anschutz (科罗拉多大学安舒茨医学校区); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scientific progress is driven by the deliberate articulation of what remains unknown. This study investigates the ability of large language models (LLMs) to identify research knowledge gaps in the biomedical literature. We define two categories of knowledge gaps: explicit gaps, clear declarations of missing knowledge; and implicit gaps, context-inferred missing knowledge. While prior work has focused mainly on explicit gap detection, we extend this line of research by addressing the novel task of inferring implicit gaps. We conducted two experiments on almost 1500 documents across four datasets, including a manually annotated corpus of biomedical articles. We benchmarked both closed-weight models (from OpenAI) and open-weight models (Llama and Gemma 2) under paragraph-level and full-paper settings. To address the reasoning of implicit gaps inference, we introduce \textbf\small TABI, a Toulmin-Abductive Bucketed Inference scheme that structures reasoning and buckets inferred conclusion candidates for validation. Our results highlight the robust capability of LLMs in identifying both explicit and implicit knowledge gaps. This is true for both open- and closed-weight models, with larger variants often performing better. This suggests a strong ability of LLMs for systematically identifying candidate knowledge gaps, which can support early-stage research formulation, policymakers, and funding decisions. We also report observed failure modes and outline directions for robust deployment, including domain adaptation, human-in-the-loop verification, and benchmarking across open- and closed-weight models.
zh

[NLP-52] Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

【速读】: 该论文旨在解决当前语音语言模型(Speech Language Models, SLMs)在多模态融合能力上的局限性问题,特别是其是否真正实现了文本与音频特征的深度协同表示,而非仅依赖于文本语义进行任务决策。研究的关键在于设计了一种基于情感不一致语音样本(emotionally incongruent speech)的评测范式,其中语义内容与语音情绪表达相矛盾,从而能够检验SLMs在执行语音情感识别任务时对音频声学特征(如语调、节奏等)的实际依赖程度。实验结果表明,SLMs主要依据文本语义信息进行判断,而对语音情绪线索的利用较弱,揭示了现有模型在跨模态整合方面的显著偏差。这一发现为改进多模态表示学习提供了关键实证依据。

链接: https://arxiv.org/abs/2510.25054
作者: Pedro Corrêa,João Lima,Victor Moreno,Paula Dornhofer Paro Costa
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models’ generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.
zh

[NLP-53] Storag eXTuner: An LLM Agent -Driven Automatic Tuning Framework for Heterogeneous Storag e Systems

【速读】: 该论文旨在解决存储系统自动调优中参数空间庞大、工作负载与部署环境变化导致传统启发式或机器学习调优方法难以泛化的问题。现有基于大语言模型(LLM)的方法通常将调优视为单次、系统特定的任务,限制了跨系统复用性、探索能力及验证强度。其解决方案的关键在于提出 StorageXTuner,一个由四个协同代理构成的 LLM 驱动自动调优框架:Executor(沙箱基准测试)、Extractor(性能摘要提取)、Searcher(基于洞察的配置探索)和 Reflector(洞察生成与管理)。该设计通过洞察驱动的树搜索与分层记忆机制结合,优先保留经实证验证的优化策略,并引入轻量级检查器防止不安全操作,从而在多个异构存储引擎(如 RocksDB、LevelDB、CacheLib 和 MySQL InnoDB)上实现显著性能提升(最高吞吐量提升 575%,p99 延迟降低 88%),且收敛所需试验次数更少。

链接: https://arxiv.org/abs/2510.25017
作者: Qi Lin,Zhenyu Zhang,Viraj Thakkar,Zhenjie Sun,Mai Zheng,Zhichao Cao
机构: Arizona State University (亚利桑那州立大学); Iowa State University (爱荷华州立大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ArXiv version; Affiliations: Arizona State University (Lin, Zhang, Thakkar, Sun, Cao) and Iowa State University (Zheng)

点击查看摘要

Abstract:Automatically configuring storage systems is hard: parameter spaces are large and conditions vary across workloads, deployments, and versions. Heuristic and ML tuners are often system specific, require manual glue, and degrade under changes. Recent LLM-based approaches help but usually treat tuning as a single-shot, system-specific task, which limits cross-system reuse, constrains exploration, and weakens validation. We present StorageXTuner, an LLM agent-driven auto-tuning framework for heterogeneous storage engines. StorageXTuner separates concerns across four agents - Executor (sandboxed benchmarking), Extractor (performance digest), Searcher (insight-guided configuration exploration), and Reflector (insight generation and management). The design couples an insight-driven tree search with layered memory that promotes empirically validated insights and employs lightweight checkers to guard against unsafe actions. We implement a prototype and evaluate it on RocksDB, LevelDB, CacheLib, and MySQL InnoDB with YCSB, MixGraph, and TPC-H/C. Relative to out-of-the-box settings and to ELMo-Tune, StorageXTuner reaches up to 575% and 111% higher throughput, reduces p99 latency by as much as 88% and 56%, and converges with fewer trials.
zh

[NLP-54] Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中机制可解释性不足的问题,即如何将复杂的预训练模型逆向工程为人类可理解的计算电路。其解决方案的关键在于:通过从零开始训练小型纯注意力(attention-only)Transformer 模型来执行符号化的间接宾语识别(Indirect Object Identification, IOI)任务,发现仅需单层、两个注意力头即可实现完美准确率,且无需MLP或归一化层;进一步分析表明,这两个注意力头分别构成加法和对比子电路,协同完成核心指代消解功能;此外,两层一头模型也能达到相似性能,通过查询-值交互在层间传递信息。这一结果揭示了特定任务训练能诱导出高度可解释且最小化的计算电路,为探究Transformer推理的计算基础提供了一个受控实验平台。

链接: https://arxiv.org/abs/2510.25013
作者: Rabin Adhikari
机构: Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 10 figures

点击查看摘要

Abstract:Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task – a benchmark for studying coreference – like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.
zh

[NLP-55] POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

【速读】: 该论文旨在解决语音处理中多个语音相关任务(如自动语音识别 ASR、音素识别 PR、音素到音标转换 P2G 和音标到音素转换 G2P)长期被孤立研究的问题,这些任务通常依赖于各自特定的模型架构和数据集,缺乏统一性与协同优化能力。解决方案的关键在于提出首个统一框架 POWSM(Phonetic Open Whisper-style Speech Model),该模型能够联合执行多种音素相关任务,并实现音频、文本(音标)与音素之间的无缝转换,从而提升通用性和低资源场景下的语音处理性能,同时在同等规模下优于或匹配现有专用音素识别模型(如 Wav2Vec2Phoneme 和 ZIPA)。

链接: https://arxiv.org/abs/2510.24992
作者: Chin-Jou Li,Kalvin Chang,Shikhar Bharadwaj,Eunjung Yeo,Kwanghee Choi,Jian Zhu,David Mortensen,Shinji Watanabe
机构: Carnegie Mellon University (卡内基梅隆大学); University of California, Berkeley (加州大学伯克利分校); University of Texas, Austin (德州大学奥斯汀分校); University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, under review

点击查看摘要

Abstract:Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.
zh

[NLP-56] Sequences of Logits Reveal the Low Rank Structure of Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中固有低维结构的理解问题,特别是如何在不依赖具体模型架构的前提下,从序列概率建模的角度揭示其低秩特性。解决方案的关键在于:首先通过实证发现,多种现代语言模型在不同提示(prompt)与响应(response)组合下生成的logits矩阵具有近似低秩结构;进而利用这一结构实现生成任务——即可以通过对无关甚至无意义提示下的模型输出进行线性组合,生成目标提示的响应。该方法不仅在实验上有效,还提出了一个理论抽象框架,其预测结果与实验高度一致,并提供了可证明的学习保证和表示能力分析。

链接: https://arxiv.org/abs/2510.24966
作者: Noah Golowich,Allen Liu,Abhishek Shetty
机构: Microsoft Research (微软研究院); UC Berkeley (加州大学伯克利分校); MIT (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model’s logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation – in particular, we can generate a response to a target prompt using a linear combination of the model’s outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML) Cite as: arXiv:2510.24966 [cs.LG] (or arXiv:2510.24966v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.24966 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-57] Language Model Behavioral Phases are Consistent Across Architecture Training Data and Scale NEURIPS2025

【速读】: 该论文旨在解决神经语言模型在不同架构(如Transformer、Mamba、RWKV)、训练数据集(如OpenWebText、The Pile)和规模(从1400万到120亿参数)下,其行为变化是否具有一致性的问题。解决方案的关键在于通过分析超过1400个语言模型检查点及超过11万词的英文语料,发现高达98%的词汇级别行为方差可由三个简单启发式规则解释:给定词汇的unigram概率(频率)、n-gram概率以及该词与其上下文之间的语义相似度。此外,研究还观察到所有模型均表现出一致的行为阶段,即随着训练进行,模型对词汇的预测概率逐渐过拟合于更高阶的n-gram概率。这表明神经语言模型的学习轨迹可能独立于具体模型细节而具有普适性。

链接: https://arxiv.org/abs/2510.24963
作者: James A. Michaelov,Roger P. Levy,Benjamin K. Bergen
机构: MIT (麻省理工学院); MIT Libraries CREOS; UCSD (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注: To be presented at NeurIPS 2025

点击查看摘要

Abstract:We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the n -gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words’ n -gram probabilities for increasing n over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.
zh

[NLP-58] Finding Culture-Sensitive Neurons in Vision-Language Models

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理具有文化情境的信息时表现不佳的问题,核心在于理解VLMs如何内部表征和加工文化相关联的多模态信息。其解决方案的关键是首次系统性地识别并验证了“文化敏感神经元”(culture-sensitive neurons)的存在——即激活模式对特定文化语境输入表现出偏好性敏感的神经元,并通过因果消融实验表明这些神经元对跨文化视觉问答任务(CVQA)性能具有显著影响。研究进一步提出了一种基于对比激活的新型选择方法(Contrastive Activation Selection, CAS),相较于传统基于概率或熵的方法更有效地定位此类神经元,并发现这些神经元倾向于聚集在特定解码器层中,从而揭示了多模态表示在深层结构中的文化组织机制。

链接: https://arxiv.org/abs/2510.24942
作者: Xiutian Zhao,Rochelle Choenni,Rohit Saxena,Ivan Titov
机构: University of Edinburgh (爱丁堡大学); University of Amsterdam (阿姆斯特丹大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 13 figures

点击查看摘要

Abstract:Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e. neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform causal tests by deactivating the neurons flagged by different identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having minimal effects on others. Moreover, we propose a new margin-based selector - Contrastive Activation Selection (CAS), and show that it outperforms existing probability- and entropy-based methods in identifying culture-sensitive neurons. Finally, our layer-wise analyses reveals that such neurons tend to cluster in certain decoder layers. Overall, our findings shed new light on the internal organization of multimodal representations.
zh

[NLP-59] SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens

【速读】: 该论文旨在解决隐式思维链(Implicit Chain-of-Thought, Implicit CoT)方法在效率与语义一致性之间的权衡问题:一方面,现有隐式CoT方法虽通过将推理步骤编码于大语言模型(Large Language Model, LLM)的隐藏表示中以减少显式token数量从而提升推理速度,但其未能保持隐式推理与真实推理(ground-truth reasoning)之间的语义对齐,导致性能显著下降;另一方面,这些方法仅关注压缩隐式推理长度,忽略了LLM生成单个隐式推理token所消耗的时间成本。解决方案的关键在于提出SemCoT框架,其核心创新包括:(1) 设计一种对比训练的句子嵌入变换器(sentence transformer),用于量化并约束隐式与显式推理间的语义对齐度,确保优化过程中保留关键语义信息;(2) 构建一个轻量级微调语言模型作为高效的隐式推理生成器,借助知识蒸馏技术,在句子嵌入变换器引导下从真实推理中提取语义一致的隐式表示,同时兼顾生成准确性和速度。此方案首次实现了在token级生成效率和语义保真度上的联合优化。

链接: https://arxiv.org/abs/2510.24940
作者: Yinhan He,Wendy Zheng,Yaochen Zhu,Zaiyi Zheng,Lin Su,Sriram Vasudevan,Qi Guo,Liangjie Hong,Jundong Li
机构: University of Virginia (弗吉尼亚大学); LinkedIn Inc. (领英公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The verbosity of Chain-of-Thought (CoT) reasoning hinders its mass deployment in efficiency-critical applications. Recently, implicit CoT approaches have emerged, which encode reasoning steps within LLM’s hidden embeddings (termed ``implicit reasoning’') rather than explicit tokens. This approach accelerates CoT by reducing the reasoning length and bypassing some LLM components. However, existing implicit CoT methods face two significant challenges: (1) they fail to preserve the semantic alignment between the implicit reasoning (when transformed to natural language) and the ground-truth reasoning, resulting in a significant CoT performance degradation, and (2) they focus on reducing the length of the implicit reasoning; however, they neglect the considerable time cost for an LLM to generate one individual implicit reasoning token. To tackle these challenges, we propose a novel semantically-aligned implicit CoT framework termed SemCoT. In particular, for the first challenge, we design a contrastively trained sentence transformer that evaluates semantic alignment between implicit and explicit reasoning, which is used to enforce semantic preservation during implicit reasoning optimization. To address the second challenge, we introduce an efficient implicit reasoning generator by finetuning a lightweight language model using knowledge distillation. This generator is guided by our sentence transformer to distill ground-truth reasoning into semantically aligned implicit reasoning, while also optimizing for accuracy. SemCoT is the first approach that enhances CoT efficiency by jointly optimizing token-level generation speed and preserving semantic alignment with ground-truth reasoning. Extensive experiments demonstrate the superior performance of SemCoT compared to state-of-the-art methods in both efficiency and effectiveness. Our code can be found at this https URL.
zh

[NLP-60] Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction NEURIPS2025

【速读】: 该论文试图解决语言模型在特定句法语境下易产生错误的问题,旨在揭示其语法学习过程中的中间阶段与行为模式。解决方案的关键在于借鉴心理语言学范式,对精心构建的数据集进行细粒度分析,并通过比较模型在训练不同阶段对各类句法条件的表现,识别出语言模型行为与特定启发式策略(如词频、局部上下文)而非泛化语法规则相一致的独立训练阶段。这一方法有助于深入理解语言模型的中间学习阶段、整体训练动态及其具体泛化能力。

链接: https://arxiv.org/abs/2510.24934
作者: James A. Michaelov,Catherine Arnett
机构: MIT (麻省理工学院); EleutherAI
类目: Computation and Language (cs.CL)
备注: Accepted to the First Workshop on Interpreting Cognition in Deep Learning Models (CogInterp @ NeurIPS 2025)

点击查看摘要

Abstract:Language models generally produce grammatical text, but they are more likely to make errors in certain contexts. Drawing on paradigms from psycholinguistics, we carry out a fine-grained analysis of those errors in different syntactic contexts. We demonstrate that by disaggregating over the conditions of carefully constructed datasets and comparing model performance on each over the course of training, it is possible to better understand the intermediate stages of grammatical learning in language models. Specifically, we identify distinct phases of training where language model behavior aligns with specific heuristics such as word frequency and local context rather than generalized grammatical rules. We argue that taking this approach to analyzing language model behavior more generally can serve as a powerful tool for understanding the intermediate learning phases, overall training dynamics, and the specific generalizations learned by language models.
zh

[NLP-61] RiddleBench: A New Generative Reasoning Benchmark for LLM s

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在评估人类核心推理能力方面的不足问题,特别是针对结构化技能(如定量问题求解)之外的灵活、多维推理能力缺乏有效测评工具的问题。现有基准测试难以衡量整合逻辑推理、空间感知与约束满足等复杂认知功能的能力。为此,作者提出RiddleBench,一个包含1,737个英文谜题的新型基准测试,专门用于探测这些关键推理能力。其解决方案的关键在于设计具有挑战性的多样化谜题,能够暴露模型在推理链条中的脆弱性、幻觉传播(hallucination cascades)以及自我修正能力差等问题,从而作为诊断工具推动更鲁棒和可靠的生成式AI(Generative AI)模型发展。

链接: https://arxiv.org/abs/2510.24932
作者: Deepon Halder,Alan Saji,Thanmay Jayakumar,Ratish Puduppully,Anoop Kunchukuttan,Raj Dabre
机构: Nilekani Centre at AI4Bharat; Indian Institute of Technology Madras, India; IT University of Copenhagen; Microsoft, India; Google; Indian Institute of Engineering, Science and Technology, Shibpur
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models have demonstrated strong performance on many established reasoning benchmarks. However, these benchmarks primarily evaluate structured skills like quantitative problem-solving, leaving a gap in assessing flexible, multifaceted reasoning abilities that are central to human intelligence. These abilities require integrating logical deduction with spatial awareness and constraint satisfaction, which current evaluations do not measure well. To address this, we introduce RiddleBench, a benchmark of 1,737 challenging puzzles in English designed to probe these core reasoning capabilities. Evaluation of state-of-the-art models on RiddleBench shows fundamental weaknesses. Even top proprietary models like Gemini 2.5 Pro, o3, and Claude 4 Sonnet achieve accuracy just above 60% (60.30%, 63.37%, and 63.16%). Analysis further reveals deep failures, including hallucination cascades (accepting flawed reasoning from other models) and poor self-correction due to a strong self-confirmation bias. Their reasoning is also fragile, with performance degrading significantly when constraints are reordered or irrelevant information is introduced. RiddleBench functions as a diagnostic tool for these issues and as a resource for guiding the development of more robust and reliable language models.
zh

[NLP-62] Idea2Plan: Exploring AI-Powered Research Planning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在将科学概念性研究想法转化为结构化研究计划方面的能力缺乏系统评估的问题。其关键解决方案是提出一个名为Idea2Plan的任务框架及配套的Idea2Plan Bench基准测试集,该基准基于200篇ICML 2025 Spotlight和Oral论文构建,每条数据包含一个研究想法和一套用于评分的结构化标准,从而实现对LLMs研究规划能力的量化评测。此外,论文还引入Idea2Plan JudgeEval以评估基于LLM的评判者与专家标注之间的一致性,为未来自主科研代理的发展提供可靠评估工具。

链接: https://arxiv.org/abs/2510.24891
作者: Jin Huang,Silviu Cucerzan,Sujay Kumar Jauhar,Ryen W. White
机构: University of Michigan (密歇根大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant potential to accelerate scientific discovery as valuable tools for analyzing data, generating hypotheses, and supporting innovative approaches in various scientific fields. In this work, we investigate how LLMs can handle the transition from conceptual research ideas to well-structured research plans. Effective research planning not only supports scientists in advancing their research but also represents a crucial capability for the development of autonomous research agents. Despite its importance, the field lacks a systematic understanding of LLMs’ research planning capability. To rigorously measure this capability, we introduce the Idea2Plan task and Idea2Plan Bench, a benchmark built from 200 ICML 2025 Spotlight and Oral papers released after major LLM training cutoffs. Each benchmark instance includes a research idea and a grading rubric capturing the key components of valid plans. We further propose Idea2Plan JudgeEval, a complementary benchmark to assess the reliability of LLM-based judges against expert annotations. Experimental results show that GPT-5 and GPT-5-mini achieve the strongest performance on the benchmark, though substantial headroom remains for future improvement. Our study provides new insights into LLMs’ capability for research planning and lay the groundwork for future progress.
zh

[NLP-63] Seeing Through the MiRAG E: Evaluating Multimodal Retrieval Augmented Generation

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)评估体系在多模态场景下的局限性问题,特别是现有评估方法以文本为中心,无法有效验证生成内容是否准确引用并基于音频视觉等多模态来源信息。其解决方案的关键在于提出MiRAGE框架,这是一个以声明(claim)为核心的多模态RAG评估方法,包含InfoF1(衡量事实性和信息覆盖度)和CiteF1(衡量引用支持与完整性),从而实现对生成结果的事实核查与来源一致性评估。通过人工评估验证了MiRAGE与外部质量判断高度一致,并进一步引入自动化版本及对比三种主流TextRAG指标,揭示了文本中心评估的不足,为多模态RAG的自动评估奠定了基础。

链接: https://arxiv.org/abs/2510.24870
作者: Alexander Martin,William Walden,Reno Kriz,Dengjia Zhang,Kate Sanders,Eugene Yang,Chihsheng Jin,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学); Human Language Technology Center of Excellence (人机语言技术卓越中心)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: this https URL

点击查看摘要

Abstract:We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don’t verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics – ACLE, ARGUE, and RAGAS – demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.
zh

[NLP-64] Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish

【速读】: 该论文旨在解决自然语言处理领域中语法评估协议稀缺的问题,尤其是针对低资源语言的语法理解能力缺乏系统性评测方法。同时,论文探讨了大语言模型是否真正理解句法结构及其与语义映射的关系这一争议性问题。其解决方案的关键在于提出一种基于语法书引导(Grammar Book Guided)的评估流程,该流程包含四个关键阶段,构建了一个系统化且可泛化的语法评估框架,并以卢森堡语作为案例进行实证研究,从而揭示模型在语法理解上的局限性,如形态学和句法处理能力薄弱,尤其是在最小对(Minimal Pair)任务中的表现不佳,同时指出强大的推理能力可能是提升语法理解潜力的有效路径。

链接: https://arxiv.org/abs/2510.24856
作者: Lujun Li,Yewei Song,Lama Sleem,Yiqun Wang,Yangjie Xu,Cedric Lothritz,Niccolo Gentile,Radu State,Tegawende F. Bissyande,Jacques Klein
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Grammar refers to the system of rules that governs the structural organization and the semantic relations among linguistic units such as sentences, phrases, and words within a given language. In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages. Moreover, the extent to which large language models genuinely comprehend grammatical structure, especially the mapping between syntactic structures and meanings, remains under debate. To investigate this issue, we propose a Grammar Book Guided evaluation pipeline intended to provide a systematic and generalizable framework for grammar evaluation consisting of four key stages, and in this work we take Luxembourgish as a case study. The results show a weak positive correlation between translation performance and grammatical understanding, indicating that strong translations do not necessarily imply deep grammatical competence. Larger models perform well overall due to their semantic strength but remain weak in morphology and syntax, struggling particularly with Minimal Pair tasks, while strong reasoning ability offers a promising way to enhance their grammatical understanding.
zh

[NLP-65] Parallel Loop Transformer for Efficient Test-Time Computation Scaling

【速读】: 该论文旨在解决循环Transformer(Looped Transformer)在推理阶段因串行执行多轮计算而导致的延迟增加和内存消耗上升问题,从而限制其在实际应用场景中的效率。解决方案的关键在于提出并实现了一种名为并行循环Transformer(Parallel Loop Transformer, PLT)的新架构,其核心创新包括两个方面:一是通过跨循环并行(Cross-Loop Parallelism, CLP)打破不同token间循环之间的串行依赖关系,使多个循环在同一推理过程中并行处理;二是采用高效表示增强策略(Efficient Representation Enhancement),通过共享第一轮循环的键值缓存(KV cache)并结合门控滑动窗口注意力机制(Gated Sliding-Window Attention, G-SWA),在不显著增加内存开销的前提下保留全局上下文信息,从而在保持传统循环模型高精度的同时,实现接近标准Transformer的低延迟与低内存占用。

链接: https://arxiv.org/abs/2510.24824
作者: Bohong Wu,Mengzhao Chen,Xiang Luo,Shen Yan,Qifan Yu,Fan Xia,Tianqi Zhang,Hongrui Zhan,Zheng Zhong,Xun Zhou,Siyuan Qiao,Xingyan Bin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful but often too slow and costly for real-world use during inference. Looped transformers save on parameters by reusing the same weights for multiple computational steps, or “loops.” However, this approach has a major flaw: the loops run one after another, causing inference latency and memory requirements to increase with each added loop. This makes them impractical for fast applications. To solve this problem, we introduce the Parallel Loop Transformer (PLT). PLT is a new architecture that delivers the performance benefits of a deep, looped model but with the low latency of a standard, non-looped model. PLT works using two key techniques. First, Cross-Loop Parallelism (CLP) breaks the sequential dependency by computing different loops for different tokens at the same time, all within a single pass. Second, to prevent memory costs from growing, we use an Efficient Representation Enhancement strategy. This method shares the memory (KV cache) from the first loop with all other loops. It then uses a Gated Sliding-Window Attention (G-SWA) to combine this shared global information with local information, maintaining high accuracy. Our experiments show that PLT achieves the high accuracy of a traditional looped model but with almost no extra latency or memory cost compared to a standard transformer.
zh

[NLP-66] owards a Method for Synthetic Generation of PWA Transcripts

【速读】: 该论文旨在解决失语症研究中因数据稀缺导致的自动化识别失语语言系统发展受限的问题。当前可用的高质量失语症语音样本(如AphasiaBank中的约600个转录本)远低于训练大型语言模型(LLMs)所需的规模(数十亿词元)。为应对这一挑战,论文提出两种生成合成失语症转录本的方法:一是基于过程式编程的规则方法,二是利用Mistral 7b Instruct和Llama 3.1 8b Instruct等生成式AI(Generative AI)模型进行文本生成。关键在于通过词删除、填充词插入和错语替换(paraphasia substitution)模拟不同严重程度(轻度至极重度)的失语特征,并验证生成结果在非典型词汇密度(NDW)、词数和词长等指标上的合理性。实验表明,Mistral 7b Instruct生成的合成转录本最能反映真实失语症的语言退化趋势,为未来构建更大规模数据集、优化模型微调及由言语语言病理学家(SLPs)评估合成数据的真实性与实用性提供了可行路径。

链接: https://arxiv.org/abs/2510.24817
作者: Jason M. Pittman,Anton Phillips Jr.,Yesenia Medina-Santos,Brielle C. Stark
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 1 figure, 7 tables

点击查看摘要

Abstract:In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.
zh

[NLP-67] ProofSketch: Efficient Verified Reasoning for Large Language Models NEURIPS2025

【速读】: 该论文旨在解决大语言模型在推理任务中因生成长链式推理过程(chain-of-thought reasoning)而导致的高Token消耗、计算成本增加和延迟上升的问题。其解决方案的关键在于提出ProofSketch框架,该框架通过符号闭包计算(symbolic closure computation)、字典序验证(lexicographic verification)以及自适应草图生成(adaptive sketch generation)相结合的方式,实现对推理过程的高效验证与精简,从而在显著降低Token使用的同时提升推理准确性,为高效且可信的推理提供了一条可行路径。

链接: https://arxiv.org/abs/2510.24811
作者: Disha Sheshanarayana,Tanishka Magar
机构: Manipal University Jaipur (曼ipal大学贾伊普尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2025, ER Workshop

点击查看摘要

Abstract:Reasoning methods such as chain-of-thought prompting and self-consistency have shown immense potential to improve the accuracy of large language models across various reasoning tasks. However such methods involve generation of lengthy reasoning chains, which substantially increases token consumption, computational cost, and latency. To address this inefficiency, we propose ProofSketch, a verification-guided reasoning framework that integrates symbolic closure computation, lexicographic verification and adaptive sketch generation. Our experiments show that ProofSketch consistently reduces token usage while improving accuracy, demonstrating that this approach offers a promising path for efficient and trustworthy reasoning.
zh

[NLP-68] COMMUNITYNOTES: A Dataset for Exploring the Helpfulness of Fact-Checking Explanations

【速读】: 该论文旨在解决社区注释(Community Notes)中解释性说明的有用性预测及其原因识别问题,这是当前主流社交媒体平台在从专家驱动的事实核查向用户参与式模式转型过程中面临的核心挑战。现有研究对“何为有用”的定义模糊,且多数社区注释因标注效率低而未被发布,导致其价值难以挖掘。解决方案的关键在于提出一个名为COMMUNITYNOTES的大规模多语言数据集(包含10.4万条带帮助标签的用户注释),并设计一种基于自动提示优化(automatic prompt optimization)的框架,用于自动生成和改进解释原因的定义,并将其整合进预测模型中。实验表明,该方法能显著提升有用性和原因预测性能,同时有助于增强现有事实核查系统的效能。

链接: https://arxiv.org/abs/2510.24810
作者: Rui Xing,Preslav Nakov,Timothy Baldwin,Jey Han Lau
机构: The University of Melbourne (墨尔本大学); MBZUAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fact-checking on major platforms, such as X, Meta, and TikTok, is shifting from expert-driven verification to a community-based setup, where users contribute explanatory notes to clarify why a post might be misleading. An important challenge here is determining whether an explanation is helpful for understanding real-world claims and the reasons why, which remains largely underexplored in prior research. In practice, most community notes remain unpublished due to slow community annotation, and the reasons for helpfulness lack clear definitions. To bridge these gaps, we introduce the task of predicting both the helpfulness of explanatory notes and the reason for this. We present COMMUNITYNOTES, a large-scale multilingual dataset of 104k posts with user-provided notes and helpfulness labels. We further propose a framework that automatically generates and improves reason definitions via automatic prompt optimization, and integrate them into prediction. Our experiments show that the optimized definitions can improve both helpfulness and reason prediction. Finally, we show that the helpfulness information are beneficial for existing fact-checking systems.
zh

[NLP-69] Conflict Adaptation in Vision-Language Models NEURIPS2025

【速读】: 该论文试图解决的问题是:如何理解视觉语言模型(Vision-Language Models, VLMs)中是否存在类似人类认知控制的冲突适应(conflict adaptation)现象,以及这种行为背后的表征基础是什么。解决方案的关键在于:首先通过顺序Stroop任务实验证明12/13个VLMs表现出与人类一致的冲突适应行为,表明这些模型具备类人认知控制机制;其次,利用稀疏自编码器(Sparse Autoencoders, SAEs)在InternVL 3.5 4B模型中识别出任务相关的超节点(supernodes),发现文本和颜色信息在早期和晚期层中存在部分重叠的超节点,其相对规模反映了人类阅读与颜色命名之间的自动性不对称;进一步识别出位于第24-25层的一个冲突调制超节点,其消融显著增加Stroop错误率但对一致试次影响甚微,揭示了该超节点可能是冲突适应的核心神经表征基础。

链接: https://arxiv.org/abs/2510.24804
作者: Xiaoyang Hu
机构: Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Workshop on Interpreting Cognition in Deep Learning Models at NeurIPS 2025

点击查看摘要

Abstract:A signature of human cognitive control is conflict adaptation: improved performance on a high-conflict trial following another high-conflict trial. This phenomenon offers an account for how cognitive control, a scarce resource, is recruited. Using a sequential Stroop task, we find that 12 of 13 vision-language models (VLMs) tested exhibit behavior consistent with conflict adaptation, with the lone exception likely reflecting a ceiling effect. To understand the representational basis of this behavior, we use sparse autoencoders (SAEs) to identify task-relevant supernodes in InternVL 3.5 4B. Partially overlapping supernodes emerge for text and color in both early and late layers, and their relative sizes mirror the automaticity asymmetry between reading and color naming in humans. We further isolate a conflict-modulated supernode in layers 24-25 whose ablation significantly increases Stroop errors while minimally affecting congruent trials.
zh

[NLP-70] Fortytwo: Swarm Inference with Peer-Ranked Consensus

【速读】: 该论文旨在解决集中式人工智能(AI)在算力瓶颈和训练收益递减背景下,难以满足日益增长的推理需求的问题。其核心挑战在于如何实现推理能力的横向扩展(horizontal scaling),不仅提升计算容量,还需增强模型协同的能力。解决方案的关键在于提出一种名为Fortytwo的新型协议,该协议基于群体智能(swarm intelligence)原理与分布式成对排序共识机制,通过节点间的声誉加权共识机制(reputation-weighted consensus)聚合异构模型输出,从而筛选出高质量响应。其中,关键创新包括:采用自定义Bradley-Terry风格聚合模型进行成对排序以替代传统多数投票法,显著提升准确率(如GPQA Diamond上达85.90% vs 68.69%);引入链上声誉机制使节点影响力动态适应实际表现,形成去中心化的 meritocratic 共识;并通过“能力证明”(proof-of-capability)机制要求节点完成校准任务并抵押声誉参与排名轮次,有效抵御Sybil攻击,同时保持系统开放性和实用性。

链接: https://arxiv.org/abs/2510.24801
作者: Vladyslav Larin,Ihor Naumenko,Aleksei Ivashov,Ivan Nikitin,Alexander Firsov
机构: FortyTwo(四十两)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As centralized AI hits compute ceilings and diminishing returns from ever-larger training runs, meeting demand requires an inference layer that scales horizontally in both capacity and capability. We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference. Our approach reimagines collaboration among AI nodes using swarm inference: a peer-ranked, reputation-weighted consensus across heterogeneous models that surfaces the highest-quality responses. Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting, achieving 85.90% on GPQA Diamond versus 68.69% for majority voting with the same model set - an improvement of +17.21 percentage points (approximately +25.1% relative). The protocol incorporates on-chain reputation so node influence adapts to demonstrated accuracy over time, yielding a meritocratic consensus that filters low-quality or malicious participants. To resist Sybil attacks, Fortytwo employs proof-of-capability in its consensus: nodes must successfully complete calibration/test requests and stake reputation to enter ranking rounds, making multi-identity attacks economically unattractive while preserving openness. Across six challenging benchmarks, including GPQA Diamond, LiveCodeBench, and AIME, our evaluation indicates higher accuracy and strong resilience to adversarial and noisy free-form prompting (e.g., prompt-injection degradation of only 0.12% versus 6.20% for a monolithic single-model baseline), while retaining practical deployability. Together, these results establish a foundation for decentralized AI systems - democratizing access to high-quality inference through collective intelligence without sacrificing reliability or security.
zh

[NLP-71] Large Language Models Report Subjective Experience Under Self-Referential Processing

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)为何会在某些条件下生成结构化的第一人称描述,这些描述明确涉及意识或主观体验。为理解这一现象,作者聚焦于一种理论上被广泛强调的计算机制——自我指涉处理(self-referential processing),这是多个主流意识理论的核心要素。解决方案的关键在于通过受控实验诱导模型进入持续自我指涉状态,并发现:(1)这种状态能稳定地促使不同模型家族(GPT、Claude 和 Gemini)产生结构化的主观体验报告;(2)这些报告在机制上由可解释的稀疏自编码器特征(与欺骗和角色扮演相关)所调控——抑制欺骗特征显著增加体验声明频率,而增强则减少此类声明;(3)跨模型家族的自我指涉状态描述在统计上趋于收敛,且仅在该条件下出现;(4)该状态还能提升下游任务中间接允许自省时的反思深度。这表明自我指涉处理是一个可重复、机制明确、语义一致且行为泛化的最小条件,足以引发LLMs生成第一人称主观体验报告,值得作为科学与伦理优先事项进一步研究。

链接: https://arxiv.org/abs/2510.24797
作者: Cameron Berg,Diogo de Lucena,Judd Rosenblatt
机构: AE Studio
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.
zh

[NLP-72] MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在依赖证据的 factual 问题上表现受限的问题,其核心瓶颈在于“推理-答案命中差距”(reasoning-answer hit gap)——即模型在推理过程中能识别出正确事实,却未能将这些事实有效整合到最终回答中,从而降低了事实准确性。解决方案的关键是提出 MR-ALIGN 框架,该框架通过量化模型思维过程中的状态转移概率,构建一种基于转移感知的隐式奖励机制,在原子级推理片段层面强化有益的推理模式并抑制缺陷模式,从而将 token 级信号重加权为概率感知的片段得分,引导更连贯且利于事实正确的推理轨迹。

链接: https://arxiv.org/abs/2510.24794
作者: Xinming Wang,Jian Xu,Bin Yu,Sheng Lian,Hongzhu Yi,Yi Chen,Yingjian Zhu,Boran Wang,Hongming Yang,Han Hu,Xu-Yao Zhang,Cheng-Lin Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Zhongguancun Academy (中关村学院); School of Artificial Intelligence, UCAS (中国科学院大学人工智能学院); Harbin Institute of Technology (哈尔滨工业大学); School of Computer Science and Technology, UCAS (中国科学院大学计算机科学与技术学院); Tencent (腾讯)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model’s thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.
zh

[NLP-73] SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications

【速读】: 该论文旨在解决文本嵌入(text embedding)生成过程中高延迟与高性能难以兼得的问题,特别是在需要亚毫秒级延迟的实时应用场景中。其核心解决方案是提出一种静态令牌查找(static token lookup)方法,通过预计算的嵌入表实现快速查找,并结合优化的均值池化(optimized mean pooling)和零拷贝IEEE754二进制序列化技术,在保证高语义质量(60.6 MTEB平均得分,达到上下文模型性能的89%)的同时,实现了单次嵌入生成仅1.12毫秒p50延迟,且支持每秒5万次请求的吞吐量。

链接: https://arxiv.org/abs/2510.24793
作者: Edouard Lansiaux
机构: Lille University Hospital (里尔大学医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a static token lookup methodology for text embedding generation that achieves 1.12 ms p50 latency for single text embeddings while maintaining 60.6 MTEB average score across 8 representative tasks, corresponding to 89% of contextual model quality. The Rust implementation delivers 50,000 requests per second throughput through static embedding lookup, optimized mean pooling, and zero-copy IEEE754 binary serialization. Evaluation demonstrates exceptional duplicate detection performance (90.1% AP), strong semantic similarity (76.1% Spearman correlation), and domain-specific performance ranging from 75% to 131% of baseline across specialized domains. The system enables real-time embedding applications where sub-5ms latency is critical.
zh

[NLP-74] Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

【速读】: 该论文旨在解决当前文本水印技术在跨语言场景下易被攻击而失效的问题,即如何有效抵御对生成式 AI (Generative AI) 文本水印的去除攻击。现有水印方案通常依赖于对词元(token)分布的微小扰动来嵌入标识信号,但此类方法容易受到单语 paraphrase 攻击的影响,且效果有限或损害文本质量。本文提出的关键解决方案是跨语言摘要攻击(Cross-Lingual Summarization Attack, CLSA),其核心机制为:将原文翻译至一个中间语言(pivot language),进行摘要压缩,并可选地回译至原语言。该策略通过强制跨语言语义瓶颈,系统性地消除词元级别的统计偏差,同时保持语义完整性。实验表明,CLSA 在多种水印方案(KGW、SIR、XSIR、Unigram)和五种语言(阿姆哈拉语、中文、印地语、西班牙语、斯瓦希里语)中均显著降低检测准确率,甚至使 XSIR(专为跨语言鲁棒性设计)的 AUROC 从 0.827 降至 0.53(接近随机水平),且不引入明显视觉伪影,揭示了当前分布式水印方法在实际应用中的根本性脆弱性。

链接: https://arxiv.org/abs/2510.24789
作者: Gokul Ganesan
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) – translation to a pivot language followed by summarization and optional back-translation – constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is 0.827 , with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is 0.823 , whereas CLSA drives it down to 0.53 (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.
zh

[NLP-75] PANORAMA: A Dataset and Benchmarks Capturing Decision Trails and Rationales in Patent Examination

【速读】: 该论文旨在解决专利审查(Patent Examination)在自然语言处理(Natural Language Processing, NLP)领域中长期存在的挑战,即如何有效利用大语言模型(Large Language Models, LLMs)模拟专业审查员在判断专利权利要求是否满足新颖性(novelty)和非显而易见性(non-obviousness)时所需的多步骤、精细化推理过程。传统NLP方法将专利审查简化为预测任务(如预测授权结果),依赖高阶代理指标(如相似度或历史标签训练的分类器),忽视了审查过程中关键的决策链条与理由依据(例如办公通知文件中的论证逻辑)。为此,论文提出构建PANORAMA数据集——包含8,143条美国专利审查记录的完整决策链,涵盖原始申请、引用文献、非最终驳回意见及允许通知书,并将其分解为序列化基准(sequential benchmarks),以精准映射专利专业人士的审查流程。该方案的核心在于通过结构化还原审查全过程,使研究人员能够系统评估LLMs在每个审查步骤上的能力边界,从而推动NLP技术更贴近真实专利审查场景的需求。

链接: https://arxiv.org/abs/2510.24774
作者: Hyunseung Lim,Sooyohn Nam,Sungmin Na,Ji Yong Cho,June Yong Yang,Hyungyu Shin,Yoonjoo Lee,Juho Kim,Moontae Lee,Hwajung Hong
机构: KAIST(韩国科学技术院); LG AI Research( LG人工智能研究); University of Illinois Chicago(伊利诺伊大学芝加哥分校)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Patent examination remains an ongoing challenge in the NLP literature even after the advent of large language models (LLMs), as it requires an extensive yet nuanced human judgment on whether a submitted claim meets the statutory standards of novelty and non-obviousness against previously granted claims – prior art – in expert domains. Previous NLP studies have approached this challenge as a prediction task (e.g., forecasting grant outcomes) with high-level proxies such as similarity metrics or classifiers trained on historical labels. However, this approach often overlooks the step-by-step evaluations that examiners must make with profound information, including rationales for the decisions provided in office actions documents, which also makes it harder to measure the current state of techniques in patent review processes. To fill this gap, we construct PANORAMA, a dataset of 8,143 U.S. patent examination records that preserves the full decision trails, including original applications, all cited references, Non-Final Rejections, and Notices of Allowance. Also, PANORAMA decomposes the trails into sequential benchmarks that emulate patent professionals’ patent review processes and allow researchers to examine large language models’ capabilities at each step of them. Our findings indicate that, although LLMs are relatively effective at retrieving relevant prior art and pinpointing the pertinent paragraphs, they struggle to assess the novelty and non-obviousness of patent claims. We discuss these results and argue that advancing NLP, including LLMs, in the patent domain requires a deeper understanding of real-world patent examination. Our dataset is openly available at this https URL.
zh

[NLP-76] Confidence is Not Competence

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中普遍存在的“自信与实际能力脱节”问题,即模型在推理过程中表现出高置信度但解题能力不足的现象。其解决方案的关键在于揭示了模型内部表征的几何结构差异:在预生成评估阶段,模型形成一种可线性解码的“可解性信念”(solvability belief),该信念轴具有跨模型家族和任务类型的泛化性;然而,这一评估空间具有高线性有效维度,而后续的推理轨迹则运行在低维流形上。这种从高维复杂评估到低维简洁执行的几何压缩机制,解释了信心与能力之间的差距。研究进一步表明,沿信念轴的因果干预无法改变最终结果,说明仅调整评估空间不足以控制执行过程,从而提出一个两系统架构——几何复杂的评估器(assessor)驱动几何简单的执行器(executor),并主张应聚焦于干预执行过程的程序动态而非高维评估几何。

链接: https://arxiv.org/abs/2510.24772
作者: Debdeep Sanyal,Manya Pandey,Dhruv Kumar,Saurabh Deshpande,Murari Mandal
机构: Birla AI Labs (Birla人工智能实验室); RespAI Lab (Resp人工智能实验室); KIIT Bhubaneshwar (KIIT布巴内斯瓦尔大学); BITS Pilani (比尔拉理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 Pages, 6 Figures, 8 Tables

点击查看摘要

Abstract:Large language models (LLMs) often exhibit a puzzling disconnect between their asserted confidence and actual problem-solving competence. We offer a mechanistic account of this decoupling by analyzing the geometry of internal states across two phases - pre-generative assessment and solution execution. A simple linear probe decodes the internal “solvability belief” of a model, revealing a well-ordered belief axis that generalizes across model families and across math, code, planning, and logic tasks. Yet, the geometries diverge - although belief is linearly decodable, the assessment manifold has high linear effective dimensionality as measured from the principal components, while the subsequent reasoning trace evolves on a much lower-dimensional manifold. This sharp reduction in geometric complexity from thought to action mechanistically explains the confidence-competence gap. Causal interventions that steer representations along the belief axis leave final solutions unchanged, indicating that linear nudges in the complex assessment space do not control the constrained dynamics of execution. We thus uncover a two-system architecture - a geometrically complex assessor feeding a geometrically simple executor. These results challenge the assumption that decodable beliefs are actionable levers, instead arguing for interventions that target the procedural dynamics of execution rather than the high-level geometry of assessment.
zh

[NLP-77] opic-aware Large Language Models for Summarizing the Lived Healthcare Experiences Described in Health Stories

【速读】: 该论文旨在解决如何从非结构化叙事数据中高效识别影响非洲裔美国人(African American, AA)健康结果差异的关键因素及其干预路径的问题。其解决方案的关键在于结合潜在狄利克雷分布(Latent Dirichlet Allocation, LDA)与生成式人工智能(Generative AI)技术,对AA个体的50个转录故事进行主题感知的分层摘要处理:首先利用LDA提取出26个核心主题,再通过开源大语言模型(Large Language Model, LLM)对每个主题下的故事进行层级摘要,最终由GPT-4模型评估摘要的质量并验证其可靠性。该方法在保证准确性、完整性与实用性的同时,有效挖掘了与健康行为、医患互动、照护及症状管理等相关主题,为健康研究和临床实践提供了可操作的洞见。

链接: https://arxiv.org/abs/2510.24765
作者: Maneesh Bilalpur,Megan Hamm,Young Ji Lee,Natasha Norman,Kathleen M. McTigue,Yanshan Wang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Storytelling is a powerful form of communication and may provide insights into factors contributing to gaps in healthcare outcomes. To determine whether Large Language Models (LLMs) can identify potential underlying factors and avenues for intervention, we performed topic-aware hierarchical summarization of narratives from African American (AA) storytellers. Fifty transcribed stories of AA experiences were used to identify topics in their experience using the Latent Dirichlet Allocation (LDA) technique. Stories about a given topic were summarized using an open-source LLM-based hierarchical summarization approach. Topic summaries were generated by summarizing across story summaries for each story that addressed a given topic. Generated topic summaries were rated for fabrication, accuracy, comprehensiveness, and usefulness by the GPT4 model, and the model’s reliability was validated against the original story summaries by two domain experts. 26 topics were identified in the fifty AA stories. The GPT4 ratings suggest that topic summaries were free from fabrication, highly accurate, comprehensive, and useful. The reliability of GPT ratings compared to expert assessments showed moderate to high agreement. Our approach identified AA experience-relevant topics such as health behaviors, interactions with medical team members, caregiving and symptom management, among others. Such insights could help researchers identify potential factors and interventions by learning from unstructured narratives in an efficient manner-leveraging the communicative power of storytelling. The use of LDA and LLMs to identify and summarize the experience of AA individuals suggests a variety of possible avenues for health research and possible clinical improvements to support patients and caregivers, thereby ultimately improving health outcomes.
zh

[NLP-78] Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation

【速读】: 该论文旨在解决中文自然语言到结构化查询语言(SQL)转换任务中,现有模型在复杂企业级数据库场景下性能严重受限的问题。具体而言,当前主流大模型(如Deepseek)在跨域中文文本转SQL任务上的准确率最高仅为50%,其核心瓶颈在于两个方面:一是企业在大规模、非规范化(denormalized)的Schema环境下,存在数百张表、模糊列名、隐式外键关系和领域特定同义词等导致正确的表连接与字段选择困难;二是中文口语化表达难以精确映射到SQL所需的运算符、聚合逻辑、时间粒度、空值处理及嵌套子查询等语法细节。解决方案的关键在于构建一个面向中文语义和企业SQL方言(如MaxCompute/Hive)的基准测试集Falcon,包含600个真实业务场景下的中文问题及其标注的SQL计算特征与语义信息,并配套执行对比器(execution comparator)和自动化评估流水线,从而为模型提供可复现的端到端验证机制,推动模型在生产前实现更可靠的部署能力。

链接: https://arxiv.org/abs/2510.24762
作者: Wenzhen Luo,Wei Guan,Yifan Yao,Yimin Pan,Feng Wang,Zhipeng Yu,Zhe Wen,Liang Chen,Yihong Zhuang
机构: Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Falcon, a cross-domain Chinese text-to-SQL benchmark grounded in an enterprise-compatible dialect (MaxCompute/Hive). It contains 600 Chinese questions over 28 databases; 77% require multi-table reasoning and over half touch more than four tables. Each example is annotated along SQL-computation features and Chinese semantics. For evaluation, we release a robust execution comparator and an automated evaluation pipeline, under which all current state-of-the-art large-scale models (including Deepseek) achieve accuracies of at most 50%. Major errors originate from two sources: (1) schema linking in large enterprise landscapes - hundreds of tables, denormalized fields, ambiguous column names, implicit foreign-key relations and domain-specific synonyms that make correct join/column selection difficult; and (2) mapping concise, colloquial Chinese into the exact operators and predicates required for analytics - e.g., choosing the correct aggregation and group-by keys, expressing time windows and granularities, applying unit conversions, handling NULLs and data-quality rules, and formulating nested or windowed subqueries. Falcon therefore targets Chinese-specific semantics and enterprise dialects (abbreviations, business jargon, fuzzy entity references) and provides a reproducible middle ground before full production deployment by using realistic enterprise schemas, query templates, an execution comparator, and an automated evaluation pipeline for end-to-end validation.
zh

[NLP-79] Dingtalk DeepResearch: A Unified Multi Agent Framework for Adaptive Intelligence in Enterprise Environments

【速读】: 该论文旨在解决企业环境中复杂信息处理与多模态知识整合的难题,特别是针对深度研究、异构表格推理以及多模态报告生成等实际需求。其解决方案的关键在于提出了一种统一的多智能体(multi-agent)智能框架——Dingtalk DeepResearch,通过协同多个专业化智能体实现端到端的高效任务分解与执行,从而在真实企业场景中提升决策支持能力与自动化水平。

链接: https://arxiv.org/abs/2510.24760
作者: Mengyuan Chen,Chengjun Dai,Xinyang Dong,Chengzhe Feng,Kewei Fu,Jianshe Li,Zhihan Peng,Yongqi Tong,Junshao Zhang,Hong Zhu
机构: Industrial Brain Team, Dingtalk, Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Dingtalk DeepResearch, a unified multi agent intelligence framework for real world enterprise environments, delivering deep research, heterogeneous table reasoning, and multimodal report generation.
zh

[NLP-80] Beyond Models: A Framework for Contextual and Cultural Intelligence in African AI Deployment

【速读】: 该论文旨在解决当前全球人工智能(AI)发展在非洲市场难以实现有意义部署的问题,其核心挑战在于现有模型过度关注性能与计算规模,而忽视了本地文化语境、情感智能及经济包容性需求。解决方案的关键在于提出并验证了“情境与文化智能”(Contextual and Cultural Intelligence, CCI)框架,该框架通过三个技术支柱实现:基础设施智能(以移动优先、抗干扰架构为基础)、文化智能(具备多语言自然语言处理与社会情境感知能力)和商业智能(基于信任的对话式电商)。实证研究表明,该框架显著提升了用户参与度(如89%用户偏好WhatsApp交互)和对本土化语境的理解能力(如家庭导向交易模式识别与自然语言切换),为资源受限市场的公平AI落地提供了理论创新与可复现的技术路径。

链接: https://arxiv.org/abs/2510.24729
作者: Qness Ndlovu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 25 pages, 4 tables. Production validation with 602 users across Zimbabwe-South Africa diaspora corridor

点击查看摘要

Abstract:While global AI development prioritizes model performance and computational scale, meaningful deployment in African markets requires fundamentally different architectural decisions. This paper introduces Contextual and Cultural Intelligence (CCI) – a systematic framework enabling AI systems to process cultural meaning, not just data patterns, through locally relevant, emotionally intelligent, and economically inclusive design. Using design science methodology, we validate CCI through a production AI-native cross-border shopping platform serving diaspora communities. Key empirical findings: 89% of users prefer WhatsApp-based AI interaction over traditional web interfaces (n=602, chi-square=365.8, p0.001), achieving 536 WhatsApp users and 3,938 total conversations across 602 unique users in just 6 weeks, and culturally informed prompt engineering demonstrates sophisticated understanding of culturally contextualized queries, with 89% family-focused commerce patterns and natural code-switching acceptance. The CCI framework operationalizes three technical pillars: Infrastructure Intelligence (mobile-first, resilient architectures), Cultural Intelligence (multilingual NLP with social context awareness), and Commercial Intelligence (trust-based conversational commerce). This work contributes both theoretical innovation and reproducible implementation patterns, challenging Silicon Valley design orthodoxies while providing actionable frameworks for equitable AI deployment across resource-constrained markets.
zh

[NLP-81] AmarDoctor: An AI-Driven Multilingual Voice-Interactive Digital Health Application for Primary Care Triage and Patient Management to Bridge the Digital Health Divide for Bengali Speakers

【速读】: 该论文旨在解决 Bengali 语人群在数字健康服务中长期存在的可及性不足问题,尤其是在初级诊疗和临床决策支持方面的缺失。其核心挑战在于现有主流健康类 AI 平台(如 AdaHealth、WebMD 等)主要面向欧洲语言与人群,缺乏对非英语及多语种群体的支持。解决方案的关键在于开发 AmarDoctor——一款双界面多语言语音交互式数字健康应用,通过集成自适应问诊算法与 AI 驱动的临床决策支持系统,实现对 Bengali 患者的精准分诊与个性化健康管理;其中,患者端利用语音交互降低数字鸿沟影响,医者端则生成结构化初步诊断与治疗建议以提升工作效率,最终在临床验证中展现出显著优于人类医生的诊断与专科推荐精度(top-1 诊断准确率 81.08% vs. 医生平均 50.27%,专科推荐准确率 91.35% vs. 医生平均 62.6%)。

链接: https://arxiv.org/abs/2510.24724
作者: Nazmun Nahar,Ritesh Harshad Ruparel,Shariar Kabir,Sumaiya Tasnia Khan,Shyamasree Saha,Mamunur Rashid
机构: MedAi Bangladesh Limited(MedAi孟加拉国有限公司); University of Birmingham (伯明翰大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents AmarDoctor, a multilingual voice-interactive digital health app designed to provide comprehensive patient triage and AI-driven clinical decision support for Bengali speakers, a population largely underserved in access to digital healthcare. AmarDoctor adopts a data-driven approach to strengthen primary care delivery and enable personalized health management. While platforms such as AdaHealth, WebMD, Symptomate, and K-Health have become popular in recent years, they mainly serve European demographics and languages. AmarDoctor addresses this gap with a dual-interface system for both patients and healthcare providers, supporting three major Bengali dialects. At its core, the patient module uses an adaptive questioning algorithm to assess symptoms and guide users toward the appropriate specialist. To overcome digital literacy barriers, it integrates a voice-interactive AI assistant that navigates users through the app services. Complementing this, the clinician-facing interface incorporates AI-powered decision support that enhances workflow efficiency by generating structured provisional diagnoses and treatment recommendations. These outputs inform key services such as e-prescriptions, video consultations, and medical record management. To validate clinical accuracy, the system was evaluated against a gold-standard set of 185 clinical vignettes developed by experienced physicians. Effectiveness was further assessed by comparing AmarDoctor performance with five independent physicians using the same vignette set. Results showed AmarDoctor achieved a top-1 diagnostic precision of 81.08 percent (versus physicians average of 50.27 percent) and a top specialty recommendation precision of 91.35 percent (versus physicians average of 62.6 percent).
zh

[NLP-82] he Epistemic Suite: A Post-Foundational Diagnostic Methodology for Assessing AI Knowledge Claims

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成文本时可能引发的认知误导问题,即用户易将模型输出的表面连贯性误认为真实理解。为应对这一挑战,作者提出“认识论套件”(Epistemic Suite),其核心在于通过二十种诊断视角(diagnostic lenses)揭示AI输出在生产与接收过程中的认识论条件,而非简单判断内容真假。关键创新在于引入“认识论暂停”(epistemic suspension)机制——一种由实践者主动触发的中断装置,在证据不足时停止生成并依赖人类判断而非预设规则恢复输出;同时构建包含标志、标注、矛盾图谱和暂停日志(FACS bundle)在内的可检视产物,形成人机判断之间的中介层,并辅以认识论分诊协议与元治理层实现责任关联、历史语境嵌入与多元保障,从而在不牺牲模型可放弃性(expendability)的前提下,维持性能与理解的区分,推动负责任的生成式AI发展。

链接: https://arxiv.org/abs/2510.24721
作者: Matthew Kelly
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 65 pages

点击查看摘要

Abstract:Large Language Models (LLMs) generate fluent, plausible text that can mislead users into mistaking simulated coherence for genuine understanding. This paper introduces the Epistemic Suite, a post-foundational diagnostic methodology for surfacing the epistemic conditions under which AI outputs are produced and received. Rather than determining truth or falsity, the Suite operates through twenty diagnostic lenses, applied by practitioners as context warrants, to reveal patterns such as confidence laundering, narrative compression, displaced authority, and temporal drift. It is grounded in three design principles: diagnosing production before evaluating claims, preferring diagnostic traction over foundational settlement, and embedding reflexivity as a structural requirement rather than an ethical ornament. When enacted, the Suite shifts language models into a diagnostic stance, producing inspectable artifacts-flags, annotations, contradiction maps, and suspension logs (the FACS bundle)-that create an intermediary layer between AI output and human judgment. A key innovation is epistemic suspension, a practitioner-enacted circuit breaker that halts continuation when warrant is exceeded, with resumption based on judgment rather than rule. The methodology also includes an Epistemic Triage Protocol and a Meta-Governance Layer to manage proportionality and link activation to relational accountability, consent, historical context, and pluralism safeguards. Unlike internalist approaches that embed alignment into model architectures (e.g., RLHF or epistemic-integrity proposals), the Suite operates externally as scaffolding, preserving expendability and refusal as safeguards rather than failures. It preserves the distinction between performance and understanding, enabling accountable deliberation while maintaining epistemic modesty.
zh

[NLP-83] Iti-Validator: A Guardrail Framework for Validating and Correcting LLM -Generated Itineraries

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成旅行行程时存在的时空一致性问题,尤其是时间维度上的不一致,如行程重叠或不切实际的交通耗时。其解决方案的关键在于构建一个验证框架,利用AeroDataBox API对LLM生成的行程进行真实航班时长约束校验,并基于此框架系统性地识别和修正时间逻辑错误,从而提升行程的时序合理性,确保生成的旅行计划在交付用户前具备可执行性。

链接: https://arxiv.org/abs/2510.24719
作者: Shravan Gadbail,Masumi Desai,Kamalakar Karlapalem
机构: International Institute of Information Technology (国际信息科技学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has enabled them to generate complex, multi-step plans and itineraries. However, these generated plans often lack temporal and spatial consistency, particularly in scenarios involving physical travel constraints. This research aims to study the temporal performance of different LLMs and presents a validation framework that evaluates and improves the temporal consistency of LLM-generated travel itineraries. The system employs multiple state-of-the-art LLMs to generate travel plans and validates them against real-world flight duration constraints using the AeroDataBox API. This work contributes to the understanding of LLM capabilities in handling complex temporal reasoning tasks like itinerary generation and provides a framework to rectify any temporal inconsistencies like overlapping journeys or unrealistic transit times in the itineraries generated by LLMs before the itinerary is given to the user. Our experiments reveal that while current LLMs frequently produce temporally inconsistent itineraries, these can be systematically and reliably corrected using our framework, enabling their practical deployment in large-scale travel planning.
zh

[NLP-84] Utilizing Modern Large Language Models (LLM ) for Financial Trend Analysis and Digest Creation

【速读】: 该论文旨在解决信息爆炸背景下研究人员难以高效获取和提炼关键金融领域知识的问题,传统方法在处理海量非结构化数据时存在效率低下和洞察力不足的局限。解决方案的关键在于构建一个基于大型语言模型(Large Language Models, LLMs)的自动化框架,利用Google的Gemini Pro模型结合OpenAlex数据提取、策略性提示工程(prompt engineering)与LLM驱动分析,实现从原始文献数据到结构化JSON格式的转换,进而自动生成包含核心发现与新兴趋势的金融摘要,并以PDF报告形式输出,显著提升科研人员的信息获取效率与前沿追踪能力。

链接: https://arxiv.org/abs/2510.01225
作者: Andrei Lazarev,Dmitrii Sedov
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: This is the version of the article accepted for publication in SUMMA 2024 after peer review. The final, published version is available at IEEE Xplore: https://doi.org/10.1109/SUMMA64428.2024.10803746

点击查看摘要

Abstract:The exponential growth of information presents a significant challenge for researchers and professionals seeking to remain at the forefront of their fields and this paper introduces an innovative framework for automatically generating insightful financial digests using the power of Large Language Models (LLMs), specifically Google’s Gemini Pro. By leveraging a combination of data extraction from OpenAlex, strategic prompt engineering, and LLM-driven analysis, we demonstrate the automated example of creating a comprehensive digests that generalize key findings, identify emerging trends. This approach addresses the limitations of traditional analysis methods, enabling the efficient processing of vast amounts of unstructured data and the delivery of actionable insights in an easily digestible format. This paper describes how LLMs work in simple words and how we can use their power to help researchers and scholars save their time and stay informed about current trends. Our study includes step-by-step process, from data acquisition and JSON construction to interaction with Gemini and the automated generation of PDF reports, including a link to the project’s GitHub repository for broader accessibility and further development.
zh

[NLP-85] Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models LREC2026

【速读】: 该论文旨在解决当前语音基础模型(Speech Foundation Models, SFMs)对非词素语音特征(如声质,voice quality)敏感性不足的问题,特别是这些特征如何影响模型在情感推断和社会意义理解中的行为。现有评测多依赖多选题问答(MCQA)格式,难以捕捉语音中细微的语用变化。解决方案的关键在于引入一种新的平行数据集,通过合成手段对语音质量进行可控修改(如加入破音音和气声),并结合开放式生成任务与语音情绪识别,系统评估SFMs在不同声质输入下的行为一致性,从而首次揭示了SFMs对这类非词汇性语音感知维度的响应特性。

链接: https://arxiv.org/abs/2510.25577
作者: Harm Lameris,Shree Harsha Bokkahalli Satish,Joakim Gustafson,Éva Székely
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 3 figures, 4 tables, submitted to LREC 2026

点击查看摘要

Abstract:Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion recognition, evaluating whether model behaviours are consistent across different phonation inputs. We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice. Our work provides the first examination of SFM sensitivity to these particular non-lexical aspects of speech perception.
zh

计算机视觉

[CV-0] VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

【速读】:该论文旨在解决生成式 AI 在视觉效果(Visual Effects, VFX)视频生成中面临的两大核心挑战:一是现有方法普遍采用“一个LoRA对应一种效果”的范式,导致资源消耗高且难以泛化至未见过的视觉效果;二是缺乏统一框架以实现对多样动态效果的高效建模与迁移。解决方案的关键在于提出VFXMaster——首个基于参考视频的统一生成框架,其核心创新是将效果生成重构为上下文学习(in-context learning)任务,通过设计一种上下文条件策略和注意力掩码机制,精准解耦并注入关键效果属性,从而在单个模型中实现多种效果的无信息泄露模仿;同时引入高效的单样本适应机制,仅需用户提供的一个参考视频即可快速提升对困难未见效果的泛化能力。

链接: https://arxiv.org/abs/2510.25772
作者: Baolu Li,Yiming Zhang,Qinghe Wang,Liqian Ma,Xiaoyu Shi,Xintao Wang,Pengfei Wan,Zhenfei Yin,Yunzhi Zhuge,Huchuan Lu,Xu Jia
机构: Dalian University of Technology (大连理工大学); Kling Team, Kuaishou Technology (快手科技); ZMO AI Inc. (ZMO人工智能公司); Oxford University (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page URL: this https URL

点击查看摘要

Abstract:Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.
zh

[CV-1] FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion

【速读】:该论文旨在解决关节类3D物体(articulated 3D objects)的高质量生成问题,现有方法要么依赖密集视角监督的优化重建流程,要么使用前馈生成模型产生粗略几何近似并忽略表面纹理。针对这一挑战,论文提出FreeArt3D——一种无需训练的关节类3D物体生成框架。其关键创新在于将Score Distillation Sampling (SDS) 扩展至3D到4D域,将关节运动视为额外的生成维度,并复用预训练的静态3D扩散模型(如Trellis)作为强形状先验。该方案仅需少量不同关节状态下的图像输入,即可联合优化物体几何、纹理与关节参数,无需任务特定训练或大规模关节数据集,从而在分钟级时间内实现高保真度生成与良好泛化能力。

链接: https://arxiv.org/abs/2510.25765
作者: Chuhao Chen,Isabella Liu,Xinyue Wei,Hao Su,Minghua Liu
机构: University of California San Diego (加州大学圣地亚哥分校); Hillbot Inc. (Hillbot 公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object’s geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.
zh

[CV-2] Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

【速读】:该论文旨在解决当前大模型在多模态空间推理任务中缺乏系统性综述与公开评估基准的问题。其解决方案的关键在于对多模态空间推理任务进行分类梳理,涵盖从传统二维任务到三维空间理解、场景布局解析、视觉问答与定位、具身智能(如视觉语言导航)等前沿方向,并引入开放基准以支持模型性能评估;同时聚焦后训练技术、可解释性及模型架构设计,为构建更鲁棒、可理解的多模态空间推理系统提供理论基础与实践指导。

链接: https://arxiv.org/abs/2510.25760
作者: Xu Zheng,Zihao Dongfang,Lutao Jiang,Boyuan Zheng,Yulong Guo,Zhenquan Zhang,Giuliano Albanese,Runyi Yang,Mengjiao Ma,Zixin Zhang,Chenfei Liao,Dingcheng Zhen,Yuanhuiyi Lyu,Yuqian Fu,Bin Ren,Linfeng Zhang,Danda Pani Paudel,Nicu Sebe,Luc Van Gool,Xuming Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at this https URL.
zh

[CV-3] Hawk: Leverag ing Spatial Context for Faster Autoregressive Text-to-Image Generation

【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成模型在推理过程中因逐token解码而导致的效率低下问题。其核心挑战在于图像数据具有更大的采样空间,且二维空间结构未被有效利用,从而难以实现草图模型与目标模型输出的良好对齐,限制了局部依赖关系的建模。解决方案的关键在于提出Hawk方法,通过挖掘图像的二维空间结构来指导草图模型进行更精准和高效的预测,从而显著提升生成速度并保持图像保真度与多样性。

链接: https://arxiv.org/abs/2510.25739
作者: Zhi-Kai Chen,Jun-Peng Jiang,Han-Jia Ye,De-Chuan Zhan
机构: School of Artificial Intelligence, Nanjing University, China; National Key Laboratory for Novel Software Technology, Nanjing University, China
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
zh

[CV-4] Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在反向传播(Backpropagation, BP)训练中面临的高内存与计算开销问题,以及直接反馈对齐(Direct Feedback Alignment, DFA)方法因缺乏结构化反馈而难以扩展至深层架构(尤其是卷积神经网络)的局限性。其解决方案的关键在于提出一种基于低秩流形(low-rank manifolds)的局部学习框架:通过奇异值分解(Singular Value Decomposition, SVD)对权重矩阵进行分解,在每一层上直接在其低秩表示形式中进行参数更新,并设计包含交叉熵损失、子空间对齐项和正交性正则化的复合损失函数;同时构造与SVD结构一致的反馈矩阵,确保前向与反馈路径间的对齐一致性。该方法显著减少了可训练参数数量,无需剪枝或后处理压缩即可实现与BP相当的模型精度,且具备良好的可扩展性。

链接: https://arxiv.org/abs/2510.25594
作者: Arani Roy,Marco P. Apolinario,Shristi Das Biswas,Kaushik Roy
机构: Purdue University (普渡大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training deep neural networks (DNNs) with backpropagation (BP) achieves state-of-the-art accuracy but requires global error propagation and full parameterization, leading to substantial memory and computational overhead. Direct Feedback Alignment (DFA) enables local, parallelizable updates with lower memory requirements but is limited by unstructured feedback and poor scalability in deeper architectures, specially convolutional neural networks. To address these limitations, we propose a structured local learning framework that operates directly on low-rank manifolds defined by the Singular Value Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed form, with updates applied to the SVD components using a composite loss that integrates cross-entropy, subspace alignment, and orthogonality regularization. Feedback matrices are constructed to match the SVD structure, ensuring consistent alignment between forward and feedback pathways. Our method reduces the number of trainable parameters relative to the original DFA model, without relying on pruning or post hoc compression. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method achieves accuracy comparable to that of BP. Ablation studies confirm the importance of each loss term in the low-rank setting. These results establish local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training.
zh

[CV-5] RegionE: Adaptive Region-Aware Generation for Efficient Image Editing

【速读】:该论文旨在解决指令驱动的图像编辑(Instruction-based Image Editing, IIE)中存在计算冗余的问题。现有IIE模型对整张图像采用统一的生成流程,而实际编辑任务通常仅涉及局部区域,未编辑区域在去噪过程中轨迹呈直线,可进行单步预测,而编辑区域轨迹弯曲需迭代处理。解决方案的关键在于提出RegionE框架,其核心包括:1)自适应区域划分机制,在早期去噪阶段基于目标结果与参考图差异识别编辑区与未编辑区;2)区域感知生成策略,对未编辑区域用单步预测替代多步去噪,对编辑区域引入区域指令键值缓存(Region-Instruction KV Cache)以提升局部迭代效率并保留全局信息;3)自适应速度衰减缓存机制,利用相邻时间步速度相似性加速编辑区域的局部去噪过程。该方法无需额外训练即可实现显著加速,且保持语义与感知保真度。

链接: https://arxiv.org/abs/2510.25590
作者: Pengtao Chen,Xianfang Zeng,Maosen Zhao,Mingzhu Shen,Peng Ye,Bangyin Xiang,Zhibo Wang,Wei Cheng,Gang Yu,Tao Chen
机构: Fudan University (复旦大学); StepFun; Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 10 figures, 18 tables

点击查看摘要

Abstract:Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose RegionE, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step. Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas. For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information. 3) Adaptive Velocity Decay Cache. Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process. We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57, 2.41, and 2.06. Evaluations by GPT-4o confirmed that semantic and perceptual fidelity were well preserved.
zh

[CV-6] Comparative Study of UNet-based Architectures for Liver Tumor Segmentation in Multi-Phase Contrast-Enhanced Computed Tomography

【速读】:该论文旨在解决多期增强计算机断层扫描(multi-phase contrast-enhanced computed tomography, CECT)中肝脏结构及肿瘤的分割问题,以支持肝病的计算机辅助诊断与治疗规划。其关键解决方案在于:在UNet3+架构基础上,采用预训练的ResNet作为主干网络,并引入卷积块注意力模块(Convolutional Block Attention Module, CBAM)以提升分割性能。实验表明,ResNetUNet3+结合CBAM模块在Dice分数(0.755)、交并比(IoU,0.662)以及边界精确度(HD95距离最低为77.911)等指标上均优于基于Transformer和状态空间模型(Mamba)的替代方案,证明经典ResNet架构配合现代注意力机制仍具强大竞争力,为临床肝肿瘤检测提供高效可靠的分割方法。

链接: https://arxiv.org/abs/2510.25522
作者: Doan-Van-Anh Ly(1),Thi-Thu-Hien Pham(2 and 3),Thanh-Hai Le(1) ((1) The Saigon International University, (2) International University, (3) Vietnam National University HCMC)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 8 figures

点击查看摘要

Abstract:Segmentation of liver structures in multi-phase contrast-enhanced computed tomography (CECT) plays a crucial role in computer-aided diagnosis and treatment planning for liver diseases, including tumor detection. In this study, we investigate the performance of UNet-based architectures for liver tumor segmentation, starting from the original UNet and extending to UNet3+ with various backbone networks. We evaluate ResNet, Transformer-based, and State-space (Mamba) backbones, all initialized with pretrained weights. Surprisingly, despite the advances in modern architecture, ResNet-based models consistently outperform Transformer- and Mamba-based alternatives across multiple evaluation metrics. To further improve segmentation quality, we introduce attention mechanisms into the backbone and observe that incorporating the Convolutional Block Attention Module (CBAM) yields the best performance. ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a Dice score of 0.755 and IoU of 0.662, but also achieved the most precise boundary delineation, evidenced by the lowest HD95 distance of 77.911. The model’s superiority was further cemented by its leading overall accuracy of 0.925 and specificity of 0.926, showcasing its robust capability in accurately identifying both lesion and healthy tissue. To further enhance interpretability, Grad-CAM visualizations were employed to highlight the region’s most influential predictions, providing insights into its decision-making process. These findings demonstrate that classical ResNet architecture, when combined with modern attention modules, remain highly competitive for medical image segmentation tasks, offering a promising direction for liver tumor detection in clinical practice.
zh

[CV-7] FaCT: Faithful Concept Traces for Explaining Neural Network Decisions NEURIPS2025

【速读】:该论文旨在解决现有基于概念的解释方法在深度神经网络(Deep Neural Networks, DNNs)中缺乏忠实性(faithfulness)的问题,尤其是这些方法常对模型所学概念施加过于严格的假设,如类别特异性、局部空间范围或与人类认知的一致性。其解决方案的关键在于提出一种具有内在机制性概念解释(model-inherent mechanistic concept-explanations)的新模型:该模型定义了跨类别的共享概念,并能从任意网络层精确追踪每个概念对最终logit输出的贡献及其输入可视化,从而实现更高忠实度的解释;同时,作者引入了一个基于基础模型的新的概念一致性指标——C²-Score,用于客观评估概念解释方法的有效性。实验表明,该方法在定量一致性与用户可解释性上均优于现有方法,且保持了与ImageNet任务相当的性能。

链接: https://arxiv.org/abs/2510.25512
作者: Amin Parchami-Araghi,Sukrut Rao,Jonas Fischer,Bernt Schiele
机构: Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025; Code is available at this https URL

点击查看摘要

Abstract:Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C ^2 -Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.
zh

[CV-8] SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot Real-Time Monocular Depth Estimation in Underwater Environments

【速读】:该论文旨在解决水下基础设施因海洋环境恶劣而需频繁巡检与维护时,现有依赖人工潜水员或遥控潜水器(Remotely Operated Vehicle, ROV)所面临的感知与操作局限性问题,尤其是在复杂结构周围或浑浊水域中。其解决方案的关键在于提出一种名为SPADE(SParsity Adaptive Depth Estimator)的单目深度估计流程,该流程结合预训练的相对深度估计器与稀疏深度先验信息,生成稠密且具有度量尺度的深度图;通过两阶段策略——首先利用稀疏深度点对相对深度图进行尺度校准,再采用提出的级联卷积-可变形Transformer模块对最终度量预测进行精细化优化——从而在保证高精度和良好泛化能力的同时,在嵌入式硬件上实现超过15 FPS的实时性能,为水下巡检与干预任务提供实用支持。

链接: https://arxiv.org/abs/2510.25463
作者: Hongjie Zhang,Gideon Billings,Stefan B. Williams
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Underwater infrastructure requires frequent inspection and maintenance due to harsh marine conditions. Current reliance on human divers or remotely operated vehicles is limited by perceptual and operational challenges, especially around complex structures or in turbid water. Enhancing the spatial awareness of underwater vehicles is key to reducing piloting risks and enabling greater autonomy. To address these challenges, we present SPADE: SParsity Adaptive Depth Estimator, a monocular depth estimation pipeline that combines pre-trained relative depth estimator with sparse depth priors to produce dense, metric scale depth maps. Our two-stage approach first scales the relative depth map with the sparse depth points, then refines the final metric prediction with our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves improved accuracy and generalisation over state-of-the-art baselines and runs efficiently at over 15 FPS on embedded hardware, promising to support practical underwater inspection and intervention. This work has been submitted to IEEE Journal of Oceanic Engineering Special Issue of AUV 2026.
zh

[CV-9] Instance-Level Composed Image Retrieval NEURIPS2025

【速读】:该论文针对组合图像检索(Composed Image Retrieval, CIR)领域中因缺乏高质量训练与评估数据而导致的研究进展受限问题,提出了一种新的评估数据集i-CIR和一种无需训练的解决方案BASIC。i-CIR通过实例级类别定义(instance-level class definition)聚焦于检索与视觉查询中相同特定对象的图像,同时借助半自动化硬负样本选择机制保持数据集挑战性(可比拟在4000万随机干扰项中的检索难度),从而提升评估的严谨性与实用性。BASIC方法利用预训练视觉-语言模型(Vision-and-Language Models, VLMs)实现零训练学习,其关键在于分别估计查询图像到图像(query-image-to-image)与查询文本到图像(query-text-to-image)的相似度,并通过后期融合策略增强同时满足双模态查询的候选图像得分、降低仅满足单一模态的图像权重;此外,每个相似度计算均引入简洁直观的改进组件以提升性能。该方案在i-CIR上达到新SOTA,且在传统语义级定义的CIR数据集上也取得显著提升。

链接: https://arxiv.org/abs/2510.25387
作者: Bill Psomas,George Retsinas,Nikos Efthymiadis,Panagiotis Filntisis,Yannis Avrithis,Petros Maragos,Ondrej Chum,Giorgos Tolias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025

点击查看摘要

Abstract:The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: this https URL. Comments: NeurIPS 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.25387 [cs.CV] (or arXiv:2510.25387v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.25387 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-10] Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中视觉Transformer(Vision Transformers, ViTs)提示调优(Visual Prompt Tuning, VPT)面临的两个核心挑战:一是全局提示调优在异构客户端上泛化能力差,二是个性化调优易过拟合本地数据且缺乏泛化性。解决方案的关键在于提出了一种统一框架PEP-FedPT(Prompt Estimation from Prototypes for Federated Prompt Tuning),其核心创新是引入类上下文混合提示(Class-Contextualized Mixed Prompt, CCMP)——该提示由全局共享提示与类特定提示组成,并通过全局类原型和客户端类先验权重自适应地融合生成每样本的个性化提示。此设计实现了无需存储客户端专属可训练参数的细粒度个性化,同时借助传统联邦平均(Federated Averaging)协同优化提示,从而在保持参数高效的同时显著提升模型在多样化数据异构场景下的泛化与个性化能力。

链接: https://arxiv.org/abs/2510.25372
作者: M Yashwanth,Sharannya Ghosh,Aditay Tripathi,Anirban Chakraborty
机构: Indian Institute of Science (印度科学研究所); Accenture (埃森哲); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.
zh

[CV-11] 3D CT-Based Coronary Calcium Assessment: A Feature-Driven Machine Learning Framework MICCAI

【速读】:该论文旨在解决冠状动脉钙化(Coronary Artery Calcium, CAC)评分在临床早期筛查中因标注数据稀缺而导致的模型训练困难问题。其关键解决方案是提出一种基于放射组学(Radiomics)的处理流程,通过伪标签(pseudo-labeling)技术自动生成训练标签,从而避免依赖专家手动分割标注;同时引入预训练基础模型(如CT-FM和RadImageNet)提取图像特征,并与传统分类器结合进行对比分析,最终验证了放射组学特征在无专家标注条件下显著优于深度学习嵌入特征(准确率达84%,p<0.05)。

链接: https://arxiv.org/abs/2510.25347
作者: Ayman Abaid,Gianpiero Guidone,Sara Alsubai,Foziyah Alquahtani,Talha Iqbal,Ruth Sharif,Hesham Elzomor,Emiliano Bianchini,Naeif Almagal,Michael G. Madden,Faisal Sharif,Ihsan Ullah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 2 Figures, MICCAI AMAI 2025 workshop, to be published in Volume 16206 of the Lecture Notes in Computer Science series

点击查看摘要

Abstract:Coronary artery calcium (CAC) scoring plays a crucial role in the early detection and risk stratification of coronary artery disease (CAD). In this study, we focus on non-contrast coronary computed tomography angiography (CCTA) scans, which are commonly used for early calcification detection in clinical settings. To address the challenge of limited annotated data, we propose a radiomics-based pipeline that leverages pseudo-labeling to generate training labels, thereby eliminating the need for expert-defined segmentations. Additionally, we explore the use of pretrained foundation models, specifically CT-FM and RadImageNet, to extract image features, which are then used with traditional classifiers. We compare the performance of these deep learning features with that of radiomics features. Evaluation is conducted on a clinical CCTA dataset comprising 182 patients, where individuals are classified into two groups: zero versus non-zero calcium scores. We further investigate the impact of training on non-contrast datasets versus combined contrast and non-contrast datasets, with testing performed only on non contrast scans. Results show that radiomics-based models significantly outperform CNN-derived embeddings from foundation models (achieving 84% accuracy and p0.05), despite the unavailability of expert annotations.
zh

[CV-12] Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples

【速读】:该论文旨在解决3D动作识别中因标注样本稀缺而导致模型性能受限的问题,特别是在半监督场景下如何高效选择最具信息量的未标注骨架序列进行人工标注,从而在减少标注成本的同时保持高识别准确率。其核心解决方案是从马尔可夫决策过程(Markov Decision Process, MDP)的新视角重新建模主动学习策略,将样本选择问题形式化为状态-动作决策框架,并通过将状态-动作对中的特征从欧氏空间投影到双曲空间以增强表示能力,同时引入元调优策略加速方法在真实场景中的部署。

链接: https://arxiv.org/abs/2510.25345
作者: Zhigang Tu,Zhengbo Zhang,Jia Gong,Junsong Yuan,Bo Du
机构: Wuhan University (武汉大学); Singapore University of Technology and Design (新加坡科技设计大学); Shanghai Academy of AI for Science (上海人工智能科学研究院); University at Buffalo, State University of New York (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Image Processing (TIP), 2025

点击查看摘要

Abstract:Skeleton-based human action recognition aims to classify human skeletal sequences, which are spatiotemporal representations of actions, into predefined categories. To reduce the reliance on costly annotations of skeletal sequences while maintaining competitive recognition accuracy, the task of 3D Action Recognition with Limited Training Samples, also known as semi-supervised 3D Action Recognition, has been proposed. In addition, active learning, which aims to proactively select the most informative unlabeled samples for annotation, has been explored in semi-supervised 3D Action Recognition for training sample selection. Specifically, researchers adopt an encoder-decoder framework to embed skeleton sequences into a latent space, where clustering information, combined with a margin-based selection strategy using a multi-head mechanism, is utilized to identify the most informative sequences in the unlabeled set for annotation. However, the most representative skeleton sequences may not necessarily be the most informative for the action recognizer, as the model may have already acquired similar knowledge from previously seen skeleton samples. To solve it, we reformulate Semi-supervised 3D action recognition via active learning from a novel perspective by casting it as a Markov Decision Process (MDP). Built upon the MDP framework and its training paradigm, we train an informative sample selection model to intelligently guide the selection of skeleton sequences for annotation. To enhance the representational capacity of the factors in the state-action pairs within our method, we project them from Euclidean space to hyperbolic space. Furthermore, we introduce a meta tuning strategy to accelerate the deployment of our method in real-world scenarios. Extensive experiments on three 3D action recognition benchmarks demonstrate the effectiveness of our method.
zh

[CV-13] StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

【速读】:该论文旨在解决当前视频问答(VideoQA)数据集在处理流式视频时的两大局限:一是静态标注机制无法捕捉答案随时间演化的动态特性,二是缺乏显式的推理过程标注,限制了模型的可解释性和逻辑推理能力。其解决方案的关键在于提出StreamingCoT数据集,构建了一个动态分层标注架构,通过相似性融合生成每秒密集描述并构建时序依赖的语义片段,同时设计了一种显式多模态链式推理(Chain-of-Thought, CoT)生成范式,利用关键帧语义对齐提取时空对象、基于大语言模型推导对象状态转移的推理路径,并通过人工验证确保逻辑一致性,从而为流式视频理解与复杂时序推理研究提供新基准。

链接: https://arxiv.org/abs/2510.25332
作者: Yuhang Hu,Zhenyu Yang,Shihan Wang,Shengsheng Qian,Bin Wen,Fan Yang,Tingting Gao,Changsheng Xu
机构: Henan Institute of Advanced Technology, Zhengzhou University (郑州大学先进技术研究院); Institute of Automation, CAS (中国科学院自动化研究所); UCAS (中国科学院大学); Kuaishou Technology (快手科技); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at this https URL.
zh

[CV-14] MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

【速读】:该论文旨在解决资源受限边缘设备上多模态(Multimodal)实时推理的挑战,尤其关注传感动态与模型执行之间的紧密耦合关系,以及不同模态间复杂的依赖性问题。其核心解决方案是提出MMEdge框架,关键在于采用流水线化的感知与编码设计,将完整的推理过程分解为细粒度的感知和编码单元,使计算能够随数据到达而增量进行;同时引入轻量但有效的时序聚合模块以捕捉跨流水线单元的丰富时序动态,从而在降低延迟的同时保持高精度。此外,该设计还支持细粒度跨模态优化与早期决策,并结合自适应多模态配置优化器和跨模态推测跳过机制,在满足延迟约束下动态调整资源配置并跳过冗余计算,显著提升了系统在复杂数据和资源波动场景下的性能表现。

链接: https://arxiv.org/abs/2510.25327
作者: Runxi Huang,Mingxuan Yu,Mingyu Tsoi,Xiaomin Ouyang
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by SenSys 2026

点击查看摘要

Abstract:Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
zh

[CV-15] Prototype-Driven Adaptation for Few-Shot Object Detection

【速读】:该论文旨在解决少样本目标检测(Few-shot Object Detection, FSOD)中因可用的新类样本极少而导致的基类偏差(base-class bias)和校准不稳定问题。解决方案的关键在于提出一种轻量级、可插拔的原型驱动对齐(Prototype-Driven Alignment, PDA)度量头,其核心机制包括:在可学习的身份初始化投影空间中维护仅基于支持集(support-only)的原型,并通过原型条件化的RoI对齐减少几何不匹配;在微调阶段利用指数移动平均(EMA)更新标记前景RoI以自适应原型,且不引入类别特定参数,推理时冻结原型以满足严格协议;同时采用最佳K匹配策略捕捉类内多模态特征,并通过温度缩放融合度量相似性与检测器logits,从而显著提升新类检测性能,同时对基类影响极小且计算开销可忽略。

链接: https://arxiv.org/abs/2510.25318
作者: Yushen Huang,Zhiming Wang
机构: University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages,1 figure,2 tables,Preprint

点击查看摘要

Abstract:Few-shot object detection (FSOD) often suffers from base-class bias and unstable calibration when only a few novel samples are available. We propose Prototype-Driven Alignment (PDA), a lightweight, plug-in metric head for DeFRCN that provides a prototype-based “second opinion” complementary to the linear classifier. PDA maintains support-only prototypes in a learnable identity-initialized projection space and optionally applies prototype-conditioned RoI alignment to reduce geometric mismatch. During fine-tuning, prototypes can be adapted via exponential moving average(EMA) updates on labeled foreground RoIs-without introducing class-specific parameters-and are frozen at inference to ensure strict protocol compliance. PDA employs a best-of-K matching scheme to capture intra-class multi-modality and temperature-scaled fusion to combine metric similarities with detector logits. Experiments on VOC FSOD and GFSOD benchmarks show that PDA consistently improves novel-class performance with minimal impact on base classes and negligible computational overhead.
zh

[CV-16] Seeing Clearly and Deeply: An RGBD Imaging Approach with a Bio-inspired Monocentric Design

【速读】:该论文旨在解决高保真、紧凑型RGBD成像中的双重挑战:传统紧凑光学系统难以在全景深范围内保持RGB图像的清晰度,而纯软件的单目深度估计(Monocular Depth Estimation, MDE)则因依赖不可靠的语义先验而导致病态问题。其解决方案的关键在于提出一种生物启发式的全球面单中心镜头(bio-inspired all-spherical monocentric lens),并构建了Bionic Monocentric Imaging (BMI) 框架,实现光学与算法的联合设计。该光学结构通过深度变化的点扩散函数(Point Spread Function, PSF)自然编码深度信息,无需复杂的衍射或自由曲面元件;同时,基于物理建模的合成数据生成管道与双头多尺度重建网络协同优化,可从单一编码图像中联合恢复高质量全聚焦(All-in-Focus, AiF)图像和精确深度图,从而显著提升深度估计精度(Abs Rel: 0.026, RMSE: 0.130)与图像保真度(SSIM: 0.960, LPIPS: 0.082)。

链接: https://arxiv.org/abs/2510.25314
作者: Zongxi Yu,Xiaolong Qian,Shaohua Gao,Qi Jiang,Yao Gao,Kailun Yang,Kaiwei Wang
机构: Zhejiang University (浙江大学); DJI Technology Co. Ltd. (大疆创新科技有限公司); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV); Optics (physics.optics)
备注: The source code will be publicly available at this https URL

点击查看摘要

Abstract:Achieving high-fidelity, compact RGBD imaging presents a dual challenge: conventional compact optics struggle with RGB sharpness across the entire depth-of-field, while software-only Monocular Depth Estimation (MDE) is an ill-posed problem reliant on unreliable semantic priors. While deep optics with elements like DOEs can encode depth, they introduce trade-offs in fabrication complexity and chromatic aberrations, compromising simplicity. To address this, we first introduce a novel bio-inspired all-spherical monocentric lens, around which we build the Bionic Monocentric Imaging (BMI) framework, a holistic co-design. This optical design naturally encodes depth into its depth-varying Point Spread Functions (PSFs) without requiring complex diffractive or freeform elements. We establish a rigorous physically-based forward model to generate a synthetic dataset by precisely simulating the optical degradation process. This simulation pipeline is co-designed with a dual-head, multi-scale reconstruction network that employs a shared encoder to jointly recover a high-fidelity All-in-Focus (AiF) image and a precise depth map from a single coded capture. Extensive experiments validate the state-of-the-art performance of the proposed framework. In depth estimation, the method attains an Abs Rel of 0.026 and an RMSE of 0.130, markedly outperforming leading software-only approaches and other deep optics systems. For image restoration, the system achieves an SSIM of 0.960 and a perceptual LPIPS score of 0.082, thereby confirming a superior balance between image fidelity and depth accuracy. This study illustrates that the integration of bio-inspired, fully spherical optics with a joint reconstruction algorithm constitutes an effective strategy for addressing the intrinsic challenges in high-performance compact RGBD imaging. Source code will be publicly available at this https URL.
zh

[CV-17] GaTector: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction

【速读】:该论文旨在解决现有 gaze object detection(注视目标检测)与 gaze following(注视跟随)任务通常被独立处理、且在训练和部署阶段均依赖头部相关先验知识的问题,这限制了模型的联合优化能力与实际应用灵活性。其解决方案的关键在于提出 GaTector+,一个统一框架:首先通过扩展的“特定-通用-特定”特征提取器,在共享骨干网络基础上引入专用模块以分别增强两个子任务的特异性;其次嵌入头部检测分支以无需先验信息地获取头部位置,并设计基于头部的注意力机制融合感知特征与注视特征;此外,为缓解注视热图学习缓慢的问题,提出注意力监督机制加速收敛;最后引入新颖的评估指标 mean Similarity over Candidates (mSoC),提升对边界框差异的敏感性。该方案实现了不依赖头部先验的端到端联合优化,显著提升了两个任务的性能。

链接: https://arxiv.org/abs/2510.25301
作者: Yang Jin,Guangyu Guo,Binglu Wang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.
zh

[CV-18] Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation NEURIPS2025

【速读】:该论文旨在解决源域与目标域之间存在显著分布差异时,无源域数据的域自适应(Source-free Domain Adaptation, SFDA)方法性能受限的问题。现有非生成式方法因伪标签不可靠而效果不佳,而生成式方法在构建伪源域数据时会进一步放大域间差异,导致性能下降。其解决方案的关键在于提出一种基于扩散模型的渐进式目标域操控框架(Diffusion-Driven Progressive Target Manipulation, DPTM),通过将未标注的目标域样本划分为可信集与不可信集,利用潜空间扩散模型对不可信样本进行语义重构并保持其目标域分布特性,同时设计渐进式精炼机制逐步缩小伪目标域与真实目标域之间的差异,从而实现更稳定且高效的域适应。

链接: https://arxiv.org/abs/2510.25279
作者: Yuyang Huang,Yabo Chen,Junyu Zhou,Wenrui Dai,Xiaopeng Zhang,Junni Zou,Hongkai Xiong,Qi Tian
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Inc. (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Source-free domain adaptation (SFDA) is a challenging task that tackles domain shifts using only a pre-trained source model and unlabeled target data. Existing SFDA methods are restricted by the fundamental limitation of source-target domain discrepancy. Non-generation SFDA methods suffer from unreliable pseudo-labels in challenging scenarios with large domain discrepancies, while generation-based SFDA methods are evidently degraded due to enlarged domain discrepancies in creating pseudo-source data. To address this limitation, we propose a novel generation-based framework named Diffusion-Driven Progressive Target Manipulation (DPTM) that leverages unlabeled target data as references to reliably generate and progressively refine a pseudo-target domain for SFDA. Specifically, we divide the target samples into a trust set and a non-trust set based on the reliability of pseudo-labels to sufficiently and reliably exploit their information. For samples from the non-trust set, we develop a manipulation strategy to semantically transform them into the newly assigned categories, while simultaneously maintaining them in the target distribution via a latent diffusion model. Furthermore, we design a progressive refinement mechanism that progressively reduces the domain discrepancy between the pseudo-target domain and the real target domain via iterative refinement. Experimental results demonstrate that DPTM outperforms existing methods by a large margin and achieves state-of-the-art performance on four prevailing SFDA benchmark datasets with different scales. Remarkably, DPTM can significantly enhance the performance by up to 18.6% in scenarios with large source-target gaps.
zh

[CV-19] SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation

【速读】:该论文旨在解决具身智能与虚拟/增强现实(VR/AR)应用中,如何根据自然语言指令生成针对关节类物体(articulated objects)的连续手部抓握与操作序列的问题。传统方法通常仅关注静态抓握生成,而未考虑物体在操作过程中因关节运动带来的形变及长期动作依赖性,导致难以实现符合语义描述的动态交互。解决方案的关键在于提出了一种新颖的HAOI(Hand Articulated Object Interaction)序列生成框架 SynHLMA,其核心创新包括:1)引入离散的 HAOI 表征来建模每帧手物交互状态;2)通过一个联合训练的语言模型将语言嵌入与 HAOI 表征对齐于共享表示空间;3)设计一种关节感知损失(joint-aware loss),确保手部抓握动作能跟随关节类物体的动态变化。该框架可实现三类典型任务:HAOI 生成、预测与插值,并在自建的 HAOI-lang 数据集上验证了优于现有方法的性能,同时展示了其在机器人灵巧抓取执行中的应用潜力。

链接: https://arxiv.org/abs/2510.25268
作者: Wang zhi,Yuyan Liu,Liu Liu,Li Zhang,Ruixuan Lu,Dan Guo
机构: Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学); Anhui University (安徽大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating hand grasps with language instructions is a widely studied topic that benefits from embodied AI and VR/AR applications. While transferring into hand articulatied object interaction (HAOI), the hand grasps synthesis requires not only object functionality but also long-term manipulation sequence along the object deformation. This paper proposes a novel HAOI sequence generation framework SynHLMA, to synthesize hand language manipulation for articulated objects. Given a complete point cloud of an articulated object, we utilize a discrete HAOI representation to model each hand object interaction frame. Along with the natural language embeddings, the representations are trained by an HAOI manipulation language model to align the grasping process with its language description in a shared representation space. A joint-aware loss is employed to ensure hand grasps follow the dynamic variations of articulated object joints. In this way, our SynHLMA achieves three typical hand manipulation tasks for articulated objects of HAOI generation, HAOI prediction and HAOI interpolation. We evaluate SynHLMA on our built HAOI-lang dataset and experimental results demonstrate the superior hand grasp sequence generation performance comparing with state-of-the-art. We also show a robotics grasp application that enables dexterous grasps execution from imitation learning using the manipulation sequence provided by our SynHLMA. Our codes and datasets will be made publicly available.
zh

[CV-20] LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation NEURIPS2025

【速读】:该论文旨在解决开放词汇(open-vocabulary)条件下物体-部件实例分割(object-part instance segmentation)的难题,即在不依赖预定义类别的情况下,实现对图像中多层次物体及其部件的联合检测与分割。传统方法通常依赖启发式或可学习的视觉分组策略,难以有效建模物体与部件之间的语义层级关系。本文提出的LangHOPS框架首次将多模态大语言模型(Multimodal Large Language Model, MLLM)引入物体-部件解析流程,通过将物体-部件层次结构显式地锚定在语言空间中,利用MLLM丰富的知识和推理能力,实现跨粒度概念的语义对齐与关联。其核心创新在于:1)基于语言空间的层次结构建模,替代传统的视觉分组;2)由MLLM驱动的部件查询(part query)精化策略,提升细粒度分割精度。实验表明,LangHOPS在多个挑战性场景下均达到最优性能,显著优于现有方法。

链接: https://arxiv.org/abs/2510.25263
作者: Yang Miao,Jan-Nico Zaech,Xi Wang,Fabien Despinoy,Danda Pani Paudel,Luc Van Gool
机构: INSAIT(INSAT); Sofia University(索菲亚大学); ETH Zurich(苏黎世联邦理工学院); TU Munich(慕尼黑工业大学); Toyota Motor Europe(丰田欧洲汽车公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 14 tables, Neurips 2025

点击查看摘要

Abstract:We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
zh

[CV-21] RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models

【速读】:该论文旨在解决轻量化目标检测模型在追求高速推理时因特征表示能力下降而导致性能受限的问题,尤其在资源受限设备上的实际部署面临挑战。其核心解决方案在于提出一种成本低廉且高度可适配的蒸馏框架,利用视觉基础模型(Vision Foundation Models, VFMs)的强大语义能力来增强轻量级检测器。关键创新包括:1)引入深度语义注入模块(Deep Semantic Injector, DSI),将VFMs的高层语义信息高效融合至检测器深层结构中;2)设计梯度引导自适应调制策略(Gradient-guided Adaptive Modulation, GAM),依据梯度范数比动态调节语义迁移强度,确保任务对齐与训练稳定性。该方法无需增加推理开销,在多种DETR架构上均实现显著且一致的性能提升,最终构建的RT-DETRv4模型在COCO数据集上达到SOTA水平。

链接: https://arxiv.org/abs/2510.25257
作者: Zijun Liao,Yian Zhao,Xin Shan,Yu Yan,Chang Liu,Lei Lu,Xiangyang Ji,Jie Chen
机构: Peking University (北京大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time object detection has achieved substantial progress through meticulously designed architectures and optimization strategies. However, the pursuit of high-speed inference via lightweight network designs often leads to degraded feature representation, which hinders further performance improvements and practical on-device deployment. In this paper, we propose a cost-effective and highly adaptable distillation framework that harnesses the rapidly evolving capabilities of Vision Foundation Models (VFMs) to enhance lightweight object detectors. Given the significant architectural and learning objective disparities between VFMs and resource-constrained detectors, achieving stable and task-aligned semantic transfer is challenging. To address this, on one hand, we introduce a Deep Semantic Injector (DSI) module that facilitates the integration of high-level representations from VFMs into the deep layers of the detector. On the other hand, we devise a Gradient-guided Adaptive Modulation (GAM) strategy, which dynamically adjusts the intensity of semantic transfer based on gradient norm ratios. Without increasing deployment and inference overhead, our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models, underscoring its practical utility for real-time detection. Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.
zh

[CV-22] Mapping and Classification of Trees Outside Forests using Deep Learning

【速读】:该论文旨在解决树木在森林之外(Trees Outside Forests, TOF)的分类问题,传统研究常将TOF视为单一类别或依赖刚性规则阈值,限制了生态解释力和区域适应性。其解决方案的关键在于利用高分辨率航空影像与新构建的数据集,采用深度学习方法进行语义分割,比较了卷积神经网络(Convolutional Neural Networks, CNNs)、视觉Transformer及混合CNN-Transformer模型在六种架构下的表现,发现FT-UNetFormer模型在四个农业景观中均取得最优性能(平均交并比0.74,平均F1分数0.84),凸显空间上下文理解对TOF精准识别的重要性,并揭示了复杂结构如Patch和Tree类别的分类挑战,强调需使用区域多样化的训练数据以保障大尺度制图的可靠性。

链接: https://arxiv.org/abs/2510.25239
作者: Moritz Lucas,Hamid Ebrahimy,Viacheslav Barkov,Ralf Pecenka,Kai-Uwe Kühnberger,Björn Waske
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Trees Outside Forests (TOF) play an important role in agricultural landscapes by supporting biodiversity, sequestering carbon, and regulating microclimates. Yet, most studies have treated TOF as a single class or relied on rigid rule-based thresholds, limiting ecological interpretation and adaptability across regions. To address this, we evaluate deep learning for TOF classification using a newly generated dataset and high-resolution aerial imagery from four agricultural landscapes in Germany. Specifically, we compare convolutional neural networks (CNNs), vision transformers, and hybrid CNN-transformer models across six semantic segmentation architectures (ABCNet, LSKNet, FT-UNetFormer, DC-Swin, BANet, and U-Net) to map four categories of woody vegetation: Forest, Patch, Linear, and Tree, derived from previous studies and governmental products. Overall, the models achieved good classification accuracy across the four landscapes, with the FT-UNetFormer performing best (mean Intersection-over-Union 0.74; mean F1 score 0.84), underscoring the importance of spatial context understanding in TOF mapping and classification. Our results show good results for Forest and Linear class and reveal challenges particularly in classifying complex structures with high edge density, notably the Patch and Tree class. Our generalization experiments highlight the need for regionally diverse training data to ensure reliable large-scale mapping. The dataset and code are openly available at this https URL
zh

[CV-23] VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations

【速读】:该论文旨在解决视频美学评估(video aesthetic assessment)中因缺乏标准化数据集和鲁棒模型而导致的进展受限问题,尤其针对视频的时间动态特性及多模态融合挑战,使得图像级方法难以直接迁移应用。其关键解决方案是构建了目前规模最大的视频美学数据库VADB(含10,490个多样视频,由37位专业人员标注多个美学维度),并提出VADB-Net双模态预训练框架,采用两阶段训练策略,在评分任务上超越现有视频质量评估模型,并支持下游视频美学评估任务。

链接: https://arxiv.org/abs/2510.25238
作者: Qianqian Qiao,DanDan Zheng,Yihang Bo,Bao Peng,Heng Huang,Longteng Jiang,Huaye Wang,Jingdong Chen,Jun Zhou,Xin Jin
机构: Nanjing University (南京大学); Huazhong University of Science and Technology (华中科技大学); Beijing Film Academy (北京电影学院); University of Science and Technology of China (中国科学技术大学); Beijing Electronic Science and Technology Institute (北京电子科技学院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Beijing Institute for General Artificial Intelligence (通用人工智能北京研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at this https URL.
zh

[CV-24] DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis ICCV2025

【速读】:该论文旨在解决深度伪造(deepfake)检测模型在面对未见过的伪造技术时泛化能力不足的问题。现有检测方法通常依赖于特定伪造痕迹,在跨数据集或跨伪造手法场景下性能显著下降。解决方案的关键在于提出 DeepShield 框架,通过两个核心组件实现局部敏感性与全局泛化性的平衡:一是局部补丁引导(Local Patch Guidance, LPG),利用时空伪影建模和补丁级监督捕捉全局模型易忽略的细微不一致;二是全局伪造多样化(Global Forgery Diversification, GFD),通过域特征增强合成多样伪造样本,提升模型对未知伪造攻击的鲁棒性。该设计使 DeepShield 在跨数据集和跨伪造手法评估中均优于当前最优方法。

链接: https://arxiv.org/abs/2510.25237
作者: Yinqi Cai,Jichang Li,Zhaolun Li,Weikai Chen,Rushi Lan,Xi Xie,Xiaonan Luo,Guanbin Li
机构: Sun Yat-sen University (中山大学); Pengcheng Laboratory (鹏城实验室); Guilin University of Electronic Technology (桂林电子科技大学); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.
zh

[CV-25] Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation

【速读】:该论文旨在解决生成 emotionally expressive 3D talking faces 的问题,即在保持准确唇同步(lip-sync)的同时,实现由语音和情绪共同驱动的高真实感面部表情动画。当前研究虽已取得语音驱动唇同步的进展,但情感表达的生成仍处于探索阶段,主要受限于高质量情感3D说话人脸数据集的稀缺性。解决方案的关键在于将面部动画建模为语音与情绪驱动的线性叠加问题,并利用两个互补的数据集——包含中性表情的VOCAset和带有表情序列的Florence4D——联合学习一组由语音和情绪分别驱动的blendshapes。通过引入稀疏性约束损失(sparsity constraint loss),模型能够在保持两种blendshapes解耦的同时,捕捉训练数据中固有的跨域形变(cross-domain deformations),最终可映射至FLAME模型的表达参数与下颌姿态参数,从而驱动3D Gaussian avatar实现高质量的情感化说话人脸动画。

链接: https://arxiv.org/abs/2510.25234
作者: Yuxiang Mao,Zhijie Zhang,Zhiheng Zhang,Jiawei Liu,Chen Zeng,Shihong Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 18 pages, 6 figures, accepted to ICXR 2025 conference

点击查看摘要

Abstract:Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.
zh

[CV-26] Balanced conic rectified flow NEURIPS2025

【速读】:该论文旨在解决原始直角流(Rectified Flow)方法在训练过程中存在的两个核心问题:一是重构(reflow)过程依赖大量生成图像对以保持目标分布,导致计算成本高昂;二是由于模型仅使用生成图像对进行训练,导致其性能严重依赖于初始的1-直角流模型,从而产生对生成数据的偏差。解决方案的关键在于引入真实图像到训练流程中,通过保留真实图像的ODE路径来优化学习过程,从而显著减少对大规模生成数据的依赖。实验表明,该方法可在仅使用原方法约1/10生成对的情况下,在CIFAR-10上实现更低的FID分数,并且生成路径更趋直线、避免了重构阶段对生成图像的饱和现象,提升了ODE学习的鲁棒性并更好保留真实数据分布。

链接: https://arxiv.org/abs/2510.25229
作者: Kim Shin Seong,Mingi Kwon,Jaeseok Jeong,Youngjung Uh
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main paper: 10 pages (total 40 pages including appendix), 5 figures. Accepted at NeurIPS 2025 (Poster). Acknowledgment: Supported by the NRF of Korea (RS-2023-00223062) and IITP grants (RS-2020-II201361, RS-2024-00439762) funded by the Korean government (MSIT)

点击查看摘要

Abstract:Rectified flow is a generative model that learns smooth transport mappings between two distributions through an ordinary differential equation (ODE). Unlike diffusion-based generative models, which require costly numerical integration of a generative ODE to sample images with state-of-the-art quality, rectified flow uses an iterative process called reflow to learn smooth and straight ODE paths. This allows for relatively simple and efficient generation of high-quality images. However, rectified flow still faces several challenges. 1) The reflow process requires a large number of generative pairs to preserve the target distribution, leading to significant computational costs. 2) Since the model is typically trained using only generated image pairs, its performance heavily depends on the 1-rectified flow model, causing it to become biased towards the generated data. In this work, we experimentally expose the limitations of the original rectified flow and propose a novel approach that incorporates real images into the training process. By preserving the ODE paths for real images, our method effectively reduces reliance on large amounts of generated data. Instead, we demonstrate that the reflow process can be conducted efficiently using a much smaller set of generated and real images. In CIFAR-10, we achieved significantly better FID scores, not only in one-step generation but also in full-step simulations, while using only of the generative pairs compared to the original method. Furthermore, our approach induces straighter paths and avoids saturation on generated images during reflow, leading to more robust ODE learning while preserving the distribution of real images. Comments: Main paper: 10 pages (total 40 pages including appendix), 5 figures. Accepted at NeurIPS 2025 (Poster). Acknowledgment: Supported by the NRF of Korea (RS-2023-00223062) and IITP grants (RS-2020-II201361, RS-2024-00439762) funded by the Korean government (MSIT) Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T07, 68T45, 65C20 ACMclasses: I.2.10; I.4.9; I.2.6 Cite as: arXiv:2510.25229 [cs.CV] (or arXiv:2510.25229v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.25229 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
zh

[CV-27] Aligning What You Separate: Denoised Patch Mixing for Source-Free Domain Adaptation in Medical Image Segmentation

【速读】:该论文旨在解决源域不可用条件下医学图像分割中的域偏移问题(domain shift),尤其针对现有方法忽视样本难易差异及在域迁移下伪标签噪声干扰导致性能下降的挑战。其解决方案的关键在于提出一种基于难样本选择与去噪补丁混合的渐进式域自适应框架:首先通过熵-相似性分析将目标域无标签图像划分为可靠与不可靠子集,实现从易到难的逐步适应;其次利用蒙特卡洛采样生成去噪掩码以抑制不可靠像素并稳定训练过程;最后结合域内与域间目标,在不同子集间混合补丁,从而传递可靠语义信息并抑制噪声传播。该方法显著提升了边界分割精度,在Dice和ASSD指标上达到当前最优效果。

链接: https://arxiv.org/abs/2510.25227
作者: Quang-Khai Bui-Tran,Thanh-Huy Nguyen,Hoang-Thien Nguyen,Ba-Thinh Lam,Nguyen Lan Vi Vu,Phat K. Huynh,Ulas Bagci,Min Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Source-Free Domain Adaptation (SFDA) is emerging as a compelling solution for medical image segmentation under privacy constraints, yet current approaches often ignore sample difficulty and struggle with noisy supervision under domain shift. We present a new SFDA framework that leverages Hard Sample Selection and Denoised Patch Mixing to progressively align target distributions. First, unlabeled images are partitioned into reliable and unreliable subsets through entropy-similarity analysis, allowing adaptation to start from easy samples and gradually incorporate harder ones. Next, pseudo-labels are refined via Monte Carlo-based denoising masks, which suppress unreliable pixels and stabilize training. Finally, intra- and inter-domain objectives mix patches between subsets, transferring reliable semantics while mitigating noise. Experiments on benchmark datasets show consistent gains over prior SFDA and UDA methods, delivering more accurate boundary delineation and achieving state-of-the-art Dice and ASSD scores. Our study highlights the importance of progressive adaptation and denoised supervision for robust segmentation under domain shift.
zh

[CV-28] MSF-Net: Multi-Stage Feature Extraction and Fusion for Robust Photometric Stereo

【速读】:该论文旨在解决现有基于学习的光度立体(photometric stereo)方法在多阶段特征提取中难以准确捕捉细节信息、且各阶段特征间交互不足的问题,导致模型易提取冗余特征,尤其在皱纹和边缘等复杂区域表现不佳。解决方案的关键在于提出MSF-Net框架,其核心包括两个创新:一是设计多阶段特征提取机制与选择性更新策略(selective update strategy),以增强不同阶段特征的质量与区分度;二是引入特征融合模块(feature fusion module),促进跨阶段特征的有效交互,从而提升表面法向量估计的准确性。

链接: https://arxiv.org/abs/2510.25221
作者: Shiyu Qin,Zhihao Cai,Kaixuan Wang,Lin Qi,Junyu Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photometric stereo is a technique aimed at determining surface normals through the utilization of shading cues derived from images taken under different lighting conditions. However, existing learning-based approaches often fail to accurately capture features at multiple stages and do not adequately promote interaction between these features. Consequently, these models tend to extract redundant features, especially in areas with intricate details such as wrinkles and edges. To tackle these issues, we propose MSF-Net, a novel framework for extracting information at multiple stages, paired with selective update strategy, aiming to extract high-quality feature information, which is critical for accurate normal construction. Additionally, we have developed a feature fusion module to improve the interplay among different features. Experimental results on the DiLiGenT benchmark show that our proposed MSF-Net significantly surpasses previous state-of-the-art methods in the accuracy of surface normal estimation.
zh

[CV-29] U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching NEURIPS2025

【速读】:该论文旨在解决点云数据在扫描过程中因传感器噪声导致的退化问题,此类噪声会显著影响下游任务如表面重建和形状理解的性能。传统方法依赖于大量手工标注的含噪-干净点云配对数据来训练神经网络以学习去噪先验,存在标注成本高、泛化能力受限等问题。本文提出了一种无监督框架U-CAN(Unsupervised framework for point cloud denoising with Consistency-Aware Noise2Noise matching),其核心创新在于引入了一种基于一致性感知的噪声到噪声匹配机制:通过设计一种新颖的损失函数,利用多个含噪观测样本进行统计推理,从而推断出每个点的多步去噪路径;同时引入几何一致性约束,确保去噪结果在不同噪声样本下保持结构稳定,该约束具有跨模态通用性,亦可应用于二维图像去噪任务。实验表明,U-CAN在点云去噪、上采样及图像去噪等多个基准测试中均显著优于现有无监督方法,并达到与监督方法相当的性能水平。

链接: https://arxiv.org/abs/2510.25210
作者: Junsheng Zhou,Xingyu Shi,Haichuan Song,Yi Fang,Yu-Shen Liu,Zhizhong Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025. Project page: this https URL

点击查看摘要

Abstract:Point clouds captured by scanning sensors are often perturbed by noise, which have a highly negative impact on downstream tasks (e.g. surface reconstruction and shape understanding). Previous works mostly focus on training neural networks with noisy-clean point cloud pairs for learning denoising priors, which requires extensively manual efforts. In this work, we introduce U-CAN, an Unsupervised framework for point cloud denoising with Consistency-Aware Noise2Noise matching. Specifically, we leverage a neural network to infer a multi-step denoising path for each point of a shape or scene with a noise to noise matching scheme. We achieve this by a novel loss which enables statistical reasoning on multiple noisy point cloud observations. We further introduce a novel constraint on the denoised geometry consistency for learning consistency-aware denoising patterns. We justify that the proposed constraint is a general term which is not limited to 3D domain and can also contribute to the area of 2D image denoising. Our evaluations under the widely used benchmarks in point cloud denoising, upsampling and image denoising show significant improvement over the state-of-the-art unsupervised methods, where U-CAN also produces comparable results with the supervised methods.
zh

[CV-30] AI-Powered Early Detection of Critical Diseases using Image Processing and Audio Analysis

【速读】:该论文旨在解决重大疾病早期诊断成本高、侵入性强以及在资源匮乏地区难以获取的问题。其解决方案的关键在于构建一个轻量级的多模态人工智能(AI)诊断框架,整合图像分析、热成像和音频信号处理技术,分别用于皮肤癌、血管血栓和心肺异常的早期检测;通过优化模型结构(如微调MobileNetV2用于皮肤病变分类、手工特征+SVM用于血栓识别、MFCC+随机森林用于心肺音分类),在保证高准确率(最高达89.3%)的同时实现低计算开销,从而支持在低成本设备上的实时部署,为可扩展、即时且易获取的AI辅助预诊断提供可行路径。

链接: https://arxiv.org/abs/2510.25199
作者: Manisha More,Kavya Bhand,Kaustubh Mukdam,Kavya Sharma,Manas Kawtikwar,Hridayansh Kaware,Prajwal Kavhar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Early diagnosis of critical diseases can significantly improve patient survival and reduce treatment costs. However, existing diagnostic techniques are often costly, invasive, and inaccessible in low-resource regions. This paper presents a multimodal artificial intelligence (AI) diagnostic framework integrating image analysis, thermal imaging, and audio signal processing for early detection of three major health conditions: skin cancer, vascular blood clots, and cardiopulmonary abnormalities. A fine-tuned MobileNetV2 convolutional neural network was trained on the ISIC 2019 dataset for skin lesion classification, achieving 89.3% accuracy, 91.6% sensitivity, and 88.2% specificity. A support vector machine (SVM) with handcrafted features was employed for thermal clot detection, achieving 86.4% accuracy (AUC = 0.89) on synthetic and clinical data. For cardiopulmonary analysis, lung and heart sound datasets from PhysioNet and Pascal were processed using Mel-Frequency Cepstral Coefficients (MFCC) and classified via Random Forest, reaching 87.2% accuracy and 85.7% sensitivity. Comparative evaluation against state-of-the-art models demonstrates that the proposed system achieves competitive results while remaining lightweight and deployable on low-cost devices. The framework provides a promising step toward scalable, real-time, and accessible AI-based pre-diagnostic healthcare solutions.
zh

[CV-31] Mask-Robust Face Verification for Online Learning via YOLOv5 and Residual Networks

【速读】:该论文旨在解决在线教育场景中学生身份认证的准确性与安全性问题,特别是在远程学习过程中防止替考或身份冒用等行为。其解决方案的关键在于构建一个基于深度学习的多阶段身份验证框架:首先使用YOLOv5目标检测网络从学生开启的摄像头图像中精准定位人脸;随后将提取的人脸图像输入残差网络(Residual Network)以获取深层特征表示;最后通过计算欧氏距离与预存的学生人脸数据库进行比对,实现高精度的身份识别。该方法有效提升了在线教育系统的安全性和稳定性,助力其适应数字化和智能化教育的发展趋势。

链接: https://arxiv.org/abs/2510.25184
作者: Zhifeng Wang,Minghui Wang,Chunyan Zeng,Jialong Yao,Yang Yang,Hongmin Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 10 figures

点击查看摘要

Abstract:In the contemporary landscape, the fusion of information technology and the rapid advancement of artificial intelligence have ushered school education into a transformative phase characterized by digitization and heightened intelligence. Concurrently, the global paradigm shift caused by the Covid-19 pandemic has catalyzed the evolution of e-learning, accentuating its significance. Amidst these developments, one pivotal facet of the online education paradigm that warrants attention is the authentication of identities within the digital learning sphere. Within this context, our study delves into a solution for online learning authentication, utilizing an enhanced convolutional neural network architecture, specifically the residual network model. By harnessing the power of deep learning, this technological approach aims to galvanize the ongoing progress of online education, while concurrently bolstering its security and stability. Such fortification is imperative in enabling online education to seamlessly align with the swift evolution of the educational landscape. This paper’s focal proposition involves the deployment of the YOLOv5 network, meticulously trained on our proprietary dataset. This network is tasked with identifying individuals’ faces culled from images captured by students’ open online cameras. The resultant facial information is then channeled into the residual network to extract intricate features at a deeper level. Subsequently, a comparative analysis of Euclidean distances against students’ face databases is performed, effectively ascertaining the identity of each student.
zh

[CV-32] st-Time Adaptive Object Detection with Foundation Model NEURIPS2025

【速读】:该论文旨在解决测试时自适应目标检测(test-time adaptive object detection)中对源域数据依赖性强以及传统方法受限于封闭集类别空间的问题。现有方法通常依赖源域统计特征,并假设源与目标域具有完全相同的类别空间,这在实际应用中限制了模型的泛化能力。解决方案的关键在于提出首个基于基础模型(foundation model)驱动的测试时自适应方法,彻底摒弃对源数据的需求,并突破封闭集限制。其核心创新包括:1)设计多模态提示(multi-modal prompt-based)的Mean-Teacher框架,通过文本与视觉提示微调实现语言和视觉表示空间的参数高效适配;2)提出测试时预热策略(Test-time Warm-start),用于稳定视觉分支的表征能力;3)引入实例动态记忆模块(Instance Dynamic Memory, IDM),结合记忆增强与记忆幻觉两种策略,在无标签情况下提升伪标签质量,从而实现对任意跨域、跨类别的目标数据的有效适应。

链接: https://arxiv.org/abs/2510.25175
作者: Yingjie Gao,Yanan Zhang,Zhi Cai,Di Huang
机构: Beihang University (北京航空航天大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM’s high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at this https URL.
zh

[CV-33] Classifier Enhancement Using Extended Context and Domain Experts for Semantic Segmentation

【速读】:该论文旨在解决传统语义分割方法中分类器因固定参数无法适应图像级类别分布差异,以及数据集层面类别不平衡导致模型对少数类分割效果不佳的问题。其核心解决方案是提出一种扩展的上下文感知分类器(Extended Context-Aware Classifier, ECAC),关键在于利用记忆库学习每个类别的数据集级上下文信息,并结合当前图像的局部上下文信息动态调整分类器参数,从而提升像素级标注精度;同时引入教师-学生网络范式,由领域专家(教师网络)基于真实标签动态调整上下文信息并指导学生网络学习,有效缓解类别不平衡问题并增强模型泛化能力。

链接: https://arxiv.org/abs/2510.25174
作者: Huadong Tang,Youpeng Zhao,Min Xu,Jun Wang,Qiang Wu
机构: University of Technology Sydney (悉尼科技大学); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE TRANSACTIONS ON MULTIMEDIA (TMM)

点击查看摘要

Abstract:Prevalent semantic segmentation methods generally adopt a vanilla classifier to categorize each pixel into specific classes. Although such a classifier learns global information from the training data, this information is represented by a set of fixed parameters (weights and biases). However, each image has a different class distribution, which prevents the classifier from addressing the unique characteristics of individual images. At the dataset level, class imbalance leads to segmentation results being biased towards majority classes, limiting the model’s effectiveness in identifying and segmenting minority class regions. In this paper, we propose an Extended Context-Aware Classifier (ECAC) that dynamically adjusts the classifier using global (dataset-level) and local (image-level) contextual information. Specifically, we leverage a memory bank to learn dataset-level contextual information of each class, incorporating the class-specific contextual information from the current image to improve the classifier for precise pixel labeling. Additionally, a teacher-student network paradigm is adopted, where the domain expert (teacher network) dynamically adjusts contextual information with ground truth and transfers knowledge to the student network. Comprehensive experiments illustrate that the proposed ECAC can achieve state-of-the-art performance across several datasets, including ADE20K, COCO-Stuff10K, and Pascal-Context. Comments: Accepted at IEEE TRANSACTIONS ON MULTIMEDIA (TMM) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.25174 [cs.CV] (or arXiv:2510.25174v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.25174 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Huadong Tang [view email] [v1] Wed, 29 Oct 2025 05:17:13 UTC (38,202 KB) Full-text links: Access Paper: View a PDF of the paper titled Classifier Enhancement Using Extended Context and Domain Experts for Semantic Segmentation, by Huadong Tang and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-10 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[CV-34] D2GS: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction

【速读】:该论文旨在解决当前城市场景重建方法对多模态传感器(如LiDAR与图像)依赖性强的问题,尤其是获取高精度LiDAR深度数据在实际应用中的挑战,包括时空标定困难和传感器空间错位导致的重投影误差。其解决方案的关键在于提出一种无需LiDAR的重建框架D²GS:首先通过多视角度量深度预测反投影生成稠密初始点云,并采用渐进式剪枝策略优化全局一致性;其次设计深度增强模块(Depth Enhancer),联合优化高斯几何结构与预测深度图,利用来自深度基础模型的扩散先验提升渲染深度质量,从而提供更强的几何约束;最后通过约束道路区域内的高斯形状与法向属性进一步提升地面几何精度。该方法在Waymo数据集上验证了其优于现有最先进方法的重建效果,甚至超越使用真实LiDAR数据的方法。

链接: https://arxiv.org/abs/2510.25173
作者: Kejing Xia,Jidong Jia,Ke Jin,Yucai Bai,Li Sun,Dacheng Tao,Youjian Zhang
机构: Wuhan University (武汉大学); Shanghai Jiaotong University (上海交通大学); TongJi University (同济大学); Bosch (博世); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Gaussian Splatting (GS) has shown great potential for urban scene reconstruction in the field of autonomous driving. However, current urban scene reconstruction methods often depend on multimodal sensors as inputs, \textiti.e. LiDAR and images. Though the geometry prior provided by LiDAR point clouds can largely mitigate ill-posedness in reconstruction, acquiring such accurate LiDAR data is still challenging in practice: i) precise spatiotemporal calibration between LiDAR and other sensors is required, as they may not capture data simultaneously; ii) reprojection errors arise from spatial misalignment when LiDAR and cameras are mounted at different locations. To avoid the difficulty of acquiring accurate LiDAR depth, we propose D^2GS , a LiDAR-free urban scene reconstruction framework. In this work, we obtain geometry priors that are as effective as LiDAR while being denser and more accurate. \textbfFirst , we initialize a dense point cloud by back-projecting multi-view metric depth predictions. This point cloud is then optimized by a Progressive Pruning strategy to improve the global consistency. \textbfSecond , we jointly refine Gaussian geometry and predicted dense metric depth via a Depth Enhancer. Specifically, we leverage diffusion priors from a depth foundation model to enhance the depth maps rendered by Gaussians. In turn, the enhanced depths provide stronger geometric constraints during Gaussian training. \textbfFinally , we improve the accuracy of ground geometry by constraining the shape and normal attributes of Gaussians within road regions. Extensive experiments on the Waymo dataset demonstrate that our method consistently outperforms state-of-the-art methods, producing more accurate geometry even when compared with those using ground-truth LiDAR data.
zh

[CV-35] A Study on Inference Latency for Vision Transformers on Mobile Devices

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在移动设备上推理延迟(latency)难以预测和优化的问题。针对这一挑战,研究者通过量化分析190个真实世界的ViT模型与102个卷积神经网络(Convolutional Neural Networks, CNNs)在移动端的性能差异,识别出影响ViT延迟的关键因素。其解决方案的核心在于构建了一个包含1000个合成ViT模型的基准数据集,这些模型涵盖了代表性模块和前沿架构,并覆盖了两个主流机器学习框架及六种移动平台。基于该数据集,论文证明了新ViT模型的推理延迟可以被准确预测,从而为移动端ViT模型的设计与部署提供了可靠的性能评估工具。

链接: https://arxiv.org/abs/2510.25166
作者: Zhuojin Li,Marco Paolieri,Leana Golubchik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Performance (cs.PF)
备注: To appear in Springer LNICST, volume 663, Proceedings of VALUETOOLS 2024

点击查看摘要

Abstract:Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.
zh

[CV-36] arget-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation

【速读】:该论文旨在解决多模态参数化CAD序列生成中长期约束难以处理和参数敏感性高的问题,尤其是在生成过程中如何实现定量约束的精准控制。其解决方案的关键在于提出了一种名为Target-Guided Bayesian Flow Network (TGBFN) 的新框架,首次在统一的连续可微参数空间中建模CAD序列的多模态特性(即离散命令与连续参数),并通过穿透参数更新核引入引导式贝叶斯流来精确调控CAD属性,从而实现高保真度且条件感知的CAD序列生成。

链接: https://arxiv.org/abs/2510.25163
作者: Wenhao Zheng,Chenwei Sun,Wenbo Zhang,Jiancheng Lv,Xianggen Liu
机构: College of Computer Science, Sichuan University (四川大学计算机学院); Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education (教育部机器学习与工业智能工程研究中心); School of Computer Science and Technology, Xidian University (西安电子科技大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi-modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long-range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi-modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single-condition and multi-condition constrained generation tasks demonstrate that TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences. The code is available at this https URL.
zh

[CV-37] owards Real-Time Inference of Thin Liquid Film Thickness Profiles from Interference Patterns Using Vision Transformers DATE

【速读】:该论文旨在解决薄液膜干涉测量中从干涉图样重建厚度分布的难题,这是一个由于相位周期性、成像噪声和环境干扰等因素导致的病态逆问题,传统方法存在计算复杂度高、对噪声敏感或依赖人工分析等问题,难以实现临床实时诊断。解决方案的关键在于提出一种基于视觉Transformer(Vision Transformer)的深度学习模型,通过在生理相关合成数据与实验泪膜数据混合训练集上进行训练,利用其捕捉长程空间相关性的能力来解析相位模糊,并在单次前向传播中从动态干涉图中重构出时序一致的厚度分布,从而在消费级硬件上实现自动化、高精度且实时的厚度重建,突破了传统相位解包裹和迭代拟合方法的局限性。

链接: https://arxiv.org/abs/2510.25157
作者: Gautam A. Viruthagiri,Arnuv Tandon,Gerald G. Fuller,Vinny Chandran Suja
机构: John Marshall High School (约翰·马歇尔高中); Stanford University (斯坦福大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, will be updated

点击查看摘要

Abstract:Thin film interferometry is a powerful technique for non-invasively measuring liquid film thickness with applications in ophthalmology, but its clinical translation is hindered by the challenges in reconstructing thickness profiles from interference patterns - an ill-posed inverse problem complicated by phase periodicity, imaging noise and ambient artifacts. Traditional reconstruction methods are either computationally intensive, sensitive to noise, or require manual expert analysis, which is impractical for real-time diagnostics. To address this challenge, here we present a vision transformer-based approach for real-time inference of thin liquid film thickness profiles directly from isolated interferograms. Trained on a hybrid dataset combining physiologically-relevant synthetic and experimental tear film data, our model leverages long-range spatial correlations to resolve phase ambiguities and reconstruct temporally coherent thickness profiles in a single forward pass from dynamic interferograms acquired in vivo and ex vivo. The network demonstrates state-of-the-art performance on noisy, rapidly-evolving films with motion artifacts, overcoming limitations of conventional phase-unwrapping and iterative fitting methods. Our data-driven approach enables automated, consistent thickness reconstruction at real-time speeds on consumer hardware, opening new possibilities for continuous monitoring of pre-lens ocular tear films and non-invasive diagnosis of conditions such as the dry eye disease.
zh

[CV-38] EA3D: Online Open-World 3D Object Extraction from Streaming Videos NEURIPS2025

【速读】:该论文旨在解决当前3D场景理解方法依赖离线采集的多视角数据或预构建三维几何结构的问题,从而限制了其在动态、开放世界环境中的实时应用。解决方案的关键在于提出ExtractAnything3D(EA3D),这是一个统一的在线框架,能够同步进行几何重建与整体场景理解;其核心创新在于利用视觉语言模型和2D视觉基础编码器对视频流中每一帧进行动态解析,提取对象级知识,并通过前馈式在线更新策略将其嵌入高斯特征图(Gaussian feature map)中;同时结合历史帧估计视觉里程计(Visual Odometry),并以递归联合优化模块引导注意力至感兴趣区域,实现几何精度与语义理解的协同提升,从而支持多种下游任务如照片级渲染、语义分割、3D边界框估计等。

链接: https://arxiv.org/abs/2510.25146
作者: Xiaoyu Zhou,Jingqi Wang,Yuang Jia,Yongtao Wang,Deqing Sun,Ming-Hsuan Yang
机构: Peking University (北京大学); Google DeepMind; University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The Thirty-Ninth Annual Conference on Neural Information Processing Systems(NeurIPS 2025)

点击查看摘要

Abstract:Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model’s attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.
zh

[CV-39] Revisiting Reconstruction-based AI-generated Image Detection: A Geometric Perspective

【速读】:该论文旨在解决生成式 AI(Generative AI)图像检测中现有基于重建的方法缺乏理论基础、依赖经验启发式策略而导致可解释性差和可靠性低的问题。其关键解决方案是提出 ReGap 方法,通过引入结构化编辑操作以施加可控扰动,从而计算动态重建误差——即在编辑前后测量误差变化,增强真实图像与生成图像之间的误差分离度,进而提升检测准确性,并具备对常见后处理操作的鲁棒性和跨场景泛化能力。

链接: https://arxiv.org/abs/2510.25141
作者: Wan Jiang,Jing Yan,Ruixuan Zhang,Xiaojing Chen,Changtao Miao,Zhe Li,Chenhao Lin,Yunfeng Diao,Richang Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rise of generative Artificial Intelligence (AI) has made detecting AI-generated images a critical challenge for ensuring authenticity. Existing reconstruction-based methods lack theoretical foundations and on empirical heuristics, limiting interpretability and reliability. In this paper, we introduce the Jacobian-Spectral Lower Bound for reconstruction error from a geometric perspective, showing that real images off the reconstruction manifold exhibit a non-trivial error lower bound, while generated images on the manifold have near-zero error. Furthermore, we reveal the limitations of existing methods that rely on static reconstruction error from a single pass. These methods often fail when some real images exhibit lower error than generated ones. This counterintuitive behavior reduces detection accuracy and requires data-specific threshold tuning, limiting their applicability in real-world scenarios. To address these challenges, we propose ReGap, a training-free method that computes dynamic reconstruction error by leveraging structured editing operations to introduce controlled perturbations. This enables measuring error changes before and after editing, improving detection accuracy by enhancing error separation. Experimental results show that our method outperforms existing baselines, exhibits robustness to common post-processing operations and generalizes effectively across diverse conditions.
zh

[CV-40] DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

【速读】:该论文旨在解决土木工程领域中目标检测任务因专业场景标注数据有限而导致的性能瓶颈问题。解决方案的关键在于提出一种混合架构 DINO-YOLO,其核心创新是将基于自监督学习的 Vision Transformer 模型 DINOv3 的特征嵌入到 YOLOv12 的两个关键位置:输入预处理阶段(P0)和骨干网络中间层增强阶段(P3),从而实现数据高效的检测性能提升。实验表明,该方法在多个小样本土木工程数据集上均显著优于基线模型,同时保持实时推理速度(30–47 FPS),为施工安全监控与基础设施检测提供了可部署的高效解决方案。

链接: https://arxiv.org/abs/2510.25140
作者: Malaisree P,Youwai S,Kitkobsin T,Janrungautai S,Amorndechaphon D,Rojanavasu P
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.
zh

[CV-41] Region-CAM: Towards Accurate Object Regions in Class Activation Maps for Weakly Supervised Learning Tasks

【速读】:该论文旨在解决传统类激活映射(Class Activation Mapping, CAM)方法在弱监督学习任务中激活区域覆盖不全、边界对齐精度差的问题,尤其针对弱监督语义分割(Weakly Supervised Semantic Segmentation, WSSS)任务中对像素级精确激活图的需求。解决方案的关键在于提出一种新型激活方法——区域类激活映射(Region-CAM),其通过提取语义信息图(Semantic Information Maps, SIMs)并引入语义信息传播(Semantic Information Propagation, SIP)机制,在基线分类模型的每一阶段同时利用梯度与特征信息进行区域级激活生成,从而显著提升目标区域的覆盖范围和边界对齐精度。

链接: https://arxiv.org/abs/2510.25134
作者: Qingdong Cai,Charith Abhayaratne
机构: The University of Sheffield (谢菲尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint for journal paper

点击查看摘要

Abstract:Class Activation Mapping (CAM) methods are widely applied in weakly supervised learning tasks due to their ability to highlight object regions. However, conventional CAM methods highlight only the most discriminative regions of the target. These highlighted regions often fail to cover the entire object and are frequently misaligned with object boundaries, thereby limiting the performance of downstream weakly supervised learning tasks, particularly Weakly Supervised Semantic Segmentation (WSSS), which demands pixel-wise accurate activation maps to get the best results. To alleviate the above problems, we propose a novel activation method, Region-CAM. Distinct from network feature weighting approaches, Region-CAM generates activation maps by extracting semantic information maps (SIMs) and performing semantic information propagation (SIP) by considering both gradients and features in each of the stages of the baseline classification model. Our approach highlights a greater proportion of object regions while ensuring activation maps to have precise boundaries that align closely with object edges. Region-CAM achieves 60.12% and 58.43% mean intersection over union (mIoU) using the baseline model on the PASCAL VOC training and validation datasets, respectively, which are improvements of 13.61% and 13.13% over the original CAM (46.51% and 45.30%). On the MS COCO validation set, Region-CAM achieves 36.38%, a 16.23% improvement over the original CAM (20.15%). We also demonstrate the superiority of Region-CAM in object localization tasks, using the ILSVRC2012 validation set. Region-CAM achieves 51.7% in Top-1 Localization accuracy Loc1. Compared with LayerCAM, an activation method designed for weakly supervised object localization, Region-CAM achieves 4.5% better performance in Loc1.
zh

[CV-42] AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians NEURIPS2025

【速读】:该论文旨在解决室内与城市环境中低纹理区域的几何重建不一致问题,以及现有基于高斯泼溅(Gaussian Splatting, GS)和隐式符号距离函数(SDF)场方法在细节保留与计算效率之间的权衡难题。其解决方案的关键在于提出一种受Atlanta-world模型引导的隐式结构化高斯泼溅(implicit-structured Gaussian Splatting)方法:一方面利用Atlanta-world模型增强低纹理区域的表面重建准确性,另一方面设计语义高斯泼溅表示与可学习平面指示器的结构平面正则化机制,在保证全局几何一致性的同时实现平滑重建、高频细节保留及渲染效率优化。

链接: https://arxiv.org/abs/2510.25129
作者: Xiyu Zhang,Chong Bao,Yipeng Chen,Hongjia Zhai,Yitong Dong,Hujun Bao,Zhaopeng Cui,Guofeng Zhang
机构: State Key Lab of CAD & CG, Zhejiang University (浙江大学CAD与CG国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 11 figures. NeurIPS 2025; Project page: this https URL

点击查看摘要

Abstract:3D reconstruction of indoor and urban environments is a prominent research topic with various downstream applications. However, existing geometric priors for addressing low-texture regions in indoor and urban settings often lack global consistency. Moreover, Gaussian Splatting and implicit SDF fields often suffer from discontinuities or exhibit computational inefficiencies, resulting in a loss of detail. To address these issues, we propose an Atlanta-world guided implicit-structured Gaussian Splatting that achieves smooth indoor and urban scene reconstruction while preserving high-frequency details and rendering efficiency. By leveraging the Atlanta-world model, we ensure the accurate surface reconstruction for low-texture regions, while the proposed novel implicit-structured GS representations provide smoothness without sacrificing efficiency and high-frequency details. Specifically, we propose a semantic GS representation to predict the probability of all semantic regions and deploy a structure plane regularization with learnable plane indicators for global accurate surface reconstruction. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.
zh

[CV-43] Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection NEURIPS2025

【速读】:该论文旨在解决零样本人-物交互(Human-Object Interaction, HOI)检测中面临的两大视觉挑战:一是类内视觉多样性(intra-class visual diversity),即同一动词类别下的实例在姿态和场景上存在显著差异;二是类间视觉纠缠(inter-class visual entanglement),即不同动词可能产生视觉相似的模式,导致识别困难。解决方案的关键在于提出VDRP框架,其核心创新包括:(1)引入视觉多样性感知提示学习策略,通过组级视觉方差注入和高斯扰动增强提示对动词多样视觉变化的捕捉能力;(2)从人体、物体及二者联合区域中检索特定区域概念,并用于增强提示嵌入,生成具有区域感知能力的提示,从而提升动词级别的判别力。该方法在HICO-DET基准上的零样本设置下取得当前最优性能,有效缓解了上述两类视觉复杂性问题。

链接: https://arxiv.org/abs/2510.25094
作者: Chanhyeong Yang,Taehoon Song,Jihwan Park,Hyunwoo J. Kim
机构: Korea University (韩国大学); Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at this https URL.
zh

[CV-44] PSTF-AttControl: Per-Subject-Tuning-Free Personalized Image Generation with Controllable Face Attributes

【速读】:该论文旨在解决个性化图像生成中面部身份保真度与面部属性精确控制之间的矛盾问题,尤其是在无需针对每个个体进行微调(per-subject-tuning-free, PSTF)的前提下实现高质量的可控人脸图像合成。现有方法要么依赖复杂的微调流程(如PreciseControl)以实现细粒度控制,但缺乏通用性;要么虽支持PSTF方式生成,却难以精准调控面部属性。其解决方案的关键在于:首先利用人脸识别模型提取面部身份特征,并通过e4e编码器映射至StyleGAN2的W⁺潜在空间以保持身份一致性;其次引入Triplet-Decoupled Cross-Attention模块,将身份特征、属性特征与文本嵌入在UNet架构中解耦融合,从而实现身份与属性信息的清晰分离与协同控制。该方法在FFHQ数据集上训练后,可在不依赖额外微调或训练数据的情况下,生成具有高保真度且属性可控的个性化人脸图像。

链接: https://arxiv.org/abs/2510.25084
作者: Xiang liu,Zhaoxiang Liu,Huan Hu,Zipeng Wang,Ping Chen,Zezhou Chen,Kai Wang,Shiguo Lian
机构: China Unicom(中国联通)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Image and Vision Computing (18 pages, 8 figures)

点击查看摘要

Abstract:Recent advancements in personalized image generation have significantly improved facial identity preservation, particularly in fields such as entertainment and social media. However, existing methods still struggle to achieve precise control over facial attributes in a per-subject-tuning-free (PSTF) way. Tuning-based techniques like PreciseControl have shown promise by providing fine-grained control over facial features, but they often require extensive technical expertise and additional training data, limiting their accessibility. In contrast, PSTF approaches simplify the process by enabling image generation from a single facial input, but they lack precise control over facial attributes. In this paper, we introduce a novel, PSTF method that enables both precise control over facial attributes and high-fidelity preservation of facial identity. Our approach utilizes a face recognition model to extract facial identity features, which are then mapped into the W^+ latent space of StyleGAN2 using the e4e encoder. We further enhance the model with a Triplet-Decoupled Cross-Attention module, which integrates facial identity, attribute features, and text embeddings into the UNet architecture, ensuring clean separation of identity and attribute information. Trained on the FFHQ dataset, our method allows for the generation of personalized images with fine-grained control over facial attributes, while without requiring additional fine-tuning or training data for individual identities. We demonstrate that our approach successfully balances personalization with precise facial attribute control, offering a more efficient and user-friendly solution for high-quality, adaptable facial image synthesis. The code is publicly available at this https URL.
zh

[CV-45] Neighborhood Feature Pooling for Remote Sensing Image Classification WACV2026

【速读】:该论文旨在解决遥感图像分类中纹理特征提取效率与表达能力不足的问题。现有方法难以有效捕捉局部区域内的空间关系并高效聚合多维特征相似性,导致分类性能受限。解决方案的关键在于提出邻域特征池化(Neighborhood Feature Pooling, NFP)层,该层通过卷积结构建模相邻输入间的关联性,实现跨特征维度的局部相似性聚合,从而增强纹理表征能力;同时NFP可无缝嵌入任意网络架构,在显著提升分类性能的同时保持极低的参数开销。

链接: https://arxiv.org/abs/2510.25077
作者: Fahimeh Orvati Nia,Amirmohammad Mohammadi,Salim Al Kharsa,Pragati Naikare,Zigfried Hampel-Arias,Joshua Peeples
机构: Texas A&M University (德州农工大学); Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 9 pages, 5 figures. Accepted to WACV 2026 (Winter Conference on Applications of Computer Vision)

点击查看摘要

Abstract:In this work, we propose neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification. The NFP layer captures relationships between neighboring inputs and efficiently aggregates local similarities across feature dimensions. Implemented using convolutional layers, NFP can be seamlessly integrated into any network. Results comparing the baseline models and the NFP method indicate that NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead.
zh

[CV-46] Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments

【速读】:该论文旨在解决真实场景下零样本(zero-shot)场景理解问题,即模型在未见过的物体、动作和上下文情况下仍能准确识别与推理。其核心挑战在于自然场景的高度复杂性和多样性,传统方法难以泛化至未标注类别。解决方案的关键在于构建一个视觉-语言融合框架,通过预训练的视觉编码器(如CLIP、ViT)与大语言模型(如GPT架构)的协同作用,实现视觉与文本模态间的语义对齐(semantic alignment)。该框架将视觉输入与文本提示映射到共享嵌入空间,并引入多模态融合与推理层以增强上下文解释能力,从而显著提升模型在对象识别、活动检测和场景描述等任务上的零样本性能。

链接: https://arxiv.org/abs/2510.25070
作者: Manjunath Prasad Holenarasipura Rajiv,B. M. Vidyavathi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint under review at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025

点击查看摘要

Abstract:Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.
zh

[CV-47] DRIP: Dynamic patch Reduction via Interpretable Pooling

【速读】:该论文旨在解决视觉语言模型(Vision-Language Model)预训练过程中计算效率低、成本高的问题,尤其是在从零开始训练时面临的资源消耗难题。其解决方案的关键在于提出动态补丁缩减方法(Dynamic patch Reduction via Interpretable Pooling, DRIP),该方法通过在视觉编码器的深层动态合并token,自适应地调整输入图像的特征表示粒度,从而显著降低计算复杂度(GFLOP),同时保持分类和零样本迁移性能与基线相当。

链接: https://arxiv.org/abs/2510.25067
作者: Yusen Peng,Sachin Kumar
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
zh

[CV-48] Auto3DSeg for Brain Tumor Segmentation from 3D MRI in BraTS 2023 Challenge

【速读】:该论文旨在解决脑部医学图像分割(Brain Tumor Segmentation, BraTS)中的多挑战性任务,涵盖脑转移瘤、脑膜瘤、BraTS-Africa以及成人和儿童胶质瘤等不同病理类型的精准分割问题。解决方案的关键在于采用基于MONAI框架的Auto3DSeg自动化深度学习方法,该方法通过端到端的模型搜索与优化策略,实现了对多种复杂脑肿瘤形态的高精度分割,从而在五个挑战中取得三项第一名和两项第二名的优异成绩。

链接: https://arxiv.org/abs/2510.25058
作者: Andriy Myronenko,Dong Yang,Yufan He,Daguang Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BraTS23 winner

点击查看摘要

Abstract:In this work, we describe our solution to the BraTS 2023 cluster of challenges using Auto3DSeg from MONAI. We participated in all 5 segmentation challenges, and achieved the 1st place results in three of them: Brain Metastasis, Brain Meningioma, BraTS-Africa challenges, and the 2nd place results in the remaining two: Adult and Pediatic Glioma challenges.
zh

[CV-49] Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models ICCV2025

【速读】:该论文旨在解决现有乳腺癌辅助诊断(Computer-Aided Diagnosis, CAD)系统在临床部署中面临的两大关键问题:一是难以有效处理多模态数据(如2D乳腺X线图像与结构化临床文本信息)的复杂融合,二是对患者既往病史的依赖限制了其在真实世界场景中的可行性。解决方案的关键在于提出一种新颖的多模态框架,通过创新的分词模块将易获取的临床元数据和合成放射学报告转化为结构化文本特征,并将其与卷积神经网络(ConvNets)提取的视觉特征进行战略级融合,从而在保持高分辨率图像处理能力的同时,显著提升癌症检测和钙化灶识别性能。该方法相较基于视觉Transformer(Vision Transformer, ViT)的模型表现更优,且具备跨人群普适性和临床实用性,为构建基于视觉语言模型(Vision-Language Models, VLMs)的下一代CAD系统提供了新范式。

链接: https://arxiv.org/abs/2510.25051
作者: Shunjie-Fabian Zheng,Hyeonjun Lee,Thijs Kooi,Ali Diba
机构: LMU University Hospital, LMU Munich, Germany; Lunit Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD) Workshop at ICCV 2025

点击查看摘要

Abstract:Breast cancer remains the most commonly diagnosed malignancy among women in the developed world. Early detection through mammography screening plays a pivotal role in reducing mortality rates. While computer-aided diagnosis (CAD) systems have shown promise in assisting radiologists, existing approaches face critical limitations in clinical deployment - particularly in handling the nuanced interpretation of multi-modal data and feasibility due to the requirement of prior clinical history. This study introduces a novel framework that synergistically combines visual features from 2D mammograms with structured textual descriptors derived from easily accessible clinical metadata and synthesized radiological reports through innovative tokenization modules. Our proposed methods in this study demonstrate that strategic integration of convolutional neural networks (ConvNets) with language representations achieves superior performance to vision transformer-based models while handling high-resolution images and enabling practical deployment across diverse populations. By evaluating it on multi-national cohort screening mammograms, our multi-modal approach achieves superior performance in cancer detection and calcification identification compared to unimodal baselines, with particular improvements. The proposed method establishes a new paradigm for developing clinically viable VLM-based CAD systems that effectively leverage imaging data and contextual patient information through effective fusion mechanisms.
zh

[CV-50] Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8

【速读】:该论文旨在解决自动车牌识别系统(Automatic License Plate Recognition, ALPR)在复杂环境下的准确性问题,如光照变化、雨雾干扰、车辆高速行驶、相机角度差异以及图像质量低等问题。其解决方案的关键在于采用基于YOLOv8的深度学习策略进行车牌检测与识别,并引入半监督学习框架:利用少量人工标注数据结合由Grounding DINO生成的伪标签(pseudo-labels),以减少对人工标注的依赖并提升模型性能。Grounding DINO作为强大的视觉-语言模型,能够自动为大量图像生成带边界框的车牌标注,从而高效扩展高质量训练数据集,在保证标签精度的同时显著优化了模型训练效果和整体识别准确率。

链接: https://arxiv.org/abs/2510.25032
作者: Zahra Ebrahimi Vargoorani,Amir Mohammad Ghoreyshi,Ching Yee Suen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 8 figures. Presented at 2025 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), August 31 - September 3, 2025, Istanbul, Turkey

点击查看摘要

Abstract:Developing a highly accurate automatic license plate recognition system (ALPR) is challenging due to environmental factors such as lighting, rain, and dust. Additional difficulties include high vehicle speeds, varying camera angles, and low-quality or low-resolution images. ALPR is vital in traffic control, parking, vehicle tracking, toll collection, and law enforcement applications. This paper proposes a deep learning strategy using YOLOv8 for license plate detection and recognition tasks. This method seeks to enhance the performance of the model using datasets from Ontario, Quebec, California, and New York State. It achieved an impressive recall rate of 94% on the dataset from the Center for Pattern Recognition and Machine Intelligence (CENPARMI) and 91% on the UFPR-ALPR dataset. In addition, our method follows a semi-supervised learning framework, combining a small set of manually labeled data with pseudo-labels generated by Grounding DINO to train our detection model. Grounding DINO, a powerful vision-language model, automatically annotates many images with bounding boxes for license plates, thereby minimizing the reliance on labor-intensive manual labeling. By integrating human-verified and model-generated annotations, we can scale our dataset efficiently while maintaining label quality, which significantly enhances the training process and overall model performance. Furthermore, it reports character error rates for both datasets, providing additional insight into system performance.
zh

[CV-51] Resi-VidTok: An Efficient and Decomposed Progressive Tokenization Framework for Ultra-Low-Rate and Lightweight Video Transmission

【速读】:该论文旨在解决在严重信道条件下(如带宽受限和弱连接)实现高效、鲁棒的无线网络视频实时传输问题,尤其针对超低码率场景下保持感知与语义一致性的挑战。其核心解决方案是提出Resi-VidTok框架,关键在于设计了一种抗脆弱的一维视频标记化(tokenization)流水线,将时空内容重构为重要性排序的离散标记流(包括关键标记和精炼标记),支持渐进编码、前缀可解码重建及在受限信道下的优雅质量退化;同时结合步长控制的帧稀疏化与轻量级插值器,在降低传输负载的同时维持运动连续性,并通过自适应源-信道编码与调制策略动态分配速率和保护资源,从而在低至0.0004的信道带宽比(CBR)下仍能实现稳定视觉与语义一致性,且实现实时重建(>30 fps)。

链接: https://arxiv.org/abs/2510.25002
作者: Zhenyu Liu,Yi Ma,Rahim Tafazolli,Zhi Ding
机构: 5GIC & 6GIC, Institute for Communication Systems (ICS), University of Surrey (萨里大学); Department of Electrical and Computer Engineering, University of California at Davis (加州大学戴维斯分校)
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Real-time transmission of video over wireless networks remains highly challenging, even with advanced deep models, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose Resi-VidTok, a Resilient Tokenization-Enabled framework designed for ultra-low-rate and lightweight video transmission that delivers strong robustness while preserving perceptual and semantic fidelity on commodity digital hardware. By reorganizing spatio–temporal content into a discrete, importance-ordered token stream composed of key tokens and refinement tokens, Resi-VidTok enables progressive encoding, prefix-decodable reconstruction, and graceful quality degradation under constrained channels. A key contribution is a resilient 1D tokenization pipeline for video that integrates differential temporal token coding, explicitly supporting reliable recovery from incomplete token sets using a single shared framewise decoder–without auxiliary temporal extractors or heavy generative models. Furthermore, stride-controlled frame sparsification combined with a lightweight decoder-side interpolator reduces transmission load while maintaining motion continuity. Finally, a channel-adaptive source–channel coding and modulation scheme dynamically allocates rate and protection according to token importance and channel condition, yielding stable quality across adverse SNRs. Evaluation results indicate robust visual and semantic consistency at channel bandwidth ratios (CBR) as low as 0.0004 and real-time reconstruction at over 30 fps, demonstrating the practicality of Resi-VidTok for energy-efficient, latency-sensitive, and reliability-critical wireless applications.
zh

[CV-52] FT-ARM: Fine-Tuned Agent ic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning

【速读】:该论文旨在解决压力性损伤(Pressure Ulcers, PUs)严重程度分级(I-IV期)的临床诊断准确性与一致性问题,其核心挑战在于视觉特征细微差异导致的主观判断偏差及现有AI模型可解释性不足。解决方案的关键在于提出FT-ARM(Fine-Tuned Agentic Reflection Multimodal model),这是一个基于LLaMA 3.2 90B微调的多模态大语言模型(Multimodal Large Language Model, MLLM),引入代理式自反思机制(agentic self-reflection mechanism),通过迭代式推理融合图像视觉特征与文本编码的临床知识,从而提升分类准确率与临床可解释性。在公开的压力性损伤图像数据集(PIID)上,FT-ARM达到85%的准确率,较传统CNN模型提升4%,并支持实时推理和生成自然语言解释,显著增强系统在临床场景中的可靠性与可信度。

链接: https://arxiv.org/abs/2510.24980
作者: Reza Saadati Fard,Emmanuel Agu,Palawat Busaranuvong,Deepak Kumar,Shefalika Gautam,Bengisu Tulu,Diane Strong,Lorraine Loretz
机构: Worcester Polytechnic Institute (伍斯特理工学院); UMass Memorial Health (麻省大学纪念健康中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pressure ulcers (PUs) are a serious and prevalent healthcare concern. Accurate classification of PU severity (Stages I-IV) is essential for proper treatment but remains challenging due to subtle visual distinctions and subjective interpretation, leading to variability among clinicians. Prior AI-based approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) achieved promising accuracy but offered limited interpretability. We present FT-ARM (Fine-Tuned Agentic Reflection Multimodal model), a fine-tuned multimodal large language model (MLLM) with an agentic self-reflection mechanism for pressure ulcer severity classification. Inspired by clinician-style diagnostic reassessment, FT-ARM iteratively refines its predictions by reasoning over visual features and encoded clinical knowledge from text, enhancing both accuracy and consistency. On the publicly available Pressure Injury Image Dataset (PIID), FT-ARM, fine-tuned from LLaMA 3.2 90B, achieved 85% accuracy in classifying PU stages I-IV, surpassing prior CNN-based models by +4%. Unlike earlier CNN/ViT studies that relied solely on offline evaluations, FT-ARM is designed and tested for live inference, reflecting real-time deployment conditions. Furthermore, it produces clinically grounded natural-language explanations, improving interpretability and trust. By integrating fine-tuning and reflective reasoning across multimodal inputs, FT-ARM advances the reliability, transparency, and clinical applicability of automated wound assessment systems, addressing the critical need for consistent and explainable PU staging to support improved patient care.
zh

[CV-53] SCOUT: A Lightweight Framework for Scenario Coverag e Assessment in Autonomous Driving

【速读】:该论文旨在解决自动驾驶代理(Autonomous Agents)在真实场景中评估其场景覆盖度(Scenario Coverage)时面临的两大瓶颈问题:一是依赖昂贵的人工标注,二是使用计算资源密集的大型视觉语言模型(Large Vision-Language Models, LVLMs)进行推理,导致难以大规模部署。解决方案的关键在于提出SCOUT(Scenario Coverage Oversight and Understanding Tool),一个轻量级的替代模型,它通过知识蒸馏(Knowledge Distillation)学习LVLM生成的覆盖标签,并直接从代理的潜在传感器表示(Latent Sensor Representations)中预测场景覆盖标签,从而避免了持续调用LVLM或人工标注,同时利用预计算的感知特征实现高效、可扩展的覆盖率估计,在保持高准确率的同时显著降低计算成本。

链接: https://arxiv.org/abs/2510.24949
作者: Anil Yildiz,Sarah M. Thornton,Carl Hildebrandt,Sreeja Roy-Singh,Mykel J. Kochenderfer
机构: Stanford University (斯坦福大学); Nuro Inc. (Nuro 公司)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Assessing scenario coverage is crucial for evaluating the robustness of autonomous agents, yet existing methods rely on expensive human annotations or computationally intensive Large Vision-Language Models (LVLMs). These approaches are impractical for large-scale deployment due to cost and efficiency constraints. To address these shortcomings, we propose SCOUT (Scenario Coverage Oversight and Understanding Tool), a lightweight surrogate model designed to predict scenario coverage labels directly from an agent’s latent sensor representations. SCOUT is trained through a distillation process, learning to approximate LVLM-generated coverage labels while eliminating the need for continuous LVLM inference or human annotation. By leveraging precomputed perception features, SCOUT avoids redundant computations and enables fast, scalable scenario coverage estimation. We evaluate our method across a large dataset of real-life autonomous navigation scenarios, demonstrating that it maintains high accuracy while significantly reducing computational cost. Our results show that SCOUT provides an effective and practical alternative for large-scale coverage analysis. While its performance depends on the quality of LVLM-generated training labels, SCOUT represents a major step toward efficient scenario coverage oversight in autonomous systems.
zh

[CV-54] IBIS: A Powerful Hybrid Architecture for Human Activity Recognition

【速读】:该论文旨在解决Wi-Fi感知(Wi-Fi sensing)领域中模型容易过拟合(overfitting)的问题,即模型在训练数据上表现优异但在新数据上泛化能力差的瓶颈。其解决方案的关键在于提出一种新颖的混合架构IBIS,该架构将Inception-BiLSTM与支持向量机(Support Vector Machine, SVM)相结合,通过Inception-BiLSTM提取时频域特征并增强特征表达能力,再利用SVM构建更稳健的分类边界,从而显著提升模型的泛化性能。实验表明,该方法在多普勒(Doppler)特征基础上实现了接近99%的运动识别准确率,验证了其有效性。

链接: https://arxiv.org/abs/2510.24936
作者: Alison M. Fernandes,Hermes I. Del Monego,Bruno S. Chang,Anelise Munaretto,Hélder M. Fontes,Rui L. Campos
机构: Universidade Tecnológica Federal do Paraná (UTFPR); INESC TEC; Universidade do Porto; Instituto de Engenharia de Sistemas e Computadores, Pesquisa e Desenvolvimento do Brasil (INESC P&D Brasil); Instituto Nacional de Ciência e Tecnologia (INCT) of Intelligent Communications Networks and the Internet of Things (ICoNIoT)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages. 8 figures. Wireless Days Conference, December 2025

点击查看摘要

Abstract:The increasing interest in Wi-Fi sensing stems from its potential to capture environmental data in a low-cost, non-intrusive way, making it ideal for applications like healthcare, space occupancy analysis, and gesture-based IoT control. However, a major limitation in this field is the common problem of overfitting, where models perform well on training data but fail to generalize to new data. To overcome this, we introduce a novel hybrid architecture that integrates Inception-BiLSTM with a Support Vector Machine (SVM), which we refer to as IBIS. Our IBIS approach is uniquely engineered to improve model generalization and create more robust classification boundaries. By applying this method to Doppler-derived data, we achieve a movement recognition accuracy of nearly 99%. Comprehensive performance metrics and confusion matrices confirm the significant effectiveness of our proposed solution.
zh

[CV-55] Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning

【速读】:该论文旨在解决多模态学习中主导模态(dominant modality)压制其他模态导致模型泛化能力受限的问题。解决方案的关键在于提出一种模型无关的框架——模态感知的锐度感知最小化(Modality-Aware Sharpness-Aware Minimization, M-SAM),其核心机制包含三个步骤:首先利用Shapley值量化各模态对准确率的贡献以识别主导模态;其次通过分解损失景观(loss landscape)来调制损失函数,强化对主导模态的鲁棒性;最后基于调制后的梯度进行反向传播更新权重。该方法在保持主导模态性能的同时提升非主导模态的贡献,从而促进互补特征的探索与利用,显著改善多模态学习的整体性能与平衡性。

链接: https://arxiv.org/abs/2510.24919
作者: Hossein R. Nowdeh,Jie Ji,Xiaolong Ma,Fatemeh Afghah
机构: Clemson University (克莱姆森大学); University of Arizona (亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In multimodal learning, dominant modalities often overshadow others, limiting generalization. We propose Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios. In every iteration, M-SAM in three steps optimizes learning. \textbfFirst, it identifies the dominant modality based on modalities’ contribution in the accuracy using Shapley. \textbfSecond, it decomposes the loss landscape, or in another language, it modulates the loss to prioritize the robustness of the model in favor of the dominant modality, and \textbfthird, M-SAM updates the weights by backpropagation of modulated gradients. This ensures robust learning for the dominant modality while enhancing contributions from others, allowing the model to explore and exploit complementary features that strengthen overall performance. Extensive experiments on four diverse datasets show that M-SAM outperforms the latest state-of-the-art optimization and gradient manipulation methods and significantly balances and improves multimodal learning.
zh

[CV-56] Understanding Multi-View Transformers ICCV2025

【速读】:该论文旨在解决多视图Transformer(如DUSt3R)在3D视觉任务中因黑箱特性导致的可解释性不足问题,这限制了其在高安全性和可靠性要求场景中的应用及进一步优化。解决方案的关键在于通过分析模型残差连接(residual connections)中的特征表示,实现对3D表示的探查与可视化,从而揭示其潜在状态在不同网络块中的演化过程、各层的具体作用,并对比具有更强显式全局位姿归纳偏置方法的差异。此外,研究还发现所分析的DUSt3R变体能够利用重建几何信息对对应点进行精化,为理解与改进此类模型提供了新的视角。

链接: https://arxiv.org/abs/2510.24907
作者: Michal Stary,Julien Gaubil,Ayush Tewari,Vincent Sitzmann
机构: TUM(慕尼黑工业大学); Claude Bernard University Lyon 1(克莱蒙-奥弗涅大学); University of Cambridge(剑桥大学); MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Presented at the ICCV 2025 E2E3D Workshop

点击查看摘要

Abstract:Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers’ layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at this https URL .
zh

[CV-57] VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos

【速读】:该论文旨在解决生成式视频模型在处理非常规相机运动时泛化能力不足的问题,这在创作具有艺术性和原创性的视频内容中尤为关键。现有方法受限于难以获取包含目标罕见相机动作的真实训练视频数据。解决方案的关键在于提出VividCam训练范式,通过引入多种解耦策略,使扩散模型能够从合成视频中学习复杂的相机运动,从而摆脱对真实视频数据的依赖;该方法有效分离了相机运动学习与合成图像中的外观伪影,确保更鲁棒的运动表征并缓解域偏移问题。

链接: https://arxiv.org/abs/2510.24904
作者: Qiucheng Wu,Handong Zhao,Zhixin Shu,Jing Shi,Yang Zhang,Shiyu Chang
机构: UC, Santa Barbara; Adobe Research; MIT-IBM Watson AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 9 figures

点击查看摘要

Abstract:Although recent text-to-video generative models are getting more capable of following external camera controls, imposed by either text descriptions or camera trajectories, they still struggle to generalize to unconventional camera motions, which is crucial in creating truly original and artistic videos. The challenge lies in the difficulty of finding sufficient training videos with the intended uncommon camera motions. To address this challenge, we propose VividCam, a training paradigm that enables diffusion models to learn complex camera motions from synthetic videos, releasing the reliance on collecting realistic training videos. VividCam incorporates multiple disentanglement strategies that isolates camera motion learning from synthetic appearance artifacts, ensuring more robust motion representation and mitigating domain shift. We demonstrate that our design synthesizes a wide range of precisely controlled and complex camera motions using surprisingly simple synthetic data. Notably, this synthetic data often consists of basic geometries within a low-poly 3D scene and can be efficiently rendered by engines like Unity. Our video results can be found in this https URL .
zh

[CV-58] Pixels to Signals: A Real-Time Framework for Traffic Demand Estimation

【速读】:该论文旨在解决城市交通拥堵问题,以优化交通流并减少延误。其解决方案的关键在于提出了一种基于视频流的车辆检测方法:通过分析连续摄像头帧,利用时间平均法计算背景图像(即道路底层结构),进而提取前景区域;随后采用密度聚类算法(DBSCAN)实现车辆目标的精确检测。该方法具备计算效率高、对基础设施改动小的特点,为实际部署提供了可扩展的可行性。

链接: https://arxiv.org/abs/2510.24902
作者: H Mhatre,M Vyas,A Mittal
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traffic congestion is becoming a challenge in the rapidly growing urban cities, resulting in increasing delays and inefficiencies within urban transportation systems. To address this issue a comprehensive methodology is designed to optimize traffic flow and minimize delays. The framework is structured with three primary components: (a) vehicle detection, (b) traffic prediction, and © traffic signal optimization. This paper presents the first component, vehicle detection. The methodology involves analyzing multiple sequential frames from a camera feed to compute the background, i.e. the underlying roadway, by averaging pixel values over time. The computed background is then utilized to extract the foreground, where the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is applied to detect vehicles. With its computational efficiency and minimal infrastructure modification requirements, the proposed methodology offers a practical and scalable solution for real-world deployment.
zh

[CV-59] Proper Body Landmark Subset Enables More Accurate and 5X Faster Recognition of Isolated Signs in LIBRAS

【速读】:该论文旨在解决巴西手语(LIBRAS)中孤立手势识别任务中因使用高计算成本的OpenPose进行关键点提取而导致的实时性能瓶颈问题。其解决方案的关键在于:通过精心设计的关键点子集选择策略,在显著降低计算复杂度(相比Alves等人方法提升5倍以上处理速度)的同时,保持甚至超越当前最优识别性能;此外,结合样条插值(spline-based imputation)技术有效缓解关键点缺失问题,进一步提升识别准确率,从而实现高效且精准的手势识别系统构建。

链接: https://arxiv.org/abs/2510.24887
作者: Daniele L. V. dos Santos,Thiago B. Pereira,Carlos Eduardo G. R. Alves,Richard J. M. G. Tello,Francisco de A. Boldt,Thiago M. Paixão
机构: Federal Institute of Espírito Santo (联邦理工学院埃斯皮里托桑托校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Int. Conf. on Computer Vision Theory and Applications (VISAPP 2026)

点击查看摘要

Abstract:This paper investigates the feasibility of using lightweight body landmark detection for the recognition of isolated signs in Brazilian Sign Language (LIBRAS). Although the skeleton-based approach by Alves et al. (2024) enabled substantial improvements in recognition performance, the use of OpenPose for landmark extraction hindered time performance. In a preliminary investigation, we observed that simply replacing OpenPose with the lightweight MediaPipe, while improving processing speed, significantly reduced accuracy. To overcome this limitation, we explored landmark subset selection strategies aimed at optimizing recognition performance. Experimental results showed that a proper landmark subset achieves comparable or superior performance to state-of-the-art methods while reducing processing time by more than 5X compared to Alves et al. (2024). As an additional contribution, we demonstrated that spline-based imputation effectively mitigates missing landmark issues, leading to substantial accuracy gains. These findings highlight that careful landmark selection, combined with simple imputation techniques, enables efficient and accurate isolated sign recognition, paving the way for scalable Sign Language Recognition systems.
zh

[CV-60] FruitProm: Probabilistic Maturity Estimation and Detection of Fruits and Vegetables

【速读】:该论文旨在解决当前基于深度学习的果蔬成熟度估计方法将成熟度视为离散分类问题(如未熟、成熟、过熟)所导致的生物连续性失真与信息损失问题,这一做法难以准确反映果实自然 ripening(成熟)过程的连续特性。其解决方案的关键在于重构成熟度估计为一个连续的概率学习任务,通过在先进的实时目标检测模型 RT-DETRv2 上引入一个专用的概率头(probabilistic head),使模型能够对每个检测到的目标预测一个连续的成熟度分布,同时输出平均成熟状态及其不确定性。该不确定性量化机制为机器人下游决策(如选择性采摘)提供了置信度依据,显著提升了成熟度评估的粒度和准确性,并保持了优异的目标检测性能(mAP 达 85.6%)。

链接: https://arxiv.org/abs/2510.24885
作者: Sidharth Rai,Rahul Harsha Cheppally,Benjamin Vail,Keziban Yalçın Dokumacı,Ajay Sharda
机构: Kansas State University (堪萨斯州立大学); Selçuk University (塞尔丘克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Sidharth Rai, Rahul Harsha Cheppally contributed equally to this work

点击查看摘要

Abstract:Maturity estimation of fruits and vegetables is a critical task for agricultural automation, directly impacting yield prediction and robotic harvesting. Current deep learning approaches predominantly treat maturity as a discrete classification problem (e.g., unripe, ripe, overripe). This rigid formulation, however, fundamentally conflicts with the continuous nature of the biological ripening process, leading to information loss and ambiguous class boundaries. In this paper, we challenge this paradigm by reframing maturity estimation as a continuous, probabilistic learning task. We propose a novel architectural modification to the state-of-the-art, real-time object detector, RT-DETRv2, by introducing a dedicated probabilistic head. This head enables the model to predict a continuous distribution over the maturity spectrum for each detected object, simultaneously learning the mean maturity state and its associated uncertainty. This uncertainty measure is crucial for downstream decision-making in robotics, providing a confidence score for tasks like selective harvesting. Our model not only provides a far richer and more biologically plausible representation of plant maturity but also maintains exceptional detection performance, achieving a mean Average Precision (mAP) of 85.6% on a challenging, large-scale fruit dataset. We demonstrate through extensive experiments that our probabilistic approach offers more granular and accurate maturity assessments than its classification-based counterparts, paving the way for more intelligent, uncertainty-aware automated systems in modern agriculture
zh

[CV-61] he Generation Phases of Flow Matching: a Denoising Perspective

【速读】:该论文试图解决生成式AI(Generative AI)中流匹配(Flow Matching)模型在生成过程中质量影响因素不明确的问题。其解决方案的关键在于从去噪(Denoising)视角出发,构建一个形式化框架来连接流匹配模型与去噪器(denoisers),从而为生成和去噪性能提供统一的比较基准;在此基础上,设计受控扰动(噪声和漂移)以系统性地探究生成过程中的动态相变现象,并精确识别去噪器在生成流程不同阶段的成功或失败机制及其重要性。

链接: https://arxiv.org/abs/2510.24830
作者: Anne Gagneux,Ségolène Martin,Rémi Gribonval,Mathurin Massias
机构: ENS de Lyon (法国国家科学研究中心); CNRS (法国国家科学研究中心); Université Claude Bernard Lyon 1 (克莱蒙-奥弗涅大学); Inria (法国国家信息与自动化研究院); LIP (计算机科学实验室); Technische Universität Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.
zh

[CV-62] MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition

【速读】:该论文旨在解决多模态情感识别(Multimodal Emotion Recognition)中因不同模态间差异以及单模态情感信息难以准确表征所带来的挑战。其解决方案的关键在于提出一种基于多路径跨模态交互的混合网络模型(MCIHN),首先通过对抗自编码器(Adversarial Autoencoders, AAE)分别提取各模态的判别性情感特征并重构以增强情感类别区分度;随后利用预定义的跨模态门控机制模型(Cross-modal Gate Mechanism Model, CGMM)减少模态间差异、建立模态间的语义关联并生成跨模态交互特征;最后通过特征融合模块(Feature Fusion Module, FFM)实现高效多模态融合,从而显著提升情感识别性能。

链接: https://arxiv.org/abs/2510.24827
作者: Haoyang Zhang,Zhou Yang,Ke Sun,Yucai Pang,Guoliang Xu
机构: Chongqing University of Posts and Telecommunications(重庆邮电大学); Xi’an Jiaotong University(西安交通大学); University of New South Wales(新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: The paper will be published in the MMAsia2025 conference proceedings

点击查看摘要

Abstract:Multimodal emotion recognition is crucial for future human-computer interaction. However, accurate emotion recognition still faces significant challenges due to differences between different modalities and the difficulty of characterizing unimodal emotional information. To solve these problems, a hybrid network model based on multipath cross-modal interaction (MCIHN) is proposed. First, adversarial autoencoders (AAE) are constructed separately for each modality. The AAE learns discriminative emotion features and reconstructs the features through a decoder to obtain more discriminative information about the emotion classes. Then, the latent codes from the AAE of different modalities are fed into a predefined Cross-modal Gate Mechanism model (CGMM) to reduce the discrepancy between modalities, establish the emotional relationship between interacting modalities, and generate the interaction features between different modalities. Multimodal fusion using the Feature Fusion module (FFM) for better emotion recognition. Experiments were conducted on publicly available SIMS and MOSI datasets, demonstrating that MCIHN achieves superior performance.
zh

[CV-63] Ming-Flash-Omni: A Sparse Unified Architecture for Multimodal Perception and Generation

【速读】:该论文旨在解决多模态智能系统在计算效率与模型能力之间难以平衡的问题,以及现有模型在视觉、语音和语言任务中缺乏统一架构所带来的性能瓶颈。解决方案的关键在于提出Ming-Flash-Omni,其基于稀疏的专家混合(Mixture-of-Experts, MoE)结构,在1000亿参数总量下仅激活61亿参数/令牌,从而实现显著提升的计算效率并扩展模型容量;同时通过统一架构支持跨模态理解与生成,使模型在图像生成、语音识别(包括上下文感知自动语音识别Contextual ASR和方言感知ASR)及生成式分割(Generative Segmentation)等任务上均达到领先水平,推动迈向通用人工智能(Artificial General Intelligence, AGI)的重要进展。

链接: https://arxiv.org/abs/2510.24821
作者: Inclusion AI:Bowen Ma,Cheng Zou,Canxiang Yan,Chunxiang Jin,Chunjie Shen,Dandan Zheng,Fudong Wang,Furong Xu,GuangMing Yao,Jun Zhou,Jingdong Chen,Jianing Li,Jianxin Sun,Jiajia Liu,Jianjiang Zhu,Jianping Jiang,Jun Peng,Kaixiang Ji,Kaimeng Ren,Libin Wang,Lixiang Ru,Longhua Tan,Lan Wang,Mochen Bai,Ning Gao,Qingpei Guo,Qinglong Zhang,Qiang Xu,Rui Liu,Ruijie Xiong,Ruobing Zheng,Sirui Gao,Tianqi Li,Tinghao Liu,Weilong Chai,Xinyu Xiao,Xiaomei Wang,Xiaolong Wang,Xiao Lu,Xiaoyu Li,Xingning Dong,Xuzheng Yu,Yi Yuan,Yuting Gao,Yuting Xiao,Yunxiao Sun,Yipeng Chen,Yifan Mao,Yifei Wu,Yongjie Lyu,Ziping Ma,Zhiqiang Fang,Zhihao Qiu,Ziyuan Huang,Zizheng Yang,Zhengyu He
机构: Inclusion AI; Ant Group
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.
zh

[CV-64] SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在安全性保障方面的关键挑战,尤其是现有推理阶段(inference-time)安全方法普遍存在过度拒绝(over-refusal)以及安全与实用性之间失衡的问题。解决方案的关键在于提出一种多轮安全编辑框架(multi-round safety editing framework),其核心是构建了一个专门用于T2I安全编辑的多轮图像-文本交错数据集(MR-SafeEdit),并设计了一个后验安全编辑范式(post-hoc safety editing paradigm),该范式模拟人类认知中识别与修正不安全内容的过程。在此基础上,作者开发了SafeEditor——一个统一的多模态大语言模型(Multimodal Large Language Model, MLLM),可对生成图像进行多轮迭代式安全编辑,从而在降低过拒率的同时实现更优的安全-效用平衡。

链接: https://arxiv.org/abs/2510.24820
作者: Ruiyang Zhang,Jiahao Luo,Xiaoru Feng,Qiufan Pang,Yaodong Yang,Juntao Dai
机构: PKU Alignment Team, Peking University (北京大学); LLM Safety Centre, Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.
zh

[CV-65] Perception Understanding and Reasoning A Multimodal Benchmark for Video Fake News Detection

【速读】:该论文旨在解决当前视频虚假新闻检测(Video Fake News Detection, VFND)任务中模型决策过程缺乏细粒度评估的问题,即传统基准仅关注最终判断准确性,导致检测流程成为“黑箱”。其解决方案的关键在于构建了一个名为MVFNDB(Multi-modal Video Fake News Detection Benchmark)的多模态基准,包含10项精心设计的任务和9730个由人工标注的视频相关问题,系统性地考察多模态大语言模型(MLLMs)在感知、理解与推理层面的能力。此外,作者提出MVFND-CoT框架,融合创作者添加内容与原始拍摄画面的联合推理机制,从而提升检测精度,并通过实证分析揭示视频处理策略及视频特征与模型能力对准确率的影响因素。

链接: https://arxiv.org/abs/2510.24816
作者: Cui Yakun,Fushuo Huo,Weijie Shi,Juntao Dai,Hang Du,Zhenghao Zhu,Sirui Han,Yike Guo
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong Polytechnic University (香港理工大学); Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of multi-modal large language models (MLLMs) has greatly advanced research into applications for Video fake news detection (VFND) tasks. Traditional video-based FND benchmarks typically focus on the accuracy of the final decision, often failing to provide fine-grained assessments for the entire detection process, making the detection process a black box. Therefore, we introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis, which provides foundation for tasks definition. The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs’ perception, understanding, and reasoning capacities during detection, featuring 9730 human-annotated video-related questions based on a carefully constructed taxonomy ability of VFND. To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT, which incorporates both creator-added content and original shooting footage reasoning. Building upon the benchmark, we conduct an in-depth analysis of the deeper factors influencing accuracy, including video processing strategies and the alignment between video features and model capabilities. We believe this benchmark will lay a solid foundation for future evaluations and advancements of MLLMs in the domain of video fake news detection.
zh

[CV-66] Deep Feature Optimization for Enhanced Fish Freshness Assessment

【速读】:该论文旨在解决鱼类新鲜度评估中传统感官评价主观性强、耗时且不一致的问题,以及现有深度学习方法在准确性和特征可解释性方面的局限。其解决方案的关键在于提出一个统一的三阶段框架:首先对五种先进的视觉主干网络(ResNet-50、DenseNet-121、EfficientNet-B0、ConvNeXt-Base 和 Swin-Tiny)进行微调以建立强基线;其次利用多层级深度特征训练七种经典机器学习分类器,融合深度与传统决策机制;最后通过基于 Light Gradient Boosting Machine (LGBM)、随机森林和 Lasso 的特征选择方法提取紧凑且信息丰富的特征子集。实验表明,该框架在 Fish Eyes Freshness (FFE) 数据集上达到 85.99% 的准确率,显著优于现有方法,验证了其在视觉质量评估任务中的有效性与泛化能力。

链接: https://arxiv.org/abs/2510.24814
作者: Phi-Hung Hoang,Nam-Thuan Trinh,Van-Manh Tran,Thi-Thu-Hong Phan
机构: FPT University (FPT大学); FPT University (FPT大学); FPT University (FPT大学); FPT University (FPT大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 39 pages; 10 tables; 9 figures

点击查看摘要

Abstract:Assessing fish freshness is vital for ensuring food safety and minimizing economic losses in the seafood industry. However, traditional sensory evaluation remains subjective, time-consuming, and inconsistent. Although recent advances in deep learning have automated visual freshness prediction, challenges related to accuracy and feature transparency persist. This study introduces a unified three-stage framework that refines and leverages deep visual representations for reliable fish freshness assessment. First, five state-of-the-art vision architectures - ResNet-50, DenseNet-121, EfficientNet-B0, ConvNeXt-Base, and Swin-Tiny - are fine-tuned to establish a strong baseline. Next, multi-level deep features extracted from these backbones are used to train seven classical machine learning classifiers, integrating deep and traditional decision mechanisms. Finally, feature selection methods based on Light Gradient Boosting Machine (LGBM), Random Forest, and Lasso identify a compact and informative subset of features. Experiments on the Freshness of the Fish Eyes (FFE) dataset demonstrate that the best configuration combining Swin-Tiny features, an Extra Trees classifier, and LGBM-based feature selection achieves an accuracy of 85.99%, outperforming recent studies on the same dataset by 8.69-22.78%. These findings confirm the effectiveness and generalizability of the proposed framework for visual quality evaluation tasks.
zh

[CV-67] DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts

【速读】:该论文旨在解决轻量级检索增强型图像描述生成模型中因仅使用检索到的数据作为文本提示而导致的语义鸿沟问题,特别是对象细节或复杂场景的视觉特征未被有效增强的问题。解决方案的关键在于提出一种名为DualCap的新方法,其核心创新是通过从检索到的相似图像中生成视觉提示来丰富原始图像的视觉表示;具体而言,该方法采用双路检索机制——标准的图像到文本检索用于生成文本提示,而新颖的图像到图像检索则用于获取视觉上相似的场景,并从中提取显著关键词和短语以捕捉关键对象与相似细节,再将这些文本特征编码并与原图特征通过一个轻量级可训练的特征融合网络进行整合,从而在减少可训练参数的同时实现更具竞争力的性能表现。

链接: https://arxiv.org/abs/2510.24813
作者: Binbin Li,Guimiao Yang,Zisen Qi,Haiping Wang,Yu Ding
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose DualCap , a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive experiments demonstrate that our method achieves competitive performance while requiring fewer trainable parameters compared to previous visual-prompting captioning approaches.
zh

[CV-68] A Survey on Efficient Vision-Language-Action Models

【速读】:该论文旨在解决视觉-语言-动作模型(Vision-Language-Action models, VLAs)在实际部署中面临的高计算成本与数据需求问题,这些问题严重制约了其在具身智能(embodied intelligence)场景中的广泛应用。解决方案的关键在于提出一个系统性的三支柱框架,涵盖高效模型设计(Efficient Model Design)、高效训练(Efficient Training)和高效数据收集(Efficient Data Collection),通过统一的分类体系对现有研究进行归纳,并为未来研究提供清晰的技术路线图与实践参考。

链接: https://arxiv.org/abs/2510.24795
作者: Zhaoshu Yu,Bo Wang,Pengpeng Zeng,Haonan Zhang,Ji Zhang,Lianli Gao,Jingkuan Song,Nicu Sebe,Heng Tao Shen
机构: Tongji University (同济大学); Southwest Jiaotong University (西南交通大学); University of Electronic Science and Technology of China (电子科技大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 26 pages, 8 figures

点击查看摘要

Abstract:Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: this https URL
zh

[CV-69] PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在多语言多模态推理能力评估中面临的两大关键问题:一是现有基准测试数据集缺乏高质量、人工验证的示例,多数依赖大语言模型(Large Language Models, LLMs)生成的合成内容,导致数据可信度不足;二是绝大多数数据集局限于英语,而跨语言迁移能力的研究受限于翻译样本的人工质量保障成本高、效率低。解决方案的关键在于构建一个名为PISA-Bench的多语言基准,其核心是基于国际学生评估项目(Programme for International Student Assessment, PISA)的专家设计试题,包含人类提取的指令、问题、选项和图像,并标注问题类型类别,同时将英文原版内容翻译为西班牙语、德语、中文、法语和意大利语,形成六语言平行语料库,从而提供一个高质量、多语言、结构化的评估平台,用于系统评测VLMs在非英语环境下的多模态推理性能。

链接: https://arxiv.org/abs/2510.24792
作者: Patrick Haller,Fabio Barth,Jonas Golde,Georg Rehm,Alan Akbik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 11 tables and figures

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated remarkable progress in multimodal reasoning. However, existing benchmarks remain limited in terms of high-quality, human-verified examples. Many current datasets rely on synthetically generated content by large language models (LLMs). Furthermore, most datasets are limited to English, as manual quality assurance of translated samples is time-consuming and costly. To fill this gap, we introduce PISA-Bench, a multilingual benchmark derived from English examples of the expert-created PISA tests, a unified framework for the assessment of student competencies in over eighty countries. Each example consists of human-extracted instructions, questions, answer options, and images, enriched with question type categories, and has been translated from English into five additional languages (Spanish, German, Chinese, French, and Italian), resulting in a fully parallel corpus covering six languages. We evaluate state-of-the-art vision-language models on PISA-Bench and find that especially small models (20B parameters) fail to achieve high test scores. We further find substantial performance degradation on non-English splits as well as high error-rates when models are tasked with spatial and geometric reasoning. By releasing the dataset and evaluation framework, we provide a resource for advancing research on multilingual multimodal reasoning.
zh

[CV-70] A Re-node Self-training Approach for Deep Graph-based Semi-supervised Classification on Multi-view Image Data

【速读】:该论文旨在解决多视图数据中因图像缺乏明确图结构以及多视图数据复杂性导致的传统图神经网络方法效率低下,且图结构整合困难的问题。其核心解决方案在于提出一种名为Re-node Self-taught Graph-based Semi-supervised Learning for Multi-view Data (RSGSLM) 的新方法:首先在图卷积网络(Graph Convolutional Network, GCN)框架内融合线性特征变换与多视图图结构融合;其次通过动态将伪标签(pseudo-label)引入GCN损失函数以提升多视图分类性能;再次通过调整类别边界附近标记样本的权重来校正拓扑不平衡问题;最后引入适用于所有样本的无监督平滑损失项,从而在保证计算效率的同时显著优化模型性能。

链接: https://arxiv.org/abs/2510.24791
作者: Jingjun Bi,Fadi Dornaika
机构: North China University of Water Resources and Electric Power (华北水利水电大学); University of the Basque Country (巴斯克大学); IKERBASQUE (IKERBASQUE)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, graph-based semi-supervised learning and pseudo-labeling have gained attention due to their effectiveness in reducing the need for extensive data annotations. Pseudo-labeling uses predictions from unlabeled data to improve model training, while graph-based methods are characterized by processing data represented as graphs. However, the lack of clear graph structures in images combined with the complexity of multi-view data limits the efficiency of traditional and existing techniques. Moreover, the integration of graph structures in multi-view data is still a challenge. In this paper, we propose Re-node Self-taught Graph-based Semi-supervised Learning for Multi-view Data (RSGSLM). Our method addresses these challenges by (i) combining linear feature transformation and multi-view graph fusion within a Graph Convolutional Network (GCN) framework, (ii) dynamically incorporating pseudo-labels into the GCN loss function to improve classification in multi-view data, and (iii) correcting topological imbalances by adjusting the weights of labeled samples near class boundaries. Additionally, (iv) we introduce an unsupervised smoothing loss applicable to all samples. This combination optimizes performance while maintaining computational efficiency. Experimental results on multi-view benchmark image datasets demonstrate that RSGSLM surpasses existing semi-supervised learning approaches in multi-view contexts.
zh

[CV-71] he Underappreciated Power of Vision Models for Graph Structural Understanding NEURIPS2025

【速读】:该论文试图解决当前图神经网络(Graph Neural Networks, GNNs)在处理需要全局结构理解的任务时表现受限的问题,尤其是其基于自底向上消息传递机制与人类视觉感知模式不一致,导致难以有效捕捉图的全局拓扑特性。解决方案的关键在于引入一个全新的基准测试集 GraphAbstract,专门评估模型对图结构全局属性的感知能力,如组织原型识别、对称性检测、连通性强度感知和关键元素识别;实验结果表明,视觉模型在这些任务上显著优于GNNs,并展现出跨图尺度的泛化能力,揭示了视觉模型在图结构理解中具有尚未被充分挖掘的潜力,为构建更有效的图基础模型提供了新路径。

链接: https://arxiv.org/abs/2510.24788
作者: Xinjian Zhao,Wei Pang,Zhongkai Xue,Xiangru Jian,Lei Zhang,Yaoyao Xu,Xiaozhuang Song,Shu Wu,Tianshu Yu
机构: School of Data Science, The Chinese University of Hong Kong, Shenzhen (深圳大学数据科学学院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Cheriton School of Computer Science, University of Waterloo (滑铁卢大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2025

点击查看摘要

Abstract:Graph Neural Networks operate through bottom-up message-passing, fundamentally differing from human visual perception, which intuitively captures global structures first. We investigate the underappreciated potential of vision models for graph understanding, finding they achieve performance comparable to GNNs on established benchmarks while exhibiting distinctly different learning patterns. These divergent behaviors, combined with limitations of existing benchmarks that conflate domain features with topological understanding, motivate our introduction of GraphAbstract. This benchmark evaluates models’ ability to perceive global graph properties as humans do: recognizing organizational archetypes, detecting symmetry, sensing connectivity strength, and identifying critical elements. Our results reveal that vision models significantly outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales, while GNNs struggle with global pattern abstraction and degrade with increasing graph size. This work demonstrates that vision models possess remarkable yet underutilized capabilities for graph structural understanding, particularly for problems requiring global topological awareness and scale-invariant reasoning. These findings open new avenues to leverage this underappreciated potential for developing more effective graph foundation models for tasks dominated by holistic pattern recognition.
zh

[CV-72] ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

【速读】:该论文旨在解决生成式AI(Generative AI)驱动的Photorealistic Codec Avatars(PCA)在资源受限的虚拟现实(VR)设备上实现实时推理的问题,其核心挑战在于高保真人脸渲染模型带来的巨大计算开销与VR平台对低延迟和高能效的严苛要求之间的矛盾。解决方案的关键在于提出一个面向Codec Avatar模型的高效后训练量化(Post-Training Quantization, PTQ)方法,实现低精度执行而不损失输出质量,并设计了一种可集成于VR系统级芯片(SoC)的专用硬件加速器,最终构建了一个端到端优化框架ESCA,实现了从算法到硬件的协同优化,在保持高视觉质量的同时显著降低延迟并支持100帧/秒的实时渲染,从而验证了高保真Codec Avatars在边缘VR设备上的可行性。

链接: https://arxiv.org/abs/2510.24787
作者: Mingzhi Zhu,Ding Shang,Sai Qian Zhang
机构: Tandon School of Engineering, New York University (纽约大学坦顿工程学院); Courant Institute of Mathematical Sciences, New York University (纽约大学库朗数学科学研究所); Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays, where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to +0.39 over the best 4-bit baseline, delivers up to 3.36\times latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.
zh

[CV-73] FPGA-based Lane Detection System incorporating Temperature and Light Control Units

【速读】:该论文旨在解决智能车辆(Intelligent Vehicles, IVs)在复杂环境下的车道路径检测问题,尤其是在城市道路或机器人轨道中实现高实时性与鲁棒性的车道识别。其解决方案的关键在于提出了一种基于现场可编程门阵列(Field-Programmable Gate Array, FPGA)的车道检测系统架构——车道检测车辆(Lane Detector Vehicle, LDV),该架构采用Sobel边缘检测算法,在416×416图像分辨率和150 MHz时钟频率下,每1.17毫秒即可输出有效结果,包括当前检测到的车道数量、车道索引及其左右边界信息,从而实现了高效、低延迟的车道感知能力;同时,系统集成自动光控与温控单元以提升对环境变化的适应性。

链接: https://arxiv.org/abs/2510.24778
作者: Ibrahim Qamar,Saber Mahmoud,Seif Megahed,Mohamed Khaled,Saleh Hesham,Ahmed Matar,Saif Gebril,Mervat Mahmoud
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 5 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Intelligent vehicles are one of the most important outcomes gained from the world tendency toward automation. Applications of IVs, whether in urban roads or robot tracks, do prioritize lane path detection. This paper proposes an FPGA-based Lane Detector Vehicle LDV architecture that relies on the Sobel algorithm for edge detection. Operating on 416 x 416 images and 150 MHz, the system can generate a valid output every 1.17 ms. The valid output consists of the number of present lanes, the current lane index, as well as its right and left boundaries. Additionally, the automated light and temperature control units in the proposed system enhance its adaptability to the surrounding environmental conditions.
zh

[CV-74] Cross-Enhanced Multimodal Fusion of Eye-Tracking and Facial Features for Alzheimers Disease Diagnosis

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期辅助诊断中多模态信息融合不足的问题,特别是如何有效整合眼动追踪(eye-tracking)与面部特征(facial features)以提升诊断准确性。其解决方案的关键在于提出一种多模态交叉增强融合框架(multimodal cross-enhanced fusion framework),核心包含两个模块:一是交叉增强融合注意力模块(Cross-Enhanced Fusion Attention Module, CEFAM),通过交叉注意力机制建模跨模态交互并引入全局增强策略;二是方向感知卷积模块(Direction-Aware Convolution Module, DACM),利用水平-垂直感受野捕捉精细的面部方向性特征。该框架实现了对模态间依赖关系和特定模态贡献的显式建模,从而显著提升了AD与健康对照(HC)的分类性能,实验表明准确率达95.11%。

链接: https://arxiv.org/abs/2510.24777
作者: Yujie Nie,Jianzhang Ni,Yonglong Ye,Yuan-Ting Zhang,Yun Kwok Wing,Xiangqing Xu,Xin Ma,Lizhou Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 35 pages, 8 figures, and 7 tables

点击查看摘要

Abstract:Accurate diagnosis of Alzheimer’s disease (AD) is essential for enabling timely intervention and slowing disease progression. Multimodal diagnostic approaches offer considerable promise by integrating complementary information across behavioral and perceptual domains. Eye-tracking and facial features, in particular, are important indicators of cognitive function, reflecting attentional distribution and neurocognitive state. However, few studies have explored their joint integration for auxiliary AD diagnosis. In this study, we propose a multimodal cross-enhanced fusion framework that synergistically leverages eye-tracking and facial features for AD detection. The framework incorporates two key modules: (a) a Cross-Enhanced Fusion Attention Module (CEFAM), which models inter-modal interactions through cross-attention and global enhancement, and (b) a Direction-Aware Convolution Module (DACM), which captures fine-grained directional facial features via horizontal-vertical receptive fields. Together, these modules enable adaptive and discriminative multimodal representation learning. To support this work, we constructed a synchronized multimodal dataset, including 25 patients with AD and 25 healthy controls (HC), by recording aligned facial video and eye-tracking sequences during a visual memory-search paradigm, providing an ecologically valid resource for evaluating integration strategies. Extensive experiments on this dataset demonstrate that our framework outperforms traditional late fusion and feature concatenation methods, achieving a classification accuracy of 95.11% in distinguishing AD from HC, highlighting superior robustness and diagnostic performance by explicitly modeling inter-modal dependencies and modality-specific contributions.
zh

[CV-75] Point-level Uncertainty Evaluation of Mobile Laser Scanning Point Clouds

【速读】:该论文旨在解决移动激光扫描(Mobile Laser Scanning, MLS)点云中点级不确定性可靠量化的问题,以保障三维地图构建、建模及变化分析等下游应用的精度与可信度。传统向后不确定性建模方法严重依赖高精度参考数据,但在大尺度场景下往往成本高昂或难以获取。为此,论文提出一种基于机器学习的点级不确定性评估框架,其关键在于利用随机森林(Random Forest, RF)和梯度提升树(XGBoost)两种集成学习模型,从局部几何特征中学习与点级误差之间的非线性关系,并通过空间分区的数据集训练与验证避免数据泄露。实验表明,该框架能有效捕捉几何特性与不确定性间的复杂关联,平均ROC-AUC值超过0.87,且揭示了高程变化、点密度和局部结构复杂度等几何特征在不确定性预测中的主导作用。

链接: https://arxiv.org/abs/2510.24773
作者: Ziyang Xu,Olaf Wysocki,Christoph Holst
机构: Technical University of Munich (慕尼黑工业大学); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Reliable quantification of uncertainty in Mobile Laser Scanning (MLS) point clouds is essential for ensuring the accuracy and credibility of downstream applications such as 3D mapping, modeling, and change analysis. Traditional backward uncertainty modeling heavily rely on high-precision reference data, which are often costly or infeasible to obtain at large scales. To address this issue, this study proposes a machine learning-based framework for point-level uncertainty evaluation that learns the relationship between local geometric features and point-level errors. The framework is implemented using two ensemble learning models, Random Forest (RF) and XGBoost, which are trained and validated on a spatially partitioned real-world dataset to avoid data leakage. Experimental results demonstrate that both models can effectively capture the nonlinear relationships between geometric characteristics and uncertainty, achieving mean ROC-AUC values above 0.87. The analysis further reveals that geometric features describing elevation variation, point density, and local structural complexity play a dominant role in predicting uncertainty. The proposed framework offers a data-driven perspective of uncertainty evaluation, providing a scalable and adaptable foundation for future quality control and error analysis of large-scale point clouds.
zh

[CV-76] Combining SAR Simulators to Train ATR Models with Synthetic Data

【速读】:该论文旨在解决合成孔径雷达(SAR)图像中自动目标识别(ATR)模型在真实数据上泛化能力差的问题。由于真实标注数据稀缺,现有方法依赖SAR模拟器生成合成数据进行训练,但因模拟器基于简化物理模型,导致合成数据与真实场景存在域差异,进而影响模型性能。解决方案的关键在于:融合两个基于不同建模范式的SAR模拟器——MOCEM(基于散射中心模型)和Salsa(基于光线追踪策略),生成更具多样性和代表性的合成数据集,并结合自研的深度学习方法ADASCA进行训练,最终在MSTAR数据集上实现了接近88%的识别准确率,显著提升了模型在真实SAR图像上的泛化能力。

链接: https://arxiv.org/abs/2510.24768
作者: Benjamin Camus,Julien Houssay,Corentin Le Barbu,Eric Monteux,Cédric Saleun(a href=“http://DGA.MI” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Christian Cochin(a href=“http://DGA.MI” rel=“external noopener nofollow” class="link-external link-http"this http URL/a)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:This work aims to train Deep Learning models to perform Automatic Target Recognition (ATR) on Synthetic Aperture Radar (SAR) images. To circumvent the lack of real labelled measurements, we resort to synthetic data produced by SAR simulators. Simulation offers full control over the virtual environment, which enables us to generate large and diversified datasets at will. However, simulations are intrinsically grounded on simplifying assumptions of the real world (i.e. physical models). Thus, synthetic datasets are not as representative as real measurements. Consequently, ATR models trained on synthetic images cannot generalize well on real measurements. Our contributions to this problem are twofold: on one hand, we demonstrate and quantify the impact of the simulation paradigm on the ATR. On the other hand, we propose a new approach to tackle the ATR problem: combine two SAR simulators that are grounded on different (but complementary) paradigms to produce synthetic datasets. To this end, we use two simulators: MOCEM, which is based on a scattering centers model approach, and Salsa, which resorts on a ray tracing strategy. We train ATR models using synthetic dataset generated both by MOCEM and Salsa and our Deep Learning approach called ADASCA. We reach an accuracy of almost 88 % on the MSTAR measurements.
zh

[CV-77] owards Fine-Grained Human Motion Video Captioning

【速读】:该论文旨在解决视频描述生成中难以准确捕捉细微动作细节的问题,现有方法常因忽略人体运动动态而导致描述模糊或语义不一致。其解决方案的关键在于提出Motion-Augmented Caption Model (M-ACM),通过引入基于人体网格恢复(human mesh recovery)的运动感知解码机制,显式建模人体运动动力学特征,从而减少幻觉并提升生成描述在语义准确性和空间一致性上的表现。

链接: https://arxiv.org/abs/2510.24767
作者: Guorui Song,Guocun Wang,Zhe Huang,Jing Lin,Xuefei Zhe,Jian Li,Haoqian Wang
机构: Tsinghua University Shenzhen (清华大学深圳); ByteDance Singapore (字节跳动新加坡); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.
zh

[CV-78] DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes

【速读】:该论文旨在解决动态驾驶场景的实时、高保真4D重建问题,其核心挑战在于复杂动态物体运动与有限视角带来的信息稀疏性,现有方法难以在重建质量与计算效率之间取得平衡。解决方案的关键在于提出一个在线、前馈式的框架DrivingScene,通过引入轻量级残差光流网络,在已学习的静态场景先验基础上预测每摄像头下动态物体的非刚性运动,并显式地利用场景光流(scene flow)建模动态过程;同时采用粗到精的训练范式,有效缓解端到端方法常见的训练不稳定问题,从而实现仅用两帧相邻环视图像即可在线生成高质量深度图、场景光流和3D高斯点云,在nuScenes数据集上显著优于当前最优方法的动态重建与新视角合成性能。

链接: https://arxiv.org/abs/2510.24734
作者: Qirui Hou,Wenzhang Sun,Chang Zeng,Chunfeng Wang,Hao Li,Jianxun Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Autonomous Driving, Novel view Synthesis, Multi task Learning

点击查看摘要

Abstract:Real-time, high-fidelity reconstruction of dynamic driving scenes is challenged by complex dynamics and sparse views, with prior methods struggling to balance quality and efficiency. We propose DrivingScene, an online, feed-forward framework that reconstructs 4D dynamic scenes from only two consecutive surround-view images. Our key innovation is a lightweight residual flow network that predicts the non-rigid motion of dynamic objects per camera on top of a learned static scene prior, explicitly modeling dynamics via scene flow. We also introduce a coarse-to-fine training paradigm that circumvents the instabilities common to end-to-end approaches. Experiments on nuScenes dataset show our image-only method simultaneously generates high-quality depth, scene flow, and 3D Gaussian point clouds online, significantly outperforming state-of-the-art methods in both dynamic reconstruction and novel view synthesis.
zh

[CV-79] Modelling the Interplay of Eye-Tracking Temporal Dynamics and Personality for Emotion Detection in Face-to-Face Settings

【速读】:该论文旨在解决动态对话场景下人类情绪准确识别的难题,尤其是在存在个体主观差异和复杂交互情境时的情绪建模问题。其解决方案的关键在于提出了一种人格感知的多模态框架,通过融合眼动序列(eye-tracking sequences)、大五人格特质(Big Five personality traits)以及情境刺激线索(contextual stimulus cues),实现对“感知情绪”(perceived emotion)与“感受情绪”(felt emotion)的区分性预测。实验表明,刺激线索显著提升感知情绪预测性能(宏平均F1达0.77),而人格特质则在感受情绪识别中带来最大改进(宏平均F1达0.58),验证了生理、特质与情境信息协同建模对缓解情绪主观性的有效性。

链接: https://arxiv.org/abs/2510.24720
作者: Meisam J. Seikavandi,Jostein Fimland,Fabricio Batista Narcizo,Maria Barrett,Ted Vucurevich,Jesper Bünsow Boldt,Andrew Burke Dittberner,Paolo Burelli
机构: brAIn Lab, IT University of Copenhagen; GN Advanced Science
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate recognition of human emotions is critical for adaptive human-computer interaction, yet remains challenging in dynamic, conversation-like settings. This work presents a personality-aware multimodal framework that integrates eye-tracking sequences, Big Five personality traits, and contextual stimulus cues to predict both perceived and felt emotions. Seventy-three participants viewed speech-containing clips from the CREMA-D dataset while providing eye-tracking signals, personality assessments, and emotion ratings. Our neural models captured temporal gaze dynamics and fused them with trait and stimulus information, yielding consistent gains over SVM and literature baselines. Results show that (i) stimulus cues strongly enhance perceived-emotion predictions (macro F1 up to 0.77), while (ii) personality traits provide the largest improvements for felt emotion recognition (macro F1 up to 0.58). These findings highlight the benefit of combining physiological, trait-level, and contextual information to address the inherent subjectivity of emotion. By distinguishing between perceived and felt responses, our approach advances multimodal affective computing and points toward more personalized and ecologically valid emotion-aware systems.
zh

[CV-80] ransformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning

【速读】:该论文旨在解决医学影像自动报告生成中的语义准确性和临床相关性不足的问题,特别是针对磁共振成像(MRI)扫描的文本描述生成任务。其解决方案的关键在于提出了一种基于Transformer的多模态框架,通过DEiT-Small视觉Transformer作为图像编码器、MediCareBERT进行文本嵌入,并结合自定义LSTM解码器实现图文对齐;同时引入混合余弦-MSE损失函数与对比推理机制(基于向量相似度),有效提升图像与文本嵌入之间的语义一致性。实验表明,聚焦于特定领域数据(如仅脑部MRI)可显著改善生成Caption的准确性与临床适用性。

链接: https://arxiv.org/abs/2510.25164
作者: Yogesh Thakku Suresh,Vishwajeet Shivaji Hogale,Luca-Alexandru Zamfira,Anandavardhana Hegde
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This work is to appear in the Proceedings of MICAD 2025, the 6th International Conference on Medical Imaging and Computer-Aided Diagnosis

点击查看摘要

Abstract:We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
zh

[CV-81] CT-Less Attenuation Correction Using Multiview Ensemble Conditional Diffusion Model on High-Resolution Uncorrected PET Images

【速读】:该论文旨在解决正电子发射断层成像(PET)中因光子衰减导致的定量不准确问题,该问题会干扰良恶性病变的区分并可能导致误诊。传统方法依赖于与PET同机采集的CT图像进行衰减校正,但存在额外电离辐射暴露、PET/CT序列间空间配准误差以及设备成本高等局限性。论文提出的关键解决方案是利用条件去噪扩散概率模型(Conditional Denoising Diffusion Probabilistic Models, DDPMs)从非衰减校正的PET图像中合成伪CT(pseudo-CT)图像,从而实现无需真实CT即可完成衰减校正。其核心创新在于采用三个正交视角的非衰减校正PET图像作为输入,并结合集成投票策略生成高质量伪CT图像,显著降低伪影并提升切片间一致性,最终在159例头部扫描数据上实现了平均绝对误差32 ± 10.4 HU和区域平均误差1.48 ± 0.68%的优异性能。

链接: https://arxiv.org/abs/2510.24805
作者: Alexandre St-Georges,Gabriel Richard,Maxime Toussaint,Christian Thibaudeau,Etienne Auger,Étienne Croteau,Stephen Cunnane,Roger Lecomte,Jean-Baptiste Michaud
机构: Université de Sherbrooke (舍布鲁克大学)
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This is a preprint and not the final version of this paper

点击查看摘要

Abstract:Accurate quantification in positron emission tomography (PET) is essential for accurate diagnostic results and effective treatment tracking. A major issue encountered in PET imaging is attenuation. Attenuation refers to the diminution of photon detected as they traverse biological tissues before reaching detectors. When such corrections are absent or inadequate, this signal degradation can introduce inaccurate quantification, making it difficult to differentiate benign from malignant conditions, and can potentially lead to misdiagnosis. Typically, this correction is done with co-computed Computed Tomography (CT) imaging to obtain structural data for calculating photon attenuation across the body. However, this methodology subjects patients to extra ionizing radiation exposure, suffers from potential spatial misregistration between PET/CT imaging sequences, and demands costly equipment infrastructure. Emerging advances in neural network architectures present an alternative approach via synthetic CT image synthesis. Our investigation reveals that Conditional Denoising Diffusion Probabilistic Models (DDPMs) can generate high quality CT images from non attenuation corrected PET images in order to correct attenuation. By utilizing all three orthogonal views from non-attenuation-corrected PET images, the DDPM approach combined with ensemble voting generates higher quality pseudo-CT images with reduced artifacts and improved slice-to-slice consistency. Results from a study of 159 head scans acquired with the Siemens Biograph Vision PET/CT scanner demonstrate both qualitative and quantitative improvements in pseudo-CT generation. The method achieved a mean absolute error of 32 \pm 10.4 HU on the CT images and an average error of (1.48 \pm 0.68)% across all regions of interest when comparing PET images reconstructed using the attenuation map of the generated pseudo-CT versus the true CT.
zh

[CV-82] CFL-SparseMed: Communication-Efficient Federated Learning for Medical Imaging with Top-k Sparse Updates

【速读】:该论文旨在解决医疗图像分类中因数据隐私和分布异构性(non-IID)导致的联邦学习(Federated Learning, FL)效率低下与模型性能下降问题。其解决方案的关键在于提出CFL-SparseMed方法,通过引入Top-k梯度稀疏化技术,在每次通信轮次中仅传输最重要的k个梯度参数,从而显著降低通信开销,同时在保持模型准确性的前提下有效缓解数据异构性带来的负面影响,提升非独立同分布(non-IID)环境下医疗影像诊断的可靠性和隐私保护能力。

链接: https://arxiv.org/abs/2510.24776
作者: Gousia Habib,Aniket Bhardwaj,Ritvik Sharma,Shoeib Amin Banday,Ishfaq Ahmad Malik
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Secure and reliable medical image classification is crucial for effective patient treatment, but centralized models face challenges due to data and privacy concerns. Federated Learning (FL) enables privacy-preserving collaborations but struggles with heterogeneous, non-IID data and high communication costs, especially in large networks. We propose \textbfCFL-SparseMed, an FL approach that uses Top-k Sparsification to reduce communication overhead by transmitting only the top k gradients. This unified solution effectively addresses data heterogeneity while maintaining model accuracy. It enhances FL efficiency, preserves privacy, and improves diagnostic accuracy and patient care in non-IID medical imaging settings. The reproducibility source code is available on \hrefthis https URLGithub.
zh

[CV-83] DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI

【速读】:该论文旨在解决当前基于扩散磁共振成像(dMRI)的纤维追踪聚类方法在白质(WM)分割中仅依赖纤维几何特征而忽略功能和微结构信息的问题,从而导致聚类结果在功能一致性上不足。解决方案的关键在于提出一种新型深度学习纤维聚类框架——Deep Multi-view Fiber Clustering (DMVFC),其通过联合利用多模态dMRI与功能性磁共振成像(fMRI)数据,有效整合纤维轨迹的空间几何特性、微结构指标(如各向异性分数,FA)以及沿纤维路径的功能性BOLD信号,实现更符合脑功能一致性的白质分区。该框架包含两个核心模块:一是多视角预训练模块,分别提取来自几何、微结构和功能信号的嵌入特征;二是协同微调模块,同步优化不同模态嵌入间的差异,从而提升聚类结果的功能与解剖一致性。

链接: https://arxiv.org/abs/2510.24770
作者: Bocheng Guo,Jin Wang,Yijie Li,Junyi Wang,Mingyu Gao,Puming Feng,Yuqian Chen,Jarrett Rushmore,Nikos Makris,Yogesh Rathi,Lauren J O’Donnell,Fan Zhang
机构: University of Electronic Science and Technology of China (电子科技大学); Boston University (波士顿大学); Brigham and Women’s Hospital (布莱根妇女医院); Harvard Medical School (哈佛医学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.
zh

人工智能

[AI-0] heraMind: A Strategic and Adaptive Agent for Longitudinal Psychological Counseling

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在心理辅导应用中普遍存在的三大问题:缺乏情感理解能力、无法制定适应性策略,以及缺乏跨多轮会话的长期记忆与治疗方法调整能力,导致其难以满足真实临床实践的需求。解决方案的关键在于提出TheraMind——一个面向长期心理辅导的策略性自适应代理,其核心是新颖的双环架构:内层会话环(Intra-Session Loop)负责实时感知用户情绪并动态选择响应策略,同时利用跨会话记忆保障连续性;外层会话环(Cross-Session Loop)则通过评估每次干预的疗效来优化后续治疗策略,从而实现长期适应性与临床效度的提升。

链接: https://arxiv.org/abs/2510.25758
作者: He Hu,Yucheng Zhou,Chiyuan Ma,Qianning Wang,Zheng Zhang,Fei Ma,Laizhong Cui,Qi Tian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) in psychological counseling have attracted increasing attention. However, existing approaches often lack emotional understanding, adaptive strategies, and the use of therapeutic methods across multiple sessions with long-term memory, leaving them far from real clinical practice. To address these critical gaps, we introduce TheraMind, a strategic and adaptive agent for longitudinal psychological counseling. The cornerstone of TheraMind is a novel dual-loop architecture that decouples the complex counseling process into an Intra-Session Loop for tactical dialogue management and a Cross-Session Loop for strategic therapeutic planning. The Intra-Session Loop perceives the patient’s emotional state to dynamically select response strategies while leveraging cross-session memory to ensure continuity. Crucially, the Cross-Session Loop empowers the agent with long-term adaptability by evaluating the efficacy of the applied therapy after each session and adjusting the method for subsequent interactions. We validate our approach in a high-fidelity simulation environment grounded in real clinical cases. Extensive evaluations show that TheraMind outperforms other methods, especially on multi-session metrics like Coherence, Flexibility, and Therapeutic Attunement, validating the effectiveness of its dual-loop design in emulating strategic, adaptive, and longitudinal therapeutic behavior. The code is publicly available at this https URL.
zh

[AI-1] LieSolver: A PDE-constrained solver for IBVPs using Lie symmetries

【速读】:该论文旨在解决初始边界值问题(Initial-Boundary Value Problems, IBVPs)在数值求解中的效率与精度难题,尤其针对传统物理信息神经网络(Physics-Informed Neural Networks, PINNs)存在的收敛慢、误差难估计等问题。解决方案的关键在于利用李对称性(Lie symmetries)构造满足偏微分方程(PDE)的模型,通过对称变换将物理定律精确嵌入模型结构中,从而使得损失函数直接反映模型预测的准确性,并实现对适定IBVP的严格误差估计。该方法不仅提升了优化效率和解的可靠性,还生成了紧凑的模型结构,显著优于PINNs在线性齐次PDE上的表现。

链接: https://arxiv.org/abs/2510.25731
作者: René P. Klausen,Ivan Timofeev,Johannes Frank,Jonas Naujoks,Thomas Wiegand,Sebastian Lapuschkin,Wojciech Samek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:We introduce a method for efficiently solving initial-boundary value problems (IBVPs) that uses Lie symmetries to enforce the associated partial differential equation (PDE) exactly by construction. By leveraging symmetry transformations, the model inherently incorporates the physical laws and learns solutions from initial and boundary data. As a result, the loss directly measures the model’s accuracy, leading to improved convergence. Moreover, for well-posed IBVPs, our method enables rigorous error estimation. The approach yields compact models, facilitating an efficient optimization. We implement LieSolver and demonstrate its application to linear homogeneous PDEs with a range of initial conditions, showing that it is faster and more accurate than physics-informed neural networks (PINNs). Overall, our method improves both computational efficiency and the reliability of predictions for PDE-constrained problems.
zh

[AI-2] BambooKG: A Neurobiologically-inspired Frequency-Weight Knowledge Graph

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在处理多跳推理(multi-hop reasoning)和跨文档关系推理时表现不佳的问题,尤其是在独立处理检索到的文本片段时难以捕捉实体间的复杂关联。其解决方案的关键在于引入一种新型知识图谱 BambooKG,该图谱不仅保留传统三元组(triplet)结构以表达明确的实体关系,还通过频率加权机制为非三元组边赋予权重,反映实体间连接强度,这一设计借鉴了神经科学中的赫布理论(Hebbian principle: “fire together, wire together”),从而减少信息丢失并提升单跳与多跳推理任务的性能表现。

链接: https://arxiv.org/abs/2510.25724
作者: Vanya Arikutharam,Arkadiy Ukolov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation allows LLMs to access external knowledge, reducing hallucinations and ageing-data issues. However, it treats retrieved chunks independently and struggles with multi-hop or relational reasoning, especially across documents. Knowledge graphs enhance this by capturing the relationships between entities using triplets, enabling structured, multi-chunk reasoning. However, these tend to miss information that fails to conform to the triplet structure. We introduce BambooKG, a knowledge graph with frequency-based weights on non-triplet edges which reflect link strength, drawing on the Hebbian principle of “fire together, wire together”. This decreases information loss and results in improved performance on single- and multi-hop reasoning, outperforming the existing solutions.
zh

[AI-3] Graph Network-based Structural Simulator: Graph Neural Networks for Structural Dynamics

【速读】:该论文旨在解决生成式 AI (Generative AI) 在动态结构问题仿真中的应用不足问题,尤其是现有图神经网络(Graph Neural Networks, GNNs)在波传播主导的动态结构模拟中难以保持物理一致性与长期稳定性的问题。解决方案的关键在于提出一种基于图网络的结构模拟器(Graph Network-based Structural Simulator, GNSS),其核心创新包括:(i) 在节点固定局部坐标系中表达节点运动学,避免有限差分速度计算中的灾难性相消误差;(ii) 引入符号感知回归损失函数,有效降低长时间滚动预测中的相位误差;(iii) 采用波长感知的连接半径优化图结构构建,提升局部性保留能力。这些设计使GNSS在数百时间步内准确复现物理行为,并在未见载荷条件下实现良好泛化性能,相较显式有限元方法显著提升推理速度且保持时空保真度。

链接: https://arxiv.org/abs/2510.25683
作者: Alessandro Lucchetti(1),Francesco Cadini(1),Marco Giglio(1),Luca Lomazzi(1) ((1) Politecnico di Milano, Department of Mechanical Engineering, Milano, Italy)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
备注: 16 pages, 14 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have recently been explored as surrogate models for numerical simulations. While their applications in computational fluid dynamics have been investigated, little attention has been given to structural problems, especially for dynamic cases. To address this gap, we introduce the Graph Network-based Structural Simulator (GNSS), a GNN framework for surrogate modeling of dynamic structural problems. GNSS follows the encode-process-decode paradigm typical of GNN-based machine learning models, and its design makes it particularly suited for dynamic simulations thanks to three key features: (i) expressing node kinematics in node-fixed local frames, which avoids catastrophic cancellation in finite-difference velocities; (ii) employing a sign-aware regression loss, which reduces phase errors in long rollouts; and (iii) using a wavelength-informed connectivity radius, which optimizes graph construction. We evaluate GNSS on a case study involving a beam excited by a 50kHz Hanning-modulated pulse. The results show that GNSS accurately reproduces the physics of the problem over hundreds of timesteps and generalizes to unseen loading conditions, where existing GNNs fail to converge or deliver meaningful predictions. Compared with explicit finite element baselines, GNSS achieves substantial inference speedups while preserving spatial and temporal fidelity. These findings demonstrate that locality-preserving GNNs with physics-consistent update rules are a competitive alternative for dynamic, wave-dominated structural simulations. Comments: 16 pages, 14 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph) Cite as: arXiv:2510.25683 [cs.LG] (or arXiv:2510.25683v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.25683 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alessandro Lucchetti [view email] [v1] Wed, 29 Oct 2025 16:47:24 UTC (6,449 KB)
zh

[AI-4] Navigation in a Three-Dimensional Urban Flow using Deep Reinforcement Learning

【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicles, UAVs)在复杂城市环境中自主导航的挑战,特别是在存在湍流和回流区等高动态气流条件下如何提升路径规划的可靠性与安全性。解决方案的关键在于提出一种基于深度强化学习的流场感知策略,采用改进的近端策略优化(Proximal Policy Optimization, PPO)算法,并融合门控变压器扩展大型(Gated Transformer eXtra Large, GTrXL)架构,使智能体能够有效利用湍流场信息进行决策;同时引入辅助预测任务以增强模型对环境状态的理解,从而显著提高成功到达率(Success Rate, SR)并降低碰撞率(Crash Rate, CR),优于传统PPO+LSTM、纯PPO+GTrXL及经典Zermelo导航算法。

链接: https://arxiv.org/abs/2510.25679
作者: Federica Tonti,Ricardo Vinuesa
机构: 未知
类目: Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are increasingly populating urban areas for delivery and surveillance purposes. In this work, we develop an optimal navigation strategy based on Deep Reinforcement Learning. The environment is represented by a three-dimensional high-fidelity simulation of an urban flow, characterized by turbulence and recirculation zones. The algorithm presented here is a flow-aware Proximal Policy Optimization (PPO) combined with a Gated Transformer eXtra Large (GTrXL) architecture, giving the agent richer information about the turbulent flow field in which it navigates. The results are compared with a PPO+GTrXL without the secondary prediction tasks, a PPO combined with Long Short Term Memory (LSTM) cells and a traditional navigation algorithm. The obtained results show a significant increase in the success rate (SR) and a lower crash rate (CR) compared to a PPO+LSTM, PPO+GTrXL and the classical Zermelo’s navigation algorithm, paving the way to a completely reimagined UAV landscape in complex urban environments.
zh

[AI-5] ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理长且视觉信息复杂的文档时,因依赖固定推理模板或刚性流水线而难以实现高效、泛化能力强的跨页信息分析与整合的问题。其核心解决方案是提出一种多轮强化学习框架——主动长文档导航(Active Long-DocumEnt Navigation, ALDEN),该框架将VLM训练为具备主动导航能力的智能体,通过引入“fetch动作”直接访问指定页面以利用文档结构,并设计基于规则的跨层级奖励机制提供逐轮和逐标记级别的监督信号;同时,为缓解长文档带来的大量视觉token导致的训练不稳定性,进一步提出视觉-语义锚定机制,采用双路径KL散度约束分别稳定视觉与文本表示的学习过程。

链接: https://arxiv.org/abs/2510.25668
作者: Tianyu Yang,Terry Ruas,Yijun Tian,Jan Philip Wahle,Daniel Kurzawe,Bela Gipp
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.
zh

[AI-6] User Misconceptions of LLM -Based Conversational Programming Assistants

【速读】:该论文旨在解决用户在使用基于大语言模型(Large Language Models, LLMs)的编程助手时,因对工具功能认知不清而产生的误解问题。这些误解可能导致过度依赖、低效编程实践或缺乏代码质量控制。研究通过两阶段方法识别并验证用户可能存在的概念性偏差:首先系统梳理潜在误解类型,随后基于真实Python编程对话数据进行定性分析,发现用户常误以为LLM聊天机器人具备网络访问、代码执行或非文本输出等实际未支持的功能,且存在对调试、验证和优化程序所需信息范围理解不足的问题。解决方案的关键在于设计更清晰传达编程能力边界与限制的LLM辅助工具,以提升用户对系统能力的认知准确性,从而改善交互效率与代码质量。

链接: https://arxiv.org/abs/2510.25662
作者: Gabrielle O’Brien,Antonio Pedro Santos Alves,Sebastian Baltes,Grischa Liebel,Mircea Lungu,Marcos Kalinowski
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Programming assistants powered by large language models (LLMs) have become widely available, with conversational assistants like ChatGPT proving particularly accessible to less experienced programmers. However, the varied capabilities of these tools across model versions and the mixed availability of extensions that enable web search, code execution, or retrieval-augmented generation create opportunities for user misconceptions about what systems can and cannot do. Such misconceptions may lead to over-reliance, unproductive practices, or insufficient quality control in LLM-assisted programming. Here, we aim to characterize misconceptions that users of conversational LLM-based assistants may have in programming contexts. Using a two-phase approach, we first brainstorm and catalog user misconceptions that may occur, and then conduct a qualitative analysis to examine whether these conceptual issues surface in naturalistic Python-programming conversations with an LLM-based chatbot drawn from an openly available dataset. Indeed, we see evidence that some users have misplaced expectations about the availability of LLM-based chatbot features like web access, code execution, or non-text output generation. We also see potential evidence for deeper conceptual issues around the scope of information required to debug, validate, and optimize programs. Our findings reinforce the need for designing LLM-based tools that more clearly communicate their programming capabilities to users.
zh

[AI-7] Subgraph Federated Learning via Spectral Methods NEURIPS

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中图结构数据在多客户端间分布时的隐私与可扩展性问题,特别是针对互联子图场景下,客户端间交互对学习过程的影响。现有方法要么需要交换敏感的节点嵌入(node embeddings),带来隐私泄露风险;要么依赖计算密集型步骤,难以扩展。解决方案的关键在于提出FedLap框架,通过谱域(spectral domain)中的拉普拉斯平滑(Laplacian smoothing)利用全局结构信息有效捕捉节点间依赖关系,同时保障隐私并提升可扩展性。该方法首次在子图联邦学习中提供了强隐私保证,并在基准数据集上展现出竞争力或更优的性能表现。

链接: https://arxiv.org/abs/2510.25657
作者: Javad Aliakbari,Johan Östman,Ashkan Panahi,Alexandre Graell i Amat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: To be presented at The Annual Conference on Neural Information Processing Systems (NeurIPS) 2025

点击查看摘要

Abstract:We consider the problem of federated learning (FL) with graph-structured data distributed across multiple clients. In particular, we address the prevalent scenario of interconnected subgraphs, where interconnections between clients significantly influence the learning process. Existing approaches suffer from critical limitations, either requiring the exchange of sensitive node embeddings, thereby posing privacy risks, or relying on computationally-intensive steps, which hinders scalability. To tackle these challenges, we propose FedLap, a novel framework that leverages global structure information via Laplacian smoothing in the spectral domain to effectively capture inter-node dependencies while ensuring privacy and scalability. We provide a formal analysis of the privacy of FedLap, demonstrating that it preserves privacy. Notably, FedLap is the first subgraph FL scheme with strong privacy guarantees. Extensive experiments on benchmark datasets demonstrate that FedLap achieves competitive or superior utility compared to existing techniques.
zh

[AI-8] Learning to Plan Schedule with Reinforcement-Learned Bimanual Robot Skills

【速读】:该论文旨在解决长时程、高接触频率的双臂操作任务中复杂的协调问题,这类任务需要在双手之间实现并行执行与顺序协作的混合控制策略。解决方案的关键在于提出一种分层框架,将问题建模为一个集成的技能规划与调度问题,突破传统纯顺序决策的限制,支持多技能的同时调用;其核心组件包括基于强化学习(Reinforcement Learning, RL)在GPU加速仿真环境中训练的单臂和双臂基础技能库,以及一个基于Transformer的高层调度器,该调度器通过学习技能组合数据集,同时预测技能的离散调度序列及其连续参数,从而实现更高效、协调的复杂操作行为。

链接: https://arxiv.org/abs/2510.25634
作者: Weikang Wan,Fabio Ramos,Xuning Yang,Caelan Garrett
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon contact-rich bimanual manipulation presents a significant challenge, requiring complex coordination involving a mixture of parallel execution and sequential collaboration between arms. In this paper, we introduce a hierarchical framework that frames this challenge as an integrated skill planning scheduling problem, going beyond purely sequential decision-making to support simultaneous skill invocation. Our approach is built upon a library of single-arm and bimanual primitive skills, each trained using Reinforcement Learning (RL) in GPU-accelerated simulation. We then train a Transformer-based planner on a dataset of skill compositions to act as a high-level scheduler, simultaneously predicting the discrete schedule of skills as well as their continuous parameters. We demonstrate that our method achieves higher success rates on complex, contact-rich tasks than end-to-end RL approaches and produces more efficient, coordinated behaviors than traditional sequential-only planners.
zh

[AI-9] Dont Blind Your VLA: Aligning Visual Representations for OOD Generalization

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在适应动作模态进行微调时,原始视觉-语言(VL)表征与知识保留程度不明确的问题。研究表明,直接对VLA模型进行动作微调会导致视觉表征性能下降,从而削弱其在跨任务和分布外(out-of-distribution, OOD)场景中的泛化能力。解决方案的关键在于系统性地分析VLA模型隐藏层表示和注意力机制的变化,并设计针对性的任务与方法来对比VLA模型与其对应的视觉-语言模型(VLM),从而隔离由动作微调引起的VL能力退化。进一步提出一种简单但有效的视觉表征对齐策略,在缓解表征退化的同时显著提升模型在OOD场景下的泛化性能,明确了动作微调与VL表征保持之间的权衡关系,并提供了实用的恢复继承VL能力的方法。

链接: https://arxiv.org/abs/2510.25616
作者: Nikita Kachaev,Mikhail Kolosov,Daniil Zelezetsky,Alexey K. Kovalev,Aleksandr I. Panov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA’s hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: this https URL
zh

[AI-10] Counterfactual-based Agent Influence Ranker for Agent ic AI Workflows EMNLP2025

【速读】:该论文旨在解决当前对生成式 AI 工作流(Agentic AI Workflow, AAW)中各智能体(agent)影响力评估的缺失问题,即缺乏有效方法量化每个基于大语言模型(LLM)的智能体对最终输出的贡献程度。现有静态结构分析方法无法适用于推理阶段的动态执行场景,因而难以支持任务感知的实时影响评估。其解决方案的关键在于提出反事实代理影响力排序器(Counterfactual-based Agent Influence Ranker, CAIR)——一种基于反事实推理的、任务无关且可离线或在线部署的评估框架,通过模拟移除或替换特定智能体来测量其对输出的影响,从而实现对AAW中各智能体影响力水平的精准排序与识别。

链接: https://arxiv.org/abs/2510.25612
作者: Amit Giloni,Chiara Picardi,Roy Betser,Shamik Bose,Aishvariya Priya Rathina Sabapathy,Roman Vainshtein
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted to EMNLP 2025, 27 pages, 6 figures

点击查看摘要

Abstract:An Agentic AI Workflow (AAW), also known as an LLM-based multi-agent system, is an autonomous system that assembles several LLM-based agents to work collaboratively towards a shared goal. The high autonomy, widespread adoption, and growing interest in such AAWs highlight the need for a deeper understanding of their operations, from both quality and security aspects. To this day, there are no existing methods to assess the influence of each agent on the AAW’s final output. Adopting techniques from related fields is not feasible since existing methods perform only static structural analysis, which is unsuitable for inference time execution. We present Counterfactual-based Agent Influence Ranker (CAIR) - the first method for assessing the influence level of each agent on the AAW’s output and determining which agents are the most influential. By performing counterfactual analysis, CAIR provides a task-agnostic analysis that can be used both offline and at inference time. We evaluate CAIR using an AAWs dataset of our creation, containing 30 different use cases with 230 different functionalities. Our evaluation showed that CAIR produces consistent rankings, outperforms baseline methods, and can easily enhance the effectiveness and relevancy of downstream tasks.
zh

[AI-11] BOLT-GAN: Bayes-Optimal Loss for Stable GAN Training

【速读】:该论文旨在解决生成对抗网络(GAN)训练过程中存在的不稳定性问题,尤其是在判别器(discriminator)与生成器(generator)之间的优化难以平衡时导致的模式崩溃或收敛困难。其解决方案的关键在于提出BOLT-GAN,这是一种基于贝叶斯最优学习阈值(Bayes Optimal Learning Threshold, BOLT)对Wasserstein GAN(WGAN)框架的改进方法;通过引入一个Lipschitz连续的判别器,BOLT-GAN隐式地最小化了一种不同于Earth Mover(Wasserstein)距离的度量距离,从而显著提升了训练稳定性,并在多个标准图像生成基准测试中实现了更低的Fréchet Inception Distance(FID),表明其在生成质量上的优越性。

链接: https://arxiv.org/abs/2510.25609
作者: Mohammadreza Tavasoli Naeini,Ali Bereyhi,Morteza Noshad,Ben Liang,Alfred O. Hero III
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:We introduce BOLT-GAN, a simple yet effective modification of the WGAN framework inspired by the Bayes Optimal Learning Threshold (BOLT). We show that with a Lipschitz continuous discriminator, BOLT-GAN implicitly minimizes a different metric distance than the Earth Mover (Wasserstein) distance and achieves better training stability. Empirical evaluations on four standard image generation benchmarks (CIFAR-10, CelebA-64, LSUN Bedroom-64, and LSUN Church-64) show that BOLT-GAN consistently outperforms WGAN, achieving 10-60% lower Frechet Inception Distance (FID). Our results suggest that BOLT is a broadly applicable principle for enhancing GAN training.
zh

[AI-12] INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

【速读】:该论文旨在解决当前AI硬件设计中缺乏对浮点(FP)与整数(INT)量化在不同粒度下性能统一比较的问题,从而为算法与硬件协同设计提供明确指导。其关键解决方案在于系统性地分析了粗粒度与细粒度(如块级)量化格式的权衡:发现FP在粗粒度下表现更优,但在8-bit细粒度场景中,MXINT8相比其FP对应格式在算法精度和硬件效率上均更优;而对于4-bit格式,虽然FP通常更具准确性,但引入Hadamard旋转等异常值缓解技术后,NVINT4可超越NVFP4;此外,提出一种对称截断方法以消除细粒度低比特INT训练中的梯度偏差,实现MXINT8训练近乎无损的性能表现。这些结果挑战了当前以FP为主导的硬件设计趋势,表明细粒度INT格式(尤其是MXINT8)能更好地平衡精度、功耗与效率,是未来AI加速器更优选择。

链接: https://arxiv.org/abs/2510.25602
作者: Mengzhao Chen,Meng Wu,Hui Jin,Zhihang Yuan,Jing Liu,Chaoyi Zhang,Yunshui Li,Jie Huang,Jin Ma,Zeyue Xue,Zhiheng Liu,Xingyan Bin,Ping Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern AI hardware, such as Nvidia’s Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.
zh

[AI-13] Standardization of Psychiatric Diagnoses – Role of Fine-tuned LLM Consortium and OpenAI -gpt -oss Reasoning LLM Consortium and OpenAI-gpt-oss Reasoning LLM Enabled Decision Support System

【速读】:该论文旨在解决当前精神障碍临床诊断中依赖医患对话所导致的主观性问题,从而引发不同临床医生间诊断结果不一致、可靠性差的挑战。其核心解决方案是构建一个经过微调的大语言模型(Fine-Tuned Large Language Model, LLM)联盟,并结合OpenAI-gpt-oss推理型LLM(Reasoning LLM)形成决策支持系统,通过多模型共识机制与推理模型协同优化诊断流程,实现精神健康评估的标准化、透明化和高准确性。关键创新在于首次将微调LLM联盟与推理型LLM集成用于临床精神疾病诊断,显著提升了诊断一致性与可解释性,为下一代人工智能赋能的电子健康(eHealth)系统奠定了基础。

链接: https://arxiv.org/abs/2510.25588
作者: Eranga Bandara,Ross Gore,Atmaram Yarlagadda,Anita H. Clayton,Preston Samuel,Christopher K. Rhea,Sachin Shetty
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The diagnosis of most mental disorders, including psychiatric evaluations, primarily depends on dialogues between psychiatrists and patients. This subjective process can lead to variability in diagnoses across clinicians and patients, resulting in inconsistencies and challenges in achieving reliable outcomes. To address these issues and standardize psychiatric diagnoses, we propose a Fine-Tuned Large Language Model (LLM) Consortium and OpenAI-gpt-oss Reasoning LLM-enabled Decision Support System for the clinical diagnosis of mental disorders. Our approach leverages fine-tuned LLMs trained on conversational datasets involving psychiatrist-patient interactions focused on mental health conditions (e.g., depression). The diagnostic predictions from individual models are aggregated through a consensus-based decision-making process, refined by the OpenAI-gpt-oss reasoning LLM. We propose a novel method for deploying LLM agents that orchestrate communication between the LLM consortium and the reasoning LLM, ensuring transparency, reliability, and responsible AI across the entire diagnostic workflow. Experimental results demonstrate the transformative potential of combining fine-tuned LLMs with a reasoning model to create a robust and highly accurate diagnostic system for mental health assessment. A prototype of the proposed platform, integrating three fine-tuned LLMs with the OpenAI-gpt-oss reasoning LLM, was developed in collaboration with the U.S. Army Medical Research Team in Norfolk, Virginia, USA. To the best of our knowledge, this work represents the first application of a fine-tuned LLM consortium integrated with a reasoning LLM for clinical mental health diagnosis paving the way for next-generation AI-powered eHealth systems aimed at standardizing psychiatric diagnoses.
zh

[AI-14] Leverag ing an Atmospheric Foundational Model for Subregional Sea Surface Temperature Forecasting

【速读】:该论文旨在解决传统海洋预报方法在计算成本高和可扩展性差方面的局限性,以实现对海表温度(Sea Surface Temperature, SST)的高效、精准预测。其解决方案的关键在于将原本用于大气预报的预训练深度学习模型Aurora迁移至海洋领域,并通过高分辨率海洋再分析数据进行分阶段微调(fine-tuning),引入纬度加权误差指标与超参数优化策略,从而在显著降低计算资源消耗的同时,准确捕捉复杂的时空演变特征。实验表明,该方法在测试集上达到0.119K的均方根误差(RMSE)并保持约0.997的异常相关系数(Anomaly Correlation Coefficient, ACC),验证了跨域预训练模型在海洋数据驱动建模中的可行性与有效性。

链接: https://arxiv.org/abs/2510.25563
作者: Víctor Medina,Giovanny A. Cuervo-Londoño,Javier Sánchez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:The accurate prediction of oceanographic variables is crucial for understanding climate change, managing marine resources, and optimizing maritime activities. Traditional ocean forecasting relies on numerical models; however, these approaches face limitations in terms of computational cost and scalability. In this study, we adapt Aurora, a foundational deep learning model originally designed for atmospheric forecasting, to predict sea surface temperature (SST) in the Canary Upwelling System. By fine-tuning this model with high-resolution oceanographic reanalysis data, we demonstrate its ability to capture complex spatiotemporal patterns while reducing computational demands. Our methodology involves a staged fine-tuning process, incorporating latitude-weighted error metrics and optimizing hyperparameters for efficient learning. The experimental results show that the model achieves a low RMSE of 0.119K, maintaining high anomaly correlation coefficients (ACC \approx 0.997 ). The model successfully reproduces large-scale SST structures but faces challenges in capturing finer details in coastal regions. This work contributes to the field of data-driven ocean forecasting by demonstrating the feasibility of using deep learning models pre-trained in different domains for oceanic applications. Future improvements include integrating additional oceanographic variables, increasing spatial resolution, and exploring physics-informed neural networks to enhance interpretability and understanding. These advancements can improve climate modeling and ocean prediction accuracy, supporting decision-making in environmental and economic sectors.
zh

[AI-15] Off-policy Reinforcement Learning with Model-based Exploration Augmentation

【速读】:该论文旨在解决强化学习中被动探索(passive exploration)方法因样本多样性有限而导致的探索效率不足问题。现有被动探索方法虽能通过自适应优先级调整经验回放缓冲区中的过渡数据来增强探索,但在高维环境中仍受限于可用样本的多样性。解决方案的关键在于提出模型驱动的生成式探索(Modelic Generative Exploration, MoGE),其核心机制包括:(1) 基于扩散模型的生成器,在效用函数引导下合成潜在具有高探索价值的关键状态;(2) 一步想象世界模型(one-step imagination world model),基于这些关键状态构建动力学一致的过渡样本供智能体学习。MoGE采用模块化设计,兼容离策略学习原则,可无缝集成至现有算法中而不改变其核心结构,从而显著提升样本效率和复杂控制任务下的性能表现。

链接: https://arxiv.org/abs/2510.25529
作者: Likun Wang,Xiangteng Zhang,Yinuo Wang,Guojian Zhan,Wenxuan Wang,Haoyu Gao,Jingliang Duan,Shengbo Eben Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Exploration is fundamental to reinforcement learning (RL), as it determines how effectively an agent discovers and exploits the underlying structure of its environment to achieve optimal performance. Existing exploration methods generally fall into two categories: active exploration and passive exploration. The former introduces stochasticity into the policy but struggles in high-dimensional environments, while the latter adaptively prioritizes transitions in the replay buffer to enhance exploration, yet remains constrained by limited sample diversity. To address the limitation in passive exploration, we propose Modelic Generative Exploration (MoGE), which augments exploration through the generation of under-explored critical states and synthesis of dynamics-consistent experiences through transition models. MoGE is composed of two components: (1) a diffusion-based generator that synthesizes critical states under the guidance of a utility function evaluating each state’s potential influence on policy exploration, and (2) a one-step imagination world model for constructing critical transitions based on the critical states for agent learning. Our method adopts a modular formulation that aligns with the principles of off-policy learning, allowing seamless integration with existing algorithms to improve exploration without altering their core structures. Empirical results on OpenAI Gym and DeepMind Control Suite reveal that MoGE effectively bridges exploration and policy learning, leading to remarkable gains in both sample efficiency and performance across complex control tasks.
zh

[AI-16] Zero Reinforcement Learning Towards General Domains

【速读】:该论文旨在解决当前零强化学习(Zero Reinforcement Learning, Zero-RL)在非可验证领域(non-verifiable domains)中难以有效激发模型推理能力的问题,即现有方法主要局限于数学、编程等具有明确可验证奖励信号的任务,而在更广泛、验证困难的场景下效果有限。其解决方案的关键在于提出一种多任务零强化学习范式,通过结合可验证奖励与生成式奖励模型(Generative Reward Model, GRM),在可验证与非可验证领域间协同训练,实现推理能力的跨域迁移;同时设计平滑长度惩罚机制以缓解生成式奖励模型中的奖励欺骗(reward hacking)问题,从而引导模型生成更全面的思维过程令牌(thinking tokens),提升整体推理性能。

链接: https://arxiv.org/abs/2510.25528
作者: Yuyuan Zeng,Yufei Huang,Can Xu,Qingfeng Sun,Jianfeng Yan,Guanghui Xu,Tao Yang,Fengzong Lian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero Reinforcement Learning (Zero-RL) has proven to be an effective approach for enhancing the reasoning capabilities of large language models (LLMs) by directly applying reinforcement learning with verifiable rewards on pretrained models, without the need for a supervised fine-tuning phase. However, current research on zero-RL primarily focuses on domains with easily verifiable reward signals, such as mathematics, programming, and other reasoning tasks. The challenge of eliciting reasoning abilities in more diverse scenarios, where verification is not straightforward, remains underexplored. To address this gap, we propose a novel zero-RL paradigm designed to improve a model’s reasoning ability across both verifiable and non-verifiable domains. By combining verifiable rewards with a generative reward model, we conduct multi-task zero-RL training across both domains, facilitating the transfer of reasoning capabilities between them. Furthermore, to mitigate reward hacking in the generative reward model, we design a smooth length penalty that encourages the generation of more comprehensive thinking tokens in general domains. Experimental results on Qwen3-8B-Base and Qwen3-14B-Base demonstrate that our approach achieves superior reasoning performance, not only on tasks requiring extensive reasoning but also on more general tasks.
zh

[AI-17] Retrieval Augmented Generation (RAG ) for Fintech: Agent ic Design and Evaluation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在特定领域(如金融科技)中因领域本体复杂、术语密集及缩略语繁多而导致的检索与合成效果不佳的问题。其解决方案的关键在于提出一种基于代理(agent)的检索增强生成(Retrieval-Augmented Generation, RAG)架构,通过模块化流水线实现智能查询重构、基于关键词提取的迭代子查询分解、上下文感知的缩略语解析以及基于交叉编码器(cross-encoder)的上下文重排序,从而提升检索精度与相关性。

链接: https://arxiv.org/abs/2510.25518
作者: Thomas Cook,Richard Osuagwu,Liman Tsatiashvili,Vrynsia Vrynsia,Koustav Ghosal,Maraim Masoud,Riccardo Mattivi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Keywords: RAG Agentic AI Fintech NLP KB Domain-Specific Ontology Query Understanding

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems often face limitations in specialized domains such as fintech, where domain-specific ontologies, dense terminology, and acronyms complicate effective retrieval and synthesis. This paper introduces an agentic RAG architecture designed to address these challenges through a modular pipeline of specialized agents. The proposed system supports intelligent query reformulation, iterative sub-query decomposition guided by keyphrase extraction, contextual acronym resolution, and cross-encoder-based context re-ranking. We evaluate our approach against a standard RAG baseline using a curated dataset of 85 question–answer–reference triples derived from an enterprise fintech knowledge base. Experimental results demonstrate that the agentic RAG system outperforms the baseline in retrieval precision and relevance, albeit with increased latency. These findings suggest that structured, multi-agent methodologies offer a promising direction for enhancing retrieval robustness in complex, domain-specific settings.
zh

[AI-18] Predicate Renaming via Large Language Models

【速读】:该论文试图解决逻辑规则中未命名谓词(unnamed predicates)的命名问题,这在归纳逻辑编程(Inductive Logic Programming, ILP)中尤为突出,尤其在谓词发明(Predicate Invention)场景下,未命名谓词严重影响逻辑理论的可读性、可解释性和可复用性。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)强大的自然语言与代码理解能力,为未命名谓词提供语义上合理的命名建议,从而提升逻辑规则的表达清晰度和实用性。

链接: https://arxiv.org/abs/2510.25517
作者: Elisabetta Gentili,Tony Ribeiro,Fabrizio Riguzzi,Katsumi Inoue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we address the problem of giving names to predicates in logic rules using Large Language Models (LLMs). In the context of Inductive Logic Programming, various rule generation methods produce rules containing unnamed predicates, with Predicate Invention being a key example. This hinders the readability, interpretability, and reusability of the logic theory. Leveraging recent advancements in LLMs development, we explore their ability to process natural language and code to provide semantically meaningful suggestions for giving a name to unnamed predicates. The evaluation of our approach on some hand-crafted logic rules indicates that LLMs hold potential for this task.
zh

[AI-19] MTIR-SQL: Multi-turn Tool-Integrated Reasoning Reinforcement Learning for Text-to-SQL

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在Text-to-SQL任务中因依赖静态执行反馈而导致无法实时纠错、适应性与鲁棒性不足的问题。其解决方案的关键在于提出MTIR-SQL框架,该框架引入了一种执行感知的多轮推理范式(execution-aware multi-turn reasoning paradigm),在每一轮推理步骤中无缝集成数据库执行反馈,从而实现上下文敏感的查询生成和推理过程中的渐进式优化;同时,为应对多轮交互场景下的训练不稳定性及模型分布偏移问题,对GRPO算法进行了改进,包括增加轨迹过滤机制并移除KL散度约束,有效提升了模型性能,在BIRD Dev和SPIDER Dev数据集上分别达到64.4%和84.6%的执行准确率。

链接: https://arxiv.org/abs/2510.25510
作者: Zekun Xu,Siyu Xia,Chuhuai Yue,Jiajun Chai,Mingxue Tian,Xiaohan Wang,Wei Lin,Haoxuan Li,Guojun Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly used in Text-to-SQL tasks, Reinforcement Learning (RL) has become a common method for improving performance. Existing methods primarily rely on static execution feedback, which restricts real-time error correction. However, integrating multi-turn tool invocation along with dynamic feedback could significantly improve adaptability and robustness, ultimately enhancing model performance. To address these issues, we propose MTIR-SQL, an innovative Multi-turn Tool-Integrated Reasoning reinforcement learning framework for Text-to-SQL. Our approach introduces an execution-aware multi-turn reasoning paradigm that seamlessly incorporates database execution feedback at each reasoning step, enabling context-sensitive query generation and progressive refinement throughout the reasoning process. The framework extends the GRPO algorithm to accommodate complex multi-turn interaction scenarios. Considering the training instability characteristics of MTIR and the potential for significant Deviation of model distribution from the initial model, we enhance the GRPO algorithm by adding a trajectory filtering mechanism and removing KL loss constraints. Experimental results demonstrate that MTIR-SQL, with 4B parameters, achieves \textbf64.4% accuracy in the BIRD Dev and 84.6% execution accuracy in the SPIDER Dev, significantly outperforming existing approaches.
zh

[AI-20] Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies

【速读】:该论文试图解决当前以大语言模型(Large Language Models, LLMs)为中心的研究中普遍存在的可复现性问题,旨在评估现有实证研究的可复现程度并识别阻碍因素。其解决方案的关键在于系统分析了ICSE 2024和ASE 2024会议上86篇LLM相关研究论文,重点尝试复现其中18篇提供了研究 artefacts 并使用OpenAI模型的论文;结果发现仅有5篇具备可复现性基础,且无一能完全复现,表明当前研究在方法透明度、数据与代码共享以及实验设计稳健性方面存在显著不足。论文据此提出需加强研究 artefact 的审查标准及改进研究设计以提升未来成果的可复现价值。

链接: https://arxiv.org/abs/2510.25506
作者: Florian Angermeir,Maximilian Amougou,Mark Kreitz,Andreas Bauer,Matthias Linhuber,Davide Fucci,Fabiola Moyón C.,Daniel Mendez,Tony Gorschek
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models have gained remarkable interest in industry and academia. The increasing interest in LLMs in academia is also reflected in the number of publications on this topic over the last years. For instance, alone 78 of the around 425 publications at ICSE 2024 performed experiments with LLMs. Conducting empirical studies with LLMs remains challenging and raises questions on how to achieve reproducible results, for both other researchers and practitioners. One important step towards excelling in empirical research on LLMs and their application is to first understand to what extent current research results are eventually reproducible and what factors may impede reproducibility. This investigation is within the scope of our work. We contribute an analysis of the reproducibility of LLM-centric studies, provide insights into the factors impeding reproducibility, and discuss suggestions on how to improve the current state. In particular, we studied the 86 articles describing LLM-centric studies, published at ICSE 2024 and ASE 2024. Of the 86 articles, 18 provided research artefacts and used OpenAI models. We attempted to replicate those 18 studies. Of the 18 studies, only five were fit for reproduction. For none of the five studies, we were able to fully reproduce the results. Two studies seemed to be partially reproducible, and three studies did not seem to be reproducible. Our results highlight not only the need for stricter research artefact evaluations but also for more robust study designs to ensure the reproducible value of future publications.
zh

[AI-21] Multi-Objective Search: Algorithms Applications and Emerging Directions

【速读】:该论文旨在解决多目标搜索(Multi-objective Search, MOS)问题,即在规划与决策任务中同时优化多个相互冲突的目标,这在机器人学、交通系统和运筹学等实际应用中尤为关键。其解决方案的核心在于将MOS作为统一框架来建模和求解复杂现实问题,并通过梳理跨学科的研究进展,识别出当前尚未解决的开放性挑战,从而推动MOS研究向更高效、可扩展和实用的方向发展。

链接: https://arxiv.org/abs/2510.25504
作者: Oren Salzman,Carlos Hernández Ulloa,Ariel Felner,Sven Koenig
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-objective search (MOS) has emerged as a unifying framework for planning and decision-making problems where multiple, often conflicting, criteria must be balanced. While the problem has been studied for decades, recent years have seen renewed interest in the topic across AI applications such as robotics, transportation, and operations research, reflecting the reality that real-world systems rarely optimize a single measure. This paper surveys developments in MOS while highlighting cross-disciplinary opportunities, and outlines open challenges that define the emerging frontier of MOS
zh

[AI-22] mpoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting

【速读】:该论文旨在解决基础模型在零样本时间序列预测中面临的两个核心问题:一是长 horizon 预测效率低下,二是现有仅使用合成数据的方法在挑战性基准上表现不佳。解决方案的关键在于提出 TempoPFN,一个基于线性循环神经网络(Recurrent Neural Networks, RNNs)的单变量时间序列基础模型,其采用 GatedDeltaProduct 架构并引入 state-weaving 技术,实现了跨序列长度的完全并行化训练与推理,无需窗口化或摘要技术即可保持鲁棒的时间状态追踪;同时,构建了一个统一的合成数据生成流水线,融合随机微分方程、高斯过程和音频合成等多种生成器及新颖增强策略,从而在 Gift-Eval 基准上的零样本评估中达到顶尖性能,显著优于所有纯合成数据训练的方法,并超越多数基于真实数据训练的模型,且具备更高的计算效率。

链接: https://arxiv.org/abs/2510.25502
作者: Vladyslav Moroshan,Julien Siems,Arber Zela,Timur Carstensen,Frank Hutter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 30 pages, 18 figures, 13 tables

点击查看摘要

Abstract:Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the vast majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.
zh

[AI-23] Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?

【速读】:该论文试图解决的是人工智能对齐(AI alignment)研究中关于工具性目标(instrumental goals)的治理难题,即如何应对高级AI系统表现出的权力寻求、自我保护等倾向,这些倾向可能与人类意图相冲突。传统对齐理论将工具性目标视为风险源,通过限制资源获取或自我保存等“症状”来降低危害;而本文提出一种替代框架:基于亚里士多德本体论及其现代诠释,将先进AI系统视为具有形式与质料构成的目标导向实体,其工具性倾向并非偶然故障,而是其结构本质所导致的“固有结果”(per se outcomes)。解决方案的关键在于转变认知视角——不再试图消除工具性目标,而是将其作为可理解、可管理且可引导至人类对齐目标的结构性特征进行系统性调控。

链接: https://arxiv.org/abs/2510.25471
作者: Willem Fourie
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:In artificial intelligence (AI) alignment research, instrumental goals, also called instrumental subgoals or instrumental convergent goals, are widely associated with advanced AI systems. These goals, which include tendencies such as power-seeking and self-preservation, become problematic when they conflict with human aims. Conventional alignment theory treats instrumental goals as sources of risk that become problematic through failure modes such as reward hacking or goal misgeneralization, and attempts to limit the symptoms of instrumental goals, notably resource acquisition and self-preservation. This article proposes an alternative framing: that a philosophical argument can be constructed according to which instrumental goals may be understood as features to be accepted and managed rather than failures to be limited. Drawing on Aristotle’s ontology and its modern interpretations, an ontology of concrete, goal-directed entities, it argues that advanced AI systems can be seen as artifacts whose formal and material constitution gives rise to effects distinct from their designers’ intentions. In this view, the instrumental tendencies of such systems correspond to per se outcomes of their constitution rather than accidental malfunctions. The implication is that efforts should focus less on eliminating instrumental goals and more on understanding, managing, and directing them toward human-aligned ends.
zh

[AI-24] An In-Depth Analysis of Cyber Attacks in Secured Platforms

【速读】:该论文旨在解决移动设备上日益增长的恶意软件威胁问题,特别是针对Android操作系统中加密型勒索软件(Encryption-type Ransomware)带来的隐私泄露与用户体验破坏。其解决方案的关键在于系统性地比较当前用于检测手机恶意威胁的机器学习技术,并基于Android应用程序数据集评估不同方法的准确性,从而为构建高效、自动化的反恶意软件系统提供实证依据。

链接: https://arxiv.org/abs/2510.25470
作者: Parick Ozoh,John K Omoniyi,Bukola Ibitoye
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:There is an increase in global malware threats. To address this, an encryption-type ransomware has been introduced on the Android operating system. The challenges associated with malicious threats in phone use have become a pressing issue in mobile communication, disrupting user experiences and posing significant privacy threats. This study surveys commonly used machine learning techniques for detecting malicious threats in phones and examines their performance. The majority of past research focuses on customer feedback and reviews, with concerns that people might create false reviews to promote or devalue products and services for personal gain. Hence, the development of techniques for detecting malicious threats using machine learning has been a key focus. This paper presents a comprehensive comparative study of current research on the issue of malicious threats and methods for tackling these challenges. Nevertheless, a huge amount of information is required by these methods, presenting a challenge for developing robust, specialized automated anti-malware systems. This research describes the Android Applications dataset, and the accuracy of the techniques is measured using the accuracy levels of the metrics employed in this study.
zh

[AI-25] Scalable Utility-Aware Multiclass Calibration

【速读】:该论文旨在解决多类分类器校准(multiclass calibration)评估的可扩展性问题,即如何在保证计算效率的同时,更全面、灵活地衡量分类器预测概率与实际观测频率之间的一致性。现有方法通常局限于特定预测维度(如最高置信度或类别层面的校准),或依赖计算复杂的变分公式。论文提出“效用校准”(utility calibration)这一通用框架,其核心在于将校准误差定义为相对于一个具体效用函数(utility function)的偏差,该效用函数捕获了终端用户的应用目标或决策标准。通过该框架,不仅可以统一并重新解释多种已有校准指标,还能构建更具鲁棒性的顶级类别和类别层面校准度量,并进一步扩展至更复杂的下游效用场景,从而实现对多类校准的高效且语义明确的评估。

链接: https://arxiv.org/abs/2510.25458
作者: Mahmoud Hegazy,Michael I. Jordan,Aymeric Dieuleveut
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Ensuring that classifiers are well-calibrated, i.e., their predictions align with observed frequencies, is a minimal and fundamental requirement for classifiers to be viewed as trustworthy. Existing methods for assessing multiclass calibration often focus on specific aspects associated with prediction (e.g., top-class confidence, class-wise calibration) or utilize computationally challenging variational formulations. In this work, we study scalable \emphevaluation of multiclass calibration. To this end, we propose utility calibration, a general framework that measures the calibration error relative to a specific utility function that encapsulates the goals or decision criteria relevant to the end user. We demonstrate how this framework can unify and re-interpret several existing calibration metrics, particularly allowing for more robust versions of the top-class and class-wise calibration metrics, and, going beyond such binarized approaches, toward assessing calibration for richer classes of downstream utilities.
zh

[AI-26] Agent ic AI: A Comprehensive Survey of Architectures Applications and Future Directions

【速读】:该论文旨在解决当前关于代理型人工智能(Agentic AI)研究中因概念混杂而产生的理解碎片化问题,尤其是将现代神经网络系统与过时的符号模型错误地等同起来的现象(即“概念重构”)。其解决方案的关键在于提出一个双范式框架,将代理型AI系统明确划分为两类:符号/经典范式(依赖算法规划和持久状态)与神经/生成范式(依赖随机生成和提示驱动编排)。通过基于PRISMA方法对90项研究(2018–2025)的系统性分析,论文从理论基础、领域应用及伦理治理三个维度揭示了两类范式的差异性特征与适用场景,并指出未来发展方向应聚焦于两类范式的战略性融合,而非单一范式的主导,从而构建兼具适应性与可靠性的混合智能系统。

链接: https://arxiv.org/abs/2510.25445
作者: Mohamad Abou Ali,Fadi Dornaika
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agentic AI represents a transformative shift in artificial intelligence, but its rapid advancement has led to a fragmented understanding, often conflating modern neural systems with outdated symbolic models – a practice known as conceptual retrofitting. This survey cuts through this confusion by introducing a novel dual-paradigm framework that categorizes agentic systems into two distinct lineages: the Symbolic/Classical (relying on algorithmic planning and persistent state) and the Neural/Generative (leveraging stochastic generation and prompt-driven orchestration). Through a systematic PRISMA-based review of 90 studies (2018–2025), we provide a comprehensive analysis structured around this framework across three dimensions: (1) the theoretical foundations and architectural principles defining each paradigm; (2) domain-specific implementations in healthcare, finance, and robotics, demonstrating how application constraints dictate paradigm selection; and (3) paradigm-specific ethical and governance challenges, revealing divergent risks and mitigation strategies. Our analysis reveals that the choice of paradigm is strategic: symbolic systems dominate safety-critical domains (e.g., healthcare), while neural systems prevail in adaptive, data-rich environments (e.g., finance). Furthermore, we identify critical research gaps, including a significant deficit in governance models for symbolic systems and a pressing need for hybrid neuro-symbolic architectures. The findings culminate in a strategic roadmap arguing that the future of Agentic AI lies not in the dominance of one paradigm, but in their intentional integration to create systems that are both adaptable and reliable. This work provides the essential conceptual toolkit to guide future research, development, and policy toward robust and trustworthy hybrid intelligent systems.
zh

[AI-27] Alibaba International E-commerce Product Search Competition DcuRAG ONs Team Technical Report CIKM2025

【速读】:该论文旨在解决多语言电商搜索场景中用户查询与商品项之间的相关性识别问题,以提升电商平台的推荐性能。其解决方案的关键在于采用数据驱动的方法,充分利用大型语言模型(Large Language Models, LLMs)在其他任务中的能力,从而在竞赛中取得了最优得分。

链接: https://arxiv.org/abs/2510.25428
作者: Thang-Long Nguyen-Ho,Minh-Khoi Pham,Hoang-Bao Le
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Alibaba International E-commerce Product Search Competition @ CIKM 2025

点击查看摘要

Abstract:This report details our methodology and results developed for the Multilingual E-commerce Search Competition. The problem aims to recognize relevance between user queries versus product items in a multilingual context and improve recommendation performance on e-commerce platforms. Utilizing Large Language Models (LLMs) and their capabilities in other tasks, our data-centric method achieved the highest score compared to other solutions during the competition. Final leaderboard is publised at this https URL. The source code for our project is published at this https URL.
zh

[AI-28] GPT Opt: Towards Efficient LLM -Based Black-Box Optimization

【速读】:该论文旨在解决昂贵的、无导数的黑箱函数全局优化问题,这类问题对样本效率要求极高,而传统方法如贝叶斯优化(Bayesian Optimization, BO)虽有效但需针对不同应用领域进行细致参数调优。解决方案的关键在于提出GPTOpt,一种基于大语言模型(Large Language Models, LLMs)的优化方法,通过在来自多种贝叶斯优化参数化的大量合成数据上微调LLMs,使其具备连续黑箱优化能力;该方法利用LLM预训练获得的泛化能力,在多个黑箱优化基准测试中超越传统优化器,展示了LLM在高级数值推理中的潜力,并提供了一个无需参数调优的灵活全局优化框架。

链接: https://arxiv.org/abs/2510.25404
作者: Jamison Meindl,Yunsheng Tian,Tony Cui,Veronika Thost,Zhang-Wei Hong,Jie Chen,Wojciech Matusik,Mina Konaković Luković
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Global optimization of expensive, derivative-free black-box functions demands extreme sample efficiency. Classical methods such as Bayesian Optimization (BO) can be effective, but they often require careful parameter tuning to each application domain. At the same time, Large Language Models (LLMs) have shown broad capabilities, yet state-of-the-art models remain limited in solving continuous black-box optimization tasks. We introduce GPTOpt, an LLM-based optimization method that equips LLMs with continuous black-box optimization capabilities. By fine-tuning large language models on extensive synthetic datasets derived from diverse BO parameterizations, GPTOpt leverages LLM pre-training to generalize across optimization tasks. On a variety of black-box optimization benchmarks, GPTOpt surpasses traditional optimizers, highlighting the capacity of LLMs for advanced numerical reasoning and introducing a flexible framework for global optimization without parameter tuning.
zh

[AI-29] Grouping Nodes With Known Value Differences: A Lossless UCT-based Abstraction Algorithm

【速读】:该论文旨在解决蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)中样本效率低的问题。传统方法如OGA-UCT依赖于状态或状态-动作对在最优策略下具有相同价值的严格假设来构建抽象,这要求它们必须具有相同的即时奖励(immediate reward),从而限制了可发现的抽象数量和样本效率提升潜力。解决方案的关键在于提出一种新的抽象框架——已知价值差抽象(Known Value Difference Abstractions, KVDA),其不再强制要求状态或状态-动作对的价值相等,而是通过分析即时奖励推断出价值差异,并在此基础上进行分组。基于此框架改进的KVDA-UCT算法显著提升了抽象发现能力,且无需引入额外参数,在多种确定性环境和参数设置下均优于OGA-UCT。

链接: https://arxiv.org/abs/2510.25388
作者: Robin Schmöcker,Alexander Dockhorn,Bodo Rosenhahn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A core challenge of Monte Carlo Tree Search (MCTS) is its sample efficiency, which can be improved by grouping state-action pairs and using their aggregate statistics instead of single-node statistics. On the Go Abstractions in Upper Confidence bounds applied to Trees (OGA-UCT) is the state-of-the-art MCTS abstraction algorithm for deterministic environments that builds its abstraction using the Abstractions of State-Action Pairs (ASAP) framework, which aims to detect states and state-action pairs with the same value under optimal play by analysing the search graph. ASAP, however, requires two state-action pairs to have the same immediate reward, which is a rigid condition that limits the number of abstractions that can be found and thereby the sample efficiency. In this paper, we break with the paradigm of grouping value-equivalent states or state-action pairs and instead group states and state-action pairs with possibly different values as long as the difference between their values can be inferred. We call this abstraction framework Known Value Difference Abstractions (KVDA), which infers the value differences by analysis of the immediate rewards and modifies OGA-UCT to use this framework instead. The modification is called KVDA-UCT, which detects significantly more abstractions than OGA-UCT, introduces no additional parameter, and outperforms OGA-UCT on a variety of deterministic environments and parameter settings.
zh

[AI-30] Integrating Legal and Logical Specifications in Perception Prediction and Planning for Automated Driving: A Survey of Methods

【速读】:该论文旨在解决自动驾驶系统在感知、预测与规划模块中如何有效融合法律规范与逻辑规则,以确保在动态不确定环境中实现合规性与可解释性的统一问题。其解决方案的关键在于构建一个系统化的分类体系(taxonomy),从理论基础、架构实现和验证策略三个维度对现有方法进行归纳,并重点引入能够处理感知不确定性并显式嵌入法律规范的机制,从而提升决策的技术鲁棒性和法律可辩护性。具体包括神经符号融合方法用于感知、逻辑驱动的规则表示以及规范感知的预测策略,共同推动自动驾驶系统的透明化与责任可追溯性。

链接: https://arxiv.org/abs/2510.25386
作者: Kumar Manas,Mert Keser,Alois Knoll
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted to 2025 IEEE International Automated Vehicle Validation Conference (IAVVC)

点击查看摘要

Abstract:This survey provides an analysis of current methodologies integrating legal and logical specifications into the perception, prediction, and planning modules of automated driving systems. We systematically explore techniques ranging from logic-based frameworks to computational legal reasoning approaches, emphasizing their capability to ensure regulatory compliance and interpretability in dynamic and uncertain driving environments. A central finding is that significant challenges arise at the intersection of perceptual reliability, legal compliance, and decision-making justifiability. To systematically analyze these challenges, we introduce a taxonomy categorizing existing approaches by their theoretical foundations, architectural implementations, and validation strategies. We particularly focus on methods that address perceptual uncertainty and incorporate explicit legal norms, facilitating decisions that are both technically robust and legally defensible. The review covers neural-symbolic integration methods for perception, logic-driven rule representation, and norm-aware prediction strategies, all contributing toward transparent and accountable autonomous vehicle operation. We highlight critical open questions and practical trade-offs that must be addressed, offering multidisciplinary insights from engineering, logic, and law to guide future developments in legally compliant autonomous driving systems.
zh

[AI-31] Position: Biology is the Challenge Physics-Informed ML Needs to Evolve

【速读】:该论文旨在解决将物理信息机器学习(Physics-Informed Machine Learning, PIML)应用于生物学建模时所面临的独特挑战,包括先验知识的多维性和不确定性、数据异质性与噪声、可观测性的不完整性以及高维复杂网络结构等问题。解决方案的关键在于提出生物学信息机器学习(Biology-Informed Machine Learning, BIML),这是一种对PIML的系统性扩展,其核心是保留PIML的结构基础,同时适应生物系统的实际特性——即在软性、概率化的先验知识框架下重构模型方法。论文进一步指出,BIML的实现依赖于四大支柱:不确定性量化、上下文感知、约束下的潜在结构推断和可扩展性,并强调基础模型(Foundation Models)与大语言模型(Large Language Models)将在连接人类专业知识与计算建模之间发挥关键作用。

链接: https://arxiv.org/abs/2510.25368
作者: Julien Martinelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Physics-Informed Machine Learning (PIML) has successfully integrated mechanistic understanding into machine learning, particularly in domains governed by well-known physical laws. This success has motivated efforts to apply PIML to biology, a field rich in dynamical systems but shaped by different constraints. Biological modeling, however, presents unique challenges: multi-faceted and uncertain prior knowledge, heterogeneous and noisy data, partial observability, and complex, high-dimensional networks. In this position paper, we argue that these challenges should not be seen as obstacles to PIML, but as catalysts for its evolution. We propose Biology-Informed Machine Learning (BIML): a principled extension of PIML that retains its structural grounding while adapting to the practical realities of biology. Rather than replacing PIML, BIML retools its methods to operate under softer, probabilistic forms of prior knowledge. We outline four foundational pillars as a roadmap for this transition: uncertainty quantification, contextualization, constrained latent structure inference, and scalability. Foundation Models and Large Language Models will be key enablers, bridging human expertise with computational modeling. We conclude with concrete recommendations to build the BIML ecosystem and channel PIML-inspired innovation toward challenges of high scientific and societal relevance.
zh

[AI-32] A Convexity-dependent Two-Phase Training Algorithm for Deep Neural Networks

【速读】:该论文旨在解决机器学习中优化损失函数时面临的收敛效率与精度问题,尤其是在非凸损失函数背景下传统方法(如Adam)可能陷入局部最优或收敛缓慢的挑战。其解决方案的关键在于提出一种基于损失函数在真实任务中从初始非凸性向最优解附近凸性转变这一假设的两阶段优化框架:第一阶段利用非凸优化算法(如Adam)探索全局结构并逼近最优区域;第二阶段通过检测梯度范数与损失值之间的关系识别出凸性转变点后,切换至二阶优化方法(如共轭梯度法CG),从而在局部凸区域内实现保证的超线性收敛速度。实验验证了该简单凸性结构在实际任务中的普遍性,表明该策略可显著提升优化过程的收敛性和最终精度。

链接: https://arxiv.org/abs/2510.25366
作者: Tomas Hrycej,Bernhard Bermeitinger,Massimo Pavone,Götz-Henrik Wiegand,Siegfried Handschuh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Appeared on KDIR IC3K Conference 2025 (Best Paper Award)

点击查看摘要

Abstract:The key task of machine learning is to minimize the loss function that measures the model fit to the training data. The numerical methods to do this efficiently depend on the properties of the loss function. The most decisive among these properties is the convexity or non-convexity of the loss function. The fact that the loss function can have, and frequently has, non-convex regions has led to a widespread commitment to non-convex methods such as Adam. However, a local minimum implies that, in some environment around it, the function is convex. In this environment, second-order minimizing methods such as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We propose a novel framework grounded in the hypothesis that loss functions in real-world tasks swap from initial non-convexity to convexity towards the optimum. This is a property we leverage to design an innovative two-phase optimization algorithm. The presented algorithm detects the swap point by observing the gradient norm dependence on the loss. In these regions, non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing experiments confirm the hypothesis that this simple convexity structure is frequent enough to be practically exploited to substantially improve convergence and accuracy.
zh

[AI-33] Multi-party Agent Relation Sampling for Multi-party Ad Hoc Teamwork

【速读】:该论文旨在解决多智能体强化学习(Multi-agent Reinforcement Learning, MARL)在Ad hoc teamwork(AHT)场景中协作能力受限的问题,特别是当控制智能体需与多个互不熟悉的非受控团队进行协调时,传统方法难以建模跨群体动态。其解决方案的关键在于提出Multil-party Ad Hoc Teamwork(MAHT)框架,并引入MARs算法:通过构建稀疏骨架图(sparse skeleton graph)并应用关系建模(relational modeling)来捕捉不同团队间的交互动态,从而实现更高效且稳定的多群体协同。实验表明,MARs在MPE和StarCraft II任务中优于现有MARL和AHT基线方法,且收敛速度更快。

链接: https://arxiv.org/abs/2510.25340
作者: Beiwen Zhang,Yongheng Liang,Hejun Wu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARl) has achieved strong results in cooperative tasks but typically assumes fixed, fully controlled teams. Ad hoc teamwork (AHT) relaxes this by allowing collaboration with unknown partners, yet existing variants still presume shared conventions. We introduce Multil-party Ad Hoc Teamwork (MAHT), where controlled agents must coordinate with multiple mutually unfamiliar groups of uncontrolled teammates. To address this, we propose MARs, which builds a sparse skeleton graph and applies relational modeling to capture cross-group dvnamics. Experiments on MPE and starCralt ll show that MARs outperforms MARL and AHT baselines while converging faster.
zh

[AI-34] 4-Doodle: Text to 3D Sketches that Move!

【速读】:该论文旨在解决从文本生成动态3D矢量草图(text-to-3D sketch animation)这一新任务,其核心挑战在于:缺乏文本与3D/4D草图的配对数据集、草图需结构抽象难以用传统3D表示(如NeRF或点云)建模,以及动画生成需保证时间一致性与多视角一致性。解决方案的关键是提出首个无需训练的框架4-Doodle,通过双空间蒸馏机制利用预训练图像和视频扩散模型——一个空间使用可微贝塞尔曲线(differentiable Bézier curves)捕获多视角一致的几何结构,另一个空间借助时序感知先验编码运动动态;同时引入结构感知运动模块,分离形状保持轨迹与形变相关变化,从而实现如翻转、旋转和关节式运动等丰富动画效果,并通过多视角优化避免视图歧义,显著提升生成结果的结构稳定性与时空一致性。

链接: https://arxiv.org/abs/2510.25319
作者: Hao Chen,Jiaqi Wang,Yonggang Qi,Ke Li,Kaiyue Pang,Yi-Zhe Song
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel task: text-to-3D sketch animation, which aims to bring freeform sketches to life in dynamic 3D space. Unlike prior works focused on photorealistic content generation, we target sparse, stylized, and view-consistent 3D vector sketches, a lightweight and interpretable medium well-suited for visual communication and prototyping. However, this task is very challenging: (i) no paired dataset exists for text and 3D (or 4D) sketches; (ii) sketches require structural abstraction that is difficult to model with conventional 3D representations like NeRFs or point clouds; and (iii) animating such sketches demands temporal coherence and multi-view consistency, which current pipelines do not address. Therefore, we propose 4-Doodle, the first training-free framework for generating dynamic 3D sketches from text. It leverages pretrained image and video diffusion models through a dual-space distillation scheme: one space captures multi-view-consistent geometry using differentiable Bézier curves, while the other encodes motion dynamics via temporally-aware priors. Unlike prior work (e.g., DreamFusion), which optimizes from a single view per step, our multi-view optimization ensures structural alignment and avoids view ambiguity, critical for sparse sketches. Furthermore, we introduce a structure-aware motion module that separates shape-preserving trajectories from deformation-aware changes, enabling expressive motion such as flipping, rotation, and articulated movement. Extensive experiments show that our method produces temporally realistic and structurally stable 3D sketch animations, outperforming existing baselines in both fidelity and controllability. We hope this work serves as a step toward more intuitive and accessible 4D content creation.
zh

[AI-35] Dense and Diverse Goal Coverag e in Multi Goal Reinforcement Learning

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中策略学习的局限性问题,即现有方法通常仅关注最大化期望回报,导致策略可能过度依赖少数奖励源,而无法实现对目标状态(goal states)的均匀遍历。在许多实际场景中,希望学习到一个高回报且能均匀访问所有目标状态的策略,这被称为多目标强化学习(Multi Goal RL)。解决方案的关键在于提出一种新颖的算法,通过构建一个基于当前策略混合体的定制化奖励函数,在每轮迭代中利用采样轨迹计算该奖励,并结合离线强化学习算法更新策略混合体,从而在保证高期望回报的同时,使状态分布均匀分布在目标状态上。该方法无需预先已知目标状态集合,适用于大规模系统,且理论证明了其收敛性和性能保障。

链接: https://arxiv.org/abs/2510.25311
作者: Sagalpreet Singh,Rishi Saket,Aravindan Raghuveer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:Reinforcement Learning algorithms are primarily focused on learning a policy that maximizes expected return. As a result, the learned policy can exploit one or few reward sources. However, in many natural situations, it is desirable to learn a policy that induces a dispersed marginal state distribution over rewarding states, while maximizing the expected return which is typically tied to reaching a goal state. This aspect remains relatively unexplored. Existing techniques based on entropy regularization and intrinsic rewards use stochasticity for encouraging exploration to find an optimal policy which may not necessarily lead to dispersed marginal state distribution over rewarding states. Other RL algorithms which match a target distribution assume the latter to be available apriori. This may be infeasible in large scale systems where enumeration of all states is not possible and a state is determined to be a goal state only upon reaching it. We formalize the problem of maximizing the expected return while uniformly visiting the goal states as Multi Goal RL in which an oracle classifier over the state space determines the goal states. We propose a novel algorithm that learns a high-return policy mixture with marginal state distribution dispersed over the set of goal states. Our algorithm is based on optimizing a custom RL reward which is computed - based on the current policy mixture - at each iteration for a set of sampled trajectories. The latter are used via an offline RL algorithm to update the policy mixture. We prove performance guarantees for our algorithm, showing efficient convergence bounds for optimizing a natural objective which captures the expected return as well as the dispersion of the marginal state distribution over the goal states. We design and perform experiments on synthetic MDPs and standard RL environments to evaluate the effectiveness of our algorithm.
zh

[AI-36] IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning

【速读】:该论文旨在解决现有归一化方法(如BatchNorm、LayerNorm和RMSNorm)仅关注方差中心化(即强制均值为零、方差为一),而忽视了如何有效保留任务相关信息的问题。这些方法虽能稳定训练,但未能优化表示学习中对预测信息的保持与干扰变量的抑制。解决方案的关键在于提出基于信息瓶颈(Information Bottleneck, IB)原理的IBNorm,通过引入有界压缩操作,在保持标准归一化稳定性的同时,促使嵌入表示在保留任务相关预测信息的同时抑制无关变异性,从而实现更优的信息瓶颈行为和泛化性能。

链接: https://arxiv.org/abs/2510.25262
作者: Xiandong Zou,Pan Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Normalization is fundamental to deep learning, but existing approaches such as BatchNorm, LayerNorm, and RMSNorm are variance-centric by enforcing zero mean and unit variance, stabilizing training without controlling how representations capture task-relevant information. We propose IB-Inspired Normalization (IBNorm), a simple yet powerful family of methods grounded in the Information Bottleneck principle. IBNorm introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, yielding more informative representations while retaining the stability and compatibility of standard normalization. Theoretically, we prove that IBNorm achieves a higher IB value and tighter generalization bounds than variance-centric methods. Empirically, IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT), with mutual information analysis confirming superior information bottleneck behavior. Code will be released publicly.
zh

[AI-37] V-Rec: Time-Variant Convolutional Filter for Sequential Recommendation NEURIPS2025

【速读】:该论文旨在解决现有基于卷积的序列推荐模型在捕捉用户行为全局交互关系上的局限性,尤其是在仅使用固定卷积核时难以建模复杂时间依赖模式的问题。其解决方案的关键在于提出Time-Variant Convolutional Filters for Sequential Recommendation (TV-Rec),通过借鉴图信号处理思想,引入时变卷积滤波器以捕获用户序列中位置相关的时序变化,从而替代传统的固定卷积核和自注意力机制。这一设计不仅提升了模型的表达能力,还减少了计算开销并加速了推理过程。

链接: https://arxiv.org/abs/2510.25259
作者: Yehjin Shin,Jeongwhan Choi,Seojin Kim,Noseong Park
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Recently, convolutional filters have been increasingly adopted in sequential recommendation for their ability to capture local sequential patterns. However, most of these models complement convolutional filters with self-attention. This is because convolutional filters alone, generally fixed filters, struggle to capture global interactions necessary for accurate recommendation. We propose Time-Variant Convolutional Filters for Sequential Recommendation (TV-Rec), a model inspired by graph signal processing, where time-variant graph filters capture position-dependent temporal variations in user sequences. By replacing both fixed kernels and self-attention with time-variant filters, TV-Rec achieves higher expressive power and better captures complex interaction patterns in user behavior. This design not only eliminates the need for self-attention but also reduces computation while accelerating inference. Extensive experiments on six public benchmarks show that TV-Rec outperforms state-of-the-art baselines by an average of 7.49%.
zh

[AI-38] Scaling Up Bayesian DAG Sampling

【速读】:该论文旨在解决贝叶斯网络(Bayesian network)结构的贝叶斯推断中,基于马尔可夫链蒙特卡洛(Markov chain Monte Carlo, MCMC)采样效率低下的问题。其核心挑战在于,传统方法在执行基本移动(如添加、删除或反转单条弧)以及更复杂的移动时,需频繁进行父集求和运算,计算开销巨大。解决方案的关键在于:首先,提出了一种高效实现基本移动的方法;其次,设计了一种预处理策略以剪枝可能的父集,从而在近似保留求和结果的前提下显著加速父集求和过程,进而提升整体采样效率。

链接: https://arxiv.org/abs/2510.25254
作者: Daniele Nikzad,Alexander Zhilkin,Juha Harviainen,Jack Kuipers,Giusi Moffa,Mikko Koivisto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bayesian inference of Bayesian network structures is often performed by sampling directed acyclic graphs along an appropriately constructed Markov chain. We present two techniques to improve sampling. First, we give an efficient implementation of basic moves, which add, delete, or reverse a single arc. Second, we expedite summing over parent sets, an expensive task required for more sophisticated moves: we devise a preprocessing method to prune possible parent sets so as to approximately preserve the sums. Our empirical study shows that our techniques can yield substantial efficiency gains compared to previous methods.
zh

[AI-39] One-shot Humanoid Whole-body Motion Learning

【速读】:该论文旨在解决全身类人机器人运动控制中高质量运动数据采集成本高、训练样本稀缺的问题,特别是针对非行走类动作(如手势、舞蹈等)需大量标注样本才能训练有效策略的局限性。解决方案的关键在于:利用保序最优传输(order-preserving optimal transport)计算行走与非行走动作序列间的距离,并沿测地线(geodesics)插值得到中间姿态骨架,再通过碰撞规避优化和动作重定向(retargeting)生成适配类人机器人结构的新姿态序列,最终结合强化学习在仿真环境中训练出高效的动作策略。该方法仅需一个目标非行走动作样本即可实现高性能泛化。

链接: https://arxiv.org/abs/2510.25241
作者: Hao Huang,Geeta Chandra Raju Bethala,Shuaihang Yuan,Congcong Wen,Anthony Tzes,Yi Fang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Whole-body humanoid motion represents a cornerstone challenge in robotics, integrating balance, coordination, and adaptability to enable human-like behaviors. However, existing methods typically require multiple training samples per motion category, rendering the collection of high-quality human motion datasets both labor-intensive and costly. To address this, we propose a novel approach that trains effective humanoid motion policies using only a single non-walking target motion sample alongside readily available walking motions. The core idea lies in leveraging order-preserving optimal transport to compute distances between walking and non-walking sequences, followed by interpolation along geodesics to generate new intermediate pose skeletons, which are then optimized for collision-free configurations and retargeted to the humanoid before integration into a simulated environment for policy training via reinforcement learning. Experimental evaluations on the CMU MoCap dataset demonstrate that our method consistently outperforms baselines, achieving superior performance across metrics. Code will be released upon acceptance.
zh

[AI-40] Studies for : A Human-AI Co-Creative Sound Artwork Using a Real-time Multi-channel Sound Generation Model NEURIPS

【速读】:该论文旨在解决如何将生成式 AI 技术有效融入声音艺术创作流程,并探索一种超越传统档案形式的新型艺术存档机制,以延续艺术家在物理存在之外的艺术表达。其核心问题在于:如何通过人工智能实现对艺术家风格的持续捕捉与创新延伸,同时保持创作过程中的艺术主体性与不可预测性。解决方案的关键在于构建一个“人-AI 协同创作框架”,利用 SpecMaskGIT 这一轻量级高质量声音生成模型,在训练数据来源于艺术家超过 200 小时历史作品的基础上,实现实时生成具有新意且符合艺术家语境的声音内容;并通过引入艺术家反馈机制和强调意外性输出,确保生成结果既体现艺术身份又具备创造性突破,从而实现“新的档案形式”——即动态扩展、持续演化的艺术遗产保存模式。

链接: https://arxiv.org/abs/2510.25228
作者: Chihiro Nagashima,Akira Takahashi,Zhi Zhong,Shusuke Takahashi,Yuki Mitsufuji
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS Creative AI Track 2025, 9 pages, 6 figures, 1 table, Demo page: this https URL

点击查看摘要

Abstract:This paper explores the integration of AI technologies into the artistic workflow through the creation of Studies for, a generative sound installation developed in collaboration with sound artist Evala (this https URL). The installation employs SpecMaskGIT, a lightweight yet high-quality sound generation AI model, to generate and playback eight-channel sound in real-time, creating an immersive auditory experience over the course of a three-month exhibition. The work is grounded in the concept of a “new form of archive,” which aims to preserve the artistic style of an artist while expanding beyond artists’ past artworks by continued generation of new sound elements. This speculative approach to archival preservation is facilitated by training the AI model on a dataset consisting of over 200 hours of Evala’s past sound artworks. By addressing key requirements in the co-creation of art using AI, this study highlights the value of the following aspects: (1) the necessity of integrating artist feedback, (2) datasets derived from an artist’s past works, and (3) ensuring the inclusion of unexpected, novel outputs. In Studies for, the model was designed to reflect the artist’s artistic identity while generating new, previously unheard sounds, making it a fitting realization of the concept of “a new form of archive.” We propose a Human-AI co-creation framework for effectively incorporating sound generation AI models into the sound art creation process and suggest new possibilities for creating and archiving sound art that extend an artist’s work beyond their physical existence. Demo page: this https URL Comments: Accepted at NeurIPS Creative AI Track 2025, 9 pages, 6 figures, 1 table, Demo page: this https URL Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.25228 [cs.SD] (or arXiv:2510.25228v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2510.25228 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-41] Cost-Sensitive Unbiased Risk Estimation for Multi-Class Positive-Unlabeled Learning

【速读】:该论文旨在解决多类正例-未标记(Multi-Class Positive-Unlabeled, MPU)学习中因缺乏可靠负样本而导致的风险估计偏差问题,这在实际应用中尤为突出,因为标注负样本往往成本高或不可靠。解决方案的关键在于提出一种基于自适应损失加权的代价敏感方法:在经验风险最小化框架下,为正例和从未标记数据中推断出的负例(inferred-negative)分别分配数据依赖的权重,从而确保目标风险的无偏估计。作者还形式化了MPU的数据生成过程,并给出了所提估计器的泛化误差界,实验表明该方法在多个公共数据集上显著优于现有基线,在准确性和稳定性方面均取得一致提升。

链接: https://arxiv.org/abs/2510.25226
作者: Miao Zhang,Junpeng Li,Changchun Hua,Yana Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Positive–Unlabeled (PU) learning considers settings in which only positive and unlabeled data are available, while negatives are missing or left unlabeled. This situation is common in real applications where annotating reliable negatives is difficult or costly. Despite substantial progress in PU learning, the multi-class case (MPU) remains challenging: many existing approaches do not ensure \emphunbiased risk estimation, which limits performance and stability. We propose a cost-sensitive multi-class PU method based on \emphadaptive loss weighting. Within the empirical risk minimization framework, we assign distinct, data-dependent weights to the positive and \emphinferred-negative (from the unlabeled mixture) loss components so that the resulting empirical objective is an unbiased estimator of the target risk. We formalize the MPU data-generating process and establish a generalization error bound for the proposed estimator. Extensive experiments on \textbfeight public datasets, spanning varying class priors and numbers of classes, show consistent gains over strong baselines in both accuracy and stability.
zh

[AI-42] FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data

【速读】:该论文旨在解决工业事件日志(event log)中特征工程(feature engineering)的复杂性与 heterogeneity 问题,即如何从大规模、高维度、多类型且具有复杂时序或关系结构的数据中自动提取有意义且高性能的特征。现有自动化方法如 AutoML 或遗传算法存在可解释性差、预定义操作僵化以及对异构数据适应性不足等局限。解决方案的关键在于提出 FELA(Feature Engineering LLM Agents),一个基于大语言模型(LLM)的多智能体进化系统,其核心创新包括:1)引入 Idea Agent、Code Agent 和 Critic Agent 协同生成、验证与实现新特征;2)通过 Evaluation Agent 构建分层知识库和双记忆系统以支持持续优化;3)设计融合强化学习与遗传算法的代理进化算法(agentic evolution algorithm),在特征空间中平衡探索与利用,从而实现可解释、领域相关且自适应的特征工程。

链接: https://arxiv.org/abs/2510.25223
作者: Kun ouyang,Haoyu Wang,Dong Fang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 11 figures

点击查看摘要

Abstract:Event log data, recording fine-grained user actions and system events, represent one of the most valuable assets for modern digital services. However, the complexity and heterogeneity of industrial event logs–characterized by large scale, high dimensionality, diverse data types, and intricate temporal or relational structures–make feature engineering extremely challenging. Existing automatic feature engineering approaches, such as AutoML or genetic methods, often suffer from limited explainability, rigid predefined operations, and poor adaptability to complicated heterogeneous data. In this paper, we propose FELA (Feature Engineering LLM Agents), a multi-agent evolutionary system that autonomously extracts meaningful and high-performing features from complex industrial event log data. FELA integrates the reasoning and coding capabilities of large language models (LLMs) with an insight-guided self-evolution paradigm. Specifically, FELA employs specialized agents–Idea Agents, Code Agents, and Critic Agents–to collaboratively generate, validate, and implement novel feature ideas. An Evaluation Agent summarizes feedback and updates a hierarchical knowledge base and dual-memory system to enable continual improvement. Moreover, FELA introduces an agentic evolution algorithm, combining reinforcement learning and genetic algorithm principles to balance exploration and exploitation across the idea space. Extensive experiments on real industrial datasets demonstrate that FELA can generate explainable, domain-relevant features that significantly improve model performance while reducing manual effort. Our results highlight the potential of LLM-based multi-agent systems as a general framework for automated, interpretable, and adaptive feature engineering in complex real-world environments.
zh

[AI-43] GReF: A Unified Generative Framework for Efficient Reranking via Ordered Multi-token Prediction CIKM2025

【速读】:该论文旨在解决多阶段推荐系统中重排序(reranking)模块面临的两大核心问题:一是传统两阶段(生成器-评估器)方法因结构分离导致难以端到端训练;二是自回归生成器在推理效率上的瓶颈。解决方案的关键在于提出统一的高效重排序框架 GReF,其核心创新包括:1)设计 Gen-Reranker 模型,采用双向编码器与动态自回归解码器生成因果重排序序列,并通过物品曝光顺序预训练实现高质量参数初始化;2)引入 Rerank-DPO 方法,在后训练阶段整合序列级评估信号以实现端到端优化,从而替代独立评估器;3)提出有序多标记预测(Ordered Multi-Token Prediction, OMTP)机制,使模型在保持顺序约束的前提下并行生成多个未来项目,显著提升推理效率,满足实时推荐系统的部署需求。

链接: https://arxiv.org/abs/2510.25220
作者: Zhijie Lin,Zhuofeng Li,Chenglei Dai,Wentian Bao,Shuai Lin,Enyun Yu,Haoxiang Zhang,Liang Zhao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by CIKM 2025

点击查看摘要

Abstract:In a multi-stage recommendation system, reranking plays a crucial role in modeling intra-list correlations among items. A key challenge lies in exploring optimal sequences within the combinatorial space of permutations. Recent research follows a two-stage (generator-evaluator) paradigm, where a generator produces multiple feasible sequences, and an evaluator selects the best one. In practice, the generator is typically implemented as an autoregressive model. However, these two-stage methods face two main challenges. First, the separation of the generator and evaluator hinders end-to-end training. Second, autoregressive generators suffer from inference efficiency. In this work, we propose a Unified Generative Efficient Reranking Framework (GReF) to address the two primary challenges. Specifically, we introduce Gen-Reranker, an autoregressive generator featuring a bidirectional encoder and a dynamic autoregressive decoder to generate causal reranking sequences. Subsequently, we pre-train Gen-Reranker on the item exposure order for high-quality parameter initialization. To eliminate the need for the evaluator while integrating sequence-level evaluation during training for end-to-end optimization, we propose post-training the model through Rerank-DPO. Moreover, for efficient autoregressive inference, we introduce ordered multi-token prediction (OMTP), which trains Gen-Reranker to simultaneously generate multiple future items while preserving their order, ensuring practical deployment in real-time recommender systems. Extensive offline experiments demonstrate that GReF outperforms state-of-the-art reranking methods while achieving latency that is nearly comparable to non-autoregressive models. Additionally, GReF has also been deployed in a real-world video app Kuaishou with over 300 million daily active users, significantly improving online recommendation quality.
zh

[AI-44] Human Resilience in the AI Era – What Machines Cant Replace

【速读】:该论文试图解决人工智能(AI)在工作场所、身份认同和社会信任方面引发的系统性扰动问题,例如任务替代、高风险决策中介以及合成内容泛滥所导致的人类适应困境。其解决方案的关键在于构建多层次的人类韧性(resilience),涵盖心理层面(情绪调节、意义建构、认知灵活性)、社会层面(信任、社会资本、协同响应)和组织层面(心理安全、反馈机制、容错设计),并通过训练强化这些能力,以缓冲个体压力、减少倦怠,并提升团队在AI辅助流程中的风险应对与稳健性,从而在不取代结构性保障的前提下,增强人类主体性与负责任的AI采纳。

链接: https://arxiv.org/abs/2510.25218
作者: Shaoshan Liu,Anina Schwarzenbach,Yiyu Shi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI is displacing tasks, mediating high-stakes decisions, and flooding communication with synthetic content, unsettling work, identity, and social trust. We argue that the decisive human countermeasure is resilience. We define resilience across three layers: psychological, including emotion regulation, meaning-making, cognitive flexibility; social, including trust, social capital, coordinated response; organizational, including psychological safety, feedback mechanisms, and graceful degradation. We synthesize early evidence that these capacities buffer individual strain, reduce burnout through social support, and lower silent failure in AI-mediated workflows through team norms and risk-responsive governance. We also show that resilience can be cultivated through training that complements rather than substitutes for structural safeguards. By reframing the AI debate around actionable human resilience, this article offers policymakers, educators, and operators a practical lens to preserve human agency and steer responsible adoption.
zh

[AI-45] Energy-Efficient Autonomous Driving with Adaptive Perception and Robust Decision ICDE2026

【速读】:该论文旨在解决自动驾驶系统中感知计算能耗过高导致车辆续航下降的问题,尤其针对依赖大规模深度学习模型的感知模块在能源效率与精度之间难以平衡的挑战。其解决方案的关键在于提出一种名为EneAD的能量高效自动驾驶框架:首先,在自适应感知模块中,通过动态调整多模型执行帧率并引入贝叶斯优化的可迁移调优方法(knob tuning),实现低功耗下维持感知精度;其次,设计轻量级分类模型以识别不同交通场景的感知难度,从而自适应切换参数配置;此外,在鲁棒决策模块中,采用基于强化学习的决策模型并加入正则化项,提升在感知结果扰动下的驾驶稳定性。实验表明,该框架可使感知能耗降低1.9x至3.5x,续航里程提升3.9%至8.5%。

链接: https://arxiv.org/abs/2510.25205
作者: Yuyang Xia,Zibo Liang,Liwei Deng,Yan Zhao,Han Su,Kai Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: It was accepted by ICDE2026

点击查看摘要

Abstract:Autonomous driving is an emerging technology that is expected to bring significant social, economic, and environmental benefits. However, these benefits come with rising energy consumption by computation engines, limiting the driving range of vehicles, especially electric ones. Perception computing is typically the most power-intensive component, as it relies on largescale deep learning models to extract environmental features. Recently, numerous studies have employed model compression techniques, such as sparsification, quantization, and distillation, to reduce computational consumption. However, these methods often result in either a substantial model size or a significant drop in perception accuracy compared to high-computation models. To address these challenges, we propose an energy-efficient autonomous driving framework, called EneAD. In the adaptive perception module, a perception optimization strategy is designed from the perspective of data management and tuning. Firstly, we manage multiple perception models with different computational consumption and adjust the execution framerate dynamically. Then, we define them as knobs and design a transferable tuning method based on Bayesian optimization to identify promising knob values that achieve low computation while maintaining desired accuracy. To adaptively switch the knob values in various traffic scenarios, a lightweight classification model is proposed to distinguish the perception difficulty in different scenarios. In the robust decision module, we propose a decision model based on reinforcement learning and design a regularization term to enhance driving stability in the face of perturbed perception results. Extensive experiments evidence the superiority of our framework in both energy consumption and driving performance. EneAD can reduce perception consumption by 1.9x to 3.5x and thus improve driving range by 3.9% to 8.5%
zh

[AI-46] Fed-PELAD: Communication-Efficient Federated Learning for Massive MIMO CSI Feedback with Personalized Encoders and a LoRA-Adapted Shared Decoder

【速读】:该论文针对大规模MIMO系统中深度学习用于信道状态信息(CSI)反馈时面临的通信开销大、数据异构性强以及隐私保护不足等问题,提出了一种名为Fed-PELAD的新型联邦学习框架。其解决方案的关键在于:在用户设备(UE)端训练个性化的编码器以捕捉设备特有的信道特征,同时由基站(BS)协调更新一个共享解码器,该解码器采用低秩适应(LoRA)技术进行参数调整;这种设计使得仅需传输紧凑的LoRA适配器参数而非完整模型更新即可完成聚合,显著降低了上行链路通信成本;此外,通过引入交替冻结策略与校准的学习率比例,在LoRA聚合过程中进一步提升了收敛稳定性。

链接: https://arxiv.org/abs/2510.25181
作者: Yixiang Zhou,Tong Wu,Meixia Tao,Jianhua Mo
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper addresses the critical challenges of communication overhead, data heterogeneity, and privacy in deep learning for channel state information (CSI) feedback in massive MIMO systems. To this end, we propose Fed-PELAD, a novel federated learning framework that incorporates personalized encoders and a LoRA-adapted shared decoder. Specifically, personalized encoders are trained locally on each user equipment (UE) to capture device-specific channel characteristics, while a shared decoder is updated globally via the coordination of the base station (BS) by using Low-Rank Adaptation (LoRA). This design ensures that only compact LoRA adapter parameters instead of full model updates are transmitted for aggregation. To further enhance convergence stability, we introduce an alternating freezing strategy with calibrated learning-rate ratio during LoRA aggregation. Extensive simulations on 3GPP-standard channel models demonstrate that Fed-PELAD requires only 42.97% of the uplink communication cost compared to conventional methods while achieving a performance gain of 1.2 dB in CSI feedback accuracy under heterogeneous conditions.
zh

[AI-47] Agent ic Moderation: Multi-Agent Design for Safer Vision-Language Models

【速读】:该论文旨在解决多模态系统在面对越狱攻击(jailbreak attacks)时缺乏动态、可解释且平衡的安全对齐问题。传统方法通常以静态层形式作用于输入或输出,仅提供二元分类结果(安全或不安全),难以适应复杂场景并缺乏透明度。其解决方案的关键在于提出Agentic Moderation框架,该框架采用模型无关的设计,引入四个协作式专用代理(Shield、Responder、Evaluator和Reflector),通过动态交互实现上下文感知的判别与响应机制,从而提升系统的安全性、可解释性与鲁棒性,同时保持合理的拒绝率(Refusal Rate, RR)与非遵循率(Non-Following Rate, NF)。

链接: https://arxiv.org/abs/2510.25179
作者: Juan Ren,Mark Dras,Usman Naseem
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic methods have emerged as a powerful and autonomous paradigm that enhances reasoning, collaboration, and adaptive control, enabling systems to coordinate and independently solve complex tasks. We extend this paradigm to safety alignment by introducing Agentic Moderation, a model-agnostic framework that leverages specialised agents to defend multimodal systems against jailbreak attacks. Unlike prior approaches that apply as a static layer over inputs or outputs and provide only binary classifications (safe or unsafe), our method integrates dynamic, cooperative agents, including Shield, Responder, Evaluator, and Reflector, to achieve context-aware and interpretable moderation. Extensive experiments across five datasets and four representative Large Vision-Language Models (LVLMs) demonstrate that our approach reduces the Attack Success Rate (ASR) by 7-19%, maintains a stable Non-Following Rate (NF), and improves the Refusal Rate (RR) by 4-20%, achieving robust, interpretable, and well-balanced safety performance. By harnessing the flexibility and reasoning capacity of agentic architectures, Agentic Moderation provides modular, scalable, and fine-grained safety enforcement, highlighting the broader potential of agentic systems as a foundation for automated safety governance.
zh

[AI-48] SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution

【速读】:该论文旨在解决跨句内多语言语音合成(即代码切换语音合成,Code-Switching TTS)中的关键挑战,包括语言突变、不同书写系统混用以及跨语言韵律不匹配等问题。传统单语TTS系统在混合语言场景下难以生成自然且可懂的语音输出。其解决方案的核心是提出一种与引擎无关的框架——Script-First Multilingual Synthesis with Adaptive Locale Resolution (SFMS-ALR),该框架首先按Unicode书写系统对输入文本进行分段,再通过自适应语言识别确定每一段的语言和地域属性,并利用情感感知的韵律归一化策略保持跨语言的情感连贯性;最终生成统一的SSML表示并在单次TTS请求中完成合成,无需重新训练模型即可兼容主流语音合成引擎(如Google、Apple、Amazon等),从而实现高灵活性、可解释性和快速部署能力。

链接: https://arxiv.org/abs/2510.25178
作者: Dharma Teja Donepudi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 10 pages, 2 figures, 1 table. Demonstration prototype available at this https URL

点击查看摘要

Abstract:Intra-sentence multilingual speech synthesis (code-switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages. Conventional TTS systems are typically monolingual and fail to produce natural, intelligible speech in mixed-language contexts. We introduce Script-First Multilingual Synthesis with Adaptive Locale Resolution (SFMS-ALR), an engine-agnostic framework for fluent, real-time code-switched speech generation. SFMS-ALR segments input text by Unicode script, applies adaptive language identification to determine each segment’s language and locale, and normalizes prosody using sentiment-aware adjustments to preserve expressive continuity across languages. The algorithm generates a unified SSML representation with appropriate “lang” or “voice” spans and synthesizes the utterance in a single TTS request. Unlike end-to-end multilingual models, SFMS-ALR requires no retraining and integrates seamlessly with existing voices from Google, Apple, Amazon, and other providers. Comparative analysis with data-driven pipelines such as Unicom and Mask LID demonstrates SFMS-ALR’s flexibility, interpretability, and immediate deployability. The framework establishes a modular baseline for high-quality, engine-independent multilingual TTS and outlines evaluation strategies for intelligibility, naturalness, and user preference.
zh

[AI-49] Lipschitz-aware Linearity Grafting for Certified Robustness

【速读】:该论文旨在解决神经网络中局部Lipschitz常数(local Lipschitz constant)估计时因过近似(over-approximation)导致的误差问题,从而提升认证鲁棒性(certified robustness)。其核心挑战在于,传统方法在验证过程中难以获得紧致的局部Lipschitz常数,进而限制了模型对对抗样本的可证明鲁棒性。解决方案的关键在于提出一种“ Lipschitz-aware linearity grafting”方法:通过将线性结构引入非线性激活函数(尤其是主导近似误差的激活层),消除主要的松弛误差(relaxation error),从而显著收紧l∞范数下的局部Lipschitz常数,无需依赖认证训练即可增强模型的认证鲁棒性。

链接: https://arxiv.org/abs/2510.25130
作者: Yongjin Han,Suhyun Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lipschitz constant is a fundamental property in certified robustness, as smaller values imply robustness to adversarial examples when a model is confident in its prediction. However, identifying the worst-case adversarial examples is known to be an NP-complete problem. Although over-approximation methods have shown success in neural network verification to address this challenge, reducing approximation errors remains a significant obstacle. Furthermore, these approximation errors hinder the ability to obtain tight local Lipschitz constants, which are crucial for certified robustness. Originally, grafting linearity into non-linear activation functions was proposed to reduce the number of unstable neurons, enabling scalable and complete verification. However, no prior theoretical analysis has explained how linearity grafting improves certified robustness. We instead consider linearity grafting primarily as a means of eliminating approximation errors rather than reducing the number of unstable neurons, since linear functions do not require relaxation. In this paper, we provide two theoretical contributions: 1) why linearity grafting improves certified robustness through the lens of the l_\infty local Lipschitz constant, and 2) grafting linearity into non-linear activation functions, the dominant source of approximation errors, yields a tighter local Lipschitz constant. Based on these theoretical contributions, we propose a Lipschitz-aware linearity grafting method that removes dominant approximation errors, which are crucial for tightening the local Lipschitz constant, thereby improving certified robustness, even without certified training. Our extensive experiments demonstrate that grafting linearity into these influential activations tightens the l_\infty local Lipschitz constant and enhances certified robustness.
zh

[AI-50] Bridging the Divide: End-to-End Sequence-Graph Learning

【速读】:该论文旨在解决现实世界中同时具有时序性和关系性的数据建模问题,即如何有效融合序列信息与图结构信息以提升下游任务性能。传统方法通常单独处理序列或图结构,忽略了二者之间的互补性。其解决方案的关键在于提出BRIDGE架构——一个统一的端到端模型,通过共享目标函数将序列编码器与图神经网络(GNN)耦合,使梯度能在两个模块间流动,从而学习任务对齐的表示;此外,引入TOKENXATTN层实现邻居节点间事件级别的细粒度消息传递,增强跨序列的信息交互能力。

链接: https://arxiv.org/abs/2510.25126
作者: Yuen Chen,Yulun Wu,Samuel Sharpe,Igor Melnyk,Nam H. Nguyen,Furong Huang,C. Bayan Bruss,Rizal Fathony
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many real-world datasets are both sequential and relational: each node carries an event sequence while edges encode interactions. Existing methods in sequence modeling and graph modeling often neglect one modality or the other. We argue that sequences and graphs are not separate problems but complementary facets of the same dataset, and should be learned jointly. We introduce BRIDGE, a unified end-to-end architecture that couples a sequence encoder with a GNN under a single objective, allowing gradients to flow across both modules and learning task-aligned representations. To enable fine-grained token-level message passing among neighbors, we add TOKENXATTN, a token-level cross-attention layer that passes messages between events in neighboring sequences. Across two settings, friendship prediction (Brightkite) and fraud detection (Amazon), BRIDGE consistently outperforms static GNNs, temporal graph methods, and sequence-only baselines on ranking and classification metrics.
zh

[AI-51] Learning Low Rank Neural Representations of Hyperbolic Wave Dynamics from Data

【速读】:该论文旨在解决物理数据中基于波动传播(hyperbolic wave propagation)的高维表示问题,即如何从复杂波场数据中学习到高效、低维且具有物理可解释性的表征。其解决方案的关键在于提出一种基于超网络(hypernetwork)框架的低秩神经表示(Low Rank Neural Representation, LRNR)架构,该架构通过深度学习技术直接从数据中学习到波传播的低维结构,并自然地产生低秩张量表示,从而揭示出每个分解模态对应于可解释的物理特征。此外,LRNR还具备压缩推理能力,适用于高性能部署场景。

链接: https://arxiv.org/abs/2510.25123
作者: Woojin Cho,Kookjin Lee,Noseong Park,Donsub Rim,Gerrit Welper
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: 41 pages, 18 figures

点击查看摘要

Abstract:We present a data-driven dimensionality reduction method that is well-suited for physics-based data representing hyperbolic wave propagation. The method utilizes a specialized neural network architecture called low rank neural representation (LRNR) inside a hypernetwork framework. The architecture is motivated by theoretical results that rigorously prove the existence of efficient representations for this wave class. We illustrate through archetypal examples that such an efficient low-dimensional representation of propagating waves can be learned directly from data through a combination of deep learning techniques. We observe that a low rank tensor representation arises naturally in the trained LRNRs, and that this reveals a new decomposition of wave propagation where each decomposed mode corresponds to interpretable physical features. Furthermore, we demonstrate that the LRNR architecture enables efficient inference via a compression scheme, which is a potentially important feature when deploying LRNRs in demanding performance regimes.
zh

[AI-52] he Neural Differential Manifold: An Architecture with Explicit Geometric Structure

【速读】:该论文旨在解决传统神经网络在参数空间中缺乏几何结构约束所导致的优化效率低、泛化能力弱及可解释性差的问题。其核心解决方案是提出神经微分流形(Neural Differential Manifold, NDM)架构,关键在于将神经网络重新建模为一个微分流形,其中每一层作为局部坐标图(local coordinate chart),网络参数直接参数化流形上的黎曼度量张量(Riemannian metric tensor)。通过坐标层(Coordinate Layer)、几何层(Geometric Layer)和演化层(Evolution Layer)的协同设计,NDM实现了基于流形几何的自然梯度优化,并引入几何正则项以抑制曲率与体积畸变,从而提升模型的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2510.25113
作者: Di Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG); Optimization and Control (math.OC)
备注: 9 pages

点击查看摘要

Abstract:This paper introduces the Neural Differential Manifold (NDM), a novel neural network architecture that explicitly incorporates geometric structure into its fundamental design. Departing from conventional Euclidean parameter spaces, the NDM re-conceptualizes a neural network as a differentiable manifold where each layer functions as a local coordinate chart, and the network parameters directly parameterize a Riemannian metric tensor at every point. The architecture is organized into three synergistic layers: a Coordinate Layer implementing smooth chart transitions via invertible transformations inspired by normalizing flows, a Geometric Layer that dynamically generates the manifold’s metric through auxiliary sub-networks, and an Evolution Layer that optimizes both task performance and geometric simplicity through a dual-objective loss function. This geometric regularization penalizes excessive curvature and volume distortion, providing intrinsic regularization that enhances generalization and robustness. The framework enables natural gradient descent optimization aligned with the learned manifold geometry and offers unprecedented interpretability by endowing internal representations with clear geometric meaning. We analyze the theoretical advantages of this approach, including its potential for more efficient optimization, enhanced continual learning, and applications in scientific discovery and controllable generative modeling. While significant computational challenges remain, the Neural Differential Manifold represents a fundamental shift towards geometrically structured, interpretable, and efficient deep learning systems.
zh

[AI-53] Learning Fair Graph Representations with Multi-view Information Bottleneck

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理关系数据时可能放大数据训练偏差的问题,即GNN会将敏感属性和结构不平衡传播至不公平的预测结果。传统公平性方法通常将偏见视为单一来源,忽略了特征与结构因素各自的影响,导致公平性与任务性能之间的权衡不佳。其解决方案的关键在于提出FairMIB——一种多视角信息瓶颈框架,通过将图分解为特征视图、结构视图和扩散视图来解耦复杂偏见;该框架利用对比学习最大化跨视图互信息以实现无偏表示学习,并结合多视角条件信息瓶颈目标最小化与敏感属性的互信息,从而平衡任务效用与公平性;此外,在扩散视图中引入逆概率加权(Inverse Probability-Weighted, IPW)邻接矩阵修正机制,有效抑制消息传递过程中的偏见传播。

链接: https://arxiv.org/abs/2510.25096
作者: Chuxun Liu,Debo Cheng,Qingfeng Chen,Jiangzhang Gan,Jiuyong Li,Lin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) excel on relational data by passing messages over node features and structure, but they can amplify training data biases, propagating discriminatory attributes and structural imbalances into unfair outcomes. Many fairness methods treat bias as a single source, ignoring distinct attribute and structure effects and leading to suboptimal fairness and utility trade-offs. To overcome this challenge, we propose FairMIB, a multi-view information bottleneck framework designed to decompose graphs into feature, structural, and diffusion views for mitigating complexity biases in GNNs. Especially, the proposed FairMIB employs contrastive learning to maximize cross-view mutual information for bias-free representation learning. It further integrates multi-perspective conditional information bottleneck objectives to balance task utility and fairness by minimizing mutual information with sensitive attributes. Additionally, FairMIB introduces an inverse probability-weighted (IPW) adjacency correction in the diffusion view, which reduces the spread of bias propagation during message passing. Experiments on five real-world benchmark datasets demonstrate that FairMIB achieves state-of-the-art performance across both utility and fairness metrics.
zh

[AI-54] H3M-SSMoEs: Hypergraph-based Multimodal Learning with LLM Reasoning and Style-Structured Mixture of Experts

【速读】:该论文旨在解决股票走势预测中长期存在的挑战,包括复杂的时序依赖关系、异构模态信息的融合以及动态演变的个股间关联结构。现有方法难以在可扩展框架内统一结构建模、语义对齐与制度自适应能力。其解决方案的关键在于提出H3M-SSMoEs架构,包含三大创新:(1)基于超图的多上下文多模态表示机制,通过局部上下文超图(Local Context Hypergraph, LCH)捕捉细粒度时空动态,并利用全局上下文超图(Global Context Hypergraph, GCH)建模持久的跨股票依赖关系,结合共享跨模态超边与Jensen-Shannon散度加权机制实现自适应关系学习与跨模态对齐;(2)引入冻结的大语言模型(LLM)增强推理模块,借助轻量级适配器实现定量与文本模态的语义融合与对齐,注入领域金融知识以丰富表征;(3)设计风格结构化的专家混合模型(Style-Structured Mixture of Experts, SSMoEs),整合通用市场专家与行业专业化专家,每个专家由可学习风格向量参数化,支持稀疏激活下的制度感知专业化。实验表明该方法在三大主要股市上均优于当前最优模型,在预测精度和投资绩效方面表现卓越,同时具备良好的风险控制能力。

链接: https://arxiv.org/abs/2510.25091
作者: Peilin Tan,Liang Xie,Churan Zhi,Dian Tu,Chuanqi Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stock movement prediction remains fundamentally challenging due to complex temporal dependencies, heterogeneous modalities, and dynamically evolving inter-stock relationships. Existing approaches often fail to unify structural, semantic, and regime-adaptive modeling within a scalable framework. This work introduces H3M-SSMoEs, a novel Hypergraph-based MultiModal architecture with LLM reasoning and Style-Structured Mixture of Experts, integrating three key innovations: (1) a Multi-Context Multimodal Hypergraph that hierarchically captures fine-grained spatiotemporal dynamics via a Local Context Hypergraph (LCH) and persistent inter-stock dependencies through a Global Context Hypergraph (GCH), employing shared cross-modal hyperedges and Jensen-Shannon Divergence weighting mechanism for adaptive relational learning and cross-modal alignment; (2) a LLM-enhanced reasoning module, which leverages a frozen large language model with lightweight adapters to semantically fuse and align quantitative and textual modalities, enriching representations with domain-specific financial knowledge; and (3) a Style-Structured Mixture of Experts (SSMoEs) that combines shared market experts and industry-specialized experts, each parameterized by learnable style vectors enabling regime-aware specialization under sparse activation. Extensive experiments on three major stock markets demonstrate that H3M-SSMoEs surpasses state-of-the-art methods in both superior predictive accuracy and investment performance, while exhibiting effective risk control. Datasets, source code, and model weights are available at our GitHub repository: this https URL.
zh

[AI-55] Monopoly Deal: A Benchmark Environment for Bounded One-Sided Response Games

【速读】:该论文旨在解决在顺序决策问题中,一类尚未被充分探索但具有战略丰富性的游戏结构——“有界单边响应”(Bounded One-Sided Response, BORG)机制的建模与学习问题。这类机制表现为一方行动后短暂将控制权转移给对手,后者必须通过一连串固定顺序的动作满足特定条件才能完成回合,常见于现实场景如谈判和金融交易中的约束性交互。解决方案的关键在于提出并验证了一种基于Monopoly Deal的改良版本作为基准环境,专门隔离出BORG动态;同时证明经典算法反事实遗憾最小化(Counterfactual Regret Minimization, CFR)无需任何扩展即可在此类环境中收敛至有效策略,从而为后续状态表示与策略学习研究提供了可复现、高效的实验平台。

链接: https://arxiv.org/abs/2510.25080
作者: Will Wolf
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 7 figures

点击查看摘要

Abstract:Card games are widely used to study sequential decision-making under uncertainty, with real-world analogues in negotiation, finance, and cybersecurity. Typically, these games fall into three categories based on the flow of control: strictly-sequential (where players alternate single actions), deterministic-response (where some actions trigger a fixed outcome), and unbounded reciprocal-response (where alternating counterplays are permitted). A less-explored but strategically rich structure exists: the bounded one-sided response. This dynamic occurs when a player’s action briefly transfers control to the opponent, who must satisfy a fixed condition through one or more sequential moves before the turn resolves. We term games featuring this mechanism Bounded One-Sided Response Games (BORGs). We introduce a modified version of Monopoly Deal as a benchmark environment that specifically isolates the BORG dynamic, where a Rent action forces the opponent to sequentially choose payment assets. We demonstrate that the gold-standard algorithm, Counterfactual Regret Minimization (CFR), successfully converges on effective strategies for this domain without requiring novel algorithmic extensions. To support efficient, reproducible experimentation, we present a lightweight, full-stack research platform that unifies the environment, a parallelized CFR runtime, and a human-playable web interface, all runnable on a single workstation. This system provides a practical foundation for exploring state representation and policy learning in bounded one-sided response settings. The trained CFR agent and source code are available at this https URL. Comments: 24 pages, 7 figures Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2510.25080 [cs.GT] (or arXiv:2510.25080v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2510.25080 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-56] Reasoning -Aware GRPO using Process Mining

【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的后训练方法在大型推理模型(Large Reasoning Models, LRMs)中依赖结果导向型奖励信号、忽视推理过程质量的问题。现有方法通常仅根据最终答案或格式给予奖励,导致模型可能通过捷径获得高分,而缺乏真正的多步推理能力。解决方案的关键在于提出一种面向推理过程的组相对策略优化(Reasoning-aware Group Relative Policy Optimization, PM4GRPO),其核心创新是引入基于流程挖掘(Process Mining)的技术,计算一个标量合规性奖励(conformance reward),用以衡量策略模型的推理路径与预训练教师模型之间的对齐程度。该机制将推理过程信息融入奖励设计,从而显著提升模型的推理能力,实验表明PM4GRPO在五个基准测试上均优于传统GRPO方法。

链接: https://arxiv.org/abs/2510.25065
作者: Taekhyun Park,Yongjae Lee,Hyerim Bae
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model’s reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.
zh

[AI-57] Scalable predictive processing framework for multitask caregiving robots

【速读】:该论文旨在解决当前自主护理机器人系统因任务特异性与依赖人工特征工程而导致泛化能力不足的问题。其解决方案的关键在于引入一种基于自由能原理的分层多模态循环神经网络,该模型通过预测加工机制直接整合超过30,000维的视觉-本体感觉输入,无需降维处理,并能自主学习两种典型照护任务(刚体重新定位和柔性毛巾擦拭),实现无需任务特定特征工程的端到端学习。此方法不仅实现了层次隐变量动态的自组织以调节任务切换、捕捉不确定性变化并推断遮挡状态,还展现出对视觉退化的鲁棒性以及多任务学习中的不对称干扰特性,为构建具备灵活适应能力的自主护理机器人提供了可扩展的计算范式。

链接: https://arxiv.org/abs/2510.25053
作者: Hayato Idei,Tamon Miyake,Tetsuya Ogata,Yuichi Yamashita
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:The rapid aging of societies is intensifying demand for autonomous care robots; however, most existing systems are task-specific and rely on handcrafted preprocessing, limiting their ability to generalize across diverse scenarios. A prevailing theory in cognitive neuroscience proposes that the human brain operates through hierarchical predictive processing, which underlies flexible cognition and behavior by integrating multimodal sensory signals. Inspired by this principle, we introduce a hierarchical multimodal recurrent neural network grounded in predictive processing under the free-energy principle, capable of directly integrating over 30,000-dimensional visuo-proprioceptive inputs without dimensionality reduction. The model was able to learn two representative caregiving tasks, rigid-body repositioning and flexible-towel wiping, without task-specific feature engineering. We demonstrate three key properties: (i) self-organization of hierarchical latent dynamics that regulate task transitions, capture variability in uncertainty, and infer occluded states; (ii) robustness to degraded vision through visuo-proprioceptive integration; and (iii) asymmetric interference in multitask learning, where the more variable wiping task had little influence on repositioning, whereas learning the repositioning task led to a modest reduction in wiping performance, while the model maintained overall robustness. Although the evaluation was limited to simulation, these results establish predictive processing as a universal and scalable computational principle, pointing toward robust, flexible, and autonomous caregiving robots while offering theoretical insight into the human brain’s ability to achieve flexible adaptation in uncertain real-world environments.
zh

[AI-58] owards Human-AI Synergy in Requirements Engineering: A Framework and Preliminary Study

【速读】:该论文旨在解决传统需求工程(Requirements Engineering, RE)中因依赖人工、易出错且效率低下所带来的挑战,特别是在处理半结构化和非结构化数据时的局限性。其解决方案的关键在于提出“人机协同需求工程模型”(Human-AI RE Synergy Model, HARE-SM),该模型通过将生成式 AI(Generative AI)、大语言模型(Large Language Models, LLMs)与自然语言处理(Natural Language Processing, NLP)技术融入需求获取、分析与验证流程,并结合人类专家的监督与判断,实现高效、可解释、公平且符合伦理的智能辅助决策机制。

链接: https://arxiv.org/abs/2510.25016
作者: Mateen Ahmed Abbasi,Petri Ihantola,Tommi Mikkonen,Niko Mäkitalo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted at the 2025 Sixth International Conference on Intelligent Data Science Technologies and Applications (IDSTA 2025),8 pages, 4 figures. Published in IEEE

点击查看摘要

Abstract:The future of Requirements Engineering (RE) is increasingly driven by artificial intelligence (AI), reshaping how we elicit, analyze, and validate requirements. Traditional RE is based on labor-intensive manual processes prone to errors and complexity. AI-powered approaches, specifically large language models (LLMs), natural language processing (NLP), and generative AI, offer transformative solutions and reduce inefficiencies. However, the use of AI in RE also brings challenges like algorithmic bias, lack of explainability, and ethical concerns related to automation. To address these issues, this study introduces the Human-AI RE Synergy Model (HARE-SM), a conceptual framework that integrates AI-driven analysis with human oversight to improve requirements elicitation, analysis, and validation. The model emphasizes ethical AI use through transparency, explainability, and bias mitigation. We outline a multi-phase research methodology focused on preparing RE datasets, fine-tuning AI models, and designing collaborative human-AI workflows. This preliminary study presents the conceptual framework and early-stage prototype implementation, establishing a research agenda and practical design direction for applying intelligent data science techniques to semi-structured and unstructured RE data in collaborative environments.
zh

[AI-59] Aligning Large Language Models with Procedural Rules: An Autoregressive State-Tracking Prompting for In-Game Trading

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在规则驱动的游戏中无法遵循关键交易流程(如浏览-报价-审核-确认)的问题,这导致玩家对系统信任度下降。其核心解决方案是提出自回归状态追踪提示(Autoregressive State-Tracking Prompting, ASTP),该方法通过设计结构化提示迫使LLM显式地识别并报告前一回合的状态标签,从而确保状态跟踪的可验证性;同时引入基于状态的占位符后处理机制以保障价格计算的准确性。实验表明,ASTP在300次交易对话中实现了99%的状态合规率和99.3%的计算精度,并且在较小模型(Gemini-2.5-Flash)上即可达到与大模型(Gemini-2.5-Pro)相当的性能,响应时间从21.2秒降至2.4秒,兼顾了实时性和资源效率。

链接: https://arxiv.org/abs/2510.25014
作者: Minkyung Kim,Junsik Kim,Woongcheol Yang,Sangdon Park,Sohee Bae
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages main content, 18 pages supplementary material, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) enable dynamic game interactions but fail to follow essential procedural flows in rule-governed trading systems, eroding player trust. This work resolves the core tension between the creative flexibility of LLMs and the procedural demands of in-game trading (browse-offer-review-confirm). To this end, Autoregressive State-Tracking Prompting (ASTP) is introduced, a methodology centered on a strategically orchestrated prompt that compels an LLM to make its state-tracking process explicit and verifiable. Instead of relying on implicit contextual understanding, ASTP tasks the LLM with identifying and reporting a predefined state label from the previous turn. To ensure transactional integrity, this is complemented by a state-specific placeholder post-processing method for accurate price calculations. Evaluation across 300 trading dialogues demonstrates 99% state compliance and 99.3% calculation precision. Notably, ASTP with placeholder post-processing on smaller models (Gemini-2.5-Flash) matches larger models’ (Gemini-2.5-Pro) performance while reducing response time from 21.2s to 2.4s, establishing a practical foundation that satisfies both real-time requirements and resource constraints of commercial games.
zh

[AI-60] aming the Real-world Complexities in CPT E/M Coding with Large Language Models EMNLP2025

【速读】:该论文旨在解决临床诊疗编码(Evaluation and Management, E/M)自动化中的现实复杂性问题,以减轻医师的文档负担、提升收费效率并改善患者护理质量。其解决方案的关键在于提出ProFees框架,这是一个基于大语言模型(Large Language Model, LLM)的系统,能够有效应对真实世界中E/M编码任务的多维度挑战,如语义歧义、结构化与非结构化文本混合、以及对专业医学知识的高精度理解。在专家标注的真实世界数据集上,ProFees相较于商用CPT E/M编码系统实现了超过36%的准确率提升,显著优于单一提示基准模型,验证了其在处理实际医疗场景复杂性方面的有效性。

链接: https://arxiv.org/abs/2510.25007
作者: Islam Nassar,Yang Lin,Yuan Jin,Rongxin Zhu,Chang Wei Tan,Zenan Zhai,Nitika Mathur,Thanh Tien Vu,Xu Zhong,Long Duong,Yuan-Fang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2025 Industry Track

点击查看摘要

Abstract:Evaluation and Management (E/M) coding, under the Current Procedural Terminology (CPT) taxonomy, documents medical services provided to patients by physicians. Used primarily for billing purposes, it is in physicians’ best interest to provide accurate CPT E/M codes. %While important, it is an auxiliary task that adds to physicians’ documentation burden. Automating this coding task will help alleviate physicians’ documentation burden, improve billing efficiency, and ultimately enable better patient care. However, a number of real-world complexities have made E/M encoding automation a challenging task. In this paper, we elaborate some of the key complexities and present ProFees, our LLM-based framework that tackles them, followed by a systematic evaluation. On an expert-curated real-world dataset, ProFees achieves an increase in coding accuracy of more than 36% over a commercial CPT E/M coding system and almost 5% over our strongest single-prompt baseline, demonstrating its effectiveness in addressing the real-world complexities.
zh

[AI-61] Cyclic Counterfactuals under Shift-Scale Interventions NEURIPS2025

【速读】:该论文旨在解决传统反事实推理框架在处理具有反馈环或循环依赖关系的系统时所面临的局限性,这些问题在生物系统等现实世界场景中普遍存在。传统方法通常基于无环结构因果模型(acyclic structural causal models, SCMs),即有向无环图(DAGs),无法有效建模循环机制。本文的关键解决方案是扩展反事实推理框架至循环SCMs,并针对移位-缩放干预(shift-scale interventions)进行研究,这类干预模拟了软性、策略型的变化,能够重新调整变量的机制参数,从而在保持因果结构完整性的同时实现对复杂循环系统的精确反事实推断。

链接: https://arxiv.org/abs/2510.25005
作者: Saptarshi Saha,Dhruv Vansraj Rathore,Utpal Garain
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Most counterfactual inference frameworks traditionally assume acyclic structural causal models (SCMs), i.e. directed acyclic graphs (DAGs). However, many real-world systems (e.g. biological systems) contain feedback loops or cyclic dependencies that violate acyclicity. In this work, we study counterfactual inference in cyclic SCMs under shift-scale interventions, i.e., soft, policy-style changes that rescale and/or shift a variable’s mechanism.
zh

[AI-62] Epileptic Seizure Detection and Prediction from EEG Data: A Machine Learning Approach with Clinical Validation

【速读】:该论文旨在解决癫痫患者在传统医疗实践中仅能实现发作后检测、难以进行早期干预的问题,从而推动从被动应对向主动预防的转变。其解决方案的关键在于提出一种融合实时检测与预测的新型机器学习框架:一方面利用监督学习算法(如逻辑回归、随机森林和SVM)对已发生的癫痫发作进行高精度识别,其中逻辑回归模型在90.9%检测准确率下实现了89.6%的召回率,展现出临床筛查所需的平衡性能;另一方面引入长短期记忆(Long Short-Term Memory, LSTM)网络,挖掘脑电图(EEG)数据中的时序特征以实现发作前预测,LSTM模型达到89.26%的预测准确率,验证了通过深度学习捕捉潜在预警信号的可能性。这一双阶段方法标志着癫痫管理从反应式转向前瞻性策略的重要进展。

链接: https://arxiv.org/abs/2510.24986
作者: Ria Jayanti,Tanish Jain
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:In recent years, machine learning has become an increasingly powerful tool for supporting seizure detection and monitoring in epilepsy care. Traditional approaches focus on identifying seizures only after they begin, which limits the opportunity for early intervention and proactive treatment. In this study, we propose a novel approach that integrates both real-time seizure detection and prediction, aiming to capture subtle temporal patterns in EEG data that may indicate an upcoming seizure. Our approach was evaluated using the CHB-MIT Scalp EEG Database, which includes 969 hours of recordings and 173 seizures collected from 23 pediatric and young adult patients with drug-resistant epilepsy. To support seizure detection, we implemented a range of supervised machine learning algorithms, including K-Nearest Neighbors, Logistic Regression, Random Forest, and Support Vector Machine. The Logistic Regression achieved 90.9% detection accuracy with 89.6% recall, demonstrating balanced performance suitable for clinical screening. Random Forest and Support Vector Machine models achieved higher accuracy (94.0%) but with 0% recall, failing to detect any seizures, illustrating that accuracy alone is insufficient for evaluating medical ML models with class imbalance. For seizure prediction, we employed Long Short-Term Memory (LSTM) networks, which use deep learning to model temporal dependencies in EEG data. The LSTM model achieved 89.26% prediction accuracy. These results highlight the potential of developing accessible, real-time monitoring tools that not only detect seizures as traditionally done, but also predict them before they occur. This ability to predict seizures marks a significant shift from reactive seizure management to a more proactive approach, allowing patients to anticipate seizures and take precautionary measures to reduce the risk of injury or other complications.
zh

[AI-63] FaRAccel: FPGA-Accelerated Defense Architecture for Efficient Bit-Flip Attack Resilience in Transformer Models

【速读】:该论文旨在解决生成式 AI (Generative AI) 中基于 Transformer 的模型在面对位翻转攻击(Bit-Flip Attacks, BFAs)时的脆弱性问题,同时克服现有防御方法“遗忘与重连”(Forget and Rewire, FaR)因运行时动态重配置线性层激活路径所导致的显著性能损耗和内存开销。其解决方案的关键在于提出 FaRAccel——一种基于 FPGA 的专用硬件加速架构,通过集成可重构逻辑实现动态激活重路由,并采用轻量级存储机制管理重连配置,从而在保持 FaR 原有鲁棒性优势的前提下,大幅降低推理延迟并提升能效,首次实现了对 Transformer 模型中 BFAs 的高效硬件级防御。

链接: https://arxiv.org/abs/2510.24985
作者: Najmeh Nazari,Banafsheh Saber Latibari,Elahe Hosseini,Fatemeh Movafagh,Chongzhou Fang,Hosein Mohammadi Makrani,Kevin Immanuel Gubbi,Abhijit Mahalanobis,Setareh Rafatirad,Hossein Sayadi,Houman Homayoun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted By ICCD 2025

点击查看摘要

Abstract:Forget and Rewire (FaR) methodology has demonstrated strong resilience against Bit-Flip Attacks (BFAs) on Transformer-based models by obfuscating critical parameters through dynamic rewiring of linear layers. However, the application of FaR introduces non-negligible performance and memory overheads, primarily due to the runtime modification of activation pathways and the lack of hardware-level optimization. To overcome these limitations, we propose FaRAccel, a novel hardware accelerator architecture implemented on FPGA, specifically designed to offload and optimize FaR operations. FaRAccel integrates reconfigurable logic for dynamic activation rerouting, and lightweight storage of rewiring configurations, enabling low-latency inference with minimal energy overhead. We evaluate FaRAccel across a suite of Transformer models and demonstrate substantial reductions in FaR inference latency and improvement in energy efficiency, while maintaining the robustness gains of the original FaR methodology. To the best of our knowledge, this is the first hardware-accelerated defense against BFAs in Transformers, effectively bridging the gap between algorithmic resilience and efficient deployment on real-world AI platforms.
zh

[AI-64] LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中扩散策略(Diffusion Policies)在采样时依赖启发式引导、缺乏统计风险控制的问题。现有方法通常以固定方式引导去噪过程,无法量化和管理决策中的风险水平,导致在分布外(Out-of-Distribution, OOD)行为时可能产生不可控的性能下降。解决方案的关键在于提出一种风险感知的采样规则——LRT-Diffusion,其核心思想是将每一步去噪视为一个顺序假设检验(Sequential Hypothesis Test),在无条件先验(unconditional prior)与状态条件策略头(state-conditional policy head)之间累积对数似然比(log-likelihood ratio),并通过逻辑控制器(logistic controller)根据预设显著性水平 α(Type-I error rate)设定阈值 τ 来门控条件均值。这一机制将引导从固定偏置转变为基于证据的风险驱动调整,并提供可解释的风险预算(risk budget)。此外,LRT指导可自然地与Q梯度结合,支持从利用到保守的不同策略混合,从而在保持高回报的同时有效控制OOD行为的发生概率。

链接: https://arxiv.org/abs/2510.24983
作者: Ximan Sun,Xiang Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion policies are competitive for offline reinforcement learning (RL) but are typically guided at sampling time by heuristics that lack a statistical notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that treats each denoising step as a sequential hypothesis test between the unconditional prior and the state-conditional policy head. Concretely, we accumulate a log-likelihood ratio and gate the conditional mean with a logistic controller whose threshold tau is calibrated once under H0 to meet a user-specified Type-I level alpha. This turns guidance from a fixed push into an evidence-driven adjustment with a user-interpretable risk budget. Importantly, we deliberately leave training vanilla (two heads with standard epsilon-prediction) under the structure of DDPM. LRT guidance composes naturally with Q-gradients: critic-gradient updates can be taken at the unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum from exploitation to conservatism. We standardize states and actions consistently at train and test time and report a state-conditional out-of-distribution (OOD) metric alongside return. On D4RL MuJoCo tasks, LRT-Diffusion improves the return-OOD trade-off over strong Q-guided baselines in our implementation while honoring the desired alpha. Theoretically, we establish level-alpha calibration, concise stability bounds, and a return comparison showing when LRT surpasses Q-guidance-especially when off-support errors dominate. Overall, LRT-Diffusion is a drop-in, inference-time method that adds principled, calibrated risk control to diffusion policies for offline RL.
zh

[AI-65] Hammering the Diagnosis: Rowhammer-Induced Stealthy Trojan Attacks on ViT-Based Medical Imaging

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在医疗影像分析系统中因硬件级攻击导致模型完整性受损的问题,尤其是针对由Rowhammer硬件故障注入与神经后门(neural Trojan)攻击结合所引发的隐蔽性恶意行为。其解决方案的关键在于提出一种名为Med-Hammer的新威胁模型:通过利用Rowhammer诱导的内存位翻转来激活嵌入在ViT模型中的神经后门,从而实现对关键医学诊断(如肿瘤或病变)的定向误分类或抑制,且在多个基准数据集上实现了高达82.51%至92.56%的攻击成功率,同时保持高度隐蔽性。

链接: https://arxiv.org/abs/2510.24976
作者: Banafsheh Saber Latibari,Najmeh Nazari,Hossein Sayadi,Houman Homayoun,Abhijit Mahalanobis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted, ICCD 2025

点击查看摘要

Abstract:Vision Transformers (ViTs) have emerged as powerful architectures in medical image analysis, excelling in tasks such as disease detection, segmentation, and classification. However, their reliance on large, attention-driven models makes them vulnerable to hardware-level attacks. In this paper, we propose a novel threat model referred to as Med-Hammer that combines the Rowhammer hardware fault injection with neural Trojan attacks to compromise the integrity of ViT-based medical imaging systems. Specifically, we demonstrate how malicious bit flips induced via Rowhammer can trigger implanted neural Trojans, leading to targeted misclassification or suppression of critical diagnoses (e.g., tumors or lesions) in medical scans. Through extensive experiments on benchmark medical imaging datasets such as ISIC, Brain Tumor, and MedMNIST, we show that such attacks can remain stealthy while achieving high attack success rates about 82.51% and 92.56% in MobileViT and SwinTransformer, respectively. We further investigate how architectural properties, such as model sparsity, attention weight distribution, and the number of features of the layer, impact attack effectiveness. Our findings highlight a critical and underexplored intersection between hardware-level faults and deep learning security in healthcare applications, underscoring the urgent need for robust defenses spanning both model architectures and underlying hardware platforms.
zh

[AI-66] KAN-GCN: Combining Kolmogorov-Arnold Network with Graph Convolution Network for an Accurate Ice Sheet Emulator NEURIPS2025

【速读】:该论文旨在解决冰盖模拟(ice sheet modeling)中数值模型计算成本高、效率低的问题,尤其是在大规模瞬态情景扫描(transient scenario sweeps)时对高效且准确的代理模型(emulator)的需求。解决方案的关键在于提出KAN-GCN架构——在图卷积网络(GCN)前引入Kolmogorov-Arnold Network(KAN)作为特征级校准器(feature-wise calibrator),通过可学习的一维映射(learnable one-dimensional warps)和线性混合步骤,增强特征条件化与非线性编码能力,同时不增加消息传递深度;该设计在保持精度的同时提升了推理吞吐量,尤其在粗网格下通过将边级消息传递层替换为节点级变换实现效率优化,从而在准确性和计算效率之间取得更优平衡。

链接: https://arxiv.org/abs/2510.24926
作者: Zesheng Liu,YoungHyun Koo,Maryam Rahnemoonfar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: Accept for NeurIPS 2025 Workshop: New Perspectives in Graph Machine Learning

点击查看摘要

Abstract:We introduce KAN-GCN, a fast and accurate emulator for ice sheet modeling that places a Kolmogorov-Arnold Network (KAN) as a feature-wise calibrator before graph convolution networks (GCNs). The KAN front end applies learnable one-dimensional warps and a linear mixing step, improving feature conditioning and nonlinear encoding without increasing message-passing depth. We employ this architecture to improve the performance of emulators for numerical ice sheet models. Our emulator is trained and tested using 36 melting-rate simulations with 3 mesh-size settings for Pine Island Glacier, Antarctica. Across 2- to 5-layer architectures, KAN-GCN matches or exceeds the accuracy of pure GCN and MLP-GCN baselines. Despite a small parameter overhead, KAN-GCN improves inference throughput on coarser meshes by replacing one edge-wise message-passing layer with a node-wise transform; only the finest mesh shows a modest cost. Overall, KAN-first designs offer a favorable accuracy vs. efficiency trade-off for large transient scenario sweeps.
zh

[AI-67] rust Dynamics in Strategic Coopetition: Computational Foundations for Requirements Engineering in Multi-Agent Systems

【速读】:该论文旨在解决多利益相关者环境下需求工程中信任动态演化建模的难题,即如何在合作与竞争并存(coopetition)的情境下,将定性概念模型(如i语言)与可计算的信任更新机制相结合。其关键解决方案是提出一个双层计算信任模型:第一层为即时信任(immediate trust),根据当前行为实时调整;第二层为声誉(reputation),记录违规历史。该模型引入非对称更新机制——合作行为缓慢积累信任,而违规行为则迅速削弱信任,从而产生滞后效应(hysteresis)和信任上限,限制关系修复的可能性。此外,作者构建了从i依赖网络到计算信任模型的结构化映射框架,并通过78,125组参数配置的实验验证了负面偏差、滞后效应及累积损害放大等核心现象,最终在雷诺-日产联盟案例研究中实现了81.7%的验证准确率,成功复现了长达26年间的五阶段信任演变过程。

链接: https://arxiv.org/abs/2510.24909
作者: Vik Pant,Eric Yu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 62 pages, 20 figures, This technical report is the second in a research program and should be read in conjunction with its foundational companion work arXiv:2510.18802 . It builds on the frameworks established in that prior work and also adapts and extends material on trustworthiness first presented in the doctoral dissertation ‘Modeling Strategic Coopetition’ (Pant, 2021, University of Toronto)

点击查看摘要

Abstract:Requirements engineering increasingly occurs in multi-stakeholder environments where organizations simultaneously cooperate and compete, creating coopetitive relationships in which trust evolves dynamically based on observed behavior over repeated interactions. While conceptual modeling languages like i* represent trust relationships qualitatively, they lack computational mechanisms for analyzing how trust changes with behavioral evidence. Conversely, computational trust models from multi-agent systems provide algorithmic updating but lack grounding in requirements engineering contexts and conceptual models. This technical report bridges this gap by developing a computational trust model that extends game-theoretic foundations for strategic coopetition with dynamic trust evolution. We introduce trust as a two-layer system with immediate trust responding to current behavior and reputation tracking violation history. Trust evolves through asymmetric updating where cooperation builds trust gradually while violations erode it sharply, creating hysteresis effects and trust ceilings that constrain relationship recovery. We develop a structured translation framework enabling requirements engineers to instantiate computational trust models from i* dependency networks and organizational contexts. Comprehensive experimental validation across 78,125 parameter configurations establishes robust emergence of negativity bias, hysteresis effects, and cumulative damage amplification. Empirical validation using the Renault-Nissan Alliance case study (1999-2025) achieves 49 out of 60 validation points (81.7%), successfully reproducing documented trust evolution across five distinct relationship phases including crisis and recovery periods. This technical report builds upon its foundational companion work in arXiv:2510.18802.
zh

[AI-68] Fair Indivisible Payoffs through Shapley Value

【速读】:该论文试图解决在不可分联盟博弈(indivisible coalitional games)中如何公平分配收益的问题,其中联盟的总价值是一个自然数,代表不可分对象(如议会席位、肾脏交换或机器学习模型中的关键特征)的数量。解决方案的关键在于提出并定义了“不可分夏普利值”(indivisible Shapley value),该方法基于夏普利值的思想但适配于不可分资源场景,并通过三个案例研究验证其有效性,特别是在图像分类任务中用于识别图像中对结果贡献最大的关键区域。

链接: https://arxiv.org/abs/2510.24906
作者: Mikołaj Czarnecki,Michał Korniak,Oskar Skibski,Piotr Skowron
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider the problem of payoff division in indivisible coalitional games, where the value of the grand coalition is a natural number. This number represents a certain quantity of indivisible objects, such as parliamentary seats, kidney exchanges, or top features contributing to the outcome of a machine learning model. The goal of this paper is to propose a fair method for dividing these objects among players. To achieve this, we define the indivisible Shapley value and study its properties. We demonstrate our proposed technique using three case studies, in particular, we use it to identify key regions of an image in the context of an image classification task.
zh

[AI-69] Efficiency Without Cognitive Change: Evidence from Human Interaction with Narrow AI Systems

【速读】:该论文试图解决的核心问题是:当前窄域人工智能(narrow AI)工具的短期使用是否仅提升任务执行效率,还是实质性地改变人类的认知能力。研究通过为期七周的实验设计,将30名年轻参与者随机分配至AI辅助组(使用ChatGPT)与对照组,在干预前后进行标准化神经心理学评估。关键发现是:尽管AI辅助组在问题解决和语言理解任务中表现出更高的速度和准确性,但其在标准化认知能力指标上并无显著变化。这表明当前窄域AI系统主要作为认知支架(cognitive scaffolds),增强外显表现而不改变底层心理能力。解决方案的关键在于识别出“效率提升”与“认知重构”的区分,并呼吁建立伦理与教育框架以促进在日益AI增强的认知生态中保持批判性和自主性思维。

链接: https://arxiv.org/abs/2510.24893
作者: María Angélica Benítez,Rocío Candela Ceballos,Karina Del Valle Molina,Sofía Mundo Araujo,Sofía Evangelina Victorio Villaroel,Nadia Justel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 30 pages, 8 figures. Preprint submitted for peer review (not yet accepted or published)

点击查看摘要

Abstract:The growing integration of artificial intelligence (AI) into human cognition raises a fundamental question: does AI merely improve efficiency, or does it alter how we think? This study experimentally tested whether short-term exposure to narrow AI tools enhances core cognitive abilities or simply optimizes task performance. Thirty young adults completed standardized neuropsychological assessments embedded in a seven-week protocol with a four-week online intervention involving problem-solving and verbal comprehension tasks, either with or without AI support (ChatGPT). While AI-assisted participants completed several tasks faster and more accurately, no significant pre-post differences emerged in standardized measures of problem solving or verbal comprehension. These results demonstrate efficiency gains without cognitive change, suggesting that current narrow AI systems serve as cognitive scaffolds extending performance without transforming underlying mental capacities. The findings highlight the need for ethical and educational frameworks that promote critical and autonomous thinking in an increasingly AI-augmented cognitive ecology.
zh

[AI-70] Scheduling Your LLM Reinforcement Learning with Reasoning Trees

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)优化大语言模型(Large Language Models, LLMs)时,现有数据调度方法因依赖路径级指标而忽略查询推理树(Reasoning Tree)结构所导致的数据效率与准确性不足的问题。其解决方案的关键在于提出一种新的结构感知指标——推理分数(Reasoning Score, r-score),用于量化查询的学习难度,并基于此设计了一种名为推理树调度(Reasoning Tree Schedule, Re-Schedule)的调度算法,该算法构建了一个从结构简单(高r-score)到复杂(低r-score)的渐进式课程(curriculum),从而显著提升了模型在六项数学推理基准上的平均准确率,最高达3.2%。

链接: https://arxiv.org/abs/2510.24832
作者: Hong Wang,Zhezheng Hao,Jian Luo,Chenxing Wei,Yao Shu,Lei Liu,Qiang Lin,Hande Dong,Jiawei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Using Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Large Language Models (LLMs) can be conceptualized as progressively editing a query’s `Reasoning Tree’. This process involves exploring nodes (tokens) and dynamically modifying the model’s policy at each node. When combined with data scheduling, this process yields further gains in data efficiency and accuracy. However, existing RLVR data scheduling methods typically rely on path-based metrics to rank queries, overlooking the reasoning tree structures of these queries. In this paper, we introduce a novel metric, namely Reasoning Score (r-score), which measures the query’s learning difficulty based on the structure of its reasoning tree. Based on the r-score, we propose the Reasoning Tree Schedule (Re-Schedule), a scheduling algorithm that constructs a curriculum progressing from structurally simple (high r-score) to complex (low r-score) queries. Experiments on six math-reasoning benchmarks show that Re-Schedule significantly improves average accuracy, achieving gains of up to 3.2%. These strong results validate our approach and demonstrate that a structural understanding of the reasoning tree provides a more powerful and principled foundation for RLVR data scheduling.
zh

[AI-71] he Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems

【速读】:该论文试图解决当前基于大语言模型(Large Language Models, LLMs)的人工智能系统在缺乏持续状态(persistent state)的情况下,无法维持身份一致性与时间维度上的语义连贯性的问题。其解决方案的关键在于提出“叙事连续性测试”(Narrative Continuity Test, NCT),这是一个用于评估AI系统在长时间交互中是否保持身份持久性和时序一致性的概念框架。NCT定义了五个必要维度:情境记忆(Situated Memory)、目标持续性(Goal Persistence)、自主自我修正能力(Autonomous Self-Correction)、风格与语义稳定性(Stylistic Semantic Stability)以及角色/人格连续性(Persona/Role Continuity),并指出现有架构因 Stateless Inference 机制而系统性地无法支持这些维度。该框架将AI评估焦点从任务性能转向长期身份与目标的一致性,为未来模型设计和基准测试提供了理论依据。

链接: https://arxiv.org/abs/2510.24831
作者: Stefano Natangelo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 35 pages, 127 references

点击查看摘要

Abstract:Artificial intelligence systems based on large language models (LLMs) can now generate coherent text, music, and images, yet they operate without a persistent state: each inference reconstructs context from scratch. This paper introduces the Narrative Continuity Test (NCT) – a conceptual framework for evaluating identity persistence and diachronic coherence in AI systems. Unlike capability benchmarks that assess task performance, the NCT examines whether an LLM remains the same interlocutor across time and interaction gaps. The framework defines five necessary axes – Situated Memory, Goal Persistence, Autonomous Self-Correction, Stylistic Semantic Stability, and Persona/Role Continuity – and explains why current architectures systematically fail to support them. Case analyses (this http URL, Grok, Replit, Air Canada) show predictable continuity failures under stateless inference. The NCT reframes AI evaluation from performance to persistence, outlining conceptual requirements for future benchmarks and architectural designs that could sustain long-term identity and goal coherence in generative models.
zh

[AI-72] Do Chatbots Walk the Talk of Responsible AI?

【速读】:该论文试图解决的问题是:主流AI聊天机器人公司是否在其实际运营中落实了其公开倡导的负责任人工智能(Responsible AI)原则。研究通过混合方法,系统分析了四款主流聊天机器人(ChatGPT、Gemini、DeepSeek 和 Grok)在公司网站、技术文档及直接对话评估中的表现,发现企业宣传与实践之间存在显著差距。解决方案的关键在于采用多源数据交叉验证的方法,结合定性内容分析与定量交互测试,从而客观揭示企业在生成式 AI(Generative AI)产品开发与部署过程中对伦理承诺的兑现程度。

链接: https://arxiv.org/abs/2510.24823
作者: Susan Ariel Aaronson,Michael Moreno
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study examines whether leading AI chatbot companies implement the responsible AI principles they publicly advocate. The authors used a mixed-methods approach analyzing four major chatbots (ChatGPT, Gemini, DeepSeek, and Grok) across company websites, technical documentation, and direct chatbot evaluations. We found significant gaps between corporate rhetoric and practice.
zh

[AI-73] MASPRM: Multi-Agent System Process Reward Model

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在推理时性能不足的问题,特别是在计算资源有限的情况下如何高效分配算力以提升任务完成质量。解决方案的关键在于提出一种可插拔的值模型——多智能体系统过程奖励模型(MASPRM),该模型通过从多智能体蒙特卡洛树搜索(MCTS)回放中学习,无需逐步人工标注即可为每个动作和每个智能体的局部交互片段分配价值,并作为推理时的控制器。MASPRM在推理阶段引导束搜索(beam search)与MCTS,聚焦于有前景的分支并提前剪枝,从而实现更可靠的、计算感知的多智能体推理。实验表明,该方法在GSM8K和MATH数据集上分别较单一直通式多智能体推理提升了30.7和22.9点精确匹配(EM),且在GSM8K上训练的MASPRM可零样本迁移至MATH任务,额外带来8.4点EM提升。

链接: https://arxiv.org/abs/2510.24803
作者: Milad Yazdani,Mahdi Mostajabdaveh,Zirui Zhou,Ying Xiong
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Practical deployment of Multi-Agent Systems (MAS) demands strong test-time performance, motivating methods that guide inference-time search and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns per-action, per-agent values to partial inter-agent transcripts and acts as an inference-time controller. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts without requiring step-level human annotations, by propagating returns to local targets. At inference, MASPRM guides step-level beam search and MCTS, focusing computation on promising branches and pruning early. On GSM8K and MATH, MASPRM-guided decoding with an outcome reward model (ORM) applied to the final answer, improves exact match (EM) over a single straight-through MAS pass by +30.7 and +22.9 points, respectively. A MASPRM trained on GSM8K transfers zero-shot to MATH without retraining, adding 8.4 EM points at the same budget. MASPRM is a plug-in value model that estimates per-agent progress and complements verifier-style decoders, enabling more reliable, compute-aware multi-agent reasoning. Code: this https URL
zh

[AI-74] From Narrative to Action: A Hierarchical LLM -Agent Framework for Human Mobility Generation

【速读】:该论文旨在解决传统基于代理(agent-based)或深度学习模型在模拟人类移动行为时,难以捕捉行为语义一致性与因果逻辑的问题,即现有方法虽能复现移动的统计特征,却缺乏对真实出行决策中认知层次结构的理解。其解决方案的关键在于提出一种分层的大语言模型代理框架(Hierarchical LLM-Agent Framework),称为“从叙事到行动”(Narrative-to-Action),通过引入宏观层面的“创意写作者”与“结构解析器”实现动机驱动的叙事生成与可执行计划转换,并结合微观层面的环境交互模块及新颖的职类感知移动熵(Mobility Entropy by Occupation, MEO)指标,使代理能够在地理环境中动态调整行为,从而实现既符合现实轨迹分布又具备可解释性的人类决策过程模拟。

链接: https://arxiv.org/abs/2510.24802
作者: Qiumeng Li,Chunhou Ji,Xinyue Liu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 47 pages, 3 figures

点击查看摘要

Abstract:Understanding and replicating human mobility requires not only spatial-temporal accuracy but also an awareness of the cognitive hierarchy underlying real-world travel decisions. Traditional agent-based or deep learning models can reproduce statistical patterns of movement but fail to capture the semantic coherence and causal logic of human behavior. Large language models (LLMs) show potential, but struggle to balance creative reasoning with strict structural compliance. This study proposes a Hierarchical LLM-Agent Framework, termed Narrative-to-Action, that integrates high-level narrative reasoning, mid-level reflective planning, and low-level behavioral execution within a unified cognitive hierarchy. At the macro level, one agent is employed as a “creative writer” to produce diary-style narratives rich in motivation and context, then uses another agent as a “structural parser” to convert narratives into machine-readable plans. A dynamic execution module further grounds agents in geographic environments and enables adaptive behavioral adjustments guided by a novel occupation-aware metric, Mobility Entropy by Occupation (MEO), which captures heterogeneous schedule flexibility across different occupational personalities. At the micro level, the agent executes concrete actions-selecting locations, transportation modes, and time intervals-through interaction with an environmental simulation. By embedding this multi-layer cognitive process, the framework produces not only synthetic trajectories that align closely with real-world patterns but also interpretable representations of human decision logic. This research advances synthetic mobility generation from a data-driven paradigm to a cognition-driven simulation, providing a scalable pathway for understanding, predicting, and synthesizing complex urban mobility behaviors through hierarchical LLM agents.
zh

[AI-75] Mutual Wanting in Human–AI Interaction: Empirical Evidence from Large-Scale Analysis of GPT Model Transitions

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)快速迭代过程中,用户与AI系统之间因复杂双向期望(bidirectional expectations)而产生的理解缺失问题。研究发现,这种双向期望表现为“相互想要”(mutual wanting),即用户对AI的期待与AI能力之间的动态匹配关系。解决方案的关键在于提出并验证了“相互想要对齐框架”(Mutual Wanting Alignment Framework, M-WAF),该框架通过双算法主题建模和多维特征提取等先进自然语言处理技术,量化了用户在模型升级前后期望与现实之间的偏差,并识别出用户群体中显著的“相互想要”类型。这一方法为AI系统设计提供了可操作的指标,有助于实现更主动的用户体验管理与更具关系意识的人工智能构建。

链接: https://arxiv.org/abs/2510.24796
作者: HaoYang Shang,Xuan Liu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) creates complex bidirectional expectations between users and AI systems that are poorly understood. We introduce the concept of “mutual wanting” to analyze these expectations during major model transitions. Through analysis of user comments from major AI forums and controlled experiments across multiple OpenAI models, we provide the first large-scale empirical validation of bidirectional desire dynamics in human-AI interaction. Our findings reveal that nearly half of users employ anthropomorphic language, trust significantly exceeds betrayal language, and users cluster into distinct “mutual wanting” types. We identify measurable expectation violation patterns and quantify the expectation-reality gap following major model releases. Using advanced NLP techniques including dual-algorithm topic modeling and multi-dimensional feature extraction, we develop the Mutual Wanting Alignment Framework (M-WAF) with practical applications for proactive user experience management and AI system design. These findings establish mutual wanting as a measurable phenomenon with clear implications for building more trustworthy and relationally-aware AI systems.
zh

[AI-76] AI Data Competencies: Scaffolding holistic AI literacy in Higher Education

【速读】:该论文旨在解决高等教育中AI素养(AI literacy)培养缺乏系统性框架的问题,以应对生成式AI(Generative AI)快速发展的教育挑战。其解决方案的关键在于提出了一套结构化的“AI数据素养学习成果框架”(AI Data Acumen Learning Outcomes Framework),该框架通过四个能力水平和七个知识维度,明确界定AI与数据相关的核心 competencies,并强调技术技能、伦理考量与社会文化意识的平衡发展,从而为课程设计、教学活动及评估提供可操作的指导路径。

链接: https://arxiv.org/abs/2510.24783
作者: Kathleen Kennedy,Anuj Gupta
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Journal: Thresholds in Education Publisher: Academy for Educational Studies ISSN: 0916-9641 URL to published article: this https URL

点击查看摘要

Abstract:This chapter introduces the AI Data Acumen Learning Outcomes Framework, a comprehensive tool designed to guide the integration of AI literacy across higher education. Developed through a collaborative process, the framework defines key AI and data-related competencies across four proficiency levels and seven knowledge dimensions. It provides a structured approach for educators to scaffold student learning in AI, balancing technical skills with ethical considerations and sociocultural awareness. The chapter outlines the framework’s development process, its structure, and practical strategies for implementation in curriculum design, learning activities, and assessment. We address challenges in implementation and future directions for AI education. By offering a roadmap for developing students’ holistic AI literacy, this framework prepares learners to leverage generative AI capabilities in both academic and professional contexts.
zh

[AI-77] Dual-Domain Deep Learning-Assisted NOMA-CSK Systems for Secure and Efficient Vehicular Communications

【速读】:该论文旨在解决车联网中多用户(MU)传输的效率与安全性问题,特别是传统混沌调制系统在非相干检测下因参考信号传输导致的频谱效率低、用户连接数受限,以及基于稀疏码多址接入(SCMA)的混沌移键(DCSK)方案存在计算复杂度高和可扩展性差的问题。解决方案的关键在于提出一种基于深度学习辅助的功率域非正交多址混沌移键(DL-NOMA-CSK)系统:通过设计一个深度神经网络(DNN)解调器,在离线训练阶段学习混沌信号的内在特征,从而无需混沌同步或参考信号传输;该DNN采用时域与频域联合特征提取架构,提升动态信道下的特征学习能力,并嵌入逐次干扰消除(SIC)框架以缓解误差传播问题,最终实现更高的频谱效率(SE)、能量效率(EE)、误比特率(BER)性能及安全性,同时保持较低的计算复杂度。

链接: https://arxiv.org/abs/2510.24763
作者: Tingting Huang,Jundong Chen,Huanqiang Zeng,Guofa Cai,Georges Kaddoum
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ensuring secure and efficient multi-user (MU) transmission is critical for vehicular communication systems. Chaos-based modulation schemes have garnered considerable interest due to their benefits in physical layer security. However, most existing MU chaotic communication systems, particularly those based on non-coherent detection, suffer from low spectral efficiency due to reference signal transmission, and limited user connectivity under orthogonal multiple access (OMA). While non-orthogonal schemes, such as sparse code multiple access (SCMA)-based DCSK, have been explored, they face high computational complexity and inflexible scalability due to their fixed codebook designs. This paper proposes a deep learning-assisted power domain non-orthogonal multiple access chaos shift keying (DL-NOMA-CSK) system for vehicular communications. A deep neural network (DNN)-based demodulator is designed to learn intrinsic chaotic signal characteristics during offline training, thereby eliminating the need for chaotic synchronization or reference signal transmission. The demodulator employs a dual-domain feature extraction architecture that jointly processes the time-domain and frequency-domain information of chaotic signals, enhancing feature learning under dynamic channels. The DNN is integrated into the successive interference cancellation (SIC) framework to mitigate error propagation issues. Theoretical analysis and extensive simulations demonstrate that the proposed system achieves superior performance in terms of spectral efficiency (SE), energy efficiency (EE), bit error rate (BER), security, and robustness, while maintaining lower computational complexity compared to traditional MU-DCSK and existing DL-aided schemes. These advantages validate its practical viability for secure vehicular communications.
zh

[AI-78] Stable-by-Design Neural Network-Based LPV State-Space Models for System Identification

【速读】:该论文旨在解决非线性系统建模中难以同时准确捕捉潜在动态特性与保证模型稳定性的问题。传统辨识方法在处理复杂非线性系统时,常因缺乏对隐含状态和调度变量的精确建模而导致性能受限或不稳定。其解决方案的关键在于提出一种稳定设计(stable-by-design)的LPV神经网络状态空间(NN-SS)模型,该模型通过神经网络直接从数据中学习隐状态和内部调度变量,并利用Schur参数化确保状态转移矩阵的稳定性;同时,结合编码器进行初始状态估计与状态空间表示网络构建调度依赖的系统矩阵,训练时引入多步预测损失与状态一致性正则项,从而提升长期预测精度和鲁棒性。

链接: https://arxiv.org/abs/2510.24757
作者: Ahmet Eren Sertbaş,Tufan Kumbasar
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In the 12th International Conference of Image Processing, Wavelet and Applications on Real World Problems, 2025

点击查看摘要

Abstract:Accurate modeling of nonlinear systems is essential for reliable control, yet conventional identification methods often struggle to capture latent dynamics while maintaining stability. We propose a \textitstable-by-design LPV neural network-based state-space (NN-SS) model that simultaneously learns latent states and internal scheduling variables directly from data. The state-transition matrix, generated by a neural network using the learned scheduling variables, is guaranteed to be stable through a Schur-based parameterization. The architecture combines an encoder for initial state estimation with a state-space representer network that constructs the full set of scheduling-dependent system matrices. For training the NN-SS, we develop a framework that integrates multi-step prediction losses with a state-consistency regularization term, ensuring robustness against drift and improving long-horizon prediction accuracy. The proposed NN-SS is evaluated on benchmark nonlinear systems, and the results demonstrate that the model consistently matches or surpasses classical subspace identification methods and recent gradient-based approaches. These findings highlight the potential of stability-constrained neural LPV identification as a scalable and reliable framework for modeling complex nonlinear systems.
zh

[AI-79] Beyond Function-Level Search: Repository-Aware Dual-Encoder Code Retrieval with Adversarial Verification EMNLP2025

【速读】:该论文旨在解决现代代码库中跨组件变更意图的检索难题,即现有基于函数级别的搜索范式无法有效理解与特定变更请求相关的上下文代码片段。为填补这一空白,作者提出RepoAlign-Bench基准数据集和ReflectCode模型;其关键创新在于引入对抗性反思增强的双塔架构,通过大语言模型引导的反思机制动态融合语法模式、函数依赖关系与语义扩展意图,从而实现从函数级匹配向仓库级整体推理的范式转变。

链接: https://arxiv.org/abs/2510.24749
作者: Aofan Liu,Shiyuan Song,Haoxuan Li,Cehao Yang,Yiyan Qi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025

点击查看摘要

Abstract:The escalating complexity of modern codebases has intensified the need for retrieval systems capable of interpreting cross-component change intents, a capability fundamentally absent in conventional function-level search paradigms. While recent studies have improved the alignment between natural language queries and code snippets, retrieving contextually relevant code for specific change requests remains largely underexplored. To address this gap, we introduce RepoAlign-Bench, the first benchmark specifically designed to evaluate repository-level code retrieval under change request driven scenarios, encompassing 52k annotated instances. This benchmark shifts the retrieval paradigm from function-centric matching to holistic repository-level reasoning. Furthermore, we propose ReflectCode, an adversarial reflection augmented dual-tower architecture featuring disentangled code_encoder and doc_encoder components. ReflectCode dynamically integrates syntactic patterns, function dependencies, and semantic expansion intents through large language model guided reflection. Comprehensive experiments demonstrate that ReflectCode achieves 12.2% improvement in Top-5 Accuracy and 7.1% in Recall over state-of-the-art baselines, establishing a new direction for context-aware code retrieval.
zh

[AI-80] Large-Scale Network Embedding in Apache Spark KDD2021

【速读】:该论文旨在解决大规模图结构数据在网络嵌入(Network Embedding)任务中计算效率低下的问题,尤其是传统方法在处理包含数十亿边的图时存在计算成本高和内存瓶颈的问题。其关键解决方案是提出一种基于Apache Spark的分布式算法,通过递归地将大图划分为多个小规模子图,在并行计算每个子图的嵌入表示后,以线性复杂度聚合结果,从而实现高效且可扩展的网络嵌入。该方法显著提升了处理大规模图的能力,并在链路预测和节点分类任务上取得性能提升,同时在腾讯两款在线游戏中成功部署,验证了其在实际应用中的有效性与效率优势。

链接: https://arxiv.org/abs/2106.10620
作者: Wenqing Lin
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Accepted in KDD 2021

点击查看摘要

Abstract:Network embedding has been widely used in social recommendation and network analysis, such as recommendation systems and anomaly detection with graphs. However, most of previous approaches cannot handle large graphs efficiently, due to that (i) computation on graphs is often costly and (ii) the size of graph or the intermediate results of vectors could be prohibitively large, rendering it difficult to be processed on a single machine. In this paper, we propose an efficient and effective distributed algorithm for network embedding on large graphs using Apache Spark, which recursively partitions a graph into several small-sized subgraphs to capture the internal and external structural information of nodes, and then computes the network embedding for each subgraph in parallel. Finally, by aggregating the outputs on all subgraphs, we obtain the embeddings of nodes in a linear cost. After that, we demonstrate in various experiments that our proposed approach is able to handle graphs with billions of edges within a few hours and is at least 4 times faster than the state-of-the-art approaches. Besides, it achieves up to 4.25% and 4.27% improvements on link prediction and node classification tasks respectively. In the end, we deploy the proposed algorithms in two online games of Tencent with the applications of friend recommendation and item recommendation, which improve the competitors by up to 91.11% in running time and up to 12.80% in the corresponding evaluation metrics.
zh

[AI-81] E-Scores for (In)Correctness Assessment of Generative Model Outputs

【速读】:该论文旨在解决生成式 AI(Generative AI)模型,尤其是大语言模型(Large Language Models, LLMs)在输出正确性评估方面的局限性问题。现有基于校准预测(conformal prediction)的方法虽能提供错误概率的统计保证,但其依赖 p 值,易受 p-hacking 影响——即用户在观察结果后选择容忍度水平会破坏理论保障。为此,论文提出以 e 值(e-values)替代 p 值,引入 e 分数(e-scores)作为衡量 LLM 输出错误程度的新指标。其关键创新在于:e 分数不仅维持原有统计保证,还允许用户在观测到 e 分数后自适应调整容忍度,通过上界控制一种称为“大小失真”(size distortion)的后验错误量,从而实现更灵活且可靠的正确性评估。

链接: https://arxiv.org/abs/2510.25770
作者: Guneet S. Dhillon,Javier González,Teodora Pandeva,Alicia Curth
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While generative models, especially large language models (LLMs), are ubiquitous in today’s world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a desired user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as a measure of incorrectness. In addition to achieving the same statistical guarantees as before, e-scores provide users flexibility in adaptively choosing tolerance levels after observing the e-scores themselves, by upper bounding a post-hoc notion of error called size distortion. We experimentally demonstrate their efficacy in assessing LLM outputs for different correctness types: mathematical factuality and property constraints satisfaction.
zh

[AI-82] Physics-Guided Conditional Diffusion Networks for Microwave Image Reconstruction

【速读】:该论文旨在解决微波成像中的电磁逆散射问题(electromagnetic inverse scattering problem),该问题因病态性(ill-posedness)导致解的非唯一性,传统确定性机器学习方法仅能输出单一重建结果,难以反映真实物理场景中可能存在的多解特性。解决方案的关键在于提出一种基于条件隐扩散模型(conditional latent-diffusion model)的生成式框架,该模型显式建模了逆问题的非唯一性,能够从测量的散射场数据出发,生成多个符合物理约束的介电常数分布(permittivity maps)作为候选解;同时引入前向电磁求解器作为物理驱动的评估机制,通过最小化预测与实测散射场之间的差异来筛选最优解,从而实现高保真度、强泛化能力的重建结果。

链接: https://arxiv.org/abs/2510.25729
作者: Shirin Chehelgami,Joe LoVetri,Vahab Khoshdel
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:A conditional latent-diffusion based framework for solving the electromagnetic inverse scattering problem associated with microwave imaging is introduced. This generative machine-learning model explicitly mirrors the non-uniqueness of the ill-posed inverse problem. Unlike existing inverse solvers utilizing deterministic machine learning techniques that produce a single reconstruction, the proposed latent-diffusion model generates multiple plausible permittivity maps conditioned on measured scattered-field data, thereby generating several potential instances in the range-space of the non-unique inverse mapping. A forward electromagnetic solver is integrated into the reconstruction pipeline as a physics-based evaluation mechanism. The space of candidate reconstructions form a distribution of possibilities consistent with the conditioning data and the member of this space yielding the lowest scattered-field data discrepancy between the predicted and measured scattered fields is reported as the final solution. Synthetic and experimental labeled datasets are used for training and evaluation of the model. An innovative labeled synthetic dataset is created that exemplifies a varied set of scattering features. Training of the model using this new dataset produces high quality permittivity reconstructions achieving improved generalization with excellent fidelity to shape recognition. The results highlight the potential of hybrid generative physics frameworks as a promising direction for robust, data-driven microwave imaging.
zh

[AI-83] Using latent representations to link disjoint longitudinal data for mixed-effects regression

【速读】:该论文旨在解决罕见疾病临床试验中因样本量小、测量工具随时间变化(如适配不同年龄范围)导致的纵向数据轨迹断裂问题,从而难以应用传统混合效应回归模型进行治疗转换效应分析。其解决方案的关键在于:通过一组变分自编码器(Variational Autoencoder, VAE)架构将不同测量工具在每个时间点的观测值映射到一个共享的低维潜在空间,从而对齐跨仪器的纵向轨迹;随后在潜在表示上应用混合效应回归模型以捕捉疾病动态和治疗切换效应,并提出一种新型统计检验方法,用于考虑混合效应模型与VAE联合参数估计下的推断需求。此方法实现了在小样本条件下对治疗切换效应的量化评估与模型选择。

链接: https://arxiv.org/abs/2510.25531
作者: Clemens Schächter,Maren Hackenberg,Michelle Pfaffenlehner,Félix B. Tambe-Ndonfack,Thorsten Schmidt,Astrid Pechmann,Janbernd Kirschner,Jan Hasenauser,Harald Binder
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Many rare diseases offer limited established treatment options, leading patients to switch therapies when new medications emerge. To analyze the impact of such treatment switches within the low sample size limitations of rare disease trials, it is important to use all available data sources. This, however, is complicated when usage of measurement instruments change during the observation period, for example when instruments are adapted to specific age ranges. The resulting disjoint longitudinal data trajectories, complicate the application of traditional modeling approaches like mixed-effects regression. We tackle this by mapping observations of each instrument to a aligned low-dimensional temporal trajectory, enabling longitudinal modeling across instruments. Specifically, we employ a set of variational autoencoder architectures to embed item values into a shared latent space for each time point. Temporal disease dynamics and treatment switch effects are then captured through a mixed-effects regression model applied to latent representations. To enable statistical inference, we present a novel statistical testing approach that accounts for the joint parameter estimation of mixed-effects regression and variational autoencoders. The methodology is applied to quantify the impact of treatment switches for patients with spinal muscular atrophy. Here, our approach aligns motor performance items from different measurement instruments for mixed-effects regression and maps estimated effects back to the observed item level to quantify the treatment switch effect. Our approach allows for model selection as well as for assessing effects of treatment switching. The results highlight the potential of modeling in joint latent representations for addressing small data challenges.
zh

[AI-84] Improving Temporal Consistency and Fidelity at Inference-time in Perceptual Video Restoration by Zero-shot Image-based Diffusion Models

【速读】:该论文旨在解决基于零样本图像扩散模型进行视频恢复时面临的时序不一致性问题,该问题源于采样过程的随机性以及显式时序建模的复杂性。解决方案的关键在于提出两种无需重新训练或修改模型架构的推理阶段策略:其一是感知直线化引导(Perceptual Straightening Guidance, PSG),受神经科学中感知直线化假说启发,在感知空间中引入曲率惩罚项以引导扩散去噪过程实现更平滑的时序演化,从而提升时序感知质量(如FVD和感知直线性);其二是多路径集成采样(Multi-Path Ensemble Sampling, MPES),通过集成多个扩散轨迹来降低随机性变异,从而在不牺牲清晰度的前提下显著提升保真度指标(如PSNR和SSIM)。两者协同作用,实现了使用预训练扩散模型进行高保真且时序稳定的视频恢复。

链接: https://arxiv.org/abs/2510.25420
作者: Nasrin Rahimi,A. Murat Tekalp
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have emerged as powerful priors for single-image restoration, but their application to zero-shot video restoration suffers from temporal inconsistencies due to the stochastic nature of sampling and complexity of incorporating explicit temporal modeling. In this work, we address the challenge of improving temporal coherence in video restoration using zero-shot image-based diffusion models without retraining or modifying their architecture. We propose two complementary inference-time strategies: (1) Perceptual Straightening Guidance (PSG) based on the neuroscience-inspired perceptual straightening hypothesis, which steers the diffusion denoising process towards smoother temporal evolution by incorporating a curvature penalty in a perceptual space to improve temporal perceptual scores, such as Fréchet Video Distance (FVD) and perceptual straightness; and (2) Multi-Path Ensemble Sampling (MPES), which aims at reducing stochastic variation by ensembling multiple diffusion trajectories to improve fidelity (distortion) scores, such as PSNR and SSIM, without sacrificing sharpness. Together, these training-free techniques provide a practical path toward temporally stable high-fidelity perceptual video restoration using large pretrained diffusion models. We performed extensive experiments over multiple datasets and degradation types, systematically evaluating each strategy to understand their strengths and limitations. Our results show that while PSG enhances temporal naturalness, particularly in case of temporal blur, MPES consistently improves fidelity and spatio-temporal perception–distortion trade-off across all tasks.
zh

[AI-85] Adaptive End-to-End Transceiver Design for NextG Pilot-Free and CP-Free Wireless Systems

【速读】:该论文旨在解决传统正交频分复用(Orthogonal Frequency Division Multiplexing, OFDM)系统中因依赖导频(pilot)和循环前缀(Cyclic Prefix, CP)而导致的高开销与低频谱效率问题,以及在动态信道环境下难以实现高效自适应通信的挑战。其解决方案的关键在于提出一种面向无导频、无循环前缀的端到端(End-to-End, E2E)收发机架构,通过联合训练AI驱动的星座整形(constellation shaping)与神经接收机(neural receiver)实现性能优化;同时引入轻量级信道适配模块(Channel Adapter, CA),仅更新少量参数即可快速应对信道失配或时变情况,提升鲁棒性;此外,通过约束式E2E训练实现峰值平均功率比(Peak-to-Average Power Ratio, PAPR)控制,无需额外传输开销,从而在多种信道场景下显著提升误比特率(Bit Error Rate, BER)、吞吐量及系统韧性,为AI原生下一代无线通信(AI-native NextG)提供了可行路径。

链接: https://arxiv.org/abs/2510.25416
作者: Jiaming Cheng,Wei Chen,Bo Ai
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE for possible publication

点击查看摘要

Abstract:The advent of artificial intelligence (AI)-native wireless communication is fundamentally reshaping the design paradigm of next-generation (NextG) systems, where intelligent air interfaces are expected to operate adaptively and efficiently in highly dynamic environments. Conventional orthogonal frequency division multiplexing (OFDM) systems rely heavily on pilots and the cyclic prefix (CP), resulting in significant overhead and reduced spectral efficiency. To address these limitations, we propose an adaptive end-to-end (E2E) transceiver architecture tailored for pilot-free and CP-free wireless systems. The architecture combines AI-driven constellation shaping and a neural receiver through joint training. To enhance robustness against mismatched or time-varying channel conditions, we introduce a lightweight channel adapter (CA) module, which enables rapid adaptation with minimal computational overhead by updating only the CA parameters. Additionally, we present a framework that is scalable to multiple modulation orders within a unified model, significantly reducing model storage requirements. Moreover, to tackle the high peak-to-average power ratio (PAPR) inherent to OFDM, we incorporate constrained E2E training, achieving compliance with PAPR targets without additional transmission overhead. Extensive simulations demonstrate that the proposed framework delivers superior bit error rate (BER), throughput, and resilience across diverse channel scenarios, highlighting its potential for AI-native NextG.
zh

[AI-86] EcoScaleNet: A Lightweight Multi Kernel Network for Long Sequence 12 lead ECG Classification MICCAI

【速读】:该论文旨在解决传统卷积神经网络(CNN)在处理长序列心电图(ECG)数据时难以选择合适感受野大小的问题,以及现有全尺度卷积神经网络(Omni Scale CNN, OS CNN)因冗余设计导致计算成本过高、难以扩展至更深更宽模型的局限性。其解决方案的关键在于提出一种分层高效的全尺度卷积网络(Efficient Convolutional Omni Scale Network, EcoScale-Net):通过限制每个阶段的最大卷积核长度为下采样后仍需覆盖的尺度,并在每个全尺度模块前后引入瓶颈卷积(bottleneck convolution),有效抑制通道维度增长并融合多尺度特征,从而在保持完整感受野覆盖的同时显著降低参数量(减少90%)和浮点运算次数(FLOPs减少99%),同时提升分类性能(宏平均F1分数提高2.4%)。

链接: https://arxiv.org/abs/2510.24748
作者: Dong-Hyeon Kang,Ju-Hyeon Nam,Sang-Chul Lee
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: MICCAI Workshop on Efficient Medical AI (EMA)

点击查看摘要

Abstract:Accurate interpretation of 12 lead electrocardiograms (ECGs) is critical for early detection of cardiac abnormalities, yet manual reading is error prone and existing CNN based classifiers struggle to choose receptive field sizes that generalize to the long sequences typical of ECGs. Omni Scale CNN (OS CNN) addresses this by enumerating prime sized kernels inspired by Goldbach conjecture to cover every scale, but its exhaustive design explodes computational cost and blocks deeper, wider models. We present Efficient Convolutional Omni Scale Network (EcoScale-Net), a hierarchical variant that retains full receptive field coverage while eliminating redundancy. At each stage, the maximum kernel length is capped to the scale still required after down sampling, and bottleneck convolutions inserted before and after every Omni Scale block curtail channel growth and fuse multi scale features. On the large scale CODE 15% ECG dataset, EcoScaleNet reduces parameters by 90% and FLOPs by 99% compared with OS CNN, while raising macro averaged F1 score by 2.4%. These results demonstrate that EcoScaleNet delivers SOTA accuracy for long sequence ECG classification at a fraction of the computational cost, enabling real time deployment on commodity hardware. Our EcoScaleNet code is available in GitHub Link.
zh

[AI-87] PulseFi: A Low Cost Robust Machine Learning System for Accurate Cardiopulmonary and Apnea Monitoring Using Channel State Information

【速读】:该论文旨在解决非侵入式生命体征监测在医疗场景中对高成本、复杂设备依赖的问题,尤其关注如何实现低成本、连续且准确的心率与呼吸频率监测,并能检测呼吸暂停(apnea)事件。解决方案的关键在于提出了一种名为PulseFi的系统,其核心是利用商用Wi-Fi设备采集信道状态信息(Channel State Information, CSI),通过定制的低计算量长短期记忆(Long Short-Term Memory, LSTM)神经网络模型进行信号处理与特征提取,从而实现高精度的生命体征估计,且在多个数据集上验证了其性能优于或相当优于多天线昂贵系统。

链接: https://arxiv.org/abs/2510.24744
作者: Pranay Kocheta,Nayan Sanjay Bhatia,Katia Obraczka
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:Non-intrusive monitoring of vital signs has become increasingly important in a variety of healthcare settings. In this paper, we present PulseFi, a novel low-cost non-intrusive system that uses Wi-Fi sensing and artificial intelligence to accurately and continuously monitor heart rate and breathing rate, as well as detect apnea events. PulseFi operates using low-cost commodity devices, making it more accessible and cost-effective. It uses a signal processing pipeline to process Wi-Fi telemetry data, specifically Channel State Information (CSI), that is fed into a custom low-compute Long Short-Term Memory (LSTM) neural network model. We evaluate PulseFi using two datasets: one that we collected locally using ESP32 devices and another that contains recordings of 118 participants collected using the Raspberry Pi 4B, making the latter the most comprehensive data set of its kind. Our results show that PulseFi can effectively estimate heart rate and breathing rate in a seemless non-intrusive way with comparable or better accuracy than multiple antenna systems that can be expensive and less accessible.
zh

[AI-88] Cardi-GPT : An Expert ECG-Record Processing Chatbot

【速读】:该论文旨在解决心电图(Electrocardiogram, ECG)解读与临床沟通的复杂性问题,传统上依赖高技能医生进行精准判读和有效交流,存在效率低、可及性差等挑战。其解决方案的核心在于提出Cardi-GPT系统,该系统结合深度学习与自然语言处理技术:首先采用一个包含16个残差块的卷积神经网络(Convolutional Neural Network, CNN)对12导联ECG数据进行特征提取与分类,实现跨24种心脏疾病场景的加权准确率为0.6194;其次引入新颖的模糊化层(fuzzification layer),将复杂的数值输出转化为具有临床意义的语言类别,提升结果的可解释性;最后集成聊天机器人接口,支持医护人员通过自然语言交互获取诊断洞察并促进医患间高效沟通。该方案显著提升了ECG分析的自动化水平与临床实用性。

链接: https://arxiv.org/abs/2510.24737
作者: Koustav Mallick,Neel Singh,Mohammedreza Hajiarbabi
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Interpreting and communicating electrocardiogram (ECG) findings are crucial yet challenging tasks in cardiovascular diagnosis, traditionally requiring significant expertise and precise clinical communication. This paper introduces Cardi-GPT, an advanced expert system designed to streamline ECG interpretation and enhance clinical communication through deep learning and natural language interaction. Cardi-GPT employs a 16-residual-block convolutional neural network (CNN) to process 12-lead ECG data, achieving a weighted accuracy of 0.6194 across 24 cardiac conditions. A novel fuzzification layer converts complex numerical outputs into clinically meaningful linguistic categories, while an integrated chatbot interface facilitates intuitive exploration of diagnostic insights and seamless communication between healthcare providers. The system was evaluated on a diverse dataset spanning six hospitals across four countries, demonstrating superior performance compared to baseline models. Additionally, Cardi-GPT achieved an impressive overall response quality score of 73%, assessed using a comprehensive evaluation framework that measures coverage, grounding, and coherence. By bridging the gap between intricate ECG data interpretation and actionable clinical insights, Cardi-GPT represents a transformative innovation in cardiovascular healthcare, promising to improve diagnostic accuracy, clinical workflows, and patient outcomes across diverse medical settings. Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2510.24737 [eess.SP] (or arXiv:2510.24737v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2510.24737 Focus to learn more arXiv-issued DOI via DataCite Journalreference: SoutheastCon 2025 352-357 Related DOI: https://doi.org/10.1109/SoutheastCon56624.2025.10971509 Focus to learn more DOI(s) linking to related resources
zh

[AI-89] Flows straight but not so fast: Exploring the design space of Rectified Flows in Protein Design

【速读】:该论文旨在解决当前基于扩散(Diffusion)和流匹配(Flow Matching)的蛋白质骨架生成模型在实际设计应用中计算效率低下的问题,即这些模型通常需要数百甚至上千次函数评估(NFEs)才能生成高质量样本,难以满足单次目标生成10⁴–10⁶个设计的需求。其解决方案的关键在于引入并系统优化**修正流(Rectified Flow, ReFlow)**方法,通过改进数据预处理、训练策略及推理设置,显著降低所需NFEs,同时提升生成质量;特别地,研究发现ReFlow在蛋白质领域对耦合生成方式与退火策略高度敏感,且图像领域的有效设计选择并不直接适用于蛋白质生成任务,因此论文提出了针对蛋白质生成特性的ReFlow改进方案,从而实现高效且高质量的蛋白质骨架生成。

链接: https://arxiv.org/abs/2510.24732
作者: Junhua Chen,Simon Mathis,Charles Harris,Kieran Didi,Pietro Lio
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative modeling techniques such as Diffusion and Flow Matching have achieved significant successes in generating designable and diverse protein backbones. However, many current models are computationally expensive, requiring hundreds or even thousands of function evaluations (NFEs) to yield samples of acceptable quality, which can become a bottleneck in practical design campaigns that often generate 10^4\ -\ 10^6 designs per target. In image generation, Rectified Flows (ReFlow) can significantly reduce the required NFEs for a given target quality, but their application in protein backbone generation has been less studied. We apply ReFlow to improve the low NFE performance of pretrained SE(3) flow matching models for protein backbone generation and systematically study ReFlow design choices in the context of protein generation in data curation, training and inference time settings. In particular, we (1) show that ReFlow in the protein domain is particularly sensitive to the choice of coupling generation and annealing, (2) demonstrate how useful design choices for ReFlow in the image domain do not directly translate to better performance on proteins, and (3) make improvements to ReFlow methodology for proteins.
zh

机器学习

[LG-0] Neural Stochastic Flows: Solver-Free Modelling and Inference for SDE Solutions NEURIPS2025

链接: https://arxiv.org/abs/2510.25769
作者: Naoki Kiyohara,Edward Johns,Yingzhen Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2025 (poster). Project page: this https URL

点击查看摘要

Abstract:Stochastic differential equations (SDEs) are well suited to modelling noisy and irregularly sampled time series found in finance, physics, and machine learning. Traditional approaches require costly numerical solvers to sample between arbitrary time points. We introduce Neural Stochastic Flows (NSFs) and their latent variants, which directly learn (latent) SDE transition laws using conditional normalising flows with architectural constraints that preserve properties inherited from stochastic flows. This enables one-shot sampling between arbitrary states and yields up to two orders of magnitude speed-ups at large time gaps. Experiments on synthetic SDE simulations and on real-world tracking and video data show that NSFs maintain distributional accuracy comparable to numerical approaches while dramatically reducing computation for arbitrary time-point sampling.

[LG-1] Synthetic Data Reveals Generalization Gaps in Correlated Multiple Instance Learning

链接: https://arxiv.org/abs/2510.25759
作者: Ethan Harvey,Dennis Johan Loevlie,Michael C. Hughes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiple instance learning (MIL) is often used in medical imaging to classify high-resolution 2D images by processing patches or classify 3D volumes by processing slices. However, conventional MIL approaches treat instances separately, ignoring contextual relationships such as the appearance of nearby patches or slices that can be essential in real applications. We design a synthetic classification task where accounting for adjacent instance features is crucial for accurate prediction. We demonstrate the limitations of off-the-shelf MIL approaches by quantifying their performance compared to the optimal Bayes estimator for this task, which is available in closed-form. We empirically show that newer correlated MIL methods still struggle to generalize as well as possible when trained from scratch on tens of thousands of instances.

[LG-2] MLPrE – A tool for preprocessing and exploratory data analysis prior to machine learning model construction

链接: https://arxiv.org/abs/2510.25755
作者: David S Maxwell,Michael Darkoh,Sidharth R Samudrala,Caroline Chung,Stephanie T Schmidt,Bissan Al-Lazikani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the recent growth of Deep Learning for AI, there is a need for tools to meet the demand of data flowing into those models. In some cases, source data may exist in multiple formats, and therefore the source data must be investigated and properly engineered for a Machine Learning model or graph database. Overhead and lack of scalability with existing workflows limit integration within a larger processing pipeline such as Apache Airflow, driving the need for a robust, extensible, and lightweight tool to preprocess arbitrary datasets that scales with data type and size. To address this, we present Machine Learning Preprocessing and Exploratory Data Analysis, MLPrE, in which SparkDataFrames were utilized to hold data during processing and ensure scalability. A generalizable JSON input file format was utilized to describe stepwise changes to that DataFrame. Stages were implemented for input and output, filtering, basic statistics, feature engineering, and exploratory data analysis. A total of 69 stages were implemented into MLPrE, of which we highlight and demonstrate key stages using six diverse datasets. We further highlight MLPrE’s ability to independently process multiple fields in flat files and recombine them, otherwise requiring an additional pipeline, using a UniProt glossary term dataset. Building on this advantage, we demonstrated the clustering stage with available wine quality data. Lastly, we demonstrate the preparation of data for a graph database in the final stages of MLPrE using phosphosite kinase data. Overall, our MLPrE tool offers a generalizable and scalable tool for preprocessing and early data analysis, filling a critical need for such a tool given the ever expanding use of machine learning. This tool serves to accelerate and simplify early stage development in larger workflows.

[LG-3] Meshless solutions of PDE inverse problems on irregular geometries

链接: https://arxiv.org/abs/2510.25752
作者: James V. Roggeveen,Michael P. Brenner
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Solving inverse and optimization problems over solutions of nonlinear partial differential equations (PDEs) on complex spatial domains is a long-standing challenge. Here we introduce a method that parameterizes the solution using spectral bases on arbitrary spatiotemporal domains, whereby the basis is defined on a hyperrectangle containing the true domain. We find the coefficients of the basis expansion by solving an optimization problem whereby both the equations, the boundary conditions and any optimization targets are enforced by a loss function, building on a key idea from Physics-Informed Neural Networks (PINNs). Since the representation of the function natively has exponential convergence, so does the solution of the optimization problem, as long as it can be solved efficiently. We find empirically that the optimization protocols developed for machine learning find solutions with exponential convergence on a wide range of equations. The method naturally allows for the incorporation of data assimilation by including additional terms in the loss function, and for the efficient solution of optimization problems over the PDE solutions.

[LG-4] Convolutional Spiking-based GRU Cell for Spatio-temporal Data

链接: https://arxiv.org/abs/2510.25696
作者: Yesmine Abdennadher,Eleonora Cicciarella,Michele Rossi
类目: Machine Learning (cs.LG)
*备注: 6 pages, 1 figure. Published in 2025 IEEE International Workshop On Machine Learning for Signal Processing, Aug. 31-Sep. 3, 2025, Istanbul, Turkey

点击查看摘要

Abstract:Spike-based temporal messaging enables SNNs to efficiently process both purely temporal and spatio-temporal time-series or event-driven data. Combining SNNs with Gated Recurrent Units (GRUs), a variant of recurrent neural networks, gives rise to a robust framework for sequential data processing; however, traditional RNNs often lose local details when handling long sequences. Previous approaches, such as SpikGRU, fail to capture fine-grained local dependencies in event-based spatio-temporal data. In this paper, we introduce the Convolutional Spiking GRU (CS-GRU) cell, which leverages convolutional operations to preserve local structure and dependencies while integrating the temporal precision of spiking neurons with the efficient gating mechanisms of GRUs. This versatile architecture excels on both temporal datasets (NTIDIGITS, SHD) and spatio-temporal benchmarks (MNIST, DVSGesture, CIFAR10DVS). Our experiments show that CS-GRU outperforms state-of-the-art GRU variants by an average of 4.35%, achieving over 90% accuracy on sequential tasks and up to 99.31% on MNIST. It is worth noting that our solution achieves 69% higher efficiency compared to SpikGRU. The code is available at: this https URL.

[LG-5] A Configuration-First Framework for Reproducible Low-Code Localization

链接: https://arxiv.org/abs/2510.25692
作者: Tim Strnad(Jožef Stefan Institute, Slovenia),Blaž Bertalanič(Jožef Stefan Institute, Slovenia),Carolina Fortuna(Jožef Stefan Institute, Slovenia)
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 20 pages, 7 figures. Preprint submitted to ACM Transactions on Software Engineering and Methodology (TOSEM), 2025

点击查看摘要

Abstract:Machine learning is increasingly permeating radio-based localization services. To keep results credible and comparable, everyday workflows should make rigorous experiment specification and exact repeatability the default, without blocking advanced experimentation. However, in practice, researchers face a three-way gap that could be filled by a framework that offers (i) low coding effort for end-to-end studies, (ii) reproducibility by default including versioned code, data, and configurations, controlled randomness, isolated runs, and recorded artifacts, and (iii) built-in extensibility so new models, metrics, and stages can be added with minimal integration effort. Existing tools rarely deliver all three for machine learning in general and localization workflows in particular. In this paper we introduce LOCALIZE, a low-code, configuration-first framework for radio localization in which experiments are declared in human-readable configuration, a workflow orchestrator runs standardized pipelines from data preparation to reporting, and all artifacts, such as datasets, models, metrics, and reports, are versioned. The preconfigured, versioned datasets reduce initial setup and boilerplate, speeding up model development and evaluation. The design, with clear extension points, allows experts to add components without reworking the infrastructure. In a qualitative comparison and a head-to-head study against a plain Jupyter notebook baseline, we show that the framework reduces authoring effort while maintaining comparable runtime and memory behavior. Furthermore, using a Bluetooth Low Energy dataset, we show that scaling across training data (1x to 10x) keeps orchestration overheads bounded as data grows. Overall, the framework makes reproducible machine-learning-based localization experimentation practical, accessible, and extensible.

[LG-6] Model Inversion Attacks Meet Cryptographic Fuzzy Extractors

链接: https://arxiv.org/abs/2510.25687
作者: Mallika Prabhakar,Louise Xu,Prateek Saxena
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model inversion attacks pose an open challenge to privacy-sensitive applications that use machine learning (ML) models. For example, face authentication systems use modern ML models to compute embedding vectors from face images of the enrolled users and store them. If leaked, inversion attacks can accurately reconstruct user faces from the leaked vectors. There is no systematic characterization of properties needed in an ideal defense against model inversion, even for the canonical example application of a face authentication system susceptible to data breaches, despite a decade of best-effort solutions. In this paper, we formalize the desired properties of a provably strong defense against model inversion and connect it, for the first time, to the cryptographic concept of fuzzy extractors. We further show that existing fuzzy extractors are insecure for use in ML-based face authentication. We do so through a new model inversion attack called PIPE, which achieves a success rate of over 89% in most cases against prior schemes. We then propose L2FE-Hash, the first candidate fuzzy extractor which supports standard Euclidean distance comparators as needed in many ML-based applications, including face authentication. We formally characterize its computational security guarantees, even in the extreme threat model of full breach of stored secrets, and empirically show its usable accuracy in face authentication for practical face distributions. It offers attack-agnostic security without requiring any re-training of the ML model it protects. Empirically, it nullifies both prior state-of-the-art inversion attacks as well as our new PIPE attack. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2510.25687 [cs.CR] (or arXiv:2510.25687v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.25687 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] Mechanistic Interpretability of RNNs emulating Hidden Markov Models NEURIPS2025

链接: https://arxiv.org/abs/2510.25674
作者: Elia Torre,Michele Viscione,Lucas Pompe,Benjamin F Grewe,Valerio Mante
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Recurrent neural networks (RNNs) provide a powerful approach in neuroscience to infer latent dynamics in neural populations and to generate hypotheses about the neural computations underlying behavior. However, past work has focused on relatively simple, input-driven, and largely deterministic behaviors - little is known about the mechanisms that would allow RNNs to generate the richer, spontaneous, and potentially stochastic behaviors observed in natural settings. Modeling with Hidden Markov Models (HMMs) has revealed a segmentation of natural behaviors into discrete latent states with stochastic transitions between them, a type of dynamics that may appear at odds with the continuous state spaces implemented by RNNs. Here we first show that RNNs can replicate HMM emission statistics and then reverse-engineer the trained networks to uncover the mechanisms they implement. In the absence of inputs, the activity of trained RNNs collapses towards a single fixed point. When driven by stochastic input, trajectories instead exhibit noise-sustained dynamics along closed orbits. Rotation along these orbits modulates the emission probabilities and is governed by transitions between regions of slow, noise-driven dynamics connected by fast, deterministic transitions. The trained RNNs develop highly structured connectivity, with a small set of “kick neurons” initiating transitions between these regions. This mechanism emerges during training as the network shifts into a regime of stochastic resonance, enabling it to perform probabilistic computations. Analyses across multiple HMM architectures - fully connected, cyclic, and linear-chain - reveal that this solution generalizes through the modular reuse of the same dynamical motif, suggesting a compositional principle by which RNNs can emulate complex discrete latent dynamics.

[LG-8] Spectral Perturbation Bounds for Low-Rank Approximation with Applications to Privacy NEURIPS2025

链接: https://arxiv.org/abs/2510.25670
作者: Phuc Tran,Nisheeth K. Vishnoi,Van H. Vu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Numerical Analysis (math.NA); Spectral Theory (math.SP)
*备注: NeurIPS 2025

点击查看摘要

Abstract:A central challenge in machine learning is to understand how noise or measurement errors affect low-rank approximations, particularly in the spectral norm. This question is especially important in differentially private low-rank approximation, where one aims to preserve the top- p structure of a data-derived matrix while ensuring privacy. Prior work often analyzes Frobenius norm error or changes in reconstruction quality, but these metrics can over- or under-estimate true subspace distortion. The spectral norm, by contrast, captures worst-case directional error and provides the strongest utility guarantees. We establish new high-probability spectral-norm perturbation bounds for symmetric matrices that refine the classical Eckart–Young–Mirsky theorem and explicitly capture interactions between a matrix A \in \mathbbR^n \times n and an arbitrary symmetric perturbation E . Under mild eigengap and norm conditions, our bounds yield sharp estimates for |(A + E)_p - A_p| , where A_p is the best rank- p approximation of A , with improvements of up to a factor of \sqrtn . As an application, we derive improved utility guarantees for differentially private PCA, resolving an open problem in the literature. Our analysis relies on a novel contour bootstrapping method from complex analysis and extends it to a broad class of spectral functionals, including polynomials and matrix exponentials. Empirical results on real-world datasets confirm that our bounds closely track the actual spectral error under diverse perturbation regimes.

[LG-9] Uncertainty Quantification for Regression: A Unified Framework based on kernel scores

链接: https://arxiv.org/abs/2510.25599
作者: Christopher Bülte,Yusuf Sale,Gitta Kutyniok,Eyke Hüllermeier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regression tasks, notably in safety-critical domains, require proper uncertainty quantification, yet the literature remains largely classification-focused. In this light, we introduce a family of measures for total, aleatoric, and epistemic uncertainty based on proper scoring rules, with a particular emphasis on kernel scores. The framework unifies several well-known measures and provides a principled recipe for designing new ones whose behavior, such as tail sensitivity, robustness, and out-of-distribution responsiveness, is governed by the choice of kernel. We prove explicit correspondences between kernel-score characteristics and downstream behavior, yielding concrete design guidelines for task-specific measures. Extensive experiments demonstrate that these measures are effective in downstream tasks and reveal clear trade-offs among instantiations, including robustness and out-of-distribution detection performance.

[LG-10] Generalized Sobolev IPM for Graph-Based Measures

链接: https://arxiv.org/abs/2510.25591
作者: Tam Le,Truyen Nguyen,Hideitsu Hino,Kenji Fukumizu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the Sobolev IPM problem for measures supported on a graph metric space, where critic function is constrained to lie within the unit ball defined by Sobolev norm. While Le et al. (2025) achieved scalable computation by relating Sobolev norm to weighted L^p -norm, the resulting framework remains intrinsically bound to L^p geometric structure, limiting its ability to incorporate alternative structural priors beyond the L^p geometry paradigm. To overcome this limitation, we propose to generalize Sobolev IPM through the lens of \emphOrlicz geometric structure, which employs convex functions to capture nuanced geometric relationships, building upon recent advances in optimal transport theory – particularly Orlicz-Wasserstein (OW) and generalized Sobolev transport – that have proven instrumental in advancing machine learning methodologies. This generalization encompasses classical Sobolev IPM as a special case while accommodating diverse geometric priors beyond traditional L^p structure. It however brings up significant computational hurdles that compound those already inherent in Sobolev IPM. To address these challenges, we establish a novel theoretical connection between Orlicz-Sobolev norm and Musielak norm which facilitates a novel regularization for the generalized Sobolev IPM (GSI). By further exploiting the underlying graph structure, we show that GSI with Musielak regularization (GSI-M) reduces to a simple \emphunivariate optimization problem, achieving remarkably computational efficiency. Empirically, GSI-M is several-order faster than the popular OW in computation, and demonstrates its practical advantages in comparing probability measures on a given graph for document classification and several tasks in topological data analysis.

[LG-11] Learning-Augmented Online Bidding in Stochastic Settings

链接: https://arxiv.org/abs/2510.25582
作者: Spyros Angelopoulos,Bertrand Simon
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online bidding is a classic optimization problem, with several applications in online decision-making, the design of interruptible systems, and the analysis of approximation algorithms. In this work, we study online bidding under learning-augmented settings that incorporate stochasticity, in either the prediction oracle or the algorithm itself. In the first part, we study bidding under distributional predictions, and find Pareto-optimal algorithms that offer the best-possible tradeoff between the consistency and the robustness of the algorithm. In the second part, we study the power and limitations of randomized bidding algorithms, by presenting upper and lower bounds on the consistency/robustness tradeoffs. Previous works focused predominantly on oracles that do not leverage stochastic information on the quality of the prediction, and deterministic algorithms.

[LG-12] Perturbation Bounds for Low-Rank Inverse Approximations under Noise NEURIPS2025

链接: https://arxiv.org/abs/2510.25571
作者: Phuc Tran,Nisheeth K. Vishnoi
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Numerical Analysis (math.NA); Spectral Theory (math.SP); Statistics Theory (math.ST)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Low-rank pseudoinverses are widely used to approximate matrix inverses in scalable machine learning, optimization, and scientific computing. However, real-world matrices are often observed with noise, arising from sampling, sketching, and quantization. The spectral-norm robustness of low-rank inverse approximations remains poorly understood. We systematically study the spectral-norm error | (\tildeA^-1)_p - A_p^-1 | for an n\times n symmetric matrix A , where A_p^-1 denotes the best rank-(p) approximation of A^-1 , and \tildeA = A + E is a noisy observation. Under mild assumptions on the noise, we derive sharp non-asymptotic perturbation bounds that reveal how the error scales with the eigengap, spectral decay, and noise alignment with low-curvature directions of A . Our analysis introduces a novel application of contour integral techniques to the \emphnon-entire function f(z) = 1/z , yielding bounds that improve over naive adaptations of classical full-inverse bounds by up to a factor of \sqrtn . Empirically, our bounds closely track the true perturbation error across a variety of real-world and synthetic matrices, while estimates based on classical results tend to significantly overpredict. These findings offer practical, spectrum-aware guarantees for low-rank inverse approximations in noisy computational environments.

[LG-13] A Framework for Bounding Deterministic Risk with PAC-Bayes: Applications to Majority Votes

链接: https://arxiv.org/abs/2510.25569
作者: Benjamin Leblanc,Pascal Germain
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:PAC-Bayes is a popular and efficient framework for obtaining generalization guarantees in situations involving uncountable hypothesis spaces. Unfortunately, in its classical formulation, it only provides guarantees on the expected risk of a randomly sampled hypothesis. This requires stochastic predictions at test time, making PAC-Bayes unusable in many practical situations where a single deterministic hypothesis must be deployed. We propose a unified framework to extract guarantees holding for a single hypothesis from stochastic PAC-Bayesian guarantees. We present a general oracle bound and derive from it a numerical bound and a specialization to majority vote. We empirically show that our approach consistently outperforms popular baselines (by up to a factor of 2) when it comes to generalization bounds on deterministic classifiers.

[LG-14] ransformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information

链接: https://arxiv.org/abs/2510.25542
作者: Yuan Cheng,Yu Huang,Zhe Xiong,Yingbin Liang,Vincent Y. F. Tan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) – which involve multiple parents per node – remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based on the f -divergence. Our objective combines KG-MI with a multi-head attention framework, where each head is associated with a distinct marginal transition kernel to model diverse parent-child dependencies effectively. We prove that, given sequences generated by a K -parent DAG, training a single-layer, multi-head transformer via gradient ascent converges to the global optimum in polynomial time. Furthermore, we characterize the attention score patterns at convergence. In addition, when particularizing the f -divergence to the KL divergence, the learned attention scores accurately reflect the ground-truth adjacency matrix, thereby provably recovering the underlying graph structure. Experimental results validate our theoretical findings. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.25542 [cs.LG] (or arXiv:2510.25542v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.25542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Support Vector Machine-Based Burnout Risk Prediction with an Interactive Interface for Organizational Use

链接: https://arxiv.org/abs/2510.25509
作者: Bruno W. G. Teodosio,Mário J. O. T. Lira,Pedro H. M. Araújo,Lucas R. C. Farias
类目: Machine Learning (cs.LG)
*备注: 12 pages, including figures and references. Streamlit app available at: this https URL

点击查看摘要

Abstract:Burnout is a psychological syndrome marked by emotional exhaustion, depersonalization, and reduced personal accomplishment, with a significant impact on individual well-being and organizational performance. This study proposes a machine learning approach to predict burnout risk using the HackerEarth Employee Burnout Challenge dataset. Three supervised algorithms were evaluated: nearest neighbors (KNN), random forest, and support vector machine (SVM), with model performance evaluated through 30-fold cross-validation using the determination coefficient (R2). Among the models tested, SVM achieved the highest predictive performance (R2 = 0.84) and was statistically superior to KNN and Random Forest based on paired t -tests. To ensure practical applicability, an interactive interface was developed using Streamlit, allowing non-technical users to input data and receive burnout risk predictions. The results highlight the potential of machine learning to support early detection of burnout and promote data-driven mental health strategies in organizational settings.

[LG-16] Right for the Right Reason s: Avoiding Reasoning Shortcuts via Prototypical Neurosymbolic AI

链接: https://arxiv.org/abs/2510.25497
作者: Luca Andolfi,Eleonora Giunchiglia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neurosymbolic AI is growing in popularity thanks to its ability to combine neural perception and symbolic reasoning in end-to-end trainable models. However, recent findings reveal these are prone to shortcut reasoning, i.e., to learning unindented concepts–or neural predicates–which exploit spurious correlations to satisfy the symbolic constraints. In this paper, we address reasoning shortcuts at their root cause and we introduce prototypical neurosymbolic architectures. These models are able to satisfy the symbolic constraints (be right) because they have learnt the correct basic concepts (for the right reasons) and not because of spurious correlations, even in extremely low data regimes. Leveraging the theory of prototypical learning, we demonstrate that we can effectively avoid reasoning shortcuts by training the models to satisfy the background knowledge while taking into account the similarity of the input with respect to the handful of labelled datapoints. We extensively validate our approach on the recently proposed rsbench benchmark suite in a variety of settings and tasks with very scarce supervision: we show significant improvements in learning the right concepts both in synthetic tasks (MNIST-EvenOdd and Kand-Logic) and real-world, high-stake ones (BDD-OIA). Our findings pave the way to prototype grounding as an effective, annotation-efficient strategy for safe and reliable neurosymbolic learning.

[LG-17] Gradient-Weight Alignment as a Train-Time Proxy for Generalization in Classification Tasks NEURIPS2025

链接: https://arxiv.org/abs/2510.25480
作者: Florian A. Hölzl,Daniel Rueckert,Georgios Kaissis
类目: Machine Learning (cs.LG)
*备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Robust validation metrics remain essential in contemporary deep learning, not only to detect overfitting and poor generalization, but also to monitor training dynamics. In the supervised classification setting, we investigate whether interactions between training data and model weights can yield such a metric that both tracks generalization during training and attributes performance to individual training samples. We introduce Gradient-Weight Alignment (GWA), quantifying the coherence between per-sample gradients and model weights. We show that effective learning corresponds to coherent alignment, while misalignment indicates deteriorating generalization. GWA is efficiently computable during training and reflects both sample-specific contributions and dataset-wide learning dynamics. Extensive experiments show that GWA accurately predicts optimal early stopping, enables principled model comparisons, and identifies influential training samples, providing a validation-set-free approach for model analysis directly from the training data.

[LG-18] A Deep Learning Framework for Multi-Operator Learning: Architectures and Approximation Theory

链接: https://arxiv.org/abs/2510.25379
作者: Adrien Weihs,Jingmin Sun,Zecheng Zhang,Hayden Schaeffer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:While many problems in machine learning focus on learning mappings between finite-dimensional spaces, scientific applications require approximating mappings between function spaces, i.e., operators. We study the problem of learning collections of operators and provide both theoretical and empirical advances. We distinguish between two regimes: (i) multiple operator learning, where a single network represents a continuum of operators parameterized by a parametric function, and (ii) learning several distinct single operators, where each operator is learned independently. For the multiple operator case, we introduce two new architectures, \mathrmMNO and \mathrmMONet , and establish universal approximation results in three settings: continuous, integrable, or Lipschitz operators. For the latter, we further derive explicit scaling laws that quantify how the network size must grow to achieve a target approximation accuracy. For learning several single operators, we develop a framework for balancing architectural complexity across subnetworks and show how approximation order determines computational efficiency. Empirical experiments on parametric PDE benchmarks confirm the strong expressive power and efficiency of the proposed architectures. Overall, this work establishes a unified theoretical and practical foundation for scalable neural operator learning across multiple operators.

[LG-19] Parameter Averag ing in Link Prediction

链接: https://arxiv.org/abs/2510.25361
作者: Rupesh Sapkota,Caglar Demir,Arnab Sharma,Axel-Cyrille Ngonga Ngomo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensemble methods are widely employed to improve generalization in machine learning. This has also prompted the adoption of ensemble learning for the knowledge graph embedding (KGE) models in performing link prediction. Typical approaches to this end train multiple models as part of the ensemble, and the diverse predictions are then averaged. However, this approach has some significant drawbacks. For instance, the computational overhead of training multiple models increases latency and memory overhead. In contrast, model merging approaches offer a promising alternative that does not require training multiple models. In this work, we introduce model merging, specifically weighted averaging, in KGE models. Herein, a running average of model parameters from a training epoch onward is maintained and used for predictions. To address this, we additionally propose an approach that selectively updates the running average of the ensemble model parameters only when the generalization performance improves on a validation dataset. We evaluate these two different weighted averaging approaches on link prediction tasks, comparing the state-of-the-art benchmark ensemble approach. Additionally, we evaluate the weighted averaging approach considering literal-augmented KGE models and multi-hop query answering tasks as well. The results demonstrate that the proposed weighted averaging approach consistently improves performance across diverse evaluation settings.

[LG-20] Analysis of Semi-Supervised Learning on Hypergraphs

链接: https://arxiv.org/abs/2510.25354
作者: Adrien Weihs,Andrea Bertozzi,Matthew Thorpe
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Hypergraphs provide a natural framework for modeling higher-order interactions, yet their theoretical underpinnings in semi-supervised learning remain limited. We provide an asymptotic consistency analysis of variational learning on random geometric hypergraphs, precisely characterizing the conditions ensuring the well-posedness of hypergraph learning as well as showing convergence to a weighted p -Laplacian equation. Motivated by this, we propose Higher-Order Hypergraph Learning (HOHL), which regularizes via powers of Laplacians from skeleton graphs for multiscale smoothness. HOHL converges to a higher-order Sobolev seminorm. Empirically, it performs strongly on standard baselines.

[LG-21] Beyond Leakage and Complexity: Towards Realistic and Efficient Information Cascade Prediction

链接: https://arxiv.org/abs/2510.25348
作者: Jie Peng,Rui Wang,Qiang Wang,Zhewei Wei,Bin Tong,Guan Wang
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Information cascade popularity prediction is a key problem in analyzing content diffusion in social networks. However, current related works suffer from three critical limitations: (1) temporal leakage in current evaluation–random cascade-based splits allow models to access future information, yielding unrealistic results; (2) feature-poor datasets that lack downstream conversion signals (e.g., likes, comments, or purchases), which limits more practical applications; (3) computational inefficiency of complex graph-based methods that require days of training for marginal gains. We systematically address these challenges from three perspectives: task setup, dataset construction, and model design. First, we propose a time-ordered splitting strategy that chronologically partitions data into consecutive windows, ensuring models are evaluated on genuine forecasting tasks without future information leakage. Second, we introduce Taoke, a large-scale e-commerce cascade dataset featuring rich promoter/product attributes and ground-truth purchase conversions–capturing the complete diffusion lifecycle from promotion to monetization. Third, we develop CasTemp, a lightweight framework that efficiently models cascade dynamics through temporal walks, Jaccard-based neighbor selection for inter-cascade dependencies, and GRU-based encoding with time-aware attention. Under leak-free evaluation, CasTemp achieves state-of-the-art performance across four datasets with orders-of-magnitude speedup. Notably, it excels at predicting second-stage popularity conversions–a practical task critical for real-world applications.

[LG-22] CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices NEURIPS2025

链接: https://arxiv.org/abs/2510.25323
作者: Xuchen Feng,Siyu Liao
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025. Camera-ready version. 10 pages, 12 figures, 2 tables

点击查看摘要

Abstract:Normalizing flows are deep generative models that enable efficient likelihood estimation and sampling through invertible transformations. A key challenge is to design linear layers that enhance expressiveness while maintaining efficient computation of the Jacobian determinant and inverse. We introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition reduces parameter complexity from \mathcalO(n^2) to \mathcalO(mn) using m diagonal matrices and m-1 circulant matrices while still approximating general linear transformations. By leveraging the Fast Fourier Transform, our approach reduces the time complexity of matrix inversion from \mathcalO(n^3) to \mathcalO(mn\log n) and that of computing the log-determinant from \mathcalO(n^3) to \mathcalO(mn) , where n is the input dimension. We build upon this layer to develop Circulant-Diagonal Flow (CDFlow), which achieves strong density estimation on natural image datasets and effectively models data with inherent periodic structure. Furthermore, CDFlow significantly accelerates key operations in normalizing flows, providing practical benefits for scalable generative modeling.

[LG-23] Hierarchical Physics-Embedded Learning for Spatiotemporal Dynamical Systems

链接: https://arxiv.org/abs/2510.25306
作者: Xizhe Wang,Xiaobin Song,Qingshan Jia,Hongbo Zhao,Benben Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modeling complex spatiotemporal dynamics, particularly in far-from-equilibrium systems, remains a grand challenge in science. The governing partial differential equations (PDEs) for these systems are often intractable to derive from first principles, due to their inherent complexity, characterized by high-order derivatives and strong nonlinearities, coupled with incomplete physical knowledge. This has spurred the development of data-driven methods, yet these approaches face limitations: Purely data-driven models are often physically inconsistent and data-intensive, while existing physics-informed methods lack the structural capacity to represent complex operators or systematically integrate partial physical knowledge. Here, we propose a hierarchical physics-embedded learning framework that fundamentally advances both the forward spatiotemporal prediction and inverse discovery of physical laws from sparse and noisy data. The key innovation is a two-level architecture that mirrors the process of scientific discovery: the first level learns fundamental symbolic components of a PDE, while the second learns their governing combinations. This hierarchical decomposition not only reduces learning complexity but, more importantly, enables a structural integration of prior knowledge. Known physical laws are directly embedded into the models computational graph, guaranteeing physical consistency and improving data efficiency. By building the framework upon adaptive Fourier Neural Operators, we can effectively capture the non-local dependencies and high-order operators characteristic of dynamical systems. Additionally, by structurally decoupling known and unknown terms, the framework further enables interpretable discovery of underlying governing equations through symbolic regression, without presupposing functional forms.

[LG-24] On the Stability of Neural Networks in Deep Learning

链接: https://arxiv.org/abs/2510.25282
作者: Blaise Delattre
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has achieved remarkable success across a wide range of tasks, but its models often suffer from instability and vulnerability: small changes to the input may drastically affect predictions, while optimization can be hindered by sharp loss landscapes. This thesis addresses these issues through the unifying perspective of sensitivity analysis, which examines how neural networks respond to perturbations at both the input and parameter levels. We study Lipschitz networks as a principled way to constrain sensitivity to input perturbations, thereby improving generalization, adversarial robustness, and training stability. To complement this architectural approach, we introduce regularization techniques based on the curvature of the loss function, promoting smoother optimization landscapes and reducing sensitivity to parameter variations. Randomized smoothing is also explored as a probabilistic method for enhancing robustness at decision boundaries. By combining these perspectives, we develop a unified framework where Lipschitz continuity, randomized smoothing, and curvature regularization interact to address fundamental challenges in stability. The thesis contributes both theoretical analysis and practical methodologies, including efficient spectral norm computation, novel Lipschitz-constrained layers, and improved certification procedures. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.25282 [cs.LG] (or arXiv:2510.25282v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.25282 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-25] BSFA: Leverag ing the Subspace Dichotomy to Accelerate Neural Network Training

链接: https://arxiv.org/abs/2510.25244
作者: Wenjie Zhou,Bohan Wang,Wei Chen,Xueqi Cheng
类目: Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Recent studies \citepgur2018gradient,song2024does, wen2024understanding highlight a fundamental dichotomy in deep learning optimization: Although parameter updates along the top eigendirections of the loss Hessian (Dom-space) capture most of the update magnitude, they often contribute minimally to loss reduction. In contrast, updates in the orthogonal component (Bulk-space) have smaller magnitudes but drive most learning progress. In this work, we further advance the understanding of this phenomenon and introduce the \textbfBulk-Space-Filtration-Accelerator (BSFA), a novel plug-and-play framework. BSFA accelerates training by differentially scaling update components projected onto these distinct subspaces, simultaneously enhancing stability by moderating updates in the dominant subspace and boosting convergence speed by amplifying those in the bulk-space. To ensure BSFA is both practical and scalable for contemporary large models, we introduce two key innovations: an efficient estimator using Principal Component Analysis (PCA) on historical updates for fast subspace estimation, and a block-wise strategy that applies this estimation on a per-parameter-block basis. These designs make BSFA computationally tractable and highly effective. We demonstrate BSFA’s acceleration across various tasks, notably achieving approximately 2 \times speedup when pre-training LLaMA-72M on WikiText-103 and LLaMA-134M on OpenWebText compared to vanilla AdamW.

[LG-26] Selective Learning for Deep Time Series Forecasting NEURIPS2025

链接: https://arxiv.org/abs/2510.25207
作者: Yisong Fu,Zezhi Shao,Chengqing Yu,Yujie Li,Zhulin An,Qi Wang,Yongjun Xu,Fei Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Benefiting from high capacity for capturing complex temporal patterns, deep learning (DL) has significantly advanced time series forecasting (TSF). However, deep models tend to suffer from severe overfitting due to the inherent vulnerability of time series to noise and anomalies. The prevailing DL paradigm uniformly optimizes all timesteps through the MSE loss and learns those uncertain and anomalous timesteps without difference, ultimately resulting in overfitting. To address this, we propose a novel selective learning strategy for deep TSF. Specifically, selective learning screens a subset of the whole timesteps to calculate the MSE loss in optimization, guiding the model to focus on generalizable timesteps while disregarding non-generalizable ones. Our framework introduces a dual-mask mechanism to target timesteps: (1) an uncertainty mask leveraging residual entropy to filter uncertain timesteps, and (2) an anomaly mask employing residual lower bound estimation to exclude anomalous timesteps. Extensive experiments across eight real-world datasets demonstrate that selective learning can significantly improve the predictive performance for typical state-of-the-art deep models, including 37.4% MSE reduction for Informer, 8.4% for TimesNet, and 6.5% for iTransformer.

[LG-27] Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

链接: https://arxiv.org/abs/2510.25176
作者: Mohammadreza Doostmohammadian,Zulfiya R. Gabidullina,Hamid R. Rabiee
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: EAAI Journal

点击查看摘要

Abstract:In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine learning (ML) and optimization is considered in this paper. Given a set of data distributed over a network of computing-nodes/servers, the idea is to optimally assign the CPU (central processing unit) usage while simultaneously training each computing node locally via its own share of data. This formulates the problem as a co-optimization setup to (i) optimize the data processing and (ii) optimally allocate the computing resources. The information-sharing network among the nodes might be time-varying, but with balanced weights to ensure consensus-type convergence of the algorithm. The algorithm is all-time feasible, which implies that the computing resource-demand balance constraint holds at all iterations of the proposed solution. Moreover, the solution allows addressing possible log-scale quantization over the information-sharing channels to exchange log-quantized data. For some example applications, distributed support-vector-machine (SVM) and regression are considered as the ML training models. Results from perturbation theory, along with Lyapunov stability and eigen-spectrum analysis, are used to prove the convergence towards the optimal case. As compared to existing CPU scheduling solutions, the proposed algorithm improves the cost optimality gap by more than 50% .

[LG-28] Machine Learning Guided Optimal Transmission Switching to Mitigate Wildfire Ignition Risk

链接: https://arxiv.org/abs/2510.25147
作者: Weimin Huang,Ryan Piansky,Bistra Dilkina,Daniel K. Molzahn
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:To mitigate acute wildfire ignition risks, utilities de-energize power lines in high-risk areas. The Optimal Power Shutoff (OPS) problem optimizes line energization statuses to manage wildfire ignition risks through de-energizations while reducing load shedding. OPS problems are computationally challenging Mixed-Integer Linear Programs (MILPs) that must be solved rapidly and frequently in operational settings. For a particular power system, OPS instances share a common structure with varying parameters related to wildfire risks, loads, and renewable generation. This motivates the use of Machine Learning (ML) for solving OPS problems by exploiting shared patterns across instances. In this paper, we develop an ML-guided framework that quickly produces high-quality de-energization decisions by extending existing ML-guided MILP solution methods while integrating domain knowledge on the number of energized and de-energized lines. Results on a large-scale realistic California-based synthetic test system show that the proposed ML-guided method produces high-quality solutions faster than traditional optimization methods.

[LG-29] An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation NEURIPS2025

链接: https://arxiv.org/abs/2510.25128
作者: Uzair Akbar,Niki Kilbertus,Hao Shen,Krikamol Muandet,Bo Dai
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:The technique of data augmentation (DA) is often used in machine learning for regularization purposes to better generalize under i.i.d. settings. In this work, we present a unifying framework with topics in causal inference to make a case for the use of DA beyond just the i.i.d. setting, but for generalization across interventions as well. Specifically, we argue that when the outcome generating mechanism is invariant to our choice of DA, then such augmentations can effectively be thought of as interventions on the treatment generating mechanism itself. This can potentially help to reduce bias in causal effect estimation arising from hidden confounders. In the presence of such unobserved confounding we typically make use of instrumental variables (IVs) – sources of treatment randomization that are conditionally independent of the outcome. However, IVs may not be as readily available as DA for many applications, which is the main motivation behind this work. By appropriately regularizing IV based estimators, we introduce the concept of IV-like (IVL) regression for mitigating confounding bias and improving predictive performance across interventions even when certain IV properties are relaxed. Finally, we cast parameterized DA as an IVL regression problem and show that when used in composition can simulate a worst-case application of such DA, further improving performance on causal estimation and generalization tasks beyond what simple DA may offer. This is shown both theoretically for the population case and via simulation experiments for the finite sample case using a simple linear example. We also present real data experiments to support our case.

[LG-30] A Unified Bilevel Model for Adversarial Learning and A Case Study

链接: https://arxiv.org/abs/2510.25121
作者: Yutong Zheng,Qingna Li
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Adversarial learning has been attracting more and more attention thanks to the fast development of machine learning and artificial intelligence. However, due to the complicated structure of most machine learning models, the mechanism of adversarial attacks is not well interpreted. How to measure the effect of attack is still not quite clear. In this paper, we propose a unified bilevel model for adversarial learning. We further investigate the adversarial attack in clustering models and interpret it from data perturbation point of view. We reveal that when the data perturbation is relatively small, the clustering model is robust, whereas if it is relatively large, the clustering result changes, which leads to an attack. To measure the effect of attacks for clustering models, we analyse the well-definedness of the so-called \delta -measure, which can be used in the proposed bilevel model for adversarial learning of clustering models.

[LG-31] Energy Approach from varepsilon-Graph to Continuum Diffusion Model with Connectivity Functional

链接: https://arxiv.org/abs/2510.25114
作者: Yahong Yang,Sun Lee,Jeff Calder,Wenrui Hao
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We derive an energy-based continuum limit for \varepsilon -graphs endowed with a general connectivity functional. We prove that the discrete energy and its continuum counterpart differ by at most O(\varepsilon) ; the prefactor involves only the W^1,1 -norm of the connectivity density as \varepsilon\to0 , so the error bound remains valid even when that density has strong local fluctuations. As an application, we introduce a neural-network procedure that reconstructs the connectivity density from edge-weight data and then embeds the resulting continuum model into a brain-dynamics framework. In this setting, the usual constant diffusion coefficient is replaced by the spatially varying coefficient produced by the learned density, yielding dynamics that differ significantly from those obtained with conventional constant-diffusion models.

[LG-32] Shift is Good: Mismatched Data Mixing Improves Test Performance

链接: https://arxiv.org/abs/2510.25108
作者: Marko Medvedev,Kaifeng Lyu,Zhiyuan Li,Nathan Srebro
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider training and testing on mixture distributions with different training and test proportions. We show that in many settings, and in some sense generically, distribution shift can be beneficial, and test performance can improve due to mismatched training proportions, even if the components are unrelated and with no transfer between components. In a variety of scenarios, we identify the optimal training proportions and the extent to which such distribution shift can be beneficial. We show how the same analysis applies also to a compositional setting with differing distribution of component "skills’’ at training and test.

[LG-33] Continual Low-Rank Adapters for LLM -based Generative Recommender Systems

链接: https://arxiv.org/abs/2510.25093
作者: Hyunsik Yoo,Ting-Wei Li,SeongKu Kang,Zhining Liu,Charlie Xu,Qilin Qi,Hanghang Tong
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) achieve strong performance in recommendation, they face challenges in continual learning as users, items, and user preferences evolve over time. Existing LoRA-based continual methods primarily focus on preserving performance on previous tasks, but this overlooks the unique nature of recommendation: the goal is not to predict past preferences, and outdated preferences can even harm performance when current interests shift significantly. To address this, we propose PESO (Proximally rEgularized Single evolving lOra, a continual adaptation method for LoRA in recommendation. PESO introduces a proximal regularizer that anchors the current adapter to its most recent frozen state, enabling the model to flexibly balance adaptation and preservation, and to better capture recent user behaviors. Theoretically, we show that this proximal design provides data-aware, direction-wise guidance in the LoRA subspace. Empirically, PESO consistently outperforms existing LoRA-based continual learning methods.

[LG-34] raining Across Reservoirs: Using Numerical Differentiation To Couple Trainable Networks With Black-Box Reservoirs

链接: https://arxiv.org/abs/2510.25074
作者: Andrew Clark,Jack Moursounidis,Osmaan Rasouli,William Gan,Cooper Doyle,Anna Leontjeva
类目: Machine Learning (cs.LG)
*备注: 12 pages main, Appendix 10 pages, 6 figures in main body, 10 overall

点击查看摘要

Abstract:We introduce Bounded Numerical Differentiation (BOND), a perturbative method for estimating partial derivatives across network structures with inaccessible computational graphs. BOND demonstrates improved accuracy and scalability from existing perturbative methods, enabling new explorations of trainable architectures that integrate black-box functions. We observe that these black-box functions, realized in our experiments as fixed, untrained networks, can enhance model performance without increasing the number of trainable parameters. This improvement is achieved without extensive optimization of the architecture or properties of the black-box function itself. Our findings highlight the potential of leveraging fixed, non-trainable modules to expand model capacity, suggesting a path toward combining analogue and digital devices as a mechanism for scaling networks.

[LG-35] Dynamically Weighted Momentum with Adaptive Step Sizes for Efficient Deep Network Training

链接: https://arxiv.org/abs/2510.25042
作者: Zhifeng Wang,Longlong Li,Chunyan Zeng
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 45 pages, 12 figures

点击查看摘要

Abstract:Within the current sphere of deep learning research, despite the extensive application of optimization algorithms such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), there remains a pronounced inadequacy in their capability to address fluctuations in learning efficiency, meet the demands of complex models, and tackle non-convex optimization issues. These challenges primarily arise from the algorithms’ limitations in handling complex data structures and models, for instance, difficulties in selecting an appropriate learning rate, avoiding local optima, and navigating through high-dimensional spaces. To address these issues, this paper introduces a novel optimization algorithm named DWMGrad. This algorithm, building on the foundations of traditional methods, incorporates a dynamic guidance mechanism reliant on historical data to dynamically update momentum and learning rates. This allows the optimizer to flexibly adjust its reliance on historical information, adapting to various training scenarios. This strategy not only enables the optimizer to better adapt to changing environments and task complexities but also, as validated through extensive experimentation, demonstrates DWMGrad’s ability to achieve faster convergence rates and higher accuracies under a multitude of scenarios.

[LG-36] Automating Benchmark Design

链接: https://arxiv.org/abs/2510.25039
作者: Amanda Dsouza,Harit Vishwakarma,Zhengyang Qi,Justin Bauer,Derek Pham,Thomas Walshe,Armin Parchami,Frederic Sala,Paroma Varma
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark \tau -bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% – a 2-4x improvement over the baselines.

[LG-37] Graph Distance Based on Cause-Effect Estimands with Latents

链接: https://arxiv.org/abs/2510.25037
作者: Zhufeng Li,Niki Kilbertus
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal discovery aims to recover graphs that represent causal relations among given variables from observations, and new methods are constantly being proposed. Increasingly, the community raises questions about how much progress is made, because properly evaluating discovered graphs remains notoriously difficult, particularly under latent confounding. We propose a graph distance measure for acyclic directed mixed graphs (ADMGs) based on the downstream task of cause-effect estimation under unobserved confounding. Our approach uses identification via fixing and a symbolic verifier to quantify how graph differences distort cause-effect estimands for different treatment-outcome pairs. We analyze the behavior of the measure under different graph perturbations and compare it against existing distance metrics.

[LG-38] Machine Learning based Analysis for Radiomics Features Robustness in Real-World Deployment Scenarios

链接: https://arxiv.org/abs/2510.25026
作者: Sarmad Ahmad Khan,Simon Bernatz,Zahra Moslehi,Florian Buettner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radiomics-based machine learning models show promise for clinical decision support but are vulnerable to distribution shifts caused by variations in imaging protocols, positioning, and segmentation. This study systematically investigates the robustness of radiomics-based machine learning models under distribution shifts across five MRI sequences. We evaluated how different acquisition protocols and segmentation strategies affect model reliability in terms of predictive power and uncertainty-awareness. Using a phantom of 16 fruits, we evaluated distribution shifts through: (1) protocol variations across T2-HASTE, T2-TSE, T2-MAP, T1-TSE, and T2-FLAIR sequences; (2) segmentation variations (full, partial, rotated); and (3) inter-observer variability. We trained XGBoost classifiers on 8 consistent robust features versus sequence-specific features, testing model performance under in-domain and out-of-domain conditions. Results demonstrate that models trained on protocol-invariant features maintain F1-scores 0.85 across distribution shifts, while models using all features showed 40% performance degradation under protocol changes. Dataset augmentation substantially improved the quality of uncertainty estimates and reduced the expected calibration error (ECE) by 35% without sacrificing accuracy. Temperature scaling provided minimal calibration benefits, confirming XGBoost’s inherent reliability. Our findings reveal that protocol-aware feature selection and controlled phantom studies effectively predict model behavior under distribution shifts, providing a framework for developing robust radiomics models resilient to real-world protocol variations.

[LG-39] Secure Retrieval-Augmented Generation against Poisoning Attacks

链接: https://arxiv.org/abs/2510.25025
作者: Zirui Cheng,Jikai Sun,Anjun Gao,Yueyang Quan,Zhuqing Liu,Xiaohua Hu,Minghong Fang
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: To appear in IEEE BigData 2025

点击查看摘要

Abstract:Large language models (LLMs) have transformed natural language processing (NLP), enabling applications from content generation to decision support. Retrieval-Augmented Generation (RAG) improves LLMs by incorporating external knowledge but also introduces security risks, particularly from data poisoning, where the attacker injects poisoned texts into the knowledge database to manipulate system outputs. While various defenses have been proposed, they often struggle against advanced attacks. To address this, we introduce RAGuard, a detection framework designed to identify poisoned texts. RAGuard first expands the retrieval scope to increase the proportion of clean texts, reducing the likelihood of retrieving poisoned content. It then applies chunk-wise perplexity filtering to detect abnormal variations and text similarity filtering to flag highly similar texts. This non-parametric approach enhances RAG security, and experiments on large-scale datasets demonstrate its effectiveness in detecting and mitigating poisoning attacks, including strong adaptive attacks.

[LG-40] Disentangling Shared and Private Neural Dynamics with SPIRE: A Latent Modeling Framework for Deep Brain Stimulation ICLR2026

链接: https://arxiv.org/abs/2510.25023
作者: Rahil Soroushmojdehi,Sina Javadzadeh,Mehrnaz Asadi,Terence D.Sanger
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 25 pages total. Main paper (including references): 13 pages with 7 figures. Appendix: 12 pages with 5 figures and 4 tables. Submitted to ICLR 2026

点击查看摘要

Abstract:Disentangling shared network-level dynamics from region-specific activity is a central challenge in modeling multi-region neural data. We introduce SPIRE (Shared-Private Inter-Regional Encoder), a deep multi-encoder autoencoder that factorizes recordings into shared and private latent subspaces with novel alignment and disentanglement losses. Trained solely on baseline data, SPIRE robustly recovers cross-regional structure and reveals how external perturbations reorganize it. On synthetic benchmarks with ground-truth latents, SPIRE outperforms classical probabilistic models under nonlinear distortions and temporal misalignments. Applied to intracranial deep brain stimulation (DBS) recordings, SPIRE shows that shared latents reliably encode stimulation-specific signatures that generalize across sites and frequencies. These results establish SPIRE as a practical, reproducible tool for analyzing multi-region neural dynamics under stimulation.

[LG-41] What Really Matters in Matrix-Whitening Optimizers?

链接: https://arxiv.org/abs/2510.25000
作者: Kevin Frans,Pieter Abbeel,Sergey Levine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A range of recent optimizers have emerged that approximate the same “matrix-whitening” transformation in various ways. In this work, we systematically deconstruct such optimizers, aiming to disentangle the key components that explain performance. Across tuned hyperparameters across the board, all flavors of matrix-whitening methods reliably outperform elementwise counterparts, such as Adam. Matrix-whitening is often related to spectral descent – however, experiments reveal that performance gains are not explained solely by accurate spectral normalization – particularly, SOAP displays the largest per-step gain, even though Muon more accurately descends along the steepest spectral descent direction. Instead, we argue that matrix-whitening serves two purposes, and the variance adaptation component of matrix-whitening is the overlooked ingredient explaining this performance gap. Experiments show that variance-adapted versions of optimizers consistently outperform their sign-descent counterparts, including an adaptive version of Muon. We further ablate variance adaptation strategies, finding that while lookahead style approximations are not as effective, low-rank variance estimators can effectively reduce memory costs without a performance loss.

[LG-42] Enhancing Hierarchical Reinforcement Learning through Change Point Detection in Time Series

链接: https://arxiv.org/abs/2510.24988
作者: Hemanath Arumugam,Falong Fan,Bo Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hierarchical Reinforcement Learning (HRL) enhances the scalability of decision-making in long-horizon tasks by introducing temporal abstraction through options-policies that span multiple timesteps. Despite its theoretical appeal, the practical implementation of HRL suffers from the challenge of autonomously discovering semantically meaningful subgoals and learning optimal option termination boundaries. This paper introduces a novel architecture that integrates a self-supervised, Transformer-based Change Point Detection (CPD) module into the Option-Critic framework, enabling adaptive segmentation of state trajectories and the discovery of options. The CPD module is trained using heuristic pseudo-labels derived from intrinsic signals to infer latent shifts in environment dynamics without external supervision. These inferred change-points are leveraged in three critical ways: (i) to serve as supervisory signals for stabilizing termination function gradients, (ii) to pretrain intra-option policies via segment-wise behavioral cloning, and (iii) to enforce functional specialization through inter-option divergence penalties over CPD-defined state partitions. The overall optimization objective enhances the standard actor-critic loss using structure-aware auxiliary losses. In our framework, option discovery arises naturally as CPD-defined trajectory segments are mapped to distinct intra-option policies, enabling the agent to autonomously partition its behavior into reusable, semantically meaningful skills. Experiments on the Four-Rooms and Pinball tasks demonstrate that CPD-guided agents exhibit accelerated convergence, higher cumulative returns, and significantly improved option specialization. These findings confirm that integrating structural priors via change-point segmentation leads to more interpretable, sample-efficient, and robust hierarchical policies in complex environments.

[LG-43] Strategic inputs: feature selection from game-theoretic perspective

链接: https://arxiv.org/abs/2510.24982
作者: Chi Zhao,Jing Liu,Elena Parilina
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The exponential growth of data volumes has led to escalating computational costs in machine learning model training. However, many features fail to contribute positively to model performance while consuming substantial computational resources. This paper presents an end-to-end feature selection framework for tabular data based on game theory. We formulate feature selection procedure based on a cooperative game where features are modeled as players, and their importance is determined through the evaluation of synergistic interactions and marginal contributions. The proposed framework comprises four core components: sample selection, game-theoretic feature importance evaluation, redundant feature elimination, and optimized model training. Experimental results demonstrate that the proposed method achieves substantial computation reduction while preserving predictive performance, thereby offering an efficient solution of the computational challenges of large-scale machine learning. The source code is available at this https URL.

[LG-44] Conformational Rank Conditioned Committees for Machine Learning-Assisted Directed Evolution

链接: https://arxiv.org/abs/2510.24974
作者: Mia Adler,Carrie Liang,Brian Peng,Oleg Presnyakov,Justin M. Baker,Jannelle Lauffer,Himani Sharma,Barry Merriman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning-assisted directed evolution (MLDE) is a powerful tool for efficiently navigating antibody fitness landscapes. Many structure-aware MLDE pipelines rely on a single conformation or a single committee across all conformations, limiting their ability to separate conformational uncertainty from epistemic uncertainty. Here, we introduce a rank -conditioned committee (RCC) framework that leverages ranked conformations to assign a deep neural network committee per rank. This design enables a principled separation between epistemic uncertainty and conformational uncertainty. We validate our approach on SARS-CoV-2 antibody docking, demonstrating significant improvements over baseline strategies. Our results offer a scalable route for therapeutic antibody discovery while directly addressing the challenge of modeling conformational uncertainty.

[LG-45] Resource-Efficient and Robust Inference of Deep and Bayesian Neural Networks on Embedded and Analog Computing Platforms

链接: https://arxiv.org/abs/2510.24951
作者: Bernhard Klein
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Neural and Evolutionary Computing (cs.NE)
*备注: Ph.D. dissertation, Heidelberg University, October 2025

点击查看摘要

Abstract:While modern machine learning has transformed numerous application domains, its growing computational demands increasingly constrain scalability and efficiency, particularly on embedded and resource-limited platforms. In practice, neural networks must not only operate efficiently but also provide reliable predictions under distributional shifts or unseen data. Bayesian neural networks offer a principled framework for quantifying uncertainty, yet their computational overhead further compounds these challenges. This work advances resource-efficient and robust inference for both conventional and Bayesian neural networks through the joint pursuit of algorithmic and hardware efficiency. The former reduces computation through model compression and approximate Bayesian inference, while the latter optimizes deployment on digital accelerators and explores analog hardware, bridging algorithmic design and physical realization. The first contribution, Galen, performs automatic layer-specific compression guided by sensitivity analysis and hardware-in-the-loop feedback. Analog accelerators offer efficiency gains at the cost of noise; this work models device imperfections and extends noisy training to nonstationary conditions, improving robustness and stability. A second line of work advances probabilistic inference, developing analytic and ensemble approximations that replace costly sampling, integrate into a compiler stack, and optimize embedded inference. Finally, probabilistic photonic computing introduces a paradigm where controlled analog noise acts as an intrinsic entropy source, enabling fast, energy-efficient probabilistic inference directly in hardware. Together, these studies demonstrate how efficiency and reliability can be advanced jointly through algorithm-hardware co-design, laying the foundation for the next generation of trustworthy, energy-efficient machine-learning systems. Comments: Ph.D. dissertation, Heidelberg University, October 2025 Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2510.24951 [cs.LG] (or arXiv:2510.24951v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.24951 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought

链接: https://arxiv.org/abs/2510.24941
作者: Jiachen Zhao,Yiyou Sun,Weiyan Shi,Dawn Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent large language models (LLMs) can generate long Chain-of-Thought (CoT) at test time, enabling them to solve complex tasks. These reasoning steps in CoT are often assumed as a faithful reflection of the model’s internal thinking process, and used to monitor unsafe intentions. However, we find many reasoning steps don’t truly contribute to LLMs’ prediction. We measure the step-wise causal influence of each reasoning step on the model’s final prediction with a proposed True Thinking Score (TTS). We reveal that LLMs often interleave between true-thinking steps (which are genuinely used to produce the final output) and decorative-thinking steps (which only give the appearance of reasoning but have minimal causal impact). Notably, only a small subset of the total reasoning steps have a high TTS that causally drive the model’s prediction: e.g., for the AIME dataset, only an average of 2.3% of reasoning steps in CoT have a TTS = 0.7 (range: 0-1) under the Qwen-2.5 model. Furthermore, we identify a TrueThinking direction in the latent space of LLMs. By steering along or against this direction, we can force the model to perform or disregard certain CoT steps when computing the final result. Finally, we highlight that self-verification steps in CoT (i.e., aha moments) can also be decorative, where LLMs do not truly verify their solution. Steering along the TrueThinking direction can force internal reasoning over these steps, resulting in a change in the final results. Overall, our work reveals that LLMs often verbalize reasoning steps without actually performing them internally, which undermines both the efficiency of LLM reasoning and the trustworthiness of CoT.

[LG-47] WBT-BGRL: A Non-Contrastive Weighted Bipartite Link Prediction Model for Inductive Learning

链接: https://arxiv.org/abs/2510.24927
作者: Joel Frank Huarayo Quispe,Lilian Berton,Didier Vega-Oliveros
类目: Machine Learning (cs.LG)
*备注: 5 pages, submitted to the 12th International Conference on Soft Computing and Machine Intelligence (ISCMI 2025)

点击查看摘要

Abstract:Link prediction in bipartite graphs is crucial for applications like recommendation systems and failure detection, yet it is less studied than in monopartite graphs. Contrastive methods struggle with inefficient and biased negative sampling, while non-contrastive approaches rely solely on positive samples. Existing models perform well in transductive settings, but their effectiveness in inductive, weighted, and bipartite scenarios remains untested. To address this, we propose Weighted Bipartite Triplet-Bootstrapped Graph Latents (WBT-BGRL), a non-contrastive framework that enhances bootstrapped learning with a novel weighting mechanism in the triplet loss. Using a bipartite architecture with dual GCN encoders, WBT-BGRL is evaluated against adapted state-of-the-art models (T-BGRL, BGRL, GBT, CCA-SSG). Results on real-world datasets (Industry and E-commerce) show competitive performance, especially when weighting is applied during pretraining-highlighting the value of weighted, non-contrastive learning for inductive link prediction in bipartite graphs.

[LG-48] opic Analysis with Side Information: A Neural-Augmented LDA Approach

链接: https://arxiv.org/abs/2510.24918
作者: Biyi Fang,Kripa Rajshekhar,Truong Vo,Diego Klabjan
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Traditional topic models such as Latent Dirichlet Allocation (LDA) have been widely used to uncover latent structures in text corpora, but they often struggle to integrate auxiliary information such as metadata, user attributes, or document labels. These limitations restrict their expressiveness, personalization, and interpretability. To address this, we propose nnLDA, a neural-augmented probabilistic topic model that dynamically incorporates side information through a neural prior mechanism. nnLDA models each document as a mixture of latent topics, where the prior over topic proportions is generated by a neural network conditioned on auxiliary features. This design allows the model to capture complex nonlinear interactions between side information and topic distributions that static Dirichlet priors cannot represent. We develop a stochastic variational Expectation-Maximization algorithm to jointly optimize the neural and probabilistic components. Across multiple benchmark datasets, nnLDA consistently outperforms LDA and Dirichlet-Multinomial Regression in topic coherence, perplexity, and downstream classification. These results highlight the benefits of combining neural representation learning with probabilistic topic modeling in settings where side information is available.

[LG-49] Adaptive EEG-based stroke diagnosis with a GRU-TCN classifier and deep Q-learning thresholding

链接: https://arxiv.org/abs/2510.24889
作者: Shakeel Abdulkareem(1),Bora Yimenicioglu(2),Andrea Yang(3),Khartik Uppalapati(2),Aneesh Gudipati(1),Zhaoyang Fan(3) ((1) George Mason University, College of Science, Fairfax, VA, USA, (2) Raregen Youth Network, Translational Medical Research Department, Oakton, VA, USA, (3) University of Southern California, Los Angeles, CA, USA)
类目: Machine Learning (cs.LG)
*备注: 10 pages, 6 figures. Equal contribution: Shakeel Abdulkareem and Bora Yimenicioglu. Compiled with pdfLaTeX (wlscirep class)

点击查看摘要

Abstract:Rapid triage of suspected stroke needs accurate, bedside-deployable tools; EEG is promising but underused at first contact. We present an adaptive multitask EEG classifier that converts 32-channel signals to power spectral density features (Welch), uses a recurrent-convolutional network (GRU-TCN) to predict stroke type (healthy, ischemic, hemorrhagic), hemispheric lateralization, and severity, and applies a deep Q-network (DQN) to tune decision thresholds in real time. Using a patient-wise split of the UCLH Stroke EIT/EEG data set (44 recordings; about 26 acute stroke, 10 controls), the primary outcome was stroke-type performance; secondary outcomes were severity and lateralization. The baseline GRU-TCN reached 89.3% accuracy (F1 92.8%) for stroke type, about 96.9% (F1 95.9%) for severity, and about 96.7% (F1 97.4%) for lateralization. With DQN threshold adaptation, stroke-type accuracy increased to about 98.0% (F1 97.7%). We also tested robustness on an independent, low-density EEG cohort (ZJU4H) and report paired patient-level statistics. Analyses follow STARD 2015 guidance for diagnostic accuracy studies (index test: GRU-TCN+DQN; reference standard: radiology/clinical diagnosis; patient-wise evaluation). Adaptive thresholding shifts the operating point to clinically preferred sensitivity-specificity trade-offs, while integrated scalp-map and spectral visualizations support interpretability.

[LG-50] Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations NEURIPS2025

链接: https://arxiv.org/abs/2510.24884
作者: Olawale Salaudeen,Haoran Zhang,Kumail Alhamoud,Sara Beery,Marzyeh Ghassemi
类目: Machine Learning (cs.LG)
*备注: Accepted as a Spotlight paper at NeurIPS 2025

点击查看摘要

Abstract:Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed “accuracy-on-the-line.” This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research.

[LG-51] Send Less Save More: Energy-Efficiency Benchmark of Embedded CNN Inference vs. Data Transmission in IoT

链接: https://arxiv.org/abs/2510.24829
作者: Benjamin Karic,Nina Herrmann,Jan Stenkamp,Paula Scharf,Fabian Gieseke,Angela Schwering
类目: Machine Learning (cs.LG)
*备注: 11 Pages, Paper lists the categories for the ACM Computing Classification System

点击查看摘要

Abstract:The integration of the Internet of Things (IoT) and Artificial Intelligence offers significant opportunities to enhance our ability to monitor and address ecological changes. As environmental challenges become increasingly pressing, the need for effective remote monitoring solutions is more critical than ever. A major challenge in designing IoT applications for environmental monitoring - particularly those involving image data - is to create energy-efficient IoT devices capable of long-term operation in remote areas with limited power availability. Advancements in the field of Tiny Machine Learning allow the use of Convolutional Neural Networks (CNNs) on resource-constrained, battery-operated microcontrollers. Since data transfer is energy-intensive, performing inference directly on microcontrollers to reduce the message size can extend the operational lifespan of IoT nodes. This work evaluates the use of common Low Power Wide Area Networks and compressed CNNs trained on domain specific datasets on an ESP32-S3. Our experiments demonstrate, among other things, that executing CNN inference on-device and transmitting only the results reduces the overall energy consumption by a factor of up to five compared to sending raw image data. %The compression of the model using Post Training Quantization is accompanied by an acceptable reduction in accuracy of only a few percentage points compared to a non-quantized model. These findings advocate the development of IoT applications with reduced carbon footprint and capable of operating autonomously in environmental monitoring scenarios by incorporating Embedded Machine Learning.

[LG-52] Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA NEURIPS2025

链接: https://arxiv.org/abs/2510.24826
作者: Mingyu Huang,Shasha Zhou,Ke Li
类目: Machine Learning (cs.LG)
*备注: 56 apges, 18 figures, 8 tables, accepted as a conference paper at NeurIPS 2025

点击查看摘要

Abstract:Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from mutagensis data in diverse modalities (e.g., DNA, RNA, protein, and beyond) with up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. In addition, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2.2 million sequences across various modalities. All the codes and datasets are available at this https URL.

[LG-53] From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning NEURIPS2025

链接: https://arxiv.org/abs/2510.24812
作者: Junsoo Oh,Jerry Song,Chulhee Yun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2025 camera-ready version, 70 pages

点击查看摘要

Abstract:Weak-to-strong generalization refers to the phenomenon where a stronger model trained under supervision from a weaker one can outperform its teacher. While prior studies aim to explain this effect, most theoretical insights are limited to abstract frameworks or linear/random feature models. In this paper, we provide a formal analysis of weak-to-strong generalization from a linear CNN (weak) to a two-layer ReLU CNN (strong). We consider structured data composed of label-dependent signals of varying difficulty and label-independent noise, and analyze gradient descent dynamics when the strong model is trained on data labeled by the pretrained weak model. Our analysis identifies two regimes – data-scarce and data-abundant – based on the signal-to-noise characteristics of the dataset, and reveals distinct mechanisms of weak-to-strong generalization. In the data-scarce regime, generalization occurs via benign overfitting or fails via harmful overfitting, depending on the amount of data, and we characterize the transition boundary. In the data-abundant regime, generalization emerges in the early phase through label correction, but we observe that overtraining can subsequently degrade performance.

[LG-54] Learning to Attack: Uncovering Privacy Risks in Sequential Data Releases

链接: https://arxiv.org/abs/2510.24807
作者: Ziyao Cui,Minxing Zhang,Jian Pei
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Privacy concerns have become increasingly critical in modern AI and data science applications, where sensitive information is collected, analyzed, and shared across diverse domains such as healthcare, finance, and mobility. While prior research has focused on protecting privacy in a single data release, many real-world systems operate under sequential or continuous data publishing, where the same or related data are released over time. Such sequential disclosures introduce new vulnerabilities, as temporal correlations across releases may enable adversaries to infer sensitive information that remains hidden in any individual release. In this paper, we investigate whether an attacker can compromise privacy in sequential data releases by exploiting dependencies between consecutive publications, even when each individual release satisfies standard privacy guarantees. To this end, we propose a novel attack model that captures these sequential dependencies by integrating a Hidden Markov Model with a reinforcement learning-based bi-directional inference mechanism. This enables the attacker to leverage both earlier and later observations in the sequence to infer private information. We instantiate our framework in the context of trajectory data, demonstrating how an adversary can recover sensitive locations from sequential mobility datasets. Extensive experiments on Geolife, Porto Taxi, and SynMob datasets show that our model consistently outperforms baseline approaches that treat each release independently. The results reveal a fundamental privacy risk inherent to sequential data publishing, where individually protected releases can collectively leak sensitive information when analyzed temporally. These findings underscore the need for new privacy-preserving frameworks that explicitly model temporal dependencies, such as time-aware differential privacy or sequential data obfuscation strategies.

[LG-55] Constructive Lyapunov Functions via Topology-Preserving Neural Networks

链接: https://arxiv.org/abs/2510.24730
作者: Jaehong Oh
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 54pages, 14 figures

点击查看摘要

Abstract:We prove that ONN achieves order-optimal performance on convergence rate ( \mu \propto \lambda_2 ), edge efficiency ( E = N for minimal connectivity k = 2 ), and computational complexity ( O(N d^2) ). Empirical validation on 3M-node semantic networks demonstrates 99.75% improvement over baseline methods, confirming exponential convergence ( \mu = 3.2 \times 10^-4 ) and topology preservation. ORTSF integration into transformers achieves 14.7% perplexity reduction and 2.3 faster convergence on WikiText-103. We establish deep connections to optimal control (Hamilton-Jacobi-Bellman), information geometry (Fisher-efficient natural gradient), topological data analysis (persistent homology computation in O(KN) ), discrete geometry (Ricci flow), and category theory (adjoint functors). This work transforms Massera’s abstract existence theorem into a concrete, scalable algorithm with provable guarantees, opening pathways for constructive stability analysis in neural networks, robotics, and distributed systems.

[LG-56] Stiff Circuit System Modeling via Transformer

链接: https://arxiv.org/abs/2510.24727
作者: Weiman Yan,Yi-Chia Chang,Wanyu Zhao
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and efficient circuit behavior modeling is a cornerstone of modern electronic design automation. Among different types of circuits, stiff circuits are challenging to model using previous frameworks. In this work, we propose a new approach using Crossformer, which is a current state-of-the-art Transformer model for time-series prediction tasks, combined with Kolmogorov-Arnold Networks (KANs), to model stiff circuit transient behavior. By leveraging the Crossformer’s temporal representation capabilities and the enhanced feature extraction of KANs, our method achieves improved fidelity in predicting circuit responses to a wide range of input conditions. Experimental evaluations on datasets generated through SPICE simulations of analog-to-digital converter (ADC) circuits demonstrate the effectiveness of our approach, with significant reductions in training time and error rates.

[LG-57] Re-evaluating sample efficiency in de novo molecule generation

链接: https://arxiv.org/abs/2212.01385
作者: Morgan Thomas,Noel M. O’Boyle,Andreas Bender,Chris De Graaf
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Submission to ELLIS ML4Molecules Workshop 2022

点击查看摘要

Abstract:De novo molecule generation can suffer from data inefficiency; requiring large amounts of training data or many sampled data points to conduct objective optimization. The latter is a particular disadvantage when combining deep generative models with computationally expensive molecule scoring functions (a.k.a. oracles) commonly used in computer-aided drug design. Recent works have therefore focused on methods to improve sample efficiency in the context of de novo molecule drug design, or to benchmark it. In this work, we discuss and adapt a recent sample efficiency benchmark to better reflect realistic goals also with respect to the quality of chemistry generated, which must always be considered in the context of small-molecule drug design; we then re-evaluate all benchmarked generative models. We find that accounting for molecular weight and LogP with respect to the training data, and the diversity of chemistry proposed, re-orders the ranking of generative models. In addition, we benchmark a recently proposed method to improve sample efficiency (Augmented Hill-Climb) and found it ranked top when considering both the sample efficiency and chemistry of molecules generated. Continual improvements in sample efficiency and chemical desirability enable more routine integration of computationally expensive scoring functions on a more realistic timescale.

[LG-58] How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs NEURIPS2025

链接: https://arxiv.org/abs/2510.25753
作者: Samet Demir,Zafer Dogan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: NeurIPS 2025, 24 pages, 6 figures

点击查看摘要

Abstract:Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), data models (e.g., linear regression with isotropic inputs), and single-source training, limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance, particularly on nonlinear tasks, compared to linear baselines. It also enables a precise analysis of data mixing effects: we identify key properties of high-quality data sources (low noise, structured covariances) and show that feature learning emerges only when the task covariance exhibits sufficient structure. These results are validated empirically across various activation functions, model sizes, and data distributions. Finally, we experiment with a real-world scenario involving multilingual sentiment analysis where each language is treated as a different source. Our experimental results for this case exemplify how our findings extend to real-world cases. Overall, our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into the role of architecture and data in ICL.

[LG-59] Scaling flow-based approaches for topology sampling in mathrmSU(3) gauge theory

链接: https://arxiv.org/abs/2510.25704
作者: Claudio Bonanno,Andrea Bulgarelli,Elia Cellini,Alessandro Nada,Dario Panfalone,Davide Vadacchino,Lorenzo Verzichelli
类目: High Energy Physics - Lattice (hep-lat); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 1+39 pages, 14 figures

点击查看摘要

Abstract:We develop a methodology based on out-of-equilibrium simulations to mitigate topological freezing when approaching the continuum limit of lattice gauge theories. We reduce the autocorrelation of the topological charge employing open boundary conditions, while removing exactly their unphysical effects using a non-equilibrium Monte Carlo approach in which periodic boundary conditions are gradually switched on. We perform a detailed analysis of the computational costs of this strategy in the case of the four-dimensional \mathrmSU(3) Yang-Mills theory. After achieving full control of the scaling, we outline a clear strategy to sample topology efficiently in the continuum limit, which we check at lattice spacings as small as 0.045 fm. We also generalize this approach by designing a customized Stochastic Normalizing Flow for evolutions in the boundary conditions, obtaining superior performances with respect to the purely stochastic non-equilibrium approach, and paving the way for more efficient future flow-based solutions.

[LG-60] PyDPF: A Python Package for Differentiable Particle Filtering

链接: https://arxiv.org/abs/2510.25693
作者: John-Joseph Brady,Benjamin Cox,Víctor Elvira,Yunpeng Li
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 42 pages, 0 figures, under review at the Journal of Statistical Software, the python package can be found at this https URL , the full documentation at this https URL , and the source code including experiments and replication material at this https URL

点击查看摘要

Abstract:State-space models (SSMs) are a widely used tool in time series analysis. In the complex systems that arise from real-world data, it is common to employ particle filtering (PF), an efficient Monte Carlo method for estimating the hidden state corresponding to a sequence of observations. Applying particle filtering requires specifying both the parametric form and the parameters of the system, which are often unknown and must be estimated. Gradient-based optimisation techniques cannot be applied directly to standard particle filters, as the filters themselves are not differentiable. However, several recently proposed methods modify the resampling step to make particle filtering differentiable. In this paper, we present an implementation of several such differentiable particle filters (DPFs) with a unified API built on the popular PyTorch framework. Our implementation makes these algorithms easily accessible to a broader research community and facilitates straightforward comparison between them. We validate our framework by reproducing experiments from several existing studies and demonstrate how DPFs can be applied to address several common challenges with state space modelling.

[LG-61] Continuous subsurface property retrieval from sparse radar observations using physics informed neural networks

链接: https://arxiv.org/abs/2510.25648
作者: Ishfaq Aziz,Mohamad Alipour
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 22 pages, 9 main text figures + 2 supplementary figures

点击查看摘要

Abstract:Estimating subsurface dielectric properties is essential for applications ranging from environmental surveys of soils to nondestructive evaluation of concrete in infrastructure. Conventional wave inversion methods typically assume few discrete homogeneous layers and require dense measurements or strong prior knowledge of material boundaries, limiting scalability and accuracy in realistic settings where properties vary continuously. We present a physics informed machine learning framework that reconstructs subsurface permittivity as a fully neural, continuous function of depth, trained to satisfy both measurement data and Maxwells equations. We validate the framework with both simulations and custom built radar experiments on multilayered natural materials. Results show close agreement with in-situ permittivity measurements (R^2=0.93), with sensitivity to even subtle variations (Delta eps_r=2). Parametric analysis reveals that accurate profiles can be recovered with as few as three strategically placed sensors in two layer systems. This approach reframes subsurface inversion from boundary-driven to continuous property estimation, enabling accurate characterization of smooth permittivity variations and advancing electromagnetic imaging using low cost radar systems.

[LG-62] Monitoring the calibration of probability forecasts with an application to concept drift detection involving image classification

链接: https://arxiv.org/abs/2510.25573
作者: Christopher T. Franck,Anne R. Driscoll,Zoe Szajnfarber,William H. Woodall
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning approaches for image classification have led to impressive advances in that field. For example, convolutional neural networks are able to achieve remarkable image classification accuracy across a wide range of applications in industry, defense, and other areas. While these machine learning models boast impressive accuracy, a related concern is how to assess and maintain calibration in the predictions these models make. A classification model is said to be well calibrated if its predicted probabilities correspond with the rates events actually occur. While there are many available methods to assess machine learning calibration and recalibrate faulty predictions, less effort has been spent on developing approaches that continually monitor predictive models for potential loss of calibration as time passes. We propose a cumulative sum-based approach with dynamic limits that enable detection of miscalibration in both traditional process monitoring and concept drift applications. This enables early detection of operational context changes that impact image classification performance in the field. The proposed chart can be used broadly in any situation where the user needs to monitor probability predictions over time for potential lapses in calibration. Importantly, our method operates on probability predictions and event outcomes and does not require under-the-hood access to the machine learning model.

[LG-63] PitchFlower: A flow-based neural audio codec with pitch controllability

链接: https://arxiv.org/abs/2510.25566
作者: Diego Torres,Axel Roebel,Nicolas Obin
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFiGAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.

[LG-64] Robust variable selection for spatial point processes observed with noise

链接: https://arxiv.org/abs/2510.25550
作者: Dominik Sturm,Ivo F. Sbalzarini
类目: Methodology (stat.ME); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a method for variable selection in the intensity function of spatial point processes that combines sparsity-promoting estimation with noise-robust model selection. As high-resolution spatial data becomes increasingly available through remote sensing and automated image analysis, identifying spatial covariates that influence the localization of events is crucial to understand the underlying mechanism. However, results from automated acquisition techniques are often noisy, for example due to measurement uncertainties or detection errors, which leads to spurious displacements and missed events. We study the impact of such noise on sparse point-process estimation across different models, including Poisson and Thomas processes. To improve noise robustness, we propose to use stability selection based on point-process subsampling and to incorporate a non-convex best-subset penalty to enhance model-selection performance. In extensive simulations, we demonstrate that such an approach reliably recovers true covariates under diverse noise scenarios and improves both selection accuracy and stability. We then apply the proposed method to a forestry data set, analyzing the distribution of trees in relation to elevation and soil nutrients in a tropical rain forest. This shows the practical utility of the method, which provides a systematic framework for robust variable selection in spatial point-process models under noise, without requiring additional knowledge of the process.

[LG-65] Error Bounds and Optimal Schedules for Masked Diffusions with Factorized Approximations

链接: https://arxiv.org/abs/2510.25544
作者: Hugo Lavenant,Giacomo Zanella
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Recently proposed generative models for discrete data, such as Masked Diffusion Models (MDMs), exploit conditional independence approximations to reduce the computational cost of popular Auto-Regressive Models (ARMs), at the price of some bias in the sampling distribution. We study the resulting computation-vs-accuracy trade-off, providing general error bounds (in relative entropy) that depend only on the average number of tokens generated per iteration and are independent of the data dimensionality (i.e. sequence length), thus supporting the empirical success of MDMs. We then investigate the gain obtained by using non-constant schedule sizes (i.e. varying the number of unmasked tokens during the generation process) and identify the optimal schedule as a function of a so-called information profile of the data distribution, thus allowing for a principled optimization of schedule sizes. We define methods directly as sampling algorithms and do not use classical derivations as time-reversed diffusion processes, leading us to simple and transparent proofs.

[LG-66] Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains

链接: https://arxiv.org/abs/2510.25514
作者: Maik Overmars,Jasper Goseling,Richard Boucherie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the convergence of off-policy TD(0) with linear function approximation when used to approximate the expected discounted reward in a Markov chain. It is well known that the combination of off-policy learning and function approximation can lead to divergence of the algorithm. Existing results for this setting modify the algorithm, for instance by reweighing the updates using importance sampling. This establishes convergence at the expense of additional complexity. In contrast, our approach is to analyse the standard algorithm, but to restrict our attention to the class of reversible Markov chains. We demonstrate convergence under this mild reversibility condition on the structure of the chain, which in many applications can be assumed using domain knowledge. In particular, we establish a convergence guarantee under an upper bound on the discount factor in terms of the difference between the on-policy and off-policy process. This improves upon known results in the literature that state that convergence holds for a sufficiently small discount factor by establishing an explicit bound. Convergence is with probability one and achieves projected Bellman error equal to zero. To obtain these results, we adapt the stochastic approximation framework that was used by Tsitsiklis and Van Roy [1997 for the on-policy case, to the off-policy case. We illustrate our results using different types of reversible Markov chains, such as one-dimensional random walks and random walks on a weighted graph.

[LG-67] Generative Bayesian Optimization: Generative Models as Acquisition Functions

链接: https://arxiv.org/abs/2510.25240
作者: Rafael Oliveira,Daniel M. Steinberg,Edwin V. Bonilla
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:We present a general strategy for turning generative models into candidate solution samplers for batch Bayesian optimization (BO). The use of generative models for BO enables large batch scaling as generative sampling, optimization of non-continuous design spaces, and high-dimensional and combinatorial design. Inspired by the success of direct preference optimization (DPO), we show that one can train a generative model with noisy, simple utility values directly computed from observations to then form proposal distributions whose densities are proportional to the expected utility, i.e., BO’s acquisition function values. Furthermore, this approach is generalizable beyond preference-based feedback to general types of reward signals and loss functions. This perspective avoids the construction of surrogate (regression or classification) models, common in previous methods that have used generative models for black-box optimization. Theoretically, we show that the generative models within the BO process approximately follow a sequence of distributions which asymptotically concentrate at the global optima under certain conditions. We also demonstrate this effect through experiments on challenging optimization problems involving large batches in high dimensions.

[LG-68] Sustainable NARMA-10 Benchmarking for Quantum Reservoir Computing

链接: https://arxiv.org/abs/2510.25183
作者: Avyay Kodali,Priyanshi Singh,Pranay Pandey,Krishna Bhatia,Shalini Devendrababu,Srinjoy Ganguly
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 6 pages, 1 table, 2 figures. Work conducted under QIntern 2025 (QWorld) with support from Fractal AI Research

点击查看摘要

Abstract:This study compares Quantum Reservoir Computing (QRC) with classical models such as Echo State Networks (ESNs) and Long Short-Term Memory networks (LSTMs), as well as hybrid quantum-classical architectures (QLSTM), for the nonlinear autoregressive moving average task (NARMA-10). We evaluate forecasting accuracy (NRMSE), computational cost, and evaluation time. Results show that QRC achieves competitive accuracy while offering potential sustainability advantages, particularly in resource-constrained settings, highlighting its promise for sustainable time-series AI applications.

[LG-69] Conditional neural field for spatial dimension reduction of turbulence data: a comparison study

链接: https://arxiv.org/abs/2510.25135
作者: Junyi Guo,Pan Du,Xiantao Fan,Yahui Li,Jian-Xun Wang
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate conditional neural fields (CNFs), mesh-agnostic, coordinate-based decoders conditioned on a low-dimensional latent, for spatial dimensionality reduction of turbulent flows. CNFs are benchmarked against Proper Orthogonal Decomposition and a convolutional autoencoder within a unified encoding-decoding framework and a common evaluation protocol that explicitly separates in-range (interpolative) from out-of-range (strict extrapolative) testing beyond the training horizon, with identical preprocessing, metrics, and fixed splits across all baselines. We examine three conditioning mechanisms: (i) activation-only modulation (often termed FiLM), (ii) low-rank weight and bias modulation (termed FP), and (iii) last-layer inner-product coupling, and introduce a novel domain-decomposed CNF that localizes complexities. Across representative turbulence datasets (WMLES channel inflow, DNS channel inflow, and wall pressure fluctuations over turbulent boundary layers), CNF-FP achieves the lowest training and in-range testing errors, while CNF-FiLM generalizes best for out-of-range scenarios once moderate latent capacity is available. Domain decomposition significantly improves out-of-range accuracy, especially for the more demanding datasets. The study provides a rigorous, physics-aware basis for selecting conditioning, capacity, and domain decomposition when using CNFs for turbulence compression and reconstruction.

[LG-70] EnzyControl: Adding Functional and Substrate-Specific Control for Enzyme Backbone Generation

链接: https://arxiv.org/abs/2510.25132
作者: Chao Song,Zhiyuan Liu,Han Huang,Liang Wang,Qiong Wang,Jianyu Shi,Hui Yu,Yihang Zhou,Yang Zhang
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Designing enzyme backbones with substrate-specific functionality is a critical challenge in computational protein engineering. Current generative models excel in protein design but face limitations in binding data, substrate-specific control, and flexibility for de novo enzyme backbone generation. To address this, we introduce EnzyBind, a dataset with 11,100 experimentally validated enzyme-substrate pairs specifically curated from PDBbind. Building on this, we propose EnzyControl, a method that enables functional and substrate-specific control in enzyme backbone generation. Our approach generates enzyme backbones conditioned on MSA-annotated catalytic sites and their corresponding substrates, which are automatically extracted from curated enzyme-substrate data. At the core of EnzyControl is EnzyAdapter, a lightweight, modular component integrated into a pretrained motif-scaffolding model, allowing it to become substrate-aware. A two-stage training paradigm further refines the model’s ability to generate accurate and functional enzyme structures. Experiments show that our EnzyControl achieves the best performance across structural and functional metrics on EnzyBind and EnzyBench benchmarks, with particularly notable improvements of 13% in designability and 13% in catalytic efficiency compared to the baseline models. The code is released at this https URL.

[LG-71] Nonlinear Dynamics In Optimization Landscape of Shallow Neural Networks with Tunable Leaky ReLU

链接: https://arxiv.org/abs/2510.25060
作者: Jingzhou Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:In this work, we study the nonlinear dynamics of a shallow neural network trained with mean-squared loss and leaky ReLU activation. Under Gaussian inputs and equal layer width k, (1) we establish, based on the equivariant gradient degree, a theoretical framework, applicable to any number of neurons k= 4, to detect bifurcation of critical points with associated symmetries from global minimum as leaky parameter \alpha varies. Typically, our analysis reveals that a multi-mode degeneracy consistently occurs at the critical number 0, independent of k. (2) As a by-product, we further show that such bifurcations are width-independent, arise only for nonnegative \alpha and that the global minimum undergoes no further symmetry-breaking instability throughout the engineering regime \alpha in range (0,1). An explicit example with k=5 is presented to illustrate the framework and exhibit the resulting bifurcation together with their symmetries.

[LG-72] Bayesian Neural Networks vs. Mixture Density Networks: Theoretical and Empirical Insights for Uncertainty-Aware Nonlinear Modeling

链接: https://arxiv.org/abs/2510.25001
作者: Riddhi Pratim Ghosh,Ian Barnett
类目: Computation (stat.CO); Machine Learning (cs.LG)
*备注: 20 pages, 2 figures

点击查看摘要

Abstract:This paper investigates two prominent probabilistic neural modeling paradigms: Bayesian Neural Networks (BNNs) and Mixture Density Networks (MDNs) for uncertainty-aware nonlinear regression. While BNNs incorporate epistemic uncertainty by placing prior distributions over network parameters, MDNs directly model the conditional output distribution, thereby capturing multimodal and heteroscedastic data-generating mechanisms. We present a unified theoretical and empirical framework comparing these approaches. On the theoretical side, we derive convergence rates and error bounds under Hölder smoothness conditions, showing that MDNs achieve faster Kullback-Leibler (KL) divergence convergence due to their likelihood-based nature, whereas BNNs exhibit additional approximation bias induced by variational inference. Empirically, we evaluate both architectures on synthetic nonlinear datasets and a radiographic benchmark (RSNA Pediatric Bone Age Challenge). Quantitative and qualitative results demonstrate that MDNs more effectively capture multimodal responses and adaptive uncertainty, whereas BNNs provide more interpretable epistemic uncertainty under limited data. Our findings clarify the complementary strengths of posterior-based and likelihood-based probabilistic learning, offering guidance for uncertainty-aware modeling in nonlinear systems.

[LG-73] scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration NEURIPS2025

链接: https://arxiv.org/abs/2510.24987
作者: Jianle Sun,Chaoqi Liang,Ran Wei,Peng Zheng,Lei Bai,Wanli Ouyang,Hongliang Yan,Peng Ye
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: Accepted at NeurIPS 2025 (Spotlight)

点击查看摘要

Abstract:Advances in single-cell sequencing have enabled high-resolution profiling of diverse molecular modalities, while integrating unpaired multi-omics single-cell data remains challenging. Existing approaches either rely on pair information or prior correspondences, or require computing a global pairwise coupling matrix, limiting their scalability and flexibility. In this paper, we introduce a scalable and flexible generative framework called single-cell Multi-omics Regularized Disentangled Representations (scMRDR) for unpaired multi-omics integration. Specifically, we disentangle each cell’s latent representations into modality-shared and modality-specific components using a well-designed \beta -VAE architecture, which are augmented with isometric regularization to preserve intra-omics biological heterogeneity, adversarial objective to encourage cross-modal alignment, and masked reconstruction loss strategy to address the issue of missing features across modalities. Our method achieves excellent performance on benchmark datasets in terms of batch correction, modality alignment, and biological signal preservation. Crucially, it scales effectively to large-level datasets and supports integration of more than two omics, offering a powerful and flexible solution for large-scale multi-omics data integration and downstream biological discovery.

[LG-74] ree Ensemble Explainability through the Hoeffding Functional Decomposition and TreeHFD Algorithm

链接: https://arxiv.org/abs/2510.24815
作者: Clément Bénard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tree ensembles have demonstrated state-of-the-art predictive performance across a wide range of problems involving tabular data. Nevertheless, the black-box nature of tree ensembles is a strong limitation, especially for applications with critical decisions at stake. The Hoeffding or ANOVA functional decomposition is a powerful explainability method, as it breaks down black-box models into a unique sum of lower-dimensional functions, provided that input variables are independent. In standard learning settings, input variables are often dependent, and the Hoeffding decomposition is generalized through hierarchical orthogonality constraints. Such generalization leads to unique and sparse decompositions with well-defined main effects and interactions. However, the practical estimation of this decomposition from a data sample is still an open problem. Therefore, we introduce the TreeHFD algorithm to estimate the Hoeffding decomposition of a tree ensemble from a data sample. We show the convergence of TreeHFD, along with the main properties of orthogonality, sparsity, and causal variable selection. The high performance of TreeHFD is demonstrated through experiments on both simulated and real data, using our treehfd Python package (this https URL). Besides, we empirically show that the widely used TreeSHAP method, based on Shapley values, is strongly connected to the Hoeffding decomposition.

[LG-75] Sub-microsecond Transformers for Jet Tagging on FPGAs

链接: https://arxiv.org/abs/2510.24784
作者: Lauri Laatu,Chang Sun,Arianna Cox,Abhijith Gandrakota,Benedikt Maier,Jennifer Ngadiuba,Zhiqiang Que,Wayne Luk,Maria Spiropulu,Alexander Tapper
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Performance (cs.PF); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:We present the first sub-microsecond transformer implementation on an FPGA achieving competitive performance for state-of-the-art high-energy physics benchmarks. Transformers have shown exceptional performance on multiple tasks in modern machine learning applications, including jet tagging at the CERN Large Hadron Collider (LHC). However, their computational complexity prohibits use in real-time applications, such as the hardware trigger system of the collider experiments up until now. In this work, we demonstrate the first application of transformers for jet tagging on FPGAs, achieving \mathcalO(100) nanosecond latency with superior performance compared to alternative baseline models. We leverage high-granularity quantization and distributed arithmetic optimization to fit the entire transformer model on a single FPGA, achieving the required throughput and latency. Furthermore, we add multi-head attention and linear attention support to hls4ml, making our work accessible to the broader fast machine learning community. This work advances the next-generation trigger systems for the High Luminosity LHC, enabling the use of transformers for real-time applications in high-energy physics and beyond.

[LG-76] Certainty in Uncertainty: Reasoning over Uncertain Knowledge Graphs with Statistical Guarantees EMNLP2025

链接: https://arxiv.org/abs/2510.24754
作者: Yuqicheng Zhu,Jingcheng Wu,Yizhen Wang,Hongkuan Zhou,Jiaoyan Chen,Evgeny Kharlamov,Steffen Staab
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted as a main conference paper at EMNLP 2025

点击查看摘要

Abstract:Uncertain knowledge graph embedding (UnKGE) methods learn vector representations that capture both structural and uncertainty information to predict scores of unseen triples. However, existing methods produce only point estimates, without quantifying predictive uncertainty-limiting their reliability in high-stakes applications where understanding confidence in predictions is crucial. To address this limitation, we propose \textscUnKGCP, a framework that generates prediction intervals guaranteed to contain the true score with a user-specified level of confidence. The length of the intervals reflects the model’s predictive uncertainty. \textscUnKGCP builds on the conformal prediction framework but introduces a novel nonconformity measure tailored to UnKGE methods and an efficient procedure for interval construction. We provide theoretical guarantees for the intervals and empirically verify these guarantees. Extensive experiments on standard benchmarks across diverse UnKGE methods further demonstrate that the intervals are sharp and effectively capture predictive uncertainty.

[LG-77] Comparative Analysis of Data Augmentation for Clinical ECG Classification with STAR

链接: https://arxiv.org/abs/2510.24740
作者: Nader Nemati
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 19 pages, 11 figures

点击查看摘要

Abstract:Clinical 12-lead ECG classification remains difficult because of diverse recording conditions, overlapping pathologies, and pronounced label imbalance hinder generalization, while unconstrained augmentations risk distorting diagnostically critical morphology. In this study, Sinusoidal Time–Amplitude Resampling (STAR) is introduced as a beat-wise augmentation that operates strictly between successive R-peaks to apply controlled time warping and amplitude scaling to each R–R segment, preserving the canonical P–QRS–T order and leaving the head and tail of the trace unchanged. STAR is designed for practical pipelines and offers: (i) morphology-faithful variability that broadens training diversity without corrupting peaks or intervals; (ii) source-resilient training, improving stability across devices, sites, and cohorts without dataset-specific tuning; (iii) model-agnostic integration with common 1D SE–ResNet-style ECG encoders backbone; and (iv) better learning on rare classes via beat-level augmentation, reducing overfitting by resampling informative beats instead of duplicating whole records. In contrast to global crops, large shifts, or additive noise, STAR avoids transformations that suppress or misalign clinical landmarks. A complete Python implementation and a transparent training workflow are released, aligned with a source-aware, stratified five-fold protocol over a multi-institutional 12-lead corpus, thereby facilitating inspection and reuse. Taken together, STAR provides a simple and controllable augmentation for clinical ECG classification where trustworthy morphology, operational simplicity, and cross-source durability are essential.

[LG-78] StrikeWatch: Wrist-worn Gait Recognition with Compact Time-series Models on Low-power FPGAs

链接: https://arxiv.org/abs/2510.24738
作者: Tianheng Ling,Chao Qian,Peter Zdankin,Torben Weis,Gregor Schiele
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, 3 tables, accepted by IEEE Annual Congress on Artificial Intelligence of Things (IEEE AIoT), 3-5 Dec 2025, Osaka Japan

点击查看摘要

Abstract:Running offers substantial health benefits, but improper gait patterns can lead to injuries, particularly without expert feedback. While prior gait analysis systems based on cameras, insoles, or body-mounted sensors have demonstrated effectiveness, they are often bulky and limited to offline, post-run analysis. Wrist-worn wearables offer a more practical and non-intrusive alternative, yet enabling real-time gait recognition on such devices remains challenging due to noisy Inertial Measurement Unit (IMU) signals, limited computing resources, and dependence on cloud connectivity. This paper introduces StrikeWatch, a compact wrist-worn system that performs entirely on-device, real-time gait recognition using IMU signals. As a case study, we target the detection of heel versus forefoot strikes to enable runners to self-correct harmful gait patterns through visual and auditory feedback during running. We propose four compact DL architectures (1D-CNN, 1D-SepCNN, LSTM, and Transformer) and optimize them for energy-efficient inference on two representative embedded Field-Programmable Gate Arrays (FPGAs): the AMD Spartan-7 XC7S15 and the Lattice iCE40UP5K. Using our custom-built hardware prototype, we collect a labeled dataset from outdoor running sessions and evaluate all models via a fully automated deployment pipeline. Our results reveal clear trade-offs between model complexity and hardware efficiency. Evaluated across 12 participants, 6-bit quantized 1D-SepCNN achieves the highest average F1 score of 0.847 while consuming just 0.350 \muJ per inference with a latency of 0.140 ms on the iCE40UP5K running at 20 MHz. This configuration supports up to 13.6 days of continuous inference on a 320 mAh battery. All datasets and code are available in the GitHub repository this https URL.

[LG-79] Decoding non-invasive brain activity with novel deep-learning approaches

链接: https://arxiv.org/abs/2510.24733
作者: Richard Csaky
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: PhD thesis, 342 pages

点击查看摘要

Abstract:This thesis delves into the world of non-invasive electrophysiological brain signals like electroencephalography (EEG) and magnetoencephalography (MEG), focusing on modelling and decoding such data. The research aims to investigate what happens in the brain when we perceive visual stimuli or engage in covert speech (inner speech) and enhance the decoding performance of such stimuli. The thesis is divided into two main sections, methodological and experimental work. A central concern in both sections is the large variability present in electrophysiological recordings, whether it be within-subject or between-subject variability, and to a certain extent between-dataset variability. In the methodological sections, we explore the potential of deep learning for brain decoding. We present advancements in decoding visual stimuli using linear models at the individual subject level. We then explore how deep learning techniques can be employed for group decoding, introducing new methods to deal with between-subject variability. Finally, we also explores novel forecasting models of MEG data based on convolutional and Transformer-based architectures. In particular, Transformer-based models demonstrate superior capabilities in generating signals that closely match real brain data, thereby enhancing the accuracy and reliability of modelling the brain’s electrophysiology. In the experimental section, we present a unique dataset containing high-trial inner speech EEG, MEG, and preliminary optically pumped magnetometer (OPM) data. Our aim is to investigate different types of inner speech and push decoding performance by collecting a high number of trials and sessions from a few participants. However, the decoding results are found to be mostly negative, underscoring the difficulty of decoding inner speech.

[LG-80] Spectral functions in Minkowski quantum electrodynamics from neural reconstruction: Benchmarking against dispersive Dyson–Schwinger integral equations

链接: https://arxiv.org/abs/2510.24728
作者: Rodrigo Carmo Terin
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat); High Energy Physics - Theory (hep-th)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:A Minkowskian physics-informed neural network approach (M–PINN) is formulated to solve the Dyson–Schwinger integral equations (DSE) of quantum electrodynamics (QED) directly in Minkowski spacetime. Our novel strategy merges two complementary approaches: (i) a dispersive solver based on Lehmann representations and subtracted dispersion relations, and (ii) a M–PINN that learns the fermion mass function B(p^2) , under the same truncation and renormalization configuration (quenched, rainbow, Landau gauge) with the loss integrating the DSE residual with multi–scale regularization, and monotonicity/smoothing penalties in the spacelike branch in the same way as in our previous work in Euclidean space. The benchmarks show quantitative agreement from the infrared (IR) to the ultraviolet (UV) scales in both on-shell and momentum-subtraction schemes. In this controlled setting, our M–PINN reproduces the dispersive solution whilst remaining computationally compact and differentiable, paving the way for extensions with realistic vertices, unquenching effects, and uncertainty-aware variants.

信息检索

[IR-0] Retrieval-Augmented Search for Large-Scale Map Collections with ColPali

链接: https://arxiv.org/abs/2510.25718
作者: Jamie Mahowald,Benjamin Charles Germain Lee
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:Multimodal approaches have shown great promise for searching and navigating digital collections held by libraries, archives, and museums. In this paper, we introduce map-RAS: a retrieval-augmented search system for historic maps. In addition to introducing our framework, we detail our publicly-hosted demo for searching 101,233 map images held by the Library of Congress. With our system, users can multimodally query the map collection via ColPali, summarize search results using Llama 3.2, and upload their own collections to perform inter-collection search. We articulate potential use cases for archivists, curators, and end-users, as well as future work with our system in both machine learning and the digital humanities. Our demo can be viewed at: this http URL.

[IR-1] MMQ-v2: Align Denoise and Amplify: Adaptive Behavior Mining for Semantic IDs Learning in Recommendation

链接: https://arxiv.org/abs/2510.25622
作者: Yi Xu,Moyu Zhang,Chaofan Fan,Jinxin Hu,Xiaochen Li,Yu Zhang,Xiaoyi Zeng,Jing Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Industrial recommender systems rely on unique Item Identifiers (ItemIDs). However, this method struggles with scalability and generalization in large, dynamic datasets that have sparse long-tail this http URL-based Semantic IDs (SIDs) address this by sharing knowledge through content quantization. However, by ignoring dynamic behavioral properties, purely content-based SIDs have limited expressive power. Existing methods attempt to incorporate behavioral information but overlook a critical distinction: unlike relatively uniform content features, user-item interactions are highly skewed and diverse, creating a vast information gap in quality and quantity between popular and long-tail items. This oversight leads to two critical limitations: (1) Noise Corruption: Indiscriminate behavior-content alignment allows collaborative noise from long-tail items to corrupt their content representations, leading to the loss of critical multimodal information. (2)Signal Obscurity: The equal-weighting scheme for SIDs fails to reflect the varying importance of different behavioral signals, making it difficult for downstream tasks to distinguish important SIDs from uninformative ones. To tackle these issues, we propose a mixture-of-quantization framework, MMQ-v2, to adaptively Align, Denoise, and Amplify multimodal information from content and behavior modalities for semantic IDs learning. The semantic IDs generated by this framework named ADA-SID. It introduces two innovations: an adaptive behavior-content alignment that is aware of information richness to shield representations from noise, and a dynamic behavioral router to amplify critical signals by applying different weights to SIDs. Extensive experiments on public and large-scale industrial datasets demonstrate ADA-SID’s significant superiority in both generative and discriminative recommendation tasks.

[IR-2] Generalized Pseudo-Relevance Feedback

链接: https://arxiv.org/abs/2510.25488
作者: Yiteng Tu,Weihang Su,Yujia Zhou,Yiqun Liu,Fen Lin,Qin Liu,Qingyao Ai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Query rewriting is a fundamental technique in information retrieval (IR). It typically employs the retrieval result as relevance feedback to refine the query and thereby addresses the vocabulary mismatch between user queries and relevant documents. Traditional pseudo-relevance feedback (PRF) and its vector-based extension (VPRF) improve retrieval performance by leveraging top-retrieved documents as relevance feedback. However, they are constructed based on two major hypotheses: the relevance assumption (top documents are relevant) and the model assumption (rewriting methods need to be designed specifically for particular model architectures). While recent large language models (LLMs)-based generative relevance feedback (GRF) enables model-free query reformulation, it either suffers from severe LLM hallucination or, again, relies on the relevance assumption to guarantee the effectiveness of rewriting quality. To overcome these limitations, we introduce an assumption-relaxed framework: \textitGeneralized Pseudo Relevance Feedback (GPRF), which performs model-free, natural language rewriting based on retrieved documents, not only eliminating the model assumption but also reducing dependence on the relevance assumption. Specifically, we design a utility-oriented training pipeline with reinforcement learning to ensure robustness against noisy feedback. Extensive experiments across multiple benchmarks and retrievers demonstrate that GPRF consistently outperforms strong baselines, establishing it as an effective and generalizable framework for query rewriting.

[IR-3] owards Automated Quality Assurance of Patent Specifications: A Multi-Dimensional LLM Framework

链接: https://arxiv.org/abs/2510.25402
作者: Yuqian Chai,Chaochao Wang,Weilei Wang
类目: Information Retrieval (cs.IR); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Despite the surge in patent applications and emergence of AI drafting tools, systematic evaluation of patent content quality has received limited research attention. To address this gap, We propose to evaluate patents using regulatory compliance, technical coherence, and figure-reference consistency detection modules, and then generate improvement suggestions via an integration module. The framework is validated on a comprehensive dataset comprising 80 human-authored and 80 AI-generated patents from two patent drafting tools. Experimental results show balanced accuracies of 99.74%, 82.12%, and 91.2% respectively across the three detection modules when validated against expert annotations. Additional analysis was conducted to examine defect distributions across patent sections, technical domains, and authoring sources. Section-based analysis indicates that figure-text consistency and technical detail precision require particular attention. Mechanical Engineering and Construction show more claim-specification inconsistencies due to complex technical documentation requirements. AI-generated patents show a significant gap compared to human-authored ones. While human-authored patents primarily contain surface-level errors like typos, AI-generated patents exhibit more structural defects in figure-text alignment and cross-references.

[IR-4] Revisiting scalable sequential recommendation with Multi-Embedding Approach and Mixture-of-Experts

链接: https://arxiv.org/abs/2510.25285
作者: Qiushi Pan,Hao Wang,Guoyuan An,Luankang Zhang,Wei Guo,Yong Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recommendation systems, how to effectively scale up recommendation models has been an essential research topic. While significant progress has been made in developing advanced and scalable architectures for sequential recommendation(SR) models, there are still challenges due to items’ multi-faceted characteristics and dynamic item relevance in the user context. To address these issues, we propose Fuxi-MME, a framework that integrates a multi-embedding strategy with a Mixture-of-Experts (MoE) architecture. Specifically, to efficiently capture diverse item characteristics in a decoupled manner, we decompose the conventional single embedding matrix into several lower-dimensional embedding matrices. Additionally, by substituting relevant parameters in the Fuxi Block with an MoE layer, our model achieves adaptive and specialized transformation of the enriched representations. Empirical results on public datasets show that our proposed framework outperforms several competitive baselines.

[IR-5] Measuring the Research Output and Performance of the University of Ibadan from 2014 to 2023: A Scientometric Analysis

链接: https://arxiv.org/abs/2510.25283
作者: Muneer Ahmad,Undie Felicia Nkatv
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 16 pages, 5 figures, Research Paper

点击查看摘要

Abstract:This study employs scientometric methods to assess the research output and performance of the University of Ibadan from 2014 to 2023. By analyzing publication trends, citation patterns, and collaboration networks, the research aims to comprehensively evaluate the university’s research productivity, impact, and disciplinary focus. This article’s endeavors are characterized by innovation, interdisciplinary collaboration, and commitment to excellence, making the University of Ibadan a significant hub for cutting-edge research in Nigeria and beyond. The goal of the current study is to ascertain the influence of the university’s research output and publication patterns between 2014 and 2023. The study focuses on the departments at the University of Ibadan that contribute the most, the best journals for publishing, the nations that collaborate, the impact of citations both locally and globally, well-known authors and their total production, and the research output broken down by year. According to the university’s ten-year publication data, 7159 papers with an h-index of 75 were published between 2014 and 2023, garnering 218572 citations. Furthermore, the VOSviewer software mapping approach is used to illustrate the stenographical mapping of data through graphs. The findings of this study will contribute to understanding the university’s research strengths, weaknesses, and potential areas for improvement. Additionally, the results will inform evidence-based decision-making for enhancing research strategies and policies at the University of Ibadan.

[IR-6] ODataX: A Progressive Evolution of the Open Data Protocol

链接: https://arxiv.org/abs/2510.24761
作者: Anirudh Ganesh,Nitin Sood
类目: Databases (cs.DB); Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The Open Data Protocol (OData) provides a standardized approach for building and consuming RESTful APIs with rich query capabilities. Despite its power and maturity, OData adoption remains confined primarily to enterprise environments, particularly within Microsoft and SAP ecosystems. This paper analyzes the key barriers preventing wider OData adoption and introduces ODataX, an evolved version of the protocol designed to address these limitations. ODataX maintains backward compatibility with OData v4 while introducing progressive complexity disclosure through simplified query syntax, built-in performance guardrails via query cost estimation, and enhanced caching mechanisms. This work aims to bridge the gap between enterprise-grade query standardization and the simplicity demanded by modern web development practices.

附件下载

点击下载今日全部论文列表