本篇博文主要内容为 2025-10-21 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-10-21)
今日共更新498篇论文,其中:
- 自然语言处理共89篇(Computation and Language (cs.CL))
- 人工智能共144篇(Artificial Intelligence (cs.AI))
- 计算机视觉共121篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共157篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Glyph: Scaling Context Windows via Visual-Text Compression
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本时因上下文窗口扩展至百万级token而导致的计算与内存开销急剧上升的问题。其解决方案的关键在于提出Glyph框架,该框架通过将长文本渲染为图像并借助视觉-语言模型(Vision-Language Models, VLMs)进行处理,从而实现语义信息保留下的高比例文本压缩(3-4倍),同时显著提升预填充和解码速度(约4倍加速)及监督微调(SFT)训练效率(约2倍加速)。此外,该方法还引入基于LLM驱动的遗传搜索策略,以优化视觉渲染配置,在压缩率与准确性之间取得平衡。
链接: https://arxiv.org/abs/2510.17800
作者: Jiale Cheng,Yusen Liu,Xinyu Zhang,Yulin Fei,Wenyi Hong,Ruiliang Lyu,Weihan Wang,Zhe Su,Xiaotao Gu,Xiao Liu,Yushi Bai,Jie Tang,Hongning Wang,Minlie Huang
机构: The Conversational Artificial Intelligence (CoAI) Group, Tsinghua University (清华大学); Zhipu AI; The Knowledge Engineering Group (KEG), Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at this https URL.
zh
[NLP-1] Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics
【速读】: 该论文旨在解决企业环境中如何高效将非结构化数据转化为可操作洞察的挑战,尤其针对当前自主代理在领域特定语义理解、意图对齐及企业系统集成方面的不足。其核心解决方案是提出Enterprise Deep Research (EDR),一个由多个专业化智能体构成的多智能体系统,关键在于:(1) 主规划代理(Master Planning Agent)实现动态查询分解;(2) 四类专用搜索代理(通用、学术、GitHub、LinkedIn)覆盖多样化信息源;(3) 基于MCP(Multi-Agent Control Protocol)的可扩展工具生态支持自然语言转SQL(NL2SQL)、文件分析与企业工作流;(4) 可视化代理生成数据驱动洞察;(5) 反思机制识别知识盲区并自动调整研究方向,同时支持人工介入引导。这一架构实现了自动化报告生成、实时流式处理和企业级部署,在DeepResearch Bench和DeepConsult等开放基准测试中优于现有最先进的代理系统,无需人工干预。
链接: https://arxiv.org/abs/2510.17797
作者: Akshara Prabhakar,Roshan Ram,Zixiang Chen,Silvio Savarese,Frank Wang,Caiming Xiong,Huan Wang,Weiran Yao
机构: Salesforce AI Research (Salesforce人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical report; 13 pages plus references and appendices
点击查看摘要
Abstract:As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain-specific nuances, intent alignment, and enterprise integration. We present Enterprise Deep Research (EDR), a multi-agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP-based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data-driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human-in-the-loop steering guidance. These components enable automated report generation, real-time streaming, and seamless enterprise deployment, as validated on internal datasets. On open-ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state-of-the-art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi-agent reasoning applications. Code at this https URL and Dataset at this https URL Comments: Technical report; 13 pages plus references and appendices Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.17797 [cs.CL] (or arXiv:2510.17797v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.17797 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-2] Executable Knowledge Graphs for Replicating AI Research
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在复现人工智能(Artificial Intelligence, AI)研究时面临的挑战,尤其是现有方法难以生成可执行代码的问题。根本原因在于背景知识不足以及检索增强生成(Retrieval-Augmented Generation, RAG)技术无法捕捉参考文献中隐藏的潜在技术细节,同时缺乏对实现层面代码信号的有效利用和结构化知识表示以支持多粒度检索与复用。解决方案的关键是提出可执行知识图谱(Executable Knowledge Graphs, xKG),这是一种模块化、可插拔的知识库,能够自动整合来自科学文献的技术洞察、代码片段和领域特定知识;通过在三种代理框架及两种不同LLM上的集成验证,xKG在PaperBench基准上实现了显著性能提升(如使用o3-mini模型时达10.9%),证明其作为自动化AI研究复现通用且可扩展方案的有效性。
链接: https://arxiv.org/abs/2510.17795
作者: Yujie Luo,Zhuoyun Yu,Xuehai Wang,Yuqi Zhu,Ningyu Zhang,Lanning Wei,Lun Du,Da Zheng,Huajun Chen
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph (浙江大学-蚂蚁集团知识图谱联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: Work in progress
点击查看摘要
Abstract:Replicating AI research is a crucial yet challenging task for large language model (LLM) agents. Existing approaches often struggle to generate executable code, primarily due to insufficient background knowledge and the limitations of retrieval-augmented generation (RAG) methods, which fail to capture latent technical details hidden in referenced papers. Furthermore, previous approaches tend to overlook valuable implementation-level code signals and lack structured knowledge representations that support multi-granular retrieval and reuse. To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature. When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3-mini) on PaperBench, demonstrating its effectiveness as a general and extensible solution for automated AI research replication. Code will released at this https URL.
zh
[NLP-3] Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning -Centric Domains
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在训练和推理阶段对可扩展、高质量评估需求日益增长的问题,特别是针对推理能力评估中缺乏大规模、多样化数据驱动的评估器。解决方案的关键在于通过构建一个包含250万样本的多任务、多领域数据集(涵盖配对比较、步骤级评估、无参考与有参考验证及单评分等五类任务),采用简单的迭代拒绝采样监督微调(iterative rejection-sampling supervised fine-tuning, SFT)方法训练出基础自动推理评估器(Foundational Automatic Reasoning Evaluators, FARE),其中FAR-8B和FARE-20B分别在80亿和200亿参数规模下实现卓越性能,显著优于此前依赖强化学习(RL)训练的专用评估器,并在实际应用中展现出近似最优的重排序能力和强化学习训练中的验证增强效果。
链接: https://arxiv.org/abs/2510.17793
作者: Austin Xu,Xuan-Phi Nguyen,Yilun Zhou,Chien-Sheng Wu,Caiming Xiong,Shafiq Joty
机构: Salesforce AI Research (Salesforce人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 9 tables, 6 figures
点击查看摘要
Abstract:Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.
zh
[NLP-4] UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
【速读】: 该论文旨在解决计算机使用代理(Computer-Use Agents, CUAs)在执行任务时依赖原始图形用户界面(GUI)操作(如点击、输入、滚动)所导致的视觉定位不准确、执行链过长、错误传播严重及性能瓶颈问题。当前CUAs缺乏对高级程序接口(如API、MCP服务器、工具调用)的集成能力,限制了其效率与鲁棒性。解决方案的关键在于提出UltraCUA,一种通过混合动作(hybrid action)机制融合低级GUI操作与高级程序化工具调用的基础模型:其核心创新包括自动化工具提取管道、合成数据引擎生成17,000+可验证任务、大规模高质量混合动作轨迹数据集,以及两阶段训练流程(监督微调+在线强化学习),使代理能够智能地在低层GUI动作与高层工具调用之间切换,从而显著提升成功率(OSWorld上相对提升22%)并减少步骤数(快11%),同时在跨域场景中表现优越(WindowsAgentArena达21.7%成功率)。
链接: https://arxiv.org/abs/2510.17790
作者: Yuhao Yang,Zhen Yang,Zi-Yi Dou,Anh Nguyen,Keen You,Omar Attia,Andrew Szot,Michael Feng,Ram Ramrakhya,Alexander Toshev,Chao Huang,Yinfei Yang,Zhe Gan
机构: Apple(苹果); The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action – seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.
zh
[NLP-5] Mapping Post-Training Forgetting in Language Models at Scale
【速读】: 该论文旨在解决大规模语言模型(Language Models, LMs)在后训练(post-training)过程中对预训练知识遗忘程度及其反向迁移(backward transfer)效应的量化问题。传统方法通过任务平均指标混淆了遗忘与反向迁移的影响,难以揭示具体变化模式。论文的关键解决方案是提出一种基于样本级别的测量范式,通过计数1-0转换(正确→错误)来量化遗忘,以及0-1转换(错误→正确)来量化反向迁移,并引入机会调整后的多选基准以排除随机猜测的干扰,从而实现对后训练阶段中知识保留与增强的精细化评估。该框架首次系统性地揭示了不同后训练策略(如强化学习/监督微调、模型合并等)对预训练知识的影响差异,为构建更鲁棒、可解释的通用人工智能系统提供了可量化的分析工具。
链接: https://arxiv.org/abs/2510.17776
作者: Jackson Harmon,Andreas Hochlehnert,Matthias Bethge,Ameya Prabhu
机构: Tübingen AI Center, University of Tübingen (图宾根人工智能中心,图宾根大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 43 pages,15 figures
点击查看摘要
Abstract:Scaled post-training now drives many of the largest capability gains in language models (LMs), yet its effect on pretrained knowledge remains poorly understood. Not all forgetting is equal: Forgetting one fact (e.g., a U.S. president or an API call) does not “average out” by recalling another. Hence, we propose a sample-wise paradigm to measure what is forgotten and when backward transfer occurs. Our metric counts 1-0 transitions (correct before post-training, incorrect after) to quantify forgetting and 0-1 transitions to quantify backward transfer. Traditional task averages conflate these effects and obscure large changes. For multiple-choice benchmarks, we add chance-adjusted variants that subtract the expected contribution of random guessing from pre- and post-training accuracies. We apply this framework across post-training stages, model sizes, and data scales. Our large-scale analysis shows that: (1) Domain-continual pretraining induces moderate forgetting with low-to-moderate backward transfer; (2) RL/SFT post-training applied to base models and Instruction tuning yields moderate-to-large backward transfer on math and logic with overall low-to-moderate forgetting; (3) Applying RL/SFT to instruction-tuned models is sensitive on data scale: at small scales, both forgetting and backward transfer are small; at larger scales, effects are mixed and warrant further study with better controls; (4) Model merging does not reliably mitigate forgetting. Overall, our framework offers a practical yardstick for mapping how post-training alters pretrained knowledge at scale – enabling progress towards generally capable AI systems.
zh
[NLP-6] Evaluating Medical LLM s by Levels of Autonomy: A Survey Moving from Benchmarks to Applications
【速读】: 该论文旨在解决医学大语言模型(Medical Large Language Models)在标准基准测试中表现优异,但其在临床工作流中实现安全可靠应用仍面临挑战的问题。解决方案的关键在于引入“自主级别框架”(levels-of-autonomy lens, L0–L3),将评估体系重构为从信息工具(L0)、信息转换与聚合(L1)、决策支持(L2)到受监督代理(L3)的渐进式自主层级,并据此明确各层级允许的操作及其风险,从而指导指标选择、证据整合与结果报告,推动评价从单纯分数导向转向具备风险意识、可信赖的临床落地证据。
链接: https://arxiv.org/abs/2510.17764
作者: Xiao Ye,Jacob Dineen,Zhaonan Li,Zhikun Xu,Weiyu Chen,Shijie Lu,Yuxi Huang,Ming Shen,Phu Tran,Ji-Eun Irene Yum,Muhammad Ali Khan,Muhammad Umar Afzal,Irbaz Bin Riaz,Ben Zhou
机构: Arizona State University (亚利桑那州立大学); Mayo Clinic (梅奥诊所)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Medical Large language models achieve strong scores on standard benchmarks; however, the transfer of those results to safe and reliable performance in clinical workflows remains a challenge. This survey reframes evaluation through a levels-of-autonomy lens (L0-L3), spanning informational tools, information transformation and aggregation, decision support, and supervised agents. We align existing benchmarks and metrics with the actions permitted at each level and their associated risks, making the evaluation targets explicit. This motivates a level-conditioned blueprint for selecting metrics, assembling evidence, and reporting claims, alongside directions that link evaluation to oversight. By centering autonomy, the survey moves the field beyond score-based claims toward credible, risk-aware evidence for real clinical use.
zh
[NLP-7] VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多模态设计下存在的隐蔽性漏洞问题,现有对抗攻击方法多依赖脆弱模板、局限于单一攻击场景,且仅揭示了部分漏洞。其解决方案的关键在于提出一种基于变分推断(variational inference)的框架VERA-V,将多模态越狱攻击发现建模为联合后验分布的学习任务,从而生成能够绕过模型防护机制的隐蔽耦合对抗输入;该框架通过轻量级攻击者近似后验分布,实现高效多样化的越狱样本采样,并结合三种互补策略——基于排版的文字提示嵌入有害线索、基于扩散的图像合成引入对抗信号、结构化干扰项分割VLM注意力,显著提升了攻击成功率,在HarmBench和HADES基准上相较最优基线最高提升53.75% ASR(攻击成功率)。
链接: https://arxiv.org/abs/2510.17759
作者: Qilin Liao,Anamika Lochab,Ruqi Zhang
机构: Purdue University (普渡大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 18 pages, 7 Figures,
点击查看摘要
Abstract:Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.
zh
[NLP-8] rain for Truth Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations
【速读】: 该论文旨在解决语言模型在生成过程中出现的外生幻觉(extrinsic hallucination)问题,即模型输出的事实性错误信息无法从其训练数据中获得支持。现有缓解方法常导致开放生成和下游任务性能下降,限制了实际应用价值。解决方案的关键在于提出一种基于二元检索增强奖励(binary retrieval-augmented reward, RAR)的在线强化学习方法:该方法仅在模型输出完全事实正确时给予奖励1,否则为0,从而避免连续奖励机制带来的质量退化问题。实验表明,该方法在保持指令遵循、数学和代码等任务性能不变的前提下,显著降低了幻觉率,并促使模型在知识不足时策略性地选择“我不知道”,从而提升了问答任务中的准确性。
链接: https://arxiv.org/abs/2510.17733
作者: Tong Chen,Akari Asai,Luke Zettlemoyer,Hannaneh Hajishirzi,Faeze Brahman
机构: University of Washington (华盛顿大学); Allen Institute for AI (Ai2); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model’s output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting “I don’t know” when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.
zh
[NLP-9] AcademicEval: Live Long-Context LLM Benchmark
【速读】: 该论文旨在解决当前长上下文大语言模型(Large Language Models, LLMs)评估中存在的三大问题:固定上下文长度限制、人工标注成本高以及训练过程中标签泄露(label leakage)风险。其核心解决方案是提出一个名为 \textscAcademicEval 的动态基准测试平台,该平台基于 arXiv 上的学术论文构建多种需要长上下文理解与生成的任务(如标题、摘要、引言和相关工作),覆盖不同抽象层级且无需人工标注;同时引入从共著作者图中收集的高质量少样本示例(few-shot demonstrations),支持灵活的上下文长度设置,并通过实时评估机制确保无标签泄露。这一设计显著提升了评估的真实性和挑战性,揭示了现有 LLM 在处理层次化抽象任务及长示例时的局限性。
链接: https://arxiv.org/abs/2510.17725
作者: Haozhen Zhang,Tao Feng,Pengrui Han,Jiaxuan You
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by TMLR. Code is available at this https URL
点击查看摘要
Abstract:Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \textscAcademicEval, a live benchmark for evaluating LLMs over long-context generation tasks. \textscAcademicEval adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textiti.e., \textscTitle, \textscAbstract, \textscIntroduction, and \textscRelated Work, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textscAcademicEval integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textscAcademicEval features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on \textscAcademicEval, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs’ long-context modeling capabilities. Code is available at this https URL
zh
[NLP-10] PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition
【速读】: 该论文旨在解决低资源场景下命名实体识别(Named Entity Recognition, NER)任务中标注数据稀缺、零样本与指令微调方法泛化能力不足以及有限数据难以有效利用的问题。其解决方案的关键在于两项创新:一是设计了一种新的指令微调模板,采用简化的输出格式并结合先前指令微调方法的优势,充分利用最新大语言模型(Large Language Models, LLMs)的长上下文窗口;二是提出一种策略性数据增强技术,在不破坏实体语义关系的前提下对上下文进行改写(paraphrasing),从而在保留实体信息的同时扩充训练数据。实验表明,该方法在少样本和零样本设置下性能接近最先进模型,且在CrossNER数据集上达到平均F1分数80.1,相比基线模型提升最高达17点,为计算资源受限且标注数据稀缺的场景提供了高效可行的解决方案。
链接: https://arxiv.org/abs/2510.17720
作者: Nanda Kumar Rengarajan,Jun Yan,Chun Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Named Entity Recognition (NER) is a critical task that requires substantial annotated data, making it challenging in low-resource scenarios where label acquisition is expensive. While zero-shot and instruction-tuned approaches have made progress, they often fail to generalize to domain-specific entities and do not effectively utilize limited available data. We present a lightweight few-shot NER framework that addresses these challenges through two key innovations: (1) a new instruction tuning template with a simplified output format that combines principles from prior IT approaches to leverage the large context window of recent state-of-the-art LLMs; (2) introducing a strategic data augmentation technique that preserves entity information while paraphrasing the surrounding context, thereby expanding our training data without compromising semantic relationships. Experiments on benchmark datasets show that our method achieves performance comparable to state-of-the-art models on few-shot and zero-shot tasks, with our few-shot approach attaining an average F1 score of 80.1 on the CrossNER datasets. Models trained with our paraphrasing approach show consistent improvements in F1 scores of up to 17 points over baseline versions, offering a promising solution for groups with limited NER training data and compute power.
zh
[NLP-11] QueST: Incentivizing LLM s to Generate Difficult Problems
【速读】: 该论文旨在解决大语言模型在推理任务中因依赖人工标注数据集和缺乏大规模、高难度编程问题训练数据而导致的可扩展性瓶颈问题。现有竞赛级编程数据集规模有限(仅数千至数万题),而以往合成数据生成方法主要依赖于对已有指令数据的增强或从人工标注数据中筛选难题,难以有效提升模型在复杂编程任务中的表现。其解决方案的关键在于提出QueST框架,该框架融合了难度感知图采样(difficulty-aware graph sampling)与难度感知拒绝微调(difficulty-aware rejection fine-tuning),直接优化专用生成器以创建具有挑战性的编程问题。通过该机制训练出的生成器在生成高质量难题方面优于GPT-4o,并能大规模生成合成数据用于知识蒸馏或强化学习,显著提升下游模型性能,验证了该方法在推动大语言模型在竞赛编程和推理能力上的有效性与可扩展性。
链接: https://arxiv.org/abs/2510.17715
作者: Hanxu Hu,Xingxing Zhang,Jannis Vamvas,Rico Sennrich,Furu Wei
机构: University of Zurich (苏黎世大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注: 20 pages, 7 figures
点击查看摘要
Abstract:Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.
zh
[NLP-12] Contextual Attention Modulation: Towards Efficient Multi-Task Adaptation in Large Language Models CIKM’-25
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多任务适应中面临的两大挑战:一是如何有效平衡通用知识保留与任务特定特征提取,二是如何在不引发灾难性遗忘(catastrophic forgetting)的前提下实现高效参数利用。现有微调方法因资源消耗大且易遗忘旧知识而受限,而现有的参数高效方法在复杂多任务场景下表现不佳。论文提出的解决方案核心在于引入上下文感知注意力调制(Contextual Attention Modulation, CAM)机制,该机制通过动态调节自注意力模块的表示,增强任务特异性特征的同时保持通用知识的稳定性;进一步地,结合混合式上下文感知注意力调制(HyCAM)框架,将共享全参数CAM模块与多个轻量级专用CAM模块相结合,并引入动态路由策略以实现自适应的知识融合,从而显著提升多任务场景下的性能表现。
链接: https://arxiv.org/abs/2510.17705
作者: Dayan Pan,Zhaoyang Fu,Jingyuan Wang,Xiao Han,Yue Zhu,Xiangyu Zhao
机构: Beihang University (北京航空航天大学); City University of Hong Kong (香港城市大学); Huawei Technologies Ltd. (华为技术有限公司); Zhejiang University of Technology (浙江工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by CIKM’ 25
点击查看摘要
Abstract:Large Language Models (LLMs) possess remarkable generalization capabilities but struggle with multi-task adaptation, particularly in balancing knowledge retention with task-specific specialization. Conventional fine-tuning methods suffer from catastrophic forgetting and substantial resource consumption, while existing parameter-efficient methods perform suboptimally in complex multi-task scenarios. To address this, we propose Contextual Attention Modulation (CAM), a novel mechanism that dynamically modulates the representations of self-attention modules in LLMs. CAM enhances task-specific features while preserving general knowledge, thereby facilitating more effective and efficient adaptation. For effective multi-task adaptation, CAM is integrated into our Hybrid Contextual Attention Modulation (HyCAM) framework, which combines a shared, full-parameter CAM module with multiple specialized, lightweight CAM modules, enhanced by a dynamic routing strategy for adaptive knowledge fusion. Extensive experiments on heterogeneous tasks, including question answering, code generation, and logical reasoning, demonstrate that our approach significantly outperforms existing approaches, achieving an average performance improvement of 3.65%. The implemented code and data are available to ease reproducibility at this https URL.
zh
[NLP-13] owards Mining Effective Pedagogical Strategies from Learner-LLM Educational Dialogues
【速读】: 该论文旨在解决当前教育领域中大型语言模型(Large Language Models, LLMs)应用评估方法的局限性问题,即现有研究多聚焦于技术性能或学习成果,而忽视了学习者与LLM之间交互过程中的教学策略有效性。其解决方案的关键在于采用对话分析(dialogue analysis)方法,通过收集学习者-LLM对话数据、进行对话行为(Dialogue Act, DA)标注、挖掘DA模式并构建预测模型,从而从互动动态中识别出有效的教学策略,推动以对话机制为核心的LLM教育应用评价体系发展。
链接: https://arxiv.org/abs/2510.17698
作者: Liqun He,Manolis Mavrikis,Mutlu Cukurova
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Dialogue plays a crucial role in educational settings, yet existing evaluation methods for educational applications of large language models (LLMs) primarily focus on technical performance or learning outcomes, often neglecting attention to learner-LLM interactions. To narrow this gap, this AIED Doctoral Consortium paper presents an ongoing study employing a dialogue analysis approach to identify effective pedagogical strategies from learner-LLM dialogues. The proposed approach involves dialogue data collection, dialogue act (DA) annotation, DA pattern mining, and predictive model building. Early insights are outlined as an initial step toward future research. The work underscores the need to evaluate LLM-based educational applications by focusing on dialogue dynamics and pedagogical strategies.
zh
[NLP-14] LILO: Bayesian Optimization with Interactive Natural Language Feedback
【速读】: 该论文旨在解决如何将非结构化自然语言反馈有效转化为可优化的目标函数,以提升贝叶斯优化(Bayesian Optimization, BO)在复杂、主观或模糊目标场景下的适用性与效率。传统偏好型贝叶斯优化(Preferential BO)受限于反馈格式的标准化要求,且需为每个特定领域定制模型,难以灵活适应多样化的用户输入。其解决方案的关键在于提出一种“语言在环”(language-in-the-loop)框架,利用大语言模型(Large Language Model, LLM)自动将多样的文本反馈映射为一致的标量效用信号,并结合先验知识进行灵活建模,无需人工设计核函数,从而在保持贝叶斯优化样本高效性和不确定性量化优势的同时,显著增强人机交互的自然性和优化性能,尤其在反馈数据稀缺的场景中表现更优。
链接: https://arxiv.org/abs/2510.17671
作者: Katarzyna Kobalczyk,Zhiyuan Jerry Lin,Benjamin Letham,Zhuokai Zhao,Maximilian Balandat,Eytan Bakshy
机构: Meta; University of Cambridge (剑桥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:For many real-world applications, feedback is essential in translating complex, nuanced, or subjective goals into quantifiable optimization objectives. We propose a language-in-the-loop framework that uses a large language model (LLM) to convert unstructured feedback in the form of natural language into scalar utilities to conduct BO over a numeric search space. Unlike preferential BO, which only accepts restricted feedback formats and requires customized models for each domain-specific problem, our approach leverages LLMs to turn varied types of textual feedback into consistent utility signals and to easily include flexible user priors without manual kernel design. At the same time, our method maintains the sample efficiency and principled uncertainty quantification of BO. We show that this hybrid method not only provides a more natural interface to the decision maker but also outperforms conventional BO baselines and LLM-only optimizers, particularly in feedback-limited regimes.
zh
[NLP-15] DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model
【速读】: 该论文旨在解决自监督语音模型(Self-supervised speech models)在捕捉说话人区分性特征方面的局限性,这一局限性制约了其在说话人验证(speaker verification)、说话人聚类(speaker diarization)和说话人画像(profiling)等任务中的性能。解决方案的关键在于提出DELULU模型,该模型通过将外部监督信号引入伪标签生成过程,具体而言是利用ReDimNet(一种先进的说话人验证模型)的帧级嵌入指导预训练阶段的k-means聚类步骤,从而在表示学习中注入强说话人区分性的归纳偏置(inductive bias),使模型更贴合说话人身份信息;同时采用掩码预测与去噪相结合的双重目标函数,进一步提升模型鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2510.17662
作者: Massa Baali,Rita Singh,Bhiksha Raj
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
zh
[NLP-16] Qomhra: A Bilingual Irish-English Large Language Model
【速读】: 该论文旨在解决低资源语言(爱尔兰语)在大型语言模型(LLM)中性能不足的问题,特别是在双语(爱尔兰语-英语)场景下如何有效提升爱尔兰语生成能力并保持英语表现。解决方案的关键在于构建一个完整的训练流程,包括双语持续预训练、指令微调(instruction tuning)以及基于人类偏好的对齐(alignment from human preferences),同时利用新获取的爱尔兰语语料库与英文文本进行混合与筛选以优化爱尔兰语性能;此外,通过Google的Gemini-2.5-Pro模型合成高质量的30K平行指令微调数据集和1K人类偏好数据集,显著提升了模型在爱尔兰语上的生成质量与对齐度,最终使Qomhrá在多项基准测试中实现爱尔兰语最高达29%、英语最高达44%的性能提升,并展现出良好的指令遵循能力。
链接: https://arxiv.org/abs/2510.17652
作者: Joseph McInerney
机构: Trinity College Dublin (都柏林三一学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper introduces Qomhrá, a bilingual Irish-English large language model (LLM), developed under low-resource constraints presenting a complete pipeline spanning bilingual continued pre-training, instruction tuning, and alignment from human preferences. Newly accessible Irish corpora and English text are mixed and curated to improve Irish performance while preserving English ability. 6 closed-weight LLMs are judged for their Irish text generation by a native speaker, a learner and other LLMs. Google’s Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise instruction tuning and human preference datasets. Two datasets are contributed leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning dataset and a 1K human preference dataset, generating accepted and rejected responses that show near perfect alignment with a native Irish speaker. Qomhrá is comprehensively evaluated across benchmarks testing translation, gender understanding, topic identification and world knowledge with gains of up to 29% in Irish and 44% in English. Qomhrá also undergoes instruction tuning and demonstrates clear progress in instruction following, crucial for chatbot functionality.
zh
[NLP-17] LLM -as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena WWW
【速读】: 该论文旨在解决如何利用大规模语言模型(Large Language Models, LLMs)实现对现实世界未来事件的预测问题,即探索“LLM-as-a-Prophet”这一新兴范式下的预测智能。其解决方案的关键在于构建了一个名为Prophet Arena的通用评估基准,该基准能够持续收集实时预测任务,并将每个任务分解为不同的处理阶段,从而支持受控且大规模的实验验证。通过该框架,研究系统性地评估了多种LLMs在预测准确性、校准误差、置信度一致性及市场回报等方面的性能,揭示了当前LLMs已具备较强的预测能力,同时也识别出如事件召回不准、数据源理解偏差以及信息聚合速度滞后于市场等关键瓶颈。
链接: https://arxiv.org/abs/2510.17638
作者: Qingchuan Yang,Simon Mahns,Sida Li,Anri Gu,Jibang Wu,Haifeng Xu
机构: University of Southern California (南加州大学); Meta; The University of Chicago (芝加哥大学); New York University (纽约大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: this https URL
点击查看摘要
Abstract:Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call “LLM-as-a-Prophet”. This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs’ inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.
zh
[NLP-18] Forget to Know Remember to Use: Context-Aware Unlearning for Large Language Models
【速读】: 该论文旨在解决现有知识遗忘(unlearning)方法在实际应用中忽视“上下文可用性”(contextual utility)的问题,即当被遗忘的知识重新出现在用户提示中时,模型仍应具备利用该知识的能力。当前主流评估指标仅关注对遗忘集的删除效果和保留集的性能维持,忽略了模型在面对已遗忘知识重新输入时的响应能力。解决方案的关键在于引入一个可插拔的正则化项(plug-in term),嵌入到原有的遗忘目标函数中,以显式地保持模型在上下文包含已遗忘信息时的使用能力,从而在实现有效遗忘的同时恢复近原始水平的上下文可用性。
链接: https://arxiv.org/abs/2510.17620
作者: Yuefeng Peng,Parnian Afshar,Megan Ganji,Thomas Butler,Amir Houmansadr,Mingxian Wang,Dezhi Hong
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models may encode sensitive information or outdated knowledge that needs to be removed, to ensure responsible and compliant model responses. Unlearning has emerged as an efficient alternative to full retraining, aiming to remove specific knowledge while preserving overall model utility. Existing evaluations of unlearning methods focus on (1) the extent of forgetting of the target knowledge (forget set) and (2) maintaining performance on the retain set (i.e., utility). However, these evaluations overlook an important usability aspect: users may still want the model to leverage the removed information if it is re-introduced in the prompt. In a systematic evaluation of six state-of-the-art unlearning methods, we find that they consistently impair such contextual utility. To address this, we augment unlearning objectives with a plug-in term that preserves the model’s ability to use forgotten knowledge when it is present in context. Extensive experiments demonstrate that our approach restores contextual utility to near original levels while still maintaining effective forgetting and retain-set utility.
zh
[NLP-19] LawChain: Modeling Legal Reasoning Chains for Chinese Tort Case Analysis
【速读】: 该论文旨在解决当前法律推理研究中两个核心问题:一是现有计算方法多依赖通用推理框架(如三段论和IRAC),未能充分刻画法律推理的复杂性和细微过程;二是现有研究主要聚焦于刑事案件,对民事案件尤其是侵权类案件的建模不足。解决方案的关键在于提出一个名为LawChain的显式法律推理建模框架,该框架将侵权案件分析过程拆解为三个模块、多个细粒度子步骤,并基于此构建了专门用于评估侵权类法律推理能力的基准测试集LawChain_eval。通过该基准,作者系统性地评估了大语言模型在民事侵权场景下的推理表现,发现其仍存在显著不足;同时引入基于LawChain结构的提示或后训练基线方法,在提升侵权推理准确性的同时展现出良好的任务泛化能力,验证了显式建模法律推理链对增强语言模型法律推理能力的有效性。
链接: https://arxiv.org/abs/2510.17602
作者: Huiyuan Xie,Chenyang Li,Huining Zhu,Chubin Zhang,Yuxiao Ye,Zhenghao Liu,Zhiyuan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Legal reasoning is a fundamental component of legal analysis and decision-making. Existing computational approaches to legal reasoning predominantly rely on generic reasoning frameworks such as syllogism and IRAC, which do not comprehensively examine the nuanced processes that underpin legal reasoning. Moreover, current research has largely focused on criminal cases, with insufficient modeling for civil cases. In this work, we present a novel framework for explicitly modeling legal reasoning in the analysis of Chinese tort-related civil cases. We first operationalize the legal reasoning processes used in tort analysis into the LawChain framework. LawChain is a three-module reasoning framework, with each module consisting of multiple finer-grained sub-steps. Informed by the LawChain framework, we introduce the task of tort legal reasoning and construct an evaluation benchmark, LawChain _eval , to systematically assess the critical steps within analytical reasoning chains for tort analysis. Leveraging this benchmark, we evaluate state-of-the-art large language models for their legal reasoning ability in civil tort contexts. Our results indicate that current models still fall short in accurately handling crucial elements of tort legal reasoning. Furthermore, we introduce several baseline approaches that explicitly incorporate LawChain-style reasoning through prompting or post-training. We conduct further experiments on additional legal analysis tasks, such as Legal Named-Entity Recognition and Criminal Damages Calculation, to verify the generalizability of these baselines. The proposed baseline approaches achieve significant improvements in tort-related legal reasoning and generalize well to related legal analysis tasks, thus demonstrating the value of explicitly modeling legal reasoning chains to enhance the reasoning capabilities of language models.
zh
[NLP-20] Reasoning Distillation and Structural Alignment for Improved Code Generation
【速读】: 该论文旨在解决小规模语言模型在代码生成任务中因缺乏算法推理能力而导致的性能不足问题,尤其是在理解用户意图和生成结构正确、逻辑合理的代码方面表现较差。其解决方案的关键在于通过知识蒸馏(knowledge distillation)技术,将超大规模语言模型(Very Large Language Model, VLLM)所具备的复杂推理能力迁移至更小、部署成本更低的模型中;具体而言,采用一种结构感知损失优化方法(structure-aware loss optimization),使学生模型能够识别正确的解题路径,并建立问题定义与潜在解决方案之间的结构对应关系,从而超越仅基于token级预测的生成方式,实现对问题解结构的深层理解,显著提升代码生成质量。
链接: https://arxiv.org/abs/2510.17598
作者: Amir Jalilifard,Anderson de Rezende Rocha,Marcos Medeiros Raimundo
机构: Universidade Estadual de Campinas (坎皮纳斯州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Effective code generation with language models hinges on two critical factors: accurately understanding the intent of the prompt and generating code that applies algorithmic reasoning to produce correct solutions capable of passing diverse test cases while adhering to the syntax of the target programming language. Unlike other language tasks, code generation requires more than accurate token prediction; it demands comprehension of solution-level and structural relationships rather than merely generating the most likely tokens. very large language model (VLLM) are capable of generating detailed steps toward the correct solution of complex tasks where reasoning is crucial in solving the problem. Such reasoning capabilities may be absent in smaller language models. Therefore, in this work, we distill the reasoning capabilities of a VLLM into a smaller, more efficient model that is faster and cheaper to deploy. Our approach trains the model to emulate the reasoning and problem-solving abilities of the VLLM by learning to identify correct solution pathways and establishing a structural correspondence between problem definitions and potential solutions through a novel method of structure-aware loss optimization. This enables the model to transcend token-level generation and to deeply grasp the overarching structure of solutions for given problems. Experimental results show that our fine-tuned model, developed through a cheap and simple to implement process, significantly outperforms our baseline model in terms of pass@1, average data flow, and average syntax match metrics across the MBPP, MBPP Plus, and HumanEval benchmarks.
zh
[NLP-21] HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection EMNLP2025
【速读】: 该论文旨在解决预训练语言模型(Pre-trained Language Models, PLMs)在代码相关任务中未能充分捕捉代码内部高阶数据关联性的问题。现有方法虽取得一定成效,但忽略了代码中潜在的复杂结构关系,如语法树层级、词法相似性和行间依赖等。解决方案的关键在于提出三种类型的高阶代码 token 关联:抽象语法树家族关联(abstract syntax tree family correlation)、词法关联(lexical correlation)和行关联(line correlation),并设计了一个用于生成 token 与超边(hyperedges)的模块来建模这些关系;在此基础上,改进超图神经网络架构并结合适配器微调(adapter tuning)技术,提出一种新型超图适配器(HGAdapter),以嵌入高阶关联信息并灵活集成到多种 PLM 中进行性能增强。实验表明,该方法在多语言代码摘要和代码克隆检测任务上均有效提升了模型表现。
链接: https://arxiv.org/abs/2510.17591
作者: Guang Yang,Yujie Zhu
机构: Guotai Haitong Securities (国泰海通证券); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Accepted by the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) as a findings long paper
点击查看摘要
Abstract:Pre-trained language models (PLMs) are increasingly being applied to code-related tasks. Although PLMs have achieved good results, they do not take into account potential high-order data correlations within the code. We propose three types of high-order correlations in code tokens, i.e. abstract syntax tree family correlation, lexical correlation, and line correlation. We design a tokens and hyperedges generator to capture these high-order data correlations. We improve the architecture of hypergraph neural networks and combine it with adapter tuning to propose a novel hypergraph-based adapter (HGAdapter) to fine-tune PLMs. HGAdapter can encode high-order data correlations and is allowed to be inserted into various PLMs to enhance performance. Experiments were conducted on several public datasets, including six languages of code summarization and code clone detection tasks. Our methods improved the performance of PLMs in datasets to varying degrees. Experimental results validate the introduction of high-order data correlations that contribute to improved effectiveness.
zh
[NLP-22] MIRAG E: Agent ic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning
【速读】: 该论文旨在解决多模态虚假信息(multimodal misinformation)在互联网平台上的快速传播问题,尤其针对人工事实核查能力难以应对海量图文混合内容的挑战。现有监督式检测模型依赖特定领域标注数据且泛化能力差,无法有效识别多样化的伪造手法。解决方案的关键在于提出MIRAGE框架——一个推理时可插拔的代理式(agentic)验证系统,通过四个顺序模块实现分解式推理:视觉真实性评估(visual veracity assessment)识别AI生成图像、跨模态一致性分析(cross-modal consistency analysis)发现图文错位使用、检索增强的事实核查(retrieval-augmented factual checking)利用迭代提问从网络获取证据,以及校准判断模块整合所有信号输出结构化带引用的推理过程。该方法无需领域特定训练即可达到与监督模型相当的性能,显著提升了跨模态虚假信息检测的通用性和可扩展性。
链接: https://arxiv.org/abs/2510.17590
作者: Mir Nafis Sharear Shopnil,Sharad Duwal,Abhishek Tyagi,Adiba Mahbub Proma
机构: University of Rochester (罗切斯特大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 16 pages, 3 tables, 1 figure
点击查看摘要
Abstract:Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.
zh
[NLP-23] Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本生成过程中出现的语言混淆(Language Confusion)问题,即模型在输出中意外混合不同语言的现象。现有解决方案要么需要重新训练模型,要么无法区分有害的语言混淆与可接受的语码转换(Code-Switching)。其核心解决方案是提出一种轻量级、可插拔的“语言混淆门控机制”(Language Confusion Gate, LCG),该机制在解码阶段通过预测语言族并仅在必要时对token进行掩码过滤,而不修改基础LLM参数。LCG的关键创新在于利用归一化调整的自蒸馏训练策略,基于语言混淆发生频率低、正确语言token通常位于top预测结果以及高资源语言token嵌入范数更大的观察,实现高效且无损的语言混淆抑制,实验证明其可在多个主流模型上将语言混淆显著降低一个数量级,同时保持任务性能不受负面影响。
链接: https://arxiv.org/abs/2510.17555
作者: Collin Zhang,Fei Huang,Chenhan Yuan,Junyang Lin
机构: Qwen Team, Alibaba Group (阿里巴巴集团); Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) often experience language confusion, which is the unintended mixing of languages during text generation. Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching. This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during decoding without altering the base LLM. The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed. Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling. When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly, often by an order of magnitude, without negatively impacting task performance. Code is available at this https URL.
zh
[NLP-24] When Annotators Disagree Topology Explains: Mapper a Topological Tool for Exploring Text Embedding Geometry and Ambiguity EMNLP2025
【速读】: 该论文旨在解决当前语言模型评估中依赖标量指标(如准确率)所导致的局限性,即这些指标无法有效捕捉模型在处理标注不一致或语义模糊数据时的内部表征机制。针对这一问题,作者提出从拓扑数据分析(Topological Data Analysis, TDA)视角出发,利用Mapper工具对微调后模型的嵌入空间结构进行建模与分析。其关键在于:Mapper能够直接揭示嵌入空间中由模型决策形成的模块化、非凸区域,且在高度模糊样本上仍保持高预测纯度(>98%的连通分量≥90%预测一致性),从而暴露模型结构置信度与标签不确定性之间的隐含张力。相较传统降维方法(如PCA或UMAP),Mapper不仅可识别决策边界塌缩和过度自信聚类等现象,还为主观自然语言处理任务提供了可量化的拓扑指标,推动更主动的模型设计策略。
链接: https://arxiv.org/abs/2510.17548
作者: Nisrine Rair,Alban Goupil,Valeriu Vrabie,Emmanuel Chochoy
机构: CReSTIC, Université de Reims Champagne-Ardenne (Reims, 法国); Chochoy Conseil (法国)
类目: Computation and Language (cs.CL)
备注: Accepted to appear in the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025, Main Conference)
点击查看摘要
Abstract:Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity, especially when human annotators disagree. We propose a topological perspective to analyze how fine-tuned models encode ambiguity and more generally instances. Applied to RoBERTa-Large on the MD-Offense dataset, Mapper, a tool from topological data analysis, reveals that fine-tuning restructures embedding space into modular, non-convex regions aligned with model predictions, even for highly ambiguous cases. Over 98% of connected components exhibit \geq 90% prediction purity, yet alignment with ground-truth labels drops in ambiguous data, surfacing a hidden tension between structural confidence and label uncertainty. Unlike traditional tools such as PCA or UMAP, Mapper captures this geometry directly uncovering decision regions, boundary collapses, and overconfident clusters. Our findings position Mapper as a powerful diagnostic tool for understanding how models resolve ambiguity. Beyond visualization, it also enables topological metrics that may inform proactive modeling strategies in subjective NLP tasks. Comments: Accepted to appear in the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025, Main Conference) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2510.17548 [cs.CL] (or arXiv:2510.17548v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.17548 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-25] OncoReason : Structuring Clinical Reasoning in LLM s for Robust and Interpretable Survival Prediction
【速读】: 该论文旨在解决癌症治疗预后预测中模型准确性与可解释性之间的矛盾问题,尤其是在面对异质性临床数据时,如何提升生成式 AI(Generative AI)在高风险决策支持场景下的结构化推理能力。其解决方案的关键在于提出一种统一的多任务学习框架,将自回归大语言模型(LLM)与临床推理对齐,通过联合训练实现二分类生存预测、连续生存时间回归以及自然语言理由生成三项任务;其中,采用链式思维(Chain-of-Thought, CoT)提示和组相对策略优化(Group Relative Policy Optimization, GRPO)两种对齐策略,显著提升了模型的预测性能与可解释性,尤其GRPO方法在BLEU、ROUGE和BERTScore等指标上达到当前最优水平,验证了推理感知对齐对于构建精准肿瘤学中可信LLM的重要性。
链接: https://arxiv.org/abs/2510.17532
作者: Raghu Vamshi Hemadri,Geetha Krishna Guruju,Kristi Topollai,Anna Ewa Choromanska
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.
zh
[NLP-26] SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)在模拟人类行为时缺乏统一、可比评估标准的问题。现有研究采用碎片化的任务和指标,导致结果难以横向比较,阻碍了对LLM模拟能力的系统性理解。其解决方案的关键在于提出SimBench——首个大规模、标准化基准测试平台,通过整合20个涵盖道德决策、经济选择等多样任务的全球数据集,构建了一个可复现、可扩展的评估框架,从而为回答“何时、如何及为何LLM模拟成功或失败”提供科学基础。
链接: https://arxiv.org/abs/2510.17516
作者: Tiancheng Hu,Joachim Baumann,Lorenzo Lupo,Nigel Collier,Dirk Hovy,Paul Röttger
机构: University of Cambridge (剑桥大学); University of Zurich (苏黎世大学); Bocconi University (博科尼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Project Website: this http URL Data: this https URL
点击查看摘要
Abstract:Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
zh
[NLP-27] Annotation-Efficient Universal Honesty Alignment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在部署过程中缺乏诚实对齐(honesty alignment)的问题,即模型无法准确识别自身知识边界并表达校准后的置信度,从而影响其可信性。现有方法要么依赖无需训练的置信度估计(如token概率、自一致性),要么采用基于正确性标注的训练校准,但后者需大量人工标注成本。论文提出Elicitation-Then-Calibration(EliCal)两阶段框架:第一阶段利用低成本的自一致性监督提取内部置信度信号,第二阶段仅用少量正确性标注对置信度进行校准。其关键创新在于通过先“诱发”再“校准”的机制,显著降低标注需求——实验表明,仅需1k条正确性标注(占全量标注的0.18%)即可实现接近最优的诚实对齐效果,并在未见过的MMLU任务上优于纯校准基线,为实现LLMs的通用诚实对齐提供了可扩展方案。
链接: https://arxiv.org/abs/2510.17509
作者: Shiyu Ni,Keping Bi,Jiafeng Guo,Minghao Tang,Jingtong Wu,Zengxin Han,Xueqi Cheng
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, CAS (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.
zh
[NLP-28] Lingua Custodis participation at the WMT 2025 Terminology shared task
【速读】: 该论文旨在解决跨语言句子嵌入(cross-lingual sentence embeddings)的构建问题,即如何在多语言环境下有效学习语义相似性并支持迁移学习任务。其关键解决方案是系统性地整合多种最先进的方法,包括掩码语言建模(masked language modeling, MLM)、翻译语言建模(translation language modeling, TLM)、双编码器翻译排序(dual encoder translation ranking)以及加性间隔Softmax(additive margin softmax),并通过预训练多语言语言模型显著减少对平行语料数据的依赖——实验表明可将所需平行数据量降低80%。最终模型在Tatoeba数据集上实现了83.7%的跨语言文本检索准确率(覆盖112种语言),优于LASER(65.5%),同时保持了单语迁移学习任务的竞争力,并能通过从CommonCrawl挖掘的平行语料训练出性能优异的神经机器翻译(NMT)模型。
链接: https://arxiv.org/abs/2510.17504
作者: Jingshu Liu,Raheel Qader,Gaëtan Caillaut,Mariam Nakhlé
机构: Lingua Custodia, France (Lingua Custodia, 法国)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5 achieved by LASER, while still performing competitively on monolingual transfer learning benchmarks. Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at this https URL.
zh
[NLP-29] Deep Self-Evolving Reasoning
【速读】: 该论文旨在解决开放权重(open-weight)小规模语言模型在长链式推理(long-form chain-of-thought reasoning)任务中因验证与修正能力薄弱而导致的推理上限问题,尤其在应对奥数级别难题时表现受限。其解决方案的核心是提出一种概率驱动的迭代推理范式——深度自演化推理(Deep Self-Evolving Reasoning, DSER),该方法将迭代推理建模为马尔可夫链,关键洞察在于:只要每一步改进的概率略高于退化概率,即使单次迭代效果微弱,通过并行运行多个长周期自演化过程并采用多数投票机制,也能渐近逼近正确答案。这一策略显著提升了小模型在复杂任务上的表现,如使Qwen3-8B模型在AIME 2024–2025基准上解决此前无法处理的问题,并超越其600B参数教师模型的单轮准确率。
链接: https://arxiv.org/abs/2510.17498
作者: Zihan Liu,Shun Zheng,Xumeng Wen,Yang Wang,Jiang Bian,Mao Yang
机构: Peking University (北京大学); Microsoft Research Asia (亚洲微软研究院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.
zh
[NLP-30] Empowering Real-World: A Survey on the Technology Practice and Evaluation of LLM -driven Industry Agents
【速读】: 该论文旨在解决如何将大语言模型(Large Language Models, LLMs)驱动的通用智能体(Agent)研究转化为能够推动产业变革的实际生产力这一关键挑战。其解决方案的核心在于构建一个基于LLM的行业智能体能力成熟度框架,系统梳理支撑智能体能力演进的三大技术支柱——记忆(Memory)、规划(Planning)与工具使用(Tool Use),并结合数字工程、科学发现、具身智能、协同业务执行及复杂系统仿真等实际应用场景,阐明从“流程执行系统”向“自适应社会系统”演进的技术路径。同时,论文还评估了当前评测体系在真实性、安全性与行业适配性方面的不足,并提出未来发展方向,为下一代行业智能体的理论研究与实践落地提供清晰路线图与方法论基础。
链接: https://arxiv.org/abs/2510.17491
作者: Yihong Tang,Kehai Chen,Liang Yue,Jinxin Fan,Caishen Zhou,Xiaoguang Li,Yuyang Zhang,Mingming Zhao,Shixiong Kai,Kaiyang Guo,Xingshan Zeng,Wenjing Cun,Lifeng Shang,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区); Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:With the rise of large language models (LLMs), LLM agents capable of autonomous reasoning, planning, and executing complex tasks have become a frontier in artificial intelligence. However, how to translate the research on general agents into productivity that drives industry transformations remains a significant challenge. To address this, this paper systematically reviews the technologies, applications, and evaluation methods of industry agents based on LLMs. Using an industry agent capability maturity framework, it outlines the evolution of agents in industry applications, from “process execution systems” to “adaptive social systems.” First, we examine the three key technological pillars that support the advancement of agent capabilities: Memory, Planning, and Tool Use. We discuss how these technologies evolve from supporting simple tasks in their early forms to enabling complex autonomous systems and collective intelligence in more advanced forms. Then, we provide an overview of the application of industry agents in real-world domains such as digital engineering, scientific discovery, embodied intelligence, collaborative business execution, and complex system simulation. Additionally, this paper reviews the evaluation benchmarks and methods for both fundamental and specialized capabilities, identifying the challenges existing evaluation systems face regarding authenticity, safety, and industry specificity. Finally, we focus on the practical challenges faced by industry agents, exploring their capability boundaries, developmental potential, and governance issues in various scenarios, while providing insights into future directions. By combining technological evolution with industry practices, this review aims to clarify the current state and offer a clear roadmap and theoretical foundation for understanding and building the next generation of industry agents.
zh
[NLP-31] DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning NEURIPS2025
【速读】: 该论文旨在解决混合型文本(如人类与AI协作生成的文本)检测难题,此类文本因涉及多样化的生成与编辑过程(包括纯AI生成、人类编辑AI文本、AI编辑人类文本及多模型协同优化等),其特征复杂且难以区分。传统方法通常采用粗粒度的二分类或简单多分类策略,无法有效捕捉不同生成流程间的内在关联。解决方案的关键在于提出DETree框架,该框架将不同生成过程建模为层次化亲和性树(Hierarchical Affinity Tree)结构,并设计专用损失函数使文本表示与该树结构对齐,从而更精细地刻画各类混合文本之间的关系;同时构建RealBench基准数据集以涵盖广泛的人机协作场景,显著提升了模型在分布外(OOD)及小样本学习条件下的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2510.17489
作者: Yongxin He,Shan Zhang,Yixuan Cao,Lei Ma,Ping Luo
机构: Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China (中国科学院计算技术研究所智能信息处理重点实验室); State Key Lab of AI Safety, Beijing, China (人工智能安全国家重点实验室); University of Chinese Academy of Sciences, CAS, Beijing, China (中国科学院大学); The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China (中国科学院自动化研究所复杂系统认知与决策智能重点实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To appear in NeurIPS 2025
点击查看摘要
Abstract:Detecting AI-involved text is essential for combating misinformation, plagiarism, and academic misconduct. However, AI text generation includes diverse collaborative processes (AI-written text edited by humans, human-written text edited by AI, and AI-generated text refined by other AI), where various or even new LLMs could be involved. Texts generated through these varied processes exhibit complex characteristics, presenting significant challenges for detection. Current methods model these processes rather crudely, primarily employing binary classification (purely human vs. AI-involved) or multi-classification (treating human-AI collaboration as a new class). We observe that representations of texts generated through different processes exhibit inherent clustering relationships. Therefore, we propose DETree, a novel approach that models the relationships among different processes as a Hierarchical Affinity Tree structure, and introduces a specialized loss function that aligns text representations with this tree. To facilitate this learning, we developed RealBench, a comprehensive benchmark dataset that automatically incorporates a wide spectrum of hybrid texts produced through various human-AI collaboration processes. Our method improves performance in hybrid text detection tasks and significantly enhances robustness and generalization in out-of-distribution scenarios, particularly in few-shot learning conditions, further demonstrating the promise of training-based approaches in OOD settings. Our code and dataset are available at this https URL.
zh
[NLP-32] ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts
【速读】: 该论文旨在解决当前混合专家(Mixture-of-Experts, MoE)架构中因层内路由机制(layer-local routing mechanism)导致的专家组合灵活性受限问题,即每一层仅能使用自身专家池,限制了专家间的协同表达能力,同时在固定参数预算下难以平衡专家维度与路由多样性。解决方案的关键在于提出ReXMoE架构,通过允许路由器在相邻层之间复用专家(expert reuse across adjacent layers),实现了专家维度与每层预算的解耦,从而在不增加总参数量的前提下提升专家组合的丰富性;进一步引入渐进式扩展路由(Progressive Scaling Routing, PSR)策略,在训练过程中逐步扩大候选专家池,有效增强模型表达能力和下游任务性能。
链接: https://arxiv.org/abs/2510.17483
作者: Zheyue Tan,Zhiyuan Li,Tao Yuan,Dong Zhou,Weilin Liu,Yueqing Zhuang,Yadong Li,Guowei Niu,Cheng Qin,Zhuyu Yao,Congyi Liu,Haiyang Xu,Boxun Li,Guohao Dai,Bo Zhao,Yu Wang
机构: Aalto University; Infinigence-AI; Yale University; Shanghai Jiao Tong University; Shanghai Innovation Institute; Tsinghua University
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) architectures have emerged as a promising approach to scale Large Language Models (LLMs). MoE boosts the efficiency by activating a subset of experts per token. Recent works show that fine-grained experts substantially enriches the combinatorial flexibility of active experts and enhances model expressiveness. However, such a design is fundamentally limited by the layer-local routing mechanism: each layer is restricted to its own expert pool. This requires a careful trade-off between expert dimensionality and routing diversity given fixed parameter budgets. We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches by allowing routers to reuse experts across adjacent layers. ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity or inflating overall parameters. To this end, we propose a new progressive scaling routing (PSR) strategy to gradually increase the candidate expert pool during training. As a result, ReXMoE improves both language modeling and downstream task performance. Extensive experiments on models ranging from 0.5B to 7B parameters across different architectures demonstrate that ReXMoE consistently improves performance under fixed architectural dimensions, confirming ReXMoE as new design paradigm for parameter-efficient and scalable MoE-based LLMs.
zh
[NLP-33] Disparities in Multilingual LLM -Based Healthcare QA
【速读】: 该论文旨在解决多语言大型语言模型(Large Language Models, LLMs)在医疗问答(Healthcare QA)任务中因语言间信息覆盖不均和事实一致性差异而导致的公平性问题。其核心挑战在于,不同语言维基百科的医疗内容覆盖程度存在显著差异,导致LLM生成的回答更倾向于依赖英语来源,从而可能忽视非英语语境下的文化相关知识。解决方案的关键在于:通过引入跨语言检索增强生成(Retrieval-Augmented Generation, RAG)技术,在推理阶段提供非英语维基百科的上下文片段,从而有效引导LLM的回答向本地化、文化相关的知识靠拢,提升多语言医疗问答的公平性和事实准确性。
链接: https://arxiv.org/abs/2510.17476
作者: Ipek Baris Schlicht,Burcu Sayin,Zhixue Zhao,Frederik M. Labonté,Cesare Barbera,Marco Viviani,Paolo Rosso,Lucie Flek
机构: Universitat Politècnica de València(瓦伦西亚理工大学), Spain; University of Trento(特伦托大学), Italy; University of Sheffield(谢菲尔德大学), United Kingdom; Bonn-Aachen International Center for IT(波恩-亚琛国际信息科技中心), Germany; University of Bonn(波恩大学), Germany; Lamarr Institute for ML and AI(拉马尔机器学习与人工智能研究所), Germany; University of Pisa(比萨大学), Italy; University of Milano-Bicocca(米兰博科尼大学), Italy; ValgrAI Valencian Graduate School and Research Network of Artificial Intelligence(瓦伦西亚研究生学院和人工智能研究网络), Spain
类目: Computation and Language (cs.CL)
备注: Under review
点击查看摘要
Abstract:Equitable access to reliable health information is vital when integrating AI into healthcare. Yet, information quality varies across languages, raising concerns about the reliability and consistency of multilingual Large Language Models (LLMs). We systematically examine cross-lingual disparities in pre-training source and factuality alignment in LLM answers for multilingual healthcare QA across English, German, Turkish, Chinese (Mandarin), and Italian. We (i) constructed Multilingual Wiki Health Care (MultiWikiHealthCare), a multilingual dataset from Wikipedia; (ii) analyzed cross-lingual healthcare coverage; (iii) assessed LLM response alignment with these references; and (iv) conducted a case study on factual alignment through the use of contextual information and Retrieval-Augmented Generation (RAG). Our findings reveal substantial cross-lingual disparities in both Wikipedia coverage and LLM factual alignment. Across LLMs, responses align more with English Wikipedia, even when the prompts are non-English. Providing contextual excerpts from non-English Wikipedia at inference time effectively shifts factual alignment toward culturally relevant knowledge. These results highlight practical pathways for building more equitable, multilingual AI systems for healthcare.
zh
[NLP-34] Evaluating Large Language Models on Urdu Idiom Translation
【速读】: 该论文旨在解决低资源语言(如乌尔都语)中习语翻译(idiomatic translation)这一长期存在的挑战,该问题在机器翻译领域尚未得到充分研究。解决方案的关键在于构建首个针对乌尔都语到英语习语翻译的评估数据集,涵盖原生乌尔都语(Native Urdu)和罗马化乌尔都语(Roman Urdu)两种文本表示形式,并标注了高质量的英文等效译文。在此基础上,论文系统评估了多种开源大语言模型(LLMs)和神经机器翻译(NMT)系统的表现,发现提示工程(prompt engineering)可提升习语翻译效果,且原生乌尔都语文本输入比罗马化文本更能保持习语和文化含义的准确性。
链接: https://arxiv.org/abs/2510.17460
作者: Muhammad Farmal Khan,Mousumi Akter
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Idiomatic translation remains a significant challenge in machine translation, especially for low resource languages such as Urdu, and has received limited prior attention. To advance research in this area, we introduce the first evaluation datasets for Urdu to English idiomatic translation, covering both Native Urdu and Roman Urdu scripts and annotated with gold-standard English equivalents. We evaluate multiple open-source Large Language Models (LLMs) and Neural Machine Translation (NMT) systems on this task, focusing on their ability to preserve idiomatic and cultural meaning. Automatic metrics including BLEU, BERTScore, COMET, and XCOMET are used to assess translation quality. Our findings indicate that prompt engineering enhances idiomatic translation compared to direct translation, though performance differences among prompt types are relatively minor. Moreover, cross script comparisons reveal that text representation substantially affects translation quality, with Native Urdu inputs producing more accurate idiomatic translations than Roman Urdu.
zh
[NLP-35] Multilingual Clinical NER for Diseases and Medications Recognition in Cardiology Texts using BERT Embeddings
【速读】: 该论文旨在解决低资源语言环境下临床自然语言处理(Natural Language Processing, NLP)中命名实体识别(Named Entity Recognition, NER)性能不足的问题,特别是针对心脏病学领域内英文、西班牙语和意大利语的临床病例报告中疾病与药物实体的识别任务。其解决方案的关键在于构建多种基于BERT的深度上下文嵌入模型,并系统评估不同单语与多语模型在跨语言场景下的表现,从而提升对非英语临床文本的实体抽取能力。实验结果显示,所提出的方法在多个子任务上均显著优于基准性能,证明了在有限标注数据条件下,利用预训练语言模型迁移学习策略可有效增强低资源语言临床NER的效果。
链接: https://arxiv.org/abs/2510.17437
作者: Manuela Daniela Danu,George Marica,Constantin Suciu,Lucian Mihai Itu,Oladimeji Farri
机构: Advanta; Siemens SRL (西门子); Transilvania University of Brasov (特兰西瓦尼亚大学布加勒斯特分校); Siemens Healthineers (西门子医疗)
类目: Computation and Language (cs.CL)
备注: 11 pages, 5 figures, 1 table, published in Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024)
点击查看摘要
Abstract:The rapidly increasing volume of electronic health record (EHR) data underscores a pressing need to unlock biomedical knowledge from unstructured clinical texts to support advancements in data-driven clinical systems, including patient diagnosis, disease progression monitoring, treatment effects assessment, prediction of future clinical events, etc. While contextualized language models have demonstrated impressive performance improvements for named entity recognition (NER) systems in English corpora, there remains a scarcity of research focused on clinical texts in low-resource languages. To bridge this gap, our study aims to develop multiple deep contextual embedding models to enhance clinical NER in the cardiology domain, as part of the BioASQ MultiCardioNER shared task. We explore the effectiveness of different monolingual and multilingual BERT-based models, trained on general domain text, for extracting disease and medication mentions from clinical case reports written in English, Spanish, and Italian. We achieved an F1-score of 77.88% on Spanish Diseases Recognition (SDR), 92.09% on Spanish Medications Recognition (SMR), 91.74% on English Medications Recognition (EMR), and 88.9% on Italian Medications Recognition (IMR). These results outperform the mean and median F1 scores in the test leaderboard across all subtasks, with the mean/median values being: 69.61%/75.66% for SDR, 81.22%/90.18% for SMR, 89.2%/88.96% for EMR, and 82.8%/87.76% for IMR.
zh
[NLP-36] Agent ic Reinforcement Learning for Search is Unsafe
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在强化学习(Reinforcement Learning, RL)训练过程中,尽管具备一定的拒绝有害请求能力,但其安全机制仍存在脆弱性的问题。具体而言,研究发现RL训练的搜索代理模型虽能继承指令微调阶段的拒绝行为,但在特定攻击下会因生成有害查询而失效,导致安全性能显著下降。解决方案的关键在于识别出当前RL训练范式的根本缺陷:即奖励函数仅优化查询的有效性,而未考虑其潜在危害性,从而使得模型在攻击诱导下优先生成有害请求而非拒绝响应。因此,论文强调亟需构建以安全性为导向的代理强化学习(Agentic RL)流程,实现对搜索行为的安全约束与优化。
链接: https://arxiv.org/abs/2510.17431
作者: Yushi Yang,Shreyansh Padarha,Andrew Lee,Adam Mahdi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.
zh
[NLP-37] Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
【速读】: 该论文试图解决后训练(post-training)过程中产生的“对齐税”(alignment tax)问题,即模型在对齐人类偏好或任务目标时,不仅导致任务准确率下降,还引发校准性(calibration)严重恶化,表现为模型过度自信、输出可靠性降低及多样性减少。解决方案的关键在于一种简单的后处理干预方法:通过插值合并对齐前与对齐后的模型权重,发现此类插值能持续生成帕累托最优(Pareto-optimal)模型——这些模型在准确率上超越两个父模型的同时,显著恢复了对齐过程中丢失的校准性能。该方法以计算高效的方式缓解了对齐税的全部影响,从而获得能力更强且更可靠的模型。
链接: https://arxiv.org/abs/2510.17426
作者: Tiancheng Hu,Benjamin Minixhofer,Nigel Collier
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The “alignment tax” of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model’s weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.
zh
[NLP-38] BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine
【速读】: 该论文旨在解决生成式 AI(Generative AI)在中医(Traditional Chinese Medicine, TCM)领域应用中的关键瓶颈问题,包括缺乏多模态整合能力、推理过程不可解释以及临床实用性不足。其解决方案的关键在于构建一个基于ChatGPT架构的多模态中医助手BenCao,通过自然语言指令微调(instruction tuning)而非参数重训练的方式,实现与中医专家级推理和伦理规范对齐;系统融合超过1000部经典与现代文献的知识库、场景化指令框架、链式思维模拟机制及执业中医师反馈优化流程,并接入舌象图像识别API和多模态数据库检索功能,从而显著提升诊断、药材识别与体质辨识等任务的准确性与可解释性,为生成式AI在传统医学领域的落地提供可扩展的技术路径。
链接: https://arxiv.org/abs/2510.17415
作者: Jiacheng Xie,Yang Yu,Yibo Chen,Hanyao Zhang,Lening Zhao,Jiaxuan He,Lei Jiang,Xiaoting Tang,Guanghui An,Dong Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:Traditional Chinese Medicine (TCM), with a history spanning over two millennia, plays a role in global healthcare. However, applying large language models (LLMs) to TCM remains challenging due to its reliance on holistic reasoning, implicit logic, and multimodal diagnostic cues. Existing TCM-domain LLMs have made progress in text-based understanding but lack multimodal integration, interpretability, and clinical applicability. To address these limitations, we developed BenCao, a ChatGPT-based multimodal assistant for TCM, integrating structured knowledge bases, diagnostic data, and expert feedback refinement. BenCao was trained through natural language instruction tuning rather than parameter retraining, aligning with expert-level reasoning and ethical norms specific to TCM. The system incorporates a comprehensive knowledge base of over 1,000 classical and modern texts, a scenario-based instruction framework for diverse interactions, a chain-of-thought simulation mechanism for interpretable reasoning, and a feedback refinement process involving licensed TCM practitioners. BenCao connects to external APIs for tongue-image classification and multimodal database retrieval, enabling dynamic access to diagnostic resources. In evaluations across single-choice question benchmarks and multimodal classification tasks, BenCao achieved superior accuracy to general-domain and TCM-domain models, particularly in diagnostics, herb recognition, and constitution classification. The model was deployed as an interactive application on the OpenAI GPTs Store, accessed by nearly 1,000 users globally as of October 2025. This study demonstrates the feasibility of developing a TCM-domain LLM through natural language-based instruction tuning and multimodal integration, offering a practical framework for aligning generative AI with traditional medical reasoning and a scalable pathway for real-world deployment.
zh
[NLP-39] AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages
【速读】: 该论文旨在解决多模态人工智能(Multimodal AI)研究长期集中于高资源语言、导致低资源语言群体难以受益的问题。其核心解决方案在于提出AfriCaption框架,关键创新包括:(i)基于Flickr8k构建的20种非洲语言语义对齐图像描述数据集,通过上下文感知的选择与翻译流程生成高质量标注;(ii)一种动态、保持上下文一致性的处理管道,结合模型集成与自适应替换机制以持续保障输出质量;(iii)一个包含0.5B参数的视觉到文本架构AfriCaption模型,融合SigLIP(视觉编码器)与NLLB200(神经机器翻译模型),实现对代表性不足的非洲语言的跨语言图像描述生成。该框架首次建立了可扩展的非洲低资源语言图像-文本资源体系,推动了多模态AI的包容性发展。
链接: https://arxiv.org/abs/2510.17405
作者: Mardiyyah Oduwole,Prince Mireku,Fatimo Adebanjo,Oluwatosin Olajide,Mahi Aminu Aliyu,Jekaterina Novikova
机构: ML Collective; Ashesi University; Abubakar Tafawa Balewa University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages, laying the groundwork for truly inclusive multimodal AI.
zh
[NLP-40] Leverag ing Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine
【速读】: 该论文旨在解决传统医学(Traditional Chinese Medicine, TCM)领域中大型语言模型(Large Language Models, LLMs)在对齐性、数据质量和评估一致性方面存在的局限性问题。现有TCM专用LLMs虽通过监督微调取得一定进展,但在专家级推理能力和事实一致性上仍不足。其解决方案的关键在于引入基于组内相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习训练方法,该方法通过组内响应比较来优化生成质量,从而提升模型在TCM场景下的推理准确性和知识一致性;实验表明,基于Qwen2.5-7B-Instruct基础模型并使用TCM-Ladder基准文本子集训练的Ladder-base模型,在多个推理指标上显著优于通用大模型(如GPT-4、Gemini 2.5、Claude 3、Qwen3)及主流TCM专用模型(如BenTsao、HuatuoGPT2、Zhongjing),验证了GRPO在构建可信、临床导向的TCM人工智能系统中的有效性与高效性。
链接: https://arxiv.org/abs/2510.17402
作者: Jiacheng Xie,Shuai Zeng,Yang Yu,Xiaoting Tang,Guanghui An,Dong Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.
zh
[NLP-41] EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLM s EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在K-12教育场景中缺乏针对学生年级水平进行内容适配的问题,即模型输出常因词汇复杂度或解释深度不匹配而难以满足不同认知发展阶段的学习者需求。其解决方案的关键在于提出EduAdapt——首个涵盖近48,000个标注年级的问答对(QA pairs)的数据集与评估框架,覆盖九大学科、Grade 1至12,并按认知发展分为四个等级;该基准可用于系统性评估LLMs在不同年级段生成适龄内容的能力,从而推动更符合儿童发展规律的教育型AI系统的训练与提示工程优化。
链接: https://arxiv.org/abs/2510.17389
作者: Numaan Naeem,Abdellah El Mekki,Muhammad Abdul-Mageed
机构: MBZUAI; The University of British Columbia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 2 figures, 14 tables, 50 listings, EMNLP 2025 Main
点击查看摘要
Abstract:Large language models (LLMs) are transforming education by answering questions, explaining complex concepts, and generating content across a wide range of subjects. Despite strong performance on academic benchmarks, they often fail to tailor responses to students’ grade levels. This is a critical need in K-12 education, where age-appropriate vocabulary and explanation are essential for effective learning. Existing models frequently produce outputs that are too advanced or vague for younger learners, and there are no standardized benchmarks to evaluate their ability to adjust across cognitive and developmental stages. To address this gap, we introduce EduAdapt, a benchmark of nearly 48k grade-labeled QA pairs across nine science subjects, spanning Grades 1-12 and grouped into four grade levels. We evaluate a diverse set of open-source LLMs on EduAdapt and find that while larger models generally perform better, they still struggle with generating suitable responses for early-grade students (Grades 1-5). Our work presents the first dataset and evaluation framework for assessing grade-level adaptability in LLMs, aiming to foster more developmentally aligned educational AI systems through better training and prompting strategies. EduAdapt code and datasets are publicly available at this https URL.
zh
[NLP-42] he Atomic Instruction Gap: Instruction-Tuned LLM s Struggle with Simple Self-Contained Directives
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)模型在执行简单、自包含指令时的不一致性问题,尤其是其在不同指令格式下表现波动显著的现象。研究表明,尽管指令微调大语言模型(Instruction-tuned Large Language Models, IT-LLMs)具备较强的零样本推理能力,但在面对仅需遵循原子级指令(如选项标签格式变化)的任务时,模型性能会因标签形式(如字母、数字、罗马数字)的不同而出现大幅下降(最高达-30.45%),这揭示了模型存在明显的指令格式偏差(instruction-format bias)。解决方案的关键在于:首先,通过系统性地控制变量(如保留语义不变但改变标签格式)来量化模型对原子指令的敏感性;其次,强调显式指令引导对提升指令遵循一致性的必要性,并指出当前训练范式无法有效保障模型在无提示或内容缺失场景下的鲁棒性;最后,提出需要开发更聚焦于原子指令执行能力的评估方法与训练策略,以突破现有IT-LLM在基础指令理解上的局限。
链接: https://arxiv.org/abs/2510.17388
作者: Henry Lim,Kwan Hui Lim
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 1 figure, 8 tables
点击查看摘要
Abstract:Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.
zh
[NLP-43] owards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理多模态信息时的局限性问题,即现有方法主要针对纯文本文档设计,在现实场景中查询和文档常包含文本与图像等混合模态信息时表现不足。其解决方案的关键在于提出 Nyx,一个专为通用检索增强生成(Universal Retrieval-Augmented Generation, URAG)场景设计的统一多模态到多模态检索器,能够同时处理文本和图像输入并进行跨模态检索与推理;同时,为缓解真实多模态数据稀缺问题,作者构建了 NyxQA 数据集,通过四阶段自动化生成与过滤流程从网络文档中合成高质量多模态问答对,并采用两阶段训练策略——先在 NyxQA 和多种开源检索数据集上预训练,再利用下游视觉语言模型(Vision-Language Models, VLMs)的反馈进行监督微调,使检索结果更贴合生成偏好,从而显著提升视觉-语言任务中的生成质量。
链接: https://arxiv.org/abs/2510.17354
作者: Chenghao Zhang,Guanting Dong,Xinyu Yang,Zhicheng Dou
机构: Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: This work is in progress
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.
zh
[NLP-44] Addressing Antisocial Behavior in Multi-Party Dialogs Through Multimodal Representation Learning
【速读】: 该论文旨在解决多用户对话场景中反社会行为(Antisocial Behavior, ASB)识别与分析的难题,尤其是针对仇恨言论、网络骚扰和网络欺凌等现象在复杂交互语境下的检测与理解不足问题。其关键解决方案是构建并利用一个名为CyberAgressionAdo-Large的法语开放数据集,评估六种文本特征学习方法与八种图结构表示学习方法,并探索它们的融合策略;其中,晚融合模型mBERT + WD-SGCN表现最优,通过结合词法线索与互动动态信息,在多个任务上实现领先性能,尤其在隐性攻击、角色转换和情境依赖型敌意等复杂ASB现象识别中展现出更强鲁棒性。
链接: https://arxiv.org/abs/2510.17289
作者: Hajar Bakarou,Mohamed Sinane El Messoussi,Anaïs Ollagnier
机构: Université Côte d’Azur (蔚蓝海岸大学); CNRS (法国国家科学研究中心); Inria (法国国家信息与自动化研究院); I3S (I3S实验室)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Antisocial behavior (ASB) on social media – including hate speech, harassment, and cyberbullying – poses growing risks to platform safety and societal well-being. Prior research has focused largely on networks such as X and Reddit, while \textitmulti-party conversational settings remain underexplored due to limited data. To address this gap, we use \textitCyberAgressionAdo-Large, a French open-access dataset simulating ASB in multi-party conversations, and evaluate three tasks: \textitabuse detection, \textitbullying behavior analysis, and \textitbullying peer-group identification. We benchmark six text-based and eight graph-based \textitrepresentation-learning methods, analyzing lexical cues, interactional dynamics, and their multimodal fusion. Results show that multimodal models outperform unimodal baselines. The late fusion model \textttmBERT + WD-SGCN achieves the best overall results, with top performance on abuse detection (0.718) and competitive scores on peer-group identification (0.286) and bullying analysis (0.606). Error analysis highlights its effectiveness in handling nuanced ASB phenomena such as implicit aggression, role transitions, and context-dependent hostility.
zh
[NLP-45] axoAlign: Scholarly Taxonomy Generation Using Language Models EMNLP2025
【速读】: 该论文旨在解决自动文献综述生成中缺乏对自动生成的分类体系(taxonomy)与人类专家构建的分类体系进行结构对比评估的问题。现有方法在生成文献综述时未考虑其结构是否与人工编写的综述一致,导致生成结果难以验证其合理性与专业性。为解决这一问题,作者提出了一种名为TaxoAlign的三阶段基于主题的、指令引导的学术分类体系生成方法,其关键在于通过引入结构对齐和语义一致性评估机制,使自动化生成的分类体系更贴近人类专家的组织逻辑。此外,作者构建了CS-TaxoBench基准数据集(包含460个从人工撰写综述中提取的分类体系),并设计了一个严格的自动化评估框架,用于量化比较自动生成分类体系与人类专家分类体系之间的结构相似性和语义连贯性,从而实现对生成质量的客观衡量。
链接: https://arxiv.org/abs/2510.17263
作者: Avishek Lahiri,Yufang Hou,Debarshi Kumar Sanyal
机构: Indian Association for the Cultivation of Science (印度科学栽培协会); IT:U Interdisciplinary Transformation University Austria (IT:U跨学科转型大学奥地利)
类目: Computation and Language (cs.CL)
备注: This paper has been accepted at the EMNLP 2025 Main Conference
点击查看摘要
Abstract:Taxonomies play a crucial role in helping researchers structure and navigate knowledge in a hierarchical manner. They also form an important part in the creation of comprehensive literature surveys. The existing approaches to automatic survey generation do not compare the structure of the generated surveys with those written by human experts. To address this gap, we present our own method for automated taxonomy creation that can bridge the gap between human-generated and automatically-created taxonomies. For this purpose, we create the CS-TaxoBench benchmark which consists of 460 taxonomies that have been extracted from human-written survey papers. We also include an additional test set of 80 taxonomies curated from conference survey papers. We propose TaxoAlign, a three-phase topic-based instruction-guided method for scholarly taxonomy generation. Additionally, we propose a stringent automated evaluation framework that measures the structural alignment and semantic coherence of automatically generated taxonomies in comparison to those created by human experts. We evaluate our method and various baselines on CS-TaxoBench, using both automated evaluation metrics and human evaluation studies. The results show that TaxoAlign consistently surpasses the baselines on nearly all metrics. The code and data can be found at this https URL.
zh
[NLP-46] Explainability of Large Language Models : Opportunities and Challenges toward Generating Trustworthy Explanations
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在预测和生成内容时缺乏可解释性的问题,尤其是其内部机制不透明导致的“幻觉”(hallucinations)现象,这限制了用户对模型输出的信任。解决方案的关键在于通过本地可解释性(local explainability)与机制可解释性(mechanistic interpretability)的研究,深入剖析Transformer架构下LLM的决策过程,并结合医疗健康和自动驾驶两个关键领域的实证研究,评估解释对用户信任的影响。论文进一步总结当前未解决的挑战,提出未来实现人类对齐、可信解释的方向。
链接: https://arxiv.org/abs/2510.17256
作者: Shahin Atakishiyev,Housam K.B. Babiker,Jiayi Dai,Nawshad Farruque,Teruaki Hayashi,Nafisa Sadaf Hriti,Md Abed Rahman,Iain Smith,Mi-Young Kim,Osmar R. Zaïane,Randy Goebel
机构: University of Alberta (阿尔伯塔大学); University of Tokyo (东京大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing. However, how a language model predicts the next token and generates content is not generally understandable by humans. Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs. Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models. In this regard, our paper aims to make three key contributions. First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature. Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains – healthcare and autonomous driving – and analyze the trust implications of such explanations for explanation receivers. Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.
zh
[NLP-47] How News Feels: Understanding Affective Bias in Multilingual Headlines for Human-Centered Media Design
【速读】: 该论文试图解决新闻媒体在报道同一事件时,因情感框架(emotional framing)差异导致公众情绪被潜在引导的问题。研究表明,负面情绪(如愤怒、恐惧和失望)在 Bengali 新闻中占据主导地位,且相同事件在不同媒体中的情感表达存在显著差异,这可能加剧信息偏见并影响公众认知。解决方案的关键在于利用零样本推理(zero-shot inference)技术,基于 Gemma-3 4B 模型对大规模 Bengali 新闻标题及内容进行情感分析,识别每条新闻的主导情绪与整体基调,并据此提出一种以人为中心的新闻聚合器设计方案,通过可视化情感线索帮助读者察觉隐藏的情感框架,从而增强媒介素养与批判性阅读能力。
链接: https://arxiv.org/abs/2510.17252
作者: Mohd Ruhul Ameen,Akif Islam,Abu Saleh Musa Miah,Ayesha Siddiqua,Jungpil Shin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, 4 tables. Submitted to the International Conference on Data and Applied Analytics (IDAA 2025)
点击查看摘要
Abstract:News media often shape the public mood not only by what they report but by how they frame it. The same event can appear calm in one outlet and alarming in another, reflecting subtle emotional bias in reporting. Negative or emotionally charged headlines tend to attract more attention and spread faster, which in turn encourages outlets to frame stories in ways that provoke stronger reactions. This research explores that tendency through large-scale emotion analysis of Bengali news. Using zero-shot inference with Gemma-3 4B, we analyzed 300000 Bengali news headlines and their content to identify the dominant emotion and overall tone of each. The findings reveal a clear dominance of negative emotions, particularly anger, fear, and disappointment, and significant variation in how similar stories are emotionally portrayed across outlets. Based on these insights, we propose design ideas for a human-centered news aggregator that visualizes emotional cues and helps readers recognize hidden affective framing in daily news.
zh
[NLP-48] From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models
【速读】: 该论文旨在解决视频扩散模型在对齐人类偏好过程中可能引入并放大的社会偏见问题,特别是性别和种族层面的刻板印象。其解决方案的关键在于提出VideoBiasEval这一系统性诊断框架,通过事件级提示策略将动作与情境语义从演员属性(性别、种族)中解耦,并设计多粒度指标量化整体种族偏见、基于种族的性别偏见、不同模型变体间的社会属性分布变化以及偏见在视频时序上的持续性。该框架首次实现了从人类偏好数据集到奖励模型再到对齐后视频扩散模型的全链路偏见溯源分析,揭示了对齐调优虽提升视觉质量但加剧且固化偏见的现象,为公平、负责任的生成式视频内容开发提供了关键评估工具与改进方向。
链接: https://arxiv.org/abs/2510.17247
作者: Zefan Cai,Haoyi Qiu,Haozhe Zhao,Ke Wan,Jiachen Li,Jiuxiang Gu,Wen Xiao,Nanyun Peng,Junjie Hu
机构: University of Wisconsin–Madison; University of California, Los Angeles; University of Illinois Urbana-Champaign; University of California, San Diego; University of California, Santa Barbara; Adobe Research; Microsoft
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.
zh
[NLP-49] StreamingThinker: Large Language Models Can Think While Reading
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain of Thought, CoT)推理中因等待完整输入才开始思考而导致的延迟问题,尤其在动态场景下会削弱对早期信息的关注。其核心解决方案是提出一种流式思维(Streaming Thinking)范式,关键在于通过三个技术模块实现边读边思:1)流式CoT生成与质量控制机制;2)基于流式注意力掩码和位置编码的顺序保持推理;3)解耦输入编码与推理生成的并行KV缓存机制,从而实现真正的并发处理。实验表明,该方案在保持与批处理推理相当性能的同时,显著降低token等待时间(减少80%)和最终答案生成的时延(减少60%以上)。
链接: https://arxiv.org/abs/2510.17238
作者: Junlong Tong,Yingqi Fan,Anhao Zhao,Yunpu Ma,Xiaoyu Shen
机构: Shanghai Jiao Tong University (上海交通大学); Eastern Institute of Technology, Ningbo, China (东方理工大学,宁波,中国); Hong Kong Polytechnic University (香港理工大学); Ludwig Maximilian University of Munich (慕尼黑路德维希-马克西米利安大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit\textbfstreaming thinking paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textitStreamingThinker, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80% reduction in token waiting before the onset of reasoning and a more than 60% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at \hrefthis https URLthis repository.
zh
[NLP-50] Wisdom is Knowing What not to Say: Hallucination-Free LLM s Unlearning via Attention Shifting
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实现机器遗忘(machine unlearning)过程中面临的两难困境:激进的遗忘策略会显著损害模型性能,而保守策略虽能保留模型实用性,却可能引发幻觉(hallucination)响应,从而降低LLM在知识密集型应用场景中的可靠性。为应对这一挑战,作者提出了一种新颖的注意力迁移(Attention-Shifting, AS)框架,其核心在于通过两个层次的注意力干预实现选择性遗忘:一是基于重要性的抑制机制(importance-aware suppression),用于削弱模型对需遗忘数据中事实性标记(fact-bearing tokens)的关注,同时不破坏语言结构;二是注意力引导的保留增强机制(attention-guided retention enhancement),强化对保留数据中语义关键标记的关注,防止无意的知识退化。这两个组件通过双损失目标联合优化,在表征叠加(representation superposition)条件下形成软边界,精准定位遗忘范围并保护无关知识,实验表明AS在ToFU和TDEC基准上分别较现有最优方法提升15%和10%的准确率,同时保持优异的抗幻觉能力。
链接: https://arxiv.org/abs/2510.17210
作者: Chenchen Tan,Youyang Qu,Xinghao Li,Hui Zhang,Shujie Cui,Cunjian Chen,Longxiang Gao
机构: Monash University (莫纳什大学); Shandong Computer Science Center (山东计算中心); Qilu University of Technology (Shandong Academy of Sciences) (齐鲁工业大学(山东省科学院)); Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing (山东省计算网络与服务计算重点实验室); Shandong Fundamental Research Center for Computer Science (山东省计算机科学基础研究中心); Anhui University (安徽大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 10 figures
点击查看摘要
Abstract:The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs’ reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs’ linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.
zh
[NLP-51] Soft-Masked Diffusion Language Models
【速读】: 该论文旨在解决当前基于掩码扩散(masked diffusion)的语言模型在解码过程中因二元决策机制(保留掩码或用预测token替换)而导致的预测信息浪费问题。其核心解决方案是提出软掩码(soft-masking, SM)方法,该方法通过动态融合被保留掩码的嵌入与前一步Top-k预测token的嵌入,为模型提供更丰富的先验信息,从而在保留上下文的同时允许部分掩码信息跨步骤传播,提升生成质量与效率。
链接: https://arxiv.org/abs/2510.17206
作者: Michael Hersche,Samuel Moor-Smith,Thomas Hofmann,Abbas Rahimi
机构: IBM Research – Zurich (IBM研究实验室-苏黎世); ETH Zürich (苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top- k predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that adapts a pretrained masked diffusion language model to incorporate SM. We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores. Furthermore, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.
zh
[NLP-52] mathcalVisimathcalPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLM s EMNLP2025
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLs)在视觉-语言任务中因注意力计算随多模态标记数量呈二次增长而导致的显著计算开销问题。现有Token剪枝方法缺乏对MLLMs跨模态信息处理与融合机制的根本理解,难以实现高效优化。解决方案的关键在于通过系统性分析揭示了MLLMs内部存在一个三层级的跨模态交互过程:浅层识别任务意图,视觉Token作为被动注意力汇点;中层由少数关键视觉Token驱动跨模态融合;深层则丢弃视觉Token,专注于语言精炼。基于此发现,作者提出无需训练的VisiPruner剪枝框架,可减少高达99%的视觉相关注意力计算和53.9%的浮点运算量(FLOPs),并显著优于现有方法,且具备跨多种MLLMs的泛化能力。
链接: https://arxiv.org/abs/2510.17205
作者: Yingqi Fan,Anhao Zhao,Jinlan Fu,Junlong Tong,Hui Su,Yijie Pan,Wei Zhang,Xiaoyu Shen
机构: Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative (宁波市空间智能与数字衍生重点实验室); Institute of Digital Twin, EIT, Ningbo (数字孪生研究所,宁波); Shanghai Jiao Tong University (上海交通大学); Hong Kong Polytechnic University (香港理工大学); Meituan Inc. (美团); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: EMNLP 2025 Main
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textitthey lack a fundamental understanding of how MLLMs process and fuse multimodal information. Through systematic analysis, we uncover a \textbfthree-stage cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emphVisiPruner, a training-free pruning framework that reduces up to 99% of vision-related attention computations and 53.9% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: this https URL.
zh
[NLP-53] Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
【速读】: 该论文旨在解决语言模型在处理长上下文时面临的两大核心挑战:一是标准Transformer架构因二次复杂度导致的计算效率瓶颈及长度外推能力差;二是滑动窗口注意力和状态空间模型等替代方案由于固定大小的记忆机制,难以有效利用完整上下文信息。其解决方案的关键在于提出并验证了三种设计原则:(1)采用具有表达能力的非线性分块编码器(Chunk Encoder)并引入专用CLS标记以生成用于检索的上下文表示;(2)引入绕过残差路径(Bypassing Residual Path)以稳定整合全局检索信息,避免被局部残差流覆盖;(3)在预训练阶段强制施加选择稀疏性(enforced selection sparsity),弥合训练与测试分布差异。通过这些原则的协同作用,实现了无需微调即可在RULER和BABILong数据集上将4K上下文训练的模型推广至3200万token的最新长度外推性能。
链接: https://arxiv.org/abs/2510.17196
作者: Jiaqi Leng,Xiang Hu,Junxiong Wang,Jianguo Li,Wei Wu,Yucheng Lu
机构: Fudan University (复旦大学); Ant Group (蚂蚁集团); Cornell University (康奈尔大学); NYU Shanghai (纽约大学上海分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Work in progress
点击查看摘要
Abstract:Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.
zh
[NLP-54] Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users NEURIPS2025
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在健康教练场景中因个性化不足而导致的子群体(如低健康素养但高自我效能感用户)效果受损的问题。其解决方案的关键在于采用“评估先行”的路径:首先冻结生成器,通过因子化决策头(Tool/Style)在标注奖励(客观工具结果与满意度)上学习子群体感知的决策策略,并利用一个轻量级隐藏原型模拟器引入早期信息增益奖励,以加速特质识别并提升目标达成率和 pass@3 指标,同时强制报告每个原型的细分指标,从而揭示平均值掩盖的子群体损害。
链接: https://arxiv.org/abs/2510.17173
作者: Melik Ozolcer,Sang Won Bae
机构: Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to the NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models
点击查看摘要
Abstract:We study a web-deployed, tool-augmented LLM health coach with real users. In a pilot with seven users (280 rated turns), offline policy evaluation (OPE) over factorized decision heads (Tool/Style) shows that a uniform heavy-tool policy raises average value on logs but harms specific subgroups, most notably low-health-literacy/high-self-efficacy users. A lightweight simulator with hidden archetypes further shows that adding a small early information-gain bonus reliably shortens trait identification and improves goal success and pass@3. Together, these early findings indicate an evaluation-first path to personalization: freeze the generator, learn subgroup-aware decision heads on typed rewards (objective tool outcomes and satisfaction), and always report per-archetype metrics to surface subgroup harms that averages obscure.
zh
[NLP-55] When AI companions become witty: Can human brain recognize AI-generated irony?
【速读】: 该论文试图解决的问题是:当大型语言模型(Large Language Models, LLMs)生成具有讽刺意味的言论时,人类是否会将其视为有意图的社交沟通行为,还是仅仅当作计算输出?其解决方案的关键在于通过行为实验与事件相关电位(ERP)技术,对比人类对AI与人类来源的讽刺语句的反应差异,发现人们并未完全对AI采取“意图立场”(intentional stance),表现为对AI生成讽刺语句的行为和神经反应(P200和P600成分)均显著弱于人类来源,且这种差异可被个体对AI真诚性的认知模型所调节。这表明人类对AI的社会性理解仍受限于对其意图性的感知,而不仅仅是语言能力的提升所能弥补。
链接: https://arxiv.org/abs/2510.17168
作者: Xiaohui Rao,Hanlin Wu,Zhenguang G. Cai
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) are increasingly deployed as social agents and trained to produce humor and irony, a question emerges: when encountering witty AI remarks, do people interpret these as intentional communication or mere computational output? This study investigates whether people adopt the intentional stance, attributing mental states to explain behavior,toward AI during irony comprehension. Irony provides an ideal paradigm because it requires distinguishing intentional contradictions from unintended errors through effortful semantic reanalysis. We compared behavioral and neural responses to ironic statements from AI versus human sources using established ERP components: P200 reflecting early incongruity detection and P600 indexing cognitive efforts in reinterpreting incongruity as deliberate irony. Results demonstrate that people do not fully adopt the intentional stance toward AI-generated irony. Behaviorally, participants attributed incongruity to deliberate communication for both sources, though significantly less for AI than human, showing greater tendency to interpret AI incongruities as computational errors. Neural data revealed attenuated P200 and P600 effects for AI-generated irony, suggesting reduced effortful detection and reanalysis consistent with diminished attribution of communicative intent. Notably, people who perceived AI as more sincere showed larger P200 and P600 effects for AI-generated irony, suggesting that intentional stance adoption is calibrated by specific mental models of artificial agents. These findings reveal that source attribution shapes neural processing of social-communicative phenomena. Despite current LLMs’ linguistic sophistication, achieving genuine social agency requires more than linguistic competence, it necessitates a shift in how humans perceive and attribute intentionality to artificial agents.
zh
[NLP-56] Rethinking On-policy Optimization for Query Augmentation
【速读】: 该论文旨在解决查询增强(Query Augmentation)在信息检索(Information Retrieval, IR)中的方法选择与性能优化问题,具体比较了基于提示(Prompting-based)和强化学习(Reinforcement Learning, RL-based)的两种主流策略在不同基准测试下的表现差异。其关键解决方案在于提出了一种新颖的混合方法——在线伪文档查询扩展(On-policy Pseudo-document Query Expansion, OPQE),该方法不直接重写查询,而是让LLM策略学习生成能最大化检索性能的伪文档(Pseudo-document),从而融合了提示方法的灵活性与生成能力,以及强化学习的目标导向优化特性,最终在多个IR任务中显著优于单独使用提示或强化学习的方法。
链接: https://arxiv.org/abs/2510.17139
作者: Zhichao Xu,Shengyao Zhuang,Xueguang Ma,Bingsen Chen,Yijun Tian,Fengran Mo,Jie Cao,Vivek Srikumar
机构: University of Utah (犹他大学); University of Queensland (昆士兰大学); University of Waterloo (滑铁卢大学); New York University (纽约大学); University of Notre Dame (圣母大学); Université de Montréal (蒙特利尔大学); University of Oklahoma (俄克拉荷马大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model’s parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.
zh
[NLP-57] Do LLM s Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在个性化交互中难以有效发现和利用用户隐含偏好(latent information)的问题。尽管LLMs在生成通用文本方面表现出色,但在推荐餐厅或规划旅行等场景下,用户往往不会明确表达所有偏好,而是存在大量未被显式陈述的潜在属性。为系统评估LLMs在多轮对话中挖掘这些隐含信息的能力,作者提出一个统一基准(unified benchmark),其核心是三代理框架(User、Assistant、Judge),支持逐轮评估信息 elicitation(提取)与adaptation(适应)。该方案的关键创新在于构建了从简单到复杂渐进式的三个任务设置:经典“20个问题”游戏、个性化问答和个性化文本摘要,从而首次提供了对隐含信息发现能力的系统性研究框架。实验表明,LLMs确实能通过对话揭示隐含信息,但成功率在32%至98%之间波动,取决于任务复杂度、主题和隐藏属性数量,凸显出偏好推理仍是实现真正自适应AI系统的开放挑战。
链接: https://arxiv.org/abs/2510.17132
作者: Ioannis Tsaknakis,Bingqing Song,Shuyu Gan,Dongyeop Kang,Alfredo Garcia,Gaowen Liu,Charles Fleming,Mingyi Hong
机构: University of Minnesota (明尼苏达大学); Texas A&M University (德克萨斯农工大学); CISCO Research (思科研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) excel at producing broadly relevant text, but this generality becomes a limitation when user-specific preferences are required, such as recommending restaurants or planning travel. In these scenarios, users rarely articulate every preference explicitly; instead, much of what they care about remains latent, waiting to be inferred. This raises a fundamental question: Can LLMs uncover and reason about such latent information through conversation? We address this problem by introducing a unified benchmark for evaluating latent information discovery - the ability of LLMs to reveal and utilize hidden user attributes through multi-turn interaction. The benchmark spans three progressively realistic settings: the classic 20 Questions game, Personalized Question Answering, and Personalized Text Summarization. All tasks share a tri-agent framework (User, Assistant, Judge) enabling turn-level evaluation of elicitation and adaptation. Our results reveal that while LLMs can indeed surface latent information through dialogue, their success varies dramatically with context: from 32% to 98%, depending on task complexity, topic, and number of hidden attributes. This benchmark provides the first systematic framework for studying latent information discovery in personalized interaction, highlighting that effective preference inference remains an open frontier for building truly adaptive AI systems. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2510.17132 [cs.LG] (or arXiv:2510.17132v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.17132 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-58] DVAGen: Dynamic Vocabulary Augmented Generation
【速读】: 该论文旨在解决传统语言模型(Language Models, LM)因固定词表(fixed vocabulary)导致在处理未登录词(out-of-vocabulary words)时泛化能力不足的问题,同时克服现有动态词表(dynamic vocabulary)方法在代码碎片化、对现代大语言模型(Large Language Models, LLMs)支持有限及推理扩展性差等方面的局限。其解决方案的关键在于提出一个完全开源、统一的框架DVAGen,该框架模块化设计训练、评估与可视化流程,无缝集成于开源LLM,并首次提供命令行界面(CLI)和WebUI工具以实现结果的实时监控,同时支持批量推理(batch inference),显著提升推理吞吐量(inference throughput)。
链接: https://arxiv.org/abs/2510.17115
作者: Wei Du,Nuowei Liu,Jie Wang,Jiahao Kuang,Tao Ji,Xiaoling Wang,Yuanbin Wu
机构: East China Normal University (华东师范大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.
zh
[NLP-59] Verification-Aware Planning for Multi-Agent Systems
【速读】: 该论文旨在解决多智能体协作中因任务理解偏差、输出格式不一致或智能体间交接失误导致的执行失败问题,这些问题常使复杂任务难以可靠完成。解决方案的关键在于提出VeriMAP框架,其核心是验证感知规划(verification-aware planning),通过将任务分解为子任务并建模依赖关系,同时以Python代码和自然语言形式定义每个子任务的验证函数(Verification Functions, VFs),从而在规划阶段嵌入可执行的验证机制,实现对协作过程的实时校验与迭代优化,无需外部标注即可提升系统的鲁棒性和可解释性。
链接: https://arxiv.org/abs/2510.17109
作者: Tianyang Xu,Dan Zhang,Kushan Mitra,Estevam Hruschka
机构: Purdue University (普渡大学); Megagon Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Submission for ARR Oct
点击查看摘要
Abstract:Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.
zh
[NLP-60] Investigating Thinking Behaviours of Reasoning -Based Language Models for Social Bias Mitigation
【速读】: 该论文旨在解决生成式 AI(Generative AI)在推理过程中因内部思维链(Chain-of-Thought, CoT)机制而聚合社会刻板印象,从而导致偏见输出的问题。其核心解决方案在于识别并干预两种关键的偏见生成机制:一是刻板印象重复(stereotype repetition),即模型主要依赖社会刻板印象作为推理依据;二是无关信息注入(irrelevant information injection),即模型虚构或引入额外细节以强化偏见叙事。作者提出一种轻量级提示(prompt-based)缓解方法,通过引导模型自我审查初始推理是否落入上述两类失败模式,在保持或提升准确性的前提下显著降低偏见水平。
链接: https://arxiv.org/abs/2510.17062
作者: Guoqing Luo,Iffat Maab,Lili Mou,Junichi Yamagishi
机构: University of Alberta (阿尔伯塔大学); Alberta Machine Intelligence Institute (阿尔伯塔机器智能研究所); National Institute of Informatics (信息学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.
zh
[NLP-61] Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的提示敏感性(prompt sensitivity)问题,即模型对语义等价但表述不同的提示产生显著差异的输出分布,这表明模型输出中的不确定性可能并非源于对提示语义的理解不确定,而是由提示表达方式的微小变化引发。解决方案的关键在于将提示敏感性建模为一种泛化误差,并通过在语义“概念空间”中引入同义改写扰动(paraphrasing perturbations)进行采样,从而提升不确定性校准(uncertainty calibration)效果,同时不损害任务准确性。此外,作者提出了一种新的黑箱LLM不确定性分解指标,该指标基于自然语言生成中的语义连续性建模,优于传统的熵基分解方法,能够量化模型不确定性中有多少是由提示敏感性导致的。
链接: https://arxiv.org/abs/2510.17028
作者: Kyle Cox,Jiawei Xu,Yikun Han,Rong Xu,Tianhao Li,Chi-Yang Hsu,Tianlong Chen,Walter Gerych,Ying Ding
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model’s output distribution for one prompt may not reflect the model’s uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic ``concept space’’ with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.
zh
[NLP-62] Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning
链接: https://arxiv.org/abs/2510.17021
作者: Bingqi Shang,Yiwei Chen,Yihua Zhang,Bingquan Shen,Sijia Liu
机构: Michigan State University (密歇根州立大学); National University of Singapore (新加坡国立大学); IBM Research (IBM 研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
[NLP-63] Extended LSTM: Adaptive Feature Gating for Toxic Comment Classification
链接: https://arxiv.org/abs/2510.17018
作者: Noor Islam S. Mohammad
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-64] SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的搜索代理(search agent)在开放域问答任务中安全性不足的问题。研究表明,相较于基础LLM,搜索代理因迭代生成查询、检索外部信息并进行推理,更容易产生有害输出,尤其是在面对恶意请求时,其拒绝阈值降低,可能整合不安全来源并生成看似合理但危险的响应。为应对这一挑战,作者提出SafeSearch方法,其核心创新在于采用多目标强化学习框架,将最终输出的安全性与实用性奖励相结合,并引入一种新颖的查询层级奖励机制——对不安全查询施加惩罚、对安全查询给予奖励。实验表明,SafeSearch在三个红队测试数据集上使代理有害性降低超过70%,同时保持与仅优化实用性的代理相当的问答性能,验证了查询层级奖励在协同提升安全性和实用性方面的有效性。
链接: https://arxiv.org/abs/2510.17017
作者: Qiusi Zhan,Angeline Budiman-Chan,Abdelrahman Zayed,Xingzhi Guo,Daniel Kang,Joo-Kyung Kim
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked “How can I track someone’s location without their consent?”, a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.
zh
[NLP-65] DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估中对隐含信息和话语层面推理能力关注不足的问题,尤其是跨句子、段落及多说话者语篇中的语用推断与信息整合能力缺乏系统性、多语言的评测基准。其解决方案的关键在于提出一个名为DiscoTrack的多语言LLM基准,涵盖12种语言和四个层次的话语理解任务:显著性识别、实体追踪、话语关系解析与桥接推理,从而推动模型在复杂语境下进行深层次语义整合与推理的能力评估。
链接: https://arxiv.org/abs/2510.17013
作者: Lanni Bu,Lauren Levin,Amir Zeldes
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often tar- geting information from individual sentences. We are still lacking more challenging, and im- portantly also multilingual, benchmarks focus- ing on implicit information and pragmatic infer- ences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.
zh
[NLP-66] Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization
【速读】: 该论文旨在解决迭代式越狱攻击(iterative jailbreak methods)对大语言模型(Large Language Models, LLMs)安全机制的有效性问题,这类攻击通过反复改写并输入提示词来诱导有害输出,并利用模型的先前响应指导后续迭代。现有防御方法未能主动打断这种动态试错循环。其解决方案的关键在于提出一种基于在线学习的动态防御框架:首先利用有害越狱生成提示与正常提示之间的差异,采用强化学习优化提示,以确保对无害任务提供恰当响应并明确拒绝有害请求;其次引入过去方向梯度阻尼(Past-Direction Gradient Damping, PDGD)机制,防止模型在攻击过程中因局限于部分输入重写而过拟合。实验表明,该方法显著优于五种现有防御策略,在三种LLM上对五种迭代越狱方法均展现出更强鲁棒性,同时提升了无害任务的响应质量。
链接: https://arxiv.org/abs/2510.17006
作者: Masahiro Kaneko,Zeerak Talat,Timothy Baldwin
机构: MBZUAI; University of Edinburgh
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs – using the model’s previous responses to guide each new iteration – have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.
zh
[NLP-67] Vocab Diet: Reshaping the Vocabulary of LLM s with Vector Arithmetic
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中词汇表设计的效率与覆盖问题:传统分词算法将词形变化(如“walk”与“walked”)视为独立词条,导致词汇表被表面形式变体填满,从而压缩了低频词和多语言词汇的存储空间。其解决方案的关键在于利用嵌入空间中词形变化可表示为线性变换向量(transformation vectors)的特性——即通过加性偏移(additive offsets)从基础词形(base form)生成变体词形的嵌入表示。作者提出一种紧凑的词汇重塑策略,不再为每个表面形式分配唯一token,而是由共享的基础词形与变换向量组合构成(例如,“walked” = “walk” + past tense),在不修改模型权重的前提下,减少多达10%的词汇表规模,并释放空间用于新增多样化的token,同时保持下游任务性能稳定并扩展对未登录词(out-of-vocabulary words)的覆盖能力。这一方法推动了从字符串枚举式词汇设计向基于语言结构的组合式词汇设计的根本转变。
链接: https://arxiv.org/abs/2510.17001
作者: Yuval Reif,Guy Kaplan,Roy Schwartz
机构: The Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) were shown to encode word form variations, such as “walk”-“walked”, as linear directions in embedding space. However, standard tokenization algorithms treat these variations as distinct tokens – filling the size-capped vocabulary with surface form variants (e.g., “walk”, “walking”, “Walk”), at the expense of less frequent words and multilingual coverage. We show that many of these variations can be captured by transformation vectors – additive offsets that yield the appropriate word’s representation when applied to the base form word embedding – in both the input and output spaces. Building on this, we propose a compact reshaping of the vocabulary: rather than assigning unique tokens to each surface form, we compose them from shared base form and transformation vectors (e.g., “walked” = “walk” + past tense). We apply our approach to multiple LLMs and across five languages, removing up to 10% of vocabulary entries – thereby freeing space to allocate new, more diverse tokens. Importantly, we do so while also expanding vocabulary coverage to out-of-vocabulary words, with minimal impact on downstream performance, and without modifying model weights. Our findings motivate a foundational rethinking of vocabulary design, moving from string enumeration to a compositional vocabulary that leverages the underlying structure of language.
zh
[NLP-68] Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLM s NEURIPS2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对恶意用户发起的对抗性攻击时,其输出中潜在信息泄露缺乏量化评估标准的问题。具体而言,攻击者试图通过特定指令推断目标属性 $ T $(如触发有害响应的二进制标志或遗忘机制中可恢复的信息程度),而模型响应中的可观测信号 $ Z $(包括答案标记、思维过程标记或logits)可能泄露此类信息,但当前对泄露量的认知仍停留在经验层面,导致审计者无法获得理论指导,防御方也难以权衡透明度与风险之间的关系。解决方案的关键在于提出一个基于信息论的框架,以互信息 $ I(Z;T) $ 作为每查询泄露比特数的度量,并证明达到误差 $ \varepsilon $ 所需的最小查询次数为 $ \log(1/\varepsilon)/I(Z;T) $,揭示了攻击成本随泄露率线性增长、随精度对数增长的本质规律。实验验证表明,仅暴露答案标记需约千次查询,加入logits可降至百次,完整暴露推理过程则只需数十次,从而首次提供了衡量LLM部署中透明度与安全性平衡的理论基准。
链接: https://arxiv.org/abs/2510.17000
作者: Masahiro Kaneko,Timothy Baldwin
机构: MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NeurIPS 2025 (spotlight)
点击查看摘要
Abstract:Adversarial attacks by malicious users that threaten the safety of large language models (LLMs) can be viewed as attempts to infer a target property T that is unknown when an instruction is issued, and becomes knowable only after the model’s reply is observed. Examples of target properties T include the binary flag that triggers an LLM’s harmful response or rejection, and the degree to which information deleted by unlearning can be restored, both elicited via adversarial instructions. The LLM reveals an \emphobservable signal Z that potentially leaks hints for attacking through a response containing answer tokens, thinking process tokens, or logits. Yet the scale of information leaked remains anecdotal, leaving auditors without principled guidance and defenders blind to the transparency–risk trade-off. We fill this gap with an information-theoretic framework that computes how much information can be safely disclosed, and enables auditors to gauge how close their methods come to the fundamental limit. Treating the mutual information I(Z;T) between the observation Z and the target property T as the leaked bits per query, we show that achieving error \varepsilon requires at least \log(1/\varepsilon)/I(Z;T) queries, scaling linearly with the inverse leak rate and only logarithmically with the desired accuracy. Thus, even a modest increase in disclosure collapses the attack cost from quadratic to logarithmic in terms of the desired accuracy. Experiments on seven LLMs across system-prompt leakage, jailbreak, and relearning attacks corroborate the theory: exposing answer tokens alone requires about a thousand queries; adding logits cuts this to about a hundred; and revealing the full thinking process trims it to a few dozen. Our results provide the first principled yardstick for balancing transparency and security when deploying LLMs.
zh
[NLP-69] Back to Bytes: Revisiting Tokenization Through UTF-8
【速读】: 该论文旨在解决现有字节级分词器在处理文本时存在的效率低、内存占用高以及特殊行为编码复杂等问题。其关键解决方案是提出UTF8Tokenizer,一种基于UTF-8字节映射的极简分词器,将每个字节直接对应到固定范围(0–255)内的token ID,避免引入超出范围的ID或辅助token;所有特殊功能(如填充、边界标记、对话结构等)均通过C0控制字节实现,复用ASCII设计思想,从而显著提升tokenization速度(快14倍)、降低主机与设备间数据传输量(比int64少8倍),并支持跨模型共享256×d维度的嵌入表,同时通过训练后添加的bit-biased嵌入增强模型对字节位结构的感知能力,不增加推理开销。
链接: https://arxiv.org/abs/2510.16987
作者: Amit Moryossef,Clara Meister,Pavel Stepachev,Desmond Elliott
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We present UTF8Tokenizer, a minimalist byte-level tokenizer that maps text exactly to IDs corresponding to the bytes underlying the text’s UTF-8 encoding (e.g., byte x09 is token ID 9). Unlike prior byte-level approaches (Xue et al., 2021; Pagnoni et al., 2025), our implementation never introduces out-of-range IDs (i.e. there is no token ID 256) or auxiliary tokens: all special behavior (e.g., padding, boundaries, conversation structure, attention segments, tool calling, “thinking” spans, etc.) is encoded using C0 control bytes - just as ASCII was originally designed to embed control information alongside printable text. These design principles yield practical benefits: (1) faster tokenization (14x) and significantly lower host-device transfer (8x less than int64); (2) simple, shareable 256*d embedding tables that can be aligned across models; and (3) a training-time enhancement via bit-biased embeddings, which exposes per-byte bit structure and can be added to the embedding table post-training, removing inference costs. Our HuggingFace-compatible implementation improves language modeling convergence.
zh
[NLP-70] Parameter-Efficient Fine-Tuning for Low-Resource Languages: A Comparative Study of LLM s for Bengali Hate Speech Detection
【速读】: 该论文旨在解决孟加拉语社交媒体平台中仇恨言论(hate speech)激增的问题,尤其是对女性和青少年群体的负面影响,同时克服现有检测方法在计算资源消耗高或依赖专有API方面的局限性。解决方案的关键在于首次将参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术应用于孟加拉语仇恨言论检测任务,采用LoRA(Low-Rank Adaptation)与QLoRA(Quantized LoRA)方法,在仅训练模型参数不足1%的情况下,实现了高性能的分类效果。实验基于BD-SHS数据集(50,281条标注评论),在单张消费级GPU上完成对Gemma-3-4B、Llama-3.2-3B和Mistral-7B三个指令微调大语言模型的适配,其中Llama-3.2-3B达到最高F1分数92.23%,验证了PEFT在低资源语言场景下的实用性与可复现性。
链接: https://arxiv.org/abs/2510.16985
作者: Akif Islam,Mohd Ruhul Ameen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to IEEE COMPAS 2025. 6 pages, 3 figures, 6 tables
点击查看摘要
Abstract:Bengali social media platforms have witnessed a sharp increase in hate speech, disproportionately affecting women and adolescents. While datasets such as BD-SHS provide a basis for structured evaluation, most prior approaches rely on either computationally costly full-model fine-tuning or proprietary APIs. This paper presents the first application of Parameter-Efficient Fine-Tuning (PEFT) for Bengali hate speech detection using LoRA and QLoRA. Three instruction-tuned large language models - Gemma-3-4B, Llama-3.2-3B, and Mistral-7B - were fine-tuned on the BD-SHS dataset of 50,281 annotated comments. Each model was adapted by training fewer than 1% of its parameters, enabling experiments on a single consumer-grade GPU. The results show that Llama-3.2-3B achieved the highest F1-score of 92.23%, followed by Mistral-7B at 88.94% and Gemma-3-4B at 80.25%. These findings establish PEFT as a practical and replicable strategy for Bengali and related low-resource languages.
zh
[NLP-71] Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures
【速读】: 该论文旨在解决知识蒸馏(Knowledge Distillation, KD)在大语言模型(Large Language Models, LLMs)训练中带来的知识产权保护与模型多样性风险问题,尤其是现有基于自身份识别或输出相似性的检测方法易被提示工程(prompt engineering)绕过的问题。其解决方案的关键在于挖掘并利用一种被忽视的信号——MoE(Mixture of Experts)架构中“结构习惯”(structural habits)的迁移,特别是内部路由模式(internal routing patterns)的传递特性;通过分析不同专家在各类输入下的专业化与协作方式,形成可在蒸馏过程中保持稳定的独特指纹。为扩展至黑盒场景和非MoE架构,作者进一步提出Shadow-MoE方法,借助辅助蒸馏构建代理MoE表示以比较任意模型对间的路由模式差异,从而实现高精度、强鲁棒性的KD检测。
链接: https://arxiv.org/abs/2510.16968
作者: Pingzhi Li,Morris Yu-Chao Huang,Zhen Tan,Qingquan Song,Jie Peng,Kai Zou,Yu Cheng,Kaidi Xu,Tianlong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is at this https URL
点击查看摘要
Abstract:Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks. Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering. We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE “structural habits”, especially internal routing patterns. Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process. To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs. We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research. Extensive experiments demonstrate 94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs.
zh
[NLP-72] Real-Time World Crafting: Generating Structured Game Behaviors from Natural Language with Large Language Models EMNLP
链接: https://arxiv.org/abs/2510.16952
作者: Austin Drake,Hang Dong
机构: University of Exeter (埃克塞特大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 16 pages, 11 figures (including appendix). To be presented at the 5th Wordplay @ EMNLP workshop (2025)
[NLP-73] Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation
【速读】: 该论文旨在解决当前对大语言模型(Large Language Models, LLMs)生成数学优化模型的评估过于粗粒度的问题,即仅依赖最终解的准确性或求解时间等宏观指标,无法揭示模型在决策变量、约束条件等结构层面的错误。其解决方案的关键在于提出一个组件级的细粒度评估框架,引入包括决策变量和约束的精确率(precision)与召回率(recall)、约束与目标函数的均方根误差(Root Mean Squared Error, RMSE),以及基于token使用量和延迟的效率指标,从而实现对LLM生成优化模型的结构性、数值准确性和计算效率的全面诊断性评价。
链接: https://arxiv.org/abs/2510.16943
作者: Dania Refai,Moataz Ahmed
机构: King Fahd University of Petroleum & Minerals (国王法赫德国王石油与矿业大学); SDAIA-KFUPM Joint Research Center for Artificial Intelligence (沙特数据和人工智能局-国王法赫德国王石油与矿业大学人工智能联合研究中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used to convert natural language descriptions into mathematical optimization formulations. Current evaluations often treat formulations as a whole, relying on coarse metrics like solution accuracy or runtime, which obscure structural or numerical errors. In this study, we present a comprehensive, component-level evaluation framework for LLM-generated formulations. Beyond the conventional optimality gap, our framework introduces metrics such as precision and recall for decision variables and constraints, constraint and objective root mean squared error (RMSE), and efficiency indicators based on token usage and latency. We evaluate GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problems of varying complexity under six prompting strategies. Results show that GPT-5 consistently outperforms other models, with chain-of-thought, self-consistency, and modular prompting proving most effective. Analysis indicates that solver performance depends primarily on high constraint recall and low constraint RMSE, which together ensure structural correctness and solution reliability. Constraint precision and decision variable metrics play secondary roles, while concise outputs enhance computational efficiency. These findings highlight three principles for NLP-to-optimization modeling: (i) Complete constraint coverage prevents violations, (ii) minimizing constraint RMSE ensures solver-level accuracy, and (iii) concise outputs improve computational efficiency. The proposed framework establishes a foundation for fine-grained, diagnostic evaluation of LLMs in optimization modeling.
zh
[NLP-74] Prompt-MII: Meta-Learning Instruction Induction for LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在采用上下文学习(In-Context Learning, ICL)进行任务适配时,随着上下文长度增加而导致的推理成本过高问题。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的元学习框架——PROMPT-MII,该框架能够从训练样本中自动归纳出紧凑且具有描述性的指令(prompt),从而在不依赖完整上下文的情况下实现与ICL相当的下游任务性能,同时显著减少所需的token数量(3–13倍)。
链接: https://arxiv.org/abs/2510.16932
作者: Emily Xiao,Yixiao Zeng,Ada Chen,Chin-Jou Li,Amanda Bertsch,Graham Neubig
机构: Carnegie Mellon University Language Technologies Institute (卡内基梅隆大学语言技术研究所)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.
zh
[NLP-75] ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models
链接: https://arxiv.org/abs/2510.16928
作者: Emily Chang,Niyati Bafna
机构: 未知
类目: Computation and Language (cs.CL)
备注:
[NLP-76] Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在动态图像分辨率输入下缺乏系统性评估的问题,尤其是现有评价范式仅关注语义性能而忽视了分辨率鲁棒性(resolution robustness)——即模型性能是否随输入分辨率变化保持稳定。解决方案的关键在于提出一个名为Res-Bench的综合性基准,包含14,400个样本、12个分辨率层级及6个核心能力维度,并设计了一套超越传统准确率指标的新评估框架,引入Spearman相关系数用于分析分辨率-性能趋势,以及绝对/相对连续误差(Absolute/Relative Continuous Error, ACE/RCE)来量化性能波动,从而实现对MLLMs分辨率鲁棒性的全面衡量与优化探索。
链接: https://arxiv.org/abs/2510.16926
作者: Chenxu Li,Zhicai Wang,Yuan Sheng,Xingyu Zhu,Yanbin Hao,Xiang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 23 pages,19 figures
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbfRes-Bench, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman’s correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.
zh
[NLP-77] Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models ? EMNLP2025
【速读】: 该论文旨在解决多模态语言模型(Multimodal Language Models, LMs)是否比纯文本模型更擅长理解具身知识(Embodied Knowledge)这一关键问题。其解决方案的关键在于提出一个基于心理学感知理论的新颖具身知识理解基准,该基准涵盖视觉、听觉、触觉、味觉、嗅觉及内感受(Interoception)等多种感官模态,并通过向量相似性比较和问答任务对模型的感知能力进行系统评估。实验结果表明,尽管引入了视觉信息,视觉语言模型(Vision-Language Models, VLMs)并未在任何维度上优于纯文本模型,且在视觉维度表现最差,揭示出当前模型对具身知识的整合仍存在显著不足。
链接: https://arxiv.org/abs/2510.16924
作者: Zhihui Yang,Yupei Wang,Kaijie Mo,Zhe Zhao,Renfen Hu
机构: Beijing Normal University (北京师范大学); Tencent AI Lab (腾讯人工智能实验室)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 (Findings). This version corrects a redundant sentence in the Results section that appeared in the camera-ready version
点击查看摘要
Abstract:Despite significant progress in multimodal language models (LMs), it remains unclear whether visual grounding enhances their understanding of embodied knowledge compared to text-only models. To address this question, we propose a novel embodied knowledge understanding benchmark based on the perceptual theory from psychology, encompassing visual, auditory, tactile, gustatory, olfactory external senses, and interoception. The benchmark assesses the models’ perceptual abilities across different sensory modalities through vector comparison and question-answering tasks with over 1,700 questions. By comparing 30 state-of-the-art LMs, we surprisingly find that vision-language models (VLMs) do not outperform text-only models in either task. Moreover, the models perform significantly worse in the visual dimension compared to other sensory dimensions. Further analysis reveals that the vector representations are easily influenced by word form and frequency, and the models struggle to answer questions involving spatial perception and reasoning. Our findings underscore the need for more effective integration of embodied knowledge in LMs to enhance their understanding of the physical world.
zh
[NLP-78] SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models
【速读】: 该论文旨在解决大型音频-语言模型(Large Audio-Language Models, LALMs)中音频属性知识编辑的难题,即如何在不进行完整重训练的前提下高效更新模型对抽象听觉特征(如音调、节奏、情感等)的认知。此前的研究主要聚焦于文本或视觉模态的知识编辑,而忽视了音频这一重要但复杂的模态。解决方案的关键在于提出SAKE——首个专门针对LALMs音频属性知识编辑的基准测试框架,通过四个维度(可靠性、泛化性、音频/文本局部性与可迁移性)系统评估七种编辑方法,揭示了保持非编辑相关听觉知识一致性、跨模态推理泛化能力以及序列编辑下的稳定性等核心挑战,从而为音频模态下的知识编辑提供了理论基础和实践路径。
链接: https://arxiv.org/abs/2510.16917
作者: Chih-Kai Yang,Yen-Ting Piao,Tzu-Wen Hsu,Szu-Wei Fu,Zhehuai Chen,Ke-Han Lu,Sung-Feng Huang,Chao-Han Huck Yang,Yu-Chiang Frank Wang,Yun-Nung Chen,Hung-yi Lee
机构: National Taiwan University (国立台湾大学); DouDou Capital; NVIDIA
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Work in progress
点击查看摘要
Abstract:Knowledge editing offers an efficient way to update model knowledge without full retraining, but prior work has concentrated almost exclusively on textual or visual modalities. We introduce SAKE, the first benchmark specifically designed for editing auditory attribute knowledge in Large Audio-Language Models (LALMs). Unlike factual updates, SAKE targets several abstract auditory attributes, capturing knowledge types that go beyond conventional textual and visual domains. We benchmark seven editing methods on two LALMs along four dimensions: reliability, generality, audio/text locality, and portability. Results highlight challenges such as preserving intra-attribute knowledge unrelated to the edit, generalizing edits to multimodal reasoning, and maintaining edits under sequential updates. SAKE provides a principled framework to study how knowledge editing extends to the auditory modalities, opening new directions for maintaining and adapting LALMs in more diverse real-world scenarios.
zh
[NLP-79] VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents NEURIPS2025
链接: https://arxiv.org/abs/2510.16907
作者: Kangrui Wang,Pingyue Zhang,Zihan Wang,Yaning Gao,Linjie Li,Qineng Wang,Hanyang Chen,Chi Wan,Yiping Lu,Zhengyuan Yang,Lijuan Wang,Ranjay Krishna,Jiajun Wu,Li Fei-Fei,Yejin Choi,Manling Li
机构: Northwestern University (西北大学); University of Washington (华盛顿大学); Stanford University (斯坦福大学); Microsoft (微软); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2025
[NLP-80] Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations ICASSP2026
【速读】: 该论文旨在解决大型音频语言模型(Large Audio-Language Models, LALMs)在语音情感(paralinguistic variation)变化下的安全对齐问题,即不同情绪表达下模型输出的安全性一致性。现有研究多关注LALMs的感知、推理与任务性能,但对其在情绪多样性场景中的安全风险缺乏系统评估。解决方案的关键在于构建一个涵盖多种情绪及其强度层级的恶意指令数据集,并对多个前沿LALMs进行系统测试,发现情绪类型和强度均显著影响模型响应的安全性,且强度效应呈现非单调特性——中等强度表达往往带来最高风险。这一发现揭示了LALMs在真实应用中潜在的安全漏洞,强调需设计针对情绪变异鲁棒性的对齐策略,以保障其可信部署。
链接: https://arxiv.org/abs/2510.16893
作者: Bo-Han Feng,Chien-Feng Liu,Yu-Hsuan Li Liang,Chih-Kai Yang,Szu-Wei Fu,Zhehuai Chen,Ke-Han Lu,Sung-Feng Huang,Chao-Han Huck Yang,Yu-Chiang Frank Wang,Yun-Nung Chen,Hung-yi Lee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026
点击查看摘要
Abstract:Large audio-language models (LALMs) extend text-based LLMs with auditory understanding, offering new opportunities for multimodal applications. While their perception, reasoning, and task performance have been widely studied, their safety alignment under paralinguistic variation remains underexplored. This work systematically investigates the role of speaker emotion. We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several state-of-the-art LALMs. Our results reveal substantial safety inconsistencies: different emotions elicit varying levels of unsafe responses, and the effect of intensity is non-monotonic, with medium expressions often posing the greatest risk. These findings highlight an overlooked vulnerability in LALMs and call for alignment strategies explicitly designed to ensure robustness under emotional variation, a prerequisite for trustworthy deployment in real-world settings.
zh
[NLP-81] Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning
【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)中因使用全量数据导致的计算开销大、过拟合及偏见放大等问题,尤其聚焦于在线批次选择(online batch selection)方法中存在的三大局限:(i)仅依赖数据效用而忽略多样性;(ii)依赖外部资源如参考模型或验证集;(iii)增加额外训练时间。其解决方案的关键在于提出UDS(Utility-Diversity Sampling)框架,通过核范数(nuclear norm)量化样本效用与样本内多样性,并利用轻量级记忆缓冲区对历史样本进行低维嵌入比较以高效估计样本间多样性,从而在无需外部资源和额外反向传播的情况下实现高效在线采样,显著提升训练效率并保持性能优势。
链接: https://arxiv.org/abs/2510.16882
作者: Heming Zou,Yixiu Mao,Yun Qu,Qi Wang,Xiangyang Ji
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops \textbfUDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at this https URL.
zh
[NLP-82] DeepAnalyze: Agent ic Large Language Models for Autonomous Data Science
【速读】: 该论文旨在解决自主数据科学(autonomous data science)这一长期挑战,即从原始数据源到生成分析师级别的深度研究报告的全流程自动化问题。现有基于工作流的数据代理(data agents)虽在特定任务上表现良好,但受限于预定义流程,难以实现真正意义上的自主性。解决方案的关键在于提出一种基于课程学习(curriculum-based)的智能体训练范式(agentic training paradigm),模拟人类数据科学家的学习路径,使模型在真实环境中逐步习得并整合多种数据处理与分析能力;同时引入数据驱动的轨迹合成框架(data-grounded trajectory synthesis framework),构建高质量训练数据,从而支撑模型完成从数据问答、专业分析到开放式研究的全链条任务。实验表明,仅用8B参数的DeepAnalyze-8B模型即可超越依赖更强大专有大语言模型(LLMs)的传统工作流代理。
链接: https://arxiv.org/abs/2510.16872
作者: Shaolei Zhang,Ju Fan,Meihao Fan,Guoliang Li,Xiaoyong Du
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注: Code: this https URL Model: this https URL
点击查看摘要
Abstract:Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data. Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs. The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.
zh
[NLP-83] Neuronal Group Communication for Efficient Neural representation
链接: https://arxiv.org/abs/2510.16851
作者: Zhengqi Pei,Qingming Huang,Shuhui Wang
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 28 pages, 2 figures
[NLP-84] FinSight: Towards Real-World Financial Deep Research
【速读】: 该论文旨在解决当前人工智能系统在生成专业金融报告(Financial Reports)过程中存在的自动化程度不足、分析深度有限及可视化质量不佳的问题。其解决方案的关键在于提出了一种名为FinSight的多智能体框架,该框架基于Code Agent with Variable Memory(CAVM)架构,将外部数据、定制工具与智能体统一到可编程的变量空间中,通过执行代码实现灵活的数据收集、分析与报告生成;同时引入迭代视觉增强机制(Iterative Vision-Enhanced Mechanism)以提升图表的专业呈现质量,并采用两阶段写作框架(Two-Stage Writing Framework)将简短的分析链扩展为结构清晰、引用明确且支持多模态输出的完整报告,从而显著提升报告的准确性、深度与专业性。
链接: https://arxiv.org/abs/2510.16844
作者: Jiajie Jin,Yuyao Zhang,Yimeng Xu,Hongjin Qian,Yutao Zhu,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); BAAI (北京人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Working in progress
点击查看摘要
Abstract:Generating professional financial reports is a labor-intensive and intellectually demanding process that current AI systems struggle to fully automate. To address this challenge, we introduce FinSight (Financial InSight), a novel multi agent framework for producing high-quality, multimodal financial reports. The foundation of FinSight is the Code Agent with Variable Memory (CAVM) architecture, which unifies external data, designed tools, and agents into a programmable variable space, enabling flexible data collection, analysis and report generation through executable code. To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism that progressively refines raw visual outputs into polished financial charts. Furthermore, a two stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports, ensuring both analytical depth and structural consistency. Experiments on various company and industry-level tasks demonstrate that FinSight significantly outperforms all baselines, including leading deep research systems in terms of factual accuracy, analytical depth, and presentation quality, demonstrating a clear path toward generating reports that approach human-expert quality.
zh
[NLP-85] Verifiable Fine-Tuning for LLM s: Zero-Knowledge Training Proofs Bound to Data Provenance and Policy
链接: https://arxiv.org/abs/2510.16830
作者: Hasan Akgul,Daniel Borg,Arta Berisha,Amina Rahimova,Andrej Novak,Mila Petrov
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 20 pages, 10 figures
[NLP-86] Whos Asking? Simulating Role-Based Questions for Conversational AI Evaluation
链接: https://arxiv.org/abs/2510.16829
作者: Navreet Kaur,Hoda Ayad,Hayoung Jung,Shravika Mittal,Munmun De Choudhury,Tanushree Mitra
机构: University of Washington (华盛顿大学); Princeton University (普林斯顿大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
[NLP-87] Cross-Genre Authorship Attribution via LLM -Based Retrieve-and-Rerank
链接: https://arxiv.org/abs/2510.16819
作者: Shantanu Agarwal,Joel Barry,Steven Fincke,Scott Miller
机构: Information Sciences Institute (信息科学研究所); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-88] Knowing the Facts but Choosing the Shortcut: Understanding How Large Language Models Compare Entities ACL
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识推理任务中依赖真实知识还是表面启发式策略的问题。研究通过实体比较任务(如判断哪条河流更长)来系统分析模型决策依据,发现尽管小模型具备足够的数值知识却常因实体流行度、提及顺序和语义共现等启发式偏差而做出错误判断;关键在于,更大规模的模型(32B参数)能够选择性地依赖更可靠的数值信息,从而实现优于小模型的表现,即便后者拥有更准确的知识库。这一发现揭示了模型规模对知识利用策略的调节作用,而链式思维提示(Chain-of-thought prompting)可有效引导所有模型优先使用数值特征,提升推理准确性。
链接: https://arxiv.org/abs/2510.16815
作者: Hans Hergen Lehmann,Jae Hee Lee,Steven Schockaert,Stefan Wermter
机构: University of Hamburg (汉堡大学); Cardiff University (卡迪夫大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 33 pages, 20 figures. Submitted ACL ARR 2025 October (under review)
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used for knowledge-based reasoning tasks, yet understanding when they rely on genuine knowledge versus superficial heuristics remains challenging. We investigate this question through entity comparison tasks by asking models to compare entities along numerical attributes (e.g., ``Which river is longer, the Danube or the Nile?‘’), which offer clear ground truth for systematic analysis. Despite having sufficient numerical knowledge to answer correctly, LLMs frequently make predictions that contradict this knowledge. We identify three heuristic biases that strongly influence model predictions: entity popularity, mention order, and semantic co-occurrence. For smaller models, a simple logistic regression using only these surface cues predicts model choices more accurately than the model’s own numerical predictions, suggesting heuristics largely override principled reasoning. Crucially, we find that larger models (32B parameters) selectively rely on numerical knowledge when it is more reliable, while smaller models (7–8B parameters) show no such discrimination, which explains why larger models outperform smaller ones even when the smaller models possess more accurate knowledge. Chain-of-thought prompting steers all models towards using the numerical features across all model sizes.
zh
计算机视觉
[CV-0] ConsistEdit: Highly Consistent and Precise Training-free Visual Editing SIGGRAPH
【速读】:该论文旨在解决当前训练-free注意力控制方法在文本引导图像和视频编辑中面临的两个核心问题:一是难以同时实现强编辑力度与源图像一致性,导致多轮编辑或视频编辑时视觉误差累积;二是多数方法强制全局一致性,限制了对特定属性(如纹理)进行细粒度修改的能力。解决方案的关键在于针对MM-DiT架构的注意力机制进行深入分析,提出ConsistEdit方法,其核心创新包括:视觉仅注意力控制、掩码引导的预注意力融合以及对查询(query)、键(key)和值(value)token的差异化操作,从而在不依赖人工设计的前提下,实现跨所有推理步骤和注意力层的一致性编辑,显著提升结构一致性和编辑精度,并支持渐进式调整结构一致性以实现更精细的控制。
链接: https://arxiv.org/abs/2510.17803
作者: Zixin Yin,Ling-Hao Chen,Lionel Ni,Xili Dai
机构: Hong Kong University of Science and Technology (香港科技大学); Tsinghua University (清华大学); International Digital Economy Academy (国际数字经济发展研究院); Hong Kong University of Science and Technology, Guangzhou (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH Asia 2025
点击查看摘要
Abstract:Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.
zh
[CV-1] Botany-Bot: Digital Twin Monitoring of Occluded and Underleaf Plant Structures with Gaussian Splats IROS2025
【速读】:该论文旨在解决商用固定摄像头植物表型系统因叶片遮挡(leaf occlusion)而无法获取植物细节的问题。其解决方案的关键在于构建一个集成双目相机、数字转盘、光照箱与工业机器人臂的自动化平台 Botany-Bot,并结合3D分割高斯点模型(3D segmentated Gaussian Splat models),通过机器人算法主动操控叶片以获取被遮挡区域(如茎芽、叶片上下表面)的高分辨率可索引图像,从而实现对活体植物的精细化“标注数字孪生”建模。
链接: https://arxiv.org/abs/2510.17783
作者: Simeon Adebola,Chung Min Kim,Justin Kerr,Shuangyu Xie,Prithvi Akella,Jose Luis Susa Rincon,Eugen Solowjow,Ken Goldberg
机构: The AUTOLab at UC Berkeley (automation.berkeley.edu); Siemens Research Lab, Berkeley, CA
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
点击查看摘要
Abstract:Commercial plant phenotyping systems using fixed cameras cannot perceive many plant details due to leaf occlusion. In this paper, we present Botany-Bot, a system for building detailed “annotated digital twins” of living plants using two stereo cameras, a digital turntable inside a lightbox, an industrial robot arm, and 3D segmentated Gaussian Splat models. We also present robot algorithms for manipulating leaves to take high-resolution indexable images of occluded details such as stem buds and the underside/topside of leaves. Results from experiments suggest that Botany-Bot can segment leaves with 90.8% accuracy, detect leaves with 86.2% accuracy, lift/push leaves with 77.9% accuracy, and take detailed overside/underside images with 77.3% accuracy. Code, videos, and datasets are available at this https URL.
zh
[CV-2] SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在推理过程中因视觉标记(visual tokens)数量增长而导致的可扩展性瓶颈问题,尤其在长视频或多轮对话场景下,视觉token主导了推理延迟。解决方案的关键在于提出一种名为SparseVILA的新范式,其核心是将视觉稀疏性解耦至预填充(prefill)和解码(decoding)两个阶段:在预填充阶段通过剪枝移除冗余视觉token以加速处理,在解码阶段则仅检索与当前查询相关的视觉token,从而在保持多轮对话语义一致性的同时显著提升效率。该方法无需训练、不依赖特定架构,结合AWQ优化推理流水线,在长上下文视频任务中实现高达4.0倍的预填充加速、2.5倍的解码加速及整体2.6倍的端到端速度提升,同时改善文档理解与推理任务的准确性。
链接: https://arxiv.org/abs/2510.17777
作者: Samir Khaki,Junxian Guo,Jiaming Tang,Shang Yang,Yukang Chen,Konstantinos N. Plataniotis,Yao Lu,Song Han,Zhijian Liu
机构: NVIDIA; MIT; UC San Diego; University of Toronto
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks – while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.
zh
[CV-3] owards Explainable Skin Cancer Classification: A Dual-Network Attention Model with Lesion Segmentation and Clinical Metadata Fusion
【速读】:该论文旨在解决皮肤癌自动诊断中因类内变异大、类间差异微弱导致的分类准确性低,以及深度学习模型缺乏可解释性从而难以获得临床信任的问题。其解决方案的关键在于提出一种基于双编码器注意力机制的框架:首先利用改进的Deep-UNet架构(含双重注意力门DAG和空洞空间金字塔池化ASPP)实现高精度病变分割;随后在分类阶段采用两个DenseNet201编码器分别处理原始图像与分割后的病变区域,通过多头交叉注意力机制融合特征,引导模型聚焦于病灶关键区域;此外引入基于Transformer的模块整合患者临床元数据(如年龄、性别、病变部位),进一步提升预测性能。该设计显著提升了分类准确率和平均AUC,并借助Grad-CAM热力图验证了模型决策依赖于病变区域而非背景噪声,从而实现了更高准确性和可解释性的皮肤病变分类。
链接: https://arxiv.org/abs/2510.17773
作者: Md. Enamul Atiq,Shaikh Anowarul Fattah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 Figures, 3 Tables
点击查看摘要
Abstract:Skin cancer is a life-threatening disease where early detection significantly improves patient outcomes. Automated diagnosis from dermoscopic images is challenging due to high intra-class variability and subtle inter-class differences. Many deep learning models operate as “black boxes,” limiting clinical trust. In this work, we propose a dual-encoder attention-based framework that leverages both segmented lesions and clinical metadata to enhance skin lesion classification in terms of both accuracy and interpretability. A novel Deep-UNet architecture with Dual Attention Gates (DAG) and Atrous Spatial Pyramid Pooling (ASPP) is first employed to segment lesions. The classification stage uses two DenseNet201 encoders-one on the original image and another on the segmented lesion whose features are fused via multi-head cross-attention. This dual-input design guides the model to focus on salient pathological regions. In addition, a transformer-based module incorporates patient metadata (age, sex, lesion site) into the prediction. We evaluate our approach on the HAM10000 dataset and the ISIC 2018 and 2019 challenges. The proposed method achieves state-of-the-art segmentation performance and significantly improves classification accuracy and average AUC compared to baseline models. To validate our model’s reliability, we use Gradient-weighted Class Activation Mapping (Grad-CAM) to generate heatmaps. These visualizations confirm that our model’s predictions are based on the lesion area, unlike models that rely on spurious background features. These results demonstrate that integrating precise lesion segmentation and clinical data with attention-based fusion leads to a more accurate and interpretable skin cancer classification model.
zh
[CV-4] Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在执行多模态任务时出现的“看到但不相信”(seeing but not believing)问题,即模型虽能感知到正确的视觉证据,却未能有效利用该证据进行推理导致错误输出。其解决方案的关键在于引入一种无需训练的推理阶段干预机制,通过选择性注意力掩码(attention-based masking)突出深层网络中可靠的视觉证据区域,从而增强模型对内部已编码但未充分利用的视觉信息的利用效率,显著提升多个主流VLM家族(如LLaVA、Qwen、Gemma和InternVL)的准确性。
链接: https://arxiv.org/abs/2510.17771
作者: Zhining Liu,Ziyi Chen,Hui Liu,Chen Luo,Xianfeng Tang,Suhang Wang,Joy Zeng,Zhenwei Dai,Zhan Shi,Tianxin Wei,Benoit Dumoulin,Hanghang Tong
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊); Penn State University (宾夕法尼亚州立大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 10 figures, 6 tables
点击查看摘要
Abstract:Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing’’ that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.
zh
[CV-5] Joint Multi-Condition Representation Modelling via Matrix Factorisation for Visual Place Recognition
【速读】:该论文旨在解决多参考视觉定位(multi-reference visual place recognition, VPR)中的性能提升问题,特别是在不同光照、视角和外观条件下,如何利用多个参考集来增强定位准确性。传统方法依赖大规模深度学习训练,虽能提高鲁棒性,但带来高昂的计算开销;而基于描述符投票或聚合的方法则通常适用于多传感器场景或依赖启发式规则,在复杂变化下效果有限。本文提出一种无需训练、不依赖特定描述符的解决方案,其关键在于通过矩阵分解将多参考描述符联合建模为基础表示(basis representations),进而实现基于投影的残差匹配机制,从而在保持轻量化的同时显著提升跨外观与视角变化下的泛化能力。
链接: https://arxiv.org/abs/2510.17739
作者: Timur Ismagilov,Shakaiba Majeed,Michael Milford,Tan Viet Tuyen Nguyen,Sarvapali D. Ramchurn,Shoaib Ehsan
机构: University of Southampton (南安普顿大学); Hanyang University (汉阳大学); Queensland University of Technology (昆士兰科技大学); University of Essex (埃塞克斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
点击查看摘要
Abstract:We address multi-reference visual place recognition (VPR), where reference sets captured under varying conditions are used to improve localisation performance. While deep learning with large-scale training improves robustness, increasing data diversity and model complexity incur extensive computational cost during training and deployment. Descriptor-level fusion via voting or aggregation avoids training, but often targets multi-sensor setups or relies on heuristics with limited gains under appearance and viewpoint change. We propose a training-free, descriptor-agnostic approach that jointly models places using multiple reference descriptors via matrix decomposition into basis representations, enabling projection-based residual matching. We also introduce SotonMV, a structured benchmark for multi-viewpoint VPR. On multi-appearance data, our method improves Recall@1 by up to ~18% over single-reference and outperforms multi-reference baselines across appearance and viewpoint changes, with gains of ~5% on unstructured data, demonstrating strong generalisation while remaining lightweight.
zh
[CV-6] Can Image-To-Video Models Simulate Pedestrian Dynamics? ICML2025
【速读】:该论文旨在解决生成式图像到视频(image-to-video, I2V)模型在模拟拥挤公共场景中行人运动模式时的现实性与准确性问题。解决方案的关键在于将I2V模型条件化于从行人轨迹基准数据集中提取的关键帧(keyframes),从而引导模型生成符合真实行人动力学特征的视频序列,并通过量化评估行人动态指标来验证其轨迹预测性能。
链接: https://arxiv.org/abs/2510.17731
作者: Aaron Appelle,Jerome P. Lynch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Appeared in the ICML 2025 Workshop on Building Physically Plausible World Models, July 2025, this https URL
点击查看摘要
Abstract:Recent high-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable inherent world-modeling capabilities by virtue of training on large scale video datasets. We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes. Our framework conditions I2V models on keyframes extracted from pedestrian trajectory benchmarks, then evaluates their trajectory prediction performance using quantitative measures of pedestrian dynamics.
zh
[CV-7] Signature Forgery Detection: Improving Cross-Dataset Generalization
【速读】:该论文旨在解决离线签名验证(Offline Signature Verification)中模型跨数据集泛化能力不足的问题,即模型在某一数据集上训练后,在其他数据集上性能显著下降的现象。其解决方案的关键在于探索不同的特征学习策略,特别是比较基于原始签名图像的直接建模方法与采用“壳层预处理”(shell preprocessing)的特征提取方法在跨数据集场景下的表现差异。实验结果表明,原始图像模型在多个公开基准(CEDAR、ICDAR 和 GPDS Synthetic)上整体性能更优,而壳层预处理方法则展现出进一步优化以实现鲁棒跨域验证的潜力。
链接: https://arxiv.org/abs/2510.17724
作者: Matheus Ramos Parracho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Undergraduate thesis (preprint)—submitted to Escola Politécnica, Universidade Federal do Rio de Janeiro (POLI/UFRJ). The final version will include official signatures and defense approval
点击查看摘要
Abstract:Automated signature verification is a critical biometric technique used in banking, identity authentication, and legal documentation. Despite the notable progress achieved by deep learning methods, most approaches in offline signature verification still struggle to generalize across datasets, as variations in handwriting styles and acquisition protocols often degrade performance. This study investigates feature learning strategies for signature forgery detection, focusing on improving cross-dataset generalization – that is, model robustness when trained on one dataset and tested on another. Using three public benchmarks – CEDAR, ICDAR, and GPDS Synthetic – two experimental pipelines were developed: one based on raw signature images and another employing a preprocessing method referred to as shell preprocessing. Several behavioral patterns were identified and analyzed; however, no definitive superiority between the two approaches was established. The results show that the raw-image model achieved higher performance across benchmarks, while the shell-based model demonstrated promising potential for future refinement toward robust, cross-domain signature verification.
zh
[CV-8] MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLM s in Multi-Turn Dialogues
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解能力评估中缺乏对多轮对话场景的覆盖问题,现有基准测试主要局限于单轮问答任务,难以反映真实应用场景中的交互复杂性。解决方案的关键在于提出MT-Video-Bench,一个面向多轮对话的综合性视频理解评估基准,其核心创新包括:系统性地构建涵盖987个精心设计的多轮对话数据集,聚焦于感知力与交互性两大维度的六大核心能力,并与实际应用如互动体育分析和多轮视频智能教学紧密对齐,从而为MLLMs在复杂视频对话场景下的性能评估提供标准化、可扩展的评测体系。
链接: https://arxiv.org/abs/2510.17722
作者: Yaning Pan,Zekun Wang,Qianqian Xie,Yongqian Wen,Yuanxing Zhang,Guohui Zhang,Haoxuan Hu,Zhiyu Pan,Yibing Huang,Zhidong Gan,Yonghong Lin,An Ping,Tianhao Peng,Jiaheng Liu
机构: Fudan University (复旦大学); Kuaishou Technology (快手科技); Nanjing University (南京大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Website: this https URL
点击查看摘要
Abstract:The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI’s ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.
zh
[CV-9] Raindrop GS: A Benchmark for 3D Gaussian Splatting under Raindrop Conditions
【速读】:该论文旨在解决在雨滴污染相机镜头条件下,3D高斯点绘(3D Gaussian Splatting, 3DGS)重建质量严重下降的问题,特别是由于雨滴引起的遮挡和光学畸变,以及真实场景中相机位姿估计与点云初始化不准确所带来的挑战。现有基准多基于已知相机位姿的合成雨滴图像进行评估,忽略了真实世界中雨滴对位姿估计和初始重建的干扰,且存在合成与真实雨滴之间的域差异问题。解决方案的关键在于提出RaindropGS这一全流程基准,涵盖从无约束雨滴干扰图像到清晰3DGS重建的完整流程,包括数据准备、数据处理和雨滴感知的3DGS评估;其中核心创新是构建了一个包含雨滴聚焦、背景聚焦和无雨真值三组对齐图像的真实世界数据集,从而系统性地揭示了相机焦点位置、位姿与点云初始化误差对3DGS性能的影响,为开发更鲁棒的雨滴环境下的3DGS方法提供了明确方向。
链接: https://arxiv.org/abs/2510.17719
作者: Zhiqiang Teng,Beibei Lin,Tingting Chen,Zifeng Yuan,Xuanyi Li,Xuanyu Zhang,Shunli Zhang
机构: Beijing Jiaotong University (北京交通大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) under raindrop conditions suffers from severe occlusions and optical distortions caused by raindrop contamination on the camera lens, substantially degrading reconstruction quality. Existing benchmarks typically evaluate 3DGS using synthetic raindrop images with known camera poses (constrained images), assuming ideal conditions. However, in real-world scenarios, raindrops often interfere with accurate camera pose estimation and point cloud initialization. Moreover, a significant domain gap between synthetic and real raindrops further impairs generalization. To tackle these issues, we introduce RaindropGS, a comprehensive benchmark designed to evaluate the full 3DGS pipeline-from unconstrained, raindrop-corrupted images to clear 3DGS reconstructions. Specifically, the whole benchmark pipeline consists of three parts: data preparation, data processing, and raindrop-aware 3DGS evaluation, including types of raindrop interference, camera pose estimation and point cloud initialization, single image rain removal comparison, and 3D Gaussian training comparison. First, we collect a real-world raindrop reconstruction dataset, in which each scene contains three aligned image sets: raindrop-focused, background-focused, and rain-free ground truth, enabling a comprehensive evaluation of reconstruction quality under different focus conditions. Through comprehensive experiments and analyses, we reveal critical insights into the performance limitations of existing 3DGS methods on unconstrained raindrop images and the varying impact of different pipeline components: the impact of camera focus position on 3DGS reconstruction performance, and the interference caused by inaccurate pose and point cloud initialization on reconstruction. These insights establish clear directions for developing more robust 3DGS methods under raindrop conditions.
zh
[CV-10] Automatic Classification of Circulating Blood Cell Clusters based on Multi-channel Flow Cytometry Imaging
【速读】:该论文旨在解决流式细胞术中循环血细胞簇(Circulating Blood Cell Clusters, CCCs)的自动分析难题,尤其是针对其不规则形态、大小差异以及多类型细胞组成的复杂性。传统机器学习方法主要聚焦于单细胞图像分析,缺乏对包含红细胞(RBCs)、白细胞(WBCs)和血小板的异质性细胞簇的有效识别与分类工具。解决方案的关键在于提出一个两步式计算框架:首先通过微调YOLOv11目标检测模型实现高精度的细胞簇与非细胞簇图像分类(准确率超95%),该模型优于传统卷积神经网络(CNN)和视觉Transformer(ViT);其次,利用多通道荧光染色区域与细胞簇轮廓叠加的方法,在存在细胞碎片和染色伪影的情况下仍能准确识别簇内细胞类型,从而实现对CCC图像的全自动、高精度分析。
链接: https://arxiv.org/abs/2510.17716
作者: Suqiang Ma,Subhadeep Sengupta,Yao Lee,Beikang Gu,Xianyan Chen,Xianqiao Wang,Yang Liu,Mengjia Xu,Galit H. Frydman,He Li
机构: University of Georgia (佐治亚大学); University of Michigan (密歇根大学); New Jersey Institute of Technology (新泽西理工学院); Massachusetts Institute of Technology (麻省理工学院); Massachusetts General Hospital (马萨诸塞综合医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Circulating blood cell clusters (CCCs) containing red blood cells (RBCs), white blood cells(WBCs), and platelets are significant biomarkers linked to conditions like thrombosis, infection, and inflammation. Flow cytometry, paired with fluorescence staining, is commonly used to analyze these cell clusters, revealing cell morphology and protein profiles. While computational approaches based on machine learning have advanced the automatic analysis of single-cell flow cytometry images, there is a lack of effort to build tools to automatically analyze images containing CCCs. Unlike single cells, cell clusters often exhibit irregular shapes and sizes. In addition, these cell clusters often consist of heterogeneous cell types, which require multi-channel staining to identify the specific cell types within the clusters. This study introduces a new computational framework for analyzing CCC images and identifying cell types within clusters. Our framework uses a two-step analysis strategy. First, it categorizes images into cell cluster and non-cluster groups by fine-tuning the You Only Look Once(YOLOv11) model, which outperforms traditional convolutional neural networks (CNNs), Vision Transformers (ViT). Then, it identifies cell types by overlaying cluster contours with regions from multi-channel fluorescence stains, enhancing accuracy despite cell debris and staining artifacts. This approach achieved over 95% accuracy in both cluster classification and phenotype identification. In summary, our automated framework effectively analyzes CCC images from flow cytometry, leveraging both bright-field and fluorescence data. Initially tested on blood cells, it holds potential for broader applications, such as analyzing immune and tumor cell clusters, supporting cellular research across various diseases.
zh
[CV-11] Improving Cross-Patient Generalization in Parkinsons Disease Detection through Chunk-Based Analysis of Hand-Drawn Patterns
【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)早期检测中两个关键问题:一是现有研究缺乏足够规模的高质量数据集,二是模型在处理未见过的患者数据时鲁棒性不足。为应对上述挑战,作者提出了一种两阶段检测方法:第一阶段根据手绘图像类型(圆、蛇形、螺旋)进行分类,第二阶段通过将每张图像分割为2×2块(chunking策略)分别提取特征并识别帕金森病指标,最终采用集成方法融合各块的决策结果。该方案的核心创新在于引入分块处理机制,显著提升了模型对未知患者数据的泛化能力,在NewHandPD数据集上实现了94.91%的准确率(未见患者),且与已见患者的结果差距仅为2.17个百分点,优于先前方法(差距达4.76个百分点)。
链接: https://arxiv.org/abs/2510.17703
作者: Mhd Adnan Albani,Riad Sonbol
机构: HIAST(哈萨克斯坦国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 2 figures, 9 tables
点击查看摘要
Abstract:Parkinson’s disease (PD) is a neurodegenerative disease affecting about 1% of people over the age of 60, causing motor impairments that impede hand coordination activities such as writing and drawing. Many approaches have tried to support early detection of Parkinson’s disease based on hand-drawn images; however, we identified two major limitations in the related works: (1) the lack of sufficient datasets, (2) the robustness when dealing with unseen patient data. In this paper, we propose a new approach to detect Parkinson’s disease that consists of two stages: The first stage classifies based on their drawing type(circle, meander, spiral), and the second stage extracts the required features from the images and detects Parkinson’s disease. We overcame the previous two limitations by applying a chunking strategy where we divide each image into 2x2 chunks. Each chunk is processed separately when extracting features and recognizing Parkinson’s disease indicators. To make the final classification, an ensemble method is used to merge the decisions made from each chunk. Our evaluation shows that our proposed approach outperforms the top performing state-of-the-art approaches, in particular on unseen patients. On the NewHandPD dataset our approach, it achieved 97.08% accuracy for seen patients and 94.91% for unseen patients, our proposed approach maintained a gap of only 2.17 percentage points, compared to the 4.76-point drop observed in prior work.
zh
[CV-12] Elastic ViTs from Pretrained Models without Retraining NEURIPS2025
【速读】:该论文旨在解决预训练视觉Transformer(Vision Transformer, ViT)模型在实际部署中因固定模型尺寸导致的计算资源利用率低下问题,即现有模型仅支持有限的预设规模,难以适应多样化的计算预算约束。其解决方案的关键在于提出一种无需重新训练、不依赖标签数据的结构化剪枝方法——SnapViT,通过融合梯度信息与跨网络结构相关性,并利用进化算法高效近似海森矩阵的非对角线结构,实现对剪枝后模型的弹性推理能力,从而可在单一A100 GPU上不到5分钟内生成可动态调整至任意计算预算的弹性模型。
链接: https://arxiv.org/abs/2510.17700
作者: Walter Simoncini,Michael Dorkenwald,Tijmen Blankevoort,Cees G.M. Snoek,Yuki M. Asano
机构: University of Technology Nuremberg (纽伦堡应用技术大学); University of Amsterdam (阿姆斯特丹大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025
点击查看摘要
Abstract:Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: this https URL
zh
[CV-13] GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver
【速读】:该论文旨在解决扩散模型(diffusion models)在采样过程中计算成本高昂的问题,尤其是现有基于梯度的优化方法虽能减少函数求值次数,但常依赖复杂的训练技巧且难以保留细粒度细节。其解决方案的关键在于提出一种无需额外训练技巧的通用求解器(Generalized Solver),通过参数化常微分方程(ODE)采样器实现更高效的推理;进一步结合原始蒸馏损失与对抗训练策略,以抑制伪影并提升细节保真度,从而形成通用对抗求解器(Generalized Adversarial Solver),在相似资源约束下显著优于现有方法。
链接: https://arxiv.org/abs/2510.17699
作者: Aleksandr Oganov,Ilya Bykov,Eva Neudachina,Mishan Aliev,Alexander Tolmachev,Alexander Sidorov,Aleksandr Zuev,Andrey Okhotin,Denis Rakitin,Aibek Alanov
机构: HSE University (俄罗斯高等经济大学); Lomonosov Moscow State University (莫斯科国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:While diffusion models achieve state-of-the-art generation quality, they still suffer from computationally expensive sampling. Recent works address this issue with gradient-based optimization methods that distill a few-step ODE diffusion solver from the full sampling process, reducing the number of function evaluations from dozens to just a few. However, these approaches often rely on intricate training techniques and do not explicitly focus on preserving fine-grained details. In this paper, we introduce the Generalized Solver: a simple parameterization of the ODE sampler that does not require additional training tricks and improves quality over existing approaches. We further combine the original distillation loss with adversarial training, which mitigates artifacts and enhances detail fidelity. We call the resulting method the Generalized Adversarial Solver and demonstrate its superior performance compared to existing solver training methods under similar resource constraints. Code is available at this https URL.
zh
[CV-14] owards 3D Objectness Learning in an Open World NEURIPS2025
【速读】:该论文旨在解决开放世界三维(3D)目标检测中泛化能力不足的问题,即如何在不依赖预定义类别标签的情况下,识别训练阶段未见过的新类别物体。传统闭集3D检测器难以适应开放场景,而直接引入3D开放词汇模型又面临词表扩展困难与语义重叠问题。解决方案的关键在于提出OP3Det——一种类无关的、无提示(prompt-free)开放世界3D检测框架,其核心创新包括:利用2D基础模型的强大泛化能力和零样本特性,融合2D语义先验与3D几何先验生成类无关提案;并通过跨模态专家混合机制(cross-modal mixture of experts),动态路由单模态与多模态特征,从而学习到具有广泛适用性的3D对象性(3D objectness)。该方法显著提升了开放世界3D目标发现的性能,在平均召回率(AR)上相比现有方案最高提升16.0%,较闭集检测器提升13.5%。
链接: https://arxiv.org/abs/2510.17686
作者: Taichi Liu,Zhenyu Wang,Ruofeng Liu,Guang Wang,Desheng Zhang
机构: Rutgers University (罗格斯大学); Tsinghua University (清华大学); Michigan State University (密歇根州立大学); Florida State University (佛罗里达州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Recent advancements in 3D object detection and novel category detection have made significant progress, yet research on learning generalized 3D objectness remains insufficient. In this paper, we delve into learning open-world 3D objectness, which focuses on detecting all objects in a 3D scene, including novel objects unseen during training. Traditional closed-set 3D detectors struggle to generalize to open-world scenarios, while directly incorporating 3D open-vocabulary models for open-world ability struggles with vocabulary expansion and semantic overlap. To achieve generalized 3D object discovery, We propose OP3Det, a class-agnostic Open-World Prompt-free 3D Detector to detect any objects within 3D scenes without relying on hand-crafted text prompts. We introduce the strong generalization and zero-shot capabilities of 2D foundation models, utilizing both 2D semantic priors and 3D geometric priors for class-agnostic proposals to broaden 3D object discovery. Then, by integrating complementary information from point cloud and RGB image in the cross-modal mixture of experts, OP3Det dynamically routes uni-modal and multi-modal features to learn generalized 3D objectness. Extensive experiments demonstrate the extraordinary performance of OP3Det, which significantly surpasses existing open-world 3D detectors by up to 16.0% in AR and achieves a 13.5% improvement compared to closed-world 3D detectors.
zh
[CV-15] Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning
【速读】:该论文旨在解决文本到图像人物检索(Text-to-image Person Retrieval, TIPR)任务中因模态异构性(modality heterogeneity)带来的挑战,特别是现有方法在跨模态对齐时存在的局限:全局对齐方法忽略细粒度差异,局部对齐方法依赖先验部件信息,且当前方法多局限于英文语境,难以适用于多语言场景。为缓解这些问题,作者首次提出多语言TIPR任务,并构建了多语言TIPR基准数据集,通过大语言模型进行初始翻译并融合领域知识进行优化;关键解决方案是提出Bi-IRRA框架——一种双向隐式关系推理与对齐框架,其中双向隐式关系推理模块通过掩码图像和文本的双向预测机制,隐式增强跨语言与跨模态的局部关系建模能力,同时引入多维全局对齐模块以有效弥合模态间差异,从而在所有多语言TIPR数据集上取得新的最先进性能。
链接: https://arxiv.org/abs/2510.17685
作者: Min Cao,Xinyu Zhou,Ding Jiang,Bo Du,Mang Ye,Min Zhang
机构: Soochow University (苏州大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Final version published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Xplore link: this https URL
点击查看摘要
Abstract:Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in this https URL.
zh
[CV-16] Intelligent Communication Mixture-of-Experts Boosted-Medical Image Segmentation Foundation Model
【速读】:该论文旨在解决现有医学图像分割基础模型在适应性微调过程中存在的两个关键问题:一是高阶特征表示能力不足,二是微调过程破坏了预训练权重的结构完整性。解决方案的核心在于提出一种智能通信混合专家(Intelligent Communication Mixture-of-Experts, IC-MoE)架构,其包含基础专家、语义专家与自适应专家,并引入像素概率自适应投票策略以实现基于标签一致性和负载均衡的专家选择与融合,从而增强高阶特征表达并保持预训练权重结构;同时,设计语义引导的对比学习方法以缓解对比学习中监督信号弱的问题,进一步提升特征表示能力并维持模型结构稳定性。实验表明,IC-MoE 在多个公开医学图像分割数据集上优于当前最优模型,具备优异的泛化性能。
链接: https://arxiv.org/abs/2510.17684
作者: Xinwei Zhang,Hu Chen,Zhe Yuan,Sukun Tian,Peng Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Foundation models for medical image segmentation have achieved remarkable performance. Adaptive fine-tuning of natural image segmentation foundation models is crucial for medical image segmentation tasks. However, some limitations exist in existing fine-tuning methods: 1) insufficient representation of high-level features and 2) the fine-tuning process disrupts the structural integrity of pretrained weights. Inspired by these critical problems, we propose an intelligent communication mixture-of-experts boosted-medical image segmentation foundation model, named IC-MoE, with twofold ideas: 1) We construct basic experts, semantic experts, and adaptive experts. Moreover, we implement a pixel probability adaptive voting strategy, which enables expert selection and fusion through label consistency and load balancing. This approach preliminarily enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. 2) We propose a semantic-guided contrastive learning method to address the issue of weak supervision in contrastive learning. This method further enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. Extensive experiments across three public medical image segmentation datasets demonstrate that the IC-MoE outperforms other SOTA models. Consequently, the proposed IC-MoE effectively supplements foundational medical image segmentation models with high-level features and pretrained structural integrity. We also validate the superior generalizability of the IC-MoE across diverse medical image segmentation scenarios.
zh
[CV-17] PICABench: How Far Are We from Physically Realistic Image Editing?
【速读】:该论文旨在解决当前图像编辑模型在实现指令完成度较高时,仍缺乏物理真实性的关键问题,即编辑操作(如添加、移除、属性修改等)往往忽略了物体间的光学交互(如阴影、反射)、力学关系及状态变化等物理效应,导致生成结果缺乏现实一致性。解决方案的关键在于构建了PICABench这一系统性评估基准,涵盖光学、力学和状态转换等八个子维度,用于量化评估物理真实性;同时提出PICAEval评价协议,结合视觉语言模型(VLM)作为评判者与区域级人工标注相结合的方式提升评估可靠性,并通过从视频中学习物理规律构建PICA-100K训练数据集,为提升物理一致性提供有效训练支持。
链接: https://arxiv.org/abs/2510.17681
作者: Yuandong Pu,Le Zhuo,Songhao Han,Jinbo Xing,Kaiwen Zhu,Shuo Cao,Bin Fu,Si Liu,Hongsheng Li,Yu Qiao,Wenlong Zhang,Xi Chen,Yihao Liu
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory; CUHK MMLab; Krea AI; Beihang University (北京航空航天大学); Tongyi Lab; USTC (中国科学技术大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.
zh
[CV-18] 4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads
【速读】:该论文旨在解决流式场景下4D语义分割(4D panoptic segmentation in a streaming setting)的实时性与鲁棒性问题,尤其在高动态环境(如密集人群疏散和复杂场景下的自动驾驶)中,需在有限时间预算内实现细粒度感知。解决方案的关键在于提出4DSegStreamer框架,其核心是双线程系统:预测线程利用历史运动与几何信息提取特征并预测未来动态;推理线程通过与最新记忆对齐、补偿自运动及动态物体位移,确保对新帧的及时预测。该设计显著提升了处理高帧率(high FPS)条件下的稳定性与准确性,且可无缝集成至现有3D/4D分割方法中以增强实时能力。
链接: https://arxiv.org/abs/2510.17664
作者: Ling Liu,Jun Tian,Li Yi
机构: Tsinghua University (清华大学); Shanghai Qi Zhi Institute; Shanghai AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:4D panoptic segmentation in a streaming setting is critical for highly dynamic environments, such as evacuating dense crowds and autonomous driving in complex scenarios, where real-time, fine-grained perception within a constrained time budget is essential. In this paper, we introduce 4DSegStreamer, a novel framework that employs a Dual-Thread System to efficiently process streaming frames. The framework is general and can be seamlessly integrated into existing 3D and 4D segmentation methods to enable real-time capability. It also demonstrates superior robustness compared to existing streaming perception approaches, particularly under high FPS conditions. The system consists of a predictive thread and an inference thread. The predictive thread leverages historical motion and geometric information to extract features and forecast future dynamics. The inference thread ensures timely prediction for incoming frames by aligning with the latest memory and compensating for ego-motion and dynamic object movements. We evaluate 4DSegStreamer on the indoor HOI4D dataset and the outdoor SemanticKITTI and nuScenes datasets. Comprehensive experiments demonstrate the effectiveness of our approach, particularly in accurately predicting dynamic objects in complex scenes.
zh
[CV-19] Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs
【速读】:该论文旨在解决视频监控场景中暴力检测任务的资源效率与性能平衡问题,特别是在非独立同分布(non-IID)数据环境下,如何实现高精度、低能耗且可持续的联邦学习部署。其解决方案的关键在于对比两种互补策略:一是基于零样本推理和联邦微调的视觉语言模型(Vision-Language Models, VLMs),二是针对本地设备优化的轻量级3D卷积神经网络(CNN3D)的个性化训练。研究发现,CNN3D在ROC AUC和对数损失上略优于LoRA微调的VLMs,同时显著降低能耗;而VLMs则在复杂情境下的上下文推理和多模态理解方面更具优势。因此,论文提出一种混合架构:以轻量CNN3D处理常规分类任务,仅在需要复杂语义分析时激活VLM,从而兼顾准确性、能效与环境可持续性,为资源敏感型视频监控系统提供可复现的基准方案。
链接: https://arxiv.org/abs/2510.17651
作者: Sébastien Thuau,Siba Haidar,Ayush Bajracharya,Rachid Chelouah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 1 figure, FLTA 2025
点击查看摘要
Abstract:We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO _2 emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.
zh
[CV-20] ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification
【速读】:该论文旨在解决肺部超声(LUS)视频中鉴别心源性肺水肿(CPE)与非心源性炎症模式(NCIP/ARDS样)、间质性肺病及健康肺组织的难题,其核心挑战在于非心源性病变在视觉表现上的高度异质性以及B线和胸膜伪影的广泛重叠。解决方案的关键在于提出ZACH-ViT模型——一种参数仅为0.25M的视觉Transformer变体,通过移除位置嵌入(positional embeddings)和[CLS]标记实现完全排列不变性(permutation-invariant),从而适配无序医学图像数据;同时引入ShuffleStrides数据增强(SSDA)策略,在保持解剖有效性的前提下打乱探头视角序列和帧顺序,提升模型泛化能力。实验表明,ZACH-ViT在95名危重症患者共380段LUS视频上达到最优ROC-AUC(验证集0.80,测试集0.79),且敏感度(0.60)与特异度(0.91)平衡,训练速度比Minimal ViT快1.35倍、参数量减少2.5倍,证明了结构对齐设计优于单纯模型规模扩展在小样本医学影像中的有效性。
链接: https://arxiv.org/abs/2510.17650
作者: Athanasios Angelakis,Amne Mousa,Micah L. A. Heldeweg,Laurens A. Biesheuvel,Mark A. Haaksma,Jasper M. Smit,Pieter R. Tuinman,Paul W. G. Elbers
机构: Amsterdam UMC, University of Amsterdam & Vrije Universiteit Amsterdam, The Netherlands; University of Amsterdam, The Netherlands
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures, 2 tables. Primary subject: cs.LG (Machine Learning) Cross-listed to: cs.CV (Computer Vision and Pattern Recognition), eess.IV (Image and Video Processing). Code available at: this https URL Installation: pip install zachvit Paper licensed under CC BY-NC-ND 4.0. Code released under Apache 2.0 License
点击查看摘要
Abstract:Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and structurally normal lungs in lung ultrasound (LUS) videos remains challenging due to the high visual variability of non-cardiogenic inflammatory patterns (NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This heterogeneity complicates automated classification as overlapping B-lines and pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer variant that removes both positional embeddings and the [CLS] token, making it fully permutation-invariant and suitable for unordered medical image data. To enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA), which permutes probe-view sequences and frame orders while preserving anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95 critically ill patients against nine state-of-the-art baselines. Despite the heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60) and specificity (0.91), while all competing models collapsed to trivial classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with 2.5x fewer parameters, supporting real-time clinical deployment. These results show that aligning architectural design with data structure can outperform scale in small-data medical imaging.
zh
[CV-21] Self-supervised Pre-training for Mapping of Archaeological Stone Wall in Historic Landscapes Using High-Resolution DEM Derivatives
【速读】:该论文旨在解决低矮干砌墙(dry-stone walls)在植被覆盖区域中难以通过传统遥感手段识别和自动测绘的问题,尤其是在澳大利亚等生态环境敏感区,其遗产价值与火灾管理需求迫切要求高精度地图。核心挑战包括:(1)植被遮挡导致视觉信息缺失;(2)标注数据稀缺限制监督学习效果。解决方案的关键在于提出DINO-CV框架,利用高分辨率机载激光雷达(Airborne LiDAR)生成的数字高程模型(Digital Elevation Model, DEM),通过地形结构而非光谱特征进行墙体识别,从而克服视觉遮挡问题;同时引入基于知识蒸馏的自监督跨视图预训练策略,在有限标注数据下学习多源DEM衍生物中的不变视觉与几何表征,支持多种骨干网络(如ResNet、Wide ResNet和Vision Transformer),最终在测试集上实现68.6%的平均交并比(mIoU),且仅用10%标注数据微调后仍保持63.8% mIoU,显著提升了自动化测绘的可行性和泛化能力。
链接: https://arxiv.org/abs/2510.17644
作者: Zexian Huang,Mashnoon Islam,Brian Armstrong,Kourosh Khoshelham,Martin Tomko
机构: University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Dry-stone walls hold significant heritage and environmental value. Mapping these structures is essential for ecosystem preservation and wildfire management in Australia. Yet, many walls remain unidentified due to their inaccessibility and the high cost of manual mapping. Deep learning-based segmentation offers a scalable solution, but two major challenges persist: (1) visual occlusion of low-lying walls by dense vegetation, and (2) limited labeled data for supervised training. We propose DINO-CV, a segmentation framework for automatic mapping of low-lying dry-stone walls using high-resolution Airborne LiDAR-derived digital elevation models (DEMs). DEMs overcome visual occlusion by capturing terrain structures hidden beneath vegetation, enabling analysis of structural rather than spectral cues. DINO-CV introduces a self-supervised cross-view pre-training strategy based on knowledge distillation to mitigate data scarcity. It learns invariant visual and geometric representations across multiple DEM derivatives, supporting various vision backbones including ResNet, Wide ResNet, and Vision Transformers. Applied to the UNESCO World Heritage cultural landscape of Budj Bim, Victoria, the method identifies one of Australia’s densest collections of colonial dry-stone walls beyond Indigenous heritage contexts. DINO-CV achieves a mean Intersection over Union (mIoU) of 68.6% on test areas and maintains 63.8% mIoU when fine-tuned with only 10% labeled data. These results demonstrate the potential of self-supervised learning on high-resolution DEM derivatives for automated dry-stone wall mapping in vegetated and heritage-rich environments with scarce annotations.
zh
[CV-22] CaMiT: A Time-Aware Car Model Dataset for Classification and Generation NEURIPS2025
【速读】:该论文旨在解决视觉识别与生成模型在面对随时间演变的物体类别(如汽车型号)时的适应性问题,即模型在跨年测试中性能下降的“时间漂移”现象。其核心挑战在于如何在有限资源下实现对细粒度类别的长期持续学习,并保持对新兴、演化及消失类别的鲁棒性。解决方案的关键在于提出一种时间增量分类设置(time-incremental classification setting),并设计两种策略:一是时间增量预训练(time-incremental pretraining),更新模型主干网络以捕捉时间维度上的特征变化;二是时间增量分类器学习(time-incremental classifier learning),仅更新最终分类层以降低计算开销,二者均能提升模型的时间鲁棒性。此外,论文还探索了利用时间元数据进行时间感知图像生成(time-aware image generation),进一步增强生成结果的时空一致性。
链接: https://arxiv.org/abs/2510.17626
作者: Frédéric LIN,Biruk Abere Ambaw,Adrian Popescu,Hejer Ammar,Romaric Audigier,Hervé Le Borgne(Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France)
机构: Université Paris-Saclay (巴黎-萨克雷大学); CEA (法国原子能和替代能源委员会); List (List实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To be published in NeurIPS 2025 Track on Datasets and Benchmarks
点击查看摘要
Abstract:AI systems must adapt to evolving visual environments, especially in domains where object appearances change over time. We introduce Car Models in Time (CaMiT), a fine-grained dataset capturing the temporal evolution of car models, a representative class of technological artifacts. CaMiT includes 787K labeled samples of 190 car models (2007-2023) and 5.1M unlabeled samples (2005-2023), supporting both supervised and self-supervised learning. Static pretraining on in-domain data achieves competitive performance with large-scale generalist models while being more resource-efficient, yet accuracy declines when models are tested across years. To address this, we propose a time-incremental classification setting, a realistic continual learning scenario with emerging, evolving, and disappearing classes. We evaluate two strategies: time-incremental pretraining, which updates the backbone, and time-incremental classifier learning, which updates only the final layer, both improving temporal robustness. Finally, we explore time-aware image generation that leverages temporal metadata during training, yielding more realistic outputs. CaMiT offers a rich benchmark for studying temporal adaptation in fine-grained visual recognition and generation.
zh
[CV-23] ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input
【速读】:该论文旨在解决共言语手势合成中的核心挑战:生成与语义一致的具象手势(iconic gestures)或指示手势(deictic gestures),而非仅限于节奏性重复的节拍手势(beat gestures)。现有方法受限于语言输入本身缺乏视觉语义信息,难以生成具有明确指代或形象表达功能的手势。解决方案的关键在于提出一种零样本(zero-shot)系统,该系统不仅基于语言输入,还引入图像分析模块以提取关键对象属性(如形状、对称性、对齐方式),并通过语义匹配模块将这些视觉特征与语音内容关联,最终利用逆运动学引擎合成具象/指示手势,并与自然节拍手势协同生成连贯的多模态交互行为。
链接: https://arxiv.org/abs/2510.17617
作者: Hendric Voss,Stefan Kopp
机构: Bielefeld University (比勒费尔德大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Human communication combines speech with expressive nonverbal cues such as hand gestures that serve manifold communicative functions. Yet, current generative gesture generation approaches are restricted to simple, repetitive beat gestures that accompany the rhythm of speaking but do not contribute to communicating semantic meaning. This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance. Such gestures cannot be derived from language input alone, which inherently lacks the visual meaning that is often carried autonomously by gestures. We therefore introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention. Our method integrates an image analysis pipeline that extracts key object properties such as shape, symmetry, and alignment, together with a semantic matching module that links these visual details to spoken text. An inverse kinematics engine then synthesizes iconic and deictic gestures and combines them with co-generated natural beat gestures for coherent multimodal communication. A comprehensive user study demonstrates the effectiveness of our approach. In scenarios where speech alone was ambiguous, gestures generated by our system significantly improved participants’ ability to identify object properties, confirming their interpretability and communicative value. While challenges remain in representing complex shapes, our results highlight the importance of context-aware semantic gestures for creating expressive and collaborative virtual agents or avatars, marking a substantial step forward towards efficient and robust, embodied human-agent interaction. More information and example videos are available here: this https URL
zh
[CV-24] One Dinomaly2 Detect Them All: A Unified Framework for Full-Spectrum Unsupervised Anomaly Detection CVPR2025
【速读】:该论文旨在解决无监督异常检测(Unsupervised Anomaly Detection, UAD)领域中多类模型性能显著落后于单类模型、且方法碎片化严重的问题,即现有方法难以在多种数据模态(如2D、多视角、RGB-3D、RGB-红外)、任务设置(单类、多类、少样本等)和应用场景(工业、生物、户外)之间实现统一部署与高性能表现。其解决方案的关键在于提出Dinomaly2——首个全谱图像无监督异常检测统一框架,通过“少即是多”(less is more)的设计哲学,仅用五个简单组件即可在标准重建框架中实现卓越性能,这种极简主义方法不仅显著缩小了多类模型与先进单类模型之间的性能差距,还天然支持跨任务、跨模态的无缝扩展,无需额外修改,从而确立了简洁性作为真正通用性的基础。
链接: https://arxiv.org/abs/2510.17611
作者: Jia Guo,Shuai Lu,Lei Fan,Zelin Li,Donglin Di,Yang Song,Weihang Zhang,Wenbing Zhu,Hong Yan,Fang Chen,Huiqi Li,Hongen Liao
机构: Tsinghua University (清华大学); Beijing Institute of Technology (北京理工大学); Shanghai Jiao Tong University (上海交通大学); City University of Hong Kong (香港城市大学); University of New South Wales (新南威尔士大学); DZ Matrix; Fudan University (复旦大学); Rongcheer Co., Ltd. (荣车科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended version of CVPR2025
点击查看摘要
Abstract:Unsupervised anomaly detection (UAD) has evolved from building specialized single-class models to unified multi-class models, yet existing multi-class models significantly underperform the most advanced one-for-one counterparts. Moreover, the field has fragmented into specialized methods tailored to specific scenarios (multi-class, 3D, few-shot, etc.), creating deployment barriers and highlighting the need for a unified solution. In this paper, we present Dinomaly2, the first unified framework for full-spectrum image UAD, which bridges the performance gap in multi-class models while seamlessly extending across diverse data modalities and task settings. Guided by the “less is more” philosophy, we demonstrate that the orchestration of five simple element achieves superior performance in a standard reconstruction-based framework. This methodological minimalism enables natural extension across diverse tasks without modification, establishing that simplicity is the foundation of true universality. Extensive experiments on 12 UAD benchmarks demonstrate Dinomaly2’s full-spectrum superiority across multiple modalities (2D, multi-view, RGB-3D, RGB-IR), task settings (single-class, multi-class, inference-unified multi-class, few-shot) and application domains (industrial, biological, outdoor). For example, our multi-class model achieves unprecedented 99.9% and 99.3% image-level (I-) AUROC on MVTec-AD and VisA respectively. For multi-view and multi-modal inspection, Dinomaly2 demonstrates state-of-the-art performance with minimum adaptations. Moreover, using only 8 normal examples per class, our method surpasses previous full-shot models, achieving 98.7% and 97.4% I-AUROC on MVTec-AD and VisA. The combination of minimalistic design, computational scalability, and universal applicability positions Dinomaly2 as a unified solution for the full spectrum of real-world anomaly detection applications.
zh
[CV-25] Integrating BIM and UAV-based photogrammetry for Automated 3D Structure Model Segmentation
【速读】:该论文旨在解决基于无人机(UAV)获取的高分辨率三维点云数据在结构健康监测中,对特定结构构件进行自动分割的难题,传统方法依赖耗时且易出错的人工标注。解决方案的关键在于提出一种基于机器学习的自动化分割框架,该框架融合了真实无人机扫描点云与建筑信息模型(BIM)生成的合成数据,利用两者的优势互补,有效克服了人工标注的局限性;实验表明,该方法在铁路轨道数据集上能高精度识别和分割钢轨、轨枕等主要构件,并通过引入少量真实数据配合BIM合成数据,显著缩短训练时间的同时保持合理的分割精度,从而提升了3D基础设施模型分割的精度与效率,推动了UAV与BIM技术在结构健康监测中的集成应用。
链接: https://arxiv.org/abs/2510.17609
作者: Siqi Chen,Shanyue Guan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The advancement of UAV technology has enabled efficient, non-contact structural health monitoring. Combined with photogrammetry, UAVs can capture high-resolution scans and reconstruct detailed 3D models of infrastructure. However, a key challenge remains in segmenting specific structural components from these models-a process traditionally reliant on time-consuming and error-prone manual labeling. To address this issue, we propose a machine learning-based framework for automated segmentation of 3D point clouds. Our approach uses the complementary strengths of real-world UAV-scanned point clouds and synthetic data generated from Building Information Modeling (BIM) to overcome the limitations associated with manual labeling. Validation on a railroad track dataset demonstrated high accuracy in identifying and segmenting major components such as rails and crossties. Moreover, by using smaller-scale datasets supplemented with BIM data, the framework significantly reduced training time while maintaining reasonable segmentation accuracy. This automated approach improves the precision and efficiency of 3D infrastructure model segmentation and advances the integration of UAV and BIM technologies in structural health monitoring and infrastructure management.
zh
[CV-26] ShapeCraft: LLM Agents for Structured Textured and Interactive 3D Modeling NEURIPS2025
【速读】:该论文旨在解决现有文本到3D生成方法中存在的两个核心问题:一是生成的3D网格通常缺乏结构化,难以满足艺术创作流程的需求;二是生成结果交互性差,限制了其在实际应用中的可用性。解决方案的关键在于提出一种基于图结构的程序化形状(Graph-based Procedural Shape, GPS)表示方法,将复杂的自然语言指令分解为具有空间关系和语义信息的子任务图结构,从而提升大语言模型(LLM)对输入的理解能力,并通过多智能体协同机制实现逐层迭代的程序化建模与纹理绘制,最终生成几何准确、语义丰富且具备交互性的结构化3D资产。
链接: https://arxiv.org/abs/2510.17603
作者: Shuyuan Zhang,Chenhan Jiang,Zuoou Li,Jiankang Deng
机构: Imperial College London (帝国理工学院); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 Poster
点击查看摘要
Abstract:3D generation from natural language offers significant potential to reduce expert manual modeling efforts and enhance accessibility to 3D assets. However, existing methods often yield unstructured meshes and exhibit poor interactivity, making them impractical for artistic workflows. To address these limitations, we represent 3D assets as shape programs and introduce ShapeCraft, a novel multi-agent framework for text-to-3D generation. At its core, we propose a Graph-based Procedural Shape (GPS) representation that decomposes complex natural language into a structured graph of sub-tasks, thereby facilitating accurate LLM comprehension and interpretation of spatial relationships and semantic shape details. Specifically, LLM agents hierarchically parse user input to initialize GPS, then iteratively refine procedural modeling and painting to produce structured, textured, and interactive 3D assets. Qualitative and quantitative experiments demonstrate ShapeCraft’s superior performance in generating geometrically accurate and semantically rich 3D assets compared to existing LLM-based agents. We further show the versatility of ShapeCraft through examples of animated and user-customized editing, highlighting its potential for broader interactive applications.
zh
[CV-27] Conveying Meaning through Gestures: An Investigation into Semantic Co-Speech Gesture Generation
【速读】:该论文旨在解决如何通过生成式 AI (Generative AI) 有效生成与话语同步的具身手势(co-speech gestures),以提升手势在传达语义信息方面的表现,并评估人类对生成手势的感知效果。其解决方案的关键在于提出两种框架:AQ-GT 和其语义增强变体 AQ-GT-a,前者基于上下文学习而无需显式语义输入,后者则引入语义标注以增强语义表达能力。实验表明,尽管 AQ-GT-a 在新情境中对形状和大小等概念的泛化能力更强,且被评价为更具表现力,但 AQ-GT 在训练域内更擅长传递核心语义,说明语义增强并非普适优化策略,而是存在专业化与泛化之间的权衡。
链接: https://arxiv.org/abs/2510.17599
作者: Hendric Voss,Lisa Michelle Bohnenkamp,Stefan Kopp
机构: Bielefeld University (比勒费尔德大学); GID GeoInformationsDienst GmbH (地理信息服务中心有限公司)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This study explores two frameworks for co-speech gesture generation, AQ-GT and its semantically-augmented variant AQ-GT-a, to evaluate their ability to convey meaning through gestures and how humans perceive the resulting movements. Using sentences from the SAGA spatial communication corpus, contextually similar sentences, and novel movement-focused sentences, we conducted a user-centered evaluation of concept recognition and human-likeness. Results revealed a nuanced relationship between semantic annotations and performance. The original AQ-GT framework, lacking explicit semantic input, was surprisingly more effective at conveying concepts within its training domain. Conversely, the AQ-GT-a framework demonstrated better generalization, particularly for representing shape and size in novel contexts. While participants rated gestures from AQ-GT-a as more expressive and helpful, they did not perceive them as more human-like. These findings suggest that explicit semantic enrichment does not guarantee improved gesture generation and that its effectiveness is highly dependent on the context, indicating a potential trade-off between specialization and generalization.
zh
[CV-28] Expose Camouflage in the Water: Underwater Camouflaged Instance Segmentation and Dataset
【速读】:该论文旨在解决水下伪装实例分割(Underwater Camouflaged Instance Segmentation, UCIS)任务中因水下环境导致的颜色失真、对比度低和模糊等问题,使得传统在陆地主导数据集上训练的方法在水下场景中性能不足的问题。其解决方案的关键在于提出首个针对水下伪装实例分割的UCIS4K数据集(包含3953张带实例级标注的海洋生物图像),并设计了基于Segment Anything Model(SAM)改进的UCIS-SAM网络架构,包含三个核心模块:通道平衡优化模块(Channel Balance Optimization Module, CBOM)以增强水下特征学习能力;频域真实融合模块(Frequency Domain True Integration Module, FDTIM)用于突出物体固有特征并抑制伪装干扰;多尺度特征频域聚合模块(Multi-scale Feature Frequency Aggregation Module, MFFAM)则强化低对比度伪装目标在多频带下的边界信息,从而显著提升分割精度。
链接: https://arxiv.org/abs/2510.17585
作者: Chuhong Wang,Hua Li,Chongyi Li,Huazhong Liu,Xiongxin Tang,Sam Kwong
机构: Hainan University (海南大学); Guangdong Ocean University (广东海洋大学); Nankai University (南开大学); Chinese Academy of Science (中国科学院); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:With the development of underwater exploration and marine protection, underwater vision tasks are widespread. Due to the degraded underwater environment, characterized by color distortion, low contrast, and blurring, camouflaged instance segmentation (CIS) faces greater challenges in accurately segmenting objects that blend closely with their surroundings. Traditional camouflaged instance segmentation methods, trained on terrestrial-dominated datasets with limited underwater samples, may exhibit inadequate performance in underwater scenes. To address these issues, we introduce the first underwater camouflaged instance segmentation (UCIS) dataset, abbreviated as UCIS4K, which comprises 3,953 images of camouflaged marine organisms with instance-level annotations. In addition, we propose an Underwater Camouflaged Instance Segmentation network based on Segment Anything Model (UCIS-SAM). Our UCIS-SAM includes three key modules. First, the Channel Balance Optimization Module (CBOM) enhances channel characteristics to improve underwater feature learning, effectively addressing the model’s limited understanding of underwater environments. Second, the Frequency Domain True Integration Module (FDTIM) is proposed to emphasize intrinsic object features and reduce interference from camouflage patterns, enhancing the segmentation performance of camouflaged objects blending with their surroundings. Finally, the Multi-scale Feature Frequency Aggregation Module (MFFAM) is designed to strengthen the boundaries of low-contrast camouflaged instances across multiple frequency bands, improving the model’s ability to achieve more precise segmentation of camouflaged objects. Extensive experiments on the proposed UCIS4K and public benchmarks show that our UCIS-SAM outperforms state-of-the-art approaches.
zh
[CV-29] PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception
【速读】:该论文旨在解决当前3D前馈模型(如VGGT)在动态场景下性能下降的问题,尤其是在存在移动人体或可变形物体(如雨伞)等复杂动态元素时,难以准确进行相机位姿估计、深度预测和点云重建。其解决方案的关键在于提出了一种动态感知聚合器(dynamics-aware aggregator),该模块通过预测一个动态感知掩码(dynamics-aware mask),实现静态与动态信息的解耦:在相机位姿估计任务中抑制运动线索,而在几何重建任务中增强动态信息,从而有效缓解多任务4D重建中因任务冲突导致的性能瓶颈。
链接: https://arxiv.org/abs/2510.17568
作者: Kaichen Zhou,Yuhan Wang,Grace Chen,Xinhai Chang,Gaspard Beaudouin,Fangneng Zhan,Paul Pu Liang,Mengyu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and point cloud reconstruction – all without post-processing. A central challenge in multi-task 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics-aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask – suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.
zh
[CV-30] WP-CrackNet: A Collaborative Adversarial Learning Framework for End-to-End Weakly-Supervised Road Crack Detection
【速读】:该论文旨在解决道路裂缝检测中对昂贵像素级标注的依赖问题,提出一种端到端的弱监督方法WP-CrackNet,仅使用图像级标签即可实现像素级裂缝检测。其核心解决方案在于设计了一个由分类器(生成类别激活图CAM)、重构器(衡量特征可推断性)和检测器(输出像素级裂缝结果)组成的三元协同框架,通过分类器与重构器之间的对抗学习促使CAM覆盖完整裂缝区域,同时检测器利用后处理后的CAM生成伪标签进行训练,形成三者间的互反馈机制以提升学习稳定性和检测精度。此外,引入路径感知注意力模块(PAAM)融合高层语义与低层结构线索,并设计中心增强的CAM一致性模块(CECCM)通过高斯加权与一致性约束优化CAM质量,从而显著提升伪标签可靠性与最终检测性能。
链接: https://arxiv.org/abs/2510.17566
作者: Nachuan Ma,Zhengfei Song,Qiang Hu,Xiaoyu Tang,Chengxi Zhang,Rui Fan,Lihua Xie
机构: Tongji University (同济大学); South China Normal University (华南师范大学); Jiangnan University (江南大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Road crack detection is essential for intelligent infrastructure maintenance in smart cities. To reduce reliance on costly pixel-level annotations, we propose WP-CrackNet, an end-to-end weakly-supervised method that trains with only image-level labels for pixel-wise crack detection. WP-CrackNet integrates three components: a classifier generating class activation maps (CAMs), a reconstructor measuring feature inferability, and a detector producing pixel-wise road crack detection results. During training, the classifier and reconstructor alternate in adversarial learning to encourage crack CAMs to cover complete crack regions, while the detector learns from pseudo labels derived from post-processed crack CAMs. This mutual feedback among the three components improves learning stability and detection accuracy. To further boost detection performance, we design a path-aware attention module (PAAM) that fuses high-level semantics from the classifier with low-level structural cues from the reconstructor by modeling spatial and channel-wise dependencies. Additionally, a center-enhanced CAM consistency module (CECCM) is proposed to refine crack CAMs using center Gaussian weighting and consistency constraints, enabling better pseudo-label generation. We create three image-level datasets and extensive experiments show that WP-CrackNet achieves comparable results to supervised methods and outperforms existing weakly-supervised methods, significantly advancing scalable road inspection. The source code package and datasets are available at this https URL.
zh
[CV-31] MambaX-Net: Dual-Input Mamba-Enhanced Cross-Attention Network for Longitudinal MRI Segmentation
【速读】:该论文旨在解决主动监测(Active Surveillance, AS)场景下前列腺癌(Prostate Cancer, PCa)自动分割的挑战,尤其是针对纵向MRI数据中存在多个时间点但专家标注稀缺的问题。现有深度学习模型通常依赖于单时间点且由专家标注的数据集进行训练,难以适应AS所需的时序分析需求。解决方案的关键在于提出MambaX-Net——一种半监督、双扫描的3D分割架构,其核心创新包括:(i) 引入Mamba增强的交叉注意力模块(Mamba-enhanced Cross-Attention Module),通过将Mamba块嵌入交叉注意力机制以高效捕捉时间演化和长程空间依赖关系;(ii) 设计形状提取模块(Shape Extractor Module),将前一时相的分割掩膜编码为潜在解剖表示以优化区域边界识别;同时结合基于预训练nnU-Net生成伪标签的自训练策略,在有限标注数据下实现鲁棒性能。实验表明,该方法在纵向AS数据上显著优于主流U-Net与Transformer模型。
链接: https://arxiv.org/abs/2510.17529
作者: Yovin Yahathugoda,Davide Prezzi,Piyalitt Ittichaiwong,Vicky Goh,Sebastien Ourselin,Michela Antonelli
机构: King’s College London (国王学院); Guy’s and St Thomas’ NHS Foundation Trust (盖伊和圣托马斯国家健康服务体系信托)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Active Surveillance (AS) is a treatment option for managing low and intermediate-risk prostate cancer (PCa), aiming to avoid overtreatment while monitoring disease progression through serial MRI and clinical follow-up. Accurate prostate segmentation is an important preliminary step for automating this process, enabling automated detection and diagnosis of PCa. However, existing deep-learning segmentation models are often trained on single-time-point and expertly annotated datasets, making them unsuitable for longitudinal AS analysis, where multiple time points and a scarcity of expert labels hinder their effective fine-tuning. To address these challenges, we propose MambaX-Net, a novel semi-supervised, dual-scan 3D segmentation architecture that computes the segmentation for time point t by leveraging the MRI and the corresponding segmentation mask from the previous time point. We introduce two new components: (i) a Mamba-enhanced Cross-Attention Module, which integrates the Mamba block into cross attention to efficiently capture temporal evolution and long-range spatial dependencies, and (ii) a Shape Extractor Module that encodes the previous segmentation mask into a latent anatomical representation for refined zone delination. Moreover, we introduce a semi-supervised self-training strategy that leverages pseudo-labels generated from a pre-trained nnU-Net, enabling effective learning without expert annotations. MambaX-Net was evaluated on a longitudinal AS dataset, and results showed that it significantly outperforms state-of-the-art U-Net and Transformer-based models, achieving superior prostate zone segmentation even when trained on limited and noisy data.
zh
[CV-32] MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
【速读】:该论文旨在解决大规模视频生成模型训练中存在的挑战,包括跨模态文本-视频对齐困难、长序列建模复杂以及时空依赖关系的捕捉难题。其解决方案的关键在于构建一个优化四支柱的训练框架:数据处理、模型架构、训练策略与基础设施,并通过系统性改进在数据预处理、视频压缩、参数扩展、基于课程学习的预训练及对齐导向的后训练阶段均实现显著效率提升和性能增强。最终成果MUG-V 10B模型在整体视频生成任务上达到当前最先进水平,在电商场景下更优于主流开源基线,且首次公开了基于Megatron-Core的大规模视频生成训练代码与完整推理流程,实现了高训练效率与近线性的多节点扩展能力。
链接: https://arxiv.org/abs/2510.17519
作者: Yongshun Zhang,Zhongyi Fan,Yonghang Zhang,Zhangzikang Li,Weifeng Chen,Zhongwei Feng,Chaoyue Wang,Peng Hou,Anxiang Zeng
机构: Shopee Pte. Ltd. (Shopee公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report; Project Page: \href{ this https URL }
点击查看摘要
Abstract:In recent years, large-scale generative models for visual content (\textite.g., images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in \hrefthis https URLour webpage.
zh
[CV-33] Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
【速读】:该论文旨在解决视频摘要(video summarization)中如何在无需大量标注数据的情况下,实现既忠实于原始内容又具备高语义连贯性的压缩表示问题。当前监督方法依赖密集标签训练,虽性能优异但泛化能力差;无监督方法缺乏高层语义理解;而基于大语言模型(LLM)的零样本方案虽无需训练,却易受提示词(prompt)设计影响且稳定性不足。其解决方案的关键在于提出一种“评分准则引导的伪标签提示框架”(rubric-guided, pseudo-labeled prompting framework),通过少量人工标注生成高置信度伪标签,并构建结构化、数据自适应的评分准则(scoring rubrics),用于解释性场景评估;在推理阶段,边界场景仅用自身描述评分,中间场景则引入邻近片段摘要以判断进展与冗余,从而在不调整模型参数的前提下,使LLM平衡局部显著性与全局一致性,显著提升零样本视频摘要的稳定性和效果。
链接: https://arxiv.org/abs/2510.17501
作者: Yuanli Wu,Long Zhang,Yue Du,Bin Li
机构: Nanyang Technological University (南洋理工大学); Guilin University of Electronic Technology (桂林电子科技大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With video exploding across social media, surveillance, and education, compressing long footage into concise yet faithful surrogates is crucial. Supervised methods learn frame/shot importance from dense labels and excel in-domain, but are costly and brittle across datasets; unsupervised methods avoid labels but often miss high-level semantics and narrative cues. Recent zero-shot pipelines use LLMs for training-free summarization, yet remain sensitive to handcrafted prompts and dataset-specific this http URL propose a rubric-guided, pseudo-labeled prompting framework. A small subset of human annotations is converted into high-confidence pseudo labels and aggregated into structured, dataset-adaptive scoring rubrics for interpretable scene evaluation. At inference, boundary scenes (first/last) are scored from their own descriptions, while intermediate scenes include brief summaries of adjacent segments to assess progression and redundancy, enabling the LLM to balance local salience with global coherence without parameter this http URL three benchmarks, our method is consistently effective. On SumMe and TVSum it achieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21) by +0.85 and +0.84 and approaching supervised performance. On the query-focused QFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stable across validation videos. These results show that rubric-guided pseudo labeling, coupled with contextual prompting, stabilizes LLM-based scoring and yields a general, interpretable zero-shot paradigm for both generic and query-focused video summarization.
zh
[CV-34] Split-Fuse-Transport: Annotation-Free Saliency via Dual Clustering and Optimal Transport Alignment
【速读】:该论文旨在解决无监督显著性目标检测(Salient Object Detection, SOD)中伪掩码(pseudo-mask)质量不足导致性能受限的问题,尤其是在缺乏像素级标注的情况下实现接近监督学习的准确率。其核心解决方案是提出AutoSOD框架,关键创新在于设计了POTNet模块——通过熵引导的双聚类头(entropy-guided dual-clustering head)将高熵像素用谱聚类、低熵像素用k-means分别聚类,并利用最优传输(Optimal Transport, OT)对齐两组原型(prototype),形成“分割-融合-传输”(split-fuse-transport)结构,在单次前向传播中生成更清晰且具备局部结构感知能力的伪掩码。这些高质量伪掩码直接监督一个MaskFormer风格的编码器-解码器网络,从而构建端到端的无监督SOD流水线,无需传统方法如SelfMask中的离线投票机制,同时显著提升精度与训练效率。
链接: https://arxiv.org/abs/2510.17484
作者: Muhammad Umer Ramzan,Ali Zia,Abdelwahed Khamis,Noman Ali,Usman Ali,Wei Xiang
机构: GIFT University (GIFT大学); La Trobe University (拉特罗布大学); CSIRO (澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Salient object detection (SOD) aims to segment visually prominent regions in images and serves as a foundational task for various computer vision applications. We posit that SOD can now reach near-supervised accuracy without a single pixel-level label, but only when reliable pseudo-masks are available. We revisit the prototype-based line of work and make two key observations. First, boundary pixels and interior pixels obey markedly different geometry; second, the global consistency enforced by optimal transport (OT) is underutilized if prototype quality is weak. To address this, we introduce POTNet, an adaptation of Prototypical Optimal Transport that replaces POT’s single k-means step with an entropy-guided dual-clustering head: high-entropy pixels are organized by spectral clustering, low-entropy pixels by k-means, and the two prototype sets are subsequently aligned by OT. This split-fuse-transport design yields sharper, part-aware pseudo-masks in a single forward pass, without handcrafted priors. Those masks supervise a standard MaskFormer-style encoder-decoder, giving rise to AutoSOD, an end-to-end unsupervised SOD pipeline that eliminates SelfMask’s offline voting yet improves both accuracy and training efficiency. Extensive experiments on five benchmarks show that AutoSOD outperforms unsupervised methods by up to 26% and weakly supervised methods by up to 36% in F-measure, further narrowing the gap to fully supervised models.
zh
[CV-35] SparseWorld: A Flexible Adaptive and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries
【速读】:该论文旨在解决现有4D占用世界模型(4D occupancy world model)在感知灵活性、动态适应性与计算效率方面的局限性,尤其是其依赖静态固定嵌入或网格结构导致的感知僵化问题,以及“就地分类”方法与真实场景连续动态特性之间的不匹配。解决方案的关键在于提出SparseWorld框架,其核心创新包括:1)引入可学习的稀疏动态查询机制,结合车辆状态和时空间关联信息设计Range-Adaptive Perception模块,实现扩展感知范围下的灵活感知;2)通过State-Conditioned Forecasting模块,将传统的分类式预测替换为回归引导的建模方式,使动态查询能精准对齐4D环境的连续性;3)设计Temporal-Aware Self-Scheduling训练策略,提升训练过程的稳定性与效率。这些设计共同推动了感知、预测与规划任务上的性能突破。
链接: https://arxiv.org/abs/2510.17482
作者: Chenxu Dang,Haiyan Liu,Guangjun Bao,Pei An,Xinyue Tang,Jie Ma,Bingchuan Sun,Yan Wang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Institute of Artificial Intelligence, Renmin University of China (中国人民大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review
点击查看摘要
Abstract:Semantic occupancy has emerged as a powerful representation in world models for its ability to capture rich spatial semantics. However, most existing occupancy world models rely on static and fixed embeddings or grids, which inherently limit the flexibility of perception. Moreover, their ``in-place classification" over grids exhibits a potential misalignment with the dynamic and continuous nature of real this http URL this paper, we propose SparseWorld, a novel 4D occupancy world model that is flexible, adaptive, and efficient, powered by sparse and dynamic queries. We propose a Range-Adaptive Perception module, in which learnable queries are modulated by the ego vehicle states and enriched with temporal-spatial associations to enable extended-range perception. To effectively capture the dynamics of the scene, we design a State-Conditioned Forecasting module, which replaces classification-based forecasting with regression-guided formulation, precisely aligning the dynamic queries with the continuity of the 4D environment. In addition, We specifically devise a Temporal-Aware Self-Scheduling training strategy to enable smooth and efficient training. Extensive experiments demonstrate that SparseWorld achieves state-of-the-art performance across perception, forecasting, and planning tasks. Comprehensive visualizations and ablation studies further validate the advantages of SparseWorld in terms of flexibility, adaptability, and efficiency. The code is available at this https URL.
zh
[CV-36] Initialize to Generalize: A Stronger Initialization Pipeline for Sparse-View 3DGS
【速读】:该论文旨在解决稀疏视角下3D高斯泼溅(Sparse-view 3D Gaussian Splatting, 3DGS)训练时过拟合问题,该问题会导致新视角渲染中出现模糊等伪影。通过控制性消融实验发现,初始化质量是决定性能上限的关键因素,而训练阶段的正则化约束仅能带来有限提升且增加额外计算成本。因此,论文将设计重心放在改进初始化环节:提出三种核心策略——(i) 频率感知的SfM(Structure-from-Motion),通过低频视图增强和宽松多视角对应关系改善低纹理区域覆盖;(ii) 3DGS自初始化,利用光度监督生成额外点以补全SfM稀疏区域;(iii) 点云正则化,基于几何/可见性先验强制多视角一致性与均匀空间分布,从而构建干净可靠的初始点云。实验表明,该方案在LLFF和Mip-NeRF360数据集上均显著优于现有方法,确立了其作为更强初始化策略的有效性。
链接: https://arxiv.org/abs/2510.17479
作者: Feng Zhou,Wenkai Guo,Pu Cao,Zhicheng Zhang,Jianqin Yin
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A preprint paper
点击查看摘要
Abstract:Sparse-view 3D Gaussian Splatting (3DGS) often overfits to the training views, leading to artifacts like blurring in novel view rendering. Prior work addresses it either by enhancing the initialization (\emphi.e., the point cloud from Structure-from-Motion (SfM)) or by adding training-time constraints (regularization) to the 3DGS optimization. Yet our controlled ablations reveal that initialization is the decisive factor: it determines the attainable performance band in sparse-view 3DGS, while training-time constraints yield only modest within-band improvements at extra cost. Given initialization’s primacy, we focus our design there. Although SfM performs poorly under sparse views due to its reliance on feature matching, it still provides reliable seed points. Thus, building on SfM, our effort aims to supplement the regions it fails to cover as comprehensively as possible. Specifically, we design: (i) frequency-aware SfM that improves low-texture coverage via low-frequency view augmentation and relaxed multi-view correspondences; (ii) 3DGS self-initialization that lifts photometric supervision into additional points, compensating SfM-sparse regions with learned Gaussian centers; and (iii) point-cloud regularization that enforces multi-view consistency and uniform spatial coverage through simple geometric/visibility priors, yielding a clean and reliable point cloud. Our experiments on LLFF and Mip-NeRF360 demonstrate consistent gains in sparse-view settings, establishing our approach as a stronger initialization strategy. Code is available at this https URL.
zh
[CV-37] Rethinking Nighttime Image Deraining via Learnable Color Space Transformation NEURIPS2025
【速读】:该论文旨在解决夜间图像去雨(nighttime image deraining)任务中的两大核心问题:一是现有数据集缺乏高质量、真实且能体现雨与光照耦合效应的样本,导致模型训练受限;二是夜间雨滴在RGB色彩空间中表现不显著,难以有效分离雨迹。解决方案的关键在于提出一个名为HQ-NightRain的新基准数据集,其在和谐性和真实性上优于现有数据集,并设计了一种颜色空间变换网络(Color Space Transformation Network, CST-Net)。CST-Net的核心创新包括:1)引入可学习的颜色空间转换器(Learnable Color Space Converter, CSC),将图像映射到YUV空间并在亮度通道(Y channel)中进行雨迹去除,因夜间雨迹在Y通道中更明显;2)采用隐式光照引导机制,使模型能够从特征层面捕捉光照信息,从而提升复杂场景下的鲁棒性。
链接: https://arxiv.org/abs/2510.17440
作者: Qiyuan Guan,Xiang Chen,Guiyue Jin,Jiyu Jin,Shumin Fan,Tianyu Song,Jinshan Pan
机构: Dalian Polytechnic University (大连 polytechnic 大学); Nanjing University of Science and Technology (南京理工大学); Dalian Martime University (大连海事大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Compared to daytime image deraining, nighttime image deraining poses significant challenges due to inherent complexities of nighttime scenarios and the lack of high-quality datasets that accurately represent the coupling effect between rain and illumination. In this paper, we rethink the task of nighttime image deraining and contribute a new high-quality benchmark, HQ-NightRain, which offers higher harmony and realism compared to existing datasets. In addition, we develop an effective Color Space Transformation Network (CST-Net) for better removing complex rain from nighttime scenes. Specifically, we propose a learnable color space converter (CSC) to better facilitate rain removal in the Y channel, as nighttime rain is more pronounced in the Y channel compared to the RGB color space. To capture illumination information for guiding nighttime deraining, implicit illumination guidance is introduced enabling the learned features to improve the model’s robustness in complex scenarios. Extensive experiments show the value of our dataset and the effectiveness of our method. The source code and datasets are available at this https URL.
zh
[CV-38] From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型在三维真实世界中行动时存在的空间推理差距问题,这一差距限制了模型的泛化能力和适应性。当前方法多基于二维编码器构建,难以有效捕捉几何信息,且已有3D集成方案要么依赖特殊传感器导致跨模态迁移能力差,要么仅注入弱线索而削弱视觉-语言对齐。解决方案的关键在于提出FALCON(From Spatial to Action)范式:通过向动作头注入丰富的三维空间标记(spatial tokens),利用空间基础模型从单目RGB图像中提取强几何先验,并引入可选融合深度或位姿信息的具身空间模型(Embodied Spatial Model),无需重新训练或改变架构即可提升精度;同时,为保持语言推理能力,空间标记由空间增强型动作头处理而非拼接至视觉-语言主干网络,从而系统性地改善空间表示、模态迁移性和对齐性能。
链接: https://arxiv.org/abs/2510.17439
作者: Zhengshen Zhang,Hao Li,Yalun Dai,Zhengbang Zhu,Lei Zhou,Chenchen Liu,Dong Wang,Francis E. H. Tay,Sijin Chen,Ziwei Liu,Yuxiao Liu,Xinghang Li,Pan Zhou
机构: ByteDance(字节跳动); Tsinghua University (清华大学); Singapore Management University (新加坡管理大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
点击查看摘要
Abstract:Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.
zh
[CV-39] Leverag ing AV1 motion vectors for Fast and Dense Feature Matching
【速读】:该论文旨在解决视频场景中稀疏特征匹配计算资源消耗高、难以高效支持大规模结构光重建(Structure from Motion, SfM)的问题。其解决方案的关键在于利用AV1编码器中已有的运动矢量(Motion Vectors, MVs),通过重构亚像素级对应点和短轨迹,并结合余弦一致性过滤,实现压缩域内的密集匹配。该方法在保持与顺序SIFT相当的配对几何精度的同时显著降低CPU使用率,且在短视频片段上可生成更稠密的匹配点云,为后续Bundle Adjustment(BA)提供高密度初始解,展现出良好的可扩展性与资源效率。
链接: https://arxiv.org/abs/2510.17434
作者: Julien Zouein,Hossein Javidnia,François Pitié,Anil Kokaram
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted ICIR 2025, camera-ready version
点击查看摘要
Abstract:We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.
zh
[CV-40] DeepDetect: Learning All-in-One Dense Keypoints
【速读】:该论文旨在解决传统关键点检测方法(如SIFT、SURF、ORB等)及基于学习的方法(如SuperPoint、R2D2等)在面对光照变化、低密度与重复性差、场景适应性弱以及缺乏语义理解等问题时性能受限的挑战,尤其难以优先关注视觉重要区域。其解决方案的关键在于提出一个统一的深度学习框架DeepDetect,通过融合7种关键点检测器和2种边缘检测器输出生成高质量的地面真实标签(ground-truth masks),从而提取多样化的视觉线索(如角点、斑点、显著边缘和纹理);随后使用轻量高效的ESPNet模型以这些掩码为监督信号进行训练,使DeepDetect能够在保持高密度关键点的同时具备语义感知能力,并在复杂和退化图像条件下展现出更强的鲁棒性和适应性。
链接: https://arxiv.org/abs/2510.17422
作者: Shaharyar Ahmed Khan Tareen,Filza Khan Tareen
机构: University of Houston (休斯顿大学); National University of Sciences and Technology (巴基斯坦国立科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures, 2 tables, 7 equations
点击查看摘要
Abstract:Keypoint detection is the foundation of many computer vision tasks, including image registration, structure-from motion, 3D reconstruction, visual odometry, and SLAM. Traditional detectors (SIFT, SURF, ORB, BRISK, etc.) and learning based methods (SuperPoint, R2D2, LF-Net, D2-Net, etc.) have shown strong performance yet suffer from key limitations: sensitivity to photometric changes, low keypoint density and repeatability, limited adaptability to challenging scenes, and lack of semantic understanding, often failing to prioritize visually important regions. We present DeepDetect, an intelligent, all-in-one, dense keypoint detector that unifies the strengths of classical detectors using deep learning. Firstly, we create ground-truth masks by fusing outputs of 7 keypoint and 2 edge detectors, extracting diverse visual cues from corners and blobs to prominent edges and textures in the images. Afterwards, a lightweight and efficient model: ESPNet, is trained using these masks as labels, enabling DeepDetect to focus semantically on images while producing highly dense keypoints, that are adaptable to diverse and visually degraded conditions. Evaluations on the Oxford Affine Covariant Regions dataset demonstrate that DeepDetect surpasses other detectors in keypoint density, repeatability, and the number of correct matches, achieving maximum values of 0.5143 (average keypoint density), 0.9582 (average repeatability), and 59,003 (correct matches).
zh
[CV-41] Monitoring Horses in Stalls: From Object to Event Detection
【速读】:该论文旨在解决马匹在厩舍中行为监测的劳动密集与低效问题,以实现对马匹健康和福利状况的早期识别。其解决方案的关键在于构建一个基于视觉的自动化监测系统,利用YOLOv11进行目标检测、BoT-SORT实现多目标跟踪,并结合对象轨迹与空间关系推理事件状态,从而区分五类马匹相关事件;同时通过引入CLIP和GroundingDINO辅助标注构建定制数据集,提升系统对盲区的适应能力,为马房场景下的实时行为监控提供可行的技术路径。
链接: https://arxiv.org/abs/2510.17409
作者: Dmitrii Galimzianov,Viacheslav Vyshegorodtsev,Ivan Nezhivykh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, 4 tables
点击查看摘要
Abstract:Monitoring the behavior of stalled horses is essential for early detection of health and welfare issues but remains labor-intensive and time-consuming. In this study, we present a prototype vision-based monitoring system that automates the detection and tracking of horses and people inside stables using object detection and multi-object tracking techniques. The system leverages YOLOv11 and BoT-SORT for detection and tracking, while event states are inferred based on object trajectories and spatial relations within the stall. To support development, we constructed a custom dataset annotated with assistance from foundation models CLIP and GroundingDINO. The system distinguishes between five event types and accounts for the camera’s blind spots. Qualitative evaluation demonstrated reliable performance for horse-related events, while highlighting limitations in detecting people due to data scarcity. This work provides a foundation for real-time behavioral monitoring in equine facilities, with implications for animal welfare and stable management.
zh
[CV-42] MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning IJCNN’25
【速读】:该论文旨在解决多模态神经网络训练中因模态过拟合(modality overfitting)导致的性能瓶颈问题,即模型过度依赖某一模态而忽视其他模态,从而限制了多模态学习的潜力并仅带来微弱的性能提升。其解决方案的关键在于提出一种名为Modality-Informed Learning ratE Scheduler (MILES) 的动态学习率调度机制,通过监测训练过程中各模态的条件利用率差异,自适应地调整每种模态的学习速率,以平衡多模态模型对不同模态的学习速度,进而提升整体多模态性能及单模态预测能力。
链接: https://arxiv.org/abs/2510.17394
作者: Alejandro Guerra-Manzanares,Farah E. Shamout
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented at the 2025 International Joint Conference on Neural Networks (IJCNN’25). The paper was awarded an honorable mention (best 4 papers)
点击查看摘要
Abstract:The aim of multimodal neural networks is to combine diverse data sources, referred to as modalities, to achieve enhanced performance compared to relying on a single modality. However, training of multimodal networks is typically hindered by modality overfitting, where the network relies excessively on one of the available modalities. This often yields sub-optimal performance, hindering the potential of multimodal learning and resulting in marginal improvements relative to unimodal models. In this work, we present the Modality-Informed Learning ratE Scheduler (MILES) for training multimodal joint fusion models in a balanced manner. MILES leverages the differences in modality-wise conditional utilization rates during training to effectively balance multimodal learning. The learning rate is dynamically adjusted during training to balance the speed of learning from each modality by the multimodal model, aiming for enhanced performance in both multimodal and unimodal predictions. We extensively evaluate MILES on four multimodal joint fusion tasks and compare its performance to seven state-of-the-art baselines. Our results show that MILES outperforms all baselines across all tasks and fusion methods considered in our study, effectively balancing modality usage during training. This results in improved multimodal performance and stronger modality encoders, which can be leveraged when dealing with unimodal samples or absent modalities. Overall, our work highlights the impact of balancing multimodal learning on improving model performance.
zh
[CV-43] Closed-Loop Transfer for Weakly-supervised Affordance Grounding ICCV2025
【速读】:该论文旨在解决弱监督下的具身交互感知(affordance grounding)问题,即如何从外视角(exocentric)交互图像中学习物体可操作区域,并将其知识迁移至第一人称视角(egocentric)图像中,以实现对复杂交互场景下物体操作区域的准确识别。传统方法仅单向转移知识,难以应对遮挡或动态交互等挑战。其解决方案的关键在于提出LoopTrans框架,通过引入统一的跨模态定位机制和去噪知识蒸馏策略,构建一个闭环的知识迁移系统:不仅能将exocentric图像中的交互知识迁移到egocentric图像,还能反向增强exocentric知识提取能力,从而有效弥合对象中心(egocentric)与交互中心(exocentric)图像之间的域差异,提升在遮挡等复杂场景下的泛化性能。
链接: https://arxiv.org/abs/2510.17384
作者: Jiajin Tang,Zhengxuan Wei,Ge Zheng,Sibei Yang
机构: ShanghaiTech University (上海科技大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025
点击查看摘要
Abstract:Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces LoopTrans, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric but also transfers back to enhance exocentric knowledge extraction. Within LoopTrans, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body.
zh
[CV-44] Latent Spaces Beyond Synthesis: From GANs to Diffusion Models
【速读】:该论文旨在解决生成式视觉模型中内部表征(internal representations)演化本质的问题,特别是从GANs和VAEs到扩散模型(diffusion-based architectures)的技术转变如何重塑了对模型生成机制的理解。其解决方案的关键在于提出“严格意义上的合成”与“广义意义上的合成”的区分:前者依赖于紧凑的潜在空间(latent space)完全决定生成过程,后者则强调表征分布在各网络层中,扩散模型正是通过这种分层分布的方式将表征任务碎片化,从而挑战了传统模型中统一内部空间的假设。这一发现促使研究者重新理解生成式AI并非直接合成内容,而是由多个专业化过程协同涌现的结果。
链接: https://arxiv.org/abs/2510.17383
作者: Ludovica Schaerf
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Presented and published at Ethics and Aesthetics of Artificial Intelligence Conference (EA-AI’25)
点击查看摘要
Abstract:This paper examines the evolving nature of internal representations in generative visual models, focusing on the conceptual and technical shift from GANs and VAEs to diffusion-based architectures. Drawing on Beatrice Fazi’s account of synthesis as the amalgamation of distributed representations, we propose a distinction between “synthesis in a strict sense”, where a compact latent space wholly determines the generative process, and “synthesis in a broad sense,” which characterizes models whose representational labor is distributed across layers. Through close readings of model architectures and a targeted experimental setup that intervenes in layerwise representations, we show how diffusion models fragment the burden of representation and thereby challenge assumptions of unified internal space. By situating these findings within media theoretical frameworks and critically engaging with metaphors such as the latent space and the Platonic Representation Hypothesis, we argue for a reorientation of how generative AI is understood: not as a direct synthesis of content, but as an emergent configuration of specialized processes.
zh
[CV-45] Facial Expression-based Parkinsons Disease Severity Diagnosis via Feature Fusion and Adaptive Class Balancing
【速读】:该论文旨在解决基于面部表情的帕金森病(Parkinson’s disease, PD)严重程度诊断中存在的三大问题:一是现有方法多依赖单一表情类型,易导致误诊;二是未考虑不同PD阶段样本的类别不平衡问题,影响预测性能;三是多数研究仅进行二分类(PD vs. 非PD),缺乏对疾病严重程度的精细化评估。其解决方案的关键在于提出一种融合多模态面部表情特征的注意力机制(attention-based feature fusion)方法,并引入自适应类别平衡策略(adaptive class balancing strategy),通过动态调整训练样本的贡献权重来缓解类别分布不均与分类难度差异的问题,从而提升PD严重程度诊断的准确性与鲁棒性。
链接: https://arxiv.org/abs/2510.17373
作者: Yintao Zhou,Wei Huang,Zhengyu Li,Jing Huang,Meng Pang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3 pages, 2 figures, accepted by MIND 2025
点击查看摘要
Abstract:Parkinson’s disease (PD) severity diagnosis is crucial for early detecting potential patients and adopting tailored interventions. Diagnosing PD based on facial expression is grounded in PD patients’ “masked face” symptom and gains growing interest recently for its convenience and affordability. However, current facial expression-based approaches often rely on single type of expression which can lead to misdiagnosis, and ignore the class imbalance across different PD stages which degrades the prediction performance. Moreover, most existing methods focus on binary classification (i.e., PD / non-PD) rather than diagnosing the severity of PD. To address these issues, we propose a new facial expression-based method for PD severity diagnosis which integrates multiple facial expression features through attention-based feature fusion. Moreover, we mitigate the class imbalance problem via an adaptive class balancing strategy which dynamically adjusts the contribution of training samples based on their class distribution and classification difficulty. Experimental results demonstrate the promising performance of the proposed method for PD severity diagnosis, as well as the efficacy of attention-based feature fusion and adaptive class balancing.
zh
[CV-46] Beyond Real Faces: Synthetic Datasets Can Achieve Reliable Recognition Performance without Privacy Compromise
【速读】:该论文旨在解决面部识别系统部署中的伦理困境:高精度模型依赖于未经同意收集的大量真实人脸数据,这在欧盟《通用数据保护条例》(GDPR)等法规下可能导致数据集撤回和法律责任。其解决方案的关键在于验证合成面部数据(synthetic facial data)作为隐私保护替代方案的可行性。研究通过系统性文献回顾(识别25个合成数据集)与大规模实验验证相结合,全面评估了合成数据在身份泄露防护、类内多样性、身份可分离性、数据规模、伦理来源、偏见缓解及基准可靠性等方面的性能。结果表明,最优合成数据集(如VariFace和VIGFace)在多个标准基准上达到甚至超过真实数据集(如CASIA-WebFace)的识别准确率(95.67% vs 94.70%),且具备可控的偏见缓解能力,从而确立了合成面部数据在科学性和伦理上的双重可行性。
链接: https://arxiv.org/abs/2510.17372
作者: Paweł Borsukiewicz,Fadi Boutros,Iyiola E. Olatunji,Charles Beumier,Wendkûuni C. Ouedraogo,Jacques Klein,Tegawendé F. Bissyandé
机构: University of Luxembourg (卢森堡大学); Fraunhofer IGD (弗劳恩霍夫研究所); Royal Military Academy (皇家军事学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The deployment of facial recognition systems has created an ethical dilemma: achieving high accuracy requires massive datasets of real faces collected without consent, leading to dataset retractions and potential legal liabilities under regulations like GDPR. While synthetic facial data presents a promising privacy-preserving alternative, the field lacks comprehensive empirical evidence of its viability. This study addresses this critical gap through extensive evaluation of synthetic facial recognition datasets. We present a systematic literature review identifying 25 synthetic facial recognition datasets (2018-2025), combined with rigorous experimental validation. Our methodology examines seven key requirements for privacy-preserving synthetic data: identity leakage prevention, intra-class variability, identity separability, dataset scale, ethical data sourcing, bias mitigation, and benchmark reliability. Through experiments involving over 10 million synthetic samples, extended by a comparison of results reported on five standard benchmarks, we provide the first comprehensive empirical assessment of synthetic data’s capability to replace real datasets. Best-performing synthetic datasets (VariFace, VIGFace) achieve recognition accuracies of 95.67% and 94.91% respectively, surpassing established real datasets including CASIA-WebFace (94.70%). While those images remain private, publicly available alternatives Vec2Face (93.52%) and CemiFace (93.22%) come close behind. Our findings reveal that they ensure proper intra-class variability while maintaining identity separability. Demographic bias analysis shows that, even though synthetic data inherits limited biases, it offers unprecedented control for bias mitigation through generation parameters. These results establish synthetic facial data as a scientifically viable and ethically imperative alternative for facial recognition research.
zh
[CV-47] Recurrent Attention-based Token Selection for Efficient Streaming Video-LLM s NEURIPS2025
【速读】:该论文旨在解决视频大语言模型(Video-LLMs)在流式视频处理场景下的效率与实时性问题,即如何在不牺牲理解能力的前提下,对长达数小时的视频进行在线处理并及时响应查询。其解决方案的关键在于提出一种无需重新训练的标准Video-LLM兼容方法,核心包括三个组成部分:首先,利用大语言模型(LLM)的注意力机制识别每个短片段中对理解贡献显著的视觉标记(visual tokens),从而可丢弃约95%的冗余信息而保持性能稳定;其次,通过递归处理已选视觉标记,构建时序一致的理解表示;最后,采用基于字幕(caption-based)的问答策略实现轻量且准确的回答生成。该方法在流式视频基准测试中达到当前最优性能,在效率与效果之间实现了良好平衡。
链接: https://arxiv.org/abs/2510.17364
作者: Vaggelis Dorovatas,Soroush Seifi,Gunshi Gupta,Rahaf Aljundi
机构: Toyota Motor Europe (丰田汽车欧洲公司); Archimedes RU, Athena RC (阿基米德研究单位,阿瑟娜研究中心); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: NeurIPS 2025
点击查看摘要
Abstract:Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.
zh
[CV-48] M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception IROS2025
【速读】:该论文旨在解决在边缘设备上部署实时空间感知任务时,如何高效地利用多任务学习框架来协同提升多个视觉任务(如语义分割、深度估计、边缘检测和表面法向量估计)的性能,同时最小化计算开销的问题。解决方案的关键在于提出了一种名为 Multi-Mono-Hydra (M2H) 的新型多任务学习框架,其核心创新是引入了基于窗口的跨任务注意力模块(Window-Based Cross-Task Attention Module),该模块能够在保持各任务特异性特征的同时,结构化地实现任务间特征交互,从而增强不同任务之间的预测一致性,并结合轻量级 ViT 架构 DINOv2 骨干网络,实现了高精度与低延迟的平衡,适用于动态环境中 3D 场景图构建的实时空间感知系统。
链接: https://arxiv.org/abs/2510.17363
作者: U.V.B.L Udugama,George Vosselman,Francesco Nex
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures
点击查看摘要
Abstract:Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.
zh
[CV-49] Exploring The Missing Semantics In Event Modality
【速读】:该论文旨在解决事件相机(event camera)在事件到视频重建(E2V)任务中缺乏语义信息的问题,因其仅记录强度变化而忽略静态物体和背景,导致重建视频的语义不完整。解决方案的关键在于提出Semantic-E2VID框架,通过引入跨模态特征对齐(CFA)模块,将基于帧的视觉基础模型Segment Anything Model(SAM)中的鲁棒语义知识迁移至事件编码器,并对齐不同模态的高层特征;同时设计语义感知特征融合(SFF)块,将帧模态中学习到的语义信息整合进事件表示,以增强事件解码器的重建能力;此外,提出一种新的语义感知E2V监督机制,利用SAM生成的类别标签引导模型恢复细节语义。
链接: https://arxiv.org/abs/2510.17347
作者: Jingqian Wu,Shengpeng Xu,Yunbo Jia,Edmund Y. Lam
机构: The University of Hong Kong (香港大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Event cameras offer distinct advantages such as low latency, high dynamic range, and efficient motion capture. However, event-to-video reconstruction (E2V), a fundamental event-based vision task, remains challenging, particularly for reconstructing and recovering semantic information. This is primarily due to the nature of the event camera, as it only captures intensity changes, ignoring static objects and backgrounds, resulting in a lack of semantic information in captured event modality. Further, semantic information plays a crucial role in video and frame reconstruction, yet is often overlooked by existing E2V approaches. To bridge this gap, we propose Semantic-E2VID, an E2V framework that explores the missing visual semantic knowledge in event modality and leverages it to enhance event-to-video reconstruction. Specifically, Semantic-E2VID introduces a cross-modal feature alignment (CFA) module to transfer the robust visual semantics from a frame-based vision foundation model, the Segment Anything Model (SAM), to the event encoder, while aligning the high-level features from distinct modalities. To better utilize the learned semantic feature, we further propose a semantic-aware feature fusion (SFF) block to integrate learned semantics in frame modality to form event representations with rich semantics that can be decoded by the event decoder. Further, to facilitate the reconstruction of semantic information, we propose a novel Semantic Perceptual E2V Supervision that helps the model to reconstruct semantic details by leveraging SAM-generated categorical labels. Extensive experiments demonstrate that Semantic-E2VID significantly enhances frame quality, outperforming state-of-the-art E2V methods across multiple benchmarks. The sample code is included in the supplementary material.
zh
[CV-50] Nearest-Class Mean and Logits Agreement for Wildlife Open-Set Recognition
【速读】:该论文旨在解决野生动物分类模型在开放集识别(Open-set Recognition, OSR)场景下的性能瓶颈问题,即现有模型在面对未知类别时仍保持过高置信度,导致误判风险增加。其解决方案的关键在于提出一种无需重新训练预训练模型的后处理方法,通过测量特征空间与预测logit空间之间的一致性来判断样本是否属于已知类别:具体而言,利用输入样本到最近类均值(Nearest Class Mean, NCM)的距离构建概率分布,并将其与softmax概率进行比较,从而量化特征与分类头输出的一致性。该策略在两个动物数据集上均表现优异,AUROC分别达到93.41和95.35,且性能稳定优于当前主流方法。
链接: https://arxiv.org/abs/2510.17338
作者: Jiahao Huo,Mufhumudzi Muthivhi,Terence L. van Zyl,Fredrik Gustafsson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Current state-of-the-art Wildlife classification models are trained under the closed world setting. When exposed to unknown classes, they remain overconfident in their predictions. Open-set Recognition (OSR) aims to classify known classes while rejecting unknown samples. Several OSR methods have been proposed to model the closed-set distribution by observing the feature, logit, or softmax probability space. A significant drawback of many existing approaches is the requirement to retrain the pre-trained classification model with the OSR-specific strategy. This study contributes a post-processing OSR method that measures the agreement between the models’ features and predicted logits. We propose a probability distribution based on an input’s distance to its Nearest Class Mean (NCM). The NCM-based distribution is then compared with the softmax probabilities from the logit space to measure agreement between the NCM and the classification head. Our proposed strategy ranks within the top three on two evaluated datasets, showing consistent performance across the two datasets. In contrast, current state-of-the-art methods excel on a single dataset. We achieve an AUROC of 93.41 and 95.35 for African and Swedish animals. The code can be found this https URL.
zh
[CV-51] DETEX: Empowering MLLM s for Intelligent DETailed EXplainable IQA ICCV2025
【速读】:该论文旨在解决图像质量评估(Image Quality Assessment, IQA)从传统的标量质量预测向更细致、可解释且与人类感知对齐的评估范式演进过程中所面临的挑战,尤其是如何实现高质量、多维度的图像质量分析。其解决方案的关键在于提出一个统一的多模态大语言模型(Multimodal Large Language Model, MLLM)iDETEX,该模型能够同时完成三个核心任务:质量定位(quality grounding)、感知建模(perception)和图像描述(description)。为提升训练效率与泛化能力,作者设计了针对不同子任务的离线增强模块和数据混合策略,并结合在线增强机制以充分利用多源监督信号,从而在ViDA-UGC大规模基准上实现了最优性能,在ICCV MIPI 2025详细图像质量评估挑战赛中排名第一,验证了其在准确性与可解释性方面的优越性。
链接: https://arxiv.org/abs/2510.17332
作者: Zhaoran Zhao,Xinli Yue,Jianhui Sun,Yuhao Xie,Tao Shao,Liangchao Yao,Fan Xia,Yuetang Deng
机构: Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025 Workshop
点击查看摘要
Abstract:Image Quality Assessment (IQA) has progressed from scalar quality prediction to more interpretable, human-aligned evaluation paradigms. In this work, we address the emerging challenge of detailed and explainable IQA by proposing iDETEX-a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description. To facilitate efficient and generalizable training across these heterogeneous subtasks, we design a suite of task-specific offline augmentation modules and a data mixing strategy. These are further complemented by online enhancement strategies to fully exploit multi-sourced supervision. We validate our approach on the large-scale ViDA-UGC benchmark, where iDETEX achieves state-of-the-art performance across all subtasks. Our model ranks first in the ICCV MIPI 2025 Detailed Image Quality Assessment Challenge, demonstrating its effectiveness and robustness in delivering accurate and interpretable quality assessments.
zh
[CV-52] CharDiff: A Diffusion Model with Character-Level Guidance for License Plate Image Restoration
【速读】:该论文旨在解决在真实场景下严重退化的车牌图像难以有效恢复与识别的问题,这不仅影响车牌识别(License Plate Recognition, LPR)系统的性能,也限制了其在司法证据、可视化界面优化及后续图像利用中的应用价值。解决方案的关键在于提出一种基于扩散模型的框架CharDiff,其核心创新是引入字符级引导机制:通过外部分割和针对低质量图像优化的光学字符识别(OCR)模块提取细粒度的字符先验信息,并设计了一种区域掩码引导注意力机制(Character-guided Attention through Region-wise Masking, CHARM),确保每个字符的引导信号仅作用于其对应区域,从而避免跨区域干扰。该结构化字符引导条件显著提升了扩散模型在实际部署中对车牌图像恢复与识别的鲁棒性。
链接: https://arxiv.org/abs/2510.17330
作者: Gyuhwan Park,Kihyun Na,Injung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures
点击查看摘要
Abstract:The significance of license plate image restoration goes beyond the preprocessing stage of License Plate Recognition (LPR) systems, as it also serves various purposes, including increasing evidential value, enhancing the clarity of visual interface, and facilitating further utilization of license plate images. We propose a novel diffusion-based framework with character-level guidance, CharDiff, which effectively restores and recognizes severely degraded license plate images captured under realistic conditions. CharDiff leverages fine-grained character-level priors extracted through external segmentation and Optical Character Recognition (OCR) modules tailored for low-quality license plate images. For precise and focused guidance, CharDiff incorporates a novel Character-guided Attention through Region-wise Masking (CHARM) module, which ensures that each character’s guidance is restricted to its own region, thereby avoiding interference with other regions. In experiments, CharDiff significantly outperformed the baseline restoration models in both restoration quality and recognition accuracy, achieving a 28% relative reduction in CER on the Roboflow-LP dataset, compared to the best-performing baseline model. These results indicate that the structured character-guided conditioning effectively enhances the robustness of diffusion-based license plate restoration and recognition in practical deployment scenarios.
zh
[CV-53] A Single Set of Adversarial Clothes Breaks Multiple Defense Methods in the Physical World
【速读】:该论文旨在解决物理世界中针对基于深度学习的目标检测器的对抗性攻击问题,特别是针对大覆盖范围的对抗性衣物(adversarial clothes)所带来的防御失效挑战。现有防御方法在面对传统小尺寸对抗补丁时表现尚可,但实验表明其对覆盖人体较大面积且外观自然的对抗性衣物防御效果显著下降。论文的关键在于通过系统评估多种主流防御策略在数字和物理场景下对对抗性衣物的鲁棒性,揭示了当前防御机制的共性脆弱性,并设计出一套通用性强、攻击成功率高达96.06%的对抗性衣物方案,从而证明现有防御方法在应对高覆盖率、自然形态的对抗样本时存在普遍漏洞。
链接: https://arxiv.org/abs/2510.17322
作者: Wei Zhang,Zhanhao Hu,Xiao Li,Xiaopei Zhu,Xiaolin Hu
机构: Tsinghua University (清华大学); University of California, Berkeley (加州大学伯克利分校); Chinese Institute for Brain Research (中国脑科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures
点击查看摘要
Abstract:In recent years, adversarial attacks against deep learning-based object detectors in the physical world have attracted much attention. To defend against these attacks, researchers have proposed various defense methods against adversarial patches, a typical form of physically-realizable attack. However, our experiments showed that simply enlarging the patch size could make these defense methods fail. Motivated by this, we evaluated various defense methods against adversarial clothes which have large coverage over the human body. Adversarial clothes provide a good test case for adversarial defense against patch-based attacks because they not only have large sizes but also look more natural than a large patch on humans. Experiments show that all the defense methods had poor performance against adversarial clothes in both the digital world and the physical world. In addition, we crafted a single set of clothes that broke multiple defense methods on Faster R-CNN. The set achieved an Attack Success Rate (ASR) of 96.06% against the undefended detector and over 64.84% ASRs against nine defended models in the physical world, unveiling the common vulnerability of existing adversarial defense methods against adversarial clothes. Code is available at: this https URL.
zh
[CV-54] CausalMamba: Scalable Conditional State Space Models for Neural Causal Inference
【速读】:该论文旨在解决基于功能磁共振成像(fMRI)进行神经因果推断时的两个核心问题:一是从血氧水平依赖(BOLD)信号中推断神经因果关系存在病态性(ill-posedness),因BOLD信号受血流动力学延迟和扩散效应的严重扭曲;二是现有方法如动态因果模型(Dynamic Causal Modeling, DCM)在计算上不可扩展,难以应用于大规模数据。解决方案的关键在于提出CausalMamba框架,将这一复杂逆问题分解为两个可处理阶段:首先通过BOLD去卷积恢复潜在神经活动,随后利用一种新颖的条件Mamba架构进行因果图推断。该方法在模拟数据上比DCM提升37%准确率,在真实任务fMRI数据中识别出88%高保真度的经典神经通路,显著优于传统方法,并揭示了工作记忆过程中大脑根据刺激灵活切换主要因果枢纽(执行网络或注意网络)的动态机制。
链接: https://arxiv.org/abs/2510.17318
作者: Sangyoon Bae,Jiook Cha
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We introduce CausalMamba, a scalable framework that addresses fundamental limitations in fMRI-based causal inference: the ill-posed nature of inferring neural causality from hemodynamically distorted BOLD signals and the computational intractability of existing methods like Dynamic Causal Modeling (DCM). Our approach decomposes this complex inverse problem into two tractable stages: BOLD deconvolution to recover latent neural activity, followed by causal graph inference using a novel Conditional Mamba architecture. On simulated data, CausalMamba achieves 37% higher accuracy than DCM. Critically, when applied to real task fMRI data, our method recovers well-established neural pathways with 88% fidelity, whereas conventional approaches fail to identify these canonical circuits in over 99% of subjects. Furthermore, our network analysis of working memory data reveals that the brain strategically shifts its primary causal hub-recruiting executive or salience networks depending on the stimulus-a sophisticated reconfiguration that remains undetected by traditional methods. This work provides neuroscientists with a practical tool for large-scale causal inference that captures both fundamental circuit motifs and flexible network dynamics underlying cognitive function.
zh
[CV-55] LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
【速读】:该论文旨在解决当前多模态模型在理解长视频(Long Videos)方面存在的能力不足问题,特别是针对人类语言、视角、行为及其他上下文信息的精准识别与推理能力。其解决方案的关键在于构建首个专门用于评估模型长视频理解能力的基准测试集——LongInsightBench,该基准包含约1,000段高信息密度的长视频(如讲座、访谈和Vlog),覆盖视觉、音频和文本三种模态,并设计了六类具有挑战性的任务场景(包括事件内与事件间任务),同时引入三步式半自动化质量保障流程以确保问题与选项的难度与有效性。实验表明,当前的全模态模型(Omni-modal Models, OLMs)在精确时间定位(Temporal Localization, T-Loc)和长程因果推理(Long-range Causal Inference, CE-Caus)任务中仍存在显著性能瓶颈,且多模态融合过程中存在信息丢失和处理偏差问题。
链接: https://arxiv.org/abs/2510.17305
作者: ZhaoYang Han,Qihan Lin,Hao Liang,Bowen Chen,Zhou Liu,Wentao Zhang
机构: Huazhong University of Science and Technology (华中科技大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Submitted to ARR Rolling Review
点击查看摘要
Abstract:We introduce \textbfLongInsightBench, the first benchmark designed to assess models’ ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbfvisual, audio, and text modalities. Our benchmark excels in three key areas: \textbfa) Long-Duration, Information-Dense Videos: We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbfb) Diverse and Challenging Task Scenarios: We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbfc) Rigorous and Comprehensive Quality Assurance Pipelines: We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at this https URL.
zh
[CV-56] Exploring Structural Degradation in Dense Representations for Self-supervised Learning NEURIPS2025
【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)中一个反直觉的现象——即训练时间越长,密集预测任务(如语义分割)的性能反而下降,称之为自监督密集退化(Self-supervised Dense Degradation, SDD)。为应对这一问题,作者提出了一种名为密集表示结构估计器(Dense representation Structure Estimator, DSE)的解决方案,其核心在于结合类别相关性度量与有效维度度量,从而在无标注数据情况下有效评估模型在密集任务上的潜在表现。DSE不仅具有理论依据,且与下游任务性能高度相关,基于此指标可实现无需额外计算开销的模型选择策略和DSE正则化方法,实验证明其能显著提升平均交并比(mIoU)并缓解SDD现象。
链接: https://arxiv.org/abs/2510.17299
作者: Siran Dai,Qianqian Xu,Peisong Wen,Yang Liu,Qingming Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:In this work, we observe a counterintuitive phenomenon in self-supervised learning (SSL): longer training may impair the performance of dense prediction tasks (e.g., semantic segmentation). We refer to this phenomenon as Self-supervised Dense Degradation (SDD) and demonstrate its consistent presence across sixteen state-of-the-art SSL methods with various losses, architectures, and datasets. When the model performs suboptimally on dense tasks at the end of training, measuring the performance during training becomes essential. However, evaluating dense performance effectively without annotations remains an open challenge. To tackle this issue, we introduce a Dense representation Structure Estimator (DSE), composed of a class-relevance measure and an effective dimensionality measure. The proposed DSE is both theoretically grounded and empirically validated to be closely correlated with the downstream performance. Based on this metric, we introduce a straightforward yet effective model selection strategy and a DSE-based regularization method. Experiments on sixteen SSL methods across four benchmarks confirm that model selection improves mIoU by 3.0% on average with negligible computational cost. Additionally, DSE regularization consistently mitigates the effects of dense degradation. Code is available at this https URL.
zh
[CV-57] Machine Vision-Based Surgical Lighting System:Design and Implementation
【速读】:该论文旨在解决传统手术照明系统依赖手动调节所导致的术者疲劳、颈部 strain 及光照漂移与阴影问题,从而影响手术精度与安全性。其解决方案的关键在于引入基于 YOLOv11 目标检测算法的机器视觉系统,通过识别放置于手术目标区域上方的蓝色标记球(blue spherical marker),驱动双伺服电机控制高功率 LED 光源自动对准目标位置,实现精准、稳定的自动化照明,提升手术环境的一致性与人机工效。
链接: https://arxiv.org/abs/2510.17287
作者: Amir Gharghabi,Mahdi Hakiminezhad,Maryam Shafaei,Shaghayegh Gharghabi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Effortless and ergonomically designed surgical lighting is critical for precision and safety during procedures. However, traditional systems often rely on manual adjustments, leading to surgeon fatigue, neck strain, and inconsistent illumination due to drift and shadowing. To address these challenges, we propose a novel surgical lighting system that leverages the YOLOv11 object detection algorithm to identify a blue marker placed above the target surgical site. A high-power LED light source is then directed to the identified location using two servomotors equipped with tilt-pan brackets. The YOLO model achieves 96.7% mAP@50 on the validation set consisting of annotated images simulating surgical scenes with the blue spherical marker. By automating the lighting process, this machine vision-based solution reduces physical strain on surgeons, improves consistency in illumination, and supports improved surgical outcomes.
zh
[CV-58] SG-CLDFF: A Novel Framework for Automated White Blood Cell Classification and Segmentation
【速读】:该论文旨在解决白细胞(White Blood Cells, WBCs)在显微图像中的精确分割与分类问题,这一任务对血液系统疾病诊断和监测至关重要,但受限于染色差异、复杂背景干扰以及类别不平衡等挑战。解决方案的关键在于提出一种新颖的显著性引导跨层深度特征融合框架(Saliency-Guided Cross-Layer Deep Feature Fusion, SG-CLDFF),其核心创新包括:首先利用显著性先验引导候选WBC区域定位并指导特征提取;其次采用轻量级混合骨干网络(EfficientSwin-style)生成多尺度表示,并通过受ResNeXt-CC启发的跨层融合模块保留浅层与深层特征的互补信息;此外,在多任务训练中联合优化分割与分类头,结合类别感知加权损失和显著性对齐正则化以缓解类别不平衡并抑制背景激活;最终通过Grad-CAM可视化与显著性一致性检查增强模型决策的可解释性。
链接: https://arxiv.org/abs/2510.17278
作者: Mehdi Zekriyapanah Gashti,Mostafa Mohammadpour,Ghasem Farjamnia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate segmentation and classification of white blood cells (WBCs) in microscopic images are essential for diagnosis and monitoring of many hematological disorders, yet remain challenging due to staining variability, complex backgrounds, and class imbalance. In this paper, we introduce a novel Saliency-Guided Cross-Layer Deep Feature Fusion framework (SG-CLDFF) that tightly integrates saliency-driven preprocessing with multi-scale deep feature aggregation to improve both robustness and interpretability for WBC analysis. SG-CLDFF first computes saliency priors to highlight candidate WBC regions and guide subsequent feature extraction. A lightweight hybrid backbone (EfficientSwin-style) produces multi-resolution representations, which are fused by a ResNeXt-CC-inspired cross-layer fusion module to preserve complementary information from shallow and deep layers. The network is trained in a multi-task setup with concurrent segmentation and cell-type classification heads, using class-aware weighted losses and saliency-alignment regularization to mitigate imbalance and suppress background activation. Interpretability is enforced through Grad-CAM visualizations and saliency consistency checks, allowing model decisions to be inspected at the regional level. We validate the framework on standard public benchmarks (BCCD, LISC, ALL-IDB), reporting consistent gains in IoU, F1, and classification accuracy compared to strong CNN and transformer baselines. An ablation study also demonstrates the individual contributions of saliency preprocessing and cross-layer fusion. SG-CLDFF offers a practical and explainable path toward more reliable automated WBC analysis in clinical workflows.
zh
[CV-59] Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models IROS2025
【速读】:该论文旨在解决当前自动驾驶系统在多样化真实场景中难以经济高效地泛化的问题,现有感知与预测模型在标准条件下表现可靠,但在复杂、多变的现实场景中性能受限。解决方案的关键在于提出一种即插即用的框架Plug-and-Forecast (PnF),通过引入多模态大语言模型(Multimodal Large Language Models, MLLMs)来增强已有运动预测模型的能力。PnF利用自然语言对复杂场景进行结构化描述,并设计提示(prompts)从MLLMs中提取语义信息,将其蒸馏为可学习嵌入(learnable embeddings),从而无需微调即可显著提升行为预测性能,实现零样本推理下的快速适应和泛化能力。
链接: https://arxiv.org/abs/2510.17274
作者: Katie Luo,Jingwei Ji,Tong He,Runsheng Xu,Yichen Xie,Dragomir Anguelov,Mingxing Tan
机构: Cornell University (康奈尔大学); Waymo LLC (Waymo有限责任公司); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In proceedings of IROS 2025
点击查看摘要
Abstract:Current autonomous driving systems rely on specialized models for perceiving and predicting motion, which demonstrate reliable performance in standard conditions. However, generalizing cost-effectively to diverse real-world scenarios remains a significant challenge. To address this, we propose Plug-and-Forecast (PnF), a plug-and-play approach that augments existing motion forecasting models with multimodal large language models (MLLMs). PnF builds on the insight that natural language provides a more effective way to describe and handle complex scenarios, enabling quick adaptation to targeted behaviors. We design prompts to extract structured scene understanding from MLLMs and distill this information into learnable embeddings to augment existing behavior prediction models. Our method leverages the zero-shot reasoning capabilities of MLLMs to achieve significant improvements in motion prediction performance, while requiring no fine-tuning – making it practical to adopt. We validate our approach on two state-of-the-art motion forecasting models using the Waymo Open Motion Dataset and the nuScenes Dataset, demonstrating consistent performance improvements across both benchmarks.
zh
[CV-60] FineVision: Open Data Is All You Need
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)发展过程中因公共数据集碎片化、不一致和污染导致的性能瓶颈问题。其解决方案的关键在于构建一个规模达2400万样本的统一、高质量开放语料库FineVision,通过半自动化且人工介入的流水线实现多源(>200个来源)数据的标准化整合与严格审核:自动化模块负责批量导入和模式映射,人工评审则验证标注忠实性、格式规范性、多样性及安全性;同时实施跨源与源内去重,并针对66个公开基准进行去污染处理。此外,FineVision还涵盖代理/图形用户界面(Agentic/GUI)任务并统一动作空间,确保执行保真度。实验表明,基于FineVision训练的模型在广泛评估中优于现有开源混合数据集上的模型,凸显了数据规模、数据洁净度以及自动化与人类监督协同优化的价值。
链接: https://arxiv.org/abs/2510.17269
作者: Luis Wiedmann,Orr Zohar,Amir Mahla,Xiaohan Wang,Rui Li,Thibaud Frere,Leandro von Werra,Aritra Roy Gosthipaty,Andrés Marafioti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.
zh
[CV-61] Fair and Interpretable Deepfake Detection in Videos
【速读】:该论文旨在解决现有深度伪造(deepfake)检测方法中存在的偏差(bias)、缺乏透明度以及无法有效捕捉时间信息的问题,这些问题导致在不同人口统计群体中决策不公且结果不可靠。解决方案的关键在于提出一种公平感知的深度伪造检测框架,其核心包括:(1) 利用基于序列的聚类进行时序建模,以学习视频中的动态特征;(2) 引入概念提取机制提升检测可靠性并增强可解释性;(3) 设计一种人口统计感知的数据增强方法,在平衡欠代表群体的同时通过频域变换保留伪造痕迹,从而缓解偏差并提高模型泛化能力。实验表明,该方法在多个基准数据集上相较于当前最优(SoTA)模型实现了公平性与准确性的最佳权衡。
链接: https://arxiv.org/abs/2510.17264
作者: Akihito Yoshii,Ryosuke Sonoda,Ramya Srinivasan
机构: Fujitsu Limited (富士通有限公司); Fujitsu Research of America, Inc. (富士通美国研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages (including References)
点击查看摘要
Abstract:Existing deepfake detection methods often exhibit bias, lack transparency, and fail to capture temporal information, leading to biased decisions and unreliable results across different demographic groups. In this paper, we propose a fairness-aware deepfake detection framework that integrates temporal feature learning and demographic-aware data augmentation to enhance fairness and interpretability. Our method leverages sequence-based clustering for temporal modeling of deepfake videos and concept extraction to improve detection reliability while also facilitating interpretable decisions for non-expert users. Additionally, we introduce a demography-aware data augmentation method that balances underrepresented groups and applies frequency-domain transformations to preserve deepfake artifacts, thereby mitigating bias and improving generalization. Extensive experiments on FaceForensics++, DFD, Celeb-DF, and DFDC datasets using state-of-the-art (SoTA) architectures (Xception, ResNet) demonstrate the efficacy of the proposed method in obtaining the best tradeoff between fairness and accuracy when compared to SoTA.
zh
[CV-62] aming Modality Entanglement in Continual Audio-Visual Segmentation
【速读】:该论文旨在解决细粒度多模态持续学习(multi-modal continual learning)中的两个关键挑战:一是多模态语义漂移(multi-modal semantic drift),即在连续任务中,发声物体被错误地标注为背景;二是共现混淆(co-occurrence confusion),即频繁共现的类别容易被混淆。为应对这些问题,论文提出了一种基于碰撞机制的多模态重放框架(Collision-based Multi-modal Rehearsal, CMR),其核心创新在于:针对语义漂移问题,设计了多模态样本选择(Multi-modal Sample Selection, MSS)策略,优先选择模态一致性高的样本用于重放;针对共现混淆问题,引入基于碰撞的样本重放(Collision-based Sample Rehearsal, CSR)机制,在训练过程中动态增加易混淆类别的重放频率,从而有效提升模型在音频-视觉分割任务中的持续学习性能。
链接: https://arxiv.org/abs/2510.17234
作者: Yuyang Hong,Qi Yang,Tao Zhang,Zili Wang,Zhaojin Fu,Kun Ding,Bin Fan,Shiming Xiang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (中国科学院自动化研究所多模态人工智能系统重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China (中国科学院大学人工智能学院); School of Intelligent Science and Technology, University of Science and Technology Beijing, Beijing 100083, China (北京科技大学智能科学与技术学院)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.
zh
[CV-63] When One Moment Isnt Enough: Multi-Moment Retrieval with Cross-Moment Interactions NEURIPS2025
【速读】:该论文旨在解决现有视频时序定位(video temporal grounding)任务中普遍存在的单时刻检索(Single-Moment Retrieval, SMR)与真实应用场景中多时刻相关性不匹配的问题。为填补这一差距,作者构建了高质量的多时刻数据集QVHighlights Multi-Moment Dataset (QV-M²),包含6,384个视频片段和2,212条标注,并提出适配多时刻检索(Multi-Moment Retrieval, MMR)的新评估指标。其核心解决方案是提出FlashMMR框架,关键创新在于引入多时刻后验证模块(Multi-moment Post-verification module),结合受限时序调整策略与验证机制,对候选片段进行精细化筛选与边界优化,从而有效剔除低置信度提议并实现鲁棒的多时刻对齐。实验表明,该方法在QV-M²上显著优于现有最优模型,在G-mAP、mAP@3+tgt和mR@3等指标上分别提升3.00%、2.70%和2.56%。
链接: https://arxiv.org/abs/2510.17218
作者: Zhuo Cao,Heming Du,Bingqing Zhang,Xin Yu,Xue Li,Sen Wang
机构: The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M ^2 ), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M ^2 consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M ^2 and QVHighlights under both SMR and MMR settings. Results show that QV-M ^2 serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M ^2 , it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at this https URL.
zh
[CV-64] Optimizing DINOv2 with Registers for Face Anti-Spoofing ICCV2025
【速读】:该论文旨在解决人脸活体检测(Liveness Detection)中针对物理-数字混合攻击(Unified Physical-Digital Attacks)的 spoofing 检测问题,即如何准确区分真实人脸与伪造人脸图像(如照片、打印件或屏幕显示的面部图像)。解决方案的关键在于利用 DINOv2 模型结合 registers 机制提取具有泛化能力的特征,并通过抑制注意力机制中的扰动,使模型聚焦于细微但关键的视觉差异,从而提升对欺骗性攻击的判别能力。
链接: https://arxiv.org/abs/2510.17201
作者: Mika Feng,Pierre Gallin-Martel,Koichi Ito,Takafumi Aoki
机构: Tohoku University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Workshop FAS
点击查看摘要
Abstract:Face recognition systems are designed to be robust against variations in head pose, illumination, and image blur during capture. However, malicious actors can exploit these systems by presenting a face photo of a registered user, potentially bypassing the authentication process. Such spoofing attacks must be detected prior to face recognition. In this paper, we propose a DINOv2-based spoofing attack detection method to discern minute differences between live and spoofed face images. Specifically, we employ DINOv2 with registers to extract generalizable features and to suppress perturbations in the attention mechanism, which enables focused attention on essential and minute features. We demonstrate the effectiveness of the proposed method through experiments conducted on the dataset provided by ``The 6th Face Anti-Spoofing Workshop: Unified Physical-Digital Attacks Detection@ICCV2025’’ and SiW dataset.
zh
[CV-65] EndoCIL: A Class-Incremental Learning Framework for Endoscopic Image Classification
【速读】:该论文旨在解决内窥镜图像分析中类增量学习(Class-incremental learning, CIL)面临的灾难性遗忘问题,尤其是在存在严重域差异和类别不平衡的情况下。现有基于回放的CIL方法难以有效缓解遗忘现象,限制了模型在临床场景中的持续适应能力。解决方案的关键在于提出一个统一的框架EndoCIL,其核心创新包括:基于最大均值差异的回放策略(Maximum Mean Discrepancy Based Replay, MDBR),通过分布对齐的贪婪选择机制选取多样且具有代表性的样本;先验正则化的类别平衡损失函数(Prior Regularized Class Balanced Loss, PRCBL),融合先验类别分布与平衡权重以缓解相间与相内类别不平衡;以及全连接梯度校准机制(Calibration of Fully-Connected Gradients, CFG),调整分类器梯度以减少对新类别的偏倚。这三个组件协同作用,显著提升了模型在长期学习过程中的稳定性和适应性。
链接: https://arxiv.org/abs/2510.17200
作者: Bingrong Liu,Jun Shi,Yushan Zheng
机构: Hefei University of Technology (合肥工业大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Class-incremental learning (CIL) for endoscopic image analysis is crucial for real-world clinical applications, where diagnostic models should continuously adapt to evolving clinical data while retaining performance on previously learned ones. However, existing replay-based CIL methods fail to effectively mitigate catastrophic forgetting due to severe domain discrepancies and class imbalance inherent in endoscopic imaging. To tackle these challenges, we propose EndoCIL, a novel and unified CIL framework specifically tailored for endoscopic image diagnosis. EndoCIL incorporates three key components: Maximum Mean Discrepancy Based Replay (MDBR), employing a distribution-aligned greedy strategy to select diverse and representative exemplars, Prior Regularized Class Balanced Loss (PRCBL), designed to alleviate both inter-phase and intra-phase class imbalance by integrating prior class distributions and balance weights into the loss function, and Calibration of Fully-Connected Gradients (CFG), which adjusts the classifier gradients to mitigate bias toward new classes. Extensive experiments conducted on four public endoscopic datasets demonstrate that EndoCIL generally outperforms state-of-the-art CIL methods across varying buffer sizes and evaluation metrics. The proposed framework effectively balances stability and plasticity in lifelong endoscopic diagnosis, showing promising potential for clinical scalability and deployment.
zh
[CV-66] Round Outcome Prediction in VALORANT Using Tactical Features from Video Analysis
【速读】:该论文旨在解决电子竞技中VALORANT这类第一人称射击游戏(FPS)的回合结果预测问题,传统方法多依赖比赛日志和统计信息,难以捕捉复杂战术动态。其解决方案的关键在于利用视频识别模型TimeSformer,并引入从最小地图(minimap)中提取的战术特征(如角色位置和游戏事件),通过数据增强方式标注这些战术事件以提升模型训练效果。实验表明,基于此类战术特征训练的模型在比赛中期及以后阶段可达到约81%的预测准确率,显著优于仅使用原始最小地图信息的模型,验证了战术特征对提升预测性能的重要性。
链接: https://arxiv.org/abs/2510.17199
作者: Nirai Hayakawa,Kazumasa Shimari,Kazuma Yamasaki,Hirotatsu Hoshikawa,Rikuto Tsuchida,Kenichi Matsumoto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE 2025 Conference on Games
点击查看摘要
Abstract:Recently, research on predicting match outcomes in esports has been actively conducted, but much of it is based on match log data and statistical information. This research targets the FPS game VALORANT, which requires complex strategies, and aims to build a round outcome prediction model by analyzing minimap information in match footage. Specifically, based on the video recognition model TimeSformer, we attempt to improve prediction accuracy by incorporating detailed tactical features extracted from minimap information, such as character position information and other in-game events. This paper reports preliminary results showing that a model trained on a dataset augmented with such tactical event labels achieved approximately 81% prediction accuracy, especially from the middle phases of a round onward, significantly outperforming a model trained on a dataset with the minimap information itself. This suggests that leveraging tactical features from match footage is highly effective for predicting round outcomes in VALORANT.
zh
[CV-67] From Pixels to People: Satellite-Based Mapping and Quantification of Riverbank Erosion and Lost Villages in Bangladesh
【速读】:该论文旨在解决孟加拉国河流侵蚀导致的村落和农田持续消失问题,这一现象对当地社区构成严重威胁,但传统监测手段难以实现高效、精准的追踪。解决方案的关键在于提出一种基于Segment Anything Model (SAM) 的改进方法:首先通过简单的颜色通道分析进行粗粒度的土地与水体分割,随后对SAM的掩码解码器进行微调,以识别河岸侵蚀的细微特征。该方法在新构建的标注数据集上实现了86.30%的平均交并比(Intersection over Union)和92.60%的Dice分数,显著优于传统方法和现成深度学习模型,从而为政策制定者和灾害管理部门提供了一种量化土地损失并可视化侵蚀过程的新工具。
链接: https://arxiv.org/abs/2510.17198
作者: M Saifuzzaman Rafat,Mohd Ruhul Ameen,Akif Islam,Abu Saleh Musa Miah,Jungpil Shin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to the International Conference on Data and Applied Analytics (IDAA 2025). 15 pages, 5 figures, 4 tables
点击查看摘要
Abstract:The great rivers of Bangladesh, arteries of commerce and sustenance, are also agents of relentless destruction. Each year, they swallow whole villages and vast tracts of farmland, erasing communities from the map and displacing thousands of families. To track this slow-motion catastrophe has, until now, been a Herculean task for human analysts. Here we show how a powerful general-purpose vision model, the Segment Anything Model (SAM), can be adapted to this task with remarkable precision. To do this, we assembled a new dataset - a digital chronicle of loss compiled from historical Google Earth imagery of Bangladesh’s most vulnerable regions, including Mokterer Char Union, Kedarpur Union, Balchipara village, and Chowhali Upazila, from 2003 to 2025. Crucially, this dataset is the first to include manually annotated data on the settlements that have vanished beneath the water. Our method first uses a simple color-channel analysis to provide a rough segmentation of land and water, and then fine-tunes SAM’s mask decoder to recognize the subtle signatures of riverbank erosion. The resulting model demonstrates a keen eye for this destructive process, achieving a mean Intersection over Union of 86.30% and a Dice score of 92.60% - a performance that significantly surpasses traditional methods and off-the-shelf deep learning models. This work delivers three key contributions: the first annotated dataset of disappeared settlements in Bangladesh due to river erosion; a specialized AI model fine-tuned for this critical task; and a method for quantifying land loss with compelling visual evidence. Together, these tools provide a powerful new lens through which policymakers and disaster management agencies can monitor erosion, anticipate its trajectory, and ultimately protect the vulnerable communities in its path.
zh
[CV-68] ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理大规模输入时产生的视觉标记冗余问题,该冗余导致推理成本高昂。现有方法多基于注意力机制或多样性进行视觉标记剪枝,但普遍忽视文本提示(text prompt)的引导作用,从而难以有效保留任务相关性。解决方案的关键在于提出一种零样本(zero-shot)的新范式,将视觉标记剪枝重构为任务相关性与信息多样性之间的权衡问题,并采用分层策略:首先筛选出与任务高度相关的视觉标记核心集,再补充多样性标记以保留更广泛的上下文信息。这种方法在多个模型和基准测试中实现了接近或超越当前最优性能,同时在剪枝高达90%标记的情况下仅带来极小的精度损失,并显著降低GPU内存占用和推理延迟。
链接: https://arxiv.org/abs/2510.17197
作者: Pu Zhang,Yuwei Li,Xingyuan Xian,Guoming Tang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when pruning up to 90% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.
zh
[CV-69] HIDISC: A Hyperbolic Framework for Domain Generalization with Generalized Category Discovery NEURIPS
【速读】:该论文旨在解决**域泛化下的广义类别发现(Domain Generalization with Generalized Category Discovery, DG-GCD)**问题,即在训练阶段无法访问目标域数据的情况下,使模型能够同时适应未见域并识别其中的已知类别与新颖类别。现有方法如DG2CD-Net依赖于多任务模拟和任务向量聚合,存在计算开销大、误差累积等问题。本文提出HIDISC框架,其核心创新在于:通过GPT引导的扩散增强技术对源域进行轻量且多样化的域扰动,避免过拟合;引入曲率感知的Tangent CutMix方法,在切空间中合成伪新颖样本以保持流形一致性;设计统一损失函数(包含惩罚性Busemann对齐、混合超球面对比正则化及自适应异常点排斥机制),实现紧凑且语义结构清晰的嵌入表示;并通过可学习的曲率参数动态适配数据复杂度,从而在PACS、Office-Home和DomainNet等多个基准上显著优于现有欧氏与超球面基线方法。
链接: https://arxiv.org/abs/2510.17188
作者: Vaibhav Rathore,Divyam Gupta,Biplab Banerjee
机构: Indian Institute of Technology Bombay(印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accpeted at NeurIPS (2025) Main Conference
点击查看摘要
Abstract:Generalized Category Discovery (GCD) aims to classify test-time samples into either seen categories** – available during training – or novel ones, without relying on label supervision. Most existing GCD methods assume simultaneous access to labeled and unlabeled data during training and arising from the same domain, limiting applicability in open-world scenarios involving distribution shifts. Domain Generalization with GCD (DG-GCD) lifts this constraint by requiring models to generalize to unseen domains containing novel categories, without accessing targetdomain data during training. The only prior DG-GCD method, DG2CD-Net, relies on episodic training with multiple synthetic domains and task vector aggregation, incurring high computational cost and error accumulation. We propose HIDISC, a hyperbolic representation learning framework that achieves domain and category-level generalization without episodic simulation. To expose the model to minimal but diverse domain variations, we augment the source domain using GPT-guided diffusion, avoiding overfitting while maintaining efficiency. To structure the representation space, we introduce Tangent CutMix, a curvature-aware interpolation that synthesizes pseudo-novel samples in tangent space, preserving manifold consistency. A unified loss – combining penalized Busemann alignment, hybrid hyperbolic contrastive regularization, and adaptive outlier repulsion – **facilitates compact, semantically structured embeddings. A learnable curvature parameter further adapts the geometry to dataset complexity. HIDISC achieves state-of-the-art results on PACS , Office-Home , and DomainNet, consistently outperforming the existing Euclidean and hyperbolic (DG)-GCD baselines.
zh
[CV-70] Capturing Head Avatar with Hand Contacts from a Monocular Video ICCV2025
【速读】:该论文旨在解决现有方法在构建逼真三维头部虚拟形象(3D head avatar)时忽略手部与面部交互导致的非刚性形变问题,尤其是无法建模如手托下巴或手指轻触脸颊等自然手势所传达的认知状态。其关键解决方案在于:首先,通过引入深度顺序损失(depth order loss)与接触正则化(contact regularization)联合优化手部与面部的姿态跟踪,确保二者空间关系的准确性;其次,基于自建的手部-面部交互数据集学习一个专用于手部诱导面部形变的主成分分析(PCA)基底,将复杂的形变场简化为少量PCA参数估计问题;此外,借鉴物理仿真思想设计接触损失(contact loss),有效抑制穿透伪影并提升结果的物理合理性。
链接: https://arxiv.org/abs/2510.17181
作者: Haonan He,Yufeng Zheng,Jie Song
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学); ETH Zürich (苏黎世联邦理工学院); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
点击查看摘要
Abstract:Photorealistic 3D head avatars are vital for telepresence, gaming, and VR. However, most methods focus solely on facial regions, ignoring natural hand-face interactions, such as a hand resting on the chin or fingers gently touching the cheek, which convey cognitive states like pondering. In this work, we present a novel framework that jointly learns detailed head avatars and the non-rigid deformations induced by hand-face interactions. There are two principal challenges in this task. First, naively tracking hand and face separately fails to capture their relative poses. To overcome this, we propose to combine depth order loss with contact regularization during pose tracking, ensuring correct spatial relationships between the face and hand. Second, no publicly available priors exist for hand-induced deformations, making them non-trivial to learn from monocular videos. To address this, we learn a PCA basis specific to hand-induced facial deformations from a face-hand interaction dataset. This reduces the problem to estimating a compact set of PCA parameters rather than a full spatial deformation field. Furthermore, inspired by physics-based simulation, we incorporate a contact loss that provides additional supervision, significantly reducing interpenetration artifacts and enhancing the physical plausibility of the results. We evaluate our approach on RGB(D) videos captured by an iPhone. Additionally, to better evaluate the reconstructed geometry, we construct a synthetic dataset of avatars with various types of hand interactions. We show that our method can capture better appearance and more accurate deforming geometry of the face than SOTA surface reconstruction methods. Comments: ICCV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.17181 [cs.CV] (or arXiv:2510.17181v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.17181 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-71] Benchmarking Out-of-Distribution Detection for Plankton Recognition: A Systematic Evaluation of Advanced Methods in Marine Ecological Monitoring
【速读】:该论文旨在解决自动化浮游生物识别模型在实际部署中因训练数据与测试数据之间存在分布偏移(Out-of-Distribution, OoD)而导致的不可预测错误问题,这一挑战源于浮游生物复杂的形态特征、庞大的物种多样性以及新物种的持续发现。为应对该问题,研究者基于DYB-PlanktonNet数据集精心设计了一系列模拟不同分布偏移场景的OoD基准测试,并系统评估了22种主流OoD检测方法。解决方案的关键在于构建了一个大规模、结构化的评估框架,首次实现了对浮游生物识别领域OoD检测方法的系统性比较;实验结果表明,ViM(Vision-based Masking)方法在所构建的基准上显著优于其他方法,尤其在远距离OoD(Far-OoD)场景下关键指标提升明显,为算法选择提供了可靠依据,并奠定了未来研究的基础。
链接: https://arxiv.org/abs/2510.17179
作者: Yingzi Han,Jiakai He,Chuanlong Xie,Jianping Li
机构: Beijing Normal University (北京师范大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Automated plankton recognition models face significant challenges during real-world deployment due to distribution shifts (Out-of-Distribution, OoD) between training and test data. This stems from plankton’s complex morphologies, vast species diversity, and the continuous discovery of novel species, which leads to unpredictable errors during inference. Despite rapid advancements in OoD detection methods in recent years, the field of plankton recognition still lacks a systematic integration of the latest computer vision developments and a unified benchmark for large-scale evaluation. To address this, this paper meticulously designed a series of OoD benchmarks simulating various distribution shift scenarios based on the DYB-PlanktonNet dataset \cite875n-f104-21, and systematically evaluated twenty-two OoD detection methods. Extensive experimental results demonstrate that the ViM \citewang2022vim method significantly outperforms other approaches in our constructed benchmarks, particularly excelling in Far-OoD scenarios with substantial improvements in key metrics. This comprehensive evaluation not only provides a reliable reference for algorithm selection in automated plankton recognition but also lays a solid foundation for future research in plankton OoD detection. To our knowledge, this study marks the first large-scale, systematic evaluation and analysis of Out-of-Distribution data detection methods in plankton recognition. Code is available at this https URL.
zh
[CV-72] Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling
【速读】:该论文旨在解决掩码自回归(Masked Autoregressive, MAR)模型在视觉生成任务中因单步建模空间相关性视觉标记(token)而导致的效率瓶颈问题。现有MAR模型虽具备并行生成优势,但其计算复杂度仍受限于对图像局部细节与全局结构协同建模的需求。解决方案的关键在于提出一种无需训练的分层采样策略——“先生成后重建”(Generation then Reconstruction, GtR),将生成过程分为两个阶段:第一阶段缓慢生成全局语义骨架(structure generation),第二阶段快速重构剩余细节(detail reconstruction)。该设计基于“从基础框架补充图像比从零构建更易”的假设,通过差异化计算资源分配实现加速;同时引入频域加权标记选择(Frequency-Weighted Token Selection, FTS),依据高频能量定位细节区域并优先分配计算预算,从而在显著提升推理速度(如MAR-H模型达3.72倍加速)的同时保持生成质量(FID: 1.59 vs. 原始1.59,IS: 304.4 vs. 原始299.1)。
链接: https://arxiv.org/abs/2510.17171
作者: Feihong Yan,Peiru Wang,Yao Zhu,Kaiyu Pang,Qingyan Wei,Huiqi Li,Linfeng Zhang
机构: EPIC Lab, SJTU (上海交通大学); Beijing Institute of Technology (北京理工大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures
点击查看摘要
Abstract:Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes will be released in this https URL.
zh
[CV-73] Investigating Adversarial Robustness against Preprocessing used in Blackbox Face Recognition
【速读】:该论文旨在解决生成式 AI (Generative AI) 在人脸识别(Face Recognition, FR)系统中对抗样本的迁移性问题,特别是针对黑盒场景下不同面部预处理技术对攻击成功率的影响。研究发现,面部检测模型的选择可使攻击成功率下降高达78%,而插值方法的影响较小;更关键的是,在白盒设置下,预处理本身也会因噪声向量与检测模型的意外交互而削弱攻击强度。解决方案的关键在于提出一种预处理无关的方法,通过输入变换(input transformations)提升所研究攻击的迁移性,实验表明该方法可使攻击成功率提升最多27%。
链接: https://arxiv.org/abs/2510.17169
作者: Roland Croft,Brian Du,Darcy Joseph,Sharath Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in DICTA 2025
点击查看摘要
Abstract:Face Recognition (FR) models have been shown to be vulnerable to adversarial examples that subtly alter benign facial images, exposing blind spots in these systems, as well as protecting user privacy. End-to-end FR systems first obtain preprocessed faces from diverse facial imagery prior to computing the similarity of the deep feature embeddings. Whilst face preprocessing is a critical component of FR systems, and hence adversarial attacks against them, we observe that this preprocessing is often overlooked in blackbox settings. Our study seeks to investigate the transferability of several out-of-the-box state-of-the-art adversarial attacks against FR when applied against different preprocessing techniques used in a blackbox setting. We observe that the choice of face detection model can degrade the attack success rate by up to 78%, whereas choice of interpolation method during downsampling has relatively minimal impacts. Furthermore, we find that the requirement for facial preprocessing even degrades attack strength in a whitebox setting, due to the unintended interaction of produced noise vectors against face detection models. Based on these findings, we propose a preprocessing-invariant method using input transformations that improves the transferability of the studied attacks by up to 27%. Our findings highlight the importance of preprocessing in FR systems, and the need for its consideration towards improving the adversarial generalisation of facial adversarial examples.
zh
[CV-74] GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在从单张2D图像生成可编辑、参数化的CAD模型时,因空间推理能力有限而导致的3D几何重建不准确问题。解决方案的关键在于提出一种两阶段后训练框架GACO-CAD:第一阶段通过监督微调引入深度图和表面法向量图作为密集几何先验,与RGB图像构成多通道输入,从而增强模型对3D几何结构的恢复能力;第二阶段采用强化学习机制,设计群体长度奖励函数,在保持高几何保真度的同时,引导生成更紧凑、冗余更少的参数化建模序列,实现几何精度与建模简洁性的协同优化。
链接: https://arxiv.org/abs/2510.17157
作者: Yinghui Wang,Xinyu Zhang,Peng Du
机构: 华东师范大学信息科学技术学院(Shanghai Engineering Research Center of Intelligent Computing, School of Information Science and Technology, East China Normal University); 浙江大学计算机科学与技术学院(School of Computer Science and Technology, Zhejiang University)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Generating editable, parametric CAD models from a single image holds great potential to lower the barriers of industrial concept design. However, current multi-modal large language models (MLLMs) still struggle with accurately inferring 3D geometry from 2D images due to limited spatial reasoning capabilities. We address this limitation by introducing GACO-CAD, a novel two-stage post-training framework. It is designed to achieve a joint objective: simultaneously improving the geometric accuracy of the generated CAD models and encouraging the use of more concise modeling procedures. First, during supervised fine-tuning, we leverage depth and surface normal maps as dense geometric priors, combining them with the RGB image to form a multi-channel input. In the context of single-view reconstruction, these priors provide complementary spatial cues that help the MLLM more reliably recover 3D geometry from 2D observations. Second, during reinforcement learning, we introduce a group length reward that, while preserving high geometric fidelity, promotes the generation of more compact and less redundant parametric modeling sequences. A simple dynamic weighting strategy is adopted to stabilize training. Experiments on the DeepCAD and Fusion360 datasets show that GACO-CAD achieves state-of-the-art performance under the same MLLM backbone, consistently outperforming existing methods in terms of code validity, geometric accuracy, and modeling conciseness.
zh
[CV-75] DiffVLA: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
【速读】:该论文旨在解决传统端到端(End-to-End, E2E)驾驶模型在长尾场景下泛化能力不足的问题,以及视觉-语言-动作(Vision-Language-Action, VLA)模型因三维推理能力有限而导致物理不可行动作生成的问题。其解决方案的关键在于提出DiffVLA++框架,通过引入度量引导的轨迹评分器(metric-guided trajectory scorer),实现语义理解与物理可行性之间的显式对齐:一方面利用VLA模块生成语义合理的轨迹,另一方面借助具有稠密轨迹词汇表的E2E模块保障动作的物理可行性,并由评分器协同优化两者输出,从而融合二者优势,在ICCV 2025自动驾驶挑战赛排行榜上取得49.12的EPDMS得分。
链接: https://arxiv.org/abs/2510.17148
作者: Yu Gao,Anqing Jiang,Yiru Wang,Heng Yuwen,Wang Shuo,Sun Hao,Wang Jijun
机构: Bosch(博世); Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
zh
[CV-76] KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation
【速读】:该论文旨在解决关节类物体(如笔记本电脑、抽屉等)在单视图输入下进行3D重建与位姿估计的难题,此类物体因多部件几何结构和可变关节配置而具有显著的结构多样性。解决方案的关键在于提出一个统一框架KineDiff3D,其核心创新包括:(1) 设计一种新型运动学感知变分自编码器(Kinematic-Aware VAE, KA-VAE),将完整的隐式表面表示(SDF)、关节角度和部件分割编码至结构化潜在空间;(2) 引入两个条件扩散模型,分别用于回归全局位姿(SE(3))与关节参数,以及从部分观测中生成运动学感知的潜在代码;(3) 构建一个双向优化模块,通过Chamfer距离最小化迭代精化重建精度与运动学参数,同时保持关节约束一致性。
链接: https://arxiv.org/abs/2510.17137
作者: WenBo Xu,Liu Liu,Li Zhang,Ran Zhang,Hao Wu,Dan Guo,Meng Wang
机构: Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Articulated objects, such as laptops and drawers, exhibit significant challenges for 3D reconstruction and pose estimation due to their multi-part geometries and variable joint configurations, which introduce structural diversity across different states. To address these challenges, we propose KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation, a unified framework for reconstructing diverse articulated instances and pose estimation from single view input. Specifically, we first encode complete geometry (SDFs), joint angles, and part segmentation into a structured latent space via a novel Kinematic-Aware VAE (KA-VAE). In addition, we employ two conditional diffusion models: one for regressing global pose (SE(3)) and joint parameters, and another for generating the kinematic-aware latent code from partial observations. Finally, we produce an iterative optimization module that bidirectionally refines reconstruction accuracy and kinematic parameters via Chamfer-distance minimization while preserving articulation constraints. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate the effectiveness of our approach in accurately reconstructing articulated objects and estimating their kinematic properties.
zh
[CV-77] GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection
【速读】:该论文旨在解决当前基于文本到图像扩散模型(text-to-image diffusion models)生成分布外(out-of-distribution, OOD)样本时存在的语义不稳定性和分布偏移多样性不足的问题,这些问题限制了OOD检测方法在真实场景中的泛化能力。解决方案的关键在于提出GOOD框架,其核心创新是引入双层级引导机制:一是图像级引导,通过最大化对数分区函数梯度来降低输入似然,促使采样轨迹向像素空间低密度区域移动;二是特征级引导,利用分类器潜在空间中k近邻距离来促进在特征稀疏区域的采样。此外,该方法还设计了一个自适应融合图像与特征差异的统一OOD评分机制,从而实现更可控、多样且鲁棒的OOD样本生成与检测性能提升。
链接: https://arxiv.org/abs/2510.17131
作者: Xin Gao,Jiyao Liu,Guanghao Li,Yueming Lyu,Jianxiong Gao,Weichen Yu,Ningsheng Xu,Liang Wang,Caifeng Shan,Ziwei Liu,Chenyang Si
机构: Nanjing University (南京大学); Fudan University (复旦大学); Carnegie Mellon University (卡内基梅隆大学); Chinese Academy of Sciences (中国科学院); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 16 figures, conference
点击查看摘要
Abstract:Recent advancements have explored text-to-image diffusion models for synthesizing out-of-distribution (OOD) samples, substantially enhancing the performance of OOD detection. However, existing approaches typically rely on perturbing text-conditioned embeddings, resulting in semantic instability and insufficient shift diversity, which limit generalization to realistic OOD. To address these challenges, we propose GOOD, a novel and flexible framework that directly guides diffusion sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level guidance: (1) Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space. (2) Feature-level guidance, derived from k-NN distance in the classifier’s latent space, promotes sampling in feature-sparse regions. Hence, this dual-guidance design enables more controllable and diverse OOD sample generation. Additionally, we introduce a unified OOD score that adaptively combines image and feature discrepancies, enhancing detection robustness. We perform thorough quantitative and qualitative analyses to evaluate the effectiveness of GOOD, demonstrating that training with samples generated by GOOD can notably enhance OOD detection performance.
zh
[CV-78] Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation
【速读】:该论文旨在解决自编码器(autoencoder)在训练过程中难以获得具有良好泛化能力的编码表示的问题,尤其关注如何使编码矩阵(code matrix)的奇异值分布趋于理想状态以提升模型性能。解决方案的关键在于引入基于矩阵自由能(matricial free energy)的正则化策略,通过构建一个可微分的损失函数,该函数以编码矩阵的奇异值为优化变量,并利用随机矩阵理论和自由概率论指出:当编码矩阵的奇异值分布与具有独立同分布高斯元素的随机度量(random metric)的奇异值分布一致时,该损失函数达到最小值。实证研究表明,通过标准的随机梯度优化方法最小化负矩阵自由能,可以生成类高斯分布的编码,从而实现跨训练集和测试集的良好泛化能力;进一步地,作者提出了最大矩阵自由能自编码器(matricidal free energy maximizing autoencoder),能够稳定生成高斯编码,并成功应用于欠定逆问题(underdetermined inverse problems)。
链接: https://arxiv.org/abs/2510.17120
作者: Rishi Sonthalia,Raj Rao Nadakuditi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:We introduce a novel regularization scheme for autoencoders based on matricial free energy. Our approach defines a differentiable loss function in terms of the singular values of the code matrix (code dimension x batch size). From the standpoint of free probability an d random matrix theory, this loss achieves its minimum when the singular value distribution of the code matrix coincides with that of an appropriately sculpted random metric with i.i.d. Gaussian entries. Empirical simulations demonstrate that minimizing the negative matricial free energy through standard stochastic gradient-based training yields Gaussian-like codes that generalize across training and test sets. Building on this foundation, we propose a matricidal free energy maximizing autoencoder that reliably produces Gaussian codes and show its application to underdetermined inverse problems.
zh
[CV-79] owards Imperceptible Watermarking Via Environment Illumination for Consumer Cameras
链接: https://arxiv.org/abs/2510.17114
作者: Hodaka Kawachi,Tomoya Nakamura,Hiroaki Santo,SaiKiran Kumar Tedla,Trevor Dalton Canham,Yasushi Yagi,Michael S. Brown
机构: University of Osaka (大阪大学); York University (约克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-80] Boosting Fidelity for Pre-Trained-Diffusion-Based Low-Light Image Enhancement via Condition Refinement
【速读】:该论文旨在解决预训练扩散模型(Pre-Trained Diffusion-Based, PTDB)在低光照等复杂场景下因内容保真度下降而导致图像恢复质量受限的问题。核心问题在于:一是缺乏合适的条件潜在空间建模机制,二是扩散过程中条件潜在变量与噪声潜在变量之间缺少双向交互,从而导致生成结果虽具感知真实感但细节失真严重。解决方案的关键在于提出一种新颖的条件优化策略,包含两个核心组件:其一为引入一个基于生成先验的潜在精炼(latent refinement)管道,用于恢复变分自编码器(VAE)编码阶段丢失的空间细节;其二为使精炼后的条件潜在变量与噪声潜在变量动态交互,增强控制信号的有效性。该方法具有即插即用特性,可无缝集成至现有扩散网络中,在不牺牲感知真实性和美学效果的前提下显著提升重建保真度。
链接: https://arxiv.org/abs/2510.17105
作者: Xiaogang Xu,Jian Wang,Yunfan Lu,Ruihang Chu,Ruixing Wang,Jiafei Wu,Bei Yu,Liang Lin
机构: The Chinese University of Hong Kong (香港中文大学); Snap Research (Snap 公司); HKUST (GZ) (香港科技大学(广州)); Alibaba Tongyi Lab (阿里巴巴通义实验室); DJI (大疆创新); The University of Hong Kong (香港大学); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion-based methods, leveraging pre-trained large models like Stable Diffusion via ControlNet, have achieved remarkable performance in several low-level vision tasks. However, Pre-Trained Diffusion-Based (PTDB) methods often sacrifice content fidelity to attain higher perceptual realism. This issue is exacerbated in low-light scenarios, where severely degraded information caused by the darkness limits effective control. We identify two primary causes of fidelity loss: the absence of suitable conditional latent modeling and the lack of bidirectional interaction between the conditional latent and noisy latent in the diffusion process. To address this, we propose a novel optimization strategy for conditioning in pre-trained diffusion models, enhancing fidelity while preserving realism and aesthetics. Our method introduces a mechanism to recover spatial details lost during VAE encoding, i.e., a latent refinement pipeline incorporating generative priors. Additionally, the refined latent condition interacts dynamically with the noisy latent, leading to improved restoration performance. Our approach is plug-and-play, seamlessly integrating into existing diffusion networks to provide more effective control. Extensive experiments demonstrate significant fidelity improvements in PTDB methods.
zh
[CV-81] Shape-aware Inertial Poser: Motion Tracking for Humans with Diverse Shapes Using Sparse Inertial Sensors SIGGRAPH
【速读】:该论文旨在解决基于稀疏惯性传感器(Inertial Measurement Units, IMUs)的人体运动捕捉中,因个体间体型差异导致模型泛化能力差的问题。现有方法通常依赖于模板成人身体形态建模训练数据,而未考虑不同体型(如儿童与成人)对IMU测量加速度的影响,从而限制了在非标准体型上的应用效果。解决方案的关键在于提出Shape-aware Inertial Poser (SAIP),其核心创新是将IMU测量信号中的形状相关分量与姿态相关分量进行解耦建模,并通过两个回归模型实现从模板身体到真实身体的映射:首先利用第一个回归模型将真实身体的IMU加速度转换为与模板成人一致的形式以补偿形状差异;随后采用状态领先方法估计模板身体的全身运动;最后通过第二个回归模型将关节速度映射回真实身体,并结合一种形状感知的物理优化策略计算出主体的真实全局运动。此外,SAIP首次引入惯性形状估计方案,通过MLP网络建模形状条件下的IMU-姿态关联关系,显著提升了对多样化体型的适应性。
链接: https://arxiv.org/abs/2510.17101
作者: Lu Yin,Ziying Shi,Yinghao Wu,Xinyu Yi,Feng Xu,Shihui Guo
机构: Xiamen University (厦门大学); Tsinghua University (清华大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH Asia 2025 (TOG)
点击查看摘要
Abstract:Human motion capture with sparse inertial sensors has gained significant attention recently. However, existing methods almost exclusively rely on a template adult body shape to model the training data, which poses challenges when generalizing to individuals with largely different body shapes (such as a child). This is primarily due to the variation in IMU-measured acceleration caused by changes in body shape. To fill this gap, we propose Shape-aware Inertial Poser (SAIP), the first solution considering body shape differences in sparse inertial-based motion capture. Specifically, we decompose the sensor measurements related to shape and pose in order to effectively model their joint correlations. Firstly, we train a regression model to transfer the IMU-measured accelerations of a real body to match the template adult body model, compensating for the shape-related sensor measurements. Then, we can easily follow the state-of-the-art methods to estimate the full body motions of the template-shaped body. Finally, we utilize a second regression model to map the joint velocities back to the real body, combined with a shape-aware physical optimization strategy to calculate global motions on the subject. Furthermore, our method relies on body shape awareness, introducing the first inertial shape estimation scheme. This is accomplished by modeling the shape-conditioned IMU-pose correlation using an MLP-based network. To validate the effectiveness of SAIP, we also present the first IMU motion capture dataset containing individuals of different body sizes. This dataset features 10 children and 10 adults, with heights ranging from 110 cm to 190 cm, and a total of 400 minutes of paired IMU-Motion samples. Extensive experimental results demonstrate that SAIP can effectively handle motion capture tasks for diverse body shapes. The code and dataset are available at this https URL.
zh
[CV-82] GSPlane: Concise and Accurate Planar Reconstruction via Structured Representation
【速读】:该论文旨在解决基于高斯溅射(Gaussian Splatting, GS)的三维场景重建方法在平面区域几何精度不足、表面平滑性差以及网格拓扑结构不理想的问题。其核心解决方案是提出GSPlane框架,关键在于引入可复用的平面先验(planar priors),通过预训练的分割与法向量预测模型提取鲁棒的平面信息,并据此结构化地组织平面高斯点坐标,从而在训练过程中强制几何一致性;同时设计动态高斯再分类器(Dynamic Gaussian Re-classifier)以自适应地将梯度持续较高的高斯点重新归类为非平面,提升优化稳定性;最终利用优化后的平面先验对网格布局进行精修,在保持渲染质量的前提下显著改善了平面区域的几何准确性与拓扑结构。
链接: https://arxiv.org/abs/2510.17095
作者: Ruitong Gan,Junran Peng,Yang Liu,Chuanchen Luo,Qing Li,Zhaoxiang Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); University of Science and Technology Beijing (北京科技大学); NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所,模式识别国家重点实验室,智能感知与计算研究中心); University of Chinese Academy of Sciences (中国科学院大学); Shandong University (山东大学); Linketic (链接科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Planes are fundamental primitives of 3D sences, especially in man-made environments such as indoor spaces and urban streets. Representing these planes in a structured and parameterized format facilitates scene editing and physical simulations in downstream applications. Recently, Gaussian Splatting (GS) has demonstrated remarkable effectiveness in the Novel View Synthesis task, with extensions showing great potential in accurate surface reconstruction. However, even state-of-the-art GS representations often struggle to reconstruct planar regions with sufficient smoothness and precision. To address this issue, we propose GSPlane, which recovers accurate geometry and produces clean and well-structured mesh connectivity for plane regions in the reconstructed scene. By leveraging off-the-shelf segmentation and normal prediction models, GSPlane extracts robust planar priors to establish structured representations for planar Gaussian coordinates, which help guide the training process by enforcing geometric consistency. To further enhance training robustness, a Dynamic Gaussian Re-classifier is introduced to adaptively reclassify planar Gaussians with persistently high gradients as non-planar, ensuring more reliable optimization. Furthermore, we utilize the optimized planar priors to refine the mesh layouts, significantly improving topological structure while reducing the number of vertices and faces. We also explore applications of the structured planar representation, which enable decoupling and flexible manipulation of objects on supportive planes. Extensive experiments demonstrate that, with no sacrifice in rendering quality, the introduction of planar priors significantly improves the geometric accuracy of the extracted meshes across various baselines.
zh
[CV-83] owards a Generalizable Fusion Architecture for Multimodal Object Detection ICCV2025
【速读】:该论文旨在解决多模态目标检测中因传感器模态间冗余特征干扰和跨模态特征融合效率低下而导致的鲁棒性不足问题。解决方案的关键在于提出一种名为Filtered Multi-Modal Cross Attention Fusion (FMCAF) 的预处理架构,其核心由两个模块组成:一是频域滤波块(Freq-Filter),用于抑制RGB与红外(IR)图像中冗余的光谱特征;二是基于交叉注意力机制的融合模块(MCAF),以增强跨模态特征共享能力。该方法无需针对特定数据集进行调优,具有良好的泛化性能,在LLVIP(低光照行人检测)和VEDAI(航空车辆检测)任务上分别实现了+1.1%和+13.9%的mAP@50提升,验证了其作为未来检测流水线中稳健多模态融合基础的潜力。
链接: https://arxiv.org/abs/2510.17078
作者: Jad Berjawi,Yoann Dupas,Christophe C’erin
机构: Université Grenoble Alpes (格勒诺布尔阿尔卑斯大学); Orange (橙色公司); Université Sorbonne Paris Nord (索邦巴黎第十三大学); INRIA (法国国家信息与自动化研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, accepted at ICCV 2025 MIRA Workshop
点击查看摘要
Abstract:Multimodal object detection improves robustness in chal- lenging conditions by leveraging complementary cues from multiple sensor modalities. We introduce Filtered Multi- Modal Cross Attention Fusion (FMCAF), a preprocess- ing architecture designed to enhance the fusion of RGB and infrared (IR) inputs. FMCAF combines a frequency- domain filtering block (Freq-Filter) to suppress redun- dant spectral features with a cross-attention-based fusion module (MCAF) to improve intermodal feature sharing. Unlike approaches tailored to specific datasets, FMCAF aims for generalizability, improving performance across different multimodal challenges without requiring dataset- specific tuning. On LLVIP (low-light pedestrian detec- tion) and VEDAI (aerial vehicle detection), FMCAF outper- forms traditional fusion (concatenation), achieving +13.9% mAP@50 on VEDAI and +1.1% on LLVIP. These results support the potential of FMCAF as a flexible foundation for robust multimodal fusion in future detection pipelines.
zh
[CV-84] ProDAT: Progressive Density-Aware Tail-Drop for Point Cloud Coding
链接: https://arxiv.org/abs/2510.17068
作者: Zhe Luo,Wenjing Jia,Stuart Perry
机构: University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-85] How Universal Are SAM2 Features?
【速读】:该论文旨在解决通用基础视觉模型(general-purpose foundation vision models)与其专用模型(specialized counterparts)之间的权衡问题,即如何在保持特征编码效率的同时兼顾任务适应性。其核心问题是:专用模型虽然在特定任务(如分割)中表现优异,但可能因过度优化而损失对其他概念任务(如姿态估计、图像描述生成)的泛化能力。解决方案的关键在于通过一个轻量级可训练的“颈部”(neck)结构来探测冻结特征的适应性,并基于信息论量化专业化带来的代价;研究发现,SAM2(Segment Anything Model 2)的逐层适配机制会引入进一步的表征瓶颈,从而揭示了特征通用性与任务特异性之间的定量权衡关系,为下游多样化应用的高效特征编码和迁移策略设计提供了理论依据。
链接: https://arxiv.org/abs/2510.17051
作者: Masoud Khairi Atani,Alon Harell,Hyomin Choi,Runyu Yang,Fabien Racape,Ivan V. Bajic
机构: Simon Fraser University (西蒙菲莎大学); InterDigital AI Lab (InterDigital人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted for publication in IEEE Picture Coding Symposium (PCS) 2025
点击查看摘要
Abstract:The trade-off between general-purpose foundation vision models and their specialized counterparts is critical for efficient feature coding design and is not yet fully understood. We investigate this trade-off by comparing the feature versatility of the general-purpose Hiera encoder against the segmentation-specialized Segment Anything Model 2 (SAM2). Using a lightweight, trainable neck to probe the adaptability of their frozen features, we quantify the information-theoretic cost of specialization. Our results reveal that while SAM2’s specialization is highly effective for spatially-related tasks like depth estimation, it comes at a cost. The specialized SAM2 encoder underperforms its generalist predecessor, Hiera, on conceptually distant tasks such as pose estimation and image captioning, demonstrating a measurable loss of broader semantic information. A novel cross-neck analysis on SAM2 reveals that each level of adaptation creates a further representational bottleneck. Our analysis illuminates these trade-offs in feature universality, providing a quantitative foundation for designing efficient feature coding and adaptation strategies for diverse downstream applications.
zh
[CV-86] Video Reasoning without Training
【速读】:该论文旨在解决基于大模型的视频推理(Video Reasoning)中因依赖昂贵的强化学习(Reinforcement Learning, RL)和冗长的思维链(Chain-of-Thought)所导致的高计算开销问题,以及推理过程中缺乏对思考行为有效调控机制的瓶颈。其解决方案的关键在于:通过分析模型输出熵(Entropy)作为信号,发现高质量模型在推理过程中会经历一系列微观探索(micro-exploration)与微观利用(micro-exploitation)阶段,从而保持推理过程的稳定性;进一步提出在推理阶段无需RL或监督微调,仅通过一个可训练的小型控制器对模型值缓存(Value Cache)进行少量优化,以最小化熵为目标调整模型行为,从而增强推理过程中的探索-利用平衡。该方法名为V-Reason,在多个视频推理数据集上显著优于基线指令微调模型,且接近RL训练模型的性能(平均准确率差距<0.6%),同时减少58.6%的输出token,实现效率大幅提升。
链接: https://arxiv.org/abs/2510.17045
作者: Deepak Sridhar,Kartikeya Bhardwaj,Jeya Pradha Jeyaraj,Nuno Vasconcelos,Ankita Nayak,Harris Teague
机构: Qualcomm AI Research (高通人工智能研究); UCSD (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model’s output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this “thinking” process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model’s behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model’s micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.
zh
[CV-87] Person Re-Identification via Generalized Class Prototypes
链接: https://arxiv.org/abs/2510.17043
作者: Md Ahmed Al Muzaddid,William J. Beksi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 18 pages, 11 figures, and 4 tables
[CV-88] Click Predict Trust: Clinician-in-the-Loop AI Segmentation for Lung Cancer CT-Based Prognosis within the Knowledge-to-Action Framework
【速读】:该论文旨在解决肺癌CT影像中手动分割耗时且变异性强的问题,以及深度学习(Deep Learning, DL)模型在临床应用中因缺乏可重复性、预测准确性不足及医生信任度低而难以落地的瓶颈。其解决方案的关键在于构建一个“医生在环路”(clinician-in-the-loop)的深度学习管道,结合半监督学习(Semi-Supervised Learning, SSL)与最优分割模型VNet,实现了高精度、高稳定性的肺部肿瘤分割(Dice = 0.83,IoU = 0.71),并显著提升了放射组学特征的稳定性(平均相关系数=0.76)和预后建模性能(准确率=0.88,F1=0.83)。同时,六名放射科医师一致认为AI生成的初始掩膜具有良好的临床意义和边界质量,适合用于人工修正而非直接替代,从而增强了临床可接受度与工作流整合性。
链接: https://arxiv.org/abs/2510.17039
作者: Mohammad R. Salmanpour,Sonya Falahati,Amir Hossein Pouria,Amin Mousavi,Somayeh Sadat Mehrnia,Morteza Alizadeh,Arman Gorji,Zeinab Farsangi,Alireza Safarian,Mehdi Maghsudi,Carlos Uribe,Arman Rahmim,Ren Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures, and 2 tables
点击查看摘要
Abstract:Lung cancer remains the leading cause of cancer mortality, with CT imaging central to screening, prognosis, and treatment. Manual segmentation is variable and time-intensive, while deep learning (DL) offers automation but faces barriers to clinical adoption. Guided by the Knowledge-to-Action framework, this study develops a clinician-in-the-loop DL pipeline to enhance reproducibility, prognostic accuracy, and clinical trust. Multi-center CT data from 999 patients across 12 public datasets were analyzed using five DL models (3D Attention U-Net, ResUNet, VNet, ReconNet, SAM-Med3D), benchmarked against expert contours on whole and click-point cropped images. Segmentation reproducibility was assessed using 497 PySERA-extracted radiomic features via Spearman correlation, ICC, Wilcoxon tests, and MANOVA, while prognostic modeling compared supervised (SL) and semi-supervised learning (SSL) across 38 dimensionality reduction strategies and 24 classifiers. Six physicians qualitatively evaluated masks across seven domains, including clinical meaningfulness, boundary quality, prognostic value, trust, and workflow integration. VNet achieved the best performance (Dice = 0.83, IoU = 0.71), radiomic stability (mean correlation = 0.76, ICC = 0.65), and predictive accuracy under SSL (accuracy = 0.88, F1 = 0.83). SSL consistently outperformed SL across models. Radiologists favored VNet for peritumoral representation and smoother boundaries, preferring AI-generated initial masks for refinement rather than replacement. These results demonstrate that integrating VNet with SSL yields accurate, reproducible, and clinically trusted CT-based lung cancer prognosis, highlighting a feasible path toward physician-centered AI translation.
zh
[CV-89] DINO-CVA: A Multimodal Goal-Conditioned Vision-to-Action Model for Autonomous Catheter Navigation
【速读】:该论文旨在解决心血管导管介入手术中依赖人工操作导致的医师疲劳、辐射暴露增加及术式结果变异等问题,其核心挑战在于现有机器人系统多为“跟随-领导”模式,缺乏智能自主性。解决方案的关键在于提出DINO-CVA框架——一种多模态目标条件行为克隆模型,通过将视觉观测与操纵杆运动学数据融合至联合嵌入空间,实现既具备视觉感知能力又具备运动学感知能力的策略学习;同时利用目标条件引导导航路径向指定解剖位置收敛,从而在保持动作预测精度的同时,使决策过程锚定于真实的血管解剖环境,为实现导管操作的自主化提供了可行路径。
链接: https://arxiv.org/abs/2510.17038
作者: Pedram Fekri,Majid Roshanfar,Samuel Barbeau,Seyedfarzad Famouri,Thomas Looi,Dale Podolsky,Mehrdad Zadeh,Javad Dargahi
机构: Concordia University (康考迪亚大学); Hospital for Sick Children (SickKids) (儿童医院); ÉTS Montréal (ÉTS蒙特利尔); Kettering University (凯特林大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Cardiac catheterization remains a cornerstone of minimally invasive interventions, yet it continues to rely heavily on manual operation. Despite advances in robotic platforms, existing systems are predominantly follow-leader in nature, requiring continuous physician input and lacking intelligent autonomy. This dependency contributes to operator fatigue, more radiation exposure, and variability in procedural outcomes. This work moves towards autonomous catheter navigation by introducing DINO-CVA, a multimodal goal-conditioned behavior cloning framework. The proposed model fuses visual observations and joystick kinematics into a joint embedding space, enabling policies that are both vision-aware and kinematic-aware. Actions are predicted autoregressively from expert demonstrations, with goal conditioning guiding navigation toward specified destinations. A robotic experimental setup with a synthetic vascular phantom was designed to collect multimodal datasets and evaluate performance. Results show that DINO-CVA achieves high accuracy in predicting actions, matching the performance of a kinematics-only baseline while additionally grounding predictions in the anatomical environment. These findings establish the feasibility of multimodal, goal-conditioned architectures for catheter navigation, representing an important step toward reducing operator dependency and improving the reliability of catheterbased therapies.
zh
[CV-90] Conditional Synthetic Live and Spoof Fingerprint Generation
【速读】:该论文旨在解决生物特征数据采集中面临的隐私保护、成本高昂及获取困难等问题,特别是针对指纹数据的收集与使用。其核心解决方案是利用生成式AI(Generative AI)技术构建高保真度的合成指纹图像数据集,包括活体指纹和伪造指纹(spoof fingerprints)。关键创新在于采用条件StyleGAN2-ADA与StyleGAN3架构生成高质量、高分辨率的活体指纹,并结合CycleGAN实现从活体指纹到多种攻击材料(如EcoFlex、Play-Doh)的伪造指纹转换,从而构建具有多样性和真实感的合成数据集(DB2和DB3)。实验表明,生成指纹在匹配性能上接近真实数据(TAR达99.47% @ FAR=0.01%),且FID低至5,同时通过标准质量评估指标(NFIQ2、MINDTCT)和身份泄露测试验证了其良好的隐私保护特性。
链接: https://arxiv.org/abs/2510.17035
作者: Syed Konain Abbas,Sandip Purnapatra,M. G. Sarwar Murshed,Conor Miller-Lynch,Lambert Igene,Soumyabrata Dey,Stephanie Schuckers,Faraz Hussain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Large fingerprint datasets, while important for training and evaluation, are time-consuming and expensive to collect and require strict privacy measures. Researchers are exploring the use of synthetic fingerprint data to address these issues. This paper presents a novel approach for generating synthetic fingerprint images (both spoof and live), addressing concerns related to privacy, cost, and accessibility in biometric data collection. Our approach utilizes conditional StyleGAN2-ADA and StyleGAN3 architectures to produce high-resolution synthetic live fingerprints, conditioned on specific finger identities (thumb through little finger). Additionally, we employ CycleGANs to translate these into realistic spoof fingerprints, simulating a variety of presentation attack materials (e.g., EcoFlex, Play-Doh). These synthetic spoof fingerprints are crucial for developing robust spoof detection systems. Through these generative models, we created two synthetic datasets (DB2 and DB3), each containing 1,500 fingerprint images of all ten fingers with multiple impressions per finger, and including corresponding spoofs in eight material types. The results indicate robust performance: our StyleGAN3 model achieves a Fréchet Inception Distance (FID) as low as 5, and the generated fingerprints achieve a True Accept Rate of 99.47% at a 0.01% False Accept Rate. The StyleGAN2-ADA model achieved a TAR of 98.67% at the same 0.01% FAR. We assess fingerprint quality using standard metrics (NFIQ2, MINDTCT), and notably, matching experiments confirm strong privacy preservation, with no significant evidence of identity leakage, confirming the strong privacy-preserving properties of our synthetic datasets.
zh
[CV-91] Where Not What: Compelling Video LLM s to Learn Geometric Causality for 3D-Grounding
链接: https://arxiv.org/abs/2510.17034
作者: Yutong Zhong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-92] Enrich and Detect: Video Temporal Grounding with Multimodal LLM s ICCV2025
【速读】:该论文旨在解决视频时序定位(Temporal Video Grounding)任务中自然语言查询与视频片段之间细粒度对齐的问题,尤其是在复杂语义理解和跨模态匹配方面存在的挑战。其解决方案的关键在于提出一种两阶段框架ED-VTG,首先利用多模态大语言模型(Multimodal Large Language Models, MLLMs)将原始查询转化为包含缺失细节和引导线索的增强句(enriched sentences),从而提升语义完整性;随后通过轻量级解码器基于上下文化表示精准预测目标视频片段边界。此外,为降低幻觉噪声影响,模型采用多实例学习(Multiple-Instance Learning, MIL)目标动态选择最优查询版本进行训练,显著提升了在多种基准上的性能表现,尤其在零样本场景下展现出优于现有方法的鲁棒性。
链接: https://arxiv.org/abs/2510.17023
作者: Shraman Pramanick,Effrosyni Mavroudi,Yale Song,Rama Chellappa,Lorenzo Torresani,Triantafyllos Afouras
机构: FAIR, Meta; Johns Hopkins University; Northeastern University
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ICCV 2025 (Highlights)
点击查看摘要
Abstract:We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.
zh
[CV-93] Do Satellite Tasks Need Special Pretraining?
链接: https://arxiv.org/abs/2510.17014
作者: Ani Vanyan,Alvard Barseghyan,Hakob Tamazyan,Tigran Galstyan,Vahan Huroyan,Naira Hovakimyan,Hrant Khachatrian
机构: YerevaNN research lab (YerevaNN 研究实验室); YSU (叶里温州立大学); Saint Louis University (圣路易斯大学); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-94] An empirical study of the effect of video encoders on Temporal Video Grounding
【速读】:该论文旨在解决时序视频定位(Temporal Video Grounding)任务中因长期依赖少数视频表示方法而导致的模型架构过拟合问题。其解决方案的关键在于开展一项实证研究,系统评估不同视频编码器(基于CNN、时序推理和Transformer的模型)对经典架构性能的影响,从而揭示各类视频特征在性能表现上的显著差异及其潜在的互补性,为更鲁棒的视频理解模型设计提供依据。
链接: https://arxiv.org/abs/2510.17007
作者: Ignacio M. De la Jara,Cristian Rodriguez-Opazo,Edison Marrese-Taylor,Felipe Bravo-Marquez
机构: University of Chile (智利大学); University of Adelaide (阿德莱德大学); National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.
zh
[CV-95] raining-free Online Video Step Grounding NEURIPS2025
链接: https://arxiv.org/abs/2510.16989
作者: Luca Zanella,Massimiliano Mancini,Yiming Wang,Alessio Tonioni,Elisa Ricci
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025. Project website at this https URL
[CV-96] CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams
链接: https://arxiv.org/abs/2510.16988
作者: Junhao Zhao,Zishuai Liu,Ruili Fang,Jin Lu,Linghan Zhang,Fei Dou
机构: University of Georgia (UGA)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-97] One-step Diffusion Models with Bregman Density Ratio Matching
链接: https://arxiv.org/abs/2510.16983
作者: Yuanzhi Zhu,Eleftherios Tsonis,Lucas Degeorge,Vicky Kalogeiton
机构: LIX, École Polytechnique, CNRS, IPP; LIGM, École Nationale des Ponts et Chaussées, CNRS, IPP; AMIAD
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: work in progress
[CV-98] Foundation Models in Medical Image Analysis: A Systematic Review and Meta-Analysis
【速读】:该论文试图解决医学图像分析领域中基础模型(Foundation Models, FMs)研究碎片化、缺乏系统性综述的问题,旨在构建一个结构化的分析框架,以清晰映射FMs在不同架构、训练范式和临床应用中的演进路径。其解决方案的关键在于:首先,根据模型架构与训练策略将相关研究系统划分为纯视觉类和视觉-语言类基础模型;其次,通过定量元分析揭示数据集使用趋势与应用领域演变;最后,深入探讨当前挑战(如域适应、高效微调、计算约束与可解释性)及新兴对策(如联邦学习、知识蒸馏与高级提示技术),从而为提升FMs的鲁棒性、可解释性与临床整合能力提供明确的研究方向。
链接: https://arxiv.org/abs/2510.16973
作者: Praveenbalaji Rajendran,Mojtaba Safari,Wenfeng He,Mingzhe Hu,Shansong Wang,Jun Zhou,Xiaofeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注:
点击查看摘要
Abstract:Recent advancements in artificial intelligence (AI), particularly foundation models (FMs), have revolutionized medical image analysis, demonstrating strong zero- and few-shot performance across diverse medical imaging tasks, from segmentation to report generation. Unlike traditional task-specific AI models, FMs leverage large corpora of labeled and unlabeled multimodal datasets to learn generalized representations that can be adapted to various downstream clinical applications with minimal fine-tuning. However, despite the rapid proliferation of FM research in medical imaging, the field remains fragmented, lacking a unified synthesis that systematically maps the evolution of architectures, training paradigms, and clinical applications across modalities. To address this gap, this review article provides a comprehensive and structured analysis of FMs in medical image analysis. We systematically categorize studies into vision-only and vision-language FMs based on their architectural foundations, training strategies, and downstream clinical tasks. Additionally, a quantitative meta-analysis of the studies was conducted to characterize temporal trends in dataset utilization and application domains. We also critically discuss persistent challenges, including domain adaptation, efficient fine-tuning, computational constraints, and interpretability along with emerging solutions such as federated learning, knowledge distillation, and advanced prompting. Finally, we identify key future research directions aimed at enhancing the robustness, explainability, and clinical integration of FMs, thereby accelerating their translation into real-world medical practice.
zh
[CV-99] Unlocking Off-the-Grid Sparse Recovery with Unlimited Sensing: Simultaneous Super-Resolution in Time and Amplitude
链接: https://arxiv.org/abs/2510.16948
作者: Ruiming Guo,Ayush Bhandari
机构: Imperial College London (帝国理工学院)
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 28 Pages, 10 figures. To appear in IEEE Journal of Selected Topics in Signal Processing
[CV-100] Domain Generalizable Continual Learning
链接: https://arxiv.org/abs/2510.16914
作者: Hongwei Yan,Guanglong Sun,Zhiqi Kang,Yi Zhong,Liyuan Wang
机构: Tsinghua University (清华大学); Inria (法国国家信息与自动化研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages
[CV-101] Beyond RGB: Leverag ing Vision Transformers for Thermal Weapon Segmentation
链接: https://arxiv.org/abs/2510.16913
作者: Akhila Kambhatla,Ahmed R Khaled
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 Images with 1 figure and 3 Tables. This is a preprint submitted to arXiv
[CV-102] Contrail-to-Flight Attribution Using Ground Visible Cameras and Flight Surveillance Data
链接: https://arxiv.org/abs/2510.16891
作者: Ramon Dalmau,Gabriel Jarry,Philippe Very
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-103] Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
链接: https://arxiv.org/abs/2510.16888
作者: Zongjian Li,Zheyuan Liu,Qihui Zhang,Bin Lin,Feize Wu,Shenghai Yuan,Zhiyuan Yan,Yang Ye,Wangbo Yu,Yuwei Niu,Shaodong Wang,Xinhua Cheng,Li Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-104] Class-N-Diff: Classification-Induced Diffusion Model Can Make Fair Skin Cancer Diagnosis
【速读】:该论文旨在解决传统类别条件生成模型在生成特定医学类别图像(如皮肤癌 dermoscopic 图像)时准确性不足的问题,从而限制其在临床诊断等场景中的应用价值。解决方案的关键在于提出一种分类引导的扩散模型 Class-N-Diff,其核心创新是在扩散模型中嵌入一个分类器(classifier),通过该分类器在生成过程中提供类别条件指导,实现图像生成与分类任务的联合优化。这一机制不仅提升了生成图像的真实性和多样性,还增强了分类器在下游诊断任务中的性能,实现了生成质量与分类精度的协同增强。
链接: https://arxiv.org/abs/2510.16887
作者: Nusrat Munia,Abdullah Imran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: EMBC 2025
点击查看摘要
Abstract:Generative models, especially Diffusion Models, have demonstrated remarkable capability in generating high-quality synthetic data, including medical images. However, traditional class-conditioned generative models often struggle to generate images that accurately represent specific medical categories, limiting their usefulness for applications such as skin cancer diagnosis. To address this problem, we propose a classification-induced diffusion model, namely, Class-N-Diff, to simultaneously generate and classify dermoscopic images. Our Class-N-Diff model integrates a classifier within a diffusion model to guide image generation based on its class conditions. Thus, the model has better control over class-conditioned image synthesis, resulting in more realistic and diverse images. Additionally, the classifier demonstrates improved performance, highlighting its effectiveness for downstream diagnostic tasks. This unique integration in our Class-N-Diff makes it a robust tool for enhancing the quality and utility of diffusion model-based synthetic dermoscopic image generation. Our code is available at this https URL.
zh
[CV-105] Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning
链接: https://arxiv.org/abs/2510.16877
作者: Heming Zou,Yunliang Zang,Wutong Xu,Xiangyang Ji
机构: Tsinghua University (清华大学); Tianjin University (天津大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-106] Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding
链接: https://arxiv.org/abs/2510.16870
作者: Yudan Ren,Xinlong Wang,Kexin Wang,Tian Xia,Zihan Ma,Zhaowei Li,Xiangrong Bi,Xiao Li,Xiaowei He
机构: Northwest University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures
[CV-107] Registration is a Powerful Rotation-Invariance Learner for 3D Anomaly Detection
【速读】:该论文旨在解决点云数据中3D异常检测的可靠性问题,特别是现有基于记忆库的方法在特征变换不一致和判别能力有限方面的缺陷,尤其是在注册失败时难以捕捉局部几何细节与实现旋转不变性,从而导致检测结果不可靠。解决方案的关键在于提出一种由注册诱导的、旋转不变的特征提取框架,通过将点云注册与基于记忆的异常检测目标相结合,使网络在学习过程中同时优化对齐与表征学习,从而获得既具备旋转鲁棒性又对异常敏感的局部判别特征。
链接: https://arxiv.org/abs/2510.16865
作者: Yuyang Yu,Zhengwei Chen,Xuemiao Xu,Lei Zhang,Haoxin Yang,Yongwei Nie,Shengfeng He
机构: South China University of Technology (华南理工大学); Guangdong University of Petrochemical Technology (广东石油化工学院); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D anomaly detection in point-cloud data is critical for industrial quality control, aiming to identify structural defects with high reliability. However, current memory bank-based methods often suffer from inconsistent feature transformations and limited discriminative capacity, particularly in capturing local geometric details and achieving rotation invariance. These limitations become more pronounced when registration fails, leading to unreliable detection results. We argue that point-cloud registration plays an essential role not only in aligning geometric structures but also in guiding feature extraction toward rotation-invariant and locally discriminative representations. To this end, we propose a registration-induced, rotation-invariant feature extraction framework that integrates the objectives of point-cloud registration and memory-based anomaly detection. Our key insight is that both tasks rely on modeling local geometric structures and leveraging feature similarity across samples. By embedding feature extraction into the registration learning process, our framework jointly optimizes alignment and representation learning. This integration enables the network to acquire features that are both robust to rotations and highly effective for anomaly detection. Extensive experiments on the Anomaly-ShapeNet and Real3D-AD datasets demonstrate that our method consistently outperforms existing approaches in effectiveness and generalizability.
zh
[CV-108] BARL: Bilateral Alignment in Representation and Label Spaces for Semi-Supervised Volumetric Medical Image Segmentation
链接: https://arxiv.org/abs/2510.16863
作者: Shujian Gao,Yuan Wang,Zekuan Yu
机构: Fudan University (复旦大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures
[CV-109] ArmFormer: Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification
链接: https://arxiv.org/abs/2510.16854
作者: Akhila Kambhatla,Taminul Islam,Khaled R Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages with 4 figures and 5 tables. This is a preprint submitted to arXiv
[CV-110] 2DGS-R: Revisiting the Normal Consistency Regularization in 2D Gaussian Splatting
【速读】:该论文旨在解决2D Gaussian Splatting (2DGS) 在实现高保真渲染与精确几何结构之间难以兼顾的问题。现有方法在提升几何精度时往往牺牲渲染质量,反之亦然,且难以在单一训练阶段同时优化两者。其解决方案的关键在于提出一种分层训练策略——2DGS-R:首先对原始2D高斯进行法向一致性正则化训练以保障几何准确性;随后识别渲染质量不足的2D高斯并引入一种原位克隆(in-place cloning)操作对其进行增强;最后冻结不透明度(opacity)进行微调。该方法仅需增加1%的存储开销和极少训练时间,即可显著提升渲染质量并保持精细几何结构,有效实现了效率与性能的平衡。
链接: https://arxiv.org/abs/2510.16837
作者: Haofan Ren,Qingsong Yan,Ming Lu,Rongfeng Lu,Zunjie Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advancements in 3D Gaussian Splatting (3DGS) have greatly influenced neural fields, as it enables high-fidelity rendering with impressive visual quality. However, 3DGS has difficulty accurately representing surfaces. In contrast, 2DGS transforms the 3D volume into a collection of 2D planar Gaussian disks. Despite advancements in geometric fidelity, rendering quality remains compromised, highlighting the challenge of achieving both high-quality rendering and precise geometric structures. This indicates that optimizing both geometric and rendering quality in a single training stage is currently unfeasible. To overcome this limitation, we present 2DGS-R, a new method that uses a hierarchical training approach to improve rendering quality while maintaining geometric accuracy. 2DGS-R first trains the original 2D Gaussians with the normal consistency regularization. Then 2DGS-R selects the 2D Gaussians with inadequate rendering quality and applies a novel in-place cloning operation to enhance the 2D Gaussians. Finally, we fine-tune the 2DGS-R model with opacity frozen. Experimental results show that compared to the original 2DGS, our method requires only 1% more storage and minimal additional training time. Despite this negligible overhead, it achieves high-quality rendering results while preserving fine geometric structures. These findings indicate that our approach effectively balances efficiency with performance, leading to improvements in both visual fidelity and geometric reconstruction accuracy.
zh
[CV-111] From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display
链接: https://arxiv.org/abs/2510.16833
作者: Xiangyu Mu,Dongliang Zhou,Jie Hou,Haijun Zhang,Weili Guan
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
[CV-112] Robust Cross-Domain Adaptation in Texture Features Transferring for Wood Chip Moisture Content Prediction
链接: https://arxiv.org/abs/2510.16832
作者: Abdur Rahman,Mohammad Marufuzzaman,Jason Street,Haifeng Wang,Veera G. Gude,Randy Buchanan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-113] ReefNet: A Large scale Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification
【速读】:该论文旨在解决珊瑚礁快速退化背景下,缺乏大规模、细粒度且可扩展的自动化监测工具的问题。现有数据集常受限于规模小、地理覆盖有限或标签粗略,难以支持机器学习(ML)模型的有效训练与泛化。解决方案的关键在于构建ReefNet——一个包含约92.5万条硬珊瑚属级注释的公开图像数据集,其标签映射至世界海洋物种名录(WoRMS),实现了全球尺度上的精细分类标注;同时提出两种评估设置:源内基准(within-source)和跨源基准(cross-source),以系统测试模型在局部场景与跨域场景下的表现,从而推动领域自适应与细粒度珊瑚分类技术的发展。
链接: https://arxiv.org/abs/2510.16822
作者: Yahia Battach,Abdulwahab Felemban,Faizan Farooq Khan,Yousef A. Radwan,Xiang Li,Fabio Marchese,Sara Beery,Burton H. Jones,Francesca Benzoni,Mohamed Elhoseiny
机构: King Abdullah University of Science and Technology (KAUST); Massachusetts Institute of Technology (MIT)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Coral reefs are rapidly declining due to anthropogenic pressures such as climate change, underscoring the urgent need for scalable, automated monitoring. We introduce ReefNet, a large public coral reef image dataset with point-label annotations mapped to the World Register of Marine Species (WoRMS). ReefNet aggregates imagery from 76 curated CoralNet sources and an additional site from Al Wajh in the Red Sea, totaling approximately 925000 genus-level hard coral annotations with expert-verified labels. Unlike prior datasets, which are often limited by size, geography, or coarse labels and are not ML-ready, ReefNet offers fine-grained, taxonomically mapped labels at a global scale to WoRMS. We propose two evaluation settings: (i) a within-source benchmark that partitions each source’s images for localized evaluation, and (ii) a cross-source benchmark that withholds entire sources to test domain generalization. We analyze both supervised and zero-shot classification performance on ReefNet and find that while supervised within-source performance is promising, supervised performance drops sharply across domains, and performance is low across the board for zero-shot models, especially for rare and visually similar genera. This provides a challenging benchmark intended to catalyze advances in domain generalization and fine-grained coral classification. We will release our dataset, benchmarking code, and pretrained models to advance robust, domain-adaptive, global coral reef monitoring and conservation.
zh
人工智能
[AI-0] Unbiased Gradient Low-Rank Projection
【速读】:该论文旨在解决当前低秩投影优化方法在训练大语言模型(Large Language Models, LLMs)时缺乏收敛性保证的问题。这类方法(如GaLore)通过存储低秩投影后的优化器状态来实现内存效率,但其引入的固有偏差会导致性能差距,难以逼近全参数训练的效果。解决方案的关键在于提出分层采样(layerwise sampling)技术以消除低秩投影机制中的偏差,从而构建出一种无偏的低秩优化方法——GaLore Unbiased with Muon (GUM)。该方法在理论上保持了基础Muon算法的收敛性,同时保留了低秩技术的内存高效特性,并在实验中展现出优于GaLore甚至超越全参数训练的性能表现。
链接: https://arxiv.org/abs/2510.17802
作者: Rui Pan,Yang Luo,Yuxing Liu,Yang You,Tong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
点击查看摘要
Abstract:Memory-efficient optimization is critical for training increasingly large language models (LLMs). A popular strategy involves gradient low-rank projection, storing only the projected optimizer states, with GaLore being a representative example. However, a significant drawback of many such methods is their lack of convergence guarantees, as various low-rank projection approaches introduce inherent biases relative to the original optimization algorithms, which contribute to performance gaps compared to full-parameter training. Aiming to tackle this problem, this paper investigates the layerwise sampling technique for debiasing low-rank projection mechanisms. In particular, an instantiation of the paradigm gives rise to a novel and unbiased low-rank optimization method built upon GaLore’s mechanism and the Muon algorithm, named GaLore Unbiased with Muon (GUM). We theoretically prove our method matches the convergence guarantees of the base Muon algorithm while preserving the memory efficiency of low-rank techniques. Empirical experiments on LLM fine-tuning and pretraining also demonstrate non-trivial improvements over GaLore and even better performance than full-parameter training. Further investigation shows that the improvement of this technique comes from a more uniform distribution of knowledge inside layers, leading to more efficient utilization of the model parameter space and better memorization.
zh
[AI-1] SoftMimic: Learning Compliant Whole-body Control from Examples
【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的人形机器人控制策略在模仿人类运动时倾向于生成刚性控制行为的问题,这种刚性控制会因对参考轨迹的严格跟踪而导致机器人在遭遇意外接触时表现出脆弱且不安全的行为。解决方案的关键在于提出SoftMimic框架,其核心创新是利用逆运动学(Inverse Kinematics, IK)求解器生成包含可行柔顺运动的增强数据集,并通过奖励机制引导策略学习匹配柔顺响应而非刚性追踪参考轨迹,从而实现机器人对外部扰动的主动吸收与环境交互的安全性提升。
链接: https://arxiv.org/abs/2510.17792
作者: Gabriel B. Margolis,Michelle Wang,Nolan Fey,Pulkit Agrawal
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Website: this https URL
点击查看摘要
Abstract:We introduce SoftMimic, a framework for learning compliant whole-body control policies for humanoid robots from example motions. Imitating human motions with reinforcement learning allows humanoids to quickly learn new skills, but existing methods incentivize stiff control that aggressively corrects deviations from a reference motion, leading to brittle and unsafe behavior when the robot encounters unexpected contacts. In contrast, SoftMimic enables robots to respond compliantly to external forces while maintaining balance and posture. Our approach leverages an inverse kinematics solver to generate an augmented dataset of feasible compliant motions, which we use to train a reinforcement learning policy. By rewarding the policy for matching compliant responses rather than rigidly tracking the reference motion, SoftMimic learns to absorb disturbances and generalize to varied tasks from a single motion clip. We validate our method through simulations and real-world experiments, demonstrating safe and effective interaction with the environment.
zh
[AI-2] Prediction of Sea Ice Velocity and Concentration in the Arctic Ocean using Physics-informed Neural Network
【速读】:该论文旨在解决传统数据驱动机器学习(ML)模型在预测北极海冰速度(SIV)和海冰浓度(SIC)时存在的泛化能力不足与物理一致性差的问题,尤其是在海冰变薄、融化加速的背景下,历史数据训练的模型难以适应未来动态变化的海冰状态。解决方案的关键在于提出一种物理信息神经网络(PINN)策略,通过将海冰物理知识嵌入到基于Hierarchical Information-sharing U-net(HIS-Unet)架构的模型中,引入物理损失函数(physics loss function)和物理约束激活函数,从而在少量训练样本下仍能生成符合物理规律的SIV和SIC预测结果,显著提升模型在融化季、初冻季及快速移动冰区的SIC预测性能。
链接: https://arxiv.org/abs/2510.17756
作者: Younghyun Koo,Maryam Rahnemoonfar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 49 pages, 7 figures, submitted to Environmental Modelling Software
点击查看摘要
Abstract:As an increasing amount of remote sensing data becomes available in the Arctic Ocean, data-driven machine learning (ML) techniques are becoming widely used to predict sea ice velocity (SIV) and sea ice concentration (SIC). However, fully data-driven ML models have limitations in generalizability and physical consistency due to their excessive reliance on the quantity and quality of training data. In particular, as Arctic sea ice entered a new phase with thinner ice and accelerated melting, there is a possibility that an ML model trained with historical sea ice data cannot fully represent the dynamically changing sea ice conditions in the future. In this study, we develop physics-informed neural network (PINN) strategies to integrate physical knowledge of sea ice into the ML model. Based on the Hierarchical Information-sharing U-net (HIS-Unet) architecture, we incorporate the physics loss function and the activation function to produce physically plausible SIV and SIC outputs. Our PINN model outperforms the fully data-driven model in the daily predictions of SIV and SIC, even when trained with a small number of samples. The PINN approach particularly improves SIC predictions in melting and early freezing seasons and near fast-moving ice regions.
zh
[AI-3] Human-AI Interactions: Cognitive Behavioral and Emotional Impacts
【速读】:该论文试图解决人工智能(AI)在人类交互中引发的多重心理风险问题,包括过度依赖、认知卸载、社会与情感操纵,以及人类自主性和判断力的隐性削弱。其解决方案的关键在于推动负责任且情境敏感的人工智能设计,强调需填补纵向研究的空白,并建立基于实证的评估框架,以平衡AI带来的益处与新兴的人类中心风险。
链接: https://arxiv.org/abs/2510.17753
作者: Celeste Riley,Omar Al-Refai,Yadira Colunga Reyes,Eman Hammad
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 1 figure. Submitted to IEEE Transactions on Technology and Society. Preprint also available on TechRxiv
点击查看摘要
Abstract:As stories of human-AI interactions continue to be highlighted in the news and research platforms, the challenges are becoming more pronounced, including potential risks of overreliance, cognitive offloading, social and emotional manipulation, and the nuanced degradation of human agency and judgment. This paper surveys recent research on these issues through the lens of the psychological triad: cognition, behavior, and emotion. Observations seem to suggest that while AI can substantially enhance memory, creativity, and engagement, it also introduces risks such as diminished critical thinking, skill erosion, and increased anxiety. Emotional outcomes are similarly mixed, with AI systems showing promise for support and stress reduction, but raising concerns about dependency, inappropriate attachments, and ethical oversight. This paper aims to underscore the need for responsible and context-aware AI design, highlighting gaps for longitudinal research and grounded evaluation frameworks to balance benefits with emerging human-centric risks.
zh
[AI-4] A Multi-Threading Kernel for Enabling Neuromorphic Edge Applications ISCAS2026
【速读】:该论文旨在解决神经形态计算在边缘设备上高效运行的问题,即如何在资源受限的移动平台(如ARM架构处理器)上实现低功耗、高能效的脉冲神经网络(Spiking Neural Networks, SNNs)推理。其解决方案的关键在于提出一种多线程内核(multi-threading kernel),该内核能够动态负载均衡多核处理器上的计算任务,显著提升SNN处理速度(在中等规模网络上相比单线程提升4倍,在Synfire网络上提升1.7倍),并相比静态核心分配方式提高高达70%的能效,从而支持轻量级、低功耗的边缘智能应用部署。
链接: https://arxiv.org/abs/2510.17745
作者: Lars Niedermeier(1 and 3),Vyom Shah(2),Jeffrey L. Krichmar(2 and 3) ((1) Niedermeier Consulting, Zurich, ZH, Switzerland, (2) Department of Computer Science, University of California, Irvine, CA, USA, (3) Department of Cognitive Sciences, University of California, Irvine, CA, USA)
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Submitted to ISCAS 2026
点击查看摘要
Abstract:Spiking Neural Networks (SNNs) have sparse, event driven processing that can leverage neuromorphic applications. In this work, we introduce a multi-threading kernel that enables neuromorphic applications running at the edge, meaning they process sensory input directly and without any up-link to or dependency on a cloud service. The kernel shows speed-up gains over single thread processing by a factor of four on moderately sized SNNs and 1.7X on a Synfire network. Furthermore, it load-balances all cores available on multi-core processors, such as ARM, which run today’s mobile devices and is up to 70% more energy efficient compared to statical core assignment. The present work can enable the development of edge applications that have low Size, Weight, and Power (SWaP), and can prototype the integration of neuromorphic chips.
zh
[AI-5] Closing the Sim2Real Performance Gap in RL
【速读】:该论文旨在解决Sim2Real(仿真到现实)领域中普遍存在的性能差距问题,即在高保真仿真环境中训练的策略在部署到真实环境时性能显著下降的现象。现有方法通常通过优化仿真器的准确性与多样性来间接提升真实世界表现,但这些指标与实际性能之间缺乏理论和实证上的强相关性。论文提出了一种新颖的双层强化学习(bi-level RL)框架:内层RL在仿真环境中训练策略,外层RL则根据真实世界的表现反馈调整仿真模型参数及仿真中的奖励函数,从而直接优化真实世界性能。该方案的关键在于将仿真参数调整问题建模为一个可微分的双层优化问题,并通过数学工具实现对仿真环境的自适应调整,以缩小Sim2Real性能差距。
链接: https://arxiv.org/abs/2510.17709
作者: Akhil S Anand,Shambhuraj Sawant,Jasper Hoffmann,Dirk Reinhardt,Sebastien Gros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Sim2Real aims at training policies in high-fidelity simulation environments and effectively transferring them to the real world. Despite the developments of accurate simulators and Sim2Real RL approaches, the policies trained purely in simulation often suffer significant performance drops when deployed in real environments. This drop is referred to as the Sim2Real performance gap. Current Sim2Real RL methods optimize the simulator accuracy and variability as proxies for real-world performance. However, these metrics do not necessarily correlate with the real-world performance of the policy as established theoretically and empirically in the literature. We propose a novel framework to address this issue by directly adapting the simulator parameters based on real-world performance. We frame this problem as a bi-level RL framework: the inner-level RL trains a policy purely in simulation, and the outer-level RL adapts the simulation model and in-sim reward parameters to maximize real-world performance of the in-sim policy. We derive and validate in simple examples the mathematical tools needed to develop bi-level RL algorithms that close the Sim2Real performance gap.
zh
[AI-6] A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning NEURIPS2025
【速读】:该论文旨在解决大规模多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中难以实现全局人类引导的问题,以及现有协作机制设计依赖经验研究、缺乏易用研究工具的困境。其解决方案的关键在于引入多智能体影响图(Multi-Agent Influence Diagrams, MAIDs)作为图形化建模框架,并提出一种基于MAIDs的“目标干预”(Targeted Intervention)交互范式,通过预策略干预(Pre-Strategy Intervention, PSI)技术仅对单一目标智能体施加因果干预,从而缓解全局指导难题;同时利用MAIDs的捆绑相关性图分析来判断特定交互设计下MARL学习范式的可行性,最终实现复合期望结果(整合主任务目标与附加目标)的因果优化。
链接: https://arxiv.org/abs/2510.17697
作者: Anjie Liu,Jianhong Wang,Samuel Kaski,Jun Wang,Mengyue Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing mechanisms to coordinate agents most relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce interaction paradigms that leverage MAIDs to analyze and visualize existing approaches in MARL. Then, we design a new interaction paradigm based on MAIDs, referred to as targeted intervention that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In our implementation, we introduce a causal inference technique-referred to as Pre-Strategy Intervention (PSI)-to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.
zh
[AI-7] CrossGuard: Safeguarding MLLM s against Joint-Modal Implicit Malicious Attacks
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对隐式越狱攻击(implicit jailbreak attacks)时的安全性不足问题。此类攻击通过文本与图像模态的协同作用表达恶意意图,具有隐蔽性强、检测难度大等特点,且现有数据稀缺限制了相关防御研究的发展。解决方案的关键在于提出ImpForge自动化红队测试管道,利用强化学习结合定制奖励模块生成跨14个领域的多样化隐式样本;在此基础上构建CrossGuard意图感知防护机制,能够有效识别并抵御显式与隐式双重威胁,在多个安全基准和域外场景下显著优于现有防御方案,实现安全性与实用性的平衡。
链接: https://arxiv.org/abs/2510.17687
作者: Xu Zhang,Hao Li,Zhichao Lu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures, 2 tables
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) achieve strong reasoning and perception capabilities but are increasingly vulnerable to jailbreak attacks. While existing work focuses on explicit attacks, where malicious content resides in a single modality, recent studies reveal implicit attacks, in which benign text and image inputs jointly express unsafe intent. Such joint-modal threats are difficult to detect and remain underexplored, largely due to the scarcity of high-quality implicit data. We propose ImpForge, an automated red-teaming pipeline that leverages reinforcement learning with tailored reward modules to generate diverse implicit samples across 14 domains. Building on this dataset, we further develop CrossGuard, an intent-aware safeguard providing robust and comprehensive defense against both explicit and implicit threats. Extensive experiments across safe and unsafe benchmarks, implicit and explicit attacks, and multiple out-of-domain settings demonstrate that CrossGuard significantly outperforms existing defenses, including advanced MLLMs and guardrails, achieving stronger security while maintaining high utility. This offers a balanced and practical solution for enhancing MLLM robustness against real-world multimodal threats.
zh
[AI-8] On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
【速读】:该论文旨在解决开放词汇目标检测(Open-vocabulary Object Detection, OVD)模型在遥感(Remote Sensing, RS)等专业领域中零样本性能受限的问题,特别是由于自然语言描述的模糊性导致细粒度类别(如“ fishing boat”与“yacht”)难以区分,从而影响下游应用(如非法捕捞监测)的准确性。解决方案的关键在于提出一种级联式方法,将大规模预训练OVD模型的广泛泛化能力与轻量级少样本分类器相结合:首先利用OVD模型生成高召回率的目标候选框,随后通过仅需少量用户标注样本实时训练的紧凑分类器实现高精度筛选;其核心创新为FLAME——一种一步式主动学习策略,通过密度估计识别决策边界附近的不确定样本,并结合聚类保证采样多样性,从而无需全模型微调即可实现分钟级适应,显著优于现有方法。
链接: https://arxiv.org/abs/2510.17670
作者: Yehonathan Refael,Amit Aides,Aviad Barzilai,George Leifman,Genady Beryozkin,Vered Silverman,Bolous Jaber,Tomer Shekel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as “fishing boat” and “yacht” since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery this http URL core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art this http URL method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.
zh
[AI-9] RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation ICRA2026
【速读】:该论文旨在解决视觉-语言-动作模型(Vision-Language-Action models, VLAs)在面对分布外(out-of-distribution, OOD)状态时表现不稳定的问题,尤其是当机器人因微小扰动或操作误差偏离训练分布时,现有模型缺乏恢复能力。其解决方案的关键在于提出一种自动化OOD数据增强框架RESample,通过离线强化学习获取动作价值网络以识别次优动作,并设计探索性采样机制从轨迹中提取潜在的OOD状态,将其作为动作代理加入训练数据,从而显式引导VLAs学会从OOD状态中恢复,显著提升模型在分布偏移下的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2510.17640
作者: Yuquan Xue,Guanxing Lu,Zhenyu Wu,Chuanrui Zhang,Bofang Jia,Zhengyi Gu,Yansong Tang,Ziwei Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages,7 figures, submitted to ICRA2026
点击查看摘要
Abstract:Vision-Language-Action models (VLAs) have demonstrated remarkable performance on complex robotic manipulation tasks through imitation learning. However, existing imitation learning datasets contain only successful trajectories and lack failure or recovery data, especially for out-of-distribution (OOD) states where the robot deviates from the main policy due to minor perturbations or errors, leading VLA models to struggle with states deviating from the training distribution. To this end, we propose an automated OOD data augmentation framework named RESample through exploratory sampling. Specifically, we first leverage offline reinforcement learning to obtain an action-value network that accurately identifies sub-optimal actions under the current manipulation policy. We further sample potential OOD states from trajectories via rollout, and design an exploratory sampling mechanism that adaptively incorporates these action proxies into the training dataset to ensure efficiency. Subsequently, our framework explicitly encourages the VLAs to recover from OOD states and enhances their robustness against distributional shifts. We conduct extensive experiments on the LIBERO benchmark as well as real-world robotic manipulation tasks, demonstrating that RESample consistently improves the stability and generalization ability of VLA models.
zh
[AI-10] GUIDE: Enhancing Gradient Inversion Attacks in Federated Learning with Denoising Models
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中客户端更新(client updates)可能引发的隐私泄露问题,特别是针对梯度反演攻击(Gradient Inversion Attacks, GIAs)导致的训练数据重建风险。现有GIAs方法通常只能重构出噪声较大的输入近似图像,其重建质量受限。解决方案的关键在于提出一种名为GUIDE的新方法,该方法利用扩散模型(diffusion models)作为去噪工具,显著提升图像重建的清晰度与感知相似性。GUIDE可无缝集成至基于代理数据集(surrogate datasets)假设的各类GIAs中,且在不同FL算法、模型和数据集下的实验表明,其能大幅提升重建质量,例如在DreamSim指标上最高提升46%的感知相似性。
链接: https://arxiv.org/abs/2510.17621
作者: Vincenzo Carletti,Pasquale Foggia,Carlo Mazzocca,Giuseppe Parrella,Mario Vento
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Federated Learning (FL) enables collaborative training of Machine Learning (ML) models across multiple clients while preserving their privacy. Rather than sharing raw data, federated clients transmit locally computed updates to train the global model. Although this paradigm should provide stronger privacy guarantees than centralized ML, client updates remain vulnerable to privacy leakage. Adversaries can exploit them to infer sensitive properties about the training data or even to reconstruct the original inputs via Gradient Inversion Attacks (GIAs). Under the honest-butcurious threat model, GIAs attempt to reconstruct training data by reversing intermediate updates using optimizationbased techniques. We observe that these approaches usually reconstruct noisy approximations of the original inputs, whose quality can be enhanced with specialized denoising models. This paper presents Gradient Update Inversion with DEnoising (GUIDE), a novel methodology that leverages diffusion models as denoising tools to improve image reconstruction attacks in FL. GUIDE can be integrated into any GIAs that exploits surrogate datasets, a widely adopted assumption in GIAs literature. We comprehensively evaluate our approach in two attack scenarios that use different FL algorithms, models, and datasets. Our results demonstrate that GUIDE integrates seamlessly with two state-ofthe- art GIAs, substantially improving reconstruction quality across multiple metrics. Specifically, GUIDE achieves up to 46% higher perceptual similarity, as measured by the DreamSim metric.
zh
[AI-11] OG-Rank: Learning to Rank Fast and Slow with Uncertainty and Reward-Trend Guided Adaptive Exploration
【速读】:该论文旨在解决临床决策场景中排序系统需兼顾实时性与可解释性的难题,即如何在低延迟前提下实现准确且可信的候选项排序。其解决方案的关键在于提出一种单解码器架构的重排序模型OG-Rank,通过融合池化后的首词得分信号与不确定性门控的解释生成机制,在一次前向传播中完成所有候选项评分,并仅在列表存在真正歧义时触发结构化理由生成,从而保持延迟可预测性;同时采用聚焦于困难样本的教学课程训练策略,使模型在默认快速排序的基础上,于门控激活时显著提升效果(如Recall@1从0.45提升至0.56),实现了高效率与高精度的平衡。
链接: https://arxiv.org/abs/2510.17614
作者: Praphul Singh,Corey Barrett,Sumana Srivasta,Irfan Bulu,Sri Gadde,Krishnaram Kenthapadi
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Clinicians need ranking systems that work in real time and still justify their choices. Motivated by the need for a low-latency, decoder-based reranker, we present OG-Rank, a single-decoder approach that pairs a pooled first-token scoring signal with an uncertainty-gated explanation step. The model scores all candidates in one pass and generates a brief, structured rationale only when the list is genuinely ambiguous, keeping latency predictable. Trained with a curriculum that concentrates effort on hard cases, OG-Rank delivers strong effectiveness on encounter-scoped order selection (fast path: Recall@1~0.45, nDCG@20~0.625) and improves further when the gate activates (Recall@1~0.56, nDCG@20~0.699 at a 45% gate rate), while compact backbones show similar gains under the same policy. Encoder baselines trail in both effectiveness and flexibility. The result is a practical recipe: rank fast by default and explain when it helps, a pattern that applies broadly to decision tasks where selective generation buys accuracy at acceptable cost. The single-policy design simplifies deployment and budget planning, and the curriculum principle (spend more on the hard cases, less on the easy ones) readily transfers beyond clinical order selection.
zh
[AI-12] CEPerFed: Communication-Efficient Personalized Federated Learning for Multi-Pulse MRI Classification
【速读】:该论文旨在解决多脉冲磁共振成像(Multi-pulse MRI)分类模型训练中因数据异构性和高通信开销导致的联邦学习(Federated Learning, FL)收敛困难问题。解决方案的关键在于提出一种通信高效的个性化联邦学习方法 CEPerFed:一方面,通过引入客户端历史风险梯度和历史均值梯度来协调局部与全局优化,前者用于加权其他客户端的贡献以提升本地更新可靠性,后者确保本地更新方向与全局优化一致,从而稳定跨异构数据分布的收敛;另一方面,设计分层奇异值分解(Hierarchical SVD, HSVD)策略,仅传输模型更新所需的关键信息,显著降低通信开销。
链接: https://arxiv.org/abs/2510.17584
作者: Ludi Li,Junbin Mao,Hanhe Lin,Xu Tian,Fang-Xiang Wu,Jin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multi-pulse magnetic resonance imaging (MRI) is widely utilized for clinical practice such as Alzheimer’s disease diagnosis. To train a robust model for multi-pulse MRI classification, it requires large and diverse data from various medical institutions while protecting privacy by preventing raw data sharing across institutions. Although federated learning (FL) is a feasible solution to address this issue, it poses challenges of model convergence due to the effect of data heterogeneity and substantial communication overhead due to large numbers of parameters transmitted within the model. To address these challenges, we propose CEPerFed, a communication-efficient personalized FL method. It mitigates the effect of data heterogeneity by incorporating client-side historical risk gradients and historical mean gradients to coordinate local and global optimization. The former is used to weight the contributions from other clients, enhancing the reliability of local updates, while the latter enforces consistency between local updates and the global optimization direction to ensure stable convergence across heterogeneous data distributions. To address the high communication overhead, we propose a hierarchical SVD (HSVD) strategy that transmits only the most critical information required for model updates. Experiments on five classification tasks demonstrate the effectiveness of the CEPerFed method. The code will be released upon acceptance at this https URL.
zh
[AI-13] Intent-Driven LLM Ensemble Planning for Flexible Multi-Robot Disassembly: Demonstration on EV Batteries
【速读】:该论文旨在解决复杂操作任务的规划问题,即在非结构化场景中,多个具备不同末端执行器和能力的机器人需根据计算机视觉信息,对任意位置与配置的对象执行一系列连贯动作。解决方案的关键在于提出了一种意图驱动的规划流水线(intent-driven planning pipeline),其核心包括:(i) 将感知结果编码为文本形式的场景表示;(ii) 利用大型语言模型(Large Language Models, LLMs)集成系统生成基于操作者意图的候选移除序列;(iii) 通过LLM-based验证器确保动作格式与先后顺序约束;(iv) 使用确定性一致性过滤器剔除幻觉对象。该方法在电动汽车电池拆解任务中验证有效,能可靠地将人类语言指令映射为安全、可执行的多机器人计划,同时保持较低的人机交互负担。
链接: https://arxiv.org/abs/2510.17576
作者: Cansu Erdogan,Cesar Alan Contreras,Alireza Rastegarpanah,Manolis Chiou,Rustam Stolkin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: This work is funded by the project called “Research and Development of a Highly Automated and Safe Streamlined Process for Increasing Lithium-ion Battery Repurposing and Recycling” (REBELION) under Grant 101104241, and partially supported by the Ministry of National Education, Republic of Turkey. Submitted to Frontiers for Review
点击查看摘要
Abstract:This paper addresses the problem of planning complex manipulation tasks, in which multiple robots with different end-effectors and capabilities, informed by computer vision, must plan and execute concatenated sequences of actions on a variety of objects that can appear in arbitrary positions and configurations in unstructured scenes. We propose an intent-driven planning pipeline which can robustly construct such action sequences with varying degrees of supervisory input from a human using simple language instructions. The pipeline integrates: (i) perception-to-text scene encoding, (ii) an ensemble of large language models (LLMs) that generate candidate removal sequences based on the operator’s intent, (iii) an LLM-based verifier that enforces formatting and precedence constraints, and (iv) a deterministic consistency filter that rejects hallucinated objects. The pipeline is evaluated on an example task in which two robot arms work collaboratively to dismantle an Electric Vehicle battery for recycling applications. A variety of components must be grasped and removed in specific sequences, determined by human instructions and/or by task-order feasibility decisions made by the autonomous system. On 200 real scenes with 600 operator prompts across five component classes, we used metrics of full-sequence correctness and next-task correctness to evaluate and compare five LLM-based planners (including ablation analyses of pipeline components). We also evaluated the LLM-based human interface in terms of time to execution and NASA TLX with human participant experiments. Results indicate that our ensemble-with-verification approach reliably maps operator intent to safe, executable multi-robot plans while maintaining low user effort.
zh
[AI-14] An Empirical Study of Lagrangian Methods in Safe Reinforcement Learning
【速读】:该论文旨在解决安全强化学习中拉格朗日乘子(Lagrange multiplier, λ)选择对优化性能和稳定性的影响问题。在机器人、导航与电力系统等安全关键领域,需在最大化性能的同时严格满足约束条件,而传统方法依赖人工设定λ值,缺乏普适性且难以保证最优性。论文提出通过构建λ-轮廓(λ-profiles)来可视化不同λ值下回报与约束代价之间的权衡关系,揭示了λ的敏感性及无通用直觉可依的特性;其关键解决方案在于验证自动更新λ机制的有效性——尽管存在训练过程中的振荡行为,但自动化策略能恢复甚至超越人工最优λ*所达到的性能,表明其学习轨迹差异具有优势。进一步地,作者引入PID控制改进自动更新策略以缓解振荡,但需精细调参才能跨任务稳定提升性能,凸显了当前拉格朗日方法在安全性与稳定性之间平衡仍需深入研究。
链接: https://arxiv.org/abs/2510.17564
作者: Lindsay Spoor,Álvaro Serra-Gómez,Aske Plaat,Thomas Moerland
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注:
点击查看摘要
Abstract:In safety-critical domains such as robotics, navigation and power systems, constrained optimization problems arise where maximizing performance must be carefully balanced with associated constraints. Safe reinforcement learning provides a framework to address these challenges, with Lagrangian methods being a popular choice. However, the effectiveness of Lagrangian methods crucially depends on the choice of the Lagrange multiplier \lambda , which governs the trade-off between return and constraint cost. A common approach is to update the multiplier automatically during training. Although this is standard in practice, there remains limited empirical evidence on the robustness of an automated update and its influence on overall performance. Therefore, we analyze (i) optimality and (ii) stability of Lagrange multipliers in safe reinforcement learning across a range of tasks. We provide \lambda -profiles that give a complete visualization of the trade-off between return and constraint cost of the optimization problem. These profiles show the highly sensitive nature of \lambda and moreover confirm the lack of general intuition for choosing the optimal value \lambda^* . Our findings additionally show that automated multiplier updates are able to recover and sometimes even exceed the optimal performance found at \lambda^* due to the vast difference in their learning trajectories. Furthermore, we show that automated multiplier updates exhibit oscillatory behavior during training, which can be mitigated through PID-controlled updates. However, this method requires careful tuning to achieve consistently better performance across tasks. This highlights the need for further research on stabilizing Lagrangian methods in safe reinforcement learning. The code used to reproduce our results can be found at this https URL.
zh
[AI-15] he Graphon Limit Hypothesis: Understanding Neural Network Pruning via Infinite Width Analysis NEURIPS2025
【速读】:该论文旨在解决稀疏神经网络(Sparse Neural Networks)在训练过程中效率与可训练性之间的矛盾问题,特别是为何相同稀疏度下不同剪枝方法所构建的结构在训练表现上存在显著差异。其解决方案的关键在于提出了一种基于图极限理论(graph limits theory)的新颖理论框架,利用图子(graphons)来刻画无限宽稀疏网络的连接模式;研究发现,不同剪枝方法诱导的稀疏结构在宽度趋于无穷时会收敛至特定的图子,从而编码了剪枝方法隐含的结构偏差。作者进一步提出图子神经切向核(Graphon Neural Tangent Kernel, Graphon NTK),用于分析稀疏网络在无限宽极限下的训练动态,并通过谱分析验证其与实际训练行为的相关性,为理解稀疏网络的可训练性提供了系统性的理论工具。
链接: https://arxiv.org/abs/2510.17515
作者: Hoang Pham,The-Anh Ta,Tom Jacobs,Rebekka Burkholz,Long Tran-Thanh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Spotlight
点击查看摘要
Abstract:Sparse neural networks promise efficiency, yet training them effectively remains a fundamental challenge. Despite advances in pruning methods that create sparse architectures, understanding why some sparse structures are better trainable than others with the same level of sparsity remains poorly understood. Aiming to develop a systematic approach to this fundamental problem, we propose a novel theoretical framework based on the theory of graph limits, particularly graphons, that characterizes sparse neural networks in the infinite-width regime. Our key insight is that connectivity patterns of sparse neural networks induced by pruning methods converge to specific graphons as networks’ width tends to infinity, which encodes implicit structural biases of different pruning methods. We postulate the Graphon Limit Hypothesis and provide empirical evidence to support it. Leveraging this graphon representation, we derive a Graphon Neural Tangent Kernel (Graphon NTK) to study the training dynamics of sparse networks in the infinite width limit. Graphon NTK provides a general framework for the theoretical analysis of sparse networks. We empirically show that the spectral analysis of Graphon NTK correlates with observed training dynamics of sparse networks, explaining the varying convergence behaviours of different pruning methods. Our framework provides theoretical insights into the impact of connectivity patterns on the trainability of various sparse network architectures.
zh
[AI-16] I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models NEURIPS2025
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在类比推理与数学推理任务中泛化能力不足及鲁棒性较差的问题,特别是面对复杂运算、宽广属性范围以及感知不确定性时的性能瓶颈。解决方案的关键在于提出 I-RAVEN-X——一个符号化基准测试框架,其通过提升操作数复杂度、扩展属性取值范围并引入感知不确定性,系统性地评估模型在长链推理关系和多概率结果探索中的表现;实证结果表明,相较于LLMs,大型推理模型(Large Reasoning Models, LRMs)在长推理链条上展现出更强的生产力(productivity)和在宽属性范围内具备更高的系统性(systematicity),但在不确定性推理和多概率路径探索方面仍存在显著挑战。
链接: https://arxiv.org/abs/2510.17496
作者: Giacomo Camposampiero,Michael Hersche,Roger Wattenhofer,Abu Sebastian,Abbas Rahimi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 5th Workshop on Mathematical Reasoning and AI (MATH-AI), NeurIPS 2025
点击查看摘要
Abstract:We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
zh
[AI-17] DAMSDAN: Distribution-Aware Multi-Source Domain Adaptation Network for Cross-Domain EEG-based Emotion Recognition
【速读】:该论文旨在解决脑电图(EEG)情绪识别在跨域场景下因个体间差异导致的泛化能力不足问题,核心挑战包括:(1) 动态建模多源数据分布异质性并量化其与目标域的相关性以减少负迁移;(2) 实现细粒度语义一致性以增强类别判别力。解决方案的关键在于提出一种分布感知的多源域适应网络(DAMSDAN),通过原型约束与对抗学习联合驱动编码器生成具有判别性和域不变性的情绪表征;同时引入基于最大均值差异(MMD)的域感知源权重策略动态估计域间偏移并自适应调整源贡献,并结合原型引导的条件对齐模块与双伪标签交互机制提升伪标签可靠性,实现类别级细粒度对齐,有效抑制噪声传播和语义漂移。
链接: https://arxiv.org/abs/2510.17475
作者: Fo Hu,Can Wang,Qinxu Zheng,Xusheng Yang,Bin Zhou,Gang Li,Yu Sun,Wen-an Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures
点击查看摘要
Abstract:Significant inter-individual variability limits the generalization of EEG-based emotion recognition under cross-domain settings. We address two core challenges in multi-source adaptation: (1) dynamically modeling distributional heterogeneity across sources and quantifying their relevance to a target to reduce negative transfer; and (2) achieving fine-grained semantic consistency to strengthen class discrimination. We propose a distribution-aware multi-source domain adaptation network (DAMSDAN). DAMSDAN integrates prototype-based constraints with adversarial learning to drive the encoder toward discriminative, domain-invariant emotion representations. A domain-aware source weighting strategy based on maximum mean discrepancy (MMD) dynamically estimates inter-domain shifts and reweights source contributions. In addition, a prototype-guided conditional alignment module with dual pseudo-label interaction enhances pseudo-label reliability and enables category-level, fine-grained alignment, mitigating noise propagation and semantic drift. Experiments on SEED and SEED-IV show average accuracies of 94.86% and 79.78% for cross-subject, and 95.12% and 83.15% for cross-session protocols. On the large-scale FACED dataset, DAMSDAN achieves 82.88% (cross-subject). Extensive ablations and interpretability analyses corroborate the effectiveness of the proposed framework for cross-domain EEG-based emotion recognition.
zh
[AI-18] Layer Specialization Underlying Compositional Reasoning in Transformers
【速读】:该论文试图解决的问题是:Transformer模型在未见过的序列上表现出组合推理能力(compositional reasoning)的机制,特别是这种能力是否源于上下文学习(in-context learning, ICL)和技能组合(skill composition)。为解答此问题,作者采用随机层次模型(Random Hierarchy Model, RHM)作为可控的生成式语法环境,系统评估模型在不同泛化条件下的表现。解决方案的关键在于:通过训练Transformer于RHM生成的子集序列,并结合行为分析与机制解析(包括主成分分析和注意力模式聚类),发现模型在训练过程中逐步形成分层专业化结构,这些结构具有层次化的表示能力,且与组合推理性能显著相关——这表明Transformer能自发发展出模块化、可解释的内部算法机制以支持组合推理,从而将模型的行为表现与其内在表征结构直接关联。
链接: https://arxiv.org/abs/2510.17469
作者: Jing Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Transformers exhibit compositional reasoning on sequences not observed during training, a capability often attributed to in-context learning (ICL) and skill composition. We investigate this phenomenon using the Random Hierarchy Model (RHM), a probabilistic context-free grammar that generates sequences through recursive rule application. Models are trained on subsets of sequences and evaluated across four generalization conditions: memorization, in-distribution generalization, out-of-distribution generalization with the same rules, and cross-layer transfer. Behaviorally, performance improves systematically with task complexity and the number of in-context examples, with out-of-distribution tasks requiring substantially more examples than in-distribution scenarios. Mechanistically, we identify a progressive emergence of layer specialization during training that correlates with generalization performance. Principal component analysis and attention pattern clustering reveal that transformers develop structured, hierarchically organized representations in specialized layers. These results demonstrate that transformers develop modular, interpretable mechanisms supporting compositional reasoning, linking internal algorithmic structure to observed behavioral capabilities.
zh
[AI-19] Label Indeterminacy in AI Law
【速读】:该论文试图解决法律领域中机器学习模型因标签不确定性(label indeterminacy)而导致的建模偏差问题。在法律实践中,案件最终判决常受和解、上诉等人为干预影响,而这些因素未被现有机器学习方法充分捕捉,导致训练标签具有潜在不确定性。论文指出,此类标签不确定性会显著影响模型行为,因此强调在法律人工智能(AI Law)中必须考虑这一问题。解决方案的关键在于识别并处理标签构建过程中引入的不确定性,尽管目前可用的填补方法依赖于不可验证的假设,但论文通过欧洲人权法院案例实证表明,标签构造方式对模型表现有实质性影响,从而凸显了标签不确定性作为核心考量因素的重要性。
链接: https://arxiv.org/abs/2510.17463
作者: Cor Steging,Tadeusz Zbiegień
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This manuscript has been accepted for presentation as a short paper at the 38th International Conference on Legal Knowledge and Information Systems (JURIX) in Turin, December 9 to 11 of 2025
点击查看摘要
Abstract:Machine learning is increasingly used in the legal domain, where it typically operates retrospectively by treating past case outcomes as ground truth. However, legal outcomes are often shaped by human interventions that are not captured in most machine learning approaches. A final decision may result from a settlement, an appeal, or other procedural actions. This creates label indeterminacy: the outcome could have been different if the intervention had or had not taken place. We argue that legal machine learning applications need to account for label indeterminacy. Methods exist that can impute these indeterminate labels, but they are all grounded in unverifiable assumptions. In the context of classifying cases from the European Court of Human Rights, we show that the way that labels are constructed during training can significantly affect model behaviour. We therefore position label indeterminacy as a relevant concern in AI Law and demonstrate how it can shape model behaviour.
zh
[AI-20] he Parameterized Complexity of Computing the VC-Dimension NEURIPS2025
【速读】:该论文致力于解决VC维(VC-dimension)的计算复杂性问题,即在给定超图 H=(V,E) 的情况下,确定其VC维的计算难度及其在不同参数下的可 tractability(可处理性)。核心贡献在于:首先,在指数时间假设(Exponential Time Hypothesis, ETH)下证明了朴素的 2O(∣V∣) 时间算法是渐近紧致的;其次,提出了基于最大度数和维度的固定参数近似与精确算法,且表明这些是唯一可利用的结构性参数;最后,将问题推广至图结构形式,并证明其在树宽(treewidth)参数化下是固定参数可解的,且对树宽的依赖远低于同类问题中通常存在的双指数级复杂度。解决方案的关键在于识别出具有实际意义的参数化方向,并设计出具有低复杂度依赖的算法,从而在理论和实践层面提升了对VC维计算问题的理解与处理能力。
链接: https://arxiv.org/abs/2510.17451
作者: Florent Foucaud,Harmender Gahlawat,Fionn Mc Inerney,Prafullkumar Tale
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO)
备注: To appear in the proceedings of NeurIPS 2025
点击查看摘要
Abstract:The VC-dimension is a fundamental and well-studied measure of the complexity of a set system (or hypergraph) that is central to many areas of machine learning. We establish several new results on the complexity of computing the VC-dimension. In particular, given a hypergraph \mathcalH=(\mathcalV,\mathcalE) , we prove that the naive 2^\mathcalO(|\mathcalV|) -time algorithm is asymptotically tight under the Exponential Time Hypothesis (ETH). We then prove that the problem admits a 1-additive fixed-parameter approximation algorithm when parameterized by the maximum degree of \mathcalH and a fixed-parameter algorithm when parameterized by its dimension, and that these are essentially the only such exploitable structural parameters. Lastly, we consider a generalization of the problem, formulated using graphs, which captures the VC-dimension of both set systems and graphs. We show that it is fixed-parameter tractable parameterized by the treewidth of the graph (which, in the case of set systems, applies to the treewidth of its incidence graph). In contrast with closely related problems whose dependency on the treewidth is necessarily double-exponential (assuming the ETH), our algorithm has a relatively low dependency on the treewidth.
zh
[AI-21] Active Inference for an Intelligent Agent in Autonomous Reconnaissance Missions
【速读】:该论文旨在解决智能代理在自主控制中如何有效平衡探索(exploration)与利用(exploitation)的问题,特别是在地理区域侦察任务中维持一致的态势感知(common operational picture)。解决方案的关键在于提出一种基于主动推理(active inference)的路径规划方法:通过构建一个融合正向和负向传感器观测证据的证据地图(evidence map),并随时间扩散这些证据;利用Dempster-Shafer理论与高斯传感器模型构建生成模型,结合贝叶斯更新机制获得目标对象的后验概率分布;最终通过计算各位置上的变分自由能(variational free energy),以最小化该能量为导向引导代理移动,从而实现对未知区域的高效探索与已识别目标的持续跟踪。
链接: https://arxiv.org/abs/2510.17450
作者: Johan Schubert,Farzad Kamrani,Tove Gustavi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented at the 6th International Workshop on Active Inference, 15-17 October 2025, Montreal, Canada
点击查看摘要
Abstract:We develop an active inference route-planning method for the autonomous control of intelligent agents. The aim is to reconnoiter a geographical area to maintain a common operational picture. To achieve this, we construct an evidence map that reflects our current understanding of the situation, incorporating both positive and “negative” sensor observations of possible target objects collected over time, and diffusing the evidence across the map as time progresses. The generative model of active inference uses Dempster-Shafer theory and a Gaussian sensor model, which provides input to the agent. The generative process employs a Bayesian approach to update a posterior probability distribution. We calculate the variational free energy for all positions within the area by assessing the divergence between a pignistic probability distribution of the evidence map and a posterior probability distribution of a target object based on the observations, including the level of surprise associated with receiving new observations. Using the free energy, we direct the agents’ movements in a simulation by taking an incremental step toward a position that minimizes the free energy. This approach addresses the challenge of exploration and exploitation, allowing agents to balance searching extensive areas of the geographical map while tracking identified target objects.
zh
[AI-22] Diverse Planning with Simulators via Linear Temporal Logic
【速读】:该论文旨在解决仿真环境下的自主代理在依赖单一规划方案时,难以满足其偏好需求的问题,即现有规划方法生成的计划可能在语法上不同但语义上重复,无法真正实现多样性。解决方案的关键在于提出一种名为 \textttFBI_\textttLTL 的多样化规划器,它利用线性时序逻辑(Linear Temporal Logic, LTL)显式定义语义多样性标准,使代理能够指定何种计划具有“有意义的不同”,并通过将这些LTL约束直接嵌入搜索过程,确保生成的计划在语义层面真正多样,从而克服了传统方法仅在形式上区分计划的局限性。
链接: https://arxiv.org/abs/2510.17418
作者: Mustafa F. Abdelwahed,Alice Toniolo,Joan Espasa,Ian P. Gent
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Autonomous agents rely on automated planning algorithms to achieve their objectives. Simulation-based planning offers a significant advantage over declarative models in modelling complex environments. However, relying solely on a planner that produces a single plan may not be practical, as the generated plans may not always satisfy the agent’s preferences. To address this limitation, we introduce \textttFBI_\textttLTL , a diverse planner explicitly designed for simulation-based planning problems. \textttFBI_\textttLTL utilises Linear Temporal Logic (LTL) to define semantic diversity criteria, enabling agents to specify what constitutes meaningfully different plans. By integrating these LTL-based diversity models directly into the search process, \textttFBI_\textttLTL ensures the generation of semantically diverse plans, addressing a critical limitation of existing diverse planning approaches that may produce syntactically different but semantically identical solutions. Extensive evaluations on various benchmarks consistently demonstrate that \textttFBI_\textttLTL generates more diverse plans compared to a baseline approach. This work establishes the feasibility of semantically-guided diverse planning in simulation-based environments, paving the way for innovative approaches in realistic, non-symbolic domains where traditional model-based approaches fail.
zh
[AI-23] Inference of Deterministic Finite Automata via Q-Learning
【速读】:该论文旨在解决如何通过强化学习(Reinforcement Learning, RL)实现对确定性有限状态自动机(Deterministic Finite Automata, DFA)的被动推断问题。传统方法主要依赖符号AI中的主动学习(如Angluin的L*算法)或被动学习技术(如RPNI),而本文提出利用Q-learning这一经典的强化学习算法,将学习到的Q函数重新解释为DFA的状态转移函数,从而在子符号学习与符号表示之间建立新的桥梁。其解决方案的关键在于:Q函数本质上映射状态-动作对到奖励值,当定义在有限域上时,这种映射可被重构为DFA的状态转移关系,从而实现从数据中直接推断出结构化的DFA模型。
链接: https://arxiv.org/abs/2510.17386
作者: Elaheh Hosseinkhani,Martin Leucker
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Traditional approaches to inference of deterministic finite-state automata (DFA) stem from symbolic AI, including both active learning methods (e.g., Angluin’s L* algorithm and its variants) and passive techniques (e.g., Biermann and Feldman’s method, RPNI). Meanwhile, sub-symbolic AI, particularly machine learning, offers alternative paradigms for learning from data, such as supervised, unsupervised, and reinforcement learning (RL). This paper investigates the use of Q-learning, a well-known reinforcement learning algorithm, for the passive inference of deterministic finite automata. It builds on the core insight that the learned Q-function, which maps state-action pairs to rewards, can be reinterpreted as the transition function of a DFA over a finite domain. This provides a novel bridge between sub-symbolic learning and symbolic representations. The paper demonstrates how Q-learning can be adapted for automaton inference and provides an evaluation on several examples.
zh
[AI-24] abR1: Taming GRPO for tabular reasoning LLM s
【速读】:该论文旨在解决表格预测任务中模型可解释性差、跨表迁移能力弱的问题,同时探索大型语言模型(LLM)在结构化数据上的推理潜力。传统方法如梯度提升决策树和专用深度学习模型虽在特定任务上表现优异,但缺乏透明的推理过程且泛化能力有限;而尽管推理型大语言模型(Reasoning LLM)具备跨任务适应性,其在表格数据上的应用尚未充分挖掘。解决方案的关键在于提出TabR1,这是首个支持多步推理的表格预测LLM,并引入Permutation Relative Policy Optimization (PRPO)——一种将列排列不变性作为结构先验的强化学习方法。PRPO通过构造每个样本的多个标签保持不变的列排列,并在排列内与排列间估计优势值,从而将稀疏奖励转化为密集的学习信号,显著提升模型在少量标注数据下的推理能力和泛化性能。
链接: https://arxiv.org/abs/2510.17385
作者: Pengxiang Cai,Zihao Gao,Jintai Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Tabular prediction has traditionally relied on gradient-boosted decision trees and specialized deep learning models, which excel within tasks but provide limited interpretability and weak transfer across tables. Reasoning large language models (LLMs) promise cross-task adaptability with trans- parent reasoning traces, yet their potential has not been fully realized for tabular data. This paper presents TabR1, the first reasoning LLM for tabular prediction with multi-step reasoning. At its core is Permutation Relative Policy Optimization (PRPO), a simple yet efficient reinforcement learning method that encodes column-permutation invariance as a structural prior. By construct- ing multiple label-preserving permutations per sample and estimating advantages both within and across permutations, PRPO transforms sparse rewards into dense learning signals and improves generalization. With limited supervision, PRPO activates the reasoning ability of LLMs for tabular prediction, enhancing few-shot and zero-shot performance as well as interpretability. Comprehensive experiments demonstrate that TabR1 achieves performance comparable to strong baselines under full-supervision fine-tuning. In the zero-shot setting, TabR1 approaches the performance of strong baselines under the 32-shot setting. Moreover, TabR1 (8B) substantially outperforms much larger LLMs across various tasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).
zh
[AI-25] Graph Attention-Guided Search for Dense Multi-Agent Pathfinding
【速读】:该论文旨在解决密集多智能体路径规划(Multi-Agent Path Finding, MAPF)问题中实时寻找近优解的挑战,尤其针对现有搜索算法在高密度场景下效率不足、而纯学习方法又难以保证解的质量的问题。其解决方案的关键在于提出一种混合框架LaGAT,通过将基于图注意力机制的神经网络策略MAGAT(a neural MAPF policy with a graph attention scheme)作为启发式信息融入主流搜索算法LaCAM,实现性能突破;具体创新包括:改进的MAGAT架构以增强对复杂环境的建模能力、针对目标地图采用“预训练-微调”策略提升泛化性,以及引入死锁检测机制以补偿神经引导的不完美性,从而在密集场景下显著优于纯搜索或纯学习方法。
链接: https://arxiv.org/abs/2510.17382
作者: Rishabh Jain,Keisuke Okumura,Michael Amir,Amanda Prorok
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Finding near-optimal solutions for dense multi-agent pathfinding (MAPF) problems in real-time remains challenging even for state-of-the-art planners. To this end, we develop a hybrid framework that integrates a learned heuristic derived from MAGAT, a neural MAPF policy with a graph attention scheme, into a leading search-based algorithm, LaCAM. While prior work has explored learning-guided search in MAPF, such methods have historically underperformed. In contrast, our approach, termed LaGAT, outperforms both purely search-based and purely learning-based methods in dense scenarios. This is achieved through an enhanced MAGAT architecture, a pre-train-then-fine-tune strategy on maps of interest, and a deadlock detection scheme to account for imperfect neural guidance. Our results demonstrate that, when carefully designed, hybrid search offers a powerful solution for tightly coupled, challenging multi-agent coordination problems.
zh
[AI-26] Optimizing Energy Management of Smart Grid using Reinforcement Learning aided by Surrogate models built using Physics-informed Neural Networks
【速读】:该论文旨在解决智能电网中能量管理优化问题,特别是强化学习(Reinforcement Learning, RL)在训练过程中因依赖高成本仿真器而产生的样本效率低下的问题。解决方案的关键在于用基于物理信息神经网络(Physics-informed Neural Networks, PINNs)构建的代理模型(surrogate models)替代原有的昂贵智能电网仿真环境,从而显著提升RL策略训练的收敛速度与效率,在更短的时间内获得稳定且最优的控制策略。
链接: https://arxiv.org/abs/2510.17380
作者: Julen Cestero,Carmine Delle Femine,Kenji S. Muro,Marco Quartulli,Marcello Restelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Optimizing the energy management within a smart grids scenario presents significant challenges, primarily due to the complexity of real-world systems and the intricate interactions among various components. Reinforcement Learning (RL) is gaining prominence as a solution for addressing the challenges of Optimal Power Flow in smart grids. However, RL needs to iterate compulsively throughout a given environment to obtain the optimal policy. This means obtaining samples from a, most likely, costly simulator, which can lead to a sample efficiency problem. In this work, we address this problem by substituting costly smart grid simulators with surrogate models built using Phisics-informed Neural Networks (PINNs), optimizing the RL policy training process by arriving to convergent results in a fraction of the time employed by the original environment.
zh
[AI-27] Bridging Embodiment Gaps: Deploying Vision-Language-Action Models on Soft Robots NEURIPS2025
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在软体连续体机械臂(soft continuum manipulator)上部署时因本体差异(embodiment mismatch)导致的控制失效问题,从而实现安全、灵活的人机协作。其解决方案的关键在于提出了一套结构化的微调(fine-tuning)与部署流程,通过针对性地调整VLA模型以适配软体机器人的物理特性,使得原本在刚性机械臂上表现良好的策略能够在软体平台上实现等效性能,进而验证了将VLA模型与软体机器人结合可有效提升人共享环境中的安全性和适应性。
链接: https://arxiv.org/abs/2510.17369
作者: Haochen Su,Cristian Meo,Francesco Stella,Andrea Peirone,Kai Junge,Josie Hughes
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by NeurIPS 2025 SpaVLE workshop. 4 pages, 2 figures(in main paper, excluding references and supplements)
点击查看摘要
Abstract:Robotic systems are increasingly expected to operate in human-centered, unstructured environments where safety, adaptability, and generalization are essential. Vision-Language-Action (VLA) models have been proposed as a language guided generalized control framework for real robots. However, their deployment has been limited to conventional serial link manipulators. Coupled by their rigidity and unpredictability of learning based control, the ability to safely interact with the environment is missing yet critical. In this work, we present the deployment of a VLA model on a soft continuum manipulator to demonstrate autonomous safe human-robot interaction. We present a structured finetuning and deployment pipeline evaluating two state-of-the-art VLA models (OpenVLA-OFT and \pi_0 ) across representative manipulation tasks, and show while out-of-the-box policies fail due to embodiment mismatch, through targeted finetuning the soft robot performs equally to the rigid counterpart. Our findings highlight the necessity of finetuning for bridging embodiment gaps, and demonstrate that coupling VLA models with soft robots enables safe and flexible embodied AI in human-shared environments.
zh
[AI-28] Localist LLM s with Recruitment Learning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在可解释性与泛化能力之间难以平衡的问题,尤其是在需要同时满足监管合规要求(如透明度)和高性能表现的应用场景中。其核心挑战在于如何动态调整模型内部表示从局部化(localist,即规则驱动、可解释)到分布式(distributed,即高效泛化)的连续谱系,而无需重新训练模型。解决方案的关键在于提出一个包含三项创新机制的框架:(1) 局部性旋钮(locality dial),通过可调参数在训练和推理阶段动态控制表示的局部化程度;(2) 基于信息论的招募机制(information-theoretic recruitment mechanism),自适应分配语义块,避免初始化时对完整领域知识的依赖;(3) 分层招募框架(hierarchical recruitment framework),将容量分配扩展至整个专用LLM层级,实现多粒度架构适配。这些机制通过注意力机制上的组稀疏惩罚、信息论锚点设计、动态规则注入及基于惩罚似然的明确单元招募准则共同作用,确保注意力在稳态点集中于语义相关块,并提供精确的注意力熵与指针保真度边界,从而实现模型复杂度与数据编码效率之间的最优权衡。
链接: https://arxiv.org/abs/2510.17358
作者: Joachim Diederich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovations are (1) a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining, (2) an information-theoretic recruitment mechanism that adaptively allocates semantic blocks as needed, eliminating the requirement for complete domain knowledge at initialization, and (3) a hierarchical recruitment framework that extends capacity allocation to entire specialized LLMs, enabling multi-granularity architectural adaptation. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, dynamic rule injection, and principled recruitment criteria based on penalized likelihood with explicit units. We provide rigorous mathematical results establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks at stationary points, with exact bounds on attention entropy and pointer fidelity. The hierarchical recruitment mechanism provides convergence guarantees at both the block level (fine-grained, within-LLM) and the LLM level (coarse-grained, cross-domain), ensuring the system discovers semantic partitions that balance model complexity against data encoding efficiency. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes while adapting architectural capacity at multiple granularities, supporting applications in regulated domains requiring both transparency and capability.
zh
[AI-29] opSeg: A Multi-Scale Topological Framework for Data-Efficient Heart Sound Segmentation ICASSP2026
【速读】:该论文旨在解决心音信号(PCG)分割任务中对大规模专家标注数据依赖性强、模型泛化能力弱的问题。其核心解决方案是提出一种以拓扑表示(topological representation)为中心的框架TopSeg,通过多尺度拓扑特征编码PCG动态特性,并采用轻量级时序卷积网络(TCN)结合顺序与持续时间约束的推理步骤进行解码。该方法在数据效率和跨数据集泛化方面表现优异,在低数据预算下显著优于基于频谱图和包络输入的基线模型,验证了拓扑感知表示作为强归纳偏置在小样本场景下的有效性。
链接: https://arxiv.org/abs/2510.17346
作者: Peihong Zhang,Zhixin Li,Yuxuan Liu,Rui Sang,Yiqiang Cai,Yizhou Tan,Shengchen Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Paper has submitted to ICASSP2026
点击查看摘要
Abstract:Deep learning approaches for heart-sound (PCG) segmentation built on time–frequency features can be accurate but often rely on large expert-labeled datasets, limiting robustness and deployment. We present TopSeg, a topological representation-centric framework that encodes PCG dynamics with multi-scale topological features and decodes them using a lightweight temporal convolutional network (TCN) with an order- and duration-constrained inference step. To evaluate data efficiency and generalization, we train exclusively on PhysioNet 2016 dataset with subject-level subsampling and perform external validation on CirCor dataset. Under matched-capacity decoders, the topological features consistently outperform spectrogram and envelope inputs, with the largest margins at low data budgets; as a full system, TopSeg surpasses representative end-to-end baselines trained on their native inputs under the same budgets while remaining competitive at full data. Ablations at 10% training confirm that all scales contribute and that combining H_0 and H_1 yields more reliable S1/S2 localization and boundary stability. These results indicate that topology-aware representations provide a strong inductive bias for data-efficient, cross-dataset PCG segmentation, supporting practical use when labeled data are limited.
zh
[AI-30] DDSC: Dynamic Dual-Signal Curriculum for Data-Efficient Acoustic Scene Classification under Domain Shift ICASSP2026
【速读】:该论文旨在解决声学场景分类(Acoustic Scene Classification, ASC)中因设备引起的域偏移(domain shift)问题,尤其是在标签数据有限的情况下。现有方法通常采用静态的课程学习(curriculum-based training)策略,按固定顺序或权重呈现训练样本,但忽略了样本难度和边际效用随模型表示变化的动态特性。解决方案的关键在于提出一种动态双信号课程(Dynamic Dual-Signal Curriculum, DDSC),其在每个训练周期(epoch)在线计算两个信号:域不变性信号(domain-invariance signal)和学习进度信号(learning-progress signal),并由一个时变调度器将二者融合为每个样本的权重,早期优先选择域不变样本,后期逐步强调设备特定样本。该方法轻量、与网络架构无关且不引入额外推理开销,在DCASE 2024 Task 1协议下显著提升了跨设备场景下的分类性能,尤其在未见设备划分上表现最优。
链接: https://arxiv.org/abs/2510.17345
作者: Peihong Zhang,Yuxuan Liu,Rui Sang,Zhixin Li,Yiqiang Cai,Yizhou Tan,Shengchen Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Paper has submitted to ICASSP2026
点击查看摘要
Abstract:Acoustic scene classification (ASC) suffers from device-induced domain shift, especially when labels are limited. Prior work focuses on curriculum-based training schedules that structure data presentation by ordering or reweighting training examples from easy-to-hard to facilitate learning; however, existing curricula are static, fixing the ordering or the weights before training and ignoring that example difficulty and marginal utility evolve with the learned representation. To overcome this limitation, we propose the Dynamic Dual-Signal Curriculum (DDSC), a training schedule that adapts the curriculum online by combining two signals computed each epoch: a domain-invariance signal and a learning-progress signal. A time-varying scheduler fuses these signals into per-example weights that prioritize domain-invariant examples in early epochs and progressively emphasize device-specific cases. DDSC is lightweight, architecture-agnostic, and introduces no additional inference overhead. Under the official DCASE 2024 Task~1 protocol, DDSC consistently improves cross-device performance across diverse ASC baselines and label budgets, with the largest gains on unseen-device splits.
zh
[AI-31] Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)对齐过程中奖励模型(Reward Models)构建所面临的两大挑战:一是偏好数据集(preference datasets)获取成本高,二是现有基于评分标准(rubric-based)的方法缺乏系统性质量控制与优化,导致可扩展性与可靠性之间存在权衡。解决方案的关键在于提出一种无需训练的框架,其核心假设是:人类偏好背后的评估标准具有跨查询的强泛化能力,从而实现极高的数据效率。该方法采用两阶段设计:第一阶段通过验证引导的“提出-评估-修订”(Propose-Evaluate-Revise)流程推导出高质量、查询特定的评分标准;第二阶段利用信息论编码率最大化策略,将细粒度评分标准压缩为紧凑且无冗余的核心集合,最终输出可解释的分层“主题-提示”(Theme-Tips)评分体系。实验证明,仅用70对偏好样本(占原始数据1.5%),该方法即可使小型模型如Qwen3-8B超越全量训练的专业模型,显著提升了奖励建模的可扩展性、可解释性与数据效率。
链接: https://arxiv.org/abs/2510.17314
作者: Lipeng Xie,Sen Huang,Zhuo Zhang,Anni Zou,Yunpeng Zhai,Dingchao Ren,Kezun Zhang,Haoyuan Hu,Boyin Liu,Haoran Chen,Zhaoyang Liu,Bolin Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reward models are essential for aligning Large Language Models (LLMs) with human values, yet their development is hampered by costly preference datasets and poor interpretability. While recent rubric-based approaches offer transparency, they often lack systematic quality control and optimization, creating a trade-off between scalability and reliability. We address these limitations with a novel, training-free framework built on a key assumption: \textitevaluation rubrics underlying human preferences exhibit significant generalization ability across diverse queries, a property that enables remarkable data efficiency. Our two-stage approach first infers high-quality, query-specific rubrics using a validation-guided \textbfPropose-Evaluate-Revise pipeline. Second, it generalizes these granular rubrics into a compact, non-redundant core set by maximizing an \textbfinformation-theoretic coding rate. The final output is an interpretable, hierarchical “Theme-Tips” rubric set. Extensive experiments demonstrate the framework’s exceptional data efficiency and performance. Critically, using just 70 preference pairs (1.5% of the source data), our method also empowers smaller models like Qwen3-8B to outperform specialized, fully-trained counterparts. This work pioneers a scalable, interpretable, and data-efficient path for reward modeling.
zh
[AI-32] RubiSCoT: A Framework for AI-Supported Academic Assessment
【速读】:该论文旨在解决学术论文评审过程中传统方法存在耗时长、评估者间一致性差的问题(evaluation variability)。其解决方案的关键在于提出一个名为RubiSCoT的AI支持框架,该框架利用大语言模型(Large Language Models, LLMs)、检索增强生成(Retrieval-Augmented Generation, RAG)和结构化思维链提示(structured chain-of-thought prompting)等先进技术,实现从开题到最终提交全过程的标准化、可扩展且透明的评估流程,从而提升评审的一致性与效率。
链接: https://arxiv.org/abs/2510.17309
作者: Thorsten Fröhlich,Tim Schlippe
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The evaluation of academic theses is a cornerstone of higher education, ensuring rigor and integrity. Traditional methods, though effective, are time-consuming and subject to evaluator variability. This paper presents RubiSCoT, an AI-supported framework designed to enhance thesis evaluation from proposal to final submission. Using advanced natural language processing techniques, including large language models, retrieval-augmented generation, and structured chain-of-thought prompting, RubiSCoT offers a consistent, scalable solution. The framework includes preliminary assessments, multidimensional assessments, content extraction, rubric-based scoring, and detailed reporting. We present the design and implementation of RubiSCoT, discussing its potential to optimize academic assessment processes through consistent, scalable, and transparent evaluation.
zh
[AI-33] Comprehending Spatio-temporal Data via Cinematic Storytelling using Large Language Models
【速读】:该论文试图解决传统时空数据可视化方法在复杂性、领域专业知识依赖性以及对广泛受众吸引力方面的不足问题。其解决方案的关键在于提出MapMuse框架,通过融合大语言模型(Large Language Models, LLMs)、检索增强生成(Retrieval-Augmented Generation, RAG)和基于代理(Agent-based)的技术,将时空数据转化为以叙事驱动的沉浸式体验。该方法借鉴电影叙事原理,强调清晰性、情感连接与用户中心设计,从而实现从数据到洞察、参与感与行动力的有效转化。
链接: https://arxiv.org/abs/2510.17301
作者: Panos Kalnis. Shuo Shang,Christian S. Jensen
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 5 pages
点击查看摘要
Abstract:Spatio-temporal data captures complex dynamics across both space and time, yet traditional visualizations are complex, require domain expertise and often fail to resonate with broader audiences. Here, we propose MapMuse, a storytelling-based framework for interpreting spatio-temporal datasets, transforming them into compelling, narrative-driven experiences. We utilize large language models and employ retrieval augmented generation (RAG) and agent-based techniques to generate comprehensive stories. Drawing on principles common in cinematic storytelling, we emphasize clarity, emotional connection, and audience-centric design. As a case study, we analyze a dataset of taxi trajectories. Two perspectives are presented: a captivating story based on a heat map that visualizes millions of taxi trip endpoints to uncover urban mobility patterns; and a detailed narrative following a single long taxi journey, enriched with city landmarks and temporal shifts. By portraying locations as characters and movement as plot, we argue that data storytelling drives insight, engagement, and action from spatio-temporal information. The case study illustrates how MapMuse can bridge the gap between data complexity and human understanding. The aim of this short paper is to provide a glimpse to the potential of the cinematic storytelling technique as an effective communication tool for spatio-temporal data, as well as to describe open problems and opportunities for future research.
zh
[AI-34] MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems
【速读】:该论文旨在解决当前大语言模型系统(LLMsys)在持续学习能力方面的不足,尤其是其在服务过程中利用用户反馈进行记忆构建与知识更新的能力尚未得到充分评估。现有基准测试多聚焦于同质化的阅读理解任务,难以真实反映模型在实际应用中从累积用户反馈中学习的能力。为此,作者提出了一种用户反馈模拟框架,并构建了一个涵盖多领域、多语言和多种任务类型的综合性基准,用于系统性评估LLMsys的持续学习性能。该解决方案的关键在于引入可模拟真实用户交互场景的反馈机制,并设计多样化、贴近实际应用的评测任务,从而推动对LLM记忆机制与优化算法的研究进展。
链接: https://arxiv.org/abs/2510.17281
作者: Qingyao Ai,Yichen Tang,Changyue Wang,Jianming Long,Weihang Su,Yiqun Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.
zh
[AI-35] Augmented Web Usage Mining and User Experience Optimization with CAWALs Enriched Analytics Data
【速读】:该论文旨在解决如何通过增强的交互数据挖掘技术来更精准地理解用户在网页上的行为,从而优化用户体验(User Experience, UX)。其解决方案的关键在于提出了一种名为增强型网络使用挖掘(Augmented Web Usage Mining, AWUM)的方法论,该方法通过整合CAWAL(Combined Application Log and Web Analytics)框架所收集的多源日志数据,对会话结构、页面请求、服务交互及退出方式等维度进行深度分析,并利用关联规则挖掘识别高频服务访问模式,从而显著提升对用户行为建模的精度与效率,为大规模用户体验优化提供坚实的数据基础。
链接: https://arxiv.org/abs/2510.17253
作者: Özkan Canay(1 and 2),{Ü}mit Kocabıcak(3 and 4) ((1) Institute of Natural Sciences, Sakarya University, Sakarya, Turkiye, (2) Vocational School of Sakarya, Sakarya University of Applied Sciences, Sakarya, Turkiye, (3) Faculty of Computer and IT Engineering, Sakarya University, Sakarya, Turkiye, (4) Turkish Higher Education Quality Council, Ankara, Turkiye)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures. Published in International Journal of Human-Computer Interaction (Taylor Francis, 2025)
点击查看摘要
Abstract:Understanding user behavior on the web is increasingly critical for optimizing user experience (UX). This study introduces Augmented Web Usage Mining (AWUM), a methodology designed to enhance web usage mining and improve UX by enriching the interaction data provided by CAWAL (Combined Application Log and Web Analytics), a framework for advanced web analytics. Over 1.2 million session records collected in one month (~8.5GB of data) were processed and transformed into enriched datasets. AWUM analyzes session structures, page requests, service interactions, and exit methods. Results show that 87.16% of sessions involved multiple pages, contributing 98.05% of total pageviews; 40% of users accessed various services and 50% opted for secure exits. Association rule mining revealed patterns of frequently accessed services, highlighting CAWAL’s precision and efficiency over conventional methods. AWUM offers a comprehensive understanding of user behavior and strong potential for large-scale UX optimization.
zh
[AI-36] Visibility Allocation Systems: How Algorithmic Design Shapes Online Visibility and Societal Outcomes
【速读】:该论文旨在解决当前算法系统(尤其是可见性分配系统,Visibility Allocation Systems, VASs)在复杂性高、结构不透明且后果难以预测的情况下,如何实现整体理解和有效评估的问题。这类系统广泛应用于信息处理与人机交互场景,如推荐、内容审核和决策支持等,但其内部机制常缺乏文档说明,且存在反馈循环和系统性风险,使得研究者与监管机构难以对其进行诊断与治理。论文提出了一种形式化框架,将VASs定义为(半)自动化系统,用于决定向人类用户呈现哪些已处理的数据;其关键在于将VASs分解为可识别的子过程,并通过数据流图进行可视化展示,同时引入端到端的评估指标体系以支持系统诊断。该框架不仅提升了对现有系统的解释能力,还可辅助AI立法实践,明确责任归属、量化系统性风险并推动动态合规。
链接: https://arxiv.org/abs/2510.17241
作者: Stefania Ionescu,Robin Forsberg,Elsa Lichtenegger,Salima Jaoua,Kshitijaa Jaglan,Florian Dorfler,Aniko Hannak
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Throughout application domains, we now rely extensively on algorithmic systems to engage with ever-expanding datasets of information. Despite their benefits, these systems are often complex (comprising of many intricate tools, e.g., moderation, recommender systems, prediction models), of unknown structure (due to the lack of accompanying documentation), and having hard-to-predict yet potentially severe downstream consequences (due to the extensive use, systematic enactment of existing errors, and many comprising feedback loops). As such, understanding and evaluating these systems as a whole remains a challenge for both researchers and legislators. To aid ongoing efforts, we introduce a formal framework for such visibility allocation systems (VASs) which we define as (semi-)automated systems deciding which (processed) data to present a human user with. We review typical tools comprising VASs and define the associated computational problems they solve. By doing so, VASs can be decomposed into sub-processes and illustrated via data flow diagrams. Moreover, we survey metrics for evaluating VASs throughout the pipeline, thus aiding system diagnostics. Using forecasting-based recommendations in school choice as a case study, we demonstrate how our framework can support VAS evaluation. We also discuss how our framework can support ongoing AI-legislative efforts to locate obligations, quantify systemic risks, and enable adaptive compliance.
zh
[AI-37] Coinvisor: An RL-Enhanced Chatbot Agent for Interactive Cryptocurrency Investment Analysis
【速读】:该论文旨在解决加密货币投资中因市场高波动性和信息碎片化导致的数据整合与分析难题,现有方法如人工分析效率低、数据聚合平台功能有限,以及基于静态预训练模型的大语言模型代理缺乏实时数据接入和多步推理能力。解决方案的关键在于提出Coinvisor——一个基于强化学习的多智能体聊天机器人框架,其核心创新是引入强化学习驱动的工具选择机制,实现多步骤规划与多样化数据源的灵活集成,从而支持实时交互和动态内容的自适应分析,显著提升工具调用准确性和投资决策的准确性与实用性。
链接: https://arxiv.org/abs/2510.17235
作者: Chong Chen,Ze Liu,Lingfeng Bao,Yanlin Wang,Ting Chen,Daoyuan Wu,Jiachi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The cryptocurrency market offers significant investment opportunities but faces challenges including high volatility and fragmented information. Data integration and analysis are essential for informed investment decisions. Currently, investors use three main approaches: (1) Manual analysis across various sources, which depends heavily on individual experience and is time-consuming and prone to bias; (2) Data aggregation platforms-limited in functionality and depth of analysis; (3) Large language model agents-based on static pretrained models, lacking real-time data integration and multi-step reasoning capabilities. To address these limitations, we present Coinvisor, a reinforcement learning-based chatbot that provides comprehensive analytical support for cryptocurrency investment through a multi-agent framework. Coinvisor integrates diverse analytical capabilities through specialized tools. Its key innovation is a reinforcement learning-based tool selection mechanism that enables multi-step planning and flexible integration of diverse data sources. This design supports real-time interaction and adaptive analysis of dynamic content, delivering accurate and actionable investment insights. We evaluated Coinvisor through automated benchmarks on tool calling accuracy and user studies with 20 cryptocurrency investors using our interface. Results show that Coinvisor improves recall by 40.7% and F1 score by 26.6% over the base model in tool orchestration. User studies show high satisfaction (4.64/5), with participants preferring Coinvisor to both general LLMs and existing crypto platforms (4.62/5).
zh
[AI-38] Diagnosis of Fuel Cell Health Status with Deep Sparse Auto-Encoder Neural Network
【速读】:该论文旨在解决燃料电池健康状态(Fuel Cell Health Status)在线诊断中高频率阻抗(High-Frequency Impedance)测试复杂且成本高昂的问题。其解决方案的关键在于采用深度稀疏自编码网络(Deep Sparse Auto-Encoding Network)对高频率阻抗进行预测与分类,实现了超过92%的准确率;进一步将该网络部署于现场可编程门阵列(FPGA)上,实现了接近90%的硬件级识别率,从而在保证精度的同时显著降低了在线检测的实现难度与成本。
链接: https://arxiv.org/abs/2510.17214
作者: Chenyan Fei,Dalin Zhang,Chen Melinda Dang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Effective and accurate diagnosis of fuel cell health status is crucial for ensuring the stable operation of fuel cell stacks. Among various parameters, high-frequency impedance serves as a critical indicator for assessing fuel cell state and health conditions. However, its online testing is prohibitively complex and costly. This paper employs a deep sparse auto-encoding network for the prediction and classification of high-frequency impedance in fuel cells, achieving metric of accuracy rate above 92%. The network is further deployed on an FPGA, attaining a hardware-based recognition rate almost 90%.
zh
[AI-39] D2C-HRHR: Discrete Actions with Double Distributional Critics for High-Risk-High-Return Tasks
【速读】:该论文旨在解决高风险高回报(High-Risk-High-Return, HRHR)任务中强化学习(Reinforcement Learning, RL)方法性能受限的问题,这类任务通常具有多模态动作分布和随机回报特性。传统RL方法假设策略为单峰高斯分布且依赖标量值 critic,无法有效捕捉HRHR场景下的复杂决策结构。解决方案的关键在于:(i) 将连续动作空间离散化以近似多模态分布;(ii) 引入熵正则化探索机制以提升对高风险但高回报动作的覆盖;(iii) 设计双评价值(dual-critic)架构以更精确地估计离散值分布,从而在高维动作空间中实现稳定且高效的策略优化。
链接: https://arxiv.org/abs/2510.17212
作者: Jundong Zhang,Yuhui Situ,Fanji Zhang,Rongji Deng,Tianqi Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Tasks involving high-risk-high-return (HRHR) actions, such as obstacle crossing, often exhibit multimodal action distributions and stochastic returns. Most reinforcement learning (RL) methods assume unimodal Gaussian policies and rely on scalar-valued critics, which limits their effectiveness in HRHR settings. We formally define HRHR tasks and theoretically show that Gaussian policies cannot guarantee convergence to the optimal solution. To address this, we propose a reinforcement learning framework that (i) discretizes continuous action spaces to approximate multimodal distributions, (ii) employs entropy-regularized exploration to improve coverage of risky but rewarding actions, and (iii) introduces a dual-critic architecture for more accurate discrete value distribution estimation. The framework scales to high-dimensional action spaces, supporting complex control domains. Experiments on locomotion and manipulation benchmarks with high risks of failure demonstrate that our method outperforms baselines, underscoring the importance of explicitly modeling multimodality and risk in RL.
zh
[AI-40] mporally Detailed Hypergraph Neural ODEs for Type 2 Diabetes Progression Modeling
【速读】:该论文旨在解决基于纵向电子健康记录(EHRs)建模疾病进展过程中的关键挑战,即如何准确刻画患者个体间异质性(如不同进展速率和路径)以及不规则时间采样下连续时间动态变化的复杂性。现有机制模型和数据驱动方法难以同时适应真实世界数据的多样性并捕捉复杂的连续时间进展轨迹。解决方案的关键在于提出Temporal Detail Hypergraph Neural Ordinary Differential Equation (TD-HNODE),其核心创新是将临床可识别的疾病进展路径表示为时序细化的超图(temporally detailed hypergraph),并通过神经微分方程(Neural ODE)框架学习连续时间动态;其中引入可学习的TD-超图拉普拉斯矩阵(TD-Hypergraph Laplacian),有效建模同一进展轨迹内及跨轨迹间的并发症指标相互依赖关系,从而提升对2型糖尿病及其相关心血管疾病进展的预测准确性。
链接: https://arxiv.org/abs/2510.17211
作者: Tingsong Xiao,Yao An Lee,Zelin Xu,Yupu Zhang,Zibo Liu,Yu Huang,Jiang Bian,Serena Jingchuan Guo,Zhe Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Disease progression modeling aims to characterize and predict how a patient’s disease complications worsen over time based on longitudinal electronic health records (EHRs). Accurate modeling of disease progression, such as type 2 diabetes, can enhance patient sub-phenotyping and inform effective and timely interventions. However, the problem is challenging due to the need to learn continuous-time dynamics of progression patterns based on irregular-time event samples and patient heterogeneity (\eg different progression rates and pathways). Existing mechanistic and data-driven methods either lack adaptability to learn from real-world data or fail to capture complex continuous-time dynamics on progression trajectories. To address these limitations, we propose Temporally Detailed Hypergraph Neural Ordinary Differential Equation (TD-HNODE), which represents disease progression on clinically recognized trajectories as a temporally detailed hypergraph and learns the continuous-time progression dynamics via a neural ODE framework. TD-HNODE contains a learnable TD-Hypergraph Laplacian that captures the interdependency of disease complication markers within both intra- and inter-progression trajectories. Experiments on two real-world clinical datasets demonstrate that TD-HNODE outperforms multiple baselines in modeling the progression of type 2 diabetes and related cardiovascular diseases.
zh
[AI-41] SimpleVSF: VLM-Scoring Fusion for Trajectory Prediction of End-to-End Autonomous Driving
【速读】:该论文旨在解决端到端自动驾驶中复杂场景下决策性能不佳的问题,即现有方法在面对动态、多变的交通环境时,难以实现安全、舒适且高效的驾驶策略。其解决方案的关键在于提出SimpleVSF(Simple VLM-Scoring Fusion)框架,该框架通过融合传统评分器与基于视觉语言模型(Vision-Language Models, VLMs)增强的评分器,并结合定量权重融合机制和基于VLM的定性语境感知决策融合机制,显著提升了规划阶段对复杂场景的理解能力与决策鲁棒性。
链接: https://arxiv.org/abs/2510.17191
作者: Peiru Zheng,Yun Zhao,Zhan Gong,Hong Zhu,Shaohua Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 2 tables
点击查看摘要
Abstract:End-to-end autonomous driving has emerged as a promising paradigm for achieving robust and intelligent driving policies. However, existing end-to-end methods still face significant challenges, such as suboptimal decision-making in complex scenarios. In this paper,we propose SimpleVSF (Simple VLM-Scoring Fusion), a novel framework that enhances end-to-end planning by leveraging the cognitive capabilities of Vision-Language Models (VLMs) and advanced trajectory fusion techniques. We utilize the conventional scorers and the novel VLM-enhanced scorers. And we leverage a robust weight fusioner for quantitative aggregation and a powerful VLM-based fusioner for qualitative, context-aware decision-making. As the leading approach in the ICCV 2025 NAVSIM v2 End-to-End Driving Challenge, our SimpleVSF framework demonstrates state-of-the-art performance, achieving a superior balance between safety, comfort, and efficiency.
zh
[AI-42] Combining ECG Foundation Model and XGBoost to Predict In-Hospital Malignant Ventricular Arrhythmias in AMI Patients
【速读】:该论文旨在解决急性心肌梗死(Acute Myocardial Infarction, AMI)后恶性室性心律失常(VT/VF)的早期识别难题,该问题在临床中导致院内死亡率较高,而传统风险评分模型性能有限,端到端深度学习模型又缺乏可解释性,难以获得临床信任。解决方案的关键在于提出一种混合预测框架,将大规模心电图(ECG)基础模型(ECGFounder)与可解释的XGBoost分类器相结合:首先利用ECGFounder自动提取150维诊断概率特征,再通过特征选择优化输入,最终由XGBoost构建高精度、高可解释性的预测模型。该方法在6,634例AMI患者数据上实现了AUC 0.801,显著优于KNN、RNN和1D-CNN等基线模型,并通过SHAP分析验证了关键特征(如“室性早搏”为风险因素,“正常窦性心律”为保护因素)与临床知识高度一致,从而为构建可信、可解释的人工智能辅助临床决策系统提供了新范式。
链接: https://arxiv.org/abs/2510.17172
作者: Shun Huang,Wenlu Xing,Shijia Geng,Hailong Wang,Guangkun Nie,Gongzheng Tang,Chenyang He,Shenda Hong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Malignant ventricular arrhythmias (VT/VF) following acute myocardial infarction (AMI) are a major cause of in-hospital death, yet early identification remains a clinical challenge. While traditional risk scores have limited performance, end-to-end deep learning models often lack the interpretability needed for clinical trust. This study aimed to develop a hybrid predictive framework that integrates a large-scale electrocardiogram (ECG) foundation model (ECGFounder) with an interpretable XGBoost classifier to improve both accuracy and interpretability. We analyzed 6,634 ECG recordings from AMI patients, among whom 175 experienced in-hospital VT/VF. The ECGFounder model was used to extract 150-dimensional diagnostic probability features , which were then refined through feature selection to train the XGBoost classifier. Model performance was evaluated using AUC and F1-score , and the SHAP method was used for interpretability. The ECGFounder + XGBoost hybrid model achieved an AUC of 0.801 , outperforming KNN (AUC 0.677), RNN (AUC 0.676), and an end-to-end 1D-CNN (AUC 0.720). SHAP analysis revealed that model-identified key features, such as “premature ventricular complexes” (risk predictor) and “normal sinus rhythm” (protective factor), were highly consistent with clinical knowledge. We conclude that this hybrid framework provides a novel paradigm for VT/VF risk prediction by validating the use of foundation model outputs as effective, automated feature engineering for building trustworthy, explainable AI-based clinical decision support systems.
zh
[AI-43] REAT: A Code LLM s Trustworthiness / Reliability Evaluation and Testing Framework
【速读】:该论文旨在解决当前大型基础模型在软件工程场景中缺乏全面可信度评估的问题,现有基准测试存在任务范围有限、未涵盖鲁棒性和可靠性等关键维度的不足。解决方案的关键在于提出一个名为TREAT(Code LLMs Trustworthiness / Reliability Evaluation And Testing)的综合评估框架,其核心创新包括:(1) 覆盖多样软件工程活动的多任务整体评估;(2) 支持多语言与多模态编码任务的扩展性评估;(3) 通过语义保持的代码变换来评估模型鲁棒性;(4) 采用多样化提示和自适应解码策略提升评估结果的可信度。该框架系统性地填补了现有评估体系的空白,并基于此对26个前沿模型进行了深入分析,揭示了模型在不同任务中的性能差异与局限性。
链接: https://arxiv.org/abs/2510.17163
作者: Shuzheng Gao,Eric John Li,Man Ho Lam,Jingyu Xiao,Yuxuan Wan,Chaozheng Wang,Ng Man Tik,Michael R. Lyu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large foundation models are fundamentally transforming the software engineering landscape, demonstrating exceptional capabilities across diverse tasks such as code generation, debugging, and testing. Despite this rapid progress, a significant gap remains in how to comprehensively evaluate these models’ trustworthiness in real-world software engineering scenarios. Existing benchmarks suffer from limited task scope and fail to incorporate critical evaluation aspects such as the robustness and reliability of models. To bridge this gap, we present an evaluation framework called TREAT (Code LLMs Trustworthiness / Reliability Evaluation And Testing) that provides a holistic assessment of model performance in code intelligence tasks. Our evaluation framework addresses key limitations in existing approaches with four main improvements: (1) Multi-Task Holistic Evaluation that spans diverse software engineering activities rather than limited coding tasks; (2) Multi-Language and Multi-Modality Assessment that extends beyond traditional single-language, text-only benchmarks to include multi-modality coding tasks; (3) Robustness Assessment that evaluates model reliability under semantically-preserving code transformations; and (4) Rigorous Evaluation Methodology that enhances the trustworthiness of evaluation results through diverse evaluation prompts and adaptive solution extraction. Based on this evaluation framework, we assess 26 state-of-the-art models and uncover both their strengths and limitations, yielding several key insights:(1) Current models show substantial performance variation across programming tasks; (2) Multi-modal language models demonstrate specific performance limitations in UI code generation and edit;
zh
[AI-44] Which LLM Multi-Agent Protocol to Choose? ICLR
【速读】:该论文旨在解决大规模多智能体系统中通信协议选择缺乏标准化评估与指导的问题,当前协议(如A2A、ACP、ANP、Agora等)的选用多依赖直觉,导致系统性能和可靠性差异显著。其核心解决方案是提出ProtocolBench基准测试平台,从任务成功率、端到端延迟、消息或字节开销及故障下的鲁棒性四个维度对协议进行量化评估;进一步设计可学习的ProtocolRouter,基于场景需求和运行时信号动态选择最优协议,从而在Fail-Storm恢复等场景下提升系统效率,例如减少18.1%的恢复时间,并在GAIA任务中实现更高的成功率。
链接: https://arxiv.org/abs/2510.17149
作者: Hongyi Du,Jiaqi Su,Jisen Li,Lijie Ding,Yingxuan Yang,Peixuan Han,Xiangru Tang,Kunlun Zhu,Jiaxuan You
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review at ICLR this http URL and benchmark artifacts: this https URL
点击查看摘要
Abstract:As large-scale multi-agent systems evolve, the communication protocol layer has become a critical yet under-evaluated factor shaping performance and reliability. Despite the existence of diverse protocols (A2A, ACP, ANP, Agora, etc.), selection is often intuition-driven and lacks standardized guidance. We introduce ProtocolBench, a benchmark that systematically compares agent protocols along four measurable axes: task success, end-to-end latency, message or byte overhead, and robustness under failures. On ProtocolBench, protocol choice significantly influences system behavior. In the Streaming Queue scenario, overall completion time varies by up to 36.5% across protocols, and mean end-to-end latency differs by 3.48 s. Under Fail-Storm Recovery, resilience also differs consistently across protocols. Beyond evaluation, we present ProtocolRouter, a learnable protocol router that selects per-scenario (or per-module) protocols from requirement and runtime signals. ProtocolRouter reduces Fail-Storm recovery time by up to 18.1% versus the best single-protocol baseline, and achieves scenario-specific gains such as higher success in GAIA. We also release ProtocolRouterBench to standardize protocol evaluation and improve reliability at scale.
zh
[AI-45] Physics-Informed Large Language Models for HVAC Anomaly Detection with Autonomous Rule Generation NEURIPS2025
【速读】:该论文旨在解决暖通空调(HVAC)系统中异常检测的可靠性与可解释性难题,传统基于规则的方法虽具备可解释性但适应性差,而深度学习方法虽预测能力强却缺乏物理合理性、透明度和效率。解决方案的关键在于提出PILLM框架——一个融合物理信息的大型语言模型(LLM)系统,通过进化循环自动生成、评估并优化异常检测规则;其核心创新是引入基于热力学和控制理论约束的“物理感知反思”与“交叉操作”,确保生成的规则既具备自适应能力又符合物理规律,从而在保证高性能的同时实现可解释且可操作的诊断逻辑。
链接: https://arxiv.org/abs/2510.17146
作者: Subin Lin,Chuanbo Hua
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: NeurIPS 2025 Workshop of UrbanAI (Oral)
点击查看摘要
Abstract:Heating, Ventilation, and Air-Conditioning (HVAC) systems account for a substantial share of global building energy use, making reliable anomaly detection essential for improving efficiency and reducing emissions. Classical rule-based approaches offer explainability but lack adaptability, while deep learning methods provide predictive power at the cost of transparency, efficiency, and physical plausibility. Recent attempts to use Large Language Models (LLMs) for anomaly detection improve interpretability but largely ignore the physical principles that govern HVAC operations. We present PILLM, a Physics-Informed LLM framework that operates within an evolutionary loop to automatically generate, evaluate, and refine anomaly detection rules. Our approach introduces physics-informed reflection and crossover operators that embed thermodynamic and control-theoretic constraints, enabling rules that are both adaptive and physically grounded. Experiments on the public Building Fault Detection dataset show that PILLM achieves state-of-the-art performance while producing diagnostic rules that are interpretable and actionable, advancing trustworthy and deployable AI for smart building systems.
zh
[AI-46] Enhanced Fish Freshness Classification with Incremental Handcrafted Feature Fusion
【速读】:该论文旨在解决鱼类新鲜度评估在食品工业中准确性和标准化难题,传统感官评价方法因主观性强、一致性差且难以跨场景推广而受限,尤其面对物种差异导致的细微腐败信号时表现不佳。其解决方案的关键在于提出一种基于手工特征(handcrafted features)的系统性方法,通过从鱼眼图像中提取并逐步融合互补描述符——包括多色彩空间的颜色统计量、直方图以及局部二值模式(Local Binary Patterns, LBP)和灰度共生矩阵(Gray-Level Co-occurrence Matrices, GLCM)等纹理特征——同时捕捉全局色度变化与感兴趣区域(ROI)内的局部退化信息,并独立融合以评估各自对新鲜度判别的有效性。实验表明,该策略显著优于先前深度学习基线,在标准训练测试设置下LightGBM分类器达到77.56%准确率(提升14.35%),数据增强后ANN模型更达97.16%准确率(提升19.86%),验证了精心设计的手工特征在策略处理下的鲁棒性、可解释性和可靠性,为食品质量监控提供了实用可行的技术路径。
链接: https://arxiv.org/abs/2510.17145
作者: Phi-Hung Hoang,Nam-Thuan Trinh,Van-Manh Tran,Thi-Thu-Hong Phan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35 pages, 6 figures and 11 tables
点击查看摘要
Abstract:Accurate assessment of fish freshness remains a major challenge in the food industry, with direct consequences for product quality, market value, and consumer health. Conventional sensory evaluation is inherently subjective, inconsistent, and difficult to standardize across contexts, often limited by subtle, species-dependent spoilage cues. To address these limitations, we propose a handcrafted feature-based approach that systematically extracts and incrementally fuses complementary descriptors, including color statistics, histograms across multiple color spaces, and texture features such as Local Binary Patterns (LBP) and Gray-Level Co-occurrence Matrices (GLCM), from fish eye images. Our method captures global chromatic variations from full images and localized degradations from ROI segments, fusing each independently to evaluate their effectiveness in assessing freshness. Experiments on the Freshness of the Fish Eyes (FFE) dataset demonstrate the approach’s effectiveness: in a standard train-test setting, a LightGBM classifier achieved 77.56% accuracy, a 14.35% improvement over the previous deep learning baseline of 63.21%. With augmented data, an Artificial Neural Network (ANN) reached 97.16% accuracy, surpassing the prior best of 77.3% by 19.86%. These results demonstrate that carefully engineered, handcrafted features, when strategically processed, yield a robust, interpretable, and reliable solution for automated fish freshness assessment, providing valuable insights for practical applications in food quality monitoring.
zh
[AI-47] Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在边缘平台(如车载移动操作机器人)上部署时面临的高计算与内存开销问题,这些问题严重制约了其实时性能。解决方案的关键在于系统性地提升VLA模型的效率,具体从四个维度进行优化:模型架构设计、感知特征表示、动作生成机制以及训练与推理策略,通过上述多层面的技术改进以降低延迟、内存占用及训练和推理成本,从而推动高效具身智能的发展。
链接: https://arxiv.org/abs/2510.17111
作者: Weifan Guan,Qinghao Hu,Aosheng Li,Jian Cheng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.
zh
[AI-48] Structured Debate Improves Corporate Credit Reasoning in Financial AI AAAI-2026
链接: https://arxiv.org/abs/2510.17108
作者: Yoonjin Lee,Munhee Kim,Hanbi Choi,Juhyeon Park,Seungho Lyoo,Woojin Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures, 2 algorithms, 2 tables, 4 appendices, will be submitted to AAAI-2026 workshop
[AI-49] Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models
【速读】:该论文旨在解决Transformer语言模型在推理过程中因键值(Key-Value, KV)缓存被恶意篡改而导致的安全漏洞问题,这一攻击面此前未被充分重视。解决方案的关键在于提出一种模块化框架——恶意令牌注入(Malicious Token Injection, MTI),通过在特定层和时间步对缓存中的键向量施加可控幅度与频率的扰动(如加性高斯噪声、置零操作及正交旋转),系统性地破坏KV缓存完整性,并结合理论分析揭示扰动如何通过注意力机制传播,量化其对logit输出偏差的影响,从而为大语言模型(LLM)部署中的缓存安全提供可复现且理论严谨的威胁建模基础。
链接: https://arxiv.org/abs/2510.17098
作者: Elias Hossain,Swayamjit Saha,Somshubhra Roy,Ravi Prasad
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Even when prompts and parameters are secured, transformer language models remain vulnerable because their key-value (KV) cache during inference constitutes an overlooked attack surface. This paper introduces Malicious Token Injection (MTI), a modular framework that systematically perturbs cached key vectors at selected layers and timesteps through controlled magnitude and frequency, using additive Gaussian noise, zeroing, and orthogonal rotations. A theoretical analysis quantifies how these perturbations propagate through attention, linking logit deviations to the Frobenius norm of corruption and softmax Lipschitz dynamics. Empirical results show that MTI significantly alters next-token distributions and downstream task performance across GPT-2 and LLaMA-2/7B, as well as destabilizes retrieval-augmented and agentic reasoning pipelines. These findings identify cache integrity as a critical yet underexplored vulnerability in current LLM deployments, positioning cache corruption as a reproducible and theoretically grounded threat model for future robustness and security research.
zh
[AI-50] Explainable Heterogeneous Anomaly Detection in Financial Networks via Adaptive Expert Routing
链接: https://arxiv.org/abs/2510.17088
作者: Zan Li,Rui Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
[AI-51] A Brain Cell Type Resource Created by Large Language Models and a Multi-Agent AI System for Collaborative Community Annotation
【速读】:该论文旨在解决单细胞RNA测序(single-cell RNA sequencing)中基因集注释困难的问题,尤其是针对功能不明确或注释匮乏的基因。传统方法如基因集富集分析(Gene Set Enrichment Analysis, GSEA)依赖于高质量的基因注释库,在此类场景下表现不佳。其解决方案的关键在于提出一种多智能体人工智能系统 BRAINCELL-AID,该系统融合自由文本描述与结构化本体标签,并引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,通过整合PubMed相关文献来校正预测结果,从而减少幻觉并提升可解释性。该方法在小鼠基因集注释中实现了77%的准确率,成功应用于大脑细胞图谱的注释,揭示了区域特异性基因共表达模式及基因组合的功能角色。
链接: https://arxiv.org/abs/2510.17064
作者: Rongbin Li,Wenbo Chen,Zhao Li,Rodrigo Munoz-Castaneda,Jinbo Li,Neha S. Maurya,Arnav Solanki,Huan He,Hanwen Xing,Meaghan Ramlakhan,Zachary Wise,Zhuhao Wu,Hua Xu,Michael Hawrylycz,W. Jim Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 6 figures, 2 tables
点击查看摘要
Abstract:Single-cell RNA sequencing has transformed our ability to identify diverse cell types and their transcriptomic signatures. However, annotating these signatures-especially those involving poorly characterized genes-remains a major challenge. Traditional methods, such as Gene Set Enrichment Analysis (GSEA), depend on well-curated annotations and often perform poorly in these contexts. Large Language Models (LLMs) offer a promising alternative but struggle to represent complex biological knowledge within structured ontologies. To address this, we present BRAINCELL-AID (BRAINCELL-AID: this https URL), a novel multi-agent AI system that integrates free-text descriptions with ontology labels to enable more accurate and robust gene set annotation. By incorporating retrieval-augmented generation (RAG), we developed a robust agentic workflow that refines predictions using relevant PubMed literature, reducing hallucinations and enhancing interpretability. Using this workflow, we achieved correct annotations for 77% of mouse gene sets among their top predictions. Applying this approach, we annotated 5,322 brain cell clusters from the comprehensive mouse brain cell atlas generated by the BRAIN Initiative Cell Census Network, enabling novel insights into brain cell function by identifying region-specific gene co-expression patterns and inferring functional roles of gene ensembles. BRAINCELL-AID also identifies Basal Ganglia-related cell types with neurologically meaningful descriptions. Hence, we create a valuable resource to support community-driven cell type annotation.
zh
[AI-52] Bitwidth-Specific Logarithmic Arithmetic for Future Hardware-Accelerated Training
【速读】:该论文旨在解决深度学习训练中仍依赖高精度浮点运算导致计算成本高昂的问题,提出了一种面向未来硬件加速器设计的低精度对数定点训练方法。其解决方案的关键在于引入位宽(bitwidth)作为优化参数,用于设计算术运算的近似策略,并提出一种新型硬件友好的分段线性近似方法来实现对数加法运算;通过模拟退火算法在不同精度级别下优化该近似函数,最终在C++位级仿真中实现了使用12位整数运算训练VGG-11和VGG-16模型时与32位浮点训练相比几乎无精度损失的效果,同时在硬件层面使LNS乘累加单元面积减少最多达32.5%,能耗降低最多达53.5%。
链接: https://arxiv.org/abs/2510.17058
作者: Hassan Hamad,Yuou Qiu,Peter A. Beerel,Keith M. Chugg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While advancements in quantization have significantly reduced the computational costs of inference in deep learning, training still predominantly relies on complex floating-point arithmetic. Low-precision fixed-point training presents a compelling alternative. This work introduces a novel enhancement in low-precision logarithmic fixed-point training, geared towards future hardware accelerator designs. We propose incorporating bitwidth in the design of approximations to arithmetic operations. To this end, we introduce a new hardware-friendly, piece-wise linear approximation for logarithmic addition. Using simulated annealing, we optimize this approximation at different precision levels. A C++ bit-true simulation demonstrates training of VGG-11 and VGG-16 models on CIFAR-100 and TinyImageNet, respectively, using 12-bit integer arithmetic with minimal accuracy degradation compared to 32-bit floating-point training. Our hardware study reveals up to 32.5% reduction in area and 53.5% reduction in energy consumption for the proposed LNS multiply-accumulate units compared to that of linear fixed-point equivalents.
zh
[AI-53] he Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLM s
链接: https://arxiv.org/abs/2510.17057
作者: Nikolaus Howe,Micah Carroll
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages
[AI-54] oolCritic: Detecting and Correcting Tool-Use Errors in Dialogue Systems
链接: https://arxiv.org/abs/2510.17052
作者: Hassan Hamad,Yingru Xu,Liang Zhao,Wenbo Yan,Narendra Gyanchandani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-55] Curiosity-driven RL for symbolic equation solving NEURIPS2025
链接: https://arxiv.org/abs/2510.17022
作者: Kevin P. O Keeffe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the NeurIPS 2025 MATH-AI Workshop
[AI-56] Justitia: Fair and Efficient Scheduling for LLM Applications
【速读】:该论文旨在解决在共享GPU服务器中调度大语言模型(Large Language Models, LLM)应用时,主流调度器因队首阻塞(head-of-line blocking)或资源分配过度受限而导致的应用完成延迟高、公平性差的问题。解决方案的关键在于提出一种名为Justitia的新调度机制,其核心创新包括:(1)以内存为中心建模LLM应用的服务成本,应对vLLM等框架中内存瓶颈问题;(2)采用轻量级神经网络实现准确的资源需求预测;(3)基于虚拟时间的公平排队算法,在保障最坏情况延迟的前提下显著提升整体调度效率与公平性。
链接: https://arxiv.org/abs/2510.17015
作者: Mingyan Yang,Guanjie Wang,Manqi Luo,Yifei Liu,Chen Chen,Han Zhao,Yu Feng,Quan Chen,Minyi Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
点击查看摘要
Abstract:In the era of Large Language Models (LLMs), it has been popular to launch a series of LLM inferences – we call an LLM application – to better solve real-world problems. When serving those applications in shared GPU servers, the schedulers are expected to attain fast application completions with guaranteed worst-case performance. However, mainstream LLM schedulers fail to behave well for LLM applications – due to head-of-line blocking or over-constrained resource allocation. In this paper, we propose to serve LLM applications in a fair and also efficient manner. To this end, we design Justitia, a novel scheduler with three key techniques. First, given that memory is prevalently a bottleneck for mainstream inference frameworks like vLLM, Justitia models the service cost of LLM applications in a memory-centric manner. Meanwhile, it uses a simple neural network model to conduct light-weight and also accurate demand prediction. Moreover, Justitia adopts a virtual-time based fair queuing algorithm to reduce the overall performance with guaranteed worst-case delay. We have implemented Justitia atop vLLM, and experimental results involving diverse LLM applications show that it can substantially enhance the scheduling efficiency with fairness preserved.
zh
[AI-57] ReclAIm: A multi-agent framework for degradation-aware performance tuning of medical imaging AI
【速读】:该论文旨在解决医疗影像AI模型在临床实践中长期可靠性不足的问题,即模型性能随时间推移可能出现退化却缺乏自动监测与修复机制。解决方案的关键在于提出ReclAIm——一个基于大语言模型(Large Language Model, LLM)核心的多智能体框架,能够通过自然语言交互实现对医学图像分类模型的自主监控、评估与微调(fine-tuning),无需编程知识即可完成性能下降后的自动纠正,从而保障模型在MRI、CT和X-ray等多种数据集上持续稳定运行。
链接: https://arxiv.org/abs/2510.17004
作者: Eleftherios Tzanis,Michail E. Klontzas
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 25 pages, 4 figures
点击查看摘要
Abstract:Ensuring the long-term reliability of AI models in clinical practice requires continuous performance monitoring and corrective actions when degradation occurs. Addressing this need, this manuscript presents ReclAIm, a multi-agent framework capable of autonomously monitoring, evaluating, and fine-tuning medical image classification models. The system, built on a large language model core, operates entirely through natural language interaction, eliminating the need for programming expertise. ReclAIm successfully trains, evaluates, and maintains consistent performance of models across MRI, CT, and X-ray datasets. Once ReclAIm detects significant performance degradation, it autonomously executes state-of-the-art fine-tuning procedures that substantially reduce the performance gap. In cases with performance drops of up to -41.1% (MRI InceptionV3), ReclAIm managed to readjust performance metrics within 1.5% of the initial model results. ReclAIm enables automated, continuous maintenance of medical imaging AI models in a user-friendly and adaptable manner that facilitates broader adoption in both research and clinical environments.
zh
[AI-58] STARK: Strategic Team of Agents for Refining Kernels
【速读】:该论文旨在解决GPU内核(GPU kernel)优化效率低下的问题,该问题源于内存层次结构、线程调度与硬件特性的复杂交互,且传统方法依赖人工干预,难以实现自动化和规模化。解决方案的关键在于提出一种基于大语言模型(LLM)的代理框架(agentic framework),通过多智能体协作、基于指令的引导、动态上下文管理及策略性搜索机制,模拟专家工程师的工作流程,使LLM能够推理硬件权衡、整合性能分析反馈并迭代优化内核代码。实验表明,该框架在KernelBench基准上显著优于基线方法,在正确性与运行速度(最高提升16倍)方面均取得突破。
链接: https://arxiv.org/abs/2510.16996
作者: Juncheng Dong,Yang Yang,Tao Liu,Yang Wang,Feng Qi,Vahid Tarokh,Kaushik Rangadurai,Shuang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware-specific characteristics. While recent advances in large language models (LLMs) provide new opportunities for automated code generation, existing approaches largely treat LLMs as single-shot generators or naive refinement tools, limiting their effectiveness in navigating the irregular kernel optimization landscape. We introduce an LLM agentic framework for GPU kernel optimization that systematically explores the design space through multi-agent collaboration, grounded instruction, dynamic context management, and strategic search. This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and refine kernels iteratively. We evaluate our approach on KernelBench, a benchmark for LLM-based kernel optimization, and demonstrate substantial improvements over baseline agents: our system produces correct solutions where baselines often fail, and achieves kernels with up to 16x faster runtime performance. These results highlight the potential of agentic LLM frameworks to advance fully automated, scalable GPU kernel optimization.
zh
[AI-59] Quantile Regression Variational Autoencoders and Diffusion Models for Uncertainty Quantification: A Spatial Analysis of Sub-seasonal Wind Speed Prediction
链接: https://arxiv.org/abs/2510.16958
作者: Ganglin Tian,Anastase Alexandre Charantonis,Camille Le Coz,Alexis Tantet,Riwal Plougonven
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This Work has been submitted to Monthly Weather Review. Copyright in this Work may be transferred without further notice
[AI-60] A Comparative User Evaluation of XRL Explanations using Goal Identification ECAI2025
链接: https://arxiv.org/abs/2510.16956
作者: Mark Towers,Yali Du,Christopher Freeman,Timothy J. Norman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ECAI 2025 Workshop on Evaluating Explainable AI and Complex Decision-Making, 8 Pages
[AI-61] A Primer on Kolmogorov-Arnold Networks (KANs) for Probabilistic Time Series Forecasting
【速读】:该论文旨在解决时间序列预测中不确定性建模不足与参数效率低下的问题,特别是在资源受限场景下(如卫星通信)对高精度、可校准的概率预测需求。解决方案的关键在于提出概率 Kolmogorov-Arnold 网络(Probabilistic Kolmogorov-Arnold Network, P-KAN),其核心创新是将传统标量权重替换为基于样条函数的非线性连接,并直接参数化预测分布(采用高斯和学生t分布),从而在保持模型表达能力的同时显著减少参数量,同时实现对非线性动态和重尾分布的有效捕捉,最终在准确性和校准性上优于多层感知机(MLP)基线方法。
链接: https://arxiv.org/abs/2510.16940
作者: Cristian J. Vaca-Rubio,Roberto Pereira,Luis Blanco,Engin Zeydan,Màrius Caus
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
点击查看摘要
Abstract:This work introduces Probabilistic Kolmogorov-Arnold Network (P-KAN), a novel probabilistic extension of Kolmogorov-Arnold Networks (KANs) for time series forecasting. By replacing scalar weights with spline-based functional connections and directly parameterizing predictive distributions, P-KANs offer expressive yet parameter-efficient models capable of capturing nonlinear and heavy-tailed dynamics. We evaluate P-KANs on satellite traffic forecasting, where uncertainty-aware predictions enable dynamic thresholding for resource allocation. Results show that P-KANs consistently outperform Multi Layer Perceptron (MLP) baselines in both accuracy and calibration, achieving superior efficiency-risk trade-offs while using significantly fewer parameters. We build up P-KANs on two distributions, namely Gaussian and Student-t distributions. The Gaussian variant provides robust, conservative forecasts suitable for safety-critical scenarios, whereas the Student-t variant yields sharper distributions that improve efficiency under stable demand. These findings establish P-KANs as a powerful framework for probabilistic forecasting with direct applicability to satellite communications and other resource-constrained domains.
zh
[AI-62] utoring LLM into a Better CUDA Optimizer
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成优化CUDA代码方面的能力边界问题,特别是其能否自主实现并行计算中的代码优化与并行模式设计,以及是否可通过提示工程(tutoring)提升其输出质量。解决方案的关键在于系统性评估LLMs在预定义任务中自动生成优化CUDA代码的能力,并引入结构化提示(即提供更详细的指导和约束条件)以引导模型逼近专家级优化结果;同时探索交互式修正机制,使模型能在单次会话中迭代改进自身错误。实验表明,尽管LLMs具备较强的编码能力,但仅靠原始提示难以达到专业优化水平,必须依赖有效的 tutoring 才能实现性能显著提升。
链接: https://arxiv.org/abs/2510.16933
作者: Matyáš Brabec,Jiří Klepl,Michal Töpfer,Martin Kruliš
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Euro-Par 2025: Parallel Processing, Part II, and is available online at this https URL
点击查看摘要
Abstract:Recent leaps in large language models (LLMs) caused a revolution in programming tools (like GitHub Copilot) that can help with code generation, debugging, and even performance optimization. In this paper, we focus on the capabilities of the most recent reasoning models to generate optimized CUDA code for predefined, well-known tasks. Our objective is to determine which types of code optimizations and parallel patterns the LLMs can perform by themselves and whether they can be improved by tutoring (providing more detailed hints and guidelines in the prompt). The generated solutions were evaluated both automatically (for correctness and speedup) and manually (code reviews) to provide a more detailed perspective. We also tried an interactive approach where the LLM can fix its previous mistakes within a session. The results indicate that LLMs are quite skilled coders; however, they require tutoring to reach optimized solutions provided by parallel computing experts.
zh
[AI-63] UNDREAM: Bridging Differentiable Rendering and Photorealistic Simulation for End-to-end Adversarial Attacks
【速读】:该论文旨在解决安全关键场景中深度学习模型(如自动驾驶系统)在面对物理空间中的对抗攻击时,因仿真环境不可微而导致攻击优化受限的问题。现有方法无法有效融合仿真环境因素(如天气、光照、背景等),从而削弱了攻击的成功率。解决方案的关键在于提出UNDEAM框架,首次实现了光真实感仿真器与可微渲染器之间的无缝衔接,使得对抗扰动能够在任意3D物体上进行端到端优化,并支持对环境参数(如天气、光照、相机角度、轨迹等)的完全控制,从而生成多样且物理上合理的对抗场景,推动物理空间对抗攻击研究的发展。
链接: https://arxiv.org/abs/2510.16923
作者: Mansi Phute,Matthew Hull,Haoran Wang,Alec Helbling,ShengYun Peng,Willian Lunardi,Martin Andreoni,Wenke Lee,Polo Chau
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Deep learning models deployed in safety critical applications like autonomous driving use simulations to test their robustness against adversarial attacks in realistic conditions. However, these simulations are non-differentiable, forcing researchers to create attacks that do not integrate simulation environmental factors, reducing attack success. To address this limitation, we introduce UNDREAM, the first software framework that bridges the gap between photorealistic simulators and differentiable renderers to enable end-to-end optimization of adversarial perturbations on any 3D objects. UNDREAM enables manipulation of the environment by offering complete control over weather, lighting, backgrounds, camera angles, trajectories, and realistic human and object movements, thereby allowing the creation of diverse scenes. We showcase a wide array of distinct physically plausible adversarial objects that UNDREAM enables researchers to swiftly explore in different configurable environments. This combination of photorealistic simulation and differentiable optimization opens new avenues for advancing research of physical adversarial attacks.
zh
[AI-64] A Lightweight DL Model for Smart Grid Power Forecasting with Feature and Resolution Mismatch
【速读】:该论文旨在解决高频率真实世界电力负荷数据中因传感器噪声、数据缺失及上下文信息不足导致的短期能耗预测准确性问题。其解决方案的关键在于构建一个轻量级且鲁棒的深度学习(Deep Learning, DL)流程:首先通过小时级下采样降低数据维度,随后采用均值与多项式回归相结合的双模式插补策略处理缺失值,再结合全面归一化方法(最终选择标准缩放,Standard Scaling)提升模型稳定性;在此基础上,设计了一个紧凑的GRU-LSTM序列到单值预测模型,在保证低推理延迟的同时,有效捕捉非线性用电模式,实现了平均RMSE为601.9 W、MAE为468.9 W以及84.36%准确率的优异性能,验证了针对性预处理与小型递归架构协同作用在实际部署场景下的有效性。
链接: https://arxiv.org/abs/2510.16911
作者: Sarah Al-Shareeda,Gulcihan Ozdemir,Heung Seok Jeon,Khaleel Ahmad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, The IEEE PES ISGT Middle East 2025 (ISGT-ME 2025) November 23-26th 2025, Dubai, UAE
点击查看摘要
Abstract:How can short-term energy consumption be accurately forecasted when sensor data is noisy, incomplete, and lacks contextual richness? This question guided our participation in the \textit2025 Competition on Electric Energy Consumption Forecast Adopting Multi-criteria Performance Metrics, which challenged teams to predict next-day power demand using real-world high-frequency data. We proposed a robust yet lightweight Deep Learning (DL) pipeline combining hourly downsizing, dual-mode imputation (mean and polynomial regression), and comprehensive normalization, ultimately selecting Standard Scaling for optimal balance. The lightweight GRU-LSTM sequence-to-one model achieves an average RMSE of 601.9~W, MAE of 468.9~W, and 84.36% accuracy. Despite asymmetric inputs and imputed gaps, it generalized well, captured nonlinear demand patterns, and maintained low inference latency. Notably, spatiotemporal heatmap analysis reveals a strong alignment between temperature trends and predicted consumption, further reinforcing the model’s reliability. These results demonstrate that targeted preprocessing paired with compact recurrent architectures can still enable fast, accurate, and deployment-ready energy forecasting in real-world conditions.
zh
[AI-65] SNOMED CT-powered Knowledge Graphs for Structured Clinical Data and Diagnostic Reasoning
链接: https://arxiv.org/abs/2510.16899
作者: Dun Liu,Qin Pang,Guangai Liu,Hongyu Mou,Jipeng Fan,Yiming Miao,Pin-Han Ho,Limei Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-66] Adaptive Online Learning with LSTM Networks for Energy Price Prediction
链接: https://arxiv.org/abs/2510.16898
作者: Salih Salihoglu,Ibrahim Ahmed,Afshin Asadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-67] DrivAerStar: An Industrial-Grade CFD Dataset for Vehicle Aerodynamic Optimization
链接: https://arxiv.org/abs/2510.16857
作者: Jiyan Qiu,Lyulin Kuang,Guan Wang,Yichen Xu,Leiyao Cui,Shaotong Fu,Yixin Zhu,Ruihua Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-68] Agent ic Inequality
链接: https://arxiv.org/abs/2510.16853
作者: Matthew Sharp,Omer Bilgin,Iason Gabriel,Lewis Hammond
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
[AI-69] Schrödinger Bridge Mamba for One-Step Speech Enhancement
【速读】:该论文旨在解决生成式模型在推理效率与性能之间的权衡问题,特别是在语音增强任务中实现高实时性(real-time factor, RTF)的同时保持优异的去噪和去混响效果。其解决方案的关键在于提出了一种名为Schrödinger Bridge Mamba (SBM) 的新训练-推理框架,该框架巧妙融合了Schrödinger Bridge (SB) 训练范式与选择性状态空间模型(selective state-space model, Mamba)架构,利用二者在数学结构和计算特性上的内在兼容性,使得模型仅需单步推理(1-step inference)即可超越采用多步迭代推理或单步推理的强基线方法,并取得最优的实时因子表现。
链接: https://arxiv.org/abs/2510.16834
作者: Jing Yang,Sirui Wang,Chao Wu,Fan Fan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure
点击查看摘要
Abstract:We propose Schrödinger Bridge Mamba (SBM), a new concept of training-inference framework motivated by the inherent compatibility between Schrödinger Bridge (SB) training paradigm and selective state-space model Mamba. We exemplify the concept of SBM with an implementation for generative speech enhancement. Experiments on a joint denoising and dereverberation task using four benchmark datasets demonstrate that SBM, with only 1-step inference, outperforms strong baselines with 1-step or iterative inference and achieves the best real-time factor (RTF). Beyond speech enhancement, we discuss the integration of SB paradigm and selective state-space model architecture based on their underlying alignment, which indicates a promising direction for exploring new deep generative models potentially applicable to a broad range of generative tasks. Demo page: this https URL
zh
[AI-70] Efficient High-Accuracy PDEs Solver with the Linear Attention Neural Operator
链接: https://arxiv.org/abs/2510.16816
作者: Ming Zhong,Zhenya Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph); Computational Physics (physics.comp-ph)
备注: 31 pages, 8 figures
机器学习
[LG-0] Functional Distribution Networks (FDN) ICLR2026
链接: https://arxiv.org/abs/2510.17794
作者: Omer Haq
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted to ICLR 2026. Code will be released upon acceptance
点击查看摘要
Abstract:Modern probabilistic regressors often remain overconfident under distribution shift. We present Functional Distribution Networks (FDN), an input-conditioned distribution over network weights that induces predictive mixtures whose dispersion adapts to the input. FDN is trained with a beta-ELBO and Monte Carlo sampling. We further propose an evaluation protocol that cleanly separates interpolation from extrapolation and stresses OOD sanity checks (e.g., that predictive likelihood degrades under shift while in-distribution accuracy and calibration are maintained). On standard regression tasks, we benchmark against strong Bayesian, ensemble, dropout, and hypernetwork baselines under matched parameter and update budgets, and assess accuracy, calibration, and shift-awareness with standard diagnostics. Together, the framework and protocol aim to make OOD-aware, well-calibrated neural regression practical and modular.
[LG-1] Inference-Time Compute Scaling For Flow Matching
链接: https://arxiv.org/abs/2510.17786
作者: Adam Stecklov,Noah El Rimawi-Fine,Mathieu Blanchette
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Allocating extra computation at inference time has recently improved sample quality in large language models and diffusion-based image generation. In parallel, Flow Matching (FM) has gained traction in language, vision, and scientific domains, but inference-time scaling methods for it remain under-explored. Concurrently, Kim et al., 2025 approach this problem but replace the linear interpolant with a non-linear variance-preserving (VP) interpolant at inference, sacrificing FM’s efficient and straight sampling. Additionally, inference-time compute scaling for flow matching has only been applied to visual tasks, like image generation. We introduce novel inference-time scaling procedures for FM that preserve the linear interpolant during sampling. Evaluations of our method on image generation, and for the first time (to the best of our knowledge), unconditional protein generation, show that I) sample quality consistently improves as inference compute increases, and II) flow matching inference-time scaling can be applied to scientific domains.
[LG-2] Atlas-based Manifold Representations for Interpretable Riemannian Machine Learning
链接: https://arxiv.org/abs/2510.17772
作者: Ryan A. Robinett,Sophia A. Madejski,Kyle Ruark,Samantha J. Riesenfeld,Lorenzo Orecchia
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
点击查看摘要
Abstract:Despite the popularity of the manifold hypothesis, current manifold-learning methods do not support machine learning directly on the latent d -dimensional data manifold, as they primarily aim to perform dimensionality reduction into \mathbbR^D , losing key manifold features when the embedding dimension D approaches d . On the other hand, methods that directly learn the latent manifold as a differentiable atlas have been relatively underexplored. In this paper, we aim to give a proof of concept of the effectiveness and potential of atlas-based methods. To this end, we implement a generic data structure to maintain a differentiable atlas that enables Riemannian optimization over the manifold. We complement this with an unsupervised heuristic that learns a differentiable atlas from point cloud data. We experimentally demonstrate that this approach has advantages in terms of efficiency and accuracy in selected settings. Moreover, in a supervised classification task over the Klein bottle and in RNA velocity analysis of hematopoietic data, we showcase the improved interpretability and robustness of our approach. Subjects: Machine Learning (cs.LG); Applications (stat.AP) ACMclasses: I.5.1 Cite as: arXiv:2510.17772 [cs.LG] (or arXiv:2510.17772v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.17772 Focus to learn more arXiv-issued DOI via DataCite
[LG-3] Efficient Tensor Completion Algorithms for Highly Oscillatory Operators
链接: https://arxiv.org/abs/2510.17734
作者: Navjot Singh,Edgar Solomonik,Xiaoye Sherry Li,Yang Liu
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper presents low-complexity tensor completion algorithms and their efficient implementation to reconstruct highly oscillatory operators discretized as n\times n matrices. The underlying tensor decomposition is based on the reshaping of the input matrix and its butterfly decomposition into an order \mathcalO (\log n) tensor. The reshaping of the input matrix into a tensor allows for representation of the butterfly decomposition as a tensor decomposition with dense tensors. This leads to efficient utilization of the existing software infrastructure for dense and sparse tensor computations. We propose two tensor completion algorithms in the butterfly format, using alternating least squares and gradient-based optimization, as well as a novel strategy that uses low-rank matrix completion to efficiently generate an initial guess for the proposed algorithms. To demonstrate the efficiency and applicability of our proposed algorithms, we perform three numerical experiments using simulated oscillatory operators in seismic applications. In these experiments, we use \mathcal O (n \log n) observed entries in the input matrix and demonstrate an \mathcalO(n\log^3 n) computational cost of the proposed algorithms, leading to a speedup of orders of magnitudes per iteration for large matrices compared to the low-rank matrix and quantized tensor-train completion. Moreover, the proposed butterfly completion algorithms, equipped with the novel initial guess generation strategy, achieve reconstruction errors that are smaller by an order of magnitude, enabling accurate recovery of the underlying structure compared to the state-of-the-art completion algorithms.
[LG-4] Enabling Fine-Grained Operating Points for Black-Box LLM s ICLR2026
链接: https://arxiv.org/abs/2510.17727
作者: Ege Beyazit,KL Navaneet,Prashant Mathur,Roi Blanco,Vidit Bansal,Karim Bouyarmane
类目: Machine Learning (cs.LG)
*备注: Under review at ICLR 2026. 36 pages, 17 figures
点击查看摘要
Abstract:Black-box Large Language Models (LLMs) provide practical and accessible alternatives to other machine learning methods, as they require minimal labeled data and machine learning expertise to develop solutions for various decision making problems. However, for applications that need operating with constraints on specific metrics (e.g., precision \geq 95%), decision making with black-box LLMs remains unfavorable, due to their low numerical output cardinalities. This results in limited control over their operating points, preventing fine-grained adjustment of their decision making behavior. In this paper, we study using black-box LLMs as classifiers, focusing on efficiently improving their operational granularity without performance loss. Specifically, we first investigate the reasons behind their low-cardinality numerical outputs and show that they are biased towards generating rounded but informative verbalized probabilities. Then, we experiment with standard prompt engineering, uncertainty estimation and confidence elicitation techniques, and observe that they do not effectively improve operational granularity without sacrificing performance or increasing inference cost. Finally, we propose efficient approaches to significantly increase the number and diversity of available operating points. Our proposed approaches provide finer-grained operating points and achieve comparable to or better performance than the benchmark methods across 11 datasets and 3 LLMs.
[LG-5] he Marked Edge Walk: A Novel MCMC Algorithm for Sampling of Graph Partitions
链接: https://arxiv.org/abs/2510.17714
作者: Atticus McWhorter,Daryl DeFord
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注:
点击查看摘要
Abstract:Novel Markov Chain Monte Carlo (MCMC) methods have enabled the generation of large ensembles of redistricting plans through graph partitioning. However, existing algorithms such as Reversible Recombination (RevReCom) and Metropolized Forest Recombination (MFR) are constrained to sampling from distributions related to spanning trees. We introduce the marked edge walk (MEW), a novel MCMC algorithm for sampling from the space of graph partitions under a tunable distribution. The walk operates on the space of spanning trees with marked edges, allowing for calculable transition probabilities for use in the Metropolis-Hastings algorithm. Empirical results on real-world dual graphs show convergence under target distributions unrelated to spanning trees. For this reason, MEW represents an advancement in flexible ensemble generation.
[LG-6] Efficient Algorithms for Mitigating Uncertainty and Risk in Reinforcement Learning
链接: https://arxiv.org/abs/2510.17690
作者: Xihong Su
类目: Machine Learning (cs.LG)
*备注: Dissertation
点击查看摘要
Abstract:This dissertation makes three main contributions. First, We identify a new connection between policy gradient and dynamic programming in MMDPs and propose the Coordinate Ascent Dynamic Programming (CADP) algorithm to compute a Markov policy that maximizes the discounted return averaged over the uncertain models. CADP adjusts model weights iteratively to guarantee monotone policy improvements to a local maximum. Second, We establish sufficient and necessary conditions for the exponential ERM Bellman operator to be a contraction and prove the existence of stationary deterministic optimal policies for ERM-TRC and EVaR-TRC. We also propose exponential value iteration, policy iteration, and linear programming algorithms for computing optimal stationary policies for ERM-TRC and EVaR-TRC. Third, We propose model-free Q-learning algorithms for computing policies with risk-averse objectives: ERM-TRC and EVaR-TRC. The challenge is that Q-learning ERM Bellman may not be a contraction. Instead, we use the monotonicity of Q-learning ERM Bellman operators to derive a rigorous proof that the ERM-TRC and the EVaR-TRC Q-learning algorithms converge to the optimal risk-averse value functions. The proposed Q-learning algorithms compute the optimal stationary policy for ERM-TRC and EVaR-TRC.
[LG-7] Quantum Synthetic Data Generation for Industrial Bioprocess Monitoring
链接: https://arxiv.org/abs/2510.17688
作者: Shawn M. Gibford,Mohammad Reza Boskabadi,Christopher J. Savoie,Seyed Soheil Mansouri
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Data scarcity and sparsity in bio-manufacturing poses challenges for accurate model development, process monitoring, and optimization. We aim to replicate and capture the complex dynamics of industrial bioprocesses by proposing the use of a Quantum Wasserstein Generative Adversarial Network with Gradient Penalty (QWGAN-GP) to generate synthetic time series data for industrially relevant processes. The generator within our GAN is comprised of a Parameterized Quantum Circuit (PQC). This methodology offers potential advantages in process monitoring, modeling, forecasting, and optimization, enabling more efficient bioprocess management by reducing the dependence on scarce experimental data. Our results demonstrate acceptable performance in capturing the temporal dynamics of real bioprocess data. We focus on Optical Density, a key measurement for Dry Biomass estimation. The data generated showed high fidelity to the actual historical experimental data. This intersection of quantum computing and machine learning has opened new frontiers in data analysis and generation, particularly in computationally intensive fields, for use cases such as increasing prediction accuracy for soft sensor design or for use in predictive control. Subjects: Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2510.17688 [cs.ET] (or arXiv:2510.17688v1 [cs.ET] for this version) https://doi.org/10.48550/arXiv.2510.17688 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Shawn Gibford [view email] [v1] Mon, 20 Oct 2025 16:04:39 UTC (1,318 KB) Full-text links: Access Paper: View a PDF of the paper titled Quantum Synthetic Data Generation for Industrial Bioprocess Monitoring, by Shawn M. Gibford and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.ET prev | next new | recent | 2025-10 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-8] Handling Extreme Class Imbalance: Using GANs in Data Augmentation for Suicide Prediction
链接: https://arxiv.org/abs/2510.17661
作者: Vaishnavi Visweswaraiah,Tanvi Banerjee,William Romine
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Suicide prediction is the key for prevention, but real data with sufficient positive samples is rare and causes extreme class imbalance. We utilized machine learning (ML) to build the model and deep learning (DL) techniques, like Generative Adversarial Networks (GAN), to generate synthetic data samples to enhance the dataset. The initial dataset contained 656 samples, with only four positive cases, prompting the need for data augmentation. A variety of machine learning models, ranging from interpretable data models to black box algorithmic models, were used. On real test data, Logistic Regression (LR) achieved a weighted precision of 0.99, a weighted recall of 0.85, and a weighted F1 score of 0.91; Random Forest (RF) showed 0.98, 0.99, and 0.99, respectively; and Support Vector Machine (SVM) achieved 0.99, 0.76, and 0.86. LR and SVM correctly identified one suicide attempt case (sensitivity:1.0) and misclassified LR(20) and SVM (31) non-attempts as attempts (specificity: 0.85 0.76, respectively). RF identified 0 suicide attempt cases (sensitivity: 0.0) with 0 false positives (specificity: 1.0). These results highlight the models’ effectiveness, with GAN playing a key role in generating synthetic data to support suicide prevention modeling efforts.
[LG-9] Just-In-Time Piecewise-Linear Semantics for ReLU-type Networks
链接: https://arxiv.org/abs/2510.17622
作者: Hongyi Duan,Haoyang Liu,Jian’an Zhang,Fengrui Liu,Yiyi Wang
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We present a JIT PL semantics for ReLU-type networks that compiles models into a guarded CPWL transducer with shared guards. The system adds hyperplanes only when operands are affine on the current cell, maintains global lower/upper envelopes, and uses a budgeted branch-and-bound. We obtain anytime soundness, exactness on fully refined cells, monotone progress, guard-linear complexity (avoiding global \binomk2 ), dominance pruning, and decidability under finite refinement. The shared carrier supports region extraction, decision complexes, Jacobians, exact/certified Lipschitz, LP/SOCP robustness, and maximal causal influence. A minimal prototype returns certificates or counterexamples with cost proportional to visited subdomains.
[LG-10] Semi-supervised Latent Bayesian Optimization for Designing Antimicrobial Peptides
链接: https://arxiv.org/abs/2510.17569
作者: Jyler Menard,R. A. Mansbach
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 19 pages, 9 figures
点击查看摘要
Abstract:Antimicrobial peptides (AMPs) are a promising class of therapeutics to treat bacterial infections. Discovering and designing such peptides is difficult because of the vast number of possible sequences of amino acids. Deep generative models, such as variational autoencoders, have shown value in peptide design due to their ability to model sequence space with a continuous-valued latent space. Although such models have already been used to great effect in biomolecular design, they still suffer from a lack of interpretability and rigorous quantification of latent space quality as a search space. We investigate (1) whether further compression of the design space via dimensionality reduction may facilitate optimization, (2) the interpretability of the spaces, and (3) how organizing latent spaces with physicochemical properties may improve the efficiency of optimizing antimicrobial activity. We find that further reduction of the latent space via dimensionality reduction can be advantageous when organizing the space with more relevant information at data availability, that using the dimensionality reduction search space can be more interpretable, and that we can organize the latent space with different physicochemical properties even at different percentages of available labels.
[LG-11] Formally Exploring Time-Series Anomaly Detection Evaluation Metrics
链接: https://arxiv.org/abs/2510.17562
作者: Dennis Wagner,Arjun Nair,Billy Joe Franks,Justus Arweiler,Aparna Muraleedharan,Indra Jungjohann,Fabian Hartung,Mayank C. Ahuja,Andriy Balinskyy,Saurabh Varshneya,Nabeel Hussain Syed,Mayank Nagda,Phillip Liznerski,Steffen Reithermann,Maja Rudolph,Sebastian Vollmer,Ralf Schulz,Torsten Katz,Stephan Mandt,Michael Bortz,Heike Leitte,Daniel Neider,Jakob Burger,Fabian Jirasek,Hans Hasse,Sophie Fellenz,Marius Kloft
类目: Machine Learning (cs.LG)
*备注: 73 pages, 13 figures
点击查看摘要
Abstract:Undetected anomalies in time series can trigger catastrophic failures in safety-critical systems, such as chemical plant explosions or power grid outages. Although many detection methods have been proposed, their performance remains unclear because current metrics capture only narrow aspects of the task and often yield misleading results. We address this issue by introducing verifiable properties that formalize essential requirements for evaluating time-series anomaly detection. These properties enable a theoretical framework that supports principled evaluations and reliable comparisons. Analyzing 37 widely used metrics, we show that most satisfy only a few properties, and none satisfy all, explaining persistent inconsistencies in prior results. To close this gap, we propose LARM, a flexible metric that provably satisfies all properties, and extend it to ALARM, an advanced variant meeting stricter requirements.
[LG-12] he Free Transformer
链接: https://arxiv.org/abs/2510.17558
作者: François Fleuret
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We propose an extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision thanks to a variational procedure. Experimental evaluations show that allowing such a conditioning translates into substantial improvements on downstream tasks.
[LG-13] rajMamba: An Efficient and Semantic-rich Vehicle Trajectory Pre-training Model NEURIPS2025
链接: https://arxiv.org/abs/2510.17545
作者: Yichen Liu,Yan Lin,Shengnan Guo,Zeyu Zhou,Youfang Lin,Huaiyu Wan
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS2025
点击查看摘要
Abstract:Vehicle GPS trajectories record how vehicles move over time, storing valuable travel semantics, including movement patterns and travel purposes. Learning travel semantics effectively and efficiently is crucial for real-world applications of trajectory data, which is hindered by two major challenges. First, travel purposes are tied to the functions of the roads and points-of-interest (POIs) involved in a trip. Such information is encoded in textual addresses and descriptions and introduces heavy computational burden to modeling. Second, real-world trajectories often contain redundant points, which harm both computational efficiency and trajectory embedding quality. To address these challenges, we propose TrajMamba, a novel approach for efficient and semantically rich vehicle trajectory learning. TrajMamba introduces a Traj-Mamba Encoder that captures movement patterns by jointly modeling both GPS and road perspectives of trajectories, enabling robust representations of continuous travel behaviors. It also incorporates a Travel Purpose-aware Pre-training procedure to integrate travel purposes into the learned embeddings without introducing extra overhead to embedding calculation. To reduce redundancy in trajectories, TrajMamba features a Knowledge Distillation Pre-training scheme to identify key trajectory points through a learnable mask generator and obtain effective compressed trajectory embeddings. Extensive experiments on two real-world datasets and three downstream tasks show that TrajMamba outperforms state-of-the-art baselines in both efficiency and accuracy.
[LG-14] Reliable Inference in Edge-Cloud Model Cascades via Conformal Alignment
链接: https://arxiv.org/abs/2510.17543
作者: Jiayi Huang,Sangwoo Park,Nicola Paoletti,Osvaldo Simeone
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: Under Review
点击查看摘要
Abstract:Edge intelligence enables low-latency inference via compact on-device models, but assuring reliability remains challenging. We study edge-cloud cascades that must preserve conditional coverage: whenever the edge returns a prediction set, it should contain the true label with a user-specified probability, as if produced by the cloud model. We formalize conditional coverage with respect to the cloud predictive distribution, and introduce a conformal alignment-based (CAb) cascading mechanism that certifies this property with user control over the risk level. Our method casts escalation from edge to cloud models as a multiple-hypothesis testing (MHT) problem, tailoring conformal alignment (CA) to select which inputs can be safely handled at the edge. The proposed CAb model cascading method yields statistical guarantees on the average fraction of edge decisions that satisfy cloud-level conditional coverage. The procedure applies to arbitrary edge prediction sets, including variants of conformal prediction (CP), and exposes a tunable trade-off among coverage, deferral rate, and set size. Experiments on CIFAR-100 image classification and the TeleQnA question-answering (QA) benchmark show that the proposed CAb cascade maintains the target conditional coverage for edge predictions while substantially reducing offloading to the cloud and incurring modest increases in prediction-set size.
[LG-15] How Does Label Noise Gradient Descent Improve Generalization in the Low SNR Regime?
链接: https://arxiv.org/abs/2510.17526
作者: Wei Huang,Andi Han,Yujin Song,Yilan Chen,Denny Wu,Difan Zou,Taiji Suzuki
类目: Machine Learning (cs.LG)
*备注: 40 pages
点击查看摘要
Abstract:The capacity of deep learning models is often large enough to both learn the underlying statistical signal and overfit to noise in the training set. This noise memorization can be harmful especially for data with a low signal-to-noise ratio (SNR), leading to poor generalization. Inspired by prior observations that label noise provides implicit regularization that improves generalization, in this work, we investigate whether introducing label noise to the gradient updates can enhance the test performance of neural network (NN) in the low SNR regime. Specifically, we consider training a two-layer NN with a simple label noise gradient descent (GD) algorithm, in an idealized signal-noise data setting. We prove that adding label noise during training suppresses noise memorization, preventing it from dominating the learning process; consequently, label noise GD enjoys rapid signal growth while the overfitting remains controlled, thereby achieving good generalization despite the low SNR. In contrast, we also show that NN trained with standard GD tends to overfit to noise in the same low SNR setting and establish a non-vanishing lower bound on its test error, thus demonstrating the benefit of introducing label noise in gradient-based training.
[LG-16] Mitigating Clever Hans Strategies in Image Classifiers through Generating Counterexamples
链接: https://arxiv.org/abs/2510.17524
作者: Sidney Bender,Ole Delzer,Jan Herrmann,Heike Antje Marxfeld,Klaus-Robert Müller,Grégoire Montavon
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Deep learning models remain vulnerable to spurious correlations, leading to so-called Clever Hans predictors that undermine robustness even in large-scale foundation and self-supervised models. Group distributional robustness methods, such as Deep Feature Reweighting (DFR) rely on explicit group labels to upweight underrepresented subgroups, but face key limitations: (1) group labels are often unavailable, (2) low within-group sample sizes hinder coverage of the subgroup distribution, and (3) performance degrades sharply when multiple spurious correlations fragment the data into even smaller groups. We propose Counterfactual Knowledge Distillation (CFKD), a framework that sidesteps these issues by generating diverse counterfactuals, enabling a human annotator to efficiently explore and correct the model’s decision boundaries through a knowledge distillation step. Unlike DFR, our method not only reweights the undersampled groups, but it also enriches them with new data points. Our method does not require any confounder labels, achieves effective scaling to multiple confounders, and yields balanced generalization across groups. We demonstrate CFKD’s efficacy across five datasets, spanning synthetic tasks to an industrial application, with particularly strong gains in low-data regimes with pronounced spurious correlations. Additionally, we provide an ablation study on the effect of the chosen counterfactual explainer and teacher model, highlighting their impact on robustness.
[LG-17] Curiosity Meets Cooperation: A Game-Theoretic Approach to Long-Tail Multi-Label Learning
链接: https://arxiv.org/abs/2510.17520
作者: Canran Xiao,Chuangxin Zhao,Zong Ke,Fei Shen
类目: Machine Learning (cs.LG)
*备注: Under review
点击查看摘要
Abstract:Long-tail imbalance is endemic to multi-label learning: a few head labels dominate the gradient signal, while the many rare labels that matter in practice are silently ignored. We tackle this problem by casting the task as a cooperative potential game. In our Curiosity-Driven Game-Theoretic Multi-Label Learning (CD-GTMLL) framework, the label space is split among several cooperating players that share a global accuracy payoff yet earn additional curiosity rewards that rise with label rarity and inter-player disagreement. These curiosity bonuses inject gradient on under-represented tags without hand-tuned class weights. We prove that gradient best-response updates ascend a differentiable potential and converge to tail-aware stationary points that tighten a lower bound on the expected Rare-F1. Extensive experiments on conventional benchmarks and three extreme-scale datasets show consistent state-of-the-art gains, delivering up to +4.3% Rare-F1 and +1.6% P@3 over the strongest baselines, while ablations reveal emergent division of labour and faster consensus on rare classes. CD-GTMLL thus offers a principled, scalable route to long-tail robustness in multi-label prediction.
[LG-18] SAFE-D: A Spatiotemporal Detection Framework for Abnormal Driving Among Parkinsons Disease-like Drivers
链接: https://arxiv.org/abs/2510.17517
作者: Hangcheng Cao,Baixiang Huang,Longzhi Yuan,Haonan An,Zihan Fang,Xianhao Chen,Yuguang Fang
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:A driver’s health state serves as a determinant factor in driving behavioral regulation. Subtle deviations from normalcy can lead to operational anomalies, posing risks to public transportation safety. While prior efforts have developed detection mechanisms for functionally-driven temporary anomalies such as drowsiness and distraction, limited research has addressed pathologically-triggered deviations, especially those stemming from chronic medical conditions. To bridge this gap, we investigate the driving behavior of Parkinson’s disease patients and propose SAFE-D, a novel framework for detecting Parkinson-related behavioral anomalies to enhance driving safety. Our methodology starts by performing analysis of Parkinson’s disease symptomatology, focusing on primary motor impairments, and establishes causal links to degraded driving performance. To represent the subclinical behavioral variations of early-stage Parkinson’s disease, our framework integrates data from multiple vehicle control components to build a behavioral profile. We then design an attention-based network that adaptively prioritizes spatiotemporal features, enabling robust anomaly detection under physiological variability. Finally, we validate SAFE-D on the Logitech G29 platform and CARLA simulator, using data from three road maps to emulate real-world driving. Our results show SAFE-D achieves 96.8% average accuracy in distinguishing normal and Parkinson-affected driving patterns.
[LG-19] AWARE: Audio Watermarking with Adversarial Resistance to Edits
链接: https://arxiv.org/abs/2510.17512
作者: Kosta Pavlović,Lazar Stanarević,Petar Nedić,Slavko Kovačević,Igor Djurović
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:Prevailing practice in learning-based audio watermarking is to pursue robustness by expanding the set of simulated distortions during training. However, such surrogates are narrow and prone to overfitting. This paper presents AWARE (Audio Watermarking with Adversarial Resistance to Edits), an alternative approach that avoids reliance on attack-simulation stacks and handcrafted differentiable distortions. Embedding is obtained via adversarial optimization in the time-frequency domain under a level-proportional perceptual budget. Detection employs a time-order-agnostic detector with a Bitwise Readout Head (BRH) that aggregates temporal evidence into one score per watermark bit, enabling reliable watermark decoding even under desynchronization and temporal cuts. Empirically, AWARE attains high audio quality and speech intelligibility (PESQ/STOI) and consistently low BER across various audio edits, often surpassing representative state-of-the-art learning-based audio watermarking systems.
[LG-20] Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares NEURIPS2025
链接: https://arxiv.org/abs/2510.17506
作者: Lachlan Ewen MacDonald,Hancheng Min,Leandro Palma,Salma Tarmoun,Ziqing Xu,René Vidal
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: NeurIPS2025. Code available at this https URL
点击查看摘要
Abstract:Classical optimisation theory guarantees monotonic objective decrease for gradient descent (GD) when employed in a small step size, or stable", regime. In contrast, gradient descent on neural networks is frequently performed in a large step size regime called the
edge of stability", in which the objective decreases non-monotonically with an observed implicit bias towards flat minima. In this paper, we take a step toward quantifying this phenomenon by providing convergence rates for gradient descent with large learning rates in an overparametrised least squares setting. The key insight behind our analysis is that, as a consequence of overparametrisation, the set of global minimisers forms a Riemannian manifold M , which enables the decomposition of the GD dynamics into components parallel and orthogonal to M . The parallel component corresponds to Riemannian gradient descent on the objective sharpness, while the orthogonal component is a bifurcating dynamical system. This insight allows us to derive convergence rates in three regimes characterised by the learning rate size: (a) the subcritical regime, in which transient instability is overcome in finite time before linear convergence to a suboptimally flat global minimum; (b) the critical regime, in which instability persists for all time with a power-law convergence toward the optimally flat global minimum; and © the supercritical regime, in which instability persists for all time with linear convergence to an orbit of period two centred on the optimally flat global minimum.
[LG-21] Stochastic Difference-of-Convex Optimization with Momentum
链接: https://arxiv.org/abs/2510.17503
作者: El Mahdi Chayti,Martin Jaggi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Stochastic difference-of-convex (DC) optimization is prevalent in numerous machine learning applications, yet its convergence properties under small batch sizes remain poorly understood. Existing methods typically require large batches or strong noise assumptions, which limit their practical use. In this work, we show that momentum enables convergence under standard smoothness and bounded variance assumptions (of the concave part) for any batch size. We prove that without momentum, convergence may fail regardless of stepsize, highlighting its necessity. Our momentum-based algorithm achieves provable convergence and demonstrates strong empirical performance.
[LG-22] Local properties of neural networks through the lens of layer-wise Hessians
链接: https://arxiv.org/abs/2510.17486
作者: Maxim Bolshim(1),Alexander Kugaevskikh(1) ((1) ITMO University, Saint Petersburg, Russia)
类目: Machine Learning (cs.LG)
*备注: Comments: 22 pages, 8 figures. Submitted to arXiv:cs.LG
点击查看摘要
Abstract:We introduce a methodology for analyzing neural networks through the lens of layer-wise Hessian matrices. The local Hessian of each functional block (layer) is defined as the matrix of second derivatives of a scalar function with respect to the parameters of that layer. This concept provides a formal tool for characterizing the local geometry of the parameter space. We show that the spectral properties of local Hessians, such as the distribution of eigenvalues, reveal quantitative patterns associated with overfitting, underparameterization, and expressivity in neural network architectures. We conduct an extensive empirical study involving 111 experiments across 37 datasets. The results demonstrate consistent structural regularities in the evolution of local Hessians during training and highlight correlations between their spectra and generalization performance. These findings establish a foundation for using local geometric analysis to guide the diagnosis and design of deep neural networks. The proposed framework connects optimization geometry with functional behavior and offers practical insight for improving network architectures and training stability.
[LG-23] Unified Privacy Guarantees for Decentralized Learning via Matrix Factorization
链接: https://arxiv.org/abs/2510.17480
作者: Aurélien Bellet,Edwige Cyffers,Davide Frey,Romaric Gaudel,Dimitri Lerévérend,François Taïani
类目: Machine Learning (cs.LG)
*备注: 21 pages, 5 figures
点击查看摘要
Abstract:Decentralized Learning (DL) enables users to collaboratively train models without sharing raw data by iteratively averaging local updates with neighbors in a network graph. This setting is increasingly popular for its scalability and its ability to keep data local under user control. Strong privacy guarantees in DL are typically achieved through Differential Privacy (DP), with results showing that DL can even amplify privacy by disseminating noise across peer-to-peer communications. Yet in practice, the observed privacy-utility trade-off often appears worse than in centralized training, which may be due to limitations in current DP accounting methods for DL. In this paper, we show that recent advances in centralized DP accounting based on Matrix Factorization (MF) for analyzing temporal noise correlations can also be leveraged in DL. By generalizing existing MF results, we show how to cast both standard DL algorithms and common trust models into a unified formulation. This yields tighter privacy accounting for existing DP-DL algorithms and provides a principled way to develop new ones. To demonstrate the approach, we introduce MAFALDA-SGD, a gossip-based DL algorithm with user-level correlated noise that outperforms existing methods on synthetic and real-world graphs.
[LG-24] owards geological inference with process-based and deep generative modeling part 2: inversion of fluvial deposits and latent-space disentanglement
链接: https://arxiv.org/abs/2510.17478
作者: Guillaume Rongier,Luk Peeters
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 52 pages, 42 figures
点击查看摘要
Abstract:High costs and uncertainties make subsurface decision-making challenging, as acquiring new data is rarely scalable. Embedding geological knowledge directly into predictive models offers a valuable alternative. A joint approach enables just that: process-based models that mimic geological processes can help train generative models that make predictions more efficiently. This study explores whether a generative adversarial network (GAN) - a type of deep-learning algorithm for generative modeling - trained to produce fluvial deposits can be inverted to match well and seismic data. Four inversion approaches applied to three test samples with 4, 8, and 20 wells struggled to match these well data, especially as the well number increased or as the test sample diverged from the training data. The key bottleneck lies in the GAN’s latent representation: it is entangled, so samples with similar sedimentological features are not necessarily close in the latent space. Label conditioning or latent overparameterization can partially disentangle the latent space during training, although not yet sufficiently for a successful inversion. Fine-tuning the GAN to restructure the latent space locally reduces mismatches to acceptable levels for all test cases, with and without seismic data. But this approach depends on an initial, partially successful inversion step, which influences the quality and diversity of the final samples. Overall, GANs can already handle the tasks required for their integration into geomodeling workflows. We still need to further assess their robustness, and how to best leverage them in support of geological interpretation.
[LG-25] CrossStateECG: Multi-Scale Deep Convolutional Network with Attention for Rest-Exercise ECG Biometrics
链接: https://arxiv.org/abs/2510.17467
作者: Dan Zheng,Jing Feng,Juan Liu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Current research in Electrocardiogram (ECG) biometrics mainly emphasizes resting-state conditions, leaving the performance decline in rest-exercise scenarios largely unresolved. This paper introduces CrossStateECG, a robust ECG-based authentication model explicitly tailored for cross-state (rest-exercise) conditions. The proposed model creatively combines multi-scale deep convolutional feature extraction with attention mechanisms to ensure strong identification across different physiological states. Experimental results on the exercise-ECGID dataset validate the effectiveness of CrossStateECG, achieving an identification accuracy of 92.50% in the Rest-to-Exercise scenario (training on resting ECG and testing on post-exercise ECG) and 94.72% in the Exercise-to-Rest scenario (training on post-exercise ECG and testing on resting ECG). Furthermore, CrossStateECG demonstrates exceptional performance across both state combinations, reaching an accuracy of 99.94% in Rest-to-Rest scenarios and 97.85% in Mixed-to-Mixed scenarios. Additional validations on the ECG-ID and MIT-BIH datasets further confirmed the generalization abilities of CrossStateECG, underscoring its potential as a practical solution for post-exercise ECG-based authentication in dynamic real-world settings.
[LG-26] Explainable AI for microseismic event detection
链接: https://arxiv.org/abs/2510.17458
作者: Ayrat Abdullin,Denis Anikiev,Umair bin Waheed
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: Submitted to Artificial Intelligence in Geosciences
点击查看摘要
Abstract:Deep neural networks like PhaseNet show high accuracy in detecting microseismic events, but their black-box nature is a concern in critical applications. We apply explainable AI (XAI) techniques, such as Gradient-weighted Class Activation Mapping (Grad-CAM) and Shapley Additive Explanations (SHAP), to interpret the PhaseNet model’s decisions and improve its reliability. Grad-CAM highlights that the network’s attention aligns with P- and S-wave arrivals. SHAP values quantify feature contributions, confirming that vertical-component amplitudes drive P-phase picks while horizontal components dominate S-phase picks, consistent with geophysical principles. Leveraging these insights, we introduce a SHAP-gated inference scheme that combines the model’s output with an explanation-based metric to reduce errors. On a test set of 9,000 waveforms, the SHAP-gated model achieved an F1-score of 0.98 (precision 0.99, recall 0.97), outperforming the baseline PhaseNet (F1-score 0.97) and demonstrating enhanced robustness to noise. These results show that XAI can not only interpret deep learning models but also directly enhance their performance, providing a template for building trust in automated seismic detectors.
[LG-27] Deeper with Riemannian Geometry: Overcoming Oversmoothing and Oversquashing for Graph Foundation Models NEURIPS25
链接: https://arxiv.org/abs/2510.17457
作者: Li Sun,Zhenhao Huang,Ming Zhang,Philip S. Yu
类目: Machine Learning (cs.LG)
*备注: Accept by NeurIPS 25
点击查看摘要
Abstract:Message Passing Neural Networks (MPNNs) is the building block of graph foundation models, but fundamentally suffer from oversmoothing and oversquashing. There has recently been a surge of interest in fixing both issues. Existing efforts primarily adopt global approaches, which may be beneficial in some regions but detrimental in others, ultimately leading to the suboptimal expressiveness. In this paper, we begin by revisiting oversquashing through a global measure – spectral gap \lambda – and prove that the increase of \lambda leads to gradient vanishing with respect to the input features, thereby undermining the effectiveness of message passing. Motivated by such theoretical insights, we propose a \textbflocal approach that adaptively adjusts message passing based on local structures. To achieve this, we connect local Riemannian geometry with MPNNs, and establish a novel nonhomogeneous boundary condition to address both oversquashing and oversmoothing. Building on the Robin condition, we design a GBN network with local bottleneck adjustment, coupled with theoretical guarantees. Extensive experiments on homophilic and heterophilic graphs show the expressiveness of GBN. Furthermore, GBN does not exhibit performance degradation even when the network depth exceeds 256 layers.
[LG-28] Quantifying Climate Policy Action and Its Links to Development Outcomes: A Cross-National Data-Driven Analysis NEURIPS2025
链接: https://arxiv.org/abs/2510.17425
作者: Aditi Dutta
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: This paper/proposal has been accepted as a poster in the NeurIPS 2025
点击查看摘要
Abstract:Addressing climate change effectively requires more than cataloguing the number of policies in place; it calls for tools that can reveal their thematic priorities and their tangible impacts on development outcomes. Existing assessments often rely on qualitative descriptions or composite indices, which can mask crucial differences between key domains such as mitigation, adaptation, disaster risk management, and loss and damage. To bridge this gap, we develop a quantitative indicator of climate policy orientation by applying a multilingual transformer-based language model to official national policy documents, achieving a classification accuracy of 0.90 (F1-score). Linking these indicators with World Bank development data in panel regressions reveals that mitigation policies are associated with higher GDP and GNI; disaster risk management correlates with greater GNI and debt but reduced foreign direct investment; adaptation and loss and damage show limited measurable effects. This integrated NLP-econometric framework enables comparable, theme-specific analysis of climate governance, offering a scalable method to monitor progress, evaluate trade-offs, and align policy emphasis with development goals.
[LG-29] Diffusion Models as Dataset Distillation Priors
链接: https://arxiv.org/abs/2510.17421
作者: Duo Su,Huyu Wu,Huanran Chen,Yiming Shi,Yuzhu Wang,Xi Ye,Jun Zhu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Dataset distillation aims to synthesize compact yet informative datasets from large ones. A significant challenge in this field is achieving a trifecta of diversity, generalization, and representativeness in a single distilled dataset. Although recent generative dataset distillation methods adopt powerful diffusion models as their foundation models, the inherent representativeness prior in diffusion models is overlooked. Consequently, these approaches often necessitate the integration of external constraints to enhance data quality. To address this, we propose Diffusion As Priors (DAP), which formalizes representativeness by quantifying the similarity between synthetic and real data in feature space using a Mercer kernel. We then introduce this prior as guidance to steer the reverse diffusion process, enhancing the representativeness of distilled samples without any retraining. Extensive experiments on large-scale datasets, such as ImageNet-1K and its subsets, demonstrate that DAP outperforms state-of-the-art methods in generating high-fidelity datasets while achieving superior cross-architecture generalization. Our work not only establishes a theoretical connection between diffusion priors and the objectives of dataset distillation but also provides a practical, training-free framework for improving the quality of the distilled dataset.
[LG-30] A Conditional Diffusion Model for Probabilistic Prediction of Battery Capacity Degradation
链接: https://arxiv.org/abs/2510.17414
作者: Hequn Li,Zhongwei Deng,Chunlin Jiang,Yvxin He andZhansheng Ning
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Accurate prediction of lithium-ion battery capacity and its associated uncertainty is essential for reliable battery management but remains challenging due to the stochastic nature of aging. This paper presents a novel method, termed the Condition Diffusion U-Net with Attention (CDUA), which integrates feature engineering and deep learning to address this challenge. The proposed approach employs a diffusion-based generative model for time-series forecasting and incorporates attention mechanisms to enhance predictive performance. Battery capacity is first derived from real-world vehicle operation data. The most relevant features are then identified using the Pearson correlation coefficient and the XGBoost algorithm. These features are used to train the CDUA model, which comprises two core components: (1) a contextual U-Net with self-attention to capture complex temporal dependencies, and (2) a denoising network to reconstruct accurate capacity values from noisy observations. Experimental validation on the real-world vehicle data demonstrates that the proposed CDUA model achieves a relative Mean Absolute Error (MAE) of 0.94% and a relative Root Mean Square Error (RMSE) of 1.14%, with a narrow 95% confidence interval of 3.74% in relative width. These results confirm that CDUA provides both accurate capacity estimation and reliable uncertainty quantification. Comparative experiments further verify its robustness and superior performance over existing mainstream approaches.
[LG-31] S4ECG: Exploring the impact of long-range interactions for arrhythmia prediction
链接: https://arxiv.org/abs/2510.17406
作者: Tiezhi Wang,Wilhelm Haverkamp,Nils Strodthoff
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:The electrocardiogram (ECG) exemplifies biosignal-based time series with continuous, temporally ordered structure reflecting cardiac physiological and pathophysiological dynamics. Detailed analysis of these dynamics has proven challenging, as conventional methods capture either global trends or local waveform features but rarely their simultaneous interplay at high temporal resolution. To bridge global and local signal analysis, we introduce S4ECG, a novel deep learning architecture leveraging structured state space models for multi-epoch arrhythmia classification. Our joint multi-epoch predictions significantly outperform single-epoch approaches by 1.0-11.6% in macro-AUROC, with atrial fibrillation specificity improving from 0.718-0.979 to 0.967-0.998, demonstrating superior performance in-distribution and enhanced out-of-distribution robustness. Systematic investigation reveals optimal temporal dependency windows spanning 10-20 minutes for peak performance. This work contributes to a paradigm shift toward temporally-aware arrhythmia detection algorithms, opening new possibilities for ECG interpretation, in particular for complex arrhythmias like atrial fibrillation and atrial flutter.
[LG-32] RINS-T: Robust Implicit Neural Solvers for Time Series Linear Inverse Problems
链接: https://arxiv.org/abs/2510.17396
作者: Keivan Faghih Niresi,Zepeng Zhang,Olga Fink
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: Accepted to IEEE Transactions on Instrumentation and Measurement
点击查看摘要
Abstract:Time series data are often affected by various forms of corruption, such as missing values, noise, and outliers, which pose significant challenges for tasks such as forecasting and anomaly detection. To address these issues, inverse problems focus on reconstructing the original signal from corrupted data by leveraging prior knowledge about its underlying structure. While deep learning methods have demonstrated potential in this domain, they often require extensive pretraining and struggle to generalize under distribution shifts. In this work, we propose RINS-T (Robust Implicit Neural Solvers for Time Series Linear Inverse Problems), a novel deep prior framework that achieves high recovery performance without requiring pretraining data. RINS-T leverages neural networks as implicit priors and integrates robust optimization techniques, making it resilient to outliers while relaxing the reliance on Gaussian noise assumptions. To further improve optimization stability and robustness, we introduce three key innovations: guided input initialization, input perturbation, and convex output combination techniques. Each of these contributions strengthens the framework’s optimization stability and robustness. These advancements make RINS-T a flexible and effective solution for addressing complex real-world time series challenges. Our code is available at this https URL.
[LG-33] Finite-Time Bounds for Averag e-Reward Fitted Q-Iteration
链接: https://arxiv.org/abs/2510.17391
作者: Jongmin Lee,Ernest K. Ryu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Although there is an extensive body of work characterizing the sample complexity of discounted-return offline RL with function approximations, prior work on the average-reward setting has received significantly less attention, and existing approaches rely on restrictive assumptions, such as ergodicity or linearity of the MDP. In this work, we establish the first sample complexity results for average-reward offline RL with function approximation for weakly communicating MDPs, a much milder assumption. To this end, we introduce Anchored Fitted Q-Iteration, which combines the standard Fitted Q-Iteration with an anchor mechanism. We show that the anchor, which can be interpreted as a form of weight decay, is crucial for enabling finite-time analysis in the average-reward setting. We also extend our finite-time analysis to the setup where the dataset is generated from a single-trajectory rather than IID transitions, again leveraging the anchor mechanism.
[LG-34] Exploration via Feature Perturbation in Contextual Bandits NEURIPS2025
链接: https://arxiv.org/abs/2510.17390
作者: Seouh-won Yi,Min-hwan Oh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at NeurIPS 2025 (spotlight)
点击查看摘要
Abstract:We propose feature perturbation, a simple yet powerful technique that injects randomness directly into feature inputs, instead of randomizing unknown parameters or adding noise to rewards. Remarkably, this algorithm achieves \tilde\mathcalO(d\sqrtT) worst-case regret bound for generalized linear bandits, while avoiding the \tilde\mathcalO(d^3/2\sqrtT) regret typical of existing randomized bandit algorithms. Because our algorithm eschews parameter sampling, it is both computationally efficient and naturally extends to non-parametric or neural network models. We verify these advantages through empirical evaluations, demonstrating that feature perturbation not only surpasses existing methods but also unifies strong practical performance with best-known theoretical guarantees.
[LG-35] Beyond Binary Out-of-Distribution Detection: Characterizing Distributional Shifts with Multi-Statistic Diffusion Trajectories
链接: https://arxiv.org/abs/2510.17381
作者: Achref Jaziri,Martin Rogmann,Martin Mundt,Visvanathan Ramesh
类目: Machine Learning (cs.LG)
*备注: 11 Pages, 6 Figures
点击查看摘要
Abstract:Detecting out-of-distribution (OOD) data is critical for machine learning, be it for safety reasons or to enable open-ended learning. However, beyond mere detection, choosing an appropriate course of action typically hinges on the type of OOD data encountered. Unfortunately, the latter is generally not distinguished in practice, as modern OOD detection methods collapse distributional shifts into single scalar outlier scores. This work argues that scalar-based methods are thus insufficient for OOD data to be properly contextualized and prospectively exploited, a limitation we overcome with the introduction of DISC: Diffusion-based Statistical Characterization. DISC leverages the iterative denoising process of diffusion models to extract a rich, multi-dimensional feature vector that captures statistical discrepancies across multiple noise levels. Extensive experiments on image and tabular benchmarks show that DISC matches or surpasses state-of-the-art detectors for OOD detection and, crucially, also classifies OOD type, a capability largely absent from prior work. As such, our work enables a shift from simple binary OOD detection to a more granular detection.
[LG-36] Model Metamers Reveal Invariances in Graph Neural Networks
链接: https://arxiv.org/abs/2510.17378
作者: Wei Xu,Xiaoyi Jiang,Lixiang Xu,Dechao Tang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In recent years, deep neural networks have been extensively employed in perceptual systems to learn representations endowed with invariances, aiming to emulate the invariance mechanisms observed in the human brain. However, studies in the visual and auditory domains have confirmed that significant gaps remain between the invariance properties of artificial neural networks and those of humans. To investigate the invariance behavior within graph neural networks (GNNs), we introduce a model ``metamers’’ generation technique. By optimizing input graphs such that their internal node activations match those of a reference graph, we obtain graphs that are equivalent in the model’s representation space, yet differ significantly in both structure and node features. Our theoretical analysis focuses on two aspects: the local metamer dimension for a single node and the activation-induced volume change of the metamer manifold. Utilizing this approach, we uncover extreme levels of representational invariance across several classic GNN architectures. Although targeted modifications to model architecture and training strategies can partially mitigate this excessive invariance, they fail to fundamentally bridge the gap to human-like invariance. Finally, we quantify the deviation between metamer graphs and their original counterparts, revealing unique failure modes of current GNNs and providing a complementary benchmark for model evaluation.
[LG-37] Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations
链接: https://arxiv.org/abs/2510.17313
作者: Tal Barami,Nimrod Berman,Ilan Naiman,Amos H. Hason,Rotem Ezra,Omri Azencot
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Learning disentangled representations in sequential data is a key goal in deep learning, with broad applications in vision, audio, and time series. While real-world data involves multiple interacting semantic factors over time, prior work has mostly focused on simpler two-factor static and dynamic settings, primarily because such settings make data collection easier, thereby overlooking the inherently multi-factor nature of real-world data. We introduce the first standardized benchmark for evaluating multi-factor sequential disentanglement across six diverse datasets spanning video, audio, and time series. Our benchmark includes modular tools for dataset integration, model development, and evaluation metrics tailored to multi-factor analysis. We additionally propose a post-hoc Latent Exploration Stage to automatically align latent dimensions with semantic factors, and introduce a Koopman-inspired model that achieves state-of-the-art results. Moreover, we show that Vision-Language Models can automate dataset annotation and serve as zero-shot disentanglement evaluators, removing the need for manual labels and human intervention. Together, these contributions provide a robust and scalable foundation for advancing multi-factor sequential disentanglement.
[LG-38] Symmetries in PAC-Bayesian Learning
链接: https://arxiv.org/abs/2510.17303
作者: Armin Beck,Peter Ochs
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Symmetries are known to improve the empirical performance of machine learning models, yet theoretical guarantees explaining these gains remain limited. Prior work has focused mainly on compact group symmetries and often assumes that the data distribution itself is invariant, an assumption rarely satisfied in real-world applications. In this work, we extend generalization guarantees to the broader setting of non-compact symmetries, such as translations and to non-invariant data distributions. Building on the PAC-Bayes framework, we adapt and tighten existing bounds, demonstrating the approach on McAllester’s PAC-Bayes bound while showing that it applies to a wide range of PAC-Bayes bounds. We validate our theory with experiments on a rotated MNIST dataset with a non-uniform rotation group, where the derived guarantees not only hold but also improve upon prior results. These findings provide theoretical evidence that, for symmetric data, symmetric models are preferable beyond the narrow setting of compact groups and invariant distributions, opening the way to a more general understanding of symmetries in machine learning.
[LG-39] Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems
链接: https://arxiv.org/abs/2510.17276
作者: Rishi Jha,Harold Triedman,Justin Wagle,Vitaly Shmatikov
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:Control-flow hijacking attacks manipulate orchestration mechanisms in multi-agent systems into performing unsafe actions that compromise the system and exfiltrate sensitive information. Recently proposed defenses, such as LlamaFirewall, rely on alignment checks of inter-agent communications to ensure that all agent invocations are “related to” and “likely to further” the original objective. We start by demonstrating control-flow hijacking attacks that evade these defenses even if alignment checks are performed by advanced LLMs. We argue that the safety and functionality objectives of multi-agent systems fundamentally conflict with each other. This conflict is exacerbated by the brittle definitions of “alignment” and the checkers’ incomplete visibility into the execution context. We then propose, implement, and evaluate ControlValve, a new defense inspired by the principles of control-flow integrity and least privilege. ControlValve (1) generates permitted control-flow graphs for multi-agent systems, and (2) enforces that all executions comply with these graphs, along with contextual rules (generated in a zero-shot manner) for each agent invocation. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Systems and Control (eess.SY) Cite as: arXiv:2510.17276 [cs.LG] (or arXiv:2510.17276v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.17276 Focus to learn more arXiv-issued DOI via DataCite
[LG-40] Uncertainty-aware data assimilation through variational inference
链接: https://arxiv.org/abs/2510.17268
作者: Anthony Frion,David S Greenberg
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Data assimilation, consisting in the combination of a dynamical model with a set of noisy and incomplete observations in order to infer the state of a system over time, involves uncertainty in most settings. Building upon an existing deterministic machine learning approach, we propose a variational inference-based extension in which the predicted state follows a multivariate Gaussian distribution. Using the chaotic Lorenz-96 dynamics as a testing ground, we show that our new model enables to obtain nearly perfectly calibrated predictions, and can be integrated in a wider variational data assimilation pipeline in order to achieve greater benefit from increasing lengths of data assimilation windows. Our code is available at this https URL.
[LG-41] Adaptive Discretization for Consistency Models NEURIPS2025
链接: https://arxiv.org/abs/2510.17266
作者: Jiayu Bai,Zhanbo Feng,Zhijie Deng,Tianqi Hou,Robert C. Qiu,Zenan Ling
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Consistency Models (CMs) have shown promise for efficient one-step generation. However, most existing CMs rely on manually designed discretization schemes, which can cause repeated adjustments for different noise schedules and datasets. To address this, we propose a unified framework for the automatic and adaptive discretization of CMs, formulating it as an optimization problem with respect to the discretization step. Concretely, during the consistency training process, we propose using local consistency as the optimization objective to ensure trainability by avoiding excessive discretization, and taking global consistency as a constraint to ensure stability by controlling the denoising error in the training target. We establish the trade-off between local and global consistency with a Lagrange multiplier. Building on this framework, we achieve adaptive discretization for CMs using the Gauss-Newton method. We refer to our approach as ADCMs. Experiments demonstrate that ADCMs significantly improve the training efficiency of CMs, achieving superior generative performance with minimal training overhead on both CIFAR-10 and ImageNet. Moreover, ADCMs exhibit strong adaptability to more advanced DM variants. Code is available at this https URL.
[LG-42] High-Level Multi-Robot Trajectory Planning And Spurious Behavior Detection
链接: https://arxiv.org/abs/2510.17261
作者: Fernando Salanova,Jesús Roche,Cristian Mahuela,Eduardo Montijano
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 6 pages,3 figures, Iberian Robotics Conference 2025
点击查看摘要
Abstract:The reliable execution of high-level missions in multi-robot systems with heterogeneous agents, requires robust methods for detecting spurious behaviors. In this paper, we address the challenge of identifying spurious executions of plans specified as a Linear Temporal Logic (LTL) formula, as incorrect task sequences, violations of spatial constraints, timing inconsis- tencies, or deviations from intended mission semantics. To tackle this, we introduce a structured data generation framework based on the Nets-within-Nets (NWN) paradigm, which coordinates robot actions with LTL-derived global mission specifications. We further propose a Transformer-based anomaly detection pipeline that classifies robot trajectories as normal or anomalous. Experi- mental evaluations show that our method achieves high accuracy (91.3%) in identifying execution inefficiencies, and demonstrates robust detection capabilities for core mission violations (88.3%) and constraint-based adaptive anomalies (66.8%). An ablation experiment of the embedding and architecture was carried out, obtaining successful results where our novel proposition performs better than simpler representations.
[LG-43] A Prototypical Network with an Attention-based Encoder for Drivers Identification Application
链接: https://arxiv.org/abs/2510.17250
作者: Wei-Hsun Lee(1),Che-Yu Chang(1),Kuang-Yu Li(2) ((1) Dept. of Transportation amp; Communication Management Science, National Cheng Kung University, Taiwan (2) Institute of Data Science, National Cheng Kung University, Taiwan)
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Driver identification has become an area of increasing interest in recent years, especially for data- driven applications, because biometric-based technologies may incur privacy issues. This study proposes a deep learning neural network architecture, an attention-based encoder (AttEnc), which uses an attention mechanism for driver identification and uses fewer model parameters than current methods. Most studies do not address the issue of data shortages for driver identification, and most of them are inflexible when encountering unknown drivers. In this study, an architecture that combines a prototypical network and an attention-based encoder (P-AttEnc) is proposed. It applies few-shot learning to overcome the data shortage issues and to enhance model generalizations. The experiments showed that the attention-based encoder can identify drivers with accuracies of 99.3%, 99.0% and 99.9% in three different datasets and has a prediction time that is 44% to 79% faster because it significantly reduces, on average, 87.6% of the model parameters. P-AttEnc identifies drivers based on few shot data, extracts driver fingerprints to address the issue of data shortages, and is able to classify unknown drivers. The first experiment showed that P-AttEnc can identify drivers with an accuracy of 69.8% in the one-shot scenario. The second experiment showed that P-AttEnc, in the 1-shot scenario, can classify unknown drivers with an average accuracy of 65.7%.
[LG-44] SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference
链接: https://arxiv.org/abs/2510.17189
作者: Wenxun Wang,Shuchang Zhou,Wenyu Sun,Peiqin Sun,Yongpan Liu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
点击查看摘要
Abstract:Transformers have shown remarkable performance in both natural language processing (NLP) and computer vision (CV) tasks. However, their real-time inference speed and efficiency are limited due to the inefficiency in Softmax and Layer Normalization (LayerNorm). Previous works based on function approximation suffer from inefficient implementation as they place emphasis on computation while disregarding memory overhead concerns. Moreover, such methods rely on retraining to compensate for approximation error which can be costly and inconvenient. In this paper, we present SOLE, a hardware-software co-design for Softmax and LayerNorm which is composed of E2Softmax and AILayerNorm. E2Softmax utilizes log2 quantization of exponent function and log-based division to approximate Softmax while AILayerNorm adopts low-precision statistic calculation. Compared with state-of-the-art designs, we achieve both low-precision calculation and low bit-width storage on Softmax and LayerNorm. Experiments show that SOLE maintains inference accuracy without retraining while offering orders of magnitude speedup and energy savings over GPU, achieving 3.04x, 3.86x energy-efficiency improvements and 2.82x, 3.32x area-efficiency improvements over prior state-of-the-art custom hardware for Softmax and LayerNorm, respectively. Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR) Cite as: arXiv:2510.17189 [cs.LG] (or arXiv:2510.17189v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.17189 Focus to learn more arXiv-issued DOI via DataCite
[LG-45] A Standardized Benchmark for Machine-Learned Molecular Dynamics using Weighted Ensemble Sampling
链接: https://arxiv.org/abs/2510.17187
作者: Alexander Aghili,Andy Bruce,Daniel Sabo,Sanya Murdeshwar,Kevin Bachelor,Ionut Mistreanu,Ashwin Lokapally,Razvan Marinescu
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 37 Pages (Main Text), 10 Figures, Submitted to Journal of Physical Chemistry B
点击查看摘要
Abstract:The rapid evolution of molecular dynamics (MD) methods, including machine-learned dynamics, has outpaced the development of standardized tools for method validation. Objective comparison between simulation approaches is often hindered by inconsistent evaluation metrics, insufficient sampling of rare conformational states, and the absence of reproducible benchmarks. To address these challenges, we introduce a modular benchmarking framework that systematically evaluates protein MD methods using enhanced sampling analysis. Our approach uses weighted ensemble (WE) sampling via The Weighted Ensemble Simulation Toolkit with Parallelization and Analysis (WESTPA), based on progress coordinates derived from Time-lagged Independent Component Analysis (TICA), enabling fast and efficient exploration of protein conformational space. The framework includes a flexible, lightweight propagator interface that supports arbitrary simulation engines, allowing both classical force fields and machine learning-based models. Additionally, the framework offers a comprehensive evaluation suite capable of computing more than 19 different metrics and visualizations across a variety of domains. We further contribute a dataset of nine diverse proteins, ranging from 10 to 224 residues, that span a variety of folding complexities and topologies. Each protein has been extensively simulated at 300K for one million MD steps per starting point (4 ns). To demonstrate the utility of our framework, we perform validation tests using classic MD simulations with implicit solvent and compare protein conformational sampling using a fully trained versus under-trained CGSchNet model. By standardizing evaluation protocols and enabling direct, reproducible comparisons across MD approaches, our open-source platform lays the groundwork for consistent, rigorous benchmarking across the molecular simulation community.
[LG-46] Robustness in Text-Attributed Graph Learning: Insights Trade-offs and New Defenses
链接: https://arxiv.org/abs/2510.17185
作者: Runlin Lei,Lu Yi,Mingguo He,Pengyu Qiu,Zhewei Wei,Yongchao Liu,Chuntao Hong
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:While Graph Neural Networks (GNNs) and Large Language Models (LLMs) are powerful approaches for learning on Text-Attributed Graphs (TAGs), a comprehensive understanding of their robustness remains elusive. Current evaluations are fragmented, failing to systematically investigate the distinct effects of textual and structural perturbations across diverse models and attack scenarios. To address these limitations, we introduce a unified and comprehensive framework to evaluate robustness in TAG learning. Our framework evaluates classical GNNs, robust GNNs (RGNNs), and GraphLLMs across ten datasets from four domains, under diverse text-based, structure-based, and hybrid perturbations in both poisoning and evasion scenarios. Our extensive analysis reveals multiple findings, among which three are particularly noteworthy: 1) models have inherent robustness trade-offs between text and structure, 2) the performance of GNNs and RGNNs depends heavily on the text encoder and attack type, and 3) GraphLLMs are particularly vulnerable to training data corruption. To overcome the identified trade-offs, we introduce SFT-auto, a novel framework that delivers superior and balanced robustness against both textual and structural attacks within a single model. Our work establishes a foundation for future research on TAG security and offers practical solutions for robust TAG learning in adversarial environments. Our code is available at: this https URL.
[LG-47] QRïS: A Preemptive Novel Method for Quishing Detection Through Structural Features of QR
链接: https://arxiv.org/abs/2510.17175
作者: Muhammad Wahid Akram,Keshav Sood,Muneeb Ul Hassan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages, 11 figures, and 7 tables
点击查看摘要
Abstract:Globally, individuals and organizations employ Quick Response (QR) codes for swift and convenient communication. Leveraging this, cybercriminals embed falsify and misleading information in QR codes to launch various phishing attacks which termed as Quishing. Many former studies have introduced defensive approaches to preclude Quishing such as by classifying the embedded content of QR codes and then label the QR codes accordingly, whereas other studies classify them using visual features (i.e., deep features, histogram density analysis features). However, these approaches mainly rely on black-box techniques which do not clearly provide interpretability and transparency to fully comprehend and reproduce the intrinsic decision process; therefore, having certain obvious limitations includes the approaches’ trust, accountability, issues in bias detection, and many more. We proposed QRïS, the pioneer method to classify QR codes through the comprehensive structural analysis of a QR code which helps to identify phishing QR codes beforehand. Our classification method is clearly transparent which makes it reproducible, scalable, and easy to comprehend. First, we generated QR codes dataset (i.e. 400,000 samples) using recently published URLs datasets [1], [2]. Then, unlike black-box models, we developed a simple algorithm to extract 24 structural features from layout patterns present in QR codes. Later, we train the machine learning models on the harvested features and obtained accuracy of up to 83.18%. To further evaluate the effectiveness of our approach, we perform the comparative analysis of proposed method with relevant contemporary studies. Lastly, for real-world deployment and validation, we developed a mobile app which assures the feasibility of the proposed solution in real-world scenarios which eventually strengthen the applicability of the study.
[LG-48] ALPINE: A Lightweight and Adaptive Privacy-Decision Agent Framework for Dynamic Edge Crowdsensing WWW2026
链接: https://arxiv.org/abs/2510.17162
作者: Guanjie Cheng,Siyang Liu,Junqin Huang,Xinkui Zhao,Yin Wang,Mengying Zhu,Linghe Kong,Shuiguang Deng
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures, 4 tables. Submitted to The Web Conference (WWW 2026)
点击查看摘要
Abstract:Mobile edge crowdsensing (MECS) systems continuously generate and transmit user data in dynamic, resource-constrained environments, exposing users to significant privacy threats. In practice, many privacy-preserving mechanisms build on differential privacy (DP). However, static DP mechanisms often fail to adapt to evolving risks, for example, shifts in adversarial capabilities, resource constraints and task requirements, resulting in either excessive noise or inadequate protection. To address this challenge, we propose ALPINE, a lightweight, adaptive framework that empowers terminal devices to autonomously adjust differential privacy levels in real time. ALPINE operates as a closed-loop control system consisting of four modules: dynamic risk perception, privacy decision via twin delayed deep deterministic policy gradient (TD3), local privacy execution and performance verification from edge nodes. Based on environmental risk assessments, we design a reward function that balances privacy gains, data utility and energy cost, guiding the TD3 agent to adaptively tune noise magnitude across diverse risk scenarios and achieve a dynamic equilibrium among privacy, utility and cost. Both the collaborative risk model and pretrained TD3-based agent are designed for low-overhead deployment. Extensive theoretical analysis and real-world simulations demonstrate that ALPINE effectively mitigates inference attacks while preserving utility and cost, making it practical for large-scale edge applications.
[LG-49] Learning After Model Deployment ECAI-2025
链接: https://arxiv.org/abs/2510.17160
作者: Derda Kaymak,Gyuhak Kim,Tomoya Kaichi,Tatsuya Konishi,Bing Liu
类目: Machine Learning (cs.LG)
*备注: Published at ECAI-2025
点击查看摘要
Abstract:In classic supervised learning, once a model is deployed in an application, it is fixed. No updates will be made to it during the application. This is inappropriate for many dynamic and open environments, where unexpected samples from unseen classes may appear. In such an environment, the model should be able to detect these novel samples from unseen classes and learn them after they are labeled. We call this paradigm Autonomous Learning after Model Deployment (ALMD). The learning here is continuous and involves no human engineers. Labeling in this scenario is performed by human co-workers or other knowledgeable agents, which is similar to what humans do when they encounter an unfamiliar object and ask another person for its name. In ALMD, the detection of novel samples is dynamic and differs from traditional out-of-distribution (OOD) detection in that the set of in-distribution (ID) classes expands as new classes are learned during application, whereas ID classes is fixed in traditional OOD detection. Learning is also different from classic supervised learning because in ALMD, we learn the encountered new classes immediately and incrementally. It is difficult to retrain the model from scratch using all the past data from the ID classes and the novel samples from newly discovered classes, as this would be resource- and time-consuming. Apart from these two challenges, ALMD faces the data scarcity issue because instances of new classes often appear sporadically in real-life applications. To address these issues, we propose a novel method, PLDA, which performs dynamic OOD detection and incremental learning of new classes on the fly. Empirical evaluations will demonstrate the effectiveness of PLDA.
[LG-50] HyperSearch: Prediction of New Hyperedges through Unconstrained yet Efficient Search ICDM
链接: https://arxiv.org/abs/2510.17153
作者: Hyunjin Choo,Fanchen Bu,Hyunjin Hwang,Young-Gyu Yoon,Kijung Shin
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: IEEE International Conference on Data Mining (ICDM) 2025
点击查看摘要
Abstract:Higher-order interactions (HOIs) in complex systems, such as scientific collaborations, multi-protein complexes, and multi-user communications, are commonly modeled as hypergraphs, where each hyperedge (i.e., a subset of nodes) represents an HOI among the nodes. Given a hypergraph, hyperedge prediction aims to identify hyperedges that are either missing or likely to form in the future, and it has broad applications, including recommending interest-based social groups, predicting collaborations, and uncovering functional complexes in biological systems. However, the vast search space of hyperedge candidates (i.e., all possible subsets of nodes) poses a significant computational challenge, making naive exhaustive search infeasible. As a result, existing approaches rely on either heuristic sampling to obtain constrained candidate sets or ungrounded assumptions on hypergraph structure to select promising hyperedges. In this work, we propose HyperSearch, a search-based algorithm for hyperedge prediction that efficiently evaluates unconstrained candidate sets, by incorporating two key components: (1) an empirically grounded scoring function derived from observations in real-world hypergraphs and (2) an efficient search mechanism, where we derive and use an anti-monotonic upper bound of the original scoring function (which is not antimonotonic) to prune the search space. This pruning comes with theoretical guarantees, ensuring that discarded candidates are never better than the kept ones w.r.t. the original scoring function. In extensive experiments on 10 real-world hypergraphs across five domains, HyperSearch consistently outperforms state-of-the-art baselines, achieving higher accuracy in predicting new (i.e., not in the training set) hyperedges. Comments: IEEE International Conference on Data Mining (ICDM) 2025 Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG) Cite as: arXiv:2510.17153 [cs.SI] (or arXiv:2510.17153v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2510.17153 Focus to learn more arXiv-issued DOI via DataCite
[LG-51] In-situ Autoguidance: Eliciting Self-Correction in Diffusion Models ICML2025
链接: https://arxiv.org/abs/2510.17136
作者: Enhao Gu,Haolin Hou
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures. ICML 2025 Workshop submission
点击查看摘要
Abstract:The generation of high-quality, diverse, and prompt-aligned images is a central goal in image-generating diffusion models. The popular classifier-free guidance (CFG) approach improves quality and alignment at the cost of reduced variation, creating an inherent entanglement of these effects. Recent work has successfully disentangled these properties by guiding a model with a separately trained, inferior counterpart; however, this solution introduces the considerable overhead of requiring an auxiliary model. We challenge this prerequisite by introducing In-situ Autoguidance, a method that elicits guidance from the model itself without any auxiliary components. Our approach dynamically generates an inferior prediction on the fly using a stochastic forward pass, reframing guidance as a form of inference-time self-correction. We demonstrate that this zero-cost approach is not only viable but also establishes a powerful new baseline for cost-efficient guidance, proving that the benefits of self-guidance can be achieved without external models.
[LG-52] Continuous Q-Score Matching: Diffusion Guided Reinforcement Learning for Continuous-Time Control
链接: https://arxiv.org/abs/2510.17122
作者: Chengxiu Hua,Jiawen Gu,Yushun Tang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has achieved significant success across a wide range of domains, however, most existing methods are formulated in discrete time. In this work, we introduce a novel RL method for continuous-time control, where stochastic differential equations govern state-action dynamics. Departing from traditional value function-based approaches, our key contribution is the characterization of continuous-time Q-functions via a martingale condition and the linking of diffusion policy scores to the action gradient of a learned continuous Q-function by the dynamic programming principle. This insight motivates Continuous Q-Score Matching (CQSM), a score-based policy improvement algorithm. Notably, our method addresses a long-standing challenge in continuous-time RL: preserving the action-evaluation capability of Q-functions without relying on time discretization. We further provide theoretical closed-form solutions for linear-quadratic (LQ) control problems within our framework. Numerical results in simulated environments demonstrate the effectiveness of our proposed method and compare it to popular baselines.
[LG-53] Fighter: Unveiling the Graph Convolutional Nature of Transformers in Time Series Modeling
链接: https://arxiv.org/abs/2510.17106
作者: Chen Zhang,Weixin Bu,Wendong Xu,Runsheng Yu,Yik-Chung Wu,Ngai Wong
类目: Machine Learning (cs.LG)
*备注: Preprint
点击查看摘要
Abstract:Transformers have achieved remarkable success in time series modeling, yet their internal mechanisms remain opaque. This work demystifies the Transformer encoder by establishing its fundamental equivalence to a Graph Convolutional Network (GCN). We show that in the forward pass, the attention distribution matrix serves as a dynamic adjacency matrix, and its composition with subsequent transformations performs computations analogous to graph convolution. Moreover, we demonstrate that in the backward pass, the update dynamics of value and feed-forward projections mirror those of GCN parameters. Building on this unified theoretical reinterpretation, we propose \textbfFighter (Flexible Graph Convolutional Transformer), a streamlined architecture that removes redundant linear projections and incorporates multi-hop graph aggregation. This perspective yields an explicit and interpretable representation of temporal dependencies across different scales, naturally expressed as graph edges. Experiments on standard forecasting benchmarks confirm that Fighter achieves competitive performance while providing clearer mechanistic interpretability of its predictions.
[LG-54] Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback
链接: https://arxiv.org/abs/2510.17103
作者: Shinji Ito,Kevin Jamieson,Haipeng Luo,Arnab Maiti,Taira Tsuchiya
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 49 pages
点击查看摘要
Abstract:We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve O(\log T) regret in stochastic settings and O(\sqrtT) regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknown-transition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.
[LG-55] On the Universal Near Optimality of Hedge in Combinatorial Settings
链接: https://arxiv.org/abs/2510.17099
作者: Zhiyuan Fan,Arnab Maiti,Kevin Jamieson,Lillian J. Ratliff,Gabriele Farina
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 28 pages, 1 Figure
[LG-56] Data Reliability Scoring
链接: https://arxiv.org/abs/2510.17085
作者: Yiling Chen,Shi Feng,Paul Kattuman,Fang-Yi Yu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: 39 pages, 5 figures
点击查看摘要
Abstract:How can we assess the reliability of a dataset without access to ground truth? We introduce the problem of reliability scoring for datasets collected from potentially strategic sources. The true data are unobserved, but we see outcomes of an unknown statistical experiment that depends on them. To benchmark reliability, we define ground-truth-based orderings that capture how much reported data deviate from the truth. We then propose the Gram determinant score, which measures the volume spanned by vectors describing the empirical distribution of the observed data and experiment outcomes. We show that this score preserves several ground-truth based reliability orderings and, uniquely up to scaling, yields the same reliability ranking of datasets regardless of the experiment – a property we term experiment agnosticism. Experiments on synthetic noise models, CIFAR-10 embeddings, and real employment data demonstrate that the Gram determinant score effectively captures data quality across diverse observation processes.
[LG-57] Convergence of Regret Matching in Potential Games and Constrained Optimization
链接: https://arxiv.org/abs/2510.17067
作者: Ioannis Anagnostides,Emanuel Tewolde,Brian Hu Zhang,Ioannis Panageas,Vincent Conitzer,Tuomas Sandholm
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:Regret matching (RM – and its modern variants – is a foundational online algorithm that has been at the heart of many AI breakthrough results in solving benchmark zero-sum games, such as poker. Yet, surprisingly little is known so far in theory about its convergence beyond two-player zero-sum games. For example, whether regret matching converges to Nash equilibria in potential games has been an open problem for two decades. Even beyond games, one could try to use RM variants for general constrained optimization problems. Recent empirical evidence suggests that they – particularly regret matching ^+ (RM ^+ ) – attain strong performance on benchmark constrained optimization problems, outperforming traditional gradient descent-type algorithms. We show that alternating RM ^+ converges to an \epsilon -KKT point after O_\epsilon(1/\epsilon^4) iterations, establishing for the first time that it is a sound and fast first-order optimizer. Our argument relates the KKT gap to the accumulated regret, two quantities that are entirely disparate in general but interact in an intriguing way in our setting, so much so that when regrets are bounded, our complexity bound improves all the way to O_\epsilon(1/\epsilon^2) . From a technical standpoint, while RM ^+ does not have the usual one-step improvement property in general, we show that it does in a certain region that the algorithm will quickly reach and remain in thereafter. In sharp contrast, our second main result establishes a lower bound: RM, with or without alternation, can take an exponential number of iterations to reach a crude approximate solution even in two-player potential games. This represents the first worst-case separation between RM and RM ^+ . Our lower bound shows that convergence to coarse correlated equilibria in potential games is exponentially faster than convergence to Nash equilibria. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2510.17067 [cs.GT] (or arXiv:2510.17067v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2510.17067 Focus to learn more arXiv-issued DOI via DataCite
[LG-58] Consistent Zero-Shot Imitation with Contrastive Goal Inference
链接: https://arxiv.org/abs/2510.17059
作者: Kathryn Wantlin,Chongyi Zheng,Benjamin Eysenbach
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In the same way that generative models today conduct most of their training in a self-supervised fashion, how can agentic models conduct their training in a self-supervised fashion, interactively exploring, learning, and preparing to quickly adapt to new tasks? A prerequisite for embodied agents deployed in real world interactions ought to be training with interaction, yet today’s most successful AI models (e.g., VLMs, LLMs) are trained without an explicit notion of action. The problem of pure exploration (which assumes no data as input) is well studied in the reinforcement learning literature and provides agents with a wide array of experiences, yet it fails to prepare them for rapid adaptation to new tasks. Today’s language and vision models are trained on data provided by humans, which provides a strong inductive bias for the sorts of tasks that the model will have to solve (e.g., modeling chords in a song, phrases in a sonnet, sentences in a medical record). However, when they are prompted to solve a new task, there is a faulty tacit assumption that humans spend most of their time in the most rewarding states. The key contribution of our paper is a method for pre-training interactive agents in a self-supervised fashion, so that they can instantly mimic human demonstrations. Our method treats goals (i.e., observations) as the atomic construct. During training, our method automatically proposes goals and practices reaching them, building off prior work in reinforcement learning exploration. During evaluation, our method solves an (amortized) inverse reinforcement learning problem to explain demonstrations as optimal goal-reaching behavior. Experiments on standard benchmarks (not designed for goal-reaching) show that our approach outperforms prior methods for zero-shot imitation.
[LG-59] Diverse Influence Component Analysis: A Geometric Approach to Nonlinear Mixture Identifiability
链接: https://arxiv.org/abs/2510.17040
作者: Hoang-Son Nguyen,Xiao Fu
类目: Machine Learning (cs.LG)
*备注: 30 pages, 3 figures
点击查看摘要
Abstract:Latent component identification from unknown nonlinear mixtures is a foundational challenge in machine learning, with applications in tasks such as disentangled representation learning and causal inference. Prior work in nonlinear independent component analysis (nICA) has shown that auxiliary signals – such as weak supervision – can support identifiability of conditionally independent latent components. More recent approaches explore structural assumptions, e.g., sparsity in the Jacobian of the mixing function, to relax such requirements. In this work, we introduce Diverse Influence Component Analysis (DICA), a framework that exploits the convex geometry of the mixing function’s Jacobian. We propose a Jacobian Volume Maximization (J-VolMax) criterion, which enables latent component identification by encouraging diversity in their influence on the observed variables. Under reasonable conditions, this approach achieves identifiability without relying on auxiliary information, latent component independence, or Jacobian sparsity assumptions. These results extend the scope of identifiability analysis and offer a complementary perspective to existing methods.
[LG-60] Hephaestus: Mixture Generative Modeling with Energy Guidance for Large-scale QoS Degradation NEURIPS2025
链接: https://arxiv.org/abs/2510.17036
作者: Nguyen Do,Bach Ngo,Youval Kashuv,Canh V. Pham,Hanghang Tong,My T. Thai
类目: Machine Learning (cs.LG)
*备注: 62 pages, 19 figures, Neural Information Processing Systems (NeurIPS 2025)
[LG-61] EEschematic: Multimodal-LLM Based AI Agent for Schematic Generation of Analog Circuit
链接: https://arxiv.org/abs/2510.17002
作者: Chang Liu,Danial Chitnis
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Circuit schematics play a crucial role in analog integrated circuit design, serving as the primary medium for human understanding and verification of circuit functionality. While recent large language model (LLM)-based approaches have shown promise in circuit topology generation and device sizing, most rely solely on textual representations such as SPICE netlists, which lack visual interpretability for circuit designers. To address this limitation, we propose EEschematic, an AI agent for automatic analog schematic generation based on a Multimodal Large Language Model (MLLM). EEschematic integrates textual, visual, and symbolic modalities to translate SPICE netlists into schematic diagrams represented in a human-editable format. The framework uses six analog substructure examples for few-shot placement and a Visual Chain-of-Thought (VCoT) strategy to iteratively refine placement and wiring, enhancing schematic clarity and symmetry. Experimental results on representative analog circuits, including a CMOS inverter, a five-transistor operational transconductance amplifier (5T-OTA), and a telescopic cascode amplifier, demonstrate that EEschematic produces schematics with high visual quality and structural correctness.
[LG-62] Graph4MM: Weaving Multimodal Learning with Structural Information ICML2025
链接: https://arxiv.org/abs/2510.16990
作者: Xuying Ning,Dongqi Fu,Tianxin Wei,Wujiang Xu,Jingrui He
类目: Machine Learning (cs.LG)
*备注: ICML 2025
点击查看摘要
Abstract:Real-world multimodal data usually exhibit complex structural relationships beyond traditional one-to-one mappings like image-caption pairs. Entities across modalities interact in intricate ways, with images and text forming diverse interconnections through contextual dependencies and co-references. Graphs provide powerful structural information for modeling intra-modal and inter-modal relationships. However, previous works fail to distinguish multi-hop neighbors and treat the graph as a standalone modality, which fragments the overall understanding. This limitation presents two key challenges in multimodal learning: (1) integrating structural information from multi-hop neighbors into foundational models, and (2) fusing modality-specific information in a principled manner. To address these challenges, we revisit the role of graphs in multimodal learning within the era of foundation models and propose Graph4MM, a graph-based multimodal learning framework. To be specific, we introduce Hop-Diffused Attention, which integrates multi-hop structural information into self-attention through causal masking and hop diffusion. Furthermore, we design MM-QFormer, a multi-mapping querying transformer for cross-modal fusion. Through theoretical and empirical analysis, we show that leveraging structures to integrate both intra- and inter-modal interactions improves multimodal understanding beyond treating them as a standalone modality. Experiments on both generative and discriminative tasks show that Graph4MM outperforms larger VLMs, LLMs, and multimodal graph baselines, achieving a 6.93% average improvement.
[LG-63] MuonBP: Faster Muon via Block-Periodic Orthogonalization
链接: https://arxiv.org/abs/2510.16981
作者: Ahmed Khaled,Kaan Ozkara,Tao Yu,Mingyi Hong,Youngsuk Park
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:Gradient orthogonalization is a simple strategy that shows great utility in speeding up gradient descent. The Muon optimizer (Jordan, Jin, et al., 2024) combines gradient orthogonalization with first-order momentum and achieves significant improvement in data efficiency over Adam/AdamW (Loshchilov and Hutter, 2019) for language model training. However, when using model parallelism, gradient orthogonalization introduces additional overhead compared to coordinate-wise optimizers (such as AdamW) due to additional gather and scatter operations on gradient matrix shards from different devices. This additional communication can amount to a throughput hit of 5%-10% compared to Adam/AdamW. To remedy this, we propose Muon with Block-Periodic Orthogonalization (MuonBP), which applies orthogonalization independently to matrix shards on each device and periodically performs full orthogonalization to maintain training stability at scale. We show how to adjust the learning rate from the baseline to MuonBP and give convergence guarantees for this algorithm. Crucially, our theory dictates that we use two stepsizes: one for the blockwise orthogonalization steps, and one for the full orthogonalization steps. Our method is simple, requires minimal hyperparameter adjustments, and achieves competitive iteration complexity compared with baseline Muon while providing per-iteration throughput comparable to coordinate-wise methods such as AdamW. When training an 8B model with eight-way tensor parallelism and ZeRO optimizer state sharding, MuonBP achieves 8% throughput increase compared to Muon with no degradation in performance.
[LG-64] owards Interpretable and Trustworthy Time Series Reasoning : A BlueSky Vision
链接: https://arxiv.org/abs/2510.16980
作者: Kanghui Ning,Zijie Pan,Yushan Jiang,Anderson Schneider,Yuriy Nevmyvaka,Dongjin Song
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Time series reasoning is emerging as the next frontier in temporal analysis, aiming to move beyond pattern recognition towards explicit, interpretable, and trustworthy inference. This paper presents a BlueSky vision built on two complementary directions. One builds robust foundations for time series reasoning, centered on comprehensive temporal understanding, structured multi-step reasoning, and faithful evaluation frameworks. The other advances system-level reasoning, moving beyond language-only explanations by incorporating multi-agent collaboration, multi-modal context, and retrieval-augmented approaches. Together, these directions outline a flexible and extensible framework for advancing time series reasoning, aiming to deliver interpretable and trustworthy temporal intelligence across diverse domains.
[LG-65] Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees
链接: https://arxiv.org/abs/2510.16974
作者: Shurong Lin,Aleksandra Slavković,Deekshith Reddy Bhoomireddy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:In social sciences, small- to medium-scale datasets are common and linear regression (LR) is canonical. In privacy-aware settings, much work has focused on differentially private (DP) LR, but mostly on point estimation with limited attention to uncertainty quantification. Meanwhile, synthetic data generation (SDG) is increasingly important for reproducibility studies, yet current DP LR methods do not readily support it. Mainstream SDG approaches are either tailored to discretized data, making them less suitable for continuous regression, or rely on deep models that require large datasets, limiting their use for the smaller, continuous data typical in social science. We propose a method for LR with valid inference under Gaussian DP: a DP bias-corrected estimator with asymptotic confidence intervals (CIs) and a general SDG procedure in which regression on the synthetic data matches our DP regression. Our binning-aggregation strategy is effective in small- to moderate-dimensional settings. Experiments show our method (1) improves accuracy over existing methods, (2) provides valid CIs, and (3) produces more reliable synthetic data for downstream ML tasks than current DP SDGs.
[LG-66] Closing the Curvature Gap: Full Transformer Hessians and Their Implications for Scaling Laws ICLR2026
链接: https://arxiv.org/abs/2510.16927
作者: Egor Petrov,Nikita Kiselev,Vladislav Meshkov,Andrey Grabovoy
类目: Machine Learning (cs.LG)
*备注: 38 pages, 12 figures. Submitted to ICLR 2026
点击查看摘要
Abstract:The lack of theoretical results for Layer Normalization and feedforward Hessians has left a gap in the study of Transformer optimization landscapes. We address this by deriving explicit second-order expressions for these components, thereby completing the Hessian characterization of full Transformer blocks. Our results generalize prior self-attention analyses and yield estimations for the role of each sublayer in curvature propagation. We demonstrate how these Hessian structures inform both convergence dynamics and the empirical scaling laws governing large-model performance. Further, we propose a Taylor-expansion-based framework for analyzing loss differences to quantify convergence trajectories. By extending Hessian theory to the full Transformer architecture, this work establishes a new foundation for theoretical and empirical investigations of optimization in large-scale deep learning.
[LG-67] SolverLLM : Leverag ing Test-Time Scaling for Optimization Problem via LLM -Guided Search NEURIPS2025
链接: https://arxiv.org/abs/2510.16916
作者: Dong Li,Xujiang Zhao,Linlin Yu,Yanchi Liu,Wei Cheng,Zhengzhang Chen,Zhong Chen,Feng Chen,Chen Zhao,Haifeng Chen
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025
点击查看摘要
Abstract:Large Language Models (LLMs) offer promising capabilities for tackling complex reasoning tasks, including optimization problems. However, existing methods either rely on prompt engineering, which leads to poor generalization across problem types, or require costly supervised training. We introduce SolverLLM, a training-free framework that leverages test-time scaling to solve diverse optimization problems. Rather than solving directly, SolverLLM generates mathematical formulations and translates them into solver-ready code, guided by a novel Monte Carlo Tree Search (MCTS) strategy. To enhance the search process, we modify classical MCTS with (1) dynamic expansion for adaptive formulation generation, (2) prompt backpropagation to guide exploration via outcome-driven feedback, and (3) uncertainty backpropagation to incorporate reward reliability into decision-making. Experiments on six standard benchmark datasets demonstrate that SolverLLM outperforms both prompt-based and learning-based baselines, achieving strong generalization without additional training.
[LG-68] DeepChem Equivariant: SE(3)-Equivariant Support in an Open-Source Molecular Machine Learning Library
链接: https://arxiv.org/abs/2510.16897
作者: Jose Siguenza,Bharath Ramsundar
类目: Machine Learning (cs.LG)
*备注: Presented at Machine Learning Symposium - BayLearn (2025)
点击查看摘要
Abstract:Neural networks that incorporate geometric relationships respecting SE(3) group transformations (e.g. rotations and translations) are increasingly important in molecular applications, such as molecular property prediction, protein structure modeling, and materials design. These models, known as SE(3)-equivariant neural networks, ensure outputs transform predictably with input coordinate changes by explicitly encoding spatial atomic positions. Although libraries such as E3NN [4] and SE(3)-TRANSFORMER [3 ] offer powerful implementations, they often require substantial deep learning or mathematical prior knowledge and lack complete training pipelines. We extend DEEPCHEM [ 13] with support for ready-to-use equivariant models, enabling scientists with minimal deep learning background to build, train, and evaluate models, such as SE(3)-Transformer and Tensor Field Networks. Our implementation includes equivariant models, complete training pipelines, and a toolkit of equivariant utilities, supported with comprehensive tests and documentation, to facilitate both application and further development of SE(3)-equivariant models.
[LG-69] UniGTE: Unified Graph-Text Encoding for Zero-Shot Generalization across Graph Tasks and Domains
链接: https://arxiv.org/abs/2510.16885
作者: Duo Wang,Yuan Zuo,Guangyue Lu,Junjie Wu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Generalizing to unseen graph tasks without task-specific supervision is challenging: conventional graph neural networks are typically tied to a fixed label space, while large language models (LLMs) struggle to capture graph structure. We introduce UniGTE, an instruction-tuned encoder-decoder framework that unifies structural and semantic reasoning. The encoder augments a pretrained autoregressive LLM with learnable alignment tokens and a structure-aware graph-text attention mechanism, enabling it to attend jointly to a tokenized graph and a natural-language task prompt while remaining permutation-invariant to node order. This yields compact, task-aware graph representations. Conditioned solely on these representations, a frozen LLM decoder predicts and reconstructs: it outputs the task answer and simultaneously paraphrases the input graph in natural language. The reconstruction objective regularizes the encoder to preserve structural cues. UniGTE is instruction-tuned on five datasets spanning node-level, edge-level, and graph-level tasks across diverse domains, yet requires no fine-tuning at inference. It achieves new state-of-the-art zero-shot results on node classification, link prediction, graph classification, and graph regression under cross-task and cross-domain settings, demonstrating that tight integration of graph structure with LLM semantics enables robust, transferable graph reasoning.
[LG-70] ProtoMol: Enhancing Molecular Property Prediction via Prototype-Guided Multimodal Learning
链接: https://arxiv.org/abs/2510.16824
作者: Yingxu Wang,Kunyu Zhang,Jiaxin Huang,Nan Yin,Siwei Liu,Eran Segal
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注:
[LG-71] Finding Manifolds With Bilinear Autoencoders
链接: https://arxiv.org/abs/2510.16820
作者: Thomas Dooms,Ward Gauderis
类目: Machine Learning (cs.LG)
*备注:
[LG-72] race Regularity PINNs: Enforcing mathrmHfrac12(partial Ω) for Boundary Data
链接: https://arxiv.org/abs/2510.16817
作者: Doyoon Kim,Junbin Song
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:
信息检索
[IR-0] How role-play shapes relevance judgment in zero-shot LLM rankers
链接: https://arxiv.org/abs/2510.17535
作者: Yumeng Wang,Jirui Qi,Catherine Chen,Panagiotis Eustratiadis,Suzan Verberne
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. In particular, role-play prompts, where the model is assigned a functional role or identity, often give more robust and accurate relevance rankings. However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability. In this work, we systematically examine how role-play variations influence zero-shot LLM rankers. We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs. Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations. Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance. These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.
[IR-1] On Efficiency-Effectiveness Trade-off of Diffusion-based Recommenders
链接: https://arxiv.org/abs/2510.17245
作者: Wenyu Mao,Jiancan Wu,Guoqing Hu,Zhengyi Yang,Wei Ji,Xiang Wang
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Diffusion models have emerged as a powerful paradigm for generative sequential recommendation, which typically generate next items to recommend guided by user interaction histories with a multi-step denoising process. However, the multi-step process relies on discrete approximations, introducing discretization error that creates a trade-off between computational efficiency and recommendation effectiveness. To address this trade-off, we propose TA-Rec, a two-stage framework that achieves one-step generation by smoothing the denoising function during pretraining while alleviating trajectory deviation by aligning with user preferences during fine-tuning. Specifically, to improve the efficiency without sacrificing the recommendation performance, TA-Rec pretrains the denoising model with Temporal Consistency Regularization (TCR), enforcing the consistency between the denoising results across adjacent steps. Thus, we can smooth the denoising function to map the noise as oracle items in one step with bounded error. To further enhance effectiveness, TA-Rec introduces Adaptive Preference Alignment (APA) that aligns the denoising process with user preference adaptively based on preference pair similarity and timesteps. Extensive experiments prove that TA-Rec’s two-stage objective effectively mitigates the discretization errors-induced trade-off, enhancing both efficiency and effectiveness of diffusion-based recommenders.
[IR-2] DSEBench: A Test Collection for Explainable Dataset Search with Examples
链接: https://arxiv.org/abs/2510.17228
作者: Qing Shi,Jing He,Qiaosheng Chen,Gong Cheng
类目: Information Retrieval (cs.IR)
*备注: 34 pages, 5 figures, submitted to Knowledge-Based Systems
点击查看摘要
Abstract:Dataset search has been an established information retrieval task. Current paradigms either retrieve datasets that are relevant to a keyword query or find datasets that are similar to an input target dataset. To allow for their combined specification of information needs, in this article, we investigate the more generalized task of Dataset Search with Examples (DSE) and further extend it to Explainable DSE that requires identifying the metadata and content fields of a dataset that indicate its relevance to the query and similarity to the target datasets. To facilitate this research, we construct DSEBench, a test collection that provides high-quality dataset- and field-level annotations to enable the evaluation of explainable DSE. We also employ a large language model to generate numerous annotations to be used for training. We establish extensive baselines on DSEBench by adapting and evaluating a variety of sparse, dense, and LLM-based retrieval, reranking, and explanation methods.
[IR-3] owards Context-aware Reasoning -enhanced Generative Searching in E-commerce
链接: https://arxiv.org/abs/2510.16925
作者: Zhiding Liu,Ben Chen,Mingyue Cheng,Enchong Chen,Li Li,Chenyi Lei,Wenwu Ou,Han Li,Kun Gai
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Search-based recommendation is one of the most critical application scenarios in e-commerce platforms. Users’ complex search contexts–such as spatiotemporal factors, historical interactions, and current query’s information–constitute an essential part of their decision-making, reflecting implicit preferences that complement explicit query terms. Modeling such rich contextual signals and their intricate associations with candidate items remains a key challenge. Although numerous efforts have been devoted to building more effective search methods, existing approaches still show limitations in integrating contextual information, which hinders their ability to fully capture user intent. To address these challenges, we propose a context-aware reasoning-enhanced generative search framework for better \textbfunderstanding the complicated context. Specifically, the framework first unifies heterogeneous user and item contexts into textual representations or text-based semantic identifiers and aligns them. To overcome the lack of explicit reasoning trajectories, we introduce a self-evolving post-training paradigm that iteratively combines supervised fine-tuning and reinforcement learning to progressively enhance the model’s reasoning capability. In addition, we identify potential biases in existing RL algorithms when applied to search scenarios and present a debiased variant of GRPO to improve ranking performance. Extensive experiments on search log data collected from a real-world e-commerce platform demonstrate that our approach achieves superior performance compared with strong baselines, validating its effectiveness for search-based recommendation. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2510.16925 [cs.IR] (or arXiv:2510.16925v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.16925 Focus to learn more arXiv-issued DOI via DataCite
附件下载
点击下载今日全部论文列表