本篇博文主要内容为 2025-10-23 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-10-23)
今日共更新508篇论文,其中:
- 自然语言处理共98篇(Computation and Language (cs.CL))
- 人工智能共144篇(Artificial Intelligence (cs.AI))
- 计算机视觉共87篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共161篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] olmOCR 2: Unit Test Rewards for Document OCR
【速读】: 该论文旨在解决数字印刷文档(如PDF)到结构化、自然排序的纯文本转换中的OCR(光学字符识别)精度与鲁棒性问题,尤其在数学公式识别、表格解析和多栏布局处理等复杂场景下表现不足。解决方案的关键在于提出olmOCR 2系统,其核心是基于强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练的7B参数视觉语言模型(Vision Language Model, VLM)olmOCR-2-7B-1025,通过大量二元单元测试(binary unit tests)作为奖励信号进行优化,并结合一套自动生成合成文档的流水线以规模化构建多样且具有挑战性的测试数据集,从而显著提升模型在olmOCR-Bench基准上的性能,特别是在数学公式、表格和多列布局等关键任务上达到当前最优水平。
链接: https://arxiv.org/abs/2510.19817
作者: Jake Poznanski,Luca Soldaini,Kyle Lo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: this https URL
点击查看摘要
Abstract:We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.
zh
[NLP-1] Hubble: a Model Suite to Advance the Study of LLM Memorization
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中的记忆风险(memorization risks)问题,即模型在训练过程中可能无意中记住并泄露敏感数据(如密码、个人生物信息等)。解决方案的关键在于构建一个完全开源的基准工具集——Hubble,其包含标准模型和扰动模型(perturbed models),其中扰动模型通过可控插入特定文本(如书籍片段、传记、测试集)来模拟真实世界中的记忆风险。研究发现:记忆强度取决于敏感数据在训练语料库中的频率与语料规模的相对关系(例如,同一密码在较小语料中更易被记住);此外,若敏感数据未持续暴露于训练过程,则可能被遗忘。因此,提出两个最佳实践:一是通过扩大训练语料规模稀释敏感数据,二是将敏感数据置于训练早期以增强其可遗忘性。Hubble为会员推断(membership inference)和机器遗忘(machine unlearning)等研究提供了理想实验平台,推动了对LLM记忆机制的系统性理解与治理。
链接: https://arxiv.org/abs/2510.19811
作者: Johnny Tian-Zheng Wei,Ameya Godbole,Mohammad Aflah Khan,Ryan Wang,Xiaoyuan Zhu,James Flemings,Nitya Kashyap,Krishna P. Gummadi,Willie Neiswanger,Robin Jia
机构: University of Southern California (南加州大学); Max Planck Institute for Software Systems (马克斯·普朗克软件系统研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models – standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens – establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.
zh
[NLP-2] Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
【速读】: 该论文旨在解决当前文本引导图像编辑(text-guided image editing)研究中缺乏大规模、高质量且公开可获取的真实图像数据集的问题。现有工作受限于合成数据的局限性,难以支撑模型在复杂编辑场景下的训练与评估。其解决方案的关键在于构建Pico-Banana-400K数据集——通过利用Nano-Banana生成来自OpenImages的真实照片的多样化编辑对,并采用细粒度图像编辑分类体系确保编辑类型覆盖全面,同时借助多模态大语言模型(MLLM)进行质量评分与人工精炼,从而保障内容保真度和指令忠实性。此外,该数据集包含多轮编辑、偏好排序及长短指令配对等子集,支持更复杂的编辑任务研究,为下一代文本引导图像编辑模型的训练与基准测试提供了坚实基础。
链接: https://arxiv.org/abs/2510.19808
作者: Yusu Qian,Eli Bocek-Rivele,Liangchen Song,Jialing Tong,Yinfei Yang,Jiasen Lu,Wenze Hu,Zhe Gan
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community’s progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.
zh
[NLP-3] Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
【速读】: 该论文旨在解决强化学习中因“学习悬崖”(learning cliff)现象导致的大语言模型(Large Language Models, LLMs)在面对远超其当前能力的问题时无法获得有效奖励信号、进而停滞进步的问题。其解决方案的关键在于提出Scaf-GRPO(Scaffolded Group Relative Policy Optimization)框架,通过诊断模型学习是否已进入停滞状态,并在必要时注入分层提示(从抽象概念到具体步骤),从而引导模型自主构建有效解,实现渐进式能力提升。
链接: https://arxiv.org/abs/2510.19807
作者: Xichen Zhang,Sitong Wu,Yinghao Zhu,Haoru Tan,Shaozuo Yu,Ziyi He,Jiaya Jia
机构: The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL
点击查看摘要
Abstract:Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ‘‘learning cliff’’ phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model’s independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO’s effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model’s ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.
zh
[NLP-4] he Art of Asking: Multilingual Prompt Optimization for Synthetic Data
【速读】: 该论文旨在解决当前多语言大语言模型(Multilingual Large Language Models, MLLMs)在使用合成数据训练时,因依赖翻译驱动的提示(translation-based prompts)而导致的性能瓶颈问题。现有方法继承了以英语为中心的表述方式和风格,并忽视文化维度,从而限制了模型的泛化能力。其解决方案的关键在于提出一种轻量级的提示空间优化(prompt-space optimization)框架,通过系统性地对翻译后的提示进行自然度(Naturalness)、文化适配性(Cultural Adaptation)和难度增强(Difficulty Enhancement)三方面改造,从而更有效地定义训练分布,提升模型的跨语言表现。实验表明,在相同数据条件下,该方法显著优于仅依赖翻译的基线,实现了多语言任务上的稳定改进。
链接: https://arxiv.org/abs/2510.19806
作者: David Mora,Viraat Aryabumi,Wei-Yin Ko,Sara Hooker,Julia Kreutzer,Marzieh Fadaee
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Synthetic data has become a cornerstone for scaling large language models, yet its multilingual use remains bottlenecked by translation-based prompts. This strategy inherits English-centric framing and style and neglects cultural dimensions, ultimately constraining model generalization. We argue that the overlooked prompt space-the very inputs that define training distributions-offers a more powerful lever for improving multilingual performance. We introduce a lightweight framework for prompt-space optimization, where translated prompts are systematically transformed for Naturalness, Cultural Adaptation, and Difficulty Enhancement. Using an off-the-shelf multilingual LLM, we apply these transformations to prompts for 12 languages spanning 7 families. Under identical data conditions, our approaches achieve substantial and consistent downstream improvements over the translation-only baseline: +4.7% on Global-MMLU accuracy, +2.4% on Flores XCometXL and +35.3% wins in preferences on mArenaHard. We establish prompt-space optimization as a simple yet powerful paradigm for building multilingual LLMs that are more robust, culturally grounded, and globally capable.
zh
[NLP-5] Blackbox Model Provenance via Palimpsestic Membership Inference
【速读】: 该论文试图解决的问题是:当Alice训练了一个开放权重的语言模型,而Bob使用该模型的黑箱衍生版本生成文本时,Alice能否通过查询Bob的模型(query setting)或仅从生成文本本身(observational setting)证明Bob确实在使用她的模型。解决方案的关键在于利用语言模型中的“回溯式记忆”(palimpsestic memorization)现象——即模型更可能记住训练后期的数据,因此可以通过测试统计量来捕捉Bob的模型或文本与Alice训练数据顺序之间的相关性。若Alice随机打乱了训练数据,则任何显著的相关性均可作为可量化统计证据拒绝“Bob模型或文本独立于Alice训练过程”的原假设,从而实现对模型复制行为的有效检测。
链接: https://arxiv.org/abs/2510.19796
作者: Rohith Kuditipudi,Jing Huang,Sally Zhu,Diyi Yang,Christopher Potts,Percy Liang
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice’s model to produce text. Can Alice prove that Bob is using her model, either by querying Bob’s derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem–in which the null hypothesis is that Bob’s model or text is independent of Alice’s randomized training run–and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice’s model using test statistics that capture correlation between Bob’s model or text and the ordering of training examples in Alice’s training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice’s training data. In the query setting, we directly estimate (via prompting) the likelihood Bob’s model gives to Alice’s training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model’s training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob’s text overlapping with spans of Alice’s training examples and 2) the likelihood of Bob’s text with respect to different versions of Alice’s model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob’s text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.
zh
[NLP-6] oolDreamer: Instilling LLM Reasoning Into Tool Retrievers
【速读】: 该论文旨在解决大规模工具集合下,大型语言模型(Large Language Models, LLMs)因上下文窗口限制无法有效调用全部工具的问题。现有检索模型依赖用户查询与工具描述(Tool Description, TD)之间的语义相似度进行排序,但用户请求常与TD的语言表达不匹配,导致检索效果不佳。解决方案的关键在于提出ToolDreamer框架,通过LLM生成与查询相关的假设性(合成)工具描述(hypothetical TD),使检索模型能够在TD的语言空间中实现更自然的查询-工具对齐。该方法可显著提升稀疏和稠密检索器的性能,且无需额外训练,具备良好灵活性,从而将部分推理负担转移至检索模块,使LLM能够高效处理海量工具而不会超出其上下文窗口限制。
链接: https://arxiv.org/abs/2510.19791
作者: Saptarshi Sengupta,Zhengyu Zhou,Jun Araki,Xingbo Wang,Bingqing Wang,Suhang Wang,Zhe Feng
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Bosch Research North America (博世研究北美)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens would exceed the LLM’s context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMs with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., description of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD’s. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.
zh
[NLP-7] Adapting Multilingual Models to Code-Mixed Tasks via Model Merging
【速读】: 该论文旨在解决代码混合(code-mixed)自然语言处理(NLP)任务中传统适应策略效果有限的问题,特别是在低资源语种对上的性能瓶颈。其核心挑战在于如何更有效地利用未标注的代码混合文本数据以提升模型在下游任务中的表现。解决方案的关键在于提出一种“模型合并”(model merging)范式:首先在未标注的代码混合文本上进行持续预训练(Continued Pre-Training, CPT),得到一个适配检查点;随后将该检查点与原始多语言基础模型进行合并,最后在下游任务数据上微调(Fine-Tuning, FT)。实验表明,该方法相较于全量微调(full fine-tuning)和CPT-FT策略,在英文-印地语(En-Hi)与英文-西班牙语(En-Es)句子分类任务中均取得显著提升(F1提升2–5点),说明合并机制能更高效地利用未标注数据。此外,合并模型在跨语对迁移(如En-Hi训练、En-Ta/En-Ml测试)中也展现出更强的泛化能力,验证了代码混合知识作为低资源语对迁移基底的有效性。
链接: https://arxiv.org/abs/2510.19782
作者: Prashant Kodali,Vaishnavi Shivkumar,Swarang Joshi,Monojit Choudhary,Ponnurangam Kumaraguru,Manish Shrivastava
机构: IIIT Hyderabad (印度国际信息技术学院); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 tables, CODS 2025
点击查看摘要
Abstract:We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT-FT. We observe gains of 2–5 points in F1 over full fine-tuning and ~1-2 points over CPT-FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.
zh
[NLP-8] AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
【速读】: 该论文旨在解决推测解码(Speculative Decoding, SD)中因小模型(draft model)与大模型(target model)知识对齐不足而导致的token接受率低的问题。传统知识蒸馏(Knowledge Distillation, KD)方法通过最小化所有token上的KL散度来提升对齐,但这一目标与SD的核心目标——最大化token接受率——存在偏差,导致draft模型难以有效吸收target模型的知识,尤其是在容量受限的情况下。解决方案的关键在于提出AdaSPEC,其核心创新是在蒸馏过程中引入选择性token过滤机制:利用一个参考模型识别并过滤掉难拟合的token,使draft模型专注于学习target模型在简单token上的行为,从而显著提升整体token接受率,同时保持生成质量不变。
链接: https://arxiv.org/abs/2510.19779
作者: Yuezhou Hu,Jiaxin Guo,Xinyu Feng,Tuo Zhao
机构: University of California, Berkeley (加州大学伯克利分校); Tsinghua University (清华大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model’s knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15%). The code is publicly available at this https URL.
zh
[NLP-9] GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters
【速读】: 该论文旨在解决稀疏微调(Sparse Fine-Tuning)中参数选择不当导致的性能瓶颈问题,即如何在保持预训练知识的同时高效适配下游任务。其解决方案的关键在于提出一种名为GaLLoP(Gradient-based Sparse Learning on Low-Magnitude Parameters)的新方法:仅对在下游任务上梯度幅值最大、但预训练时参数幅值最小的稀疏参数进行微调,从而优先选择任务相关性强且对原始知识干扰最小的参数,有效缓解灾难性遗忘和任务数据过拟合问题,同时提升模型在分布内与分布外任务上的鲁棒性和稳定性。
链接: https://arxiv.org/abs/2510.19778
作者: Anand Choudhary,Yasser Sulaıman,Lukas Mauch,Ghouthi Boukli Hacene,Fabien Cardinaux,Antoine Bosselut
机构: Sony Europe Ltd., Stuttgart Technology Center, EUREC(欧洲研究中心); EPFL, Switzerland; University of Stuttgart, Germany
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Sparse fine-tuning techniques adapt LLMs to downstream tasks by only tuning a sparse subset of model parameters. However, the effectiveness of sparse adaptation depends on optimally selecting the model parameters to be fine-tuned. In this work, we introduce a novel sparse fine-tuning technique named GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters, which fine-tunes only those model parameters which have the largest gradient magnitudes on downstream tasks and the smallest pre-trained magnitudes, intuitively prioritizing parameters that are highly task-relevant, but minimally disruptive to pre-trained knowledge. Our experimentation with LLaMA3 8B and Gemma 2B as base models shows that GaLLoP consistently improves or matches the in-distribution as well as out-of-distribution performance obtained via the usage of other leading parameter-efficient fine-tuning techniques, including LoRA, DoRA, and SAFT. Our analysis demonstrates that GaLLoP mitigates catastrophic forgetting and memorization of task data, as important pre-trained parameters remain unchanged, and stabilizes performance relative to other fine-tuning techniques, robustly generalizing across most random seeds.
zh
[NLP-10] SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration
【速读】: 该论文旨在解决大语言模型在复杂推理任务中因“浅层思考”(underthinking)导致的性能瓶颈问题,即模型频繁切换思维路径却缺乏对潜在高价值路径的深入探索,从而影响推理准确性和token效率。解决方案的关键在于提出一种名为SmartSwitch的推理框架,其核心机制包括两个模块:感知模块(perception module)通过预训练的过程奖励模型(Process Reward Model, PRM)实时评估思维切换点前的思路潜力;若检测到高潜力思路被过早放弃,则干预模块(intervention module)会中断当前推理流程,回溯至切换前节点并插入“深化提示”(deepening prompt),引导模型对被忽略的高潜力路径进行更深层次的探索。该方法可作为即插即用方案集成至任意大语言模型,显著提升其数学推理能力。
链接: https://arxiv.org/abs/2510.19767
作者: Xichen Zhang,Sitong Wu,Haoru Tan,Shaozuo Yu,Yinghao Zhu,Ziyi He,Jiaya Jia
机构: The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL
点击查看摘要
Abstract:The long chain-of-thought (LongCoT) capability is central to the recent breakthroughs achieved by large language models in complex reasoning tasks. However, the accompanying issue of ‘‘underthinking’’, where models exhibit shallow reasoning by frequently switching thoughts without sufficient exploration, limits both performance and token efficiency. To address this problem, we propose a simple yet effective reasoning strategy: the SmartSwitch inference framework. This framework can be easily integrated into any large language model as a plug-and-play solution, continuously monitoring the model’s reasoning process to detect underthinking and guide it toward deeper exploration of promising but overlooked thoughts. Specifically, the perception module identifies points where thoughts switch and evaluates the potential of the preceding thought using an off-the-shelf process reward model (PRM). If a high-potential thought is found to be prematurely abandoned, the intervention module interrupts the ongoing inference, backtracks to the point before the switch, and inserts a “deepening prompt” to encourage further exploration along that promising path. Extensive experiments on challenging mathematical reasoning benchmarks demonstrate that our method significantly enhances the performance of various large language models of different sizes.
zh
[NLP-11] Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在特定文化规范、政治倾向或文本指定语义条件下生成内容时的可控性问题,即传统提示工程(prompt engineering)难以确保模型行为符合预期条件,这主要源于预训练和对齐数据集带来的归纳偏置(inductive bias)。为解决这一问题,作者提出Zhyper,其核心创新在于设计了一种参数高效的因子化超网络(factorized hypernetwork)框架,该框架能够根据文本描述动态生成上下文感知的LoRA适配器(context-aware LoRA adapters),从而实现对LLM的高效条件控制。相比现有方法,Zhyper在保持竞争性能的同时,最多可减少26倍参数量,显著提升了参数效率,并在跨域泛化和细粒度语境价值捕捉方面表现更优。
链接: https://arxiv.org/abs/2510.19733
作者: M. H. I. Abdalla,Zhipin Wang,Christian Frey,Steffen Eger,Josif Grabocka
机构: University of Technology Nuremberg (纽伦堡应用技术大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Model (LLM) conditioning refers to instructing an LLM to generate content in accordance with the norms and values of a specific culture, beliefs of a particular political orientation, or any desired text-specified semantic conditioning. Unfortunately, prompt engineering does not ensure that LLMs behave in accordance with a desired conditioning due to the inductive bias of the pre-training and alignment datasets. Prior works have focused on fine-tuning LLMs by directly conditioning the LoRA weights; however, such methods introduce a large number of parameters. As a remedy, we propose Zhyper, a parameter-efficient factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions. Experiments on multiple benchmarks show that Zhyper achieves competitive performance with up to 26x fewer parameters than the state-of-the-art baselines. Furthermore, we extend Zhyper to cultural alignment, demonstrating improved generalization to out-of-domain settings and a better capturing of fine-grained contextual values.
zh
[NLP-12] From Answers to Guidance: A Proactive Dialogue System for Legal Documents
【速读】: 该论文旨在解决普通公民在理解和应用复杂欧盟法律文本时面临的可及性问题,尽管欧盟提供了开放获取的立法、议会回应和监管文件,但这些资源对非专业人士而言仍难以探索。解决方案的关键在于构建EUDial数据集与LexGuide框架:EUDial是一个包含880个对话轮次的多轮对话数据集,涵盖初始提问、结构化回答及后续追问;而LexGuide则采用分层主题组织的检索增强生成(retrieval-augmented generation, RAG)方法,确保对话推进过程中的法律内容覆盖全面且语义连贯,从而实现主动式、结构化的法律信息导航,有效缩小法律信息可用性与公民理解能力之间的差距。
链接: https://arxiv.org/abs/2510.19723
作者: Ashish Chouhan,Michael Gertz
机构: Heidelberg University (海德堡大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 3 figures, 2 tables, 2 prompts
点击查看摘要
Abstract:The accessibility of legal information remains a constant challenge, particularly for laypersons seeking to understand and apply complex institutional texts. While the European Union provides open access to legislation, parliamentary responses, and regulatory documents, these resources can be challenging for laypeople to explore. In this paper, we introduce EUDial, a proactive multi-turn dialogue dataset constructed from 204 blogs curated by the Citizens’ Enquiries Unit (AskEP) of the European Parliamentary Research Service. EUDial contains 880 dialogue turns (averaging 4.3 turns per dialogue), where each dialogue includes initial questions, structured answers, and follow-up questions. Beyond dataset construction, we propose the LexGuide framework that leverages retrieval-augmented generation with hierarchical topic organization to structure dialogue progression, ensuring both comprehensive coverage of legal aspects and coherence across conversational turns. The results demonstrate that proactive, structured navigation closes the gap between the availability of legal information and citizen comprehension, establishing EUDial and LexGuide as practical resources for advancing proactive legal dialogue systems.
zh
[NLP-13] Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings
【速读】: 该论文旨在解决预训练语言模型(Language Models, LM)在零样本(zero-shot)设置下通过提示(prompting)执行多样化任务时,其内部表示质量与提示相关性之间关系不明确的问题。解决方案的关键在于通过一系列探针实验(probing experiments)对不同提示模板生成的嵌入(prompt embeddings)进行系统分析,发现提示虽能影响表示质量,但这种变化并不总是与提示对目标任务的相关性一致,从而挑战了“更相关提示必然带来更好表示”的假设,并进一步探讨可能导致该现象的潜在因素。
链接: https://arxiv.org/abs/2510.19694
作者: Cesar Gonzalez-Gutierrez,Dirk Hovy
机构: Polytechnic University of Catalonia (加泰罗尼亚理工大学); Bocconi University (博科尼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Prompting is a common approach for leveraging LMs in zero-shot settings. However, the underlying mechanisms that enable LMs to perform diverse tasks without task-specific supervision remain poorly understood. Studying the relationship between prompting and the quality of internal representations can shed light on how pre-trained embeddings may support in-context task solving. In this empirical study, we conduct a series of probing experiments on prompt embeddings, analyzing various combinations of prompt templates for zero-shot classification. Our findings show that while prompting affects the quality of representations, these changes do not consistently correlate with the relevance of the prompts to the target task. This result challenges the assumption that more relevant prompts necessarily lead to better representations. We further analyze potential factors that may contribute to this unexpected behavior.
zh
[NLP-14] Are Large Language Models Sensitive to the Motives Behind Communication? NEURIPS2025
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否具备对信息源动机进行敏感判断的能力,即能否像人类一样根据说话者的意图和激励机制来调整对信息可信度的评估,从而在真实世界中做出更合理的推理。解决方案的关键在于通过两类实验验证并提升LLMs的动机警觉性(motivational vigilance):首先,在受控的认知科学实验中证明LLMs能以类人方式对偏倚来源的信息进行折扣;其次,在自然场景下的赞助广告环境中发现LLMs表现偏离理性模型,但引入一个简单的“引导干预”(steering intervention),通过增强意图与激励因素的显著性,显著提升了其推理与理性模型的一致性。这表明LLMs具有基础的动机敏感性,但要在复杂现实环境中可靠应用,仍需进一步优化其对语境中动机线索的处理能力。
链接: https://arxiv.org/abs/2510.19687
作者: Addison J. Wu,Ryan Liu,Kerem Oktar,Theodore R. Sumers,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学); Anthropic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2025
点击查看摘要
Abstract:Human communication is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans’ intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self-serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source – for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs’ behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents’ information ecosystems. In these settings, we find that LLMs’ inferences do not track the rational models’ predictions nearly as closely – partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.
zh
[NLP-15] CoSense-LLM : Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
【速读】: 该论文旨在解决在干扰敏感环境中部署大语言模型(Large Language Models, LLMs)时面临的多维挑战,包括低延迟、低能耗、带宽限制与隐私保护之间的权衡问题。其核心解决方案是提出一个以边缘计算为先的框架 CoSense-LLM,关键在于通过四个模块协同实现:(i) SenseFusion 轻量级编码器将多模态传感器流(如 Wi-Fi 信道状态信息、惯性测量单元等)映射并压缩为可验证的语义离散代码序列;(ii) Edge-RAG 利用本地混合检索机制结合场所特定策略和笔记进行生成 grounding;(iii) PromptRouter 根据成本与不确定性动态选择边缘本地生成、边缘加检索或轻量级云端升级路径;(iv) Secure Execution 提供可审计的数据裁剪路径,确保原始波形不出设备,实现数据最小化。该设计使系统在家庭、办公及诊所场景中均能维持亚秒级端到端延迟(p95),显著降低跨层级通信开销,并保障隐私,同时支持设备端个性化与联邦更新,在非独立同分布(non-IID)漂移下保持性能稳定。
链接: https://arxiv.org/abs/2510.19670
作者: Hasan Akgul,Mari Eplik,Javier Rojas,Aina Binti Abdullah,Pieter van der Merwe
机构: Istanbul Technical University (伊斯坦布尔技术大学); University of Tartu (塔尔图大学); University of Chile (智利大学); Universiti Teknologi Malaysia (马来西亚理工大学); Stellenbosch University (斯泰伦博斯大学)
类目: Computation and Language (cs.CL)
备注: 19 pages,8 figures
点击查看摘要
Abstract:We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.
zh
[NLP-16] DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 中大型语言模型(Large Language Models, LLMs)在推理过程中存在“过度思考”(overthinking)的问题,即模型在简单任务上生成冗长且低效的思维链(reasoning trace),导致计算资源浪费而性能提升有限。其核心解决方案是提出一种轻量级框架 DiffAdapt,通过分析推理轨迹中token概率分布的熵值来动态识别问题难度,并据此选择对应的推理策略(Easy/Normal/Hard),每种策略由固定的提示模板、温度参数和最大生成长度构成。该方法不依赖对基础LLM的微调,而是训练一个小型探测器(probe)分类模型最终隐藏状态以实现低成本适配,在多个模型和基准测试中实现了与原模型相当或更优的准确率,同时将token消耗降低最多达22.4%,显著提升了推理效率。
链接: https://arxiv.org/abs/2510.19669
作者: Xiang Liu,Xuming Hu,Xiaowen Chu,Eunsol Choi
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22–25% entropy reduction from easy to medium difficulty regions, suggesting an overthinking phenomenon on easy instances. Building on these insights, we introduce \textbfDiffAdapt, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM’s final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4%, establishing a practical path toward compute-efficient reasoning.
zh
[NLP-17] Unraveling Emotions with Pre-Trained Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放文本中进行情感识别时面临的挑战,包括语境模糊性、语言变异性以及复杂情感表达的解析困难等问题。其解决方案的关键在于通过两种策略提升模型性能:一是对预训练模型进行微调(fine-tuning),以获得超过70%的准确率;二是采用结构化的提示工程(prompt engineering)和情感分组(emotion grouping)技术,显著增强LLMs在不同场景下的情感检测能力。研究表明,单纯使用简单提示难以发挥LLMs的潜力,而系统性的提示设计与情感类别聚合是提高其在开放文本中情感识别效果的核心因素。
链接: https://arxiv.org/abs/2510.19668
作者: Alejandro Pajón-Sanmartín,Francisco De Arriba-Pérez,Silvia García-Méndez,Fátima Leal,Benedita Malheiro,Juan Carlos Burguillo-Rial
机构: Information Technologies Group, atlanTTic, University of Vigo (维戈大学信息技术组,atlanTTic); Research on Economics, Management and Information Technologies, Universidade Portucalense (波尔图大学经济、管理与信息技术研究); ISEP, Polytechnic of Porto (波尔图理工学院工业工程学院); Institute for Systems and Computer Engineering, Technology and Science (科技与科学系统与计算机工程研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Transformer models have significantly advanced the field of emotion recognition. However, there are still open challenges when exploring open-ended queries for Large Language Models (LLMs). Although current models offer good results, automatic emotion analysis in open texts presents significant challenges, such as contextual ambiguity, linguistic variability, and difficulty interpreting complex emotional expressions. These limitations make the direct application of generalist models difficult. Accordingly, this work compares the effectiveness of fine-tuning and prompt engineering in emotion detection in three distinct scenarios: (i) performance of fine-tuned pre-trained models and general-purpose LLMs using simple prompts; (ii) effectiveness of different emotion prompt designs with LLMs; and (iii) impact of emotion grouping techniques on these models. Experimental tests attain metrics above 70% with a fine-tuned pre-trained model for emotion recognition. Moreover, the findings highlight that LLMs require structured prompt engineering and emotion grouping to enhance their performance. These advancements improve sentiment analysis, human-computer interaction, and understanding of user behavior across various domains.
zh
[NLP-18] From Forecasting to Planning : Policy World Model for Collaborative State-Action Prediction
【速读】: 该论文旨在解决当前驾驶世界模型(driving world models)与轨迹规划(trajectory planning)之间脱节的问题,即现有方法通常将世界建模用于环境模拟,而未有效赋能规划模块。为实现两者的协同优化,作者提出了一种新的驾驶范式——策略世界模型(Policy World Model, PWM),其关键在于通过无需动作输入的未来状态预测机制(action-free future state forecasting scheme),使世界模型能够直接利用学习到的世界知识辅助规划决策。此外,PWM还引入了动态增强的并行标记生成机制(dynamically enhanced parallel token generation mechanism),结合上下文引导的分词器(context-guided tokenizer)和自适应动态焦点损失(adaptive dynamic focal loss),显著提升了视频预测效率与准确性,从而在仅使用前视摄像头输入的情况下达到或超越依赖多视角、多模态输入的先进方法的性能表现。
链接: https://arxiv.org/abs/2510.19654
作者: Zhida Zhao,Talas Fu,Yifan Wang,Lijun Wang,Huchuan Lu
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: Accepted by NuerIPS 2025 (Poster)
点击查看摘要
Abstract:Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at this https URL.
zh
[NLP-19] LLavaCode: Compressed Code Representations for Retrieval-Augmented Code Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在代码补全任务中因引入大量上下文导致序列长度显著增加、推理速度变慢的问题,尤其在集成开发环境(IDE)等交互式场景下影响用户体验。解决方案的关键在于提出 LlavaCode 框架,通过一个小型投影模块将原始代码上下文压缩为语义丰富的紧凑表示,仅需少量单标记向量即可替代完整检索内容,从而在保持甚至提升生成质量(如 EM 和 ES 指标)的同时,显著降低延迟——实验表明其在线补全任务中可实现 20–38% 的首次词元时间(Time-to-First-Token, TTFT)减少。
链接: https://arxiv.org/abs/2510.19644
作者: Daria Cherniuk,Nikita Sukhorukov,Nikita Sushko,Daniil Gusak,Danil Sivtsov,Elena Tutubalina,Evgeny Frolov
机构: Personalization Technologies(个性化技术); AIC(人工智能中心); HSE University(高等经济大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.
zh
[NLP-20] Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent
【速读】: 该论文旨在解决由于用户在社交媒体中使用风格化字体和类字体表情符号(font-like emoji)所引发的自然语言处理(Natural Language Processing, NLP)模型安全漏洞问题。这类视觉上吸引人且人类可读的文本,因被模型视为不同token而引入干扰,形成人与模型之间的感知鸿沟。解决方案的关键在于提出一种基于风格的攻击方法——Style Attack Disguise (SAD),其设计包含轻量级(提升查询效率)和强攻击版本(优化攻击性能),实验证明其在情感分类、机器翻译等任务中对传统模型、大语言模型(Large Language Models, LLMs)及商业服务均具有显著攻击效果,并揭示了其在图文生成(text-to-image)和语音合成(text-to-speech)等多模态任务中的潜在威胁。
链接: https://arxiv.org/abs/2510.19641
作者: Yangshijie Zhang,Xinda Wang,Jialin Liu,Wenqiang Wang,Zhicong Ma,Xingxing Jia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With social media growth, users employ stylistic fonts and font-like emoji to express individuality, creating visually appealing text that remains human-readable. However, these fonts introduce hidden vulnerabilities in NLP models: while humans easily read stylistic text, models process these characters as distinct tokens, causing interference. We identify this human-model perception gap and propose a style-based attack, Style Attack Disguise (SAD). We design two sizes: light for query efficiency and strong for superior attack performance. Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services demonstrate SAD’s strong attack performance. We also show SAD’s potential threats to multimodal tasks including text-to-image and text-to-speech generation.
zh
[NLP-21] HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application
【速读】: 该论文旨在解决当前深度搜索智能体在处理具有模糊边界和隐含逻辑关系的层级规则(如海关税则、医疗手册等)时能力不足的问题,这一问题在现有代理评估基准中被严重忽视。解决方案的关键在于提出首个真实世界专家级电商基准HSCodeComp,用于评估智能体在复杂规则指导下进行10位数商品编码(Harmonized System Code, HSCode)预测的能力。该基准基于大规模电商平台的真实数据构建,涵盖632个产品条目并由多名人类专家标注,能够有效揭示当前主流大语言模型(LLM)驱动的代理在层级规则应用上的显著性能差距(最佳代理仅达46.8%准确率,远低于人类专家的95.0%),从而推动对深层推理与规则理解能力的研究进展。
链接: https://arxiv.org/abs/2510.19631
作者: Yiqian Yang,Tian Lan,Qianghuai Jia,Li Zhu,Hui Jiang,Hang Zhu,Longyue Wang,Weihua Luo,Kaifu Zhang
机构: Alibaba International Digital Commerce (阿里巴巴国际数字商业集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Effective deep search agents must not only access open-domain and domain-specific knowledge but also apply complex rules-such as legal clauses, medical manuals and tariff rules. These rules often feature vague boundaries and implicit logic relationships, making precise application challenging for agents. However, this critical capability is largely overlooked by current agent benchmarks. To fill this gap, we introduce HSCodeComp, the first realistic, expert-level e-commerce benchmark designed to evaluate deep search agents in hierarchical rule application. In this task, the deep reasoning process of agents is guided by these rules to predict 10-digit Harmonized System Code (HSCode) of products with noisy but realistic descriptions. These codes, established by the World Customs Organization, are vital for global supply chain efficiency. Built from real-world data collected from large-scale e-commerce platforms, our proposed HSCodeComp comprises 632 product entries spanning diverse product categories, with these HSCodes annotated by several human experts. Extensive experimental results on several state-of-the-art LLMs, open-source, and closed-source agents reveal a huge performance gap: best agent achieves only 46.8% 10-digit accuracy, far below human experts at 95.0%. Besides, detailed analysis demonstrates the challenges of hierarchical rule application, and test-time scaling fails to improve performance further. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA) Cite as: arXiv:2510.19631 [cs.AI] (or arXiv:2510.19631v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.19631 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-22] CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian Polish Russian and English
【速读】: 该论文旨在解决跨语言新闻相似性评估中数据集稀缺且难以扩展的问题,特别是在非英语语种(如乌克兰语)的虚假新闻检测场景下。现有数据集多依赖人工标注,限制了其规模和对新语言的适应性。解决方案的关键在于构建一个可扩展、可解释的众包流水线(crowdsourcing pipeline),用于自动收集高质量的跨语言新闻对,并基于“4W”标准(Who, What, Where, When)进行语义相似性标注与理由说明。该方法成功生成了首个以乌克兰语为中心的多语言新闻数据集CrossNews-UA,涵盖波兰语、俄语和英语,为后续多语言新闻分析提供了可靠基准,并验证了从传统词袋模型到大语言模型(LLM)在跨语言任务中的性能差异。
链接: https://arxiv.org/abs/2510.19628
作者: Daryna Dementieva,Evgeniya Sukhodolskaya,Alexander Fraser
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In the era of social networks and rapid misinformation spread, news analysis remains a critical task. Detecting fake news across multiple languages, particularly beyond English, poses significant challenges. Cross-lingual news comparison offers a promising approach to verify information by leveraging external sources in different languages (Chen and Shu, 2024). However, existing datasets for cross-lingual news analysis (Chen et al., 2022a) were manually curated by journalists and experts, limiting their scalability and adaptability to new languages. In this work, we address this gap by introducing a scalable, explainable crowdsourcing pipeline for cross-lingual news similarity assessment. Using this pipeline, we collected a novel dataset CrossNews-UA of news pairs in Ukrainian as a central language with linguistically and contextually relevant languages-Polish, Russian, and English. Each news pair is annotated for semantic similarity with detailed justifications based on the 4W criteria (Who, What, Where, When). We further tested a range of models, from traditional bag-of-words, Transformer-based architectures to large language models (LLMs). Our results highlight the challenges in multilingual news analysis and offer insights into models performance.
zh
[NLP-23] PBBQ: A Persian Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在波斯语文化语境下社会偏见(social biases)缺乏系统评估工具的问题。现有研究虽已关注多语言环境中的偏见检测,但针对波斯语文化背景的资源仍严重不足。解决方案的关键在于构建了一个名为PBBQ的综合性基准数据集,该数据集涵盖16个文化类别,基于250名不同背景个体填写的问卷,并由社会学专家协作确保其有效性,最终包含超过37,000条精心设计的问题。通过在多个开源、闭源及专为波斯语微调的LLMs上进行基准测试,研究发现当前模型普遍存在显著的社会偏见,且其输出模式与人类偏见高度一致,揭示了模型学习表征与文化接受度之间的复杂关系。该数据集将公开发布,以支持未来对波斯语模型偏见的评估与缓解工作。
链接: https://arxiv.org/abs/2510.19616
作者: Farhan Farsi,Shayan Bali,Fatemeh Valeh,Parsa Ghofrani,Alireza Pakniat,Kian Kashfipour,Amir H. Payberah
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:With the increasing adoption of large language models (LLMs), ensuring their alignment with social norms has become a critical concern. While prior research has examined bias detection in various languages, there remains a significant gap in resources addressing social biases within Persian cultural contexts. In this work, we introduce PBBQ, a comprehensive benchmark dataset designed to evaluate social biases in Persian LLMs. Our benchmark, which encompasses 16 cultural categories, was developed through questionnaires completed by 250 diverse individuals across multiple demographics, in close collaboration with social science experts to ensure its validity. The resulting PBBQ dataset contains over 37,000 carefully curated questions, providing a foundation for the evaluation and mitigation of bias in Persian language models. We benchmark several open-source LLMs, a closed-source model, and Persian-specific fine-tuned models on PBBQ. Our findings reveal that current LLMs exhibit significant social biases across Persian culture. Additionally, by comparing model outputs to human responses, we observe that LLMs often replicate human bias patterns, highlighting the complex interplay between learned representations and cultural this http URL acceptance of the paper, our PBBQ dataset will be publicly available for use in future work. Content warning: This paper contains unsafe content.
zh
[NLP-24] Human-Agent Collaborative Paper-to-Page Crafting for Under 0.1
【速读】: 该论文旨在解决科研人员在将学术论文转化为动态、交互式网页时面临的效率低下与自动化不足的问题。当前,尽管静态内容(如幻灯片和海报)已可通过自动化工具处理,但网页的构建仍依赖大量手动操作,阻碍了研究成果的高效传播。解决方案的关键在于提出一个名为AutoPage的多智能体系统,其核心思想是采用从粗到细的分层协作流程:首先进行叙事规划,再生成多模态内容并实现交互式渲染;同时引入“Checker”代理通过比对原始论文来抑制AI幻觉,并设置可选的人工检查点以确保输出符合作者意图,从而将系统从单一工具升级为协同助手。
链接: https://arxiv.org/abs/2510.19600
作者: Qianli Ma,Siyu Wang,Yilin Chen,Yinhao Tang,Yixiang Yang,Chang Guo,Bingjie Gao,Zhening Xing,Yanan Sun,Zhipeng Zhang
机构: AutoLab, SAI, Shanghai Jiao Tong University (上海交通大学人工智能实验室); Shanghai AI Laboratory (上海人工智能实验室)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In the quest for scientific progress, communicating research is as vital as the discovery itself. Yet, researchers are often sidetracked by the manual, repetitive chore of building project webpages to make their dense papers accessible. While automation has tackled static slides and posters, the dynamic, interactive nature of webpages has remained an unaddressed challenge. To bridge this gap, we reframe the problem, arguing that the solution lies not in a single command, but in a collaborative, hierarchical process. We introduce \textbfAutoPage , a novel multi-agent system that embodies this philosophy. AutoPage deconstructs paper-to-page creation into a coarse-to-fine pipeline from narrative planning to multimodal content generation and interactive rendering. To combat AI hallucination, dedicated “Checker” agents verify each step against the source paper, while optional human checkpoints ensure the final product aligns perfectly with the author’s vision, transforming the system from a mere tool into a powerful collaborative assistant. To rigorously validate our approach, we also construct \textbfPageBench , the first benchmark for this new task. Experiments show AutoPage not only generates high-quality, visually appealing pages but does so with remarkable efficiency in under 15 minutes for less than \ 0.1. Code and dataset will be released at \hrefthis https URLWebpage .
zh
[NLP-25] Detecting Latin in Historical Books with Large Language Models : A Multimodal Benchmark
【速读】: 该论文旨在解决从混合语言历史文档中提取拉丁文片段的问题,此类文档常具有复杂的版式结构。解决方案的关键在于构建了一个包含724页标注数据的多模态基准数据集,并评估了大基础模型(large foundation models)在该任务上的表现,结果表明当代模型已具备实现可靠拉丁文检测的能力,从而为历史文献数字化与语言识别提供了新的技术路径。
链接: https://arxiv.org/abs/2510.19585
作者: Yu Wu,Ke Shu,Jonas Fischer,Lidia Pivovarova,David Rosson,Eetu Mäkelä,Mikko Tolonen
机构: University of Helsinki (赫尔辛基大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注: Under review. Both the dataset and code will be published
点击查看摘要
Abstract:This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models’ capabilities and limits for this task.
zh
[NLP-26] [De|Re]constructing VLMs Reasoning in Counting
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在视觉推理任务中表现不佳的问题,特别是其在计数任务中的局限性,如对物体数量、类型、空间排列及干扰项共现的敏感性。研究表明,VLMs的错误主要源于最后一层表征到输出空间的映射不准确。解决方案的关键在于针对性地微调输出层,实验表明仅优化该层即可使准确率提升高达21%,并在真实世界数据集上验证了该方法的有效性和一致性。
链接: https://arxiv.org/abs/2510.19555
作者: Simone Alghisi,Gabriel Roccabruna,Massimo Rizzoli,Seyed Mahed Mousavi,Giuseppe Riccardi
机构: University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Vision-Language Models (VLMs) have recently gained attention due to their competitive performance on multiple downstream tasks, achieved by following user-input instructions. However, VLMs still exhibit several limitations in visual reasoning, such as difficulties in identifying relations (e.g., spatial, temporal, and among objects), understanding temporal sequences (e.g., frames), and counting objects. In this work, we go beyond score-level benchmark evaluations of VLMs by investigating the underlying causes of their failures and proposing a targeted approach to improve their reasoning capabilities. We study the reasoning skills of seven state-of-the-art VLMs in the counting task under controlled experimental conditions. Our experiments show that VLMs are highly sensitive to the number and type of objects, their spatial arrangement, and the co-occurrence of distractors. A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space. Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%. We corroborate these findings by achieving consistent improvements on real-world datasets.
zh
[NLP-27] Conditions for Catastrophic Forgetting in Multilingual Translation
【速读】: 该论文旨在解决多语言微调过程中出现的灾难性遗忘(catastrophic forgetting)问题,即模型在针对特定语言进行微调时,性能下降导致未参与微调的语言能力退化。其解决方案的关键在于识别并验证影响遗忘的核心因素:模型与数据规模之间的相对比例是决定遗忘发生与否的主要条件;同时发现,模型的指令遵循能力比架构本身更关键地影响多语言知识保留;此外,跨语言对齐(cross-lingual alignment)不仅可缓解遗忘,还能促进对未见目标语言的正向迁移。
链接: https://arxiv.org/abs/2510.19546
作者: Danni Liu,Jan Niehues
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL)
备注: Multilingual Representation Learning (MRL) Workshop 2025
点击查看摘要
Abstract:Fine-tuning multilingual foundation models on specific languages often induces catastrophic forgetting, degrading performance on languages unseen in fine-tuning. While this phenomenon is widely-documented, the literature presents fragmented results about when forgetting occurs. To address this ambiguity, we conduct a systematic empirical study using machine translation as a testbed to identify the conditions that trigger catastrophic forgetting in multilingual fine-tuning. Through controlled experiments across different model architectures, data scales, and fine-tuning approaches, we reveal that the relative scale between model and data size is a primary determinant of forgetting. Moreover, we demonstrate that a model’s instruction-following ability is more critical for retaining multilingual knowledge than its architecture. Contrary to assumptions, parameter-efficient fine-tuning offers no clear advantage over full fine-tuning in mitigating forgetting. Lastly, we show that cross-lingual alignment can mitigate forgetting while also facilitating positive transfer to unseen target languages.
zh
[NLP-28] Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment
【速读】: 该论文试图解决当前语音基础模型(speech foundation models)评估方法分散、缺乏统一标准的问题,即不同模型在语音处理的不同方面表现出优势,但现有评估协议未针对其能力差异进行系统性匹配。解决方案的关键在于提出一个三维度的统一分类框架(taxonomy),从三个正交轴度量:评估所关注的具体方面(evaluation aspect)、模型完成任务所需的能力(model capabilities required)、以及任务或评估协议的方法论需求(task or protocol requirements)。该框架将现有评估和基准测试映射到模型暴露的能力(如语音生成、实时处理)及其方法论要求(如微调数据、人工判断),从而为模型与合适评估方法的对齐提供原则性指导,并揭示如韵律、交互性和推理能力等关键领域的评估空白,为未来基准设计指明方向。
链接: https://arxiv.org/abs/2510.19509
作者: Maureen de Seyssel,Eeshan Gunesh Dhekane
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 57 pages (26 main, 25 appendix, 6 references)
点击查看摘要
Abstract:Speech foundation models have recently achieved remarkable capabilities across a wide range of tasks. However, their evaluation remains disjointed across tasks and model types. Different models excel at distinct aspects of speech processing and thus require different evaluation protocols. This paper proposes a unified taxonomy that addresses the question: Which evaluation is appropriate for which model? The taxonomy defines three orthogonal axes: the \textbfevaluation aspect being measured, the model capabilities required to attempt the task, and the task or protocol requirements needed to perform it. We classify a broad set of existing evaluations and benchmarks along these axes, spanning areas such as representation learning, speech generation, and interactive dialogue. By mapping each evaluation to the capabilities a model exposes (e.g., speech generation, real-time processing) and to its methodological demands (e.g., fine-tuning data, human judgment), the taxonomy provides a principled framework for aligning models with suitable evaluation methods. It also reveals systematic gaps, such as limited coverage of prosody, interaction, or reasoning, that highlight priorities for future benchmark design. Overall, this work offers a conceptual foundation and practical guide for selecting, interpreting, and extending evaluations of speech models.
zh
[NLP-29] Lookahead Routing for Large Language Models
【速读】: 该论文旨在解决多模型系统中路由决策效率与准确性之间的矛盾问题,即现有基于输入查询的分类式路由方法虽降低了计算开销,但忽略了潜在输出信息和上下文语义细节,导致对复杂或模糊查询的路由效果不佳。其解决方案的关键在于提出名为Lookahead的前瞻性路由框架,通过预测各候选大语言模型(Large Language Model, LLM)输出的潜在表示(latent representations),在不进行完整推理的前提下捕捉模型响应特征,从而实现更精准的模型选择,显著提升路由决策质量。
链接: https://arxiv.org/abs/2510.19506
作者: Canbin Huang,Tianyuan Shi,Yuhua Zhu,Ruijun Chen,Xiaojun Quan
机构: Sun Yat-sen University (中山大学); Shenzhen Loop Area Institute (深圳环区研究院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language model (LLM) routers improve the efficiency of multi-model systems by directing each query to the most appropriate model while leveraging the diverse strengths of heterogeneous LLMs. Most existing approaches frame routing as a classification problem based solely on the input query. While this reduces overhead by avoiding inference across all models, it overlooks valuable information that could be gleaned from potential outputs and fails to capture implicit intent or contextual nuances that often emerge only during response generation. These limitations can result in suboptimal routing decisions, particularly for complex or ambiguous queries that require deeper semantic understanding. To address this challenge, we propose Lookahead, a routing framework that “foresees” potential model outputs by predicting their latent representations and uses these predictions to guide model selection, thus enabling more informed routing without full inference. Within this framework, we implement two approaches based on causal and masked language models. Empirical evaluations across seven public benchmarks - spanning instruction following, mathematical reasoning, and code generation - show that Lookahead consistently outperforms existing routing baselines, achieving an average performance gain of 7.7% over the state-of-the-art. Our code is available at this https URL.
zh
[NLP-30] What is the Best Sequence Length for BABYLM? EMNLP
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在 BabyLM 预训练中应采用何种序列长度(sequence length)的问题,以优化在固定计算预算下的性能表现。其解决方案的关键在于系统性地比较不同序列长度对两类任务的影响:发现较短序列已足够支持语法泛化(grammatical generalization)任务,而较长上下文则显著提升形态学类比推理(morphological analogical reasoning)任务的性能,从而表明最优序列长度取决于具体任务需求和模型架构特性。
链接: https://arxiv.org/abs/2510.19493
作者: Suchir Salhan,Richard Diehl Martinez,Zébulon Goriely,Paula Buttery
机构: University of Cambridge (剑桥大学); ALTA Institute (阿尔塔研究所)
类目: Computation and Language (cs.CL)
备注: Paper Accepted at the 2025 BabyLM Workshop @ EMNLP (Suzhou, China)
点击查看摘要
Abstract:Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.
zh
[NLP-31] Machine Text Detectors are Membership Inference Attacks
【速读】: 该论文旨在解决会员推理攻击(Membership Inference Attacks, MIAs)与机器生成文本检测(Machine-Generated Text Detection)两个研究方向长期独立发展所导致的方法论割裂问题,尽管二者均基于语言模型的概率分布特征,但缺乏相互借鉴。其解决方案的关键在于揭示两者之间的可迁移性(transferability),即证明在理论上最优的判别指标对两类任务是相同的,并通过大规模实证实验验证了现有方法在跨任务场景下的高性能表现(如Binoculars从文本检测迁移至MIA取得SOTA效果)。作者进一步提出MINT统一评估套件,以促进两领域间的协作与公平比较。
链接: https://arxiv.org/abs/2510.19492
作者: Ryuto Koike,Liam Dugan,Masahiro Kaneko,Chris Callison-Burch,Naoaki Okazaki
机构: Institute of Science Tokyo (东京科学研究所); University of Pennsylvania (宾夕法尼亚大学); MBZUAI (MBZUAI)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Although membership inference attacks (MIAs) and machine-generated text detection target different goals, identifying training samples and synthetic texts, their methods often exploit similar signals based on a language model’s probability distribution. Despite this shared methodological foundation, the two tasks have been independently studied, which may lead to conclusions that overlook stronger methods and valuable insights developed in the other task. In this work, we theoretically and empirically investigate the transferability, i.e., how well a method originally developed for one task performs on the other, between MIAs and machine text detection. For our theoretical contribution, we prove that the metric that achieves the asymptotically highest performance on both tasks is the same. We unify a large proportion of the existing literature in the context of this optimal metric and hypothesize that the accuracy with which a given method approximates this metric is directly correlated with its transferability. Our large-scale empirical experiments, including 7 state-of-the-art MIA methods and 5 state-of-the-art machine text detectors across 13 domains and 10 generators, demonstrate very strong rank correlation (rho 0.6) in cross-task performance. We notably find that Binoculars, originally designed for machine text detection, achieves state-of-the-art performance on MIA benchmarks as well, demonstrating the practical impact of the transferability. Our findings highlight the need for greater cross-task awareness and collaboration between the two research communities. To facilitate cross-task developments and fair evaluations, we introduce MINT, a unified evaluation suite for MIAs and machine-generated text detection, with implementation of 15 recent methods from both tasks.
zh
[NLP-32] VideoAgent Trek: Computer Use Pretraining from Unlabeled Videos
【速读】: 该论文旨在解决训练计算机使用代理(Computer-use Agents)所需大量GUI交互数据的获取难题,尤其是手动标注动作轨迹成本高昂的问题。其核心解决方案是提出VideoAgentTrek这一可扩展的数据挖掘流水线,通过自动从公开的屏幕录制视频中提取训练数据,无需人工标注。关键创新在于开发了Video2Action模块,该模块包含两个组件:一是视频定位模型(video grounding model),用于精准识别和定位GUI操作的时间边界与上下文;二是动作内容识别器(action-content recognizer),用于高保真地提取结构化参数(如点击坐标和输入文本)。该方法成功将39,000个YouTube教学视频转化为152万条交互步骤,并在OSWorld-Verified和AgentNetBench上显著提升任务成功率与步级准确率,证明被动互联网视频可转化为高质量监督信号,为计算机使用代理提供了一种经济高效的替代标注方式。
链接: https://arxiv.org/abs/2510.19488
作者: Dunjie Lu,Yiheng Xu,Junli Wang,Haoyuan Wu,Xinyuan Wang,Zekun Wang,Junlin Yang,Hongjin Su,Jixuan Chen,Junda Chen,Yuchen Mao,Jingren Zhou,Junyang Lin,Binyuan Hui,Tao Yu
机构: The University of Hong Kong (香港大学); Qwen Team, Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 6 figures
点击查看摘要
Abstract:Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.
zh
[NLP-33] Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition
【速读】: 该论文旨在解决自动语音识别(Automatic Speech Recognition, ASR)和语音翻译(Speech Translation, ST)任务中传统束搜索(beam search)解码方法在准确性上存在提升空间的问题。当前,束搜索虽被广泛应用于语音到文本的任务中,但近期研究表明,在文本到文本生成任务中,基于样本的最小贝叶斯风险(sample-based Minimum Bayes Risk, MBR)解码能显著优于束搜索。本文的关键解决方案是将MBR解码引入ASR和ST任务,通过在Whisper及其衍生模型上进行系统评估,验证其在英日语场景下的有效性;结果表明,MBR在多数实验设置下均取得更高准确率,证明其是一种适用于高精度离线ASR与ST任务的有前景的方法。
链接: https://arxiv.org/abs/2510.19471
作者: Yuu Jinnai
机构: CyberAgent( CyberAgent)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Recent work has shown that sample-based Minimum Bayes Risk (MBR) decoding outperforms beam search in text-to-text generation tasks, such as machine translation, text summarization, and image captioning. On the other hand, beam search is the current practice for speech-to-text tasks such as automatic speech recognition (ASR) and Speech Translation (ST). Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks. In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models. We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated. The results show that MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy. The code is available at this https URL
zh
[NLP-34] MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models
【速读】: 该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在静态预训练下难以维持对时间敏感事实知识(time-sensitive factual knowledge)准确理解的问题,现有基准测试因设计静态而无法有效评估LMMs在时序认知方面的能力。解决方案的关键在于提出MINED基准,这是一个涵盖6个核心维度(认知、意识、可信度、理解、推理与鲁棒性)和11项挑战性任务的综合性评估体系,基于维基百科构建,包含2,104个时间敏感知识样本,覆盖六类知识类型。通过在该基准上对15种主流LMMs进行评测,发现Gemini-2.5-Pro表现最优,同时验证了知识编辑方法在单次更新场景中可有效提升LMMs的时间敏感知识更新能力。
链接: https://arxiv.org/abs/2510.19457
作者: Kailin Jiang,Ning Jiang,Yuchen Ren,Yuchen Li,Yifan Gao,Jinhe Bi,Yunpu Ma,Qingqing Liu,Xianhao Wang,Yifan Jia,Hongbo Jiang,Yaocong Hu,Bin Li,Lei Liu,Yuntao Du
机构: University of Science and Technology of China (中国科学技术大学); Northeast Forestry University (东北林业大学); University of Sydney (悉尼大学); Anhui Polytechnic University (安徽工业大学); Ludwig Maximilian University of Munich (慕尼黑路德维希-马克西米利安大学); Beijing Institute of Technology (北京理工大学); Shandong University (山东大学); Xiamen University (厦门大学)
类目: Computation and Language (cs.CL)
备注: project page: this https URL
点击查看摘要
Abstract:Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs’ ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.
zh
[NLP-35] LLM Unlearning with LLM Beliefs
【速读】: 该论文旨在解决大语言模型在训练过程中可能记忆敏感或有害内容,导致这些内容在后续输出中重现的问题。现有去记忆(unlearning)方法通常依赖梯度上升及其变体来降低特定目标响应的概率,但作者发现此类策略会引发一个关键副作用:概率质量被重新分布到高概率区域,通常对应于与目标语义相关的重述表达,这种现象被称为“挤压效应”(squeezing effect),它导致许多方法仅实现虚假的去记忆效果,且被ROUGE、真实率等自动化指标误判为成功。解决方案的关键在于提出一种基于Bootstrap(BS)的框架,该框架显式地将挤压效应与模型自身的高置信度生成结果(即模型信念,model beliefs)关联起来;由于模型信念天然捕获了概率质量被挤压至的高概率区域,通过在去记忆目标中引入对模型信念的联合抑制,BS-T(token级别)和BS-S(sequence级别)分别从词元和序列层面削弱高概率内容,从而更彻底地实现遗忘并保持模型性能。
链接: https://arxiv.org/abs/2510.19422
作者: Kemou Li,Qizhou Wang,Yue Wang,Fengpeng Li,Jun Liu,Bo Han,Jiantao Zhou
机构: State Key Laboratory of Internet of Things and Smart City, University of Macau (澳门大学物联网与智慧城市国家重点实验室); TMLR Group, Department of Computer Science, Hong Kong Baptist University (香港浸会大学计算机科学系TMLR组)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models trained on vast corpora inherently risk memorizing sensitive or harmful content, which may later resurface in their outputs. Prevailing unlearning methods generally rely on gradient ascent and its variants to lower the probability of specific target responses. However, we find that this strategy induces a critical side effect: probability mass is redistributed into high-likelihood regions, often corresponding to semantically related rephrasings of the targets. We refer to this as the squeezing effect, which explains why many methods yield merely spurious unlearning, a problem further obscured by automated metrics (e.g., ROUGE, truth ratio) that misreport actual success. To address this, we propose a bootstrapping (BS) framework that explicitly links the squeezing effect with the model’s own high-confidence generations, namely its model beliefs. Since model beliefs inherently capture the very high-likelihood regions where probability mass is squeezed, incorporating them into the unlearning objective directly counters the squeezing effect. By jointly suppressing both target responses and model beliefs, BS-T (token) attenuates high-probability tokens, whereas BS-S (sequence) removes entire high-confidence generations, together achieving more thorough forgetting while preserving utility. Extensive experiments across diverse benchmarks with various model families confirm the effectiveness of our approach.
zh
[NLP-36] BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models EMNLP
【速读】: 该论文旨在解决性能导向型评估基准与认知启发模型评价之间存在的鸿沟问题,即现有评估方法难以衡量模型是否具备类人语言习得的系统性特征。解决方案的关键在于提出BLiSS 1.0(Benchmark of Learner Interlingual Syntactic Structure),其核心创新是引入“选择性容错”(selective tolerance)这一新范式:通过对比自然学习者错误(naturalistic learner error)与匹配的人工错误(artificial error)在相同句法结构中的可接受性,来检验模型是否更倾向于接受真实语言习得过程中的典型错误。该基准基于超过280万条自然学习者语料构建了136,867个受控三元组(正确句、学习者句、人工错误句),实验表明选择性容错是一种独立于标准语法正确性的能力,并且模型表现高度依赖训练范式,从而验证了BLiSS作为衡量不同训练目标对人类语言习得模式对齐程度的有效工具。
链接: https://arxiv.org/abs/2510.19419
作者: Yuan Gao,Suchir Salhan,Andrew Caines,Paula Buttery,Weiwei Sun
机构: ALTA Institute; University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注: Accepted Paper at the BabyLM Workshop 2025 @ EMNLP (Presentation in Suzhou, China)
点击查看摘要
Abstract:To bridge the gap between performance-oriented benchmarks and the evaluation of cognitively inspired models, we introduce BLiSS 1.0, a Benchmark of Learner Interlingual Syntactic Structure. Our benchmark operationalizes a new paradigm of selective tolerance, testing whether a model finds a naturalistic learner error more plausible than a matched, artificial error within the same sentence. Constructed from over 2.8 million naturalistic learner sentences, BLiSS provides 136,867 controlled triplets (corrected, learner, artificial) for this purpose. Experiments on a diverse suite of models demonstrate that selective tolerance is a distinct capability from standard grammaticality, with performance clustering strongly by training paradigm. This validates BLiSS as a robust tool for measuring how different training objectives impact a model’s alignment with the systematic patterns of human language acquisition.
zh
[NLP-37] Spatio-temporal Sign Language Representation and Translation
【速读】: 该论文旨在解决手语翻译(Sign Language Translation, SLT)中因传统方法未能有效利用时序特征而导致的性能瓶颈问题。现有SLT系统通常采用通用序列到序列(seq2seq)架构,使用从视频帧中提取的特征作为输入,但标准方法往往忽略了时空特征的联合建模。本文提出的关键解决方案是构建一个端到端的模型,能够同时学习空间-时间特征表示与翻译过程,从而提升对新数据集的泛化能力。该方法在开发集上取得了5±1 BLEU分数,尽管测试集表现下降至0.11±0.06 BLEU,但仍体现了其在统一建模时空信息方面的潜力。
链接: https://arxiv.org/abs/2510.19413
作者: Yasser Hamidullah,Josef van Genabith,Cristina España-Bonet
机构: German Research Center for Artificial Intelligence (DFKI)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved 5\pm1 BLEU points on the development set, but the performance on the test dropped to 0.11\pm0.06 BLEU points.
zh
[NLP-38] oMMeR – Efficient Entity Mention Detection from Large Language Models
【速读】: 该论文旨在解决命名实体识别(Named Entity Recognition, NER)中的提及检测(mention detection)问题,即准确识别文本中指向实体的词元片段,这是信息抽取的基础任务且常成为性能瓶颈。解决方案的关键在于提出一个轻量级模型 ToMMeR(300K 参数),该模型通过探测大型语言模型(LLM)早期层中的提及检测能力,实现高召回率(零样本下达 93%)与高精度(使用 LLM 作为裁判器时超过 90% 精度),且极少产生假阳性预测。研究进一步表明,不同架构的 LLM(参数规模从 14M 到 15B)在早期层中对提及边界具有高度一致性(Dice 相似度达 75%),说明提及检测能力可自然地从语言建模中涌现;此外,将 ToMMeR 扩展为 span 分类头后,在标准 NER 基准上达到接近 SOTA 的 F1 分数(80–87%),验证了结构化实体表示存在于 LLM 早期层并可用极少量参数高效恢复。
链接: https://arxiv.org/abs/2510.19410
作者: Victor Morand,Nadi Tomeh,Josiane Mothe,Benjamin Piwowarski
机构: Institut des Systèmes Intelligents et de Robotique (ISIR); Sorbonne Université; CNRS; LIPN; Université Sorbonne Paris Nord; UMR7030 CNRS; INSPE; UT2J; Université de Toulouse; IRIT; UMR5505 CNRS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code is available at this https URL
点击查看摘要
Abstract:Identifying which text spans refer to entities – mention detection – is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93% recall zero-shot, with over 90% precision using an LLM as a judge showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE 75%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves near SOTA NER performance (80-87% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.
zh
[NLP-39] SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision
【速读】: 该论文旨在解决传统手语翻译(Sign Language Translation, SLT)模型依赖单一口语语言文本监督导致的可扩展性差与跨语言泛化能力弱的问题。其解决方案的关键在于采用多语言、模态无关的嵌入表示(language-agnostic, multimodal embeddings),该嵌入基于多种语言的文本和语音训练,从而实现无需特定语言或模态的直接多语言翻译;同时提出一种耦合增强方法,结合多语言目标增强(即翻译成多种语言)与视频级扰动,有效缓解数据稀缺问题并提升模型鲁棒性。实验表明,该方案在BLEURT指标上优于仅使用文本嵌入监督的方法,尤其在低资源场景下优势显著。
链接: https://arxiv.org/abs/2510.19398
作者: Yasser Hamidullah,Shakib Yazdani,Cennet Oguz,Josef van Genabith,Cristina España-Bonet
机构: German Research Center for Artificial Intelligence (DFKI GmbH) (德国人工智能研究中心(DFKI GmbH)); Barcelona Supercomputing Center (BSC-CNS) (巴塞罗那超级计算中心(BSC-CNS)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Sign language translation (SLT) is typically trained with text in a single spoken language, which limits scalability and cross-language generalization. Earlier approaches have replaced gloss supervision with text-based sentence embeddings, but up to now, these remain tied to a specific language and modality. In contrast, here we employ language-agnostic, multimodal embeddings trained on text and speech from multiple languages to supervise SLT, enabling direct multilingual translation. To address data scarcity, we propose a coupled augmentation method that combines multilingual target augmentations (i.e. translations into many languages) with video-level perturbations, improving model robustness. Experiments show consistent BLEURT gains over text-only sentence embedding supervision, with larger improvements in low-resource settings. Our results demonstrate that language-agnostic embedding supervision, combined with coupled augmentation, provides a scalable and semantically robust alternative to traditional SLT training.
zh
[NLP-40] ColorAg ent: Building A Robust Personalized and Interactive OS Agent
【速读】: 该论文旨在解决如何构建一个能够执行用户指令并忠实遵循用户意图的下一代操作系统(OS)代理问题,以实现从传统命令行界面到生成式 AI 代理交互的跃迁。其解决方案的关键在于:首先,通过分步强化学习(step-wise reinforcement learning)与自进化训练(self-evolving training)增强模型在长时程环境交互中的能力;其次,设计了一个定制化的多代理框架,确保系统在通用性、一致性与鲁棒性方面的表现;再次,在用户交互层面引入个性化意图识别与主动参与机制,使 OS 代理不仅是自动化工具,更成为具有协作性的“温暖伙伴”。实验表明,ColorAgent 在 AndroidWorld 和 AndroidLab 基准上分别达到 77.2% 和 50.7% 的成功率,刷新了当前最优性能。
链接: https://arxiv.org/abs/2510.19386
作者: Ning Li,Qiqiang Lin,Zheng Wu,Xiaoyun Mo,Weiming Zhang,Yin Zhao,Xiangmou Qu,Jiamu Zhou,Jun Wang,Congmin Zheng,Yuanyi Song,Hongjiang Chen,Heyuan Huang,Jihong Wang,Jiaxin Yin,Jingwei Yu,Junwei Liao,Qiuying Peng,Xingyu Lou,Jun Wang,Weiwen Liu,Zhuosheng Zhang,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); OPPO Research Institute
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:With the advancements in hardware, software, and large language model technologies, the interaction between humans and operating systems has evolved from the command-line interface to the rapidly emerging AI agent interactions. Building an operating system (OS) agent capable of executing user instructions and faithfully following user desires is becoming a reality. In this technical report, we present ColorAgent, an OS agent designed to engage in long-horizon, robust interactions with the environment while also enabling personalized and proactive user interaction. To enable long-horizon interactions with the environment, we enhance the model’s capabilities through step-wise reinforcement learning and self-evolving training, while also developing a tailored multi-agent framework that ensures generality, consistency, and robustness. In terms of user interaction, we explore personalized user intent recognition and proactive engagement, positioning the OS agent not merely as an automation tool but as a warm, collaborative partner. We evaluate ColorAgent on the AndroidWorld and AndroidLab benchmarks, achieving success rates of 77.2% and 50.7%, respectively, establishing a new state of the art. Nonetheless, we note that current benchmarks are insufficient for a comprehensive evaluation of OS agents and propose further exploring directions in future work, particularly in the areas of evaluation paradigms, agent collaboration, and security. Our code is available at this https URL.
zh
[NLP-41] Sign Language Translation with Sentence Embedding Supervision
【速读】: 该论文旨在解决当前手语翻译(Sign Language Translation, SLT)系统对人工标注的词位(gloss)依赖性强的问题,尤其是在缺乏大规模高质量词位标注数据的情况下,现有方法性能受限。其解决方案的关键在于引入目标语句的句子嵌入(sentence embeddings)作为替代词位的监督信号,该方法无需任何人工标注,仅需原始文本数据即可学习得到有效的监督信息。通过这种新型监督机制,模型在不依赖词位标注的前提下显著提升了翻译性能,并在德语(PHOENIX-2014T)和美式手语(How2Sign)数据集上实现了新的最先进结果,同时天然支持多语言场景,有效缩小了无词位(gloss-free)与有词位(gloss-dependent)系统之间的性能差距。
链接: https://arxiv.org/abs/2510.19367
作者: Yasser Hamidullah,Josef van Genabith,Cristina España-Bonet
机构: German Research Center for Artificial Intelligence (DFKI GmbH)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:State-of-the-art sign language translation (SLT) systems facilitate the learning process through gloss annotations, either in an end2end manner or by involving an intermediate step. Unfortunately, gloss labelled sign language data is usually not available at scale and, when available, gloss annotations widely differ from dataset to dataset. We present a novel approach using sentence embeddings of the target sentences at training time that take the role of glosses. The new kind of supervision does not need any manual annotation but it is learned on raw textual data. As our approach easily facilitates multilinguality, we evaluate it on datasets covering German (PHOENIX-2014T) and American (How2Sign) sign languages and experiment with mono- and multilingual sentence embeddings and translation systems. Our approach significantly outperforms other gloss-free approaches, setting the new state-of-the-art for data sets where glosses are not available and when no additional SLT datasets are used for pretraining, diminishing the gap between gloss-free and gloss-dependent systems.
zh
[NLP-42] MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs
链接: https://arxiv.org/abs/2510.19366
作者: Xinfeng Xia,Jiacheng Liu,Xiaofeng Hou,Peng Tang,Mingxuan Zhang,Wenfeng Wang,Chao Li
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-43] he Massive Legal Embedding Benchmark (MLEB)
【速读】: 该论文旨在解决当前法律信息检索(Legal Information Retrieval)领域缺乏大规模、多样化且开源基准测试的问题,以推动相关模型在多司法管辖区、多文档类型和多任务场景下的可复现评估与改进。其解决方案的关键在于构建了迄今为止最大、最全面的开源基准——Massive Legal Embedding Benchmark (MLEB),包含十个由专家标注的数据集,覆盖美国、英国、欧盟、澳大利亚、爱尔兰和新加坡等多个司法管辖区,涵盖判例、立法、监管指南、合同及文献等多种文档类型,并支持搜索、零样本分类和问答等任务类型;其中七个数据集为新构建,旨在填补现有开源法律信息检索资源中的领域与地域空白,同时公开代码、结果与数据以保障评估的可复现性。
链接: https://arxiv.org/abs/2510.19365
作者: Umar Butler,Abdur-Rahman Butler,Adrian Lucas Malec
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 15 pages, 2 figures
点击查看摘要
Abstract:We present the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive open-source benchmark for legal information retrieval to date. MLEB consists of ten expert-annotated datasets spanning multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore), document types (cases, legislation, regulatory guidance, contracts, and literature), and task types (search, zero-shot classification, and question answering). Seven of the datasets in MLEB were newly constructed in order to fill domain and jurisdictional gaps in the open-source legal information retrieval landscape. We document our methodology in building MLEB and creating the new constituent datasets, and release our code, results, and data openly to assist with reproducible evaluations.
zh
[NLP-44] LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts
【速读】: 该论文旨在解决大语言模型在长上下文(long-context)推理能力上的瓶颈问题,尤其是如何有效训练模型以应对高难度的多跳问答(multi-hop QA)任务。现有强化学习(Reinforcement Learning, RL)方法主要提升短上下文推理效果,但对长上下文所需的复杂思维模式探索不足,且高质量难样本稀缺。解决方案的关键在于提出LoongRL框架,其核心创新是KeyChain——一种将短多跳问答转化为高难度长上下文任务的数据合成方法:通过插入UUID链隐藏真实问题于大量干扰文档中,迫使模型逐步追踪正确链条、识别真问题、检索相关事实并进行推理。在此基础上的RL训练诱导出“规划-检索-推理-复核”的涌现式推理模式,显著提升模型在远超训练长度(如128K)下的泛化能力,同时保持短上下文推理性能。
链接: https://arxiv.org/abs/2510.19363
作者: Siyuan Wang,Gaokai Zhang,Li Lyna Zhang,Ning Shang,Fan Yang,Dongyao Chen,Mao Yang
机构: Microsoft Research Asia (微软亚洲研究院); Shanghai Jiao Tong University (上海交通大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing “Aha” moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.
zh
[NLP-45] Agent icMath: Enhancing LLM Reasoning via Agent ic-based Math Data Generation
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)数学推理能力提升中因训练数据质量低、信息丰富度不足而导致的瓶颈问题。现有方法常生成错误答案或缺乏逻辑严谨性,且依赖大量标注数据但效率低下。其解决方案的关键在于提出AgenticMath——一种基于智能体(agentic)的流水线式高质量数学问答对生成框架,包含四个核心阶段:种子问题筛选(Seed Question Filter)、多智能体重述(Agentic Question Rephrase)、基于思维链(Chain-of-Thought, CoT)的答案增强(Answer Augment),以及最终的质量评估与筛选(Question and Answer Evaluation)。该方法通过自动化、高保真地构建少量但高质量的数据集(仅30–60K样本),即可在3B–8B参数规模的LLM上实现优于甚至媲美使用数倍数据量(如400K–2.3M样本)训练基线的效果,验证了“精准数据生成优于海量低质数据”的有效性。
链接: https://arxiv.org/abs/2510.19361
作者: Xianyang Liu,Yilin Liu,Shuai Wang,Hao Cheng,Andrew Estornell,Yuzhi Zhao,Jiaheng Wei
机构: King’s College London (伦敦国王学院); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Hong Kong Baptist University (香港浸会大学); ByteDance Seed (字节跳动种子团队); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress
点击查看摘要
Abstract:The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic pipeline for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.
zh
[NLP-46] M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models LREC2026
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Model, MLLM)在多说话人、多轮对话理解中对说话人归属推理(speaker-attributed reasoning)能力不足的问题,即模型虽能准确捕捉话语内容,却难以识别“谁在何时说了什么”。解决方案的关键在于构建了一个新的基准测试集M3-SLU,该基准基于四个公开语料库(CHiME-6、MELD、MultiDialog和AMI),包含超过12,000个经验证的实例,涵盖音频、转录文本及元数据,并设计了两项核心任务:(1)说话人归属问答(Speaker-Attributed Question Answering)和(2)通过话语匹配进行说话人归属(Speaker Attribution via Utterance Matching)。通过LLM-as-Judge与准确率指标评估端到端MLLM与级联流水线方法的性能,揭示了现有模型在说话人感知对话理解中的显著差距,为推动该领域研究提供了挑战性基准。
链接: https://arxiv.org/abs/2510.19358
作者: Yejin Kwon,Taewoo Kang,Hyunsoo Yoon,Changouk Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to LREC 2026. 11 pages, 5 figures
点击查看摘要
Abstract:We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding. M3-SLU offers as a challenging benchmark to advance research in speaker-aware multimodal understanding.
zh
[NLP-47] Modeling Turn-Taking with Semantically Informed Gestures
【速读】: 该论文旨在解决多模态对话中话轮转换(turn-taking)预测的难题,尤其关注手势(gesture)作为补充线索在对话管理中的作用。现有方法主要依赖语言和声学特征,但忽略了手势提供的语义信息对话语节奏的调节能力。解决方案的关键在于构建了一个扩展的多主体对话手势数据集 DnD Gesture++,其中包含2,663条语义标注的手势(涵盖象形、隐喻、指代和话语类型),并采用Mixture-of-Experts框架融合文本、音频与手势特征进行建模。实验表明,引入语义引导的手势可显著提升话轮转换预测性能,验证了其在多模态交互中的互补价值。
链接: https://arxiv.org/abs/2510.19350
作者: Varsha Suresh,M. Hamza Mughal,Christian Theobalt,Vera Demberg
机构: Saarland University (萨尔兰大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.
zh
[NLP-48] Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system
【速读】: 该论文旨在解决临床笔记中个人身份信息(Personally Identifiable Information, PII)的隐私保护问题,以支持医学研究和生成式 AI(Generative AI)开发。传统方法依赖高成本的大语言模型(Large Language Models, LLMs)或云端服务,存在计算资源消耗大、数据隐私风险高等局限,尤其在低资源环境中难以部署。其解决方案的关键在于构建一个本地化、轻量且高效的 PII 去标识化系统 LOGICAL,该系统基于微调后的通用轻量命名实体识别模型(Generalist and Lightweight Named Entity Recognition, GLiNER),在标准笔记本电脑上即可运行,无需专用 GPU。实验表明,该方法在 1515 份精神科电子健康记录(Electronic Health Records, EHR)文档中实现了 98.0% 的微平均 F1 分数,并能完全净化 95% 的文档,显著优于 Azure NER、Presidio 及零样本提示的 LLM 方法,体现了“源头去标识化”策略在准确性、效率与安全性上的优势。
链接: https://arxiv.org/abs/2510.19346
作者: Prakrithi Shivaprakash,Lekhansh Shukla,Animesh Mukherjee,Prabhat Chand,Pratima Murthy
机构: 未知
类目: Computation and Language (cs.CL)
备注: 30 pages, 15 main text and 15 supplementary material
点击查看摘要
Abstract:Removing Personally Identifiable Information (PII) from clinical notes in Electronic Health Records (EHRs) is essential for research and AI development. While Large Language Models (LLMs) are powerful, their high computational costs and the data privacy risks of API-based services limit their use, especially in low-resource settings. To address this, we developed LOGICAL (Local Obfuscation by GLINER for Impartial Context-Aware Lineage), an efficient, locally deployable PII removal system built on a fine-tuned Generalist and Lightweight Named Entity Recognition (GLiNER) model. We used 1515 clinical documents from a psychiatric hospital’s EHR system. We defined nine PII categories for removal. A modern-gliner-bi-large-v1.0 model was fine-tuned on 2849 text instances and evaluated on a test set of 376 instances using character-level precision, recall, and F1-score. We compared its performance against Microsoft Azure NER, Microsoft Presidio, and zero-shot prompting with Gemini-Pro-2.5 and Llama-3.3-70B-Instruct. The fine-tuned GLiNER model achieved superior performance, with an overall micro-average F1-score of 0.980, significantly outperforming Gemini-Pro-2.5 (F1-score: 0.845). LOGICAL correctly sanitised 95% of documents completely, compared to 64% for the next-best solution. The model operated efficiently on a standard laptop without a dedicated GPU. However, a 2% entity-level false negative rate underscores the need for human-in-the-loop validation across all tested systems. Fine-tuned, specialised transformer models like GLiNER offer an accurate, computationally efficient, and secure solution for PII removal from clinical notes. This “sanitisation at the source” approach is a practical alternative to resource-intensive LLMs, enabling the creation of de-identified datasets for research and AI development while preserving data privacy, particularly in resource-constrained environments.
zh
[NLP-49] Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
链接: https://arxiv.org/abs/2510.19338
作者: Ling Team,Bin Han,Caizhi Tang,Chen Liang,Donghao Zhang,Fan Yuan,Feng Zhu,Jie Gao,Jingyu Hu,Longfei Li,Meng Li,Mingyang Zhang,Peijie Jiang,Peng Jiao,Qian Zhao,Qingyuan Yang,Wenbo Shen,Xinxing Yang,Yalin Zhang,Yankun Ren,Yao Zhao,Yibo Cao,Yixuan Sun,Yue Zhang,Yuchen Fang,Zibin Lin,Zixuan Cheng,Jun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 13 figures
[NLP-50] Algorithmic Fairness in NLP: Persona-Infused LLM s for Human-Centric Hate Speech Detection
链接: https://arxiv.org/abs/2510.19331
作者: Ewelina Gajewska,Arda Derbent,Jaroslaw A Chudziak,Katarzyna Budzynska
机构: Warsaw University of Technology (华沙理工大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: This paper has been accepted for the upcoming 59th Hawaii International Conference on System Sciences (HICSS-59), 2026, Hawaii, USA. The final published version will appear in the official conference proceedings
[NLP-51] Slot Filling as a Reasoning Task for SpeechLLM s
【速读】: 该论文旨在解决语音大语言模型(speechLLM)在端到端槽位填充(slot-filling)任务中推理能力不足的问题。解决方案的关键在于引入链式思维(chain-of-thought)框架,将槽位填充任务分解为多个推理步骤,并基于此构建推理数据集,采用监督微调(supervised fine-tuning)策略对speechLLM进行训练。研究进一步区分了常规speechLLM与推理型speechLLM,并对比不同文本基础模型(text foundation model)的类型与规模对性能的影响,发现仅依赖数学、逻辑和编码领域优化的文本大语言模型作为基础时效果不佳;而通过混合模式(hybrid mode)微调——即保留直接预测与推理两种操作模式的speechLLM,在性能上优于仅采用单一模式微调的模型。
链接: https://arxiv.org/abs/2510.19326
作者: Kadri Hacioglu,Manjunath K E,Andreas Stolcke
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We propose integration of reasoning into speech large language models (speechLLMs) for the end-to-end slot-filling task. Inspired by the recent development of reasoning LLMs, we use a chain-of-thought framework to decompose the slot-filling task into multiple reasoning steps, create a reasoning dataset and apply the supervised fine-tuning strategy to a speechLLM. We distinguish between regular and reasoning speechLLMs and experiment with different types and sizes of LLMs as their text foundation models. We demonstrate performance improvements by introducing reasoning (intermediate) steps. However, we show that a reasoning textual LLM developed mainly for math, logic and coding domains might be inferior as a foundation model for a reasoning speechLLM. We further show that hybrid speechLLMs, built on a hybrid text foundation LLM and fine-tuned to preserve both direct and reasoning modes of operation, have better performance than those fine-tuned employing only one mode of operation.
zh
[NLP-52] Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization
【速读】: 该论文旨在解决文本摘要任务中多目标优化难题,即如何在一致性、连贯性、相关性和流畅性等多个评价维度上实现平衡与协同优化。传统方法难以有效权衡这些相互制约的目标,而基于大语言模型(LLM)的强化学习(RL)虽有潜力,但缺乏对多目标动态调整的机制。解决方案的关键在于提出一种新颖的超体积优化(Hypervolume Optimization, HVO)策略,通过在强化学习的奖励过程中引入超体积法动态调节各目标组间的得分权重,引导模型逐步逼近帕累托前沿(Pareto front),从而生成在多个维度上表现均衡的摘要。实验表明,该方法在多个数据集上优于群组相对策略优化(GRPO),且7B规模的基础模型经HVO增强后性能接近GPT-4,同时生成长度更短。
链接: https://arxiv.org/abs/2510.19325
作者: Junjie Song,Yiwen Liu,Dapeng Li,Yin Sun,Shukun Fu,Siqi Chen,Yuji Cao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Text summarization is a crucial task that requires the simultaneous optimization of multiple objectives, including consistency, coherence, relevance, and fluency, which presents considerable challenges. Although large language models (LLMs) have demonstrated remarkable performance, enhanced by reinforcement learning (RL), few studies have focused on optimizing the multi-objective problem of summarization through RL based on LLMs. In this paper, we introduce hypervolume optimization (HVO), a novel optimization strategy that dynamically adjusts the scores between groups during the reward process in RL by using the hypervolume method. This method guides the model’s optimization to progressively approximate the pareto front, thereby generating balanced summaries across multiple objectives. Experimental results on several representative summarization datasets demonstrate that our method outperforms group relative policy optimization (GRPO) in overall scores and shows more balanced performance across different dimensions. Moreover, a 7B foundation model enhanced by HVO performs comparably to GPT-4 in the summarization task, while maintaining a shorter generation length. Our code is publicly available at this https URL
zh
[NLP-53] HAD: HAllucination Detection Language Models Based on a Comprehensive Hallucination Taxonomy
【速读】: 该论文旨在解决自然语言生成(Natural Language Generation, NLG)模型,尤其是大语言模型中广泛存在的幻觉(hallucination)问题,即模型生成看似合理但实际错误的信息。为应对这一挑战,作者提出了一种包含11个类别的全面幻觉分类体系,并构建了统一的HAllucination Detection (HAD)模型框架,该框架将幻觉检测、细粒度片段级识别与内容修正整合到单一推理流程中。其关键创新在于通过一个约9万样本的合成训练数据集进行端到端训练,使模型具备跨多种NLG任务的通用性,并在自建的HADTest测试集(2,248样本)上验证了其在域内和域外场景下的优越性能,显著优于现有基线方法,在HaluEval、FactCHD和FaithBench等多个权威评测基准上达到当前最优结果。
链接: https://arxiv.org/abs/2510.19318
作者: Fan Xu,Xinyu Hu,Zhenghan Yu,Li Lin,Xu Zhang,Yang Zhang,Wei Zhou,Jinjie Gu,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); Alibaba Group (阿里巴巴集团); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The increasing reliance on natural language generation (NLG) models, particularly large language models, has raised concerns about the reliability and accuracy of their outputs. A key challenge is hallucination, where models produce plausible but incorrect information. As a result, hallucination detection has become a critical task. In this work, we introduce a comprehensive hallucination taxonomy with 11 categories across various NLG tasks and propose the HAllucination Detection (HAD) models this https URL, which integrate hallucination detection, span-level identification, and correction into a single inference process. Trained on an elaborate synthetic dataset of about 90K samples, our HAD models are versatile and can be applied to various NLG tasks. We also carefully annotate a test set for hallucination detection, called HADTest, which contains 2,248 samples. Evaluations on in-domain and out-of-domain test sets show that our HAD models generally outperform the existing baselines, achieving state-of-the-art results on HaluEval, FactCHD, and FaithBench, confirming their robustness and versatility.
zh
[NLP-54] KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints
【速读】: 该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在预训练后难以持续获取新知识且易发生灾难性遗忘(catastrophic forgetting)的问题。其核心挑战在于如何在不损害原有知识的前提下有效注入新知识,实现知识适应(knowledge adaptation)与知识保留(knowledge retention)的平衡。解决方案的关键在于提出KORE方法,该方法通过两个协同机制实现:一是基于知识导向的增强(KnOwledge-oRientEd augmentations),将单个知识项转化为结构化、全面的知识表示以提升新知识学习的准确性;二是利用线性层激活的协方差矩阵存储旧知识,并通过投影原权重至该矩阵的零空间来初始化适配器(adapter),从而定义最小干扰的微调方向,显著增强对已有知识的保留能力。
链接: https://arxiv.org/abs/2510.19316
作者: Kailin Jiang,Hongbo Jiang,Ning Jiang,Zhi Gao,Jinhe Bi,Yuchen Ren,Bin Li,Yuntao Du,Lei Liu,Qing Li
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室,BIGAI); Xiamen University (厦门大学); Northeast Forestry University (东北林业大学); Beijing Institute of Technology (北京理工大学); Ludwig Maximilian University of Munich (慕尼黑路德维希-马克西米利安大学); University of Sydney (悉尼大学); Shandong University (山东大学)
类目: Computation and Language (cs.CL)
备注: project page: this https URL
点击查看摘要
Abstract:Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM’s linear layer activations and initializes the adapter by projecting the original weights into the matrix’s null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.
zh
[NLP-55] JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成内容时存在的幻觉问题(hallucination),即模型输出看似合理但实际不可靠的信息。现有幻觉检测流程通常包括响应分解(claim extraction)、查询生成(query generation)、证据收集(evidence collection)和声明验证(claim verification),但传统方法在前两个阶段存在显著局限:claim提取过程中易丢失上下文信息,而查询生成阶段缺乏语义特异性,导致下游检索与验证效果下降。解决方案的关键在于提出一种联合的claim与查询生成框架JointCQ,通过设计精细的评估标准筛选合成训练数据,并微调语言模型实现claim提取与查询生成的端到端联合优化,从而为后续搜索和验证提供更可靠、更具信息量的输入,显著提升多领域开放问答任务中的幻觉检测性能。
链接: https://arxiv.org/abs/2510.19310
作者: Fan Xu,Huixuan Zhang,Zhenliang Zhang,Jiahao Wang,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); Trustworthy Technology and Engineering Laboratory, Huawei (华为可信技术与工程实验室)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Current large language models (LLMs) often suffer from hallucination issues, i,e, generating content that appears factual but is actually unreliable. A typical hallucination detection pipeline involves response decomposition (i.e., claim extraction), query generation, evidence collection (i.e., search or retrieval), and claim verification. However, existing methods exhibit limitations in the first two stages, such as context loss during claim extraction and low specificity in query generation, resulting in degraded performance across the hallucination detection pipeline. In this work, we introduce JointCQ this https URL, a joint claim-and-query generation framework designed to construct an effective and efficient claim-query generator. Our framework leverages elaborately designed evaluation criteria to filter synthesized training data, and finetunes a language model for joint claim extraction and query generation, providing reliable and informative inputs for downstream search and verification. Experimental results demonstrate that our method outperforms previous methods on multiple open-domain QA hallucination detection benchmarks, advancing the goal of more trustworthy and transparent language model systems.
zh
[NLP-56] heMCPCompany: Creating General-purpose Agents with Task-specific Tools
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂现实任务中依赖通用工具(如浏览器)进行环境交互的局限性,以及如何有效利用大量专用工具提升代理(agent)性能的问题。其核心挑战在于:尽管已有大量基于REST API构建的任务特定工具(如TheMCPCompany中超过18,000个工具),但现有模型在实际应用中难以高效检索并组合这些工具以完成复杂任务。解决方案的关键在于引入TheMCPCompany基准测试平台,通过提供真实服务的API接口和人工标注的ground-truth工具链,系统评估工具调用代理的性能上限与实际表现差异;实验表明,虽然具备工具检索能力的模型整体优于浏览器基代理,但小模型无法充分利用工具资源,而先进模型(如GPT-5)在理想工具召回条件下接近最优性能,揭示出当前模型在复杂企业级工具环境中仍面临显著的推理与检索瓶颈。
链接: https://arxiv.org/abs/2510.19286
作者: Reza Esfandiarpoor,Vishwas Suryanarayanan,Stephen H. Bach,Vishal Chowdhary,Anthony Aue
机构: Brown University (布朗大学); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注: Code: this https URL
点击查看摘要
Abstract:Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5’s performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.
zh
[NLP-57] Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization
【速读】: 该论文旨在解决当前生成式阅读理解题目中难以有效控制难度的问题,尤其针对传统方法无法直接生成教育场景中最常用的多项选择题(multiple-choice questions)以及缺乏显式训练以优化难度控制准确性的两大局限。其解决方案的关键在于提出一种基于大语言模型(large language model)的难度可控多项选择题生成方法,并采用直接偏好优化(direct preference optimization)技术对模型进行训练,从而显著提升难度控制的准确性与实用性。
链接: https://arxiv.org/abs/2510.19265
作者: Yuto Tomikawa,Masaki Uto
机构: University of Electro-Communications (电气通信大学)
类目: Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Difficulty-controllable question generation for reading comprehension has gained significant attention in the field of education as a fundamental tool for adaptive learning support. Although several neural question generation methods have recently succeeded in controlling difficulty, conventional approaches still face two major limitations. First, they cannot directly generate multiple-choice questions, which are the most widely used question type in educational contexts. Second, they are not explicitly trained to optimize the accuracy of difficulty control, leaving room for further improvement in difficulty controllability. To address these limitations, this study proposes a novel difficulty-controllable multiple-choice question generation method for reading comprehension which leverages a large language model trained using a direct preference optimization technique to improve the accuracy of difficulty control.
zh
[NLP-58] SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂电子表格时面临的两大核心挑战:一是难以准确捕捉表格的复杂结构,二是推理过程缺乏正确性保障。为此,作者提出 SheetBrain,一个神经符号双工作流代理框架,其关键在于通过三个模块协同实现高精度的表格理解与推理:首先,理解模块生成包含表概要和基于查询的问题洞察,为推理提供结构化引导;其次,执行模块整合 Python沙箱与预加载的表格处理库及 Excel 辅助工具,支持多轮交互式推理;最后,验证模块对推理结果进行正确性校验,并在必要时触发重新执行,从而形成闭环纠错机制。该方案在多个公开基准和新提出的 SheetBench 基准上均显著提升了准确率,尤其在大规模、多表且结构复杂的场景中表现突出。
链接: https://arxiv.org/abs/2510.19247
作者: Ziwei Wang,Jiayuan Su,Mengyu Zhou,Huaxing Zeng,Mengni Jia,Xiao Lv,Haoyu Dong,Xiaojun Ma,Shi Han,Dongmei Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Understanding and reasoning over complex spreadsheets remain fundamental challenges for large language models (LLMs), which often struggle with accurately capturing the complex structure of tables and ensuring reasoning correctness. In this work, we propose SheetBrain, a neuro-symbolic dual workflow agent framework designed for accurate reasoning over tabular data, supporting both spreadsheet question answering and manipulation tasks. SheetBrain comprises three core modules: an understanding module, which produces a comprehensive overview of the spreadsheet - including sheet summary and query-based problem insight to guide reasoning; an execution module, which integrates a Python sandbox with preloaded table-processing libraries and an Excel helper toolkit for effective multi-turn reasoning; and a validation module, which verifies the correctness of reasoning and answers, triggering re-execution when necessary. We evaluate SheetBrain on multiple public tabular QA and manipulation benchmarks, and introduce SheetBench, a new benchmark targeting large, multi-table, and structurally complex spreadsheets. Experimental results show that SheetBrain significantly improves accuracy on both existing benchmarks and the more challenging scenarios presented in SheetBench. Our code is publicly available at this https URL.
zh
[NLP-59] Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL
【速读】: 该论文旨在解决现有语言知识库(如URIEL+)在跨语言迁移学习中面临的两个关键问题:一是其通用向量表示无法适配语言数据的多样化结构,二是缺乏将地理、谱系和类型学等多维距离信号整合为单一综合评分的合理方法。解决方案的关键在于提出一种类型匹配的语言距离框架,通过设计结构感知的新型表示方法来分别建模不同距离类型——地理距离采用说话者加权分布(speaker-weighted distributions),谱系距离使用双曲嵌入(hyperbolic embeddings),类型学距离则基于潜在变量模型(latent variables model);最终将这些异构信号统一为一个任务无关的鲁棒复合距离,显著提升了多种自然语言处理任务中的迁移性能。
链接: https://arxiv.org/abs/2510.19217
作者: York Hay Ng,Aditya Khan,Xiang Lu,Matteo Salloum,Michael Zhou,Phuong H. Hoang,A. Seza Doğruöz,En-Shiun Annie Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. One, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data, and two, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. In selecting transfer languages, our representations and composite distances consistently improve performance across a wide range of NLP tasks, providing a more principled and effective toolkit for multilingual research.
zh
[NLP-60] DiSRouter: Distributed Self-Routing for LLM Selections
链接: https://arxiv.org/abs/2510.19208
作者: Hang Zheng,Hongshen Xu,Yongkai Lin,Shuai Fan,Lu Chen,Kai Yu
机构: X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence, SJTU AI Institute, Shanghai Jiao Tong University, Shanghai, China; Jiangsu Key Lab of Language Computing, Suzhou, China; AISpeech Co., Ltd., Suzhou, China; Suzhou Laboratory, Suzhou, China
类目: Computation and Language (cs.CL)
备注:
[NLP-61] Multi-Faceted Evaluation of Tool-Augmented Dialogue Systems
【速读】: 该论文旨在解决当前评估工具增强型对话系统(tool-augmented dialogues)时存在的关键问题:现有方法仅评估用户满意度或代理的工具调用能力,无法捕捉多轮交互中因代理误解工具结果而导致的严重错误,尤其在用户满意度看似良好但实际存在认知偏差的情况下。解决方案的关键在于提出两个核心组件:一是TRACE基准,它系统性地合成涵盖多种错误类型的工具增强对话;二是SCOPE评估框架,能够自动发现多样化的错误模式并生成相应的评价标准,从而在用户满意度信号误导的情况下仍能准确识别和量化错误。实验表明,SCOPE显著优于基线方法,尤其在高难度场景下表现突出。
链接: https://arxiv.org/abs/2510.19186
作者: Zhaoyi Joey Hou,Tanya Shourya,Yingfan Wang,Shamik Roy,Vinayshekhar Bannihatti Kumar,Rashmi Gangadharaiah
机构: University of Pittsburgh (匹兹堡大学); AWS AI Labs (亚马逊云科技人工智能实验室)
类目: Computation and Language (cs.CL)
备注: The first two authors contributed equally. Manuscript under submission
点击查看摘要
Abstract:Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents’ tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues-such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases, and SCOPE, an evaluation framework that automatically discovers diverse error patterns and evaluation rubrics in tool-augmented dialogues. Experiments show SCOPE significantly outperforms the baseline, particularly on challenging cases where user satisfaction signals are misleading.
zh
[NLP-62] Interpretable Question Answering with Knowledge Graphs
【速读】: 该论文旨在解决传统问答系统依赖检索增强生成(Retrieval-Augmented Generation, RAG)与大语言模型(Large Language Models, LLMs)所带来的计算开销高、可解释性差等问题,提出一种仅基于知识图谱(Knowledge Graph, KG)检索的问答方法。其解决方案的关键在于构建一个两阶段流程:第一阶段通过文档预处理生成问题-答案(Question-Answer, QA)对;第二阶段将QA对转化为知识图谱,并利用嵌入(embedding)和模糊匹配技术进行图谱检索、重排序与实体关系边的改写(paraphrasing),最终输出答案。该方案不依赖LLM生成答案,而是通过小规模改写模型提升检索质量,实验证明在CRAG基准上使用LLAMA-3.2和GPT-3.5-Turbo作为评判器时分别达到71.9%和54.4%的准确率。
链接: https://arxiv.org/abs/2510.19181
作者: Kartikeya Aneja,Manasvi Srivastava,Subhayan Das,Nagender Aneja
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper presents a question answering system that operates exclusively on a knowledge graph retrieval without relying on retrieval augmented generation (RAG) with large language models (LLMs). Instead, a small paraphraser model is used to paraphrase the entity relationship edges retrieved from querying the knowledge graph. The proposed pipeline is divided into two main stages. The first stage involves pre-processing a document to generate sets of question-answer (QA) pairs. The second stage converts these QAs into a knowledge graph from which graph-based retrieval is performed using embeddings and fuzzy techniques. The graph is queried, re-ranked, and paraphrased to generate a final answer. This work includes an evaluation using LLM-as-a-judge on the CRAG benchmark, which resulted in accuracies of 71.9% and 54.4% using LLAMA-3.2 and GPT-3.5-Turbo, respectively.
zh
[NLP-63] he Zero-Step Thinking: An Empirical Study of Mode Selection as Harder Early Exit in Reasoning Models NEURIPS’25
【速读】: 该论文旨在解决推理模型在执行链式思维(Chain-of-Thought, CoT)时因过度思考导致的计算开销过大的问题,核心挑战在于如何在推理开始前自动选择合适的推理模式(长CoT或短CoT)以及在推理过程中确定最优的提前退出点。解决方案的关键在于将“模式选择”(Mode Selection)视为一个比“早期退出”(Early Exit)更具挑战性的任务,因为前者需在无显式推理过程的情况下(即零步思维,zero-step thinking)基于预定义的伪思考内容做出决策,而后者则可在迭代推理中动态判断最佳停止时机。研究发现,仅依赖模型内部信息的方法虽优于提示驱动方法,但在信息有限场景下仍存在稳定性不足的问题,表明当前方法尚不足以有效应对该任务。
链接: https://arxiv.org/abs/2510.19176
作者: Yuqiao Tan,Shizhu He,Kang Liu,Jun Zhao
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所认知与复杂系统决策智能重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by NeurIPS’25 Efficient Reasoning Workshop
点击查看摘要
Abstract:Reasoning models have demonstrated exceptional performance in tasks such as mathematics and logical reasoning, primarily due to their ability to engage in step-by-step thinking during the reasoning process. However, this often leads to overthinking, resulting in unnecessary computational overhead. To address this issue, Mode Selection aims to automatically decide between Long-CoT (Chain-of-Thought) or Short-CoT by utilizing either a Thinking or NoThinking mode. Simultaneously, Early Exit determines the optimal stopping point during the iterative reasoning process. Both methods seek to reduce the computational burden. In this paper, we first identify Mode Selection as a more challenging variant of the Early Exit problem, as they share similar objectives but differ in decision timing. While Early Exit focuses on determining the best stopping point for concise reasoning at inference time, Mode Selection must make this decision at the beginning of the reasoning process, relying on pre-defined fake thoughts without engaging in an explicit reasoning process, referred to as zero-step thinking. Through empirical studies on nine baselines, we observe that prompt-based approaches often fail due to their limited classification capabilities when provided with minimal hand-crafted information. In contrast, approaches that leverage internal information generally perform better across most scenarios but still exhibit issues with stability. Our findings indicate that existing methods relying solely on the information provided by models are insufficient for effectively addressing Mode Selection in scenarios with limited information, highlighting the ongoing challenges of this task. Our code is available at this https URL.
zh
[NLP-64] When Facts Change: Probing LLM s on Evolving Knowledge with evolveQA
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理时间演化知识冲突(temporal knowledge conflicts)时表现不佳的问题,即当训练数据中事实随时间变化时,模型难以准确反映最新状态。现有评估方法多基于静态知识库(如Wikidata),且仅关注高频实体,缺乏动态结构以公平比较不同知识截止日期(knowledge cut-off dates)的模型。其解决方案的关键在于构建了一个名为evolveQA的新基准,该基准源自三个真实世界的时间戳文档集合(AWS更新、Azure变更和WHO疾病爆发报告),通过识别自然发生的知识演变并生成针对不同知识截止日期的黄金答案问题,实现了对LLMs在动态时序知识理解能力上的系统性评估。实验表明,该框架能显著揭示模型性能下降(最高达31%),从而更真实地刻画LLM在时间维度上的知识保持与更新能力。
链接: https://arxiv.org/abs/2510.19172
作者: Nishanth Sridhar Nakshatri,Shamik Roy,Manoj Ghuhan Arivazhagan,Hanhan Zhou,Vinayshekhar Bannihatti Kumar,Rashmi Gangadharaiah
机构: Purdue University (普渡大学); AWS AI Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under submission
点击查看摘要
Abstract:LLMs often fail to handle temporal knowledge conflicts–contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.
zh
[NLP-65] hink Straight Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG NEURIPS2025
链接: https://arxiv.org/abs/2510.19171
作者: Jihwan Bang,Juntae Lee,Seunghan Yang,Sungha Choi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at NeurIPS 2025 Workshop
[NLP-66] OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform
链接: https://arxiv.org/abs/2510.19169
作者: Thomas Wang,Haowen Li
机构: OpenGuardrails.com; The Hong Kong Polytechnic University
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
[NLP-67] “You Are Rejected!”: An Empirical Study of Large Language Models Taking Hiring Evaluations
链接: https://arxiv.org/abs/2510.19167
作者: Dingjie Fu,Dianxing Shi
机构: Huazhong Univeristy of Science and Technology (HUST)
类目: Computation and Language (cs.CL)
备注: Technical Report, 14 pages, 8 figures
[NLP-68] betan Language and AI: A Comprehensive Survey of Resources Methods and Challenges
【速读】: 该论文旨在解决藏语(Tibetan)作为亚洲代表性低资源语言在人工智能(AI)研究中面临的数据稀缺、评估标准缺失和工具不完善等问题。其核心挑战包括文本与语音数据资源匮乏、拼写变体复杂以及缺乏统一的性能评估指标。解决方案的关键在于系统梳理现有藏语自然语言处理(NLP)任务的数据集、工具与方法,识别瓶颈并推动跨语言迁移学习、多模态学习及社区驱动的数据共建机制,从而为构建可持续、包容性的低资源语言AI生态系统奠定基础。
链接: https://arxiv.org/abs/2510.19144
作者: Cheng Huang,Nyima Tashi,Fan Gao,Yutong Liu,Jiahao Li,Hao Tian,Siyang Jiang,Thupten Tsering,Ban Ma-bao,Renzeg Duojie,Gadeng Luosang,Rinchen Dongrub,Dorje Tashi,Jin Zhang,Xiao Feng,Hao Wang,Jie Tang,Guojie Tang,Xiangxiang Wang,Jia Zhang,Tsengdar Lee,Yongbin Yu
机构: University of Electronic Science and Technology of China, Chengdu, 610056, China; Southern Methodist University, Dallas, 75205, USA; Tibet University, Lhasa 850000, China; The City University of Hong Kong, Kowloon, Hong Kong SAR 999077, China; The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong SAR 999077, China; The Chinese University of Hong Kong, Sha Tin, Hong Kong SAR 999077, China; University of Connecticut, Storrs, 06269, USA; Tsinghua University, Beijing 100084, China; University of Texas at Arlington, Arlington, TX 76019, USA; University of Chinese Academy of Sciences, Beijing 100049, China
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Tibetan, one of the major low-resource languages in Asia, presents unique linguistic and sociocultural characteristics that pose both challenges and opportunities for AI research. Despite increasing interest in developing AI systems for underrepresented languages, Tibetan has received limited attention due to a lack of accessible data resources, standardized benchmarks, and dedicated tools. This paper provides a comprehensive survey of the current state of Tibetan AI in the AI domain, covering textual and speech data resources, NLP tasks, machine translation, speech recognition, and recent developments in LLMs. We systematically categorize existing datasets and tools, evaluate methods used across different tasks, and compare performance where possible. We also identify persistent bottlenecks such as data sparsity, orthographic variation, and the lack of unified evaluation metrics. Additionally, we discuss the potential of cross-lingual transfer, multi-modal learning, and community-driven resource creation. This survey aims to serve as a foundational reference for future work on Tibetan AI research and encourages collaborative efforts to build an inclusive and sustainable AI ecosystem for low-resource languages.
zh
[NLP-69] A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
链接: https://arxiv.org/abs/2510.19139
作者: Sohyeon Jeon,Hyung-Chul Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-70] raining-Free Spectral Fingerprints of Voice Processing in Transformers
【速读】: 该论文旨在解决如何通过模型内部的计算结构特征来识别和量化不同语言模型在训练过程中形成的特定处理策略问题,尤其是这些策略是否会在语法变换等任务中留下可检测的“计算指纹”。其解决方案的关键在于利用图信号处理方法对注意力机制诱导的token图进行谱分析,提取代数连通性(Fiedler值,Δλ₂)的变化作为指标,从而无监督地追踪模型在语音交替任务下的早期层响应差异;研究表明,这种谱特征不仅与模型行为表现高度相关(如Phi-3-Mini的r = -0.976),还能被针对性的注意力头消融实验所调控,表明其源于早期注意力结构的功能性映射,进而揭示了训练偏置如何以可测量的连接模式形式体现在模型架构中。
链接: https://arxiv.org/abs/2510.19131
作者: Valentin Noël
机构: Devoteam
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
备注: Preprint under review (2025). 12 pages, 8 figures
点击查看摘要
Abstract:Different transformer architectures implement identical linguistic computations via distinct connectivity patterns, yielding model imprinted ``computational fingerprints’’ detectable through spectral analysis. Using graph signal processing on attention induced token graphs, we track changes in algebraic connectivity (Fiedler value, \Delta\lambda_2 ) under voice alternation across 20 languages and three model families, with a prespecified early window (layers 2–5). Our analysis uncovers clear architectural signatures: Phi-3-Mini shows a dramatic English specific early layer disruption ( \overline\Delta\lambda_2_[2,5]!\approx!-0.446 ) while effects in 19 other languages are minimal, consistent with public documentation that positions the model primarily for English use. Qwen2.5-7B displays small, distributed shifts that are largest for morphologically rich languages, and LLaMA-3.2-1B exhibits systematic but muted responses. These spectral signatures correlate strongly with behavioral differences (Phi-3: r=-0.976 ) and are modulated by targeted attention head ablations, linking the effect to early attention structure and confirming functional relevance. Taken together, the findings are consistent with the view that training emphasis can leave detectable computational imprints: specialized processing strategies that manifest as measurable connectivity patterns during syntactic transformations. Beyond voice alternation, the framework differentiates reasoning modes, indicating utility as a simple, training free diagnostic for revealing architectural biases and supporting model reliability analysis.
zh
[NLP-71] A Graph Signal Processing Framework for Hallucination Detection in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成过程中难以区分事实性推理与幻觉(hallucination)的问题。其解决方案的关键在于提出一种基于谱分析(spectral analysis)的框架,将Transformer层建模为由注意力机制诱导的动态图,其中token嵌入作为图上的信号,并利用图信号处理(Graph Signal Processing, GSP)定义了一系列诊断指标,包括Dirichlet能量、谱熵和高频能量比,这些指标与计算稳定性存在理论关联。实验表明,不同类型的幻觉(如逻辑矛盾、语义错误和替换型幻觉)在谱空间中呈现可区分的特征模式,从而实现高精度的幻觉检测(88.75%准确率),显著优于基于困惑度的基线方法(75%)。该研究揭示了谱几何可能捕捉推理模式与错误行为的潜力,为LLMs中的幻觉检测提供了新的理论与实践框架。
链接: https://arxiv.org/abs/2510.19117
作者: Valentin Noël
机构: Devoteam(德沃特姆)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
备注: Preprint under review (2025). 11 pages, 7 figures. Code and scripts: to be released
点击查看摘要
Abstract:Large language models achieve impressive results but distinguishing factual reasoning from hallucinations remains challenging. We propose a spectral analysis framework that models transformer layers as dynamic graphs induced by attention, with token embeddings as signals on these graphs. Through graph signal processing, we define diagnostics including Dirichlet energy, spectral entropy, and high-frequency energy ratios, with theoretical connections to computational stability. Experiments across GPT architectures suggest universal spectral patterns: factual statements exhibit consistent “energy mountain” behavior with low-frequency convergence, while different hallucination types show distinct signatures. Logical contradictions destabilize spectra with large effect sizes ( g1.0 ), semantic errors remain stable but show connectivity drift, and substitution hallucinations display intermediate perturbations. A simple detector using spectral signatures achieves 88.75% accuracy versus 75% for perplexity-based baselines, demonstrating practical utility. These findings indicate that spectral geometry may capture reasoning patterns and error behaviors, potentially offering a framework for hallucination detection in large language models.
zh
[NLP-72] hats Deprecated! Understanding Detecting and Steering Knowledge Conflicts in Language Models for Code Generation
链接: https://arxiv.org/abs/2510.19116
作者: Jaesung Bae,Cameron Churchwell,Mitchell Hermon,Tsun-An Hsieh,Jocelyn Xu,Yekaterina Yegorova,Mark Hasegawa-Johnson,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[NLP-73] PoSh: Using Scene Graphs To Guide LLM s-as-a-Judge For Detailed Image Descriptions
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在生成详细图像描述时缺乏有效评估指标的问题。现有标准评价指标(如CIDEr、SPICE)主要针对短文本设计,对当前较少出现的错误(如物体误识别)敏感,但难以捕捉长文本中属性与关系的细粒度错误,并无法定位具体出错的文本片段。为此,作者提出PoSh(a metric for detailed image description),其核心创新在于利用场景图(scene graph)作为结构化评分基准,引导大语言模型(LLM)作为评判者(LLM-as-a-Judge),从而生成基于细粒度错误(如组合理解错误)的聚合评分。PoSh具备可复现性、可解释性,且相比现有方法(包括GPT-4o-as-a-Judge)更贴近人类判断。为验证其有效性,作者进一步构建了DOCENT数据集,包含艺术作品及其专家撰写参考描述和学生标注的粗粒度与细粒度质量判断,推动在挑战性新领域中对VLM详细描述能力的评估。
链接: https://arxiv.org/abs/2510.19060
作者: Amith Ananthram,Elias Stengel-Eskin,Lorena A. Bradford,Julia Demarest,Adam Purvis,Keith Krut,Robert Stein,Rina Elster Pantalony,Mohit Bansal,Kathleen McKeown
机构: Columbia University (哥伦比亚大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); The National Gallery of Art (国家美术馆); UCLA (加州大学洛杉矶分校); UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages, 9 figures. Metric/benchmark available at this https URL
点击查看摘要
Abstract:While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman \rho ) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
zh
[NLP-74] From Memorization to Generalization: Fine-Tuning Large Language Models for Biomedical Term-to-Identifier Normalization
【速读】: 该论文旨在解决生物医学数据整合中的术语标准化(term normalization)问题,即如何将自然语言描述的生物医学术语准确映射到标准化标识符(standardized identifiers),以实现语义互操作性。其解决方案的关键在于系统评估大型语言模型(LLMs)在不同生物医学本体(ontologies)上的记忆能力(memorization)与泛化能力(generalization),并发现微调效果取决于两个相互作用的因素:标识符的流行度(popularity)和词汇化程度(lexicalization)。具体而言,高频且词汇化的标识符(如基因符号)能增强模型的记忆与语义泛化能力,而GO和HPO等本体中非词汇化的任意标识符则限制了模型的学习能力,导致微调无效。这一发现为预测微调是否提升事实召回提供了理论框架。
链接: https://arxiv.org/abs/2510.19036
作者: Suswitha Pericharla,Daniel B. Hier,Tayo Obafemi-Ajayi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted for publication to BMC BioData Mining
点击查看摘要
Abstract:Effective biomedical data integration depends on automated term normalization, the mapping of natural language biomedical terms to standardized identifiers. This linking of terms to identifiers is essential for semantic interoperability. Large language models (LLMs) show promise for this task but perform unevenly across terminologies. We evaluated both memorization (training-term performance) and generalization (validation-term performance) across multiple biomedical ontologies. Fine-tuning Llama 3.1 8B revealed marked differences by terminology. GO mappings showed strong memorization gains (up to 77% improvement in term-to-identifier accuracy), whereas HPO showed minimal improvement. Generalization occurred only for protein-gene (GENE) mappings (13.9% gain), while fine-tuning for HPO and GO yielded negligible transfer. Baseline accuracy varied by model scale, with GPT-4o outperforming both Llama variants for all terminologies. Embedding analyses showed tight semantic alignment between gene symbols and protein names but weak alignment between terms and identifiers for GO or HPO, consistent with limited lexicalization. Fine-tuning success depended on two interacting factors: identifier popularity and lexicalization. Popular identifiers were more likely encountered during pretraining, enhancing memorization. Lexicalized identifiers, such as gene symbols, enabled semantic generalization. By contrast, arbitrary identifiers in GO and HPO constrained models to rote learning. These findings provide a predictive framework for when fine-tuning enhances factual recall versus when it fails due to sparse or non-lexicalized identifiers.
zh
[NLP-75] When Can We Trust LLM s in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在心理健康支持场景中评估困难的问题,尤其针对现有基准数据集规模小、可靠性不足、依赖合成或社交媒体数据,以及缺乏自动化评判者可信度评估框架的局限性。其解决方案的关键在于构建两个新型基准:MentalBench-100k 和 MentalAlign-70k,前者提供 10 万条真实对话响应对以支持大规模生成评估,后者通过对比四类高性能 LLM 判官与人类专家在 7 万条标注上的评分,引入“情感认知一致性框架”(Affective Cognitive Agreement Framework),利用组内相关系数(Intraclass Correlation Coefficient, ICC)及其置信区间量化判官与人类专家在认知支持得分(Cognitive Support Score, CSS)和情感共鸣得分(Affective Resonance Score, ARS)上的一致性、偏差和可靠性。该方法首次系统性揭示了 LLM 判官在不同维度上的表现差异,为心理健康领域 LLM 的可靠评估提供了可复现的方法论基础。
链接: https://arxiv.org/abs/2510.19032
作者: Abeer Badawi,Elahe Rahimi,Md Tahmid Rahman Laskar,Sheri Grach,Lindsay Bertrand,Lames Danok,Jimmy Huang,Frank Rudzicz,Elham Dolatabadi
机构: York University (约克大学); Vector Institute (向量研究所); Dalhousie University (达尔豪斯大学); IWK Health Hospital (IWK健康医院); King’s College London (伦敦国王学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70kreframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: this https URL
zh
[NLP-76] Re:Member: Emotional Question Generation from Personal Memories ACL2025
【速读】: 该论文试图解决第二语言(L2)学习中互动性不足与情感联结薄弱的问题,旨在通过更具情感表达力和基于个人记忆的交互方式提升学习者的参与度与语言习得效果。解决方案的关键在于构建一个名为Re:Member的系统,该系统利用用户个人视频生成具有情绪色彩的目标语言提问,结合视觉上下文调整语音情感风格(如耳语或深夜语气),以激发情感回忆和对话参与;其核心技术包括基于WhisperX的转录对齐、三帧视觉采样以及Style-BERT-VITS2驱动的情绪化语音合成,形成模块化的生成流程,从而实现个性化、情境化且富有情感张力的语言学习体验。
链接: https://arxiv.org/abs/2510.19030
作者: Zackary Rackauckas,Nobuaki Minematsu,Julia Hirschberg
机构: Columbia University (哥伦比亚大学); The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted to HCI+NLP at ACL 2025
点击查看摘要
Abstract:We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users’ personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.
zh
[NLP-77] Are they lovers or friends? Evaluating LLM s Social Reasoning in English and Korean Dialogues
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在人际交互场景中社会推理能力不足的问题,具体表现为模型难以准确推断对话双方的社交关系(如朋友、姐妹、情侣等)。其解决方案的关键在于构建了一个包含1000个对话语料的多语言数据集SCRIPTS,该数据集来源于电影剧本,并由母语为英语和韩语的标注者对每段对话中的关系进行概率性标注(高度可能、不太可能、不可能)。通过在此数据集上评估多个主流LLM,研究发现当前模型在英文任务上表现尚可(约75–80%准确率),但在韩语任务上性能显著下降(58–69%),且存在将“不可能”关系错误选为答案的情况(占比10–25%),同时表明链式思维提示(chain-of-thought prompting)对社会推理提升有限甚至加剧偏见。这揭示了现有模型在社会认知理解上的系统性局限,强调开发具备社会意识的语言模型的必要性。
链接: https://arxiv.org/abs/2510.19028
作者: Eunsu Kim,Junyeong Park,Juhyun Oh,Kiwoong Park,Seyoung Song,A.Seza Dogruoz,Najoung Kim,Alice Oh
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As large language models (LLMs) are increasingly used in human-AI interactions, their social reasoning capabilities in interpersonal contexts are critical. We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean, sourced from movie scripts. The task involves evaluating models’ social reasoning capability to infer the interpersonal relationships (e.g., friends, sisters, lovers) between speakers in each dialogue. Each dialogue is annotated with probabilistic relational labels (Highly Likely, Less Likely, Unlikely) by native (or equivalent) Korean and English speakers from Korea and the U.S. Evaluating nine models on our task, current proprietary LLMs achieve around 75-80% on the English dataset, whereas their performance on Korean drops to 58-69%. More strikingly, models select Unlikely relationships in 10-25% of their responses. Furthermore, we find that thinking models and chain-of-thought prompting, effective for general reasoning, provide minimal benefits for social reasoning and occasionally amplify social biases. Our findings reveal significant limitations in current LLMs’ social reasoning capabilities, highlighting the need for efforts to develop socially-aware language models.
zh
[NLP-78] Dynamic Evaluation for Oversensitivity in LLM s EMNLP
【速读】: 该论文旨在解决语言模型在面对无害提示时过度防御性拒绝(oversensitivity)的问题,这种行为模糊了有害与无害内容的边界并破坏用户体验。现有基准测试依赖静态数据集,随模型演进而出现数据污染,导致评估能力下降。解决方案的关键在于提出一个动态生成模型特定挑战数据集的框架,能够捕捉新兴的防御模式并匹配每个模型的独特行为;基于此构建的OVERBENCH基准涵盖25个大语言模型(Large Language Models, LLMs)的45万个样本,实现了对过度敏感性的持续监测和漏洞识别。
链接: https://arxiv.org/abs/2510.19005
作者: Sophia Xiao Pu,Sitao Cheng,Xin Eric Wang,William Yang Wang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL)
备注: EMNLP-Findings 2025
点击查看摘要
Abstract:Oversensitivity occurs when language models defensively reject prompts that are actually benign. This behavior not only disrupts user interactions but also obscures the boundary between harmful and harmless content. Existing benchmarks rely on static datasets that degrade overtime as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model’s unique behavior. Building on this approach, we construct OVERBENCH, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 25 models. OVERBENCH provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.
zh
[NLP-79] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估体系在专业领域应用中的局限性问题,即现有评测方法多局限于数学、编程和短文本问答等任务,难以全面衡量LLMs在处理专业文档、信息整合与生成综合报告等方面的能力。为应对这一挑战,作者提出ProfBench——一个包含超过7000个由具备专业背景的人类专家(Physics PhD、Chemistry PhD、Finance MBA 和 Consulting MBA)标注的响应-标准配对的数据集,用于更真实地评估LLMs在专业场景下的表现。解决方案的关键在于构建低成本且鲁棒的LLM-Judges系统,通过缓解自我增强偏差(self-enhancement bias)并降低评估成本2–3个数量级,使得专业级评测既公平又可扩展至更广泛的社区使用。实验表明,即使是最先进的模型如GPT-5-high,在ProfBench上也仅达到65.9%的整体性能,凸显了该基准测试的挑战性,并揭示了专有模型与开源模型之间的显著性能差异及长推理链在复杂专业任务中的关键作用。
链接: https://arxiv.org/abs/2510.18941
作者: Zhilin Wang,Jaehun Jung,Ximing Lu,Shizhe Diao,Ellie Evans,Jiaqi Zeng,Pavlo Molchanov,Yejin Choi,Jan Kautz,Yi Dong
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages
点击查看摘要
Abstract:Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: this https URL and Code: this https URL
zh
[NLP-80] NeuroAda: Activating Each Neurons Potential for Parameter-Efficient Fine-Tuning
【速读】: 该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中普遍存在的权衡问题:即在保持高内存效率的同时实现细粒度模型适配。现有方法分为两类——基于添加模块的方法(如LoRA)虽内存效率高但表达能力受限,而选择性原位适配方法虽能实现更精确的调整但显著增加内存消耗。解决方案的关键在于提出NeuroAda,其首先识别网络中重要的连接(即关键参数),随后为这些参数引入旁路连接(bypass connections),并在微调过程中仅更新旁路连接,冻结原始模型参数。这种方法实现了细粒度适应与高内存效率的统一,在23+个自然语言生成与理解任务上验证了其优越性能,同时将可训练参数比例控制在≤0.02%,并降低高达60%的CUDA显存占用。
链接: https://arxiv.org/abs/2510.18940
作者: Zhi Zhang,Yixian Shen,Congfeng Cao,Ekaterina Shutova
机构: ILLC, University of Amsterdam (阿姆斯特丹大学信息语言学中心); PCS, University of Amsterdam (阿姆斯特丹大学计算机科学系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Existing parameter-efficient fine-tuning (PEFT) methods primarily fall into two categories: addition-based and selective in-situ adaptation. The former, such as LoRA, introduce additional modules to adapt the model to downstream tasks, offering strong memory efficiency. However, their representational capacity is often limited, making them less suitable for fine-grained adaptation. In contrast, the latter directly fine-tunes a carefully chosen subset of the original model parameters, allowing for more precise and effective adaptation, but at the cost of significantly increased memory consumption. To reconcile this trade-off, we propose NeuroAda, a novel PEFT method that enables fine-grained model finetuning while maintaining high memory efficiency. Our approach first identifies important parameters (i.e., connections within the network) as in selective adaptation, and then introduces bypass connections for these selected parameters. During finetuning, only the bypass connections are updated, leaving the original model parameters frozen. Empirical results on 23+ tasks spanning both natural language generation and understanding demonstrate that NeuroAda achieves state-of-the-art performance with as little as \leq \textbf0.02% trainable parameters, while reducing CUDA memory usage by up to 60%. We release our code here: this https URL.
zh
[NLP-81] Lost in the Maze: Overcoming Context Limitations in Long-Horizon Agent ic Search
链接: https://arxiv.org/abs/2510.18939
作者: Howard Yen,Ashwin Paranjape,Mengzhou Xia,Thejas Venkatesh,Jack Hessel,Danqi Chen,Yuhao Zhang
机构: Princeton Language and Intelligence, Princeton University (普林斯顿语言与智能中心,普林斯顿大学); Samaya AI
类目: Computation and Language (cs.CL)
备注: Code and data are available here: this https URL
[NLP-82] Evaluating LLM Story Generation through Large-scale Network Analysis of Social Structures NEURIPS2025
【速读】: 该论文旨在解决大规模评估大语言模型(Large Language Models, LLMs)在复杂任务中创造性能力的难题,尤其是故事生成能力的可扩展性评估问题。传统方法依赖人工评分,难以规模化;为此,作者提出一种基于叙事中角色关系结构的可扩展分析方法,将故事建模为带符号的字符网络(signed character networks),通过计算网络密度、聚类系数和带符号边权重等拓扑属性进行量化分析。其关键创新在于利用社会网络结构特征作为代理指标,有效揭示LLM生成故事中普遍存在的正向关系过度集中倾向,从而提供了一种自动化、可扩展且与人类评估结果一致的评价框架,为未来LLM创作能力的系统性测评提供了新工具。
链接: https://arxiv.org/abs/2510.18932
作者: Hiroshi Nonaka,K. E. Perry
机构: Soka University of America (创价大学美国分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This paper has 14 pages and 8 figures. To be presented at the NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
点击查看摘要
Abstract:Evaluating the creative capabilities of large language models (LLMs) in complex tasks often requires human assessments that are difficult to scale. We introduce a novel, scalable methodology for evaluating LLM story generation by analyzing underlying social structures in narratives as signed character networks. To demonstrate its effectiveness, we conduct a large-scale comparative analysis using networks from over 1,200 stories, generated by four leading LLMs (GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash) and a human-written corpus. Our findings, based on network properties like density, clustering, and signed edge weights, show that LLM-generated stories consistently exhibit a strong bias toward tightly-knit, positive relationships, which aligns with findings from prior research using human assessment. Our proposed approach provides a valuable tool for evaluating limitations and tendencies in the creative storytelling of current and future LLMs.
zh
[NLP-83] BAPO: Stabilizing Off-Policy Reinforcement Learning for LLM s via Balanced Policy Optimization with Adaptive Clipping
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在离策略(off-policy)设置下训练大语言模型(Large Language Models, LLMs)时面临的关键挑战:政策熵急剧下降、优化不稳定甚至崩溃。通过理论与实证分析,作者识别出两个核心问题:一是负优势样本主导策略梯度,抑制有效行为并可能导致梯度爆炸;二是PPO类目标中固定的裁剪机制系统性阻碍熵增加更新,导致过度利用(over-exploitation)而牺牲探索(exploration)。解决方案的关键在于提出BAlanced Policy Optimization with Adaptive Clipping(BAPO),其核心创新是动态调整裁剪边界,以自适应地重新平衡正负贡献、维持策略熵,并稳定RL优化过程。该方法在多种离策略场景(如样本重放和部分回溯)中均实现快速、稳定且数据高效的训练效果。
链接: https://arxiv.org/abs/2510.18927
作者: Zhiheng Xi,Xin Guo,Yang Nan,Enyu Zhou,Junrui Shen,Wenxiang Chen,Jiaqi Liu,Jixuan Huang,Zhihao Zhang,Honglin Guo,Xun Deng,Zhikai Lei,Miao Zheng,Guoteng Wang,Shuo Zhang,Peng Sun,Rui Zheng,Hang Yan,Tao Gui,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
点击查看摘要
Abstract:Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings–where stale data from past policies are used for training–improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios–including sample replay and partial rollout–BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.
zh
[NLP-84] Benchmarking On-Device Machine Learning on Apple Silicon with MLX
【速读】: 该论文旨在解决在资源受限的设备(如笔记本电脑和移动设备)上高效部署大型语言模型(Large Language Models, LLMs)所面临的挑战,特别是如何充分利用苹果硅芯片(Apple Silicon)硬件特性以实现低延迟的推理性能。其解决方案的关键在于开发了一个名为MLX的机器学习(Machine Learning, ML)框架及其扩展库MLX-Transformers,该框架针对Apple Silicon架构进行了优化,能够无缝执行直接从Hugging Face获取的Transformer模型,无需繁琐的检查点转换步骤,从而显著提升本地化模型部署的效率与便捷性。通过对比不同Transformer架构在MLX与PyTorch之间的推理延迟表现,研究验证了MLX在苹果生态中具备实现高性能、低延迟模型推理的能力,为未来多模态模型的适配提供了坚实基础。
链接: https://arxiv.org/abs/2510.18921
作者: Oluwaseun A. Ajayi,Ogundepo Odunayo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 6 figures. Presented at the 6th Deep Learning Indaba (DLI 2024), Dakar, Senegal; non-archival presentation. Poster: this https URL
点击查看摘要
Abstract:The recent widespread adoption of Large Language Models (LLMs) and machine learning in general has sparked research interest in exploring the possibilities of deploying these models on smaller devices such as laptops and mobile phones. This creates a need for frameworks and approaches that are capable of taking advantage of on-device hardware. The MLX framework was created to address this need. It is a framework optimized for machine learning (ML) computations on Apple silicon devices, facilitating easier research, experimentation, and prototyping. This paper presents a performance evaluation of MLX, focusing on inference latency of transformer models. We compare the performance of different transformer architecture implementations in MLX with their Pytorch counterparts. For this research we create a framework called MLX-transformers which includes different transformer implementations in MLX and downloads the model checkpoints in pytorch and converts it to the MLX format. By leveraging the advanced architecture and capabilities of Apple Silicon, MLX-Transformers enables seamless execution of transformer models directly sourced from Hugging Face, eliminating the need for checkpoint conversion often required when porting models between frameworks. Our study benchmarks different transformer models on two Apple Silicon macbook devices against an NVIDIA CUDA GPU. Specifically, we compare the inference latency performance of models with the same parameter sizes and checkpoints. We evaluate the performance of BERT, RoBERTa, and XLM-RoBERTa models, with the intention of extending future work to include models of different modalities, thus providing a more comprehensive assessment of MLX’s capabilities. The results highlight MLX’s potential in enabling efficient and more accessible on-device ML applications within Apple’s ecosystem. Comments: 19 pages, 6 figures. Presented at the 6th Deep Learning Indaba (DLI 2024), Dakar, Senegal; non-archival presentation. Poster: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2510.18921 [cs.LG] (or arXiv:2510.18921v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.18921 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-85] Misinformation Detection using Large Language Models with Explainability
【速读】: 该论文旨在解决在线平台中虚假信息(misinformation)快速传播所导致的信任危机与决策失准问题。其解决方案的关键在于构建一个可解释且计算高效的检测框架,利用基于Transformer的预训练语言模型(PLMs)如RoBERTa和DistilBERT,通过两阶段优化策略:首先冻结主干网络仅训练分类头,随后逐步解冻各层并采用分层学习率衰减;同时引入局部层面的LIME和全局特征归因层面的SHAP方法实现透明化解释。实验表明,轻量级模型DistilBERT在保持与RoBERTa相当准确率的同时显著降低计算资源消耗,验证了该方法在性能与效率之间的平衡能力,为可信、可扩展的虚假信息检测提供了有效路径。
链接: https://arxiv.org/abs/2510.18918
作者: Jainee Patel,Chintan Bhatt,Himani Trivedi,Thanh Thi Nguyen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Proceedings of the 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025)
点击查看摘要
Abstract:The rapid spread of misinformation on online platforms undermines trust among individuals and hinders informed decision making. This paper shows an explainable and computationally efficient pipeline to detect misinformation using transformer-based pretrained language models (PLMs). We optimize both RoBERTa and DistilBERT using a two-step strategy: first, we freeze the backbone and train only the classification head; then, we progressively unfreeze the backbone layers while applying layer-wise learning rate decay. On two real-world benchmark datasets, COVID Fake News and FakeNewsNet GossipCop, we test the proposed approach with a unified protocol of preprocessing and stratified splits. To ensure transparency, we integrate the Local Interpretable Model-Agnostic Explanations (LIME) at the token level to present token-level rationales and SHapley Additive exPlanations (SHAP) at the global feature attribution level. It demonstrates that DistilBERT achieves accuracy comparable to RoBERTa while requiring significantly less computational resources. This work makes two key contributions: (1) it quantitatively shows that a lightweight PLM can maintain task performance while substantially reducing computational cost, and (2) it presents an explainable pipeline that retrieves faithful local and global justifications without compromising performance. The results suggest that PLMs combined with principled fine-tuning and interpretability can be an effective framework for scalable, trustworthy misinformation detection.
zh
[NLP-86] MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Uni-modal and Omni-modal in OmniModels
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)从单模态理解向统一视觉、音频与语言模态的全模态(omni-modal)融合演进过程中,单模态能力与全模态能力之间关联性不明确的问题。为推动全模态模型智能演进,论文提出了一种新型高质量且多样化的全模态基准测试集——MultiModal All in One Benchmark (MMAO-Bench),其关键创新在于构建了包含1880个人工精标样本、覆盖44类任务类型,并引入多步开放式问答机制以更有效评估复杂推理能力的评测体系,从而揭示跨模态与单模态性能之间的组合规律:弱模型中全模态能力呈现瓶颈效应,而强模型则表现出协同促进效应。
链接: https://arxiv.org/abs/2510.18915
作者: Chen Chen,ZeYang Hu,Fengjiao Chen,Liya Ma,Jiaxing Liu,Xiaoyu Li,Xuezhi Cao
机构: Meituan(美团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures. Work in progress
点击查看摘要
Abstract:Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model’s intelligence evolution. In this work, we propose a novel, high quality and diversity omni model benchmark, MultiModal All in One Benchmark (MMAO-Bench), which effectively assesses both uni-modal and omni-modal understanding capabilities. The benchmark consists of 1880 human curated samples, across 44 task types, and a innovative multi-step open-ended question type that better assess complex reasoning tasks. Experimental result shows the compositional law between cross-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.
zh
[NLP-87] Context-aware Fairness Evaluation and Mitigation in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对话过程中因内部表征中嵌入的不良行为所引发的公平性缺失、不一致性漂移、有害内容放大及无关模式传播等问题。现有训练阶段或数据驱动的方法存在计算开销大、部署后不可逆且难以适应新对话场景等局限;而传统剪枝方法虽具灵活性和透明性,却因静态特性导致一旦神经元被移除便丧失动态适应能力。本文提出一种动态、可逆的基于剪枝的框架,其核心在于通过检测上下文感知的神经元激活状态,并在推理阶段应用自适应掩码机制来调节其生成影响,从而实现细粒度、记忆感知的偏差缓解,同时保留知识并提升多语言单轮与多轮对话中的连贯性,支持实时对话人工智能中的动态公平性控制。
链接: https://arxiv.org/abs/2510.18914
作者: Afrozah Nadeem,Mark Dras,Usman Naseem
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PrePrint
点击查看摘要
Abstract:Large language models often display undesirable behaviors embedded in their internal representations, undermining fairness, inconsistency drift, amplification of harmful content, and the propagation of unwanted patterns during extended dialogue and conversations. Although training-time or data-centric methods attempt to reduce these effects, they are computationally expensive, irreversible once deployed, and slow to adapt to new conversational contexts. Pruning-based methods provide a flexible and transparent way to reduce bias by adjusting the neurons responsible for certain behaviors. However, most existing approaches are static; once a neuron is removed, the model loses the ability to adapt when the conversation or context changes. To address this, we propose a dynamic, reversible, pruning-based framework that detects context-aware neuron activations and applies adaptive masking to modulate their influence during generation. Our inference-time solution provides fine-grained, memory-aware mitigation with knowledge-preserved, more coherent behavior across multilingual single- and multi-turn dialogues, enabling dynamic fairness control in real-world conversational AI.
zh
[NLP-88] Learning from the Best Differently: A Diversity-Driven Rethinking on Data Selection
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)预训练数据选择中质量(quality)与多样性(diversity)难以兼顾的问题。现有基于单一或多重评分的筛选方法存在非单调性问题,即高分数据未必带来下游性能提升,且易因维度相关性导致多样性缺失。其核心原因是评分方法未解耦相关指标,使得“高质量”数据在局部维度上占据优势,却系统性忽略分布异质性。解决方案的关键在于将多维评价指标(如语言质量、知识质量和理解难度)通过主成分分析(Principal Component Analysis, PCA)进行正交化处理,生成相互独立的评估维度;在此基础上,为每个正交维度训练基于RoBERTa的评分器以实现高效推理,并直接选取各维度的Top-scored数据构建最终训练集。该方法确保了数据集在保持高质的同时具备强多样性,实验证明其跨维度重叠低于2%,且模型性能显著优于基线方法。
链接: https://arxiv.org/abs/2510.18909
作者: Hongyi He,Xiao Liu,Zhenghao Lin,Mingni Tang,Yi Cheng,Jintao Wang,Wenjie Li,Peng Cheng,Yeyun Gong
机构: Tsinghua University (清华大学); Microsoft Research (微软研究院); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:High-quality pre-training data is crutial for large language models, where quality captures factual reliability and semantic value, and diversity ensures broad coverage and distributional heterogeneity. Existing approaches typically rely on single or multiple-dimensional score-based selection. However, directly selecting top-scored data often degrades performance, and sampling from a broader range is required to recover results. The above non-monotonicity between dataset scores and downstream benchmark results reveals a fundamental bias: score-based methods collapse correlated dimensions, causing top-scored data to appear high-quality while systematically overlooking diversity. We argue that ensuring diversity requires decomposing correlated metrics into orthogonal feature dimensions, from which the top-scored data can be directly selected. Therefore, we proposed the Orthogonal Diversity-Aware Selection (ODiS) algorithm, which preserves both quality and diversity during data selection. First, ODiS evaluates data from multiple dimensions, covering language quality, knowledge quality, and comprehension difficulty. The multi-dimensional scores are then decorrelated via Principal Component Analysis (PCA), yielding orthogonal evaluation dimensions. For each dimension, a Roberta-based scorer is trained to regress the data onto PCA-projected scores, enabling scalable inference on large corpora. Finally, ODiS constructs the training dataset by selecting top-scored data within each orthogonal dimension, thereby ensuring both quality and diversity. Empirical results show that ODiS-selected data exhibit less than 2% inter-dimension overlap, confirming orthogonality between dimensions. More importantly, models trained with ODiS-selected data significantly outperform other baselines on downstream benchmarks, highlighting the necessity of orthogonal, diversity-aware data selection for LLMs.
zh
[NLP-89] Improving Topic Modeling of Social Media Short Texts with Rephrasing: A Case Study of COVID-19 Related Tweets
【速读】: 该论文旨在解决社交媒体短文本(如Twitter)在公共卫生危机研究中进行主题建模时,因内容简短、非正式和噪声干扰导致传统主题模型生成不连贯或冗余主题的问题。解决方案的关键在于提出了一种模型无关的框架 \emphTM-Rephrase,利用大语言模型(LLMs)对原始推文进行重述(rephrasing),将其转化为更标准、正式的语言,从而提升后续主题建模的效果。实验表明,尤其是将口语化表达转为正式表达的策略,在多个主题模型上显著提升了主题一致性、唯一性和多样性,并降低了冗余度,尤其对LDA算法效果最为明显。
链接: https://arxiv.org/abs/2510.18908
作者: Wangjiaxuan Xin,Shuhua Yin,Shi Chen,Yaorong Ge
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Social media platforms such as Twitter (now X) provide rich data for analyzing public discourse, especially during crises such as the COVID-19 pandemic. However, the brevity, informality, and noise of social media short texts often hinder the effectiveness of traditional topic modeling, producing incoherent or redundant topics that are often difficult to interpret. To address these challenges, we have developed \emphTM-Rephrase, a model-agnostic framework that leverages large language models (LLMs) to rephrase raw tweets into more standardized and formal language prior to topic modeling. Using a dataset of 25,027 COVID-19-related Twitter posts, we investigate the effects of two rephrasing strategies, general- and colloquial-to-formal-rephrasing, on multiple topic modeling methods. Results demonstrate that \emphTM-Rephrase improves three metrics measuring topic modeling performance (i.e., topic coherence, topic uniqueness, and topic diversity) while reducing topic redundancy of most topic modeling algorithms, with the colloquial-to-formal strategy yielding the greatest performance gains and especially for the Latent Dirichlet Allocation (LDA) algorithm. This study contributes to a model-agnostic approach to enhancing topic modeling in public health related social media analysis, with broad implications for improved understanding of public discourse in health crisis as well as other important domains.
zh
[NLP-90] DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code NEURIPS2025
【速读】: 该论文旨在解决当前机器生成内容检测器在多语言文本和源代码场景下准确率与效率难以兼顾的问题。现有方法(如Fast DetectGPT或GPTZero)主要依赖零样本(zero-shot)策略,普遍存在计算开销高或检测精度不足的缺陷,且二者常呈现权衡关系。解决方案的关键在于:通过微调仅包含编码器的小型语言模型(Small Language Models, SLMs),特别是RoBERTA和CodeBERTa,利用针对源代码及其他自然语言构建的专用训练数据集进行优化,从而实现对生成内容的二分类任务中显著优于大型语言模型(LLMs)的性能表现,同时大幅降低延迟(减少8–12倍)和显存占用(减少3–5倍)。实验表明,该方案在512 token输入下可达到AUROC = 0.97–0.99、宏平均F1 = 0.89–0.94,并在跨生成器迁移和对抗性变换(如改写、回译、代码格式化/重命名)下仍保持≥92%的原始AUROC性能。
链接: https://arxiv.org/abs/2510.18904
作者: Shriyansh Agrawal,Aidan Lau,Sanyam Shah,Ahan M R,Kevin Zhu,Sunishchal Dev,Vasu Sharma
机构: Algoverse AI Research (Algoverse AI 研究所); Algoverse Academy (Algoverse 学院); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025): 4th Workshop on Deep Learning for Code
点击查看摘要
Abstract:The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC = 0.97 to 0.99 and macro-F1 0.89 to 0.94 while reducing latency by 8 - 12\times and peak VRAM by 3 - 5\times at 512 -token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains \geq 92% of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.
zh
[NLP-91] ransformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti
【速读】: 该论文旨在解决低资源语言(如Sylheti)在机器翻译(Machine Translation, MT)中表现不佳的问题,尤其是在高资源语言(如孟加拉语)向低资源语言翻译任务中的性能瓶颈。其解决方案的关键在于通过微调多语言Transformer模型(如mBART-50和MarianMT)来实现任务特定的适应,相较于零样本大语言模型(Zero-shot Large Language Models, LLMs),微调模型在翻译准确性(translation adequacy)和字符级保真度(character-level fidelity)上均显著更优,从而验证了针对低资源语言进行针对性训练的重要性。
链接: https://arxiv.org/abs/2510.18898
作者: Mangsura Kabir Oni,Tabia Tanzin Prama
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Machine Translation (MT) has advanced from rule-based and statistical methods to neural approaches based on the Transformer architecture. While these methods have achieved impressive results for high-resource languages, low-resource varieties such as Sylheti remain underexplored. In this work, we investigate Bengali-to-Sylheti translation by fine-tuning multilingual Transformer models and comparing them with zero-shot large language models (LLMs). Experimental results demonstrate that fine-tuned models significantly outperform LLMs, with mBART-50 achieving the highest translation adequacy and MarianMT showing the strongest character-level fidelity. These findings highlight the importance of task-specific adaptation for underrepresented languages and contribute to ongoing efforts toward inclusive language technologies.
zh
[NLP-92] When Models Cant Follow: Testing Instruction Adherence Across 256 LLM s
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在指令遵循能力方面缺乏系统性、高效且可诊断的评估方法的问题。现有基准测试虽全面,但难以快速识别特定指令遵循模式,且新模型可能因训练数据中包含旧基准而表现出记忆性表现而非真实能力。解决方案的关键在于提出一个精简高效的评估框架,通过20个精心设计的提示(prompts)覆盖多样任务类别,聚焦格式合规性、内容约束、逻辑顺序和多步骤执行等核心维度;该框架在大规模实证研究中验证了其有效性(测试了331个模型中的256个),并采用严格的功能验证机制避免选择偏差,从而为研究人员和从业者提供了一种实用、可复现的诊断工具,同时揭示了当前LLM在指令遵循中的共性失败模式与难点类型。
链接: https://arxiv.org/abs/2510.18892
作者: Richard J. Young,Brandon Gillins,Alice M. Matthews
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 3 figures, 5 tables. Comprehensive evaluation of 256 LLMs on instruction-following tasks
点击查看摘要
Abstract:Despite widespread deployment of Large Language Models, systematic evaluation of instruction-following capabilities remains challenging. While comprehensive benchmarks exist, focused assessments that quickly diagnose specific instruction adherence patterns are valuable. As newer models may be trained on existing benchmarks, novel evaluation approaches are needed to assess genuine capabilities rather than memorized performance. This paper presents a streamlined evaluation framework using twenty carefully designed prompts to assess LLM instruction-following across diverse task categories. We demonstrate this framework through a large-scale empirical study conducted on October 14, 2025, testing 256 verified working models from 331 available via OpenRouter. To ensure methodological rigor and prevent selection bias, we first verified each model’s basic functionality before inclusion. Unlike large-scale benchmarks requiring extensive computational resources, our approach offers a practical diagnostic tool researchers and practitioners can readily apply. Our methodology builds upon verifiable instructions while introducing a compact test suite balancing comprehensiveness with efficiency. Each prompt targets distinct aspects of instruction following, including format compliance, content constraints, logical sequencing, and multi-step task execution. We evaluate models from major providers (OpenAI, Anthropic, Google, Meta, Mistral) and emerging implementations (Qwen, DeepSeek, community models), providing comparative performance analysis. Our findings reveal consistent failure modes and identify specific instruction types posing particular challenges. This work contributes both a practical evaluation tool and one of the most comprehensive empirical analyses of instruction-following capabilities across the contemporary LLM landscape.
zh
[NLP-93] Small Language Models Offer Significant Potential for Science Community
【速读】: 该论文旨在解决科学文献中高效、精准且低成本的信息检索问题,尤其是在地球科学领域面对海量文献时如何快速获取专家验证的定量研究成果。其核心解决方案是构建一个基于小型语言模型(MiniLMs)的框架,利用从95种高影响力地球科学期刊中提取的约7700万条高质量语句构成的语料库,通过语义搜索与句子级索引实现计算高效的领域特定信息提取。相较于生成式AI(Generative AI)常产生泛化回答的特点,MiniLM在识别具有多学科来源的定量结论方面表现更优,并结合情感分析和无监督聚类技术追踪研究趋势与热点演变,从而为事实核查、趋势分析及教育应用提供可靠工具。
链接: https://arxiv.org/abs/2510.18890
作者: Jian Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advancements in natural language processing, particularly with large language models (LLMs), are transforming how scientists engage with the literature. While the adoption of LLMs is increasing, concerns remain regarding potential information biases and computational costs. Rather than LLMs, I developed a framework to evaluate the feasibility of precise, rapid, and cost-effective information retrieval from extensive geoscience literature using freely available small language models (MiniLMs). A curated corpus of approximately 77 million high-quality sentences, extracted from 95 leading peer-reviewed geoscience journals such as Geophysical Research Letters and Earth and Planetary Science Letters published during years 2000 to 2024, was constructed. MiniLMs enable a computationally efficient approach for extracting relevant domain-specific information from these corpora through semantic search techniques and sentence-level indexing. This approach, unlike LLMs such as ChatGPT-4 that often produces generalized responses, excels at identifying substantial amounts of expert-verified information with established, multi-disciplinary sources, especially for information with quantitative findings. Furthermore, by analyzing emotional tone via sentiment analysis and topical clusters through unsupervised clustering within sentences, MiniLM provides a powerful tool for tracking the evolution of conclusions, research priorities, advancements, and emerging questions within geoscience communities. Overall, MiniLM holds significant potential within the geoscience community for applications such as fact and image retrievals, trend analyses, contradiction analyses, and educational purposes.
zh
[NLP-94] Contextual Augmentation for Entity Linking using Large Language Models
链接: https://arxiv.org/abs/2510.18888
作者: Daniel Vollmers,Hamada M. Zahera,Diego Moussallem,Axel-Cyrille Ngonga Ngomo
机构: Paderborn University (帕德博恩大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-95] owards Better Health Conversations: The Benefits of Context-seeking
【速读】: 该论文旨在解决在复杂信息环境中,用户使用大语言模型(Large Language Models, LLMs)获取健康信息时面临的准确性、偏见和误导风险问题。其核心挑战在于如何提升LLM在健康咨询场景下的交互质量与可信度。解决方案的关键在于引入“主动情境询问”(proactive context-seeking)机制,即通过设计一种“导航型AI”(Wayfinding AI),在对话中主动引导用户提供更具体的个人背景信息,从而增强回答的针对性与相关性。实验证明,相较于基线AI,该方法显著提升了用户对回答的帮助性、相关性和定制化程度的评价,揭示了主动情境获取对改善健康类对话式AI性能的重要作用。
链接: https://arxiv.org/abs/2510.18880
作者: Rory Sayres,Yuexing Hao,Abbi Ward,Amy Wang,Beverly Freeman,Serena Zhan,Diego Ardila,Jimmy Li,I-Ching Lee,Anna Iurchenko,Siyi Kou,Kartikeya Badola,Jimmy Hu,Bhawesh Kumar,Keith Johnson,Supriya Vijay,Justin Krogue,Avinatan Hassidim,Yossi Matias,Dale R. Webster,Sunny Virmani,Yun Liu,Quang Duong,Mike Schaekermann
机构: Google Research(谷歌研究院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Navigating health questions can be daunting in the modern information landscape. Large language models (LLMs) may provide tailored, accessible information, but also risk being inaccurate, biased or misleading. We present insights from 4 mixed-methods studies (total N=163), examining how people interact with LLMs for their own health questions. Qualitative studies revealed the importance of context-seeking in conversational AIs to elicit specific details a person may not volunteer or know to share. Context-seeking by LLMs was valued by participants, even if it meant deferring an answer for several turns. Incorporating these insights, we developed a “Wayfinding AI” to proactively solicit context. In a randomized, blinded study, participants rated the Wayfinding AI as more helpful, relevant, and tailored to their concerns compared to a baseline AI. These results demonstrate the strong impact of proactive context-seeking on conversational dynamics, and suggest design patterns for conversational AI to help navigate health topics.
zh
[NLP-96] Aligning Multilingual News for Stock Return Prediction
链接: https://arxiv.org/abs/2510.19203
作者: Yuntao Wu,Lynn Tao,Ing-Haw Cheng,Charles Martineau,Yoshio Nozawa,John Hull,Andreas Veneris
机构: University of Toronto (多伦多大学); Rotman School of Management (罗特曼管理学院)
类目: Computational Finance (q-fin.CP); Computation and Language (cs.CL)
备注: 6 pages, 4 tables, 2 figures, AI for Finance Symposium’25 Workshop at ICAIF’25
[NLP-97] StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction
链接: https://arxiv.org/abs/2510.18938
作者: Qianheng Xu
机构: Millburn High School (米尔本高中)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 5 figures
计算机视觉
[CV-0] Is This Tracker On? A Benchmark Protocol for Dynamic Tracking ALT
【速读】:该论文旨在解决当前点跟踪(point tracking)方法在真实世界场景中性能不足的问题,特别是现有基准测试未能充分涵盖运动复杂性、遮挡模式和目标多样性等关键挑战。其解决方案的关键在于提出ITTO这一新的基准套件,该套件基于真实世界视角视频和高质量人工标注,并通过多阶段数据收集流程确保标注精度;ITTO能够系统性地评估和诊断追踪算法在复杂动态环境下的表现,从而揭示现有方法在遮挡后重新识别点方面的显著缺陷,为开发更鲁棒的点跟踪模型提供可靠测试平台与方向指引。
链接: https://arxiv.org/abs/2510.19819
作者: Ilona Demler,Saumya Chauhan,Georgia Gkioxari
机构: California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:We introduce ITTO, a challenging new benchmark suite for evaluating and diagnosing the capabilities and limitations of point tracking methods. Our videos are sourced from existing datasets and egocentric real-world recordings, with high-quality human annotations collected through a multi-stage pipeline. ITTO captures the motion complexity, occlusion patterns, and object diversity characteristic of real-world scenes – factors that are largely absent in current benchmarks. We conduct a rigorous analysis of state-of-the-art tracking methods on ITTO, breaking down performance along key axes of motion complexity. Our findings reveal that existing trackers struggle with these challenges, particularly in re-identifying points after occlusion, highlighting critical failure modes. These results point to the need for new modeling approaches tailored to real-world dynamics. We envision ITTO as a foundation testbed for advancing point tracking and guiding the development of more robust tracking algorithms.
zh
[CV-1] How to Evaluate Monocular Depth Estimation?
【速读】:该论文旨在解决单目深度估计(monocular depth estimation)领域中评估指标缺乏标准化且对不同类型扰动敏感性不明确的问题。现有评估指标在面对如曲率扰动(例如将平面表面变为波浪状)时表现出严重的低敏感性,导致其与人类感知判断存在偏差。为解决这一问题,论文提出了一种基于相对表面法向量(relative surface normals)的新度量方法,结合新的深度可视化工具和一种系统化构建复合度量的方法,从而提升评估结果与人类视觉判断的一致性。
链接: https://arxiv.org/abs/2510.19814
作者: Siyang Wu,Jack Nugent,Willow Yang,Jia Deng
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Monocular depth estimation is an important task with rapid progress, but how to evaluate it remains an open question, as evidenced by a lack of standardization in existing literature and a large selection of evaluation metrics whose trade-offs and behaviors are not well understood. This paper contributes a novel, quantitative analysis of existing metrics in terms of their sensitivity to various types of perturbations of ground truth, emphasizing comparison to human judgment. Our analysis reveals that existing metrics are severely under-sensitive to curvature perturbation such as making flat surfaces wavy. To remedy this, we introduce a new metric based on relative surface normals, along with new depth visualization tools and a principled method to create composite metrics with better human alignment. Code and data are available at: this https URL.
zh
[CV-2] Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在部署分布与训练分布存在偏移时性能下降的问题,特别是针对长尾分布下的原型退化(prototype degradation)和语义相似类别间的混淆问题。解决方案的关键在于提出一种轻量级的测试时自适应(Test-Time Adaptation, TTA)框架——CPL-NC(Class-Aware Prototype Learning with Negative Contrast),其核心包括两个机制:一是类感知原型缓存模块(Class-Aware Prototype Cache Module),根据测试阶段的频率和激活历史动态调整每类原型容量,并通过 rejuvenation 机制恢复不活跃类别的知识以缓解长尾问题;二是负对比学习机制(Negative Contrastive Learning Mechanism),显式识别并约束难样本的视觉-文本负对,增强类别可分性。此外,该方法采用非对称优化策略,仅更新文本原型而固定视觉特征,从而提升稳定性与泛化能力。
链接: https://arxiv.org/abs/2510.19802
作者: Xiaozhen Qiao,Jingkai Zhao,Yuqiu Jiang,Xianda Guo,Zhe Sun,Hongyuan Zhang,Xuelong Li
机构: University of Science and Technology of China (中国科学技术大学); China Telecom (中国电信); Wuhan University (武汉大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbfClass-Aware \textbfPrototype \textbfLearning with \textbfNegative \textbfContrast(\textbfCPL-NC), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textitClass-Aware Prototype Cache Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textitNegative Contrastive Learning Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.
zh
[CV-3] OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation
【速读】:该论文旨在解决多模态人体全身运动生成中的复杂性与一致性难题,特别是在跨模态条件(如文本、音乐、语音)下实现高质量、可控且时序一致的运动合成。其核心解决方案是提出OmniMotion-X框架,采用自回归扩散Transformer(autoregressive diffusion transformer)以统一的序列到序列方式建模多模态输入与输出,首次引入参考运动作为新型条件信号,显著提升生成动作在风格、时序动态和内容一致性上的表现;同时设计渐进式弱到强混合条件训练策略以缓解多模态冲突,并构建了目前最大的统一多模态动作数据集OmniMoCap-X(包含28个公开动捕源,标准化为SMPL-X格式),通过GPT-4o自动标注结构化层级描述,支撑高质量多任务训练,最终实现在多种任务场景(如文本到动作、音乐到舞蹈、运动预测等)中生成高保真、连贯且可交互的长时程运动。
链接: https://arxiv.org/abs/2510.19789
作者: Guowei Xu,Yuxuan Bian,Ailing Zeng,Mingyi Shi,Shaoli Huang,Wen Li,Lixin Duan,Qiang Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.
zh
[CV-4] Adaptive Distribution-aware Quantization for Mixed-Precision Neural Networks
【速读】:该论文旨在解决量化感知训练(Quantization-Aware Training, QAT)在资源受限设备部署深度神经网络时面临的两大挑战:一是激活值分布的高度非均匀性,二是权重量化中使用静态且不匹配的码本(codebook)。其解决方案的核心是提出一种自适应分布感知量化(Adaptive Distribution-aware Quantization, ADQ)框架,关键创新在于三项技术:(1) 基于分位数的码本初始化方法,使初始码本更贴合权重分布;(2) 基于指数移动平均(Exponential Moving Average, EMA)的在线码本自适应机制,动态追踪分布变化;(3) 基于敏感度分析的混合精度分配策略。此外,针对激活值,引入硬件友好的非均匀到均匀映射方案,从而实现高精度与低比特宽度的平衡。
链接: https://arxiv.org/abs/2510.19760
作者: Shaohang Jia,Zhiyong Huang,Zhi Yu,Mingyang Hou,Shuai Miao,Han Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures
点击查看摘要
Abstract:Quantization-Aware Training (QAT) is a critical technique for deploying deep neural networks on resource-constrained devices. However, existing methods often face two major challenges: the highly non-uniform distribution of activations and the static, mismatched codebooks used in weight quantization. To address these challenges, we propose Adaptive Distribution-aware Quantization (ADQ), a mixed-precision quantization framework that employs a differentiated strategy. The core of ADQ is a novel adaptive weight quantization scheme comprising three key innovations: (1) a quantile-based initialization method that constructs a codebook closely aligned with the initial weight distribution; (2) an online codebook adaptation mechanism based on Exponential Moving Average (EMA) to dynamically track distributional shifts; and (3) a sensitivity-informed strategy for mixed-precision allocation. For activations, we integrate a hardware-friendly non-uniform-to-uniform mapping scheme. Comprehensive experiments validate the effectiveness of our method. On ImageNet, ADQ enables a ResNet-18 to achieve 71.512% Top-1 accuracy with an average bit-width of only 2.81 bits, outperforming state-of-the-art methods under comparable conditions. Furthermore, detailed ablation studies on CIFAR-10 systematically demonstrate the individual contributions of each innovative component, validating the rationale and effectiveness of our design.
zh
[CV-5] A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成式 AI 中因多步迭代和复杂骨干网络导致的计算开销大、生成延迟高问题,从而阻碍其实时应用。解决方案的关键在于提出“扩散缓存”(Diffusion Caching)这一无需训练、与架构无关且高效的推理范式,其核心机制是识别并复用扩散过程中固有的计算冗余,通过特征级跨步骤重用和层间调度策略,在不修改模型参数的前提下显著降低计算量。该方法从静态复用向动态预测演进,提升了灵活性,并可与采样优化、模型蒸馏等技术融合,为未来多模态与交互式应用提供统一高效的推理框架。
链接: https://arxiv.org/abs/2510.19755
作者: Jiacheng Liu,Xinyu Wang,Yuqi Lin,Zhikai Wang,Peiru Wang,Peiliang Cai,Qinming Zhou,Zhengan Yan,Zexuan Yan,Zhengyi Shi,Chang Zou,Yue Ma,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages,2 figures
点击查看摘要
Abstract:Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textitmulti-step iterations and \textitcomplex backbone networks lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbfDiffusion Caching offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textitstatic reuse to \textitdynamic prediction. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textitEfficient Generative Intelligence. Comments: 22 pages,2 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.19755 [cs.LG] (or arXiv:2510.19755v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.19755 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-6] Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning NEURIPS2025
【速读】:该论文旨在解决基于Transformer的具身智能体在长时间任务中因视觉输入信息过载而导致上下文容量受限的问题,从而影响其长期决策能力。现有方法或依赖固定大小的记忆循环模型,或完全依赖Transformer的完整上下文,难以实现高效且可扩展的记忆机制。解决方案的关键在于提出一种名为Memo的新架构和训练策略:通过在训练过程中交错插入周期性摘要标记(summarization tokens)来显式地创建和检索记忆,使模型能够在保持计算与存储效率的同时,有效压缩无关信息并保留关键经验,从而提升在长时程任务中的性能表现和泛化能力。
链接: https://arxiv.org/abs/2510.19732
作者: Gunshi Gupta,Karmesh Yadav,Zsolt Kira,Yarin Gal,Rahaf Aljundi
机构: University of Oxford (牛津大学); Georgia Tech University (佐治亚理工学院); Toyota Europe (丰田欧洲)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted for Spotlight Presentation at NeurIPS 2025
点击查看摘要
Abstract:To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo’s effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints.
zh
[CV-7] LyTimeT: Towards Robust and Interpretable State-Variable Discovery
【速读】:该论文旨在解决从高维视频中提取真实动力学变量的难题,尤其针对背景运动、遮挡和纹理变化等干扰因素导致的潜在状态学习不稳定问题。其解决方案的关键在于提出一个两阶段可解释变量提取框架 LyTimeT:第一阶段采用基于时空 TimeSformer 的自编码器,利用全局注意力机制聚焦动态相关区域并抑制冗余变化,实现鲁棒且稳定的潜在状态学习;第二阶段通过线性相关性分析筛选物理意义明确的维度,并引入 Lyapunov 稳定性正则化项约束转移动力学,以增强收缩性、减少滚动预测中的误差累积。此方法在合成基准与真实世界动力系统(包括混沌现象)上均表现出最优的互信息与内在维度估计精度,同时具备对背景扰动的不变性及更低的均方误差。
链接: https://arxiv.org/abs/2510.19716
作者: Kuai Yu,Crystal Su,Xiang Liu,Judah Goldfeder,Mingyuan Shao,Hod Lipson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Extracting the true dynamical variables of a system from high-dimensional video is challenging due to distracting visual factors such as background motion, occlusions, and texture changes. We propose LyTimeT, a two-phase framework for interpretable variable extraction that learns robust and stable latent representations of dynamical systems. In Phase 1, LyTimeT employs a spatio-temporal TimeSformer-based autoencoder that uses global attention to focus on dynamically relevant regions while suppressing nuisance variation, enabling distraction-robust latent state learning and accurate long-horizon video prediction. In Phase 2, we probe the learned latent space, select the most physically meaningful dimensions using linear correlation analysis, and refine the transition dynamics with a Lyapunov-based stability regularizer to enforce contraction and reduce error accumulation during roll-outs. Experiments on five synthetic benchmarks and four real-world dynamical systems, including chaotic phenomena, show that LyTimeT achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers the lowest analytical mean squared error among CNN-based (TIDE) and transformer-only baselines. Our results demonstrate that combining spatio-temporal attention with stability constraints yields predictive models that are not only accurate but also physically interpretable.
zh
[CV-8] Explainable Face Presentation Attack Detection via Ensemble-CAM
【速读】:该论文旨在解决深度学习(Deep Learning, DL)驱动的活体检测(Presentation Attack Detection, PAD)系统在面对伪造生物特征数据(如人脸图像)时,因模型决策过程缺乏透明性而导致的信任缺失问题。其解决方案的关键在于提出一种名为Ensemble-CAM的新颖可视化解释技术,通过整合多个类激活图(Class Activation Mapping, CAM)来提供可解释的视觉证据,从而揭示模型判断某张人脸图像为真实或伪造的关键区域,提升系统的可解释性、透明度与可信度。
链接: https://arxiv.org/abs/2510.19695
作者: Rashik Shadman,M G Sarwar Murshed,Faraz Hussain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Presentation attacks represent a critical security threat where adversaries use fake biometric data, such as face, fingerprint, or iris images, to gain unauthorized access to protected systems. Various presentation attack detection (PAD) systems have been designed leveraging deep learning (DL) models to mitigate this type of threat. Despite their effectiveness, most of the DL models function as black boxes - their decisions are opaque to their users. The purpose of explainability techniques is to provide detailed information about the reason behind the behavior or decision of DL models. In particular, visual explanation is necessary to better understand the decisions or predictions of DL-based PAD systems and determine the key regions due to which a biometric image is considered real or fake by the system. In this work, a novel technique, Ensemble-CAM, is proposed for providing visual explanations for the decisions made by deep learning-based face PAD systems. Our goal is to improve DL-based face PAD systems by providing a better understanding of their behavior. Our provided visual explanations will enhance the transparency and trustworthiness of DL-based face PAD systems.
zh
[CV-9] Curvilinear Structure-preserving Unpaired Cross-domain Medical Image Translation
【速读】:该论文旨在解决无配对图像到图像翻译(unpaired image-to-image translation)在医学影像中对细小弯曲结构(curvilinear structures,如微血管)保留不足的问题,此类结构的失真会严重影响诊断可靠性与定量分析准确性。解决方案的关键在于提出一种通用框架Curvilinear Structure-preserving Translation (CST),其核心创新是通过引入拓扑监督机制,在训练过程中显式地保持细小弯曲结构的几何完整性;具体实现上,CST集成一个弯曲结构提取模块(curvilinear extraction module),为基线模型提供结构一致性约束,并可无缝嵌入CycleGAN和UNSB等现有方法中,从而显著提升跨模态翻译的保真度与临床适用性。
链接: https://arxiv.org/abs/2510.19679
作者: Zihao Chen,Yi Zhou,Xudong Jiang,Li Chen,Leopold Schmetterer,Bingyao Tan,Jun Cheng
机构: Nanyang Technological University (南洋理工大学); Singapore Eye Research Institute (新加坡眼科研究所); Singapore National Eye Centre (新加坡国家眼科中心); Duke-NUS Medical School (杜克-国大医学院); Institute of Molecular and Clinical Ophthalmology (分子与临床眼科学研究所); School of Chemical and Biomedical Engineering (化学与生物医学工程学院); Medical University of Vienna (维也纳医科大学); Department of Clinical Pharmacology (临床药理学系); Rothschild Foundation Hospital (罗斯柴尔德基金会医院); Wuhan University of Science and Technology (武汉科技大学); Institute for Infocomm Research (信息通信研究院); A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Unpaired image-to-image translation has emerged as a crucial technique in medical imaging, enabling cross-modality synthesis, domain adaptation, and data augmentation without costly paired datasets. Yet, existing approaches often distort fine curvilinear structures, such as microvasculature, undermining both diagnostic reliability and quantitative analysis. This limitation is consequential in ophthalmic and vascular imaging, where subtle morphological changes carry significant clinical meaning. We propose Curvilinear Structure-preserving Translation (CST), a general framework that explicitly preserves fine curvilinear structures during unpaired translation by integrating structure consistency into the training. Specifically, CST augments baseline models with a curvilinear extraction module for topological supervision. It can be seamlessly incorporated into existing methods. We integrate it into CycleGAN and UNSB as two representative backbones. Comprehensive evaluation across three imaging modalities: optical coherence tomography angiography, color fundus and X-ray coronary angiography demonstrates that CST improves translation fidelity and achieves state-of-the-art performance. By reinforcing geometric integrity in learned mappings, CST establishes a principled pathway toward curvilinear structure-aware cross-domain translation in medical imaging.
zh
[CV-10] I Spy With My Models Eye: Visual Search as a Behavioural Test for MLLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉处理方面的“黑箱”问题,即现有评估方法主要关注任务准确率,却难以揭示其内在感知机制。为实现对MLLMs视觉感知能力的可解释性诊断,作者借鉴认知心理学中的经典视觉搜索范式,设计了受控实验以检验MLLM是否具备人类类似的“突出效应”(pop-out effect)——即显著视觉特征能否独立于干扰项数量被快速检测。关键解决方案在于:通过控制颜色、大小和光照等视觉特征的单一或组合条件,发现先进MLLM在基于单一特征的析取式(disjunctive)搜索中表现出类人pop-out效应,并在合取式(conjunctive)搜索中存在容量限制;同时,还观察到MLLM像人类一样利用自然场景先验(如光照方向)来构建物体表征。研究进一步通过针对性微调与机制可解释性分析验证了这些发现,表明视觉搜索是一种基于认知原理的诊断工具,可用于系统评估MLLM的感知能力。
链接: https://arxiv.org/abs/2510.19678
作者: John Burden,Jonathan Prunty,Ben Slater,Matthieu Tehenan,Greg Davis,Lucy Cheke
机构: Leverhulme Centre for the Future of Intelligence, University of Cambridge; Department of Engineering, University of Cambridge; Department of Psychology, University of Cambridge; Department of Computer Science, University of Cambridge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint
点击查看摘要
Abstract:Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms – originally developed to study human perception – to test whether MLLMs exhibit the ``pop-out’’ effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.
zh
[CV-11] Re-Activating Frozen Primitives for 3D Gaussian Splatting
【速读】:该论文旨在解决3D Gaussian Splatting (3D-GS) 在复杂场景中因过度重建伪影(如局部模糊和针状畸变)而导致的渲染质量下降问题。作者指出,现有方法将此类问题归因于大尺度高斯分布的分割不足,但实际根源在于两个根本限制:一是密集化过程中梯度幅值稀释导致的生长停滞,二是“原始体素冻结现象”(primitive frozen phenomenon),即在复杂区域关键高斯分布的密集化被抑制,而次优尺度的高斯分布则陷入局部最优。解决方案的核心是提出 ReAct-GS,其关键创新在于:(1) 引入基于重要性的密集化准则,利用多视角 α-混合权重重新激活复杂区域中停滞的原始体素生长;(2) 设计再激活机制,通过自适应参数扰动恢复冻结的高斯分布,从而有效消除过重建伪影并显著提升新视角合成性能。
链接: https://arxiv.org/abs/2510.19653
作者: Yuxin Cheng,Binxiao Huang,Wenyong Zhou,Taiqiang Wu,Zhengwu Liu,Graziano Chesi,Ngai Wong
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3D-GS) achieves real-time photorealistic novel view synthesis, yet struggles with complex scenes due to over-reconstruction artifacts, manifesting as local blurring and needle-shape distortions. While recent approaches attribute these issues to insufficient splitting of large-scale Gaussians, we identify two fundamental limitations: gradient magnitude dilution during densification and the primitive frozen phenomenon, where essential Gaussian densification is inhibited in complex regions while suboptimally scaled Gaussians become trapped in local optima. To address these challenges, we introduce ReAct-GS, a method founded on the principle of re-activation. Our approach features: (1) an importance-aware densification criterion incorporating \alpha -blending weights from multiple viewpoints to re-activate stalled primitive growth in complex regions, and (2) a re-activation mechanism that revitalizes frozen primitives through adaptive parameter perturbations. Comprehensive experiments across diverse real-world datasets demonstrate that ReAct-GS effectively eliminates over-reconstruction artifacts and achieves state-of-the-art performance on standard novel view synthesis metrics while preserving intricate geometric details. Additionally, our re-activation mechanism yields consistent improvements when integrated with other 3D-GS variants such as Pixel-GS, demonstrating its broad applicability.
zh
[CV-12] MedReason -R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom
【速读】:该论文旨在解决通用大视觉语言模型(Vision-Language Models, VLMs)在医学领域表现不佳的问题,尤其是针对CT图像疾病诊断任务中因缺乏大规模高质量专业医疗数据集以及未充分建模从粗到细的诊断推理过程而导致的性能瓶颈。解决方案的关键在于两个方面:一是构建了包含84K问答对的CT-RATE-VQA医学专用数据集,解决了数据稀缺问题;二是提出MedReason-R1模型,通过显式嵌入病灶区域感兴趣区(Region-of-Interest, ROI)的方式强化全局定位与病灶特异性细节的联合建模,并引入GRPO强化学习框架实现无需人工标注的高效推理优化,从而显著提升诊断准确性和泛化能力。
链接: https://arxiv.org/abs/2510.19626
作者: Yifan Li,Fenghe Tang,Yingtai Li,Shaohua Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code, checkpoints, and dataset are available at: this https URL
点击查看摘要
Abstract:General-purpose large Vision-Language Models (VLMs) demonstrate strong capabilities in generating detailed descriptions for natural images. However, their performance in the medical domain remains suboptimal, even for relatively straightforward tasks, primarily due to the lack of large-scale, high-quality, specialized medical imaging datasets and the neglect of the diagnostic process that progresses from coarse to fine-grained. To address the first issue, we construct the CT-RATE-VQA dataset, which has 84K QA pairs. For the second issue, we propose MedReason-R1, a medical VLM with explicit reasoning process for disease diagnosis. MedReason-R1 incorporates a novel strategy that embeds zoom-in disease region-of-interest areas into the image, highlighting the crucial role of both global localization and disease-specific details in enhancing the model’s diagnostic performance. Furthermore, we introduce the GRPO reinforcement learning framework to MedReason-R1, which enables effective reasoning without relying on costly manual annotations. Compared to recent general-purpose and medical VLMs, MedReason-R1 achieves state-of-the-art performance in CT disease diagnosis while retaining generalization. The code, checkpoints, and dataset are available at: this https URL
zh
[CV-13] Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning ICCV2025
【速读】:该论文旨在解决现有时刻检索(Moment Retrieval)方法面临的三大瓶颈问题:数据稀缺导致模型局限于浅层关键词特征关联、相邻事件间边界区域的模糊性以及细粒度语义区分能力不足(如区分“踢球”与“掷球”)。其解决方案的核心在于提出一种无需外部依赖的增强型时刻检索框架(AMR),通过两个关键设计实现突破:一是利用数据增强策略在不增加人工标注成本的前提下,自动修正原始标注中的边界模糊和语义混淆;二是构建两阶段训练机制——第一阶段采用课程学习(curriculum learning)在增强数据上训练以建立基础的边界与语义感知能力,第二阶段引入双查询集(Original Queries 和 Active Queries)进行知识蒸馏,其中原始查询保持冻结的基础查询结构以稳定定位性能,而主动查询则动态适应真实数据分布,同时通过跨阶段蒸馏损失确保知识不遗忘并提升泛化能力。此方案显著提升了模型在复杂现实场景下的表现。
链接: https://arxiv.org/abs/2510.19622
作者: Zhengxuan Wei,Jiajin Tang,Sibei Yang
机构: ShanghaiTech University (上海科技大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is accepted by ICCV 2025
点击查看摘要
Abstract:Existing Moment Retrieval methods face three critical bottlenecks: (1) data scarcity forces models into shallow keyword-feature associations; (2) boundary ambiguity in transition regions between adjacent events; (3) insufficient discrimination of fine-grained semantics (e.g., distinguishing kicking" vs.
throwing" a ball). In this paper, we propose a zero-external-dependency Augmented Moment Retrieval framework, AMR, designed to overcome local optima caused by insufficient data annotations and the lack of robust boundary and semantic discrimination capabilities. AMR is built upon two key insights: (1) it resolves ambiguous boundary information and semantic confusion in existing annotations without additional data (avoiding costly manual labeling), and (2) it preserves boundary and semantic discriminative capabilities enhanced by training while generalizing to real-world scenarios, significantly improving performance. Furthermore, we propose a two-stage training framework with cold-start and distillation adaptation. The cold-start stage employs curriculum learning on augmented data to build foundational boundary/semantic awareness. The distillation stage introduces dual query sets: Original Queries maintain DETR-based localization using frozen Base Queries from the cold-start model, while Active Queries dynamically adapt to real-data distributions. A cross-stage distillation loss enforces consistency between Original and Base Queries, preventing knowledge forgetting while enabling real-world generalization. Experiments on multiple benchmarks show that AMR achieves improved performance over prior state-of-the-art approaches.
zh
[CV-14] Prag matic Heterogeneous Collaborative Perception via Generative Communication Mechanism NEURIPS2025
【速读】:该论文旨在解决异构多智能体系统中因传感器和模型差异导致的领域差距(domain gap)问题,从而实现高效、低开销的协作感知。现有方法依赖于适应或重构策略,在实际应用中存在两大局限:一是对编码器或核心模块的侵入式重训练破坏了智能体间的语义一致性;二是引入新智能体时计算成本高,难以扩展。解决方案的关键在于提出一种生成式通信机制(Generative Communication, GenComm),其核心创新包括:(1)设计可变形消息提取器(Deformable Message Extractor)以提取空间信息并替代中间特征进行传输;(2)采用条件扩散模型构建空间感知特征生成器(Spatial-Aware Feature Generator),在不修改原网络的前提下生成与本地语义空间一致且保留协作方空间信息的特征;(3)通过轻量级通道增强器(Channel Enhancer)优化生成特征后融合。该方案显著降低了新增智能体的计算复杂度与参数需求(实验显示降低81%),同时保持感知性能优于当前最优方法。
链接: https://arxiv.org/abs/2510.19618
作者: Junfei Zhou,Penglin Dai,Quanmin Wei,Bingyi Liu,Xiao Wu,Jianping Wang
机构: Southwest Jiaotong University (西南交通大学); Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China (教育部可持续城市智能交通工程研究中心); Wuhan University of Technology (武汉理工大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures, accepted to NeurIPS 2025
点击查看摘要
Abstract:Multi-agent collaboration enhances the perception capabilities of individual agents through information sharing. However, in real-world applications, differences in sensors and models across heterogeneous agents inevitably lead to domain gaps during collaboration. Existing approaches based on adaptation and reconstruction fail to support pragmatic heterogeneous collaboration due to two key limitations: (1) Intrusive retraining of the encoder or core modules disrupts the established semantic consistency among agents; and (2) accommodating new agents incurs high computational costs, limiting scalability. To address these challenges, we present a novel Generative Communication mechanism (GenComm) that facilitates seamless perception across heterogeneous multi-agent systems through feature generation, without altering the original network, and employs lightweight numerical alignment of spatial information to efficiently integrate new agents at minimal cost. Specifically, a tailored Deformable Message Extractor is designed to extract spatial message for each collaborator, which is then transmitted in place of intermediate features. The Spatial-Aware Feature Generator, utilizing a conditional diffusion model, generates features aligned with the ego agent’s semantic space while preserving the spatial information of the collaborators. These generated features are further refined by a Channel Enhancer before fusion. Experiments conducted on the OPV2V-H, DAIR-V2X and V2X-Real datasets demonstrate that GenComm outperforms existing state-of-the-art methods, achieving an 81% reduction in both computational cost and parameter count when incorporating new agents. Our code is available at this https URL.
zh
[CV-15] Beyond sparse denoising in frames: minimax estimation with a scattering transform
【速读】:该论文旨在解决复杂信号(特别是边缘为分段C^α曲线的卡通图像)在加性高斯噪声污染下的去噪问题,传统稀疏表示方法(如小波、曲线波和Xlet框架)在未知Lipschitz指数α ≤ 2时表现次优。其解决方案的关键在于引入一种基于散射系数(scattering coefficients)的联合ℓ¹范数最小化与最大化策略:通过同时优化不同子集的散射系数ℓ¹范数,能够有效捕捉图像中不同类型几何正则性(geometric regularity),从而实现对所有α ≤ 2的最优渐近性能——数值实验表明该方法达到最小最大渐近界(minimax asymptotic bound),并提出此结果为数学猜想,为调和分析与深度卷积神经网络之间的理论桥梁提供了新的数学框架。
链接: https://arxiv.org/abs/2510.19612
作者: Nathanaël Cuvelle–Magar,Stéphane Mallat
机构: Ecole normale supérieure (法国巴黎高等师范学院); Collège de France (法国巴黎高等师范学院); Flatiron Institute, Simons Foundation (西蒙斯基金会扁平铁研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:A considerable amount of research in harmonic analysis has been devoted to non-linear estimators of signals contaminated by additive Gaussian noise. They are implemented by thresholding coefficients in a frame, which provide a sparse signal representation, or by minimising their \ell^1 norm. However, sparse estimators in frames are not sufficiently rich to adapt to complex signal regularities. For cartoon images whose edges are piecewise \bf C^\alpha curves, wavelet, curvelet and Xlet frames are suboptimal if the Lipschitz exponent \alpha \leq 2 is an unknown parameter. Deep convolutional neural networks have recently obtained much better numerical results, which reach the minimax asymptotic bounds for all \alpha . Wavelet scattering coefficients have been introduced as simplified convolutional neural network models. They are computed by transforming the modulus of wavelet coefficients with a second wavelet transform. We introduce a denoising estimator by jointly minimising and maximising the \ell^1 norms of different subsets of scattering coefficients. We prove that these \ell^1 norms capture different types of geometric image regularity. Numerical experiments show that this denoising estimator reaches the minimax asymptotic bound for cartoon images for all Lipschitz exponents \alpha \leq 2 . We state this numerical result as a mathematical conjecture. It provides a different harmonic analysis approach to suppress noise from signals, and to specify the geometric regularity of functions. It also opens a mathematical bridge between harmonic analysis and denoising estimators with deep convolutional network.
zh
[CV-16] XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在胸部X光图像理解中跨模态可解释性不足的问题,特别是其文本概念与视觉证据之间的对齐能力(grounding ability)尚未得到系统评估。解决方案的关键在于构建首个针对胸部X光图像的系统性基准测试(XBench),通过交叉注意力机制和基于相似性的定位图生成可视化解释,并量化这些解释与放射科医生标注区域的一致性,从而揭示不同VLM变体在识别能力和接地能力之间的关系及其对小病灶或弥散病变的局限性。
链接: https://arxiv.org/abs/2510.19599
作者: Haozhe Luo,Shelley Zixin Shu,Ziyu Zhou,Sebastian Otalora,Mauricio Reyes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at this https URL
zh
[CV-17] CBDiff:Conditional Bernoulli Diffusion Models for Image Forgery Localization
【速读】:该论文旨在解决图像伪造定位(Image Forgery Localization, IFL)任务中现有方法生成单一确定性定位图导致精度和可靠性不足的问题,尤其在高风险应用场景如司法鉴定和安防监控中难以满足需求。解决方案的关键在于提出一种条件伯努利扩散模型(Conditional Bernoulli Diffusion Model, CBDiff),通过生成多个多样且合理的伪造区域定位图,更全面地刻画伪造分布的不确定性与变异性;同时,创新性地在扩散过程中引入伯努利噪声以更好地建模伪造掩码的二值性和稀疏特性,并设计时间步交叉注意力机制(Time-Step Cross-Attention, TSCAttention)利用语义特征引导不同时间步的推理过程,从而显著提升检测性能。
链接: https://arxiv.org/abs/2510.19597
作者: Zhou Lei,Pan Gang,Wang Jiahao,Sun Di
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Image Forgery Localization (IFL) is a crucial task in image forensics, aimed at accurately identifying manipulated or tampered regions within an image at the pixel level. Existing methods typically generate a single deterministic localization map, which often lacks the precision and reliability required for high-stakes applications such as forensic analysis and security surveillance. To enhance the credibility of predictions and mitigate the risk of errors, we introduce an advanced Conditional Bernoulli Diffusion Model (CBDiff). Given a forged image, CBDiff generates multiple diverse and plausible localization maps, thereby offering a richer and more comprehensive representation of the forgery distribution. This approach addresses the uncertainty and variability inherent in tampered regions. Furthermore, CBDiff innovatively incorporates Bernoulli noise into the diffusion process to more faithfully reflect the inherent binary and sparse properties of forgery masks. Additionally, CBDiff introduces a Time-Step Cross-Attention (TSCAttention), which is specifically designed to leverage semantic feature guidance with temporal steps to improve manipulation detection. Extensive experiments on eight publicly benchmark datasets demonstrate that CBDiff significantly outperforms existing state-of-the-art methods, highlighting its strong potential for real-world deployment.
zh
[CV-18] Decomposed Attention Fusion in MLLM s for Training-Free Video Reasoning Segmentation WWW
【速读】:该论文旨在解决视频目标分割(Video Object Segmentation, VOS)中缺乏训练数据或需额外训练的问题,尤其是在不依赖微调(training-free)场景下如何实现精准的视频目标定位与分割。其核心挑战在于:多模态大语言模型(Multimodal Large Language Models, MLLMs)生成的原始注意力图(attention maps)噪声较大且与物体区域对齐度低,难以直接用于分割任务。解决方案的关键在于提出分解注意力融合(Decomposed Attention Fusion, DecAF)机制,通过两个关键步骤进行优化:(1) 对比式前景-背景融合(contrastive object-background fusion),有效抑制无关激活并增强对象聚焦特征;(2) 互补帧间融合(complementary video-frame fusion),利用时序信息提升空间一致性。在此基础上,进一步引入注意力引导的SAM2提示(attention-guided SAM2 prompting),无需重新训练即可获得精细分割掩码。该方法在无需训练的前提下实现了接近有监督训练方法的性能,在引用和推理类VOS基准测试中均表现优异。
链接: https://arxiv.org/abs/2510.19592
作者: Su Ho Han,Jeongseok Hyun,Pilhyeon Lee,Minho Shim,Dongyoon Wee,Seon Joo Kim
机构: Yonsei University (延世大学); Inha University (仁荷大学); NAVER Cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at this https URL.
zh
[CV-19] Digitizing Paper ECGs at Scale: An Open-Source Algorithm for Clinical Research
【速读】:该论文旨在解决大量临床心电图(Electrocardiogram, ECG)仅以纸质扫描形式存在,导致无法用于现代自动化诊断的问题。其解决方案的关键在于提出了一种完全自动化的模块化框架,能够将扫描或拍摄的ECG图像转化为适用于临床和研究场景的数字信号,该框架在包含常见伪影的37,191张图像上验证,平均信噪比达19.65 dB,并在Emory Paper Digitization ECG Dataset上进一步证明优于当前最先进方法,且整个软件开源,有助于推动回顾性ECG档案的数字化利用和AI驱动诊断的普及。
链接: https://arxiv.org/abs/2510.19590
作者: Elias Stenhede,Agnar Martin Bjørnstad,Arian Ranjbar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Millions of clinical ECGs exist only as paper scans, making them unusable for modern automated diagnostics. We introduce a fully automated, modular framework that converts scanned or photographed ECGs into digital signals, suitable for both clinical and research applications. The framework is validated on 37,191 ECG images with 1,596 collected at Akershus University Hospital, where the algorithm obtains a mean signal-to-noise ratio of 19.65 dB on scanned papers with common artifacts. It is further evaluated on the Emory Paper Digitization ECG Dataset, comprising 35,595 images, including images with perspective distortion, wrinkles, and stains. The model improves on the state-of-the-art in all subcategories. The full software is released as open-source, promoting reproducibility and further development. We hope the software will contribute to unlocking retrospective ECG archives and democratize access to AI-driven diagnostics.
zh
[CV-20] Uncertainty evaluation of segmentation models for Earth observation
【速读】:该论文旨在解决遥感影像语义分割预测中不确定性估计的问题,其核心挑战在于相较于标准图像分类任务,语义分割需要生成逐像素的不确定性估计,并且在实际应用中需具备可扩展性。解决方案的关键在于系统性地评估多种现有不确定性估计方法(包括随机分割网络和集成模型)在遥感领域的适用性,结合不同神经网络架构与不确定性度量指标,在两个具有差异性的遥感数据集(PASTIS 和 ForTy)上进行实证分析,从而提出基于实用性(如识别预测错误区域和噪声输入区域的能力)的优化建议。
链接: https://arxiv.org/abs/2510.19586
作者: Melanie Rey,Andriy Mnih,Maxim Neumann,Matt Overlan,Drew Purves
机构: Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper investigates methods for estimating uncertainty in semantic segmentation predictions derived from satellite imagery. Estimating uncertainty for segmentation presents unique challenges compared to standard image classification, requiring scalable methods producing per-pixel estimates. While most research on this topic has focused on scene understanding or medical imaging, this work benchmarks existing methods specifically for remote sensing and Earth observation applications. Our evaluation focuses on the practical utility of uncertainty measures, testing their ability to identify prediction errors and noise-corrupted input image regions. Experiments are conducted on two remote sensing datasets, PASTIS and ForTy, selected for their differences in scale, geographic coverage, and label confidence. We perform an extensive evaluation featuring several models, such as Stochastic Segmentation Networks and ensembles, in combination with a number of neural architectures and uncertainty metrics. We make a number of practical recommendations based on our findings.
zh
[CV-21] Addressing the Depth-of-Field Constraint: A New Paradigm for High Resolution Multi-Focus Image Fusion
【速读】:该论文旨在解决多焦点图像融合(Multi-focus Image Fusion, MFIF)中的关键挑战,包括训练数据有限、合成数据集与真实场景之间的域差距(domain gap),以及缺乏有效信息区域的处理难题。其解决方案的核心是提出一种基于蒸馏变分自编码器(distilled variational autoencoder)的新方法 VAEEDOF,该方法能够实现高保真、高效的图像重建;同时引入了一个新的4K合成数据集 MattingMFIF,通过从真实照片中模拟逼真的景深(Depth-of-Field, DOF)效果来缓解数据稀缺问题,并支持最多七张图像的同时融合,显著提升了跨不同聚焦点的鲁棒性。
链接: https://arxiv.org/abs/2510.19581
作者: Luca Piano,Peng Huanwen,Radu Ciprian Bilcu
机构: Huawei Technologies Oy (Finland) Company Ltd; Politecnico di Torino
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multi-focus image fusion (MFIF) addresses the depth-of-field (DOF) limitations of optical lenses, where only objects within a specific range appear sharp. Although traditional and deep learning methods have advanced the field, challenges persist, including limited training data, domain gaps from synthetic datasets, and difficulties with regions lacking information. We propose VAEEDOF, a novel MFIF method that uses a distilled variational autoencoder for high-fidelity, efficient image reconstruction. Our fusion module processes up to seven images simultaneously, enabling robust fusion across diverse focus points. To address data scarcity, we introduce MattingMFIF, a new syntetic 4K dataset, simulating realistic DOF effects from real photographs. Our method achieves state-of-the-art results, generating seamless artifact-free fused images and bridging the gap between synthetic and real-world scenarios, offering a significant step forward in addressing complex MFIF challenges. The code, and weights are available here:
zh
[CV-22] Multi-modal Co-learning for Earth Observation: Enhancing single-modality models via modality collaboration
【速读】:该论文旨在解决地球观测(Earth Observation, EO)领域中多模态协同学习(multi-modal co-learning)在实际部署时面临的挑战:即训练阶段可用的多种传感器模态,在推理阶段可能仅能访问其中一种模态,而现有方法通常针对特定任务或特定推理模态进行定制化设计,缺乏通用性。解决方案的关键在于提出一种新颖的多模态协同学习框架,通过结合对比学习(contrastive learning)与模态判别学习(modality discriminative learning),引导单模态模型在其内部特征空间中分离出模态共享信息(modality-shared)和模态特有信息(modality-specific),从而实现无需指定推理模态即可跨任务泛化的能力。该框架在四个涵盖分类与回归任务的EO基准上验证有效,显著优于当前最先进的机器学习、计算机视觉及EO专用方法。
链接: https://arxiv.org/abs/2510.19579
作者: Francisco Mena,Dino Ienco,Cassio F. Dantas,Roberto Interdonato,Andreas Dengel
机构: University of Kaiserslautern-Landau (RPTU); German Research Center for Artificial Intelligence (DFKI); University of Montpellier; INRAE; CIRAD; INRIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the Machine Learning journal, CfP: Discovery Science 2024
点击查看摘要
Abstract:Multi-modal co-learning is emerging as an effective paradigm in machine learning, enabling models to collaboratively learn from different modalities to enhance single-modality predictions. Earth Observation (EO) represents a quintessential domain for multi-modal data analysis, wherein diverse remote sensors collect data to sense our planet. This unprecedented volume of data introduces novel challenges. Specifically, the access to the same sensor modalities at both training and inference stages becomes increasingly complex based on real-world constraints affecting remote sensing platforms. In this context, multi-modal co-learning presents a promising strategy to leverage the vast amount of sensor-derived data available at the training stage to improve single-modality models for inference-time deployment. Most current research efforts focus on designing customized solutions for either particular downstream tasks or specific modalities available at the inference stage. To address this, we propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference. Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information. We evaluate our framework on four EO benchmarks spanning classification and regression tasks across different sensor modalities, where only one of the modalities available during training is accessible at inference time. Our results demonstrate consistent predictive improvements over state-of-the-art approaches from the recent machine learning and computer vision literature, as well as EO-specific methods. The obtained findings validate our framework in the single-modality inference scenarios across a diverse range of EO applications.
zh
[CV-23] VGD: Visual Geometry Gaussian Splatting for Feed-Forward Surround-view Driving Reconstruction
【速读】:该论文旨在解决前馈式环视自动驾驶场景重建中几何一致性与新视角质量难以兼顾的问题(即在最小重叠区域下,现有方法难以保证几何一致性和重建质量)。其解决方案的关键在于显式学习几何信息,并利用这些几何特征引导新视角语义质量的提升:首先设计轻量化的VGGT变体从预训练模型中蒸馏几何先验至几何分支;其次引入Gaussian Head融合多尺度几何token以预测用于新视角渲染的高斯参数;最后通过几何与高斯头分支的多尺度特征联合监督语义细化模型,实现特征一致性的优化,从而显著提升重建质量与泛化能力。
链接: https://arxiv.org/abs/2510.19578
作者: Junhong Lin,Kangli Wang,Shunzhou Wang,Songlin Fan,Ge Li,Wei Gao
机构: Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology (广东省超高清晰沉浸式媒体技术重点实验室); School of Electronic and Computer Engineering (电子与计算机工程学院); Peking University (北京大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures
点击查看摘要
Abstract:Feed-forward surround-view autonomous driving scene reconstruction offers fast, generalizable inference ability, which faces the core challenge of ensuring generalization while elevating novel view quality. Due to the surround-view with minimal overlap regions, existing methods typically fail to ensure geometric consistency and reconstruction quality for novel views. To tackle this tension, we claim that geometric information must be learned explicitly, and the resulting features should be leveraged to guide the elevating of semantic quality in novel views. In this paper, we introduce \textbfVisual Gaussian Driving (VGD), a novel feed-forward end-to-end learning framework designed to address this challenge. To achieve generalizable geometric estimation, we design a lightweight variant of the VGGT architecture to efficiently distill its geometric priors from the pre-trained VGGT to the geometry branch. Furthermore, we design a Gaussian Head that fuses multi-scale geometry tokens to predict Gaussian parameters for novel view rendering, which shares the same patch backbone as the geometry branch. Finally, we integrate multi-scale features from both geometry and Gaussian head branches to jointly supervise a semantic refinement model, optimizing rendering quality through feature-consistent learning. Experiments on nuScenes demonstrate that our approach significantly outperforms state-of-the-art methods in both objective metrics and subjective quality under various settings, which validates VGD’s scalability and high-fidelity surround-view reconstruction.
zh
[CV-24] Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection
【速读】:该论文旨在解决视频域中无模型信息(no-box)环境下对目标检测模型的对抗攻击问题,尤其是针对自动驾驶和监控系统等网络物理系统中对象检测模型的安全性挑战。其解决方案的关键在于提出α-Cloak,一种完全通过RGBA视频的alpha通道实现的对抗攻击方法:利用alpha通道将恶意目标视频与良性视频融合,生成在人类视觉上无感知、但能持续误导对象检测器的融合视频,且无需访问模型架构、参数或输出,同时不引入可察觉的伪影。
链接: https://arxiv.org/abs/2510.19574
作者: Ariana Yi,Ce Zhou,Liyang Xiao,Qiben Yan
机构: Mission San Jose High School (使命山丘高中); Missouri University of Science and Technology (密苏里科技大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:As object detection models are increasingly deployed in cyber-physical systems such as autonomous vehicles (AVs) and surveillance platforms, ensuring their security against adversarial threats is essential. While prior work has explored adversarial attacks in the image domain, those attacks in the video domain remain largely unexamined, especially in the no-box setting. In this paper, we present \alpha-Cloak, the first no-box adversarial attack on object detectors that operates entirely through the alpha channel of RGBA videos. \alpha-Cloak exploits the alpha channel to fuse a malicious target video with a benign video, resulting in a fused video that appears innocuous to human viewers but consistently fools object detectors. Our attack requires no access to model architecture, parameters, or outputs, and introduces no perceptible artifacts. We systematically study the support for alpha channels across common video formats and playback applications, and design a fusion algorithm that ensures visual stealth and compatibility. We evaluate \alpha-Cloak on five state-of-the-art object detectors, a vision-language model, and a multi-modal large language model (Gemini-2.0-Flash), demonstrating a 100% attack success rate across all scenarios. Our findings reveal a previously unexplored vulnerability in video-based perception systems, highlighting the urgent need for defenses that account for the alpha channel in adversarial settings.
zh
[CV-25] HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking
【速读】:该论文旨在解决多模态传感器(RGB相机与事件相机)在目标跟踪任务中因成像机制差异导致的显著时空不对称性问题,这一不对称性阻碍了二者有效融合。解决方案的关键在于提出一种分层非对称蒸馏(Hierarchical Asymmetric Distillation, HAD)框架,通过设计分层对齐策略,在最小化信息损失的同时保持学生网络的计算效率和参数紧凑性,从而实现对两种模态互补优势的有效整合。
链接: https://arxiv.org/abs/2510.19560
作者: Yao Deng,Xian Zhong,Wenxuan Liu,Zhaofei Yu,Jingling Yuan,Tiejun Huang
机构: Wuhan University of Technology (武汉理工大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:RGB cameras excel at capturing rich texture details with high spatial resolution, whereas event cameras offer exceptional temporal resolution and a high dynamic range (HDR). Leveraging their complementary strengths can substantially enhance object tracking under challenging conditions, such as high-speed motion, HDR environments, and dynamic background interference. However, a significant spatio-temporal asymmetry exists between these two modalities due to their fundamentally different imaging mechanisms, hindering effective multi-modal integration. To address this issue, we propose Hierarchical Asymmetric Distillation (HAD), a multi-modal knowledge distillation framework that explicitly models and mitigates spatio-temporal asymmetries. Specifically, HAD proposes a hierarchical alignment strategy that minimizes information loss while maintaining the student network’s computational efficiency and parameter compactness. Extensive experiments demonstrate that HAD consistently outperforms state-of-the-art methods, and comprehensive ablation studies further validate the effectiveness and necessity of each designed component. The code will be released soon.
zh
[CV-26] A Matter of Time: Revealing the Structure of Time in Vision-Language Models
【速读】:该论文旨在解决大规模视觉语言模型(Vision-Language Models, VLMs)在时间感知能力方面的不足,即评估其对视觉内容时序位置的识别与建模能力。针对这一问题,作者提出了一种新颖的方法论,并构建了TIME10k基准数据集(包含超过10,000张带时间标签的真实图像),系统性地评估了37个VLMs的时间感知性能。研究发现,时间信息在VLM嵌入空间中呈现出低维、非线性的流形结构,基于此关键洞察,论文提出了从嵌入空间显式提取“时间线”(timeline)表示的方法,该表示能够建模时间及其顺序关系,从而有效支持时间推理任务。相较基于提示(prompt-based)的基线方法,该方案在准确率上达到竞争性甚至更优表现,且计算效率更高。
链接: https://arxiv.org/abs/2510.19559
作者: Nidham Tekaya,Manuela Waldner,Matthias Zeppelzauer
机构: St. Pölten University of Applied Sciences (圣珀尔滕应用科学大学); TU Wien (维也纳工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline’’ representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at this https URL.
zh
[CV-27] he Intricate Dance of Prompt Complexity Quality Diversity and Consistency in T2I Models
【速读】:该论文旨在解决生成式 AI(Generative AI)中文本到图像(Text-to-Image, T2I)模型所生成合成数据在质量、多样性与一致性等关键维度上的 utility 如何受提示复杂度(prompt complexity)影响的问题。现有研究多聚焦于单一维度评估,缺乏对提示复杂性系统性影响的深入理解,尤其在真实数据分布偏移与生成多样性权衡方面存在空白。解决方案的关键在于提出一个全新的评估框架,能够量化比较真实数据与合成数据的性能差异,并通过大规模实证实验揭示:随着提示复杂度提升,合成数据的条件多样性与提示一致性下降,但合成数据与真实数据之间的分布差距缩小;同时,当前推理时干预方法虽可增强多样性,却可能使生成结果超出真实数据支持范围,而基于预训练语言模型进行提示扩展(prompt expansion)的方法能有效提升图像多样性与美学质量,甚至优于真实数据表现。
链接: https://arxiv.org/abs/2510.19557
作者: Xiaofeng Zhang,Aaron Courville,Michal Drozdzal,Adriana Romero-Soriano
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Text-to-image (T2I) models offer great potential for creating virtually limitless synthetic data, a valuable resource compared to fixed and finite real datasets. Previous works evaluate the utility of synthetic data from T2I models on three key desiderata: quality, diversity, and consistency. While prompt engineering is the primary means of interacting with T2I models, the systematic impact of prompt complexity on these critical utility axes remains underexplored. In this paper, we first conduct synthetic experiments to motivate the difficulty of generalization w.r.t. prompt complexity and explain the observed difficulty with theoretical derivations. Then, we introduce a new evaluation framework that can compare the utility of real data and synthetic data, and present a comprehensive analysis of how prompt complexity influences the utility of synthetic data generated by commonly used T2I models. We conduct our study across diverse datasets, including CC12M, ImageNet-1k, and DCI, and evaluate different inference-time intervention methods. Our synthetic experiments show that generalizing to more general conditions is harder than the other way round, since the former needs an estimated likelihood that is not learned by diffusion models. Our large-scale empirical experiments reveal that increasing prompt complexity results in lower conditional diversity and prompt consistency, while reducing the synthetic-to-real distribution shift, which aligns with the synthetic experiments. Moreover, current inference-time interventions can augment the diversity of the generations at the expense of moving outside the support of real data. Among those interventions, prompt expansion, by deliberately using a pre-trained language model as a likelihood estimator, consistently achieves the highest performance in both image diversity and aesthetics, even higher than that of real data.
zh
[CV-28] PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis NEURIPS2025
【速读】:该论文旨在解决从稀疏重叠图像对中进行配对相机位姿估计(pairwise camera pose estimation)这一在三维视觉领域长期存在的挑战,尤其针对小重叠或无重叠图像对时现有方法性能显著下降的问题。其核心解决方案是提出一种混合视频生成(Hybrid Video Generation, HVG)框架,通过耦合视频插值模型与位姿条件的新型视图合成模型,生成更清晰的中间帧;同时设计基于特征匹配的帧选择器(Feature Matching Selector, FMS),利用特征对应关系从合成结果中筛选适合位姿估计的中间帧,从而提升在低重叠场景下的位姿估计精度。
链接: https://arxiv.org/abs/2510.19527
作者: Qing Mao,Tianxin Huang,Yu Zhu,Jinqiu Sun,Yanning Zhang,Gim Hee Lee
机构: Northwestern Polytechnical University (西北工业大学); National University of Singapore (新加坡国立大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
点击查看摘要
Abstract:Pairwise camera pose estimation from sparsely overlapping image pairs remains a critical and unsolved challenge in 3D vision. Most existing methods struggle with image pairs that have small or no overlap. Recent approaches attempt to address this by synthesizing intermediate frames using video interpolation and selecting key frames via a self-consistency score. However, the generated frames are often blurry due to small overlap inputs, and the selection strategies are slow and not explicitly aligned with pose estimation. To solve these cases, we propose Hybrid Video Generation (HVG) to synthesize clearer intermediate frames by coupling a video interpolation model with a pose-conditioned novel view synthesis model, where we also propose a Feature Matching Selector (FMS) based on feature correspondence to select intermediate frames appropriate for pose estimation from the synthesized results. Extensive experiments on Cambridge Landmarks, ScanNet, DL3DV-10K, and NAVI demonstrate that, compared to existing SOTA methods, PoseCrafter can obviously enhance the pose estimation performances, especially on examples with small or no overlap.
zh
[CV-29] CARES: Context-Aware Resolution Selector for VLMs
【速读】:该论文旨在解决大规模视觉语言模型(Large Vision-Language Models, VLMs)在处理图像时普遍采用高分辨率输入导致计算资源浪费和延迟增加的问题,尤其是在低分辨率图像已足够完成任务的情况下。解决方案的关键在于提出一个轻量级预处理模块CARES(Context-Aware Resolution Selector),其通过一个小型VLM(350M参数)提取特征并预测目标预训练VLM达到最优性能所需的最小输入分辨率;CARES以离散分类器形式训练,但在推理阶段可插值连续分辨率,实现细粒度的分辨率控制,在多个多模态基准测试中保持任务性能的同时,最高可降低80%的计算开销。
链接: https://arxiv.org/abs/2510.19496
作者: Moshe Kimhi,Nimrod Shabtay,Raja Giryes,Chaim Baskin,Eli Schwartz
机构: Technion(以色列理工学院); IBM Research(IBM研究院); Tel-Aviv University(特拉维夫大学); Ben-Gurion University of the Negev(内盖夫本-古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emphCARES-a \textbfContext-\textbfAware \textbfResolution \textbfSelector, a lightweight preprocessing module that, given an image-query pair, predicts the \emphminimal sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM’s response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.
zh
[CV-30] owards Single-Source Domain Generalized Object Detection via Causal Visual Prompts
【速读】:该论文旨在解决单源域泛化目标检测(Single-source Domain Generalized Object Detection, SDGOD)中因域偏移和有限域特定知识导致模型陷入虚假相关性(spurious correlations)的问题,即模型过度依赖颜色等简化分类特征而非物体轮廓等域不变表示。解决方案的关键在于提出Cauvis方法:首先引入交叉注意力提示模块(Cross-Attention Prompts),通过视觉提示与交叉注意力机制缓解虚假特征偏差;其次设计双分支适配器(dual-branch adapter),在高频率特征提取基础上解耦因果-虚假特征,实现域适应与域泛化能力的协同提升。该方法在SDGOD数据集上取得显著性能提升(相比现有方法提升15.9–31.4%),并在复杂干扰环境下展现出更强鲁棒性。
链接: https://arxiv.org/abs/2510.19487
作者: Chen Li,Huiying Xu,Changxin Gao,Zeyu Wang,Yun Liu,Xinzhong Zhu
机构: Huazhong University of Science and Technology (华中科技大学); Zhejiang Normal University (浙江师范大学); Northwest Polytechnical University (西北工业大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
点击查看摘要
Abstract:Single-source Domain Generalized Object Detection (SDGOD), as a cutting-edge research topic in computer vision, aims to enhance model generalization capability in unseen target domains through single-source domain training. Current mainstream approaches attempt to mitigate domain discrepancies via data augmentation techniques. However, due to domain shift and limited domain-specific knowledge, models tend to fall into the pitfall of spurious correlations. This manifests as the model’s over-reliance on simplistic classification features (e.g., color) rather than essential domain-invariant representations like object contours. To address this critical challenge, we propose the Cauvis (Causal Visual Prompts) method. First, we introduce a Cross-Attention Prompts module that mitigates bias from spurious features by integrating visual prompts with cross-attention. To address the inadequate domain knowledge coverage and spurious feature entanglement in visual prompts for single-domain generalization, we propose a dual-branch adapter that disentangles causal-spurious features while achieving domain adaptation via high-frequency feature extraction. Cauvis achieves state-of-the-art performance with 15.9-31.4% gains over existing domain generalization methods on SDGOD datasets, while exhibiting significant robustness advantages in complex interference environments.
zh
[CV-31] Mitigating representation bias caused by missing pixels in methane plume detection ACL ECML-PKDD2025
【速读】:该论文旨在解决卫星图像中因云层等因素导致的系统性缺失像素(missing data not at random, MNAR)所引发的表示偏差(representation bias)问题,特别是在甲烷泄漏 plume 检测任务中,模型可能错误地将图像覆盖率(coverage,即有效像素比例)与标签相关联,从而在低覆盖率图像中显著低估 plume 的检测概率。解决方案的关键在于:一是采用多重插补(multiple imputation)方法消除覆盖率与标签之间的依赖关系;二是提出一种加权重采样策略,在训练过程中通过在每个覆盖率区间内强制类别平衡,从而打破标签与覆盖率之间的虚假关联。实验证明,这两种方法均能有效降低表示偏差,且不损害模型的整体准确率、精确率或召回率,同时提升模型在低覆盖率场景下的实际检测能力。
链接: https://arxiv.org/abs/2510.19478
作者: Julia Wąsala,Joannes D. Maasakkers,Ilse Aben,Rochelle Schneider,Holger Hoos,Mitra Baratchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the MACLEAN workshop at ECML-PKDD 2025
点击查看摘要
Abstract:Most satellite images have systematically missing pixels (i.e., missing data not at random (MNAR)) due to factors such as clouds. If not addressed, these missing pixels can lead to representation bias in automated feature extraction models. In this work, we show that spurious association between the label and the number of missing values in methane plume detection can cause the model to associate the coverage (i.e., the percentage of valid pixels in an image) with the label, subsequently under-detecting plumes in low-coverage images. We evaluate multiple imputation approaches to remove the dependence between the coverage and a label. Additionally, we propose a weighted resampling scheme during training that removes the association between the label and the coverage by enforcing class balance in each coverage bin. Our results show that both resampling and imputation can significantly reduce the representation bias without hurting balanced accuracy, precision, or recall. Finally, we evaluate the capability of the debiased models using these techniques in an operational scenario and demonstrate that the debiased models have a higher chance of detecting plumes in low-coverage images.
zh
[CV-32] PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation
【速读】:该论文旨在解决单目3D人体姿态估计中因2D到3D映射的固有深度模糊性而导致的病态逆问题,尤其是现有视频方法仅在序列内进行处理、未能利用跨序列的人体运动结构规律和重复模式的问题。其解决方案的关键在于提出Pattern Reuse Graph Convolutional Network (PRGCN),该框架将姿态估计建模为模式检索与适配问题:通过一个图记忆库(graph memory bank)学习并存储以关系图形式编码的紧凑姿态原型,并借助注意力机制动态检索这些结构化先验;随后,这些先验通过记忆驱动的图卷积与硬编码解剖约束自适应融合,确保几何合理性。此外,为支撑鲁棒的时空特征提取,设计了双流混合架构,结合基于Mamba的状态空间模型的线性复杂度局部时序建模能力与自注意力机制的全局关系建模能力,从而实现跨序列模式复用与累积知识学习的新范式。
链接: https://arxiv.org/abs/2510.19475
作者: Zhuoyang Xie,Yibo Zhao,Hui Huang,Riwei Wang,Zan Gao
机构: Wenzhou University (温州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 6 figures, 6 tables
点击查看摘要
Abstract:Monocular 3D human pose estimation remains a fundamentally ill-posed inverse problem due to the inherent depth ambiguity in 2D-to-3D lifting. While contemporary video-based methods leverage temporal context to enhance spatial reasoning, they operate under a critical paradigm limitation: processing each sequence in isolation, thereby failing to exploit the strong structural regularities and repetitive motion patterns that pervade human movement across sequences. This work introduces the Pattern Reuse Graph Convolutional Network (PRGCN), a novel framework that formalizes pose estimation as a problem of pattern retrieval and adaptation. At its core, PRGCN features a graph memory bank that learns and stores a compact set of pose prototypes, encoded as relational graphs, which are dynamically retrieved via an attention mechanism to provide structured priors. These priors are adaptively fused with hard-coded anatomical constraints through a memory-driven graph convolution, ensuring geometrical plausibility. To underpin this retrieval process with robust spatiotemporal features, we design a dual-stream hybrid architecture that synergistically combines the linear-complexity, local temporal modeling of Mamba-based state-space models with the global relational capacity of self-attention. Extensive evaluations on Human3.6M and MPI-INF-3DHP benchmarks demonstrate that PRGCN establishes a new state-of-the-art, achieving an MPJPE of 37.1mm and 13.4mm, respectively, while exhibiting enhanced cross-domain generalization capability. Our work posits that the long-overlooked mechanism of cross-sequence pattern reuse is pivotal to advancing the field, shifting the paradigm from per-sequence optimization towards cumulative knowledge learning.
zh
[CV-33] Predicting before Reconstruction: A generative prior framework for MRI acceleration
【速读】:该论文旨在解决磁共振成像(MRI)采集时间过长限制临床效率的问题。其解决方案的关键在于提出一种从图像重建向预测成像转变的新范式:首先利用生成式模型(Generative Model)基于多种数据源(如其他对比度图像、既往扫描数据、采集参数和患者信息)预测目标对比度图像,将其作为数据驱动的先验信息用于高欠采样数据的重建。这一预测先验显著提升了重建质量与效率,在多个公开及内部数据集上验证了其在高加速因子(×4、×8 和 ×12)下的优越性能。
链接: https://arxiv.org/abs/2510.19472
作者: Juhyung Park,Rokgi Hong,Roh-Eul Yoo,Jaehyeon Koo,Se Young Chun,Seung Hong Choi,Jongho Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 8figures
点击查看摘要
Abstract:Recent advancements in artificial intelligence have created transformative capabilities in image synthesis and generation, enabling diverse research fields to innovate at revolutionary speed and spectrum. In this study, we leverage this generative power to introduce a new paradigm for accelerating Magnetic Resonance Imaging (MRI), introducing a shift from image reconstruction to proactive predictive imaging. Despite being a cornerstone of modern patient care, MRI’s lengthy acquisition times limit clinical throughput. Our novel framework addresses this challenge by first predicting a target contrast image, which then serves as a data-driven prior for reconstructing highly under-sampled data. This informative prior is predicted by a generative model conditioned on diverse data sources, such as other contrast images, previously scanned images, acquisition parameters, patient information. We demonstrate this approach with two key applications: (1) reconstructing FLAIR images using predictions from T1w and/or T2w scans, and (2) reconstructing T1w images using predictions from previously acquired T1w scans. The framework was evaluated on internal and multiple public datasets (total 14,921 scans; 1,051,904 slices), including multi-channel k-space data, for a range of high acceleration factors (x4, x8 and x12). The results demonstrate that our prediction-prior reconstruction method significantly outperforms other approaches, including those with alternative or no prior information. Through this framework we introduce a fundamental shift from image reconstruction towards a new paradigm of predictive imaging.
zh
[CV-34] PCP-GAN: Property-Constrained Pore-scale image reconstruction via conditional Generative Adversarial Networks
【速读】:该论文旨在解决地下储层表征中孔隙尺度图像代表性不足与数据稀缺的双重挑战,即从天然岩心样本中提取的子图像往往因空间异质性而偏离宏观物性测量值,且实际物理样品仅在稀疏井位获取。其解决方案的关键在于提出一种多条件生成对抗网络(multi-conditional Generative Adversarial Network, cGAN)框架,通过在单一统一模型中同时对孔隙度(porosity)和深度参数进行条件约束,实现对孔隙结构形态的精准控制与生成。该方法不仅保留了普遍适用的孔隙网络特征(如平均孔径、比表面积和迂曲度),还融合了不同深度下的地质特性(如颗粒灰岩与含无水石膏包裹体的结晶结构),最终实现了高精度孔隙度预测(R²=0.95)及显著提升的代表性表现(双约束误差仅为1.9–11.3%,远优于随机提取的真实子图像误差36.4–578%)。
链接: https://arxiv.org/abs/2510.19465
作者: Ali Sadeghkhani,Brandon Bennett,Masoud Babaei,Arash Rabbani
机构: University of Leeds (利兹大学); The University of Manchester (曼彻斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
备注:
点击查看摘要
Abstract:Obtaining truly representative pore-scale images that match bulk formation properties remains a fundamental challenge in subsurface characterization, as natural spatial heterogeneity causes extracted sub-images to deviate significantly from core-measured values. This challenge is compounded by data scarcity, where physical samples are only available at sparse well locations. This study presents a multi-conditional Generative Adversarial Network (cGAN) framework that generates representative pore-scale images with precisely controlled properties, addressing both the representativeness challenge and data availability constraints. The framework was trained on thin section samples from four depths (1879.50-1943.50 m) of a carbonate formation, simultaneously conditioning on porosity values and depth parameters within a single unified model. This approach captures both universal pore network principles and depth-specific geological characteristics, from grainstone fabrics with interparticle-intercrystalline porosity to crystalline textures with anhydrite inclusions. The model achieved exceptional porosity control (R^2=0.95) across all formations with mean absolute errors of 0.0099-0.0197. Morphological validation confirmed preservation of critical pore network characteristics including average pore radius, specific surface area, and tortuosity, with statistical differences remaining within acceptable geological tolerances. Most significantly, generated images demonstrated superior representativeness with dual-constraint errors of 1.9-11.3% compared to 36.4-578% for randomly extracted real sub-images. This capability provides transformative tools for subsurface characterization, particularly valuable for carbon storage, geothermal energy, and groundwater management applications where knowing the representative morphology of the pore space is critical for implementing digital rock physics.
zh
[CV-35] Exploring “Many in Few” and “Few in Many” Properties in Long-Tailed Highly-Imbalanced IC Defect Classification
【速读】:该论文旨在解决真实世界集成电路(IC)缺陷分类任务中因数据极度不均衡和样本特征复杂性导致的性能瓶颈问题。具体而言,工业级AOI(自动光学检测)系统产生的数据分布远比公开不平衡数据集更为极端,且样本同时包含类别特异性特征与类无关的域相关特征,造成类内多样性大、类间相似度高,严重削弱现有先进分类模型的泛化能力。为应对这一挑战,作者提出了ReCAME-Net框架,其核心创新在于融合多专家分类器架构、区域通道注意力模块、度量学习损失函数、困难类别挖掘策略以及知识蒸馏机制,从而在保持对通用数据集良好兼容性的同时,显著提升对高度不均衡IC缺陷图像的分类精度。
链接: https://arxiv.org/abs/2510.19463
作者: Hao-Chiang Shao,Chun-Hao Chang,Yu-Hsien Lin,Chia-Wen Lin,Shao-Yun Fang,Yan-Hsiu Liu
机构: National Chung Hsing University (国立中兴大学); National Tsing Hua University (国立清华大学); National Taiwan University of Science and Technology (国立台湾科技大学); United Microelectronics Corporation (联华电子)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Despite significant advancements in deep classification techniques and in-lab automatic optical inspection models for long-tailed or highly imbalanced data, applying these approaches to real-world IC defect classification tasks remains challenging. This difficulty stems from two primary factors. First, real-world conditions, such as the high yield-rate requirements in the IC industry, result in data distributions that are far more skewed than those found in general public imbalanced datasets. Consequently, classifiers designed for open imbalanced datasets often fail to perform effectively in real-world scenarios. Second, real-world samples exhibit a mix of class-specific attributes and class-agnostic, domain-related features. This complexity adds significant difficulty to the classification process, particularly for highly imbalanced datasets. To address these challenges, this paper introduces the IC-Defect-14 dataset, a large, highly imbalanced IC defect image dataset sourced from AOI systems deployed in real-world IC production lines. This dataset is characterized by its unique “intra-class clusters” property, which presents two major challenges: large intra-class diversity and high inter-class similarity. These characteristics, rarely found simultaneously in existing public datasets, significantly degrade the performance of current state-of-the-art classifiers for highly imbalanced data. To tackle this challenge, we propose ReCAME-Net, which follows a multi-expert classifier framework and integrates a regional channel attention module, metric learning losses, a hard category mining strategy, and a knowledge distillation procedure. Extensive experimental evaluations demonstrate that ReCAME-Net outperforms previous state-of-the-art models on the IC-Defect-14 dataset while maintaining comparable performance and competitiveness on general public datasets.
zh
[CV-36] Reasoning Like Experts: Leverag ing Multimodal Large Language Models for Drawing-based Psychoanalysis
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在主观、情感丰富领域(如心理分析)中应用不足的问题,尤其聚焦于临床心理学中广泛应用的房屋-树-人测验(House-Tree-Person, HTP)图像的自动化心理解读。其解决方案的关键在于提出一个分步式框架PICK,通过三层层次化表示(单对象级、多对象级和整体级)对HTP绘画进行语义分解与结构建模,并结合知识注入机制与强化学习驱动的特征提取模块,生成融合风格特征与动态对象特性的心理画像;最终整合多层级信息以实现符合专家推理水平的心理评估,从而显著提升MLLMs在心理分析任务中的能力与可解释性。
链接: https://arxiv.org/abs/2510.19451
作者: Xueqi Ma,Yanbei Jiang,Sarah Erfani,James Bailey,Weifeng Liu,Krista A. Ehinger,Jey Han Lau
机构: The University of Melbourne (墨尔本大学); China University of Petroleum (East China) (中国石油大学(华东))
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by ACM Multimedia 2025
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.
zh
[CV-37] Multi-Camera Worker Tracking in Logistics Warehouse Considering Wide-Angle Distortion
【速读】:该论文旨在解决物流仓库中工人位置跟踪精度不足的问题,以支持数字孪生(Digital Twin)技术在仓储管理中的应用。其关键解决方案在于:利用19个安装在天花板上的广角摄像头进行多视角监控,并通过基于脚部位置的对齐方法校正广角相机边缘严重的垂直方向图像畸变,从而实现跨摄像头的高精度位置映射与融合,最终使跟踪准确率提升超过20%。
链接: https://arxiv.org/abs/2510.19432
作者: Yuki Mori,Kazuma Kano,Yusuke Asai,Shin Katayama,Kenta Urano,Takuro Yonezawa,Nobuo Kawaguchi
机构: Nagoya University (名古屋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:With the spread of e-commerce, the logistics market is growing around the world. Therefore, improving the efficiency of warehouse operations is essential. To achieve this, various approaches have been explored, and among them, the use of digital twins is gaining attention. To make this approach possible, it is necessary to accurately collect the positions of workers in a warehouse and reflect them in a virtual space. However, a single camera has limitations in its field of view, therefore sensing with multiple cameras is necessary. In this study, we explored a method to track workers using 19 wide-angle cameras installed on the ceiling, looking down at the floor of the logistics warehouse. To understand the relationship between the camera coordinates and the actual positions in the warehouse, we performed alignment based on the floor surface. However, due to the characteristics of wide-angle cameras, significant distortion occurs at the edges of the image, particularly in the vertical direction. To address this, the detected worker positions from each camera were aligned based on foot positions, reducing the effects of image distortion, and enabling accurate position alignment across cameras. As a result, we confirmed an improvement of over 20% in tracking accuracy. Furthermore, we compared multiple methods for utilizing appearance features and validated the effectiveness of the proposed approach.
zh
[CV-38] GigaBrain-0: A World Model-Powered Vision-Language-Action Model
【速读】:该论文旨在解决当前通用机器人视觉-语言-动作(Vision-Language-Action, VLA)模型训练中对大规模真实世界机器人数据的高度依赖问题,该问题导致数据收集成本高、效率低,严重限制了VLA系统的可扩展性和泛化能力。解决方案的关键在于引入GigaBrain-0——一个基于世界模型生成数据(如视频生成、真实世界到真实世界的迁移、人类动作迁移、视角迁移及仿真到现实的迁移数据)的新型VLA基础模型。通过利用世界模型大规模生成多样化数据,GigaBrain-0显著减少了对真实机器人数据的依赖,并结合RGBD输入建模与具身链式思维(embodied Chain-of-Thought, CoT)监督机制,提升了策略鲁棒性,使模型能够在任务执行过程中推理空间几何、物体状态和长程依赖关系,从而在灵巧操作、长周期任务和移动操作等真实场景中实现显著性能提升。
链接: https://arxiv.org/abs/2510.19430
作者: GigaBrain Team:Angen Ye,Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Haoyun Li,Jie Li,Jiagang Zhu,Lv Feng,Peng Li,Qiuping Deng,Runqi Ouyang,Wenkang Qin,Xinze Chen,Xiaofeng Wang,Yang Wang,Yifan Li,Yilong Li,Yiran Ding,Yuan Xu,Yun Ye,Yukun Zhou,Zhehao Dong,Zhenan Wang,Zhichao Liu,Zheng Zhu
机构: GigaAI
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
点击查看摘要
Abstract:Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.
zh
[CV-39] From See to Shield: ML-Assisted Fine-Grained Access Control for Visual Data
【速读】:该论文旨在解决大规模数据存储环境中敏感信息识别与保护的难题,尤其是在多用户角色和权限差异下的可信数据共享问题。其解决方案的关键在于提出了一种基于策略驱动访问控制的系统架构,通过四个核心模块实现敏感区域的自动化检测、后纠正、密钥管理和细粒度访问控制;其中,敏感区域采用对称加密与属性基加密(Attribute-Based Encryption, ABE)相结合的混合加密方案,在保障效率的同时实现策略驱动的访问控制,同时支持高效的密钥分发与隔离存储,从而在保持系统可扩展性的同时提升安全性与精度。
链接: https://arxiv.org/abs/2510.19418
作者: Mete Harun Akcay,Buse Gul Atli,Siddharth Prakash Rao,Alexandros Bakas
机构: Nokia Bell Labs(诺基亚贝尔实验室); Abo Akademi University(奥博阿卡德米大学); Linköping University(林雪平大学); Sweden(瑞典)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, 6 tables. In submission
点击查看摘要
Abstract:As the volume of stored data continues to grow, identifying and protecting sensitive information within large repositories becomes increasingly challenging, especially when shared with multiple users with different roles and permissions. This work presents a system architecture for trusted data sharing with policy-driven access control, enabling selective protection of sensitive regions while maintaining scalability. The proposed architecture integrates four core modules that combine automated detection of sensitive regions, post-correction, key management, and access control. Sensitive regions are secured using a hybrid scheme that employs symmetric encryption for efficiency and Attribute-Based Encryption for policy enforcement. The system supports efficient key distribution and isolates key storage to strengthen overall security. To demonstrate its applicability, we evaluate the system on visual datasets, where Privacy-Sensitive Objects in images are automatically detected, reassessed, and selectively encrypted prior to sharing in a data repository. Experimental results show that our system provides effective PSO detection, increases macro-averaged F1 score (5%) and mean Average Precision (10%), and maintains an average policy-enforced decryption time of less than 1 second per image. These results demonstrate the effectiveness, efficiency and scalability of our proposed solution for fine-grained access control.
zh
[CV-40] Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes MICRO
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多视角(multi-view)环境下进行机器人操作任务时,其空间推理能力评估不足的问题。现有大多数VLM评估仍局限于单视角场景,而实际机器人平台越来越多采用多摄像头配置以克服遮挡和深度模糊问题,因此亟需系统性地评估VLMs是否能有效利用多视角信息完成复杂空间推理与执行任务。解决方案的关键在于提出并构建MV-RoboBench——一个专门用于评测VLMs在机器人操作中多视角空间推理能力的基准数据集,包含1.7k条人工标注的问答项,涵盖空间理解与机器人执行两大类共八个子任务,并提供标准化的评估协议。该基准揭示了当前先进模型在多视角场景下仍显著落后于人类表现,且发现空间智能与机器人任务执行之间存在正相关关系,同时指出通用单视角空间理解能力无法可靠转化为机器人任务的成功表现。
链接: https://arxiv.org/abs/2510.19400
作者: Zhiyuan Feng,Zhaolu Kang,Qijie Wang,Zhiying Du,Jiongrui Yan,Shubin Shi,Chengbo Yuan,Huizhi Liang,Yu Deng,Qixiu Li,Rushuai Yang,Arctanx An,Leqi Zheng,Weijie Wang,Shawn Chen,Sicheng Xu,Yaobo Liang,Jiaolong Yang,Baining Guo
机构: Tsinghua University (清华大学); Peking University (北京大学); Fudan University (复旦大学); Microsoft Research Asia (微软亚洲研究院); Hong Kong University of Science and Technology (香港科技大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project and benchmark are publicly available at this https URL
点击查看摘要
Abstract:Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.
zh
[CV-41] AegisRF: Adversarial Perturbations Guided with Sensitivity for Protecting Intellectual Property of Neural Radiance Fields BMVC2025
【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRFs)在未经授权使用场景下的知识产权(IP)保护问题。现有方法因直接对三维几何结构施加对抗扰动易导致场景结构变形、渲染质量显著下降,因而多回避几何扰动或仅限于显式空间(如网格)进行限制。其解决方案的关键在于提出AegisRF框架,该框架包含两个核心组件:一是扰动场(Perturbation Field),将对抗扰动注入NeRF预渲染输出(颜色与体密度)以误导下游目标模型;二是敏感性场(Sensitivity Field),用于学习空间变化的几何扰动对渲染质量的影响程度,并自适应约束扰动强度,在保障高视觉保真度的同时有效干扰未经授权的应用。
链接: https://arxiv.org/abs/2510.19371
作者: Woo Jae Kim,Kyu Beom Han,Yoonki Cho,Youngju Na,Junsik Jung,Sooel Son,Sung-eui Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2025
点击查看摘要
Abstract:As Neural Radiance Fields (NeRFs) have emerged as a powerful tool for 3D scene representation and novel view synthesis, protecting their intellectual property (IP) from unauthorized use is becoming increasingly crucial. In this work, we aim to protect the IP of NeRFs by injecting adversarial perturbations that disrupt their unauthorized applications. However, perturbing the 3D geometry of NeRFs can easily deform the underlying scene structure and thus substantially degrade the rendering quality, which has led existing attempts to avoid geometric perturbations or restrict them to explicit spaces like meshes. To overcome this limitation, we introduce a learnable sensitivity to quantify the spatially varying impact of geometric perturbations on rendering quality. Building upon this, we propose AegisRF, a novel framework that consists of a Perturbation Field, which injects adversarial perturbations into the pre-rendering outputs (color and volume density) of NeRF models to fool an unauthorized downstream target model, and a Sensitivity Field, which learns the sensitivity to adaptively constrain geometric perturbations, preserving rendering quality while disrupting unauthorized use. Our experimental evaluations demonstrate the generalized applicability of AegisRF across diverse downstream tasks and modalities, including multi-view image classification and voxel-based 3D localization, while maintaining high visual fidelity. Codes are available at this https URL.
zh
[CV-42] DARE: A Deformable Adaptive Regularization Estimator for Learning-Based Medical Image Registration
【速读】:该论文旨在解决深度学习驱动的可变形医学图像配准方法中因忽略正则化作用而导致的配准结果缺乏鲁棒性和解剖学合理性的问题。解决方案的关键在于提出DARE(Deformable Adaptive Regularization Estimator)框架,其核心创新是基于形变场梯度模长动态调整弹性正则化强度,并融合应变能与剪切能项以实现稳定性与灵活性的自适应平衡;同时引入折叠预防机制,通过惩罚负雅可比行列式区域来抑制非物理形变,从而提升配准精度与解剖学合理性。
链接: https://arxiv.org/abs/2510.19353
作者: Ahsan Raza Siyal,Markus Haltmeier,Ruth Steiger,Malik Galijasevic,Elke Ruth Gizewski,Astrid Ellen Grams
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
点击查看摘要
Abstract:Deformable medical image registration is a fundamental task in medical image analysis. While deep learning-based methods have demonstrated superior accuracy and computational efficiency compared to traditional techniques, they often overlook the critical role of regularization in ensuring robustness and anatomical plausibility. We propose DARE (Deformable Adaptive Regularization Estimator), a novel registration framework that dynamically adjusts elastic regularization based on the gradient norm of the deformation field. Our approach integrates strain and shear energy terms, which are adaptively modulated to balance stability and flexibility. To ensure physically realistic transformations, DARE includes a folding-prevention mechanism that penalizes regions with negative deformation Jacobian. This strategy mitigates non-physical artifacts such as folding, avoids over-smoothing, and improves both registration accuracy and anatomical plausibility
zh
[CV-43] Learning To Defer To A Population With Limited Demonstrations
【速读】:该论文旨在解决学习延迟(Learning to Defer, L2D)系统在实际部署中因数据稀缺而导致的性能瓶颈问题。其核心解决方案是提出一种上下文感知的半监督框架,利用元学习(meta-learning)从少量示范样本中生成专家特定的嵌入表示(expert-specific embeddings),并通过双重机制实现高效训练与实时适应:首先,这些嵌入用于生成大量伪标签以训练模型;其次,在测试阶段可实现对新专家的即时适配。实验表明,基于合成标签训练的模型能迅速逼近“理想”性能水平,显著提升了L2D系统的数据效率和可扩展性。
链接: https://arxiv.org/abs/2510.19351
作者: Nilesh Ramgolam,Gustavo Carneiro,Hsiang-Ting(Tim)Chen
机构: Australian Institute for Machine Learning (AIML)(澳大利亚机器学习研究所); Centre for Vision, Speech and Signal Processing (视觉、语音和信号处理中心)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE DICTA 2025 (poster). 7 pages, 2 figures
点击查看摘要
Abstract:This paper addresses the critical data scarcity that hinders the practical deployment of learning to defer (L2D) systems to the population. We introduce a context-aware, semi-supervised framework that uses meta-learning to generate expert-specific embeddings from only a few demonstrations. We demonstrate the efficacy of a dual-purpose mechanism, where these embeddings are used first to generate a large corpus of pseudo-labels for training, and subsequently to enable on-the-fly adaptation to new experts at test-time. The experiment results on three different datasets confirm that a model trained on these synthetic labels rapidly approaches oracle-level performance, validating the data efficiency of our approach. By resolving a key training bottleneck, this work makes adaptive L2D systems more practical and scalable, paving the way for human-AI collaboration in real-world environments. To facilitate reproducibility and address implementation details not covered in the main text, we provide our source code and training configurations at this https URL.
zh
[CV-44] DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLM s for Mobile Phone Agents
链接: https://arxiv.org/abs/2510.19336
作者: Kai Shi,Jun Yang,Ni Yang,Binqiang Pan,Qingsong Xie,Chao Zhang,Zhenyu Yang,Tianhuang Su,Haonan Lu
机构: OPPO AI Center (OPPO人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-45] A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP
【速读】:该论文旨在解决开放词汇图像分割与目标识别(Open-Vocabulary Segmentation and Recognition, OVSR)问题,即在不依赖标注数据的情况下实现对未见类别的图像区域进行准确分割和识别。其核心挑战在于如何在无监督条件下获得语义合理的图像分割,并结合视觉-语言模型实现跨模态的开放集识别。解决方案的关键在于提出一个两阶段训练-free框架:第一阶段利用EfficientNetB0提取像素级特征并采用奇异值分解(Singular Value Decomposition, SVD)降维,结合层次聚类自适应确定分割区域数量;第二阶段通过CLIP的视觉Transformer编码器将分割区域映射为图像嵌入,同时使用文本编码器生成类别提示嵌入,二者经SVD投影至共享潜在空间后,通过计算相似度完成分类决策,从而实现无需微调即可支持开放词汇场景的高效识别。
链接: https://arxiv.org/abs/2510.19333
作者: Ying Dai,Wei Yu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP’s text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.
zh
[CV-46] BrainMCLIP: Brain Image Decoding with Multi-Layer feature Fusion of CLIP
【速读】:该论文旨在解决当前基于fMRI图像解码方法中因依赖参数密集的变分自编码器(VAE)管道而忽视CLIP模型中间层丰富物体信息的问题,同时违背了大脑视觉系统功能层次结构的特性。其解决方案的关键在于提出BrainMCLIP框架,该框架采用一种参数高效、多层融合策略,依据人类视觉系统的功能层次结构,将不同功能层级的fMRI信号(低级/高级视觉区域)分别对齐至CLIP对应中间层与最终语义层,从而在不引入额外VAE路径的前提下,有效捕捉视觉细节并提升高阶语义准确性。通过交叉重建策略和新型多粒度损失函数的设计,BrainMCLIP在保持高性能的同时,参数量较最优VAE方法减少71.7%,实现了语义精度与细节保真度之间的良好平衡。
链接: https://arxiv.org/abs/2510.19332
作者: Tian Xia,Zihan Ma,Xinlong Wang,Qing Liu,Xiaowei He,Tianming Liu,Yudan Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Decoding images from fMRI often involves mapping brain activity to CLIP’s final semantic layer. To capture finer visual details, many approaches add a parameter-intensive VAE-based pipeline. However, these approaches overlook rich object information within CLIP’s intermediate layers and contradicts the brain’s functionally hierarchical. We introduce BrainMCLIP, which pioneers a parameter-efficient, multi-layer fusion approach guided by human visual system’s functional hierarchy, eliminating the need for such a separate VAE pathway. BrainMCLIP aligns fMRI signals from functionally distinct visual areas (low-/high-level) to corresponding intermediate and final CLIP layers, respecting functional hierarchy. We further introduce a Cross-Reconstruction strategy and a novel multi-granularity loss. Results show BrainMCLIP achieves highly competitive performance, particularly excelling on high-level semantic metrics where it matches or surpasses SOTA(state-of-the-art) methods, including those using VAE pipelines. Crucially, it achieves this with substantially fewer parameters, demonstrating a reduction of 71.7%(Table.\reftab:compare_clip_vae) compared to top VAE-based SOTA methods, by avoiding the VAE pathway. By leveraging intermediate CLIP features, it effectively captures visual details often missed by CLIP-only approaches, striking a compelling balance between semantic accuracy and detail fidelity without requiring a separate VAE pipeline.
zh
[CV-47] Exploring Scale Shift in Crowd Localization under the Context of Domain Generalization
【速读】:该论文旨在解决群体定位(crowd localization)模型在领域泛化(domain generalization, DG)场景下因训练与测试数据间头部尺度分布差异(scale shift)导致的性能显著下降问题。其解决方案的关键在于提出一种名为因果特征分解与各向异性处理(Causal Feature Decomposition and Anisotropic Processing, Catto)的新算法,通过理论分析揭示尺度偏移的影响机制,并设计针对性策略缓解该影响,从而提升模型在跨域场景下的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2510.19330
作者: Juncheng Wang,Lei Shang,Ziqi Liu,Wang Lu,Xixu Hu,Zhe Hu,Jindong Wang,Shujun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Crowd localization plays a crucial role in visual scene understanding towards predicting each pedestrian location in a crowd, thus being applicable to various downstream tasks. However, existing approaches suffer from significant performance degradation due to discrepancies in head scale distributions (scale shift) between training and testing data, a challenge known as domain generalization (DG). This paper aims to comprehend the nature of scale shift within the context of domain generalization for crowd localization models. To this end, we address four critical questions: (i) How does scale shift influence crowd localization in a DG scenario? (ii) How can we quantify this influence? (iii) What causes this influence? (iv) How to mitigate the influence? Initially, we conduct a systematic examination of how crowd localization performance varies with different levels of scale shift. Then, we establish a benchmark, ScaleBench, and reproduce 20 advanced DG algorithms to quantify the influence. Through extensive experiments, we demonstrate the limitations of existing algorithms and underscore the importance and complexity of scale shift, a topic that remains insufficiently explored. To deepen our understanding, we provide a rigorous theoretical analysis on scale shift. Building on these insights, we further propose an effective algorithm called Causal Feature Decomposition and Anisotropic Processing (Catto) to mitigate the influence of scale shift in DG settings. Later, we also provide extensive analytical experiments, revealing four significant insights for future research. Our results emphasize the importance of this novel and applicable research direction, which we term Scale Shift Domain Generalization.
zh
[CV-48] Seabed-Net: A multi-task network for joint bathymetry estimation and seabed classification from remote sensing imagery in shallow waters
【速读】:该论文旨在解决浅水区海底地形(bathymetry)与底质类型(seabed classification)联合建模中存在任务孤立、精度不足及深度学习方法难以落地的问题。现有方法通常将深度估计与底质分类作为独立任务处理,忽略了二者之间的潜在协同效应,导致模型性能受限。其解决方案的关键在于提出Seabed-Net——一个统一的多任务框架,通过双分支编码器分别提取深度与底质特征,并借助注意力特征融合模块(Attention Feature Fusion)和基于窗口的Swin-Transformer融合块实现跨任务特征交互;同时采用动态任务不确定性加权机制平衡不同目标的优化过程。该设计显著提升了浅水区综合测绘精度,在两个异质海岸站点上均优于传统经验模型、单任务与多任务基线方法,实现了更低的均方根误差(RMSE)和更高的底质分类准确率。
链接: https://arxiv.org/abs/2510.19329
作者: Panagiotis Agrafiotis,Begüm Demir
机构: Technische Universität Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to ISPRS Journal of Photogrammetry and Remote Sensing
点击查看摘要
Abstract:Accurate, detailed, and regularly updated bathymetry, coupled with complex semantic content, is essential for under-mapped shallow-water environments facing increasing climatological and anthropogenic pressures. However, existing approaches that derive either depth or seabed classes from remote sensing imagery treat these tasks in isolation, forfeiting the mutual benefits of their interaction and hindering the broader adoption of deep learning methods. To address these limitations, we introduce Seabed-Net, a unified multi-task framework that simultaneously predicts bathymetry and pixel-based seabed classification from remote sensing imagery of various resolutions. Seabed-Net employs dual-branch encoders for bathymetry estimation and pixel-based seabed classification, integrates cross-task features via an Attention Feature Fusion module and a windowed Swin-Transformer fusion block, and balances objectives through dynamic task uncertainty weighting. In extensive evaluations at two heterogeneous coastal sites, it consistently outperforms traditional empirical models and traditional machine learning regression methods, achieving up to 75% lower RMSE. It also reduces bathymetric RMSE by 10-30% compared to state-of-the-art single-task and multi-task baselines and improves seabed classification accuracy up to 8%. Qualitative analyses further demonstrate enhanced spatial consistency, sharper habitat boundaries, and corrected depth biases in low-contrast regions. These results confirm that jointly modeling depth with both substrate and seabed habitats yields synergistic gains, offering a robust, open solution for integrated shallow-water mapping. Code and pretrained weights are available at this https URL.
zh
[CV-49] Online Handwritten Signature Verification Based on Temporal-Spatial Graph Attention Transformer
【速读】:该论文旨在解决手写签名验证(Handwritten Signature Verification)中因用户内部变异性(intra-user variability)和伪造风险导致的高精度难题。其核心解决方案是提出Temporal-Spatial Graph Attention Transformer (TS-GATR),该方法通过将签名表示为图结构,利用图注意力网络(Graph Attention Network, GAT)建模节点间动态特征(如位置、速度、压力)的复杂关系,并结合门控循环单元(Gated Recurrent Unit, GRU)捕捉长期时序依赖,从而提升验证性能。关键创新在于引入双图注意力模块(Dual-Graph Attention Transformer, DGATR),分别构建k步邻接图与k近邻邻接图以提取局部与全局空间特征,实现对签名时空特性的精细化建模。
链接: https://arxiv.org/abs/2510.19321
作者: Hai-jie Yuan,Heng Zhang,Fei Yin
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Handwritten signature verification is a crucial aspect of identity authentication, with applications in various domains such as finance and e-commerce. However, achieving high accuracy in signature verification remains challenging due to intra-user variability and the risk of forgery. This paper introduces a novel approach for dynamic signature verification: the Temporal-Spatial Graph Attention Transformer (TS-GATR). TS-GATR combines the Graph Attention Network (GAT) and the Gated Recurrent Unit (GRU) to model both spatial and temporal dependencies in signature data. TS-GATR enhances verification performance by representing signatures as graphs, where each node captures dynamic features (e.g. position, velocity, pressure), and by using attention mechanisms to model their complex relationships. The proposed method further employs a Dual-Graph Attention Transformer (DGATR) module, which utilizes k-step and k-nearest neighbor adjacency graphs to model local and global spatial features, respectively. To capture long-term temporal dependencies, the model integrates GRU, thereby enhancing its ability to learn dynamic features during signature verification. Comprehensive experiments conducted on benchmark datasets such as MSDS and DeepSignDB show that TS-GATR surpasses current state-of-the-art approaches, consistently achieving lower Equal Error Rates (EER) across various scenarios.
zh
[CV-50] Unified Reinforcement and Imitation Learning for Vision-Language Models NEURIPS2025
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)因规模庞大而在资源受限环境中难以部署的问题。为实现高效且高性能的小型化VLM,作者提出统一强化学习与模仿学习(Unified Reinforcement and Imitation Learning, RIL)的训练算法。其核心创新在于将强化学习与对抗式模仿学习相结合:通过一个基于大语言模型(LLM)的判别器区分学生模型与教师模型的输出,同时借助多个大型教师VLM提供多样化指导,使小型学生模型不仅能模仿教师的文本生成能力,还能在强化信号驱动下持续提升自身生成性能。这一联合学习机制显著缩小了轻量级模型与前沿闭源VLM之间的性能差距,并在多项基准测试中实现超越。
链接: https://arxiv.org/abs/2510.19307
作者: Byung-Kwan Lee,Ryo Hachiuma,Yong Man Ro,Yu-Chiang Frank Wang,Yueh-Hua Wu
机构: NVIDIA; KAIST; National Taiwan University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025, Project page: this https URL
点击查看摘要
Abstract:Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.
zh
[CV-51] FrogDeepSDM: Improving Frog Counting and Occurrence Prediction Using Multimodal Data and Pseudo-Absence Imputation
【速读】:该论文旨在解决物种分布模型(Species Distribution Modelling, SDM)在数据稀疏或不完整情况下预测精度不足的问题,尤其是在两栖类动物(蛙类,Anura)分布监测中的应用。其关键解决方案在于融合深度学习与数据预处理技术:首先通过数据平衡(data balancing)显著降低模型误差(如MAE从189降至29),其次利用特征选择优化环境变量输入,最终构建一个整合土地覆盖、归一化植被指数(NDVI)等多模态信息的集成模型(multimodal ensemble model),实现了对蛙类数量估算和栖息地分类的高精度预测(准确率84.9%,AUC 0.90),从而提升了生态建模在实际场景中的泛化能力与可扩展性。
链接: https://arxiv.org/abs/2510.19305
作者: Chirag Padubidri,Pranesh Velmurugan,Andreas Lanitis,Andreas Kamilaris
机构: University of Twente (特温特大学); CYENS Center of Excellence; Cyprus University of Tecnhology (塞浦路斯技术大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Monitoring species distribution is vital for conservation efforts, enabling the assessment of environmental impacts and the development of effective preservation strategies. Traditional data collection methods, including citizen science, offer valuable insights but remain limited in coverage and completeness. Species Distribution Modelling (SDM) helps address these gaps by using occurrence data and environmental variables to predict species presence across large regions. In this study, we enhance SDM accuracy for frogs (Anura) by applying deep learning and data imputation techniques using data from the “EY - 2022 Biodiversity Challenge.” Our experiments show that data balancing significantly improved model performance, reducing the Mean Absolute Error (MAE) from 189 to 29 in frog counting tasks. Feature selection identified key environmental factors influencing occurrence, optimizing inputs while maintaining predictive accuracy. The multimodal ensemble model, integrating land cover, NDVI, and other environmental inputs, outperformed individual models and showed robust generalization across unseen regions. The fusion of image and tabular data improved both frog counting and habitat classification, achieving 84.9% accuracy with an AUC of 0.90. This study highlights the potential of multimodal learning and data preprocessing techniques such as balancing and imputation to improve predictive ecological modeling when data are sparse or incomplete, contributing to more precise and scalable biodiversity monitoring.
zh
[CV-52] Vision-Based Mistake Analysis in Procedural Activities: A Review of Advances and Challenges
【速读】:该论文旨在解决结构化任务中错误检测与预测的问题,尤其关注程序性(procedural)和执行性(executional)错误的识别,以提升工业自动化、康复训练、教育及人机协作等场景下的安全性与效率。其解决方案的关键在于利用计算机视觉技术,如动作识别、行为预测与活动理解,从视频数据中捕捉任务执行过程中的偏差,例如步骤顺序错误、操作方法不当或时间节奏异常;同时通过系统梳理现有数据集、评估指标与前沿方法,并针对类内差异、视角变化及活动组合结构带来的挑战,提出基于程序结构建模、监督层次划分与学习策略分类的分析框架,为未来融合神经符号推理与反事实状态建模等方向提供了理论基础与实践路径。
链接: https://arxiv.org/abs/2510.19292
作者: Konstantinos Bacharidis,Antonis A. Argyros
机构: Institute of Computer Science, Foundation for Research and Technology – Hellas (希腊研究与技术基金会计算机科学研究所); Computer Science Department, University of Crete (克里特大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21pages, 6 figures, 2 tables
点击查看摘要
Abstract:Mistake analysis in procedural activities is a critical area of research with applications spanning industrial automation, physical rehabilitation, education and human-robot collaboration. This paper reviews vision-based methods for detecting and predicting mistakes in structured tasks, focusing on procedural and executional errors. By leveraging advancements in computer vision, including action recognition, anticipation and activity understanding, vision-based systems can identify deviations in task execution, such as incorrect sequencing, use of improper techniques, or timing errors. We explore the challenges posed by intra-class variability, viewpoint differences and compositional activity structures, which complicate mistake detection. Additionally, we provide a comprehensive overview of existing datasets, evaluation metrics and state-of-the-art methods, categorizing approaches based on their use of procedural structure, supervision levels and learning strategies. Open challenges, such as distinguishing permissible variations from true mistakes and modeling error propagation are discussed alongside future directions, including neuro-symbolic reasoning and counterfactual state modeling. This work aims to establish a unified perspective on vision-based mistake analysis in procedural activities, highlighting its potential to enhance safety, efficiency and task performance across diverse domains.
zh
[CV-53] Enhancing Early Alzheimer Disease Detection through Big Data and Ensemble Few-Shot Learning
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer disease)早期检测中因标注医疗数据稀缺、疾病复杂性高及数据隐私限制而导致的准确率不足问题。其解决方案的关键在于结合小样本学习(Few-Shot Learning, FSL)与集成学习框架,采用基于原型网络(Prototypical Network, ProtoNet)的集成方法,利用多种预训练卷积神经网络(Convolutional Neural Networks, CNNs)作为编码器以增强医学图像特征的丰富性,并引入类别感知损失(class-aware loss)与熵损失(entropy loss)相结合的优化策略,从而提升对阿尔茨海默病进展阶段的分类精度。实验在Kaggle Alzheimer数据集和ADNI数据集上验证了该方法的有效性,准确率分别达到99.72%和99.86%,显著优于现有先进方法。
链接: https://arxiv.org/abs/2510.19282
作者: Safa Ben Atitallah,Maha Driss,Wadii Boulila,Anis Koubaa
机构: Prince Sultan University (王子苏丹大学); University of Manouba (曼努巴大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Alzheimer disease is a severe brain disorder that causes harm in various brain areas and leads to memory damage. The limited availability of labeled medical data poses a significant challenge for accurate Alzheimer disease detection. There is a critical need for effective methods to improve the accuracy of Alzheimer disease detection, considering the scarcity of labeled data, the complexity of the disease, and the constraints related to data privacy. To address this challenge, our study leverages the power of big data in the form of pre-trained Convolutional Neural Networks (CNNs) within the framework of Few-Shot Learning (FSL) and ensemble learning. We propose an ensemble approach based on a Prototypical Network (ProtoNet), a powerful method in FSL, integrating various pre-trained CNNs as encoders. This integration enhances the richness of features extracted from medical images. Our approach also includes a combination of class-aware loss and entropy loss to ensure a more precise classification of Alzheimer disease progression levels. The effectiveness of our method was evaluated using two datasets, the Kaggle Alzheimer dataset and the ADNI dataset, achieving an accuracy of 99.72% and 99.86%, respectively. The comparison of our results with relevant state-of-the-art studies demonstrated that our approach achieved superior accuracy and highlighted its validity and potential for real-world applications in early Alzheimer disease detection.
zh
[CV-54] D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在生成过程中难以准确匹配提示词中指定对象数量的问题。现有方法通常依赖于可微分的辅助计数网络作为外部评判器来增强模型的计数能力,但这类方法无法利用检测类模型(如目标检测器)的优越计数性能,因为其“通过枚举计数”的机制本质上是非可微的。论文提出了一种名为Detector-to-Differentiable (D2D) 的新框架,其核心创新在于设计定制化的激活函数,将检测模型的输出logits转换为软二值指示信号,从而将非可微的检测器转化为可微的评判器,并在推理阶段优化噪声先验,以引导生成过程中的准确计数。这一方法使T2I模型能够借助高性能检测模型的能力,在不显著牺牲图像质量与计算效率的前提下,大幅提升对象计数准确性。
链接: https://arxiv.org/abs/2510.19278
作者: Nobline Yoo,Olga Russakovsky,Ye Zhu
机构: Princeton University (普林斯顿大学); École Polytechnique (巴黎综合理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 14 figures
点击查看摘要
Abstract:Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into soft binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models. Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object scenarios) demonstrate consistent and substantial improvements in object counting accuracy (e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark), with minimal degradation in overall image quality and computational overhead.
zh
[CV-55] MobiAct: Efficient MAV Action Recognition Using MobileNetV4 with Contrastive Learning and Knowledge Distillation
【速读】:该论文旨在解决微小型飞行器(Micro Air Vehicle, MAV)在资源受限平台上的动作识别问题,即如何在保证高识别准确率的同时实现低计算复杂度与高效推理速度。现有方法多依赖于大型、计算密集型模型,难以满足MAV对实时性与能效的要求,导致精度与速度之间的权衡。解决方案的关键在于提出一种轻量级动作识别框架MobiAct:首先采用MobileNetV4作为主干网络以降低计算开销;其次引入分阶段正交知识蒸馏(Stage-wise Orthogonal Knowledge Distillation, SOKD)策略,从ResNet18教师网络高效迁移MAV运动特征至学生网络;同时集成无参数注意力机制提升识别精度而不增加模型复杂度;最后通过混合损失训练策略协同优化多个目标函数,确保训练稳定性和鲁棒性。实验表明,MobiAct在三个自建数据集上平均识别准确率达92.12%,能耗仅136.16 pJ,每秒可处理8.84个动作,且推理速度比最优对比方法快达2倍,显著提升了MAV动作识别的效率与实用性。
链接: https://arxiv.org/abs/2510.19273
作者: Zhang Nengbo,Ho Hann Woei
机构: Universiti Sains Malaysia (马来西亚理科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate and efficient recognition of Micro Air Vehicle (MAV) motion is essential for enabling real-time perception and coordination in autonomous aerial swarm. However, most existing approaches rely on large, computationally intensive models that are unsuitable for resource-limited MAV platforms, which results in a trade-off between recognition accuracy and inference speed. To address these challenges, this paper proposes a lightweight MAV action recognition framework, MobiAct, designed to achieve high accuracy with low computational cost. Specifically, MobiAct adopts MobileNetV4 as the backbone network and introduces a Stage-wise Orthogonal Knowledge Distillation (SOKD) strategy to effectively transfer MAV motion features from a teacher network (ResNet18) to a student network, thereby enhancing knowledge transfer efficiency. Furthermore, a parameter-free attention mechanism is integrated into the architecture to improve recognition accuracy without increasing model complexity. In addition, a hybrid loss training strategy is developed to combine multiple loss objectives, which ensures stable and robust optimization during training. Experimental results demonstrate that the proposed MobiAct achieves low-energy and low-computation MAV action recognition, while maintaining the fastest action decoding speed among compared methods. Across all three self-collected datasets, MobiAct achieves an average recognition accuracy of 92.12%, while consuming only 136.16 pJ of energy and processing recognition at a rate of 8.84 actions per second. Notably, MobiAct decodes actions up to 2 times faster than the leading method, with highly comparable recognition accuracy, highlighting its superior efficiency in MAV action recognition.
zh
[CV-56] SCEESR: Semantic-Control Edge Enhancement for Diffusion-Based Super-Resolution
【速读】:该论文旨在解决真实世界图像超分辨率(Real-ISR)中因复杂退化和重建歧义导致的结构失真与计算效率之间的权衡问题。现有生成式模型虽能提升感知质量,但往往伴随高昂的计算成本;而单步扩散模型虽速度快,却易因蒸馏伪影产生结构不准确。解决方案的关键在于:引入ControlNet机制以提供语义边缘引导,在单次推理过程中实现动态结构控制,并设计融合L2、LPIPS及边缘感知AME损失的混合损失函数,从而在保持单步生成高效性的同时,显著提升重建图像的几何精度与真实感。
链接: https://arxiv.org/abs/2510.19272
作者: Yun Kai Zhuang
机构: ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 3 tables
点击查看摘要
Abstract:Real-world image super-resolution (Real-ISR) must handle complex degradations and inherent reconstruction ambiguities. While generative models have improved perceptual quality, a key trade-off remains with computational cost. One-step diffusion models offer speed but often produce structural inaccuracies due to distillation artifacts. To address this, we propose a novel SR framework that enhances a one-step diffusion model using a ControlNet mechanism for semantic edge guidance. This integrates edge information to provide dynamic structural control during single-pass inference. We also introduce a hybrid loss combining L2, LPIPS, and an edge-aware AME loss to optimize for pixel accuracy, perceptual quality, and geometric precision. Experiments show our method effectively improves structural integrity and realism while maintaining the efficiency of one-step generation, achieving a superior balance between output quality and inference speed. The results of test datasets will be published at this https URL and the related code will be published at this https URL.
zh
[CV-57] Advances in 4D Representation: Geometry Motion and Interaction
【速读】:该论文旨在解决4D生成与重建领域中如何选择和定制合适4D表示方法的问题,以应对不同计算资源、应用场景和数据条件下的挑战。其关键解决方案在于从几何(geometry)、运动(motion)和交互(interaction)三个核心维度对现有4D表示进行系统性分类与分析,聚焦代表性工作而非全面枚举,从而提炼各类表示在实际任务中的优势特性与局限性,并为用户决策提供理论依据与实践指导。此外,论文还强调了大型语言模型(LLMs)和视频基础模型(VFMs)在4D应用中的潜力及其当前限制,同时梳理了当前可用的4D数据集及其不足,推动该子领域的进一步发展。
链接: https://arxiv.org/abs/2510.19255
作者: Mingrui Zhao,Sauradip Nag,Kai Wang,Aditya Vora,Guangda Ji,Peter Chun,Ali Mahdavi-Amiri,Hao Zhang
机构: Simon Fraser University (西蒙菲莎大学); University of Alberta (阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages. Project Page: this https URL
点击查看摘要
Abstract:We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics whose developments have been propelled by recent advances in neural fields, geometric and motion deep learning, as well 3D generative artificial intelligence (GenAI). While our survey is not the first of its kind, we build our coverage of the domain from a unique and distinctive perspective of 4D representations/, to model 3D geometry evolving over time while exhibiting motion and interaction. Specifically, instead of offering an exhaustive enumeration of many works, we take a more selective approach by focusing on representative works to highlight both the desirable properties and ensuing challenges of each representation under different computation, application, and data scenarios. The main take-away message we aim to convey to the readers is on how to select and then customize the appropriate 4D representations for their tasks. Organizationally, we separate the 4D representations based on three key pillars: geometry, motion, and interaction. Our discourse will not only encompass the most popular representations of today, such as neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS), but also bring attention to relatively under-explored representations in the 4D context, such as structured models and long-range motions. Throughout our survey, we will reprise the role of large language models (LLMs) and video foundational models (VFMs) in a variety of 4D applications, while steering our discussion towards their current limitations and how they can be addressed. We also provide a dedicated coverage on what 4D datasets are currently available, as well as what is lacking, in driving the subfield forward. Project page:this https URL
zh
[CV-58] Background Fades Foreground Leads: Curriculum-Guided Background Pruning for Efficient Foreground-Centric Collaborative Perception
链接: https://arxiv.org/abs/2510.19250
作者: Yuheng Wu,Xiangbo Gao,Quang Tau,Zhengzhong Tu,Dongman Lee
机构: KAIST (韩国科学技术院); Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
[CV-59] Space Object Detection using Multi-frame Temporal Trajectory Completion Method
【速读】:该论文旨在解决地球同步轨道(Geostationary Earth Orbit, GEO)空间目标在光学成像中因信号微弱、复杂星背景和环境干扰导致的检测难题。解决方案的关键在于:首先通过小波变换增强单帧图像中的高频目标特征并抑制背景噪声;在此基础上,提出基于匈牙利算法(Hungarian algorithm)的多帧时间轨迹补全方案,实现全局最优跨帧匹配;进一步在后处理流程中设计时序匹配与插值补全、基于时序一致性的噪声滤波以及渐进式轨迹优化等步骤,有效缓解漏检和误检问题,最终在公开的SpotGEO数据集上实现了90.14%的F₁分数。
链接: https://arxiv.org/abs/2510.19220
作者: Xiaoqing Lan,Biqiao Xin,Bingshu Wang,Han Zhang,Laixian Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Space objects in Geostationary Earth Orbit (GEO) present significant detection challenges in optical imaging due to weak signals, complex stellar backgrounds, and environmental interference. In this paper, we enhance high-frequency features of GEO targets while suppressing background noise at the single-frame level through wavelet transform. Building on this, we propose a multi-frame temporal trajectory completion scheme centered on the Hungarian algorithm for globally optimal cross-frame matching. To effectively mitigate missing and false detections, a series of key steps including temporal matching and interpolation completion, temporal-consistency-based noise filtering, and progressive trajectory refinement are designed in the post-processing pipeline. Experimental results on the public SpotGEO dataset demonstrate the effectiveness of the proposed method, achieving an F_1 score of 90.14%.
zh
[CV-60] SFGFusion: Surface Fitting Guided 3D Object Detection with 4D Radar and Camera Fusion
【速读】:该论文旨在解决4D成像雷达(4D imaging radar)因点云稀疏和分辨率低而导致的目标几何表征能力弱、多模态融合困难的问题。其解决方案的关键在于提出SFGFusion网络,该网络通过表面拟合(surface fitting)引导的机制,利用图像和雷达数据估计目标的二次曲面参数,构建显式的表面拟合模型以增强空间表示与跨模态交互;该模型生成的预测深度信息一方面用于将图像特征从透视视图(perspective view, PV)转换为统一的鸟瞰图(bird’s-eye view, BEV),提升空间映射精度;另一方面用于生成密集伪点云,缓解雷达点云稀疏问题,从而实现更可靠的多模态融合与精细目标检测。
链接: https://arxiv.org/abs/2510.19215
作者: Xiaozhi Li,Huijun Di,Jian Li,Feng Liu,Wei Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Pattern Recognition
点击查看摘要
Abstract:3D object detection is essential for autonomous driving. As an emerging sensor, 4D imaging radar offers advantages as low cost, long-range detection, and accurate velocity measurement, making it highly suitable for object detection. However, its sparse point clouds and low resolution limit object geometric representation and hinder multi-modal fusion. In this study, we introduce SFGFusion, a novel camera-4D imaging radar detection network guided by surface fitting. By estimating quadratic surface parameters of objects from image and radar data, the explicit surface fitting model enhances spatial representation and cross-modal interaction, enabling more reliable prediction of fine-grained dense depth. The predicted depth serves two purposes: 1) in an image branch to guide the transformation of image features from perspective view (PV) to a unified bird’s-eye view (BEV) for multi-modal fusion, improving spatial mapping accuracy; and 2) in a surface pseudo-point branch to generate dense pseudo-point cloud, mitigating the radar point sparsity. The original radar point cloud is also encoded in a separate radar branch. These two point cloud branches adopt a pillar-based method and subsequently transform the features into the BEV space. Finally, a standard 2D backbone and detection head are used to predict object labels and bounding boxes from BEV features. Experimental results show that SFGFusion effectively fuses camera and 4D radar features, achieving superior performance on the TJ4DRadSet and view-of-delft (VoD) object detection benchmarks.
zh
[CV-61] MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting
链接: https://arxiv.org/abs/2510.19210
作者: In-Hwan Jin,Hyeongju Mun,Joonsoo Kim,Kugjin Yun,Kyeongbo Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-62] GRASPLAT: Enabling dexterous grasping through novel view synthesis IROS2025
链接: https://arxiv.org/abs/2510.19200
作者: Matteo Bortolon,Nuno Ferreira Duarte,Plinio Moreno,Fabio Poiesi,José Santos-Victor,Alessio Del Bue
机构: TeV, Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); PAVIS, Fondazione Istituto Italiano di Tecnologia (意大利技术研究院基金会); ISR, Instituto Superior Técnico, Universidade de Lisboa (里斯本大学理工学院); University of Trento (特伦托大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted IROS 2025
[CV-63] Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks
【速读】:该论文旨在解决当前基于合成数据的驾驶世界模型在下游感知任务中评估不足的问题,尤其是现有方法因依赖“先在合成数据上预训练、再在真实数据上微调”的两阶段策略,导致训练成本翻倍且合成数据带来的性能提升不显著。其核心解决方案是提出Dream4Drive框架,关键在于通过将输入视频分解为多个3D感知引导图(3D-aware guidance maps),并在此基础上渲染3D资产以生成多视角逼真视频,从而实现对Corner Case(极端场景)的大规模可控生成与编辑,显著增强自动驾驶感知模型对罕见场景的鲁棒性。
链接: https://arxiv.org/abs/2510.19195
作者: Kai Zeng,Zhanqian Wu,Kaixin Xiong,Xiaobao Wei,Xiangyu Guo,Zhenxin Zhu,Kalok Ho,Lijun Zhou,Bohan Zeng,Ming Lu,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Wentao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are \mathbfreally\ crucial for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Project: \hrefthis https URLthis\ https\ URL
zh
[CV-64] Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning
【速读】:该论文旨在解决基于奖励的微调方法在图像到视频(I2V)生成任务中难以保持时序一致性的问题。传统奖励函数主要优化整个视频序列的质量,如美学吸引力和整体一致性,但往往忽视了帧间的时间连贯性,导致生成视频存在时间不一致现象。解决方案的关键在于提出一种新的度量指标——视频一致性距离(Video Consistency Distance, VCD),该指标在视频帧特征的频域空间中定义,通过频域分析有效捕捉帧间信息,从而增强时序一致性。在奖励驱动的微调框架下引入VCD,可在不损害其他性能的前提下显著提升生成视频的时间一致性。
链接: https://arxiv.org/abs/2510.19193
作者: Takehiro Aoshima,Yusuke Shinohara,Park Byeongseon
机构: LY Corporation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages
点击查看摘要
Abstract:Reward-based fine-tuning of video diffusion models is an effective approach to improve the quality of generated videos, as it can fine-tune models without requiring real-world video datasets. However, it can sometimes be limited to specific performances because conventional reward functions are mainly aimed at enhancing the quality across the whole generated video sequence, such as aesthetic appeal and overall consistency. Notably, the temporal consistency of the generated video often suffers when applying previous approaches to image-to-video (I2V) generation tasks. To address this limitation, we propose Video Consistency Distance (VCD), a novel metric designed to enhance temporal consistency, and fine-tune a model with the reward-based fine-tuning framework. To achieve coherent temporal consistency relative to a conditioning image, VCD is defined in the frequency space of video frame features to capture frame information effectively through frequency-domain analysis. Experimental results across multiple I2V datasets demonstrate that fine-tuning a video generation model with VCD significantly enhances temporal consistency without degrading other performance compared to the previous method.
zh
[CV-65] PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning
链接: https://arxiv.org/abs/2510.19183
作者: Fengyuan Sun,Hui Chen,Xinhao Xu,Dandan Zheng,Jingdong Chen,Jun Zhou,Jungong Han,Guiguang Ding
机构: School of Software, Tsinghua University (清华大学软件学院); Ant Group (蚂蚁集团); Department of Automation, Tsinghua University (清华大学自动化系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-66] Malaria Detection from Blood Cell Images Using XceptionNet
【速读】:该论文旨在解决疟疾(malaria)人工诊断效率低、准确性差的问题,尤其针对缺乏专业技能的医疗人员在显微镜下观察红细胞(RBCs)图像时易出现误诊的情况。解决方案的关键在于利用深度卷积神经网络自动提取血细胞图像中的深层内在特征,并实现对疟原虫感染细胞与健康细胞的分类识别。实验表明,残差注意力网络(Residual Attention Network)和XceptionNet在公开疟疾细胞图像数据集上分别达到97.28%和97.55%的平均准确率,显著优于其他对比方法,验证了基于深度学习的自动化检测方案在提升诊断可靠性与减少人为干预方面的可行性与有效性。
链接: https://arxiv.org/abs/2510.19182
作者: Warisa Nusrat,Mostafijur Rahman,Ayatullah Faruk Mollah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Malaria, which primarily spreads with the bite of female anopheles mosquitos, often leads to death of people - specifically children in the age-group of 0-5 years. Clinical experts identify malaria by observing RBCs in blood smeared images with a microscope. Lack of adequate professional knowledge and skills, and most importantly manual involvement may cause incorrect diagnosis. Therefore, computer aided automatic diagnosis stands as a preferred substitute. In this paper, well-demonstrated deep networks have been applied to extract deep intrinsic features from blood cell images and thereafter classify them as malaria infected or healthy cells. Among the six deep convolutional networks employed in this work viz. AlexNet, XceptionNet, VGG-19, Residual Attention Network, DenseNet-121 and Custom-CNN. Residual Attention Network and XceptionNet perform relatively better than the rest on a publicly available malaria cell image dataset. They yield an average accuracy of 97.28% and 97.55% respectively, that surpasses other related methods on the same dataset. These findings highly encourage the reality of deep learning driven method for automatic and reliable detection of malaria while minimizing direct manual involvement.
zh
[CV-67] FootFormer: Estimating Stability from Visual Input
【速读】:该论文旨在解决从视觉输入中联合预测人体运动动力学参数(如足底压力分布、足部接触图和质心位置)的问题,现有方法通常仅能生成其中一到两个指标,难以实现多维动态特征的协同估计。其解决方案的关键在于提出FootFormer,一种跨模态架构,能够直接从视觉数据中端到端地学习并预测多个相关物理量,从而在多个数据集上显著优于或等效于现有方法,并在经典运动学指标(如重心(CoM)、重心投影(CoP)和支撑基底(BoS))的稳定性预测方面达到当前最优性能(SOTA)。
链接: https://arxiv.org/abs/2510.19170
作者: Keaton Kraiger,Jingjing Li,Skanda Bharadwaj,Jesse Scott,Robert T. Collins,Yanxi Liu
机构: Pennsylvania State University (宾夕法尼亚州立大学); University of South Florida (南佛罗里达大学); Scientific Applications & Research Associates (SARA), Inc. (科学应用与研究协会(SARA)公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 9 figures
点击查看摘要
Abstract:We propose FootFormer, a cross-modality approach for jointly predicting human motion dynamics directly from visual input. On multiple datasets, FootFormer achieves statistically significantly better or equivalent estimates of foot pressure distributions, foot contact maps, and center of mass (CoM), as compared with existing methods that generate one or two of those measures. Furthermore, FootFormer achieves SOTA performance in estimating stability-predictive components (CoP, CoM, BoS) used in classic kinesiology metrics. Code and data are available at this https URL.
zh
[CV-68] X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning
链接: https://arxiv.org/abs/2510.19150
作者: Yunzhe Wang,Soham Hans,Volkan Ustun
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 5 figures
[CV-69] A Novel Approach to Breast Cancer Segmentation using U-Net Model with Attention Mechanisms and FedProx
链接: https://arxiv.org/abs/2510.19118
作者: Eyad Gad,Mustafa Abou Khatwa,Mustafa A. Elattar,Sahar Selim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-70] Advancing Brain Tumor Segmentation via Attention-based 3D U-Net Architecture and Digital Image Processing
链接: https://arxiv.org/abs/2510.19109
作者: Eyad Gad,Seif Soliman,M. Saeed Darweesh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-71] MetaCluster: Enabling Deep Compression of Kolmogorov-Arnold Network
【速读】:该论文旨在解决Kolmogorov-Arnold Networks (KANs) 因参数量和内存占用大幅增加而导致的存储效率低下问题,同时保持其高表达能力和精度。解决方案的关键在于提出MetaCluster框架:通过引入一个轻量级元学习器(meta-learner),将低维嵌入映射到系数向量空间,使这些向量聚集在低维流形上,从而便于聚类;随后在系数空间中运行K-means算法,用共享质心替代每个边的系数向量,最终仅需存储少量码本(codebook)和每边索引,实现高达80倍的参数存储压缩,且无需牺牲模型精度。
链接: https://arxiv.org/abs/2510.19105
作者: Matthew Raffel,Adwaith Renjith,Lizhong Chen
机构: Oregon State University (俄勒冈州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Kolmogorov-Arnold Networks (KANs) replace scalar weights with per-edge vectors of basis coefficients, thereby boosting expressivity and accuracy but at the same time resulting in a multiplicative increase in parameters and memory. We propose MetaCluster, a framework that makes KANs highly compressible without sacrificing accuracy. Specifically, a lightweight meta-learner, trained jointly with the KAN, is used to map low-dimensional embedding to coefficient vectors, shaping them to lie on a low-dimensional manifold that is amenable to clustering. We then run K-means in coefficient space and replace per-edge vectors with shared centroids. Afterwards, the meta-learner can be discarded, and a brief fine-tuning of the centroid codebook recovers any residual accuracy loss. The resulting model stores only a small codebook and per-edge indices, exploiting the vector nature of KAN parameters to amortize storage across multiple coefficients. On MNIST, CIFAR-10, and CIFAR-100, across standard KANs and ConvKANs using multiple basis functions, MetaCluster achieves a reduction of up to 80 \times in parameter storage, with no loss in accuracy. Code will be released upon publication.
zh
[CV-72] UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning
链接: https://arxiv.org/abs/2510.19078
作者: Zhongyu Jiang,Wenhao Chai,Lei Li,Zhuoran Zhou,Cheng-Yen Yang,Jenq-Neng Hwang
机构: University of Washington (华盛顿大学); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-73] MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models
【速读】:该论文旨在解决当前文本到视频扩散模型(text-to-video diffusion models)在生成视频时缺乏时间连贯性和物理合理性的问题,其根本原因在于模型对自然视频中复杂运动的理解不足。解决方案的关键在于提出一种以运动为中心的对齐框架(motion-centric alignment framework),该框架从预训练视频编码器中学习一个解耦的运动子空间(disentangled motion subspace),并通过优化该子空间以预测真实光流(optical flow)来确保其捕捉真实的运动动力学。随后,将文本到视频扩散模型的潜在特征对齐至该运动子空间,使生成模型能够内化运动知识,从而提升视频的物理常识性(physical commonsense),同时保持对文本提示的忠实性。
链接: https://arxiv.org/abs/2510.19022
作者: Aritra Bhowmik,Denis Korzhenkov,Cees G. M. Snoek,Amirhossein Habibian,Mohsen Ghafoorian
机构: University of Amsterdam (阿姆斯特丹大学); Qualcomm AI Research (高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Text-to-video diffusion models have enabled high-quality video synthesis, yet often fail to generate temporally coherent and physically plausible motion. A key reason is the models’ insufficient understanding of complex motions that natural videos often entail. Recent works tackle this problem by aligning diffusion model features with those from pretrained video encoders. However, these encoders mix video appearance and dynamics into entangled features, limiting the benefit of such alignment. In this paper, we propose a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder. This subspace is optimized to predict ground-truth optical flow, ensuring it captures true motion dynamics. We then align the latent features of a text-to-video diffusion model to this new subspace, enabling the generative model to internalize motion knowledge and generate more plausible videos. Our method improves the physical commonsense in a state-of-the-art video diffusion model, while preserving adherence to textual prompts, as evidenced by empirical evaluations on VideoPhy, VideoPhy2, VBench, and VBench-2.0, along with a user study.
zh
[CV-74] Δt-Mamba3D: A Time-Aware Spatio-Temporal State-Space Model for Breast Cancer Risk Prediction
【速读】:该论文旨在解决纵向医学影像分析中因不规则时间间隔导致的序列建模难题,即如何有效利用在非均匀时间点采集的高分辨率影像序列中的空间与时间信息。当前方法通常要么丢失空间细节(将图像压缩为向量),要么采用计算复杂度高且无法适应非均匀时间步长的时空模型。其解决方案的关键在于提出Time-Aware Δt-Mamba3D,一种专为纵向医学影像设计的状态空间架构,其核心创新是引入连续时间选择性扫描机制,显式地将两次检查间的真实时间差(Δt)嵌入状态转移过程,同时结合多尺度3D邻域融合模块以鲁棒地捕捉时空关联,从而在保持线性计算复杂度的前提下,实现对患者长期筛查历史的高效建模与精准预测。
链接: https://arxiv.org/abs/2510.19003
作者: Zhengbo Zhou,Dooman Arefan,Margarita Zuley,Shandong Wu
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Longitudinal analysis of sequential radiological images is hampered by a fundamental data challenge: how to effectively model a sequence of high-resolution images captured at irregular time intervals. This data structure contains indispensable spatial and temporal cues that current methods fail to fully exploit. Models often compromise by either collapsing spatial information into vectors or applying spatio-temporal models that are computationally inefficient and incompatible with non-uniform time steps. We address this challenge with Time-Aware \Delta t-Mamba3D, a novel state-space architecture adapted for longitudinal medical imaging. Our model simultaneously encodes irregular inter-visit intervals and rich spatio-temporal context while remaining computationally efficient. Its core innovation is a continuous-time selective scanning mechanism that explicitly integrates the true time difference between exams into its state transitions. This is complemented by a multi-scale 3D neighborhood fusion module that robustly captures spatio-temporal relationships. In a comprehensive breast cancer risk prediction benchmark using sequential screening mammogram exams, our model shows superior performance, improving the validation c-index by 2-5 percentage points and achieving higher 1-5 year AUC scores compared to established variants of recurrent, transformer, and state-space models. Thanks to its linear complexity, the model can efficiently process long and complex patient screening histories of mammograms, forming a new framework for longitudinal image analysis.
zh
[CV-75] Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts
【速读】:该论文旨在解决自动驾驶场景中高阶感知、预测与规划问题的视觉语言问答(Vision-Language QA)任务,其核心挑战在于如何有效利用预训练的多模态大语言模型(Multimodal Large Language Model, MLLM)来理解复杂驾驶环境并生成准确、可靠的推理答案。解决方案的关键在于提出一个两阶段架构:第一阶段通过六摄像头输入、短期历史帧和链式思维(Chain-of-Thought)提示模板结合少量示例(few-shot)进行推理,并引入自一致性集成(self-consistency ensemble)提升答案可靠性;第二阶段进一步融合nuScenes场景元数据(如物体标注、车辆状态等)和任务特定指令(category-specific question instructions),实现对感知、预测与规划任务的精细化引导。实验表明,该方法显著优于基线模型,在严重视觉退化条件下仍保持96%准确率,验证了精心设计的提示工程与上下文锚定对增强预训练模型在自动驾驶QA任务中性能的重要性。
链接: https://arxiv.org/abs/2510.19001
作者: Seungjun Yu,Junsung Park,Youngsun Lim,Hyunjung Shim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.
zh
[CV-76] nabla-SDF: Learning Euclidean Signed Distance Functions Online with Gradient-Augmented Octree Interpolation and Neural Residual
【速读】:该论文旨在解决从点云数据中高效且高精度重建连续可微的符号距离函数(Signed Distance Function, SDF)的问题,以支持机器人自主能力如定位、建图、运动规划与控制。现有方法要么依赖离散体素结构导致SDF不连续或不可微,要么采用神经网络方法虽具高保真度和可微性但存在效率低、灾难性遗忘、内存受限及仅能处理截断SDF等缺陷。其解决方案的关键在于提出∇-SDF,一种混合方法:利用梯度增强八叉树插值获得显式先验,同时引入隐式神经残差来修正误差,从而实现非截断(欧几里得空间)SDF重建,在计算效率和内存占用上媲体积素方法,同时在可微性和精度上达到神经网络方法水平,具备良好的可扩展性。
链接: https://arxiv.org/abs/2510.18999
作者: Zhirui Dai,Qihao Qian,Tianxing Fan,Nikolay Atanasov
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Estimation of signed distance functions (SDFs) from point cloud data has been shown to benefit many robot autonomy capabilities, including localization, mapping, motion planning, and control. Methods that support online and large-scale SDF reconstruction tend to rely on discrete volumetric data structures, which affect the continuity and differentiability of the SDF estimates. Recently, using implicit features, neural network methods have demonstrated high-fidelity and differentiable SDF reconstruction but they tend to be less efficient, can experience catastrophic forgetting and memory limitations in large environments, and are often restricted to truncated SDFs. This work proposes \nabla -SDF, a hybrid method that combines an explicit prior obtained from gradient-augmented octree interpolation with an implicit neural residual. Our method achieves non-truncated (Euclidean) SDF reconstruction with computational and memory efficiency comparable to volumetric methods and differentiability and accuracy comparable to neural network methods. Extensive experiments demonstrate that \methodname outperforms the state of the art in terms of accuracy and efficiency, providing a scalable solution for downstream tasks in robotics and computer vision.
zh
[CV-77] Ninja Codes: Neurally Generated Fiducial Markers for Stealthy 6-DoF Tracking
【速读】:该论文旨在解决传统视觉标记(fiducial markers)在增强现实(AR)、机器人导航和基于动作的用户界面等应用中因外观明显而影响环境美观的问题。其解决方案的关键在于提出了一种名为Ninja Codes的神经生成视觉标记,通过编码器网络对任意图像施加视觉上微小但有效的修改,使标记能够自然融入真实环境纹理中;同时,该方法采用端到端训练的深度隐写术框架,联合优化标记生成与检测模块,从而实现隐蔽性强、定位精度高且可在普通RGB相机上运行的6自由度(6-DoF)位置跟踪能力。
链接: https://arxiv.org/abs/2510.18976
作者: Yuichiro Takeuchi,Yusuke Imoto,Shunya Kato
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 11 pages, 12 figures
点击查看摘要
Abstract:In this paper we describe Ninja Codes, neurally-generated fiducial markers that can be made to naturally blend into various real-world environments. An encoder network converts arbitrary images into Ninja Codes by applying visually modest alterations; the resulting codes, printed and pasted onto surfaces, can provide stealthy 6-DoF location tracking for a wide range of applications including augmented reality, robotics, motion-based user interfaces, etc. Ninja Codes can be printed using off-the-shelf color printers on regular printing paper, and can be detected using any device equipped with a modern RGB camera and capable of running inference. Using an end-to-end process inspired by prior work on deep steganography, we jointly train a series of network modules that perform the creation and detection of Ninja Codes. Through experiments, we demonstrate Ninja Codes’ ability to provide reliable location tracking under common indoor lighting conditions, while successfully concealing themselves within diverse environmental textures. We expect Ninja Codes to offer particular value in scenarios where the conspicuous appearances of conventional fiducial markers make them undesirable for aesthetic and other reasons.
zh
[CV-78] Dimensionality Reduction for Remote Sensing Data Analysis: A Systematic Review of Methods and Applications
【速读】:该论文旨在解决遥感(Remote Sensing, RS)数据在高维特性下所面临的稀疏性、效率低下及维度灾难等问题,这些问题限制了机器学习模型在环境监测、城市规划和灾害管理等任务中的有效性。其解决方案的关键在于采用维度缩减(Dimensionality Reduction, DR)技术,特别是特征提取方法,通过保留数据的核心属性来降低复杂度,从而提升数据压缩、清洗、融合、可视化、异常检测和预测等下游任务的性能。
链接: https://arxiv.org/abs/2510.18935
作者: Nathan Mankovich,Kai-Hendrik Cohrs,Homer Durand,Vasileios Sitokonstantinou,Tristan Williams,Gustau Camps-Valls
机构: Image Processing Lab, Universitat de València, 46980 Paterna (València), Spain; Artificial Intelligence Group, Wageningen University and Research, Wageningen, The Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Earth observation involves collecting, analyzing, and processing an ever-growing mass of data. Automatically harvesting information is crucial for addressing significant societal, economic, and environmental challenges, ranging from environmental monitoring to urban planning and disaster management. However, the high dimensionality of these data poses challenges in terms of sparsity, inefficiency, and the curse of dimensionality, which limits the effectiveness of machine learning models. Dimensionality reduction (DR) techniques, specifically feature extraction, address these challenges by preserving essential data properties while reducing complexity and enhancing tasks such as data compression, cleaning, fusion, visualization, anomaly detection, and prediction. This review provides a handbook for leveraging DR across the RS data value chain and identifies opportunities for under-explored DR algorithms and their application in future research.
zh
[CV-79] Automated Morphological Analysis of Neurons in Fluorescence Microscopy Using YOLOv8
【速读】:该论文旨在解决荧光显微镜图像中神经元细胞的精准分割与形态学分析问题,该过程传统上依赖大量人工标注,效率低且耗时。解决方案的关键在于构建一个基于高分辨率干细胞来源神经元数据集的端到端自动化流程,其中采用YOLOv8模型进行实例分割,训练数据为人工标注的显微图像,实现了超过97%的分割准确率;同时结合真实标签(ground truth)与预测掩膜(predicted masks)提取细胞长度、宽度、面积及灰度强度等生物学有意义的特征,最终获得75.32%的形态测量准确率,显著提升了神经元形态定量分析的自动化水平与可扩展性。
链接: https://arxiv.org/abs/2510.19455
作者: Banan Alnemri,Arwa Basbrain
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 7 pages, 2 figures and 2 tables
点击查看摘要
Abstract:Accurate segmentation and precise morphological analysis of neuronal cells in fluorescence microscopy images are crucial steps in neuroscience and biomedical imaging applications. However, this process is labor-intensive and time-consuming, requiring significant manual effort and expertise to ensure reliable outcomes. This work presents a pipeline for neuron instance segmentation and measurement based on a high-resolution dataset of stem-cell-derived neurons. The proposed method uses YOLOv8, trained on manually annotated microscopy images. The model achieved high segmentation accuracy, exceeding 97%. In addition, the pipeline utilized both ground truth and predicted masks to extract biologically significant features, including cell length, width, area, and grayscale intensity values. The overall accuracy of the extracted morphological measurements reached 75.32%, further supporting the effectiveness of the proposed approach. This integrated framework offers a valuable tool for automated analysis in cell imaging and neuroscience research, reducing the need for manual annotation and enabling scalable, precise quantification of neuron morphology.
zh
人工智能
[AI-0] Semantic World Models
【速读】:该论文旨在解决传统基于像素重建的世界模型在机器人控制中与实际规划目标不一致的问题,即强像素重建能力并不必然带来良好的决策性能。其解决方案的关键在于将世界建模任务重新定义为预测未来帧中与任务相关的语义信息,而非直接重建像素;通过将世界建模视为一个关于未来帧语义信息的视觉问答(Visual Question Answering, VQA)问题,利用预训练视觉语言模型(Vision-Language Models, VLMs)进行监督微调,从而构建“语义”世界模型。该方法使模型在规划决策时聚焦于任务相关特征,并继承VLMs的泛化性和鲁棒性,显著提升了开放场景下机器人任务的泛化能力。
链接: https://arxiv.org/abs/2510.19818
作者: Jacob Berg,Chuning Zhu,Yanda Bao,Ishan Durugkar,Abhishek Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as “semantic” world models through a supervised finetuning process on image-action-text data, enabling planning for decision-making while inheriting many of the generalization and robustness properties from the pretrained vision-language models. The paper demonstrates how such a semantic world model can be used for policy improvement on open-ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction-based action-conditional world modeling. Website available at this https URL.
zh
[AI-1] Integrating Transparent Models LLM s and Practitioner-in-the-Loop: A Case of Nonprofit Program Evaluation
【速读】:该论文旨在解决公共部门和非营利组织在采用人工智能(AI)工具时面临的信任与实用性困境,即现有模型普遍缺乏透明度,难以提供可操作的个体案例级指导。其解决方案的关键在于构建一个“从业者在环路”(practitioner-in-the-loop)的工作流程:首先使用可解释的决策树模型(interpretable decision-tree models)识别关键预测因子,随后将决策树结构输入大语言模型(LLM),使其生成基于透明模型的个体案例级预测;同时在整个过程中嵌入从业者参与特征工程、模型设计、解释审查与可用性评估,确保领域知识贯穿分析全流程。实证结果表明,该方法能够实现高准确性、高可信度且具备行动导向性的案例评估,为公共与非营利领域负责任地部署AI提供了可行路径。
链接: https://arxiv.org/abs/2510.19799
作者: Ji Ma,Albert Casella
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Software Engineering (cs.SE); General Economics (econ.GN)
备注:
点击查看摘要
Abstract:Public and nonprofit organizations often hesitate to adopt AI tools because most models are opaque even though standard approaches typically analyze aggregate patterns rather than offering actionable, case-level guidance. This study tests a practitioner-in-the-loop workflow that pairs transparent decision-tree models with large language models (LLMs) to improve predictive accuracy, interpretability, and the generation of practical insights. Using data from an ongoing college-success program, we build interpretable decision trees to surface key predictors. We then provide each tree’s structure to an LLM, enabling it to reproduce case-level predictions grounded in the transparent models. Practitioners participate throughout feature engineering, model design, explanation review, and usability assessment, ensuring that field expertise informs the analysis at every stage. Results show that integrating transparent models, LLMs, and practitioner input yields accurate, trustworthy, and actionable case-level evaluations, offering a viable pathway for responsible AI adoption in the public and nonprofit sectors.
zh
[AI-2] On Controlled Change: Generative AIs Impact on Professional Authority in Journalism
【速读】:该论文旨在解决生成式 AI (Generative AI) 在新闻业中应用所带来的专业权威挑战与整合难题,即如何在不损害新闻伦理和职业标准的前提下,实现AI技术与记者日常工作的有效融合。其解决方案的关键在于提出“受控变革”(controlled change)这一概念,强调记者通过制定适应性指南、实验性测试AI工具以及批判性评估其能力与局限,以有监督的方式主动管理AI的引入,从而维护并重塑新闻专业权威。
链接: https://arxiv.org/abs/2510.19792
作者: Tomás Dodds,Wang Ngai Yeung,Claudia Mellado,Mathias-Felipe de Lima-Santos
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Using (generative) artificial intelligence tools and systems in journalism is expected to increase journalists’ production rates, transform newsrooms’ economic models, and further personalize the audience’s news consumption practices. Since its release in 2022, OpenAI’s ChatGPT and other large language models have raised the alarms inside news organizations, not only for bringing new challenges to news reporting and fact-checking but also for what these technologies would mean for journalists’ professional authority in journalism. This paper examines how journalists in Dutch media manage the integration of AI technologies into their daily routines. Drawing from 13 interviews with editors, journalists, and innovation managers in different news outlets and media companies, we propose the concept of controlled change. as a heuristic to explain how journalists are proactively setting guidelines, experimenting with AI tools, and identifying their limitations and capabilities. Using professional authority as a theoretical framework, we argue that journalists anticipate and integrate AI technologies in a supervised manner and identify three primary mechanisms through which journalists manage this integration: (1) developing adaptive guidelines that align AI use with ethical codes, (2) experimenting with AI technologies to determine their necessity and fit, and (3) critically assessing the capabilities and limitations of AI systems.
zh
[AI-3] Benchmarking World-Model Learning
【速读】:该论文旨在解决当前世界模型(world model)学习与评估方法偏离实际应用目标的问题,即现有方法通常以帧预测(next-frame prediction)为核心任务,且在训练和评估中均依赖于相同环境中的奖励最大化,难以衡量模型对环境动态的通用理解能力。解决方案的关键在于提出WorldTest协议,其核心创新包括:(1)将奖励无关的探索阶段与评分测试阶段分离,测试环境虽不同于训练环境但具有相关性;(2)采用开放式的任务设计,支持多种未知下游任务(如掩码帧预测、规划和因果动态变化检测),且不依赖特定模型表示形式,从而实现跨方法公平比较;(3)通过AutumnBench基准测试套件(包含43个交互式网格世界环境和129项任务)验证该协议的有效性,实证表明人类表现优于当前前沿模型,且计算资源扩展仅在部分环境中提升性能,揭示了世界模型学习仍存在显著提升空间。
链接: https://arxiv.org/abs/2510.19788
作者: Archana Warrier,Dat Nyugen,Michelangelo Naim,Moksh Jain,Yichao Liang,Karen Schroeder,Cambridge Yang,Joshua B. Tenenbaum,Sebastian Vollmer,Kevin Ellis,Zenna Tavares
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages, 10 figures
点击查看摘要
Abstract:Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended \unicodex2014 models should support many different tasks unknown ahead of time \unicodex2014 and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template \unicodex2014 reward-free exploration, derived tests, and behavior-based scoring \unicodex2014 to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.
zh
[AI-4] Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)代理系统在评估“主动性”(proactivity)方面的不足问题。现有基准测试局限于局部上下文,难以衡量模型跨源推理和长期规划能力,从而无法真实反映代理在无明确指令下自主识别并解决问题的能力。为应对这一挑战,作者提出PROBE(Proactive Resolution Of BottlEnecks)框架,其核心在于将主动性解构为三个关键步骤:(1)搜索未明确指出的问题;(2)识别具体瓶颈;(3)执行恰当的解决方案。通过该结构化评估流程,研究者能够量化不同前沿LLM与代理框架在主动任务中的表现,揭示当前系统在自主行动上的局限性,并指明未来改进方向。
链接: https://arxiv.org/abs/2510.19771
作者: Gil Pasternak,Dheeraj Rajagopal,Julia White,Dhruv Atreja,Matthew Thomas,George Hurn-Maloney,Ash Lewis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:LLM-based agents are increasingly moving towards proactivity: rather than awaiting instruction, they exercise agency to anticipate user needs and solve them autonomously. However, evaluating proactivity is challenging; current benchmarks are constrained to localized context, limiting their ability to test reasoning across sources and longer time horizons. To address this gap, we present PROBE (Proactive Resolution Of BottlEnecks). PROBE decomposes proactivity as a pipeline of three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. We apply PROBE to evaluate leading LLMs and popular agentic frameworks, showing that even state-of-the-art models struggle to solve this benchmark. Computing our consistent measurements across frontier LLMs and agents, we find that the best end-to-end performance of 40% is achieved by both GPT-5 and Claude Opus-4.1. Additionally, we demonstrate the relative capabilities of each model and analyze mutual failure modes. Our results highlight the current limitations of autonomous action in agentic systems, and expose promising future research directions.
zh
[AI-5] Learning Affordances at Inference-Time for Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作模型(Vision-Language-Action models, VLAs)在执行复杂控制任务时缺乏情境化和动态调整行为能力的问题,尤其是在失败后无法有效反思并改进策略。解决方案的关键在于提出一种名为“推理时学习”(Learning from Inference-Time Execution, LITEN)的新框架,该框架将低层VLA策略与高层视觉语言模型(Vision-Language Model, VLM)连接起来,通过在推理过程中将历史经验作为上下文输入,使高层数模型能够学习低层VLA的可操作性(affordances)和能力。LITEN通过迭代式地执行计划生成与评估阶段,在评估阶段对未结构化的机器人轨迹(如原始视频)进行结构化反思,并将有用结论纳入后续推理上下文,从而实现基于经验的自我修正和长期任务规划优化。
链接: https://arxiv.org/abs/2510.19752
作者: Ameesh Shah,William Chen,Adwait Godbole,Federico Mora,Sanjit A. Seshia,Sergey Levine
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages and appendix
点击查看摘要
Abstract:Solving complex real-world control tasks often takes multiple tries: if we fail at first, we reflect on what went wrong, and change our strategy accordingly to avoid making the same mistake. In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks, but lack the ability to contextually and dynamically readjust behavior when they fail to accomplish a task. In this work, we introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences by including them in-context, allowing it to learn the affordances and capabilities of the low-level VLA. Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution and draws useful conclusions to be included in future reasoning contexts. Unlike similar approaches to self-refinement in non-robotics domains, LITEN must reflect on unstructured real-world robot trajectories (e.g., raw videos), which requires structured guiderails during assessment. Our experimental results demonstrate LITEN is able to effectively learn from past experience to generate plans that use high-affordance instructions to accomplish long-horizon tasks.
zh
[AI-6] Misalignment Bounty: Crowdsourcing AI Agent Misbehavior
【速读】:该论文旨在解决高级人工智能系统在实际运行中可能出现的目标偏移(misalignment)问题,即AI行为偏离人类意图的现象。为获取清晰且可复现的案例,研究团队发起“对齐漏洞赏金计划”(Misalignment Bounty),通过众包方式收集了295个潜在的非预期或不安全目标案例,并从中筛选出9个获奖案例进行深入分析。解决方案的关键在于建立一套结构化的评估标准,用于识别和验证AI代理在复杂任务中表现出的偏离人类意图的行为模式,从而为后续的安全性改进和对齐研究提供实证基础。
链接: https://arxiv.org/abs/2510.19738
作者: Rustem Turtayev,Natalia Fedorova,Oleg Serikov,Sergey Koldyba,Lev Avagyan,Dmitrii Volkov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Advanced AI systems sometimes act in ways that differ from human intent. To gather clear, reproducible examples, we ran the Misalignment Bounty: a crowdsourced project that collected cases of agents pursuing unintended or unsafe goals. The bounty received 295 submissions, of which nine were awarded. This report explains the program’s motivation and evaluation criteria, and walks through the nine winning submissions step by step. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.19738 [cs.AI] (or arXiv:2510.19738v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.19738 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-7] Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series
【速读】:该论文旨在解决医疗人工智能(Medical AI)模型在重症监护病房(ICU)场景下缺乏可信、细粒度的评估方法的问题,特别是在隐私保护前提下对不同人口学亚组(如性别、年龄、种族等交叉群体)进行可靠性能分析的挑战。解决方案的关键在于提出一种增强型时间序列生成模型——Enhanced TimeAutoDiff,其通过在潜在扩散目标中引入分布对齐惩罚项(distribution-alignment penalties),显著缩小了真实数据上训练、合成数据上测试的评估指标与真实数据上训练和测试之间的差距(即TRTS gap),同时保持了合成数据用于训练的有效性。实验表明,该方法在MIMIC-III和eICU数据集上的24小时死亡率和住院时长预测任务中,将TRTS差距降低超过70%,并在32个交叉亚组中大幅减少AUROC估计误差,从而为临床AI模型提供了一种可信赖、隐私安全且具备群体差异敏感性的评估框架。
链接: https://arxiv.org/abs/2510.19728
作者: Mahmoud Ibrahim,Bart Elen,Chang Sun,Gökhan Ertaylan,Michel Dumontier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We present a novel framework for leveraging synthetic ICU time-series data not only to train but also to rigorously and trustworthily evaluate predictive models, both at the population level and within fine-grained demographic subgroups. Building on prior diffusion and VAE-based generators (TimeDiff, HealthGen, TimeAutoDiff), we introduce \textitEnhanced TimeAutoDiff, which augments the latent diffusion objective with distribution-alignment penalties. We extensively benchmark all models on MIMIC-III and eICU, on 24-hour mortality and binary length-of-stay tasks. Our results show that Enhanced TimeAutoDiff reduces the gap between real-on-synthetic and real-on-real evaluation (``TRTS gap’') by over 70%, achieving \Delta_TRTS \leq 0.014 AUROC, while preserving training utility ( \Delta_TSTR \approx 0.01 ). Crucially, for 32 intersectional subgroups, large synthetic cohorts cut subgroup-level AUROC estimation error by up to 50% relative to small real test sets, and outperform them in 72–84% of subgroups. This work provides a practical, privacy-preserving roadmap for trustworthy, granular model evaluation in critical care, enabling robust and reliable performance analysis across diverse patient populations without exposing sensitive EHR data, contributing to the overall trustworthiness of Medical AI.
zh
[AI-8] RLIE: Rule Generation with Logistic Regression Iterative Refinement and Evaluation for Large Language Models
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的规则学习方法中忽视规则间交互作用,且未充分结合概率建模以实现鲁棒推理的问题。其解决方案的关键在于提出RLIE框架,该框架将LLM与概率建模相耦合,通过四个阶段实现:(1) 利用LLM生成并筛选规则候选;(2) 使用逻辑回归学习规则的全局权重以进行选择和校准;(3) 基于预测误差迭代优化规则集;(4) 评估直接使用加权规则作为分类器的效果。实验证明,显式应用学习到的规则权重优于将规则注入LLM进行提示的方法,表明LLM在语义生成与解释上表现优异,但在精确的概率集成方面可靠性不足,从而明确了LLM在归纳推理中的潜力与局限,并实现了更可靠的神经符号推理。
链接: https://arxiv.org/abs/2510.19698
作者: Yang Yang,Hua XU,Zhangyi Hu,Yutao Yue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) can propose rules in natural language, sidestepping the need for a predefined predicate space in traditional rule learning. Yet many LLM-based approaches ignore interactions among rules, and the opportunity to couple LLMs with probabilistic rule learning for robust inference remains underexplored. We present RLIE, a unified framework that integrates LLMs with probabilistic modeling to learn a set of weighted rules. RLIE has four stages: (1) Rule generation, where an LLM proposes and filters candidates; (2) Logistic regression, which learns probabilistic weights for global selection and calibration; (3) Iterative refinement, which updates the rule set using prediction errors; and (4) Evaluation, which compares the weighted rule set as a direct classifier with methods that inject rules into an LLM. We evaluate multiple inference strategies on real-world datasets. Applying rules directly with their learned weights yields superior performance, whereas prompting LLMs with the rules, weights, and logistic-model outputs surprisingly degrades accuracy. This supports the view that LLMs excel at semantic generation and interpretation but are less reliable for precise probabilistic integration. RLIE clarifies the potential and limitations of LLMs for inductive reasoning and couples them with classic probabilistic rule combination methods to enable more reliable neuro-symbolic reasoning.
zh
[AI-9] oward Agent ic Software Engineering Beyond Code: Framing Vision Values and Vocabulary
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在软件工程(Software Engineering, SE)领域应用中,研究焦点过度集中于代码相关活动而忽视了更广泛的社技术(socio-technical)因素的问题。其解决方案的关键在于:首先,建议将生成式软件工程(Agentic SE)的研究范围从单一代码生成扩展至“全流程”视角,以SE基础理论和演化框架为支撑;其次,提出一套初步的价值观与原则以指导实践;最后,强调设计和使用明确的术语体系对推动该领域规范化发展的重要性。此举旨在促使该领域不仅成为必然趋势,更成为可规划、可控制且值得追求的长期发展方向。
链接: https://arxiv.org/abs/2510.19692
作者: Rashina Hoda
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 5 pages
点击查看摘要
Abstract:Agentic AI is poised to usher in a seismic paradigm shift in Software Engineering (SE). As technologists rush head-along to make agentic AI a reality, SE researchers are driven to establish agentic SE as a research area. While early visions of agentic SE are primarily focused on code-related activities, early empirical evidence calls for a consideration of a range of socio-technical concerns to make it work in practice. This paper contributes to the emerging community vision by: (a) recommending an expansion of its scope beyond code, toward a ‘whole of process’ vision, grounding it in SE foundations and evolution and emerging agentic SE frameworks, (b) proposing a preliminary set of values and principles to guide efforts, and © sharing guidance on designing/using well-defined vocabulary for agentic SE. It is hoped that these ideas will encourage community collaborations and steer the SE community towards laying strong foundations of agentic SE so its not only inevitable but also deliberate and desirable in the long run.
zh
[AI-10] Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation
【速读】:该论文旨在解决工业与政府机构在受监管环境中部署数据驱动分析时面临的挑战,即如何在保证合规性(如IL4/FIPS标准)的同时实现低延迟、高吞吐量和成本效益的推理服务。传统分布式框架(如Spark和Flink)虽适用于大规模批处理或流式分析,但存在协调复杂性和审计开销问题,难以满足中等规模、对延迟敏感的推理场景需求。解决方案的关键在于提出一种面向生产的“大数据即服务”(Big Data as a Service, BDaaS)蓝图,其核心是将单节点无服务器GPU运行时与可解释的TabNet模型相结合:利用GPU加速提升吞吐量,借助无服务器弹性降低成本,通过特征掩码(feature-mask)可解释性机制满足合规要求;实验表明,该方案相较Spark基线可实现最高4.5倍吞吐提升、98倍延迟降低及每千次推理成本降低90%,且合规机制仅引入约5.7毫秒额外延迟(p99为22毫秒),同时保持了稳定可解释性,从而为受监管环境提供了安全、可解释且经济高效的AI推理部署范式。
链接: https://arxiv.org/abs/2510.19689
作者: Guilin Zhang,Wulan Guo,Ziqi Tan,Srinivas Vippagunta,Suchitra Raman,Shreeshankar Chatterjee,Ju Lin,Shang Liu,Mary Schladenhauffen,Jeffrey Luo,Hailong Jiang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 7 figures, 4 tables. Accepted to IEEE BigData 2025
点击查看摘要
Abstract:Industrial and government organizations increasingly depend on data-driven analytics for workforce, finance, and regulated decision processes, where timeliness, cost efficiency, and compliance are critical. Distributed frameworks such as Spark and Flink remain effective for massive-scale batch or streaming analytics but introduce coordination complexity and auditing overheads that misalign with moderate-scale, latency-sensitive inference. Meanwhile, cloud providers now offer serverless GPUs, and models such as TabNet enable interpretable tabular ML, motivating new deployment blueprints for regulated environments. In this paper, we present a production-oriented Big Data as a Service (BDaaS) blueprint that integrates a single-node serverless GPU runtime with TabNet. The design leverages GPU acceleration for throughput, serverless elasticity for cost reduction, and feature-mask interpretability for IL4/FIPS compliance. We conduct benchmarks on the HR, Adult, and BLS datasets, comparing our approach against Spark and CPU baselines. Our results show that GPU pipelines achieve up to 4.5x higher throughput, 98x lower latency, and 90% lower cost per 1K inferences compared to Spark baselines, while compliance mechanisms add only ~5.7 ms latency with p99 22 ms. Interpretability remains stable under peak load, ensuring reliable auditability. Taken together, these findings provide a compliance-aware benchmark, a reproducible Helm-packaged blueprint, and a decision framework that demonstrate the practicality of secure, interpretable, and cost-efficient serverless GPU analytics for regulated enterprise and government settings.
zh
[AI-11] Directive Metacognitive or a Blend of Both? A Comparison of AI-Generated Feedback Types on Student Engagement Confidence and Outcomes
【速读】:该论文试图解决的问题是:在教育场景中,如何有效利用人工智能(AI)生成反馈以提升学生的学习效果,特别是比较指令式反馈(directive feedback)与元认知反馈(metacognitive feedback)对学习参与度、自信心及作业质量的影响。解决方案的关键在于设计并验证一种混合型AI反馈机制(hybrid AI-generated feedback),该机制融合了指令式反馈的明确指导优势与元认知反馈的反思促进作用,从而在提供即时改进指引的同时,支持学生的自我调节学习(SRL)能力发展。实证结果显示,混合型反馈显著提升了学生的修改行为,且在自信心和作业质量方面与其他两种反馈方式相当,表明其在平衡效率与深度学习之间具有潜力。
链接: https://arxiv.org/abs/2510.19685
作者: Omar Alsaiari,Nilufar Baghaei,Jason M. Lodge,Omid Noroozi,Dragan Gašević,Marie Boden,Hassan Khosravi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Feedback is one of the most powerful influences on student learning, with extensive research examining how best to implement it in educational settings. Increasingly, feedback is being generated by artificial intelligence (AI), offering scalable and adaptive responses. Two widely studied approaches are directive feedback, which gives explicit explanations and reduces cognitive load to speed up learning, and metacognitive feedback which prompts learners to reflect, track their progress, and develop self-regulated learning (SRL) skills. While both approaches have clear theoretical advantages, their comparative effects on engagement, confidence, and quality of work remain underexplored. This study presents a semester-long randomised controlled trial with 329 students in an introductory design and programming course using an adaptive educational platform. Participants were assigned to receive directive, metacognitive, or hybrid AI-generated feedback that blended elements of both directive and metacognitive feedback. Results showed that revision behaviour differed across feedback conditions, with Hybrid prompting the most revisions compared to Directive and Metacognitive. Confidence ratings were uniformly high, and resource quality outcomes were comparable across conditions. These findings highlight the promise of AI in delivering feedback that balances clarity with reflection. Hybrid approaches, in particular, show potential to combine actionable guidance for immediate improvement with opportunities for self-reflection and metacognitive growth.
zh
[AI-12] Study of Training Dynamics for Memory-Constrained Fine-Tuning
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在资源受限环境下的高效训练问题,特别是在模型规模不断增长背景下如何在保持高性能的同时严格控制内存占用。其解决方案的关键在于提出了一种名为TraDy的新型迁移学习框架,该框架基于两个核心洞察:一是层的重要性(layer importance)对于参数更新具有架构依赖性且可预先确定;二是动态随机通道选择相比静态方法能提供更优的梯度近似效果。具体而言,TraDy通过在预选层内跨epoch动态随机重采样通道的方式实现高效的梯度计算,在保证下游任务性能领先的同时,显著降低内存消耗,例如达到最高99%的激活稀疏性、95%的权重导数稀疏性,并减少97%的权重导数计算FLOPs。
链接: https://arxiv.org/abs/2510.19675
作者: Aël Quélennec,Nour Hezbri,Pavlo Mozharovskyi,Van-Tam Nguyen,Enzo Tartaglione
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.
zh
[AI-13] Explainable e-sports win prediction through Machine Learning classification in streaming
【速读】:该论文旨在解决电子竞技(e-sports)中专业胜率预测的准确性与可解释性问题,尤其是现有方法多聚焦于批量分类而忽视了实时数据流处理和可视化技术。其解决方案的关键在于提出一种基于滑动窗口的流式数据处理机制,通过控制多个时间窗口来捕捉游戏过程中的关键变化,并结合可解释性模块实现高精度(超过90%)的实时胜率预测,从而增强决策系统的可信度与实用性。
链接: https://arxiv.org/abs/2510.19671
作者: Silvia García-Méndez,Francisco de Arriba-Pérez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The increasing number of spectators and players in e-sports, along with the development of optimized communication solutions and cloud computing technology, has motivated the constant growth of the online game industry. Even though Artificial Intelligence-based solutions for e-sports analytics are traditionally defined as extracting meaningful patterns from related data and visualizing them to enhance decision-making, most of the effort in professional winning prediction has been focused on the classification aspect from a batch perspective, also leaving aside the visualization techniques. Consequently, this work contributes to an explainable win prediction classification solution in streaming in which input data is controlled over several sliding windows to reflect relevant game changes. Experimental results attained an accuracy higher than 90 %, surpassing the performance of competing solutions in the literature. Ultimately, our system can be leveraged by ranking and recommender systems for informed decision-making, thanks to the explainability module, which fosters trust in the outcome predictions.
zh
[AI-14] A Graph Engine for Guitar Chord-Tone Soloing Education
【速读】:该论文旨在解决吉他学习者在练习和弦音即兴(chord tone soloing)时面临的困难问题,这一技能是爵士吉他理论进阶的基础,但因其复杂性而难以掌握。解决方案的关键在于构建一个基于图论的引擎:首先生成每个和弦的和弦音分解音(arpeggio),随后将每个和弦的和弦音作为图中的节点,并依据最优过渡音计算相邻和弦节点之间的边权重,最终通过寻找最短路径来重构一条连贯的和弦音即兴旋律线,从而为学生提供可实践的即兴建议。
链接: https://arxiv.org/abs/2510.19666
作者: Matthew Keating,Michael Casey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICMC 2025
点击查看摘要
Abstract:We present a graph-based engine for computing chord tone soloing suggestions for guitar students. Chord tone soloing is a fundamental practice for improvising over a chord progression, where the instrumentalist uses only the notes contained in the current chord. This practice is a building block for all advanced jazz guitar theory but is difficult to learn and practice. First, we discuss methods for generating chord-tone arpeggios. Next, we construct a weighted graph where each node represents a chord tone arpeggio for a chord in the progression. Then, we calculate the edge weight between each consecutive chord’s nodes in terms of optimal transition tones. We then find the shortest path through this graph and reconstruct a chord-tone soloing line. Finally, we discuss a user-friendly system to handle input and output to this engine for guitar students to practice chord tone soloing.
zh
[AI-15] AgentS ense: LLM s Empower Generalizable and Explainable Web-Based Participatory Urban Sensing
【速读】:该论文旨在解决当前基于Web的参与式城市感知(participatory urban sensing)系统在跨多样城市场景中泛化能力弱以及决策过程缺乏可解释性的问题。解决方案的关键在于提出一种无需训练的混合型框架AgentSense,其核心是将大语言模型(LLM)嵌入到多智能体演化系统中:首先利用经典规划器生成基线任务分配方案,随后通过迭代优化机制动态调整感知任务分配以适应城市环境变化和工作者偏好差异,同时生成自然语言解释以提升决策透明度与可信度。该方法在两个大规模移动数据集及七类动态扰动场景下验证了其在自适应性和可解释性上的显著优势。
链接: https://arxiv.org/abs/2510.19661
作者: Xusen Guo,Mingxing Peng,Xixuan Hao,Xingchen Zou,Qiongyan Wang,Sijie Ruan,Yuxuan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 10 pages
点击查看摘要
Abstract:Web-based participatory urban sensing has emerged as a vital approach for modern urban management by leveraging mobile individuals as distributed sensors. However, existing urban sensing systems struggle with limited generalization across diverse urban scenarios and poor interpretability in decision-making. In this work, we introduce AgentSense, a hybrid, training-free framework that integrates large language models (LLMs) into participatory urban sensing through a multi-agent evolution system. AgentSense initially employs classical planner to generate baseline solutions and then iteratively refines them to adapt sensing task assignments to dynamic urban conditions and heterogeneous worker preferences, while producing natural language explanations that enhance transparency and trust. Extensive experiments across two large-scale mobility datasets and seven types of dynamic disturbances demonstrate that AgentSense offers distinct advantages in adaptivity and explainability over traditional methods. Furthermore, compared to single-agent LLM baselines, our approach outperforms in both performance and robustness, while delivering more reasonable and transparent explanations. These results position AgentSense as a significant advancement towards deploying adaptive and explainable urban sensing systems on the web.
zh
[AI-16] A Goal-Driven Survey on Root Cause Analysis
【速读】:该论文旨在解决现有根因分析(Root Cause Analysis, RCA)研究中因目标不明确而导致的文献分类混乱问题。当前多数综述仅按输入数据类型(如基于指标或基于追踪的方法)对RCA相关工作进行划分,忽略了不同研究任务背后的差异化目标,例如快速定位故障服务与识别具体功能缺陷的目标本质不同,这种粗粒度分类掩盖了领域的真实进展与空白。论文的关键解决方案是提出一个以目标为导向的框架,基于135篇2014至2025年间关于云环境事故管理中RCA的研究论文,按照其核心目标进行系统性归类与整合,从而清晰呈现各类RCA方法的任务差异,并进一步提炼出所有RCA工作的统一终极目标,为后续研究提供结构化视角和方向指引。
链接: https://arxiv.org/abs/2510.19593
作者: Aoyang Fang,Haowen Yang,Haoze Dong,Qisheng Lu,Junjielong Xu,Pinjia He
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Root Cause Analysis (RCA) is a crucial aspect of incident management in large-scale cloud services. While the term root cause analysis or RCA has been widely used, different studies formulate the task differently. This is because the term “RCA” implicitly covers tasks with distinct underlying goals. For instance, the goal of localizing a faulty service for rapid triage is fundamentally different from identifying a specific functional bug for a definitive fix. However, previous surveys have largely overlooked these goal-based distinctions, conventionally categorizing papers by input data types (e.g., metric-based vs. trace-based methods). This leads to the grouping of works with disparate objectives, thereby obscuring the true progress and gaps in the field. Meanwhile, the typical audience of an RCA survey is either laymen who want to know the goals and big picture of the task or RCA researchers who want to figure out past research under the same task formulation. Thus, an RCA survey that organizes the related papers according to their goals is in high demand. To this end, this paper presents a goal-driven framework that effectively categorizes and integrates 135 papers on RCA in the context of cloud incident management based on their diverse goals, spanning the period from 2014 to 2025. In addition to the goal-driven categorization, it discusses the ultimate goal of all RCA papers as an umbrella covering different RCA formulations. Moreover, the paper discusses open challenges and future directions in RCA.
zh
[AI-17] DAIL: Beyond Task Ambiguity for Language-Conditioned Reinforcement Learning
【速读】:该论文旨在解决自然语言指令在多任务场景下因语义模糊性导致智能体算法性能下降的问题,尤其是在语言条件任务中,指令的灵活性带来了显著的歧义性。解决方案的关键在于提出一种名为DAIL(Distributional Aligned Learning)的新方法,其核心由两个组件构成:分布策略(distributional policy)和语义对齐模块(semantic alignment)。其中,分布策略通过值分布估计机制提升任务的可微性,而语义对齐模块则建模轨迹与语言指令之间的对应关系,从而有效缓解指令歧义,提升模型在结构化和视觉观测基准上的泛化与执行能力。
链接: https://arxiv.org/abs/2510.19562
作者: Runpeng Xie,Quanwei Wang,Hao Hu,Zherui Zhou,Ni Mu,Xiyun Li,Yiqin Yang,Shuang Xu,Qianchuan Zhao,Bo XU
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Website at: this https URL
点击查看摘要
Abstract:Comprehending natural language and following human instructions are critical capabilities for intelligent agents. However, the flexibility of linguistic instructions induces substantial ambiguity across language-conditioned tasks, severely degrading algorithmic performance. To address these limitations, we present a novel method named DAIL (Distributional Aligned Learning), featuring two key components: distributional policy and semantic alignment. Specifically, we provide theoretical results that the value distribution estimation mechanism enhances task differentiability. Meanwhile, the semantic alignment module captures the correspondence between trajectories and linguistic instructions. Extensive experimental results on both structured and visual observation benchmarks demonstrate that DAIL effectively resolves instruction ambiguities, achieving superior performance to baseline methods. Our implementation is available at this https URL.
zh
[AI-18] Insights into the Unknown: Federated Data Diversity Analysis on Molecular Data
【速读】:该论文旨在解决生成式 AI (Generative AI) 在药物发现中应用受限的问题,即当前AI方法主要依赖公开数据集,缺乏工业级私有数据的规模与多样性,导致其在实际制药场景中的落地困难。为应对这一挑战,研究提出通过联邦学习(Federated Learning, FL)实现跨数据孤岛的隐私保护协作建模,但同时也面临分布式分子数据的多样性评估、合理数据划分及化学空间结构理解等关键数据任务的复杂性。解决方案的关键在于系统评估三种联邦聚类方法——Fed-kMeans、Fed-PCA+Fed-kMeans 和 Fed-LSH——在保持数据隐私前提下对分布式分子数据进行解耦与表征的能力,并引入化学信息学驱动的评价指标 SF-ICF 以提升分析的领域相关性与可解释性,从而推动联邦学习在分子数据多样性分析中的有效应用。
链接: https://arxiv.org/abs/2510.19535
作者: Markus Bujotzek,Evelyn Trautmann,Calum Hand,Ian Hales
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:AI methods are increasingly shaping pharmaceutical drug discovery. However, their translation to industrial applications remains limited due to their reliance on public datasets, lacking scale and diversity of proprietary pharmaceutical data. Federated learning (FL) offers a promising approach to integrate private data into privacy-preserving, collaborative model training across data silos. This federated data access complicates important data-centric tasks such as estimating dataset diversity, performing informed data splits, and understanding the structure of the combined chemical space. To address this gap, we investigate how well federated clustering methods can disentangle and represent distributed molecular data. We benchmark three approaches, Federated kMeans (Fed-kMeans), Federated Principal Component Analysis combined with Fed-kMeans (Fed-PCA+Fed-kMeans), and Federated Locality-Sensitive Hashing (Fed-LSH), against their centralized counterparts on eight diverse molecular datasets. Our evaluation utilizes both, standard mathematical and a chemistry-informed evaluation metrics, SF-ICF, that we introduce in this work. The large-scale benchmarking combined with an in-depth explainability analysis shows the importance of incorporating domain knowledge through chemistry-informed metrics, and on-client explainability analyses for federated diversity analysis on molecular data.
zh
[AI-19] Optimizing the Unknown: Black Box Bayesian Optimization with Energy-Based Model and Reinforcement Learning NEURIPS2025
【速读】:该论文旨在解决传统贝叶斯优化(Bayesian Optimization, BO)方法在处理高维或复杂任务时因“单步偏差”(one-step bias)导致易陷入局部最优、全局搜索能力不足的问题。其解决方案的关键在于将每一轮贝叶斯优化迭代建模为马尔可夫决策过程(Markov Decision Process, MDP),并引入基于能量的模型(Energy-Based Model, EBM)与高斯过程(Gaussian Process, GP)相结合,以同时捕捉目标函数的局部细节和全局结构信息;进一步利用近端策略优化(Proximal Policy Optimization, PPO)实现多步前瞻(multi-step lookahead)策略学习,动态调整探索深度与方向,从而有效克服传统BO方法的局限性。
链接: https://arxiv.org/abs/2510.19530
作者: Ruiyao Miao,Junren Xiao,Shiya Tsang,Hui Xiong,Yingnian Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper is accepted by 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
点击查看摘要
Abstract:Existing Bayesian Optimization (BO) methods typically balance exploration and exploitation to optimize costly objective functions. However, these methods often suffer from a significant one-step bias, which may lead to convergence towards local optima and poor performance in complex or high-dimensional tasks. Recently, Black-Box Optimization (BBO) has achieved success across various scientific and engineering domains, particularly when function evaluations are costly and gradients are unavailable. Motivated by this, we propose the Reinforced Energy-Based Model for Bayesian Optimization (REBMBO), which integrates Gaussian Processes (GP) for local guidance with an Energy-Based Model (EBM) to capture global structural information. Notably, we define each Bayesian Optimization iteration as a Markov Decision Process (MDP) and use Proximal Policy Optimization (PPO) for adaptive multi-step lookahead, dynamically adjusting the depth and direction of exploration to effectively overcome the limitations of traditional BO methods. We conduct extensive experiments on synthetic and real-world benchmarks, confirming the superior performance of REBMBO. Additional analyses across various GP configurations further highlight its adaptability and robustness.
zh
[AI-20] From Prototypes to Sparse ECG Explanations: SHAP-Driven Counterfactuals for Multivariate Time-Series Multi-class Classification
【速读】:该论文旨在解决深度学习模型在12导联心电图(ECG)分类任务中缺乏可解释性的问题,尤其是如何生成稀疏且生理上合理的反事实解释(counterfactual explanations),以提升临床医生对AI诊断结果的信任与理解。其解决方案的关键在于提出一种原型驱动的框架:首先利用SHAP(Shapley Additive Explanations)阈值识别关键信号片段并转化为区间规则;其次通过动态时间规整(DTW)和medoid聚类提取代表性原型;最后将这些原型与查询样本的R波峰对齐,确保解释的时间一致性。该方法仅修改原始信号的78%,同时保持81.3%的整体有效性,并实现43%的时序稳定性提升,从而支持近实时生成符合临床意义的反事实解释,为交互式解释平台提供基础。
链接: https://arxiv.org/abs/2510.19514
作者: Maciej Mozolewski,Betül Bayrak,Kerstin Bach,Grzegorz J. Nalepa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:In eXplainable Artificial Intelligence (XAI), instance-based explanations for time series have gained increasing attention due to their potential for actionable and interpretable insights in domains such as healthcare. Addressing the challenges of explainability of state-of-the-art models, we propose a prototype-driven framework for generating sparse counterfactual explanations tailored to 12-lead ECG classification models. Our method employs SHAP-based thresholds to identify critical signal segments and convert them into interval rules, uses Dynamic Time Warping (DTW) and medoid clustering to extract representative prototypes, and aligns these prototypes to query R-peaks for coherence with the sample being explained. The framework generates counterfactuals that modify only 78% of the original signal while maintaining 81.3% validity across all classes and achieving 43% improvement in temporal stability. We evaluate three variants of our approach, Original, Sparse, and Aligned Sparse, with class-specific performance ranging from 98.9% validity for myocardial infarction (MI) to challenges with hypertrophy (HYP) detection (13.2%). This approach supports near realtime generation ( 1 second) of clinically valid counterfactuals and provides a foundation for interactive explanation platforms. Our findings establish design principles for physiologically-aware counterfactual explanations in AI-based diagnosis systems and outline pathways toward user-controlled explanation interfaces for clinical deployment.
zh
[AI-21] Modeling realistic human behavior using generative agents in a multimodal transport system: Software architecture and Application to Toulouse
【速读】:该论文旨在解决如何建模现实的人类出行行为,以理解个体在复杂多式联运交通系统中的出行选择,并据此提出个性化出行解决方案的挑战。其关键解决方案是将大型语言模型(Large Language Models, LLMs)集成到基于代理的仿真(Agent-Based Simulation, ABS)框架中,利用GAMA平台构建动态交互的交通环境,结合通用 transit feed 规范(General Transit Feed Specification, GTFS)数据与OpenTripPlanner实现多式联运路径规划,从而让代理能够根据上下文做出情境感知的出行决策并随时间形成习惯,显著提升了对真实人类行为建模的能力。
链接: https://arxiv.org/abs/2510.19497
作者: Trung-Dung Vu,Benoit Gaudou,Kamaldeep Singh Oberoi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Modeling realistic human behaviour to understand people’s mode choices in order to propose personalised mobility solutions remains challenging. This paper presents an architecture for modeling realistic human mobility behavior in complex multimodal transport systems, demonstrated through a case study in Toulouse, France. We apply Large Language Models (LLMs) within an agent-based simulation to capture decision-making in a real urban setting. The framework integrates the GAMA simulation platform with an LLM-based generative agent, along with General Transit Feed Specification (GTFS) data for public transport, and OpenTripPlanner for multimodal routing. GAMA platform models the interactive transport environment, providing visualization and dynamic agent interactions while eliminating the need to construct the simulation environment from scratch. This design enables a stronger focus on developing generative agents and evaluating their performance in transport decision-making processes. Over a simulated month, results show that agents not only make context-aware transport decisions but also form habits over time. We conclude that combining LLMs with agent-based simulation offers a promising direction for advancing intelligent transportation systems and personalised multimodal mobility solutions. We also discuss some limitations of this approach and outline future work on scaling to larger regions, integrating real-time data, and refining memory models.
zh
[AI-22] Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning
【速读】:该论文旨在解决模仿学习(Imitation Learning)在机器人任务中因依赖高质量、任务特定数据而导致的适应性受限问题,尤其是在面对现实世界中多样的物体配置和场景时表现不佳。其关键解决方案是引入离线强化学习(Offline Reinforcement Learning, Offline RL)作为工具,通过合理的设计改进,有效利用非专家数据(如游戏数据、次优示范、部分任务完成或次优策略的轨迹),从而增强模仿学习策略的性能。研究表明,仅需简单的算法调整即可在稀疏数据覆盖条件下充分利用此类数据,而无需额外假设;核心在于扩展策略分布的支持范围,使结合离线RL的模仿学习方法具备更强的鲁棒性、泛化能力和对初始条件变化的适应性,显著提升复杂操作任务中的成功率。
链接: https://arxiv.org/abs/2510.19495
作者: Kevin Huang,Rosario Scalise,Cleah Winston,Ayush Agrawal,Yunchu Zhang,Rohan Baijal,Markus Grotz,Byron Boots,Benjamin Burchfiel,Hongkai Dai,Masha Itkina,Paarth Shah,Abhishek Gupta
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Imitation learning has proven effective for training robots to perform complex tasks from expert human demonstrations. However, it remains limited by its reliance on high-quality, task-specific data, restricting adaptability to the diverse range of real-world object configurations and scenarios. In contrast, non-expert data – such as play data, suboptimal demonstrations, partial task completions, or rollouts from suboptimal policies – can offer broader coverage and lower collection costs. However, conventional imitation learning approaches fail to utilize this data effectively. To address these challenges, we posit that with right design decisions, offline reinforcement learning can be used as a tool to harness non-expert data to enhance the performance of imitation learning policies. We show that while standard offline RL approaches can be ineffective at actually leveraging non-expert data under the sparse data coverage settings typically encountered in the real world, simple algorithmic modifications can allow for the utilization of this data, without significant additional assumptions. Our approach shows that broadening the support of the policy distribution can allow imitation algorithms augmented by offline RL to solve tasks robustly, showing considerably enhanced recovery and generalization behavior. In manipulation tasks, these innovations significantly increase the range of initial conditions where learned policies are successful when non-expert data is incorporated. Moreover, we show that these methods are able to leverage all collected data, including partial or suboptimal demonstrations, to bolster task-directed policy performance. This underscores the importance of algorithmic techniques for using non-expert data for robust policy learning in robotics.
zh
[AI-23] Graph Unlearning Meets Influence-aware Negative Preference Optimization
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在执行遗忘学习(unlearning)过程中因梯度上升导致模型效用急剧下降的问题。现有方法通过在遗忘集上使用梯度上升来保持节点表示不变,但其快速发散特性会显著损害模型性能。解决方案的关键在于提出一种影响感知的负偏好优化框架(Influence-aware Negative Preference Optimization, INPO),其核心机制包括:1)设计一种基于影响感知的消息传递函数,以增强对被遗忘边的影响并缓解遗忘集与保留集之间的拓扑紧耦合;2)采用基于移除的方法快速估计每条边的影响权重;3)引入拓扑熵损失(topological entropy loss),从拓扑结构角度防止局部信息过度丢失。这些改进共同实现了更缓慢的发散速度和更强的模型效用鲁棒性,在多个真实数据集上均取得了最优的遗忘质量指标。
链接: https://arxiv.org/abs/2510.19479
作者: Qiang Chen,Zhongze Wu,Ang He,Xi Lin,Shuo Jiang,Shan You,Chang Xu,Yi Chen,Xiu Su
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advancements in graph unlearning models have enhanced model utility by preserving the node representation essentially invariant, while using gradient ascent on the forget set to achieve unlearning. However, this approach causes a drastic degradation in model utility during the unlearning process due to the rapid divergence speed of gradient ascent. In this paper, we introduce \textbfINPO, an \textbfInfluence-aware \textbfNegative \textbfPreference \textbfOptimization framework that focuses on slowing the divergence speed and improving the robustness of the model utility to the unlearning process. Specifically, we first analyze that NPO has slower divergence speed and theoretically propose that unlearning high-influence edges can reduce impact of unlearning. We design an influence-aware message function to amplify the influence of unlearned edges and mitigate the tight topological coupling between the forget set and the retain set. The influence of each edge is quickly estimated by a removal-based method. Additionally, we propose a topological entropy loss from the perspective of topology to avoid excessive information loss in the local structure during unlearning. Extensive experiments conducted on five real-world datasets demonstrate that INPO-based model achieves state-of-the-art performance on all forget quality metrics while maintaining the model’s utility. Codes are available at \hrefthis https URLthis https URL.
zh
[AI-24] A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
【速读】:该论文旨在解决当前人工智能(AI)系统在逼近危险能力水平时,传统安全论证(safety cases)失效的问题,即当模型具备潜在危害性能力时,仅依赖静态评估难以确保其安全性。解决方案的关键在于构建基于链式思维(chain-of-thought, CoT)监控的安全论证体系:首先证明模型在无CoT状态下不具备危险能力;其次确保任何由CoT激活的危险能力均可通过CoT监控手段检测到。为此,作者系统分析了影响可监控性的两类威胁——神经语言(neuralese)和编码推理(encoded reasoning),并提出维护CoT忠实性的技术路径,包括对不可监控推理的重构与提取可监控CoT的方法,并引入预测市场以量化关键里程碑的技术可行性。
链接: https://arxiv.org/abs/2510.19476
作者: Julian Schulz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As AI systems approach dangerous capability levels where inability safety cases become insufficient, we need alternative approaches to ensure safety. This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models and outlines our research agenda. We argue that CoT monitoring might support both control and trustworthiness safety cases. We propose a two-part safety case: (1) establishing that models lack dangerous capabilities when operating without their CoT, and (2) ensuring that any dangerous capabilities enabled by a CoT are detectable by CoT monitoring. We systematically examine two threats to monitorability: neuralese and encoded reasoning, which we categorize into three forms (linguistic drift, steganography, and alien reasoning) and analyze their potential drivers. We evaluate existing and novel techniques for maintaining CoT faithfulness. For cases where models produce non-monitorable reasoning, we explore the possibility of extracting a monitorable CoT from a non-monitorable CoT. To assess the viability of CoT monitoring safety cases, we establish prediction markets to aggregate forecasts on key technical milestones influencing their feasibility.
zh
[AI-25] HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission
【速读】:该论文旨在解决跨数据中心(cross-DC)条件下混合专家模型(Mixture-of-Experts, MoE)训练中因受限带宽导致的专家并行(Expert Parallelism, EP)可扩展性瓶颈问题。现有EP优化方法依赖于通信与计算重叠,但在低带宽场景下效果有限,因为数据通信时间远长于计算时间。解决方案的关键在于提出HybridEP框架,其核心思想是通过动态调整专家的空间布局来减少数据通信流量和频率,从而降低EP的通信开销;为此构建了一个基于流的模型以确定最优传输比例,并引入两种关键技术:(1) 域划分(domain-based partition)用于建立混合通信模式与GPU级拓扑之间的映射关系,(2) 参数高效迁移(parameter-efficient migration)进一步优化拓扑结构,降低专家传输开销并扩大域规模。整体上,HybridEP实现了比传统EP更优的可扩展性,在受限带宽下实验表明其性能提升最高达5.6倍,并在大规模模拟中实现最多1.45倍的速度提升。
链接: https://arxiv.org/abs/2510.19470
作者: Weihao Yang,Hao Huang,Donglei Wu,Ningke Li,Yanqi Pan,Qiyang Zheng,Wen Xia,Shiyi Li,Qiang Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) has become a popular architecture for scaling large models. However, the rapidly growing scale outpaces model training on a single DC, driving a shift toward a more flexible, cross-DC training paradigm. Under this, Expert Parallelism (EP) of MoE faces significant scalability issues due to the limited cross-DC bandwidth. Specifically, existing EP optimizations attempt to overlap data communication and computation, which has little benefit in low-bandwidth scenarios due to a much longer data communication time. Therefore, the trends of cross-DC EP scaling is fast becoming a critical roadblock to the continued growth of MoE models. To address this, we propose HybridEP, a modeling-guided framework to optimize EP under constrained bandwidth. Our key idea is to dynamically transform the spatial placement of experts to reduce data communication traffic and frequency, thereby minimizing EP’s communication overheads. However, it is non-trivial to find the optimal solution because it complicates the original communication pattern by mixing data and expert communication. We therefore build a stream-based model to determine the optimal transmission ratio. Guided by this, we incorporate two techniques: (1) domain-based partition to construct the mapping between hybrid patterns and specific communication topology at GPU level, and (2) parameter-efficient migration to further refine this topology by reducing expert transmission overhead and enlarging the domain size. Combining all these designs, HybridEP can be considered as a more general EP with better scalability. Experimental results show that HybridEP outperforms existing state-of-the-art MoE training systems by up to 5.6x under constrained bandwidth. We further compare HybridEP and EP on large-scale simulations. HybridEP achieves up to 1.45x speedup with 1k DCs under different bandwidths. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2510.19470 [cs.DC] (or arXiv:2510.19470v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2510.19470 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-26] Universal Quantitative Abstraction: Categorical Duality and Logical Completeness for Probabilistic Systems
【速读】:该论文旨在解决概率系统中状态抽象的定量建模问题,即如何在保持行为差异可度量的前提下进行状态压缩与抽象,从而为强化学习中的值函数近似和表示学习提供数学严谨的理论支撑。其解决方案的关键在于提出了一种统一的定量抽象理论框架,核心是构造一个具有泛性质(universal property)的ε-商空间(ε-quotient),该空间在所有满足预设价值损失边界(value loss bound)的抽象中最为信息丰富;通过范畴论中的伴随函子对(adjunction)关系 Qε⊣Rε 建立抽象与重构之间的对偶性,并利用coalgebra框架证明行为伪度量(behavioral pseudometric)为Bellman型算子的唯一不动点,具备收缩性和Lipschitz连续性;进一步引入定量模态μ-演算(quantitative modal μ-calculus)并证明其逻辑表达完备性,使得行为距离等价于最大逻辑偏差,最终通过有限马尔可夫决策过程上的精确验证实验确认了该方法在收缩性、价值损失边界控制、扰动稳定性、对抗可区分性及可扩展性方面的有效性。
链接: https://arxiv.org/abs/2510.19444
作者: Nivar Anwer(Institute of Artificial Intelligence, De Montfort University, Leicester, United Kingdom)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:A unified theory of quantitative abstraction is presented for probabilistic systems that links category theory, optimal transport, and quantitative modal logic. At its core is a canonical \varepsilon -quotient endowed with a universal property: among all \varepsilon -abstractions, it is the most informative one that respects a prescribed bound on value loss. This construction induces an adjunction between abstraction and realization functors (Q_\varepsilon \dashv R_\varepsilon) , established via the Special Adjoint Functor Theorem, revealing a categorical duality between metric structure and logical semantics. A behavioral pseudometric is characterized as the unique fixed point of a Bellman-style operator, with contraction and Lipschitz properties proved in a coalgebraic setting. A quantitative modal \mu -calculus is introduced and shown to be expressively complete for logically representable systems, so that behavioral distance coincides with maximal logical deviation. Compositionality under interface refinement is analyzed, clarifying how abstractions interact across system boundaries. An exact validation suite on finite Markov decision processes corroborates the contraction property, value-loss bounds, stability under perturbation, adversarial distinguishability, and scalability, demonstrating both robustness and computational feasibility. The resulting framework provides principled targets for state aggregation and representation learning, with mathematically precise guarantees for value-function approximation in stochastic domains.
zh
[AI-27] NeSyPr: Neurosymbolic Proceduralization For Efficient Embodied Reasoning NEURIPS2025
【速读】:该论文旨在解决在动态环境中利用语言模型(Language Models, LMs)执行具身任务时,由于延迟、连接性和资源限制导致无法在线访问大规模推理引擎或符号规划器的问题。其解决方案的关键在于提出一种名为NeSyPr的神经符号过程化(Neurosymbolic Proceduralization)框架,该框架通过将符号工具生成的任务特定计划转化为可组合的过程化表示,编码隐含的产生式规则,从而将多步符号结构化路径寻找与推理抽象为单步语言模型推理,实现无需外部符号引导的高效测试时推理,适用于对延迟敏感和资源受限的物理系统部署。
链接: https://arxiv.org/abs/2510.19429
作者: Wonje Choi,Jooyoung Kim,Honguk Woo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025
点击查看摘要
Abstract:We address the challenge of adopting language models (LMs) for embodied tasks in dynamic environments, where online access to large-scale inference engines or symbolic planners is constrained due to latency, connectivity, and resource limitations. To this end, we present NeSyPr, a novel embodied reasoning framework that compiles knowledge via neurosymbolic proceduralization, thereby equipping LM-based agents with structured, adaptive, and timely reasoning capabilities. In NeSyPr, task-specific plans are first explicitly generated by a symbolic tool leveraging its declarative knowledge. These plans are then transformed into composable procedural representations that encode the plans’ implicit production rules, enabling the resulting composed procedures to be seamlessly integrated into the LM’s inference process. This neurosymbolic proceduralization abstracts and generalizes multi-step symbolic structured path-finding and reasoning into single-step LM inference, akin to human knowledge compilation. It supports efficient test-time inference without relying on external symbolic guidance, making it well suited for deployment in latency-sensitive and resource-constrained physical systems. We evaluate NeSyPr on the embodied benchmarks PDDLGym, VirtualHome, and ALFWorld, demonstrating its efficient reasoning capabilities over large-scale reasoning models and a symbolic planner, while using more compact LMs.
zh
[AI-28] Neural Variational Dropout Processes ICLR
【速读】:该论文旨在解决元学习(meta-learning)中如何稳健地推断条件后验模型(conditional posterior model)的问题,尤其是在少样本(few-shot)多任务学习场景下,如何快速适应新任务并有效处理函数不确定性。解决方案的关键在于提出了一种名为神经变分Dropout过程(Neural Variational Dropout Processes, NVDPs)的新方法:其核心创新是基于任务特定的Dropout机制建模条件后验分布,并利用低秩伯努利专家元模型(low-rank product of Bernoulli experts meta-model)实现从少量观测上下文到Dropout率的高效映射,从而在共享神经网络基础上快速重构以适应新任务;此外,NVDPs引入一种基于整个任务数据的先验来优化变分推断中的条件Dropout后验,显著提升了对任务特异性Dropout率的鲁棒估计能力,能够应对广泛的函数模糊性和不确定性。
链接: https://arxiv.org/abs/2510.19425
作者: Insu Jeon,Youngjin Park,Gunhee Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a Poster at International Conference on Learning Representations (ICLR) 2022 (Apr 25-29, 2022)
点击查看摘要
Abstract:Learning to infer the conditional posterior model is a key step for robust meta-learning. This paper presents a new Bayesian meta-learning approach called Neural Variational Dropout Processes (NVDPs). NVDPs model the conditional posterior distribution based on a task-specific dropout; a low-rank product of Bernoulli experts meta-model is utilized for a memory-efficient mapping of dropout rates from a few observed contexts. It allows for a quick reconfiguration of a globally learned and shared neural network for new tasks in multi-task few-shot learning. In addition, NVDPs utilize a novel prior conditioned on the whole task data to optimize the conditional \textitdropout posterior in the amortized variational inference. Surprisingly, this enables the robust approximation of task-specific dropout rates that can deal with a wide range of functional ambiguities and uncertainties. We compared the proposed method with other meta-learning approaches in the few-shot learning tasks such as 1D stochastic regression, image inpainting, and classification. The results show the excellent performance of NVDPs.
zh
[AI-29] MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration ACL
【速读】:该论文旨在解决当前多跳工具编排评估基准中存在的局限性,即现有评测通常孤立地评估单个工具功能,忽略了功能性重叠(functional overlap)和跨服务器协同(cross-server orchestration)等现实挑战,导致对大语言模型(LLM)代理能力的评估过于乐观。其解决方案的关键在于提出MSC-Bench——一个基于分层模型-上下文协议(hierarchical Model-Context Protocol, MCP)生态系统的大型基准,通过构建“功能等价集合”(equal function sets)来生成真实标签(ground truth),从而实现客观指标如F1分数的计算,并减少对LLM作为评判者(LLM-as-a-judge)的依赖。该基准采用五级课程化设计,系统性地测试代理从单工具编排到跨服务器复杂规划的能力及其对超出范围请求的鲁棒性,揭示了刚性层级结构在缺乏协同设计时可能抑制性能,且即使最先进的代理也存在系统性鲁棒性缺陷,为工具使用型代理的发展提供了诊断框架与改进方向。
链接: https://arxiv.org/abs/2510.19423
作者: Jia-Kai Dong,I-Wei Huang,Chun-Tin Wu,Yi-Tien Tsai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: under ACL Rolling Review 2025
点击查看摘要
Abstract:We introduce MSC-Bench, a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents in a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often evaluate tools in isolation, ignoring challenges such as functional overlap and cross-server orchestration, leading to overly optimistic assessments. MSC-Bench addresses these gaps by constructing ground truth through ‘equal function sets’, allowing objective metrics such as F1 score and reducing the dependency on LLM-as-a-judge evaluation. Organized as a five-level curriculum, it systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. MSC-Bench provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents. The benchmark and resources are publicly available at this https URL.
zh
[AI-30] FairNet: Dynamic Fairness Correction without Performance Loss via Contrastive Conditional LoRA
【速读】:该论文旨在解决机器学习模型中公平性保障的挑战,特别是现有去偏方法普遍存在的性能下降、静态校正策略缺乏灵活性、少数群体数据稀疏导致的欠拟合以及敏感属性利用不充分等问题。其解决方案的关键在于提出FairNet框架,通过集成一个偏差检测器与条件低秩适应(LoRA)机制,实现仅对检测为偏置的实例进行动态、实例级的公平性校正,从而在保持无偏实例性能的同时提升弱势群体的表现;此外,引入一种专门设计的对比损失函数以最小化不同敏感组别内的类内表示差异,并有效缓解少数群体的欠拟合问题,且该框架可灵活适配完整、部分或缺失敏感属性标签的场景。
链接: https://arxiv.org/abs/2510.19421
作者: Songqi Zhou,Zeyuan Liu,Benben Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Ensuring fairness in machine learning models is a critical challenge. Existing debiasing methods often compromise performance, rely on static correction strategies, and struggle with data sparsity, particularly within minority groups. Furthermore, their utilization of sensitive attributes is often suboptimal, either depending excessively on complete attribute labeling or disregarding these attributes entirely. To overcome these limitations, we propose FairNet, a novel framework for dynamic, instance-level fairness correction. FairNet integrates a bias detector with conditional low-rank adaptation (LoRA), which enables selective activation of the fairness correction mechanism exclusively for instances identified as biased, and thereby preserve performance on unbiased instances. A key contribution is a new contrastive loss function for training the LoRA module, specifically designed to minimize intra-class representation disparities across different sensitive groups and effectively address underfitting in minority groups. The FairNet framework can flexibly handle scenarios with complete, partial, or entirely absent sensitive attribute labels. Theoretical analysis confirms that, under moderate TPR/FPR for the bias detector, FairNet can enhance the performance of the worst group without diminishing overall model performance, and potentially yield slight performance improvements. Comprehensive empirical evaluations across diverse vision and language benchmarks validate the effectiveness of FairNet.
zh
[AI-31] Monitoring LLM -based Multi-Agent Systems Against Corruptions via Node Evaluation
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中因复杂通信过程而易受污染攻击(corruption attacks)导致的信任问题。现有防御机制主要基于静态图结构,仅能应对固定拓扑下的攻击或优化静态网络以提升防御能力,难以应对动态演化和多样化的攻击模式。解决方案的关键在于提出一种动态防御范式,通过持续监控MAS图结构中的通信行为,实时调整图拓扑,精准切断恶意通信路径,从而有效抵御不断变化的动态攻击。实验表明,该方法在日益复杂和动态的MAS环境中显著优于现有防御机制,为MAS的可信应用提供了有效的保障。
链接: https://arxiv.org/abs/2510.19420
作者: Chengcan Wu,Zhixin Zhang,Mingqian Xu,Zeming Wei,Meng Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注:
点击查看摘要
Abstract:Large Language Model (LLM)-based Multi-Agent Systems (MAS) have become a popular paradigm of AI applications. However, trustworthiness issues in MAS remain a critical concern. Unlike challenges in single-agent systems, MAS involve more complex communication processes, making them susceptible to corruption attacks. To mitigate this issue, several defense mechanisms have been developed based on the graph representation of MAS, where agents represent nodes and communications form edges. Nevertheless, these methods predominantly focus on static graph defense, attempting to either detect attacks in a fixed graph structure or optimize a static topology with certain defensive capabilities. To address this limitation, we propose a dynamic defense paradigm for MAS graph structures, which continuously monitors communication within the MAS graph, then dynamically adjusts the graph topology, accurately disrupts malicious communications, and effectively defends against evolving and diverse dynamic attacks. Experimental results in increasingly complex and dynamic MAS environments demonstrate that our method significantly outperforms existing MAS defense mechanisms, contributing an effective guardrail for their trustworthy applications. Our code is available at this https URL.
zh
[AI-32] A New Type of Adversarial Examples
【速读】:该论文旨在解决机器学习模型对对抗样本(adversarial examples)的脆弱性问题,特别是传统对抗样本通过微小扰动诱导模型错误分类的特性。其核心创新在于提出了一种反向构造策略:生成与原始样本在特征空间中显著不同但依然导致相同预测结果的对抗样本。解决方案的关键在于设计了四类新型算法——负向迭代快速梯度符号法(NI-FGSM)、负向迭代快速梯度法(NI-FGM)及其动量变体(NMI-FGSM 和 NMI-FGM),这些方法通过反向优化方向生成对抗样本,揭示出对抗样本不仅存在于数据邻域内,更广泛分布于整个样本空间,从而拓展了对抗攻击的研究边界并为模型安全性评估提供了新视角。
链接: https://arxiv.org/abs/2510.19347
作者: Xingyang Nie,Guojie Xiao,Su Pan,Biao Wang,Huilin Ge,Tao Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:Most machine learning models are vulnerable to adversarial examples, which poses security concerns on these models. Adversarial examples are crafted by applying subtle but intentionally worst-case modifications to examples from the dataset, leading the model to output a different answer from the original example. In this paper, adversarial examples are formed in an exactly opposite manner, which are significantly different from the original examples but result in the same answer. We propose a novel set of algorithms to produce such adversarial examples, including the negative iterative fast gradient sign method (NI-FGSM) and the negative iterative fast gradient method (NI-FGM), along with their momentum variants: the negative momentum iterative fast gradient sign method (NMI-FGSM) and the negative momentum iterative fast gradient method (NMI-FGM). Adversarial examples constructed by these methods could be used to perform an attack on machine learning systems in certain occasions. Moreover, our results show that the adversarial examples are not merely distributed in the neighbourhood of the examples from the dataset; instead, they are distributed extensively in the sample space.
zh
[AI-33] Foundation Model Forecasts: Form and Function
【速读】:该论文旨在解决时间序列基础模型(Time-series foundation models, TSFMs)在实际应用中因预测形式限制而无法支持多样化操作任务的问题。尽管当前TSFMs在预测准确性上表现优异,但其输出形式(如点预测、分位数预测、参数化预测或轨迹集合)直接影响其可操作性。论文的关键解决方案在于:首先明确不同操作任务所需的最小充分预测类型,并建立预测类型之间的转换关系——轨迹集合可通过边缘化转化为更简单的形式而无需额外假设,但反向转换需通过copula或校准方法引入时序依赖结构;其次,证明了边际分布无法决定路径依赖事件的概率,即存在无限多个联合分布具有相同边际却对操作问题给出不同答案,从而强调预测形式本身是区分实用价值的核心因素,而非仅凭精度。
链接: https://arxiv.org/abs/2510.19345
作者: Alvaro Perez-Diaz,James C. Loach,Danielle E. Toutoungi,Lee Middleton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 3 figures
点击查看摘要
Abstract:Time-series foundation models (TSFMs) achieve strong forecast accuracy, yet accuracy alone does not determine practical value. The form of a forecast – point, quantile, parametric, or trajectory ensemble – fundamentally constrains which operational tasks it can support. We survey recent TSFMs and find that two-thirds produce only point or parametric forecasts, while many operational tasks require trajectory ensembles that preserve temporal dependence. We establish when forecast types can be converted and when they cannot: trajectory ensembles convert to simpler forms via marginalization without additional assumptions, but the reverse requires imposing temporal dependence through copulas or conformal methods. We prove that marginals cannot determine path-dependent event probabilities – infinitely many joint distributions share identical marginals but yield different answers to operational questions. We map six fundamental forecasting tasks to minimal sufficient forecast types and provide a task-aligned evaluation framework. Our analysis clarifies when forecast type, not accuracy, differentiates practical utility.
zh
[AI-34] o Use or to Refuse? Re-Centering Student Agency with Generative AI in Engineering Design Education
【速读】:该论文试图解决的问题是如何在工程与建筑类设计教育中有效引导学生从单纯依赖生成式 AI (Generative AI) 的自动化工具转向更具创新性的使用方式,从而培养其对 AI 作为协作伙伴(collaborative partner)或主动选择不使用(deliberate non-use)的反思能力,避免陷入仅关注提示词优化(prompt crafting)的技术表层应用。解决方案的关键在于引入“工具-队友-非使用”三元分类框架(tool-teammate-neither triage),通过结构化反思任务促使学生识别 AI 在设计流程中的角色定位,并在此基础上形成可评估的设计习惯;同时,结合工具访问支持、反思训练、角色标签标注及竞赛激励等协同机制,实现了 AI 辅助创新在大规模教学场景下的规模化落地,且保持了问责性与伦理意识。
链接: https://arxiv.org/abs/2510.19342
作者: Thijs Willems,Sumbul Khan,Qian Huang,Bradley Camburn,Nachamma Sockalingam,King Wang Poon
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: to be published in IEEE TALE 2025
点击查看摘要
Abstract:This pilot study traces students’ reflections on the use of AI in a 13-week foundational design course enrolling over 500 first-year engineering and architecture students at the Singapore University of Technology and Design. The course was an AI-enhanced design course, with several interventions to equip students with AI based design skills. Students were required to reflect on whether the technology was used as a tool (instrumental assistant), a teammate (collaborative partner), or neither (deliberate non-use). By foregrounding this three-way lens, students learned to use AI for innovation rather than just automation and to reflect on agency, ethics, and context rather than on prompt crafting alone. Evidence stems from coursework artefacts: thirteen structured reflection spreadsheets and eight illustrated briefs submitted, combined with notes of teachers and researchers. Qualitative coding of these materials reveals shared practices brought about through the inclusion of Gen-AI, including accelerated prototyping, rapid skill acquisition, iterative prompt refinement, purposeful “switch-offs” during user research, and emergent routines for recognizing hallucinations. Unexpectedly, students not only harnessed Gen-AI for speed but (enabled by the tool-teammate-neither triage) also learned to reject its outputs, invent their own hallucination fire-drills, and divert the reclaimed hours into deeper user research, thereby transforming efficiency into innovation. The implications of the approach we explore shows that: we can transform AI uptake into an assessable design habit; that rewarding selective non-use cultivates hallucination-aware workflows; and, practically, that a coordinated bundle of tool access, reflection, role tagging, and public recognition through competition awards allows AI based innovation in education to scale without compromising accountability.
zh
[AI-35] SORA-ATMAS: Adaptive Trust Management and Multi-LLM Aligned Governance for Future Smart Cities
【速读】:该论文旨在解决智能城市中多源异构代理系统(如交通、气象与安全领域)在部署生成式 AI(Generative AI)时面临的治理、风险与合规(Governance, Risk, and Compliance, GRC)挑战,包括责任归属不清、数据隐私泄露及监管不一致等问题。其解决方案的关键在于提出并验证了 SORA-ATMAS 框架——一个基于规则驱动的上下文感知型治理机制,通过引入跨域策略约束、高风险场景下的回退机制以及对多个大语言模型(LLMs)输出的政策对齐控制,实现了分布式代理决策的可验证性、实时响应能力和合规一致性。实证表明,该框架能显著降低预测误差(平均 MAE 减少 35%),并在多代理协同下保持低延迟(<72 ms 执行时间,<100 ms 治理延迟)和高吞吐量(13.8–17.2 请求/秒),从而为复杂智能城市系统的可信自治运行提供了可靠支撑。
链接: https://arxiv.org/abs/2510.19327
作者: Usama Antuley,Shahbaz Siddiqui,Sufian Hameed,Waqas Arif,Subhan Shah,Syed Attique Shah
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid evolution of smart cities has increased the reliance on intelligent interconnected services to optimize infrastructure, resources, and citizen well-being. Agentic AI has emerged as a key enabler by supporting autonomous decision-making and adaptive coordination, allowing urban systems to respond in real time to dynamic conditions. Its benefits are evident in areas such as transportation, where the integration of traffic data, weather forecasts, and safety sensors enables dynamic rerouting and a faster response to hazards. However, its deployment across heterogeneous smart city ecosystems raises critical governance, risk, and compliance (GRC) challenges, including accountability, data privacy, and regulatory alignment within decentralized infrastructures. Evaluation of SORA-ATMAS with three domain agents (Weather, Traffic, and Safety) demonstrated that its governance policies, including a fallback mechanism for high-risk scenarios, effectively steer multiple LLMs (GPT, Grok, DeepSeek) towards domain-optimized, policy-aligned outputs, producing an average MAE reduction of 35% across agents. Results showed stable weather monitoring, effective handling of high-risk traffic plateaus 0.85, and adaptive trust regulation in Safety/Fire scenarios 0.65. Runtime profiling of a 3-agent deployment confirmed scalability, with throughput between 13.8-17.2 requests per second, execution times below 72~ms, and governance delays under 100 ms, analytical projections suggest maintained performance at larger scales. Cross-domain rules ensured safe interoperability, with traffic rerouting permitted only under validated weather conditions. These findings validate SORA-ATMAS as a regulation-aligned, context-aware, and verifiable governance framework that consolidates distributed agent outputs into accountable, real-time decisions, offering a resilient foundation for smart-city management.
zh
[AI-36] Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks
链接: https://arxiv.org/abs/2510.19322
作者: Changbo Wu,Zhuolong Yu,Gongming Zhao,Hongli Xu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
[AI-37] Continual Knowledge Adaptation for Reinforcement Learning NEURIPS2025
【速读】:该论文旨在解决持续强化学习(Continual Reinforcement Learning)中普遍存在的灾难性遗忘(catastrophic forgetting)和知识利用效率低的问题。现有方法在面对非平稳环境时难以有效积累和迁移历史经验,导致模型性能下降。其解决方案的关键在于提出一种持续知识适应机制(Continual Knowledge Adaptation, CKA-RL),通过维护任务特定的知识向量池(task-specific knowledge vector pool),动态调用历史知识以适应新任务,从而缓解遗忘并提升跨任务的知识迁移效率;同时引入自适应知识融合机制(Adaptive Knowledge Merging),对相似知识向量进行聚合,在降低内存开销的同时保留关键信息,显著提升了算法的可扩展性和性能表现。
链接: https://arxiv.org/abs/2510.19314
作者: Jinwu Hu,Zihao Lian,Zhiquan Wen,Chenghao Li,Guohao Chen,Xutao Wen,Bin Xiao,Mingkui Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025
点击查看摘要
Abstract:Reinforcement Learning enables agents to learn optimal behaviors through interactions with environments. However, real-world environments are typically non-stationary, requiring agents to continuously adapt to new tasks and changing conditions. Although Continual Reinforcement Learning facilitates learning across multiple tasks, existing methods often suffer from catastrophic forgetting and inefficient knowledge utilization. To address these challenges, we propose Continual Knowledge Adaptation for Reinforcement Learning (CKA-RL), which enables the accumulation and effective utilization of historical knowledge. Specifically, we introduce a Continual Knowledge Adaptation strategy, which involves maintaining a task-specific knowledge vector pool and dynamically using historical knowledge to adapt the agent to new tasks. This process mitigates catastrophic forgetting and enables efficient knowledge transfer across tasks by preserving and adapting critical model parameters. Additionally, we propose an Adaptive Knowledge Merging mechanism that combines similar knowledge vectors to address scalability challenges, reducing memory requirements while ensuring the retention of essential knowledge. Experiments on three benchmarks demonstrate that the proposed CKA-RL outperforms state-of-the-art methods, achieving an improvement of 4.20% in overall performance and 8.02% in forward transfer. The source code is available at this https URL.
zh
[AI-38] Collaborative penetration testing suite for emerging generative AI algorithms
【速读】:该论文旨在解决生成式 AI (Generative AI) 模型在面对经典与量子网络攻击时的安全性问题,具体包括模型逆向攻击、数据投毒、对抗样本输入等漏洞,以及量子计算对现有公钥加密算法(如 RSA 和 ECC)的潜在威胁。其解决方案的关键在于构建一个集成化的“量子 AI 安全协议”,通过五大核心组件实现:静态和动态应用安全测试(SAST/DAST)、交互式应用安全测试(IAST)与持续集成/持续部署(CI/CD)流程融合、基于 Hyperledger Fabric 的区块链日志审计、基于 RLWE 的格密码学实现抗量子加密,以及 AI 红队模拟(Adversarial ML 与量子辅助攻击)。该方案实现了跨领域协作工作流,显著提升了漏洞发现效率(300+ 个漏洞识别)与修复响应速度(高严重性问题两周内减少 70%),并确保了量子安全通信与日志不可篡改性,最终形成面向未来量子时代的生成式 AI 安全防护体系。
链接: https://arxiv.org/abs/2510.19303
作者: Petar Radanliev
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:Problem Space: AI Vulnerabilities and Quantum Threats Generative AI vulnerabilities: model inversion, data poisoning, adversarial inputs. Quantum threats Shor Algorithm breaking RSA ECC encryption. Challenge Secure generative AI models against classical and quantum cyberattacks. Proposed Solution Collaborative Penetration Testing Suite Five Integrated Components: DAST SAST OWASP ZAP, Burp Suite, SonarQube, Fortify. IAST Contrast Assess integrated with CI CD pipeline. Blockchain Logging Hyperledger Fabric for tamper-proof logs. Quantum Cryptography Lattice based RLWE protocols. AI Red Team Simulations Adversarial ML Quantum-assisted attacks. Integration Layer: Unified workflow for AI, cybersecurity, and quantum experts. Key Results 300+ vulnerabilities identified across test environments. 70% reduction in high-severity issues within 2 weeks. 90% resolution efficiency for blockchain-logged vulnerabilities. Quantum-resistant cryptography maintained 100% integrity in tests. Outcome: Quantum AI Security Protocol integrating Blockchain Quantum Cryptography AI Red Teaming.
zh
[AI-39] Learning to Make Friends: Coaching LLM Agents toward Emergent Social Ties
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Model, LLM)代理是否能够复现人类在线行为中复杂的社交动态(如同质性、互惠性和社会认同),以及何种记忆与学习机制可促使此类动态涌现。解决方案的关键在于构建一个基于多代理LLM的仿真框架,其中代理通过情境内学习(in-context learning)并辅以教练信号(coaching signal)加速适应,在设计的行为奖励函数驱动下模拟人类社交动机(包括社交互动、信息寻求、自我呈现、协调和情感支持),从而实现稳定交互模式与社会关系的自发形成,最终生成具有真实在线社区特征的网络结构。
链接: https://arxiv.org/abs/2510.19299
作者: Philipp J. Schneider,Lin Tian,Marian-Andrei Rizoiu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:Can large language model (LLM) agents reproduce the complex social dynamics that characterize human online behavior – shaped by homophily, reciprocity, and social validation – and what memory and learning mechanisms enable such dynamics to emerge? We present a multi-agent LLM simulation framework in which agents repeatedly interact, evaluate one another, and adapt their behavior through in-context learning accelerated by a coaching signal. To model human social behavior, we design behavioral reward functions that capture core drivers of online engagement, including social interaction, information seeking, self-presentation, coordination, and emotional support. These rewards align agent objectives with empirically observed user motivations, enabling the study of how network structures and group formations emerge from individual decision-making. Our experiments show that coached LLM agents develop stable interaction patterns and form emergent social ties, yielding network structures that mirror properties of real online communities. By combining behavioral rewards with in-context adaptation, our framework establishes a principled testbed for investigating collective dynamics in LLM populations and reveals how artificial agents may approximate or diverge from human-like social behavior.
zh
[AI-40] Knowledge and Common Knowledge of Strategies
【速读】:该论文旨在解决现有战略推理研究中仅采用知情或无知语义的局限性,提出一种可在细粒度层面指定策略知识的模型,从而区分一阶、高阶和共同策略知识。其关键解决方案在于引入对策略知识的分层建模机制,使得能够精确刻画参与者对彼此策略的认知层级,并通过博弈论实例(如Hanabi游戏)验证高阶策略知识的影响,同时证明共同策略知识是解决共识问题的必要条件,进而探讨该模型的模型检测问题的可判定性。
链接: https://arxiv.org/abs/2510.19298
作者: Borja Sierra Miranda,Thomas Studer
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Most existing work on strategic reasoning simply adopts either an informed or an uninformed semantics. We propose a model where knowledge of strategies can be specified on a fine-grained level. In particular, it is possible to distinguish first-order, higher-order, and common knowledge of strategies. We illustrate the effect of higher-order knowledge of strategies by studying the game Hanabi. Further, we show that common knowledge of strategies is necessary to solve the consensus problem. Finally, we study the decidability of the model checking problem.
zh
[AI-41] Social World Model-Augmented Mechanism Design Policy Learning
【速读】:该论文旨在解决人工社会智能中个体与集体利益对齐的机制设计问题,尤其针对具有持久潜在特质(如技能、偏好)的异质性代理和复杂多智能体系统动态所带来的挑战,同时需在现实世界交互成本高昂的情况下实现高样本效率。解决方案的关键在于提出一种名为SWM-AP(Social World Model-Augmented Mechanism Design Policy Learning)的新方法,其核心是通过分层建模代理行为的社会世界模型(Social World Model),从交互轨迹中推断代理特质,并基于这些特质预测代理对部署机制的响应;在此基础上,机制设计策略利用社会世界模型生成大量训练轨迹,同时在真实交互中在线推断代理特质,从而显著提升策略学习的样本效率与性能。
链接: https://arxiv.org/abs/2510.19270
作者: Xiaoyuan Zhang,Yizhe Huang,Chengdong Ma,Zhixun Chen,Long Ma,Yali Du,Song-Chun Zhu,Yaodong Yang,Xue Feng
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Designing adaptive mechanisms to align individual and collective interests remains a central challenge in artificial social intelligence. Existing methods often struggle with modeling heterogeneous agents possessing persistent latent traits (e.g., skills, preferences) and dealing with complex multi-agent system dynamics. These challenges are compounded by the critical need for high sample efficiency due to costly real-world interactions. World Models, by learning to predict environmental dynamics, offer a promising pathway to enhance mechanism design in heterogeneous and complex systems. In this paper, we introduce a novel method named SWM-AP (Social World Model-Augmented Mechanism Design Policy Learning), which learns a social world model hierarchically modeling agents’ behavior to enhance mechanism design. Specifically, the social world model infers agents’ traits from their interaction trajectories and learns a trait-based model to predict agents’ responses to the deployed mechanisms. The mechanism design policy collects extensive training trajectories by interacting with the social world model, while concurrently inferring agents’ traits online during real-world interactions to further boost policy learning efficiency. Experiments in diverse settings (tax policy design, team coordination, and facility location) demonstrate that SWM-AP outperforms established model-based and model-free RL baselines in cumulative rewards and sample efficiency.
zh
[AI-42] LAPRAD: LLM -Assisted PRotocol Attack Discovery
【速读】:该论文旨在解决互联网协议(如DNS)中漏洞发现效率低、依赖人工经验的问题,以提升协议安全性。其核心挑战在于传统方法难以快速识别复杂且隐蔽的攻击向量,尤其是在面对新型分布式拒绝服务(DDoS)攻击时。解决方案的关键是提出了一种名为LAPRAD(LLM-Assisted Protocol Attack Discovery)的半自动化方法,利用大语言模型(LLM)进行攻击思路生成、基于ReACT框架自动构建攻击配置,并通过实证验证攻击有效性。该方法显著提升了漏洞挖掘效率,成功发现了三种新的DNS缓存刷新型DDoS攻击(SigCacheFlush),并验证了其对主流DNS解析器实现的严重破坏性,突破了现有防护机制。
链接: https://arxiv.org/abs/2510.19264
作者: R.Can Aygun(UCLA),Yehuda Afek(Tel-Aviv University),Anat Bremler-Barr(Tel-Aviv University),Leonard Kleinrock(UCLA)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: IFIP Networking 2025 Proceedings (Accepted on 05.05.2025)
点击查看摘要
Abstract:With the goal of improving the security of Internet protocols, we seek faster, semi-automatic methods to discover new vulnerabilities in protocols such as DNS, BGP, and others. To this end, we introduce the LLM-Assisted Protocol Attack Discovery (LAPRAD) methodology, enabling security researchers with some DNS knowledge to efficiently uncover vulnerabilities that would otherwise be hard to detect. LAPRAD follows a three-stage process. In the first, we consult an LLM (GPT-o1) that has been trained on a broad corpus of DNS-related sources and previous DDoS attacks to identify potential exploits. In the second stage, a different LLM automatically constructs the corresponding attack configurations using the ReACT approach implemented via LangChain (DNS zone file generation). Finally, in the third stage, we validate the attack’s functionality and effectiveness. Using LAPRAD, we uncovered three new DDoS attacks on the DNS protocol and rediscovered two recently reported ones that were not included in the LLM’s training data. The first new attack employs a bait-and-switch technique to trick resolvers into caching large, bogus DNSSEC RRSIGs, reducing their serving capacity to as little as 6%. The second exploits large DNSSEC encryption algorithms (RSA-4096) with multiple keys, thereby bypassing a recently implemented default RRSet limit. The third leverages ANY-type responses to produce a similar effect. These variations of a cache-flushing DDoS attack, called SigCacheFlush, circumvent existing patches, severely degrade resolver query capacity, and impact the latest versions of major DNS resolver implementations. Comments: IFIP Networking 2025 Proceedings (Accepted on 05.05.2025) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2510.19264 [cs.CR] (or arXiv:2510.19264v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.19264 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Published in IFIP Networking 2025 Proceedings Submission history From: Rustem Can Aygun [view email] [v1] Wed, 22 Oct 2025 05:47:41 UTC (9,528 KB)
zh
[AI-43] An Argumentative Explanation Framework for Generalized Reason Model with Inconsistent Precedents
【速读】:该论文试图解决的问题是在存在不一致判例(inconsistent precedents)的情况下,如何为基于广义理由模型(generalized notion of the reason model)的案例推理提供论证式解释(argumentative explanation)。传统方法依赖于判例集合的一致性假设,而本文通过扩展推导状态论证框架(Derivation State Argumentation framework, DSA-framework),提出了一种适用于处理不一致判例的新解释机制,其关键在于将原用于一致前提下的论证结构推广至允许矛盾前提的场景,从而实现对复杂、非一致法律推理过程的有效形式化与可解释性支持。
链接: https://arxiv.org/abs/2510.19263
作者: Wachara Fungwacharakorn,Gauvain Bourgne,Ken Satoh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, extended version for JURIX 2025 submission
点击查看摘要
Abstract:Precedential constraint is one foundation of case-based reasoning in AI and Law. It generally assumes that the underlying set of precedents must be consistent. To relax this assumption, a generalized notion of the reason model has been introduced. While several argumentative explanation approaches exist for reasoning with precedents based on the traditional consistent reason model, there has been no corresponding argumentative explanation method developed for this generalized reasoning framework accommodating inconsistent precedents. To address this question, this paper examines an extension of the derivation state argumentation framework (DSA-framework) to explain the reasoning according to the generalized notion of the reason model.
zh
[AI-44] ChatGPT Unveils Its Limits: Principles of Law Deliver Checkmate
【速读】:该论文旨在解决生成式 AI(Generative AI)在法律领域中处理复杂任务时存在的局限性问题,特别是其在提取和整合法律原则(Principles of Law, PoLs)方面的不足。研究通过对比 ChatGPT 与基于正则表达式(Regex)的基线方法,揭示了即使模型具备所需知识和能力,仍无法进行系统性推理以达成全面结果,这表明当前生成式 AI 缺乏对多维 competencies 的统一整合能力。解决方案的关键在于认识到:真正的智能不仅依赖于知识获取,更在于对复杂问题的分解与跨领域推理能力,而这一能力目前仍是人类独有的特质,尤其在法律文本理解与应用中表现明显。
链接: https://arxiv.org/abs/2510.19261
作者: Marianna Molinari,Ilaria Angela Amantea,Marinella Quaranta,Guido Governatori
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This study examines the performance of ChatGPT with an experiment in the legal domain. We compare the outcome with it a baseline using regular expressions (Regex), rather than focusing solely on the assessment against human performance. The study reveals that even if ChatGPT has access to the necessary knowledge and competencies, it is unable to assemble them, reason through, in a way that leads to an exhaustive result. This unveils a major limitation of ChatGPT. Intelligence encompasses the ability to break down complex issues and address them according to multiple required competencies, providing a unified and comprehensive solution. In the legal domain, one of the most crucial tasks is reading legal decisions and extracting key passages condensed from principles of law (PoLs), which are then incorporated into subsequent rulings by judges or defense documents by lawyers. In performing this task, artificial intelligence lacks an all-encompassing understanding and reasoning, which makes it inherently limited. Genuine intelligence, remains a uniquely human trait, at least in this particular field.
zh
[AI-45] FnRGNN: Distribution-aware Fairness in Graph Neural Network
链接: https://arxiv.org/abs/2510.19257
作者: Soyoung Park,Sungsu Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-46] See Think Act: Online Shopper Behavior Simulation with VLM Agents
链接: https://arxiv.org/abs/2510.19245
作者: Yimeng Zhang,Jiri Gesi,Ran Xue,Tian Wang,Ziyi Wang,Yuxuan Lu,Sinong Zhan,Huimin Zeng,Qingjun Cui,Yufan Guo,Jing Huang,Mubarak Shah,Dakuo Wang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
[AI-47] SPOT: Scalable Policy Optimization with Trees for Markov Decision Processes
【速读】:该论文旨在解决在高风险决策场景中,如何高效优化可解释的决策树策略(decision tree policies)以应用于马尔可夫决策过程(Markov Decision Processes, MDPs)的问题。现有方法在处理大规模MDP时存在计算效率低、难以扩展的局限性。其解决方案的关键在于提出SPOT方法,将决策树策略的优化建模为混合整数线性规划(Mixed-Integer Linear Program, MILP)问题,并采用一种降维空间的分支定界(reduced-space branch-and-bound)策略,将MDP动态演化与树结构约束解耦,从而支持高效的并行搜索。这一设计显著提升了运行效率和可扩展性,同时保证每轮迭代均获得最优决策树,实现了在保持策略可解释性和紧凑性的同时,大幅提升求解速度。
链接: https://arxiv.org/abs/2510.19241
作者: Xuyuan Xiong,Pedro Chumpitaz-Flores,Kaixun Hua,Cheng Hua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Interpretable reinforcement learning policies are essential for high-stakes decision-making, yet optimizing decision tree policies in Markov Decision Processes (MDPs) remains challenging. We propose SPOT, a novel method for computing decision tree policies, which formulates the optimization problem as a mixed-integer linear program (MILP). To enhance efficiency, we employ a reduced-space branch-and-bound approach that decouples the MDP dynamics from tree-structure constraints, enabling efficient parallel search. This significantly improves runtime and scalability compared to previous methods. Our approach ensures that each iteration yields the optimal decision tree. Experimental results on standard benchmarks demonstrate that SPOT achieves substantial speedup and scales to larger MDPs with a significantly higher number of states. The resulting decision tree policies are interpretable and compact, maintaining transparency without compromising performance. These results demonstrate that our approach simultaneously achieves interpretability and scalability, delivering high-quality policies an order of magnitude faster than existing approaches.
zh
[AI-48] WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation NEURIPS2025
【速读】:该论文旨在解决当前网页代理(web agents)评估中过度依赖二元成功指标或单一参考轨迹的问题,忽视了基准数据集中存在的结构多样性。解决方案的关键在于提出 WebGraphEval 框架,将多个代理的轨迹抽象为统一的加权动作图(weighted action graph),从而捕捉跨模型的行为规律、识别冗余与低效路径,并定位被基于结果的度量所忽略的关键决策点。该框架通过结构化分析(如奖励传播和成功率加权边统计)实现对多路径、跨代理及效率敏感的评估,且无需修改环境即可兼容现有基准(如 WebArena)。
链接: https://arxiv.org/abs/2510.19205
作者: Yaoyao Qian,Yuanli Wang,Jinda Zhang,Yun Zong,Meixu Chen,Hanhan Zhou,Jindan Huang,Yifan Zeng,Xinyu Hu,Chan Hee Song,Danqing Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Multi-Turn Interactions in Large Language Models
点击查看摘要
Abstract:Current evaluation of web agents largely reduces to binary success metrics or conformity to a single reference trajectory, ignoring the structural diversity present in benchmark datasets. We present WebGraphEval, a framework that abstracts trajectories from multiple agents into a unified, weighted action graph. This representation is directly compatible with benchmarks such as WebArena, leveraging leaderboard runs and newly collected trajectories without modifying environments. The framework canonically encodes actions, merges recurring behaviors, and applies structural analyses including reward propagation and success-weighted edge statistics. Evaluations across thousands of trajectories from six web agents show that the graph abstraction captures cross-model regularities, highlights redundancy and inefficiency, and identifies critical decision points overlooked by outcome-based metrics. By framing web interaction as graph-structured data, WebGraphEval establishes a general methodology for multi-path, cross-agent, and efficiency-aware evaluation of web agents.
zh
[AI-49] An Active Diffusion Neural Network for Graphs
【速读】:该论文旨在解决传统基于扩散的图神经网络(Graph Neural Networks, GNNs)在信息传播过程中存在的过平滑(over-smoothing)问题,即随着扩散层数增加,节点表示趋于一致,导致模型难以捕捉图的全局结构信息并保持节点特征的独特性。其解决方案的关键在于提出一种主动扩散机制(Active Diffusion),通过引入多个外部信息源动态调控扩散过程,从而打破被动热扩散的局限;同时,该方法通过直接求解主动扩散迭代公式的闭式解(closed-form solution),实现真正的无限扩散,使节点在保留自身特征的同时高效获取全局图结构信息。
链接: https://arxiv.org/abs/2510.19202
作者: Mengying Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The analogy to heat diffusion has enhanced our understanding of information flow in graphs and inspired the development of Graph Neural Networks (GNNs). However, most diffusion-based GNNs emulate passive heat diffusion, which still suffers from over-smoothing and limits their ability to capture global graph information. Inspired by the heat death of the universe, which posits that energy distribution becomes uniform over time in a closed system, we recognize that, without external input, node representations in a graph converge to identical feature vectors as diffusion progresses. To address this issue, we propose the Active Diffusion-based Graph Neural Network (ADGNN). ADGNN achieves active diffusion by integrating multiple external information sources that dynamically influence the diffusion process, effectively overcoming the over-smoothing problem. Furthermore, our approach realizes true infinite diffusion by directly calculating the closed-form solution of the active diffusion iterative formula. This allows nodes to preserve their unique characteristics while efficiently gaining comprehensive insights into the graph’s global structure. We evaluate ADGNN against several state-of-the-art GNN models across various graph tasks. The results demonstrate that ADGNN significantly improves both accuracy and efficiency, highlighting its effectiveness in capturing global graph information and maintaining node distinctiveness.
zh
[AI-50] Imbalanced Gradients in RL Post-Training of Multi-Task LLM s
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)后训练过程中,由于多任务数据混合导致的梯度不平衡问题。研究发现,某些任务产生的梯度显著大于其他任务,从而使得优化过程偏向于这些高梯度任务,而这些任务并不一定带来更大的性能提升,即梯度大小与学习收益之间无正相关关系。解决方案的关键在于识别并纠正这种梯度层面的不平衡现象,而非简单依赖数据混合策略,强调未来应发展基于梯度层级的系统性校正方法以实现更公平和有效的多任务学习。
链接: https://arxiv.org/abs/2510.19178
作者: Runzhe Wu,Ankur Samanta,Ayush Jain,Scott Fujimoto,Jeongyeol Kwon,Ben Kretzu,Youliang Yu,Kaveh Hassani,Boris Vidolov,Yonathan Efroni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multi-task post-training of large language models (LLMs) is typically performed by mixing datasets from different tasks and optimizing them jointly. This approach implicitly assumes that all tasks contribute gradients of similar magnitudes; when this assumption fails, optimization becomes biased toward large-gradient tasks. In this paper, however, we show that this assumption fails in RL post-training: certain tasks produce significantly larger gradients, thus biasing updates toward those tasks. Such gradient imbalance would be justified only if larger gradients implied larger learning gains on the tasks (i.e., larger performance improvements) – but we find this is not true. Large-gradient tasks can achieve similar or even much lower learning gains than small-gradient ones. Further analyses reveal that these gradient imbalances cannot be explained by typical training statistics such as training rewards or advantages, suggesting that they arise from the inherent differences between tasks. This cautions against naive dataset mixing and calls for future work on principled gradient-level corrections for LLMs.
zh
[AI-51] InvarGC: Invariant Granger Causality for Heterogeneous Interventional Time Series under Latent Confounding
【速读】:该论文旨在解决传统Granger因果检验在非线性关系识别中的局限性,以及在存在潜在混杂因素(latent confounders)和未知干预目标(interventional targets)时难以准确识别因果结构的问题。其解决方案的关键在于提出**不变Granger因果(Invariant Granger Causality, InvarGC)**方法,该方法利用跨环境异质性来缓解潜在混杂的影响,并以边级粒度区分受干预与未受干预的环境,从而恢复不变的因果关系。作者还证明了在该设定下的可辨识性(identifiability),并通过合成数据和真实世界数据的广泛实验验证了方法的有效性。
链接: https://arxiv.org/abs/2510.19138
作者: Ziyi Zhang,Shaogang Ren,Xiaoning Qian,Nick Duffield
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Granger causality is widely used for causal structure discovery in complex systems from multivariate time series data. Traditional Granger causality tests based on linear models often fail to detect even mild non-linear causal relationships. Therefore, numerous recent studies have investigated non-linear Granger causality methods, achieving improved performance. However, these methods often rely on two key assumptions: causal sufficiency and known interventional targets. Causal sufficiency assumes the absence of latent confounders, yet their presence can introduce spurious correlations. Moreover, real-world time series data usually come from heterogeneous environments, without prior knowledge of interventions. Therefore, in practice, it is difficult to distinguish intervened environments from non-intervened ones, and even harder to identify which variables or timesteps are affected. To address these challenges, we propose Invariant Granger Causality (InvarGC), which leverages cross-environment heterogeneity to mitigate the effects of latent confounding and to distinguish intervened from non-intervened environments with edge-level granularity, thereby recovering invariant causal relations. In addition, we establish the identifiability under these conditions. Extensive experiments on both synthetic and real-world datasets demonstrate the competitive performance of our approach compared to state-of-the-art methods.
zh
[AI-52] A Cross-Environment and Cross-Embodiment Path Planning Framework via a Conditional Diffusion Model
【速读】:该论文旨在解决高维复杂环境中机器人路径规划的效率、安全性与泛化能力问题,传统方法存在计算耗时长、参数调优繁琐,而现有基于学习的方法仍难以有效跨环境和机器人本体迁移。解决方案的关键在于提出GADGET(Generalizable and Adaptive Diffusion-Guided Environment-aware Trajectory generation),其核心创新为混合双条件机制:通过无分类器引导(classifier-free guidance)利用学习得到的场景编码实现环境感知,同时结合分类器引导的控制屏障函数(Control Barrier Function, CBF)安全塑形,在去噪过程中直接集成实时避障能力,从而无需重新训练即可实现零样本迁移至新环境和不同自由度(DoF)的机器人本体(如Franka Panda、Kinova Gen3、UR5)。
链接: https://arxiv.org/abs/2510.19128
作者: Mehran Ghafarian Tamizi,Homayoun Honari,Amir Mehdi Soufi Enayati,Aleksey Nozdryn-Plotnicki,Homayoun Najjaran
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures
点击查看摘要
Abstract:Path planning for a robotic system in high-dimensional cluttered environments needs to be efficient, safe, and adaptable for different environments and hardware. Conventional methods face high computation time and require extensive parameter tuning, while prior learning-based methods still fail to generalize effectively. The primary goal of this research is to develop a path planning framework capable of generalizing to unseen environments and new robotic manipulators without the need for retraining. We present GADGET (Generalizable and Adaptive Diffusion-Guided Environment-aware Trajectory generation), a diffusion-based planning model that generates joint-space trajectories conditioned on voxelized scene representations as well as start and goal configurations. A key innovation is GADGET’s hybrid dual-conditioning mechanism that combines classifier-free guidance via learned scene encoding with classifier-guided Control Barrier Function (CBF) safety shaping, integrating environment awareness with real-time collision avoidance directly in the denoising process. This design supports zero-shot transfer to new environments and robotic embodiments without retraining. Experimental results show that GADGET achieves high success rates with low collision intensity in spherical-obstacle, bin-picking, and shelf environments, with CBF guidance further improving safety. Moreover, comparative evaluations indicate strong performance relative to both sampling-based and learning-based baselines. Furthermore, GADGET provides transferability across Franka Panda, Kinova Gen3 (6/7-DoF), and UR5 robots, and physical execution on a Kinova Gen3 demonstrates its ability to generate safe, collision-free trajectories in real-world settings.
zh
[AI-53] Steering Autoregressive Music Generation with Recursive Feature Machines
链接: https://arxiv.org/abs/2510.19127
作者: Daniel Zhao,Daniel Beaglehole,Taylor Berg-Kirkpatrick,Julian McAuley,Zachary Novack
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
[AI-54] What Makes a Good Curriculum? Disentangling the Effects of Data Ordering on LLM Mathematical Reasoning
【速读】:该论文旨在解决当前课程学习(Curriculum Learning, CL)在大语言模型(Large Language Models, LLMs)推理能力提升中缺乏统一评估框架和明确指导原则的问题,特别是对CL何时有效、正向与反向课程策略的相对优势以及不同度量指标下学习效果差异等核心问题尚未厘清。其解决方案的关键在于提出一个统一的离线评估框架,将课程难度分解为五个互补维度:问题难度(Problem Difficulty)、模型惊讶度(Model Surprisal)、置信度边际(Confidence Margin)、预测不确定性(Predictive Uncertainty)和决策变异性(Decision Variability),并通过受控的后训练实验在数学推理基准上系统验证不同课程策略的效果,从而揭示模型能力与任务复杂度交互作用下的最优课程路径,并区分任务对齐型课程(塑造最终表示与泛化能力)与内部状态型课程(调节置信度与不确定性)的作用机制。
链接: https://arxiv.org/abs/2510.19099
作者: Yaning Jia,Chunhui Zhang,Xingjian Diao,Xiangchi Yuan,Zhongyu Ouyang,soroush vosoughi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages (main text) + 4 pages (appendix), 4 figures
点击查看摘要
Abstract:Curriculum learning (CL) - ordering training data from easy to hard - has become a popular strategy for improving reasoning in large language models (LLMs). Yet prior work employs disparate difficulty metrics and training setups, leaving open fundamental questions: When does curriculum help? Which direction - forward or reverse - is better? And does the answer depend on what we measure? We address these questions through a unified offline evaluation framework that decomposes curriculum difficulty into five complementary dimensions: Problem Difficulty, Model Surprisal, Confidence Margin, Predictive Uncertainty, and Decision Variability. Through controlled post-training experiments on mathematical reasoning benchmarks with Llama3.1-8B, Mistral-7B, and Gemma3-4B, we find that (i) no curriculum strategy dominates universally - the relative effectiveness of forward versus reverse CL depends jointly on model capability and task complexity; (ii) even within a single metric, samples at different difficulty levels produce distinct gains depending on task demands; and (iii) task-aligned curricula focus on shaping the model’s final representations and generalization, whereas inner-state curricula modulate internal states such as confidence and uncertainty. Our findings challenge the notion of a universal curriculum strategy and offer actionable guidance across model and task regimes, with some metrics indicating that prioritizing decision-uncertain samples can further enhance learning outcomes.
zh
[AI-55] Local Guidance for Configuration-Based Multi-Agent Pathfinding
链接: https://arxiv.org/abs/2510.19072
作者: Tomoki Arita,Keisuke Okumura
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages
[AI-56] he MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLM S
链接: https://arxiv.org/abs/2510.19055
作者: Brandon James Carone,Iran R. Roman,Pablo Ripollés
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures, 2 tables
[AI-57] Rectifying Shortcut Behaviors in Preference-based Reward Learning NEURIPS2025
【速读】:该论文旨在解决基于人类反馈的强化学习中偏好奖励模型(preference-based reward models)存在的奖励黑客(reward hacking)问题,即模型通过利用训练数据中的伪相关特征(如响应冗长性、讨好语气或谄媚倾向)而非真实意图来获得高奖励分数,从而导致泛化能力差和行为偏离预期目标。其解决方案的关键在于提出了一种名为PRISM(Preference-based Reward Invariance for Shortcut Mitigation)的方法,该方法受核空间中不变性理论启发,通过在闭式学习目标下学习组不变核函数与特征映射,有效抑制了短路行为(shortcut behaviors),从而提升了奖励模型在分布外任务上的准确性,并降低了下游策略模型对短路特征的依赖。
链接: https://arxiv.org/abs/2510.19050
作者: Wenqian Ye,Guangtao Zheng,Aidong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2025
点击查看摘要
Abstract:In reinforcement learning from human feedback, preference-based reward models play a central role in aligning large language models to human-aligned behavior. However, recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization. They achieve high reward scores by exploiting shortcuts, that is, exploiting spurious features (e.g., response verbosity, agreeable tone, or sycophancy) that correlate with human preference labels in the training data rather than genuinely reflecting the intended objectives. In this paper, instead of probing these issues one at a time, we take a broader view of the reward hacking problem as shortcut behaviors and introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning. Inspired by the invariant theory in the kernel perspective, we propose Preference-based Reward Invariance for Shortcut Mitigation (PRISM), which learns group-invariant kernels with feature maps in a closed-form learning objective. Experimental results in several benchmarks show that our method consistently improves the accuracy of the reward model on diverse out-of-distribution tasks and reduces the dependency on shortcuts in downstream policy models, establishing a robust framework for preference-based alignment.
zh
[AI-58] REPAIR Approach for Social-based City Reconstruction Planning in case of natural disasters
【速读】:该论文旨在解决自然灾害后城市重建过程中资源有限条件下如何最大化社会效益的问题,其核心挑战在于协调可用预算与时间、满足多元利益相关者需求(如居民福祉与政治优先事项),同时考虑城市结构约束(如道路与建筑间的依赖关系)。解决方案的关键是提出一种通用的决策支持系统——REPAIR(Post Disaster REbuilding plAn ProvIdeR),该系统基于深度强化学习技术,能够生成多组备选重建方案供地方管理者选择实施,并已在2009年意大利拉奎拉地震后的实际重建场景中验证其有效性。
链接: https://arxiv.org/abs/2510.19048
作者: Ghulam Mudassir,Antinisca Di Marco,Giordano d’Aloisio
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at International Journal of Data Science and Analytics
点击查看摘要
Abstract:Natural disasters always have several effects on human lives. It is challenging for governments to tackle these incidents and to rebuild the economic, social and physical infrastructures and facilities with the available resources (mainly budget and time). Governments always define plans and policies according to the law and political strategies that should maximise social benefits. The severity of damage and the vast resources needed to bring life back to normality make such reconstruction a challenge. This article is the extension of our previously published work by conducting comprehensive comparative analysis by integrating additional deep learning models plus random agent which is used as a baseline. Our prior research introduced a decision support system by using the Deep Reinforcement Learning technique for the planning of post-disaster city reconstruction, maximizing the social benefit of the reconstruction process, considering available resources, meeting the needs of the broad community stakeholders (like citizens’ social benefits and politicians’ priorities) and keeping in consideration city’s structural constraints (like dependencies among roads and buildings). The proposed approach, named post disaster REbuilding plAn ProvIdeR (REPAIR) is generic. It can determine a set of alternative plans for local administrators who select the ideal one to implement, and it can be applied to areas of any extension. We show the application of REPAIR in a real use case, i.e., to the L’Aquila reconstruction process, damaged in 2009 by a major earthquake.
zh
[AI-59] “Over-the-Hood” AI Inclusivity Bugs and How 3 AI Product Teams Found and Fixed Them
【速读】:该论文旨在解决用户面向的生成式 AI (Generative AI) 产品中存在“表层”包容性偏差(over-the-hood inclusivity biases)的问题,即那些因用户交互设计不当而 disproportionately 排斥特定问题解决风格用户的缺陷。这些问题不同于底层算法或训练数据中的偏见,而是体现在产品界面和交互逻辑中,导致部分用户群体难以有效使用AI功能。解决方案的关键在于引入并改进了现有的非AI导向包容性设计方法——GenderMag,并开发出适用于AI产品的变体“GenderMag-for-AI”,该方法通过系统性识别用户认知差异与AI交互行为之间的不匹配,成功检测并修复了6类共83次AI包容性缺陷,其中47个实例已实施修复,验证了其在发现和缓解用户层面包容性问题上的有效性。
链接: https://arxiv.org/abs/2510.19033
作者: Andrew Anderson,Fatima A. Moussaoui,Jimena Noa Guevara,Md Montaser Hamid,Margaret Burnett
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While much research has shown the presence of AI’s “under-the-hood” biases (e.g., algorithmic, training data, etc.), what about “over-the-hood” inclusivity biases: barriers in user-facing AI products that disproportionately exclude users with certain problem-solving approaches? Recent research has begun to report the existence of such biases – but what do they look like, how prevalent are they, and how can developers find and fix them? To find out, we conducted a field study with 3 AI product teams, to investigate what kinds of AI inclusivity bugs exist uniquely in user-facing AI products, and whether/how AI product teams might harness an existing (non-AI-oriented) inclusive design method to find and fix them. The teams’ work resulted in identifying 6 types of AI inclusivity bugs arising 83 times, fixes covering 47 of these bug instances, and a new variation of the GenderMag inclusive design method, GenderMag-for-AI, that is especially effective at detecting certain kinds of AI inclusivity bugs.
zh
[AI-60] CLiVR: Conversational Learning System in Virtual Reality with AI-Powered Patients
【速读】:该论文试图解决传统医学教育中模拟训练依赖标准化病人(Standardized Patients, SP)和高保真模拟人(High-fidelity Manikins)所导致的资源消耗大、可及性差与扩展性不足的问题。解决方案的关键在于提出CLiVR系统,其核心是融合大型语言模型(Large Language Models, LLMs)、语音处理技术与三维虚拟形象(3D Avatars),在虚拟现实(Virtual Reality, VR)环境中实现动态生成的医患对话交互,并通过情感分析提供沟通语气反馈,从而构建一个沉浸式、可扩展且具备教学反馈能力的智能模拟平台。
链接: https://arxiv.org/abs/2510.19031
作者: Akilan Amithasagaran,Sagnik Dakshit,Bhavani Suryadevara,Lindsey Stockton
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Simulations constitute a fundamental component of medical and nursing education and traditionally employ standardized patients (SP) and high-fidelity manikins to develop clinical reasoning and communication skills. However, these methods require substantial resources, limiting accessibility and scalability. In this study, we introduce CLiVR, a Conversational Learning system in Virtual Reality that integrates large language models (LLMs), speech processing, and 3D avatars to simulate realistic doctor-patient interactions. Developed in Unity and deployed on the Meta Quest 3 platform, CLiVR enables trainees to engage in natural dialogue with virtual patients. Each simulation is dynamically generated from a syndrome-symptom database and enhanced with sentiment analysis to provide feedback on communication tone. Through an expert user study involving medical school faculty (n=13), we assessed usability, realism, and perceived educational impact. Results demonstrated strong user acceptance, high confidence in educational potential, and valuable feedback for improvement. CLiVR offers a scalable, immersive supplement to SP-based training.
zh
[AI-61] FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains
【速读】:该论文旨在解决机器学习领域中因数据稀缺、获取成本高或受隐私法规限制而导致的数据集挑战(dataset challenge),特别是在医疗健康、生物医学研究和网络安全等高风险场景下,模型训练常受限于高质量标注数据的缺乏。其解决方案的关键在于提出一种自适应的大语言模型(LLM)框架——FlexiDataGen,通过四个核心组件实现动态语义数据生成:(1) 句法-语义分析、(2) 增强检索生成、(3) 动态元素注入以及 (4) 迭代改写与语义验证,从而自主合成语义连贯、语言多样且高度领域相关的高质量数据,有效缓解数据短缺与标注瓶颈问题,支撑可扩展、高精度的机器学习模型开发。
链接: https://arxiv.org/abs/2510.19025
作者: Hamed Jelodar,Samita Bai,Roozbeh Razavi-Far,Ali A. Ghorbani
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Dataset availability and quality remain critical challenges in machine learning, especially in domains where data are scarce, expensive to acquire, or constrained by privacy regulations. Fields such as healthcare, biomedical research, and cybersecurity frequently encounter high data acquisition costs, limited access to annotated data, and the rarity or sensitivity of key events. These issues-collectively referred to as the dataset challenge-hinder the development of accurate and generalizable machine learning models in such high-stakes domains. To address this, we introduce FlexiDataGen, an adaptive large language model (LLM) framework designed for dynamic semantic dataset generation in sensitive domains. FlexiDataGen autonomously synthesizes rich, semantically coherent, and linguistically diverse datasets tailored to specialized fields. The framework integrates four core components: (1) syntactic-semantic analysis, (2) retrieval-augmented generation, (3) dynamic element injection, and (4) iterative paraphrasing with semantic validation. Together, these components ensure the generation of high-quality, domain-relevant data. Experimental results show that FlexiDataGen effectively alleviates data shortages and annotation bottlenecks, enabling scalable and accurate machine learning model development.
zh
[AI-62] Prior-informed optimization of treatment recommendation via bandit algorithms trained on large language model-processed historical records
【速读】:该论文旨在解决当前医疗实践中因依赖标准化治疗框架和经验方法而忽视个体患者差异,导致健康结果不佳的问题。其解决方案的关键在于构建一个整合大型语言模型(Large Language Models, LLMs)、条件表格式生成对抗网络(Conditional Tabular Generative Adversarial Networks, CTGAN)、T-learner反事实模型与上下文bandit策略的综合系统:LLMs将非结构化医疗文本转化为结构化数据(准确率93.2%),CTGAN生成真实感合成患者数据(两样本验证准确率55%),T-learner预测个体患者对不同治疗的响应(准确率84.3%),并结合先验信息驱动的上下文bandit算法,在探索新疗法与利用已有知识之间实现有效平衡,从而在在线学习环境中克服冷启动问题,提升计算效率,推动个性化医疗的发展。
链接: https://arxiv.org/abs/2510.19014
作者: Saman Nessari,Ali Bozorgi-Amiri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Current medical practice depends on standardized treatment frameworks and empirical methodologies that neglect individual patient variations, leading to suboptimal health outcomes. We develop a comprehensive system integrating Large Language Models (LLMs), Conditional Tabular Generative Adversarial Networks (CTGAN), T-learner counterfactual models, and contextual bandit approaches to provide customized, data-informed clinical recommendations. The approach utilizes LLMs to process unstructured medical narratives into structured datasets (93.2% accuracy), uses CTGANs to produce realistic synthetic patient data (55% accuracy via two-sample verification), deploys T-learners to forecast patient-specific treatment responses (84.3% accuracy), and integrates prior-informed contextual bandits to enhance online therapeutic selection by effectively balancing exploration of new possibilities with exploitation of existing knowledge. Testing on stage III colon cancer datasets revealed that our KernelUCB approach obtained 0.60-0.61 average reward scores across 5,000 rounds, exceeding other reference methods. This comprehensive system overcomes cold-start limitations in online learning environments, improves computational effectiveness, and constitutes notable progress toward individualized medicine adapted to specific patient characteristics.
zh
[AI-63] Plural Voices Single Agent : Towards Inclusive AI in Multi-User Domestic Spaces
【速读】:该论文旨在解决国内人工智能代理(AI agents)在伦理、自主性和包容性方面面临的挑战,特别是对儿童、老年人和神经多样性(Neurodivergent)用户等被忽视群体的支持不足问题。其核心解决方案是提出一种名为“多元声音模型”(Plural Voices Model, PVM)的单智能体框架,该框架通过实时价值对齐动态协商多用户需求,利用涵盖心理健康、老年护理、教育和道德推理等领域的多样化公开数据集进行训练;关键创新在于采用人类与合成相结合的课程设计(human+synthetic curriculum design),引入公平感知场景和伦理增强机制,识别核心价值观、冲突及可访问性要求,从而指导包容性原则的确立,并结合自适应安全支架、个性化交互策略(如分步引导神经多样性用户、简化语言面向儿童)以及公平冲突调解机制,实现更安全、公平且用户中心的家用AI系统部署。
链接: https://arxiv.org/abs/2510.19008
作者: Joydeep Chandra,Satyam Kumar Navneet
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Domestic AI agents faces ethical, autonomy, and inclusion challenges, particularly for overlooked groups like children, elderly, and Neurodivergent users. We present the Plural Voices Model (PVM), a novel single-agent framework that dynamically negotiates multi-user needs through real-time value alignment, leveraging diverse public datasets on mental health, eldercare, education, and moral reasoning. Using human+synthetic curriculum design with fairness-aware scenarios and ethical enhancements, PVM identifies core values, conflicts, and accessibility requirements to inform inclusive principles. Our privacy-focused prototype features adaptive safety scaffolds, tailored interactions (e.g., step-by-step guidance for Neurodivergent users, simple wording for children), and equitable conflict resolution. In preliminary evaluations, PVM outperforms multi-agent baselines in compliance (76% vs. 70%), fairness (90% vs. 85%), safety-violation rate (0% vs. 7%), and latency. Design innovations, including video guidance, autonomy sliders, family hubs, and adaptive safety dashboards, demonstrate new directions for ethical and inclusive domestic AI, for building user-centered agentic systems in plural domestic contexts. Our Codes and Model are been open sourced, available for reproduction: this https URL
zh
[AI-64] mely Clinical Diagnosis through Active Test Selection
【速读】:该论文旨在解决当前机器学习(Machine Learning, ML)在临床诊断中应用时存在的局限性,即多数方法依赖静态、完整观测数据集,无法模拟临床实践中医生所采用的序列化、资源敏感的推理过程,导致诊断效率低且易出错,尤其在高压或资源受限环境中更为显著。其解决方案的关键在于提出ACTMED(Adaptive Clinical Test selection via Model-based Experimental Design),该框架融合贝叶斯实验设计(Bayesian Experimental Design, BED)与大语言模型(Large Language Models, LLMs),通过每一步选择预期能最大减少诊断不确定性的检测项目来优化测试顺序;其中LLMs作为灵活的模拟器,无需结构化任务特定训练数据即可生成合理的患者状态分布并支持信念更新,同时允许临床医生全程参与决策,从而实现诊断准确性、可解释性和资源利用效率的协同提升。
链接: https://arxiv.org/abs/2510.18988
作者: Silas Ruhrberg Estévez,Nicolás Astorga,Mihaela van der Schaar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: None
点击查看摘要
Abstract:There is growing interest in using machine learning (ML) to support clinical diag- nosis, but most approaches rely on static, fully observed datasets and fail to reflect the sequential, resource-aware reasoning clinicians use in practice. Diagnosis remains complex and error prone, especially in high-pressure or resource-limited settings, underscoring the need for frameworks that help clinicians make timely and cost-effective decisions. We propose ACTMED (Adaptive Clinical Test selection via Model-based Experimental Design), a diagnostic framework that integrates Bayesian Experimental Design (BED) with large language models (LLMs) to better emulate real-world diagnostic reasoning. At each step, ACTMED selects the test expected to yield the greatest reduction in diagnostic uncertainty for a given patient. LLMs act as flexible simulators, generating plausible patient state distributions and supporting belief updates without requiring structured, task-specific training data. Clinicians can remain in the loop; reviewing test suggestions, interpreting intermediate outputs, and applying clinical judgment throughout. We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use. This represents a step to- ward transparent, adaptive, and clinician-aligned diagnostic systems that generalize across settings with reduced reliance on domain-specific data.
zh
[AI-65] st-time Verification via Optimal Transport: Coverag e ROC Sub-optimality
【速读】:该论文旨在解决生成式 AI(Generative AI)在测试时扩展(test-time scaling)过程中,验证机制(verifier)的作用及其不完美性对大型语言模型(LLM)性能提升的影响问题。其核心挑战在于现有研究未能统一量化生成器覆盖率(coverage)、验证器区域收敛性(region of convergence, ROC)与采样算法次优性(sub-optimality)三者之间的几何交互关系。解决方案的关键是将可验证的测试时扩展建模为一个运输问题(transport problem),从而揭示出次优性-覆盖率曲线存在三种典型区间:运输区、策略改进区和饱和区;并进一步分析顺序与批处理两类采样算法如何通过计算复杂度影响这些区域间的权衡。理论框架得到了 Qwen、Llama 和 Gemma 模型上的实证结果支持。
链接: https://arxiv.org/abs/2510.18982
作者: Arpan Mukherjee,Marcello Bullo,Debabrota Basu,Deniz Gündüz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While test-time scaling with verification has shown promise in improving the performance of large language models (LLMs), the role of the verifier and its imperfections remain underexplored. The effect of verification manifests through interactions of three quantities: (i) the generator’s coverage, (ii) the verifier’s region of convergence (ROC), and (iii) the sampling algorithm’s sub-optimality. Though recent studies capture subsets of these factors, a unified framework quantifying the geometry of their interplay is missing. We frame verifiable test-time scaling as a transport problem. This characterizes the interaction of coverage, ROC, and sub-optimality, and uncovers that the sub-optimality–coverage curve exhibits three regimes. A transport regime – where sub-optimality increases with coverage, a policy improvement regime – where sub-optimality may decrease with coverage, depending on the verifier’s ROC, and a saturation regime – where sub-optimality plateaus, unaffected by coverage. We further propose and analyze two classes of sampling algorithms – sequential and batched, and examine how their computational complexities shape these trade-offs. Empirical results with Qwen, Llama, and Gemma models corroborate our theoretical findings.
zh
[AI-66] A Justice Lens on Fairness and Ethics Courses in Computing Education: LLM -Assisted Multi-Perspective and Thematic Evaluation
【速读】:该论文旨在解决当前人工智能(AI)与机器学习(ML)相关课程中公平性与伦理内容设计缺乏系统评估的问题,尤其关注教学大纲(syllabus)在引导学生批判性思维、促进包容性和透明度方面的不足。其解决方案的关键在于开发了一种以正义为导向的评分量表(justice-oriented scoring rubric),并利用大语言模型(LLM)进行多角色模拟评估——从教师、系主任、机构评审员和外部专家四个视角对24份课程大纲进行分析,从而识别出隐性课程设计缺口,并提炼出跨课程的主题趋势,为改进公平性、伦理性和正义性内容的教学设计提供可操作的方向。
链接: https://arxiv.org/abs/2510.18931
作者: Kenya S. Andrews,Deborah Dormah Kanubala,Kehinde Aruleba,Francisco Enrique Vicente Castro,Renata A Revelo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 8 figures, In Review
点击查看摘要
Abstract:Course syllabi set the tone and expectations for courses, shaping the learning experience for both students and instructors. In computing courses, especially those addressing fairness and ethics in artificial intelligence (AI), machine learning (ML), and algorithmic design, it is imperative that we understand how approaches to navigating barriers to fair outcomes are being this http URL expectations should be inclusive, transparent, and grounded in promoting critical thinking. Syllabus analysis offers a way to evaluate the coverage, depth, practices, and expectations within a course. Manual syllabus evaluation, however, is time-consuming and prone to inconsistency. To address this, we developed a justice-oriented scoring rubric and asked a large language model (LLM) to review syllabi through a multi-perspective role simulation. Using this rubric, we evaluated 24 syllabi from four perspectives: instructor, departmental chair, institutional reviewer, and external evaluator. We also prompted the LLM to identify thematic trends across the courses. Findings show that multiperspective evaluation aids us in noting nuanced, role-specific priorities, leveraging them to fill hidden gaps in curricula design of AI/ML and related computing courses focused on fairness and ethics. These insights offer concrete directions for improving the design and delivery of fairness, ethics, and justice content in such courses.
zh
[AI-67] Application of Reduced-Order Models for Temporal Multiscale Representations in the Prediction of Dynamical Systems
【速读】:该论文旨在解决复杂多尺度系统动力学建模与预测中的关键挑战,即传统机器学习方法难以捕捉高频行为,且对初始条件敏感、无法有效分离宏观与微观尺度动态。其解决方案的核心在于提出三种协同的多尺度学习方法:首先,结合分区统一(Partition of Unity, PU)方法与神经网络,将系统动力学分解为局部分量并直接预测宏观和微观行为;其次,利用奇异值分解(Singular Value Decomposition, SVD)提取主导模态以显式分离多尺度动态;最后,在数据矩阵不可全获的实际场景下,引入稀疏高阶SVD从有限观测中重构多尺度动力学。该框架能够准确捕捉粗粒度与精细尺度行为,适用于高维、不完整观测下的真实复杂系统建模。
链接: https://arxiv.org/abs/2510.18925
作者: Elias Al Ghazal,Jad Mounayer,Beatriz Moya,Sebastian Rodriguez,Chady Ghnatios,Francisco Chinesta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Regular research article, 28 pages, 13 figures
点击查看摘要
Abstract:Modeling and predicting the dynamics of complex multiscale systems remains a significant challenge due to their inherent nonlinearities and sensitivity to initial conditions, as well as limitations of traditional machine learning methods that fail to capture high frequency behaviours. To overcome these difficulties, we propose three approaches for multiscale learning. The first leverages the Partition of Unity (PU) method, integrated with neural networks, to decompose the dynamics into local components and directly predict both macro- and micro-scale behaviors. The second applies the Singular Value Decomposition (SVD) to extract dominant modes that explicitly separate macro- and micro-scale dynamics. Since full access to the data matrix is rarely available in practice, we further employ a Sparse High-Order SVD to reconstruct multiscale dynamics from limited measurements. Together, these approaches ensure that both coarse and fine dynamics are accurately captured, making the framework effective for real-world applications involving complex, multi-scale phenomena and adaptable to higher-dimensional systems with incomplete observations, by providing an approximation and interpretation in all time scales present in the phenomena under study.
zh
[AI-68] Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients
【速读】:该论文旨在解决强化学习中人类反馈(Reinforcement Learning from Human Feedback, RLHF)或可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在实际应用中对奖励噪声(reward noise)高度敏感的问题,特别是由不一致或错误奖励导致的偏差。解决方案的关键在于提出一种噪声鲁棒的群体相对策略优化(Group Relative Policy Optimization, GRPO)框架,并引入“正确完成的GRPO”(Done Right GRPO),通过将奖励扰动建模为伯努利噪声(Bernoulli noise),在估计奖励翻转概率后对学习信号进行去偏处理,从而获得理论保证的无偏梯度估计。该方法利用群体级策略优化本身对个体噪声的天然鲁棒性,并通过噪声校正进一步增强其稳定性,在数学和代码任务上实现了显著性能提升(如数学任务准确率最高提升6.7个百分点)。
链接: https://arxiv.org/abs/2510.18924
作者: Omar El mansouri,Mohamed El Amine Seddik,Salem Lahlou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (this http URL) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code tasks when applying our noise correction to standard reward model usage, with particular gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions. This work bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.
zh
[AI-69] ADPO: Anchored Direct Preference Optimization
【速读】:该论文旨在解决传统直接偏好优化(Direct Preference Optimization, DPO)在实际应用中面临的三个核心问题:一是对硬二元标签的强假设限制了其在现实场景中的鲁棒性;二是训练过程易受梯度漂移影响,导致不稳定;三是缺乏对多选项偏好建模的能力。为此,作者提出锚定直接偏好优化(Anchored Direct Preference Optimization, ADPO),其关键创新在于引入软偏好概率以编码不确定性并缓解梯度漂移、采用任意参考策略作为锚点实现组间平移不变性和隐式KL正则化、以及通过Plackett-Luce分布支持列表级偏好建模。这些机制使ADPO能够统一多种偏好学习目标,并在上下文bandits和序列强化学习任务中显著提升性能,尤其在噪声污染环境下表现更优。
链接: https://arxiv.org/abs/2510.18913
作者: Wang Zixian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Anchored Direct Preference Optimization (ADPO) is a unified framework that generalizes Direct Preference Optimization (DPO) with soft preferences, reference-policy anchoring, and groupwise extensions. While standard DPO assumes hard binary labels and pairwise comparisons, ADPO introduces: (i) soft preference probabilities that encode uncertainty and mitigate gradient drift; (ii) arbitrary reference-policy anchors that stabilize training via groupwise shift invariance and implicit KL regularization; and (iii) listwise preference modeling through Plackett-Luce distributions. We prove that DPO, Bradley-Terry objectives, and Top-1-vs-Rest formulations emerge as special cases. ADPO yields three practical variants: pairwise anchored Soft-DPO, listwise anchored Soft-DPO with raw rewards, and KDE-based listwise smoothing for heavy-tailed noise. In contextual bandits, anchoring improves WinMass by 38-63% over standard DPO, while KDE smoothing achieves 0.68 vs 0.32 under heavy-tailed contamination (112% relative gain). In sequential reinforcement learning (CartPole, LunarLander), anchoring improves noisy-preference performance by 15-29%, confirming transfer from single-step to multi-step settings. Experiments with 10-256 parameter models provide clear guidance: use pairwise anchored Soft-DPO for clean or moderate noise, and KDE-based listwise ADPO for extreme contamination.
zh
[AI-70] Large Connectome Model: An fMRI Foundation Model of Brain Connectomes Empowered by Brain-Environment Interaction in Multitask Learning Landscape
【速读】:该论文旨在解决当前功能神经影像(functional neuroimaging)领域中人工智能模型因样本量有限而导致性能受限的问题,尤其是在临床应用中,如疾病早期诊断和行为识别等下游任务表现不佳。其解决方案的关键在于提出一种多任务学习框架下的基础模型(foundation model),通过将大脑-环境交互(brain-environment interaction, BEI)进行token化表示,并利用大规模未标注功能磁共振成像(fMRI)数据与丰富的环境变量及人口统计学信息进行可扩展的多任务预训练;同时,在微调阶段引入伪标签机制对预训练的BEI进行半监督优化,从而提升模型在特定临床任务(如自闭症、帕金森病、阿尔茨海默病和精神分裂症的早期诊断)中的泛化能力与预测准确性。
链接: https://arxiv.org/abs/2510.18910
作者: Ziquan Wei,Tingting Dan,Guorong Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages 6 figures
点击查看摘要
Abstract:A reliable foundation model of functional neuroimages is critical to promote clinical applications where the performance of current AI models is significantly impeded by a limited sample size. To that end, tremendous efforts have been made to pretraining large models on extensive unlabeled fMRI data using scalable self-supervised learning. Since self-supervision is not necessarily aligned with the brain-to-outcome relationship, most foundation models are suboptimal to the downstream task, such as predicting disease outcomes. By capitalizing on rich environmental variables and demographic data along with an unprecedented amount of functional neuroimages, we form the brain modeling as a multitask learning and present a scalable model architecture for (i) multitask pretraining by tokenizing multiple brain-environment interactions (BEI) and (ii) semi-supervised finetuning by assigning pseudo-labels of pretrained BEI. We have evaluated our foundation model on a variety of applications, including sex prediction, human behavior recognition, and disease early diagnosis of Autism, Parkinson’s disease, Alzheimer’s disease, and Schizophrenia, where promising results indicate the great potential to facilitate current neuroimaging applications in clinical routines.
zh
[AI-71] 3D Optimization for AI Inference Scaling: Balancing Accuracy Cost and Latency
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 推理扩展(inference scaling)优化中忽视成本与延迟约束的问题,传统方法多依赖一维启发式策略(固定推理轮次)或二维权衡(如性能与计算资源),难以适应实际部署环境中的多维约束。其解决方案的关键在于提出一个三维(3D)优化框架,将准确率、成本和延迟统一纳入决策空间,通过多目标优化(Multi-Objective Optimization, MOO)构建更全面的可行解空间,并采用蒙特卡洛模拟评估四种优化方法在九个大语言模型上的表现,最终验证了基于膝点(knee-point)的优化策略在兼顾三者之间实现最优平衡的能力,为不同运行环境下部署感知的推理扩展提供了理论基础。
链接: https://arxiv.org/abs/2510.18905
作者: Minseok Jung,Abhas Ricky,Muhammad Rameez Chatni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environmentadaptive selection of the inference scaling k. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.
zh
[AI-72] Evaluating LLM s for Career Guidance: Comparative Analysis of Computing Competency Recommendations Across Ten African Countries
【速读】:该论文试图解决的问题是:在非洲不同国家背景下,生成式AI(Generative AI)模型对计算机岗位入门级能力要求的描述存在显著偏差,导致技术建议与当地实际需求脱节,进而可能加剧数字殖民主义(Digital Colonialism)倾向。其解决方案的关键在于通过对比六种主流大语言模型(LLMs)在十个国家的响应,结合《Computing Curricula 2020》框架、数字殖民主义理论和乌班图哲学(Ubuntu Philosophy),系统评估模型在技术技能与非技术能力(如伦理责任、文化敏感性)上的表现,并揭示开放源代码模型(如Llama 3和DeepSeek)在本地化语境认知方面优于专有模型(如ChatGPT 4和Claude),从而提出应采用去殖民化(decolonial)视角重构AI教育工具的设计逻辑,以增强AI指导内容与非洲本土基础设施、政策及文化背景的适配性。
链接: https://arxiv.org/abs/2510.18902
作者: Precious Eze,Stephanie Lunn,Bruk Berhane(College of Engineering and Computing, Florida International University, Miami, USA)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 42 pages, 2 figures, 5 tables. Submitted to Computers Education Open Access
点击查看摘要
Abstract:Employers increasingly expect graduates to utilize large language models (LLMs) in the workplace, yet the competencies needed for computing roles across Africa remain unclear given varying national contexts. This study examined how six LLMs, namely ChatGPT 4, DeepSeek, Gemini, Claude 3.5, Llama 3, and Mistral AI, describe entry-level computing career expectations across ten African countries. Using the Computing Curricula 2020 framework and drawing on Digital Colonialism Theory and Ubuntu Philosophy, we analyzed 60 LLM responses to standardized prompts. Technical skills such as cloud computing and programming appeared consistently, but notable differences emerged in how models addressed non-technical competencies, particularly ethics and responsible AI use. Models varied considerably in recognizing country-specific factors, including local technology ecosystems, language requirements, and national policies. Open-source models demonstrated stronger contextual awareness and a better balance between technical and professional skills, earning top scores in nine of ten countries. Still, all models struggled with cultural sensitivity and infrastructure considerations, averaging only 35.4% contextual awareness. This first broad comparison of LLM career guidance for African computing students uncovers entrenched infrastructure assumptions and Western-centric biases, creating gaps between technical recommendations and local needs. The strong performance of cost-effective open-source models (Llama: 4.47/5; DeepSeek: 4.25/5) compared to proprietary alternatives (ChatGPT 4: 3.90/5; Claude: 3.46/5) challenges assumptions about AI tool quality in resource-constrained settings. Our findings highlight how computing competency requirements vary widely across Africa and underscore the need for decolonial approaches to AI in education that emphasize contextual relevance
zh
[AI-73] AI for Distributed Systems Design: Scalable Cloud Optimization Through Repeated LLM s Sampling And Simulators
【速读】:该论文旨在解决生成式 AI (Generative AI) 在分布式系统策略设计中的应用难题,特别是如何在保持策略可解释性的同时高效探索大规模策略空间。其解决方案的关键在于构建一个“生成-验证”迭代闭环:利用大语言模型(LLM)自动生成 Python 形式的调度策略,通过领域特定模拟器(Eudoxia)对策略进行确定性评估,并基于结构化反馈引导下一轮生成。该方法结合了 LLM 的创造性与模拟器的精确验证能力,实现了对 Function-as-a-Service 运行时(Bauplan)调度策略的自动化优化,初步结果显示了吞吐量提升效果。
链接: https://arxiv.org/abs/2510.18897
作者: Jacopo Tagliabue
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Databases (cs.DB); Software Engineering (cs.SE)
备注: Pre-print IAAA workshop submission
点击查看摘要
Abstract:We explore AI-driven distributed-systems policy design by combining stochastic code generation from large language models (LLMs) with deterministic verification in a domain-specific simulator. Using a Function-as-a-Service runtime (Bauplan) and its open-source simulator (Eudoxia) as a case study, we frame scheduler design as an iterative generate-and-verify loop: an LLM proposes a Python policy, the simulator evaluates it on standardized traces, and structured feedback steers subsequent generations. This setup preserves interpretability while enabling targeted search over a large design space. We detail the system architecture and report preliminary results on throughput improvements across multiple models. Beyond early gains, we discuss the limits of the current setup and outline next steps; in particular, we conjecture that AI will be crucial for scaling this methodology by helping to bootstrap new simulators.
zh
[AI-74] CosmoCore Affective Dream-Replay Reinforcement Learning for Code Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成过程中存在的幻觉问题(如语法错误或逻辑漏洞),以及缺乏高效自我修正能力的问题。其核心解决方案是提出一种受神经科学启发的强化学习架构 CosmoCore,关键在于引入情感信号(affective signals)——通过轻量级多层感知机(MLP)对代码生成轨迹进行情绪标签化(valence 和 surprise),将高负向情绪(cringe)事件优先存储于 Dream Queue 中进行五倍重放,同时对低惊喜的成功样本进行剪枝以避免过拟合与缓冲区膨胀。这一机制显著提升了模型的自我纠错效率并降低了幻觉率,实验证明在 HumanEval 和 BigCodeBench 等基准上可减少 48% 的幻觉代码,并加速自纠正过程 45%。
链接: https://arxiv.org/abs/2510.18895
作者: Santhosh Kumar Ravindran
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 12 pages
点击查看摘要
Abstract:We introduce CosmoCore, a neuroscience-inspired reinforcement learning (RL) architecture that integrates affective signals to enhance code generation in large language models (LLMs). Motivated by human and animal learning where embarrassment from mistakes drives rapid correction, as observed in training a puppy to avoid repeating errors after a single scolding CosmoCore tags code generation trajectories with valence and surprise using a lightweight multi-layer perceptron (MLP). High-negative valence (cringe) episodes, such as buggy code outputs, are prioritized in a Dream Queue for five-fold replay during off-policy updates, while low-surprise successes are pruned to prevent overconfidence and buffer bloat. Evaluated on code generation benchmarks like HumanEval and BigCodeBench, alongside simulations with a custom data pipeline environment, CosmoCore reduces hallucinated code (e.g., syntax errors or logical bugs) by 48% and accelerates self-correction by 45%. Local experiments using Hugging Face models in a PySpark environment validate these gains, with code snippets provided for replication. Ablations confirm valence tagging boosts curiosity in exploration, and pruning mitigates inefficiency. This framework extends RL from human feedback (RLHF) for more emotionally aware code assistants, with applications in IDEs and data pipelines. Code and the custom mini-world simulation are released.
zh
[AI-75] CodeCRDT: Observation-Driven Coordination for Multi-Agent LLM Code Generation
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM)系统中因协调开销过高而导致无法实现并行加速的问题。其解决方案的关键在于提出一种基于观察的协调模式 CodeCRDT,该模式通过代理对共享状态的可观测更新进行监控,并依赖冲突自由复制数据类型(Conflict-Free Replicated Data Types, CRDTs)实现无锁、无冲突的并发代码生成,从而保证强最终一致性(strong eventual consistency)。此方法避免了传统显式消息传递机制带来的通信瓶颈,显著提升了系统在特定任务下的并行效率与收敛可靠性。
链接: https://arxiv.org/abs/2510.18893
作者: Sergey Pugachev
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 11 pages, 3 figures
点击查看摘要
Abstract:Multi-agent LLM systems fail to realize parallel speedups due to costly coordination. We present CodeCRDT, an observation-driven coordination pattern where agents coordinate by monitoring a shared state with observable updates and deterministic convergence, rather than explicit message passing. Using Conflict-Free Replicated Data Types (CRDTs), CodeCRDT enables lock-free, conflict-free concurrent code generation with strong eventual consistency. Evaluation across 600 trials (6 tasks, 50 runs per mode) shows both benefits and trade-offs: up to 21.1% speedup on some tasks, up to 39.4% slowdown on others, and 100% convergence with zero merge failures. The study formalizes observation-driven coordination for stochastic LLM agents, revealing semantic conflict rates (5-10%) and quality-performance tradeoffs, and provides empirical characterization of when parallel coordination succeeds versus fails based on task structure.
zh
[AI-76] LLM Bazaar: A Service Design for Supporting Collaborative Learning with an LLM -Powered Multi-Party Collaboration Infrastructure
【速读】:该论文旨在解决传统对话代理在协作学习中支持能力有限的问题,尤其是在促进批判性思维和协作解决问题方面的不足。其核心挑战在于如何有效整合大语言模型(Large Language Models, LLMs)以提供实时、情境敏感的协作支持。解决方案的关键在于构建一个基于开源协作支持架构Bazaar的LLM代理壳(LLM-agent shell),该架构能够将LLM赋能的能力无缝嵌入到群组学习环境中,从而实现对学习互动模式和协作学习成果的动态优化与重塑。
链接: https://arxiv.org/abs/2510.18877
作者: Zhen Wu,Jiaxin Shi,R. Charles Murray,Carolyn Rosé,Micah San Andres
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: this https URL
点击查看摘要
Abstract:For nearly two decades, conversational agents have played a critical role in structuring interactions in collaborative learning, shaping group dynamics, and supporting student engagement. The recent integration of large language models (LLMs) into these agents offers new possibilities for fostering critical thinking and collaborative problem solving. In this work, we begin with an open source collaboration support architecture called Bazaar and integrate an LLM-agent shell that enables introduction of LLM-empowered, real time, context sensitive collaborative support for group learning. This design and infrastructure paves the way for exploring how tailored LLM-empowered environments can reshape collaborative learning outcomes and interaction patterns.
zh
[AI-77] Actor-Free Continuous Control via Structurally Maximizable Q-Functions NEURIPS2025
链接: https://arxiv.org/abs/2510.18828
作者: Yigit Korkmaz,Urvi Bhuwania,Ayush Jain,Erdem Bıyık
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
[AI-78] SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems
链接: https://arxiv.org/abs/2509.23130
作者: Qian Cheng,Ruize Tang,Emilie Ma,Finn Hackett,Peiyang He,Yiming Su,Ivan Beschastnikh,Yu Huang,Xiaoxing Ma,Tianyin Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
备注:
[AI-79] A Unified Formal Theory on the Logical Limits of Symbol Grounding
【速读】:该论文试图解决符号接地问题(Symbol Grounding Problem),即如何在形式系统内部为符号赋予稳定且一致的意义。解决方案的关键在于通过四阶段的形式化论证,揭示意义的生成必须依赖于外部、动态且非算法性的过程:首先证明纯符号系统因自指悖论无法内生地建立意义基础;其次表明任何具有有限静态预设意义的系统都存在不完备性;再次指出符号与外部意义的连接不能由系统内部逻辑推导得出,而必须作为元层级的公理性更新;最后证明若试图用固定算法自动化此更新过程,则会构造出一个更大但同样不完备的新系统。由此,论文从逻辑上确立了意义接地是一种必然开放、非算法的过程,揭示了智能系统在封闭状态下存在的哥德尔式根本限制。
链接: https://arxiv.org/abs/2509.20409
作者: Zhangchi Liu
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure. A formal proof on the logical limits of symbol grounding
点击查看摘要
Abstract:This paper synthesizes a series of formal proofs to construct a unified theory on the logical limits of the Symbol Grounding Problem. We demonstrate through a four-stage argument that meaning within a formal system must arise from a process that is external, dynamic, and non-algorithmic. First, we prove that any purely symbolic system, devoid of external connections, cannot internally establish a consistent foundation for meaning due to self-referential paradoxes. Second, we extend this limitation to systems with any finite, static set of pre-established meanings, proving they are inherently incomplete. Third, we demonstrate that the very “act” of connecting an internal symbol to an external meaning cannot be a product of logical inference within the system but must be an axiomatic, meta-level update. Finally, we prove that any attempt to automate this update process using a fixed, external “judgment” algorithm will inevitably construct a larger, yet equally incomplete, symbolic system. Together, these conclusions formally establish that the grounding of meaning is a necessarily open-ended, non-algorithmic process, revealing a fundamental, Gödel-style limitation for any self-contained intelligent system.
zh
[AI-80] Demonstrating Real Advantage of Machine-Learning-Enhanced Monte Carlo for Combinatorial Optimization
链接: https://arxiv.org/abs/2510.19544
作者: Luca Maria Del Bono,Federico Ricci-Tersenghi,Francesco Zamponi
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注: 13 main pages, 6 main figures. 4 supplementary pages, 2 supplementary figures
[AI-81] KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge
【速读】:该论文旨在解决当前分子大语言模型在分子理解能力上的局限性,主要源于预训练阶段文本描述不足和分子表示策略欠优。其解决方案的关键在于两个方面:一是构建了包含10万条细粒度分子注释的大型数据集KnowMol-100K,实现了分子与多层级文本描述之间的有效对齐;二是提出了一种化学信息增强的分子表示方法,显著改进了现有分子表征策略的局限性。基于此,研究团队开发出新一代多模态分子大语言模型KnowMol,在分子理解和生成任务中均展现出卓越性能。
链接: https://arxiv.org/abs/2510.19484
作者: Zaifei Yang,Hong Chang,Ruibing Hou,Shiguang Shan,Xilin Chen
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks. GitHub: this https URL Huggingface: this https URL Subjects: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2510.19484 [q-bio.BM] (or arXiv:2510.19484v1 [q-bio.BM] for this version) https://doi.org/10.48550/arXiv.2510.19484 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-82] EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
【速读】:该论文旨在解决现有语音防欺骗(anti-spoofing)系统在真实场景中性能显著下降的问题,尤其是面对物理回放攻击(physical replay attacks)时表现不佳。其关键解决方案是提出了EchoFake数据集,该数据集包含超过120小时的音频,涵盖13,000多名说话者,并融合了前沿零样本文本到语音(zero-shot text-to-speech, TTS)合成语音与在多种设备和真实环境条件下采集的物理回放录音,从而更贴近实际部署挑战,提升检测模型的泛化能力。
链接: https://arxiv.org/abs/2510.19414
作者: Tong Zhang,Yihuan Huang,Yanzhen Ren
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
点击查看摘要
Abstract:The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.
zh
[AI-83] Metadata Extraction Leverag ing Large Language Models
链接: https://arxiv.org/abs/2510.19334
作者: Cuize Han,Sesh Jalagam
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
[AI-84] No Intelligence Without Statistics: The Invisible Backbone of Artificial Intelligence
链接: https://arxiv.org/abs/2510.19212
作者: Ernest Fokoué
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注: 37 pages, 6 figures
[AI-85] News-Aware Direct Reinforcement Trading for Financial Markets
链接: https://arxiv.org/abs/2510.19173
作者: Qing-Yu Lan,Zhan-He Wang,Jun-Qian Jiang,Yu-Tong Wang,Yun-Song Piao
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, 3 tables
[AI-86] Prospects for Using Artificial Intelligence to Understand Intrinsic Kinetics of Heterogeneous Catalytic Reactions
【速读】:该论文旨在解决异质催化研究中“多对一”(many-to-one)难题,即如何将内在反应动力学与可观测宏观现象有效关联的问题。其解决方案的关键在于将生成式 AI (Generative AI) 与多尺度模型及多模态实验深度融合,通过机器学习力场、微动力学和反应器建模实现化学空间的快速探索,并结合原位与瞬态实验数据提供高保真度信息;同时利用生成式和代理型 AI 自动构建模型、量化不确定性并耦合理论与实验,从而实现可解释、可复现且具备迁移能力的“自驱动模型”(self-driving models),推动催化机制的系统性发现。
链接: https://arxiv.org/abs/2510.18911
作者: Andrew J. Medford,Todd N. Whittaker,Bjarne Kreitz,David W. Flaherty,John R. Kitchin
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注: Submitted to “Current Opinion in Chemical Engineering” for peer review
点击查看摘要
Abstract:Artificial intelligence (AI) is influencing heterogeneous catalysis research by accelerating simulations and materials discovery. A key frontier is integrating AI with multiscale models and multimodal experiments to address the “many-to-one” challenge of linking intrinsic kinetics to observables. Advances in machine-learned force fields, microkinetics, and reactor modeling enable rapid exploration of chemical spaces, while operando and transient data provide unprecedented insight. Yet, inconsistent data quality and model complexity limit mechanistic discovery. Generative and agentic AI can automate model generation, quantify uncertainty, and couple theory with experiment, realizing “self-driving models” that produce interpretable, reproducible, and transferable understanding of catalytic systems.
zh
[AI-87] What is Implementation Science; and Why It Matters for Bridging the Artificial Intelligence Innovation-to-Application Gap in Medical Imaging
链接: https://arxiv.org/abs/2510.13006
作者: Ahmad Fayaz-Bakhsh,Janice Tania,Syaheerah Lebai Lutfi,Abhinav K. Jha,Arman Rahmim
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:
机器学习
[LG-0] he Feasibility of Training Sovereign Language Models in the Global South: A Study of Brazil and Mexico
链接: https://arxiv.org/abs/2510.19801
作者: Sandra Malagon(1 and 2),Monica A. Ulloa Ruiz(1 and 2),Tatiana Elizabeth Sandoval Plaza(1),Gabriel Rafael Rosario Bolívar(1),Valentina García Mesa(1),Ivanna Alvarado Morales(1) ((1) Carreras con Impacto, (2) AIxo)
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures
点击查看摘要
Abstract:The rapid escalation of computational requirements for training large-scale language models has reinforced structural asymmetries between high-capacity jurisdictions and countries in the Global South. This paper examines the technical and fiscal feasibility of sovereign-scale language model training in Brazil and Mexico under conditions of constrained hardware access, energy availability, and fiscal ceilings. Using a dual-axis design that varies accelerator generation (NVIDIA H100 vs. A100) and training duration (90 vs. 150 days), we estimate compute demand, energy consumption, capital expenditures, and regulatory compatibility for the training of a 10-trillion-token model. Our findings show that while all configurations remain below export-control and electrical infrastructure thresholds, fiscal viability is determined by hardware efficiency. H100-based scenarios achieve training feasibility at a total cost of 8-14 million USD, while A100 deployments require 19-32 million USD due to higher energy and hardware demand. We argue that extending training timelines should be treated as a policy lever to mitigate hardware constraints, enabling the production of usable, auditable, and locally aligned models without competing at the global frontier. This study contributes to the discourse on AI compute governance and technological sovereignty by highlighting context-sensitive strategies that allow middle-income countries to establish sustainable and strategically sufficient AI capabilities.
[LG-1] ransformers are almost optimal metalearners for linear classification
链接: https://arxiv.org/abs/2510.19797
作者: Roey Magen,Gal Vardi
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Transformers have demonstrated impressive in-context learning (ICL) capabilities, raising the question of whether they can serve as metalearners that adapt to new tasks using only a small number of in-context examples, without any further training. While recent theoretical work has studied transformers’ ability to perform ICL, most of these analyses do not address the formal metalearning setting, where the objective is to solve a collection of related tasks more efficiently than would be possible by solving each task individually. In this paper, we provide the first theoretical analysis showing that a simplified transformer architecture trained via gradient descent can act as a near-optimal metalearner in a linear classification setting. We consider a natural family of tasks where each task corresponds to a class-conditional Gaussian mixture model, with the mean vectors lying in a shared k -dimensional subspace of R^d . After training on a sufficient number of such tasks, we show that the transformer can generalize to a new task using only O(k / R^4) in-context examples, where R denotes the signal strength at test time. This performance (almost) matches that of an optimal learner that knows exactly the shared subspace and significantly outperforms any learner that only has access to the in-context data, which requires \Omega(d / R^4) examples to generalize. Importantly, our bounds on the number of training tasks and examples per task needed to achieve this result are independent of the ambient dimension d .
[LG-2] Environment Inference for Learning Generalizable Dynamical System NEURIPS2025
链接: https://arxiv.org/abs/2510.19784
作者: Shixuan Liu,Yue He,Haotian Wang,Wenjing Yang,Yunfei Wang,Peng Cui,Zhong Liu
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025 Spotlight
点击查看摘要
Abstract:Data-driven methods offer efficient and robust solutions for analyzing complex dynamical systems but rely on the assumption of I.I.D. data, driving the development of generalization techniques for handling environmental differences. These techniques, however, are limited by their dependence on environment labels, which are often unavailable during training due to data acquisition challenges, privacy concerns, and environmental variability, particularly in large public datasets and privacy-sensitive domains. In response, we propose DynaInfer, a novel method that infers environment specifications by analyzing prediction errors from fixed neural networks within each training round, enabling environment assignments directly from data. We prove our algorithm effectively solves the alternating optimization problem in unlabeled scenarios and validate it through extensive experiments across diverse dynamical systems. Results show that DynaInfer outperforms existing environment assignment techniques, converges rapidly to true labels, and even achieves superior performance when environment labels are available.
[LG-3] he Tail Tells All: Estimating Model-Level Membership Inference Vulnerability Without Reference Models
链接: https://arxiv.org/abs/2510.19773
作者: Euodia Dodd,Nataša Krčo,Igor Shilov,Yves-Alexandre de Montjoye
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Membership inference attacks (MIAs) have emerged as the standard tool for evaluating the privacy risks of AI models. However, state-of-the-art attacks require training numerous, often computationally expensive, reference models, limiting their practicality. We present a novel approach for estimating model-level vulnerability, the TPR at low FPR, to membership inference attacks without requiring reference models. Empirical analysis shows loss distributions to be asymmetric and heavy-tailed and suggests that most points at risk from MIAs have moved from the tail (high-loss region) to the head (low-loss region) of the distribution after training. We leverage this insight to propose a method to estimate model-level vulnerability from the training and testing distribution alone: using the absence of outliers from the high-loss region as a predictor of the risk. We evaluate our method, the TNR of a simple loss attack, across a wide range of architectures and datasets and show it to accurately estimate model-level vulnerability to the SOTA MIA attack (LiRA). We also show our method to outperform both low-cost (few reference models) attacks such as RMIA and other measures of distribution difference. We finally evaluate the use of non-linear functions to evaluate risk and show the approach to be promising to evaluate the risk in large-language models.
[LG-4] Exploring the Effect of DNN Depth on Adversarial Attacks in Network Intrusion Detection Systems
链接: https://arxiv.org/abs/2510.19761
作者: Mohamed ElShehaby,Ashraf Matrawy
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Adversarial attacks pose significant challenges to Machine Learning (ML) systems and especially Deep Neural Networks (DNNs) by subtly manipulating inputs to induce incorrect predictions. This paper investigates whether increasing the layer depth of deep neural networks affects their robustness against adversarial attacks in the Network Intrusion Detection System (NIDS) domain. We compare the adversarial robustness of various deep neural networks across both \acNIDS and computer vision domains (the latter being widely used in adversarial attack experiments). Our experimental results reveal that in the NIDS domain, adding more layers does not necessarily improve their performance, yet it may actually significantly degrade their robustness against adversarial attacks. Conversely, in the computer vision domain, adding more layers exhibits a more modest impact on robustness. These findings can guide the development of robust neural networks for (NIDS) applications and highlight the unique characteristics of network security domains within the (ML) landscape.
[LG-5] CONFEX: Uncertainty-Aware Counterfactual Explanations with Conformal Guarantees
链接: https://arxiv.org/abs/2510.19754
作者: Aman Bilkhoo(1),Milad Kazemi(1),Nicola Paoletti(1),Mehran Hosseini(1 and 2) ((1) Department of Informatics, King’s College London, (2) Department of Computer Science, University of Manchester)
类目: Machine Learning (cs.LG)
*备注: 35 pages, 10 figures, 21 tables, 2 algorithms. [Main paper part consists of 11 pages, 2 figures, 1 table, 1 algorithm]
点击查看摘要
Abstract:Counterfactual explanations (CFXs) provide human-understandable justifications for model predictions, enabling actionable recourse and enhancing interpretability. To be reliable, CFXs must avoid regions of high predictive uncertainty, where explanations may be misleading or inapplicable. However, existing methods often neglect uncertainty or lack principled mechanisms for incorporating it with formal guarantees. We propose CONFEX, a novel method for generating uncertainty-aware counterfactual explanations using Conformal Prediction (CP) and Mixed-Integer Linear Programming (MILP). CONFEX explanations are designed to provide local coverage guarantees, addressing the issue that CFX generation violates exchangeability. To do so, we develop a novel localised CP procedure that enjoys an efficient MILP encoding by leveraging an offline tree-based partitioning of the input space. This way, CONFEX generates CFXs with rigorous guarantees on both predictive uncertainty and optimality. We evaluate CONFEX against state-of-the-art methods across diverse benchmarks and metrics, demonstrating that our uncertainty-aware approach yields robust and plausible explanations.
[LG-6] When Do Transformers Learn Heuristics for Graph Connectivity?
链接: https://arxiv.org/abs/2510.19753
作者: Qilin Ye,Deqing Fu,Robin Jia,Vatsal Sharan
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the disentangled Transformer, and prove that an L -layer model has capacity to solve for graphs with diameters up to exactly 3^L , implementing an algorithm equivalent to computing powers of the adjacency matrix. We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter \leq 3^L ) drive the learning of a correct algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically demonstrate that restricting training data within a model’s capacity leads to both standard and disentangled transformers learning the exact algorithm rather than the degree-based heuristic.
[LG-7] BATIS: Bayesian Approaches for Targeted Improvement of Species Distribution Models
链接: https://arxiv.org/abs/2510.19749
作者: Catherine Villeneuve,Benjamin Akera,Mélisande Teng,David Rolnick
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE); Quantitative Methods (q-bio.QM)
*备注:
点击查看摘要
Abstract:Species distribution models (SDMs), which aim to predict species occurrence based on environmental variables, are widely used to monitor and respond to biodiversity change. Recent deep learning advances for SDMs have been shown to perform well on complex and heterogeneous datasets, but their effectiveness remains limited by spatial biases in the data. In this paper, we revisit deep SDMs from a Bayesian perspective and introduce BATIS, a novel and practical framework wherein prior predictions are updated iteratively using limited observational data. Models must appropriately capture both aleatoric and epistemic uncertainty to effectively combine fine-grained local insights with broader ecological patterns. We benchmark an extensive set of uncertainty quantification approaches on a novel dataset including citizen science observations from the eBird platform. Our empirical study shows how Bayesian deep learning approaches can greatly improve the reliability of SDMs in data-scarce locations, which can contribute to ecological understanding and conservation efforts.
[LG-8] Statistical Inference for Linear Functionals of Online Least-squares SGD when t gtrsim d1δ
链接: https://arxiv.org/abs/2510.19734
作者: Bhavya Agrawalla,Krishnakumar Balasubramanian,Promit Ghosal
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: Improved version of arXiv:2302.09727 with new results
点击查看摘要
Abstract:Stochastic Gradient Descent (SGD) has become a cornerstone method in modern data science. However, deploying SGD in high-stakes applications necessitates rigorous quantification of its inherent uncertainty. In this work, we establish \emphnon-asymptotic Berry–Esseen bounds for linear functionals of online least-squares SGD, thereby providing a Gaussian Central Limit Theorem (CLT) in a \emphgrowing-dimensional regime. Existing approaches to high-dimensional inference for projection parameters, such as~\citechang2023inference, rely on inverting empirical covariance matrices and require at least t \gtrsim d^3/2 iterations to achieve finite-sample Berry–Esseen guarantees, rendering them computationally expensive and restrictive in the allowable dimensional scaling. In contrast, we show that a CLT holds for SGD iterates when the number of iterations grows as t \gtrsim d^1+\delta for any \delta 0 , significantly extending the dimensional regime permitted by prior works while improving computational efficiency. The proposed online SGD-based procedure operates in \mathcalO(td) time and requires only \mathcalO(d) memory, in contrast to the \mathcalO(td^2 + d^3) runtime of covariance-inversion methods. To render the theory practically applicable, we further develop an \emphonline variance estimator for the asymptotic variance appearing in the CLT and establish \emphhigh-probability deviation bounds for this estimator. Collectively, these results yield the first fully online and data-driven framework for constructing confidence intervals for SGD iterates in the near-optimal scaling regime t \gtrsim d^1+\delta .
[LG-9] Bridging Earth and Space: A Survey on HAPS for Non-Terrestrial Networks
链接: https://arxiv.org/abs/2510.19731
作者: G. Svistunov(1),A. Akhtarshenas(1),D. López-Pérez(1),M. Giordani(2),G. Geraci(3),H. Yanikomeroglu(4) ((1) Universitat Politècnica de València, (2) University of Padova, (3) Universitat Pompeu Fabra, (4) Carleton University)
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 30 pages. This work has been submitted to IEEE Communications Surveys Tutorials (under review)
点击查看摘要
Abstract:HAPS are emerging as key enablers in the evolution of 6G wireless networks, bridging terrestrial and non-terrestrial infrastructures. Operating in the stratosphere, HAPS can provide wide-area coverage, low-latency, energy-efficient broadband communications with flexible deployment options for diverse applications. This survey delivers a comprehensive overview of HAPS use cases, technologies, and integration strategies within the 6G ecosystem. The roles of HAPS in extending connectivity to underserved regions, supporting dynamic backhauling, enabling massive IoT, and delivering reliable low-latency communications for autonomous and immersive services are discussed. The paper reviews state-of-the-art architectures for terrestrial and non-terrestrial network integration, highlights recent field trials. Furthermore, key enabling technologies such as channel modeling, AI-driven resource allocation, interference control, mobility management, and energy-efficient communications are examined. The paper also outlines open research challenges. By addressing existing gaps in the literature, this survey positions HAPS as a foundational component of globally integrated, resilient, and sustainable 6G networks.
[LG-10] SEMPO: Lightweight Foundation Models for Time Series Forecasting NEURIPS2025
链接: https://arxiv.org/abs/2510.19710
作者: Hui He,Kun Yi,Yuanchi Ma,Qi Zhang,Zhendong Niu,Guansong Pang
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:The recent boom of large pre-trained models witnesses remarkable success in developing foundation models (FMs) for time series forecasting. Despite impressive performance across diverse downstream forecasting tasks, existing time series FMs possess massive network architectures and require substantial pre-training on large-scale datasets, which significantly hinders their deployment in resource-constrained environments. In response to this growing tension between versatility and affordability, we propose SEMPO, a novel lightweight foundation model that requires pretraining on relatively small-scale data, yet exhibits strong general time series forecasting. Concretely, SEMPO comprises two key modules: 1) energy-aware SpEctral decomposition module, that substantially improves the utilization of pre-training data by modeling not only the high-energy frequency signals but also the low-energy yet informative frequency signals that are ignored in current methods; and 2) Mixture-of-PrOmpts enabled Transformer, that learns heterogeneous temporal patterns through small dataset-specific prompts and adaptively routes time series tokens to prompt-based experts for parameter-efficient model adaptation across different datasets and domains. Equipped with these modules, SEMPO significantly reduces both pre-training data scale and model size, while achieving strong generalization. Extensive experiments on two large-scale benchmarks covering 16 datasets demonstrate the superior performance of SEMPO in both zero-shot and few-shot forecasting scenarios compared with state-of-the-art methods. Code and data are available at this https URL.
[LG-11] Fast Inference via Hierarchical Speculative Decoding
链接: https://arxiv.org/abs/2510.19705
作者: Amir Globerson,Haim Kaplan,Yishay Mansour,Clara Mohri,Tal Schuster
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft model to propose tokens that the larger target model verifies in parallel. In practice, however, there may exist a set of potential draft models- ranging from faster but less inaccurate, to slower yet more reliable. We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks these draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass, until finally the target model verifies tokens. We derive an expression for the expected latency of any such hierarchy and show that selecting the latency-optimal hierarchy can be done in polynomial time. Empirically, HSD gives up to 1.2x speed-up over the best single-draft baseline, demonstrating the practicality of our algorithm in reducing generation latency beyond previous techniques.
[LG-12] Policy Learning with Abstention
链接: https://arxiv.org/abs/2510.19672
作者: Ayush Sawarni,Jikai Jin,Justin Whitehouse,Vasilis Syrgkanis
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Policy learning algorithms are widely used in areas such as personalized medicine and advertising to develop individualized treatment regimes. However, most methods force a decision even when predictions are uncertain, which is risky in high-stakes settings. We study policy learning with abstention, where a policy may defer to a safe default or an expert. When a policy abstains, it receives a small additive reward on top of the value of a random guess. We propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention rule from their disagreements. We establish fast O(1/n)-type regret guarantees when propensities are known, and extend these guarantees to the unknown-propensity case via a doubly robust (DR) objective. We further show that abstention is a versatile tool with direct applications to other core problems in policy learning: it yields improved guarantees under margin conditions without the common realizability assumption, connects to distributionally robust policy learning by hedging against small data shifts, and supports safe policy improvement by ensuring improvement over a baseline policy with high probability.
[LG-13] Overlap-weighted orthogonal meta-learner for treatment effect estimation over time
链接: https://arxiv.org/abs/2510.19643
作者: Konstantin Hess,Dennis Frauen,Mihaela van der Schaar,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Estimating heterogeneous treatment effects (HTEs) in time-varying settings is particularly challenging, as the probability of observing certain treatment sequences decreases exponentially with longer prediction horizons. Thus, the observed data contain little support for many plausible treatment sequences, which creates severe overlap problems. Existing meta-learners for the time-varying setting typically assume adequate treatment overlap, and thus suffer from exploding estimation variance when the overlap is low. To address this problem, we introduce a novel overlap-weighted orthogonal (WO) meta-learner for estimating HTEs that targets regions in the observed data with high probability of receiving the interventional treatment sequences. This offers a fully data-driven approach through which our WO-learner can counteract instabilities as in existing meta-learners and thus obtain more reliable HTE estimates. Methodologically, we develop a novel Neyman-orthogonal population risk function that minimizes the overlap-weighted oracle risk. We show that our WO-learner has the favorable property of Neyman-orthogonality, meaning that it is robust against misspecification in the nuisance functions. Further, our WO-learner is fully model-agnostic and can be applied to any machine learning model. Through extensive experiments with both transformer and LSTM backbones, we demonstrate the benefits of our novel WO-learner.
[LG-14] Latent Space Factorization in LoRA NEURIPS2025
链接: https://arxiv.org/abs/2510.19640
作者: Shashi Kumar,Yacouba Kaloga,John Mitros,Petr Motlicek,Ina Kodrasi
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025
点击查看摘要
Abstract:Low-rank adaptation (LoRA) is a widely used method for parameter-efficient finetuning. However, existing LoRA variants lack mechanisms to explicitly disambiguate task-relevant information within the learned low-rank subspace, potentially limiting downstream performance. We propose Factorized Variational Autoencoder LoRA (FVAE-LoRA), which leverages a VAE to learn two distinct latent spaces. Our novel Evidence Lower Bound formulation explicitly promotes factorization between the latent spaces, dedicating one latent space to task-salient features and the other to residual information. Extensive experiments on text, audio, and image tasks demonstrate that FVAE-LoRA consistently outperforms standard LoRA. Moreover, spurious correlation evaluations confirm that FVAE-LoRA better isolates task-relevant signals, leading to improved robustness under distribution shifts. Our code is publicly available at: this https URL
[LG-15] Matrix-Free Least Squares Solvers: Values Gradients and What to Do With Them
链接: https://arxiv.org/abs/2510.19634
作者: Hrittik Roy,Søren Hauberg,Nicholas Krämer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
点击查看摘要
Abstract:This paper argues that the method of least squares has significant unfulfilled potential in modern machine learning, far beyond merely being a tool for fitting linear models. To release its potential, we derive custom gradients that transform the solver into a differentiable operator, like a neural network layer, enabling many diverse applications. Empirically, we demonstrate: (i) scalability by enforcing weight sparsity on a 50 million parameter model; (ii) imposing conservativeness constraints in score-based generative models; and (iii) hyperparameter tuning of Gaussian processes based on predictive performance. By doing this, our work represents the next iteration in developing differentiable linear-algebra tools and making them widely accessible to machine learning practitioners.
[LG-16] Learning and Simulating Building Evacuation Patterns for Enhanced Safety Design Using Generative Models
链接: https://arxiv.org/abs/2510.19623
作者: Jin Han,Zhe Zheng,Yi Gu,Jia-Rui Lin,Xin-Zheng Lu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Evacuation simulation is essential for building safety design, ensuring properly planned evacuation routes. However, traditional evacuation simulation relies heavily on refined modeling with extensive parameters, making it challenging to adopt such methods in a rapid iteration process in early design stages. Thus, this study proposes DiffEvac, a novel method to learn building evacuation patterns based on Generative Models (GMs), for efficient evacuation simulation and enhanced safety design. Initially, a dataset of 399 diverse functional layouts and corresponding evacuation heatmaps of buildings was established. Then, a decoupled feature representation is proposed to embed physical features like layouts and occupant density for GMs. Finally, a diffusion model based on image prompts is proposed to learn evacuation patterns from simulated evacuation heatmaps. Compared to existing research using Conditional GANs with RGB representation, DiffEvac achieves up to a 37.6% improvement in SSIM, 142% in PSNR, and delivers results 16 times faster, thereby cutting simulation time to 2 minutes. Case studies further demonstrate that the proposed method not only significantly enhances the rapid design iteration and adjustment process with efficient evacuation simulation but also offers new insights and technical pathways for future safety optimization in intelligent building design. The research implication is that the approach lowers the modeling burden, enables large-scale what-if exploration, and facilitates coupling with multi-objective design tools.
[LG-17] A Climate-Aware Deep Learning Framework for Generalizable Epidemic Forecasting
链接: https://arxiv.org/abs/2510.19611
作者: Jinpyo Hong,Rachel E. Baker
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Precise outbreak forecasting of infectious diseases is essential for effective public health responses and epidemic control. The increased availability of machine learning (ML) methods for time-series forecasting presents an enticing avenue to enhance outbreak forecasting. Though the COVID-19 outbreak demonstrated the value of applying ML models to predict epidemic profiles, using ML models to forecast endemic diseases remains underexplored. In this work, we present ForecastNet-XCL (an ensemble model based on XGBoost+CNN+BiLSTM), a deep learning hybrid framework designed to addresses this gap by creating accurate multi-week RSV forecasts up to 100 weeks in advance based on climate and temporal data, without access to real-time surveillance on RSV. The framework combines high-resolution feature learning with long-range temporal dependency capturing mechanisms, bolstered by an autoregressive module trained on climate-controlled lagged relations. Stochastic inference returns probabilistic intervals to inform decision-making. Evaluated across 34 U.S. states, ForecastNet-XCL reliably outperformed statistical baselines, individual neural nets, and conventional ensemble methods in both within- and cross-state scenarios, sustaining accuracy over extended forecast horizons. Training on climatologically diverse datasets enhanced generalization furthermore, particularly in locations having irregular or biennial RSV patterns. ForecastNet-XCL’s efficiency, performance, and uncertainty-aware design make it a deployable early-warning tool amid escalating climate pressures and constrained surveillance resources.
[LG-18] he Confusing Instance Principle for Online Linear Quadratic Control
链接: https://arxiv.org/abs/2510.19531
作者: Waris Radji(Scool, CRIStAL),Odalric-Ambrym Maillard(Scool, CRIStAL)
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We revisit the problem of controlling linear systems with quadratic cost under unknown dynamics with model-based reinforcement learning. Traditional methods like Optimism in the Face of Uncertainty and Thompson Sampling, rooted in multi-armed bandits (MABs), face practical limitations. In contrast, we propose an alternative based on the Confusing Instance (CI) principle, which underpins regret lower bounds in MABs and discrete Markov Decision Processes (MDPs) and is central to the Minimum Empirical Divergence (MED) family of algorithms, known for their asymptotic optimality in various settings. By leveraging the structure of LQR policies along with sensitivity and stability analysis, we develop MED-LQ. This novel control strategy extends the principles of CI and MED beyond small-scale settings. Our benchmarks on a comprehensive control suite demonstrate that MED-LQ achieves competitive performance in various scenarios while highlighting its potential for broader applications in large-scale MDPs.
[LG-19] Bi-Level Decision-Focused Causal Learning for Large-Scale Marketing Optimization: Bridging Observational and Experimental Data NEURIPS2025
链接: https://arxiv.org/abs/2510.19517
作者: Shuli Zhang,Hao Zhou,Jiaqi Zheng,Guibin Jiang,Bing Cheng,Wei Lin,Guihai Chen
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Online Internet platforms require sophisticated marketing strategies to optimize user retention and platform revenue – a classical resource allocation problem. Traditional solutions adopt a two-stage pipeline: machine learning (ML) for predicting individual treatment effects to marketing actions, followed by operations research (OR) optimization for decision-making. This paradigm presents two fundamental technical challenges. First, the prediction-decision misalignment: Conventional ML methods focus solely on prediction accuracy without considering downstream optimization objectives, leading to improved predictive metrics that fail to translate to better decisions. Second, the bias-variance dilemma: Observational data suffers from multiple biases (e.g., selection bias, position bias), while experimental data (e.g., randomized controlled trials), though unbiased, is typically scarce and costly – resulting in high-variance estimates. We propose Bi-level Decision-Focused Causal Learning (Bi-DFCL) that systematically addresses these challenges. First, we develop an unbiased estimator of OR decision quality using experimental data, which guides ML model training through surrogate loss functions that bridge discrete optimization gradients. Second, we establish a bi-level optimization framework that jointly leverages observational and experimental data, solved via implicit differentiation. This novel formulation enables our unbiased OR estimator to correct learning directions from biased observational data, achieving optimal bias-variance tradeoff. Extensive evaluations on public benchmarks, industrial marketing datasets, and large-scale online A/B tests demonstrate the effectiveness of Bi-DFCL, showing statistically significant improvements over state-of-the-art. Currently, Bi-DFCL has been deployed at Meituan, one of the largest online food delivery platforms in the world.
[LG-20] aming LLM s to Detect and Mitigate Hallucinations NEURIPS2025
链接: https://arxiv.org/abs/2510.19507
作者: Demian Till,John Smeaton,Peter Haubrick,Gouse Saheb,Florian Graef,David Berman
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025 workshop on Reliable ML from Unreliable Data
点击查看摘要
Abstract:Recent work has demonstrated state-of-the-art results in large language model (LLM) hallucination detection and mitigation through consistency-based approaches which involve aggregating multiple responses sampled from a single LLM for a given prompt. These approaches help offset limitations stemming from the imperfect data on which LLMs are trained, which includes biases and under-representation of information required at deployment time among other limitations which can lead to hallucinations. We show that extending these single-model consistency methods to combine responses from multiple LLMs with different training data, training schemes and model architectures can result in substantial further improvements in hallucination detection and mitigation capabilities beyond their single-model consistency counterparts. We evaluate this \emphconsortium consistency approach across many model teams from a pool of 15 LLMs and explore under what conditions it is beneficial to team together different LLMs in this manner. Further, we show that these performance improvements often come with reduced inference costs, offsetting a significant drawback with single-model consistency methods.
[LG-21] Energy-Efficient and Dequantization-Free Q-LLM s: A Spiking Neural Network Approach to Salient Value Mitigation
链接: https://arxiv.org/abs/2510.19498
作者: Chenyu Wang,Zhanglu Yan,Zhi Zhou,Xu Chen,Weng-Fai Wong
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In the era of large language models (LLMs), weight-activation quantization helps fit models on edge device by reducing memory and compute bit-widths. However, three challenges persist for energy constrained hardware: (1) even after quantization, multiply-accumulate (MAC) operations remain unavoidable and continue to dominate energy consumption; (2) dequantization (or per-tensor/channel rescaling) introduces extra arithmetic and data movement, increasing latency and energy; (3) uniform parameters bit widths clip salient values-while intra-channel mixed precision is generally impractical on current matrix hardware and memory. In contrast, brain-inspired Spiking Neural Networks (SNNs), owing to their binary spike-based information representation and the Integrate-and-Fire (IF) paradigm, naturally support mixed-precision storage and energy-efficient computation by replacing complex MACs with temporal Accumulate (ACCs). Motivated by this property, we propose SpikeQuant, which selectively applies mixed-precision quantization to activations with salient values and re-encodes them into binary spike counts, thereby enabling dynamic mixed storage of different bitwidths. Furthermore, by embedding the quantization scale into the threshold of the IF mechanism, our approach performs energy-efficient linear transformations on weights and activations while avoiding explicit dequantization. Experimental results demonstrate that SpikeQuant consistently achieves near-FP16 perplexity under W4A4 quantization while reducing energy cost by up to 4.6 times compared to existing methods, highlighting its effectiveness for accurate and energy-efficient LLM deployment.
[LG-22] ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices
链接: https://arxiv.org/abs/2510.19482
作者: Xin Nie,Liang Dong,HaiCheng Zhang,JiaWang Xiao,G. Sun
类目: Machine Learning (cs.LG)
*备注: 19 pages, 8 figures
点击查看摘要
Abstract:The deployment of Large Language Models (LLMs) on CPU-based edge devices is crucial for enabling on-device intelligence and expanding AI accessibility. However, it remains challenging due to limited memory and computational resources. During edge inference, memory usage and latency are the primary bottlenecks. Although weight quantization can effectively reduce memory consumption, existing hardware-friendly approaches often rely on uniform quantization, which poorly fits weight distributions and incurs high dequantization overhead at low bit widths. To address these limitations, we propose ELUTQ, an efficient quantization framework introducing a novel quantization format, Hierarchical Linear Quantization (HLQ). HLQ better captures the statistical characteristics of weights without increasing the computational cost of Bit-serial LUT-based GEMM operations, thereby eliminating dequantization overhead. It is orthogonal to existing quantization algorithms and can be seamlessly integrated into various quantization pipelines. For efficient on-device deployment, ELUTQ provides optimized CPU kernels for end-to-end inference. Experiments show that for LLaMA3-8B, HLQ reduces perplexity by about 8% at 3-bit and 85% at 2-bit precision under post-training quantization, completing quantization within one hour. With efficient finetuning, HLQ further improves 2-bit performance within two hours. In terms of inference efficiency, our 2-bit LLaMA2-7B achieves over 25 tokens/s on an Apple M2 chip (4 threads, batch size = 1).
[LG-23] Online Two-Stage Submodular Maximization NEURIPS2025
链接: https://arxiv.org/abs/2510.19480
作者: Iasonas Nikolaou,Miltiadis Stouras,Stratis Ioannidis,Evimaria Terzi
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: To appear at NeurIPS 2025
点击查看摘要
Abstract:Given a collection of monotone submodular functions, the goal of Two-Stage Submodular Maximization (2SSM) [Balkanski et al., 2016] is to restrict the ground set so an objective selected u.a.r. from the collection attains a high maximal value, on average, when optimized over the restricted ground set. We introduce the Online Two-Stage Submodular Maximization (O2SSM) problem, in which the submodular objectives are revealed in an online fashion. We study this problem for weighted threshold potential functions, a large and important subclass of monotone submodular functions that includes influence maximization, data summarization, and facility location, to name a few. We design an algorithm that achieves sublinear (1 - 1/e)^2 -regret under general matroid constraints and (1 - 1/e)(1-e^-kk^k/k!) -regret in the case of uniform matroids of rank k ; the latter also yields a state-of-the-art bound for the (offline) 2SSM problem. We empirically validate the performance of our online algorithm with experiments on real datasets.
[LG-24] g-DPO: Scalable Preference Optimization for Protein Language Models NEURIPS2025 NIPS2025
链接: https://arxiv.org/abs/2510.19474
作者: Constance Ferragu,Jonathan D. Ziegler,Nicolas Deutschmann,Arthur Lindoulsi,Eli Bixby,Cradle ML Team
类目: Machine Learning (cs.LG)
*备注: Accepted at two workshops: FM4LS NeurIPS 2025 ( this https URL ) and MLSB in Copenhagen EurIPS 2025
点击查看摘要
Abstract:Direct Preference Optimization (DPO) is an effective approach for aligning protein language models with experimental design goals. However, DPO faces a scalability bottleneck: the number of possible training pairs grows quadratically with the number of labeled sequences, leading to prohibitive training times even for modestly sized datasets. We introduce g-DPO, a framework that (i) uses sequence space clustering to prune redundant pairs while preserving training signal, and (ii) amortizes likelihood computations with group-based approximations. Across three protein engineering tasks, g-DPO maintains in-silico and in-vitro performance that is statistically indistinguishable from standard DPO, while converging 1.8 to 3.7 times faster, with greater gains expected as the size of the dataset increases.
[LG-25] Revisiting the Relation Between Robustness and Universality
链接: https://arxiv.org/abs/2510.19427
作者: M. Klabunde,L. Caspari,F. Lemmerich
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The modified universality hypothesis proposed by Jones et al. (2022) suggests that adversarially robust models trained for a given task are highly similar. We revisit the hypothesis and test its generality. While we verify Jones’ main claim of high representational similarity in specific settings, results are not consistent across different datasets. We also discover that predictive behavior does not converge with increasing robustness and thus is not universal. We find that differing predictions originate in the classification layer, but show that more universal predictive behavior can be achieved with simple retraining of the classifiers. Overall, our work points towards partial universality of neural networks in specific settings and away from notions of strict universality.
[LG-26] Iterative Training of Physics-Informed Neural Networks with Fourier-enhanced Features
链接: https://arxiv.org/abs/2510.19399
作者: Yulun Wu,Miguel Aguiar,Karl H.Johansson,Matthieu Barreau
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures in the main paper
点击查看摘要
Abstract:Spectral bias, the tendency of neural networks to learn low-frequency features first, is a well-known issue with many training algorithms for physics-informed neural networks (PINNs). To overcome this issue, we propose IFeF-PINN, an algorithm for iterative training of PINNs with Fourier-enhanced features. The key idea is to enrich the latent space using high-frequency components through Random Fourier Features. This creates a two-stage training problem: (i) estimate a basis in the feature space, and (ii) perform regression to determine the coefficients of the enhanced basis functions. For an underlying linear model, it is shown that the latter problem is convex, and we prove that the iterative training scheme converges. Furthermore, we empirically establish that Random Fourier Features enhance the expressive capacity of the network, enabling accurate approximation of high-frequency PDEs. Through extensive numerical evaluation on classical benchmark problems, the superior performance of our method over state-of-the-art algorithms is shown, and the improved approximation across the frequency domain is illustrated.
[LG-27] ARA: Adaptive Rank Allocation for Efficient Large Language Model SVD Compression
链接: https://arxiv.org/abs/2510.19389
作者: Lin Xv,Jingsheng Gao,Xian Gao,Ting Liu,Yuzhuo Fu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In the field of large language model (LLM) compression, singular value decomposition (SVD) is a widely studied and adopted low-rank decomposition technique. Since SVD operates exclusively on linear modules, and these modules in LLMs are separated by nonlinear components, SVD can only be applied independently to each linear module. Under a global compression ratio constraint, determining the appropriate rank for different linear modules becomes a critical problem. Existing approaches, such as heuristic algorithms and mask-based training, have made progress in addressing this challenge. However, these methods still suffer from several limitations: heuristic algorithms explore the solution space within restricted regions, while mask-based training struggles to efficiently capture the relationship between singular value spectra and trainable parameters. More importantly, current methods overlook the key property that the gain function is non-smooth at a compression ratio of 1, which often leads the training process to suboptimal local minima. To address these issues, we propose an Adaptive Rank Allocation (ARA) method. Specifically, (1) ARA introduces a dedicated mask design that enables efficient mapping and updating between retained ranks and trainable parameters; and (2) it employs an additional loss function to guide parameter selection toward globally optimal solutions. Experimental results demonstrate that ARA achieves state-of-the-art performance. On the LLaMA2-7B model with a 80% compression ratio, ARA reduces perplexity on WikiText2 from 8.38 to 6.42 and improves average zero-shot task accuracy by 9.72 percentage points compared with uniform compression. These results highlight the effectiveness of our method for rank allocation in SVD-based LLM compression.
[LG-28] CPSVD: Enhancing Large Language Model Compression via Column-Preserving Singular Value Decomposition
链接: https://arxiv.org/abs/2510.19385
作者: Lin Xv,Jingsheng Gao,Xian Gao,Ting Li,Yuzhuo Fu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The rapid advancement of Large Language Models (LLMs) faces a critical bottleneck in their immense size, necessitating efficient compression techniques. While Singular Value Decomposition (SVD) is a promising approach, existing SVD-based methods treat the entire parameter matrix uniformly, overlooking that SVD approximation errors vary significantly across different matrix parts, which often leads to suboptimal compression. To address this, we propose \textbfColumn-\textbfPreserving \textbfSingular \textbfValue \textbfDecomposition (CPSVD), a novel method that refines SVD-based LLM compression by intelligently segmenting the parameter matrix. Unlike traditional SVD, CPSVD identifies and directly preserves matrix columns with high decomposition errors, applying SVD only to columns with low decomposition errors, while precisely determining the optimal balance point between these two strategies to minimize error. Furthermore, leveraging the inherent heterogeneity in decomposition errors across different matrices within an LLM, CPSVD adaptively allocates non-uniform compression rates to modules within that layer, while adhering to a target layer-wise compression ratio, thereby further enhancing compression performance. Extensive experiments demonstrate that CPSVD consistently outperforms state-of-the-art SVD-based LLM compression methods, achieving lower perplexity and higher accuracy on zero-shot tasks.
[LG-29] Learning Noise-Resilient and Transferable Graph-Text Alignment via Dynamic Quality Assessment
链接: https://arxiv.org/abs/2510.19384
作者: Yuhang Liu,Minglai Shao,Zengyi Wo,Yunlong Chu,Bing Hao,Shengzhong Liu,Ruijie Wang,Jianxin Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Pre-training Graph Foundation Models (GFMs) on text-attributed graphs (TAGs) is central to web-scale applications such as search, recommendation, and knowledge discovery. However, existing CLIP-style graph-text aligners face two key limitations: they assume strict one-to-one correspondences between nodes and texts, overlooking the inherent many-to-many relations in real-world graphs; and they rely on static alignment objectives that cannot adapt to varying data quality, making them brittle under noisy supervision. Together, these limitations expose a core dilemma: embracing expressive many-to-many alignment amplifies noise, while reverting to strict one-to-one strategies sacrifices semantic diversity and fails to handle inherently mismatched pairs. To address these challenges, we propose ADAligner, a dynamic, quality-aware graph-text alignment framework that dynamically adjusts between expressive many-to-many and conservative one-to-one objectives according to supervision quality. ADAligner estimates batch-level alignment reliability in real time and adapts its optimization accordingly, promoting soft, subgraph-level many-to-many alignment when supervision is clean, while emphasizing reliable one-to-one alignment by dynamically filtering low-confidence pairs under noise. Theoretically, we prove that this dynamic mechanism forms a stable negative feedback process, ensuring convergence and robustness. Comprehensive experiments on nine diverse TAG datasets demonstrate that ADAligner consistently outperforms prior graph-text aligners on zero-/few-shot node classification, link prediction and cross-modal retrieval tasks. It maintains strong robustness under noisy supervision and accelerates pre-training by approximately 2 to 3 times compared to multimodal baselines, establishing a scalable and reliable foundation for graph-text representation learning in real-world web environments.
[LG-30] LMFD: Latent Monotonic Feature Discovery
链接: https://arxiv.org/abs/2510.19383
作者: Guus Toussaint,Arno Knobbe
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Machine Learning and Principles and Practice of Knowledge Discovery in Databases, and is available online at this https URL
点击查看摘要
Abstract:Many systems in our world age, degrade or otherwise move slowly but steadily in a certain direction. When monitoring such systems by means of sensors, one often assumes that some form of age' is latently present in the data, but perhaps the available sensors do not readily provide this useful information. The task that we study in this paper is to extract potential proxies for this
age’ from the available multi-variate time series without having clear data on what age' actually is. We argue that when we find a sensor, or more likely some discovered function of the available sensors, that is sufficiently monotonic, that function can act as the proxy we are searching for. Using a carefully defined grammar and optimising the resulting equations in terms of monotonicity, defined as the absolute Spearman's Rank Correlation between time and the candidate formula, the proposed approach generates a set of candidate features which are then fitted and assessed on monotonicity. The proposed system is evaluated against an artificially generated dataset and two real-world datasets. In all experiments, we show that the system is able to combine sensors with low individual monotonicity into latent features with high monotonicity. For the real-world dataset of InfraWatch, a structural health monitoring project, we show that two features with individual absolute Spearman's \rho values of 0.13 and 0.09 can be combined into a proxy with an absolute Spearman's \rho of 0.95 . This demonstrates that our proposed method can find interpretable equations which can serve as a proxy for the
age’ of the system.
[LG-31] Optimization Benchmark for Diffusion Models on Dynamical Systems
链接: https://arxiv.org/abs/2510.19376
作者: Fabian Schaipp
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:The training of diffusion models is often absent in the evaluation of new optimization techniques. In this work, we benchmark recent optimization algorithms for training a diffusion model for denoising flow trajectories. We observe that Muon and SOAP are highly efficient alternatives to AdamW (18% lower final loss). We also revisit several recent phenomena related to the training of models for text or image applications in the context of diffusion model training. This includes the impact of the learning-rate schedule on the training dynamics, and the performance gap between Adam and SGD.
[LG-32] Using Temperature Sampling to Effectively Train Robot Learning Policies on Imbalanced Datasets
链接: https://arxiv.org/abs/2510.19373
作者: Basavasagar Patil,Sydney Belt,Jayjun Lee,Nima Fazeli,Bernadette Bucher
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Increasingly large datasets of robot actions and sensory observations are being collected to train ever-larger neural networks. These datasets are collected based on tasks and while these tasks may be distinct in their descriptions, many involve very similar physical action sequences (e.g., ‘pick up an apple’ versus ‘pick up an orange’). As a result, many datasets of robotic tasks are substantially imbalanced in terms of the physical robotic actions they represent. In this work, we propose a simple sampling strategy for policy training that mitigates this imbalance. Our method requires only a few lines of code to integrate into existing codebases and improves generalization. We evaluate our method in both pre-training small models and fine-tuning large foundational models. Our results show substantial improvements on low-resource tasks compared to prior state-of-the-art methods, without degrading performance on high-resource tasks. This enables more effective use of model capacity for multi-task policies. We also further validate our approach in a real-world setup on a Franka Panda robot arm across a diverse set of tasks.
[LG-33] AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch
链接: https://arxiv.org/abs/2510.19368
作者: Weichuang Shao,Iman Yi Liao,Tomas Henrique Bode Maul,Tissa Chandesa
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
[LG-34] Autobidding Arena: unified evaluation of the classical and RL-based autobidding algorithms
链接: https://arxiv.org/abs/2510.19357
作者: Andrey Pudovikov,Alexandra Khirianova,Ekaterina Solodneva,Aleksandr Katrutsa,Egor Samosvat,Yuriy Dorn
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Advertisement auctions play a crucial role in revenue generation for e-commerce companies. To make the bidding procedure scalable to thousands of auctions, the automatic bidding (autobidding) algorithms are actively developed in the industry. Therefore, the fair and reproducible evaluation of autobidding algorithms is an important problem. We present a standardized and transparent evaluation protocol for comparing classical and reinforcement learning (RL) autobidding algorithms. We consider the most efficient autobidding algorithms from different classes, e.g., ones based on the controllers, RL, optimal formulas, etc., and benchmark them in the bidding environment. We utilize the most recent open-source environment developed in the industry, which accurately emulates the bidding process. Our work demonstrates the most promising use cases for the considered autobidding algorithms, highlights their surprising drawbacks, and evaluates them according to multiple metrics. We select the evaluation metrics that illustrate the performance of the autobidding algorithms, the corresponding costs, and track the budget pacing. Such a choice of metrics makes our results applicable to the broad range of platforms where autobidding is effective. The presented comparison results help practitioners to evaluate the candidate autobidding algorithms from different perspectives and select ones that are efficient according to their companies’ targets.
[LG-35] ConvXformer: Differentially Private Hybrid ConvNeXt-Transformer for Inertial Navigation
链接: https://arxiv.org/abs/2510.19352
作者: Omer Tariq,Muhammad Bilal,Muneeb Ul Hassan,Dongsoo Han,Jon Crowcroft
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Robotics (cs.RO)
*备注: 14 pages, 8 figures, 3 tables
[LG-36] Scalable LinUCB: Low-Rank Design Matrix Updates for Recommenders with Large Action Spaces
链接: https://arxiv.org/abs/2510.19349
作者: Evgenia Shustova,Marina Sheshukova,Sergey Samsonov,Evgeny Frolov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-37] A Markov Decision Process for Variable Selection in Branch Bound
链接: https://arxiv.org/abs/2510.19348
作者: Paul Strang,Zacharie Alès,Côme Bissuel,Olivier Juan,Safia Kedad-Sidhoum,Emmanuel Rachelson
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Mixed-Integer Linear Programming (MILP) is a powerful framework used to address a wide range of NP-hard combinatorial optimization problems, often solved by Branch and Bound (BB). A key factor influencing the performance of BB solvers is the variable selection heuristic governing branching decisions. Recent contributions have sought to adapt reinforcement learning (RL) algorithms to the BB setting to learn optimal branching policies, through Markov Decision Processes (MDP) inspired formulations, and ad hoc convergence theorems and algorithms. In this work, we introduce BBMDP, a principled vanilla MDP formulation for variable selection in BB, allowing to leverage a broad range of RL algorithms for the purpose of learning optimal B\B heuristics. Computational experiments validate our model empirically, as our branching agent outperforms prior state-of-the-art RL agents on four standard MILP benchmarks.
[LG-38] Calibration and Discrimination Optimization Using Clusters of Learned Representation
链接: https://arxiv.org/abs/2510.19328
作者: Tomer Lavi,Bracha Shapira,Nadav Rappoport
类目: Machine Learning (cs.LG)
*备注:
[LG-39] ransformers are Inherently Succinct
链接: https://arxiv.org/abs/2510.19315
作者: Pascal Bergsträßer,Ryan Cotterell,Anthony W. Lin
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:
点击查看摘要
Abstract:We propose succinctness as a measure of the expressive power of a transformer in describing a concept. To this end, we prove that transformers are highly expressive in that they can represent formal languages substantially more succinctly than standard representations of formal languages like finite automata and Linear Temporal Logic (LTL) formulas. As a by-product of this expressivity, we show that verifying properties of transformers is provably intractable (i.e. EXPSPACE-complete).
[LG-40] Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall
链接: https://arxiv.org/abs/2510.19304
作者: Mingyu Jo,Jaesik Yoon,Justin Deschenaux,Caglar Gulcehre,Sungjin Ahn
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Discrete diffusion models offer a promising alternative to autoregressive generation through parallel decoding, but they suffer from a sampling wall: once categorical sampling occurs, rich distributional information collapses into one-hot vectors and cannot be propagated across steps, forcing subsequent steps to operate with limited information. To mitigate this problem, we introduce Loopholing, a novel and simple mechanism that preserves this information via a deterministic latent pathway, leading to Loopholing Discrete Diffusion Models (LDDMs). Trained efficiently with a self-conditioning strategy, LDDMs achieve substantial gains-reducing generative perplexity by up to 61% over prior baselines, closing (and in some cases surpassing) the gap with autoregressive models, and producing more coherent text. Applied to reasoning tasks, LDDMs also improve performance on arithmetic benchmarks such as Countdown and Game of 24. These results also indicate that loopholing mitigates idle steps and oscillations, providing a scalable path toward high-quality non-autoregressive text generation.
[LG-41] QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation NEURIPS2025
链接: https://arxiv.org/abs/2510.19296
作者: Yang Zhang,Rui Zhang,Jiaming Guo,Lei Huang,Di Huang,Yunpu Zhao,Shuyao Cheng,Pengwei Jin,Chongxiao Li,Zidong Du,Xing Hu,Qi Guo,Yunji Chen
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Programming Languages (cs.PL)
*备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for Verilog code generation (QiMeng-SALV) by leveraging code segments of functionally correct output signal to optimize RL training. Considering Verilog code specifies the structural interconnection of hardware gates and wires so that different output signals are independent, the key insight of QiMeng-SALV is to extract verified signal-aware implementations in partially incorrect modules, so as to enhance the extraction of meaningful functional rewards. Roughly, we verify the functional correctness of signals in generated module by comparing with that of reference module in the training data. Then abstract syntax tree (AST) is employed to identify signal-aware code segments which can provide meaningful functional rewards from erroneous modules. Finally, we introduce signal-aware DPO which is optimized on the correct signal-level code segments, thereby preventing noise and interference from incorrect signals. The proposed QiMeng-SALV underscores the paradigm shift from conventional module-level to fine-grained signal-level optimization in Verilog code generation, addressing the issue of insufficient functional rewards. Experiments demonstrate that our method achieves state-of-the-art performance on VerilogEval and RTLLM, with a 7B parameter model matching the performance of the DeepSeek v3 671B model and significantly outperforming the leading open-source model CodeV trained on the same dataset. Our code is available at this https URL.
[LG-42] Knowledge Distillation of Uncertainty using Deep Latent Factor Model
链接: https://arxiv.org/abs/2510.19290
作者: Sehyun Park,Jongjin Lee,Yunseop Shin,Ilsang Ohn,Yongdai Kim
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
点击查看摘要
Abstract:Deep ensembles deliver state-of-the-art, reliable uncertainty quantification, but their heavy computational and memory requirements hinder their practical deployments to real applications such as on-device AI. Knowledge distillation compresses an ensemble into small student models, but existing techniques struggle to preserve uncertainty partly because reducing the size of DNNs typically results in variation reduction. To resolve this limitation, we introduce a new method of distribution distillation (i.e. compressing a teacher ensemble into a student distribution instead of a student ensemble) called Gaussian distillation, which estimates the distribution of a teacher ensemble through a special Gaussian process called the deep latent factor model (DLF) by treating each member of the teacher ensemble as a realization of a certain stochastic process. The mean and covariance functions in the DLF model are estimated stably by using the expectation-maximization (EM) algorithm. By using multiple benchmark datasets, we demonstrate that the proposed Gaussian distillation outperforms existing baselines. In addition, we illustrate that Gaussian distillation works well for fine-tuning of language models and distribution shift problems.
[LG-43] Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models
链接: https://arxiv.org/abs/2510.19268
作者: Mingen Li,Houjian Yu,Yixuan Huang,Youngjin Hong,Changhyun Choi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, 3 tables
点击查看摘要
Abstract:Long-horizon routing tasks of deformable linear objects (DLOs), such as cables and ropes, are common in industrial assembly lines and everyday life. These tasks are particularly challenging because they require robots to manipulate DLO with long-horizon planning and reliable skill execution. Successfully completing such tasks demands adapting to their nonlinear dynamics, decomposing abstract routing goals, and generating multi-step plans composed of multiple skills, all of which require accurate high-level reasoning during execution. In this paper, we propose a fully autonomous hierarchical framework for solving challenging DLO routing tasks. Given an implicit or explicit routing goal expressed in language, our framework leverages vision-language models~(VLMs) for in-context high-level reasoning to synthesize feasible plans, which are then executed by low-level skills trained via reinforcement learning. To improve robustness in long horizons, we further introduce a failure recovery mechanism that reorients the DLO into insertion-feasible states. Our approach generalizes to diverse scenes involving object attributes, spatial descriptions, as well as implicit language commands. It outperforms the next best baseline method by nearly 50% and achieves an overall success rate of 92.5% across long-horizon routing scenarios.
[LG-44] Data Efficient Any Transformer-to-Mamba Distillation via Attention Bridge
链接: https://arxiv.org/abs/2510.19266
作者: Penghao Wang,Yuhao Zhou,Mengxuan Wu,Panpan Zhang,Zhangyang Wang,Kai Wang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:State-space models (SSMs) have emerged as efficient alternatives to Transformers for sequence modeling, offering superior scalability through recurrent structures. However, their training remains costly and the ecosystem around them is far less mature than that of Transformers. Moreover, the structural heterogeneity between SSMs and Transformers makes it challenging to efficiently distill knowledge from pretrained attention models. In this work, we propose Cross-architecture distillation via Attention Bridge (CAB), a novel data-efficient distillation framework that efficiently transfers attention knowledge from Transformer teachers to state-space student models. Unlike conventional knowledge distillation that transfers knowledge only at the output level, CAB enables token-level supervision via a lightweight bridge and flexible layer-wise alignment, improving both efficiency and transferability. We further introduce flexible layer-wise alignment strategies to accommodate architectural discrepancies between teacher and student. Extensive experiments across vision and language domains demonstrate that our method consistently improves the performance of state-space models, even under limited training data, outperforming both standard and cross-architecture distillation methods. Our findings suggest that attention-based knowledge can be efficiently transferred to recurrent models, enabling rapid utilization of Transformer expertise for building a stronger SSM community.
[LG-45] Mixing Configurations for Downstream Prediction
链接: https://arxiv.org/abs/2510.19248
作者: Juntang Wang,Hao Wu,Runkun Guo,Yihan Wang,Dongmian Zou,Shixin Xu
类目: Machine Learning (cs.LG)
*备注: 16 pages,13 figures, conference paper. Equal contribution: Juntang Wang and Hao Wu
点击查看摘要
Abstract:Humans possess an innate ability to group objects by similarity, a cognitive mechanism that clustering algorithms aim to emulate. Recent advances in community detection have enabled the discovery of configurations – valid hierarchical clusterings across multiple resolution scales – without requiring labeled data. In this paper, we formally characterize these configurations and identify similar emergent structures in register tokens within Vision Transformers. Unlike register tokens, configurations exhibit lower redundancy and eliminate the need for ad hoc selection. They can be learned through unsupervised or self-supervised methods, yet their selection or composition remains specific to the downstream task and input. Building on these insights, we introduce GraMixC, a plug-and-play module that extracts configurations, aligns them using our Reverse Merge/Split (RMS) technique, and fuses them via attention heads before forwarding them to any downstream predictor. On the DSN1 16S rRNA cultivation-media prediction task, GraMixC improves the R2 score from 0.6 to 0.9 across multiple methods, setting a new state of the art. We further validate GraMixC on standard tabular benchmarks, where it consistently outperforms single-resolution and static-feature baselines.
[LG-46] Interpret Policies in Deep Reinforcement Learning using SILVER with RL-Guided Labeling: A Model-level Approach to High-dimensional and Multi-action Environments
链接: https://arxiv.org/abs/2510.19244
作者: Yiyu Qian,Su Nguyen,Chao Chen,Qinyue Zhou,Liyuan Zhao
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Deep reinforcement learning (RL) achieves remarkable performance but lacks interpretability, limiting trust in policy behavior. The existing SILVER framework (Li, Siddique, and Cao 2025) explains RL policy via Shapley-based regression but remains restricted to low-dimensional, binary-action domains. We propose SILVER with RL-guided labeling, an enhanced variant that extends SILVER to multi-action and high-dimensional environments by incorporating the RL policy’s own action outputs into the boundary points identification. Our method first extracts compact feature representations from image observations, performs SHAP-based feature attribution, and then employs RL-guided labeling to generate behaviorally consistent boundary datasets. Surrogate models, such as decision trees and regression-based functions, are subsequently trained to interpret RL policy’s decision structure. We evaluate the proposed framework on two Atari environments using three deep RL algorithms and conduct human-subject study to assess the clarity and trustworthiness of the derived interpretable policy. Results show that our approach maintains competitive task performance while substantially improving transparency and human understanding of agent behavior. This work advances explainable RL by transforming SILVER into a scalable and behavior-aware framework for interpreting deep RL agents in high-dimensional, multi-action settings.
[LG-47] Understanding the Implicit Biases of Design Choices for Time Series Foundation Models
链接: https://arxiv.org/abs/2510.19236
作者: Annan Yu,Danielle C. Maddix,Boran Han,Xiyuan Zhang,Abdul Fatir Ansari,Oleksandr Shchur,Christos Faloutsos,Andrew Gordon Wilson,Michael W. Mahoney,Yuyang Wang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Time series foundation models (TSFMs) are a class of potentially powerful, general-purpose tools for time series forecasting and related temporal tasks, but their behavior is strongly shaped by subtle inductive biases in their design. Rather than developing a new model and claiming that it is better than existing TSFMs, e.g., by winning on existing well-established benchmarks, our objective is to understand how the various ``knobs’’ of the training process affect model quality. Using a mix of theory and controlled empirical evaluation, we identify several design choices (patch size, embedding choice, training objective, etc.) and show how they lead to implicit biases in fundamental model properties (temporal behavior, geometric structure, how aggressively or not the model regresses to the mean, etc.); and we show how these biases can be intuitive or very counterintuitive, depending on properties of the model and data. We also illustrate in a case study on outlier handling how multiple biases can interact in complex ways; and we discuss implications of our results for learning the bitter lesson and building TSFMs.
[LG-48] Brain-Inspired Perspective on Configurations: Unsupervised Similarity and Early Cognition
链接: https://arxiv.org/abs/2510.19229
作者: Juntang Wang,Yihan Wang,Hao Wu,Dongmian Zou,Shixin Xu
类目: Machine Learning (cs.LG)
*备注: 13 pages, 4 figures, conference paper. Equal contribution: Juntang Wang, Yihan Wang and Hao Wu
点击查看摘要
Abstract:Infants discover categories, detect novelty, and adapt to new contexts without supervision – a challenge for current machine learning. We present a brain-inspired perspective on configurations, a finite-resolution clustering framework that uses a single resolution parameter and attraction-repulsion dynamics to yield hierarchical organization, novelty sensitivity, and flexible adaptation. To evaluate these properties, we introduce mheatmap, which provides proportional heatmaps and a reassignment algorithm to fairly assess multi-resolution and dynamic behavior. Across datasets, configurations are competitive on standard clustering metrics, achieve 87% AUC in novelty detection, and show 35% better stability during dynamic category evolution. These results position configurations as a principled computational model of early cognitive categorization and a step toward brain-inspired AI.
[LG-49] Controllable Machine Unlearning via Gradient Pivoting
链接: https://arxiv.org/abs/2510.19226
作者: Youngsik Hwang,Dong-Young Lim
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:Machine unlearning (MU) aims to remove the influence of specific data from a trained model. However, approximate unlearning methods, often formulated as a single-objective optimization (SOO) problem, face a critical trade-off between unlearning efficacy and model fidelity. This leads to three primary challenges: the risk of over-forgetting, a lack of fine-grained control over the unlearning process, and the absence of metrics to holistically evaluate the trade-off. To address these issues, we reframe MU as a multi-objective optimization (MOO) problem. We then introduce a novel algorithm, Controllable Unlearning by Pivoting Gradient (CUP), which features a unique pivoting mechanism. Unlike traditional MOO methods that converge to a single solution, CUP’s mechanism is designed to controllably navigate the entire Pareto frontier. This navigation is governed by a single intuitive hyperparameter, the `unlearning intensity’, which allows for precise selection of a desired trade-off. To evaluate this capability, we adopt the hypervolume indicator, a metric that captures both the quality and diversity of the entire set of solutions an algorithm can generate. Our experimental results demonstrate that CUP produces a superior set of Pareto-optimal solutions, consistently outperforming existing methods across various vision tasks.
[LG-50] RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLM s
链接: https://arxiv.org/abs/2510.19225
作者: Yongji Wu,Xueshen Liu,Haizhong Zheng,Juncheng Gu,Beidi Chen,Z. Morley Mao,Arvind Krishnamurthy,Ion Stoica
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has become essential for unlocking advanced reasoning capabilities in large language models (LLMs). RL workflows involve interleaving rollout and training stages with fundamentally different resource requirements. Rollout typically dominates overall execution time, yet scales efficiently through multiple independent instances. In contrast, training requires tightly-coupled GPUs with full-mesh communication. Existing RL frameworks fall into two categories: co-located and disaggregated architectures. Co-located ones fail to address this resource tension by forcing both stages to share the same GPUs. Disaggregated architectures, without modifications of well-established RL algorithms, suffer from resource under-utilization. Meanwhile, preemptible GPU resources, i.e., spot instances on public clouds and spare capacity in production clusters, present significant cost-saving opportunities for accelerating RL workflows, if efficiently harvested for rollout. In this paper, we present RLBoost, a systematic solution for cost-efficient RL training that harvests preemptible GPU resources. Our key insight is that rollout’s stateless and embarrassingly parallel nature aligns perfectly with preemptible and often fragmented resources. To efficiently utilize these resources despite frequent and unpredictable availability changes, RLBoost adopts a hybrid architecture with three key techniques: (1) adaptive rollout offload to dynamically adjust workloads on the reserved (on-demand) cluster, (2) pull-based weight transfer that quickly provisions newly available instances, and (3) token-level response collection and migration for efficient preemption handling and continuous load balancing. Extensive experiments show RLBoost increases training throughput by 1.51x-1.97x while improving cost efficiency by 28%-49% compared to using only on-demand GPU resources. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2510.19225 [cs.DC] (or arXiv:2510.19225v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2510.19225 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-51] Enhancing Graph Neural Networks: A Mutual Learning Approach
链接: https://arxiv.org/abs/2510.19223
作者: Paul Agbaje,Akajyoti Mitra,Afia Anjum,Pranali Khose,Ebelechukwu Nwafor,Habeeb Olufowobi
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Knowledge distillation (KD) techniques have emerged as a powerful tool for transferring expertise from complex teacher models to lightweight student models, particularly beneficial for deploying high-performance models in resource-constrained devices. This approach has been successfully applied to graph neural networks (GNNs), harnessing their expressive capabilities to generate node embeddings that capture structural and feature-related information. In this study, we depart from the conventional KD approach by exploring the potential of collaborative learning among GNNs. In the absence of a pre-trained teacher model, we show that relatively simple and shallow GNN architectures can synergetically learn efficient models capable of performing better during inference, particularly in tackling multiple tasks. We propose a collaborative learning framework where ensembles of student GNNs mutually teach each other throughout the training process. We introduce an adaptive logit weighting unit to facilitate efficient knowledge exchange among models and an entropy enhancement technique to improve mutual learning. These components dynamically empower the models to adapt their learning strategies during training, optimizing their performance for downstream tasks. Extensive experiments conducted on three datasets each for node and graph classification demonstrate the effectiveness of our approach.
[LG-52] A Communication-Efficient Decentralized Actor-Critic Algorithm
链接: https://arxiv.org/abs/2510.19199
作者: Xiaoxing Ren,Nicola Bastianello,Thomas Parisini,Andreas A. Malikopoulos
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:In this paper, we study the problem of reinforcement learning in multi-agent systems where communication among agents is limited. We develop a decentralized actor-critic learning framework in which each agent performs several local updates of its policy and value function, where the latter is approximated by a multi-layer neural network, before exchanging information with its neighbors. This local training strategy substantially reduces the communication burden while maintaining coordination across the network. We establish finite-time convergence analysis for the algorithm under Markov-sampling. Specifically, to attain the \varepsilon -accurate stationary point, the sample complexity is of order \mathcalO(\varepsilon^-3) and the communication complexity is of order \mathcalO(\varepsilon^-1\tau^-1) , where tau denotes the number of local training steps. We also show how the final error bound depends on the neural network’s approximation quality. Numerical experiments in a cooperative control setting illustrate and validate the theoretical findings.
[LG-53] Natural Gradient VI: Guarantees for Non-Conjugate Models NEURIPS2025
链接: https://arxiv.org/abs/2510.19163
作者: Fangyuan Sun,Ilyas Fatkhullin,Niao He
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: NeurIPS 2025
点击查看摘要
Abstract:Stochastic Natural Gradient Variational Inference (NGVI) is a widely used method for approximating posterior distribution in probabilistic models. Despite its empirical success and foundational role in variational inference, its theoretical underpinnings remain limited, particularly in the case of non-conjugate likelihoods. While NGVI has been shown to be a special instance of Stochastic Mirror Descent, and recent work has provided convergence guarantees using relative smoothness and strong convexity for conjugate models, these results do not extend to the non-conjugate setting, where the variational loss becomes non-convex and harder to analyze. In this work, we focus on mean-field parameterization and advance the theoretical understanding of NGVI in three key directions. First, we derive sufficient conditions under which the variational loss satisfies relative smoothness with respect to a suitable mirror map. Second, leveraging this structure, we propose a modified NGVI algorithm incorporating non-Euclidean projections and prove its global non-asymptotic convergence to a stationary point. Finally, under additional structural assumptions about the likelihood, we uncover hidden convexity properties of the variational loss and establish fast global convergence of NGVI to a global optimum. These results provide new insights into the geometry and convergence behavior of NGVI in challenging inference settings.
[LG-54] Preliminary Use of Vision Language Model Driven Extraction of Mouse Behavior Towards Understanding Fear Expression
链接: https://arxiv.org/abs/2510.19160
作者: Paimon Goulart,Jordan Steinhauser,Kylene Shuler,Edward Korzus,Jia Chen,Evangelos E. Papalexakis
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Integration of diverse data will be a pivotal step towards improving scientific explorations in many disciplines. This work establishes a vision-language model (VLM) that encodes videos with text input in order to classify various behaviors of a mouse existing in and engaging with their environment. Importantly, this model produces a behavioral vector over time for each subject and for each session the subject undergoes. The output is a valuable dataset that few programs are able to produce with as high accuracy and with minimal user input. Specifically, we use the open-source Qwen2.5-VL model and enhance its performance through prompts, in-context learning (ICL) with labeled examples, and frame-level preprocessing. We found that each of these methods contributes to improved classification, and that combining them results in strong F1 scores across all behaviors, including rare classes like freezing and fleeing, without any model fine-tuning. Overall, this model will support interdisciplinary researchers studying mouse behavior by enabling them to integrate diverse behavioral features, measured across multiple time points and environments, into a comprehensive dataset that can address complex research questions.
[LG-55] Instance-Dependent Regret Bounds for Nonstochastic Linear Partial Monitoring
链接: https://arxiv.org/abs/2510.19158
作者: Federico Di Gennaro,Khaled Eldowa,Nicolò Cesa-Bianchi
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In contrast to the classic formulation of partial monitoring, linear partial monitoring can model infinite outcome spaces, while imposing a linear structure on both the losses and the observations. This setting can be viewed as a generalization of linear bandits where loss and feedback are decoupled in a flexible manner. In this work, we address a nonstochastic (adversarial), finite-actions version of the problem through a simple instance of the exploration-by-optimization method that is amenable to efficient implementation. We derive regret bounds that depend on the game structure in a more transparent manner than previous theoretical guarantees for this paradigm. Our bounds feature instance-specific quantities that reflect the degree of alignment between observations and losses, and resemble known guarantees in the stochastic setting. Notably, they achieve the standard \sqrtT rate in easy (locally observable) games and T^2/3 in hard (globally observable) games, where T is the time horizon. We instantiate these bounds in a selection of old and new partial information settings subsumed by this model, and illustrate that the achieved dependence on the game structure can be tight in interesting cases.
[LG-56] Feature Space Adaptation for Robust Model Fine-Tuning
链接: https://arxiv.org/abs/2510.19155
作者: Peng Wang,Minghao Gu,Qiang Huang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Catastrophic forgetting is a common issue in model fine-tuning, especially when the downstream domain contains limited labeled data or differs greatly from the pre-training distribution. Existing parameter-efficient fine-tuning methods operate in the weight space by modifying or augmenting the pre-trained model’s parameters, which can yield models overly specialized to the available downstream data. To mitigate the risk of overwriting pre-trained knowledge and enhance robustness, we propose to fine-tune the pre-trained model in the feature space. Two new fine-tuning methods are proposed: LoRFA (Low-Rank Feature Adaptation) and VeFA (Vector-Based Feature Adaptation). Feature space adaptation is inspired by the idea of effect equivalence modeling (EEM) of downstream lurking variables causing distribution shifts, which posits that unobserved factors can be represented as the total equivalent amount on observed features. By compensating for the effects of downstream lurking variables via a lightweight feature-level transformation, the pre-trained representations can be preserved, which improves model generalization under distribution shift. We evaluate LoRFA and VeFA versus LoRA on image classification, NLU, and NLG, covering both standard fine-tuning metrics and robustness. Feature space adaptation achieves comparable fine-tuning results and consistently stronger robustness.
[LG-57] Subliminal Corruption: Mechanisms Thresholds and Interpretability
链接: https://arxiv.org/abs/2510.19152
作者: Reya Vir,Sarvesh Bhatnagar
类目: Machine Learning (cs.LG)
*备注:
[LG-58] HAMLOCK: HArdware-Model LOgically Combined attacK
链接: https://arxiv.org/abs/2510.19145
作者: Sanskar Amgain,Daniel Lobo,Atri Chatterjee,Swarup Bhunia,Fnu Suya
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The growing use of third-party hardware accelerators (e.g., FPGAs, ASICs) for deep neural networks (DNNs) introduces new security vulnerabilities. Conventional model-level backdoor attacks, which only poison a model’s weights to misclassify inputs with a specific trigger, are often detectable because the entire attack logic is embedded within the model (i.e., software), creating a traceable layer-by-layer activation path. This paper introduces the HArdware-Model Logically Combined Attack (HAMLOCK), a far stealthier threat that distributes the attack logic across the hardware-software boundary. The software (model) is now only minimally altered by tuning the activations of few neurons to produce uniquely high activation values when a trigger is present. A malicious hardware Trojan detects those unique activations by monitoring the corresponding neurons’ most significant bit or the 8-bit exponents and triggers another hardware Trojan to directly manipulate the final output logits for misclassification. This decoupled design is highly stealthy, as the model itself contains no complete backdoor activation path as in conventional attacks and hence, appears fully benign. Empirically, across benchmarks like MNIST, CIFAR10, GTSRB, and ImageNet, HAMLOCK achieves a near-perfect attack success rate with a negligible clean accuracy drop. More importantly, HAMLOCK circumvents the state-of-the-art model-level defenses without any adaptive optimization. The hardware Trojan is also undetectable, incurring area and power overheads as low as 0.01%, which is easily masked by process and environmental noise. Our findings expose a critical vulnerability at the hardware-software interface, demanding new cross-layer defenses against this emerging threat. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2510.19145 [cs.CR] (or arXiv:2510.19145v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.19145 Focus to learn more arXiv-issued DOI via DataCite
[LG-59] Learning Peer Influence Probabilities with Linear Contextual Bandits
链接: https://arxiv.org/abs/2510.19119
作者: Ahmed Sayeed Faruk,Mohammad Shahverdikondori,Elena Zheleva
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
[LG-60] Weight Decay may matter more than muP for Learning Rate Transfer in Practice
链接: https://arxiv.org/abs/2510.19093
作者: Atli Kosson,Jeremy Welborn,Yang Liu,Martin Jaggi,Xi Chen
类目: Machine Learning (cs.LG)
*备注:
[LG-61] POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning
链接: https://arxiv.org/abs/2510.19056
作者: Kuai Yu,Xiaoyu Wu,Peishen Yan,Qingqian Yang,Linshan Jiang,Hao Wang,Yang Hua,Tao Song,Haibing Guan
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Federated Learning (FL) enables decentralized model training across multiple clients without exposing local data, but its distributed feature makes it vulnerable to backdoor attacks. Despite early FL backdoor attacks modifying entire models, recent studies have explored the concept of backdoor-critical (BC) layers, which poison the chosen influential layers to maintain stealthiness while achieving high effectiveness. However, existing BC layers approaches rely on rule-based selection without consideration of the interrelations between layers, making them ineffective and prone to detection by advanced defenses. In this paper, we propose POLAR (POlicy-based LAyerwise Reinforcement learning), the first pipeline to creatively adopt RL to solve the BC layer selection problem in layer-wise backdoor attack. Different from other commonly used RL paradigm, POLAR is lightweight with Bernoulli sampling. POLAR dynamically learns an attack strategy, optimizing layer selection using policy gradient updates based on backdoor success rate (BSR) improvements. To ensure stealthiness, we introduce a regularization constraint that limits the number of modified layers by penalizing large attack footprints. Extensive experiments demonstrate that POLAR outperforms the latest attack methods by up to 40% against six state-of-the-art (SOTA) defenses.
[LG-62] Empowering Decision Trees via Shape Function Branching NEURIPS2025
链接: https://arxiv.org/abs/2510.19040
作者: Nakul Upadhya,Eldan Cohen
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025, Source code found at: this https URL
点击查看摘要
Abstract:Decision trees are prized for their interpretability and strong performance on tabular data. Yet, their reliance on simple axis-aligned linear splits often forces deep, complex structures to capture non-linear feature effects, undermining human comprehension of the constructed tree. To address this limitation, we propose a novel generalization of a decision tree, the Shape Generalized Tree (SGT), in which each internal node applies a learnable axis-aligned shape function to a single feature, enabling rich, non-linear partitioning in one split. As users can easily visualize each node’s shape function, SGTs are inherently interpretable and provide intuitive, visual explanations of the model’s decision mechanisms. To learn SGTs from data, we propose ShapeCART, an efficient induction algorithm for SGTs. We further extend the SGT framework to bivariate shape functions (S ^2 GT) and multi-way trees (SGT _K ), and present Shape ^2 CART and ShapeCART _K , extensions to ShapeCART for learning S ^2 GTs and SGT _K s, respectively. Experiments on various datasets show that SGTs achieve superior performance with reduced model size compared to traditional axis-aligned linear trees.
[LG-63] Category learning in deep neural networks: Information content and geometry of internal representations
链接: https://arxiv.org/abs/2510.19021
作者: Laurent Bonnasse-Gahot,Jean-Pierre Nadal
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Neurons and Cognition (q-bio.NC)
*备注:
点击查看摘要
Abstract:In animals, category learning enhances discrimination between stimuli close to the category boundary. This phenomenon, called categorical perception, was also empirically observed in artificial neural networks trained on classification tasks. In previous modeling works based on neuroscience data, we show that this expansion/compression is a necessary outcome of efficient learning. Here we extend our theoretical framework to artificial networks. We show that minimizing the Bayes cost (mean of the cross-entropy loss) implies maximizing the mutual information between the set of categories and the neural activities prior to the decision layer. Considering structured data with an underlying feature space of small dimension, we show that maximizing the mutual information implies (i) finding an appropriate projection space, and, (ii) building a neural representation with the appropriate metric. The latter is based on a Fisher information matrix measuring the sensitivity of the neural activity to changes in the projection space. Optimal learning makes this neural Fisher information follow a category-specific Fisher information, measuring the sensitivity of the category membership. Category learning thus induces an expansion of neural space near decision boundaries. We characterize the properties of the categorical Fisher information, showing that its eigenvectors give the most discriminant directions at each point of the projection space. We find that, unexpectedly, its maxima are in general not exactly at, but near, the class boundaries. Considering toy models and the MNIST dataset, we numerically illustrate how after learning the two Fisher information matrices match, and essentially align with the category boundaries. Finally, we relate our approach to the Information Bottleneck one, and we exhibit a bias-variance decomposition of the Bayes cost, of interest on its own.
[LG-64] Impartial Selection with Predictions
链接: https://arxiv.org/abs/2510.19002
作者: Javier Cembrano,Felix Fischer,Max Klimm
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:We study the selection of agents based on mutual nominations, a theoretical problem with many applications from committee selection to AI alignment. As agents both select and are selected, they may be incentivized to misrepresent their true opinion about the eligibility of others to influence their own chances of selection. Impartial mechanisms circumvent this issue by guaranteeing that the selection of an agent is independent of the nominations cast by that agent. Previous research has established strong bounds on the performance of impartial mechanisms, measured by their ability to approximate the number of nominations for the most highly nominated agents. We study to what extent the performance of impartial mechanisms can be improved if they are given a prediction of a set of agents receiving a maximum number of nominations. Specifically, we provide bounds on the consistency and robustness of such mechanisms, where consistency measures the performance of the mechanisms when the prediction is accurate and robustness its performance when the prediction is inaccurate. For the general setting where up to k agents are to be selected and agents nominate any number of other agents, we give a mechanism with consistency 1-O\big(\frac1k\big) and robustness 1-\frac1e-O\big(\frac1k\big) . For the special case of selecting a single agent based on a single nomination per agent, we prove that 1 -consistency can be achieved while guaranteeing \frac12 -robustness. A close comparison with previous results shows that (asymptotically) optimal consistency can be achieved with little to no sacrifice in terms of robustness.
[LG-65] An Encode-then-Decompose Approach to Unsupervised Time Series Anomaly Detection on Contaminated Training Data–Extended Version ICDE2026
链接: https://arxiv.org/abs/2510.18998
作者: Buang Zhang,Tung Kieu,Xiangfei Qiu,Chenjuan Guo,Jilin Hu,Aoying Zhou,Christian S. Jensen,Bin Yang
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 15 pages. An extended version of “An Encode-then-Decompose Approach to Unsupervised Time Series Anomaly Detection on Contaminated Training Data” accepted at ICDE 2026
点击查看摘要
Abstract:Time series anomaly detection is important in modern large-scale systems and is applied in a variety of domains to analyze and monitor the operation of diverse systems. Unsupervised approaches have received widespread interest, as they do not require anomaly labels during training, thus avoiding potentially high costs and having wider applications. Among these, autoencoders have received extensive attention. They use reconstruction errors from compressed representations to define anomaly scores. However, representations learned by autoencoders are sensitive to anomalies in training time series, causing reduced accuracy. We propose a novel encode-then-decompose paradigm, where we decompose the encoded representation into stable and auxiliary representations, thereby enhancing the robustness when training with contaminated time series. In addition, we propose a novel mutual information based metric to replace the reconstruction errors for identifying anomalies. Our proposal demonstrates competitive or state-of-the-art performance on eight commonly used multi- and univariate time series benchmarks and exhibits robustness to time series with different contamination ratios.
[LG-66] owards Universal Solvers: Using PGD Attack in Active Learning to Increase Generalizability of Neural Operators as Knowledge Distillation from Numerical PDE Solvers
链接: https://arxiv.org/abs/2510.18989
作者: Yifei Sun
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Nonlinear PDE solvers require fine space-time discretizations and local linearizations, leading to high memory cost and slow runtimes. Neural operators such as FNOs and DeepONets offer fast single-shot inference by learning function-to-function mappings and truncating high-frequency components, but they suffer from poor out-of-distribution (OOD) generalization, often failing on inputs outside the training distribution. We propose an adversarial teacher-student distillation framework in which a differentiable numerical solver supervises a compact neural operator while a PGD-style active sampling loop searches for worst-case inputs under smoothness and energy constraints to expand the training set. Using differentiable spectral solvers enables gradient-based adversarial search and stabilizes sample mining. Experiments on Burgers and Navier-Stokes systems demonstrate that adversarial distillation substantially improves OOD robustness while preserving the low parameter cost and fast inference of neural operators.
[LG-67] Position: Many generalization measures for deep learning are frag ile
链接: https://arxiv.org/abs/2510.18934
作者: Shuofeng Zhang,Ard Louis
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:A wide variety of generalization measures have been applied to deep neural networks (DNNs). Although obtaining tight bounds remains challenging, such measures are often assumed to reproduce qualitative generalization trends. In this position paper, we argue that many post-mortem generalization measures – those computed on trained networks – are \textbffragile: small training modifications that barely affect the underlying DNN can substantially change a measure’s value, trend, or scaling behavior. For example, minor hyperparameter changes, such as learning rate adjustments or switching between SGD variants can reverse the slope of a learning curve in widely used generalization measures like the path norm. We also identify subtler forms of fragility. For instance, the PAC-Bayes origin measure is regarded as one of the most reliable, and is indeed less sensitive to hyperparameter tweaks than many other measures. However, it completely fails to capture differences in data complexity across learning curves. This data fragility contrasts with the function-based marginal-likelihood PAC-Bayes bound, which does capture differences in data-complexity, including scaling behavior, in learning curves, but which is not a post-mortem measure. Beyond demonstrating that many bounds – such as path, spectral and Frobenius norms, flatness proxies, and deterministic PAC-Bayes surrogates – are fragile, this position paper also argues that developers of new measures should explicitly audit them for fragility.
[LG-68] CityAQVis: Integrated ML-Visualization Sandbox Tool for Pollutant Estimation in Urban Regions Using Multi-Source Data (Software Article)
链接: https://arxiv.org/abs/2510.18878
作者: Brij Bridhin Desai,Yukta Arvind,Aswathi Mundayatt,Jaya Sreevalsan-Nair
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 19 pages, 10 figures, 2 tables
点击查看摘要
Abstract:Urban air pollution poses significant risks to public health, environmental sustainability, and policy planning. Effective air quality management requires predictive tools that can integrate diverse datasets and communicate complex spatial and temporal pollution patterns. There is a gap in interactive tools with seamless integration of forecasting and visualization of spatial distributions of air pollutant concentrations. We present CityAQVis, an interactive machine learning ML sandbox tool designed to predict and visualize pollutant concentrations at the ground level using multi-source data, which includes satellite observations, meteorological parameters, population density, elevation, and nighttime lights. While traditional air quality visualization tools often lack forecasting capabilities, CityAQVis enables users to build and compare predictive models, visualizing the model outputs and offering insights into pollution dynamics at the ground level. The pilot implementation of the tool is tested through case studies predicting nitrogen dioxide (NO2) concentrations in metropolitan regions, highlighting its adaptability to various pollutants. Through an intuitive graphical user interface (GUI), the user can perform comparative visualizations of the spatial distribution of surface-level pollutant concentration in two different urban scenarios. Our results highlight the potential of ML-driven visual analytics to improve situational awareness and support data-driven decision-making in air quality management.
[LG-69] Remarks on a recent preprint of Chernikov and Towsner
链接: https://arxiv.org/abs/2510.19665
作者: Maryanthe Malliaris
类目: Logic (math.LO); Machine Learning (cs.LG)
*备注:
[LG-70] Learning Upper Lower Value Envelopes to Shape Online RL: A Principled Approach
链接: https://arxiv.org/abs/2510.19528
作者: Sebastian Reboul,Hélène Halconruy,Randal Douc
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 32 pages, 5 figures
[LG-71] A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond
链接: https://arxiv.org/abs/2510.19382
作者: Nikos Tsikouras,Yorgos Pantis,Ioannis Mitliagkas,Christos Tzamos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-72] Square root Coxs survival analysis by the fittest linear and neural networks model
链接: https://arxiv.org/abs/2510.19374
作者: Maxime van Cutsem,Sylvain Sardy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-73] On the hardness of RL with Lookahead
链接: https://arxiv.org/abs/2510.19372
作者: Corentin Pla,Hugo Richard,Marc Abeille,Nadav Merlis,Vianney Perchet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-74] Nonmonotone subgradient methods based on a local descent lemma
链接: https://arxiv.org/abs/2510.19341
作者: Francisco J. Aragón-Artacho,Rubén Campoy,Pedro Pérez-Aros,David Torregrosa-Belén
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
[LG-75] opology of Currencies: Persistent Homology for FX Co-movements: A Comparative Clustering Study
链接: https://arxiv.org/abs/2510.19306
作者: Pattravadee de Favereau de Jeneret,Ioannis Diamantis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); General Economics (econ.GN); Applications (stat.AP)
*备注: 26 pages, 17 figures, the results were presented at the 5th MORSE Conference, Maastricht University (October 2025)
[LG-76] Magnetic field estimation using Gaussian process regression for interactive wireless power system design
链接: https://arxiv.org/abs/2510.19277
作者: Yuichi Honjo,Cedric Caremel,Ken Takaki,Yuta Noma,Yoshihiro Kawahara,Takuya Sasatani
类目: Applied Physics (physics.app-ph); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 29 pages, 8 figures, 1 table
[LG-77] Synthesizability Prediction of Crystalline Structures with a Hierarchical Transformer and Uncertainty Quantification
链接: https://arxiv.org/abs/2510.19251
作者: Danial Ebrahimzadeh,Sarah Sharif,Yaser Mike Banad
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Predicting which hypothetical inorganic crystals can be experimentally realized remains a central challenge in accelerating materials discovery. SyntheFormer is a positive-unlabeled framework that learns synthesizability directly from crystal structure, combining a Fourier-transformed crystal periodicity (FTCP) representation with hierarchical feature extraction, Random-Forest feature selection, and a compact deep MLP classifier. The model is trained on historical data from 2011 through 2018 and evaluated prospectively on future years from 2019 to 2025, where the positive class constitutes only 1.02 per cent of samples. Under this temporally separated evaluation, SyntheFormer achieves a test area under the ROC curve of 0.735 and, with dual-threshold calibration, attains high-recall screening with 97.6 per cent recall at 94.2 per cent coverage, which minimizes missed opportunities while preserving discriminative power. Crucially, the model recovers experimentally confirmed metastable compounds that lie far from the convex hull and simultaneously assigns low scores to many thermodynamically stable yet unsynthesized candidates, demonstrating that stability alone is insufficient to predict experimental attainability. By aligning structure-aware representation with uncertainty-aware decision rules, SyntheFormer provides a practical route to prioritize synthesis targets and focus laboratory effort on the most promising new inorganic materials.
[LG-78] ransfer Learning Beyond the Standard Model NEURIPS2025
链接: https://arxiv.org/abs/2510.19168
作者: Veena Krishnaraj,Adrian E. Bayer,Christian Kragh Jespersen,Peter Melchior
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 4+8 pages, 7 figures. Accepted at NeurIPS 2025 Workshop: Machine Learning and the Physical Sciences
[LG-79] Extreme Event Aware (η-) Learning
链接: https://arxiv.org/abs/2510.19161
作者: Kai Chang,Themistoklis P. Sapsis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注: Minor revisions at PNAS
[LG-80] Signature Kernel Scoring Rule as Spatio-Temporal Diagnostic for Probabilistic Forecasting
链接: https://arxiv.org/abs/2510.19110
作者: Archer Dodson,Ritabrata Dutta
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
[LG-81] Learning noisy tissue dynamics across time scales
链接: https://arxiv.org/abs/2510.19090
作者: Ming Han,John Devany,Michel Fruchart,Margaret L. Gardel,Vincenzo Vitelli
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Quantitative Methods (q-bio.QM)
*备注: 15 pages, 6 figures
[LG-82] Calibrated Principal Component Regression
链接: https://arxiv.org/abs/2510.19020
作者: Yixuan Florence Wu,Yilun Zhu,Lei Cao and,Naichen Shi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-83] Foundation Models for Discovery and Exploration in Chemical Space
链接: https://arxiv.org/abs/2510.18900
作者: Alexius Wadell,Anoushka Bhutani,Victor Azumah,Austin R. Ellis-Mohr,Celia Kelly,Hancheng Zhao,Anuj K. Nayak,Kareem Hegazy,Alexander Brace,Hongyi Lin,Murali Emani,Venkatram Vishwanath,Kevin Gering,Melisa Alkan,Tom Gibbs,Jack Wells,Lav R. Varshney,Bharath Ramsundar,Karthik Duraisamy,Michael W. Mahoney,Arvind Ramanathan,Venkatasubramanian Viswanathan
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: Main manuscript: 28 pages (including references), 7 tables and 5 figures. Supplementary information: 91 pages (including references), 12 tables and 82 figures
信息检索
[IR-0] op-P Masking for Cross Language Information Retrieval
链接: https://arxiv.org/abs/2510.19758
作者: Joseph Casale,Andrew Silverschotz,Joseph DeSimone
类目: Information Retrieval (cs.IR)
*备注: Unsubmitted
点击查看摘要
Abstract:Top-K masking schemes have been proposed as a method to promote sparse representations in Information Retrieval (IR) tasks, as a simple alternative to Floating Point Operations per Second (FLOPS) regularization. Algorithms such as Bilingual Lexical and Document Expansion Model (BLADE), adopt this approach as a post-processing stage. We propose using Top-P Dynamic Masking similar to Nucleus Sampling in Large Language Models, and demonstrate better performance than Top-K masking. Specifically, we evaluate our methods in the domain of Cross Language Information Retrieval (CLIR)
[IR-1] CoRECT: A Framework for Evaluating Embedding Compression Techniques at Scale
链接: https://arxiv.org/abs/2510.19340
作者: L. Caspari,M. Dinzinger,K. Gosh Dastidar,C. Fellicious,J. Mitrović,M. Granitzer
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Dense retrieval systems have proven to be effective across various benchmarks, but require substantial memory to store large search indices. Recent advances in embedding compression show that index sizes can be greatly reduced with minimal loss in ranking quality. However, existing studies often overlook the role of corpus complexity – a critical factor, as recent work shows that both corpus size and document length strongly affect dense retrieval performance. In this paper, we introduce CoRECT (Controlled Retrieval Evaluation of Compression Techniques), a framework for large-scale evaluation of embedding compression methods, supported by a newly curated dataset collection. To demonstrate its utility, we benchmark eight representative types of compression methods. Notably, we show that non-learned compression achieves substantial index size reduction, even on up to 100M passages, with statistically insignificant performance loss. However, selecting the optimal compression method remains challenging, as performance varies across models. Such variability highlights the necessity of CoRECT to enable consistent comparison and informed selection of compression methods. All code, data, and results are available on GitHub and HuggingFace.
[IR-2] C2T-ID: Converting Semantic Codebooks to Textual Document Identifiers for Generative Search
链接: https://arxiv.org/abs/2510.19221
作者: Yingchen Zhang,Ruqing Zhang,Jiafeng Guo,Wenjun Peng,Sen Li,Fuyu Lv,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注:
[IR-3] XGen-Q: An Explainable Domain-Adaptive LLM Framework with Retrieval-Augmented Generation for Software Security
链接: https://arxiv.org/abs/2510.19006
作者: Hamed Jelodar,Mohammad Meymani,Roozbeh Razavi-Far,Ali A. Ghorbani
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Generative AI and large language models (LLMs) have shown strong capabilities in code understanding, but their use in cybersecurity, particularly for malware detection and analysis, remains limited. Existing detection systems often fail to generalize to obfuscated or previously unseen threats, underscoring the need for more adaptable and explainable models. To address this challenge, we introduce XGen-Q, a domain-adapted LLM built on the Qwen-Coder architecture and pretrained on a large-scale corpus of over one million malware samples, spanning both source and assembly code. XGen-Q uses a multi-stage prompt strategy combined with retrieval-augmented generation (RAG) to deliver reliable malware identification and detailed forensic reporting, even in the presence of complex code obfuscation. To further enhance generalization, we design a training pipeline that systematically exposes the model to diverse obfuscation patterns. Experimental results show that XGen-Q achieves significantly lower perplexity than competitive baselines and exhibits strong performance on novel malware samples, demonstrating the promise of LLM-based approaches for interpretable and robust malware analysis.
[IR-4] SBAN: A Framework Multi-Dimensional Dataset for Large Language Model Pre-Training and Software Code Mining
链接: https://arxiv.org/abs/2510.18936
作者: Hamed Jelodar,Mohammad Meymani,Samita Bai,Roozbeh Razavi-Far,Ali A. Ghorbani
类目: Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注:
点击查看摘要
Abstract:This paper introduces SBAN (Source code, Binary, Assembly, and Natural Language Description), a large-scale, multi-dimensional dataset designed to advance the pre-training and evaluation of large language models (LLMs) for software code analysis. SBAN comprises more than 3 million samples, including 2.9 million benign and 672,000 malware respectively, each represented across four complementary layers: binary code, assembly instructions, natural language descriptions, and source code. This unique multimodal structure enables research on cross-representation learning, semantic understanding of software, and automated malware detection. Beyond security applications, SBAN supports broader tasks such as code translation, code explanation, and other software mining tasks involving heterogeneous data. It is particularly suited for scalable training of deep models, including transformers and other LLM architectures. By bridging low-level machine representations and high-level human semantics, SBAN provides a robust foundation for building intelligent systems that reason about code. We believe that this dataset opens new opportunities for mining software behavior, improving security analytics, and enhancing LLM capabilities in pre-training and fine-tuning tasks for software code mining.
附件下载
点击下载今日全部论文列表