本篇博文主要内容为 2025-10-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-10-03)
今日共更新643篇论文,其中:
- 自然语言处理共133篇(Computation and Language (cs.CL))
- 人工智能共222篇(Artificial Intelligence (cs.AI))
- 计算机视觉共112篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共245篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation
【速读】: 该论文旨在解决当前基于竞技场(arena-style)评估的大型语言模型(Large Language Models, LLMs)评分系统中对“平局”(draw)语义的不合理假设问题。现有方法将平局视为两模型能力相等并据此调整评分,但作者提出质疑:平局更可能反映的是查询难度高或任务客观性强,而非模型性能相当。解决方案的关键在于忽略平局时的评分更新,即在评分计算中不因平局而调整模型评级,从而提升对战斗结果(包括平局)预测的准确性——实验证明这一策略可在四个不同评分系统上带来1-3%的相对准确率提升,并进一步揭示平局与查询易度和客观性显著相关(风险比分别为1.37和1.35)。
链接: https://arxiv.org/abs/2510.02306
作者: Raphael Tang,Crystina Zhang,Wenyan Li,Carmen Lai,Pontus Stenetorp,Yao Lu
机构: Centre for Artificial Intelligence, University College London (伦敦大学学院人工智能中心); University of Waterloo (滑铁卢大学); University of Copenhagen (哥本哈根大学); Research and Development Center for Large Language Models, National Institute of Informatics (日本信息研究所大语言模型研发中心)
类目: Computation and Language (cs.CL)
备注: 6 pages, 4 figures
点击查看摘要
Abstract:In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the “battle” a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.
zh
[NLP-1] Interactive Training: Feedback-Driven Neural Network Optimization EMNLP2025
链接: https://arxiv.org/abs/2510.02297
作者: Wentao Zhang,Yang Young Lu,Yuntian Deng
机构: University of Waterloo (滑铁卢大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2025 Demo
[NLP-2] F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data
【速读】: 该论文旨在解决当前高性能嵌入模型(embedding models)普遍依赖大规模对比学习预训练、复杂训练流程及昂贵合成数据的问题,从而导致训练成本高、资源消耗大且难以复现。其解决方案的关键在于提出F2LLM(Foundation to Feature Large Language Models),通过直接在600万条查询-文档-负样本三元组上微调基础语言模型(foundation models),这些数据均来自开源且非合成的语料库,实现了在较低训练成本下获得卓越的嵌入性能:F2LLM-4B在MTEB英文排行榜中以约40亿参数位列第2,F2LLM-1.7B在10亿–20亿参数区间排名第一,显著提升了嵌入模型的性价比与可复现性。
链接: https://arxiv.org/abs/2510.02294
作者: Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang
机构: Ant Group (蚂蚁集团); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.
zh
[NLP-3] From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens EMNLP2025
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在系统性基准测试、中间表征分析与解释方面的工具缺失问题,尤其缺乏统一、可扩展且支持多模型兼容的框架来提取和研究VLM前向传播过程中各层的中间输出。解决方案的关键在于提出VLM-Lens——一个开源工具包,通过提供统一的YAML配置接口抽象不同模型的实现细节,支持对16种主流基础VLM及其30余种变体的中间层输出提取,并具备良好的可扩展性以无缝集成新模型,同时兼容多种可解释性与分析方法,从而为深入理解VLM内部工作机制提供标准化、高效的研究基础设施。
链接: https://arxiv.org/abs/2510.02292
作者: Hala Sheta,Eric Huang,Shuyu Wu,Ilia Alenabi,Jiajun Hong,Ryker Lin,Ruoxi Ning,Daniel Wei,Jialin Yang,Jiawei Zhou,Ziqiao Ma,Freda Shi
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: EMNLP 2025 System Demonstration | Code: this https URL
点击查看摘要
Abstract:We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs. Comments: EMNLP 2025 System Demonstration | Code: this https URL Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.02292 [cs.CL] (or arXiv:2510.02292v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.02292 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-4] ree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮交互场景下对对抗攻击的脆弱性问题,尤其是现有方法未能充分探索由复杂对话动态和策略性对话规划所催生的新型多轮攻击路径。解决方案的关键在于提出DialTree-RPO框架——一种结合树搜索(tree search)与基于策略的强化学习(on-policy reinforcement learning)的自主发现机制,将对话建模为序列决策问题,从而无需人工标注数据即可系统性地探索并生成多样化的多轮攻击策略,显著提升了攻击成功率(ASR),在10个目标模型上平均提升超过25.9%。
链接: https://arxiv.org/abs/2510.02286
作者: Ruohao Guo,Afshin Oroojlooy,Roshan Sridhar,Miguel Ballesteros,Alan Ritter,Dan Roth
机构: Georgia Institute of Technology (佐治亚理工学院); Oracle AI (甲骨文人工智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.
zh
[NLP-5] Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective
【速读】: 该论文旨在解决生成式 AI(Generative AI)中大型推理模型(Large Reasoning Models, LRMs)在跨语言场景下的推理能力迁移问题,即英语强化后训练(Reinforcement Post-Training, RPT)所获得的推理能力是否能有效迁移到其他语言。核心问题是:英语主导的LRMs是否具备真正的跨语言推理泛化能力?解决方案的关键在于引入一种量化跨语言迁移能力的新指标,并通过系统性实验揭示了三个关键发现:首先,“平行跃迁”(Parallel Leap)现象表明,从单一单语训练跃迁至仅一个平行语言即可显著提升性能;其次,“平行缩放律”(Parallel Scaling Law)揭示跨语言推理迁移遵循幂律关系,且与训练语言数量呈正相关;最后,“单语泛化差距”(Monolingual Generalization Gap)指出英语主导模型未能充分泛化至其他语言,说明当前LRMs的推理机制可能并不类人,从而为构建更语言无关的推理模型提供了重要方向。
链接: https://arxiv.org/abs/2510.02272
作者: Wen Yang,Junhong Wu,Chong Li,Chengqing Zong,Jiajun Zhang
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Wuhan AI Research (武汉人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress
点击查看摘要
Abstract:Recent advancements in Reinforcement Post-Training (RPT) have significantly enhanced the capabilities of Large Reasoning Models (LRMs), sparking increased interest in the generalization of RL-based reasoning. While existing work has primarily focused on investigating its generalization across tasks or modalities, this study proposes a novel cross-linguistic perspective to investigate reasoning generalization. This raises a crucial question: \textitDoes the reasoning capability achieved from English RPT effectively transfer to other languages? We address this by systematically evaluating English-centric LRMs on multilingual reasoning benchmarks and introducing a metric to quantify cross-lingual transferability. Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm. Through interventional studies, we find that models with stronger initial English capabilities tend to over-rely on English-specific patterns, leading to diminished cross-lingual generalization. To address this, we conduct a thorough parallel training study. Experimental results yield three key findings: \textbfFirst-Parallel Leap , a substantial leap in performance when transitioning from monolingual to just a single parallel language, and a predictable \textbfParallel Scaling Law , revealing that cross-lingual reasoning transfer follows a power-law with the number of training parallel languages. Moreover, we identify the discrepancy between actual monolingual performance and the power-law prediction as \textbfMonolingual Generalization Gap , indicating that English-centric LRMs fail to fully generalize across languages. Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.
zh
[NLP-6] InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents
【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)代理在多源信息获取任务中面临的两大核心问题:一是依赖开放网络搜索导致的信息噪声与不可靠性,二是许多现实任务所需的精准领域知识无法通过通用网络获取。为此,作者提出InfoMosaic-Bench,这是首个专注于工具增强型代理的多源信息检索基准测试,其关键创新在于设计了InfoMosaic-Flow这一可扩展的流水线,通过将任务条件锚定于验证过的工具输出、强制跨源依赖关系并过滤掉可通过简单查表解决的“捷径案例”,从而确保任务的可靠性与复杂性。实验表明,仅使用网络搜索难以完成复杂任务(如GPT-5准确率仅为38.2%),而工具集成虽具潜力但效果不稳定,且22.4%的失败源于错误的工具选择或调用,凸显出当前LLM在基础工具交互能力上的不足。
链接: https://arxiv.org/abs/2510.02271
作者: Yaxin Du,Yuanshuo Zhang,Xiyuan Yang,Yifan Zhou,Cheng Wang,Gongyi Zou,Xianghe Pang,Wenhao Wang,Menglan Chen,Shuo Tang,Zhiyu Li,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); Zhejiang University (浙江大学); University of Oxford (牛津大学); MemTensor
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools – and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.
zh
[NLP-7] RLAD: Training LLM s to Discover Abstractions for Solving Reasoning Problems
【速读】: 该论文旨在解决大模型在复杂推理任务中难以持续识别并复用算法性程序的问题,现有方法常导致推理过程冗长且缺乏结构化,无法有效引导探索。其核心解决方案是引入“推理抽象”(reasoning abstractions),即以简洁的自然语言描述程序性和事实性知识,作为指导模型学习成功推理的中间表征。通过构建一个两阶段强化学习训练范式(RLAD),联合训练抽象生成器与解题生成器,实现结构化探索、解耦学习信号,并显著提升对更难问题的泛化能力。实验表明,在测试时分配更多计算资源用于生成抽象,比单纯增加解题生成次数更能提升性能,凸显了抽象在引导有意义探索中的关键作用。
链接: https://arxiv.org/abs/2510.02263
作者: Yuxiao Qu,Anikait Singh,Yoonho Lee,Amrith Setlur,Ruslan Salakhutdinov,Chelsea Finn,Aviral Kumar
机构: Carnegie Mellon University (卡内基梅隆大学); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement “algorithmic procedures” that can be used to deduce answers to hard problems. Doing so requires realizing the most relevant primitives, intermediate results, or shared procedures, and building upon them. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, most reasoning traces learned by large models fail to consistently capture or reuse procedures, instead drifting into verbose and degenerate exploration. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing multiple abstractions given a problem, followed by RL that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and a solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions at large test budgets, illustrating the role of abstractions in guiding meaningful exploration.
zh
[NLP-8] he Unreason able Effectiveness of Scaling Agents for Computer Use
【速读】: 该论文旨在解决计算机使用代理(Computer-use Agents, CUAs)在执行长周期、复杂数字任务时因不可靠性和高方差导致的应用瓶颈问题。其解决方案的关键在于提出行为最佳-N(Behavior Best-of-N, bBoN)方法,通过生成多个轨迹(rollouts)并基于描述代理行为的叙事(behavior narratives)进行选择,实现广泛探索与有原则的轨迹筛选相结合,从而显著提升鲁棒性和成功率。该方法在OSWorld上达到69.9%的新SOTA性能,接近人类水平(72%),并通过消融实验验证了关键设计要素的有效性,同时展现出在不同操作系统(WindowsAgentArena和AndroidWorld)上的强泛化能力。
链接: https://arxiv.org/abs/2510.02250
作者: Gonzalo Gonzalez-Pumariega,Vincent Tu,Chih-Lun Lee,Jiachen Yang,Ang Li,Xin Eric Wang
机构: Simular Research
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 7 figures, 10 tables
点击查看摘要
Abstract:Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their unreliability and high variance hinder their application to long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method that scales over agents by generating multiple rollouts and selecting among them using behavior narratives that describe the agents’ rollouts. It enables both wide exploration and principled trajectory selection, substantially improving robustness and success rates. On OSWorld, our bBoN scaling method establishes a new state of the art (SoTA) at 69.9%, significantly outperforming prior methods and approaching human-level performance at 72%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the unreasonable effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and bBoN provides a practical framework to achieve this.
zh
[NLP-9] Explore Briefly Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂问题推理中存在“过度思考”(overthinking)的问题,即模型对简单问题生成冗长的思维链(Chain-of-Thought, CoT),导致推理效率低下且难以根据问题复杂度动态调整推理深度。解决方案的关键在于提出一种新的度量指标——Token Entropy Cumulative Average(TECA),用于量化推理过程中信息探索的程度,并设计了一种名为“Briefly Explore, Then Decide”的新型推理范式,结合累积熵调控(Cumulative Entropy Regulation, CER)机制,使模型能够基于TECA动态识别最优终止点以输出最终答案,从而实现高效且自适应的推理过程。
链接: https://arxiv.org/abs/2510.02249
作者: Tianyi Jiang,Yi Bin,Yujuan Ding,Kainian Zhu,Fei Ma,Jingkuan Song,Heng Tao Shen
机构: Tongji University (同济大学); Hong Kong Polytechnic University (香港理工大学); Shanghai University of Electric and Power (上海电力大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (广东省人工智能与数字经济实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm – Explore Briefly, Then Decide – with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem-solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.
zh
[NLP-10] ExGRPO: Learning to Reason from Experience
【速读】: 该论文旨在解决强化学习从可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)中因标准的在线策略训练(on-policy training)在每次更新后丢弃rollout经验而导致的计算效率低下和训练不稳定问题。其解决方案的关键在于提出ExGRPO(Experiential Group Relative Policy Optimization)框架,该框架通过识别并优先利用具有高价值的经验(以rollout正确性与熵作为指标),并对经验进行分组管理,并采用混合策略目标来平衡探索与经验利用,从而显著提升大语言模型在数学与通用推理任务上的性能与训练稳定性。
链接: https://arxiv.org/abs/2510.02245
作者: Runzhe Zhan,Yafu Li,Zhi Wang,Xiaoye Qu,Dongrui Liu,Jing Shao,Derek F. Wong,Yu Cheng
机构: University of Macau (澳门大学); Shanghai AI Laboratory (上海人工智能实验室); Nanjing University (南京大学); The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.
zh
[NLP-11] AccurateRAG : A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在构建过程中效率低、性能不稳定的问题,尤其在实际部署中难以达到最优问答性能。其解决方案的关键在于提出了一种名为AccurateRAG的新框架,该框架提供了一个端到端的开发流水线,涵盖原始数据处理、微调数据生成、文本嵌入(text embedding)、大语言模型(Large Language Model, LLM)微调、输出评估以及本地化RAG系统构建等模块,从而显著提升开发效率与最终问答性能,在基准测试集上实现了新的最先进(state-of-the-art)效果。
链接: https://arxiv.org/abs/2510.02243
作者: Linh The Nguyen,Chi Tran,Dung Ngoc Nguyen,Van-Cuong Pham,Hoang Ngo,Dat Quoc Nguyen
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司); Qualcomm Vietnam Company Limited (高通越南有限公司)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We introduce AccurateRAG – a novel framework for constructing high-performance question-answering applications based on retrieval-augmented generation (RAG). Our framework offers a pipeline for development efficiency with tools for raw dataset processing, fine-tuning data generation, text embedding LLM fine-tuning, output evaluation, and building RAG systems locally. Experimental results show that our framework outperforms previous strong baselines and obtains new state-of-the-art question-answering performance on benchmark datasets.
zh
[NLP-12] Study on LLM s for Promptagator-Style Dense Retriever Training CIKM2025
【速读】: 该论文旨在解决当前Promptagator方法依赖于专有且大规模语言模型(Large Language Models, LLMs)的问题,这些模型可能因访问限制或敏感数据合规性问题而无法被广泛使用。解决方案的关键在于验证开源小规模LLM(参数量≤14B)作为Promptagator式查询生成器的有效性,研究发现即使是3B参数规模的开源模型也能在生成合成数据方面表现出色,从而为领域特定检索模型的微调提供可靠、可访问的替代方案。
链接: https://arxiv.org/abs/2510.02241
作者: Daniel Gwon,Nour Jedidi,Jimmy Lin
机构: Massachusetts Institute of Technology Lincoln Laboratory (麻省理工学院林肯实验室); University of Waterloo (滑铁卢大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: CIKM 2025 short research paper
点击查看摘要
Abstract:Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales ( \leq 14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications.
zh
[NLP-13] Enhanced Arabic-language cyberbullying detection: deep embedding and transformer (BERT) approaches
【速读】: 该论文旨在解决阿拉伯语网络欺凌(cyberbullying)自动检测方法稀缺的问题,以应对社交媒体平台中青少年情绪健康面临的威胁。其解决方案的关键在于构建了一个包含10,662条X(原Twitter)帖子的高质量阿拉伯语数据集,并采用Kappa工具验证标注质量;随后通过四组实验对比多种深度学习模型,发现基于双向长短期记忆网络(Bi-LSTM)与FastText词嵌入结合的方法在准确率上达到98%,显著优于其他模型,体现出该组合在阿拉伯语网络欺凌识别任务中的优越性。
链接: https://arxiv.org/abs/2510.02232
作者: Ebtesam Jaber Aljohani,Wael M. S. Yafoo
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent technological advances in smartphones and communications, including the growth of such online platforms as massive social media networks such as X (formerly known as Twitter) endangers young people and their emotional well-being by exposing them to cyberbullying, taunting, and bullying content. Most proposed approaches for automatically detecting cyberbullying have been developed around the English language, and methods for detecting Arabic-language cyberbullying are scarce. Methods for detecting Arabic-language cyberbullying are especially scarce. This paper aims to enhance the effectiveness of methods for detecting cyberbullying in Arabic-language content. We assembled a dataset of 10,662 X posts, pre-processed the data, and used the kappa tool to verify and enhance the quality of our annotations. We conducted four experiments to test numerous deep learning models for automatically detecting Arabic-language cyberbullying. We first tested a long short-term memory (LSTM) model and a bidirectional long short-term memory (Bi-LSTM) model with several experimental word embeddings. We also tested the LSTM and Bi-LSTM models with a novel pre-trained bidirectional encoder from representations (BERT) and then tested them on a different experimental models BERT again. LSTM-BERT and Bi-LSTM-BERT demonstrated a 97% accuracy. Bi-LSTM with FastText embedding word performed even better, achieving 98% accuracy. As a result, the outcomes are generalize
zh
[NLP-14] he Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models
【速读】: 该论文旨在解决强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在提升大语言模型推理能力时出现的“推理边界收缩”问题,即RLVR反而限制了模型对多样化问题的求解能力。通过理论和实证分析,作者揭示了两个关键现象:一是负干扰效应(negative interference),即训练中学习某些问题会削弱模型解决其他问题的能力;二是赢家通吃现象(winner-take-all),即RLVR过度强化高初始概率正确解答的问题,抑制低概率问题的学习。这一现象源于标准RL目标中的在线策略采样机制,导致模型收敛到狭窄的解题策略。解决方案的关键在于提出一种简单但有效的数据筛选算法,聚焦于低初始概率问题进行RLVR训练,从而显著提升Pass@k性能,扩展模型的推理边界。
链接: https://arxiv.org/abs/2510.02230
作者: Phuc Minh Nguyen,Chinh D. La,Duy M. H. Nguyen,Nitesh V. Chawla,Binh T. Nguyen,Khoa D. Doan
机构: VinUniversity (Vin大学); University of Stuttgart (斯图加特大学); University of Notre Dame (圣母大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Max Planck Research School for Intelligent Systems (IMPRS-IS) (马克斯·普朗克智能系统研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 15 figures
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models’ reasoning capabilities, yet recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it. This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics and reveals two critical phenomena that explain this failure. First, we expose negative interference in RLVR, where learning to solve certain training problems actively reduces the likelihood of correct solutions for others, leading to the decline of Pass@ k performance, or the probability of generating a correct solution within k attempts. Second, we uncover the winner-take-all phenomenon: RLVR disproportionately reinforces problems with high likelihood, correct solutions, under the base model, while suppressing other initially low-likelihood ones. Through extensive theoretical and empirical analysis on multiple mathematical reasoning benchmarks, we show that this effect arises from the inherent on-policy sampling in standard RL objectives, causing the model to converge toward narrow solution strategies. Based on these insights, we propose a simple yet effective data curation algorithm that focuses RLVR learning on low-likelihood problems, achieving notable improvement in Pass@ k performance. Our code is available at this https URL.
zh
[NLP-15] More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration
【速读】: 该论文旨在解决当前基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法在提升大语言模型(Large Language Models, LLMs)推理能力时存在的局限性,即依赖单一自探索或单个离策略教师模型所导致的内在模型偏差和探索受限问题,从而限制了推理多样性与性能。其解决方案的关键在于提出一种自适应多指导策略优化框架(Adaptive Multi-Guidance Policy Optimization, AMPO),该框架采用“按需引导”机制,在在线策略模型无法生成正确解时才引入多个熟练教师模型的指导,从而在保持自我发现价值的同时扩展探索空间;同时结合基于理解度的选择机制,使学生模型优先学习其最可能理解的推理路径,实现广度探索与有效利用之间的平衡。
链接: https://arxiv.org/abs/2510.02227
作者: Xiaoyang Yuan,Yujuan Ding,Yi Bin,Wenqi Shao,Jinyu Cai,Jingkuan Song,Yang Yang,Hengtao Shen
机构: Tongji University (同济大学); Hong Kong Polytechnic University (香港理工大学); Shanghai AI Lab (上海人工智能实验室); National University of Singapore (新加坡国立大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 5 figures
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This “guidance-on-demand” approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at this https URL.
zh
[NLP-16] StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在金融领域,尤其是股票交易场景中的评估不足问题。现有金融基准主要测试静态知识问答能力,无法反映交易决策的动态性和迭代性,导致对LLM作为金融代理(financial agent)的实际效能评估存在局限。为此,作者提出StockBench——一个无污染(contamination-free)的多月度真实股票交易环境基准,其关键在于为LLM代理提供每日市场信号(包括价格、基本面数据和新闻),并要求其做出序列化的买卖或持有决策,通过累计收益、最大回撤和Sortino比率等金融指标进行量化评估。这一设计使模型的能力从静态知识理解扩展到动态策略制定与风险管理,从而更真实地衡量LLM在高风险、高价值金融任务中的潜力。
链接: https://arxiv.org/abs/2510.02209
作者: Yanxu Chen,Zijun Yao,Yantao Liu,Jin Ye,Jianing Yu,Lei Hou,Juanzi Li
机构: Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals – including prices, fundamentals, and news – and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.
zh
[NLP-17] Say One Thing Do Another? Diagnosing Reasoning -Execution Gaps in VLM-Powered Mobile-Use Agents
【速读】: 该论文旨在解决当前移动设备使用代理(mobile-use agents)在执行基于视觉-语言模型(VLMs)的自然语言指令时,存在的推理-执行差距(reasoning-execution gap)问题。现有评估方法仅关注执行准确率(execution accuracy),忽略了链式思维(chain-of-thought, CoT)推理过程是否与真实动作(ground-truth action)一致,从而可能导致用户因误判合理推理而信任错误行为,引发安全风险。解决方案的关键在于提出一种新的评估框架——Ground-Truth Alignment (GTA),通过量化CoT所隐含的动作与真实动作的一致性来诊断推理-执行偏差,并结合标准的精确匹配(Exact Match, EM)指标,共同评估推理准确性和执行准确性。该框架识别出两类典型差距:执行间隙(Execution Gap, EG)和推理间隙(Reasoning Gap, RG),揭示了当前先进模型中普遍存在的系统性偏差,为构建更可信的移动代理提供了可解释的诊断工具。
链接: https://arxiv.org/abs/2510.02204
作者: Lingzhong Dong,Ziqi Zhou,Shuaibo Yang,Haiyue Sheng,Pengzhou Cheng,Zongru Wu,Zheng Wu,Gongshen Liu,Zhuosheng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.
zh
[NLP-18] ARUQULA – An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities ESWC2025
【速读】: 该论文旨在解决非计算机科学背景用户在与知识图谱(Knowledge Graph)交互时面临的难题,即查询语言SPARQL的学习门槛过高。为降低这一障碍,论文提出了一种基于SPINACH的通用方法,其核心在于利用大语言模型(Large Language Models, LLMs)构建一个代理系统,通过迭代式探索与执行过程实现自然语言到SPARQL查询的翻译,而非传统的单次映射方式。该方案的关键创新在于将Text2SPARQL任务建模为一个动态推理过程,从而提升对复杂查询的理解与生成能力,并为进一步优化提供了可解释的代理行为分析。
链接: https://arxiv.org/abs/2510.02200
作者: Felix Brei,Lorenz Bühmann,Johannes Frey,Daniel Gerber,Lars-Peter Meyer,Claus Stadler,Kirill Bulert
机构: Leipzig University, Germany; Chemnitz Technical University, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: peer reviewed publication at Text2SPARQL Workshop @ ESWC 2025
点击查看摘要
Abstract:Interacting with knowledge graphs can be a daunting task for people without a background in computer science since the query language that is used (SPARQL) has a high barrier of entry. Large language models (LLMs) can lower that barrier by providing support in the form of Text2SPARQL translation. In this paper we introduce a generalized method based on SPINACH, an LLM backed agent that translates natural language questions to SPARQL queries not in a single shot, but as an iterative process of exploration and execution. We describe the overall architecture and reasoning behind our design decisions, and also conduct a thorough analysis of the agent behavior to gain insights into future areas for targeted improvements. This work was motivated by the Text2SPARQL challenge, a challenge that was held to facilitate improvements in the Text2SPARQL domain.
zh
[NLP-19] A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents : From Answers to Reports
【速读】: 该论文旨在解决当前评估深度研究代理(Deep Research Agents, DRAs)能力的基准测试在评价维度、响应格式和评分机制方面的不足,从而无法有效衡量其在复杂开放任务中的表现。解决方案的关键在于提出一个严谨的多维评估框架与专用基准测试:该基准包含214个专家精心设计的挑战性查询,覆盖10个主题领域,并为每个查询提供人工构建的参考数据包以支持复合评估;同时,评估框架引入集成评分指标,对DRAs生成的长文本报告进行语义质量、主题聚焦度和检索可信度的综合打分,显著提升了对DRAs系统性能的量化评估能力。
链接: https://arxiv.org/abs/2510.02190
作者: Yang Yao,Yixu Wang,Yuxuan Zhang,Yi Lu,Tianle Gu,Lingyu Li,Dingyi Zhao,Keming Wu,Haozhe Wang,Ping Nie,Yan Teng,Yingchun Wang
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The University of Hong Kong (香港大学); Fudan University (复旦大学); University of British Columbia (不列颠哥伦比亚大学); University of Toronto (多伦多大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Hong Kong University of Science and Technology (香港科技大学); Peking University (北京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Artificial intelligence is undergoing the paradigm shift from closed language models to interconnected agent systems capable of external perception and information integration. As a representative embodiment, Deep Research Agents (DRAs) systematically exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their capacity to assess such systems effectively. This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses. The benchmark comprises 214 expert-curated challenging queries distributed across 10 broad thematic domains, each accompanied by manually constructed reference bundles to support composite evaluation. The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems.
zh
[NLP-20] Learning to Reason for Hallucination Span Detection
链接: https://arxiv.org/abs/2510.02173
作者: Hsuan Su,Ting-Yao Hu,Hema Swetha Koppula,Kundan Krishna,Hadi Pouransari,Cheng-Yu Hsieh,Cem Koc,Joseph Yitan Cheng,Oncel Tuzel,Raviteja Vemulapalli
机构: National Taiwan University (国立台湾大学); Apple (苹果)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[NLP-21] RESTRAIN: From Spurious Votes to Signals – Self-Driven RL with Self-Penalization
【速读】: 该论文旨在解决当前基于人类标注数据的强化学习方法在提升大模型链式思维(Chain-of-Thought, CoT)推理能力时所面临的高标注成本及在复杂任务上表现瓶颈的问题。其解决方案的关键在于提出一种名为RESTRAIN(REinforcement learning with Self-restraint)的自惩罚强化学习框架,该框架通过利用模型自身预测分布中的信号——对过度自信的推理路径和一致性低的样本施加惩罚,同时保留有潜力的推理链——将无标注数据中的“缺乏黄金标签”转化为有效的学习信号。此机制可无缝集成至GRPO等策略优化方法中,实现无需监督的持续自我改进,在多个挑战性推理基准测试中显著提升性能,例如在AIME25上Pass@1指标提升达+140.7%,几乎达到使用黄金标签训练的效果。
链接: https://arxiv.org/abs/2510.02172
作者: Zhaoning Yu,Will Su,Leitian Tao,Haozhu Wang,Aashu Singh,Hanchao Yu,Jianyu Wang,Hongyang Gao,Weizhe Yuan,Jason Weston,Ping Yu,Jing Xu
机构: Meta(元)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model’s entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.
zh
[NLP-22] he Disparate Impacts of Speculative Decoding
【速读】: 该论文旨在解决生成式 AI(Generative AI)中推测解码(speculative decoding)技术在不同任务间存在速度提升不均的问题,即速度加速效果在欠拟合或代表性不足的任务上显著减弱。其关键解决方案是基于对速度提升不公平性的量化分析,识别出导致差异的主要因素,并提出一种缓解策略以减少跨任务的速度提升差异;实验验证表明,该策略在多个模型对上平均提升了12%的公平性指标。
链接: https://arxiv.org/abs/2510.02128
作者: Jameson Sandler,Ahmet Üstün,Marco Romanelli,Sara Hooker,Ferdinando Fioretto
机构: University of Virginia (弗吉尼亚大学); Cohere; Hofstra University (霍夫斯特拉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed
unfairness’’ and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.
zh
[NLP-23] Do AI Models Perform Human-like Abstract Reasoning Across Modalities?
链接: https://arxiv.org/abs/2510.02125
作者: Claas Beger,Ryan Yi,Shuhao Fu,Arseny Moskvichev,Sarah W. Tsai,Sivasankaran Rajamanickam,Melanie Mitchell
机构: Santa Fe Institute (圣达菲研究所); Advanced Micro Devices, Inc. (超威半导体公司); Sandia National Laboratories (桑迪亚国家实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures
[NLP-24] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems
链接: https://arxiv.org/abs/2510.02066
作者: Siddhant Arora,Jinchuan Tian,Hayato Futami,Jiatong Shi,Yosuke Kashiwagi,Emiru Tsunoo,Shinji Watanabe
机构: Carnegie Mellon University (卡内基梅隆大学); Sony Group Corporation (索尼集团)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
[NLP-25] Stream RAG : Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage
链接: https://arxiv.org/abs/2510.02044
作者: Siddhant Arora,Haidar Khan,Kai Sun,Xin Luna Dong,Sajal Choudhary,Seungwhan Moon,Xinyuan Zhang,Adithya Sagar,Surya Teja Appini,Kaushik Patnaik,Sanat Sharma,Shinji Watanabe,Anuj Kumar,Ahmed Aly,Yue Liu,Florian Metze,Zhaojiang Lin
机构: Meta(元)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
[NLP-26] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models
【速读】: 该论文试图解决当前对大语言模型(Large Language Models, LLMs)创造力评估过于关注输出质量而忽视生成过程的问题。解决方案的关键在于采用以过程为导向的研究方法,借鉴叙事学(narratology)理论将LLMs视为计算作者(computational authors),并引入基于约束的决策机制作为分析框架,通过受控提示(controlled prompting)赋予模型不同的作者身份(authorial personas),进而系统性地考察其在风格(Style)、人物(Character)、事件(Event)和场景(Setting)等叙事元素中的偏好与决策逻辑。研究发现,LLMs普遍更强调风格维度,且不同模型展现出可区分的创造性行为特征,表明该方法为量化和解析AI的作者式创造力提供了新的系统性工具。
链接: https://arxiv.org/abs/2510.02025
作者: Donghoon Jung,Jiwoo Choi,Songeun Chae,Seohyon Jung
机构: KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Evaluations of large language models (LLMs)’ creativity have focused primarily on the quality of their outputs rather than the processes that shape them. This study takes a process-oriented approach, drawing on narratology to examine LLMs as computational authors. We introduce constraint-based decision-making as a lens for authorial creativity. Using controlled prompting to assign authorial personas, we analyze the creative preferences of the models. Our findings show that LLMs consistently emphasize Style over other elements, including Character, Event, and Setting. By also probing the reasoning the models provide for their choices, we show that distinctive profiles emerge across models and argue that our approach provides a novel systematic tool for analyzing AI’s authorial creativity.
zh
[NLP-27] LLM -Based Multi-Task Bangla Hate Speech Detection: Type Severity and Target
【速读】: 该论文旨在解决低资源语言(以孟加拉语为例)中仇恨言论检测工具不足的问题,特别是现有模型多为单任务(如二分类的仇恨/非仇恨),缺乏对仇恨内容多维度信号(类型、严重程度、目标对象)的覆盖。其关键解决方案是构建首个面向孟加拉语的多任务仇恨言论数据集BanglaMultiHate,并在此基础上系统比较传统基线模型、单语预训练模型与大语言模型(LLM)在零样本提示和LoRA微调下的表现。实验表明,尽管LoRA微调后的LLM性能可媲美孟加拉语BERT(BanglaBERT),但基于文化与语言特征的预训练仍是实现鲁棒性能的关键因素,从而为低资源场景下开发符合本地语境的内容审核工具提供了新的基准与方法论支持。
链接: https://arxiv.org/abs/2510.01995
作者: Md Arid Hasan,Firoj Alam,Md Fahad Hossain,Usman Naseem,Syed Ishtiaque Ahmed
机构: University of Toronto (多伦多大学); Qatar Computing Research Institute (卡塔尔计算研究研究所); Daffodil International University (水仙国际大学); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Online social media platforms are central to everyday communication and information seeking. While these platforms serve positive purposes, they also provide fertile ground for the spread of hate speech, offensive language, and bullying content targeting individuals, organizations, and communities. Such content undermines safety, participation, and equity online. Reliable detection systems are therefore needed, especially for low-resource languages where moderation tools are limited. In Bangla, prior work has contributed resources and models, but most are single-task (e.g., binary hate/offense) with limited coverage of multi-facet signals (type, severity, target). We address these gaps by introducing the first multi-task Bangla hate-speech dataset, BanglaMultiHate, one of the largest manually annotated corpus to date. Building on this resource, we conduct a comprehensive, controlled comparison spanning classical baselines, monolingual pretrained models, and LLMs under zero-shot prompting and LoRA fine-tuning. Our experiments assess LLM adaptability in a low-resource setting and reveal a consistent trend: although LoRA-tuned LLMs are competitive with BanglaBERT, culturally and linguistically grounded pretraining remains critical for robust performance. Together, our dataset and findings establish a stronger benchmark for developing culturally aligned moderation tools in low-resource contexts. For reproducibility, we will release the dataset and all related scripts.
zh
[NLP-28] Exploring Database Normalization Effects on SQL Generation CIKM2025
【速读】: 该论文旨在解决自然语言到SQL(NL2SQL)系统中schema设计,特别是规范化(Normalization)对模型性能影响被忽视的问题。现有研究多在固定schema下评估模型,忽略了schema设计本身对准确率的显著作用。解决方案的关键在于首次系统性地考察不同规范化级别(从1NF到3NF)对八种主流大语言模型在合成数据和真实学术论文数据集上的影响,发现:对于简单检索查询,非规范化schema在零样本设置下仍能保持高精度;而规范化schema虽在基础表选择和连接类型预测上引入挑战,但通过提供少量示例(few-shot examples)可显著缓解;对于聚合查询,规范化schema表现更优,因其能有效应对非规范化schema中的数据重复和NULL值问题。因此,最优schema设计应根据目标查询类型动态调整,强调在NL2SQL开发中需考虑schema设计并引入自适应schema选择机制。
链接: https://arxiv.org/abs/2510.01989
作者: Ryosuke Kohita
机构: CyberAgent( CyberAgent)
类目: Computation and Language (cs.CL)
备注: Accepted to CIKM 2025
点击查看摘要
Abstract:Schema design, particularly normalization, is a critical yet often overlooked factor in natural language to SQL (NL2SQL) systems. Most prior research evaluates models on fixed schemas, overlooking the influence of design on performance. We present the first systematic study of schema normalization’s impact, evaluating eight leading large language models on synthetic and real-world datasets with varied normalization levels. We construct controlled synthetic datasets with formal normalization (1NF-3NF) and real academic paper datasets with practical schemes. Our results show that denormalized schemas offer high accuracy on simple retrieval queries, even with cost-effective models in zero-shot settings. In contrast, normalized schemas (2NF/3NF) introduce challenges such as errors in base table selection and join type prediction; however, these issues are substantially mitigated by providing few-shot examples. For aggregation queries, normalized schemas yielded better performance, mainly due to their robustness against the data duplication and NULL value issues that cause errors in denormalized schemas. These findings suggest that the optimal schema design for NL2SQL applications depends on the types of queries to be supported. Our study demonstrates the importance of considering schema design when developing NL2SQL interfaces and integrating adaptive schema selection for real-world scenarios.
zh
[NLP-29] aking a SEAT: Predicting Value Interpretations from Sentiment Emotion Argument and Topic Annotations ECAI2025
【速读】: 该论文试图解决的问题是如何使人工智能(AI)系统能够更好地对齐多样化的个体价值观念,从而避免偏向多数群体的视角。传统方法往往依赖人口统计学特征进行建模,但忽略了个体在情感、情绪、论点和主题等多维度主观标注行为中的差异性。解决方案的关键在于利用多维主观标注(SEAT维度,即Sentiment, Emotion, Argument, and Topics)作为个体诠释视角的代理变量,通过零样本(zero-shot)和少样本(few-shot)设置验证语言模型是否能基于这些标注预测个体的价值判断。实验表明,同时提供所有SEAT维度信息显著优于单独使用任一维度或不提供个体信息的基线,凸显了将个体主观标注行为纳入建模的重要性,为未来大规模验证提供了基础。
链接: https://arxiv.org/abs/2510.01976
作者: Adina Nicola Dobrinoiu,Ana Cristiana Marcu,Amir Homayounirad,Luciano Cavalcante Siebert,Enrico Liscio
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at VALE workshop (ECAI 2025)
点击查看摘要
Abstract:Our interpretation of value concepts is shaped by our sociocultural background and lived experiences, and is thus subjective. Recognizing individual value interpretations is important for developing AI systems that can align with diverse human perspectives and avoid bias toward majority viewpoints. To this end, we investigate whether a language model can predict individual value interpretations by leveraging multi-dimensional subjective annotations as a proxy for their interpretive lens. That is, we evaluate whether providing examples of how an individual annotates Sentiment, Emotion, Argument, and Topics (SEAT dimensions) helps a language model in predicting their value interpretations. Our experiment across different zero- and few-shot settings demonstrates that providing all SEAT dimensions simultaneously yields superior performance compared to individual dimensions and a baseline where no information about the individual is provided. Furthermore, individual variations across annotators highlight the importance of accounting for the incorporation of individual subjective annotators. To the best of our knowledge, this controlled setting, although small in size, is the first attempt to go beyond demographics and investigate the impact of annotation behavior on value prediction, providing a solid foundation for future large-scale validation.
zh
[NLP-30] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning
【速读】: 该论文旨在解决在线事实核查(claim verification)中因依赖提示工程或预设推理流程而导致的验证技能提升受限问题,尤其是在动态证据检索与推理交互方面缺乏统一训练范式。其解决方案的关键在于提出Veri-R1框架,这是一个基于在线强化学习(online reinforcement learning, RL)的端到端训练机制,使大语言模型(large language models, LLMs)能够与搜索引擎进行动态交互,并通过显式的奖励信号来优化模型的规划、检索和推理行为,从而更贴近真实场景下的验证过程并显著提升整体准确性与证据相关性。
链接: https://arxiv.org/abs/2510.01932
作者: Qi He,Cheng Qian,Xiusi Chen,Bingxiang He,Yi R.(May)Fung,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Fudan University (复旦大学); Tsinghua University (清华大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Claim verification with large language models (LLMs) has recently attracted considerable attention, owing to their superior reasoning capabilities and transparent verification pathways compared to traditional answer-only judgments. Online claim verification requires iterative evidence retrieval and reasoning, yet existing approaches mainly rely on prompt engineering or predesigned reasoning workflows without offering a unified training paradigm to improve necessary skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL) framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors. The dynamic interaction between models and retrieval systems more accurately reflects real-world verification scenarios and fosters comprehensive verification skills. Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles evidence score, often surpassing larger-scale counterparts. Ablation studies further reveal the impact of reward components and the link between output logits and label accuracy. Our results highlight the effectiveness of online RL for precise and faithful claim verification and provide a foundation for future research. We release our code to support community progress in LLM empowered claim verification.
zh
[NLP-31] Inverse Language Modeling towards Robust and Grounded LLM s
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)防御机制碎片化且发展滞后的问题,尤其相较于分类器领域的研究成果,LLMs在对抗鲁棒性方面的研究尚不充分。为提升LLMs的鲁棒性并实现原生接地(native grounding),作者提出逆向语言建模(Inverse Language Modeling, ILM)框架,其核心在于将LLMs从静态生成器转变为可分析、抗扰动的系统:一方面通过输入扰动下的输出一致性增强模型鲁棒性,另一方面通过反转模型输出以识别潜在有害或不安全的输入触发词,从而实现对模型行为的可控性和可信度提升。
链接: https://arxiv.org/abs/2510.01929
作者: Davide Gabrielli,Simone Sestito,Iacopo Masi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The current landscape of defensive mechanisms for LLMs is fragmented and underdeveloped, unlike prior work on classifiers. To further promote adversarial robustness in LLMs, we propose Inverse Language Modeling (ILM), a unified framework that simultaneously 1) improves the robustness of LLMs to input perturbations, and, at the same time, 2) enables native grounding by inverting model outputs to identify potentially toxic or unsafe input triggers. ILM transforms LLMs from static generators into analyzable and robust systems, potentially helping RED teaming. ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable and trustworthy. The code is publicly available at this http URL.
zh
[NLP-32] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
【速读】: 该论文旨在解决如何有效提升大语言模型(Large Language Models, LLMs)推理性能的问题,核心挑战在于如何设计和应用奖励模型(Reward Models, RMs)以增强LLM在生成、选择和优化过程中的能力。解决方案的关键在于系统性地梳理RMs的架构设计、训练方法与评估技术,并深入探讨其三大核心应用场景:(1)在推理阶段指导生成并优选输出;(2)支持数据合成与迭代自我改进;(3)为基于强化学习(Reinforcement Learning, RL)的微调提供训练信号。通过整合现有研究成果与实证分析,论文进一步提出关于RMs选择、泛化能力、评估标准及优化策略等关键开放问题的见解,从而为RMs在LLM推理中的高效部署与持续演进提供理论支撑与实践路径。
链接: https://arxiv.org/abs/2510.01925
作者: Qiyuan Liu,Hao Xu,Xuhong Chen,Wei Chen,Yee Whye Teh,Ning Miao
机构: City University of Hong Kong (香港城市大学); Li Auto Inc. (理想汽车公司); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we address critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.
zh
[NLP-33] Constrained Adaptive Rejection Sampling
链接: https://arxiv.org/abs/2510.01902
作者: Paweł Parys,Sairam Vaidya,Taylor Berg-Kirkpatrick,Loris D’Antoni
机构: University of Warsaw (华沙大学); University of California San Diego (加州大学圣地亚哥分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-34] REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration
链接: https://arxiv.org/abs/2510.01879
作者: Yisu Wang,Ming Wang,Haoyuan Song,Wenjie Huang,Chaozheng Wang,Yi Xie,Xuming Ran
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-35] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models EMNLP2025
链接: https://arxiv.org/abs/2510.01845
作者: Ece Takmaz,Lisa Bylinina,Jakub Dotlacil
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the EMNLP 2025 workshop BabyLM: Accelerating language modeling research with cognitively plausible datasets
[NLP-36] Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂任务中因自回归token级生成导致的推理过程受限于局部决策、缺乏全局规划的问题,这一局限常引发冗余、不连贯或错误的推理路径,从而显著降低整体性能。现有方法如基于树结构的算法和强化学习(Reinforcement Learning, RL)虽尝试改善此问题,但存在计算开销高且难以获得最优推理轨迹的缺陷。解决方案的关键在于提出一种两阶段框架——Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO):第一阶段利用先进LLM将Chain-of-Thought (CoT) 推理提炼为紧凑的高层引导(high-level guidance),并通过监督微调(Supervised Fine-Tuning, SFT)固化该引导;第二阶段引入一种感知引导的强化学习方法,联合优化最终输出与高层引导的质量,从而提升推理有效性。实验表明,该方法在多个数学推理基准上均实现稳定且显著的性能提升,验证了其有效性和泛化能力。
链接: https://arxiv.org/abs/2510.01833
作者: Zhihao Dou,Qinjian Zhao,Zhongwei Wan,Dinggen Zhang,Weida Wang,Towsif Raiyan,Benteng Chen,Qingtao Pan,Yang Ouyang,Zhiqiang Gao,Shufei Zhang,Sumon Biswas
机构: Case Western Reserve University (凯斯西储大学); Kean University (基恩大学); The Ohio State University (俄亥俄州立大学); Fudan University (复旦大学); The University of Hong Kong (香港大学); North Carolina State University (北卡罗来纳州立大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages and 5 figures
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable reasoning abilities in complex tasks, often relying on Chain-of-Thought (CoT) reasoning. However, due to their autoregressive token-level generation, the reasoning process is largely constrained to local decision-making and lacks global planning. This limitation frequently results in redundant, incoherent, or inaccurate reasoning, which significantly degrades overall performance. Existing approaches, such as tree-based algorithms and reinforcement learning (RL), attempt to address this issue but suffer from high computational costs and often fail to produce optimal reasoning trajectories. To tackle this challenge, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization PTA-GRPO, a two-stage framework designed to improve both high-level planning and fine-grained CoT reasoning. In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning (SFT). In the second stage, we introduce a guidance-aware RL method that jointly optimizes the final output and the quality of high-level guidance, thereby enhancing reasoning effectiveness. We conduct extensive experiments on multiple mathematical reasoning benchmarks, including MATH, AIME2024, AIME2025, and AMC, across diverse base models such as Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and LLaMA3.2-3B. Experimental results demonstrate that PTA-GRPO consistently achieves stable and significant improvements across different models and tasks, validating its effectiveness and generalization.
zh
[NLP-37] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning
链接: https://arxiv.org/abs/2510.01832
作者: Shicheng Liu,Kai Sun,Lisheng Fu,Xilun Chen,Xinyuan Zhang,Zhaojiang Lin,Rulin Shao,Yue Liu,Anuj Kumar,Wen-tau Yih,Xin Luna Dong
机构: Stanford University (斯坦福大学); Meta (Meta)
类目: Computation and Language (cs.CL)
备注:
[NLP-38] Syntactic Blind Spots: How Misalignment Leads to LLM s Mathematical Errors EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学问题求解中存在的一种系统性失败模式——语法盲区(syntactic blind spots),即模型因问题表述形式不熟悉而错误应用已掌握的推理策略,尽管其语义清晰且数学能力充足。解决方案的关键在于识别并干预这种由表面结构与内部表征之间脆弱耦合引发的错误:通过使用来自正确示例的句法模板重构错误回答的问题,保持语义不变的同时降低结构复杂度,从而显著提升准确率;研究进一步引入基于依存局部性理论(Dependency Locality Theory, DLT)的句法复杂度度量,量化表明更高DLT得分与更高错误率相关,证明多数推理错误源于结构错位而非概念难度,且语法感知干预可有效揭示并缓解此类归纳失败。
链接: https://arxiv.org/abs/2510.01831
作者: Dane Williamson,Yangfeng Ji,Matthew Dwyer
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 5 Tables, 9 Figures; Accepted to MathNLP 2025: The 3rd Workshop on Mathematical Natural Language Processing (co-located with EMNLP 2025)
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.
zh
[NLP-39] Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction
链接: https://arxiv.org/abs/2510.01817
作者: Adam Filipek
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 18 pages, 6 figures, small-scale experiments
[NLP-40] Detecting LLM -Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成的高说服力垃圾评论对在线平台可信度构成的威胁,这类评论在语义上高度接近人类写作,使得现有检测系统难以有效识别。解决方案的关键在于提出 FraudSquad,一个融合预训练语言模型文本嵌入与门控图变压器(gated graph transformer)的混合检测模型,该模型无需人工特征工程或大量标注数据即可同时捕捉评论的语义信息和用户行为信号,从而实现对LLM生成垃圾评论的高效精准识别。
链接: https://arxiv.org/abs/2510.01801
作者: Xin Liu,Rongwu Xu,Xinyi Jia,Jason Liao,Jiao Sun,Ling Huang,Wei Xu
机构: Tsinghua University (清华大学); SUPCON ( supcon); University of British Columbia (不列颠哥伦比亚大学); Google DeepMind (谷歌深度思维); AHI Fintech (AHI金融科技)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The rise of large language models (LLMs) has enabled the generation of highly persuasive spam reviews that closely mimic human writing. These reviews pose significant challenges for existing detection systems and threaten the credibility of online platforms. In this work, we first create three realistic LLM-generated spam review datasets using three distinct LLMs, each guided by product metadata and genuine reference reviews. Evaluations by GPT-4.1 confirm the high persuasion and deceptive potential of these reviews. To address this threat, we propose FraudSquad, a hybrid detection model that integrates text embeddings from a pre-trained language model with a gated graph transformer for spam node classification. FraudSquad captures both semantic and behavioral signals without relying on manual feature engineering or massive training resources. Experiments show that FraudSquad outperforms state-of-the-art baselines by up to 44.22% in precision and 43.01% in recall on three LLM-generated datasets, while also achieving promising results on two human-written spam datasets. Furthermore, FraudSquad maintains a modest model size and requires minimal labeled training data, making it a practical solution for real-world applications. Our contributions include new synthetic datasets, a practical detection framework, and empirical evidence highlighting the urgency of adapting spam detection to the LLM era. Our code and datasets are available at: this https URL.
zh
[NLP-41] Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction
【速读】: 该论文旨在解决法律自然语言处理(Legal Natural Language Processing, LNLP)中文本抽取质量评估的可扩展性问题,尤其是在缺乏预标注真实标签(ground truth)的情况下对司法判决文书中的七个语义块(semantic blocks)提取质量进行有效评估。其解决方案的关键在于系统性地评估16种无监督指标(unsupervised metrics),涵盖文档级、语义级、结构级、伪真实标签和法律特定类别,通过与7,168条专家评分(Likert 1–5量表)进行对比,识别出表现最优的指标,如词频一致性(Term Frequency Coherence)和覆盖率/块完整性(Coverage Ratio/Block Completeness),并揭示了大语言模型(LLM)评估得分虽具一定相关性但尚不足以替代人工判断的局限性。该研究为法律文本自动化分析提供了无需标注的评估工具,推动了司法智能分析与伦理AI部署的发展。
链接: https://arxiv.org/abs/2510.01792
作者: Ivan Leonidovich Litvak,Anton Kostin,Fedor Lashkin,Tatiana Maksiyan,Sergey Lagutin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 28 pages
点击查看摘要
Abstract:The rapid advancement of artificial intelligence in legal natural language processing demands scalable methods for evaluating text extraction from judicial decisions. This study evaluates 16 unsupervised metrics, including novel formulations, to assess the quality of extracting seven semantic blocks from 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews on a 1–5 Likert scale. These metrics, spanning document-based, semantic, structural, pseudo-ground truth, and legal-specific categories, operate without pre-annotated ground truth. Bootstrapped correlations, Lin’s concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson r = 0.540 , Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson r = 0.513 , Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term Density (Pearson r = -0.479 , Lin CCC = -0.079, MAE = 0.394) show strong negative correlations. The LLM Evaluation Score (mean = 0.849, Pearson r = 0.382 , Lin CCC = 0.325, MAE = 0.197) showed moderate alignment, but its performance, using gpt-4.1-mini via g4f, suggests limited specialization for legal textse. These findings highlight that unsupervised metrics, including LLM-based approaches, enable scalable screening but, with moderate correlations and low CCC values, cannot fully replace human judgment in high-stakes legal contexts. This work advances legal NLP by providing annotation-free evaluation tools, with implications for judicial analytics and ethical AI deployment.
zh
[NLP-42] Can LLM s Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
链接: https://arxiv.org/abs/2510.01782
作者: Wenbo Pan,Jie Xu,Qiguang Chen,Junhao Dong,Libo Qin,Xinfeng Li,Haining Yu,Xiaohua Jia
机构: City University of Hong Kong (香港城市大学); Harbin Institute of Technology (哈尔滨工业大学); Nanyang Technological University (南洋理工大学); Central South University (中南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-43] Machine-interpretable Engineering Design Standards for Valve Specification
链接: https://arxiv.org/abs/2510.01736
作者: Anders Gjerver,Rune Frostad,Vedrana Barisic,Melinda Hodkiewicz,Caitlin Woods,Mihaly Fekete,Arild Braathen Torjusen,Johan Wilhelm Kluwer
机构: Aibel AS (Aibel AS); University of Western Australia (西澳大利亚大学); Equinor (Equinor); DNV (DNV)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 10 figures, 4 tables
[NLP-44] What MLLM s Learn about When they Learn about Multimodal Reasoning Reasoning : Perception Reasoning or their Integration?
链接: https://arxiv.org/abs/2510.01719
作者: Jiwan Chung,Neel Joshi,Pratyusha Sharma,Youngjae Yu,Vibhav Vineet
机构: Microsoft Research AI Frontiers (微软研究院人工智能前沿); Yonsei University (延世大学); Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-45] Format Inertia: A Failure Mechanism of LLM s in Medical Pre-Consultation EMNLP2025
链接: https://arxiv.org/abs/2510.01688
作者: Seungseop Lim,Gibaeg Kim,Wooseok Han,Jean Seo,Hyunkyung Lee,Jaehyo Yoo,Eunho Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Industry Track
[NLP-46] Improving AGI Evaluation: A Data Science Perspective
链接: https://arxiv.org/abs/2510.01687
作者: John Hawkins
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-47] How Do Language Models Compose Functions?
链接: https://arxiv.org/abs/2510.01685
作者: Apoorv Khandelwal,Ellie Pavlick
机构: Brown University (布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-48] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol
链接: https://arxiv.org/abs/2510.01674
作者: He Zhang,Anzhou Zhang,Jian Dai
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Genentech, Inc. (基因泰克公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
[NLP-49] Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness
链接: https://arxiv.org/abs/2510.01670
作者: Erfan Shayegani,Keegan Hines,Yue Dong,Nael Abu-Ghazaleh,Roman Lutz,Spencer Whitehead,Vidhisha Balachandran,Besmira Nushi,Vibhav Vineet
机构: Microsoft Research AI Frontiers (微软研究人工智能前沿); Microsoft AI Red Team (微软人工智能红队); University of California, Riverside (加州大学河滨分校); NVIDIA (英伟达)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
[NLP-50] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization EMNLP2025
【速读】: 该论文旨在解决多模态对话摘要(Multimodal Dialogue Summarization, MDS)任务中缺乏可靠自动评估基准的问题,以降低模型开发成本并减少人工标注依赖。其解决方案的关键在于提出首个针对MDS的元评估基准MDSEval,该基准包含图像共享对话、对应摘要及人类对八个明确质量维度的评判,并引入一种基于跨模态互斥关键信息(Mutually Exclusive Key Information, MEKI)的新型过滤框架,以确保数据质量和多样性。这一工作首次系统识别并形式化了MDS特有的评估维度,为后续研究提供了可量化、可比较的评估标准。
链接: https://arxiv.org/abs/2510.01659
作者: Yinhong Liu,Jianfeng He,Hang Su,Ruixue Lian,Yi Nian,Jake Vincent,Srikanth Vishnubhotla,Robinson Piramuthu,Saab Mansour
机构: AWS AI Labs; Language Technology Lab, University of Cambridge (剑桥大学语言技术实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025
点击查看摘要
Abstract:Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.
zh
[NLP-51] SoK: Measuring What Matters for Closed-Loop Security Agents
链接: https://arxiv.org/abs/2510.01654
作者: Mudita Khurana,Raunak Jain
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-52] Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLM s with Enabled Bidirectional Attention
链接: https://arxiv.org/abs/2510.01652
作者: Zhaoxin Feng,Jianfei Ma,Emmanuele Chersoni,Xiaojing Zhao,Xiaoyi Bao
机构: Language Science and Technology, The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-53] Position: Privacy Is Not Just Memorization!
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)隐私研究中存在的显著失衡问题,即过度关注训练数据的逐字记忆(verbatim memorization),而忽视了数据收集实践、推理时上下文泄露、自主代理能力以及通过深度推理攻击实现监控民主化等更为紧迫且可扩展的隐私风险。其解决方案的关键在于提出一个覆盖LLM全生命周期的隐私风险综合分类体系,并通过纵向分析过去十年(2016–2025)1322篇顶级会议论文揭示:现有技术框架在应对这些多维威胁时存在明显不足,亟需从单一技术视角转向跨学科方法,以应对这些新兴威胁的社会技术本质。
链接: https://arxiv.org/abs/2510.01645
作者: Niloofar Mireshghallah,Tianshi Li
机构: Carnegie Mellon University (卡内基梅隆大学); Northeastern University (东北大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages, 6 figures, 2 tables
点击查看摘要
Abstract:The discourse on privacy risks in Large Language Models (LLMs) has disproportionately focused on verbatim memorization of training data, while a constellation of more immediate and scalable privacy threats remain underexplored. This position paper argues that the privacy landscape of LLM systems extends far beyond training data extraction, encompassing risks from data collection practices, inference-time context leakage, autonomous agent capabilities, and the democratization of surveillance through deep inference attacks. We present a comprehensive taxonomy of privacy risks across the LLM lifecycle – from data collection through deployment – and demonstrate through case studies how current privacy frameworks fail to address these multifaceted threats. Through a longitudinal analysis of 1,322 AI/ML privacy papers published at leading conferences over the past decade (2016–2025), we reveal that while memorization receives outsized attention in technical research, the most pressing privacy harms lie elsewhere, where current technical approaches offer little traction and viable paths forward remain unclear. We call for a fundamental shift in how the research community approaches LLM privacy, moving beyond the narrow focus of current technical solutions and embracing interdisciplinary approaches that address the sociotechnical nature of these emerging threats.
zh
[NLP-54] NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对恶意输入时的安全脆弱性问题,即攻击者通过构造特定的“越狱提示”(jailbreak prompts)绕过模型内置的安全防护机制,诱导其生成不符合政策规范的内容。解决方案的关键在于提升模型对越狱提示的识别能力,研究发现:基于现有数据集,通过端到端微调双向编码器表示模型(Bidirectional Encoder Representations from Transformers, BERT)可实现最佳检测性能;同时,通过可视化分析发现,提示结构中显式的自我指涉(explicit reflexivity)可能是识别越狱意图的重要信号。
链接: https://arxiv.org/abs/2510.01644
作者: John Hawkins,Aditya Pramar,Rodney Beard,Rohitash Chandra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer’s policies. In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses, including looking at our ability to identify jailbreaks that use previously unseen strategies. Our results indicate that using current datasets the best performance is achieved by fine tuning a Bidirectional Encoder Representations from Transformers (BERT) model end-to-end for identifying jailbreaks. We visualise the keywords that distinguish jailbreak from genuine prompts and conclude that explicit reflexivity in prompt structure could be a signal of jailbreak intention.
zh
[NLP-55] Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws Benefits and Pitfalls EMNLP2025
链接: https://arxiv.org/abs/2510.01631
作者: Feiyang Kang,Newsha Ardalani,Michael Kuchnik,Youssef Emad,Mostafa Elhoushi,Shubhabrata Sengupta,Shang-Wen Li,Ramya Raghavendra,Ruoxi Jia,Carole-Jean Wu
机构: Meta(元)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published as a Main Conference paper at EMNLP 2025
[NLP-56] Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead
链接: https://arxiv.org/abs/2510.01624
作者: Feiyang Kang,Michael Kuchnik,Karthik Padthe,Marin Vlastelica,Ruoxi Jia,Carole-Jean Wu,Newsha Ardalani
机构: Meta(元)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Under Review
[NLP-57] LLM 4Rec: Large Language Models for Multimodal Generative Recommendation with Causal Debiasing
链接: https://arxiv.org/abs/2510.01622
作者: Bo Ma,Hang Li,ZeHua Hu,XiaoFan Gui,LuYao Liu,Simon Lau
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-58] AMAS: Adaptively Determining Communication Topology for LLM -based Multi-Agent System
链接: https://arxiv.org/abs/2510.01617
作者: Hui Yi Leong,Yuheng Li,Yuqing Wu,Wenwen Ouyang,Wei Zhu,Jiechao Gao
机构: University of Chicage(芝加哥大学); Johns Hopkins University (约翰霍普金斯大学); Carnegie Mellon University (卡内基梅隆大学); University of Hong Kong (香港大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-59] Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training SFT and DPO
链接: https://arxiv.org/abs/2510.01616
作者: Yu-Cheng Chih,Ming-Tao Duan,Yong-Hao Hou
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 1 figures, 2 tables. Technical report. Introduces PureTC-1B, an adapter-based pipeline for stabilizing Small Language Models in Traditional Chinese using CPT, SFT, and DPO
[NLP-60] RAG -BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering
链接: https://arxiv.org/abs/2510.01612
作者: Lovely Yeswanth Panchumarthi,Sai Prasad Gudari,Atharva Negi,Praveen Raj Budime,Harsit Upadhya
机构: Emory University (埃默里大学); Trine University (特赖恩大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-61] PychoBench: Evaluating the Psychology Intelligence of Large Language Models
链接: https://arxiv.org/abs/2510.01611
作者: Min Zeng
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-62] Bridging Collaborative Filtering and Large Language Models with Dynamic Alignment Multimodal Fusion and Evidence-grounded Explanations
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的推荐系统中存在的三大关键问题:一是协同过滤(Collaborative Filtering, CF)模型依赖静态用户行为快照,难以捕捉快速变化的用户偏好;二是现实世界中大量物品包含文本之外的视觉和音频内容,现有方法未能有效融合多模态信息;三是推荐结果缺乏可信赖的解释,无法提供基于具体协同模式或物品属性的可验证理由。解决方案的关键在于提出一个名为\model的框架,其核心创新包括:(1) 设计在线适应机制,通过轻量级模块持续整合新用户交互,避免对大模型重新训练;(2) 构建统一表示空间,无缝融合协同信号与视觉、音频特征,支持部分模态缺失场景;(3) 开发基于协同模式和物品属性的解释系统,生成可验证的自然语言推理过程,从而在保持冻结基础模型效率的同时,实现低开销、高可信度的推荐。
链接: https://arxiv.org/abs/2510.01606
作者: Bo Ma,LuYao Liu,Simon Lau,Chandler Yuan,and XueY Cui,Rosie Zhang
机构: Peking University (北京大学); China University of Political Science and Law (中国政法大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent research has explored using Large Language Models for recommendation tasks by transforming user interaction histories and item metadata into text prompts, then having the LLM produce rankings or recommendations. A promising approach involves connecting collaborative filtering knowledge to LLM representations through compact adapter networks, which avoids expensive fine-tuning while preserving the strengths of both components. Yet several challenges persist in practice: collaborative filtering models often use static snapshots that miss rapidly changing user preferences; many real-world items contain rich visual and audio content beyond textual descriptions; and current systems struggle to provide trustworthy explanations backed by concrete evidence. Our work introduces \model, a framework that tackles these limitations through three key innovations. We develop an online adaptation mechanism that continuously incorporates new user interactions through lightweight modules, avoiding the need to retrain large models. We create a unified representation that seamlessly combines collaborative signals with visual and audio features, handling cases where some modalities may be unavailable. Finally, we design an explanation system that grounds recommendations in specific collaborative patterns and item attributes, producing natural language rationales users can verify. Our approach maintains the efficiency of frozen base models while adding minimal computational overhead, making it practical for real-world deployment.
zh
[NLP-63] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)管道中多模型微调策略的选择问题,即如何在不同微调方式下平衡性能提升与计算成本。其关键解决方案在于系统性地比较独立微调、联合微调和两阶段微调三种策略,并发现尽管这些方法在EM和F1等生成质量指标上表现相当,但其计算开销差异显著;因此,最优策略取决于训练数据是否包含上下文标签以及是否需要对嵌入模型和生成模型的学习率进行网格搜索。
链接: https://arxiv.org/abs/2510.01600
作者: Neal Gregory Lawton,Alfy Samuel,Anoop Kumar,Daben Liu
机构: Capital One (资本one)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Last Modified: 17 Sept 2025EMNLP 2025 FindingsConference, Publication Chairs, AuthorsRevisionsBibTeXCC BY 4.0 Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Fine-tuning, Question Answering, Joint fine-tuning TL;DR: We evaluate and compare strategies for fine-tuning Retrieval Augmented Generation (RAG) pipelines, including independent fine-tuning, joint fine-tuning, and two-phase fine-tuning. Abstract: Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.
zh
[NLP-64] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering
链接: https://arxiv.org/abs/2510.01591
作者: Zhenwen Liang,Ruosen Li,Yujun Zhou,Linfeng Song,Dian Yu,Xinya Du,Haitao Mi,Dong Yu
机构: Tencent AI Lab(腾讯AI实验室); University of Texas at Dallas(德克萨斯大学达拉斯分校); University of Notre Dame(圣母大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-65] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning
链接: https://arxiv.org/abs/2510.01585
作者: Haochen You,Baojing Liu
机构: Columbia University (哥伦比亚大学); Hebei Institute of Communications (河北通信学院)
类目: Computation and Language (cs.CL); Networking and Internet Architecture (cs.NI)
备注: Accepted as a short paper at ACM Multimedia Asia 2025
[NLP-66] hink Right: Learning to Mitigate Under-Over Thinking via Adaptive Attentive Compression
【速读】: 该论文旨在解决生成式 AI(Generative AI)在复杂推理任务中因推理长度不当而导致的“适应性不足”(under-adaptivity)问题,即模型在面对不同难度的任务时无法动态调整推理步数:过短的推理(underthinking)会导致难题出错,而过长的推理(overthinking)则产生冗余步骤,降低效率。解决方案的关键在于提出 TRAAC(Think Right with Adaptive, Attentive Compression),一种基于在线后训练强化学习(RL)的方法,其核心机制包括:利用模型自注意力机制识别推理轨迹中的关键步骤并剪枝冗余内容,同时估计任务难度并将该信息融入奖励函数,从而学习按难度分配推理预算。此方法实现了精度提升与推理长度压缩的双重优化,并展现出良好的跨任务泛化能力。
链接: https://arxiv.org/abs/2510.01581
作者: Joykirat Singh,Justin Chih-Yao Chen,Archiki Prasad,Elias Stengel-Eskin,Akshay Nambi,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Microsoft Research (微软研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code: this https URL
点击查看摘要
Abstract:Recent thinking models solve complex reasoning tasks by scaling test-time compute, but this scaling must be allocated in line with task difficulty. On one hand, short reasoning (underthinking) leads to errors on harder problems that require extended reasoning steps; but, excessively long reasoning (overthinking) can be token-inefficient, generating unnecessary steps even after reaching a correct intermediate solution. We refer to this as under-adaptivity, where the model fails to modulate its response length appropriately given problems of varying difficulty. To address under-adaptivity and strike a balance between under- and overthinking, we propose TRAAC (Think Right with Adaptive, Attentive Compression), an online post-training RL method that leverages the model’s self-attention over a long reasoning trajectory to identify important steps and prune redundant ones. TRAAC also estimates difficulty and incorporates it into training rewards, thereby learning to allocate reasoning budget commensurate with example difficulty. Our approach improves accuracy, reduces reasoning steps, and enables adaptive thinking compared to base models and other RL baselines. Across a variety of tasks (AIME, AMC, GPQA-D, BBEH), TRAAC (Qwen3-4B) achieves an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model, and a 7.9% accuracy gain paired with a 29.4% length drop compared to the best RL baseline. TRAAC also shows strong generalization: although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets like GPQA-D, BBEH, and OptimalThinkingBench. Our analysis further verifies that TRAAC provides fine-grained adjustments to thinking budget based on difficulty and that a combination of task-difficulty calibration and attention-based compression yields gains across diverse tasks.
zh
[NLP-67] Synthetic Prefixes to Mitigate Bias in Real-Time Neural Query Autocomplete SIGIR
链接: https://arxiv.org/abs/2510.01574
作者: Adithya Rajan,Xiaoyu Liu,Prateek Verma,Vibhu Arora
机构: Walmart Global Technology(沃尔玛全球科技)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the Proceedings of the ACM SIGIR Asia Pacific Conference on Information Retrieval (SIGIR-AP 2025), December 7-10, 2025, Xi’an, China
[NLP-68] InvThink: Towards AI Safety via Inverse Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成响应时缺乏对潜在危害进行前瞻性评估的问题,从而提升模型的安全性而不牺牲其通用推理能力。现有安全对齐方法通常直接优化安全输出,但可能引入“安全税”(safety tax),即在提升安全性的同时损害模型在标准基准上的性能。解决方案的关键在于提出InvThink方法,通过引导模型进行逆向思维(inverse thinking)——即先枚举潜在危害、分析其后果,再生成规避风险的安全输出——实现更高效、可扩展且具备广泛适用性的安全增强。实验证明,该方法不仅显著改善了安全性(尤其在医疗、金融、法律等高风险场景中减少高达15.7%的有害响应),还缓解了安全与能力之间的权衡问题。
链接: https://arxiv.org/abs/2510.01569
作者: Yubin Kim,Taehan Kim,Eugene Park,Chunjong Park,Cynthia Breazeal,Daniel McDuff,Hae Won Park
机构: Massachusetts Institute of Technology (麻省理工学院); Google Research (谷歌研究); Google DeepMind (谷歌深度大脑)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We present InvThink, a simple yet powerful approach that gives large language models (LLMs) the capability of inverse thinking: reasoning through failure modes before generating responses. Unlike existing safety alignment methods that optimize directly for safe response, InvThink instructs models to 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods. (ii) InvThink mitigates safety tax; by training models to systematically consider failure modes, it preserves general reasoning capabilities on standard benchmarks. (iii) beyond general safety tasks, InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses compared to baseline methods like SafetyPrompt. We further implement InvThink via supervised fine-tuning, and reinforcement learning across three LLM families. These results suggest that inverse reasoning provides a scalable and generalizable path toward safer, more capable language models.
zh
[NLP-69] Information Seeking for Robust Decision Making under Partial Observability
链接: https://arxiv.org/abs/2510.01531
作者: Djengo Cyun-Jyun Fang,Tsung-Wei Ke
机构: National Taiwan University (国立台湾大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: The project page is available at this https URL
[NLP-70] One More Question is Enough Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning EMNLP2025
【速读】: 该论文旨在解决领域特定的定量推理问题,尤其是在金融等需要专家知识和复杂问答(QA)的领域中,大语言模型(LLM)性能受限的问题。其解决方案的关键在于提出专家问题分解(Expert Question Decomposition, EQD),该方法基于两阶段微调框架,并通过奖励函数引导生成有助于提升QA效果的子问题。EQD仅需数千训练样本和单张A100 GPU即可完成微调,且推理效率与零样本提示相当;实验表明,它在多个金融领域基准数据集上显著优于现有先进模型和提示策略,且分析发现,一个支持性子问题通常比详细的指导步骤更具价值。
链接: https://arxiv.org/abs/2510.01526
作者: Mengyu Wang,Sotirios Sabanis,Miguel de Carvalho,Shay B. Cohen,Tiejun Ma
机构: The University of Edinburgh (爱丁堡大学); National Technical University of Athens (雅典国立技术大学); Archimedes/Athena Research Centre (阿基米德/阿西娜研究中心); University of Aveiro (阿威罗大学)
类目: Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注: Accepted by EMNLP 2025
点击查看摘要
Abstract:Domain-specific quantitative reasoning remains a major challenge for large language models (LLMs), especially in fields requiring expert knowledge and complex question answering (QA). In this work, we propose Expert Question Decomposition (EQD), an approach designed to balance the use of domain knowledge with computational efficiency. EQD is built on a two-step fine-tuning framework and guided by a reward function that measures the effectiveness of generated sub-questions in improving QA outcomes. It requires only a few thousand training examples and a single A100 GPU for fine-tuning, with inference time comparable to zero-shot prompting. Beyond its efficiency, EQD outperforms state-of-the-art domain-tuned models and advanced prompting strategies. We evaluate EQD in the financial domain, characterized by specialized knowledge and complex quantitative reasoning, across four benchmark datasets. Our method consistently improves QA performance by 0.6% to 10.5% across different LLMs. Our analysis reveals an important insight: in domain-specific QA, a single supporting question often provides greater benefit than detailed guidance steps.
zh
[NLP-71] From Videos to Indexed Knowledge Graphs – Framework to Marry Methods for Multimodal Content Analysis and Understanding
链接: https://arxiv.org/abs/2510.01513
作者: Basem Rizk,Joel Walsh,Mark Core,Benjamin Nye
机构: University Of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
[NLP-72] Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data
【速读】: 该论文旨在解决在线职位广告数据难以获取、缺乏标准化和透明性的问题,以及传统职业信息数据库(如ONET)更新频率低、样本量小的局限性。其解决方案的关键在于以ONET为框架,开发一套自然语言处理(Natural Language Processing, NLP)工具——Job Ad Analysis Toolkit (JAAT),用于从大量在线职位广告中提取结构化信息,并通过LLM-as-a-Judge测试验证其可靠性与准确性。该方法实现了对超过1.55亿条职位广告的高效处理,生成包含任务、职业代码、工具技术、工资、技能、行业等特征的超大规模数据集,为劳动力市场研究、教育与职业发展提供高质量的数据基础。
链接: https://arxiv.org/abs/2510.01470
作者: Stephen Meisenbacher,Svetlozar Nestorov,Peter Norlander
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 85 pages
点击查看摘要
Abstract:Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (ONET) are updated infrequently and based on small survey samples. We adopt ONET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-a-Judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 - 2025. We illustrate the potential for research and future uses in education and workforce development.
zh
[NLP-73] A-VERT: Agnostic Verification with Embedding Ranking Targets
【速读】: 该论文旨在解决语言模型(Language Model, LM)响应自动评估中现有方法的局限性问题,即传统方法要么计算成本过高(如使用大语言模型作为裁判,LLM-as-a-Judge),要么与真实应用场景脱节(如字符串匹配或对数概率法)。其解决方案的关键在于提出一种结构无关(structure-free)的评估方法,通过利用语义嵌入(semantic embedding)距离来匹配目标候选答案与任意LM生成的文本,从而在较低计算开销(仅需参数量小于10B的嵌入模型)下实现高精度分类。实验表明,该方法在三个数据集和三种不同LM架构上均表现出优异性能,回归得分约0.97,准确率约96%,接近人类标注水平。
链接: https://arxiv.org/abs/2510.01469
作者: Nicolás Aguirre,Ramiro Caso,Ramiro Rodríguez Colmeiro,Mauro Santelli,Joaquín Toranzo Calderón
机构: GIAR, UTN-FRBA; Pnyx AI; IIF–SADAF–CONICET; Universidad de Buenos Aires
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 7 figures, code available at this https URL , authors in alphabetical order
点击查看摘要
Abstract:The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than 10B parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.
zh
[NLP-74] LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练大语言模型(Large Language Models, LLMs)时效率与有效性不足的问题。现有研究主要聚焦于优化损失函数,但未能充分考虑推理过程中生成文本长度对策略优化的影响。论文提出了一种名为长度感知采样策略优化(Length-aware Sampling for Policy Optimization, LSPO)的新型元强化学习算法,其核心创新在于动态调整训练数据的选择机制——依据每轮生成响应的平均长度进行自适应采样。这一方法通过引入长度信号作为采样权重,有效缓解了因“过度思考”(overthinking)导致的无效推理路径,从而显著提升RLVR在多模型和多数据集上的学习效果。
链接: https://arxiv.org/abs/2510.01459
作者: Weizhe Chen,Sven Koenig,Bistra Dilkina
机构: University of Southern California (南加州大学); University of California, Irvine (加州大学欧文分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Since the release of Deepseek-R1, reinforcement learning with verifiable rewards (RLVR) has become a central approach for training large language models (LLMs) on reasoning tasks. Recent work has largely focused on modifying loss functions to make RLVR more efficient and effective. In this paper, motivated by studies of overthinking in LLMs, we propose Length-aware Sampling for Policy Optimization (LSPO), a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length. We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness. In addition, we conduct a detailed ablation study to examine alternative ways of incorporating length signals into dynamic sampling, offering further insights and highlighting promising directions for future research.
zh
[NLP-75] VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning
链接: https://arxiv.org/abs/2510.01444
作者: Rui Liu,Dian Yu,Tong Zheng,Runpeng Dai,Zongxia Li,Wenhao Yu,Zhenwen Liang,Linfeng Song,Haitao Mi,Pratap Tokekar,Dong Yu
机构: Tencent AI Lab (腾讯AI实验室); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-76] Optimal Stopping vs Best-of-N for Inference Time Optimization
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在生成过程中如何平衡输出质量与推理成本的问题,尤其是在采用多次生成(如Best-of-N采样)时的效率瓶颈。其核心解决方案是将这一问题建模为经典的“潘多拉魔盒”(Pandora’s Box)最优停止问题,并提出一种基于UCB(Upper Confidence Bound)风格的自适应推理优化算法:该方法通过引入受Bradley-Terry模型启发的奖励归一化机制,实现对不同提示下奖励尺度的动态调整,从而在线学习停止阈值,在无需预先知晓奖励分布的情况下逼近已知分布下的最优策略(Weitzman算法)。实验表明,该方法可在保持与非自适应Best-of-N相当性能的同时,平均减少15%-35%的生成次数,实现了理论保障与实际效率的统一。
链接: https://arxiv.org/abs/2510.01394
作者: Yusuf Kalayci,Vinod Raman,Shaddin Dughmi
机构: University of Southern California (南加州大学); University of Michigan (密歇根大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 24 pages
点击查看摘要
Abstract:Large language model (LLM) generation often requires balancing output quality against inference cost, especially when using multiple generations. We introduce a new framework for inference-time optimization based on the classical Pandora’s Box problem. Viewing each generation as opening a costly “box” with random reward, we develop algorithms that decide when to stop generating without knowing the underlying reward distribution. Our first contribution is a UCB-style Pandora’s Box algorithm, which achieves performance that is provably close to Weitzman’s algorithm, the optimal strategy when the distribution is known. We further adapt this method to practical LLM settings by addressing reward scaling across prompts via a Bradley-Terry inspired transformation. This leads to an adaptive inference-time optimization method that normalizes rewards and learns stopping thresholds on the fly. Experiments on the AlpacaFarm and HH-RLHF datasets, using multiple LLM-reward model pairs, show that our adaptive strategy can obtain the same performance as non-adaptive Best-of-N sampling while requiring 15-35 percent fewer generations on average. Our results establish a principled bridge between optimal stopping theory and inference-time scaling, providing both theoretical performance bounds and practical efficiency gains for LLM deployment.
zh
[NLP-77] AG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies
链接: https://arxiv.org/abs/2510.01391
作者: Maithili Kadam,Francis Ferraro
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in *sem 2025
[NLP-78] Fine-tuning with RAG for Improving LLM Learning of New Skills ICLR2026
链接: https://arxiv.org/abs/2510.01375
作者: Humaid Ibrahim,Nikolai Rozanov,Marek Rei
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review at ICLR 2026
[NLP-79] Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
链接: https://arxiv.org/abs/2510.01367
作者: Xinpeng Wang,Nitish Joshi,Barbara Plank,Rico Angell,He He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-80] WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents
【速读】: 该论文旨在解决当前针对网络代理(web agents)的提示注入攻击(prompt injection attacks)缺乏系统性检测评估的问题。现有检测方法主要针对通用场景,未在web agent这一特定环境中进行验证,导致其有效性尚不明确。论文的关键解决方案在于构建首个针对web agent的综合性基准测试体系,包括细粒度的攻击分类、多模态数据集(含文本与图像的恶意及良性样本),以及系统化的文本和图像检测方法,并在多种场景下进行性能评估。结果表明,尽管部分检测器对显式指令或可见扰动攻击具有一定识别能力,但对隐式指令或不可感知扰动的攻击仍存在显著失效问题,揭示了当前检测技术的局限性。
链接: https://arxiv.org/abs/2510.01354
作者: Yinuo Liu,Ruohan Xu,Xilong Wang,Yuqi Jia,Neil Zhenqiang Gong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically evaluated for web agents. In this work, we bridge this gap by presenting the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents. We begin by introducing a fine-grained categorization of such attacks based on the threat model. We then construct datasets containing both malicious and benign samples: malicious text segments generated by different attacks, benign text segments from four categories, malicious images produced by attacks, and benign images from two categories. Next, we systematize both text-based and image-based detection methods. Finally, we evaluate their performance across multiple scenarios. Our key findings show that while some detectors can identify attacks that rely on explicit textual instructions or visible image perturbations with moderate to high accuracy, they largely fail against attacks that omit explicit instructions or employ imperceptible perturbations. Our datasets and code are released at: this https URL.
zh
[NLP-81] MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments NEURIPS2025
链接: https://arxiv.org/abs/2510.01353
作者: Darshan Deshpande,Varun Gangal,Hersh Mehta,Anand Kannappan,Rebecca Qian,Peng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2025 SEA Workshop
[NLP-82] Aristotle: IMO-level Automated Theorem Proving
【速读】: 该论文旨在解决自动化定理证明(Automated Theorem Proving, ATP)在复杂数学竞赛问题中表现不足的问题,特别是如何有效融合形式化验证与非形式化推理以提升系统在高难度数学命题上的求解能力。解决方案的关键在于构建一个名为Aristotle的AI系统,其核心创新是整合三个模块:基于Lean的证明搜索系统、用于生成和形式化引理的非形式化推理模块,以及专用几何求解器,从而实现对国际数学奥林匹克(International Mathematical Olympiad, IMO)难题的黄金奖级别解题性能,并展现出良好的可扩展性。
链接: https://arxiv.org/abs/2510.01346
作者: Tudor Achim,Alex Best,Kevin Der,Mathïs Fédérico,Sergei Gukov,Daniel Halpern-Leister,Kirsten Henningsgard,Yury Kudryashov,Alexander Meiburg,Martin Michelsen,Riley Patterson,Eric Rodriguez,Laura Scharff,Vikram Shanker,Vladmir Sicca,Hari Sowrirajan,Aidan Swope,Matyas Tamas,Vlad Tenev,Jonathan Thomm,Harold Williams,Lawrence Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We introduce Aristotle, an AI system that combines formal verification with informal reasoning, achieving gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems. Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver. Our system demonstrates state-of-the-art performance with favorable scaling properties for automated theorem proving.
zh
[NLP-83] HiSpec: Hierarchical Speculative Decoding for LLM s
【速读】: 该论文旨在解决生成式 AI (Generative AI) 中大语言模型(LLM)推理速度瓶颈问题,尤其是传统推测解码(Speculative Decoding)中验证阶段(verification)成为性能瓶颈的问题。现有方法虽聚焦于加速草稿生成(drafting),但中间验证(intermediate verification)常因计算开销高、内存占用大或依赖近似启发式策略而效率低下且影响准确性。其解决方案的关键在于提出一种分层推测解码框架 HiSpec,利用早退出(early-exit, EE)模型实现低开销的中间验证:EE 模型通过跳过部分层遍历并显式训练以使选定层的隐藏状态具有可解释性,从而在不显著增加计算和内存负担的前提下完成高效验证;同时,HiSpec 进一步设计了跨模型缓存复用机制(重用 key-value 缓存与隐藏状态),并通过周期性校验确保最终输出精度。实验表明,该方法平均提升吞吐量 1.28×,最高达 2.01×,优于基线单层推测方案。
链接: https://arxiv.org/abs/2510.01336
作者: Avinash Kumar,Sujay Sanghavi,Poulami Das
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is 4\times slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. \textit``Intermediate" verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose \underline\textitHi\textiterarchical \underline\textitSpec\textitulative Decoding (HiSpec) , a framework for high-throughput speculative decoding that exploits \textitearly-exit (EE) models for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28 \times on average and by up to 2.01 \times compared to the baseline single-layer speculation without compromising accuracy. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2510.01336 [cs.CL] (or arXiv:2510.01336v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.01336 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-84] Agent ic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models
链接: https://arxiv.org/abs/2510.01304
作者: Yu Zeng,Wenxuan Huang,Shiting Huang,Xikun Bao,Yukun Qi,Yiming Zhao,Qiuchen Wang,Lin Chen,Zehui Chen,Huaian Chen,Wanli Ouyang,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室); East China Normal University (华东师范大学); The Chinese University of Hong Kong (香港中文大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-85] LLM -based Multi-Agent Blackboard System for Information Discovery in Data Science
链接: https://arxiv.org/abs/2510.01285
作者: Alireza Salemi,Mihir Parmar,Palash Goyal,Yiwen Song,Jinsung Yoon,Hamed Zamani,Hamid Palangi,Tomas Pfister
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Google Cloud AI Research(谷歌云人工智能研究)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
[NLP-86] Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing
链接: https://arxiv.org/abs/2510.01283
作者: Israel Abebe Azime,Tadesse Destaw Belay,Atnafu Lambebo Tonja
机构: Saarland University (萨尔兰大学); Instituto Politécnico Nacional (国家理工学院); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-87] UMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture
链接: https://arxiv.org/abs/2510.01279
作者: Yongchao Chen,Jiefeng Chen,Rui Meng,Ji Yin,Na Li,Chuchu Fan,Chi Wang,Tomas Pfister,Jinsung Yoon
机构: MIT(麻省理工学院); Harvard(哈佛大学); Google Cloud AI Research(谷歌云人工智能研究); Google DeepMind(谷歌深度大脑)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 13 figures
[NLP-88] LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews
链接: https://arxiv.org/abs/2510.01276
作者: Sumaiya Tabassum
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-89] raceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models
链接: https://arxiv.org/abs/2510.01274
作者: Shenxu Chang,Junchi Yu,Weixing Wang,Yongqiang Chen,Jialin Yu,Philip Torr,Jindong Gu
机构: University of Oxford (牛津大学); University of Potsdam (波茨坦大学); Carnegie Mellon University (卡内基梅隆大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-90] hink Twice Generate Once: Safeguarding by Progressive Self-Reflection EMNLP2025
链接: https://arxiv.org/abs/2510.01270
作者: Hoang Phan,Victor Li,Qi Lei
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 Findings
[NLP-91] AdaDetectGPT : Adaptive Detection of LLM -Generated Text with Statistical Guarantees NEURIPS2025
链接: https://arxiv.org/abs/2510.01268
作者: Hongyi Zhou,Jin Zhu,Pingfan Su,Kai Ye,Ying Yang,Shakeel A O B Gavioli-Akilagun,Chengchun Shi
机构: Tsinghua University (清华大学); University of Birmingham (伯明翰大学); LSE (伦敦政治经济学院); City University Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted by NeurIPS2025
[NLP-92] OpenAI s GPT -OSS-20B Model and Safety Alignment Issues in a Low-Resource Language
链接: https://arxiv.org/abs/2510.01266
作者: Isa Inuwa-Dutse
机构: University of Huddersfield (哈德斯菲尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 tables
[NLP-93] RLP: Reinforcement as a Pretraining Objective
链接: https://arxiv.org/abs/2510.01265
作者: Ali Hatamizadeh,Syeda Nahida Akter,Shrimai Prabhumoye,Jan Kautz,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro,Yejin Choi
机构: NVIDIA(英伟达); Carnegie Mellon University (卡内基梅隆大学); Boston University (波士顿大学); Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: RLP introduces a new paradigm for RL-based Pretraining
[NLP-94] In AI Sweet Harmony: Socioprag matic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt -oss-20b
链接: https://arxiv.org/abs/2510.01259
作者: Nils Durner
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 27 pages, 1 figure
[NLP-95] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse
链接: https://arxiv.org/abs/2510.01258
作者: Nathan Junzi Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures
[NLP-96] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLM s
【速读】: 该论文旨在解决当前知识图谱问答(Knowledge Graph Question Answering, KGQA)中基于大语言模型(Large Language Models, LLMs)的方法所面临的两大局限:一是检索型方法受限于检索信息的质量,二是代理型(agent-based)方法高度依赖专有大模型且计算开销大。解决方案的关键在于提出一种名为“检索-判断-探索”(Retrieval-Judgment-Exploration, RJE)的框架,其核心机制包括:通过精细化的推理路径检索提升初始信息质量、引入判断模块评估推理充分性以决定是否扩展探索,并结合专用辅助模块(如推理路径排序、问题分解和检索增强探索)使小型开源LLM(如3B和8B参数规模)无需微调即可达到与专有LLM相当的效果。RJE显著减少了LLM调用次数和Token消耗,在保持性能的同时实现更高的效率。
链接: https://arxiv.org/abs/2510.01257
作者: Can Lin,Zhengwang Jiang,Ling Zheng,Qi Zhao,Yuhang Zhang,Qi Song,Wangqiu Zhou
机构: University of Science and Technology of China (中国科学技术大学); City University of Hong Kong (香港城市大学); Deqing Alpha Innovation Institute (德清阿尔法创新研究院); Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures
点击查看摘要
Abstract:Knowledge graph question answering (KGQA) aims to answer natural language questions using knowledge graphs. Recent research leverages large language models (LLMs) to enhance KGQA reasoning, but faces limitations: retrieval-based methods are constrained by the quality of retrieved information, while agent-based methods rely heavily on proprietary LLMs. To address these limitations, we propose Retrieval-Judgment-Exploration (RJE), a framework that retrieves refined reasoning paths, evaluates their sufficiency, and conditionally explores additional evidence. Moreover, RJE introduces specialized auxiliary modules enabling small-sized LLMs to perform effectively: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration. Experiments show that our approach with proprietary LLMs (such as GPT-4o-mini) outperforms existing baselines while enabling small open-source LLMs (such as 3B and 8B parameters) to achieve competitive results without fine-tuning LLMs. Additionally, RJE substantially reduces the number of LLM calls and token usage compared to agent-based methods, yielding significant efficiency improvements.
zh
[NLP-97] Longitudinal Monitoring of LLM Content Moderation of Social Issues
链接: https://arxiv.org/abs/2510.01255
作者: Yunlang Dai,Emma Lurie,Danaé Metaxa,Sorelle A. Friedler
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
[NLP-98] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLM s ICASSP2026
链接: https://arxiv.org/abs/2510.01254
作者: Shree Harsha Bokkahalli Satish,Gustav Eje Henter,Éva Székely
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 Figures, Submitted to IEEE ICASSP 2026
[NLP-99] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models
链接: https://arxiv.org/abs/2510.01252
作者: Mariam Mahran,Katharina Simbeck
机构: HTW Berlin University of Applied Sciences (柏林应用技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Draft version, subject to revision. 8 pages, 3 figures
[NLP-100] Efficient Uncertainty Estimation for LLM -based Entity Linking in Tabular Data
链接: https://arxiv.org/abs/2510.01251
作者: Carlo Bono,Federico Belotti,Matteo Palmonari
机构: Politecnico di Milano (米兰理工大学); Università degli Studi di Milano Bicocca (米兰大学布雷西亚分校)
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
[NLP-101] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages
链接: https://arxiv.org/abs/2510.01250
作者: Trung Duc Anh Dang,Ferdinando Pio D’Elia
机构: Centre for Language Technology, University of Copenhagen, Denmark(丹麦哥本哈根大学语言技术中心)
类目: Computation and Language (cs.CL)
备注:
[NLP-102] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning
【速读】: 该论文旨在解决科学问答(Scientific Question Answering, SQAs)数据集中因逻辑跳跃和隐含推理导致的高错误率问题,从而提升大语言模型(Large Language Models, LLMs)在科学问题求解中的可靠性。其解决方案的关键在于提出LOCA(Logical Chain Augmentation)框架,通过“增强-审核”循环机制自动补全答案中缺失的逻辑步骤,并显式分离科学原理与其推导过程,从而实现对科学语料库的自动化清洗与质量提升,显著降低错误率(从高达20%降至2%以下)。
链接: https://arxiv.org/abs/2510.01249
作者: You-Le Fang,Dong-Shan Jian,Xiang Li,Ce Meng,Ling-Shi Meng,Chen-Xu Yan,Zhi-Zhang Bian,Yan-Qing Ma
机构: Peking University (北京大学); Center for High Energy Physics, Peking University (北京大学高能物理中心)
类目: Computation and Language (cs.CL)
备注: 29 pages, 2 figures
点击查看摘要
Abstract:While Large Language Models (LLMs) excel in general domains, their reliability often falls short in scientific problem-solving. The advancement of scientific AI depends on large-scale, high-quality corpora. However, existing scientific question-answering (QA) datasets suffer from high error rates, frequently resulting from logical leaps and implicit reasoning within the answers. To address this issue, we introduce LOCA (Logical Chain Augmentation), a novel framework for automatically cleaning scientific corpora, implemented through an augment-and-review loop. At its core, LOCA enhances raw answers by completing missing logical steps and explicitly separating the underlying scientific principle from its subsequent derivation. By applying LOCA to challenging scientific corpora, we demonstrate that it can automatically filter noisy datasets, typically reducing the error rate from as high as 20% to below 2%. LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, paving the way for more reliable training and evaluation of scientific AI.
zh
[NLP-103] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs NEURIPS2025
链接: https://arxiv.org/abs/2510.01248
作者: Ruyue Liu,Rong Yin,Xiangzhen Bo,Xiaoshuai Hao,Yong Liu,Jinwen Zhong,Can Ma,Weiping Wang
机构: Institute of Information Engineering, CAS (中国科学院信息工程研究所); School of Cyberspace Security, UCAS (中国科学院大学网络空间安全学院); Wuhan University of Technology (武汉理工大学); Xiaomi EV (小米汽车); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2025
[NLP-104] Lets Play Across Cultures: A Large Multilingual Multicultural Benchmark for Assessing Language Models Understanding of Sports EMNLP’25
链接: https://arxiv.org/abs/2510.01247
作者: Punit Kumar Singh,Nishant Kumar,Akash Ghosh,Kunal Pasad,Khushi Soni,Manisha Jaishwal,Sriparna Saha,Syukron Abu Ishaq Alfarozi,Asres Temam Abagissa,Kitsuchart Pasupa,Haiqin Yang,Jose G Moreno
机构: Indian Institute of Technology Patna (印度理工学院帕特纳分校); Sardar Patel Institute of Technology (萨达尔·帕特尔技术学院); Universitas Gadjah Mada (加查马达大学); King Mongkut’s Institute of Technology Ladkrabang (拉贾蒙空理工大学); Shenzhen Technology University (深圳技术大学); Université de Toulouse (图卢兹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 52 pages, 56 figures; appearing at EMNLP’25
[NLP-105] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering
【速读】: 该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在语言模型控制(steering)中存在冗余特征干扰和输出退化的问题。具体而言,现有方法使用 top-k SAE 潜在维度进行控制时,其中许多维度捕捉的是标点等非语义特征而非指令等语义属性;同时,固定强度的 SAE 控制策略常导致重复性单字输出等退化现象。解决方案的关键在于:首先聚焦于单一最相关潜变量(top-1 latent),以消除冗余特征干扰;其次引入逐 token 衰减的控制策略(token-wise decaying steering),缓解输出退化问题,并实现与均值激活差异基线更可靠的比较。实证表明,针对推理相关潜变量的控制可有效激发逐步数学推理能力,且在数学推理基准上优于均值激活差异方法,在 IF-Eval 上性能相当。
链接: https://arxiv.org/abs/2510.01246
作者: Jiaqing Xie
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL)
备注: 25 pages
点击查看摘要
Abstract:Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.
zh
[NLP-106] SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction EMNLP2025
链接: https://arxiv.org/abs/2510.01245
作者: Runfei Chen,Shuyang Jiang,Wei Huang
机构: Urban Mobility Institute, Tongji University, China (同济大学城市交通研究所); College of Surveying and Geo-informatics, Tongji University, China (同济大学测绘与地理信息学院); College of Computer Science and Artificial Intelligence, Fudan University, China (复旦大学计算机科学技术与人工智能学院); Department of Civil Engineering, Toronto Metropolitan University, Canada (多伦多都会大学土木工程系)
类目: Computation and Language (cs.CL)
备注: EMNLP2025
[NLP-107] Feasibility of Structuring Stress Documentation Using an Ontology-Guided Large Language Model
链接: https://arxiv.org/abs/2510.01244
作者: Hyeoneui Kim,Jeongha Kim,Huijing Xu,Jinsun Jung,Sunghoon Kang,Sun Joo Jang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
[NLP-108] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing NEURIPS25
链接: https://arxiv.org/abs/2510.01243
作者: Yisong Xiao,Aishan Liu,Siyuan Liang,Zonghao Ying,Xianglong Liu,Dacheng Tao
机构: Beihang University (北京航空航天大学); National University of Singapore (新加坡国立大学); Zhongguancun Laboratory (中关村实验室); Institute of Dataspace (数据空间研究所); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NeurIPS 25
[NLP-109] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI
链接: https://arxiv.org/abs/2510.01242
作者: Seyma Yaman Kayadibi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 34 pages, 17 figures. Includes theoretical development and mathematical proofs of the Artificial Age Score (AAS), with empirical illustrations via ChatGPT-based memory recall experiments (screenshots included)
[NLP-110] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation
链接: https://arxiv.org/abs/2510.01241
作者: Hu Wei,Ze Xu,Boyu Yang,Linlin Miao,Weiqi Zhai,Yihan Li,Zixuan Li,Zhijun Wang,Boya Wang,Jianwei Yu,Jialing Yuan,Xiaoyue Zhang,Cheng He,Minglei Chen,Zifan Zhang,Qianhui Li,Wei Wang,Xiang Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
[NLP-111] RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models
链接: https://arxiv.org/abs/2510.01240
作者: Zukang Xu,Xing Hu,Qiang Wu,Dawei Yang
机构: Houmo AI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
[NLP-112] CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM EMNLP2025
链接: https://arxiv.org/abs/2510.01239
作者: Juntae Lee,Jihwan Bang,Seunghan Yang,Simyung Chang
机构: 未知
类目: Computation and Language (cs.CL)
备注: accepted at EMNLP 2025 (main)
[NLP-113] Silent Tokens Loud Effects: Padding in LLM s NEURIPS2025
链接: https://arxiv.org/abs/2510.01238
作者: Rom Himelstein,Amit LeVi,Yonatan Belinkov,Avi Mendelson
机构: Technion - Israel Institute of Technology (以色列理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NeurIPS 2025 Workshop: LLM Evaluation
[NLP-114] Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的幻觉问题,即模型生成看似合理但事实错误的内容。现有缓解策略多依赖于生成后的纠正,不仅计算成本高,且无法预防不可靠内容的产生。其解决方案的关键在于提出一种基于置信度感知的路由系统,通过在生成前主动评估模型不确定性,将查询按估计可靠性引导至不同处理路径:高置信度时本地生成,中等置信度时采用检索增强生成,低置信度时调用更大模型,极低置信度时交由人工审核。该方法融合语义对齐、层间收敛分析与学习得到的置信度估计三种互补信号,构建统一置信度评分,实现了在显著提升幻觉检测能力(F1从0.61提升至0.82)的同时降低40%计算开销,推动了LLM可靠性提升范式从被动纠错向主动评估的转变。
链接: https://arxiv.org/abs/2510.01237
作者: Nandakishor M
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models suffer from hallucination, generating plausible yet factually incorrect content. Current mitigation strategies focus on post-generation correction, which is computationally expensive and fails to prevent unreliable content generation. We propose a confidence-aware routing system that proactively assesses model uncertainty before generation and redirects queries based on estimated reliability. Our approach combines three complementary signals: semantic alignment between internal representations and reference embeddings, internal convergence analysis across model layers, and learned confidence estimation. The unified confidence score determines routing to four pathways: local generation for high confidence, retrieval-augmented generation for medium confidence, larger models for low confidence, and human review for very low confidence. Evaluation on knowledge-intensive QA benchmarks demonstrates significant improvements in hallucination detection (0.74 vs. 0.42 baseline) while reducing computational costs by 40% compared to post-hoc methods. The F1 score improves from 0.61 to 0.82 with low false positive rates (0.09). This paradigm shift from reactive correction to proactive assessment offers a computationally efficient approach to LLM reliability enhancement.
zh
[NLP-115] GRPO: Enhancing Dermatological Reasoning under Low Resource Settings
链接: https://arxiv.org/abs/2510.01236
作者: Ismam Nur Swapnil,Aranya Saha,Tanvir Ahmed Khan,Mohammad Ariful Haque
机构: Bangladesh University of Engineering and Technology (BUET) (孟加拉国工程技术大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Will be submitted at IEEE JBHI
[NLP-116] Automated Extraction of Material Properties using LLM -based AI Agents
链接: https://arxiv.org/abs/2510.01235
作者: Subham Ghosh,Abhishek Tewari
机构: Mehta Family School of Data Science and Artificial Intelligence (数据科学与人工智能梅hta家庭学院); Indian Institute of Technology Roorkee (印度理工学院鲁尔基分校); Department of Metallurgical and Materials Engineering (冶金与材料工程系)
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-117] LLM Rank: Understanding LLM Strengths for Model Routing
链接: https://arxiv.org/abs/2510.01234
作者: Shubham Agrawal,Prasang Gupta
机构: Zeno AI; Independent Researcher
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure
[NLP-118] Computational Social Linguistics for Telugu Cultural Preservation: Novel Algorithms for Chandassu Metrical Pattern Recognition
链接: https://arxiv.org/abs/2510.01233
作者: Boddu Sri Pavan,Boddu Swathi Sree
机构: RGUKT, Nuzvid (Rajiv Gandhi University of Knowledge Technologies, Nuzvid)
类目: Computation and Language (cs.CL)
备注: 16 pages, 4 figures
[NLP-119] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks EMNLP2025
链接: https://arxiv.org/abs/2510.01232
作者: Dongjun Kim,Gyuho Shim,Yongchan Chun,Minhyuk Kim,Chanjun Park,Heuiseok Lim
机构: Korea University (韩国大学); Soongsil University (中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures. Accepted to EMNLP 2025 main conference
[NLP-120] rustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models
链接: https://arxiv.org/abs/2510.01231
作者: Shuaidong Pan,Di Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
[NLP-121] Geometric Structures and Patterns of Meaning: A PHATE Manifold Analysis of Chinese Character Embeddings
链接: https://arxiv.org/abs/2510.01230
作者: Wen G. Gong
机构: 未知
类目: Computation and Language (cs.CL)
备注: 33 pages, 17 figures
[NLP-122] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM -Based Supervision
链接: https://arxiv.org/abs/2510.01229
作者: Dimitar Peshevski,Kiril Blazhevski,Martin Popovski,Gjorgji Madjarov
机构: Ss. Cyril and Methodius University (圣克莱门特·奥赫里德斯基大学); G+D Netcetera (G+D Netcetera)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by RANLP 2025
[NLP-123] Who is In Charge? Dissecting Role Conflicts in Instruction Following
链接: https://arxiv.org/abs/2510.01228
作者: Siqi Zeng
机构: University of Illinois, Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-124] EEFSUVA: A New Mathematical Olympiad Benchmark
链接: https://arxiv.org/abs/2510.01227
作者: Nicole N Khatibi,Daniil A. Radamovich,Michael P. Brenner
机构: 未知
类目: Computation and Language (cs.CL); History and Overview (math.HO)
备注: 16 Pages, 5 figures
[NLP-125] ClaimCheck: Real-Time Fact-Checking with Small Language Models
链接: https://arxiv.org/abs/2510.01226
作者: Akshith Reddy Putta,Jacob Devasier,Chengkai Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-126] Utilizing Modern Large Language Models (LLM ) for Financial Trend Analysis and Digest Creation
链接: https://arxiv.org/abs/2510.01225
作者: Andrei Lazarev,Dmitrii Sedov
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This is the version of the article accepted for publication in SUMMA 2024 after peer review. The final, published version is available at IEEE Xplore: https://doi.org/10.1109/SUMMA64428.2024.10803746
[NLP-127] Context Matters: Comparison of commercial large language tools in veterinary medicine
链接: https://arxiv.org/abs/2510.01224
作者: Tyler J Poore,Christopher J Pinard,Aleena Shabbir,Andrew Lagree,Andre Telfer,Kuan-Chuen Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 Figures, 10 pages
[NLP-128] Jailbreaking LLM s via Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreak attacks)时防御能力不足的问题,特别是针对现有方法因明显恶意意图而易被检测的局限性。解决方案的关键在于提出一种名为RTS-Attack(Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge)的自适应自动化框架,其核心创新是利用高度语义相关且嵌套的场景(nested scenarios)并融合目标毒性知识(targeted toxic knowledge),使得生成的越狱提示在内容上不包含直接有害查询,从而有效绕过LLMs的对齐防御机制(alignment defenses)。实验表明,该方法在效率和通用性上均优于现有基线,在多个先进大模型(如GPT-4o、Llama3-70b和Gemini-pro)中表现出卓越性能。
链接: https://arxiv.org/abs/2510.01223
作者: Hui Dou,Ning Xu,Yiwen Zhang,Kaibin Wang
机构: AnHui University (安徽大学); Swinburne University of Technology (斯威本科技大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks. However, they remain exposed to jailbreak attacks, eliciting harmful responses. The nested scenario strategy has been increasingly adopted across various methods, demonstrating immense potential. Nevertheless, these methods are easily detectable due to their prominent malicious intentions. In this work, we are the first to find and systematically verify that LLMs’ alignment defenses are not sensitive to nested scenarios, where these scenarios are highly semantically relevant to the queries and incorporate targeted toxic knowledge. This is a crucial yet insufficiently explored direction. Based on this, we propose RTS-Attack (Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge), an adaptive and automated framework to examine LLMs’ alignment. By building scenarios highly relevant to the queries and integrating targeted toxic knowledge, RTS-Attack bypasses the alignment defenses of LLMs. Moreover, the jailbreak prompts generated by RTS-Attack are free from harmful queries, leading to outstanding concealment. Extensive experiments demonstrate that RTS-Attack exhibits superior performance in both efficiency and universality compared to the baselines across diverse advanced LLMs, including GPT-4o, Llama3-70b, and Gemini-pro. Our complete code is available in the supplementary material. WARNING: THIS PAPER CONTAINS POTENTIALLY HARMFUL CONTENT.
zh
[NLP-129] Discourse vs emissions: Analysis of corporate narratives symbolic practices and mimicry through LLM s
【速读】: 该论文旨在解决企业气候信息披露中因模仿行为和象征性报告导致信息透明度与可比性不足的问题,从而削弱其决策有用性。解决方案的关键在于构建一个基于微调大语言模型(Large Language Models, LLMs)的多维评估框架,通过情感、承诺、具体性和目标雄心四个分类器从可持续发展报告和年报中提取叙事指标,并将其与企业的排放水平、市值及行业属性等特征关联分析,从而量化披露成熟度并识别出披露内容与实际减排行动之间的脱节现象。
链接: https://arxiv.org/abs/2510.01222
作者: Bertrand Kian Hassani,Yacoub Bahini,Rizwan Mushtaq
机构: QUANT AI Lab (QUANT AI 实验室); University College London (伦敦大学学院); Université Paris 1 Panthéon-Sorbonne (巴黎第一大学); Institut Louis Bachelier (路易·巴舍利耶研究所); EDC Paris Business School (EDC 巴黎商学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Climate change has increased demands for transparent and comparable corporate climate disclosures, yet imitation and symbolic reporting often undermine their value. This paper develops a multidimensional framework to assess disclosure maturity among 828 this http URL firms using large language models (LLMs) fine-tuned for climate communication. Four classifiers-sentiment, commitment, specificity, and target ambition-extract narrative indicators from sustainability and annual reports, which are linked to firm attributes such as emissions, market capitalization, and sector. Analyses reveal three insights: (1) risk-focused narratives often align with explicit commitments, but quantitative targets (e.g., net-zero pledges) remain decoupled from tone; (2) larger and higher-emitting firms disclose more commitments and actions than peers, though inconsistently with quantitative targets; and (3) widespread similarity in disclosure styles suggests mimetic behavior, reducing differentiation and decision usefulness. These results highlight the value of LLMs for ESG narrative analysis and the need for stronger regulation to connect commitments with verifiable transition strategies.
zh
[NLP-130] owards Open-Ended Discovery for Low-Resource NLP EMNLP2025
链接: https://arxiv.org/abs/2510.01220
作者: Bonaventure F. P. Dossou,Henri Aïdasso
机构: McGill University (麦吉尔大学); Mila Quebec AI Institute (蒙特利尔人工智能研究所); École de technologie supérieure (ÉTS) (高等技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP) at EMNLP 2025
[NLP-131] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset
链接: https://arxiv.org/abs/2510.01219
作者: Leroy Z. Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-132] Control the Temperature: Selective Sampling for Diverse and High-Quality LLM Outputs
链接: https://arxiv.org/abs/2510.01218
作者: Sergey Troshin,Wafaa Mohammed,Yan Meng,Christof Monz,Antske Fokkens,Vlad Niculae
机构: Language Technology Lab, University of Amsterdam (阿姆斯特丹大学语言技术实验室); Computational Linguistics and Text Mining Lab, Vrije Universiteit Amsterdam (阿姆斯特丹自由大学计算语言学与文本挖掘实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Second Conference on Language Modeling, 2025
计算机视觉
[CV-0] Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在处理多主体描述时存在的关键问题,包括属性泄露(attribute leakage)、身份纠缠(identity entanglement)和主体遗漏(subject omissions)。其解决方案的核心在于提出首个理论框架,将流匹配(Flow Matching, FM)建模为随机最优控制(Stochastic Optimal Control, SOC)问题,从而将主体解耦视为对训练好的FM采样器的控制过程。该框架导出两种架构无关的算法:(i) 无需训练的测试时控制器,通过单次更新扰动基础速度场;(ii) 对偶匹配(Adjoint Matching),一种轻量级微调规则,通过回归控制网络至反向伴随信号来提升多主体保真度,同时保留原始模型能力。此统一框架不仅解释了先前注意力机制的启发式方法,还通过流-扩散对应关系扩展至扩散模型,并首次提供了专为多主体保真度设计的微调路径。
链接: https://arxiv.org/abs/2510.02315
作者: Eric Tillmann Bill,Enis Simsar,Thomas Hofmann
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
点击查看摘要
Abstract:Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.
zh
[CV-1] StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions ICCV2025 ALT
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在图像级投毒攻击下的鲁棒性问题,即如何通过精心设计的恶意数据注入方式破坏其三维场景重建质量并诱导生成错误视图。解决方案的关键在于提出一种密度引导的投毒方法:首先利用核密度估计(Kernel Density Estimation, KDE)识别出低密度区域,并在这些区域中战略性地注入高斯点,从而嵌入仅从特定视角可见的幻觉物体;同时引入自适应噪声策略以破坏多视角一致性,进一步提升攻击效果。此外,论文还提出了基于KDE的评估协议,用于系统化衡量攻击难度,为后续研究提供客观基准。
链接: https://arxiv.org/abs/2510.02314
作者: Bo-Hsu Ke,You-Zhe Xie,Yu-Lun Liu,Wei-Chen Chiu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL
点击查看摘要
Abstract:3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method’s superior performance compared to state-of-the-art techniques. Project page: this https URL
zh
[CV-2] Clink! Chop! Thud! – Learning Object Sounds from Real-World Interactions ICCV2025
【速读】:该论文旨在解决如何让模型从声音中准确识别出与之对应的物体交互事件的问题,即“声音物体检测”(sounding object detection)任务,其核心挑战在于建立声学信号与具体物体之间的因果关联。解决方案的关键在于构建一个以对象为中心的多模态感知框架,通过自动化的分割掩码生成管道引导模型关注交互过程中最相关的视觉区域,并引入槽注意力(slot attention)视觉编码器强化对象先验,从而实现对物体-声音对应关系的精准建模。
链接: https://arxiv.org/abs/2510.02313
作者: Mengyu Yang,Yiming Chen,Haozheng Pei,Siddhant Agarwal,Arun Balajee Vasudevan,James Hays
机构: Georgia Institute of Technology (佐治亚理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL
点击查看摘要
Abstract:Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model’s ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model’s focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.
zh
[CV-3] Inferring Dynamic Physical Properties from Video Foundation Models
【速读】:该论文旨在解决从视频中预测动态物理属性的问题,具体包括弹性(elasticity)、黏度(viscosity)和动摩擦力(dynamic friction),这些属性需依赖时间序列信息才能推断。其关键解决方案在于提出三种不同的方法:(i) 使用经典计算机视觉技术提取内在反映物理属性的视觉线索的“oracle方法”;(ii) 基于预训练视频生成模型或自监督模型,通过视觉提示(visual prompt)与可训练提示向量实现跨注意力机制的简单读出策略;(iii) 探索多模态大语言模型(MLLMs)的提示策略。实验表明,生成式或自监督视频基础模型性能接近oracle方法,而MLLMs当前表现较弱,但可通过优化提示策略提升性能。
链接: https://arxiv.org/abs/2510.02311
作者: Guanqi Zhan,Xianzheng Ma,Weidi Xie,Andrew Zisserman
机构: VGG, University of Oxford (牛津大学视觉几何组); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and © prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.
zh
[CV-4] NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation
【速读】:该论文旨在解决文本到图像扩散模型在训练时固定分辨率导致的泛化能力不足问题,尤其在生成低于训练分辨率的图像时质量显著下降的问题。现有高分辨率生成模型无法为不需要高分辨率输出的用户提供经济高效的替代方案。其解决方案的关键在于识别出噪声调度器(noise scheduler)在不同分辨率下具有不一致的感知效应:相同噪声水平对低分辨率图像的信号破坏程度远大于高分辨率图像,从而造成训练与测试阶段的不匹配。为此,作者提出NoiseShift方法,这是一种无需重新训练的校准策略,通过根据图像分辨率动态调整去噪器所接收的噪声水平来缓解这一问题,且兼容现有模型架构和采样流程,在多个主流模型上均显著提升了低分辨率图像生成的质量。
链接: https://arxiv.org/abs/2510.02307
作者: Ruozhen He,Moayed Haji-Ali,Ziyan Yang,Vicente Ordonez
机构: Rice University (莱斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.
zh
[CV-5] Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models
链接: https://arxiv.org/abs/2510.02300
作者: Runqian Wang,Yilun Du
机构: MIT (麻省理工学院); Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-6] Continual Personalization for Diffusion Models
【速读】:该论文旨在解决扩散模型(diffusion models)在增量学习(incremental learning)场景下的持续个性化问题,即在不遗忘先前知识的前提下,高效地将新概念融入模型并保持零样本文本到图像生成能力。其解决方案的关键在于提出了一种名为概念神经元选择(Concept Neuron Selection, CNS)的新颖学习策略:CNS能够精准识别与目标概念高度相关的神经元,并在增量训练中仅微调这些特定神经元,同时联合保留此前学习到的知识,从而避免灾难性遗忘(catastrophic forgetting)。该方法无需融合不同概念的模型参数(fusion-free),显著降低内存占用和计算开销,在单概念与多概念个性化任务中均达到当前最优性能。
链接: https://arxiv.org/abs/2510.02296
作者: Yu-Chien Liao,Jr-Jen Chen,Chi-Pin Huang,Ci-Siang Lin,Meng-Lin Wu,Yu-Chiang Frank Wang
机构: National Taiwan University (台湾大学); Qualcomm Technologies, Inc. (高通技术公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Updating diffusion models in an incremental setting would be practical in real-world applications yet computationally challenging. We present a novel learning strategy of Concept Neuron Selection (CNS), a simple yet effective approach to perform personalization in a continual learning scheme. CNS uniquely identifies neurons in diffusion models that are closely related to the target concepts. In order to mitigate catastrophic forgetting problems while preserving zero-shot text-to-image generation ability, CNS finetunes concept neurons in an incremental manner and jointly preserves knowledge learned of previous concepts. Evaluation of real-world datasets demonstrates that CNS achieves state-of-the-art performance with minimal parameter adjustments, outperforming previous methods in both single and multi-concept personalization works. CNS also achieves fusion-free operation, reducing memory storage and processing time for continual personalization.
zh
[CV-7] VideoNSA: Native Sparse Attention Scales Video Understanding
【速读】:该论文旨在解决多模态语言模型中视频理解受限于上下文长度的问题,即模型在长视频场景下容易遗漏关键帧转换信息,并难以维持跨长时间尺度的一致性。其解决方案的关键在于提出 VideoNSA 方法,通过引入硬件感知的混合注意力机制,在文本部分保留密集注意力(dense attention),而在视频部分采用原生稀疏注意力(Native Sparse Attention, NSA)结构,实现对长视频内容的有效建模与高效计算。该方法在 216K 视频指令数据集上进行端到端训练,显著提升了长视频理解、时间推理和空间任务的表现,并揭示了稀疏注意力在固定预算下的最优全局-局部分配策略及任务依赖的分支使用模式。
链接: https://arxiv.org/abs/2510.02295
作者: Enxin Song,Wenhao Chai,Shusheng Yang,Ethan Armand,Xiaojun Shan,Haiyang Xu,Jianwen Xie,Zhuowen Tu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL , Code: this https URL
点击查看摘要
Abstract:Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.
zh
[CV-8] st-Time Anchoring for Discrete Diffusion Posterior Sampling
【速读】:该论文旨在解决利用预训练的离散扩散基础模型(discrete diffusion foundation models)进行后验采样(posterior sampling)的问题,目标是在不重新训练特定任务模型的前提下,从噪声测量中恢复图像。现有方法在离散扩散后验采样中面临三大挑战:无导数引导导致信号稀疏、连续松弛限制适用性、分裂吉布斯采样器受维度灾难影响。为此,作者提出锚定后验采样(Anchored Posterior Sampling, APS),其核心创新在于两个方面:一是引入量化期望(quantized expectation)以在离散嵌入空间中实现类梯度引导,二是采用锚定重掩码(anchored remasking)机制实现自适应解码。该方法在标准基准测试中对线性和非线性逆问题均达到当前最优性能,并在无需训练的风格迁移与文本引导编辑中展现出显著优势。
链接: https://arxiv.org/abs/2510.02291
作者: Litu Rout,Andreas Lugmayr,Yasamin Jafarian,Srivatsan Varadharajan,Constantine Caramanis,Sanjay Shakkottai,Ira Kemelmacher-Shlizerman
机构: Google(谷歌); UT Austin
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Preprint
点击查看摘要
Abstract:We study the problem of posterior sampling using pretrained discrete diffusion foundation models, aiming to recover images from noisy measurements without retraining task-specific models. While diffusion models have achieved remarkable success in generative modeling, most advances rely on continuous Gaussian diffusion. In contrast, discrete diffusion offers a unified framework for jointly modeling categorical data such as text and images. Beyond unification, discrete diffusion provides faster inference, finer control, and principled training-free Bayesian inference, making it particularly well-suited for posterior sampling. However, existing approaches to discrete diffusion posterior sampling face severe challenges: derivative-free guidance yields sparse signals, continuous relaxations limit applicability, and split Gibbs samplers suffer from the curse of dimensionality. To overcome these limitations, we introduce Anchored Posterior Sampling (APS) for masked diffusion foundation models, built on two key innovations – quantized expectation for gradient-like guidance in discrete embedding space, and anchored remasking for adaptive decoding. Our approach achieves state-of-the-art performance among discrete diffusion samplers across linear and nonlinear inverse problems on the standard benchmarks. We further demonstrate the benefits of our approach in training-free stylization and text-guided editing.
zh
[CV-9] MultiModal Action Conditioned Video Generation
【速读】:该论文旨在解决当前视频模型作为世界模型(world model)时缺乏细粒度控制的问题,尤其是通用家用机器人在执行精细任务和应对紧急情况时对实时微操作控制的需求。解决方案的关键在于引入细粒度多模态动作(fine-grained multimodal actions),融合本体感觉(proprioception)、动觉(kinesthesia)、力触觉(force haptics)和肌电激活(muscle activation)等多种感知模态,以捕捉难以通过文本条件生成模型模拟的精确交互行为。作者进一步提出一种特征学习范式,在对齐多模态信息的同时保留各模态的独特性,并设计正则化方案增强动作轨迹特征的因果性,从而更准确地表征复杂交互动力学。实验表明,该方法显著提升了仿真精度并减少了时间漂移。
链接: https://arxiv.org/abs/2510.02287
作者: Yichen Li,Antonio Torralba
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Current video models fail as world model as they lack fine-graiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.
zh
[CV-10] Learning to Generate Object Interactions with Physics-Guided Video Diffusion
【速读】:该论文旨在解决当前视频生成模型在物理合理性上的不足,特别是物体间交互缺乏物理基础、难以实现精确的动力学控制等问题。其核心解决方案是提出KineMask方法,通过引入基于物体掩码(object mask)的两阶段训练策略,逐步移除未来运动监督,使视频扩散模型(VDMs)能够在合成场景中学习简单交互,并有效迁移到真实场景中生成具有物理合理性的刚体运动与交互效果。关键创新在于将低层级运动控制(如指定物体速度)与高层文本条件(如预测场景描述)相结合,从而支持复杂动力学现象的合成,显著提升视频生成的真实性与可控性。
链接: https://arxiv.org/abs/2510.02284
作者: David Romero,Ariana Bermudez,Hao Li,Fabio Pizzati,Ivan Laptev
机构: MBZUAI(马斯达尔人工智能大学); Pinscreen(美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.
zh
[CV-11] Self-Forcing: Towards Minute-Scale High-Quality Video Generation
【速读】:该论文旨在解决扩散模型(Diffusion Models)在长视频生成中因依赖Transformer架构而导致的计算成本过高,以及学生模型(Student Model)在扩展生成长度时由于误差累积引发的质量退化问题。现有方法通常通过蒸馏短时双向教师模型(Short-horizon Bidirectional Teachers)来训练学生模型,但受限于教师模型无法生成长视频,导致学生模型在超出训练范围后性能显著下降。解决方案的关键在于:利用教师模型的知识,在无需长视频监督或重新训练数据集的前提下,通过从自生成的长视频片段中采样并提供引导信号,从而有效缓解误差累积并保持时间一致性。该方法可将视频长度扩展至教师模型能力的20倍,生成长达4分15秒的视频(接近基础模型位置编码支持的最大跨度),且避免了重复计算重叠帧的问题,显著提升了生成视频的保真度和一致性。
链接: https://arxiv.org/abs/2510.02283
作者: Justin Cui,Jie Wu,Ming Li,Tao Yang,Xiaojie Li,Rui Wang,Andrew Bai,Yuanhao Ban,Cho-Jui Hsieh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: preprint
点击查看摘要
Abstract:Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher’s capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model’s position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at this https URL
zh
[CV-12] VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLM s and RL
【速读】:该论文旨在解决AI生成视频(AI-generated videos)带来的社会风险,如虚假信息传播和声誉损害,其核心挑战在于开发兼具高准确率与可解释性的检测工具。解决方案的关键在于提出VidGuard-R1,这是首个基于多模态大语言模型(MLLM)并采用群体相对策略优化(GRPO)进行微调的视频真实性检测模型;通过设计包含14万条真实与AI生成视频的高难度数据集,并引入针对时序伪影和生成复杂度的两个专用奖励模型,使模型在零样本场景下即达到当前最优性能,且经额外训练后准确率超过95%,同时提供精确、可理解的推理过程,从而满足监管机构与终端用户对透明度的需求。
链接: https://arxiv.org/abs/2510.02282
作者: Kyoungjun Park,Yifan Yang,Juheon Yi,Shicheng Zheng,Yifei Shen,Dongqi Han,Caihua Shan,Muhammad Muaz,Lili Qiu
机构: University of Texas at Austin (得克萨斯大学奥斯汀分校); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:With the rapid advancement of AI-generated videos, there is an urgent need for effective detection tools to mitigate societal risks such as misinformation and reputational harm. In addition to accurate classification, it is essential that detection models provide interpretable explanations to ensure transparency for regulators and end users. To address these challenges, we introduce VidGuard-R1, the first video authenticity detector that fine-tunes a multi-modal large language model (MLLM) using group relative policy optimization (GRPO). Our model delivers both highly accurate judgments and insightful reasoning. We curate a challenging dataset of 140k real and AI-generated videos produced by state-of-the-art generation models, carefully designing the generation process to maximize discrimination difficulty. We then fine-tune Qwen-VL using GRPO with two specialized reward models that target temporal artifacts and generation complexity. Extensive experiments demonstrate that VidGuard-R1 achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Case studies further show that VidGuard-R1 produces precise and interpretable rationales behind its predictions. The code is publicly available at this https URL.
zh
[CV-13] microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification
【速读】:该论文旨在解决基于CLIP(Contrastive Language–Image Pretraining)的视觉语言模型(VLMs)在细粒度图像分类任务中因过度依赖粗粒度全局特征而导致性能受限的问题。现有方法通过将大语言模型(LLM)描述与CLIP的[CLS]标记对齐来注入细粒度知识,但忽略了空间精度。其解决方案的关键在于提出microCLIP框架,该框架通过轻量级TokenFusion模块中的显著性导向注意力池化(Saliency-Oriented Attention Pooling, SOAP)机制,从图像补丁嵌入中构建一个聚焦局部显著区域的[FG]标记,并将其与全局[CLS]标记融合以实现粗细粒度对齐;同时引入双头LLM衍生分类器和动态知识聚合策略,稳定伪标签生成并迭代优化细粒度表示,从而在仅需轻量适应的前提下,在13个细粒度基准上平均提升2.90%准确率。
链接: https://arxiv.org/abs/2510.02270
作者: Sathira Silva,Eman Ali,Chetan Arora,Muhammad Haris Khan
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Alexandria University (亚历山大大学); IIT Delhi (印度理工学院德里分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP \texttt[CLS] token; however, this approach overlooks spatial precision. We propose \textbfmicroCLIP , a self-training framework that jointly refines CLIP’s visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided \texttt[FG] token from patch embeddings and fuses it with the global \texttt[CLS] token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion’s evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent 2.90% average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at this https URL.
zh
[CV-14] Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning
【速读】:该论文旨在解决视觉不变性模仿学习(view-invariant imitation learning)中的泛化能力问题,即在不同相机视角下策略性能下降的问题。其关键解决方案是显式地将策略条件化于相机外参(camera extrinsics),利用像素级射线的Plucker嵌入表示来编码视角信息,从而提升策略在视角变化下的鲁棒性。实验表明,这种条件化方式显著改善了标准行为克隆策略(如ACT、Diffusion Policy和SmolVLA)在RoboSuite和ManiSkill中六项操作任务上的跨视角泛化能力,尤其在去除背景线索干扰后仍能保持稳定性能,实现仅依赖RGB图像的鲁棒控制。
链接: https://arxiv.org/abs/2510.02268
作者: Tianchong Jiang,Jingtian Ji,Xiangshan Tan,Jiading Fang,Anand Bhattad,Vitor Guizilini,Matthew R. Walter
机构: Toyota Technological Institute at Chicago (丰田技术学院); Waymo; Johns Hopkins University (约翰霍普金斯大学); Toyota Research Institute (丰田研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Code and project materials are available at this http URL
点击查看摘要
Abstract:We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Plucker embeddings of per-pixel rays, we show that conditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, including ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in RoboSuite and ManiSkill that pair “fixed” and “randomized” scene variants, decoupling background cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes; this shortcut collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code at this https URL .
zh
[CV-15] NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes
【速读】:该论文旨在解决跨被试(cross-subject)视觉刺激重建中的挑战,即如何在存在个体间神经表征差异和大脑对复杂视觉输入中核心语义特征抽象编码的情况下,实现高精度的视觉信息重建。其关键解决方案是提出NeuroSwift框架,通过扩散模型集成互补适配器:AutoKL用于低级特征重建,CLIP适配器则基于Stable Diffusion生成图像与COCO标注对训练,模拟高级视觉皮层的语义编码;同时采用预训练-微调策略,仅需微调17%参数(全连接层),即可在轻量级GPU(三张RTX 4090)上实现每被试仅需一小时训练,显著提升跨被试泛化性能并优于现有方法。
链接: https://arxiv.org/abs/2510.02266
作者: Shiyi Zhang,Dong Liang,Yihang Zhou
机构: Southern University of Science and Technology (南方科技大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain’s abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift’s CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.
zh
[CV-16] Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities
【速读】:该论文旨在解决在非实验室环境下准确评估人类运动的问题,以支持远程医疗、运动科学和康复等应用场景。其关键解决方案是通过对比基于单目视频的3D人体姿态估计模型(如MotionAGFormer、MotionBERT、MMPose 2D-to-3D pose lifting和NVIDIA BodyTrack)与惯性测量单元(Inertial Measurement Units, IMUs)在真实世界条件下的性能表现,利用VIDIMU数据集中的13种临床相关日常活动进行基准测试。研究发现,尽管两种技术均具备可行性,但MotionAGFormer在关节角度估计上表现出最优性能(均方根误差RMSE为9.27°±4.80°,平均绝对误差MAE为7.86°±4.18°,皮尔逊相关系数为0.86±0.15,决定系数R²为0.67±0.28),揭示了视频方法在健康成人中已具有临床潜力,同时也明确了视频与传感器方法在成本、可及性和精度之间的权衡关系,为开发高效、经济且用户友好的远程监测系统提供了重要依据。
链接: https://arxiv.org/abs/2510.02264
作者: Mario Medrano-Paredes,Carmen Fernández-González,Francisco-Javier Díaz-Pernas,Hichem Saoudi,Javier González-Alonso,Mario Martínez-Zarzuela
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: All tables, graphs and figures generated can be obtained in the Zenodo repository complementary to this work: this https URL
点击查看摘要
Abstract:Advances in machine learning and wearable sensors offer new opportunities for capturing and analyzing human movement outside specialized laboratories. Accurate assessment of human movement under real-world conditions is essential for telemedicine, sports science, and rehabilitation. This preclinical benchmark compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs), leveraging the VIDIMU dataset containing a total of 13 clinically relevant daily activities which were captured using both commodity video cameras and five IMUs. During this initial study only healthy subjects were recorded, so results cannot be generalized to pathological cohorts. Joint angles derived from state-of-the-art deep learning frameworks (MotionAGFormer, MotionBERT, MMPose 2D-to-3D pose lifting, and NVIDIA BodyTrack) were evaluated against joint angles computed from IMU data using OpenSim inverse kinematics following the Human3.6M dataset format with 17 keypoints. Among them, MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE ( 9.27°\pm 4.80° ) and MAE ( 7.86°\pm 4.18° ), as well as the highest Pearson correlation ( 0.86 \pm 0.15 ) and the highest coefficient of determination R^2 ( 0.67 \pm 0.28 ). The results reveal that both technologies are viable for out-of-the-lab kinematic assessment. However, they also highlight key trade-offs between video- and sensor-based approaches including costs, accessibility, and precision. This study clarifies where off-the-shelf video models already provide clinically promising kinematics in healthy adults and where they lag behind IMU-based estimates while establishing valuable guidelines for researchers and clinicians seeking to develop robust, cost-effective, and user-friendly solutions for telehealth and remote patient monitoring.
zh
[CV-17] From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding
【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, Video LLMs)在处理长视频时面临的“针 haystack 问题”——即原始视频帧产生的海量视觉 token 超出模型上下文窗口限制,导致实际应用受限。现有方法通过稀疏采样关键帧来减少 token 数量,但这种帧级选择会丢失重要的时间动态信息,影响对运动和事件连续性的推理。解决方案的关键在于将采样粒度从孤立的关键帧扩展到关键片段(key clips),即短时且时间上连贯的视频段落,从而保留更丰富的时序结构;同时提出一种自适应分辨率策略,在固定计算预算下动态平衡空间分辨率与片段长度,确保每段视频的 token 总数恒定。实验表明,该训练-free 方法 F2C 在多个长视频基准测试中显著优于均匀采样,验证了保持时间连贯性在视频理解中的重要性。
链接: https://arxiv.org/abs/2510.02262
作者: Guangyu Sun,Archit Singhal,Burak Uzkent,Mubarak Shah,Chen Chen,Garin Kessler
机构: Amazon(亚马逊); University of Central Florida(中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the “needle in a haystack” problem: the massive number of visual tokens produced from raw video frames exhausts the model’s context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at this https URL .
zh
[CV-18] Drag Flow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
【速读】:该论文旨在解决基于拖拽(drag-based)的图像编辑中目标区域失真问题,其根源在于早期基础模型(如Stable Diffusion)生成先验(generative priors)不足,难以将优化后的潜在表示映射回自然图像流形。随着扩散模型从UNet架构向更可扩展的DiT(Diffusion Transformer)结构演进(如SD3.5、FLUX),生成先验显著增强,但拖拽编辑任务尚未从中受益。解决方案的关键在于提出首个利用FLUX强大先验的拖拽编辑框架DragFlow:首先识别出直接在DiT上应用点级拖拽编辑效果差,因其特征结构不如UNet压缩且缺乏可靠点级运动监督;进而引入区域级编辑范式,通过仿射变换提供更丰富一致的特征监督;同时集成预训练开放域个性化适配器(如IP-Adapter)以提升主体一致性,并结合梯度掩码硬约束保持背景保真度;最后借助多模态大语言模型(MLLM)处理任务歧义。该方法在新构建的区域级拖拽基准ReD Bench上显著优于现有基线,树立了拖拽图像编辑的新SOTA。
链接: https://arxiv.org/abs/2510.02253
作者: Zihan Zhou,Shilin Lu,Shuli Leng,Shaocong Zhang,Zhuming Lian,Xinlei Yu,Adams Wai-Kin Kong
机构: Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
点击查看摘要
Abstract:Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX’s rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.
zh
[CV-19] RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning
【速读】:该论文旨在解决多模态大语言模型(MLLMs)在细粒度视觉推理任务中表现不佳的问题,特别是其在结构化且信息丰富的场景(如交通地图)中的空间推理能力不足。针对标准强化学习(RL)因奖励稀疏和优化不稳定导致训练困难的问题,论文提出两个关键解决方案:一是构建扩展数据集ReasonMap-Plus,通过视觉问答(VQA)任务引入密集奖励信号,实现细粒度视觉理解技能的有效冷启动训练;二是设计RewardMap多阶段强化学习框架,包含难度感知的奖励机制(提供细节奖励以缓解稀疏性)和从简单感知到复杂推理的渐进式训练策略,显著优于传统监督微调(SFT)。实验证明,RewardMap各组件协同作用可带来稳定性能提升,且在6个基准测试上平均提高3.47%,验证了其在增强视觉理解和推理能力方面的有效性。
链接: https://arxiv.org/abs/2510.02240
作者: Sicheng Feng,Kaiwen Tuo,Song Wang,Lingdong Kong,Jianke Zhu,Huan Wang
机构: Westlake University (西湖大学); Tongji University (同济大学); Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.
zh
[CV-20] mpoControl: Temporal Attention Guidance for Text-to-Video Models
【速读】:该论文旨在解决生成式视频模型在时间控制上的不足问题,即现有模型难以实现对视频中特定视觉元素出现时刻的细粒度调控。其解决方案的关键在于提出了一种名为TempoControl的新方法,该方法利用文本到视频扩散模型中的交叉注意力(cross-attention)机制,在推理阶段无需重新训练或额外监督即可实现视觉概念的时间对齐。通过三种互补原则优化注意力分布:基于相关性对齐时间形状、基于能量增强可见区域的注意力强度、以及通过熵约束保持空间聚焦,从而实现精准的时间控制并保障视频质量与多样性。
链接: https://arxiv.org/abs/2510.02226
作者: Shira Schiber,Ofir Lindenbaum,Idan Schwartz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review
点击查看摘要
Abstract:Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal shape with a control signal (via correlation), amplifying it where visibility is needed (via energy), and maintaining spatial focus (via entropy). TempoControl allows precise control over timing while ensuring high video quality and diversity. We demonstrate its effectiveness across various video generation applications, including temporal reordering for single and multiple objects, as well as action and audio-aligned generation.
zh
[CV-21] MMDEW: Multipurpose Multiclass Density Estimation in the Wild
【速读】:该论文旨在解决密集且存在遮挡场景下多类别目标计数问题,传统基于检测的离散计数方法在此类场景中失效。其解决方案的关键在于提出一种基于Twins金字塔视觉Transformer(Twins Pyramid Vision Transformer)主干网络与专用多类别计数头的框架,结合多尺度解码策略,并引入两阶段设计:在训练时通过基于分割的类别聚焦模块(Category Focus Module)抑制类别间交叉干扰,同时采用区域损失函数提升模型在新领域的泛化能力。实验表明,该方法在VisDrone和iSAID基准上相较现有方法显著降低平均绝对误差(MAE),并验证了其在生物多样性监测等新应用场景中的有效性。
链接: https://arxiv.org/abs/2510.02213
作者: Villanelle O’Reilly,Jonathan Cox,Georgios Leontidis,Marc Hanheide,Petra Bosilj,James Brown
机构: University of Lincoln (林肯大学); University of Aberdeen (阿伯丁大学); Maastricht University (马斯特里赫特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8+1 pages, 4 figures, 5 tables
点击查看摘要
Abstract:Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision-transformer backbone and a specialised multi-class counting head built on a state-of-the-art multiscale decoding approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method’s regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.
zh
[CV-22] Cross-Breed Pig Identification Using Auricular Vein Pattern Recognition: A Machine Learning Approach for Small-Scale Farming Applications
【速读】:该论文旨在解决小农户在牲畜识别中面临的难题,即传统方法如耳标和微芯片存在可靠性差、成本高、仅适用于纯种猪等问题,难以在资源受限的农业环境中推广。其解决方案的关键在于提出一种非侵入式的生物特征识别方法,利用耳部静脉(auricular vein)的独特性进行个体识别:通过标准智能手机采集图像并结合简易背光条件,构建多阶段计算机视觉流水线以增强静脉可见性、提取结构与空间特征,并基于支持向量机(SVM)模型实现高精度分类,最终在混合品种猪群中达到98.12%的识别准确率,且全流程平均耗时仅8.3秒,具备实时部署可行性。
链接: https://arxiv.org/abs/2510.02197
作者: Emmanuel Nsengiyumvaa,Leonard Niyitegekaa,Eric Umuhoza
机构: Carnegie Mellon University Africa (卡内基梅隆大学非洲分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: 20 pages
点击查看摘要
Abstract:Accurate livestock identification is a cornerstone of modern farming: it supports health monitoring, breeding programs, and productivity tracking. However, common pig identification methods, such as ear tags and microchips, are often unreliable, costly, target pure breeds, and thus impractical for small-scale farmers. To address this gap, we propose a noninvasive biometric identification approach that leverages uniqueness of the auricular vein patterns. To this end, we have collected 800 ear images from 20 mixed-breed pigs (Landrace cross Pietrain and Duroc cross Pietrain), captured using a standard smartphone and simple back lighting. A multistage computer vision pipeline was developed to enhance vein visibility, extract structural and spatial features, and generate biometric signatures. These features were then classified using machine learning models. Support Vector Machines (SVM) achieved the highest accuracy: correctly identifying pigs with 98.12% precision across mixed-breed populations. The entire process from image processing to classification was completed in an average of 8.3 seconds, demonstrating feasibility for real-time farm deployment. We believe that by replacing fragile physical identifiers with permanent biological markers, this system provides farmers with a cost-effective and stress-free method of animal identification. More broadly, the findings confirm the practicality of auricular vein biometrics for digitizing livestock management, reinforcing its potential to extend the benefits of precision farming to resource-constrained agricultural communities.
zh
[CV-23] GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation
链接: https://arxiv.org/abs/2510.02186
作者: Weijia Dou,Xu Zhang,Yi Bin,Jian Liu,Bo Peng,Guoqing Wang,Yang Yang,Heng Tao Shen
机构: Tongji University (同济大学); Tianjin University (天津大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-24] DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis
链接: https://arxiv.org/abs/2510.02178
作者: Jialin Gao,Donghao Zhou,Mingjian Liang,Lihao Liu,Chi-Wing Fu,Xiaowei Hu,Pheng-Ann Heng
机构: The Chinese University of Hong Kong (香港中文大学); Monash University (蒙纳士大学); Amazon (亚马逊); South China University of Technology (华南理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-25] Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting
【速读】:该论文旨在解决当前基于冻结视觉语言模型(Frozen Vision-Language Models, VLMs)进行视频异常检测(Video Anomaly Detection, VAD)时,由于提示(prompt)过于抽象而导致对复杂异常中细粒度人-物交互或动作语义建模不足的问题。解决方案的关键在于提出一种结构化的提示框架 ASK-Hint,其通过引入以动作为中心的知识(action-centric knowledge),将提示组织为语义一致的类别组(如暴力、财产犯罪、公共安全等),并设计细粒度引导性问题,使模型预测更契合判别性视觉线索,从而提升检测准确性与可解释性。
链接: https://arxiv.org/abs/2510.02155
作者: Shu Zou,Xinyu Tian,Lukas Wesemann,Fabian Waschkowski,Zhaoyuan Yang,Jing Zhang
机构: Australian National University (澳大利亚国立大学); Maincode; GE Research (通用电气研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, video anomaly detection
点击查看摘要
Abstract:Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.
zh
[CV-26] FRIEREN: Federated Learning with Vision-Language Regularization for Segmentation
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在语义分割(Semantic Segmentation, SS)任务中因域偏移(domain shift)导致的性能下降问题,特别是在客户端数据完全无标签的情况下,传统方法难以有效适应新域。解决方案的关键在于提出了一种新的挑战性任务FFREEDG(即模型仅使用服务器端预训练的有标签源数据进行初始化,后续在客户端仅用无标签数据训练且不重新访问源数据),并设计了FRIEREN框架:该框架利用视觉-语言基础模型(Vision-Language Foundation Model, VFM)的知识,通过CLIP-based文本嵌入引导的视觉-语言解码器增强语义消歧能力,并采用弱到强一致性学习策略提升伪标签的鲁棒性,从而在合成到真实和清晰到恶劣天气等基准上实现了优于现有领域泛化与适应方法的性能表现。
链接: https://arxiv.org/abs/2510.02114
作者: Ding-Ruei Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Master Thesis
点击查看摘要
Abstract:Federeated Learning (FL) offers a privacy-preserving solution for Semantic Segmentation (SS) tasks to adapt to new domains, but faces significant challenges from these domain shifts, particularly when client data is unlabeled. However, most existing FL methods unrealistically assume access to labeled data on remote clients or fail to leverage the power of modern Vision Foundation Models (VFMs). Here, we propose a novel and challenging task, FFREEDG, in which a model is pretrained on a server’s labeled source dataset and subsequently trained across clients using only their unlabeled data, without ever re-accessing the source. To solve FFREEDG, we propose FRIEREN, a framework that leverages the knowledge of a VFM by integrating vision and language modalities. Our approach employs a Vision-Language decoder guided by CLIP-based text embeddings to improve semantic disambiguation and uses a weak-to-strong consistency learning strategy for robust local training on pseudo-labels. Our experiments on synthetic-to-real and clear-to-adverse-weather benchmarks demonstrate that our framework effectively tackles this new task, achieving competitive performance against established domain generalization and adaptation methods and setting a strong baseline for future research.
zh
[CV-27] When Tracking Fails: Analyzing Failure Modes of SAM2 for Point-Based Tracking in Surgical Videos MICCAI
链接: https://arxiv.org/abs/2510.02100
作者: Woowon Jang,Jiwon Im,Juseung Choi,Niki Rashidian,Wesley De Neve,Utku Ozbulak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) Workshop on Collaborative Intelligence and Autonomy in Image-guided Surgery (COLAS), 2025
[CV-28] Mapping Historic Urban Footprints in France: Balancing Quality Scalability and AI Techniques
【速读】:该论文旨在解决法国1970年前历史城市蔓延(urban sprawl)定量分析中缺乏全国范围数字城市轮廓数据的问题。解决方案的关键在于提出一种双阶段U-Net深度学习流水线,用于从Scan Histo历史地图系列(1925–1950年)中提取城市区域。第一阶段利用初始训练集生成初步分类图,识别如文字和道路等混淆区域以指导针对性的数据增强;第二阶段则基于优化后的数据集及第一阶段的二值化输出,有效抑制辐射噪声并减少误检,从而显著提升精度。该方法在高性能计算集群上处理了覆盖法国本土的941幅高分辨率图像,最终生成首个公开可用、全国尺度的城市轮廓栅格数据集,整体准确率达73%,成功克服了历史地图常见的标注与等高线等干扰因素。
链接: https://arxiv.org/abs/2510.02097
作者: Walid Rabehi,Marion Le Texier,Rémi Lemoy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Quantitative analysis of historical urban sprawl in France before the 1970s is hindered by the lack of nationwide digital urban footprint data. This study bridges this gap by developing a scalable deep learning pipeline to extract urban areas from the Scan Histo historical map series (1925-1950), which produces the first open-access, national-scale urban footprint dataset for this pivotal period. Our key innovation is a dual-pass U-Net approach designed to handle the high radiometric and stylistic complexity of historical maps. The first pass, trained on an initial dataset, generates a preliminary map that identifies areas of confusion, such as text and roads, to guide targeted data augmentation. The second pass uses a refined dataset and the binarized output of the first model to minimize radiometric noise, which significantly reduces false positives. Deployed on a high-performance computing cluster, our method processes 941 high-resolution tiles covering the entirety of metropolitan France. The final mosaic achieves an overall accuracy of 73%, effectively capturing diverse urban patterns while overcoming common artifacts like labels and contour lines. We openly release the code, training datasets, and the resulting nationwide urban raster to support future research in long-term urbanization dynamics.
zh
[CV-29] VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation
链接: https://arxiv.org/abs/2510.02086
作者: Arman Behnam
机构: Illinois Institute of Technology (伊利诺伊理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-30] Spec-Gloss Surfels and Normal-Diffuse Priors for Relightable Glossy Objects
【速读】:该论文旨在解决高光物体(glossy objects)的精确重建与重光照(relighting)难题,其核心挑战在于物体形状、材质属性与照明条件之间存在强耦合关系,难以有效解耦。现有基于神经渲染的方法通常依赖简化的双向反射分布函数(Bidirectional Reflectance Distribution Function, BRDF)模型或参数化方式,将漫反射与镜面反射成分耦合,导致材质恢复不准确且重光照效果受限。本文的关键解决方案是将微表面BRDF(microfacet BRDF)与镜面光泽度(specular-glossiness)参数化集成至2D高斯点绘(Gaussian Splatting)框架中,并引入延迟着色(deferred shading),从而实现更物理一致的材质分解;同时利用基于扩散的表面法向量和漫反射颜色先验指导早期优化,缓解歧义问题,并通过环境贴图的粗到精优化策略加速收敛并保留高动态范围的镜面反射特性,显著提升复杂高光场景下的几何与材质重建质量及新光照条件下的重光照真实性。
链接: https://arxiv.org/abs/2510.02069
作者: Georgios Kouros,Minye Wu,Tinne Tuytelaars
机构: KU Leuven (鲁汶大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate reconstruction and relighting of glossy objects remain a longstanding challenge, as object shape, material properties, and illumination are inherently difficult to disentangle. Existing neural rendering approaches often rely on simplified BRDF models or parameterizations that couple diffuse and specular components, which restricts faithful material recovery and limits relighting fidelity. We propose a relightable framework that integrates a microfacet BRDF with the specular-glossiness parameterization into 2D Gaussian Splatting with deferred shading. This formulation enables more physically consistent material decomposition, while diffusion-based priors for surface normals and diffuse color guide early-stage optimization and mitigate ambiguity. A coarse-to-fine optimization of the environment map accelerates convergence and preserves high-dynamic-range specular reflections. Extensive experiments on complex, glossy scenes demonstrate that our method achieves high-quality geometry and material reconstruction, delivering substantially more realistic and consistent relighting under novel illumination compared to existing Gaussian splatting methods.
zh
[CV-31] Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers
【速读】:该论文旨在解决基于有限身体传感器的姿势估计(pose estimation)在跨用户场景下泛化能力差的问题,其核心挑战在于位置测量受个体体型差异影响显著。解决方案的关键在于将姿势估计建模为一个逆问题(inverse problem),并提出InPose方法:利用预训练的扩散模型(diffusion model)仅以旋转测量作为条件输入,同时引入由位置测量推导出的似然项(likelihood term)来引导生成过程,从而实现零样本(zero-shot)跨用户泛化,生成最能解释稀疏传感数据的高概率姿态序列。
链接: https://arxiv.org/abs/2510.02043
作者: Sahil Bhandary Karnoor,Romit Roy Choudhury
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Pose estimation refers to tracking a human’s full body posture, including their head, torso, arms, and legs. The problem is challenging in practical settings where the number of body sensors are limited. Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both location, rotation measurements from the sensors. Unfortunately, nearly all these approaches generalize poorly across users, primarly because location measurements are highly influenced by the body size of the user. In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization. Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations. Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.
zh
[CV-32] GaussianMorphing: Mesh-Guided 3D Gaussians for Semantic-Aware Object Morphing
链接: https://arxiv.org/abs/2510.02034
作者: Mengtian Li,Yunshu Bai,Yimin Chu,Yijun Shen,Zhongmei Li,Weifeng Ge,Zhifeng Xie,Chaofeng Chen
机构: Shanghai University (上海大学); Wuhan University (武汉大学); East China University of Science and Technology (华东理工大学); Fudan University (复旦大学); Shanghai Engineering Research Center of Motion Picture Special Effects (上海电影特效工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
[CV-33] kabr-tools: Automated Framework for Multi-Species Behavioral Monitoring
【速读】:该论文旨在解决传统野外行为观察方法在尺度、效率和精度上的局限性,难以实现对复杂多维动物行为模式的大规模量化与解析。其核心问题是:如何高效、准确地获取跨景观的野生动物行为数据,以支持生态学研究与保护实践。解决方案的关键在于开发并应用kabr-tools(Kenyan Animal Behavior Recognition Tools),这是一个基于无人机视频与机器学习相结合的开源自动化行为监测框架,通过目标检测、跟踪及行为分类系统,可提取包括时间预算、行为转换、社会互动、栖息地关联和群体组成动态等关键指标,显著提升了行为分析的粒度与连续性,同时验证了其在多物种行为研究中的高通量能力与准确性。
链接: https://arxiv.org/abs/2510.02030
作者: Jenna Kline,Maksim Kholiavchenko,Samuel Stevens,Nina van Tiel,Alison Zhong,Namrata Banerji,Alec Sheets,Sowbaranika Balasubramaniam,Isla Duporge,Matthew Thompson,Elizabeth Campolongo,Jackson Miliko,Neil Rosser,Tanya Berger-Wolf,Charles V. Stewart,Daniel I. Rubenstein
机构: The Ohio State University(俄亥俄州立大学); Rensselaer Polytechnic Institute(伦斯勒理工学院); École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院); Princeton University(普林斯顿大学); Mpala Research Centre(姆帕拉研究中心); University of Miami(迈阿密大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages
点击查看摘要
Abstract:A comprehensive understanding of animal behavior ecology depends on scalable approaches to quantify and interpret complex, multidimensional behavioral patterns. Traditional field observations are often limited in scope, time-consuming, and labor-intensive, hindering the assessment of behavioral responses across landscapes. To address this, we present kabr-tools (Kenyan Animal Behavior Recognition Tools), an open-source package for automated multi-species behavioral monitoring. This framework integrates drone-based video with machine learning systems to extract behavioral, social, and spatial metrics from wildlife footage. Our pipeline leverages object detection, tracking, and behavioral classification systems to generate key metrics, including time budgets, behavioral transitions, social interactions, habitat associations, and group composition dynamics. Compared to ground-based methods, drone-based observations significantly improved behavioral granularity, reducing visibility loss by 15% and capturing more transitions with higher accuracy and continuity. We validate kabr-tools through three case studies, analyzing 969 behavioral sequences, surpassing the capacity of traditional methods for data capture and annotation. We found that, like Plains zebras, vigilance in Grevy’s zebras decreases with herd size, but, unlike Plains zebras, habitat has a negligible impact. Plains and Grevy’s zebras exhibit strong behavioral inertia, with rare transitions to alert behaviors and observed spatial segregation between Grevy’s zebras, Plains zebras, and giraffes in mixed-species herds. By enabling automated behavioral monitoring at scale, kabr-tools offers a powerful tool for ecosystem-wide studies, advancing conservation, biodiversity research, and ecological monitoring.
zh
[CV-34] LiLa-Net: Lightweight Latent LiDAR Autoencoder for 3D Point Cloud Reconstruction ICRA
链接: https://arxiv.org/abs/2510.02028
作者: Mario Resino,Borja Pérez,Jaime Godoy,Abdulla Al-Kaff,Fernando García
机构: Universidad Carlos III de Madrid (卡洛斯三世大学); Department of Systems Engineering and Automation (系统工程与自动化系); Autonomous Mobility and Perception Lab (AMPL) (自主移动与感知实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, 7 tables, Submitted to ICRA
[CV-35] Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT -4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework
链接: https://arxiv.org/abs/2510.02001
作者: Nanaka Hosokawa,Ryo Takahashi,Tomoya Kitano,Yukihiro Iida,Chisako Muramatsu,Tatsuro Hayashi,Yuta Seino,Xiangrong Zhou,Takeshi Hara,Akitoshi Katsumata,Hiroshi Fujita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Intended for submission to Scientific Reports
[CV-36] Pure-Pass: Fine-Grained Adaptive Masking for Dynamic Token-Mixing Routing in Lightweight Image Super-Resolution
【速读】:该论文旨在解决深度学习图像超分辨率(Image Super-Resolution, SR)方法中计算复杂度高、难以实际部署的问题,特别是现有轻量级方法在适应性、掩码粒度和空间灵活性方面的局限性。其解决方案的关键在于提出Pure-Pass(PP)机制,这是一种像素级掩码策略,通过固定的颜色中心点将像素分类为“纯像素”并跳过其昂贵的计算操作,从而实现细粒度、空间灵活且保持自适应能力的计算优化。该机制被集成至当前最优的ATD-light模型中,形成PP-ATD-light,在节省相似计算量的前提下显著提升了重建质量和参数效率。
链接: https://arxiv.org/abs/2510.01997
作者: Junyu Wu,Jie Tang,Jie Liu,Gangshan Wu
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Image Super-Resolution (SR) aims to reconstruct high-resolution images from low-resolution counterparts, but the computational complexity of deep learning-based methods often hinders practical deployment. CAMixer is the pioneering work to integrate the advantages of existing lightweight SR methods and proposes a content-aware mixer to route token mixers of varied complexities according to the difficulty of content recovery. However, several limitations remain, such as poor adaptability, coarse-grained masking and spatial inflexibility, among others. We propose Pure-Pass (PP), a pixel-level masking mechanism that identifies pure pixels and exempts them from expensive computations. PP utilizes fixed color center points to classify pixels into distinct categories, enabling fine-grained, spatially flexible masking while maintaining adaptive flexibility. Integrated into the state-of-the-art ATD-light model, PP-ATD-light achieves superior SR performance with minimal overhead, outperforming CAMixer-ATD-light in reconstruction quality and parameter efficiency when saving a similar amount of computation.
zh
[CV-37] 4DGS-Craft: Consistent and Interactive 4D Gaussian Splatting Editing
【速读】:该论文旨在解决4D Gaussian Splatting (4DGS) 编辑中面临的视图一致性(view consistency)、时间一致性(temporal consistency)以及非编辑区域一致性(non-editing region consistency)不足的问题,同时提升对复杂文本指令的处理能力。其解决方案的关键在于提出一个名为4DGS-Craft的框架:首先引入一种4D感知的InstructPix2Pix模型,利用从初始场景提取的4D VGGT几何特征来捕捉潜在的4D结构;其次设计多视角网格模块(multi-view grid module),通过迭代优化多视角输入图像并联合优化4D场景以增强一致性;此外,提出一种新的高斯选择机制(Gaussian selection mechanism)仅优化编辑区域内高斯点,从而保持非编辑区域的一致性;最后,集成基于大语言模型(LLM)的用户意图理解模块,将复杂指令分解为原子操作序列,实现更可控和交互式的4D场景编辑。
链接: https://arxiv.org/abs/2510.01991
作者: Lei Liu,Can Wang,Zhenghao Chen,Dong Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in 4D Gaussian Splatting (4DGS) editing still face challenges with view, temporal, and non-editing region consistency, as well as with handling complex text instructions. To address these issues, we propose 4DGS-Craft, a consistent and interactive 4DGS editing framework. We first introduce a 4D-aware InstructPix2Pix model to ensure both view and temporal consistency. This model incorporates 4D VGGT geometry features extracted from the initial scene, enabling it to capture underlying 4D geometric structures during editing. We further enhance this model with a multi-view grid module that enforces consistency by iteratively refining multi-view input images while jointly optimizing the underlying 4D scene. Furthermore, we preserve the consistency of non-edited regions through a novel Gaussian selection mechanism, which identifies and optimizes only the Gaussians within the edited regions. Beyond consistency, facilitating user interaction is also crucial for effective 4DGS editing. Therefore, we design an LLM-based module for user intent understanding. This module employs a user instruction template to define atomic editing operations and leverages an LLM for reasoning. As a result, our framework can interpret user intent and decompose complex instructions into a logical sequence of atomic operations, enabling it to handle intricate user commands and further enhance editing performance. Compared to related works, our approach enables more consistent and controllable 4D scene editing. Our code will be made available upon acceptance.
zh
[CV-38] riAlignXA: An Explainable Trilemma Alignment Framework for Trustworthy Agri-product Grading
【速读】:该论文旨在解决在线生鲜农产品电商中存在的“信任赤字”问题,其根源在于数字交易无法提供对产品质量的直接感官感知。解决方案的关键在于构建一个基于“双源验证”的“信任金字塔”模型,并提出“三角信任指数”(Triangular Trust Index, TTI)以量化评估农产品分级中生物特性、时效性和经济可行性之间的权衡关系——这一权衡被称作“不可能三角”。为实现可解释且可信的决策支持,研究将算法角色从“决策者”重构为“透明决策依据提供者”,设计了可解释人工智能框架 TriAlignXA,其核心包含三个引擎:用于精细化质量描述的生物自适应引擎、用于提升处理效率的时效优化引擎以及用于成本控制的经济优化引擎;同时引入“预映射机制”将过程数据编码为二维码,实现质量信息的可视化传递。实验证明该框架在分级精度和多目标平衡能力上显著优于基线模型,为建立可信的在线生鲜交易生态系统提供了理论与实践双重支撑。
链接: https://arxiv.org/abs/2510.01990
作者: Jianfei Xie,Ziyang Li
机构: Xinjiang University (新疆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:The ‘trust deficit’ in online fruit and vegetable e-commerce stems from the inability of digital transactions to provide direct sensory perception of product quality. This paper constructs a ‘Trust Pyramid’ model through ‘dual-source verification’ of consumer trust. Experiments confirm that quality is the cornerstone of trust. The study reveals an ‘impossible triangle’ in agricultural product grading, comprising biological characteristics, timeliness, and economic viability, highlighting the limitations of traditional absolute grading standards. To quantitatively assess this trade-off, we propose the ‘Triangular Trust Index’ (TTI). We redefine the role of algorithms from ‘decision-makers’ to ‘providers of transparent decision-making bases’, designing the explainable AI framework–TriAlignXA. This framework supports trustworthy online transactions within agricultural constraints through multi-objective optimization. Its core relies on three engines: the Bio-Adaptive Engine for granular quality description; the Timeliness Optimization Engine for processing efficiency; and the Economic Optimization Engine for cost control. Additionally, the “Pre-Mapping Mechanism” encodes process data into QR codes, transparently conveying quality information. Experiments on grading tasks demonstrate significantly higher accuracy than baseline models. Empirical evidence and theoretical analysis verify the framework’s balancing capability in addressing the “impossible triangle”. This research provides comprehensive support–from theory to practice–for building a trustworthy online produce ecosystem, establishing a critical pathway from algorithmic decision-making to consumer trust.
zh
[CV-39] textG2RPO: Granular GRPO for Precise Reward in Flow Models
【速读】:该论文旨在解决流模型(Flow Models)在强化学习(Reinforcement Learning, RL)过程中因奖励信号稀疏且狭窄而导致的偏好对齐效果不佳的问题。现有方法虽能探索潜在高价值样本,但难以实现精确的奖励评估。其解决方案的关键在于提出一种新型的Granular-GRPO(\textG^2 RPO)框架:首先引入奇异随机采样(Singular Stochastic Sampling)策略,通过增强奖励与注入噪声之间的相关性,实现逐步随机探索并获得忠实于每一步SDE扰动的奖励;其次设计多粒度优势整合模块(Multi-Granularity Advantage Integration),聚合不同扩散尺度下计算的优势值,消除固定粒度去噪带来的偏差,从而实现对采样方向更全面、鲁棒的评估。
链接: https://arxiv.org/abs/2510.01982
作者: Yujie Zhou,Pengyang Ling,Jiazi Bu,Yibin Wang,Yuhang Zang,Jiaqi Wang,Li Niu,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Github Page: this https URL
点击查看摘要
Abstract:The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO ( \textG^2 RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our \textG^2 RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.
zh
[CV-40] ROI-GS: Interest-based Local Quality 3D Gaussian Splatting
【速读】:该论文旨在解决现有3D高斯泼溅(3D Gaussian Splatting, 3DGS)方法在重建三维场景时资源分配均匀导致兴趣区域(Region Of Interest, ROI)细节不足、模型规模过大的问题。其解决方案的关键在于提出一种面向对象的框架ROI-GS,通过对象引导的相机选择、针对性的对象训练以及将高保真对象重建无缝融合到全局场景中,实现对指定物体的局部细节增强,同时保持实时性能,并显著降低整体模型大小(约减少17%),提升单目标场景下的训练效率与局部重建质量(最高提升2.96 dB PSNR)。
链接: https://arxiv.org/abs/2510.01978
作者: Quoc-Anh Bui,Gilles Rougeron,Géraldine Morin,Simone Gasparini
机构: Université Paris-Saclay, CEA, List (法国巴黎-萨克雷大学, 国家科学研究中心, 列斯实验室); Université de Toulouse, Toulouse INP – IRIT (图卢兹大学, 图卢兹国立理工学院–IRIT研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 3 figures, 2 tables
点击查看摘要
Abstract:We tackle the challenge of efficiently reconstructing 3D scenes with high detail on objects of interest. Existing 3D Gaussian Splatting (3DGS) methods allocate resources uniformly across the scene, limiting fine detail to Regions Of Interest (ROIs) and leading to inflated model size. We propose ROI-GS, an object-aware framework that enhances local details through object-guided camera selection, targeted Object training, and seamless integration of high-fidelity object of interest reconstructions into the global scene. Our method prioritizes higher resolution details on chosen objects while maintaining real-time performance. Experiments show that ROI-GS significantly improves local quality (up to 2.96 dB PSNR), while reducing overall model size by \approx 17% of baseline and achieving faster training for a scene with a single object of interest, outperforming existing methods.
zh
[CV-41] ZK-WAGON: Imperceptible Watermark for Image Generation Models using ZK-SNARKs WWW
【速读】:该论文旨在解决生成式 AI(Generative AI)图像模型在真实性、所有权和滥用方面带来的安全与伦理问题,特别是针对深度伪造(deepfakes)和知识产权侵权等风险。传统水印方法存在图像质量下降、易被移除或需访问模型内部敏感信息等局限性,难以实现安全且可扩展的部署。其解决方案的关键在于提出 ZK-WAGON 系统,利用零知识简洁非交互式知识论证(ZK-SNARKs)技术,在不泄露模型权重、生成提示或其他敏感信息的前提下,生成可验证的图像来源证明;并通过选择性层 ZK-电路构建(SL-ZKCC)方法优化关键网络层转换为电路的过程,显著降低证明生成时间,最终通过最低有效位(LSB)隐写术将 ZK-SNARK 证明不可感知地嵌入图像中,从而实现模型无关、安全可信的图像水印方案。
链接: https://arxiv.org/abs/2510.01967
作者: Aadarsh Anantha Ramakrishnan,Shubham Agarwal,Selvanayagam S,Kunwar Singh
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AI-ML Systems 2025, Bangalore, India, this https URL
点击查看摘要
Abstract:As image generation models grow increasingly powerful and accessible, concerns around authenticity, ownership, and misuse of synthetic media have become critical. The ability to generate lifelike images indistinguishable from real ones introduces risks such as misinformation, deepfakes, and intellectual property violations. Traditional watermarking methods either degrade image quality, are easily removed, or require access to confidential model internals - making them unsuitable for secure and scalable deployment. We are the first to introduce ZK-WAGON, a novel system for watermarking image generation models using the Zero-Knowledge Succinct Non Interactive Argument of Knowledge (ZK-SNARKs). Our approach enables verifiable proof of origin without exposing model weights, generation prompts, or any sensitive internal information. We propose Selective Layer ZK-Circuit Creation (SL-ZKCC), a method to selectively convert key layers of an image generation model into a circuit, reducing proof generation time significantly. Generated ZK-SNARK proofs are imperceptibly embedded into a generated image via Least Significant Bit (LSB) steganography. We demonstrate this system on both GAN and Diffusion models, providing a secure, model-agnostic pipeline for trustworthy AI image generation.
zh
[CV-42] Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLM s
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉任务中依赖间接表示(如将检测坐标生成为文本)所导致的性能瓶颈,以及难以支持密集预测任务(如分割)的问题。其解决方案的关键在于提出了一种统一范式——Patch-as-Decodable Token (PaDT),核心创新是引入视觉参考标记(Visual Reference Tokens, VRTs),这些VRTs由查询图像的视觉补丁嵌入提取,并与语言模型(LLM)输出的文本标记无缝交织;随后通过一个轻量级解码器将LLM输出映射为检测、分割和定位等多样化视觉预测结果。该方法在每次前向传播中独立处理VRTs并动态扩展嵌入表,从而提升定位精度并增强对相似物体的区分能力。
链接: https://arxiv.org/abs/2510.01954
作者: Yongyi Su,Haojie Zhang,Shijie Li,Nanqing Liu,Jingyi Liao,Junyi Pan,Yuan Liu,Xiaofen Xing,Chong Sun,Chen Li,Nancy F. Chen,Shuicheng Yan,Xulei Yang,Xun Xu
机构: South China University of Technology (华南理工大学); Institute for Infocomm Research (I2R), A*STAR (新加坡资讯通信研究院); WeChat Vision, Tencent Inc. (腾讯微信视觉团队); Foshan University (佛山大学); Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 12 figures and 9 tables
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM’s output textual tokens. A lightweight decoder then transforms LLM’s outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at this https URL.
zh
[CV-43] ClustViT: Clustering-based Token Merging for Semantic Segmentation
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在实际机器人系统中应用受限的问题,其核心瓶颈在于自注意力机制带来的二次计算复杂度(quadratic attention complexity)。为提升效率,现有方法尝试根据图像复杂度动态合并token,但这类方法在密集预测任务(如语义分割)中表现不佳。本文提出ClustViT,关键创新在于引入一个可训练的Cluster模块,在网络内部基于分割掩码生成的伪聚类(pseudo-clusters)引导token合并,随后通过Regenerator模块恢复细节以支持下游分割头。该方案在三个数据集上实现最高达2.18倍的GFLOPs减少和1.64倍的推理加速,同时保持与基线相当的分割精度。
链接: https://arxiv.org/abs/2510.01948
作者: Fabio Montello,Ronja Güldenring,Lazaros Nalpantidis
机构: DTU - Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE
点击查看摘要
Abstract:Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.
zh
[CV-44] Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors
链接: https://arxiv.org/abs/2510.01934
作者: Guangyao Zhai,Yue Zhou,Xinyan Deng,Lars Heckler,Nassir Navab,Benjamin Busam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 13 figures. Code is available at \url{ this https URL }
[CV-45] Automated Defect Detection for Mass-Produced Electronic Components Based on YOLO Object Detection Models
【速读】:该论文旨在解决传统工业组件缺陷检测过程耗时且依赖人工的问题,从而减轻质检人员负担并提升产品质量管控效率。其核心挑战在于缺乏足够的缺陷样本数据,限制了深度学习模型的训练效果。解决方案的关键在于结合生成式对抗网络(ConSinGAN)与YOLO系列目标检测模型,利用ConSinGAN生成高质量、适量的缺陷图像以扩充训练集,进而显著提升检测精度与鲁棒性;实验表明,YOLOv7配合ConSinGAN在准确率(95.50%)和检测速度(285 ms)上均优于其他版本及传统阈值法,实现了高效、自动化的缺陷检测系统部署。
链接: https://arxiv.org/abs/2510.01914
作者: Wei-Lung Mao,Chun-Chi Wang,Po-Heng Chou,Yen-Ting Liu
机构: National Yunlin University of Science and Technology (国立云林科技大学); Academia Sinica (中央研究院); National Sun Yat-sen University (国立中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 12 pages, 16 figures, 7 tables, and published in IEEE Sensors Journal
点击查看摘要
Abstract:Since the defect detection of conventional industry components is time-consuming and labor-intensive, it leads to a significant burden on quality inspection personnel and makes it difficult to manage product quality. In this paper, we propose an automated defect detection system for the dual in-line package (DIP) that is widely used in industry, using digital camera optics and a deep learning (DL)-based model. The two most common defect categories of DIP are examined: (1) surface defects, and (2) pin-leg defects. However, the lack of defective component images leads to a challenge for detection tasks. To solve this problem, the ConSinGAN is used to generate a suitable-sized dataset for training and testing. Four varieties of the YOLO model are investigated (v3, v4, v7, and v9), both in isolation and with the ConSinGAN augmentation. The proposed YOLOv7 with ConSinGAN is superior to the other YOLO versions in accuracy of 95.50%, detection time of 285 ms, and is far superior to threshold-based approaches. In addition, the supervisory control and data acquisition (SCADA) system is developed, and the associated sensor architecture is described. The proposed automated defect detection can be easily established with numerous types of defects or insufficient defect data.
zh
[CV-46] Flow-Matching Guided Deep Unfolding for Hyperspectral Image Reconstruction
链接: https://arxiv.org/abs/2510.01912
作者: Yi Ai,Yuanhao Cai,Yulun Zhang,Xiaokang Yang
机构: Shanghai Jiao Tong University (上海交通大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-47] Leverag ing Prior Knowledge of Diffusion Model for Person Search
【速读】:该论文旨在解决person search任务中因使用ImageNet预训练骨干网络导致的空间上下文和细粒度身份线索捕捉不足,以及检测与重识别共享同一骨干特征所引发的优化目标冲突问题。其解决方案的关键在于引入预训练扩散模型作为先验知识,构建DiffPS框架,通过三个专用模块实现:(i) 利用扩散先验引导的区域提议网络(Diffusion-Guided Region Proposal Network, DGRPN)提升人像定位精度;(ii) 采用多尺度频域精修网络(Multi-Scale Frequency Refinement Network, MSFRN)缓解形状偏差;(iii) 设计语义自适应特征聚合网络(Semantic-Adaptive Feature Aggregation Network, SFAN)融合文本对齐的扩散特征,从而在不牺牲任一子任务性能的前提下实现协同优化。
链接: https://arxiv.org/abs/2510.01841
作者: Giyeol Kim,Sooyoung Yang,Jihyong Oh,Myungjoo Kang,Chanho Eom
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be suboptimal for capturing the complex spatial context and fine-grained identity cues necessary for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to conflicting optimization objectives. In this paper, we propose DiffPS (Diffusion Prior Knowledge for Person Search), a novel framework that leverages a pre-trained diffusion model while eliminating the optimization conflict between two sub-tasks. We analyze key properties of diffusion priors and propose three specialized modules: (i) Diffusion-Guided Region Proposal Network (DGRPN) for enhanced person localization, (ii) Multi-Scale Frequency Refinement Network (MSFRN) to mitigate shape bias, and (iii) Semantic-Adaptive Feature Aggregation Network (SFAN) to leverage text-aligned diffusion features. DiffPS sets a new state-of-the-art on CUHK-SYSU and PRW.
zh
[CV-48] Calibrating the Full Predictive Class Distribution of 3D Object Detectors for Autonomous Driving
链接: https://arxiv.org/abs/2510.01829
作者: Cornelius Schröder,Marius-Raphael Schlüter,Markus Lienkamp
机构: Institute for Automotive Engineering (汽车工程研究所); Munich Institute of Robotics and Machine Intelligence (慕尼黑机器人与机器智能研究所); School of Engineering and Design (工程与设计学院); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-49] Pack and Force Your Memory: Long-form and Consistent Video Generation
链接: https://arxiv.org/abs/2510.01784
作者: Xiaofei Wu,Guozhen Zhang,Zhiyong Xu,Yuan Zhou,Qinglin Lu,Xuming He
机构: ShanghaiTech University (上海科技大学); Tencent Hunyuan (腾讯混元); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-50] LOBE-GS: Load-Balanced and Efficient 3D Gaussian Splatting for Large-Scale Scene Reconstruction
链接: https://arxiv.org/abs/2510.01767
作者: Sheng-Hsiang Hung,Ting-Yu Yen,Wei-Fang Sun,Simon See,Shih-Hsuan Hung,Hung-Kuo Chu
机构: National Tsing Hua University (国立清华大学); NVIDIA AI Technology Center (NVAITC)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-51] Unsupervised Dynamic Feature Selection for Robust Latent Spaces in Vision Tasks
链接: https://arxiv.org/abs/2510.01758
作者: Bruno Corcuera,Carlos Eiras-Franco,Brais Cancela
机构: Universidade da Coruña (拉科鲁尼亚大学); CITIC Research Center (信息与通信技术研究中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-52] PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning
【速读】:该论文旨在解决当前基于卷积神经网络(CNN)和Transformer的风格迁移模型在处理复杂风格和高分辨率输入时效率低下、难以扩展的问题。其解决方案的关键在于提出PyramidStyler框架,该框架采用分层多尺度的位置编码(Pyramidal Positional Encoding, PPE),有效捕捉局部细节与全局上下文信息的同时降低计算负担,并结合强化学习(Reinforcement Learning, RL)动态优化风格化过程,从而显著提升收敛速度与图像质量。实验表明,该方法在保持实时推理速度(约1.39秒/次)的前提下,大幅降低内容损失(至2.07)和风格损失(至0.86),并进一步通过RL优化实现更优性能(内容损失2.03,风格损失0.75)。
链接: https://arxiv.org/abs/2510.01715
作者: Raahul Krishna Durairaju(1),K. Saruladha(2) ((1) California State University, Fullerton, (2) Puducherry Technological University)
机构: California State University, Fullerton (加州州立大学富尔顿分校); Puducherry Technological University (本地技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Neural Style Transfer (NST) has evolved from Gatys et al.'s (2015) CNN-based algorithm, enabling AI-driven artistic image synthesis. However, existing CNN and transformer-based models struggle to scale efficiently to complex styles and high-resolution inputs. We introduce PyramidStyler, a transformer framework with Pyramidal Positional Encoding (PPE): a hierarchical, multi-scale encoding that captures both local details and global context while reducing computational load. We further incorporate reinforcement learning to dynamically optimize stylization, accelerating convergence. Trained on Microsoft COCO and WikiArt, PyramidStyler reduces content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs–achieving 1.39 s inference–and yields further improvements (content 2.03; style 0.75) with minimal speed penalty (1.40 s) when using RL. These results demonstrate real-time, high-quality artistic rendering, with broad applications in media and design.
zh
[CV-53] Holistic Order Prediction in Natural Scenes
链接: https://arxiv.org/abs/2510.01704
作者: Pierre Musacchio,Hyunmin Lee,Jaesik Park
机构: Seoul National University (首尔国立大学); LG AI Research (LG人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 11 figures, 6 tables
[CV-54] VaPR – Vision-language Preference alignment for Reasoning
【速读】:该论文旨在解决现有基于AI生成反馈的偏好微调方法在对齐大型视觉语言模型(Large Vision-Language Models, LVLMs)与人类偏好时,忽视合成偏好标注中普遍存在噪声的问题,尤其是风格和长度偏差。其解决方案的关键在于提出一种基于大语言模型(LLM)引导的响应编辑框架,用于生成具有目标错误的硬负样本响应,同时保持与被接受响应在风格和长度上的相似性;基于此框架构建了VaPR数据集(30K高质量样本),并在多个LVLM家族上验证了其有效性,显著提升了模型性能并缓解了“二元问题倾向”等常见失败模式。
链接: https://arxiv.org/abs/2510.01700
作者: Rohan Wadhawan,Fabrice Y Harel-Canada,Zi-Yi Dou,Suhaila Shakiah,Robinson Piramuthu,Nanyun Peng
机构: University of California Los Angeles (加州大学洛杉矶分校); Amazon.com, Inc. (亚马逊公司)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences. However, existing techniques overlook the prevalence of noise in synthetic preference annotations in the form of stylistic and length biases. To this end, we introduce a hard-negative response generation framework based on LLM-guided response editing, that produces rejected responses with targeted errors, maintaining stylistic and length similarity to the accepted ones. Using this framework, we develop the VaPR dataset, comprising 30K high-quality samples, to finetune three LVLM families: LLaVA-V1.5, Qwen2VL Qwen2.5VL (2B-13B sizes). Our VaPR models deliver significant performance improvements across ten benchmarks, achieving average gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL), with notable improvements on reasoning tasks. A scaling analysis shows that performance consistently improves with data size, with LLaVA models benefiting even at smaller scales. Moreover, VaPR reduces the tendency to answer “Yes” in binary questions - addressing a common failure mode in LVLMs like LLaVA. Lastly, we show that the framework generalizes to open-source LLMs as editors, with models trained on VaPR-OS achieving ~99% of the performance of models trained on \name, which is synthesized using GPT-4o. Our data, models, and code can be found on the project page this https URL
zh
[CV-55] MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLM s
链接: https://arxiv.org/abs/2510.01691
作者: Jiyao Liu,Jinjie Wei,Wanying Qu,Chenglong Ma,Junzhi Ning,Yunheng Li,Ying Chen,Xinzhe Luo,Pengcheng Chen,Xin Gao,Ming Hu,Huihui Xu,Xin Wang,Shujian Gao,Dingkang Yang,Zhongying Deng,Jin Ye,Lihao Liu,Junjun He,Ningsheng Xu
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Imperial College London (帝国理工学院); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 13 figures
[CV-56] FreeViS: Training-free Video Stylization with Inconsistent References
【速读】:该论文旨在解决视频风格化(video stylization)中长期存在的两大挑战:一是逐帧应用图像风格化方法会导致时间不一致性(temporal inconsistency),二是专用视频风格化模型训练通常依赖成对视频数据且计算成本高昂。其解决方案的关键在于提出一种无需训练的框架 FreeViS,通过将多个风格化参考图像整合进预训练的图像到视频生成模型(image-to-video, I2V)中,有效缓解了先前方法中存在的误差传播问题,同时避免了闪烁和卡顿现象;此外,该方法引入高频补偿机制以约束内容布局与运动,并结合基于光流的运动提示(flow-based motion cues)来保留低显著性区域中的风格纹理,从而在保持高风格保真度的同时显著提升时间一致性。
链接: https://arxiv.org/abs/2510.01686
作者: Jiacong Xu,Yiqun Mei,Ke Zhang,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: \url{ this https URL }
点击查看摘要
Abstract:Video stylization plays a key role in content creation, but it remains a challenging problem. Naïvely applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via this https URL
zh
[CV-57] Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring
【速读】:该论文旨在解决深度学习模型在胸部X光片(CXR)解读中存在公平性和可靠性不足的问题,尤其是模型在不同患者亚组间表现不一致,导致难以通过整体指标发现潜在的隐藏错误。现有误差检测方法(如基于置信度校准或分布外检测)难以识别分布内(within-distribution)的细微错误,而基于图像或表征一致性的方法在医学影像领域尚未充分探索。解决方案的关键在于提出一种增强敏感性风险评分(Augmentation-Sensitivity Risk Scoring, ASRS)框架:通过施加临床上合理的旋转增强(±15°/±30°),并利用RAD-DINO编码器测量嵌入空间的偏移,从而量化样本对扰动的敏感性;该方法可将病例按稳定性分为四分位,高敏感样本即使具有高AUROC和置信度,仍表现出显著更低的召回率(-0.2 至 -0.3),从而实现无需标签的异常病例筛选与医生重点审查,提升医疗AI的公平性与安全性。
链接: https://arxiv.org/abs/2510.01683
作者: Han-Jay Shu,Wei-Ning Chiu,Shun-Ting Chang,Meng-Ping Huang,Takeshi Tohyama,Ahram Han,Po-Chih Kuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figures
点击查看摘要
Abstract:Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches – based on confidence calibration or out-of-distribution (OOD) detection – struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ( \pm 15^\circ / \pm 30^\circ ) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ( -0.2 to -0.3 ) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.
zh
[CV-58] Look Less Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning
链接: https://arxiv.org/abs/2510.01681
作者: Xuchen Li,Xuzhao Li,Jiahui Gao,Renjie Pi,Shiyu Hu,Wentao Zhang
机构: CASIA(中国科学院自动化研究所); UCAS(中国科学院大学); ZGCA(中国科学院自动化研究所); HKU(香港大学); HKUST(香港科技大学); NTU(南洋理工大学); PKU(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint, Under review
[CV-59] An Efficient Deep Template Matching and In-Plane Pose Estimation Method via Template-Aware Dynamic Convolution
【速读】:该论文旨在解决工业检测与部件对齐任务中模板匹配在复杂背景下的高效定位与几何状态估计问题,传统方法因需穷举角度和尺度导致效率低下,而现有深度学习方法多仅输出相似度分数,缺乏显式的几何姿态建模能力,难以满足实际部署需求。其解决方案的关键在于提出一种轻量级端到端框架,将模板匹配重构为联合定位与几何回归任务,直接输出目标中心坐标、旋转角及独立的水平与垂直缩放因子;通过引入模板感知的动态卷积模块(Template-Aware Dynamic Convolution Module, TDCM)在推理阶段动态注入模板特征以增强泛化匹配能力,并结合基于旋转-剪切的增强策略与结构感知伪标签实现无几何标注训练,同时设计轻量化精修模块提升角度与尺度精度,最终在保持3.07M参数量和14ms推理延迟的前提下显著提升复杂变换下的匹配精度与鲁棒性。
链接: https://arxiv.org/abs/2510.01678
作者: Ke Jia,Ji Zhou,Hanxin Li,Zhigan Zhou,Haojie Chu,Xiaojie Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Expert Systems with Applications
点击查看摘要
Abstract:In industrial inspection and component alignment tasks, template matching requires efficient estimation of a target’s position and geometric state (rotation and scaling) under complex backgrounds to support precise downstream operations. Traditional methods rely on exhaustive enumeration of angles and scales, leading to low efficiency under compound transformations. Meanwhile, most deep learning-based approaches only estimate similarity scores without explicitly modeling geometric pose, making them inadequate for real-world deployment. To overcome these limitations, we propose a lightweight end-to-end framework that reformulates template matching as joint localization and geometric regression, outputting the center coordinates, rotation angle, and independent horizontal and vertical scales. A Template-Aware Dynamic Convolution Module (TDCM) dynamically injects template features at inference to guide generalizable matching. The compact network integrates depthwise separable convolutions and pixel shuffle for efficient matching. To enable geometric-annotation-free training, we introduce a rotation-shear-based augmentation strategy with structure-aware pseudo labels. A lightweight refinement module further improves angle and scale precision via local optimization. Experiments show our 3.07M model achieves high precision and 14ms inference under compound transformations. It also demonstrates strong robustness in small-template and multi-object scenarios, making it highly suitable for deployment in real-time industrial applications. The code is available at:this https URL.
zh
[CV-60] Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis
链接: https://arxiv.org/abs/2510.01677
作者: Han Wu,Yanming Sun,Yunhe Yang,Derek F. Wong
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-61] UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction
链接: https://arxiv.org/abs/2510.01669
作者: Jin Cao,Hongrui Wu,Ziyong Feng,Hujun Bao,Xiaowei Zhou,Sida Peng
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室); Tongji University (同济大学); DeepGlint
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: page: this https URL code: this https URL
[CV-62] Non-Rigid Structure-from-Motion via Differential Geometry with Recoverable Conformal Scale
【速读】:该论文致力于解决单目视觉可变形同时定位与建图(monocular visual deformable SLAM)中的非刚性结构-运动恢复(NRSfM)问题,尤其针对共形变形(conformal deformation)场景下的重建精度与鲁棒性不足的问题。现有方法通常依赖于局部平面或线性变形等严格假设,且无法准确恢复共形尺度(conformal scale),导致深度估计不精确。本文提出的Con-NRSfM方法的关键在于:通过基于图的优化框架对选定的2D图像形变进行逐点重建,摒弃了传统方法对局部几何约束的依赖;同时将深度与共形尺度的约束解耦,实现更精确的深度估计;并通过并行可分离的迭代优化策略提升数值稳定性,结合自监督编码器-解码器网络生成稠密带纹理的3D点云,显著提升了重建准确性和鲁棒性。
链接: https://arxiv.org/abs/2510.01665
作者: Yongbo Chen,Yanhao Zhang,Shaifali Parashar,Liang Zhao,Shoudong Huang
机构: University of Technology Sydney (悉尼科技大学); Shanghai Jiao Tong University (上海交通大学); Institut National des Sciences Appliquées de Lyon (里昂国立应用科学学院); University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Non-rigid structure-from-motion (NRSfM), a promising technique for addressing the mapping challenges in monocular visual deformable simultaneous localization and mapping (SLAM), has attracted growing attention. We introduce a novel method, called Con-NRSfM, for NRSfM under conformal deformations, encompassing isometric deformations as a subset. Our approach performs point-wise reconstruction using 2D selected image warps optimized through a graph-based framework. Unlike existing methods that rely on strict assumptions, such as locally planar surfaces or locally linear deformations, and fail to recover the conformal scale, our method eliminates these constraints and accurately computes the local conformal scale. Additionally, our framework decouples constraints on depth and conformal scale, which are inseparable in other approaches, enabling more precise depth estimation. To address the sensitivity of the formulated problem, we employ a parallel separable iterative optimization strategy. Furthermore, a self-supervised learning framework, utilizing an encoder-decoder network, is incorporated to generate dense 3D point clouds with texture. Simulation and experimental results using both synthetic and real datasets demonstrate that our method surpasses existing approaches in terms of reconstruction accuracy and robustness. The code for the proposed method will be made publicly available on the project website: this https URL.
zh
[CV-63] Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery
【速读】:该论文旨在解决传统面部表情编码系统(如面部动作编码系统,Facial Action Coding System, FACS)在覆盖范围有限和人工标注成本高昂方面的局限性。其解决方案的关键在于提出一种无监督、数据驱动的离散面部编码方法(Discrete Facial Encoding, DFE),通过残差向量量化变分自编码器(Residual Vector Quantized Variational Autoencoder, RVQ-VAE)从3D网格序列中学习一个紧凑且可解释的面部表达词典。该方法首先利用3D可变形模型(3D Morphable Model, 3DMM)提取与身份无关的表情特征,实现头部姿态和面部几何等因子的有效解耦;随后将这些特征编码为共享码本中的离散标记(token),每个标记代表一种特定且可复用的面部形变模式,从而更精确地刻画面部行为。实验表明,DFE在压力检测、人格预测和抑郁识别等心理任务中优于FACS及其他主流视频表征模型,展现出作为FACS可扩展替代方案的潜力。
链接: https://arxiv.org/abs/2510.01662
作者: Minh Tran,Maksim Siniukov,Zhangyu Jin,Mohammad Soleymani
机构: University of Southern California, Institute for Creative Technologies (南加州大学创意技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Facial expression analysis is central to understanding human behavior, yet existing coding systems such as the Facial Action Coding System (FACS) are constrained by limited coverage and costly manual annotation. In this work, we introduce Discrete Facial Encoding (DFE), an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions from 3D mesh sequences learned through a Residual Vector Quantized Variational Autoencoder (RVQ-VAE). Our approach first extracts identity-invariant expression features from images using a 3D Morphable Model (3DMM), effectively disentangling factors such as head pose and facial geometry. We then encode these features using an RVQ-VAE, producing a sequence of discrete tokens from a shared codebook, where each token captures a specific, reusable facial deformation pattern that contributes to the overall expression. Through extensive experiments, we demonstrate that Discrete Facial Encoding captures more precise facial behaviors than FACS and other facial encoding alternatives. We evaluate the utility of our representation across three high-level psychological tasks: stress detection, personality prediction, and depression detection. Using a simple Bag-of-Words model built on top of the learned tokens, our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models such as Masked Autoencoders. Further analysis reveals that our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.
zh
[CV-64] VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming
链接: https://arxiv.org/abs/2510.01660
作者: Duy Nguyen,Dat Nguyen
机构: Hanoi University of Science and Technology (河内科学技术大学); Harvard University (哈佛大学); Basis.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-65] LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition
链接: https://arxiv.org/abs/2510.01651
作者: Rixin Zhou,Peiqiang Qiu,Qian Zhang,Chuntao Li,Xi Yang
机构: Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures, 2 Tables
[CV-66] FideDiff: Efficient Diffusion Model for High-Fidelity Image Motion Deblurring
链接: https://arxiv.org/abs/2510.01641
作者: Xiaoyang Liu,Zhengyan Zhou,Zihang Xu,Jiezhang Cao,Zheng Chen,Yulun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-67] Joint Deblurring and 3D Reconstruction for Macrophotography
【速读】:该论文旨在解决宏观摄影中因离焦模糊(defocus blur)导致的图像清晰度下降及高质量三维(3D)重建困难的问题。现有传统图像去模糊方法依赖大量图像与标注数据,且缺乏针对宏观摄影场景的多视角3D重建方案。其解决方案的关键在于提出一种联合去模糊与3D重建的方法:从多视角模糊图像出发,通过可微分渲染(differentiable rendering)机制,同步优化物体的清晰3D模型与每个像素的离焦模糊核(defocus blur kernel),从而实现仅用少量多视角图像即可同时获得高质量去模糊结果和高保真3D外观重建。
链接: https://arxiv.org/abs/2510.01640
作者: Yifan Zhao,Liangchen Li,Yuqi Zhou,Kai Wang,Yan Liang,Juyong Zhang
机构: University of Science and Technology of China (中国科学技术大学); China Unicom (中国联通)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Pacific Graphics 2025. To be published in Computer Graphics Forum
点击查看摘要
Abstract:Macro lens has the advantages of high resolution and large magnification, and 3D modeling of small and detailed objects can provide richer information. However, defocus blur in macrophotography is a long-standing problem that heavily hinders the clear imaging of the captured objects and high-quality 3D reconstruction of them. Traditional image deblurring methods require a large number of images and annotations, and there is currently no multi-view 3D reconstruction method for macrophotography. In this work, we propose a joint deblurring and 3D reconstruction method for macrophotography. Starting from multi-view blurry images captured, we jointly optimize the clear 3D model of the object and the defocus blur kernel of each pixel. The entire framework adopts a differentiable rendering method to self-supervise the optimization of the 3D model and the defocus blur kernel. Extensive experiments show that from a small number of multi-view images, our proposed method can not only achieve high-quality image deblurring but also recover high-fidelity 3D appearance.
zh
[CV-68] VLA-R1: Enhancing Reasoning in Vision-Language-Action Models
链接: https://arxiv.org/abs/2510.01623
作者: Angen Ye,Zeyu Zhang,Boyuan Wang,Xiaofeng Wang,Dapeng Zhang,Zheng Zhu
机构: GigaAI; CASIA; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
[CV-69] MPMAvatar: Learning 3D Gaussian Avatars with Accurate and Robust Physics-Based Dynamics NEURIPS2025
链接: https://arxiv.org/abs/2510.01619
作者: Changmin Lee,Jihyun Lee,Tae-Kyun Kim
机构: KAIST
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025
[CV-70] Automated Genomic Interpretation via Concept Bottleneck Models for Medical Robotics
【速读】:该论文旨在解决基因组学中自动化决策系统的可解释性与临床实用性问题,即如何将原始DNA序列转化为可操作、可验证的医学决策,以支持医疗自动化和机器人系统集成。其解决方案的关键在于提出一个融合混沌游戏表示(Chaos Game Representation, CGR)与概念瓶颈模型(Concept Bottleneck Model, CBM)的框架,强制预测过程通过GC含量、CpG密度和k-mer基序等生物意义明确的概念,并引入概念保真度监督、先验一致性对齐、KL分布匹配及不确定性校准等机制提升可靠性;此外,通过成本感知推荐层将预测输出转化为兼顾准确性、校准性和临床效用的决策策略,从而优化资源利用并减少不必要的重复检测。
链接: https://arxiv.org/abs/2510.01618
作者: Zijun Li,Jinchang Zhang,Ming Zhang,Guoyu Lu
机构: SUNY Binghamton University (纽约州立大学宾汉顿分校); Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Other Quantitative Biology (q-bio.OT)
备注:
点击查看摘要
Abstract:We propose an automated genomic interpretation module that transforms raw DNA sequences into actionable, interpretable decisions suitable for integration into medical automation and robotic systems. Our framework combines Chaos Game Representation (CGR) with a Concept Bottleneck Model (CBM), enforcing predictions to flow through biologically meaningful concepts such as GC content, CpG density, and k mer motifs. To enhance reliability, we incorporate concept fidelity supervision, prior consistency alignment, KL distribution matching, and uncertainty calibration. Beyond accurate classification of HIV subtypes across both in-house and LANL datasets, our module delivers interpretable evidence that can be directly validated against biological priors. A cost aware recommendation layer further translates predictive outputs into decision policies that balance accuracy, calibration, and clinical utility, reducing unnecessary retests and improving efficiency. Extensive experiments demonstrate that the proposed system achieves state of the art classification performance, superior concept prediction fidelity, and more favorable cost benefit trade-offs compared to existing baselines. By bridging the gap between interpretable genomic modeling and automated decision-making, this work establishes a reliable foundation for robotic and clinical automation in genomic medicine.
zh
[CV-71] NPN: Non-Linear Projections of the Null-Space for Imaging Inverse Problems NEURIPS2025
【速读】:该论文旨在解决成像逆问题中因测量数据欠采样和噪声导致的病态性(ill-posedness)问题,即在传感矩阵的零空间(null-space)中存在无限多解,传统方法依赖手工设计的正则项或学习模型来约束解空间,但这些先验通常忽略了零空间的任务特定结构。其解决方案的关键在于提出一种新型正则化方法——非线性零空间投影(Non-Linear Projections of the Null-Space, NPN),该方法不直接在图像域施加结构约束,而是通过神经网络引导解落在传感矩阵零空间的低维投影子空间中,从而利用零空间中与信号无关的信息增强重建质量。此方法兼具可解释性和灵活性,理论上保障收敛性与重建精度,并在压缩感知、去模糊、超分辨率、计算机断层成像(CT)和磁共振成像(MRI)等多类成像任务中显著提升重建保真度。
链接: https://arxiv.org/abs/2510.01608
作者: Roman Jacome,Romario Gualdrón-Hurtado,Leon Suarez,Henry Arguello
机构: Universidad Industrial de Santander (桑坦德工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Optimization and Control (math.OC)
备注: 25 pages, 12 tables, 10 figures. Accepted to NeurIPS 2025
点击查看摘要
Abstract:Imaging inverse problems aims to recover high-dimensional signals from undersampled, noisy measurements, a fundamentally ill-posed task with infinite solutions in the null-space of the sensing operator. To resolve this ambiguity, prior information is typically incorporated through handcrafted regularizers or learned models that constrain the solution space. However, these priors typically ignore the task-specific structure of that null-space. In this work, we propose \textitNon-Linear Projections of the Null-Space (NPN), a novel class of regularization that, instead of enforcing structural constraints in the image domain, promotes solutions that lie in a low-dimensional projection of the sensing matrix’s null-space with a neural network. Our approach has two key advantages: (1) Interpretability: by focusing on the structure of the null-space, we design sensing-matrix-specific priors that capture information orthogonal to the signal components that are fundamentally blind to the sensing process. (2) Flexibility: NPN is adaptable to various inverse problems, compatible with existing reconstruction frameworks, and complementary to conventional image-domain priors. We provide theoretical guarantees on convergence and reconstruction accuracy when used within plug-and-play methods. Empirical results across diverse sensing matrices demonstrate that NPN priors consistently enhance reconstruction fidelity in various imaging inverse problems, such as compressive sensing, deblurring, super-resolution, computed tomography, and magnetic resonance imaging, with plug-and-play methods, unrolling networks, deep image prior, and diffusion models.
zh
[CV-72] ActiveUMI: Robotic Manipulation with Active Perception from Robot-Free Human Demonstrations
链接: https://arxiv.org/abs/2510.01607
作者: Qiyuan Zeng,Chengmeng Li,Jude St. John,Zhongyi Zhou,Junjie Wen,Guorui Feng,Yichen Zhu,Yi Xu
机构: Shanghai University (上海大学); Stanford University (斯坦福大学); Midea Group (美的集团)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: technique report. The website is available at this https URL
[CV-73] ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models
链接: https://arxiv.org/abs/2510.01582
作者: Krishna Teja Chitty-Venkata,Murali Emani
机构: Argonne National Laboratory (阿贡国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint
[CV-74] Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations ICCV2025
链接: https://arxiv.org/abs/2510.01576
作者: Ricardo Gonzalez Penuela,Felipe Arias-Russi,Victor Capriles
机构: Cornell Tech(康奈尔科技学院); Universidad de los Andes(安第斯大学); Independent Researcher(独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages, 2 figure, 2 tables, CV4A11y Workshop at ICCV 2025
[CV-75] Consistent Assistant Domains Transformer for Source-free Domain Adaptation
链接: https://arxiv.org/abs/2510.01559
作者: Renrong Shao,Wei Zhang,Kangyang Luo,Qin Li,and Jun Wang
机构: Naval Medical University (第二军医大学); Tsinghua University (清华大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-76] Robust Classification of Oral Cancer with Limited Training Data
链接: https://arxiv.org/abs/2510.01547
作者: Akshay Bhagwan Sonawane,Lena D. Swamikannan,Lakshman Tamil
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-77] Growing Visual Generative Capacity for Pre-Trained MLLM s
【速读】:该论文旨在解决统一多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像理解与生成任务中难以兼顾语义对齐与像素级保真度的问题。现有方法要么采用混合架构(如连续嵌入结合扩散或流模型),破坏了自回归范式;要么采用纯自回归方式,但常在语义一致性与细节还原之间存在权衡。其解决方案的关键在于提出一种纯自回归的统一框架Bridge,通过Mixture-of-Transformers架构将预训练视觉理解模型扩展为具备生成能力的模型,并引入一种语义到像素的离散表示(semantic-to-pixel discrete representation),在序列长度仅增加7.9%的前提下,融合紧凑语义标记与细粒度像素标记,从而在保持强语言对齐的同时实现高精度视觉细节描述。
链接: https://arxiv.org/abs/2510.01546
作者: Hanyu Wang,Jiaming Han,Ziyan Yang,Qi Zhao,Shanchuan Lin,Xiangyu Yue,Abhinav Shrivastava,Zhenheng Yang,Hao Chen
机构: University of Maryland, College Park (马里兰大学学院公园分校); CUHK MMLab (香港中文大学多媒体实验室); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
点击查看摘要
Abstract:Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.
zh
[CV-78] owards Better Optimization For Listwise Preference in Diffusion Models
链接: https://arxiv.org/abs/2510.01540
作者: Jiamu Bai,Xin Yu,Meilong Xu,Weitao Lu,Xin Pan,Kiwan Maeng,Daniel Kifer,Jian Wang,Yu Wang
机构: Penn State University (宾夕法尼亚州立大学); TikTok Inc. (抖音公司); Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-79] MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation NEURIPS2025
【速读】:该论文旨在解决半监督分割中如何从无标签数据中有效捕捉有意义的语义结构的问题,尤其是在组织病理学图像分析中对象密集分布带来的挑战。其解决方案的关键在于引入一种基于多扰动预测(通过随机丢弃和时间训练快照获得)的拓扑一致性机制,通过空间重叠与全局结构对齐相结合的新颖匹配策略,在缺乏真实标签的情况下准确匹配不同预测间的对应拓扑特征,从而区分生物上合理的结构与瞬态噪声,显著降低拓扑错误,提升分割结果的鲁棒性和准确性。
链接: https://arxiv.org/abs/2510.01532
作者: Meilong Xu,Xiaoling Hu,Shahira Abousamra,Chen Li,Chao Chen
机构: Stony Brook University (石溪大学); Massachusetts General Hospital and Harvard Medical School (马萨诸塞州总医院和哈佛医学院); Department of Biomedical Data Science, Stanford University (生物医学数据科学系,斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 6 figures. Accepted by NeurIPS 2025
点击查看摘要
Abstract:In semi-supervised segmentation, capturing meaningful semantic structures from unlabeled data is essential. This is particularly challenging in histopathology image analysis, where objects are densely distributed. To address this issue, we propose a semi-supervised segmentation framework designed to robustly identify and preserve relevant topological features. Our method leverages multiple perturbed predictions obtained through stochastic dropouts and temporal training snapshots, enforcing topological consistency across these varied outputs. This consistency mechanism helps distinguish biologically meaningful structures from transient and noisy artifacts. A key challenge in this process is to accurately match the corresponding topological features across the predictions in the absence of ground truth. To overcome this, we introduce a novel matching strategy that integrates spatial overlap with global structural alignment, minimizing discrepancies among predictions. Extensive experiments demonstrate that our approach effectively reduces topological errors, resulting in more robust and accurate segmentations essential for reliable downstream analysis. Code is available at \hrefthis https URLthis https URL.
zh
[CV-80] WALT: Web Agents that Learn Tools
【速读】:该论文旨在解决当前Web代理在执行复杂浏览器任务时存在的脆弱性问题,即依赖于细粒度的UI交互和高资源消耗的大语言模型(LLM)推理,导致在动态页面布局和长序列任务中易失效。解决方案的关键在于提出WALT(Web Agents that Learn Tools)框架,通过逆向工程将网站隐含的功能抽象为可复用的工具接口(如搜索、筛选、排序、发布、评论等),使代理能够以高阶操作调用而非低级点击与输入来完成任务,从而将计算负担从易错的步骤式推理转移到可靠的工具调用上,显著提升了自动化任务的成功率与鲁棒性。
链接: https://arxiv.org/abs/2510.01524
作者: Viraj Prabhu,Yutong Dai,Matthew Fernandez,Jing Gu,Krithika Ramakrishnan,Yanqi Luo,Silvio Savarese,Caiming Xiong,Junnan Li,Zeyuan Chen,Ran Xu
机构: Salesforce AI Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Web agents promise to automate complex browser tasks, but current methods remain brittle – relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites – spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.
zh
[CV-81] AortaDiff: A Unified Multitask Diffusion Framework For Contrast-Free AAA Imaging
链接: https://arxiv.org/abs/2510.01498
作者: Yuxuan Ou,Ning Bi,Jiazhen Pan,Jiancheng Yang,Boliang Yu,Usama Zidan,Regent Lee,Vicente Grau
机构: University of Oxford (牛津大学); Technical University of Munich (慕尼黑工业大学); ELLIS Institute Finland (芬兰ELLIS研究所); Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-82] Purrception: Variational Flow Matching for Vector-Quantized Image Generation
链接: https://arxiv.org/abs/2510.01478
作者: Răzvan-Andrei Matişan,Vincent Tao Hu,Grigory Bartosh,Björn Ommer,Cees G. M. Snoek,Max Welling,Jan-Willem van de Meent,Mohammad Mahdi Derakhshani,Floor Eijkelboom
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[CV-83] Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories
链接: https://arxiv.org/abs/2510.01454
作者: Nilay Naharas,Dang Nguyen,Nesihan Bulut,Mohammadhossein Bateni,Vahab Mirrokni,Baharan Mirzasoleiman
机构: University of California Los Angeles (加州大学洛杉矶分校); Google Research (谷歌研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 30 pages, 10 figures, 5 tables, link: this https URL
[CV-84] GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
链接: https://arxiv.org/abs/2510.01448
作者: Angel Daruna,Nicholas Meegan,Han-Pang Chiu,Supun Samarasekera,Rakesh Kumar
机构: SRI International (SRI 国际)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: preprint under review
[CV-85] On the Role of Domain Experts in Creating Effective Tutoring Systems
链接: https://arxiv.org/abs/2510.01432
作者: Sarath Sreedharan,Kelsey Sikes,Nathaniel Blanchard,Lisa Mason,Nikhil Krishnaswamy,Jill Zarestky
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AIED 2025 Blue Sky Track
[CV-86] Ultra-Efficient Decoding for End-to-End Neural Compression and Reconstruction NEURIPS2025
链接: https://arxiv.org/abs/2510.01407
作者: Ethan G. Rogers,Cheng Wang
机构: Iowa State University (爱荷华州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures, NeurIPS 2025 Workshop MLForSys
[CV-87] DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation
链接: https://arxiv.org/abs/2510.01399
作者: Shubhankar Borse,Farzad Farhadzadeh,Munawar Hayat,Fatih Porikli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-88] VENTURA: Adapting Image Diffusion Models for Unified Task Conditioned Navigation
链接: https://arxiv.org/abs/2510.01388
作者: Arthur Zhang,Xiangyun Meng,Luca Calliari,Dong-Ki Kim,Shayegan Omidshafiei,Joydeep Biswas,Ali Agha,Amirreza Shaban
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, 3 tables
[CV-89] SPUS: A Lightweight and Parameter-Efficient Foundation Model for PDEs
链接: https://arxiv.org/abs/2510.01370
作者: Abu Bucker Siddik,Diane Oyen,Alexander Most,Michal Kucer,Ayan Biswas
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:
[CV-90] EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels
链接: https://arxiv.org/abs/2510.01362
作者: Shijia Feng,Michael Wray,Walterio Mayol-Cuevas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
[CV-91] Image Generation Based on Image Style Extraction
链接: https://arxiv.org/abs/2510.01347
作者: Shuochen Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-92] LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration
链接: https://arxiv.org/abs/2510.01339
作者: Alessio Spagnoletti,Andrés Almansa,Marcelo Pereyra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 23 pages, 12 figures
[CV-93] From 2D to 3D Deep Learning-based Shape Reconstruction in Magnetic Resonance Imaging: A Review
链接: https://arxiv.org/abs/2510.01296
作者: Emma McMillian,Abhirup Banerjee,Alfonso Bueno-Orovio
机构: University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-94] Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
链接: https://arxiv.org/abs/2510.01284
作者: Chetwin Low,Weimin Wang,Calder Katyal
机构: Character.ai(字符AI); Yale University (耶鲁大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
[CV-95] Development and Evaluation of an AI-Driven Telemedicine System for Prenatal Healthcare MICCAI2025
链接: https://arxiv.org/abs/2510.01194
作者: Juan Barrientos,Michaelle Pérez,Douglas González,Favio Reyna,Julio Fajardo,Andrea Lara
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at MICCAI 2025 MIRASOL Workshop, 10 pages, 5 figures
[CV-96] Measurement-Guided Consistency Model Sampling for Inverse Problems
【速读】:该论文旨在解决扩散模型(Diffusion Models)在求解逆成像问题时因多步采样过程缓慢而难以实际部署的问题。现有方法虽通过一致性模型(Consistency Models)实现高效生成,但其直接应用于逆问题重建仍缺乏深入探索。解决方案的关键在于提出一种改进的一致性采样方法:通过引入与测量算子(measurement operator)关联的测量一致性机制,引导采样过程的随机性,从而在仅需少数几步的情况下既保证重建结果对已采集测量数据的保真度,又维持一致性生成的高效性。实验表明,该方法在感知质量和像素级指标上均优于基线一致性采样,实现了高效率与高质量重建的平衡。
链接: https://arxiv.org/abs/2510.02208
作者: Amirreza Tanevardi,Pooria Abbas Rad Moghadam,Sajjad Amini
机构: Sharif University of Technology (谢里夫理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 3 figures, submitted to IEEE Signal Processing Letters
点击查看摘要
Abstract:Diffusion models have become powerful generative priors for solving inverse imaging problems, but their reliance on slow multi-step sampling limits practical deployment. Consistency models address this bottleneck by enabling high-quality generation in a single or only a few steps, yet their direct adaptation to inverse problems is underexplored. In this paper, we present a modified consistency sampling approach tailored for inverse problem reconstruction: the sampler’s stochasticity is guided by a measurement-consistency mechanism tied to the measurement operator, which enforces fidelity to the acquired measurements while retaining the efficiency of consistency-based generation. Experiments on Fashion-MNIST and LSUN Bedroom datasets demonstrate consistent improvements in perceptual and pixel-level metrics, including Fréchet Inception Distance, Kernel Inception Distance, peak signal-to-noise ratio, and structural similarity index measure, compared to baseline consistency sampling, yielding competitive or superior reconstructions with only a handful of steps.
zh
[CV-97] Uncovering Semantic Selectivity of Latent Groups in Higher Visual Cortex with Mutual Information-Guided Diffusion
【速读】:该论文旨在解决高阶视觉皮层中神经群体如何分布并组织特定视觉特征信息这一核心问题,特别是探究这些特征是否以结构化的、语义有意义的子空间形式存在。其解决方案的关键在于提出MIG-Vis方法:首先利用变分自编码器(Variational Autoencoder, VAE)从神经群体中推断出一组解耦的神经潜在子空间;随后设计一种基于互信息(Mutual Information, MI)引导的扩散模型合成流程,可视化每个潜在组所编码的具体视觉-语义特征。该方法在猕猴下颞叶(inferior temporal, IT)皮层多会话神经放电数据上验证有效,揭示了具有明确语义选择性的神经潜在组,涵盖了物体姿态、跨类别变换及类内内容等多样视觉特征,从而为高阶视觉皮层中的结构化语义表征提供了直接且可解释的证据。
链接: https://arxiv.org/abs/2510.02182
作者: Yule Wang,Joseph Yu,Chengrui Li,Weihan Li,Anqi Wu
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Understanding how neural populations in higher visual areas encode object-centered visual information remains a central challenge in computational neuroscience. Prior works have investigated representational alignment between artificial neural networks and the visual cortex. Nevertheless, these findings are indirect and offer limited insights to the structure of neural populations themselves. Similarly, decoding-based methods have quantified semantic features from neural populations but have not uncovered their underlying organizations. This leaves open a scientific question: “how feature-specific visual information is distributed across neural populations in higher visual areas, and whether it is organized into structured, semantically meaningful subspaces.” To tackle this problem, we present MIG-Vis, a method that leverages the generative power of diffusion models to visualize and validate the visual-semantic attributes encoded in neural latent subspaces. Our method first uses a variational autoencoder to infer a group-wise disentangled neural latent subspace from neural populations. Subsequently, we propose a mutual information (MI)-guided diffusion synthesis procedure to visualize the specific visual-semantic features encoded by each latent group. We validate MIG-Vis on multi-session neural spiking datasets from the inferior temporal (IT) cortex of two macaques. The synthesized results demonstrate that our method identifies neural latent groups with clear semantic selectivity to diverse visual features, including object pose, inter-category transformations, and intra-class content. These findings provide direct, interpretable evidence of structured semantic representation in the higher visual cortex and advance our understanding of its encoding principles.
zh
[CV-98] SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification MICCAI
链接: https://arxiv.org/abs/2510.02109
作者: Jong Bum Won,Wesley De Neve,Joris Vankerschaver,Utku Ozbulak
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2025
[CV-99] A Multicentric Dataset for Training and Benchmarking Breast Cancer Segmentation in HE Slides
链接: https://arxiv.org/abs/2510.02037
作者: Carlijn Lems,Leslie Tessier,John-Melle Bokhorst,Mart van Rijthoven,Witali Aswolinskiy,Matteo Pozzi,Natalie Klubickova,Suzanne Dintzis,Michela Campora,Maschenka Balkenhol,Peter Bult,Joey Spronck,Thomas Detone,Mattia Barbareschi,Enrico Munari,Giuseppe Bogina,Jelle Wesseling,Esther H. Lips,Francesco Ciompi,Frédérique Meeuwsen,Jeroen van der Laak
机构: Radboud University Medical Center (拉德布德大学医学中心); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); University of Trento (特伦托大学); Biopticka Laboratory Ltd. (生物显微实验室有限公司); Charles University (查尔斯大学); University of Washington Medical Center (华盛顿大学医学中心); Santa Chiara Hospital, APSS (圣卡夏医院, APSS); CISMed - Centre for Medical Sciences (CISMed - 医学科学中心); University and Hospital Trust of Verona (维罗纳大学和医院信托); IRCCS Sacro Cuore Don Calabria Hospital (IRCCS 圣心堂唐卡布拉迪亚医院); Netherlands Cancer Institute (荷兰癌症研究所); Leiden University Medical Center (莱顿大学医学中心); Linköping University (林雪平大学)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Our dataset is available at this https URL , our code is available at this https URL , and our benchmark is available at this https URL
[CV-100] GFSR-Net: Guided Focus via Segment-Wise Relevance Network for Interpretable Deep Learning in Medical Imaging
【速读】:该论文旨在解决深度学习在医学图像分析中因缺乏可解释性而导致临床应用受限的问题。当前模型虽能准确预测,但往往无法提供合理的推理依据,甚至可能依赖于与疾病无关的视觉线索(如标注信息),从而降低医生信任度并增加误诊风险。解决方案的关键在于提出Guided Focus via Segment-Wise Relevance Network (GFSR-Net),该方法仅需少量人工标注即可近似模拟人类对图像的关注区域,无需精确边界或全图标记,训练过程中模型通过引导机制逐步聚焦于具有诊断意义的特征区域,从而提升模型的可解释性和可靠性,且适用于多种医学及自然图像类型。
链接: https://arxiv.org/abs/2510.01919
作者: Jhonatan Contreras,Thomas Bocklitz
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
备注:
点击查看摘要
Abstract:Deep learning has achieved remarkable success in medical image analysis, however its adoption in clinical practice is limited by a lack of interpretability. These models often make correct predictions without explaining their reasoning. They may also rely on image regions unrelated to the disease or visual cues, such as annotations, that are not present in real-world conditions. This can reduce trust and increase the risk of misleading diagnoses. We introduce the Guided Focus via Segment-Wise Relevance Network (GFSR-Net), an approach designed to improve interpretability and reliability in medical imaging. GFSR-Net uses a small number of human annotations to approximate where a person would focus within an image intuitively, without requiring precise boundaries or exhaustive markings, making the process fast and practical. During training, the model learns to align its focus with these areas, progressively emphasizing features that carry diagnostic meaning. This guidance works across different types of natural and medical images, including chest X-rays, retinal scans, and dermatological images. Our experiments demonstrate that GFSR achieves comparable or superior accuracy while producing saliency maps that better reflect human expectations. This reduces the reliance on irrelevant patterns and increases confidence in automated diagnostic tools.
zh
[CV-101] owards Photonic Band Diagram Generation with Transformer-Latent Diffusion Models
【速读】:该论文旨在解决光子晶体(photonic crystals)中光传播带隙图(band diagrams, BDs)计算成本高昂的问题,尤其是在逆向设计(inverse design)优化循环中频繁求解麦克斯韦方程组(Maxwell’s equations)时的数值瓶颈。解决方案的关键在于提出首个基于扩散模型(diffusion models)的BD生成方法,其核心创新是将Transformer编码器与潜在空间扩散模型(latent diffusion model)相结合:Transformer编码器从输入结构中提取上下文嵌入(contextual embeddings),而扩散模型则据此生成对应的BD。该方法不仅显著降低计算开销,还展现出对复杂光子学中干涉和散射现象的良好建模能力,为光子学领域新型代理建模(surrogate modeling)策略提供了新路径。
链接: https://arxiv.org/abs/2510.01749
作者: Valentin Delchevalerie,Nicolas Roy,Arnaud Bougaham,Alexandre Mayer,Benoît Frénay,Michaël Lobet
机构: University of Namur (纳穆尔大学); Cenaero (纳塞罗)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Photonic crystals enable fine control over light propagation at the nanoscale, and thus play a central role in the development of photonic and quantum technologies. Photonic band diagrams (BDs) are a key tool to investigate light propagation into such inhomogeneous structured materials. However, computing BDs requires solving Maxwell’s equations across many configurations, making it numerically expensive, especially when embedded in optimization loops for inverse design techniques, for example. To address this challenge, we introduce the first approach for BD generation based on diffusion models, with the capacity to later generalize and scale to arbitrary three dimensional structures. Our method couples a transformer encoder, which extracts contextual embeddings from the input structure, with a latent diffusion model to generate the corresponding BD. In addition, we provide insights into why transformers and diffusion models are well suited to capture the complex interference and scattering phenomena inherent to photonics, paving the way for new surrogate modeling strategies in this domain.
zh
[CV-102] Median2Median: Zero-shot Suppression of Structured Noise in Images
【速读】:该论文旨在解决真实世界图像中由强各向异性相关结构噪声(structured noise)引起的去噪难题,此类噪声在医学成像和计算机视觉任务中广泛存在,而现有方法在处理此类噪声时性能受限。传统数据驱动方法依赖大规模高质量标注数据集,泛化能力不足;零样本方法虽无需训练数据,但仅适用于独立同分布(i.i.d.)噪声场景,难以应对实际中的结构化噪声。论文提出Median2Median(M2M)框架,其核心创新在于设计了一种新颖的采样策略:从单张噪声图像中生成伪独立的子图像对,通过方向插值与广义中值滤波自适应排除受结构伪影污染的像素值,并结合随机分配策略扩大有效采样空间、消除系统性偏差,从而使得采样对适合Noise2Noise训练范式。该方案首次实现了在非i.i.d.结构噪声下的高效零样本去噪,为超越严格i.i.d.假设的去噪提供了新路径。
链接: https://arxiv.org/abs/2510.01666
作者: Jianxu Wang,Ge Wang
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
备注: 13 pages, 6 figures, not published yet
点击查看摘要
Abstract:Image denoising is a fundamental problem in computer vision and medical imaging. However, real-world images are often degraded by structured noise with strong anisotropic correlations that existing methods struggle to remove. Most data-driven approaches rely on large datasets with high-quality labels and still suffer from limited generalizability, whereas existing zero-shot methods avoid this limitation but remain effective only for independent and identically distributed (i.i.d.) noise. To address this gap, we propose Median2Median (M2M), a zero-shot denoising framework designed for structured noise. M2M introduces a novel sampling strategy that generates pseudo-independent sub-image pairs from a single noisy input. This strategy leverages directional interpolation and generalized median filtering to adaptively exclude values distorted by structured artifacts. To further enlarge the effective sampling space and eliminate systematic bias, a randomized assignment strategy is employed, ensuring that the sampled sub-image pairs are suitable for Noise2Noise training. In our realistic simulation studies, M2M performs on par with state-of-the-art zero-shot methods under i.i.d. noise, while consistently outperforming them under correlated noise. These findings establish M2M as an efficient, data-free solution for structured noise suppression and mark the first step toward effective zero-shot denoising beyond the strict i.i.d. assumption.
zh
[CV-103] Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning
【速读】:该论文旨在解决预训练视频模型在社会感知任务中与人类直觉相似性结构不一致的问题(即“模态差距”),并探索如何利用人类行为数据引导模型学习更贴近人类社会认知的视频表征。其核心解决方案是通过引入一个包含49,000条奇一异判断的人类相似性基准,发现基于文本的嵌入优于现有视频模型,并进一步提出一种结合三元组损失与相关性分析(triplet-RSA)的混合目标函数,结合低秩适配(LoRA)对TimeSformer模型进行微调,从而显著提升模型在未见视频上的相似性预测性能及社会情感属性编码能力。
链接: https://arxiv.org/abs/2510.01502
作者: Kathy Garcia,Leyla Isik
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages total, 4 figures. Includes 1 algorithm and 2 tables in the appendix
点击查看摘要
Abstract:Humans intuitively perceive complex social signals in visual scenes, yet it remains unclear whether state-of-the-art AI models encode the same similarity structure. We study (Q1) whether modern video and language models capture human-perceived similarity in social videos, and (Q2) how to instill this structure into models using human behavioral data. To address this, we introduce a new benchmark of over 49,000 odd-one-out similarity judgments on 250 three-second video clips of social interactions, and discover a modality gap: despite the task being visual, caption-based language embeddings align better with human similarity than any pretrained video model. We close this gap by fine-tuning a TimeSformer video model on these human judgments with our novel hybrid triplet-RSA objective using low-rank adaptation (LoRA), aligning pairwise distances to human similarity. This fine-tuning protocol yields significantly improved alignment with human perceptions on held-out videos in terms of both explained variance and odd-one-out triplet accuracy. Variance partitioning shows that the fine-tuned video model increases shared variance with language embeddings and explains additional unique variance not captured by the language model. Finally, we test transfer via linear probes and find that human-similarity fine-tuning strengthens the encoding of social-affective attributes (intimacy, valence, dominance, communication) relative to the pretrained baseline. Overall, our findings highlight a gap in pretrained video models’ social recognition and demonstrate that behavior-guided fine-tuning shapes video representations toward human social perception.
zh
[CV-104] An Efficient Quality Metric for Video Frame Interpolation Based on Motion-Field Divergence
链接: https://arxiv.org/abs/2510.01361
作者: Conall Daly,Darren Ramsook,Anil Kokaram
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: IEEE 17th International Conference on Quality of Multimedia Experience 2025 accepted manuscript, 7 pages
[CV-105] MorphGen: Controllable and Morphologically Plausible Generative Cell-Imaging
链接: https://arxiv.org/abs/2510.01298
作者: Berker Demirel,Marco Fumero,Theofanis Karaletsos,Francesco Locatello
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-106] JaneEye: A 12-nm 2K-FPS 18.9-μJ/Frame Event-based Eye Tracking Accelerator
链接: https://arxiv.org/abs/2510.01213
作者: Tao Han,Ang Li,Qinyu Chen,Chang Gao
机构: Delft University of Technology (代尔夫特理工大学); Leiden University (莱顿大学)
类目: ignal Processing (eess.SP); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
备注: Accepted to 2026 IEEE 31st Asia and South Pacific Design Automation Conference (ASP-DAC) 2026
人工智能
[AI-0] Diffusion Models and the Manifold Hypothesis: Log-Domain Smoothing is Geometry Adaptive
【速读】:该论文试图解决扩散模型(Diffusion Models)在多样化数据域中表现出卓越泛化能力的机制问题,尤其是其背后是否与数据的低维流形结构有关。解决方案的关键在于通过分数匹配(Score Matching)的学习框架,揭示隐式正则化如何促使模型在数据流形上进行平滑建模:理论与实证结果表明,对经验分数匹配目标的平滑操作等价于在对数密度域中进行平滑,这会生成沿数据流形切向的平滑效应;进一步证明,通过选择合适的平滑策略,可以控制扩散模型所依赖的流形结构,从而实现对泛化行为的有效调控。
链接: https://arxiv.org/abs/2510.02305
作者: Tyler Farghly,Peter Potaptchik,Samuel Howard,George Deligiannidis,Jakiw Pidstrigach
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Diffusion models have achieved state-of-the-art performance, demonstrating remarkable generalisation capabilities across diverse domains. However, the mechanisms underpinning these strong capabilities remain only partially understood. A leading conjecture, based on the manifold hypothesis, attributes this success to their ability to adapt to low-dimensional geometric structure within the data. This work provides evidence for this conjecture, focusing on how such phenomena could result from the formulation of the learning problem through score matching. We inspect the role of implicit regularisation by investigating the effect of smoothing minimisers of the empirical score matching objective. Our theoretical and empirical results confirm that smoothing the score function – or equivalently, smoothing in the log-density domain – produces smoothing tangential to the data manifold. In addition, we show that the manifold along which the diffusion model generalises can be controlled by choosing an appropriate smoothing.
zh
[AI-1] Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中幻觉(hallucination)问题,尤其是由预测不确定性引发的特定类型幻觉——错构(confabulation)的检测难题。现有方法通过估计自然语言生成(Natural Language Generation, NLG)中的预测不确定性来识别错构,但其评估依赖于问答(Question Answering, QA)数据集上与文本正确性之间的相关性,而常用近似正确性函数存在显著分歧,导致对不确定性估计(Uncertainty Estimation, UE)方法性能的评估出现偏差甚至被人为夸大。论文的关键解决方案在于:引入多种替代风险指标(risk indicators),包括对多个LLM作为裁判(LLM-as-a-judge)变体进行边际化处理以减少评估偏差,以及利用结构化任务、分布外检测和扰动检测任务提供更稳健且可控的风险信号;同时提出使用Elo评分机制对UE方法在广泛评估场景下的表现进行客观综合排序,从而提升不确定性估计算法评估的鲁棒性和可比性。
链接: https://arxiv.org/abs/2510.02279
作者: Mykyta Ielanskyi,Kajetan Schweighofer,Lukas Aichberger,Sepp Hochreiter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Hallucinations are a common issue that undermine the reliability of large language models (LLMs). Recent studies have identified a specific subset of hallucinations, known as confabulations, which arise due to predictive uncertainty of LLMs. To detect confabulations, various methods for estimating predictive uncertainty in natural language generation (NLG) have been developed. These methods are typically evaluated by correlating uncertainty estimates with the correctness of generated text, with question-answering (QA) datasets serving as the standard benchmark. However, commonly used approximate correctness functions have substantial disagreement between each other and, consequently, in the ranking of the uncertainty estimation methods. This allows one to inflate the apparent performance of uncertainty estimation methods. We propose using several alternative risk indicators for risk correlation experiments that improve robustness of empirical assessment of UE algorithms for NLG. For QA tasks, we show that marginalizing over multiple LLM-as-a-judge variants leads to reducing the evaluation biases. Furthermore, we explore structured tasks as well as out of distribution and perturbation detection tasks which provide robust and controllable risk indicators. Finally, we propose to use an Elo rating of uncertainty estimation methods to give an objective summarization over extensive evaluation settings.
zh
[AI-2] BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
【速读】:该论文旨在解决生物信号(biosignals)跨模态知识迁移中因标注数据稀缺导致的模型训练困难问题,同时应对当前基础模型(foundation models)参数量大、计算资源消耗高的挑战。其核心解决方案是提出一种轻量级桥梁网络(BioX-Bridge),通过在基础模型的中间特征表示层构建高效对齐机制,实现不同生物信号模态间的无监督知识迁移。关键创新在于:1)设计了一种精巧的对齐位置选择策略,仅在关键特征层插入桥梁网络以减少冗余;2)引入灵活的原型网络作为桥梁架构,有效促进跨模态信息流动,从而在显著降低88–99%可训练参数的同时保持甚至超越现有最优方法的迁移性能。
链接: https://arxiv.org/abs/2510.02276
作者: Chenqi Li,Yu Liu,Timothy Denison,Tingting Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Biosignals offer valuable insights into the physiological states of the human body. Although biosignal modalities differ in functionality, signal fidelity, sensor comfort, and cost, they are often intercorrelated, reflecting the holistic and interconnected nature of human physiology. This opens up the possibility of performing the same tasks using alternative biosignal modalities, thereby improving the accessibility, usability, and adaptability of health monitoring systems. However, the limited availability of large labeled datasets presents challenges for training models tailored to specific tasks and modalities of interest. Unsupervised cross-modal knowledge transfer offers a promising solution by leveraging knowledge from an existing modality to support model training for a new modality. Existing methods are typically based on knowledge distillation, which requires running a teacher model alongside student model training, resulting in high computational and memory overhead. This challenge is further exacerbated by the recent development of foundation models that demonstrate superior performance and generalization across tasks at the cost of large model sizes. To this end, we explore a new framework for unsupervised cross-modal knowledge transfer of biosignals by training a lightweight bridge network to align the intermediate representations and enable information flow between foundation models and across modalities. Specifically, we introduce an efficient strategy for selecting alignment positions where the bridge should be constructed, along with a flexible prototype network as the bridge architecture. Extensive experiments across multiple biosignal modalities, tasks, and datasets show that BioX-Bridge reduces the number of trainable parameters by 88–99% while maintaining or even improving transfer performance compared to state-of-the-art methods.
zh
[AI-3] How to Combat Reactive and Dynamic Jamming Attacks with Reinforcement Learning
【速读】:该论文旨在解决动态干扰(reactive jamming)问题,即干扰者通过选择信道和感知阈值来检测并干扰正在进行的传输。解决方案的关键在于利用强化学习(reinforcement learning, RL)使发送端与接收端在无先验知识的情况下,自主学习如何调整发射功率、调制方式和信道选择以规避干扰并优化吞吐量。具体而言,针对离散干扰事件状态采用Q-learning,而针对基于接收功率的连续状态则使用深度Q网络(Deep Q-Networks, DQN),并通过设计不同的奖励函数和动作集合,实现对频谱动态变化和干扰策略的快速适应,从而维持高通信速率。
链接: https://arxiv.org/abs/2510.02265
作者: Yalin E. Sagduyu,Tugba Erpek,Kemal Davaslioglu,Sastry Kompella
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注:
点击查看摘要
Abstract:This paper studies the problem of mitigating reactive jamming, where a jammer adopts a dynamic policy of selecting channels and sensing thresholds to detect and jam ongoing transmissions. The transmitter-receiver pair learns to avoid jamming and optimize throughput over time (without prior knowledge of channel conditions or jamming strategies) by using reinforcement learning (RL) to adapt transmit power, modulation, and channel selection. Q-learning is employed for discrete jamming-event states, while Deep Q-Networks (DQN) are employed for continuous states based on received power. Through different reward functions and action sets, the results show that RL can adapt rapidly to spectrum dynamics and sustain high rates as channels and jamming policies change over time.
zh
[AI-4] DiFFPO: Training Diffusion LLM s to Reason Fast and Furious via Reinforcement Learning
【速读】:该论文旨在解决扩散大语言模型(dLLMs)在推理时效率与准确性之间的权衡问题,即如何在保持甚至提升推理能力的同时显著降低计算开销。其解决方案的关键在于提出一种统一的强化学习(RL)框架——DiFFPO,通过两个核心创新实现:一是利用替代策略(surrogate policies)进行离策略强化学习训练,结合重要性采样修正的两阶段似然近似方法,提升了样本效率和任务性能;二是引入联合训练策略,将采样器/控制器与dLLM策略协同优化,使模型能自适应地为每个输入提示动态调整推理阈值,从而在减少函数评估次数(NFEs)的前提下获得更高精度,有效改善了dLLMs推理时计算资源的帕累托前沿(Pareto frontier)。
链接: https://arxiv.org/abs/2510.02212
作者: Hanyang Zhao,Dawen Liang,Wenpin Tang,David Yao,Nathan Kallus
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We propose DiFFPO, Diffusion Fast and Furious Policy Optimization, a unified framework for training masked diffusion large language models (dLLMs) to reason not only better (furious), but also faster via reinforcement learning (RL). We first unify the existing baseline approach such as d1 by proposing to train surrogate policies via off-policy RL, whose likelihood is much more tractable as an approximation to the true dLLM policy. This naturally motivates a more accurate and informative two-stage likelihood approximation combined with importance sampling correction, which leads to generalized RL algorithms with better sample efficiency and superior task performance. Second, we propose a new direction of joint training efficient samplers/controllers of dLLMs policy. Via RL, we incentivize dLLMs’ natural multi-token prediction capabilities by letting the model learn to adaptively allocate an inference threshold for each prompt. By jointly training the sampler, we yield better accuracies with lower number of function evaluations (NFEs) compared to training the model only, obtaining the best performance in improving the Pareto frontier of the inference-time compute of dLLMs. We showcase the effectiveness of our pipeline by training open source large diffusion language models over benchmark math and planning tasks.
zh
[AI-5] Detection of Chagas Disease from the ECG: The George B. Moody PhysioNet Challenge 2025
【速读】:该论文旨在解决慢性恰加斯病(Chagas disease)的早期识别难题,尤其在血清学检测能力有限的地区,利用心电图(ECG)中反映的心肌病变特征来实现患者分诊与优先筛查。其解决方案的关键在于:构建了一个包含弱标签大样本数据集与强标签小样本数据集的多源数据集,通过数据增强提升模型鲁棒性和泛化能力,并引入基于局部血清学检测资源的评估指标,将问题定义为一个临床可操作的分诊任务,从而有效支持高风险人群的精准识别与及时干预。
链接: https://arxiv.org/abs/2510.02202
作者: Matthew A. Reyna(1),Zuzana Koscova(1),Jan Pavlus(1),Soheil Saghafi(1),James Weigle(1),Andoni Elola(1,2),Salman Seyedi(1),Kiersten Campbell(1),Qiao Li(1),Ali Bahrami Rad(1),Antônio H. Ribeiro(3),Antonio Luiz P. Ribeiro(4,5),Reza Sameni(1,6),Gari D. Clifford(1,6) ((1) Department of Biomedical Informatics, Emory University, Atlanta, USA, (2) Department of Electronic Technology, University of the Basque Country UPV/EHU, Spain, (3) Department of Information Technology, Uppsala University, Uppsala, Sweden, (4) Universidade Federal de Minas Gerais, Belo Horizonte, Brazil, (5) Telehealth Center from Hospital das Clinicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil, (6) Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, USA)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures
点击查看摘要
Abstract:Objective: Chagas disease is a parasitic infection that is endemic to South America, Central America, and, more recently, the U.S., primarily transmitted by insects. Chronic Chagas disease can cause cardiovascular diseases and digestive problems. Serological testing capacities for Chagas disease are limited, but Chagas cardiomyopathy often manifests in ECGs, providing an opportunity to prioritize patients for testing and treatment. Approach: The George B. Moody PhysioNet Challenge 2025 invites teams to develop algorithmic approaches for identifying Chagas disease from electrocardiograms (ECGs). Main results: This Challenge provides multiple innovations. First, we leveraged several datasets with labels from patient reports and serological testing, provided a large dataset with weak labels and smaller datasets with strong labels. Second, we augmented the data to support model robustness and generalizability to unseen data sources. Third, we applied an evaluation metric that captured the local serological testing capacity for Chagas disease to frame the machine learning problem as a triage task. Significance: Over 630 participants from 111 teams submitted over 1300 entries during the Challenge, representing diverse approaches from academia and industry worldwide.
zh
[AI-6] UpSafecircC: Upcycling for Controllable Safety in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中面临的安全风险问题,如有害内容生成和越狱攻击(jailbreak attacks),同时克服现有安全技术(如外部防护机制、推理时引导和后训练对齐)在安全性、实用性与可控性之间难以平衡的局限性。其解决方案的关键在于提出一种统一框架 UpSafe∘C,通过安全感知的“升级再利用”(safety-aware upcycling)策略,将模型中关键的安全层重构为稀疏的专家混合模型(Mixture-of-Experts, MoE),其中路由机制作为软防护屏障,选择性激活原始多层感知机(MLP)和新增的安全专家;进一步引入两阶段监督微调(SFT)策略强化安全判别能力并保留通用性能,并设计安全温度(safety temperature)机制实现推理时灵活控制,从而在安全与实用性之间达到帕累托最优。
链接: https://arxiv.org/abs/2510.02194
作者: Yuhao Sun,Zhuoer Xu,Shiwen Cui,Kun Yang,Lingyun Yu,Yongdong Zhang,Hongtao Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques – including external guardrails, inference-time guidance, and post-training alignment – each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe ^\circ C, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe ^\circ C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
zh
[AI-7] EvolveCaptions: Empowering DHH Users Through Real-Time Collaborative Captioning
链接: https://arxiv.org/abs/2510.02181
作者: Liang-Yuan Wu,Dhruv Jain
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
[AI-8] GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning
链接: https://arxiv.org/abs/2510.02180
作者: Silvia Sapora,Devon Hjelm,Alexander Toshev,Omar Attia,Bogdan Mazoure
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-9] Go witheFlow: Real-time Emotion Driven Audio Effects Modulation NEURIPS
【速读】:该论文旨在解决机器在音乐表演中难以模拟人类情感表达的问题,即如何通过人机协作增强实时音乐表演的情感传达能力。其解决方案的关键在于提出witheFlow系统,该系统能够基于从生物信号(biosignals)和音频本身提取的特征,自动调节音频效果,从而实现对音乐表现力的动态调整,且系统设计轻量、可本地运行并开源,便于集成到现有数字音频工作站(Digital Audio Workstation, DAW)环境中。
链接: https://arxiv.org/abs/2510.02171
作者: Edmund Dervakos,Spyridon Kantarelis,Vassilis Lyberatos,Jason Liartis,Giorgos Stamou
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at NeurIPS Creative AI Track 2025: Humanity
点击查看摘要
Abstract:Music performance is a distinctly human activity, intrinsically linked to the performer’s ability to convey, evoke, or express emotion. Machines cannot perform music in the human sense; they can produce, reproduce, execute, or synthesize music, but they lack the capacity for affective or emotional experience. As such, music performance is an ideal candidate through which to explore aspects of collaboration between humans and machines. In this paper, we introduce the witheFlow system, designed to enhance real-time music performance by automatically modulating audio effects based on features extracted from both biosignals and the audio itself. The system, currently in a proof-of-concept phase, is designed to be lightweight, able to run locally on a laptop, and is open-source given the availability of a compatible Digital Audio Workstation and sensors.
zh
[AI-10] SIEVE: Towards Verifiable Certification for Code-datasets
【速读】:该论文旨在解决公共代码数据集(code datasets)缺乏可验证质量保证的问题,当前依赖静态的“数据集卡片”(dataset cards)虽能提供信息,但不具备审计能力且无法提供统计保障,导致难以确证数据集质量;同时,团队各自构建临时的数据清洗管道,造成资源分散和成本上升。其解决方案的关键在于提出SIEVE框架,将每项属性检查转化为“置信卡”(Confidence Cards)——一种机器可读、可验证的证书,并附带任意时间有效的统计边界,从而实现对数据集质量的可验证认证,有望降低质量保障成本并提升代码数据集的信任度。
链接: https://arxiv.org/abs/2510.02166
作者: Fatou Ndiaye Mbodji,El-hacen Diallo,Jordan Samhi,Kui Liu,Jacques Klein,Tegawendé F. Bissyande
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5
点击查看摘要
Abstract:Code agents and empirical software engineering rely on public code datasets, yet these datasets lack verifiable quality guarantees. Static ‘dataset cards’ inform, but they are neither auditable nor do they offer statistical guarantees, making it difficult to attest to dataset quality. Teams build isolated, ad-hoc cleaning pipelines. This fragments effort and raises cost. We present SIEVE, a community-driven framework. It turns per-property checks into Confidence Cards-machine-readable, verifiable certificates with anytime-valid statistical bounds. We outline a research plan to bring SIEVE to maturity, replacing narrative cards with anytime-verifiable certification. This shift is expected to lower quality-assurance costs and increase trust in code-datasets.
zh
[AI-11] Comparing Contrastive and Triplet Loss in Audio-Visual Embedding: Intra-Class Variance and Greediness Analysis
【速读】:该论文旨在解决对比损失(contrastive loss)与三元组损失(triplet loss)在深度度量学习中对表征质量影响机制不清晰的问题,尤其关注其在类内与类间方差保持及优化行为上的差异。解决方案的关键在于通过理论分析与任务一致的实验设计(涵盖合成数据和MNIST、CIFAR-10、CUB-200、CARS196等真实数据集),系统比较两种损失函数在损失衰减速率、活跃比例和梯度范数等优化动态指标上的表现,发现三元组损失能更好地保留类内与类间方差,支持更细粒度的语义区分,并通过较少但更强的更新持续聚焦难样本,从而在分类与检索任务中整体性能更优。
链接: https://arxiv.org/abs/2510.02161
作者: Donghuo Zeng
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 tables, 3 figures
点击查看摘要
Abstract:Contrastive loss and triplet loss are widely used objectives in deep metric learning, yet their effects on representation quality remain insufficiently understood. We present a theoretical and empirical comparison of these losses, focusing on intra- and inter-class variance and optimization behavior (e.g., greedy updates). Through task-specific experiments with consistent settings on synthetic data and real datasets-MNIST, CIFAR-10-it is shown that triplet loss preserves greater variance within and across classes, supporting finer-grained distinctions in the learned representations. In contrast, contrastive loss tends to compact intra-class embeddings, which may obscure subtle semantic differences. To better understand their optimization dynamics, By examining loss-decay rate, active ratio, and gradient norm, we find that contrastive loss drives many small updates early on, while triplet loss produces fewer but stronger updates that sustain learning on hard examples. Finally, across both classification and retrieval tasks on MNIST, CIFAR-10, CUB-200, and CARS196 datasets, our results consistently show that triplet loss yields superior performance, which suggests using triplet loss for detail retention and hard-sample focus, and contrastive loss for smoother, broad-based embedding refinement.
zh
[AI-12] Human-Robo-advisor collaboration in decision-making: Evidence from a multiphase mixed methods experimental study
【速读】:该论文试图解决的问题是:尽管机器人顾问(Robo-advisors, RAs)具有成本低、偏见少等优势,但其实际采纳率仍然有限,且学界对用户如何理解RA角色及其在决策中的作用机制缺乏深入认知。解决方案的关键在于采用多阶段混合方法设计,融合行为实验(N=334)、主题分析与后续定量验证,揭示了RA角色的三种类型和用户类型的四类模式,并构建了一个2×2的接受度前因分类体系,将影响因素区分为个体层面和算法层面的促进因子与抑制因子,从而为设计更具可信度和适应性的RA系统提供实证依据与可操作洞见。
链接: https://arxiv.org/abs/2510.02153
作者: Hasan Mahmuda,Najmul Islam,Satish Krishnan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Robo-advisors (RAs) are cost-effective, bias-resistant alternatives to human financial advisors, yet adoption remains limited. While prior research has examined user interactions with RAs, less is known about how individuals interpret RA roles and integrate their advice into decision-making. To address this gap, this study employs a multiphase mixed methods design integrating a behavioral experiment (N = 334), thematic analysis, and follow-up quantitative testing. Findings suggest that people tend to rely on RAs, with reliance shaped by information about RA performance and the framing of advice as gains or losses. Thematic analysis reveals three RA roles in decision-making and four user types, each reflecting distinct patterns of advice integration. In addition, a 2 x 2 typology categorizes antecedents of acceptance into enablers and inhibitors at both the individual and algorithmic levels. By combining behavioral, interpretive, and confirmatory evidence, this study advances understanding of human-RA collaboration and provides actionable insights for designing more trustworthy and adaptive RA systems.
zh
[AI-13] FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models EMNLP2025
【速读】:该论文旨在解决企业级文档理解模型开发中面临的高质量标注数据稀缺问题,即构建大规模、多样化且标注完善的文档数据集在成本上不可行,受限于隐私保护、法律合规及海量人工标注需求。解决方案的关键在于提出FlexDoc框架,其核心是结合随机Schema(Stochastic Schemas)与参数化采样(Parameterized Sampling),通过概率建模版式布局、视觉结构和内容变异性,实现对多语言半结构化文档的可控生成,从而显著提升数据多样性并降低标注成本——实验表明,使用FlexDoc生成的数据可使关键信息抽取(KIE)任务的F1分数提升最高达11%,同时减少超90%的人工标注工作量。
链接: https://arxiv.org/abs/2510.02133
作者: Karan Dua,Hitesh Laxmichand Patel,Puneet Mittal,Ranjeet Gupta,Amit Agarwal,Praneet Pabolu,Srikant Panda,Hansa Meghwani,Graham Horwood,Fahad Shah
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at EMNLP 2025
点击查看摘要
Abstract:Developing document understanding models at enterprise scale requires large, diverse, and well-annotated datasets spanning a wide range of document types. However, collecting such data is prohibitively expensive due to privacy constraints, legal restrictions, and the sheer volume of manual annotation needed - costs that can scale into millions of dollars. We introduce FlexDoc, a scalable synthetic data generation framework that combines Stochastic Schemas and Parameterized Sampling to produce realistic, multilingual semi-structured documents with rich annotations. By probabilistically modeling layout patterns, visual structure, and content variability, FlexDoc enables the controlled generation of diverse document variants at scale. Experiments on Key Information Extraction (KIE) tasks demonstrate that FlexDoc-generated data improves the absolute F1 Score by up to 11% when used to augment real datasets, while reducing annotation effort by over 90% compared to traditional hard-template methods. The solution is in active deployment, where it has accelerated the development of enterprise-grade document understanding models while significantly reducing data acquisition and annotation costs.
zh
[AI-14] VarCoNet: A variability-aware self-supervised framework for functional connectome extraction from resting-state fMRI
【速读】:该论文旨在解决如何有效利用静息态功能磁共振成像(resting-state fMRI, rs-fMRI)数据中个体间脑功能差异的问题,传统方法常将这种变异视为噪声,而本文提出将其视为有意义的数据进行建模。解决方案的关键在于提出VarCoNet框架,其核心创新包括:1)采用自监督对比学习策略,通过一种基于rs-fMRI信号分段的新型增强方法挖掘个体间功能变异性;2)设计融合1D-CNN与Transformer的编码器结构以提升时间序列建模能力;3)引入鲁棒的贝叶斯超参数优化机制,从而实现从rs-fMRI数据中提取稳定且可解释的功能连接组(functional connectome, FC)嵌入表示,适用于下游任务(如个体指纹识别和自闭症谱系障碍分类),并在多个数据集上展现出优于13种先进深度学习方法的性能、鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2510.02120
作者: Charalampos Lamprou,Aamna Alshehhi,Leontios J. Hadjileontiadis,Mohamed L. Seghier
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: My preview .pdf was not loading. Can you please share with me a compiled .pdf file so I can confirm that the result is correct?
点击查看摘要
Abstract:Accounting for inter-individual variability in brain function is key to precision medicine. Here, by considering functional inter-individual variability as meaningful data rather than noise, we introduce VarCoNet, an enhanced self-supervised framework for robust functional connectome (FC) extraction from resting-state fMRI (rs-fMRI) data. VarCoNet employs self-supervised contrastive learning to exploit inherent functional inter-individual variability, serving as a brain function encoder that generates FC embeddings readily applicable to downstream tasks even in the absence of labeled data. Contrastive learning is facilitated by a novel augmentation strategy based on segmenting rs-fMRI signals. At its core, VarCoNet integrates a 1D-CNN-Transformer encoder for advanced time-series processing, enhanced with a robust Bayesian hyperparameter optimization. Our VarCoNet framework is evaluated on two downstream tasks: (i) subject fingerprinting, using rs-fMRI data from the Human Connectome Project, and (ii) autism spectrum disorder (ASD) classification, using rs-fMRI data from the ABIDE I and ABIDE II datasets. Using different brain parcellations, our extensive testing against state-of-the-art methods, including 13 deep learning methods, demonstrates VarCoNet’s superiority, robustness, interpretability, and generalizability. Overall, VarCoNet provides a versatile and robust framework for FC analysis in rs-fMRI.
zh
[AI-15] Demystifying the Roles of LLM Layers in Retrieval Knowledge and Reasoning ICASSP2025
链接: https://arxiv.org/abs/2510.02091
作者: Xinyuan Song,Keyu Wang,PengXiang Li,Lu Yin,Shiwei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICASSP 2025
[AI-16] KAIROS: Unified Training for Universal Non-Autoregressive Time Series Forecasting
【速读】:该论文旨在解决Web应用中时间序列预测对实时决策支持的高响应速度需求,以及传统自回归(autoregressive)方法因误差累积导致预测质量下降的问题。解决方案的关键在于提出KAIROS框架,这是一种非自回归(non-autoregressive)的时间序列预测方法,能够直接建模分段级别的多峰分布,从而避免误差传播并实现“及时推理”(just-in-time inference),同时在不牺牲精度的前提下显著降低推理成本。
链接: https://arxiv.org/abs/2510.02084
作者: Kuiye Ding,Fanda Fan,Zheya Wang,Hongxiao Li,Yifan Wang,Lei Wang,Chunjie Luo,Jianfeng Zhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In the World Wide Web, reliable time series forecasts provide the forward-looking signals that drive resource planning, cache placement, and anomaly response, enabling platforms to operate efficiently as user behavior and content distributions evolve. Compared with other domains, time series forecasting for Web applications requires much faster responsiveness to support real-time decision making. We present KAIROS, a non-autoregressive time series forecasting framework that directly models segment-level multi-peak distributions. Unlike autoregressive approaches, KAIROS avoids error accumulation and achieves just-in-time inference, while improving over existing non-autoregressive models that collapse to over-smoothed predictions. Trained on the large-scale corpus, KAIROS demonstrates strong zero-shot generalization on six widely used benchmarks, delivering forecasting performance comparable to state-of-the-art foundation models with similar scale, at a fraction of their inference cost. Beyond empirical results, KAIROS highlights the importance of non-autoregressive design as a scalable paradigm for foundation models in time series.
zh
[AI-17] ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection
【速读】:该论文旨在解决当前表格异常检测(Tabular Anomaly Detection, AD)研究中缺乏文本语义信息的问题,即现有基准数据集仅提供原始数据点而未包含特征描述、领域知识等结构化文本元数据,导致模型无法利用专家依赖的上下文信息进行检测。解决方案的关键在于提出 ReTabAD 基准平台,其核心包括:(1) 20 个精心构建的表格数据集,附带丰富的结构化文本元数据,并集成经典、深度学习和大语言模型(Large Language Model, LLM)等前沿检测算法;(2) 一种无需任务特定训练的零样本 LLM 框架,通过引入语义上下文实现上下文感知的异常检测,显著提升检测性能与可解释性,为未来研究提供了强基线与系统性探索基础。
链接: https://arxiv.org/abs/2510.02060
作者: Sanghyu Yoon,Dongmin Kim,Suhee Yoon,Ye Seul Sim,Seungdong Yoa,Hye-Seung Cho,Soonyoung Lee,Hankook Lee,Woohyung Lim
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures
点击查看摘要
Abstract:In tabular anomaly detection (AD), textual semantics often carry critical signals, as the definition of an anomaly is closely tied to domain-specific context. However, existing benchmarks provide only raw data points without semantic context, overlooking rich textual metadata such as feature descriptions and domain knowledge that experts rely on in practice. This limitation restricts research flexibility and prevents models from fully leveraging domain knowledge for detection. ReTabAD addresses this gap by restoring textual semantics to enable context-aware tabular AD research. We provide (1) 20 carefully curated tabular datasets enriched with structured textual metadata, together with implementations of state-of-the-art AD algorithms including classical, deep learning, and LLM-based approaches, and (2) a zero-shot LLM framework that leverages semantic context without task-specific training, establishing a strong baseline for future research. Furthermore, this work provides insights into the role and utility of textual metadata in AD through experiments and analysis. Results show that semantic context improves detection performance and enhances interpretability by supporting domain-aware reasoning. These findings establish ReTabAD as a benchmark for systematic exploration of context-aware AD.
zh
[AI-18] he Current State of AI Bias Bounties: An Overview of Existing Programmes and Research
链接: https://arxiv.org/abs/2510.02036
作者: Sergej Kucenko,Nathaniel Dennler,Fengxiang He
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 6,227 words (18 pages, from abstract to appendix), one figure, one table, and an appendix with an additional table
[AI-19] Zero-shot reasoning for simulating scholarly peer-review
【速读】:该论文旨在解决学术出版生态系统面临的双重危机:投稿量激增导致的管理困境与生成式 AI(Generative AI)监管缺失,从而威胁科学诚信。其解决方案的关键在于提出并验证了一个确定性模拟框架(deterministic simulation framework),该框架首次为评估AI生成的同行评审报告提供了稳定、基于证据的标准。通过分析352份模拟评审报告,研究发现该系统能够稳定地模拟校准后的编辑判断(如“修改”决策占多数且跨学科一致),并维持严格的程序完整性(如证据锚定合规率恒定在29%),从而显著降低生成式AI固有的随机性,实现可预测、规则驱动的评审过程,为科学界提供透明、可审计的治理工具。
链接: https://arxiv.org/abs/2510.02027
作者: Khalid M. Saqr
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
点击查看摘要
Abstract:The scholarly publishing ecosystem faces a dual crisis of unmanageable submission volumes and unregulated AI, creating an urgent need for new governance models to safeguard scientific integrity. The traditional human-only peer review regime lacks a scalable, objective benchmark, making editorial processes opaque and difficult to audit. Here we investigate a deterministic simulation framework that provides the first stable, evidence-based standard for evaluating AI-generated peer review reports. Analyzing 352 peer-review simulation reports, we identify consistent system state indicators that demonstrate its reliability. First, the system is able to simulate calibrated editorial judgment, with ‘Revise’ decisions consistently forming the majority outcome (50%) across all disciplines, while ‘Reject’ rates dynamically adapt to field-specific norms, rising to 45% in Health Sciences. Second, it maintains unwavering procedural integrity, enforcing a stable 29% evidence-anchoring compliance rate that remains invariant across diverse review tasks and scientific domains. These findings demonstrate a system that is predictably rule-bound, mitigating the stochasticity of generative AI. For the scientific community, this provides a transparent tool to ensure fairness; for publishing strategists, it offers a scalable instrument for auditing workflows, managing integrity risks, and implementing evidence-based governance. The framework repositions AI as an essential component of institutional accountability, providing the critical infrastructure to maintain trust in scholarly communication.
zh
[AI-20] Clarifying Semantics of In-Context Examples for Unit Test Generation
链接: https://arxiv.org/abs/2510.01994
作者: Chen Yang,Lin Yang,Ziqi Wang,Dong Wang,Jianyi Zhou,Junjie Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: accepted in the research track of ASE 2025
[AI-21] Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement
【速读】:该论文旨在解决语音增强模型在跨语料库(cross-corpus)场景下泛化性能不足的问题,同时兼顾模型效率与计算复杂度。其解决方案的关键在于提出一种新型高效混合架构RWSA-MambaUNet,该结构在U-Net框架中融合了Mamba和多头注意力机制,并引入分辨率-wise 共享注意力(Resolution-wise Shared Attention, RWSA),即在对应的时间和频率分辨率层间共享注意力权重,从而提升模型对不同语料的适应能力。实验表明,该方法在保持极低参数量和计算开销的前提下,显著优于现有基线模型,在DNS 2020和EARS-WHAM_v2两个跨域测试集上均取得最优性能。
链接: https://arxiv.org/abs/2510.01958
作者: Nikolai Lund Kühne,Jesper Jensen,Jan Østergaard,Zheng-Hua Tan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Submitted to IEEE for possible publication
点击查看摘要
Abstract:Recent advances in speech enhancement have shown that models combining Mamba and attention mechanisms yield superior cross-corpus generalization performance. At the same time, integrating Mamba in a U-Net structure has yielded state-of-the-art enhancement performance, while reducing both model size and computational complexity. Inspired by these insights, we propose RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and multi-head attention in a U-Net structure for improved cross-corpus performance. Resolution-wise shared attention (RWSA) refers to layerwise attention-sharing across corresponding time- and frequency resolutions. Our best-performing RWSA-MambaUNet model achieves state-of-the-art generalization performance on two out-of-domain test sets. Notably, our smallest model surpasses all baselines on the out-of-domain DNS 2020 test set in terms of PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and a fraction of the FLOPs.
zh
[AI-22] o Mask or to Mirror: Human-AI Alignment in Collective Reasoning
链接: https://arxiv.org/abs/2510.01924
作者: Crystal Qian,Aaron Parisi,Clémentine Bouleau,Vivian Tsai,Maël Lebreton,Lucas Dixon
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
[AI-23] Are LLM s Better GNN Helpers? Rethinking Robust Graph Learning under Deficiencies with Iterative Refinement
链接: https://arxiv.org/abs/2510.01910
作者: Zhaoyan Wang,Zheng Gao,Arogya Kharel,In-Young Ko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages
[AI-24] Multimodal Foundation Models for Early Disease Detection
【速读】:该论文旨在解决传统诊断模型在处理多模态医疗数据(如电子健康记录、医学影像、基因组学及可穿戴设备监测数据)时,因孤立分析各模态而无法有效挖掘跨模态关联的问题,从而限制了早期疾病诊断的准确性。其解决方案的关键在于提出一种基于注意力机制的Transformer架构的多模态基础模型,通过专用编码器将不同模态数据映射到共享潜在空间,并利用多头注意力和残差归一化进行融合,实现对多种任务的预训练与轻量微调,从而具备良好的泛化能力与临床适应性。
链接: https://arxiv.org/abs/2510.01899
作者: Md Talha Mohsin,Ismail Abdulrashid
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages
点击查看摘要
Abstract:Healthcare generates diverse streams of data, including electronic health records (EHR), medical imaging, genetics, and ongoing monitoring from wearable devices. Traditional diagnostic models frequently analyze these sources in isolation, which constrains their capacity to identify cross-modal correlations essential for early disease diagnosis. Our research presents a multimodal foundation model that consolidates diverse patient data through an attention-based transformer framework. At first, dedicated encoders put each modality into a shared latent space. Then, they combine them using multi-head attention and residual normalization. The architecture is made for pretraining on many tasks, which makes it easy to adapt to new diseases and datasets with little extra work. We provide an experimental strategy that uses benchmark datasets in oncology, cardiology, and neurology, with the goal of testing early detection tasks. The framework includes data governance and model management tools in addition to technological performance to improve transparency, reliability, and clinical interpretability. The suggested method works toward a single foundation model for precision diagnostics, which could improve the accuracy of predictions and help doctors make decisions.
zh
[AI-25] HRTFformer: A Spatially-Aware Transformer for Personalized HRTF Upsampling in Immersive Audio Rendering
【速读】:该论文旨在解决个性化头相关传输函数(Head-Related Transfer Function, HRTF)在大规模应用中因测量过程复杂而难以实现的问题,特别是通过空间上采样(spatial upsampling)减少所需测量点数量的同时保持高保真度和空间一致性。解决方案的关键在于提出一种基于Transformer的架构,该架构在球谐(spherical harmonic, SH)域中建模HRTF的空间相关性,利用注意力机制捕捉远距离空间依赖关系,并引入邻域差异损失(neighbor dissimilarity loss)以增强幅度平滑性和空间一致性,从而显著提升高倍率上采样下的重建精度与感知真实性。
链接: https://arxiv.org/abs/2510.01891
作者: Xuyi Hu,Jian Li,Shaojie Zhang,Stefan Goetz,Lorenzo Picinali,Ozgur B. Akan,Aidan O. T. Hogg
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 10 pages and 5 figures
点击查看摘要
Abstract:Personalized Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating personalized HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range spatial consistency and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbor dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model surpasses leading methods by a substantial margin in generating realistic, high-fidelity HRTFs.
zh
[AI-26] Small is Sufficient: Reducing the World AI Energy Consumption Through Model Selection
链接: https://arxiv.org/abs/2510.01889
作者: Tiago da Silva Barros,Frédéric Giroire,Ramon Aparicio-Pardo,Joanna Moulierac
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
[AI-27] ACOS: Task Agnostic COordinator of a multi-drone System
【速读】:该论文旨在解决多无人机系统(multi-UAV system)中单飞行员难以高效管理多个无人飞行器(UAV)的问题,尤其是在不同自主级别(从个体直接控制到群体协同再到完全自主蜂群行为)下实现灵活交互的挑战。解决方案的关键在于提出一个统一框架TACOS(Task-Agnostic COordinator of a multi-drone System),其核心是利用大语言模型(Large Language Models, LLMs)构建三个集成能力:一是支持一对一至多对一自然语言接口以实现直观的人机交互;二是智能协调器将用户意图转化为结构化任务计划;三是自主代理执行计划并与现实世界交互。该框架通过LLM调用可执行API库,实现了语义推理与实时多机器人协同之间的桥梁,从而显著降低操作员工作负荷并提升系统灵活性。
链接: https://arxiv.org/abs/2510.01869
作者: Alessandro Nazzari,Roberto Rubinacci,Marco Lovera
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 6 pages, 6 figures, accepted as poster at 2025 IEEE International Symposium on Multi-Robot Multi-Agent Systems
点击查看摘要
Abstract:When a single pilot is responsible for managing a multi-drone system, the task demands varying levels of autonomy, from direct control of individual UAVs, to group-level coordination, to fully autonomous swarm behaviors for accomplishing high-level tasks. Enabling such flexible interaction requires a framework that supports multiple modes of shared autonomy. As language models continue to improve in reasoning and planning, they provide a natural foundation for such systems, reducing pilot workload by enabling high-level task delegation through intuitive, language-based interfaces. In this paper we present TACOS (Task-Agnostic COordinator of a multi-drone System), a unified framework that enables high-level natural language control of multi-UAV systems through Large Language Models (LLMs). TACOS integrates three key capabilities into a single architecture: a one-to-many natural language interface for intuitive user interaction, an intelligent coordinator for translating user intent into structured task plans, and an autonomous agent that executes plans interacting with the real-world. TACOS allows a LLM to interact with a library of executable APIs, bridging semantic reasoning with real-time multi-robot coordination. We demonstrate the system in real-world multi-drone system and conduct an ablation study to assess the contribution of each module.
zh
[AI-28] Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning
链接: https://arxiv.org/abs/2510.01857
作者: Claudio Fanconi,Nicolás Astorga,Mihaela van der Schaar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-29] Pre-Hoc Predictions in AutoML: Leverag ing LLM s to Enhance Model Selection and Benchmarking for Tabular datasets
【速读】:该论文旨在解决AutoML(自动化机器学习)中因依赖耗时的超参数遍历搜索而导致计算开销过大的问题。传统方法通过自动训练和测试多种模型来寻找最优解,但效率低下。其解决方案的关键在于引入预选(pre-hoc)策略,利用数据集描述和统计信息,结合传统模型与大语言模型(Large Language Model, LLM)代理,智能地缩小AutoML库的搜索空间,从而在不牺牲最优模型选择性能的前提下显著降低计算资源消耗。
链接: https://arxiv.org/abs/2510.01842
作者: Yannis Belkhiter,Seshu Tirupathi,Giulio Zizzo,Sachin Sharma,John D. Kelleher
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Oral Presentations ADAPT Annual Scientific Conference 2025
点击查看摘要
Abstract:The field of AutoML has made remarkable progress in post-hoc model selection, with libraries capable of automatically identifying the most performing models for a given dataset. Nevertheless, these methods often rely on exhaustive hyperparameter searches, where methods automatically train and test different types of models on the target dataset. Contrastingly, pre-hoc prediction emerges as a promising alternative, capable of bypassing exhaustive search through intelligent pre-selection of models. Despite its potential, pre-hoc prediction remains under-explored in the literature. This paper explores the intersection of AutoML and pre-hoc model selection by leveraging traditional models and Large Language Model (LLM) agents to reduce the search space of AutoML libraries. By relying on dataset descriptions and statistical information, we reduce the AutoML search space. Our methodology is applied to the AWS AutoGluon portfolio dataset, a state-of-the-art AutoML benchmark containing 175 tabular classification datasets available on OpenML. The proposed approach offers a shift in AutoML workflows, significantly reducing computational overhead, while still selecting the best model for the given dataset.
zh
[AI-30] Human-AI Teaming Co-Learning in Military Operations
链接: https://arxiv.org/abs/2510.01815
作者: Clara Maathuis,Kasper Cools
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to Sensors + Imaging; presented on 18th of September (Artificial Intelligence for Security and Defence Applications III)
[AI-31] SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment ICASSP2026
链接: https://arxiv.org/abs/2510.01812
作者: Yuxun Tang,Lan Liu,Wenhao Feng,Yiwen Zhao,Jionghao Han,Yifeng Yu,Jiatong Shi,Qin Jin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 4 pages, 5 figures; submitted to ICASSP 2026
[AI-32] REBot: From RAG to CatRAG with Semantic Enrichment and Graph Routing
【速读】:该论文旨在解决学术监管咨询系统构建中缺乏领域特定法规资源的问题,从而帮助学生准确理解并遵守高校政策。其解决方案的关键在于提出REBot,一个基于CatRAG(混合检索推理框架)的增强型大语言模型(LLM)咨询聊天机器人,该框架融合了检索增强生成(Retrieval-Augmented Generation, RAG)与基于图的推理能力,通过一个分层且带类别标签的知识图谱(Knowledge Graph, KG)实现语义特征对齐,同时引入轻量级意图分类器将查询路由至相应的检索模块,确保回答在事实准确性与上下文深度上的平衡。
链接: https://arxiv.org/abs/2510.01800
作者: Thanh Ma,Tri-Tam La,Lam-Thu Le Huu,Minh-Nghi Nguyen,Khanh-Van Pham Luu,Huu-Hoa Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Academic regulation advising is essential for helping students interpret and comply with institutional policies, yet building effective systems requires domain specific regulatory resources. To address this challenge, we propose REBot, an LLM enhanced advisory chatbot powered by CatRAG, a hybrid retrieval reasoning framework that integrates retrieval augmented generation with graph based reasoning. CatRAG unifies dense retrieval and graph reasoning, supported by a hierarchical, category labeled knowledge graph enriched with semantic features for domain alignment. A lightweight intent classifier routes queries to the appropriate retrieval modules, ensuring both factual accuracy and contextual depth. We construct a regulation specific dataset and evaluate REBot on classification and question answering tasks, achieving state of the art performance with an F1 score of 98.89%. Finally, we implement a web application that demonstrates the practical value of REBot in real world academic advising scenarios.
zh
[AI-33] Rethinking the shape convention of an MLP
链接: https://arxiv.org/abs/2510.01796
作者: Meng-Hsi Chen,Yu-Ang Lee,Feng-Ting Liao,Da-shan Shiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-34] Nav-EE: Navigation-Guided Early Exiting for Efficient Vision-Language Models in Autonomous Driving
链接: https://arxiv.org/abs/2510.01795
作者: Haibo Hu,Lianming Huang,Xinyu Wang,Yufei Cui,Nan Guan,Chun Jason Xue
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
[AI-35] Secure Multi-Modal Data Fusion in Federated Digital Health Systems via MCP
链接: https://arxiv.org/abs/2510.01780
作者: Aueaphum Aueawatthanaphisut
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 6 pages, 8 figures, 7 equations, 1 algorithm
[AI-36] A cybersecurity AI agent selection and decision support framework
链接: https://arxiv.org/abs/2510.01751
作者: Masike Malatji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 figures, 6 tables, AI agents decision support framework
[AI-37] MetaboT: AI-based agent for natural language-based interaction with metabolomics knowledge graphs
【速读】:该论文旨在解决质谱代谢组学(mass spectrometry metabolomics)数据量庞大、传统查询方式难以高效提取结构化信息的问题。其核心挑战在于用户需具备知识图谱(knowledge graph)的本体(ontology)和SPARQL查询语言语法的专业知识,这构成了访问复杂语义数据的技术壁垒。解决方案的关键在于设计并实现MetaboT系统——一个基于大语言模型(LLM)的多智能体(multi-agent)架构,通过将自然语言问题分解为多个可执行任务,由专门代理(Agent)协同完成从问题理解、实体识别、标准化标识符获取到SPARQL查询生成与执行的全流程自动化。该方法显著提升了查询准确性(达83.67%),相较仅依赖LLM提示工程的基线(仅8.16%准确率),验证了多智能体协作机制在保障领域知识严谨性的同时,有效降低了知识图谱使用门槛,实现了面向科研用户的自然语言交互式数据检索。
链接: https://arxiv.org/abs/2510.01724
作者: Madina Bekbergenova(ICN),Lucas Pradi(ICN),Benjamin Navet(ICN),Emma Tysinger(ICN),Franck Michel(WIMMICS),Matthieu Feraud(ICN),Yousouf Taghzouti(ICN, WIMMICS),Yan Zhou Chen,Olivier Kirchhoffer(UNIGE),Florence Mehl(SIB),Martin Legrand(ICN),Tao Jiang(ICN),Marco Pagni(SIB),Soha Hassoun,Jean-Luc Wolfender(UNIGE),Wout Bittremieux(UA),Fabien Gandon(WIMMICS, Laboratoire I3S - SPARKS),Louis-Félix Nothias(CNRS, UniCA, ICN)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Mass spectrometry metabolomics generates vast amounts of data requiring advanced methods for interpretation. Knowledge graphs address these challenges by structuring mass spectrometry data, metabolite information, and their relationships into a connected network (Gaudry et al. 2024). However, effective use of a knowledge graph demands an in-depth understanding of its ontology and its query language syntax. To overcome this, we designed MetaboT, an AI system utilizing large language models (LLMs) to translate user questions into SPARQL semantic query language for operating on knowledge graphs (Steve Harris 2013). We demonstrate its effectiveness using the Experimental Natural Products Knowledge Graph (ENPKG), a large-scale public knowledge graph for plant natural products (Gaudry et al. 2024).MetaboT employs specialized AI agents for handling user queries and interacting with the knowledge graph by breaking down complex tasks into discrete components, each managed by a specialised agent (Fig. 1a). The multi-agent system is constructed using the LangChain and LangGraph libraries, which facilitate the integration of LLMs with external tools and information sources (LangChain, n.d.). The query generation process follows a structured workflow. First, the Entry Agent determines if the question is new or a follow-up to previous interactions. New questions are forwarded to the Validator Agent, which verifies if the question is related to the knowledge graph. Then, the valid question is sent to the Supervisor Agent, which identifies if the question requires chemical conversions or standardized identifiers. In this case it delegates the question to the Knowledge Graph Agent, which can use tools to extract necessary details, such as URIs or taxonomies of chemical names, from the user query. Finally, an agent responsible for crafting the SPARQL queries equipped with the ontology of the knowledge graph uses the provided identifiers to generate the query. Then, the system executes the generated query against the metabolomics knowledge graph and returns structured results to the user (Fig. 1b). To assess the performance of MetaboT we have curated 50 metabolomics-related questions and their expected answers. In addition to submitting these questions to MetaboT, we evaluated a baseline by submitting them to a standard LLM (GPT-4o) with a prompt that incorporated the knowledge graph ontology but did not provide specific entity IDs. This baseline achieved only 8.16% accuracy, compared to MetaboT’s 83.67%, underscoring the necessity of our multi-agent system for accurately retrieving entities and generating correct SPARQL queries. MetaboT demonstrates promising performance as a conversational question-answering assistant, enabling researchers to retrieve structured metabolomics data through natural language queries. By automating the generation and execution of SPARQL queries, it removes technical barriers that have traditionally hindered access to knowledge graphs. Importantly, MetaboT leverages the capabilities of LLMs while maintaining experimentally grounded query generation, ensuring that outputs remain aligned with domain-specific standards and data structures. This approach facilitates data-driven discoveries by bridging the gap between complex semantic technologies and user-friendly interaction. MetaboT is accessible at [this https URL], and its source code is available at [this https URL].
zh
[AI-38] Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
【速读】:该论文旨在解决当前情感文本到语音(Emotional Text-To-Speech, Emotional TTS)和风格迁移方法依赖参考编码器控制全局风格或情感向量,但无法捕捉参考语音中细微声学细节的问题。解决方案的关键在于提出一种新颖的情感TTS方法,通过引入细粒度的音素级情感嵌入预测机制,并结合风格解耦技术,有效分离参考语音中的固有属性(如音色与情感)。具体而言,该方法利用风格解耦策略指导两个特征提取器,降低音色与情感特征之间的互信息,从而从参考语音中精确分离出独立的风格成分,实现更自然且情感丰富的语音合成。
链接: https://arxiv.org/abs/2510.01722
作者: Jianing Yang,Sheng Li,Takahiro Shinozaki,Yuki Saito,Hiroshi Saruwatari
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: In Proceedings of the 17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC 2025)
点击查看摘要
Abstract:Current emotional Text-To-Speech (TTS) and style transfer methods rely on reference encoders to control global style or emotion vectors, but do not capture nuanced acoustic details of the reference speech. To this end, we propose a novel emotional TTS method that enables fine-grained phoneme-level emotion embedding prediction while disentangling intrinsic attributes of the reference speech. The proposed method employs a style disentanglement method to guide two feature extractors, reducing mutual information between timbre and emotion features, and effectively separating distinct style components from the reference speech. Experimental results demonstrate that our method outperforms baseline TTS systems in generating natural and emotionally rich speech. This work highlights the potential of disentangled and fine-grained representations in advancing the quality and flexibility of emotional TTS systems.
zh
[AI-39] Latency-aware Multimodal Federated Learning over UAV Networks
链接: https://arxiv.org/abs/2510.01717
作者: Shaba Shaon,Dinh C. Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE Transactions on Network Science and Engineering
[AI-40] PolySim: Bridging the Sim-to-Real Gap for Humanoid Control via Multi-Simulator Dynamics Randomization
【速读】:该论文旨在解决人形机器人全身控制(Whole-Body Control, WBC)策略在仿真训练中面临的“仿真到现实差距”(sim-to-real gap)问题,其根本原因在于单个仿真器存在的诱导偏差(simulator inductive bias),即仿真器固有的假设和局限性会导致不同仿真环境之间以及仿真与真实世界之间出现显著差异。解决方案的关键在于通过多仿真器联合训练,使学习到的控制器能够捕捉超越单一仿真器假设的通用动力学特性;为此,作者提出了PolySim平台,该平台集成多个异构仿真引擎,在同一训练过程中并行运行来自不同引擎的环境,从而实现动力学层面的域随机化(domain randomization)。理论分析表明,PolySim相较于单仿真器训练能获得更紧的诱导偏差上界,实验证明其显著降低仿真内运动跟踪误差,并成功实现零样本部署到真实Unitree G1机器人上,无需额外微调。
链接: https://arxiv.org/abs/2510.01708
作者: Zixing Lei,Zibo Zhou,Sheng Yin,Yueru Chen,Qingyao Xu,Weixin Li,Yunhong Wang,Bowei Tang,Wei Jing,Siheng Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures
点击查看摘要
Abstract:Humanoid whole-body control (WBC) policies trained in simulation often suffer from the sim-to-real gap, which fundamentally arises from simulator inductive bias, the inherent assumptions and limitations of any single simulator. These biases lead to nontrivial discrepancies both across simulators and between simulation and the real world. To mitigate the effect of simulator inductive bias, the key idea is to train policies jointly across multiple simulators, encouraging the learned controller to capture dynamics that generalize beyond any single simulator’s assumptions. We thus introduce PolySim, a WBC training platform that integrates multiple heterogeneous simulators. PolySim can launch parallel environments from different engines simultaneously within a single training run, thereby realizing dynamics-level domain randomization. Theoretically, we show that PolySim yields a tighter upper bound on simulator inductive bias than single-simulator training. In experiments, PolySim substantially reduces motion-tracking error in sim-to-sim evaluations; for example, on MuJoCo, it improves execution success by 52.8 over an IsaacSim baseline. PolySim further enables zero-shot deployment on a real Unitree G1 without additional fine-tuning, showing effective transfer from simulation to the real world. We will release the PolySim code upon acceptance of this work.
zh
[AI-41] Representational Alignment Across Model Layers and Brain Regions with Hierarchical Optimal Transport
【速读】:该论文旨在解决传统表示相似性分析方法在比较神经网络表示时存在的三个核心问题:(1)各层独立匹配导致结果不对称;(2)缺乏全局一致的对齐评分;(3)难以处理深度不同的网络。传统方法通常采用逐层一对一映射,忽略了整体激活结构,从而限制了跨网络比较的准确性与可解释性。解决方案的关键在于提出一种统一的分层最优传输(Hierarchical Optimal Transport, HOT)框架,该框架通过联合优化软性的、全局一致的层间耦合关系和神经元级别的传输计划,使源网络中的神经元可以将激活质量分布到多个目标层,同时在边缘约束下最小化总传输成本。这一机制不仅生成单一的全局对齐分数,还能自然地通过质量分配处理深度不匹配问题,并揭示出平滑且细粒度的层级对应关系,这些结构模式由全局优化自发产生,而非人为设定,显著优于贪婪式逐层匹配方法。
链接: https://arxiv.org/abs/2510.01706
作者: Shaan Shah,Meenakshi Khosla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Standard representational similarity methods align each layer of a network to its best match in another independently, producing asymmetric results, lacking a global alignment score, and struggling with networks of different depths. These limitations arise from ignoring global activation structure and restricting mappings to rigid one-to-one layer correspondences. We propose Hierarchical Optimal Transport (HOT), a unified framework that jointly infers soft, globally consistent layer-to-layer couplings and neuron-level transport plans. HOT allows source neurons to distribute mass across multiple target layers while minimizing total transport cost under marginal constraints. This yields both a single alignment score for the entire network comparison and a soft transport plan that naturally handles depth mismatches through mass distribution. We evaluate HOT on vision models, large language models, and human visual cortex recordings. Across all domains, HOT matches or surpasses standard pairwise matching in alignment quality. Moreover, it reveals smooth, fine-grained hierarchical correspondences: early layers map to early layers, deeper layers maintain relative positions, and depth mismatches are resolved by distributing representations across multiple layers. These structured patterns emerge naturally from global optimization without being imposed, yet are absent in greedy layer-wise methods. HOT thus enables richer, more interpretable comparisons between representations, particularly when networks differ in architecture or depth.
zh
[AI-42] A Locally Executable AI System for Improving Preoperative Patient Communication: A Multi-Domain Clinical Evaluation ALT
链接: https://arxiv.org/abs/2510.01671
作者: Motoki Sato(Nagasaki University, Japan),Yuki Matsushita(Nagasaki University, Japan),Hidekazu Takahashi(Boston Medical Sciences, Tokyo, Japan),Tomoaki Kakazu(Showa Medical University Koto Toyosu Hospital, Japan),Sou Nagata(Nagasaki University, Japan),Mizuho Ohnuma(Nagasaki University, Japan),Atsushi Yoshikawa(Kanto Gakuin University, Japan),Masayuki Yamamura(Institute of Science Tokyo, Japan)
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 32 pages, 4 figures, 10 tables 32 pages, 4 figures, 10 tables. This paper is currently under review at ACM Transactions on Computing for Healthcare. Reproducibility resources: this http URL
[AI-43] GuruAgents : Emulating Wise Investors with Prompt-Guided LLM Agents
【速读】:该论文旨在解决如何将传奇投资大师的定性投资哲学系统化、可操作地转化为定量交易策略的问题。其解决方案的关键在于设计并实现五种名为GuruAgents的提示引导型AI代理(prompt-guided AI agents),通过将每位投资大师的独特理念编码进大语言模型(LLM)的提示中,并融合金融工具与确定性推理流程,从而生成可复现的量化投资行为。实证结果显示,该方法能有效将抽象的投资哲学转化为具体、稳定的策略表现,其中巴菲特风格的GuruAgent在纳斯达克100成分股2023年第四季度至2025年第二季度的回测中实现了42.2%的年化收益率(CAGR),显著优于基准,验证了提示工程在自动化系统化投资中的可行性与有效性。
链接: https://arxiv.org/abs/2510.01664
作者: Yejin Kim,Youngbin Lee,Juhyeong Kim,Yongjae Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 Pages, 2 figures
点击查看摘要
Abstract:This study demonstrates that GuruAgents, prompt-guided AI agents, can systematically operationalize the strategies of legendary investment gurus. We develop five distinct GuruAgents, each designed to emulate an iconic investor, by encoding their distinct philosophies into LLM prompts that integrate financial tools and a deterministic reasoning pipeline. In a backtest on NASDAQ-100 constituents from Q4 2023 to Q2 2025, the GuruAgents exhibit unique behaviors driven by their prompted personas. The Buffett GuruAgent achieves the highest performance, delivering a 42.2% CAGR that significantly outperforms benchmarks, while other agents show varied results. These findings confirm that prompt engineering can successfully translate the qualitative philosophies of investment gurus into reproducible, quantitative strategies, highlighting a novel direction for automated systematic investing. The source code and data are available at this https URL.
zh
[AI-44] Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value
链接: https://arxiv.org/abs/2510.01663
作者: Wangxuan Fan,Ching Wang,Siqi Li,Nan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, 9 tables
[AI-45] Learning Time-Series Representations by Hierarchical Uniformity-Tolerance Latent Balancing
链接: https://arxiv.org/abs/2510.01658
作者: Amin Jalali,Milad Soltany,Michael Greenspan,Ali Etemad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in Transactions on Machine Learning Research
[AI-46] Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning
链接: https://arxiv.org/abs/2510.01656
作者: Jiashun Liu,Johan Obando-Ceron,Han Lu,Yancheng He,Weixun Wang,Wenbo Su,Bo Zheng,Pablo Samuel Castro,Aaron Courville,Ling Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-47] he Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM
链接: https://arxiv.org/abs/2510.01650
作者: Kwanhee Lee,Hyeondo Jang,Dongyeop Lee,Dan Alistarh,Namhoon Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
[AI-48] Source-Free Cross-Domain Continual Learning
【速读】:该论文旨在解决源域样本不可用条件下的跨域持续学习(cross-domain continual learning)问题,即在隐私受限环境中无法获取标注源域数据的情况下,如何有效应对领域偏移(domain shift)和灾难性遗忘(catastrophic forgetting, CF)等挑战。其解决方案的关键在于提出一种无需回放(rehearsal-free)的频率感知动态提示协同机制(REFEREE),该方法通过融合预训练源模型与大规模视觉-语言模型(vision-language model),利用频率感知提示技术增强低频成分、抑制高频噪声,从而生成对伪标签噪声鲁棒的增强样本;同时引入不确定性加权策略,基于预测均值与协方差矩阵的不确定性进行自适应加权,缓解伪标签噪声的影响;此外,采用核线性判别分析(kernel linear discriminant analysis, KLDA)在冻结主干网络的前提下实现分类器的增量更新,有效缓解灾难性遗忘问题。
链接: https://arxiv.org/abs/2510.01649
作者: Muhammad Tanzil Furqon,Mahardhika Pratama,Igor Škrjanc,Lin Liu,Habibullah Habibullah,Kutluyil Dogancay
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Although existing cross-domain continual learning approaches successfully address many streaming tasks having domain shifts, they call for a fully labeled source domain hindering their feasibility in the privacy constrained environments. This paper goes one step ahead with the problem of source-free cross-domain continual learning where the use of source-domain samples are completely prohibited. We propose the idea of rehearsal-free frequency-aware dynamic prompt collaborations (REFEREE) to cope with the absence of labeled source-domain samples in realm of cross-domain continual learning. REFEREE is built upon a synergy between a source-pre-trained model and a large-scale vision-language model, thus overcoming the problem of sub-optimal generalizations when relying only on a source pre-trained model. The domain shift problem between the source domain and the target domain is handled by a frequency-aware prompting technique encouraging low-frequency components while suppressing high-frequency components. This strategy generates frequency-aware augmented samples, robust against noisy pseudo labels. The noisy pseudo-label problem is further addressed with the uncertainty-aware weighting strategy where the mean and covariance matrix are weighted by prediction uncertainties, thus mitigating the adverse effects of the noisy pseudo label. Besides, the issue of catastrophic forgetting (CF) is overcome by kernel linear discriminant analysis (KLDA) where the backbone network is frozen while the classification is performed using the linear discriminant analysis approach guided by the random kernel method. Our rigorous numerical studies confirm the advantage of our approach where it beats prior arts having access to source domain samples with significant margins.
zh
[AI-49] Understanding the Geospatial Reasoning Capabilities of LLM s: A Trajectory Recovery Perspective
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在地理空间推理能力上的局限性,特别是其是否具备读取道路网络地图并执行导航任务的能力。为验证这一问题,作者将轨迹恢复(trajectory recovery)作为代理任务,要求模型基于被掩码的GPS轨迹重建完整路径。解决方案的关键在于构建了GLOBALTRACE数据集(包含4000+条真实世界轨迹,覆盖多种区域和交通方式),并设计了一种以道路网络为上下文的提示框架(prompting framework),使LLMs能够在不依赖外部导航工具的情况下生成合法路径。实验表明,LLMs在零样本泛化能力上优于现有基线模型和专用轨迹恢复模型,且能通过灵活的地图推理整合用户偏好,从而提升导航体验。
链接: https://arxiv.org/abs/2510.01639
作者: Thinh Hung Truong,Jey Han Lau,Jianzhong Qi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We explore the geospatial reasoning capabilities of Large Language Models (LLMs), specifically, whether LLMs can read road network maps and perform navigation. We frame trajectory recovery as a proxy task, which requires models to reconstruct masked GPS traces, and introduce GLOBALTRACE, a dataset with over 4,000 real-world trajectories across diverse regions and transportation modes. Using road network as context, our prompting framework enables LLMs to generate valid paths without accessing any external navigation tools. Experiments show that LLMs outperform off-the-shelf baselines and specialized trajectory recovery models, with strong zero-shot generalization. Fine-grained analysis shows that LLMs have strong comprehension of the road network and coordinate systems, but also pose systematic biases with respect to regions and transportation modes. Finally, we demonstrate how LLMs can enhance navigation experiences by reasoning over maps in flexible ways to incorporate user preferences.
zh
[AI-50] owards Human-Centered RegTech: Unpacking Professionals Strategies and Needs for Using LLM s Safely EMNLP2025
链接: https://arxiv.org/abs/2510.01638
作者: Siying Hu,Yaxing Yao,Zhicong Lu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to the 4th HCI+NLP@EMNLP 2025 Workshop. (Non-archival)
[AI-51] Learning to Decide with Just Enough: Information-Theoretic Context Summarization for CDMPs
链接: https://arxiv.org/abs/2510.01620
作者: Peidong Liu,Junjiang Lin,Shaowen Wang,Yao Xu,Haiqing Li,Xuhao Xie,Siyi Wu,Hao Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-52] Agent Rec: Next-Generation LLM -Powered Multi-Agent Collaborative Recommendation with Adaptive Intelligence
链接: https://arxiv.org/abs/2510.01609
作者: Bo Ma,Hang Li,ZeHua Hu,XiaoFan Gui,LuYao Liu,Simon Lau
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-53] Enhancing Noise Robustness of Parkinsons Disease Telemonitoring via Contrastive Feature Augmentation
链接: https://arxiv.org/abs/2510.01588
作者: Ziming Tang,Chengbin Hou,Tianyu Zhang,Bangxu Tian,Jinbao Wang,Hairong Lv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-54] AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2510.01586
作者: Zhenyu Pan,Yiting Zhang,Zhuo Liu,Yolo Yunlong Tang,Zeliang Zhang,Haozheng Luo,Yuwei Han,Jianshu Zhang,Dennis Wu,Hong-Yu Chen,Haoran Lu,Haoyang Fang,Manling Li,Chenliang Xu,Philip S. Yu,Han Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-55] From Supervision to Exploration: What Does Protein Language Model Learn During Reinforcement Learning?
链接: https://arxiv.org/abs/2510.01571
作者: Hanqun Cao,Hongrui Zhang,Junde Xu,Zhou Zhang,Lingdong Shen,Minghao Sun,Ge Liu,Jinbo Xu,Wu-Jun Li,Jinren Ni,Cesar de la Fuente-Nunez,Tianfan Fu,Yejin Choi,Pheng-Ann Heng,Fang Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: 24 pages, 7 figures, 4 tables
[AI-56] Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization
链接: https://arxiv.org/abs/2510.01555
作者: Kezhao Liu,Jason Klein Liu,Mingtao Chen,Yiming Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-57] POLAR: Automating Cyber Threat Prioritization through LLM -Powered Assessment
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在网络安全威胁情报(Cyber Threat Intelligence, CTI)应用中存在显著性能差距的问题,尤其是由威胁环境本身特性引发的内在脆弱性,而非模型架构缺陷。解决方案的关键在于提出一种新颖的分类方法论,融合分层 stratification、自回归精炼(autoregressive refinement)与人机协同监督(human-in-the-loop supervision),以可靠地分析LLMs在CTI任务中的失败实例,并揭示三大根本性漏洞:虚假相关性(spurious correlations)、矛盾知识(contradictory knowledge)和受限泛化能力(constrained generalization)。基于实证实验与人工验证,研究为设计更鲁棒的LLM驱动型CTI系统提供了可操作的洞见。
链接: https://arxiv.org/abs/2510.01552
作者: Luoxi Tang,Yuqiao Meng,Ankita Patra,Weicheng Ma,Muchao Ye,Zhaohan Xi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 25 pages
点击查看摘要
Abstract:Large Language Models (LLMs) are intensively used to assist security analysts in counteracting the rapid exploitation of cyber threats, wherein LLMs offer cyber threat intelligence (CTI) to support vulnerability assessment and incident response. While recent work has shown that LLMs can support a wide range of CTI tasks such as threat analysis, vulnerability detection, and intrusion defense, significant performance gaps persist in practical deployments. In this paper, we investigate the intrinsic vulnerabilities of LLMs in CTI, focusing on challenges that arise from the nature of the threat landscape itself rather than the model architecture. Using large-scale evaluations across multiple CTI benchmarks and real-world threat reports, we introduce a novel categorization methodology that integrates stratification, autoregressive refinement, and human-in-the-loop supervision to reliably analyze failure instances. Through extensive experiments and human inspections, we reveal three fundamental vulnerabilities: spurious correlations, contradictory knowledge, and constrained generalization, that limit LLMs in effectively supporting CTI. Subsequently, we provide actionable insights for designing more robust LLM-powered CTI systems to facilitate future research.
zh
[AI-58] Predictive Preference Learning from Human Interventions NEURIPS2025
【速读】:该论文旨在解决交互式模仿学习(Interactive Imitation Learning)中一个关键局限性:现有方法仅在当前状态纠正智能体动作,而未考虑未来状态的行为调整,可能导致潜在的安全风险。为应对这一问题,作者提出预测偏好学习(Predictive Preference Learning, PPL),其核心在于利用人类干预所蕴含的隐式偏好信号来预测未来轨迹表现。PPL的关键创新是将每次人类干预“Bootstrap”扩展至未来L个时间步(称为偏好窗口),假设在此期间智能体行为不变且人类重复相同干预;通过在这些未来状态上执行偏好优化,专家修正信息被传播至预期探索的安全敏感区域,从而显著提升学习效率并减少对人工演示的依赖。理论分析进一步表明,选择合适的偏好窗口长度L可在覆盖高风险状态与标签正确性之间取得平衡,进而限制算法最优性差距。
链接: https://arxiv.org/abs/2510.01545
作者: Haoyuan Cai,Zhenghao Peng,Bolei Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: NeurIPS 2025 Spotlight. Project page: this https URL
点击查看摘要
Abstract:Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent’s action at the current state, they do not adjust its actions in future states, which may be potentially more hazardous. To address this, we introduce Predictive Preference Learning from Human Interventions (PPL), which leverages the implicit preference signals contained in human interventions to inform predictions of future rollouts. The key idea of PPL is to bootstrap each human intervention into L future time steps, called the preference horizon, with the assumption that the agent follows the same action and the human makes the same intervention in the preference horizon. By applying preference optimization on these future states, expert corrections are propagated into the safety-critical regions where the agent is expected to explore, significantly improving learning efficiency and reducing human demonstrations needed. We evaluate our approach with experiments on both autonomous driving and robotic manipulation benchmarks and demonstrate its efficiency and generality. Our theoretical analysis further shows that selecting an appropriate preference horizon L balances coverage of risky states with label correctness, thereby bounding the algorithmic optimality gap. Demo and code are available at: this https URL
zh
[AI-59] Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models
【速读】:该论文旨在解决扩散语言模型(Diffusion Language Models, dLLMs)在复杂推理任务中训练困难的问题,特别是现有强化学习方法因依赖稀疏的结果奖励而导致模型可能学习到错误的推理路径但偶然得出正确答案的缺陷。其解决方案的关键在于提出一种理论框架,将复杂问题求解建模为分层选择过程,从而将全局约束分解为一系列局部逻辑步骤,并据此设计出Step-Aware Policy Optimization (SAPO)算法——该算法通过基于过程的奖励函数引导模型在去噪过程中对齐潜在的推理层次结构,鼓励逐步改进,从而学习结构化、连贯的推理路径,显著提升推理性能与生成过程的可解释性。
链接: https://arxiv.org/abs/2510.01544
作者: Shaoan Xie,Lingjing Kong,Xiangchen Song,Xinshuai Dong,Guangyi Chen,Eric P.Xing,Kun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation, yet training them for complex reasoning remains a key challenge. Current reinforcement learning approaches often rely on sparse, outcome-based rewards, which can reinforce flawed reasoning paths that lead to coincidentally correct answers. We argue that this stems from a fundamental mismatch with the natural structure of reasoning. We first propose a theoretical framework that formalizes complex problem solving as a hierarchical selection process, where an intractable global constraint is decomposed into a series of simpler, localized logical steps. This framework provides a principled foundation for algorithm design, including theoretical insights into the identifiability of this latent reasoning structure. Motivated by this theory, we identify unstructured refinement – a failure mode where a model’s iterative steps do not contribute meaningfully to the solution – as a core deficiency in existing methods. We then introduce Step-Aware Policy Optimization (SAPO), a novel RL algorithm that aligns the dLLM’s denoising process with the latent reasoning hierarchy. By using a process-based reward function that encourages incremental progress, SAPO guides the model to learn structured, coherent reasoning paths. Our empirical results show that this principled approach significantly improves performance on challenging reasoning benchmarks and enhances the interpretability of the generation process.
zh
[AI-60] LOGicalThought: Logic-Based Ontological Grounding of LLM s for High-Assurance Reasoning
链接: https://arxiv.org/abs/2510.01530
作者: Navapat Nananukul,Yue Zhang,Ryan Lee,Eric Boxer,Jonathan May,Vibhav Giridhar Gogate,Jay Pujara,Mayank Kejriwal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-61] owards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation
链接: https://arxiv.org/abs/2510.01528
作者: Daniel Zhao,Abhilash Shankarampeta,Lanxiang Hu,Tajana Rosing,Hao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-62] Predictive Modeling and Explainable AI for Veterinary Safety Profiles Residue Assessment and Health Outcomes Using Real-World Data and Physicochemical Properties
【速读】:该论文旨在解决兽用药品在食品动物中使用时的安全性评估问题,特别是通过预测不良事件(Adverse Events, AEs)是否导致死亡或恢复来识别高风险药物-事件组合,从而降低食品链中违规残留的风险。其解决方案的关键在于构建一个融合严谨数据工程、先进机器学习模型与可解释人工智能(Explainable AI)的预测框架:首先利用VeDDRA本体标准化AEs并整合药物理化性质以增强特征表达;其次采用欠采样/过采样策略缓解类别不平衡问题,并通过集成方法(如Voting和Stacking)及基于平均不确定性边界(AUM)的伪标签技术提升对少数类(死亡结果)的召回率;最终借助SHAP值实现模型可解释性,识别出肺部、心脏和支气管疾病、动物种属与年龄等生物医学因素以及药物理化特性为致命结局的关键驱动因子,显著提升了预测准确性与临床可用性。
链接: https://arxiv.org/abs/2510.01520
作者: Hossein Sholehrasa,Xuan Xu,Doina Caragea,Jim E. Riviere,Majid Jaberi-Douraki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The safe use of pharmaceuticals in food-producing animals is vital to protect animal welfare and human food safety. Adverse events (AEs) may signal unexpected pharmacokinetic or toxicokinetic effects, increasing the risk of violative residues in the food chain. This study introduces a predictive framework for classifying outcomes (Death vs. Recovery) using ~1.28 million reports (1987-2025 Q1) from the U.S. FDA’s OpenFDA Center for Veterinary Medicine. A preprocessing pipeline merged relational tables and standardized AEs through VeDDRA ontologies. Data were normalized, missing values imputed, and high-cardinality features reduced; physicochemical drug properties were integrated to capture chemical-residue links. We evaluated supervised models, including Random Forest, CatBoost, XGBoost, ExcelFormer, and large language models (Gemma 3-27B, Phi 3-12B). Class imbalance was addressed, such as undersampling and oversampling, with a focus on prioritizing recall for fatal outcomes. Ensemble methods(Voting, Stacking) and CatBoost performed best, achieving precision, recall, and F1-scores of 0.95. Incorporating Average Uncertainty Margin (AUM)-based pseudo-labeling of uncertain cases improved minority-class detection, particularly in ExcelFormer and XGBoost. Interpretability via SHAP identified biologically plausible predictors, including lung, heart, and bronchial disorders, animal demographics, and drug physicochemical properties. These features were strongly linked to fatal outcomes. Overall, the framework shows that combining rigorous data engineering, advanced machine learning, and explainable AI enables accurate, interpretable predictions of veterinary safety outcomes. The approach supports FARAD’s mission by enabling early detection of high-risk drug-event profiles, strengthening residue risk assessment, and informing regulatory and clinical decision-making.
zh
[AI-63] Lateral Tree-of-Thoughts Surpasses ToT by Incorporating Logically-Consistent Low-Utility Candidates
链接: https://arxiv.org/abs/2510.01500
作者: Abhinav Madahar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-64] Beyond Majority Voting: LLM Aggregation by Leverag ing Higher-Order Information
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent Large Language Model, Multi-agent LLM)推理中如何有效聚合多个模型答案的核心挑战。传统多数投票方法对所有答案一视同仁,忽略了模型间的潜在异质性和相关性,导致决策可靠性受限。论文提出两种新型聚合算法——最优权重(Optimal Weight, OW)和逆意外流行度(Inverse Surprising Popularity, ISP),其关键在于利用一阶信息(如单个模型的准确率)和二阶信息(如模型间的一致性与相关性),从而在温和假设下理论上证明可缓解多数投票的固有缺陷,提升集体决策的可靠性。实证结果表明,这两种方法在合成数据、UltraFeedback 和 MMLU 基准以及真实医疗场景 ARMMAN 中均显著优于多数投票,兼具性能提升与机制设计启示。
链接: https://arxiv.org/abs/2510.01499
作者: Rui Ai,Yuqi Pan,David Simchi-Levi,Milind Tambe,Haifeng Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
点击查看摘要
Abstract:With the rapid progress of multi-agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Across all cases, our methods consistently outperform majority voting, offering both practical performance gains and conceptual insights for the design of robust multi-agent LLM pipelines.
zh
[AI-65] Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed
链接: https://arxiv.org/abs/2510.01494
作者: Isha Gupta,Rylan Schaeffer,Joshua Kazdan,Ken Liu,Sanmi Koyejo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-66] VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs
链接: https://arxiv.org/abs/2510.01483
作者: Mohamad Al Mdfaa,Svetlana Lukina,Timur Akhtyamov,Arthur Nigmatzyanov,Dmitrii Nalberskii,Sergey Zagoruyko,Gonzalo Ferrer
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication
[AI-67] AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance
链接: https://arxiv.org/abs/2510.01474
作者: Bill Marino,Rosco Hunter,Zubair Jamali,Marinos Emmanouil Kalpakos,Mudra Kashyap,Isaiah Hinton,Alexa Hanson,Maahum Nazir,Christoph Schnabl,Felix Steffek,Hongkai Wen,Nicholas D. Lane
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-68] From keywords to semantics: Perceptions of large language models in data discovery
链接: https://arxiv.org/abs/2510.01473
作者: Maura E Halstead,Mark A. Green,Caroline Jay,Richard Kingston,David Topping,Alexander Singleton
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
[AI-69] RealClass: A Framework for Classroom Speech Simulation with Public Datasets and Game Engines
链接: https://arxiv.org/abs/2510.01462
作者: Ahmed Adel Attia,Jing Liu,Carol Espy Wilson
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: arXiv admin note: substantial text overlap with arXiv:2506.09206
[AI-70] he Three Regimes of Offline-to-Online Reinforcement Learning
链接: https://arxiv.org/abs/2510.01460
作者: Lu Li,Tianwei Ni,Yihao Sun,Pierre-Luc Bacon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-71] he Command Line GUIde: Graphical Interfaces from Man Pages via AI
链接: https://arxiv.org/abs/2510.01453
作者: Saketh Ram Kasibatla,Kiran Medleri Hiremath,Raven Rothkopf,Sorin Lerner,Haijun Xia,Brian Hempel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures, In Proceedings of the IEEE Symposium on Visual Languages and Human Centric Computing (VL/HCC), October 2025
[AI-72] Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression
【速读】:该论文旨在解决当前Transformer架构中注意力机制表达能力不足的问题,尤其是在面对非平稳数据时缺乏理论支撑的高效建模能力。现有研究多聚焦于降低Softmax Attention的计算复杂度,而对更具表达力且基于理论洞察的替代机制探索较少。其解决方案的关键在于提出了一种名为局部线性注意力(Local Linear Attention, LLA)的新机制,该机制源自非参数统计学视角下的测试时回归分析,通过偏差-方差权衡理论证明其在关联记忆任务中优于线性注意力和Softmax注意力。为应对LLA固有的高计算复杂度(Θ(n2d) 和 Θ(nd2)),作者进一步设计了两种内存高效的原语,并引入FlashLLA算法实现硬件友好的分块并行计算,同时开发定制化推理内核以显著降低内存开销,从而在保证理论优势的同时具备实际部署可行性。
链接: https://arxiv.org/abs/2510.01450
作者: Yifei Zuo,Yutong Yin,Zhichen Zeng,Ang Li,Banghua Zhu,Zhaoran Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at greater computational cost-has been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the \Theta(n^2 d) and \Theta(n d^2) complexity. We then introduce FlashLLA, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models. Code is available at this https URL.
zh
[AI-73] AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation
链接: https://arxiv.org/abs/2510.01433
作者: Anukriti Singh,Kasra Torshizi,Khuzema Habib,Kelin Yu,Ruohan Gao,Pratap Tokekar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
[AI-74] A Tale of LLM s and Induced Small Proxies: Scalable Agents for Knowledge Mining
链接: https://arxiv.org/abs/2510.01427
作者: Sipeng Zhang,Longfei Yun,Zilong Wang,Jingbo Shang,Letian Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-75] OntoLogX: Ontology-Guided Knowledge Graph Extraction from Cybersecurity Logs with Large Language Models
链接: https://arxiv.org/abs/2510.01409
作者: Luca Cotti,Idilio Drago,Anisa Rula,Devis Bianchini,Federico Cerutti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 6 tables, 7 figures
[AI-76] Automating Data-Driven Modeling and Analysis for Engineering Applications using Large Language Model Agents
链接: https://arxiv.org/abs/2510.01398
作者: Yang Liu,Zaid Abulawi,Abhiram Garimidi,Doyeong Lim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-77] Neural Network Surrogates for Free Energy Computation of Complex Chemical Systems ICSE
【速读】:该论文旨在解决自由能重构方法(如高斯过程回归,GPR)在使用复杂或机器学习得到的集体变量(Collective Variables, CVs)时,因缺乏解析形式的雅可比矩阵(Jacobian)而导致的计算瓶颈问题。其解决方案的关键在于提出了一种神经网络代理框架,该框架直接从笛卡尔坐标中学习CV,并利用自动微分(Automatic Differentiation)技术自动生成雅可比矩阵,从而无需依赖人工推导的解析表达式。这种方法不仅提升了对简单距离CV和复杂配位数CV的自由能重构精度,且雅可比误差近似服从高斯分布,满足GPR流程的稳定性要求,显著扩展了梯度驱动的自由能方法在生物化学与材料模拟中对复杂CV的应用范围。
链接: https://arxiv.org/abs/2510.01396
作者: Wasut Pornpatcharapong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
备注: 6 pages, 4 figures. This work has already been accepted for presentation in The 29th International Computer Science and Engineering Conference (ICSEC) 2025, Chiang Mai, Thailand, and will be published in IEEE Xplore
点击查看摘要
Abstract:Free energy reconstruction methods such as Gaussian Process Regression (GPR) require Jacobians of the collective variables (CVs), a bottleneck that restricts the use of complex or machine-learned CVs. We introduce a neural network surrogate framework that learns CVs directly from Cartesian coordinates and uses automatic differentiation to provide Jacobians, bypassing analytical forms. On an MgCl2 ion-pairing system, our method achieved high accuracy for both a simple distance CV and a complex coordination-number CV. Moreover, Jacobian errors also followed a near-Gaussian distribution, making them suitable for GPR pipelines. This framework enables gradient-based free energy methods to incorporate complex and machine-learned CVs, broadening the scope of biochemistry and materials simulations.
zh
[AI-78] Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence
链接: https://arxiv.org/abs/2510.01395
作者: Myra Cheng,Cinoo Lee,Pranav Khadpe,Sunny Yu,Dyllan Han,Dan Jurafsky
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
[AI-79] INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在执行任务时缺乏自我反思机制的问题,即无法提前预测失败并主动请求人类监督员的帮助。解决方案的关键在于提出一个名为INSIGHT的学习框架,利用token级不确定性信号来预测何时应触发求助行为。具体而言,该框架基于π₀-FAST模型提取每token的熵(entropy)、对数概率(log-probability)以及基于Dirichlet分布的Aleatoric和Epistemic不确定性估计,并训练轻量级Transformer分类器将这些序列映射为帮助触发信号。研究发现,通过建模token级不确定性随时间演化的动态特性,相比静态序列级指标,能显著提升预测准确性,从而实现更可靠的内省(introspection)能力。
链接: https://arxiv.org/abs/2510.01389
作者: Ulas Berk Karli,Ziyao Shangguan,Tesca FItzgerald
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present \textbfINSIGHT, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using \pi_0 -FAST as the underlying model, we extract per-token \emphentropy, \emphlog-probability, and Dirichlet-based estimates of \emphaleatoric and epistemic uncertainty, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.
zh
[AI-80] Retrieval-Augmented Framework for LLM -Based Clinical Decision Support
链接: https://arxiv.org/abs/2510.01363
作者: Leon Garza,Anantaa Kotal,Michael A. Grasso,Emre Umucu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-81] Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks
链接: https://arxiv.org/abs/2510.01359
作者: Shoumik Saha,Jifan Chen,Sam Mayers,Sanjay Krishna Gouda,Zijian Wang,Varun Kumar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 28 pages, 21 figures, 9 tables
[AI-82] Low Rank Gradients and Where to Find Them
【速读】:该论文旨在解决两层神经网络在训练过程中梯度低秩结构的形成机制问题,特别是在不依赖传统各向同性假设(isotropy assumptions)的情况下,如何刻画梯度的低秩特性及其与数据分布、模型缩放方式和正则化策略的关系。其解决方案的关键在于提出了一种“尖刺数据模型”(spiked data model),允许数据主成分具有各向异性且条件数不良,并在均值场(mean-field)和神经切空间核(neural-tangent-kernel)两种缩放尺度下分析输入权重梯度的结构;理论证明该梯度近似为低秩,主要由两个秩一分量主导:一个与数据残差(data-residue)对齐,另一个与输入数据中的单个尖刺(rank-one spike)对齐;进一步揭示了训练数据特性、激活函数以及正则化手段(如权重衰减、输入噪声、雅可比惩罚)如何调控这两个分量之间的平衡,从而为理解深度学习优化过程中的隐式正则化提供了理论依据。
链接: https://arxiv.org/abs/2510.01303
作者: Rishi Sonthalia,Michael Murray,Guido Montúfar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:This paper investigates low-rank structure in the gradients of the training loss for two-layer neural networks while relaxing the usual isotropy assumptions on the training data and parameters. We consider a spiked data model in which the bulk can be anisotropic and ill-conditioned, we do not require independent data and weight matrices and we also analyze both the mean-field and neural-tangent-kernel scalings. We show that the gradient with respect to the input weights is approximately low rank and is dominated by two rank-one terms: one aligned with the bulk data-residue , and another aligned with the rank one spike in the input data. We characterize how properties of the training data, the scaling regime and the activation function govern the balance between these two components. Additionally, we also demonstrate that standard regularizers, such as weight decay, input noise and Jacobian penalties, also selectively modulate these components. Experiments on synthetic and real data corroborate our theoretical predictions.
zh
[AI-83] he Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation NEURIPS2025
链接: https://arxiv.org/abs/2510.01295
作者: Zarreen Reza
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
[AI-84] Cyber Academia-Chemical Engineering (CA-ChemE): A Living Digital Town for Self-Directed Research Evolution and Emergent Scientific Discovery
链接: https://arxiv.org/abs/2510.01293
作者: Zekun Jiang,Chunming Xu,Tianhang Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-85] Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours
链接: https://arxiv.org/abs/2510.01288
作者: Rui Melo,Rui Abreu,Corina S. Pasareanu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 main pages, 13 appendix pages
[AI-86] Emergent evaluation hubs in a decentralizing large language model ecosystem
链接: https://arxiv.org/abs/2510.01286
作者: Manuel Cebrian,Tomomi Kito,Raul Castro Fernandez
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 15 pages, 11 figures, 3 tables
[AI-87] An Analysis of the New EU AI Act and A Proposed Standardization Framework for Machine Learning Fairness
【速读】:该论文旨在解决欧盟《人工智能法案》(AI Act)在公平性(fairness)与透明度(transparency)等关键伦理规范方面缺乏可量化指标、术语使用模糊(如将“透明性”“可解释性”“可理解性”混用)以及未明确提及伦理合规透明度的问题,这些问题可能导致法律风险增加并抑制投资。解决方案的关键在于提出一个更精细化的监管框架,并构建一个公共系统评估体系,用于衡量AI系统的公平性和透明度;同时倡导通过标准化行业最佳实践来补充广泛法规,以满足产业对细节的需求,同时避免扼制创新与投资。
链接: https://arxiv.org/abs/2510.01281
作者: Mike Teodorescu,Yongxu Sun,Haren N. Bhatia,Christos Makridis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 6 pages; IEEE HPEC 2025 Poster Session 4-P1 (12:15-13:15): AI/ML/GenAI Poster Session Thursday September 18 2025
点击查看摘要
Abstract:The European Union’s AI Act represents a crucial step towards regulating ethical and responsible AI systems. However, we find an absence of quantifiable fairness metrics and the ambiguity in terminology, particularly the interchangeable use of the keywords transparency, explainability, and interpretability in the new EU AI Act and no reference of transparency of ethical compliance. We argue that this ambiguity creates substantial liability risk that would deter investment. Fairness transparency is strategically important. We recommend a more tailored regulatory framework to enhance the new EU AI regulation. Further-more, we propose a public system framework to assess the fairness and transparency of AI systems. Drawing from past work, we advocate for the standardization of industry best practices as a necessary addition to broad regulations to achieve the level of details required in industry, while preventing stifling innovation and investment in the AI sector. The proposals are exemplified with the case of ASR and speech synthesizers.
zh
[AI-88] Noisy-Pair Robust Representation Alignment for Positive-Unlabeled Learning
链接: https://arxiv.org/abs/2510.01278
作者: Hengwei Zhao,Zhengzhong Tu,Zhuo Zheng,Wei Wang,Junjue Wang,Rusty Feagin,Wenzhe Jiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-89] Modeling Others Minds as Code
【速读】:该论文旨在解决现有建模人类行为方法在实际应用中面临的两大问题:一是数据需求量大且模型脆弱,二是难以在保持计算效率的同时快速适应动态环境。其解决方案的关键在于将日常社会互动视为可预测的行为脚本(behavioral programs),并将其编码为计算机程序而非基于信念和欲望的策略。通过引入ROTE算法,该研究结合大语言模型(LLM)生成行为程序假设空间,并利用概率推理对不确定性进行建模,从而实现从稀疏观测中高效准确地预测人类与AI行为,在网格世界任务和大规模具身家庭模拟器中显著优于行为克隆和基于LLM的方法,最高提升达50%的样本内准确率和泛化能力。
链接: https://arxiv.org/abs/2510.01272
作者: Kunal Jha,Aydan Yuenan Huang,Eric Ye,Natasha Jaques,Max Kleiman-Weiner
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Accurate prediction of human behavior is essential for robust and safe human-AI collaboration. However, existing approaches for modeling people are often data-hungry and brittle because they either make unrealistic assumptions about rationality or are too computationally demanding to adapt rapidly. Our key insight is that many everyday social interactions may follow predictable patterns; efficient “scripts” that minimize cognitive load for actors and observers, e.g., “wait for the green light, then go.” We propose modeling these routines as behavioral programs instantiated in computer code rather than policies conditioned on beliefs and desires. We introduce ROTE, a novel algorithm that leverages both large language models (LLMs) for synthesizing a hypothesis space of behavioral programs, and probabilistic inference for reasoning about uncertainty over that space. We test ROTE in a suite of gridworld tasks and a large-scale embodied household simulator. ROTE predicts human and AI behaviors from sparse observations, outperforming competitive baselines – including behavior cloning and LLM-based methods – by as much as 50% in terms of in-sample accuracy and out-of-sample generalization. By treating action understanding as a program synthesis problem, ROTE opens a path for AI systems to efficiently and effectively predict human behavior in the real-world.
zh
[AI-90] Identifying Information-Transfer Nodes in a Recurrent Neural Network Reveals Dynamic Representations
链接: https://arxiv.org/abs/2510.01271
作者: Arend Hintze,Asadullah Najam,Jory Schossau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-91] Budgeted Broadcast: An Activity-Dependent Pruning Rule for Neural Network Efficiency
【速读】:该论文旨在解决神经网络剪枝(Pruning)过程中如何在保持模型性能的同时实现更高效、更具多样性的参数稀疏化问题。传统剪枝方法通常基于对损失函数影响的排序(如参数幅度或梯度)进行移除,但忽略了网络中各单元之间的信息流通效率与选择性平衡。论文提出Budgeted Broadcast (BB) 方法,其核心在于引入局部流量预算(local traffic budget),即每个单元的长期活跃率 ai 与扇出数 ki 的乘积,并通过约束熵分析推导出最优选择性-受众平衡关系:logai1−ai=βki。该平衡由简单的局部执行器强制实现——要么减少输入扇入(降低活动性),要么减少输出扇出(减少广播),从而在全局流量预算下提升编码熵和特征解相关性,在多个任务(语音识别、人脸识别、突触预测)中实现高稀疏度下的精度提升,甚至超越密集基线模型。
链接: https://arxiv.org/abs/2510.01263
作者: Yaron Meirovitch,Fuming Yang,Jeff Lichtman,Nir Shavit
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Most pruning methods remove parameters ranked by impact on loss (e.g., magnitude or gradient). We propose Budgeted Broadcast (BB), which gives each unit a local traffic budget (the product of its long-term on-rate a_i and fan-out k_i ). A constrained-entropy analysis shows that maximizing coding entropy under a global traffic budget yields a selectivity-audience balance, \log\frac1-a_ia_i=\beta k_i . BB enforces this balance with simple local actuators that prune either fan-in (to lower activity) or fan-out (to reduce broadcast). In practice, BB increases coding entropy and decorrelation and improves accuracy at matched sparsity across Transformers for ASR, ResNets for face identification, and 3D U-Nets for synapse prediction, sometimes exceeding dense baselines. On electron microscopy images, it attains state-of-the-art F1 and PR-AUC under our evaluation protocol. BB is easy to integrate and suggests a path toward learning more diverse and efficient representations.
zh
[AI-92] RSTGCN: Railway-centric Spatio-Temporal Graph Convolutional Network for Train Delay Prediction
链接: https://arxiv.org/abs/2510.01262
作者: Koyena Chowdhury,Paramita Koley,Abhijnan Chakraborty,Saptarshi Ghosh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-93] IoT-MCP: Bridging LLM s and IoT Systems Through Model Context Protocol
链接: https://arxiv.org/abs/2510.01260
作者: Ningyuan Yang,Guanliang Lyu,Mingchen Ma,Yiyi Lu,Yiming Li,Zhihui Gao,Hancheng Ye,Jianyi Zhang,Tingjun Chen,Yiran Chen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
[AI-94] Kant: An Efficient Unified Scheduling System for Large-Scale AI Clusters
链接: https://arxiv.org/abs/2510.01256
作者: Lingling Zeng,Gen Zhang,Jialin Peng,Xiang Xu,Yuan Xu,Lijun Ma
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 25 pages,15 figures
[AI-95] OR-Toolformer: Modeling and Solving Operations Research Problems with Tool Augmented Large Language Models
【速读】:该论文旨在解决开源大型语言模型(Large Language Models, LLMs)在运筹学(Operations Research, OR)问题求解中面临的两大挑战:一是依赖闭源API导致的隐私风险,二是从头训练开源模型带来的高昂计算成本。解决方案的关键在于提出OR-Toolformer,通过一个半自动的数据合成流程生成多样化的运筹学问题-答案对,并引入外部求解器作为工具以增强模型能力,从而在不依赖闭源API的前提下实现高精度和强泛化性的OR问题建模与求解。
链接: https://arxiv.org/abs/2510.01253
作者: Jianzhang Zhang,Jialong Zhou,Chuang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models (LLMs) demonstrate strong mathematical reasoning, but reliance on closed-source APIs for OR tasks raises privacy concerns, and training open-source models from scratch incurs high compute costs. We introduce OR-Toolformer, which fine-tunes Llama-3.1-8B-Instruct with a semi-automatic data synthesis pipeline that generates diverse OR problem-answer pairs and augments the model with external solvers to produce API calls. On three of four standard benchmarks, OR-Toolformer achieves up to 80.1% execution accuracy, exceeding size-matched baselines by over 4.3%. In zero-shot evaluation on two unseen OR problem types, it attains 54% average accuracy, a 21 percentage-point improvement over the strongest baseline. These findings validate the efficacy of tool-augmented fine-tuning LLMs for accurate and generalizable OR problem modeling and solving.
zh
[AI-96] LegiScout: A Visual Tool for Understanding Complex Legislation
【速读】:该论文旨在解决现代立法框架(如《平价医疗法案》(Affordable Care Act, ACA))中政策结构复杂、静态图表难以理解的问题,尤其在涉及多个机构、规定和相互依赖关系时,传统静态图表对专家而言也存在解读困难。解决方案的关键在于提出LegiScout系统,该系统通过将静态政策图转化为动态力导向图(force-directed graphs),结合数据抽取、自然语言处理(Natural Language Processing, NLP)与计算机视觉技术,实现对立法文本及其结构的交互式可视化,从而提升用户对复杂法律体系的理解与探索能力。
链接: https://arxiv.org/abs/2510.01195
作者: Aadarsh Rajiv,Klaus Mueller
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Modern legislative frameworks, such as the Affordable Care Act (ACA), often involve complex webs of agencies, mandates, and interdependencies. Government issued charts attempt to depict these structures but are typically static, dense, and difficult to interpret - even for experts. We introduce LegiScout, an interactive visualization system that transforms static policy diagrams into dynamic, force-directed graphs, enhancing comprehension while preserving essential relationships. By integrating data extraction, natural language processing, and computer vision techniques, LegiScout supports deeper exploration of not only the ACA but also a wide range of legislative and regulatory frameworks. Our approach enables stakeholders - policymakers, analysts, and the public - to navigate and understand the complexity inherent in modern law.
zh
[AI-97] An Anthropologist LLM to Elicit Users Moral Preferences through Role-Play
链接: https://arxiv.org/abs/2510.01189
作者: Gianluca De Ninno,Paola Inverardi,Francesca Belotti
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
[AI-98] How to Find Fantastic Papers: Self-Rankings as a Powerful Predictor of Scientific Impact Beyond Peer Review
链接: https://arxiv.org/abs/2510.02143
作者: Buxin Su,Natalie Collina,Garrett Wen,Didong Li,Kyunghyun Cho,Jianqing Fan,Bingxin Zhao,Weijie Su
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注:
[AI-99] BioinfoMCP: A Unified Platform Enabling MCP Interfaces in Agent ic Bioinformatics
链接: https://arxiv.org/abs/2510.02139
作者: Florensia Widjaja,Zhangtianyi Chen,Juexiao Zhou
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 20 pages, 8 figures, 3 tables
[AI-100] Unlocking Symbol-Level Precoding Efficiency Through Tensor Equivariant Neural Network
【速读】:该论文旨在解决基于构造性干扰(Constructive Interference, CI)的符号级预编码(Symbol-Level Precoding, SLP)因计算复杂度高而难以实际应用的问题。解决方案的关键在于提出一种端到端深度学习(Deep Learning, DL)框架,该框架利用最优SLP解的闭式结构及其内在的张量等变性(Tensor Equivariance, TE),即输入排列诱导输出相应排列的性质,设计出具有特定参数共享模式的网络结构,从而实现线性计算复杂度并保持强泛化能力。通过引入基于注意力机制的TE模块,该框架不仅在理想信道状态信息(CSI)场景下显著提升了推理速度(约80倍于传统方法),还能扩展至非完美CSI场景,映射CSI、统计信息和符号至辅助变量,有效捕捉最优SLP的性能增益。
链接: https://arxiv.org/abs/2510.02108
作者: Jinshuo Zhang,Yafei Wang,Xinping Yi,Wenjin Wang,Shi Jin,Symeon Chatzinotas,Björn Ottersten
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Although symbol-level precoding (SLP) based on constructive interference (CI) exploitation offers performance gains, its high complexity remains a bottleneck. This paper addresses this challenge with an end-to-end deep learning (DL) framework with low inference complexity that leverages the structure of the optimal SLP solution in the closed-form and its inherent tensor equivariance (TE), where TE denotes that a permutation of the input induces the corresponding permutation of the output. Building upon the computationally efficient model-based formulations, as well as their known closed-form solutions, we analyze their relationship with linear precoding (LP) and investigate the corresponding optimality condition. We then construct a mapping from the problem formulation to the solution and prove its TE, based on which the designed networks reveal a specific parameter-sharing pattern that delivers low computational complexity and strong generalization. Leveraging these, we propose the backbone of the framework with an attention-based TE module, achieving linear computational complexity. Furthermore, we demonstrate that such a framework is also applicable to imperfect CSI scenarios, where we design a TE-based network to map the CSI, statistics, and symbols to auxiliary variables. Simulation results show that the proposed framework captures substantial performance gains of optimal SLP, while achieving an approximately 80-times speedup over conventional methods and maintaining strong generalization across user numbers and symbol block lengths.
zh
[AI-101] FINCH: Financial Intelligence using Natural language for Contextualized SQL Handling
链接: https://arxiv.org/abs/2510.01887
作者: Avinash Kumar Singh,Bhaskarjit Sarmah,Stefano Pasquali
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
备注:
[AI-102] A Modular Theory of Subjective Consciousness for Natural and Artificial Minds
链接: https://arxiv.org/abs/2510.01864
作者: Michaël Gillon
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 41 pages, 3 figures. Under review, comments welcome
[AI-103] NGGAN: Noise Generation GAN Based on the Practical Measurement Dataset for Narrowband Powerline Communications
【速读】:该论文旨在解决窄带电力线通信(Narrowband Powerline Communication, NB-PLC)系统中非周期异步脉冲噪声(nonperiodic asynchronous impulsive noise)统计特性难以全面建模的问题,从而提升脉冲噪声处理性能。现有数学噪声生成模型仅能刻画部分噪声特征,无法准确反映实际测量噪声的复杂性。其解决方案的关键在于提出一种基于生成对抗网络(Generative Adversarial Network, GAN)的噪声生成模型——噪声生成GAN(Noise-generation GAN, NGGAN),该模型通过学习实际测量的NB-PLC噪声样本特征实现数据增强;具体创新点包括:(i) 设计适配输入信号长度以支持循环平稳噪声生成,(ii) 采用Wasserstein距离作为损失函数以提高生成噪声与训练数据的统计相似性并保障多样性,(iii) 基于真实测量数据构建训练集,对比分析表明NGGAN在生成噪声质量上更接近实际测量噪声,显著优于传统基于数学模型(如分段频谱循环平稳高斯模型PSCGM和FRESH滤波器)的方法。
链接: https://arxiv.org/abs/2510.01850
作者: Ying-Ren Chien,Po-Heng Chou,You-Jie Peng,Chun-Yuan Huang,Hen-Wai Tsao,Yu Tsao
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 16 pages, 15 figures, 11 tables, and published in IEEE Transactions on Instrumentation and Measurement, Vol. 74, 2025
点击查看摘要
Abstract:Capturing comprehensive statistics of nonperiodic asynchronous impulsive noise is a critical issue in enhancing impulse noise processing for narrowband powerline communication (NB-PLC) transceivers. However, existing mathematical noise generative models capture only some of the characteristics of additive noise. Therefore, we propose a generative adversarial network (GAN), called the noise-generation GAN (NGGAN), that learns the complicated characteristics of practically measured noise samples for data augmentation. To closely match the statistics of complicated noise in NB-PLC systems, we measured the NB-PLC noise via the analog coupling and bandpass filtering circuits of a commercial NB-PLC modem to build a realistic dataset. Specifically, the NGGAN design approaches based on the practically measured dataset are as follows: (i) we design the length of input signals that the NGGAN model can fit to facilitate cyclo-stationary noise generation. (ii) Wasserstein distance is used as a loss function to enhance the similarity between the generated noise and the training dataset and ensure that the sample diversity is sufficient for various applications. (iii) To measure the similarity performance of the GAN-based models based on mathematical and practically measured datasets, we perform quantitative and qualitative analyses. The training datasets include (1) a piecewise spectral cyclo-stationary Gaussian model (PSCGM), (2) a frequency-shift (FRESH) filter, and (3) practical measurements from NB-PLC systems. Simulation results demonstrate that the proposed NGGAN trained using waveform characteristics is closer to the practically measured dataset in terms of the quality of the generated noise.
zh
[AI-104] BioBlobs: Differentiable Graph Partitioning for Protein Representation Learning
链接: https://arxiv.org/abs/2510.01632
作者: Xin Wang,Carlos Oliver
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:
[AI-105] Pharmacophore-Guided Generative Design of Novel Drug-Like Molecules NEURIPS-2025
链接: https://arxiv.org/abs/2510.01480
作者: Ekaterina Podplutova,Anastasia Vepreva,Olga A. Konovalova,Vladimir Vinogradov,Dmitrii O. Shkil,Andrei Dmitrenko
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: AI4Mat-NeurIPS-2025 Poster
[AI-106] BioVERSE: Representation Alignment of Biomedical Modalities to LLM s for Multi-Modal Reasoning
链接: https://arxiv.org/abs/2510.01428
作者: Ching-Huei Tsou,Michal Ozery-Flato,Ella Barkan,Diwakar Mahajan,Ben Shapira
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
[AI-107] Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting
链接: https://arxiv.org/abs/2510.01414
作者: Jiping Li,Rishi Sonthalia
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-108] DeMuon: A Decentralized Muon for Matrix Optimization over Graphs
链接: https://arxiv.org/abs/2510.01377
作者: Chuan He,Shuyi Ren,Jingwei Mao,Erik G. Larsson
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:
[AI-109] Enhancing the development of Cherenkov Telescope Array control software with Large Language Models
【速读】:该论文旨在解决高能伽马射线天文观测中,Cherenkov Telescope Array Observatory(CTAO)控制与数据获取软件(ACADA)在工程维护和运行过程中面临的复杂性与效率瓶颈问题。解决方案的关键在于构建基于指令微调的大语言模型(instruction-finetuned large language models, LLMs)的智能代理(AI agents),这些代理能够对接项目特定的文档与代码库、理解上下文信息、调用外部API,并以自然语言与用户交互,从而提升CTAO流水线在日常运行和离线数据分析中的自动化水平与人机协作效率。
链接: https://arxiv.org/abs/2510.01299
作者: Dmitriy Kostunin,Elisa Jones,Vladimir Sotnikov,Valery Sotnikov,Sergo Golovachev,Alexandre Strube
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: EuCAIFCon 2025 proceedings
点击查看摘要
Abstract:We develop AI agents based on instruction-finetuned large language models (LLMs) to assist in the engineering and operation of the Cherenkov Telescope Array Observatory (CTAO) Control and Data Acquisition Software (ACADA). These agents align with project-specific documentation and codebases, understand contextual information, interact with external APIs, and communicate with users in natural language. We present our progress in integrating these features into CTAO pipelines for operations and offline data analysis.
zh
[AI-110] Evaluating New AI Cell Foundation Models on Challenging Kidney Pathology Cases Unaddressed by Previous Foundation Models
链接: https://arxiv.org/abs/2510.01287
作者: Runchen Wang,Junlin Guo,Siqi Lu,Ruining Deng,Zhengyi Lu,Yanfan Zhu,Yuechen Yang,Chongyu Qu,Yu Wang,Shilin Zhao,Catie Chang,Mitchell Wilkes,Mengmeng Yin,Haichun Yang,Yuankai Huo
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
[AI-111] Mamba Outpaces Reformer in Stock Prediction with Sentiments from Top Ten LLM s
链接: https://arxiv.org/abs/2510.01203
作者: Lokesh Antony Kadiyala,Amir Mirzaeinia
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-112] Quantum-Assisted Correlation Clustering
链接: https://arxiv.org/abs/2509.03561
作者: Antonio Macaluso,Supreeth Mysore Venkatesh,Diego Arenas,Matthias Klusch,Andreas Dengel
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To be published in IEEE QAI 2025 conference
机器学习
[LG-0] KaVa: Latent Reasoning via Compressed KV-Cache Distillation
链接: https://arxiv.org/abs/2510.02312
作者: Anna Kuzina,Maciej Pioro,Paul N. Whatmough,Babak Ehteshami Bejnordi
类目: Machine Learning (cs.LG)
*备注: Preprint. Under Review
点击查看摘要
Abstract:Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces. In this work, we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.
[LG-1] Robust Tangent Space Estimation via Laplacian Eigenvector Gradient Orthogonalization
链接: https://arxiv.org/abs/2510.02308
作者: Dhruv Kohli,Sawyer J. Robertson,Gal Mishne,Alexander Cloninger
类目: Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注:
点击查看摘要
Abstract:Estimating the tangent spaces of a data manifold is a fundamental problem in data analysis. The standard approach, Local Principal Component Analysis (LPCA), struggles in high-noise settings due to a critical trade-off in choosing the neighborhood size. Selecting an optimal size requires prior knowledge of the geometric and noise characteristics of the data that are often unavailable. In this paper, we propose a spectral method, Laplacian Eigenvector Gradient Orthogonalization (LEGO), that utilizes the global structure of the data to guide local tangent space estimation. Instead of relying solely on local neighborhoods, LEGO estimates the tangent space at each data point by orthogonalizing the gradients of low-frequency eigenvectors of the graph Laplacian. We provide two theoretical justifications of our method. First, a differential geometric analysis on a tubular neighborhood of a manifold shows that gradients of the low-frequency Laplacian eigenfunctions of the tube align closely with the manifold’s tangent bundle, while an eigenfunction with high gradient in directions orthogonal to the manifold lie deeper in the spectrum. Second, a random matrix theoretic analysis also demonstrates that low-frequency eigenvectors are robust to sub-Gaussian noise. Through comprehensive experiments, we demonstrate that LEGO yields tangent space estimates that are significantly more robust to noise than those from LPCA, resulting in marked improvements in downstream tasks such as manifold learning, boundary detection, and local intrinsic dimension estimation.
[LG-2] Knowledge Distillation Detection for Open-weights Models NEURIPS2025
链接: https://arxiv.org/abs/2510.02302
作者: Qin Shi,Amber Yijia Zheng,Qifan Song,Raymond A. Yeh
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025
点击查看摘要
Abstract:We propose the task of knowledge distillation detection, which aims to determine whether a student model has been distilled from a given teacher, under a practical setting where only the student’s weights and the teacher’s API are available. This problem is motivated by growing concerns about model provenance and unauthorized replication through distillation. To address this task, we introduce a model-agnostic framework that combines data-free input synthesis and statistical score computation for detecting distillation. Our approach is applicable to both classification and generative models. Experiments on diverse architectures for image classification and text-to-image generation show that our method improves detection accuracy over the strongest baselines by 59.6% on CIFAR-10, 71.2% on ImageNet, and 20.0% for text-to-image generation. The code is available at this https URL.
[LG-3] Fine-Grained Urban Traffic Forecasting on Metropolis-Scale Road Networks
链接: https://arxiv.org/abs/2510.02278
作者: Fedor Velikonivtsev,Oleg Platonov,Gleb Bazhenov,Liudmila Prokhorenkova
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Traffic forecasting on road networks is a complex task of significant practical importance that has recently attracted considerable attention from the machine learning community, with spatiotemporal graph neural networks (GNNs) becoming the most popular approach. The proper evaluation of traffic forecasting methods requires realistic datasets, but current publicly available benchmarks have significant drawbacks, including the absence of information about road connectivity for road graph construction, limited information about road properties, and a relatively small number of road segments that falls short of real-world applications. Further, current datasets mostly contain information about intercity highways with sparsely located sensors, while city road networks arguably present a more challenging forecasting task due to much denser roads and more complex urban traffic patterns. In this work, we provide a more complete, realistic, and challenging benchmark for traffic forecasting by releasing datasets representing the road networks of two major cities, with the largest containing almost 100,000 road segments (more than a 10-fold increase relative to existing datasets). Our datasets contain rich road features and provide fine-grained data about both traffic volume and traffic speed, allowing for building more holistic traffic forecasting systems. We show that most current implementations of neural spatiotemporal models for traffic forecasting have problems scaling to datasets of our size. To overcome this issue, we propose an alternative approach to neural traffic forecasting that uses a GNN without a dedicated module for temporal sequence processing, thus achieving much better scalability, while also demonstrating stronger forecasting performance. We hope our datasets and modeling insights will serve as a valuable resource for research in traffic forecasting.
[LG-4] Diffusion2: Turning 3D Environments into Radio Frequency Heatmaps
链接: https://arxiv.org/abs/2510.02274
作者: Kyoungjun Park,Yifan Yang,Changhan Ge,Lili Qiu,Shiqi Jiang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Modeling radio frequency (RF) signal propagation is essential for understanding the environment, as RF signals offer valuable insights beyond the capabilities of RGB cameras, which are limited by the visible-light spectrum, lens coverage, and occlusions. It is also useful for supporting wireless diagnosis, deployment, and optimization. However, accurately predicting RF signals in complex environments remains a challenge due to interactions with obstacles such as absorption and reflection. We introduce Diffusion^2, a diffusion-based approach that uses 3D point clouds to model the propagation of RF signals across a wide range of frequencies, from Wi-Fi to millimeter waves. To effectively capture RF-related features from 3D data, we present the RF-3D Encoder, which encapsulates the complexities of 3D geometry along with signal-specific details. These features undergo multi-scale embedding to simulate the actual RF signal dissemination process. Our evaluation, based on synthetic and real-world measurements, demonstrates that Diffusion^2 accurately estimates the behavior of RF signals in various frequency bands and environmental conditions, with an error margin of just 1.9 dB and 27x faster than existing methods, marking a significant advancement in the field. Refer to this https URL for more information.
[LG-5] ransformers Discover Molecular Structure Without Graph Priors
链接: https://arxiv.org/abs/2510.02259
作者: Tobias Kreiman,Yutong Bai,Fadi Atieh,Elizabeth Weaver,Eric Qu,Aditi S. Krishnapriyan
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs) are the dominant architecture for molecular machine learning, particularly for molecular property prediction and machine learning interatomic potentials (MLIPs). GNNs perform message passing on predefined graphs often induced by a fixed radius cutoff or k-nearest neighbor scheme. While this design aligns with the locality present in many molecular tasks, a hard-coded graph can limit expressivity due to the fixed receptive field and slows down inference with sparse graph operations. In this work, we investigate whether pure, unmodified Transformers trained directly on Cartesian coordinates \unicodex2013 without predefined graphs or physical priors \unicodex2013 can approximate molecular energies and forces. As a starting point for our analysis, we demonstrate how to train a Transformer to competitive energy and force mean absolute errors under a matched training compute budget, relative to a state-of-the-art equivariant GNN on the OMol25 dataset. We discover that the Transformer learns physically consistent patterns \unicodex2013 such as attention weights that decay inversely with interatomic distance \unicodex2013 and flexibly adapts them across different molecular environments due to the absence of hard-coded biases. The use of a standard Transformer also unlocks predictable improvements with respect to scaling training resources, consistent with empirical scaling laws observed in other domains. Our results demonstrate that many favorable properties of GNNs can emerge adaptively in Transformers, challenging the necessity of hard-coded graph inductive biases and pointing toward standardized, scalable architectures for molecular modeling.
[LG-6] Drop-Muon: Update Less Converge Faster
链接: https://arxiv.org/abs/2510.02239
作者: Kaja Gruntkowska,Yassine Maziane,Zheng Qu,Peter Richtárik
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
[LG-7] PUL-Inter-slice Defender: An Anomaly Detection Solution for Distributed Slice Mobility Attacks
链接: https://arxiv.org/abs/2510.02236
作者: Ricardo Misael Ayala Molina,Hyame Assem Alameddine,Makan Pourzandi,Chadi Assi
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures, 4 tables, journal paper
点击查看摘要
Abstract:Network Slices (NSs) are virtual networks operating over a shared physical infrastructure, each designed to meet specific application requirements while maintaining consistent Quality of Service (QoS). In Fifth Generation (5G) networks, User Equipment (UE) can connect to and seamlessly switch between multiple NSs to access diverse services. However, this flexibility, known as Inter-Slice Switching (ISS), introduces a potential vulnerability that can be exploited to launch Distributed Slice Mobility (DSM) attacks, a form of Distributed Denial of Service (DDoS) attack. To secure 5G networks and their NSs against DSM attacks, we present in this work, PUL-Inter-Slice Defender; an anomaly detection solution that leverages Positive Unlabeled Learning (PUL) and incorporates a combination of Long Short-Term Memory Autoencoders and K-Means clustering. PUL-Inter-Slice Defender leverages the Third Generation Partnership Project (3GPP) key performance indicators and performance measurement counters as features for its machine learning models to detect DSM attack variants while maintaining robustness in the presence of contaminated training data. When evaluated on data collected from our 5G testbed based on the open-source free5GC and UERANSIM, a UE/ Radio Access Network (RAN) simulator; PUL-Inter-Slice Defender achieved F1-scores exceeding 98.50% on training datasets with 10% to 40% attack contamination, consistently outperforming its counterpart Inter-Slice Defender and other PUL based solutions combining One-Class Support Vector Machine (OCSVM) with Random Forest and XGBoost.
[LG-8] xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity
链接: https://arxiv.org/abs/2510.02228
作者: Maximilian Beck,Kajetan Schweighofer,Sebastian Böck,Sebastian Lehner,Sepp Hochreiter
类目: Machine Learning (cs.LG)
*备注: Code and data available at this https URL
点击查看摘要
Abstract:Scaling laws play a central role in the success of Large Language Models (LLMs), enabling the prediction of model performance relative to compute budgets prior to training. While Transformers have been the dominant architecture, recent alternatives such as xLSTM offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime. We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment. First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T). Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work. Finally, we analyze inference-time scaling characteristics. Our findings reveal that in typical LLM training and inference scenarios, xLSTM scales favorably compared to Transformers. Importantly, xLSTM’s advantage widens as training and inference contexts grow.
[LG-9] Efficiently Generating Correlated Sample Paths from Multi-step Time Series Foundation Models
链接: https://arxiv.org/abs/2510.02224
作者: Ethan Baron,Boris Oreshkin,Ruijun Ma,Hanyu Zhang,Kari Torkkola,Michael W. Mahoney,Andrew Gordon Wilson,Tatiana Konstantinova
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-10] Diffusion Transformers for Imputation: Statistical Efficiency and Uncertainty Quantification NEURIPS2025
链接: https://arxiv.org/abs/2510.02216
作者: Zeqi Ye,Minshuo Chen
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 49 pages, 4 figures. Accepted as a poster at NeurIPS 2025
点击查看摘要
Abstract:Imputation methods play a critical role in enhancing the quality of practical time-series data, which often suffer from pervasive missing values. Recently, diffusion-based generative imputation methods have demonstrated remarkable success compared to autoregressive and conventional statistical approaches. Despite their empirical success, the theoretical understanding of how well diffusion-based models capture complex spatial and temporal dependencies between the missing values and observed ones remains limited. Our work addresses this gap by investigating the statistical efficiency of conditional diffusion transformers for imputation and quantifying the uncertainty in missing values. Specifically, we derive statistical sample complexity bounds based on a novel approximation theory for conditional score functions using transformers, and, through this, construct tight confidence regions for missing values. Our findings also reveal that the efficiency and accuracy of imputation are significantly influenced by the missing patterns. Furthermore, we validate these theoretical insights through simulation and propose a mixed-masking training strategy to enhance the imputation performance.
[LG-11] C2AL: Cohort-Contrastive Auxiliary Learning for Large-scale Recommendation Systems ICLR2026
链接: https://arxiv.org/abs/2510.02215
作者: Mertcan Cokbas,Ziteng Liu,Zeyi Tao,Chengkai Zhang,Elder Veliz,Qin Huang,Ellie Wen,Huayu Li,Qiang Jin,Murat Duman,Benjamin Au,Guy Lebanon,Sagar Chordia
类目: Machine Learning (cs.LG)
*备注: Submitted to ICLR 2026
点击查看摘要
Abstract:Training large-scale recommendation models under a single global objective implicitly assumes homogeneity across user populations. However, real-world data are composites of heterogeneous cohorts with distinct conditional distributions. As models increase in scale and complexity and as more data is used for training, they become dominated by central distribution patterns, neglecting head and tail regions. This imbalance limits the model’s learning ability and can result in inactive attention weights or dead neurons. In this paper, we reveal how the attention mechanism can play a key role in factorization machines for shared embedding selection, and propose to address this challenge by analyzing the substructures in the dataset and exposing those with strong distributional contrast through auxiliary learning. Unlike previous research, which heuristically applies weighted labels or multi-task heads to mitigate such biases, we leverage partially conflicting auxiliary labels to regularize the shared representation. This approach customizes the learning process of attention layers to preserve mutual information with minority cohorts while improving global performance. We evaluated C2AL on massive production datasets with billions of data points each for six SOTA models. Experiments show that the factorization machine is able to capture fine-grained user-ad interactions using the proposed method, achieving up to a 0.16% reduction in normalized entropy overall and delivering gains exceeding 0.30% on targeted minority cohorts.
[LG-12] Poolformer: Recurrent Networks with Pooling for Long-Sequence Modeling
链接: https://arxiv.org/abs/2510.02206
作者: Daniel Gallo Fernández
类目: Machine Learning (cs.LG)
*备注:
[LG-13] High-Fidelity Speech Enhancement via Discrete Audio Tokens
链接: https://arxiv.org/abs/2510.02187
作者: Luca A. Lanzendörfer,Frédéric Berdoz,Antonis Asonitis,Roger Wattenhofer
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.
[LG-14] Flatness-Aware Stochastic Gradient Langevin Dynamics
链接: https://arxiv.org/abs/2510.02174
作者: Stefano Bruno,Youngsik Hwang,Jaehyeon An,Sotirios Sabanis,Dong-Young Lim
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Generalization in deep learning is closely tied to the pursuit of flat minima in the loss landscape, yet classical Stochastic Gradient Langevin Dynamics (SGLD) offers no mechanism to bias its dynamics toward such low-curvature solutions. This work introduces Flatness-Aware Stochastic Gradient Langevin Dynamics (fSGLD), designed to efficiently and provably seek flat minima in high-dimensional nonconvex optimization problems. At each iteration, fSGLD uses the stochastic gradient evaluated at parameters perturbed by isotropic Gaussian noise, commonly referred to as Random Weight Perturbation (RWP), thereby optimizing a randomized-smoothing objective that implicitly captures curvature information. Leveraging these properties, we prove that the invariant measure of fSGLD stays close to a stationary measure concentrated on the global minimizers of a loss function regularized by the Hessian trace whenever the inverse temperature and the scale of random weight perturbation are properly coupled. This result provides a rigorous theoretical explanation for the benefits of random weight perturbation. In particular, we establish non-asymptotic convergence guarantees in Wasserstein distance with the best known rate and derive an excess-risk bound for the Hessian-trace regularized objective. Extensive experiments on noisy-label and large-scale vision tasks, in both training-from-scratch and fine-tuning settings, demonstrate that fSGLD achieves superior or comparable generalization and robustness to baseline algorithms while maintaining the computational cost of SGD, about half that of SAM. Hessian-spectrum analysis further confirms that fSGLD converges to significantly flatter minima.
[LG-15] NoMod: A Non-modular Attack on Module Learning With Errors
链接: https://arxiv.org/abs/2510.02162
作者: Cristian Bassotto,Ermes Franch,Marina Krček,Stjepan Picek
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The advent of quantum computing threatens classical public-key cryptography, motivating NIST’s adoption of post-quantum schemes such as those based on the Module Learning With Errors (Module-LWE) problem. We present NoMod ML-Attack, a hybrid white-box cryptanalytic method that circumvents the challenge of modeling modular reduction by treating wrap-arounds as statistical corruption and casting secret recovery as robust linear estimation. Our approach combines optimized lattice preprocessing–including reduced-vector saving and algebraic amplification–with robust estimators trained via Tukey’s Biweight loss. Experiments show NoMod achieves full recovery of binary secrets for dimension n = 350 , recovery of sparse binomial secrets for n = 256 , and successful recovery of sparse secrets in CRYSTALS-Kyber settings with parameters (n, k) = (128, 3) and (256, 2) . We release our implementation in an anonymous repository this https URL.
[LG-16] Reinforcement Learning with Action-Triggered Observations
链接: https://arxiv.org/abs/2510.02149
作者: Alexander Ryabchenko,Wenlong Mou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We study reinforcement learning problems where state observations are stochastically triggered by actions, a constraint common in many real-world applications. This framework is formulated as Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), where each action has a specified probability of triggering a state observation. We derive tailored Bellman optimality equations for this framework and introduce the action-sequence learning paradigm in which agents commit to executing a sequence of actions until the next observation arrives. Under the linear MDP assumption, value-functions are shown to admit linear representations in an induced action-sequence feature map. Leveraging this structure, we propose off-policy estimators with statistical error guarantees for such feature maps and introduce ST-LSVI-UCB, a variant of LSVI-UCB adapted for action-triggered settings. ST-LSVI-UCB achieves regret \widetilde O(\sqrtKd^3(1-\gamma)^-3) , where K is the number of episodes, d the feature dimension, and \gamma the discount factor (per-step episode non-termination probability). Crucially, this work establishes the theoretical foundation for learning with sporadic, action-triggered observations while demonstrating that efficient learning remains feasible under such observation constraints.
[LG-17] Policy Gradient Guidance Enables Test Time Control
链接: https://arxiv.org/abs/2510.02148
作者: Jianing Qi,Hao Tang,Zhigang Zhu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We introduce Policy Gradient Guidance (PGG), a simple extension of classifier-free guidance from diffusion models to classical policy gradient methods. PGG augments the policy gradient with an unconditional branch and interpolates conditional and unconditional branches, yielding a test-time control knob that modulates behavior without retraining. We provide a theoretical derivation showing that the additional normalization term vanishes under advantage estimation, leading to a clean guided policy gradient update. Empirically, we evaluate PGG on discrete and continuous control benchmarks. We find that conditioning dropout-central to diffusion guidance-offers gains in simple discrete tasks and low sample regimes, but dropout destabilizes continuous control. Training with modestly larger guidance ( \gamma1 ) consistently improves stability, sample efficiency, and controllability. Our results show that guidance, previously confined to diffusion policies, can be adapted to standard on-policy methods, opening new directions for controllable online reinforcement learning.
[LG-18] Catalyst GFlowNet for electrocatalyst design: A hydrogen evolution reaction case study NEURIPS
链接: https://arxiv.org/abs/2510.02142
作者: Lena Podina,Christina Humer,Alexandre Duval,Victor Schmidt,Ali Ramlaoui,Shahana Chatterjee,Yoshua Bengio,Alex Hernandez-Garcia,David Rolnick,Félix Therrien
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 5 pages, 2 figures. Accepted to NeurIPS AI for Materials Workshop 2025
[LG-19] DAG DECORation: Continuous Optimization for Structure Learning under Hidden Confounding
链接: https://arxiv.org/abs/2510.02117
作者: Samhita Pal,James O’quinn,Kaveh Aryan,Heather Pua,James P. Long,Amir Asiaee
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
点击查看摘要
Abstract:We study structure learning for linear Gaussian SEMs in the presence of latent confounding. Existing continuous methods excel when errors are independent, while deconfounding-first pipelines rely on pervasive factor structure or nonlinearity. We propose \textscDECOR, a single likelihood-based and fully differentiable estimator that jointly learns a DAG and a correlated noise model. Our theory gives simple sufficient conditions for global parameter identifiability: if the mixed graph is bow free and the noise covariance has a uniform eigenvalue margin, then the map from (\B,\OmegaMat) to the observational covariance is injective, so both the directed structure and the noise are uniquely determined. The estimator alternates a smooth-acyclic graph update with a convex noise update and can include a light bow complementarity penalty or a post hoc reconciliation step. On synthetic benchmarks that vary confounding density, graph density, latent rank, and dimension with np , \textscDECOR matches or outperforms strong baselines and is especially robust when confounding is non-pervasive, while remaining competitive under pervasiveness.
[LG-20] Ensemble Threshold Calibration for Stable Sensitivity Control
链接: https://arxiv.org/abs/2510.02116
作者: John N. Daras
类目: Machine Learning (cs.LG); Databases (cs.DB); Machine Learning (stat.ML)
*备注: 10 pages, 6 tables
点击查看摘要
Abstract:Precise recall control is critical in large-scale spatial conflation and entity-matching tasks, where missing even a few true matches can break downstream analytics, while excessive manual review inflates cost. Classical confidence-interval cuts such as Clopper-Pearson or Wilson provide lower bounds on recall, but they routinely overshoot the target by several percentage points and exhibit high run-to-run variance under skewed score distributions. We present an end-to-end framework that achieves exact recall with sub-percent variance over tens of millions of geometry pairs, while remaining TPU-friendly. Our pipeline starts with an equigrid bounding-box filter and compressed sparse row (CSR) candidate representation, reducing pair enumeration by two orders of magnitude. A deterministic xxHash bootstrap sample trains a lightweight neural ranker; its scores are propagated to all remaining pairs via a single forward pass and used to construct a reproducible, score-decile-stratified calibration set. Four complementary threshold estimators - Clopper-Pearson, Jeffreys, Wilson, and an exact quantile - are aggregated via inverse-variance weighting, then fused across nine independent subsamples. This ensemble reduces threshold variance compared to any single method. Evaluated on two real cadastral datasets (approximately 6.31M and 67.34M pairs), our approach consistently hits a recall target within a small error, decreases redundant verifications relative to other calibrations, and runs end-to-end on a single TPU v3 core.
[LG-21] Hybrid Deep Learning Modeling Approach to Predict Natural Gas Consumption of Home Subscribers on Limited Data
链接: https://arxiv.org/abs/2510.02115
作者: Milad Firoozeh,Nader Dashti,Mohammad Ali Hatefi
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Today, natural gas, as a clean fuel and the best alternative to crude oil, covers a significant part of global demand. Iran is one of the largest countries with energy resources and in terms of gas is the second-largest country in the world. But, due to the increase in population and energy consumption, it faces problems such as pressure drops and gas outages yearly in cold seasons and therefore it is necessary to control gas consumption, especially in the residential sector, which has the largest share in Iran. This study aims to analyze and predict gas consumption for residential customers in Zanjan province, Iran, using machine learning models, including LSTM, GRU, and a hybrid BiLSTM-XGBoost model. The dataset consists of gas consumption and meteorology data collected over six years, from 2017 to 2022. The models were trained and evaluated based on their ability to accurately predict consumption patterns. The results indicate that the hybrid BiLSTM-XGBoost model outperformed the other models in terms of accuracy, with lower Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE) values, and Mean Percentage Error (MPE). Additionally, the Hybrid model demonstrated robust performance, particularly in scenarios with limited data. The findings suggest that machine learning approaches, particularly hybrid models, can be effectively utilized to manage and predict gas consumption, contributing to more efficient resource management and reducing seasonal shortages. This study highlights the importance of incorporating geographical and climatic factors in predictive modeling, as these significantly influence gas usage across different regions.
[LG-22] SoundReactor: Frame-level Online Video-to-Audio Generation
链接: https://arxiv.org/abs/2510.02110
作者: Koichi Saito,Julian Tanke,Christian Simon,Masato Ishii,Kazuki Shimada,Zachary Novack,Zhi Zhong,Akio Hayakawa,Takashi Shibuya,Yuki Mitsufuji
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model’s backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at this https URL.
[LG-23] PENEX: AdaBoost-Inspired Neural Network Regularization
链接: https://arxiv.org/abs/2510.02107
作者: Klaus-Rudolf Kladny,Bernhard Schölkopf,Michael Muehlebach
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:AdaBoost sequentially fits so-called weak learners to minimize an exponential loss, which penalizes mislabeled data points more severely than other loss functions like cross-entropy. Paradoxically, AdaBoost generalizes well in practice as the number of weak learners grows. In the present work, we introduce Penalized Exponential Loss (PENEX), a new formulation of the multi-class exponential loss that is theoretically grounded and, in contrast to the existing formulation, amenable to optimization via first-order methods. We demonstrate both empirically and theoretically that PENEX implicitly maximizes margins of data points. Also, we show that gradient increments on PENEX implicitly parameterize weak learners in the boosting framework. Across computer vision and language tasks, we show that PENEX exhibits a regularizing effect often better than established methods with similar computational cost. Our results highlight PENEX’s potential as an AdaBoost-inspired alternative for effective training and fine-tuning of deep neural networks.
[LG-24] Learning Model Representations Using Publicly Available Model Hubs
链接: https://arxiv.org/abs/2510.02096
作者: Damian Falk,Konstantin Schürholt,Konstantinos Tzevelekakis,Léo Meynent,Damian Borth
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The weights of neural networks have emerged as a novel data modality, giving rise to the field of weight space learning. A central challenge in this area is that learning meaningful representations of weights typically requires large, carefully constructed collections of trained models, typically referred to as model zoos. These model zoos are often trained ad-hoc, requiring large computational resources, constraining the learned weight space representations in scale and flexibility. In this work, we drop this requirement by training a weight space learning backbone on arbitrary models downloaded from large, unstructured model repositories such as Hugging Face. Unlike curated model zoos, these repositories contain highly heterogeneous models: they vary in architecture and dataset, and are largely undocumented. To address the methodological challenges posed by such heterogeneity, we propose a new weight space backbone designed to handle unstructured model populations. We demonstrate that weight space representations trained on models from Hugging Face achieve strong performance, often outperforming backbones trained on laboratory-generated model zoos. Finally, we show that the diversity of the model weights in our training set allows our weight space model to generalize to unseen data modalities. By demonstrating that high-quality weight space representations can be learned in the wild, we show that curated model zoos are not indispensable, thereby overcoming a strong limitation currently faced by the weight space learning community.
[LG-25] Fine-Tuning Flow Matching via Maximum Likelihood Estimation of Reconstructions
链接: https://arxiv.org/abs/2510.02081
作者: Zhaoyi Li,Jingtao Ding,Yong Li,Shihua Li
类目: Machine Learning (cs.LG)
*备注:
[LG-26] Inferring Optical Tissue Properties from Photoplethysmography using Hybrid Amortized Inference
链接: https://arxiv.org/abs/2510.02073
作者: Jens Behrmann,Maria R. Cervera,Antoine Wehenkel,Andrew C. Miller,Albert Cerussi,Pranay Jain,Vivek Venugopal,Shijie Yan,Guillermo Sapiro,Luca Pegolotti,Jörn-Henrik Jacobsen
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Smart wearables enable continuous tracking of established biomarkers such as heart rate, heart rate variability, and blood oxygen saturation via photoplethysmography (PPG). Beyond these metrics, PPG waveforms contain richer physiological information, as recent deep learning (DL) studies demonstrate. However, DL models often rely on features with unclear physiological meaning, creating a tension between predictive power, clinical interpretability, and sensor design. We address this gap by introducing PPGen, a biophysical model that relates PPG signals to interpretable physiological and optical parameters. Building on PPGen, we propose hybrid amortized inference (HAI), enabling fast, robust, and scalable estimation of relevant physiological parameters from PPG signals while correcting for model misspecification. In extensive in-silico experiments, we show that HAI can accurately infer physiological parameters under diverse noise and sensor conditions. Our results illustrate a path toward PPG models that retain the fidelity needed for DL-based features while supporting clinical interpretation and informed hardware design.
[LG-27] Adaptive Heterogeneous Mixtures of Normalising Flows for Robust Variational Inference
链接: https://arxiv.org/abs/2510.02056
作者: Benjamin Wiriyapong,Oktay Karakuş,Kirill Sidorov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 2 Figures and 2 tables
点击查看摘要
Abstract:Normalising-flow variational inference (VI) can approximate complex posteriors, yet single-flow models often behave inconsistently across qualitatively different distributions. We propose Adaptive Mixture Flow Variational Inference (AMF-VI), a heterogeneous mixture of complementary flows (MAF, RealNVP, RBIG) trained in two stages: (i) sequential expert training of individual flows, and (ii) adaptive global weight estimation via likelihood-driven updates, without per-sample gating or architectural changes. Evaluated on six canonical posterior families of banana, X-shape, two-moons, rings, a bimodal, and a five-mode mixture, AMF-VI achieves consistently lower negative log-likelihood than each single-flow baseline and delivers stable gains in transport metrics (Wasserstein-2) and maximum mean discrepancy (MDD), indicating improved robustness across shapes and modalities. The procedure is efficient and architecture-agnostic, incurring minimal overhead relative to standard flow training, and demonstrates that adaptive mixtures of diverse flows provide a reliable route to robust VI across diverse posterior families whilst preserving each expert’s inductive bias.
[LG-28] Mathematical Modeling and Convergence Analysis of Deep Neural Networks with Dense Layer Connectivities in Deep Learning
链接: https://arxiv.org/abs/2510.02049
作者: Jinshu Huang,Haibin Su,Xue-Cheng Tai,Chunlin Wu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In deep learning, dense layer connectivity has become a key design principle in deep neural networks (DNNs), enabling efficient information flow and strong performance across a range of applications. In this work, we model densely connected DNNs mathematically and analyze their learning problems in the deep-layer limit. For a broad applicability, we present our analysis in a framework setting of DNNs with densely connected layers and general non-local feature transformations (with local feature transformations as special cases) within layers, which is called dense non-local (DNL) framework and includes standard DenseNets and variants as special examples. In this formulation, the densely connected networks are modeled as nonlinear integral equations, in contrast to the ordinary differential equation viewpoint commonly adopted in prior works. We study the associated training problems from an optimal control perspective and prove convergence results from the network learning problem to its continuous-time counterpart. In particular, we show the convergence of optimal values and the subsequence convergence of minimizers, using a piecewise linear extension and \Gamma -convergence analysis. Our results provide a mathematical foundation for understanding densely connected DNNs and further suggest that such architectures can offer stability of training deep models.
[LG-29] Variational Secret Common Randomness Extraction
链接: https://arxiv.org/abs/2510.02048
作者: Xinyang Li,Vlad C. Andrei,Peter J. Gu,Yiqi Chen,Ullrich J. Mönich,Holger Boche
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:This paper studies the problem of extracting common randomness (CR) or secret keys from correlated random sources observed by two legitimate parties, Alice and Bob, through public discussion in the presence of an eavesdropper, Eve. We propose a practical two-stage CR extraction framework. In the first stage, the variational probabilistic quantization (VPQ) step is introduced, where Alice and Bob employ probabilistic neural network (NN) encoders to map their observations into discrete, nearly uniform random variables (RVs) with high agreement probability while minimizing information leakage to Eve. This is realized through a variational learning objective combined with adversarial training. In the second stage, a secure sketch using code-offset construction reconciles the encoder outputs into identical secret keys, whose secrecy is guaranteed by the VPQ objective. As a representative application, we study physical layer key (PLK) generation. Beyond the traditional methods, which rely on the channel reciprocity principle and require two-way channel probing, thus suffering from large protocol overhead and being unsuitable in high mobility scenarios, we propose a sensing-based PLK generation method for integrated sensing and communications (ISAC) systems, where paired range-angle (RA) maps measured at Alice and Bob serve as correlated sources. The idea is verified through both end-to-end simulations and real-world software-defined radio (SDR) measurements, including scenarios where Eve has partial knowledge about Bob’s position. The results demonstrate the feasibility and convincing performance of both the proposed CR extraction framework and sensing-based PLK generation method.
[LG-30] FairContrast: Enhancing Fairness through Contrastive learning and Customized Augmenting Methods on Tabular Data NEURIPS2025
链接: https://arxiv.org/abs/2510.02017
作者: Aida Tayebi,Ali Khodabandeh Yalabadi,Mehdi Yazdani-Jahromi,Ozlem Ozmen Garibay
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025 - Reliable ML Workshop
[LG-31] Normality Calibration in Semi-supervised Graph Anomaly Detection
链接: https://arxiv.org/abs/2510.02014
作者: Guolei Zeng,Hezhe Qiao,Guoguo Ai,Jinsong Guo,Guansong Pang
类目: Machine Learning (cs.LG)
*备注: 17 pages
[LG-32] ShapeGen3DCP: A Deep Learning Framework for Layer Shape Prediction in 3D Concrete Printing
链接: https://arxiv.org/abs/2510.02009
作者: Giacomo Rizzieri,Federico Lanteri,Liberato Ferrara,Massimiliano Cremonesi
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This work introduces ShapeGen3DCP, a deep learning framework for fast and accurate prediction of filament cross-sectional geometry in 3D Concrete Printing (3DCP). The method is based on a neural network architecture that takes as input both material properties in the fluid state (density, yield stress, plastic viscosity) and process parameters (nozzle diameter, nozzle height, printing and flow velocities) to directly predict extruded layer shapes. To enhance generalization, some inputs are reformulated into dimensionless parameters that capture underlying physical principles. Predicted geometries are compactly represented using Fourier descriptors, which enforce smooth, closed, and symmetric profiles while reducing the prediction task to a small set of coefficients. The training dataset was synthetically generated using a well-established Particle Finite Element (PFEM) model of 3DCP, overcoming the scarcity of experimental data. Validation against diverse numerical and experimental cases shows strong agreement, confirming the framework’s accuracy and reliability. This opens the way to practical uses ranging from pre-calibration of print settings, minimizing or even eliminating trial-and-error adjustments, to toolpath optimization for more advanced designs. Looking ahead, coupling the framework with simulations and sensor feedback could enable closed-loop digital twins for 3DCP, driving real-time process optimization, defect detection, and adaptive control of printing parameters.
[LG-33] PepCompass: Navigating peptide embedding spaces using Riemannian Geometry
链接: https://arxiv.org/abs/2510.01988
作者: Marcin Możejko(1),Adam Bielecki(1),Jurand Prądzyński(1),Marcin Traskowski(1),Antoni Janowski(1),Karol Jurasz(1),Michał Kucharczyk(1),Hyun-Su Lee(2),Marcelo Der Torossian Torres(2),Cesar de la Fuente-Nunez(2),Paulina Szymczak(3),Michał Kmicikiewicz(3),Ewa Szczurek(1 and 3) ((1) University of Warsaw, (2) University of Pennsylvania, (3) Hemholtz Center Munich)
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Antimicrobial peptide discovery is challenged by the astronomical size of peptide space and the relative scarcity of active peptides. Generative models provide continuous latent “maps” of peptide space, but conventionally ignore decoder-induced geometry and rely on flat Euclidean metrics, rendering exploration and optimization distorted and inefficient. Prior manifold-based remedies assume fixed intrinsic dimensionality, which critically fails in practice for peptide data. Here, we introduce PepCompass, a geometry-aware framework for peptide exploration and optimization. At its core, we define a Union of \kappa -Stable Riemannian Manifolds \mathbbM^\kappa , a family of decoder-induced manifolds that captures local geometry while ensuring computational stability. We propose two local exploration methods: Second-Order Riemannian Brownian Efficient Sampling, which provides a convergent second-order approximation to Riemannian Brownian motion, and Mutation Enumeration in Tangent Space, which reinterprets tangent directions as discrete amino-acid substitutions. Combining these yields Local Enumeration Bayesian Optimization (LE-BO), an efficient algorithm for local activity optimization. Finally, we introduce Potential-minimizing Geodesic Search (PoGS), which interpolates between prototype embeddings along property-enriched geodesics, biasing discovery toward seeds, i.e. peptides with favorable activity. In-vitro validation confirms the effectiveness of PepCompass: PoGS yields four novel seeds, and subsequent optimization with LE-BO discovers 25 highly active peptides with broad-spectrum activity, including against resistant bacterial strains. These results demonstrate that geometry-informed exploration provides a powerful new paradigm for antimicrobial peptide design.
[LG-34] Private Federated Multiclass Post-hoc Calibration
链接: https://arxiv.org/abs/2510.01987
作者: Samuel Maddock,Graham Cormode,Carsten Maple
类目: Machine Learning (cs.LG)
*备注:
[LG-35] Moon: A Modality Conversion-based Efficient Multivariate Time Series Anomaly Detection
链接: https://arxiv.org/abs/2510.01970
作者: Yuanyuan Yao,Yuhan Shi,Lu Chen,Ziquan Fang,Yunjun Gao,Leong Hou U,Yushuai Li,Tianyi Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Multivariate time series (MTS) anomaly detection identifies abnormal patterns where each timestamp contains multiple variables. Existing MTS anomaly detection methods fall into three categories: reconstruction-based, prediction-based, and classifier-based methods. However, these methods face two key challenges: (1) Unsupervised learning methods, such as reconstruction-based and prediction-based methods, rely on error thresholds, which can lead to inaccuracies; (2) Semi-supervised methods mainly model normal data and often underuse anomaly labels, limiting detection of subtle anomalies;(3) Supervised learning methods, such as classifier-based approaches, often fail to capture local relationships, incur high computational costs, and are constrained by the scarcity of labeled data. To address these limitations, we propose Moon, a supervised modality conversion-based multivariate time series anomaly detection framework. Moon enhances the efficiency and accuracy of anomaly detection while providing detailed anomaly analysis reports. First, Moon introduces a novel multivariate Markov Transition Field (MV-MTF) technique to convert numeric time series data into image representations, capturing relationships across variables and timestamps. Since numeric data retains unique patterns that cannot be fully captured by image conversion alone, Moon employs a Multimodal-CNN to integrate numeric and image data through a feature fusion model with parameter sharing, enhancing training efficiency. Finally, a SHAP-based anomaly explainer identifies key variables contributing to anomalies, improving interpretability. Extensive experiments on six real-world MTS datasets demonstrate that Moon outperforms six state-of-the-art methods by up to 93% in efficiency, 4% in accuracy and, 10.8% in interpretation performance.
[LG-36] Lower Bounds on Adversarial Robustness for Multiclass Classification with General Loss Functions
链接: https://arxiv.org/abs/2510.01969
作者: Camilo Andrés García Trillos,Nicolás García Trillos
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
[LG-37] Multi-bit Audio Watermarking
链接: https://arxiv.org/abs/2510.01968
作者: Luca A. Lanzendörfer,Kyle Fearne,Florian Grötschla,Roger Wattenhofer
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
[LG-38] Bias beyond Borders: Global Inequalities in AI-Generated Music
链接: https://arxiv.org/abs/2510.01963
作者: Ahmet Solak,Florian Grötschla,Luca A. Lanzendörfer,Roger Wattenhofer
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
[LG-39] StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold NEURIPS2025
链接: https://arxiv.org/abs/2510.01938
作者: Zhizhong Li,Sina Sajadmanesh,Jingtao Li,Lingjuan Lyu
类目: Machine Learning (cs.LG)
*备注: Accepted as a spotlight at NeurIPS 2025
点击查看摘要
Abstract:Low-rank adaptation (LoRA) has been widely adopted as a parameter-efficient technique for fine-tuning large-scale pre-trained models. However, it still lags behind full fine-tuning in performance, partly due to its insufficient exploitation of the geometric structure underlying low-rank manifolds. In this paper, we propose a geometry-aware extension of LoRA that uses a three-factor decomposition U!SV^\top . Analogous to the structure of singular value decomposition (SVD), it separates the adapter’s input and output subspaces, V and U , from the scaling factor S . Our method constrains U and V to lie on the Stiefel manifold, ensuring their orthonormality throughout the training. To optimize on the Stiefel manifold, we employ a flexible and modular geometric optimization design that converts any Euclidean optimizer to a Riemannian one. It enables efficient subspace learning while remaining compatible with existing fine-tuning pipelines. Empirical results across a wide range of downstream tasks, including commonsense reasoning, math and code generation, image classification, and image generation, demonstrate the superior performance of our approach against the recent state-of-the-art variants of LoRA. Code is available at this https URL.
[LG-40] A Methodology for Transparent Logic-Based Classification Using a Multi-Task Convolutional Tsetlin Machine
链接: https://arxiv.org/abs/2510.01906
作者: Mayur Kishor Shende,Ole-Christoffer Granmo,Runar Helin,Vladimir I. Zadorozhny,Rishad Shafik
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The Tsetlin Machine ™ is a novel machine learning paradigm that employs finite-state automata for learning and utilizes propositional logic to represent patterns. Due to its simplistic approach, TMs are inherently more interpretable than learning algorithms based on Neural Networks. The Convolutional TM has shown comparable performance on various datasets such as MNIST, K-MNIST, F-MNIST and CIFAR-2. In this paper, we explore the applicability of the TM architecture for large-scale multi-channel (RGB) image classification. We propose a methodology to generate both local interpretations and global class representations. The local interpretations can be used to explain the model predictions while the global class representations aggregate important patterns for each class. These interpretations summarize the knowledge captured by the convolutional clauses, which can be visualized as images. We evaluate our methods on MNIST and CelebA datasets, using models that achieve 98.5% accuracy on MNIST and 86.56% F1-score on CelebA (compared to 88.07% for ResNet50) respectively. We show that the TM performs competitively to this deep learning model while maintaining its interpretability, even in large-scale complex training environments. This contributes to a better understanding of TM clauses and provides insights into how these models can be applied to more complex and diverse datasets.
[LG-41] Multi-marginal temporal Schrödinger Bridge Matching for video generation from unpaired data
链接: https://arxiv.org/abs/2510.01894
作者: Thomas Gravier,Thomas Boyer,Auguste Genovesio
类目: Machine Learning (cs.LG)
*备注: Under review. Code available at this https URL . Additional experiment materials available at this https URL
[LG-42] Randomized Gradient Subspaces for Efficient Large Language Model Training
链接: https://arxiv.org/abs/2510.01878
作者: Sahar Rajabi,Nayeema Nonta,Samanvay Vajpayee,Sirisha Rambhatla
类目: Machine Learning (cs.LG)
*备注:
[LG-43] Ranking Items from Discrete Ratings: The Cost of Unknown User Thresholds
链接: https://arxiv.org/abs/2510.01871
作者: Oscar Villemaud,Suryanarayana Sankagiri,Matthias Grossglauser
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures
点击查看摘要
Abstract:Ranking items is a central task in many information retrieval and recommender systems. User input for the ranking task often comes in the form of ratings on a coarse discrete scale. We ask whether it is possible to recover a fine-grained item ranking from such coarse-grained ratings. We model items as having scores and users as having thresholds; a user rates an item positively if the item’s score exceeds the user’s threshold. Although all users agree on the total item order, estimating that order is challenging when both the scores and the thresholds are latent. Under our model, any ranking method naturally partitions the n items into bins; the bins are ordered, but the items inside each bin are still unordered. Users arrive sequentially, and every new user can be queried to refine the current ranking. We prove that achieving a near-perfect ranking, measured by Spearman distance, requires \Theta(n^2) users (and therefore \Omega(n^2) queries). This is significantly worse than the O(n\log n) queries needed to rank from comparisons; the gap reflects the additional queries needed to identify the users who have the appropriate thresholds. Our bound also quantifies the impact of a mismatch between score and threshold distributions via a quadratic divergence factor. To show the tightness of our results, we provide a ranking algorithm whose query complexity matches our bound up to a logarithmic factor. Our work reveals a tension in online ranking: diversity in thresholds is necessary to merge coarse ratings from many users into a fine-grained ranking, but this diversity has a cost if the thresholds are a priori unknown.
[LG-44] Universal Dynamic Regret and Constraint Violation Bounds for Constrained Online Convex Optimization
链接: https://arxiv.org/abs/2510.01867
作者: Subhamon Supantha,Abhishek Sinha
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We consider a generalization of the celebrated Online Convex Optimization (OCO) framework with online adversarial constraints. We present two algorithms having simple modular structures that yield universal dynamic regret and cumulative constraint violation bounds, improving upon the state-of-the-art results. Our results hold in the most general case when both the cost and constraint functions are chosen arbitrarily by an adversary, and the constraint functions need not contain any common feasible point. The results are established by reducing the constrained learning problem to an instance of the standard OCO problem with specially constructed surrogate cost functions.
[LG-45] Microscaling Floating Point Formats for Large Language Models
链接: https://arxiv.org/abs/2510.01863
作者: Marco Cococcioni,Dario Pagani,Federico Rossi
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
[LG-46] Compositional meta-learning through probabilistic task inference
链接: https://arxiv.org/abs/2510.01858
作者: Jacob J. W. Bakermans,Pablo Tano,Reidar Riveland,Charles Findling,Alexandre Pouget
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
点击查看摘要
Abstract:To solve a new task from minimal experience, it is essential to effectively reuse knowledge from previous tasks, a problem known as meta-learning. Compositional solutions, where common elements of computation are flexibly recombined into new configurations, are particularly well-suited for meta-learning. Here, we propose a compositional meta-learning model that explicitly represents tasks as structured combinations of reusable computations. We achieve this by learning a generative model that captures the underlying components and their statistics shared across a family of tasks. This approach transforms learning a new task into a probabilistic inference problem, which allows for finding solutions without parameter updates through highly constrained hypothesis testing. Our model successfully recovers ground truth components and statistics in rule learning and motor learning tasks. We then demonstrate its ability to quickly infer new solutions from just single examples. Together, our framework joins the expressivity of neural networks with the data-efficiency of probabilistic inference to achieve rapid compositional meta-learning.
[LG-47] Explicit Discovery of Nonlinear Symmetries from Dynamic Data
链接: https://arxiv.org/abs/2510.01855
作者: Lexiang Hu,Yikang Li,Zhouchen Lin
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Symmetry is widely applied in problems such as the design of equivariant networks and the discovery of governing equations, but in complex scenarios, it is not known in advance. Most previous symmetry discovery methods are limited to linear symmetries, and recent attempts to discover nonlinear symmetries fail to explicitly get the Lie algebra subspace. In this paper, we propose LieNLSD, which is, to our knowledge, the first method capable of determining the number of infinitesimal generators with nonlinear terms and their explicit expressions. We specify a function library for the infinitesimal group action and aim to solve for its coefficient matrix, proving that its prolongation formula for differential equations, which governs dynamic data, is also linear with respect to the coefficient matrix. By substituting the central differences of the data and the Jacobian matrix of the trained neural network into the infinitesimal criterion, we get a system of linear equations for the coefficient matrix, which can then be solved using SVD. On top quark tagging and a series of dynamic systems, LieNLSD shows qualitative advantages over existing methods and improves the long rollout accuracy of neural PDE solvers by over 20% while applying to guide data augmentation. Code and data are available at this https URL.
[LG-48] Learning Representations Through Contrastive Neural Model Checking
链接: https://arxiv.org/abs/2510.01853
作者: Vladimir Krsmanovic,Matthias Cosler,Mohamed Ghanem,Bernd Finkbeiner
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:
[LG-49] Black-Box Combinatorial Optimization with Order-Invariant Reinforcement Learning
链接: https://arxiv.org/abs/2510.01824
作者: Olivier Goudet,Quentin Suire,Adrien Goëffon,Frédéric Saubion,Sylvain Lamprier
类目: Machine Learning (cs.LG)
*备注:
[LG-50] Sensitivity Specificity and Consistency: A Tripartite Evaluation of Privacy Filters for Synthetic Data Generation
链接: https://arxiv.org/abs/2510.01793
作者: Adil Koeken,Alexander Ziller,Moritz Knolle,Daniel Rueckert
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The generation of privacy-preserving synthetic datasets is a promising avenue for overcoming data scarcity in medical AI research. Post-hoc privacy filtering techniques, designed to remove samples containing personally identifiable information, have recently been proposed as a solution. However, their effectiveness remains largely unverified. This work presents a rigorous evaluation of a filtering pipeline applied to chest X-ray synthesis. Contrary to claims from the original publications, our results demonstrate that current filters exhibit limited specificity and consistency, achieving high sensitivity only for real images while failing to reliably detect near-duplicates generated from training data. These results demonstrate a critical limitation of post-hoc filtering: rather than effectively safeguarding patient privacy, these methods may provide a false sense of security while leaving unacceptable levels of patient information exposed. We conclude that substantial advances in filter design are needed before these methods can be confidently deployed in sensitive applications.
[LG-51] Neural non-canonical Hamiltonian dynamics for long-time simulations
链接: https://arxiv.org/abs/2510.01788
作者: Clémentine Courtès(IRMA, MACARON),Emmanuel Franck(MACARON),Michael Kraus(IPP),Laurent Navoret(IRMA, MACARON),Léopold Trémant(LML)
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
点击查看摘要
Abstract:This work focuses on learning non-canonical Hamiltonian dynamics from data, where long-term predictions require the preservation of structure both in the learned model and in numerical schemes. Previous research focused on either facet, respectively with a potential-based architecture and with degenerate variational integrators, but new issues arise when combining both. In experiments, the learnt model is sometimes numerically unstable due to the gauge dependency of the scheme, rendering long-time simulations impossible. In this paper, we identify this problem and propose two different training strategies to address it, either by directly learning the vector field or by learning a time-discrete dynamics through the scheme. Several numerical test cases assess the ability of the methods to learn complex physical dynamics, like the guiding center from gyrokinetic plasma physics.
[LG-52] Octax: Accelerated CHIP-8 Arcade Environments for Reinforcement Learning in JAX
链接: https://arxiv.org/abs/2510.01764
作者: Waris Radji,Thomas Michel,Hector Piteau
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Reinforcement learning (RL) research requires diverse, challenging environments that are both tractable and scalable. While modern video games may offer rich dynamics, they are computationally expensive and poorly suited for large-scale experimentation due to their CPU-bound execution. We introduce Octax, a high-performance suite of classic arcade game environments implemented in JAX, based on CHIP-8 emulation, a predecessor to Atari, which is widely adopted as a benchmark in RL research. Octax provides the JAX community with a long-awaited end-to-end GPU alternative to the Atari benchmark, offering image-based environments, spanning puzzle, action, and strategy genres, all executable at massive scale on modern GPUs. Our JAX-based implementation achieves orders-of-magnitude speedups over traditional CPU emulators while maintaining perfect fidelity to the original game mechanics. We demonstrate Octax’s capabilities by training RL agents across multiple games, showing significant improvements in training speed and scalability compared to existing solutions. The environment’s modular design enables researchers to easily extend the suite with new games or generate novel environments using large language models, making it an ideal platform for large-scale RL experimentation.
[LG-53] Learning Regularization Functionals for Inverse Problems: A Comparative Study
链接: https://arxiv.org/abs/2510.01755
作者: Johannes Hertrich,Hok Shing Wong,Alexander Denker,Stanislas Ducotterd,Zhenghan Fang,Markus Haltmeier,Željko Kereta,Erich Kobler,Oscar Leong,Mohammad Sadegh Salehi,Carola-Bibiane Schönlieb,Johannes Schwab,Zakhar Shumaylov,Jeremias Sulam,German Shâma Wache,Martin Zach,Yasi Zhang,Matthias J. Ehrhardt,Sebastian Neumayer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:In recent years, a variety of learned regularization frameworks for solving inverse problems in imaging have emerged. These offer flexible modeling together with mathematical insights. The proposed methods differ in their architectural design and training strategies, making direct comparison challenging due to non-modular implementations. We address this gap by collecting and unifying the available code into a common framework. This unified view allows us to systematically compare the approaches and highlight their strengths and limitations, providing valuable insights into their future potential. We also provide concise descriptions of each method, complemented by practical guidelines.
[LG-54] Private and Fair Machine Learning: Revisiting the Disparate Impact of Differentially Private SGD
链接: https://arxiv.org/abs/2510.01744
作者: Lea Demelius,Dominik Kowald,Simone Kopeinik,Roman Kern,Andreas Trügler
类目: Machine Learning (cs.LG)
*备注:
[LG-55] Workplace Location Choice Model based on Deep Neural Network
链接: https://arxiv.org/abs/2510.01723
作者: Tanay Rastogi,Anders Karlström
类目: Machine Learning (cs.LG)
*备注:
[LG-56] Finite-Time Bounds for Distributionally Robust TD Learning with Linear Function Approximation
链接: https://arxiv.org/abs/2510.01721
作者: Saptarshi Mandal,Yashaswini Murthy,R. Srikant
类目: Machine Learning (cs.LG)
*备注: Preprint. 32 Pages
[LG-57] Accelerating Attention with Basis Decomposition
链接: https://arxiv.org/abs/2510.01718
作者: Jialin Zhao
类目: Machine Learning (cs.LG)
*备注:
[LG-58] ActiNet: Activity intensity classification of wrist-worn accelerometers using self-supervised deep learning
链接: https://arxiv.org/abs/2510.01712
作者: Aidan Acquah,Shing Chan,Aiden Doherty
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The use of reliable and accurate human activity recognition (HAR) models on passively collected wrist-accelerometer data is essential in large-scale epidemiological studies that investigate the association between physical activity and health outcomes. While the use of self-supervised learning has generated considerable excitement in improving HAR, it remains unknown the extent to which these models, coupled with hidden Markov models (HMMs), would make a tangible improvement to classification performance, and the effect this may have on the predicted daily activity intensity compositions. Using 151 CAPTURE-24 participants’ data, we trained the ActiNet model, a self-supervised, 18-layer, modified ResNet-V2 model, followed by hidden Markov model (HMM) smoothing to classify labels of activity intensity. The performance of this model, evaluated using 5-fold stratified group cross-validation, was then compared to a baseline random forest (RF) + HMM, established in existing literature. Differences in performance and classification outputs were compared with different subgroups of age and sex within the Capture-24 population. The ActiNet model was able to distinguish labels of activity intensity with a mean macro F1 score of 0.82, and mean Cohen’s kappa score of 0.86. This exceeded the performance of the RF + HMM, trained and validated on the same dataset, with mean scores of 0.77 and 0.81, respectively. These findings were consistent across subgroups of age and sex. These findings encourage the use of ActiNet for the extraction of activity intensity labels from wrist-accelerometer data in future epidemiological studies.
[LG-59] Contrastive Representation Regularization for Vision-Language-Action Models
链接: https://arxiv.org/abs/2510.01711
作者: Taeyoung Kim,Jimin Lee,Myungkyu Koo,Dongyoung Kim,Kyungmin Lee,Changyeon Kim,Younggyo Seo,Jinwoo Shin
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 20 pages, 12 figures
[LG-60] PASTA: A Unified Framework for Offline Assortment Learning
链接: https://arxiv.org/abs/2510.01693
作者: Juncheng Dong,Weibin Mo,Zhengling Qi,Cong Shi,Ethan X. Fang,Vahid Tarokh
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We study a broad class of assortment optimization problems in an offline and data-driven setting. In such problems, a firm lacks prior knowledge of the underlying choice model, and aims to determine an optimal assortment based on historical customer choice data. The combinatorial nature of assortment optimization often results in insufficient data coverage, posing a significant challenge in designing provably effective solutions. To address this, we introduce a novel Pessimistic Assortment Optimization (PASTA) framework that leverages the principle of pessimism to achieve optimal expected revenue under general choice models. Notably, PASTA requires only that the offline data distribution contains an optimal assortment, rather than providing the full coverage of all feasible assortments. Theoretically, we establish the first finite-sample regret bounds for offline assortment optimization across several widely used choice models, including the multinomial logit and nested logit models. Additionally, we derive a minimax regret lower bound, proving that PASTA is minimax optimal in terms of sample and model complexity. Numerical experiments further demonstrate that our method outperforms existing baseline approaches.
[LG-61] Evaluating the Robustness of a Production Malware Detection System to Transferable Adversarial Attacks
链接: https://arxiv.org/abs/2510.01676
作者: Milad Nasr,Yanick Fratantonio,Luca Invernizzi,Ange Albertini,Loua Farah,Alex Petit-Bianco,Andreas Terzis,Kurt Thomas,Elie Bursztein,Nicholas Carlini
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
[LG-62] Support Basis: Fast Attention Beyond Bounded Entries
链接: https://arxiv.org/abs/2510.01643
作者: Maryam Aliakbarpour,Vladimir Braverman,Junze Yin,Haochen Zhang
类目: Machine Learning (cs.LG)
*备注:
[LG-63] Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking
链接: https://arxiv.org/abs/2510.01637
作者: Liyan Xie,Muhammad Siddeek,Mohamed Seif,Andrea J. Goldsmith,Mengdi Wang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.
[LG-64] CAT: Curvature-Adaptive Transformers for Geometry-Aware Learning
链接: https://arxiv.org/abs/2510.01634
作者: Ryan Y. Lin,Siddhartha Ojha,Nicholas Bai
类目: Machine Learning (cs.LG)
*备注:
[LG-65] Posterior Collapse as a Phase Transition in Variational Autoencoders
链接: https://arxiv.org/abs/2510.01621
作者: Zhen Li,Fan Zhang,Zheng Zhang,Yu Chen
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures
[LG-66] Securing generative artificial intelligence with parallel magnetic tunnel junction true randomness
链接: https://arxiv.org/abs/2510.01598
作者: Youwei Bao,Shuhan Yang,Hyunsoo Yang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 4 figures
[LG-67] Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control
链接: https://arxiv.org/abs/2510.01578
作者: Haochen You,Baojing Liu
类目: Machine Learning (cs.LG)
*备注: Accepted as a conference paper at ACM Multimedia Asia 2025
[LG-68] riServe: Efficient DiT Serving for Heterogeneous Image Generation
链接: https://arxiv.org/abs/2510.01565
作者: Runyu Lu,Shiqi He,Wenxuan Tan,Shenggui Li,Ruofan Wu,Jeff J. Ma,Ang Chen,Mosharaf Chowdhury
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
[LG-69] Large-Scale Bayesian Causal Discovery with Interventional Data
链接: https://arxiv.org/abs/2510.01562
作者: Seong Woo Han,Daniel Duy Vo,Brielin C. Brown
类目: Machine Learning (cs.LG)
*备注:
[LG-70] CardioRAG : A Retrieval-Augmented Generation Framework for Multimodal Chagas Disease Detection
链接: https://arxiv.org/abs/2510.01558
作者: Zhengyang Shen,Xuehao Zhai,Hua Tu,Mayue Shi
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 4 pages, 2 figures. Accepted for oral presentation at the 52nd international Computing in Cardiology Conference (CinC2025)
[LG-71] MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models
链接: https://arxiv.org/abs/2510.01549
作者: Kevin Zhai,Utsav Singh,Anirudh Thatipelli,Souradip Chakraborty,Anit Kumar Sahu,Furong Huang,Amrit Singh Bedi,Mubarak Shah
类目: Machine Learning (cs.LG)
*备注:
[LG-72] Executable Counterfactuals: Improving LLM s Causal Reasoning Through Code
链接: https://arxiv.org/abs/2510.01539
作者: Aniket Vashishtha,Qirun Dai,Hongyuan Mei,Amit Sharma,Chenhao Tan,Hao Peng
类目: Machine Learning (cs.LG)
*备注:
[LG-73] meSeriesScientist: A General-Purpose AI Agent for Time Series Analysis
链接: https://arxiv.org/abs/2510.01538
作者: Haokun Zhao,Xiang Zhang,Jiaqi Wei,Yiwei Xu,Yuting He,Siqi Sun,Chenyu You
类目: Machine Learning (cs.LG)
*备注:
[LG-74] NVIDIA AI Aerial: AI-Native Wireless Communications
链接: https://arxiv.org/abs/2510.01533
作者: Kobi Cohen-Arazi,Michael Roe,Zhen Hu,Rohan Chavan,Anna Ptasznik,Joanna Lin,Joao Morais,Joseph Boccuzzi,Tommaso Balercia
类目: Machine Learning (cs.LG)
*备注: 7 pages, 7 figures
[LG-75] Bypassing Prompt Guards in Production with Controlled-Release Prompting
链接: https://arxiv.org/abs/2510.01529
作者: Jaiden Fairoze,Sanjam Garg,Keewoo Lee,Mingyuan Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
[LG-76] Round-trip Reinforcement Learning: Self-Consistent Training for Better Chemical LLM s
链接: https://arxiv.org/abs/2510.01527
作者: Lecheng Kong,Xiyuan Wang,Yixin Chen,Muhan Zhang
类目: Machine Learning (cs.LG)
*备注: 19 pages
[LG-77] On Integer Programming for the Binarized Neural Network Verification Problem
链接: https://arxiv.org/abs/2510.01525
作者: Woojin Kim,James R. Luedtke
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
[LG-78] CarbonX: An Open-Source Tool for Computational Decarbonization Using Time Series Foundation Models
链接: https://arxiv.org/abs/2510.01521
作者: Diptyaroop Maji,Kang Yang,Prashant Shenoy,Ramesh K Sitaraman,Mani Srivastava
类目: Machine Learning (cs.LG)
*备注:
[LG-79] Flock: A Knowledge Graph Foundation Model via Learning on Random Walks
链接: https://arxiv.org/abs/2510.01510
作者: Jinwoo Kim,Xingyue Huang,Krzysztof Olejniczak,Kyungbin Min,Michael Bronstein,Seunghoon Hong,İsmail İlkan Ceylan
类目: Machine Learning (cs.LG)
*备注:
[LG-80] Realistic CDSS Drug Dosing with End-to-end Recurrent Q-learning for Dual Vasopressor Control NEURIPS2025 ALT
链接: https://arxiv.org/abs/2510.01508
作者: Will Y. Zou,Jean Feng,Alexandre Kalimouttou,Jennifer Yuntong Zhang,Christopher W. Seymour,Romain Pirracchio
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures. Neurips 2025 Workshop Learning from Time Series for Health
[LG-81] Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets
链接: https://arxiv.org/abs/2510.01479
作者: Shriram Karpoora Sundara Pandian,Ali Baheri
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
[LG-82] Comparative Field Deployment of Reinforcement Learning and Model Predictive Control for Residential HVAC
链接: https://arxiv.org/abs/2510.01475
作者: Ozan Baris Mulayim,Elias N. Pergantis,Levi D. Reyes Premer,Bingqing Chen,Guannan Qu,Kevin J. Kircher,Mario Bergés
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 27 pages, 11 figures, 4 tables. Under review for Applied Energy
[LG-83] PEL-NAS: Search Space Partitioned Architecture Prompt Co-Evolutionary LLM -driven Hardware-Aware Neural Architecture Search
链接: https://arxiv.org/abs/2510.01472
作者: Hengyi Zhu,Grace Li Zhang,Shaoyi Huang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Hardware-Aware Neural Architecture Search (HW-NAS) requires joint optimization of accuracy and latency under device constraints. Traditional supernet-based methods require multiple GPU days per dataset. Large Language Model (LLM)-driven approaches avoid training a large supernet and can provide quick feedback, but we observe an exploration bias: the LLM repeatedly proposes neural network designs within limited search space and fails to discover architectures across different latency ranges in the entire search space. To address this issue, we propose PEL-NAS: a search space Partitioned, architecture prompt co-Evolutionary and LLM-driven Neural Architecture Search that can generate neural networks with high accuracy and low latency with reduced search cost. Our proposed PEL-NAS has three key components: 1) a complexity-driven partitioning engine that divides the search space by complexity to enforce diversity and mitigate exploration bias; 2) an LLM-powered architecture prompt co-evolution operator, in which the LLM first updates a knowledge base of design heuristics based on results from the previous round, then performs a guided evolution algorithm on architectures with prompts that incorporate this knowledge base. Prompts and designs improve together across rounds which avoids random guesswork and improve efficiency; 3) a zero-cost predictor to avoid training a large number of candidates from scratch. Experimental results show that on HW-NAS-Bench, PEL-NAS can achieve overall higher HV, lower IGD, and up to 54% lower latency than baselines at similar accuracy. Meanwhile, the search cost drops from days to minutes compared with traditional supernet baselines.
[LG-84] Fine-tuning LLM s with variational Bayesian last layer for high-dimensional Bayesian optimzation
链接: https://arxiv.org/abs/2510.01471
作者: Haotian Xiang,Jinwen Xu,Qin Lu
类目: Machine Learning (cs.LG)
*备注:
[LG-85] How Well Can Preference Optimization Generalize Under Noisy Feedback?
链接: https://arxiv.org/abs/2510.01458
作者: Shawn Im,Yixuan Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:As large language models (LLMs) advance their capabilities, aligning these models with human preferences has become crucial. Preference optimization, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for aligning LLMs. However, most existing works assume noise-free feedback, which is unrealistic due to the inherent errors and inconsistencies in human judgments. This paper addresses the impact of noisy feedback on preference optimization, providing generalization guarantees under these conditions. In particular, we consider noise models that correspond to common real-world sources of noise, such as mislabeling and uncertainty. Unlike traditional analyses that assume convergence, our work focuses on finite-step preference optimization, offering new insights that are more aligned with practical LLM training. We describe how generalization decays with different types of noise across levels of noise rates based on the preference data distribution and number of samples. Our analysis for noisy preference learning applies to a broad family of preference optimization losses such as DPO, IPO, SLiC, etc. Empirical validation on contemporary LLMs confirms the practical relevance of our findings, offering valuable insights for developing AI systems that align with human preferences.
[LG-86] Fixing That Free Lunch: When Where and Why Synthetic Data Fails in Model-Based Policy Optimization
链接: https://arxiv.org/abs/2510.01457
作者: Brett Barkley,David Fridovich-Keil
类目: Machine Learning (cs.LG)
*备注:
[LG-87] SCOPED: Score-Curvature Out-of-distribution Proximity Evaluator for Diffusion
链接: https://arxiv.org/abs/2510.01456
作者: Brett Barkley,Preston Culbertson,David Fridovich-Keil
类目: Machine Learning (cs.LG)
*备注:
[LG-88] SoftAdaClip: A Smooth Clipping Strategy for Fair and Private Model Training
链接: https://arxiv.org/abs/2510.01447
作者: Dorsa Soleymani,Ali Dadsetan,Frank Rudzicz
类目: Machine Learning (cs.LG)
*备注:
[LG-89] Edge Artificial Intelligence: A Systematic Review of Evolution Taxonomic Frameworks and Future Horizons
链接: https://arxiv.org/abs/2510.01439
作者: Mohamad Abou Ali,Fadi Dornaika
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Edge Artificial Intelligence (Edge AI) embeds intelligence directly into devices at the network edge, enabling real-time processing with improved privacy and reduced latency by processing data close to its source. This review systematically examines the evolution, current landscape, and future directions of Edge AI through a multi-dimensional taxonomy including deployment location, processing capabilities such as TinyML and federated learning, application domains, and hardware types. Following PRISMA guidelines, the analysis traces the field from early content delivery networks and fog computing to modern on-device intelligence. Core enabling technologies such as specialized hardware accelerators, optimized software, and communication protocols are explored. Challenges including resource limitations, security, model management, power consumption, and connectivity are critically assessed. Emerging opportunities in neuromorphic hardware, continual learning algorithms, edge-cloud collaboration, and trustworthiness integration are highlighted, providing a comprehensive framework for researchers and practitioners.
[LG-90] Learning to Play Multi-Follower Bayesian Stackelberg Games
链接: https://arxiv.org/abs/2510.01387
作者: Gerson Personnat,Tao Lin,Safwan Hossain,David C. Parkes
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:
[LG-91] Fine-Tuning Masked Diffusion for Provable Self-Correction
链接: https://arxiv.org/abs/2510.01384
作者: Jaeyeon Kim,Seunggeun Kim,Taekyun Lee,David Z. Pan,Hyeji Kim,Sham Kakade,Sitan Chen
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:A natural desideratum for generative models is self-correction–detecting and revising low-quality tokens at inference. While Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces, their capacity for self-correction remains poorly understood. Prior attempts to incorporate self-correction into MDMs either require overhauling MDM architectures/training or rely on imprecise proxies for token quality, limiting their applicability. Motivated by this, we introduce PRISM–Plug-in Remasking for Inference-time Self-correction of Masked Diffusions–a lightweight, model-agnostic approach that applies to any pretrained MDM. Theoretically, PRISM defines a self-correction loss that provably learns per-token quality scores, without RL or a verifier. These quality scores are computed in the same forward pass with MDM and used to detect low-quality tokens. Empirically, PRISM advances MDM inference across domains and scales: Sudoku; unconditional text (170M); and code with LLaDA (8B).
[LG-92] Selective Underfitting in Diffusion Models
链接: https://arxiv.org/abs/2510.01378
作者: Kiwhan Song,Jaeyeon Kim,Sitan Chen,Yilun Du,Sham Kakade,Vincent Sitzmann
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Diffusion models have emerged as the principal paradigm for generative modeling across various domains. During training, they learn the score function, which in turn is used to generate samples at inference. They raise a basic yet unsolved question: which score do they actually learn? In principle, a diffusion model that matches the empirical score in the entire data space would simply reproduce the training data, failing to generate novel samples. Recent work addresses this question by arguing that diffusion models underfit the empirical score due to training-time inductive biases. In this work, we refine this perspective, introducing the notion of selective underfitting: instead of underfitting the score everywhere, better diffusion models more accurately approximate the score in certain regions of input space, while underfitting it in others. We characterize these regions and design empirical interventions to validate our perspective. Our results establish that selective underfitting is essential for understanding diffusion models, yielding new, testable insights into their generalization and generative performance.
[LG-93] RheOFormer: A generative transformer model for simulation of complex fluids and flows
链接: https://arxiv.org/abs/2510.01365
作者: Maedeh Saberi,Amir Barati Farimani,Safa Jamali
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 8 pages, 5 figures. Submitted to PNAS
[LG-94] o Augment or Not to Augment? Diagnosing Distributional Symmetry Breaking ICLR
链接: https://arxiv.org/abs/2510.01349
作者: Hannah Lawrence,Elyssa Hofgard,Vasco Portilheiro,Yuxuan Chen,Tess Smidt,Robin Walters
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: A short version of this paper appeared at the ICLR AI4Mat workshop in April 2025
点击查看摘要
Abstract:Symmetry-aware methods for machine learning, such as data augmentation and equivariant architectures, encourage correct model behavior on all transformations (e.g. rotations or permutations) of the original dataset. These methods can improve generalization and sample efficiency, under the assumption that the transformed datapoints are highly probable, or “important”, under the test distribution. In this work, we develop a method for critically evaluating this assumption. In particular, we propose a metric to quantify the amount of anisotropy, or symmetry-breaking, in a dataset, via a two-sample neural classifier test that distinguishes between the original dataset and its randomly augmented equivalent. We validate our metric on synthetic datasets, and then use it to uncover surprisingly high degrees of alignment in several benchmark point cloud datasets. We show theoretically that distributional symmetry-breaking can actually prevent invariant methods from performing optimally even when the underlying labels are truly invariant, as we show for invariant ridge regression in the infinite feature limit. Empirically, we find that the implication for symmetry-aware methods is dataset-dependent: equivariant methods still impart benefits on some anisotropic datasets, but not others. Overall, these findings suggest that understanding equivariance – both when it works, and why – may require rethinking symmetry biases in the data.
[LG-95] Self-Supervised Representation Learning as Mutual Information Maximization
链接: https://arxiv.org/abs/2510.01345
作者: Akhlaqur Rahman Sabby,Yi Sui,Tongzi Wu,Jesse C. Cresswell,Ga Wu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Self-supervised representation learning (SSRL) has demonstrated remarkable empirical success, yet its underlying principles remain insufficiently understood. While recent works attempt to unify SSRL methods by examining their information-theoretic objectives or summarizing their heuristics for preventing representation collapse, architectural elements like the predictor network, stop-gradient operation, and statistical regularizer are often viewed as empirically motivated additions. In this paper, we adopt a first-principles approach and investigate whether the learning objective of an SSRL algorithm dictates its possible optimization strategies and model design choices. In particular, by starting from a variational mutual information (MI) lower bound, we derive two training paradigms, namely Self-Distillation MI (SDMI) and Joint MI (JMI), each imposing distinct structural constraints and covering a set of existing SSRL algorithms. SDMI inherently requires alternating optimization, making stop-gradient operations theoretically essential. In contrast, JMI admits joint optimization through symmetric architectures without such components. Under the proposed formulation, predictor networks in SDMI and statistical regularizers in JMI emerge as tractable surrogates for the MI objective. We show that many existing SSRL methods are specific instances or approximations of these two paradigms. This paper provides a theoretical explanation behind the choices of different architectural components of existing SSRL methods, beyond heuristic conveniences.
[LG-96] On the Identifiability of Latent Action Policies
链接: https://arxiv.org/abs/2510.01337
作者: Sébastien Lachapelle
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages
[LG-97] Quantum-inspired Benchmark for Estimating Intrinsic Dimension
链接: https://arxiv.org/abs/2510.01335
作者: Aritra Das,Joseph T. Iosue,Victor V. Albert
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Metric Geometry (math.MG); Data Analysis, Statistics and Probability (physics.data-an); Quantum Physics (quant-ph)
*备注: 19 figures, 35 pages
[LG-98] Network-Level Vehicle Delay Estimation at Heterogeneous Signalized Intersections
链接: https://arxiv.org/abs/2510.01292
作者: Xiaobo Ma,Hyunsoo Noh,James Tokishi,Ryan Hatch
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2503.20113
点击查看摘要
Abstract:Accurate vehicle delay estimation is essential for evaluating the performance of signalized intersections and informing traffic management strategies. Delay reflects congestion levels and affects travel time reliability, fuel use, and emissions. Machine learning (ML) offers a scalable, cost-effective alternative; However, conventional models typically assume that training and testing data follow the same distribution, an assumption that is rarely satisfied in real-world applications. Variations in road geometry, signal timing, and driver behavior across intersections often lead to poor generalization and reduced model accuracy. To address this issue, this study introduces a domain adaptation (DA) framework for estimating vehicle delays across diverse intersections. The framework separates data into source and target domains, extracts key traffic features, and fine-tunes the model using a small, labeled subset from the target domain. A novel DA model, Gradient Boosting with Balanced Weighting (GBBW), reweights source data based on similarity to the target domain, improving adaptability. The framework is tested using data from 57 heterogeneous intersections in Pima County, Arizona. Performance is evaluated against eight state-of-the-art ML regression models and seven instance-based DA methods. Results demonstrate that the GBBW framework provides more accurate and robust delay estimates. This approach supports more reliable traffic signal optimization, congestion management, and performance-based planning. By enhancing model transferability, the framework facilitates broader deployment of machine learning techniques in real-world transportation systems.
[LG-99] hinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models
链接: https://arxiv.org/abs/2510.01290
作者: Akshat Ramachandran,Marina Neseem,Charbel Sakr,Rangharajan Venkatesan,Brucek Khailany,Tushar Krishna
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens’ memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over state-of-the-art baselines.
[LG-100] Safe Reinforcement Learning-Based Vibration Control: Overcoming Training Risks with LQR Guidance
链接: https://arxiv.org/abs/2510.01269
作者: Rohan Vitthal Thorat,Juhi Singh,Rajdip Nayek
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: Paper accepted for presentation at ICCMS 2025. The submission includes 10 pages and 6 figures
[LG-101] A Framework for Scalable Heterogeneous Multi-Agent Adversarial Reinforcement Learning in IsaacLab AACL
链接: https://arxiv.org/abs/2510.01264
作者: Isaac Peterson,Christopher Allred,Jacob Morrey,Mario Harper
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 8 page, 9 figures, code this https URL
[LG-102] Adaptive Federated Learning Defences via Trust-Aware Deep Q-Networks
链接: https://arxiv.org/abs/2510.01261
作者: Vedant Palit
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 16 pages, 10 figures
[LG-103] Accelerating Long-Term Molecular Dynamics with Physics-Informed Time-Series Forecasting
链接: https://arxiv.org/abs/2510.01206
作者: Hung Le,Sherif Abbas,Minh Hoang Nguyen,Van Dai Do,Huu Hiep Nguyen,Dung Nguyen
类目: Machine Learning (cs.LG)
*备注: 16 pages, preprint
[LG-104] Location Matters: Leverag ing Multi-Resolution Geo-Embeddings for Housing Search RECSYS2025
链接: https://arxiv.org/abs/2510.01196
作者: Ivo Silva,Pedro Nogueira,Guilherme Bonaldo(QuintoAndar)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to RecSys 2025 (industry track)
[LG-105] Fast training of accurate physics-informed neural networks without gradient descent
链接: https://arxiv.org/abs/2405.20836
作者: Chinmay Datar,Taniya Kapoor,Abhishek Chandra,Qing Sun,Erik Lien Bolager,Iryna Burak,Anna Veselovska,Massimo Fornasier,Felix Dietrich
类目: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 54 pages, 23 figures
[LG-106] Quantum Fisher information matrices from Rényi relative entropies
链接: https://arxiv.org/abs/2510.02218
作者: Mark M. Wilde
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 94 pages, 2 figures, dedicated to Professor Fumio Hiai on the occasion of his forthcoming 80th birthday
[LG-107] Hybrid Physics-ML Framework for Pan-Arctic Permafrost Infrastructure Risk at Record 2.9-Million Observation Scale
链接: https://arxiv.org/abs/2510.02189
作者: Boris Kriuk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages, 9 figures
[LG-108] Non-Asymptotic Analysis of Data Augmentation for Precision Matrix Estimation NEURIPS2025
链接: https://arxiv.org/abs/2510.02119
作者: Lucas Morisset,Adrien Hardy,Alain Durmus
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: Conference paper at NeurIPS 2025 (Spotlight)
[LG-109] Adaptive Kernel Selection for Stein Variational Gradient Descent
链接: https://arxiv.org/abs/2510.02067
作者: Moritz Melcher,Simon Weissmann,Ashia C. Wilson,Jakob Zech
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-110] Multidata Causal Discovery for Statistical Hurricane Intensity Forecasting
链接: https://arxiv.org/abs/2510.02050
作者: Saranya Ganesh S.,Frederick Iat-Hin Tam,Milton S. Gomez,Marie McGraw,Mark DeMaria,Kate Musgrave,Jakob Runge,Tom Beucler
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 19 pages, 7 Figures, 1 Table, SI
[LG-111] Uniform-in-time convergence bounds for Persistent Contrastive Divergence Algorithms
链接: https://arxiv.org/abs/2510.01944
作者: Paul Felix Valsecchi Oliva,O. Deniz Akyildiz,Andrew Duncan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-112] Smooth Quasar-Convex Optimization with Constraints
链接: https://arxiv.org/abs/2510.01943
作者: David Martínez-Rubio
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Quasar-convex functions form a broad nonconvex class with applications to linear dynamical systems, generalized linear models, and Riemannian optimization, among others. Current nearly optimal algorithms work only in affine spaces due to the loss of one degree of freedom when working with general convex constraints. Obtaining an accelerated algorithm that makes nearly optimal \widetildeO(1/(\gamma\sqrt\epsilon)) first-order queries to a \gamma -quasar convex smooth function \emphwith constraints was independently asked as an open problem in Martínez-Rubio (2022); Lezane, Langer, and Koolen (2024). In this work, we solve this question by designing an inexact accelerated proximal point algorithm that we implement using a first-order method achieving the aforementioned rate and, as a consequence, we improve the complexity of the accelerated geodesically Riemannian optimization solution in Martínez-Rubio (2022). We also analyze projected gradient descent and Frank-Wolfe algorithms in this constrained quasar-convex setting. To the best of our knowledge, our work provides the first analyses of first-order methods for quasar-convex smooth functions with general convex constraints.
[LG-113] Precise Dynamics of Diagonal Linear Networks: A Unifying Analysis by Dynamical Mean-Field Theory
链接: https://arxiv.org/abs/2510.01930
作者: Sota Nishiyama,Masaaki Imaizumi
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 54 pages
[LG-114] Deep Hedging Under Non-Convexity: Limitations and a Case for AlphaZero
链接: https://arxiv.org/abs/2510.01874
作者: Matteo Maggiolo,Giuseppe Nuti,Miroslav Štrupl,Oleg Szehr
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 15 pages in main text + 18 pages of references and appendices
[LG-115] A reproducible comparative study of categorical kernels for Gaussian process regression with new clustering-based nested kernels
链接: https://arxiv.org/abs/2510.01840
作者: Raphaël Carpintero Perez(CMAP),Sébastien Da Veiga(ENSAI, CREST, RT-UQ),Josselin Garnier(CMAP, ASCII)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-116] PRESOL: a web-based computational setting for feature-based flare forecasting
链接: https://arxiv.org/abs/2510.01799
作者: Chiara Curletto,Paolo Massa,Valeria Tagliafico,Cristina Campi,Federico Benvenuto,Michele Piana,Andrea Tacchino
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注:
[LG-117] Scalable Asynchronous Federated Modeling for Spatial Data
链接: https://arxiv.org/abs/2510.01771
作者: Jianwei Shi,Sameh Abdulah,Ying Sun,Marc G. Genton
类目: Methodology (stat.ME); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:
[LG-118] Reducing Simulation Dependence in Neutrino Telescopes with Masked Point Transformers
链接: https://arxiv.org/abs/2510.01733
作者: Felix J. Yu,Nicholas Kamp,Carlos A. Argüelles
类目: High Energy Physics - Experiment (hep-ex); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, presented at the 39th International Cosmic Ray Conference (ICRC2025)
[LG-119] AI Foundation Model for Time Series with Innovations Representation
链接: https://arxiv.org/abs/2510.01560
作者: Lang Tong,Xinyi Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper introduces an Artificial Intelligence (AI) foundation model for time series in engineering applications, where causal operations are required for real-time monitoring and control. Since engineering time series are governed by physical, rather than linguistic, laws, large-language-model-based AI foundation models may be ineffective or inefficient. Building on the classical innovations representation theory of Wiener, Kallianpur, and Rosenblatt, we propose Time Series GPT (TS-GPT) – an innovations-representation-based Generative Pre-trained Transformer for engineering monitoring and control. As an example of foundation model adaptation, we consider Probabilistic Generative Forecasting, which produces future time series samples from conditional probability distributions given past realizations. We demonstrate the effectiveness of TS-GPT in forecasting real-time locational marginal prices using historical data from U.S. independent system operators.
[LG-120] Financial Stability Implications of Generative AI: Taming the Animal Spirits
链接: https://arxiv.org/abs/2510.01451
作者: Anne Lundgaard Hansen,Seung Jung Lee
类目: General Finance (q-fin.GN); Machine Learning (cs.LG)
*备注:
[LG-121] Continuously Augmented Discrete Diffusion model for Categorical Generative Modeling
链接: https://arxiv.org/abs/2510.01329
作者: Huangjie Zheng,Shansan Gong,Ruixiang Zhang,Tianrong Chen,Jiatao Gu,Mingyuan Zhou,Navdeep Jaitly,Yizhe Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Standard discrete diffusion models treat all unobserved states identically by mapping them to an absorbing [MASK] token. This creates an ‘information void’ where semantic information that could be inferred from unmasked tokens is lost between denoising steps. We introduce Continuously Augmented Discrete Diffusion (CADD), a framework that augments the discrete state space with a paired diffusion in a continuous latent space. This yields graded, gradually corrupted states in which masked tokens are represented by noisy yet informative latent vectors rather than collapsed ‘information voids’. At each reverse step, CADD may leverage the continuous latent as a semantic hint to guide discrete denoising. The design is clean and compatible with existing discrete diffusion training. At sampling time, the strength and choice of estimator for the continuous latent vector enables a controlled trade-off between mode-coverage (generating diverse outputs) and mode-seeking (generating contextually precise outputs) behaviors. Empirically, we demonstrate CADD improves generative quality over mask-based diffusion across text generation, image synthesis, and code modeling, with consistent gains on both qualitative and quantitative metrics against strong discrete baselines.
[LG-122] Combining complex Langevin dynamics with score-based and energy-based diffusion models
链接: https://arxiv.org/abs/2510.01328
作者: Gert Aarts,Diaa E. Habibi,Lingxiao Wang,Kai Zhou
类目: High Energy Physics - Lattice (hep-lat); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 22 pages, many figures
点击查看摘要
Abstract:Theories with a sign problem due to a complex action or Boltzmann weight can sometimes be numerically solved using a stochastic process in the complexified configuration space. However, the probability distribution effectively sampled by this complex Langevin process is not known a priori and notoriously hard to understand. In generative AI, diffusion models can learn distributions, or their log derivatives, from data. We explore the ability of diffusion models to learn the distributions sampled by a complex Langevin process, comparing score-based and energy-based diffusion models, and speculate about possible applications.
[LG-123] Hybrid Predictive Modeling of Malaria Incidence in the Amhara Region Ethiopia: Integrating Multi-Output Regression and Time-Series Forecasting
链接: https://arxiv.org/abs/2510.01302
作者: Kassahun Azezew,Amsalu Tesema,Bitew Mekuria,Ayenew Kassie,Animut Embiale,Ayodeji Olalekan Salau,Tsega Asresa
类目: Other Quantitative Biology (q-bio.OT); Machine Learning (cs.LG)
*备注:
[LG-124] Private Realizable-to-Agnostic Transformation with Near-Optimal Sample Complexity
链接: https://arxiv.org/abs/2510.01291
作者: Bo Li,Wei Wang,Peng Ye
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
信息检索
[IR-0] Contrastive Retrieval Heads Improve Attention-Based Re-Ranking
链接: https://arxiv.org/abs/2510.02219
作者: Linh Tran,Yulong Li,Radu Florian,Wei Sun
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:The strong zero-shot and long-context capabilities of recent Large Language Models (LLMs) have paved the way for highly effective re-ranking systems. Attention-based re-rankers leverage attention weights from transformer heads to produce relevance scores, but not all heads are created equally: many contribute noise and redundancy, thus limiting performance. To address this, we introduce CoRe heads, a small set of retrieval heads identified via a contrastive scoring metric that explicitly rewards high attention heads that correlate with relevant documents, while downplaying nodes with higher attention that correlate with irrelevant documents. This relative ranking criterion isolates the most discriminative heads for re-ranking and yields a state-of-the-art list-wise re-ranker. Extensive experiments with three LLMs show that aggregated signals from CoRe heads, constituting less than 1% of all heads, substantially improve re-ranking accuracy over strong baselines. We further find that CoRe heads are concentrated in middle layers, and pruning the computation of final 50% of model layers preserves accuracy while significantly reducing inference time and memory usage.
[IR-1] alkPlay-Tools: Conversational Music Recommendation with LLM Tool Calling NEURIPS
链接: https://arxiv.org/abs/2510.01698
作者: Seungheon Doh,Keunwoo Choi,Juhan Nam
类目: Information Retrieval (cs.IR); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted for publication at The Workshop on AI for Music, Neural Information Processing Systems (NeurIPS-AI4Music)
点击查看摘要
Abstract:While the recent developments in large language models (LLMs) have successfully enabled generative recommenders with natural language interactions, their recommendation behavior is limited, leaving other simpler yet crucial components such as metadata or attribute filtering underutilized in the system. We propose an LLM-based music recommendation system with tool calling to serve as a unified retrieval-reranking pipeline. Our system positions an LLM as an end-to-end recommendation system that interprets user intent, plans tool invocations, and orchestrates specialized components: boolean filters (SQL), sparse retrieval (BM25), dense retrieval (embedding similarity), and generative retrieval (semantic IDs). Through tool planning, the system predicts which types of tools to use, their execution order, and the arguments needed to find music matching user preferences, supporting diverse modalities while seamlessly integrating multiple database filtering methods. We demonstrate that this unified tool-calling framework achieves competitive performance across diverse recommendation scenarios by selectively employing appropriate retrieval methods based on user queries, envisioning a new paradigm for conversational music recommendation systems.
[IR-2] IoDResearch: Deep Research on Private Heterogeneous Data via the Internet of Data
链接: https://arxiv.org/abs/2510.01553
作者: Zhuofan Shi,Zijie Guo,Xinjian Ma,Gang Huang,Yun Ma,Xiang Jing
类目: Information Retrieval (cs.IR)
*备注: 8 pages,4 figures
[IR-3] MetaSynth: Multi-Agent Metadata Generation from Implicit Feedback in Black-Box Systems NEURIPS
链接: https://arxiv.org/abs/2510.01523
作者: Shreeranjani Srirangamsridharan,Ali Abavisani,Reza Yousefi Maragheh,Ramin Giahi,Kai Zhao,Jason Cho,Sushant Kumar
类目: Information Retrieval (cs.IR)
*备注: NeurIPS Workshop LAW
点击查看摘要
Abstract:Meta titles and descriptions strongly shape engagement in search and recommendation platforms, yet optimizing them remains challenging. Search engine ranking models are black box environments, explicit labels are unavailable, and feedback such as click-through rate (CTR) arrives only post-deployment. Existing template, LLM, and retrieval-augmented approaches either lack diversity, hallucinate attributes, or ignore whether candidate phrasing has historically succeeded in ranking. This leaves a gap in directly leveraging implicit signals from observable outcomes. We introduce MetaSynth, a multi-agent retrieval-augmented generation framework that learns from implicit search feedback. MetaSynth builds an exemplar library from top-ranked results, generates candidate snippets conditioned on both product content and exemplars, and iteratively refines outputs via evaluator-generator loops that enforce relevance, promotional strength, and compliance. On both proprietary e-commerce data and the Amazon Reviews corpus, MetaSynth outperforms strong baselines across NDCG, MRR, and rank metrics. Large-scale A/B tests further demonstrate 10.26% CTR and 7.51% clicks. Beyond metadata, this work contributes a general paradigm for optimizing content in black-box systems using implicit signals.
[IR-4] Optimal signals assignment for eBay View Item page RECSYS2025
链接: https://arxiv.org/abs/2510.01198
作者: Matan Mandelbrod,Biwei Jiang,Giald Wagner,Tal Franji,Guy Feigenblat
类目: Information Retrieval (cs.IR)
*备注: Accepted at the CONSEQUENCES 2025 workshop, co-located with ACM RecSys 2025
[IR-5] Are LLM s ready to help non-expert users to make charts of official statistics data?
链接: https://arxiv.org/abs/2510.01197
作者: Gadir Suleymanli,Alexander Rogiers,Lucas Lageweg,Jefrey Lijffijt
类目: Information Retrieval (cs.IR)
*备注:
附件下载
点击下载今日全部论文列表