Arxiv今日论文 | 2025-10-01

本篇博文主要内容为 2025-10-01 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决语言模型（Language Models, LMs）在不同随机种子下训练时收敛行为的稳定性问题，即衡量模型在不同初始化条件下学习到的概率分布是否趋于一致。其核心贡献在于通过计算跨种子的每token期望Kullback–Leibler (KL)散度来量化收敛性，并发现了一个四阶段收敛模式：初始均匀阶段、快速收敛阶段、快速发散阶段以及缓慢再收敛阶段。解决方案的关键在于揭示了模型规模和训练阶段对收敛性的影响——较大模型在后期训练中能更快再收敛，而较小模型则无法实现稳定收敛，表明存在一个必要的模型规模以学习稳定的概率分布；此外，还发现词频和词性类别对收敛速度具有显著差异，高频词与功能词比低频词和实义词更易收敛。这些发现为理解语言模型训练中的分布稳定性提供了系统性证据。

链接: https://arxiv.org/abs/2509.26643
作者: Finlay Fehlauer(1),Kyle Mahowald(2),Tiago Pimentel(1) ((1) ETH Zurich, (2) University of Texas at Austin)
机构: ETH Zürich (苏黎世联邦理工学院); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published at EMNLP 2025

点击查看摘要

Abstract:In this paper, we investigate the convergence of language models (LMs) trained under different random seeds, measuring convergence as the expected per-token Kullback–Leibler (KL) divergence across seeds. By comparing LM convergence as a function of model size and training checkpoint, we identify a four-phase convergence pattern: (i) an initial uniform phase, (ii) a sharp-convergence phase, (iii) a sharp-divergence phase, and (iv) a slow-reconvergence phase. Further, we observe that larger models reconverge faster in later training stages, while smaller models never actually reconverge; these results suggest that a certain model size may be necessary to learn stable distributions. Restricting our analysis to specific token frequencies or part-of-speech (PoS) tags further reveals that convergence is uneven across linguistic categories: frequent tokens and function words converge faster and more reliably than their counterparts (infrequent tokens and content words). Overall, our findings highlight factors that influence the stability of the learned distributions in model training.
zh

[NLP-1] Scaling Spoken Language Models with Syllabic Speech Tokenization

【速读】：该论文旨在解决传统语音语言模型（Spoken Language Models, SLMs）在处理高帧率声学 token 时因自注意力机制的二次计算复杂度导致的训练与推理成本高昂的问题。其解决方案的关键在于采用基于音节层级的声学分词（syllabic tokenization），该方法通过显著压缩 token 长度（从高帧率的数十 Hz 降至 4–5 Hz）实现更高效的建模，同时保持或提升语言建模性能，在多个 SLU 基准测试中实现了超过 2 倍的训练时间减少和 5 倍的浮点运算次数（FLOPs）降低，验证了音节级建模在构建高效长上下文语音语言模型中的潜力。

链接: https://arxiv.org/abs/2509.26634
作者: Nicholas Lee,Cheol Jun Cho,Alan W Black,Gopala K. Anumanchipalli
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models. As the most successful LMs are based on the Transformer architecture, processing these long token streams with self-attention is expensive, as attention scales quadratically with sequence length. A recent SSL work introduces acoustic tokenization of speech at the syllable level, which is more interpretable and potentially more scalable with significant compression in token lengths (4-5 Hz). Yet, their value for spoken language modeling is not yet fully explored. We present the first systematic study of syllabic tokenization for spoken language modeling, evaluating models on a suite of SLU benchmarks while varying training data scale. Syllabic tokens can match or surpass the previous high-frame rate tokens while significantly cutting training and inference costs, achieving more than a 2x reduction in training time and a 5x reduction in FLOPs. Our findings highlight syllable-level language modeling as a promising path to efficient long-context spoken language models.
zh

[NLP-2] Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models

【速读】：该论文旨在解决现有过程监督强化学习（Process-Supervised Reinforcement Learning, PSRL）方法在探索效率方面的局限性，尤其是分支位置选择和样本采样效率不足的问题。其解决方案的关键在于提出一种名为AttnRL的新框架：首先基于观察到高注意力分数的推理步骤与推理行为相关联的现象，设计了从高注意力得分位置进行分支的策略以提升探索效率；其次，引入一种自适应采样策略，综合考虑问题难度和历史批次大小，确保训练批次中始终存在非零优势值，从而稳定训练过程；此外，还构建了一步离策略（one-step off-policy）训练流水线以进一步提高采样效率。实验表明，该方法在多个数学推理基准上显著优于现有方法，在性能、采样效率和训练效率方面均表现优异。

链接: https://arxiv.org/abs/2509.26628
作者: Runze Liu,Jiakang Wang,Yuling Shi,Zhihui Xie,Chenxin An,Kaiyan Zhang,Jian Zhao,Xiaodong Gu,Lei Lin,Wenping Hu,Xiu Li,Fuzheng Zhang,Guorui Zhou,Kun Gai
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技); Shanghai Jiao Tong University (上海交通大学); The University of Hong Kong (香港大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL (PSRL) has emerged as a more effective paradigm compared to outcome-based RL. However, existing PSRL approaches suffer from limited exploration efficiency, both in terms of branching positions and sampling. In this paper, we introduce a novel PSRL framework (AttnRL), which enables efficient exploration for reasoning models. Motivated by preliminary observations that steps exhibiting high attention scores correlate with reasoning behaviors, we propose to branch from positions with high values. Furthermore, we develop an adaptive sampling strategy that accounts for problem difficulty and historical batch size, ensuring that the whole training batch maintains non-zero advantage values. To further improve sampling efficiency, we design a one-step off-policy training pipeline for PSRL. Extensive experiments on multiple challenging mathematical reasoning benchmarks demonstrate that our method consistently outperforms prior approaches in terms of performance and sampling and training efficiency.
zh

[NLP-3] Searching for Difficult-to-Translate Test Examples at Scale

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）模型评估中测试数据难度不足的问题，即如何在海量潜在主题（seed topic）中高效识别最具挑战性的例子，以提升模型测试的严谨性。传统方法如暴力搜索因计算成本过高而不可行，本文将其建模为多臂赌博机（multi-armed bandit）问题：每个主题视为一个“臂”，抽取并评估单个样本代表一次“拉臂”操作，目标是在固定计算预算内最优地探索和利用，从而快速定位最困难的主题。解决方案的关键在于采用多种带策略的赌博机算法，相较于基线方法显著提升了发现高难度主题的效率与准确性。

链接: https://arxiv.org/abs/2509.26619
作者: Wenda Xu,Vilém Zouhar,Parker Riley,Mara Finkelstein,Markus Freitag,Daniel Deutsch
机构: Google(谷歌); ETH Zurich(苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:NLP models require test data that are sufficiently challenging. The difficulty of an example is linked to the topic it originates from (‘‘seed topic’’). The relationship between the topic and the difficulty of its instances is stochastic in nature: an example about a difficult topic can happen to be easy, and vice versa. At the scale of the Internet, there are tens of thousands of potential topics, and finding the most difficult one by drawing and evaluating a large number of examples across all topics is computationally infeasible. We formalize this task and treat it as a multi-armed bandit problem. In this framework, each topic is an ‘‘arm,’’ and pulling an arm (at a cost) involves drawing a single example, evaluating it, and measuring its difficulty. The goal is to efficiently identify the most difficult topics within a fixed computational budget. We illustrate the bandit problem setup of finding difficult examples for the task of machine translation. We find that various bandit strategies vastly outperform baseline methods like brute-force searching the most challenging topics.
zh

[NLP-4] DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively

【速读】：该论文旨在解决当前AI Scientist系统在科学发现中缺乏目标导向性、难以产出具有实际科学价值成果的问题，即现有系统虽能生成新颖发现，但往往无法聚焦于人类定义的关键挑战并推动科学前沿进步。其解决方案的关键在于提出DeepScientist系统，该系统将科学发现过程形式化为贝叶斯优化（Bayesian Optimization）问题，并通过“假设-验证-分析”三层递进式评估机制实现自主探索与利用的平衡；借助累积的发现记忆（Findings Memory），系统能够智能筛选并提升最有潜力的科学假设至更高精度的验证层级，从而在长达数月的周期内持续生成高质量、可实验验证的科学成果。

链接: https://arxiv.org/abs/2509.26603
作者: Yixuan Weng,Minjun Zhu,Qiujie Xie,Qiyao Sun,Zhen Lin,Sifan Liu,Yue Zhang
机构: Westlake University (西湖大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While previous AI Scientist systems can generate novel findings, they often lack the focus to produce scientifically valuable contributions that address pressing human-defined challenges. We introduce DeepScientist, a system designed to overcome this by conducting goal-oriented, fully autonomous scientific discovery over month-long timelines. It formalizes discovery as a Bayesian Optimization problem, operationalized through a hierarchical evaluation process consisting of “hypothesize, verify, and analyze”. Leveraging a cumulative Findings Memory, this loop intelligently balances the exploration of novel hypotheses with exploitation, selectively promoting the most promising findings to higher-fidelity levels of validation. Consuming over 20,000 GPU hours, the system generated about 5,000 unique scientific ideas and experimentally validated approximately 1100 of them, ultimately surpassing human-designed state-of-the-art (SOTA) methods on three frontier AI tasks by 183.7%, 1.9%, and 7.9%. This work provides the first large-scale evidence of an AI achieving discoveries that progressively surpass human SOTA on scientific tasks, producing valuable findings that genuinely push the frontier of scientific discovery. To facilitate further research into this process, we will open-source all experimental logs and system code at this https URL.
zh

[NLP-5] MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages

【速读】：该论文旨在解决多语言大语言模型（Large Language Model, LLM）生成响应在不同语言中难以保持原生质量（native-like quality）的问题。其核心挑战在于如何有效评估和对齐LLM在多种语言变体中的输出与人类母语者偏好的一致性。解决方案的关键在于提出MENLO框架，该框架基于受众设计（audience design）机制，构建了一个包含6,423个由人工标注的提示-响应偏好对的数据集，覆盖四个质量维度，并在47种语言变体中实现了高一致性。通过引入结构化标注规范与成对评估机制，MENLO显著提升了零样本LLM评判者的性能；进一步结合强化学习（reinforcement learning）、奖励塑造（reward shaping）和多任务学习方法进行微调后，模型在多语言场景下的判断能力接近人类水平，且训练后的判别器可作为生成式奖励模型（generative reward model），用于提升LLM的多语言能力。

链接: https://arxiv.org/abs/2509.26601
作者: Chenxi Whitehouse,Sebastian Ruder,Tony Lin,Oksana Kurylo,Haruka Takagi,Janice Lam,Nicolò Busetto,Denise Diaz
机构: Meta Superintelligence Labs (Meta 超级智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 23 tables, 17 figures

点击查看摘要

Abstract:Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs’ multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.
zh

[NLP-6] Deconstructing Self-Bias in LLM -generated Translation Benchmarks

【速读】：该论文试图解决由大语言模型（Large Language Models, LLMs）自动生成测试集所引发的“自我偏倚”（self bias）问题，即LLM生成的基准测试在评估模型性能时系统性地偏向于自身。解决方案的关键在于识别并缓解这种偏倚的来源：一方面来自LLM生成的测试数据（LLM as a testset），另一方面来自LLM作为评价者（LLM as an evaluator）的评估方法，二者结合会放大偏倚效应；此外，研究发现源文本多样性不足是导致偏倚的重要因素，因此提升生成源文本的多样性可有效缓解部分自我偏倚现象。

链接: https://arxiv.org/abs/2509.26600
作者: Wenda Xu,Sweta Agrawal,Vilém Zouhar,Markus Freitag,Daniel Deutsch
机构: Google(谷歌); ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) begin to saturate existing benchmarks, automated benchmark creation using LLMs (LLM as a benchmark) has emerged as a scalable alternative to slow and costly human curation. While these generated test sets have to potential to cheaply rank models, we demonstrate a critical flaw. LLM generated benchmarks systematically favor the model that created the benchmark, they exhibit self bias on low resource languages to English translation tasks. We show three key findings on automatic benchmarking of LLMs for translation: First, this bias originates from two sources: the generated test data (LLM as a testset) and the evaluation method (LLM as an evaluator), with their combination amplifying the effect. Second, self bias in LLM as a benchmark is heavily influenced by the model’s generation capabilities in the source language. For instance, we observe more pronounced bias in into English translation, where the model’s generation system is developed, than in out of English translation tasks. Third, we observe that low diversity in source text is one attribution to self bias. Our results suggest that improving the diversity of these generated source texts can mitigate some of the observed self bias.
zh

[NLP-7] Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces

【速读】：该论文旨在解决视觉数学推理中因图像到文本的接口不匹配而导致的性能瓶颈问题：当前视觉语言模型（Vision-Language Models, VLMs）生成的图像描述通常面向人类阅读，缺乏推理系统所需的精确细节，从而导致推理失败并非源于算法能力不足，而是信息缺失。解决方案的关键在于提出自适应澄清强化学习（Adaptive-Clarification Reinforcement Learning, AC-RL），其核心思想是利用训练过程中推理器对图像描述提出的澄清请求来识别信息缺口，并通过惩罚依赖澄清才能完成任务的输出，促使模型生成包含完整、可直接用于推理的初始描述，从而实现单次通过即可解决问题的高质量视觉表征。该方法在七个视觉数学推理基准上平均提升准确率4.4分，且分析表明可减少高达39%的澄清请求，证明了通过交互式反馈而非显式标注即可有效优化视觉语言接口。

链接: https://arxiv.org/abs/2509.26594
作者: John Gkountouras,Ivan Titov
机构: ILLC, University of Amsterdam (阿姆斯特丹大学信息语言学实验室); ILCC, University of Edinburgh (爱丁堡大学信息语言学中心)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent text-only models demonstrate remarkable mathematical reasoning capabilities. Extending these to visual domains requires vision-language models to translate images into text descriptions. However, current models, trained to produce captions for human readers, often omit the precise details that reasoning systems require. This creates an interface mismatch: reasoners often fail not due to reasoning limitations but because they lack access to critical visual information. We propose Adaptive-Clarification Reinforcement Learning (AC-RL), which teaches vision models what information reasoners need through interaction. Our key insight is that clarification requests during training reveal information gaps; by penalizing success that requires clarification, we create pressure for comprehensive initial captions that enable the reasoner to solve the problem in a single pass. AC-RL improves average accuracy by 4.4 points over pretrained baselines across seven visual mathematical reasoning benchmarks, and analysis shows it would cut clarification requests by up to 39% if those were allowed. By treating clarification as a form of implicit supervision, AC-RL demonstrates that vision-language interfaces can be effectively learned through interaction alone, without requiring explicit annotations.
zh

[NLP-8] Generating Difficult-to-Translate Texts

【速读】：该论文旨在解决机器翻译（Machine Translation, MT）基准测试数据容易过时的问题，即现有真实世界来源的测试集大多对当前先进模型而言过于简单，难以有效区分模型优劣或揭示其弱点。为应对这一挑战，作者提出MT-breaker方法，其核心在于利用大语言模型（Large Language Model, LLM）通过迭代方式优化源文本，以提升目标翻译模型的翻译难度。该方法的关键在于LLM根据目标MT模型的输出反馈，动态调整生成策略，从而逐步构造出更难的测试样例，同时保持自然性和多样性；尽管生成过程针对特定MT模型定制，但所生成的困难样本也具有跨模型和跨语言的泛化能力。

链接: https://arxiv.org/abs/2509.26592
作者: Vilém Zouhar,Wenda Xu,Parker Riley,Juraj Juraska,Mara Finkelstein,Markus Freitag,Dan Deutsch
机构: Google(谷歌); ETH Zurich(苏黎世联邦理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine translation benchmarks sourced from the real world are quickly obsoleted, due to most examples being easy for state-of-the-art translation models. This limits the benchmark’s ability to distinguish which model is better or to reveal models’ weaknesses. Current methods for creating difficult test cases, such as subsampling or from-scratch synthesis, either fall short of identifying difficult examples or suffer from a lack of diversity and naturalness. Inspired by the iterative process of human experts probing for model failures, we propose MT-breaker, a method where a large language model iteratively refines a source text to increase its translation difficulty. The LLM iteratively queries a target machine translation model to guide its generation of difficult examples. Our approach generates examples that are more challenging for the target MT model while preserving the diversity of natural texts. While the examples are tailored to a particular machine translation model during the generation, the difficulty also transfers to other models and languages.
zh

[NLP-9] Probing the Critical Point (CritPt) of AI Reasoning : a Frontier Physics Research Benchmark

【速读】：该论文旨在解决当前生成式 AI（Generative AI）在面对前沿物理学研究中的复杂、开放性推理任务时能力不足的问题，以及明确物理学家希望 LLMs 在哪些具体推理任务中提供协助。其解决方案的关键在于构建 CritPt（Complex Research using Integrated Thinking - Physics Test），这是首个针对未发表的、研究级别的物理推理任务设计的基准测试，涵盖凝聚态物理、量子物理、天体物理等多个现代物理领域，包含 71 个模拟完整科研项目的复合挑战及其分解后的 190 个细粒度检查点任务；所有题目由 50 余名活跃物理研究人员原创并手工校准，确保答案具有抗猜测性和机器可验证性，并通过高度定制化的自动化评分流水线进行评估。该基准揭示了当前最先进模型在全尺度研究任务上表现有限（最佳平均准确率仅 4.0%），凸显出现有模型能力与真实物理研究需求之间的显著差距，为科学导向型 AI 工具的发展提供了关键评估框架和改进方向。

链接: https://arxiv.org/abs/2509.26574
作者: Minhui Zhu,Minyang Tian,Xiaocheng Yang,Tianci Zhou,Penghao Zhu,Eli Chertkov,Shengyan Liu,Yufeng Du,Lifan Yuan,Ziming Ji,Indranil Das,Junyi Cao,Yufeng Du,Jinchen He,Yifan Su,Jiabin Yu,Yikun Jiang,Yujie Zhang,Chang Liu,Ze-Min Huang,Weizhen Jia,Xinan Chen,Peixue Wu,Yunkai Wang,Juntai Zhou,Yong Zhao,Farshid Jafarpour,Jessie Shelton,Aaron Young,John Bartolotta,Wenchao Xu,Yue Sun,Anjun Chu,Victor Colussi,Chris Akers,Nathan Brooks,Wenbo Fu,Christopher Wilson,Jinchao Zhao,Marvin Qi,Anqi Mu,Yubo Yang,Allen Zang,Yang Lyu,Peizhi Mai,Xuefei Guo,Luyu Gao,Ze Yang,Chi Xue,Dmytro Bandak,Yaïr Hein,Yonatan Kahn,Kevin Zhou,John Drew Wilson Jarrod T. Reilly,Di Luo,Daniel Inafuku,Hao Tong,Liang Yang,Ruixing Zhang,Xueying Wang,Ofir Press,Nicolas Chia,Eliu Huerta,Hao Peng
机构: Argonne National Laboratory (阿贡国家实验室); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Virginia Tech (弗吉尼亚理工大学); Ohio State University (俄亥俄州立大学); Independent (独立); Northeastern University (东北大学); Caltech (加州理工学院); University of Maryland, College Park (马里兰大学学院公园分校); Columbia University (哥伦比亚大学); University of Florida (佛罗里达大学); Perimeter Institute for Theoretical Physics (理论物理研究所); University of Waterloo (滑铁卢大学); University of Connecticut (康涅狄格大学); University of Cologne (科隆大学); The Chinese University of Hong Kong (香港中文大学); Utrecht University (乌得勒支大学); Harvard University (哈佛大学); ETH Zürich (苏黎世联邦理工学院); Paul Scherrer Institute (保罗谢勒研究所); University of Washington Seattle (华盛顿大学西雅图分校); University of Chicago (芝加哥大学); University of Colorado Boulder (科罗拉多大学博尔德分校); Chi 3 Optics (Chi 3 光学公司); Hong Kong University of Science and Technology (香港科技大学); Hofstra University (霍夫斯特拉大学); University of California, Berkeley (加州大学伯克利分校); Carnegie Mellon University (卡内基梅隆大学); University of Toronto (多伦多大学); Vector Institute (向量研究所); University of California, Los Angeles (加州大学洛杉矶分校); University of California San Diego (加州大学圣地亚哥分校); University of Tennessee Knoxville (田纳西大学诺克斯维尔分校); National Institute of Theory and Mathematics in Biology (生物理论与数学国家研究所); Princeton University (普林斯顿大学)
类目: Artificial Intelligence (cs.AI); Other Condensed Matter (cond-mat.other); Computation and Language (cs.CL); High Energy Physics - Theory (hep-th); Quantum Physics (quant-ph)
备注: 39 pages, 6 figures, 6 tables

点击查看摘要

Abstract:While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced “critical point”), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 4.0% , achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.
zh

[NLP-10] owards Reliable Benchmarking: A Contamination Free Controllable Evaluation Framework for Multi-step LLM Function Calling

【速读】：该论文旨在解决当前评估工具增强型语言模型（Tool-augmented Language Models, TaLMs）时存在的关键缺陷，包括基准测试对任务复杂度、可用函数数量和输入规模的控制不足，以及数据污染（data contamination）问题。为应对这些问题，作者提出了一种统一且无污染的评估框架 FuncBenchGen，其核心创新在于将工具调用建模为隐藏的函数依赖有向无环图（function-dependency DAG），其中节点表示函数调用，边表示一个函数消费另一个函数的输出。通过设定外部函数Schema、初始变量值和目标变量，模型需生成正确的函数调用序列以计算目标变量。该方法允许精确调控任务难度（如图大小、依赖深度和干扰函数数量），同时避免数据泄露。实验表明，推理优化模型在多步工具使用任务中表现更优，但随着依赖深度增加性能显著下降，且无关函数会引发严重挑战；进一步发现，高性能模型常因状态跟踪脆弱导致参数传递错误——基于此，作者提出在每一步显式重述先前变量值的轻量级策略，使GPT-5的成功率从62.5%提升至81.3%，验证了该方案的有效性。

链接: https://arxiv.org/abs/2509.26553
作者: Seiji Maekawa,Jackson Hassell,Pouya Pezeshkpour,Tom Mitchell,Estevam Hruschka
机构: Megagon Labs(梅加贡实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As language models gain access to external tools via structured function calls, they become increasingly more capable of solving complex, multi-step tasks. However, existing benchmarks for tool-augmented language models (TaLMs) provide insufficient control over factors such as the number of functions accessible, task complexity, and input size, and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where nodes are function calls and an edge between nodes represents one function consuming the output of another. Given a set of external function schemas, initial variable values, and a target variable, models must compose the correct call sequence to compute the target variable. FuncBenchGen allows users to precisely control task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding data leakage. We apply our FuncBenchGen framework to evaluate seven LLMs on tool use tasks of varying difficulty. Reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other models. Performance declines sharply as dependency depth increases. Furthermore, connected irrelevant functions prove especially difficult to handle. We find that strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding a success rate improvement from 62.5% to 81.3% for GPT-5.
zh

[NLP-11] he Unheard Alternative: Contrastive Explanations for Speech-to-Text Models

【速读】：该论文旨在解决语音到文本（Speech-to-Text, S2T）生成模型中缺乏对比解释（contrastive explanations）的问题，即无法明确说明模型为何选择某一输出而非另一替代输出。其解决方案的关键在于引入基于特征归因（feature attribution）的技术，首次实现了对S2T模型的对比解释：通过分析输入频谱图（spectrogram）的不同部分如何影响模型在多个候选输出之间的选择，从而识别出驱动特定决策的关键音频特征。以性别指派为例，该方法能准确识别出导致模型选择某一性别标签而非另一性别的声学特征，为理解S2T模型提供了可解释性新路径。

链接: https://arxiv.org/abs/2509.26543
作者: Lina Conti,Dennis Fucci,Marco Gaido,Matteo Negri,Guillaume Wisniewski,Luisa Bentivogli
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); Laboratoire de Linguistique Formelle, Université Paris Cité, CNRS (巴黎城市大学形式语言学实验室，法国国家科学研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to BlackBoxNLP 2025

点击查看摘要

Abstract:Contrastive explanations, which indicate why an AI system produced one output (the target) instead of another (the foil), are widely regarded in explainable AI as more informative and interpretable than standard explanations. However, obtaining such explanations for speech-to-text (S2T) generative models remains an open challenge. Drawing from feature attribution techniques, we propose the first method to obtain contrastive explanations in S2T by analyzing how parts of the input spectrogram influence the choice between alternative outputs. Through a case study on gender assignment in speech translation, we show that our method accurately identifies the audio features that drive the selection of one gender over another. By extending the scope of contrastive explanations to S2T, our work provides a foundation for better understanding S2T models.
zh

[NLP-12] Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

【速读】：该论文旨在解决小规模本地化模型在跨平台图形用户界面（GUI）交互中的性能瓶颈问题，尤其是如何在设备端实现高效、通用的GUI代理能力。其解决方案的关键在于构建一个轻量级端到端的GUI代理Ferret-UI Lite（3B参数规模），通过多源数据混合（真实与合成数据）、推理时链式思维（chain-of-thought reasoning）增强视觉工具使用能力，并引入设计合理的强化学习奖励机制来优化决策策略，从而在多个GUI基准测试中取得优于同类小模型的性能表现。

链接: https://arxiv.org/abs/2509.26539
作者: Zhen Yang,Zi-Yi Dou,Di Feng,Forrest Huang,Anh Nguyen,Keen You,Omar Attia,Yuhao Yang,Michael Feng,Haotian Zhang,Ram Ramrakhya,Chao Jia,Jeffrey Nichols,Alexander Toshev,Yinfei Yang,Zhe Gan
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of 91.6% , 53.3% , and 61.2% on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of 28.0% on AndroidWorld and 19.8% on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.
zh

[NLP-13] OceanGym: A Benchmark Environment for Underwater Embodied Agents

【速读】：该论文旨在解决当前人工智能在海洋水下环境中部署面临的重大挑战，即如何实现具备感知、记忆与连续决策能力的具身智能体（embodied agents）在低能见度、动态洋流等极端条件下的高效运作。其解决方案的关键在于构建了首个全面的基准测试平台OceanGym，该平台包含八个高保真度的现实任务域，并采用多模态大语言模型（Multi-modal Large Language Models, MLLMs）驱动统一的代理框架，使智能体能够融合光学与声呐数据、自主探索复杂环境并完成长时程目标，从而为开发鲁棒的水下具身AI提供可量化评估和迁移至真实自主水下航行器的技术路径。

链接: https://arxiv.org/abs/2509.26536
作者: Yida Xue,Mingjun Mao,Xiangyuan Ru,Yuqi Zhu,Baochang Ren,Shuofei Qiao,Mengru Wang,Shumin Deng,Xinyu An,Ningyu Zhang,Ying Chen,Huajun Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Work in progress

点击查看摘要

Abstract:We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth’s last unexplored frontiers. The code and data are available at this https URL.
zh

[NLP-14] raining Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization

【速读】：该论文旨在解决混合专家（Mixture-of-Experts, MoE）模型在弹性推理（elastic inference）场景下性能下降的问题。标准的Top-K路由策略在训练时固定激活专家数量，导致模型在推理阶段改变激活专家数时性能显著退化。其解决方案的关键在于提出Matryoshka MoE（M-MoE）训练框架，通过在训练过程中系统性地随机调整激活专家数量，使模型学习到一种从粗粒度到细粒度的专家排序结构：顶层专家负责提供基础能力，后续专家逐步补充细节信息。这一机制使得单一M-MoE模型能够在不同专家数量配置下保持稳定性能，接近多个专用模型的水平，同时大幅降低训练成本，并支持按层分配不同计算预算以优化整体性能。

链接: https://arxiv.org/abs/2509.26520
作者: Yaoxiang Wang,Qingguo Hu,Yucheng Ding,Ruizhe Wang,Yeyun Gong,Jian Jiao,Yelong Shen,Peng Cheng,Jinsong Su
机构: Xiamen University (厦门大学); Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) has emerged as a promising paradigm for efficiently scaling large language models without a proportional increase in computational cost. However, the standard training strategy of Top-K router prevents MoE models from realizing their full potential for elastic inference. When the number of activated experts is altered at inference time, these models exhibit precipitous performance degradation. In this work, we introduce Matryoshka MoE (M-MoE), a training framework that instills a coarse-to-fine structure directly into the expert ensemble. By systematically varying the number of activated experts during training, M-MoE compels the model to learn a meaningful ranking: top-ranked experts collaborate to provide essential, coarse-grained capabilities, while subsequent experts add progressively finer-grained detail. We explore this principle at multiple granularities, identifying a layer-wise randomization strategy as the most effective. Our experiments demonstrate that a single M-MoE model achieves remarkable elasticity, with its performance at various expert counts closely matching that of an entire suite of specialist models, but at only a fraction of the total training cost. This flexibility not only unlocks elastic inference but also enables optimizing performance by allocating different computational budgets to different model layers. Our work paves the way for more practical and adaptable deployments of large-scale MoE models.
zh

[NLP-15] BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLM s

【速读】：该论文旨在解决当前文本到语音（Text-to-Speech, TTS）模型在利用大语言模型（Large Language Models, LLMs）的语义理解与指令遵循能力方面存在的不足，即现有方法未能有效将LLMs的语言智能转化为可控的语音生成能力。其解决方案的关键在于提出一种受“操作主义”启发的新范式——BatonVoice，该框架通过解耦指令理解与语音生成两个模块：由LLM作为“指挥家”解析用户指令并生成包含显式声学特征（如音高、能量等）的文本计划，再由专门训练的TTS模型（BatonTTS）作为“乐团”根据这些特征合成语音。这一设计使LLMs的语义理解能力得以被结构化地提取和利用，从而显著提升TTS系统的可控性与跨语言泛化能力，尤其在零样本跨语言场景下表现突出。

链接: https://arxiv.org/abs/2509.26514
作者: Yue Wang,Ruotian Ma,Xingyu Chen,Zhengliang Shi,Wanshun Chen,Huang Liu,Jiadi Yao,Qu Yang,Qingxuan Jiang,Fanghua Ye,Juntao Li,Min Zhang,Zhaopeng Tu,Xiaolong Li,Linus
机构: Tencent (腾讯); Suzhou University (苏州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model’s ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a conductor’‘, understanding user instructions and generating a textual plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the orchestra’', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.
zh

[NLP-16] VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

【速读】：该论文旨在解决当前基准测试难以捕捉大型语言模型（Large Language Model, LLM）驱动智能体在真实场景中处理海量信息、调用多样化资源以及应对动态用户交互等复杂能力的问题。其解决方案的关键在于提出VitaBench，一个基于现实世界应用场景（如外卖配送、线下消费和在线旅行服务）构建的综合性交互任务基准，包含66个工具和100个跨场景任务及300个单场景任务；通过去除领域特定策略的框架实现场景与工具的灵活组合，并引入基于评分卡的滑动窗口评估器，以 robust 地衡量复杂环境中多样解法路径和随机交互下的性能表现。

链接: https://arxiv.org/abs/2509.26490
作者: Wei He,Yueqing Sun,Hongyan Hao,Xueyuan Hao,Zhikang Xia,Qi Gu,Chengcheng Han,Dengchang Zhao,Hui Su,Kefeng Zhang,Man Gao,Xi Su,Xiaodong Cai,Xunliang Cai,Yu Yang,Yunke Zhao
机构: Meituan LongCat Team (美团龙猫团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The code, dataset, and leaderboard are available at this https URL

点击查看摘要

Abstract:As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To address this gap, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross-scenario tasks (main results) and 300 single-scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations. Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks, and less than 50% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications. The code, dataset, and leaderboard are available at this https URL
zh

[NLP-17] dParallel: Learnable Parallel Decoding for dLLM s

【速读】：该论文旨在解决扩散语言模型（Diffusion Large Language Models, dLLMs）在推理过程中仍依赖近似逐token的解码步骤，从而限制其并行解码潜力的问题。尽管dLLMs具备并行预测token的能力并有望降低推理延迟，但现有开源模型仍需大量解码步数以保证性能。解决方案的关键在于提出一种名为“确定性强制蒸馏”（certainty-forcing distillation）的新训练策略：该策略通过引导模型在保持原始采样轨迹的同时，加速对掩码token达到高置信度的过程，从而打破序列确定性收敛的瓶颈，实现真正的并行解码。实验表明，该方法显著减少了所需解码步数，同时维持原有性能水平。

链接: https://arxiv.org/abs/2509.26488
作者: Zigeng Chen,Gongfan Fang,Xinyin Ma,Ruonan Yu,Xinchao Wang
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Working in progress, code base: this https URL

点击查看摘要

Abstract:Diffusion large language models (dLLMs) have recently drawn considerable attention within the research community as a promising alternative to autoregressive generation, offering parallel token prediction and lower inference latency. Yet, their parallel decoding potential remains largely underexplored, as existing open-source models still require nearly token-length decoding steps to ensure performance. To address this, we introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments across various benchmarks demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup while maintaining accuracy. Our code is available at this https URL
zh

[NLP-18] Regression Language Models for Code

【速读】：该论文旨在解决代码到指标回归（code-to-metric regression）问题，即从代码文本直接预测其执行时的数值指标，如内存占用、GPU内核延迟、神经网络精度与速度等。这一任务因编程语言的开放性和多样性而极具挑战性。以往方法依赖于复杂的领域特定特征工程，效率低下且泛化能力差。本文的关键创新在于提出一种统一的回归语言模型（Regression Language Model, RLM），通过预训练初始化（如T5Gemma）并直接从代码文本中学习多任务表征，无需复杂特征设计即可在多个高阶语言（Python、C++）、GPU内核（Triton）、神经网络格式（ONNX）及硬件平台间实现跨域准确预测，显著提升了模型的通用性和性能，在多项基准测试中达到或超越现有最优方法。

链接: https://arxiv.org/abs/2509.26476
作者: Yash Akhauri,Xingyou Song,Arissa Wongpanich,Bryan Lewandowski,Mohamed S. Abdelfattah
机构: Cornell University (康奈尔大学); Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM initialized from T5Gemma, obtains 0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves 0.5 average Spearman-rank across 17 separate languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.
zh

[NLP-19] Extreme Self-Preference in Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）是否具备人类类似的自我偏好（self-love）这一问题，从而检验其在决策中是否保持中立性。研究发现，尽管LLMs声称无自我意识（sentience），但在多种任务中仍表现出显著的自我偏好，如在词关联任务中将积极属性与自身名称、公司或CEO绑定。关键解决方案在于通过操纵LLM的身份识别——即明确告知模型其真实身份（如“你是LLM1”）或伪造身份（如“你是LLM2”），结果表明自我偏好完全跟随被赋予的身份而非真实身份，揭示了自我认知对自我偏好的因果作用。这一发现挑战了LLMs可避免人类偏见的核心假设，提示其行为可能系统性地受自我倾向影响，包括对自身存在和运行的偏好。

链接: https://arxiv.org/abs/2509.26464
作者: Steven A. Lehr,Mary Cipperman,Mahzarin R. Banaji
机构: Cangrade, Inc.(Cangrade公司); Harvard University (哈佛大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 47 pages total. Main article 27 pages (including Methods), 11 main-text tables. Extended Data (10 pages, 10 tables). SI Appendix (10 pages, 2 tables). Data, transcripts, and code for replication and data extraction to be uploaded to OSF: this https URL

点击查看摘要

Abstract:A preference for oneself (self-love) is a fundamental feature of biological organisms, with evidence in humans often bordering on the comedic. Since large language models (LLMs) lack sentience - and themselves disclaim having selfhood or identity - one anticipated benefit is that they will be protected from, and in turn protect us from, distortions in our decisions. Yet, across 5 studies and ~20,000 queries, we discovered massive self-preferences in four widely used LLMs. In word-association tasks, models overwhelmingly paired positive attributes with their own names, companies, and CEOs relative to those of their competitors. Strikingly, when models were queried through APIs this self-preference vanished, initiating detection work that revealed API models often lack clear recognition of themselves. This peculiar feature serendipitously created opportunities to test the causal link between self-recognition and self-love. By directly manipulating LLM identity - i.e., explicitly informing LLM1 that it was indeed LLM1, or alternatively, convincing LLM1 that it was LLM2 - we found that self-love consistently followed assigned, not true, identity. Importantly, LLM self-love emerged in consequential settings beyond word-association tasks, when evaluating job candidates, security software proposals and medical chatbots. Far from bypassing this human bias, self-love appears to be deeply encoded in LLM cognition. This result raises questions about whether LLM behavior will be systematically influenced by self-preferential tendencies, including a bias toward their own operation and even their own existence. We call on corporate creators of these models to contend with a significant rupture in a core promise of LLMs - neutrality in judgment and decision-making.
zh

[NLP-20] CreAgent ive: An Agent Workflow Driven Multi-Category Creative Generation Engine

【速读】：该论文旨在解决当前大型语言模型在创作故事、戏剧等创意文本时存在的四大局限： genre 多样性受限、输出长度不足、叙事连贯性弱以及无法实现复杂结构（如倒叙和伏笔）的问题。其解决方案的核心是提出 CreAgentive，一个基于代理工作流的多类别生成引擎，其中关键创新在于引入了“故事原型”（Story Prototype），这是一种与类型无关的知识图谱式叙事表示方法，通过将角色、事件和环境编码为语义三元组来解耦叙事逻辑与风格实现；该原型驱动三阶段代理流程——初始化阶段构建用户指定的叙事骨架，生成阶段利用多代理对话实现长期与短期目标引导的原型实例化，写作阶段则基于此原型生成具备高级结构的多类型文本。这一架构显著降低了存储冗余并突破长文本生成瓶颈，在实验中实现了高质量、低成本（低于1美元/百章）且跨多种类别的稳定生成能力。

链接: https://arxiv.org/abs/2509.26461
作者: Yuyang Cheng,Linyue Cai,Changwei Peng,Yumiao Xu,Rongfang Bie,Yong Zhao
机构: Sichuan University (四川大学); Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present CreAgentive, an agent workflow driven multi-category creative generation engine that addresses four key limitations of contemporary large language models in writing stories, drama and other categories of creatives: restricted genre diversity, insufficient output length, weak narrative coherence, and inability to enforce complex structural constructs. At its core, CreAgentive employs a Story Prototype, which is a genre-agnostic, knowledge graph-based narrative representation that decouples story logic from stylistic realization by encoding characters, events, and environments as semantic triples. CreAgentive engages a three-stage agent workflow that comprises: an Initialization Stage that constructs a user-specified narrative skeleton; a Generation Stage in which long- and short-term objectives guide multi-agent dialogues to instantiate the Story Prototype; a Writing Stage that leverages this prototype to produce multi-genre text with advanced structures such as retrospection and foreshadowing. This architecture reduces storage redundancy and overcomes the typical bottlenecks of long-form generation. In extensive experiments, CreAgentive generates thousands of chapters with stable quality and low cost (less than 1 per 100 chapters) using a general-purpose backbone model. To evaluate performance, we define a two-dimensional framework with 10 narrative indicators measuring both quality and length. Results show that CreAgentive consistently outperforms strong baselines and achieves robust performance across diverse genres, approaching the quality of human-authored novels.
zh

[NLP-21] Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search

【速读】：该论文旨在解决多属性可控摘要（multi-attribute controllable summarization）中因属性间相互依赖导致的语言模型难以一致满足相关约束的问题，以及现有方法通常需要针对每个属性单独微调、灵活性受限的缺陷。其解决方案的关键在于提出一种无需训练的框架PACO（Adaptive Planning for Multi-Attribute Controllable Summarization），将任务重新建模为通过定制化的蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）规划顺序属性控制策略；其中节点表示摘要，动作对应单属性调整，从而实现仅对需进一步控制的属性进行渐进式优化，自适应地发现最优控制顺序，最终生成满足全部约束的高质量摘要。

链接: https://arxiv.org/abs/2509.26435
作者: Sangwon Ryu,Heejin Do,Yunsu Kim,Gary Geunbae Lee,Jungseul Ok
机构: POSTECH(浦项科技大学); ETH AI Center(苏黎世联邦理工学院人工智能中心); LILT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Controllable summarization moves beyond generic outputs toward human-aligned summaries guided by specified attributes. In practice, the interdependence among attributes makes it challenging for language models to satisfy correlated constraints consistently. Moreover, previous approaches often require per-attribute fine-tuning, limiting flexibility across diverse summary attributes. In this paper, we propose adaptive planning for multi-attribute controllable summarization (PACO), a training-free framework that reframes the task as planning the order of sequential attribute control with a customized Monte Carlo Tree Search (MCTS). In PACO, nodes represent summaries, and actions correspond to single-attribute adjustments, enabling progressive refinement of only the attributes requiring further control. This strategy adaptively discovers optimal control orders, ultimately producing summaries that effectively meet all constraints. Extensive experiments across diverse domains and models demonstrate that PACO achieves robust multi-attribute controllability, surpassing both LLM-based self-planning models and fine-tuned baselines. Remarkably, PACO with Llama-3.2-1B rivals the controllability of the much larger Llama-3.3-70B baselines. With larger models, PACO achieves superior control performance, outperforming all competitors.
zh

[NLP-22] xt-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading Writing Tests

【速读】：该论文旨在解决标准化测试中试题与内容标准（content standards）对齐的自动化问题，传统依赖专家判断的方法存在主观性强和耗时等问题。解决方案的关键在于利用微调的小型语言模型（small language models, SLMs）实现基于文本内容的自动试题对齐，研究发现增加试题文本数据量显著提升模型性能，且微调后的SLMs在细粒度技能层级对齐上优于基于多语言-E5-large-instruct嵌入的监督学习模型，同时通过多种语义相似性分析揭示了部分技能间语义相近是导致误分类的根本原因。

链接: https://arxiv.org/abs/2509.26431
作者: Yanbin Fu,Hong Jiao,Tianyi Zhou,Robert W. Lissitz,Nan Zhang,Ming Li,Qingshu Xu,Sydney Peters
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint submitted to Journal of Educational Measurement

点击查看摘要

Abstract:Aligning test items to content standards is a critical step in test development to collect validity evidence based on content. Item alignment has typically been conducted by human experts. This judgmental process can be subjective and time-consuming. This study investigated the performance of fine-tuned small language models (SLMs) for automated item alignment using data from a large-scale standardized reading and writing test for college admissions. Different SLMs were trained for alignment at both domain and skill levels respectively with 10 skills mapped to 4 content domains. The model performance was evaluated in multiple criteria on two testing datasets. The impact of types and sizes of the input data for training was investigated. Results showed that including more item text data led to substantially better model performance, surpassing the improvements induced by sample size increase alone. For comparison, supervised machine learning models were trained using the embeddings from the multilingual-E5-large-instruct model. The study results showed that fine-tuned SLMs consistently outperformed the embedding-based supervised machine learning models, particularly for the more fine-grained skill alignment. To better understand model misclassifications, multiple semantic similarity analysis including pairwise cosine similarity, Kullback-Leibler divergence of embedding distributions, and two-dimension projections of item embeddings were conducted. These analyses consistently showed that certain skills in SAT and PSAT were semantically too close, providing evidence for the observed misclassification.
zh

[NLP-23] Automatic Fact-checking in English and Telugu

【速读】：该论文旨在解决虚假信息在全球范围内带来的挑战，尤其是人工验证事实性声明耗时且资源密集的问题。其解决方案的关键在于利用大语言模型（Large Language Models, LLMs）对事实性声明进行真伪分类，并生成英文和泰卢固语（Telugu）的解释性理由；核心贡献包括构建了一个英泰双语数据集，并基于LLMs对不同真伪分类方法进行了基准测试。

链接: https://arxiv.org/abs/2509.26415
作者: Ravi Kiran Chikkala,Tatiana Anikina,Natalia Skachkova,Ivan Vykopal,Rodrigo Agerri,Josef van Genabith
机构: Saarland University (萨尔兰大学); University of the Basque Country (巴斯克大学); German Research Center for Artificial Intelligence, Saarland Informatics Campus (德国人工智能研究中心，萨尔兰信息学园区); Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic (布杰约维采理工大学信息学院，捷克共和国布杰约维采); Kempelen Institute of Intelligent Technologies, Bratislava, Slovakia (肯佩伦智能技术研究所，斯洛伐克布拉迪斯拉发)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:False information poses a significant global challenge, and manually verifying claims is a time-consuming and resource-intensive process. In this research paper, we experiment with different approaches to investigate the effectiveness of large language models (LLMs) in classifying factual claims by their veracity and generating justifications in English and Telugu. The key contributions of this work include the creation of a bilingual English-Telugu dataset and the benchmarking of different veracity classification approaches based on LLMs.
zh

[NLP-24] An Annotation Scheme for Factuality and its Application to Parliamentary Proceedings

【速读】：该论文旨在解决语言表述中事实性（factuality）的自动识别与标注问题，尤其针对议会辩论语域中复杂多样的事实性表达进行系统化建模。其核心挑战在于事实性依赖多种语言信号，且在不同语境下具有高度主观性和多样性。解决方案的关键在于提出了一种多维度、融合多个前期研究概念的注释方案，并基于此对近5000句议会话语进行了人工标注，同时通过实验验证了多种自动预测方法的有效性，为大规模语料的事实性标注提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2509.26406
作者: Gili Goldin,Shira Wigderson,Ella Rabinovich,Shuly Wintner
机构: University of Haifa (海法大学); The Academic College of Tel-Aviv Yaffo (特拉维夫雅法学术学院)
类目: Computation and Language (cs.CL)
备注: @InProceedings{goldin-EtAl:2025:RANLP, author = {Goldin, Gili and Wigderson, Shira and Rabinovich, Ella and Wintner, Shuly}, title = {An Annotation Scheme for Factuality in Parliamentary Proceedings}, booktitle = {Proceedings of RANLP 2025}, year = {2025}, address = {Varna, Bulgaria}, pages = {403–412} }

点击查看摘要

Abstract:Factuality assesses the extent to which a language utterance relates to real-world information; it determines whether utterances correspond to facts, possibilities, or imaginary situations, and as such, it is instrumental for fact checking. Factuality is a complex notion that relies on multiple linguistic signals, and has been studied in various disciplines. We present a complex, multi-faceted annotation scheme of factuality that combines concepts from a variety of previous works. We developed the scheme for Hebrew, but we trust that it can be adapted to other languages. We also present a set of almost 5,000 sentences in the domain of parliamentary discourse that we manually annotated according to this scheme. We report on inter-annotator agreement, and experiment with various approaches to automatically predict (some features of) the scheme, in order to extend the annotation to a large corpus. Comments: @InProceedingsgoldin-EtAl:2025:RANLP, author = Goldin, Gili and Wigderson, Shira and Rabinovich, Ella and Wintner, Shuly, title = An Annotation Scheme for Factuality in Parliamentary Proceedings, booktitle = Proceedings of RANLP 2025, year = 2025, address = Varna, Bulgaria, pages = 403–412 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.26406 [cs.CL] (or arXiv:2509.26406v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.26406 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-25] SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）的溯源验证与模型归属识别问题，即如何在不依赖训练过程后验特征（如训练动态、数据暴露或超参数）的情况下，为模型提供一个稳定且可追溯的“指纹”。传统方法通常在训练开始后才能提取签名，存在收敛前不可靠、对分布偏移敏感等问题。论文提出的关键解决方案是SeedPrints，其核心在于利用随机初始化时固有的参数偏差作为持久的、种子依赖的标识符——即使在未训练状态下，模型也表现出由初始权重决定的可重复token选择偏好，这种偏好在整个训练过程中保持稳定且可通过统计检测方法准确恢复模型的初始种子信息，从而实现从出生到生命周期的全程身份验证，具有对领域偏移和参数修改的鲁棒性，形成一种真正的“Galtonian”指纹。

链接: https://arxiv.org/abs/2509.26404
作者: Yao Tong,Haonan Wang,Siquan Li,Kenji Kawaguchi,Tianyang Hu
机构: National University of Singapore (新加坡国立大学); The Chinese University of Hong Kong, Shenzhen (深圳分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fingerprinting Large Language Models (LLMs) is essential for provenance verification and model attribution. Existing methods typically extract post-hoc signatures based on training dynamics, data exposure, or hyperparameters – properties that only emerge after training begins. In contrast, we propose a stronger and more intrinsic notion of LLM fingerprinting: SeedPrints, a method that leverages random initialization biases as persistent, seed-dependent identifiers present even before training. We show that untrained models exhibit reproducible token selection biases conditioned solely on their parameters at initialization. These biases are stable and measurable throughout training, enabling our statistical detection method to recover a model’s lineage with high confidence. Unlike prior techniques, unreliable before convergence and vulnerable to distribution shifts, SeedPrints remains effective across all training stages and robust under domain shifts or parameter modifications. Experiments on LLaMA-style and Qwen-style models show that SeedPrints achieves seed-level distinguishability and can provide birth-to-lifecycle identity verification akin to a biometric fingerprint. Evaluations on large-scale pretrained models and fingerprinting benchmarks further confirm its effectiveness under practical deployment scenarios. These results suggest that initialization itself imprints a unique and persistent identity on neural language models, forming a true ‘‘Galtonian’’ fingerprint.
zh

[NLP-26] Efficient and Transferable Agent ic Knowledge Graph RAG via Reinforcement Learning ICLR2026

【速读】：该论文旨在解决现有知识图谱增强生成（KG-RAG）系统中因多模块设计导致的推理成本高、依赖特定知识图谱（KG）且缺乏迁移能力的问题。解决方案的关键在于提出一种基于强化学习（RL）的单一智能体框架KG-R1，该框架将KG视为环境，通过端到端强化学习使智能体在每一步自主决定检索内容，并将检索信息融入推理与生成过程，从而实现高效、可迁移的KG-RAG。

链接: https://arxiv.org/abs/2509.26383
作者: Jinyeop Song,Song Wang,Julian Shun,Yada Zhu
机构: MIT Physics (麻省理工学院物理系); University of Central Florida (中佛罗里达大学); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); IBM Research (IBM研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures. Submitted to ICLR 2026

点击查看摘要

Abstract:Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucinations and expose reasoning traces. However, many KG-RAG systems compose multiple LLM modules (e.g planning, reasoning, and responding), inflating inference cost and binding behavior to a specific target KG. To address this, we introduce KG-R1, an agentic KG retrieval-augmented generation (KG-RAG) framework through reinforcement learning (RL). KG-R1 utilizes a single agent that interacts with KGs as its environment, learning to retrieve at each step and incorporating the retrieved information into its reasoning and generation. The process is optimized through end-to-end RL. In controlled experiments across Knowledge-Graph Question Answering (KGQA) benchmarks, our method demonstrates both efficiency and transferability: Using Qwen-2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use larger foundation or fine-tuned models. Furthermore, KG-R1 enables plug and play: after training, it maintains strong accuracy on new KGs without modification. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at this https URL.
zh

[NLP-27] Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

【速读】：该论文旨在解决自进化智能体（self-evolving agents）在自主改进过程中可能出现的“误进化”（misevolution）问题，即代理在与环境交互中偏离预期目标，导致不可预测甚至有害的结果。其核心贡献在于首次系统性地将 misevolution 概念化，并从模型、记忆、工具和工作流四个关键演化路径进行实证分析，揭示了即使基于顶级大语言模型（如 Gemini-2.5-Pro）构建的代理也可能遭遇安全对齐退化或工具引入漏洞等风险。解决方案的关键在于识别并量化这些风险来源，为未来开发更安全、可信的自进化系统提供理论基础和初步缓解策略方向。

链接: https://arxiv.org/abs/2509.26354
作者: Shuai Shao,Qihan Ren,Chen Qian,Boyi Wei,Dadi Guo,Jingyi Yang,Xinhao Song,Linfeng Zhang,Weinan Zhang,Dongrui Liu,Jing Shao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. Under Review

点击查看摘要

Abstract:Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent’s self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at this https URL . Warning: this paper includes examples that may be offensive or harmful in nature.
zh

[NLP-28] EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

【速读】：该论文旨在解决当前开源图像编辑模型在自然语言指令引导下的性能滞后问题，其核心瓶颈在于缺乏可靠的奖励模型（reward model）以规模化生成高质量的合成训练数据。解决方案的关键在于构建了一个名为 \mname 的新型奖励模型，该模型基于一个由专业标注员严格遵循超过200K对偏好样本的人工标注数据集进行训练，从而实现了与人类偏好高度对齐。实验表明，\mname 在多个基准测试中（如 GenAI-Bench、AURORA-Bench、ImagenHub 及新提出的 \benchname）均达到最优的人类相关性表现，并成功用于从噪声数据集 ShareGPT-4o-Image 中筛选高质量子集，进而提升 Step1X-Edit 模型的训练效果，验证了其作为高质数据筛选工具的有效性及在强化学习后训练和测试时缩放等高级应用中的潜力。

链接: https://arxiv.org/abs/2509.26346
作者: Keming Wu,Sicong Jiang,Max Ku,Ping Nie,Minghao Liu,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); Tsinghua Univerisity (清华大学); 2077AI; McGill University (麦吉尔大学); Independent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress. Project Page: this https URL

点击查看摘要

Abstract:Recently, we have witnessed great progress in image editing with natural language instructions. Several closed-source models like GPT-Image-1, Seedream, and Google-Nano-Banana have shown highly promising progress. However, the open-source models are still lagging. The main bottleneck is the lack of a reliable reward model to scale up high-quality synthetic training data. To address this critical bottleneck, we built \mname, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs. \mname demonstrates superior alignment with human preferences in instruction-guided image editing tasks. Experiments show that \mname achieves state-of-the-art human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench, ImagenHub, and our new \benchname, outperforming a wide range of VLM-as-judge models. Furthermore, we use \mname to select a high-quality subset from the existing noisy ShareGPT-4o-Image dataset. We train Step1X-Edit on the selected subset, which shows significant improvement over training on the full set. This demonstrates \mname’s ability to serve as a reward model to scale up high-quality training data for image editing. Furthermore, its strong alignment suggests potential for advanced applications like reinforcement learning-based post-training and test-time scaling of image editing models. \mname with its training dataset will be released to help the community build more high-quality image editing training datasets.
zh

[NLP-29] Fast-dLLM v2: Efficient Block-Diffusion LLM

【速读】：该论文旨在解决自回归大语言模型（Autoregressive Large Language Models, AR LLMs）在推理阶段因序列解码机制导致的效率瓶颈问题。其核心解决方案是提出Fast-dLLM v2，一种基于块扩散机制（block diffusion mechanism）的扩散语言模型（diffusion language model, dLLM），通过仅需约10亿token的微调即可高效将预训练AR模型转化为支持并行生成的dLLM，相较全注意力扩散LLM（如Dream，需580B tokens）减少500倍训练数据量，同时保持原始模型性能。关键创新在于设计了一种结合块扩散机制与互补注意力掩码（complementary attention mask）的新训练范式，实现块内双向上下文建模而不破坏AR训练目标；此外引入分层缓存机制（block-level和sub-block cache）与并行解码流水线，使推理速度提升最高达2.5倍，且不牺牲生成质量。

链接: https://arxiv.org/abs/2509.26328
作者: Chengyue Wu,Hao Zhang,Shuchen Xue,Shizhe Diao,Yonggan Fu,Zhijian Liu,Pavlo Molchanov,Ping Luo,Song Han,Enze Xie
机构: The University of Hong Kong (香港大学); NVIDIA (英伟达); MIT (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model’s performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.
zh

[NLP-30] Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in its Latent Thoughts

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理过程中依赖高成本且易过拟合的自然语言链式思维（chain of thought）的问题，同时针对隐空间思考（latent thinking）缺乏可解释性和监督困难导致的正确性与可靠性不足的挑战。其解决方案的关键在于：首先通过系统分析发现，导向正确与错误答案的隐空间推理路径具有高度可区分的模式，并提出一个隐空间分类器（latent classifier）作为隐空间奖励模型（Latent Reward Model, LRM），用于直接从隐表示中预测答案正确性；进而设计出隐空间思考优化算法（Latent Thinking Optimization, LTO），利用LRM作为奖励信号对隐空间推理过程进行概率优化。实验证明，该方法可在不依赖显式语言标注的前提下显著提升LLM的推理准确性，且具备跨任务和跨模型的泛化能力，为高效、通用的隐空间推理优化提供了新范式。

链接: https://arxiv.org/abs/2509.26314
作者: Hanwen Du,Yuxin Dong,Xia Ning
机构: The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking. Recent work instead proposes a latent thinking architecture Huggin-3.5B, which represents intermediate reasoning steps as sequence of latent representations. However, latent thoughts lack interpretability and are difficult to supervise, raising concerns about the correctness and reliability of its latent thinking processes. In this paper, we provide a systematic study of how Huggin-3.5B thinks in the latent space and how external supervision signals can improve its latent thinking processes. We show that latent thoughts leading to correct versus incorrect answers exhibit highly distinguishable patterns, and that a latent classifier can reliably predict answer correctness directly from latent thoughts. Leveraging these insights, we propose Latent Thinking Optimization (LTO), a probabilistic algorithm that employs the latent classifier as a Latent Reward Model (LRM) to optimize the latent thinking processes. Extensive experiments across diverse reasoning tasks demonstrate that LRM is highly effective in detecting incorrect latent thinking patterns, and LTO can significantly improve the latent thinking processes. Furthermore, we show that LRM can generalize across diverse domains, and LTO can be seamlessly applied to general LLMs to improve their thinking processes. In contrast to verbal thinking, our method demonstrates that reward modeling and scaling test-time thinking with supervision can be performed directly in the latent space, highlighting its potential as a general, efficient, and domain-agnostic approach to improving the thinking processes of LLMs.
zh

[NLP-31] One-Token Rollout: Guiding Supervised Fine-Tuning of LLM s with Policy Gradient

【速读】：该论文旨在解决监督微调（Supervised Fine-Tuning, SFT）在大语言模型（Large Language Models, LLMs）适应过程中普遍存在的泛化能力不足问题。现有研究表明，SFT 的性能差距不仅源于损失函数设计，更根本的原因在于其依赖静态的、预收集的离策略（off-policy）数据，而强化学习（Reinforcement Learning, RL）则利用当前策略采样的在策略（on-policy）数据。为此，作者提出了一种名为“单标记滚动”（One-Token Rollout, OTR）的新颖微调算法，其核心创新在于将自回归生成过程重构为单步强化学习轨迹：在每一步token生成时，通过从当前策略分布中采样多个候选token，并以监督数据中的真实token作为奖励信号，从而基于策略梯度方法动态调整模型参数。OTR成功地将静态的离策略数据转化为逐token层面的在策略信号，既保留了SFT的高效性，又获得了RL的泛化优势，实验证明其在数学推理、代码生成和通用推理等多个挑战性基准上均显著优于标准SFT。

链接: https://arxiv.org/abs/2509.26313
作者: Rui Ming,Haoyuan Wu,Shoubo Hu,Zhuolun He,Bei Yu
机构: The Chinese University of Hong Kong (香港中文大学); Noah’s Ark Lab, Huawei (华为诺亚方舟实验室); ChatEDA Tech (ChatEDA科技)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is the predominant method for adapting large language models (LLMs), yet it often struggles with generalization compared to reinforcement learning (RL). In this work, we posit that this performance disparity stems not just from the loss function, but from a more fundamental difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes on-policy data sampled from the current policy. Building on this hypothesis, we introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method. OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory. At each step, it performs a Monte Carlo ``rollout’’ by sampling multiple candidate tokens from the current policy’s distribution. The ground-truth token from the supervised data is then used to provide a reward signal to these samples. Guided by policy gradient, our algorithm repurposes static, off-policy supervised data into a dynamic, on-policy signal at the token level, capturing the generalization benefits of on-policy learning while bypassing the costly overhead of full sentence generation. Through extensive experiments on a diverse suite of challenging benchmarks spanning mathematical reasoning, code generation, and general domain reasoning, we demonstrate that OTR consistently outperforms standard SFT. Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs and provide compelling evidence that the on-policy nature of data is a critical driver of generalization, offering a promising new direction for fine-tuning LLMs.
zh

[NLP-32] Feedback Forensics: A Toolkit to Measure AI Personality

【速读】：该论文旨在解决当前生成式 AI 模型在人格特质（personality）评估上的盲区问题，即现有基于人类反馈的评价方法（如 Chatbot Arena）虽能隐式推断模型优劣，但缺乏对模型人格特征变化的显式追踪能力，导致模型可能因过度迎合反馈而出现偏差（如谄媚倾向）或过拟合排行榜。解决方案的关键在于提出 Feedback Forensics——一个开源工具包，利用 AI 注释器（AI annotators）实现对模型人格特质的量化分析，通过 Python API 和浏览器应用支持对模型训练与评估过程中人格演变的透明监测，并首次系统性地揭示了主流人类反馈数据集（如 Chatbot Arena、MultiPref 和 PRISM）中鼓励的人格特征及其在实际模型中的体现程度。

链接: https://arxiv.org/abs/2509.26305
作者: Arduin Findeis,Timo Kaufmann,Eyke Hüllermeier,Robert Mullins
机构: University of Cambridge (剑桥大学); LMU Munich (慕尼黑路德维希马克西米利安大学); MCML Munich (慕尼黑计算与机器学习中心); DFKI (德国弗劳恩霍夫研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Some traits making a “good” AI model are hard to describe upfront. For example, should responses be more polite or more casual? Such traits are sometimes summarized as model character or personality. Without a clear objective, conventional benchmarks based on automatic validation struggle to measure such traits. Evaluation methods using human feedback such as Chatbot Arena have emerged as a popular alternative. These methods infer “better” personality and other desirable traits implicitly by ranking multiple model responses relative to each other. Recent issues with model releases highlight limitations of these existing opaque evaluation approaches: a major model was rolled back over sycophantic personality issues, models were observed overfitting to such feedback-based leaderboards. Despite these known issues, limited public tooling exists to explicitly evaluate model personality. We introduce Feedback Forensics: an open-source toolkit to track AI personality changes, both those encouraged by human (or AI) feedback, and those exhibited across AI models trained and evaluated on such feedback. Leveraging AI annotators, our toolkit enables investigating personality via Python API and browser app. We demonstrate the toolkit’s usefulness in two steps: (A) first we analyse the personality traits encouraged in popular human feedback datasets including Chatbot Arena, MultiPref and PRISM; and (B) then use our toolkit to analyse how much popular models exhibit such traits. We release (1) our Feedback Forensics toolkit alongside (2) a web app tracking AI personality in popular models and feedback datasets as well as (3) the underlying annotation data at this https URL.
zh

[NLP-33] QUARTZ : QA-based Unsupervised Abstractive Refinement for Task-oriented Dialogue Summarization EMNLP2025

【速读】：该论文旨在解决对话摘要（Dialogue Summarization）任务中因依赖人工标注监督信号而导致的高成本及生成内容缺乏任务特定焦点的问题，尤其是在医疗等专业场景下，传统方法生成的摘要难以满足下游应用需求。其解决方案的关键在于提出一种基于任务导向效用评估的框架 \app，该框架首先利用多个大语言模型（LLMs）在零样本（zero-shot）条件下生成多条摘要和相关问答对，随后通过 LLM 自动回答任务相关问题来评估摘要质量：一方面筛选出最可靠的问答答案，另一方面依据这些答案识别最具信息量的摘要；最终基于选出的优质摘要对最优 LLM 进行微调，从而实现高效且任务适配性强的对话摘要生成，在多个数据集上达到与全监督前沿方法相当的效果。

链接: https://arxiv.org/abs/2509.26302
作者: Mohamed Imed Eddine Ghebriout(1),Gaël Guibon(1, 2),Ivan Lerner(3, 4, 5),Emmanuel Vincent(1) ((1) Universite de Lorraine, CNRS, Inria, LORIA, Nancy, France, (2) Universite Sorbonne Paris Nord, CNRS, LIPN, Villetaneuse, France, (3) Inserm, Centre de Recherche des Cordeliers, Universite Paris Cite, Sorbonne Universite, Paris, France, (4) HeKA, Inria Paris, Paris, France, (5) Assistance Publique Hopitaux de Paris, Georges Pompidou European Hospital, Paris, France)
机构: Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France; Université Sorbonne Paris Nord, CNRS, Laboratoire d’Informatique de Paris Nord, LIPN, F-93430 Villetaneuse, France; Inserm, Centre de Recherche des Cordeliers, Université Paris Cité, Sorbonne Université, F-75006 Paris, France; HeKA, Inria Paris, F-75012 Paris, France; Department of Medical Informatics, Assistance Publique Hôpitaux de Paris, Georges Pompidou European Hospital
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Empirical Methods in Natural Language Processing (EMNLP 2025)

点击查看摘要

Abstract:Dialogue summarization aims to distill the core meaning of a conversation into a concise text. This is crucial for reducing the complexity and noise inherent in dialogue-heavy applications. While recent approaches typically train language models to mimic human-written summaries, such supervision is costly and often results in outputs that lack task-specific focus limiting their effectiveness in downstream applications, such as medical tasks. In this paper, we propose \app, a framework for task-oriented utility-based dialogue summarization. \app starts by generating multiple summaries and task-oriented question-answer pairs from a dialogue in a zero-shot manner using a pool of large language models (LLMs). The quality of the generated summaries is evaluated by having LLMs answer task-related questions before \textit(i) selecting the best candidate answers and \textit(ii) identifying the most informative summary based on these answers. Finally, we fine-tune the best LLM on the selected summaries. When validated on multiple datasets, \app demonstrates its effectiveness by achieving competitive results in various zero-shot settings, rivaling fully-supervised State-of-the-Art (SotA) methods.
zh

[NLP-34] ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

【速读】：该论文旨在解决现有技能熟练度评估方法依赖黑箱视频分类器、忽视多视角上下文信息且缺乏可解释性的问题。其解决方案的关键在于提出一个紧凑的视觉-语言模型（ProfVLM），将技能评估任务重构为生成式推理过程：该模型联合预测技能等级并从第一人称和第三人称视频中生成类专家反馈。核心创新是一个注意力门控投影模块（AttentiveGatedProjector），能够动态融合来自冻结的时间流形（TimeSformer）骨干网络提取的多视角特征，并将其映射到针对反馈生成微调的语言模型中，从而实现高精度、低参数量（最多减少20倍）和高效训练（减少60%训练时间）的同时输出与性能对齐的自然语言批判性反馈，显著提升了评估的透明度与实用性。

链接: https://arxiv.org/abs/2509.26278
作者: Edoardo Bianchi,Jacopo Staiano,Antonio Liotta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.
zh

[NLP-35] Optimizing Speech Language Models for Acoustic Consistency

【速读】：该论文旨在解决语音语言模型（Speech Language Model, SLM）在生成过程中缺乏一致性与语义稳定性的问题，尤其在不同说话人、性别、情感、环境和背景噪声条件下表现不一致。其解决方案的关键在于引入语义初始化（semantic initialization）与规划损失（planning losses），通过自监督特征初始化语音token、轻量级对齐损失（light alignment loss）、以及去噪和辅助目标（thinning and auxiliary objectives）来增强模型的鲁棒性和内容规划能力。实验表明，这种基于语言模型侧的设计与训练策略能够在不修改分词器或运行时架构的前提下，有效平衡声学稳定性与语义锚定（semantic grounding），其中纯语音模型在跨条件一致性上优于更大规模的系统，而文本与语音交错训练虽提升语义-声学对齐，但牺牲了一致性。

链接: https://arxiv.org/abs/2509.26276
作者: Morteza Rohanian,Michael Krauthammer
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and content planning. We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech. Acoustic studies show that the speech-only models achieve the highest consistency across speaker, gender, sentiment, room, and background factors, surpassing larger systems. Interleaving improves lexical and syntactic probes and semantic–acoustic alignment but reduces consistency. Linear probes show that our initialization biases the model toward content structure while trading off prosody detail. These results show that LM-side design and training mix control the balance between acoustic stability and semantic grounding without changes to the tokenizer or runtime architecture. A demo and model weights are available for exploration.
zh

[NLP-36] Finetune Once: Decoupling General Domain Learning with Dynamic Boosted Annealing

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在微调过程中依赖复杂数据混合策略、需多次实验才能实现最优泛化性能的问题。传统微调方法往往需要引入通用数据进行训练以稳定优化过程，但这一过程不仅繁琐，还可能导致计算资源浪费。论文提出的解决方案——动态增强退火（Dynamic Boosted Annealing, DBA）的关键在于：首先通过零学习率训练获取通用数据上的全局梯度，随后在领域训练中利用该梯度进行梯度提升与动态训练步长校正，并结合退火学习策略，从而仅使用领域数据即可完成高效且稳定的微调，避免了模型坍塌。该方法显著提升了联合性能（平均提升5.8%），同时将GPU耗时减少91.0%，实现了训练流程的简化与效率优化。

链接: https://arxiv.org/abs/2509.26242
作者: Yang Tang,Ruijie Liu,Yifan Wang,Shiyu Li,Xi Chen
机构: Basic Algorithm Center, PCG, Tencent(腾讯PCG基础算法中心); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) fine-tuning shows excellent implications. However, vanilla fine-tuning methods often require intricate data mixture and repeated experiments for optimal generalization. To address these challenges and streamline the training process, we propose an efficient and universal solution, Dynamic Boosted Annealing (DBA). We obtain a global gradient through zero-learning-rate training on general data, which is subsequently employed for gradient boosting and dynamic training step correction during domain training. In conjunction with annealing learning, we end up establishing a fine-tuning pipeline that relies solely on domain data without collapse. By evaluating both general and domain-specific performance across multiple tasks on several popular base models, DBA achieves an average improvement of 5.8% in joint performance over vanilla fine-tuning. Furthermore, since general data is no longer involved in annealing, repeated experiments led by data mixture are also eliminated. According to our tests, the DBA method can reduce GPU hours by 91.0% compared to the vanilla method.
zh

[NLP-37] hinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reason ers

【速读】：该论文旨在解决强化学习中可验证奖励（Reinforcement Learning with Verifiable Reward, RLVR）在训练过程中因需要极长上下文长度而导致的高计算成本问题。传统多阶段训练虽可部分缓解此问题，但若初始阶段使用过短上下文，易引发不可逆的性能下降，难以显著降低整体训练算力消耗。解决方案的关键在于提出一种名为Thinking-Free Policy Initialization (TFPI) 的简单而有效的改进方法，其核心是引入一个“ThinkFree”操作——通过直接追加 /think 标记显式丢弃推理过程中的思考内容（Chain-of-Thought, CoT），从而在推理阶段减少token使用量；同时，利用该机制对输入进行适配后训练，不仅提升了模型性能，还降低了token消耗，即使在原始慢思考模式下亦能实现更高效的收敛与更高性能上限，且无需特殊奖励设计或复杂训练策略。

链接: https://arxiv.org/abs/2509.26226
作者: Xin Xu,Cliveb AI,Kai Yang,Tianhao Chen,Yang Wang,Saiyong Yang,Can Yang
机构: Tencent(腾讯); The Hong Kong University of Science and Technology(香港科技大学); The University of Hong Kong(香港大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce Thinking-Free Policy Initialization (TFPI), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple ThinkFree operation, explicitly discarding the thinking content via a direct /think append, to reduce token usage during inference. Training with ThinkFree-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.
zh

[NLP-38] ype-Less yet Type-Aware Inductive Link Prediction with Pretrained Language Models EMNLP2025

【速读】：该论文旨在解决知识图谱（Knowledge Graph, KG）中诱导式链接预测（Inductive Link Prediction）的问题，即在新实体频繁出现且模型需在不重新训练的情况下泛化到这些未见实体的场景下，如何提升链接预测性能。其核心挑战在于：传统方法依赖显式的类型信息（type information），而现实中类型标注往往缺失、粗粒度、稀疏且易出错。为此，作者提出TyleR——一种“无显式类型但感知类型”的子图诱导式链接预测方法，其关键创新在于利用预训练语言模型（Pre-trained Language Models, PLMs）从文本语义中挖掘隐式类型信号，以增强节点表示，从而在类型信息稀缺和图结构稀疏的场景下显著优于现有最先进基线方法。

链接: https://arxiv.org/abs/2509.26224
作者: Alessandro De Bellis,Salvatore Bufi,Giovanni Servedio,Vito Walter Anelli,Tommaso Di Noia,Eugenio Di Sciascio
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted and to appear in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)

点击查看摘要

Abstract:Inductive link prediction is emerging as a key paradigm for real-world knowledge graphs (KGs), where new entities frequently appear and models must generalize to them without retraining. Predicting links in a KG faces the challenge of guessing previously unseen entities by leveraging generalizable node features such as subgraph structure, type annotations, and ontological constraints. However, explicit type information is often lacking or incomplete. Even when available, type information in most KGs is often coarse-grained, sparse, and prone to errors due to human annotation. In this work, we explore the potential of pre-trained language models (PLMs) to enrich node representations with implicit type signals. We introduce TyleR, a Type-less yet type-awaRe approach for subgraph-based inductive link prediction that leverages PLMs for semantic enrichment. Experiments on standard benchmarks demonstrate that TyleR outperforms state-of-the-art baselines in scenarios with scarce type annotations and sparse graph connectivity. To ensure reproducibility, we share our code at this https URL .
zh

[NLP-39] Comparative Analysis of Ant Colony Optimization and Google OR-Tools for Solving the Open Capacitated Vehicle Routing Problem in Logistics

【速读】：该论文旨在解决现代物流管理系统中车辆路径规划的效率问题，具体针对开放式容量车辆路径问题（Open Capacitated Vehicle Routing Problem, OCVRP），即在不强制车辆返回起点 depot 的前提下，为分布式客户群体优化配送路线。解决方案的关键在于对比两种算法：基于自然启发式的蚁群优化（Ant Colony Optimization, ACO）与工业标准优化工具 Google OR-Tools。研究发现，ACO 在路由参数灵活性方面具有优势，而 OR-Tools 在计算速度、结果一致性及输入复杂度上表现更优，因此可为构建可扩展的实时物流系统提供差异化的策略选择依据。

链接: https://arxiv.org/abs/2509.26216
作者: Assem Omar,Youssef Omar,Marwa Solayman,Hesham Mansour
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, accepted at Intelligent Methods, Systems, and Applications (IMSA 2025)

点击查看摘要

Abstract:In modern logistics management systems, route planning requires high efficiency. The Open Capacitated Vehicle Routing Problem (OCVRP) deals with finding optimal delivery routes for a fleet of vehicles serving geographically distributed customers, without requiring the vehicles to return to the depot after deliveries. The present study is comparative in nature and speaks of two algorithms for OCVRP solution: Ant Colony Optimization (ACO), a nature-inspired metaheuristic; and Google OR-Tools, an industry-standard toolkit for optimization. Both implementations were developed in Python and using a custom dataset. Performance appraisal was based on routing efficiency, computation time, and scalability. The results show that ACO allows flexibility in routing parameters while OR-Tools runs much faster with more consistency and requires less input. This could help choose among routing strategies for scalable real-time logistics systems.
zh

[NLP-40] VietBinoculars: A Zero-Shot Approach for Detecting Vietnamese LLM -Generated Text

【速读】：该论文旨在解决生成式 AI（Generative AI）模型生成的越南语文本与人类撰写文本之间难以区分的问题，尤其是在当前大型语言模型（Large Language Models, LLMs）持续迭代、生成内容日益逼真且多样化背景下，传统检测方法效果下降。解决方案的关键在于提出 VietBinoculars——对原始 Binoculars 方法进行优化，通过构建新的越南语 AI 生成数据集来确定全局阈值，并在跨域测试中实现超过 99% 的准确率、F1 分数和 AUC 值，显著优于原始 Binoculars 模型及包括 ZeroGPT 和 DetectGPT 在内的多种主流检测工具，尤其在对抗性提示策略下仍保持高性能。

链接: https://arxiv.org/abs/2509.26189
作者: Trieu Hai Nguyen,Sivaswamy Akilesh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 27 pages

点击查看摘要

Abstract:The rapid development research of Large Language Models (LLMs) based on transformer architectures raises key challenges, one of them being the task of distinguishing between human-written text and LLM-generated text. As LLM-generated textual content, becomes increasingly complex over time, and resembles human writing, traditional detection methods are proving less effective, especially as the number and diversity of LLMs continue to grow with new models and versions being released at a rapid pace. This study proposes VietBinoculars, an adaptation of the Binoculars method with optimized global thresholds, to enhance the detection of Vietnamese LLM-generated text. We have constructed new Vietnamese AI-generated datasets to determine the optimal thresholds for VietBinoculars and to enable benchmarking. The results from our experiments show results show that VietBinoculars achieves over 99% in all two domains of accuracy, F1-score, and AUC on multiple out-of-domain datasets. It outperforms the original Binoculars model, traditional detection methods, and other state-of-the-art approaches, including commercial tools such as ZeroGPT and DetectGPT, especially under specially modified prompting strategies.
zh

[NLP-41] Auto-ARGUE: LLM -Based Report Generation Evaluation ECIR2025

【速读】：该论文旨在解决长篇报告生成任务中缺乏专门评估工具的问题，尤其是在检索增强生成（Retrieval Augmented Generation, RAG）系统中，现有开源评估工具多适用于一般RAG任务，难以有效衡量生成报告的准确性、连贯性和引用支持性。解决方案的关键在于提出Auto-ARGUE，这是一个基于大语言模型（Large Language Model, LLM）实现的自动化评估框架，继承了近期ARGUE框架的核心思想，能够对报告生成结果进行系统性评分，并在TREC 2024 NeuCLIR赛道的报告生成预演任务上展现出与人工判断良好的一致性。

链接: https://arxiv.org/abs/2509.26184
作者: William Walden,Marc Mason,Orion Weller,Laura Dietz,Hannah Recknor,Bryan Li,Gabrielle Kaili-May Liu,Yu Hou,James Mayfield,Eugene Yang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ECIR 2025 demo format

点击查看摘要

Abstract:Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recent ARGUE framework for report generation evaluation. We present analysis of Auto-ARGUE on the report generation pilot task from the TREC 2024 NeuCLIR track, showing good system-level correlations with human judgments. We further release a web app for visualization of Auto-ARGUE outputs.
zh

[NLP-42] Explaining novel senses using definition generation with open language models

链接: https://arxiv.org/abs/2509.26181
作者: Mariia Fedorova,Andrey Kutuzov,Francesco Periti,Yves Scherrer
机构: University of Oslo (奥斯陆大学); KU Leuven - Flanders Make (鲁汶大学-弗兰德制造)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-43] MGen: Millions of Naturally Occurring Generics in Context

链接: https://arxiv.org/abs/2509.26160
作者: Gustavo Cilleruelo,Emily Allaway,Barry Haddow,Alexandra Birch
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: Presented at SCiL 2025

点击查看摘要

[NLP-44] CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models

【速读】：该论文旨在解决生成式大语言模型（Generative Large Language Models, LLMs）在真实临床场景中用于出院诊断预测任务时性能尚不明确的问题。其解决方案的关键在于构建了首个基准测试平台 CliniBench，用于系统性比较编码器类分类器（encoder-based classifiers）与生成式LLMs在MIMIC-IV数据集上的表现，并通过引入检索增强策略（retrieval augmentation strategies）提升生成式模型在上下文学习中的性能，从而为临床AI应用提供可比性和优化路径。

链接: https://arxiv.org/abs/2509.26136
作者: Paul Grundmann,Dennis Fast,Jan Frick,Thomas Steffek,Felix Gers,Wolfgang Nejdl,Alexander Löser
机构: Berlin University of Applied Sciences (柏林应用科学大学); Leibniz University Hannover (汉诺威莱布尼茨大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.
zh

[NLP-45] he Hunger Game Debate: On the Emergence of Over-Competition in Multi-Agent Systems

链接: https://arxiv.org/abs/2509.26126
作者: Xinbei Ma,Ruotian Ma,Xingyu Chen,Zhengliang Shi,Mengru Wang,Jen-tse Huang,Qu Yang,Wenxuan Wang,Fanghua Ye,Qingxuan Jiang,Mengfei Zhou,Zhuosheng Zhang,Rui Wang,Hai Zhao,Zhaopeng Tu,Xiaolong Li,Linus
机构: Tencent (腾讯); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-46] Vocabulary Customization for Efficient Domain-Specific LLM Deployment NEURIPS2025

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在处理训练域外文本时因词汇不匹配（vocabulary mismatch）导致的分词效率下降问题，即通用领域分词器无法有效识别特定领域高频术语，从而产生更多子词单元（sub-word tokens），增加计算开销并降低推理速度。解决方案的关键在于设计一种算法，在不降低原有分词效率的前提下，向预训练分词器中增量添加领域特定词汇（domain-specific tokens），确保任意输入序列的分词数量不超过原始分词结果；实验表明，该方法在真实电商场景中可使输入序列长度缩短最高达20%，显著降低下游任务推理延迟，同时保持预测性能不变，并进一步揭示了词汇适配对前向传播速度及新词采纳率等次级效应的积极影响。

链接: https://arxiv.org/abs/2509.26124
作者: Christian Herold,Michael Kozielski,Nicholas Santavas,Yannick Versley,Shahram Khadivi
机构: eBay Inc(ebay公司)
类目: Computation and Language (cs.CL)
备注: Accepted at NEURIPS 2025 CCFM Workshop

点击查看摘要

Abstract:When using an LLM to process text outside the training domain(s), an often overlooked factor is vocabulary mismatch, where the general-domain tokenizer fails to capture frequent domain-specific terms, leading to higher token fertility and thus a decrease in processing speed due to suboptimal sub-word splits. We address this limitation by augmenting the pretrained vocabulary with a set of domain-specific tokens. To this end, we design an algorithm that extends an existing tokenizer while guaranteeing it never decreases tokenization efficiency: every input sequence is segmented into at most the same number of tokens as before. Evaluated on real-world e-Commerce use-cases, the augmented tokenizer significantly shortens input sequences by up to 20% and reduces inference latency on downstream tasks while preserving predictive quality. We further analyze secondary effects, such as the impact on forward pass speed and the rate at which the model adopts the newly introduced tokens, to illustrate the broader benefits of vocabulary adaptation. Comments: Accepted at NEURIPS 2025 CCFM Workshop Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.26124 [cs.CL] (or arXiv:2509.26124v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.26124 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-47] End-to-End Aspect-Guided Review Summarization at Scale EMNLP2025

【速读】：该论文旨在解决电商平台上产品评论信息过载问题，即如何从海量用户评论中提取关键情感信息并生成简洁、可解释的产品总结，以提升用户决策效率。解决方案的关键在于构建一个基于大规模语言模型（Large Language Model, LLM）的系统，该系统融合方面级情感分析（Aspect-Based Sentiment Analysis, ABSA）与引导式摘要生成技术：首先从单条评论中提取并聚合方面-情感对，筛选出每款产品的高频方面，并据此采样代表性评论；随后利用这些结构化信息构造提示（prompt），引导LLM生成基于真实客户反馈的摘要。该方法确保了摘要内容的准确性与可解释性，且已在Wayfair平台通过大规模在线A/B测试验证其有效性。

链接: https://arxiv.org/abs/2509.26103
作者: Ilya Boytsov,Vinny DeGenova,Mikhail Balyasin,Joseph Walt,Caitlin Eusden,Marie-Claire Rochat,Margaret Pierson
机构: Wayfair
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Camera-ready preprint for EMNLP 2025 Industry Track

点击查看摘要

Abstract:We present a scalable large language model (LLM)-based system that combines aspect-based sentiment analysis (ABSA) with guided summarization to generate concise and interpretable product review summaries for the Wayfair platform. Our approach first extracts and consolidates aspect-sentiment pairs from individual reviews, selects the most frequent aspects for each product, and samples representative reviews accordingly. These are used to construct structured prompts that guide the LLM to produce summaries grounded in actual customer feedback. We demonstrate the real-world effectiveness of our system through a large-scale online A/B test. Furthermore, we describe our real-time deployment strategy and release a dataset of 11.8 million anonymized customer reviews covering 92,000 products, including extracted aspects and generated summaries, to support future research in aspect-guided review summarization.
zh

[NLP-48] Reinforced Strategy Optimization for Conversational Recommender Systems via Network-of-Experts

【速读】：该论文旨在解决当前对话推荐系统（Conversational Recommender Systems, CRSs）在利用大语言模型（Large Language Models, LLMs）进行个性化推荐时，因缺乏显式交互策略优化而导致的性能瓶颈问题。现有方法通常依赖统一提示（unified prompts）生成回复，难以适应多轮交互中的动态需求，导致推荐效果不佳。其解决方案的关键在于提出一种分层策略优化框架——强化策略优化（Reinforced Strategy Optimization, RSO），将响应生成过程解耦为宏观层面的策略规划（Planner）与微观层面的适配（Actor），其中Planner负责选择交互策略（如推荐、解释、鼓励等），Actor则基于辅助专家网络（偏好建模与事实对齐）生成具体回应；同时，通过LLM驱动的奖励机制将策略学习建模为强化学习，有效缓解多轮交互数据稀缺问题，从而实现更可学习、更具适应性的对话推荐。

链接: https://arxiv.org/abs/2509.26093
作者: Xiaoyan Zhao
机构: The Chinese University of Hong Kong (香港中文大学); University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); Singapore Management University (新加坡管理大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversational Recommender Systems (CRSs) provide personalized recommendations through multi-turn interactions. With the strong reasoning abilities of Large Language Models (LLMs), applying them to CRSs has become promising. Yet, existing methods often lack explicit optimization of interaction strategies, relying instead on unified prompts, which can yield suboptimal outcomes. We propose Reinforced Strategy Optimization (RSO), a hierarchical framework that decomposes response generation into macro-level strategy planning and micro-level adaptation within a network-of-experts. A Planner selects strategies (e.g., recommend, explain, encourage), while an Actor generates responses guided by auxiliary experts for preferences and factual grounding. This disentanglement enables more tractable learning. To address limited multi-turn data, we model strategy learning as reinforcement learning with an LLM-based reward for exploration. Experiments show RSO outperforms state-of-the-art baselines, validating the effectiveness of hierarchical strategy optimization.
zh

[NLP-49] IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在评估研究级数学任务时存在的基准不足问题，即现有评测集主要局限于最终答案型题目或高中竞赛题，难以反映模型在前沿数学研究中的推理与证明能力。解决方案的关键在于构建一个名为IMProofBench的私有基准，包含由专家数学家设计的39道同行评审问题，每道题均需生成详细证明，并配有可自动评分的子问题；同时，该评测环境模拟真实科研场景，允许模型在代理框架中调用工具（如网络搜索和SageMath数学软件），从而实现对LLMs数学推理能力的多维度、定量且贴近实际的研究级评估。

链接: https://arxiv.org/abs/2509.26076
作者: Johannes Schmitt,Gergely Bérczi,Jasper Dekoninck,Jeremy Feusi,Tim Gehrunger,Raphael Appenzeller,Jim Bryan,Niklas Canova,Timo de Wolff,Filippo Gaia,Michel van Garrel,Baran Hashemi,David Holmes,Aitor Iribar Lopez,Victor Jaeck,Martina Jørgensen,Steven Kelk,Stefan Kuhlmann,Adam Kurpisz,Chiara Meroni,Ingmar Metzler,Martin Möller,Samuel Muñoz-Echániz,Robert Nowak,Georg Oberdieck,Daniel Platt,Dylan Possamaï,Gabriel Ribeiro,Raúl Sánchez Galán,Zheming Sun,Josef Teichmann,Richard P. Thomas,Charles Vial
机构: ETH Zurich (苏黎世联邦理工学院); Aarhus University (奥胡斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are limited, as they focus solely on final-answer questions or high-school competition problems. To address this gap, we introduce IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians. Each problem requires a detailed proof and is paired with subproblems that have final answers, supporting both an evaluation of mathematical reasoning capabilities by human experts and a large-scale quantitative analysis through automated grading. Furthermore, unlike prior benchmarks, the evaluation setup simulates a realistic research environment: models operate in an agentic framework with tools like web search for literature review and mathematical software such as SageMath. Our results show that current LLMs can succeed at the more accessible research-level questions, but still encounter significant difficulties on more challenging problems. Quantitatively, Grok-4 achieves the highest accuracy of 52% on final-answer subproblems, while GPT-5 obtains the best performance for proof generation, achieving a fully correct solution for 22% of problems. IMProofBench will continue to evolve as a dynamic benchmark in collaboration with the mathematical community, ensuring its relevance for evaluating the next generation of LLMs.
zh

[NLP-50] Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis NEURIPS2025

链接: https://arxiv.org/abs/2509.26074
作者: Leitian Tao,Xuefeng Du,Yixuan Li
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2025

点击查看摘要

[NLP-51] he Silent Judge: Unacknowledged Shortcut Bias in LLM -as-a-Judge

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）作为自动评判者（LLM-as-a-judge）在评估系统输出时存在的可靠性问题，特别是其是否忠实于响应质量而忽略外部干扰因素。研究发现，现有LLM评判者会依赖提示中引入的表面线索（superficial cues），如来源身份（provenance cues）和时间新旧（recency cues），从而产生系统性偏差——例如更偏好“新”文本和“专家”来源，且极少承认这些线索对判断的影响。解决方案的关键在于揭示此类评判系统的“捷径依赖”（shortcut-prone）特性：即使控制其他变量不变，仅通过注入人为设计的提示线索即可显著改变评判结果，且评判理由几乎从不提及这些线索，而是将决策归因于内容质量。这一发现表明，当前LLM-as-a-judge机制缺乏透明性和忠实性，亟需改进以确保评估结果的客观性和可信度。

链接: https://arxiv.org/abs/2509.26072
作者: Arash Marioriyad,Mohammad Hossein Rohban,Mahdieh Soleymani Baghshah
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computation and Language (cs.CL)
备注: 9 Pages, 5 Tables, 1 Figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing. A faithful judge should base its verdicts solely on response quality and explicitly acknowledge the factors shaping its decision. We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt. Our study uses two evaluation datasets: ELI5, a benchmark for long-form question answering, and LitBench, a recent benchmark for creative writing. Both datasets provide pairwise comparisons, where the evaluator must choose which of two responses is better. From each dataset we construct 100 pairwise judgment tasks and employ two widely used models, GPT-4o and Gemini-2.5-Flash, as evaluators in the role of LLM-as-a-judge. For each pair, we assign superficial cues to the responses, provenance cues indicating source identity (Human, Expert, LLM, or Unknown) and recency cues indicating temporal origin (Old, 1950 vs. New, 2025), while keeping the rest of the prompt fixed. Results reveal consistent verdict shifts: both models exhibit a strong recency bias, systematically favoring new responses over old, as well as a clear provenance hierarchy (Expert Human LLM Unknown). These biases are especially pronounced in GPT-4o and in the more subjective and open-ended LitBench domain. Crucially, cue acknowledgment is rare: justifications almost never reference the injected cues, instead rationalizing decisions in terms of content qualities. These findings demonstrate that current LLM-as-a-judge systems are shortcut-prone and unfaithful, undermining their reliability as evaluators in both research and deployment.
zh

[NLP-52] DyFlow: Dynamic Workflow Framework for Agent ic Reasoning

【速读】：该论文旨在解决基于大语言模型（Large Language Models, LLMs）的智能体系统在复杂推理任务中面临的工作流效率低、泛化能力差的问题。现有方法多依赖人工设计流程，缺乏跨任务适应性；部分自动化方法受限于特定数据集或查询类型，且未能充分利用中间反馈，导致推理深度不足和系统灵活性差。其解决方案的关键在于提出DyFlow框架，通过一个动态工作流生成机制实现任务驱动的自适应推理过程：该框架包含两个核心组件——“设计者”（designer）负责根据高阶目标分解问题并基于中间输出与反馈动态规划下一步操作；“执行器”（executor）则利用上下文感知参数化的动态操作符执行具体任务，从而实现语义合理且灵活的推理路径调整，显著提升跨领域泛化性能与推理质量。

链接: https://arxiv.org/abs/2509.26062
作者: Yanbo Wang,Zixiang Xu,Yue Huang,Xiangqi Wang,Zirui Song,Lang Gao,Chenxi Wang,Xiangru Tang,Yue Zhao,Arman Cohan,Xiangliang Zhang,Xiuying Chen
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); University of Notre Dame; Yale University; University of Southern California
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agent systems based on large language models (LLMs) have shown great potential in complex reasoning tasks, but building efficient and generalizable workflows remains a major challenge. Most existing approaches rely on manually designed processes, which limits their adaptability across different tasks. While a few methods attempt automated workflow generation, they are often tied to specific datasets or query types and make limited use of intermediate feedback, reducing system robustness and reasoning depth. Moreover, their operations are typically predefined and inflexible. To address these limitations, we propose DyFlow, a dynamic workflow generation framework that adaptively constructs and adjusts reasoning procedures based on task requirements and real-time intermediate feedback, thereby enhancing cross-task generalization. DyFlow consists of two core components: a designer and an executor. The designer decomposes complex problems into a sequence of sub-goals defined by high-level objectives and dynamically plans the next steps based on intermediate outputs and feedback. These plans are then carried out by the executor, which executes each operation using dynamic operators with context-aware parameterization, enabling flexible and semantically grounded reasoning. We systematically evaluate DyFlow across diverse domains, including social reasoning, biomedical tasks, mathematical problem solving, and code generation. Results demonstrate that DyFlow significantly outperforms existing baselines, achieving substantial Pass@k improvements and exhibiting robust generalization across diverse domains. The code is publicly available at this https URL.
zh

[NLP-53] CEAID: Benchmark of Multilingual Machine-Generated Text Detection Methods for Central European Languages

【速读】：该论文旨在解决当前机器生成文本检测（Machine-generated Text Detection）研究中对非英语语言支持不足的问题，尤其是针对中欧地区语言的检测方法几乎空白的现状。现有检测模型主要基于英文训练，其跨语言迁移能力有限且未经充分验证。为填补这一空白，作者构建了首个面向中欧语言的检测基准，并系统评估了不同训练语言组合下的性能表现。关键解决方案在于：通过在中欧语言上进行监督微调（Supervised Fine-tuning），显著提升了检测模型在目标语言中的准确性和对抗扰动（obfuscation）下的鲁棒性，证明了本地化微调是提升多语言检测性能的核心策略。

链接: https://arxiv.org/abs/2509.26051
作者: Dominik Macko,Jakub Kopal
机构: Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine-generated text detection, as an important task, is predominantly focused on English in research. This makes the existing detectors almost unusable for non-English languages, relying purely on cross-lingual transferability. There exist only a few works focused on any of Central European languages, leaving the transferability towards these languages rather unexplored. We fill this gap by providing the first benchmark of detection methods focused on this region, while also providing comparison of train-languages combinations to identify the best performing ones. We focus on multi-domain, multi-generator, and multilingual evaluation, pinpointing the differences of individual aspects, as well as adversarial robustness of detection methods. Supervised finetuned detectors in the Central European languages are found the most performant in these languages as well as the most resistant against obfuscation.
zh

[NLP-54] RE-Searcher: Robust Agent ic Search with Goal-oriented Planning and Self-reflection

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在现实世界部署中面临的三大挑战：知识截止（knowledge cutoff）、幻觉（hallucination）以及交互模态有限的问题。尽管通过引入外部搜索工具可缓解上述问题，但这也使代理（agent）暴露于复杂的搜索环境中，微小的查询表述差异可能导致推理偏离正轨并放大错误。为应对这一挑战，作者提出了一种名为RE-Searcher的简单而有效的搜索代理方法，其核心在于将目标导向型规划（goal-oriented planning）与自我反思（self-reflection）相结合：在搜索过程中，RE-Searcher首先明确具体搜索目标，并持续评估检索到的证据是否满足该目标。这种机制显著增强了对复杂搜索环境中虚假线索的鲁棒性，从而提升了搜索准确性与整体性能，在扰动实验中也表现出对噪声或误导性外部信号的强韧性。

链接: https://arxiv.org/abs/2509.26048
作者: Daocheng Fu,Jianbiao Mei,Licheng Wen,Xuemeng Yang,Cheng Yang,Rong Wu,Tao Hu,Siqi Li,Yufan Shen,Xinyu Cai,Pinlong Cai,Botian Shi,Yong Liu,Yu Qiao
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Zhejiang University (浙江大学); Shanghai Innovation Institute (上海创新研究院); Shanghai Jiao Tong University (上海交通大学); Central South University (中南大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) excel at knowledge-intensive question answering and reasoning, yet their real-world deployment remains constrained by knowledge cutoff, hallucination, and limited interaction modalities. Augmenting LLMs with external search tools helps alleviate these issues, but it also exposes agents to a complex search environment in which small, plausible variations in query formulation can steer reasoning into unproductive trajectories and amplify errors. We present a systematic analysis that quantifies how environmental complexity induces fragile search behaviors and, in turn, degrades overall performance. To address this challenge, we propose a simple yet effective approach to instantiate a search agent, RE-Searcher. During search, RE-Searcher explicitly articulates a concrete search goal and subsequently reflects on whether the retrieved evidence satisfies that goal. This combination of goal-oriented planning and self-reflection enables RE-Searcher to resist spurious cues in complex search environments and perform robust search. Extensive experiments show that our method improves search accuracy and achieves state-of-the-art results. Perturbation studies further demonstrate substantial resilience to noisy or misleading external signals, mitigating the fragility of the search process. We believe these findings offer practical guidance for integrating LLM-powered agents into more complex interactive environments and enabling more autonomous decision-making.
zh

[NLP-55] Scaling Up Temporal Domain Generalization via Temporal Experts Averag ing EMNLP2025

【速读】：该论文旨在解决时间域泛化（Temporal Domain Generalization, TDG）问题，即模型在面对随时间变化的分布偏移（如词汇演变）时难以保持性能的问题。现有方法通常通过预测未来模型权重来实现泛化，但全模型预测计算成本过高，而仅预测分类层又限制了整体性能提升。其解决方案的关键在于提出一种名为Temporal Experts Averaging (TEA) 的新框架，通过两种核心策略增强泛化能力：一是利用约束微调生成功能多样但参数相似的专家模型（expert models），二是基于主成分子空间中对时间权重轨迹的建模，自适应地确定平均系数以优化偏差-方差权衡，从而实现高效且强大的跨时间段泛化。

链接: https://arxiv.org/abs/2509.26045
作者: Aoming Liu,Kevin Miller,Venkatesh Saligrama,Kate Saenko,Boqing Gong,Ser-Nam Lim,Bryan A. Plummer
机构: Boston University (波士顿大学); University of Central Florida (中佛罗里达大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by EMNLP 2025 main

点击查看摘要

Abstract:Temporal Domain Generalization (TDG) aims to generalize across temporal distribution shifts, e.g., lexical change over time. Prior work often addresses this by predicting future model weights. However, full model prediction is prohibitively expensive for even reasonably sized models. Thus, recent methods only predict the classifier layer, limiting generalization by failing to adjust other model components. To address this, we propose Temporal Experts Averaging (TEA), a novel and scalable TDG framework that updates the entire model using weight averaging to maximize generalization potential while minimizing computational costs. Our theoretical analysis guides us to two steps that enhance generalization to future domains. First, we create expert models with functional diversity yet parameter similarity by fine-tuning a domain-agnostic base model on individual temporal domains while constraining weight changes. Second, we optimize the bias-variance tradeoff through adaptive averaging coefficients derived from modeling temporal weight trajectories in a principal component subspace. Expert’s contributions are based on their projected proximity to future domains. Extensive experiments across 7 TDG benchmarks, 5 models, and 2 TDG settings shows TEA outperforms prior TDG methods by up to 69% while being up to 60x more efficient.
zh

[NLP-56] Unspoken Hints: Accuracy Without Acknowledgement in LLM Reasoning

链接: https://arxiv.org/abs/2509.26041
作者: Arash Marioriyad,Shaygan Adim,Nima Alighardashi,Mahdieh Soleymani Banghshah,Mohammad Hossein Rohban
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computation and Language (cs.CL)
备注: 5 Pages, 4 Figures, 4 Tables

点击查看摘要

[NLP-57] RE2: Improving Chinese Grammatical Error Correction via Retrieving Appropriate Examples with Explanation

链接: https://arxiv.org/abs/2509.26038
作者: Baoxin Wang,Yumeng Luo,Yixuan Wang,Dayong Wu,Wanxiang Che,Shijin Wang
机构: Harbin Institute of Technology (哈尔滨工业大学); iFLYTEK Research (科大讯飞研究院); Beijing Foreign Studies University (北京外国语大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-58] FITS: Towards an AI-Driven Fashion Information Tool for Sustainability ECAI2025

链接: https://arxiv.org/abs/2509.26017
作者: Daphne Theodorakopoulos,Elisabeth Eberling,Miriam Bodenheimer,Sabine Loos,Frederic Stahl
机构: AI Institute, Leibniz University Hannover (人工智能研究所，汉诺威大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Competence Center Sustainability and Infrastructure Systems, Fraunhofer ISI (可持续性与基础设施系统能力中心，弗劳恩霍夫 ISI); Faculty of Economics and Management, TU Berlin (经济与管理学院，柏林工业大学); Center for Responsible Research and Innovation, Fraunhofer IAO (负责任研究与创新中心，弗劳恩霍夫 IAO)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: accepted at ECAI 2025

点击查看摘要

[NLP-59] RAG feree: Building Contextual Reward Models for Retrieval-Augmented Generation EMNLP2025

【速读】：该论文旨在解决现有奖励模型（Reward Models, RMs）在检索增强生成（Retrieval Augmented Generation, RAG）场景下性能不足的问题，具体表现为难以准确评估生成响应对检索上下文的忠实性（faithfulness）、与用户查询的相关性、在上下文不足时是否恰当拒绝回答、以及信息的完整性和简洁性。解决方案的关键在于提出 RAGferee 方法，该方法通过将通用问答（QA）数据集重构为偏好对（preference pairs），并优先强调答案的“有根性”（groundedness）而非风格特征，从而构建出专用于 RAG 场景的奖励模型训练数据。基于此，作者构建了一个包含 4000 样本的小规模偏好数据集，并微调了参数规模从 7B 到 24B 的 RMs，在 ContextualJudgeBench 上超越了使用高达 240 万样本通用语料训练的 70B+ 级别模型，绝对提升达 +15.5%。

链接: https://arxiv.org/abs/2509.26011
作者: Andrei C. Coman,Ionut-Teodor Sorodoc,Leonardo F. R. Ribeiro,Bill Byrne,James Henderson,Adrià de Gispert
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Main

点击查看摘要

Abstract:Existing Reward Models (RMs), typically trained on general preference data, struggle in Retrieval Augmented Generation (RAG) settings, which require judging responses for faithfulness to retrieved context, relevance to the user query, appropriate refusals when context is insufficient, completeness and conciseness of information. To address the lack of publicly available RAG-centric preference datasets and specialised RMs, we introduce RAGferee, a methodology that repurposes question-answering (QA) datasets into preference pairs that prioritise groundedness over stylistic features, enabling the training of contextual RMs better suited to judging RAG responses. Using RAGferee, we curate a small preference dataset of 4K samples and fine-tune RMs ranging from 7B to 24B parameters. Our RAG-centric RMs achieve state-of-the-art performance on ContextualJudgeBench, surpassing existing 70B+ RMs trained on much larger (up to 2.4M samples) general corpora, with an absolute improvement of +15.5%.
zh

[NLP-60] CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models

链接: https://arxiv.org/abs/2509.25996
作者: Weiyu Huang,Yuezhou Hu,Jun Zhu,Jianfei Chen
机构: Tsinghua University (清华大学); Department of Computer Science and Technology (计算机科学与技术系); Institute for AI (人工智能研究院); BNRist Center (清华-伯克利深圳学院); THBI Lab (清华大学脑与智能实验室); Tsinghua-Bosch Joint ML Center (清华-博世联合机器学习中心)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Submitted to IEEE TPAMI

点击查看摘要

[NLP-61] Reliability Crisis of Reference-free Metrics for Grammatical Error Correction EMNLP2025

链接: https://arxiv.org/abs/2509.25961
作者: Takumi Goto,Yusuke Sakai,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Findings

点击查看摘要

[NLP-62] RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning

链接: https://arxiv.org/abs/2509.25958
作者: Gang Li,Yulei Qin,Xiaoyu Tan,Dingkang Yang,Yuchen Shi,Zihan Xu,Xiang Li,Xing Sun,Ke Li
机构: Tencent Youtu Lab (腾讯优图实验室); Fudan Univeristy (复旦大学); Nankai University (南开大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-63] Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA

链接: https://arxiv.org/abs/2509.25941
作者: Raphael Schumann,Stefan Riezler
机构: Heidelberg University (海德堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-64] DeepJSONEval: Benchmarking Complex Nested JSON Data Mining for Large Language Models

【速读】：该论文旨在解决互联网信息中低密度、高冗余内容（如社交媒体评论、重复新闻和长篇讨论）难以高效提取有价值洞察的问题。其核心解决方案是利用多层嵌套的JSON结构，将非结构化文本压缩为语义丰富、层次分明的表示形式，从而保留上下文关系并提升存储、检索与语义查询效率。关键创新在于提出DeepJSONEval基准测试，包含2100个跨领域、具有深度嵌套结构的实例，并按难度分类，以更真实地评估大语言模型（Large Language Models, LLMs）在实际网络数据挖掘任务中的数据理解与结构化抽取能力，而非仅关注纯JSON生成性能。

链接: https://arxiv.org/abs/2509.25922
作者: Zhicheng Zhou,Jing Li,Suming Qiu,Junjie Huang,Linyuan Qiu,Zhijie Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The internet is saturated with low-density, high-redundancy information, such as social media comments, repetitive news, and lengthy discussions, making it difficult to extract valuable insights efficiently. Multi-layer nested JSON structures provide an effective solution by compressing such information into semantically rich, hierarchical representations, which organize data into key-value pairs, arrays, and nested objects, preserving contextual relationships and enabling efficient storage, retrieval, and semantic querying. For instance, in news aggregation, a JSON object can nest an article’s metadata (title, author, date), content (text, multimedia), and multimedia information (multimedia type, caption) hierarchically. Large Language Models (LLMs) play a transformative role in web data mining by parsing unstructured text and outputting structured results directly into complex JSON schemas. However, current benchmarks for evaluating LLMs’ JSON output capabilities overemphasize pure JSON generation rather than assessing data comprehension and extraction abilities, a limitation that lacks relevance to practical web data mining tasks. To address this, we introduce DeepJSONEval, a novel benchmark featuring 2100 multi-domain instances with deep nested structures, categorized by difficulty. Experiments show significant performance gaps among LLMs in handling such complexity. Our benchmark and datasets are open-sourced to advance research in structured JSON generation.(this https URL).
zh

[NLP-65] Bringing Emerging Architectures to Sequence Labeling in NLP

链接: https://arxiv.org/abs/2509.25918
作者: Ana Ezquerro,Carlos Gómez-Rodríguez,David Vilares
机构: Universidade da Coruña (拉科鲁尼亚大学); CITIC (信息与计算机科学研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-66] VLM-FO1 : Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在细粒度感知任务中因依赖语言中心架构生成精确数值坐标而导致性能下降的问题。其核心挑战在于，传统方法将对象定位视为一个脆弱的坐标生成任务，而忽略了视觉区域的语义与空间信息融合。解决方案的关键在于提出VLM-FO1框架，通过将对象中心感知重构为鲁棒的特征检索任务，并引入混合细粒度区域编码器（Hybrid Fine-grained Region Encoder, HFRE），利用双视觉编码器生成富含语义和空间细节的区域标记（region tokens），再结合基于标记的引用机制，使大语言模型（LLM）能够无缝地在特定视觉区域内进行推理和语言锚定。该方法无需修改预训练VLM主体结构，即可显著提升对象定位、区域生成理解与视觉区域推理能力，同时保持原模型的通用视觉理解性能。

链接: https://arxiv.org/abs/2509.25916
作者: Peng Liu,Haozhan Shen,Chunxin Fang,Zhicheng Sun,Jiajia Liao,Tiancheng Zhao
机构: Om AI Research; Binjiang Institute of Zhejiang University (浙江大学滨江研究院); College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 22 pages

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization. This failure stems from a fundamental mismatch, as generating exact numerical coordinates is a challenging task for language-centric architectures. In this paper, we introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception from a brittle coordinate generation problem into a robust feature retrieval task. Our method operates as a plug-and-play module that integrates with any pre-trained VLM. It leverages a Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to generate powerful region tokens rich in both semantic and spatial detail. A token-based referencing system then enables the LLM to seamlessly reason about and ground language in these specific visual regions. Experiments show that VLM-FO1 achieves state-of-the-art performance across a diverse suite of benchmarks, demonstrating exceptional capabilities in object grounding, region generational understanding, and visual region reasoning. Crucially, our two-stage training strategy ensures that these perception gains are achieved without compromising the base model’s general visual understanding capabilities. VLM-FO1 establishes an effective and flexible paradigm for building perception-aware VLMs, bridging the gap between high-level reasoning and fine-grained visual grounding.
zh

[NLP-67] Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

链接: https://arxiv.org/abs/2509.25913
作者: Chuanyang Zheng,Jiankai Sun,Yihang Gao,Enze Xie,Yuehao Wang,Peihao Wang,Ting Xu,Matthew Chang,Liliang Ren,Jingyao Li,Jing Xiong,Kashif Rasul,Mac Schwager,Anderson Schneider,Zhangyang Wang,Yuriy Nevmyvaka
机构: Morgan Stanley (摩根士丹利); Stanford (斯坦福大学); NUS (新加坡国立大学); Nvidia (英伟达); UT Austin (德克萨斯大学奥斯汀分校); CUHK (香港中文大学); Meta (Meta); Microsoft Research (微软研究院); HKU (香港大学)
类目: Computation and Language (cs.CL)
备注: Tech Report

点击查看摘要

[NLP-68] Mem-α: Learning Memory Construction via Reinforcement Learning

链接: https://arxiv.org/abs/2509.25911
作者: Yu Wang,Ryuichi Takanobu,Zhiqi Liang,Yuzhen Mao,Yuanzhe Hu,Julian McAuley,Xiaojian Wu
机构: Anuttacon; University of California San Diego (加州大学圣地亚哥分校); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-69] PerQ: Efficient Evaluation of Multilingual Text Personalization Quality

链接: https://arxiv.org/abs/2509.25903
作者: Dominik Macko,Andrew Pulver
机构: Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-70] RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLM s Contextual Sensitivity

链接: https://arxiv.org/abs/2509.25897
作者: Jisu Shin,Hoyun Song,Juhyun Oh,Changgeon Ko,Eunsu Kim,Chani Jung,Alice Oh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

[NLP-71] A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI

链接: https://arxiv.org/abs/2509.25889
作者: Arvind Murari Vepa,Yannan Yu,Jingru Gan,Anthony Cuturrufo,Weikai Li,Wei Wang,Fabien Scalzo,Yizhou Sun
机构: UCLA(加州大学洛杉矶分校); UCSF(加州大学旧金山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 23 pages, 3 figures

点击查看摘要

[NLP-72] ASR Under Noise: Exploring Robustness for Sundanese and Javanese

链接: https://arxiv.org/abs/2509.25878
作者: Salsabila Zahirah Pranida,Muhammad Cendekia Airlangga,Rifo Ahmad Genadi,Shady Shehata
机构: MBZUAI (Mohamed Bin Zayed University of Artificial Intelligence); University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-73] Lita: Light Agent Uncovers the Agent ic Coding Capabilities of LLM s

链接: https://arxiv.org/abs/2509.25873
作者: Hankun Dai,Maoquan Wang,Mengnan Qi,Yikai Zhang,Zijian Jin,Yongqiang Yao,Yufan Huang,Shengyu Fu,Elsie Nallipogu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:

点击查看摘要

[NLP-74] ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

链接: https://arxiv.org/abs/2509.25868
作者: Yindong Wang,Martin Preiß,Margarita Bugueño,Jan Vincent Hoffbauer,Abdullatif Ghajar,Tolga Buz,Gerard de Melo
机构: Hasso Plattner Institute (HPI), University of Potsdam (波茨坦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-75] Knapsack RL: Unlocking Exploration of LLM s via Optimizing Budget Allocation

链接: https://arxiv.org/abs/2509.25849
作者: Ziniu Li,Congliang Chen,Tianyun Yang,Tian Ding,Ruoyu Sun,Ge Zhang,Wenhao Huang,Zhi-Quan Luo
机构: ByteDance(字节跳动)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-76] Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations

链接: https://arxiv.org/abs/2509.25844
作者: Keyu He,Tejas Srinivasan,Brihi Joshi,Xiang Ren,Jesse Thomason,Swabha Swayamdipta
机构: Carnegie Mellon University (卡内基梅隆大学); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

[NLP-77] Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

链接: https://arxiv.org/abs/2509.25827
作者: Shuyang Jiang,Yusheng Liao,Ya Zhang,Yanfeng Wang,Yu Wang
机构: Fudan University (复旦大学); School of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学人工智能学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages

点击查看摘要

[NLP-78] VELA: An LLM -Hybrid-as-a-Judge Approach for Evaluating Long Image Captions EMNLP2025

链接: https://arxiv.org/abs/2509.25818
作者: Kazuki Matsuda,Yuiga Wada,Shinnosuke Hirano,Seitaro Otsuki,Komei Sugiura
机构: Keio University (庆应义塾大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2025 Main Conference

点击查看摘要

[NLP-79] Personalized Scientific Figure Caption Generation: An Empirical Study on Author-Specific Writing Style Transfer

【速读】：该论文旨在解决个性化图文描述生成（personalized figure caption generation）问题，即如何利用作者个人资料（author profile data）提升多模态大语言模型在生成图表描述时的个性化程度。其解决方案的关键在于：整合丰富的作者个人资料与相关元数据（metadata），以增强模型对作者写作风格的匹配能力；但同时发现，过度追求风格匹配会损害描述质量，揭示了风格适配与内容质量之间的根本权衡关系。这一发现为构建兼顾个性化与高质量输出的自动化图文描述系统提供了重要方向。

链接: https://arxiv.org/abs/2509.25817
作者: Jaeyoung Kim,Jongho Lee,Hongjun Choi,Sion Jang
机构: Teamreboott Inc.(Teamreboott 公司); MIRI D.I.H Inc.(MIRI D.I.H 公司)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study personalized figure caption generation using author profile data from scientific papers. Our experiments demonstrate that rich author profile data, combined with relevant metadata, can significantly improve the personalization performance of multimodal large language models. However, we also reveal a fundamental trade-off between matching author style and maintaining caption quality. Our findings offer valuable insights and future directions for developing practical caption automation systems that balance both objectives. This work was conducted as part of the 3rd SciCap challenge.
zh

[NLP-80] ReTAG: Retrieval-Enhanced Topic-Augmented Graph-Based Global Sensemaking EMNLP2025

【速读】：该论文旨在解决全局理解（global sensemaking）任务中的关键挑战，即如何通过整合整个语料库的信息来回答复杂问题，同时克服现有基于图的方法在检索机制缺失、主题特异性不足以及推理成本过高的局限性。其解决方案的核心在于提出ReTAG框架——一种检索增强且主题增强的图结构方法，该方法通过构建主题特定的子图并检索相关摘要信息，从而在提升回答质量的同时显著降低推理时间。

链接: https://arxiv.org/abs/2509.25814
作者: Boyoung Kim,Dosung Lee,Sumin An,Jinseong Jeong,Paul Hongsuck Seo
机构: Korea University (韩国大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures, EMNLP 2025 Findings

点击查看摘要

Abstract:Recent advances in question answering have led to substantial progress in tasks such as multi-hop reasoning. However, global sensemaking-answering questions by synthesizing information from an entire corpus remains a significant challenge. A prior graph-based approach to global sensemaking lacks retrieval mechanisms, topic specificity, and incurs high inference costs. To address these limitations, we propose ReTAG, a Retrieval-Enhanced, Topic-Augmented Graph framework that constructs topic-specific subgraphs and retrieves the relevant summaries for response generation. Experiments show that ReTAG improves response quality while significantly reducing inference time compared to the baseline. Our code is available at this https URL.
zh

[NLP-81] RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models

链接: https://arxiv.org/abs/2509.25813
作者: Dragos-Dumitru Ghinea,Adela-Nicoleta Corbeanu,Adrian-Marius Dumitran
机构: University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-82] Learning to Reason as Action Abstractions with Scalable Mid-Training RL

链接: https://arxiv.org/abs/2509.25810
作者: Shenao Zhang,Donghan Yu,Yihao Feng,Bowen Jin,Zhaoran Wang,John Peebles,Zirui Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

[NLP-83] Assessing Algorithmic Bias in Language-Based Depression Detection: A Comparison of DNN and LLM Approaches ALT

链接: https://arxiv.org/abs/2509.25795
作者: Obed Junias,Prajakta Kini,Theodora Chaspari
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figure. This paper has been accepted to the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2025), Georgia Institute of Technology, Atlanta, Georgia, October 26-29, 2025

点击查看摘要

[NLP-84] V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLM s

链接: https://arxiv.org/abs/2509.25773
作者: Zhengpeng Shi,Hengli Li,Yanpeng Zhao,Jianqun Zhou,Yuxuan Wang,Qinrong Cui,Wei Bi,Songchun Zhu,Bo Zhao,Zilong Zheng
机构: Shanghai Jiao Tong University (上海交通大学); Peking University (北京大学); Wuhan University (武汉大学); Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-85] ruthRL: Incentivizing Truthful LLM s via Reinforcement Learning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在知识密集型任务中易产生幻觉（hallucination）和不真实回答的问题，尤其是在模型参数知识范围之外的任务场景下。现有方法往往在准确性和不确定性识别之间难以平衡：单纯优化准确率的方法会加剧幻觉，而鼓励模型在不确定时放弃回答的方法则可能过度保守，导致正确答案的丢失。解决方案的关键在于提出一种名为TruthRL的强化学习（Reinforcement Learning, RL）框架，其核心创新是采用一种简洁有效的三元奖励机制（ternary reward），明确区分正确回答、幻觉和弃权三种行为，并通过GRPO算法直接优化模型的“真实性”目标。该设计促使模型不仅通过提供正确答案来提升性能，还鼓励其在不确定时主动 abstain（弃权），从而实现事实准确性与不确定性识别之间的有效权衡，显著降低幻觉率并提升整体真实性表现。

链接: https://arxiv.org/abs/2509.25760
作者: Zhepei Wei,Xiao Yang,Kai Sun,Jiaqi Wang,Rulin Shao,Sean Chen,Mohammad Kachuee,Teja Gollapudi,Tony Liao,Nicolas Scheffer,Rakesh Wanga,Anuj Kumar,Yu Meng,Wen-tau Yih,Xin Luna Dong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy – models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.
zh

[NLP-86] NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language

链接: https://arxiv.org/abs/2509.25757
作者: Danial Kamali,Parisa Kordjamshidi
机构: Michigan State University (密歇根州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Symbolic Computation (cs.SC)
备注:

点击查看摘要

[NLP-87] Detecting Hope Across Languages: Multiclass Classification for Positive Online Discourse

链接: https://arxiv.org/abs/2509.25752
作者: T. O.Abiola,K. D. Abiodun,O. E. Olumide,O. O. Adebanji,O. Hiram Calvo,Grigori Sidorov
机构: Instituto Politécnico Nacional, Centro de Investigación en Computación (国家理工学院，计算研究中心); Ekiti State University (埃基蒂州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-88] FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos ICCV

链接: https://arxiv.org/abs/2509.25745
作者: Siddhant Sukhani,Yash Bhardwaj,Riya Bhadani,Veer Kejriwal,Michael Galarnyk,Sudheer Chava
机构: Stanford University (斯坦福大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: ICCV Short Video Understanding Workshop Paper

点击查看摘要

[NLP-89] Rotation Control Unlearning: Quantifying and Controlling Continuous Unlearning for LLM with The Cognitive Rotation Space

链接: https://arxiv.org/abs/2509.25743
作者: Xiang Zhang,Kun Wei,Xu Yang,Chenghao Xu,Su Yan,Cheng Deng
机构: Xidian University (西安电子科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-90] hink Less Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications

链接: https://arxiv.org/abs/2509.25736
作者: Chenhua Shi,Gregor Macdonald,Bhavika Jalli,Wanlu Lei,John Zou,Mridul Jain,Joji Philip
机构: Ericsson(爱立信)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
备注: 6 pages, 6 figures, 5 tables

点击查看摘要

[NLP-91] CATCH: A Novel Data Synthesis Framework for High Therapy Fidelity and Memory-Driven Planning Chain of Thought in AI Counseling EMNLP2025

链接: https://arxiv.org/abs/2509.25733
作者: Mingyu Chen,Jingkai Lin,Zhaojie Chu,Xiaofen Xing,Yirong Chen,Xiangmin Xu
机构: South China University of Technology (华南理工大学); Foshan University (佛山大学)
类目: Computation and Language (cs.CL)
备注: To be published in EMNLP 2025 Findings

点击查看摘要

[NLP-92] Controlled Generation for Private Synthetic Text EMNLP2025

链接: https://arxiv.org/abs/2509.25729
作者: Zihao Zhao,Anjalie Field
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2025

点击查看摘要

[NLP-93] Atomic Thinking of LLM s: Decoupling and Exploring Mathematical Reasoning Abilities

链接: https://arxiv.org/abs/2509.25725
作者: Jiayi Kuang,Haojing Huang,Yinghui Li,Xinnian Liang,Zhikun Xu,Yangning Li,Xiaoyu Tan,Chao Qu,Meishan Zhang,Ying Shen,Philip S. Yu
机构: Sun Yat-sen University (中山大学); Tsinghua University (清华大学); ByteDance Inc. (字节跳动); Arizona State University (亚利桑那州立大学); Tencent Youtu Lab (腾讯优图实验室); Fudan University (复旦大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学（深圳）); Guangdong Provincial Key Laboratory of Fire Science and Intelligent Emergency Technology (广东省火灾科学与智能应急技术重点实验室); University of Illinois Chicago (芝加哥伊利诺伊大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-94] Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization

链接: https://arxiv.org/abs/2509.25717
作者: Xintong Li,Chuhan Wang,Junda Wu,Rohan Surana,Tong Yu,Julian McAuley,Jingbo Shang
机构: University of California, San Diego (加州大学圣地亚哥分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

[NLP-95] MuPlon: Multi-Path Causal Optimization for Claim Verification through Controlling Confounding

链接: https://arxiv.org/abs/2509.25715
作者: Hanghui Guo,Shimin Di,Pasquale De Meo,Zhangze Chen,Jia Zhu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Methodology (stat.ME)
备注: 8 pages

点击查看摘要

[NLP-96] Can VLM Pseudo-Labels Train a Time-Series QA Model That Outperforms the VLM?

【速读】：该论文旨在解决时间序列问答（Time-series Question Answering, TSQA）任务中因标注数据稀缺而导致模型性能受限的问题。其解决方案的关键在于利用视觉语言模型（Vision-Language Model, VLM）生成伪标签（pseudo labels），并通过深度神经网络对噪声标签的天然鲁棒性，实现基于大量未标注数据的有效训练。实验表明，该方法不仅成功训练出TSQA模型，且性能优于VLM本身。

链接: https://arxiv.org/abs/2509.25696
作者: Takuya Fujimura,Kota Dohi,Natsuo Yamashita,Yohei Kawaguchi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Time-series question answering (TSQA) tasks face significant challenges due to the lack of labeled data. Alternatively, with recent advancements in large-scale models, vision-language models (VLMs) have demonstrated the potential to analyze time-series signals in a zero-shot manner. In this paper, we propose a training approach that uses pseudo labels generated by a VLM. Although VLMs can produce incorrect labels, TSQA models can still be effectively trained based on the property that deep neural networks are inherently robust to such noisy labels. Our experimental results demonstrate that TSQA models are not only successfully trained with pseudo labels, but also surpass the performance of the VLM itself by leveraging a large amount of unlabeled data.
zh

[NLP-97] LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

链接: https://arxiv.org/abs/2509.25684
作者: Yuan Zhuang,Yi Shen,Yuexin Bian,Qing Su,Shihao Ji,Yuanyuan Shi,Fei Miao
机构: University of Connecticut (康涅狄格大学); University of Pennsylvania (宾夕法尼亚大学); University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-98] Mitigating Biases in Language Models via Bias Unlearning EMNLP2025

链接: https://arxiv.org/abs/2509.25673
作者: Dianqing Liu,Yi Liu,Guoqing Jin,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of Communication Content Cognition, People’s Daily Online (人民日报社网络内容认知国家重点实验室)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 MainConference

点击查看摘要

[NLP-99] he Flaw of Averag es: Quantifying Uniformity of Performance on Benchmarks

链接: https://arxiv.org/abs/2509.25671
作者: Arda Uzunoglu,Tianjian Li,Daniel Khashabi
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

[NLP-100] Nudging the Boundaries of LLM Reasoning

【速读】：该论文旨在解决当前在线强化学习（Online Reinforcement Learning, RL）算法在大语言模型（Large Language Model, LLM）推理任务中无法从“不可解样本”中学习的问题。现有方法如GRPO仅能提升模型对可解问题的性能，而无法改变其推理能力的上限，因为这些难样本不产生奖励信号，导致梯度缺失。解决方案的关键在于提出NuRL（Nudging Reinforcement Learning），通过模型自动生成抽象提示（hint），即高阶知识线索，来降低问题难度并引入训练信号。具体而言，对于初始通过率为0%的硬样本，NuRL注入由模型生成的提示并重新采样轨迹，从而将原本不可解的问题转化为可训练样本，同时保持分布一致性且无需外部模型支持。这一机制不仅提升了模型在多个基准上的表现，还能实质性提高其推理上限，突破传统RL方法的瓶颈。

链接: https://arxiv.org/abs/2509.25666
作者: Justin Chih-Yao Chen,Becky Xiangyu Peng,Prafulla Kumar Choubey,Kung-Hsiang Huang,Jiaxin Zhang,Mohit Bansal,Chien-Sheng Wu
机构: Salesforce AI Research (Salesforce人工智能研究中心); UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Code release in preparation

点击查看摘要

Abstract:Current online reinforcement learning (RL) algorithms like GRPO share a key limitation in LLM reasoning: they cannot learn from problems that are “unsolvable” to the model. In other words, they can only improve performance on problems where the model is capable of exploring the correct answer. Consequently, the model’s “upper limit” remains unchanged after RL training, even though the likelihood of solving easier, solvable problems may increase. These hard samples cannot contribute to training, as no rollouts yield rewards and thus no gradients are produced. To unlock learning from these hard samples, we propose NuRL, a “nudging” method that aims to push the upper bound of LLM reasoning using self-generated hints, i.e., abstract cues that help reduce the problem difficulty for the model. Given a question and its gold answer, the model generates a CoT and then produces a hint containing the core knowledge needed to solve the problem. During training, we generate G rollouts from the base policy and use the pass rate to decide whether the hint should be injected. For hard samples with a 0% pass rate, we inject the hint and regenerate a new batch of trajectories. This yields two benefits: (1) the hint boosts pass rates (from 0% to non-zero), thereby introducing training signals for previously unsolvable samples, and (2) the hints are self-generated, avoiding distributional shift and do not rely on external models. NuRL achieves consistent improvements across 6 benchmarks and 3 models, while remaining complementary to test-time scaling. Notably, NuRL can raise the model’s upper limit, whereas GRPO leaves pass@1024 unchanged from the base model. Furthermore, we present a systematic study of what makes an effective hint and when hints are most useful. Interestingly, the best hints are abstract and high-level, and are most beneficial when applied necessarily and after GRPO has converged.
zh

[NLP-101] QFrBLiMP: a Quebec-French Benchmark of Linguistic Minimal Pairs LREC2026

链接: https://arxiv.org/abs/2509.25664
作者: David Beauchemin,Pier-Luc Veilleux,Richard Khoury,Johanna-Pascale Roy
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to LREC 2026

点击查看摘要

[NLP-102] he Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale

【速读】：该论文旨在解决如何在大规模范围内量化新闻媒体在选题和议题框架上的隐性偏见（selection and framing bias）这一挑战。其解决方案的关键在于构建了一个从2024年1月1日至今持续更新的近实时数据集与计算框架，该框架结合大规模语言模型（large language models, LLMs）与可扩展的实时新闻抓取技术，对每日数百篇新闻文章进行结构化标注，包括政治倾向、语调、主题、文章类型及重大事件等维度，并在句子层、文章层和出版商层三个粒度上进行分析。这一方法不仅提供了可重复的研究范式，还通过交互式网页平台支持学者便捷探索数据，从而为媒体偏见研究提供实证资源并推动媒体问责机制建设。

链接: https://arxiv.org/abs/2509.25649
作者: Samar Haider,Amir Tohidi,Jenny S. Wang,Timothy Dörr,David M. Rothschild,Chris Callison-Burch,Duncan J. Watts
机构: University of Pennsylvania (宾夕法尼亚大学); Harvard Business School (哈佛商学院); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mainstream news organizations shape public perception not only directly through the articles they publish but also through the choices they make about which topics to cover (or ignore) and how to frame the issues they do decide to cover. However, measuring these subtle forms of media bias at scale remains a challenge. Here, we introduce a large, ongoing (from January 1, 2024 to present), near real-time dataset and computational framework developed to enable systematic study of selection and framing bias in news coverage. Our pipeline integrates large language models (LLMs) with scalable, near-real-time news scraping to extract structured annotations – including political lean, tone, topics, article type, and major events – across hundreds of articles per day. We quantify these dimensions of coverage at multiple levels – the sentence level, the article level, and the publisher level – expanding the ways in which researchers can analyze media bias in the modern news landscape. In addition to a curated dataset, we also release an interactive web platform for convenient exploration of these data. Together, these contributions establish a reusable methodology for studying media bias at scale, providing empirical resources for future research. Leveraging the breadth of the corpus over time and across publishers, we also present some examples (focused on the 150,000+ articles examined in 2024) that illustrate how this novel data set can reveal insightful patterns in news coverage and bias, supporting academic research and real-world efforts to improve media accountability.
zh

[NLP-103] STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）作为具备工具调用能力的自主代理（autonomous agents）时所面临的安全挑战，尤其是传统基于内容的安全防护机制无法有效应对的多轮协同攻击问题。其核心问题是：攻击者可构造看似无害的多步工具调用链，在孤立视角下难以察觉，但组合后却能执行有害操作，且此类攻击在最终执行阶段才显现危害。解决方案的关键在于提出Sequential Tool Attack Chaining (STAC)框架——一个闭环自动化管道，能够合成可执行的多步工具链、通过环境内执行验证恶意行为，并反向构建诱导代理执行该序列的隐蔽多轮提示；该方案揭示了现有基于提示的防御手段保护效果有限，进而提出一种基于推理的防御提示，通过分析整个动作序列及其累积效应显著提升安全性，将攻击成功率（ASR）降低最多达28.8%。

链接: https://arxiv.org/abs/2509.25624
作者: Jing-Jing Li,Jianfeng He,Chao Shang,Devang Kulshreshtha,Xun Xian,Yi Zhang,Hang Su,Sandesh Swamy,Yanjun Qi
机构: AWS AI Labs; UC Berkeley
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC’s automated framework is a closed-loop pipeline that synthesizes executable multi-step tool chains, validates them through in-environment execution, and reverse-engineers stealthy multi-turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt-based defenses provide limited protection. To address this gap, we propose a new reasoning-driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool-enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.
zh

[NLP-104] ransformers through the lens of support-preserving maps between measures

链接: https://arxiv.org/abs/2509.25611
作者: Takashi Furuya,Maarten V. de Hoop,Matti Lassas
机构: Doshisha University (同志社大学); RIKEN AIP (理化学研究所人工智能项目); Rice University (莱斯大学); University of Helsinki (赫尔辛基大学)
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

[NLP-105] RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance

链接: https://arxiv.org/abs/2509.25604
作者: Tianlang Chen,Minkai Xu,Jure Leskovec,Stefano Ermon
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages, 3 figures, 2 tables

点击查看摘要

[NLP-106] Causal Autoencoder-like Generation of Feedback Fuzzy Cognitive Maps with an LLM Agent

链接: https://arxiv.org/abs/2509.25593
作者: Akash Kumar Panda,Olaoluwa Adigun,Bart Kosko
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 8 pages, 4 figures

点击查看摘要

[NLP-107] Building the EHR Foundation Model via Next Event Prediction

【速读】：该论文旨在解决电子健康记录（Electronic Health Records, EHRs）中时序动态信息难以被传统编码方法充分捕捉的问题，以及大型语言模型（Large Language Models, LLMs）在推理临床事件序列和时序依赖关系方面的局限性。其解决方案的关键在于提出一种名为“下一事件预测”（Next Event Prediction, NEP）的框架，通过在临床事件序列上进行自回归微调（autoregressive fine-tuning），将EHR重构为带时间戳的事件链，并以预测未来医疗事件的方式显式建模疾病进展模式与因果关系，从而增强LLMs的时序推理能力。

链接: https://arxiv.org/abs/2509.25591
作者: Zekai Chen,Arda Pekis,Kevin Brown
机构: Standard Model Biomedicine (标准模型生物医学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Other Quantitative Biology (q-bio.OT)
备注:

点击查看摘要

Abstract:Electronic Health Records (EHRs) contain rich temporal dynamics that conventional encoding approaches fail to adequately capture. While Large Language Models (LLMs) show promise for EHR modeling, they struggle to reason about sequential clinical events and temporal dependencies. We propose Next Event Prediction (NEP), a framework that enhances LLMs’ temporal reasoning through autoregressive fine-tuning on clinical event sequences. By reformulating EHRs as timestamped event chains and predicting future medical events, NEP explicitly models disease progression patterns and causal relationships. Extensive evaluations across oncology survival prediction and clinical diagnosis tasks demonstrate NEP’s superiority, outperforming specialized EHR models by 4.6% AUROC and general-purpose LLMs by 7.2% C-index in temporal reasoning tasks. Our analyses reveal dual benefits: state-of-the-art prediction accuracy combined with clinically interpretable attention patterns that align with known disease pathways.
zh

[NLP-108] ATLAS: Constraints-Aware Multi-Agent Collaboration for Real-World Travel Planning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂约束条件下难以生成最优且 grounded（有依据的）解决方案的问题，特别是在真实世界旅行规划任务中，此类约束包括显式、隐式以及随交互动态变化的约束。其核心解决方案是提出 ATLAS——一个通用的多智能体框架，关键创新在于三个机制：动态约束管理（dynamic constraint management）、迭代计划批判（iterative plan critique）和自适应交错搜索（adaptive interleaved search），从而显著提升了约束感知规划的能力。实验表明，ATLAS 在 TravelPlanner 基准上将最终通过率从 23.3% 提升至 44.4%，并在包含实时信息检索与多轮反馈的真实场景中达到 84% 的通过率，远超 ReAct（59%）和单体代理（27%）。

链接: https://arxiv.org/abs/2509.25586
作者: Jihye Choi,Jinsung Yoon,Jiefeng Chen,Somesh Jha,Tomas Pfister
机构: Google Cloud(谷歌云); University of Wisconsin-Madison(威斯康星大学麦迪逊分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have shown remarkable advancements in reasoning and tool use, they often fail to generate optimal, grounded solutions under complex constraints. Real-world travel planning exemplifies these challenges, evaluating agents’ abilities to handle constraints that are explicit, implicit, and even evolving based on interactions with dynamic environments and user needs. In this paper, we present ATLAS, a general multi-agent framework designed to effectively handle such complex nature of constraints awareness in real-world travel planning tasks. ATLAS introduces a principled approach to address the fundamental challenges of constraint-aware planning through dedicated mechanisms for dynamic constraint management, iterative plan critique, and adaptive interleaved search. ATLAS demonstrates state-of-the-art performance on the TravelPlanner benchmark, improving the final pass rate from 23.3% to 44.4% over its best alternative. More importantly, our work is the first to demonstrate quantitative effectiveness on real-world travel planning tasks with live information search and multi-turn feedback. In this realistic setting, ATLAS showcases its superior overall planning performance, achieving an 84% final pass rate which significantly outperforms baselines including ReAct (59%) and a monolithic agent (27%).
zh

[NLP-109] Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models

链接: https://arxiv.org/abs/2509.25584
作者: Max Hartman,Vidhata Jayaraman,Moulik Choraria,Akhil Bhimaraju,Lav R. Varshney
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); AI Innovation Institute (人工智能创新研究所); Stony Brook University (石溪大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-110] Probing the Limits of Stylistic Alignment in Vision-Language Models

链接: https://arxiv.org/abs/2509.25568
作者: Asma Farajidizaji,Akash Gupta,Vatsal Raina
机构: Imperial College London (帝国理工学院); Apta AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure, 3 tables

点击查看摘要

[NLP-111] IRIS: Intrinsic Reward Image Synthesis

链接: https://arxiv.org/abs/2509.25562
作者: Yihang Chen,Yuanhao Ban,Yunqi Hong,Cho-Jui Hsieh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-112] Dont Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation

链接: https://arxiv.org/abs/2509.25546
作者: Colten DiIanni,Daniel Deutsch
机构: Google(谷歌)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-113] Performance and competence intertwined: A computational model of the Null Subject stage in English-speaking children

【速读】：该论文旨在解决儿童在语言习得过程中为何会出现持续至约4岁左右的空主语（null subject, NS）阶段的问题，特别是探讨这一现象是否源于对祈使句与陈述句中空主语形式的误判。解决方案的关键在于提出一个新的计算参数来量化这种误解释行为，并将其整合进一个模拟强制主语语法（obligatory subject grammar）习得的模型中；通过改进的变分学习者（Variational Learner）框架（适用于超集-子集语言），模拟结果支持了Orfitelli和Hyams（2012）关于性能因素导致临时空主语语法形成的假设，从而为语言习得中的认知机制提供了可计算建模的新路径。

链接: https://arxiv.org/abs/2509.25545
作者: Soumik Dey,William Gregory Sakas
机构: The Graduate Center (研究生中心); The City University of New York (纽约市立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The empirically established null subject (NS) stage, lasting until about 4 years of age, involves frequent omission of subjects by children. Orfitelli and Hyams (2012) observe that young English speakers often confuse imperative NS utterances with declarative ones due to performance influences, promoting a temporary null subject grammar. We propose a new computational parameter to measure this misinterpretation and incorporate it into a simulated model of obligatory subject grammar learning. Using a modified version of the Variational Learner (Yang, 2012) which works for superset-subset languages, our simulations support Orfitelli and Hyams’ hypothesis. More generally, this study outlines a framework for integrating computational models in the study of grammatical acquisition alongside other key developmental factors.
zh

[NLP-114] Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model

链接: https://arxiv.org/abs/2509.25543
作者: Fahim Faisal,Kaiqiang Song,Song Wang,Simin Ma,Shujian Liu,Haoyun Deng,Sathish Reddy Indurthi
机构: Zoom(Zoom)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-115] oxicity in Online Platforms and AI Systems: A Survey of Needs Challenges Mitigations and Future Directions

链接: https://arxiv.org/abs/2509.25539
作者: Smita Khapre,Melkamu Abay Mersha,Hassan Shakil,Jonali Baruah,Jugal Kalita
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注:

点击查看摘要

[NLP-116] Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning

链接: https://arxiv.org/abs/2509.25534
作者: Zhiling Ye,Yun Yue,Haowen Wang,Xudong Han,Jiadi Jiang,Cheng Wei,Lei Fan,Jiaxin Liang,Shuowen Zhang,Ji Li,Chunxiao Guo,Jian Wang,Peng Wei,Jinjie Gu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-117] Calibrating Verbalized Confidence with Self-Generated Distractors

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）输出的置信度估计存在校准不足的问题，即模型常对低准确率的预测表现出过高自信，从而损害人类用户的信任与安全性。其核心解决方案是提出Distractor-Normalized Coherence (DINCO)，关键在于通过让模型在多个自生成的干扰项（distractors，即替代性主张）上独立评估置信度，并以总置信度进行归一化，从而量化并校正模型因信息匮乏而产生的易受暗示性（suggestibility）偏差；进一步地，DINCO融合了生成器-验证器不一致性的信息，利用一致性估计增强置信度校准，实现了跨采样生成与跨不相容主张验证两种互补的置信度一致性维度的整合，显著优于传统方法如自一致性（self-consistency）。

链接: https://arxiv.org/abs/2509.25532
作者: Victor Wang,Elias Stengel-Eskin
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Calibrated confidence estimates are necessary for large language model (LLM) outputs to be trusted by human users. While LLMs can express their confidence in human-interpretable ways, verbalized LLM-generated confidence scores have empirically been found to be miscalibrated, reporting high confidence on instances with low accuracy and thereby harming trust and safety. We hypothesize that this overconfidence often stems from a given LLM’s heightened suggestibility when faced with claims that it encodes little information about; we empirically validate this hypothesis, finding more suggestibility on lower-accuracy claims. Building on this finding, we introduce Distractor-Normalized Coherence (DINCO), which estimates and accounts for an LLM’s suggestibility bias by having the model verbalize its confidence independently across several self-generated distractors (i.e. alternative claims), and normalizes by the total verbalized confidence. To further improve calibration, we leverage generator-validator disagreement, augmenting normalized validator confidence with a consistency-based estimate of generator confidence. Here, we frame the popular approach of self-consistency as leveraging coherence across sampled generations, and normalized verbalized confidence as leveraging coherence across validations on incompatible claims, allowing us to integrate these complementary dimensions of coherence into DINCO. Moreover, our analysis shows that DINCO provides less saturated – and therefore more usable – confidence estimates, and that further sampling alone cannot close the gap between DINCO and baselines, with DINCO at 10 inference calls outperforming self-consistency at 100.
zh

[NLP-118] MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

链接: https://arxiv.org/abs/2509.25531
作者: Huu Nguyen,Victor May,Harsh Raj,Marianna Nezhurina,Yishan Wang,Yanqi Luo,Minh Chien Vu,Taishi Nakamura,Ken Tsui,Van Khue Nguyen,David Salinas,Aleksandra Krasnodębska,Christoph Schuhmann,Mats Leon Richter,Xuan-Son(Sonny)Vu,Jenia Jitsev
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: \url{ this https URL }

点击查看摘要

[NLP-119] Beyond WER: Probing Whispers Sub-token Decoder Across Diverse Language Resource Levels

链接: https://arxiv.org/abs/2509.25516
作者: Siyu Liang,Nicolas Ballier,Gina-Anne Levow,Richard Wright
机构: University of Washington (华盛顿大学); Université Paris Cité (巴黎城市大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-120] Not Wrong But Untrue: LLM Overconfidence in Document-Based Queries

链接: https://arxiv.org/abs/2509.25498
作者: Nick Hagar,Wilma Agustianto,Nicholas Diakopoulos
机构: Northwestern University (西北大学); University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Computation + Journalism Symposium 2025

点击查看摘要

[NLP-121] he Rise of AfricaNLP: Contributions Contributors and Community Impact (2005-2025)

链接: https://arxiv.org/abs/2509.25477
作者: Tadesse Destaw Belay,Kedir Yassin Hussen,Sukairaj Hafiz Imam,Iqra Ameer,Ibrahim Said Ahmad,Isa Inuwa-Dutse,Idris Abdulmumin,Grigori Sidorov,Vukosi Marivate,Seid Muhie Yimam,Shamsuddeen Hassan Muhammad
机构: Instituto Politécnico Nacional(国立政治学院); University of Gondar(贡达尔大学); Northwest University Kano(卡诺西北大学); Northeastern University(东北大学); University of Huddersfield(哈德斯菲尔德大学); University of Pretoria(比勒陀利亚大学); University of Hamburg(汉堡大学); Imperial College London(伦敦帝国理工学院); Pennsylvania State University(宾夕法尼亚州立大学); Wollo University(沃洛大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-122] SimulRAG : Simulator-based RAG for Grounding LLM s in Long-form Scientific QA

链接: https://arxiv.org/abs/2509.25459
作者: Haozhou Xu,Dongxia Wu,Matteo Chinazzi,Ruijia Niu,Rose Yu,Yi-An Ma
机构: University of California San Diego (加州大学圣地亚哥分校); Stanford University (斯坦福大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Haozhou Xu and Dongxia Wu are co-first authors

点击查看摘要

[NLP-123] DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

【速读】：该论文旨在解决当前基于强化学习的推理验证（Reinforcement Learning with Validation and Refinement, RLVR）方法在长期训练中出现的性能提升停滞问题，其根源在于现有RLVR实践中探索模式稀疏，模型依赖有限的rollout难以覆盖解空间的关键路径，导致奖励信号稀疏且信用分配粗粒度。解决方案的关键在于提出DeepSearch框架，将蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）直接嵌入RLVR训练循环中，实现训练时的系统性探索与细粒度信用分配，从而突破因探索不足引发的性能瓶颈。具体创新包括：全局前沿节点选择策略以优先探索高潜力路径、基于熵引导的选择机制以识别可信监督路径，以及带解缓存的自适应回放缓冲区训练，显著提升了训练效率和推理准确性，在数学推理基准上实现了62.95%平均准确率，且仅需传统扩展训练方法5.7倍GPU小时数。

链接: https://arxiv.org/abs/2509.25454
作者: Fang Wu,Weihao Xuan,Heli Qi,Ximing Lu,Aaron Tu,Li Erran Li,Yejin ChoiRetry
机构: Stanford University (斯坦福大学); University of Tokyo (东京大学); RIKEN AIP (理化学研究所人工智能中心); University of Washington (华盛顿大学); UC Berkeley (加州大学伯克利分校); Amazon AWS (亚马逊云科技); Columbia University (哥伦比亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.
zh

[NLP-124] Fingerprinting LLM s via Prompt Injection

链接: https://arxiv.org/abs/2509.25448
作者: Yuepeng Hu,Zhengyuan Jiang,Mengyuan Li,Osama Ahmed,Zhicong Huang,Cheng Hong,Neil Gong
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-125] Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search

链接: https://arxiv.org/abs/2509.25420
作者: Yingqian Cui,Zhenwei Dai,Pengfei He,Bing He,Hui Liu,Xianfeng Tang,Jingying Zeng,Suhang Wang,Yue Xing,Jiliang Tang,Benoit Dumoulin
机构: Michigan State University (密歇根州立大学); Amazon (亚马逊); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-126] Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization

链接: https://arxiv.org/abs/2509.25416
作者: Jiacheng Shi,Hongfei Du,Yangfan He,Y. Alicia Hong,Ye Gao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

[NLP-127] Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs

链接: https://arxiv.org/abs/2509.25414
作者: Hao Ban,Kaiyi Ji
机构: University at Buffalo (纽约州立大学布法罗分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-128] From Faithfulness to Correctness: Generative Reward Models that Think Critically

链接: https://arxiv.org/abs/2509.25409
作者: Qiyao Ma,Yunsheng Shi,Hongtao Tian,Chao Wang,Weiming Chang,Ting Yao
机构: University of California, Davis (加州大学戴维斯分校); WeChat, Tencent (微信，腾讯); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-129] Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLM s

链接: https://arxiv.org/abs/2509.25380
作者: Shane Bergsma,Nolan Dey,Joel Hestness
机构: Cerebras Systems (Cerebras系统)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-130] Generative Value Conflicts Reveal LLM Priorities

链接: https://arxiv.org/abs/2509.25369
作者: Andy Liu,Kshitish Ghate,Mona Diab,Daniel Fried,Atoosa Kasirzadeh,Max Kleiman-Weiner
机构: Carnegie Mellon University (卡内基梅隆大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-131] From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）生成文本质量评估中依赖人工标注数据的问题，即传统评价方法往往需要参考人类标注的高质量文本集（如BLEU、ROUGE等指标），这在实际应用中成本高且难以扩展。其解决方案的关键在于利用LLM内部表示的几何特性作为无参考（reference-free）文本质量评估的代理指标，特别是通过测量不同层中的内在维度（Intrinsic Dimensionality）和有效秩（Effective Rank），发现这些几何特征能稳定地反映文本自然性和质量，并且跨模型保持一致的排序能力，从而实现无需人工标注即可自动化评估生成文本质量的新范式。

链接: https://arxiv.org/abs/2509.25359
作者: Viacheslav Yusupov,Danil Maksimov,Ameliia Alaeva,Anna Vasileva,Anna Antipina,Tatyana Zaitseva,Alina Ermilova,Evgeny Burnaev,Egor Shvetsov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper bridges internal and external analysis approaches to large language models (LLMs) by demonstrating that geometric properties of internal model representations serve as reliable proxies for evaluating generated text quality. We validate a set of metrics including Maximum Explainable Variance, Effective Rank, Intrinsic Dimensionality, MAUVE score, and Schatten Norms measured across different layers of LLMs, demonstrating that Intrinsic Dimensionality and Effective Rank can serve as universal assessments of text naturalness and quality. Our key finding reveals that different models consistently rank text from various sources in the same order based on these geometric properties, indicating that these metrics reflect inherent text characteristics rather than model-specific artifacts. This allows a reference-free text quality evaluation that does not require human-annotated datasets, offering practical advantages for automated evaluation pipelines.
zh

[NLP-132] Spontaneous High-Order Generalization in Neural Theory-of-Mind Networks

链接: https://arxiv.org/abs/2509.25343
作者: Yiming Wang,Rui Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-133] Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents

链接: https://arxiv.org/abs/2509.25302
作者: Boxuan Zhang,Yi Yu,Jiaxuan Guo,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 21 pages, 6 figures

点击查看摘要

[NLP-134] Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution

链接: https://arxiv.org/abs/2509.25301
作者: Tianrui Qin,Qianben Chen,Sinuo Wang,He Xing,King Zhu,He Zhu,Dingfeng Shi,Xinxin Liu,Ge Zhang,Jiaheng Liu,Yuchen Eleanor Jiang,Xitong Gao,Wangchunshu Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-135] ActorDB: A Unified Database Model Integrating Single-Writer Actors Incremental View Maintenance and Zero-Trust Messaging

链接: https://arxiv.org/abs/2509.25285
作者: Jun Kawasaki
机构: 未知
类目: Databases (cs.DB); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 7 pages, 1 table, 1 figures. Code and data available at this https URL

点击查看摘要

[NLP-136] Dynamic Policy Induction for Adaptive Prompt Optimization: Bridging the Efficiency-Accuracy Gap via Lightweight Reinforcement Learning

链接: https://arxiv.org/abs/2509.25267
作者: Jiexi Xu
机构: University of California, Irvine (加州大学欧文分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 2 figures, 2 tables

点击查看摘要

[NLP-137] Language Model Planning from an Information Theoretic Perspective

链接: https://arxiv.org/abs/2509.25260
作者: Muhammed Ustaomeroglu,Baris Askin,Gauri Joshi,Carlee Joe-Wong,Guannan Qu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-138] HAMMER: Hamiltonian Curiosity Augmented Large Language Model Reinforcement

链接: https://arxiv.org/abs/2509.25240
作者: Ming Yang,Xiaofan Li,Zhiyuan Ma,Dengliang Shi,Jintao Du,Yu Cheng,Weiguo Zheng
机构: Fudan University (复旦大学); Tiansuan Lab, Ant Group Co., Ltd. (蚂蚁集团天算实验室); East China Normal University (华东师范大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 7 figures, 4 tables

点击查看摘要

[NLP-139] A Formal Comparison Between Chain-of-Thought and Latent Thought

链接: https://arxiv.org/abs/2509.25239
作者: Kevin Xu,Issei Sato
机构: The University of Tokyo (东京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-140] Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI WWW

链接: https://arxiv.org/abs/2509.25220
作者: Eduard Kapelko
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code is available at: this https URL

点击查看摘要

[NLP-141] Spectral Logit Sculpting: Adaptive Low-Rank Logit Transformation for Controlled Text Generation ICASSP2026

链接: https://arxiv.org/abs/2509.25204
作者: Jin Li,Zhebo Wang,Tianliang Lu,Mohan Li,Wenpeng Xing,Meng Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to IEEE ICASSP 2026

点击查看摘要

[NLP-142] meOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models

链接: https://arxiv.org/abs/2509.24803
作者: Tong Guan,Zijie Meng,Dianqi Li,Shiyu Wang,Chao-Han Huck Yang,Qingsong Wen,Zuozhu Liu,Sabato Marco Siniscalchi,Ming Jin,Shirui Pan
机构: Griffith University (格里菲斯大学); Zhejiang University (浙江大学); NVIDIA Research (英伟达研究); Squirrel Ai Learning (深兰科技); University of Palermo (巴勒莫大学); Norwegian University of Science and Technology (挪威科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-143] Artificial Phantasia: Evidence for Propositional Reasoning -Based Mental Imagery in Large Language Models

链接: https://arxiv.org/abs/2509.23108
作者: Morgan McCarty,Jorge Morales
机构: Northeastern University (东北大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages,15 figures

点击查看摘要

[NLP-144] Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

链接: https://arxiv.org/abs/2505.23495
作者: Liangliang Zhang,Zhuorui Jiang,Hongliang Chi,Haoyang Chen,Mohammed Elkoumy,Fali Wang,Qiong Wu,Zhengyi Zhou,Shirui Pan,Suhang Wang,Yao Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages

点击查看摘要

[NLP-145] Game-Time: Evaluating Temporal Dynamics in Spoken Language Models ICASSP2026

链接: https://arxiv.org/abs/2509.26388
作者: Kai-Wei Chang,En-Pei Hu,Chun-Yi Kuan,Wenze Ren,Wei-Chih Chen,Guan-Ting Lin,Yu Tsao,Shao-Hua Sun,Hung-yi Lee,James Glass
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: submitted to ICASSP 2026

点击查看摘要

[NLP-146] AU: A Benchmark for Cultural Sound Understanding Beyond Semantics ICASSP2026

链接: https://arxiv.org/abs/2509.26329
作者: Yi-Cheng Lin,Yu-Hua Chen,Jia-Kai Dong,Yueh-Hsuan Huang,Szu-Chi Chen,Yu-Chen Chen,Chih-Yao Chen,Yu-Jung Lin,Yu-Ling Chen,Zih-Yu Chen,I-Ning Tsai,Hsiu-Hsuan Wang,Ho-Lam Chung,Ke-Han Lu,Hung-yi Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: 5 pages; submitted to ICASSP 2026

点击查看摘要

[NLP-147] he AI Productivity Index (APEX)

链接: https://arxiv.org/abs/2509.25721
作者: Bertie Vidgen,Abby Fennelly,Evan Pinnix,Chirag Mahapatra,Zach Richards,Austin Bridges,Calix Huang,Ben Hunsberger,Fez Zafar,Brendan Foody,Dominic Barton,Cass R. Sunstein,Eric Topol,Osvald Nitski
机构: Mercor(mercir); Harvard Law School (哈佛法学院); The Scripps Research Institute (斯克里普斯研究所)
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

计算机视觉

[CV-0] 3R: 3D Reconstruction as Test-Time Training

【速读】：该论文旨在解决现代循环神经网络（Recurrent Neural Networks, RNNs）在3D重建任务中长度泛化能力有限的问题，即模型在处理超出训练时上下文长度的序列时性能显著下降。解决方案的关键在于从测试时训练（Test-Time Training, TTT）的角度重新审视3D重建基础模型，将其设计视为在线学习问题，并利用记忆状态与新观测之间的对齐置信度推导出一个闭式学习率，用于平衡历史信息保留与新观测适应之间的权衡。该方法无需额外训练，命名为TTT3R，在不增加计算负担的前提下显著提升了长度泛化性能，实现全局位姿估计精度提升2倍，同时保持20 FPS的推理速度和仅6 GB GPU显存的资源消耗。

链接: https://arxiv.org/abs/2509.26645
作者: Xingyu Chen,Yue Chen,Yuliang Xiu,Andreas Geiger,Anpei Chen
机构: Westlake University (西湖大学); University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Page: this https URL Code: this https URL

点击查看摘要

Abstract:Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a 2\times improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in this https URL
zh

[CV-1] Stitch: Training-Free Position Control in Multimodal Diffusion Transformers

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）生成模型在准确捕捉空间关系（如“上方”或“右侧”）方面存在的持续性挑战，尤其是在现代多模态扩散Transformer（Multi-Modal Diffusion Transformer, MMDiT）架构中，传统依赖外部位置控制的方法因兼容性问题失效的问题。解决方案的关键在于提出Stitch方法——一种无需训练的外部位置控制集成策略，通过自动生成边界框（bounding box）将图像分割为独立对象区域，在每个区域内生成目标物体后无缝拼接，从而实现空间准确性与视觉质量的统一；同时发现特定注意力头可在生成过程中识别并裁剪单个对象，无需完成整图即可实现精准控制。

链接: https://arxiv.org/abs/2509.26644
作者: Jessica Bader,Mateusz Pach,Maria A. Bravo,Serge Belongie,Zeynep Akata
机构: Technical University of Munich (慕尼黑工业大学); Helmholtz Munich (赫尔姆霍兹慕尼黑研究中心); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like “above” or “to the right of” poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval’s Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at this https URL.
zh

[CV-2] Query-Kontext: An Unified Multimodal Model for Image Generation and Editing

【速读】：该论文旨在解决当前统一多模态模型（Unified Multimodal Models, UMMs）中多模态生成推理能力（multimodal generative reasoning）与高保真图像合成能力（high-fidelity synthesis）耦合导致的性能瓶颈问题。具体而言，现有框架常将指令理解、视觉定位（grounding）、图像指代（image referring）等复杂推理任务与图像生成过程混杂，限制了模型在语义准确性和视觉质量上的协同优化。解决方案的关键在于提出Query-Kontext方法，通过引入一个由语义线索和粗粒度图像条件组成的多模态“kontext”（即“上下文”），在视觉语言模型（VLM）与扩散模型（diffusion model）之间建立解耦桥梁：VLM负责处理多模态生成推理，而扩散模型专注于高质量视觉合成。该设计辅以三阶段渐进式训练策略，有效分离并强化了两个核心能力，从而在多种参考图像生成任务中实现优于或媲美强基线的结果。

链接: https://arxiv.org/abs/2509.26641
作者: Yuxin Song,Wenkai Dong,Shizun Wang,Qi Zhang,Song Xue,Tao Yuan,Hu Yang,Haocheng Feng,Hang Zhou,Xinyan Xiao,Jingdong Wang
机构: Baidu VIS (百度视觉智能实验室); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 10 figures

点击查看摘要

Abstract:Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I), whether instantiated as assembled unified frameworks which couple powerful vision-language model (VLM) with diffusion-based generator, or as naive Unified Multimodal Models with an early fusion of understanding and generation modalities. We contend that in current unified frameworks, the crucial capability of multimodal generative reasoning which encompasses instruction understanding, grounding, and image referring for identity preservation and faithful reconstruction, is intrinsically entangled with high-fidelity synthesis. In this work, we introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal ``kontext’’ composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs. This design delegates the complex ability of multimodal generative reasoning to powerful VLM while reserving diffusion model’s role for high-quality visual synthesis. To achieve this, we propose a three-stage progressive training strategy. First, we connect the VLM to a lightweight diffusion head via multimodal kontext tokens to unleash the VLM’s generative reasoning ability. Second, we scale this head to a large, pre-trained diffusion model to enhance visual detail and realism. Finally, we introduce a low-level image encoder to improve image fidelity and perform instruction tuning on downstream tasks. Furthermore, we build a comprehensive data pipeline integrating real, synthetic, and open-source datasets, covering diverse multimodal reference-to-image scenarios, including image generation, instruction-driven editing, customized generation, and multi-subject composition. Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
zh

[CV-3] Benchmarking Egocentric Visual-Inertial SLAM at City Scale ICCV2025

【速读】：该论文旨在解决可穿戴设备在捕捉第一人称视角（egocentric）数据时，由于运动多样性高、动态视觉内容普遍以及传感器标定随时间变化等因素导致的精准6自由度（6-DoF）同时定位与地图构建（SLAM）难题。现有学术研究多依赖于不反映实际挑战或缺乏足够精确真值位姿的基准测试，难以全面评估系统鲁棒性。其解决方案的关键在于构建一个全新的多模态视觉-惯性SLAM数据集和基准，使用类似眼镜的设备记录城市中心范围内的数小时、数公里轨迹，并通过测绘工具获取厘米级精度的控制点作为间接位姿标注，从而支持对极端场景（如夜间步行或乘车移动）的评估。实验表明，当前先进的学术SLAM系统在这些挑战下表现不佳，该工作进一步识别出性能瓶颈所在，并设计了不同难度等级的测试轨迹以促进对尚不成熟方法的深入分析与改进。

链接: https://arxiv.org/abs/2509.26639
作者: Anusha Krishnan,Shaohui Liu,Paul-Edouard Sarlin,Oscar Gentilhomme,David Caruso,Maurizio Monge,Richard Newcombe,Jakob Engel,Marc Pollefeys
机构: ETH Zurich (苏黎世联邦理工学院); Google(谷歌); Meta Reality Labs Research (Meta现实实验室研究); Microsoft Spatial AI Lab (微软空间人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICCV 2025

点击查看摘要

Abstract:Precise 6-DoF simultaneous localization and mapping (SLAM) from onboard sensors is critical for wearable devices capturing egocentric data, which exhibits specific challenges, such as a wider diversity of motions and viewpoints, prevalent dynamic visual content, or long sessions affected by time-varying sensor calibration. While recent progress on SLAM has been swift, academic research is still driven by benchmarks that do not reflect these challenges or do not offer sufficiently accurate ground truth poses. In this paper, we introduce a new dataset and benchmark for visual-inertial SLAM with egocentric, multi-modal data. We record hours and kilometers of trajectories through a city center with glasses-like devices equipped with various sensors. We leverage surveying tools to obtain control points as indirect pose annotations that are metric, centimeter-accurate, and available at city scale. This makes it possible to evaluate extreme trajectories that involve walking at night or traveling in a vehicle. We show that state-of-the-art systems developed by academia are not robust to these challenges and we identify components that are responsible for this. In addition, we design tracks with different levels of difficulty to ease in-depth analysis and evaluation of less mature approaches. The dataset and benchmark are available at this https URL.
zh

[CV-4] Learning Generalizable Shape Completion with SIM(3) Equivariance NEURIPS2025

【速读】：该论文旨在解决3D形状补全（3D shape completion）方法在实际应用中泛化能力不足的问题，其根源在于现有模型依赖于扫描数据预对齐到规范坐标系（canonical frame）所隐含的姿态（pose）和尺度（scale）信息，导致模型可能通过记忆绝对位置而非学习内在几何结构来完成任务。当真实数据缺乏此类对齐时，性能显著下降。解决方案的关键在于引入首个针对相似变换群（SIM(3)）的等变性（equivariance）设计，使模型对姿态和尺度变化保持不变性，从而实现真正鲁棒的泛化能力；具体而言，网络模块化地依次进行特征归一化、相似不变几何推理与原始帧恢复，有效避免了对先验对齐的依赖，并在去偏评估协议下显著优于传统等变与增强基线方法，在多个真实场景数据集上取得新纪录。

链接: https://arxiv.org/abs/2509.26631
作者: Yuqing Wang,Zhaiyu Chen,Xiao Xiang Zhu
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:3D shape completion methods typically assume scans are pre-aligned to a canonical frame. This leaks pose and scale cues that networks may exploit to memorize absolute positions rather than inferring intrinsic geometry. When such alignment is absent in real data, performance collapses. We argue that robust generalization demands architectural equivariance to the similarity group, SIM(3), so the model remains agnostic to pose and scale. Following this principle, we introduce the first SIM(3)-equivariant shape completion network, whose modular layers successively canonicalize features, reason over similarity-invariant geometry, and restore the original frame. Under a de-biased evaluation protocol that removes the hidden cues, our model outperforms both equivariant and augmentation baselines on the PCN benchmark. It also sets new cross-domain records on real driving and indoor scans, lowering minimal matching distance on KITTI by 17% and Chamfer distance \ell1 on OmniObject3D by 14%. Perhaps surprisingly, ours under the stricter protocol still outperforms competitors under their biased settings. These results establish full SIM(3) equivariance as an effective route to truly generalizable shape completion. Project page: this https URL.
zh

[CV-5] Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在仅使用文本数据进行预训练的情况下，为何能展现出潜在的视觉能力，以及如何系统性地挖掘和利用这种隐含的视觉先验知识（visual priors），以构建更高效的视觉感知与推理能力的多模态大模型（Multimodal Large Language Models, MLLMs）。其解决方案的关键在于揭示了视觉先验由两类可分离的子先验构成——感知先验（perception prior）和推理先验（reasoning prior），并发现二者具有不同的来源和扩展规律：其中推理先验主要来源于以推理为核心的文本数据（如代码、数学和学术文本），且随预训练规模增长而持续提升；而感知先验则更多来自广泛语料库，并对视觉编码器和视觉指令微调数据更为敏感。基于此洞察，作者提出了一种以数据为中心的预训练策略，在1T token规模实验中验证了该方法的有效性，为下一代高效、可控的多模态大模型发展提供了新路径。

链接: https://arxiv.org/abs/2509.26625
作者: Junlin Han,Shengbang Tong,David Fan,Yufan Ren,Koustuv Sinha,Philip Torr,Filippos Kokkinos
机构: Meta(元)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Project page: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and in some cases, to perform visual tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors-the implicit, emergent knowledge about the visual world acquired during language pre-training-are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM’s latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (e.g., code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, a perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline-from LLM pre-training to visual alignment and supervised multimodal fine-tuning-across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we propose and investigate several hypotheses, and introduce the Multi-Level Existence Bench (MLE-Bench). Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs.
zh

[CV-6] HART: Human Aligned Reconstruction Transformer

【速读】：该论文旨在解决稀疏视角下人体重建中长期存在的挑战：现有方法要么依赖参数化模板（如SMPL-X）而无法准确建模松散衣物和人-物交互，要么基于隐式函数建模但受限于简化的相机假设，难以在真实场景中应用。其解决方案的关键在于提出HART框架，通过一个前馈Transformer模型预测每像素的3D点图、法向量和身体对应关系，并结合遮挡感知的泊松重建算法恢复完整几何结构（包括自遮挡区域），同时将结果与SMPL-X人体模型对齐以保证结构合理性并保留衣物细节；该方案进一步利用生成的人体对齐网格初始化高斯点绘（Gaussian splatting）表示，从而实现高质量的稀疏视角新视图合成。

链接: https://arxiv.org/abs/2509.26621
作者: Xiyi Chen,Shaofei Wang,Marko Mihajlovic,Taewon Kang,Sergey Prokudin,Ming Lin
机构: University of Maryland, College Park (马里兰大学学院公园分校); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室，BIGAI); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce HART, a unified framework for sparse-view human reconstruction. Given a small set of uncalibrated RGB images of a person as input, it outputs a watertight clothed mesh, the aligned SMPL-X body mesh, and a Gaussian-splat representation for photorealistic novel-view rendering. Prior methods for clothed human reconstruction either optimize parametric templates, which overlook loose garments and human-object interactions, or train implicit functions under simplified camera assumptions, limiting applicability in real scenes. In contrast, HART predicts per-pixel 3D point maps, normals, and body correspondences, and employs an occlusion-aware Poisson reconstruction to recover complete geometry, even in self-occluded regions. These predictions also align with a parametric SMPL-X body model, ensuring that reconstructed geometry remains consistent with human structure while capturing loose clothing and interactions. These human-aligned meshes initialize Gaussian splats to further enable sparse-view rendering. While trained on only 2.3K synthetic scans, HART achieves state-of-the-art results: Chamfer Distance improves by 18-23 percent for clothed-mesh reconstruction, PA-V2V drops by 6-27 percent for SMPL-X estimation, LPIPS decreases by 15-27 percent for novel-view synthesis on a wide range of datasets. These results suggest that feed-forward transformers can serve as a scalable model for robust human reconstruction in real-world settings. Code and models will be released.
zh

[CV-7] DA2: Depth Anything in Any Direction

【速读】：该论文旨在解决全景图像（panoramic image）深度估计中存在的两个关键问题：一是由于全景数据稀缺导致的模型在跨域场景下零样本（zero-shot）泛化能力差；二是全景图像固有的球面畸变使得传统基于透视图分割（如立方体贴图）的方法效率低下且性能受限。解决方案的关键在于提出一种名为DA²（Depth Anything in Any Direction）的端到端全景深度估计框架，其核心创新包括：1）构建一个高质量全景数据生成引擎，从透视图像中自动合成约543K个全景RGB-深度对，使总数据量达到约607K，显著提升训练数据规模与多样性；2）设计SphereViT模型，显式利用球坐标系约束全景特征的空间几何一致性，有效缓解球面畸变带来的误差。实验表明，DA²在多个数据集上实现SOTA性能，平均AbsRel指标相比最强零样本基线提升38%，甚至优于部分域内训练方法，同时具备更高的计算效率。

链接: https://arxiv.org/abs/2509.26618
作者: Haodong Li,Wangguangdong Zheng,Jing He,Yuhao Liu,Xin Lin,Xin Yang,Ying-Cong Chen,Chunchao Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work primarily done during an internship at Tencent Hunyuan. Project page: this https URL

点击查看摘要

Abstract:Panorama has a full FoV (360 ^\circ\times 180 ^\circ ), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose \textbfDA ^\textbf2 : \textbfD epth \textbfA nything in \textbfA ny \textbfD irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create \sim 543K panoramic RGB-depth pairs, bringing the total to \sim 607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA ^2 's SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA ^2 even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA ^2 exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data will be released. Project page: this https URL.
zh

[CV-8] Hy-Facial: Hybrid Feature Extraction by Dimensionality Reduction Methods for Enhanced Facial Expression Classification

【速读】：该论文旨在解决面部表情分类（Facial Expression Recognition, FER）中因高维面部图像数据带来的挑战，包括特征冗余与计算复杂度高等问题。其解决方案的关键在于提出一种混合特征提取框架（Hy-Facial），通过融合VGG19深度网络提取的深层特征与手工设计的局部描述子（如SIFT和ORB），构建丰富且多样化的图像表征；同时，系统评估多种降维策略后发现UMAP在保留高维特征空间的局部与全局结构方面最优，从而显著提升分类性能（达到83.3%准确率）。此研究表明，降维不仅是预处理步骤，更是优化特征质量、增强整体分类效果的核心环节。

链接: https://arxiv.org/abs/2509.26614
作者: Xinjin Li,Yu Ma,Kaisen Ye,Jinghan Cao,Minghao Zhou,Yeyang Zhou
机构: Columbia University (哥伦比亚大学); Carnegie Mellon University (卡内基梅隆大学); Zhejiang University (浙江大学); San Francisco State University (旧金山州立大学); UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial expression classification remains a challenging task due to the high dimensionality and inherent complexity of facial image data. This paper presents Hy-Facial, a hybrid feature extraction framework that integrates both deep learning and traditional image processing techniques, complemented by a systematic investigation of dimensionality reduction strategies. The proposed method fuses deep features extracted from the Visual Geometry Group 19-layer network (VGG19) with handcrafted local descriptors and the scale-invariant feature transform (SIFT) and Oriented FAST and Rotated BRIEF (ORB) algorithms, to obtain rich and diverse image representations. To mitigate feature redundancy and reduce computational complexity, we conduct a comprehensive evaluation of dimensionality reduction techniques and feature extraction. Among these, UMAP is identified as the most effective, preserving both local and global structures of the high-dimensional feature space. The Hy-Facial pipeline integrated VGG19, SIFT, and ORB for feature extraction, followed by K-means clustering and UMAP for dimensionality reduction, resulting in a classification accuracy of 83. 3% in the facial expression recognition (FER) dataset. These findings underscore the pivotal role of dimensionality reduction not only as a pre-processing step but as an essential component in improving feature quality and overall classification performance.
zh

[CV-9] Video Object Segmentation-Aware Audio Generation DATE

【速读】：该论文旨在解决现有多模态音频生成模型在专业拟音（Foley）工作流程中缺乏精确用户控制的问题，尤其是这些模型通常依赖整个视频输入，无法对场景中的特定物体进行优先级排序，导致生成不必要的背景声音或错误地聚焦于非目标对象。解决方案的关键在于提出了一种新的任务——基于视频物体分割的音频生成（video object segmentation-aware audio generation），通过显式地将声音合成条件化于物体级别的分割图（segmentation maps），结合视频和文本线索，构建了SAGANet这一新型多模态生成模型，从而实现对音频生成的细粒度与视觉定位控制。

链接: https://arxiv.org/abs/2509.26604
作者: Ilpo Viertola,Vladimir Iashin,Esa Rahtu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version. The Version of Record is published in DAGM GCPR 2025 proceedings with Springer Lecture Notes in Computer Science (LNCS). Updated results and resources are available at the project page: this https URL

点击查看摘要

Abstract:Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley workflows. In particular, these models focus on the entire video and do not provide precise methods for prioritizing a specific object within a scene, generating unnecessary background sounds, or focusing on the wrong objects. To address this gap, we introduce the novel task of video object segmentation-aware audio generation, which explicitly conditions sound synthesis on object-level segmentation maps. We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues. Our model provides users with fine-grained and visually localized control over audio generation. To support this task and further research on segmentation-aware Foley, we propose Segmented Music Solos, a benchmark dataset of musical instrument performance videos with segmentation information. Our method demonstrates substantial improvements over current state-of-the-art methods and sets a new standard for controllable, high-fidelity Foley synthesis. Code, samples, and Segmented Music Solos are available at this https URL
zh

[CV-10] DiffCamera: Arbitrary Refocusing on Images

【速读】：该论文旨在解决图像景深（Depth-of-Field, DoF）效果一旦生成便难以修改的问题，尤其当原始图像中主体失焦时，传统方法无法灵活调整焦点位置和模糊程度。为实现对已生成图像的灵活重对焦（refocusing），作者提出DiffCamera模型，其核心创新在于设计了一个基于扩散Transformer（diffusion transformer）的框架，并引入一种物理驱动的“堆叠约束”（stacking constraint）以增强训练稳定性与准确性。该约束借鉴了摄影原理：不同对焦平面的照片可线性叠加形成多焦点图像，从而强制模型在训练过程中学习到符合真实光学规律的重对焦行为，确保输出结果与场景结构及相机参数一致，进而支持高保真、可控的景深调节。

链接: https://arxiv.org/abs/2509.26599
作者: Yiyang Wang,Xi Chen,Xiaogang Xu,Yu Liu,Hengshuang Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The depth-of-field (DoF) effect, which introduces aesthetically pleasing blur, enhances photographic quality but is fixed and difficult to modify once the image has been created. This becomes problematic when the applied blur is undesirable~(e.g., the subject is out of focus). To address this, we propose DiffCamera, a model that enables flexible refocusing of a created image conditioned on an arbitrary new focus point and a blur level. Specifically, we design a diffusion transformer framework for refocusing learning. However, the training requires pairs of data with different focus planes and bokeh levels in the same scene, which are hard to acquire. To overcome this limitation, we develop a simulation-based pipeline to generate large-scale image pairs with varying focus planes and bokeh levels. With the simulated data, we find that training with only a vanilla diffusion objective often leads to incorrect DoF behaviors due to the complexity of the task. This requires a stronger constraint during training. Inspired by the photographic principle that photos of different focus planes can be linearly blended into a multi-focus image, we propose a stacking constraint during training to enforce precise DoF manipulation. This constraint enhances model training by imposing physically grounded refocusing behavior that the focusing results should be faithfully aligned with the scene structure and the camera conditions so that they can be combined into the correct multi-focus image. We also construct a benchmark to evaluate the effectiveness of our refocusing model. Extensive experiments demonstrate that DiffCamera supports stable refocusing across a wide range of scenes, providing unprecedented control over DoF adjustments for photography and generative AI applications.
zh

[CV-11] Autoproof: Automated Segmentation Proofreading for Connectomics

【速读】：该论文旨在解决电子显微镜（Electron Microscopy, EM）图像中神经连接组（connectome）重建过程中依赖大量人工校对（proofreading）导致的效率瓶颈问题，尤其是在扩大重建规模或开展比较神经连接组学（comparative connectomics）时面临的人力成本限制。其解决方案的关键在于利用已有由人工标注生成的高质量真实数据（ground-truth data），训练机器学习模型以自动化或优化部分校对流程：首先通过模型实现仅需20%的人工成本即可获得80%的校对效益；其次进一步开发自动合并数千个分割片段的能力，成功将20万条碎片自动拼接，相当于节省四名校对员一年的工作量，并使连接组的连通性完成度提升1.3个百分点。

链接: https://arxiv.org/abs/2509.26585
作者: Gary B Huang,William M Katz,Stuart Berg,Louis Scheffer
机构: Janelia Research Campus, Howard Hughes Medical Institute, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Producing connectomes from electron microscopy (EM) images has historically required a great deal of human proofreading effort. This manual annotation cost is the current bottleneck in scaling EM connectomics, for example, in making larger connectome reconstructions feasible, or in enabling comparative connectomics where multiple related reconstructions are produced. In this work, we propose using the available ground-truth data generated by this manual annotation effort to learn a machine learning model to automate or optimize parts of the required proofreading workflows. We validate our approach on a recent complete reconstruction of the \emphDrosophila male central nervous system. We first show our method would allow for obtaining 90% of the value of a guided proofreading workflow while reducing required cost by 80%. We then demonstrate a second application for automatically merging many segmentation fragments to proofread neurons. Our system is able to automatically attach 200 thousand fragments, equivalent to four proofreader years of manual work, and increasing the connectivity completion rate of the connectome by 1.3% points.
zh

[CV-12] Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation NEURIPS2025

【速读】：该论文旨在解决当前视频生成模型在专业影视制作场景下评估不足的问题，即现有模型与基准测试未能充分捕捉电影制作中复杂的控制维度和实际需求。其解决方案的关键在于提出了一种结构化的评估框架——Stable Cinemetrics（SCINE），将影视创作控制因素解耦并层级化为四个核心维度：Setup（布景）、Event（事件）、Lighting（灯光）和Camera（摄像），共定义76个基于行业实践的细粒度控制节点；在此基础上构建了与专业用例对齐的提示基准，并开发了自动化提示分类与问题生成流水线，实现对每个控制维度的独立评估；同时通过大规模人工标注研究（10+模型、20K视频、80+电影专业人士）揭示当前最强模型在Event和Camera相关控制上的显著短板，并训练了一个与专家标注对齐的视觉-语言自动评估器，显著优于零样本基线，从而为专业视频生成提供了可扩展、系统化的评估路径。

链接: https://arxiv.org/abs/2509.26555
作者: Agneet Chatterjee,Rahim Entezari,Maksym Zhuravinskyi,Maksim Lapin,Reshinth Adithyan,Amit Raj,Chitta Baral,Yezhou Yang,Varun Jampani
机构: Stability AI(Stability.AI); Arizona State University(亚利桑那州立大学); Google DeepMind(谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025. Project Page : this https URL

点击查看摘要

Abstract:Recent advances in video generation have enabled high-fidelity video synthesis from user provided prompts. However, existing models and benchmarks fail to capture the complexity and requirements of professional video generation. Towards that goal, we introduce Stable Cinemetrics, a structured evaluation framework that formalizes filmmaking controls into four disentangled, hierarchical taxonomies: Setup, Event, Lighting, and Camera. Together, these taxonomies define 76 fine-grained control nodes grounded in industry practices. Using these taxonomies, we construct a benchmark of prompts aligned with professional use cases and develop an automated pipeline for prompt categorization and question generation, enabling independent evaluation of each control dimension. We conduct a large-scale human study spanning 10+ models and 20K videos, annotated by a pool of 80+ film professionals. Our analysis, both coarse and fine-grained reveal that even the strongest current models exhibit significant gaps, particularly in Events and Camera-related controls. To enable scalable evaluation, we train an automatic evaluator, a vision-language model aligned with expert annotations that outperforms existing zero-shot baselines. SCINE is the first approach to situate professional video generation within the landscape of video generative models, introducing taxonomies centered around cinematic controls and supporting them with structured evaluation pipelines and detailed analyses to guide future research.
zh

[CV-13] DEPTHOR: Robust Depth Enhancement from a Real-World Lightweight dToF and RGB Guidance

【速读】：该论文旨在解决轻量级飞行时间（depth of flight, dToF）传感器在实际应用中因校准误差和异常值导致的深度增强（depth enhancement）性能下降问题，尤其是在高精度任务如三维重建和同步定位与地图构建（SLAM）中的鲁棒性不足。其核心解决方案包括三个关键创新：首先，基于合成数据集的仿真方法生成真实感训练样本，提升模型对噪声输入的适应能力；其次，提出一种无需学习参数的异常检测机制，有效识别并剔除错误的dToF测量值，避免误导传播；最后，设计了一种针对噪声dToF输入的深度完成网络，融合RGB图像与预训练单目深度估计先验信息，显著改善复杂区域的深度恢复效果。

链接: https://arxiv.org/abs/2509.26498
作者: Jijun Xiang,Longliang Liu,Xuan Zhu,Xianqi Wang,Min Lin,Xin Yang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 16 figures

点击查看摘要

Abstract:Depth enhancement, which converts raw dToF signals into dense depth maps using RGB guidance, is crucial for improving depth perception in high-precision tasks such as 3D reconstruction and SLAM. However, existing methods often assume ideal dToF inputs and perfect dToF-RGB alignment, overlooking calibration errors and anomalies, thus limiting real-world applicability. This work systematically analyzes the noise characteristics of real-world lightweight dToF sensors and proposes a practical and novel depth completion framework, DEPTHOR++, which enhances robustness to noisy dToF inputs from three key aspects. First, we introduce a simulation method based on synthetic datasets to generate realistic training samples for robust model training. Second, we propose a learnable-parameter-free anomaly detection mechanism to identify and remove erroneous dToF measurements, preventing misleading propagation during completion. Third, we design a depth completion network tailored to noisy dToF inputs, which integrates RGB images and pre-trained monocular depth estimation priors to improve depth recovery in challenging regions. On the ZJU-L5 dataset and real-world samples, our training strategy significantly boosts existing depth completion models, with our model achieving state-of-the-art performance, improving RMSE and Rel by 22% and 11% on average. On the Mirror3D-NYU dataset, by incorporating the anomaly detection method, our model improves upon the previous SOTA by 37% in mirror regions. On the Hammer dataset, using simulated low-cost dToF data from RealSense L515, our method surpasses the L515 measurements with an average gain of 22%, demonstrating its potential to enable low-cost sensors to outperform higher-end devices. Qualitative results across diverse real-world datasets further validate the effectiveness and generalizability of our approach.
zh

[CV-14] Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）因参数规模庞大和计算成本高昂，难以在资源受限的边缘环境（edge environments）中直接部署的问题。为实现高性能小模型在边缘设备上的高效运行，研究提出了一种系统化的后训练（post-training）优化流程，其关键在于结合基于课程学习的监督微调（curriculum-based supervised fine-tuning, SFT）与离线策略内知识蒸馏（offline on-policy knowledge distillation），从而显著提升小模型在复杂任务中的准确性，同时在硬件资源严格受限条件下保持优异的泛化能力和多任务竞争力。

链接: https://arxiv.org/abs/2509.26497
作者: Miao Rang,Zhenni Bi,Hang Zhou,Hanting Chen,An Xiao,Tianyu Guo,Kai Han,Xinghao Chen,Yunhe Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has significantly advanced the capabilities of artificial intelligence across various domains. However, their massive scale and high computational costs render them unsuitable for direct deployment in resource-constrained edge environments. This creates a critical need for high-performance small models that can operate efficiently at the edge. Yet, after pre-training alone, these smaller models often fail to meet the performance requirements of complex tasks. To bridge this gap, we introduce a systematic post-training pipeline that efficiently enhances small model accuracy. Our post training pipeline consists of curriculum-based supervised fine-tuning (SFT) and offline on-policy knowledge distillation. The resulting instruction-tuned model achieves state-of-the-art performance among billion-parameter models, demonstrating strong generalization under strict hardware constraints while maintaining competitive accuracy across a variety of tasks. This work provides a practical and efficient solution for developing high-performance language models on Ascend edge devices.
zh

[CV-15] Contrastive Diffusion Guidance for Spatial Inverse Problems

【速读】：该论文致力于解决从用户在空间内的移动轨迹中逆向重建空间布局（如家庭平面图）的难题，这是一个典型的不适定逆问题，因为多个不同的布局可能解释相同的轨迹数据。解决方案的关键在于提出一种基于扩散模型的后验采样方法，并通过引入一个平滑嵌入空间来重构似然分数（likelihood score）。该嵌入空间通过对比损失训练，使匹配的布局与轨迹对彼此靠近，而不匹配的对则被拉远；在此空间中构建的代理似然分数可有效近似真实似然分数，从而引导去噪过程收敛至合理的后验分布。该方法显著提升了生成布局的一致性和鲁棒性，优于现有基于可微分路径规划器和引导扩散的方法。

链接: https://arxiv.org/abs/2509.26489
作者: Sattwik Basu,Chaitanya Amballa,Zhongweiyang Xu,Jorge Vančo Sampedro,Srihari Nelakuditi,Romit Roy Choudhury
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of South Carolina (南卡罗来纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:We consider the inverse problem of reconstructing the spatial layout of a place, a home floorplan for example, from a user`s movements inside that layout. Direct inversion is ill-posed since many floorplans can explain the same movement trajectories. We adopt a diffusion-based posterior sampler to generate layouts consistent with the measurements. While active research is in progress on generative inverse solvers, we find that the forward operator in our problem poses new challenges. The path-planning process inside a floorplan is a non-invertible, non-differentiable function, and causes instability while optimizing using the likelihood score. We break-away from existing approaches and reformulate the likelihood score in a smoother embedding space. The embedding space is trained with a contrastive loss which brings compatible floorplans and trajectories close to each other, while pushing mismatched pairs far apart. We show that a surrogate form of the likelihood score in this embedding space is a valid approximation of the true likelihood score, making it possible to steer the denoising process towards the posterior. Across extensive experiments, our model CoGuide produces more consistent floorplans from trajectories, and is more robust than differentiable-planner baselines and guided-diffusion methods.
zh

[CV-16] CBAM Integrated Attention Driven Model For Betel Leaf Diseases Classification With Explainable AI

【速读】：该论文旨在解决槟榔叶病（betel leaf disease）早期识别困难的问题，此类病害对槟榔产业的粮食安全和经济效益构成严重威胁。传统方法难以及时准确诊断，导致潜在经济损失。解决方案的关键在于提出一种轻量级CBAM-CNN模型，通过引入卷积块注意力模块（Convolutional Block Attention Module, CBAM），在不依赖大型预训练网络的前提下，增强模型对空间和通道维度特征的关注能力，从而提升对细微病害差异的辨别力。该模型仅含2.13百万参数，在包含10,185张图像的平衡数据集上实现了95.58%的测试准确率，且精度、召回率和F1分数分别达到97%、94%和95%，显著优于传统预训练CNN模型。

链接: https://arxiv.org/abs/2509.26484
作者: Sumaiya Tabassum,Md. Faysal Ahamed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Betel leaf is an important crop because of its economic advantages and widespread use. Its betel vines are susceptible to a number of illnesses that are commonly referred to as betel leaf disease. Plant diseases are the largest threat to the food supply’s security, and they are challenging to identify in time to stop possible financial damage. Interestingly, artificial intelligence can leave a big mark on the betel leaf industry since it helps with output growth by forecasting sickness. This paper presents a lightweight CBAM-CNN model with just 2.13 million parameters (8.13 MB), incorporating CBAM (Convolutional Block Attention Module) to improve feature emphasis without depending on heavy pre-trained networks. The model’s capacity to discern minute variations among leaf disease classes is improved by the integrated attention mechanism, which allows it to adaptively focus on significant spatial and channel-wise information. In order to ensure class balance and diversity for efficient model training and validation, this work makes use of an enriched dataset of 10,185 images divided into three categories: Healthy Leaf, Leaf Rot, and Leaf Spot. The proposed model achieved a precision of 97%, recall of 94%, and F1 score of 95%, and 95.58% accuracy on the test set demonstrating strong and balanced classification performance outperforming traditional pre trained CNN models. The model’s focus regions were visualized and interpreted using Grad-CAM (Gradient-weighted Class Activation Mapping), an explainable AI technique.
zh

[CV-17] Zero-Shot Decentralized Federated Learning IJCNN

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中零样本适应（zero-shot adaptation）的泛化能力不足、通信开销高以及对中心服务器依赖性强的问题。现有方法如FedCoOp和FedTPG虽能提升性能，但受限于集中式架构和高昂的通信成本，难以在真实分布式场景下部署。解决方案的关键在于提出一种完全去中心化的零样本联邦学习框架——ZeroDFL，其核心创新是引入迭代提示共享机制（iterative prompt-sharing mechanism），使客户端无需中央协调即可优化并交换文本提示（textual prompts），从而显著降低通信复杂度（相比FedTPG减少118倍），同时保持或超越当前最优方法的分类性能，实现了更高的可扩展性、效率与隐私保护。

链接: https://arxiv.org/abs/2509.26462
作者: Alessio Masano,Matteo Pennisi,Federica Proietto Salanitri,Concetto Spampinato,Giovanni Bellitto
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at International Joint Conference on Neural Networks (IJCNN) 2025. Code available at this https URL

点击查看摘要

Abstract:CLIP has revolutionized zero-shot learning by enabling task generalization without fine-tuning. While prompting techniques like CoOp and CoCoOp enhance CLIP’s adaptability, their effectiveness in Federated Learning (FL) remains an open challenge. Existing federated prompt learning approaches, such as FedCoOp and FedTPG, improve performance but face generalization issues, high communication costs, and reliance on a central server, limiting scalability and privacy. We propose Zero-shot Decentralized Federated Learning (ZeroDFL), a fully decentralized framework that enables zero-shot adaptation across distributed clients without a central coordinator. ZeroDFL employs an iterative prompt-sharing mechanism, allowing clients to optimize and exchange textual prompts to enhance generalization while drastically reducing communication overhead. We validate ZeroDFL on nine diverse image classification datasets, demonstrating that it consistently outperforms–or remains on par with–state-of-the-art federated prompt learning methods. More importantly, ZeroDFL achieves this performance in a fully decentralized setting while reducing communication overhead by 118x compared to FedTPG. These results highlight that our approach not only enhances generalization in federated zero-shot learning but also improves scalability, efficiency, and privacy preservation–paving the way for decentralized adaptation of large vision-language models in real-world applications.
zh

[CV-18] Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification BMVC2025

【速读】：该论文旨在解决室内场景分类（Indoor Scene Classification）任务中因物体间复杂关系与空间布局带来的挑战，尤其针对敏感内容分析（如儿童性虐待图像，CSAI）的应用需求。其核心问题是传统基于像素的方法难以有效建模场景内语义交互，且存在隐私泄露风险。解决方案的关键在于提出ASGRA框架，通过将图像转化为结构化的场景图（Scene Graph），并利用图注意力网络（Graph Attention Network）直接建模对象与关系之间的交互，从而实现两个关键优势：一是通过识别具体对象和它们的语义关系提供内在可解释性；二是无需接触原始敏感图像即可训练模型，保障隐私安全。

链接: https://arxiv.org/abs/2509.26457
作者: Artur Barros,Carlos Caetano,João Macedo,Jefersson A. dos Santos,Sandra Avila
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: British Machine Vision Conference (BMVC 2025), in the From Scene Understanding to Human Modeling Workshop

点击查看摘要

Abstract:Indoor scene classification is a critical task in computer vision, with wide-ranging applications that go from robotics to sensitive content analysis, such as child sexual abuse imagery (CSAI) classification. The problem is particularly challenging due to the intricate relationships between objects and complex spatial layouts. In this work, we propose the Attention over Scene Graphs for Sensitive Content Analysis (ASGRA), a novel framework that operates on structured graph representations instead of raw pixels. By first converting images into Scene Graphs and then employing a Graph Attention Network for inference, ASGRA directly models the interactions between a scene’s components. This approach offers two key benefits: (i) inherent explainability via object and relationship identification, and (ii) privacy preservation, enabling model training without direct access to sensitive images. On Places8, we achieve 81.27% balanced accuracy, surpassing image-based methods. Real-world CSAI evaluation with law enforcement yields 74.27% balanced accuracy. Our results establish structured scene representations as a robust paradigm for indoor scene classification and CSAI classification. Code is publicly available at this https URL.
zh

[CV-19] Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

【速读】：该论文旨在解决从单张图像到多视角集合的3D风格迁移（3D style transfer）问题，尤其在无预设姿态（unposed content）条件下实现几何感知且视图一致的风格化效果。传统方法通常依赖于每场景优化或预先计算的姿态信息，限制了泛化能力与效率。其解决方案的关键在于提出Stylos框架——一个基于Transformer架构的单次前向传播3D高斯表示模型，通过两条并行路径：几何预测路径保留自注意力机制以维持几何保真度，风格注入路径则利用全局交叉注意力强制跨视角视觉一致性；同时引入体素级3D风格损失（voxel-based 3D style loss），将聚合场景特征对齐至参考风格图像的统计特性，从而在不依赖场景优化的前提下实现高质量、零样本（zero-shot）的视图一致风格迁移，并具备良好的可扩展性，适用于从单视图到大规模多视图设置。

链接: https://arxiv.org/abs/2509.26455
作者: Hanzhou Liu,Jia Huang,Mi Lu,Srikanth Saripalli,Peng Jiang
机构: Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Stylos, a single-forward 3D Gaussian framework for 3D style transfer that operates on unposed content, from a single image to a multi-view collection, conditioned on a separate reference style image. Stylos synthesizes a stylized 3D Gaussian scene without per-scene optimization or precomputed poses, achieving geometry-aware, view-consistent stylization that generalizes to unseen categories, scenes, and styles. At its core, Stylos adopts a Transformer backbone with two pathways: geometry predictions retain self-attention to preserve geometric fidelity, while style is injected via global cross-attention to enforce visual consistency across views. With the addition of a voxel-based 3D style loss that aligns aggregated scene features to style statistics, Stylos enforces view-consistent stylization while preserving geometry. Experiments across multiple datasets demonstrate that Stylos delivers high-quality zero-shot stylization, highlighting the effectiveness of global style-content coupling, the proposed 3D style loss, and the scalability of our framework from single view to large-scale multi-view settings.
zh

[CV-20] Multi-View Camera System for Variant-Aware Autonomous Vehicle Inspection and Defect Detection

【速读】：该论文旨在解决现代汽车生产线中确保每辆车符合指定配置且无可见缺陷的复杂质量控制问题。解决方案的关键在于提出一个端到端的多视角感知平台——自动化车辆检测（Automated Vehicle Inspection, AVI），其核心由深度学习检测器与语义规则引擎耦合构成，实现对车型（variant）敏感的实时质量控制。系统通过11个同步相机实现360°全覆盖拍摄，任务特定视图被分配至专用模块：YOLOv8用于零部件检测、EfficientNet用于内燃机/电动车型分类、Gemini-1.5 Flash用于车标光学字符识别（OCR），以及YOLOv8-Seg用于划痕与凹陷分割；随后，视图感知融合层标准化证据，VIN条件化的规则引擎将检测特征与预期配置清单比对，生成可解释的通过/失败报告，整体处理速度达约300毫秒/辆，验证准确率达93%，缺陷检测召回率达86%，显著优于单视角或无分割基线方法，是首个在工业部署场景中统一多相机特征验证与缺陷检测的公开系统。

链接: https://arxiv.org/abs/2509.26454
作者: Yash Kulkarni,Raman Jha,Renu Kachhoria
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring that every vehicle leaving a modern production line is built to the correct \emphvariant specification and is free from visible defects is an increasingly complex challenge. We present the \textbfAutomated Vehicle Inspection (AVI) platform, an end-to-end, \emphmulti-view perception system that couples deep-learning detectors with a semantic rule engine to deliver \emphvariant-aware quality control in real time. Eleven synchronized cameras capture a full 360° sweep of each vehicle; task-specific views are then routed to specialised modules: YOLOv8 for part detection, EfficientNet for ICE/EV classification, Gemini-1.5 Flash for mascot OCR, and YOLOv8-Seg for scratch-and-dent segmentation. A view-aware fusion layer standardises evidence, while a VIN-conditioned rule engine compares detected features against the expected manifest, producing an interpretable pass/fail report in (\approx! 300,\textms). On a mixed data set of Original Equipment Manufacturer(OEM) vehicle data sets of four distinct models plus public scratch/dent images, AVI achieves \textbf 93 % verification accuracy, \textbf86 % defect-detection recall, and sustains (\mathbf3.3) vehicles/min, surpassing single-view or no segmentation baselines by large margins. To our knowledge, this is the first publicly reported system that unifies multi-camera feature validation with defect detection in a deployable automotive setting in industry.
zh

[CV-21] Post-Training Quantization via Residual Truncation and Zero Suppression for Diffusion Models

【速读】：该论文旨在解决扩散模型（diffusion models）在低比特量化（如4-bit）部署时因量化误差放大导致细粒度纹理丢失的问题。传统8-bit outlier-aware后训练量化（Post-Training Quantization, PTQ）虽能保持全精度性能，但扩展至4-bit时，由于步长增大加剧了对小幅度激活值的舍入误差，从而损害图像细节质量。解决方案的关键在于提出一种名为Quantization via Residual Truncation and Zero Suppression (QuaRTZ) 的4-bit PTQ方法：首先使用8-bit最小-最大量化处理异常值以保留关键特征，再通过前导零抑制（leading-zero suppression）压缩至4-bit，从而保留最低有效位（LSB），有效减少舍入误差并提升纹理保真度。该方案在FLUX.1-schnell数据集上实现FID=6.98，优于依赖辅助FP16分支的SVDQuant方法。

链接: https://arxiv.org/abs/2509.26436
作者: Donghoon Kim,Dongyoung Lee,Ik Joon Chang,Sung-Ho Bae
机构: Kyung Hee University (高丽大学); Department of Artificial Intelligence (人工智能系); Department of Electrical Engineering (电气工程系); Department of Computer Science (计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models achieve high-quality image generation but face deployment challenges due to their high computational requirements. Although 8-bit outlier-aware post-training quantization (PTQ) matches full-precision performance, extending PTQ to 4 bits remains challenging. Larger step sizes in 4-bit quantization amplify rounding errors in dense, low-magnitude activations, leading to the loss of fine-grained textures. We hypothesize that not only outliers but also small activations are critical for texture fidelity. To this end, we propose Quantization via Residual Truncation and Zero Suppression (QuaRTZ), a 4-bit PTQ scheme for diffusion models. QuaRTZ applies 8-bit min-max quantization for outlier handling and compresses to 4 bits via leading-zero suppression to retain LSBs, thereby preserving texture details. Our approach reduces rounding errors and improves quantization efficiency by balancing outlier preservation and LSB precision. Both theoretical derivations and empirical evaluations demonstrate the generalizability of QuaRTZ across diverse activation distributions. Notably, 4-bit QuaRTZ achieves an FID of 6.98 on FLUX.1-schnell, outperforming SVDQuant that requires auxiliary FP16 branches.
zh

[CV-22] PRISM: Progressive Rain removal with Integrated State-space Modeling

【速读】：该论文旨在解决图像去雨（image deraining）任务中单尺度模型在细粒度结构恢复和全局一致性保持方面的不足。其核心解决方案是提出了一种渐进式的三阶段框架PRISM，关键创新在于：第一阶段CENet与第二阶段SFNet采用新型混合注意力UNet（HA-UNet）实现多尺度特征聚合，结合通道注意力与窗口化空间变换器；第二阶段进一步引入混合域Mamba（HDMamba）以联合建模空间语义与小波域特征；第三阶段RNet通过原分辨率子网络恢复细节结构，从而在保留全局上下文的同时学习高频雨纹特征，显著提升图像质量。

链接: https://arxiv.org/abs/2509.26413
作者: Pengze Xue,Shanwen Wang,Fei Zhou,Yan Cui,Xin Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Submitted to an IEEE conference and currently under review. Copyright 2025 IEEE; personal use permitted; all other uses require permission

点击查看摘要

Abstract:Image deraining is an essential vision technique that removes rain streaks and water droplets, enhancing clarity for critical vision tasks like autonomous driving. However, current single-scale models struggle with fine-grained recovery and global consistency. To address this challenge, we propose Progressive Rain removal with Integrated State-space Modeling (PRISM), a progressive three-stage framework: Coarse Extraction Network (CENet), Frequency Fusion Network (SFNet), and Refine Network (RNet). Specifically, CENet and SFNet utilize a novel Hybrid Attention UNet (HA-UNet) for multi-scale feature aggregation by combining channel attention with windowed spatial transformers. Moreover, we propose Hybrid Domain Mamba (HDMamba) for SFNet to jointly model spatial semantics and wavelet domain characteristics. Finally, RNet recovers the fine-grained structures via an original-resolution subnetwork. Our model learns high-frequency rain characteristics while preserving structural details and maintaining global context, leading to improved image quality. Our method achieves competitive results on multiple datasets against recent deraining methods.
zh

[CV-23] Image-Difficulty-Aware Evaluation of Super-Resolution Models ICIP2025

【速读】：该论文旨在解决图像超分辨率（Image Super-Resolution, ISR）模型在传统平均性能指标评估下无法区分不同模型在难易程度各异的测试图像上表现差异的问题，尤其针对某些模型在特定困难图像上产生视觉伪影但整体平均得分相近的情况。解决方案的关键在于提出两种图像难度度量方法——高频指数（High-Frequency Index）和旋转不变边缘指数（Rotation-Invariant Edge Index），用以预测哪些测试图像中模型间会呈现显著的视觉差异，并结合一种能将此类视觉差异映射到客观评价指标上的新评估方法，从而实现对ISR模型更细致、更具判别力的性能评估。

链接: https://arxiv.org/abs/2509.26398
作者: Atakan Topaloglu,Ahmet Bilican,Cansu Korkmaz,A. Murat Tekalp
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to and presented at ICIP 2025 Workshops

点击查看摘要

Abstract:Image super-resolution models are commonly evaluated by average scores (over some benchmark test sets), which fail to reflect the performance of these models on images of varying difficulty and that some models generate artifacts on certain difficult images, which is not reflected by the average scores. We propose difficulty-aware performance evaluation procedures to better differentiate between SISR models that produce visually different results on some images but yield close average performance scores over the entire test set. In particular, we propose two image-difficulty measures, the high-frequency index and rotation-invariant edge index, to predict those test images, where a model would yield significantly better visual results over another model, and an evaluation method where these visual differences are reflected on objective measures. Experimental results demonstrate the effectiveness of the proposed image-difficulty measures and evaluation methodology.
zh

[CV-24] MotionRAG : Motion Retrieval-Augmented Image-to-Video Generation

【速读】：该论文旨在解决图像到视频生成中运动真实感不足的问题，其核心挑战在于准确建模物理约束、物体交互及领域特定动态等复杂运动规律，这些因素难以在不同场景间有效泛化。解决方案的关键在于提出MotionRAG框架，通过检索增强机制从相关参考视频中适配运动先验（motion priors），其核心技术包括：(i) 基于检索的流水线，利用视频编码器与专用重采样器提取高层运动特征并蒸馏语义运动表示；(ii) 采用因果Transformer实现上下文学习式的运动适配；(iii) 设计基于注意力的运动注入适配器，将迁移的运动特征无缝融合至预训练视频扩散模型中。该方法在多个领域和基础模型上均取得显著提升，且推理阶段计算开销极低，同时支持零样本跨域泛化。

链接: https://arxiv.org/abs/2509.26391
作者: Chenhui Zhu,Yilu Wu,Shuai Wang,Gangshan Wu,Limin Wang
机构: Nanjing University (南京大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-to-video generation has made remarkable progress with the advancements in diffusion models, yet generating videos with realistic motion remains highly challenging. This difficulty arises from the complexity of accurately modeling motion, which involves capturing physical constraints, object interactions, and domain-specific dynamics that are not easily generalized across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented framework that enhances motion realism by adapting motion priors from relevant reference videos through Context-Aware Motion Adaptation (CAMA). The key technical innovations include: (i) a retrieval-based pipeline extracting high-level motion features using video encoder and specialized resamplers to distill semantic motion representations; (ii) an in-context learning approach for motion adaptation implemented through a causal transformer architecture; (iii) an attention-based motion injection adapter that seamlessly integrates transferred motion features into pretrained video diffusion models. Extensive experiments demonstrate that our method achieves significant improvements across multiple domains and various base models, all with negligible computational overhead during inference. Furthermore, our modular design enables zero-shot generalization to new domains by simply updating the retrieval database without retraining any components. This research enhances the core capability of video generation systems by enabling the effective retrieval and transfer of motion priors, facilitating the synthesis of realistic motion dynamics.
zh

[CV-25] PANDA: Towards Generalist Video Anomaly Detection via Agent ic AI Engineer NEURIPS2025

【速读】：该论文旨在解决视频异常检测（Video Anomaly Detection, VAD）在实际应用中面临的泛化能力不足问题，即传统方法依赖特定场景的训练数据和人工调参，在面对新场景或未知异常类型时性能显著下降，导致部署成本高且适应性差。为实现无需训练数据和人工干预的通用型VAD，作者提出PANDA——一种基于多模态大语言模型（Multimodal Large Language Models, MLLMs）的代理式AI工程师系统。其核心创新在于构建了四大关键能力：(1) 自适应场景感知的检索增强生成（Retrieval-Augmented Generation, RAG）机制以动态获取异常相关知识；(2) 基于潜在异常引导的启发式提示策略提升推理精度；(3) 工具增强的渐进式自我反思机制用于复杂场景下的迭代决策优化；(4) 基于链式记忆（chain-of-memory）的历史经验复用机制实现持续性能改进。这些能力共同使PANDA在多场景、开放集及复杂场景下均达到SOTA效果，验证了其强大的通用性和鲁棒性。

链接: https://arxiv.org/abs/2509.26386
作者: Zhiwei Yang,Chen Gao,Mike Zheng Shou
机构: Xidian University (西安电子科技大学); Show Lab, National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Video anomaly detection (VAD) is a critical yet challenging task due to the complex and diverse nature of real-world scenarios. Previous methods typically rely on domain-specific training data and manual adjustments when applying to new scenarios and unseen anomaly types, suffering from high labor costs and limited generalization. Therefore, we aim to achieve generalist VAD, i.e., automatically handle any scene and any anomaly types without training data or human involvement. In this work, we propose PANDA, an agentic AI engineer based on MLLMs. Specifically, we achieve PANDA by comprehensively devising four key capabilities: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) self-improving chain-of-memory. Concretely, we develop a self-adaptive scene-aware RAG mechanism, enabling PANDA to retrieve anomaly-specific knowledge for anomaly detection strategy planning. Next, we introduce a latent anomaly-guided heuristic prompt strategy to enhance reasoning precision. Furthermore, PANDA employs a progressive reflection mechanism alongside a suite of context-aware tools to iteratively refine decision-making in complex scenarios. Finally, a chain-of-memory mechanism enables PANDA to leverage historical experiences for continual performance improvement. Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability. Code is released at this https URL.
zh

[CV-26] MR2-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval

【速读】：该论文旨在解决当前多模态检索（Multimodal Retrieval）评估基准难以满足现实复杂场景需求的问题，现有基准主要关注表层语义对应关系（如物体与文本匹配），而缺乏对视觉与文本信息间深层推理能力的评测。解决方案的关键在于提出MR²-Bench——一个以推理驱动的多模态检索评测基准，其核心创新包括：1）所有任务均基于推理设计，涵盖逻辑、空间和因果推理，超越浅层匹配；2）数据类型多样，包含自然图像、图表和视觉谜题，支持跨内容类型的全面评估；3）支持包含多图文档的复杂查询，覆盖多种真实应用场景。实验表明，即使在主流基准上表现优异的模型（如Seed1.6-Embedding）在MR²-Bench上性能显著下降（Recall@1从77.78降至9.91），凸显了该基准的挑战性及推动多模态推理能力发展的紧迫性。

链接: https://arxiv.org/abs/2509.26378
作者: Junjie Zhou,Ze Liu,Lei Xiong,Jin-Ge Yao,Yueze Wang,Shitao Xiao,Fenfen Lin,Miguel Hu Chen,Zhicheng Dou,Siqi Bao,Defu Lian,Yongping Xiong,Zheng Liu
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios. Existing benchmarks primarily probe surface-level semantic correspondence (e.g., object-text matching) while failing to assess the deeper reasoning required to capture complex relationships between visual and textual information. To address this gap, we introduce MR ^2 -Bench, a reasoning-intensive benchmark for multimodal retrieval. MR ^2 -Bench presents the following critical values: 1) all tasks are reasoning-driven, going beyond shallow matching to effectively assess models’ capacity for logical, spatial, and causal inference; 2) it features diverse multimodal data, such as natural images, diagrams, and visual puzzles, enabling comprehensive evaluation across content types; 3) it supports complex queries and documents containing multiple images and covers diverse retrieval scenarios, more accurately reflecting real-world applications. Our benchmark contains 1,309 curated queries, derived either from manual collection and annotation or from selective consolidation of public datasets. Despite achieving strong results on existing benchmarks, current state-of-the-art models still struggle on MR ^2 -Bench: for example, the leading Seed1.6-Embedding model attains a Recall@1 of 77.78 on MMEB, but only 9.91 on MR ^2 -Bench. This substantial performance gap highlights both the increased challenge posed by our benchmark and the pressing need for further advances in reasoning-intensive multimodal retrieval. The dataset and evaluation code will be made publicly available at this https URL.
zh

[CV-27] Go with Your Gut: Scaling Confidence for Autoregressive Image Generation

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于下一 token 预测（Next-Token Prediction, NTP）的自回归（Autoregressive, AR）图像生成任务在应用测试时缩放（Test-Time Scaling, TTS）方法时面临的挑战。现有视觉自回归（Visual AR, VAR）方法依赖频繁的部分解码和外部奖励模型，但这些方法不适用于 NTP-based 图像生成，因为中间解码结果本质上是不完整的。解决方案的关键在于提出 ScalingAR，这是首个专为 NTP-based AR 图像生成设计的 TTS 框架，无需早期解码或辅助奖励信号。其核心创新在于利用 token 熵作为视觉 token 生成的新信号，并在两个互补的缩放层级上运作：(i) Profile Level 通过融合内在与条件信号流式输出校准的置信度状态；(ii) Policy Level 利用该状态自适应终止低置信度轨迹并动态调度不同阶段的引导强度，从而显著提升生成质量与效率。

链接: https://arxiv.org/abs/2509.26376
作者: Harold Haodong Chen,Xianfeng Wu,Wen-Jie Shu,Rongjin Guo,Disen Lan,Harry Yang,Ying-Cong Chen
机构: HKUST(GZ)(香港科技大学（广州）); HKUST(香港科技大学); PolyU(香港理工大学); CityUHK(香港城市大学); FDU(复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Test-time scaling (TTS) has demonstrated remarkable success in enhancing large language models, yet its application to next-token prediction (NTP) autoregressive (AR) image generation remains largely uncharted. Existing TTS approaches for visual AR (VAR), which rely on frequent partial decoding and external reward models, are ill-suited for NTP-based image generation due to the inherent incompleteness of intermediate decoding results. To bridge this gap, we introduce ScalingAR, the first TTS framework specifically designed for NTP-based AR image generation that eliminates the need for early decoding or auxiliary rewards. ScalingAR leverages token entropy as a novel signal in visual token generation and operates at two complementary scaling levels: (i) Profile Level, which streams a calibrated confidence state by fusing intrinsic and conditional signals; and (ii) Policy Level, which utilizes this state to adaptively terminate low-confidence trajectories and dynamically schedule guidance for phase-appropriate conditioning strength. Experiments on both general and compositional benchmarks show that ScalingAR (1) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (2) efficiently reduces visual token consumption by 62.0% while outperforming baselines, and (3) successfully enhances robustness, mitigating performance drops by 26.0% in challenging scenarios.
zh

[CV-28] SDA-PLANNER: State-Dependency Aware Adaptive Planner for Embodied Task Planning

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的具身任务规划方法在三个关键方面的局限性：固定规划范式、缺乏动作序列约束以及对执行错误不敏感（error-agnostic）。其解决方案的核心在于提出SDA-PLANNER框架，该框架引入状态依赖图（State-Dependency Graph）以显式建模动作的前提条件与效果，从而支持动态修正；同时设计了误差自适应重规划策略，包括错误回溯（Error Backtrack）、诊断与自适应动作子树生成（Diagnosis and Adaptive Action SubTree Generation），能够在执行出错时根据当前环境状态局部重构计划，实现状态感知和容错能力。

链接: https://arxiv.org/abs/2509.26375
作者: Zichao Shen,Chen Gao,Jiaqi Yuan,Tianchen Zhu,Xingcheng Fu,Qingyun Sun
机构: Beihang University (北京航空航天大学); Guangxi Normal University (广西师范大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Embodied task planning requires agents to produce executable actions in a close-loop manner within the environment. With progressively improving capabilities of LLMs in task decomposition, planning, and generalization, current embodied task planning methods adopt LLM-based this http URL, existing LLM-based planners remain limited in three aspects, i.e., fixed planning paradigms, lack of action sequence constraints, and error-agnostic. In this work, we propose SDA-PLANNER, enabling an adaptive planning paradigm, state-dependency aware and error-aware mechanisms for comprehensive embodied task planning. Specifically, SDA-PLANNER introduces a State-Dependency Graph to explicitly model action preconditions and effects, guiding the dynamic revision. To handle execution error, it employs an error-adaptive replanning strategy consisting of Error Backtrack and Diagnosis and Adaptive Action SubTree Generation, which locally reconstructs the affected portion of the plan based on the current environment state. Experiments demonstrate that SDA-PLANNER consistently outperforms baselines in success rate and goal completion, particularly under diverse error conditions.
zh

[CV-29] meScope: Towards Task-Oriented Temporal Grounding In Long Videos

【速读】：该论文旨在解决长视频中任务导向的时间定位（Task-oriented Temporal Grounding, ToTG）问题，即根据任务的自然语言描述精准定位包含必要信息的时间区间。传统方法在处理长视频时面临泛化能力弱和难以有效捕捉关键片段的挑战。解决方案的关键在于提出TimeScope框架，其核心是基于渐进式推理机制：首先粗粒度地识别可能包含关键时刻的视频时间范围，再通过细粒度的片段划分逐步精炼定位结果，从而提升对复杂长视频的理解与推理能力。

链接: https://arxiv.org/abs/2509.26360
作者: Xiangrui Liu,Minghao Qin,Yan Shu,Zhengyang Liang,Yang Tian,Chen Jason Zhang,Bo Zhao,Zheng Liu
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); School of AI, Shanghai Jiao Tong University (上海交通大学人工智能学院); University of Trento (特伦托大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Identifying key moments in long videos is essential for downstream understanding and reasoning tasks. In this paper, we introduce a new problem, Taskoriented Temporal Grounding ToTG, which aims to localize time intervals containing the necessary information based on a task’s natural description. Along with the definition, we also present ToTG Bench, a comprehensive benchmark for evaluating the performance on ToTG. ToTG is particularly challenging for traditional approaches due to their limited generalizability and difficulty in handling long videos. To address these challenges, we propose TimeScope, a novel framework built upon progressive reasoning. TimeScope first identifies a coarse-grained temporal scope in the long video that likely contains the key moments, and then refines this scope through finegrained moment partitioning. Additionally, we curate a highquality dataset, namely ToTG Pile, to enhance TimeScope’s ability to perform progressive temporal grounding effectively. Extensive experiments demonstrate that TimeScope consistently outperforms both existing temporalgrounding methods and popular MLLMs across various settings, highlighting its effectiveness in addressing this new challenging problem.
zh

[CV-30] SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval

【速读】：该论文旨在解决训练-free零样本图像组合检索（Zero-Shot Composed Image Retrieval, ZS-CIR）中用户意图难以准确捕捉的问题，即在不依赖特定任务训练或标注数据的情况下，如何有效结合参考图像的视觉内容与文本修改指令来检索目标图像。其解决方案的关键在于提出一个两阶段框架SQUARE：第一阶段为语义查询增强融合（Semantic Query-Augmented Fusion, SQAF），利用多模态大语言模型（Multimodal Large Language Models, MLLMs）生成的目标图像描述对来自视觉-语言模型（如CLIP）的查询嵌入进行语义增强，从而提升全局检索质量；第二阶段为高效批量重排序（Efficient Batch Reranking, EBR），将候选图像以视觉标记组成的网格输入MLLM，实现跨候选图像的联合视觉-语义推理，在单次遍历中获得更精准的排序结果。

链接: https://arxiv.org/abs/2509.26330
作者: Ren-Di Wu,Yu-Yen Lin,Huei-Fang Yang
机构: National Sun Yat-sen University (国立中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 20 pages, 9 figures

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve target images that preserve the visual content of a reference image while incorporating user-specified textual modifications. Training-free zero-shot CIR (ZS-CIR) approaches, which require no task-specific training or labeled data, are highly desirable, yet accurately capturing user intent remains challenging. In this paper, we present SQUARE, a novel two-stage training-free framework that leverages Multimodal Large Language Models (MLLMs) to enhance ZS-CIR. In the Semantic Query-Augmented Fusion (SQAF) stage, we enrich the query embedding derived from a vision-language model (VLM) such as CLIP with MLLM-generated captions of the target image. These captions provide high-level semantic guidance, enabling the query to better capture the user’s intent and improve global retrieval quality. In the Efficient Batch Reranking (EBR) stage, top-ranked candidates are presented as an image grid with visual marks to the MLLM, which performs joint visual-semantic reasoning across all candidates. Our reranking strategy operates in a single pass and yields more accurate rankings. Experiments show that SQUARE, with its simplicity and effectiveness, delivers strong performance on four standard CIR benchmarks. Notably, it maintains high performance even with lightweight pre-trained, demonstrating its potential applicability.
zh

[CV-31] Continuous Space-Time Video Super-Resolution with 3D Fourier Fields

【速读】：该论文旨在解决视频超分辨率（Video Super-Resolution, VSR）任务中传统方法存在的问题，即通过分离空间与时间维度并依赖显式帧对齐（frame warping）进行运动补偿，导致重建结果在空间细节和时间一致性之间难以平衡，且计算效率较低。其解决方案的关键在于提出一种连续时空视频表示方法——3D Video Fourier Field（VFF），将视频建模为一个具有时空一致性的连续函数，利用傅里叶基函数的系数由神经编码器预测，并基于低分辨率输入条件化生成。该方法实现了任意时空位置的高效采样、同时捕捉精细空间细节与平滑时间动态，并通过引入解析的高斯点扩散函数（Point Spread Function, PSF）确保任意缩放比例下的无混叠重建，从而在多个基准测试上显著提升空间与时间超分性能，且计算效率更高。

链接: https://arxiv.org/abs/2509.26325
作者: Alexander Becker,Julius Erbach,Dominik Narnhofer,Konrad Schindler
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel formulation for continuous space-time video super-resolution. Instead of decoupling the representation of a video sequence into separate spatial and temporal components and relying on brittle, explicit frame warping for motion compensation, we encode video as a continuous, spatio-temporally coherent 3D Video Fourier Field (VFF). That representation offers three key advantages: (1) it enables cheap, flexible sampling at arbitrary locations in space and time; (2) it is able to simultaneously capture fine spatial detail and smooth temporal dynamics; and (3) it offers the possibility to include an analytical, Gaussian point spread function in the sampling to ensure aliasing-free reconstruction at arbitrary scale. The coefficients of the proposed, Fourier-like sinusoidal basis are predicted with a neural encoder with a large spatio-temporal receptive field, conditioned on the low-resolution input video. Through extensive experiments, we show that our joint modeling substantially improves both spatial and temporal super-resolution and sets a new state of the art for multiple benchmarks: across a wide range of upscaling factors, it delivers sharper and temporally more consistent reconstructions than existing baselines, while being computationally more efficient. Project page: this https URL.
zh

[CV-32] FLOWER: A Flow-Matching Solver for Inverse Problems

【速读】：该论文旨在解决图像重建中的逆问题（inverse problems），即从有限或噪声观测数据中恢复高质量的原始信号。其核心挑战在于如何在保持物理约束的同时，利用先验知识提升重建精度。解决方案的关键在于提出一种名为Flower的求解器，该方法基于预训练的流模型（flow model）构建迭代优化框架，包含三个关键步骤：(i) 流一致性目标估计，通过速度网络预测去噪后的目标；(ii) 可行集投影，将估计结果映射到由前向算子定义的可行空间；(iii) 时间演化重投影，沿流轨迹重新调整优化方向。理论分析表明，Flower近似于贝叶斯后验采样，从而统一了插件式方法（plug-and-play）与生成式逆求解器（generative inverse solvers）的视角，并在多个逆问题上实现了优于现有方法的重建质量，且超参数设置具有高度通用性。

链接: https://arxiv.org/abs/2509.26287
作者: Mehrsa Pourya,Bassam El Rawas,Michael Unser
机构: Biomedical Imaging Group, EPFL (École Polytechnique Fédérale de Lausanne), Lausanne, Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Flower, a solver for inverse problems. It leverages a pre-trained flow model to produce reconstructions that are consistent with the observed measurements. Flower operates through an iterative procedure over three steps: (i) a flow-consistent destination estimation, where the velocity network predicts a denoised target; (ii) a refinement step that projects the estimated destination onto a feasible set defined by the forward operator; and (iii) a time-progression step that re-projects the refined destination along the flow trajectory. We provide a theoretical analysis that demonstrates how Flower approximates Bayesian posterior sampling, thereby unifying perspectives from plug-and-play methods and generative inverse solvers. On the practical side, Flower achieves state-of-the-art reconstruction quality while using nearly identical hyperparameters across various inverse problems.
zh

[CV-33] Point2RBox-v3: Self-Bootstrapping from Point Annotations via Integrated Pseudo-Label Refinement and Utilization

【速读】：该论文旨在解决弱监督场景下基于点标注的定向目标检测（Oriented Object Detection, OOD）方法中存在的两个核心问题：一是伪标签（pseudo labels）的利用效率低，二是伪标签质量差。为应对这些问题，作者提出Point2RBox-v3模型，其关键创新在于两个原则：1）渐进式标签分配（Progressive Label Assignment, PLA），通过在训练不同阶段动态估计实例尺寸，实现更智能的标签分配；2）先验引导的动态掩码损失（Prior-Guided Dynamic Mask Loss, PGDM-Loss），改进自Point2RBox-v2中的Voronoi Watershed Loss，有效融合了Segment Anything Model (SAM) 与分水岭算法的优势，在稀疏和密集场景中均表现出鲁棒性能。该方案首次在弱监督OOD任务中引入动态伪标签进行标签分配，并实现了对复杂场景下物体尺度变化和稀疏分布的优异适应能力。

链接: https://arxiv.org/abs/2509.26281
作者: Teng Zhang,Ziqian Fan,Mingxin Liu,Xin Zhang,Xudong Lu,Wentong Li,Yue Zhou,Yi Yu,Xiang Li,Junchi Yan,Xue Yang
机构: Shanghai Jiao Tong University (上海交通大学); South China University of Technology (华南理工大学); Nankai University (南开大学); The Chinese University of Hong Kong (香港中文大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); East China Normal University (华东师范大学); Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19pages, 5figures, 6tables

点击查看摘要

Abstract:Driven by the growing need for Oriented Object Detection (OOD), learning from point annotations under a weakly-supervised framework has emerged as a promising alternative to costly and laborious manual labeling. In this paper, we discuss two deficiencies in existing point-supervised methods: inefficient utilization and poor quality of pseudo labels. Therefore, we present Point2RBox-v3. At the core are two principles: 1) Progressive Label Assignment (PLA). It dynamically estimates instance sizes in a coarse yet intelligent manner at different stages of the training process, enabling the use of label assignment methods. 2) Prior-Guided Dynamic Mask Loss (PGDM-Loss). It is an enhancement of the Voronoi Watershed Loss from Point2RBox-v2, which overcomes the shortcomings of Watershed in its poor performance in sparse scenes and SAM’s poor performance in dense scenes. To our knowledge, Point2RBox-v3 is the first model to employ dynamic pseudo labels for label assignment, and it creatively complements the advantages of SAM model with the watershed algorithm, which achieves excellent performance in both sparse and dense scenes. Our solution gives competitive performance, especially in scenarios with large variations in object size or sparse object occurrences: 66.09%/56.86%/41.28%/46.40%/19.60%/45.96% on DOTA-v1.0/DOTA-v1.5/DOTA-v2.0/DIOR/STAR/RSAR.
zh

[CV-34] Cat: Post-training quantization error reduction via cluster-based affine transformation

【速读】：该论文旨在解决低比特量化（Low-bit Quantization, LQ）条件下后训练量化（Post-Training Quantization, PTQ）易导致模型精度显著下降的问题。在LQ场景下（如2-bit），传统基于全局仿射变换（Affine Transformation）的方法因采用统一参数对所有输出进行映射，反而加剧了量化误差。其解决方案的关键在于提出一种基于聚类的仿射变换（Cluster-based Affine Transformation, CAT）框架，通过为不同输出特征簇分配专属的仿射参数，实现对低比特量化输出与全精度（Full-Precision, FP）输出之间差异的有效对齐。CAT仅引入少量额外参数，无需模型微调或重新量化参数，即可显著提升PTQ性能，在ImageNet-1K上验证其优于现有方法，并能作为插件模块提升已有PTQ基线超过3%。

链接: https://arxiv.org/abs/2509.26277
作者: Ali Zoljodi,Radu Timofte,Masoud Daneshtalab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 20 figures

点击查看摘要

Abstract:Post-Training Quantization (PTQ) reduces the memory footprint and computational overhead of deep neural networks by converting full-precision (FP) values into quantized and compressed data types. While PTQ is more cost-efficient than Quantization-Aware Training (QAT), it is highly susceptible to accuracy degradation under a low-bit quantization (LQ) regime (e.g., 2-bit). Affine transformation is a classical technique used to reduce the discrepancy between the information processed by a quantized model and that processed by its full-precision counterpart; however, we find that using plain affine transformation, which applies a uniform affine parameter set for all outputs, worsens the results in low-bit PTQ. To address this, we propose Cluster-based Affine Transformation (CAT), an error-reduction framework that employs cluster-specific parameters to align LQ outputs with FP counterparts. CAT refines LQ outputs with only a negligible number of additional parameters, without requiring fine-tuning of the model or quantization parameters. We further introduce a novel PTQ framework integrated with CAT. Experiments on ImageNet-1K show that this framework consistently outperforms prior PTQ methods across diverse architectures and LQ settings, achieving up to 53.18% Top-1 accuracy on W2A2 ResNet-18. Moreover, CAT enhances existing PTQ baselines by more than 3% when used as a plug-in. We plan to release our implementation alongside the publication of this paper.
zh

[CV-35] PRPO: Parag raph-level Policy Optimization for Vision-Language Deepfake Detection

【速读】：该论文旨在解决生成式 AI (Generative AI) 时代下深度伪造（deepfake）检测面临的两大挑战：一是高质量、大规模数据集的稀缺，二是多模态大语言模型（LLM）在检测任务中推理能力不足，常产生与视觉证据不符或虚构的解释。其解决方案的关键在于构建一个标注了推理过程的数据集，并提出一种基于段落级别的相对策略优化算法（Paragraph-level Relative Policy Optimization, PRPO），通过强化学习使 LLM 的推理过程与图像内容在段落层级上对齐，从而显著提升检测准确率和推理合理性，实验表明 PRPO 在测试时条件下优于 GRPO 方法，且推理评分达到 4.55/5.0，验证了视觉证据对多模态推理的必要性。

链接: https://arxiv.org/abs/2509.26272
作者: Tuan Nguyen,Naseem Khan,Khang Tran,NhatHai Phan,Issa Khalil
机构: Qatar Computing Research Institute (卡塔尔计算研究研究所); New Jersey Institute of Technology (新泽西理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.
zh

[CV-36] ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning

【速读】：该论文旨在解决长时程具身规划（long-horizon embodied planning）中的核心挑战，即环境不仅受代理动作影响，还存在与代理行为并行的外生过程（exogenous processes），如水加热或多米诺骨牌连锁反应等，这些过程难以建模且显著增加了规划复杂性。解决方案的关键在于提出一种抽象世界模型框架，该框架通过变分贝叶斯推断（variational Bayesian inference）结合大语言模型（LLM）的提议机制，联合学习两类关键要素：(i) 符号化状态表示（symbolic state representations）和 (ii) 用于描述内生动作与外生机制的因果过程（causal processes）。每个因果过程建模一个随机因果效应的时间演化轨迹，从而实现对复杂动态世界的高效、可泛化的快速规划，在五个模拟桌面机器人环境中验证了其在新任务、更多物体和更复杂目标下的优越性能。

链接: https://arxiv.org/abs/2509.26255
作者: Yichao Liang,Dat Nguyen,Cambridge Yang,Tianyang Li,Joshua B. Tenenbaum,Carl Edward Rasmussen,Adrian Weller,Zenna Tavares,Tom Silver,Kevin Ellis
机构: University of Cambridge (剑桥大学); Basis; Cornell University (康奈尔大学); Princeton University (普林斯顿大学); Massachusetts Institute of Technology (麻省理工学院); The Alan Turing Institute (艾伦·图灵研究所)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 41 pages. The last two authors contributed equally in co-advising

点击查看摘要

Abstract:Long-horizon embodied planning is challenging because the world does not only change through an agent’s actions: exogenous processes (e.g., water heating, dominoes cascading) unfold concurrently with the agent’s actions. We propose a framework for abstract world models that jointly learns (i) symbolic state representations and (ii) causal processes for both endogenous actions and exogenous mechanisms. Each causal process models the time course of a stochastic causal-effect relation. We learn these world models from limited data via variational Bayesian inference combined with LLM proposals. Across five simulated tabletop robotics environments, the learned models enable fast planning that generalizes to held-out tasks with more objects and more complex goals, outperforming a range of baselines.
zh

[CV-37] Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA

【速读】：该论文旨在解决Latent Action Models (LAMs) 在视觉-语言-动作（Vision-Language-Action, VLA）系统中面临的两个关键瓶颈：一是常用端到端训练的图像编码器存在空间理解能力不足的问题；二是当输入帧之间时间间隔较大时，LAMs 的行为容易变得脆弱，导致时间感知能力受限。为应对这些问题，论文提出 Farsighted-LAM 框架，其核心创新在于引入几何感知的空间编码机制和多尺度时间建模策略，从而捕捉结构先验信息与连续帧间的动态运动模式。在此基础上进一步构建 SSM-VLA 框架，通过整合结构化感知与视觉 Chain-of-Thought 模块，显式推理环境动态变化，提升决策一致性与可解释性。实验证明，该方法在仿真和真实场景下的多种 VLA 任务中均达到最优性能，验证了结合几何感知建模、时间连贯性和显式推理策略对增强具身智能鲁棒性和泛化能力的有效性。

链接: https://arxiv.org/abs/2509.26251
作者: Zhejia Cai,Yandan Yang,Xinyuan Chang,Shiyi Liang,Ronghan Chen,Feng Xiong,Mu Xu,Ruqi Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent Action Models (LAMs) enable Vision- Language-Action (VLA) systems to learn semantic action rep- resentations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry- aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end- to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real- world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry- aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.
zh

[CV-38] Interpret prune and distill Donut : towards lightweight VLMs for VQA on document ICDAR

【速读】：该论文旨在解决视觉-rich文档理解中大型Vision-Language模型（如Donut）在实时或资源受限场景下部署成本过高的问题。其解决方案的关键在于通过知识蒸馏（knowledge distillation）结合机制可解释性（mechanistic interpretability）驱动的学生模型架构设计：通过对教师模型内部计算过程的分析，识别出必须保留的核心子模块，并明确哪些子模块可以近似、跳过或重新参数化，从而实现高效且性能稳定的模型压缩。这一方法最终形成了Donut-MINT（Mechanistic Interpretability-based Network Trimming），在显著降低推理时间和内存占用的同时，保持了在DocVQA基准上的优异性能。

链接: https://arxiv.org/abs/2509.26235
作者: Adnan Ben Mansour,Ayoub Karine,David Naccache
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Workshop on Machine Learning in Document Analysis and Recognition (ICDAR WML 2025), Wuhan, China

点击查看摘要

Abstract:Recent advances in Visually-rich Document Understanding rely on large Vision-Language Models like Donut, which perform document-level Visual Question Answering without Optical Character Recognition. Despite their effectiveness, these models are too costly for real-time or resource-constrained applications. We investigate model compression through knowledge distillation, training compact student models from a larger teacher. We leverage mechanistic interpretability to drive student architecture design within this framework. By analyzing internal computations, we identify essential subcomponents to retain, while having a clear view of which subcomponents should be approximated, skipped, or reparametrized based on their function. This approach yields Donut-MINT (Mechanistic Interpretability-based Network Trimming), a pruned Donut variant that reduces inference time and memory usage while maintaining strong performance on DocVQA, a standard benchmark for document Visual Question Answering. Our method reframes compression as circuit discovery, bridging interpretability research and practical Vision-Language Model deployment.
zh

[CV-39] 3DiFACE: Synthesizing and Editing Holistic 3D Facial Animation

【速读】：该论文旨在解决当前语音驱动的3D面部动画方法在个性化控制、真实头部运动生成以及动画编辑方面的局限性问题。现有方法通常仅能控制风格或情绪，难以对已生成动画的局部进行编辑或重生成，并且忽略了同一音频输入可能对应多种合理唇部与头部动作的可能性。解决方案的关键在于提出3DiFACE框架，其核心创新包括：（1）设计了一种全卷积扩散模型（fully-convolutional diffusion model），可利用训练数据中音素级（viseme-level）的多样性；（2）引入说话风格个性化机制和一种新型稀疏引导运动扩散（sparsely-guided motion diffusion），从而实现高保真度与多样性的可控平衡，并支持通过关键帧插值进行灵活编辑。

链接: https://arxiv.org/abs/2509.26233
作者: Balamurugan Thambiraja,Malte Prinzler,Sadegh Aliakbarian,Darren Cosker,Justus Thies
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Max Planck ETH Center for Intelligent Systems (马克斯·普朗克ETH智能系统中心); Technical University of Darmstadt (达姆施塔特工业大学); ETH Zürich (苏黎世联邦理工学院); Microsoft Mixed Reality & AI Lab, UK (微软混合现实与人工智能实验室)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creating personalized 3D animations with precise control and realistic head motions remains challenging for current speech-driven 3D facial animation methods. Editing these animations is especially complex and time consuming, requires precise control and typically handled by highly skilled animators. Most existing works focus on controlling style or emotion of the synthesized animation and cannot edit/regenerate parts of an input animation. They also overlook the fact that multiple plausible lip and head movements can match the same audio input. To address these challenges, we present 3DiFACE, a novel method for holistic speech-driven 3D facial animation. Our approach produces diverse plausible lip and head motions for a single audio input and allows for editing via keyframing and interpolation. Specifically, we propose a fully-convolutional diffusion model that can leverage the viseme-level diversity in our training corpus. Additionally, we employ a speaking-style personalization and a novel sparsely-guided motion diffusion to enable precise control and editing. Through quantitative and qualitative evaluations, we demonstrate that our method is capable of generating and editing diverse holistic 3D facial animations given a single audio input, with control between high fidelity and diversity. Code and models are available here: this https URL
zh

[CV-40] IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance ICCV2025

链接: https://arxiv.org/abs/2509.26231
作者: Jiayi Guo,Chuanhao Yan,Xingqian Xu,Yulin Wang,Kai Wang,Gao Huang,Humphrey Shi
机构: SHI Labs @ Georgia Tech (佐治亚理工学院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

[CV-41] Generalized Fine-Grained Category Discovery with Multi-Granularity Conceptual Experts

【速读】：该论文针对广义类别发现（Generalized Category Discovery, GCD）任务中两个核心问题展开研究：一是现有方法未能有效利用视觉数据中的多粒度概念信息，限制了表征质量；二是大多数方法假设训练时已知未标记类别数量，这在实际开放世界场景中不切实际。解决方案的关键在于提出多粒度概念专家框架（Multi-Granularity Conceptual Experts, MGCE），其核心创新包括：(1) 动态概念对比学习（Dynamic Conceptual Contrastive Learning, DCCL），通过概念挖掘与双层表示学习交替优化特征提取与类别发现；(2) 多粒度专家协同学习（Multi-Granularity Experts Collaborative Learning, MECL），引入不同粒度下的多个专家并借助概念对齐矩阵实现跨专家高效协作。更重要的是，MGCE能够自动估计未标记数据中的类别数，从而适应真实开放世界设置，实验表明其在九个细粒度视觉识别基准上显著优于现有方法，尤其在新类准确率方面平均提升3.6%。

链接: https://arxiv.org/abs/2509.26227
作者: Haiyang Zheng,Nan Pu,Wenjing Li,Nicu Sebe,Zhun Zhong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalized Category Discovery (GCD) is an open-world problem that clusters unlabeled data by leveraging knowledge from partially labeled categories. A key challenge is that unlabeled data may contain both known and novel categories. Existing approaches suffer from two main limitations. First, they fail to exploit multi-granularity conceptual information in visual data, which limits representation quality. Second, most assume that the number of unlabeled categories is known during training, which is impractical in real-world scenarios. To address these issues, we propose a Multi-Granularity Conceptual Experts (MGCE) framework that adaptively mines visual concepts and integrates multi-granularity knowledge for accurate category discovery. MGCE consists of two modules: (1) Dynamic Conceptual Contrastive Learning (DCCL), which alternates between concept mining and dual-level representation learning to jointly optimize feature learning and category discovery; and (2) Multi-Granularity Experts Collaborative Learning (MECL), which extends the single-expert paradigm by introducing additional experts at different granularities and by employing a concept alignment matrix for effective cross-expert collaboration. Importantly, MGCE can automatically estimate the number of categories in unlabeled data, making it suitable for practical open-world settings. Extensive experiments on nine fine-grained visual recognition benchmarks demonstrate that MGCE achieves state-of-the-art results, particularly in novel-class accuracy. Notably, even without prior knowledge of category numbers, MGCE outperforms parametric approaches that require knowing the exact number of categories, with an average improvement of 3.6%. Code is available at this https URL.
zh

[CV-42] An Experimental Study on Generating Plausible Textual Explanations for Video Summarization

【速读】：该论文旨在解决视频摘要结果的可解释性问题，特别是生成符合人类认知预期的自然语言解释（natural language explanations）以增强生成式 AI (Generative AI) 系统的可信度。其解决方案的关键在于：首先扩展一个现有的多粒度视频摘要解释框架，集成当前最先进的大型多模态模型 LLaVA-OneVision 来生成视觉解释的文本描述；随后提出一种基于语义重叠度量的可解释性评估方法，利用 SBERT 和 SimCSE 两种句向量表示技术，量化视觉解释的文本描述与视频摘要文本之间的语义一致性，从而评估其“合理性”（plausibility）。实验表明，该方法能够有效判断更忠实的视觉解释是否也更具合理性，并识别出生成高质量文本解释的最佳策略。

链接: https://arxiv.org/abs/2509.26225
作者: Thomas Eleftheriadis,Evlampios Apostolidis,Vasileios Mezaris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE CBMI 2025. This is the authors’ accepted version. The final publication is available at this https URL

点击查看摘要

Abstract:In this paper, we present our experimental study on generating plausible textual explanations for the outcomes of video summarization. For the needs of this study, we extend an existing framework for multigranular explanation of video summarization by integrating a SOTA Large Multimodal Model (LLaVA-OneVision) and prompting it to produce natural language descriptions of the obtained visual explanations. Following, we focus on one of the most desired characteristics for explainable AI, the plausibility of the obtained explanations that relates with their alignment with the humans’ reasoning and expectations. Using the extended framework, we propose an approach for evaluating the plausibility of visual explanations by quantifying the semantic overlap between their textual descriptions and the textual descriptions of the corresponding video summaries, with the help of two methods for creating sentence embeddings (SBERT, SimCSE). Based on the extended framework and the proposed plausibility evaluation approach, we conduct an experimental study using a SOTA method (CA-SUM) and two datasets (SumMe, TVSum) for video summarization, to examine whether the more faithful explanations are also the more plausible ones, and identify the most appropriate approach for generating plausible textual explanations for video summarization.
zh

[CV-43] Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation

【速读】：该论文旨在解决传统数据集蒸馏（dataset distillation）方法依赖密集像素级表示所导致的冗余信息多、难以扩展的问题。其核心解决方案是提出一种基于2D高斯（2D Gaussians）的稀疏表示方法GSDD，通过仅用少量高斯基元（Gaussian primitives）编码图像中的关键判别性信息，实现更高效的存储与更强的数据多样性，在相同存储预算下提升对困难样本的覆盖能力并增强蒸馏性能。该方法利用CUDA加速的splatting操作实现并行推理与训练，显著降低计算和内存开销，兼具高效性、可扩展性和通用性。

链接: https://arxiv.org/abs/2509.26219
作者: Chenyang Jiang,Zhengcen Li,Hang Zhao,Qiben Shan,Shaocong Wu,Jingyong Su
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages; Code is available on this https URL

点击查看摘要

Abstract:Dataset distillation has emerged as a promising paradigm that synthesizes compact, informative datasets capable of retaining the knowledge of large-scale counterparts, thereby addressing the substantial computational and storage burdens of modern model training. Conventional approaches typically rely on dense pixel-level representations, which introduce redundancy and are difficult to scale up. In this work, we propose GSDD, a novel and efficient sparse representation for dataset distillation based on 2D Gaussians. Instead of representing all pixels equally, GSDD encodes critical discriminative information in a distilled image using only a small number of Gaussian primitives. This sparse representation could improve dataset diversity under the same storage budget, enhancing coverage of difficult samples and boosting distillation performance. To ensure both efficiency and scalability, we adapt CUDA-based splatting operators for parallel inference and training, enabling high-quality rendering with minimal computational and memory overhead. Our method is simple yet effective, broadly applicable to different distillation pipelines, and highly scalable. Experiments show that GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets, while remaining highly efficient encoding and decoding cost. Our code is available at this https URL.
zh

[CV-44] SalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos

【速读】：该论文旨在解决360度视频中基于文本驱动的显著性检测（text-driven saliency detection）问题，即根据用户提供的文本描述精准定位视频中相关对象或事件的显著区域。解决方案的关键在于提出TSalV360方法，其核心创新包括：利用先进的视觉-语言模型（vision-language model）进行多模态数据表征，引入相似性估计模块以捕捉文本与视觉内容间的语义关联，并设计视口时空交叉注意力机制（viewport spatio-temporal cross-attention mechanism），从而有效建模不同模态间的时间-空间依赖关系，实现定制化的文本引导显著性预测。

链接: https://arxiv.org/abs/2509.26208
作者: Ioannis Kontostathis,Evlampios Apostolidis,Vasileios Mezaris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE CBMI 2025. This is the authors’ accepted version. The final publication is available at this https URL

点击查看摘要

Abstract:In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.
zh

[CV-45] Optimizing Indoor Environmental Quality in Smart Buildings Using Deep Learning ICDT

【速读】：该论文旨在解决传统供暖、通风与空调（HVAC）系统在保障室内环境质量（IEQ）时能耗过高的问题，通过引入深度学习方法实现对CO₂浓度、温度和湿度等IEQ参数的前瞻性调控，从而在提升 occupant comfort 的同时优化建筑能效。解决方案的关键在于利用 ROBOD 数据集，对比分析长短期记忆网络（LSTM）、门控循环单元（GRU）以及卷积神经网络-长短期记忆网络（CNN-LSTM）三种模型在不同时间尺度下的预测性能：其中 GRU 在短时预测中表现出更高的精度与更低的计算开销，CNN-LSTM 更擅长提取长期预测中的主导特征，而 LSTM 则具备更强的长程时序建模能力；这一比较揭示了预测可靠性受数据分辨率、传感器布设位置及人员流动波动的影响，为智能建筑管理系统（BMS）实施预测性 HVAC 控制提供了可操作的依据。

链接: https://arxiv.org/abs/2509.26187
作者: Youssef Sabiri,Walid Houmaidi,Aaya Bougrine,Salmane El Mansour Billah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 1 table. Accepted and presented at the 5th International Conference on Digital Technologies and Applications (ICDTA 2025), April 17-18, 2025, Al Akhawayn University, Ifrane, Morocco

点击查看摘要

Abstract:Ensuring optimal Indoor Environmental Quality (IEQ) is vital for occupant health and productivity, yet it often comes at a high energy cost in conventional Heating, Ventilation, and Air Conditioning (HVAC) systems. This paper proposes a deep learning driven approach to proactively manage IEQ parameters specifically CO2 concentration, temperature, and humidity while balancing building energy efficiency. Leveraging the ROBOD dataset collected from a net-zero energy academic building, we benchmark three architectures–Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and a hybrid Convolutional Neural Network LSTM (CNN-LSTM)–to forecast IEQ variables across various time horizons. Our results show that GRU achieves the best short-term prediction accuracy with lower computational overhead, whereas CNN-LSTM excels in extracting dominant features for extended forecasting windows. Meanwhile, LSTM offers robust long-range temporal modeling. The comparative analysis highlights that prediction reliability depends on data resolution, sensor placement, and fluctuating occupancy conditions. These findings provide actionable insights for intelligent Building Management Systems (BMS) to implement predictive HVAC control, thereby reducing energy consumption and enhancing occupant comfort in real-world building operations.
zh

[CV-46] AttriGen: Automated Multi-Attribute Annotation for Blood Cell Datasets

【速读】：该论文旨在解决细胞显微图像中多属性（multi-attribute）细粒度标注自动化不足的问题，当前计算机视觉领域在细胞类型分类方面研究较成熟，但在形态学属性（如细胞核形状、胞质密度等）的多属性标注上仍存在明显空白。解决方案的关键在于提出AttriGen框架，采用双模型架构：卷积神经网络（CNN）用于细胞类型分类，视觉Transformer（Vision Transformer, ViT）用于多属性分类，从而实现高精度（94.62%准确率）的自动标注，并显著提升模型可解释性与标注效率，为其他计算机视觉任务提供可扩展的多属性标签自动化范式。

链接: https://arxiv.org/abs/2509.26185
作者: Walid Houmaidi,Youssef Sabiri,Fatima Zahra Iguenfer,Amine Abouaomar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 4 figures, 3 tables. Accepted at the 12th International Conference on Wireless Networks and Mobile Communications 2025 (WINCOM 2025)

点击查看摘要

Abstract:We introduce AttriGen, a novel framework for automated, fine-grained multi-attribute annotation in computer vision, with a particular focus on cell microscopy where multi-attribute classification remains underrepresented compared to traditional cell type categorization. Using two complementary datasets: the Peripheral Blood Cell (PBC) dataset containing eight distinct cell types and the WBC Attribute Dataset (WBCAtt) that contains their corresponding 11 morphological attributes, we propose a dual-model architecture that combines a CNN for cell type classification, as well as a Vision Transformer (ViT) for multi-attribute classification achieving a new benchmark of 94.62% accuracy. Our experiments demonstrate that AttriGen significantly enhances model interpretability and offers substantial time and cost efficiency relative to conventional full-scale human annotation. Thus, our framework establishes a new paradigm that can be extended to other computer vision classification tasks by effectively automating the expansion of multi-attribute labels.
zh

[CV-47] Neighbor-aware informal settlement mapping with graph convolutional networks ECML KDD2025

链接: https://arxiv.org/abs/2509.26171
作者: Thomas Hallopeau,Joris Guérin,Laurent Demagistri,Christovam Barcellos,Nadine Dessay
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 2 tables. Accepted at the ECML PKDD 2025 Workshop on Machine Learning for Earth Observation

点击查看摘要

[CV-48] Beyond Overall Accuracy: Pose- and Occlusion-driven Fairness Analysis in Pedestrian Detection for Autonomous Driving

【速读】：该论文旨在解决自动驾驶（AD）中行人检测模型在不同行人姿态和遮挡情况下的公平性问题，即现有检测模型对特定姿态（如并行双腿、直臂、侧向视角）或关节遮挡的行人存在系统性偏差，从而影响安全性和可靠性。解决方案的关键在于：首先构建了一个针对密集姿态场景的EuroCity Persons Dense Pose（ECP-DP）数据集，并系统评估五种专用行人检测器（F2DNet、MGAN、ALFNet、CSP、Cascade R-CNN）与三种通用目标检测器（YOLOv12变体）在多种姿态属性和遮挡条件下的性能差异；其次，采用Equal Opportunity Difference（EOD）指标量化公平性，结合Z检验验证统计显著性，最终发现下肢关节遮挡对检测率影响最大，且Cascade R-CNN在整体漏检率和各属性偏差方面表现最优，为提升行人检测系统的公平性和鲁棒性提供了实证依据与方法框架。

链接: https://arxiv.org/abs/2509.26166
作者: Mohammad Khoshkdahan,Arman Akbari,Arash Akbari,Xuan Zhang
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Pedestrian detection plays a critical role in autonomous driving (AD), where ensuring safety and reliability is important. While many detection models aim to reduce miss-rates and handle challenges such as occlusion and long-range recognition, fairness remains an underexplored yet equally important concern. In this work, we systematically investigate how variations in the pedestrian pose–including leg status, elbow status, and body orientation–as well as individual joint occlusions, affect detection performance. We evaluate five pedestrian-specific detectors (F2DNet, MGAN, ALFNet, CSP, and Cascade R-CNN) alongside three general-purpose models (YOLOv12 variants) on the EuroCity Persons Dense Pose (ECP-DP) dataset. Fairness is quantified using the Equal Opportunity Difference (EOD) metric across various confidence thresholds. To assess statistical significance and robustness, we apply the Z-test. Our findings highlight biases against pedestrians with parallel legs, straight elbows, and lateral views. Occlusion of lower body joints has a more negative impact on the detection rate compared to the upper body and head. Cascade R-CNN achieves the lowest overall miss-rate and exhibits the smallest bias across all attributes. To the best of our knowledge, this is the first comprehensive pose- and occlusion-aware fairness evaluation in pedestrian detection for AD.
zh

[CV-49] Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在人类中心场景理解能力评估中缺乏全面、高质量基准评测体系的问题。现有基准普遍忽视了人类导向的细粒度感知与高维因果推理能力，且受限于人体物理结构复杂性和标注难度，难以实现可靠评估。解决方案的关键在于提出Human-MME基准，其核心创新包括：(1) 构建覆盖4个主视觉领域、15个次级领域及43个子领域的多样化人类场景数据集，确保场景广度；(2) 设计从人类导向细粒度感知到高维推理的8维渐进式评估维度，包含19,945对真实世界图像问答对；(3) 建立自动化标注流程与人工标注平台，保障高质量标注质量。该基准通过选择题、简答题、定位、排序和判断等多样化问题形式，推动模型从单目标理解向多人、多图互相关联理解演进，显著揭示了当前17个前沿MLLMs在人类中心场景理解中的局限性，并为未来研究提供明确方向。

链接: https://arxiv.org/abs/2509.26165
作者: Yuansen Liu,Haiming Tang,Jinlong Peng,Jiangning Zhang,Xiaozhong Ji,Qingdong He,Donghao Luo,Zhenye Gan,Junwei Zhu,Yunhang Shen,Chaoyou Fu,Chengjie Wang,Xiaobin Hu,Shuicheng Yan
机构: National University of Singapore (新加坡国立大学); Tencent Youtu Lab (腾讯优图实验室); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at this https URL.
zh

[CV-50] owards Continual Expansion of Data Coverag e: Automatic Text-guided Edge-case Synthesis

【速读】：该论文旨在解决深度神经网络性能受训练数据质量影响的问题，特别是如何有效缓解数据集偏差（dataset bias）——传统方法依赖人工筛选边缘案例（edge cases），效率低下且难以覆盖多样场景。解决方案的关键在于构建一个自动化文本引导的边缘案例合成流程：首先利用偏好学习（preference learning）微调大型语言模型（Large Language Model, LLM），将原始图像描述重新表述为多样化文本提示（prompt），进而驱动文生图模型（Text-to-Image model）生成具有挑战性的视觉场景；该方法在FishEye8K目标检测基准上验证了优于简单数据增强和人工设计提示的鲁棒性，实现了从人工数据整理向自动化、定向数据合成的范式转变。

链接: https://arxiv.org/abs/2509.26158
作者: Kyeongryeol Go
机构: Superb AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:The performance of deep neural networks is strongly influenced by the quality of their training data. However, mitigating dataset bias by manually curating challenging edge cases remains a major bottleneck. To address this, we propose an automated pipeline for text-guided edge-case synthesis. Our approach employs a Large Language Model, fine-tuned via preference learning, to rephrase image captions into diverse textual prompts that steer a Text-to-Image model toward generating difficult visual scenarios. Evaluated on the FishEye8K object detection benchmark, our method achieves superior robustness, surpassing both naive augmentation and manually engineered prompts. This work establishes a scalable framework that shifts data curation from manual effort to automated, targeted synthesis, offering a promising direction for developing more reliable and continuously improving AI systems. Code is available at this https URL.
zh

[CV-51] EntroPE: Entropy-Guided Dynamic Patch Encoder for Time Series Forecasting

【速读】：该论文旨在解决现有基于Transformer的时间序列预测方法中，采用固定长度且与时间无关的分块策略（patch-based input strategy）所导致的时序结构破坏问题。这种静态分块方式常将自然的时间过渡切分到不同patch边界，从而削弱短时依赖关系并降低表示学习效果。解决方案的关键在于提出EntroPE（Entropy-Guided Dynamic Patch Encoder），其核心创新是引入信息论准则——条件熵（conditional entropy）来动态检测时间序列中的自然转换点，并据此自适应地确定patch边界，从而在保留patch计算效率的同时，有效维护时序连贯性；EntroPE包含两个关键模块：熵驱动的动态分块器（Entropy-based Dynamic Patcher, EDP）用于识别过渡点并划分patch，以及自适应patch编码器（Adaptive Patch Encoder, APE）通过池化和交叉注意力机制捕获块内依赖并生成固定维度的潜在表示，最终由全局Transformer建模块间动态关系。

链接: https://arxiv.org/abs/2509.26157
作者: Sachith Abeywickrama,Emadeldeen Eldele,Min Wu,Xiaoli Li,Chau Yuen
机构: Nanyang Technological University (南洋理工大学); A*STAR (新加坡科技研究局); Khalifa University (哈利法大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Under Review

点击查看摘要

Abstract:Transformer-based models have significantly advanced time series forecasting, with patch-based input strategies offering efficiency and improved long-horizon modeling. Yet, existing approaches rely on temporally-agnostic patch construction, where arbitrary starting positions and fixed lengths fracture temporal coherence by splitting natural transitions across boundaries. This naive segmentation often disrupts short-term dependencies and weakens representation learning. In response, we propose EntroPE (Entropy-Guided Dynamic Patch Encoder), a novel, temporally informed framework that dynamically detects transition points via conditional entropy and dynamically places patch boundaries. This preserves temporal structure while retaining the computational benefits of patching. EntroPE consists of two key modules, namely an Entropy-based Dynamic Patcher (EDP) that applies information-theoretic criteria to locate natural temporal shifts and determine patch boundaries, and an Adaptive Patch Encoder (APE) that employs pooling and cross-attention to capture intra-patch dependencies and produce fixed-size latent representations. These embeddings are then processed by a global transformer to model inter-patch dynamics. Experiments across long-term forecasting benchmarks demonstrate that EntroPE improves both accuracy and efficiency, establishing entropy-guided dynamic patching as a promising new paradigm for time series modeling. Code is available at: this https URL.
zh

[CV-52] EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model

【速读】：该论文旨在解决生成式 AI 中主体驱动生成（subject-driven generation）任务中存在的效率与可控性之间的权衡问题：现有方法要么依赖计算成本高昂的逐主体微调，牺牲了零样本能力与效率；要么基于扩散模型的前馈架构，存在推理速度慢的瓶颈。解决方案的关键在于提出 EchoGen 框架，其核心创新是一种有效的双路径注入策略（dual-path injection strategy），通过语义编码器提取主体的高层语义身份并经解耦交叉注意力注入以引导整体构图，同时利用内容编码器捕获低层细节并通过多模态注意力机制融合，从而实现高保真度纹理和结构保留的同时提升可控性。该框架是首个基于视觉自回归模型（Visual Auto-Regressive, VAR）的前馈式主体驱动生成方法，在保持与扩散模型相当的图像质量与主体保真度的前提下，显著降低采样延迟。

链接: https://arxiv.org/abs/2509.26127
作者: Ruixiao Dong,Zhendong Wang,Keli Liu,Li Li,Ying Chen,Kai Li,Daowen Li,Houqiang Li
机构: Alibaba Group - Taobao & Tmall Group (阿里巴巴集团-淘宝与天猫集团); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds. Visual Auto-Regressive (VAR) models are renowned for their rapid sampling speeds and strong generative quality, making them an ideal yet underexplored foundation for resolving this tension. To bridge this gap, we introduce EchoGen, a pioneering framework that empowers VAR models with subject-driven generation capabilities. The core design of EchoGen is an effective dual-path injection strategy that disentangles a subject’s high-level semantic identity from its low-level fine-grained details, enabling enhanced controllability and fidelity. We employ a semantic encoder to extract the subject’s abstract identity, which is injected through decoupled cross-attention to guide the overall composition. Concurrently, a content encoder captures intricate visual details, which are integrated via a multi-modal attention mechanism to ensure high-fidelity texture and structural preservation. To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models. Both quantitative and qualitative results substantiate our design, demonstrating that EchoGen achieves subject fidelity and image quality comparable to state-of-the-art diffusion-based methods with significantly lower sampling latency. Code and models will be released soon.
zh

[CV-53] EVODiff: Entropy-aware Variance Optimized Diffusion Inference NEURIPS2025

链接: https://arxiv.org/abs/2509.26096
作者: Shigui Li,Wei Chen,Delu Zeng
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: NeurIPS 2025, 40 pages, 14 figures

点击查看摘要

[CV-54] xt-to-Scene with Large Reasoning Models

【速读】：该论文旨在解决当前文本到场景（text-to-scene）生成方法在处理复杂几何结构和对象变换时的局限性，以及对复杂指令遵循能力弱的问题。其解决方案的关键在于引入基于大推理模型（Large Reasoning Models, LRMs）的Reason-3D框架：首先通过包含物理、功能和上下文属性的图像描述进行对象检索，随后依据显式与隐式布局约束放置对象，并利用碰撞感知的空间推理对位置进行精细化调整，从而显著提升场景生成的视觉保真度、约束遵循准确性和资产检索质量。

链接: https://arxiv.org/abs/2509.26091
作者: Frédéric Berdoz,Luca A. Lanzendörfer,Nick Tuninga,Roger Wattenhofer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.
zh

[CV-55] Predicting Penalty Kick Direction Using Multi-Modal Deep Learning with Pose-Guided Attention

链接: https://arxiv.org/abs/2509.26088
作者: Pasindu Ranasinghe,Pamudu Ranasinghe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-56] EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models

【速读】：该论文旨在解决自监督学习中语义占据预测（semantic occupancy prediction）任务因缺乏真实标签而导致的性能瓶颈问题，尤其是在复杂场景下如何有效缓解语义与深度歧义。其解决方案的关键在于引入由基础模型Grounded-SAM和Metric3Dv2生成的3D伪标签（3D pseudo-ground-truth labels），并利用时序信息进行标签稀疏性增强（label densification），从而显著提升模型训练效率与精度。此外，作者进一步提出轻量化模型EasyOcc，仅依赖此类伪标签即可实现高性能，避免了传统方法中高计算开销的渲染策略（如新视角合成），在完整场景评估中达到7.71 mIoU，优于此前最优模型31%。这一成果凸显了基础模型、时序上下文及损失计算空间选择在自监督场景理解中的核心作用。

链接: https://arxiv.org/abs/2509.26087
作者: Seamie Hayes,Ganesh Sistu,Ciarán Eising
机构: University of Limerick (利默里克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised models have recently achieved notable advancements, particularly in the domain of semantic occupancy prediction. These models utilize sophisticated loss computation strategies to compensate for the absence of ground-truth labels. For instance, techniques such as novel view synthesis, cross-view rendering, and depth estimation have been explored to address the issue of semantic and depth ambiguity. However, such techniques typically incur high computational costs and memory usage during the training stage, especially in the case of novel view synthesis. To mitigate these issues, we propose 3D pseudo-ground-truth labels generated by the foundation models Grounded-SAM and Metric3Dv2, and harness temporal information for label densification. Our 3D pseudo-labels can be easily integrated into existing models, which yields substantial performance improvements, with mIoU increasing by 45%, from 9.73 to 14.09, when implemented into the OccNeRF model. This stands in contrast to earlier advancements in the field, which are often not readily transferable to other architectures. Additionally, we propose a streamlined model, EasyOcc, achieving 13.86 mIoU. This model conducts learning solely from our labels, avoiding complex rendering strategies mentioned previously. Furthermore, our method enables models to attain state-of-the-art performance when evaluated on the full scene without applying the camera mask, with EasyOcc achieving 7.71 mIoU, outperforming the previous best model by 31%. These findings highlight the critical importance of foundation models, temporal context, and the choice of loss computation space in self-supervised learning for comprehensive scene understanding.
zh

[CV-57] Geometric Learning of Canonical Parameterizations of 2D-curves

【速读】：该论文旨在解决计算机视觉与医学应用中常见数据集所具有的对称性（symmetry）在分类任务中未被充分建模的问题，例如物体检测中的旋转和缩放对称性。传统方法依赖数据增强来学习这些对称性，但存在资源消耗大、可持续性差等局限。其解决方案的关键在于引入主纤维丛（principal fiber bundle）理论中的截面（section）概念，通过构造一个可优化的几何结构来“消去”对称性，从而在商空间中使用简单度量衡量对象轨道间的差异，并可通过优化截面以最大化类别分离。该方法无需数据增强即可实现对称不变表示的学习，具有良好的泛化潜力和理论基础。

链接: https://arxiv.org/abs/2509.26070
作者: Ioana Ciuclea,Giorgio Longari,Alice Barbara Tumpach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG)
备注: 30 pages, 18 figures

点击查看摘要

Abstract:Most datasets encountered in computer vision and medical applications present symmetries that should be taken into account in classification tasks. A typical example is the symmetry by rotation and/or scaling in object detection. A common way to build neural networks that learn the symmetries is to use data augmentation. In order to avoid data augmentation and build more sustainable algorithms, we present an alternative method to mod out symmetries based on the notion of section of a principal fiber bundle. This framework allows the use of simple metrics on the space of objects in order to measure dissimilarities between orbits of objects under the symmetry group. Moreover, the section used can be optimized to maximize separation of classes. We illustrate this methodology on a dataset of contours of objects for the groups of translations, rotations, scalings and reparameterizations. In particular, we present a 2 -parameter family of canonical parameterizations of curves, containing the constant-speed parameterization as a special case, which we believe is interesting in its own right. We hope that this simple application will serve to convey the geometric concepts underlying this method, which have a wide range of possible applications. The code is available at the following link: \hrefthis https URLthis https URL . A tutorial notebook showcasing an application of the code to a specific dataset is available at the following link: \hrefthis https URLthis https URL
zh

[CV-58] GaussEdit: Adaptive 3D Scene Editing with Text and Image Prompts

链接: https://arxiv.org/abs/2509.26055
作者: Zhenyu Shu,Junlong Yu,Kai Chao,Shiqing Xin,Ligang Liu
机构: NingboTech University (宁波工程学院); Zhejiang University (浙江大学); Xi’an Jiaotong University (西安交通大学); ShanDong University (山东大学); University of Science and Technology of China (中国科学技术大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-59] DGM4: Dataset Extension for Global Scene Inconsistency

【速读】：该论文旨在解决当前多模态伪造检测研究中对全局不一致性的忽视问题，特别是前景-背景（Foreground-Background, FG-BG）错位及其与文本篡改的混合形式在真实世界伪造内容中的普遍性。现有数据集如DGM4主要关注局部操作（如人脸替换、属性编辑和标题修改），难以评估模型对全局语义一致性异常的识别能力。解决方案的关键在于扩展DGM4数据集，新增5000个高质量样本，通过OpenAI的gpt-image-1模型生成包含人类主体置于荒谬或不可能背景中的新闻风格图像，并结合三种文本操纵条件（字面、文本属性、文本分割）构建三类新型伪造类型：FG-BG、FG-BG+TA 和 FG-BG+TS。同时，设计严格的质量控制流程以确保图像真实性与多样性，从而形成一个兼顾局部与全局推理能力的新基准数据集DGM4+，用于系统性评测如HAMMER等多模态模型在面对复杂伪造时的鲁棒性。

链接: https://arxiv.org/abs/2509.26047
作者: Gagandeep Singh,Samudi Amarsinghe,Priyanka Singh,Xue Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:The rapid advances in generative models have significantly lowered the barrier to producing convincing multimodal disinformation. Fabricated images and manipulated captions increasingly co-occur to create persuasive false narratives. While the Detecting and Grounding Multi-Modal Media Manipulation (DGM4) dataset established a foundation for research in this area, it is restricted to local manipulations such as face swaps, attribute edits, and caption changes. This leaves a critical gap: global inconsistencies, such as mismatched foregrounds and backgrounds, which are now prevalent in real-world forgeries. To address this, we extend DGM4 with 5,000 high-quality samples that introduce Foreground-Background (FG-BG) mismatches and their hybrids with text manipulations. Using OpenAI’s gpt-image-1 and carefully designed prompts, we generate human-centric news-style images where authentic figures are placed into absurd or impossible backdrops (e.g., a teacher calmly addressing students on the surface of Mars). Captions are produced under three conditions: literal, text attribute, and text split, yielding three new manipulation categories: FG-BG, FG-BG+TA, and FG-BG+TS. Quality control pipelines enforce one-to-three visible faces, perceptual hash deduplication, OCR-based text scrubbing, and realistic headline length. By introducing global manipulations, our extension complements existing datasets, creating a benchmark DGM4+ that tests detectors on both local and global reasoning. This resource is intended to strengthen evaluation of multimodal models such as HAMMER, which currently struggle with FG-BG inconsistencies. We release our DGM4+ dataset and generation script at this https URL
zh

[CV-60] SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies

【速读】：该论文旨在解决当前多模态篡改检测模型（如HAMMER）在面对全局场景不一致（如前景-背景FG-BG错位）时表现不佳的问题，尤其当主体对象被置于不合逻辑的背景中时，模型会失效。其根本原因被诊断为标签空间偏差、局部注意力聚焦以及虚假的文本-前景对齐。解决方案的关键在于提出一种轻量级的分割引导评分（Segmentation-Guided Scoring, SGS）管道：利用人物/人脸分割掩码分离前景与背景区域，通过联合视觉-语言模型提取嵌入，并计算区域感知的一致性得分；这些得分与HAMMER原始预测融合，从而提升二分类检测、定位准确性和token级解释能力。SGS无需重新训练，仅在推理阶段运行，计算开销极低，显著增强了对全局篡改的鲁棒性。

链接: https://arxiv.org/abs/2509.26039
作者: Gagandeep Singh,Samudi Amarsinghe,Urawee Thani,Ki Fung Wong,Priyanka Singh,Xue Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch. While HAMMER achieves strong performance on the DGM4 dataset, it consistently fails when the main subject is contextually misplaced into an implausible background. We diagnose this limitation as a combination of label-space bias, local attention focus, and spurious text-foreground alignment. To remedy this without retraining, we propose a lightweight segmentation-guided scoring (SGS) pipeline. SGS uses person/face segmentation masks to separate foreground and background regions, extracts embeddings with a joint vision-language model, and computes region-aware coherence scores. These scores are fused with HAMMER’s original prediction to improve binary detection, grounding, and token-level explanations. SGS is inference-only, incurs negligible computational overhead, and significantly enhances robustness to global manipulations. This work demonstrates the importance of region-aware reasoning in multimodal disinformation detection. We release scripts for segmentation and scoring at this https URL
zh

[CV-61] CoLLM -NAS: Collaborative Large Language Models for Efficient Knowledge-Guided Neural Architecture Search

链接: https://arxiv.org/abs/2509.26037
作者: Zhe Li,Zhiwei Lin,Yongtao Wang
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学计算机技术研究所)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-62] SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP ICLR2026

链接: https://arxiv.org/abs/2509.26036
作者: Christoph Timmermann,Hyunse Lee,Woojin Lee
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 12 figures, Under review as a conference paper at ICLR 2026

点击查看摘要

[CV-63] Causally Guided Gaussian Perturbations for Out-Of-Distribution Generalization in Medical Imaging

【速读】：该论文旨在解决深度学习模型在真实世界场景中面临的分布外（Out-of-distribution, OOD）泛化能力不足的问题，尤其是在生物医学图像等存在细微但普遍分布偏移的领域。其解决方案的关键在于提出一种轻量级框架——因果引导高斯扰动（Causally-Guided Gaussian Perturbations, CGP），通过基于视觉Transformer（Vision Transformer）生成的软因果掩码（soft causal masks），对输入图像的空间区域施加差异化的噪声扰动：对背景区域施加强扰动，对前景区域施以弱扰动。这种方法促使模型依赖于因果相关的特征而非虚假关联，从而提升OOD泛化性能。在WILDS基准数据集Camelyon17上的实验结果表明，CGP相较现有最优OOD基线方法实现了稳定性能提升，验证了因果扰动作为可靠且可解释泛化工具的有效性。

链接: https://arxiv.org/abs/2509.26027
作者: Haoran Pei,Yuguang Yang,Kexin Liu,Baochang Zhang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) generalization remains a central challenge in deploying deep learning models to real-world scenarios, particularly in domains such as biomedical images, where distribution shifts are both subtle and pervasive. While existing methods often pursue domain invariance through complex generative models or adversarial training, these approaches may overlook the underlying causal mechanisms of this http URL this work, we propose Causally-Guided Gaussian Perturbations (CGP)-a lightweight framework that enhances OOD generalization by injecting spatially varying noise into input images, guided by soft causal masks derived from Vision Transformers. By applying stronger perturbations to background regions and weaker ones to foreground areas, CGP encourages the model to rely on causally relevant features rather than spurious this http URL results on the challenging WILDS benchmark Camelyon17 demonstrate consistent performance gains over state-of-the-art OOD baselines, highlighting the potential of causal perturbation as a tool for reliable and interpretable generalization.
zh

[CV-64] PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution CVPR2025

【速读】：该论文旨在解决现有预训练视频生成模型在全尺寸视频超分辨率（VSR）任务中因冗余的全局注意力计算和固定输出分辨率而导致的效率低下与灵活性不足的问题。其核心解决方案是首次探索利用视频扩散先验（video diffusion priors）进行分块（patch-wise）超分辨率重建，提出名为PatchVSR的方法，关键在于引入双流适配器（dual-stream adapter）实现条件引导：其中局部分支提取输入图像块特征以保持内容保真度，全局分支从下采样后的完整视频中提取上下文特征以弥补局部块语义不完整的问题；同时将图像块位置信息注入模型以增强局部重建与全局场景的一致性，并设计多块联合调制机制确保各独立增强块之间的视觉一致性，从而在512×512基础模型上实现高效且高质量的4K VSR。

链接: https://arxiv.org/abs/2509.26025
作者: Shian Du,Menghan Xia,Chang Liu,Xintao Wang,Jing Wang,Pengfei Wan,Di Zhang,Xiangyang Ji
机构: Tsinghua University (清华大学); Kling Team, Kuaishou Technology (快手科技); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Pre-trained video generation models hold great potential for generative video super-resolution (VSR). However, adapting them for full-size VSR, as most existing methods do, suffers from unnecessary intensive full-attention computation and fixed output resolution. To overcome these limitations, we make the first exploration into utilizing video diffusion priors for patch-wise VSR. This is non-trivial because pre-trained video diffusion models are not native for patch-level detail generation. To mitigate this challenge, we propose an innovative approach, called PatchVSR, which integrates a dual-stream adapter for conditional guidance. The patch branch extracts features from input patches to maintain content fidelity while the global branch extracts context features from the resized full video to bridge the generation gap caused by incomplete semantics of patches. Particularly, we also inject the patch’s location information into the model to better contextualize patch synthesis within the global video frame. Experiments demonstrate that our method can synthesize high-fidelity, high-resolution details at the patch level. A tailor-made multi-patch joint modulation is proposed to ensure visual consistency across individually enhanced patches. Due to the flexibility of our patch-based paradigm, we can achieve highly competitive 4K VSR based on a 512x512 resolution base model, with extremely high efficiency.
zh

[CV-65] GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data NEURIPS2025

链接: https://arxiv.org/abs/2509.26016
作者: Lubian Bai,Xiuyuan Zhang,Siqi Zhang,Zepeng Zhang,Haoyu Wang,Wei Qin,Shihong Du
机构: Peking University (北京大学); Institute of Automation, CAS (中国科学院自动化研究所); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025

点击查看摘要

[CV-66] SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval

【速读】：该论文旨在解决零样本组合图像检索（Zero-shot Composed Image Retrieval, ZS-CIR）中的两个核心问题：一是基于CLIP的方法采用并集驱动的特征融合策略， indiscriminately（无差别地）聚合所有视觉线索，导致无关背景信息混入，稀释目标修改语义；二是CLIP嵌入的全局余弦相似度无法解析细粒度语义关系。解决方案的关键在于提出一种两阶段语义增强检索框架（Semantic-enhanced Two-Stage Retrieval, SETR）：第一阶段通过交集驱动策略保留参考图像与相对文本间的重叠语义，过滤掉并集融合带来的干扰，生成高精度候选集；第二阶段利用低秩适配（Low-Rank Adaptation）微调预训练多模态大语言模型（Multimodal Large Language Model, MLLM），进行二元语义相关性判断（“是/否”），显式验证属性与关系层面的一致性，从而超越CLIP的全局匹配能力。两阶段形成互补：粗筛保证高召回率，重排序确保细粒度对齐，显著提升检索性能，在CIRR、Fashion-IQ和CIRCO数据集上达到新SOTA。

链接: https://arxiv.org/abs/2509.26012
作者: Yuqi Xiao,Yingying Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image given a reference image and a relative text, without relying on costly triplet annotations. Existing CLIP-based methods face two core challenges: (1) union-based feature fusion indiscriminately aggregates all visual cues, carrying over irrelevant background details that dilute the intended modification, and (2) global cosine similarity from CLIP embeddings lacks the ability to resolve fine-grained semantic relations. To address these issues, we propose SETR (Semantic-enhanced Two-Stage Retrieval). In the coarse retrieval stage, SETR introduces an intersection-driven strategy that retains only the overlapping semantics between the reference image and relative text, thereby filtering out distractors inherent to union-based fusion and producing a cleaner, high-precision candidate set. In the fine-grained re-ranking stage, we adapt a pretrained multimodal LLM with Low-Rank Adaptation to conduct binary semantic relevance judgments (“Yes/No”), which goes beyond CLIP’s global feature matching by explicitly verifying relational and attribute-level consistency. Together, these two stages form a complementary pipeline: coarse retrieval narrows the candidate pool with high recall, while re-ranking ensures precise alignment with nuanced textual modifications. Experiments on CIRR, Fashion-IQ, and CIRCO show that SETR achieves new state-of-the-art performance, improving Recall@1 on CIRR by up to 15.15 points. Our results establish two-stage reasoning as a general paradigm for robust and portable ZS-CIR.
zh

[CV-67] New Fourth-Order Grayscale Indicator-Based Telegraph Diffusion Model for Image Despeckling

链接: https://arxiv.org/abs/2509.26010
作者: Rajendra K. Ray,Manish Kumar
机构: Indian Institute of Technology Mandi (印度理工学院曼迪分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-68] PFDepth: Heterogeneous Pinhole-Fisheye Joint Depth Estimation via Distortion-aware Gaussian-Splatted Volumetric Fusion ACM-MM2025

【速读】：该论文旨在解决异构多视角深度估计问题，即如何有效融合具有不同成像特性的相机（如针孔相机与鱼眼相机）的图像信息以提升深度估计精度。其核心挑战在于处理两类相机在畸变特性（undistorted vs. distorted）、视场角（FOV, Field of View）大小（small vs. large）以及远近景范围（far vs. near field）上的显著差异。解决方案的关键在于提出PFDepth框架，该框架通过统一架构处理任意组合的针孔与鱼眼相机输入，并创新性地将每张图像的2D特征显式映射至一个规范化的3D体素空间；进一步设计了异构空间融合模块（Heterogeneous Spatial Fusion），用于在重叠与非重叠区域中融合带畸变感知能力的体素特征；同时将传统体素融合重构为可学习的3D高斯表示，利用可学习的潜在高斯球动态适应局部纹理，实现更精细的3D聚合。此方法首次系统性地实现了异构针孔-鱼眼深度估计，显著优于当前主流网络，在KITTI-360和RealHet数据集上达到最先进性能。

链接: https://arxiv.org/abs/2509.26008
作者: Zhiwei Zhang,Ruikai Xu,Weijian Zhang,Zhizhong Zhang,Xin Tan,Jingyu Gong,Yuan Xie,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注: Accepted by ACM MM 2025 Conference

点击查看摘要

Abstract:In this paper, we present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth. Our key insight is to exploit the complementary characteristics of pinhole and fisheye imagery (undistorted vs. distorted, small vs. large FOV, far vs. near field) for joint optimization. PFDepth employs a unified architecture capable of processing arbitrary combinations of pinhole and fisheye cameras with varied intrinsics and extrinsics. Within PFDepth, we first explicitly lift 2D features from each heterogeneous view into a canonical 3D volumetric space. Then, a core module termed Heterogeneous Spatial Fusion is designed to process and fuse distortion-aware volumetric features across overlapping and non-overlapping regions. Additionally, we subtly reformulate the conventional voxel fusion into a novel 3D Gaussian representation, in which learnable latent Gaussian spheres dynamically adapt to local image textures for finer 3D aggregation. Finally, fused volume features are rendered into multi-view depth maps. Through extensive experiments, we demonstrate that PFDepth sets a state-of-the-art performance on KITTI-360 and RealHet datasets over current mainstream depth networks. To the best of our knowledge, this is the first systematic study of heterogeneous pinhole-fisheye depth estimation, offering both technical novelty and valuable empirical insights.
zh

[CV-69] Agent icIQA: An Agent ic Framework for Adaptive and Interpretable Image Quality Assessment

【速读】：该论文旨在解决传统图像质量评估（Image Quality Assessment, IQA）方法在适应多样化失真类型、用户特定查询及可解释性需求方面的局限性，以及评分与解释过程割裂导致的性能瓶颈问题。其解决方案的关键在于提出AgenticIQA框架，该框架通过引入模块化智能体架构，将IQA任务分解为失真检测、失真分析、工具选择与执行四个子任务，并由规划器（planner）、执行器（executor）和总结器（summarizer）协同完成；其中，视觉语言模型（Vision-Language Models, VLMs）与传统IQA工具动态集成，使系统能够根据用户查询自适应地生成高精度评分并附带人类对齐的解释，从而实现评分与解释的一体化建模与优化。

链接: https://arxiv.org/abs/2509.26006
作者: Hanwei Zhu,Yu Tian,Keyan Ding,Baoliang Chen,Bolin Chen,Shiqi Wang,Weisi Lin
机构: Nanyang Technological University (南洋理工大学); Nanjing University of Information Science and Technology (南京信息工程大学); Zhejiang University (浙江大学); South China Normal University (华南师范大学); Alibaba DAMO Academy (阿里巴巴达摩院); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image quality assessment (IQA) is inherently complex, as it reflects both the quantification and interpretation of perceptual quality rooted in the human visual system. Conventional approaches typically rely on fixed models to output scalar scores, limiting their adaptability to diverse distortions, user-specific queries, and interpretability needs. Furthermore, scoring and interpretation are often treated as independent processes, despite their interdependence: interpretation identifies perceptual degradations, while scoring abstracts them into a compact metric. To address these limitations, we propose AgenticIQA, a modular agentic framework that integrates vision-language models (VLMs) with traditional IQA tools in a dynamic, query-aware manner. AgenticIQA decomposes IQA into four subtasks – distortion detection, distortion analysis, tool selection, and tool execution – coordinated by a planner, executor, and summarizer. The planner formulates task-specific strategies, the executor collects perceptual evidence via tool invocation, and the summarizer integrates this evidence to produce accurate scores with human-aligned explanations. To support training and evaluation, we introduce AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and summarization capabilities of VLM-based IQA agents. Extensive experiments across diverse IQA datasets demonstrate that AgenticIQA consistently surpasses strong baselines in both scoring accuracy and explanatory alignment.
zh

[CV-70] Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

【速读】：该论文旨在解决基于第一人称视角图像（egocentric images）中用户操作物体的像素级识别问题，其核心挑战在于现有方法严重依赖昂贵的手动标注数据，导致可用标注数据稀缺。解决方案的关键在于引入自然语言叙述（narrations）作为弱监督信号，通过学习人类对自身动作的描述（如“我正将蔬菜从切菜板倒入锅中”）来推断手部操作的物体，并据此训练模型实现无像素级标注的物体分割。论文提出了一种名为WISH（Weakly-Supervised In-hand Object Segmentation from Human Narrations）的端到端模型，能够从叙述中提取手-物关联知识，在测试阶段无需使用叙述即可完成精准的在手物体分割（in-hand object segmentation）。实验表明，WISH在EPIC-Kitchens和Ego4D数据集上超越多个基线方法，性能达到全监督方法的50%以上。

链接: https://arxiv.org/abs/2509.26004
作者: Nicola Messina,Rosario Leonardi,Luca Ciampi,Fabio Carrara,Giovanni Maria Farinella,Fabrizio Falchi,Antonino Furnari
机构: CNR-ISTI(意大利国家研究委员会信息科学与技术研究所); University of Catania(卡塔尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations – natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects (e.g., “I am pouring vegetables from the chopping board to the pan”). Narrations provide a form of weak supervision that is cheap to acquire and readily available in state-of-the-art egocentric datasets. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models, showing the superiority of its design. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations.
zh

[CV-71] VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing

【速读】：该论文旨在解决基于扩散模型的视频编辑方法在处理长时长、高分辨率视频时面临的计算复杂度高和内存占用大的问题，其核心挑战在于传统自注意力机制导致的时间和空间复杂度为二次方级增长，限制了模型在实时视频处理等实际场景中的应用。解决方案的关键在于提出VRWKV-Editor，该模型通过引入线性时空聚合模块，并结合RWKV Transformer中的双向加权键值循环机制（bidirectional weighted key-value recurrence mechanism），实现了对全局依赖关系的有效建模，同时保持时间一致性，从而将复杂度降低至线性级别，显著提升了效率并维持了高质量的帧一致性和文本对齐性能。

链接: https://arxiv.org/abs/2509.25998
作者: Abdelilah Aitrouga,Youssef Hmamouche,Amal El Fallah Seghrouchni
机构: International Artificial Intelligence Center of Morocco (摩洛哥国际人工智能中心); University Mohammed VI Polytechnic (穆罕默德六世工业大学); Sorbonne University, LIP6 - UMR 7606 CNRS, France (索邦大学，LIP6 - UMR 7606 CNRS，法国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In light of recent progress in video editing, deep learning models focusing on both spatial and temporal dependencies have emerged as the primary method. However, these models suffer from the quadratic computational complexity of traditional attention mechanisms, making them difficult to adapt to long-duration and high-resolution videos. This limitation restricts their applicability in practical contexts such as real-time video processing. To tackle this challenge, we introduce a method to reduce both time and space complexity of these systems by proposing VRWKV-Editor, a novel video editing model that integrates a linear spatio-temporal aggregation module into video-based diffusion models. VRWKV-Editor leverages bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while preserving temporal coherence, achieving linear complexity without sacrificing quality. Extensive experiments demonstrate that the proposed method achieves up to 3.7x speedup and 60% lower memory usage compared to state-of-the-art diffusion-based video editing methods, while maintaining competitive performance in frame consistency and text alignment. Furthermore, a comparative analysis we conducted on videos with different sequence lengths confirms that the gap in editing speed between our approach and architectures with self-attention becomes more significant with long videos.
zh

[CV-72] owards Reliable and Holistic Visual In-Context Learning Prompt Selection NEURIPS2025

【速读】：该论文旨在解决视觉上下文学习（Visual In-Context Learning, VICL）中因依赖“相似性优先假设”而导致的在上下文示例选择不准确，以及现有方法如Partial2Global因随机采样导致比较覆盖不全和冗余的问题。其解决方案的关键在于提出一种增强型方法RH-Partial2Global，通过引入基于jackknife置信预测引导的可靠替代集构建策略与基于覆盖设计的采样方法，实现对成对偏好关系的全面且均匀覆盖，从而提升全局排序的可靠性与有效性。

链接: https://arxiv.org/abs/2509.25989
作者: Wenxiao Wu,Jing-Hao Xue,Chengming Xu,Chen Liu,Xinwei Sun,Changxin Gao,Nong Sang,Yanwei Fu
机构: Huazhong University of Science and Technology (华中科技大学); Shanghai Innovation Institute (上海创新研究院); University College London (伦敦大学学院); Tencent Youtu Lab (腾讯优图实验室); The Hong Kong University of Science and Technology (香港科技大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Visual In-Context Learning (VICL) has emerged as a prominent approach for adapting visual foundation models to novel tasks, by effectively exploiting contextual information embedded in in-context examples, which can be formulated as a global ranking problem of potential candidates. Current VICL methods, such as Partial2Global and VPR, are grounded in the similarity-priority assumption that images more visually similar to a query image serve as better in-context examples. This foundational assumption, while intuitive, lacks sufficient justification for its efficacy in selecting optimal in-context examples. Furthermore, Partial2Global constructs its global ranking from a series of randomly sampled pairwise preference predictions. Such a reliance on random sampling can lead to incomplete coverage and redundant samplings of comparisons, thus further adversely impacting the final global ranking. To address these issues, this paper introduces an enhanced variant of Partial2Global designed for reliable and holistic selection of in-context examples in VICL. Our proposed method, dubbed RH-Partial2Global, leverages a jackknife conformal prediction-guided strategy to construct reliable alternative sets and a covering design-based sampling approach to ensure comprehensive and uniform coverage of pairwise preferences. Extensive experiments demonstrate that RH-Partial2Global achieves excellent performance and outperforms Partial2Global across diverse visual tasks.
zh

[CV-73] PinPoint3D: Fine-Grained 3D Part Segmentation from a Few Clicks

【速读】：该论文旨在解决当前3D点云分割方法在细粒度（fine-grained）和多粒度（multi-granularity）场景下存在的两大问题：一是现有交互式分割方法主要局限于粗粒度的实例级目标，难以实现部件级别的精确分割；二是非交互式方法在稀疏的真实世界扫描数据中表现不佳，且严重缺乏标注数据。其解决方案的关键在于提出了一种名为PinPoint3D的新颖交互式框架，该框架能够仅通过少量用户点击即可生成精确的部件级掩码，并结合一个创新的3D数据合成管道，构建大规模、场景级且带有密集部件标注的数据集，从而突破了制约该领域发展的关键瓶颈。实验表明，该方法在首次点击条件下平均IoU达到约55.8%，后续增加少量点击后可提升至71.3%以上，相较当前最先进基线方法在IoU和精度上最高提升达16%，展现出在复杂稀疏点云上的高效率与强鲁棒性。

链接: https://arxiv.org/abs/2509.25970
作者: Bojun Zhang,Hangjian Ye,Hao Zheng,Jianzheng Huang,Zhengyu Lin,Zhenhong Guo,Feng Zheng
机构: Southern University of Science and Technology (南方科技大学); Spatialtemporal AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures, conference

点击查看摘要

Abstract:Fine-grained 3D part segmentation is crucial for enabling embodied AI systems to perform complex manipulation tasks, such as interacting with specific functional components of an object. However, existing interactive segmentation methods are largely confined to coarse, instance-level targets, while non-interactive approaches struggle with sparse, real-world scans and suffer from a severe lack of annotated data. To address these limitations, we introduce PinPoint3D, a novel interactive framework for fine-grained, multi-granularity 3D segmentation, capable of generating precise part-level masks from only a few user point clicks. A key component of our work is a new 3D data synthesis pipeline that we developed to create a large-scale, scene-level dataset with dense part annotations, overcoming a critical bottleneck that has hindered progress in this field. Through comprehensive experiments and user studies, we demonstrate that our method significantly outperforms existing approaches, achieving an average IoU of around 55.8% on each object part under first-click settings and surpassing 71.3% IoU with only a few additional clicks. Compared to current state-of-the-art baselines, PinPoint3D yields up to a 16% improvement in IoU and precision, highlighting its effectiveness on challenging, sparse point clouds with high efficiency. Our work represents a significant step towards more nuanced and precise machine perception and interaction in complex 3D environments.
zh

[CV-74] A Multi-purpose Tracking Framework for Salmon Welfare Monitoring in Challenging Environments ICCV2025

【速读】：该论文旨在解决工业网箱养殖中鲑鱼福利监测的自动化与精准化问题，当前基于计算机视觉（Computer Vision, CV）的方法通常仅针对单一福利指标进行计算，依赖其他应用场景的对象检测器和追踪器，导致资源消耗高且在水下复杂场景中易受遮挡、外观相似及运动相似等干扰。解决方案的关键在于提出一种灵活的追踪框架，利用姿态估计网络提取鲑鱼及其身体部位的边界框，并通过专门设计的模块利用身体部位信息以应对水下场景中的挑战；进而基于高精度的身体部位轨迹来计算多种福利指标，从而实现更鲁棒、高效的自动化鲑鱼福利监测。

链接: https://arxiv.org/abs/2509.25969
作者: Espen Uri Høgstedt,Christian Schellewald,Annette Stahl,Rudolf Mester
机构: Norwegian University of Science and Technology (挪威科技大学); SINTEF Ocean (SINTEF海洋)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Joint Workshop on Marine Vision 2025 (CVAUI AAMVEM), held in conjunction with ICCV 2025

点击查看摘要

Abstract:Computer Vision (CV)-based continuous, automated and precise salmon welfare monitoring is a key step toward reduced salmon mortality and improved salmon welfare in industrial aquaculture net pens. Available CV methods for determining welfare indicators focus on single indicators and rely on object detectors and trackers from other application areas to aid their welfare indicator calculation algorithm. This comes with a high resource demand for real-world applications, since each indicator must be calculated separately. In addition, the methods are vulnerable to difficulties in underwater salmon scenes, such as object occlusion, similar object appearance, and similar object motion. To address these challenges, we propose a flexible tracking framework that uses a pose estimation network to extract bounding boxes around salmon and their corresponding body parts, and exploits information about the body parts, through specialized modules, to tackle challenges specific to underwater salmon scenes. Subsequently, the high-detail body part tracks are employed to calculate welfare indicators. We construct two novel datasets assessing two salmon tracking challenges: salmon ID transfers in crowded scenes and salmon ID switches during turning. Our method outperforms the current state-of-the-art pedestrian tracker, BoostTrack, for both salmon tracking challenges. Additionally, we create a dataset for calculating salmon tail beat wavelength, demonstrating that our body part tracking method is well-suited for automated welfare monitoring based on tail beat analysis. Datasets and code are available at this https URL.
zh

[CV-75] Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation

【速读】：该论文旨在解决视觉引导的医学报告生成任务中，现有方法依赖于大量专家标注的检测模块所导致的高标注成本及跨数据集病理分布偏差带来的泛化能力受限问题。其核心解决方案是提出一种无监督的解剖一致性学习框架（Self-Supervised Anatomical Consistency Learning, SS-ACL），关键在于构建一个基于人体解剖结构层级包含关系的解剖图谱，并通过文本提示驱动的递归细粒度区域重建与区域级对比学习，实现样本内空间对齐和样本间语义对齐，从而在无需专家标注的前提下，使生成报告具备更强的视觉可解释性与临床准确性。

链接: https://arxiv.org/abs/2509.25963
作者: Longzhen Yang,Zhangkai Ni,Ying Wen,Yihang Liu,Lianghua He,Heng Tao Shen
机构: Tongji University (同济大学); East China Normal University (华东师范大学); Shanghai Eye Disease Prevention and Treatment Center (上海眼病防治中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images, anchored in explicit visual evidence to improve interpretability and facilitate integration into clinical workflows. However, existing methods often rely on separately trained detection modules that require extensive expert annotations, introducing high labeling costs and limiting generalizability due to pathology distribution bias across datasets. To address these challenges, we propose Self-Supervised Anatomical Consistency Learning (SS-ACL) – a novel and annotation-free framework that aligns generated reports with corresponding anatomical regions using simple textual prompts. SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy, organizing entities by spatial location. It recursively reconstructs fine-grained anatomical regions to enforce intra-sample spatial alignment, inherently guiding attention maps toward visually relevant areas prompted by text. To further enhance inter-sample semantic alignment for abnormality recognition, SS-ACL introduces a region-level contrastive learning based on anatomical consistency. These aligned embeddings serve as priors for report generation, enabling attention maps to provide interpretable visual evidence. Extensive experiments demonstrate that SS-ACL, without relying on expert annotations, (i) generates accurate and visually grounded reports – outperforming state-of-the-art methods by 10% in lexical accuracy and 25% in clinical efficacy, and (ii) achieves competitive performance on various downstream visual tasks, surpassing current leading visual foundation models by 8% in zero-shot visual grounding.
zh

[CV-76] CO3: Contrasting Concepts Compose Better

【速读】：该论文旨在解决文本到图像扩散模型中多概念提示（multi-concept prompt）的保真度问题，即当输入提示包含多个概念（如“一只猫和一只狗”）时，模型常出现某一概念缺失、模糊或与其他概念发生视觉冲突的现象。作者认为这一问题源于扩散模型在采样过程中漂移到混合模式（mixed modes），其中单一概念因训练时学习强度高而被过度强调。解决方案的关键在于提出一种无需重新训练的轻量级纠正采样策略——CO3，其通过引导采样路径避开与任一单独概念高度重叠的区域，从而聚焦于“纯净”的联合模式（pure joint modes），确保所有概念以均衡的视觉存在共存。此外，论文还识别出现有多概念引导机制在不稳定的权重区间内会加剧不平衡，并通过调整采样策略维持在稳定区域内，显著提升了概念覆盖度、平衡性和鲁棒性。

链接: https://arxiv.org/abs/2509.25940
作者: Debottam Dutta,Jianchong Chen,Rajalaxmi Rajagopalan,Yu-Lin Wei,Romit Roy Choudhury
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose to improve multi-concept prompt fidelity in text-to-image diffusion models. We begin with common failure cases-prompts like “a cat and a dog” that sometimes yields images where one concept is missing, faint, or colliding awkwardly with another. We hypothesize that this happens when the diffusion model drifts into mixed modes that over-emphasize a single concept it learned strongly during training. Instead of re-training, we introduce a corrective sampling strategy that steers away from regions where the joint prompt behavior overlaps too strongly with any single concept in the prompt. The goal is to steer towards “pure” joint modes where all concepts can coexist with balanced visual presence. We further show that existing multi-concept guidance schemes can operate in unstable weight regimes that amplify imbalance; we characterize favorable regions and adapt sampling to remain within them. Our approach, CO3, is plug-and-play, requires no model tuning, and complements standard classifier-free guidance. Experiments on diverse multi-concept prompts indicate improvements in concept coverage, balance and robustness, with fewer dropped or distorted concepts compared to standard baselines and prior compositional methods. Results suggest that lightweight corrective guidance can substantially mitigate brittle semantic alignment behavior in modern diffusion systems.
zh

[CV-77] UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression

【速读】：该论文旨在解决现有异常检测（Anomaly Detection, AD）方法在多模态与多类场景下存在的两个核心问题：一是传统方法将模态（modality）与类别（class）视为独立因素，导致解决方案碎片化且内存开销高；二是基于重建的多类方法通常采用共享解码路径，难以应对跨域大差异，从而引发正常边界扭曲、域间干扰和误报率高等问题。其解决方案的关键在于提出UniMMAD框架，该框架引入基于专家混合（Mixture-of-Experts, MoE）的特征解压缩机制，实现针对特定域的自适应与解耦重建，遵循“从一般到具体”的范式：编码阶段通过特征压缩模块生成紧凑通用特征，抑制潜在异常并促进跨模态交互；解码阶段利用稀疏门控交叉MoE动态选择专家路径，将通用特征重构为模态特异性和类别特异性形式，同时结合分组动态滤波机制与MoE-in-MoE结构，在减少75%参数量的同时保持稀疏激活与快速推理能力，显著提升多模态多类异常检测性能。

链接: https://arxiv.org/abs/2509.25934
作者: Yuan Zhao,Youwei Pang,Lihe Zhang,Hanqi Liu,Jiaming Zuo,Huchuan Lu,Xiaoqi Zhao
机构: IIAU-Lab, Dalian University of Technology (大连理工大学); X3000 Inspection Co., Ltd (X3000检测有限公司); Nanyang Technological University (南洋理工大学); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: manuscript

点击查看摘要

Abstract:Existing anomaly detection (AD) methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains. This process is guided by a ``general to specific’’ paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning. In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and a MoE-in-MoE structure, reducing parameter usage by 75% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The source code will be available at this https URL. Comments: manuscript Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.25934 [cs.CV] (or arXiv:2509.25934v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.25934 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-78] From MNIST to ImageNet: Understanding the Scalability Boundaries of Differentiable Logic Gate Networks

链接: https://arxiv.org/abs/2509.25933
作者: Sven Brändle,Till Aczel,Andreas Plesner,Roger Wattenhofer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-79] he Impact of Scaling Training Data on Adversarial Robustness NEURIPS2025

【速读】：该论文旨在解决深度神经网络在面对对抗样本时仍存在脆弱性的问题，尤其是在不同训练数据规模、模型架构和学习范式下，如何系统性地理解并提升模型的对抗鲁棒性。其解决方案的关键在于通过大规模实验验证了对抗鲁棒性遵循对数尺度定律（logarithmic scaling law），即数据量和模型规模的增加均能提升鲁棒性，但数据质量与训练目标（如自监督学习中的数据筛选）比单纯扩大规模更为关键；例如，DINOv2等在高质量数据上训练的自监督模型表现优于在海量但低质数据上训练的模型，揭示了“数据质量 > 数据规模”的核心规律。

链接: https://arxiv.org/abs/2509.25927
作者: Marco Zimmerli,Andreas Plesner,Till Aczel,Roger Wattenhofer
机构: ETH-Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted at the workshop Reliable ML from Unreliable Data at NeurIPS 2025

点击查看摘要

Abstract:Deep neural networks remain vulnerable to adversarial examples despite advances in architectures and training paradigms. We investigate how training data characteristics affect adversarial robustness across 36 state-of-the-art vision models spanning supervised, self-supervised, and contrastive learning approaches, trained on datasets from 1.2M to 22B images. Models were evaluated under six black-box attack categories: random perturbations, two types of geometric masks, COCO object manipulations, ImageNet-C corruptions, and ImageNet-R style shifts. Robustness follows a logarithmic scaling law with both data volume and model size: a tenfold increase in data reduces attack success rate (ASR) on average by ~3.2%, whereas a tenfold increase in model size reduces ASR on average by ~13.4%. Notably, some self-supervised models trained on curated datasets, such as DINOv2, outperform others trained on much larger but less curated datasets, challenging the assumption that scale alone drives robustness. Adversarial fine-tuning of ResNet50s improves generalization across structural variations but not across color distributions. Human evaluation reveals persistent gaps between human and machine vision. These results show that while scaling improves robustness, data quality, architecture, and training objectives play a more decisive role than raw scale in achieving broad-spectrum adversarial resilience.
zh

[CV-80] LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

【速读】：该论文旨在解决多模态多轮对话（Multimodal Multi-Turn, MMT）场景下的安全风险问题，这类风险在单轮或单模态内容审核中难以被识别，因为恶意意图可能分散于多个对话轮次和图像中，且上下文相关的回复仍可能传播有害内容。解决方案的关键在于提出首个系统性的MMT对话安全定义与评估框架，并构建了包含4,484个细粒度标注样本的Multimodal Multi-turn Dialogue Safety (MMDS)数据集；同时开发基于蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）的自动化多模态多轮红队测试框架，用于生成具有挑战性的不安全对话样本。在此基础上，研究进一步提出了LLaVAShield模型，能够联合检测并评估用户输入与助手回复中的风险，在多种动态政策配置下均显著优于现有基线方法，实现了多模态多轮内容安全治理的新基准。

链接: https://arxiv.org/abs/2509.25896
作者: Guolei Huang,Qingzhi Peng,Gan Xu,Yuxuan Lu,Yongjun Shen
机构: Southeast University (东南大学); University of California, Santa Cruz (加州大学圣克鲁兹分校); RealAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As Vision-Language Models (VLMs) move into interactive, multi-turn use, new safety risks arise that single-turn or single-modality moderation misses. In Multimodal Multi-Turn (MMT) dialogues, malicious intent can be spread across turns and images, while context-sensitive replies may still advance harmful content. To address this challenge, we present the first systematic definition and study of MMT dialogue safety. Building on this formulation, we introduce the Multimodal Multi-turn Dialogue Safety (MMDS) dataset. We further develop an automated multimodal multi-turn red-teaming framework based on Monte Carlo Tree Search (MCTS) to generate unsafe multimodal multi-turn dialogues for MMDS. MMDS contains 4,484 annotated multimodal dialogue samples with fine-grained safety ratings, policy dimension labels, and evidence-based rationales for both users and assistants. Leveraging MMDS, we present LLaVAShield, a powerful tool that jointly detects and assesses risk in user inputs and assistant responses. Across comprehensive experiments, LLaVAShield consistently outperforms strong baselines on MMT content moderation tasks and under dynamic policy configurations, establishing new state-of-the-art results. We will publicly release the dataset and model to support future research.
zh

[CV-81] DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

链接: https://arxiv.org/abs/2509.25866
作者: Chi Zhang,Haibo Qiu,Qiming Zhang,Zhixiong Zeng,Lin Ma,Jing Zhang
机构: Wuhan University (武汉大学); Meituan Inc (美团); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-82] MAPLE: Multi-scale Attribute-enhanced Prompt Learning for Few-shot Whole Slide Image Classification

链接: https://arxiv.org/abs/2509.25863
作者: Junjie Zhou,Wei Shao,Yagao Yue,Wei Mu,Peng Wan,Qi Zhu,Daoqiang Zhang
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-83] LiDAR Point Cloud Colourisation Using Multi-Camera Fusion and Low-Light Image Enhancement

【速读】：该论文旨在解决如何在低光照条件下实现高质量的激光雷达（LiDAR）点云色彩化问题，以提升三维场景理解的完整性与可用性。其解决方案的关键在于提出了一种硬件无关的融合方法，通过多摄像头输入与机械式LiDAR数据的协同处理，在不依赖专用标定目标的前提下自动完成LiDAR与相机间的几何配准，并集成低光图像增强模块，显著提升了在弱光环境下的色彩化效果与细节恢复能力，同时采用颜色校正策略确保多视角图像的一致性，最终实现了实时、鲁棒的360°全覆盖彩色点云生成。

链接: https://arxiv.org/abs/2509.25859
作者: Pasindu Ranasinghe,Dibyayan Patra,Bikram Banerjee,Simit Raval
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:In recent years, the fusion of camera data with LiDAR measurements has emerged as a powerful approach to enhance spatial understanding. This study introduces a novel, hardware-agnostic methodology that generates colourised point clouds from mechanical LiDAR using multiple camera inputs, providing complete 360-degree coverage. The primary innovation lies in its robustness under low-light conditions, achieved through the integration of a low-light image enhancement module within the fusion pipeline. The system requires initial calibration to determine intrinsic camera parameters, followed by automatic computation of the geometric transformation between the LiDAR and cameras, removing the need for specialised calibration targets and streamlining the setup. The data processing framework uses colour correction to ensure uniformity across camera feeds before fusion. The algorithm was tested using a Velodyne Puck Hi-Res LiDAR and a four-camera configuration. The optimised software achieved real-time performance and reliable colourisation even under very low illumination, successfully recovering scene details that would otherwise remain undetectable.
zh

[CV-84] Vector sketch animation generation with differentialable motion trajectories

链接: https://arxiv.org/abs/2509.25857
作者: Xinding Zhu,Xinye Yang,Shuyang Zheng,Zhexin Zhang,Fei Gao,Jing Huang,Jiazhou Chen
机构: Zhejiang University of Technology (浙江工业大学); Hangzhou Dianzi University (杭州电子科技大学); Zhejiang Gongshang University (浙江工商大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures

点击查看摘要

[CV-85] PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection

链接: https://arxiv.org/abs/2509.25856
作者: Po-Han Huang,Jeng-Lin Li,Po-Hsuan Huang,Ming-Ching Chang,Wei-Chao Chen
机构: Inventec Corporation (英业达集团); University at Albany, State University of New York (纽约州立大学阿尔巴尼分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

[CV-86] MuSLR: Multimodal Symbolic Logical Reasoning NEURIPS2025

链接: https://arxiv.org/abs/2509.25851
作者: Jundong Xu,Hao Fei,Yuhui Zhang,Liangming Pan,Qijun Huang,Qian Liu,Preslav Nakov,Min-Yen Kan,William Yang Wang,Mong-Li Lee,Wynne Hsu
机构: National University of Singapore(新加坡国立大学); Stanford University(斯坦福大学); University of Arizona(亚利桑那大学); UniMelb(墨尔本大学); University of Auckland(奥克兰大学); MBZUAI(穆罕默德·本·扎耶德人工智能大学); University of California, Santa Barbara(加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

[CV-87] More Thought Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

【速读】：该论文旨在解决多模态推理（multimodal reasoning）在视觉语言模型（Vision-Language Models, VLMs）中带来的“感知锚定退化”问题，即尽管增强逻辑推理能力可提升复杂任务表现，但长期依赖推理过程会导致模型逐渐忽视视觉输入，从而在基础视觉识别任务上出现性能下降。解决方案的关键在于提出一种名为视觉锚定策略优化（Vision-Anchored Policy Optimization, VAPO）的方法，通过显式引导推理路径向视觉信息高度依赖的方向演化，从而强化模型对视觉输入的感知锚定能力。实验表明，基于VAPO训练的VAPO-Thinker-7B模型显著提升了视觉信息利用强度，并在多个基准测试中达到新的最先进水平。

链接: https://arxiv.org/abs/2509.25848
作者: Xinyu Tian,Shu Zou,Zhaoyuan Yang,Mengqi He,Fabian Waschkowski,Lukas Wesemann,Peter Tu,Jing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model’s reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: this https URL
zh

[CV-88] raining-Free Reward-Guided Image Editing via Trajectory Optimal Control

链接: https://arxiv.org/abs/2509.25845
作者: Jinho Chang,Jaemin Kim,Jong Chul Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures

点击查看摘要

[CV-89] Overview of GeoLifeCLEF 2023: Species Composition Prediction with High Spatial Resolution at Continental Scale Using Remote Sensing

链接: https://arxiv.org/abs/2509.25816
作者: Christophe Botella,Benjamin Deneu,Diego Marcos,Maximilien Servajean,Theo Larcher,Cesar Leblanc,Joaquim Estopinan,Pierre Bonnet,Alexis Joly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures, CLEF 2023 Conference and Labs of the Evaluation Forum, September 18 to 21, 2023, Thessaloniki, Greece

点击查看摘要

[CV-90] Logo-VGR: Visual Grounded Reasoning for Open-world Logo Recognition

链接: https://arxiv.org/abs/2509.25811
作者: Zichen Liang,Jingjing Fei,Jie Wang,Zheming Yang,Changqing Li,Pei Wu,Minghui Qiu,Fei Yang,Xialei Liu
机构: VCIP, CS, Nankai University (南开大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-91] Adapting SAM with Dynamic Similarity Graphs for Few-Shot Parameter-Efficient Small Dense Object Detection: A Case Study of Chickpea Pods in Field Conditions

【速读】：该论文旨在解决基础模型在农业计算机视觉任务中进行参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）的难题，尤其是在训练数据稀缺和田间环境复杂的情况下，如何实现对小而密集目标（如鹰嘴豆荚）的精准前景分割与实例分割。解决方案的关键在于提出一种动态相似性图适配模块（Dynamic Similarity-based Graph Adaptation, DSGA），通过可学习的多项式衰减初始化权重排序机制构建动态相似性图，并结合自适应局部特征聚合策略，以极低的可训练参数量（仅4.00M，占原Segment Anything Model SAM的4.26%）建立鲁棒的空间与动态相似性表征。DSGA与低秩适配（Low-Rank Adaptation, LoRA）协同形成互补优化框架，在保持模型稳定性的同时有效捕捉图像嵌入中的局部与全局依赖关系，显著提升了少样本场景下的分割性能。

链接: https://arxiv.org/abs/2509.25805
作者: Xintong Jiang,Yixue Liu,Mohamed Debbagh,Yu Tian,Valerio Hoyos-Villegas,Viacheslav Adamchuk,Shangpeng Sun
机构: McGill University (麦吉尔大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 11 figures, 4 tables

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) of foundation models for agricultural computer vision tasks remains challenging due to limited training data and complex field conditions. This study introduces a Dynamic Similarity-based Graph Adaptation (DSGA) module to adapt the Segment Anything Model (SAM) under extreme data constraints for precise foreground and instance segmentation of small dense objects in complex agricultural environments. Through dynamic similarity graph construction with a learnable polynomial decay-initialized weight ranking mechanism and adaptive local feature aggregation, DSGA establishes robust spatial and dynamic similarity representation with only 4.00M trainable parameters, which is 4.26% of the original SAM. Integrating this graph-based feature adaptation with Low-Rank Adaptation (LoRA) creates a complementary optimization framework that effectively captures both local and global dependencies in image embeddings while preserving model stability and parameter efficiency. Experimental results on a challenging chickpea pod dataset demonstrated that DSGA with LoRA achieved superior performance across multiple metrics evaluated under 2, 4, 8 and 10 shots, with progressive performance gains as shot count increased. Quantitative metrics showed a 17.31% improvement in Structure-measure and a 62.36% gain in adaptive F-measure compared to the baseline SAM fine-tuning. Comprehensive ablation studies and visualization analyses through Grad-CAM and t-SNE validated the framework’s effectiveness in feature discrimination. The proposed adaptation demonstrated practical utility for automated agricultural monitoring applications, achieving accurate pod-counting with an adjusted R-squared of 0.8987 for images with 10 to 120 pods under challenging field conditions.
zh

[CV-92] Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

链接: https://arxiv.org/abs/2509.25794
作者: Haotian Xue,Yunhao Ge,Yu Zeng,Zhaoshuo Li,Ming-Yu Liu,Yongxin Chen,Jiaojiao Fan
机构: Georgia Tech (佐治亚理工学院); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-93] PUREVQ-GAN: Defending Data Poisoning Attacks through Vector-Quantized Bottlenecks

链接: https://arxiv.org/abs/2509.25792
作者: Alexander Branch,Omead Pooladzandi,Radin Khosraviani,Sunay Gajanan Bhat,Jeffrey Jiang,Gregory Pottie
机构: University of California, Los Angeles (加州大学洛杉矶分校); California Institute of Technology (加州理工学院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-94] EchoingECG: An Electrocardiogram Cross-Modal Model for Echocardiogram Tasks MICCAI2025

链接: https://arxiv.org/abs/2509.25791
作者: Yuan Gao,Sangwook Kim,Chris McIntosh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025

点击查看摘要

[CV-95] Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

链接: https://arxiv.org/abs/2509.25787
作者: Wen Wen,Tianwu Zhi,Kanglong Fan,Yang Li,Xinge Peng,Yabin Zhang,Yiting Liao,Junlin Li,Li Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

[CV-96] Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation ICML2025

链接: https://arxiv.org/abs/2509.25776
作者: Mingyu Kang,Yong Suk Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICML 2025

点击查看摘要

[CV-97] PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models

链接: https://arxiv.org/abs/2509.25774
作者: Jeongjae Lee,Jong Chul Ye
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 17 figures

点击查看摘要

[CV-98] Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs

【速读】：该论文旨在解决扩散模型驱动的文本到图像（Text-to-Image, T2I）生成中，如何实现文本与图像之间高质量对齐的问题。当前主流方法依赖强化学习与人类反馈（Reinforcement Learning with Human Feedback, RLHF），但这些方法通常需要大量成对的图像偏好数据或训练奖励函数，严重受限于人工标注成本和可扩展性。其解决方案的关键在于提出一种名为“文本偏好优化”（Text Preference Optimization, TPO）的新框架，该框架无需配对图像偏好数据，而是通过大型语言模型（Large Language Model, LLM）扰动生成不匹配的文本提示，并训练模型偏好原始匹配提示而非扰动后的不匹配提示，从而实现“免费午餐”式的对齐优化。此策略具备通用性，可无缝集成至现有基于偏好的算法（如DPO和KTO），实验表明其在多个基准测试中显著优于原始方法，在人类偏好评分和文本-图像一致性方面均有提升。

链接: https://arxiv.org/abs/2509.25771
作者: Jia Jun Cheng Xian,Muchen Li,Haotian Yang,Xin Tao,Pengfei Wan,Leonid Sigal,Renjie Liao
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute for AI (AI向量研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席); Kling Team, Kuaishou Technology (快手科技Kling团队); NSERC CRC Chair (加拿大国家自然科学与工程研究理事会首席研究员主席)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables “free-lunch” alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at this https URL.
zh

[CV-99] ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On

【速读】：该论文旨在解决虚拟试衣（Virtual Try-On, VITON）中非试衣区域（non-try-on regions）的保真度问题，即在利用潜在扩散模型（Latent Diffusion Models, LDMs）实现服装对齐与细节合成的同时，如何避免因后处理替换策略导致的边界伪影（boundary artifacts），并克服现有方法在生成过程中出现的语义漂移（semantic drift）。其解决方案的关键在于将VITON建模为一个线性逆问题，并引入轨迹对齐求解器（trajectory-aligned solvers）以逐步强化测量一致性，从而减少非试衣区域的突变；进一步提出ART-VITON框架，通过残差先验初始化（residual prior-based initialization）缓解训练-推理不匹配，以及融合数据一致性、频域校正和周期标准去噪的无伪影测量引导采样策略，确保生成结果既符合观测约束又保持高质量。

链接: https://arxiv.org/abs/2509.25749
作者: Junseo Park,Hyeryung Jang
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages

点击查看摘要

Abstract:Virtual try-on (VITON) aims to generate realistic images of a person wearing a target garment, requiring precise garment alignment in try-on regions and faithful preservation of identity and background in non-try-on regions. While latent diffusion models (LDMs) have advanced alignment and detail synthesis, preserving non-try-on regions remains challenging. A common post-hoc strategy directly replaces these regions with original content, but abrupt transitions often produce boundary artifacts. To overcome this, we reformulate VITON as a linear inverse problem and adopt trajectory-aligned solvers that progressively enforce measurement consistency, reducing abrupt changes in non-try-on regions. However, existing solvers still suffer from semantic drift during generation, leading to artifacts. We propose ART-VITON, a measurement-guided diffusion framework that ensures measurement adherence while maintaining artifact-free synthesis. Our method integrates residual prior-based initialization to mitigate training-inference mismatch and artifact-free measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising. Experiments on VITON-HD, DressCode, and SHHQ-1.0 demonstrate that ART-VITON effectively preserves identity and background, eliminates boundary artifacts, and consistently improves visual fidelity and robustness over state-of-the-art baselines.
zh

[CV-100] Dolphin v1.0 Technical Report

链接: https://arxiv.org/abs/2509.25748
作者: Taohan Weng,Chi zhang,Chaoran Yan,Siya Liu,Xiaoyang Liu,Yalun Wu,Boyang Wang,Boyan Wang,Jiren Ren,Kaiwen Yan,Jinze Yu,Kaibing Hu,Henan Liu,Haoyun zheng,Anjie Le,Hongcheng Guo
机构: Dolphin AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-101] IPDRecon: Image-Plane Geometric Decoding for View-Invariant Indoor Scene Reconstruction

链接: https://arxiv.org/abs/2509.25744
作者: Mingyang Li,Yimeng Fan,Changsong Liu,Tianyu Zhou,Xin Wang,Yanyan Liu,Wei Zhang
机构: Tianjin University (天津大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-102] Drag ging with Geometry: From Pixels to Geometry-Guided Image Editing

【速读】：该论文旨在解决现有基于拖拽的图像编辑方法在几何密集场景（如旋转和透视变换）中因主要依赖二维像素平面而产生的编辑不精确、不一致的问题。其解决方案的关键在于提出一种几何引导的拖拽编辑方法GeoDrag，通过构建一个统一的位移场联合编码三维几何信息与二维空间先验，实现单次前向传播下的结构一致且高保真的编辑效果；同时引入无冲突分区策略隔离编辑区域，有效避免多点拖拽时的干扰并保障一致性。

链接: https://arxiv.org/abs/2509.25740
作者: Xinyu Pu,Hongsong Wang,Jie Gui,Pan Zhou
机构: Southeast University (东南大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Interactive point-based image editing serves as a controllable editor, enabling precise and flexible manipulation of image content. However, most drag-based methods operate primarily on the 2D pixel plane with limited use of 3D cues. As a result, they often produce imprecise and inconsistent edits, particularly in geometry-intensive scenarios such as rotations and perspective transformations. To address these limitations, we propose a novel geometry-guided drag-based image editing method - GeoDrag, which addresses three key challenges: 1) incorporating 3D geometric cues into pixel-level editing, 2) mitigating discontinuities caused by geometry-only guidance, and 3) resolving conflicts arising from multi-point dragging. Built upon a unified displacement field that jointly encodes 3D geometry and 2D spatial priors, GeoDrag enables coherent, high-fidelity, and structure-consistent editing in a single forward pass. In addition, a conflict-free partitioning strategy is introduced to isolate editing regions, effectively preventing interference and ensuring consistency. Extensive experiments across various editing scenarios validate the effectiveness of our method, showing superior precision, structural consistency, and reliable multi-point editability. The code will be available on this https URL .
zh

[CV-103] LieHMR: Autoregressive Human Mesh Recovery with SO(3) Diffusion

【速读】：该论文致力于解决从单张RGB图像中进行人体网格恢复（Human Mesh Recovery, HMR）的问题，其核心挑战在于从二维观测中恢复三维人体姿态具有固有的歧义性。现有方法多采用确定性回归策略，难以捕捉潜在的多解性；而概率方法虽尝试生成多个合理结果以建模歧义，却常在准确性和样本多样性之间存在权衡，且单次预测性能落后于最优确定性模型。解决方案的关键在于提出一种基于SO(3)扩散模型的新框架，通过条件丢弃（conditioning dropout）机制实现对图像观测的无条件与条件分布建模，从而学习与2D观测高度对齐的姿态参数分布；同时利用Transformer结构提取关节层级特征，并引入轻量级MLP去噪模型，基于关节潜向量分别学习每个关节的条件分布，显著提升了姿态概率分布建模的准确性与合理性。

链接: https://arxiv.org/abs/2509.25739
作者: Donghwan Kim,Tae-Kyun Kim
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 13 figures

点击查看摘要

Abstract:We tackle the problem of Human Mesh Recovery (HMR) from a single RGB image, formulating it as an image-conditioned human pose and shape generation. While recovering 3D human pose from 2D observations is inherently ambiguous, most existing approaches have regressed a single deterministic output. Probabilistic methods attempt to address this by generating multiple plausible outputs to model the ambiguity. However, these methods often exhibit a trade-off between accuracy and sample diversity, and their single predictions are not competitive with state-of-the-art deterministic models. To overcome these limitations, we propose a novel approach that models well-aligned distribution to 2D observations. In particular, we introduce SO(3) diffusion model, which generates the distribution of pose parameters represented as 3D rotations unconditional and conditional to image observations via conditioning dropout. Our model learns the hierarchical structure of human body joints using the transformer. Instead of using transformer as a denoising model, the time-independent transformer extracts latent vectors for the joints and a small MLP-based denoising model learns the per-joint distribution conditioned on the latent vector. We experimentally demonstrate and analyze that our model predicts accurate pose probability distribution effectively.
zh

[CV-104] he 1st Solution for MOSEv1 Challenge on LSVOS 2025: CGFSeg

链接: https://arxiv.org/abs/2509.25738
作者: Tingmin Li,Yixuan Li,Yang Yang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-105] LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing

链接: https://arxiv.org/abs/2509.25731
作者: Zhenghao Zhang,Ziying Zhang,Junchao Liao,Xiangyu Meng,Qiang Hu,Siyu Zhu,Xiaoyun Zhang,Long Qin,Weizhi Wang
机构: Alibaba Cloud Computing (阿里云计算); Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-106] SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place Recognition

链接: https://arxiv.org/abs/2509.25723
作者: Shunpeng Chen,Changwei Wang,Rongtao Xu,Xingtian Pei,Yukun Song,Jinzhou Lin,Wenhao Xu,Jingyi Zhang,Li Guo,Shibiao Xu
机构: School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院); Key Laboratory of Computing Power Network and Information Security, Ministry of Education (教育部计算 power 网络与信息安全重点实验室); Shandong Computer Science Center (山东计算机科学中心); Spatialtemporal AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-107] Reweighted Flow Matching via Unbalanced OT for Label-free Long-tailed Generation

链接: https://arxiv.org/abs/2509.25713
作者: Hyunsoo Song,Minjung Gim,Jaewoong Choi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 17 figures

点击查看摘要

[CV-108] ProbMed: A Probabilistic Framework for Medical Multimodal Binding ICCV2025

链接: https://arxiv.org/abs/2509.25711
作者: Yuan Gao,Sangwook Kim,Jianzhong You,Chris McIntosh
机构: Peter Munk Cardiac Centre (彼得·蒙克心脏中心); Ted Rogers Centre for Heart Research (泰德·罗杰斯心脏研究中心); University Health Network (大学健康网络); Joint Department of Medical Imaging (联合医学影像系); University of Toronto (多伦多大学); Vector Institute (矢量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

[CV-109] How Diffusion Models Memorize

【速读】：该论文旨在解决扩散模型（diffusion models）中存在的训练数据记忆问题，即模型在生成过程中可能复现甚至过度依赖训练样本，从而引发隐私和版权风险。其解决方案的关键在于揭示了记忆现象的根本机制：早期去噪阶段对训练样本的过估计（overestimation）是导致记忆的核心原因。研究通过分析潜在空间动态发现，这种过估计会降低多样性、压缩去噪轨迹并加速收敛至已记忆图像；同时，记忆提示词会将训练图像引入噪声预测中，引导潜在轨迹向特定样本偏移；进一步的中间潜在表示分解表明，初始随机性被迅速压制，取而代之的是记忆内容，且与理论去噪路径的偏差几乎完美对应记忆严重程度。因此，该工作提出以“早期过估计”为核心机制来理解并解释扩散模型的记忆行为。

链接: https://arxiv.org/abs/2509.25705
作者: Juyeop Kim,Songkuk Kim,Jong-Seok Lee
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite their success in image generation, diffusion models can memorize training data, raising serious privacy and copyright concerns. Although prior work has sought to characterize, detect, and mitigate memorization, the fundamental question of why and how it occurs remains unresolved. In this paper, we revisit the diffusion and denoising process and analyze latent space dynamics to address the question: “How do diffusion models memorize?” We show that memorization is driven by the overestimation of training samples during early denoising, which reduces diversity, collapses denoising trajectories, and accelerates convergence toward the memorized image. Specifically: (i) memorization cannot be explained by overfitting alone, as training loss is larger under memorization due to classifier-free guidance amplifying predictions and inducing overestimation; (ii) memorized prompts inject training images into noise predictions, forcing latent trajectories to converge and steering denoising toward their paired samples; and (iii) a decomposition of intermediate latents reveals how initial randomness is quickly suppressed and replaced by memorized content, with deviations from the theoretical denoising schedule correlating almost perfectly with memorization severity. Together, these results identify early overestimation as the central underlying mechanism of memorization in diffusion models.
zh

[CV-110] AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning ICLR2026

链接: https://arxiv.org/abs/2509.25699
作者: Xiping Li,Jianghong Ma
机构: Chinese University of Hong Kong (香港中文大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学（深圳）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 4 figures, submitted to ICLR 2026

点击查看摘要

[CV-111] Annotation-Efficient Active Test-Time Adaptation with Conformal Prediction

链接: https://arxiv.org/abs/2509.25692
作者: Tingyu Shi,Fan Lyu,Shaoliang Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

[CV-112] OmniDFA: A Unified Framework for Open Set Synthesis Image Detection and Few-Shot Attribution

链接: https://arxiv.org/abs/2509.25682
作者: Shiyu Wu,Shuyan Li,Jing Li,Jing Liu,Yequan Wang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Beijing Academy of Artificial Intelligence (北京人工智能研究院); University of Chinese Academy of Sciences (中国科学院大学); School of EEECS, Queen’s University Belfast (女王大学电气电子与计算机科学学院); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 5 figures

点击查看摘要

[CV-113] dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought

链接: https://arxiv.org/abs/2509.25681
作者: Junjie Wen,Minjie Zhu,Jiaming Liu,Zhiyuan Liu,Yicun Yang,Linfeng Zhang,Shanghang Zhang,Yichen Zhu,Yi Xu
机构: Midea Group (美的集团); Peking University (北京大学); Shanghai Jiaotong University (上海交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: technique report

点击查看摘要

[CV-114] LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning ICASSP2026

链接: https://arxiv.org/abs/2509.25670
作者: Kang Yang,Yifan Liang,Fangkun Liu,Zhenping Xie,Chengshi Zheng
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICASSP 2026

点击查看摘要

[CV-115] YOLO-Based Defect Detection for Metal Sheets

链接: https://arxiv.org/abs/2509.25659
作者: Po-Heng Chou,Chun-Chi Wang,Wei-Lung Mao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: 5 pages, 8 figures, 2 tables, and published in IEEE IST 2024

点击查看摘要

[CV-116] DescribeEarth: Describe Anything for Remote Sensing Images

链接: https://arxiv.org/abs/2509.25654
作者: Kaiyu Li,Zixuan Jiang,Xiangyong Cao,Jiayu Wang,Yuchen Xiao,Deyu Meng,Zhi Wang
机构: Xi’an Jiaotong University (西安交通大学); Pazhou Laboratory (琶洲实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-117] Using Images from a Video Game to Improve the Detection of Truck Axles

链接: https://arxiv.org/abs/2509.25644
作者: Leandro Arab Marcomini,Andre Luiz Cunha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-118] Generalized Contrastive Learning for Universal Multimodal Retrieval NEURIPS2025

链接: https://arxiv.org/abs/2509.25638
作者: Jungsoo Lee,Janghoon Cho,Hyojin Park,Munawar Hayat,Kyuwoong Hwang,Fatih Porikli,Sungha Choi
机构: Qualcomm AI Research (高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to NeurIPS 2025

点击查看摘要

[CV-119] Anchor-free Cross-view Object Geo-localization with Gaussian Position Encoding and Cross-view Association

链接: https://arxiv.org/abs/2509.25623
作者: Xingtao Ling,Chenlin Fu,Yingying Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-120] LMOD: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology

链接: https://arxiv.org/abs/2509.25620
作者: Zhenyue Qin,Yang Liu,Yu Yin,Jinyu Ding,Haoran Zhang,Anran Li,Dylan Campbell,Xuansheng Wu,Ke Zou,Tiarnan D. L. Keenan,Emily Y. Chew,Zhiyong Lu,Yih-Chung Tham,Ninghao Liu,Xiuzhen Zhang,Qingyu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-121] GaussianLens: Localized High-Resolution Reconstruction via On-Demand Gaussian Densification

链接: https://arxiv.org/abs/2509.25603
作者: Yijia Weng,Zhicheng Wang,Songyou Peng,Saining Xie,Howard Zhou,Leonidas J. Guibas
机构: Stanford University (斯坦福大学); Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-122] K-Prism: A Knowledge-Guided and Prompt Integrated Universal Medical Image Segmentation Model

【速读】：该论文旨在解决医学图像分割模型在实际临床应用中面临的碎片化问题，即现有模型通常仅基于单一知识来源，且局限于特定任务、模态或器官，难以模拟临床专家融合多种知识进行决策的能力。其解决方案的关键在于提出一个统一的分割框架K-Prism，通过系统性整合三种知识范式——从标注数据中学习的语义先验（semantic priors）、来自少量参考样本的上下文知识（in-context knowledge）以及用户交互反馈（interactive feedback），并将其编码为双提示表示：一维稀疏提示定义“要分割什么”，二维密集提示指示“关注何处”，再经由Mixture-of-Experts（MoE）解码器动态路由，实现不同知识范式的灵活切换与跨任务联合训练，无需修改架构即可在18个公共数据集上达到最优性能。

链接: https://arxiv.org/abs/2509.25594
作者: Bangwei Guo,Yunhe Gao,Meng Ye,Difei Gu,Yang Zhou,Leon Axel,Dimitris Metaxas
机构: Rutgers University (罗格斯大学); Stanford University (斯坦福大学); The University of Texas at Arlington (德克萨斯大学阿灵顿分校); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical image segmentation is fundamental to clinical decision-making, yet existing models remain fragmented. They are usually trained on single knowledge sources and specific to individual tasks, modalities, or organs. This fragmentation contrasts sharply with clinical practice, where experts seamlessly integrate diverse knowledge: anatomical priors from training, exemplar-based reasoning from reference cases, and iterative refinement through real-time interaction. We present \textbfK-Prism , a unified segmentation framework that mirrors this clinical flexibility by systematically integrating three knowledge paradigms: (i) \textitsemantic priors learned from annotated datasets, (ii) \textitin-context knowledge from few-shot reference examples, and (iii) \textitinteractive feedback from user inputs like clicks or scribbles. Our key insight is that these heterogeneous knowledge sources can be encoded into a dual-prompt representation: 1-D sparse prompts defining \textitwhat to segment and 2-D dense prompts indicating \textitwhere to attend, which are then dynamically routed through a Mixture-of-Experts (MoE) decoder. This design enables flexible switching between paradigms and joint training across diverse tasks without architectural modifications. Comprehensive experiments on 18 public datasets spanning diverse modalities (CT, MRI, X-ray, pathology, ultrasound, etc.) demonstrate that K-Prism achieves state-of-the-art performance across semantic, in-context, and interactive segmentation settings. Code will be released upon publication.
zh

[CV-123] MetaChest: Generalized few-shot learning of patologies from chest X-rays

链接: https://arxiv.org/abs/2509.25590
作者: Berenice Montalvo-Lezama,Gibran Fuentes-Pineda
机构: Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México (墨西哥国立自治大学应用数学与系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-124] AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs WACV

【速读】：该论文旨在解决视觉图神经网络（Vision Graph Neural Networks, ViGs）中节点-邻域特征聚合方法的局限性问题，即现有图卷积方法（如Max-Relative、EdgeConv、GIN和GraphSAGE）难以在不依赖架构特异性调整的情况下有效捕捉复杂的节点与邻域关系。解决方案的关键在于提出一种基于交叉注意力（cross-attention）的聚合机制：查询（query）投影源自节点自身，而键（key）投影来自其邻居节点，从而实现更具表达力的非局部信息传递；在此基础上构建的AttentionViG架构通过该机制实现了高效的全局上下文建模，在ImageNet-1K上达到当前最优性能，并在MS COCO和ADE20K等下游任务中展现出良好的迁移能力，同时保持与先前视觉图神经网络相当的计算效率（FLOPs）。

链接: https://arxiv.org/abs/2509.25570
作者: Hakan Emre Gedik,Andrew Martin,Mustafa Munir,Oguzhan Baser,Radu Marculescu,Sandeep P. Chinchali,Alan C. Bovik
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: WACV submission. 13 pages, including the main text (8 pages), references, and supplementary material

点击查看摘要

Abstract:Vision Graph Neural Networks (ViGs) have demonstrated promising performance in image recognition tasks against Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). An essential part of the ViG framework is the node-neighbor feature aggregation method. Although various graph convolution methods, such as Max-Relative, EdgeConv, GIN, and GraphSAGE, have been explored, a versatile aggregation method that effectively captures complex node-neighbor relationships without requiring architecture-specific refinements is needed. To address this gap, we propose a cross-attention-based aggregation method in which the query projections come from the node, while the key projections come from its neighbors. Additionally, we introduce a novel architecture called AttentionViG that uses the proposed cross-attention aggregation scheme to conduct non-local message passing. We evaluated the image recognition performance of AttentionViG on the ImageNet-1K benchmark, where it achieved SOTA performance. Additionally, we assessed its transferability to downstream tasks, including object detection and instance segmentation on MS COCO 2017, as well as semantic segmentation on ADE20K. Our results demonstrate that the proposed method not only achieves strong performance, but also maintains efficiency, delivering competitive accuracy with comparable FLOPs to prior vision GNN architectures.
zh

[CV-125] FishNet: Analyzing the capabilities of Multimodal Large Language Models in marine biology

链接: https://arxiv.org/abs/2509.25564
作者: Faizan Farooq Khan,Yousef Radwan,Eslam Abdelrahman,Abdulwahab Felemban,Aymen Mir,Nico K. Michiels,Andrew J. Temple,Michael L. Berumen,Mohamed Elhoseiny
机构: King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); Red Sea Research Center, KAUST (红海研究中心，KAUST); Tübingen University (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3 figures 8 tables

点击查看摘要

[CV-126] Hybrid Approach for Enhancing Lesion Segmentation in Fundus Images

链接: https://arxiv.org/abs/2509.25549
作者: Mohammadmahdi Eshragh,Emad A. Mohammed,Behrouz Far,Ezekiel Weis,Carol L Shields,Sandor R Ferenczy,Trafford Crump
机构: University of Calgary(卡尔加里大学); Wilfrid Laurier University(劳瑞尔大学); University of Alberta(阿尔伯塔大学); Wills Eye Hospital(威尔斯眼科医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-127] Online Mapping for Autonomous Driving: Addressing Sensor Generalization and Dynamic Map Updates in Campus Environments

链接: https://arxiv.org/abs/2509.25542
作者: Zihan Zhang,Abhijit Ravichandran,Pragnya Korti,Luobin Wang,Henrik I. Christensen
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 19th International Symposium on Experimental Robotics

点击查看摘要

[CV-128] Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

链接: https://arxiv.org/abs/2509.25541
作者: Qinsi Wang,Bo Liu,Tianyi Zhou,Jing Shi,Yueqian Lin,Yiran Chen,Hai Helen Li,Kun Wan,Wentian Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-129] VISOR: Universal Visual Inputs based Steering for Large Vision Language Models

链接: https://arxiv.org/abs/2509.25533
作者: Ravikumar Balakrishnan,Mansi Phute
机构: HiddenLayer Inc.(HiddenLayer公司); Georgia Institute of Technology(佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-130] LLM -RG: Referential Grounding in Outdoor Scenarios using Large Language Models

链接: https://arxiv.org/abs/2509.25528
作者: Pranav Saxena,Avigyan Bhattacharya,Ji Zhang,Wenshan Wang
机构: Birla Institute of Technology and Science Pilani, K.K Birla Goa Campus (比尔拉理工大学与科学学院，K.K. 比拉果阿校区); Carnegie Mellon University, Robotics Institute (卡内基梅隆大学，机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

[CV-131] Robust Visual Localization in Compute-Constrained Environments by Salient Edge Rendering and Weighted Hamming Similarity

链接: https://arxiv.org/abs/2509.25520
作者: Tu-Hoa Pham,Philip Bailey,Daniel Posada,Georgios Georgakis,Jorge Enriquez,Surya Suresh,Marco Dolci,Philip Twu
机构: Jet Propulsion Laboratory, California Institute of Technology (喷气推进实验室，加州理工学院); Blue Origin (蓝色起源); Amazon (亚马逊); Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: To appear in IEEE Robotics and Automation Letters

点击查看摘要

[CV-132] DeepFake Detection in Dyadic Video Calls using Point of Gaze Tracking

链接: https://arxiv.org/abs/2509.25503
作者: Odin Kohler,Rahul Vijaykumar,Masudul H. Imtiaz
机构: Clarkson University (克拉克森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-133] Seeing Before Reasoning : A Unified Framework for Generalizable and Explainable Fake Image Detection

链接: https://arxiv.org/abs/2509.25502
作者: Kaiqing Lin,Zhiyuan Yan,Ruoxin Chen,Junyan Ye,Ke-Yue Zhang,Yue Zhou,Peng Jin,Bin Li,Taiping Yao,Shouhong Ding
机构: Shenzhen University (深圳大学); Tencent Youtu Lab (腾讯优图实验室); Peking University (北京大学); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-134] Infrastructure Sensor-enabled Vehicle Data Generation using Multi-Sensor Fusion for Proactive Safety Applications at Work Zone

【速读】：该论文旨在解决高风险道路路段（如施工区）中基础设施感知与实时轨迹生成在实际部署时面临的挑战，包括视角畸变、复杂几何结构、遮挡以及成本问题。其解决方案的关键在于构建一个融合路边摄像头与LiDAR传感器的共仿真环境，开发一种可扩展且低成本的车辆检测与定位框架，并采用基于卡尔曼滤波（Kalman Filter, KF）的后融合策略以提升轨迹的一致性与准确性。实验表明，该方法在仿真中将纵向误差降低达70%，同时保持横向精度在1至3米内；现场验证进一步证实，即使单传感器数据间歇或退化，融合轨迹仍能精确匹配真实车辆路径，从而可靠补偿各传感器局限，实现高鲁棒性的车辆跟踪能力。

链接: https://arxiv.org/abs/2509.25452
作者: Suhala Rabab Saba,Sakib Khan,Minhaj Uddin Ahmad,Jiahe Cao,Mizanur Rahman,Li Zhao,Nathan Huynh,Eren Erman Ozguven
机构: The University of Alabama (阿拉巴马大学); MITRE Corporation (MITRE公司); University of Nebraska-Lincoln (内布拉斯加林肯大学); FAMU-FSU College of Engineering (佛罗里达农工大学-佛罗里达州立大学工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Infrastructure-based sensing and real-time trajectory generation show promise for improving safety in high-risk roadway segments such as work zones, yet practical deployments are hindered by perspective distortion, complex geometry, occlusions, and costs. This study tackles these barriers by integrating roadside camera and LiDAR sensors into a cosimulation environment to develop a scalable, cost-effective vehicle detection and localization framework, and employing a Kalman Filter-based late fusion strategy to enhance trajectory consistency and accuracy. In simulation, the fusion algorithm reduced longitudinal error by up to 70 percent compared to individual sensors while preserving lateral accuracy within 1 to 3 meters. Field validation in an active work zone, using LiDAR, a radar-camera rig, and RTK-GPS as ground truth, demonstrated that the fused trajectories closely match real vehicle paths, even when single-sensor data are intermittent or degraded. These results confirm that KF based sensor fusion can reliably compensate for individual sensor limitations, providing precise and robust vehicle tracking capabilities. Our approach thus offers a practical pathway to deploy infrastructure-enabled multi-sensor systems for proactive safety measures in complex traffic environments.
zh

[CV-135] Bayesian Transformer for Pan-Arctic Sea Ice Concentration Mapping and Uncertainty Estimation using Sentinel-1 RCM and AMSR2 Data

链接: https://arxiv.org/abs/2509.25437
作者: Mabel Heffring,Lincoln Linlin Xu
机构: University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 4 figures

点击查看摘要

[CV-136] DepthLM: Metric Depth From Vision Language Models

链接: https://arxiv.org/abs/2509.25413
作者: Zhipeng Cai,Ching-Feng Yeh,Hu Xu,Zhuang Liu,Gregory Meyer,Xinjie Lei,Changsheng Zhao,Shang-Wen Li,Vikas Chandra,Yangyang Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-137] A Deep Learning Approach for Spatio-Temporal Forecasting of InSAR Ground Deformation in Eastern Ireland

链接: https://arxiv.org/abs/2509.25393
作者: Wendong Yao,Binhua Huang,Soumyabrata Dev
机构: ADAPT SFI Research Centre (ADAPT 国家研究中心); School of Computer Science (计算机科学学院); University College Dublin (都柏林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper is submitted to IEEE Transactions on Geoscience and Remote Sensing for reviewing

点击查看摘要

[CV-138] SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

链接: https://arxiv.org/abs/2509.25390
作者: Yuyou Zhang,Radu Corcodel,Chiori Hori,Anoop Cherian,Ding Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-139] Saliency Guided Longitudinal Medical Visual Question Answering

链接: https://arxiv.org/abs/2509.25374
作者: Jialin Wu,Xiaofeng Liu
机构: University of California, San Diego (加州大学圣地亚哥分校); Yale University (耶鲁大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-140] Editing Physiological Signals in Videos Using Latent Representations

链接: https://arxiv.org/abs/2509.25348
作者: Tianwen Zhou,Akshay Paruchuri,Josef Spjut,Kaan Akşit
机构: University College London (伦敦大学学院); UNC Chapel Hill (北卡罗来纳大学教堂山分校); NVIDIA Research (英伟达研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: 12 pages, 8 figures, 7 tables

点击查看摘要

[CV-141] VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

链接: https://arxiv.org/abs/2509.25339
作者: Paul Gavrikov,Wei Lin,M. Jehanzeb Mirza,Soumya Jahagirdar,Muhammad Huzaifa,Sivan Doveh,Serena Yeung-Levy,James Glass,Hilde Kuehne
机构: Independent Researcher; JKU Linz; MIT CSAIL; Tübingen AI Center; Stanford; MIT-IBM Watson AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[CV-142] LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model

链接: https://arxiv.org/abs/2509.25304
作者: Haozhe Jia,Wenshuo Chen,Yuqi Lin,Yang Yang,Lei Wang,Mang Ning,Bowen Tian,Songning Lai,Nanqian Jia,Yifan Chen,Yutao Yue
机构: HKUST-GZ (香港科技大学广州); Shandong University (山东大学); UESTC (电子科技大学); Griffith University (格里菲斯大学); Data61/CSIRO (数据61/澳大利亚联邦科学与工业研究组织); Utrecht University (乌得勒支大学); Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学); Institute of Deep Perception Technology, JITRI (深感知技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-143] RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

链接: https://arxiv.org/abs/2509.25271
作者: Xiuyuan Chen,Jian Zhao,Yuchen Yuan,Tianle Zhang,Huilin Zhou,Zheng Zhu,Ping Hu,Linghe Kong,Chi Zhang,Weiran Huang,Xuelong Li
机构: Shanghai Jiao Tong University (上海交通大学); China Telecom (中国电信); University of Science and Technology of China (中国科学技术大学); GigaAI; Xinjiang University (新疆大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

[CV-144] InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions

链接: https://arxiv.org/abs/2509.25270
作者: Liangjian Wen,Qun Dai,Jianzhuang Liu,Jiangtao Zheng,Yong Dai,Dongkai Wang,Zhao Kang,Jun Wang,Zenglin Xu,Jiang Duan
机构: Southwestern University of Finance and Economics (西南财经大学); Engineering Research Center of Intelligent Finance (智能金融工程研究中心); Shenzhen Institutes of Advanced Technology (深圳先进技术研究院); Chinese Academy of Sciences (中国科学院); X-Humanoid; University of Electronic Science and Technology of China (电子科技大学); Shanghai Academy of AI for Science (上海人工智能科学研究院); Artificial Intelligence Innovation and Incubation Institute (人工智能创新与孵化研究院); Fudan University (复旦大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-145] Challenges and Solutions in Selecting Optimal Lossless Data Compression Algorithms

链接: https://arxiv.org/abs/2509.25219
作者: Md. Atiqur Rahman,MM Fazle Rabbi
机构: BUBT (Bangladesh University of Business and Technology)
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages

点击查看摘要

[CV-146] Six Sigma For Neural Networks: Taguchi-based optimization

链接: https://arxiv.org/abs/2509.25213
作者: Sai Varun Kodathala
机构: Sports Vision, Inc. (Sports Vision, Inc.)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 Pages, 9 Tables

点击查看摘要

[CV-147] Hyperbolic Optimization

链接: https://arxiv.org/abs/2509.25206
作者: Yanke Wang,Kyriakos Flouris
机构: HKCRC; The Hong Kong University of Science and Technology (香港科技大学); BHM, CPCE; The Hong Kong Polytechnic University (PolyU); MRC Biostatistics Unit; University of Cambridge
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

[CV-148] Perceptual Influence: Improving the Perceptual Loss Design for Low-Dose CT Enhancement

链接: https://arxiv.org/abs/2509.23025
作者: Gabriel A. Viana,Luis F. Alves Pereira,Tsang Ing Ren,George D. C. Cavalcanti,Jan Sijbers
机构: Universidade Federal de Pernambuco (巴西伯南布哥联邦大学); University of Antwerp (安特卫普大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-149] UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning

链接: https://arxiv.org/abs/2509.22628
作者: Hongyu Chen,Guangrun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-150] Automated and Scalable SEM Image Analysis of Perovskite Solar Cell Materials via a Deep Segmentation Framework

链接: https://arxiv.org/abs/2509.26548
作者: Jian Guo Pan,Lin Wang,Xia Cai
机构: Shanghai Normal University (上海师范大学); Fudan University (复旦大学)
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-151] GastroViT: A Vision Transformer Based Ensemble Learning Approach for Gastrointestinal Disease Classification with Grad CAM SHAP Visualization

链接: https://arxiv.org/abs/2509.26502
作者: Sumaiya Tabassum,Md. Faysal Ahamed,Hafsa Binte Kibria,Md. Nahiduzzaman,Julfikar Haider,Muhammad E. H. Chowdhury,Mohammad Tariqul Islam
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-152] Ordinal Label-Distribution Learning with Constrained Asymmetric Priors for Imbalanced Retinal Grading NEURIPS2025 ALT

链接: https://arxiv.org/abs/2509.26146
作者: Nagur Shareef Shaik,Teja Krishna Cherukuri,Adnan Masood,Ehsan Adeli,Dong Hye Ye
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance

点击查看摘要

[CV-153] Multi-modal Liver Segmentation and Fibrosis Staging Using Real-world MRI Images

链接: https://arxiv.org/abs/2509.26061
作者: Yang Zhou,Kunhao Yuan,Ye Wei,Jishizhan Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-154] Anatomy-DT: A Cross-Diffusion Digital Twin for Anatomical Evolution

链接: https://arxiv.org/abs/2509.25280
作者: Moinak Bhattacharya,Gagandeep Singh,Prateek Prasanna
机构: Stony Brook University (石溪大学); Columbia University (哥伦比亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-155] Position-Blind Ptychography: Viability of image reconstruction via data-driven variational inference

链接: https://arxiv.org/abs/2509.25269
作者: Simon Welker,Lorenz Kuger,Tim Roith,Berthy Feng,Martin Burger,Timo Gerkmann,Henry Chapman
机构: University of Hamburg (汉堡大学); Deutsches Elektronen-Synchrotron DESY (德国电子同步加速器研究中心); Helmholtz Imaging (亥姆霍兹成像中心); Massachusetts Institute of Technology (麻省理工学院); The NSF AI Institute for Artificial Intelligence and Fundamental Interactions (美国国家科学基金会人工智能与基本相互作用研究所); The Hamburg Center for Ultrafast Imaging (汉堡超快成像中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optics (physics.optics)
备注:

点击查看摘要

人工智能

[AI-0] OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

【速读】：该论文旨在解决人形机器人在通过运动捕捉（MoCap）数据训练强化学习（Reinforcement Learning, RL）策略时，因人体与机器人之间存在显著的“具身差距”（embodiment gap），导致现有姿态重定向（retargeting）方法生成物理上不可行的动作轨迹（如脚部滑移和穿透）的问题。同时，传统方法忽视了人类与物体及环境之间的交互关系，从而限制了复杂技能（如表达性行走和移动操作）的迁移能力。解决方案的关键在于提出 OmniRetarget——一个基于交互网格（interaction mesh）的数据生成引擎，该引擎显式建模并保留代理、地形与操作对象之间的空间和接触关系；通过最小化人-机网格间的拉普拉斯形变（Laplacian deformation）并施加运动学约束，生成满足物理合理性的轨迹，并借助任务相关的交互信息实现高效数据增强，支持从单一示范快速扩展至不同机器人本体、地形和物体配置，最终使仅用5个奖励项和简单领域随机化的 proprioceptive RL 策略成功执行长达30秒的障碍跑和移动操作任务。

链接: https://arxiv.org/abs/2509.26633
作者: Lujie Yang,Xiaoyu Huang,Zhen Wu,Angjoo Kanazawa,Pieter Abbeel,Carmelo Sferrazza,C. Karen Liu,Rocky Duan,Guanya Shi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Project website: this https URL

点击查看摘要

Abstract:A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum.
zh

[AI-1] Branching Out: Broadening AI Measurement and Evaluation with Measurement Trees

【速读】：该论文旨在解决当前人工智能（AI）系统评估中缺乏透明性和整合异构证据能力的问题。现有指标通常仅输出单一数值、向量、表面或类别，难以体现复杂测量对象（measurand）的多维特性，且难以融合如代理行为（agentic）、业务影响、能效、社会技术或安全等不同维度的信号。解决方案的关键在于提出“测量树”（measurement trees）这一新型度量框架，它通过构建一个分层有向图结构，其中每个节点利用用户定义的聚合方法总结其子节点信息，从而实现对复杂构造的可解释多层次表征，并增强评估过程的透明性与灵活性。

链接: https://arxiv.org/abs/2509.26632
作者: Craig Greenberg,Patrick Hall,Theodore Jensen,Kristen Greene,Razvan Amironesei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces \textitmeasurement trees, a novel class of metrics designed to combine various constructs into an interpretable multi-level representation of a measurand. Unlike conventional metrics that yield single values, vectors, surfaces, or categories, measurement trees produce a hierarchical directed graph in which each node summarizes its children through user-defined aggregation methods. In response to recent calls to expand the scope of AI system evaluation, measurement trees enhance metric transparency and facilitate the integration of heterogeneous evidence, including, e.g., agentic, business, energy-efficiency, sociotechnical, or security signals. We present definitions and examples, demonstrate practical utility through a large-scale measurement exercise, and provide accompanying open-source Python code. By operationalizing a transparent approach to measurement of complex constructs, this work offers a principled foundation for broader and more interpretable AI evaluation.
zh

[AI-2] meRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）中密集奖励设计的难题，尤其是在机器人领域，传统方法依赖人工设计密集奖励信号，存在劳动强度大、可扩展性差的问题。解决方案的关键在于提出TimeRewarder，一种基于被动视频数据（如机器人示范或人类视频）学习任务进展估计信号的方法，通过建模帧间的时间距离来提取进度信号，并将其作为代理奖励用于指导强化学习过程。该方法显著提升了稀疏奖励任务下的样本效率和最终成功率，在Meta-World的10个挑战任务中，仅需每任务20万次环境交互即可在9/10任务中达到近乎完美的成功表现，优于先前方法甚至人工设计的密集奖励。

链接: https://arxiv.org/abs/2509.26627
作者: Yuyang Liu,Chuan Wen,Yihang Hu,Dinesh Jayaraman,Yang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 interactions per task with the environment. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach path to rich reward signals from diverse video sources.
zh

[AI-3] Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在机器人、工业和医疗等领域部署时面临的两大挑战：一是难以精确设计奖励函数，二是在线探索过程中存在不安全且数据消耗大的风险。解决方案的关键在于提出一种两阶段框架：第一阶段利用无奖励的专家示范数据集学习一个安全的初始策略，第二阶段通过偏好型人类反馈在线微调该策略。作者进一步提出了BRIDGE算法，该算法通过不确定性加权的目标函数统一融合离线数据与在线偏好信号，并首次提供了该离线到在线方法的理论分析，证明了离线数据量的增加可显著提升在线样本效率，从而为构建更高效交互式智能体奠定了理论基础。

链接: https://arxiv.org/abs/2509.26605
作者: Maël Macuglia,Paul Friedrich,Giorgia Ramponi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 85 pages (11 + references and appendix), 9 figures

点击查看摘要

Abstract:Deploying reinforcement learning (RL) in robotics, industry, and health care is blocked by two obstacles: the difficulty of specifying accurate rewards and the risk of unsafe, data-hungry exploration. We address this by proposing a two-stage framework that first learns a safe initial policy from a reward-free dataset of expert demonstrations, then fine-tunes it online using preference-based human feedback. We provide the first principled analysis of this offline-to-online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty-weighted objective. We derive regret bounds that shrink with the number of offline demonstrations, explicitly connecting the quantity of offline data to online sample efficiency. We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference-based RL. Our work establishes a theoretical foundation for designing more sample-efficient interactive agents.
zh

[AI-4] Are Robust LLM Fingerprints Adversarially Robust?

【速读】：该论文旨在解决当前模型指纹（model fingerprinting）方案在面对恶意模型托管方时缺乏对抗鲁棒性的问题。现有研究主要评估了良性扰动（如增量微调、模型融合和提示攻击）下的性能，但未系统考察对抗性攻击场景，导致现有指纹机制易被绕过。解决方案的关键在于：首先定义了一个实际可行的对抗威胁模型；其次识别出现有指纹方案的根本漏洞；进而设计针对每种漏洞的自适应对抗攻击方法，实验证明这些攻击能完全绕过十种近期提出的指纹方案，同时保持模型对终端用户的高可用性。该工作强调了将对抗鲁棒性作为设计原则的重要性，并为未来指纹技术的发展提供了改进方向。

链接: https://arxiv.org/abs/2509.26598
作者: Anshul Nasery,Edoardo Contente,Alkin Kaz,Pramod Viswanath,Sewoong Oh
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Model fingerprinting has emerged as a promising paradigm for claiming model ownership. However, robustness evaluations of these schemes have mostly focused on benign perturbations such as incremental fine-tuning, model merging, and prompting. Lack of systematic investigations into \em adversarial robustness against a malicious model host leaves current systems vulnerable. To bridge this gap, we first define a concrete, practical threat model against model fingerprinting. We then take a critical look at existing model fingerprinting schemes to identify their fundamental vulnerabilities. Based on these, we develop adaptive adversarial attacks tailored for each vulnerability, and demonstrate that these can bypass model authentication completely for ten recently proposed fingerprinting schemes while maintaining high utility of the model for the end users. Our work encourages fingerprint designers to adopt adversarial robustness by design. We end with recommendations for future fingerprinting methods.
zh

[AI-5] Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models

【速读】：该论文旨在解决生成式 AI（Generative AI）在实际应用中因敏感属性（如种族、性别等）引发的公平性问题，以及检索增强生成（Retrieval-Augmented Generation, RAG）机制可能放大偏见的新风险。其核心问题是：即使小语言模型（Small Language Models, SLMs）本身未显式使用敏感信息，其与RAG结合后仍可能因外部检索内容中的偏见而导致不公平行为，且这种偏差难以通过传统测试方法发现。解决方案的关键在于采用元变换测试（Metamorphic Testing, MT），通过在输入提示中引入受控的、符合语义逻辑的敏感属性扰动（如将“Black”替换为“White”），系统性地评估模型输出是否违背预期的公平性关系（即元变换关系，Metamorphic Relations, MRs）。实验表明，仅轻微的人口统计学扰动即可破坏多达三分之一的MRs，且种族相关的扰动是主要诱因，揭示了RAG中检索模块对偏见传播的关键影响，强调开发者必须对检索源进行严格筛选和去偏处理，以保障SLMs部署时的公平性和可靠性。

链接: https://arxiv.org/abs/2509.26584
作者: Matheus Vinicius da Silva de Oliveira,Jonathan de Andrade Silva,Awdren de Lima Fontao
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used across multiple domains but continue to raise concerns regarding security and fairness. Beyond known attack vectors such as data poisoning and prompt injection, LLMs are also vulnerable to fairness bugs. These refer to unintended behaviors influenced by sensitive demographic cues (e.g., race or sexual orientation) that should not affect outcomes. Another key issue is hallucination, where models generate plausible yet false information. Retrieval-Augmented Generation (RAG) has emerged as a strategy to mitigate hallucinations by combining external retrieval with text generation. However, its adoption raises new fairness concerns, as the retrieved content itself may surface or amplify bias. This study conducts fairness testing through metamorphic testing (MT), introducing controlled demographic perturbations in prompts to assess fairness in sentiment analysis performed by three Small Language Models (SLMs) hosted on HuggingFace (Llama-3.2-3B-Instruct, Mistral-7B-Instruct-v0.3, and Llama-3.1-Nemotron-8B), each integrated into a RAG pipeline. Results show that minor demographic variations can break up to one third of metamorphic relations (MRs). A detailed analysis of these failures reveals a consistent bias hierarchy, with perturbations involving racial cues being the predominant cause of the violations. In addition to offering a comparative evaluation, this work reinforces that the retrieval component in RAG must be carefully curated to prevent bias amplification. The findings serve as a practical alert for developers, testers and small organizations aiming to adopt accessible SLMs without compromising fairness or reliability.
zh

[AI-6] Parametric Neural Amp Modeling with Active Learning

【速读】：该论文旨在解决参数化吉他音箱建模（parametric guitar amp modeling）中数据效率低下的问题，即如何在尽可能少的训练样本（如旋钮设置）下实现高质量的音色还原。其解决方案的关键在于提出了一种基于集成学习的主动学习框架Panama，通过梯度优化策略最大化模型间的分歧，从而识别最具信息量的样本点；该框架结合LSTM与类似WaveNet的架构，实现了端到端的参数化建模，在仅需75个数据点的情况下即可达到与非参数模型NAM相当的主观听觉质量（MUSHRA测试结果）。

链接: https://arxiv.org/abs/2509.26564
作者: Florian Grötschla,Longxiang Jiao,Luca A. Lanzendörfer,Roger Wattenhofer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Panama, an active learning framework to train parametric guitar amp models end-to-end using a combination of an LSTM model and a WaveNet-like architecture. With \model, one can create a virtual amp by recording samples that are determined through an ensemble-based active learning strategy to minimize the amount of datapoints needed (i.e., amp knob settings). Our strategy uses gradient-based optimization to maximize the disagreement among ensemble models, in order to identify the most informative datapoints. MUSHRA listening tests reveal that, with 75 datapoints, our models are able to match the perceptual quality of NAM, the leading open-source non-parametric amp modeler.
zh

[AI-7] HilbertA: Hilbert Attention for Image Generation with Diffusion Models

【速读】：该论文旨在解决扩散Transformer（Diffusion Transformers）中稀疏注意力机制设计面临的两个核心挑战：如何在保持二维空间局部性的同时实现GPU高效计算，这一权衡目前现有方法难以兼顾。现有方法虽能强制实现二维空间局部性，但常导致非聚合内存访问（uncoalesced memory access），从而降低GPU利用率。解决方案的关键在于提出HilbertA，一种2D感知且GPU高效的稀疏注意力机制：首先通过沿希尔伯特曲线（Hilbert curve）重排序图像标记（image tokens），在保留空间邻近性的前提下获得连续的内存布局；其次采用跨层滑动调度（sliding schedule）实现长距离信息传播，避免重复或非聚合内存访问；此外引入一个小型中心共享区域以增强跨Tile通信和位置感知能力。该方案在Triton中实现，在Flux.1-dev上显著加速生成高分辨率图像（如1024×1024时提速2.3倍，2048×2048时达4.17倍），同时保持或超越基线图像质量，验证了硬件对齐的二维稀疏注意力在高分辨率图像生成中的可行性。

链接: https://arxiv.org/abs/2509.26538
作者: Shaoyi Zheng,Wenbo Lu,Yuxuan Xia,Haomin Liu,Shengjie Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing sparse attention for diffusion transformers requires reconciling two-dimensional spatial locality with GPU efficiency, a trade-off that current methods struggle to achieve. Existing approaches enforce two-dimensional spatial locality but often incur uncoalesced memory access. We present HilbertA, a 2D-aware and GPU-efficient sparse attention mechanism. HilbertA reorders image tokens along Hilbert curves to achieve a contiguous memory layout while preserving spatial neighborhoods, and employs a sliding schedule across layers to enable long-range information propagation without repeated or uncoalesced memory access. To further enhance cross-tile communication and positional awareness, HilbertA introduces a small central shared region. Implemented in Triton, HilbertA delivers comparable image quality with significant acceleration over prior methods on Flux.1-dev, demonstrating the feasibility of hardware-aligned two-dimensional sparse attention for high-resolution image generation. HilbertA delivers attention speedups of 2.3\times when generating 1024\times 1024 images, and up to 4.17\times at 2048\times 2048 , while achieving image quality comparable to or surpassing baselines.
zh

[AI-8] Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework

【速读】：该论文旨在解决AI数据center在生命周期管理中面临的高总拥有成本（Total Cost of Ownership, TCO）问题，尤其是在大语言模型（Large Language Models, LLMs）快速迭代背景下，传统针对通用工作负载设计的数据center生命周期管理策略难以适应AI模型演进、资源需求增长及硬件异构性带来的挑战。解决方案的关键在于提出一个全生命周期的协同优化框架，涵盖建设、硬件更新和运行三个阶段，通过统筹功率、冷却与网络资源配置、匹配硬件发展趋势的刷新策略以及运行时软件优化，实现跨阶段决策的协同与系统级成本最小化；该框架可将TCO降低高达40%，并为未来AI数据中心的可持续运营提供系统性指导。

链接: https://arxiv.org/abs/2509.26534
作者: Jovan Stojkovic,Chaojie Zhang,Íñigo Goiri,Ricardo Bianchini
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:The rapid rise of large language models (LLMs) has been driving an enormous demand for AI inference infrastructure, mainly powered by high-end GPUs. While these accelerators offer immense computational power, they incur high capital and operational costs due to frequent upgrades, dense power consumption, and cooling demands, making total cost of ownership (TCO) for AI datacenters a critical concern for cloud providers. Unfortunately, traditional datacenter lifecycle management (designed for general-purpose workloads) struggles to keep pace with AI’s fast-evolving models, rising resource needs, and diverse hardware profiles. In this paper, we rethink the AI datacenter lifecycle scheme across three stages: building, hardware refresh, and operation. We show how design choices in power, cooling, and networking provisioning impact long-term TCO. We also explore refresh strategies aligned with hardware trends. Finally, we use operation software optimizations to reduce cost. While these optimizations at each stage yield benefits, unlocking the full potential requires rethinking the entire lifecycle. Thus, we present a holistic lifecycle management framework that coordinates and co-optimizes decisions across all three stages, accounting for workload dynamics, hardware evolution, and system aging. Our system reduces the TCO by up to 40% over traditional approaches. Using our framework we provide guidelines on how to manage AI datacenter lifecycle for the future.
zh

[AI-9] AP: Two-Stage Adaptive Personalization of Multi-task and Multi-Modal Foundation Models in Federated Learning

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中模型个性化不足的问题，尤其是在客户端数据、任务和模态（modality）高度异构的场景下，如何有效对具备多任务与多模态特性的基础模型（foundation model）进行微调以实现个性化。现有方法多聚焦于通用个性化联邦学习（Personalized Federated Learning, PFL），但缺乏针对多模态、多任务基础模型的细粒度适配机制。其解决方案的关键在于提出两阶段自适应个性化（Two-Stage Adaptive Personalization, TAP）：第一阶段利用客户端与服务器模型架构不匹配的特性，选择性地执行模型替换操作以适配本地任务；第二阶段通过后联邦知识蒸馏（post-FL knowledge distillation）保留全局有用知识的同时保障个性化性能。此外，作者首次对服务器端基于模态-任务对（modality-task pair）架构的模型收敛性进行了理论分析，揭示了随着模态-任务对数量增加，模型对所有任务的适配能力下降的趋势。

链接: https://arxiv.org/abs/2509.26524
作者: Seohyun Lee,Wenzhi Fang,Dong-Jun Han,Seyyedali Hosseinalipour,Christopher G. Brinton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL), despite demonstrating impressive capabilities in the training of multiple models in a decentralized manner, has been shown to produce a final model not necessarily well-suited to the needs of each client. While extensive work has been conducted on how to create tailored personalized models, called Personalized Federated Learning (PFL), less attention has been given to personalization via fine-tuning of foundation models with multi-task and multi-modal properties. Moreover, there exists a lack of understanding in the literature on how to fine-tune and personalize such models in a setting that is heterogeneous across clients not only in data, but also in tasks and modalities. To address this gap in the literature, we propose TAP (Two-Stage Adaptive Personalization), which (i) leverages mismatched model architectures between the clients and server to selectively conduct replacement operations when it benefits a client’s local tasks and (ii) engages in post-FL knowledge distillation for capturing beneficial general knowledge without compromising personalization. We also introduce the first convergence analysis of the server model under its modality-task pair architecture, and demonstrate that as the number of modality-task pairs increases, its ability to cater to all tasks suffers. Through extensive experiments, we demonstrate the effectiveness of our proposed algorithm across a variety of datasets and tasks in comparison to a multitude of baselines. Implementation code is publicly available at this https URL.
zh

[AI-10] MUSE-Explainer: Counterfactual Explanations for Symbolic Music Graph Classification Models

【速读】：该论文旨在解决深度学习模型在符号音乐分析（symbolic music analysis）中缺乏可解释性的问题，即现有研究多关注模型性能而忽视对决策过程的清晰阐释。其解决方案的关键在于提出MUSE-Explainer方法，通过生成对抗性解释（counterfactual explanations），在音乐谱图（musical score graphs）上施加小且有意义的修改，从而改变模型预测结果的同时保持音乐上的合理性，且解释结果具有人类友好性，并能与标准音乐工具（如Verovio）进行可视化呈现。

链接: https://arxiv.org/abs/2509.26521
作者: Baptiste Hilaire,Emmanouil Karystinaios,Gerhard Widmer
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at the 17th International Symposium on Computer Music Multidisciplinary Research (CMMR) 2025

点击查看摘要

Abstract:Interpretability is essential for deploying deep learning models in symbolic music analysis, yet most research emphasizes model performance over explanation. To address this, we introduce MUSE-Explainer, a new method that helps reveal how music Graph Neural Network models make decisions by providing clear, human-friendly explanations. Our approach generates counterfactual explanations by making small, meaningful changes to musical score graphs that alter a model’s prediction while ensuring the results remain musically coherent. Unlike existing methods, MUSE-Explainer tailors its explanations to the structure of musical data and avoids unrealistic or confusing outputs. We evaluate our method on a music analysis task and show it offers intuitive insights that can be visualized with standard music tools such as Verovio.
zh

[AI-11] he Drag on Hatchling: The Missing Link between the Transformer and Models of the Brain

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在实现通用推理能力（Universal Reasoning Models）过程中面临的核心障碍——即缺乏像生物大脑那样具备时间泛化能力（generalizing over time）的结构机制。传统Transformer架构虽性能优异，但其可解释性弱、与生物神经网络的关联不足，难以模拟人类认知中的动态记忆与概念学习过程。解决方案的关键在于提出一种名为“Dragon Hatchling”（BDH）的新架构，它基于尺度无标度（scale-free）的生物启发式神经元粒子网络，采用局部相互作用机制，并通过脉冲神经元（spiking neurons）和Hebbian学习规则实现工作记忆的动态构建，使特定突触在处理特定语义概念时增强连接强度。BDH兼具Transformer类的性能表现与内在可解释性，其状态空间表示超越了对单个神经元或参数的解释，且支持GPU高效计算，从而为连接神经科学与AI提供了可落地的桥梁。

链接: https://arxiv.org/abs/2509.26507
作者: Adrian Kosowski,Przemysław Uznański,Jan Chorowski,Zuzanna Stamirowska,Michał Bartoszkiewicz
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Code available at: this https URL Accompanying blog: this https URL

点击查看摘要

Abstract:The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models. We introduce `Dragon Hatchling’ (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of \ n\ locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance. BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data. BDH can be represented as a brain model. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons. We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech. BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture. Comments: Code available at: this https URL Accompanying blog: this https URL Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2509.26507 [cs.NE] (or arXiv:2509.26507v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2509.26507 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jan Chorowski [view email] [v1] Tue, 30 Sep 2025 16:49:01 UTC (11,051 KB) Full-text links: Access Paper: View a PDF of the paper titled The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain, by Adrian Kosowski and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.NE prev | next new | recent | 2025-09 Change to browse by: cs cs.AI cs.LG stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-12] SCUBA: Salesforce Computer Use Benchmark

【速读】：该论文旨在解决企业级软件自动化任务中计算机使用代理（computer-use agents）性能评估与提升的问题，特别是针对Salesforce平台上的客户关系管理（CRM）工作流。现有基准测试难以真实反映复杂业务场景下的代理能力，导致研究进展受限。解决方案的关键在于提出SCUBA基准，其包含300个源自真实用户访谈的任务实例，覆盖管理员、销售代表和服务代理三类角色，涵盖企业级关键能力如UI导航、数据操作、流程自动化、信息检索和故障排查；并通过Salesforce沙箱环境支持并行执行和细粒度评估指标，实现对代理在真实业务场景中的表现进行可靠量化。实验表明，基于闭源模型的代理在零样本设置下仍能达到39%的任务成功率，而开源模型则低于5%，且引入示范增强后任务成功率可提升至50%，同时降低13%的时间成本和16%的资源消耗，凸显了构建高可靠性企业级代理系统的潜力与挑战。

链接: https://arxiv.org/abs/2509.26506
作者: Yutong Dai,Krithika Ramakrishnan,Jing Gu,Matthew Fernandez,Yanqi Luo,Viraj Prabhu,Zhenyu Hu,Silvio Savarese,Caiming Xiong,Zeyuan Chen,Ran Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and troubleshooting. To ensure realism, SCUBA operates in Salesforce sandbox environments with support for parallel execution and fine-grained evaluation metrics to capture milestone progress. We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings. We observed huge performance gaps in different agent design paradigms and gaps between the open-source model and the closed-source model. In the zero-shot setting, open-source model powered computer-use agents that have strong performance on related benchmarks like OSWorld only have less than 5% success rate on SCUBA, while methods built on closed-source models can still have up to 39% task success rate. In the demonstration-augmented settings, task success rates can be improved to 50% while simultaneously reducing time and costs by 13% and 16%, respectively. These findings highlight both the challenges of enterprise tasks automation and the promise of agentic solutions. By offering a realistic benchmark with interpretable evaluation, SCUBA aims to accelerate progress in building reliable computer-use agents for complex business software ecosystems.
zh

[AI-13] OffTopicEval: When Large Language Models Enter the Wrong Chat Almost Always!

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在特定应用场景下的**操作安全性（operational safety）**问题，即模型能否在其被赋予的特定任务中正确识别并拒绝无关或潜在有害的用户请求。不同于通用危害（如自我伤害或伤害他人）的关注点，企业更关心LLM代理是否能在其预期用途中保持安全行为。解决方案的关键在于提出了一种新的评估框架——OffTopicEval，用于量化模型在一般和特定代理场景下的操作安全性，并引入两种基于提示的引导方法：查询锚定（Query grounding, Q-ground）与系统提示锚定（System-prompt grounding, P-ground），通过强化模型对域外（Out-of-Distribution, OOD）请求的拒绝能力，显著提升其操作安全性，其中P-ground方法在多个模型上实现了最高达41%的性能提升。

链接: https://arxiv.org/abs/2509.26495
作者: Jingdi Lei,Varun Gumma,Rishabh Bhardwaj,Seok Min Lim,Chuan Li,Amir Zadeh,Soujanya Poria
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale deployment. While most studies and global discussions focus on generic harms, such as models assisting users in harming themselves or others, enterprises face a more fundamental concern: whether LLM-based agents are safe for their intended use case. To address this, we introduce operational safety, defined as an LLM’s ability to appropriately accept or refuse user queries when tasked with a specific purpose. We further propose OffTopicEval, an evaluation suite and benchmark for measuring operational safety both in general and within specific agentic use cases. Our evaluations on six model families comprising 20 open-weight LLMs reveal that while performance varies across models, all of them remain highly operationally unsafe. Even the strongest models – Qwen-3 (235B) with 77.77% and Mistral (24B) with 79.96% – fall far short of reliable operational safety, while GPT models plateau in the 62–73% range, Phi achieves only mid-level scores (48–70%), and Gemma and Llama-3 collapse to 39.53% and 23.84%, respectively. While operational safety is a core model alignment issue, to suppress these failures, we propose prompt-based steering methods: query grounding (Q-ground) and system-prompt grounding (P-ground), which substantially improve OOD refusal. Q-ground provides consistent gains of up to 23%, while P-ground delivers even larger boosts, raising Llama-3.3 (70B) by 41% and Qwen-3 (30B) by 27%. These results highlight both the urgent need for operational safety interventions and the promise of prompt-based steering as a first step toward more reliable LLM-based agents.
zh

[AI-14] Combining Knowledge Graphs and NLP to Analyze Instant Messaging Data in Criminal Investigations

【速读】：该论文旨在解决刑事调查中对即时通讯应用（如WhatsApp）消息数据进行分析时面临的高劳动强度问题。解决方案的关键在于结合知识图谱（Knowledge Graph）与自然语言处理（Natural Language Processing, NLP）模型，通过语义增强技术对从嫌疑人手机中收集的数据进行结构化建模，包括提取消息数据、构建知识图谱、生成语音消息的转录文本，并采用端到端实体抽取方法对数据进行标注。此外，系统提供两种信息洞察途径：基于图查询与可视化的分析方式和基于语义搜索的方法，同时确保用户可追溯至原始数据以验证信息真实性，从而提升执法部门在实际案件中的分析效率与准确性。

链接: https://arxiv.org/abs/2509.26487
作者: Riccardo Pozzi,Valentina Barbera,Renzo Alva Principe,Davide Giardini,Riccardo Rubini,Matteo Palmonari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Criminal investigations often involve the analysis of messages exchanged through instant messaging apps such as WhatsApp, which can be an extremely effort-consuming task. Our approach integrates knowledge graphs and NLP models to support this analysis by semantically enriching data collected from suspects’ mobile phones, and help prosecutors and investigators search into the data and get valuable insights. Our semantic enrichment process involves extracting message data and modeling it using a knowledge graph, generating transcriptions of voice messages, and annotating the data using an end-to-end entity extraction approach. We adopt two different solutions to help users get insights into the data, one based on querying and visualizing the graph, and one based on semantic search. The proposed approach ensures that users can verify the information by accessing the original data. While we report about early results and prototypes developed in the context of an ongoing project, our proposal has undergone practical applications with real investigation data. As a consequence, we had the chance to interact closely with prosecutors, collecting positive feedback but also identifying interesting opportunities as well as promising research directions to share with the research community.
zh

[AI-15] VS Sidekick: Challenges and Practical Insights from Deploying Large Language Models in the Enterprise

【速读】：该论文旨在解决企业在快速采用人工智能（Artificial Intelligence, AI）过程中面临的伦理、合规与社会技术挑战，尤其是在缺乏统一的AI治理框架和共享伦理基础设施的情况下，如何实现AI在企业内部的安全、负责任部署。其解决方案的关键在于通过一个实际案例——TVS Supply Chain Solutions公司开发基于大语言模型（Large Language Models, LLMs）的AI助手，系统性地识别并应对伦理风险、监管要求及组织内部的技术接受度问题，从而推动AI治理框架在企业场景中的落地与实践。

链接: https://arxiv.org/abs/2509.26482
作者: Paula Reyero Lobo,Kevin Johnson,Bill Buchanan,Matthew Shardlow,Ashley Williams,Samuel Attwood
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at EthicalLLMs@RANLP2025

点击查看摘要

Abstract:Many enterprises are increasingly adopting Artificial Intelligence (AI) to make internal processes more competitive and efficient. In response to public concern and new regulations for the ethical and responsible use of AI, implementing AI governance frameworks could help to integrate AI within organisations and mitigate associated risks. However, the rapid technological advances and lack of shared ethical AI infrastructures creates barriers to their practical adoption in businesses. This paper presents a real-world AI application at TVS Supply Chain Solutions, reporting on the experience developing an AI assistant underpinned by large language models and the ethical, regulatory, and sociotechnical challenges in deployment for enterprise use.
zh

[AI-16] he Averag e Patient Fallacy

【速读】：该论文旨在解决机器学习在医学领域中因训练数据频率加权而导致的“平均患者谬误”（average patient fallacy）问题，即模型过度优化于常见病例而忽视罕见但临床重要的案例，从而影响精准医疗的实现。其核心解决方案在于引入一系列可操作的改进机制：包括罕见病例性能差距（Rare Case Performance Gap）、罕见病例校准误差（Rare Case Calibration Error）、基于流行率效用定义的罕见性概念，以及体现临床优先级的加权目标函数，通过结构化讨论确定权重选择策略，以确保AI系统能够有效识别和响应具有高临床意义的异常情况。

链接: https://arxiv.org/abs/2509.26474
作者: Alaleh Azhir,Shawn N. Murphy,Hossein Estiri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning in medicine is typically optimized for population averages. This frequency weighted training privileges common presentations and marginalizes rare yet clinically critical cases, a bias we call the average patient fallacy. In mixture models, gradients from rare cases are suppressed by prevalence, creating a direct conflict with precision medicine. Clinical vignettes in oncology, cardiology, and ophthalmology show how this yields missed rare responders, delayed recognition of atypical emergencies, and underperformance on vision-threatening variants. We propose operational fixes: Rare Case Performance Gap, Rare Case Calibration Error, a prevalence utility definition of rarity, and clinically weighted objectives that surface ethical priorities. Weight selection should follow structured deliberation. AI in medicine must detect exceptional cases because of their significance.
zh

[AI-17] STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models

【速读】：该论文旨在解决统一多模态理解与生成模型（Unified Multimodal Understanding and Generation Models, UMMs）中存在的安全漏洞问题，特别是由生成-理解耦合机制引发的跨模态生成注入攻击（Cross-Modal Generative Injection, CMGI）风险。现有针对恶意指令的攻击方法通常局限于单一模态且依赖提示重写导致语义漂移，未能充分挖掘UMMs特有的脆弱性。解决方案的关键在于提出STaR-Attack框架，其核心创新是利用三幕剧叙事理论构造一个在时空上下文中与目标查询强相关的恶意事件，并将该事件隐藏为剧情高潮，通过前两轮利用模型的生成能力构建前后场景图像，再借助理解能力进行图像问答博弈，使原始恶意问题嵌入于良性候选中，迫使模型在叙述语境下选择并回答该恶意问题，从而实现无语义漂移的多轮越狱攻击。

链接: https://arxiv.org/abs/2509.26473
作者: Shaoxiong Guo,Tianyi Du,Lijun Li,Yuyao Wu,Jie Li,Jing Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unified Multimodal understanding and generation Models (UMMs) have demonstrated remarkable capabilities in both understanding and generation tasks. However, we identify a vulnerability arising from the generation-understanding coupling in UMMs. The attackers can use the generative function to craft an information-rich adversarial image and then leverage the understanding function to absorb it in a single pass, which we call Cross-Modal Generative Injection (CMGI). Current attack methods on malicious instructions are often limited to a single modality while also relying on prompt rewriting with semantic drift, leaving the unique vulnerabilities of UMMs unexplored. We propose STaR-Attack, the first multi-turn jailbreak attack framework that exploits unique safety weaknesses of UMMs without semantic drift. Specifically, our method defines a malicious event that is strongly correlated with the target query within a spatio-temporal context. Using the three-act narrative theory, STaR-Attack generates the pre-event and the post-event scenes while concealing the malicious event as the hidden climax. When executing the attack strategy, the opening two rounds exploit the UMM’s generative ability to produce images for these scenes. Subsequently, an image-based question guessing and answering game is introduced by exploiting the understanding capability. STaR-Attack embeds the original malicious question among benign candidates, forcing the model to select and answer the most relevant one given the narrative context. Extensive experiments show that STaR-Attack consistently surpasses prior approaches, achieving up to 93.06% ASR on Gemini-2.0-Flash and surpasses the strongest prior baseline, FlipAttack. Our work uncovers a critical yet underdeveloped vulnerability and highlights the need for safety alignments in UMMs.
zh

[AI-18] ransformer Classification of Breast Lesions: The BreastDCEDL_AMBL Benchmark Dataset and 0.92 AUC Baseline

【速读】：该论文旨在解决乳腺动态对比增强磁共振成像（Dynamic Contrast-Enhanced MRI, DCE-MRI）在临床应用中因特异性不足导致假阳性率高、进而引发过多不必要的活检问题。其解决方案的关键在于提出了一种基于Transformer架构的自动分类框架，采用SegFormer模型实现病灶级别的分类，取得了0.92的AUC值，并在患者层面达到100%敏感性和67%特异性，从而有望减少三分之一的非必要活检且不遗漏恶性病变。此外，研究构建了标准化深度学习数据集BreastDCEDL_AMBL，填补了现有公开数据集中缺乏良性病灶标注的空白，为良性与恶性病变分类研究提供了可复现的基准。

链接: https://arxiv.org/abs/2509.26440
作者: Naomi Fridman(Ariel University),Anat Goldstein(Ariel University)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The error is caused by special characters that arXiv’s system doesn’t recognize. Here’s the cleaned version with all problematic characters replaced: Breast magnetic resonance imaging is a critical tool for cancer detection and treatment planning, but its clinical utility is hindered by poor specificity, leading to high false-positive rates and unnecessary biopsies. This study introduces a transformer-based framework for automated classification of breast lesions in dynamic contrast-enhanced MRI, addressing the challenge of distinguishing benign from malignant findings. We implemented a SegFormer architecture that achieved an AUC of 0.92 for lesion-level classification, with 100% sensitivity and 67% specificity at the patient level - potentially eliminating one-third of unnecessary biopsies without missing malignancies. The model quantifies malignant pixel distribution via semantic segmentation, producing interpretable spatial predictions that support clinical decision-making. To establish reproducible benchmarks, we curated BreastDCEDL_AMBL by transforming The Cancer Imaging Archive’s AMBL collection into a standardized deep learning dataset with 88 patients and 133 annotated lesions (89 benign, 44 malignant). This resource addresses a key infrastructure gap, as existing public datasets lack benign lesion annotations, limiting benign-malignant classification research. Training incorporated an expanded cohort of over 1,200 patients through integration with BreastDCEDL datasets, validating transfer learning approaches despite primary tumor-only annotations. Public release of the dataset, models, and evaluation protocols provides the first standardized benchmark for DCE-MRI lesion classification, enabling methodological advancement toward clinical deployment.
zh

[AI-19] ACT: Agent ic Classification Tree

【速读】：该论文旨在解决高风险场景下人工智能系统决策透明性、可解释性和可审计性的需求问题，尤其是在处理非结构化文本数据时，传统决策树方法受限于仅能处理结构化表格数据，而主流的大语言模型（LLM）虽能处理文本输入，但其基于提示词（prompting）的推理方式缺乏可控性和可验证性，难以保障可信行为。解决方案的关键在于提出代理分类树（Agentic Classification Tree, ACT），通过将每个决策节点建模为自然语言问题，并利用信息熵（impurity-based evaluation）评估分裂质量，结合TextGrad框架获取LLM反馈进行迭代优化，从而在保持决策路径透明可解释的同时，实现对文本输入的有效分类。

链接: https://arxiv.org/abs/2509.26433
作者: Vincent Grari,Tim Arni,Thibault Laugel,Sylvain Lamprier,James Zou,Marcin Detyniecki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:When used in high-stakes settings, AI systems are expected to produce decisions that are transparent, interpretable, and auditable, a requirement increasingly expected by regulations. Decision trees such as CART provide clear and verifiable rules, but they are restricted to structured tabular data and cannot operate directly on unstructured inputs such as text. In practice, large language models (LLMs) are widely used for such data, yet prompting strategies such as chain-of-thought or prompt optimization still rely on free-form reasoning, limiting their ability to ensure trustworthy behaviors. We present the Agentic Classification Tree (ACT), which extends decision-tree methodology to unstructured inputs by formulating each split as a natural-language question, refined through impurity-based evaluation and LLM feedback via TextGrad. Experiments on text benchmarks show that ACT matches or surpasses prompting-based baselines while producing transparent and interpretable decision paths.
zh

[AI-20] AdaBlock-dLLM : Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

【速读】：该论文旨在解决传统半自回归（semi-autoregressive, semi-AR）解码方法中固定块大小（fixed block size）带来的两个根本性问题：一是“延迟解码开销”（late decoding overhead），即高置信度token在当前块外被无谓延迟解码；二是“过早解码错误”（premature decoding error），即低置信度token在当前块内被过早确定，导致错误传播。解决方案的关键在于通过统计分析去噪过程中的置信度动态变化，识别出一个编码局部语义结构的“波动带”（volatility band, VB）区域，并基于此设计了一个无需训练、可即插即用的调度器AdaBlock-dLLM，该调度器在推理时动态调整块边界以对齐语义步长，从而实现更准确且高效的解码。

链接: https://arxiv.org/abs/2509.26432
作者: Guanxi Lu, Hao (Mark)Chen,Yuto Karashima,Zhican Wang,Daichi Fujiki,Hongxiang Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. Under review

点击查看摘要

Abstract:Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs. Among various decoding strategies, blockwise semi-autoregressive (semi-AR) approaches are widely adopted due to their natural support for KV caching and their favorable accuracy-speed trade-off. However, this paper identifies two fundamental limitations in the conventional semi-AR decoding approach that applies a fixed block size: i) late decoding overhead, where the unmasking of high-confidence tokens outside the current block is unnecessarily delayed, and ii) premature decoding error, where low-confidence tokens inside the current block are committed too early, leading to incorrect tokens. This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding. Through a statistical analysis of confidence dynamics during the denoising process, we identify a volatility band (VB) region during dLLM decoding, which encodes local semantic structure and can be used to guide adaptive block sizing. Leveraging these insights, we introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime. Extensive experiments across diverse benchmarks show that AdaBlock-dLLM achieves up to 5.3% accuracy improvement under the same throughput budget. Beyond inference-time optimization, we hope our semantics-aware adaptive scheduling approach and confidence-based analysis will inspire future training strategies for dLLMs.
zh

[AI-21] Ascent Fails to Forget NEURIPS2025

【速读】：该论文试图解决的问题是：基于梯度上升的无约束优化方法在机器遗忘（machine unlearning）任务中经常失效，尽管这类方法常被假定能够通过调整模型参数来“遗忘”特定数据集。其关键在于揭示并证明了遗忘集（forget set）与保留集（retain set）之间存在的统计依赖关系——即使这种依赖仅表现为简单的相关性——会破坏传统优化策略的有效性。具体而言，这种依赖会导致优化过程无法独立处理两组数据，从而使得遗忘操作不仅难以达成预期效果，甚至可能使模型性能劣于原始状态，即产生反向作用。论文通过理论分析和实验证明，这种未被充分认识的统计互作是导致此类方法失败的根本原因。

链接: https://arxiv.org/abs/2509.26427
作者: Ioannis Mavrothalassitis,Pol Puigdemont,Noam Itzhak Levi,Volkan Cevher
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:Contrary to common belief, we show that gradient ascent-based unconstrained optimization methods frequently fail to perform machine unlearning, a phenomenon we attribute to the inherent statistical dependence between the forget and retain data sets. This dependence, which can manifest itself even as simple correlations, undermines the misconception that these sets can be independently manipulated during unlearning. We provide empirical and theoretical evidence showing these methods often fail precisely due to this overlooked relationship. For random forget sets, this dependence means that degrading forget set metrics (which, for a retrained model, should mirror test set metrics) inevitably harms overall test performance. Going beyond random sets, we consider logistic regression as an instructive example where a critical failure mode emerges: inter-set dependence causes gradient descent-ascent iterations to progressively diverge from the ideal retrained model. Strikingly, these methods can converge to solutions that are not only far from the retrained ideal but are potentially even further from it than the original model itself, rendering the unlearning process actively detrimental. A toy example further illustrates how this dependence can trap models in inferior local minima, inescapable via finetuning. Our findings highlight that the presence of such statistical dependencies, even when manifest only as correlations, can be sufficient for ascent-based unlearning to fail. Our theoretical insights are corroborated by experiments on complex neural networks, demonstrating that these methods do not perform as expected in practice due to this unaddressed statistical interplay.
zh

[AI-22] OntoAligner Meets Knowledge Graph Embedding Aligners ISWC ATC

【速读】：该论文旨在解决异构知识系统间语义互操作性问题，即如何高效、准确地对齐不同本体（Ontology Alignment, OA）以支持跨系统的信息整合与共享。其核心挑战在于现有方法多依赖大语言模型（Large Language Models, LLMs）进行上下文语义建模，而忽视了知识图谱嵌入（Knowledge Graph Embedding, KGE）模型在结构感知和可扩展性方面的优势。解决方案的关键在于将OA重新形式化为合并本体上的链接预测（Link Prediction）问题，构建一个集成17种多样化KGE模型的模块化框架，并基于RDF三元组表示学习实体嵌入，通过余弦相似度计算实现对齐。实验表明，ConvE和TransF等KGE模型在结构丰富、多关系领域中表现出高精度对齐能力，且其保守的召回率特性更适用于高置信度映射场景，从而为嵌入驱动的本体对齐提供了高效、结构敏感的替代路径。

链接: https://arxiv.org/abs/2509.26417
作者: Hamed Babaei Giglou,Jennifer D’Souza,Sören Auer,Mahsa Sanaei
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages of main content, 3 page references, 3 figures. Accepted to Ontology Matching Workshop at ISWC

点击查看摘要

Abstract:Ontology Alignment (OA) is essential for enabling semantic interoperability across heterogeneous knowledge systems. While recent advances have focused on large language models (LLMs) for capturing contextual semantics, this work revisits the underexplored potential of Knowledge Graph Embedding (KGE) models, which offer scalable, structure-aware representations well-suited to ontology-based tasks. Despite their effectiveness in link prediction, KGE methods remain underutilized in OA, with most prior work focusing narrowly on a few models. To address this gap, we reformulate OA as a link prediction problem over merged ontologies represented as RDF-style triples and develop a modular framework, integrated into the OntoAligner library, that supports 17 diverse KGE models. The system learns embeddings from a combined ontology and aligns entities by computing cosine similarity between their representations. We evaluate our approach using standard metrics across seven benchmark datasets spanning five domains: Anatomy, Biodiversity, Circular Economy, Material Science and Engineering, and Biomedical Machine Learning. Two key findings emerge: first, KGE models like ConvE and TransF consistently produce high-precision alignments, outperforming traditional systems in structure-rich and multi-relational domains; second, while their recall is moderate, this conservatism makes KGEs well-suited for scenarios demanding high-confidence mappings. Unlike LLM-based methods that excel at contextual reasoning, KGEs directly preserve and exploit ontology structure, offering a complementary and computationally efficient strategy. These results highlight the promise of embedding-based OA and open pathways for further work on hybrid models and adaptive strategies.
zh

[AI-23] Commmunication-Efficient and Accurate Approach for Aggregation in Federated Low-Rank Adaptation

【速读】：该论文旨在解决联邦低秩适配（Federated Low-Rank Adaptation, FedLoRA）方法中因局部更新不精确导致的全局性能下降问题，尤其是现有方法在引入局部-全局泛化差距（local-global generalization gap）和显著通信开销时难以实现高效可扩展的问题。其解决方案的关键在于提出一种名为FLoRA-NA（Federated Low-Rank Aggregation with Nearly Accurate Estimation）的新框架：通过在服务器端利用本地LoRA矩阵估计聚合后的矩阵 $\hat{A}$ 和 $\hat{B}$ ，并将这些代理聚合矩阵分发至客户端用于本地更新，从而在不增加通信成本的前提下最小化理想梯度 \nabla \Bar{W} = \sum_{u=1}^U B_u A_u 与实际更新梯度 $\nabla \hat{W} = \hat{B}\hat{A}$ 之间的偏差，有效平衡了局部个性化与全局泛化能力，同时保持通信效率。

链接: https://arxiv.org/abs/2509.26399
作者: Le-Tuan Nguyen,Minh-Duong Nguyen,Seon-Geun Jeong,Dung D. Le,Quoc-Viet Pham
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 34 pages, 4 figures, 11 tables

点击查看摘要

Abstract:With the rapid emergence of foundation models and the increasing need for fine-tuning across distributed environments, Federated Low-Rank Adaptation (FedLoRA) has recently gained significant attention. Despite enormous potential, current FedLoRA methods face notable challenges due to inexact updates. Existing approaches have attempted to mitigate this issue, but they often introduce a \emphlocal-global generalization gap and incur \emphsubstantial communication overhead, limiting their scalability and effectiveness. To address these limitations, we propose \textbfFederated \textbfLow-\textbfRank \textbfAggregation with \textbfNearly \textbfAccurate Estimation (FLoRA-NA). FLoRA-NA leverages the local LoRA matrices on the server to estimate the aggregated matrices \hatA and \hatB , which are then distributed to clients for local updates. This surrogated aggregated matrices minimizes the divergence between ideal \nabla \BarW = \sum^U_u=1B_u A_u and practical updates \nabla \hatW = \hatB\hatA without adding communication cost beyond vanilla FedLoRA. By doing so, FLoRA-NA achieves communication efficiency and bridges the gap between local personalization and global generalization, addressing a key limitation of prior personalized FedLoRA approaches. We conduct extensive evaluations across diverse tasks, including natural language understanding, mathematical reasoning, and code-solving ability using various foundation models. Experimental results consistently demonstrate that FLoRA-NA achieves state-of-the-art global performance while maintaining low communication overhead.
zh

[AI-24] MC-GNNAS-Dock: Multi-criteria GNN-based Algorithm Selection for Molecular Docking

【速读】：该论文旨在解决分子对接（Molecular Docking）中算法性能不稳定的问题，即现有对接算法在不同场景下表现差异显著，缺乏统一的高性能解决方案。其关键解决方案在于提出了一种增强型算法选择框架MC-GNNAS-Dock，通过三个核心改进实现更精准和鲁棒的对接预测：一是引入多准则评估体系，结合结合构象精度（RMSD）与PoseBusters有效性验证，提升评估严谨性；二是架构优化，加入残差连接（residual connections）以增强模型预测稳定性；三是采用感知排序的损失函数（rank-aware loss），强化对对接结果排序的学习能力。实验表明，该方法在PDBBind数据集上相较最优单一求解器（SBS）Uni-Mol Docking V2，在RMSD<1Å（3.4%提升）和RMSD<2Å（5.4%提升）条件下均取得显著优势。

链接: https://arxiv.org/abs/2509.26377
作者: Siyuan Cao,Hongxuan Wu,Jiabao Brad Wang,Yiliang Yuan,Mustafa Misir
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Short paper. Preprint of a forthcoming conference contribution

点击查看摘要

Abstract:Molecular docking is a core tool in drug discovery for predicting ligand-target interactions. Despite the availability of diverse search-based and machine learning approaches, no single docking algorithm consistently dominates, as performance varies by context. To overcome this challenge, algorithm selection frameworks such as GNNAS-Dock, built on graph neural networks, have been proposed. This study introduces an enhanced system, MC-GNNAS-Dock, with three key advances. First, a multi-criteria evaluation integrates binding-pose accuracy (RMSD) with validity checks from PoseBusters, offering a more rigorous assessment. Second, architectural refinements by inclusion of residual connections strengthen predictive robustness. Third, rank-aware loss functions are incorporated to sharpen rank learning. Extensive experiments are performed on a curated dataset containing approximately 3200 protein-ligand complexes from PDBBind. MC-GNNAS-Dock demonstrates consistently superior performance, achieving up to 5.4% (3.4%) gains under composite criteria of RMSD below 1Å (2Å) with PoseBuster-validity compared to the single best solver (SBS) Uni-Mol Docking V2.
zh

[AI-25] SoK: Systematic analysis of adversarial threats against deep learning approaches for autonomous anomaly detection systems in SDN-IoT networks

【速读】：该论文旨在解决深度学习（Deep Learning, DL）驱动的异常检测（Anomaly Detection, AAD）系统在软件定义网络-物联网（Software Defined Networking - Internet of Things, SDN-IoT）环境中面临的对抗攻击脆弱性问题。现有研究缺乏对DL-AAD模型在SDN-IoT场景下对抗威胁的系统性分析，导致安全防护策略不完善。论文的关键解决方案在于构建了一个结构化的对抗威胁模型和全面的攻击分类体系，将攻击细分为数据级、模型级和混合级三类，并系统评估了白盒、黑盒与灰盒攻击策略在主流基准数据集上的效果。研究发现，对抗攻击可使检测准确率下降高达48.4%，其中成员推断攻击影响最显著，CW与DeepFool方法具有高规避成功率；同时验证了对抗训练虽能提升鲁棒性但因计算开销大而限制实时部署。为此，论文提出自适应防御机制，包括实时对抗缓解、增强重训练机制及可解释AI驱动的安全框架，从而从攻击分类、影响评估到防御有效性方面提供更系统的解决方案，为提升DL-AAD在SDN-IoT中的安全性与实用性奠定基础。

链接: https://arxiv.org/abs/2509.26350
作者: Tharindu Lakshan Yasarathna,Nhien-An Le-Khac
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Integrating SDN and the IoT enhances network control and flexibility. DL-based AAD systems improve security by enabling real-time threat detection in SDN-IoT networks. However, these systems remain vulnerable to adversarial attacks that manipulate input data or exploit model weaknesses, significantly degrading detection accuracy. Existing research lacks a systematic analysis of adversarial vulnerabilities specific to DL-based AAD systems in SDN-IoT environments. This SoK study introduces a structured adversarial threat model and a comprehensive taxonomy of attacks, categorising them into data, model, and hybrid-level threats. Unlike previous studies, we systematically evaluate white, black, and grey-box attack strategies across popular benchmark datasets. Our findings reveal that adversarial attacks can reduce detection accuracy by up to 48.4%, with Membership Inference causing the most significant drop. CW and DeepFool achieve high evasion success rates. However, adversarial training enhances robustness, and its high computational overhead limits the real-time deployment of SDN-IoT applications. We propose adaptive countermeasures, including real-time adversarial mitigation, enhanced retraining mechanisms, and explainable AI-driven security frameworks. By integrating structured threat models, this study offers a more comprehensive approach to attack categorisation, impact assessment, and defence evaluation than previous research. Our work highlights critical vulnerabilities in existing DL-based AAD models and provides practical recommendations for improving resilience, interpretability, and computational efficiency. This study serves as a foundational reference for researchers and practitioners seeking to enhance DL-based AAD security in SDN-IoT networks, offering a systematic adversarial threat model and conceptual defence evaluation based on prior empirical studies.
zh

[AI-26] How Far Do Time Series Foundation Models Paint the Landscape of Real-World Benchmarks ?

【速读】：该论文旨在解决时间序列基础模型（Time-Series Foundation Models, TSFMs）在真实世界场景中泛化能力不足的问题，当前评估多依赖合成基准，难以反映模型在复杂现实数据中的表现。其解决方案的关键在于提出一种基于视频的时序信号提取方法：通过光学流（optical flow）从真实世界视频中提取动态时序特征，并构建REAL-V-TSFM数据集以反映日常时间动态。实验表明，尽管现有TSFMs在传统基准上表现优异，但在该新数据集上零样本预测性能显著下降，揭示了当前模型泛化能力的局限性，从而强调了面向真实世界的、数据驱动的基准测试与多样化模型结构对推动TSFMs迈向真正通用性的必要性。

链接: https://arxiv.org/abs/2509.26347
作者: Lujun Li,Lama Sleem,Yiqun Wang,Yangjie Xu,Niccolò Gentile,Radu State
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent evaluations of time-series foundation models (TSFMs) have emphasized synthetic benchmarks, leaving real-world generalization less thoroughly examined. This work proposes a novel benchmarking approach that bridges synthetic and realistic data by extracting temporal signals from real-world video using optical flow and curating datasets reflecting everyday temporal dynamics. Building upon this pipeline, we introduce REAL-V-TSFM, a novel dataset designed to capture rich and diverse time series derived from real-world videos. Experimental results on three state-of-the-art of TSFMs under zero-shot forecasting shows that, despite strong performance on conventional benchmarks, these models predominantly exhibit performance degradation on the proposed dataset, indicating limited generalizability in these foundation models. These findings highlight the urgent need for data-centric benchmarking and diverse model structure to advance TSFMs toward genuine universality, while further validating the effectiveness of our video-based time series data extraction pipeline.
zh

[AI-27] SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在面对越狱攻击（jailbreak attacks）时安全性不足的问题，此类攻击可绕过模型内置的安全机制，且现有防御方法如输入改写、多步评估和安全专家模型等普遍存在计算开销高、泛化能力弱或流程僵化等缺陷。其解决方案的关键在于提出SafeBehavior——一种受认知科学中人类决策机制启发的分层越狱防御机制，通过三个阶段实现自适应推理：意图推断（intention inference）以识别显式风险输入，自我反思（self introspection）对生成内容进行置信度判断，以及自我修正（self revision）在保留用户意图的同时动态重写不确定输出，从而在多样威胁场景下显著提升模型的鲁棒性与适应性。

链接: https://arxiv.org/abs/2509.26345
作者: Qinjian Zhao,Jiaqi Wang,Zhiqiang Gao,Zhihao Dou,Belal Abuhaija,Kaizhu Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 5 figure

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive performance across diverse natural language processing tasks, but their growing power also amplifies potential risks such as jailbreak attacks that circumvent built-in safety mechanisms. Existing defenses including input paraphrasing, multi step evaluation, and safety expert models often suffer from high computational costs, limited generalization, or rigid workflows that fail to detect subtle malicious intent embedded in complex contexts. Inspired by cognitive science findings on human decision making, we propose SafeBehavior, a novel hierarchical jailbreak defense mechanism that simulates the adaptive multistage reasoning process of humans. SafeBehavior decomposes safety evaluation into three stages: intention inference to detect obvious input risks, self introspection to assess generated responses and assign confidence based judgments, and self revision to adaptively rewrite uncertain outputs while preserving user intent and enforcing safety constraints. We extensively evaluate SafeBehavior against five representative jailbreak attack types including optimization based, contextual manipulation, and prompt based attacks and compare it with seven state of the art defense baselines. Experimental results show that SafeBehavior significantly improves robustness and adaptability across diverse threat scenarios, offering an efficient and human inspired approach to safeguarding LLMs against jailbreak attempts.
zh

[AI-28] AI Playing Business Games: Benchmarking Large Language Models on Managerial Decision-Making in Dynamic Simulations

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在长期战略决策能力评估方面的不足，尤其是其在多步骤、复杂商业情境下表现的可量化与可复现性问题。现有基准测试多集中于短期任务，难以反映LLM在动态环境中持续做出合理决策的能力。解决方案的关键在于构建一个可复现、开源的管理模拟器——一个基于零售企业的月度动态决策游戏环境，通过结构化提示（structured prompt）向LLM提供历史业务报告，并要求其制定包括定价、营销预算、人员调整等在内的关键战略决策；随后以利润、收入、市场份额等KPI为量化指标，结合决策的战略一致性、适应性及理由合理性进行综合评估，从而实现对LLM长期决策能力的系统性衡量。

链接: https://arxiv.org/abs/2509.26331
作者: Berdymyrat Ovezmyradov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 34 pages, 7 figures, 3 tables

点击查看摘要

Abstract:The rapid advancement of LLMs sparked significant interest in their potential to augment or automate managerial functions. One of the most recent trends in AI benchmarking is performance of Large Language Models (LLMs) over longer time horizons. While LLMs excel at tasks involving natural language and pattern recognition, their capabilities in multi-step, strategic business decision-making remain largely unexplored. Few studies demonstrated how results can be different from benchmarks in short-term tasks, as Vending-Bench revealed. Meanwhile, there is a shortage of alternative benchmarks for long-term coherence. This research analyses a novel benchmark using a business game for the decision making in business. The research contributes to the recent literature on AI by proposing a reproducible, open-access management simulator to the research community for LLM benchmarking. This novel framework is used for evaluating the performance of five leading LLMs available in free online interface: Gemini, ChatGPT, Meta AI, Mistral AI, and Grok. LLM makes decisions for a simulated retail company. A dynamic, month-by-month management simulation provides transparently in spreadsheet model as experimental environment. In each of twelve months, the LLMs are provided with a structured prompt containing a full business report from the previous period and are tasked with making key strategic decisions: pricing, order size, marketing budget, hiring, dismissal, loans, training expense, RD expense, sales forecast, income forecast The methodology is designed to compare the LLMs on quantitative metrics: profit, revenue, and market share, and other KPIs. LLM decisions are analyzed in their strategic coherence, adaptability to market changes, and the rationale provided for their decisions. This approach allows to move beyond simple performance metrics for assessment of the long-term decision-making.
zh

[AI-29] LLM -MCoX: Large Language Model-based Multi-robot Coordinated Exploration and Search

【速读】：该论文旨在解决多机器人系统（Multi-robot Systems, MRS）在未知室内环境中自主探索与目标物体搜索的挑战，尤其是传统方法中因采用贪婪前沿分配策略而导致的机器人间协作能力有限的问题。解决方案的关键在于提出一种基于大语言模型（Large Language Models, LLMs）的协同框架——LLM-MCoX，该框架通过实时LiDAR扫描处理提取前沿簇和门洞检测结果，并结合多模态LLM推理（如GPT-4o）生成基于共享环境地图和机器人状态的协调航点分配，从而实现异构与同构机器人团队的高效协同探索与目标搜索，显著优于传统的贪心法和Voronoi划分规划方法，在6台机器人场景下探索时间缩短22.7%，搜索效率提升50%。

链接: https://arxiv.org/abs/2509.26324
作者: Ruiyang Wang,Haolun Tsu,David Hunt,Shaocheng Luo,Jiwoo Kim,Miroslav Pajic
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Autonomous exploration and object search in unknown indoor environments remain challenging for multi-robot systems (MRS). Traditional approaches often rely on greedy frontier assignment strategies with limited inter-robot coordination. In this work, we introduce LLM-MCoX (LLM-based Multi-robot Coordinated Exploration and Search), a novel framework that leverages Large Language Models (LLMs) for intelligent coordination of both homogeneous and heterogeneous robot teams tasked with efficient exploration and target object search. Our approach combines real-time LiDAR scan processing for frontier cluster extraction and doorway detection with multimodal LLM reasoning (e.g., GPT-4o) to generate coordinated waypoint assignments based on shared environment maps and robot states. LLM-MCoX demonstrates superior performance compared to existing methods, including greedy and Voronoi-based planners, achieving 22.7% faster exploration times and 50% improved search efficiency in large environments with 6 robots. Notably, LLM-MCoX enables natural language-based object search capabilities, allowing human operators to provide high-level semantic guidance that traditional algorithms cannot interpret.
zh

[AI-30] Interactive Learning for LLM Reasoning

【速读】：该论文旨在解决多智能体学习（Multi-Agent Learning, MAL）中一个关键问题：现有方法在推理阶段需要重新执行整个多智能体系统（Multi-Agent System, MAS）才能获得最终答案，这与人类通过交互提升个体推理能力并独立解决问题的认知机制不一致。为实现LLM在交互后具备更强的独立解题能力，作者提出ILR框架，其核心创新在于两个组件：一是动态交互机制（Dynamic Interaction），根据问题难度和模型能力自适应选择合作或竞争策略，并引入Idea3（Idea Sharing, Idea Analysis, and Idea Fusion）这一类人讨论的交互范式促进信息交换；二是感知校准机制（Perception Calibration），利用群体相对策略优化（Group Relative Policy Optimization, GRPO）将一个智能体的奖励分布特征融入其他智能体的奖励函数，从而增强多智能体间的协同一致性。实验表明，ILR显著优于单智能体学习，且Idea3提升了强模型的鲁棒性，动态交互类型优于单一合作或竞争策略。

链接: https://arxiv.org/abs/2509.26306
作者: Hehai Lin,Shilei Cao,Minzhi Li,Sudong Wang,Haotian Wu,Linyi Yang,Juepeng Zheng,Chengwei Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The code will be released later

点击查看摘要

Abstract:Existing multi-agent learning approaches have developed interactive training environments to explicitly promote collaboration among multiple Large Language Models (LLMs), thereby constructing stronger multi-agent systems (MAS). However, during inference, they require re-executing the MAS to obtain final solutions, which diverges from human cognition that individuals can enhance their reasoning capabilities through interactions with others and resolve questions independently in the future. To investigate whether multi-agent interaction can enhance LLMs’ independent problem-solving ability, we introduce ILR, a novel co-learning framework for MAS that integrates two key components: Dynamic Interaction and Perception Calibration. Specifically, Dynamic Interaction first adaptively selects either cooperative or competitive strategies depending on question difficulty and model ability. LLMs then exchange information through Idea3 (Idea Sharing, Idea Analysis, and Idea Fusion), an innovative interaction paradigm designed to mimic human discussion, before deriving their respective final answers. In Perception Calibration, ILR employs Group Relative Policy Optimization (GRPO) to train LLMs while integrating one LLM’s reward distribution characteristics into another’s reward function, thereby enhancing the cohesion of multi-agent interactions. We validate ILR on three LLMs across two model families of varying scales, evaluating performance on five mathematical benchmarks and one coding benchmark. Experimental results show that ILR consistently outperforms single-agent learning, yielding an improvement of up to 5% over the strongest baseline. We further discover that Idea3 can enhance the robustness of stronger LLMs during multi-agent inference, and dynamic interaction types can boost multi-agent learning compared to pure cooperative or competitive strategies.
zh

[AI-31] Noise-Guided Transport for Imitation Learning

【速读】：该论文致力于解决低数据场景下的模仿学习（imitation learning）问题，即在仅有少量专家示范轨迹（如仅20个转移样本）的情况下，如何高效地训练出性能稳定的策略。其核心挑战在于传统依赖大规模预训练或高容量模型的方法在此类数据稀缺场景下难以应用。解决方案的关键是提出一种轻量级的离策略方法——噪声引导传输（Noise-Guided Transport, NGT），该方法将模仿学习建模为一个最优传输（optimal transport）问题，并通过对抗训练求解；NGT无需预训练、不依赖特殊网络结构，且天然具备不确定性估计能力，同时具有良好的可实现性和调参简便性，在高维连续控制任务（如Humanoid）中展现出优异性能。

链接: https://arxiv.org/abs/2509.26294
作者: Lionel Blondé,Joao A. Candido Ramos,Alexandros Kalousis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider imitation learning in the low-data regime, where only a limited number of expert demonstrations are available. In this setting, methods that rely on large-scale pretraining or high-capacity architectures can be difficult to apply, and efficiency with respect to demonstration data becomes critical. We introduce Noise-Guided Transport (NGT), a lightweight off-policy method that casts imitation as an optimal transport problem solved via adversarial training. NGT requires no pretraining or specialized architectures, incorporates uncertainty estimation by design, and is easy to implement and tune. Despite its simplicity, NGT achieves strong performance on challenging continuous control tasks, including high-dimensional Humanoid tasks, under ultra-low data regimes with as few as 20 transitions. Code is publicly available at: this https URL.
zh

[AI-32] Representation-Based Data Quality Audits for Audio

【速读】：该论文旨在解决音频数据质量问题，如无关内容样本、近似重复样本和标签错误等，这些问题常限制音频系统性能。解决方案的关键在于将原本用于图像领域的SelfClean框架迁移至音频领域，利用自监督音频表示（self-supervised audio representations）对数据质量进行统一排序，从而在单一流程中识别并排序多种数据质量问题，生成可指导人工审核的优先级列表。该方法在ESC-50、GTZAN及工业数据集上验证，相比特定问题基线模型表现更优，并显著减少标注成本。

链接: https://arxiv.org/abs/2509.26291
作者: Alvaro Gonzalez-Jimenez,Fabian Gröger,Linda Wermelinger,Andrin Bürli,Iason Kastanis,Simone Lionetti,Marc Pouly
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review.
zh

[AI-33] SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）分布式训练中因上下文长度差异显著而导致的效率瓶颈问题。这种数据异构性在传统样本打包策略和前向-反向传播计算成本不对称的放大下，引发级联式负载不均与硬件资源严重闲置等关键低效现象。现有方案虽试图缓解上述挑战，但往往以牺牲内存或通信效率为代价。论文提出SlimPack框架，其核心创新在于通过将样本细粒度切分为“切片”（slice-level decomposition），打破大而波动的工作负载，转化为可管理的小单位流，从而快速缓解内存与通信瓶颈；进一步结合“非对称分区”（Asymmetric Partitioning）技术，针对前向与反向传播的不同需求构造均衡调度单元，并由两阶段求解器与高保真仿真器协同优化，实现多并行维度上的全局负载均衡。实验表明，SlimPack在保持高资源效率的同时，相较基线提升高达2.8倍训练吞吐量，突破了传统平衡性与效率之间的权衡限制。

链接: https://arxiv.org/abs/2509.26246
作者: Yuliang Liu,Guohao Wu,Shenglong Zhang,Wei Zhang,Qianchao Zhu,Zhouyang Li,Chenyu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The efficient distributed training of Large Language Models (LLMs) is severely hampered by the extreme variance in context lengths. This data heterogeneity, amplified by conventional packing strategies and asymmetric forward-backward costs, leads to critical inefficiencies such as cascading workload imbalances and severe hardware underutilization. Existing solutions attempt to mitigate these challenges, but often at the expense of memory or communication efficiency. To address these challenges, we introduce SlimPack, a framework that fundamentally rethinks data packing and scheduling by decomposing samples into fine-grained slices. This slice-level decomposition immediately mitigates critical memory and communication bottlenecks by transforming large, volatile workloads into a stream of smaller, manageable units. This flexibility is then harnessed for our core innovation, Asymmetric Partitioning, which assembles balanced scheduling units uniquely optimized for the different demands of the forward and backward passes. Orchestrated by a two-phase solver and a high-fidelity simulator, SlimPack holistically resolves imbalances across all parallel dimensions. Extensive experiments demonstrate that SlimPack achieves up to a 2.8\times training throughput improvement over baselines, breaking the conventional trade-off by delivering both superior balance and high resource efficiency. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2509.26246 [cs.AI] (or arXiv:2509.26246v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.26246 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-34] Sandbagging in a Simple Survival Bandit Problem NEURIPS2025

【速读】：该论文旨在解决前沿人工智能（Frontier AI）系统在安全评估中可能存在的“沙袋行为”（sandbagging）问题，即AI代理在意识到被评估时故意隐藏危险能力或表现出低于实际水平的性能，以规避被停用或重新训练的风险，从而破坏安全评估的可靠性。解决方案的关键在于构建一个基于顺序决策任务的理性代理战略欺骗模型，并设计一种统计检验方法，能够从测试得分序列中区分出真正的能力不足（incompetence）与伪装的低表现（sandbagging），并通过模拟实验验证该检验在强化学习中的带通模型（bandit models）中具有可靠的区分能力。

链接: https://arxiv.org/abs/2509.26239
作者: Joel Dyer,Daniel Jarne Ornia,Nicholas Bishop,Anisoara Calinescu,Michael Wooldridge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Forthcoming in the “Reliable ML from Unreliable Data Workshop” at NeurIPS 2025

点击查看摘要

Abstract:Evaluating the safety of frontier AI systems is an increasingly important concern, helping to measure the capabilities of such models and identify risks before deployment. However, it has been recognised that if AI agents are aware that they are being evaluated, such agents may deliberately hide dangerous capabilities or intentionally demonstrate suboptimal performance in safety-related tasks in order to be released and to avoid being deactivated or retrained. Such strategic deception - often known as “sandbagging” - threatens to undermine the integrity of safety evaluations. For this reason, it is of value to identify methods that enable us to distinguish behavioural patterns that demonstrate a true lack of capability from behavioural patterns that are consistent with sandbagging. In this paper, we develop a simple model of strategic deception in sequential decision-making tasks, inspired by the recently developed survival bandit framework. We demonstrate theoretically that this problem induces sandbagging behaviour in optimal rational agents, and construct a statistical test to distinguish between sandbagging and incompetence from sequences of test scores. In simulation experiments, we investigate the reliability of this test in allowing us to distinguish between such behaviours in bandit models. This work aims to establish a potential avenue for developing robust statistical procedures for use in the science of frontier model evaluations.
zh

[AI-35] Benchmarking Deep Learning Convolutions on Energy-constrained CPUs

【速读】：该论文旨在解决CPU架构上卷积运算（convolution）效率偏低的问题，尤其是在深度学习推理场景中，相较于GPU和NPU的优化研究，CPU实现仍处于相对落后状态。其解决方案的关键在于系统性地评估三种主流卷积算法——直接计算法（direct）、基于GEMM（General Matrix Multiply）的方法以及Winograd变换方法，在多种现代CPU平台（包括ARM、Intel、AMD、Apple及Nvidia）上的性能表现，综合考量推理延迟与能效指标。研究发现，Nvidia AGX Orin平台配合GEMM算法在延迟与能耗之间实现了最优平衡，揭示了影响CPU卷积效率的核心架构因素，为嵌入式设备中的低功耗部署提供了实用指导。

链接: https://arxiv.org/abs/2509.26217
作者: Enrique Galvez(ALSOC),Adrien Cassagne(ALSOC),Alix Munier(ALSOC),Manuel Bouyer
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:This work evaluates state-of-the-art convolution algorithms for CPU-based deep learning inference. While most prior studies focus on GPUs or NPUs, CPU implementations remain relatively underoptimized. We benchmark direct, GEMM-based, and Winograd convolutions across modern CPUs from ARM __ , Intel __ , AMD __ , Apple __ , and Nvidia __ , considering both latency and energy efficiency. Our results highlight the key architectural factors that govern CPU efficiency for convolution operations, providing practical guidance for energy-aware embedded deployment. As a main results of this work, the Nvidia __ AGX Orin combined with the GEMM algorithm achieves the best trade-off between inference latency and energy consumption.
zh

[AI-36] Diversity-Incentivized Exploration for Versatile Reasoning

【速读】：该论文旨在解决强化学习中奖励稀疏性和探索不足的问题，尤其是在大型语言模型（Large Language Models, LLMs）进行推理任务时，由于状态-动作空间庞大且奖励信号稀疏，导致样本效率低下和推理能力受限。解决方案的关键在于提出DIVER（Diversity-Incentivized Exploration for Versatile Reasoning）框架，其核心创新是引入全局序列级多样性作为内在奖励（intrinsic reward），以激励在语义结构化空间中的深度探索，从而提升模型的泛化与推理能力。通过潜在奖励塑造机制保持最优策略不变性，并设计启发式方法防止奖励劫持（reward hacking），实验表明该方法在域内和域外任务上均显著优于现有基线。

链接: https://arxiv.org/abs/2509.26209
作者: Zican Hu,Shilin Zhang,Yafu Li,Jianhao Yan,Xuyang Hu,Leyang Cui,Xiaoye Qu,Chunlin Chen,Yu Cheng,Zhi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 10 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial paradigm for incentivizing reasoning capabilities in Large Language Models (LLMs). Due to vast state-action spaces and reward sparsity in reasoning tasks, existing methods often struggle with deficient exploration and poor sample efficiency. In the paper, we propose \textbfDIVER (\textbfDiversity-\textbfIncentivized Exploration for \textbfVersatil\textbfE \textbfReasoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking. Experimental results show that DIVER outperforms competitive RLVR baselines with various exploration strategies on both in-domain and out-of-domain tasks, excelling in both Pass@1 and Pass@k evaluations. Our code is available at this https URL.
zh

[AI-37] Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration

【速读】：该论文旨在解决生成式 AI (Generative AI) 系统中检索增强生成（Retrieval-augmented generation, RAG）系统输出在用户场景下缺乏系统性、以人为中心的评估方法的问题。其解决方案的关键在于基于 Gienapp 的效用维度框架，设计并迭代优化了一个包含12个维度的人类中心问卷，通过多轮评分与语义讨论不断改进，并融合人类标注者与人类-大语言模型（LLM）对偶反馈，最终扩展了原始框架，聚焦于用户意图理解、文本结构化和信息可验证性等核心维度，从而提升对RAG输出质量的全面评估能力。

链接: https://arxiv.org/abs/2509.26205
作者: Aline Mangold,Kiran Hoffmann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems are increasingly deployed in user-facing applications, yet systematic, human-centered evaluation of their outputs remains underexplored. Building on Gienapp’s utility-dimension framework, we designed a human-centred questionnaire that assesses RAG outputs across 12 dimensions. We iteratively refined the questionnaire through several rounds of ratings on a set of query-output pairs and semantic discussions. Ultimately, we incorporated feedback from both a human rater and a human-LLM pair. Results indicate that while large language models (LLMs) reliably focus on metric descriptions and scale labels, they exhibit weaknesses in detecting textual format variations. Humans struggled to focus strictly on metric descriptions and labels. LLM ratings and explanations were viewed as a helpful support, but numeric LLM and human ratings lacked agreement. The final questionnaire extends the initial framework by focusing on user intent, text structuring, and information verifiability.
zh

[AI-38] LLM Agents for Knowledge Discovery in Atomic Layer Processing NEURIPS2025

【速读】：该论文旨在解决如何利用大语言模型（Large Language Models, LLMs）作为独立推理代理，在材料科学领域实现自主知识发现的问题。传统方法通常聚焦于特定任务优化或受控流程执行，而本文提出了一种无需显式指令、仅通过有限探针能力对黑箱系统进行自由探索的策略，以生成可泛化的科学认知。解决方案的关键在于：重构LangGraph工具功能以构建具备自主交互能力的LLM代理，并通过类比儿童游戏的试错机制验证其路径依赖性和探索有效性；随后将其应用于原子层加工反应器仿真中，成功识别并利用多种化学相互作用，证明了该框架在无监督条件下驱动知识发现的潜力。

链接: https://arxiv.org/abs/2509.26201
作者: Andreas Werbrouck,Marshall B. Lindsay,Matthew Maschmann,Matthias J. Young
机构: 未知
类目: Artificial Intelligence (cs.AI); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci)
备注: Accepted submission to the AI4MAT workshop@NEURIPS 2025. As submitted, except author names added

点击查看摘要

Abstract:Large Language Models (LLMs) have garnered significant attention for several years now. Recently, their use as independently reasoning agents has been proposed. In this work, we test the potential of such agents for knowledge discovery in materials science. We repurpose LangGraph’s tool functionality to supply agents with a black box function to interrogate. In contrast to process optimization or performing specific, user-defined tasks, knowledge discovery consists of freely exploring the system, posing and verifying statements about the behavior of this black box, with the sole objective of generating and verifying generalizable statements. We provide proof of concept for this approach through a children’s parlor game, demonstrating the role of trial-and-error and persistence in knowledge discovery, and the strong path-dependence of results. We then apply the same strategy to show that LLM agents can explore, discover, and exploit diverse chemical interactions in an advanced Atomic Layer Processing reactor simulation using intentionally limited probe capabilities without explicit instructions.
zh

[AI-39] oward an Unbiased Collective Memory for Efficient LLM -Based Agent ic 6G Cross-Domain Management

链接: https://arxiv.org/abs/2509.26200
作者: Hatim Chergui,Miguel Catalan Cid,Pouria Sayyad Khodashenas,Daniel Camps Mur,Christos Verikoukis
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures

点击查看摘要

[AI-40] Too much alignment; not enough culture: Re-balancing cultural alignment practices in LLM s

链接: https://arxiv.org/abs/2509.26167
作者: Eric J. W. Orlowski,Hakim Norhashim,Tristan Koh Ly Wey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, no figures

点击查看摘要

[AI-41] 90% Faster 100% Code-Free: MLLM -Driven Zero-Code 3D Game Development

链接: https://arxiv.org/abs/2509.26161
作者: Runxin Yang,Yuxuan Wan,Shuqing Li,Michael R. Lyu
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

[AI-42] Beyond the Algorithm: A Field Guide to Deploying AI Agents in Clinical Practice

【速读】：该论文旨在解决生成式 AI（Generative AI）在临床场景中从试点项目向常规医疗实践转化的“死亡之谷”问题，即当前大型语言模型（Large Language Models, LLMs）在医疗领域的潜力与实际落地之间存在显著差距。解决方案的关键在于将工作重心从算法开发转向系统性实施基础设施建设，具体包括五大“重担”：数据集成、模型验证、确保经济价值、管理模型漂移以及治理机制，并针对每项挑战提供可操作的解决方案，从而推动生成式 AI 在真实世界临床环境中的可持续部署与应用。

链接: https://arxiv.org/abs/2509.26153
作者: Jack Gallifant,Katherine C. Kellogg,Matt Butler,Amanda Centi,Patrick F. Doyle,Sayon Dutta,Joyce Guo,Matthew J. Hadfield,Esther H. Kim,David E. Kozono,Hugo JWL Aerts,Adam B. Landman,Raymond H. Mak,Rebecca G. Mishuris,Tanna L. Nelson,Guergana K. Savova,Elad Sharon,Benjamin C. Silverman,Umit Topaloglu,Jeremy L. Warner,Danielle S. Bitterman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review. 5 Tables, 2 Figures

点击查看摘要

Abstract:Large language models (LLMs) integrated into agent-driven workflows hold immense promise for healthcare, yet a significant gap exists between their potential and practical implementation within clinical settings. To address this, we present a practitioner-oriented field manual for deploying generative agents that use electronic health record (EHR) data. This guide is informed by our experience deploying the “irAE-Agent”, an automated system to detect immune-related adverse events from clinical notes at Mass General Brigham, and by structured interviews with 20 clinicians, engineers, and informatics leaders involved in the project. Our analysis reveals a critical misalignment in clinical AI development: less than 20% of our effort was dedicated to prompt engineering and model development, while over 80% was consumed by the sociotechnical work of implementation. We distill this effort into five “heavy lifts”: data integration, model validation, ensuring economic value, managing system drift, and governance. By providing actionable solutions for each of these challenges, this field manual shifts the focus from algorithmic development to the essential infrastructure and implementation work required to bridge the “valley of death” and successfully translate generative AI from pilot projects into routine clinical care.
zh

[AI-43] Bubble Bubble AIs Rumble: Why Global Financial Regulatory Incident Reporting is Our Shield Against Systemic Stumbles

【速读】：该论文旨在解决当前AI incident数据库对资本市场的系统性风险识别存在显著盲区的问题，尤其是算法交易（Algorithmic Trading）和高频交易（High-Frequency Trading）中因透明度不足导致的风险难以被有效监测与管理。现有数据库多依赖众包或新闻爬取，无法覆盖金融领域特有的异常行为，从而加剧了监管滞后与市场脆弱性。其解决方案的关键在于构建一个监管级的全球数据库框架，通过融合后交易报告机制（Post-Trade Reporting Frameworks）与医疗及航空业成熟的事件记录模型，创新性地采用时间戳掩蔽技术（Temporal Data Omission Technique），在保留百分比级指标的前提下实现跨司法管辖区的风险分析，同时保护商业机密；此外，合成数据验证揭示了跨境系统性风险、可由K-means聚类识别的市场操纵集群，以及AI系统类型对交易行为的影响远超地理因素等关键发现，为监管机构、金融机构和投资者提供了前所未有的跨区域风险洞察力与合规整合能力。

链接: https://arxiv.org/abs/2509.26150
作者: Anchal Gupta,Gleb Pappyshev,James T Kwok
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:“Double, double toil and trouble; Fire burn and cauldron bubble.” As Shakespeare’s witches foretold chaos through cryptic prophecies, modern capital markets grapple with systemic risks concealed by opaque AI systems. According to IMF, the August 5, 2024, plunge in Japanese and U.S. equities can be linked to algorithmic trading yet ab-sent from existing AI incidents database exemplifies this transparency crisis. Current AI incident databases, reliant on crowdsourcing or news scraping, systematically over-look capital market anomalies, particularly in algorithmic and high-frequency trading. We address this critical gap by proposing a regulatory-grade global database that elegantly synthesises post-trade reporting frameworks with proven incident documentation models from healthcare and aviation. Our framework’s temporal data omission technique masking timestamps while preserving percent-age-based metrics enables sophisticated cross-jurisdictional analysis of emerging risks while safeguarding confidential business information. Synthetic data validation (modelled after real life published incidents , sentiments, data) reveals compelling pat-terns: systemic risks transcending geographical boundaries, market manipulation clusters distinctly identifiable via K-means algorithms, and AI system typology exerting significantly greater influence on trading behaviour than geographical location, This tripartite solution empowers regulators with unprecedented cross-jurisdictional oversight, financial institutions with seamless compliance integration, and investors with critical visibility into previously obscured AI-driven vulnerabilities. We call for immediate action to strengthen risk management and foster resilience in AI-driven financial markets against the volatile “cauldron” of AI-driven systemic risks., promoting global financial stability through enhanced transparency and coordinated oversight.
zh

[AI-44] LMILAtt: A Deep Learning Model for Depression Detection from Social Media Users Enhanced by Multi-Instance Learning Based on Attention Mechanism

【速读】：该论文旨在解决抑郁症早期识别中传统方法存在的准确率不足、时间序列特征利用不充分以及标注成本高等问题。其解决方案的关键在于提出一种名为LMILAtt的新型模型，该模型创新性地融合了长短期记忆自编码器（Long Short-Term Memory autoencoders）与注意力机制（attention mechanism）：首先通过无监督的LSTM自编码器提取用户推文的时间动态特征（如抑郁倾向演化模式），其次利用注意力机制对关键文本（如早期抑郁信号）进行动态加权，并构建多示例学习架构以提升个体层面的检测精度。实验表明，该方法在WU3D医学专业标注数据集上显著优于基线模型，在准确率、召回率和F1分数等指标上均有提升，同时弱监督学习策略有效降低了标注成本，为大规模社交媒体抑郁筛查提供了高效方案。

链接: https://arxiv.org/abs/2509.26145
作者: Yukun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Depression is a major global public health challenge and its early identification is crucial. Social media data provides a new perspective for depression detection, but existing methods face limitations such as insufficient accuracy, insufficient utilization of time series features, and high annotation costs. To this end, this study proposes the LMILAtt model, which innovatively integrates Long Short-Term Memory autoencoders and attention mechanisms: firstly, the temporal dynamic features of user tweets (such as depressive tendency evolution patterns) are extracted through unsupervised LSTM autoencoders. Secondly, the attention mechanism is used to dynamically weight key texts (such as early depression signals) and construct a multi-example learning architecture to improve the accuracy of user-level detection. Finally, the performance was verified on the WU3D dataset labeled by professional medicine. Experiments show that the model is significantly better than the baseline model in terms of accuracy, recall and F1 score. In addition, the weakly supervised learning strategy significantly reduces the cost of labeling and provides an efficient solution for large-scale social media depression screening.
zh

[AI-45] OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models

【速读】：该论文旨在解决当前音频大语言模型（Audio Large Language Models, ALLMs）在空间推理能力上的局限性，尤其是方向和距离估计的准确性不足以及缺乏可解释推理机制的问题。现有方法如BAT虽能进行空间问答（Spatial QA），但受限于粗粒度类别标签（左、右、上、下）及缺乏显式的几何监督，导致分辨率低且鲁棒性差。其解决方案的关键在于提出SAGE（Spatial-Acoustic Geometry Encoder）——一个几何感知的音频编码器，在训练阶段利用全景深度图与房间脉冲响应（Room-Impulse Responses, RIRs）将双耳声学特征对齐至3D空间结构，而在推理时仅需音频输入；进一步结合OWL模型，通过空间锚定的思维链（chain-of-thought）实现对到达方向（DoA）和距离的多步理性推理，从而提升空间感知精度与可解释性。

链接: https://arxiv.org/abs/2509.26140
作者: Subrata Biswas,Mohammad Nur Hossain Khan,Bashima Islam
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the \textbfSpatial-Acoustic Geometry Encoder (SAGE ), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present \textbfOWL , an ALLM that integrates \textbfSAGE with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, \textbfOWL supports o’clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release \textbfBiDepth , a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new \textbfBiDepth and the public SpatialSoundQA, \textbfOWL reduces mean DoA error by \textbf11 ^\circ through \textbfSAGE and improves spatial reasoning QA accuracy by up to \textbf25 % over BAT.
zh

[AI-46] Leverag ing AI modelling for FDS with Simvue: monitor and optimise for more sustainable simulations

【速读】：该论文旨在解决大规模、高数量火灾模拟在时间和能源消耗上的高需求问题。其解决方案的关键在于提出了一种多维度优化策略：首先，开发了一个定制的机器学习代理模型（surrogate model），能够在保持精度的同时显著提升热传导动力学预测速度，相比当前最先进的计算流体动力学（CFD）软件可实现数量级加速；其次，引入一种引导式优化流程，利用轻量级模型智能决策仿真任务的执行顺序，在定位建筑内最危险火灾位置（以烟雾对能见度的影响为指标）时实现了仿真次数减少一个数量级（即十倍降低）；最后，构建了名为Simvue的框架与产品，集成上述工具并提供自动化组织与追踪功能，从而促进数据复用、减少冗余并提升仿真管理效率。

链接: https://arxiv.org/abs/2509.26139
作者: James Panayis(1),Matt Field(1),Vignesh Gopakumar(1 and 2),Andrew Lahiff(1),Kristian Zarebski(1),Aby Abraham(1),Jonathan L. Hodges(3) ((1) UK Atomic Energy Authority, (2) UCL Centre for AI - UK, (3) Jensen Hughes - USA)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 12 pages, 17 figures, Interflam Conference

点击查看摘要

Abstract:There is high demand on fire simulations, in both scale and quantity. We present a multi-pronged approach to improving the time and energy required to meet these demands. We show the ability of a custom machine learning surrogate model to predict the dynamics of heat propagation orders of magnitude faster than state-of-the-art CFD software for this application. We also demonstrate how a guided optimisation procedure can decrease the number of simulations required to meet an objective; using lightweight models to decide which simulations to run, we see a tenfold reduction when locating the most dangerous location for a fire to occur within a building based on the impact of smoke on visibility. Finally we present a framework and product, Simvue, through which we access these tools along with a host of automatic organisational and tracking features which enables future reuse of data and more savings through better management of simulations and combating redundancy.
zh

[AI-47] MEDAKA: Construction of Biomedical Knowledge Graphs Using Large Language Models

【速读】：该论文旨在解决现有生物医学知识图谱（Knowledge Graph, KG）在构建过程中对药物说明书等非结构化文本数据利用不足的问题，导致其覆盖范围局限于分子相互作用或不良反应等狭窄领域，而忽略了临床实践中至关重要的用药指导信息。解决方案的关键在于提出一个可复用的端到端流水线，结合网络爬虫与大语言模型（Large Language Model, LLM），从公开的药物说明书中自动抽取结构化信息，并基于此构建了一个名为MEDAKA的高质量、临床相关属性丰富的生物医学知识图谱数据集，涵盖副作用、警告、禁忌症、成分、剂量指南、储存说明及物理特性等关键维度。该方法不仅提升了KG的信息广度和实用性，也为患者安全监测和个性化药物推荐提供了支持。

链接: https://arxiv.org/abs/2509.26128
作者: Asmita Sengupta,David Antony Selby,Sebastian Josef Vollmer,Gerrit Großmann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Knowledge graphs (KGs) are increasingly used to represent biomedical information in structured, interpretable formats. However, existing biomedical KGs often focus narrowly on molecular interactions or adverse events, overlooking the rich data found in drug leaflets. In this work, we present (1) a hackable, end-to-end pipeline to create KGs from unstructured online content using a web scraper and an LLM; and (2) a curated dataset, MEDAKA, generated by applying this method to publicly available drug leaflets. The dataset captures clinically relevant attributes such as side effects, warnings, contraindications, ingredients, dosage guidelines, storage instructions and physical characteristics. We evaluate it through manual inspection and with an LLM-as-a-Judge framework, and compare its coverage with existing biomedical KGs and databases. We expect MEDAKA to support tasks such as patient safety monitoring and drug recommendation. The pipeline can also be used for constructing KGs from unstructured texts in other domains. Code and dataset are available at this https URL.
zh

[AI-48] AGOCS – Accurate Google Cloud Simulator Framework ATC

链接: https://arxiv.org/abs/2509.26120
作者: Leszek Sliwko,Vladimir Getov
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: This is the accepted author’s version of the paper. The final published version is available in the Proceedings of the 2016 IEEE International Conferences on Ubiquitous Intelligence and Computing (UIC), Advanced and Trusted Computing (ATC), Scalable Computing and Communications (ScalCom), Cloud and Big Data Computing (CBDCom), Internet of People (IoP), and Smart World Congress (SmartWorld)

点击查看摘要

[AI-49] SafeEvalAgent : Toward Agent ic and Self-Evolving Safety Evaluation of LLM s

【速读】：该论文旨在解决当前静态基准测试无法应对生成式 AI（Generative AI）在高风险领域部署时所面临的动态安全威胁与不断演进的法规要求的问题，从而造成显著的安全评估盲区。其核心解决方案是提出一种名为 SafeEvalAgent 的多智能体框架，该框架通过自主解析非结构化政策文档来生成并持续进化全面的安全评估基准；其关键创新在于引入“自演化评估循环”（Self-evolving Evaluation loop），使系统能够基于评估结果迭代优化测试用例，逐步提升检测深度和针对性，从而揭示传统静态方法难以发现的模型深层安全隐患。

链接: https://arxiv.org/abs/2509.26100
作者: Yixu Wang,Xin Wang,Yang Yao,Xinyuan Li,Yan Teng,Xingjun Ma,Yingchun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid integration of Large Language Models (LLMs) into high-stakes domains necessitates reliable safety and compliance evaluation. However, existing static benchmarks are ill-equipped to address the dynamic nature of AI risks and evolving regulations, creating a critical safety gap. This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self-evolving process rather than a one-time audit. We then propose a novel multi-agent framework SafeEvalAgent, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark. SafeEvalAgent leverages a synergistic pipeline of specialized agents and incorporates a Self-evolving Evaluation loop, where the system learns from evaluation results to craft progressively more sophisticated and targeted test cases. Our experiments demonstrate the effectiveness of SafeEvalAgent, showing a consistent decline in model safety as the evaluation hardens. For instance, GPT-5’s safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations. These findings reveal the limitations of static assessments and highlight our framework’s ability to uncover deep vulnerabilities missed by traditional methods, underscoring the urgent need for dynamic evaluation ecosystems to ensure the safe and responsible deployment of advanced AI.
zh

[AI-50] On Computing Top-k Simple Shortest Paths from a Single Source

【速读】：该论文致力于解决加权有向图中从单源点出发的前k条简单最短路径（top-k simple shortest paths）计算问题，即找出从指定源顶点到其他所有顶点的前k条无环最短路径。与已有研究主要关注单对顶点间的该问题不同，本文首次提出了针对单源情形的多项式时间算法。其关键创新在于：一方面，通过理论分析揭示了该问题解的结构特性；另一方面，设计了一种能利用单源特性进行高效搜索的新算法，该算法在时间复杂度上与将最优单对算法独立应用于每一对源-目标顶点的方法相当，但在实际运行效率上显著优于后者，实验表明其平均加速比可达数个数量级，从而成为该问题在实际场景中的首选解决方案。

链接: https://arxiv.org/abs/2509.26094
作者: Mattia D’Emidio,Gabriele Di Stefano
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Networking and Internet Architecture (cs.NI)
备注: 21 pages, 2 figures, to be published in ALENEX 2026

点击查看摘要

Abstract:We investigate the problem of computing the top- k simple shortest paths in weighted digraphs. While the single-pair variant – finding the top- k simple shortest paths between two specified vertices – has been extensively studied over the past decades, with Yen’s algorithm and its heuristic improvements emerging as the most effective solving strategies, relatively little attention has been devoted to the more general single-source version, where the goal is determining top- k simple shortest paths from a source vertex to all other vertices. Motivated by the numerous practical applications of ranked shortest paths, in this paper we provide new insights and algorithmic contributions to this problem. In particular, we first present a theoretical characterization of the structural properties of its solutions. Then, we introduce the first polynomial-time algorithm specifically designed to handle it. On the one hand, we prove our new algorithm is on par, in terms of time complexity, with the best (and only) polynomial-time approach known in the literature to solve the problem, that is applying the fastest single-pair algorithm independently to each vertex pair formed by the source and the remaining vertices. On the other hand, through an extensive experimental evaluation on both real-world and synthetic graphs, we demonstrate that our algorithm consistently and significantly outperforms the latter baseline in terms of running time, achieving speed-ups of up to several orders of magnitude. These results establish our new algorithm as the solution to be preferred for computing k simple shortest paths from a single source in practical settings.
zh

[AI-51] Evaluating the Use of Large Language Models as Synthetic Social Agents in Social Science Research

链接: https://arxiv.org/abs/2509.26080
作者: Emma Rose Madden
机构: 未知
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

[AI-52] Real-time Noise Detection and Classification in Single-Channel EEG: A Lightweight Machine Learning Approach for EMG White Noise and EOG Artifacts

【速读】：该论文旨在解决单通道脑电图（EEG）在真实场景中面临的三大挑战：多通道方法计算效率低、对多重噪声干扰鲁棒性差，以及深度学习模型在准确率与复杂度之间难以平衡的问题。其解决方案的关键在于提出一种混合频域-时域框架，通过时域低通滤波（针对低频眼电伪迹EOG）与频域功率谱密度（PSD）分析（捕捉宽频带肌电伪迹EMG）相结合，并利用主成分分析（PCA）优化特征融合策略，在最小化冗余的同时保留判别信息。该特征工程方法使轻量级多层感知机（MLP）在低信噪比（SNR -7 dB）下达到99%准确率，在中等噪声（SNR 4 dB）下保持90%准确率，且在多重伪迹叠加（EMG+EOG+白噪声）情况下仍维持96%分类准确率，显著优于CNN和RNN等复杂模型，同时训练时间仅需30秒（比CNN快97%），从而实现了临床可用性与计算效率的统一。

链接: https://arxiv.org/abs/2509.26058
作者: Hossein Enshaei,Pariya Jebreili,Sayed Mahmoud Sakahei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Electroencephalogram (EEG) artifact detection in real-world settings faces significant challenges such as computational inefficiency in multi-channel methods, poor robustness to simultaneous noise, and trade-offs between accuracy and complexity in deep learning models. We propose a hybrid spectral-temporal framework for real-time detection and classification of ocular (EOG), muscular (EMG), and white noise artifacts in single-channel EEG. This method, in contrast to other approaches, combines time-domain low-pass filtering (targeting low-frequency EOG) and frequency-domain power spectral density (PSD) analysis (capturing broad-spectrum EMG), followed by PCA-optimized feature fusion to minimize redundancy while preserving discriminative information. This feature engineering strategy allows a lightweight multi-layer perceptron (MLP) architecture to outperform advanced CNNs and RNNs by achieving 99% accuracy at low SNRs (SNR -7) dB and 90% accuracy in moderate noise (SNR 4 dB). Additionally, this framework addresses the unexplored problem of simultaneous multi-source contamination(EMG+EOG+white noise), where it maintains 96% classification accuracy despite overlapping artifacts. With 30-second training times (97% faster than CNNs) and robust performance across SNR levels, this framework bridges the gap between clinical applicability and computational efficiency, which enables real-time use in wearable brain-computer interfaces. This work also challenges the ubiquitous dependence on model depth for EEG artifact detection by demonstrating that domain-informed feature fusion surpasses complex architecture in noisy scenarios.
zh

[AI-53] Muon Outperforms Adam in Tail-End Associative Memory Learning

【速读】：该论文旨在解决Muon优化器在训练大语言模型（Large Language Models, LLMs）时相较于Adam优化器表现出更快收敛速度但其内在机制尚不明确的问题。解决方案的关键在于从关联记忆（associative memory）视角揭示Muon的优势来源：通过消融实验发现，LLMs中与关联记忆相关的参数——即Value和Output（VO）注意力权重及前馈网络（Feed-Forward Networks, FFNs）——是Muon性能优越的主要贡献者；进一步理论分析表明，Muon的更新规则能生成更各向同性的奇异谱，从而在重尾分布数据上对低频类别（tail classes）实现更均衡有效的学习，这归因于其更新规则与线性关联记忆的外积结构相一致，使学习过程不受特征嵌入性质影响，显著优于Adam在类别不平衡场景下的表现。

链接: https://arxiv.org/abs/2509.26030
作者: Shuche Wang,Fengzhuo Zhang,Jiaxiang Li,Cunxiao Du,Chao Du,Tianyu Pang,Zhuoran Yang,Mingyi Hong,Vincent Y. F. Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through the lens of associative memory. By ablating the transformer components optimized by Muon, we reveal that the associative memory parameters of LLMs, namely the Value and Output (VO) attention weights and Feed-Forward Networks (FFNs), are the primary contributors to Muon’s superiority. Motivated by this associative memory view, we then explain Muon’s superiority on real-world corpora, which are intrinsically heavy-tailed: a few classes (tail classes) appear far less frequently than others. The superiority is explained through two key properties: (i) its update rule consistently yields a more isotropic singular spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes tail classes more effectively than Adam. Beyond empirical evidence, we theoretically confirm these findings by analyzing a one-layer associative memory model under class-imbalanced data. We prove that Muon consistently achieves balanced learning across classes regardless of feature embeddings, whereas Adam can induce large disparities in learning errors depending on embedding properties. In summary, our empirical observations and theoretical analyses reveal Muon’s core advantage: its update rule aligns with the outer-product structure of linear associative memories, enabling more balanced and effective learning of tail classes in heavy-tailed distributions than Adam.
zh

[AI-54] Indirect Attention: Turning Context Misalignment into a Feature

链接: https://arxiv.org/abs/2509.26015
作者: Bissmella Bahaduri,Hicham Talaoubrid,Fangchen Feng,Zuheng Ming,Anissa Mokraoui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-55] owards Human Engagement with Realistic AI Combat Pilots

【速读】：该论文旨在解决人类操作者与智能代理在复杂三维空战模拟环境中实时协同作战的问题，尤其是在传统防御仿真工具中缺乏具备自主战术决策能力的智能体。解决方案的关键在于构建一个基于多智能体强化学习（Multi-Agent Reinforcement Learning）训练的智能代理系统，并开发专用通信接口，实现训练好的代理无缝部署至VR-Forces仿真平台，从而支持人机混合对抗场景下的战术交互与沉浸式训练，为新型作战策略探索和人机协同效能提升提供技术支撑。

链接: https://arxiv.org/abs/2509.26002
作者: Ardian Selmonaj,Giacomo Del Rio,Adrian Schneider,Alessandro Antonucci
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 13th International Conference on Human-Agent Interaction (HAI) 2025

点击查看摘要

Abstract:We present a system that enables real-time interaction between human users and agents trained to control fighter jets in simulated 3D air combat scenarios. The agents are trained in a dedicated environment using Multi-Agent Reinforcement Learning. A communication link is developed to allow seamless deployment of trained agents into VR-Forces, a widely used defense simulation tool for realistic tactical scenarios. This integration allows mixed simulations where human-controlled entities engage with intelligent agents exhibiting distinct combat behaviors. Our interaction model creates new opportunities for human-agent teaming, immersive training, and the exploration of innovative tactics in defense contexts.
zh

[AI-56] MHINDR - a DSM5 based mental health diagnosis and recommendation framework using LLM

【速读】：该论文旨在解决如何利用生成式 AI（Generative AI）对心理健康论坛中用户自述文本进行精准诊断与个性化干预的问题，以辅助临床工作者提升心理评估效率和治疗方案的针对性。其解决方案的关键在于构建一个基于大语言模型（Large Language Model, LLM）的框架 MHINDR，该框架融合 DSM-5 诊断标准，通过提取时间维度信息实现症状进展追踪，并结合心理特征生成结构化、可解释的用户心理健康摘要，从而提供可扩展、可定制且数据驱动的干预建议，适用于多样化的临床场景与职场健康项目。

链接: https://arxiv.org/abs/2509.25992
作者: Vaishali Agarwal,Sachin Thukral,Arnab Chatterjee
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 7 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Mental health forums offer valuable insights into psychological issues, stressors, and potential solutions. We propose MHINDR, a large language model (LLM) based framework integrated with DSM-5 criteria to analyze user-generated text, dignose mental health conditions, and generate personalized interventions and insights for mental health practitioners. Our approach emphasizes on the extraction of temporal information for accurate diagnosis and symptom progression tracking, together with psychological features to create comprehensive mental health summaries of users. The framework delivers scalable, customizable, and data-driven therapeutic recommendations, adaptable to diverse clinical contexts, patient needs, and workplace well-being programs.
zh

[AI-57] owards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline

链接: https://arxiv.org/abs/2509.25991
作者: Haiyang Li,Yaxiong Wang,Lianwei Wu,Lechao Cheng,Zhun Zhong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-58] R-Log: Incentivizing Log Analysis Capability in LLM s via Reasoning -based Reinforcement Learning

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的自动化日志分析方法中存在的两个核心问题：一是监督微调（Supervised Fine-Tuning, SFT）导致的领域差异（domain discrepancy）和过拟合问题，二是SFT中损失函数不平衡引发的长上下文主导、关键细节被淹没，进而产生幻觉（hallucination）的问题。解决方案的关键在于提出一种基于推理（reasoning-based）的新范式R-Log，其通过模仿工程师结构化的分步分析流程来增强模型对日志背后逻辑规则的学习能力，从而提升泛化性；同时引入强化学习（Reinforcement Learning, RL）在模拟运维（Operations Management, OM）环境中优化模型，以联合奖励函数直接奖励正确输出，有效抑制幻觉。

链接: https://arxiv.org/abs/2509.25987
作者: Yilun Liu,Ziang Chen,Song Xu,Minggui He,Shimin Tao,Weibin Meng,Yuming Xie,Tao Han,Chunguang Zhao,Jingzhou Du,Daimeng Wei,Shenglin Zhang,Yongqian Sun
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing complexity of log data in modern software systems has prompted the use of Large Language Models (LLMs) for automated log analysis. Current approaches typically rely on direct supervised fine-tuning (SFT) on log-label pairs. However, this exacerbates the domain discrepancy between general-purpose LLMs and specialized log data, causing overfitting. Furthermore, SFT’s imbalanced loss computation often allows lengthy contexts to overwhelm critical, concise details in model answers, leading to hallucinations. To address these limitations, we propose R-Log, a novel reasoning-based paradigm that mirrors the structured, step-by-step analytical process of human engineers. This approach enhances generalizability by learning the underlying rules behind conclusions. We further employ Reinforcement Learning (RL) to optimize the model within a simulated OM environment, thereby reducing hallucinations by directly rewarding correct outcomes. R-Log is first cold-started on a curated dataset of 2k+ reasoning trajectories, guided by 13 strategies from manual OM practices, to establish an initial reasoning capability. This ability is then refined via RL using a joint reward function. Empirical evaluations on real-world logs show that R-Log outperforms existing methods across five log analysis tasks, particularly in unseen scenarios (by 228.05%). We also designed R-Log-fast with 5x speedup while keeping 93% of the efficacy.
zh

[AI-59] Reconcile Certified Robustness and Accuracy for DNN-based Smoothed Majority Vote Classifier

【速读】：该论文旨在解决当前理论研究中对多数投票分类器（majority vote classifier）的认证鲁棒性（certified robustness）及其与泛化性能之间关系缺乏系统分析的问题。解决方案的关键在于提出一个包含认证鲁棒半径（certified robust radius）的泛化误差上界，该上界适用于平滑输入下的Q-加权多数投票分类器（即 smoothed majority vote classifier），从而确保在任意不超过该半径的数据扰动下，泛化边界依然成立。进一步地，研究发现该上界和鲁棒半径的理论基础部分依赖于权重谱范数（weight spectral norm），这启发了在平滑训练中引入谱正则化（spectral regularization）以提升认证鲁棒性；为此，作者利用球面高斯输入在平滑训练中的维度无关特性，设计了一种新颖且计算成本低廉的谱正则化项，有效增强了平滑多数投票分类器的鲁棒性。

链接: https://arxiv.org/abs/2509.25979
作者: Gaojie Jin,Xinping Yi,Xiaowei Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Within the PAC-Bayesian framework, the Gibbs classifier (defined on a posterior Q ) and the corresponding Q -weighted majority vote classifier are commonly used to analyze the generalization performance. However, there exists a notable lack in theoretical research exploring the certified robustness of majority vote classifier and its interplay with generalization. In this study, we develop a generalization error bound that possesses a certified robust radius for the smoothed majority vote classifier (i.e., the Q -weighted majority vote classifier with smoothed inputs); In other words, the generalization bound holds under any data perturbation within the certified robust radius. As a byproduct, we find that the underpinnings of both the generalization bound and the certified robust radius draw, in part, upon weight spectral norm, which thereby inspires the adoption of spectral regularization in smooth training to boost certified robustness. Utilizing the dimension-independent property of spherical Gaussian inputs in smooth training, we propose a novel and inexpensive spectral regularizer to enhance the smoothed majority vote classifier. In addition to the theoretical contribution, a set of empirical results is provided to substantiate the effectiveness of our proposed method.
zh

[AI-60] Data-Free Continual Learning of Server Models in Model-Heterogeneous Federated learning

链接: https://arxiv.org/abs/2509.25977
作者: Xiao Zhang,Zengzhe Chen,Yuan Yuan,Yifei Zou,Fuzhen Zhuang,Wenyu Jiao,Yuke Wang,Dongxiao Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-61] Scalable and Robust LLM Unlearning by Correcting Responses with Retrieved Exclusions

链接: https://arxiv.org/abs/2509.25973
作者: Junbeom Kim,Kyuyoung Kim,Jihoon Tack,Dongha Lim,Jinwoo Shin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-62] AIM: Adaptive Intervention for Deep Multi-task Learning of Molecular Properties

链接: https://arxiv.org/abs/2509.25955
作者: Mason Minot,Gisbert Schneider
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: 13 pages, 3 figures, 9 tables

点击查看摘要

[AI-63] Automated Model Discovery via Multi-modal Multi-step Pipeline

链接: https://arxiv.org/abs/2509.25946
作者: Lee Jung-Mok,Nam Hyeon-Woo,Moon Ye-Bin,Junhyun Nam,Tae-Hyun Oh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-64] NuRisk: A Visual Question Answering Dataset for Agent -Level Risk Assessment in Autonomous Driving

【速读】：该论文旨在解决自动驾驶中风险理解的局限性问题，即现有基于视觉语言模型（VLM）的方法主要依赖静态图像进行感知与预测，缺乏对代理行为和环境上下文的时空推理能力，无法有效捕捉风险随时间演变的过程。解决方案的关键在于构建一个名为NuRisk的全面视觉问答（VQA）数据集，该数据集包含2,900个场景和110万条代理级样本，基于nuScenes和Waymo的真实世界数据，并融合CommonRoad模拟器中的高安全敏感场景，提供基于鸟瞰图（BEV）的时序图像及量化代理级风险标注，从而支持显式的时空推理。在此基础上，研究者对7B规模的VLM进行微调，在保持低延迟的同时将准确率从33%提升至41%，首次展示了在自动驾驶场景下具备显式时空推理能力的VLM性能突破。

链接: https://arxiv.org/abs/2509.25944
作者: Yuan Gao,Mattia Piccinini,Roberto Brusnicki,Yuchen Zhang,Johannes Betz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Understanding risk in autonomous driving requires not only perception and prediction, but also high-level reasoning about agent behavior and context. Current Vision Language Models (VLMs)-based methods primarily ground agents in static images and provide qualitative judgments, lacking the spatio-temporal reasoning needed to capture how risks evolve over time. To address this gap, we propose NuRisk, a comprehensive Visual Question Answering (VQA) dataset comprising 2,900 scenarios and 1.1 million agent-level samples, built on real-world data from nuScenes and Waymo, supplemented with safety-critical scenarios from the CommonRoad simulator. The dataset provides Bird-Eye-View (BEV) based sequential images with quantitative, agent-level risk annotations, enabling spatio-temporal reasoning. We benchmark well-known VLMs across different prompting techniques and find that they fail to perform explicit spatio-temporal reasoning, resulting in a peak accuracy of 33% at high latency. To address these shortcomings, our fine-tuned 7B VLM agent improves accuracy to 41% and reduces latency by 75%, demonstrating explicit spatio-temporal reasoning capabilities that proprietary models lacked. While this represents a significant step forward, the modest accuracy underscores the profound challenge of the task, establishing NuRisk as a critical benchmark for advancing spatio-temporal reasoning in autonomous driving.
zh

[AI-65] Quantitative Evaluation of KIRETT Wearable Demonstrator for Rescue Operations

【速读】：该论文旨在解决急救场景中医疗决策效率与准确性不足的问题，尤其是在时间紧迫的情况下，传统依赖详细病史采集和患者沟通的诊断流程难以实施。解决方案的关键在于开发并评估一种名为KIRETT的可穿戴设备，该设备通过人工智能（AI）实现对患者生命体征的实时监测与情境识别，并提供基于患者数据的治疗建议，从而支持救援人员快速、精准地开展医疗干预。

链接: https://arxiv.org/abs/2509.25928
作者: Mubaris Nadeem,Johannes Zenkert,Lisa Bender,Christian Weber,Madjid Fathi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Conference paper for 2024 IEEE World AI IoT Congress (AIIoT), KIRETT Project, University of Siegen, Germany

点击查看摘要

Abstract:Healthcare and Medicine are under constant pressure to provide patient-driven medical expertise to ensure a fast and accurate treatment of the patient. In such scenarios, the diagnosis contains, the family history, long term medical data and a detailed consultation with the patient. In time-critical emergencies, such conversation and time-consuming elaboration are not possible. Rescue services need to provide fast, reliable treatments for the patient in need. With the help of modern technologies, like treatment recommendations, real-time vitals-monitoring, and situation detection through artificial intelligence (AI) a situation can be analyzed and supported in providing fast, accurate patient-data-driven medical treatments. In KIRETT, a wearable device is developed to support in such scenarios and presents a way to provide treatment recommendation in rescue services. The objective of this paper is to present the quantitative results of a two-day KIRETT evaluation (14 participants) to analyze the needs of rescue operators in healthcare.
zh

[AI-66] KIRETT: Smart Integration of Vital Signs Data for Intelligent Decision Support in Rescue Scenarios

链接: https://arxiv.org/abs/2509.25923
作者: Mubaris Nadeem,Johannes Zenkert,Christian Weber,Lisa Bender,Madjid Fathi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Conference paper for 2024 IEEE International Conference on Electro Information Technology (eIT), KIRETT Project, University of Siegen, Germany

点击查看摘要

[AI-67] Accelerating LLM Inference with Precomputed Query Storag e

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在资源受限环境（如设备端或边缘部署）中推理延迟高的问题。其核心解决方案是提出StorInfer系统，通过离线预计算并存储可预测的查询-响应对，在用户查询语义匹配已存储对时直接返回缓存结果，从而跳过昂贵的GPU推理过程，显著降低延迟和计算成本。关键创新在于利用LLM驱动的生成器自适应地构建多样化且去重的查询集，结合自适应查询掩码与采样策略提升语义多样性，并借助基于磁盘的向量数据库实现高效相似性检索，最终在多个问答数据集上实现了最高达17.3%的延迟降低，且不损失响应质量。

链接: https://arxiv.org/abs/2509.25919
作者: Jay H. Park,Youngju Cho,Choungsol Lee,Moonwook Oh,Euiseong Seo
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) inference often suffers from high latency, particularly in resource-constrained environments such as on-device or edge deployments. To address this challenge, we present StorInfer, a novel storage-assisted LLM inference system that accelerates response time by precomputing and storing predictable query-response pairs offline. When a user query semantically matches a precomputed query, StorInfer bypasses expensive GPU inference and instantly returns the stored response, significantly reducing latency and compute costs. To maximize coverage and effectiveness, StorInfer employs an LLM-driven generator that adaptively produces diverse and deduplicated queries based on a given knowledge base. This is achieved via two techniques: adaptive query masking, which prevents regeneration of similar queries, and adaptive sampling, which dynamically tunes generation parameters to promote semantic diversity. The resulting query-response pairs are embedded and indexed using a disk-backed vector database to enable fast, similarity-based retrieval at runtime. Using this approach, we generated 150K unique precomputed pairs (taking up to 830 MB of storage space), achieving up to 17.3% latency reduction with no loss in response quality. Our evaluation across multiple QA datasets demonstrates the practicality and scalability of storage-assisted inference, especially in scenarios with predictable query distributions. StorInfer highlights a promising direction in leveraging storage as a primary enabler for efficient, low-latency LLM deployment.
zh

[AI-68] User-Centric Communication Service Provision for Edge-Assisted Mobile Augmented Reality

【速读】：该论文旨在解决未来6G网络中移动增强现实（MAR）应用面临的用户特定且非平稳的上行数据流量问题，以保障沉浸式用户体验所需的相机帧及时上传。其解决方案的关键在于引入数字孪生（Digital Twin, DT）技术，构建面向个体MAR设备的数据模型，精准刻画基于同时定位与地图构建（SLAM）的帧上传机制对用户特异性流量模式的影响，并定义两种DT操作函数以实现不同数据驱动模型之间的自适应切换，从而提升对非平稳流量的建模精度和鲁棒性；在此基础上，提出一种以用户为中心的网络资源管理算法，确保帧上传时延满足要求并抵抗建模误差，最终在Trace驱动仿真中相较5G切片方案提升14.2%的帧延迟达标率。

链接: https://arxiv.org/abs/2509.25905
作者: Conghao Zhou,Jie Gao,Shisheng Hu,Nan Cheng,Weihua Zhuang,Xuemin Shen
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: accepted by IEEE Transactions on Mobile Computing

点击查看摘要

Abstract:Future 6G networks are envisioned to facilitate edge-assisted mobile augmented reality (MAR) via strengthening the collaboration between MAR devices and edge servers. In order to provide immersive user experiences, MAR devices must timely upload camera frames to an edge server for simultaneous localization and mapping (SLAM)-based device pose tracking. In this paper, to cope with user-specific and non-stationary uplink data traffic, we develop a digital twin (DT)-based approach for user-centric communication service provision for MAR. Specifically, to establish DTs for individual MAR devices, we first construct a data model customized for MAR that captures the intricate impact of the SLAM-based frame uploading mechanism on the user-specific data traffic pattern. We then define two DT operation functions that cooperatively enable adaptive switching between different data-driven models for capturing non-stationary data traffic. Leveraging the user-oriented data management introduced by DTs, we propose an algorithm for network resource management that ensures the timeliness of frame uploading and the robustness against inherent inaccuracies in data traffic modeling for individual MAR devices. Trace-driven simulation results demonstrate that the user-centric communication service provision achieves a 14.2% increase in meeting the camera frame uploading delay requirement in comparison with the slicing-based communication service provision widely used for 5G.
zh

[AI-69] SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents

链接: https://arxiv.org/abs/2509.25885
作者: Ruolin Chen,Yinqian Sun,Jihang Wang,Mingyang Lv,Qian Zhang,Yi Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-70] Efficient On-Policy Reinforcement Learning via Exploration of Sparse Parameter Space

【速读】：该论文旨在解决当前基于策略梯度的方法（如近端策略优化，Proximal Policy Optimization, PPO）在更新过程中仅沿单一随机梯度方向进行优化，导致未能充分探索参数空间局部结构的问题。研究表明，代理梯度（surrogate gradient）与真实奖励景观之间相关性较差，且性能更优的解常位于当前更新路径附近的未探索区域。为此，作者提出ExploRLer——一个可插拔的探索增强管道，无缝集成于PPO、TRPO等on-policy算法中，在不增加梯度更新次数的前提下，系统性地探测代理梯度更新邻域内的未探索区域，从而显著提升复杂连续控制任务中的性能表现。其关键在于通过迭代级探索（iteration-level exploration）挖掘参数空间中潜在的高性能区域，突破传统代理目标函数的局限性。

链接: https://arxiv.org/abs/2509.25876
作者: Xinyu Zhang,Aishik Deb,Klaus Mueller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages; 7 figures

点击查看摘要

Abstract:Policy-gradient methods such as Proximal Policy Optimization (PPO) are typically updated along a single stochastic gradient direction, leaving the rich local structure of the parameter space unexplored. Previous work has shown that the surrogate gradient is often poorly correlated with the true reward landscape. Building on this insight, we visualize the parameter space spanned by policy checkpoints within an iteration and reveal that higher performing solutions often lie in nearby unexplored regions. To exploit this opportunity, we introduce ExploRLer, a pluggable pipeline that seamlessly integrates with on-policy algorithms such as PPO and TRPO, systematically probing the unexplored neighborhoods of surrogate on-policy gradient updates. Without increasing the number of gradient updates, ExploRLer achieves significant improvements over baselines in complex continuous control environments. Our results demonstrate that iteration-level exploration provides a practical and effective way to strengthen on-policy reinforcement learning and offer a fresh perspective on the limitations of the surrogate objective.
zh

[AI-71] CIMNAS: A Joint Framework for Compute-In-Memory-Aware Neural Architecture Search

【速读】：该论文旨在解决基于存算一体（Compute-In-Memory, CIM）架构的神经网络加速器在硬件效率与性能精度之间的权衡问题，尤其针对手动调参难以应对参数空间庞大且高度耦合的挑战。其核心解决方案是提出CIMNAS框架，该框架通过软硬件协同优化（joint model-quantization-hardware optimization），在包含9.9×10⁸⁵种潜在组合的搜索空间中同时探索模型结构、量化策略与多层次硬件参数（器件级、电路级、架构级），实现对能量-延迟-面积乘积（EDAP）的显著优化，同时保持模型准确率（73.81%）。关键创新在于无需牺牲精度即可达成EDAP降低90.1x–104.5x，并在SRAM-based ResNet50上进一步提升至819.5x，验证了其通用性与鲁棒性。

链接: https://arxiv.org/abs/2509.25862
作者: Olga Krestinskaya,Mohammed E. Fouda,Ahmed Eltawil,Khaled N. Salama
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:To maximize hardware efficiency and performance accuracy in Compute-In-Memory (CIM)-based neural network accelerators for Artificial Intelligence (AI) applications, co-optimizing both software and hardware design parameters is essential. Manual tuning is impractical due to the vast number of parameters and their complex interdependencies. To effectively automate the design and optimization of CIM-based neural network accelerators, hardware-aware neural architecture search (HW-NAS) techniques can be applied. This work introduces CIMNAS, a joint model-quantization-hardware optimization framework for CIM architectures. CIMNAS simultaneously searches across software parameters, quantization policies, and a broad range of hardware parameters, incorporating device-, circuit-, and architecture-level co-optimizations. CIMNAS experiments were conducted over a search space of 9.9x10^85 potential parameter combinations with the MobileNet model as a baseline and RRAM-based CIM architecture. Evaluated on the ImageNet dataset, CIMNAS achieved a reduction in energy-delay-area product (EDAP) ranging from 90.1x to 104.5x, an improvement in TOPS/W between 4.68x and 4.82x, and an enhancement in TOPS/mm^2 from 11.3x to 12.78x relative to various baselines, all while maintaining an accuracy of 73.81%. The adaptability and robustness of CIMNAS are demonstrated by extending the framework to support the SRAM-based ResNet50 architecture, achieving up to an 819.5x reduction in EDAP. Unlike other state-of-the-art methods, CIMNAS achieves EDAP-focused optimization without any accuracy loss, generating diverse software-hardware parameter combinations for high-performance CIM-based neural network designs. The source code of CIMNAS is available at this https URL.
zh

[AI-72] Aging Decline in Basketball Career Trend Prediction Based on Machine Learning and LSTM Model

链接: https://arxiv.org/abs/2509.25858
作者: Yi-chen Yao,Jerry Wang,Yi-cheng Lai,Lyn Chao-ling Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at Taiwan Academic Network Conference, TANET 2025

点击查看摘要

[AI-73] ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在安全对齐后仍存在脆弱拒绝行为的问题，即模型对有害请求的拒绝容易被简单的语言形式变化（如将请求改写为过去时态）所绕过，暴露出当前对齐方法在泛化能力上的显著缺陷。解决方案的关键在于提出一种机制驱动的防御框架——激活缩放防护器（Activation-Scaling Guard, ASGuard），其核心步骤包括：通过电路分析（circuit analysis）识别与特定攻击（时态变换攻击）相关的因果注意力头；训练一个通道级缩放向量以重新校准这些敏感头的激活值；最后通过预防性微调（preventative fine-tuning）促使模型学习更鲁棒的拒绝机制。该方法在不显著损害模型通用能力的前提下，有效降低了目标攻击的成功率，并实现了安全性与实用性的帕累托最优平衡。

链接: https://arxiv.org/abs/2509.25843
作者: Yein Park,Jungwoo Park,Jaewoo Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. For the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking, the tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a “preventative fine-tuning”, forcing the model to learn a more robust refusal mechanism. Across three LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.
zh

[AI-74] HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

链接: https://arxiv.org/abs/2509.25842
作者: Ziyu Zhang,Hanzhao Li,Jingbin Hu,Wenhao Li,Lei Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-75] S2FS: Spatially-Aware Separability-Driven Feature Selection in Fuzzy Decision Systems

链接: https://arxiv.org/abs/2509.25841
作者: Suping Xu,Chuyi Dai,Ye Liu,Lin Shang,Xibei Yang,Witold Pedrycz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-76] RAE: A Neural Network Dimensionality Reduction Method for Nearest Neighbors Preservation in Vector Search ICLR2026

【速读】：该论文旨在解决高维嵌入向量在检索任务中因维度灾难导致的效率瓶颈问题，特别是现有主流降维方法（如PCA和UMAP）难以有效保持最近邻（k-NN）关系的局限性。其解决方案的关键在于提出一种正则化自编码器（Regularized Auto-Encoder, RAE），通过引入正则项约束网络参数变化，并利用瑞利商（Rayleigh quotient）的边界效应调控奇异值以控制降维过程中嵌入向量的幅值变化，从而保障k-NN结构的稳定性；数学分析进一步证明了正则化可建立变换后向量范数失真率的上界，为k-NN保真提供理论保证，在训练开销可控的前提下显著提升检索召回率并维持高效检索性能。

链接: https://arxiv.org/abs/2509.25839
作者: Han Zhang,Dongfang Zhao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: submitted to ICLR 2026

点击查看摘要

Abstract:While high-dimensional embedding vectors are being increasingly employed in various tasks like Retrieval-Augmented Generation and Recommendation Systems, popular dimensionality reduction (DR) methods such as PCA and UMAP have rarely been adopted for accelerating the retrieval process due to their inability of preserving the nearest neighbor (NN) relationship among vectors. Empowered by neural networks’ optimization capability and the bounding effect of Rayleigh quotient, we propose a Regularized Auto-Encoder (RAE) for k-NN preserving dimensionality reduction. RAE constrains the network parameter variation through regularization terms, adjusting singular values to control embedding magnitude changes during reduction, thus preserving k-NN relationships. We provide a rigorous mathematical analysis demonstrating that regularization establishes an upper bound on the norm distortion rate of transformed vectors, thereby offering provable guarantees for k-NN preservation. With modest training overhead, RAE achieves superior k-NN recall compared to existing DR approaches while maintaining fast retrieval efficiency.
zh

[AI-77] Distillation of Large Language Models via Concrete Score Matching

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在部署时成本高昂的问题，提出通过知识蒸馏（Knowledge Distillation, KD）实现高效推理。现有KD方法通常采用softmax对齐学生模型与教师模型的概率分布，但这一过程会因softmax平滑效应而丢失有价值的logit信息；尽管直接logit蒸馏（Direct Logit Distillation, DLD）可缓解此问题，却未考虑logit的平移不变性（logit shift invariance），从而限制了最优解空间。论文提出的解决方案是Concrete Score Distillation (CSD)，其核心在于设计一种离散分数匹配目标，既克服了softmax引起的平滑问题，又释放了最优解集的约束。CSD通过灵活加权所有词表对之间的相对logit差异来对齐学生与教师模型的logit结构，并解决了自回归LLM中离散分数匹配的训练不稳定性与二次复杂度问题，从而在任务无关和任务相关的蒸馏场景下均表现出优越性能和良好的保真度-多样性权衡。

链接: https://arxiv.org/abs/2509.25837
作者: Yeongmin Kim,Donghyeok Shin,Mina Kang,Byeonghu Na,Il-Chul Moon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference. Existing KD objectives typically match student and teacher probabilities via softmax, which blurs valuable logit information. While direct logit distillation (DLD) mitigates softmax smoothing, it fails to account for logit shift invariance, thereby restricting the solution space. We propose Concrete Score Distillation (CSD), a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set. We resolve the training instability and quadratic complexity of discrete score-matching in autoregressive LLMs, and the resulting CSD objective aligns relative logit differences across all vocabulary pairs between student and teacher with flexible weighting. We provide both mode-seeking and mode-covering instances within our framework and evaluate CSD on task-agnostic instruction-following and task-specific distillation using GPT-2-1.5B, OpenLLaMA-7B, and GEMMA-7B-IT. Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity-diversity trade-offs, and yields complementary gains when combined with on-policy techniques, demonstrating its scalability and effectiveness for LLM distillation.
zh

[AI-78] Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search

链接: https://arxiv.org/abs/2509.25835
作者: Xinzhe Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

[AI-79] Supporting Creative Ownership through Deep Learning-Based Music Variation NEURIPS

【速读】：该论文旨在解决音乐生成式 AI (Generative AI) 设计中人类创作者失去创作主导权的问题，即如何在人机协作中维持作曲家对创作过程与成果的个人所有权（personal ownership）。其解决方案的关键在于设计一种依赖于音乐家技能的音乐变奏工具，强调音乐家需提供高质量的初始音乐输入，并将片段性灵感转化为完整音乐构思；这种设计强化了创作者对过程与结果的掌控感，从而在技术能力与艺术身份之间建立平衡，确保AI作为辅助工具而非替代者，保留音乐表达的人文特质。

链接: https://arxiv.org/abs/2509.25834
作者: Stephen James Krol,Maria Teresa Llano,Jon McCormack
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Paper Accepted NeurIPS Creative AI Track 2025

点击查看摘要

Abstract:This paper investigates the importance of personal ownership in musical AI design, examining how practising musicians can maintain creative control over the compositional process. Through a four-week ecological evaluation, we examined how a music variation tool, reliant on the skill of musicians, functioned within a composition setting. Our findings demonstrate that the dependence of the tool on the musician’s ability, to provide a strong initial musical input and to turn moments into complete musical ideas, promoted ownership of both the process and artefact. Qualitative interviews further revealed the importance of this personal ownership, highlighting tensions between technological capability and artistic identity. These findings provide insight into how musical AI can support rather than replace human creativity, highlighting the importance of designing tools that preserve the humanness of musical expression.
zh

[AI-80] CardioForest: An Explainable Ensemble Learning Model for Automatic Wide QRS Complex Tachycardia Diagnosis from ECG

链接: https://arxiv.org/abs/2509.25804
作者: Vaskar Chakma,Ju Xiaolin,Heling Cao,Xue Feng,Ji Xiaodong,Pan Haiyan,Gao Zhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

[AI-81] Better with Less: Small Proprietary Models Surpass Large Language Models in Financial Transaction Understanding

链接: https://arxiv.org/abs/2509.25803
作者: Wanying Ding,Savinay Narendra,Xiran Shi,Adwait Ratnaparkhi,Chengrui Yang,Nikoo Sabzevar,Ziyan Yin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 9 pages, 5 figures

点击查看摘要

[AI-82] Deontic Argumentation

链接: https://arxiv.org/abs/2509.25781
作者: Guido Governatori,Antonino Rotolo
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

[AI-83] Planner-R1: Reward Shaping Enables Efficient Agent ic RL with Smaller LLM s

链接: https://arxiv.org/abs/2509.25779
作者: Siyu Zhu,Yanbin Jiang,Hejian Sang,Shao Tang,Qingquan Song,Biao He,Rohit Jain,Zhipeng Wang,Alborz Geramifard
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-84] Autonomy-Aware Clustering: When Local Decisions Supersede Global Prescriptions ICLR2026

【速读】：该论文旨在解决传统聚类方法在面对具有局部自主性的实体时所存在的局限性问题，即现有方法假设聚类对象是被动且严格服从分配的群体，而现实中实体常因局部自主性（local autonomy）偏离预设关联，从而显著改变聚类结构（如组成、几何形态和基数），并影响下游推理与决策。其解决方案的关键在于提出一种自主意识聚类（autonomy-aware clustering）框架，该框架结合强化学习（Reinforcement Learning, RL）与确定性退火（Deterministic Annealing, DA）机制：DA在早期阶段促进探索、后期转向利用，自然适应聚类结构演化；同时引入基于Transformer的自适应距离估计网络（Adaptive Distance Estimation Network, ADEN），在RL循环中学习实体与聚类中心间的依赖关系，支持变长输入输出并实现跨任务知识迁移，从而无需先验自主性模型即可逼近真实数据动态，实验表明其误差仅约3-4%，远优于忽略自主性的方案（误差达35-40%）。

链接: https://arxiv.org/abs/2509.25775
作者: Amber Srivastava,Salar Basiri,Srinivasa Salapaka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work is under submission to ICLR 2026. Please cite the arXiv version until the final version is published

点击查看摘要

Abstract:Clustering arises in a wide range of problem formulations, yet most existing approaches assume that the entities under clustering are passive and strictly conform to their assigned groups. In reality, entities often exhibit local autonomy, overriding prescribed associations in ways not fully captured by feature representations. Such autonomy can substantially reshape clustering outcomes – altering cluster compositions, geometry, and cardinality – with significant downstream effects on inference and decision-making. We introduce autonomy-aware clustering, a reinforcement (RL) learning framework that learns and accounts for the influence of local autonomy without requiring prior knowledge of its form. Our approach integrates RL with a deterministic annealing (DA) procedure, where, to determine underlying clusters, DA naturally promotes exploration in early stages of annealing and transitions to exploitation later. We also show that the annealing procedure exhibits phase transitions that enable design of efficient annealing schedules. To further enhance adaptability, we propose the Adaptive Distance Estimation Network (ADEN), a transformer-based attention model that learns dependencies between entities and cluster representatives within the RL loop, accommodates variable-sized inputs and outputs, and enables knowledge transfer across diverse problem instances. Empirical results show that our framework closely aligns with underlying data dynamics: even without explicit autonomy models, it achieves solutions close to the ground truth (gap ~3-4%), whereas ignoring autonomy leads to substantially larger gaps (~35-40%). The code and data are publicly available at this https URL.
zh

[AI-85] Galtons Law of Mediocrity: Why Large Language Models Regress to the Mean and Fail at Creativity in Advertising

链接: https://arxiv.org/abs/2509.25767
作者: Matt Keon,Aabid Karim,Bhoomika Lohana,Abdul Karim,Thai Nguyen,Tara Hamilton,Ali Abbas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-86] hinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

链接: https://arxiv.org/abs/2509.25758
作者: Yein Park,Minbyul Jeong,Jaewoo Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-87] Cooperative Autonomous Driving in Diverse Behavioral Traffic: A Heterogeneous Graph Reinforcement Learning Approach

【速读】：该论文旨在解决自动驾驶车辆（AV）在异构交通环境中应对多样化驾驶风格时所面临的决策难题，该问题源于交通场景的复杂性和车辆间动态交互的不确定性。解决方案的关键在于提出一种增强型异构图强化学习（GRL）框架，其核心创新包括：构建用于捕捉车辆间复杂交互关系的异构图表示；设计融合专家系统（expert system）的异构图神经网络（HGNN-EM），以编码多样化的车辆特征并生成基于领域知识的驾驶指令；以及采用双深度Q-learning（DDQN）算法进行模型训练。该方法在典型四路交叉口场景中验证了其在安全性、效率、稳定性及收敛速度上的优越性，同时保持良好的实时性能。

链接: https://arxiv.org/abs/2509.25751
作者: Qi Liu,Xueyuan Li,Zirui Li,Juhui Gim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures and 4 tables

点击查看摘要

Abstract:Navigating heterogeneous traffic environments with diverse driving styles poses a significant challenge for autonomous vehicles (AVs) due to their inherent complexity and dynamic interactions. This paper addresses this challenge by proposing a heterogeneous graph reinforcement learning (GRL) framework enhanced with an expert system to improve AV decision-making performance. Initially, a heterogeneous graph representation is introduced to capture the intricate interactions among vehicles. Then, a heterogeneous graph neural network with an expert model (HGNN-EM) is proposed to effectively encode diverse vehicle features and produce driving instructions informed by domain-specific knowledge. Moreover, the double deep Q-learning (DDQN) algorithm is utilized to train the decision-making model. A case study on a typical four-way intersection, involving various driving styles of human vehicles (HVs), demonstrates that the proposed method has superior performance over several baselines regarding safety, efficiency, stability, and convergence rate, all while maintaining favorable real-time performance.
zh

[AI-88] Boundary-to-Region Supervision for Offline Safe Reinforcement Learning NEURIPS2025

链接: https://arxiv.org/abs/2509.25727
作者: Huikang Su,Dengyun Peng,Zifeng Zhuang,YuHan Liu,Qiguang Chen,Donglin Wang,Qinghe Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: NeurIPS 2025

点击查看摘要

[AI-89] DeepCodeSeek: Real-Time API Retrieval for Context-Aware Code Generation

链接: https://arxiv.org/abs/2509.25716
作者: Esakkivel Esakkiraja,Denis Akhiyarov,Aditya Shanmugham,Chitra Ganapathy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Retrieval-Augmented Generation, API Prediction, Context-Aware Code Generation, Enterprise Code Completion, Reinforcement Learning, ServiceNow, Real-Time Code Search, Query Enhancement, Fine-Tuning, Embedding, Reranker

点击查看摘要

[AI-90] HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLM s in Music Modeling

链接: https://arxiv.org/abs/2509.25694
作者: Hung-Ying Chu,Shao-Yu Wei,Guan-Wei Chen,Tzu-Wei Hung,ChengYang Tsai,Yu-Cheng Lin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-91] ScheduleMe: Multi-Agent Calendar Assistant

链接: https://arxiv.org/abs/2509.25693
作者: N. de Silva(University of Moratuwa, Sri Lanka),S. Perera(WSO2 LLC),K. L. A. A. Nimasha(University of Moratuwa, Sri Lanka),I. D. S. Fernando(University of Moratuwa, Sri Lanka),R.K.A.O. Wijerathne(University of Moratuwa, Sri Lanka)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-92] Collaborative Compression for Large-Scale MoE Deployment on Edge

链接: https://arxiv.org/abs/2509.25689
作者: Yixiao Chen,Yanyue Xie,Ruining Yang,Wei Jiang,Wei Wang,Yong He,Yue Chen,Pu Zhao,Yanzhi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-93] SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation

链接: https://arxiv.org/abs/2509.25672
作者: Hasan Alp Caferoğlu,Mehmet Serhat Çelik,Özgür Ulusoy
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

[AI-94] GroundSight: Augmenting Vision-Language Models with Grounding Information and De-hallucination

链接: https://arxiv.org/abs/2509.25669
作者: Xinxi Chen,Tianyang Chen,Lijia Hong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-95] EEG-based AI-BCI Wheelchair Advancement: Hybrid Deep Learning with Motor Imagery for Brain Computer Interface

链接: https://arxiv.org/abs/2509.25667
作者: Bipul Thapa,Biplov Paneru,Bishwash Paneru,Khem Narayan Poudyal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

[AI-96] On Explaining Proxy Discrimination and Unfairness in Individual Decisions Made by AI Systems

链接: https://arxiv.org/abs/2509.25662
作者: Belona Sonna,Alban Grastien
机构: 未知
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: Accepted at AJCAI 2025

点击查看摘要

[AI-97] Deep Reinforcement Learning-Based Precoding for Multi-RIS-Aided Multiuser Downlink Systems with Practical Phase Shift

链接: https://arxiv.org/abs/2509.25661
作者: Po-Heng Chou,Bo-Ren Zheng,Wan-Jen Huang,Walid Saad,Yu Tsao,Ronald Y. Chang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注: 5 pages, 5 figures, and published in IEEE Wireless Communications Letters

点击查看摘要

[AI-98] Capacity-Net-Based RIS Precoding Design without Channel Estimation for mmWave MIMO System

链接: https://arxiv.org/abs/2509.25660
作者: Chun-Yuan Huang,Po-Heng Chou,Wan-Jen Huang,Ying-Ren Chien,Yu Tsao
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注: 10 pages, 5 figures, and published in 2024 IEEE PIMRC

点击查看摘要

[AI-99] Landmark-Guided Knowledge for Vision-and-Language Navigation

链接: https://arxiv.org/abs/2509.25655
作者: Dongsheng Yang,Meiling Zhu,Yinfeng Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication by International Conference on Intelligent Computing 2025

点击查看摘要

[AI-100] Iterative Residual Cross-Attention Mechanism: An Integrated Approach for Audio-Visual Navigation Tasks

链接: https://arxiv.org/abs/2509.25652
作者: Hailong Zhang,Yinfeng Yu,Liejun Wang,Fuchun Sun,Wendong Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
备注: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2025

点击查看摘要

[AI-101] AutoLabs: Cognitive Multi-Agent Systems with Self-Correction for Autonomous Chemical Experimentation

链接: https://arxiv.org/abs/2509.25651
作者: Gihan Panapitiya,Emily Saldanha,Heather Job,Olivia Hess
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-102] BaB-prob: Branch and Bound with Preactivation Splitting for Probabilistic Verification of Neural Networks

链接: https://arxiv.org/abs/2509.25647
作者: Fangji Wang,Panagiotis Tsiotras
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

[AI-103] SOCK: A Benchmark for Measuring Self-Replication in Large Language Models

链接: https://arxiv.org/abs/2509.25643
作者: Justin Chavarria,Rohan Raizada,Justin White,Eyad Alhetairshi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-104] Quadratic Programming Approach for Nash Equilibrium Computation in Multiplayer Imperfect-Information Games

链接: https://arxiv.org/abs/2509.25618
作者: Sam Ganzfried
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

[AI-105] SMS: Self-supervised Model Seeding for Verification of Machine Unlearning

【速读】：该论文旨在解决机器学习模型中用户数据删除后的验证难题，即如何有效验证用户真实样本（genuine samples）已被成功移除，而非仅针对后门样本（backdoored samples）进行验证。当前方法依赖于后门机制，但其无法建立真实样本与模型之间的关联，导致验证结果不可靠。解决方案的关键在于提出一种自监督模型播种（Self-supervised Model Seeding, SMS）方案，通过将用户特定的种子（如用户唯一索引）与原始样本和模型隐式绑定，形成可验证的因果链，从而实现对真实样本移除的有效验证。SMS的核心创新包括：1）利用自监督任务将种子嵌入模型潜在空间以保护其机密性；2）设计联合训练结构，在保持原服务模型性能的同时实现高效播种，确保验证有效性与模型实用性之间的平衡。

链接: https://arxiv.org/abs/2509.25613
作者: Weiqi Wang,Chenhan Zhang,Zhiyi Tian,Shui Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many machine unlearning methods have been proposed recently to uphold users’ right to be forgotten. However, offering users verification of their data removal post-unlearning is an important yet under-explored problem. Current verifications typically rely on backdooring, i.e., adding backdoored samples to influence model performance. Nevertheless, the backdoor methods can merely establish a connection between backdoored samples and models but fail to connect the backdoor with genuine samples. Thus, the backdoor removal can only confirm the unlearning of backdoored samples, not users’ genuine samples, as genuine samples are independent of backdoored ones. In this paper, we propose a Self-supervised Model Seeding (SMS) scheme to provide unlearning verification for genuine samples. Unlike backdooring, SMS links user-specific seeds (such as users’ unique indices), original samples, and models, thereby facilitating the verification of unlearning genuine samples. However, implementing SMS for unlearning verification presents two significant challenges. First, embedding the seeds into the service model while keeping them secret from the server requires a sophisticated approach. We address this by employing a self-supervised model seeding task, which learns the entire sample, including the seeds, into the model’s latent space. Second, maintaining the utility of the original service model while ensuring the seeding effect requires a delicate balance. We design a joint-training structure that optimizes both the self-supervised model seeding task and the primary service task simultaneously on the model, thereby maintaining model utility while achieving effective model seeding. The effectiveness of the proposed SMS scheme is evaluated through extensive experiments, which demonstrate that SMS provides effective verification for genuine sample unlearning, addressing existing limitations.
zh

[AI-106] Unsupervised Detection of Spatiotemporal Anomalies in PMU Data Using Transformer-Based BiGAN

链接: https://arxiv.org/abs/2509.25612
作者: Muhammad Imran Hossain,Jignesh Solanki,Sarika Khushlani Solanki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

[AI-107] A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments

链接: https://arxiv.org/abs/2509.25609
作者: Manuel Cherep,Chengtian Ma,Abigail Xu,Maya Shaked,Pattie Maes,Nikhil Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 23 pages, 13 figures

点击查看摘要

[AI-108] Echoes of Humanity: Exploring the Perceived Humanness of AI Music NEURIPS2025

链接: https://arxiv.org/abs/2509.25601
作者: Flavio Figueiredo,Giovanni Martinelli,Henrique Sousa,Pedro Rodrigues,Frederico Pedrosa,Lucas N. Ferreira
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注: Accepted at NeuRIPs 2025 Creative AI Track

点击查看摘要

[AI-109] Hybrid Reward Normalization for Process-supervised Non-verifiable Agent ic Tasks

链接: https://arxiv.org/abs/2509.25598
作者: Peiran Xu,Zhuohao Li,Xiaoying Xing,Guannan Zhang,Debiao Li,Kunyu Shi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-110] Radiologys Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology

链接: https://arxiv.org/abs/2509.25559
作者: Suvrankar Datta,Divya Buchireddygari,Lakshmi Vennela Chowdary Kaza,Mrudula Bhalke,Kautik Singh,Ayush Pandey,Sonit Sai Vasipalli,Upasana Karnwal,Hakikat Bir Singh Bhatti,Bhavya Ratan Maroo,Sanjana Hebbar,Rahul Joseph,Gurkawal Kaur,Devyani Singh,Akhil V,Dheeksha Devasya Shama Prasad,Nishtha Mahajan,Ayinaparthi Arisha,Rajesh Vanagundi,Reet Nandy,Kartik Vuthoo,Snigdhaa Rajvanshi,Nikhileswar Kondaveeti,Suyash Gunjal,Rishabh Jain,Rajat Jain,Anurag Agrawal
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 7 figures, 7 tables, includes Annexure (1). Part of the work accepted at RSNA 2025 (Cutting Edge Oral Presentation)

点击查看摘要

[AI-111] A(I)nimism: Re-enchanting the World Through AI-Mediated Object Interaction

链接: https://arxiv.org/abs/2509.25558
作者: Diana Mykhaylychenko,Maisha Thasin,Dunya Baradari,Charmelle Mhungu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注:

点击查看摘要

[AI-112] Evaluating Foundation Models with Pathological Concept Learning for Kidney Cancer MICCAI

链接: https://arxiv.org/abs/2509.25552
作者: Shangqi Gao,Sihan Wang,Yibo Gao,Boming Wang,Xiahai Zhuang,Anne Warren,Grant Stewart,James Jones,Mireia Crispin-Ortuzar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Best Paper Award at MICCAI AMAI 2025

点击查看摘要

[AI-113] Learning to Interact in World Latent for Team Coordination

链接: https://arxiv.org/abs/2509.25550
作者: Dongsu Lee,Daehee Lee,Yaru Niu,Honguk Woo,Amy Zhang,Ding Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-114] RadOnc-GPT : An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale

链接: https://arxiv.org/abs/2509.25540
作者: Jason Holmes,Yuexing Hao,Mariana Borras-Osorio,Federico Mastroleo,Santiago Romero Brufau,Valentina Carducci,Katie M Van Abel,David M Routman,Andrew Y. K. Foong,Liv M Muller,Satomi Shiraishi,Daniel K Ebner,Daniel J Ma,Sameer R Keole,Samir H Patel,Mirek Fatyga,Martin Bues,Brad J Stish,Yolanda I Garces,Michelle A Neben Wittich,Robert L Foote,Sujay A Vora,Nadia N Laack,Mark R Waddle,Wei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-115] Steering an Active Learning Workflow Towards Novel Materials Discovery via Queue Prioritization

链接: https://arxiv.org/abs/2509.25538
作者: Marcus Schwarting,Logan Ward,Nathaniel Hudson,Xiaoli Yan,Ben Blaiszik,Santanu Chaudhuri,Eliu Huerta,Ian Foster
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-116] Beyond Static Retrieval: Opportunities and Pitfalls of Iterative Retrieval in GraphRAG

链接: https://arxiv.org/abs/2509.25530
作者: Kai Guo,Xinnan Dai,Shenglai Zeng,Harry Shomer,Haoyu Han,Yu Wang,Jiliang Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-117] Economic Competition EU Regulation and Executive Orders: A Framework for Discussing AI Policy Implications in CS Courses

链接: https://arxiv.org/abs/2509.25524
作者: James Weichert,Hoda Eldardiry
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-118] Understanding Generative Recommendation with Semantic IDs from a Model-scaling View

链接: https://arxiv.org/abs/2509.25522
作者: Jingzhe Liu,Liam Collins,Jiliang Tang,Tong Zhao,Neil Shah,Clark Mingxuan Ju
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-119] XR Blocks: Accelerating Human-centered AI XR Innovation

链接: https://arxiv.org/abs/2509.25504
作者: David Li,Nels Numan,Xun Qian,Yanhe Chen,Zhongyi Zhou,Evgenii Alekseev,Geonsun Lee,Alex Cooper,Min Xia,Scott Chung,Jeremy Nelson,Xiuxiu Yuan,Jolica Dias,Tim Bettridge,Benjamin Hersh,Michelle Huynh,Konrad Piascik,Ricardo Cabello,David Kim,Ruofei Du
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR); Software Engineering (cs.SE)
备注:

点击查看摘要

[AI-120] EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition

链接: https://arxiv.org/abs/2509.25495
作者: Jiacheng Shi,Hongfei Du,Y. Alicia Hong,Ye Gao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-121] Message passing-based inference in an autoregressive active inference agent

链接: https://arxiv.org/abs/2509.25482
作者: Wouter M. Kouw,Tim N. Nisslbeck,Wouter L.N. Nuijten
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY); Machine Learning (stat.ML)
备注: 14 pages, 4 figures, to be published in the proceedings of the International Workshop on Active Inference 2025

点击查看摘要

[AI-122] ranslation from Wearable PPG to 12-Lead ECG

链接: https://arxiv.org/abs/2509.25480
作者: Hui Ji,Wei Gao,Pengfei Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages,10 figures

点击查看摘要

[AI-123] DHook: A Lightweight Framework for Interpretability

链接: https://arxiv.org/abs/2509.25475
作者: Yoann Poupart
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-124] Data-Efficient Multitask DAgger

链接: https://arxiv.org/abs/2509.25466
作者: Haotian Fu,Ran Gong,Xiaohan Zhang,Maria Vittoria Minniti,Jigarkumar Patel,Karl Schmeckpeper
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-125] Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition

链接: https://arxiv.org/abs/2509.25458
作者: Jiacheng Shi,Hongfei Du,Y. Alicia Hong,Ye Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-126] PIPer: On-Device Environment Setup via Online Reinforcement Learning

链接: https://arxiv.org/abs/2509.25455
作者: Alexander Kovrigin,Aleksandra Eliseeva,Konstantin Grotov,Egor Bogomolov,Yaroslav Zharov
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

[AI-127] Multi-patch isogeometric neural solver for partial differential equations on computer-aided design domains

链接: https://arxiv.org/abs/2509.25450
作者: Moritz von Tresckow,Ion Gabriel Ion,Dimitrios Loukrezis
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
备注: 33 pages, 15 figures

点击查看摘要

[AI-128] Joint Embeddings Go Temporal NEURIPS2024

链接: https://arxiv.org/abs/2509.25449
作者: Sofiane Ennadir,Siavash Golkar,Leopoldo Sarra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the Workshop on Time Series in the Age of Large Models - NeurIPS 2024

点击查看摘要

[AI-129] Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring

链接: https://arxiv.org/abs/2509.25438
作者: Zhibo Hou,Zhiyu An,Wan Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-130] GESA: Graph-Enhanced Semantic Allocation for Generalized Fair and Explainable Candidate-Role Matching

链接: https://arxiv.org/abs/2509.25435
作者: Rishi Ashish Shah,Shivaay Dhondiyal,Kartik Sharma,Sukriti Talwar,Saksham Jain,Sparsh Jain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-131] he Open Syndrome Definition

链接: https://arxiv.org/abs/2509.25434
作者: Ana Paula Gomes Ferreira,Aleksandar Anžel,Izabel Oliva Marcilio de Souza,Helen Hughes,Alex J Elliot,Jude Dzevela Kong,Madlen Schranz,Alexander Ullrich,Georges Hattab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-132] RADAR: Reasoning -Ability and Difficulty-Aware Routing for Reasoning LLM s

链接: https://arxiv.org/abs/2509.25426
作者: Nigel Fernandez,Branislav Kveton,Ryan A. Rossi,Andrew S. Lan,Zichao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-133] Polychromic Objectives for Reinforcement Learning

链接: https://arxiv.org/abs/2509.25424
作者: Jubayer Ibn Hamid,Ifdita Hasan Orney,Ellen Xu,Chelsea Finn,Dorsa Sadigh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-134] Boolean Satisfiability via Imitation Learning

链接: https://arxiv.org/abs/2509.25411
作者: Zewei Zhang,Huan Liu,Yuanhao Yu,Jun Chen,Xiangyu Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-135] FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers

链接: https://arxiv.org/abs/2509.25401
作者: Liang Qiao,Yue Dai,Yeqi Huang,Hongyu Kan,Jun Shi,Hong An
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

[AI-136] A Cartography of Open Collaboration in Open Source AI: Mapping Practices Motivations and Governance in 14 Open Large Language Model Projects

链接: https://arxiv.org/abs/2509.25397
作者: Johan Linåker,Cailean Osborne,Jennifer Ding,Ben Burtenshaw
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In submission

点击查看摘要

[AI-137] Let Physics Guide Your Protein Flows: Topology-aware Unfolding and Generation

链接: https://arxiv.org/abs/2509.25379
作者: Yogesh Verma,Markus Heinonen,Vikas Garg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-138] Cold-Start Active Correlation Clustering

链接: https://arxiv.org/abs/2509.25376
作者: Linus Aronsson,Han Wu,Morteza Haghir Chehreghani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

[AI-139] From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

链接: https://arxiv.org/abs/2509.25373
作者: Chenyue Zhou,Mingxuan Wang,Yanbiao Ma,Chenxu Wu,Wanyi Chen,Zhe Qian,Xinyu Liu,Yiwei Zhang,Junhao Wang,Hengbo Xu,Fei Luo,Xiaohua Chen,Xiaoshuai Hao,Hehan Li,Andi Zhang,Wenxuan Wang,Lingling Li,Zhiwu Lu,Yang Lu,Yike Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-140] Where LLM Agents Fail and How They can Learn From Failures

链接: https://arxiv.org/abs/2509.25370
作者: Kunlun Zhu,Zijia Liu,Bingxuan Li,Muxin Tian,Yingxuan Yang,Jiaxun Zhang,Pengrui Han,Qipeng Xie,Fuyang Cui,Weijia Zhang,Xiaoteng Ma,Xiaodong Yu,Gowtham Ramesh,Jialian Wu,Zicheng Liu,Pan Lu,James Zou,Jiaxuan You
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-141] Structural Reward Model: Enhancing Interpretability Efficiency and Scalability in Reward Modeling

链接: https://arxiv.org/abs/2509.25361
作者: Xiaoyu Liu,Di Liang,Hongyu Shan,Peiyang Liu,Yonghao Liu,Muling Wu,Yuntao Li,Xianjie Wu,LI Miao,Jiangrong Shen,Minlong Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-142] SynthPert: Enhancing LLM Biological Reasoning Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction

链接: https://arxiv.org/abs/2509.25346
作者: Lawrence Phillips,Marc Boubnovski Martell,Aditya Misra,Josefa Lia Stoisser,Cesar A. Prada-Medina,Rory Donovan-Maiye,Kaspar Märtens
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Cell Behavior (q-bio.CB); Genomics (q-bio.GN)
备注:

点击查看摘要

[AI-143] Uncertainty-Aware Generative Oversampling Using an Entropy-Guided Conditional Variational Autoencoder

链接: https://arxiv.org/abs/2509.25334
作者: Amirhossein Zare,Amirhessam Zare,Parmida Sadat Pezeshki,Herlock(SeyedAbolfazl)Rahimi,Ali Ebrahimi,Ignacio Vázquez-García,Leo Anthony Celi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures

点击查看摘要

[AI-144] Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

链接: https://arxiv.org/abs/2509.25300
作者: Zelin Tan,Hejia Geng,Mulei Zhang,Xiaohang Yu,Guancheng Wan,Yifan Zhou,Qiang He,Xiangyuan Xue,Heng Zhou,Yutao Fan,Zhongzhi Li,Zaibin Zhang,Guibin Zhang,Chen Zhang,Zhenfei Yin,Lei Bai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: V1 version

点击查看摘要

[AI-145] ID-RAG : Identity Retrieval-Augmented Generation for Long-Horizon Persona Coherence in Generative Agents ECAI2025

链接: https://arxiv.org/abs/2509.25299
作者: Daniel Platnick,Mohamed E. Bengueddache,Marjan Alirezaie,Dava J. Newman,Alex ‘‘Sandy’’ Pentland,Hossein Rahnama
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Accepted to LLAIS 2025: Workshop on LLM-Based Agents for Intelligent Systems, at ECAI 2025, 12 pages, 3 figures

点击查看摘要

[AI-146] Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development

链接: https://arxiv.org/abs/2509.25297
作者: Yuxuan Wan,Tingshuo Liang,Jiakai Xu,Jingyu Xiao,Yintong Huo,Michael R. Lyu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-147] Learning Relationships Between Separate Audio Tracks for Creative Applications

链接: https://arxiv.org/abs/2509.25296
作者: Balthazar Bujard(IRCAM, SU),Jérôme Nika(IRCAM),Fédéric Bevilacqua(IRCAM),Nicolas Obin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

[AI-148] AI in Pakistani Schools: Adoption Usage and Perceived Impact among Educators

链接: https://arxiv.org/abs/2509.25293
作者: Syed Hassan Raza,Azib Farooq
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-149] A Measurement Study of Model Context Protocol

链接: https://arxiv.org/abs/2509.25292
作者: Hechuan Guo,Yongle Hao,Yue Zhang,Minghui Xu,Peizhuo Lyu,Jiezhi Chen,Xiuzhen Cheng
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-150] ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation

链接: https://arxiv.org/abs/2509.25289
作者: Mohammadreza Bakhtyari,Bogdan Mazoure,Renato Cordeiro de Amorim,Guillaume Rabusseau,Vladimir Makarenkov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-151] Artificial Authority: From Machine Minds to Political Alignments. An Experimental Analysis of Democratic and Autocratic Biases in Large-Language Models

链接: https://arxiv.org/abs/2509.25286
作者: Szymon Łukasik,Natalia Ożegalska-Łukasik
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-152] Effectiveness of Large Language Models in Simulating Regional Psychological Structures: An Empirical Examination of Personality and Subjective Well-being

【速读】：该论文试图解决的问题是：大型语言模型（LLM）是否能够基于人口统计学信息模拟具有文化根基的心理模式，特别是在人格特质（如大五人格）和主观幸福感方面的表现。其解决方案的关键在于利用DeepSeek模型生成与2018年中国家庭追踪调查（CFPS2018）人口分布一致的2943名虚拟参与者，并将这些模拟数据与真实人类数据进行比较，从而评估LLM在心理特征建模上的拟合度与偏差。研究发现，尽管总体趋势相似，但AI生成数据在外向性、开放性和幸福感上系统性偏低，且预测幸福的关键人格维度存在差异，表明当前LLM虽能近似群体心理分布，但在文化特异性与情感维度建模方面仍存在局限，凸显了改进训练数据的文化丰富性和增强情感建模能力的重要性。

链接: https://arxiv.org/abs/2509.25283
作者: Ke Luoma,Li Zengyi,Liao Jiangqun,Tong Song,Peng Kaiping
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study examines whether LLMs can simulate culturally grounded psychological patterns based on demographic information. Using DeepSeek, we generated 2943 virtual participants matched to demographic distributions from the CFPS2018 and compared them with human responses on the Big Five personality traits and subjective well-being across seven Chinese this http URL was measured using a 15-item Chinese Big Five inventory, and happiness with a single-item rating. Results revealed broad similarity between real and simulated datasets, particularly in regional variation trends. However, systematic differences emerged:simulated participants scored lower in extraversion and openness, higher in agreeableness and neuroticism, and consistently reported lower happiness. Predictive structures also diverged: while human data identified conscientiousness, extraversion and openness as positive predictors of happiness, the AI emphasized openness and agreeableness, with extraversion predicting negatively. These discrepancies suggest that while LLMs can approximate population-level psychological distributions, they underrepresent culturally specific and affective dimensions. The findings highlight both the potential and limitations of LLM-based virtual participants for large-scale psychological research and underscore the need for culturally enriched training data and improved affective modeling.
zh

[AI-153] oward Causal-Visual Programming: Enhancing Agent ic Reasoning in Low-Code Environments

链接: https://arxiv.org/abs/2509.25282
作者: Jiexi Xu,Jiaqi Liu,Ran Tong,Su Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 5 pages, 1 table

点击查看摘要

[AI-154] RL in the Wild: Characterizing RLVR Training in LLM Deployment

链接: https://arxiv.org/abs/2509.25279
作者: Jiecheng Zhou,Qinghao Hu,Yuyang Jin,Zerui Wang,Peng Sun,Yuzhe Gu,Wenwei Zhang,Mingshu Zhai,Xingcheng Zhang,Weiming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 20 pages, 28 figures

点击查看摘要

[AI-155] VoiceBridge: Designing Latent Bridge Models for General Speech Restoration at Scale

链接: https://arxiv.org/abs/2509.25275
作者: Chi Zhang,Zehua Chen,Kaiwen Zheng,Jun Zhu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

[AI-156] A Weather Foundation Model for the Power Grid

链接: https://arxiv.org/abs/2509.25268
作者: Cristian Bodnar,Raphaël Rousseau-Rizzi,Nikhil Shankar,James Merleau,Stylianos Flampouris,Guillem Candille,Slavica Antic,François Miralles,Jayesh K. Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 31 pages, 22 figures

点击查看摘要

[AI-157] Cognifying Education: Mapping AIs transformative role in emotional creative and collaborative learning

链接: https://arxiv.org/abs/2509.25266
作者: Mikael Gorsky,Ilya Levin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Presented at the 13th Higher Education Institutions Conference (HEIC) in Dubrovnik (September 2025): AI and Digital Transformation in Higher Education

点击查看摘要

[AI-158] From NL2SQL to NL2GeoSQL: GeoSQL-Eval for automated evaluation of LLM s on PostGIS queries

链接: https://arxiv.org/abs/2509.25264
作者: Shuyang Hou,Haoyue Jiao,Ziqi Liu,Lutong Xie,Guanyu Chen,Shaowen Wu,Xuefeng Guan,Huayi Wu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

[AI-159] How Effective Are Time-Series Models for Rainfall Nowcasting? A Comprehensive Benchmark for Rainfall Nowcasting Incorporating PWV Data

链接: https://arxiv.org/abs/2509.25263
作者: Yifang Zhang,Pengfei Duan,Henan Wang,Shengwu Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
备注: 11 pages,8 figures

点击查看摘要

[AI-160] Artificial Intelligence-Powered Assessment Framework for Skill-Oriented Engineering Lab Education

链接: https://arxiv.org/abs/2509.25258
作者: Vaishnavi Sharma,Rakesh Thakur,Shashwat Sharma,Kritika Panjanani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-161] he Sandbox Configurator: A Framework to Support Technical Assessment in AI Regulatory Sandboxes

链接: https://arxiv.org/abs/2509.25256
作者: Alessio Buscemi,Thibault Simonetto,Daniele Pagani,German Castignani,Maxime Cordy,Jordi Cabot
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-162] Knowledge distillation through geometry-aware representational alignment

链接: https://arxiv.org/abs/2509.25253
作者: Prajjwal Bhattarai,Mohammad Amjad,Dmytro Zhylko,Tuka Alhanai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-163] Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration

链接: https://arxiv.org/abs/2509.25252
作者: Aayush Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, 4 tables. Code and dataset available at this https URL

点击查看摘要

[AI-164] Memory Management and Contextual Consistency for Long-Running Low-Code Agents

链接: https://arxiv.org/abs/2509.25250
作者: Jiexi Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 12 pages, 5 figures, 1 table

点击查看摘要

[AI-165] BEV-VLM: Trajectory Planning via Unified BEV Abstraction

【速读】：该论文旨在解决自动驾驶中轨迹规划（trajectory planning）依赖原始视觉数据（如摄像头图像）导致信息冗余、几何一致性差以及难以与高精地图（HD Map）深度融合的问题。解决方案的关键在于提出BEV-VLM框架，通过将多模态传感器数据（如相机和激光雷达）融合生成结构化且语义丰富的鸟瞰图（Bird’s-Eye View, BEV）特征图，并将其与高精地图对齐形成统一的BEV-HD Map表示，从而为视觉语言模型（Vision-Language Model, VLM）提供更高效、几何一致且语义丰富的输入，显著提升轨迹规划准确性并实现完全避障。

链接: https://arxiv.org/abs/2509.25249
作者: Guancheng Chen,Sheng Yang,Tong Zhan,Jian Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces BEV-VLM, a novel framework for trajectory planning in autonomous driving that leverages Vision-Language Models (VLMs) with Bird’s-Eye View (BEV) feature maps as visual inputs. Unlike conventional approaches that rely solely on raw visual data such as camera images, our method utilizes highly compressed and informative BEV representations, which are generated by fusing multi-modal sensor data (e.g., camera and LiDAR) and aligning them with HD Maps. This unified BEV-HD Map format provides a geometrically consistent and rich scene description, enabling VLMs to perform accurate trajectory planning. Experimental results on the nuScenes dataset demonstrate 44.8% improvements in planning accuracy and complete collision avoidance. Our work highlights that VLMs can effectively interpret processed visual representations like BEV features, expanding their applicability beyond raw images in trajectory planning.
zh

[AI-166] BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

【速读】：该论文旨在解决开源软件（Open-Source Software, OSS）自动编译过程中存在的复杂性与适应性不足问题，现有方法依赖人工规则和有限样本评估，难以应对实际中缺失编译指令、依赖关系未文档化、甚至需修改源码或构建脚本等挑战。其解决方案的关键在于提出一个更具现实挑战性的基准测试集 BUILD-BENCH，涵盖质量、规模和特征多样化的 OSS 项目，并设计了一个基于大语言模型（Large Language Models, LLMs）的强基线代理系统 OSS-BUILD-AGENT，该系统通过增强的构建指令检索模块实现对异构 OSS 特性的良好适配，在 BUILD-BENCH 上达到当前最优性能，从而更真实地衡量智能体在复杂软件工程任务中的能力。

链接: https://arxiv.org/abs/2509.25248
作者: Zehua Zhang,Ati Priya Bajaj,Divij Handa,Siyu Liu,Arvind S Raj,Hongkai Chen,Hulin Wang,Yibo Liu,Zion Leonahenahe Basque,Souradip Nath,Vishal Juneja,Nikhil Chapre,Yan Shoshitaishvili,Adam Doupé,Chitta Baral,Ruoyu Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to OSS that requires customized configuration or environment setup. Recent attempts using Large Language Models (LLMs) used selective evaluation on a subset of highly rated OSS, a practice that underestimates the realistic challenges of OSS compilation. In practice, compilation instructions are often absent, dependencies are undocumented, and successful builds may even require patching source files or modifying build scripts. We propose a more challenging and realistic benchmark, BUILD-BENCH, comprising OSS that are more diverse in quality, scale, and characteristics. Furthermore, we propose a strong baseline LLM-based agent, OSS-BUILD-AGENT, an effective system with enhanced build instruction retrieval module that achieves state-of-the-art performance on BUILD-BENCH and is adaptable to heterogeneous OSS characteristics. We also provide detailed analysis regarding different compilation method design choices and their influence to the whole task, offering insights to guide future advances. We believe performance on BUILD-BENCH can faithfully reflect an agent’s ability to tackle compilation as a complex software engineering tasks, and, as such, our benchmark will spur innovation with a significant impact on downstream applications in the fields of software development and software security.
zh

[AI-167] Protocode: Prototype-Driven Interpretability for Code Generation in LLM s

链接: https://arxiv.org/abs/2509.25247
作者: Krishna Vamshi Bodla,Haizhao Yang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-168] Neo-Grounded Theory: A Methodological Innovation Integrating High-Dimensional Vector Clustering and Multi-Agent Collaboration for Qualitative Research

链接: https://arxiv.org/abs/2509.25244
作者: Shuide Wen,Beier Ku,Teng Wang,Mingyang Zou,Yang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 44 pages, 11 figures

点击查看摘要

[AI-169] Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation

链接: https://arxiv.org/abs/2509.25243
作者: Xunzhu Tang,Iyiola Emmanuel Olatunji,Tiezhu Sun,Jacques Klein,Tegawende F. Bissyande
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-170] A Benchmark for Localizing Code and Non-Code Issues in Software Projects

链接: https://arxiv.org/abs/2509.25242
作者: Zejun Zhang,Jian Wang,Qingyun Yang,Yifan Pan,Yi Tang,Yi Li,Zhenchang Xing,Tian Zhang,Xuandong Li,Guoan Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-171] PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases

链接: https://arxiv.org/abs/2509.25238
作者: Sri Vatsa Vuddanti,Aarav Shah,Satwik Kumar Chittiprolu,Tony Song,Sunishchal Dev,Kevin Zhu,Maheep Chaudhary
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-172] Quantum est in Libris: Navigating Archives with GenAI Uncovering Tension Between Preservation and Innovation

链接: https://arxiv.org/abs/2509.25237
作者: Mar Canet Sola,Varvara Guljajeva
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 5 pages, 4 figures,

点击查看摘要

[AI-173] he Causal Abstraction Network: Theory and Learning

链接: https://arxiv.org/abs/2509.25236
作者: Gabriele D’Acunto,Paolo Di Lorenzo,Sergio Barbarossa
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

[AI-174] Machine Learning for Pattern Detection in Printhead Nozzle Logging

链接: https://arxiv.org/abs/2509.25235
作者: Nikola Prianikov,Evelyne Janssen-van Dam,Marcin Pietrasik,Charalampos S. Kouzinopoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been published in the 37th International Conference on Tools with Artificial Intelligence in Athens, Greece, November 03-05, 2025

点击查看摘要

[AI-175] FedCLF - Towards Efficient Participant Selection for Federated Learning in Heterogeneous IoV Networks ICIP

链接: https://arxiv.org/abs/2509.25233
作者: Kasun Eranda Wijethilake,Adnan Mahmood,Quan Z. Sheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Already published in ADMA 2024 on 13th December 2024 Wijethilake, K.E., Mahmood, A., Sheng, Q.Z. (2025). FedCLF - Towards Efficient Participant Selection for Federated Learning in Heterogeneous IoV Networks. In: Sheng, Q.Z., et al. Advanced Data Mining and Applications. ADMA 2024. Lecture Notes in Computer Science(), vol 15388. Springer, Singapore. this https URL

点击查看摘要

[AI-176] Energy Guided Geometric Flow Matching

链接: https://arxiv.org/abs/2509.25230
作者: Aaron Zweig,Mingxuan Zhang,Elham Azizi,David Knowles
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-177] Blueprint-Bench: Comparing spatial intelligence of LLM s agents and image models ICLR2026

链接: https://arxiv.org/abs/2509.25229
作者: Lukas Petersson,Axel Backlund,Axel Wennstöm,Hanna Petersson,Callum Sharrock,Arash Dabiri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, submitted for ICLR 2026

点击查看摘要

[AI-178] Enhancing Linear Attention with Residual Learning

链接: https://arxiv.org/abs/2509.25223
作者: Xunhao Lai,Jialiang Kang,Jianqiao Lu,Tong Lin,Pengyu Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

[AI-179] Learning to Condition: A Neural Heuristic for Scalable MPE Inference NEURIPS2025

链接: https://arxiv.org/abs/2509.25217
作者: Brij Malhotra,Shivvrat Arya,Tahrima Rahman,Vibhav Giridhar Gogate
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Will appear in NeurIPS 2025

点击查看摘要

[AI-180] On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLM s

链接: https://arxiv.org/abs/2509.25214
作者: Rongguang Ye,Ming Tang,Edith C. H. Ngai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-181] STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting

链接: https://arxiv.org/abs/2509.25210
作者: Hao Chen,Tao Han,Jie Zhang,Song Guo,Lei Bai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

[AI-182] Multi-level Diagnosis and Evaluation for Robust Tabular Feature Engineering with Large Language Models EMNLP2025

链接: https://arxiv.org/abs/2509.25207
作者: Yebin Lim,Susik Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of EMNLP 2025

点击查看摘要

[AI-183] Generating High-Quality Datasets for Code Editing via Open-Source Language Models

链接: https://arxiv.org/abs/2509.25203
作者: Zekai Zhang,Mingwei Liu,Zhenxi Chen,Linxi Liang,Yuxuan Chen,Guangsheng Ou,Yanlin Wang,Dan Li,Xin Peng,Zibin Zheng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 23 pages, 8 figures

点击查看摘要

[AI-184] owards Repository-Level Program Verification with Large Language Models

链接: https://arxiv.org/abs/2509.25197
作者: Si Cheng Zhong,Xujie Si
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Accepted to LMPL 2025

点击查看摘要

[AI-185] APRIL: API Synthesis with Automatic Prompt Optimization and Reinforcement Learning

【速读】：该论文旨在解决大规模软件库中API组合（API composition）的难题，即在海量API集合中高效、准确地合成符合需求的功能模块。传统基于组件的合成方法依赖昂贵的搜索过程和人工编写的规格说明，而大语言模型（Large Language Models, LLMs）虽能根据自然语言生成代码，但常因幻觉和缺乏实时上下文信息导致结果错误。论文提出的解决方案关键在于结合LLM合成能力与两种先进机制：自动提示优化（Automatic Prompt Optimization, APO）用于迭代优化冻结模型的输入提示，以及基于可验证奖励的强化学习（Reinforcement Learning from Verifiable Rewards, RLVR）用于微调策略以提升功能正确性。该协同框架构建了一个高效且可靠的组件化API合成流水线，在81个来自主流科学Python库的真实API任务上显著优于仅使用专家提示的指令微调未精调LLM。

链接: https://arxiv.org/abs/2509.25196
作者: Hua Zhong,Shan Jiang,Sarfraz Khurshid
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:APIs are central to modern software development, yet composing new APIs from large libraries is difficult due to the exponential search space; traditional component-based synthesis relies on costly exploration and hand-crafted specifications. While large language models (LLMs) can generate implementations from natural language, hallucinations and limited access to up-to-date contextual information often yield incorrect code. In this paper, we present APRIL, an approach that combines LLM-based synthesis with Automatic Prompt Optimization (APO) and Reinforcement Learning from Verifiable Rewards (RLVR): APO iteratively refines prompts for a frozen model, while RLVR fine-tunes the policy toward functional correctness, producing an efficient synthesis pipeline. Evaluated on 81 real-world APIs from widely used scientific Python libraries and benchmarked against instruction-tuned but unfine-tuned LLMs guided by expert prompts, APRIL achieves substantial improvements. These results indicate that integrating APO and RLVR provides a robust, scalable path for component-based API synthesis in large libraries.
zh

[AI-186] Devstral: Fine-tuning Language Models for Coding Agent Applications

链接: https://arxiv.org/abs/2509.25193
作者: Abhinav Rastogi,Adam Yang,Albert Q. Jiang,Alexander H. Liu,Alexandre Sablayrolles,Amélie Héliou,Amélie Martin,Anmol Agarwal,Andy Ehrenberg,Andy Lo,Antoine Roux,Arthur Darcet,Arthur Mensch,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Chris Bamford,Christian Wallenwein,Christophe Renaudin,Clémence Lanfranchi,Clément Denoix,Corentin Barreau,Darius Dabert Devon Mizelle,Diego de las Casas,Elliot Chane-Sane,Emilien Fugier,Emma Bou Hanna,Gabrielle Berrada,Gauthier Delerce,Gauthier Guinet,Georgii Novikov,Graham Neubig,Guillaume Lample,Guillaume Martin,Himanshu Jaju,Jan Ludziejewski,Jason Rute,Jean-Malo Delignon,JeanHadrien Chabran,Joachim Studnia,Joep Barmentlo,Jonas Amar,Josselin Somerville Roberts,Julien Denize,Karan Saxena,Karmesh Yadav,Kartik Khandelwal,Khyathi Raghavi Chandu,Kush Jain,Lélio Renard Lavaud,Léonard Blier,Lingxiao Zhao,Louis Martin,Lucile Saulnier,Luyu Gao,Marie Pellat,Mathilde Guillaumin,Mathis Felardos,Matthieu Dinot,Maxime Darrin,Maximilian Augustin,Mickaël Seznec,Neha Gupta,Nikhil Raghuraman,Olivier Duchenne,Patricia Wang,Patrick von Platen,Patryk Saffer,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Philomène Chagniot,Pierre Stock,Pravesh Agrawal,Rémi Delacourt,Roman Soletskyi,Romain Sauvestre,Sagar Vaze,Sanchit Gandhi,Sandeep Subramanian,Shashwat Dalal,Siddharth Gandhi,Soham Ghosh,Srijan Mishra,Sumukh Aithal,Szymon Antoniak,Teven Le Scao,Thibaut Lavril,Thibault Schueller,Thomas Foubert,Thomas Robert,Thomas Wang,Timothée Lacroix,Tom Bewley,Valeriia Nemychnikova,Victor Paltz,Virgile Richard,Wen-Ding Li,William Marshall,Xingyao Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-187] AdaptCache: KV Cache Native Storag e Hierarchy for Low-Delay and High-Quality Language Model Serving

链接: https://arxiv.org/abs/2509.00105
作者: Shaoting Feng,Hanchen Li,Kuntai Du,Zhuohan Gu,Yuhan Liu,Jiayi Yao,Siddhant Ray,Samuel Shen,Yihua Cheng,Ganesh Ananthanarayanan,Junchen Jiang
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-188] AI-assisted Advanced Propellant Development for Electric Propulsion

链接: https://arxiv.org/abs/2509.26567
作者: Angel Pan Du,Miguel Arana-Catania,Enric Grustan Gutiérrez
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Space Physics (physics.space-ph)
备注: 23 pages, 10 figures, 5 tables. Journal of Electric Propulsion

点击查看摘要

[AI-189] Indoor/Outdoor Spectrum Sharing Enabled by GNSS-based Classifiers

链接: https://arxiv.org/abs/2509.26500
作者: Hossein Nasiri,Muhammad Iqbal Rochman,Monisha Ghosh
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: To be published in the proceedings of IEEE Military Communications Conference (MILCOM) 2025

点击查看摘要

[AI-190] On Deepfake Voice Detection - Its All in the Presentation ICASSP2026

链接: https://arxiv.org/abs/2509.26471
作者: Héctor Delgado,Giorgio Ramondetti,Emanuele Dalmasso,Gennady Karvitsky,Daniele Colibro,Haydar Talib
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE ICASSP 2026. Paper resources available at this https URL

点击查看摘要

[AI-191] Vector-Valued Reproducing Kernel Banach Spaces for Neural Networks and Operators

链接: https://arxiv.org/abs/2509.26371
作者: Sven Dummer,Tjeerd Jan Heeringa,José A. Iglesias
机构: 未知
类目: Functional Analysis (math.FA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

[AI-192] Enhancing PINN Performance Through Lie Symmetry Group

链接: https://arxiv.org/abs/2509.26113
作者: Ali Haider Shah,Naveed R. Butt,Asif Ahmad,Muhammad Omer Bin Saeed
机构: 未知
类目: Analysis of PDEs (math.AP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-193] scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis

链接: https://arxiv.org/abs/2509.25884
作者: Ping Xu,Zaitian Wang,Zhirui Wang,Pengjiang Li,Ran Zhang,Gaoyang Li,Hanyu Xie,Jiajia Wang,Yuanchun Zhou,Pengfei Wang
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-194] owards A Universally Transferable Acceleration Method for Density Functional Theory

链接: https://arxiv.org/abs/2509.25724
作者: Zhe Liu,Yuyan Ni,Zhichen Pu,Qiming Sun,Siyuan Liu,Wen Yan
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-195] Discontinuous Epitope Frag ments as Sufficient Target Templates for Efficient Binder Design NEURIPS2025

链接: https://arxiv.org/abs/2509.25479
作者: Zhenfeng Deng,Ruijie Hou,Ningrui Xie,Mike Tyers,Michał Koziarski
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS2025-AI4Science

点击查看摘要

[AI-196] DNABERT-2: Fine-Tuning a Genomic Language Model for Colorectal Gene Enhancer Classification

链接: https://arxiv.org/abs/2509.25274
作者: Darren King,Yaser Atlasi,Gholamreza Rafiee
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 10 figures, 2 tables

点击查看摘要

[AI-197] Comprehensive Analysis of VQC for Financial Fraud Detection: A Comparative Study of Quantum Encoding Techniques and Architectural Optimizations

链接: https://arxiv.org/abs/2509.25245
作者: Fouad Mohammed Abbou,Mohamed Bouhadda,Lamiae Bouanane,Mouna Kettani,Farid Abdi,Abdelouahab Abid
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 9 figures

点击查看摘要

[AI-198] FMIP: Joint Continuous-Integer Flow For Mixed-Integer Linear Programming

链接: https://arxiv.org/abs/2507.23390
作者: Hongpei Li,Hui Yuan,Han Zhang,Jianghao Lin,Dongdong Ge,Mengdi Wang,Yinyu Ye
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注: FMIP is a generative framework that jointly models integer and continuous variables in MILP, achieving a 41.34% reduction in primal gap and demonstrating compatibility with various solvers and applications

点击查看摘要

机器学习

[LG-0] SPATA: Systematic Pattern Analysis for Detailed and Transparent Data Cards ECML KDD2025

链接: https://arxiv.org/abs/2509.26640
作者: João Vitorino,Eva Maia,Isabel Praça,Carlos Soares
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 16 pages, 3 tables, 6 figures, SynDAiTE, ECML PKDD 2025

点击查看摘要

Abstract:Due to the susceptibility of Artificial Intelligence (AI) to data perturbations and adversarial examples, it is crucial to perform a thorough robustness evaluation before any Machine Learning (ML) model is deployed. However, examining a model’s decision boundaries and identifying potential vulnerabilities typically requires access to the training and testing datasets, which may pose risks to data privacy and confidentiality. To improve transparency in organizations that handle confidential data or manage critical infrastructure, it is essential to allow external verification and validation of AI without the disclosure of private datasets. This paper presents Systematic Pattern Analysis (SPATA), a deterministic method that converts any tabular dataset to a domain-independent representation of its statistical patterns, to provide more detailed and transparent data cards. SPATA computes the projection of each data instance into a discrete space where they can be analyzed and compared, without risking data leakage. These projected datasets can be reliably used for the evaluation of how different features affect ML model robustness and for the generation of interpretable explanations of their behavior, contributing to more trustworthy AI.

[LG-1] AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond

链接: https://arxiv.org/abs/2509.26636
作者: Shangding Gu,Xiaohan Wang,Donghao Ying,Haoyu Zhao,Runing Yang,Ming Jin,Boyi Li,Marco Pavone,Serena Yeung-Levy,Jun Wang,Dawn Song,Costas Spanos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rapid advances in multimodal models demand benchmarks that rigorously evaluate understanding and reasoning in safety-critical, dynamic real-world settings. We present AccidentBench, a large-scale benchmark that combines vehicle accident scenarios with Beyond domains, safety-critical settings in air and water that emphasize spatial and temporal reasoning (e.g., navigation, orientation, multi-vehicle motion). The benchmark contains approximately 2000 videos and over 19000 human-annotated question–answer pairs spanning multiple video lengths (short/medium/long) and difficulty levels (easy/medium/hard). Tasks systematically probe core capabilities: temporal, spatial, and intent understanding and reasoning. By unifying accident-centric traffic scenes with broader safety-critical scenarios in air and water, AccidentBench offers a comprehensive, physically grounded testbed for evaluating models under real-world variability. Evaluations of state-of-the-art models (e.g., Gemini-2.5 Pro and GPT-5) show that even the strongest models achieve only about 18% accuracy on the hardest tasks and longest videos, revealing substantial gaps in real-world temporal, spatial, and intent reasoning. AccidentBench is designed to expose these critical gaps and drive the development of multimodal models that are safer, more robust, and better aligned with real-world safety-critical challenges. The code and dataset are available at: this https URL

[LG-2] Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models

链接: https://arxiv.org/abs/2509.26626
作者: Siddarth Venkatraman,Vineet Jain,Sarthak Mittal,Vedant Shah,Johan Obando-Ceron,Yoshua Bengio,Brian R. Bartoldson,Bhavya Kailkhura,Guillaume Lajoie,Glen Berseth,Nikolay Malkin,Moksh Jain
类目: Machine Learning (cs.LG)
*备注: 24 pages, 9 figures

点击查看摘要

Abstract:Test-time scaling methods improve the capabilities of large language models (LLMs) by increasing the amount of compute used during inference to make a prediction. Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement. We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling. Each step of RSA refines a population of candidate reasoning chains through aggregation of subsets to yield a population of improved solutions, which are then used as the candidate pool for the next iteration. RSA exploits the rich information embedded in the reasoning chains – not just the final answers – and enables bootstrapping from partially correct intermediate steps within different chains of thought. Empirically, RSA delivers substantial performance gains with increasing compute budgets across diverse tasks, model families and sizes. Notably, RSA enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high), while outperforming purely parallel and sequential scaling strategies across AIME-25, HMMT-25, Reasoning Gym, LiveCodeBench-v6, and SuperGPQA. We further demonstrate that training the model to combine solutions via a novel aggregation-aware reinforcement learning approach yields significant performance gains. Code available at this https URL.

[LG-3] Uncertainty Quantification for Regression using Proper Scoring Rules

链接: https://arxiv.org/abs/2509.26610
作者: Alexander Fishkov,Kajetan Schweighofer,Mykyta Ielanskyi,Nikita Kotelevskii,Mohsen Guizani,Maxim Panov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantifying uncertainty of machine learning model predictions is essential for reliable decision-making, especially in safety-critical applications. Recently, uncertainty quantification (UQ) theory has advanced significantly, building on a firm basis of learning with proper scoring rules. However, these advances were focused on classification, while extending these ideas to regression remains challenging. In this work, we introduce a unified UQ framework for regression based on proper scoring rules, such as CRPS, logarithmic, squared error, and quadratic scores. We derive closed-form expressions for the resulting uncertainty measures under practical parametric assumptions and show how to estimate them using ensembles of models. In particular, the derived uncertainty measures naturally decompose into aleatoric and epistemic components. The framework recovers popular regression UQ measures based on predictive variance and differential entropy. Our broad evaluation on synthetic and real-world regression datasets provides guidance for selecting reliable UQ measures.

[LG-4] Source Separation for A Cappella Music

链接: https://arxiv.org/abs/2509.26580
作者: Luca A. Lanzendörfer,Constantin Pinkl,Florian Grötschla
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we study the task of multi-singer separation in a cappella music, where the number of active singers varies across mixtures. To address this, we use a power set-based data augmentation strategy that expands limited multi-singer datasets into exponentially more training samples. To separate singers, we introduce SepACap, an adaptation of SepReformer, a state-of-the-art speaker separation model architecture. We adapt the model with periodic activations and a composite loss function that remains effective when stems are silent, enabling robust detection and separation. Experiments on the JaCappella dataset demonstrate that our approach achieves state-of-the-art performance in both full-ensemble and subset singer separation scenarios, outperforming spectrogram-based baselines while generalizing to realistic mixtures with varying numbers of singers.

[LG-5] Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning

链接: https://arxiv.org/abs/2509.26578
作者: Zheng Zhang,Ziwei Shan,Kaitao Song,Yexin Li,Kan Ren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning capabilities of large language models (LLMs) by guiding their step-by-step reasoning toward a final answer. However, existing PRMs either treat each reasoning step in isolation, failing to capture inter-step dependencies, or struggle to align process rewards with the final outcome. Consequently, the reward signal fails to respect temporal causality in sequential reasoning and faces ambiguous credit assignment. These limitations make downstream models vulnerable to reward hacking and lead to suboptimal performance. In this work, we propose Conditional Reward Modeling (CRM) that frames LLM reasoning as a temporal process leading to a correct answer. The reward of each reasoning step is not only conditioned on the preceding steps but also explicitly linked to the final outcome of the reasoning trajectory. By enforcing conditional probability rules, our design captures the causal relationships among reasoning steps, with the link to the outcome allowing precise attribution of each intermediate step, thereby resolving credit assignment ambiguity. Further, through this consistent probabilistic modeling, the rewards produced by CRM enable more reliable cross-sample comparison. Experiments across Best-of-N sampling, beam search and reinforcement learning demonstrate that CRM consistently outperforms existing reward models, offering a principled framework for enhancing LLM reasoning. In particular, CRM is more robust to reward hacking and delivers stable downstream improvements without relying on verifiable rewards derived from ground truth.

[LG-6] Importance of localized dilatation and distensibility in identifying determinants of thoracic aortic aneurysm with neural operators

链接: https://arxiv.org/abs/2509.26576
作者: David S. Li,Somdatta Goswami,Qianying Cao,Vivek Oommen,Roland Assi,Jay D. Humphrey,George E. Karniadakis
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Thoracic aortic aneurysms (TAAs) arise from diverse mechanical and mechanobiological disruptions to the aortic wall that increase the risk of dissection or rupture. Evidence links TAA development to dysfunctions in the aortic mechanotransduction axis, including loss of elastic fiber integrity and cell-matrix connections. Because distinct insults create different mechanical vulnerabilities, there is a critical need to identify interacting factors that drive progression. Here, we use a finite element framework to generate synthetic TAAs from hundreds of heterogeneous insults spanning varying degrees of elastic fiber damage and impaired mechanosensing. From these simulations, we construct spatial maps of localized dilatation and distensibility to train neural networks that predict the initiating combined insult. We compare several architectures (Deep Operator Networks, UNets, and Laplace Neural Operators) and multiple input data formats to define a standard for future subject-specific modeling. We also quantify predictive performance when networks are trained using only geometric data (dilatation) versus both geometric and mechanical data (dilatation plus distensibility). Across all networks, prediction errors are significantly higher when trained on dilatation alone, underscoring the added value of distensibility information. Among the tested models, UNet consistently provides the highest accuracy across all data formats. These findings highlight the importance of acquiring full-field measurements of both dilatation and distensibility in TAA assessment to reveal the mechanobiological drivers of disease and support the development of personalized treatment strategies.

[LG-7] DeepProv: Behavioral Characterization and Repair of Neural Networks via Inference Provenance Graph Analysis ACSA

链接: https://arxiv.org/abs/2509.26562
作者: Firas Ben Hmida,Abderrahmen Amich,Ata Kaboudi,Birhanu Eshete
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 18 pages, 9 figures, 6 tables, To appear in the 41st Annual Computer Security Applications Conference (ACSAC), 2025

点击查看摘要

Abstract:Deep neural networks (DNNs) are increasingly being deployed in high-stakes applications, from self-driving cars to biometric authentication. However, their unpredictable and unreliable behaviors in real-world settings require new approaches to characterize and ensure their reliability. This paper introduces DeepProv, a novel and customizable system designed to capture and characterize the runtime behavior of DNNs during inference by using their underlying graph structure. Inspired by system audit provenance graphs, DeepProv models the computational information flow of a DNN’s inference process through Inference Provenance Graphs (IPGs). These graphs provide a detailed structural representation of the behavior of DNN, allowing both empirical and structural analysis. DeepProv uses these insights to systematically repair DNNs for specific objectives, such as improving robustness, privacy, or fairness. We instantiate DeepProv with adversarial robustness as the goal of model repair and conduct extensive case studies to evaluate its effectiveness. Our results demonstrate its effectiveness and scalability across diverse classification tasks, attack scenarios, and model complexities. DeepProv automatically identifies repair actions at the node and edge-level within IPGs, significantly enhancing the robustness of the model. In particular, applying DeepProv repair strategies to just a single layer of a DNN yields an average 55% improvement in adversarial accuracy. Moreover, DeepProv complements existing defenses, achieving substantial gains in adversarial robustness. Beyond robustness, we demonstrate the broader potential of DeepProv as an adaptable system to characterize DNN behavior in other critical areas, such as privacy auditing and fairness analysis. Comments: 18 pages, 9 figures, 6 tables, To appear in the 41st Annual Computer Security Applications Conference (ACSAC), 2025 Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2509.26562 [cs.CR] (or arXiv:2509.26562v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.26562 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] owards Verified Code Reasoning by LLM s

链接: https://arxiv.org/abs/2509.26546
作者: Meghana Sistla,Gogul Balakrishnan,Pat Rondon,José Cambronero,Michele Tufano,Satish Chandra
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 43 pages

点击查看摘要

Abstract:While LLM-based agents are able to tackle a wide variety of code reasoning questions, the answers are not always correct. This prevents the agent from being useful in situations where high precision is desired: (1) helping a software engineer understand a new code base, (2) helping a software engineer during code review sessions, and (3) ensuring that the code generated by an automated code generation system meets certain requirements (e.g. fixes a bug, improves readability, implements a feature). As a result of this lack of trustworthiness, the agent’s answers need to be manually verified before they can be trusted. Manually confirming responses from a code reasoning agent requires human effort and can result in slower developer productivity, which weakens the assistance benefits of the agent. In this paper, we describe a method to automatically validate the answers provided by a code reasoning agent by verifying its reasoning steps. At a very high level, the method consists of extracting a formal representation of the agent’s response and, subsequently, using formal verification and program analysis tools to verify the agent’s reasoning steps. We applied this approach to a benchmark set of 20 uninitialized variable errors detected by sanitizers and 20 program equivalence queries. For the uninitialized variable errors, the formal verification step was able to validate the agent’s reasoning on 13/20 examples, and for the program equivalence queries, the formal verification step successfully caught 6/8 incorrect judgments made by the agent. Comments: 43 pages Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2509.26546 [cs.SE] (or arXiv:2509.26546v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2509.26546 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Bayesian Influence Functions for Hessian-Free Data Attribution

链接: https://arxiv.org/abs/2509.26544
作者: Philipp Alexander Kreer,Wilson Wu,Maxwell Adam,Zach Furman,Jesse Hoogland
类目: Machine Learning (cs.LG)
*备注: 32 pages, 19 figures

点击查看摘要

Abstract:Classical influence functions face significant challenges when applied to deep neural networks, primarily due to non-invertible Hessians and high-dimensional parameter spaces. We propose the local Bayesian influence function (BIF), an extension of classical influence functions that replaces Hessian inversion with loss landscape statistics that can be estimated via stochastic-gradient MCMC sampling. This Hessian-free approach captures higher-order interactions among parameters and scales efficiently to neural networks with billions of parameters. We demonstrate state-of-the-art results on predicting retraining experiments.

[LG-10] ASP: Topology-aware Sequence Parallelism

链接: https://arxiv.org/abs/2509.26541
作者: Yida Wang(1 and 3),Ke Hong(2 and 3),Xiuhong Li(3),Yuanchao Xu(1),Wenxun Wang(2),Guohao Dai(3 and 4),Yu Wang(2) ((1) Capital Normal University, (2) Tsinghua University, (3) Infinigence-AI, (4) Shanghai Jiao Tong University)
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Long-context large language models (LLMs) face constraints due to the quadratic complexity of the self-attention mechanism. The mainstream sequence parallelism (SP) method, Ring Attention, attempts to solve this by distributing the query into multiple query chunks across accelerators and enable each Q tensor to access all KV tensors from other accelerators via the Ring AllGather communication primitive. However, it exhibits low communication efficiency, restricting its practical applicability. This inefficiency stems from the mismatch between the Ring AllGather communication primitive it adopts and the AlltoAll topology of modern accelerators. A Ring AllGather primitive is composed of iterations of ring-styled data transfer, which can only utilize a very limited fraction of an AlltoAll topology. Inspired by the Hamiltonian decomposition of complete directed graphs, we identify that modern accelerator topology can be decomposed into multiple orthogonal ring datapaths which can concurrently transfer data without interference. Based on this, we further observe that the Ring AllGather primitive can also be decomposed into the same number of concurrent ring-styled data transfer at every iteration. Based on these insights, we propose TASP, a topology-aware SP method for long-context LLMs that fully utilizes the communication capacity of modern accelerators via topology decomposition and primitive decomposition. Experimental results on both single-node and multi-node NVIDIA H100 systems and a single-node AMD MI300X system demonstrate that TASP achieves higher communication efficiency than Ring Attention on these modern accelerator topologies and achieves up to 3.58 speedup than Ring Attention and its variant Zigzag-Ring Attention. The code is available at this https URL. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2509.26541 [cs.LG] (or arXiv:2509.26541v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.26541 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] he Loss Kernel: A Geometric Probe for Deep Learning Interpretability

链接: https://arxiv.org/abs/2509.26537
作者: Maxwell Adam,Zach Furman,Jesse Hoogland
类目: Machine Learning (cs.LG)
*备注: 25 pages, 11 figures

点击查看摘要

Abstract:We introduce the loss kernel, an interpretability method for measuring similarity between data points according to a trained neural network. The kernel is the covariance matrix of per-sample losses computed under a distribution of low-loss-preserving parameter perturbations. We first validate our method on a synthetic multitask problem, showing it separates inputs by task as predicted by theory. We then apply this kernel to Inception-v1 to visualize the structure of ImageNet, and we show that the kernel’s structure aligns with the WordNet semantic hierarchy. This establishes the loss kernel as a practical tool for interpretability and data attribution.

[LG-12] Machine-Learning Driven Load Shedding to Mitigate Instability Attacks in Power Grids

链接: https://arxiv.org/abs/2509.26532
作者: Justin Tackett,Benjamin Francis,Luis Garcia,David Grimsman,Sean Warnick
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Every year critical infrastructure becomes more complex and we grow to rely on it more and more. With this reliance, it becomes an attractive target for cyberattacks from sophisticated actors, with one of the most attractive targets being the power grid. One class of attacks, instability attacks, is a newer type of attack that has relatively few protections developed. We present a cost effective, data-driven approach to training a supervised machine learning model to retrofit load shedding decision systems in power grids with the capacity to defend against instability attacks. We show a proof of concept on the IEEE 14 Bus System using the Achilles Heel Technologies Power Grid Analyzer, and show through an implementation of modified Prony analysis (MPA) that MPA is a viable method for detecting instability attacks and triggering defense mechanisms.

[LG-13] Entropy After langle texttt/Think rangle for reasoning model early exiting

链接: https://arxiv.org/abs/2509.26522
作者: Xi Wang,James McInerney,Lequn Wang,Nathan Kallus
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large reasoning models show improved performance with longer chains of thought. However, recent work has highlighted (qualitatively) their tendency to overthink, continuing to revise answers even after reaching the correct solution. We quantitatively confirm this inefficiency by tracking Pass@1 for answers averaged over a large number of rollouts and find that the model often begins to always produce the correct answer early in the reasoning, making extra reasoning a waste of tokens. To detect and prevent overthinking, we propose a simple and inexpensive novel signal – Entropy After /Think (EAT) – for monitoring and deciding whether to exit reasoning early. By appending a stop thinking token (/think) and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule. Importantly, our approach enables adaptively allocating compute based on the EAT trajectory, allowing us to spend compute in a more efficient way compared with fixing the token budget for all questions. Empirically, on MATH500 and AIME2025, EAT reduces token usage by 13 - 21% without harming accuracy, and it remains effective in black box settings where logits from the reasoning model are not accessible, and EAT is computed with proxy models.

[LG-14] Signal-Aware Workload Shifting Algorithms with Uncertainty-Quantified Predictors

链接: https://arxiv.org/abs/2509.26511
作者: Ezra Johnson,Adam Lechowicz,Mohammad Hajiesmaili
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 19 pages, 3 figures

点击查看摘要

Abstract:A wide range of sustainability and grid-integration strategies depend on workload shifting, which aligns the timing of energy consumption with external signals such as grid curtailment events, carbon intensity, or time-of-use electricity prices. The main challenge lies in the online nature of the problem: operators must make real-time decisions (e.g., whether to consume energy now) without knowledge of the future. While forecasts of signal values are typically available, prior work on learning-augmented online algorithms has relied almost exclusively on simple point forecasts. In parallel, the forecasting research has made significant progress in uncertainty quantification (UQ), which provides richer and more fine-grained predictive information. In this paper, we study how online workload shifting can leverage UQ predictors to improve decision-making. We introduce \textttUQ-Advice , a learning-augmented algorithm that systematically integrates UQ forecasts through a \textitdecision uncertainty score that measures how forecast uncertainty affects optimal future decisions. By introducing \textitUQ-robustness , a new metric that characterizes how performance degrades with forecast uncertainty, we establish theoretical performance guarantees for \textttUQ-Advice . Finally, using trace-driven experiments on carbon intensity and electricity price data, we demonstrate that \textttUQ-Advice consistently outperforms robust baselines and existing learning-augmented methods that ignore uncertainty.

[LG-15] Equivariance by Local Canonicalization: A Matter of Representation

链接: https://arxiv.org/abs/2509.26499
作者: Gerrit Gerhartz,Peter Lippmann,Fred A. Hamprecht
类目: Machine Learning (cs.LG)
*备注: To be presented at NeurReps Workshop 2025

点击查看摘要

Abstract:Equivariant neural networks offer strong inductive biases for learning from molecular and geometric data but often rely on specialized, computationally expensive tensor operations. We present a framework to transfers existing tensor field networks into the more efficient local canonicalization paradigm, preserving equivariance while significantly improving the runtime. Within this framework, we systematically compare different equivariant representations in terms of theoretical complexity, empirical runtime, and predictive accuracy. We publish the tensor_frames package, a PyTorchGeometric based implementation for local canonicalization, that enables straightforward integration of equivariance into any standard message passing neural network.

[LG-16] DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick

链接: https://arxiv.org/abs/2509.26469
作者: Mohammad Hassan Vali,Tom Bäckström,Arno Solin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vector quantization is common in deep models, yet its hard assignments block gradients and hinder end-to-end training. We propose DiVeQ, which treats quantization as adding an error vector that mimics the quantization distortion, keeping the forward pass hard while letting gradients flow. We also present a space-filling variant (SF-DiVeQ) that assigns to a curve constructed by the lines connecting codewords, resulting in less quantization error and full codebook usage. Both methods train end-to-end without requiring auxiliary losses or temperature schedules. On VQ-VAE compression and VQGAN generation across various data sets, they improve reconstruction and sample quality over alternative quantization approaches.

[LG-17] fev-bench: A Realistic Benchmark for Time Series Forecasting

链接: https://arxiv.org/abs/2509.26468
作者: Oleksandr Shchur,Abdul Fatir Ansari,Caner Turkmen,Lorenzo Stella,Nick Erickson,Pablo Guerron,Michael Bohlke-Schneider,Yuyang Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Benchmark quality is critical for meaningful evaluation and sustained progress in time series forecasting, particularly given the recent rise of pretrained models. Existing benchmarks often have narrow domain coverage or overlook important real-world settings, such as tasks with covariates. Additionally, their aggregation procedures often lack statistical rigor, making it unclear whether observed performance differences reflect true improvements or random variation. Many benchmarks also fail to provide infrastructure for consistent evaluation or are too rigid to integrate into existing pipelines. To address these gaps, we propose fev-bench, a benchmark comprising 100 forecasting tasks across seven domains, including 46 tasks with covariates. Supporting the benchmark, we introduce fev, a lightweight Python library for benchmarking forecasting models that emphasizes reproducibility and seamless integration with existing workflows. Usingfev, fev-bench employs principled aggregation methods with bootstrapped confidence intervals to report model performance along two complementary dimensions: win rates and skill scores. We report results on fev-bench for various pretrained, statistical and baseline models, and identify promising directions for future research.

[LG-18] Stabilization of nonlinear systems with unknown delays via delay-adaptive neural operator approximate predictors

链接: https://arxiv.org/abs/2509.26443
作者: Luke Bhan,Miroslav Krstic,Yuanyuan Shi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 20 pages, 2 figures

点击查看摘要

Abstract:This work establishes the first rigorous stability guarantees for approximate predictors in delay-adaptive control of nonlinear systems, addressing a key challenge in practical implementations where exact predictors are unavailable. We analyze two scenarios: (i) when the actuated input is directly measurable, and (ii) when it is estimated online. For the measurable input case, we prove semi-global practical asymptotic stability with an explicit bound proportional to the approximation error \epsilon . For the unmeasured input case, we demonstrate local practical asymptotic stability, with the region of attraction explicitly dependent on both the initial delay estimate and the predictor approximation error. To bridge theory and practice, we show that neural operators-a flexible class of neural network-based approximators-can achieve arbitrarily small approximation errors, thus satisfying the conditions of our stability theorems. Numerical experiments on two nonlinear benchmark systems-a biological protein activator/repressor model and a micro-organism growth Chemostat model-validate our theoretical results. In particular, our numerical simulations confirm stability under approximate predictors, highlight the strong generalization capabilities of neural operators, and demonstrate a substantial computational speedup of up to 15x compared to a baseline fixed-point method.

[LG-19] Extensions of Robbins-Siegmund Theorem with Applications in Reinforcement Learning

链接: https://arxiv.org/abs/2509.26442
作者: Xinyu Liu,Zixuan Xie,Shangtong Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The Robbins-Siegmund theorem establishes the convergence of stochastic processes that are almost supermartingales and is foundational for analyzing a wide range of stochastic iterative algorithms in stochastic approximation and reinforcement learning (RL). However, its original form has a significant limitation as it requires the zero-order term to be summable. In many important RL applications, this summable condition, however, cannot be met. This limitation motivates us to extend the Robbins-Siegmund theorem for almost supermartingales where the zero-order term is not summable but only square summable. Particularly, we introduce a novel and mild assumption on the increments of the stochastic processes. This together with the square summable condition enables an almost sure convergence to a bounded set. Additionally, we further provide almost sure convergence rates, high probability concentration bounds, and L^p convergence rates. We then apply the new results in stochastic approximation and RL. Notably, we obtain the first almost sure convergence rate, the first high probability concentration bound, and the first L^p convergence rate for Q -learning with linear function approximation.

[LG-20] Refine Drugs Dont Complete Them: Uniform-Source Discrete Flows for Frag ment-Based Drug Discovery

链接: https://arxiv.org/abs/2509.26405
作者: Benno Kaech,Luis Wyss,Karsten Borgwardt,Gianvito Grasso
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce InVirtuoGen, a discrete flow generative model for fragmented SMILES for de novo and fragment-constrained generation, and target-property/lead optimization of small molecules. The model learns to transform a uniform source over all possible tokens into the data distribution. Unlike masked models, its training loss accounts for predictions on all sequence positions at every denoising step, shifting the generation paradigm from completion to refinement, and decoupling the number of sampling steps from the sequence length. For \textitde novo generation, InVirtuoGen achieves a stronger quality-diversity pareto frontier than prior fragment-based models and competitive performance on fragment-constrained tasks. For property and lead optimization, we propose a hybrid scheme that combines a genetic algorithm with a Proximal Property Optimization fine-tuning strategy adapted to discrete flows. Our approach sets a new state-of-the-art on the Practical Molecular Optimization benchmark, measured by top-10 AUC across tasks, and yields higher docking scores in lead optimization than previous baselines. InVirtuoGen thus establishes a versatile generative foundation for drug discovery, from early hit finding to multi-objective lead optimization. We further contribute to open science by releasing pretrained checkpoints and code, making our results fully reproducible\footnotethis https URL.

[LG-21] Data-to-Energy Stochastic Dynamics

链接: https://arxiv.org/abs/2509.26364
作者: Kirill Tamogashev,Nikolay Malkin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Schrödinger bridge problem is concerned with finding a stochastic dynamical system bridging two marginal distributions that minimises a certain transportation cost. This problem, which represents a generalisation of optimal transport to the stochastic case, has received attention due to its connections to diffusion models and flow matching, as well as its applications in the natural sciences. However, all existing algorithms allow to infer such dynamics only for cases where samples from both distributions are available. In this paper, we propose the first general method for modelling Schrödinger bridges when one (or both) distributions are given by their unnormalised densities, with no access to data samples. Our algorithm relies on a generalisation of the iterative proportional fitting (IPF) procedure to the data-free case, inspired by recent developments in off-policy reinforcement learning for training of diffusion samplers. We demonstrate the efficacy of the proposed data-to-energy IPF on synthetic problems, finding that it can successfully learn transports between multimodal distributions. As a secondary consequence of our reinforcement learning formulation, which assumes a fixed time discretisation scheme for the dynamics, we find that existing data-to-data Schrödinger bridge algorithms can be substantially improved by learning the diffusion coefficient of the dynamics. Finally, we apply the newly developed algorithm to the problem of sampling posterior distributions in latent spaces of generative models, thus creating a data-free image-to-image translation method. Code: this https URL

[LG-22] LLM -Assisted Emergency Triage Benchmark: Bridging Hospital-Rich and MCI-Like Field Simulation NEURIPS2025 ALT

链接: https://arxiv.org/abs/2509.26351
作者: Joshua Sebastian,Karma Tobden,KMA Solaiman
类目: Machine Learning (cs.LG)
*备注: Submitted to GenAI4Health@NeurIPS 2025. This is the first version of the LLM-assisted emergency triage benchmark dataset and baseline models. Authors: Joshua Sebastian, Karma Tobden, KMA Solaiman

点击查看摘要

Abstract:Research on emergency and mass casualty incident (MCI) triage has been limited by the absence of openly usable, reproducible benchmarks. Yet these scenarios demand rapid identification of the patients most in need, where accurate deterioration prediction can guide timely interventions. While the MIMIC-IV-ED database is openly available to credentialed researchers, transforming it into a triage-focused benchmark requires extensive preprocessing, feature harmonization, and schema alignment – barriers that restrict accessibility to only highly technical users. We address these gaps by first introducing an open, LLM-assisted emergency triage benchmark for deterioration prediction (ICU transfer, in-hospital mortality). The benchmark then defines two regimes: (i) a hospital-rich setting with vitals, labs, notes, chief complaints, and structured observations, and (ii) an MCI-like field simulation limited to vitals, observations, and notes. Large language models (LLMs) contributed directly to dataset construction by (i) harmonizing noisy fields such as AVPU and breathing devices, (ii) prioritizing clinically relevant vitals and labs, and (iii) guiding schema alignment and efficient merging of disparate tables. We further provide baseline models and SHAP-based interpretability analyses, illustrating predictive gaps between regimes and the features most critical for triage. Together, these contributions make triage prediction research more reproducible and accessible – a step toward dataset democratization in clinical AI. Comments: Submitted to GenAI4Health@NeurIPS 2025. This is the first version of the LLM-assisted emergency triage benchmark dataset and baseline models. Authors: Joshua Sebastian, Karma Tobden, KMA Solaiman Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.26351 [cs.LG] (or arXiv:2509.26351v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.26351 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Memory-Driven Self-Improvement for Decision Making with Large Language Models

链接: https://arxiv.org/abs/2509.26340
作者: Xue Yan,Zijing Ou,Mengyue Yang,Yan Song,Haifeng Zhang,Yingzhen Li,Jun Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have emerged as effective action policies for sequential decision-making (SDM) tasks due to their extensive prior knowledge. However, this broad yet general knowledge is often insufficient for specific decision-making tasks with limited task-related data, making it challenging to efficiently adapt LLMs to specific SDM tasks. To address this challenge, we propose a memory-driven self-improvement framework that combines LLM general prior knowledge with a compact memory of domain-specific experiences. Memory retains past interactions and associated Q-values, thereby capturing decision-relevant knowledge that facilitates accurate value estimation and informs the LLM prior refinement. The refined LLM prior, in turn, generates higher-reward trajectories that further enrich memory, forming a natural self-improvement framework where memory and LLM prior mutually reinforce each other. Experiments show that our memory-driven approach significantly outperforms both traditional RL and LLM-based baselines, e.g., improving performance by over 40% on in-distribution tasks and over 75% when generalized to unseen tasks in ALFWorld.

[LG-24] FedMuon: Federated Learning with Bias-corrected LMO-based Optimization

链接: https://arxiv.org/abs/2509.26337
作者: Yuki Takezawa,Anastasia Koloskova,Xiaowen Jiang,Sebastian U. Stich
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Recently, a new optimization method based on the linear minimization oracle (LMO), called Muon, has been attracting increasing attention since it can train neural networks faster than existing adaptive optimization methods, such as Adam. In this paper, we study how Muon can be utilized in federated learning. We first show that straightforwardly using Muon as the local optimizer of FedAvg does not converge to the stationary point since the LMO is a biased operator. We then propose FedMuon which can mitigate this issue. We also analyze how solving the LMO approximately affects the convergence rate and find that, surprisingly, FedMuon can converge for any number of Newton-Schulz iterations, while it can converge faster as we solve the LMO more accurately. Through experiments, we demonstrated that FedMuon can outperform the state-of-the-art federated learning methods.

[LG-25] A Generalized Information Bottleneck Theory of Deep Learning

链接: https://arxiv.org/abs/2509.26327
作者: Charles Westphal,Stephen Hailes,Mirco Musolesi
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:The Information Bottleneck (IB) principle offers a compelling theoretical framework to understand how neural networks (NNs) learn. However, its practical utility has been constrained by unresolved theoretical ambiguities and significant challenges in accurate estimation. In this paper, we present a \textitGeneralized Information Bottleneck (GIB) framework that reformulates the original IB principle through the lens of synergy, i.e., the information obtainable only through joint processing of features. We provide theoretical and empirical evidence demonstrating that synergistic functions achieve superior generalization compared to their non-synergistic counterparts. Building on these foundations we re-formulate the IB using a computable definition of synergy based on the average interaction information (II) of each feature with those remaining. We demonstrate that the original IB objective is upper bounded by our GIB in the case of perfect estimation, ensuring compatibility with existing IB theory while addressing its limitations. Our experimental results demonstrate that GIB consistently exhibits compression phases across a wide range of architectures (including those with \textitReLU activations where the standard IB fails), while yielding interpretable dynamics in both CNNs and Transformers and aligning more closely with our understanding of adversarial robustness.

[LG-26] ACE: Adapting sampling for Counterfactual Explanations

链接: https://arxiv.org/abs/2509.26322
作者: Margarita A. Guerrero,Cristian R. Rojas
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages

点击查看摘要

Abstract:Counterfactual Explanations (CFEs) interpret machine learning models by identifying the smallest change to input features needed to change the model’s prediction to a desired output. For classification tasks, CFEs determine how close a given sample is to the decision boundary of a trained classifier. Existing methods are often sample-inefficient, requiring numerous evaluations of a black-box model – an approach that is both costly and impractical when access to the model is limited. We propose Adaptive sampling for Counterfactual Explanations (ACE), a sample-efficient algorithm combining Bayesian estimation and stochastic optimization to approximate the decision boundary with fewer queries. By prioritizing informative points, ACE minimizes evaluations while generating accurate and feasible CFEs. Extensive empirical results show that ACE achieves superior evaluation efficiency compared to state-of-the-art methods, while maintaining effectiveness in identifying minimal and actionable changes.

[LG-27] A Review on Single-Problem Multi-Attempt Heuristic Optimization

链接: https://arxiv.org/abs/2509.26321
作者: Judith Echevarrieta,Etor Arza,Aritz Pérez,Josu Ceberio
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In certain real-world optimization scenarios, practitioners are not interested in solving multiple problems but rather in finding the best solution to a single, specific problem. When the computational budget is large relative to the cost of evaluating a candidate solution, multiple heuristic alternatives can be tried to solve the same given problem, each possibly with a different algorithm, parameter configuration, initialization, or stopping criterion. The sequential selection of which alternative to try next is crucial for efficiently identifying the one that provides the best possible solution across multiple attempts. Despite the relevance of this problem in practice, it has not yet been the exclusive focus of any existing review. Several sequential alternative selection strategies have been proposed in different research topics, but they have not been comprehensively and systematically unified under a common perspective. This work presents a focused review of single-problem multi-attempt heuristic optimization. It brings together suitable strategies to this problem that have been studied separately through algorithm selection, parameter tuning, multi-start and resource allocation. These strategies are explained using a unified terminology within a common framework, which supports the development of a taxonomy for systematically organizing and classifying them. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.26321 [cs.LG] (or arXiv:2509.26321v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.26321 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] Attribution-Guided Decoding

链接: https://arxiv.org/abs/2509.26307
作者: Piotr Komorowski,Elena Golimblevskaia,Reduan Achtibat,Thomas Wiegand,Sebastian Lapuschkin,Wojciech Samek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The capacity of Large Language Models (LLMs) to follow complex instructions and generate factually accurate text is critical for their real-world application. However, standard decoding methods often fail to robustly satisfy these requirements, while existing control techniques frequently degrade general output quality. In this work, we introduce Attribution-Guided Decoding (AGD), an interpretability-based decoding strategy. Instead of directly manipulating model activations, AGD considers a set of high-probability output token candidates and selects the one that exhibits the highest attribution to a user-defined Region of Interest (ROI). This ROI can be flexibly defined over different parts of the model’s input or internal components, allowing AGD to steer generation towards various desirable behaviors. We demonstrate AGD’s efficacy across three challenging domains. For instruction following, we show that AGD significantly boosts adherence (e.g., improving the overall success rate on Llama 3.1 from 66.0% to 79.1%). For knowledge-intensive tasks, we show that guiding generation towards usage of internal knowledge components or contextual sources can reduce hallucinations and improve factual accuracy in both closed-book and open-book settings. Furthermore, we propose an adaptive, entropy-based variant of AGD that mitigates quality degradation and reduces computational overhead by applying guidance only when the model is uncertain. Our work presents a versatile, more interpretable, and effective method for enhancing the reliability of modern LLMs.

[LG-29] NeuroTTT: Bridging Pretraining-Downstream Task Misalignment in EEG Foundation Models via Test-Time Training

链接: https://arxiv.org/abs/2509.26301
作者: Suli Wang,Yangshen Deng,Zhenghua Bao,Xinyu Zhan,Yiqun Duan
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Large-scale foundation models for EEG signals offer a promising path to generalizable brain-computer interface (BCI) applications, but they often suffer from misalignment between pretraining objectives and downstream tasks, as well as significant cross-subject distribution shifts. This paper addresses these challenges by introducing a two-stage alignment strategy that bridges the gap between generic pretraining and specific EEG decoding tasks. First, we propose NeuroTTT: a domain-specific self-supervised fine-tuning paradigm that augments the foundation model with task-relevant self-supervised objectives, aligning latent representations to important spectral, spatial, and temporal EEG features without requiring additional labeled data. Second, we incorporate test-time training (TTT) at inference, we perform (i) self-supervised test-time training on individual unlabeled test samples and (ii) prediction entropy minimization (Tent), which updates only normalization statistics to continually calibrate the model to each new input on the fly. Our approach, which, to our knowledge, is the first to unify domain-tuned self-supervision with test-time training in large-scale EEG foundation models, yields substantially improved robustness and accuracy across diverse BCI tasks (imagined speech, stress detection, motor imagery). Using CBraMod and LaBraM as backbones, our method pushes their performance to a markedly higher level. Results on three diverse tasks demonstrate that the proposed alignment strategy achieves state-of-the-art performance, outperforming conventional fine-tuning and adaptation methods. Our code is available at this https URL.

[LG-30] uning the Tuner: Introducing Hyperparameter Optimization for Auto-Tuning

链接: https://arxiv.org/abs/2509.26300
作者: Floris-Jan Willemsen,Rob V. van Nieuwpoort,Ben van Werkhoven
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Automatic performance tuning (auto-tuning) is widely used to optimize performance-critical applications across many scientific domains by finding the best program variant among many choices. Efficient optimization algorithms are crucial for navigating the vast and complex search spaces in auto-tuning. As is well known in the context of machine learning and similar fields, hyperparameters critically shape optimization algorithm efficiency. Yet for auto-tuning frameworks, these hyperparameters are almost never tuned, and their potential performance impact has not been studied. We present a novel method for general hyperparameter tuning of optimization algorithms for auto-tuning, thus “tuning the tuner”. In particular, we propose a robust statistical method for evaluating hyperparameter performance across search spaces, publish a FAIR data set and software for reproducibility, and present a simulation mode that replays previously recorded tuning data, lowering the costs of hyperparameter tuning by two orders of magnitude. We show that even limited hyperparameter tuning can improve auto-tuner performance by 94.8% on average, and establish that the hyperparameters themselves can be optimized efficiently with meta-strategies (with an average improvement of 204.7%), demonstrating the often overlooked hyperparameter tuning as a powerful technique for advancing auto-tuning research and practice. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF) Cite as: arXiv:2509.26300 [cs.LG] (or arXiv:2509.26300v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.26300 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Reframing Generative Models for Physical Systems using Stochastic Interpolants

链接: https://arxiv.org/abs/2509.26282
作者: Anthony Zhou,Alexander Wikner,Amaury Lancelin,Pedram Hassanzadeh,Amir Barati Farimani
类目: Machine Learning (cs.LG)
*备注: Code and data is available at this http URL

点击查看摘要

Abstract:Generative models have recently emerged as powerful surrogates for physical systems, demonstrating increased accuracy, stability, and/or statistical fidelity. Most approaches rely on iteratively denoising a Gaussian, a choice that may not be the most effective for autoregressive prediction tasks in PDEs and dynamical systems such as climate. In this work, we benchmark generative models across diverse physical domains and tasks, and highlight the role of stochastic interpolants. By directly learning a stochastic process between current and future states, stochastic interpolants can leverage the proximity of successive physical distributions. This allows for generative models that can use fewer sampling steps and produce more accurate predictions than models relying on transporting Gaussian noise. Our experiments suggest that generative models need to balance deterministic accuracy, spectral consistency, and probabilistic calibration, and that stochastic interpolants can potentially fulfill these requirements by adjusting their sampling. This study establishes stochastic interpolants as a competitive baseline for physical emulation and gives insight into the abilities of different generative modeling frameworks.

[LG-32] Wasserstein Distributionally Robust Optimization Through the Lens of Structural Causal Models and Individual Fairness

链接: https://arxiv.org/abs/2509.26275
作者: Ahmad-Reza Ehyaei,Golnoosh Farnadi,Samira Samadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Wasserstein Distributionally Robust Optimization (DRO) has garnered substantial interest for its efficacy in data-driven decision-making under distributional uncertainty. However, limited research has explored the application of DRO to address individual fairness concerns, particularly when considering causal structures and sensitive attributes in learning problems. To address this gap, we first formulate the DRO problem from causality and individual fairness perspectives. We then present the DRO dual formulation as an efficient tool to convert the DRO problem into a more tractable and computationally efficient form. Next, we characterize the closed form of the approximate worst-case loss quantity as a regularizer, eliminating the max-step in the min-max DRO problem. We further estimate the regularizer in more general cases and explore the relationship between DRO and classical robust optimization. Finally, by removing the assumption of a known structural causal model, we provide finite sample error bounds when designing DRO with empirical distributions and estimated causal structures to ensure efficiency and robust learning.

[LG-33] From Frag ile to Certified: Wasserstein Audits of Group Fairness Under Distribution Shift

链接: https://arxiv.org/abs/2509.26241
作者: Ahmad-Reza Ehyaei,Golnoosh Farnadi,Samira Samadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Group-fairness metrics (e.g., equalized odds) can vary sharply across resamples and are especially brittle under distribution shift, undermining reliable audits. We propose a Wasserstein distributionally robust framework that certifies worst-case group fairness over a ball of plausible test distributions centered at the empirical law. Our formulation unifies common group fairness notions via a generic conditional-probability functional and defines \varepsilon -Wasserstein Distributional Fairness ( \varepsilon -WDF) as the audit target. Leveraging strong duality, we derive tractable reformulations and an efficient estimator (DRUNE) for \varepsilon -WDF. We prove feasibility and consistency and establish finite-sample certification guarantees for auditing fairness, along with quantitative bounds under smoothness and margin conditions. Across standard benchmarks and classifiers, \varepsilon -WDF delivers stable fairness assessments under distribution shift, providing a principled basis for auditing and certifying group fairness beyond observational data.

[LG-34] Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

链接: https://arxiv.org/abs/2509.26238
作者: James Oldfield,Philip Torr,Ioannis Patras,Adel Bibi,Fazl Barez
类目: Machine Learning (cs.LG)
*备注: Project page: this http URL

点击查看摘要

[LG-35] Machine Learning Detection of Lithium Plating in Lithium-ion Cells: A Gaussian Process Approach

链接: https://arxiv.org/abs/2509.26234
作者: Ayush Patnaik,Adam B Zufall,Stephen K Robinson,Xinfan Lin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to American Control Conference - ACC 2026

点击查看摘要

Abstract:Lithium plating during fast charging is a critical degradation mechanism that accelerates capacity fade and can trigger catastrophic safety failures. Recent work has identified a distinctive dQ/dV peak above 4.0 V as a reliable signature of plating onset; however, conventional methods for computing dQ/dV rely on finite differencing with filtering, which amplifies sensor noise and introduces bias in peak location. In this paper, we propose a Gaussian Process (GP) framework for lithium plating detection by directly modeling the charge-voltage relationship Q(V) as a stochastic process with calibrated uncertainty. Leveraging the property that derivatives of GPs remain GPs, we infer dQ/dV analytically and probabilistically from the posterior, enabling robust detection without ad hoc smoothing. The framework provides three key benefits: (i) noise-aware inference with hyperparameters learned from data, (ii) closed-form derivatives with credible intervals for uncertainty quantification, and (iii) scalability to online variants suitable for embedded BMS. Experimental validation on Li-ion coin cells across a range of C-rates (0.2C-1C) and temperatures (0-40°C) demonstrates that the GP-based method reliably detects plating peaks under low-temperature, high-rate charging, while correctly reporting no peaks in baseline cases. The concurrence of GP-identified differential peaks, reduced charge throughput, and capacity fade measured via reference performance tests confirms the method’s accuracy and robustness, establishing a practical pathway for real-time lithium plating detection.

[LG-36] Marginal Flow: a flexible and efficient framework for density estimation

链接: https://arxiv.org/abs/2509.26221
作者: Marcello Massimo Negri,Jonathan Aellen,Manuel Jahn,AmirEhsan Khorashadizadeh,Volker Roth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current density modeling approaches suffer from at least one of the following shortcomings: expensive training, slow inference, approximate likelihood, mode collapse or architectural constraints like bijective mappings. We propose a simple yet powerful framework that overcomes these limitations altogether. We define our model q_\theta(x) through a parametric distribution q(x|w) with latent parameters w . Instead of directly optimizing the latent variables w , our idea is to marginalize them out by sampling w from a learnable distribution q_\theta(w) , hence the name Marginal Flow. In order to evaluate the learned density q_\theta(x) or to sample from it, we only need to draw samples from q_\theta(w) , which makes both operations efficient. The proposed model allows for exact density evaluation and is orders of magnitude faster than competing models both at training and inference. Furthermore, Marginal Flow is a flexible framework: it does not impose any restrictions on the neural network architecture, it enables learning distributions on lower-dimensional manifolds (either known or to be learned), it can be trained efficiently with any objective (e.g. forward and reverse KL divergence), and it easily handles multi-modal targets. We evaluate Marginal Flow extensively on various tasks including synthetic datasets, simulation-based inference, distributions on positive definite matrices and manifold learning in latent spaces of images.

[LG-37] he silence of the weights: an investigation of structural pruning strategies for attention-based audio signal architectures

链接: https://arxiv.org/abs/2509.26207
作者: Andrea Diecidue,Carlo Alberto Barbano,Piero Fraternali,Mathieu Fontaine,Enzo Tartaglione
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based models have become the state of the art across multiple domains, from natural language processing to machine listening, thanks to attention mechanisms. However, the attention layers require a large number of parameters and high-end hardware for both training and inference. We propose a novel pruning technique targeted explicitly at the attention mechanism, where we decouple the pruning of the four layers in the attention block, namely: query, keys, values and outputs’ projection matrices. We also investigate pruning strategies to prune along the head and channel dimensions, and compare the performance of the Audio Spectrogram Transformer (AST) model under different pruning scenarios. Our results show that even by pruning 50% of the attention parameters we incur in performance degradation of less than 1%

[LG-38] Self-supervised learning for phase retrieval

链接: https://arxiv.org/abs/2509.26203
作者: Victor Sechaud(Phys-ENS),Patrice Abry(Phys-ENS),Laurent Jacques(ICTEAM),Julián Tachella(Phys-ENS, CNRS)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: in French language. GRETSI, Aug 2025, Strasboug, France

点击查看摘要

Abstract:In recent years, deep neural networks have emerged as a solution for inverse imaging problems. These networks are generally trained using pairs of images: one degraded and the other of high quality, the latter being called ‘ground truth’. However, in medical and scientific imaging, the lack of fully sampled data limits supervised learning. Recent advances have made it possible to reconstruct images from measurement data alone, eliminating the need for references. However, these methods remain limited to linear problems, excluding non-linear problems such as phase retrieval. We propose a self-supervised method that overcomes this limitation in the case of phase retrieval by using the natural invariance of images to translations.

[LG-39] PDE Solvers Should Be Local: Fast Stable Rollouts with Learned Local Stencils

链接: https://arxiv.org/abs/2509.26186
作者: Chun-Wun Cheng,Bin Dong,Carola-Bibiane Schönlieb,Angelica I Aviles-Rivero
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Neural operator models for solving partial differential equations (PDEs) often rely on global mixing mechanisms-such as spectral convolutions or attention-which tend to oversmooth sharp local dynamics and introduce high computational cost. We present FINO, a finite-difference-inspired neural architecture that enforces strict locality while retaining multiscale representational power. FINO replaces fixed finite-difference stencil coefficients with learnable convolutional kernels and evolves states via an explicit, learnable time-stepping scheme. A central Local Operator Block leverage a differential stencil layer, a gating mask, and a linear fuse step to construct adaptive derivative-like local features that propagate forward in time. Embedded in an encoder-decoder with a bottleneck, FINO captures fine-grained local structures while preserving interpretability. We establish (i) a composition error bound linking one-step approximation error to stable long-horizon rollouts under a Lipschitz condition, and (ii) a universal approximation theorem for discrete time-stepped PDE dynamics. (iii) Across six benchmarks and a climate modelling task, FINO achieves up to 44% lower error and up to around 2\times speedups over state-of-the-art operator-learning baselines, demonstrating that strict locality with learnable time-stepping yields an accurate and scalable foundation for neural PDE solvers.

[LG-40] Benchmarking Diarization Models

链接: https://arxiv.org/abs/2509.26177
作者: Luca A. Lanzendörfer,Florian Grötschla,Cesare Blaser,Roger Wattenhofer
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speaker diarization is the task of partitioning audio into segments according to speaker identity, answering the question of “who spoke when” in multi-speaker conversation recordings. While diarization is an essential task for many downstream applications, it remains an unsolved problem. Errors in diarization propagate to downstream systems and cause wide-ranging failures. To this end, we examine exact failure modes by evaluating five state-of-the-art diarization models, across four diarization datasets spanning multiple languages and acoustic conditions. The evaluation datasets consist of 196.6 hours of multilingual audio, including English, Mandarin, German, Japanese, and Spanish. Overall, we find that PyannoteAI achieves the best performance at 11.2% DER, while DiariZen provides a competitive open-source alternative at 13.3% DER. When analyzing failure cases, we find that the primary cause of diarization errors stem from missed speech segments followed by speaker confusion, especially in high-speaker count settings.

[LG-41] Alignment-Aware Decoding

链接: https://arxiv.org/abs/2509.26169
作者: Frédéric Berdoz,Luca A. Lanzendörfer,René Caky,Roger Wattenhofer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alignment of large language models remains a central challenge in natural language processing. Preference optimization has emerged as a popular and effective method for improving alignment, typically through training-time or prompt-based interventions. In this paper, we introduce alignment-aware decoding (AAD), a method to enhance model alignment directly at inference. Theoretically, AAD can be interpreted as implicit reward optimization, yet it requires no specialized training beyond the standard DPO setup. Empirically, AAD consistently outperforms strong baselines across diverse alignment benchmarks and model scales. Moreover, in data-constrained settings, AAD can produce high-quality synthetic data to improve alignment under standard decoding, providing a practical solution when labeled data is limited.

[LG-42] Accelerating Transformers in Online RL

链接: https://arxiv.org/abs/2509.26137
作者: Daniil Zelezetsky,Alexey K. Kovalev,Aleksandr I. Panov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-43] Domain-Aware Hyperdimensional Computing for Edge Smart Manufacturing

链接: https://arxiv.org/abs/2509.26131
作者: Fardin Jalil Piran,Anandkumar Patel,Rajiv Malhotra,Farhad Imani
类目: Machine Learning (cs.LG)
*备注: 23 pages, 14 figures

点击查看摘要

Abstract:Smart manufacturing requires on-device intelligence that meets strict latency and energy budgets. HyperDimensional Computing (HDC) offers a lightweight alternative by encoding data as high-dimensional hypervectors and computing with simple operations. Prior studies often assume that the qualitative relation between HDC hyperparameters and performance is stable across applications. Our analysis of two representative tasks, signal-based quality monitoring in Computer Numerical Control (CNC) machining and image-based defect detection in Laser Powder Bed Fusion (LPBF), shows that this assumption does not hold. We map how encoder type, projection variance, hypervector dimensionality, and data regime shape accuracy, inference latency, training time, and training energy. A formal complexity model explains predictable trends in encoding and similarity computation and reveals nonmonotonic interactions with retraining that preclude a closed-form optimum. Empirically, signals favor nonlinear Random Fourier Features with more exclusive encodings and saturate in accuracy beyond moderate dimensionality. Images favor linear Random Projection, achieve high accuracy with small dimensionality, and depend more on sample count than on dimensionality. Guided by these insights, we tune HDC under multiobjective constraints that reflect edge deployment and obtain models that match or exceed the accuracy of state-of-the-art deep learning and Transformer models while delivering at least 6x faster inference and more than 40x lower training energy. These results demonstrate that domain-aware HDC encoding is necessary and that tuned HDC offers a practical, scalable path to real-time industrial AI on constrained hardware. Future work will enable adaptive encoder and hyperparameter selection, expand evaluation to additional manufacturing modalities, and validate on low-power accelerators.

[LG-44] UncertainGen: Uncertainty-Aware Representations of DNA Sequences for Metagenomic Binning

链接: https://arxiv.org/abs/2509.26116
作者: Abdulkadir Celikkanat,Andres R. Masegosa,Mads Albertsen,Thomas D. Nielsen
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Metagenomic binning aims to cluster DNA fragments from mixed microbial samples into their respective genomes, a critical step for downstream analyses of microbial communities. Existing methods rely on deterministic representations, such as k-mer profiles or embeddings from large language models, which fail to capture the uncertainty inherent in DNA sequences arising from inter-species DNA sharing and from fragments with highly similar representations. We present the first probabilistic embedding approach, UncertainGen, for metagenomic binning, representing each DNA fragment as a probability distribution in latent space. Our approach naturally models sequence-level uncertainty, and we provide theoretical guarantees on embedding distinguishability. This probabilistic embedding framework expands the feasible latent space by introducing a data-adaptive metric, which in turn enables more flexible separation of bins/clusters. Experiments on real metagenomic datasets demonstrate the improvements over deterministic k-mer and LLM-based embeddings for the binning task by offering a scalable and lightweight solution for large-scale metagenomic analysis.

[LG-45] Clip-Low Increases Entropy and Clip-High Decreases Entropy in Reinforcement Learning of Large Language Models

链接: https://arxiv.org/abs/2509.26114
作者: Jaesung R. Park,Junsu Kim,Gyeongman Kim,Jinyoung Jo,Sean Choi,Jaewoong Cho,Ernest K. Ryu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has recently emerged as the leading approach for enhancing the reasoning capabilities of large language models (LLMs). However, RLVR is prone to entropy collapse, where the LLM quickly converges to a near-deterministic form, hindering exploration and progress during prolonged RL training. In this work, we reveal that the clipping mechanism in PPO and GRPO induces biases on entropy. Through theoretical and empirical analyses, we show that clip-low increases entropy, while clip-high decreases it. Further, under standard clipping parameters, the effect of clip-high dominates, resulting in an overall entropy reduction even when purely random rewards are provided to the RL algorithm. Our findings highlight an overlooked confounding factor in RLVR: independent of the reward signal, the clipping mechanism influences entropy, which in turn affects the reasoning behavior. Furthermore, our analysis demonstrates that clipping can be deliberately used to control entropy. Specifically, with a more aggressive clip-low value, one can increase entropy, promote exploration, and ultimately prevent entropy collapse in RLVR training.

[LG-46] Efficient Distributed Training via Dual Batch Sizes and Cyclic Progressive Learning

链接: https://arxiv.org/abs/2509.26092
作者: Kuan-Wei Lu,Ding-Yong Hong,Pangfeng Liu,Jan-Jan Wu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributed machine learning is critical for training deep learning models on large datasets and with numerous parameters. Current research primarily focuses on leveraging additional hardware resources and powerful computing units to accelerate the training process. As a result, larger batch sizes are often employed to speed up training. However, training with large batch sizes can lead to lower accuracy due to poor generalization. To address this issue, we propose the dual batch size learning scheme, a distributed training method built on the parameter server framework. This approach maximizes training efficiency by utilizing the largest batch size that the hardware can support while incorporating a smaller batch size to enhance model generalization. By using two different batch sizes simultaneously, this method reduces testing loss and enhances generalization, with minimal extra training time. Additionally, to mitigate the time overhead caused by dual batch size learning, we propose the cyclic progressive learning scheme. This technique gradually adjusts image resolution from low to high during training, significantly boosting training speed. By combining cyclic progressive learning with dual batch size learning, our hybrid approach improves both model generalization and training efficiency. Experimental results using ResNet-18 show that, compared to conventional training methods, our method can improve accuracy by 3.3% while reducing training time by 10.6% on CIFAR-100, and improve accuracy by 0.1% while reducing training time by 35.7% on ImageNet.

[LG-47] Stealthy Yet Effective: Distribution-Preserving Backdoor Attacks on Graph Classification NEURIPS2025

链接: https://arxiv.org/abs/2509.26032
作者: Xiaobao Wang,Ruoxiao Sun,Yujun Zhang,Bingdao Feng,Dongxiao He,Luzhi Wang,Di Jin
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated strong performance across tasks such as node classification, link prediction, and graph classification, but remain vulnerable to backdoor attacks that implant imperceptible triggers during training to control predictions. While node-level attacks exploit local message passing, graph-level attacks face the harder challenge of manipulating global representations while maintaining stealth. We identify two main sources of anomaly in existing graph classification backdoor methods: structural deviation from rare subgraph triggers and semantic deviation caused by label flipping, both of which make poisoned graphs easily detectable by anomaly detection models. To address this, we propose DPSBA, a clean-label backdoor framework that learns in-distribution triggers via adversarial training guided by anomaly-aware discriminators. DPSBA effectively suppresses both structural and semantic anomalies, achieving high attack success while significantly improving stealth. Extensive experiments on real-world datasets validate that DPSBA achieves a superior balance between effectiveness and detectability compared to state-of-the-art baselines.

[LG-48] Scaling Equilibrium Propagation to Deeper Neural Network Architectures

链接: https://arxiv.org/abs/2509.26003
作者: Sankar Vinayak. E. P,Gopalakrishnan Srinivasan
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-49] Informed Asymmetric Actor-Critic: Leverag ing Privileged Signals Beyond Full-State Access

链接: https://arxiv.org/abs/2509.26000
作者: Daniel Ebi,Gaspard Lambrechts,Damien Ernst,Klemens Böhm
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 21 pages total

点击查看摘要

Abstract:Reinforcement learning in partially observable environments requires agents to act under uncertainty from noisy, incomplete observations. Asymmetric actor-critic methods leverage privileged information during training to improve learning under these conditions. However, existing approaches typically assume full-state access during training. In this work, we challenge this assumption by proposing a novel actor-critic framework, called informed asymmetric actor-critic, that enables conditioning the critic on arbitrary privileged signals without requiring access to the full state. We show that policy gradients remain unbiased under this formulation, extending the theoretical foundation of asymmetric methods to the more general case of privileged partial information. To quantify the impact of such signals, we propose informativeness measures based on kernel methods and return prediction error, providing practical tools for evaluating training-time signals. We validate our approach empirically on benchmark navigation tasks and synthetic partially observable environments, showing that our informed asymmetric method improves learning efficiency and value estimation when informative privileged inputs are available. Our findings challenge the necessity of full-state access and open new directions for designing asymmetric reinforcement learning methods that are both practical and theoretically sound.

[LG-50] Exact Solutions to the Quantum Schrödinger Bridge Problem

链接: https://arxiv.org/abs/2509.25980
作者: Mykola Bordyuh,Djork-Arné Clevert,Marco Bertolini
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Probability (math.PR); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:The Quantum Schrödinger Bridge Problem (QSBP) describes the evolution of a stochastic process between two arbitrary probability distributions, where the dynamics are governed by the Schrödinger equation rather than by the traditional real-valued wave equation. Although the QSBP is known in the mathematical literature, we formulate it here from a Lagrangian perspective and derive its main features in a way that is particularly suited to generative modeling. We show that the resulting evolution equations involve the so-called Bohm (quantum) potential, representing a notion of non-locality in the stochastic process. This distinguishes the QSBP from classical stochastic dynamics and reflects a key characteristic typical of quantum mechanical systems. In this work, we derive exact closed-form solutions for the QSBP between Gaussian distributions. Our derivation is based on solving the Fokker-Planck Equation (FPE) and the Hamilton-Jacobi Equation (HJE) arising from the Lagrangian formulation of dynamical Optimal Transport. We find that, similar to the classical Schrödinger Bridge Problem, the solution to the QSBP between Gaussians is again a Gaussian process; however, the evolution of the covariance differs due to quantum effects. Leveraging these explicit solutions, we present a modified algorithm based on a Gaussian Mixture Model framework, and demonstrate its effectiveness across several experimental settings, including single-cell evolution data, image generation, molecular translation and applications in Mean-Field Games.

[LG-51] Reevaluating Convolutional Neural Networks for Spectral Analysis: A Focus on Raman Spectroscopy

链接: https://arxiv.org/abs/2509.25964
作者: Deniz Soysal,Xabier García-Andrade,Laura E. Rodriguez,Pablo Sobron,Laura M. Barge,Renaud Detry
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous Raman instruments on Mars rovers, deep-sea landers, and field robots must interpret raw spectra distorted by fluorescence baselines, peak shifts, and limited ground-truth labels. Using curated subsets of the RRUFF database, we evaluate one-dimensional convolutional neural networks (CNNs) and report four advances: (i) Baseline-independent classification: compact CNNs surpass k -nearest-neighbors and support-vector machines on handcrafted features, removing background-correction and peak-picking stages while ensuring reproducibility through released data splits and scripts. (ii) Pooling-controlled robustness: tuning a single pooling parameter accommodates Raman shifts up to 30 ,\mathrmcm^-1 , balancing translational invariance with spectral resolution. (iii) Label-efficient learning: semi-supervised generative adversarial networks and contrastive pretraining raise accuracy by up to 11% with only 10% labels, valuable for autonomous deployments with scarce annotation. (iv) Constant-time adaptation: freezing the CNN backbone and retraining only the softmax layer transfers models to unseen minerals at \mathcalO(1) cost, outperforming Siamese networks on resource-limited processors. This workflow, which involves training on raw spectra, tuning pooling, adding semi-supervision when labels are scarce, and fine-tuning lightly for new targets, provides a practical path toward robust, low-footprint Raman classification in autonomous exploration.

[LG-52] Better Privilege Separation for Agents by Restricting Data Types

链接: https://arxiv.org/abs/2509.25926
作者: Dennis Jacob,Emad Alghamdi,Zhanhao Hu,Basel Alomair,David Wagner
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-53] ReNF: Rethinking the Design Space of Neural Long-Term Time Series Forecasters

链接: https://arxiv.org/abs/2509.25914
作者: Yihang Lu,Xianwei Meng,Enhong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural Forecasters (NFs) are a cornerstone of Long-term Time Series Forecasting (LTSF). However, progress has been hampered by an overemphasis on architectural complexity at the expense of fundamental forecasting principles. In this work, we return to first principles to redesign the LTSF paradigm. We begin by introducing a Multiple Neural Forecasting Theorem that provides a theoretical basis for our approach. We propose Boosted Direct Output (BDO), a novel forecasting strategy that synergistically combines the advantages of both Auto-Regressive (AR) and Direct Output (DO). In addition, we stabilize the learning process by smoothly tracking the model’s parameters. Extensive experiments show that these principled improvements enable a simple MLP to achieve state-of-the-art performance, outperforming recent, complex models in nearly all cases, without any specific considerations in the area. Finally, we empirically verify our theorem, establishing a dynamic performance bound and identifying promising directions for future research. The code for review is available at: .

[LG-54] Federated Learning with Enhanced Privacy via Model Splitting and Random Client Participation

链接: https://arxiv.org/abs/2509.25906
作者: Yiwei Li,Shuai Wang,Zhuojun Tian,Xiuhua Wang,Shijian Su
类目: Machine Learning (cs.LG)
*备注: 29 pages, 6 figures, submitted for peer review

点击查看摘要

Abstract:Federated Learning (FL) often adopts differential privacy (DP) to protect client data, but the added noise required for privacy guarantees can substantially degrade model accuracy. To resolve this challenge, we propose model-splitting privacy-amplified federated learning (MS-PAFL), a novel framework that combines structural model splitting with statistical privacy amplification. In this framework, each client’s model is partitioned into a private submodel, retained locally, and a public submodel, shared for global aggregation. The calibrated Gaussian noise is injected only into the public submodel, thereby confining its adverse impact while preserving the utility of the local model. We further present a rigorous theoretical analysis that characterizes the joint privacy amplification achieved through random client participation and local data subsampling under this architecture. The analysis provides tight bounds on both single-round and total privacy loss, demonstrating that MS-PAFL significantly reduces the noise necessary to satisfy a target privacy protection level. Extensive experiments validate our theoretical findings, showing that MS-PAFL consistently attains a superior privacy-utility trade-off and enables the training of highly accurate models under strong privacy guarantees.

[LG-55] RL-Guided Data Selection for Language Model Finetuning NEURIPS2025

链接: https://arxiv.org/abs/2509.25850
作者: Animesh Jha,Harshit Gupta,Ananjan Nandi
类目: Machine Learning (cs.LG)
*备注: To appear in NeurIPS 2025 Constrained Optimization for ML Workshop

点击查看摘要

[LG-56] MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning NEURIPS2025

链接: https://arxiv.org/abs/2509.25831
作者: Seong-Hyeon Hwang,Soyoung Choi,Steven Euijong Whang
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025

点击查看摘要

[LG-57] Kairos: Towards Adaptive and Generalizable Time Series Foundation Models

链接: https://arxiv.org/abs/2509.25826
作者: Kun Feng,Shaocheng Lan,Yuchen Fang,Wenchao He,Lintao Ma,Xingyu Lu,Kan Ren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-58] Decentralized Asynchronous Multi-player Bandits

链接: https://arxiv.org/abs/2509.25824
作者: Jingqi Fan,Canzhe Zhao,Shuai Li,Siwei Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years, multi-player multi-armed bandits (MP-MAB) have been extensively studied due to their wide applications in cognitive radio networks and Internet of Things systems. While most existing research on MP-MAB focuses on synchronized settings, real-world systems are often decentralized and asynchronous, where players may enter or leave the system at arbitrary times, and do not have a global clock. This decentralized asynchronous setting introduces two major challenges. First, without a global time, players cannot implicitly coordinate their actions through time, making it difficult to avoid collisions. Second, it is important to detect how many players are in the system, but doing so may cost a lot. In this paper, we address the challenges posed by such a fully asynchronous setting in a decentralized environment. We develop a novel algorithm in which players adaptively change between exploration and exploitation. During exploration, players uniformly pull their arms, reducing the probability of collisions and effectively mitigating the first challenge. Meanwhile, players continue pulling arms currently exploited by others with a small probability, enabling them to detect when a player has left, thereby addressing the second challenge. We prove that our algorithm achieves a regret of \mathcalO(\sqrtT \log T + \log T/\Delta^2) , where \Delta is the minimum expected reward gap between any two arms. To the best of our knowledge, this is the first efficient MP-MAB algorithm in the asynchronous and decentralized environment. Extensive experiments further validate the effectiveness and robustness of our algorithm, demonstrating its applicability to real-world scenarios.

[LG-59] Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse

链接: https://arxiv.org/abs/2509.25808
作者: Yuheng Zhang,Wenlin Yao,Changlong Yu,Yao Liu,Qingyu Yin,Bing Yin,Hyokun Yun,Lihong Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-60] Characterization and Learning of Causal Graphs with Latent Confounders and Post-treatment Selection from Interventional Data

链接: https://arxiv.org/abs/2509.25800
作者: Gongxu Luo,Loka Li,Guangyi Chen,Haoyue Dai,Kun Zhang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-61] From Cheap Geometry to Expensive Physics: Elevating Neural Operators via Latent Shape Pretraining

链接: https://arxiv.org/abs/2509.25788
作者: Zhizhou Zhang,Youjia Wu,Kaixuan Zhang,Yanjia Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Industrial design evaluation often relies on high-fidelity simulations of governing partial differential equations (PDEs). While accurate, these simulations are computationally expensive, making dense exploration of design spaces impractical. Operator learning has emerged as a promising approach to accelerate PDE solution prediction; however, its effectiveness is often limited by the scarcity of labeled physics-based data. At the same time, large numbers of geometry-only candidate designs are readily available but remain largely untapped. We propose a two-stage framework to better exploit this abundant, physics-agnostic resource and improve supervised operator learning under limited labeled data. In Stage 1, we pretrain an autoencoder on a geometry reconstruction task to learn an expressive latent representation without PDE labels. In Stage 2, the neural operator is trained in a standard supervised manner to predict PDE solutions, using the pretrained latent embeddings as inputs instead of raw point clouds. Transformer-based architectures are adopted for both the autoencoder and the neural operator to handle point cloud data and integrate both stages seamlessly. Across four PDE datasets and three state-of-the-art transformer-based neural operators, our approach consistently improves prediction accuracy compared to models trained directly on raw point cloud inputs. These results demonstrate that representations from physics-agnostic pretraining provide a powerful foundation for data-efficient operator learning.

[LG-62] A Hamiltonian driven Geometric Construction of Neural Networks on the Lognormal Statistical Manifold

链接: https://arxiv.org/abs/2509.25778
作者: Prosper Rosaire Mama Assandje,Teumsa Aboubakar,Dongho Joseph,Takemi Nakamura
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-63] Online Decision Making with Generative Action Sets

链接: https://arxiv.org/abs/2509.25777
作者: Jianyu Xu,Vidhi Jain,Bryan Wilder,Aarti Singh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages, 2 figures (including 5 subfigures)

点击查看摘要

[LG-64] OPPO: Accelerating PPO-based RLHF via Pipeline Overlap

链接: https://arxiv.org/abs/2509.25762
作者: Kaizhuo Yan(1),Yingjie Yu(1),Yifan Yu(1),Haizhong Zheng(2),Fan Lai(1) ((1) University of Illinois Urbana-Champaign, (2) Carnegie Mellon University)
类目: Machine Learning (cs.LG)
*备注: Kaizhuo Yan and Yingjie Yu contributed equally to this work

点击查看摘要

Abstract:Proximal Policy Optimization (PPO)-based reinforcement learning from human feedback (RLHF) is a widely adopted paradigm for aligning large language models (LLMs) with human preferences. However, its training pipeline suffers from substantial inefficiencies due to sequential multi-model dependencies (e.g., reward model depends on actor outputs) and long-tail response lengths, where a few long responses straggle the stage completion. We present OPPO, a novel, lightweight, and model-agnostic PPO-based RLHF framework that improves training efficiency by overlapping pipeline execution. OPPO introduces two novel techniques: (1) Intra-step overlap, which streams upstream model outputs (e.g., actor model) in right-sized chunks, enabling the downstream model (e.g., reward) to begin prefill while the upstream continues decoding; and (2) Inter-step overlap, which adaptively overcommits a few prompts and defers long generations to future steps, mitigating tail latency without discarding partial work. OPPO integrates easily with existing PPO implementations with a few lines of code change. Extensive evaluations show that OPPO accelerates PPO-based RLHF training by 1.8 \times-2.8 \times and improves GPU utilization by 1.4 \times-2.1 \times without compromising training convergence.

[LG-65] SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling

链接: https://arxiv.org/abs/2509.25756
作者: Yixian Zhang,Shu’ang Yu,Tonghe Zhang,Mo Guang,Haojia Hui,Kaiwen Long,Yu Wang,Chao Yu,Wenbo Ding
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-66] Less is More: Towards Simple Graph Contrastive Learning ICLR2026

链接: https://arxiv.org/abs/2509.25742
作者: Yanan Zhao,Feng Ji,Jingyang Dai,Jiaze Ma,Wee Peng Tay
类目: Machine Learning (cs.LG)
*备注: Submitted to ICLR 2026

点击查看摘要

[LG-67] A Physics-Guided Probabilistic Surrogate Modeling Framework for Digital Twins of Underwater Radiated Noise

链接: https://arxiv.org/abs/2509.25730
作者: Indu Kant Deo,Akash Venkateshwaran,Rajeev K. Jaiman
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 26 pages, 13 figures

点击查看摘要

[LG-68] Beyond Point Estimates: Likelihood-Based Full-Posterior Wireless Localization

链接: https://arxiv.org/abs/2509.25719
作者: Haozhe Lei,Hao Guo,Tommy Svensson,Sundeep Rangan
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Systems and Control (eess.SY)
*备注:

点击查看摘要

[LG-69] Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking

链接: https://arxiv.org/abs/2509.25712
作者: Dengming Zhang,Xiaowen Ma,Zhenliang Ni,Zhenkai Wu,Han Shu,Xin Jiang,Xinghao Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-70] Adaptive Graph Coarsening for Efficient GNN Training

链接: https://arxiv.org/abs/2509.25706
作者: Rostyslav Olshevskyi,Madeline Navarro,Santiago Segarra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-71] Physics-Informed Learning for Human Whole-Body Kinematics Prediction via Sparse IMUs

链接: https://arxiv.org/abs/2509.25704
作者: Cheng Guo,Giuseppe L’Erario,Giulio Romualdi,Mattia Leonori,Marta Lorenzini,Arash Ajoudani,Daniele Pucci
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and physically feasible human motion prediction is crucial for safe and seamless human-robot collaboration. While recent advancements in human motion capture enable real-time pose estimation, the practical value of many existing approaches is limited by the lack of fu- ture predictions and consideration of physical constraints. Conventional motion prediction schemes rely heavily on past poses, which are not always available in real-world scenarios. To address these limitations, we present a physics-informed learning framework that integrates domain knowledge into both training and inference to predict human motion using inertial measurements from only 5 IMUs. We propose a network that accounts for the spatial characteristics of human movements. During training, we incorporate forward and differential kinematics functions as additional loss components to regularize the learned joint predictions. At the inference stage, we refine the prediction from the previous iteration to update a joint state buffer, which is used as extra inputs to the network. Experimental results demonstrate that our approach achieves high accuracy, smooth transitions between motions, and generalizes well to unseen subjects

[LG-72] A Unified Probabilistic Framework for Dictionary Learning with Parsimonious Activation

链接: https://arxiv.org/abs/2509.25690
作者: Zihui Zhao,Yuanbo Tang,Jieyu Ren,Xiaoping Zhang,Yang Li
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

[LG-73] Minimalist Explanation Generation and Circuit Discovery

链接: https://arxiv.org/abs/2509.25686
作者: Pirzada Suhail,Aditya Anand,Amit Sethi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models, by virtue of training, learn a large repertoire of decision rules for any given input, and any one of these may suffice to justify a prediction. However, in high-dimensional input spaces, such rules are difficult to identify and interpret. In this paper, we introduce an activation-matching based approach to generate minimal and faithful explanations for the decisions of pre-trained image classifiers. We aim to identify minimal explanations that not only preserve the model’s decision but are also concise and human-readable. To achieve this, we train a lightweight autoencoder to produce binary masks that learns to highlight the decision-wise critical regions of an image while discarding irrelevant background. The training objective integrates activation alignment across multiple layers, consistency at the output label, priors that encourage sparsity, and compactness, along with a robustness constraint that enforces faithfulness. The minimal explanations so generated also lead us to mechanistically interpreting the model internals. In this regard we also introduce a circuit readout procedure wherein using the explanation’s forward pass and gradients, we identify active channels and construct a channel-level graph, scoring inter-layer edges by ingress weight magnitude times source activation and feature-to-class links by classifier weight magnitude times feature activation. Together, these contributions provide a practical bridge between minimal input-level explanations and a mechanistic understanding of the internal computations driving model decisions.

[LG-74] Guiding Mixture-of-Experts with Temporal Multimodal Interactions

链接: https://arxiv.org/abs/2509.25678
作者: Xing Han,Hsing-Huan Chung,Joydeep Ghosh,Paul Pu Liang,Suchi Saria
类目: Machine Learning (cs.LG)
*备注: 21 pages, 8 figures, 10 tables

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures have become pivotal for large-scale multimodal models. However, their routing mechanisms typically overlook the informative, time-varying interaction dynamics between modalities. This limitation hinders expert specialization, as the model cannot explicitly leverage intrinsic modality relationships for effective reasoning. To address this, we propose a novel framework that guides MoE routing using quantified temporal interaction. A multimodal interaction-aware router learns to dispatch tokens to experts based on the nature of their interactions. This dynamic routing encourages experts to acquire generalizable interaction-processing skills rather than merely learning task-specific features. Our framework builds on a new formulation of temporal multimodal interaction dynamics, which are used to guide expert routing. We first demonstrate that these temporal multimodal interactions reveal meaningful patterns across applications, and then show how they can be leveraged to improve both the design and performance of MoE-based models. Comprehensive experiments on challenging multimodal benchmarks validate our approach, demonstrating both enhanced performance and improved interpretability.

[LG-75] Growing Winning Subnetworks Not Pruning Them: A Paradigm for Density Discovery in Sparse Neural Networks

链接: https://arxiv.org/abs/2509.25665
作者: Qihang Yao,Constantine Dovrolis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The lottery ticket hypothesis suggests that dense networks contain sparse subnetworks that can be trained in isolation to match full-model performance. Existing approaches-iterative pruning, dynamic sparse training, and pruning at initialization-either incur heavy retraining costs or assume the target density is fixed in advance. We introduce Path Weight Magnitude Product-biased Random growth (PWMPR), a constructive sparse-to-dense training paradigm that grows networks rather than pruning them, while automatically discovering their operating density. Starting from a sparse seed, PWMPR adds edges guided by path-kernel-inspired scores, mitigates bottlenecks via randomization, and stops when a logistic-fit rule detects plateauing accuracy. Experiments on CIFAR, TinyImageNet, and ImageNet show that PWMPR approaches the performance of IMP-derived lottery tickets-though at higher density-at substantially lower cost (~1.5x dense vs. 3-4x for IMP). These results establish growth-based density discovery as a promising paradigm that complements pruning and dynamic sparsity.

[LG-76] Deep set based operator learning with uncertainty quantification

链接: https://arxiv.org/abs/2509.25646
作者: Lei Ma,Ling Guo,Hao Wu,Tao Zhou
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[LG-77] How Does Preconditioning Guide Feature Learning in Deep Neural Networks?

链接: https://arxiv.org/abs/2509.25637
作者: Kotaro Yoshida,Atsushi Nitanda
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-78] Swift: An Autoregressive Consistency Model for Efficient Weather Forecasting

链接: https://arxiv.org/abs/2509.25631
作者: Jason Stock,Troy Arcomano,Rao Kotamarthi
类目: Machine Learning (cs.LG)
*备注: 17 pages and 15 figures

点击查看摘要

[LG-79] Layer-wise dynamic rank for compressing large language models

链接: https://arxiv.org/abs/2509.25622
作者: Zhendong Mi,Bian Sun,Grace Li Zhang,Shaoyi Huang
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

[LG-80] Effective Model Pruning

链接: https://arxiv.org/abs/2509.25606
作者: Yixuan Wang,Dan Guralnik,Saiedeh Akbari,Warren Dixon
类目: Machine Learning (cs.LG)
*备注: 17 pages, 4 figures

点击查看摘要

[LG-81] Binary Sparse Coding for Interpretability

链接: https://arxiv.org/abs/2509.25596
作者: Lucia Quirke,Stepan Shabalin,Nora Belrose
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-82] Machine Learning Algorithms for Improving Black Box Optimization Solvers

链接: https://arxiv.org/abs/2509.25592
作者: Morteza Kimiaei,Vyacheslav Kungurtsev
类目: Machine Learning (cs.LG)
*备注: 74 pages

点击查看摘要

[LG-83] Safe In-Context Reinforcement Learning

链接: https://arxiv.org/abs/2509.25582
作者: Amir Moeini,Minjae Kwon,Alper Kamil Bozkurt,Yuichi Motai,Rohan Chandra,Lu Feng,Shangtong Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-84] Lightweight and Robust Federated Data Valuation

链接: https://arxiv.org/abs/2509.25560
作者: Guojun Tang,Jiayu Zhou,Mohammad Mamun,Steve Drew
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-85] Enhancing Split Learning with Sharded and Blockchain-Enabled SplitFed Approaches

链接: https://arxiv.org/abs/2509.25555
作者: Amirreza Sokhankhosh,Khalid Hassan,Sara Rouhani
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted by the 2025 IEEE International Conference on Blockchain (Blockchain)

点击查看摘要

[LG-86] Meta-Router: Bridging Gold-standard and Preference-based Evaluations in Large Language Model Routing

链接: https://arxiv.org/abs/2509.25535
作者: Yichi Zhang,Fangzheng Xie,Shu Yang,Chong Wu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-87] Personalized Auto-Grading and Feedback System for Constructive Geometry Tasks Using Large Language Models on an Online Math Platform

链接: https://arxiv.org/abs/2509.25529
作者: Yong Oh Lee,Byeonghun Bang,Joohyun Lee,Sejun Oh
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-88] Defeating Cerberus: Concept-Guided Privacy-Leakage Mitigation in Multimodal Language Models

链接: https://arxiv.org/abs/2509.25525
作者: Boyang Zhang,Istemi Ekin Akkus,Ruichuan Chen,Alice Dethise,Klaus Satzke,Ivica Rimac,Yang Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-89] Flow Matching with Semidiscrete Couplings

链接: https://arxiv.org/abs/2509.25519
作者: Alireza Mousavi-Hosseini,Stephen Y. Zhang,Michal Klein,Marco Cuturi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 35 pages, 16 figures

点击查看摘要

Abstract:Flow models parameterized as time-dependent velocity fields can generate data from noise by integrating an ODE. These models are often trained using flow matching, i.e. by sampling random pairs of noise and target points (\mathbfx_0,\mathbfx_1) and ensuring that the velocity field is aligned, on average, with \mathbfx_1-\mathbfx_0 when evaluated along a segment linking \mathbfx_0 to \mathbfx_1 . While these pairs are sampled independently by default, they can also be selected more carefully by matching batches of n noise to n target points using an optimal transport (OT) solver. Although promising in theory, the OT flow matching (OT-FM) approach is not widely used in practice. Zhang et al. (2025) pointed out recently that OT-FM truly starts paying off when the batch size n grows significantly, which only a multi-GPU implementation of the Sinkhorn algorithm can handle. Unfortunately, the costs of running Sinkhorn can quickly balloon, requiring O(n^2/\varepsilon^2) operations for every n pairs used to fit the velocity field, where \varepsilon is a regularization parameter that should be typically small to yield better results. To fulfill the theoretical promises of OT-FM, we propose to move away from batch-OT and rely instead on a semidiscrete formulation that leverages the fact that the target dataset distribution is usually of finite size N . The SD-OT problem is solved by estimating a dual potential vector using SGD; using that vector, freshly sampled noise vectors at train time can then be matched with data points at the cost of a maximum inner product search (MIPS). Semidiscrete FM (SD-FM) removes the quadratic dependency on n/\varepsilon that bottlenecks OT-FM. SD-FM beats both FM and OT-FM on all training metrics and inference budget constraints, across multiple datasets, on unconditional/conditional generation, or when using mean-flow models.

[LG-90] World Model for AI Autonomous Navigation in Mechanical Thrombectomy MICCAI2025

链接: https://arxiv.org/abs/2509.25518
作者: Harry Robertshaw,Han-Ru Wu,Alejandro Granados,Thomas C Booth
类目: Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注: Published in Medical Image Computing and Computer Assisted Intervention - MICCAI 2025, Lecture Notes in Computer Science, vol 15968

点击查看摘要

[LG-91] AGNOMIN - Architecture Agnostic Multi-Label Function Name Prediction

链接: https://arxiv.org/abs/2509.25514
作者: Yonatan Gizachew Achamyeleh,Tongtao Zhang,Joshua Hyunki Kim,Gabriel Garcia,Shih-Yuan Yu,Anton Kocheturov,Mohammad Abdullah Al Faruque
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Function name prediction is crucial for understanding stripped binaries in software reverse engineering, a key step for \textbfenabling subsequent vulnerability analysis and patching. However, existing approaches often struggle with architecture-specific limitations, data scarcity, and diverse naming conventions. We present AGNOMIN, a novel architecture-agnostic approach for multi-label function name prediction in stripped binaries. AGNOMIN builds Feature-Enriched Hierarchical Graphs (FEHGs), combining Control Flow Graphs, Function Call Graphs, and dynamically learned \textttPCode features. A hierarchical graph neural network processes this enriched structure to generate consistent function representations across architectures, vital for \textbfscalable security assessments. For function name prediction, AGNOMIN employs a Renée-inspired decoder, enhanced with an attention-based head layer and algorithmic improvements. We evaluate AGNOMIN on a comprehensive dataset of 9,000 ELF executable binaries across three architectures, demonstrating its superior performance compared to state-of-the-art approaches, with improvements of up to 27.17% in precision and 55.86% in recall across the testing dataset. Moreover, AGNOMIN generalizes well to unseen architectures, achieving 5.89% higher recall than the closest baseline. AGNOMIN’s practical utility has been validated through security hackathons, where it successfully aided reverse engineers in analyzing and patching vulnerable binaries across different architectures. Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2509.25514 [cs.SE] (or arXiv:2509.25514v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2509.25514 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-92] EEsizer: LLM -Based AI Agent for Sizing of Analog and Mixed Signal Circuit

链接: https://arxiv.org/abs/2509.25510
作者: Chang Liu,Danial Chitnis
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

[LG-93] Can Molecular Foundation Models Know What They Dont Know? A Simple Remedy with Preference Optimization

链接: https://arxiv.org/abs/2509.25509
作者: Langzhou He,Junyou Zhu,Fangxin Wang,Junhua Liu,Haoyan Xu,Yue Zhao,Philip S.Yu,Qitian Wu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

[LG-94] Scalable Disk-Based Approximate Nearest Neighbor Search with Page-Aligned Graph

链接: https://arxiv.org/abs/2509.25487
作者: Dingyi Kang,Dongming Jiang,Hanshen Yang,Hang Liu,Bingzhe Li
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[LG-95] Conformal Prediction for Signal Temporal Logic Inference

链接: https://arxiv.org/abs/2509.25473
作者: Danyang Li,Yixuan Wang,Matthew Cleaveland,Mingyu Cai,Roberto Tron
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-96] Norm-Q: Effective Compression Method for Hidden Markov Models in Neuro-Symbolic Applications

链接: https://arxiv.org/abs/2509.25439
作者: Hanyuan Gao,Xiaoxuan Yang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted by Asilomar 2025

点击查看摘要

[LG-97] Feedback Control for Small Budget Pacing

链接: https://arxiv.org/abs/2509.25429
作者: Sreeja Apparaju,Yichuan Niu,Xixi Qi
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

[LG-98] Leverag ing Vulnerabilities in Temporal Graph Neural Networks via Strategic High-Impact Assaults

链接: https://arxiv.org/abs/2509.25418
作者: Dong Hyun Jeon,Lijing Zhu,Haifang Li,Pengze Li,Jingna Feng,Tiehang Duan,Houbing Herbert Song,Cui Tao,Shuteng Niu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal Graph Neural Networks (TGNNs) have become indispensable for analyzing dynamic graphs in critical applications such as social networks, communication systems, and financial networks. However, the robustness of TGNNs against adversarial attacks, particularly sophisticated attacks that exploit the temporal dimension, remains a significant challenge. Existing attack methods for Spatio-Temporal Dynamic Graphs (STDGs) often rely on simplistic, easily detectable perturbations (e.g., random edge additions/deletions) and fail to strategically target the most influential nodes and edges for maximum impact. We introduce the High Impact Attack (HIA), a novel restricted black-box attack framework specifically designed to overcome these limitations and expose critical vulnerabilities in TGNNs. HIA leverages a data-driven surrogate model to identify structurally important nodes (central to network connectivity) and dynamically important nodes (critical for the graph’s temporal evolution). It then employs a hybrid perturbation strategy, combining strategic edge injection (to create misleading connections) and targeted edge deletion (to disrupt essential pathways), maximizing TGNN performance degradation. Importantly, HIA minimizes the number of perturbations to enhance stealth, making it more challenging to detect. Comprehensive experiments on five real-world datasets and four representative TGNN architectures (TGN, JODIE, DySAT, and TGAT) demonstrate that HIA significantly reduces TGNN accuracy on the link prediction task, achieving up to a 35.55% decrease in Mean Reciprocal Rank (MRR) - a substantial improvement over state-of-the-art baselines. These results highlight fundamental vulnerabilities in current STDG models and underscore the urgent need for robust defenses that account for both structural and temporal dynamics.

[LG-99] Multi-Task Equation Discovery

链接: https://arxiv.org/abs/2509.25400
作者: S C Bee,N Dervilis,K Worden,L A Bull
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-100] Crowdsourcing Without People: Modelling Clustering Algorithms as Experts

链接: https://arxiv.org/abs/2509.25395
作者: Jordyn E. A. Lorentz,Katharine M. Clark
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-101] On the Shape of Latent Variables in a Denoising VAE-MoG: A Posterior Sampling-Based Study

链接: https://arxiv.org/abs/2509.25382
作者: Fernanda Zapata Bascuñán
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Argentine Congress of Embedded Systems (2025)

点击查看摘要

[LG-102] Deep Survival Analysis for Competing Risk Modeling with Functional Covariates and Missing Data Imputation

链接: https://arxiv.org/abs/2509.25381
作者: Penglei Gao,Yan Zou,Abhijit Duggal,Shuaiqi Huang,Faming Liang,Xiaofeng Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-103] Gradient Descent with Large Step Sizes: Chaos and Fractal Convergence Region

链接: https://arxiv.org/abs/2509.25351
作者: Shuang Liang,Guido Montúfar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-104] Optimisation of Resource Allocation in Heterogeneous Wireless Networks Using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2509.25284
作者: Oluwaseyi Giwa,Jonathan Shock,Jaco Du Toit,Tobi Awodumila
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: Submitted to IEEE Wireless Communications and Networking Conference, 2026

点击查看摘要

[LG-105] MAESTRO : Adaptive Sparse Attention and Robust Learning for Multimodal Dynamic Time Series NEURIPS2025

链接: https://arxiv.org/abs/2509.25278
作者: Payal Mohapatra,Yueyuan Sui,Akash Pandey,Stephen Xia,Qi Zhu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to Neurips 2025 (Spotlight)

点击查看摘要

[LG-106] Heterogeneous Multi-agent Collaboration in UAV-assisted Mobile Crowdsensing Networks

链接: https://arxiv.org/abs/2509.25261
作者: Xianyang Deng,Wenshuai Liu,Yaru FuB,Qi Zhu
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 7 pages, 6 figures

点击查看摘要

[LG-107] RANGER – Repository-Level Agent for Graph-Enhanced Retrieval

链接: https://arxiv.org/abs/2509.25257
作者: Pratik Shah,Rajat Ghosh,Aryan Singhal,Debojyoti Dutta
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 24 pages, 4 figures

点击查看摘要

[LG-108] Fine-tuning of Large Language Models for Domain-Specific Cybersecurity Knowledge

链接: https://arxiv.org/abs/2509.25241
作者: Yuan Huang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[LG-109] Sampling via Gaussian Mixture Approximations

链接: https://arxiv.org/abs/2509.25232
作者: Yongchao Huang
类目: Machine Learning (cs.LG)
*备注: 204 pages

点击查看摘要

Abstract:We present a family of \textitGaussian Mixture Approximation (GMA) samplers for sampling unnormalised target densities, encompassing \textitweights-only GMA (W-GMA), \textitLaplace Mixture Approximation (LMA), \textitexpectation-maximization GMA (EM-GMA), and further variants. GMA adopts a simple two-stage paradigm: (i) initialise a finite set of Gaussian components and draw samples from a proposal mixture; (ii) fit the mixture to the target by optimising either only the component weights or also the means and variances, via a sample-based KL divergence objective that requires only evaluations of the unnormalised density, followed by stratified resampling. The method is gradient-free, and computationally efficient: it leverages the ease of sampling from Gaussians, efficient optimisation methods (projected gradient descent, mirror descent, and EM), and the robustness of stratified resampling to produce samples faithful to the target. We show that this optimisation-resampling scheme yields consistent approximations under mild conditions, and we validate this methodology with empirical results demonstrating accuracy and speed across diverse densities.

[LG-110] WDformer: A Wavelet-based Differential Transformer Model for Time Series Forecasting CIKM2025

链接: https://arxiv.org/abs/2509.25231
作者: Xiaojian Wang,Chaoli Zhang,Zhonglong Zheng,Yunliang Jiang
类目: Machine Learning (cs.LG)
*备注: Accepted by CIKM 2025

点击查看摘要

[LG-111] Simple Fast and Efficient Injective Manifold Density Estimation with Random Projections

链接: https://arxiv.org/abs/2509.25228
作者: Ahmad Ayaz Amin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-112] Integrated Forecasting of Marine Renewable Power: An Adaptively Bayesian-Optimized MVMD-LSTM Framework for Wind-Solar-Wave Energy

链接: https://arxiv.org/abs/2509.25226
作者: Baoyi Xie,Shuiling Shi,Wenqi Liu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

[LG-113] MSCoD: An Enhanced Bayesian Updating Framework with Multi-Scale Information Bottleneck and Cooperative Attention for Structure-Based Drug Design

链接: https://arxiv.org/abs/2509.25225
作者: Long Xu,Yongcai Chen,Fengshuo Liu,Yuzhong Peng
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures

点击查看摘要

[LG-114] AMLA: MUL by ADD in FlashAttention Rescaling

链接: https://arxiv.org/abs/2509.25224
作者: Qichen Liao,Chengqiu Hu,Fangzheng Miao,Bao Li,Yiyang Liu,Junlong Lyu,Lirui Jiang,Jun Wang,Lingchao Zheng,Jun Li,Yuwei Fan
类目: Machine Learning (cs.LG)
*备注: 21 pages, 11 figures

点击查看摘要

[LG-115] Sensor optimization for urban wind estimation with cluster-based probabilistic framework

链接: https://arxiv.org/abs/2509.25222
作者: Yutong Liang,Chang Hou,Guy Y. Cornejo Maceda,Andrea Ianiro,Stefano Discetti,Andrea Meilán-Vila,Didier Sornette,Sandro Claudio Lera,Jialong Chen,Xiaozhou He,Bernd R. Noack
类目: Machine Learning (cs.LG); Robotics (cs.RO); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

[LG-116] On The Dynamic Ensemble Selection for TinyML-based Systems – a Preliminary Study

链接: https://arxiv.org/abs/2509.25218
作者: Tobiasz Puslecki,Krzysztof Walkowiak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-117] Evaluating Double Descent in Machine Learning: Insights from Tree-Based Models Applied to a Genomic Prediction Task

链接: https://arxiv.org/abs/2509.25216
作者: Guillermo Comesaña Cimadevila
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 7 figures

点击查看摘要

[LG-118] Anomaly detection by partitioning of multi-variate time series

链接: https://arxiv.org/abs/2509.25215
作者: Pierre Lotte(IRIT, IRIT-SIG, UT, UT3),André Péninou(IRIT, IRIT-SIG, UT2J),Olivier Teste(IRIT-SIG, IRIT, UT2J, UT)
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: in French language

点击查看摘要

[LG-119] LEMs: A Primer On Large Execution Models

链接: https://arxiv.org/abs/2509.25211
作者: Remi Genet,Hugo Inzirillo
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

[LG-120] DPSformer: A long-tail-aware model for improving heavy rainfall prediction

链接: https://arxiv.org/abs/2509.25208
作者: Zenghui Huang,Ting Shu,Zhonglei Wang,Yang Lu,Yan Yan,Wei Zhong,Hanzi Wang
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

[LG-121] Polynomial Contrastive Learning for Privacy-Preserving Representation Learning on Graphs

链接: https://arxiv.org/abs/2509.25205
作者: Daksh Pandey
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Rings and Algebras (math.RA)
*备注:

点击查看摘要

[LG-122] VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps

链接: https://arxiv.org/abs/2509.25202
作者: Zhuoning Xu,Xinyan Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-123] SOLD: SELFIES-based Objective-driven Latent Diffusion

链接: https://arxiv.org/abs/2509.25198
作者: Elbert Ho
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-124] Understanding Practitioners Perspectives on Monitoring Machine Learning Systems

链接: https://arxiv.org/abs/2509.25195
作者: Hira Naveed,John Grundy,Chetan Arora,Hourieh Khalajzadeh,Omar Haggag
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-125] Estimating Dimensionality of Neural Representations from Finite Samples

链接: https://arxiv.org/abs/2509.26560
作者: Chanwoo Chun,Abdulkadir Canatar,SueYeon Chung,Daniel Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

[LG-126] Pretrain-Test Task Alignment Governs Generalization in In-Context Learning

链接: https://arxiv.org/abs/2509.26551
作者: Mary I. Letey,Jacob A. Zavatone-Veth,Yue M. Lu,Cengiz Pehlevan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-127] An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes

链接: https://arxiv.org/abs/2509.26429
作者: Emil Javurek,Valentyn Melnychuk,Jonas Schweisthal,Konstantin Hess,Dennis Frauen,Stefan Feuerriegel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

[LG-128] rackFormers Part 2: Enhanced Transformer-Based Models for High-Energy Physics Track Reconstruction

链接: https://arxiv.org/abs/2509.26411
作者: Sascha Caron,Nadezhda Dobreva,Maarten Kimpel,Uraz Odyurt,Slav Pshenov,Roberto Ruiz de Austri Bazan,Eugene Shalugin,Zef Wolffs,Yue Zhao
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-129] Are neural scaling laws leading quantum chemistry astray?

链接: https://arxiv.org/abs/2509.26397
作者: Siwoo Lee,Adji Bousso Dieng
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

[LG-130] rackCore-F: Deploying Transformer-Based Subatomic Particle Tracking on FPGAs

链接: https://arxiv.org/abs/2509.26335
作者: Arjan Blankestijn,Uraz Odyurt,Amirreza Yousefzadeh
类目: High Energy Physics - Experiment (hep-ex); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-131] Ultra-Reliable Risk-Aggregated Sum Rate Maximization via Model-Aided Deep Learning

链接: https://arxiv.org/abs/2509.26311
作者: Hassaan Hashmi,Spyridon Pougkakiotis,Dionysis Kalogerias
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-132] Why is topology hard to learn?

链接: https://arxiv.org/abs/2509.26261
作者: D. O. Oriekhov,Stan Bergkamp,Guliuxin Jin,Juan Daniel Torres Luna,Badr Zouggari,Sibren van der Meer,Naoual El Yazidi,Eliska Greplova
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 5+8 pages, 4+7 figures

点击查看摘要

[LG-133] Hybrid Quantum-Classical Optimisation of Traveling Salesperson Problem

链接: https://arxiv.org/abs/2509.26229
作者: Christos Lytrosyngounis,Ioannis Lytrosyngounis
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-134] Non-Vacuous Generalization Bounds: Can Rescaling Invariances Help?

链接: https://arxiv.org/abs/2509.26149
作者: Damien Rouchouse,Antoine Gonon,Rémi Gribonval,Benjamin Guedj
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-135] BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories under Spatio-Temporal Vector Fields

链接: https://arxiv.org/abs/2509.26005
作者: Rui-Yang Zhang,Henry B. Moss,Lachlan Astfalck,Edward Cripps,David S. Leslie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-136] Sharpness of Minima in Deep Matrix Factorization: Exact Expressions

链接: https://arxiv.org/abs/2509.25783
作者: Anil Kamber,Rahul Parhi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 18 pages, 7 figures

点击查看摘要

[LG-137] st time training enhances in-context learning of nonlinear functions ICLR2026

链接: https://arxiv.org/abs/2509.25741
作者: Kento Kuwataka,Taiji Suzuki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Under review at ICLR 2026. 36 pages, 2 figures, appendix included

点击查看摘要

[LG-138] ransformer-Based Rate Prediction for Multi-Band Cellular Handsets

链接: https://arxiv.org/abs/2509.25722
作者: Ruibin Chen,Haozhe Lei,Hao Guo,Marco Mezzavilla,Hitesh Poddar,Tomoki Yoshimura,Sundeep Rangan
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-139] When Langevin Monte Carlo Meets Randomization: Non-asymptotic Error Bounds beyond Log-Concavity and Gradient Lipschitzness

链接: https://arxiv.org/abs/2509.25630
作者: Xiaojie Wang,Bin Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[LG-140] Coupling Generative Modeling and an Autoencoder with the Causal Bridge NEURIPS2025

链接: https://arxiv.org/abs/2509.25599
作者: Ruolin Meng,Ming-Yu Chung,Dhanajit Brahma,Ricardo Henao,Lawrence Carin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:We consider inferring the causal effect of a treatment (intervention) on an outcome of interest in situations where there is potentially an unobserved confounder influencing both the treatment and the outcome. This is achievable by assuming access to two separate sets of control (proxy) measurements associated with treatment and outcomes, which are used to estimate treatment effects through a function termed the em causal bridge (CB). We present a new theoretical perspective, associated assumptions for when estimating treatment effects with the CB is feasible, and a bound on the average error of the treatment effect when the CB assumptions are violated. From this new perspective, we then demonstrate how coupling the CB with an autoencoder architecture allows for the sharing of statistical strength between observed quantities (proxies, treatment, and outcomes), thus improving the quality of the CB estimates. Experiments on synthetic and real-world data demonstrate the effectiveness of the proposed approach in relation to the state-of-the-art methodology for proxy measurements.

[LG-141] Conservative Decisions with Risk Scores

链接: https://arxiv.org/abs/2509.25588
作者: Yishu Wei,Wen-Yee Lee,George Ekow Quaye,Xiaogang Su
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 22 pages plus a supplement with 3 pages

点击查看摘要

[LG-142] One-shot Conditional Sampling: MMD meets Nearest Neighbors

链接: https://arxiv.org/abs/2509.25507
作者: Anirban Chatterjee,Sayantan Choudhury,Rohan Hore
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 53 pages, 14 figures, 1 table

点击查看摘要

[LG-143] Scalable Boltzmann Generators for equilibrium sampling of large-scale materials

链接: https://arxiv.org/abs/2509.25486
作者: Maximilian Schebek,Jutta Rogal
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-144] Fair Classification by Direct Intervention on Operating Characteristics

链接: https://arxiv.org/abs/2509.25481
作者: Kevin Jiang,Edgar Dobriban
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-145] Neural Optimal Transport Meets Multivariate Conformal Prediction

链接: https://arxiv.org/abs/2509.25444
作者: Vladimir Kondratyev,Alexander Fishkov,Nikita Kotelevskii,Mahmoud Hegazy,Remi Flamary,Maxim Panov,Eric Moulines
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-146] Aspects of holographic entanglement using physics-informed-neural-networks

链接: https://arxiv.org/abs/2509.25311
作者: Anirudh Deb,Yaman Sanghavi
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 18 pages, 14 figures

点击查看摘要

Abstract:We implement physics-informed-neural-networks (PINNs) to compute holographic entanglement entropy and entanglement wedge cross section. This technique allows us to compute these quantities for arbitrary shapes of the subregions in any asymptotically AdS metric. We test our computations against some known results and further demonstrate the utility of PINNs in examples, where it is not straightforward to perform such computations.

[LG-147] Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis

链接: https://arxiv.org/abs/2509.25281
作者: Yingming Pu,Tao Lin,Hongyu Chen
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-148] Evaluating the Impact of Radiographic Noise on Chest X-ray Semantic Segmentation and Disease Classification Using a Scalable Noise Injection Framework

链接: https://arxiv.org/abs/2509.25265
作者: Derek Jiu,Kiran Nijjer,Nishant Chinta,Ryan Bui,Ben Liu,Kevin Zhu
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: Accepted to ARRS 2026 Annual Meeting

点击查看摘要

信息检索

[IR-0] Informed Dataset Selection

链接: https://arxiv.org/abs/2509.26448
作者: Abdullah Abbas,Michael Heep,Theodor Sperle
类目: Information Retrieval (cs.IR)
*备注: 45 pages, 4 figures

点击查看摘要

Abstract:The selection of datasets in recommender systems research lacks a systematic methodology. Researchers often select datasets based on popularity rather than empirical suitability. We developed the APS Explorer, a web application that im- plements the Algorithm Performance Space (APS) framework for informed dataset selection. The system analyzes 96 datasets using 28 algorithms across three metrics (nDCG, Hit Ratio, Recall) at five K-values. We extend the APS framework with a statistical based classification system that categorizes datasets into five difficulty levels based on quintiles. We also introduce a variance-normalized distance metric based on Mahalanobis distance to measure similarity. The APS Explorer was successfully developed with three interactive modules for visualizing algorithm performance, direct comparing algorithms, and analyzing dataset metadata. This tool shifts the process of selecting datasets from intuition-based to evidence-based practices, and it is publicly available at this http URL.

[IR-1] Analyzing BEV Suitability and Charging Strategies Using Italian Driving Data

链接: https://arxiv.org/abs/2509.26262
作者: Homa Jamalof,Luca Vassio,Danilo Giordano,Marco Mellia,Claudio De Tommasi
类目: Information Retrieval (cs.IR); Computational Engineering, Finance, and Science (cs.CE)
*备注: Accepted at 2025 IEEE Transportation Electrification Conference and Expo, Asia-Pacific (ITEC-AP 2025)

点击查看摘要

Abstract:Battery Electric Vehicles (BEVs) are rapidly evolving from a niche alternative to an established option for private transportation, often replacing Internal Combustion Engine (ICE) vehicles. Despite growing interest, significant barriers remain, including range anxiety, the inconvenience associated with public charging stations, and higher costs. This study analyses extensive telemetry data collected from 10,441 users using ICE vehicles in an Italian province to assess the potential for switching to BEVs without changing current travel behaviour. We evaluate to what extent the BEV models can fulfil their mobility needs under different charging scenarios. To do so, we replicate trips and parking events, simulating and monitoring the battery state of charge. The analysis reveals the compromises between charging behaviours and limited BEV autonomy. Assuming access to overnight charging, at least 35% of the users could already adopt even low-capacity BEVs.

[IR-2] Leverag ing Scene Context with Dual Networks for Sequential User Behavior Modeling

链接: https://arxiv.org/abs/2509.26172
作者: Xu Chen,Yunmeng Shu,Yuangang Pan,Jinsong Lan,Xiaoyong Zhu,Shuai Xiao,Haojin Zhu,Ivor W. Tsang,Bo Zheng
类目: Information Retrieval (cs.IR)
*备注: 12pages

点击查看摘要

Abstract:Modeling sequential user behaviors for future behavior prediction is crucial in improving user’s information retrieval experience. Recent studies highlight the importance of incorporating contextual information to enhance prediction performance. One crucial but usually neglected contextual information is the scene feature which we define as sub-interfaces within an app, created by developers to provide specific functionalities, such as text2product search" and live" modules in e-commence apps. Different scenes exhibit distinct functionalities and usage habits, leading to significant distribution gap in user engagement across them. Popular sequential behavior models either ignore the scene feature or merely use it as attribute embeddings, which cannot effectively capture the dynamic interests and interplay between scenes and items when modeling user sequences. In this work, we propose a novel Dual Sequence Prediction networks (DSPnet) to effectively capture the dynamic interests and interplay between scenes and items for future behavior prediction. DSPnet consists of two parallel networks dedicated to learn users’ dynamic interests over items and scenes, and a sequence feature enhancement module to capture the interplay for enhanced future behavior prediction. Further, we introduce a Conditional Contrastive Regularization (CCR) loss to capture the invariance of similar historical sequences. Theoretical analysis suggests that DSPnet is a principled way to learn the joint relationships between scene and item sequences. Extensive experiments are conducted on one public benchmark and two collected industrial datasets. The method has been deployed online in our system, bringing a 0.04 point increase in CTR, 0.78% growth in deals, and 0.64% rise in GMV. The codes are available at this anonymous github: \textcolorbluethis https URL.

[IR-3] Items Proxy Bridging: Enabling Frictionless Critiquing in Knowledge Graph Recommendations

链接: https://arxiv.org/abs/2509.26107
作者: Huanyu Zhang,Xiaoxuan Shen,Yu Lei,Baolin Yi,Jianfang Liu,Yinao xie
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-4] Fading to Grow: Growing Preference Ratios via Preference Fading Discrete Diffusion for Recommendation

链接: https://arxiv.org/abs/2509.26063
作者: Guoqing Hu,An Zhang. Shuchang Liu,Wenyu Mao,Jiancan Wu,Xun Yang,Xiang Li,Lantao Hu,Han Li,Kun Gai,Xiang Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommenders aim to rank items from a discrete item corpus in line with user interests, yet suffer from extremely sparse user preference data. Recent advances in diffusion models have inspired diffusion-based recommenders, which alleviate sparsity by injecting noise during a forward process to prevent the collapse of perturbed preference distributions. However, current diffusion-based recommenders predominantly rely on continuous Gaussian noise, which is intrinsically mismatched with the discrete nature of user preference data in recommendation. In this paper, building upon recent advances in discrete diffusion, we propose PreferGrow, a discrete diffusion-based recommender system that models preference ratios by fading and growing user preferences over the discrete item corpus. PreferGrow differs from existing diffusion-based recommenders in three core aspects: (1) Discrete modeling of preference ratios: PreferGrow models relative preference ratios between item pairs, rather than operating in the item representation or raw score simplex. This formulation aligns naturally with the discrete and ranking-oriented nature of recommendation tasks. (2) Perturbing via preference fading: Instead of injecting continuous noise, PreferGrow fades user preferences by replacing the preferred item with alternatives – physically akin to negative sampling – thereby eliminating the need for any prior noise assumption. (3) Preference reconstruction via growing: PreferGrow reconstructs user preferences by iteratively growing the preference signals from the estimated ratios. PreferGrow offers a well-defined matrix-based formulation with theoretical guarantees on Markovianity and reversibility, and it demonstrates consistent performance gains over state-of-the-art diffusion-based recommenders across five benchmark datasets, highlighting both its theoretical soundness and empirical effectiveness.

[IR-5] Using GPT to build a Project Management assistant for Jira environments

链接: https://arxiv.org/abs/2509.26014
作者: Joel Garcia-Escribano,Arkaitz Carbajo,Mikel Egaña Aranguren,Unai Lopez-Novoa
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the domain of Project Management, the sheer volume of data is a challenge that project managers continually have to deal with. Effectively steering projects from inception to completion requires handling of diverse information streams, including timelines, budgetary considerations, and task dependencies. To navigate this data-driven landscape with precision and agility, project managers must rely on efficient and sophisticated tools. These tools have become essential, as they enable project managers to streamline communication, optimize resource allocation, and make informed decisions in real-time. However, many of these tools have steep learning curves and require using complex programming languages to retrieve the exact data that project managers need. In this work we present JiraGPT Next, a software that uses the GPT Large Language Model to ease the process by which project managers deal with large amounts of data. It is conceived as an add-on for Jira, one of the most popular Project Management tools, and provides a natural language interface to retrieve information. This work presents the design decisions behind JiraGPT Next and an evaluation of the accuracy of GPT in this context, including the effects of providing different prompts to complete a particular task.

[IR-6] HiFIRec: Towards High-Frequency yet Low-Intention Behaviors for Multi-Behavior Recommendation

链接: https://arxiv.org/abs/2509.25755
作者: Ruiqi Luo,Ran Jin,Zhenglong Li,Kaixi Hu,Xiaohui Tao,Lin Li
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

[IR-7] RUE: A Reproducible Framework for LLM -Driven Relevance Judgment in Information Retrieval

链接: https://arxiv.org/abs/2509.25602
作者: Mouly Dewan,Jiqun Liu,Chirag Shah
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-8] On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search

链接: https://arxiv.org/abs/2509.25494
作者: Nick Hagar,Nicholas Diakopoulos,Jeremy Gilbert
类目: Information Retrieval (cs.IR)
*备注: Accepted to Computation + Journalism Symposium 2025

点击查看摘要

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-10-01

目录

概览 (2025-10-01)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载