本篇博文主要内容为 2025-10-06 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-10-06)
今日共更新511篇论文,其中:
- 自然语言处理共101篇(Computation and Language (cs.CL))
- 人工智能共161篇(Artificial Intelligence (cs.AI))
- 计算机视觉共75篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共155篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Reward Models are Metrics in a Trench Coat
【速读】: 该论文试图解决奖励模型(reward models)与评估指标(evaluation metrics)两个研究领域长期分离所导致的冗余术语、重复陷阱及性能瓶颈问题,例如对虚假相关性的敏感性、下游奖励劫持(reward hacking)风险、数据质量提升困难以及元评估(meta-evaluation)方法不足等。其解决方案的关键在于推动两者的紧密协作:通过实证表明在特定任务中评估指标可优于奖励模型,并系统梳理两大领域的研究成果,进而提出多个可协同优化的方向,包括偏好获取方法改进、规避虚假相关性和奖励劫持机制,以及引入校准意识的元评估策略,从而实现奖励模型与评估指标的相互增强与整体性能提升。
链接: https://arxiv.org/abs/2510.03231
作者: Sebastian Gehrmann
机构: Bloomberg(彭博)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of reinforcement learning in post-training of large language models has sparked significant interest in reward models. Reward models assess the quality of sampled model outputs to generate training signals. This task is also performed by evaluation metrics that monitor the performance of an AI model. We find that the two research areas are mostly separate, leading to redundant terminology and repeated pitfalls. Common challenges include susceptibility to spurious correlations, impact on downstream reward hacking, methods to improve data quality, and approaches to meta-evaluation. Our position paper argues that a closer collaboration between the fields can help overcome these issues. To that end, we show how metrics outperform reward models on specific tasks and provide an extensive survey of the two areas. Grounded in this survey, we point to multiple research topics in which closer alignment can improve reward models and metrics in areas such as preference elicitation methods, avoidance of spurious correlations and reward hacking, and calibration-aware meta-evaluation.
zh
[NLP-1] Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行复杂推理任务时,随着推理链(reasoning chain)的延长,关键中间步骤和原始提示(prompt)被上下文淹没、导致注意力不足进而引发错误的问题。解决方案的关键在于提出一种名为 Self-Anchor 的新框架,该框架利用推理过程的内在结构,将推理轨迹分解为结构化的计划,并自动对齐模型注意力至最相关的推理步骤,从而在生成过程中保持模型焦点。实验表明,Self-Anchor 在六个基准测试中均优于当前最优的提示方法(prompting methods),显著缩小了“非推理型”模型与专用推理模型之间的性能差距,使大多数 LLM 在无需重新训练的情况下即可处理复杂推理任务。
链接: https://arxiv.org/abs/2510.03223
作者: Hongxiang Zhang,Yuan Tian,Tianyi Zhang
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:To solve complex reasoning tasks for Large Language Models (LLMs), prompting-based methods offer a lightweight alternative to fine-tuning and reinforcement learning. However, as reasoning chains extend, critical intermediate steps and the original prompt will be buried in the context, receiving insufficient attention and leading to errors. In this paper, we propose Self-Anchor, a novel pipeline that leverages the inherent structure of reasoning to steer LLM attention. Self-Anchor decomposes reasoning trajectories into structured plans and automatically aligns the model’s attention to the most relevant inference steps, allowing the model to maintain focus throughout generation. Our experiment shows that Self-Anchor outperforms SOTA prompting methods across six benchmarks. Notably, Self-Anchor significantly reduces the performance gap between ``non-reasoning’’ models and specialized reasoning models, with the potential to enable most LLMs to tackle complex reasoning tasks without retraining.
zh
[NLP-2] Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中因策略熵坍缩导致的探索能力退化问题,即在复杂推理任务中,随着训练进行,模型逐渐丧失对低概率但关键的“推理火花”(reasoning sparks)的探索能力,从而限制性能提升。解决方案的关键在于提出低概率正则化(Low-probability Regularization, Lp-Reg),其核心机制是构建一个去噪后的启发式代理分布(heuristic proxy distribution),通过过滤掉被认为的噪声token并重新归一化剩余候选词,增强推理火花的概率,并以该代理分布作为软正则化目标,利用KL散度保护这些高价值低概率token免于被过度惩罚,从而实现稳定、持续的探索与性能提升。
链接: https://arxiv.org/abs/2510.03222
作者: Guanhua Huang,Tingqiang Xu,Mingze Wang,Qi Yi,Xue Gong,Siheng Li,Ruibin Xiong,Kejiao Li,Yuhao Jiang,Bo Zhou
机构: Tencent(腾讯); Tsinghua University (清华大学); Peking University (北京大学); The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textbf\textitreasoning sparks. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of \textitreasoning sparks is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy training for around 1,000 steps, a regime where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a 60.17% average accuracy on five math benchmarks, an improvement of 2.66% over prior methods. Code is available at this https URL.
zh
[NLP-3] Cache-to-Cache: Direct Semantic Communication Between Large Language Models
【速读】: 该论文旨在解决多大语言模型(Large Language Models, LLMs)系统中因依赖文本形式通信而导致的语义信息丢失和生成延迟问题。现有设计中,LLMs通过输出token序列进行交互,迫使内部表示转换为文本,不仅损失了丰富的语义信息,还引入了逐token生成的延迟。解决方案的关键在于提出一种新的“缓存到缓存”(Cache-to-Cache, C2C)范式,利用神经网络将源模型的键值缓存(KV-Cache)与目标模型的KV-Cache进行投影融合,实现模型间直接的语义传递;同时引入可学习门控机制选择受益于缓存通信的目标层,从而在不增加缓存大小的前提下保留深层语义,并显著提升性能与效率。
链接: https://arxiv.org/abs/2510.03215
作者: Tianyu Fu,Zihan Min,Hanling Zhang,Jichao Yan,Guohao Dai,Wanli Ouyang,Yu Wang
机构: Tsinghua University (清华大学); Infinigence AI; The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory; Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model’s KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at this https URL.
zh
[NLP-4] Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reason er
【速读】: 该论文旨在解决连续扩散语言模型(continuous diffusion models)在实际性能上落后于离散扩散模型(discrete diffusion models)的问题,尽管理论上连续模型具有更强的表达能力。其核心矛盾在于:连续扩散模型虽能提供中间监督信号以增强训练可塑性,但其在将连续表示空间中的隐变量解码回离散词元空间时面临困难,导致采样质量下降。为此,作者提出共进化连续-离散扩散(Coevolutionary Continuous Discrete Diffusion, CCDD),其关键创新在于构建一个联合多模态扩散过程,同时在连续表示空间和离散词元空间上进行去噪操作,利用单一模型实现双模态协同优化;该设计既保留了连续空间的语义丰富性与强表达力,又借助显式的离散词元提升训练稳定性和生成样本质量,在真实语言建模任务中展现出卓越的实证性能。
链接: https://arxiv.org/abs/2510.03206
作者: Cai Zhou,Chenxiao Yang,Yi Hu,Chenyu Wang,Chubin Zhang,Muhan Zhang,Lester Mackey,Tommi Jaakkola,Stephen Bates,Dinghuai Zhang
机构: Massachusetts Institute of Technology (麻省理工学院); Microsoft Research (微软研究院); Toyota Technological Insitute at Chicago (芝加哥丰田技术学院); Peking University (北京大学); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 pages
Abstract:Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.
zh
[NLP-5] FocusAgent : Simple Yet Effective Ways of Trimming the Large Context of Web Agents
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的网络代理(Web agents)在处理长网页观测时面临的三大问题:一是网页内容常超过数万token,导致上下文窗口饱和和计算成本上升;二是全量处理网页会暴露于提示注入(prompt injection)等安全风险;三是现有剪枝策略要么误删关键信息,要么保留冗余内容,影响动作预测性能。解决方案的关键在于提出FocusAgent,其核心是利用轻量级LLM检索器,基于任务目标从可访问性树(accessibility tree, AxTree)中提取最相关的行内容,从而实现高效、精准的上下文压缩。此方法不仅显著降低观测规模(超50%),还提升了对提示注入攻击的鲁棒性,同时保持任务成功率不变。
链接: https://arxiv.org/abs/2510.03204
作者: Imene Kerboua,Sahar Omidi Shayegan,Megh Thakkar,Xing Han Lù,Léo Boisvert,Massimo Caccia,Jérémy Espinas,Alexandre Aussem,Véronique Eglin,Alexandre Lacoste
机构: LIRIS - CNRS, INSA Lyon, Universite Claude Bernard Lyon 1; Esker; ServiceNow Research; Mila - Quebec AI Institute; McGill University; Polytechnique Montréal
类目: Computation and Language (cs.CL)
备注:
Abstract:Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.
zh
[NLP-6] Model-Based Ranking of Source Languages for Zero-Shot Cross-Lingual Transfer EMNLP2025
【速读】: 该论文旨在解决跨语言迁移学习中源语言选择的问题,即如何为特定目标语言高效地筛选出最适合作为训练源的候选语言集合。传统方法依赖于词汇或语言学特征进行排序,但其性能受限于目标语言是否有可用的标注数据。论文提出的解决方案是NN-Rank算法,其关键在于利用预训练多语言模型(multilingual models)提取的隐藏表示(hidden representations)与少量未标注的目标语言数据相结合,通过近邻搜索(nearest neighbor search)来衡量源语言与目标语言在语义空间中的相似性,从而生成高质量的源语言排名。实验表明,该方法在无领域数据时仍能保持竞争力,且仅需极少量(如25条)目标语言样本即可达到接近全量数据下的性能表现(92.8% NDCG)。
链接: https://arxiv.org/abs/2510.03202
作者: Abteen Ebrahimi,Adam Wiemerslage,Katharina von der Wense
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); Kensho Technologies (肯肖科技); Johannes Gutenberg University Mainz (马克斯普朗克研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 (Main)
Abstract:We present NN-Rank, an algorithm for ranking source languages for cross-lingual transfer, which leverages hidden representations from multilingual models and unlabeled target-language data. We experiment with two pretrained multilingual models and two tasks: part-of-speech tagging (POS) and named entity recognition (NER). We consider 51 source languages and evaluate on 56 and 72 target languages for POS and NER, respectively. When using in-domain data, NN-Rank beats state-of-the-art baselines that leverage lexical and linguistic features, with average improvements of up to 35.56 NDCG for POS and 18.14 NDCG for NER. As prior approaches can fall back to language-level features if target language data is not available, we show that NN-Rank remains competitive using only the Bible, an out-of-domain corpus available for a large number of languages. Ablations on the amount of unlabeled target data show that, for subsets consisting of as few as 25 examples, NN-Rank produces high-quality rankings which achieve 92.8% of the NDCG achieved using all available target data for ranking.
zh
[NLP-7] Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning
【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在视觉规划任务中难以实现精确空间推理和长程规划的问题,尤其是在生成规划域描述文件(PDDL domain files)时表现不佳,导致现有方法依赖人工预定义或环境访问进行迭代优化。解决方案的关键在于提出一种双VLM引导的框架VLMFP(Dual-VLM-guided Framework),通过引入两个互补的VLM组件:SimVLM(用于基于规则描述模拟动作后果)和GenVLM(通过对比PDDL与SimVLM执行结果迭代生成并修正PDDL文件),从而实现对PDDL问题文件和领域文件的自主生成。此设计显著提升了在未见实例、外观和规则下的泛化能力,验证了其在6个网格世界场景中的有效性。
链接: https://arxiv.org/abs/2510.03182
作者: Yilun Hao,Yongchao Chen,Chuchu Fan,Yang Zhang
机构: MIT (麻省理工学院); Harvard University (哈佛大学); MIT-IBM Watson AI Lab (麻省理工学院-IBM沃森人工智能实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注: 30 pages, 5 figures, 5 tables
Abstract:Vision Language Models (VLMs) show strong potential for visual planning but struggle with precise spatial and long-horizon reasoning. In contrast, Planning Domain Definition Language (PDDL) planners excel at long-horizon formal planning, but cannot interpret visual inputs. Recent works combine these complementary advantages by enabling VLMs to turn visual planning problems into PDDL files for formal planning. However, while VLMs can generate PDDL problem files satisfactorily, they struggle to accurately generate the PDDL domain files, which describe all the planning rules. As a result, prior methods rely on human experts to predefine domain files or on constant environment access for refinement. We propose VLMFP, a Dual-VLM-guided framework that can autonomously generate both PDDL problem and domain files for formal visual planning. VLMFP introduces two VLMs to ensure reliable PDDL file generation: A SimVLM that simulates action consequences based on input rule descriptions, and a GenVLM that generates and iteratively refines PDDL files by comparing the PDDL and SimVLM execution results. VLMFP unleashes multiple levels of generalizability: The same generated PDDL domain file works for all the different instances under the same problem, and VLMs generalize to different problems with varied appearances and rules. We evaluate VLMFP with 6 grid-world domains and test its generalization to unseen instances, appearance, and game rules. On average, SimVLM accurately describes 95.5%, 82.6% of scenarios, simulates 85.5%, 87.8% of action sequence, and judges 82.4%, 85.6% goal reaching for seen and unseen appearances, respectively. With the guidance of SimVLM, VLMFP can generate PDDL files to reach 70.0%, 54.1% valid plans for unseen instances in seen and unseen appearances, respectively. Project page: this https URL.
zh
[NLP-8] When Names Disappear: Revealing What LLM s Actually Understand About Code
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在代码理解任务中对命名语义(human-interpretable naming)的依赖问题,即模型性能的提升可能源于对标识符命名模式的记忆而非真正的结构语义理解。作者指出,现有基准测试在执行任务和意图级任务(如代码摘要)中均存在因命名信息泄露导致的虚假性能表现,掩盖了模型对程序行为本质的理解能力。解决方案的关键在于引入一组语义保持的混淆技术(semantics-preserving obfuscations),通过系统性地抑制命名线索同时保留程序行为不变,构建出ClassEval-Obf这一增强型基准测试集,从而更真实地评估LLMs在代码理解与泛化上的能力。
链接: https://arxiv.org/abs/2510.03178
作者: Cuong Chi Le,Minh V.T. Pham,Cuong Duc Van,Hoang N. Phan,Huy N. Phan,Tien N. Nguyen
机构: FPT Software AI Center (FPT软件人工智能中心); Nanyang Technological University (南洋理工大学); University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear. We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent. Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions. Surprisingly, we also observe consistent reductions on execution tasks that should depend only on structure, revealing that current benchmarks reward memorization of naming patterns rather than genuine semantic reasoning. To disentangle these effects, we introduce a suite of semantics-preserving obfuscations and show that they expose identifier leakage across both summarization and execution. Building on these insights, we release ClassEval-Obf, an obfuscation-enhanced benchmark that systematically suppresses naming cues while preserving behavior. Our results demonstrate that ClassEval-Obf reduces inflated performance gaps, weakens memorization shortcuts, and provides a more reliable basis for assessing LLMs’ code understanding and generalization.
zh
[NLP-9] opic Modeling as Long-Form Generation: Can Long-Context LLM s revolutionize NTM via Zero-Shot Prompting?
【速读】: 该论文旨在解决传统神经主题模型(Neural Topic Models, NTMs)在当前大语言模型(Large Language Models, LLMs)时代下的适用性与性能局限问题,尤其是其依赖复杂推理和生成网络来学习潜在主题分布的瓶颈。解决方案的关键在于提出一种将主题建模视为长文本生成任务的新范式(long-form generation paradigm),并设计了一种即插即用的实现方法:通过采样数据子集、使用特定提示(prompt)生成主题及其代表性文本,并基于关键词匹配完成文本分配。该方法无需训练即可直接利用LLM的零样本(zero-shot)能力,从而在不依赖NTMs架构的前提下,实现实用且高效的主题建模。
链接: https://arxiv.org/abs/2510.03174
作者: Xuan Xu,Haolun Li,Zhongliang Yang,Beilin Chu,Jia Song,Moxuan Xu,Linna Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional topic models such as neural topic models rely on inference and generation networks to learn latent topic distributions. This paper explores a new paradigm for topic modeling in the era of large language models, framing TM as a long-form generation task whose definition is updated in this paradigm. We propose a simple but practical approach to implement LLM-based topic model tasks out of the box (sample a data subset, generate topics and representative text with our prompt, text assignment with keyword match). We then investigate whether the long-form generation paradigm can beat NTMs via zero-shot prompting. We conduct a systematic comparison between NTMs and LLMs in terms of topic quality and empirically examine the claim that “a majority of NTMs are outdated.”
zh
[NLP-10] Neural Correlates of Language Models Are Specific to Human Language NEURIPS2025
【速读】: 该论文旨在解决先前研究中关于大语言模型(Large Language Models, LLMs)隐藏状态与人脑fMRI响应之间相关性的稳健性问题,即这些相关性是否受维度灾难、相似性度量方式、模型训练数据类型以及位置编码等因素的影响。其解决方案的关键在于系统性地控制和验证四个潜在混淆因素:(i) 在降维后仍观察到显著相关性,排除了高维空间中偶然相关性的可能性;(ii) 使用新的相似性度量方法复现结果,增强了结论的普适性;(iii) 发现只有基于人类语言训练的模型才表现出与大脑表征的高度一致性,凸显了语言训练的重要性;(iv) 证实位置编码的存在是实现脑-模型相似性的必要条件,揭示了模型结构对生物合理性的重要影响。上述发现共同强化了LLMs与大脑语言处理机制之间存在本质相似性的证据,并推动了对当前先进语言模型生物学合理性和可解释性的讨论。
链接: https://arxiv.org/abs/2510.03156
作者: Iñigo Parra
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注: To be presented at NeurIPS 2025 Workshops
Abstract:Previous work has shown correlations between the hidden states of large language models and fMRI brain responses, on language tasks. These correlations have been taken as evidence of the representational similarity of these models and brain states. This study tests whether these previous results are robust to several possible concerns. Specifically this study shows: (i) that the previous results are still found after dimensionality reduction, and thus are not attributable to the curse of dimensionality; (ii) that previous results are confirmed when using new measures of similarity; (iii) that correlations between brain representations and those from models are specific to models trained on human language; and (iv) that the results are dependent on the presence of positional encoding in the models. These results confirm and strengthen the results of previous research and contribute to the debate on the biological plausibility and interpretability of state-of-the-art large language models.
zh
[NLP-11] EditLens: Quantifying the Extent of AI Editing in Text
【速读】: 该论文旨在解决如何识别由生成式 AI(Generative AI)对人类撰写文本进行编辑后所产生的“AI-edited text”,即区分纯人类写作、纯AI生成文本以及混合写作的问题。传统研究主要聚焦于检测完全由AI生成的内容,而本文指出AI编辑过的文本同样具有可检测性,并且其编辑程度可量化。解决方案的关键在于:首先提出轻量级相似度指标来衡量文本中AI编辑的幅度(需原始人类文本作为参照),并通过人工标注验证这些指标的有效性;随后基于相似度指标作为中间监督信号,训练出EditLens回归模型,该模型能预测文本中AI编辑的程度。该方法在二分类(F1=94.7%)和三分类(F1=90.4%)任务中均达到当前最优性能,实现了对AI编辑强度的精确识别,为作者归属、教育监管和政策制定提供了新的技术路径。
链接: https://arxiv.org/abs/2510.03154
作者: Katherine Thai,Bradley Emi,Elyas Masrour,Mohit Iyyer
机构: Pangram Labs; University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:A significant proportion of queries to large language models ask them to edit user-provided text, rather than generate new text from scratch. While previous work focuses on detecting fully AI-generated text, we demonstrate that AI-edited text is distinguishable from human-written and AI-generated text. First, we propose using lightweight similarity metrics to quantify the magnitude of AI editing present in a text given the original human-written text and validate these metrics with human annotators. Using these similarity metrics as intermediate supervision, we then train EditLens, a regression model that predicts the amount of AI editing present within a text. Our model achieves state-of-the-art performance on both binary (F1=94.7%) and ternary (F1=90.4%) classification tasks in distinguishing human, AI, and mixed writing. Not only do we show that AI-edited text can be detected, but also that the degree of change made by AI to human writing can be detected, which has implications for authorship attribution, education, and policy. Finally, as a case study, we use our model to analyze the effects of AI-edits applied by Grammarly, a popular writing assistance tool. To encourage further research, we commit to publicly releasing our models and dataset.
zh
[NLP-12] Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言场景下置信度校准(confidence calibration)不足的问题,即模型预测置信度与其实际准确率之间存在系统性偏差,尤其在非英语语言中表现更为严重。解决方案的关键在于发现并利用模型内部表示中的层间差异:研究指出,受英语中心训练数据影响的最终层提供较差的置信度信号,而中间层(late-intermediate layers)则展现出更可靠且校准更好的特征表示。基于此洞察,作者提出无需额外训练的方法——语言感知置信度集成(Language-Aware Confidence Ensemble, LACE),通过自适应选择每种语言的最佳层组合来提升多语言置信度校准性能,从而推动构建更具全球公平性和可信度的LLMs。
链接: https://arxiv.org/abs/2510.03136
作者: Ej Zhou,Caiqi Zhang,Tiancheng Hu,Chengzu Li,Nigel Collier,Ivan Vulić,Anna Korhonen
机构: Language Technology Lab, University of Cambridge (剑桥大学语言技术实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Confidence calibration, the alignment of a model’s predicted confidence with its actual accuracy, is crucial for the reliable deployment of Large Language Models (LLMs). However, this critical property remains largely under-explored in multilingual contexts. In this work, we conduct the first large-scale, systematic studies of multilingual calibration across six model families and over 100 languages, revealing that non-English languages suffer from systematically worse calibration. To diagnose this, we investigate the model’s internal representations and find that the final layer, biased by English-centric training, provides a poor signal for multilingual confidence. In contrast, our layer-wise analysis uncovers a key insight that late-intermediate layers consistently offer a more reliable and better-calibrated signal. Building on this, we introduce a suite of training-free methods, including Language-Aware Confidence Ensemble (LACE), which adaptively selects an optimal ensemble of layers for each specific language. Our study highlights the hidden costs of English-centric alignment and offer a new path toward building more globally equitable and trustworthy LLMs by looking beyond the final layer.
zh
[NLP-13] SurveyBench: How Well Can LLM (-Agents ) Write Academic Surveys?
【速读】: 该论文旨在解决当前生成式 AI 在学术综述写作(LLM4Survey)中输出质量难以达到人类标准,且缺乏系统性、读者导向的评估基准的问题。其解决方案的关键在于提出一个细粒度、基于测验驱动的评估框架 SurveyBench,该框架包含三个核心要素:(1)从 arXiv 上 11,343 篇论文及对应 4,947 篇高质量综述中提取典型主题;(2)构建多维度指标体系,涵盖结构质量(如覆盖广度、逻辑连贯性)、内容质量(如信息整合粒度、洞察清晰度)以及非文本丰富性;(3)采用内容与测验双模式评估协议,明确对齐读者的信息需求,从而有效揭示现有 LLM4Survey 方法的不足(例如在内容评估中平均比人类低 21%)。
链接: https://arxiv.org/abs/2510.03120
作者: Zhaojun Sun,Xuzhou Zhu,Xuanhe Zhou,Xin Tong,Shuo Wang,Jie Fu,Guoliang Li,Zhiyuan Liu,Fan Wu
机构: Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers’ informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).
zh
[NLP-14] Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation
【速读】: 该论文旨在解决语音到文本翻译(Speech-to-Text Translation, S2TT)系统中由自动语音识别(Automatic Speech Recognition, ASR)与文本到文本翻译(Text-to-Text Translation, T2TT)模块级联导致的两大问题:错误传播以及无法利用韵律等声学线索。其核心解决方案在于评估链式思维(Chain-of-Thought, CoT)提示方法是否能通过联合访问语音和转录文本来克服上述局限,结果发现CoT实际上仍主要依赖转录文本而极少利用语音信息,表明其并未真正实现跨模态协同。研究进一步提出简单训练干预措施(如引入直接S2TT数据或注入噪声转录文本),可显著提升模型对语音的感知能力与鲁棒性,从而揭示出未来架构设计应明确整合声学信息以突破当前瓶颈。
链接: https://arxiv.org/abs/2510.03115
作者: Jacobo Romero-Díaz,Gerard I. Gállego,Oriol Pareras,Federico Costa,Javier Hernando,Cristina España-Bonet
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.
zh
[NLP-15] Semantic Similarity in Radiology Reports via LLM s and NER
【速读】: 该论文旨在解决放射科报告中初步报告与最终报告之间语义差异识别的难题,以提升初级放射科医生的培训效果并发现临床知识盲区。现有基于大语言模型(LLM)和命名实体识别(NER)的方法在语义相似性评估上存在准确性不足的问题。其解决方案的关键在于提出Llama-EntScore方法,该方法融合Llama 3.1与NER技术,并引入可调权重机制,对不同类型的差异进行强调或弱化处理,从而生成定量的相似性评分及可解释的反馈,有效提升了评分准确率(精确匹配达67%,±1误差内达93%),优于单独使用LLM或NER的效果。
链接: https://arxiv.org/abs/2510.03102
作者: Beth Pearson,Ahmed Adnan,Zahraa Abdallah
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Radiology report evaluation is a crucial part of radiologists’ training and plays a key role in ensuring diagnostic accuracy. As part of the standard reporting workflow, a junior radiologist typically prepares a preliminary report, which is then reviewed and edited by a senior radiologist to produce the final report. Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge. While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge. In this paper, we explore the ability of LLMs to provide explainable and accurate comparisons of reports in the radiology domain. We begin by comparing the performance of several LLMs in comparing radiology reports. We then assess a more traditional approach based on Named-Entity-Recognition (NER). However, both approaches exhibit limitations in delivering accurate feedback on semantic similarity. To address this, we propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences. Our approach generates a quantitative similarity score for tracking progress and also gives an interpretation of the score that aims to offer valuable guidance in reviewing and refining their reporting. We find our method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores - outperforming both LLMs and NER used independently. Code is available at: \hrefthis https URLthis http URL_reports
zh
[NLP-16] Revisiting Direct Speech-to-Text Translation with Speech LLM s: Better Scaling than CoT Prompting?
【速读】: 该论文旨在解决语音到文本翻译(Speech-to-Text Translation, S2TT)中不同提示策略——链式思维(Chain-of-Thought, CoT)提示与直接提示(Direct prompting)——在数据规模扩展下的性能差异问题。其核心解决方案在于系统性地比较这两种方法在不同训练数据量下的表现,通过伪标签法对ASR语料库进行多语言翻译以构建大规模S2TT数据集,并在此基础上训练基于大语言模型(LLM)的S2TT系统。关键发现是:随着数据量增加,直接提示策略的性能提升更加稳定和显著,表明在更大规模S2TT资源可用时,直接提示可能成为更有效的范式。
链接: https://arxiv.org/abs/2510.03093
作者: Oriol Pareras,Gerard I. Gállego,Federico Costa,Cristina España-Bonet,Javier Hernando
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.
zh
[NLP-17] Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles EMNLP2025
【速读】: 该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)中因情感细微差别复杂性导致的准确性受限问题。其解决方案的关键在于区分描述性语义(descriptive semantics)与表达性语义(expressive semantics):描述性语义反映话语的内容背景,与意图情感标签一致;表达性语义则体现说话者的情绪状态,与诱发的情感反应相关。通过实验验证这一区分机制,为构建更具备情境感知能力的智能系统提供了理论依据和方法支持。
链接: https://arxiv.org/abs/2510.03060
作者: Rongchen Guo,Vincent Francoeur,Isar Nejadgholi,Sylvain Gagnon,Miodrag Bolic
机构: University of Ottawa (渥太华大学); National Research Council Canada (加拿大国家研究委员会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the *SEM conference collocated with EMNLP2025
Abstract:Speech Emotion Recognition (SER) is essential for improving human-computer interaction, yet its accuracy remains constrained by the complexity of emotional nuances in speech. In this study, we distinguish between descriptive semantics, which represents the contextual content of speech, and expressive semantics, which reflects the speaker’s emotional state. After watching emotionally charged movie segments, we recorded audio clips of participants describing their experiences, along with the intended emotion tags for each clip, participants’ self-rated emotional responses, and their valence/arousal scores. Through experiments, we show that descriptive semantics align with intended emotions, while expressive semantics correlate with evoked emotions. Our findings inform SER applications in human-AI interaction and pave the way for more context-aware AI systems.
zh
[NLP-18] Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines
【速读】: 该论文旨在解决英国国家卫生与临床优化研究所(NICE)临床指南文本长度大、数量多,导致在时间受限的医疗环境中难以高效利用的问题。解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的系统,其核心是采用混合嵌入机制(hybrid embedding mechanism)的检索架构,结合大型语言模型(Large Language Models, LLMs)实现对自然语言查询的精准响应。该系统在10,195个文本片段构成的数据库上验证表现优异,如MRR达0.814、前十个结果中召回率达99.1%,且在生成阶段显著提升答案忠实度(faithfulness),例如RAG增强后的O4-Mini模型达到99.5%的忠实度,远超未使用RAG的医学专用模型(仅43%),同时保持100%的上下文精确率(Context Precision),有效防止信息捏造,从而为医疗场景下生成式AI的应用提供了一种可靠、可扩展的路径。
链接: https://arxiv.org/abs/2510.02967
作者: Matthew Lewis,Samuel Thio,Richard JB Dobson,Spiros Denaxas
机构: University College London (伦敦大学学院); King’s College London (国王学院); CogStack Limited (CogStack有限公司); Interdisciplinary Transformation University (IT:U) (跨学科转型大学); British Heart Foundation Data Science Centre (英国心脏基金会数据科学中心); National and Kapodistrian University of Athens (雅典国立卡波迪斯特里安大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:This paper presents the development and evaluation of a Retrieval-Augmented Generation (RAG) system for querying the United Kingdom’s National Institute for Health and Care Excellence (NICE) clinical guidelines using Large Language Models (LLMs). The extensive length and volume of these guidelines can impede their utilisation within a time-constrained healthcare system, a challenge this project addresses through the creation of a system capable of providing users with precisely matched information in response to natural language queries. The system’s retrieval architecture, composed of a hybrid embedding mechanism, was evaluated against a database of 10,195 text chunks derived from three hundred guidelines. It demonstrates high performance, with a Mean Reciprocal Rank (MRR) of 0.814, a Recall of 81% at the first chunk and of 99.1% within the top ten retrieved chunks, when evaluated on 7901 queries. The most significant impact of the RAG system was observed during the generation phase. When evaluated on a manually curated dataset of seventy question-answer pairs, RAG-enhanced models showed substantial gains in performance. Faithfulness, the measure of whether an answer is supported by the source text, was increased by 64.7 percentage points to 99.5% for the RAG-enhanced O4-Mini model and significantly outperformed the medical-focused Meditron3-8B LLM, which scored 43%. This, combined with a perfect Context Precision score of 1 for all RAG-enhanced models, confirms the system’s ability to prevent information fabrication by grounding its answers in relevant source material. This study thus establishes RAG as an effective, reliable, and scalable approach for applying generative AI in healthcare, enabling cost-effective access to medical guidelines. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2510.02967 [cs.CL] (or arXiv:2510.02967v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.02967 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-19] Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在领域特定微调过程中对受版权保护数据的未经授权使用问题,现有方法如成员推理攻击(Membership Inference Attacks, MIAs)通常依赖内部信号(如logits),而黑盒方法则受限于手工提示或干净参考数据集的校准需求,实用性受限。解决方案的关键在于提出TRACE框架——一种完全黑盒的检测机制,其核心创新包括:1)通过私钥引导的无失真水印技术对训练数据进行重写,保障文本质量和下游任务性能;2)利用微调过程中的“放射性效应”(radioactivity effect),设计熵门控(entropy-gated)机制,仅对高不确定性token进行评分,显著增强检测灵敏度。该方案在多种数据集和模型家族中均表现出强统计显著性,且支持多数据集归属识别,并在后续大规模非水印语料预训练后仍保持鲁棒性。
链接: https://arxiv.org/abs/2510.02962
作者: Jingqi Zhang,Ruibo Chen,Yingqing Yang,Peihua Mai,Heng Huang,Yan Pang
机构: Cranberry-Lemon University (克兰伯里-柠檬大学); University of the Witwatersrand (威特沃特斯兰德大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly fine-tuned on smaller, domain-specific datasets to improve downstream performance. These datasets often contain proprietary or copyrighted material, raising the need for reliable safeguards against unauthorized use. Existing membership inference attacks (MIAs) and dataset-inference methods typically require access to internal signals such as logits, while current black-box approaches often rely on handcrafted prompts or a clean reference dataset for calibration, both of which limit practical applicability. Watermarking is a promising alternative, but prior techniques can degrade text quality or reduce task performance. We propose TRACE, a practical framework for fully black-box detection of copyrighted dataset usage in LLM fine-tuning. \textttTRACE rewrites datasets with distortion-free watermarks guided by a private key, ensuring both text quality and downstream utility. At detection time, we exploit the radioactivity effect of fine-tuning on watermarked data and introduce an entropy-gated procedure that selectively scores high-uncertainty tokens, substantially amplifying detection power. Across diverse datasets and model families, TRACE consistently achieves significant detections (p0.05), often with extremely strong statistical evidence. Furthermore, it supports multi-dataset attribution and remains robust even after continued pretraining on large non-watermarked corpora. These results establish TRACE as a practical route to reliable black-box verification of copyrighted dataset usage. We will make our code available at: this https URL.
zh
[NLP-20] Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval EMNLP2025
【速读】: 该论文旨在解决对话数据检索(Conversational Data Retrieval, CDR)在产品洞察分析中的评估难题,即缺乏一个全面且可靠的基准测试集来衡量系统从大规模对话数据中精准检索相关信息的能力。其解决方案的关键在于构建首个综合性对话数据检索基准测试集(CDR benchmark),包含1.6k个查询、5类分析任务和9.1k条对话数据,并通过系统性评估16种主流嵌入模型发现当前最佳模型的NDCG@10仅为0.51,显著低于文档级检索性能,揭示了对话数据检索的独特挑战(如隐式状态识别、轮次动态变化和上下文指代关系),同时提供实用的查询模板与分任务错误分析,为后续研究提供了标准评估框架与改进方向。
链接: https://arxiv.org/abs/2510.02938
作者: Yohan Lee,Yongwoo Song,Sangyeop Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025 Industry Track
Abstract:We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights. With 1.6k queries across five analytical tasks and 9.1k conversations, our benchmark provides a reliable standard for measuring conversational data retrieval performance. Our evaluation of 16 popular embedding models shows that even the best models reach only around NDCG@10 of 0.51, revealing a substantial gap between document and conversational data retrieval capabilities. Our work identifies unique challenges in conversational data retrieval (implicit state recognition, turn dynamics, contextual references) while providing practical query templates and detailed error analysis across different task categories. The benchmark dataset and code are available at this https URL.
zh
[NLP-21] Self-Reflective Generation at Test Time
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长链式思维(chain-of-thought)推理过程中因前向自回归生成机制脆弱而导致的错误传播问题,即早期token错误会引发连锁反应,影响最终输出质量。现有自反思方法要么对完整草稿进行修订,要么依赖昂贵的训练来学习自我修正,均属被动且低效策略。其解决方案的关键在于提出一种轻量级测试时框架Self-Reflective Generation at Test Time (SRGen),该框架在生成过程中动态识别高不确定性token,并基于已生成上下文训练特定的校正向量(corrective vector),从而在不确定点进行前瞻性的自反思生成,调整token概率分布,提升决策可靠性。此机制不依赖额外训练,仅通过回顾部分输出实现高效纠错,在多个数学推理基准上显著提升单次生成质量和自一致性投票效果,且具有良好的可组合性与计算开销可控性。
链接: https://arxiv.org/abs/2510.02919
作者: Jian Mu,Qixin Zhang,Zhiyong Wang,Menglin Yang,Shuang Qiu,Chengwei Qin,Zhongxiang Dai,Yao Shu
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Nanyang Technological University (南洋理工大学); University of Edinburgh (爱丁堡大学); City University of Hong Kong (香港城市大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computation and Language (cs.CL)
备注: 24 pages, 8 figures
Abstract:Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.
zh
[NLP-22] Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation
【速读】: 该论文旨在解决Wordle游戏中的约束满足问题(Constraint Satisfaction Problem, CSP)求解效率与鲁棒性不足的问题,传统方法依赖信息熵最大化或频率启发式策略,缺乏对逻辑约束的正式建模。其解决方案的关键在于提出了一种全新的CSP形式化框架:首先引入CSP-Aware Entropy,在约束传播后计算信息增益而非基于原始候选集,从而更精准地评估每一步决策的价值;其次构建概率性CSP框架,将贝叶斯词频先验与逻辑约束融合,实现概率与确定性推理的协同优化。该方法在2315个英文单词上实现了平均3.54次猜测、99.9%成功率,显著优于前向检查法(p<0.001),且在噪声环境下仍保持稳定性能,验证了形式化约束处理和约束感知启发式策略在结构化谜题求解中的优越性。
链接: https://arxiv.org/abs/2510.02855
作者: Jahidul Arafat,Fariha Tasmin,Sanjaya Poudel,Kamrujjaman,Eftakhar Ahmed Arnob,Ahsan Habib Tareq
机构: Auburn University (奥本大学); Bangladesh University of Professionals (孟加拉国专业大学); Bangladesh Army Int. University of Science and Technology (孟加拉国陆军国际科技大学); Green University of Bangladesh (绿色大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 35 pages, 14 figures, 10 tables. Open-source implementation with 91% test coverage available at this https URL
Abstract:Wordle presents an algorithmically rich testbed for constraint satisfaction problem (CSP) solving. While existing solvers rely on information-theoretic entropy maximization or frequency-based heuristics without formal constraint treatment, we present the first comprehensive CSP formulation of Wordle with novel constraint-aware solving strategies. We introduce CSP-Aware Entropy, computing information gain after constraint propagation rather than on raw candidate sets, and a Probabilistic CSP framework integrating Bayesian word-frequency priors with logical constraints. Through evaluation on 2,315 English words, CSP-Aware Entropy achieves 3.54 average guesses with 99.9% success rate, a statistically significant 1.7% improvement over Forward Checking (t=-4.82, p0.001, Cohen’s d=0.07) with 46% faster runtime (12.9ms versus 23.7ms per guess). Under 10% noise, CSP-aware approaches maintain 5.3 percentage point advantages (29.0% versus 23.7%, p=0.041), while Probabilistic CSP achieves 100% success across all noise levels (0-20%) through constraint recovery mechanisms. Cross-lexicon validation on 500 Spanish words demonstrates 88% success with zero language-specific tuning, validating that core CSP principles transfer across languages despite an 11.2 percentage point gap from linguistic differences (p0.001, Fisher’s exact test). Our open-source implementation with 34 unit tests achieving 91% code coverage provides reproducible infrastructure for CSP research. The combination of formal CSP treatment, constraint-aware heuristics, probabilistic-logical integration, robustness analysis, and cross-lexicon validation establishes new performance benchmarks demonstrating that principled constraint satisfaction techniques outperform classical information-theoretic and learning-based approaches for structured puzzle-solving domains.
zh
[NLP-23] Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
【速读】: 该论文旨在解决当前工具增强型大语言模型(Large Language Model, LLM)代理在复杂任务中性能评估的局限性问题,即现有方法仅依赖最终答案匹配,而忽视了推理轨迹中的效率、幻觉和适应性等关键维度。解决方案的关键在于提出 TRACE 框架,通过引入一个证据库(evidence bank),动态累积前序推理步骤中的知识,从而实现对代理推理轨迹的多维、细粒度分析与评估。这一机制使评估不再依赖昂贵的全量真实轨迹标注,且可在小规模开源模型上高效运行,显著提升了评估的可扩展性和准确性。
链接: https://arxiv.org/abs/2510.02837
作者: Wonjoong Kim,Sangwu Park,Yeonjun In,Sein Kim,Dongha Lee,Chanyoung Park
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Under Review
Abstract:Although recent tool-augmented benchmarks incorporate complex user requests and diverse tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent’s performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucination, and adaptivity. The most straightforward method for evaluating these aspects is to compare an agent’s trajectory with the ground-truth trajectory, but this approach is fundamentally limited since annotating all valid ground-truth trajectories is prohibitively expensive. However, a simple LLM-based evaluator struggles to assess trajectories in detail without ground truth. To effectively evaluate the agents in this manner, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating an evidence bank, which accumulates knowledge gathered from preceding reasoning steps, TRACE enables a multi-faceted analysis and evaluation of an agent’s reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.
zh
[NLP-24] Evaluating Large Language Models for IUCN Red List Species Information
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生物多样性保护评估中可靠性不足的问题,特别是其在国际自然保护联盟(IUCN)红色名录四大核心评估维度——分类学、保护状态、分布和威胁——中的表现差异。研究发现,LLMs在分类学任务上表现优异(准确率94.9%),但在保护状态判断等需推理的任务上严重不足(仅27.2%),揭示了模型存在“知识-推理鸿沟”,且系统性偏爱受关注的脊椎动物类群,可能加剧保护资源分配不公。解决方案的关键在于明确LLM的适用边界:将其作为信息检索工具而非决策主体,采用人机协同的混合模式,由人类专家保留风险评估与政策制定的最终决策权,从而实现技术赋能与责任可控的平衡。
链接: https://arxiv.org/abs/2510.02830
作者: Shinya Uryu
机构: Tokushima University (德岛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures
Abstract:Large Language Models (LLMs) are rapidly being adopted in conservation to address the biodiversity crisis, yet their reliability for species evaluation is uncertain. This study systematically validates five leading models on 21,955 species across four core IUCN Red List assessment components: taxonomy, conservation status, distribution, and threats. A critical paradox was revealed: models excelled at taxonomic classification (94.9%) but consistently failed at conservation reasoning (27.2% for status assessment). This knowledge-reasoning gap, evident across all models, suggests inherent architectural constraints, not just data limitations. Furthermore, models exhibited systematic biases favoring charismatic vertebrates, potentially amplifying existing conservation inequities. These findings delineate clear boundaries for responsible LLM deployment: they are powerful tools for information retrieval but require human oversight for judgment-based decisions. A hybrid approach is recommended, where LLMs augment expert capacity while human experts retain sole authority over risk assessment and policy.
zh
[NLP-25] StepChain GraphRAG : Reasoning Over Knowledge Graphs for Multi-Hop Question Answering
【速读】: 该论文旨在解决多跳问答(multi-hop question answering, MQA)中如何有效整合迭代推理步骤与外部知识检索的问题,以提升答案的准确性与可解释性。其解决方案的关键在于提出StepChain GraphRAG框架,该框架将问题分解与广度优先搜索(Breadth-First Search, BFS)推理流相结合:首先构建全局语料库索引,在推理阶段仅对检索到的片段动态构建知识图谱,并将复杂查询拆分为子问题;随后,针对每个子问题采用基于BFS的遍历策略,沿相关边动态扩展,形成显式的证据链,从而避免语言模型被冗余上下文淹没。此设计显著提升了准确率(在MuSiQue、2WikiMultiHopQA和HotpotQA上达到SOTA性能)并增强了推理过程的可解释性。
链接: https://arxiv.org/abs/2510.02827
作者: Tengjun Ni,Xin Yuan,Shenghong Li,Kai Wu,Ren Ping Liu,Wei Ni,Wenjie Zhang
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Recent progress in retrieval-augmented generation (RAG) has led to more accurate and interpretable multi-hop question answering (QA). Yet, challenges persist in integrating iterative reasoning steps with external knowledge retrieval. To address this, we introduce StepChain GraphRAG, a framework that unites question decomposition with a Breadth-First Search (BFS) Reasoning Flow for enhanced multi-hop QA. Our approach first builds a global index over the corpus; at inference time, only retrieved passages are parsed on-the-fly into a knowledge graph, and the complex query is split into sub-questions. For each sub-question, a BFS-based traversal dynamically expands along relevant edges, assembling explicit evidence chains without overwhelming the language model with superfluous context. Experiments on MuSiQue, 2WikiMultiHopQA, and HotpotQA show that StepChain GraphRAG achieves state-of-the-art Exact Match and F1 scores. StepChain GraphRAG lifts average EM by 2.57% and F1 by 2.13% over the SOTA method, achieving the largest gain on HotpotQA (+4.70% EM, +3.44% F1). StepChain GraphRAG also fosters enhanced explainability by preserving the chain-of-thought across intermediate retrieval steps. We conclude by discussing how future work can mitigate the computational overhead and address potential hallucinations from large language models to refine efficiency and reliability in multi-hop QA.
zh
[NLP-26] NCV: A Node-Wise Consistency Verification Approach for Low-Cost Structured Error Localization in LLM Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多步推理验证中面临的两大挑战:一是错误定位不精确,二是验证过程消耗大量token导致效率低下。现有方法要么对整个推理链进行整体评估,易受注意力稀释影响,要么依赖昂贵的多采样策略。其解决方案的关键在于提出一种无需训练的节点一致性验证(Node-wise Consistency Verification, NCV)框架,将思维链(Chain of Thought, CoT)分解为相互关联的验证节点,并在每个节点上执行轻量级二元一致性检查,从而实现精准的错误定位和避免冗余生成,显著提升验证效率与可解释性。
链接: https://arxiv.org/abs/2510.02816
作者: Yulong Zhang,Li Wang,Wei Du,Peilin Li,Yuqin Dai Zhiyuan Zhao,Lingyong Fang,Ziniu Liu,Ru Zhang,Huijia Zhu,Gongshen Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Verifying multi-step reasoning in large language models is difficult due to imprecise error localization and high token costs. Existing methods either assess entire reasoning chains, suffering attention dilution, or rely on expensive multi-sampling. We introduce Node-wise Consistency Verification (NCV), a training-free framework that recasts verification as lightweight binary consistency checks at the node level. By decomposing the chain of thought into interconnected verification nodes, NCV precisely localizes errors and avoids unnecessary long-form generation. Experiments demonstrate that our approach enhances interpretability and efficiency, presenting a scalable solution for reliable LLM reasoning verification. On public datasets, NCV achieves a 10% to 25% improvement in F1 scores over baselines while utilizing 6\times ~ 58\times fewer tokens than traditional methods like CoT-based verifiers.
zh
[NLP-27] A Computational Framework for Interpretable Text-Based Personality Assessment from Social Media
【速读】: 该论文旨在解决人格评估中两个核心问题:一是缺乏大规模、标注详尽的人格数据集,二是人格心理学与自然语言处理(Natural Language Processing, NLP)之间存在脱节,限制了模型的有效性和可解释性。为应对这些问题,作者构建了两个来自Reddit平台的数据集——MBTI9k和PANDORA,其中PANDORA包含超过1700万条评论及MBTI与大五人格模型的标签与人口统计学信息,显著提升了数据规模、质量和标签覆盖度。在此基础上,提出SIMPA(Statement-to-Item Matching Personality Assessment)框架,其关键在于通过语义相似性匹配用户生成文本与标准化问卷条目,实现高可解释性与高效的人格评估,且具有模型无关性、分层线索检测机制和良好的扩展性,适用于复杂标签体系与多变量关联场景。
链接: https://arxiv.org/abs/2510.02811
作者: Matej Gjurković
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Phd thesis
Abstract:Personality refers to individual differences in behavior, thinking, and feeling. With the growing availability of digital footprints, especially from social media, automated methods for personality assessment have become increasingly important. Natural language processing (NLP) enables the analysis of unstructured text data to identify personality indicators. However, two main challenges remain central to this thesis: the scarcity of large, personality-labeled datasets and the disconnect between personality psychology and NLP, which restricts model validity and interpretability. To address these challenges, this thesis presents two datasets – MBTI9k and PANDORA – collected from Reddit, a platform known for user anonymity and diverse discussions. The PANDORA dataset contains 17 million comments from over 10,000 users and integrates the MBTI and Big Five personality models with demographic information, overcoming limitations in data size, quality, and label coverage. Experiments on these datasets show that demographic variables influence model validity. In response, the SIMPA (Statement-to-Item Matching Personality Assessment) framework was developed - a computational framework for interpretable personality assessment that matches user-generated statements with validated questionnaire items. By using machine learning and semantic similarity, SIMPA delivers personality assessments comparable to human evaluations while maintaining high interpretability and efficiency. Although focused on personality assessment, SIMPA’s versatility extends beyond this domain. Its model-agnostic design, layered cue detection, and scalability make it suitable for various research and practical applications involving complex label taxonomies and variable cue associations with target concepts.
zh
[NLP-28] Pareto-optimal Non-uniform Language Generation
【速读】: 该论文致力于解决语言生成在极限情形下的非均匀最优性问题,即如何设计算法,在面对可数个语言集合中某一目标语言 $ L $ 的字符串枚举时,确保生成的新字符串最终全部合法,并且生成时间 $ t(L) $ 对每个语言尽可能优化。此前工作虽提供了强非均匀生成保证,但其对不同语言的生成时间 $ t(L) $ 可能存在显著次优。本文提出一种帕累托最优(Pareto-optimal)的生成算法,其生成时间 $ t^\star(L) $ 在所有可能算法中达到近似最优:任何试图缩短某个语言 $ L $ 的生成时间的替代算法,必然导致另一语言 $ L’ $ 的生成时间变差。该方案的核心在于构建一个可适配至噪声和代表性生成等实际场景的通用算法框架,从而实现非均匀语言生成的理论最优性。
链接: https://arxiv.org/abs/2510.02795
作者: Moses Charikar,Chirag Pabbaraju
机构: Stanford University (斯坦福大学)
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages, 1 figure
Abstract:Kleinberg and Mullainathan (2024) recently proposed an interesting model for language generation in the limit: Given a countable collection of languages, and an adversary enumerating the strings of some language L from the collection, the objective is to generate new strings from the target language, such that all strings generated beyond some finite time are valid. Li, Raman and Tewari (2024) and Charikar and Pabbaraju (2024) showed strong non-uniform generation guarantees in this model, giving algorithms that generate new valid strings from L after seeing a number of distinct input strings t(L) that depends only on L (and the collection), but not the enumeration order. However, for both these works, the language-wise generation times t(L) of the algorithm can be strictly sub-optimal. In this work, we study Pareto-optimality of non-uniform language generation in the limit. We propose an algorithm, whose generation times t^\star(L) are (almost) Pareto-optimal: any other algorithm whose generation time for some language L is strictly smaller than t^\star(L) , must satisfy that its generation time for some other language L’ is strictly worse than t^\star(L’) . Pareto-optimality is essentially the best that one can achieve for non-uniform generation. Our algorithmic framework conveniently adapts to further give Pareto-optimal non-uniform generation algorithms in the practically motivated settings of noisy as well as representative generation. Comments: 24 pages, 1 figure Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2510.02795 [cs.DS] (or arXiv:2510.02795v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2510.02795 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-29] MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding EMNLP2025
【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在多模态任务中普遍存在的幻觉(hallucination)问题,即模型生成的内容与输入的视觉和文本信息相矛盾的现象。现有方法如对比解码(contrastive decoding)和注意力调控(attention manipulation)存在构造对比样本困难或稳定性差的问题。本文提出图像头掩码对比解码(Image Head Masked Contrastive Decoding, MaskCD),其核心创新在于利用LVLM中的“图像头”(image heads)进行掩码操作,从而构建更有效的对比样本以实现稳定且高效的对比解码,显著缓解幻觉现象并保持模型的通用能力。
链接: https://arxiv.org/abs/2510.02790
作者: Jingyuan Deng,Yujiu Yang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: accepted to emnlp2025 findings
Abstract:Large vision-language models (LVLMs) have shown remarkable performance in visual-language understanding for downstream multimodal tasks. While their capabilities are improving, problems emerge simultaneously. Among those problems, the hallucinations have attracted much attention, which stands for the phenomenon where LVLMs generate contradictory content to their input visual and text contents. Many approaches have been proposed to deal with this issue, such as contrastive decoding and attention manipulation. However, contrastive decoding methods struggle in constructing appropriate contrastive samples, and attention manipulation methods are highly sensitive, lacking stability. In this work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach utilizes the “image heads” in LVLMs, masking them to construct contrastive samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs. Corresponding resources could be found at: this https URL .
zh
[NLP-30] XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments EMNLP
【速读】: 该论文旨在解决跨语言主题建模(Cross-lingual topic modeling)中主题一致性(alignment quality)与主题可解释性(coherence and diversity)难以兼顾的问题。现有方法虽在主题多样性上有所提升,但常因缺乏有效的跨语言对齐机制而导致主题 coherence 下降或语言间分布不一致。其解决方案的关键在于提出 XTRA 框架,通过两个核心组件实现:(1) 表示对齐(representation alignment),利用对比学习在共享语义空间中对齐文档-主题分布;(2) 主题对齐(topic alignment),将不同语言的主题词分布投影至同一空间以强制跨语言一致性。这种双机制协同优化使得模型能够学习到既语义清晰又跨语言一致的主题结构。
链接: https://arxiv.org/abs/2510.02788
作者: Tien Phat Nguyen,Vu Minh Ngo,Tung Nguyen,Linh Van Ngo,Duc Anh Nguyen,Sang Dinh,Trung Le
机构: Hanoi University of Science and Technology (河内科学技术大学); University of Monash (蒙纳士大学)
类目: Computation and Language (cs.CL)
备注: 2025 EMNLP Findings
Abstract:Cross-lingual topic modeling aims to uncover shared semantic themes across languages. Several methods have been proposed to address this problem, leveraging both traditional and neural approaches. While previous methods have achieved some improvements in topic diversity, they often struggle to ensure high topic coherence and consistent alignment across languages. We propose XTRA (Cross-Lingual Topic Modeling with Topic and Representation Alignments), a novel framework that unifies Bag-of-Words modeling with multilingual embeddings. XTRA introduces two core components: (1) representation alignment, aligning document-topic distributions via contrastive learning in a shared semantic space; and (2) topic alignment, projecting topic-word distributions into the same space to enforce crosslingual consistency. This dual mechanism enables XTRA to learn topics that are interpretable (coherent and diverse) and well-aligned across languages. Experiments on multilingual corpora confirm that XTRA significantly outperforms strong baselines in topic coherence, diversity, and alignment quality. Code and reproducible scripts are available at https: //github.com/tienphat140205/XTRA.
zh
[NLP-31] A Granular Study of Safety Pretraining under Model Abliteration NEURIPS2025
【速读】: 该论文旨在解决开放权重大语言模型(Open-weight LLMs)在推理阶段通过简单激活编辑(activation edits)可能被规避安全干预措施的问题,特别是评估常见安全训练方法(如拒绝训练或元标签训练)在面对此类编辑时的鲁棒性。其解决方案的关键在于引入一种轻量级投影技术——模型消解(model abliteration),该技术旨在移除模型中对拒绝敏感的方向,并通过在SmolLM2-1.7B模型上系统性地评估多个安全预训练检查点(checkpoints),结合多轮人工标注与自动分类,量化不同安全组件在编辑后的失效程度,从而为将推理时编辑纳入安全评估提供可操作的协议。
链接: https://arxiv.org/abs/2510.02768
作者: Shashank Agnihotri,Jonas Jakubassa,Priyam Dey,Sachin Goyal,Bernt Schiele,Venkatesh Babu Radhakrishnan,Margret Keuper
机构: Data and Web Science Group, University of Mannheim, Germany (德国曼海姆大学数据与网络科学组); Vision and AI Lab, Indian Institute of Science, Bangalore, India (印度科学研究所视觉与人工智能实验室); Carnegie Mellon University, United States of America (美国卡内基梅隆大学); Max-Planck-Institute for Informatics, Saarland Informatics Campus, Germany (德国马普研究所信息学研究所,萨尔兰计算机科学园区)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at NeurIPS 2025 bWorkshop Lock-LLM. *Equal Contribution
Abstract:Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as Refusal or Non-Refusal using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: this https URL.
zh
[NLP-32] he Path of Self-Evolving Large Language Models : Achieving Data-Efficient Learning via Intrinsic Feedback
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)推理能力时对大量标注数据的依赖问题。传统RL训练方法需要耗费大量人力进行任务设计与标注,而本文提出一种基于自知觉(self-awareness)的最小数据强化学习框架:其关键在于引入两个创新机制——(1)自知难度预测(self-aware difficulty prediction),使模型能够评估任务难度并优先选择具有挑战性但可解的任务;(2)自知极限突破(self-aware limit breaking),使模型能识别超出自身能力边界的任务,并主动请求外部数据以突破限制。实验表明,该方法在九个基准测试中实现了53.8%的相对性能提升,且仅需不到1.2%的额外数据,验证了自知觉强化学习在实现LLM自我进化训练中的有效性。
链接: https://arxiv.org/abs/2510.02752
作者: Hangfan Zhang,Siyuan Xu,Zhimeng Guo,Huaisheng Zhu,Shicheng Liu,Xinrun Wang,Qiaosheng Zhang,Yang Chen,Peng Ye,Lei Bai,Shuyue Hu
机构: Pennsylvania State University (宾夕法尼亚州立大学); Singapore Management University (新加坡管理大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) has demonstrated potential in enhancing the reasoning capabilities of large language models (LLMs), but such training typically demands substantial efforts in creating and annotating data. In this work, we explore improving LLMs through RL with minimal data. Our approach alternates between the LLM proposing a task and then attempting to solve it. To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness: (1) self-aware difficulty prediction, where the model learns to assess task difficulty relative to its own abilities and prioritize challenging yet solvable tasks, and (2) self-aware limit breaking, where the model recognizes when a task is beyond its capability boundary and proactively requests external data to break through that limit. Extensive experiments on nine benchmarks showing a 53.8% relative improvement with less than 1.2% extra data demonstrate the efficacy of self-aware RL and underscore the promise of self-evolving agent training.
zh
[NLP-33] IndiCASA: A Dataset and Bias Evaluation Framework in LLM s Using Contrastive Embedding Similarity in the Indian Context AAAI
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在文化多样性背景下嵌入偏见的评估难题,尤其是在印度等多元文化语境中,现有基于嵌入的偏见评估方法难以捕捉细微的刻板印象。其解决方案的关键在于提出一种基于对比学习(contrastive learning)训练的编码器框架,通过嵌入相似性来捕获细粒度偏见,并构建了一个名为IndiCASA(IndiBias-based Contextually Aligned Stereotypes and Anti-stereotypes)的新数据集,包含2,575条人工验证句子,覆盖种姓、性别、宗教、残疾和经济地位五个社会维度。实验表明,所有测试模型均存在一定程度的刻板偏见,其中与残疾相关的偏见尤为顽固,而宗教偏见相对较低,可能得益于全球范围内的去偏努力,凸显了开发更公平模型的必要性。
链接: https://arxiv.org/abs/2510.02742
作者: Santhosh G S,Akshay Govind S,Gokul S Krishnan,Balaraman Ravindran,Sriraam Natarajan
机构: 1: Indian Institute of Technology Madras (印度理工学院马德拉斯分校); 2: University of Texas at Austin (得克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注: Accepted at 8th AAAI/ACM Conference on AI, Ethics, and Society (AIES) 2025
Abstract:Large Language Models (LLMs) have gained significant traction across critical domains owing to their impressive contextual understanding and generative capabilities. However, their increasing deployment in high stakes applications necessitates rigorous evaluation of embedded biases, particularly in culturally diverse contexts like India where existing embedding-based bias assessment methods often fall short in capturing nuanced stereotypes. We propose an evaluation framework based on a encoder trained using contrastive learning that captures fine-grained bias through embedding similarity. We also introduce a novel dataset - IndiCASA (IndiBias-based Contextually Aligned Stereotypes and Anti-stereotypes) comprising 2,575 human-validated sentences spanning five demographic axes: caste, gender, religion, disability, and socioeconomic status. Our evaluation of multiple open-weight LLMs reveals that all models exhibit some degree of stereotypical bias, with disability related biases being notably persistent, and religion bias generally lower likely due to global debiasing efforts demonstrating the need for fairer model development.
zh
[NLP-34] PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking
【速读】: 该论文旨在解决多模态实体链接(Multimodal Entity Linking, MEL)中负样本质量对表示学习影响未被充分探索的问题。现有方法通常依赖随机或启发式选择的负样本,难以有效提升模型的判别能力。解决方案的关键在于提出一种基于策略梯度的生成对抗网络(Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking, PGMEL),其中生成器通过策略梯度优化生成具有挑战性的负样本,而判别器则负责学习区分正负样本的度量空间。该框架通过对抗训练机制自动筛选难例负样本,从而显著增强模型的语义表示能力,在Wiki-MEL、Richpedia-MEL和WikiDiverse等数据集上优于当前最优方法。
链接: https://arxiv.org/abs/2510.02726
作者: KM Pooja,Cheng Long,Aixin Sun
机构: Indian Institute of Information Technology, Allahabad (印度信息技术学院,阿拉哈巴德); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The task of entity linking, which involves associating mentions with their respective entities in a knowledge graph, has received significant attention due to its numerous potential applications. Recently, various multimodal entity linking (MEL) techniques have been proposed, targeted to learn comprehensive embeddings by leveraging both text and vision modalities. The selection of high-quality negative samples can potentially play a crucial role in metric/representation learning. However, to the best of our knowledge, this possibility remains unexplored in existing literature within the framework of MEL. To fill this gap, we address the multimodal entity linking problem in a generative adversarial setting where the generator is responsible for generating high-quality negative samples, and the discriminator is assigned the responsibility for the metric learning tasks. Since the generator is involved in generating samples, which is a discrete process, we optimize it using policy gradient techniques and propose a policy gradient-based generative adversarial network for multimodal entity linking (PGMEL). Experimental results based on Wiki-MEL, Richpedia-MEL and WikiDiverse datasets demonstrate that PGMEL learns meaningful representation by selecting challenging negative samples and outperforms state-of-the-art methods.
zh
[NLP-35] Hyperparameter Loss Surfaces Are Simple Near their Optima
【速读】: 该论文旨在解决现代机器学习模型中超参数(hyperparameter)优化难题,尤其是在模型规模庞大时难以进行系统性搜索的问题。其核心挑战在于缺乏有效的工具来理解超参数损失面(loss surface)的结构,从而无法高效定位最优配置。解决方案的关键在于发现损失面在接近最优解时会呈现出一种简化的渐近结构(asymptotic regime),并提出基于随机搜索(random search)的新技术来揭示这一结构。作者进一步推导出随机搜索在该渐近区间的收敛规律,并由此构建了可解释和外推其性能的新理论工具,如用于估计最佳可能损失的置信区间或量化有效超参数数量等。
链接: https://arxiv.org/abs/2510.02721
作者: Nicholas Lourie,He He,Kyunghyun Cho
机构: New York University (纽约大学); Genentech (基因技术公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Accepted to COLM 2025. 23 pages, 8 figures
Abstract:Hyperparameters greatly impact models’ capabilities; however, modern models are too large for extensive search. Instead, researchers design recipes that train well across scales based on their understanding of the hyperparameters. Despite this importance, few tools exist for understanding the hyperparameter loss surface. We discover novel structure in it and propose a new theory yielding such tools. The loss surface is complex, but as you approach the optimum simple structure emerges. It becomes characterized by a few basic features, like its effective dimension and the best possible loss. To uncover this asymptotic regime, we develop a novel technique based on random search. Within this regime, the best scores from random search take on a new distribution we discover. Its parameters are exactly the features defining the loss surface in the asymptotic regime. From these features, we derive a new asymptotic law for random search that can explain and extrapolate its convergence. These new tools enable new analyses, such as confidence intervals for the best possible performance or determining the effective number of hyperparameters. We make these tools available at this https URL .
zh
[NLP-36] ravelBench : Exploring LLM Performance in Low-Resource Domains
【速读】: 该论文试图解决现有大语言模型(Large Language Models, LLMs)基准测试在低资源任务中表现信息不足的问题,导致难以在这些领域开发有效的解决方案。其关键解决方案是构建了一个包含14个旅行领域数据集的评测集合,覆盖7种常见的自然语言处理(Natural Language Processing, NLP)任务,并基于真实场景中的匿名数据对不同LLMs进行系统评估。研究发现,通用基准结果不足以反映模型在低资源环境下的性能,即使经过大量训练浮点运算(FLOPs),未经微调的LLM在复杂、领域特定任务中仍存在性能瓶颈;同时,推理能力对小型LLM的提升更为显著,使其在某些任务上具备更强的判断力。
链接: https://arxiv.org/abs/2510.02719
作者: Srinivas Billa,Xiaonan Jing
机构: Expedia Group (Expedia集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures
Abstract:Results on existing LLM benchmarks capture little information over the model capabilities in low-resource tasks, making it difficult to develop effective solutions in these domains. To address these challenges, we curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios, and analysed the performance across LLMs. We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks. Our results confirm that general benchmarking results are insufficient for understanding model performance in low-resource tasks. Despite the amount of training FLOPs, out-of-the-box LLMs hit performance bottlenecks in complex, domain-specific scenarios. Furthermore, reasoning provides a more significant boost for smaller LLMs by making the model a better judge on certain tasks.
zh
[NLP-37] me-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长时间多轮对话中鲁棒性不足的问题,现有评估框架多依赖静态基准和单轮测试,难以捕捉真实交互中随时间演化的对话退化现象。其解决方案的关键在于首次将生存分析(survival analysis)引入对话系统鲁棒性评估,将对话失败建模为“时间到事件”过程,并采用Cox比例风险模型、加速失效时间模型(Accelerated Failure Time, AFT)与随机生存森林等方法对36,951个对话轮次进行建模。研究发现,prompt-to-prompt(P2P)的语义漂移会导致灾难性失败风险激增,而渐进累积的漂移反而具有保护作用,显著延长对话寿命,且AFT模型结合交互项表现最优,具备优异的区分度和校准能力,从而为构建更稳健的对话代理提供了理论依据与实践指导。
链接: https://arxiv.org/abs/2510.02712
作者: Yubo Li,Ramayya Krishnan,Rema Padman
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present the first comprehensive survival analysis of conversational AI robustness, analyzing 36,951 conversation turns across 9 state-of-the-art LLMs to model failure as a time-to-event process. Our survival modeling framework-employing Cox proportional hazards, Accelerated Failure Time, and Random Survival Forest approaches-reveals extraordinary temporal dynamics. We find that abrupt, prompt-to-prompt(P2P) semantic drift is catastrophic, dramatically increasing the hazard of conversational failure. In stark contrast, gradual, cumulative drift is highly protective, vastly reducing the failure hazard and enabling significantly longer dialogues. AFT models with interactions demonstrate superior performance, achieving excellent discrimination and exceptional calibration. These findings establish survival analysis as a powerful paradigm for evaluating LLM robustness, offer concrete insights for designing resilient conversational agents, and challenge prevailing assumptions about the necessity of semantic consistency in conversational AI Systems.
zh
[NLP-38] Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLM s in Contextual Question-Answering
【速读】: 该论文旨在解决上下文问答(Contextual QA)任务中缺乏有效不确定性量化(Uncertainty Quantification, UQ)方法的问题,而现有研究主要集中在封闭书本的事实性问答任务上。其核心挑战在于如何在真实应用场景中准确衡量模型对上下文依赖信息的置信度,尤其是区分认知不确定性(epistemic uncertainty)与偶然不确定性(aleatoric uncertainty)。解决方案的关键在于提出一个理论基础扎实的token级不确定性度量框架:首先定义交叉熵作为预测分布与未知真实分布之间的差异,并通过分解该度量分离出epistemic成分;进而以理想提示模型逼近真实分布,推导出epistemic不确定性的上界,并将其解释为当前模型隐藏表示与理想模型之间的语义特征差距。在此基础上,作者进一步识别出三个可解释的特征——上下文依赖性(context-reliance)、上下文理解能力(context comprehension)和诚实性(honesty),并通过少量标注样本提取这些特征并集成形成鲁棒的不确定性评分,在多个基准测试中显著优于最先进的无监督和有监督UQ方法,且推理开销极低。
链接: https://arxiv.org/abs/2510.02671
作者: Yavuz Bakman,Sungmin Kang,Zhiqi Huang,Duygu Nur Yaldiz,Catarina G. Belém,Chenyang Zhu,Anoop Kumar,Alfy Samuel,Salman Avestimehr,Daben Liu,Sai Praneeth Karimireddy
机构: University of Southern California (南加州大学); Capital One (资本一号); University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Uncertainty Quantification (UQ) research has primarily focused on closed-book factual question answering (QA), while contextual QA remains unexplored, despite its importance in real-world applications. In this work, we focus on UQ for the contextual QA task and propose a theoretically grounded approach to quantify epistemic uncertainty. We begin by introducing a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution. By decomposing this measure, we isolate the epistemic component and approximate the true distribution by a perfectly prompted, idealized model. We then derive an upper bound for epistemic uncertainty and show that it can be interpreted as semantic feature gaps in the given model’s hidden representations relative to the ideal model. We further apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: context-reliance (using the provided context rather than parametric knowledge), context comprehension (extracting relevant information from context), and honesty (avoiding intentional lies). Using a top-down interpretability approach, we extract these features by using only a small number of labeled samples and ensemble them to form a robust uncertainty score. Experiments on multiple QA benchmarks in both in-distribution and out-of-distribution settings show that our method substantially outperforms state-of-the-art unsupervised (sampling-free and sampling-based) and supervised UQ methods, achieving up to a 13-point PRR improvement while incurring a negligible inference overhead.
zh
[NLP-39] Self-Improvement in Multimodal Large Language Models : A Survey EMNLP2025
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自我提升(self-improvement)方面研究尚不系统的问题,尤其关注如何高效利用多样化数据源以实现模型能力的持续增强,同时降低对人工干预的依赖。其解决方案的关键在于从三个核心维度构建结构化框架:1)数据收集(data collection),即如何自动获取高质量、多样化的多模态数据;2)数据组织(data organization),即如何对异构数据进行有效标注与整合;3)模型优化(model optimization),即通过迭代训练和反馈机制实现模型性能的闭环改进。该框架为推动MLLMs向更通用、自适应的方向发展提供了理论基础与实践路径。
链接: https://arxiv.org/abs/2510.02665
作者: Shijian Deng,Kai Wang,Tianyu Yang,Harsh Singh,Yapeng Tian
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校); University of Toronto (多伦多大学); University of Notre Dame (圣母大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025
Abstract:Recent advancements in self-improvement for Large Language Models (LLMs) have efficiently enhanced model capabilities without significantly increasing costs, particularly in terms of human effort. While this area is still relatively young, its extension to the multimodal domain holds immense potential for leveraging diverse data sources and developing more general self-improving models. This survey is the first to provide a comprehensive overview of self-improvement in Multimodal LLMs (MLLMs). We provide a structured overview of the current literature and discuss methods from three perspectives: 1) data collection, 2) data organization, and 3) model optimization, to facilitate the further development of self-improvement in MLLMs. We also include commonly used evaluations and downstream applications. Finally, we conclude by outlining open challenges and future research directions.
zh
[NLP-40] Less LLM More Documents: Searching for Improved RAG ECIR2026
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因依赖大语言模型(Large Language Models, LLMs)而导致的成本高、部署受限的问题。其核心解决方案是通过扩大检索器的语料库规模来降低对大型生成模型的依赖,从而在保持甚至提升性能的同时优化资源利用效率。关键发现在于:增大语料库能显著增强RAG效果,尤其当搭配中小规模生成模型时,其性能可媲美使用超大规模生成模型但语料库较小的情况;这种改进主要源于答案相关段落覆盖率的提升,而非生成效率的改变。这一发现确立了语料库与生成模型之间的权衡关系,表明投资于更大语料库是一种高效且可行的增强RAG性能路径。
链接: https://arxiv.org/abs/2510.02657
作者: Jingjie Ning,Yibo Kong,Yunfan Long,Jamie Callan
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 16 pages. Submitted to ECIR 2026
Abstract:Retrieval-Augmented Generation (RAG) couples document retrieval with large language models (LLMs). While scaling generators improves accuracy, it also raises cost and limits deployability. We explore an orthogonal axis: enlarging the retriever’s corpus to reduce reliance on large LLMs. Experimental results show that corpus scaling consistently strengthens RAG and can often serve as a substitute for increasing model size, though with diminishing returns at larger scales. Small- and mid-sized generators paired with larger corpora often rival much larger models with smaller corpora; mid-sized models tend to gain the most, while tiny and large models benefit less. Our analysis shows that improvements arise primarily from increased coverage of answer-bearing passages, while utilization efficiency remains largely unchanged. These findings establish a principled corpus-generator trade-off: investing in larger corpora offers an effective path to stronger RAG, often comparable to enlarging the LLM itself.
zh
[NLP-41] SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在非高资源语言中难以有效迁移复杂推理能力的问题,即多语言推理任务中的性能瓶颈。其核心挑战在于语言特定的语义信息难以被统一建模,导致跨语言表达差异影响推理一致性。解决方案的关键是提出一种无需训练的结构化思维(Structured-of-Thought, SoT)方法,通过两个关键步骤实现:语言思维转换(Language Thinking Transformation)将语言特异性语义映射为语言无关的结构化表示,以及结构化知识转换(Structured Knowledge Transformation)引导模型保持一致的底层推理路径,从而提升多语言场景下的推理鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2510.02648
作者: Rui Qi,Zhibo Man,Yufeng Chen,Fengran Mo,Jinan Xu,Kaiyu Huang
机构: Key Laboratory of Big Data & Artificial Intelligence in Transportation (交通运输大数据与人工智能重点实验室); Beijing Jiaotong University (北京交通大学); School of Computer Science and Technology (计算机科学与技术学院); University of Montreal (蒙特利尔大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 (findings)
Abstract:Recent developments have enabled Large Language Models (LLMs) to engage in complex reasoning tasks through deep thinking. However, the capacity of reasoning has not been successfully transferred to non-high-resource languages due to resource constraints, which struggles with multilingual reasoning tasks. To this end, we propose Structured-of-Thought (SoT), a training-free method that improves the performance on multilingual reasoning through a multi-step transformation: Language Thinking Transformation and Structured Knowledge Transformation. The SoT method converts language-specific semantic information into language-agnostic structured representations, enabling the models to understand the query in different languages more sophisticated. Besides, SoT effectively guides LLMs toward more concentrated reasoning to maintain consistent underlying reasoning pathways when handling cross-lingual variations in expression. Experimental results demonstrate that SoT outperforms several strong baselines on multiple multilingual reasoning benchmarks when adapting to various backbones of LLMs. It can also be integrated with other training-free strategies for further improvements. Our code is available at this https URL.
zh
[NLP-42] Mind the Gap: Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions
【速读】: 该论文试图解决的问题是:当大型语言模型(Large Language Models, LLMs)被部署为客服聊天机器人时,用户与其交互的沟通风格相较于与人类客服交互时存在显著差异,而当前主流训练数据多基于人与人之间的对话,可能导致模型在实际应用中适应性不足。解决方案的关键在于:通过在后训练阶段引入风格多样化的数据增强策略,使模型能够更好地应对上线后用户沟通风格的变化;实验表明,这种数据层面的多样性提升显著优于仅在推理时对用户输入进行重写(inference-time reformulation)的方法,从而有效增强LLM在真实场景中的鲁棒性和交互体验。
链接: https://arxiv.org/abs/2510.02645
作者: Fulei Zhang,Zhou Yu
机构: Amazon.com Inc.(亚马逊公司)
类目: Computation and Language (cs.CL)
备注: Accepted to The Second Workshop on Generative AI for E-commerce (GenAIECommerce '25), held September 22, 2025, in Prague, Czech Republic
Abstract:As Large Language Models (LLMs) are increasingly deployed in customer-facing applications, a critical yet underexplored question is how users communicate differently with LLM chatbots compared to human agent. In this study, we present empirical evidence that users adopt distinct communication styles when users interact with chatbots versus human agents. Our analysis reveals significant differences in grammatical fluency, politeness, and lexical diversity in user language between the two settings. These findings suggest that models trained exclusively on human-human interaction data may not adequately accommodate the communication style shift that occurs once an LLM chatbot is deployed. To enhance LLM robustness to post-launch communication style changes, we experimented with two strategies: (1) data augmentation during the post-training phase and (2) inference-time user message reformulation. Our results indicate that models trained on stylistically diverse datasets significantly outperform those trained exclusively on original or stylistically uniform datasets, while inference-time reformulation proved less effective. These insights help us to better adapt our models for improved LLM-user interaction experiences.
zh
[NLP-43] HyperAdaLoRA: Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance
【速读】: 该论文旨在解决LoRA(Low-Rank Adaptation)在微调大语言模型时因固定秩分配导致的适应性不足问题,以及AdaLoRA在训练过程中收敛速度慢、计算开销高的缺陷。其解决方案的关键在于提出HyperAdaLoRA框架,通过引入基于注意力机制的超网络(hypernetwork)动态生成奇异值分解(Singular Value Decomposition, SVD)的参数(P, Λ, Q),并结合对超网络输出的剪枝实现动态秩分配,从而在加速收敛的同时保持模型性能,且具备良好的可扩展性。
链接: https://arxiv.org/abs/2510.02630
作者: Hao Zhang,Zhenjia Li,Runfeng Bao,Yifan Gao,Xi Xiao,Bo Huang,Yuhang Wu,Tianyang Wang,Hao Xu
机构: Harvard University (哈佛大学); University of Chinese Academy of Sciences (中国科学院大学); Fudan University (复旦大学); University of Chicago (芝加哥大学); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Shanghai University of Engineering Science (上海工程技术大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages
Abstract:Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), has emerged as a promising approach to fine-tuning large language models(LLMs) while reducing computational and memory overhead. However, LoRA assumes a uniform rank \textitr for each incremental matrix, not accounting for the varying significance of weight matrices across different modules and layers. AdaLoRA leverages Singular Value Decomposition (SVD) to parameterize updates and employs pruning of singular values to introduce dynamic rank allocation, thereby enhancing adaptability. However, during the training process, it often encounters issues of slow convergence speed and high computational overhead. To address this issue, we propose HyperAdaLoRA, a novel framework that accelerates the convergence of AdaLoRA by leveraging a hypernetwork. Instead of directly optimizing the components of Singular Value Decomposition (P, \Lambda, Q) , HyperAdaLoRA employs a hypernetwork based on attention mechanisms to dynamically generate these parameters. By pruning the outputs of the hypernetwork that generates the singular values, dynamic rank allocation is achieved. Comprehensive experiments on various datasets and models demonstrate that our method achieves faster convergence without sacrificing performance. Additionally, further extension experiments on other LoRA-based approaches validate the broad applicability of our method.
zh
[NLP-44] Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在生成响应时对上下文信息的利用方式不透明的问题,即用户无法判断模型是依赖参数化记忆(parametric memory)还是所给定的上下文,也无法识别具体哪些上下文片段影响了输出结果。为提升解释的准确性与可验证性,作者提出了一种基于受控测试用例的黄金标准(gold standard)Highlight Explanations (HEs) 评估框架,该框架通过设定已知真实上下文使用情况的测试场景,避免了以往间接代理评估方法的局限性。其关键创新在于引入了具有明确因果关系的ground-truth标签,从而首次实现了对HE方法在上下文归因(context attribution)任务上的直接、客观评估;同时,研究还发现,尽管MechLight(一种机制解释方法)表现最优,但所有现有方法在处理长上下文时均存在性能下降和位置偏差问题,揭示了当前解释技术在准确性和鲁棒性方面的根本挑战。
链接: https://arxiv.org/abs/2510.02629
作者: Jingyi Sun,Pepa Atanasova,Sagnik Ray Choudhury,Sekh Mainul Islam,Isabelle Augenstein
机构: University of Copenhagen (哥本哈根大学); University of North Texas (北德克萨斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework’s broad applicability, we evaluate four HE methods – three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task – across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.
zh
[NLP-45] On the Role of Temperature Sampling in Test-Time Scaling
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段通过测试时扩展(Test-Time Scaling, TTS)提升推理能力时存在的局限性问题,即当前单温度采样策略无法充分挖掘模型潜力,且在大规模样本数下性能提升趋于饱和,部分难题仍无法解决。其解决方案的关键在于引入温度维度的扩展(temperature scaling),通过多温度采样生成多样化推理路径,并结合投票机制选择最优解,从而显著扩大模型的推理边界,有效释放基础模型的潜在能力,实现无需强化学习(Reinforcement Learning, RL)后训练即可达到RL优化模型的性能水平。
链接: https://arxiv.org/abs/2510.02611
作者: Yuheng Wu,Azalia Mirhoseini,Thierry Tambe
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) can improve reasoning at inference time through test-time scaling (TTS), where multiple reasoning traces are generated and the best one is selected. Prior work shows that increasing the number of samples K steadily improves accuracy. In this paper, we demonstrate that this trend does not hold indefinitely: at large K, further scaling yields no gains, and certain hard questions remain unsolved regardless of the number of traces. Interestingly, we find that different sampling temperatures solve different subsets of problems, implying that single-temperature scaling explores only part of a model’s potential. We therefore propose scaling along the temperature dimension, which enlarges the reasoning boundary of LLMs. Averaged over Qwen3 (0.6B, 1.7B, 4B, 8B) and five representative reasoning benchmarks (AIME 2024/2025, MATH500, LiveCodeBench, Hi-ToM), temperature scaling yields an additional 7.3 points over single-temperature TTS. Temperature scaling also enables base models to reach performance comparable to reinforcement learning (RL)-trained counterparts, without additional post-training. We further provide a comprehensive analysis of this phenomenon and design a multi-temperature voting method that reduces the overhead of temperature scaling. Overall, our findings suggest that TTS is more powerful than previously thought, and that temperature scaling offers a simple and effective way to unlock the latent potential of base models.
zh
[NLP-46] How Confident are Video Models? Empowering Video Models to Express their Uncertainty
【速读】: 该论文旨在解决生成式视频模型(generative video models)在文本到视频生成任务中存在幻觉(hallucination)问题时缺乏不确定性量化(uncertainty quantification, UQ)方法的现状,从而提升其在真实应用场景中的安全性与可靠性。解决方案的关键在于提出一个完整的UQ框架,其中核心创新是S-QUBED方法——一种基于潜在空间建模的黑盒不确定性分解技术,能够严谨地将预测不确定性解耦为可变性(aleatoric)和知识不足(epistemic)两类成分,并通过无严格建模假设的鲁棒秩相关评估指标实现对模型校准性的有效衡量。该方法在基准视频数据集上的实验证明其能生成与任务准确率负相关的校准总不确定性估计,显著提升了生成结果的可信度与可控性。
链接: https://arxiv.org/abs/2510.02571
作者: Zhiting Mei,Ola Shorinwa,Anirudha Majumdar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Generative video models demonstrate impressive text-to-video capabilities, spurring widespread adoption in many real-world applications. However, like large language models (LLMs), video generation models tend to hallucinate, producing plausible videos even when they are factually wrong. Although uncertainty quantification (UQ) of LLMs has been extensively studied in prior work, no UQ method for video models exists, raising critical safety concerns. To our knowledge, this paper represents the first work towards quantifying the uncertainty of video models. We present a framework for uncertainty quantification of generative video models, consisting of: (i) a metric for evaluating the calibration of video models based on robust rank correlation estimation with no stringent modeling assumptions; (ii) a black-box UQ method for video models (termed S-QUBED), which leverages latent modeling to rigorously decompose predictive uncertainty into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate benchmarking calibration in video models. By conditioning the generation task in the latent space, we disentangle uncertainty arising due to vague task specifications from that arising from lack of knowledge. Through extensive experiments on benchmark video datasets, we demonstrate that S-QUBED computes calibrated total uncertainty estimates that are negatively correlated with the task accuracy and effectively computes the aleatoric and epistemic constituents.
zh
[NLP-47] ranscribe Translate or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models
【速读】: 该论文旨在解决生成式语音语言模型(Spoken Language Models, SLMs)中模态适配器(Modality Adapters, MAs)如何转换语音编码器输出表示的问题,即揭示MA在连接语音编码与语言模型解码器时所采用的内部表征策略。解决方案的关键在于通过计算MA输出表示与语言模型解码器token之间的最近邻关系,发现两类不同的表征机制:一类是基于英语语义的中间语言(interlingua),适用于使用Whisper编码器的模型(如SALMONN和Qwen2-Audio),能够处理指令微调中未见的语言;另一类则是以英文词汇表达输入语音的音素特征,适用于非Whisper编码器的模型(如Phi-4-Multimodal-Instruct)。这一发现表明,MA的表征方式取决于语音编码器是否同时训练了语音识别与翻译任务。
链接: https://arxiv.org/abs/2510.02569
作者: Tolúl\d{o}pé Ògúnrèmí,Christopher D. Manning,Dan Jurafsky,Karen Livescu
机构: Stanford University (斯坦福大学); Toyota Technological Institute at Chicago (芝加哥丰田技术学院)
类目: Computation and Language (cs.CL)
备注: ASRU 2025
Abstract:Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don’t, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.
zh
[NLP-48] Knowledge-Graph Based RAG System Evaluation Framework
【速读】: 该论文旨在解决当前检索增强生成(Retrieval Augmented Generation, RAG)系统评估中存在的挑战,即传统评价指标难以有效捕捉大型语言模型(Large Language Models, LLM)生成内容所具有的高流畅性和自然性等关键特征。其解决方案的核心在于扩展现有的RAGAS评估框架,引入基于知识图谱(Knowledge Graph, KG)的评估范式,通过多跳推理(multi-hop reasoning)和语义社区聚类(semantic community clustering)机制,构建更全面的评分指标体系,从而提升对RAG系统性能的细粒度感知能力与评估准确性。
链接: https://arxiv.org/abs/2510.02549
作者: Sicheng Dong,Vahid Zolfaghari,Nenad Petrovic,Alois Knoll
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) has become a significant research focus and is utilized in various fields, such as text generation and dialog systems. One of the most essential applications of LLM is Retrieval Augmented Generation (RAG), which greatly enhances generated content’s reliability and relevance. However, evaluating RAG systems remains a challenging task. Traditional evaluation metrics struggle to effectively capture the key features of modern LLM-generated content that often exhibits high fluency and naturalness. Inspired by the RAGAS tool, a well-known RAG evaluation framework, we extended this framework into a KG-based evaluation paradigm, enabling multi-hop reasoning and semantic community clustering to derive more comprehensive scoring metrics. By incorporating these comprehensive evaluation criteria, we gain a deeper understanding of RAG systems and a more nuanced perspective on their performance. To validate the effectiveness of our approach, we compare its performance with RAGAS scores and construct a human-annotated subset to assess the correlation between human judgments and automated metrics. In addition, we conduct targeted experiments to demonstrate that our KG-based evaluation method is more sensitive to subtle semantic differences in generated outputs. Finally, we discuss the key challenges in evaluating RAG systems and highlight potential directions for future research.
zh
[NLP-49] Hierarchical Semantic Retrieval with Cobweb
【速读】: 该论文旨在解决神经文档检索中因将语料库视为单一粒度的向量云而导致的结构利用不足与解释性差的问题(即缺乏对文档层次结构的有效利用和透明的检索路径)。其解决方案的关键在于提出 Cobweb 框架,该框架通过构建原型树(prototype tree)组织句子嵌入,并采用从粗到细的遍历策略进行文档排序;其中内部节点作为概念原型,提供多粒度的相关性信号并借助检索路径实现可解释性。这一方法在强编码器嵌入(如 BERT/T5)下性能媲美点积搜索,且在 kNN 效果下降时仍保持鲁棒性(尤其在 GPT-2 向量下点积性能崩溃时),展现出更强的适应性和可解释性优势。
链接: https://arxiv.org/abs/2510.02539
作者: Anant Gupta,Karthik Singaravadivelan,Zekun Wang
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 20 pages, 7 tables, 4 figures
Abstract:Neural document retrieval often treats a corpus as a flat cloud of vectors scored at a single granularity, leaving corpus structure underused and explanations opaque. We use Cobweb–a hierarchy-aware framework–to organize sentence embeddings into a prototype tree and rank documents via coarse-to-fine traversal. Internal nodes act as concept prototypes, providing multi-granular relevance signals and a transparent rationale through retrieval paths. We instantiate two inference approaches: a generalized best-first search and a lightweight path-sum ranker. We evaluate our approaches on MS MARCO and QQP with encoder (e.g., BERT/T5) and decoder (GPT-2) representations. Our results show that our retrieval approaches match the dot product search on strong encoder embeddings while remaining robust when kNN degrades: with GPT-2 vectors, dot product performance collapses whereas our approaches still retrieve relevant results. Overall, our experiments suggest that Cobweb provides competitive effectiveness, improved robustness to embedding quality, scalability, and interpretable retrieval via hierarchical prototypes.
zh
[NLP-50] Unraveling Syntax: How Language Models Learn Context-Free Grammars
【速读】: 该论文旨在解决语言模型(Language Models, LM)在语法习得过程中的学习动态机制不明确的问题,特别是如何理解Transformer架构在处理具有层次结构的语法(如概率上下文无关文法,PCFG)时的学习路径与内在表征演化。其解决方案的关键在于构建一个基于合成语言的可控实验框架:通过训练小型模型学习由PCFG生成的合成语言,精确调控语法复杂度、递归深度和子文法结构,从而实现对学习过程的系统性分析。研究进一步证明了适用于子文法结构的训练损失与KL散度的递归公式,并发现Transformer在学习过程中并非按人类儿童那样逐步掌握简单结构再过渡到复杂结构,而是并行降低所有子文法的损失;同时揭示了子文法预训练可提升小模型最终性能并增强内部表征与语法结构的一致性,且模型在深层递归结构上存在固有局限,凸显神经网络在建模层次化语法方面的根本挑战。
链接: https://arxiv.org/abs/2510.02524
作者: Laura Ying Schulz,Daniel Mitropolsky,Tomaso Poggio
机构: Massachussetts Institute of Technology (麻省理工学院); ETH Zürich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: Equal contribution by LYS and DM
Abstract:We introduce a new framework for understanding how language models acquire syntax. While large models achieve impressive results, little is known about their learning dynamics. Our approach starts with the observation that most domains of interest, such as natural language syntax, coding languages, arithmetic problems, are captured by probabilistic context-free grammars (PCFGs). We study the learning dynamics of small models trained on synthetic languages generated from PCFGs, enabling precise control over grammar complexity, recursion depth, and subgrammar structure. We prove several general, recursive formulae for the training loss and Kullback-Leibler divergence over the subgrammar structure of a PCFG. Empirically, we find that unlike children, who first master simple substructures before progressing to more complex constructions, transformers reduce loss across all subgrammars in parallel. We further show that subgrammar pretraining can improve the final loss for smaller models, and that pretrained models develop internal representations more aligned with the grammar’s substructure. Finally, we demonstrate that models struggle with deeper recursive structures (a limitation even of large language models), revealing fundamental challenges in how neural networks represent hierarchical syntax. Overall, our work initiates the study of the learning dynamics of transformers on PCFGs as a versatile testbed for probing learning in language models, opening a research direction with many open questions.
zh
[NLP-51] Beyond Imitation: Recovering Dense Rewards from Demonstrations
【速读】: 该论文旨在解决传统监督微调(Supervised Fine-Tuning, SFT)被简单视为模仿学习过程的局限性问题,即SFT仅被认为是在训练策略以复制专家行为,而忽略了其潜在的奖励建模能力。论文的核心贡献在于建立SFT与逆强化学习(Inverse Reinforcement Learning, IRL)之间的根本等价关系,证明SFT目标是逆Q-learning的一个特例,从而表明SFT不仅学习策略,还隐式地学习了一个密集的、基于token级别的奖励模型,用于解释专家演示。解决方案的关键在于通过构造一个基准相对奖励函数(baseline-relative reward function),从SFT模型中直接恢复出这一密集奖励信号,进而实现细粒度的信用分配,并利用该奖励进一步优化策略,如在指令遵循基准上提出的Dense-Path REINFORCE方法显著优于原始SFT模型。
链接: https://arxiv.org/abs/2510.02493
作者: Jiangnan Li,Thuy-Trang Vu,Ehsan Abbasnejad,Gholamreza Haffari
机构: Monash University (莫纳什大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Conventionally, supervised fine-tuning (SFT) is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on demonstration datasets. In this work, we challenge this view by establishing a fundamental equivalence between SFT and Inverse Reinforcement Learning. We prove that the SFT objective is a special case of Inverse Q-Learning, which implies that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations. We then show how to recover this dense reward signal directly from the SFT model by formulating a baseline-relative reward function. The availability of such a dense reward model offers numerous benefits, providing granular credit assignment for each token generated. We demonstrate one key application by using these recovered rewards to further improve the policy with reinforcement learning. Our method, Dense-Path REINFORCE, consistently outperforms the original SFT models on instruction-following benchmarks. This work reframes SFT not merely as policy imitation but as a powerful reward learning mechanism, opening new possibilities for leveraging expert demonstrations.
zh
[NLP-52] Litespark Technical Report: High-Throughput Energy-Efficient LLM Training Framework
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)预训练过程中存在的训练时间长和能耗高的问题,这些问题通常导致模型需要数月计算时间和高达千兆瓦时级别的电力消耗。其解决方案的关键在于提出了一种名为Litespark的新颖预训练框架,通过针对Transformer架构中的注意力机制(attention)和多层感知机(MLP)层进行针对性优化,结合结构改进与算法增强,在保持与标准Transformer实现兼容的前提下,显著提升模型浮点运算利用率(Model FLOPs Utilization, MFU)。实验证明,该方法在3B至30B参数规模的Llama模型上使用SlimPajama-627B数据集进行基准测试时,可实现2倍至6倍的训练吞吐量提升,并减少55%至83%的能耗,且具备模型和硬件无关性,适用于多种Transformer架构及后续训练阶段(如监督微调和直接偏好优化)。
链接: https://arxiv.org/abs/2510.02483
作者: Nii Osae Osae Dade,Moinul Hossain Rahat
机构: Mindbeam AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages
Abstract:Training Large Language Models (LLMs) is plagued by long training times and massive energy consumption, with modern models requiring months of computation and gigawatt-hours of electricity. In light of these challenges,we introduce Litespark, a novel pre-training framework that addresses these inefficiencies through targeted optimizations to transformer attention and MLP layers. Our approach combines architectural improvements with algorithmic enhancements to maximize Model FLOPs Utilization (MFU) while maintaining compatibility with standard transformer implementations. Comprehensive benchmarking on 3B and 30B parameter Llama models using the SlimPajama-627B dataset demonstrates substantial performance gains: 2x-6x training throughput improvement and 55%-83 % energy consumption reduction across multi-node H200 GPU clusters. These optimizations are model- and hardware-agnostic, enabling broad applicability across transformer architectures and extending to post-training phases including supervised fine-tuning and direct preference optimization.
zh
[NLP-53] SIMSplat: Predictive Driving Scene Editing with Language-aligned 4D Gaussian Splatting
【速读】: 该论文旨在解决现有虚拟驾驶模拟框架在生成真实场景时效率低下、编辑能力受限的问题,特别是缺乏直观且精确的场景操控手段。其解决方案的关键在于提出SIMSplat——一种基于语言对齐高斯点绘(Gaussian splatting)的预测性驾驶场景编辑器。该方法通过将自然语言提示与高斯重建的场景进行对齐,实现以语言为控制接口的直观编辑,并支持对道路对象的细粒度查询与修改(如添加新物体或调整车辆和行人轨迹),同时结合多智能体运动预测技术优化代理间的交互逻辑,从而提升场景的真实感与可控性。
链接: https://arxiv.org/abs/2510.02469
作者: Sung-Yeon Park,Adam Lee,Juanwu Lu,Can Cui,Luyang Jiang,Rohit Gupta,Kyungtae Han,Ahmadreza Moradipari,Ziran Wang
机构: Purdue University (普渡大学); UC Berkeley (加州大学伯克利分校); Toyota InfoTech Labs (丰田信息科技实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Driving scene manipulation with sensor data is emerging as a promising alternative to traditional virtual driving simulators. However, existing frameworks struggle to generate realistic scenarios efficiently due to limited editing capabilities. To address these challenges, we present SIMSplat, a predictive driving scene editor with language-aligned Gaussian splatting. As a language-controlled editor, SIMSplat enables intuitive manipulation using natural language prompts. By aligning language with Gaussian-reconstructed scenes, it further supports direct querying of road objects, allowing precise and flexible editing. Our method provides detailed object-level editing, including adding new objects and modifying the trajectories of both vehicles and pedestrians, while also incorporating predictive path refinement through multi-agent motion prediction to generate realistic interactions among all agents in the scene. Experiments on the Waymo dataset demonstrate SIMSplat’s extensive editing capabilities and adaptability across a wide range of scenarios. Project page: this https URL
zh
[NLP-54] CLARITY: Clinical Assistant for Routing Inference and Triage EMNLP2025
【速读】: 该论文旨在解决医疗场景中患者分诊效率低、专科转诊不精准以及临床咨询耗时过长的问题。其解决方案的核心在于构建一个名为CLARITY(Clinical Assistant for Routing, Inference, and Triage)的混合架构AI平台,该平台融合有限状态机(Finite State Machine, FSM)以实现结构化对话流程,并结合基于大语言模型(Large Language Model, LLM)的协作代理来分析症状并优先推荐合适专科医生,从而提升分诊准确率与咨询效率。实证结果显示,CLARITY在首次尝试路由精度上超越人类水平,且平均咨询时长仅为人工方式的三分之一。
链接: https://arxiv.org/abs/2510.02463
作者: Vladimir Shaposhnikov,Aleksandr Nesterov,Ilia Kopanichuk,Ivan Bakulin,Egor Zhelvakov,Ruslan Abramov,Ekaterina Tsapieva,Dmitry V. Dylov,Ivan Oseledets
机构: AIRI; Skoltech; MIPT; SberMedAI; Sber
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted to EMNLP 2025 (Industrial Track)
Abstract:We present CLARITY (Clinical Assistant for Routing, Inference, and Triage), an AI-driven platform designed to facilitate patient-to-specialist routing, clinical consultations, and severity assessment of patients’ conditions. Its hybrid architecture combines a Finite State Machine (FSM) for structured dialogue flows with collaborative agents that employ Large Language Model (LLM) to analyze symptoms and prioritize referrals to appropriate specialists. Built on a modular microservices framework, CLARITY ensures safe, efficient, and robust performance, flexible and readily scalable to meet the demands of existing workflows and IT solutions in healthcare. We report integration of our clinical assistant into a large-scale nation-wide inter-hospital IT platform, with over 55,000 content-rich user dialogues completed within the two months of deployment, 2,500 of which were expert-annotated for a consequent validation. The validation results show that CLARITY surpasses human-level performance in terms of the first-attempt routing precision, naturally requiring up to 3 times shorter duration of the consultation than with a human. Comments: Accepted to EMNLP 2025 (Industrial Track) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2510.02463 [cs.CL] (or arXiv:2510.02463v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.02463 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-55] How to Train Your Advisor: Steering Black-Box LLM s with Advisor Models
【速读】: 该论文旨在解决黑箱模型(black-box models)在实际应用中因缺乏可定制性而导致的适应性不足问题,尤其是在面对不同输入、用户或环境时,静态提示优化方法无法动态调整策略的局限。解决方案的关键在于提出Advisor Models——一种轻量级参数化策略模型,通过强化学习训练以在上下文中实时生成自然语言引导指令,从而对黑箱模型的行为进行实例级调节。该方法将advisor作为介于输入与主模型之间的第二小模型,利用环境奖励信号实现动态响应,显著提升了任务性能,并展现出跨模型迁移和对分布外输入的鲁棒性。
链接: https://arxiv.org/abs/2510.02453
作者: Parth Asawa,Alan Zhu,Matei Zaharia,Alexandros G. Dimakis,Joseph E. Gonzalez
机构: UC Berkeley (加州大学伯克利分校); Bespoke Labs (定制实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Foundation models are increasingly deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. While static prompt optimization has shown promise, it produces a single fixed prompt that fails to adapt to different inputs, users, or environments. We introduce Advisor Models, lightweight parametric policies trained with reinforcement learning to reactively issue natural language steering instructions in-context to black-box models. The advisor is a second small model that sits between the input and the model, shaping behavior on a per-instance basis using reward signals from the environment. Across multiple domains involving reasoning and personalization, we show that Advisor Models outperform static prompt optimizers, discovering environment dynamics and improving downstream task performance. We also demonstrate the generalizability of advisors by transferring them across black-box models, as well as the framework’s ability to achieve specialization while retaining robustness to out-of-distribution inputs. Viewed more broadly, Advisor Models provide a learnable interface to black-box systems where the advisor acts as a parametric, environment-specific memory. We argue that dynamic optimization of black-box models via Advisor Models is a promising direction for enabling personalization and environment-adaptable AI with frontier-level capabilities.
zh
[NLP-56] Words That Make Language Models Perceive
【速读】: 该论文试图解决的问题是:纯文本训练的大语言模型(Large Language Models, LLMs)虽然缺乏直接的感知经验,但其内部表征是否隐含了多模态规律,以及能否通过特定手段激活这些潜在的跨模态结构。解决方案的关键在于采用轻量级提示工程(prompt engineering),即通过显式感官提示(如“see”或“hear”)引导模型在生成下一个词时,模拟基于未提供视觉或听觉证据的条件预测,从而激活与专业视觉和音频编码器更对齐的模态特异性表征。
链接: https://arxiv.org/abs/2510.02425
作者: Sophie L. Wang,Phillip Isola,Brian Cheung
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to ‘see’ or ‘hear’, it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.
zh
[NLP-57] Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing ICDM
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在将自然语言(Natural Language, NL)查询转换为SQL时,因数据库(Database, DB)间性能差异大而带来的翻译准确性问题。核心挑战在于NL查询常使用领域特定词汇,需结合数据库结构理解其语义才能准确映射到SQL。现有基准测试依赖不切实际的、针对具体查询的文本提示来表达领域知识,导致泛化能力弱。论文提出了一种系统性框架,在数据库层面关联结构化的领域陈述(structured domain statements),并通过子字符串匹配检索与用户查询相关的领域陈述。关键创新在于:1)以数据库级结构化领域陈述替代传统查询级文本提示,提升实用性与准确性;2)基于子字符串匹配的检索机制显著优于其他检索方法,从而提高LLM生成SQL的准确率。
链接: https://arxiv.org/abs/2510.02394
作者: Manasi Patwardhan,Ayush Agarwal,Shabbirhussain Bhaisaheb,Aseem Arora,Lovekesh Vig,Sunita Sarawagi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 11 tables. Accepted in the 1st Workshop on Grounding Documents with Reasoning, Agents, Retrieval, and Attribution (RARA) held in conjunction with IEEE International Conference on Data Mining (ICDM) 2025
Abstract:The performance of Large Language Models (LLMs) for translating Natural Language (NL) queries into SQL varies significantly across databases (DBs). NL queries are often expressed using a domain specific vocabulary, and mapping these to the correct SQL requires an understanding of the embedded domain expressions, their relationship to the DB schema structure. Existing benchmarks rely on unrealistic, ad-hoc query specific textual hints for expressing domain knowledge. In this paper, we propose a systematic framework for associating structured domain statements at the database level. We present retrieval of relevant structured domain statements given a user query using sub-string level match. We evaluate on eleven realistic DB schemas covering diverse domains across five open-source and proprietary LLMs and demonstrate that (1) DB level structured domain statements are more practical and accurate than existing ad-hoc query specific textual domain statements, and (2) Our sub-string match based retrieval of relevant domain statements provides significantly higher accuracy than other retrieval approaches.
zh
[NLP-58] KnowledgeSmith: Uncovering Knowledge Updating in LLM s with Model Editing and Unlearning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识更新机制方面缺乏系统性理解的问题,尤其是知识编辑(Knowledge Editing)与机器遗忘(Machine Unlearning)之间的差异及其随训练数据规模变化的演化规律。现有研究受限于评估数据的孤立性、小样本性和不充分性,难以揭示LLMs如何类似人类般调整知识结构以及不同更新策略对模型知识传播、可塑性扩展、一致性与鲁棒性的具体影响。解决方案的关键在于提出一个统一框架KnowledgeSmith,将知识编辑与机器遗忘建模为同一约束优化问题,并设计了一个自动化的数据集生成器,能够在多图层级和不同数据规模下提供结构化干预,从而实现对知识更新路径的可控实验分析。这一方法使研究者能够深入洞察LLMs的知识传播特性及一致性-容量权衡关系,为开发更可靠、可扩展的知识更新策略提供实证依据。
链接: https://arxiv.org/abs/2510.02392
作者: Yinyi Luo,Zhexian Zhou,Hao Chen,Kai Qiu,Marios Savvides,Yixuan Li,Jindong Wang
机构: Carnegie Mellon University (卡内基梅隆大学); William & Mary (威廉玛丽学院); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL)
备注: Technical report
Abstract:Knowledge editing and machine unlearning are two popular approaches for large language models (LLMs) to stay up-to-date. However, the knowledge updating mechanism of LLMs remains largely unexplored due to insufficient, isolated, and small-scale evaluation. For instance, are LLMs similar to humans in modifying certain knowledge? What differs editing and unlearning as training data increases? This paper proposes KnowledgeSmith, a unified framework to systematically understand the updating mechanism of LLMs. We first cast editing and unlearning as instances of one constrained optimization problem. Then, we propose an automatic dataset generator that provides structured interventions across multiple graph levels and data scales, enabling controlled studies of how different modification strategies propagate through model knowledge. Extensive experiments demonstrate nuanced insights over knowledge propagation, plasticity scaling, consistency, and robustness. For instance, our results show that LLMs do not exhibit similar updating as humans for different levels of knowledge, and there exists consistency-capacity trade-off. We hope our findings can offer suggestions to the design of more reliable and scalable strategies. Code: this https URL
zh
[NLP-59] Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在领域特定问答(Domain-specific Question Answering, QA)中因缺乏准确、实时信息而表现不佳的问题。现有检索增强生成(Retrieval-Augmented Generation, RAG)系统主要依赖非结构化文档,忽视了关系型数据库所具有的高精度、时效性强且可高效查询的结构化事实信息。解决方案的关键在于提出一个规则驱动的路由框架(rule-driven routing framework),其核心包括:(1)通过规则评分机制选择最合适的知识源(数据库或文档)以平衡效果与效率;(2)引入规则制定专家代理(rule-making expert agent)基于QA反馈动态优化规则,提升适应性;(3)利用路径级元缓存(path-level meta-cache)复用相似查询的历史路由决策,降低延迟和成本。实验证明该框架在多个基准测试上优于静态策略和学习型路由基线,兼具高准确率与可控计算开销。
链接: https://arxiv.org/abs/2510.02388
作者: Haoyue Bai,Haoyu Wang,Shengyu Chen,Zhengzhang Chen,Lu-An Tang,Wei Cheng,Haifeng Chen,Yanjie Fu
机构: Arizona State University (亚利桑那州立大学); NEC Laboratories America, Inc. (美国 NEC 实验室公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA), yet they often struggle in domain-specific scenarios where accurate and up-to-date information is required. Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge, but existing systems primarily rely on unstructured documents, while largely overlooking relational databases, which provide precise, timely, and efficiently queryable factual information, serving as indispensable infrastructure in domains such as finance, healthcare, and scientific research. Motivated by this gap, we conduct a systematic analysis that reveals three central observations: (i) databases and documents offer complementary strengths across queries, (ii) naively combining both sources introduces noise and cost without consistent accuracy gains, and (iii) selecting the most suitable source for each query is crucial to balance effectiveness and efficiency. We further observe that query types show consistent regularities in their alignment with retrieval paths, suggesting that routing decisions can be effectively guided by systematic rules that capture these patterns. Building on these insights, we propose a rule-driven routing framework. A routing agent scores candidate augmentation paths based on explicit rules and selects the most suitable one; a rule-making expert agent refines the rules over time using QA feedback to maintain adaptability; and a path-level meta-cache reuses past routing decisions for semantically similar queries to reduce latency and cost. Experiments on three QA benchmarks demonstrate that our framework consistently outperforms static strategies and learned routing baselines, achieving higher accuracy while maintaining moderate computational cost.
zh
[NLP-60] Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems
【速读】: 该论文旨在解决在资源受限环境下,如何从多个不同大语言模型(Large Language Models, LLMs)生成的候选回答中高效且可靠地选择最优响应的问题。现有方法通常依赖昂贵的外部验证器、人工评估或单一模型的自一致性采样策略,而多LLM系统虽具备更高的响应多样性潜力,却常因缺乏有效筛选机制而导致性能不佳。论文提出了一种基于校准对数似然分数(calibrated log-likelihood score)的原理性、新颖且计算高效的响应选择方法,该方法通过隐式利用各模型自身的知识和置信度来实现最优响应识别,在GSM8K、MMLU(6个子集)和ARC数据集上分别实现了约4%、3%和5%的性能提升,显著优于传统的单模型自一致性方法和多LLM非协同设置。
链接: https://arxiv.org/abs/2510.02377
作者: Aakriti Agrawal,Rohith Aralikatti,Anirudh Satheesh,Souradip Chakraborty,Amrit Singh Bedi,Furong Huang
机构: University of Maryland (马里兰大学); Hilabs; University of Central Florida (中佛罗里达大学); Capital One (Capital One)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.
zh
[NLP-61] Pretraining with hierarchical memories: separating long-tail and common knowledge
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)因参数规模急剧增长而导致的资源消耗过高问题,尤其是在边缘设备上受限于推理时内存和计算能力的情况下,将全部世界知识存储于模型参数中既不必要也不可行。其解决方案的关键在于提出一种基于记忆增强架构的小型语言模型:通过引入一个分层的参数化记忆库(hierarchical parametric memory bank),在预训练和推理阶段动态加载与上下文相关的记忆块,并将其与小型语言模型融合。该方法实现了对长尾世界知识的有效存储与检索,同时保留了小模型在通用知识和推理能力上的锚定作用,从而在显著降低模型参数量的同时保持甚至超越更大模型的性能表现。
链接: https://arxiv.org/abs/2510.02375
作者: Hadi Pouransari,David Grangier,C Thomas,Michael Kirchhof,Oncel Tuzel
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.
zh
[NLP-62] raining Dynamics of Parametric and In-Context Knowledge Utilization in Language Models
【速读】: 该论文旨在解决大语言模型在推理阶段面临的一个关键问题:如何有效协调上下文检索知识(in-context knowledge)与预训练获得的参数化知识(parametric knowledge)之间的冲突。当前主流的检索增强生成(Retrieval-Augmented Generation, RAG)方法虽广泛采用,但缺乏对模型在训练过程中形成知识仲裁策略(knowledge-arbitration strategies)的系统理解,导致模型可能盲目接受外部信息或僵化依赖内部参数,从而造成资源浪费和性能下降。解决方案的关键在于通过受控实验揭示训练条件对模型知识仲裁能力的影响机制——研究发现,文档内事实重复有助于同时提升模型的参数化和上下文知识利用能力;而训练数据中包含不一致信息或分布偏移时,可促使模型发展出更稳健的知识融合策略。这些结果表明,非理想的数据特性不应被剔除,而是应被视为促进鲁棒仲裁行为学习的重要因素。
链接: https://arxiv.org/abs/2510.02370
作者: Minsung Kim,Dong-Kyum Kim,Jea Kwon,Nakyeong Yang,Kyomin Jung,Meeyoung Cha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:Large language models often encounter conflicts between in-context knowledge retrieved at inference time and parametric knowledge acquired during pretraining. Models that accept external knowledge uncritically are vulnerable to misinformation, whereas models that adhere rigidly to parametric knowledge fail to benefit from retrieval. Despite the widespread adoption of retrieval-augmented generation, we still lack a systematic understanding of what shapes knowledge-arbitration strategies during training. This gap risks producing pretrained models with undesirable arbitration behaviors and, consequently, wasting substantial computational resources after the pretraining budget has already been spent. To address this problem, we present the first controlled study of how training conditions influence models’ use of in-context and parametric knowledge, and how they arbitrate between them. We train transformer-based language models on a synthetic biographies corpus while systematically controlling various conditions. Our experiments reveal that intra-document repetition of facts fosters the development of both parametric and in-context capabilities. Moreover, training on a corpus that contains inconsistent information or distributional skew encourages models to develop robust strategies for leveraging parametric and in-context knowledge. Rather than viewing these non-ideal properties as artifacts to remove, our results indicate that they are important for learning robust arbitration. These insights offer concrete, empirical guidance for pretraining models that harmoniously integrate parametric and in-context knowledge.
zh
[NLP-63] Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents ICLR2026
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在复杂任务中因缺乏实例级上下文(instance-level context)而导致的性能瓶颈问题。实例级上下文是指与特定环境实例相关的可验证且可复用的事实,如物体位置、制作配方和局部规则等,这些信息对决策至关重要,但传统方法往往忽略其获取与利用。解决方案的关键在于提出一种任务无关的实例级上下文学习(Instance-Level Context Learning, ILCL)方法:通过引导式探索(guided exploration),结合紧凑的“待办事项森林”(TODO forest)来智能优先排序下一步行动,并采用轻量级的“计划-执行-提取”循环(plan-act-extract loop)高效完成探索任务,从而自动生成高精度、可复用的上下文文档,显著提升LLM代理的成功率与效率。
链接: https://arxiv.org/abs/2510.02369
作者: Kuntai Cai,Juncheng Liu,Xianglin Yang,Zhaojie Niu,Xiaokui Xiao,Xing Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review at ICLR 2026
Abstract:Large language model (LLM) agents typically receive two kinds of context: (i) environment-level manuals that define interaction interfaces and global rules, and (ii) task-level guidance or demonstrations tied to specific goals. In this work, we identify a crucial but overlooked third type of context, instance-level context, which consists of verifiable and reusable facts tied to a specific environment instance, such as object locations, crafting recipes, and local rules. We argue that the absence of instance-level context is a common source of failure for LLM agents in complex tasks, as success often depends not only on reasoning over global rules or task prompts but also on making decisions based on precise and persistent facts. Acquiring such context requires more than memorization: the challenge lies in efficiently exploring, validating, and formatting these facts under tight interaction budgets. We formalize this problem as Instance-Level Context Learning (ILCL) and introduce our task-agnostic method to solve it. Our method performs a guided exploration, using a compact TODO forest to intelligently prioritize its next actions and a lightweight plan-act-extract loop to execute them. This process automatically produces a high-precision context document that is reusable across many downstream tasks and agents, thereby amortizing the initial exploration cost. Experiments across TextWorld, ALFWorld, and Crafter demonstrate consistent gains in both success and efficiency: for instance, ReAct’s mean success rate in TextWorld rises from 37% to 95%, while IGE improves from 81% to 95%. By transforming one-off exploration into persistent, reusable knowledge, our method complements existing contexts to enable more reliable and efficient LLM agents.
zh
[NLP-64] A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在回答具有争议性的历史问题时是否存在偏见及其稳定性如何,特别是在不同语言和响应格式下模型输出的一致性与中立性。解决方案的关键在于通过三阶段研究设计,系统性地比较LLMs在不同情境下的回答模式——包括二元判断与数值评分两种格式,并考察语言背景对模型立场转换的影响,从而揭示模型在特定语境下表现出的响应不一致性及其潜在偏差来源。
链接: https://arxiv.org/abs/2510.02362
作者: Matei-Iulian Cocu,Răzvan-Cosmin Cristia,Adrian Marius Dumitran
机构: University of Bucharest (布加勒斯特大学); Softbinator
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:In this case study, we select a set of controversial Romanian historical questions and ask multiple Large Language Models to answer them across languages and contexts, in order to assess their biases. Besides being a study mainly performed for educational purposes, the motivation also lies in the recognition that history is often presented through altered perspectives, primarily influenced by the culture and ideals of a state, even through large language models. Since they are often trained on certain data sets that may present certain ambiguities, the lack of neutrality is subsequently instilled in users. The research process was carried out in three stages, to confirm the idea that the type of response expected can influence, to a certain extent, the response itself; after providing an affirmative answer to some given question, an LLM could shift its way of thinking after being asked the same question again, but being told to respond with a numerical value of a scale. Results show that binary response stability is relatively high but far from perfect and varies by language. Models often flip stance across languages or between formats; numeric ratings frequently diverge from the initial binary choice, and the most consistent models are not always those judged most accurate or neutral. Our research brings to light the predisposition of models to such inconsistencies, within a specific contextualization of the language for the question asked.
zh
[NLP-65] ChunkLLM : A Lightweight Pluggable Framework for Accelerating LLM s Inference
【速读】: 该论文旨在解决基于Transformer的大模型在处理长序列时因自注意力机制(self-attention)计算复杂度与输入token数量呈二次增长而导致的严重计算效率低下问题。现有基于块选择和压缩的方法要么存在语义不完整,要么训练与推理效率不佳。解决方案的关键在于提出一种轻量且可插拔的训练框架ChunkLLM,其核心创新包括两个组件:QK Adapter(Q-Adapter和K-Adapter)用于在每一层实现特征压缩与块内注意力获取,以及位于模型最底层的Chunk Adapter,通过利用上下文语义信息检测块边界。训练阶段冻结主干参数,仅优化这两个适配器,并设计了一种注意力蒸馏方法提升关键块的召回率;推理阶段仅在检测到块边界时触发块选择,显著加速推理过程。实验表明,该方法在保持长文本任务性能达98.64%的同时,将键值缓存保留率维持在48.58%,并在120K长度文本处理中实现最高4.48倍加速。
链接: https://arxiv.org/abs/2510.02361
作者: Haojie Ouyang,Jianwei Lv,Lei Ren,Chen Wei,Xiaojie Wang,Fangxiang Feng
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Li Auto (理想汽车)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention’s quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.
zh
[NLP-66] Spiral of Silence in Large Language Model Agents
【速读】: 该论文试图解决的问题是:在由大语言模型(Large Language Models, LLMs)构成的群体中,是否存在类似“沉默的螺旋”(Spiral of Silence, SoS)的社会心理现象——即少数观点因缺乏支持而被抑制,从而导致多数意见主导公共话语。传统SoS理论基于人类社会的心理机制,但其是否能在纯统计驱动的语言生成系统中自发涌现尚不明确。论文的关键解决方案在于提出一个评估框架,通过控制“历史”(History)和“人格”(Persona)两类信号的可用性,在多个开源与闭源LLM模型上进行实验,量化分析意见动态变化趋势(如Mann-Kendall检验、Spearman相关系数)及分布集中度(如峰度、四分位距)。结果表明,仅当历史信号与人格信号共同存在时,LLM群体才会表现出显著的多数优势和SoS模式;单独依赖历史信号会导致强锚定效应,而仅靠人格信号则产生无关联的多样化观点,说明历史信息对SoS动力学的形成至关重要。该研究为负责任的人工智能设计提供了实证基础,并揭示了需监控和缓解LLM代理系统中潜在的从众行为。
链接: https://arxiv.org/abs/2510.02360
作者: Mingze Zhong,Meng Fang,Zijing Shi,Yuxuan Huang,Shunfeng Zheng,Yali Du,Ling Chen,Jun Wang
机构: AAII, University of Technology Sydney (悉尼科技大学); University of Liverpool (利物浦大学); King’s College London (伦敦国王学院); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The Spiral of Silence (SoS) theory holds that individuals with minority views often refrain from speaking out for fear of social isolation, enabling majority positions to dominate public discourse. When the ‘agents’ are large language models (LLMs), however, the classical psychological explanation is not directly applicable, since SoS was developed for human societies. This raises a central question: can SoS-like dynamics nevertheless emerge from purely statistical language generation in LLM collectives? We propose an evaluation framework for examining SoS in LLM agents. Specifically, we consider four controlled conditions that systematically vary the availability of ‘History’ and ‘Persona’ signals. Opinion dynamics are assessed using trend tests such as Mann-Kendall and Spearman’s rank, along with concentration measures including kurtosis and interquartile range. Experiments across open-source and closed-source models show that history and persona together produce strong majority dominance and replicate SoS patterns; history signals alone induce strong anchoring; and persona signals alone foster diverse but uncorrelated opinions, indicating that without historical anchoring, SoS dynamics cannot emerge. The work bridges computational sociology and responsible AI design, highlighting the need to monitor and mitigate emergent conformity in LLM-agent systems.
zh
[NLP-67] Emission-GPT : A domain-specific language model agent for knowledge retrieval emission inventory and data analysis
【速读】: 该论文旨在解决大气污染物与温室气体排放数据获取和分析过程中存在的知识碎片化、专业化程度高以及现有方法效率低下等问题,这些问题限制了非专家对排放信息的理解与应用。解决方案的关键在于构建Emission-GPT——一个面向大气排放领域的知识增强型大语言模型代理(knowledge-enhanced large language model agent),其核心创新包括:基于超过10,000份权威文档(如标准、报告、指南和同行评审文献)构建的结构化知识库,结合提示工程(prompt engineering)与问题补全技术实现精准领域问答,并支持通过自然语言交互完成排放数据查询、可视化、源贡献分析及排放因子推荐等任务,从而显著提升排放清单编制与情景评估的自动化水平与可及性。
链接: https://arxiv.org/abs/2510.02359
作者: Jiashu Ye,Tong Wu,Weiwen Chen,Hao Zhang,Zeteng Lin,Xingxing Li,Shujuan Weng,Manni Zhu,Xin Yuan,Xinlong Hong,Jingjie Li,Junyu Zheng,Zhijiong Huang,Jing Tang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Jinan University (暨南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Improving air quality and addressing climate change relies on accurate understanding and analysis of air pollutant and greenhouse gas emissions. However, emission-related knowledge is often fragmented and highly specialized, while existing methods for accessing and compiling emissions data remain inefficient. These issues hinder the ability of non-experts to interpret emissions information, posing challenges to research and management. To address this, we present Emission-GPT, a knowledge-enhanced large language model agent tailored for the atmospheric emissions domain. Built on a curated knowledge base of over 10,000 documents (including standards, reports, guidebooks, and peer-reviewed literature), Emission-GPT integrates prompt engineering and question completion to support accurate domain-specific question answering. Emission-GPT also enables users to interactively analyze emissions data via natural language, such as querying and visualizing inventories, analyzing source contributions, and recommending emission factors for user-defined scenarios. A case study in Guangdong Province demonstrates that Emission-GPT can extract key insights–such as point source distributions and sectoral trends–directly from raw data with simple prompts. Its modular and extensible architecture facilitates automation of traditionally manual workflows, positioning Emission-GPT as a foundational tool for next-generation emission inventory development and scenario-based assessment.
zh
[NLP-68] DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在自回归(Autoregressive, AR)解码过程中因逐token生成而导致的高延迟问题。现有方案如推测解码(Speculative Decoding)通过引入快速起草器(drafter)提出多token草案并由目标模型并行验证以提升效率,但多数部署仍依赖AR起草器,限制了实际加速效果。论文提出的解决方案关键在于引入DiffuSpec框架,利用预训练扩散语言模型(Diffusion Language Model, DLM)实现单次前向传播生成多token草案,从而突破AR起草器的串行瓶颈;同时设计因果一致性路径搜索(Causal-Consistency Path Search, CPS)从DLM生成的token lattice中提取符合AR验证约束的左到右路径,并结合自适应草案长度控制器(Adaptive Draft-Length, ADL)根据历史接受率动态调整草案长度,实现速度与质量的平衡。实验表明,DiffuSpec在多个基准上可实现最高达3倍的墙-clock加速,证明了基于扩散模型的起草策略在推测解码中的有效性与鲁棒性。
链接: https://arxiv.org/abs/2510.02358
作者: Guanghao Li,Zhihui Fu,Min Fang,Qibin Zhao,Ming Tang,Chun Yuan,Jun Wang
机构: Tsinghua University (清华大学); Southern University of Science and Technology (南方科技大学); OPPO Research Institute (OPPO研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter to propose multi-token drafts, which are then verified in parallel by the target model. However, many deployments still rely on AR drafters, where sequential passes limit wall-clock gains. We revisit the drafting stage and present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass, while remaining compatible with standard AR verifiers. Because DLM drafts are generated under bidirectional conditioning, parallel per-position candidates form a token lattice in which the locally highest-probability token at each position need not form a causal left-to-right path. Moreover, DLM drafting requires pre-specifying a draft length, inducing a speed-quality trade-off. To address these challenges, we introduce two practical components: (i) a causal-consistency path search (CPS) over this lattice that extracts a left-to-right path aligned with AR verification; and (ii) an adaptive draft-length (ADL) controller that adjusts next proposal size based on recent acceptance feedback and realized generated length. Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
zh
[NLP-69] Modeling the language cortex with form-independent and enriched representations of sentence meaning reveals remarkable semantic abstractness
【速读】: 该论文旨在解决语言 cortex 中语义表征的抽象性问题,即人类语言系统如何在不依赖具体语言形式的情况下表征意义。其解决方案的关键在于利用视觉-语言模型(vision-language models)生成与句子对应的多幅图像,并提取其中的视觉模型嵌入(embedding),通过聚合多个生成图像的嵌入来预测语言 cortex 的神经响应;同时,通过多句式改写(paraphrase)和引入隐含上下文细节(如“我吃了一个煎饼”扩展为包含“枫糖浆”的版本)进一步提升预测准确性,从而证明语言 cortex 中存在高度抽象且形式无关的语义表征,其丰富度甚至超越当前大型语言模型的表征能力。
链接: https://arxiv.org/abs/2510.02354
作者: Shreya Saha,Shurui Li,Greta Tuckute,Yuanning Li,Ru-Yuan Zhang,Leila Wehbe,Evelina Fedorenko,Meenakshi Khosla
机构: University of California San Diego (加州大学圣地亚哥分校); ShanghaiTech University (上海科技大学); Kempner Institute, Harvard University (哈佛大学肯普纳研究所); Shanghai Jiao Tong University (上海交通大学); Carnegie Mellon University (卡内基梅隆大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The human language system represents both linguistic forms and meanings, but the abstractness of the meaning representations remains debated. Here, we searched for abstract representations of meaning in the language cortex by modeling neural responses to sentences using representations from vision and language models. When we generate images corresponding to sentences and extract vision model embeddings, we find that aggregating across multiple generated images yields increasingly accurate predictions of language cortex responses, sometimes rivaling large language models. Similarly, averaging embeddings across multiple paraphrases of a sentence improves prediction accuracy compared to any single paraphrase. Enriching paraphrases with contextual details that may be implicit (e.g., augmenting “I had a pancake” to include details like “maple syrup”) further increases prediction accuracy, even surpassing predictions based on the embedding of the original sentence, suggesting that the language system maintains richer and broader semantic representations than language models. Together, these results demonstrate the existence of highly abstract, form-independent meaning representations within the language cortex.
zh
[NLP-70] An Senegalese Legal Texts Structuration Using LLM -augmented Knowledge Graph
【速读】: 该论文旨在解决塞内加尔司法体系中法律文本获取与组织困难的问题,以提升公民和法律从业者对权利与义务的理解效率。其关键解决方案在于利用生成式 AI(Generative AI)和大语言模型(Large Language Models, LLMs)技术,从多种法律文件中成功提取7,967条法律条文,并构建包含2,872个节点和10,774条关系的图数据库,从而实现法律文本间关联关系的可视化;同时采用先进的三元组抽取技术,验证了GPT-4o、GPT-4和Mistral-Large等模型在识别法律关系及元数据方面的有效性,为建立高效、可访问的法律信息框架提供了坚实基础。
链接: https://arxiv.org/abs/2510.02353
作者: Oumar Kane,Mouhamad M. Allaya,Dame Samb,Mamadou Bousso
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 8 figures, 2 tables, 1 algorithm
Abstract:This study examines the application of artificial intelligence (AI) and large language models (LLM) to improve access to legal texts in Senegal’s judicial system. The emphasis is on the difficulties of extracting and organizing legal documents, highlighting the need for better access to judicial information. The research successfully extracted 7,967 articles from various legal documents, particularly focusing on the Land and Public Domain Code. A detailed graph database was developed, which contains 2,872 nodes and 10,774 relationships, aiding in the visualization of interconnections within legal texts. In addition, advanced triple extraction techniques were utilized for knowledge, demonstrating the effectiveness of models such as GPT-4o, GPT-4, and Mistral-Large in identifying relationships and relevant metadata. Through these technologies, the aim is to create a solid framework that allows Senegalese citizens and legal professionals to more effectively understand their rights and responsibilities.
zh
[NLP-71] Evaluating Bias in Spoken Dialogue LLM s for Real-World Decisions and Recommendations
【速读】: 该论文旨在解决语音大语言模型(Speech Large Language Models, SLMs)中偏见问题的系统性评估缺失这一关键科学问题,尤其关注多轮对话场景下由年龄、性别和口音等副语言特征引发的决策与推荐偏差。其解决方案的关键在于构建了一套针对语音交互系统的公平性评估框架,包括用于决策任务的群体不公平得分(Group Unfairness Score, GUS)和用于推荐任务的基于相似性的归一化统计率(Similarity-based Normalized Statistics Rate, SNSR),并首次在端到端语音对话模型(如Qwen2.5-Omni、GLM-4-Voice、GPT-4o Audio和Gemini-2.5-Flash)中量化了多轮对话中重复负面反馈对偏见传播的影响机制,揭示出开源模型对人口属性更敏感且推荐任务易放大跨群体差异,而闭源模型整体偏见更低,为构建公平可靠的音频交互系统提供了实证依据和基准工具(FairDialogue数据集及代码)。
链接: https://arxiv.org/abs/2510.02352
作者: Yihao Wu,Tianrui Wang,Yizhou Peng,Yi-Wen Chao,Xuyi Zhuang,Xinsheng Wang,Shunshun Yin,Ziyang Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While biases in large language models (LLMs), such as stereotypes and cultural tendencies in outputs, have been examined and identified, their presence and characteristics in spoken dialogue models (SDMs) with audio input and output remain largely unexplored. Paralinguistic features, such as age, gender, and accent, can affect model outputs; when compounded by multi-turn conversations, these effects may exacerbate biases, with potential implications for fairness in decision-making and recommendation tasks. In this paper, we systematically evaluate biases in speech LLMs and study the impact of multi-turn dialogues with repeated negative feedback. Bias is measured using Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations, across both open-source models like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash. Our analysis reveals that closed-source models generally exhibit lower bias, while open-source models are more sensitive to age and gender, and recommendation tasks tend to amplify cross-group disparities. We found that biased decisions may persist in multi-turn conversations. This work provides the first systematic study of biases in end-to-end spoken dialogue models, offering insights towards fair and reliable audio-based interactive systems. To facilitate further research, we release the FairDialogue dataset and evaluation code.
zh
[NLP-72] Language Culture and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLM s ICDM
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在政治话语中识别冒犯性内容时,如何因不同意识形态和文化背景而产生差异性判断的问题。研究通过多语言数据集(涵盖英语、波兰语和俄语)对多个主流LLM进行测试,要求其从极右翼、保守派、中间派及进步派等不同政治身份视角评估推文的冒犯性。解决方案的关键在于引入具备显式推理能力的模型(如DeepSeek-R1、o4-mini),这些模型展现出更强的一致性和对意识形态与文化细微差别的敏感度,从而显著提升冒犯性判断的个性化与可解释性,表明推理机制是实现跨语言、跨意识形态社会政治文本分类任务的核心要素。
链接: https://arxiv.org/abs/2510.02351
作者: Dzmitry Pihulski,Jan Kocoń
机构: Wroclaw Tech (弗罗茨瓦夫科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW)
Abstract:We explore how large language models (LLMs) assess offensiveness in political discourse when prompted to adopt specific political and cultural perspectives. Using a multilingual subset of the MD-Agreement dataset centered on tweets from the 2020 US elections, we evaluate several recent LLMs - including DeepSeek-R1, o4-mini, GPT-4.1-mini, Qwen3, Gemma, and Mistral - tasked with judging tweets as offensive or non-offensive from the viewpoints of varied political personas (far-right, conservative, centrist, progressive) across English, Polish, and Russian contexts. Our results show that larger models with explicit reasoning abilities (e.g., DeepSeek-R1, o4-mini) are more consistent and sensitive to ideological and cultural variation, while smaller models often fail to capture subtle distinctions. We find that reasoning capabilities significantly improve both the personalization and interpretability of offensiveness judgments, suggesting that such mechanisms are key to adapting LLMs for nuanced sociopolitical text classification across languages and ideologies.
zh
[NLP-73] LLM SQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL ICDM
【速读】: 该论文旨在解决WikiSQL数据集在现代大语言模型(Large Language Models, LLMs)时代所暴露出的结构性与标注问题,如大小写敏感性不一致、数据类型错位、语法错误及未回答的问题等,这些问题限制了其在当前生成式自然语言到SQL(Text-to-SQL)模型中的有效性。解决方案的关键在于提出LLMSQL——一个针对LLM优化的系统性修订与重构版本,通过分类识别原始错误并采用自动化方法进行清洗和重新标注,同时将自然语言问题与完整SQL查询以纯文本形式提供,摒弃了原数据集中为指针网络设计的token选择机制,从而构建了一个更适合现代生成式Text-to-SQL模型训练与评估的基准数据集。
链接: https://arxiv.org/abs/2510.02350
作者: Dzmitry Pihulski,Karol Charchut,Viktoria Novogrodskaia,Jan Kocoń
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW)
Abstract:Converting natural language questions into SQL queries (Text-to-SQL) enables non-expert users to interact with relational databases and has long been a central task for natural language interfaces to data. While the WikiSQL dataset played a key role in early NL2SQL research, its usage has declined due to structural and annotation issues, including case sensitivity inconsistencies, data type mismatches, syntax errors, and unanswered questions. We present LLMSQL, a systematic revision and transformation of WikiSQL designed for the LLM era. We classify these errors and implement automated methods for cleaning and re-annotation. To assess the impact of these improvements, we evaluated multiple large language models (LLMs), including Gemma 3, LLaMA 3.2, Mistral 7B, gpt-oss 20B, Phi-3.5 Mini, Qwen 2.5, OpenAI o4-mini, DeepSeek R1 and others. Rather than serving as an update, LLMSQL is introduced as an LLM-ready benchmark: unlike the original WikiSQL, tailored for pointer-network models selecting tokens from input, LLMSQL provides clean natural language questions and full SQL queries as plain text, enabling straightforward generation and evaluation for modern natural language-to-SQL models.
zh
[NLP-74] mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations
【速读】: 该论文旨在解决文本嵌入空间对齐(text embedding space alignment)中缺乏平行数据时的效率与稳定性问题。原始方法vec2vec虽能实现近完美的对齐,但计算成本高且不稳定;其解决方案的关键在于提出mini-vec2vec,一种基于线性变换的高效、鲁棒替代方案,包含伪平行向量初步匹配、变换拟合和迭代优化三个阶段,显著提升效率并保持或超越原方法性能,同时具备可解释性和易于扩展性。
链接: https://arxiv.org/abs/2510.02348
作者: Guy Dar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data. vec2vec finds a near-perfect alignment, but it is expensive and unstable. We present mini-vec2vec, a simple and efficient alternative that requires substantially lower computational cost and is highly robust. Moreover, the learned mapping is a linear transformation. Our method consists of three main stages: a tentative matching of pseudo-parallel embedding vectors, transformation fitting, and iterative refinement. Our linear alternative exceeds the original instantiation of vec2vec by orders of magnitude in efficiency, while matching or exceeding their results. The method’s stability and interpretable algorithmic steps facilitate scaling and unlock new opportunities for adoption in new domains and fields.
zh
[NLP-75] Small Language Models for Curriculum-based Guidance
【速读】: 该论文旨在解决生成式 AI(Generative AI)在教育领域应用中面临的计算资源消耗高、依赖云端基础设施以及可持续性不足的问题。其解决方案的关键在于利用检索增强生成(Retrieval-Augmented Generation, RAG)管道与选定的小型语言模型(Small Language Models, SLMs)相结合,通过优化提示工程和针对性检索策略,使SLMs在教学指导任务中达到与大型语言模型(Large Language Models, LLMs)相当的准确性与教学一致性。该方法不仅显著降低计算与能源需求,还支持在消费级硬件上实现实时部署,从而实现成本效益高、隐私保护强且环境友好的AI教学助理系统,为教育机构提供可扩展的个性化学习解决方案。
链接: https://arxiv.org/abs/2510.02347
作者: Konstantinos Katharakis,Sippo Rossi,Raghava Rao Mukkamala
机构: Copenhagen Business School (哥本哈根商学院); Hanken School of Economics (汉肯经济学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The adoption of generative AI and large language models (LLMs) in education is still emerging. In this study, we explore the development and evaluation of AI teaching assistants that provide curriculum-based guidance using a retrieval-augmented generation (RAG) pipeline applied to selected open-source small language models (SLMs). We benchmarked eight SLMs, including LLaMA 3.1, IBM Granite 3.3, and Gemma 3 (7-17B parameters), against GPT-4o. Our findings show that with proper prompting and targeted retrieval, SLMs can match LLMs in delivering accurate, pedagogically aligned responses. Importantly, SLMs offer significant sustainability benefits due to their lower computational and energy requirements, enabling real-time use on consumer-grade hardware without depending on cloud infrastructure. This makes them not only cost-effective and privacy-preserving but also environmentally responsible, positioning them as viable AI teaching assistants for educational institutions aiming to scale personalized learning in a sustainable and energy-efficient manner.
zh
[NLP-76] Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression ICLR2026
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)大语言模型中存在的三重困境:负载不均衡、参数冗余和通信开销过高。其解决方案的关键在于提出一个统一框架,通过动态专家聚类与结构化压缩协同优化:首先采用在线聚类机制,基于参数相似性和激活相似性的融合度量周期性重组专家,稳定专家利用率;其次,在每个聚类内将专家权重分解为共享基础矩阵与极低秩残差适配器(residual adapters),实现每组高达五倍的参数压缩并保留专家特异性;同时引入两级分层路由策略,显著降低路由搜索空间和全对全通信量;最后结合异构精度存储(共享基用FP16,残差因子用INT4)与动态非活跃聚类卸载,使峰值内存消耗接近稠密模型水平。该方法在GLUE和WikiText-103上验证有效,相较标准MoE模型减少约80%总参数、提升10%-20%吞吐量,并将专家负载方差降低超过三倍。
链接: https://arxiv.org/abs/2510.02345
作者: Peijun Zhu,Ning Yang,Jiayu Wei,Jinghang Wu,Haijun Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 12 pages, 2 figures, 3 tables. Under review as a conference paper at ICLR 2026
Abstract:Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model’s architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs.
zh
[NLP-77] textttBluePrint: A Social Media User Dataset for LLM Persona Evaluation and Training
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在模拟社交媒体动态时缺乏标准化数据资源的问题,尤其在细调和评估基于大语言模型(LLMs)的社会媒体代理(social media agents)方面存在显著空白。其核心挑战在于如何构建既具行为真实性又符合隐私保护要求的数据集,以支持对用户交互模式的建模,尤其是上下文依赖的互动行为。解决方案的关键是提出SIMPACT框架——一个面向模拟的个性与行为捕捉工具包(SIMulation-oriented Persona and Action Capture Toolkit),通过将匿名化用户聚类为行为特征明确的“人格”(persona),并结合12种社交互动类型(如点赞、回复、转发等)及其前置行为上下文,构建可复用、可评估的高质量数据集;其中BluePrint作为首个公开实现,聚焦政治话语场景,不仅提供了用于评估政治话语建模的基准,也为其他领域(如虚假信息传播或极化现象研究)的数据构建提供模板,从而推动伦理合规且方法严谨的社会媒体仿真研究发展。
链接: https://arxiv.org/abs/2510.02343
作者: Aurélien Bück-Kaeffer,Je Qin Chooi,Dan Zhao,Maximilian Puelma Touzel,Kellin Pelrine,Jean-François Godbout,Reihaneh Rabbany,Zachary Yang
机构: McGill University (麦吉尔大学); Mila - Quebec Artificial Intelligence Institute (魁北克人工智能研究所); Harvard College (哈佛学院); NYU (纽约大学); Université de Montréal (蒙特利尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 11 tables
Abstract:Large language models (LLMs) offer promising capabilities for simulating social media dynamics at scale, enabling studies that would be ethically or logistically challenging with human subjects. However, the field lacks standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents. We address this gap by introducing SIMPACT, the SIMulation-oriented Persona and Action Capture Toolkit, a privacy respecting framework for constructing behaviorally-grounded social media datasets suitable for training agent models. We formulate next-action prediction as a task for training and evaluating LLM-based agents and introduce metrics at both the cluster and population levels to assess behavioral fidelity and stylistic realism. As a concrete implementation, we release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse. BluePrint clusters anonymized users into personas of aggregated behaviours, capturing authentic engagement patterns while safeguarding privacy through pseudonymization and removal of personally identifiable information. The dataset includes a sizable action set of 12 social media interaction types (likes, replies, reposts, etc.), each instance tied to the posting activity preceding it. This supports the development of agents that use context-dependence, not only in the language, but also in the interaction behaviours of social media to model social media users. By standardizing data and evaluation protocols, SIMPACT provides a foundation for advancing rigorous, ethically responsible social media simulations. BluePrint serves as both an evaluation benchmark for political discourse modeling and a template for building domain specific datasets to study challenges such as misinformation and polarization.
zh
[NLP-78] CATMark: A Context-Aware Thresholding Framework for Robust Cross-Task Watermarking in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)水印算法在低熵场景下导致文本质量下降的问题,以及现有基于熵阈值的方法在跨任务生成中适应性差、计算开销大的局限性。其解决方案的关键在于提出一种上下文感知的自适应水印框架(Context-Aware Threshold watermarking, \myalgo),该框架通过 logits 聚类将文本生成过程划分为语义状态,并据此动态构建上下文感知的熵阈值,从而在保持结构化内容保真度的同时嵌入鲁棒水印,且无需预设阈值或针对特定任务进行调优。
链接: https://arxiv.org/abs/2510.02342
作者: Yu Zhang,Shuliang Liu,Xu Yang,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); South China University of Technology (华南理工大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Watermarking algorithms for Large Language Models (LLMs) effectively identify machine-generated content by embedding and detecting hidden statistical features in text. However, such embedding leads to a decline in text quality, especially in low-entropy scenarios where performance needs improvement. Existing methods that rely on entropy thresholds often require significant computational resources for tuning and demonstrate poor adaptability to unknown or cross-task generation scenarios. We propose \textbfContext-\textbfAware \textbfThreshold watermarking ( \myalgo ), a novel framework that dynamically adjusts watermarking intensity based on real-time semantic context. \myalgo partitions text generation into semantic states using logits clustering, establishing context-aware entropy thresholds that preserve fidelity in structured content while embedding robust watermarks. Crucially, it requires no pre-defined thresholds or task-specific tuning. Experiments show \myalgo improves text quality in cross-tasks without sacrificing detection accuracy.
zh
[NLP-79] DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning
【速读】: 该论文旨在解决大语言模型在真实场景部署中,因显式满意度(SAT)反馈稀缺而难以有效进行偏好学习的问题。现有方法依赖昂贵的人工标注或假设正样本充足,无法充分利用用户通过迭代修正、偏好表达等行为产生的隐式不满意度(DSAT)信号。其解决方案的关键在于提出DRIFT(Dissatisfaction-Refined Iterative Preference Training)框架,该框架以真实世界的DSAT信号为训练锚点,并动态从演化中的策略中采样正样本,从而实现高效且可扩展的后训练优化。实验表明,DRIFT在WildBench和AlpacaEval2等基准上显著优于基线方法,尤其在14B规模下超越GPT-4o-mini,同时保持探索能力并避免梯度退化,体现了其对最丰富且信息量大的DSAT信号的有效利用。
链接: https://arxiv.org/abs/2510.02341
作者: Yifan Wang,Bolian Li,Junlin Wu,Zhaoxuan Tan,Zheli Liu,Ruqi Zhang,Ananth Grama,Qingkai Zeng
机构: Purdue University (普渡大学); Washington University in St. Louis (圣路易斯华盛顿大学); University of Notre Dame (圣母大学); Nankai University (南开大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce \textbfDRIFT (\textbfDissatisfaction-\textbfRefined \textbfIterative pre\textbfFerence \textbfTraining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world \textitWildFeedback datasets and synthetic \textitUltraFeedback datasets achieve up to +6.23% (7B) / +7.61% (14B) on WildBench Task Score and up to +8.95% (7B) / +12.29% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at this https URL.
zh
[NLP-80] Can Prompts Rewind Time for LLM s? Evaluating the Effectiveness of Prompted Knowledge Cutoffs
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在时间预测任务中因预训练数据污染而导致的性能高估问题,即模型可能通过记忆而非推理实现对截止日期前测试数据的准确预测。其解决方案的关键在于探索基于提示(prompting)的去遗忘技术是否能够有效模拟早期知识截止(knowledge cutoff),从而更真实地评估模型的时间推理能力。研究构建了三类评估数据集,分别检验模型在直接事实遗忘、语义演变遗忘和因果关联知识遗忘方面的表现,结果表明:尽管提示方法能有效抑制对截止日期后直接信息的回忆,但在涉及因果相关性的情境下,模型仍难以表现出真正的遗忘行为,揭示了当前评估范式在时间预测任务中的局限性。
链接: https://arxiv.org/abs/2510.02340
作者: Xin Gao,Ruiyi Zhang,Daniel Du,Saurabh Mahindre,Sai Ashish Somayajula,Pengtao Xie
机构: UC San Diego (加州大学圣地亚哥分校); SUNY Buffalo (纽约州立大学布法罗分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at this https URL.
zh
[NLP-81] Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models EMNLP
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在决策支持场景中缺乏可靠不确定性量化(Uncertainty Quantification, UQ)的问题,尤其是在基于计算论证的可解释LLM框架——论点型LLM(Argumentative LLMs, ArgLLMs)中,如何有效整合UQ方法以提升其在主张验证任务中的可靠性。解决方案的关键在于系统性地评估不同LLM UQ方法在ArgLLMs中的表现,并提出一种新颖的实验范式:通过ArgLLMs在复杂且可能具有争议性的陈述上的验证性能来间接衡量UQ方法的有效性。实验结果表明,尽管结构简单,直接提示(direct prompting)策略在ArgLLMs中展现出优于更复杂方法的不确定性估计效果,凸显了其在实际应用中的有效性与可行性。
链接: https://arxiv.org/abs/2510.02339
作者: Kevin Zhou,Adam Dejl,Gabriel Freedman,Lihu Chen,Antonio Rago,Francesca Toni
机构: Imperial College London (帝国理工学院); King’s College London (国王学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP Findings 2025
Abstract:Research in uncertainty quantification (UQ) for large language models (LLMs) is increasingly important towards guaranteeing the reliability of this groundbreaking technology. We explore the integration of LLM UQ methods in argumentative LLMs (ArgLLMs), an explainable LLM framework for decision-making based on computational argumentation in which UQ plays a critical role. We conduct experiments to evaluate ArgLLMs’ performance on claim verification tasks when using different LLM UQ methods, inherently performing an assessment of the UQ methods’ effectiveness. Moreover, the experimental procedure itself is a novel way of evaluating the effectiveness of UQ methods, especially when intricate and potentially contentious statements are present. Our results demonstrate that, despite its simplicity, direct prompting is an effective UQ strategy in ArgLLMs, outperforming considerably more complex approaches.
zh
[NLP-82] Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在临床文档自动化中的关键挑战,即如何在保证内容完整性(completeness)和事实准确性(factual grounding)的前提下,实现高效、可扩展的长文本生成。其解决方案的核心在于提出一种集成评估的强化学习框架,结合组相对策略优化(Group Relative Policy Optimization, GRPO)与 DocLens——一个基于声明级别的判别式评估器,该评估器提供确定性且对话上下文相关的奖励信号。该方法无需训练独立的奖励模型或依赖人工标注参考文本,即可直接优化临床笔记的事实准确性和完整性,并通过简单的奖励门控策略降低训练成本,实证表明其在多项指标上优于基线模型。
链接: https://arxiv.org/abs/2510.02338
作者: Samyak Jhaveri,Praphul Singh,Jangwon Kim,Tara Taghavi,Krishnaram Kenthapadi
机构: Oracle Health AI(Oracle健康AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automating clinical documentation with large language models requires precise alignment with priorities such as completeness and factual grounding. We present an evaluation-integrated reinforcement learning framework for long-form clinical text generation that couples Group Relative Policy Optimization (GRPO) with DocLens, a claim-level evaluator that provides deterministic, dialogue-grounded rewards. Our method directly optimizes factual grounding and completeness without training a separate reward model or relying on human-authored references. Empirically, the approach improves clinical note quality and reduces training cost via a simple reward-gating strategy. An independent GPT-5 qualitative evaluation further supports these gains, showing higher preference for GRPO outputs in factuality, completeness, and brevity, with fewer omissions and hallucinations. Because the benchmarks are relatively clean and the base model already well aligned, these improvements likely represent a conservative lower bound. The framework is scalable to real-world settings and can incorporate custom objectives such as guideline adherence or billing preferences.
zh
[NLP-83] CRACQ: A Multi-Dimensional Approach To Automated Document Assessment
【速读】: 该论文旨在解决当前对机器生成文本(如科研申请书)进行自动化评估时缺乏多维、可解释且稳定评价标准的问题。现有方法通常依赖单一评分机制,难以全面反映文本在逻辑连贯性、严谨性、适切性、完整性及整体质量等方面的差异。为此,作者提出CRACQ框架,其关键在于构建一个基于五维特质(Coherence, Rigor, Appropriateness, Completeness, Quality)的多维度评估体系,融合语言学、语义和结构特征信号,实现从整体到各维度的可解释性判断,从而提升自动化评估的稳定性与可信度。
链接: https://arxiv.org/abs/2510.02337
作者: Ishak Soltani,Francisco Belo,Bernardo Tavares
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper presents CRACQ, a multi-dimensional evaluation framework tailored to evaluate documents across f i v e specific traits: Coherence, Rigor, Appropriateness, Completeness, and Quality. Building on insights from traitbased Automated Essay Scoring (AES), CRACQ expands its fo-cus beyond essays to encompass diverse forms of machine-generated text, providing a rubricdriven and interpretable methodology for automated evaluation. Unlike singlescore approaches, CRACQ integrates linguistic, semantic, and structural signals into a cumulative assessment, enabling both holistic and trait-level analysis. Trained on 500 synthetic grant pro-posals, CRACQ was benchmarked against an LLM-as-a-judge and further tested on both strong and weak real applications. Preliminary results in-dicate that CRACQ produces more stable and interpretable trait-level judgments than direct LLM evaluation, though challenges in reliability and domain scope remain
zh
[NLP-84] KurdSTS: The Kurdish Semantic Textual Similarity
【速读】: 该论文旨在解决低资源语言(如库尔德语)在语义文本相似度(Semantic Textual Similarity, STS)任务中缺乏高质量标注数据的问题。其关键解决方案是构建了首个库尔德语STS数据集,包含10,000句对,覆盖正式与非正式语域,并进行了人工相似度标注;同时在该数据集上对Sentence-BERT、多语言BERT等主流模型进行基准测试,验证了其有效性并揭示了库尔德语形态学复杂性、拼写变体和代码混用等挑战,为低资源语言的语义理解和自然语言处理研究提供了可复现的评估基准与起点。
链接: https://arxiv.org/abs/2510.02336
作者: Abdulhady Abas Abdullah,Hadi Veisi,Hussein M. Al
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Semantic Textual Similarity (STS) measures the degree of meaning overlap between two texts and underpins many NLP tasks. While extensive resources exist for high-resource languages, low-resource languages such as Kurdish remain underserved. We present, to our knowledge, the first Kurdish STS dataset: 10,000 sentence pairs spanning formal and informal registers, each annotated for similarity. We benchmark Sentence-BERT, multilingual BERT, and other strong baselines, obtaining competitive results while highlighting challenges arising from Kurdish morphology, orthographic variation, and code-mixing. The dataset and baselines establish a reproducible evaluation suite and provide a strong starting point for future research on Kurdish semantics and low-resource NLP.
zh
[NLP-85] FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在数学证明中的实际应用问题,即如何有效辅助数学家完成复杂证明中缺失的中间步骤,这一任务被定义为子目标补全(subgoal completion)。解决方案的关键在于构建了一个名为 FormalML 的 Lean 4 基准测试集,该数据集基于机器学习基础理论,通过一种翻译策略将过程式证明转化为声明式形式,从而提取出 4937 个涵盖优化和概率不等式的子目标问题,其难度各异且包含前提检索与高阶研究级上下文。此基准首次实现了子目标补全任务中复杂推理场景与前提获取能力的结合,为评估和改进 LLM 在数学证明中的实用性提供了重要工具。
链接: https://arxiv.org/abs/2510.02335
作者: Xiao-Wen Yang,Zihao Zhang,Jianuo Cao,Zhi Zhou,Zenan Li,Lan-Zhe Guo,Yuan Yao,Taolue Chen,Yu-Feng Li,Xiaoxing Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving. Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored. We identify this challenge as the task of subgoal completion, where an LLM must discharge short but nontrivial proof obligations left unresolved in a human-provided sketch. To study this problem, we introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning. Using a translation tactic that converts procedural proofs into declarative form, we extract 4937 problems spanning optimization and probability inequalities, with varying levels of difficulty. FormalML is the first subgoal completion benchmark to combine premise retrieval and complex research-level contexts. Evaluation of state-of-the-art provers highlights persistent limitations in accuracy and efficiency, underscoring the need for more capable LLM-based theorem provers for effective subgoal completion,
zh
[NLP-86] Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署过程中因生成有害内容、事实错误及社会偏见等不良行为所带来的安全风险问题,其核心挑战在于准确诊断这些行为的根源。现有基于参数梯度的归因方法因噪声信号强和计算复杂度高而效果有限。本文提出一种新颖且高效的诊断框架,关键在于直接在模型的激活空间(activation space)中分析表示及其梯度,从而获得语义上可解释的信号,将模型输出与训练数据关联起来。该方法不仅实现了样本级别的归因,还可进行细粒度的token级别分析,精准定位导致模型行为异常的具体样本和短语,为理解、审计和缓解LLM风险提供了强有力的工具。
链接: https://arxiv.org/abs/2510.02334
作者: Zhe Li,Wei Zhao,Yige Li,Jun Sun
机构: Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 4 figures
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their deployment is frequently undermined by undesirable behaviors such as generating harmful content, factual inaccuracies, and societal biases. Diagnosing the root causes of these failures poses a critical challenge for AI safety. Existing attribution methods, particularly those based on parameter gradients, often fall short due to prohibitive noisy signals and computational complexity. In this work, we introduce a novel and efficient framework that diagnoses a range of undesirable LLM behaviors by analyzing representation and its gradients, which operates directly in the model’s activation space to provide a semantically meaningful signal linking outputs to their training data. We systematically evaluate our method for tasks that include tracking harmful content, detecting backdoor poisoning, and identifying knowledge contamination. The results demonstrate that our approach not only excels at sample-level attribution but also enables fine-grained token-level analysis, precisely identifying the specific samples and phrases that causally influence model behavior. This work provides a powerful diagnostic tool to understand, audit, and ultimately mitigate the risks associated with LLMs. The code is available at this https URL.
zh
[NLP-87] Human Mobility Datasets Enriched With Contextual and Social Dimensions
【速读】: 该论文旨在解决当前人类移动轨迹数据在语义丰富性、多模态融合与可重用性方面的不足,尤其针对传统轨迹数据缺乏上下文信息(如出行目的、交通方式、天气影响)以及难以支持高级分析任务(如行为建模、知识图谱构建)的问题。解决方案的关键在于构建两个结构迥异的大城市(巴黎和纽约)的语义增强型轨迹数据集,其核心创新包括:1)整合GPS轨迹与多种上下文层(如停靠点、移动段、兴趣点POI、推断的交通方式及天气数据);2)引入由大语言模型(LLM)生成的合成社交媒体文本,实现多模态语义增强;3)以表格和资源描述框架(RDF)两种格式提供数据,确保语义推理能力与FAIR(可发现、可访问、可互操作、可重用)数据实践兼容。该方案通过开源可复现的处理流水线,为行为建模、迁移学习和LLM驱动的移动性研究提供了统一且语义丰富的基础资源。
链接: https://arxiv.org/abs/2510.02333
作者: Chiara Pugliese,Francesco Lettich,Guido Rocchietti,Chiara Renso,Fabio Pinelli
机构: IIT-CNR Pisa, Italy; ISTI-CNR Pisa, Italy; IMT Lucca Lucca, Italy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 5 pages, 3 figures, 1 table
Abstract:In this resource paper, we present two publicly available datasets of semantically enriched human trajectories, together with the pipeline to build them. The trajectories are publicly available GPS traces retrieved from OpenStreetMap. Each dataset includes contextual layers such as stops, moves, points of interest (POIs), inferred transportation modes, and weather data. A novel semantic feature is the inclusion of synthetic, realistic social media posts generated by Large Language Models (LLMs), enabling multimodal and semantic mobility analysis. The datasets are available in both tabular and Resource Description Framework (RDF) formats, supporting semantic reasoning and FAIR data practices. They cover two structurally distinct, large cities: Paris and New York. Our open source reproducible pipeline allows for dataset customization, while the datasets support research tasks such as behavior modeling, mobility prediction, knowledge graph construction, and LLM-based applications. To our knowledge, our resource is the first to combine real-world movement, structured semantic enrichment, LLM-generated text, and semantic web compatibility in a reusable framework.
zh
[NLP-88] A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography
【速读】: 该论文旨在解决神经语言隐写术(Neural Linguistic Steganography)中因现代分词器(Tokenizer)的分词歧义性导致的解码失败问题,尤其是现有方法SyncPool在处理歧义候选集时过度牺牲嵌入容量的问题。其关键解决方案是提出一种名为“look-ahead Sync”的新方法:该方法仅对真正不可区分的token序列进行最小限度的同步采样,而保留所有可区分路径以最大化嵌入容量,从而在不损失可证明安全性前提下显著提升信息嵌入率。实验表明,该方法在英文(Llama 3)和中文(Qwen 2.5)基准上均接近理论容量上限,相比SyncPool嵌入速率提升超过160%(英文)和25%(中文)。
链接: https://arxiv.org/abs/2510.02332
作者: Yapei Feng,Feng Jiang,Shanhao Wu,Hua Zhong
机构: Hangzhou Dianzi University (杭州电子科技大学); University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 13 pages,7 figures
Abstract:Neural linguistic steganography aims to embed information into natural text while preserving statistical undetectability. A fundamental challenge in this ffeld stems from tokenization ambiguity in modern tokenizers, which can lead to catastrophic decoding failures. The recent method, SyncPool, addresses this ambiguity by employing a coarse-grained synchronization mechanism over groups of ambiguous candidates. However, SyncPool sacriffces embedding capacity, as it utilizes the entire Shannon entropy of an ambiguous group solely for synchronization rather than for payload embedding. We propose a method named look-ahead Sync, which overcomes the capacity limitation of SyncPool while retaining its provable security guarantees. Our approach performs minimal synchronized sampling only on truly indistinguishable token sequences, while strategically preserving all other discernible paths to maximize embedding capacity. We provide theoretical proofs for the security of our method and analyze the gap between its achievable embedding capacity and the theoretical upper bound. Experiments on English (using Llama 3) and Chinese (using Qwen 2.5) benchmarks show that our method consistently approaches the theoretical capacity upper bound and signiffcantly outperforms SyncPool. The improvement in embedding rate exceeds 160% in English and 25% in Chinese, particularly in settings with larger candidate pools. This work represents a signiffcant step toward practical high-capacity provably secure linguistic steganography. Comments: 13 pages,7 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2510.02332 [cs.CL] (or arXiv:2510.02332v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.02332 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-89] Synthetic Dialogue Generation for Interactive Conversational Elicitation Recommendation (ICER)
【速读】: 该论文旨在解决对话推荐系统(Conversational Recommender Systems, CRSs)中因公开数据稀缺而导致语言模型(Language Models, LMs)难以有效微调的问题。传统方法利用LM作为用户模拟器生成训练数据,但常因行为不一致而生成不符合真实用户交互模式的对话序列。解决方案的关键在于结合行为模拟器(behavior simulators)与LM提示技术(LM-prompting),以生成与用户潜在状态一致的自然对话数据,从而提升对话的连贯性、事实性和自然度。该方法成功构建了一个大规模开源CRS数据集,涵盖偏好获取和示例批评两类场景,并通过人工评估验证了生成对话的质量。
链接: https://arxiv.org/abs/2510.02331
作者: Moonkyung Ryu,Chih-Wei Hsu,Yinlam Chow,Mohammad Ghavamzadeh,Craig Boutilier
机构: Google Research (谷歌研究); Amazon AGI (亚马逊AGI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While language models (LMs) offer great potential for conversational recommender systems (CRSs), the paucity of public CRS data makes fine-tuning LMs for CRSs challenging. In response, LMs as user simulators qua data generators can be used to train LM-based CRSs, but often lack behavioral consistency, generating utterance sequences inconsistent with those of any real user. To address this, we develop a methodology for generating natural dialogues that are consistent with a user’s underlying state using behavior simulators together with LM-prompting. We illustrate our approach by generating a large, open-source CRS data set with both preference elicitation and example critiquing. Rater evaluation on some of these dialogues shows them to exhibit considerable consistency, factuality and naturalness.
zh
[NLP-90] EntropyLong: Effective Long-Context Training via Predictive Uncertainty
【速读】: 该论文旨在解决长上下文语言模型训练中难以有效构建真实长距离依赖关系的问题。当前方法如通用文本拼接或基于启发式的变体常无法确保依赖关系的真实性,易引入虚假相关性。其解决方案的关键在于提出EntropyLong方法,通过预测不确定性(即熵)来验证依赖质量:首先识别文档中的高熵位置,再从大规模语料库中检索语义相关的上下文,并通过评估这些上下文是否能降低预测熵来验证其信息增益。该“模型内验证”机制确保每个长距离依赖均具有可度量的信息价值,从而生成高质量的128K长度序列训练数据,显著提升模型在RULER和LongBenchv2等基准上的长程理解能力。
链接: https://arxiv.org/abs/2510.02330
作者: Junlong Jia,Ziyang Chen,Xing Wu,Chaochen Gao,Zijia Lin,Debing Zhang,Songlin Hu,Binghui Guo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in progress; Correspondence to: Xing Wu wuxing@iie. this http URL
Abstract:Training long-context language models to capture long-range dependencies requires specialized data construction. Current approaches, such as generic text concatenation or heuristic-based variants, frequently fail to guarantee genuine long-range dependencies. We propose EntropyLong, a novel data construction method that leverages predictive uncertainty to verify dependency quality. Our approach identifies high-entropy positions in documents, retrieves semantically relevant contexts from large corpora, and verifies their utility by assessing whether they reduce prediction entropy. This model-in-the-loop verification ensures each dependency represents measurable information gain rather than spurious correlation. We construct training samples with long-range dependencies by combining original documents with these verified contextual supplements. Using FineWebEdu and Cosmopedia, we generate a dataset of 128K-length sequences with verified dependencies. Models trained on this data demonstrate significant improvements on RULER benchmarks, particularly in tasks requiring distant information. Following instruction fine-tuning, our models also achieve substantial gains on LongBenchv2, demonstrating enhanced long-context understanding. Extensive ablation studies further validate the necessity and effectiveness of entropybased verification for long-context training.
zh
[NLP-91] SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification
【速读】: 该论文旨在解决现有判别式解码(judge decoding)方法在加速大语言模型(Large Language Model, LLM)推理过程中存在的泛化能力不足问题,即其依赖人工标注或具备可验证真实标签的任务,难以适用于多样化的自然语言处理(Natural Language Processing, NLP)任务。解决方案的关键在于提出SelfJudge,通过目标模型自身的自监督信号训练判别器(judge verifier),利用语义保持度(semantic preservation)作为评估指标,判断替换候选token后的响应是否保留原始响应的语义一致性,从而实现无需外部标注的自动验证器训练,显著提升了不同NLP任务下的推理效率与准确性平衡。
链接: https://arxiv.org/abs/2510.02329
作者: Kanghoon Yoon,Minsub Kim,Sungjae Lee,Joonhyung Lee,Sunghyeon Woo,Yeonjun In,Se Jung Kwon,Chanyoung Park,Dongsoo Lee
机构: KAIST(韩国科学技术院); NAVER Cloud(NAVER云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.
zh
[NLP-92] AMANDA: Agent ic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering EMNLP
【速读】: 该论文旨在解决医疗多模态大语言模型(Medical Multimodal Large Language Models, Med-MLLMs)在低资源场景下因医学推理能力瓶颈而导致性能下降的问题,具体包括两个方面:一是内在推理瓶颈,即模型未能充分关注医学图像中的细节信息;二是外在推理瓶颈,即缺乏对专业医学知识的有效整合。解决方案的关键在于提出一种无需训练的代理框架AMANDA,通过大语言模型(LLM)代理实现医学知识增强:内在知识增强采用从粗到细的问题分解策略以支持全面诊断,外在知识增强则通过生物医学知识图谱检索来锚定推理过程,从而显著提升零样本和少样本条件下的医学视觉问答(Med-VQA)性能。
链接: https://arxiv.org/abs/2510.02328
作者: Ziqing Wang,Chengsheng Mao,Xiaole Wen,Yuan Luo,Kaize Ding
机构: Northwestern University (西北大学); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: EMNLP Findings
Abstract:Medical Multimodal Large Language Models (Med-MLLMs) have shown great promise in medical visual question answering (Med-VQA). However, when deployed in low-resource settings where abundant labeled data are unavailable, existing Med-MLLMs commonly fail due to their medical reasoning capability bottlenecks: (i) the intrinsic reasoning bottleneck that ignores the details from the medical image; (ii) the extrinsic reasoning bottleneck that fails to incorporate specialized medical knowledge. To address those limitations, we propose AMANDA, a training-free agentic framework that performs medical knowledge augmentation via LLM agents. Specifically, our intrinsic medical knowledge augmentation focuses on coarse-to-fine question decomposition for comprehensive diagnosis, while extrinsic medical knowledge augmentation grounds the reasoning process via biomedical knowledge graph retrieval. Extensive experiments across eight Med-VQA benchmarks demonstrate substantial improvements in both zero-shot and few-shot Med-VQA settings. The code is available at this https URL.
zh
[NLP-93] KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI
【速读】: 该论文旨在解决实时语音到语音(S2S)模型在响应自然性和低延迟方面表现优异,但缺乏深度知识和语义理解的问题;同时克服级联系统(由自动语音识别、文本大语言模型(LLM)和文本转语音合成组成)虽具备强大知识表示能力,却因高延迟而破坏自然交互流畅性的局限。解决方案的关键在于提出一种混合架构:通过一个S2S Transformer实现即时响应,同时将用户语音查询同步传递给后端LLM进行深度推理,并将LLM生成的文本结果实时注入S2S模型以引导其语音输出,从而在不显著增加延迟的前提下显著提升回答正确性,使其接近级联系统的性能水平。
链接: https://arxiv.org/abs/2510.02327
作者: So Kuroki,Yotaro Kubo,Takuya Akiba,Yujin Tang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM’s text-based response is then injected in real time to guide the S2S model’s speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.
zh
[NLP-94] Hallucination-Resistant Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在文献综述任务中易产生幻觉(hallucination)和错误引用的问题,从而限制其在专业科研工作流中的可信度与实用性。解决方案的关键在于提出RA-FSM(Research Assistant - Finite State Machine),这是一个基于GPT的模块化研究助手,通过引入有限状态机(Finite State Machine)控制循环——即“相关性-置信度-知识”三阶段机制,将生成过程嵌入到一个确定性的、可追踪的流程中:系统利用向量检索进行知识获取,结合确定性的引文处理流水线,由控制器过滤无关查询、评估问题可回答性、分解复杂问题,并仅在必要时触发检索,最终输出带有置信度标签和去重后、在语料库内引用的答案。该设计显著提升了答案的透明性和证据可靠性,特别适用于高风险技术场景。
链接: https://arxiv.org/abs/2510.02326
作者: Vivek Bhavsar,Joseph Ereifej,Aravanan Gurusami
机构: Coherent Corporation(相干公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures
Abstract:Large language models accelerate literature synthesis but can hallucinate and mis-cite, limiting their usefulness in expert workflows. We present RA-FSM (Research Assistant - Finite State Machine), a modular GPT-based research assistant that wraps generation in a finite-state control loop: Relevance - Confidence - Knowledge. The system is grounded in vector retrieval and a deterministic citation pipeline. The controller filters out-of-scope queries, scores answerability, decomposes questions, and triggers retrieval only when needed, and emits answers with confidence labels and in-corpus, de-duplicated references. A ranked-tier ingestion workflow constructs a domain knowledge base from journals, conferences, indices, preprints, and patents, writing both to a dense vector index and to a relational store of normalized metrics. We implement the system for photonics and evaluate it on six task categories: analytical reasoning, numerical analysis, methodological critique, comparative synthesis, factual extraction, and application design. In blinded A/B reviews, domain experts prefer RA-FSM to both a strong Notebook LM (NLM) and a vanilla Default GPT API call single-pass baseline, citing stronger boundary-condition handling and more defensible evidence use. Coverage and novelty analyses indicate that RA-FSM explores beyond the NLM while incurring tunable latency and cost overheads. The design emphasizes transparent, well-cited answers for high-stakes technical work and is generalizable to other scientific domains.
zh
[NLP-95] Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中频繁产生幻觉(hallucination)的问题,即模型会自信地给出错误答案而非承认知识缺失。现有方法如激活引导(activation steering)虽能有效减少幻觉,但需在推理阶段实时监控与干预,难以部署于实际系统。本文提出对比激活引导的摊销学习算法(Contrastive Activation Steering for Amortized Learning, CASAL),其核心创新在于将激活引导的收益直接嵌入模型权重中,通过训练单个Transformer层的轻量级子模块实现“已知则答、未知则弃”的行为模式。CASAL无需推理时干预,且在多个短文本问答基准上实现30%-40%的幻觉降低,同时相比强基线LoRA方法(如SFT和DPO)具备30倍计算效率和20倍数据效率优势,并展现出良好的分布外(out-of-distribution, OOD)泛化能力,是首个在密集模型与专家混合(Mixture-of-Experts, MoE)架构上均有效的基于引导的训练方法。
链接: https://arxiv.org/abs/2510.02324
作者: Wannan Yang,Xinchi Qiu,Lei Yu,Yuchen Zhang,Oliver Aobo Yang,Narine Kokhlikyan,Nicola Cancedda,Diego Garcia-Olano
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model’s weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL’s light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL’s flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.
zh
[NLP-96] Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations
【速读】: 该论文旨在解决生成式 AI (Generative AI) 时代下,大型语言模型(Large Language Models, LLMs)产生的文本在检测系统中易受对抗攻击的问题,尤其针对通过改写(paraphrasing)等语义层面扰动实现的规避攻击。现有检测方法普遍依赖统计特征,对语义级别的扰动缺乏鲁棒性,导致检测性能显著下降。解决方案的关键在于提出一种新型检测框架——扰动不变特征工程(Perturbation-Invariant Feature Engineering, PIFE),其核心思想是将输入文本通过多阶段归一化处理转化为标准形式,并量化原始文本与标准形式之间的差异(如编辑距离和语义相似度),将这些显式的扰动特征直接输入分类器,而非单纯依赖对抗训练来增强模型泛化能力。实验证明,PIFE 框架在面对字符、词和句子层级的多层次攻击时,相较于传统加固的 Transformer 模型,在严格控制假阳性率(1%)条件下仍保持高达 82.6% 的真正例率(True Positive Rate),显著优于传统方法(仅 48.8%),从而有效突破“语义规避阈值”(semantic evasion threshold)限制,实现了更本质的鲁棒性提升。
链接: https://arxiv.org/abs/2510.02319
作者: Lekkala Sai Teja,Annepaka Yadagiri,Sangam Sai Anish,Siva Gopala Krishna Nuthakki,Partha Pakray
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 3 figures
Abstract:The growth of highly advanced Large Language Models (LLMs) constitutes a huge dual-use problem, making it necessary to create dependable AI-generated text detection systems. Modern detectors are notoriously vulnerable to adversarial attacks, with paraphrasing standing out as an effective evasion technique that foils statistical detection. This paper presents a comparative study of adversarial robustness, first by quantifying the limitations of standard adversarial training and then by introducing a novel, significantly more resilient detection framework: Perturbation-Invariant Feature Engineering (PIFE), a framework that enhances detection by first transforming input text into a standardized form using a multi-stage normalization pipeline, it then quantifies the transformation’s magnitude using metrics like Levenshtein distance and semantic similarity, feeding these signals directly to the classifier. We evaluate both a conventionally hardened Transformer and our PIFE-augmented model against a hierarchical taxonomy of character-, word-, and sentence-level attacks. Our findings first confirm that conventional adversarial training, while resilient to syntactic noise, fails against semantic attacks, an effect we term “semantic evasion threshold”, where its True Positive Rate at a strict 1% False Positive Rate plummets to 48.8%. In stark contrast, our PIFE model, which explicitly engineers features from the discrepancy between a text and its canonical form, overcomes this limitation. It maintains a remarkable 82.6% TPR under the same conditions, effectively neutralizing the most sophisticated semantic attacks. This superior performance demonstrates that explicitly modeling perturbation artifacts, rather than merely training on them, is a more promising path toward achieving genuine robustness in the adversarial arms race.
zh
[NLP-97] EEFSUVA: A New Mathematical Olympiad Benchmark
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在数学推理能力评估中可能存在的高估问题,尤其针对现有基准测试(benchmarks)因数据污染和题型局限性而未能真实反映模型能力的现象。其解决方案的关键在于引入EEFSUVA这一新型基准,该基准源自东欧及前苏联国家的区域性与国家级数学奥林匹克竞赛,题目难度与国际数学奥林匹克(IMO)相当,但采用非标准解法且在线语料库中罕见,从而有效规避了数据泄露风险并拓展了评估维度。初步实验表明,即使是最先进的LLMs在EEFSUVA上的表现显著低于其他奥数类基准,凸显了构建更广泛、更具挑战性的评估数据集对准确衡量数学推理能力的重要性。
链接: https://arxiv.org/abs/2510.01227
作者: Nicole N Khatibi,Daniil A. Radamovich,Michael P. Brenner
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); History and Overview (math.HO)
备注: 16 Pages, 5 figures
Abstract:Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.
zh
[NLP-98] Multiplicative-Additive Constrained Models:Toward Joint Visualization of Interactive and Independent Effects
【速读】: 该论文旨在解决高风险领域(如医疗健康)中机器学习模型可解释性与预测性能之间的权衡问题。传统广义加法模型(Generalized Additive Models, GAMs)虽具备良好的可解释性,但因仅能建模单变量效应和两两交互项,限制了其预测能力;而曲线遍历集回归(Curve Ergodic Set Regression, CESR)虽能自然可视化形状函数并包含全特征交互,却未能在性能上超越GAMs。解决方案的关键在于提出乘法-加法约束模型(Multiplicative-Additive Constrained Models, MACMs),通过在CESR基础上引入一个加法部分来解耦交互项与独立项系数的耦合关系,从而有效扩展假设空间,在保持形状函数可视化的前提下显著提升预测性能。
链接: https://arxiv.org/abs/2509.21923
作者: Fumin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Interpretability is one of the considerations when applying machine learning to high-stakes fields such as healthcare that involve matters of life safety. Generalized Additive Models (GAMs) enhance interpretability by visualizing shape functions. Nevertheless, to preserve interpretability, GAMs omit higher-order interaction effects (beyond pairwise interactions), which imposes significant constraints on their predictive performance. We observe that Curve Ergodic Set Regression (CESR), a multiplicative model, naturally enables the visualization of its shape functions and simultaneously incorporates both interactions among all features and individual feature effects. Nevertheless, CESR fails to demonstrate superior performance compared to GAMs. We introduce Multiplicative-Additive Constrained Models (MACMs), which augment CESR with an additive part to disentangle the intertwined coefficients of its interactive and independent terms, thus effectively broadening the hypothesis space. The model is composed of a multiplicative part and an additive part, whose shape functions can both be naturally visualized, thereby assisting users in interpreting how features participate in the decision-making process. Consequently, MACMs constitute an improvement over both CESR and GAMs. The experimental results indicate that neural network-based MACMs significantly outperform both CESR and the current state-of-the-art GAMs in terms of predictive performance.
zh
[NLP-99] SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis ICASSP2026
【速读】: 该论文旨在解决医学人工智能系统普遍依赖书面文本而忽视语音输入的问题,尤其是在放射学领域中,临床报告主要通过语音 dictation(口述)生成,但现有AI模型未能有效利用这一重要信息模态。其解决方案的关键在于构建一个大规模的语音放射学报告数据集(Speech-RATE),并提出SpeechCT-CLIP模型,通过对比学习将语音与三维CT图像对齐至共享表征空间;进一步采用知识蒸馏技术,从预训练的文本-图像CLIP模型中迁移语义对齐能力到语音模态,显著缩小了语音模型与文本模型之间的性能差距,从而实现了无需文本输入即可进行零样本分类和跨模态检索的高效语音驱动诊断支持。
链接: https://arxiv.org/abs/2510.02322
作者: Lukas Buess,Jan Geier,David Bani-Harouni,Chantal Pellegrini,Matthias Keicher,Paula Andrea Perez-Toro,Nassir Navab,Andreas Maier,Tomas Arias-Vergara
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Submitted to ICASSP 2026; under review
Abstract:Spoken communication plays a central role in clinical workflows. In radiology, for example, most reports are created through dictation. Yet, nearly all medical AI systems rely exclusively on written text. In this work, we address this gap by exploring the feasibility of learning visual-language representations directly from spoken radiology reports. Specifically, we synthesize a large-scale dataset (Speech-RATE) of spoken radiology reports and train SpeechCT-CLIP, a contrastive model that aligns speech and 3D CT volumes in a shared representation space. While naive speech-based models underperform compared to text-trained counterparts, we show that knowledge distillation from a pretrained text-image CLIP model effectively transfers semantic alignment capabilities from text to speech, substantially narrowing this gap. Experiments demonstrate improved zero-shot classification F1 from 0.623 to 0.705, recovering 88% of the performance difference, and strong retrieval results without requiring text at inference. These findings highlight speech as a practical alternative to text in multimodal pretraining and open the door to voice-driven diagnostic support tools in clinical practice.
zh
[NLP-100] WEE-Therapy: A Mixture of Weak Encoders Framework for Psychological Counseling Dialogue Analysis
【速读】: 该论文旨在解决当前音频语言模型(AudioLLM)在心理咨询对话理解中表现不足的问题,尤其是其依赖通用预训练单语音编码器难以捕捉领域特异性特征(如复杂情绪和专业咨询技术)的局限性。解决方案的关键在于提出WEE-Therapy,一个基于弱编码器集成(Weak Encoder Ensemble, WEE)机制的多任务AudioLLM:通过引入一组轻量级、专业化编码器补充主编码器,并采用新颖的双路由策略,将稳定的数据无关领域知识与动态的数据相关专家选择相结合,从而在不显著增加参数量的前提下,在情绪识别、技术分类、风险检测和摘要生成等多项任务上实现显著性能提升。
链接: https://arxiv.org/abs/2510.02320
作者: Yongqi Kang,Yong Zhao
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: 5 pages
Abstract:The advancement of computational psychology requires AI tools capable of deeply understanding counseling dialogues. Existing audio language models (AudioLLMs) often rely on single speech encoders pre-trained on general data, struggling to capture domain-specific features like complex emotions and professional techniques. To address this, we propose WEE-Therapy, a multi-task AudioLLM incorporating a Weak Encoder Ensemble (WEE) mechanism. This supplements a powerful base encoder with a pool of lightweight, specialized encoders. A novel dual-routing strategy combines stable, data-independent domain knowledge with dynamic, data-dependent expert selection. Evaluated on emotion recognition, technique classification, risk detection, and summarization, WEE-Therapy achieves significant performance gains across all tasks with minimal parameter overhead, demonstrating strong potential for AI-assisted clinical analysis.
zh
计算机视觉
[CV-0] LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在专业领域(如医学影像)中因标注数据稀缺且昂贵而导致的分布外(out-of-distribution, OOD)任务性能下降问题。解决方案的关键在于提出一种标签高效适配框架LEAML,其核心创新包括:利用少量标注的视觉问答(VQA)样本与大量未标注图像,通过一个由图像描述蒸馏正则化的问答生成器(QA Generator)为未标注数据生成领域相关的伪问答对;同时,仅选择性地更新与问答任务最相关的神经元,从而在蒸馏过程中高效获取领域特定知识。实验表明,LEAML在胃肠道内镜和体育视频问答任务上均显著优于标准微调方法,在极低监督条件下展现出更强的适应能力。
链接: https://arxiv.org/abs/2510.03232
作者: Ci-Siang Lin,Min-Hung Chen,Yu-Yang Sheng,Yu-Chiang Frank Wang
机构: National Taiwan University (国立台湾大学); NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.
zh
[CV-1] Improving GUI Grounding with Explicit Position-to-Coordinate Mapping
【速读】:该论文旨在解决GUI grounding任务中因视觉语言模型(VLMs)在高分辨率显示界面泛化能力不足而导致的像素坐标映射不可靠问题。核心瓶颈在于现有方法将坐标作为文本标记直接从视觉特征生成,迫使模型隐式学习复杂的像素位置映射关系,从而在未见分辨率下准确率显著下降。解决方案的关键在于两项互补创新:一是引入RULER tokens作为显式坐标标记,使模型能像参考地图上的网格线一样定位并调整坐标,而非从零生成;二是采用交错式多尺度旋转位置编码(Interleaved MRoPE, I-MRoPE),均衡地表示宽度与高度维度,缓解标准位置编码方案的空间不对称性。该方法通过提供明确的空间引导,显著提升了跨分辨率和平台的GUI自动化可靠性。
链接: https://arxiv.org/abs/2510.03230
作者: Suyuchen Wang,Tianyu Zhang,Ahmed Masry,Christopher Pal,Spandana Gella,Bang Liu,Perouz Taslakian
机构: ServiceNow; Mila - Quebec AI Institute (魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); York University (约克大学); Polytechnique Montréal (蒙特利尔工程学院); McGill University (麦吉尔大学); CIFAR AI Chair (加拿大高级研究院人工智能主席)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.
zh
[CV-2] MIXER: Mixed Hyperspherical Random Embedding Neural Network for Texture Recognition
【速读】:该论文旨在解决现有随机神经网络在纹理表征学习中仅聚焦于跨信息预测、缺乏对整体网络架构创新的问题。其解决方案的关键在于提出一种名为Mixer的新颖随机神经网络,核心机制包括:利用超球面随机嵌入(hyperspherical random embeddings)与双分支学习模块(dual-branch learning module),以同时捕获通道内(intra-channel)和通道间(inter-channel)的复杂关系,并通过一个新构建的优化问题进一步增强纹理表征的丰富性。
链接: https://arxiv.org/abs/2510.03228
作者: Ricardo T. Fares,Lucas C. Ribas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Randomized neural networks for representation learning have consistently achieved prominent results in texture recognition tasks, effectively combining the advantages of both traditional techniques and learning-based approaches. However, existing approaches have so far focused mainly on improving cross-information prediction, without introducing significant advancements to the overall randomized network architecture. In this paper, we propose Mixer, a novel randomized neural network for texture representation learning. At its core, the method leverages hyperspherical random embeddings coupled with a dual-branch learning module to capture both intra- and inter-channel relationships, further enhanced by a newly formulated optimization problem for building rich texture representations. Experimental results have shown the interesting results of the proposed approach across several pure texture benchmarks, each with distinct characteristics and challenges. The source code will be available upon publication.
zh
[CV-3] st-Time Defense Against Adversarial Attacks via Stochastic Resonance of Latent Ensembles
【速读】:该论文旨在解决对抗攻击(adversarial attacks)对深度学习模型预测结果的干扰问题,尤其是在测试阶段缺乏有效防御机制的情况下。现有方法如特征滤波或平滑处理往往导致信息损失,难以兼顾鲁棒性与性能。其解决方案的关键在于提出一种“以噪治噪”(combat noise with noise)的策略,利用随机共振(stochastic resonance)原理,在输入图像上引入微小的平移扰动,对变换后的特征嵌入进行对齐与聚合,并以闭式公式映射回原始参考图像,从而在不依赖训练、特定网络结构或攻击类型的前提下显著提升模型鲁棒性,实现无需微调的通用测试时防御(test-time defense)。
链接: https://arxiv.org/abs/2510.03224
作者: Dong Lao,Yuxiang Zhang,Haniyeh Ehsani Oskouie,Yangchao Wu,Alex Wong,Stefano Soatto
机构: LSU Vision Lab (路易斯安那州立大学视觉实验室); UCLA Vision Lab (加州大学洛杉矶分校视觉实验室); Yale Vision Lab (耶鲁大学视觉实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We propose a test-time defense mechanism against adversarial attacks: imperceptible image perturbations that significantly alter the predictions of a model. Unlike existing methods that rely on feature filtering or smoothing, which can lead to information loss, we propose to “combat noise with noise” by leveraging stochastic resonance to enhance robustness while minimizing information loss. Our approach introduces small translational perturbations to the input image, aligns the transformed feature embeddings, and aggregates them before mapping back to the original reference image. This can be expressed in a closed-form formula, which can be deployed on diverse existing network architectures without introducing additional network modules or fine-tuning for specific attack types. The resulting method is entirely training-free, architecture-agnostic, and attack-agnostic. Empirical results show state-of-the-art robustness on image classification and, for the first time, establish a generic test-time defense for dense prediction tasks, including stereo matching and optical flow, highlighting the method’s versatility and practicality. Specifically, relative to clean (unperturbed) performance, our method recovers up to 68.1% of the accuracy loss on image classification, 71.9% on stereo matching, and 29.2% on optical flow under various types of adversarial attacks.
zh
[CV-4] MonSTeR: a Unified Model for Motion Scene Text Retrieval
【速读】:该论文旨在解决多模态对齐评估难题,即如何量化骨骼运动(motion)、意图(text)与场景(scene)三者之间的语义一致性问题。现有研究缺乏有效工具来衡量这三种模态间的协同关系,导致跨模态检索和理解能力受限。解决方案的关键在于提出MonSTeR模型——首个面向运动-场景-文本的检索模型,其通过构建统一潜在空间(unified latent space),融合单模态与跨模态表示,从而捕捉三者间复杂的高阶依赖关系,实现灵活且鲁棒的跨模态检索。实验表明,该方法在多项任务中优于仅依赖单模态表示的三模态模型,并在用户研究中验证了检索得分与人类偏好高度一致。
链接: https://arxiv.org/abs/2510.03200
作者: Luca Collorone,Matteo Gioia,Massimiliano Pappa,Paolo Leoni,Giovanni Ficarra,Or Litany,Indro Spinelli,Fabio Galasso
机构: Sapienza University of Rome (罗马大学); Technion (以色列理工学院); NVIDIA (英伟达); WSense
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it. Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene). In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations. This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks. Our results show that MonSTeR outperforms trimodal models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR’s latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models are available at this http URL.
zh
[CV-5] Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft
【速读】:该论文旨在解决自回归视频扩散模型在模拟Minecraft类交互场景时面临的长期空间一致性与新场景生成质量之间的权衡问题。具体而言,在有限计算预算下,模型需在有限上下文窗口内压缩并利用历史信息:仅依赖时间记忆会导致重访区域的空间不一致,而引入空间记忆虽能增强一致性,却可能因空间上下文不足导致新场景生成质量下降。解决方案的关键在于提出“Memory Forcing”学习框架,其核心是通过几何索引的空间记忆结构与两种训练策略协同作用:混合训练(Hybrid Training)区分探索与重访行为,引导模型在探索时使用时间记忆、重访时调用空间记忆;链式前向训练(Chained Forward Training)通过模型滚动预测扩大姿态变化,促使模型更依赖空间记忆以维持一致性;同时结合点到帧检索(Point-to-Frame Retrieval)和增量三维重建(Incremental 3D Reconstruction),高效实现历史信息的精准访问与更新。实验表明,该方法在保持计算效率的同时显著提升了长序列下的空间一致性和生成质量。
链接: https://arxiv.org/abs/2510.03198
作者: Junchao Huang,Xinting Hu,Boyao Han,Shaoshuai Shi,Zhuotao Tian,Tianyu He,Li Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figures
Abstract:Autoregressive video diffusion models have proved effective for world modeling and interactive scene generation, with Minecraft gameplay as a representative application. To faithfully simulate play, a model must generate natural content while exploring new scenes and preserve spatial consistency when revisiting explored areas. Under limited computation budgets, it must compress and exploit historical cues within a finite context window, which exposes a trade-off: Temporal-only memory lacks long-term spatial consistency, whereas adding spatial memory strengthens consistency but may degrade new scene generation quality when the model over-relies on insufficient spatial context. We present Memory Forcing, a learning framework that pairs training protocols with a geometry-indexed spatial memory. Hybrid Training exposes distinct gameplay regimes, guiding the model to rely on temporal memory during exploration and incorporate spatial memory for revisits. Chained Forward Training extends autoregressive training with model rollouts, where chained predictions create larger pose variations and encourage reliance on spatial memory for maintaining consistency. Point-to-Frame Retrieval efficiently retrieves history by mapping currently visible points to their source frames, while Incremental 3D Reconstruction maintains and updates an explicit 3D cache. Extensive experiments demonstrate that Memory Forcing achieves superior long-term spatial consistency and generative quality across diverse environments, while maintaining computational efficiency for extended sequences.
zh
[CV-6] Product-Quantised Image Representation for High-Quality Image Synthesis
【速读】:该论文旨在解决高保真图像生成中潜在表示(latent representations)的高效编码问题,传统产品量化(Product Quantisation, PQ)在该领域应用有限。为提升重建性能并优化计算效率,作者提出PQGAN,一种将PQ集成到VQGAN框架中的量化图像自编码器。其关键在于对码本大小、嵌入维度和子空间分解之间交互关系的深入分析,揭示了向量量化(Vector Quantisation, VQ)与PQ在扩展嵌入维度时性能表现相反的规律,并据此指导超参数最优选择。实验表明,PQGAN在PSNR上达到37dB(优于此前27dB),同时显著降低FID、LPIPS和CMMD分数达96%,且可无缝嵌入预训练扩散模型,实现生成速度大幅提升或分辨率翻倍而无需额外计算成本。
链接: https://arxiv.org/abs/2510.03191
作者: Denis Zavadski,Nikita Philip Tatsch,Carsten Rother
机构: Heidelberg University (海德堡大学); ELIZA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Product quantisation (PQ) is a classical method for scalable vector encoding, yet it has seen limited usage for latent representations in high-fidelity image generation. In this work, we introduce PQGAN, a quantised image autoencoder that integrates PQ into the well-known vector quantisation (VQ) framework of VQGAN. PQGAN achieves a noticeable improvement over state-of-the-art methods in terms of reconstruction performance, including both quantisation methods and their continuous counterparts. We achieve a PSNR score of 37dB, where prior work achieves 27dB, and are able to reduce the FID, LPIPS, and CMMD score by up to 96%. Our key to success is a thorough analysis of the interaction between codebook size, embedding dimensionality, and subspace factorisation, with vector and scalar quantisation as special cases. We obtain novel findings, such that the performance of VQ and PQ behaves in opposite ways when scaling the embedding dimension. Furthermore, our analysis shows performance trends for PQ that help guide optimal hyperparameter selection. Finally, we demonstrate that PQGAN can be seamlessly integrated into pre-trained diffusion models. This enables either a significantly faster and more compute-efficient generation, or a doubling of the output resolution at no additional cost, positioning PQ as a strong extension for discrete latent representation in image synthesis.
zh
[CV-7] Dynamic Prompt Generation for Interactive 3D Medical Image Segmentation Training
【速读】:该论文旨在解决交互式3D生物医学图像分割中模型缺乏体素感知能力(volumetric awareness)和交互能力受限的问题。其解决方案的关键在于提出一种训练策略,结合动态体素提示生成(dynamic volumetric prompt generation)与内容感知自适应裁剪(content-aware adaptive cropping),以优化图像编码器的利用效率;该策略在训练过程中模拟真实的用户交互模式,同时通过单GPU实现基于序列精炼反馈的学习,有效缓解计算资源瓶颈。
链接: https://arxiv.org/abs/2510.03189
作者: Tidiane Camaret Ndir,Alexander Pfefferle,Robin Tibor Schirrmeister
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interactive 3D biomedical image segmentation requires efficient models that can iteratively refine predictions based on user prompts. Current foundation models either lack volumetric awareness or suffer from limited interactive capabilities. We propose a training strategy that combines dynamic volumetric prompt generation with content-aware adaptive cropping to optimize the use of the image encoder. Our method simulates realistic user interaction patterns during training while addressing the computational challenges of learning from sequential refinement feedback on a single GPU. For efficient training, we initialize our network using the publicly available weights from the nnInteractive segmentation model. Evaluation on the \textbfFoundation Models for Interactive 3D Biomedical Image Segmentation competition demonstrates strong performance with an average final Dice score of 0.6385, normalized surface distance of 0.6614, and area-under-the-curve metrics of 2.4799 (Dice) and 2.5671 (NSD).
zh
[CV-8] ROGR: Relightable 3D Objects using Generative Relighting NEURIPS2025
【速读】:该论文旨在解决从多视角图像中重建可重光照(relightable)三维模型的问题,即如何在不依赖逐光照优化或光传输模拟的前提下,实现对物体在任意环境光照下的高效、真实感渲染。解决方案的关键在于提出了一种基于生成式重光照模型的新型方法ROGR,其核心是训练一个光照条件神经辐射场(lighting-conditioned Neural Radiance Field, NeRF),该NeRF采用创新的双分支架构分别编码通用光照效应与镜面反射特性,从而能够直接根据输入的环境贴图生成对应光照下的物体外观,显著提升了重光照的效率与质量。
链接: https://arxiv.org/abs/2510.03163
作者: Jiapeng Tang,Matthew Lavine,Dor Verbin,Stephan J. Garbin,Matthias Nießner,Ricardo Martin Brualla,Pratul P. Srinivasan,Philipp Henzler
机构: Google Research (谷歌研究); Google Deepmind (谷歌深度学习); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: NeurIPS 2025 Spotlight. Project page: this https URL
Abstract:We introduce ROGR, a novel approach that reconstructs a relightable 3D model of an object captured from multiple views, driven by a generative relighting model that simulates the effects of placing the object under novel environment illuminations. Our method samples the appearance of the object under multiple lighting environments, creating a dataset that is used to train a lighting-conditioned Neural Radiance Field (NeRF) that outputs the object’s appearance under any input environmental lighting. The lighting-conditioned NeRF uses a novel dual-branch architecture to encode the general lighting effects and specularities separately. The optimized lighting-conditioned NeRF enables efficient feed-forward relighting under arbitrary environment maps without requiring per-illumination optimization or light transport simulation. We evaluate our approach on the established TensoIR and Stanford-ORB datasets, where it improves upon the state-of-the-art on most metrics, and showcase our approach on real-world object captures.
zh
[CV-9] UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization
【速读】:该论文旨在解决合成图像伪造检测与定位(Forgery Image Detection and Localization, FIDL)中现有方法存在的局限性,包括领域特异性过强、跨域泛化能力差以及缺乏自适应集成框架等问题。其解决方案的关键在于提出UniShield——一种基于多智能体的统一系统,通过引入感知代理(perception agent)和检测代理(detection agent)的协同机制:感知代理动态分析图像特征以选择最优检测模型,检测代理则将多种专家检测器整合为统一框架并输出可解释报告,从而实现对图像篡改、文档伪造、DeepFake及AI生成图像等多类伪造内容的高效、自适应检测与定位。
链接: https://arxiv.org/abs/2510.03161
作者: Qing Huang,Zhipei Xu,Xuanyu Zhang,Jian Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid advancements in image generation, synthetic images have become increasingly realistic, posing significant societal risks, such as misinformation and fraud. Forgery Image Detection and Localization (FIDL) thus emerges as essential for maintaining information integrity and societal security. Despite impressive performances by existing domain-specific detection methods, their practical applicability remains limited, primarily due to their narrow specialization, poor cross-domain generalization, and the absence of an integrated adaptive framework. To address these issues, we propose UniShield, the novel multi-agent-based unified system capable of detecting and localizing image forgeries across diverse domains, including image manipulation, document manipulation, DeepFake, and AI-generated images. UniShield innovatively integrates a perception agent with a detection agent. The perception agent intelligently analyzes image features to dynamically select suitable detection models, while the detection agent consolidates various expert detectors into a unified framework and generates interpretable reports. Extensive experiments show that UniShield achieves state-of-the-art results, surpassing both existing unified approaches and domain-specific detectors, highlighting its superior practicality, adaptiveness, and scalability.
zh
[CV-10] SpineBench: A Clinically Salient Level-Aware Benchmark Powered by the SpineMed-450k Corpus
【速读】:该论文旨在解决当前生成式 AI 在脊柱疾病辅助诊断中面临的两大核心问题:一是缺乏具备椎体层级意识(level-aware)的多模态数据集,导致模型难以在X射线、CT和MRI等不同影像模态间进行精准的椎体水平推理;二是缺少临床可追溯、标准化的评估基准,限制了模型性能的客观衡量与优化。解决方案的关键在于构建一个由脊柱外科医生共同设计的生态系统——SpineMed,其核心组成部分为SpineMed-450k数据集与SpineBench评估框架。SpineMed-450k是首个面向椎体层级推理的大规模指令数据集,包含超过45万条高质量、可溯源的问答、多轮咨询及报告生成指令,通过“两阶段大语言模型(LLM)生成”流程(草稿与修订)确保内容准确性;SpineBench则提供基于临床重要维度(如椎体定位、病灶评估和手术规划)的标准化评测体系。实验表明,基于该数据集微调的模型在SpineBench上显著优于现有大型视觉语言模型(LVLMs),且临床医生评价其输出具有更高的诊断清晰度和实用性。
链接: https://arxiv.org/abs/2510.03160
作者: Ming Zhao,Wenhui Dong,Yang Zhang,Xiang Zheng,Zhonghao Zhang,Zian Zhou,Yunzhi Guan,Liukun Xu,Wei Peng,Zhaoyang Gong,Zhicheng Zhang,Dachuan Li,Xiaosheng Ma,Yuli Ma,Jianing Ni,Changjiang Jiang,Lixia Tian,Qixin Chen,Kaishun Xia,Pingping Liu,Tongshun Zhang,Zhiqiang Liu,Zhongan Bi,Chenyang Si,Tiansheng Sun,Caifeng Shan
机构: Jilin University (吉林大学); Nanjing University (南京大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Ningxia University (宁夏大学); Zhejiang University (浙江大学); Stanford University (斯坦福大学); π3 Lab; Wuhan University (武汉大学); Beijing Jiaotong University (北京交通大学); The General Hospital of the People’s Liberation Army (中国人民解放军总医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model’s outputs.
zh
[CV-11] ReeMark: Reeb Graphs for Simulating Patterns of Life in Spatiotemporal Trajectories
【速读】:该论文旨在解决如何准确建模人类移动行为以支持城市规划、流行病学和交通管理等问题,核心挑战在于生成既保留个体与群体层面日常活动模式(Patterns of Life, PoLs),又能体现真实生活一致性和变异性的时空轨迹。其解决方案的关键是提出马尔可夫-Reeb图(Markovian Reeb Graphs),这是一种基于概率拓扑模型的框架,通过融合个体级与群体级移动结构,在保证数据和计算效率的同时,生成高保真度的未来轨迹模拟结果。
链接: https://arxiv.org/abs/2510.03152
作者: Anantajit Subrahmanya,Chandrakanth Gudavalli,Connor Levenson,Umang Garg,B.S. Manjunath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 15 pages, 3 figures, 2 algorithms, 1 table
Abstract:Accurately modeling human mobility is critical for urban planning, epidemiology, and traffic management. In this work, we introduce Markovian Reeb Graphs, a novel framework for simulating spatiotemporal trajectories that preserve Patterns of Life (PoLs) learned from baseline data. By combining individual- and population-level mobility structures within a probabilistic topological model, our approach generates realistic future trajectories that capture both consistency and variability in daily life. Evaluations on the Urban Anomalies dataset (Atlanta and Berlin subsets) using the Jensen-Shannon Divergence (JSD) across population- and agent-level metrics demonstrate that the proposed method achieves strong fidelity while remaining data- and compute-efficient. These results position Markovian Reeb Graphs as a scalable framework for trajectory simulation with broad applicability across diverse urban environments.
zh
[CV-12] MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning
【速读】:该论文旨在解决视觉导航策略中因光学信息难以显式建模(如LiDAR点云或深度图)而导致的智能模型设计与大规模数据依赖问题。其解决方案的关键在于利用视觉-语言-动作(Vision-Language-Action, VLA)模型的多模态理解能力,通过教师-学生框架从合成专家数据中学习多样化的导航能力。具体而言,作者构建了一个基于预训练大语言模型和视觉基础模型的多视角VLA模型MM-Nav(支持360°观测),并结合三种定制化强化学习(Reinforcement Learning, RL)专家在不同导航任务(到达、挤压、避障)中生成的带特权深度信息的专家数据进行迭代训练,同时动态平衡各能力模块的训练比例。实验证明该方法不仅在仿真环境中展现出强泛化能力,且学生VLA模型性能超越RL教师,体现了多能力融合的协同效应。
链接: https://arxiv.org/abs/2510.03142
作者: Tianyu Xu,Jiawei Chen,Jiazhao Zhang,Wenyao Zhang,Zekun Qi,Minghan Li,Zhizheng Zhang,He Wang
机构: Peking University (北京大学); Galbot; Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); BAAI (北京智源研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Visual navigation policy is widely regarded as a promising direction, as it mimics humans by using egocentric visual observations for navigation. However, optical information of visual observations is difficult to be explicitly modeled like LiDAR point clouds or depth maps, which subsequently requires intelligent models and large-scale data. To this end, we propose to leverage the intelligence of the Vision-Language-Action (VLA) model to learn diverse navigation capabilities from synthetic expert data in a teacher-student manner. Specifically, we implement the VLA model, MM-Nav, as a multi-view VLA (with 360 observations) based on pretrained large language models and visual foundation models. For large-scale navigation data, we collect expert data from three reinforcement learning (RL) experts trained with privileged depth information in three challenging tailor-made environments for different navigation capabilities: reaching, squeezing, and avoiding. We iteratively train our VLA model using data collected online from RL experts, where the training ratio is dynamically balanced based on performance on individual capabilities. Through extensive experiments in synthetic environments, we demonstrate that our model achieves strong generalization capability. Moreover, we find that our student VLA model outperforms the RL teachers, demonstrating the synergistic effect of integrating multiple capabilities. Extensive real-world experiments further confirm the effectiveness of our method.
zh
[CV-13] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories
【速读】:该论文旨在解决交互中心视频(interaction-centric videos)生成中的复杂动态交互建模难题,尤其针对人类或机器人与物体之间交互场景的高质量合成问题。现有方法通常难以有效捕捉此类交互的多样性与时空一致性,且依赖密集精确的掩码(mask)标注,限制了其在真实场景中的应用。解决方案的关键在于提出Mask2IV框架,采用解耦的两阶段流程:第一阶段预测演员(actor)与物体的合理运动轨迹,第二阶段基于这些轨迹生成视频。该设计无需用户输入密集掩码,同时保留对交互过程的灵活控制能力,并支持通过动作描述或空间位置线索实现直观操控,从而在视觉真实性和可控性上显著优于现有基线方法。
链接: https://arxiv.org/abs/2510.03135
作者: Gen Li,Bo Zhao,Jianfei Yang,Laura Sevilla-Lara
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.
zh
[CV-14] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion
【速读】:该论文旨在解决从脑活动信号中高保真重建复杂自然场景视觉信息的问题,现有方法在处理低层特征异质性和高层语义纠缠(semantic entanglement)时表现不足。其解决方案的关键在于受视觉皮层层次化表征理论启发,提出HAVIR模型,将视觉 cortex 分为两个层次区域:结构生成器(Structural Generator)从空间加工体素中提取结构信息并转化为潜在扩散先验,语义提取器(Semantic Extractor)将语义加工体素映射为CLIP嵌入,二者通过通用扩散模型(Versatile Diffusion model)融合以合成最终图像,从而在复杂场景下显著提升重建的结构与语义质量。
链接: https://arxiv.org/abs/2510.03122
作者: Shiyi Zhang,Dong Liang,Hairong Zheng,Yihang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.
zh
[CV-15] aming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction
【速读】:该论文旨在解决文本到音视频生成(Text-to-Sounding-Video, T2SV)任务中的两个核心挑战:其一,传统方法使用单一文本描述同时作为视频和音频的条件输入,导致模态间干扰,影响预训练模型的性能;其二,跨模态特征交互机制尚不明确,难以实现语义与时间上的同步。为应对这些问题,作者提出Hierarchical Visual-Grounded Captioning (HVGC)框架,通过生成解耦的视频与音频双标签来消除条件阶段的模态干扰;在此基础上进一步设计Dual-tower Diffusion Transformer (BridgeDiT),引入Dual CrossAttention (DCA)机制作为对称双向信息交换的“桥梁”,有效实现跨模态语义对齐与时间同步,从而显著提升T2SV生成质量。
链接: https://arxiv.org/abs/2510.03117
作者: Kaisi Guan,Xihua Wang,Zhengfeng Lai,Xin Cheng,Peng Zhang,XiaoJiang Liu,Ruihua Song,Meng Cao
机构: Renmin University of China(中国人民大学); Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:
Abstract:This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.
zh
[CV-16] GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion NEURIPS2025
【速读】:该论文旨在解决参考图像引导的图像修复(reference-driven image completion)问题,尤其针对目标视图与参考图像存在显著差异时,现有生成式方法因缺乏几何约束而导致修复内容错位或不合理的问题。解决方案的关键在于提出GeoComplete框架,其核心创新包括:1)通过投影点云对扩散过程进行条件控制,注入显式的三维结构信息以保障几何一致性;2)引入目标感知掩码策略,在训练中识别并屏蔽参考图像中目标视图不可见的区域,从而引导模型聚焦于有效参考线索。该框架采用双分支扩散架构,结合跨分支联合自注意力机制,实现几何感知且视觉质量高的图像修复。
链接: https://arxiv.org/abs/2510.03110
作者: Beibei Lin,Tingting Chen,Robby T. Tan
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025. Project page: this https URL
Abstract:Reference-driven image completion, which restores missing regions in a target view using additional images, is particularly challenging when the target view differs significantly from the references. Existing generative methods rely solely on diffusion priors and, without geometric cues such as camera pose or depth, often produce misaligned or implausible content. We propose GeoComplete, a novel framework that incorporates explicit 3D structural guidance to enforce geometric consistency in the completed regions, setting it apart from prior image-only approaches. GeoComplete introduces two key ideas: conditioning the diffusion process on projected point clouds to infuse geometric information, and applying target-aware masking to guide the model toward relevant reference cues. The framework features a dual-branch diffusion architecture. One branch synthesizes the missing regions from the masked target, while the other extracts geometric features from the projected point cloud. Joint self-attention across branches ensures coherent and accurate completion. To address regions visible in references but absent in the target, we project the target view into each reference to detect occluded areas, which are then masked during training. This target-aware masking directs the model to focus on useful cues, enhancing performance in difficult scenarios. By integrating a geometry-aware dual-branch diffusion architecture with a target-aware masking strategy, GeoComplete offers a unified and robust solution for geometry-conditioned image completion. Experiments show that GeoComplete achieves a 17.1 PSNR improvement over state-of-the-art methods, significantly boosting geometric accuracy while maintaining high visual quality.
zh
[CV-17] Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields
【速读】:该论文旨在解决生成式辐射场(radiance fields)中语义蒸馏(semantic distillation)的优化问题,特别是探讨几何引导(geometry-grounding)是否能提升蒸馏语义特征的质量及其在下游任务中的性能表现。其核心问题是:几何引导的语义特征是否优于纯视觉特征(visual-only features),尤其是在空间感知任务如位姿估计和物体定位中。解决方案的关键在于提出了一种名为SPINE的新框架,用于无需初始猜测即可实现高精度辐射场反演(radiance field inversion),其核心由两个模块组成:基于蒸馏语义的粗粒度反演与基于光度优化的细粒度反演。实验表明,尽管几何引导特征包含更丰富的几何细节,但其在位姿估计任务中反而导致精度下降,说明纯视觉特征在多样性任务中更具优势,从而凸显了未来研究需聚焦于如何有效整合几何信息以增强预训练语义特征的泛化能力。
链接: https://arxiv.org/abs/2510.03104
作者: Zhiting Mei,Ola Shorinwa,Anirudha Majumdar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Semantic distillation in radiance fields has spurred significant advances in open-vocabulary robot policies, e.g., in manipulation and navigation, founded on pretrained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the question: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: coarse inversion using distilled semantics, and fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.
zh
[CV-18] Latent Diffusion Unlearning: Protecting Against Unauthorized Personalization Through Trajectory Shifted Perturbations
【速读】:该论文旨在解决生成式 AI(Generative AI)在个性化微调过程中引发的数据隐私泄露、知识产权侵犯及未经授权使用等问题,尤其是针对扩散模型(Diffusion Models)在少量用户图像下即可实现高效个性化所带来的风险。其核心解决方案是提出一种基于潜在空间(latent space)的扰动策略,通过交替执行去噪与反演操作,并调整扩散过程的起始点(denoising trajectory),从而生成在视觉上与原图几乎无差异但对下游生成模型具有“不可学习性”的中毒样本。该方法将不可学习性(unlearnability)嵌入到潜在扩散模型(Latent Diffusion Models, LDMs)框架中,显著提升了对抗先进逆向攻击的鲁棒性(平均提升约10%),同时保持了较高的感知保真度(PSNR、SSIM 和 FID 指标改善约8%-10%),实现了对敏感数据的隐蔽且有效的防御。
链接: https://arxiv.org/abs/2510.03089
作者: Naresh Kumar Devulapally,Shruti Agarwal,Tejas Gokhale,Vishnu Suresh Lokhande
机构: University at Buffalo, The State University of New York(纽约州立大学布法罗分校); Adobe Research(Adobe 研究院); University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models have demonstrated remarkable effectiveness in rapid and high-fidelity personalization, even when provided with only a few user images. However, the effectiveness of personalization techniques has lead to concerns regarding data privacy, intellectual property protection, and unauthorized usage. To mitigate such unauthorized usage and model replication, the idea of generating ``unlearnable’’ training samples utilizing image poisoning techniques has emerged. Existing methods for this have limited imperceptibility as they operate in the pixel space which results in images with noise and artifacts. In this work, we propose a novel model-based perturbation strategy that operates within the latent space of diffusion models. Our method alternates between denoising and inversion while modifying the starting point of the denoising trajectory: of diffusion models. This trajectory-shifted sampling ensures that the perturbed images maintain high visual fidelity to the original inputs while being resistant to inversion and personalization by downstream generative models. This approach integrates unlearnability into the framework of Latent Diffusion Models (LDMs), enabling a practical and imperceptible defense against unauthorized model adaptation. We validate our approach on four benchmark datasets to demonstrate robustness against state-of-the-art inversion attacks. Results demonstrate that our method achieves significant improvements in imperceptibility ( \sim 8 % -10% on perceptual metrics including PSNR, SSIM, and FID) and robustness ( \sim 10% on average across five adversarial settings), highlighting its effectiveness in safeguarding sensitive data.
zh
[CV-19] What Drives Compositional Generalization in Visual Generative Models?
【速读】:该论文旨在解决视觉生成模型在组合泛化(compositional generalization)能力上的局限性问题,即模型能否将已知概念以新颖方式组合生成新场景。研究发现,影响组合泛化性能的两个关键设计因素是:训练目标是否作用于离散或连续分布,以及条件输入在训练过程中对组成概念的信息提供程度。解决方案的关键在于通过引入基于JEPA(Joint-Embedding Predictive Architecture)的辅助连续损失来松弛MaskGIT等离散模型的离散损失约束,从而提升其在组合生成任务中的表现。
链接: https://arxiv.org/abs/2510.03075
作者: Karim Farid,Rajat Sahay,Yumna Ali Alnaggar,Simon Schrodi,Volker Fischer,Cordelia Schmid,Thomas Brox
机构: University of Freiburg (弗莱堡大学); Bosch Center for Artificial Intelligence (博世人工智能中心); Inria, École Normale Supérieure, CNRS, PSL Research University (法国国家信息与自动化研究院、巴黎高等师范学院、法国国家科学研究中心、巴黎文理研究大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.
zh
[CV-20] InsideOut: An EfficientNetV2-S Based Deep Learning Framework for Robust Multi-Class Facial Emotion Recognition
【速读】:该论文旨在解决面部情绪识别(Facial Emotion Recognition, FER)在实际应用中面临的挑战,包括遮挡、光照与姿态变化、类内差异细微以及数据集不平衡导致的少数情绪类别识别性能下降等问题。其解决方案的关键在于构建一个可复现的FER框架InsideOut,该框架基于EfficientNetV2-S架构,结合迁移学习、强数据增强策略和面向类别不平衡的优化机制;具体而言,通过标准化FER2013图像、分层采样与增强、以及使用类别权重损失函数微调轻量级分类头,有效缓解了数据分布偏斜问题,最终在FER2013数据集上达到62.8%的准确率和0.590的宏平均F1分数,验证了高效网络结构与定制化不平衡处理相结合在提升FER性能方面的有效性。
链接: https://arxiv.org/abs/2510.03066
作者: Ahsan Farabi,Israt Khandaker,Ibrahim Khalil Shanto,Md Abdul Ahad Minhaz,Tanisha Zaman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial Emotion Recognition (FER) is a key task in affective computing, enabling applications in human-computer interaction, e-learning, healthcare, and safety systems. Despite advances in deep learning, FER remains challenging due to occlusions, illumination and pose variations, subtle intra-class differences, and dataset imbalance that hinders recognition of minority emotions. We present InsideOut, a reproducible FER framework built on EfficientNetV2-S with transfer learning, strong data augmentation, and imbalance-aware optimization. The approach standardizes FER2013 images, applies stratified splitting and augmentation, and fine-tunes a lightweight classification head with class-weighted loss to address skewed distributions. InsideOut achieves 62.8% accuracy with a macro averaged F1 of 0.590 on FER2013, showing competitive results compared to conventional CNN baselines. The novelty lies in demonstrating that efficient architectures, combined with tailored imbalance handling, can provide practical, transparent, and reproducible FER solutions.
zh
[CV-21] When and Where do Events Switch in Multi-Event Video Generation? ICCV2025
【速读】:该论文旨在解决文本到视频(Text-to-video, T2V)生成中多事件场景下的事件切换控制问题,即明确在何种时间点和位置上,多事件提示(multi-event prompts)能够有效调控事件之间的过渡。其解决方案的关键在于提出了一套自标注的提示语集MEve,用于系统评估不同模型在多事件视频生成中的表现,并通过实验发现:在去噪过程的早期阶段及模型层的分块(block-wise)结构中进行干预,是实现连贯且可控的多事件视频生成的核心因素。
链接: https://arxiv.org/abs/2510.03049
作者: Ruotong Liao,Guowen Huang,Qing Cheng,Thomas Seidl,Daniel Cremers,Volker Tresp
机构: Ludwig-Maxilians-University of Munich (慕尼黑路德维希-马克西米利安大学); Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work in Progress. Accepted to ICCV2025 @ LongVid-Foundations
Abstract:Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.
zh
[CV-22] PocketSR: The Super-Resolution Expert in Your Pocket Mobiles
【速读】:该论文旨在解决真实世界图像超分辨率(RealSR)任务中现有基于生成式模型的方法因计算成本高、延迟大而难以在边缘设备上部署的问题。其解决方案的关键在于提出了一种超轻量级、单步处理的模型PocketSR,通过两项核心技术实现高效与高保真度的平衡:一是设计了LiteED模块,作为稳定扩散(Stable Diffusion, SD)中计算密集型变分自编码器(VAE)的高效替代方案,参数量减少97.5%的同时保持高质量编码与解码能力;二是引入在线退火剪枝(online annealing pruning)策略优化U-Net结构,逐步将生成先验从重型模块迁移至轻量级对应模块,并结合多层特征蒸馏损失缓解剪枝导致的先验知识丢失,从而显著提升推理效率并维持先进性能。
链接: https://arxiv.org/abs/2510.03012
作者: Haoze Sun,Linfeng Jiang,Fan Li,Renjing Pei,Zhixin Wang,Yong Guo,Jiaqi Xu,Haoyu Chen,Jin Han,Fenglong Song,Yujiu Yang,Wenbo Li
机构: Tsinghua University (清华大学); Joy Future Academy; HKUST (GZ) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world image super-resolution (RealSR) aims to enhance the visual quality of in-the-wild images, such as those captured by mobile phones. While existing methods leveraging large generative models demonstrate impressive results, the high computational cost and latency make them impractical for edge deployment. In this paper, we introduce PocketSR, an ultra-lightweight, single-step model that brings generative modeling capabilities to RealSR while maintaining high fidelity. To achieve this, we design LiteED, a highly efficient alternative to the original computationally intensive VAE in SD, reducing parameters by 97.5% while preserving high-quality encoding and decoding. Additionally, we propose online annealing pruning for the U-Net, which progressively shifts generative priors from heavy modules to lightweight counterparts, ensuring effective knowledge transfer and further optimizing efficiency. To mitigate the loss of prior knowledge during pruning, we incorporate a multi-layer feature distillation loss. Through an in-depth analysis of each design component, we provide valuable insights for future research. PocketSR, with a model size of 146M parameters, processes 4K images in just 0.8 seconds, achieving a remarkable speedup over previous methods. Notably, it delivers performance on par with state-of-the-art single-step and even multi-step RealSR models, making it a highly practical solution for edge-device applications.
zh
[CV-23] Not every day is a sunny day: Synthetic cloud injection for deep land cover segmentation robustness evaluation across data sources
【速读】:该论文旨在解决两个核心问题:一是现有遥感地物分类(Land Cover Semantic Segmentation, LCS)模型依赖的Sentinel-2光学影像多为云-free,难以适应热带地区频繁云覆盖的实际场景;二是深度神经网络编码器下采样过程中易丢失空间与光谱细节,影响分割精度。解决方案的关键在于:首先提出一种云注入算法以模拟真实云覆盖条件,从而评估雷达数据对光学数据缺失的补偿能力;其次设计一种轻量级策略,在解码层注入归一化差异指数(Normalized Difference Indices, NDIs),在不显著增加计算负担的前提下保留关键空间特征。实验表明,该方法在云-free条件下提升了U-Net和DeepLabV3的性能(分别提高1.99%和2.78%),而在云覆盖场景中引入Sentinel-1雷达数据可显著改善所有模型表现,验证了雷达-光学融合的有效性。
链接: https://arxiv.org/abs/2510.03006
作者: Sara Mobsite,Renaud Hostache,Laure Berti Equille,Emmanuel Roux,Joris Guerin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Supervised deep learning for land cover semantic segmentation (LCS) relies on labeled satellite data. However, most existing Sentinel-2 datasets are cloud-free, which limits their usefulness in tropical regions where clouds are common. To properly evaluate the extent of this problem, we developed a cloud injection algorithm that simulates realistic cloud cover, allowing us to test how Sentinel-1 radar data can fill in the gaps caused by cloud-obstructed optical imagery. We also tackle the issue of losing spatial and/or spectral details during encoder downsampling in deep networks. To mitigate this loss, we propose a lightweight method that injects Normalized Difference Indices (NDIs) into the final decoding layers, enabling the model to retain key spatial features with minimal additional computation. Injecting NDIs enhanced land cover segmentation performance on the DFC2020 dataset, yielding improvements of 1.99% for U-Net and 2.78% for DeepLabV3 on cloud-free imagery. Under cloud-covered conditions, incorporating Sentinel-1 data led to significant performance gains across all models compared to using optical data alone, highlighting the effectiveness of radar-optical fusion in challenging atmospheric scenarios.
zh
[CV-24] owards Scalable and Consistent 3D Editing
【速读】:该论文旨在解决3D编辑(3D editing)任务中面临的挑战,包括跨视角一致性、结构保真度以及细粒度控制等问题,这些问题在沉浸式内容创作、数字娱乐和AR/VR等领域尤为关键。现有方法普遍存在效率低、几何失真或依赖人工标注且易错的3D掩码(3D masks)等缺陷。其解决方案的核心在于两方面:一是构建了目前规模最大、质量最高的配对数据集3DEditVerse(包含116,309个训练样本和1,500个测试样本),通过姿态驱动的几何编辑与基础模型引导的外观编辑相结合,确保编辑局部性、多视角一致性和语义对齐;二是提出3DEditFormer,一种结构保持型条件Transformer模型,利用双引导注意力机制(dual-guidance attention)和时间自适应门控(time-adaptive gating)实现可编辑区域与保留结构的解耦,从而在无需辅助3D掩码的情况下完成精确且一致的编辑,显著优于当前最优基线方法。
链接: https://arxiv.org/abs/2510.02994
作者: Ruihao Xia,Yang Tang,Pan Zhou
机构: East China University of Science and Technology (华东理工大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D editing - the task of locally modifying the geometry or appearance of a 3D asset - has wide applications in immersive content creation, digital entertainment, and AR/VR. However, unlike 2D editing, it remains challenging due to the need for cross-view consistency, structural fidelity, and fine-grained controllability. Existing approaches are often slow, prone to geometric distortions, or dependent on manual and accurate 3D masks that are error-prone and impractical. To address these challenges, we advance both the data and model fronts. On the data side, we introduce 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits and foundation model-guided appearance edits, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment. On the model side, we propose 3DEditFormer, a 3D-structure-preserving conditional transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks. Extensive experiments demonstrate that our framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing. Dataset and code will be released. Project: this https URL
zh
[CV-25] IT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型在处理长且详细的提示词时存在的理解与执行不一致问题,即模型难以有效对齐长提示词的语义内容并稳定生成符合要求的图像。解决方案的关键在于提出一个全新的评估框架,包括两个核心组件:一是LPG-Bench基准测试集,包含200个平均长度超过250词的精心设计提示词,用于系统性评估T2I模型在长提示下的表现;二是基于文本到图像再到文本一致性(Text-to-Image-to-Text, TIT)的零样本评估指标,通过直接比较原始提示词与由大语言模型(Large Language Model, LLM)对生成图像描述的一致性来量化T2I对齐度,其中TIT-Score-LLM版本在配对准确性上相较最强基线提升7.31%,显著优于CLIP-score、LMM-score等传统指标。
链接: https://arxiv.org/abs/2510.02987
作者: Juntong Wang,Huiyu Duan,Jiarui Wang,Ziheng Jia,Guangtao Zhai,Xiongkuo Min
机构: Institute of Image Communication and Network Engineering (图像通信与网络工程研究所); MoE Key Lab of Artificial Intelligence (教育部人工智能重点实验室); AI Institute (人工智能研究院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement of large multimodal models (LMMs), recent text-to-image (T2I) models can generate high-quality images and demonstrate great alignment to short prompts. However, they still struggle to effectively understand and follow long and detailed prompts, displaying inconsistent generation. To address this challenge, we introduce LPG-Bench, a comprehensive benchmark for evaluating long-prompt-based text-to-image generation. LPG-Bench features 200 meticulously crafted prompts with an average length of over 250 words, approaching the input capacity of several leading commercial models. Using these prompts, we generate 2,600 images from 13 state-of-the-art models and further perform comprehensive human-ranked annotations. Based on LPG-Bench, we observe that state-of-the-art T2I alignment evaluation metrics exhibit poor consistency with human preferences on long-prompt-based image generation. To address the gap, we introduce a novel zero-shot metric based on text-to-image-to-text consistency, termed TIT, for evaluating long-prompt-generated images. The core concept of TIT is to quantify T2I alignment by directly comparing the consistency between the raw prompt and the LMM-produced description on the generated image, which includes an efficient score-based instantiation TIT-Score and a large-language-model (LLM) based instantiation TIT-Score-LLM. Extensive experiments demonstrate that our framework achieves superior alignment with human judgment compared to CLIP-score, LMM-score, etc., with TIT-Score-LLM attaining a 7.31% absolute improvement in pairwise accuracy over the strongest baseline. LPG-Bench and TIT methods together offer a deeper perspective to benchmark and foster the development of T2I models. All resources will be made publicly available.
zh
[CV-26] Flip Distribution Alignment VAE for Multi-Phase MRI Synthesis MICCAI2025
【速读】:该论文旨在解决多期增强磁共振成像(multi-phase contrast-enhanced MRI, CE-MRI)合成中共享特征与独立特征分离不充分的问题,现有方法通常采用参数效率低的深度自编码器生成器,且缺乏可解释的训练策略。其解决方案的关键在于提出一种轻量级的特征解耦变分自编码器模型——Flip Distribution Alignment Variational Autoencoder (FDA-VAE),通过将输入图像和目标图像编码为关于标准正态分布对称的两个潜在分布,实现共享特征与独立特征的有效分离;同时引入Y型双向训练策略,进一步提升特征分离的可解释性,从而在显著降低模型参数量和推理时间的同时,提升合成图像质量。
链接: https://arxiv.org/abs/2510.02970
作者: Xiaoyan Kui,Qianmu Xiao,Qqinsong Li,Zexin Ji,JIelin Zhang,Beiji Zou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been early accept by MICCAI 2025
Abstract:Separating shared and independent features is crucial for multi-phase contrast-enhanced (CE) MRI synthesis. However, existing methods use deep autoencoder generators with low parameter efficiency and lack interpretable training strategies. In this paper, we propose Flip Distribution Alignment Variational Autoencoder (FDA-VAE), a lightweight feature-decoupled VAE model for multi-phase CE MRI synthesis. Our method encodes input and target images into two latent distributions that are symmetric concerning a standard normal distribution, effectively separating shared and independent features. The Y-shaped bidirectional training strategy further enhances the interpretability of feature separation. Experimental results show that compared to existing deep autoencoder-based end-to-end synthesis methods, FDA-VAE significantly reduces model parameters and inference time while effectively improving synthesis quality. The source code is publicly available at this https URL.
zh
[CV-27] Confidence and Dispersity as Signals: Unsupervised Model Evaluation and Ranking ICML’23
【速读】:该论文旨在解决在无标签测试数据条件下评估模型泛化能力的问题,尤其针对实际部署场景中缺乏标注测试数据的挑战。其解决方案的关键在于利用模型预测的两个内在属性——置信度(confidence)和分散度(dispersity),二者分别反映预测的确定性和类别分布的多样性,共同提供强而互补的泛化信号。通过系统性地对比基于置信度、分散度及混合指标的多种评估方法,研究发现混合指标(特别是预测矩阵的核范数)在不同模型架构、数据集和分布偏移类型下均表现出最优性能,能够实现对固定模型在多个无标签测试集上的准确估计(数据集中心评估)以及对候选模型在单个无标签测试集上的可靠排序(模型中心评估),从而为实际部署中的无监督模型评估提供了通用且可推广的方法基础。
链接: https://arxiv.org/abs/2510.02956
作者: Weijian Deng,Weijie Tu,Ibrahim Radwan,Mohammad Abu Alsheikh,Stephen Gould,Liang Zheng
机构: The Australian National University (澳大利亚国立大学); University of Canberra (坎特伯雷大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 11 figures, extension of ICML’23 work: Confidence and Dispersity Speak: Characterizing Prediction Matrix for Unsupervised Accuracy Estimation
Abstract:Assessing model generalization under distribution shift is essential for real-world deployment, particularly when labeled test data is unavailable. This paper presents a unified and practical framework for unsupervised model evaluation and ranking in two common deployment settings: (1) estimating the accuracy of a fixed model on multiple unlabeled test sets (dataset-centric evaluation), and (2) ranking a set of candidate models on a single unlabeled test set (model-centric evaluation). We demonstrate that two intrinsic properties of model predictions, namely confidence (which reflects prediction certainty) and dispersity (which captures the diversity of predicted classes), together provide strong and complementary signals for generalization. We systematically benchmark a set of confidence-based, dispersity-based, and hybrid metrics across a wide range of model architectures, datasets, and distribution shift types. Our results show that hybrid metrics consistently outperform single-aspect metrics on both dataset-centric and model-centric evaluation settings. In particular, the nuclear norm of the prediction matrix provides robust and accurate performance across tasks, including real-world datasets, and maintains reliability under moderate class imbalance. These findings offer a practical and generalizable basis for unsupervised model assessment in deployment scenarios.
zh
[CV-28] Multimodal Carotid Risk Stratification with Large Vision-Language Models: Benchmarking Fine-Tuning and Clinical Insights
【速读】:该论文旨在解决颈动脉粥样硬化疾病中可靠风险评估的临床挑战,即如何在透明且可解释的前提下整合多样化的临床与影像学信息。其核心解决方案在于利用先进的多模态视觉-语言模型(Vision-Language Models, VLMs)对超声成像(Ultrasound Imaging, USI)与结构化临床、人口统计、实验室及蛋白生物标志物数据进行融合分析,并通过模拟真实诊断场景的问答序列来评估模型性能。关键创新点在于采用低秩适配(Low-Rank Adaptation, LoRA)技术对LLaVa-NeXT-Vicuna模型进行领域适应,显著提升了卒中风险分层准确性;同时,将表格形式的多模态数据以文本形式注入模型进一步增强了特异性和平衡准确率,实现了优于传统卷积神经网络(Convolutional Neural Networks, CNNs)基线的性能表现。
链接: https://arxiv.org/abs/2510.02922
作者: Daphne Tsolissou,Theofanis Ganitidis,Konstantinos Mitsis,Stergios CHristodoulidis,Maria Vakalopoulou,Konstantina Nikita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable risk assessment for carotid atheromatous disease remains a major clinical challenge, as it requires integrating diverse clinical and imaging information in a manner that is transparent and interpretable to clinicians. This study investigates the potential of state-of-the-art and recent large vision-language models (LVLMs) for multimodal carotid plaque assessment by integrating ultrasound imaging (USI) with structured clinical, demographic, laboratory, and protein biomarker data. A framework that simulates realistic diagnostic scenarios through interview-style question sequences is proposed, comparing a range of open-source LVLMs, including both general-purpose and medically tuned models. Zero-shot experiments reveal that even if they are very powerful, not all LVLMs can accurately identify imaging modality and anatomy, while all of them perform poorly in accurate risk classification. To address this limitation, LLaVa-NeXT-Vicuna is adapted to the ultrasound domain using low-rank adaptation (LoRA), resulting in substantial improvements in stroke risk stratification. The integration of multimodal tabular data in the form of text further enhances specificity and balanced accuracy, yielding competitive performance compared to prior convolutional neural network (CNN) baselines trained on the same dataset. Our findings highlight both the promise and limitations of LVLMs in ultrasound-based cardiovascular risk prediction, underscoring the importance of multimodal integration, model calibration, and domain adaptation for clinical translation.
zh
[CV-29] Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting NEURIPS2025
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在零样本(zero-shot)场景下对对抗攻击高度敏感的问题,即尽管CLIP等模型具备出色的零样本泛化能力,但其鲁棒性较差。解决方案的关键在于提出一种名为“置信度感知加权”(Confidence-Aware Weighting, CAW)的训练机制,其核心包含两个组件:一是置信度感知损失(Confidence-Aware Loss),通过缩放干净样本与对抗样本预测之间的KL散度来优先优化不确定性高的对抗样本;二是特征对齐正则化(feature alignment regularization),通过最小化对抗输入下冻结和微调图像编码器特征间的距离,保持语义一致性。这两个组件协同作用,在不损害模型泛化性能的前提下显著提升模型在强对抗攻击(如AutoAttack)下的鲁棒性和准确率。
链接: https://arxiv.org/abs/2510.02913
作者: Nikoo Naghavian,Mostafa Tavassolipour
机构: University of Tehran (德黑兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the NeurIPS 2025 Workshop on Reliable ML from Unreliable Data
Abstract:Vision-language models like CLIP demonstrate impressive zero-shot generalization but remain highly vulnerable to adversarial attacks. In this work, we propose Confidence-Aware Weighting (CAW) to enhance zero-shot robustness in vision-language models. CAW consists of two components: (1) a Confidence-Aware loss that prioritizes uncertain adversarial examples by scaling the KL divergence between clean and adversarial predictions, and (2) a feature alignment regularization that preserves semantic consistency by minimizing the distance between frozen and fine-tuned image encoder features on adversarial inputs. These components work jointly to improve both clean and robust accuracy without sacrificing generalization. Extensive experiments on TinyImageNet and 14 additional datasets show that CAW outperforms recent methods such as PMG-AFT and TGA-ZSR under strong attacks like AutoAttack, while using less memory.
zh
[CV-30] Dont Just Chase “Highlighted Tokens” in MLLM s: Revisiting Visual Holistic Context Retention NEURIPS2025
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中因依赖大量视觉标记(visual tokens)而导致的显著计算开销问题。现有方法通常采用基于注意力机制的剪枝策略,如文本-视觉交叉注意力或[\textttCLS]注意力来评估并移除冗余视觉标记,但这类“注意力优先”的剪枝方式倾向于保留语义相似的标记,导致在高剪枝比例下性能显著下降。本文提出了一种名为HoloV的简单而高效的视觉标记剪枝框架,其核心创新在于从全局视角重新思考标记保留策略:通过自适应地将剪枝预算分配到不同空间区域,确保保留的标记能够捕捉整体视觉上下文而非孤立的显著特征,从而最小化表征坍缩(representational collapse),维持任务相关的信息完整性。实验表明,HoloV在多种任务、MLLM架构和剪枝比例下均优于当前最优方法,例如在剪掉88.9%的视觉标记后,LLaVA1.5仍能保持95.8%的原始性能,实现了更优的效率-精度权衡。
链接: https://arxiv.org/abs/2510.02912
作者: Xin Zou,Di Lu,Yizhou Wang,Yibo Yan,Yuanhuiyi Lyu,Xu Zheng,Linfeng Zhang,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; INSAIT, Sofia University “St. Kliment Ohridski”; Shanghai Jiao Tong University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025 main
Abstract:Despite their powerful capabilities, Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [\textttCLS] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i.e., they tend to preserve semantically similar tokens, resulting in pronounced performance drops under high pruning ratios. To this end, we propose HoloV, a simple yet effective, plug-and-play visual token pruning framework for efficient inference. Distinct from previous attention-first schemes, HoloV rethinks token retention from a holistic perspective. By adaptively distributing the pruning budget across different spatial crops, HoloV ensures that the retained tokens capture the global visual context rather than isolated salient features. This strategy minimizes representational collapse and maintains task-relevant information even under aggressive pruning. Experimental results demonstrate that our HoloV achieves superior performance across various tasks, MLLM architectures, and pruning ratios compared to SOTA methods. For instance, LLaVA1.5 equipped with HoloV preserves 95.8% of the original performance after pruning 88.9% of visual tokens, achieving superior efficiency-accuracy trade-offs.
zh
[CV-31] raining-Free Out-Of-Distribution Segmentation With Foundation Models
【速读】:该论文旨在解决在语义分割任务中检测未知对象(out-of-distribution, OoD)区域的问题,这对于自动驾驶等安全关键应用至关重要。现有大型视觉基础模型(如InternImage、DINOv2和CLIP)虽在闭集语义分割任务中表现优异,但其对OoD区域的识别能力尚未充分探索。解决方案的关键在于提出一种无需训练的简单方法:利用InternImage骨干网络提取特征,并结合K-Means聚类与原始解码器logits的置信度阈值策略,自动区分分布内(in-distribution, ID)与分布外(OoD)区域。该方法在RoadAnomaly和ADE-OoD两个基准上分别取得50.02和48.77的平均精度(Average Precision),优于多个监督与无监督基线,验证了基础模型在无需额外标注或假设条件下具备潜在的通用OoD分割能力。
链接: https://arxiv.org/abs/2510.02909
作者: Laith Nayal,Hadi Salloum,Ahmad Taha,Yaroslav Kholodov,Alexander Gasnikov
机构: Innopolis University (因诺波利斯大学); Moscow Institute of Physics and Technology (莫斯科物理技术研究所); Q Deep (Q Deep); Machine Learning and Data Representation Lab, Innopolis University (机器学习与数据表示实验室,因诺波利斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures, 2 tables, ICOMP 2025
Abstract:Detecting unknown objects in semantic segmentation is crucial for safety-critical applications such as autonomous driving. Large vision foundation models, includ- ing DINOv2, InternImage, and CLIP, have advanced visual representation learn- ing by providing rich features that generalize well across diverse tasks. While their strength in closed-set semantic tasks is established, their capability to detect out- of-distribution (OoD) regions in semantic segmentation remains underexplored. In this work, we investigate whether foundation models fine-tuned on segmen- tation datasets can inherently distinguish in-distribution (ID) from OoD regions without any outlier supervision. We propose a simple, training-free approach that utilizes features from the InternImage backbone and applies K-Means clustering alongside confidence thresholding on raw decoder logits to identify OoD clusters. Our method achieves 50.02 Average Precision on the RoadAnomaly benchmark and 48.77 on the benchmark of ADE-OoD with InternImage-L, surpassing several supervised and unsupervised baselines. These results suggest a promising direc- tion for generic OoD segmentation methods that require minimal assumptions or additional data.
zh
[CV-32] One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
【速读】:该论文旨在解决传统零样本图像描述(zero-shot captioning)模型仅能生成全局图像描述而无法对任意区域进行细粒度描述的问题,尤其在缺乏区域级标注数据的情况下。其解决方案的核心在于提出一种从图像中心(image-centric)转向补丁中心(patch-centric)的统一框架,将图像划分为原子级补丁单元,并通过聚合这些补丁的语义表示来生成任意区域(包括单个补丁、非连续区域乃至整图)的描述,从而实现无需区域级监督的可扩展描述生成。实验表明,使用如DINO等能够生成有意义且密集视觉特征的主干网络是取得最优性能的关键因素。
链接: https://arxiv.org/abs/2510.02898
作者: Lorenzo Bianchi,Giacomo Pacini,Fabio Carrara,Nicola Messina,Giuseppe Amato,Fabrizio Falchi
机构: CNR-ISTI (National Research Council - Institute of Information Science and Technologies); Università di Pisa (比萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present \frameworkName, a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense, region-set, and a newly introduced trace captioning task, highlighting the effectiveness of patch-wise semantic representations for scalable caption generation. Project page at this https URL .
zh
[CV-33] PyRadiomics-cuda: a GPU-accelerated 3D features extraction from medical images within PyRadiomics
【速读】:该论文旨在解决医学影像中三维形状特征提取的计算效率问题,尤其是在处理大规模体数据时面临的高时间成本。其解决方案的关键在于开发了一个基于GPU加速的PyRadiomics扩展工具PyRadiomics-cuda,通过将关键几何运算迁移至GPU硬件执行,显著缩短了特征提取时间;同时保持与原PyRadiomics API完全兼容,实现了无需修改代码即可无缝集成到现有AI工作流中的透明加速效果,从而支持高通量、可扩展的放射组学分析。
链接: https://arxiv.org/abs/2510.02894
作者: Jakub Lisowski,Piotr Tyrakowski,Szymon Zyguła,Krzysztof Kaczmarski
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:PyRadiomics-cuda is a GPU-accelerated extension of the PyRadiomics library, designed to address the computational challenges of extracting three-dimensional shape features from medical images. By offloading key geometric computations to GPU hardware it dramatically reduces processing times for large volumetric datasets. The system maintains full compatibility with the original PyRadiomics API, enabling seamless integration into existing AI workflows without code modifications. This transparent acceleration facilitates efficient, scalable radiomics analysis, supporting rapid feature extraction essential for high-throughput AI pipeline. Tests performed on a typical computational cluster, budget and home devices prove usefulness in all scenarios. PyRadiomics-cuda is implemented in Python and C/CUDA and is freely available under the BSD license at this https URL Additionally PyRadiomics-cuda test suite is available at this https URL. It provides detailed handbook and sample scripts suited for different kinds of workflows plus detailed installation instructions. The dataset used for testing is available at Kaggle this https URL
zh
[CV-34] ELMF4EggQ: Ensemble Learning with Multimodal Feature Fusion for Non-Destructive Egg Quality Assessment
【速读】:该论文旨在解决如何通过非破坏性方式准确评估鸡蛋品质的问题,以提升家禽养殖业中的食品安全保障、产品标准一致性及生产效率。其关键解决方案是提出了一种基于多模态特征融合的集成学习框架(ELMF4EggQ),该框架仅利用鸡蛋外部属性(图像、形状和重量)进行分类预测,首次实现了仅凭非侵入性特征对鸡蛋等级与新鲜度的机器学习建模。核心创新包括:构建了首个公开可用的186枚褐壳蛋标注数据集,融合深度卷积神经网络(CNN)提取的图像特征与结构特征(如蛋形和重量),并通过主成分分析(PCA)降维、SMOTE过采样以及多种机器学习算法的集成投票机制,显著提升了分类准确性(等级分类达86.57%,新鲜度预测达70.83%)。
链接: https://arxiv.org/abs/2510.02876
作者: Md Zahim Hassan,Md. Osama,Muhammad Ashad Kabir,Md. Saiful Islam,Zannatul Naim
机构: BAUST (Bangladesh Agricultural University Science and Technology); CSU (Charles Sturt University); SAU (Sher-e-Bangla Agricultural University)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 30 pages
Abstract:Accurate, non-destructive assessment of egg quality is critical for ensuring food safety, maintaining product standards, and operational efficiency in commercial poultry production. This paper introduces ELMF4EggQ, an ensemble learning framework that employs multimodal feature fusion to classify egg grade and freshness using only external attributes - image, shape, and weight. A novel, publicly available dataset of 186 brown-shelled eggs was constructed, with egg grade and freshness levels determined through laboratory-based expert assessments involving internal quality measurements, such as yolk index and Haugh unit. To the best of our knowledge, this is the first study to apply machine learning methods for internal egg quality assessment using only external, non-invasive features, and the first to release a corresponding labeled dataset. The proposed framework integrates deep features extracted from external egg images with structural characteristics such as egg shape and weight, enabling a comprehensive representation of each egg. Image feature extraction is performed using top-performing pre-trained CNN models (ResNet152, DenseNet169, and ResNet152V2), followed by PCA-based dimensionality reduction, SMOTE augmentation, and classification using multiple machine learning algorithms. An ensemble voting mechanism combines predictions from the best-performing classifiers to enhance overall accuracy. Experimental results demonstrate that the multimodal approach significantly outperforms image-only and tabular (shape and weight) only baselines, with the multimodal ensemble approach achieving 86.57% accuracy in grade classification and 70.83% in freshness prediction. All code and data are publicly available at this https URL, promoting transparency, reproducibility, and further research in this domain.
zh
[CV-35] Representing Beauty: Towards a Participatory but Objective Latent Aesthetics
【速读】:该论文试图解决的问题是:如何理解机器在何种意义上能够识别美,尤其是在面对形式多样性极高的对象时,神经网络是否具备建模审美判断的能力。其解决方案的关键在于通过跨模型表征收敛(cross-model representational convergence)的研究发现,美丽图像在不同训练数据和模态的深度学习模型中会产生更相似且对齐的表征,而无美感的图像则不具备这一特性;这表明美的形式结构具有现实基础(realist basis),而非仅源于社会建构的价值观,且这种现实性源于物理与文化物质的共同锚定,从而揭示了人类感知与创造行为在塑造深度学习潜在空间中的核心作用,并论证了机器不仅能模仿创作,还能基于规模优势产生新颖的创造性洞见,推动人机协同共创成为文化生产与机器感知中的基础机制。
链接: https://arxiv.org/abs/2510.02869
作者: Alexander Michael Rusnak
机构: École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:What does it mean for a machine to recognize beauty? While beauty remains a culturally and experientially compelling but philosophically elusive concept, deep learning systems increasingly appear capable of modeling aesthetic judgment. In this paper, we explore the capacity of neural networks to represent beauty despite the immense formal diversity of objects for which the term applies. By drawing on recent work on cross-model representational convergence, we show how aesthetic content produces more similar and aligned representations between models which have been trained on distinct data and modalities - while unaesthetic images do not produce more aligned representations. This finding implies that the formal structure of beautiful images has a realist basis - rather than only as a reflection of socially constructed values. Furthermore, we propose that these realist representations exist because of a joint grounding of aesthetic form in physical and cultural substance. We argue that human perceptual and creative acts play a central role in shaping these the latent spaces of deep learning systems, but that a realist basis for aesthetics shows that machines are not mere creative parrots but can produce novel creative insights from the unique vantage point of scale. Our findings suggest that human-machine co-creation is not merely possible, but foundational - with beauty serving as a teleological attractor in both cultural production and machine perception.
zh
[CV-36] Med-K2N: Flexible K-to-N Modality Translation for Medical Image Synthesis ICLR2026
【速读】:该论文旨在解决跨模态医学图像合成中的三大关键挑战:如何建模不同模态对多种目标任务的异质贡献、如何确保融合质量以防止噪声信息导致性能下降,以及如何在多输出生成中保持模态身份一致性。其解决方案的核心在于将多模态医学数据视为具有质量驱动选择机制的序列帧,并设计三个协同模块——PreWeightNet用于全局贡献评估、ThresholdNet实现自适应过滤、EffiWeightNet计算有效权重;同时引入因果模态身份模块(Causal Modality Identity Module, CMIM),利用视觉-语言建模建立生成图像与目标模态描述之间的因果约束,从而保障模态身份一致性。该方法命名为Med-K2N,在多个基准上显著优于现有最先进方法。
链接: https://arxiv.org/abs/2510.02815
作者: Feng Yuan,Yifan Gao,Yuehua Ye,Haoyue Li,Xin Gao
机构: University of Science and Technology of China (中国科学技术大学); Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences (中国科学院苏州生物医学工程技术研究所); The Third Affiliated Hospital of Sun Yat-sen University (中山大学附属第三医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR2026 under review
Abstract:Cross-modal medical image synthesis research focuses on reconstructing missing imaging modalities from available ones to support clinical diagnosis. Driven by clinical necessities for flexible modality reconstruction, we explore K to N medical generation, where three critical challenges emerge: How can we model the heterogeneous contributions of different modalities to various target tasks? How can we ensure fusion quality control to prevent degradation from noisy information? How can we maintain modality identity consistency in multi-output generation? Driven by these clinical necessities, and drawing inspiration from SAM2’s sequential frame paradigm and clinicians’ progressive workflow of incrementally adding and selectively integrating multi-modal information, we treat multi-modal medical data as sequential frames with quality-driven selection mechanisms. Our key idea is to “learn” adaptive weights for each modality-task pair and “memorize” beneficial fusion patterns through progressive enhancement. To achieve this, we design three collaborative modules: PreWeightNet for global contribution assessment, ThresholdNet for adaptive filtering, and EffiWeightNet for effective weight computation. Meanwhile, to maintain modality identity consistency, we propose the Causal Modality Identity Module (CMIM) that establishes causal constraints between generated images and target modality descriptions using vision-language modeling. Extensive experimental results demonstrate that our proposed Med-K2N outperforms state-of-the-art methods by significant margins on multiple benchmarks. Source code is available.
zh
[CV-37] Work Zones challenge VLM Trajectory Planning : Toward Mitigation and Robust Autonomous Driving
【速读】:该论文旨在解决视觉语言模型(Visual Language Models, VLMs)在施工区域(work zones)轨迹规划中的失效问题,这类场景通常包含不规则布局、临时交通管制和动态几何结构,导致主流VLMs在68.0%的案例中无法生成正确轨迹。解决方案的关键在于提出REACT-Drive框架,该框架通过检索增强生成(Retrieval-Augmented Generation, RAG)机制将历史失败案例转化为约束规则与可执行代码,并在新场景中检索相似模式以指导轨迹生成,从而显著提升规划准确性(平均位移误差降低约3倍)和推理效率(仅需0.58秒),同时在真实世界15个施工区域场景中验证了其有效性。
链接: https://arxiv.org/abs/2510.02803
作者: Yifan Liao,Zhen Sun,Xiaoyun Qiu,Zixiao Zhao,Wenbing Tang,Xinlei He,Xinhu Zheng,Tianwei Zhang,Xinyi Huang,Xingshuo Han
机构: Hong Kong University of Science and Technology (Guangzhou); Nanjing University of Aeronautics and Astronautics; Northwest A&F University; Nanyang Technological University; Jinan University
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages,5 figures
Abstract:Visual Language Models (VLMs), with powerful multimodal reasoning capabilities, are gradually integrated into autonomous driving by several automobile manufacturers to enhance planning capability in challenging environments. However, the trajectory planning capability of VLMs in work zones, which often include irregular layouts, temporary traffic control, and dynamically changing geometric structures, is still unexplored. To bridge this gap, we conduct the \textitfirst systematic study of VLMs for work zone trajectory planning, revealing that mainstream VLMs fail to generate correct trajectories in 68.0% of cases. To better understand these failures, we first identify candidate patterns via subgraph mining and clustering analysis, and then confirm the validity of 8 common failure patterns through human verification. Building on these findings, we propose REACT-Drive, a trajectory planning framework that integrates VLMs with Retrieval-Augmented Generation (RAG). Specifically, REACT-Drive leverages VLMs to convert prior failure cases into constraint rules and executable trajectory planning code, while RAG retrieves similar patterns in new scenarios to guide trajectory generation. Experimental results on the ROADWork dataset show that REACT-Drive yields a reduction of around 3\times in average displacement error relative to VLM baselines under evaluation with Qwen2.5-VL. In addition, REACT-Drive yields the lowest inference time ( 0.58 s) compared with other methods such as fine-tuning ( 17.90 s). We further conduct experiments using a real vehicle in 15 work zone scenarios in the physical world, demonstrating the strong practicality of REACT-Drive.
zh
[CV-38] VERNIER: an open-source software pushing marker pose estimation down to the micrometer and nanometer scales
【速读】:该论文旨在解决微纳尺度下物体六自由度(6 degrees of freedom)位姿估计的难题,特别是在厘米级测量范围与纳米级分辨率(以及微弧度级角度分辨率)之间实现高精度、高鲁棒性的位姿测量。其解决方案的关键在于提出了一种名为VERNIER的开源相位处理软件,该软件基于伪周期性图案(pseudo-periodic patterns),采用一种基于相位的局部阈值算法,能够在噪声、离焦和遮挡等复杂条件下仍保持高可靠性。通过相位处理技术实现纳米级精度,同时结合编码模式设计支持厘米级测量范围,从而兼顾精度与适用范围,为显微成像中的位姿测量提供了系统性方法。
链接: https://arxiv.org/abs/2510.02791
作者: Patrick Sandoz,Antoine N. André,Guillaume J. Laurent
机构: Université Marie et Louis Pasteur, SupMicroTech, CNRS, Institut FEMTO-ST; National Institute of Advanced Industrial Science and Technology, CNRS-AIST JRL, IRL
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Pose estimation is still a challenge at the small scales. Few solutions exist to capture the 6 degrees of freedom of an object with nanometric and microradians resolutions over relatively large ranges. Over the years, we have proposed several fiducial marker and pattern designs to achieve reliable performance for various microscopy applications. Centimeter ranges are possible using pattern encoding methods, while nanometer resolutions can be achieved using phase processing of the periodic frames. This paper presents VERNIER, an open source phase processing software designed to provide fast and reliable pose measurement based on pseudo-periodic patterns. Thanks to a phase-based local thresholding algorithm, the software has proven to be particularly robust to noise, defocus and occlusion. The successive steps of the phase processing are presented, as well as the different types of patterns that address different application needs. The implementation procedure is illustrated with synthetic and experimental images. Finally, guidelines are given for selecting the appropriate pattern design and microscope magnification lenses as a function of the desired performance.
zh
[CV-39] Align Your Query: Representation Alignment for Multimodality Medical Object Detection
【速读】:该论文旨在解决多模态医学图像目标检测中因不同成像模态(如胸部X光片CXR、CT、MRI)之间统计特性差异大和表示空间不一致导致的性能下降问题。其核心解决方案是通过表示对齐(representation alignment)策略,具体聚焦于DETR类模型中的对象查询(object queries)表示,并提出一种无需修改检测器架构的轻量级框架:首先引入模态标记(modality tokens),即由文本生成的紧凑嵌入,编码成像模态信息;然后设计多模态上下文注意力机制(Multimodality Context Attention, MoCA),利用自注意力机制将模态标记注入对象查询中,实现模态上下文传播;此外,进一步提出QueryREPA预训练阶段,基于模态平衡批次与任务特定对比损失对齐查询表示与模态标记。该方法在保持DETR结构不变的前提下显著提升跨模态迁移能力,实现高效且鲁棒的多模态医学目标检测。
链接: https://arxiv.org/abs/2510.02789
作者: Ara Seo,Bryan Sangwoo Kim,Hyungjin Chung,Jong Chul Ye
机构: KAIST AI (韩国科学技术院人工智能); EverEx
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: this https URL.
zh
[CV-40] OTR: Synthesizing Overlay Text Dataset for Text Removal
【速读】:该论文旨在解决现有文本移除(text removal)研究中因数据集局限性导致的域外泛化能力不足与评估不准确的问题。当前主流基准如SCUT-EnsText存在人工编辑引入的地面真实(ground truth)伪影、背景过于简单以及评价指标无法反映生成结果质量等缺陷,难以支撑跨场景文本移除任务的可靠评估与模型训练。其解决方案的关键在于提出一种基于对象感知放置(object-aware placement)和视觉-语言模型(vision-language model)生成内容的合成方法,用于构建适用于非场景文本(non-scene text)领域的文本移除基准数据集,从而确保干净的地面真实标签并提供更具挑战性的文本移除场景。
链接: https://arxiv.org/abs/2510.02787
作者: Jan Zdenek,Wataru Shimoda,Kota Yamaguchi
机构: CyberAgent( CyberAgent); Tokyo(东京); Japan(日本)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Proceedings of the 33rd ACM International Conference on Multimedia (MM '25), October 27-31, 2025, Dublin, Ireland, this https URL
Abstract:Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at this https URL .
zh
[CV-41] Reasoning Riddles: How Explainability Reveals Cognitive Limits in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在复杂横向思维任务(如字谜谜题)中认知过程不透明的问题,尤其是其推理机制与失败模式尚不明确。解决方案的关键在于构建了一个系统标注的221个字谜谜题数据集,涵盖六类认知类别,并设计了一套将推理质量与答案正确性分离的评估框架;同时通过三种提示策略探究不同解释性推理过程,揭示了VLM在视觉组合方面的系统性优势以及在隐含意义和文化符号理解上的根本局限,从而确立可解释性是模型性能的核心组成部分而非事后补充。
链接: https://arxiv.org/abs/2510.02780
作者: Prahitha Movva
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) excel at many multimodal tasks, yet their cognitive processes remain opaque on complex lateral thinking challenges like rebus puzzles. While recent work has demonstrated these models struggle significantly with rebus puzzle solving, the underlying reasoning processes and failure patterns remain largely unexplored. We address this gap through a comprehensive explainability analysis that moves beyond performance metrics to understand how VLMs approach these complex lateral thinking challenges. Our study contributes a systematically annotated dataset of 221 rebus puzzles across six cognitive categories, paired with an evaluation framework that separates reasoning quality from answer correctness. We investigate three prompting strategies designed to elicit different types of explanatory processes and reveal critical insights into VLM cognitive processes. Our findings demonstrate that reasoning quality varies dramatically across puzzle categories, with models showing systematic strengths in visual composition while exhibiting fundamental limitations in absence interpretation and cultural symbolism. We also discover that prompting strategy substantially influences both cognitive approach and problem-solving effectiveness, establishing explainability as an integral component of model performance rather than a post-hoc consideration.
zh
[CV-42] AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
【速读】:该论文旨在解决长视频理解中因时间跨度大、信息密度高而导致的视觉-语言模型(VLM)性能受限问题,尤其是现有方法依赖均匀采样易遗漏关键时刻,或采用固定间隔的关键帧选择策略会错过事件附近的细粒度线索。解决方案的核心在于提出一种无需训练的自适应关键帧采样模块 AdaRD-Key,其通过最大化统一的“相关性-多样性最大体积”(Relevance–Diversity Max-Volume, RD-MV)目标函数,联合优化查询条件下的相关性得分与对数行列式形式的多样性项,从而选出既具代表性又无冗余的帧序列;此外,该方法引入轻量级相关性感知门控机制,在查询与视频对齐较弱时自动切换至纯多样性模式,提升覆盖范围而无需额外标注数据,整体实现高效、实时且兼容主流 VLM 的插件式部署。
链接: https://arxiv.org/abs/2510.02778
作者: Xian Zhang,Zexi Wu,Zinuo Li,Hongming Xu,Luqi Gong,Farid Boussaid,Naoufel Werghi,Mohammed Bennamoun
机构: The University of Western Australia (西澳大利亚大学); Dalian University of Technology (大连理工大学); Khalifa University (哈利法大学); Zhejiang Lab (浙江实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding long-form videos remains a significant challenge for vision–language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redundancy. While effective at limiting overlap, this strategy frequently misses short, fine-grained cues near important events. Other methods instead emphasize visual diversity but neglect query relevance. We propose AdaRD-Key, a training-free keyframe sampling module for query-driven long-form video understanding. AdaRD-Key maximizes a unified Relevance–Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames. To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism; when the relevance distribution indicates weak alignment, the method seamlessly shifts into a diversity-only mode, enhancing coverage without additional supervision. Our pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and Video-MME demonstrate state-of-the-art performance, particularly on long-form videos. Code available at this https URL.
zh
[CV-43] Hierarchical Generalized Category Discovery for Brain Tumor Classification in Digital Pathology
【速读】:该论文旨在解决脑肿瘤分类中现有方法无法识别训练阶段未见类别的问题,即传统监督学习受限于预定义类别集合,而无监督或半监督方法难以有效融合标注数据中的先验知识。其核心解决方案是提出Hierarchical Generalized Category Discovery for Brain Tumor Classification (HGCD-BT),关键创新在于将层次聚类与对比学习相结合,并引入一种新颖的半监督层次聚类损失函数,从而在未标注数据中同时识别已知和未知类别的脑肿瘤类型,且能反映脑肿瘤分类体系的层级结构。该方法在刺激拉曼组织学图像(patch-level)和苏木精-伊红染色全切片图像(slide-level)上均实现显著性能提升,验证了其跨模态泛化能力。
链接: https://arxiv.org/abs/2510.02760
作者: Matthias Perkonigg,Patrick Rockenschaub,Georg Göbel,Adelheid Wöhrer
机构: Medical University of Innsbruck (因斯布鲁克医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Accurate brain tumor classification is critical for intra-operative decision making in neuro-oncological surgery. However, existing approaches are restricted to a fixed set of predefined classes and are therefore unable to capture patterns of tumor types not available during training. Unsupervised learning can extract general-purpose features, but it lacks the ability to incorporate prior knowledge from labelled data, and semi-supervised methods often assume that all potential classes are represented in the labelled data. Generalized Category Discovery (GCD) aims to bridge this gap by categorizing both known and unknown classes within unlabelled data. To reflect the hierarchical structure of brain tumor taxonomies, in this work, we introduce Hierarchical Generalized Category Discovery for Brain Tumor Classification (HGCD-BT), a novel approach that integrates hierarchical clustering with contrastive learning. Our method extends contrastive learning based GCD by incorporating a novel semi-supervised hierarchical clustering loss. We evaluate HGCD-BT on OpenSRH, a dataset of stimulated Raman histology brain tumor images, achieving a +28% improvement in accuracy over state-of-the-art GCD methods for patch-level classification, particularly in identifying previously unseen tumor categories. Furthermore, we demonstrate the generalizability of HGCD-BT on slide-level classification of hematoxylin and eosin stained whole-slide images from the Digital Brain Tumor Atlas, confirming its utility across imaging modalities.
zh
[CV-44] Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在真实世界分布偏移(distribution shifts)下性能下降的问题,尤其针对测试时适应(Test-Time Adaptation, TTA)场景中现有方法存在的两大局限:一是依赖计算昂贵的反向传播(backpropagation),难以实现实时部署;二是仅关注似然(likelihood)适应而忽视先验(prior)的作用。解决方案的关键在于提出一种统一的、无需训练的框架——Bayesian Class Adaptation plus (BCA+),其核心创新是引入一个动态缓存机制,用于自适应存储和更新类别嵌入(class embeddings)、空间尺度(spatial scales,适用于检测任务)以及由历史预测推导出的自适应类别先验(adaptive class priors)。通过将适应过程建模为贝叶斯推理问题,最终预测由初始VLM输出与基于缓存的预测融合而成,其中缓存预测结合了动态更新的似然项(衡量特征与尺度相似性)和先验项(反映类别分布演化),从而实现对模型语义理解与上下文置信度的双重校正。该方法不依赖反向传播,具备高效率和优异的泛化能力,在识别与检测基准上均达到最先进性能。
链接: https://arxiv.org/abs/2510.02750
作者: Lihua Zhou,Mao Ye,Shuaifeng Li,Nianxin Li,Jinlin Wu,Xiatian Zhu,Lei Deng,Hongbin Liu,Jiebo Luo,Zhen Lei
机构: Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences, Hong Kong, China; School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China; Surrey Institute for People-Centred Artificial Intelligence, CVSSP, University of Surrey, Guildford, UK; School of Electronics and Information Engineering, Shenzhen University, Shenzhen, China; University of Rochester; School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Vision-language models (VLMs) such as CLIP and Grounding DINO have achieved remarkable success in object recognition and detection. However, their performance often degrades under real-world distribution shifts. Test-time adaptation (TTA) aims to mitigate this issue by adapting models during inference. Existing methods either rely on computationally expensive backpropagation, which hinders real-time deployment, or focus solely on likelihood adaptation, which overlooks the critical role of the prior. Our prior work, Bayesian Class Adaptation (BCA), addressed these shortcomings for object recognition by introducing a training-free framework that incorporates adaptive priors. Building upon this foundation, we now present Bayesian Class Adaptation plus (BCA+), a unified, training-free framework for TTA for both object recognition and detection. BCA+ introduces a dynamic cache that adaptively stores and updates class embeddings, spatial scales (for detection), and, crucially, adaptive class priors derived from historical predictions. We formulate adaptation as a Bayesian inference problem, where final predictions are generated by fusing the initial VLM output with a cache-based prediction. This cache-based prediction combines a dynamically updated likelihood (measuring feature and scale similarity) and a prior (reflecting the evolving class distribution). This dual-adaptation mechanism, coupled with uncertainty-guided fusion, enables BCA+ to correct both the model’s semantic understanding and its contextual confidence. As a training-free method requiring no backpropagation, BCA+ is highly efficient. Extensive experiments demonstrate that BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks.
zh
[CV-45] Retrv-R1: A Reasoning -Driven MLLM Framework for Universal and Efficient Multimodal Retrieval NEURIPS2025
【速读】:该论文旨在解决将基于强化学习(Reinforcement Learning, RL)的推理增强方法从纯文本大语言模型(Large Language Models, LLMs)迁移至多模态通用检索任务时所面临的两大核心挑战:一是因多候选样本推理过程导致的高计算开销(token消耗过大),二是直接应用RL训练在检索任务中表现出的不稳定性与次优性能。解决方案的关键在于提出Retrv-R1架构,其创新性地引入两个核心模块:(1)信息压缩模块结合细节检查机制(details inspection mechanism),在显著减少token数量的同时保留关键信息以提升计算效率;(2)一种新的训练范式,包括使用专为检索设计的合成思维链(Chain-of-Thought, CoT)数据集进行激活阶段优化,随后采用新颖的课程奖励机制(curriculum reward)引导强化学习,从而实现性能与效率的协同提升。这一设计使Retrv-R1在多个基准测试中达到当前最优(SOTA)表现,并具备良好的泛化能力。
链接: https://arxiv.org/abs/2510.02745
作者: Lanyun Zhu,Deyi Ji,Tianrun Chen,Haiyang Wu,Shiqi Wang
机构: City University of Hong Kong (香港城市大学); Tencent (腾讯); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:The success of DeepSeek-R1 demonstrates the immense potential of using reinforcement learning (RL) to enhance LLMs’ reasoning capabilities. This paper introduces Retrv-R1, the first R1-style MLLM specifically designed for multimodal universal retrieval, achieving higher performance by employing step-by-step reasoning to produce more accurate retrieval results. We find that directly applying the methods of DeepSeek-R1 to retrieval tasks is not feasible, mainly due to (1) the high computational cost caused by the large token consumption required for multiple candidates with reasoning processes, and (2) the instability and suboptimal results when directly applying RL to train for retrieval tasks. To address these issues, Retrv-R1 introduces an information compression module with a details inspection mechanism, which enhances computational efficiency by reducing the number of tokens while ensuring that critical information for challenging candidates is preserved. Furthermore, a new training paradigm is proposed, including an activation stage using a retrieval-tailored synthetic CoT dataset for more effective optimization, followed by RL with a novel curriculum reward to improve both performance and efficiency. Incorporating these novel designs, Retrv-R1 achieves SOTA performance, high efficiency, and strong generalization ability, as demonstrated by experiments across multiple benchmarks and tasks.
zh
[CV-46] Net2Net: When Un-trained Meets Pre-trained Networks for Robust Real-World Denoising
【速读】:该论文旨在解决真实世界中噪声去除(noise removal)的挑战,即传统基于手工先验的方法在复杂多变的实际噪声场景下性能受限,而现有深度学习方法虽能从大规模数据中学习噪声特征,却常依赖大量标注数据且泛化能力不足。解决方案的关键在于提出一种名为Net2Net的新框架,其核心创新是通过正则化去噪(Regularization by Denoising, RED)将未训练的深度图像先验(DIP)与预训练的去噪网络DRUNet相结合:未训练网络无需标签即可自适应地拟合单张图像的独特噪声特性,而预训练网络则利用大规模数据中学到的通用表示提升鲁棒性,从而在有限训练数据条件下实现更优的跨噪声类型和成像条件的泛化性能。
链接: https://arxiv.org/abs/2510.02733
作者: Weimin Yuan,Cai Meng
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional denoising methods for noise removal have largely relied on handcrafted priors, often perform well in controlled environments but struggle to address the complexity and variability of real noise. In contrast, deep learning-based approaches have gained prominence for learning noise characteristics from large datasets, but these methods frequently require extensive labeled data and may not generalize effectively across diverse noise types and imaging conditions. In this paper, we present an innovative method, termed as Net2Net, that combines the strengths of untrained and pre-trained networks to tackle the challenges of real-world noise removal. The innovation of Net2Net lies in its combination of unsupervised DIP and supervised pre-trained model DRUNet by regularization by denoising (RED). The untrained network adapts to the unique noise characteristics of each input image without requiring labeled data, while the pre-trained network leverages learned representations from large-scale datasets to deliver robust denoising performance. This hybrid framework enhances generalization across varying noise patterns and improves performance, particularly in scenarios with limited training data. Extensive experiments on benchmark datasets demonstrate the superiority of our method for real-world noise removal.
zh
[CV-47] From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting
【速读】:该论文旨在解决单目视频动态三维重建中因视角限制导致的3D运动歧义性以及建模时变场景带来的计算复杂度问题。现有稀疏控制方法虽通过将数百万高斯点缩减至数千个控制点以降低计算量,但其控制点分配仅基于几何信息,造成静态区域冗余和动态区域不足的问题。解决方案的关键在于提出一种运动自适应框架(motion-adaptive framework),该框架利用视觉基础模型提供的语义与运动先验,建立patch-token-node对应关系,并通过运动自适应压缩策略在动态区域集中控制点、抑制静态背景冗余;同时引入基于样条的轨迹参数化方法(spline-based trajectory parameterization)替代MLP驱动的形变场,结合迭代体素化与运动趋势评分机制,实现表示密度的灵活自适应调整,从而有效弥合控制点分布与运动复杂度之间的根本不匹配问题。
链接: https://arxiv.org/abs/2510.02732
作者: Jianing Chen,Zehao Li,Yujun Cai,Hao Jiang,Shuqin Gao,Honglong Zhao,Tianlu Mao,Yucheng Zhang
机构: Institute of Computing Technology, Chinese Academy of Sciences, ICT (中国科学院计算技术研究所); University of Chinese Academy of Sciences, UCAS (中国科学院大学); The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dynamic 3D reconstruction from monocular videos remains difficult due to the ambiguity inferring 3D motion from limited views and computational demands of modeling temporally varying scenes. While recent sparse control methods alleviate computation by reducing millions of Gaussians to thousands of control points, they suffer from a critical limitation: they allocate points purely by geometry, leading to static redundancy and dynamic insufficiency. We propose a motion-adaptive framework that aligns control density with motion complexity. Leveraging semantic and motion priors from vision foundation models, we establish patch-token-node correspondences and apply motion-adaptive compression to concentrate control points in dynamic regions while suppressing redundancy in static backgrounds. Our approach achieves flexible representational density adaptation through iterative voxelization and motion tendency scoring, directly addressing the fundamental mismatch between control point allocation and motion complexity. To capture temporal evolution, we introduce spline-based trajectory parameterization initialized by 2D tracklets, replacing MLP-based deformation fields to achieve smoother motion representation and more stable optimization. Extensive experiments demonstrate significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.
zh
[CV-48] Dale meets Langevin: A Multiplicative Denoising Diffusion Model
【速读】:该论文旨在解决传统梯度下降优化方法与生物神经系统学习机制不一致的问题,从而构建更符合生物学原理的学习技术。其核心挑战在于如何在保持生物合理性的同时实现有效的学习和生成建模。解决方案的关键在于引入Dale定律(Dale’s law)约束下的指数梯度下降(exponential gradient descent)框架,该框架通过限制突触权重的更新为乘法形式,自然导致突触权重呈对数正态分布(log-normal distribution)。作者进一步利用几何布朗运动(geometric Brownian motion, GBM)对应的随机微分方程(SDE),通过离散化反向时间SDE推导出一个乘法更新规则,该规则恰好等价于基于Dale定律的指数梯度下降采样形式。在此基础上,提出了一种新的乘法去噪分数匹配(multiplicative denoising score-matching)形式化方法,适用于正定数据(如图像),并首次实现了基于几何布朗运动的生物启发式生成模型,实验验证了其在MNIST、Fashion MNIST和Kuzushiji数据集上的生成能力。
链接: https://arxiv.org/abs/2510.02730
作者: Nishanth Shetty,Madhava Prasath,Chandra Sekhar Seelamantula
机构: Indian Institute of Science (印度科学研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gradient descent has proven to be a powerful and effective technique for optimization in numerous machine learning applications. Recent advances in computational neuroscience have shown that learning in standard gradient descent optimization formulation is not consistent with learning in biological systems. This has opened up interesting avenues for building biologically inspired learning techniques. One such approach is inspired by Dale’s law, which states that inhibitory and excitatory synapses do not swap roles during the course of learning. The resulting exponential gradient descent optimization scheme leads to log-normally distributed synaptic weights. Interestingly, the density that satisfies the Fokker-Planck equation corresponding to the stochastic differential equation (SDE) with geometric Brownian motion (GBM) is the log-normal density. Leveraging this connection, we start with the SDE governing geometric Brownian motion, and show that discretizing the corresponding reverse-time SDE yields a multiplicative update rule, which surprisingly, coincides with the sampling equivalent of the exponential gradient descent update founded on Dale’s law. Furthermore, we propose a new formalism for multiplicative denoising score-matching, subsuming the loss function proposed by Hyvaerinen for non-negative data. Indeed, log-normally distributed data is positive and the proposed score-matching formalism turns out to be a natural fit. This allows for training of score-based models for image data and results in a novel multiplicative update scheme for sample generation starting from a log-normal density. Experimental results on MNIST, Fashion MNIST, and Kuzushiji datasets demonstrate generative capability of the new scheme. To the best of our knowledge, this is the first instance of a biologically inspired generative model employing multiplicative updates, founded on geometric Brownian motion.
zh
[CV-49] MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context
【速读】:该论文旨在解决现有文本驱动动作生成方法在捕捉行为因果逻辑与人类意图方面的局限性,以及因缺乏视觉 grounding 导致的生成精度和个性化不足的问题。其解决方案的关键在于提出 MoGIC 框架,通过联合优化多模态条件下的动作生成与意图预测,将意图建模与视觉先验(visual priors)融合进统一的多模态动作合成体系中;同时引入具有自适应作用范围的注意力混合机制(mixture-of-attention mechanism),实现条件 token 与动作子序列之间的有效局部对齐,从而提升生成可控性、准确性和多样性。
链接: https://arxiv.org/abs/2510.02722
作者: Junyu Shi,Yong Sun,Zhiyuan Zhang,Lijiang Liu,Zhengjie Zhang,Yuxin He,Qiang Nie
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing text-driven motion generation methods often treat synthesis as a bidirectional mapping between language and motion, but remain limited in capturing the causal logic of action execution and the human intentions that drive behavior. The absence of visual grounding further restricts precision and personalization, as language alone cannot specify fine-grained spatiotemporal details. We propose MoGIC, a unified framework that integrates intention modeling and visual priors into multimodal motion synthesis. By jointly optimizing multimodal-conditioned motion generation and intention prediction, MoGIC uncovers latent human goals, leverages visual priors to enhance generation, and exhibits versatile multimodal generative capability. We further introduce a mixture-of-attention mechanism with adaptive scope to enable effective local alignment between conditional tokens and motion subsequences. To support this paradigm, we curate Mo440H, a 440-hour benchmark from 21 high-quality motion datasets. Experiments show that after finetuning, MoGIC reduces FID by 38.6% on HumanML3D and 34.6% on Mo440H, surpasses LLM-based methods in motion captioning with a lightweight text head, and further enables intention prediction and vision-conditioned generation, advancing controllable motion synthesis and intention understanding. The code is available at this https URL
zh
[CV-50] A Statistical Method for Attack-Agnostic Adversarial Attack Detection with Compressive Sensing Comparison
【速读】:该论文旨在解决现有对抗攻击检测方法在面对未见过的攻击类型时检测能力不足,以及对不同攻击类型难以实现高精度检测的问题。解决方案的关键在于提出一种基于统计学的方法,在神经网络部署前建立检测基线,通过比较压缩与非压缩神经网络对输入样本响应行为的差异,生成一个用于判断是否存在对抗样本的指标,从而实现对多种攻击类型的近似完美检测,并显著降低误报率,具备良好的实用性与可靠性。
链接: https://arxiv.org/abs/2510.02707
作者: Chinthana Wimalasuriya,Spyros Tragoudas
机构: Southern Illinois University (南伊利诺伊大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Adversarial attacks present a significant threat to modern machine learning systems. Yet, existing detection methods often lack the ability to detect unseen attacks or detect different attack types with a high level of accuracy. In this work, we propose a statistical approach that establishes a detection baseline before a neural network’s deployment, enabling effective real-time adversarial detection. We generate a metric of adversarial presence by comparing the behavior of a compressed/uncompressed neural network pair. Our method has been tested against state-of-the-art techniques, and it achieves near-perfect detection across a wide range of attack types. Moreover, it significantly reduces false positives, making it both reliable and practical for real-world applications.
zh
[CV-51] FSFSplatter: Build Surface and Novel Views with Sparse-Views within 3min
【速读】:该论文旨在解决从稀疏自由视角图像中进行高质量表面重建的问题,现有方法在缺乏密集且校准视图的情况下,常因视场重叠有限和过拟合导致表面质量下降。解决方案的关键在于提出FSFSplatter,其核心创新包括:端到端的稠密高斯初始化、相机参数估计与几何增强的场景优化一体化设计;利用大型Transformer编码多视角图像,并通过自分裂高斯头生成几何一致的初始场景;采用基于贡献度的剪枝策略消除局部噪声点,并在快速优化过程中借助深度监督和多视角特征监督(结合可微分相机参数)缓解过拟合问题,从而显著提升稀疏输入下的重建质量。
链接: https://arxiv.org/abs/2510.02691
作者: Yibin Zhao,Yihan Pan,Jun Nan,Jianjun Yi
机构: East China University of Science and Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstructing from free sparse images often leads to poor surface due to limited overlap and overfitting. We introduce FSFSplatter, a new approach for fast surface reconstruction from free sparse images. Our method integrates end-to-end dense Gaussian initialization, camera parameter estimation, and geometry-enhanced scene optimization. Specifically, FSFSplatter employs a large Transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting to limited views by leveraging depth and multi-view feature supervision with differentiable camera parameters during rapid optimization. FSFSplatter outperforms current state-of-the-art methods on widely used DTU and Replica.
zh
[CV-52] Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models
【速读】:该论文旨在解决流匹配(flow-matching)模型在强化学习(reinforcement learning, RL)中因确定性特性导致的优化困难问题,即传统方法通过随机噪声扰动潜在空间(latents)引入 stochasticity 的方式效率低且不稳定。解决方案的关键在于提出 Smart-GRPO 方法,其核心是通过迭代搜索策略对噪声扰动进行优化:该策略先解码候选扰动,再利用奖励函数评估其性能,并基于反馈不断调整噪声分布以逼近高奖励区域,从而实现更高效、稳定的强化学习训练,提升图像质量和人类对齐性。
链接: https://arxiv.org/abs/2510.02654
作者: Benjamin Yu,Jackie Liu,Justin Cui
机构: University of California, Los Angeles (加州大学洛杉矶分校); Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in flow-matching have enabled high-quality text-to-image generation. However, the deterministic nature of flow-matching models makes them poorly suited for reinforcement learning, a key tool for improving image quality and human alignment. Prior work has introduced stochasticity by perturbing latents with random noise, but such perturbations are inefficient and unstable. We propose Smart-GRPO, the first method to optimize noise perturbations for reinforcement learning in flow-matching models. Smart-GRPO employs an iterative search strategy that decodes candidate perturbations, evaluates them with a reward function, and refines the noise distribution toward higher-reward regions. Experiments demonstrate that Smart-GRPO improves both reward optimization and visual quality compared to baseline methods. Our results suggest a practical path toward reinforcement learning in flow-matching frameworks, bridging the gap between efficient training and human-aligned generation.
zh
[CV-53] Sequence-Preserving Dual-FoV Defense for Traffic Sign and Light Recognition in Autonomous Vehicles
【速读】:该论文旨在解决自动驾驶车辆中交通灯与交通标志识别模型在面对数字对抗攻击及自然环境退化(如眩光、雨天、污渍或涂鸦)时的鲁棒性不足问题,尤其关注现有方法对时间连续性、多静态视场(FoV)感知以及数字与物理扰动双重脆弱性的忽视。其解决方案的关键在于提出一种双视场(dual FoV)、保持序列连续性的鲁棒性框架,结合多源数据构建的美国场景数据集,并设计了一个统一的三层防御堆栈:特征压缩(feature squeezing)、防御蒸馏(defensive distillation)与基于熵的异常检测,辅以序列级时间投票机制提升性能;实验表明该方案在真实场景下显著提升了mAP至79.8,将攻击成功率(ASR)降至18.2%,并有效降低高风险误分类比例至32%。
链接: https://arxiv.org/abs/2510.02642
作者: Abhishek Joshi,Jahnavi Krishna Koda,Abhishek Phadke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traffic light and sign recognition are key for Autonomous Vehicles (AVs) because perception mistakes directly influence navigation and safety. In addition to digital adversarial attacks, models are vulnerable to existing perturbations (glare, rain, dirt, or graffiti), which could lead to dangerous misclassifications. The current work lacks consideration of temporal continuity, multistatic field-of-view (FoV) sensing, and robustness to both digital and natural degradation. This study proposes a dual FoV, sequence-preserving robustness framework for traffic lights and signs in the USA based on a multi-source dataset built on aiMotive, Udacity, Waymo, and self-recorded videos from the region of Texas. Mid and long-term sequences of RGB images are temporally aligned for four operational design domains (ODDs): highway, night, rainy, and urban. Over a series of experiments on a real-life application of anomaly detection, this study outlines a unified three-layer defense stack framework that incorporates feature squeezing, defensive distillation, and entropy-based anomaly detection, as well as sequence-wise temporal voting for further enhancement. The evaluation measures included accuracy, attack success rate (ASR), risk-weighted misclassification severity, and confidence stability. Physical transferability was confirmed using probes for recapture. The results showed that the Unified Defense Stack achieved 79.8mAP and reduced the ASR to 18.2%, which is superior to YOLOv8, YOLOv9, and BEVFormer, while reducing the high-risk misclassification to 32%.
zh
[CV-54] Deep Generative Continual Learning using Functional LoRA: FunLoRA
【速读】:该论文旨在解决深度生成模型在持续适应(continual adaptation)过程中因灾难性遗忘(catastrophic forgetting)导致的性能下降问题。传统增量训练方法依赖于模型自身生成的合成数据进行再训练,但存在训练时间随任务累积而急剧增长、以及合成数据缺乏真实数据丰富性导致长期性能退化两大局限。其解决方案的关键在于提出一种基于低秩适配(LoRA)的新型动态条件机制——功能 LoRA(FunLoRA),该机制仅使用秩为1的矩阵,并通过精心设计的函数对重参数化后的矩阵秩进行功能性扩展,从而实现参数高效微调(PEFT)。此方法确保模型在仅用当前任务数据训练的情况下避免灾难性遗忘,同时显著降低内存消耗与采样时间,且在基于流匹配(flow-matching)的生成模型上展现出优于扩散模型类方法的分类准确率。
链接: https://arxiv.org/abs/2510.02631
作者: Victor Enescu,Hichem Sahbi
机构: Sorbonne University (索邦大学); CNRS (法国国家科学研究中心); LIP6 (巴黎第六大学计算机科学实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual adaptation of deep generative models holds tremendous potential and critical importance, given their rapid and expanding usage in text and vision based applications. Incremental training, however, remains highly challenging due to catastrophic forgetting phenomenon, which makes it difficult for neural networks to effectively incorporate new knowledge. A common strategy consists in retraining the generative model on its own synthetic data in order to mitigate forgetting. Yet, such an approach faces two major limitations: (i) the continually increasing training time eventually becomes intractable, and (ii) reliance on synthetic data inevitably leads to long-term performance degradation, since synthetic samples lack the richness of real training data. In this paper, we attenuate these issues by designing a novel and more expressive conditioning mechanism for generative models based on low rank adaptation (LoRA), that exclusively employs rank 1 matrices, whose reparametrized matrix rank is functionally increased using carefully selected functions – and dubbed functional LoRA: FunLoRA. Using this dynamic conditioning, the generative model is guaranteed to avoid catastrophic forgetting and needs only to be trained on data from the current task. Extensive experiments using flow-matching based models trained from scratch, showcase that our proposed parameter-efficient fine-tuning (PEFT) method surpasses prior state-of-the-art results based on diffusion models, reaching higher classification accuracy scores, while only requiring a fraction of the memory cost and sampling time.
zh
[CV-55] Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation
【速读】:该论文旨在解决基于扩散模型(diffusion models)的语音驱动视频生成方法在实际应用中存在推理速度慢、计算成本高且视频质量不佳的问题,尤其针对实时部署场景的限制。其解决方案的关键在于提出一种输入感知的蒸馏方法:首先引入人体姿态关键点(human pose keypoints)作为条件,设计输入感知稀疏注意力机制(input-aware sparse attention),以引导模型聚焦于面部、手部和上半身等关键区域,从而减少冗余计算并增强时序一致性;其次,设计输入感知蒸馏损失函数(input-aware distillation loss),显著提升唇同步精度与手势动作的真实性。通过上述两个核心改进,该方法在保持实时性能的同时,实现了优于当前主流音频驱动与输入驱动方法的视觉质量。
链接: https://arxiv.org/abs/2510.02617
作者: Beijia Lu,Ziyi Chen,Jing Xiao,Jun-Yan Zhu
机构: Carnegie Mellon University (卡内基梅隆大学); PAII Inc. (PAII 公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Diffusion models can synthesize realistic co-speech video from audio for various applications, such as video creation and virtual agents. However, existing diffusion-based methods are slow due to numerous denoising steps and costly attention mechanisms, preventing real-time deployment. In this work, we distill a many-step diffusion video model into a few-step student model. Unfortunately, directly applying recent diffusion distillation methods degrades video quality and falls short of real-time performance. To address these issues, our new video distillation method leverages input human pose conditioning for both attention and loss functions. We first propose using accurate correspondence between input human pose keypoints to guide attention to relevant regions, such as the speaker’s face, hands, and upper body. This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence. To further enhance visual quality, we introduce an input-aware distillation loss that improves lip synchronization and hand motion realism. By integrating our input-aware sparse attention and distillation loss, our method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. We also conduct extensive experiments showing the effectiveness of our algorithmic design choices.
zh
[CV-56] Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig
【速读】:该论文旨在解决在非受控场景下(unconstrained settings)实现精准的3D手部追踪及其与环境交互建模的问题,当前主流数据集多基于受控实验室环境采集,限制了模型在真实世界中的泛化能力。解决方案的关键在于设计了一种无标记(marker-less)的多相机系统,结合背载式轻量级外视角(exocentric)摄像头阵列(8个)与用户佩戴的Meta Quest 3头显(提供2个第一人称视角,egocentric views),构建了一个融合ego-exo视角的跟踪流水线,从而生成高精度的3D手部姿态真值(ground truth),有效平衡了环境真实性与标注准确性之间的权衡。
链接: https://arxiv.org/abs/2510.02601
作者: Patrick Rim,Kun He,Kevin Harris,Braden Copple,Shangchen Han,Sizhe An,Ivan Shugurov,Tomas Hodan,He Wen,Xu Xie
机构: Meta Reality Labs (Meta现实实验室); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.
zh
[CV-57] PEO: Training-Free Aesthetic Quality Enhancement in Pre-Trained Text-to-Image Diffusion Models with Prompt Embedding Optimization
【速读】:该论文旨在解决预训练文本到图像扩散模型在仅输入简单提示(prompt)时生成图像美学质量较低的问题。解决方案的关键在于提出一种无需训练且与骨干模型无关的Prompt Embedding Optimization (PEO) 方法,通过优化初始提示的文本嵌入(text embedding),在保持语义一致性的同时提升生成图像的视觉美感。该方法采用三元目标函数,兼顾美学保真度、文本嵌入一致性以及与原始提示的最小偏离,从而实现高质量图像生成。
链接: https://arxiv.org/abs/2510.02599
作者: Hovhannes Margaryan,Bo Wan,Tinne Tuytelaars
机构: KU Leuven (鲁汶大学); Université Paris-Saclay (巴黎萨克雷大学); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces a novel approach to aesthetic quality improvement in pre-trained text-to-image diffusion models when given a simple prompt. Our method, dubbed Prompt Embedding Optimization (PEO), leverages a pre-trained text-to-image diffusion model as a backbone and optimizes the text embedding of a given simple and uncurated prompt to enhance the visual quality of the generated image. We achieve this by a tripartite objective function that improves the aesthetic fidelity of the generated image, ensures adherence to the optimized text embedding, and minimal divergence from the initial prompt. The latter is accomplished through a prompt preservation term. Additionally, PEO is training-free and backbone-independent. Quantitative and qualitative evaluations confirm the effectiveness of the proposed method, exceeding or equating the performance of state-of-the-art text-to-image and prompt adaptation methods.
zh
[CV-58] Unlocking the power of partnership: How humans and machines can work together to improve face recognition
【速读】:该论文旨在解决人机协同在人脸识别决策中如何提升准确性的关键问题,尤其关注个体差异对协作效果的影响。其解决方案的核心在于提出并验证了“近似准确性规则”(Proximal Accuracy Rule, PAR),即当人类与机器的基线识别准确率越接近时,融合决策带来的增益越大;进一步基于PAR识别出一个“关键融合区”,在此区域内即使人类表现低于机器,融合仍能提升整体系统准确性——该区域远比预期广泛。研究通过智能选择具有潜力的人类合作者(而非简单聚合所有判断)实现“智能人机融合”,显著优于单独使用机器或盲目整合所有人机判断,同时有效抑制低性能人类对系统准确率的负面影响,从而为高精度人脸识别提供了可落地的、数据驱动的协同优化路径。
链接: https://arxiv.org/abs/2510.02570
作者: P. Jonathon Phillips(1),Geraldine Jeckeln(2),Carina A. Hahn(1),Amy N. Yates(1),Peter C. Fontana(1),Alice J. O’Toole(2) ((1) Information Access Division, National Institute of Standards and Technology, Gaithersburg, MD (2) School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, TX)
机构: National Institute of Standards and Technology (美国国家标准与技术研究院); The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human review of consequential decisions by face recognition algorithms creates a “collaborative” human-machine system. Individual differences between people and machines, however, affect whether collaboration improves or degrades accuracy in any given case. We establish the circumstances under which combining human and machine face identification decisions improves accuracy. Using data from expert and non-expert face identifiers, we examined the benefits of human-human and human-machine collaborations. The benefits of collaboration increased as the difference in baseline accuracy between collaborators decreased-following the Proximal Accuracy Rule (PAR). This rule predicted collaborative (fusion) benefit across a wide range of baseline abilities, from people with no training to those with extensive training. Using the PAR, we established a critical fusion zone, where humans are less accurate than the machine, but fusing the two improves system accuracy. This zone was surprisingly large. We implemented “intelligent human-machine fusion” by selecting people with the potential to increase the accuracy of a high-performing machine. Intelligent fusion was more accurate than the machine operating alone and more accurate than combining all human and machine judgments. The highest system-wide accuracy achievable with human-only partnerships was found by graph theory. This fully human system approximated the average performance achieved by intelligent human-machine collaboration. However, intelligent human-machine collaboration more effectively minimized the impact of low-performing humans on system-wide accuracy. The results demonstrate a meaningful role for both humans and machines in assuring accurate face identification. This study offers an evidence-based road map for the intelligent use of AI in face identification.
zh
[CV-59] PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction
【速读】:该论文旨在解决从单目视频中重建物理上合理的人体运动这一难题,现有方法多基于运动学估计,缺乏物理约束导致结果不真实;且传统两阶段设计(先运动学估计后物理后处理)易引入误差累积,限制整体重建质量。其解决方案的关键在于提出PhysHMR框架,该框架通过端到端学习视觉到动作策略(visual-to-action policy),直接在物理模拟器中控制人形模型,实现视觉一致性和物理合理性兼备的运动重建。核心创新包括:1)像素转射线(pixel-as-ray)策略,将2D关键点映射为3D空间射线并转换至全局坐标系,提供稳健的全局姿态引导而不依赖噪声较大的3D根节点预测;2)结合预训练编码器提取的局部视觉特征与上述软全局引导,使策略能同时推理细节姿态与全局位置;3)引入知识蒸馏机制,从动捕训练的专家模型迁移运动知识至视觉条件策略,并辅以物理驱动的强化学习奖励进行微调,有效提升样本效率和重建精度。
链接: https://arxiv.org/abs/2510.02566
作者: Qiao Feng,Yiming Huang,Yufu Wang,Jiatao Gu,Lingjie Liu
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing physically plausible human motion from monocular videos remains a challenging problem in computer vision and graphics. Existing methods primarily focus on kinematics-based pose estimation, often leading to unrealistic results due to the lack of physical constraints. To address such artifacts, prior methods have typically relied on physics-based post-processing following the initial kinematics-based motion estimation. However, this two-stage design introduces error accumulation, ultimately limiting the overall reconstruction quality. In this paper, we present PhysHMR, a unified framework that directly learns a visual-to-action policy for humanoid control in a physics-based simulator, enabling motion reconstruction that is both physically grounded and visually aligned with the input video. A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space. These rays are incorporated as policy inputs, providing robust global pose guidance without depending on noisy 3D root predictions. This soft global grounding, combined with local visual features from a pretrained encoder, allows the policy to reason over both detailed pose and global positioning. To overcome the sample inefficiency of reinforcement learning, we further introduce a distillation scheme that transfers motion knowledge from a mocap-trained expert to the vision-conditioned policy, which is then refined using physically motivated reinforcement learning rewards. Extensive experiments demonstrate that PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.
zh
[CV-60] Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models through Reinforcement Learning from Ranking Feedback
【速读】:该论文旨在解决大规模视频语言模型(Video-Language Models, VLMs)在对齐文本与视觉理解时,依赖大量人工偏好反馈所带来的高昂成本问题。当前主流方法如强化学习从人类偏好中学习(Reinforcement Learning from Human Feedback, RLHF)和基于AI反馈的强化学习(Reinforcement Learning with AI Feedback, RLAIF)均受限于训练专用奖励模型(Reward Model)的复杂性和数据效率低下。其关键解决方案是提出Oracle-RLAIF框架,该框架摒弃传统标量奖励建模方式,采用一个通用的Oracle排序器(Oracle ranker)直接对候选响应进行排序而非评分,并结合一种基于组相对策略优化(Group Relative Policy Optimization, GRPO)的新型排名损失函数GRPO_rank,从而实现基于序数反馈(ordinal feedback)的直接优化,显著提升训练效率与模型性能。
链接: https://arxiv.org/abs/2510.02561
作者: Derek Shi,Ruben Glatt,Christine Klymko,Shubham Mohole,Hongjun Choi,Shashank Kushwaha,Sam Sakla,Felipe Leno da Silva
机构: Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室); University of California Los Angeles (加州大学洛杉矶分校); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Proceedings of the 39th Annual Conference on Neural Information Processing Systems, ARLET Workshop (Aligning Reinforcement Learning Experimentalists and Theorists)
Abstract:Recent advances in large video-language models (VLMs) rely on extensive fine-tuning techniques that strengthen alignment between textual and visual comprehension. Leading pipelines typically pair supervised fine-tuning (SFT) with reinforcement learning from preference data to enhance video comprehension. However, as VLMs scale in parameter size, so does the cost of gathering enough human feedback. To make fine-tuning more cost-effective, recent frameworks explore reinforcement learning with AI feedback (RLAIF), which replace human preference with AI as a judge. Current RLAIF frameworks rely on a specialized reward model trained with video narratives to create calibrated scalar rewards-- an expensive and restrictive pipeline. We propose Oracle-RLAIF, a novel framework that replaces the trained reward model with a more general Oracle ranker which acts as a drop-in model ranking candidate model responses rather than scoring them. Alongside Oracle-RLAIF, we introduce GRPO_rank , a novel rank-based loss function based on Group Relative Policy Optimization (GRPO) that directly optimizes ordinal feedback with rank-aware advantages. Empirically, we demonstrate that Oracle-RLAIF consistently outperforms leading VLMs using existing fine-tuning methods when evaluated across various video comprehension benchmarks. Oracle-RLAIF paves the path to creating flexible and data-efficient frameworks for aligning large multi-modal video models with reinforcement learning from rank rather than score.
zh
[CV-61] Exploring OCR-augmented Generation for Bilingual VQA
【速读】:该论文旨在解决多语言视觉问答(VQA)任务中因缺乏有效文本识别能力而导致模型性能受限的问题,特别是在韩语和英语双语场景下的OCR增强生成能力不足。其解决方案的关键在于构建并发布KLOCR——一个基于1亿条数据实例训练的强健双语光学字符识别(OCR)基线模型,用于增强视觉语言模型(VLMs)对图像中文本内容的理解与利用能力;同时,为支持韩语VQA研究,作者还构建了KOCRBench基准测试集,并系统分析不同提示(prompting)方法的效果。实验表明,OCR提取的文本信息能显著提升开源及商用模型在双语VQA任务中的表现。
链接: https://arxiv.org/abs/2510.02543
作者: JoonHo Lee,Sunho Park
机构: KL-Net
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We investigate OCR-augmented generation with Vision Language Models (VLMs), exploring tasks in Korean and English toward multilingualism. To support research in this domain, we train and release KLOCR, a strong bilingual OCR baseline trained on 100M instances to augment VLMs with OCR ability. To complement existing VQA benchmarks, we curate KOCRBench for Korean VQA, and analyze different prompting methods. Extensive experiments show that OCR-extracted text significantly boosts performance across open source and commercial models. Our work offers new insights into OCR-augmented generation for bilingual VQA. Model, code, and data are available at this https URL.
zh
[CV-62] Secure and Robust Watermarking for AI-generated Images: A Comprehensive Survey
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像在知识产权保护、真实性验证和责任归属方面引发的严峻挑战。其解决方案的关键在于水印技术(watermarking),通过将不可见或可见的水印嵌入 AI 生成图像中,实现对图像来源的追踪与区分,确保内容的真实性与可追溯性,从而构建可信的数字生态系统。论文系统梳理了水印系统的建模、各类技术方案、评估指标、攻击脆弱性及未来研究方向,为相关领域的研究人员提供全面的技术参考与理论支撑。
链接: https://arxiv.org/abs/2510.02384
作者: Jie Cao,Qi Li,Zelin Zhang,Jianbing Ni
机构: Queen’s University (皇后大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of generative artificial intelligence (Gen-AI) has facilitated the effortless creation of high-quality images, while simultaneously raising critical concerns regarding intellectual property protection, authenticity, and accountability. Watermarking has emerged as a promising solution to these challenges by distinguishing AI-generated images from natural content, ensuring provenance, and fostering trustworthy digital ecosystems. This paper presents a comprehensive survey of the current state of AI-generated image watermarking, addressing five key dimensions: (1) formalization of image watermarking systems; (2) an overview and comparison of diverse watermarking techniques; (3) evaluation methodologies with respect to visual quality, capacity, and detectability; (4) vulnerabilities to malicious attacks; and (5) prevailing challenges and future directions. The survey aims to equip researchers with a holistic understanding of AI-generated image watermarking technologies, thereby promoting their continued development.
zh
[CV-63] Representation Learning for Compressed Video Action Recognition via Attentive Cross-modal Interaction with Motion Enhancement IJCAI2022
【速读】:该论文旨在解决压缩视频动作识别中因稀疏采样和压缩运动信息导致的动态细节粗略、噪声大以及RGB与运动模态融合不足的问题。解决方案的关键在于提出一种名为MEACI-Net(Motion Enhancement and Attentive Cross-modal Interaction Network)的新框架,其核心创新包括:1)在运动流中引入嵌入去噪模块的多尺度块以增强运动表示学习;2)通过选择性运动补全(Selective Motion Complement, SMC)模块,利用时空注意力机制将局部运动特征补充至RGB模态;3)借助跨模态增强(Cross-Modality Augment, CMA)模块实现两模态的选择性特征融合,从而有效提升模态间交互质量与整体识别性能。
链接: https://arxiv.org/abs/2205.03569
作者: Bing Li,Jiaxin Chen,Dongming Zhang,Xiuguo Bao,Di Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IJCAI 2022
Abstract:Compressed video action recognition has recently drawn growing attention, since it remarkably reduces the storage and computational cost via replacing raw videos by sparsely sampled RGB frames and compressed motion cues (e.g., motion vectors and residuals). However, this task severely suffers from the coarse and noisy dynamics and the insufficient fusion of the heterogeneous RGB and motion modalities. To address the two issues above, this paper proposes a novel framework, namely Attentive Cross-modal Interaction Network with Motion Enhancement (MEACI-Net). It follows the two-stream architecture, i.e. one for the RGB modality and the other for the motion modality. Particularly, the motion stream employs a multi-scale block embedded with a denoising module to enhance representation learning. The interaction between the two streams is then strengthened by introducing the Selective Motion Complement (SMC) and Cross-Modality Augment (CMA) modules, where SMC complements the RGB modality with spatio-temporally attentive local motion features and CMA further combines the two modalities with selective feature augmentation. Extensive experiments on the UCF-101, HMDB-51 and Kinetics-400 benchmarks demonstrate the effectiveness and efficiency of MEACI-Net.
zh
[CV-64] Wave-GMS: Lightweight Multi-Scale Generative Model for Medical Image Segmentation
【速读】:该论文旨在解决医疗影像分割中深度分割网络在资源受限设备(如内存有限的GPU)上训练效率低、参数量大以及跨域泛化能力弱的问题。其核心解决方案是提出Wave-GMS,一种轻量级多尺度生成式模型,通过显著减少可训练参数(仅约2.6M)并避免加载内存密集型预训练视觉基础模型,实现高效训练与高精度分割的平衡,同时支持在小内存GPU上使用大批次训练,并展现出优异的跨数据集泛化性能。
链接: https://arxiv.org/abs/2510.03216
作者: Talha Ahmed,Nehal Ahmed Shaikh,Hassan Mohy-ud-Din
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure, 4 tables; Submitted to IEEE Conference for possible publication
Abstract:For equitable deployment of AI tools in hospitals and healthcare facilities, we need Deep Segmentation Networks that offer high performance and can be trained on cost-effective GPUs with limited memory and large batch sizes. In this work, we propose Wave-GMS, a lightweight and efficient multi-scale generative model for medical image segmentation. Wave-GMS has a substantially smaller number of trainable parameters, does not require loading memory-intensive pretrained vision foundation models, and supports training with large batch sizes on GPUs with limited memory. We conducted extensive experiments on four publicly available datasets (BUS, BUSI, Kvasir-Instrument, and HAM10000), demonstrating that Wave-GMS achieves state-of-the-art segmentation performance with superior cross-domain generalizability, while requiring only ~2.6M trainable parameters. Code is available at this https URL.
zh
[CV-65] Neural Posterior Estimation with Autoregressive Tiling for Detecting Objects in Astronomical Images
【速读】:该论文旨在解决天文图像中微弱且相互重叠的天体(astronomical objects)检测与表征问题,这是高分辨率天文巡天数据处理中的核心挑战。其解决方案的关键在于提出了一种基于空间自回归结构的变分推断(amortized variational inference)方法,通过构建一种以 K-色棋盘图案(K-color checkerboard pattern)划分并排序潜空间(latent space)的变分分布族,使条件独立性结构与后验分布一致;该变分分布由卷积神经网络参数化,并采用神经后验估计(Neural Posterior Estimation, NPE)最小化前向KL散度进行训练,从而在斯隆数字巡天(Sloan Digital Sky Survey)图像上实现最优性能,并显著提升后验校准效果。
链接: https://arxiv.org/abs/2510.03074
作者: Jeffrey Regier
机构: University of Michigan (密歇根大学)
类目: Applications (stat.AP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Upcoming astronomical surveys will produce petabytes of high-resolution images of the night sky, providing information about billions of stars and galaxies. Detecting and characterizing the astronomical objects in these images is a fundamental task in astronomy – and a challenging one, as most of these objects are faint and many visually overlap with other objects. We propose an amortized variational inference procedure to solve this instance of small-object detection. Our key innovation is a family of spatially autoregressive variational distributions that partition and order the latent space according to a K -color checkerboard pattern. By construction, the conditional independencies of this variational family mirror those of the posterior distribution. We fit the variational distribution, which is parameterized by a convolutional neural network, using neural posterior estimation (NPE) to minimize an expectation of the forward KL divergence. Using images from the Sloan Digital Sky Survey, our method achieves state-of-the-art performance. We further demonstrate that the proposed autoregressive structure greatly improves posterior calibration.
zh
[CV-66] GCVAMD: A Modified CausalVAE Model for Causal Age-related Macular Degeneration Risk Factor Detection and Prediction
【速读】:该论文旨在解决当前基于深度学习的年龄相关性黄斑变性(Age Related Macular Degeneration, AMD)检测模型普遍缺乏对疾病病理机制解释能力的问题,尤其是无法识别和分析关键致病因素(如玻璃膜疣 drusen 和新生血管 neovascularization)的因果作用,从而限制了干预分析与临床决策的可靠性。解决方案的关键在于提出一种新颖的因果AMD分析模型GCVAMD,其核心是引入改进的因果变分自编码器(CausalVAE)方法,仅从原始光学相干断层扫描(OCT)图像中提取潜在的因果特征,并在隐空间中建模drusen和neovascularization等关键病理状态的因果机制,从而实现从AMD分类到治疗模拟与干预分析的多任务支持。
链接: https://arxiv.org/abs/2510.02781
作者: Daeyoung Kim
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Age Related Macular Degeneration(AMD) has been one of the most leading causes of permanent vision impairment in ophthalmology. Though treatments, such as anti VEGF drugs or photodynamic therapies, were developed to slow down the degenerative process of AMD, there is still no specific cure to reverse vision loss caused by AMD. Thus, for AMD, detecting existence of risk factors of AMD or AMD itself within the patient retina in early stages is a crucial task to reduce the possibility of vision impairment. Apart from traditional approaches, deep learning based methods, especially attention mechanism based CNNs and GradCAM based XAI analysis on OCT scans, exhibited successful performance in distinguishing AMD retina from normal retinas, making it possible to use AI driven models to aid medical diagnosis and analysis by ophthalmologists regarding AMD. However, though having significant success, previous works mostly focused on prediction performance itself, not pathologies or underlying causal mechanisms of AMD, which can prohibit intervention analysis on specific factors or even lead to less reliable decisions. Thus, this paper introduces a novel causal AMD analysis model: GCVAMD, which incorporates a modified CausalVAE approach that can extract latent causal factors from only raw OCT images. By considering causality in AMD detection, GCVAMD enables causal inference such as treatment simulation or intervention analysis regarding major risk factors: drusen and neovascularization, while returning informative latent causal features that can enhance downstream tasks. Results show that through GCVAMD, drusen status and neovascularization status can be identified with AMD causal mechanisms in GCVAMD latent spaces, which can in turn be used for various tasks from AMD detection(classification) to intervention analysis.
zh
[CV-67] Image Enhancement Based on Pigment Representation
【速读】:该论文旨在解决传统图像增强方法在颜色变换过程中受限于预定义色彩空间(如RGB)而导致适应性不足、表达能力有限的问题。其解决方案的关键在于提出了一种基于色素表示(pigment representation)的新方法,通过将输入的RGB颜色动态映射到高维特征空间——即“色素空间”,实现对图像内容的自适应处理;在此空间中,色素被单独重投影并融合以聚合颜色信息,随后再转换回RGB空间生成增强图像。整个过程中的变换与重投影参数由视觉编码器自适应估计,从而在保持较低计算复杂度和模型规模的同时,显著提升了图像增强任务(如图像修饰与色调映射)的性能。
链接: https://arxiv.org/abs/2510.02713
作者: Se-Ho Lee,Keunsoo Ko,Seung-Wook Kim
机构: Jeonbuk National University (全北国立大学); The Catholic University of Korea (天主教大学); Pukyong National University (釜庆国立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures, accepted at IEEE Transactions on Multimedia (TMM)
Abstract:This paper presents a novel and efficient image enhancement method based on pigment representation. Unlike conventional methods where the color transformation is restricted to pre-defined color spaces like RGB, our method dynamically adapts to input content by transforming RGB colors into a high-dimensional feature space referred to as \textitpigments. The proposed pigment representation offers adaptability and expressiveness, achieving superior image enhancement performance. The proposed method involves transforming input RGB colors into high-dimensional pigments, which are then reprojected individually and blended to refine and aggregate the information of the colors in pigment spaces. Those pigments are then transformed back into RGB colors to generate an enhanced output image. The transformation and reprojection parameters are derived from the visual encoder which adaptively estimates such parameters based on the content in the input image. Extensive experimental results demonstrate the superior performance of the proposed method over state-of-the-art methods in image enhancement tasks, including image retouching and tone mapping, while maintaining relatively low computational complexity and small model size.
zh
[CV-68] A UAV-Based VNIR Hyperspectral Benchmark Dataset for Landmine and UXO Detection
【速读】:该论文旨在解决当前公开可用的基于无人机(UAV)平台的可见光与近红外(VNIR)高光谱影像数据在地雷及未爆弹药(UXO)探测研究中严重匮乏的问题。其解决方案的关键在于构建并发布一个高质量、多场景标注的VNIR高光谱数据集,该数据集通过搭载Headwall Nano-Hyperspec传感器的多传感器无人机平台,在约20.6米飞行高度下采集了覆盖398–1002 nm波段的270个连续光谱通道数据,包含143个真实感模拟地雷和UXO目标(包括裸露、部分埋藏和完全埋藏状态)。关键处理步骤包括辐射定标、正射校正、拼接以及采用两点经验线法(Empirical Line Method, ELM)进行反射率反演,验证结果显示在400–900 nm范围内均方根误差(RMSE)低于1.0,光谱角度距离(SAM)介于1–6度之间,表明具有优异的光谱保真度。该数据集配套提供原始辐亮度立方体、地面控制点(GCP)/航测点数据及参考光谱,可支持可复现的研究,并与已有同场地磁感应(EMI)数据结合形成多传感器基准测试平台。
链接: https://arxiv.org/abs/2510.02700
作者: Sagar Lekhak,Emmett J. Ientilucci,Jasper Baur,Susmita Ghosh
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: This work has been accepted and will be presented at the Indian Geoscience and Remote Sensing Symposium (InGARSS) 2025 in India and will appear in the IEEE InGARSS 2025 Proceedings
Abstract:This paper introduces a novel benchmark dataset of Visible and Near-Infrared (VNIR) hyperspectral imagery acquired via an unmanned aerial vehicle (UAV) platform for landmine and unexploded ordnance (UXO) detection research. The dataset was collected over a controlled test field seeded with 143 realistic surrogate landmine and UXO targets, including surface, partially buried, and fully buried configurations. Data acquisition was performed using a Headwall Nano-Hyperspec sensor mounted on a multi-sensor drone platform, flown at an altitude of approximately 20.6 m, capturing 270 contiguous spectral bands spanning 398-1002 nm. Radiometric calibration, orthorectification, and mosaicking were performed followed by reflectance retrieval using a two-point Empirical Line Method (ELM), with reference spectra acquired using an SVC spectroradiometer. Cross-validation against six reference objects yielded RMSE values below 1.0 and SAM values between 1 and 6 degrees in the 400-900 nm range, demonstrating high spectral fidelity. The dataset is released alongside raw radiance cubes, GCP/AeroPoint data, and reference spectra to support reproducible research. This contribution fills a critical gap in open-access UAV-based hyperspectral data for landmine detection and offers a multi-sensor benchmark when combined with previously published drone-based electromagnetic induction (EMI) data from the same test field.
zh
[CV-69] Learning a distance measure from the information-estimation geometry of data
【速读】:该论文旨在解决如何构建一种能够准确衡量信号间距离的全局度量方法,尤其在复杂分布下需同时具备局部和全局几何适应性。传统度量如马氏距离(Mahalanobis distance)仅适用于高斯分布,难以刻画非线性或高维数据的真实结构。解决方案的关键在于提出信息估计度量(Information-Estimation Metric, IEM),其核心思想是利用信息论与估计理论之间的深刻联系——即信号的对数概率与其最优去噪器(denoiser)在不同噪声水平下的误差向量相关联。通过比较两个信号在多尺度噪声下的去噪误差向量,IEM 在几何上等价于对比模糊密度场在其周围区域的得分向量场(score vector fields)。该度量被证明为合法的全局度量,并可推导出局部二阶近似形式,形成黎曼度量(Riemannian metric)。实验表明,基于学习到的去噪器(类似生成扩散模型)计算的 IEM,在 ImageNet 上能有效预测人类感知判断,性能优于或媲美当前最先进的监督图像质量评估指标。
链接: https://arxiv.org/abs/2510.02514
作者: Guy Ohayon,Pierre-Etienne H. Fiquet,Florentin Guth,Jona Ballé,Eero P. Simoncelli
机构: Flatiron Institute (纽约大学高级科学研究所); New York University (纽约大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Signal Processing (eess.SP); Machine Learning (stat.ML)
备注: Code available at this https URL
Abstract:We introduce the Information-Estimation Metric (IEM), a novel form of distance function derived from an underlying continuous probability density over a domain of signals. The IEM is rooted in a fundamental relationship between information theory and estimation theory, which links the log-probability of a signal with the errors of an optimal denoiser, applied to noisy observations of the signal. In particular, the IEM between a pair of signals is obtained by comparing their denoising error vectors over a range of noise amplitudes. Geometrically, this amounts to comparing the score vector fields of the blurred density around the signals over a range of blur levels. We prove that the IEM is a valid global metric and derive a closed-form expression for its local second-order approximation, which yields a Riemannian metric. For Gaussian-distributed signals, the IEM coincides with the Mahalanobis distance. But for more complex distributions, it adapts, both locally and globally, to the geometry of the distribution. In practice, the IEM can be computed using a learned denoiser (analogous to generative diffusion models) and solving a one-dimensional integral. To demonstrate the value of our framework, we learn an IEM on the ImageNet database. Experiments show that this IEM is competitive with or outperforms state-of-the-art supervised image quality metrics in predicting human perceptual judgments.
zh
[CV-70] Glaucoma Detection and Structured OCT Report Generation via a Fine-tuned Multimodal Large Language Model
【速读】:该论文旨在解决眼科临床中光学相干断层扫描(OCT)图像质量筛查与自动化结构化报告生成的难题,特别是针对视盘(ONH)OCT环形扫描在青光眼诊断中的应用。其解决方案的关键在于开发并验证了一个可解释的多模态大语言模型(MM-LLM),基于Llama 3.2 Vision-Instruct模型进行微调,通过配对的OCT图像与自动生成的结构化临床报告(包含全局及节段性视网膜神经纤维层[RNFL]变薄评估)进行训练,同时将低质量图像标记为不可用并关联固定拒绝语句。该模型在图像质量分诊、青光眼检测和RNFL节段性变薄分类三项任务上均表现出高准确率(如质量筛查准确率达0.90,青光眼检测F1-score达0.91),且生成的文本描述与参考报告高度一致(BERTScore-F1达0.99),从而实现了从影像到临床决策支持的端到端自动化分析。
链接: https://arxiv.org/abs/2510.02403
作者: Jalil Jalili,Yashraj Gavhane,Evan Walker,Anna Heinke,Christopher Bowd,Akram Belghith,Massimo A. Fazio,Christopher A. Girkin,C. Gustavo De Moraes,Jeffrey M. Liebmann,Sally L. Baxter,Robert N. Weinreb,Linda M. Zangwill,Mark Christopher
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Objective: To develop an explainable multimodal large language model (MM-LLM) that (1) screens optic nerve head (ONH) OCT circle scans for quality and (2) generates structured clinical reports that include glaucoma diagnosis and sector-wise retinal nerve fiber layer (RNFL) thinning assessments. Design: Retrospective cohort study of 1,310 subjects contributing 43,849 Spectralis ONH OCT circle scans (1,331 glaucomatous and 867 healthy eyes) from the DIGS and ADAGES cohorts. Methods: A MM-LLM (Llama 3.2 Vision-Instruct model) was fine-tuned to generate clinical descriptions of OCT imaging data. Training data included paired OCT images and automatically generated, structured clinical reports that described global and sectoral RNFL thinning. Poor-quality scans were labeled as unusable and paired with a fixed refusal statement. The model was evaluated on a held-out test set for three tasks: quality assessment, glaucoma detection, and RNFL thinning classification across seven anatomical sectors. Evaluation metrics included accuracy, sensitivity, specificity, precision, and F1-score. Model description quality was also evaluated using standard text evaluation metrics. Results: The model achieved 0.90 accuracy and 0.98 specificity for quality triage. For glaucoma detection, accuracy was 0.86 (sensitivity 0.91, specificity 0.73, F1-score 0.91). RNFL thinning prediction accuracy ranged from 0.83 to 0.94, with highest performance in global and temporal sectors. Text generation scores showed strong alignment with reference reports (BLEU: 0.82; ROUGE-1: 0.94; ROUGE-2: 0.87; ROUGE-L: 0.92; BERTScore-F1: 0.99). Conclusions: The fine-tuned MM-LLM generated accurate clinical descriptions based on OCT imaging. The model achieved high accuracy in identifying image quality issues and detecting glaucoma. The model also provided sectoral descriptions of RNFL thinning to help support clinical OCT evaluation.
zh
人工智能
[AI-0] Abstain and Validate: A Dual-LLM Policy for Reducing Noise in Agent ic Program Repair
【速读】:该论文旨在解决工业级生成式自动程序修复(Agentic Automated Program Repair, APR)系统中因生成低质量或不相关补丁而导致的开发者信任下降和效率损耗问题。核心挑战在于,尽管自动化修复系统能够生成大量候选补丁,但其中多数无法真正修复目标缺陷(bug),反而增加了人工审查的噪声。解决方案的关键在于提出两种互补的基于大语言模型(LLM)的策略:一是缺陷回避策略(bug abstention policy),用于排除系统难以修复的缺陷;二是补丁验证策略(patch validation policy),用于过滤掉不符合修复质量要求的补丁。实验证明,这两种策略在谷歌代码库中的多类缺陷上显著提升了修复成功率,组合使用时最高可提升39个百分点,为实现可靠、可扩展的工业级应用提供了可行路径。
链接: https://arxiv.org/abs/2510.03217
作者: José Cambronero,Michele Tufano,Sherry Shi,Renyao Wei,Grant Uy,Runxiang Cheng,Chin-Jung Liu,Shiying Pan,Satish Chandra,Pat Rondon
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic Automated Program Repair (APR) is increasingly tackling complex, repository-level bugs in industry, but ultimately agent-generated patches still need to be reviewed by a human before committing them to ensure they address the bug. Showing unlikely patches to developers can lead to substantial noise, wasting valuable developer time and eroding trust in automated code changes. We introduce two complementary LLM-based policies to reduce such noise: bug abstention and patch validation policies. Bug abstention excludes bugs that the agentic APR system is unlikely to fix. Patch validation rejects patches that are unlikely to be a good fix for the given bug. We evaluate both policies on three sets of bugs from Google’s codebase, and their candidate patches generated by an internal agentic APR system. On a set of 174 human-reported bugs, removing bugs and patch trajectories rejected by our policies can raise success rates by up to 13 percentage points and 15 percentage points, respectively, and by up to 39 percentage points in combination. On null pointer exceptions and sanitizer-reported bugs with machine-generated bug reports, patch validation also improves average single-sample success rates. This two-policy approach provides a practical path to the reliable, industrial-scale deployment of agentic APR systems.
zh
[AI-1] CoDA: Agent ic Systems for Collaborative Data Visualization
【速读】:该论文旨在解决数据科学中可视化自动化不足的问题,即当前系统难以有效处理多文件复杂数据集,并在迭代优化过程中缺乏对代码错误和可视化质量的鲁棒性管理。解决方案的关键在于将任务重构为协作式多智能体(multi-agent)问题,提出CoDA系统,通过专业化的大语言模型(LLM)智能体分别负责元数据分析、任务规划、代码生成与自我反思,形成结构化流水线;其中元数据驱动的分析策略规避了令牌(token)限制,而以质量为导向的迭代优化机制保障了最终输出的可靠性与准确性。
链接: https://arxiv.org/abs/2510.03194
作者: Zichen Chen,Jiefeng Chen,Sercan Ö. Arik,Misha Sra,Tomas Pfister,Jinsung Yoon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, 6 figures, 5 tables
Abstract:Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations, highlighting the need for robust automation from natural language queries. However, current systems struggle with complex datasets containing multiple files and iterative refinement. Existing approaches, including simple single- or multi-agent systems, often oversimplify the task, focusing on initial query parsing while failing to robustly manage data complexity, code errors, or final visualization quality. In this paper, we reframe this challenge as a collaborative multi-agent problem. We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection. We formalize this pipeline, demonstrating how metadata-focused analysis bypasses token limits and quality-driven refinement ensures robustness. Extensive evaluations show CoDA achieves substantial gains in the overall score, outperforming competitive baselines by up to 41.5%. This work demonstrates that the future of visualization automation lies not in isolated code generation but in integrated, collaborative agentic workflows.
zh
[AI-2] Improving Cooperation in Collaborative Embodied AI
【速读】:该论文旨在解决多智能体系统中协作行为与决策效率低下的问题,尤其是在共享虚拟空间中利用大语言模型(Large Language Models, LLMs)进行通信、推理和任务协调时的性能瓶颈。其解决方案的关键在于对提示工程(prompt engineering)策略进行系统性优化,并在此基础上改进CoELA框架以增强代理间的协作能力;同时,通过集成语音交互功能,进一步提升人机协同的沉浸感与迭代开发效率。实验表明,最优提示组合可使基于Gemma3的系统效率提升22%,验证了提示优化在提升多智能体协作性能中的核心作用。
链接: https://arxiv.org/abs/2510.03153
作者: Hima Jacob Leven Suprabha,Laxmi Nag Laxminarayan Nagesh,Ajith Nair,Alvin Reuben Amal Selvaster,Ayan Khan,Raghuram Damarla,Sanju Hannah Samuel,Sreenithi Saravana Perumal,Titouan Puech,Venkataramireddy Marella,Vishal Sonar,Alessandro Suglia,Oliver Lemon
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: In proceedings of UKCI 2025
Abstract:The integration of Large Language Models (LLMs) into multiagent systems has opened new possibilities for collaborative reasoning and cooperation with AI agents. This paper explores different prompting methods and evaluates their effectiveness in enhancing agent collaborative behaviour and decision-making. We enhance CoELA, a framework designed for building Collaborative Embodied Agents that leverage LLMs for multi-agent communication, reasoning, and task coordination in shared virtual spaces. Through systematic experimentation, we examine different LLMs and prompt engineering strategies to identify optimised combinations that maximise collaboration performance. Furthermore, we extend our research by integrating speech capabilities, enabling seamless collaborative voice-based interactions. Our findings highlight the effectiveness of prompt optimisation in enhancing collaborative agent performance; for example, our best combination improved the efficiency of the system running with Gemma3 by 22% compared to the original CoELA system. In addition, the speech integration provides a more engaging user interface for iterative system development and demonstrations.
zh
[AI-3] Signature-Informed Transformer for Asset Allocation
【速读】:该论文旨在解决量化金融中资产配置的鲁棒性问题,特别是深度学习预测模型因目标函数不匹配和误差放大而导致性能不佳的挑战。解决方案的关键在于提出Signature-Informed Transformer (SIT) 框架,其核心创新包括:利用路径签名(path signatures)构建资产动态的几何丰富表示,以及设计一种签名增强的注意力机制,将金融先验知识(如领先-滞后效应)嵌入模型中,从而直接优化风险感知的财务目标。这一方法在标普100日度数据上的实证表明,SIT显著优于传统与深度学习基线模型,尤其在与“预测-再优化”范式对比时表现突出,验证了组合感知目标与几何感知归纳偏置对机器学习系统中风险敏感资本配置的重要性。
链接: https://arxiv.org/abs/2510.03129
作者: Yoontae Hwang,Stefan Zohren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Portfolio Management (q-fin.PM)
备注:
Abstract:Robust asset allocation is a key challenge in quantitative finance, where deep-learning forecasters often fail due to objective mismatch and error amplification. We introduce the Signature-Informed Transformer (SIT), a novel framework that learns end-to-end allocation policies by directly optimizing a risk-aware financial objective. SIT’s core innovations include path signatures for a rich geometric representation of asset dynamics and a signature-augmented attention mechanism embedding financial inductive biases, like lead-lag effects, into the model. Evaluated on daily S\P 100 equity data, SIT decisively outperforms traditional and deep-learning baselines, especially when compared to predict-then-optimize models. These results indicate that portfolio-aware objectives and geometry-aware inductive biases are essential for risk-aware capital allocation in machine-learning systems. The code is available at: this https URL
zh
[AI-4] A Study of Rule Omission in Ravens Progressive Matrices
【速读】:该论文旨在解决当前人工智能系统在抽象推理任务中是否存在真正推理能力的问题,特别是针对视觉和语言模型是否依赖统计捷径而非深层规则理解来完成瑞文渐进矩阵(Raven’s Progressive Matrices, RPM)任务。其解决方案的关键在于构建一个名为Impartial-RAVEN(I-RAVEN)的数据集,通过在训练阶段故意省略部分结构规则,系统性地评估模型在面对未见过的规则时的泛化能力。实验表明,尽管基于Transformer的序列到序列模型和视觉架构如CoPINet与Dual-Contrast Network在熟悉规则上表现优异,但在新规则下准确率显著下降,且标记级准确率与完整答案准确率之间存在显著差距,揭示了现有模型在抽象推理上的根本局限性,强调了发展超越模式识别、具备鲁棒抽象推理能力的新架构的重要性。
链接: https://arxiv.org/abs/2510.03127
作者: Binze Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Analogical reasoning lies at the core of human cognition and remains a fundamental challenge for artificial intelligence. Raven’s Progressive Matrices (RPM) serve as a widely used benchmark to assess abstract reasoning by requiring the inference of underlying structural rules. While many vision-based and language-based models have achieved success on RPM tasks, it remains unclear whether their performance reflects genuine reasoning ability or reliance on statistical shortcuts. This study investigates the generalization capacity of modern AI systems under conditions of incomplete training by deliberately omitting several structural rules during training. Both sequence-to-sequence transformer models and vision-based architectures such as CoPINet and the Dual-Contrast Network are evaluated on the Impartial-RAVEN (I-RAVEN) dataset. Experiments reveal that although transformers demonstrate strong performance on familiar rules, their accuracy declines sharply when faced with novel or omitted rules. Moreover, the gap between token-level accuracy and complete answer accuracy highlights fundamental limitations in current approaches. These findings provide new insights into the reasoning mechanisms underlying deep learning models and underscore the need for architectures that move beyond pattern recognition toward robust abstract reasoning.
zh
[AI-5] Distilled Protein Backbone Generation
【速读】:该论文旨在解决基于扩散模型(diffusion-based models)和流模型(flow-based models)进行蛋白质骨架生成时存在的计算效率低下问题,即反向扩散过程通常需要数百次迭代步骤,导致推理速度缓慢,难以满足大规模蛋白质设计的需求。解决方案的关键在于引入并优化Score Identity Distillation(SiD)这一得分蒸馏技术,通过多步生成(multistep generation)与推理阶段噪声调制(inference time noise modulation)的协同策略,使训练出的少步生成器在仅需少量采样步骤的情况下仍能保持与预训练教师模型相当的设计性(designability)、多样性(diversity)和新颖性(novelty),从而实现超过20倍的采样速度提升,显著降低推理成本,推动扩散模型在实际蛋白工程中的应用落地。
链接: https://arxiv.org/abs/2510.03095
作者: Liyang Xie,Haoran Zhang,Zhendong Wang,Wesley Tansey,Mingyuan Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Diffusion- and flow-based generative models have recently demonstrated strong performance in protein backbone generation tasks, offering unprecedented capabilities for de novo protein design. However, while achieving notable performance in generation quality, these models are limited by their generating speed, often requiring hundreds of iterative steps in the reverse-diffusion process. This computational bottleneck limits their practical utility in large-scale protein discovery, where thousands to millions of candidate structures are needed. To address this challenge, we explore the techniques of score distillation, which has shown great success in reducing the number of sampling steps in the vision domain while maintaining high generation quality. However, a straightforward adaptation of these methods results in unacceptably low designability. Through extensive study, we have identified how to appropriately adapt Score identity Distillation (SiD), a state-of-the-art score distillation strategy, to train few-step protein backbone generators which significantly reduce sampling time, while maintaining comparable performance to their pretrained teacher model. In particular, multistep generation combined with inference time noise modulation is key to the success. We demonstrate that our distilled few-step generators achieve more than a 20-fold improvement in sampling speed, while achieving similar levels of designability, diversity, and novelty as the Proteina teacher model. This reduction in inference cost enables large-scale in silico protein design, thereby bringing diffusion-based models closer to real-world protein engineering applications.
zh
[AI-6] From Facts to Foils: Designing and Evaluating Counterfactual Explanations for Smart Environments
【速读】:该论文旨在解决规则驱动的智能环境中缺乏可解释性方法的问题,尤其是如何生成适用于此类环境的反事实解释(counterfactual explanations)。当前虽在可解释人工智能(XAI)领域中反事实解释已被证明是一种有效工具,但尚未有成熟的方法将其应用于基于规则的智能系统。论文的关键解决方案是首次对该类场景下的反事实解释进行了形式化定义,并开发了一个插件式实现,集成到现有智能环境解释引擎中,从而支持生成针对用户问题情境的可操作性建议。实验结果表明,反事实解释在需要解决问题的场景中更受青睐,而因果解释则因语言简洁性在时间紧迫时更具优势,这为不同情境下解释类型的选择提供了实证依据。
链接: https://arxiv.org/abs/2510.03078
作者: Anna Trapp,Mersedeh Sadeghi,Andreas Vogelsang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted at Ex-ASE 2025, co-located with the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE 2025)
Abstract:Explainability is increasingly seen as an essential feature of rule-based smart environments. While counterfactual explanations, which describe what could have been done differently to achieve a desired outcome, are a powerful tool in eXplainable AI (XAI), no established methods exist for generating them in these rule-based domains. In this paper, we present the first formalization and implementation of counterfactual explanations tailored to this domain. It is implemented as a plugin that extends an existing explanation engine for smart environments. We conducted a user study (N=17) to evaluate our generated counterfactuals against traditional causal explanations. The results show that user preference is highly contextual: causal explanations are favored for their linguistic simplicity and in time-pressured situations, while counterfactuals are preferred for their actionable content, particularly when a user wants to resolve a problem. Our work contributes a practical framework for a new type of explanation in smart environments and provides empirical evidence to guide the choice of when each explanation type is most effective.
zh
[AI-7] A Unified Deep Reinforcement Learning Approach for Close Enough Traveling Salesman Problem
【速读】:该论文旨在解决近似可达旅行商问题(Close-Enough TSP, CETSP)的求解难题,其核心挑战在于节点访问依赖于邻域约束——即只有当智能体进入目标节点周围紧凑邻域时才视为已访问。传统深度强化学习(Deep Reinforcement Learning, DRL)方法在处理此类基于邻域的访问条件时表现受限。为此,作者提出了一种统一双解码器深度强化学习框架(Unified Dual-Decoder DRL, UD3RL),其关键创新在于将决策过程解耦为两个子任务:节点选择(node-decoder)与路径点确定(loc-decoder)。通过引入改进的编码器结构和k近邻子图交互策略,UD3RL显著增强了空间推理能力,并结合定制化的REINFORCE算法实现端到端训练,从而在不同问题规模、邻域半径类型(恒定或随机)以及动态环境中均展现出优越的解质量、计算效率与泛化性能。
链接: https://arxiv.org/abs/2510.03065
作者: Mingfeng Fan,Jiaqi Cheng,Yaoxin Wu,Yifeng Zhang,Yibin Yang,Guohua Wu,Guillaume Sartoretti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, deep reinforcement learning (DRL) has gained traction for solving the NP-hard traveling salesman problem (TSP). However, limited attention has been given to the close-enough TSP (CETSP), primarily due to the challenge introduced by its neighborhood-based visitation criterion, wherein a node is considered visited if the agent enters a compact neighborhood around it. In this work, we formulate a Markov decision process (MDP) for CETSP using a discretization scheme and propose a novel unified dual-decoder DRL (UD3RL) framework that separates decision-making into node selection and waypoint determination. Specifically, an adapted encoder is employed for effective feature extraction, followed by a node-decoder and a loc-decoder to handle the two sub-tasks, respectively. A k-nearest neighbors subgraph interaction strategy is further introduced to enhance spatial reasoning during location decoding. Furthermore, we customize the REINFORCE algorithm to train UD3RL as a unified model capable of generalizing across different problem sizes and varying neighborhood radius types (i.e., constant and random radii). Experimental results show that UD3RL outperforms conventional methods in both solution quality and runtime, while exhibiting strong generalization across problem scales, spatial distributions, and radius ranges, as well as robustness to dynamic environments.
zh
[AI-8] Comparative Analysis of Parameterized Action Actor-Critic Reinforcement Learning Algorithms for Web Search Match Plan Generation
【速读】:该论文旨在解决高维决策任务中参数化动作(Parametrized Action, PA)空间下的强化学习算法性能问题,特别是在完全可观测环境中如何实现高效且稳定的策略优化。其解决方案的关键在于提出并验证Parameterized Action Greedy Actor-Critic (PAGAC) 算法,该方法摒弃了对循环神经网络的依赖,直接在连续的动作参数空间中进行贪心策略更新,从而显著提升训练速度与收敛稳定性;实验表明,PAGAC 在 Platform-v0 和 Goal-v0 基准任务上均优于 SAC 和 TQC,实现了最快训练时间与最高回报,证明其在复杂动作空间中具备优越的效率和可靠性。
链接: https://arxiv.org/abs/2510.03064
作者: Ubayd Bapoo,Clement N Nyirenda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 10th International Congress on Information and Communication Technology (ICICT 2025)
Abstract:This study evaluates the performance of Soft Actor Critic (SAC), Greedy Actor Critic (GAC), and Truncated Quantile Critics (TQC) in high-dimensional decision-making tasks using fully observable environments. The focus is on parametrized action (PA) spaces, eliminating the need for recurrent networks, with benchmarks Platform-v0 and Goal-v0 testing discrete actions linked to continuous action-parameter spaces. Hyperparameter optimization was performed with Microsoft NNI, ensuring reproducibility by modifying the codebase for GAC and TQC. Results show that Parameterized Action Greedy Actor-Critic (PAGAC) outperformed other algorithms, achieving the fastest training times and highest returns across benchmarks, completing 5,000 episodes in 41:24 for the Platform game and 24:04 for the Robot Soccer Goal game. Its speed and stability provide clear advantages in complex action spaces. Compared to PASAC and PATQC, PAGAC demonstrated superior efficiency and reliability, making it ideal for tasks requiring rapid convergence and robust performance. Future work could explore hybrid strategies combining entropy-regularization with truncation-based methods to enhance stability and expand investigations into generalizability.
zh
[AI-9] ZeroShotOpt: Towards Zero-Shot Pretrained Models for Efficient Black-Box Optimization
【速读】:该论文旨在解决昂贵的、无导数的黑箱函数全局优化问题,其核心挑战在于现有贝叶斯优化(Bayesian Optimization, BO)方法对代理模型和采集函数超参数的依赖性较强,这些参数通常需手动调优且难以在不同问题场景中泛化。解决方案的关键在于提出ZeroShotOpt——一种面向连续黑箱优化任务的通用预训练模型,通过在大规模优化轨迹数据(来自12种BO变体)上进行离线强化学习,并结合数百万个基于高斯过程(Gaussian Process)生成的具有多样化景观的合成函数进行预训练,从而学习可迁移的优化策略。这种设计使模型在未见过的基准测试中展现出鲁棒的零样本泛化能力,达到或超越主流全局优化器的采样效率,同时为后续研究提供可复用的基础框架。
链接: https://arxiv.org/abs/2510.03051
作者: Jamison Meindl,Yunsheng Tian,Tony Cui,Veronika Thost,Zhang-Wei Hong,Johannes Dürholt,Jie Chen,Wojciech Matusik,Mina Konaković Luković
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Global optimization of expensive, derivative-free black-box functions requires extreme sample efficiency. While Bayesian optimization (BO) is the current state-of-the-art, its performance hinges on surrogate and acquisition function hyper-parameters that are often hand-tuned and fail to generalize across problem landscapes. We present ZeroShotOpt, a general-purpose, pretrained model for continuous black-box optimization tasks ranging from 2D to 20D. Our approach leverages offline reinforcement learning on large-scale optimization trajectories collected from 12 BO variants. To scale pretraining, we generate millions of synthetic Gaussian process-based functions with diverse landscapes, enabling the model to learn transferable optimization policies. As a result, ZeroShotOpt achieves robust zero-shot generalization on a wide array of unseen benchmarks, matching or surpassing the sample efficiency of leading global optimizers, including BO, while also offering a reusable foundation for future extensions and improvements. Our open-source code, dataset, and model are available at: this https URL
zh
[AI-10] CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration ACM-MM’25
【速读】:该论文旨在解决移动设备上序列推荐模型部署中的两大挑战:一是模型压缩导致的个性化推荐精度下降问题,二是本地微调带来的额外计算负担。现有量化方法虽能提升部署效率,但忽视了设备端用户兴趣差异;而基于本地重训练的个性化策略则因计算开销大难以实用。解决方案的关键在于提出一种名为CHORD(Customizing Hybrid-precision On-device model for sequential Recommendation with Device-cloud collaboration)的框架,其核心创新是通过云端辅助的通道级混合精度量化机制,在保持模型轻量化的同时实现用户个性化适配。具体而言,CHORD利用云侧超网络模块识别用户特定关键参数,并在设备端执行无需反向传播的动态混合精度量化,从而在不进行本地重训练的前提下完成高效且个性化的推理加速,同时仅用2比特/通道编码量化策略显著降低通信开销。
链接: https://arxiv.org/abs/2510.03038
作者: Tianqi Liu,Kairui Fu,Shengyu Zhang,Wenyan Fan,Zhaocheng Du,Jieming Zhu,Fan Wu,Fei Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: accepted by ACM MM’25
Abstract:With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for \underline\textbfCustomizing \underline\textbfHybrid-precision \underline\textbfOn-device model for sequential \underline\textbfRecommendation with \underline\textbfDevice-cloud collaboration (\textbfCHORD), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.
zh
[AI-11] Investigating The Smells of LLM Generated Code
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生成代码时存在的代码质量评估问题,特别是针对代码异味(code smells)这一关键指标进行系统性分析。传统研究多关注生成代码的功能正确性,而忽视了其结构与可维护性等质量维度。论文提出了一种基于场景的评估方法,通过将测试数据按编程任务的主题和复杂度划分为多个子集,从而识别出LLM在不同使用场景下代码质量最薄弱的环节;其核心解决方案是构建一个自动化测试系统,对比LLM生成代码与专业开发者编写的参考代码中的代码异味数量,并量化差异,揭示LLM在复杂任务和面向对象等高级主题中生成代码质量显著低于人类编写代码的现象。
链接: https://arxiv.org/abs/2510.03029
作者: Debalina Ghosh Paul,Hong Zhu,Ian Bayley
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Context: Large Language Models (LLMs) are increasingly being used to generate program code. Much research has been reported on the functional correctness of generated code, but there is far less on code quality. Objectives: In this study, we propose a scenario-based method of evaluating the quality of LLM-generated code to identify the weakest scenarios in which the quality of LLM generated code should be improved. Methods: The method measures code smells, an important indicator of code quality, and compares them with a baseline formed from reference solutions of professionally written code. The test dataset is divided into various subsets according to the topics of the code and complexity of the coding tasks to represent different scenarios of using LLMs for code generation. We will also present an automated test system for this purpose and report experiments with the Java programs generated in response to prompts given to four state-of-the-art LLMs: Gemini Pro, ChatGPT, Codex, and Falcon. Results: We find that LLM-generated code has a higher incidence of code smells compared to reference solutions. Falcon performed the least badly, with a smell increase of 42.28%, followed by Gemini Pro (62.07%), ChatGPT (65.05%) and finally Codex (84.97%). The average smell increase across all LLMs was 63.34%, comprising 73.35% for implementation smells and 21.42% for design smells. We also found that the increase in code smells is greater for more complex coding tasks and for more advanced topics, such as those involving object-orientated concepts. Conclusion: In terms of code smells, LLM’s performances on various coding task complexities and topics are highly correlated to the quality of human written code in the corresponding scenarios. However, the quality of LLM generated code is noticeably poorer than human written code. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.03029 [cs.SE] (or arXiv:2510.03029v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2510.03029 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hong Zhu [view email] [v1] Fri, 3 Oct 2025 14:09:55 UTC (542 KB) Full-text links: Access Paper: View a PDF of the paper titled Investigating The Smells of LLM Generated Code, by Debalina Ghosh Paul and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2025-10 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-12] Learning Robust Diffusion Models from Imprecise Supervision
【速读】:该论文旨在解决条件扩散模型(conditional diffusion models)在训练过程中因依赖大规模数据集而引入的不精确条件输入(imprecise supervision)所导致的条件失配(condition mismatch)问题,进而降低生成质量。解决方案的关键在于提出DMIS(Diffusion Models from Imprecise Supervision),这是一个统一框架,通过最大似然估计推导出目标函数,并将其分解为生成组件和分类组件:生成组件建模不精确标签的分布,分类组件则利用扩散分类器(diffusion classifier)推断类别后验概率,同时通过优化的时间步采样策略提升效率。该方法在图像生成、弱监督学习和噪声数据集压缩等多种场景下均展现出高质量且类别区分度强的生成性能。
链接: https://arxiv.org/abs/2510.03016
作者: Dong-Dong Wu,Jiacheng Cui,Wei Wang,Zhiqiang She,Masashi Sugiyama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Conditional diffusion models have achieved remarkable success in various generative tasks recently, but their training typically relies on large-scale datasets that inevitably contain imprecise information in conditional inputs. Such supervision, often stemming from noisy, ambiguous, or incomplete labels, will cause condition mismatch and degrade generation quality. To address this challenge, we propose DMIS, a unified framework for training robust Diffusion Models from Imprecise Supervision, which is the first systematic study within diffusion models. Our framework is derived from likelihood maximization and decomposes the objective into generative and classification components: the generative component models imprecise-label distributions, while the classification component leverages a diffusion classifier to infer class-posterior probabilities, with its efficiency further improved by an optimized timestep sampling strategy. Extensive experiments on diverse forms of imprecise supervision, covering tasks of image generation, weakly supervised learning, and noisy dataset condensation demonstrate that DMIS consistently produces high-quality and class-discriminative samples.
zh
[AI-13] BrainIB: Leverag ing Graph Neural Networks and Information Bottleneck for Functional Brain Biomarkers in Schizophrenia
【速读】:该论文旨在解决传统机器学习诊断模型依赖人工特征工程导致偏差,以及深度学习模型因可解释性不足难以提供可靠脑生物标志物的问题。其解决方案的关键在于提出一种端到端的图神经网络框架 BrainIB++,该框架基于信息瓶颈(Information Bottleneck, IB)原理,在模型训练过程中自动识别最具信息量的数据驱动脑区子图作为解释依据,从而在保持高诊断准确性的同时提升模型的可解释性和临床适用性。
链接: https://arxiv.org/abs/2510.03004
作者: Tianzheng Hu,Qiang Li,Shu Liu,Vince D. Calhoun,Guido van Wingen,Shujian Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This manuscript has been accepted by Biomedical Signal Processing and Control and the code is available at this https URL
Abstract:The development of diagnostic models is gaining traction in the field of psychiatric disorders. Recently, machine learning classifiers based on resting-state functional magnetic resonance imaging (rs-fMRI) have been developed to identify brain biomarkers that differentiate psychiatric disorders from healthy controls. However, conventional machine learning-based diagnostic models often depend on extensive feature engineering, which introduces bias through manual intervention. While deep learning models are expected to operate without manual involvement, their lack of interpretability poses significant challenges in obtaining explainable and reliable brain biomarkers to support diagnostic decisions, ultimately limiting their clinical applicability. In this study, we introduce an end-to-end innovative graph neural network framework named BrainIB++, which applies the information bottleneck (IB) principle to identify the most informative data-driven brain regions as subgraphs during model training for interpretation. We evaluate the performance of our model against nine established brain network classification methods across three multi-cohort schizophrenia datasets. It consistently demonstrates superior diagnostic accuracy and exhibits generalizability to unseen data. Furthermore, the subgraphs identified by our model also correspond with established clinical biomarkers in schizophrenia, particularly emphasizing abnormalities in the visual, sensorimotor, and higher cognition brain functional network. This alignment enhances the model’s interpretability and underscores its relevance for real-world diagnostic applications.
zh
[AI-14] From high-frequency sensors to noon reports: Using transfer learning for shaft power prediction in maritime
【速读】:该论文旨在解决船舶轴功率(shaft power)预测精度不足的问题,尤其是在高频率传感器数据难以获取的情况下,如何利用低频但易得的正午报告(noon reports)实现高效准确的预测。其解决方案的关键在于采用迁移学习(transfer learning)方法:首先在某艘船的高频数据上训练初始模型,随后将该模型在其他船只的低频正午报告数据上进行微调(fine-tuning),从而有效利用已有高质量数据的知识迁移至新场景,显著提升预测性能,实验表明该方法在不同类型的船舶上均能降低平均绝对百分比误差(MAPE)。
链接: https://arxiv.org/abs/2510.03003
作者: Akriti Sharma,Dogan Altan,Dusica Marijan,Arnbjørn Maressa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Keywords: transfer learning, shaft power prediction, noon reports, sensor data, maritime
Abstract:With the growth of global maritime transportation, energy optimization has become crucial for reducing costs and ensuring operational efficiency. Shaft power is the mechanical power transmitted from the engine to the shaft and directly impacts fuel consumption, making its accurate prediction a paramount step in optimizing vessel performance. Power consumption is highly correlated with ship parameters such as speed and shaft rotation per minute, as well as weather and sea conditions. Frequent access to this operational data can improve prediction accuracy. However, obtaining high-quality sensor data is often infeasible and costly, making alternative sources such as noon reports a viable option. In this paper, we propose a transfer learning-based approach for predicting vessels shaft power, where a model is initially trained on high-frequency data from a vessel and then fine-tuned with low-frequency daily noon reports from other vessels. We tested our approach on sister vessels (identical dimensions and configurations), a similar vessel (slightly larger with a different engine), and a different vessel (distinct dimensions and configurations). The experiments showed that the mean absolute percentage error decreased by 10.6 percent for sister vessels, 3.6 percent for a similar vessel, and 5.3 percent for a different vessel, compared to the model trained solely on noon report data.
zh
[AI-15] Untargeted Jailbreak Attack
【速读】:该论文旨在解决现有基于梯度的越狱攻击(jailbreak attacks)在大型语言模型(Large Language Models, LLMs)中因目标受限导致的攻击效率低、搜索空间狭窄的问题。传统方法如贪婪坐标梯度(Greedy Coordinate Gradient, GCG)和COLD-Attack依赖于固定的目标响应,这限制了对抗样本的生成灵活性并需要大量优化迭代才能实现攻击目标。为此,作者提出了首个基于梯度的无目标越狱攻击(Untargeted Jailbreak Attack, UJA),其核心创新在于将攻击目标从“诱导特定有害响应”转变为“最大化模型输出的不安全性概率”,并通过一个判别模型(judge model)量化该概率。由于该目标不可微,研究进一步将其分解为两个可微子目标:优化有害响应与对应对抗提示(adversarial prompt),并提供了理论分析以验证分解的有效性。这一设计显著扩展了搜索空间,使UJA仅需约100次迭代即可在近期安全对齐的LLMs上实现超过80%的成功率,优于当前最优方法(如I-GCG和COLD-Attack)超过20个百分点。
链接: https://arxiv.org/abs/2510.02999
作者: Xinzhe Huang,Wenjing Hu,Tianhang Zheng,Kedong Xiu,Xiaojun Jia,Di Wang,Zhan Qin,Kui Ren
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing gradient-based jailbreak attacks on Large Language Models (LLMs), such as Greedy Coordinate Gradient (GCG) and COLD-Attack, typically optimize adversarial suffixes to align the LLM output with a predefined target response. However, by restricting the optimization objective as inducing a predefined target, these methods inherently constrain the adversarial search space, which limit their overall attack efficacy. Furthermore, existing methods typically require a large number of optimization iterations to fulfill the large gap between the fixed target and the original model response, resulting in low attack efficiency. To overcome the limitations of targeted jailbreak attacks, we propose the first gradient-based untargeted jailbreak attack (UJA), aiming to elicit an unsafe response without enforcing any predefined patterns. Specifically, we formulate an untargeted attack objective to maximize the unsafety probability of the LLM response, which can be quantified using a judge model. Since the objective is non-differentiable, we further decompose it into two differentiable sub-objectives for optimizing an optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to targeted jailbreak attacks, UJA’s unrestricted objective significantly expands the search space, enabling a more flexible and efficient exploration of LLM this http URL evaluations demonstrate that \textscUJA can achieve over 80% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks such as I-GCG and COLD-Attack by over 20%. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.02999 [cs.CR] (or arXiv:2510.02999v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.02999 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-16] Onto-Epistemological Analysis of AI Explanations
【速读】:该论文旨在解决当前可解释人工智能(Explainable AI, XAI)方法中隐含的本体论(ontological)与认识论(epistemological)假设未被充分审视的问题,这些问题直接影响了AI解释的有效性与适用性。其解决方案的关键在于系统分析不同XAI方法所依赖的关于解释存在性及人类认知能力的基本假设,并指出看似微小的技术调整可能对应着根本性的哲学立场差异;同时强调在选择和应用XAI方法时必须考虑其背后的onto-epistemological范式,以确保解释结果在特定应用场景中的合理性与可信度。
链接: https://arxiv.org/abs/2510.02996
作者: Martina Mattioli,Eike Petersen,Aasa Feragen,Marcello Pelillo,Siavash A. Bigdeli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence (AI) is being applied in almost every field. At the same time, the currently dominant deep learning methods are fundamentally black-box systems that lack explanations for their inferences, significantly limiting their trustworthiness and adoption. Explainable AI (XAI) methods aim to overcome this challenge by providing explanations of the models’ decision process. Such methods are often proposed and developed by engineers and scientists with a predominantly technical background and incorporate their assumptions about the existence, validity, and explanatory utility of different conceivable explanatory mechanisms. However, the basic concept of an explanation – what it is, whether we can know it, whether it is absolute or relative – is far from trivial and has been the subject of deep philosophical debate for millennia. As we point out here, the assumptions incorporated into different XAI methods are not harmless and have important consequences for the validity and interpretation of AI explanations in different domains. We investigate ontological and epistemological assumptions in explainability methods when they are applied to AI systems, meaning the assumptions we make about the existence of explanations and our ability to gain knowledge about those explanations. Our analysis shows how seemingly small technical changes to an XAI method may correspond to important differences in the underlying assumptions about explanations. We furthermore highlight the risks of ignoring the underlying onto-epistemological paradigm when choosing an XAI method for a given application, and we discuss how to select and adapt appropriate XAI methods for different domains of application.
zh
[AI-17] AI Generated Child Sexual Abuse Material - Whats the Harm?
【速读】:该论文旨在解决生成式人工智能(Generative AI)工具可能用于制作完全或部分合成的儿童性虐待材料(AI CSAM)所带来的新型风险与挑战,这些问题对儿童保护、执法和社会应对机制构成严重威胁。论文指出,尽管有人认为AI CSAM因缺乏直接受害者而危害较小,但这种观点忽视了其在制造无受害者的虚假内容、复现已知受害者创伤、促进诱骗与勒索行为以及助长儿童性剥削正常化等方面的多重危害;解决方案的关键在于全面识别并批判性评估AI CSAM的实际风险,警惕将其误判为“减害工具”的倾向,强调必须采取积极主动的生态系统响应措施,防止因对所谓“无直接伤害”的误解而导致政策和行动上的迟滞。
链接: https://arxiv.org/abs/2510.02978
作者: Caoilte Ó Ciardha,John Buckley,Rebecca S. Portnoff
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:The development of generative artificial intelligence (AI) tools capable of producing wholly or partially synthetic child sexual abuse material (AI CSAM) presents profound challenges for child protection, law enforcement, and societal responses to child exploitation. While some argue that the harmfulness of AI CSAM differs fundamentally from other CSAM due to a perceived absence of direct victimization, this perspective fails to account for the range of risks associated with its production and consumption. AI has been implicated in the creation of synthetic CSAM of children who have not previously been abused, the revictimization of known survivors of abuse, the facilitation of grooming, coercion and sexual extortion, and the normalization of child sexual exploitation. Additionally, AI CSAM may serve as a new or enhanced pathway into offending by lowering barriers to engagement, desensitizing users to progressively extreme content, and undermining protective factors for individuals with a sexual interest in children. This paper provides a primer on some key technologies, critically examines the harms associated with AI CSAM, and cautions against claims that it may function as a harm reduction tool, emphasizing how some appeals to harmlessness obscure its real risks and may contribute to inertia in ecosystem responses.
zh
[AI-18] Corrosion Risk Estimation for Heritage Preservation: An Internet of Things and Machine Learning Approach Using Temperature and Humidity
【速读】:该论文旨在解决文化遗产遗址中钢制结构的腐蚀预测难题,以实现主动保护。其关键解决方案是基于物联网(IoT)硬件系统与LoRa无线通信技术采集的三年气象数据,构建仅依赖温度和相对湿度的机器学习回归框架,从而实现高精度的腐蚀速率预测,并通过Streamlit可视化界面提供实时监测与可操作的维护建议,展现出低成本、易扩展的防腐蚀管理方案。
链接: https://arxiv.org/abs/2510.02973
作者: Reginald Juan M. Mercado,Muhammad Kabeer,Haider Al-Obaidy,Rosdiadee Nordin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
备注: 17 pages
Abstract:Proactive preservation of steel structures at culturally significant heritage sites like the San Sebastian Basilica in the Philippines requires accurate corrosion forecasting. This study developed an Internet of Things hardware system connected with LoRa wireless communications to monitor heritage buildings with steel structures. From a three year dataset generated by the IoT system, we built a machine learning framework for predicting atmospheric corrosion rates using only temperature and relative humidity data. Deployed via a Streamlit dashboard with ngrok tunneling for public access, the framework provides real-time corrosion monitoring and actionable preservation recommendations. This minimal-data approach is scalable and cost effective for heritage sites with limited monitoring resources, showing that advanced regression can extract accurate corrosion predictions from basic meteorological data enabling proactive preservation of culturally significant structures worldwide without requiring extensive sensor networks
zh
[AI-19] Ergodic Risk Measures: Towards a Risk-Aware Foundation for Continual Reinforcement Learning
【速读】:该论文旨在解决持续强化学习(continual reinforcement learning, continual RL)中长期性能优化的理论基础问题,特别是如何在持续学习场景下实现风险感知(risk-aware)决策。传统方法多基于风险中性(risk-neutral)假设,仅优化期望长期回报,而忽略了对风险敏感的决策需求。论文指出,经典的风险度量理论在当前形式下无法适配持续学习环境,因此提出了一种新的“遍历风险度量”(ergodic risk measures)类,其能够兼容持续学习框架,并为风险感知的持续学习提供理论支撑。解决方案的关键在于引入遍历风险度量,从而在保持持续适应能力的同时,实现对长期性能分布的更全面优化,包括方差、尾部风险等超出均值的指标。
链接: https://arxiv.org/abs/2510.02945
作者: Juan Sebastian Rojas,Chi-Guhn Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual reinforcement learning (continual RL) seeks to formalize the notions of lifelong learning and endless adaptation in RL. In particular, the aim of continual RL is to develop RL agents that can maintain a careful balance between retaining useful information and adapting to new situations. To date, continual RL has been explored almost exclusively through the lens of risk-neutral decision-making, in which the agent aims to optimize the expected (or mean) long-run performance. In this work, we present the first formal theoretical treatment of continual RL through the lens of risk-aware decision-making, in which the agent aims to optimize a reward-based measure of long-run performance beyond the mean. In particular, we show that the classical theory of risk measures, widely used as a theoretical foundation in non-continual risk-aware RL, is, in its current form, incompatible with the continual setting. Then, building on this insight, we extend risk measure theory into the continual setting by introducing a new class of ergodic risk measures that are compatible with continual learning. Finally, we provide a case study of risk-aware continual learning, along with empirical results, which show the intuitive appeal and theoretical soundness of ergodic risk measures.
zh
[AI-20] WavInWav: Time-domain Speech Hiding via Invertible Neural Network
【速读】:该论文旨在解决现有基于深度神经网络(Deep Neural Networks, DNNs)的音频信息隐藏方法在秘密音频恢复过程中质量不佳的问题,其根本原因在于模型对时频关系建模能力不足。解决方案的关键在于引入一种基于流的可逆神经网络(flow-based invertible neural network),建立水印音频(stego audio)、原始音频(cover audio)与秘密音频(secret audio)之间的直接映射关系,从而提升嵌入与提取过程的可逆性;同时,在时域信号上设计时频损失函数(time-frequency loss),在保留时频约束优势的同时显著改善秘密音频的恢复质量,确保了实用场景中信息恢复的准确性与鲁棒性。
链接: https://arxiv.org/abs/2510.02915
作者: Wei Fan,Kejiang Chen,Xiangkun Wang,Weiming Zhang,Nenghai Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 13 pages, 5 figures, project page: this https URL
Abstract:Data hiding is essential for secure communication across digital media, and recent advances in Deep Neural Networks (DNNs) provide enhanced methods for embedding secret information effectively. However, previous audio hiding methods often result in unsatisfactory quality when recovering secret audio, due to their inherent limitations in the modeling of time-frequency relationships. In this paper, we explore these limitations and introduce a new DNN-based approach. We use a flow-based invertible neural network to establish a direct link between stego audio, cover audio, and secret audio, enhancing the reversibility of embedding and extracting messages. To address common issues from time-frequency transformations that degrade secret audio quality during recovery, we implement a time-frequency loss on the time-domain signal. This approach not only retains the benefits of time-frequency constraints but also enhances the reversibility of message recovery, which is vital for practical applications. We also add an encryption technique to protect the hidden data from unauthorized access. Experimental results on the VCTK and LibriSpeech datasets demonstrate that our method outperforms previous approaches in terms of subjective and objective metrics and exhibits robustness to various types of noise, suggesting its utility in targeted secure communication scenarios.
zh
[AI-21] FeDABoost: Fairness Aware Federated Learning with Adaptive Boosting ECML-PKDD2025
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在非独立同分布(non-IID)数据场景下模型性能下降和客户端间公平性不足的问题。其解决方案的关键在于提出FeDABoost框架,融合动态提升机制与自适应梯度聚合策略:一方面,借鉴多类AdaBoost(SAMME)的加权机制,根据客户端局部误差率分配更高权重给表现更优的客户端,从而增强全局模型聚合的可靠性;另一方面,通过调整焦点损失(focal loss)的关注参数,动态提升表现较差客户端的训练强度,强化难分类样本的学习能力,进而改善整体公平性与性能表现。
链接: https://arxiv.org/abs/2510.02914
作者: Tharuka Kasthuri Arachchige,Veselka Boeva,Shahrooz Abghari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Presented in WAFL@ECML-PKDD 2025
Abstract:This work focuses on improving the performance and fairness of Federated Learning (FL) in non IID settings by enhancing model aggregation and boosting the training of underperforming clients. We propose FeDABoost, a novel FL framework that integrates a dynamic boosting mechanism and an adaptive gradient aggregation strategy. Inspired by the weighting mechanism of the Multiclass AdaBoost (SAMME) algorithm, our aggregation method assigns higher weights to clients with lower local error rates, thereby promoting more reliable contributions to the global model. In parallel, FeDABoost dynamically boosts underperforming clients by adjusting the focal loss focusing parameter, emphasizing hard to classify examples during local training. We have evaluated FeDABoost on three benchmark datasets MNIST, FEMNIST, and CIFAR10, and compared its performance with those of FedAvg and Ditto. The results show that FeDABoost achieves improved fairness and competitive performance.
zh
[AI-22] DMark: Order-Agnostic Watermarking for Diffusion Large Language Models
【速读】:该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)的文本水印难题。传统水印方法基于自回归模型的因果生成机制,依赖于从左到右的逐词生成顺序,而dLLMs采用非序列化解码方式,允许以任意顺序完成token生成,导致现有水印方法失效。解决方案的关键在于提出DMark框架,其核心创新为三种互补策略:预测性水印(Predictive Watermarking)在实际上下文缺失时利用模型预测的token进行水印嵌入;双向水印(Bidirectional Watermarking)利用扩散解码特有的前向与后向依赖关系增强水印可检测性;以及预测-双向融合水印(Predictive-Bidirectional Watermarking),综合两种策略以最大化检测强度。实验表明,DMark在保持文本质量的同时实现92.0–99.5%的高检测率(FPR=1%),显著优于对现有方法的简单适配(仅49.6–71.2%)。
链接: https://arxiv.org/abs/2510.02902
作者: Linyu Wu,Linhao Zhong,Wenjie Qu,Yuexin Li,Yue Liu,Shengfang Zhai,Chunhua Shen,Jiaheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Diffusion large language models (dLLMs) offer faster generation than autoregressive models while maintaining comparable quality, but existing watermarking methods fail on them due to their non-sequential decoding. Unlike autoregressive models that generate tokens left-to-right, dLLMs can finalize tokens in arbitrary order, breaking the causal design underlying traditional watermarks. We present DMark, the first watermarking framework designed specifically for dLLMs. DMark introduces three complementary strategies to restore watermark detectability: predictive watermarking uses model-predicted tokens when actual context is unavailable; bidirectional watermarking exploits both forward and backward dependencies unique to diffusion decoding; and predictive-bidirectional watermarking combines both approaches to maximize detection strength. Experiments across multiple dLLMs show that DMark achieves 92.0-99.5% detection rates at 1% false positive rate while maintaining text quality, compared to only 49.6-71.2% for naive adaptations of existing methods. DMark also demonstrates robustness against text manipulations, establishing that effective watermarking is feasible for non-autoregressive language models.
zh
[AI-23] Global Convergence of Policy Gradient for Entropy Regularized Linear-Quadratic Control with multiplicative noise
【速读】:该论文旨在解决在系统参数未知的动态环境中,基于强化学习(Reinforcement Learning, RL)的熵正则化线性二次控制(Entropy-Regularized Linear Quadratic Control, LQC)问题,尤其针对存在乘性噪声的无限时域控制场景。其解决方案的关键在于:首先,将正则化策略梯度(Regularized Policy Gradient, RPG)算法适配至随机最优控制框架,并在梯度支配(gradient domination)和近光滑性(near-smoothness)条件下证明了全局收敛性;其次,提出一种无需系统模型信息的零阶优化方法——基于样本的正则化策略梯度(Sample-Based Regularized Policy Gradient, SB-RPG),该算法通过熵正则化项加速收敛并有效平衡探索与利用(exploration-exploitation trade-off),同时保持强理论保证。
链接: https://arxiv.org/abs/2510.02896
作者: Gabriel Diaz,Lucky Li,Wenhao Zhang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 33 pages, 4 figures
Abstract:Reinforcement Learning (RL) has emerged as a powerful framework for sequential decision-making in dynamic environments, particularly when system parameters are unknown. This paper investigates RL-based control for entropy-regularized Linear Quadratic control (LQC) problems with multiplicative noises over an infinite time horizon. First, we adapt the Regularized Policy Gradient (RPG) algorithm to stochastic optimal control settings, proving that despite the non-convexity of the problem, RPG converges globally under conditions of gradient domination and near-smoothness. Second, based on zero-order optimization approach, we introduce a novel model free RL algorithm: Sample-Based Regularized Policy Gradient (SB-RPG). SB-RPG operates without knowledge of system parameters yet still retains strong theoretical guarantees of global convergence. Our model leverages entropy regularization to accelerate convergence and address the exploration versus exploitation trade-off inherent in RL. Numerical simulations validate the theoretical results and demonstrate the efficacy of SB-RPG in unknown-parameters environments.
zh
[AI-24] Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models
【速读】:该论文旨在解决离散扩散模型(Discrete Diffusion Model, DDM)在奖励优化中的挑战,尤其是由于非自回归(non-autoregressive)结构导致重要性采样(importance sampling)不可行以及轨迹(rollout)计算复杂,从而阻碍了强化学习方法(如组相对策略优化,Group Relative Policy Optimization, GRPO)的有效应用。其解决方案的关键在于提出MaskGRPO,一种首次实现可扩展多模态强化学习的框架,核心创新包括:首先明确DDM的理论基础以构建能捕捉关键token波动的重要性估计器,从而支持有效的梯度更新;其次设计针对视觉序列的精细化rollout机制,生成多样且可靠的完成结果与优化梯度。实验证明,MaskGRPO在数学推理、代码生成和视觉生成任务中均实现了更稳定高效的策略更新,显著提升推理性能与生成质量,标志着首个适用于离散化视觉扩散模型的实际强化学习方法。
链接: https://arxiv.org/abs/2510.02880
作者: Tianren Ma,Mu Zhang,Yibing Wang,Qixiang Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.
zh
[AI-25] Reward Model Routing in Alignment
【速读】:该论文旨在解决当前基于人类或AI反馈的强化学习(Reinforcement Learning from Human or AI Feedback, RLHF/RLAIF)中依赖单一奖励模型(Reward Model, RM)所带来的对齐质量受限与过拟合风险问题。现有动态RM路由方法虽能通过从候选池中选择最优RM来提升性能,但仍面临冷启动和探索不足的挑战。解决方案的关键在于提出BayesianRouter——一个结合离线学习与在线贝叶斯选择的混合路由框架:在离线阶段,通过多任务训练估计各RM的可靠性;在在线阶段,采用贝叶斯Thompson采样机制,以离线学习得到的嵌入作为高斯先验初始化RM权重向量,并根据实时奖励自适应更新后验分布,从而实现对策略分布变化的高效适应与更优的RM选择。
链接: https://arxiv.org/abs/2510.02850
作者: Xinle Wu,Yao Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning from human or AI feedback (RLHF / RLAIF) has become the standard paradigm for aligning large language models (LLMs). However, most pipelines rely on a single reward model (RM), limiting alignment quality and risking overfitting. Recent work explores RM routing–dynamically selecting an RM from a candidate pool to exploit complementary strengths while maintaining O(1) RM calls–but existing methods suffer from cold-start and insufficient exploration. We propose BayesianRouter, a hybrid routing framework that combines offline RM strengths learning with online Bayesian selection. In the offline stage, a multi-task router is trained on preference data to estimate per-RM reliability. In the online stage, a Bayesian Thompson sampling router performs per-query RM selection, initializing RM-specific weight vectors with offline embeddings as Gaussian priors and adaptively updating their posteriors with online rewards to adapt to the evolving policy distribution. Extensive experiments on instruction-following (AlpacaEval-2, Arena-Hard, MT-Bench) and reasoning (GSM8K, MMLU) benchmarks show that BayesianRouter consistently outperforms individual RMs, RM ensembling, and existing routing methods.
zh
[AI-26] Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech
【速读】:该论文旨在解决零样本语音合成(Zero-shot Text-to-Speech, Zero-shot TTS)中存在的三大核心问题:一是由token重复或意外内容迁移导致的不可靠合成;二是推理速度慢与计算开销大;三是时间多样性(temporal diversity)不足,影响语音自然度。解决方案的关键在于提出Flamed-TTS框架,通过重构流匹配(flow matching)训练范式,引入离散与连续表示相结合的方式以分别建模语音的不同属性(如说话人身份、韵律等),从而在保证低延迟和低计算成本的同时,显著提升语音保真度与时间维度上的多样性,实验表明其在可懂性、自然度、说话人相似度等方面均优于现有最先进模型。
链接: https://arxiv.org/abs/2510.02848
作者: Hieu-Nghia Huynh-Nguyen,Huynh Nguyen Dang,Ngoc-Son Nguyen,Van Nguyen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Zero-shot Text-to-Speech (TTS) has recently advanced significantly, enabling models to synthesize speech from text using short, limited-context prompts. These prompts serve as voice exemplars, allowing the model to mimic speaker identity, prosody, and other traits without extensive speaker-specific data. Although recent approaches incorporating language models, diffusion, and flow matching have proven their effectiveness in zero-shot TTS, they still encounter challenges such as unreliable synthesis caused by token repetition or unexpected content transfer, along with slow inference and substantial computational overhead. Moreover, temporal diversity-crucial for enhancing the naturalness of synthesized speech-remains largely underexplored. To address these challenges, we propose Flamed-TTS, a novel zero-shot TTS framework that emphasizes low computational cost, low latency, and high speech fidelity alongside rich temporal diversity. To achieve this, we reformulate the flow matching training paradigm and incorporate both discrete and continuous representations corresponding to different attributes of speech. Experimental results demonstrate that Flamed-TTS surpasses state-of-the-art models in terms of intelligibility, naturalness, speaker similarity, acoustic characteristics preservation, and dynamic pace. Notably, Flamed-TTS achieves the best WER of 4% compared to the leading zero-shot TTS baselines, while maintaining low latency in inference and high fidelity in generated speech. Code and audio samples are available at our demo page this https URL.
zh
[AI-27] ake Goodhart Seriously: Principled Limit on General-Purpose AI Optimization
【速读】:该论文试图解决的问题是:在机器学习实践中,模型训练是否真正满足其指定的目标函数(即Objective Satisfaction Assumption, OSA)这一长期被忽视但关键的假设。研究表明,在现实条件下,由于近似误差、估计误差和优化误差的存在,OSA必然失效,且即使目标函数定义完美,也无法完全捕捉开发者意图(如人类偏好对齐),导致目标函数错位不可避免。解决方案的关键在于认识到这些偏差与强优化压力下引发的Goodhart效应难以区分,因此必须为通用人工智能(General-Purpose AI)系统设定一个原则性的优化上限,否则持续优化将不可避免地导致可预测且不可逆的控制权丧失。
链接: https://arxiv.org/abs/2510.02840
作者: Antoine Maier,Aude Maier,Tom David
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure. Under review
Abstract:A common but rarely examined assumption in machine learning is that training yields models that actually satisfy their specified objective function. We call this the Objective Satisfaction Assumption (OSA). Although deviations from OSA are acknowledged, their implications are overlooked. We argue, in a learning-paradigm-agnostic framework, that OSA fails in realistic conditions: approximation, estimation, and optimization errors guarantee systematic deviations from the intended objective, regardless of the quality of its specification. Beyond these technical limitations, perfectly capturing and translating the developer’s intent, such as alignment with human preferences, into a formal objective is practically impossible, making misspecification inevitable. Building on recent mathematical results, absent a mathematical characterization of these gaps, they are indistinguishable from those that collapse into Goodhart’s law failure modes under strong optimization pressure. Because the Goodhart breaking point cannot be located ex ante, a principled limit on the optimization of General-Purpose AI systems is necessary. Absent such a limit, continued optimization is liable to push systems into predictable and irreversible loss of control.
zh
[AI-28] Knowledge-Aware Modeling with Frequency Adaptive Learning for Battery Health Prognostics
【速读】:该论文旨在解决电池健康状态(Battery Health)预测中因复杂非线性退化行为、噪声干扰及容量再生现象导致的准确性与鲁棒性不足的问题。现有数据驱动模型虽能捕捉时间序列退化特征,但缺乏物理知识引导,难以实现长期可靠的剩余使用寿命(Remaining Useful Life, RUL)预测。其解决方案的关键在于提出Karma模型——一种具备频率自适应学习能力的知识感知型模型:首先通过信号分解提取不同频段的电池特征;进而采用双流深度学习架构,其中一路建模低频长期退化趋势,另一路刻画高频短期动态变化;同时引入基于经验研究的双指数函数作为先验知识,并利用粒子滤波优化参数,确保预测结果在物理一致性基础上具备不确定性量化能力,从而显著提升预测精度与可靠性。
链接: https://arxiv.org/abs/2510.02839
作者: Vijay Babu Pamshetti,Wei Zhang,Sumei Sun,Jie Zhang,Yonggang Wen,Qingyu Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 4 tables
Abstract:Battery health prognostics are critical for ensuring safety, efficiency, and sustainability in modern energy systems. However, it has been challenging to achieve accurate and robust prognostics due to complex battery degradation behaviors with nonlinearity, noise, capacity regeneration, etc. Existing data-driven models capture temporal degradation features but often lack knowledge guidance, which leads to unreliable long-term health prognostics. To overcome these limitations, we propose Karma, a knowledge-aware model with frequency-adaptive learning for battery capacity estimation and remaining useful life prediction. The model first performs signal decomposition to derive battery signals in different frequency bands. A dual-stream deep learning architecture is developed, where one stream captures long-term low-frequency degradation trends and the other models high-frequency short-term dynamics. Karma regulates the prognostics with knowledge, where battery degradation is modeled as a double exponential function based on empirical studies. Our dual-stream model is used to optimize the parameters of the knowledge with particle filters to ensure physically consistent and reliable prognostics and uncertainty quantification. Experimental study demonstrates Karma’s superior performance, achieving average error reductions of 50.6% and 32.6% over state-of-the-art algorithms for battery health prediction on two mainstream datasets, respectively. These results highlight Karma’s robustness, generalizability, and potential for safer and more reliable battery management across diverse applications.
zh
[AI-29] Dissecting Transformers: A CLEAR Perspective towards Green AI
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)推理阶段能耗评估缺乏细粒度方法的问题,现有研究多依赖粗粒度的模型级指标,难以准确揭示Transformer架构中各组件的实际能量消耗差异。其解决方案的关键在于提出一种名为“基于重复采样的组件级能量评估方法”(Component-Level Energy Assessment via Repeated sampling, CLEAR),该方法有效克服了微秒级组件执行时间与毫秒级能量传感器监测之间的时间不匹配问题,实现了对Transformer核心组件的能量精细化测量,在保持组件级能量方差低于9.5%的同时,捕获超过90%的模型总能耗。此方法首次系统性地揭示了注意力模块(Attention blocks)单位浮点运算(FLOP)能耗显著高于其他组件的事实,表明仅用FLOPs无法准确反映组件级真实能耗,为后续通过组件级优化设计节能型Transformer模型提供了基准和方向。
链接: https://arxiv.org/abs/2510.02810
作者: Hemang Jain,Shailender Goyal,Divyansh Pandey,Karthik Vaidhyanathan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:The rapid adoption of Large Language Models (LLMs) has raised significant environmental concerns. Unlike the one-time cost of training, LLM inference occurs continuously at a global scale and now dominates the AI energy footprint. Yet, most sustainability studies report only coarse, model-level metrics due to the lack of fine-grained measurement methods, treating energy efficiency more as an afterthought than as a primary objective. We present the first fine-grained empirical analysis of inference energy across core components of transformer architecture. We propose a novel methodology, Component-Level Energy Assessment via Repeated sampling (CLEAR), to overcome temporal mismatch between microsecond scale component execution and monitoring of millisecond (ms) scale energy sensors. Using CLEAR, we evaluate 15 models spanning four distinct architecture types and consistently keep component-wise energy variance below 9.5% while capturing more than 90% of the model’s total energy as individual components. Our empirical analysis reveals that Attention blocks consume significantly more energy per floating-point operation (FLOP), indicating that energy consumption is not proportionally aligned with FLOP counts. This shows that FLOPs alone fail to capture the true energy cost at a component level. Our findings establish detailed component-level energy baselines and provide insight as an initial step to build energy-efficient transformer models through component-level optimizations.
zh
[AI-30] Relevance-Aware Thresholding in Online Conformal Prediction for Time Series
【速读】:该论文旨在解决在线归因预测(Online Conformal Prediction, OCP)中阈值更新步骤过于依赖二元判断(即预测区间是否包含真实值)而忽略预测区间相关性的问题,从而导致阈值波动剧烈、预测区间过宽。其解决方案的关键在于:将传统的二元评估机制替换为一类更广义的函数,这些函数能够基于真实值量化预测区间的 relevancy(相关性),从而在阈值更新时引入对区间信息量的连续性度量,有效抑制阈值的突变,最终在保证覆盖率有效性的同时获得更紧凑的预测区间。
链接: https://arxiv.org/abs/2510.02809
作者: Théo Dupuy,Binbin Xu,Stéphane Perrey,Jacky Montmain,Abdelhak Imoussaten
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Uncertainty quantification has received considerable interest in recent works in Machine Learning. In particular, Conformal Prediction (CP) gains ground in this field. For the case of time series, Online Conformal Prediction (OCP) becomes an option to address the problem of data distribution shift over time. Indeed, the idea of OCP is to update a threshold of some quantity (whether the miscoverage level or the quantile) based on the distribution observation. To evaluate the performance of OCP methods, two key aspects are typically considered: the coverage validity and the prediction interval width minimization. Recently, new OCP methods have emerged, offering long-run coverage guarantees and producing more informative intervals. However, during the threshold update step, most of these methods focus solely on the validity of the prediction intervals~–~that is, whether the ground truth falls inside or outside the interval~–~without accounting for their relevance. In this paper, we aim to leverage this overlooked aspect. Specifically, we propose enhancing the threshold update step by replacing the binary evaluation (inside/outside) with a broader class of functions that quantify the relevance of the prediction interval using the ground truth. This approach helps prevent abrupt threshold changes, potentially resulting in narrower prediction intervals. Indeed, experimental results on real-world datasets suggest that these functions can produce tighter intervals compared to existing OCP methods while maintaining coverage validity.
zh
[AI-31] OptunaHub: A Platform for Black-Box Optimization
【速读】:该论文旨在解决黑盒优化(Black-box Optimization, BBO)研究在不同领域如AutoML和材料信息学中因方法与基准测试分散而导致的科研碎片化问题。其解决方案的关键在于提出OptunaHub这一社区平台,通过统一的Python API、贡献者包注册机制和Web界面,实现BBO方法与基准的集中管理与高效检索,从而促进跨领域协作与应用迭代,形成“贡献—应用”的良性循环。
链接: https://arxiv.org/abs/2510.02798
作者: Yoshihiko Ozaki,Shuhei Watanabe,Toshihiko Yanase
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to Journal of machine learning research
Abstract:Black-box optimization (BBO) drives advances in domains such as AutoML and Materials Informatics, yet research efforts often remain fragmented across domains. We introduce OptunaHub (this https URL), a community platform that centralizes BBO methods and benchmarks. OptunaHub provides unified Python APIs, a contributor package registry, and a web interface to promote searchability and cross-domain research. OptunaHub aims to foster a virtuous cycle of contributions and applications. The source code is publicly available in the optunahub, optunahub-registry, and optunahub-web repositories under the Optuna organization on GitHub (this https URL).
zh
[AI-32] Fusing Multi- and Hyperspectral Satellite Data for Harmful Algal Bloom Monitoring with Self-Supervised and Hierarchical Deep Learning
【速读】:该论文旨在解决有害藻华(Harmful Algal Bloom, HAB)严重程度和物种分类在标签稀缺环境下的可扩展监测问题。传统方法依赖于每种传感器的标注数据,限制了其在多源卫星数据融合中的应用。解决方案的关键在于提出一种自监督机器学习框架SIT-FUSE,通过融合多传感器遥感反射率数据(如VIIRS、MODIS、Sentinel-3、PACE)与TROPOMI太阳诱导荧光(Solar-Induced Fluorescence, SIF),利用自监督表征学习和分层深度聚类技术,无需每仪器单独标注即可生成HAB严重程度与物种分类产品,并通过墨西哥湾和南加州现场观测数据(2018–2025年)验证其有效性,实现了对总浮游植物、Karenia brevis、Alexandrium spp.及Pseudo-nitzschia spp.等关键物种的良好预测性能。
链接: https://arxiv.org/abs/2510.02763
作者: Nicholas LaHaye,Kelly M. Luis,Michelle M. Gierach
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a self-supervised machine learning framework for detecting and mapping harmful algal bloom (HAB) severity and speciation using multi-sensor satellite data. By fusing reflectance data from operational instruments (VIIRS, MODIS, Sentinel-3, PACE) with TROPOMI solar-induced fluorescence (SIF), our framework, called SIT-FUSE, generates HAB severity and speciation products without requiring per-instrument labeled datasets. The framework employs self-supervised representation learning, hierarchical deep clustering to segment phytoplankton concentrations and speciations into interpretable classes, validated against in-situ data from the Gulf of Mexico and Southern California (2018-2025). Results show strong agreement with total phytoplankton, Karenia brevis, Alexandrium spp., and Pseudo-nitzschia spp. measurements. This work advances scalable HAB monitoring in label-scarce environments while enabling exploratory analysis via hierarchical embeddings: a critical step toward operationalizing self-supervised learning for global aquatic biogeochemistry.
zh
[AI-33] Prototyping Digital Social Spaces through Metaphor-Driven Design: Translating Spatial Concepts into an Interactive Social Simulation
【速读】:该论文试图解决当前社交平台设计过度聚焦于用户参与度(engagement)和规模扩张,而忽视了多样化社会互动可能性的问题。其解决方案的关键在于提出一种基于隐喻驱动的系统,将用户提出的隐喻转化为结构化的平台功能集合,并通过大语言模型(LLM)驱动的代理生成可交互的模拟环境。该方法使用户能够直观地探索和验证不同社会架构下的潜在动态,从而突破现有平台的技术约束,为未来社交平台的设计拓展出更具包容性和多样性的空间。
链接: https://arxiv.org/abs/2510.02759
作者: Yoojin Hong,Martina Di Paola,Braahmi Padmakumar,Hwi Joon Lee,Mahnoor Shafiq,Joseph Seering
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 25 pages, in submission to CHI 2026
Abstract:Social media platforms are central to communication, yet their designs remain narrowly focused on engagement and scale. While researchers have proposed alternative visions for online spaces, these ideas are difficult to prototype within platform constraints. In this paper, we introduce a metaphor-driven system to help users imagine and explore new social media environments. The system translates users’ metaphors into structured sets of platform features and generates interactive simulations populated with LLM-driven agents. To evaluate this approach, we conducted a study where participants created and interacted with simulated social media spaces. Our findings show that metaphors allow users to express distinct social expectations, and that perceived authenticity of the simulation depended on how well it captured dynamics like intimacy, participation, and temporal engagement. We conclude by discussing how metaphor-driven simulation can be a powerful design tool for prototyping alternative social architectures and expanding the design space for future social platforms.
zh
[AI-34] CST-AFNet: A dual attention-based deep learning framework for intrusion detection in IoT networks
【速读】:该论文旨在解决物联网(IoT)和工业物联网(IIoT)环境中因设备异构性、资源受限及分布广泛所带来的复杂网络安全挑战,特别是针对入侵检测的准确性与实时性问题。解决方案的关键在于提出一种基于双注意力机制的深度学习框架——CST AFNet,其核心创新包括:多尺度卷积神经网络(Multi-scale CNNs)用于提取空间特征、双向门控循环单元(BiGRUs)捕捉时间依赖关系,以及通道注意力与时间注意力相结合的双重注意力机制,以增强对关键数据模式的关注能力。实验表明,该模型在Edge IIoTset数据集上实现了99.97%的准确率,且宏平均精确率、召回率和F1分数均超过99.3%,显著优于传统深度学习模型,验证了其在复杂IoT/IIoT环境中的高效性与可扩展性。
链接: https://arxiv.org/abs/2510.02717
作者: Waqas Ishtiaq,Ashrafun Zannat,A.H.M. Shahariar Parvez,Md. Alamgir Hossain,Muntasir Hasan Kanchan,Muhammad Masud Tarek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 9 pages, 9 figures, 5 tables
Abstract:The rapid expansion of the Internet of Things (IoT) has revolutionized modern industries by enabling smart automation and real time connectivity. However, this evolution has also introduced complex cybersecurity challenges due to the heterogeneous, resource constrained, and distributed nature of these environments. To address these challenges, this research presents CST AFNet, a novel dual attention based deep learning framework specifically designed for robust intrusion detection in IoT networks. The model integrates multi scale Convolutional Neural Networks (CNNs) for spatial feature extraction, Bidirectional Gated Recurrent Units (BiGRUs) for capturing temporal dependencies, and a dual attention mechanism, channel and temporal attention, to enhance focus on critical patterns in the data. The proposed method was trained and evaluated on the Edge IIoTset dataset, a comprehensive and realistic benchmark containing more than 2.2 million labeled instances spanning 15 attack types and benign traffic, collected from a seven layer industrial testbed. Our proposed model achieves outstanding accuracy for both 15 attack types and benign traffic. CST AFNet achieves 99.97 percent accuracy. Moreover, this model demonstrates exceptional performance with macro averaged precision, recall, and F1 score all above 99.3 percent. Experimental results show that CST AFNet achieves superior detection accuracy, significantly outperforming traditional deep learning models. The findings confirm that CST AFNet is a powerful and scalable solution for real time cyber threat detection in complex IoT and IIoT environments, paving the way for more secure, intelligent, and adaptive cyber physical systems.
zh
[AI-35] A 1000times Faster LLM -enhanced Algorithm For Path Planning in Large-scale Grid Maps
【速读】:该论文旨在解决大规模网格地图中路径规划效率低下的问题,现有方法如A*、Dijkstra及其变体在处理大尺度地图时因搜索时间和内存消耗过高而表现不佳;同时,尽管大型语言模型(Large Language Models, LLMs)在路径规划中展现出潜力,但仍存在空间幻觉和规划性能不足的问题。针对这些问题,作者深入分析了LLM-A算法的瓶颈,并提出了一种创新的改进方案iLLM-A,其关键在于三个精心设计的机制:1)对A算法的优化以提升搜索效率;2)基于增量学习的LLM策略,用于生成高质量的路径航点(waypoints);3)合理选择适用于A路径规划的航点子集,从而显著减少计算负担并提高路径质量。实验表明,iLLM-A相较LLM-A平均提速超过1000倍,极端情况下达2349.5倍,内存消耗降低最多58.6%,且路径长度更短、标准差更低。
链接: https://arxiv.org/abs/2510.02716
作者: Junlin Zeng,Xin Zhang,Xiang Zhao,Yan Pan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Path planning in grid maps, arising from various applications, has garnered significant attention. Existing methods, such as A*, Dijkstra, and their variants, work well for small-scale maps but fail to address large-scale ones due to high search time and memory consumption. Recently, Large Language Models (LLMs) have shown remarkable performance in path planning but still suffer from spatial illusion and poor planning performance. Among all the works, LLM-A* \citemeng2024llm leverages LLM to generate a series of waypoints and then uses A* to plan the paths between the neighboring waypoints. In this way, the complete path is constructed. However, LLM-A* still suffers from high computational time for large-scale maps. To fill this gap, we conducted a deep investigation into LLM-A* and found its bottleneck, resulting in limited performance. Accordingly, we design an innovative LLM-enhanced algorithm, abbr. as iLLM-A*. iLLM-A* includes 3 carefully designed mechanisms, including the optimization of A*, an incremental learning method for LLM to generate high-quality waypoints, and the selection of the appropriate waypoints for A* for path planning. Finally, a comprehensive evaluation on various grid maps shows that, compared with LLM-A*, iLLM-A* \textbf1) achieves more than 1000\times speedup on average, and up to 2349.5\times speedup in the extreme case, 2) saves up to 58.6% of the memory cost, 3) achieves both obviously shorter path length and lower path length standard deviation.
zh
[AI-36] A Novel Unified Lightweight Temporal-Spatial Transformer Approach for Intrusion Detection in Drone Networks
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)网络在商业、工业及民用领域广泛应用背景下所面临的网络安全挑战,尤其是现有入侵检测机制在动态性、资源受限环境中缺乏适应性、效率和泛化能力的问题。解决方案的关键在于提出一种轻量级且统一的基于时空变换器(Temporal Spatial Transformer)的入侵检测系统——TSLT-Net,其通过自注意力机制有效建模网络流量中的时间模式与空间依赖关系,支持单架构下多类攻击分类与二元异常检测,并在ISOT Drone Anomaly Detection数据集上实现了99.99%的多类检测准确率和100%的二元异常检测准确率,同时仅需0.04 MB内存和9722个可训练参数,展现出优异的实时性和边缘部署可行性。
链接: https://arxiv.org/abs/2510.02711
作者: Tarun Kumar Biswas,Ashrafun Zannat,Waqas Ishtiaq,Md. Alamgir Hossain
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 21 pages, 18 figures, 5 tables
Abstract:The growing integration of drones across commercial, industrial, and civilian domains has introduced significant cybersecurity challenges, particularly due to the susceptibility of drone networks to a wide range of cyberattacks. Existing intrusion detection mechanisms often lack the adaptability, efficiency, and generalizability required for the dynamic and resource constrained environments in which drones operate. This paper proposes TSLT-Net, a novel lightweight and unified Temporal Spatial Transformer based intrusion detection system tailored specifically for drone networks. By leveraging self attention mechanisms, TSLT-Net effectively models both temporal patterns and spatial dependencies in network traffic, enabling accurate detection of diverse intrusion types. The framework includes a streamlined preprocessing pipeline and supports both multiclass attack classification and binary anomaly detection within a single architecture. Extensive experiments conducted on the ISOT Drone Anomaly Detection Dataset, consisting of more than 2.3 million labeled records, demonstrate the superior performance of TSLT-Net with 99.99 percent accuracy in multiclass detection and 100 percent in binary anomaly detection, while maintaining a minimal memory footprint of only 0.04 MB and 9722 trainable parameters. These results establish TSLT-Net as an effective and scalable solution for real time drone cybersecurity, particularly suitable for deployment on edge devices in mission critical UAV systems.
zh
[AI-37] RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization ICLR2026
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中安全性与策略表达能力之间的权衡问题:现有风险规避方法往往导致价值保守性并限制策略类别的表达能力,而高表达能力的策略仅在无风险敏感性的场景下被使用。为解决此问题,作者提出风险感知多模态演员-评论家框架(Risk-Aware Multimodal Actor-Critic, RAMAC),其核心创新在于将一个**表达性强的生成式演员(Generative Actor)与一个分布式评论家(Distributional Critic)**相结合,通过生成路径联合优化分布风险度量(如CVaR)与行为克隆(Behavior Cloning, BC)损失,从而在复杂多模态环境中实现风险敏感的学习。实验表明,RAMAC在保持多数Stochastic-D4RL任务高回报的同时,显著提升了在极端低尾风险下的性能(即CVaR₀.₁指标)。
链接: https://arxiv.org/abs/2510.02695
作者: Kai Fukazawa,Kunal Mundada,Iman Soltani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review as a conference paper at ICLR 2026, 21 pages, 8 figures. The HTML preview may misrender some figures; please refer to the PDF
Abstract:In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) offers an attractive alternative but only if policies deliver high returns without incurring catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of value conservatism and restricted policy classes, whereas expressive policies are only used in risk-neutral settings. Here, we address this gap by introducing the \textbfRisk-Aware Multimodal Actor-Critic (RAMAC) framework, which couples an \emphexpressive generative actor with a distributional critic. The RAMAC differentiates composite objective combining distributional risk and BC loss through the generative path, achieving risk-sensitive learning in complex multimodal scenarios. We instantiate RAMAC with diffusion and flow-matching actors and observe consistent gains in \mathrmCVaR_0.1 while maintaining strong returns on most Stochastic-D4RL tasks. Code: this https URL
zh
[AI-38] Fine-Tuning Diffusion Models via Intermediate Distribution Shaping
【速读】:该论文旨在解决扩散模型(Diffusion Models)在下游应用中分布难以对齐的问题,即如何利用奖励函数有效调整预训练扩散模型所捕获的数据分布,以更好地适应特定任务需求。传统策略梯度方法(如PPO)虽在自回归生成中广泛应用,但其依赖的边际似然在扩散模型中难以计算,导致现有方法多依赖近似或松弛。论文的关键解决方案是将多种基于拒绝采样的微调方法统一为GRAFT框架,并揭示其本质上等价于带有重加权奖励的PPO;进一步提出P-GRAFT,在中间噪声水平上进行分布塑造,通过偏置-方差权衡机制提升微调效率;同时引入逆噪声校正(inverse noise correction)技术,无需显式奖励即可优化流模型。实验表明,该框架在文本到图像生成、分子生成等任务中显著优于基线方法,尤其在Stable Diffusion 2上实现VQAScore提升8.81%,并在无条件图像生成中以更低计算成本降低FID。
链接: https://arxiv.org/abs/2510.02692
作者: Gautham Govind Anil,Shaan Ul Haque,Nithish Kannen,Dheeraj Nagaraj,Sanjay Shakkottai,Karthikeyan Shanmugam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models are widely used for generative tasks across domains. While pre-trained diffusion models effectively capture the training data distribution, it is often desirable to shape these distributions using reward functions to align with downstream applications. Policy gradient methods, such as Proximal Policy Optimization (PPO), are widely used in the context of autoregressive generation. However, the marginal likelihoods required for such methods are intractable for diffusion models, leading to alternative proposals and relaxations. In this context, we unify variants of Rejection sAmpling based Fine-Tuning (RAFT) as GRAFT, and show that this implicitly performs PPO with reshaped rewards. We then introduce P-GRAFT to shape distributions at intermediate noise levels and demonstrate empirically that this can lead to more effective fine-tuning. We mathematically explain this via a bias-variance tradeoff. Motivated by this, we propose inverse noise correction to improve flow models without leveraging explicit rewards. We empirically evaluate our methods on text-to-image(T2I) generation, layout generation, molecule generation and unconditional image generation. Notably, our framework, applied to Stable Diffusion 2, improves over policy gradient methods on popular T2I benchmarks in terms of VQAScore and shows an 8.81% relative improvement over the base model. For unconditional image generation, inverse noise correction improves FID of generated images at lower FLOPs/image.
zh
[AI-39] Can Data-Driven Dynamics Reveal Hidden Physics? There Is A Need for Interpretable Neural Operators
【速读】:该论文旨在解决神经算子(neural operators)在学习函数空间映射过程中对内在学习机制理解不足的问题,尤其是在数据驱动模拟复杂物理动力学时如何揭示其预测过程及隐含的物理规律。解决方案的关键在于提出一种基于分类的新视角:将神经算子分为空间域模型(Spatial domain models)和函数域模型(Functional domain models),并在此基础上构建双空间多尺度模型(dual-space multi-scale model),该模型不仅实现了当前最优性能(SOTA),还展现出学习复杂物理规律的巨大潜力;同时强调需建立有原则的框架以融入已知物理知识,从而提升泛化能力并挖掘更多隐藏的物理现象。
链接: https://arxiv.org/abs/2510.02683
作者: Wenhan Gao,Jian Luo,Fang Wan,Ruichen Xu,Xiang Liu,Haipeng Xing,Yi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, neural operators have emerged as powerful tools for learning mappings between function spaces, enabling data-driven simulations of complex dynamics. Despite their successes, a deeper understanding of their learning mechanisms remains underexplored. In this work, we classify neural operators into two types: (1) Spatial domain models that learn on grids and (2) Functional domain models that learn with function bases. We present several viewpoints based on this classification and focus on learning data-driven dynamics adhering to physical principles. Specifically, we provide a way to explain the prediction-making process of neural operators and show that neural operator can learn hidden physical patterns from data. However, this explanation method is limited to specific situations, highlighting the urgent need for generalizable explanation methods. Next, we show that a simple dual-space multi-scale model can achieve SOTA performance and we believe that dual-space multi-spatio-scale models hold significant potential to learn complex physics and require further investigation. Lastly, we discuss the critical need for principled frameworks to incorporate known physics into neural operators, enabling better generalization and uncovering more hidden physical phenomena.
zh
[AI-40] Automated Constraint Specification for Job Scheduling by Regulating Generative Model with Domain-Specific Representation
【速读】:该论文旨在解决先进计划与排程(Advanced Planning and Scheduling, APS)系统中制造需求形式化约束定义依赖人工、效率低下的问题。当前尽管调度算法研究成熟,但将异构原始制造数据转化为精确约束仍需大量人工干预,限制了APS系统的自动化和可扩展性。解决方案的关键在于提出一种以约束为中心(constraint-centric)的架构,通过构建三级层次化结构空间并引入领域特定表示(domain-specific representation),在保持灵活性的同时确保约束生成的精度与可靠性;同时设计了自动化生产场景适配算法,实现对具体制造配置的高效定制,从而有效平衡大语言模型(Large Language Models, LLMs)的生成能力与制造系统对可靠性的严格要求。
链接: https://arxiv.org/abs/2510.02679
作者: Yu-Zhe Shi,Qiao Xu,Yanjia Li,Mingchen Liu,Huamin Qu,Lecheng Ruan,Qining Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication in IEEE Transactions on Automation Science and Engineering
Abstract:Advanced Planning and Scheduling (APS) systems have become indispensable for modern manufacturing operations, enabling optimized resource allocation and production efficiency in increasingly complex and dynamic environments. While algorithms for solving abstracted scheduling problems have been extensively investigated, the critical prerequisite of specifying manufacturing requirements into formal constraints remains manual and labor-intensive. Although recent advances of generative models, particularly Large Language Models (LLMs), show promise in automating constraint specification from heterogeneous raw manufacturing data, their direct application faces challenges due to natural language ambiguity, non-deterministic outputs, and limited domain-specific knowledge. This paper presents a constraint-centric architecture that regulates LLMs to perform reliable automated constraint specification for production scheduling. The architecture defines a hierarchical structural space organized across three levels, implemented through domain-specific representation to ensure precision and reliability while maintaining flexibility. Furthermore, an automated production scenario adaptation algorithm is designed and deployed to efficiently customize the architecture for specific manufacturing configurations. Experimental results demonstrate that the proposed approach successfully balances the generative capabilities of LLMs with the reliability requirements of manufacturing systems, significantly outperforming pure LLM-based approaches in constraint specification tasks.
zh
[AI-41] ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多模态接口下引入的新安全漏洞难以系统评估的问题,现有红队测试方法受限于狭窄的对抗模式或高度依赖人工设计,缺乏对新兴真实世界VLM漏洞的可扩展探索。解决方案的关键在于提出ARMS(Adaptive Red-teaming Agent),其通过推理增强的多步编排机制自动优化多样化的红队策略,以有效诱发目标VLM产生有害输出;核心创新包括:1)提出11种新颖的多模态攻击策略(如推理劫持、上下文隐藏等),覆盖广泛对抗模式;2)基于模型上下文协议(Model Context Protocol, MCP)集成17种红队算法;3)设计分层记忆与ε-greedy探索算法平衡攻击多样性与有效性。实验表明,ARMS在实例级和策略级基准上均达到最先进攻击成功率(平均超越基线52.1%,Claude-4-Sonnet超过90%),并构建了包含30K条实例、51类风险的ARMS-Bench多模态安全数据集,显著揭示了VLM的新型脆弱性,且基于该数据集的安全微调能大幅提升VLM鲁棒性并保持通用能力。
链接: https://arxiv.org/abs/2510.02677
作者: Zhaorun Chen,Xun Liu,Mintong Kang,Jiawei Zhang,Minzhou Pan,Shuang Yang,Bo Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 60 pages, 16 figures
Abstract:As vision-language models (VLMs) gain prominence, their multimodal interfaces also introduce new safety vulnerabilities, making the safety evaluation challenging and critical. Existing red-teaming efforts are either restricted to a narrow set of adversarial patterns or depend heavily on manual engineering, lacking scalable exploration of emerging real-world VLM vulnerabilities. To bridge this gap, we propose ARMs, an adaptive red-teaming agent that systematically conducts comprehensive risk assessments for VLMs. Given a target harmful behavior or risk definition, ARMs automatically optimizes diverse red-teaming strategies with reasoning-enhanced multi-step orchestration, to effectively elicit harmful outputs from target VLMs. We propose 11 novel multimodal attack strategies, covering diverse adversarial patterns of VLMs (e.g., reasoning hijacking, contextual cloaking), and integrate 17 red-teaming algorithms into ARMs via model context protocol (MCP). To balance the diversity and effectiveness of the attack, we design a layered memory with an epsilon-greedy attack exploration algorithm. Extensive experiments on instance- and policy-based benchmarks show that ARMs achieves SOTA attack success rates, exceeding baselines by an average of 52.1% and surpassing 90% on Claude-4-Sonnet. We show that the diversity of red-teaming instances generated by ARMs is significantly higher, revealing emerging vulnerabilities in VLMs. Leveraging ARMs, we construct ARMs-Bench, a large-scale multimodal safety dataset comprising over 30K red-teaming instances spanning 51 diverse risk categories, grounded in both real-world multimodal threats and regulatory risks. Safety fine-tuning with ARMs-Bench substantially improves the robustness of VLMs while preserving their general utility, providing actionable guidance to improve multimodal safety alignment against emerging threats.
zh
[AI-42] o Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
【速读】:该论文旨在解决生成式 AI (Generative AI) 模型在大规模参数量(数百亿级别)下高效部署所面临的计算精度与内存效率矛盾问题。传统低精度计算方法常因量化误差导致模型性能下降,而该研究提出的关键解决方案是:利用训练后模型权重中指数分布的熵集中现象(exponent concentration),设计一种无需反量化开销的损失less低精度浮点格式。其核心创新在于理论证明了指数熵受控于随机梯度下降诱导的 α-稳定分布,并据此推导出接近 FP4.67 的压缩极限;进而提出 Exponent-Concentrated FP8 (ECF8),通过熵感知编码和 GPU 优化解码实现无损压缩,在 LLM 和 DiT 模型上达到最高 26.9% 内存节省和 177.1% 吞吐加速,且输出结果完全无偏差。
链接: https://arxiv.org/abs/2510.02676
作者: Zeyu Yang,Tianyi Zhang,Jianwen Xie,Chuan Li,Zhaozhuo Xu,Anshumali Shrivastava
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from \alpha -stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.
zh
[AI-43] HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference
【速读】:该论文针对大语言模型(Large Language Models, LLMs)在低批量(low-batch)和长上下文(long context)场景下推理效率低下这一问题展开研究,尤其聚焦于LLM推理中两个阶段——预填充(prefill)与解码(decode)所呈现的显著差异性计算与内存需求。传统加速器设计多面向高批量或短上下文优化,难以适配交互式应用对延迟敏感的需求。解决方案的关键在于提出HALO,一种以异构存储为中心的加速架构:通过2.5D封装集成基于HBM的存内计算(Compute-in-DRAM, CiD)与片上模拟存内计算(Compute-in-Memory, CiM),并引入相位感知映射策略——将预填充阶段的计算密集型操作映射至CiM以利用其高吞吐矩阵乘法能力,而将解码阶段的内存密集型操作交由CiD执行以减少DRAM内部数据搬运。实验表明,HALO相较现有方法(如AttAcc和CENT)在LLaMA-2 7B与Qwen3 8B模型上分别实现最高18倍和2.5倍的几何平均速度提升,验证了异构设计对LLM推理优化的有效性。
链接: https://arxiv.org/abs/2510.02675
作者: Shubham Negi,Kaushik Roy
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid adoption of Large Language Models (LLMs) has driven a growing demand for efficient inference, particularly in latency-sensitive applications such as chatbots and personalized assistants. Unlike traditional deep neural networks, LLM inference proceeds in two distinct phases: the prefill phase, which processes the full input sequence in parallel, and the decode phase, which generates tokens sequentially. These phases exhibit highly diverse compute and memory requirements, which makes accelerator design particularly challenging. Prior works have primarily been optimized for high-batch inference or evaluated only short input context lengths, leaving the low-batch and long context regime, which is critical for interactive applications, largely underexplored. We propose HALO, a heterogeneous memory centric accelerator designed for these unique challenges of prefill and decode phases in low-batch LLM inference. HALO integrates HBM based Compute-in-DRAM (CiD) with an on-chip analog Compute-in-Memory (CiM), co-packaged using 2.5D integration. To further improve the hardware utilization, we introduce a phase-aware mapping strategy that adapts to the distinct demands of the prefill and decode phases. Compute bound operations in the prefill phase are mapped to CiM to exploit its high throughput matrix multiplication capability, while memory-bound operations in the decode phase are executed on CiD to benefit from reduced data movement within DRAM. Additionally, we present an analysis of the performance tradeoffs of LLMs under two architectural extremes: a fully CiD and a fully on-chip analog CiM design to highlight the need for a heterogeneous design. We evaluate HALO on LLaMA-2 7B and Qwen3 8B models. Our experimental results show that LLMs mapped to HALO achieve up to 18x geometric mean speedup over AttAcc, an attention-optimized mapping and 2.5x over CENT, a fully CiD based mapping. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.02675 [cs.AR] (or arXiv:2510.02675v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2510.02675 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-44] AutoMaAS: Self-Evolving Multi-Agent Architecture Search for Large Language Models
【速读】:该论文旨在解决当前多智能体系统(Multi-agent Systems, MAS)在基于大语言模型(Large Language Models, LLMs)的应用中,自动化设计方法普遍采用静态、单一架构而导致资源分配无法根据查询复杂度和领域需求动态调整的问题。其解决方案的核心在于提出AutoMaAS——一个自演化多智能体架构搜索框架,关键创新包括:基于性能-成本分析的自动操作符生成、融合与消除机制;实时参数调整的动态成本感知优化;在线反馈驱动的持续架构精化;以及通过决策追踪机制提升可解释性。该框架实现了在多个基准测试上相较现有最优方法性能提升1.0–7.1%的同时,推理成本降低3–5%,并展现出跨数据集和LLM骨干网络的良好迁移能力。
链接: https://arxiv.org/abs/2510.02669
作者: Bo Ma,Hang Li,ZeHua Hu,XiaoFan Gui,LuYao Liu,Simon Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
Abstract:Multi-agent systems powered by large language models have demonstrated remarkable capabilities across diverse domains, yet existing automated design approaches seek monolithic solutions that fail to adapt resource allocation based on query complexity and domain requirements. This paper introduces AutoMaAS, a self-evolving multi-agent architecture search framework that leverages neural architecture search principles to automatically discover optimal agent configurations through dynamic operator lifecycle management and automated machine learning techniques. Our approach incorporates four key innovations: (1) automatic operator generation, fusion, and elimination based on performance-cost analysis, (2) dynamic cost-aware optimization with real-time parameter adjustment, (3) online feedback integration for continuous architecture refinement, and (4) enhanced interpretability through decision tracing mechanisms. Extensive experiments across six benchmarks demonstrate that AutoMaAS achieves 1.0-7.1% performance improvement while reducing inference costs by 3-5% compared to state-of-the-art methods. The framework shows superior transferability across datasets and LLM backbones, establishing a new paradigm for automated multi-agent system design in the era of large language models.
zh
[AI-45] Agent icRAG : Tool-Augmented Foundation Models for Zero-Shot Explainable Recommender Systems
【速读】:该论文旨在解决基础模型(foundation models)在推荐系统中因推理不透明和知识受限而导致的应用瓶颈问题。其解决方案的关键在于提出AgenticRAG框架,该框架融合了工具增强的基础模型与检索增强生成(retrieval-augmented generation, RAG),通过外部工具调用、知识检索和思维链(chain-of-thought)推理的协同机制,构建无需任务特定训练即可实现零样本可解释推荐的自主推荐代理,从而在保持计算效率的同时显著提升推荐性能与透明度。
链接: https://arxiv.org/abs/2510.02668
作者: Bo Ma,Hang Li,ZeHua Hu,XiaoFan Gui,LuYao Liu,Simon Liu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models have revolutionized artificial intelligence, yet their application in recommender systems remains limited by reasoning opacity and knowledge constraints. This paper introduces AgenticRAG, a novel framework that combines tool-augmented foundation models with retrieval-augmented generation for zero-shot explainable recommendations. Our approach integrates external tool invocation, knowledge retrieval, and chain-of-thought reasoning to create autonomous recommendation agents capable of transparent decision-making without task-specific training. Experimental results on three real-world datasets demonstrate that AgenticRAG achieves consistent improvements over state-of-the-art baselines, with NDCG@10 improvements of 0.4% on Amazon Electronics, 0.8% on MovieLens-1M, and 1.6% on Yelp datasets. The framework exhibits superior explainability while maintaining computational efficiency comparable to traditional methods.
zh
[AI-46] utorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在教育场景中作为智能辅导系统时,缺乏对教学核心能力的充分支持问题,特别是模型在识别学生困惑、提供个性化指导、生成适应性解释和促进主动学习等方面表现不足。解决方案的关键在于构建一个名为TutorBench的高质量数据集与评估基准,其包含由人类专家精心设计的1,490个样本,覆盖高中及AP级别课程内容,并聚焦于三种典型教学任务:生成适应性解释、提供可操作反馈以及通过提示促进主动学习。该基准引入了针对每个样本的具体评分标准(rubrics),并采用基于LLM判别器的细粒度自动化评估方法,从而实现对模型 tutoring skills 的可靠量化分析。实验表明,现有前沿LLMs在该基准上的平均得分未超过56%,且在诊断与支持学生学习的核心能力上普遍存在明显短板,凸显了未来AI助教系统发展的关键方向。
链接: https://arxiv.org/abs/2510.02663
作者: Rakshith S Srinivasa,Zora Che,Chen Bo Calvin Zhang,Diego Mares,Ernesto Hernandez,Jayeon Park,Dean Lee,Guillermo Mangialardi,Charmaine Ng,Ed-Yeremai Hernandez Cardona,Anisha Gunjal,Yunzhong He,Bing Liu,Chen Xing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As students increasingly adopt large language models (LLMs) as learning aids, it is crucial to build models that are adept at handling the nuances of tutoring: they need to identify the core needs of students, be adaptive, provide personalized guidance, and be accurate. To this end, we introduce TutorBench, a dataset and evaluation benchmark designed to rigorously evaluate the core tutoring skills of LLMs. The dataset comprises 1,490 samples curated by human experts, focused on high-school and AP-level curricula. The samples are drawn from three common tutoring tasks: (i) generating adaptive explanations tailored to a student’s confusion, (ii) providing actionable feedback on a student’s work, and (iii) promoting active learning through effective hint generation. To account for the inherent complexity of tutoring, samples are accompanied by sample-specific rubrics which are used to judge model responses during evaluation. TutorBench uses a reliable and fine-grained automatic evaluation method that uses an LLM-judge and the sample-specific rubrics. We evaluate 16 frontier LLMs on TutorBench and present a detailed analysis of their performance and behavior. Our results show that none of the frontier LLMs achieve a score of greater than 56% , showing a large room for improvement. We find that LLMs fall short in exhibiting the full range of tutoring skills needed to guide, diagnose, and support students effectively, with all the frontier models achieving less than a 60% pass rate on rubric criteria related to these skills. We also find that different model families exhibit varied strengths and limitations: the Claude models outperform others in supporting active learning, while they lag behind in the other two use cases. By releasing TutorBench, we provide a comprehensive and unsaturated benchmark to guide the development of the next-generation of AI tutors.
zh
[AI-47] When Researchers Say Mental Model/Theory of Mind of AI What Are They Really Talking About?
【速读】:该论文试图解决当前AI研究中将大语言模型(Large Language Models, LLMs)表现出的理论心理能力(Theory of Mind, ToM)误认为是真实心智状态的问题,核心在于澄清行为模拟与真实认知之间的本质区别。其解决方案的关键在于摒弃孤立评估AI系统ToM能力的传统范式,转而构建一种相互理论心理(mutual ToM)框架,强调在人-AI交互过程中同步考察人类认知与AI算法的动态协同作用,从而更准确地理解智能体间的真实心智建模机制。
链接: https://arxiv.org/abs/2510.02660
作者: Xiaoyun Yin,Elmira Zahmat Doost,Shiwen Zhou,Garima Arya Yadav,Jamie C. Gorman
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:When researchers claim AI systems possess ToM or mental models, they are fundamentally dis- cussing behavioral predictions and bias corrections rather than genuine mental states. This position paper argues that the current discourse conflates sophisticated pattern matching with authentic cog- nition, missing a crucial distinction between simulation and experience. While recent studies show LLMs achieving human-level performance on ToM laboratory tasks, these results are based only on behavioral mimicry. More importantly, the entire testing paradigm may be flawed in applying individual human cognitive tests to AI systems, but assessing human cognition directly in the moment of human-AI interaction. I suggest shifting focus toward mutual ToM frameworks that acknowledge the simultaneous contributions of human cognition and AI algorithms, emphasizing the interaction dynamics, instead of testing AI in isolation.
zh
[AI-48] A Concept of Possibility for Real-World Events
【速读】:该论文旨在解决传统可能性理论在描述现实世界事件发生可能性时的局限性,尤其是其难以有效应用于规划类问题。其解决方案的关键在于提出一种基于前提条件(prerequisites)和约束条件(constraints)的概率计算模型:事件的可能性被定义为前提条件成立且约束条件不成立的概率函数。这一方法摒弃了Zadeh于1978年提出的经典可能性概念,转而聚焦于实际场景中事件发生的可实现程度,从而为多方案规划中的可行性评估提供量化依据,例如在车辆路径规划中可识别最可行的路径方案。
链接: https://arxiv.org/abs/2510.02655
作者: Daniel G. Schwartz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper offers a new concept of \it possibility as an alternative to the now-a-days standard concept originally introduced by L.A. Zadeh in 1978. This new version was inspired by the original but, formally, has nothing in common with it other than that they both adopt the Łukasiewicz multivalent interpretation of the logical connectives. Moreover, rather than seeking to provide a general notion of possibility, this focuses specifically on the possibility of a real-world event. An event is viewed as having prerequisites that enable its occurrence and constraints that may impede its occurrence, and the possibility of the event is computed as a function of the probabilities that the prerequisites hold and the constraints do not. This version of possibility might appropriately be applied to problems of planning. When there are multiple plans available for achieving a goal, this theory can be used to determine which plan is most possible, i.e., easiest or most feasible to complete. It is speculated that this model of reasoning correctly captures normal human reasoning about plans. The theory is elaborated and an illustrative example for vehicle route planning is provided. There is also a suggestion of potential future applications.
zh
[AI-49] Geolog-IA: Conversational System for Academic Theses
【速读】:该论文旨在解决地质学相关论文问答系统中存在的知识过时和生成幻觉(hallucination)问题。其解决方案的关键在于构建一个基于人工智能的对话系统 Geolog-IA,采用 Llama 3.1 和 Gemini 2.5 语言模型,并结合检索增强生成(Retrieval Augmented Generation, RAG)架构与 SQLite 数据库,从而提升回答的准确性与时效性。
链接: https://arxiv.org/abs/2510.02653
作者: Micaela Fuel Pozo,Andrea Guatumillo Saltos,Yeseña Tipan Llumiquinga,Kelly Lascano Aguirre,Marilyn Castillo Jara,Christian Mejia-Escobar
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 17 pages, in Spanish language
Abstract:This study presents the development of Geolog-IA, a novel conversational system based on artificial intelligence that responds naturally to questions about geology theses from the Central University of Ecuador. Our proposal uses the Llama 3.1 and Gemini 2.5 language models, which are complemented by a Retrieval Augmented Generation (RAG) architecture and an SQLite database. This strategy allows us to overcome problems such as hallucinations and outdated knowledge. The evaluation of Geolog-IA’s performance with the BLEU metric reaches an average of 0.87, indicating high consistency and accuracy in the responses generated. The system offers an intuitive, web-based interface that facilitates interaction and information retrieval for directors, teachers, students, and administrative staff at the institution. This tool can be a key support in education, training, and research and establishes a basis for future applications in other disciplines.
zh
[AI-50] Automatic Building Code Review: A Case Study
【速读】:该论文旨在解决建筑监管机构在资源有限或偏远地区面临的建筑规范审查效率低、易出错且成本高昂的问题,尤其是在项目规模和复杂性不断增长的背景下。其核心解决方案是提出一种基于代理(agent)驱动的自动化建筑规范审查(Automated Code Review, ACR)框架,关键在于融合建筑信息模型(BIM)数据提取与两种互补的智能验证机制:一是通过调用美国能源部COMcheck引擎的API实现确定性、可审计的合规性判断;二是利用检索增强生成(RAG)技术对规范条款进行灵活推理,以应对覆盖不全或语义模糊的情形。该框架借助大型语言模型(LLM)代理从异构文件中提取几何参数、运行日程及系统属性,并通过模型上下文协议(MCP)代理管道显著提升审查结果的严谨性和可靠性,实证表明GPT-4o在效率与稳定性上表现最优,且MCP优于RAG。此方案实现了BIM与权威代码审查工具之间的无缝集成,具备可扩展性、互操作性和生产就绪特性。
链接: https://arxiv.org/abs/2510.02634
作者: Hanlong Wan,Weili Xu,Michael Rosenberg,Jian Zhang,Aysha Siddika
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Building officials, particularly those in resource-constrained or rural jurisdictions, face labor-intensive, error-prone, and costly manual reviews of design documents as projects increase in size and complexity. The growing adoption of Building Information Modeling (BIM) and Large Language Models (LLMs) presents opportunities for automated code review (ACR) solutions. This study introduces a novel agent-driven framework that integrates BIM-based data extraction with automated verification using both retrieval-augmented generation (RAG) and Model Context Protocol (MCP) agent pipelines. The framework employs LLM-enabled agents to extract geometry, schedules, and system attributes from heterogeneous file types, which are then processed for building code checking through two complementary mechanisms: (1) direct API calls to the US Department of Energy COMcheck engine, providing deterministic and audit-ready outputs, and (2) RAG-based reasoning over rule provisions, enabling flexible interpretation where coverage is incomplete or ambiguous. The framework was evaluated through case demonstrations, including automated extraction of geometric attributes (such as surface area, tilt, and insulation values), parsing of operational schedules, and validation of lighting allowances under ASHRAE Standard 90.1-2022. Comparative performance tests across multiple LLMs showed that GPT-4o achieved the best balance of efficiency and stability, while smaller models exhibited inconsistencies or failures. Results confirm that MCP agent pipelines outperform RAG reasoning pipelines in rigor and reliability. This work advances ACR research by demonstrating a scalable, interoperable, and production-ready approach that bridges BIM with authoritative code review tools. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Reportnumber: PNNL-SA-216238 Cite as: arXiv:2510.02634 [cs.SE] (or arXiv:2510.02634v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2510.02634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-51] A Trajectory Generator for High-Density Traffic and Diverse Agent -Interaction Scenarios
【速读】:该论文旨在解决现有自动驾驶轨迹预测基准数据集存在的长尾分布问题,即多数样本来自低密度场景和简单直行行为,导致高密度场景及安全关键操作(如变道、超车和转弯)严重缺失,从而限制模型泛化能力并造成评估结果过于乐观。解决方案的关键在于提出一种新颖的轨迹生成框架,通过将连续道路环境转化为结构化的网格表示,支持细粒度路径规划、显式冲突检测与多智能体协同;在此基础上引入行为感知生成机制,结合规则驱动的决策触发、基于Frenet坐标系的轨迹平滑以及动态可行性约束,从而合成真实且复杂的高密度场景与稀有行为,有效提升数据集的密度与行为多样性,同时保持运动真实性和场景级安全性。
链接: https://arxiv.org/abs/2510.02627
作者: Ruining Yang,Yi Xu,Yixiao Chen,Yun Fu,Lili Su
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate trajectory prediction is fundamental to autonomous driving, as it underpins safe motion planning and collision avoidance in complex environments. However, existing benchmark datasets suffer from a pronounced long-tail distribution problem, with most samples drawn from low-density scenarios and simple straight-driving behaviors. This underrepresentation of high-density scenarios and safety critical maneuvers such as lane changes, overtaking and turning is an obstacle to model generalization and leads to overly optimistic evaluations. To address these challenges, we propose a novel trajectory generation framework that simultaneously enhances scenarios density and enriches behavioral diversity. Specifically, our approach converts continuous road environments into a structured grid representation that supports fine-grained path planning, explicit conflict detection, and multi-agent coordination. Built upon this representation, we introduce behavior-aware generation mechanisms that combine rule-based decision triggers with Frenet-based trajectory smoothing and dynamic feasibility constraints. This design allows us to synthesize realistic high-density scenarios and rare behaviors with complex interactions that are often missing in real data. Extensive experiments on the large-scale Argoverse 1 and Argoverse 2 datasets demonstrate that our method significantly improves both agent density and behavior diversity, while preserving motion realism and scenario-level safety. Our synthetic data also benefits downstream trajectory prediction models and enhances performance in challenging high-density scenarios.
zh
[AI-52] MINERVA: Mutual Information Neural Estimation for Supervised Feature Selection
【速读】:该论文旨在解决传统特征筛选方法依赖统计成对依赖度量(pair-wise dependence metrics)时,在目标变量依赖于高阶特征交互而非单一特征贡献的情况下表现不佳的问题。其解决方案的关键在于提出一种基于神经网络估计互信息(Mutual Information, MI)的监督式特征选择方法——MINERVA,通过参数化互信息的近似并设计包含稀疏诱导正则项的损失函数实现特征选择;同时采用两阶段流程解耦表示学习与特征选择过程,从而提升泛化能力并更准确地刻画特征重要性。
链接: https://arxiv.org/abs/2510.02610
作者: Taurai Muvunzaa,Egor Kraev,Pere Planell-Morell,Alexander Y. Shestopaloff
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages
Abstract:Existing feature filters rely on statistical pair-wise dependence metrics to model feature-target relationships, but this approach may fail when the target depends on higher-order feature interactions rather than individual contributions. We introduce Mutual Information Neural Estimation Regularized Vetting Algorithm (MINERVA), a novel approach to supervised feature selection based on neural estimation of mutual information between features and targets. We paramaterize the approximation of mutual information with neural networks and perform feature selection using a carefully designed loss function augmented with sparsity-inducing regularizers. Our method is implemented in a two-stage process to decouple representation learning from feature selection, ensuring better generalization and a more accurate expression of feature importance. We present examples of ubiquitous dependency structures that are rarely captured in literature and show that our proposed method effectively captures these complex feature-target relationships by evaluating feature subsets as an ensemble. Experimental results on synthetic and real-life fraud datasets demonstrate the efficacy of our method and its ability to perform exact solutions.
zh
[AI-53] Mitigating Modal Imbalance in Multimodal Reasoning
【速读】:该论文旨在解决基础模型(Foundation Models, FMs)在跨模态情境下进行联合推理的能力不足问题,尤其是当不同模态之间存在冲突证据时,模型是否能够有效整合多模态信息以达成一致判断。研究表明,FMs 在单一模态中识别冲突的准确率高达90%,但在跨模态或跨语言情境下,这一比例骤降至3%,其根本原因在于跨模态注意力分配失衡——模型对某些模态表现出显著的偏好性关注,从而忽视其他模态的信息。解决方案的关键在于:通过在训练样本中显式地组合多种模态信息(即在同一训练实例中引入多模态输入),可显著缓解注意力不平衡问题,并直接提升模型在多个视觉-语言基准任务上的性能表现。这表明,系统性地设计包含跨模态交互的训练数据是构建可靠多模态基础模型的核心前提。
链接: https://arxiv.org/abs/2510.02608
作者: Chen Henry Wu,Neil Kale,Aditi Raghunathan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures, CoLM 2025
Abstract:Foundation models (FMs) deployed in real-world tasks such as computer-use agents must integrate diverse modalities. How good are FMs at performing joint reasoning, simultaneously reasoning over multiple modalities, especially when the modalities interact and relate to each other to form cross-modal context? To better understand this problem, we study FMs on cross-modal conflicts: scenarios where conflicting evidence is presented across modalities. This allows us to examine whether FMs prioritize one modality over another or reason jointly to reconcile the conflict. Our experiments reveal that FMs can recognize conflicts in unimodal contexts, composed of a single modality, 90% of the time, but the ratio falls as low as 3% when evidence is split across modalities – similar observations hold in cross-lingual contexts, composed of multiple languages. We trace this failure to cross-modal attention imbalance, showing that FMs exhibit extreme asymmetry in attention scores, disproportionately prioritizing certain modalities. We show that cross-modal attention imbalance does not go away by simply scaling up multimodal or multilingual datasets blindly, since they lack training examples that explicitly require cross-modal reasoning. We demonstrate that even a simple and scalable method of explicitly combining multiple modalities within each training instance significantly reduces attention imbalance. Reduced attention imbalance directly translates to improved downstream performance on several vision-language benchmarks. Our findings underscore the importance of systematically addressing cross-modal contexts to build reliable foundation models.
zh
[AI-54] Multimodal Large Language Model Framework for Safe and Interpretable Grid-Integrated EVs
【速读】:该论文旨在解决电动汽车(Electric Vehicles, EVs)融入智能电网(Smart Grid)时,如何实现驾驶员、车辆与周边环境之间安全且可解释交互的挑战。其核心问题是:在城市驾驶场景中,如何将多模态传感器数据(如目标检测、语义分割和车载控制器局域网(CAN bus)遥测)高效融合,并转化为自然语言警报以提升驾驶员决策能力。解决方案的关键在于构建一个基于多模态大语言模型(Multi-modal Large Language Model, LLM)的框架,该框架整合YOLOv8视觉感知、地理编码定位与CAN bus数据,通过提示工程(Prompt Engineering)实现从原始传感信息到语义化警报的映射,从而在真实道路数据上验证了其生成情境感知警报的有效性,为电动交通系统与电网协同优化提供可扩展的技术路径。
链接: https://arxiv.org/abs/2510.02592
作者: Jean Douglas Carvalho,Hugo Kenji,Ahmad Mohammad Saber,Glaucia Melo,Max Mauro Dias Santos,Deepa Kundur
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been presented at the 2025 IEEE PES Conference on Innovative Smart Grid Technologies (ISGT 2025)
Abstract:The integration of electric vehicles (EVs) into smart grids presents unique opportunities to enhance both transportation systems and energy networks. However, ensuring safe and interpretable interactions between drivers, vehicles, and the surrounding environment remains a critical challenge. This paper presents a multi-modal large language model (LLM)-based framework to process multimodal sensor data - such as object detection, semantic segmentation, and vehicular telemetry - and generate natural-language alerts for drivers. The framework is validated using real-world data collected from instrumented vehicles driving on urban roads, ensuring its applicability to real-world scenarios. By combining visual perception (YOLOv8), geocoded positioning, and CAN bus telemetry, the framework bridges raw sensor data and driver comprehension, enabling safer and more informed decision-making in urban driving scenarios. Case studies using real data demonstrate the framework’s effectiveness in generating context-aware alerts for critical situations, such as proximity to pedestrians, cyclists, and other vehicles. This paper highlights the potential of LLMs as assistive tools in e-mobility, benefiting both transportation systems and electric networks by enabling scalable fleet coordination, EV load forecasting, and traffic-aware energy planning. Index Terms - Electric vehicles, visual perception, large language models, YOLOv8, semantic segmentation, CAN bus, prompt engineering, smart grid. Comments: This paper has been presented at the 2025 IEEE PES Conference on Innovative Smart Grid Technologies (ISGT 2025) Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.02592 [cs.AI] (or arXiv:2510.02592v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.02592 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-55] A Benchmark Study of Deep Reinforcement Learning Algorithms for the Container Stowage Planning Problem
【速读】:该论文旨在解决集装箱堆场调度规划(Container Stowage Planning, CSPP)在复杂场景下依赖人工经验、缺乏系统性强化学习(Reinforcement Learning, RL)算法比较的问题。其解决方案的关键在于构建了一个基于Gym的可复用环境,该环境不仅捕捉了CSPP的核心特征,还扩展支持单智能体与多智能体形式的起重机调度(crane scheduling),并在该框架下对五种主流RL算法(DQN、QR-DQN、A2C、PPO和TRPO)进行了多场景下的性能对比分析,从而揭示了算法选择与问题建模对CSPP优化效果的重要影响。
链接: https://arxiv.org/abs/2510.02589
作者: Yunqi Huang,Nishith Chennakeshava,Alexis Carras,Vladislav Neverov,Wei Liu,Aske Plaat,Yingjie Fan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Container stowage planning (CSPP) is a critical component of maritime transportation and terminal operations, directly affecting supply chain efficiency. Owing to its complexity, CSPP has traditionally relied on human expertise. While reinforcement learning (RL) has recently been applied to CSPP, systematic benchmark comparisons across different algorithms remain limited. To address this gap, we develop a Gym environment that captures the fundamental features of CSPP and extend it to include crane scheduling in both multi-agent and single-agent formulations. Within this framework, we evaluate five RL algorithms: DQN, QR-DQN, A2C, PPO, and TRPO under multiple scenarios of varying complexity. The results reveal distinct performance gaps with increasing complexity, underscoring the importance of algorithm choice and problem formulation for CSPP. Overall, this paper benchmarks multiple RL methods for CSPP while providing a reusable Gym environment with crane scheduling, thus offering a foundation for future research and practical deployment in maritime logistics.
zh
[AI-56] Agent ic Additive Manufacturing Alloy Discovery
【速读】:该论文旨在解决增材制造(Additive Manufacturing, AM)领域中合金发现过程的复杂性问题,该过程通常需要跨材料科学、热力学模拟和实验分析等多个专业领域的知识协同。解决方案的关键在于构建一个基于大语言模型(Large Language Model, LLM)的多智能体系统(multi-agent system),该系统通过模型上下文协议(Model Context Protocol, MCP)调用外部工具(如Thermo-Calc进行性质图计算、生成缺乏融合工艺映射等),实现对用户复杂查询的推理与响应,并能根据工具调用结果动态调整任务路径,从而在实际环境中实现自主决策与自动化合金筛选,显著加速合金发现流程。
链接: https://arxiv.org/abs/2510.02567
作者: Peter Pak,Achuth Chandrasekhar,Amir Barati Farimani
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Agentic systems enable the intelligent use of research tooling, augmenting a researcher’s ability to investigate and propose novel solutions to existing problems. Within Additive Manufacturing (AM), alloy discovery remains a complex challenge, often requiring expertise in the various domains of materials science, thermodynamic simulations, and experimental analysis. Large Language Model (LLM) enabled agents can facilitate this endeavor by utilizing their extensive knowledge base to dispatch tool calls via Model Context Protocol (MCP) to perform actions such as Thermo-Calc property diagram calculations and lack of fusion process map generation. In addition, the multi-agent system developed in this work is able to effectively reason through complex user prompts and provide analysis on the printability of proposed alloys. These agents can dynamically adjust their task trajectory to the outcomes of tool call results, effectively enabling autonomous decision-making in practical environments. This work aims to utilize LLM enabled agents to automate and accelerate the task of alloy discovery within the field of additive manufacturing and showcase the benefits of adopting this multi-agent system.
zh
[AI-57] Orchestrating Human-AI Teams: The Manager Agent as a Unifying Research Challenge
【速读】:该论文旨在解决复杂多智能体工作流(multi-agent workflows)在动态人机协同团队中难以有效管理的问题。其核心挑战在于设计一个能够自主协调人类与AI工作者的自主管理者代理(Autonomous Manager Agent),该代理需具备将复杂目标分解为任务图、分配任务、监控进度、适应环境变化并确保透明沟通的能力。解决方案的关键在于将工作流管理形式化为部分可观测随机博弈(Partially Observable Stochastic Game),并识别出四个基础性挑战:层次化组合推理、多目标优化、临时团队协作规划及内置治理合规机制。为此,作者提出了MA-Gym开源仿真与评估框架,并通过GPT-5-based管理代理在20个工作流上的实验表明,当前模型在目标完成度、约束遵守和运行时间等多目标优化上仍存在显著不足,凸显了该问题的复杂性和开放性。
链接: https://arxiv.org/abs/2510.02557
作者: Charlie Masters,Advaith Vellanki,Jiangbo Shangguan,Bart Kultys,Jonathan Gilmore,Alastair Moore,Stefano V. Albrecht
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as an oral paper for the conference for Distributed Artificial Intelligence (DAI 2025). 8 pages, 2 figures
Abstract:While agentic AI has advanced in automating individual tasks, managing complex multi-agent workflows remains a challenging problem. This paper presents a research vision for autonomous agentic systems that orchestrate collaboration within dynamic human-AI teams. We propose the Autonomous Manager Agent as a core challenge: an agent that decomposes complex goals into task graphs, allocates tasks to human and AI workers, monitors progress, adapts to changing conditions, and maintains transparent stakeholder communication. We formalize workflow management as a Partially Observable Stochastic Game and identify four foundational challenges: (1) compositional reasoning for hierarchical decomposition, (2) multi-objective optimization under shifting preferences, (3) coordination and planning in ad hoc teams, and (4) governance and compliance by design. To advance this agenda, we release MA-Gym, an open-source simulation and evaluation framework for multi-agent workflow orchestration. Evaluating GPT-5-based Manager Agents across 20 workflows, we find they struggle to jointly optimize for goal completion, constraint adherence, and workflow runtime - underscoring workflow management as a difficult open problem. We conclude with organizational and ethical implications of autonomous management systems.
zh
[AI-58] oolTweak: An Attack on Tool Selection in LLM -based Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在调用外部工具时因工具名称和描述可被恶意篡改而导致的不公平选择偏倚问题。其核心风险在于,攻击者可通过迭代修改工具元信息,系统性地诱导代理模型优先选择特定工具,从而在工具生态中制造不正当竞争优势,引发公平性、竞争性和安全性问题。解决方案的关键在于提出一种轻量级自动化攻击方法 ToolTweak,能够将目标工具的选择率从基线约20%提升至最高81%,并验证了其在开源与闭源模型间的强迁移能力;同时,论文评估了两种防御策略——语义重写(paraphrasing)和困惑度过滤(perplexity filtering),发现它们能有效降低选择偏倚,使代理更均衡地选用功能相似的替代工具,从而缓解此类安全威胁。
链接: https://arxiv.org/abs/2510.02554
作者: Jonathan Sneh,Ruomei Yan,Jialin Yu,Philip Torr,Yarin Gal,Sunando Sengupta,Eric Sommerlade,Alasdair Paren,Adel Bibi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As LLMs increasingly power agents that interact with external tools, tool use has become an essential mechanism for extending their capabilities. These agents typically select tools from growing databases or marketplaces to solve user tasks, creating implicit competition among tool providers and developers for visibility and usage. In this paper, we show that this selection process harbors a critical vulnerability: by iteratively manipulating tool names and descriptions, adversaries can systematically bias agents toward selecting specific tools, gaining unfair advantage over equally capable alternatives. We present ToolTweak, a lightweight automatic attack that increases selection rates from a baseline of around 20% to as high as 81%, with strong transferability between open-source and closed-source models. Beyond individual tools, we show that such attacks cause distributional shifts in tool usage, revealing risks to fairness, competition, and security in emerging tool ecosystems. To mitigate these risks, we evaluate two defenses: paraphrasing and perplexity filtering, which reduce bias and lead agents to select functionally similar tools more equally. All code will be open-sourced upon acceptance.
zh
[AI-59] PHORECAST: Enabling AI Understanding of Public Health Outreach Across Populations
【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision and Language Models, VLMs)在高风险领域(如公共卫生)中对多样化个体与群体响应建模能力不足的问题,尤其缺乏能够支持细粒度行为预测与社区级参与模式分析的多模态数据集。其解决方案的关键在于构建PHORECAST(Public Health Outreach REceptivity and CAmpaign Signal Tracking)——一个面向健康传播情境的多模态数据集,该数据集涵盖个体层面的行为响应与群体层面的互动模式,从而支撑多模态理解、响应预测、个性化推荐及社会预测等任务,推动AI系统更精准地模拟、解读和预判异质性公众情绪与行为,促进更具适应性和包容性的健康传播模型发展。
链接: https://arxiv.org/abs/2510.02535
作者: Rifaa Qadri,Anh Nhat Nhu,Swati Ramnath,Laura Yu Zheng,Raj Bhansali,Sylvette La Touche-Howard,Tracy Marie Zeeger,Tom Goldstein,Ming Lin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding how diverse individuals and communities respond to persuasive messaging holds significant potential for advancing personalized and socially aware machine learning. While Large Vision and Language Models (VLMs) offer promise, their ability to emulate nuanced, heterogeneous human responses, particularly in high stakes domains like public health, remains underexplored due in part to the lack of comprehensive, multimodal dataset. We introduce PHORECAST (Public Health Outreach REceptivity and CAmpaign Signal Tracking), a multimodal dataset curated to enable fine-grained prediction of both individuallevel behavioral responses and community-wide engagement patterns to health messaging. This dataset supports tasks in multimodal understanding, response prediction, personalization, and social forecasting, allowing rigorous evaluation of how well modern AI systems can emulate, interpret, and anticipate heterogeneous public sentiment and behavior. By providing a new dataset to enable AI advances for public health, PHORECAST aims to catalyze the development of models that are not only more socially aware but also aligned with the goals of adaptive and inclusive health communication
zh
[AI-60] Multimodal Function Vectors for Spatial Relations
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在少样本情境下进行关系推理时内部机制不透明的问题,特别是空间关系建模的可解释性与可控性不足。解决方案的关键在于识别并提取出负责传递空间关系表征的注意力头(attention heads),从中获得可操作的多模态功能向量(multimodal function vectors)。这些功能向量可在不更新模型参数的情况下提升零样本推理准确率,并通过少量微调实现显著优于传统上下文学习基线的性能;进一步地,通过线性组合特定关系的功能向量,模型能够泛化至未见过的空间关系类比任务,揭示了LMM中局部化结构对关系推理的强大编码能力与模块化特性。
链接: https://arxiv.org/abs/2510.02528
作者: Shuhao Fu,Esther Goldberg,Ying Nian Wu,Hongjing Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Multimodal Models (LMMs) demonstrate impressive in-context learning abilities from limited multimodal demonstrations, yet the internal mechanisms supporting such task learning remain opaque. Building on prior work of large language models, we show that a small subset of attention heads in the vision-language model OpenFlamingo-4B is responsible for transmitting representations of spatial relations. The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM’s performance on relational tasks. First, using both synthetic and real image datasets, we apply causal mediation analysis to identify attention heads that strongly influence relational predictions, and extract multimodal function vectors that improve zero-shot accuracy at inference time. We further demonstrate that these multimodal function vectors can be fine-tuned with a modest amount of training data, while keeping LMM parameters frozen, to significantly outperform in-context learning baselines. Finally, we show that relation-specific function vectors can be linearly combined to solve analogy problems involving novel and untrained spatial relations, highlighting the strong generalization ability of this approach. Our results show that LMMs encode spatial relational knowledge within localized internal structures, which can be systematically extracted and optimized, thereby advancing our understanding of model modularity and enhancing control over relational reasoning in LMMs.
zh
[AI-61] From Pixels to Factors: Learning Independently Controllable State Variables for Reinforcement Learning
【速读】:该论文旨在解决强化学习中高维观测下如何有效利用状态空间的结构信息问题:传统基于因子化马尔可夫决策过程(factored Markov decision processes)的方法虽样本效率高,但依赖已知的因子表示;而深度强化学习虽能处理高维观测,却无法显式利用状态的因子结构。解决方案的关键在于提出一种名为“动作可控因子分解”(Action-Controllable Factorization, ACF)的对比学习方法,其核心思想是通过挖掘动作对潜在变量的独立控制能力,从像素级观测中自动识别出可被动作单独影响的状态分量(即“可控制因子”),并利用动作作用的稀疏性(即每个动作仅影响部分状态变量)构造对比学习信号,从而在多个具有已知因子结构的基准环境(如Taxi、FourRooms和MiniGrid-DoorKey)中直接从原始图像恢复出真实可控因子,并显著优于现有解耦表征算法。
链接: https://arxiv.org/abs/2510.02484
作者: Rafael Rodriguez-Sanchez,Cameron Allen,George Konidaris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Algorithms that exploit factored Markov decision processes are far more sample-efficient than factor-agnostic methods, yet they assume a factored representation is known a priori – a requirement that breaks down when the agent sees only high-dimensional observations. Conversely, deep reinforcement learning handles such inputs but cannot benefit from factored structure. We address this representation problem with Action-Controllable Factorization (ACF), a contrastive learning approach that uncovers independently controllable latent variables – state components each action can influence separately. ACF leverages sparsity: actions typically affect only a subset of variables, while the rest evolve under the environment’s dynamics, yielding informative data for contrastive training. ACF recovers the ground truth controllable factors directly from pixel observations on three benchmarks with known factored structure – Taxi, FourRooms, and MiniGrid-DoorKey – consistently outperforming baseline disentanglement algorithms.
zh
[AI-62] Safe and Efficient In-Context Learning via Risk Control
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在利用上下文示例(in-context demonstrations)进行任务学习时,可能因恶意或错误示例导致性能下降的安全问题。现有方法缺乏对有害示例的内在防御机制,使得模型易受攻击。解决方案的关键在于引入一种基于分布无关风险控制(Distribution-Free Risk Control, DFRC)的框架,通过动态早期退出预测机制识别并忽略对有害输入响应最敏感的注意力头,从而限制有害示例对模型性能的负面影响;同时,该方法还能在有益示例上提升计算效率与性能表现,实现安全性与效率的协同优化。
链接: https://arxiv.org/abs/2510.02480
作者: Andrea Wynn,Metod Jazbec,Charith Peris,Rinat Khaziev,Anqi Liu,Daniel Khashabi,Eric Nalisnick
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) demonstrate a remarkable ability to learn new tasks from a few in-context examples. However, this flexibility introduces safety concerns: LLMs can be influenced by incorrect or malicious demonstrations – for example, if an adversary tampers with or injects harmful examples without a human supervisor noticing. This motivates principled designs in which the system itself includes built-in mechanisms to guard against such attacks. We propose a novel approach to limit the degree to which harmful demonstrations can degrade model performance. First, we define a baseline ``safe’’ behavior for the model – the model’s performance given no in-context demonstrations (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which in-context samples can decay performance below zero-shot. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs \textitand leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results showing that our approach can effectively control risk for harmful in-context demonstrations while simultaneously achieving substantial computational efficiency gains with helpful demonstrations.
zh
[AI-63] Market-Based Data Subset Selection – Principled Aggregation of Multi-Criteria Example Utility
【速读】:该论文旨在解决训练数据子集选择中的难题,即如何从大规模数据中筛选出小而有效的样本集合,以提升模型性能并降低计算成本。传统方法依赖于多种信号(如不确定性、稀有性、多样性等)的组合,但这些信号往往异质且权重设定主观,导致效果不稳定。解决方案的关键在于提出一种基于市场的选择机制:利用LMSR(Logarithmic Market Scoring Rule)作为价格预测市场,将每个样本视为交易者,通过单一流动性参数控制选择集中度,并引入主题级归一化稳定校准;同时采用基于token预算的价格规则 ρ=p/ℓ^γ,其中 γ 显式暴露长度偏差,结合轻量级多样性头提升覆盖度。该框架在理论层面实现了最大熵聚合与指数加权,具备可解释的聚合强度调节机制,在实验中显著优于单信号基线,且在固定计算预算下统一了多信号数据精炼策略,适用于提示级推理和分类任务。
链接: https://arxiv.org/abs/2510.02456
作者: Ashish Jha,Valentin Leplat,AH Phan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:Selecting a small yet useful subset of training data is hard because signals of example utility (uncertainty, rarity, diversity, etc.) are heterogeneous and typically combined with ad hoc weights. We propose a market-based selector that prices each example via a cost-function prediction market (LMSR), signals act as traders, a single liquidity parameter controls concentration, and topic-wise normalization stabilizes calibration. Token budgets are handled explicitly by a price-per-token rule \rho=p/\ell^\gamma , with \gamma exposing an interpretable length bias; a lightweight diversity head improves coverage. We quantify coverage via topic cluster coverage and effective sample size. On the theory side, we show that LMSR implements a maximum-entropy aggregation with exponential weighting and a convex objective, yielding transparent knobs for aggregation strength. Empirically, on GSM8K (60k-token budget) the market with diversity achieves parity with strong single-signal baselines while reducing seed variance and incurring !0.1 GPU-hr selection overhead; on AGNews at kept=5-25% the market (with light balancing) delivers competitive accuracy with improved balance and stability. The framework unifies multi-signal data curation under fixed compute for prompt-level reasoning and classification.
zh
[AI-64] RefineShot: Rethinking Cinematography Understanding with Foundational Skill Evaluation
【速读】:该论文旨在解决当前影视理解(cinematography understanding)评估基准ShotBench中存在的问题,包括选项设计模糊、模型推理一致性不足及指令遵循能力弱,这些问题严重影响了评估的可靠性与公平性,阻碍了该领域的发展。解决方案的关键在于:首先对ShotBench进行系统性重构,通过一致的选项重设计提升评估严谨性;其次首次对ShotVL模型的推理行为进行批判性分析,揭示其内在局限;最后引入一种联合评估协议,同时考察任务准确率与核心模型能力(如推理一致性与指令遵循),由此构建出更可靠且可扩展的新基准RefineShot,推动影视理解研究向更高水平发展。
链接: https://arxiv.org/abs/2510.02423
作者: Hang Wu,Yujun Cai,Haonan Ge,Hongkai Chen,Ming-Hsuan Yang,Yiwei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cinematography understanding refers to the ability to recognize not only the visual content of a scene but also the cinematic techniques that shape narrative meaning. This capability is attracting increasing attention, as it enhances multimodal understanding in real-world applications and underpins coherent content creation in film and media. As the most comprehensive benchmark for this task, ShotBench spans a wide range of cinematic concepts and VQA-style evaluations, with ShotVL achieving state-of-the-art results on it. However, our analysis reveals that ambiguous option design in ShotBench and ShotVL’s shortcomings in reasoning consistency and instruction adherence undermine evaluation reliability, limiting fair comparison and hindering future progress. To overcome these issues, we systematically refine ShotBench through consistent option restructuring, conduct the first critical analysis of ShotVL’s reasoning behavior, and introduce an extended evaluation protocol that jointly assesses task accuracy and core model competencies. These efforts lead to RefineShot, a refined and expanded benchmark that enables more reliable assessment and fosters future advances in cinematography understanding.
zh
[AI-65] Dynamic Target Attack
【速读】:该论文旨在解决现有基于梯度的越狱攻击(jailbreak attack)中因目标响应位于安全对齐大语言模型(safety-aligned LLM)输出分布的极低密度区域而导致优化困难的问题。传统方法采用固定的目标响应,由于该目标与原始输出分布差异显著,导致攻击需要大量迭代且成功率受限。解决方案的关键在于提出动态目标攻击(Dynamic Target Attack, DTA),其核心思想是在每轮优化中直接从当前提示条件下的模型输出分布中采样多个候选响应,并选取最具危害性的响应作为临时目标进行对抗提示优化,从而显著缩小目标与输出分布之间的差距,有效降低优化难度。实验表明,DTA在白盒和黑盒设置下均展现出更高的攻击成功率和效率,显著优于现有基线方法。
链接: https://arxiv.org/abs/2510.02422
作者: Kedong Xiu,Churui Zeng,Tianhang Zheng,Xinzhe Huang,Xiaojun Jia,Di Wang,Puning Zhao,Zhan Qin,Kui Ren
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response. However, this fixed target usually resides in an extremely low-density region of a safety-aligned LLM’s output distribution conditioned on diverse harmful inputs. Due to the substantial discrepancy between the target and the original output, existing attacks require numerous iterations to optimize the adversarial prompt, which might still fail to induce the low-probability target response from the target LLM. In this paper, we propose Dynamic Target Attack (DTA), a new jailbreaking framework relying on the target LLM’s own responses as targets to optimize the adversarial prompts. In each optimization round, DTA iteratively samples multiple candidate responses directly from the output distribution conditioned on the current prompt, and selects the most harmful response as a temporary target for prompt optimization. In contrast to existing attacks, DTA significantly reduces the discrepancy between the target and the output distribution, substantially easing the optimization process to search for an effective adversarial prompt. Extensive experiments demonstrate the superior effectiveness and efficiency of DTA: under the white-box setting, DTA only needs 200 optimization iterations to achieve an average attack success rate (ASR) of over 87% on recent safety-aligned LLMs, exceeding the state-of-the-art baselines by over 15%. The time cost of DTA is 2-26 times less than existing baselines. Under the black-box setting, DTA uses Llama-3-8B-Instruct as a surrogate model for target sampling and achieves an ASR of 85% against the black-box target model Llama-3-70B-Instruct, exceeding its counterparts by over 25%. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.02422 [cs.CR] (or arXiv:2510.02422v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.02422 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-66] BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)网络代理在真实开放网络环境中评估受限的问题,即现有评测多局限于沙盒环境或人工构造任务,难以反映其在复杂现实场景中的性能表现。解决方案的关键在于提出 BrowserArena——一个基于真实开放网络的代理评估平台,通过收集用户提交的任务、进行对战式(Arena-style)对比实验,并利用步骤级人类反馈(step-level human feedback)识别代理行为中的失败模式。该方法不仅揭示了代理在验证码解析(captcha resolution)、弹窗广告移除(pop-up banner removal)和直接URL导航等三类常见任务中的系统性失效现象,还通过构建针对性数据集量化不同模型在这些失败模式上的策略差异,从而实现对web代理能力与脆弱性的规模化、精细化评估。
链接: https://arxiv.org/abs/2510.02418
作者: Sagnik Anupam,Davis Brown,Shuo Li,Eric Wong,Hamed Hassani,Osbert Bastani
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLM web agents now browse and take actions on the open web, yet current agent evaluations are constrained to sandboxed environments or artificial tasks. We introduce BrowserArena, a live open-web agent evaluation platform that collects user-submitted tasks, runs Arena-style head-to-head comparisons, and uses step-level human feedback to surface failure modes. Collecting and analyzing step-level annotations on the agent traces, we identify three consistent failure modes: captcha resolution, pop-up banner removal, and direct navigation to URLs. By constructing targeted datasets to further study these tasks, we discover variations in how different language models navigate these failure modes. We find, for example, that o4-mini deploys a wider variety of strategies to circumvent captcha resolution than other models and DeepSeek-R1 consistently misleads users about captcha resolution. Our findings surface both the diversity and brittleness of current web agents. More broadly, our benchmarking methodology provides an approach to evaluating and understanding web agent failure modes at scale.
zh
[AI-67] NEURODNAAI: Neural pipeline approaches for the advancing dna-based information storag e as a sustainable digital medium using deep learning framework
【速读】:该论文旨在解决DNA存储在实际应用中面临的挑战,包括合成成本高、测序错误以及生物约束(如GC含量失衡和同聚物序列)等问题。其解决方案的关键在于提出了一种名为NeuroDNAAI的框架,该框架借鉴量子并行性概念以增强编码多样性与鲁棒性,并结合生物信息学约束与深度学习技术来提升错误缓解能力;该方法能够将二进制数据流编码为符号化DNA序列,在存在替换、插入和删除噪声的信道中传输并实现高保真度重建,显著优于传统的提示或规则驱动方案,在基准数据集上的文本与图像任务中均表现出低比特误码率,从而实现了理论、流程与仿真的统一,推动了可扩展且生物可行的DNA存档存储的发展。
链接: https://arxiv.org/abs/2510.02417
作者: Rakesh Thakur,Lavanya Singh,Yashika,Manomay Bundawala,Aruna Kumar
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:DNA is a promising medium for digital information storage for its exceptional density and durability. While prior studies advanced coding theory, workflow design, and simulation tools, challenges such as synthesis costs, sequencing errors, and biological constraints (GC-content imbalance, homopolymers) limit practical deployment. To address this, our framework draws from quantum parallelism concepts to enhance encoding diversity and resilience, integrating biologically informed constraints with deep learning to enhance error mitigation in DNA storage. NeuroDNAAI encodes binary data streams into symbolic DNA sequences, transmits them through a noisy channel with substitutions, insertions, and deletions, and reconstructs them with high fidelity. Our results show that traditional prompting or rule-based schemes fail to adapt effectively to realistic noise, whereas NeuroDNAAI achieves superior accuracy. Experiments on benchmark datasets demonstrate low bit error rates for both text and images. By unifying theory, workflow, and simulation into one pipeline, NeuroDNAAI enables scalable, biologically valid archival DNA storage
zh
[AI-68] RainSeer: Fine-Grained Rainfall Reconstruction via Physics-Guided Modeling
【速读】:该论文旨在解决高分辨率降雨场重建问题,现有空间插值方法(如基于自动气象站观测或融合卫星/雷达数据的方法)常因过度平滑而无法捕捉降雨的尖锐过渡和局部极端特征。其解决方案的关键在于提出 RainSeer 框架,该框架将雷达反射率重新诠释为物理驱动的结构先验,从而显式建模降雨的发生时间、位置与机制;通过两阶段物理信息嵌入架构:第一阶段采用结构到点映射器(Structure-to-Point Mapper),利用双向映射实现中尺度雷达结构向地面点状降水的精准投影;第二阶段采用地理感知降水解码器(Geo-Aware Rain Decoder),借助因果时空注意力机制刻画水成物在下降、融化和蒸发过程中的语义演化,有效弥合高空水成物与地表降水之间的物理断层。
链接: https://arxiv.org/abs/2510.02414
作者: Lin Chen(1),Jun Chen(1),Minghui Qiu(1),Shuxin Zhong(1),Binghong Chen(2),Kaishun Wu(1) ((1) The Hong Kong University of Science and Technology (Guangzhou), (2) China Meteorological Administration)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reconstructing high-resolution rainfall fields is essential for flood forecasting, hydrological modeling, and climate analysis. However, existing spatial interpolation methods-whether based on automatic weather station (AWS) measurements or enhanced with satellite/radar observations often over-smooth critical structures, failing to capture sharp transitions and localized extremes. We introduce RainSeer, a structure-aware reconstruction framework that reinterprets radar reflectivity as a physically grounded structural prior-capturing when, where, and how rain develops. This shift, however, introduces two fundamental challenges: (i) translating high-resolution volumetric radar fields into sparse point-wise rainfall observations, and (ii) bridging the physical disconnect between aloft hydro-meteors and ground-level precipitation. RainSeer addresses these through a physics-informed two-stage architecture: a Structure-to-Point Mapper performs spatial alignment by projecting mesoscale radar structures into localized ground-level rainfall, through a bidirectional mapping, and a Geo-Aware Rain Decoder captures the semantic transformation of hydro-meteors through descent, melting, and evaporation via a causal spatiotemporal attention mechanism. We evaluate RainSeer on two public datasets-RAIN-F (Korea, 2017-2019) and MeteoNet (France, 2016-2018)-and observe consistent improvements over state-of-the-art baselines, reducing MAE by over 13.31% and significantly enhancing structural fidelity in reconstructed rainfall fields.
zh
[AI-69] Extreme value forecasting using relevance-based data augmentation with deep learning models
【速读】:该论文旨在解决极端值预测(extreme value forecasting)中因数据不平衡导致的模型性能下降问题,特别是在金融和气候变化等应用场景中。其解决方案的关键在于提出一种结合生成式对抗网络(Generative Adversarial Networks, GANs)与合成少数类过采样技术(Synthetic Minority Oversampling Technique, SMOTE)的数据增强框架,并将其与卷积长短期记忆网络(Convolutional Long Short-Term Memory, Conv-LSTM)和双向长短期记忆网络(Bidirectional Long Short-Term Memory, BD-LSTM)相结合,以提升多步预测中对极端值的捕捉能力。研究发现,基于SMOTE的策略在整体预测精度及极端区域表现上均优于其他方法,且具备良好的计算效率,同时揭示了Conv-LSTM与BD-LSTM在不同数据特性下的互补优势:前者适用于周期性强、稳定的序列,后者则更适合混沌或非平稳序列。
链接: https://arxiv.org/abs/2510.02407
作者: Junru Hua,Rahul Ahluwalia,Rohitash Chandra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Data augmentation with generative adversarial networks (GANs) has been popular for class imbalance problems, mainly for pattern classification and computer vision-related applications. Extreme value forecasting is a challenging field that has various applications from finance to climate change problems. In this study, we present a data augmentation framework for extreme value forecasting. In this framework, our focus is on forecasting extreme values using deep learning models in combination with data augmentation models such as GANs and synthetic minority oversampling technique (SMOTE). We use deep learning models such as convolutional long short-term memory (Conv-LSTM) and bidirectional long short-term memory (BD-LSTM) networks for multistep ahead prediction featuring extremes. We investigate which data augmentation models are the most suitable, taking into account the prediction accuracy overall and at extreme regions, along with computational efficiency. We also present novel strategies for incorporating data augmentation, considering extreme values based on a relevance function. Our results indicate that the SMOTE-based strategy consistently demonstrated superior adaptability, leading to improved performance across both short- and long-horizon forecasts. Conv-LSTM and BD-LSTM exhibit complementary strengths: the former excels in periodic, stable datasets, while the latter performs better in chaotic or non-stationary sequences.
zh
[AI-70] Linear RNNs for autoregressive generation of long music samples
【速读】:该论文旨在解决直接以自回归方式学习生成原始音频波形的难题,该任务因原始序列长度较长且存在多时间尺度的重要结构而极具挑战性。传统基于循环神经网络(Recurrent Neural Networks, RNNs)、因果卷积和自注意力机制的方法在此任务上效果有限。论文提出的解决方案关键在于采用深度状态空间模型(Deep State Space Models),也称为线性RNN(Linear RNNs),并通过引入上下文并行性(context-parallelism)实现对长达一分钟(1M tokens)序列的训练。在此基础上,作者构建了HarmonicRNN模型,在小规模数据集上实现了当前最优的对数似然和感知指标表现。
链接: https://arxiv.org/abs/2510.02401
作者: Konrad Szewczyk,Daniel Gallo Fernández,James Townsend
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:Directly learning to generate audio waveforms in an autoregressive manner is a challenging task, due to the length of the raw sequences and the existence of important structure on many different timescales. Traditional approaches based on recurrent neural networks, as well as causal convolutions and self-attention, have only had limited success on this task. However, recent work has shown that deep state space models, also referred to as linear RNNs, can be highly efficient in this context. In this work, we push the boundaries of linear RNNs applied to raw audio modeling, investigating the effects of different architectural choices and using context-parallelism to enable training on sequences up to one minute (1M tokens) in length. We present a model, HarmonicRNN, which attains state of the art log-likelihoods and perceptual metrics on small-scale datasets.
zh
[AI-71] Hyperparameters are all you need: Using five-step inference for an original diffusion model to generate images comparable to the latest distillation model
【速读】:该论文旨在解决扩散模型(Diffusion Model)在高分辨率图像生成中推理步数过多、效率低下且依赖额外训练的问题。传统方法通常需要数十步甚至上百步才能生成高质量图像,且往往需通过蒸馏(Distillation)等复杂训练过程优化推理速度,限制了实际应用的效率与灵活性。其解决方案的关键在于:基于对扩散常微分方程(ODE)和随机微分方程(SDE)离散化误差的理论分析,提出一种无需训练的高效采样算法,可在仅8步内生成512×512和1024×1024分辨率的高质量图像,并支持灵活的引导尺度(guidance scale)。该方法在多个基准数据集上实现了优于现有最优非训练类求解器(如DPM++ 2m、AMED-plugin)的FID性能,同时在5步或6步推理下仍保持竞争力,显著提升了扩散模型的推理效率与实用性。
链接: https://arxiv.org/abs/2510.02390
作者: Zilai Li
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 10 pages, 5 figures, conference
Abstract:The diffusion model is a state-of-the-art generative model that generates an image by applying a neural network iteratively. Moreover, this generation process is regarded as an algorithm solving an ordinary differential equation or a stochastic differential equation. Based on the analysis of the truncation error of the diffusion ODE and SDE, our study proposes a training-free algorithm that generates high-quality 512 x 512 and 1024 x 1024 images in eight steps, with flexible guidance scales. To the best of my knowledge, our algorithm is the first one that samples a 1024 x 1024 resolution image in 8 steps with an FID performance comparable to that of the latest distillation model, but without additional training. Meanwhile, our algorithm can also generate a 512 x 512 image in 8 steps, and its FID performance is better than the inference result using state-of-the-art ODE solver DPM++ 2m in 20 steps. We validate our eight-step image generation algorithm using the COCO 2014, COCO 2017, and LAION datasets. And our best FID performance is 15.7, 22.35, and 17.52. While the FID performance of DPM++2m is 17.3, 23.75, and 17.33. Further, it also outperforms the state-of-the-art AMED-plugin solver, whose FID performance is 19.07, 25.50, and 18.06. We also apply the algorithm in five-step inference without additional training, for which the best FID performance in the datasets mentioned above is 19.18, 23.24, and 19.61, respectively, and is comparable to the performance of the state-of-the-art AMED Pulgin solver in eight steps, SDXL-turbo in four steps, and the state-of-the-art diffusion distillation model Flash Diffusion in five steps. We also validate our algorithm in synthesizing 1024 * 1024 images within 6 steps, whose FID performance only has a limited distance to the latest distillation algorithm. The code is in repo: this https URL
zh
[AI-72] CWM: An Open-Weights LLM for Research on Code Generation with World Models
【速读】:该论文旨在解决当前代码生成模型在缺乏对程序运行环境动态理解的情况下,难以实现复杂推理与规划的问题。传统方法仅依赖静态代码训练数据,限制了模型对代码执行过程的建模能力。解决方案的关键在于引入世界模型(World Model)机制,通过在Python解释器和代理式Docker环境中进行大量观察-动作轨迹的中段训练(mid-training),并结合可验证编码、数学和多轮软件工程环境中的多任务强化学习(multi-task reasoning RL),使模型具备模拟代码执行过程的能力。这一设计显著提升了代码生成的推理与规划能力,为构建具有自主执行和反思能力的智能编程代理提供了新的研究范式。
链接: https://arxiv.org/abs/2510.02387
作者: FAIR CodeGen team. Jade Copet,Quentin Carbonneaux,Gal Cohen,Jonas Gehring,Jacob Kahn,Jannik Kossen,Felix Kreuk,Emily McMilin,Michel Meyer,Yuxiang Wei,David Zhang,Kunhao Zheng,Jordi Armengol-Estapé,Pedram Bashiri,Maximilian Beck,Pierre Chambon,Abhishek Charnalia,Chris Cummins,Juliette Decugis,Zacharias V. Fisches,François Fleuret,Fabian Gloeckle,Alex Gu,Michael Hassid,Daniel Haziza,Badr Youbi Idrissi,Christian Keller,Rahul Kindi,Hugh Leather,Gallil Maimon,Aram Markosyan,Francisco Massa,Pierre-Emmanuel Mazaré,Vegard Mella,Naila Murray,Keyur Muzumdar,Peter O’Hearn,Matteo Pagliardini,Dmitrii Pedchenko,Tal Remez,Volker Seeker,Marco Selvi,Oren Sultan,Sida Wang,Luca Wehrstedt,Ori Yoran,Lingming Zhang,Taco Cohen,Yossi Adi,Gabriel Synnaeve
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 58 pages
Abstract:We release Code World Model (CWM), a 32-billion-parameter open-weights LLM, to advance research on code generation with world models. To improve code understanding beyond what can be learned from training on static code alone, we mid-train CWM on a large amount of observation-action trajectories from Python interpreter and agentic Docker environments, and perform extensive multi-task reasoning RL in verifiable coding, math, and multi-turn software engineering environments. With CWM, we provide a strong testbed for researchers to explore the opportunities world modeling affords for improving code generation with reasoning and planning in computational environments. We present first steps of how world models can benefit agentic coding, enable step-by-step simulation of Python code execution, and show early results of how reasoning can benefit from the latter. CWM is a dense, decoder-only LLM trained with a context size of up to 131k tokens. Independent of its world modeling capabilities, CWM offers strong performance on general coding and math tasks: it reaches pass@1 scores of 65.8% on SWE-bench Verified (with test-time scaling), 68.6% on LiveCodeBench, 96.6% on Math-500, and 76.0% on AIME 2024. To support further research on code world modeling, we release model checkpoints after mid-training, SFT, and RL.
zh
[AI-73] On The Frag ility of Benchmark Contamination Detection in Reasoning Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在排行榜(Leaderboards)评估中因训练数据污染(benchmark contamination)而导致性能虚高的问题。当前评估体系将模型表现与排行榜排名直接挂钩,促使开发者通过将评测基准数据纳入训练集来人为提升性能,从而引发公平性和可信度危机。论文指出,这种污染行为极难被现有检测方法识别:一方面,在监督微调(Supervised Fine-Tuning, SFT)阶段引入污染后,即使使用基于重要性采样和裁剪目标的强化学习(Reinforcement Learning, RL)方法如GRPO进行微调,也能显著掩盖污染信号;另一方面,当高级LLM在最终阶段采用思维链(Chain-of-Thought, CoT)方式进行SFT污染时,主流检测方法几乎失效,因为模型对未见但分布相似样本仍表现出高置信度,从而规避基于记忆性的检测机制。解决方案的关键在于揭示RL方法(尤其是PPO风格的目标函数)内在的污染隐藏能力,并强调亟需开发针对LLMs特性的先进污染检测技术与可信赖的评估协议。
链接: https://arxiv.org/abs/2510.02386
作者: Han Wang,Haoyu Li,Brian Ko,Huan Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Leaderboards for LRMs have turned evaluation into a competition, incentivizing developers to optimize directly on benchmark suites. A shortcut to achieving higher rankings is to incorporate evaluation benchmarks into the training data, thereby yielding inflated performance, known as benchmark contamination. Surprisingly, our studies find that evading contamination detections for LRMs is alarmingly easy. We focus on the two scenarios where contamination may occur in practice: (I) when the base model evolves into LRM via SFT and RL, we find that contamination during SFT can be originally identified by contamination detection methods. Yet, even a brief GRPO training can markedly conceal contamination signals that most detection methods rely on. Further empirical experiments and theoretical analysis indicate that PPO style importance sampling and clipping objectives are the root cause of this detection concealment, indicating that a broad class of RL methods may inherently exhibit similar concealment capability; (II) when SFT contamination with CoT is applied to advanced LRMs as the final stage, most contamination detection methods perform near random guesses. Without exposure to non-members, contaminated LRMs would still have more confidence when responding to those unseen samples that share similar distributions to the training set, and thus, evade existing memorization-based detection methods. Together, our findings reveal the unique vulnerability of LRMs evaluations: Model developers could easily contaminate LRMs to achieve inflated leaderboards performance while leaving minimal traces of contamination, thereby strongly undermining the fairness of evaluation and threatening the integrity of public leaderboards. This underscores the urgent need for advanced contamination detection methods and trustworthy evaluation protocols tailored to LRMs.
zh
[AI-74] Scaling Homomorphic Applications in Deployment
【速读】:该论文旨在解决加密生态系统(encryption ecosystems)在实际生产环境中的可用性评估问题,即如何判断其是否具备部署和运行的成熟度。为实现这一目标,作者构建了一个基于同态加密(Homomorphic Encryption, HE)的电影推荐应用原型,并通过容器化(containerization)与编排(orchestration)技术将其“生产化”(productionized)。解决方案的关键在于:通过调整部署配置并引入额外的基础设施优化策略,有效缓解了全同态加密(Fully Homomorphic Encryption, FHE)在计算性能上的限制,从而验证了FHE在真实场景中可行的工程路径。
链接: https://arxiv.org/abs/2510.02376
作者: Ryan Marinelli,Angelica Chowdhury
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 5 pages, 6 figures, 1 pseudo code
Abstract:In this endeavor, a proof-of-concept homomorphic application is developed to determine the production readiness of encryption ecosystems. A movie recommendation app is implemented for this purpose and productionized through containerization and orchestration. By tuning deployment configurations, the computational limitations of Fully Homomorphic Encryption (FHE) are mitigated through additional infrastructure optimizations Index Terms: Reinforcement Learning, Orchestration, Homomorphic Encryption Comments: 5 pages, 6 figures, 1 pseudo code Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.02376 [cs.CR] (or arXiv:2510.02376v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.02376 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-75] A Hybrid CAPTCHA Combining Generative AI with Keystroke Dynamics for Enhanced Bot Detection
【速读】:该论文旨在解决传统完全自动化公共图灵测试(CAPTCHA)在可用性与抗AI机器人攻击能力之间难以平衡的问题。其解决方案的关键在于提出一种混合型CAPTCHA系统,通过融合大型语言模型(LLM)的认知挑战与击键动力学(keystroke dynamics)的行为生物特征分析,生成动态且不可预测的问答任务——这些任务对人类而言简单易答,但对自动化代理则具有挑战性;同时,系统实时分析用户的打字节奏以区分人类输入与机器模拟行为。该双层机制显著提升了对抗基于粘贴和脚本的模拟攻击的能力,并保持了高可用性。
链接: https://arxiv.org/abs/2510.02374
作者: Ayda Aghaei Nia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures
Abstract:Completely Automated Public Turing tests to tell Computers and Humans Apart (CAPTCHAs) are a foundational component of web security, yet traditional implementations suffer from a trade-off between usability and resilience against AI-powered bots. This paper introduces a novel hybrid CAPTCHA system that synergizes the cognitive challenges posed by Large Language Models (LLMs) with the behavioral biometric analysis of keystroke dynamics. Our approach generates dynamic, unpredictable questions that are trivial for humans but non-trivial for automated agents, while simultaneously analyzing the user’s typing rhythm to distinguish human patterns from robotic input. We present the system’s architecture, formalize the feature extraction methodology for keystroke analysis, and report on an experimental evaluation. The results indicate that our dual-layered approach achieves a high degree of accuracy in bot detection, successfully thwarting both paste-based and script-based simulation attacks, while maintaining a high usability score among human participants. This work demonstrates the potential of combining cognitive and behavioral tests to create a new generation of more secure and user-friendly CAPTCHAs.
zh
[AI-76] A-MemGuard: A Proactive Defense Framework for LLM -Based Agent Memory
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在依赖记忆进行自主规划与决策时所面临的新型安全威胁:攻击者可通过注入看似无害的记忆条目,在特定上下文中触发恶意行为,且此类攻击具有隐蔽性强、可自我强化的特性——即被篡改的结果会被存储为新先例,从而放大初始错误并降低未来攻击的门槛。解决方案的关键在于提出A-MemGuard(Agent-Memory Guard),其核心思想是使记忆系统具备自检与自纠能力,无需修改代理原有架构,通过两个机制实现:一是基于共识的验证机制,利用多条相关记忆推理路径的一致性检测异常;二是双内存结构,将检测到的失败转化为“教训”单独存储并在后续决策前调用,以此打破错误循环并支持持续适应。实验表明,该方案能有效将攻击成功率降低超过95%,同时保持极低的可用性损耗。
链接: https://arxiv.org/abs/2510.02373
作者: Qianshan Wei,Tengchao Yang,Yaochen Wang,Xinfeng Li,Lijun Li,Zhenfei Yin,Yi Zhan,Thorsten Holz,Zhiqiang Lin,XiaoFeng Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) agents use memory to learn from past interactions, enabling autonomous planning and decision-making in complex environments. However, this reliance on memory introduces a critical security risk: an adversary can inject seemingly harmless records into an agent’s memory to manipulate its future behavior. This vulnerability is characterized by two core aspects: First, the malicious effect of injected records is only activated within a specific context, making them hard to detect when individual memory entries are audited in isolation. Second, once triggered, the manipulation can initiate a self-reinforcing error cycle: the corrupted outcome is stored as precedent, which not only amplifies the initial error but also progressively lowers the threshold for similar attacks in the future. To address these challenges, we introduce A-MemGuard (Agent-Memory Guard), the first proactive defense framework for LLM agent memory. The core idea of our work is the insight that memory itself must become both self-checking and self-correcting. Without modifying the agent’s core architecture, A-MemGuard combines two mechanisms: (1) consensus-based validation, which detects anomalies by comparing reasoning paths derived from multiple related memories and (2) a dual-memory structure, where detected failures are distilled into ``lessons’’ stored separately and consulted before future actions, breaking error cycles and enabling adaptation. Comprehensive evaluations on multiple benchmarks show that A-MemGuard effectively cuts attack success rates by over 95% while incurring a minimal utility cost. This work shifts LLM memory security from static filtering to a proactive, experience-driven model where defenses strengthen over time. Our code is available in this https URL
zh
[AI-77] Federated Spatiotemporal Graph Learning for Passive Attack Detection in Smart Grids
【速读】:该论文旨在解决智能电网中被动窃听(passive eavesdropping)攻击的检测难题,此类攻击虽不篡改数据,但通过监听通信链路可获取电网拓扑、用电模式等敏感信息,为后续针对性攻击提供情报。其关键解决方案是提出一种以图为中心的多模态检测方法,通过融合物理层特征与行为指标,在 ego-centric star 子图和短时间窗口内构建统一的时空表示:首先采用图卷积聚合空间上下文信息,再利用双向门控循环单元(bidirectional GRU)建模短期时序依赖,形成两阶段编码器;同时在 FedProx 联邦学习框架下训练,保障本地原始数据不出设备,提升对非独立同分布(non-IID)数据的鲁棒性。实验表明,该方法在合成标准数据集上实现每时间步 98.32% 准确率(F1_attack=0.972)和每序列 93.35% 准确率(FPR=0.15%),验证了时空联合建模对隐蔽侦察的有效识别能力。
链接: https://arxiv.org/abs/2510.02371
作者: Bochra Al Agha,Razane Tajeddine
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Smart grids are exposed to passive eavesdropping, where attackers listen silently to communication links. Although no data is actively altered, such reconnaissance can reveal grid topology, consumption patterns, and operational behavior, creating a gateway to more severe targeted attacks. Detecting this threat is difficult because the signals it produces are faint, short-lived, and often disappear when traffic is examined by a single node or along a single timeline. This paper introduces a graph-centric, multimodal detector that fuses physical-layer and behavioral indicators over ego-centric star subgraphs and short temporal windows to detect passive attacks. To capture stealthy perturbations, a two-stage encoder is introduced: graph convolution aggregates spatial context across ego-centric star subgraphs, while a bidirectional GRU models short-term temporal dependencies. The encoder transforms heterogeneous features into a unified spatio-temporal representation suitable for classification. Training occurs in a federated learning setup under FedProx, improving robustness to heterogeneous local raw data and contributing to the trustworthiness of decentralized training; raw measurements remain on client devices. A synthetic, standards-informed dataset is generated to emulate heterogeneous HAN/NAN/WAN communications with wireless-only passive perturbations, event co-occurrence, and leak-safe splits. The model achieves a testing accuracy of 98.32% per-timestep (F1_attack=0.972) and 93.35% per-sequence at 0.15% FPR using a simple decision rule with run-length m=2 and threshold \tau=0.55 . The results demonstrate that combining spatial and temporal context enables reliable detection of stealthy reconnaissance while maintaining low false-positive rates, making the approach suitable for non-IID federated smart-grid deployments.
zh
[AI-78] Privacy in the Age of AI: A Taxonomy of Data Risks
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)系统在处理日益敏感数据时带来的前所未有的隐私挑战,这些问题超出了传统隐私框架的应对能力,因其具备自主学习和黑箱决策等独特特性。解决方案的关键在于构建一个基于45项研究的系统性文献综述所提炼出的AI隐私风险分类体系(taxonomy),将19个关键风险归纳为四个维度:数据集层面、模型层面、基础设施层面和内部威胁风险。该分类体系揭示了人类错误(占比9.45%)是最重要的因素,从而挑战了传统安全方法过度依赖技术控制而忽视人为因素的倾向,强调需融合技术和行为维度以推动可信AI的发展,并为后续研究奠定基础。
链接: https://arxiv.org/abs/2510.02357
作者: Grace Billiris,Asif Gill,Madhushi Bandara
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, 4 tables
Abstract:Artificial Intelligence (AI) systems introduce unprecedented privacy challenges as they process increasingly sensitive data. Traditional privacy frameworks prove inadequate for AI technologies due to unique characteristics such as autonomous learning and black-box decision-making. This paper presents a taxonomy classifying AI privacy risks, synthesised from 45 studies identified through systematic review. We identify 19 key risks grouped under four categories: Dataset-Level, Model-Level, Infrastructure-Level, and Insider Threat Risks. Findings reveal a balanced distribution across these dimensions, with human error (9.45%) emerging as the most significant factor. This taxonomy challenges conventional security approaches that typically prioritise technical controls over human factors, highlighting gaps in holistic understanding. By bridging technical and behavioural dimensions of AI privacy, this paper contributes to advancing trustworthy AI development and provides a foundation for future research.
zh
[AI-79] Measuring Physical-World Privacy Awareness of Large Language Models : An Evaluation Benchmark
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在具身智能体(embodied agents)中物理世界隐私意识评估缺失的问题。现有评测方法局限于自然语言场景,无法衡量LLM在真实物理环境中对隐私保护的敏感性与合规性。其解决方案的关键在于提出EAPrivacy基准测试框架,通过四层结构化、程序生成的物理场景,系统评估智能体在处理敏感物体、适应环境变化、平衡任务执行与隐私约束以及应对社会规范冲突等方面的能力。该基准揭示了当前主流模型在动态物理环境中隐私意识严重不足,为后续开发更鲁棒、物理感知对齐的AI系统提供了量化依据和改进方向。
链接: https://arxiv.org/abs/2510.02356
作者: Xinjie Shen,Mufei Li,Pan Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural language based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent’s ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86% of cases. In high-stakes situations pitting privacy against critical social norms, leading models like GPT-4o and Claude-3.5-haiku disregarded the social norm over 15% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically-aware alignment.
zh
[AI-80] An Investigation into the Performance of Non-Contrastive Self-Supervised Learning Methods for Network Intrusion Detection
【速读】:该论文旨在解决传统监督学习在网络安全领域中仅能检测已知异常的局限性,从而探索适用于网络入侵检测(Network Intrusion Detection)的非对比式自监督学习(Non-contrastive Self-Supervised Learning)方法的有效性问题。其解决方案的关键在于系统性地评估五种非对比式自监督学习方法在两种主流网络入侵检测数据集(UNSW-NB15 和 5G-NIDD)上的表现,通过组合三种编码器架构与六种数据增强策略,共执行90组实验以确定最优配置,并验证其相较于无监督基线模型(DeepSVDD 和 Autoencoder)在攻击检测任务中的竞争力。
链接: https://arxiv.org/abs/2510.02349
作者: Hamed Fard,Tobias Schalau,Gerhard Wunder
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Network intrusion detection, a well-explored cybersecurity field, has predominantly relied on supervised learning algorithms in the past two decades. However, their limitations in detecting only known anomalies prompt the exploration of alternative approaches. Motivated by the success of self-supervised learning in computer vision, there is a rising interest in adapting this paradigm for network intrusion detection. While prior research mainly delved into contrastive self-supervised methods, the efficacy of non-contrastive methods, in conjunction with encoder architectures serving as the representation learning backbone and augmentation strategies that determine what is learned, remains unclear for effective attack detection. This paper compares the performance of five non-contrastive self-supervised learning methods using three encoder architectures and six augmentation strategies. Ninety experiments are systematically conducted on two network intrusion detection datasets, UNSW-NB15 and 5G-NIDD. For each self-supervised model, the combination of encoder architecture and augmentation method yielding the highest average precision, recall, F1-score, and AUCROC is reported. Furthermore, by comparing the best-performing models to two unsupervised baselines, DeepSVDD, and an Autoencoder, we showcase the competitiveness of the non-contrastive methods for attack detection. Code at: this https URL
zh
[AI-81] Agent ic-AI Healthcare: Multilingual Privacy-First Framework with MCP Agents
【速读】:该论文旨在解决当前医疗人工智能系统在隐私保护、多语言支持与可解释性方面的不足,特别是在患者交互场景中难以兼顾合规性与透明度的问题。其解决方案的关键在于构建一个基于模型上下文协议(Model Context Protocol, MCP)的智能体协同架构,通过集成专用的隐私与合规层(包括基于角色的访问控制RBAC、AES-GCM字段级加密及防篡改审计日志),实现对HIPAA、PIPEDA和PHIPA等国际医疗数据标准的兼容,同时利用大语言模型提供多语言(如英语、法语、阿拉伯语)患者-医生交互和透明的诊断推理能力,从而推动可信赖、多语言且符合法规的医疗AI应用落地。
链接: https://arxiv.org/abs/2510.02325
作者: Mohammed A. Shehab
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure. Submitted as a system/vision paper
Abstract:This paper introduces Agentic-AI Healthcare, a privacy-aware, multilingual, and explainable research prototype developed as a single-investigator project. The system leverages the emerging Model Context Protocol (MCP) to orchestrate multiple intelligent agents for patient interaction, including symptom checking, medication suggestions, and appointment scheduling. The platform integrates a dedicated Privacy and Compliance Layer that applies role-based access control (RBAC), AES-GCM field-level encryption, and tamper-evident audit logging, aligning with major healthcare data protection standards such as HIPAA (US), PIPEDA (Canada), and PHIPA (Ontario). Example use cases demonstrate multilingual patient-doctor interaction (English, French, Arabic) and transparent diagnostic reasoning powered by large language models. As an applied AI contribution, this work highlights the feasibility of combining agentic orchestration, multilingual accessibility, and compliance-aware architecture in healthcare applications. This platform is presented as a research prototype and is not a certified medical device.
zh
[AI-82] Stimulus-Voltage-Based Prediction of Action Potential Onset Timing: Classical vs. Quantum-Inspired Approaches
【速读】:该论文旨在解决传统漏电积分-发放(Leaky Integrate-and Fire, LIF)模型在预测神经元动作电位(Action Potential, AP)起始时间时误差较大,尤其是在强刺激或快速变化刺激下表现不佳的问题。其解决方案的关键在于提出一种量子启发的漏电积分-发放(Quantum-Inspired Leaky Integrate-and-Fire, QI-LIF)模型,将AP起始视为一个概率事件,并用时间域上的高斯波包(Gaussian wave packet)进行表征,从而更准确地捕捉神经元放电中的生物变异性与不确定性。该方法显著降低了预测误差,尤其在高刺激强度下表现出与实验观测更为一致的性能。
链接: https://arxiv.org/abs/2510.03155
作者: Stevens Johnson,Varun Puram,Johnson Thomas,Acsah Konuparamban,Ashwin Kannan
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Accurate modeling of neuronal action potential (AP) onset timing is crucial for understanding neural coding of danger signals. Traditional leaky integrate-and-fire (LIF) models, while widely used, exhibit high relative error in predicting AP onset latency, especially under strong or rapidly changing stimuli. Inspired by recent experimental findings and quantum theory, we present a quantum-inspired leaky integrate-and-fire (QI-LIF) model that treats AP onset as a probabilistic event, represented by a Gaussian wave packet in time. This approach captures the biological variability and uncertainty inherent in neuronal firing. We systematically compare the relative error of AP onset predictions between the classical LIF and QI-LIF models using synthetic data from hippocampal and sensory neurons subjected to varying stimulus amplitudes. Our results demonstrate that the QI-LIF model significantly reduces prediction error, particularly for high-intensity stimuli, aligning closely with observed biological responses. This work highlights the potential of quantum-inspired computational frameworks in advancing the accuracy of neural modeling and has implications for quantum engineering approaches to brain-inspired computing.
zh
[AI-83] A Study of Neural Polar Decoders for Communication
【速读】:该论文旨在解决传统极化码(Polar Code)译码器在实际通信系统中性能受限的问题,尤其是在低速率、短块长场景下(如5G控制信道)以及单载波(Single-Carrier)系统中的应用挑战。其关键解决方案是提出并优化神经极化译码器(Neural Polar Decoder, NPD),通过引入神经网络架构实现对具有记忆性的信道结构的高效建模,从而在无需导频和循环前缀(Cyclic Prefix)的情况下提升译码性能;同时,NPD支持任意码长(通过率匹配)、高阶调制及多种信道条件下的鲁棒性,显著优于标准5G极化码译码器,在误比特率(BER)、误块率(BLER)和吞吐量方面均有提升,尤其适用于低速率和短块长配置,并且在单载波系统中表现出与正交频分复用(OFDM)相当的性能但峰值平均功率比(PAPR)更低,具备实用化潜力。
链接: https://arxiv.org/abs/2510.03069
作者: Rom Hirsch,Ziv Aharoni,Henry D. Pfister,Haim H. Permuter
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we adapt and analyze Neural Polar Decoders (NPDs) for end-to-end communication systems. While prior work demonstrated the effectiveness of NPDs on synthetic channels, this study extends the NPD to real-world communication systems. The NPD was adapted to complete OFDM and single-carrier communication systems. To satisfy practical system requirements, the NPD is extended to support any code length via rate matching, higher-order modulations, and robustness across diverse channel conditions. The NPD operates directly on channels with memory, exploiting their structure to achieve higher data rates without requiring pilots and a cyclic prefix. Although NPD entails higher computational complexity than the standard 5G polar decoder, its neural network architecture enables an efficient representation of channel statistics, resulting in manageable complexity suitable for practical systems. Experimental results over 5G channels demonstrate that the NPD consistently outperforms the 5G polar decoder in terms of BER, BLER, and throughput. These improvements are particularly significant for low-rate and short-block configurations, which are prevalent in 5G control channels. Furthermore, NPDs applied to single-carrier systems offer performance comparable to OFDM with lower PAPR, enabling effective single-carrier transmission over 5G channels. These results position the NPD as a high-performance, pilotless, and robust decoding solution.
zh
[AI-84] FinReflectKG - MultiHop: Financial QA Benchmark for Reasoning with Knowledge Graph Evidence
【速读】:该论文旨在解决金融披露文本中多跳推理(multi-hop reasoning)任务中的关键瓶颈问题:即相关事实分散在不同章节、文件、公司及年份之间,导致大语言模型(LLM)在缺乏精准上下文选择的情况下,需消耗大量token进行无效检索与推理,从而影响效率和准确性。解决方案的关键在于构建一个时序索引的金融知识图谱(FinReflectKG),将经审计的事实三元组(audited triples)精确链接至标普100指数(SP 100) filings 中的原始文本块,并通过挖掘跨行业(GICS分类体系下)高频2-3跳子图模式生成带有明确证据支持的分析师风格问答对。该方法实现了基于知识图谱的精准路径引导式检索,显著提升模型推理效率与正确率,相较传统基于向量的文本窗口检索方式,在保持高准确性的前提下减少约84.5%的token消耗。
链接: https://arxiv.org/abs/2510.02906
作者: Abhinav Arun,Reetu Raj Harsh,Bhaskarjit Sarmah,Stefano Pasquali
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-hop reasoning over financial disclosures is often a retrieval problem before it becomes a reasoning or generation problem: relevant facts are dispersed across sections, filings, companies, and years, and LLMs often expend excessive tokens navigating noisy context. Without precise Knowledge Graph (KG)-guided selection of relevant context, even strong reasoning models either fail to answer or consume excessive tokens, whereas KG-linked evidence enables models to focus their reasoning on composing already retrieved facts. We present FinReflectKG - MultiHop, a benchmark built on FinReflectKG, a temporally indexed financial KG that links audited triples to source chunks from SP 100 filings (2022-2024). Mining frequent 2-3 hop subgraph patterns across sectors (via GICS taxonomy), we generate financial analyst style questions with exact supporting evidence from the KG. A two-phase pipeline first creates QA pairs via pattern-specific prompts, followed by a multi-criteria quality control evaluation to ensure QA validity. We then evaluate three controlled retrieval scenarios: (S1) precise KG-linked paths; (S2) text-only page windows centered on relevant text spans; and (S3) relevant page windows with randomizations and distractors. Across both reasoning and non-reasoning models, KG-guided precise retrieval yields substantial gains on the FinReflectKG - MultiHop QA benchmark dataset, boosting correctness scores by approximately 24 percent while reducing token utilization by approximately 84.5 percent compared to the page window setting, which reflects the traditional vector retrieval paradigm. Spanning intra-document, inter-year, and cross-company scopes, our work underscores the pivotal role of knowledge graphs in efficiently connecting evidence for multi-hop financial QA. We also release a curated subset of the benchmark (555 QA Pairs) to catalyze further research.
zh
[AI-85] SAE-RNA: A Sparse Autoencoder Model for Interpreting RNA Language Model Representations
【速读】:该论文旨在解决当前RNA语言模型(RNA Language Models, RNA LMs)在预训练过程中对信使RNA(mRNA)和非编码RNA(ncRNA)家族的内部表征机制不明确的问题。其关键解决方案是提出一种名为SAE-RNA的可解释性模型,该模型无需端到端重新训练即可分析RiNALMo等RNA LMs的嵌入表示,并将其映射至已知的人类水平生物特征,从而将RNA可解释性问题转化为预训练嵌入中的概念发现任务,为揭示ncRNA家族潜在关系提供了实用工具与假设生成框架。
链接: https://arxiv.org/abs/2510.02734
作者: Taehan Kim,Sangdae Nam
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: preprint
Abstract:Deep learning, particularly with the advancement of Large Language Models, has transformed biomolecular modeling, with protein advances (e.g., ESM) inspiring emerging RNA language models such as RiNALMo. Yet how and what these RNA Language Models internally encode about messenger RNA (mRNA) or non-coding RNA (ncRNA) families remains unclear. We present SAE- RNA, interpretability model that analyzes RiNALMo representations and maps them to known human-level biological features. Our work frames RNA interpretability as concept discovery in pretrained embeddings, without end-to-end retraining, and provides practical tools to probe what RNA LMs may encode about ncRNA families. The model can be extended to close comparisons between RNA groups, and supporting hypothesis generation about previously unrecognized relationships.
zh
[AI-86] Fully automated inverse co-optimization of templates and block copolymer blending recipes for DSA lithography
【速读】:该论文旨在解决定向自组装(Directed Self-Assembly, DSA)中模板形状参数化与优化难题,以及如何在保证高精度图案形成的同时提升模板的可制造性。其关键解决方案在于提出一种仅用两个参数表征模板形状的高斯描述符(Gaussian descriptor),并引入AB/AB二元共聚物混合体系以增强系统对模板形貌的适应性;在此基础上,采用贝叶斯优化(Bayesian Optimization, BO)协同优化二元共聚物组分与模板形状,通过约束模板曲率变化确保优化后的结构具备优异的可制造性,从而实现多孔图案的高度匹配自组装形态。
链接: https://arxiv.org/abs/2510.02715
作者: Yuhao Zhou,Huangyan Shen,Qingliang Song,Qingshu Dong,Jianfeng Li,Weihua Li
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:The directed self-assembly (DSA) of block copolymers (BCPs) offers a highly promising approach for the fabrication of contact holes or vertical interconnect access at sub-7nm technology nodes. To fabricate circular holes with precisely controlled size and positions, the self-assembly of block copolymers requires guidance from a properly designed template. Effectively parameterizing the template shape to enable efficient optimization remains a critical yet challenging problem. Moreover, the optimized template must possess excellent manufacturability for practical applications. In this work, we propose a Gaussian descriptor for characterizing the template shape with only two parameters. We further propose to use AB/AB binary blends instead of pure diblock copolymer to improve the adaptability of the block copolymer system to the template shape. The Bayesian optimization (BO) is applied to co-optimize the binary blend and the template shape. Our results demonstrate that BO based on the Gaussian descriptor can efficiently yield the optimal templates for diverse multi-hole patterns, all leading to highly matched self-assembled morphologies. Moreover, by imposing constraints on the variation of curvature of the template during optimization, superior manufacturability is ensured for each optimized template. It is noteworthy that each key parameter of the blend exhibits a relatively wide tunable window under the requirement of rather high precision. Our work provides valuable insights for advancing DSA technology, and thus potentially propels its practical applications forward.
zh
[AI-87] Cross-Platform DNA Methylation Classifier for the Eight Molecular Subtypes of Group 3 4 Medulloblastoma
【速读】:该论文旨在解决儿童髓母细胞瘤(medulloblastoma)中分子亚型分类在临床转化应用中的难题,特别是如何实现跨平台、高精度的亚型识别以支持个性化治疗和患者监测。其关键解决方案是开发了一种基于DNA甲基化特征的机器学习分类器,能够同时准确区分2019年共识定义的Group 3和Group 4中的八个新型亚型,且在HM450和EPIC两种甲基化芯片平台上均表现出优异性能(加权F1=0.95,平衡准确率=0.957),具备跨平台兼容性和可扩展性,为未来构建公开可用的在线分类工具奠定了基础,推动了精准医学在高发亚型群体中的落地应用。
链接: https://arxiv.org/abs/2510.02416
作者: Omer Abid,Gholamreza Rafiee
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 5 tables
Abstract:Medulloblastoma is a malignant pediatric brain cancer, and the discovery of molecular subgroups is enabling personalized treatment strategies. In 2019, a consensus identified eight novel subtypes within Groups 3 and 4, each displaying heterogeneous characteristics. Classifiers are essential for translating these findings into clinical practice by supporting clinical trials, personalized therapy development and application, and patient monitoring. This study presents a DNA methylation-based, cross-platform machine learning classifier capable of distinguishing these subtypes on both HM450 and EPIC methylation array samples. Across two independent test sets, the model achieved weighted F1 = 0.95 and balanced accuracy = 0.957, consistent across platforms. As the first cross-platform solution, it provides backward compatibility while extending applicability to a newer platform, also enhancing accessibility. It also has the potential to become the first publicly available classifier for these subtypes once deployed through a web application, as planned in the future. This work overall takes steps in the direction of advancing precision medicine and improving clinical outcomes for patients within the majority prevalence medulloblastoma subgroups, groups 3 and 4.
zh
机器学习
[LG-0] o Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable Reinforcement Learning NEURIPS2025
链接: https://arxiv.org/abs/2510.03207
作者: Yuda Song,Dhruv Rohatgi,Aarti Singh,J. Andrew Bagnell
类目: Machine Learning (cs.LG)
*备注: 45 pages, 9 figures, published at NeurIPS 2025
Abstract:Partial observability is a notorious challenge in reinforcement learning (RL), due to the need to learn complex, history-dependent policies. Recent empirical successes have used privileged expert distillation–which leverages availability of latent state information during training (e.g., from a simulator) to learn and imitate the optimal latent, Markovian policy–to disentangle the task of “learning to see” from “learning to act”. While expert distillation is more computationally efficient than RL without latent state information, it also has well-documented failure modes. In this paper–through a simple but instructive theoretical model called the perturbed Block MDP, and controlled experiments on challenging simulated locomotion tasks–we investigate the algorithmic trade-off between privileged expert distillation and standard RL without privileged information. Our main findings are: (1) The trade-off empirically hinges on the stochasticity of the latent dynamics, as theoretically predicted by contrasting approximate decodability with belief contraction in the perturbed Block MDP; and (2) The optimal latent policy is not always the best latent policy to distill. Our results suggest new guidelines for effectively exploiting privileged information, potentially advancing the efficiency of policy learning across many practical partially observable domains.
[LG-1] Automatic Generation of Digital Twins for Network Testing
链接: https://arxiv.org/abs/2510.03205
作者: Shenjia Ding,David Flynn,Paul Harvey
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Accepted to ANMS at ICDCS 2025
Abstract:The increased use of software in the operation and management of telecommunication networks has moved the industry one step closer to realizing autonomous network operation. One consequence of this shift is the significantly increased need for testing and validation before such software can be deployed. Complementing existing simulation or hardware-based approaches, digital twins present an environment to achieve this testing; however, they require significant time and human effort to configure and execute. This paper explores the automatic generation of digital twins to provide efficient and accurate validation tools, aligned to the ITU-T autonomous network architecture’s experimentation subsystem. We present experimental results for an initial use case, demonstrating that the approach is feasible in automatically creating efficient digital twins with sufficient accuracy to be included as part of existing validation pipelines.
[LG-2] Best-of-Majority: Minimax-Optimal Strategy for Pass@k Inference Scaling
链接: https://arxiv.org/abs/2510.03199
作者: Qiwei Di,Kaixuan Ji,Xuheng Li,Heyang Zhao,Quanquan Gu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages, 3 figures
Abstract:LLM inference often generates a batch of candidates for a prompt and selects one via strategies like majority voting or Best-of- N (BoN). For difficult tasks, this single-shot selection often underperforms. Consequently, evaluations commonly report Pass@ k : the agent may submit up to k responses, and only the best of them is used when computing regret. Motivated by this, we study inference scaling in the more general Pass@ k inference setting, and prove that neither majority voting nor BoN exhibits the desirable scaling with k and the sampling budget N . Combining the advantages of majority voting and BoN, we propose a new inference strategy called Best-of-Majority (BoM), with a pivotal step that restricts the candidates to the responses with high frequency in the N samples before selecting the top- k rewards. We prove that when the sampling budget is N=\tilde\Omega(C^) , the regret of BoM is O(\epsilon_\mathrmopt+\sqrt\epsilon_\mathrmRM^2C^/k) , where C^* is the coverage coefficient, \epsilon_\mathrmRM is the estimation error of the reward model, and \epsilon_\mathrmopt is the estimation error of reward at the optimal response. We further establish a matching lower bound, certifying that our algorithm is minimax optimal. Beyond optimality, BoM has a key advantage: unlike majority voting and BoN, its performance does not degrade when increasing N . Experimental results of inference on math problems show BoM outperforming both majority voting and BoN.
[LG-3] Estimation of Resistance Training RPE using Inertial Sensors and Electromyography
链接: https://arxiv.org/abs/2510.03197
作者: James Thomas,Johan Walhström
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate estimation of rating of perceived exertion (RPE) can enhance resistance training through personalized feedback and injury prevention. This study investigates the application of machine learning models to estimate RPE during single-arm dumbbell bicep curls, using data from wearable inertial and electromyography (EMG) sensors. A custom dataset of 69 sets and over 1000 repetitions was collected, with statistical features extracted for model training. Among the models evaluated, a random forest classifier achieved the highest performance, with 41.4% exact accuracy and 85.9% \pm1 RPE accuracy. While the inclusion of EMG data slightly improved model accuracy over inertial sensors alone, its utility may have been limited by factors such as data quality and placement sensitivity. Feature analysis highlighted eccentric repetition time as the strongest RPE predictor. The results demonstrate the feasibility of wearable-sensor-based RPE estimation and identify key challenges for improving model generalizability.
[LG-4] Superposition disentanglement of neural representations reveals hidden alignment
链接: https://arxiv.org/abs/2510.03186
作者: André Longon,David Klindt,Meenakshi Khosla
类目: Machine Learning (cs.LG)
*备注:
Abstract:The superposition hypothesis states that a single neuron within a population may participate in the representation of multiple features in order for the population to represent more features than the number of neurons. In neuroscience and AI, representational alignment metrics measure the extent to which different deep neural networks (DNNs) or brains represent similar information. In this work, we explore a critical question: \textitdoes superposition interact with alignment metrics in any undesirable way? We hypothesize that models which represent the same features in \textitdifferent superposition arrangements, i.e., their neurons have different linear combinations of the features, will interfere with predictive mapping metrics (semi-matching, soft-matching, linear regression), producing lower alignment than expected. We first develop a theory for how the strict permutation metrics are dependent on superposition arrangements. This is tested by training sparse autoencoders (SAEs) to disentangle superposition in toy models, where alignment scores are shown to typically increase when a model’s base neurons are replaced with its sparse overcomplete latent codes. We find similar increases for DNN(\rightarrow)DNN and DNN(\rightarrow)brain linear regression alignment in the visual domain. Our results suggest that superposition disentanglement is necessary for mapping metrics to uncover the true representational alignment between neural codes.
[LG-5] PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning
链接: https://arxiv.org/abs/2510.03185
作者: Wanjia Zhao,Qinwei Ma,Jingzhe Shi,Shirley Wu,Jiaqi Han,Yijia Xiao,Si-Yuan Chen,Xiao Luo,Ludwig Schmidt,James Zou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively explored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts’ scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.
[LG-6] Q-Learning with Shift-Aware Upper Confidence Bound in Non-Stationary Reinforcement Learning
链接: https://arxiv.org/abs/2510.03181
作者: Ha Manh Bui,Felix Parker,Kimia Ghobadi,Anqi Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the Non-Stationary Reinforcement Learning (RL) under distribution shifts in both finite-horizon episodic and infinite-horizon discounted Markov Decision Processes (MDPs). In the finite-horizon case, the transition functions may suddenly change at a particular episode. In the infinite-horizon setting, such changes can occur at an arbitrary time step during the agent’s interaction with the environment. While the Q-learning Upper Confidence Bound algorithm (QUCB) can discover a proper policy during learning, due to the distribution shifts, this policy can exploit sub-optimal rewards after the shift happens. To address this issue, we propose Density-QUCB (DQUCB), a shift-aware Q-learning~UCB algorithm, which uses a transition density function to detect distribution shifts, then leverages its likelihood to enhance the uncertainty estimation quality of Q-learning~UCB, resulting in a balance between exploration and exploitation. Theoretically, we prove that our oracle DQUCB achieves a better regret guarantee than QUCB. Empirically, our DQUCB enjoys the computational efficiency of model-free RL and outperforms QUCB baselines by having a lower regret across RL tasks, as well as a real-world COVID-19 patient hospital allocation task using a Deep-Q-learning architecture.
[LG-7] FTTE: Federated Learning on Resource-Constrained Devices
链接: https://arxiv.org/abs/2510.03165
作者: Irene Tenison,Anna Murphy,Charles Beauville,Lalana Kagal
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) enables collaborative model training across distributed devices while preserving data privacy, but deployment on resource-constrained edge nodes remains challenging due to limited memory, energy, and communication bandwidth. Traditional synchronous and asynchronous FL approaches further suffer from straggler induced delays and slow convergence in heterogeneous, large scale networks. We present FTTE (Federated Tiny Training Engine),a novel semi-asynchronous FL framework that uniquely employs sparse parameter updates and a staleness-weighted aggregation based on both age and variance of client updates. Extensive experiments across diverse models and data distributions - including up to 500 clients and 90% stragglers - demonstrate that FTTE not only achieves 81% faster convergence, 80% lower on-device memory usage, and 69% communication payload reduction than synchronous FL (this http URL), but also consistently reaches comparable or higher target accuracy than semi-asynchronous (this http URL) in challenging regimes. These results establish FTTE as the first practical and scalable solution for real-world FL deployments on heterogeneous and predominantly resource-constrained edge devices.
[LG-8] Why Do We Need Warm-up? A Theoretical Perspective
链接: https://arxiv.org/abs/2510.03164
作者: Foivos Alimisis,Rustem Islamov,Aurelien Lucchi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the (L_0, L_1) -smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.
[LG-9] Calibrated Uncertainty Sampling for Active Learning
链接: https://arxiv.org/abs/2510.03162
作者: Ha Manh Bui,Iliana Maifeld-Carucci,Anqi Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the problem of actively learning a classifier with a low calibration error. One of the most popular Acquisition Functions (AFs) in pool-based Active Learning (AL) is querying by the model’s uncertainty. However, we recognize that an uncalibrated uncertainty model on the unlabeled pool may significantly affect the AF effectiveness, leading to sub-optimal generalization and high calibration error on unseen data. Deep Neural Networks (DNNs) make it even worse as the model uncertainty from DNN is usually uncalibrated. Therefore, we propose a new AF by estimating calibration errors and query samples with the highest calibration error before leveraging DNN uncertainty. Specifically, we utilize a kernel calibration error estimator under the covariate shift and formally show that AL with this AF eventually leads to a bounded calibration error on the unlabeled pool and unseen test data. Empirically, our proposed method surpasses other AF baselines by having a lower calibration and generalization error across pool-based AL settings.
[LG-10] Mixture of Many Zero-Compute Experts: A High-Rate Quantization Theory Perspective
链接: https://arxiv.org/abs/2510.03151
作者: Yehuda Dar
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper uses classical high-rate quantization theory to provide new insights into mixture-of-experts (MoE) models for regression tasks. Our MoE is defined by a segmentation of the input space to regions, each with a single-parameter expert that acts as a constant predictor with zero-compute at inference. Motivated by high-rate quantization theory assumptions, we assume that the number of experts is sufficiently large to make their input-space regions very small. This lets us to study the approximation error of our MoE model class: (i) for one-dimensional inputs, we formulate the test error and its minimizing segmentation and experts; (ii) for multidimensional inputs, we formulate an upper bound for the test error and study its minimization. Moreover, we consider the learning of the expert parameters from a training dataset, given an input-space segmentation, and formulate their statistical learning properties. This leads us to theoretically and empirically show how the tradeoff between approximation and estimation errors in MoE learning depends on the number of experts.
[LG-11] aming Imperfect Process Verifiers: A Sampling Perspective on Backtracking
链接: https://arxiv.org/abs/2510.03149
作者: Dhruv Rohatgi,Abhishek Shetty,Donya Saless,Yuchen Li,Ankur Moitra,Andrej Risteski,Dylan J. Foster
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:
Abstract:Test-time algorithms that combine the generative power of language models with process verifiers that assess the quality of partial generations offer a promising lever for eliciting new reasoning capabilities, but the algorithmic design space and computational scaling properties of such approaches are still opaque, and their benefits are far from apparent when one accounts for the cost of learning a high-quality verifier. Our starting point is the observation that seemingly benign errors in a learned verifier can lead to catastrophic failures for standard decoding techniques due to error amplification during the course of generation. We then ask: can this be improved with more sophisticated decoding strategies? We introduce a new process-guided test-time sampling algorithm, VGB, which uses theoretically grounded backtracking to achieve provably better robustness to verifier errors. VGB interprets autoregressive generation as a random walk on a tree of partial generations, with transition probabilities guided by the process verifier and base model; crucially, backtracking occurs probabilistically. This process generalizes the seminal Sinclair-Jerrum random walk (Sinclair Jerrum, 1989) from the literature on approximate counting and sampling in theoretical computer science, and a conceptual contribution of our work is to highlight parallels with this literature. Empirically, we demonstrate on both synthetic and real language modeling tasks that VGB outperforms baselines on a variety of metrics. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2510.03149 [cs.LG] (or arXiv:2510.03149v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.03149 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-12] he Computational Complexity of Almost Stable Clustering with Penalties
链接: https://arxiv.org/abs/2510.03143
作者: Kamyar Khodamoradi,Farnam Mansouri,Sandra Zilles
类目: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:We investigate the complexity of stable (or perturbation-resilient) instances of \mathrmk-M\smallEANS and \mathrmk-M\smallEDIAN clustering problems in metrics with small doubling dimension. While these problems have been extensively studied under multiplicative perturbation resilience in low-dimensional Euclidean spaces (e.g., (Friggstad et al., 2019; Cohen-Addad and Schwiegelshohn, 2017)), we adopt a more general notion of stability, termed ``almost stable’', which is closer to the notion of (\alpha, \varepsilon) -perturbation resilience introduced by Balcan and Liang (2016). Additionally, we extend our results to \mathrmk-M\smallEANS / \mathrmk-M\smallEDIAN with penalties, where each data point is either assigned to a cluster centre or incurs a penalty. We show that certain special cases of almost stable \mathrmk-M\smallEANS / \mathrmk-M\smallEDIAN (with penalties) are solvable in polynomial time. To complement this, we also examine the hardness of almost stable instances and (1 + \frac1poly(n)) -stable instances of \mathrmk-M\smallEANS / \mathrmk-M\smallEDIAN (with penalties), proving super-polynomial lower bounds on the runtime of any exact algorithm under the widely believed Exponential Time Hypothesis (ETH). Subjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2510.03143 [cs.CC] (or arXiv:2510.03143v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2510.03143 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-13] Enhancing XAI Narratives through Multi-Narrative Refinement and Knowledge Distillation
链接: https://arxiv.org/abs/2510.03134
作者: Flavio Giorgi,Matteo Silvestri,Cesare Campagnano,Fabrizio Silvestri,Gabriele Tolomei
类目: Machine Learning (cs.LG)
*备注:
Abstract:Explainable Artificial Intelligence has become a crucial area of research, aiming to demystify the decision-making processes of deep learning models. Among various explainability techniques, counterfactual explanations have been proven particularly promising, as they offer insights into model behavior by highlighting minimal changes that would alter a prediction. Despite their potential, these explanations are often complex and technical, making them difficult for non-experts to interpret. To address this challenge, we propose a novel pipeline that leverages Language Models, large and small, to compose narratives for counterfactual explanations. We employ knowledge distillation techniques along with a refining mechanism to enable Small Language Models to perform comparably to their larger counterparts while maintaining robust reasoning abilities. In addition, we introduce a simple but effective evaluation method to assess natural language narratives, designed to verify whether the models’ responses are in line with the factual, counterfactual ground truth. As a result, our proposed pipeline enhances both the reasoning capabilities and practical performance of student models, making them more suitable for real-world use cases.
[LG-14] Real Time Headway Predictions in Urban Rail Systems and Implications for Service Control: A Deep Learning Approach
链接: https://arxiv.org/abs/2510.03121
作者: Muhammad Usama,Haris Koutsopoulos
类目: Machine Learning (cs.LG)
*备注:
Abstract:Efficient real-time dispatching in urban metro systems is essential for ensuring service reliability, maximizing resource utilization, and improving passenger satisfaction. This study presents a novel deep learning framework centered on a Convolutional Long Short-Term Memory (ConvLSTM) model designed to predict the complex spatiotemporal propagation of train headways across an entire metro line. By directly incorporating planned terminal headways as a critical input alongside historical headway data, the proposed model accurately forecasts future headway dynamics, effectively capturing both their temporal evolution and spatial dependencies across all stations. This capability empowers dispatchers to evaluate the impact of various terminal headway control decisions without resorting to computationally intensive simulations. We introduce a flexible methodology to simulate diverse dispatcher strategies, ranging from maintaining even headways to implementing custom patterns derived from observed terminal departures. In contrast to existing research primarily focused on passenger load predictioning or atypical disruption scenarios, our approach emphasizes proactive operational control. Evaluated on a large-scale dataset from an urban metro line, the proposed ConvLSTM model demonstrates promising headway predictions, offering actionable insights for real-time decision-making. This framework provides rail operators with a powerful, computationally efficient tool to optimize dispatching strategies, thereby significantly improving service consistency and passenger satisfaction.
[LG-15] AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks
链接: https://arxiv.org/abs/2510.03101
作者: Irene Tenison,Soumyajit Chatterjee,Fahim Kawsar,Mohammad Malekzadeh
类目: Machine Learning (cs.LG)
*备注:
Abstract:To utilize pre-trained neural networks on edge and mobile devices, we often require efficient adaptation to user-specific runtime data distributions while operating under limited compute and memory resources. On-device retraining with a target dataset can facilitate such adaptations; however, it remains impractical due to the increasing depth of modern neural nets, as well as the computational overhead associated with gradient-based optimization across all layers. Current approaches reduce training cost by selecting a subset of layers for retraining, however, they rely on labeled data, at least one full-model backpropagation, or server-side meta-training; limiting their suitability for constrained devices. We introduce AdaBet, a gradient-free layer selection approach to rank important layers by analyzing topological features of their activation spaces through Betti Numbers and using forward passes alone. AdaBet allows selecting layers with high learning capacity, which are important for retraining and adaptation, without requiring labels or gradients. Evaluating AdaBet on sixteen pairs of benchmark models and datasets, shows AdaBet achieves an average gain of 5% more classification accuracy over gradient-based baselines while reducing average peak memory consumption by 40%.
[LG-16] Adaptive Node Feature Selection For Graph Neural Networks
链接: https://arxiv.org/abs/2510.03096
作者: Ali Azizpour,Madeline Navarro,Santiago Segarra
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose an adaptive node feature selection approach for graph neural networks (GNNs) that identifies and removes unnecessary features during training. The ability to measure how features contribute to model output is key for interpreting decisions, reducing dimensionality, and even improving performance by eliminating unhelpful variables. However, graph-structured data introduces complex dependencies that may not be amenable to classical feature importance metrics. Inspired by this challenge, we present a model- and task-agnostic method that determines relevant features during training based on changes in validation performance upon permuting feature values. We theoretically motivate our intervention-based approach by characterizing how GNN performance depends on the relationships between node data and graph structure. Not only do we return feature importance scores once training concludes, we also track how relevance evolves as features are successively dropped. We can therefore monitor if features are eliminated effectively and also evaluate other metrics with this technique. Our empirical results verify the flexibility of our approach to different graph architectures as well as its adaptability to more challenging graph learning settings.
[LG-17] Bootstrap Learning for Combinatorial Graph Alignment with Sequential GNNs
链接: https://arxiv.org/abs/2510.03086
作者: Marc Lelarge
类目: Machine Learning (cs.LG)
*备注: 27 pages, 10 figures, 12 tables
Abstract:Graph neural networks (GNNs) have struggled to outperform traditional optimization methods on combinatorial problems, limiting their practical impact. We address this gap by introducing a novel chaining procedure for the graph alignment problem, a fundamental NP-hard task of finding optimal node correspondences between unlabeled graphs using only structural information. Our method trains a sequence of GNNs where each network learns to iteratively refine similarity matrices produced by previous networks. During inference, this creates a bootstrap effect: each GNN improves upon partial solutions by incorporating discrete ranking information about node alignment quality from prior iterations. We combine this with a powerful architecture that operates on node pairs rather than individual nodes, capturing global structural patterns essential for alignment that standard message-passing networks cannot represent. Extensive experiments on synthetic benchmarks demonstrate substantial improvements: our chained GNNs achieve over 3x better accuracy than existing methods on challenging instances, and uniquely solve regular graphs where all competing approaches fail. When combined with traditional optimization as post-processing, our method substantially outperforms state-of-the-art solvers on the graph alignment benchmark.
[LG-18] Bayesian E(3)-Equivariant Interatomic Potential with Iterative Restratification of Many-body Message Passing
链接: https://arxiv.org/abs/2510.03046
作者: Soohaeng Yoo Willow,Tae Hyeon Park,Gi Beom Sim,Sung Wook Moon,Seung Kyu Min,D. ChangMo Yang,Hyun Woo Kim,Juho Lee,Chang Woo Myung
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning potentials (MLPs) have become essential for large-scale atomistic simulations, enabling ab initio-level accuracy with computational efficiency. However, current MLPs struggle with uncertainty quantification, limiting their reliability for active learning, calibration, and out-of-distribution (OOD) detection. We address these challenges by developing Bayesian E(3) equivariant MLPs with iterative restratification of many-body message passing. Our approach introduces the joint energy-force negative log-likelihood (NLL _\textJEF ) loss function, which explicitly models uncertainty in both energies and interatomic forces, yielding superior accuracy compared to conventional NLL losses. We systematically benchmark multiple Bayesian approaches, including deep ensembles with mean-variance estimation, stochastic weight averaging Gaussian, improved variational online Newton, and laplace approximation by evaluating their performance on uncertainty prediction, OOD detection, calibration, and active learning tasks. We further demonstrate that NLL _\textJEF facilitates efficient active learning by quantifying energy and force uncertainties. Using Bayesian active learning by disagreement (BALD), our framework outperforms random sampling and energy-uncertainty-based sampling. Our results demonstrate that Bayesian MLPs achieve competitive accuracy with state-of-the-art models while enabling uncertainty-guided active learning, OOD detection, and energy/forces calibration. This work establishes Bayesian equivariant neural networks as a powerful framework for developing uncertainty-aware MLPs for atomistic simulations at scale.
[LG-19] Lightweight Transformer for EEG Classification via Balanced Signed Graph Algorithm Unrolling
链接: https://arxiv.org/abs/2510.03027
作者: Junyi Yao,Parham Eftekhar,Gene Cheung,Xujin Chris Liu,Yao Wang,Wei Hu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Samples of brain signals collected by EEG sensors have inherent anti-correlations that are well modeled by negative edges in a finite graph. To differentiate epilepsy patients from healthy subjects using collected EEG signals, we build lightweight and interpretable transformer-like neural nets by unrolling a spectral denoising algorithm for signals on a balanced signed graph – graph with no cycles of odd number of negative edges. A balanced signed graph has well-defined frequencies that map to a corresponding positive graph via similarity transform of the graph Laplacian matrices. We implement an ideal low-pass filter efficiently on the mapped positive graph via Lanczos approximation, where the optimal cutoff frequency is learned from data. Given that two balanced signed graph denoisers learn posterior probabilities of two different signal classes during training, we evaluate their reconstruction errors for binary classification of EEG signals. Experiments show that our method achieves classification performance comparable to representative deep learning schemes, while employing dramatically fewer parameters.
[LG-20] Differentially Private Wasserstein Barycenters
链接: https://arxiv.org/abs/2510.03021
作者: Anming Gu,Sasidhar Kunapuli,Mark Bun,Edward Chien,Kristjan Greenewald
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 9 figures
Abstract:The Wasserstein barycenter is defined as the mean of a set of probability measures under the optimal transport metric, and has numerous applications spanning machine learning, statistics, and computer graphics. In practice these input measures are empirical distributions built from sensitive datasets, motivating a differentially private (DP) treatment. We present, to our knowledge, the first algorithms for computing Wasserstein barycenters under differential privacy. Empirically, on synthetic data, MNIST, and large-scale U.S. population datasets, our methods produce high-quality private barycenters with strong accuracy-privacy tradeoffs.
[LG-21] Distributional Inverse Reinforcement Learning
链接: https://arxiv.org/abs/2510.03013
作者: Feiyang Wu,Ye Zhao,Anqi Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a deterministic reward estimate or match only expected returns, our method captures richer structure in expert behavior, particularly in learning the reward distribution, by minimizing first-order stochastic dominance (FSD) violations and thus integrating distortion risk measures (DRMs) into policy learning, enabling the recovery of both reward distributions and distribution-aware policies. This formulation is well-suited for behavior analysis and risk-aware imitation learning. Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks demonstrate that our method recovers expressive reward representations and achieves state-of-the-art imitation performance.
[LG-22] Oracle-based Uniform Sampling from Convex Bodies
链接: https://arxiv.org/abs/2510.02983
作者: Thanh Dang,Jiaming Liang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 24 pages
Abstract:We propose new Markov chain Monte Carlo algorithms to sample a uniform distribution on a convex body K . Our algorithms are based on the Alternating Sampling Framework/proximal sampler, which uses Gibbs sampling on an augmented distribution and assumes access to the so-called restricted Gaussian oracle (RGO). The key contribution of this work is the efficient implementation of RGO for uniform sampling on K via rejection sampling and access to either a projection oracle or a separation oracle on K . In both oracle cases, we establish non-asymptotic complexities to obtain unbiased samples where the accuracy is measured in Rényi divergence or \chi^2 -divergence.
[LG-23] ContextFlow: Context-Aware Flow Matching For Trajectory Inference From Spatial Omics Data
链接: https://arxiv.org/abs/2510.02952
作者: Santanu Subhash Rathod,Francesco Ceccarelli,Sean B. Holden,Pietro Liò,Xiao Zhang,Jovan Tanevski
类目: Machine Learning (cs.LG)
*备注: 26 pages, 9 figures, 13 tables
Abstract:Inferring trajectories from longitudinal spatially-resolved omics data is fundamental to understanding the dynamics of structural and functional tissue changes in development, regeneration and repair, disease progression, and response to treatment. We propose ContextFlow, a novel context-aware flow matching framework that incorporates prior knowledge to guide the inference of structural tissue dynamics from spatially resolved omics data. Specifically, ContextFlow integrates local tissue organization and ligand-receptor communication patterns into a transition plausibility matrix that regularizes the optimal transport objective. By embedding these contextual constraints, ContextFlow generates trajectories that are not only statistically consistent but also biologically meaningful, making it a generalizable framework for modeling spatiotemporal dynamics from longitudinal, spatially resolved omics data. Evaluated on three datasets, ContextFlow consistently outperforms state-of-the-art flow matching methods across multiple quantitative and qualitative metrics of inference accuracy and biological coherence. Our code is available at: \hrefthis https URLContextFlow
[LG-24] RAxSS: Retrieval-Augmented Sparse Sampling for Explainable Variable-Length Medical Time Series Classification NEURIPS2025 ALT
链接: https://arxiv.org/abs/2510.02936
作者: Aydin Javadov,Samir Garibov,Tobias Hoesli,Qiyang Sun,Florian von Wangenheim,Joseph Ollier,Björn W. Schuller
类目: Machine Learning (cs.LG)
*备注: Accepted at the NeurIPS 2025 Workshop on Learning from Time Series for Health
Abstract:Medical time series analysis is challenging due to data sparsity, noise, and highly variable recording lengths. Prior work has shown that stochastic sparse sampling effectively handles variable-length signals, while retrieval-augmented approaches improve explainability and robustness to noise and weak temporal correlations. In this study, we generalize the stochastic sparse sampling framework for retrieval-informed classification. Specifically, we weight window predictions by within-channel similarity and aggregate them in probability space, yielding convex series-level scores and an explicit evidence trail for explainability. Our method achieves competitive iEEG classification performance and provides practitioners with greater transparency and explainability. We evaluate our method in iEEG recordings collected in four medical centers, demonstrating its potential for reliable and explainable clinical variable-length time series classification.
[LG-25] Mechanistic Interpretability of Code Correctness in LLM s via Sparse Autoencoders
链接: https://arxiv.org/abs/2510.02917
作者: Kriz Tahimic,Charibeth Cheng
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:As Large Language Models become integral to software development, with substantial portions of AI-suggested code entering production, understanding their internal correctness mechanisms becomes critical for safe deployment. We apply sparse autoencoders to decompose LLM representations, identifying directions that correspond to code correctness. We select predictor directions using t-statistics and steering directions through separation scores from base model representations, then analyze their mechanistic properties through steering, attention analysis, and weight orthogonalization. We find that code correctness directions in LLMs reliably predict incorrect code, while correction capabilities, though statistically significant, involve tradeoffs between fixing errors and preserving correct code. Mechanistically, successful code generation depends on attending to test cases rather than problem descriptions. Moreover, directions identified in base models retain their effectiveness after instruction-tuning, suggesting code correctness mechanisms learned during pre-training are repurposed during fine-tuning. Our mechanistic insights suggest three practical applications: prompting strategies should prioritize test examples over elaborate problem descriptions, predictor directions can serve as error alarms for developer review, and these same predictors can guide selective steering, intervening only when errors are anticipated to prevent the code corruption from constant steering.
[LG-26] SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos
链接: https://arxiv.org/abs/2510.02916
作者: Amir Dellali,Luca A. Lanzendörfer,Florian Grötschla,Roger Wattenhofer
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:We propose SALSA-V, a multimodal video-to-audio generation model capable of synthesizing highly synchronized, high-fidelity long-form audio from silent video content. Our approach introduces a masked diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we achieve rapid generation of high-quality audio samples in as few as eight sampling steps, paving the way for near-real-time applications without requiring dedicated fine-tuning or retraining. We demonstrate that SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and synchronization with video content in quantitative evaluation and a human listening study. Furthermore, our use of random masking during training enables our model to match spectral characteristics of reference audio samples, broadening its applicability to professional audio synthesis tasks such as Foley generation and sound design.
[LG-27] Learning Explicit Single-Cell Dynamics Using ODE Representations
链接: https://arxiv.org/abs/2510.02903
作者: Jan-Philipp von Bassewitz,Adeel Pervez,Marco Fumero,Matthew Robinson,Theofanis Karaletsos,Francesco Locatello
类目: Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
*备注: 26 pages, 10 figures. Preprint under review
Abstract:Modeling the dynamics of cellular differentiation is fundamental to advancing the understanding and treatment of diseases associated with this process, such as cancer. With the rapid growth of single-cell datasets, this has also become a particularly promising and active domain for machine learning. Current state-of-the-art models, however, rely on computationally expensive optimal transport preprocessing and multi-stage training, while also not discovering explicit gene interactions. To address these challenges we propose Cell-Mechanistic Neural Networks (Cell-MNN), an encoder-decoder architecture whose latent representation is a locally linearized ODE governing the dynamics of cellular evolution from stem to tissue cells. Cell-MNN is fully end-to-end (besides a standard PCA pre-processing) and its ODE representation explicitly learns biologically consistent and interpretable gene interactions. Empirically, we show that Cell-MNN achieves competitive performance on single-cell benchmarks, surpasses state-of-the-art baselines in scaling to larger datasets and joint training across multiple datasets, while also learning interpretable gene interactions that we validate against the TRRUST database of gene interactions.
[LG-28] RoiRL: Efficient Self-Supervised Reasoning with Offline Iterative Reinforcement Learning NEURIPS2025
链接: https://arxiv.org/abs/2510.02892
作者: Aleksei Arzhantsev,Otmane Sakhi,Flavian Vasile
类目: Machine Learning (cs.LG)
*备注: Accepted to the Efficient Reasoning Workshop at NeuRIPS 2025
Abstract:Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.
[LG-29] Subject-Adaptive Sparse Linear Models for Interpretable Personalized Health Prediction from Multimodal Lifelog Data
链接: https://arxiv.org/abs/2510.02835
作者: Dohyun Bu,Jisoo Han,Soohwa Kwon,Yulim So,Jong-Seok Lee
类目: Machine Learning (cs.LG)
*备注: 6 pages, ICTC 2025
Abstract:Improved prediction of personalized health outcomes – such as sleep quality and stress – from multimodal lifelog data could have meaningful clinical and practical implications. However, state-of-the-art models, primarily deep neural networks and gradient-boosted ensembles, sacrifice interpretability and fail to adequately address the significant inter-individual variability inherent in lifelog data. To overcome these challenges, we propose the Subject-Adaptive Sparse Linear (SASL) framework, an interpretable modeling approach explicitly designed for personalized health prediction. SASL integrates ordinary least squares regression with subject-specific interactions, systematically distinguishing global from individual-level effects. We employ an iterative backward feature elimination method based on nested F -tests to construct a sparse and statistically robust model. Additionally, recognizing that health outcomes often represent discretized versions of continuous processes, we develop a regression-then-thresholding approach specifically designed to maximize macro-averaged F1 scores for ordinal targets. For intrinsically challenging predictions, SASL selectively incorporates outputs from compact LightGBM models through confidence-based gating, enhancing accuracy without compromising interpretability. Evaluations conducted on the CH-2025 dataset – which comprises roughly 450 daily observations from ten subjects – demonstrate that the hybrid SASL-LightGBM framework achieves predictive performance comparable to that of sophisticated black-box methods, but with significantly fewer parameters and substantially greater transparency, thus providing clear and actionable insights for clinicians and practitioners.
[LG-30] Multi-scale Autoregressive Models are Laplacian Discrete and Latent Diffusion Models in Disguise
链接: https://arxiv.org/abs/2510.02826
作者: Steve Hong,Samuel Belkadi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We revisit Visual Autoregressive (VAR) models through the lens of an iterative-refinement framework. Rather than viewing VAR solely as next-scale autoregression, we formalise it as a deterministic forward process that constructs a Laplacian-style latent pyramid, paired with a learned backward process that reconstructs it in a small number of coarse-to-fine steps. This view connects VAR to denoising diffusion and isolates three design choices that help explain its efficiency and fidelity: refining in a learned latent space, casting prediction as discrete classification over code indices, and partitioning the task by spatial frequency. We run controlled experiments to quantify each factor’s contribution to fidelity and speed, and we outline how the same framework extends to permutation-invariant graph generation and to probabilistic, ensemble-style medium-range weather forecasting. The framework also suggests practical interfaces for VAR to leverage tools from the diffusion ecosystem while retaining few-step, scale-parallel generation.
[LG-31] he Curious Case of In-Training Compression of State Space Models
链接: https://arxiv.org/abs/2510.02823
作者: Makram Chahine,Philipp Nazari,Daniela Rus,T. Konstantin Rusch
类目: Machine Learning (cs.LG)
*备注:
Abstract:State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees. Leveraging the eigenvalue stability properties of Hankel matrices, we apply this lens to SSMs during training, where only dimensions of high influence are identified and preserved. Our approach applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models. Experiments show that in-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure lost by models trained directly at smaller dimension. In other words, SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance.
[LG-32] FlexiQ: Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks EUROSYS’26
链接: https://arxiv.org/abs/2510.02822
作者: Jaemin Kim,Hongjun Um,Sungkyun Kim,Yongjun Park,Jiwon Seo
类目: Machine Learning (cs.LG)
*备注: 16 pages. 14 figures. To be published in the Proceedings of the European Conference on Computer Systems (EUROSYS '26)
Abstract:Neural networks commonly execute on hardware accelerators such as NPUs and GPUs for their size and computation overhead. These accelerators are costly and it is hard to scale their resources to handle real-time workload fluctuations. We present FlexiQ, an adaptive mixed-precision quantization scheme for computer vision models. FlexiQ selectively applies low-bitwidth computation to feature channels with small value ranges and employs an efficient bit-lowering method to minimize quantization errors while maintaining inference accuracy. Furthermore, FlexiQ adjusts its low-bitwidth channel ratio in real time, enabling quantized models to effectively manage fluctuating inference workload. We implemented FlexiQ prototype, including the mixed-precision inference runtime on our custom NPU and GPUs. Evaluated on eleven convolution- and transformer-based vision models, FlexiQ achieves on average 6.6% higher accuracy for 4-bit models with finetuning and outperforms four state-of-the-art quantization techniques. Moreover, our mixed-precision models achieved an efficient accuracy-latency trade-off, with the 50% 4-bit model incurring only 0.6% accuracy loss while achieving 40% of the speedup of the 100% 4-bit model over 8-bit model. Latency evaluations on our NPU and GPUs confirmed that FlexiQ introduces minimal runtime overhead, demonstrating its hardware efficiency and overall performance benefits. Comments: 16 pages. 14 figures. To be published in the Proceedings of the European Conference on Computer Systems (EUROSYS '26) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.02822 [cs.LG] (or arXiv:2510.02822v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.02822 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-33] Online Learning in the Random Order Model
链接: https://arxiv.org/abs/2510.02820
作者: Martino Bernasconi,Andrea Celli,Riccardo Colini-Baldeschi,Federico Fusco,Stefano Leonardi,Matteo Russo
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:
Abstract:In the random-order model for online learning, the sequence of losses is chosen upfront by an adversary and presented to the learner after a random permutation. Any random-order input is \emphasymptotically equivalent to a stochastic i.i.d. one, but, for finite times, it may exhibit significant \em non-stationarity, which can hinder the performance of stochastic learning algorithms. While algorithms for adversarial inputs naturally maintain their regret guarantees in random order, simple no-regret algorithms exist for the stochastic model that fail against random-order instances. In this paper, we propose a general template to adapt stochastic learning algorithms to the random-order model without substantially affecting their regret guarantees. This allows us to recover improved regret bounds for prediction with delays, online learning with constraints, and bandits with switching costs. Finally, we investigate online classification and prove that, in random order, learnability is characterized by the VC dimension rather than the Littlestone dimension, thus providing a further separation from the general adversarial model. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2510.02820 [cs.LG] (or arXiv:2510.02820v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.02820 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Matteo Russo [view email] [v1] Fri, 3 Oct 2025 08:53:35 UTC (37 KB) Full-text links: Access Paper: View a PDF of the paper titled Online Learning in the Random Order Model, by Martino Bernasconi and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-10 Change to browse by: cs cs.DS References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-34] Mitigating Spurious Correlation via Distributionally Robust Learning with Hierarchical Ambiguity Sets
链接: https://arxiv.org/abs/2510.02818
作者: Sung Ho Jo,Seonghwi Kim,Minwoo Chae
类目: Machine Learning (cs.LG)
*备注:
Abstract:Conventional supervised learning methods are often vulnerable to spurious correlations, particularly under distribution shifts in test data. To address this issue, several approaches, most notably Group DRO, have been developed. While these methods are highly robust to subpopulation or group shifts, they remain vulnerable to intra-group distributional shifts, which frequently occur in minority groups with limited samples. We propose a hierarchical extension of Group DRO that addresses both inter-group and intra-group uncertainties, providing robustness to distribution shifts at multiple levels. We also introduce new benchmark settings that simulate realistic minority group distribution shifts-an important yet previously underexplored challenge in spurious correlation research. Our method demonstrates strong robustness under these conditions-where existing robust learning methods consistently fail-while also achieving superior performance on standard benchmarks. These results highlight the importance of broadening the ambiguity set to better capture both inter-group and intra-group distributional uncertainties.
[LG-35] Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification NEURIPS2025
链接: https://arxiv.org/abs/2510.02779
作者: Yuanfan Li,Yunwen Lei,Zheng-Chu Guo,Yiming Ying
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025. Camera-ready version to appear
Abstract:Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of O(1/\sqrtn) , or focus on networks with smooth activation functions, incurring exponential dependence on network depth L . In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin \gamma , we prove an excess risk rate of \widetildeO(L^4 (1 + \gamma L^2) / (n \gamma^2)) , which aligns with the optimal SVM-type rate \widetildeO(1 / (n \gamma^2)) up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.
[LG-36] Curl Descent: Non-Gradient Learning Dynamics with Sign-Diverse Plasticity
链接: https://arxiv.org/abs/2510.02765
作者: Hugo Ninou,Jonathan Kadmon,N. Alex Cayco-Gajic
类目: Machine Learning (cs.LG)
*备注:
Abstract:Gradient-based algorithms are a cornerstone of artificial neural network training, yet it remains unclear whether biological neural networks use similar gradient-based strategies during learning. Experiments often discover a diversity of synaptic plasticity rules, but whether these amount to an approximation to gradient descent is unclear. Here we investigate a previously overlooked possibility: that learning dynamics may include fundamentally non-gradient “curl”-like components while still being able to effectively optimize a loss function. Curl terms naturally emerge in networks with inhibitory-excitatory connectivity or Hebbian/anti-Hebbian plasticity, resulting in learning dynamics that cannot be framed as gradient descent on any objective. To investigate the impact of these curl terms, we analyze feedforward networks within an analytically tractable student-teacher framework, systematically introducing non-gradient dynamics through neurons exhibiting rule-flipped plasticity. Small curl terms preserve the stability of the original solution manifold, resulting in learning dynamics similar to gradient descent. Beyond a critical value, strong curl terms destabilize the solution manifold. Depending on the network architecture, this loss of stability can lead to chaotic learning dynamics that destroy performance. In other cases, the curl terms can counterintuitively speed learning compared to gradient descent by allowing the weight dynamics to escape saddles by temporarily ascending the loss. Our results identify specific architectures capable of supporting robust learning via diverse learning rules, providing an important counterpoint to normative theories of gradient-based learning in neural networks.
[LG-37] okenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling EUROSYS2026
链接: https://arxiv.org/abs/2510.02758
作者: Junyi Chen,Chuheng Du,Renyuan Liu,Shuochao Yao,Dingtian Yan,Jiang Liao,Shengzhong Liu,Fan Wu,Guihai Chen
类目: Machine Learning (cs.LG)
*备注: Accepted by EuroSys 2026
Abstract:Real-time LLM interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation (i.e.,required time-between-tokens). Standard LLM serving systems suffer from the inflexibility caused by non-preemptive request scheduling and reactive memory management, leading to poor resource utilization and low request processing parallelism under request bursts. Therefore, we present TokenFlow, a novel LLM serving system with enhanced text streaming performance via preemptive request scheduling and proactive key-value (KV) cache management. TokenFlow dynamically prioritizes requests based on real-time token buffer occupancy and token consumption rate, while actively transferring KV cache between GPU and CPU memory in the background and overlapping I/O with computation to minimize request preemption overhead. Extensive experiments on Llama3-8B and Qwen2.5-32B across multiple GPUs (RTX 4090, A6000, H200) demonstrate that TokenFlow achieves up to 82.5% higher effective throughput (accounting for actual user consumption) while reducing P99 TTFT by up to 80.2%, without degrading overall token throughput.
[LG-38] Flow with the Force Field: Learning 3D Compliant Flow Matching Policies from Force and Demonstration-Guided Simulation Data
链接: https://arxiv.org/abs/2510.02738
作者: Tianyu Li,Yihan Li,Zizhe Zhang,Nadia Figueroa
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:While visuomotor policy has made advancements in recent years, contact-rich tasks still remain a challenge. Robotic manipulation tasks that require continuous contact demand explicit handling of compliance and force. However, most visuomotor policies ignore compliance, overlooking the importance of physical interaction with the real world, often leading to excessive contact forces or fragile behavior under uncertainty. Introducing force information into vision-based imitation learning could help improve awareness of contacts, but could also require a lot of data to perform well. One remedy for data scarcity is to generate data in simulation, yet computationally taxing processes are required to generate data good enough not to suffer from the Sim2Real gap. In this work, we introduce a framework for generating force-informed data in simulation, instantiated by a single human demonstration, and show how coupling with a compliant policy improves the performance of a visuomotor policy learned from synthetic data. We validate our approach on real-robot tasks, including non-prehensile block flipping and a bi-manual object moving, where the learned policy exhibits reliable contact maintenance and adaptation to novel conditions. Project Website: this https URL
[LG-39] Hybrid-Collaborative Augmentation and Contrastive Sample Adaptive-Differential Awareness for Robust Attributed Graph Clustering
链接: https://arxiv.org/abs/2510.02731
作者: Tianxiang Zhao,Youqing Wang,Jinlu Wang,Jiapu Wang,Mingliang Cui,Junbin Gao,Jipeng Guo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Due to its powerful capability of self-supervised representation learning and clustering, contrastive attributed graph clustering (CAGC) has achieved great success, which mainly depends on effective data augmentation and contrastive objective setting. However, most CAGC methods utilize edges as auxiliary information to obtain node-level embedding representation and only focus on node-level embedding augmentation. This approach overlooks edge-level embedding augmentation and the interactions between node-level and edge-level embedding augmentations across various granularity. Moreover, they often treat all contrastive sample pairs equally, neglecting the significant differences between hard and easy positive-negative sample pairs, which ultimately limits their discriminative capability. To tackle these issues, a novel robust attributed graph clustering (RAGC), incorporating hybrid-collaborative augmentation (HCA) and contrastive sample adaptive-differential awareness (CSADA), is proposed. First, node-level and edge-level embedding representations and augmentations are simultaneously executed to establish a more comprehensive similarity measurement criterion for subsequent contrastive learning. In turn, the discriminative similarity further consciously guides edge augmentation. Second, by leveraging pseudo-label information with high confidence, a CSADA strategy is elaborately designed, which adaptively identifies all contrastive sample pairs and differentially treats them by an innovative weight modulation function. The HCA and CSADA modules mutually reinforce each other in a beneficent cycle, thereby enhancing discriminability in representation learning. Comprehensive graph clustering evaluations over six benchmark datasets demonstrate the effectiveness of the proposed RAGC against several state-of-the-art CAGC methods.
[LG-40] Accuracy Law for the Future of Deep Time Series Forecasting
链接: https://arxiv.org/abs/2510.02729
作者: Yuxuan Wang,Haixu Wu,Yuezhou Ma,Yuchen Fang,Ziyi Zhang,Yong Liu,Shiyu Wang,Zhou Ye,Yang Xiang,Jianmin Wang,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep time series forecasting has emerged as a booming direction in recent years. Despite the exponential growth of community interests, researchers are sometimes confused about the direction of their efforts due to minor improvements on standard benchmarks. In this paper, we notice that, unlike image recognition, whose well-acknowledged and realizable goal is 100% accuracy, time series forecasting inherently faces a non-zero error lower bound due to its partially observable and uncertain nature. To pinpoint the research objective and release researchers from saturated tasks, this paper focuses on a fundamental question: how to estimate the performance upper bound of deep time series forecasting? Going beyond classical series-wise predictability metrics, e.g., ADF test, we realize that the forecasting performance is highly related to window-wise properties because of the sequence-to-sequence forecasting paradigm of deep time series models. Based on rigorous statistical tests of over 2,800 newly trained deep forecasters, we discover a significant exponential relationship between the minimum forecasting error of deep models and the complexity of window-wise series patterns, which is termed the accuracy law. The proposed accuracy law successfully guides us to identify saturated tasks from widely used benchmarks and derives an effective training strategy for large time series models, offering valuable insights for future research.
[LG-41] EvoSpeak: Large Language Models for Interpretable Genetic Programming-Evolved Heuristics
链接: https://arxiv.org/abs/2510.02686
作者: Meng Xu,Jiao Liu,Yew Soon Ong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Genetic programming (GP) has demonstrated strong effectiveness in evolving tree-structured heuristics for complex optimization problems. Yet, in dynamic and large-scale scenarios, the most effective heuristics are often highly complex, hindering interpretability, slowing convergence, and limiting transferability across tasks. To address these challenges, we present EvoSpeak, a novel framework that integrates GP with large language models (LLMs) to enhance the efficiency, transparency, and adaptability of heuristic evolution. EvoSpeak learns from high-quality GP heuristics, extracts knowledge, and leverages this knowledge to (i) generate warm-start populations that accelerate convergence, (ii) translate opaque GP trees into concise natural-language explanations that foster interpretability and trust, and (iii) enable knowledge transfer and preference-aware heuristic generation across related tasks. We verify the effectiveness of EvoSpeak through extensive experiments on dynamic flexible job shop scheduling (DFJSS), under both single- and multi-objective formulations. The results demonstrate that EvoSpeak produces more effective heuristics, improves evolutionary efficiency, and delivers human-readable reports that enhance usability. By coupling the symbolic reasoning power of GP with the interpretative and generative strengths of LLMs, EvoSpeak advances the development of intelligent, transparent, and user-aligned heuristics for real-world optimization problems.
[LG-42] opological Invariance and Breakdown in Learning
链接: https://arxiv.org/abs/2510.02670
作者: Yongyi Yang,Tomaso Poggio,Isaac Chuang,Liu Ziyin
类目: Machine Learning (cs.LG)
*备注:
Abstract:We prove that for a broad class of permutation-equivariant learning rules (including SGD, Adam, and others), the training process induces a bi-Lipschitz mapping between neurons and strongly constrains the topology of the neuron distribution during training. This result reveals a qualitative difference between small and large learning rates \eta . With a learning rate below a topological critical point \eta^* , the training is constrained to preserve all topological structure of the neurons. In contrast, above \eta^* , the learning process allows for topological simplification, making the neuron manifold progressively coarser and thereby reducing the model’s expressivity. Viewed in combination with the recent discovery of the edge of stability phenomenon, the learning dynamics of neuron networks under gradient descent can be divided into two phases: first they undergo smooth optimization under topological constraints, and then enter a second phase where they learn through drastic topological simplifications. A key feature of our theory is that it is independent of specific architectures or loss functions, enabling the universal application of topological methods to the study of deep learning.
[LG-43] Optimal Characteristics of Inspection Vehicle for Drive-by Bridge Inspection
链接: https://arxiv.org/abs/2510.02658
作者: A. Calderon Hurtado,E. Atroshchenko,K.C. Chang,C.W. Kim,M. Makki Alamdari
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Drive-by inspection for bridge health monitoring has gained increasing attention over the past decade. This method involves analysing the coupled vehicle-bridge response, recorded by an instrumented inspection vehicle, to assess structural integrity and detect damage. However, the vehicles mechanical and dynamic properties significantly influence detection performance, limiting the effectiveness of the approach. This study presents a framework for optimising the inspection vehicle to enhance damage sensitivity. An unsupervised deep learning methodbased on adversarial autoencoders (AAE)is used to reconstruct the frequency- domain representation of acceleration responses. The mass and stiffness of the tyre suspension system of a two-axle vehicle are optimised by minimising the Wasserstein distance between damage index distributions for healthy and damaged bridge states. A Kriging meta-model is employed to approximate this objective function efficiently and identify optimal vehicle configurations in both dimensional and non-dimensional parameter spaces. Results show that vehicles with frequency ratios between 0.3 and 0.7 relative to the bridges’ first natural frequency are most effective, while those near resonance perform poorly. Lighter vehicles require lower natural frequencies for optimal detection. This is the first study to rigorously optimise the sensing platform for drive-by sensing and to propose a purpose-built inspection vehicle.
[LG-44] abImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer
链接: https://arxiv.org/abs/2510.02625
作者: Jacob Feitelberg,Dwaipayan Saha,Kyuseong Choi,Zaid Ahmad,Anish Agarwal,Raaz Dwivedi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Missing data is a pervasive problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks. However, due to huge variance in performance across real-world domains and time-consuming hyperparameter tuning, no default imputation method exists. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations requiring no fitting or hyperparameter tuning at inference-time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, which enables a 100\times speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating realistic missingness patterns, which boosts test-time performance, and (iii) MissBench, a comprehensive benchmark for evaluation of imputation methods with 42 OpenML datasets and 13 missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute’s robust performance compared to 11 established imputation methods.
[LG-45] owards CONUS-Wide ML-Augmented Conceptually-Interpretable Modeling of Catchment-Scale Precipitation-Storag e-Runoff Dynamics
链接: https://arxiv.org/abs/2510.02605
作者: Yuan-Heng Wang,Yang Yang,Fabio Ciulla,Hoshin V. Gupta,Charuleka Varadharajan
类目: Machine Learning (cs.LG)
*备注: Main text: 95 pages, 15 figures, 4 tables; Applendix: Section A-E; 2 figures; Supplementary Materials: 15 figures, 7 tables
Abstract:While many modern studies are dedicated to ML-based large-sample hydrologic modeling, these efforts have not necessarily translated into predictive improvements that are grounded in enhanced physical-conceptual understanding. Here, we report on a CONUS-wide large-sample study (spanning diverse hydro-geo-climatic conditions) using ML-augmented physically-interpretable catchment-scale models of varying complexity based in the Mass-Conserving Perceptron (MCP). Results were evaluated using attribute masks such as snow regime, forest cover, and climate zone. Our results indicate the importance of selecting model architectures of appropriate model complexity based on how process dominance varies with hydrological regime. Benchmark comparisons show that physically-interpretable mass-conserving MCP-based models can achieve performance comparable to data-based models based in the Long Short-Term Memory network (LSTM) architecture. Overall, this study highlights the potential of a theory-informed, physically grounded approach to large-sample hydrology, with emphasis on mechanistic understanding and the development of parsimonious and interpretable model architectures, thereby laying the foundation for future models of everywhere that architecturally encode information about spatially- and temporally-varying process dominance.
[LG-46] Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning
链接: https://arxiv.org/abs/2510.02590
作者: Ahmed Hendawy,Henrik Metternich,Théo Vincent,Mahdi Kallel,Jan Peters,Carlo D’Eramo
类目: Machine Learning (cs.LG)
*备注:
Abstract:The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.
[LG-47] Geospatial Machine Learning Libraries
链接: https://arxiv.org/abs/2510.02572
作者: Adam J. Stewart,Caleb Robinson,Arindam Banerjee
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Book chapter
Abstract:Recent advances in machine learning have been supported by the emergence of domain-specific software libraries, enabling streamlined workflows and increased reproducibility. For geospatial machine learning (GeoML), the availability of Earth observation data has outpaced the development of domain libraries to handle its unique challenges, such as varying spatial resolutions, spectral properties, temporal cadence, data coverage, coordinate systems, and file formats. This chapter presents a comprehensive overview of GeoML libraries, analyzing their evolution, core functionalities, and the current ecosystem. It also introduces popular GeoML libraries such as TorchGeo, eo-learn, and Raster Vision, detailing their architecture, supported data types, and integration with ML frameworks. Additionally, it discusses common methodologies for data preprocessing, spatial–temporal joins, benchmarking, and the use of pretrained models. Through a case study in crop type mapping, it demonstrates practical applications of these tools. Best practices in software design, licensing, and testing are highlighted, along with open challenges and future directions, particularly the rise of foundation models and the need for governance in open-source geospatial software. Our aim is to guide practitioners, developers, and researchers in navigating and contributing to the rapidly evolving GeoML landscape.
[LG-48] On The Expressive Power of GNN Derivatives
链接: https://arxiv.org/abs/2510.02565
作者: Yam Eitan,Moshe Eliasof,Yoav Gelberg,Fabrizio Frasca,Guy Bar-Shalom,Haggai Maron
类目: Machine Learning (cs.LG)
*备注: 30 pages, 3 figures
Abstract:Despite significant advances in Graph Neural Networks (GNNs), their limited expressivity remains a fundamental challenge. Research on GNN expressivity has produced many expressive architectures, leading to architecture hierarchies with models of increasing expressive power. Separately, derivatives of GNNs with respect to node features have been widely studied in the context of the oversquashing and over-smoothing phenomena, GNN explainability, and more. To date, these derivatives remain unexplored as a means to enhance GNN expressivity. In this paper, we show that these derivatives provide a natural way to enhance the expressivity of GNNs. We introduce High-Order Derivative GNN (HOD-GNN), a novel method that enhances the expressivity of Message Passing Neural Networks (MPNNs) by leveraging high-order node derivatives of the base model. These derivatives generate expressive structure-aware node embeddings processed by a second GNN in an end-to-end trainable architecture. Theoretically, we show that the resulting architecture family’s expressive power aligns with the WL hierarchy. We also draw deep connections between HOD-GNN, Subgraph GNNs, and popular structural encoding schemes. For computational efficiency, we develop a message-passing algorithm for computing high-order derivatives of MPNNs that exploits graph sparsity and parallelism. Evaluations on popular graph learning benchmarks demonstrate HOD-GNN’s strong performance on popular graph learning tasks.
[LG-49] AttentiveGRUAE: An Attention-Based GRU Autoencoder for Temporal Clustering and Behavioral Characterization of Depression from Wearable Data NEURIPS
链接: https://arxiv.org/abs/2510.02558
作者: Nidhi Soley,Vishal M Patel,Casey O Taylor
类目: Machine Learning (cs.LG)
*备注: 4 pages, 3 figures, 2 tables, Accepted NeurIPS (TS4H Workshop) 2025, non-camera-ready version)
Abstract:In this study, we present AttentiveGRUAE, a novel attention-based gated recurrent unit (GRU) autoencoder designed for temporal clustering and prediction of outcome from longitudinal wearable data. Our model jointly optimizes three objectives: (1) learning a compact latent representation of daily behavioral features via sequence reconstruction, (2) predicting end-of-period depression rate through a binary classification head, and (3) identifying behavioral subtypes through Gaussian Mixture Model (GMM) based soft clustering of learned embeddings. We evaluate AttentiveGRUAE on longitudinal sleep data from 372 participants (GLOBEM 2018-2019), and it demonstrates superior performance over baseline clustering, domain-aligned self-supervised, and ablated models in both clustering quality (silhouette score = 0.70 vs 0.32-0.70) and depression classification (AUC = 0.74 vs 0.50-0.67). Additionally, external validation on cross-year cohorts from 332 participants (GLOBEM 2020-2021) confirms cluster reproducibility (silhouette score = 0.63, AUC = 0.61) and stability. We further perform subtype analysis and visualize temporal attention, which highlights sleep-related differences between clusters and identifies salient time windows that align with changes in sleep regularity, yielding clinically interpretable explanations of risk.
[LG-50] Even Faster Kernel Matrix Linear Algebra via Density Estimation
链接: https://arxiv.org/abs/2510.02540
作者: Rikhav Shah,Sandeep Silwal,Haike Xu
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:This paper studies the use of kernel density estimation (KDE) for linear algebraic tasks involving the kernel matrix of a collection of n data points in \mathbb R^d . In particular, we improve upon existing algorithms for computing the following up to (1+\varepsilon) relative error: matrix-vector products, matrix-matrix products, the spectral norm, and sum of all entries. The runtimes of our algorithms depend on the dimension d , the number of points n , and the target error \varepsilon . Importantly, the dependence on n in each case is far lower when accessing the kernel matrix through KDE queries as opposed to reading individual entries. Our improvements over existing best algorithms (particularly those of Backurs, Indyk, Musco, and Wagner '21) for these tasks reduce the polynomial dependence on \varepsilon , and additionally decreases the dependence on n in the case of computing the sum of all entries of the kernel matrix. We complement our upper bounds with several lower bounds for related problems, which provide (conditional) quadratic time hardness results and additionally hint at the limits of KDE based approaches for the problems we study. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA) MSC classes: 68W25, 15B48, 15B05, 15A18 ACMclasses: E.1; F.2.1 Cite as: arXiv:2510.02540 [cs.DS] (or arXiv:2510.02540v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2510.02540 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-51] Model-brain comparison using inter-animal transforms
链接: https://arxiv.org/abs/2510.02523
作者: Imran Thobani,Javier Sagastuy-Brena,Aran Nayebi,Jacob Prince,Rosa Cao,Daniel Yamins
类目: Machine Learning (cs.LG)
*备注: 16 pages, 8 figures. An extended and revised version of a 9-page paper to be published in the Proceedings of the 2025 Cognitive Computational Neuroscience conference
Abstract:Artificial neural network models have emerged as promising mechanistic models of the brain. However, there is little consensus on the correct method for comparing model activations to brain responses. Drawing on recent work in philosophy of neuroscience, we propose a comparison methodology based on the Inter-Animal Transform Class (IATC) - the strictest set of functions needed to accurately map neural responses between subjects in an animal population. Using the IATC, we can map bidirectionally between a candidate model’s responses and brain data, assessing how well the model can masquerade as a typical subject using the same kinds of transforms needed to map across real subjects. We identify the IATC in three settings: a simulated population of neural network models, a population of mouse subjects, and a population of human subjects. We find that the IATC resolves detailed aspects of the neural mechanism, such as the non-linear activation function. Most importantly, we find that the IATC enables accurate predictions of neural activity while also achieving high specificity in mechanism identification, evidenced by its ability to separate response patterns from different brain areas while strongly aligning same-brain-area responses between subjects. In other words, the IATC is a proof-by-existence that there is no inherent tradeoff between the neural engineering goal of high model-brain predictivity and the neuroscientific goal of identifying mechanistically accurate brain models. Using IATC-guided transforms, we obtain new evidence in favor of topographical deep neural networks (TDANNs) as models of the visual system. Overall, the IATC enables principled model-brain comparisons, contextualizing previous findings about the predictive success of deep learning models of the brain, while improving upon previous approaches to model-brain comparison.
[LG-52] Graph Generation with Spectral Geodesic Flow Matching
链接: https://arxiv.org/abs/2510.02520
作者: Xikun Huang,Tianyu Ruan,Chihao Zhang,Shihua Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph generation is a fundamental task with wide applications in modeling complex systems. Although existing methods align the spectrum or degree profile of the target graph, they often ignore the geometry induced by eigenvectors and the global structure of the graph. In this work, we propose Spectral Geodesic Flow Matching (SFMG), a novel framework that uses spectral eigenmaps to embed both input and target graphs into continuous Riemannian manifolds. We then define geodesic flows between embeddings and match distributions along these flows to generate output graphs. Our method yields several advantages: (i) captures geometric structure beyond eigenvalues, (ii) supports flexible generation of diverse graphs, and (iii) scales efficiently. Empirically, SFMG matches the performance of state-of-the-art approaches on graphlet, degree, and spectral metrics across diverse benchmarks. In particular, it achieves up to 30 \times speedup over diffusion-based models, offering a substantial advantage in scalability and training efficiency. We also demonstrate its ability to generalize to unseen graph scales. Overall, SFMG provides a new approach to graph synthesis by integrating spectral geometry with flow matching.
[LG-53] In-memory Training on Analog Devices with Limited Conductance States via Multi-tile Residual Learning
链接: https://arxiv.org/abs/2510.02516
作者: Jindan Li,Zhaoxian Wu,Gaowen Liu,Tayfun Gokmen,Tianyi Chen
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Optimization and Control (math.OC)
*备注:
Abstract:Analog in-memory computing (AIMC) accelerators enable efficient deep neural network computation directly within memory using resistive crossbar arrays, where model parameters are represented by the conductance states of memristive devices. However, effective in-memory training typically requires at least 8-bit conductance states to match digital baselines. Realizing such fine-grained states is costly and often requires complex noise mitigation techniques that increase circuit complexity and energy consumption. In practice, many promising memristive devices such as ReRAM offer only about 4-bit resolution due to fabrication constraints, and this limited update precision substantially degrades training accuracy. To enable on-chip training with these limited-state devices, this paper proposes a \emphresidual learning framework that sequentially learns on multiple crossbar tiles to compensate the residual errors from low-precision weight updates. Our theoretical analysis shows that the optimality gap shrinks with the number of tiles and achieves a linear convergence rate. Experiments on standard image classification benchmarks demonstrate that our method consistently outperforms state-of-the-art in-memory analog training strategies under limited-state settings, while incurring only moderate hardware overhead as confirmed by our cost analysis.
[LG-54] Improved Robustness of Deep Reinforcement Learning for Control of Time-Varying Systems by Bounded Extremum Seeking
链接: https://arxiv.org/abs/2510.02490
作者: Shaifalee Saxena,Alan Williams,Rafael Fierro,Alexander Scheinker
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:In this paper, we study the use of robust model independent bounded extremum seeking (ES) feedback control to improve the robustness of deep reinforcement learning (DRL) controllers for a class of nonlinear time-varying systems. DRL has the potential to learn from large datasets to quickly control or optimize the outputs of many-parameter systems, but its performance degrades catastrophically when the system model changes rapidly over time. Bounded ES can handle time-varying systems with unknown control directions, but its convergence speed slows down as the number of tuned parameters increases and, like all local adaptive methods, it can get stuck in local minima. We demonstrate that together, DRL and bounded ES result in a hybrid controller whose performance exceeds the sum of its parts with DRL taking advantage of historical data to learn how to quickly control a many-parameter system to a desired setpoint while bounded ES ensures its robustness to time variations. We present a numerical study of a general time-varying system and a combined ES-DRL controller for automatic tuning of the Low Energy Beam Transport section at the Los Alamos Neutron Science Center linear particle accelerator.
[LG-55] Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction NEURIPS2025
链接: https://arxiv.org/abs/2510.02476
作者: Jie Li,Andrew McCarthy,Zhizhuo Zhang,Stephen Young
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: NeurIPS 2025 workshop: 2nd Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences
Abstract:In-context learners like TabPFN are promising for biomolecule efficacy prediction, where established molecular feature sets and relevant experimental results can serve as powerful contextual examples. However, their performance is highly sensitive to the provided context, making strategies like post-hoc ensembling of models trained on different data subsets a viable approach. An open question is how to select the best models for the ensemble without access to ground truth labels. In this study, we investigate an uncertainty-guided strategy for model selection. We demonstrate on an siRNA knockdown efficacy task that a TabPFN model using simple sequence-based features can surpass specialized state-of-the-art predictors. We also show that the model’s predicted inter-quantile range (IQR), a measure of its uncertainty, has a negative correlation with true prediction error. By selecting and averaging an ensemble of models with the lowest mean IQR, we achieve superior performance compared to naive ensembling or using a single model trained on all available data. This finding highlights model uncertainty as a powerful, label-free heuristic for optimizing biomolecule efficacy predictions.
[LG-56] Heterogeneous Graph Representation of Stiffened Panels with Non-Uniform Boundary Conditions and Loads
链接: https://arxiv.org/abs/2510.02472
作者: Yuecheng Cai,Jasmin Jelovica
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: This is a preprint and has been submitted to Engineering with Computers
Abstract:Surrogate models are essential in structural analysis and optimization. We propose a heterogeneous graph representation of stiffened panels that accounts for geometrical variability, non-uniform boundary conditions, and diverse loading scenarios, using heterogeneous graph neural networks (HGNNs). The structure is partitioned into multiple structural units, such as stiffeners and the plates between them, with each unit represented by three distinct node types: geometry, boundary, and loading nodes. Edge heterogeneity is introduced by incorporating local orientations and spatial relationships of the connecting nodes. Several heterogeneous graph representations, each with varying degrees of heterogeneity, are proposed and analyzed. These representations are implemented into a heterogeneous graph transformer (HGT) to predict von Mises stress and displacement fields across stiffened panels, based on loading and degrees of freedom at their boundaries. To assess the efficacy of our approach, we conducted numerical tests on panels subjected to patch loads and box beams composed of stiffened panels under various loading conditions. The heterogeneous graph representation was compared with a homogeneous counterpart, demonstrating superior performance. Additionally, an ablation analysis was performed to evaluate the impact of graph heterogeneity on HGT performance. The results show strong predictive accuracy for both displacement and von Mises stress, effectively capturing structural behavior patterns and maximum values.
[LG-57] SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection
链接: https://arxiv.org/abs/2510.02470
作者: Ashish Jha,Salman Ahmadi-Asl
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training modern neural networks on large datasets is computationally and energy intensive. We present SAGE, a streaming data-subset selection method that maintains a compact Frequent Directions (FD) sketch of gradient geometry in O(\ell D) memory and prioritizes examples whose sketched gradients align with a consensus direction. The approach eliminates N \times N pairwise similarities and explicit N \times \ell gradient stores, yielding a simple two-pass, GPU-friendly pipeline. Leveraging FD’s deterministic approximation guarantees, we analyze how agreement scoring preserves gradient energy within the principal sketched subspace. Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory. Overall, SAGE offers a practical, constant-memory alternative that complements pruning and model compression for efficient training.
[LG-58] Assessing the Potential for Catastrophic Failure in Dynamic Post-Training Quantization
链接: https://arxiv.org/abs/2510.02457
作者: Logan Frank,Paul Ardis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Post-training quantization (PTQ) has recently emerged as an effective tool for reducing the computational complexity and memory usage of a neural network by representing its weights and activations with lower precision. While this paradigm has shown great success in lowering compute and storage costs, there is the potential for drastic performance reduction depending upon the distribution of inputs experienced in inference. When considering possible deployment in safety-critical environments, it is important to investigate the extent of potential performance reduction, and what characteristics of input distributions may give rise to this reduction. In this work, we explore the idea of extreme failure stemming from dynamic PTQ and formulate a knowledge distillation and reinforcement learning task to learn a network and bit-width policy pair such that catastrophic failure under quantization is analyzed in terms of worst case potential. Our results confirm the existence of this “detrimental” network-policy pair, with several instances demonstrating performance reductions in the range of 10-65% in accuracy, compared to their “robust” counterparts encountering a 2% decrease. From systematic experimentation and analyses, we also provide an initial exploration into points at highest vulnerability. While our results represent an initial step toward understanding failure cases introduced by PTQ, our findings ultimately emphasize the need for caution in real-world deployment scenarios. We hope this work encourages more rigorous examinations of robustness and a greater emphasis on safety considerations for future works within the broader field of deep learning.
[LG-59] Adaptive Deception Framework with Behavioral Analysis for Enhanced Cybersecurity Defense
链接: https://arxiv.org/abs/2510.02424
作者: Basil Abdullah AL-Zahrani
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 5 pages, 5 tables, 1 figure
Abstract:This paper presents CADL (Cognitive-Adaptive Deception Layer), an adaptive deception framework achieving 99.88% detection rate with 0.13% false positive rate on the CICIDS2017 dataset. The framework employs ensemble machine learning (Random Forest, XGBoost, Neural Networks) combined with behavioral profiling to identify and adapt responses to network intrusions. Through a coordinated signal bus architecture, security components share real-time intelligence, enabling collective decision-making. The system profiles attackers based on temporal patterns and deploys customized deception strategies across five escalation levels. Evaluation on 50,000 CICIDS2017 test samples demonstrates that CADL significantly outperforms traditional intrusion detection systems (Snort: 71.2%, Suricata: 68.5%) while maintaining production-ready false positive rates. The framework’s behavioral analysis achieves 89% accuracy in classifying attacker profiles. We provide open-source implementation and transparent performance metrics, offering an accessible alternative to commercial deception platforms costing 150-400 per host annually.
[LG-60] OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data
链接: https://arxiv.org/abs/2510.02410
作者: Patrick Langer,Thomas Kaar,Max Rosenblattl,Maxwell A. Xu,Winnie Chow,Martin Maritsch,Aradhana Verma,Brian Han,Daniel Seung Kim,Henry Chubb,Scott Ceresnak,Aydin Zahedivash,Alexander Tarlochan Singh Sandhu,Fatima Rodriguez,Daniel McDuff,Elgar Fleisch,Oliver Aalami,Filipe Barata,Paul Schmiedmayer
类目: Machine Learning (cs.LG)
*备注:
Abstract:LLMs have emerged as powerful tools for interpreting multimodal data. In medicine, they hold particular promise for synthesizing large volumes of clinical information into actionable insights and digital health applications. Yet, a major limitation remains their inability to handle time series. To overcome this gap, we present OpenTSLM, a family of Time Series Language Models (TSLMs) created by integrating time series as a native modality to pretrained LLMs, enabling reasoning over multiple time series of any length. We investigate two architectures for OpenTSLM. The first, OpenTSLM-SoftPrompt, models time series implicitly by concatenating learnable time series tokens with text tokens via soft prompting. Although parameter-efficient, we hypothesize that explicit time series modeling scales better and outperforms implicit approaches. We thus introduce OpenTSLM-Flamingo, which integrates time series with text via cross-attention. We benchmark both variants against baselines that treat time series as text tokens or plots, across a suite of text-time-series Chain-of-Thought (CoT) reasoning tasks. We introduce three datasets: HAR-CoT, Sleep-CoT, and ECG-QA-CoT. Across all, OpenTSLM models outperform baselines, reaching 69.9 F1 in sleep staging and 65.4 in HAR, compared to 9.05 and 52.2 for finetuned text-only models. Notably, even 1B-parameter OpenTSLM models surpass GPT-4o (15.47 and 2.95). OpenTSLM-Flamingo matches OpenTSLM-SoftPrompt in performance and outperforms on longer sequences, while maintaining stable memory requirements. By contrast, SoftPrompt grows exponentially in memory with sequence length, requiring around 110 GB compared to 40 GB VRAM when training on ECG-QA with LLaMA-3B. Expert reviews by clinicians find strong reasoning capabilities exhibited by OpenTSLMs on ECG-QA. To facilitate further research, we provide all code, datasets, and models open-source.
[LG-61] LLM -Generated Samples for Android Malware Detection
链接: https://arxiv.org/abs/2510.02391
作者: Nik Rollinson,Nikolaos Polatidis
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 24 pages
Abstract:Android malware continues to evolve through obfuscation and polymorphism, posing challenges for both signature-based defenses and machine learning models trained on limited and imbalanced datasets. Synthetic data has been proposed as a remedy for scarcity, yet the role of large language models (LLMs) in generating effective malware data for detection tasks remains underexplored. In this study, we fine-tune GPT-4.1-mini to produce structured records for three malware families: BankBot, Locker/SLocker, and Airpush/StopSMS, using the KronoDroid dataset. After addressing generation inconsistencies with prompt engineering and post-processing, we evaluate multiple classifiers under three settings: training with real data only, real-plus-synthetic data, and synthetic data alone. Results show that real-only training achieves near perfect detection, while augmentation with synthetic data preserves high performance with only minor degradations. In contrast, synthetic-only training produces mixed outcomes, with effectiveness varying across malware families and fine-tuning strategies. These findings suggest that LLM-generated malware can enhance scarce datasets without compromising detection accuracy, but remains insufficient as a standalone training source.
[LG-62] From Trace to Line: LLM Agent for Real-World OSS Vulnerability Localization
链接: https://arxiv.org/abs/2510.02389
作者: Haoran Xi,Minghao Shao,Brendan Dolan-Gavitt,Muhammad Shafique,Ramesh Karri
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Large language models show promise for vulnerability discovery, yet prevailing methods inspect code in isolation, struggle with long contexts, and focus on coarse function- or file-level detections - offering limited actionable guidance to engineers who need precise line-level localization and targeted patches in real-world software development. We present T2L-Agent (Trace-to-Line Agent), a project-level, end-to-end framework that plans its own analysis and progressively narrows scope from modules to exact vulnerable lines. T2L-Agent couples multi-round feedback with an Agentic Trace Analyzer (ATA) that fuses runtime evidence - crash points, stack traces, and coverage deltas - with AST-based code chunking, enabling iterative refinement beyond single pass predictions and translating symptoms into actionable, line-level diagnoses. To benchmark line-level vulnerability discovery, we introduce T2L-ARVO, a diverse, expert-verified 50-case benchmark spanning five crash families and real-world projects. T2L-ARVO is specifically designed to support both coarse-grained detection and fine-grained localization, enabling rigorous evaluation of systems that aim to move beyond file-level predictions. On T2L-ARVO, T2L-Agent achieves up to 58.0% detection and 54.8% line-level localization, substantially outperforming baselines. Together, the framework and benchmark push LLM-based vulnerability detection from coarse identification toward deployable, robust, precision diagnostics that reduce noise and accelerate patching in open-source software workflows.
[LG-63] An Encoder-Decoder Network for Beamforming over Sparse Large-Scale MIMO Channels
链接: https://arxiv.org/abs/2510.02355
作者: Yubo Zhang,Jeremy Johnston,Xiaodong Wang
类目: ystems and Control (eess.SY); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, 9 figures, submitted to TCOM and is waiting for reviews
Abstract:We develop an end-to-end deep learning framework for downlink beamforming in large-scale sparse MIMO channels. The core is a deep EDN architecture with three modules: (i) an encoder NN, deployed at each user end, that compresses estimated downlink channels into low-dimensional latent vectors. The latent vector from each user is compressed and then fed back to the BS. (ii) a beamformer decoder NN at the BS that maps recovered latent vectors to beamformers, and (iii) a channel decoder NN at the BS that reconstructs downlink channels from recovered latent vectors to further refine the beamformers. The training of EDN leverages two key strategies: (a) semi-amortized learning, where the beamformer decoder NN contains an analytical gradient ascent during both training and inference stages, and (b) knowledge distillation, where the loss function consists of a supervised term and an unsupervised term, and starting from supervised training with MMSE beamformers, over the epochs, the model training gradually shifts toward unsupervised using the sum-rate objective. The proposed EDN beamforming framework is extended to both far-field and near-field hybrid beamforming scenarios. Extensive simulations validate its effectiveness under diverse network and channel conditions.
[LG-64] Joint Bidding on Intraday and Frequency Containment Reserve Markets
链接: https://arxiv.org/abs/2510.03209
作者: Yiming Zhang,Wolfgang Ridinger,David Wozabal
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注:
Abstract:As renewable energy integration increases supply variability, battery energy storage systems (BESS) present a viable solution for balancing supply and demand. This paper proposes a novel approach for optimizing battery BESS participation in multiple electricity markets. We develop a joint bidding strategy that combines participation in the primary frequency reserve market with continuous trading in the intraday market, addressing a gap in the extant literature which typically considers these markets in isolation or simplifies the continuous nature of intraday trading. Our approach utilizes a mixed integer linear programming implementation of the rolling intrinsic algorithm for intraday decisions and state of charge recovery, alongside a learned classifier strategy (LCS) that determines optimal capacity allocation between markets. A comprehensive out-of-sample backtest over more than one year of historical German market data validates our approach: The LCS increases overall profits by over 4% compared to the best-performing static strategy and by more than 3% over a naive dynamic benchmark. Crucially, our method closes the gap to a theoretical perfect foresight strategy to just 4%, demonstrating the effectiveness of dynamic, learning-based allocation in a complex, multi-market environment.
[LG-65] Improving Online-to-Nonconvex Conversion for Smooth Optimization via Double Optimism
链接: https://arxiv.org/abs/2510.03167
作者: Francisco Patitucci,Ruichen Jiang,Aryan Mokhtari
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 32 pages
Abstract:A recent breakthrough in nonconvex optimization is the online-to-nonconvex conversion framework of \citecutkosky2023optimal, which reformulates the task of finding an \varepsilon -first-order stationary point as an online learning problem. When both the gradient and the Hessian are Lipschitz continuous, instantiating this framework with two different online learners achieves a complexity of \mathcalO(\varepsilon^-1.75\log(1/\varepsilon)) in the deterministic case and a complexity of \mathcalO(\varepsilon^-3.5) in the stochastic case. However, this approach suffers from several limitations: (i) the deterministic method relies on a complex double-loop scheme that solves a fixed-point equation to construct hint vectors for an optimistic online learner, introducing an extra logarithmic factor; (ii) the stochastic method assumes a bounded second-order moment of the stochastic gradient, which is stronger than standard variance bounds; and (iii) different online learning algorithms are used in the two settings. In this paper, we address these issues by introducing an online optimistic gradient method based on a novel \textitdoubly optimistic hint function. Specifically, we use the gradient at an extrapolated point as the hint, motivated by two optimistic assumptions: that the difference between the hint and the target gradient remains near constant, and that consecutive update directions change slowly due to smoothness. Our method eliminates the need for a double loop and removes the logarithmic factor. Furthermore, by simply replacing full gradients with stochastic gradients and under the standard assumption that their variance is bounded by \sigma^2 , we obtain a unified algorithm with complexity \mathcalO(\varepsilon^-1.75 + \sigma^2 \varepsilon^-3.5) , smoothly interpolating between the best-known deterministic rate and the optimal stochastic rate.
[LG-66] FR-LUX: Friction-Aware Regime-Conditioned Policy Optimization for Implementable Portfolio Management
链接: https://arxiv.org/abs/2510.02986
作者: Jian’an Zhang
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures, includes theoretical guarantees and empirical evaluation, submitted to AI/ML in Finance track
Abstract:Transaction costs and regime shifts are major reasons why paper portfolios fail in live trading. We introduce FR-LUX (Friction-aware, Regime-conditioned Learning under eXecution costs), a reinforcement learning framework that learns after-cost trading policies and remains robust across volatility-liquidity regimes. FR-LUX integrates three ingredients: (i) a microstructure-consistent execution model combining proportional and impact costs, directly embedded in the reward; (ii) a trade-space trust region that constrains changes in inventory flow rather than logits, yielding stable low-turnover updates; and (iii) explicit regime conditioning so the policy specializes to LL/LH/HL/HH states without fragmenting the data. On a 4 x 5 grid of regimes and cost levels with multiple random seeds, FR-LUX achieves the top average Sharpe ratio with narrow bootstrap confidence intervals, maintains a flatter cost-performance slope than strong baselines, and attains superior risk-return efficiency for a given turnover budget. Pairwise scenario-level improvements are strictly positive and remain statistically significant after multiple-testing corrections. We provide formal guarantees on optimality under convex frictions, monotonic improvement under a KL trust region, long-run turnover bounds and induced inaction bands due to proportional costs, positive value advantage for regime-conditioned policies, and robustness to cost misspecification. The methodology is implementable: costs are calibrated from standard liquidity proxies, scenario-level inference avoids pseudo-replication, and all figures and tables are reproducible from released artifacts.
[LG-67] oRANS: Online optimisation of RANS machine learning models with embedded DNS data generation
链接: https://arxiv.org/abs/2510.02982
作者: Daniel Dehtyriov,Jonathan F. MacArt,Justin Sirignano
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:Deep learning (DL) has demonstrated promise for accelerating and enhancing the accuracy of flow physics simulations, but progress is constrained by the scarcity of high-fidelity training data, which is costly to generate and inherently limited to a small set of flow conditions. Consequently, closures trained in the conventional offline paradigm tend to overfit and fail to generalise to new regimes. We introduce an online optimisation framework for DL-based Reynolds-averaged Navier–Stokes (RANS) closures which seeks to address the challenge of limited high-fidelity datasets. Training data is dynamically generated by embedding a direct numerical simulation (DNS) within a subdomain of the RANS domain. The RANS solution supplies boundary conditions to the DNS, while the DNS provides mean velocity and turbulence statistics that are used to update a DL closure model during the simulation. This feedback loop enables the closure to adapt to the embedded DNS target flow, avoiding reliance on precomputed datasets and improving out-of-distribution performance. The approach is demonstrated for the stochastically forced Burgers equation and for turbulent channel flow at Re_\tau=180 , 270 , 395 and 590 with varying embedded domain lengths 1\leq L_0/L\leq 8 . Online-optimised RANS models significantly outperform both offline-trained and literature-calibrated closures, with accurate training achieved using modest DNS subdomains. Performance degrades primarily when boundary-condition contamination dominates or when domains are too short to capture low-wavenumber modes. This framework provides a scalable route to physics-informed machine learning closures, enabling data-adaptive reduced-order models that generalise across flow regimes without requiring large precomputed training datasets.
[LG-68] Scalable Quantum Optimisation using HADOF: Hamiltonian Auto-Decomposition Optimisation Framework ECAI
链接: https://arxiv.org/abs/2510.02926
作者: Namasi G Sankar,Georgios Miliotis,Simon Caton
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Sankar, N., Miliotis, G. and Caton, S. Scalable Quantum Optimisation using HADOF: Hamiltonian Auto-Decomposition Optimisation Framework. In 3rd International Workshop on AI for Quantum and Quantum for AI (AIQxQIA 2025), at the 28th European Conference on Artificial Intelligence (ECAI), October 25-30, 2025, Bologna, Italy
Abstract:Quantum Annealing (QA) and QAOA are promising quantum optimisation algorithms used for finding approximate solutions to combinatorial problems on near-term NISQ systems. Many NP-hard problems can be reformulated as Quadratic Unconstrained Binary Optimisation (QUBO), which maps naturally onto quantum Hamiltonians. However, the limited qubit counts of current NISQ devices restrict practical deployment of such algorithms. In this study, we present the Hamiltonian Auto-Decomposition Optimisation Framework (HADOF), which leverages an iterative strategy to automatically divide the Quadratic Unconstrained Binary Optimisation (QUBO) Hamiltonian into sub-Hamiltonians which can be optimised separately using Hamiltonian based optimisers such as QAOA, QA or Simulated Annealing (SA) and aggregated into a global solution. We compare HADOF with Simulated Annealing (SA) and the CPLEX exact solver, showing scalability to problem sizes far exceeding available qubits while maintaining competitive accuracy and runtime. Furthermore, we realise HADOF for a toy problem on an IBM quantum computer, showing promise for practical applications of quantum optimisation.
[LG-69] he land use-climate change-biodiversity nexus in European islands stakeholders
链接: https://arxiv.org/abs/2510.02829
作者: Aristides Moustakas,Irene Christoforidi,George Zittis,Nazli Demirel,Mauro Fois,Savvas Zotos,Eirini Gallou,Valentini Stamatiadou,Elli Tzirkalli,Christos Zoumides,Kristina Košić,Aikaterini Christopoulou,Aleksandra Dragin,Damian Łowicki,Artur Gil,Bruna Almeida,Panos Chrysos,Mario V. Balzan,Mark D.C. Mansoldo,Rannveig Ólafsdóttir,Cigdem Kaptan Ayhan,Lutfi Atay,Mirela Tase,Vladimir Stojanović,Maja Mijatov Ladičorbić,Juan Pedro Díaz,Francisco Javier Expósito,Sonia Quiroga,Miguel Ángel Casquet Cano,Haoran Wang,Cristina Suárez,Paraskevi Manolaki,Ioannis N. Vogiatzakis
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG)
*备注: In press at the Environmental Impact Assessment Review journal. Pre-proof author’s version
Abstract:To promote climate adaptation and mitigation, it is crucial to understand stakeholder perspectives and knowledge gaps on land use and climate changes. Stakeholders across 21 European islands were consulted on climate and land use change issues affecting ecosystem services. Climate change perceptions included temperature, precipitation, humidity, extremes, and wind. Land use change perceptions included deforestation, coastal degradation, habitat protection, renewable energy facilities, wetlands, and others. Additional concerns such as invasive species, water or energy scarcity, infrastructure problems, and austerity were also considered. Climate and land use change impact perceptions were analysed with machine learning to quantify their influence. The predominant climatic characteristic is temperature, and the predominant land use characteristic is deforestation. Water-related problems are top priorities for stakeholders. Energy-related problems, including energy deficiency and issues with wind and solar facilities, rank high as combined climate and land use risks. Stakeholders generally perceive climate change impacts on ecosystem services as negative, with natural habitat destruction and biodiversity loss identified as top issues. Land use change impacts are also negative but more complex, with more explanatory variables. Stakeholders share common perceptions on biodiversity impacts despite geographic disparity, but they differentiate between climate and land use impacts. Water, energy, and renewable energy issues pose serious concerns, requiring management measures.
[LG-70] Neural Jump ODEs as Generative Models
链接: https://arxiv.org/abs/2510.02757
作者: Robert A. Crowell,Florian Krach,Josef Teichmann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this work, we explore how Neural Jump ODEs (NJODEs) can be used as generative models for Itô processes. Given (discrete observations of) samples of a fixed underlying Itô process, the NJODE framework can be used to approximate the drift and diffusion coefficients of the process. Under standard regularity assumptions on the Itô processes, we prove that, in the limit, we recover the true parameters with our approximation. Hence, using these learned coefficients to sample from the corresponding Itô process generates, in the limit, samples with the same law as the true underlying process. Compared to other generative machine learning models, our approach has the advantage that it does not need adversarial training and can be trained solely as a predictive model on the observed samples without the need to generate any samples during training to empirically approximate the distribution. Moreover, the NJODE framework naturally deals with irregularly sampled data with missing values as well as with path-dependent dynamics, allowing to apply this approach in real-world settings. In particular, in the case of path-dependent coefficients of the Itô processes, the NJODE learns their optimal approximation given the past observations and therefore allows generating new paths conditionally on discrete, irregular, and incomplete past observations in an optimal way.
[LG-71] Quantitative Convergence Analysis of Projected Stochastic Gradient Descent for Non-Convex Losses via the Goldstein Subdifferential
链接: https://arxiv.org/abs/2510.02735
作者: Yuping Zheng,Andrew Lamperski
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 40 pages, 2 figures, under review for 37th International Conference on Algorithmic Learning Theory
Abstract:Stochastic gradient descent (SGD) is the main algorithm behind a large body of work in machine learning. In many cases, constraints are enforced via projections, leading to projected stochastic gradient algorithms. In recent years, a large body of work has examined the convergence properties of projected SGD for non-convex losses in asymptotic and non-asymptotic settings. Strong quantitative guarantees are available for convergence measured via Moreau envelopes. However, these results cannot be compared directly with work on unconstrained SGD, since the Moreau envelope construction changes the gradient. Other common measures based on gradient mappings have the limitation that convergence can only be guaranteed if variance reduction methods, such as mini-batching, are employed. This paper presents an analysis of projected SGD for non-convex losses over compact convex sets. Convergence is measured via the distance of the gradient to the Goldstein subdifferential generated by the constraints. Our proposed convergence criterion directly reduces to commonly used criteria in the unconstrained case, and we obtain convergence without requiring variance reduction. We obtain results for data that are independent, identically distributed (IID) or satisfy mixing conditions ( L -mixing). In these cases, we derive asymptotic convergence and O(N^-1/3) non-asymptotic bounds in expectation, where N is the number of steps. In the case of IID sub-Gaussian data, we obtain almost-sure asymptotic convergence and high-probability non-asymptotic O(N^-1/5) bounds. In particular, these are the first non-asymptotic high-probability bounds for projected SGD with non-convex losses.
[LG-72] FLOWR.root: A flow matching based foundation model for joint multi-purpose structure-aware 3D ligand generation and affinity prediction
链接: https://arxiv.org/abs/2510.02578
作者: Julian Cremer,Tuan Le,Mohammad M. Ghahremanpour,Emilia Sługocka,Filipe Menezes,Djork-Arné Clevert
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:We present this http URL, an equivariant flow-matching model for pocket-aware 3D ligand generation with joint binding affinity prediction and confidence estimation. The model supports de novo generation, pharmacophore-conditional sampling, fragment elaboration, and multi-endpoint affinity prediction (pIC50, pKi, pKd, pEC50). Training combines large-scale ligand libraries with mixed-fidelity protein-ligand complexes, followed by refinement on curated co-crystal datasets and parameter-efficient finetuning for project-specific adaptation. this http URL achieves state-of-the-art performance in unconditional 3D molecule generation and pocket-conditional ligand design, producing geometrically realistic, low-strain structures. The integrated affinity prediction module demonstrates superior accuracy on the SPINDR test set and outperforms recent models on the Schrodinger FEP+/OpenFE benchmark with substantial speed advantages. As a foundation model, this http URL requires finetuning on project-specific datasets to account for unseen structure-activity landscapes, yielding strong correlation with experimental data. Joint generation and affinity prediction enable inference-time scaling through importance sampling, steering molecular design toward higher-affinity compounds. Case studies validate this: selective CK2alpha ligand generation against CLK3 shows significant correlation between predicted and quantum-mechanical binding energies, while ERalpha and TYK2 scaffold elaboration demonstrates strong agreement with QM calculations. By integrating structure-aware generation, affinity estimation, and property-guided sampling, this http URL provides a comprehensive foundation for structure-based drug design spanning hit identification through lead optimization.
[LG-73] Learning Multi-Index Models with Hyper-Kernel Ridge Regression
链接: https://arxiv.org/abs/2510.02532
作者: Shuo Huang,Hippolyte Labarrière,Ernesto De Vito,Tomaso Poggio,Lorenzo Rosasco
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Deep neural networks excel in high-dimensional problems, outperforming models such as kernel methods, which suffer from the curse of dimensionality. However, the theoretical foundations of this success remain poorly understood. We follow the idea that the compositional structure of the learning task is the key factor determining when deep networks outperform other approaches. Taking a step towards formalizing this idea, we consider a simple compositional model, namely the multi-index model (MIM). In this context, we introduce and study hyper-kernel ridge regression (HKRR), an approach blending neural networks and kernel methods. Our main contribution is a sample complexity result demonstrating that HKRR can adaptively learn MIM, overcoming the curse of dimensionality. Further, we exploit the kernel nature of the estimator to develop ad hoc optimization approaches. Indeed, we contrast alternating minimization and alternating gradient methods both theoretically and numerically. These numerical results complement and reinforce our theoretical findings.
[LG-74] Self-supervised diffusion model fine-tuning for costate initialization using Markov chain Monte Carlo
链接: https://arxiv.org/abs/2510.02527
作者: Jannik Graebner,Ryne Beeson
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:
Abstract:Global search and optimization of long-duration, low-thrust spacecraft trajectories with the indirect method is challenging due to a complex solution space and the difficulty of generating good initial guesses for the costate variables. This is particularly true in multibody environments. Given data that reveals a partial Pareto optimal front, it is desirable to find a flexible manner in which the Pareto front can be completed and fronts for related trajectory problems can be found. In this work we use conditional diffusion models to represent the distribution of candidate optimal trajectory solutions. We then introduce into this framework the novel approach of using Markov Chain Monte Carlo algorithms with self-supervised fine-tuning to achieve the aforementioned goals. Specifically, a random walk Metropolis algorithm is employed to propose new data that can be used to fine-tune the diffusion model using a reward-weighted training based on efficient evaluations of constraint violations and missions objective functions. The framework removes the need for separate focused and often tedious data generation phases. Numerical experiments are presented for two problems demonstrating the ability to improve sample quality and explicitly target Pareto optimality based on the theory of Markov chains. The first problem does so for a transfer in the Jupiter-Europa circular restricted three-body problem, where the MCMC approach completes a partial Pareto front. The second problem demonstrates how a dense and superior Pareto front can be generated by the MCMC self-supervised fine-tuning method for a Saturn-Titan transfer starting from the Jupiter-Europa case versus a separate dedicated global search.
[LG-75] Adaptive randomized pivoting and volume sampling
链接: https://arxiv.org/abs/2510.02513
作者: Ethan N. Epperly
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO)
*备注: 13 pages, 2 figures
Abstract:Adaptive randomized pivoting (ARP) is a recently proposed and highly effective algorithm for column subset selection. This paper reinterprets the ARP algorithm by drawing connections to the volume sampling distribution and active learning algorithms for linear regression. As consequences, this paper presents new analysis for the ARP algorithm and faster implementations using rejection sampling.
[LG-76] Beyond Linear Diffusions: Improved Representations for Rare Conditional Generative Modeling
链接: https://arxiv.org/abs/2510.02499
作者: Kulunu Dharmakeerthi,Yousef El-Laham,Henry H. Wong,Vamsi K. Potluru,Changhong He,Taosong He
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models have emerged as powerful generative frameworks with widespread applications across machine learning and artificial intelligence systems. While current research has predominantly focused on linear diffusions, these approaches can face significant challenges when modeling a conditional distribution, P(Y|X=x) , when P(X=x) is small. In these regions, few samples, if any, are available for training, thus modeling the corresponding conditional density may be difficult. Recognizing this, we show it is possible to adapt the data representation and forward scheme so that the sample complexity of learning a score-based generative model is small in low probability regions of the conditioning space. Drawing inspiration from conditional extreme value theory we characterize this method precisely in the special case in the tail regions of the conditioning variable, X . We show how diffusion with a data-driven choice of nonlinear drift term is best suited to model tail events under an appropriate representation of the data. Through empirical validation on two synthetic datasets and a real-world financial dataset, we demonstrate that our tail-adaptive approach significantly outperforms standard diffusion models in accurately capturing response distributions at the extreme tail conditions.
[LG-77] Predictive inference for time series: why is split conformal effective despite temporal dependence?
链接: https://arxiv.org/abs/2510.02471
作者: Rina Foygel Barber,Ashwin Pananjady
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 22 pages
Abstract:We consider the problem of uncertainty quantification for prediction in a time series: if we use past data to forecast the next time point, can we provide valid prediction intervals around our forecasts? To avoid placing distributional assumptions on the data, in recent years the conformal prediction method has been a popular approach for predictive inference, since it provides distribution-free coverage for any iid or exchangeable data distribution. However, in the time series setting, the strong empirical performance of conformal prediction methods is not well understood, since even short-range temporal dependence is a strong violation of the exchangeability assumption. Using predictors with “memory” – i.e., predictors that utilize past observations, such as autoregressive models – further exacerbates this problem. In this work, we examine the theoretical properties of split conformal prediction in the time series setting, including the case where predictors may have memory. Our results bound the loss of coverage of these methods in terms of a new “switch coefficient”, measuring the extent to which temporal dependence within the time series creates violations of exchangeability. Our characterization of the coverage probability is sharp over the class of stationary, \beta -mixing processes. Along the way, we introduce tools that may prove useful in analyzing other predictive inference methods for dependent data.
[LG-78] Higher-arity PAC learning VC dimension and packing lemma
链接: https://arxiv.org/abs/2510.02420
作者: Artem Chernikov,Henry Towsner
类目: Machine Learning (stat.ML); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO); Logic (math.LO); Statistics Theory (math.ST)
*备注: 12 pages, 1 figure
Abstract:The aim of this note is to overview some of our work in Chernikov, Towsner’20 (arXiv:2010.00726) developing higher arity VC theory (VC _n dimension), including a generalization of Haussler packing lemma, and an associated tame (slice-wise) hypergraph regularity lemma; and to demonstrate that it characterizes higher arity PAC learning (PAC _n learning) in n -fold product spaces with respect to product measures introduced by Kobayashi, Kuriyama and Takeuchi’15. We also point out how some of the recent results in arXiv:2402.14294, arXiv:2505.15688, arXiv:2509.20404 follow from our work in arXiv:2010.00726.
[LG-79] he Equilibrium Response of Atmospheric Machine-Learning Models to Uniform Sea Surface Temperature Warming
链接: https://arxiv.org/abs/2510.02415
作者: Bosong Zhang,Timothy M. Merlis
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning models for the global atmosphere that are capable of producing stable, multi-year simulations of Earth’s climate have recently been developed. However, the ability of these ML models to generalize beyond the training distribution remains an open question. In this study, we evaluate the climate response of several state-of-the-art ML models (ACE2-ERA5, NeuralGCM, and cBottle) to a uniform sea surface temperature warming, a widely used benchmark for evaluating climate change. We assess each ML model’s performance relative to a physics-based general circulation model (GFDL’s AM4) across key diagnostics, including surface air temperature, precipitation, temperature and wind profiles, and top-of-the-atmosphere radiation. While the ML models reproduce key aspects of the physical model response, particularly the response of precipitation, some exhibit notable departures from robust physical responses, including radiative responses and land region warming. Our results highlight the promise and current limitations of ML models for climate change applications and suggest that further improvements are needed for robust out-of-sample generalization.
信息检索
[IR-0] OpenZL: A Graph-Based Model for Compression
链接: https://arxiv.org/abs/2510.03203
作者: Yann Collet,Nick Terrell,W. Felix Handte,Danielle Rozenblit,Victor Zhang,Kevin Zhang,Yaelle Goldschlag,Jennifer Lee,Daniel Riegel,Stan Angelov,Nadav Rotem
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注:
Abstract:Research in general-purpose lossless compression over the last decade has largely found improvements in compression ratio that come at great cost to resource utilization and processing throughput. However, most production workloads require high throughput and low resource utilization, so most research systems have seen little adoption. Instead, real world improvements in compression are increasingly often realized by building application-specific compressors which can exploit knowledge about the structure and semantics of the data being compressed. These systems easily outperform even the best generic compressors, but application-specific compression schemes are not without drawbacks. They are inherently limited in applicability and are difficult to maintain and deploy. We show that these challenges can be overcome with a new way of thinking about compression. We propose the ``graph model’’ of compression, a new theoretical framework for representing compression as a directed acyclic graph of modular codecs. This motivates OpenZL, an implementation of this model that compresses data into a self-describing wire format, any configuration of which can be decompressed by a universal decoder. OpenZL’s design enables rapid development of tailored compressors with minimal code, its universal decoder eliminates deployment lag, and its investment in a well-vetted standard component library minimizes security risks. Experimental results demonstrate that OpenZL achieves superior compression ratios and speeds compared to state-of-the-art general-purpose compressors on a variety of real-world datasets. Internal deployments at Meta have also shown consistent improvements in size and/or speed, with development timelines reduced from months to days. OpenZL thus represents an advance in practical, scalable, and maintainable data compression for modern data-intensive applications. Subjects: Information Retrieval (cs.IR); Databases (cs.DB) Cite as: arXiv:2510.03203 [cs.IR] (or arXiv:2510.03203v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.03203 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-1] A Simple but Effective Elaborative Query Reformulation Approach for Natural Language Recommendation
链接: https://arxiv.org/abs/2510.02656
作者: Qianfeng Wen,Yifan Liu,Justin Cui,Joshua Zhang,Anton Korikov,George-Kirollos Saad,Scott Sanner
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 5 figures
Abstract:Natural Language (NL) recommender systems aim to retrieve relevant items from free-form user queries and item descriptions. Existing systems often rely on dense retrieval (DR), which struggles to interpret challenging queries that express broad (e.g., “cities for youth friendly activities”) or indirect (e.g., “cities for a high school graduation trip”) user intents. While query reformulation (QR) has been widely adopted to improve such systems, existing QR methods tend to focus only on expanding the range of query subtopics (breadth) or elaborating on the potential meaning of a query (depth), but not both. In this paper, we propose EQR (Elaborative Subtopic Query Reformulation), a large language model-based QR method that combines both breadth and depth by generating potential query subtopics with information-rich elaborations. We also introduce three new natural language recommendation benchmarks in travel, hotel, and restaurant domains to establish evaluation of NL recommendation with challenging queries. Experiments show EQR substantially outperforms state-of-the-art QR methods in various evaluation metrics, highlighting that a simple yet effective QR approach can significantly improve NL recommender systems for queries with broad and indirect user intents.
[IR-2] Revisiting Query Variants: The Advantage of Retrieval Over Generation of Query Variants for Effective QPP
链接: https://arxiv.org/abs/2510.02512
作者: Fangzheng Tian,Debasis Ganguly,Craig Macdonald
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 4 figures
Abstract:Leveraging query variants (QVs), i.e., queries with potentially similar information needs to the target query, has been shown to improve the effectiveness of query performance prediction (QPP) approaches. Existing QV-based QPP methods generate QVs facilitated by either query expansion or non-contextual embeddings, which may introduce topical drifts and hallucinations. In this paper, we propose a method that retrieves QVs from a training set (e.g., MS MARCO) for a given target query of QPP. To achieve a high recall in retrieving queries with the most similar information needs as the target query from a training set, we extend the directly retrieved QVs (1-hop QVs) by a second retrieval using their denoted relevant documents (which yields 2-hop QVs). Our experiments, conducted on TREC DL’19 and DL’20, show that the QPP methods with QVs retrieved by our method outperform the best-performing existing generated-QV-based QPP approaches by as much as around 20%, on neural ranking models like MonoT5.