本篇博文主要内容为 2025-09-08 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-09-08)
今日共更新391篇论文,其中:
- 自然语言处理共107篇(Computation and Language (cs.CL))
- 人工智能共124篇(Artificial Intelligence (cs.AI))
- 计算机视觉共71篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共121篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Non-Termination Proving: 100 Million LoC and Beyond
【速读】: 该论文旨在解决大规模真实代码库中非终止性(divergence)检测的难题,传统方法受限于代码规模,仅适用于数十至数百行代码的小型基准测试,难以应对企业级项目动辄数千万甚至上亿行代码的实际场景。解决方案的关键在于提出名为 Pulse Infinite 的工具,其采用组合式(compositional)与下近似(under-approximating)的证明技术:组合式设计支持对大型程序的模块化分析以实现可扩展性,而下近似保证在复杂程序结构中仍能保持推理的可靠性与 soundness,从而在 C、C++ 和 Hack 编写的开源及专有代码中成功识别出 30 余个此前未知的非终止性问题,显著提升了实际应用场景中生成式 AI (Generative AI) 辅助静态分析的能力边界。
链接: https://arxiv.org/abs/2509.05293
作者: Julien Vanegue,Jules Villard,Peter O’Hearn,Azalea Raad
机构: 未知
类目: Programming Languages (cs.PL); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 14 pages, 4 figures
Abstract:We report on our tool, Pulse Infinite, that uses proof techniques to show non-termination (divergence) in large programs. Pulse Infinite works compositionally and under-approximately: the former supports scale, and the latter ensures soundness for proving divergence. Prior work focused on small benchmarks in the tens or hundreds of lines of code (LoC), and scale limits their practicality: a single company may have tens of millions, or even hundreds of millions of LoC or more. We report on applying Pulse Infinite to over a hundred million lines of open-source and proprietary software written in C, C++, and Hack, identifying over 30 previously unknown issues, establishing a new state of the art for detecting divergence in real-world codebases.
zh
[NLP-1] Crosscoding Through Time: Tracking Emergence Consolidation Of Linguistic Representations Throughout LLM Pretraining
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预训练过程中特定语言能力何时以及如何涌现的问题,传统评估方法如基准测试难以揭示模型对概念和能力的习得机制。为弥补这一空白并实现对模型训练过程在概念层面的更深入理解,作者提出使用稀疏交叉编码器(sparse crosscoders)来发现并对齐不同模型检查点中的特征表示,并引入一种新指标——相对间接效应(Relative Indirect Effects, RelIE),用于追踪个体特征在训练阶段中对任务性能产生因果影响的关键时刻。该方案的核心在于通过跨检查点的特征对齐与因果分析,实现对语言特征演化过程的细粒度监测,且具备架构无关性和可扩展性,为解释性更强的表征学习分析提供了可行路径。
链接: https://arxiv.org/abs/2509.05291
作者: Deniz Bayazit,Aaron Mueller,Antoine Bosselut
机构: EPFL(瑞士联邦理工学院); Boston University(波士顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) learn non-trivial abstractions during pretraining, like detecting irregular plural noun subjects. However, it is not well understood when and how specific linguistic abilities emerge as traditional evaluation methods such as benchmarking fail to reveal how models acquire concepts and capabilities. To bridge this gap and better understand model training at the concept level, we use sparse crosscoders to discover and align features across model checkpoints. Using this approach, we track the evolution of linguistic features during pretraining. We train crosscoders between open-sourced checkpoint triplets with significant performance and representation shifts, and introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important for task performance. We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining. Our approach is architecture-agnostic and scalable, offering a promising path toward more interpretable and fine-grained analysis of representation learning throughout pretraining.
zh
[NLP-2] Elucidating the Design Space of Decay in Linear Attention
【速读】: 该论文旨在解决线性复杂度序列模型中衰减机制(decay mechanism)的设计与优化问题,特别是其对模型性能的影响。其核心解决方案的关键在于系统性地分析衰减机制的四个维度:参数化策略(parameterization strategy)、参数共享(parameter sharing)、衰减粒度(decay granularity)以及与相对位置编码方法(如RoPE)的兼容性。研究发现,有效的衰减机制需在特定参数范围内配置,参数共享若不当会导致衰减值异常从而损害性能,且向量衰减通常优于标量衰减,但在特定参数化策略下标量衰减可能表现更优;此外,RoPE在多数线性注意力机制中并未带来显著收益。
链接: https://arxiv.org/abs/2509.05282
作者: Zhen Qin,Xuyang Shen,Yiran Zhong
机构: TapTap; OpenNLPLab
类目: Computation and Language (cs.CL)
备注: Accepted to COLM 2025. Yiran Zhong is the corresponding author. Code is available at this https URL
Abstract:This paper presents a comprehensive investigation into the decay mechanisms inherent in linear complexity sequence models. We systematically delineate the design space of decay mechanisms across four pivotal dimensions: parameterization strategy, which refers to the computational methodology for decay; parameter sharing, which involves the utilization of supplementary parameters for decay computation; decay granularity, comparing scalar versus vector-based decay; and compatibility with relative positional encoding methods, such as Rotary Position Embedding (RoPE). Through an extensive series of experiments conducted on diverse language modeling tasks, we uncovered several critical insights. Firstly, the design of the parameterization strategy for decay requires meticulous consideration. Our findings indicate that effective configurations are typically confined to a specific range of parameters. Secondly, parameter sharing cannot be used arbitrarily, as it may cause decay values to be too large or too small, thereby significantly impacting performance. Thirdly, under identical parameterization strategies, scalar decay generally underperforms compared to its vector-based counterpart. However, in certain scenarios with alternative parameterization strategies, scalar decay may unexpectedly surpass vector decay in efficacy. Lastly, our analysis reveals that RoPE, a commonly employed relative positional encoding method, typically fails to provide tangible benefits to the majority of linear attention mechanisms.
zh
[NLP-3] SpikingBrain Technical Report: Spiking Brain-inspired Large Models
【速读】: 该论文旨在解决主流基于Transformer的大语言模型(Large Language Models, LLMs)在训练和推理效率上的瓶颈问题:训练计算复杂度随序列长度呈二次增长,而推理内存消耗则线性增长,限制了长文本处理能力;同时,在非NVIDIA硬件平台上稳定高效地训练大模型也面临挑战。解决方案的关键在于提出一种受大脑启发的SpikingBrain模型家族,其核心创新包括:(1) 模型架构层面采用线性及混合线性注意力机制与自适应脉冲神经元设计,实现事件驱动的稀疏计算;(2) 算法优化方面构建基于转换的高效训练流水线与专用脉冲编码框架;(3) 系统工程层面针对MetaX GPU集群定制训练框架、算子库及并行策略。该方案显著提升长序列训练效率(如SpikingBrain-7B在4M-token序列下首次生成时间提速超100倍),并在数百张MetaX C550 GPU上实现数周稳定训练,达到23.4%的模型浮点运算利用率和69.15%的稀疏度,验证了脑启发机制在下一代高效可扩展大模型设计中的潜力。
链接: https://arxiv.org/abs/2509.05276
作者: Yuqi Pan,Yupeng Feng,Jinghao Zhuang,Siyu Ding,Zehao Liu,Bohan Sun,Yuhong Chou,Han Xu,Xuerui Qiu,Anlin Deng,Anjie Hu,Peng Zhou,Man Yao,Jibin Wu,Jian Yang,Guoliang Sun,Bo Xu,Guoqi Li
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Beijing Key Laboratory of Brain-Inspired General Intelligence Large Model (北京市脑启发通用智能大模型重点实验室); Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology (脑认知与脑启发智能技术重点实验室); The Hong Kong Polytechnic University (香港理工大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Zhongguancun Academy (中关村学院); Beihang University (北京航空航天大学); LuxiTech; MetaX Integrated Circuit Co., Ltd (MetaX集成电路有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models significantly improve long-sequence training efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Training remains stable for weeks on hundreds of MetaX C550 GPUs, with the 7B model reaching a Model FLOPs Utilization of 23.4 percent. The proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2509.05276 [cs.LG] (or arXiv:2509.05276v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.05276 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-4] Uniform Information Density and Syntactic Reduction: Revisiting textitthat-Mentioning in English Complement Clauses
【速读】: 该论文旨在解决语言产出中信息密度与句法省略现象之间的关系问题,特别是针对英语中补语从句中可选连接词“that”的省略行为是否受信息密度影响这一经典议题。其解决方案的关键在于采用大规模当代对话语料库,并结合机器学习和神经语言模型对信息密度进行更精确的估算,从而区分传统基于主句动词子类概率的测量方法所引入的词汇特异性偏差,发现基于上下文词嵌入(contextual word embeddings)的信息密度估计能更好地解释“that”省略模式中的额外变异。
链接: https://arxiv.org/abs/2509.05254
作者: Hailin Hao,Elsi Kaiser
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Speakers often have multiple ways to express the same meaning. The Uniform Information Density (UID) hypothesis suggests that speakers exploit this variability to maintain a consistent rate of information transmission during language production. Building on prior work linking UID to syntactic reduction, we revisit the finding that the optional complementizer \textitthat in English complement clauses is more likely to be omitted when the clause has low information density (i.e., more predictable). We advance this line of research by analyzing a large-scale, contemporary conversational corpus and using machine learning and neural language models to refine estimates of information density. Our results replicated the established relationship between information density and \textitthat -mentioning. However, we found that previous measures of information density based on matrix verbs’ subcategorization probability capture substantial idiosyncratic lexical variation. By contrast, estimates derived from contextual word embeddings account for additional variance in patterns of complementizer usage.
zh
[NLP-5] CURE: Controlled Unlearning for Robust Embeddings – Mitigating Conceptual Shortcuts in Pre-Trained Language Models EMNLP2025
【速读】: 该论文旨在解决预训练语言模型在下游任务中因依赖虚假的、概念驱动的相关性(concept-driven correlations)而导致鲁棒性和公平性下降的问题。其核心解决方案是提出一种轻量级框架CURE,关键在于通过两个阶段实现概念短路的解耦与抑制:首先利用专用的内容提取器结合反向网络提取与概念无关的表征,最大限度保留任务相关信息;随后引入可控去偏模块,借助对比学习精细调节残留概念线索的影响,从而根据目标任务动态削弱有害偏见或利用有益关联。该方法在IMDB和Yelp数据集上显著提升F1分数,同时计算开销极低,为构建更可靠、公平的语言理解系统提供了灵活且无需监督的范式。
链接: https://arxiv.org/abs/2509.05230
作者: Aysenur Kocak,Shuo Yang,Bardh Prenkaj,Gjergji Kasneci
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)
Abstract:Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.
zh
[NLP-6] Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation
【速读】: 该论文旨在解决链式思维(Chain-of-thought reasoning)在处理简单问题时产生冗余推理路径的问题,即模型在面对不同难度的任务时缺乏动态调整推理深度的能力。解决方案的关键在于提出一种难度感知的推理框架,通过后训练(post-training)策略,在不修改模型架构的前提下,利用精心策划的数据集进行监督微调(SFT)和直接偏好优化(DPO)联合训练:SFT引导模型学习与问题难度成比例的推理长度和格式,而DPO则保持推理准确性;两者结合可在显著缩短简单问题推理路径的同时,维持甚至提升复杂问题上的性能表现,使模型具备“按需思考”的能力。
链接: https://arxiv.org/abs/2509.05226
作者: Abdul Waheed,Chancharik Mitra,Laurie Z. Wang,Deva Ramanan,Bhiksha Raj
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: 28 Pages
Abstract:Chain-of-thought reasoning, while powerful, can produce unnecessarily verbose output for simpler problems. We present a framework for difficulty-aware reasoning that teaches models to dynamically adjust reasoning depth based on problem complexity. Remarkably, we show that models can be endowed with such dynamic inference pathways without any architectural modifications; we simply post-train on data that is carefully curated to include chain-of-thought traces that are proportional in length to problem difficulty. Our analysis reveals that post-training via supervised fine-tuning (SFT) primarily captures patterns like reasoning length and format, while direct preference optimization (DPO) preserves reasoning accuracy, with their combination reducing length and maintaining or improving performance. Both quantitative metrics and qualitative assessments confirm that models can learn to “think proportionally”, reasoning minimally on simple problems while maintaining depth for complex ones.
zh
[NLP-7] HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models
【速读】: 该论文旨在解决当前Transformer模型中位置编码机制在处理长序列时的局限性问题,包括绝对位置编码因固定表示难以外推至更长序列、相对位置编码如Alibi在极长上下文中性能下降,以及广泛使用的旋转位置编码(Rotary Positional Encoding, RoPE)因引入振荡注意力模式而导致长距离依赖建模不稳定。解决方案的关键在于提出一种基于双曲几何中洛伦兹变换的几何重构方法——双曲旋转位置编码(Hyperbolic Rotary Positional Encoding, HoPE),通过超双曲函数实现对token表示的洛伦兹旋转,理论上证明RoPE是其特例,并从根本上通过强制注意力权重随token距离单调衰减来解决RoPE的稳定性问题,从而显著提升模型对长程依赖的建模与泛化能力。
链接: https://arxiv.org/abs/2509.05218
作者: Chang Dai,Hongyu Shan,Mingyang Song,Di Liang
机构: Peking University (北京大学); Tianjin University (天津大学); Tencent (腾讯); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper proposes Hyperbolic Rotary Positional Encoding (HoPE), a geometric reformulation of positional encoding inspired by Lorentz transformations. HoPE addresses limitations of existing methods like RoPE by enabling stable long-distance dependency modeling. Code and data will be made available upon publication
Abstract:Positional encoding mechanisms enable Transformers to model sequential structure and long-range dependencies in text. While absolute positional encodings struggle with extrapolation to longer sequences due to fixed positional representations, and relative approaches like Alibi exhibit performance degradation on extremely long contexts, the widely-used Rotary Positional Encoding (RoPE) introduces oscillatory attention patterns that hinder stable long-distance dependency modelling. We address these limitations through a geometric reformulation of positional encoding. Drawing inspiration from Lorentz transformations in hyperbolic geometry, we propose Hyperbolic Rotary Positional Encoding (HoPE), which leverages hyperbolic functions to implement Lorentz rotations on token representations. Theoretical analysis demonstrates that RoPE is a special case of our generalized formulation. HoPE fundamentally resolves RoPE’s slation issues by enforcing monotonic decay of attention weights with increasing token distances. Extensive experimental results, including perplexity evaluations under several extended sequence benchmarks, show that HoPE consistently exceeds existing positional encoding methods. These findings underscore HoPE’s enhanced capacity for representing and generalizing long-range dependencies. Data and code will be available.
zh
[NLP-8] BEDTime: A Unified Benchmark for Automatically Describing Time Series
【速读】: 该论文旨在解决当前时间序列分析模型评估中存在的两大问题:一是现有研究常伴随新数据集提出模型,导致不同方法难以进行独立、公平的直接比较;二是评估任务混杂多样,未能明确识别出具体能力对整体性能的贡献。其解决方案的关键在于形式化并聚焦于三个核心任务——识别(True/False问答)、区分(多选题问答)和生成(开放式自然语言描述),用以系统评测模型将时间序列转化为通用自然语言的能力,并通过统一4个近期数据集实现跨模型的标准化对比。这一设计使得能够清晰揭示不同架构(如纯语言模型、视觉-语言模型及时间序列-语言模型)在特定能力上的表现差异及其局限性。
链接: https://arxiv.org/abs/2509.05215
作者: Medhasweta Sen,Zachary Gottesman,Jiaxing Qiu,C. Bayan Bruss,Nam Nguyen,Tom Hartvigsen
机构: University of Virginia (弗吉尼亚大学); CapitalOne (Capital One)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Many recent studies have proposed general-purpose foundation models designed for a variety of time series analysis tasks. While several established datasets already exist for evaluating these models, previous works frequently introduce their models in conjunction with new datasets, limiting opportunities for direct, independent comparisons and obscuring insights into the relative strengths of different methods. Additionally, prior evaluations often cover numerous tasks simultaneously, assessing a broad range of model abilities without clearly pinpointing which capabilities contribute to overall performance. To address these gaps, we formalize and evaluate 3 tasks that test a model’s ability to describe time series using generic natural language: (1) recognition (True/False question-answering), (2) differentiation (multiple choice question-answering), and (3) generation (open-ended natural language description). We then unify 4 recent datasets to enable head-to-head model comparisons on each task. Experimentally, in evaluating 13 state-of-the-art language, vision–language, and time series–language models, we find that (1) popular language-only methods largely underperform, indicating a need for time series-specific architectures, (2) VLMs are quite successful, as expected, identifying the value of vision models for these tasks and (3) pretrained multimodal time series–language models successfully outperform LLMs, but still have significant room for improvement. We also find that all approaches exhibit clear fragility in a range of robustness tests. Overall, our benchmark provides a standardized evaluation on a task necessary for time series reasoning systems.
zh
[NLP-9] Hunyuan-MT Technical Report
【速读】: 该论文旨在解决多语言翻译任务中,尤其是中文与少数民族语言及方言之间翻译性能不足的问题。现有模型在低资源语言对上的表现往往受限,难以满足多样化的实际应用场景需求。解决方案的关键在于提出两个创新模型:Hunyuan-MT-7B(首个开源多语言翻译模型)和受“慢思考模式”启发的Hunyuan-MT-Chimera-7B。前者通过分阶段训练流程(包括通用预训练、监督微调SFT以及基于强化学习RL和弱到强RL的高级对齐)构建强大的多语言翻译能力;后者则通过集成不同参数设置下生成的多个输出,实现超越传统Chain-of-Thought(CoT)类慢思考模型的性能提升。实验证明,这两个模型在WMT2025共享任务中31个语言对中有30个排名第一,尤其在中文与少数民族语言/方言翻译任务上显著优于同类模型和多数当前最优(SOTA)大模型。
链接: https://arxiv.org/abs/2509.05209
作者: Mao Zheng,Zheng Li,Bingxin Qu,Mingyang Song,Yang Du,Mingrui Sun,Di Wang
机构: Tencent Hunyuan Team (腾讯混元团队)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this report, we introduce Hunyuan-MT-7B, our first open-source multilingual translation model, which supports bidirectional translation across 33 major languages and places a special emphasis on translation between Mandarin and several ethnic minority languages as well as dialects. Furthermore, to serve and address diverse translation scenarios and enhance model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a translation model inspired by the slow thinking mode. This model integrates multiple outputs generated by the Hunyuan-MT-7B model under varying parameter settings, thereby achieving performance superior to that of conventional slow-thinking models based on Chain-of-Thought (CoT). The development of our models follows a holistic training process specifically engineered for multilingual translation, which begins with general and MT-oriented pre-training to build foundational capabilities, proceeds to Supervised Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models of comparable parameter size and most of the SOTA large models, particularly on the task of translation between Mandarin and minority languages as well as dialects. In the WMT2025 shared task (General Machine Translation), our models demonstrate state-of-the-art performance, ranking first in 30 out of 31 language pairs. This result highlights the robustness of our models across a diverse linguistic spectrum, encompassing high-resource languages such as Chinese, English, and Japanese, as well as low-resource languages including Czech, Marathi, Estonian, and Icelandic.
zh
[NLP-10] riadic Fusion of Cognitive Functional and Causal Dimensions for Explainable LLM s: The TAXAL Framework
【速读】: 该论文旨在解决生成式 AI(Generative AI)在高风险领域部署时因模型不透明、偏见和不稳定所引发的信任与问责问题,尤其是传统可解释性方法无法捕捉代理型大语言模型(agentic LLMs)的推理路径、规划逻辑及系统性影响的问题。其解决方案的关键在于提出 TAXAL(Triadic Alignment for eXplainability in Agentic LLMs)框架,通过认知(用户理解)、功能(实用价值)和因果(忠实推理)三维度的融合,构建一个角色敏感的统一解释基础,从而支持在多元社会技术场景中设计、评估与部署适配性强的解释策略。
链接: https://arxiv.org/abs/2509.05199
作者: David Herrera-Poyatos,Carlos Peláez-González,Cristina Zuheros,Virilo Tejedor,Rosana Montes,Francisco Herrera
机构: University of Granada (格拉纳达大学); Andalusian Institute of Data Science and Computational Intelligence (DaSCI) (安达卢西亚数据科学与计算智能研究所)
类目: Computation and Language (cs.CL)
备注: 27 pages, 9 tables and 2 figures
Abstract:Large Language Models (LLMs) are increasingly being deployed in high-risk domains where opacity, bias, and instability undermine trust and accountability. Traditional explainability methods, focused on surface outputs, do not capture the reasoning pathways, planning logic, and systemic impacts of agentic LLMs. We introduce TAXAL (Triadic Alignment for eXplainability in Agentic LLMs), a triadic fusion framework that unites three complementary dimensions: cognitive (user understanding), functional (practical utility), and causal (faithful reasoning). TAXAL provides a unified, role-sensitive foundation for designing, evaluating, and deploying explanations in diverse sociotechnical settings. Our analysis synthesizes existing methods, ranging from post-hoc attribution and dialogic interfaces to explanation-aware prompting, and situates them within the TAXAL triadic fusion model. We further demonstrate its applicability through case studies in law, education, healthcare, and public services, showing how explanation strategies adapt to institutional constraints and stakeholder roles. By combining conceptual clarity with design patterns and deployment pathways, TAXAL advances explainability as a technical and sociotechnical practice, supporting trustworthy and context-sensitive LLM applications in the era of agentic AI. Comments: 27 pages, 9 tables and 2 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.05199 [cs.CL] (or arXiv:2509.05199v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.05199 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: David Herrera-Poyatos [view email] [v1] Fri, 5 Sep 2025 15:58:49 UTC (549 KB)
zh
[NLP-11] PRIM: Towards Practical In-Image Multilingual Machine Translation EMNLP2025
【速读】: 该论文旨在解决当前端到端图像内机器翻译(In-Image Machine Translation, IIMT)研究主要依赖合成数据、难以反映真实世界复杂场景的问题,从而缩小学术研究与实际应用之间的差距。为推动真实场景下的多语言图像内翻译研究,作者提出了实用型图像内多语言机器翻译(Practical In-Image Multilingual Machine Translation, IIMMT),并构建了PRIM数据集,该数据集包含具有复杂背景、多样字体和文本位置的真实世界单行文本图像,支持多语言翻译方向。解决方案的关键在于提出了一种端到端模型VisTrans,该模型分别处理图像中的文本与背景信息,既保障了多语言翻译能力,又提升了视觉质量,实验表明其在翻译准确性和视觉效果上优于现有方法。
链接: https://arxiv.org/abs/2509.05146
作者: Yanzhi Tian,Zeming Liu,Zhengyang Liu,Chong Feng,Xin Li,Heyan Huang,Yuhang Guo
机构: Beijing Institute of Technology (北京理工大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to EMNLP 2025 Main Conference
Abstract:In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (IIMMT). In order to convince the lack of publicly available data, we annotate the PRIM dataset, which contains real-world captured one-line text images with complex background, various fonts, diverse text positions, and supports multilingual translation directions. We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM, which processes visual text and background information in the image separately, ensuring the capability of multilingual translation while improving the visual quality. Experimental results indicate the VisTrans achieves a better translation quality and visual effect compared to other models. The code and dataset are available at: this https URL.
zh
[NLP-12] ICR: Iterative Clarification and Rewriting for Conversational Search
【速读】: 该论文旨在解决对话式查询重写(Conversational Query Rewriting)中因查询内存在多个模糊表达而导致的多位置同时识别与重写困难的问题。解决方案的关键在于提出了一种名为ICR(Iterative Clarification and Rewriting)的新框架,其核心机制是通过迭代式地生成澄清问题(clarification questions)与重写后的查询,使模型在澄清与重写之间交替进行,从而逐步优化检索性能,并最终实现领先于现有方法的准确率。
链接: https://arxiv.org/abs/2509.05100
作者: Zhiyu Cao,Peifeng Li,Qiaoming Zhu
机构: Soochow University (苏州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Most previous work on Conversational Query Rewriting employs an end-to-end rewriting paradigm. However, this approach is hindered by the issue of multiple fuzzy expressions within the query, which complicates the simultaneous identification and rewriting of multiple positions. To address this issue, we propose a novel framework ICR (Iterative Clarification and Rewriting), an iterative rewriting scheme that pivots on clarification questions. Within this framework, the model alternates between generating clarification questions and rewritten queries. The experimental results show that our ICR can continuously improve retrieval performance in the clarification-rewriting iterative process, thereby achieving state-of-the-art performance on two popular datasets.
zh
[NLP-13] Finding your MUSE: Mining Unexpected Solutions Engine
【速读】: 该论文旨在解决创新者在面对问题时容易陷入对现有解决方案或初步想法的认知固化(cognitive fixation)问题,从而阻碍探索新颖替代方案的挑战。其解决方案的关键在于构建功能概念图(Functional Concept Graphs, FCGs),这是一种由功能要素组成的互联表示结构,能够支持抽象化、问题重构和类比启发;通过显式建模抽象关系,该方法克服了先前研究在规模与质量上的局限性,并进一步提出MUSE算法,利用FCGs为特定问题生成创造性灵感。
链接: https://arxiv.org/abs/2509.05072
作者: Nir Sweed,Hanit Hakim,Ben Wolfson,Hila Lifshitz,Dafna Shahaf
机构: The Hebrew University of Jerusalem(希伯来大学); New York University(纽约大学); Warwick Business School(华威商学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Innovators often exhibit cognitive fixation on existing solutions or nascent ideas, hindering the exploration of novel alternatives. This paper introduces a methodology for constructing Functional Concept Graphs (FCGs), interconnected representations of functional elements that support abstraction, problem reframing, and analogical inspiration. Our approach yields large-scale, high-quality FCGs with explicit abstraction relations, overcoming limitations of prior work. We further present MUSE, an algorithm leveraging FCGs to generate creative inspirations for a given problem. We demonstrate our method by computing an FCG on 500K patents, which we release for further research.
zh
[NLP-14] oM-SSI: Evaluating Theory of Mind in Situated Social Interactions EMNLP2025
【速读】: 该论文旨在解决现有理论心理(Theory of Mind, ToM)基准测试对人类社会互动复杂性刻画不足的问题,尤其是当前多基于Sally-Anne测试的文本或二元交互范式,难以反映真实情境中多主体、多模态、动态空间下的心智推理能力。其解决方案的关键在于提出ToM-SSI基准,该基准首次引入包含最多四名智能体的群体交互场景,支持多模态输入(如视觉与语言)和空间动态行为,并能模拟混合合作与阻碍的情境,从而实现对多个代理心智状态并行推理的评估,显著扩展了ToM研究的广度与深度。
链接: https://arxiv.org/abs/2509.05066
作者: Matteo Bortoletto,Constantin Ruhdorfer,Andreas Bulling
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 (Main)
Abstract:Most existing Theory of Mind (ToM) benchmarks for foundation models rely on variations of the Sally-Anne test, offering only a very limited perspective on ToM and neglecting the complexity of human social interactions. To address this gap, we propose ToM-SSI: a new benchmark specifically designed to test ToM capabilities in environments rich with social interactions and spatial dynamics. While current ToM benchmarks are limited to text-only or dyadic interactions, ToM-SSI is multimodal and includes group interactions of up to four agents that communicate and move in situated environments. This unique design allows us to study, for the first time, mixed cooperative-obstructive settings and reasoning about multiple agents’ mental state in parallel, thus capturing a wider range of social cognition than existing benchmarks. Our evaluations reveal that the current models’ performance is still severely limited, especially in these new tasks, highlighting critical gaps for future research.
zh
[NLP-15] Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations
【速读】: 该论文旨在解决传统语言类型学(typological)特征库中存在的特征稀疏性(feature sparsity)和静态快照(static snapshots)问题,这些问题限制了跨语言表示学习的准确性和动态适应能力。其解决方案的关键在于提出 Entropy2Vec 框架,通过利用单语语言模型预测熵(entropy)来捕捉语言间的结构相似性:低熵表示语言间结构相似度高,高熵则表明差异显著。该方法生成稠密、非稀疏的语言嵌入(language embeddings),可灵活适应不同时间点且无缺失值,从而在下游多语言自然语言处理任务中展现出与已知类型学分类一致且具有竞争力的表现。
链接: https://arxiv.org/abs/2509.05060
作者: Patrick Amadeus Irawan,Ryandito Diandaru,Belati Jagad Bintang Syuhada,Randy Zakya Suchrady,Alham Fikri Aji,Genta Indra Winata,Fajri Koto,Samuel Cahyawijaya
机构: MBZUAI; Universitas Indonesia; NTU; Capital One; Cohere
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.
zh
[NLP-16] Masked Diffusion Language Models with Frequency-Informed Training
【速读】: 该论文旨在解决在数据受限条件下训练语言模型的效率问题,即如何在有限数据下实现高效的语言建模。其解决方案的关键在于提出了一种基于掩码扩散语言建模(masked diffusion language modeling)的框架,通过引入频率感知的掩码策略优先学习稀有词元(token),同时保持理论上的有效性,并探索多种噪声调度(noise scheduling)和噪声权重分配方案,以优化NELBO(Noise-Enhanced Lower Bound Objective)目标函数,从而在BabyLM基准测试中展现出与混合自回归-掩码基线相当的性能,验证了扩散方法在数据受限场景下的可行性与竞争力。
链接: https://arxiv.org/abs/2509.05056
作者: Despoina Kosmopoulou,Efthymios Georgiou,Vaggelis Dorovatas,Georgios Paraskevopoulos,Alexandros Potamianos
机构: National Technical University of Athens (雅典国立技术大学); Archimedes RU, Athena RC (阿基米德研究中心); University of Bern (伯尔尼大学); Institute of Language and Signal Processing, Athena RC (语言与信号处理研究所)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:We present a masked diffusion language modeling framework for data-efficient training for the BabyLM 2025 Challenge. Our approach applies diffusion training objectives to language modeling under strict data constraints, incorporating frequency-informed masking that prioritizes learning from rare tokens while maintaining theoretical validity. We explore multiple noise scheduling strategies, including two-mode approaches, and investigate different noise weighting schemes within the NELBO objective. We evaluate our method on the BabyLM benchmark suite, measuring linguistic competence, world knowledge, and human-likeness. Results show performance competitive to hybrid autoregressive-masked baselines, demonstrating that diffusion-based training offers a viable alternative for data-restricted language learning.
zh
[NLP-17] Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework
【速读】: 该论文旨在解决当前测试时扩展(test-time scaling, TTS)方法在复杂推理任务中计算效率低下的问题,其核心瓶颈在于现有方法主要依赖冗余采样,未能有效利用历史推理过程中的经验信息。解决方案的关键在于提出 Sticker-TTS 框架,通过三个协同工作的大型推理模型(Large Reasoning Models, LRMs)迭代探索与优化解题路径,并引入“贴纸”(sticker)机制——即从历史尝试中提炼出的凝练关键条件(distilled key conditions),用于指导知识的提取、精炼与跨轮次复用,从而提升推理效率和准确性。此外,该框架采用两阶段优化策略,融合模仿学习与自我改进机制,实现渐进式性能提升。
链接: https://arxiv.org/abs/2509.05007
作者: Jie Chen,Jinhao Jiang,Yingqian Min,Zican Dong,Shijie Wang,Wayne Xin Zhao,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Northeastern University at Qinhuangdao (秦皇岛上东北大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 1 figures, 5 tables
Abstract:Large reasoning models (LRMs) have exhibited strong performance on complex reasoning tasks, with further gains achievable through increased computational budgets at inference. However, current test-time scaling methods predominantly rely on redundant sampling, ignoring the historical experience utilization, thereby limiting computational efficiency. To overcome this limitation, we propose Sticker-TTS, a novel test-time scaling framework that coordinates three collaborative LRMs to iteratively explore and refine solutions guided by historical attempts. At the core of our framework are distilled key conditions-termed stickers-which drive the extraction, refinement, and reuse of critical information across multiple rounds of reasoning. To further enhance the efficiency and performance of our framework, we introduce a two-stage optimization strategy that combines imitation learning with self-improvement, enabling progressive refinement. Extensive evaluations on three challenging mathematical reasoning benchmarks, including AIME-24, AIME-25, and OlymMATH, demonstrate that Sticker-TTS consistently surpasses strong baselines, including self-consistency and advanced reinforcement learning approaches, under comparable inference budgets. These results highlight the effectiveness of sticker-guided historical experience utilization. Our code and data are available at this https URL.
zh
[NLP-18] Do Large Language Models Need Intent? Revisiting Response Generation Strategies for Service Assistant
【速读】: 该论文旨在解决对话式人工智能(Conversational AI)中服务响应生成的核心难题:是否必须通过显式的意图识别(Intent Recognition)步骤才能生成高质量的服务回复,还是模型可以直接生成有效回应而无需此前置步骤。其解决方案的关键在于设计并对比两种范式——“意图优先的响应生成”(Intent-First Response Generation)与“直接响应生成”(Direct Response Generation),并在两个公开的服务交互数据集上对多个先进语言模型(包括微调后的T5变体)进行系统性基准测试,从而揭示显式意图建模在实际任务中的必要性或冗余性,为构建更高效、有效的响应生成系统提供实证依据和设计指导。
链接: https://arxiv.org/abs/2509.05006
作者: Inbal Bolshinsky,Shani Kupiec,Almog Sasson,Yehudit Aperstein,Alexander Apartsin
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 1 figure
Abstract:In the era of conversational AI, generating accurate and contextually appropriate service responses remains a critical challenge. A central question remains: Is explicit intent recognition a prerequisite for generating high-quality service responses, or can models bypass this step and produce effective replies directly? This paper conducts a rigorous comparative study to address this fundamental design dilemma. Leveraging two publicly available service interaction datasets, we benchmark several state-of-the-art language models, including a fine-tuned T5 variant, across both paradigms: Intent-First Response Generation and Direct Response Generation. Evaluation metrics encompass both linguistic quality and task success rates, revealing surprising insights into the necessity or redundancy of explicit intent modelling. Our findings challenge conventional assumptions in conversational AI pipelines, offering actionable guidelines for designing more efficient and effective response generation systems.
zh
[NLP-19] Optimizing Small Transformer-Based Language Models for Multi-Label Sentiment Analysis in Short Texts ECAI2025
【速读】: 该论文旨在解决短文本情感分类(sentiment classification)中因类别不平衡、训练样本有限以及情感标签固有的主观性所导致的歧义性和数据稀疏性问题,这些问题在短文本场景下尤为突出。其解决方案的关键在于系统评估三种策略的有效性:(1) 基于领域特定语料的持续预训练,(2) 利用生成式数据增强(generative data augmentation)进行自动数据扩充,以及 (3) 分类头架构的改进。实验表明,数据增强能有效提升分类性能,而基于增强数据集的持续预训练可能引入噪声反而降低准确率;同时,分类头结构优化带来的收益有限,这为资源受限环境下BERT类模型的优化提供了实证依据和实用指导。
链接: https://arxiv.org/abs/2509.04982
作者: Julius Neumann,Robert Lange,Yuni Susanti,Michael Färber
机构: ScaDS.AI; TU Dresden (德累斯顿工业大学); FIZ Karlsruhe (卡尔蔡司研究所)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at LDD@ECAI 2025
Abstract:Sentiment classification in short text datasets faces significant challenges such as class imbalance, limited training samples, and the inherent subjectivity of sentiment labels – issues that are further intensified by the limited context in short texts. These factors make it difficult to resolve ambiguity and exacerbate data sparsity, hindering effective learning. In this paper, we evaluate the effectiveness of small Transformer-based models (i.e., BERT and RoBERTa, with fewer than 1 billion parameters) for multi-label sentiment classification, with a particular focus on short-text settings. Specifically, we evaluated three key factors influencing model performance: (1) continued domain-specific pre-training, (2) data augmentation using automatically generated examples, specifically generative data augmentation, and (3) architectural variations of the classification head. Our experiment results show that data augmentation improves classification performance, while continued pre-training on augmented datasets can introduce noise rather than boost accuracy. Furthermore, we confirm that modifications to the classification head yield only marginal benefits. These findings provide practical guidance for optimizing BERT-based models in resource-constrained settings and refining strategies for sentiment classification in short-text datasets.
zh
[NLP-20] Classification of kinetic-related injury in hospital triage data using NLP
【速读】: 该论文旨在解决在有限计算资源和隐私保护约束下,对急诊科分诊记录(triage notes)进行高效分类的问题。其核心挑战包括:医疗数据的敏感性要求本地化处理、多数医疗机构缺乏训练大型语言模型(Large Language Model, LLM)所需的硬件设施,以及标注高质量数据需依赖专家人工干预且成本高昂。解决方案的关键在于构建一个两阶段微调(fine-tuning)流程:首先利用开源小规模数据集(2k样本)在GPU上对预训练LLM进行初步分类器微调;随后在CPU上使用医院特有数据集(1000样本)进行二次微调,从而在不牺牲性能的前提下显著降低算力需求与标注成本。该方法通过精心设计的数据集构建和迁移学习策略,实现了在受限环境下对分诊文本的有效分类。
链接: https://arxiv.org/abs/2509.04969
作者: Midhun Shyam,Jim Basilakis,Kieran Luken,Steven Thomas,John Crozier,Paul M. Middleton,X. Rosalind Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted as a short paper for publishing at ADMA 2025 ( this https URL ), with Supplementary Material available at this https URL
Abstract:Triage notes, created at the start of a patient’s hospital visit, contain a wealth of information that can help medical staff and researchers understand Emergency Department patient epidemiology and the degree of time-dependent illness or injury. Unfortunately, applying modern Natural Language Processing and Machine Learning techniques to analyse triage data faces some challenges: Firstly, hospital data contains highly sensitive information that is subject to privacy regulation thus need to be analysed on site; Secondly, most hospitals and medical facilities lack the necessary hardware to fine-tune a Large Language Model (LLM), much less training one from scratch; Lastly, to identify the records of interest, expert inputs are needed to manually label the datasets, which can be time-consuming and costly. We present in this paper a pipeline that enables the classification of triage data using LLM and limited compute resources. We first fine-tuned a pre-trained LLM with a classifier using a small (2k) open sourced dataset on a GPU; and then further fine-tuned the model with a hospital specific dataset of 1000 samples on a CPU. We demonstrated that by carefully curating the datasets and leveraging existing models and open sourced data, we can successfully classify triage data with limited compute resources.
zh
[NLP-21] owards Ontology-Based Descriptions of Conversations with Qualitatively-Defined Concepts
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在作为对话代理时的可控性问题,特别是如何确保生成响应的可预测性和用户个性化。其核心挑战在于将通常定性的对话特征(如语言熟练度)转化为可计算、可推理的形式化定义。解决方案的关键在于提出一种基于本体(ontology)的方法,利用一组语言学描述符对定性概念进行量化定义,并将其形式化为描述逻辑(description logic),从而构建一个可用于推理和一致性检查的语义框架。该框架以CEFR语言熟练度水平为例,指导LLM通过微调实现受控文本生成,实验表明该方法能够提供一致且可解释的熟练度定义,显著提升对话人工智能的透明度。
链接: https://arxiv.org/abs/2509.04926
作者: Barbara Gendron(LORIA, UL),Gaël Guibon(LIPN, LORIA),Mathieu D’aquin(LORIA, UL)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at TOTh 2025 (Terminology \ Ontology: Theories and applications)
Abstract:The controllability of Large Language Models (LLMs) when used as conversational agents is a key challenge, particularly to ensure predictable and user-personalized responses. This work proposes an ontology-based approach to formally define conversational features that are typically qualitative in nature. By leveraging a set of linguistic descriptors, we derive quantitative definitions for qualitatively-defined concepts, enabling their integration into an ontology for reasoning and consistency checking. We apply this framework to the task of proficiency-level control in conversations, using CEFR language proficiency levels as a case study. These definitions are then formalized in description logic and incorporated into an ontology, which guides controlled text generation of an LLM through fine-tuning. Experimental results demonstrate that our approach provides consistent and explainable proficiency-level definitions, improving transparency in conversational AI.
zh
[NLP-22] SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
【速读】: 该论文旨在解决现有用于GUI感知的多模态大语言模型(Multimodal Large Language Models, MLLMs)在坐标建模精度低、推理速度慢以及仅能定位预定义元素而无法解析完整界面的问题。解决方案的关键在于提出SparkUI-Parser框架,通过引入基于预训练MLLM的连续坐标建模机制(而非传统的概率离散建模),结合额外的token路由模块和坐标解码器,显著提升了定位精度并加快了推理速度;同时,设计了一种基于改进匈牙利匹配算法的拒绝机制,增强了对不存在元素的识别能力,从而降低误报率。
链接: https://arxiv.org/abs/2509.04908
作者: Hongyi Jing,Jiafu Chen,Chen Rao,Ziqiang Dang,Jiajie Teng,Tianyi Chu,Juncheng Mo,Shuo Fang,Huaizhong Lin,Rui Lv,Chenguang Ma,Lei Zhao
机构: 1: Institute of Artificial Intelligence, School of Computer Science and Technology, Nanjing University (南京大学计算机科学与技术学院人工智能研究所); 2: Department of Computer Science and Engineering, Shanghai Jiao Tong University (上海交通大学计算机科学与工程系)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:The existing Multimodal Large Language Models (MLLMs) for GUI perception have made great progress. However, the following challenges still exist in prior methods: 1) They model discrete coordinates based on text autoregressive mechanism, which results in lower grounding accuracy and slower inference speed. 2) They can only locate predefined sets of elements and are not capable of parsing the entire interface, which hampers the broad application and support for downstream tasks. To address the above issues, we propose SparkUI-Parser, a novel end-to-end framework where higher localization precision and fine-grained parsing capability of the entire interface are simultaneously achieved. Specifically, instead of using probability-based discrete modeling, we perform continuous modeling of coordinates based on a pre-trained Multimodal Large Language Model (MLLM) with an additional token router and coordinate decoder. This effectively mitigates the limitations inherent in the discrete output characteristics and the token-by-token generation process of MLLMs, consequently boosting both the accuracy and the inference speed. To further enhance robustness, a rejection mechanism based on a modified Hungarian matching algorithm is introduced, which empowers the model to identify and reject non-existent elements, thereby reducing false positives. Moreover, we present ScreenParse, a rigorously constructed benchmark to systematically assess structural perception capabilities of GUI models across diverse scenarios. Extensive experiments demonstrate that our approach consistently outperforms SOTA methods on ScreenSpot, ScreenSpot-v2, CAGUI-Grounding and ScreenParse benchmarks. The resources are available at this https URL.
zh
[NLP-23] ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长文本生成任务中面临的两大挑战:一是高质量长文本响应数据稀缺,导致监督微调(Supervised Fine-Tuning, SFT)和强化学习(Reinforcement Learning, RL)中的偏好奖励信号难以获取;二是现有方法多聚焦于粗粒度的质量维度(如相关性、连贯性和有用性),忽视了不同长文本生成场景下细粒度的特定需求。解决方案的关键在于提出一种自适应约束增强的强化学习框架(Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning, ACE-RL),其核心机制包括:首先通过识别指令的潜在意图自动提取一组细粒度、可自适应调整的约束条件;其次设计基于约束满足度的奖励函数,将主观质量评估转化为客观约束验证;最终利用强化学习优化模型以生成更符合多样化长文本生成场景的高质量内容。实验表明,ACE-RL在WritingBench上相较SFT和RL基线分别提升20.70%和7.32%,且最优模型超越GPT-4o达7.10%。
链接: https://arxiv.org/abs/2509.04903
作者: Jianghao Chen,Wei Sun,Qixiang Yin,Lingxing Kong,Zhixing Tan,Jiajun Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under review, our code is available at this https URL
Abstract:Large Language Models (LLMs) have demonstrated remarkable progress in long-context understanding, yet they face significant challenges in high-quality long-form generation. Existing studies primarily suffer from two limitations: (1) A heavy reliance on scarce, high-quality long-form response data for supervised fine-tuning (SFT) or for pairwise preference reward in reinforcement learning (RL). (2) Focus on coarse-grained quality optimization dimensions, such as relevance, coherence, and helpfulness, overlooking the fine-grained specifics inherent to diverse long-form generation scenarios. To address this issue, we propose a framework using Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first automatically deconstructs each instruction into a set of fine-grained, adaptive constraint criteria by identifying its underlying intents and demands. Subsequently, we design a reward mechanism that quantifies the quality of long-form responses based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we utilize reinforcement learning to guide models toward superior long-form generation capabilities. Experimental results demonstrate that our ACE-RL framework significantly outperforms existing SFT and RL baselines by 20.70% and 7.32% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 7.10%, providing a more effective training paradigm for LLMs to generate high-quality content across diverse long-form generation scenarios.
zh
[NLP-24] PLaMo 2 Technical Report
【速读】: 该论文旨在解决日本语专用大语言模型在训练数据稀缺、计算效率低下以及推理性能不足等问题。其关键解决方案在于采用基于Samba的混合架构,通过持续预训练逐步过渡到全注意力机制以支持32K token上下文长度;利用大量合成语料缓解数据稀缺问题,并结合权重复用与结构化剪枝技术提升计算效率,从而实现8B参数模型性能媲美此前100B参数模型;此外,通过监督微调(SFT)与直接偏好优化(DPO)的流水线后训练策略,辅以合成日语指令数据和模型融合技术进一步优化性能,并借助vLLM推理优化与量化技术,在最小精度损失下实现高效部署,最终在日语基准测试中达到领先水平。
链接: https://arxiv.org/abs/2509.04897
作者: Preferred Networks:Kaizaburo Chubachi,Yasuhiro Fujita,Shinichi Hemmi,Yuta Hirokawa,Toshiki Kataoka,Goro Kobayashi,Kenichi Maehashi,Calvin Metzger,Hiroaki Mikami,Shogo Murai,Daisuke Nishino,Kento Nozawa,Shintarou Okada,Daisuke Okanohara,Shunta Saito,Shotaro Sano,Shuji Suzuki,Daisuke Tanaka,Avinash Ummadisingu,Hanqin Wang,Sixue Wang,Tianqi Xu
机构: Preferred Networks(Preferred Networks)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In this report, we introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture that transitions to full attention via continual pre-training to support 32K token contexts. Training leverages extensive synthetic corpora to overcome data scarcity, while computational efficiency is achieved through weight reuse and structured pruning. This efficient pruning methodology produces an 8B model that achieves performance comparable to our previous 100B model. Post-training further refines the models using a pipeline of supervised fine-tuning (SFT) and direct preference optimization (DPO), enhanced by synthetic Japanese instruction data and model merging techniques. Optimized for inference using vLLM and quantization with minimal accuracy loss, the PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.
zh
[NLP-25] L1RA: Dynamic Rank Assignment in LoRA Fine-Tuning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在下游任务微调过程中因计算资源消耗大而带来的挑战,尤其是在资源受限场景下。其核心解决方案是提出L1RA方法,关键在于利用L1正则化动态分配低秩适配器(Low-Rank Adapters, LoRA)的秩(rank),在给定总秩预算的前提下,通过剪枝冗余秩并重新分配至不同适配器,从而优化资源利用率。实验表明,L1RA在保持甚至提升性能的同时,显著降低计算开销,并揭示了模型中需重点调整的组件(如前馈层和注意力输出投影层),为模型微调与定制提供了可解释性指导。
链接: https://arxiv.org/abs/2509.04884
作者: Raul Singh,Nicolo Brunello,Vincenzo Scotti,Mark James Carman
机构: DEIB, Politecnico di Milano (米兰理工大学电气与信息工程系); KASTEL, Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL); Performance (cs.PF)
备注: Work published at ICNLSP 2025, waiting for publication link
Abstract:The ability of Large Language Models (LLMs) to solve complex tasks has made them crucial in the development of AI-based applications. However, the high computational requirements to fine-tune these LLMs on downstream tasks pose significant challenges, particularly when resources are limited. In response to this challenge, we introduce L1RA, a novel technique aimed at dynamically distributing the rank of low-rank adapters during fine-tuning using LoRA. Given a rank budget (i.e., total sum of adapters rank), L1RA leverages L1 regularisation to prune redundant ranks and redistribute them across adapters, thereby optimising resource utilisation. Through a series of comprehensive experiments, we empirically demonstrate that L1RA maintains comparable or even reduced computational overhead compared to other LoRA variants, including the vanilla approach, while achieving same or better performances. Moreover, the post-training analysis of rank distribution unveiled insights into the specific model components requiring the most adaptation to align with the task objective: the feed-forward layers and the attention output projection. These results highlight the efficacy of L1RA in not only enhancing the efficiency of LLM fine-tuning, but also in providing valuable diagnostic information for model refinement and customisation. In conclusion, L1RA stands as a promising technique for advancing the performance and interpretability of LLM adaptation, particularly in scenarios where computational resources are constrained.
zh
[NLP-26] Using LLM s for Multilingual Clinical Entity Linking to ICD-10
【速读】: 该论文旨在解决临床文本中术语与国际疾病分类第十版(ICD-10)编码自动匹配的问题,以实现从出院小结等非结构化医疗文本中提取标准化结构化信息的目标。其核心解决方案是一个多阶段流水线方法:首先利用临床词典对文本中语义明确的术语进行精确匹配,对于无法通过词典直接映射的术语,则采用基于上下文学习(in-context learning)的GPT-4.1模型进行预测,从而提升跨语言场景下的ICD-10编码准确性。该方法在西班牙语和希腊语数据集上均表现出良好的性能,验证了其在多语言环境中的有效性。
链接: https://arxiv.org/abs/2509.04868
作者: Sylvia Vassileva,Ivan Koychev,Svetla Boytcheva
机构: Sofia University St. Kliment Ohridski (索菲亚大学圣克莱门特奥赫里德学院); Graphwise (Graphwise)
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 Figures, to be published in Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing, RANLP 2025
Abstract:The linking of clinical entities is a crucial part of extracting structured information from clinical texts. It is the process of assigning a code from a medical ontology or classification to a phrase in the text. The International Classification of Diseases - 10th revision (ICD-10) is an international standard for classifying diseases for statistical and insurance purposes. Automatically assigning the correct ICD-10 code to terms in discharge summaries will simplify the work of healthcare professionals and ensure consistent coding in hospitals. Our paper proposes an approach for linking clinical terms to ICD-10 codes in different languages using Large Language Models (LLMs). The approach consists of a multistage pipeline that uses clinical dictionaries to match unambiguous terms in the text and then applies in-context learning with GPT-4.1 to predict the ICD-10 code for the terms that do not match the dictionary. Our system shows promising results in predicting ICD-10 codes on different benchmark datasets in Spanish - 0.89 F1 for categories and 0.78 F1 on subcategories on CodiEsp, and Greek - 0.85 F1 on ElCardioCC.
zh
[NLP-27] Memorization neq Understanding: Do Large Language Models Have the Ability of Scenario Cognition? EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言处理任务中表现出的泛化能力是否源于对训练数据的浅层记忆,还是基于深层语义理解这一关键问题。其解决方案的核心在于提出一个双视角评估框架,从模型输出视角(通过回答场景相关问题)和内部表征视角(探测模型内部表示中场景元素与论元的关联编码)两个维度系统评估LLM的场景认知能力(scenario cognition),从而揭示当前模型在语义理解层面存在的局限性。
链接: https://arxiv.org/abs/2509.04866
作者: Boxiang Ma,Ru Li,Yuanlong Wang,Hongye Tan,Xiaoli Li
机构: Shanxi University (山西大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Main Conference
Abstract:Driven by vast and diverse textual data, large language models (LLMs) have demonstrated impressive performance across numerous natural language processing (NLP) tasks. Yet, a critical question persists: does their generalization arise from mere memorization of training data or from deep semantic understanding? To investigate this, we propose a bi-perspective evaluation framework to assess LLMs’ scenario cognition - the ability to link semantic scenario elements with their arguments in context. Specifically, we introduce a novel scenario-based dataset comprising diverse textual descriptions of fictional facts, annotated with scenario elements. LLMs are evaluated through their capacity to answer scenario-related questions (model output perspective) and via probing their internal representations for encoded scenario elements-argument associations (internal representation perspective). Our experiments reveal that current LLMs predominantly rely on superficial memorization, failing to achieve robust semantic scenario cognition, even in simple cases. These findings expose critical limitations in LLMs’ semantic understanding and offer cognitive insights for advancing their capabilities.
zh
[NLP-28] Evaluating Cognitive-Behavioral Fixation via Multimodal User Viewing Patterns on Social Media
【速读】: 该论文旨在解决如何在数字社交平台中 computationally detect and evaluate cognitive-behavioral fixation(认知行为固化)的问题,即用户对特定内容领域的持续且重复的参与行为。当前虽已有心理学研究对此现象进行探讨,但缺乏有效的计算方法来识别和量化这种行为模式。解决方案的关键在于提出一个新颖的框架,通过分析用户的多模态社交互动模式,结合多模态主题提取模块(multimodal topic extraction module)与认知行为固化量化模块(cognitive-behavioral fixation quantification module),实现自适应、分层且可解释的用户行为评估。实验表明,该方法在现有基准和新构建的多模态数据集上均表现出有效性,为大规模计算分析认知固化提供了基础。
链接: https://arxiv.org/abs/2509.04823
作者: Yujie Wang,Yunwei Zhao,Jing Yang,Han Han,Shiguang Shan,Jie Zhang
机构: State Key Laboratory of AI Safety (人工智能安全国家重点实验室); Institute of Computing Technology (计算技术研究所); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); CNCERT/CC (国家互联网应急中心)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:
Abstract:Digital social media platforms frequently contribute to cognitive-behavioral fixation, a phenomenon in which users exhibit sustained and repetitive engagement with narrow content domains. While cognitive-behavioral fixation has been extensively studied in psychology, methods for computationally detecting and evaluating such fixation remain underexplored. To address this gap, we propose a novel framework for assessing cognitive-behavioral fixation by analyzing users’ multimodal social media engagement patterns. Specifically, we introduce a multimodal topic extraction module and a cognitive-behavioral fixation quantification module that collaboratively enable adaptive, hierarchical, and interpretable assessment of user behavior. Experiments on existing benchmarks and a newly curated multimodal dataset demonstrate the effectiveness of our approach, laying the groundwork for scalable computational analysis of cognitive fixation. All code in this project is publicly available for research purposes at this https URL.
zh
[NLP-29] AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding
【速读】: 该论文旨在解决语音语言理解(Spoken Language Understanding, SLU)系统在实际应用中面临的两大挑战:一是标注训练数据稀缺,二是大型语言模型(Large Language Models, LLMs)部署时的计算开销过大。为缓解这些问题,作者提出了一种自适应特征蒸馏(Adaptive Feature Distillation, AFD)框架,其核心创新在于引入了一个动态适配器(dynamic adapter),该适配器基于残差投影神经网络(Residual Projection Neural Network, RPNN)实现异构特征空间对齐,并设计了动态蒸馏系数(Dynamic Distillation Coefficient, DDC)以根据意图和槽位预测性能的实时反馈自适应调节蒸馏强度,从而在保持高性能的同时显著降低模型复杂度。
链接: https://arxiv.org/abs/2509.04821
作者: Yan Xie,Yibo Cui,Liang Xie,Erwei Yin
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figures
Abstract:Spoken Language Understanding (SLU) is a core component of conversational systems, enabling machines to interpret user utterances. Despite its importance, developing effective SLU systems remains challenging due to the scarcity of labeled training data and the computational burden of deploying Large Language Models (LLMs) in real-world applications. To further alleviate these issues, we propose an Adaptive Feature Distillation framework that transfers rich semantic representations from a General Text Embeddings (GTE)-based teacher model to a lightweight student model. Our method introduces a dynamic adapter equipped with a Residual Projection Neural Network (RPNN) to align heterogeneous feature spaces, and a Dynamic Distillation Coefficient (DDC) that adaptively modulates the distillation strength based on real-time feedback from intent and slot prediction performance. Experiments on the Chinese profile-based ProSLU benchmark demonstrate that AFD-SLU achieves state-of-the-art results, with 95.67% intent accuracy, 92.02% slot F1 score, and 85.50% overall accuracy.
zh
[NLP-30] Analyzing Finnish Inflectional Classes through Discriminative Lexicon and Deep Learning Models
【速读】: 该论文试图解决的问题是:词形变化类(inflectional classes)是否在认知上真实存在,即母语者是否需要发现这些类别才能正确掌握语言中的词形变化规则。解决方案的关键在于使用判别式词典模型(Discriminative Lexicon Model, DLM)来理解和生成芬兰语的屈折名词,而无需预先设定词形变化类。研究通过构建多种DLM模型,在包含55,271个屈折形式的高频率名词数据集上进行训练与测试,发现模型即使不依赖显式的词形变化类划分,也能有效学习和泛化屈折规则,尤其在高频、多类型、高产出性的词形变化类中表现更优,从而表明词形变化类可能并非认知上的必要结构,而是语言使用频率驱动下的统计产物。
链接: https://arxiv.org/abs/2509.04813
作者: Alexandre Nikolaev,Yu-Ying Chuang,R. Harald Baayen
机构: University of Eastern Finland (东芬兰大学); National Taiwan Normal University (台湾师范大学); University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Descriptions of complex nominal or verbal systems make use of inflectional classes. Inflectional classes bring together nouns which have similar stem changes and use similar exponents in their paradigms. Although inflectional classes can be very useful for language teaching as well as for setting up finite state morphological systems, it is unclear whether inflectional classes are cognitively real, in the sense that native speakers would need to discover these classes in order to learn how to properly inflect the nouns of their language. This study investigates whether the Discriminative Lexicon Model (DLM) can understand and produce Finnish inflected nouns without setting up inflectional classes, using a dataset with 55,271 inflected nouns of 2000 high-frequency Finnish nouns from 49 inflectional classes. Several DLM comprehension and production models were set up. Some models were not informed about frequency of use, and provide insight into learnability with infinite exposure (endstate learning). Other models were set up from a usage based perspective, and were trained with token frequencies being taken into consideration (frequency-informed learning). On training data, models performed with very high accuracies. For held-out test data, accuracies decreased, as expected, but remained acceptable. Across most models, performance increased for inflectional classes with more types, more lower-frequency words, and more hapax legomena, mirroring the productivity of the inflectional classes. The model struggles more with novel forms of unproductive and less productive classes, and performs far better for unseen forms belonging to productive classes. However, for usage-based production models, frequency was the dominant predictor of model performance, and correlations with measures of productivity were tenuous or absent.
zh
[NLP-31] Code Review Without Borders: Evaluating Synthetic vs. Real Data for Review Recommendation
【速读】: 该论文旨在解决在新兴编程语言或框架中,由于缺乏标注数据导致无法有效训练监督模型来自动判断代码变更是否需要人工审查的问题。其核心解决方案在于利用大型语言模型(Large Language Models, LLMs)将高资源语言中的代码变更翻译为低资源语言中的等价变更,从而生成合成训练数据;在此基础上训练监督分类器,并实验证明该方法可显著缩小与真实标注数据训练模型之间的性能差距,为自动化代码审查系统在快速演进的技术栈中提供可扩展的落地路径。
链接: https://arxiv.org/abs/2509.04810
作者: Yogev Cohen,Dudi Ohayon,Romy Somkin,Yehudit Aperstein,Alexander Apartsin
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 4 pages, 1 figure
Abstract:Automating the decision of whether a code change requires manual review is vital for maintaining software quality in modern development workflows. However, the emergence of new programming languages and frameworks creates a critical bottleneck: while large volumes of unlabelled code are readily available, there is an insufficient amount of labelled data to train supervised models for review classification. We address this challenge by leveraging Large Language Models (LLMs) to translate code changes from well-resourced languages into equivalent changes in underrepresented or emerging languages, generating synthetic training data where labelled examples are scarce. We assume that although LLMs have learned the syntax and semantics of new languages from available unlabelled code, they have yet to fully grasp which code changes are considered significant or review-worthy within these emerging ecosystems. To overcome this, we use LLMs to generate synthetic change examples and train supervised classifiers on them. We systematically compare the performance of these classifiers against models trained on real labelled data. Our experiments across multiple GitHub repositories and language pairs demonstrate that LLM-generated synthetic data can effectively bootstrap review recommendation systems, narrowing the performance gap even in low-resource settings. This approach provides a scalable pathway to extend automated code review capabilities to rapidly evolving technology stacks, even in the absence of annotated data.
zh
[NLP-32] Mind the Gap: Evaluating Model- and Agent ic-Level Vulnerabilities in LLM s with Action Graphs
【速读】: 该论文旨在解决当前安全评估框架在面对大型语言模型向智能体系统(agentic systems)演进时,无法有效识别部署场景下特定风险的问题。其解决方案的核心是提出AgentSeer——一个基于可观测性的评估框架,通过将智能体执行过程分解为细粒度的动作图(action graphs)与组件图(component graphs),实现对智能体情境(agentic-situational)的系统性评估。该方法揭示了传统模型级评估难以捕捉的“仅在智能体环境中出现”的漏洞(agentic-only vulnerabilities),并验证了工具调用(tool-calling)等行为在攻击成功率(ASR)上显著高于纯文本交互,从而确立了面向智能体场景的安全评估范式必要性。
链接: https://arxiv.org/abs/2509.04802
作者: Ilham Wicaksono,Zekun Wu,Theo King,Adriano Koshiyama,Philip Treleaven
机构: Holistic AI; University College London(伦敦大学学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models transition to agentic systems, current safety evaluation frameworks face critical gaps in assessing deployment-specific risks. We introduce AgentSeer, an observability-based evaluation framework that decomposes agentic executions into granular action and component graphs, enabling systematic agentic-situational assessment. Through cross-model validation on GPT-OSS-20B and Gemini-2.0-flash using HarmBench single turn and iterative refinement attacks, we demonstrate fundamental differences between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47% ASR) versus Gemini-2.0-flash (50.00% ASR), with both models showing susceptibility to social engineering while maintaining logic-based attack resistance. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover “agentic-only” vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24-60% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, agent transfer operations as highest-risk tools, semantic rather than syntactic vulnerability mechanisms, and context-dependent attack effectiveness, alongside model-specific security profiles in absolute ASR levels and optimal injection strategies. Direct attack transfer from model-level to agentic contexts shows degraded performance (GPT-OSS-20B: 57% human injection ASR; Gemini-2.0-flash: 28%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic evaluation gaps. These findings establish the urgent need for agentic-situation evaluation paradigms, with AgentSeer providing the standardized methodology and empirical validation.
zh
[NLP-33] Knowledge Collapse in LLM s: When Fluency Survives but Facts Fail under Recursive Synthetic Training
【速读】: 该论文旨在解决大语言模型在依赖合成数据进行递归训练时出现的“知识坍塌”(knowledge collapse)问题,即模型在保持表面流畅性的同时,事实准确性逐渐下降,产生“自信且错误”的输出,这对高精度依赖场景构成严重风险。解决方案的关键在于提出一种领域特定的合成训练策略,通过针对性地优化合成数据的生成与训练过程,在不显著增加计算成本的前提下大幅提升模型对知识坍塌的抵抗能力,并结合模型中心指标与任务中心度量构建可复现的评估框架,以识别不同退化阶段并实现对认知劣化过程的有效监测。
链接: https://arxiv.org/abs/2509.04796
作者: Figarri Keisha,Zekun Wu,Ze Wang,Adriano Koshiyama,Philip Treleaven
机构: Holistic AI; University College London
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models increasingly rely on synthetic data due to human-written content scarcity, yet recursive training on model-generated outputs leads to model collapse, a degenerative process threatening factual reliability. We define knowledge collapse as a distinct three-stage phenomenon where factual accuracy deteriorates while surface fluency persists, creating “confidently wrong” outputs that pose critical risks in accuracy-dependent domains. Through controlled experiments with recursive synthetic training, we demonstrate that collapse trajectory and timing depend critically on instruction format, distinguishing instruction-following collapse from traditional model collapse through its conditional, prompt-dependent nature. We propose domain-specific synthetic training as a targeted mitigation strategy that achieves substantial improvements in collapse resistance while maintaining computational efficiency. Our evaluation framework combines model-centric indicators with task-centric metrics to detect distinct degradation phases, enabling reproducible assessment of epistemic deterioration across different language models. These findings provide both theoretical insights into collapse dynamics and practical guidance for sustainable AI training in knowledge-intensive applications where accuracy is paramount.
zh
[NLP-34] Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中人格操控(Personality Manipulation)的机制不明确及其在实际应用中的权衡问题,特别是在客户服务和智能体(Agentic)场景下如何实现稳定、高效且可解释的人格控制。其解决方案的关键在于提出一套系统性方法:首先构建一个对比数据集以支持公平的跨方法评估;其次引入基于运行内Δ分析的统一评测框架,分离出推理能力、代理性能与人口统计偏差等维度;再次开发特质净化技术以缓解五大性格特质(Big Five Traits)编码中的表征重叠问题;最后设计三层次稳定性框架量化方法级、特质级及组合级的鲁棒性。实验表明,不同策略存在明确权衡——提示学习(ICL)对能力影响最小但效果有限,参数高效微调(PEFT)虽能实现最高人格对齐却显著损害任务性能,而机制导向控制(Mechanistic Steering, MS)则提供轻量级运行时干预并具备良好有效性,尤其在中间层激活空间中展现出人格编码的集中趋势,为部署与可解释性提供了新路径。
链接: https://arxiv.org/abs/2509.04794
作者: Gunmay Handa,Zekun Wu,Adriano Koshiyama,Philip Treleaven
机构: Holistic AI; University College London
类目: Computation and Language (cs.CL)
备注:
Abstract:Personality manipulation in large language models (LLMs) is increasingly applied in customer service and agentic scenarios, yet its mechanisms and trade-offs remain unclear. We present a systematic study of personality control using the Big Five traits, comparing in-context learning (ICL), parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS). Our contributions are fourfold. First, we construct a contrastive dataset with balanced high/low trait responses, enabling effective steering vector computation and fair cross-method evaluation. Second, we introduce a unified evaluation framework based on within-run \Delta analysis that disentangles, reasoning capability, agent performance, and demographic bias across MMLU, GAIA, and BBQ benchmarks. Third, we develop trait purification techniques to separate openness from conscientiousness, addressing representational overlap in trait encoding. Fourth, we propose a three-level stability framework that quantifies method-, trait-, and combination-level robustness, offering practical guidance under deployment constraints. Experiments on Gemma-2-2B-IT and LLaMA-3-8B-Instruct reveal clear trade-offs: ICL achieves strong alignment with minimal capability loss, PEFT delivers the highest alignment at the cost of degraded task performance, and MS provides lightweight runtime control with competitive effectiveness. Trait-level analysis shows openness as uniquely challenging, agreeableness as most resistant to ICL, and personality encoding consolidating around intermediate layers. Taken together, these results establish personality manipulation as a multi-level probe into behavioral representation, linking surface conditioning, parameter encoding, and activation-level steering, and positioning mechanistic steering as a lightweight alternative to fine-tuning for both deployment and interpretability.
zh
[NLP-35] Enhancing Diversity in Large Language Models via Determinantal Point Processes
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在监督微调(Supervised Fine-Tuning)和强化学习(Reinforcement Learning)后输出多样性下降的问题,即模型倾向于生成趋于一致、缺乏语义多样性的“标准化”响应。解决方案的关键在于提出一种基于行列式点过程(Determinantal Point Processes, DPPs)的新训练方法——DQO(Diversity-Quality Optimization),其核心思想是通过计算一组响应嵌入向量构成的核相似性矩阵的行列式来量化语义多样性,并将其作为优化目标之一,从而在不牺牲模型质量的前提下显著提升输出的语义多样性。
链接: https://arxiv.org/abs/2509.04784
作者: Yilei Chen,Souradip Chakraborty,Lorenz Wolf,Ioannis Ch. Paschalidis,Aldo Pacchiano
机构: Boston University (波士顿大学); University of Maryland (马里兰大学); University College London (伦敦大学学院); Broad Institute of MIT and Harvard (MIT和哈佛大学的布罗德研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Supervised fine-tuning and reinforcement learning are two popular methods for post-training large language models (LLMs). While improving the model’s performance on downstream tasks, they often reduce the model’s output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on lexical differences. We propose a novel training method named DQO based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.
zh
[NLP-36] Decoders Laugh as Loud as Encoders
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在理解复杂语义任务(如幽默识别)时是否存在真正理解能力的问题,尤其是相较于人类水平的表现是否具有实质性认知深度。其关键解决方案在于通过对比微调后的解码器模型(GPT-4o)与最优微调编码器模型(RoBERTa)在幽默识别任务上的性能表现,发现二者在F1-macro得分上均达到约0.85–0.86的高水平,表明即使不依赖传统编码器结构,基于生成式AI(Generative AI)的解码器也能实现接近甚至等同于先进编码器模型的理解能力,从而为LLMs具备一定语义理解潜力提供了实证支持。
链接: https://arxiv.org/abs/2509.04779
作者: Eli Borodach,Raj Dandekar,Rajat Dandekar,Sreedath Panat
机构: Vizuara AI Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:From the dawn of the computer, Allen Turing dreamed of a robot that could communicate using language as a human being. The recent advances in the field of Large Language Models (LLMs) shocked the scientific community when a single model can apply for various natural language processing (NLP) tasks, while the output results are sometimes even better than most human communication skills. Models such as GPT, Claude, Grok, etc. have left their mark on the scientific community. However, it is unclear how much these models understand what they produce, especially in a nuanced theme such as humor. The question of whether computers understand humor is still open (among the decoders, the latest to be checked was GPT-2). We addressed this issue in this paper; we have showed that a fine-tuned decoder (GPT-4o) performed (Mean F1-macro score of 0.85) as well as the best fine-tuned encoder (RoBERTa with a Mean of F1-score 0.86)
zh
[NLP-37] Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在回答复杂问题时准确率不足的问题。其解决方案的关键在于提出一种基于知识图谱的多跳问题分解方法(multi-hop question decomposition method),将原本复杂的单跳问题拆解为一系列逻辑关联的子问题,从而提升模型的理解与推理能力。实验表明,无论是否进行微调,该方法均显著优于直接回答复杂问题的策略,且在使用LoRA(Low-Rank Adaptation)微调后仍保持优势,验证了多跳分解在增强LLM复杂问题解答能力方面的有效性。
链接: https://arxiv.org/abs/2509.04770
作者: Zucheng Liang,Wenxin Wei,Kaijie Zhang,Hongyi Chen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Accurately answering complex questions has consistently been a significant challenge for Large Language Models (LLMs). To address this, this paper proposes a multi-hop question decomposition method for complex questions, building upon research within the MQUAKE framework. Utilizing the LLAMA3 model, we systematically investigate the impact of multi-hop question decomposition within knowledge graphs on model comprehension and reasoning accuracy, both before and after model training. In our experiments, we systematically partitioned and converted the MQUAKE-T dataset into two distinct formats: a single-hop dataset designed for directly answering complex questions, and a multi-hop dataset constructed using the multi-hop question decomposition method. We then fine-tuned the LLAMA3 model on these datasets and conducted inference tests. Our results demonstrate that, without fine-tuning the LLM, the prediction performance based on the multi-hop question decomposition method significantly outperforms the method of directly answering complex questions. After fine-tuning using the LoRA (Low-Rank Adaptation) method, the performance of both approaches improved compared to the untrained baseline. Crucially, the method utilizing multi-hop decomposition consistently maintained its superiority. These findings validate the effectiveness of the multi-hop decomposition method both before and after training, demonstrating its capability to effectively enhance the LLM’s ability to answer complex questions.
zh
[NLP-38] A Study of Large Language Models for Patient Information Extraction: Model Architecture Fine-Tuning Strategy and Multi-task Instruction Tuning
【速读】: 该论文旨在解决如何有效利用大语言模型(Large Language Models, LLMs)进行临床文本中患者信息提取的问题,尤其关注模型架构选择、参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)策略以及多任务指令微调对少样本学习性能的影响。其解决方案的关键在于系统性比较编码器-仅型(如BERT、GatorTron)与解码器-仅型(如GatorTronGPT、Llama 3.1、GatorTronLlama)LLMs在五个多样化临床数据集上的表现,并通过prompt-based PEFT方法实现高效微调;同时构建一种多任务指令微调框架,在四个数据集上联合训练以提升零样本和少样本场景下的泛化能力,从而为开发鲁棒、可迁移的临床信息抽取系统提供实证依据与优化路径。
链接: https://arxiv.org/abs/2509.04753
作者: Cheng Peng,Xinyu Dong,Mengxian Lyu,Daniel Paredes,Yaoyun Zhang,Yonghui Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Natural language processing (NLP) is a key technology to extract important patient information from clinical narratives to support healthcare applications. The rapid development of large language models (LLMs) has revolutionized many NLP tasks in the clinical domain, yet their optimal use in patient information extraction tasks requires further exploration. This study examines LLMs’ effectiveness in patient information extraction, focusing on LLM architectures, fine-tuning strategies, and multi-task instruction tuning techniques for developing robust and generalizable patient information extraction systems. This study aims to explore key concepts of using LLMs for clinical concept and relation extraction tasks, including: (1) encoder-only or decoder-only LLMs, (2) prompt-based parameter-efficient fine-tuning (PEFT) algorithms, and (3) multi-task instruction tuning on few-shot learning performance. We benchmarked a suite of LLMs, including encoder-based LLMs (BERT, GatorTron) and decoder-based LLMs (GatorTronGPT, Llama 3.1, GatorTronLlama), across five datasets. We compared traditional full-size fine-tuning and prompt-based PEFT. We explored a multi-task instruction tuning framework that combines both tasks across four datasets to evaluate the zero-shot and few-shot learning performance using the leave-one-dataset-out strategy.
zh
[NLP-39] Phonological Representation Learning for Isolated Signs Improves Out-of-Vocabulary Generalization
【速读】: 该论文旨在解决手语识别中因词汇数据集代表性不足而导致模型难以泛化到未见手势的问题。其核心挑战在于,现有模型在面对未见过的手势时性能下降明显,这通常与学习到的表征中包含虚假相关性有关。解决方案的关键在于引入两种语言学启发式的归纳偏置:一是参数解耦(Parameter Disentanglement),通过架构设计使模型学习到更可解释的离散表征;二是音位半监督(Phonological Semi-Supervision),作为正则化手段提升模型对未见手势的重建质量。实验表明,该方法在已知手势识别准确率和未见手势的一次性重建效果上均优于基线模型,验证了显式语言学先验对提升手语表示学习泛化能力的有效性。
链接: https://arxiv.org/abs/2509.04745
作者: Lee Kezar,Zed Sehyr,Jesse Thomason
机构: University of Southern California (南加州大学); Chapman University (查普曼大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sign language datasets are often not representative in terms of vocabulary, underscoring the need for models that generalize to unseen signs. Vector quantization is a promising approach for learning discrete, token-like representations, but it has not been evaluated whether the learned units capture spurious correlations that hinder out-of-vocabulary performance. This work investigates two phonological inductive biases: Parameter Disentanglement, an architectural bias, and Phonological Semi-Supervision, a regularization technique, to improve isolated sign recognition of known signs and reconstruction quality of unseen signs with a vector-quantized autoencoder. The primary finding is that the learned representations from the proposed model are more effective for one-shot reconstruction of unseen signs and more discriminative for sign identification compared to a controlled baseline. This work provides a quantitative analysis of how explicit, linguistically-motivated biases can improve the generalization of learned representations of sign language.
zh
[NLP-40] WildScore: Benchmarking MLLM s in-the-Wild Symbolic Music Reasoning
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在符号音乐领域(symbolic music domain)的推理能力尚未被充分探索的问题,尤其关注其对真实世界乐谱的理解与复杂音乐学问题的回答能力。解决方案的关键在于提出首个“野外”(in-the-wild)多模态符号音乐推理与分析基准WildScore,该基准基于真实音乐作品及其用户生成的问题和讨论构建,同时设计了涵盖高层与细粒度音乐学本体的系统性分类体系,并将复杂的音乐推理任务建模为多项选择题问答形式,从而实现对MLLMs符号音乐理解能力的可控且可扩展的评估。
链接: https://arxiv.org/abs/2509.04744
作者: Gagan Mundada,Yash Vishe,Amit Namburi,Xin Xu,Zachary Novack,Julian McAuley,Junda Wu
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs’ capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs’ symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.
zh
[NLP-41] Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning
【速读】: 该论文试图解决当前多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在复杂、长时程任务中面临的挑战,特别是由于探索空间庞大和奖励稀疏导致的训练效率低下问题。其核心问题是:现有基于结构扁平仿真环境的训练方法难以支持智能体习得高级战略行为,尤其是在机器人足球等场景中表现不佳。解决方案的关键在于构建具有显式层次化世界模型(World Model)的环境,并通过大型语言模型(Large Language Models, LLMs)动态生成任务导向的层级结构(hierarchical scaffold),从而提供内在课程(intrinsic curriculum)、密集且有意义的学习信号以及组合式学习框架,显著提升智能体的学习样本效率与策略复杂性。
链接: https://arxiv.org/abs/2509.04731
作者: Brennen Hill
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
Abstract:The convergence of Language models, Agent models, and World models represents a critical frontier for artificial intelligence. While recent progress has focused on scaling Language and Agent models, the development of sophisticated, explicit World Models remains a key bottleneck, particularly for complex, long-horizon multi-agent tasks. In domains such as robotic soccer, agents trained via standard reinforcement learning in high-fidelity but structurally-flat simulators often fail due to intractable exploration spaces and sparse rewards. This position paper argues that the next frontier in developing capable agents lies in creating environments that possess an explicit, hierarchical World Model. We contend that this is best achieved through hierarchical scaffolding, where complex goals are decomposed into structured, manageable subgoals. Drawing evidence from a systematic review of 2024 research in multi-agent soccer, we identify a clear and decisive trend towards integrating symbolic and hierarchical methods with multi-agent reinforcement learning (MARL). These approaches implicitly or explicitly construct a task-based world model to guide agent learning. We then propose a paradigm shift: leveraging Large Language Models to dynamically generate this hierarchical scaffold, effectively using language to structure the World Model on the fly. This language-driven world model provides an intrinsic curriculum, dense and meaningful learning signals, and a framework for compositional learning, enabling Agent Models to acquire sophisticated, strategic behaviors with far greater sample efficiency. By building environments with explicit, language-configurable task layers, we can bridge the gap between low-level reactive behaviors and high-level strategic team play, creating a powerful and generalizable framework for training the next generation of intelligent agents.
zh
[NLP-42] KERAG : Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering EMNLP
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在问答任务中因缺乏外部知识支持而产生的幻觉问题,同时克服传统基于知识图谱的问答(Knowledge Graph Question Answering, KGQA)方法因依赖严格语义解析导致覆盖度低的问题。其解决方案的关键在于提出一种新的基于知识图谱的检索增强生成(Retrieval-Augmented Generation, RAG)框架——KERAG,通过检索更广范围的知识子图以提升信息覆盖,并结合检索-过滤-摘要三阶段策略与微调后的链式思维(Chain-of-Thought)推理机制,有效降低噪声并提升对简单及复杂问题的回答质量。
链接: https://arxiv.org/abs/2509.04716
作者: Yushi Sun,Kai Sun,Yifan Ethan Xu,Xiao Yang,Xin Luna Dong,Nan Tang,Lei Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by EMNLP Findings 2025
Abstract:Retrieval-Augmented Generation (RAG) mitigates hallucination in Large Language Models (LLMs) by incorporating external data, with Knowledge Graphs (KGs) offering crucial information for question answering. Traditional Knowledge Graph Question Answering (KGQA) methods rely on semantic parsing, which typically retrieves knowledge strictly necessary for answer generation, thus often suffer from low coverage due to rigid schema requirements and semantic ambiguity. We present KERAG, a novel KG-based RAG pipeline that enhances QA coverage by retrieving a broader subgraph likely to contain relevant information. Our retrieval-filtering-summarization approach, combined with fine-tuned LLMs for Chain-of-Thought reasoning on knowledge sub-graphs, reduces noises and improves QA for both simple and complex questions. Experiments demonstrate that KERAG surpasses state-of-the-art solutions by about 7% in quality and exceeds GPT-4o (Tool) by 10-21%.
zh
[NLP-43] OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics
【速读】: 该论文旨在解决多说话人、多语言对话语音数据稀缺且标注质量参差的问题,以支持更鲁棒的语音识别与对话理解研究。其解决方案的关键在于构建了一个大规模、高质量的多说话人多语言对话语音数据集OleSpeech-IV,该数据集涵盖多样化话题,音频来源包括公开英语播客、访谈节目和电话会议等真实场景,并通过人工标注与专有处理流程相结合的方式,精准获取说话人姓名、话语片段(turn)、转录文本以及时间戳和置信度分数等结构化信息,从而显著提升下游任务的数据可用性与模型泛化能力。
链接: https://arxiv.org/abs/2509.04702
作者: Wei Chu,Yuanzhe Dong,Ke Tan,Dong Han,Xavier Menendez-Pidal,Ruchao Fan,Chenfeng Miao,Chanwoo Kim,Bhiksha Raj,Rita Singh
机构: Olewave(奥尔韦夫); Stanford University (斯坦福大学); Microsoft; PingAn Technology (平安科技); Korea University; Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:OleSpeech-IV dataset is a large-scale multispeaker and multilingual conversational speech dataset with diverse topics. The audio content comes from publicly-available English podcasts, talk shows, teleconferences, and other conversations. Speaker names, turns, and transcripts are human-sourced and refined by a proprietary pipeline, while additional information such as timestamps and confidence scores is derived from the pipeline. The IV denotes its position as Tier IV in the Olewave dataset series. In addition, we have open-sourced a subset, OleSpeech-IV-2025-EN-AR-100, for non-commercial research use.
zh
[NLP-44] ODKE: Ontology-Guided Open-Domain Knowledge Extraction with LLM s
【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)维护中面临的核心挑战——保持知识的时效性与完整性,同时降低人工成本。传统方法在大规模开放域事实抽取和更新方面效率低、覆盖不足,难以满足生产级应用需求。其解决方案的关键在于构建一个可扩展的自动化系统 ODKE+,通过模块化流水线实现高精度的事实提取与验证:首先由 Extraction Initiator 识别缺失或过时事实,Evidence Retriever 收集支持文档;随后采用混合知识抽取策略,结合基于规则的模式匹配与基于本体引导提示的大语言模型(Large Language Models, LLMs);再通过轻量级 Grounder 使用第二个 LLM 对抽取结果进行验证,并由 Corroborator 进行排序与归一化处理;整个流程动态生成针对实体类型的本体片段(ontology snippets),确保抽取结果符合 schema 约束,从而实现跨 195 个谓词的类型一致性知识抽取。该设计显著提升了覆盖率和更新效率,实测达到 98.8% 的精确率,且平均更新延迟缩短 50 天,验证了基于 LLM 的结构化验证流程在可信、规模化知识摄取中的可行性与实用性。
链接: https://arxiv.org/abs/2509.04696
作者: Samira Khorshidi,Azadeh Nikfarjam,Suprita Shankar,Yisi Sang,Yash Govind,Hyun Jang,Ali Kasgari,Alexis McClimans,Mohamed Soliman,Vishnu Konda,Ahmed Fakhry,Xiaoguang Qi
机构: Apple Inc(苹果公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge graphs (KGs) are foundational to many AI applications, but maintaining their freshness and completeness remains costly. We present ODKE+, a production-grade system that automatically extracts and ingests millions of open-domain facts from web sources with high precision. ODKE+ combines modular components into a scalable pipeline: (1) the Extraction Initiator detects missing or stale facts, (2) the Evidence Retriever collects supporting documents, (3) hybrid Knowledge Extractors apply both pattern-based rules and ontology-guided prompting for large language models (LLMs), (4) a lightweight Grounder validates extracted facts using a second LLM, and (5) the Corroborator ranks and normalizes candidate facts for ingestion. ODKE+ dynamically generates ontology snippets tailored to each entity type to align extractions with schema constraints, enabling scalable, type-consistent fact extraction across 195 predicates. The system supports batch and streaming modes, processing over 9 million Wikipedia pages and ingesting 19 million high-confidence facts with 98.8% precision. ODKE+ significantly improves coverage over traditional methods, achieving up to 48% overlap with third-party KGs and reducing update lag by 50 days on average. Our deployment demonstrates that LLM-based extraction, grounded in ontological structure and verification workflows, can deliver trustworthiness, production-scale knowledge ingestion with broad real-world applicability. A recording of the system demonstration is included with the submission and is also available at this https URL.
zh
[NLP-45] Why Language Models Hallucinate
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在面对不确定信息时产生“幻觉”(hallucination)的问题,即模型倾向于生成看似合理但事实上错误的陈述,而非承认自身不确定性。研究表明,这种现象并非源于模型本身的复杂性或缺陷,而是由训练与评估机制中对猜测行为的奖励导向所导致——具体而言,现有训练流程和评估标准倾向于惩罚不确定回答,从而促使模型在模糊情境下选择猜测以提升得分。解决方案的关键在于重构评估体系:通过调整主流基准测试的评分规则,使其不再惩罚不确定性表达,从而引导模型发展出更可信的响应策略,而非单纯追求高分表现。这一社会技术层面的修正,可有效缓解幻觉问题,推动更可靠的生成式AI系统发展。
链接: https://arxiv.org/abs/2509.04664
作者: Adam Tauman Kalai,Ofir Nachum,Santosh S. Vempala,Edwin Zhang
机构: OpenAI; Georgia Tech (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such “hallucinations” persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious – they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded – language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This “epidemic” of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
zh
[NLP-46] Evaluating NL2SQL via SQL2NL EMNLP2025
【速读】: 该论文旨在解决自然语言到SQL(NL2SQL)模型在面对语言多样性(linguistic variation)时的鲁棒性评估不足问题,现有基准测试往往未以系统化或受控方式考察这一因素。解决方案的关键在于提出一种schema-aligned paraphrasing framework(模式对齐的改写框架),该框架利用SQL-to-NL(SQL2NL)技术自动生成语义等价但词汇多样化的查询,同时保持与原始数据库模式和意图的一致性,从而实现对NL2SQL模型在孤立条件下对语言变化敏感性的首次有针对性评估。
链接: https://arxiv.org/abs/2509.04657
作者: Mohammadtaher Safarzadeh,Afshin Oroojlooyjadid,Dan Roth
机构: Oracle AI (Oracle人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2025
Abstract:Robust evaluation in the presence of linguistic variation is key to understanding the generalization capabilities of Natural Language to SQL (NL2SQL) models, yet existing benchmarks rarely address this factor in a systematic or controlled manner. We propose a novel schema-aligned paraphrasing framework that leverages SQL-to-NL (SQL2NL) to automatically generate semantically equivalent, lexically diverse queries while maintaining alignment with the original schema and intent. This enables the first targeted evaluation of NL2SQL robustness to linguistic variation in isolation-distinct from prior work that primarily investigates ambiguity or schema perturbations. Our analysis reveals that state-of-the-art models are far more brittle than standard benchmarks suggest. For example, LLaMa3.3-70B exhibits a 10.23% drop in execution accuracy (from 77.11% to 66.9%) on paraphrased Spider queries, while LLaMa3.1-8B suffers an even larger drop of nearly 20% (from 62.9% to 42.5%). Smaller models (e.g., GPT-4o mini) are disproportionately affected. We also find that robustness degradation varies significantly with query complexity, dataset, and domain – highlighting the need for evaluation frameworks that explicitly measure linguistic generalization to ensure reliable performance in real-world settings.
zh
[NLP-47] AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLM s
【速读】: 该论文旨在解决阿拉伯语大语言模型(Large Language Models, LLMs)在生成式问答(Generative Question Answering, GQA)和摘要任务中事实性幻觉(factual hallucination)评估不足的问题。当前对LLM幻觉的研究主要集中在英语语境,而阿拉伯语作为全球广泛使用的语言之一,在多语言模型中的表现仍缺乏系统性评估。为应对这一知识空白,作者提出了一个细粒度的幻觉评估框架,包含12个针对不同任务特征的幻觉指标,用于量化分析12种不同LLM(包括阿拉伯语预训练模型、多语言模型及基于推理的模型)在阿拉伯语自然语言生成任务中的事实一致性与忠实度表现。关键创新在于构建了首个面向阿拉伯语场景的、任务特异性的细粒度幻觉评估体系,并通过实证发现阿拉伯语预训练模型Allam在幻觉率上优于多语言模型,且接近推理增强型模型的表现。
链接: https://arxiv.org/abs/2509.04656
作者: Aisha Alansari,Hamzah Luqman
机构: King Fahd University of Petroleum and Minerals (国王法赫德石油与矿业大学); SDAIA-KFUPM Joint Research Center for Artificial Intelligence (沙特数据与人工智能局-国王法赫德石油与矿业大学联合人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs’ hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic’s widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs’ outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models. The code is available at: \hrefthis https URLGithub link.
zh
[NLP-48] Polysemantic Dropout: Conformal OOD Detection for Specialized LLM s EMNLP2025
【速读】: 该论文旨在解决专业化大语言模型(Specialized Large Language Models, LLMs)在推理阶段面对域外输入(Out-of-Domain, OOD)时可能出现的不可靠输出问题,这在医疗等关键应用中尤为危险。解决方案的关键在于提出一种基于诱导性共形异常检测(Inductive Conformal Anomaly Detection, ICAD)框架的新颖非一致性度量方法,该方法利用模型对dropout的容忍度(dropout tolerance)作为判别依据:假设域内输入比域外输入具有更高的dropout容忍度,通过多层dropout容忍度的有效集成策略提升检测性能,同时保持ICAD理论上的误报率边界控制。实验表明,该方法在医学专用LLMs上显著优于基线,AUROC提升达2%至37%。
链接: https://arxiv.org/abs/2509.04655
作者: Ayush Gupta,Ramneet Kaur,Anirban Roy,Adam D. Cobb,Rama Chellappa,Susmit Jha
机构: SRI; Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 main conference
Abstract:We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model’s dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of 2% to 37% when treating OOD datapoints as positives and in-domain test datapoints as negatives.
zh
[NLP-49] Comparative Analysis of Transformer Models in Disaster Tweet Classification for Public Safety
【速读】: 该论文旨在解决灾难相关推文自动分类问题,以提升应急响应效率。传统机器学习(Machine Learning, ML)模型在处理非正式、隐喻或模糊语言时表现不足,难以准确理解推文语境。解决方案的关键在于采用基于Transformer架构的预训练语言模型(如BERT、DistilBERT、RoBERTa和DeBERTa),利用其上下文嵌入(contextual embeddings)和注意力机制(attention mechanisms)捕捉文本深层语义,从而显著优于传统ML方法,在灾难推文分类任务中实现了最高91%的准确率,展现出更强的语言理解能力与泛化性能。
链接: https://arxiv.org/abs/2509.04650
作者: Sharif Noor Zisad,Ragib Hasan
机构: University of Alabama at Birmingham (阿拉巴马大学伯明翰分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Twitter and other social media platforms have become vital sources of real time information during disasters and public safety emergencies. Automatically classifying disaster related tweets can help emergency services respond faster and more effectively. Traditional Machine Learning (ML) models such as Logistic Regression, Naive Bayes, and Support Vector Machines have been widely used for this task, but they often fail to understand the context or deeper meaning of words, especially when the language is informal, metaphorical, or ambiguous. We posit that, in this context, transformer based models can perform better than traditional ML models. In this paper, we evaluate the effectiveness of transformer based models, including BERT, DistilBERT, RoBERTa, and DeBERTa, for classifying disaster related tweets. These models are compared with traditional ML approaches to highlight the performance gap. Experimental results show that BERT achieved the highest accuracy (91%), significantly outperforming traditional models like Logistic Regression and Naive Bayes (both at 82%). The use of contextual embeddings and attention mechanisms allows transformer models to better understand subtle language in tweets, where traditional ML models fall short. This research demonstrates that transformer architectures are far more suitable for public safety applications, offering improved accuracy, deeper language understanding, and better generalization across real world social media text.
zh
[NLP-50] Maestro: Joint Graph Config Optimization for Reliable AI Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在实际应用中因结构设计与配置优化分离而导致的可靠性问题,即现有优化方法通常仅调整节点配置(如提示词、工具调用等),而忽略图结构层面的模块组合与信息流设计,从而无法识别和修复结构性失败模式。解决方案的关键在于提出一个框架无关的协同优化框架 Maestro,它同时搜索代理的图结构(graph)与各节点配置(configuration),并在显式 rollout 和 token 预算约束下最大化代理质量;此外,Maestro 利用轨迹中的反射性文本反馈来优先选择修改策略,显著提升样本效率并针对性地修复特定失败模式。实验表明,该方法在 IFBench 和 HotpotQA 基准上均显著优于当前主流提示优化器(如 MIPROv2、GEPA 及其变体),且在受限于仅优化提示时仍保持优势,验证了联合图-配置搜索对结构性缺陷的有效性。
链接: https://arxiv.org/abs/2509.04642
作者: Wenxiao Wang,Priyatham Kattakinda,Soheil Feizi
机构: RELAI.ai
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Technical Report by this http URL
Abstract:Building reliable LLM agents requires decisions at two levels: the graph (which modules exist and how information flows) and the configuration of each node (models, prompts, tools, control knobs). Most existing optimizers tune configurations while holding the graph fixed, leaving structural failure modes unaddressed. We introduce Maestro, a framework-agnostic holistic optimizer for LLM agents that jointly searches over graphs and configurations to maximize agent quality, subject to explicit rollout/token budgets. Beyond numeric metrics, Maestro leverages reflective textual feedback from traces to prioritize edits, improving sample efficiency and targeting specific failure modes. On the IFBench and HotpotQA benchmarks, Maestro consistently surpasses leading prompt optimizers–MIPROv2, GEPA, and GEPA+Merge–by an average of 12%, 4.9%, and 4.86%, respectively; even when restricted to prompt-only optimization, it still leads by 9.65%, 2.37%, and 2.41%. Maestro achieves these results with far fewer rollouts than GEPA. We further show large gains on two applications (interviewer RAG agents), highlighting that joint graph configuration search addresses structural failure modes that prompt tuning alone cannot fix.
zh
[NLP-51] Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中因提示词(prompt)攻击所引发的安全风险问题,包括未经授权的模型蒸馏、微调和编辑等行为,这些攻击可能造成知识产权泄露、虚假信息生成及用户信任度下降。其解决方案的关键在于系统性地梳理和分类提示词攻击方法,构建清晰的威胁模型,并深入剖析各类攻击的机制与影响,从而为下一代具备内在抗攻击能力的LLMs研发提供理论基础与实践指导。
链接: https://arxiv.org/abs/2509.04615
作者: Brennen Hill,Surendra Parla,Venkata Abhijeeth Balabhadruni,Atharv Prajod Padmalayam,Sujay Chandra Shekara Sharma
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:The proliferation of Large Language Models (LLMs) has introduced critical security challenges, where adversarial actors can manipulate input prompts to cause significant harm and circumvent safety alignments. These prompt-based attacks exploit vulnerabilities in a model’s design, training, and contextual understanding, leading to intellectual property theft, misinformation generation, and erosion of user trust. A systematic understanding of these attack vectors is the foundational step toward developing robust countermeasures. This paper presents a comprehensive literature survey of prompt-based attack methodologies, categorizing them to provide a clear threat model. By detailing the mechanisms and impacts of these exploits, this survey aims to inform the research community’s efforts in building the next generation of secure LLMs that are inherently resistant to unauthorized distillation, fine-tuning, and editing.
zh
[NLP-52] Sample-efficient Integration of New Modalities into Large Language Models
【速读】: 该论文旨在解决多模态基础模型在集成新模态时面临的两个核心问题:一是由于可能的模态空间庞大且持续演化,从头训练模型以涵盖所有模态不现实;二是现有方法将新模态整合进预训练大语言模型(LLM)通常需要大量成对数据,而低资源模态往往缺乏此类数据。解决方案的关键在于提出一种样本高效模态集成方法(Sample-Efficient Modality Integration, SEMI),其核心创新是设计了一个超网络(hypernetwork),该网络可基于少量目标模态样本,在推理时动态生成适配器(adapter),用于调整共享投影层(projector)——该投影层位于模态特定编码器与LLM之间。通过在高资源模态(如文本、语音、音频、视频)上训练超网络,并利用等距变换增强训练模态多样性,SEMI实现了仅用极少量样本即可有效集成任意嵌入维度的新模态(如卫星图像、天文图像、惯性测量和分子结构),显著降低数据需求,例如达到32-shot SEMI的性能水平所需的数据量仅为从零训练投影层的1/64。
链接: https://arxiv.org/abs/2509.04606
作者: Osman Batur İnce,André F. T. Martins,Oisin Mac Aodha,Edoardo M. Ponti
机构: University of Edinburgh (爱丁堡大学); Instituto de Telecomunicações (电信研究所); Instituto Superior Técnico, Universidade de Lisboa (里斯本理工学院,里斯本大学); Unbabel (Unbabel)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Pre-print
Abstract:Multimodal foundation models can process several modalities. However, since the space of possible modalities is large and evolving over time, training a model from scratch to encompass all modalities is unfeasible. Moreover, integrating a modality into a pre-existing foundation model currently requires a significant amount of paired data, which is often not available for low-resource modalities. In this paper, we introduce a method for sample-efficient modality integration (SEMI) into Large Language Models (LLMs). To this end, we devise a hypernetwork that can adapt a shared projector – placed between modality-specific encoders and an LLM – to any modality. The hypernetwork, trained on high-resource modalities (i.e., text, speech, audio, video), is conditioned on a few samples from any arbitrary modality at inference time to generate a suitable adapter. To increase the diversity of training modalities, we artificially multiply the number of encoders through isometric transformations. We find that SEMI achieves a significant boost in sample efficiency during few-shot integration of new modalities (i.e., satellite images, astronomical images, inertial measurements, and molecules) with encoders of arbitrary embedding dimensionality. For instance, to reach the same accuracy as 32-shot SEMI, training the projector from scratch needs 64 \times more data. As a result, SEMI holds promise to extend the modality coverage of foundation models.
zh
[NLP-53] Spoken in Jest Detected in Earnest: A Systematic Review of Sarcasm Recognition – Multimodal Fusion Challenges and Future Prospects
【速读】: 该论文旨在解决语音中讽刺意图识别(sarcasm recognition in speech)这一长期被忽视的研究问题,其核心挑战在于如何有效利用语音数据提升机器对人类复杂语言使用中的讽刺语义的理解能力。解决方案的关键在于系统梳理从单一模态到多模态方法的演进路径,包括语音数据集的构建、声学特征提取技术从传统手工特征向深度学习表征的转变,以及分类模型从单模态分析向多模态融合策略的发展,从而推动讽刺识别从文本主导范式转向以语音为核心的多模态理解体系。
链接: https://arxiv.org/abs/2509.04605
作者: Xiyuan Gao,Shekhar Nayak,Matt Coler
机构: Campus Fryslân, University of Groningen (格罗宁根大学弗里斯兰校区)
类目: Computation and Language (cs.CL)
备注: 20 pages, 7 figures, Submitted to IEEE Transactions on Affective Computing
Abstract:Sarcasm, a common feature of human communication, poses challenges in interpersonal interactions and human-machine interactions. Linguistic research has highlighted the importance of prosodic cues, such as variations in pitch, speaking rate, and intonation, in conveying sarcastic intent. Although previous work has focused on text-based sarcasm detection, the role of speech data in recognizing sarcasm has been underexplored. Recent advancements in speech technology emphasize the growing importance of leveraging speech data for automatic sarcasm recognition, which can enhance social interactions for individuals with neurodegenerative conditions and improve machine understanding of complex human language use, leading to more nuanced interactions. This systematic review is the first to focus on speech-based sarcasm recognition, charting the evolution from unimodal to multimodal approaches. It covers datasets, feature extraction, and classification methods, and aims to bridge gaps across diverse research domains. The findings include limitations in datasets for sarcasm recognition in speech, the evolution of feature extraction techniques from traditional acoustic features to deep learning-based representations, and the progression of classification methods from unimodal approaches to multimodal fusion techniques. In so doing, we identify the need for greater emphasis on cross-cultural and multilingual sarcasm recognition, as well as the importance of addressing sarcasm as a multimodal phenomenon, rather than a text-based challenge.
zh
[NLP-54] Manipulating Transformer-Based Models: Controllability Steerability and Robust Interventions
【速读】: 该论文旨在解决基于Transformer的自然语言处理(Natural Language Processing, NLP)模型在生成文本时缺乏细粒度控制的问题。其核心挑战在于如何在不破坏模型原有能力的前提下,实现对输出内容的可控调整,如情感倾向、事实准确性等。解决方案的关键在于提出一个统一框架,涵盖三个层面的干预策略:提示(prompt)级引导、激活值(activation)干预以及权重空间(weight-space)编辑,并将可控文本生成形式化为一个优化问题,通过提示工程、参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)、模型编辑和强化学习等手段实现目标行为的精准调控。理论分析表明,极小的权重更新即可实现目标行为改变且副作用有限;实证结果显示,在情感控制和事实修正任务中成功率可达90%,同时保持基础性能稳定,但存在泛化性与特异性之间的权衡。
链接: https://arxiv.org/abs/2509.04549
作者: Faruk Alpay,Taylan Alpay
机构: Lightcap(未来部门); Turkish Aeronautical Association(土耳其航空协会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages
Abstract:Transformer-based language models excel in NLP tasks, but fine-grained control remains challenging. This paper explores methods for manipulating transformer models through principled interventions at three levels: prompts, activations, and weights. We formalize controllable text generation as an optimization problem addressable via prompt engineering, parameter-efficient fine-tuning, model editing, and reinforcement learning. We introduce a unified framework encompassing prompt-level steering, activation interventions, and weight-space edits. We analyze robustness and safety implications, including adversarial attacks and alignment mitigations. Theoretically, we show minimal weight updates can achieve targeted behavior changes with limited side-effects. Empirically, we demonstrate 90% success in sentiment control and factual edits while preserving base performance, though generalization-specificity trade-offs exist. We discuss ethical dual-use risks and the need for rigorous evaluation. This work lays groundwork for designing controllable and robust language models.
zh
[NLP-55] Quantized Large Language Models in Biomedical Natural Language Processing: Evaluation and Recommendation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生物医学自然语言处理(Biomedical Natural Language Processing, BioNLP)中因模型规模庞大、计算资源需求高而难以在医疗场景中本地部署的问题,尤其是在数据隐私要求严格的环境中无法使用云端服务的情况下。解决方案的关键在于系统性评估量化(Quantization)技术对12个先进大模型(涵盖通用和生物医学专用模型)在8个基准数据集上的影响,结果表明量化可在保持模型性能的前提下显著降低GPU内存占用(最高达75%),从而实现70B参数模型在40GB消费级GPU上的本地运行,同时保留领域知识和对高级提示方法的响应能力,为安全、高效的本地化部署提供了可行路径。
链接: https://arxiv.org/abs/2509.04534
作者: Zaifu Zhan,Shuang Zhou,Min Zeng,Kai Yu,Meijia Song,Xiaoyi Chen,Jun Wang,Yu Hou,Rui Zhang
机构: University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures
Abstract:Large language models have demonstrated remarkable capabilities in biomedical natural language processing, yet their rapid growth in size and computational requirements present a major barrier to adoption in healthcare settings where data privacy precludes cloud deployment and resources are limited. In this study, we systematically evaluated the impact of quantization on 12 state-of-the-art large language models, including both general-purpose and biomedical-specific models, across eight benchmark datasets covering four key tasks: named entity recognition, relation extraction, multi-label classification, and question answering. We show that quantization substantially reduces GPU memory requirements-by up to 75%-while preserving model performance across diverse tasks, enabling the deployment of 70B-parameter models on 40GB consumer-grade GPUs. In addition, domain-specific knowledge and responsiveness to advanced prompting methods are largely maintained. These findings provide significant practical and guiding value, highlighting quantization as a practical and effective strategy for enabling the secure, local deployment of large yet high-capacity language models in biomedical contexts, bridging the gap between technical advances in AI and real-world clinical translation.
zh
[NLP-56] Using LLM s to create analytical datasets: A case study of reconstructing the historical memory of Colombia
【速读】: 该论文试图解决哥伦比亚长期武装冲突中暴力事件数据缺失与历史记忆建构不足的问题,即政府缺乏系统性记录导致公开冲突信息匮乏,进而阻碍了对历史事件的深入理解。解决方案的关键在于利用生成式 AI(Generative AI)中的大语言模型(Large Language Model, LLM)GPT 对超过20万篇西班牙语暴力相关报纸文章进行自动化读取与问答分析,从而构建可支持政策研究的大规模结构化文本数据集,并在此基础上开展描述性分析及暴力与古柯作物铲除政策之间关系的实证研究。这一方法显著提升了对海量文本数据的深度挖掘能力,为冲突研究提供了前所未有的新路径。
链接: https://arxiv.org/abs/2509.04523
作者: David Anderson,Galia Benitez,Margret Bjarnadottir,Shriyan Reyya
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Colombia has been submerged in decades of armed conflict, yet until recently, the systematic documentation of violence was not a priority for the Colombian government. This has resulted in a lack of publicly available conflict information and, consequently, a lack of historical accounts. This study contributes to Colombia’s historical memory by utilizing GPT, a large language model (LLM), to read and answer questions about over 200,000 violence-related newspaper articles in Spanish. We use the resulting dataset to conduct both descriptive analysis and a study of the relationship between violence and the eradication of coca crops, offering an example of policy analyses that such data can support. Our study demonstrates how LLMs have opened new research opportunities by enabling examinations of large text corpora at a previously infeasible depth.
zh
[NLP-57] Hierarchical Section Matching Prediction (HSMP) BERT for Fine-Grained Extraction of Structured Data from Hebrew Free-Text Radiology Reports in Crohns Disease
【速读】: 该论文旨在解决从放射学报告中提取结构化临床信息的难题,特别是在低资源语言(如希伯来语)环境下,针对克罗恩病(Crohn’s disease)多器官病变表现稀疏的问题。其核心解决方案是提出一种基于提示学习(prompt-based learning)的分层结构匹配预测BERT模型(Hierarchical Structured Matching Prediction BERT, HSMP-BERT),通过分层推理机制显著提升多标签分类性能与计算效率,同时在小样本标注数据下实现高精度结构化信息抽取(平均F1达0.83±0.08,kappa达0.65±0.17),优于零样本基线和标准微调方法(p < 10⁻⁷)。
链接: https://arxiv.org/abs/2509.04519
作者: Zvi Badash,Hadas Ben-Atya,Naama Gavrielov,Liam Hazan,Gili Focht,Ruth Cytter-Kuint,Talar Hagopian,Dan Turner,Moti Freiman
机构: Technion (以色列理工学院); Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Extracting structured clinical information from radiology reports is challenging, especially in low-resource languages. This is pronounced in Crohn’s disease, with sparsely represented multi-organ findings. We developed Hierarchical Structured Matching Prediction BERT (HSMP-BERT), a prompt-based model for extraction from Hebrew radiology text. In an administrative database study, we analyzed 9,683 reports from Crohn’s patients imaged 2010-2023 across Israeli providers. A subset of 512 reports was radiologist-annotated for findings across six gastrointestinal organs and 15 pathologies, yielding 90 structured labels per subject. Multilabel-stratified split (66% train+validation; 33% test), preserving label prevalence. Performance was evaluated with accuracy, F1, Cohen’s \kappa , AUC, PPV, NPV, and recall. On 24 organ-finding combinations with 15 positives, HSMP-BERT achieved mean F1 0.83 \pm 0.08 and \kappa 0.65 \pm 0.17, outperforming the SMP zero-shot baseline (F1 0.49 \pm 0.07, \kappa 0.06 \pm 0.07) and standard fine-tuning (F1 0.30 \pm 0.27, \kappa 0.27 \pm 0.34; paired t-test p 10^-7 ). Hierarchical inference cuts runtime 5.1 \times vs. traditional inference. Applied to all reports, it revealed associations among ileal wall thickening, stenosis, and pre-stenotic dilatation, plus age- and sex-specific trends in inflammatory findings. HSMP-BERT offers a scalable solution for structured extraction in radiology, enabling population-level analysis of Crohn’s disease and demonstrating AI’s potential in low-resource settings.
zh
[NLP-58] Advancing SLM Tool-Use Capability using Reinforcement Learning
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在工具使用(tool use)能力上显著弱于大语言模型(Large Language Models, LLMs)的问题,尤其是在面对需要调用外部API、数据库或执行动态交互任务时表现不足。其核心挑战在于SLMs因训练数据规模有限和知识覆盖范围窄,导致上下文理解能力和泛化性能受限。解决方案的关键在于采用强化学习(Reinforcement Learning, RL)方法中的组相对策略优化(Group Relative Policy Optimization, GRPO),通过高效且适应性强的策略优化机制提升SLMs的工具使用准确率,从而显著增强其在实际应用中的可用性与实用性。
链接: https://arxiv.org/abs/2509.04518
作者: Dhruvi Paprunia,Vansh Kharidia,Pankti Doshi
机构: MPSTME, NMIMS (NMIMS大学工程学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have progressed beyond simple text creation, and tool use has become increasingly important for complex, real-world tasks. Tool use in LLMs refers to their ability to utilize external resources such as APIs, databases, or software functions to extend their functionality beyond generating this http URL are used for tasks such as performing calculations, making API calls to retrieve the current time and date, and more. This capability enables models to fetch real-time data, execute commands, or solve problems requiring dynamic interaction, making it indispensable for applications like AI agents in virtual assistants, robotic control, or automated workflows. However, while LLMs are usually adept tool use, their vast resource requirements and computation complexity restrict their use in every use this http URL a result, there is an increasing need for more compact and efficient Small Language Models (SLMs). Small language models (SLMs) struggle in tool use compared to large language models (LLMs). As soon in Table 1. SLMs are typically trained on smaller, more specific datasets, resulting in a narrower knowledge base and limited contextual understanding compared to LLMs. This research addresses these challenges by using Reinforcement Learning (RL), specifically Group Relative Policy Optimization (GRPO), to enhance tool-use proficiency in SLMs. Unlike conventional fine-tuning approaches that require heavy computation and often lack adaptability, our method provides an efficient, effective solution that significantly boosts SLM tool-use accuracy, increasing their practical utility. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.04518 [cs.CL] (or arXiv:2509.04518v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.04518 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-59] Analysis of Voluntarily Reported Data Post Mesh Implantation for Detecting Public Emotion and Identifying Concern Reports
【速读】: 该论文旨在解决疝气修复手术中植入补片(mesh implant)后患者情绪体验的量化与分析问题,尤其关注患者报告中情感变化与医疗设备监管及技术进步之间的关联。其解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)技术,结合加拿大国家研究委员会情绪词典(NRC Emotion Lexicon)和TextBlob工具对美国MAUDE数据库中2000至2021年间的患者主观叙述进行情感分类与极性分析,从而识别出具有高情绪强度的“Concern Reports”(关切报告),并揭示患者情绪随时间演变的趋势。这一方法为临床实践提供了基于真实世界数据的情感洞察,有助于优化术前沟通、术后护理策略,并推动以患者为中心的医疗决策改进。
链接: https://arxiv.org/abs/2509.04517
作者: Indu Bala,Lewis Mitchell,Marianne H Gillam
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Mesh implants are widely utilized in hernia repair surgeries, but postoperative complications present a significant concern. This study analyzes patient reports from the Manufacturer and User Facility Device Experience (MAUDE) database spanning 2000 to 2021 to investigate the emotional aspects of patients following mesh implantation using Natural Language Processing (NLP). Employing the National Research Council Canada (NRC) Emotion Lexicon and TextBlob for sentiment analysis, the research categorizes patient narratives into eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and assesses sentiment polarity. The goal is to discern patterns in patient sentiment over time and to identify reports signaling urgent concerns, referred to as “Concern Reports,” thereby understanding shifts in patient experiences in relation to changes in medical device regulation and technological advancements in healthcare. The study detected an increase in Concern Reports and higher emotional intensity during the periods of 2011-2012 and 2017-2018. Through temporal analysis of Concern Reports and overall sentiment, this research provides valuable insights for healthcare practitioners, enhancing their understanding of patient experiences post-surgery, which is critical for improving preoperative counselling, postoperative care, and preparing patients for mesh implant surgeries. The study underscores the importance of emotional considerations in medical practices and the potential for sentiment analysis to inform and enhance patient care.
zh
[NLP-60] Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets
【速读】: 该论文旨在解决多语言大语言模型(Large Language Models, LLMs)在不同语言间性能不平等的问题,特别是英语主导的训练数据是否会导致非英语语言模型性能下降。其核心假设是:由于训练数据分布不均,模型对非英语语言的理解能力可能受限。解决方案的关键在于通过对比实验验证语言一致性的重要性——具体而言,研究构建了两个单语BERT模型:一个完全在斯瓦希里语(Swahili)新闻数据上训练和测试,另一个则使用等量英文新闻数据训练,并将斯瓦希里语数据翻译成英文后输入该英文模型进行评估。结果表明,尽管翻译质量高,斯瓦希里语原生训练模型的错误率(0.36%)显著低于翻译后输入英文模型的结果(1.47%),说明单纯依赖翻译无法弥合语言间的表征差异,且模型内部知识表示的不完善会影响跨语言理解效果。因此,该研究强调原生语言训练对于提升模型可靠性与公平性至关重要。
链接: https://arxiv.org/abs/2509.04516
作者: Sophie Jaffer,Simeon Sayer
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 13 Pages, 3 Figures
Abstract:As large language models (LLMs) expand multilingual capabilities, questions remain about the equity of their performance across languages. While many communities stand to benefit from AI systems, the dominance of English in training data risks disadvantaging non-English speakers. To test the hypothesis that such data disparities may affect model performance, this study compares two monolingual BERT models: one trained and tested entirely on Swahili data, and another on comparable English news data. To simulate how multilingual LLMs process non-English queries through internal translation and abstraction, we translated the Swahili news data into English and evaluated it using the English-trained model. This approach tests the hypothesis by evaluating whether translating Swahili inputs for evaluation on an English model yields better or worse performance compared to training and testing a model entirely in Swahili, thus isolating the effect of language consistency versus cross-lingual abstraction. The results prove that, despite high-quality translation, the native Swahili-trained model performed better than the Swahili-to-English translated model, producing nearly four times fewer errors: 0.36% vs. 1.47% respectively. This gap suggests that translation alone does not bridge representational differences between languages and that models trained in one language may struggle to accurately interpret translated inputs due to imperfect internal knowledge representation, suggesting that native-language training remains important for reliable outcomes. In educational and informational contexts, even small performance gaps may compound inequality. Future research should focus on addressing broader dataset development for underrepresented languages and renewed attention to multilingual model evaluation, ensuring the reinforcing effect of global AI deployment on existing digital divides is reduced.
zh
[NLP-61] Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations
【速读】: 该论文旨在解决生成式 AI 在职业故事生成中存在的人口统计学偏见问题,尤其是性别和种族层面的代表性偏差。其解决方案的关键在于提出了一种名为“Bias Analysis and Mitigation through Explanation”(BAME)的策略,该策略利用模型自身生成的解释来指导针对性的提示工程(prompt engineering),从而在不修改模型参数的前提下有效降低偏见,提升不同群体之间的代表性公平性。
链接: https://arxiv.org/abs/2509.04515
作者: Martha O. Dimgba,Sharon Oba,Ameeta Agrawal,Philippe J. Giabbanelli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Language models have been shown to propagate social bias through their output, particularly in the representation of gender and ethnicity. This paper investigates gender and ethnicity biases in AI-generated occupational stories. Representation biases are measured before and after applying our proposed mitigation strategy, Bias Analysis and Mitigation through Explanation (BAME), revealing improvements in demographic representation ranging from 2% to 20%. BAME leverages model-generated explanations to inform targeted prompt engineering, effectively reducing biases without modifying model parameters. By analyzing stories generated across 25 occupational groups, three large language models (Claude 3.5 Sonnet, Llama 3.1 70B Instruct, and GPT-4 Turbo), and multiple demographic dimensions, we identify persistent patterns of overrepresentation and underrepresentation linked to training data stereotypes. Our findings demonstrate that guiding models with their own internal reasoning mechanisms can significantly enhance demographic parity, thereby contributing to the development of more transparent generative AI systems.
zh
[NLP-62] Scaling behavior of large language models in emotional safety classification across sizes and tasks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理情绪敏感内容时的安全性与可靠性问题,特别是在心理健康场景中如何准确识别和分类潜在风险内容。其解决方案的关键在于构建了一个包含15K样本的新型情感安全数据集,并通过ChatGPT生成的情感重解释提示进行增强,进而系统评估不同规模的LLaMA模型(1B至70B参数)在零样本、少样本及微调设置下的表现。研究发现,尽管更大模型在多标签分类和零样本任务中性能更优,但轻量级微调可使最小的1B模型达到与更大模型及BERT相当的性能,且推理时仅需2GB显存,从而为隐私保护型敏感应用提供可行的本地化部署方案。
链接: https://arxiv.org/abs/2509.04512
作者: Edoardo Pinzuti,Oliver Tüscher,André Ferreira Castro
机构: Leibniz Institute for Resilience Research(莱布尼茨韧性研究中心); University Medical Center Halle(哈雷大学医学中心); German Center for Mental Health(德国心理健康中心); Department of Psychiatry and Psychotherapy, University Medical Center of the Johannes Gutenberg-University Mainz(约翰内斯古腾堡大学美因茨分校大学医学中心精神病学与心理治疗系); School of Life Sciences, Technical University of Munich(慕尼黑工业大学生命科学学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Understanding how large language models (LLMs) process emotionally sensitive content is critical for building safe and reliable systems, particularly in mental health contexts. We investigate the scaling behavior of LLMs on two key tasks: trinary classification of emotional safety (safe vs. unsafe vs. borderline) and multi-label classification using a six-category safety risk taxonomy. To support this, we construct a novel dataset by merging several human-authored mental health datasets ( 15K samples) and augmenting them with emotion re-interpretation prompts generated via ChatGPT. We evaluate four LLaMA models (1B, 3B, 8B, 70B) across zero-shot, few-shot, and fine-tuning settings. Our results show that larger LLMs achieve stronger average performance, particularly in nuanced multi-label classification and in zero-shot settings. However, lightweight fine-tuning allowed the 1B model to achieve performance comparable to larger models and BERT in several high-data categories, while requiring 2GB VRAM at inference. These findings suggest that smaller, on-device models can serve as viable, privacy-preserving alternatives for sensitive applications, offering the ability to interpret emotional context and maintain safe conversational boundaries. This work highlights key implications for therapeutic LLM applications and the scalable alignment of safety-critical systems.
zh
[NLP-63] Combine Virtual Reality and Machine-Learning to Identify the Presence of Dyslexia: A Cross-Linguistic Approach
【速读】: 该论文旨在解决如何利用虚拟现实(Virtual Reality, VR)与人工智能(Artificial Intelligence, AI)技术辅助识别意大利语和西班牙语大学生群体中是否存在阅读障碍(dyslexia)的问题。其解决方案的关键在于通过VR环境中的静默阅读(Silent Reading, SR)测试和自尊评估任务收集行为数据,并结合监督式机器学习(Supervised Machine Learning, ML)模型进行分类分析,结果显示基于完成时间的差异可有效区分有无阅读障碍的学生,且在意大利语样本中达到87.5%的准确率,表明VR衍生的行为指标(尤其是任务耗时)是预测阅读障碍的有效特征,但语言特性可能影响模型泛化能力。
链接: https://arxiv.org/abs/2509.04510
作者: Michele Materazzini,Gianluca Morciano,Jose Manuel Alcalde-Llergo,Enrique Yeguas-Bolivar,Giuseppe Calabro,Andrea Zingoni,Juri Taborri
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 22 pages, 10 figures, 5 tables
Abstract:This study explores the use of virtual reality (VR) and artificial intelligence (AI) to predict the presence of dyslexia in Italian and Spanish university students. In particular, the research investigates whether VR-derived data from Silent Reading (SR) tests and self-esteem assessments can differentiate between students that are affected by dyslexia and students that are not, employing machine learning (ML) algorithms. Participants completed VR-based tasks measuring reading performance and self-esteem. A preliminary statistical analysis (t tests and Mann Whitney tests) on these data was performed, to compare the obtained scores between individuals with and without dyslexia, revealing significant differences in completion time for the SR test, but not in accuracy, nor in self esteem. Then, supervised ML models were trained and tested, demonstrating an ability to classify the presence/absence of dyslexia with an accuracy of 87.5 per cent for Italian, 66.6 per cent for Spanish, and 75.0 per cent for the pooled group. These findings suggest that VR and ML can effectively be used as supporting tools for assessing dyslexia, particularly by capturing differences in task completion speed, but language-specific factors may influence classification accuracy.
zh
[NLP-64] ProST: Progressive Sub-task Training for Pareto-Optimal Multi-agent Systems Using Small Language Models
【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在多智能体系统(Multi-agent Systems)中因长轨迹学习困难而导致的子任务学习不充分问题,从而影响整体任务的有效性。其核心挑战在于SLMs在复杂任务中难以通过常规训练策略掌握所有子任务,进而限制了多智能体系统的性能表现。解决方案的关键在于提出一种渐进式子任务训练策略(progressive sub-task training strategy),该策略在每轮训练中逐步引入新的子任务,模拟实例级课程学习(instance-level curriculum learning),显著提升了多智能体系统在不同配置下的有效性,并通过帕累托分析验证了其在效果与效率之间更优的权衡能力。
链接: https://arxiv.org/abs/2509.04508
作者: Biddut Sarker Bijoy,Mohammad Saqib Hasan,Pegah Alipoormolabashi,Avirup Sil,Aruna Balasubramanian,Niranjan Balasubramanian
机构: Stony Brook University (石溪大学); IBM Research AI (IBM 研究院人工智能)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-agent systems with smaller language models (SLMs) present a viable alternative to single agent systems powered by large language models (LLMs) for addressing complex problems. In this work, we study how these alternatives compare in terms of both effectiveness and efficiency. To study this trade-off, we instantiate single and multi-agent systems for the complex problems in the AppWorld environment using different sized language models. We find that difficulties with long-trajectory learning in smaller language models (SLMs) limit their performance. Even when trained for specialized roles, SLMs fail to learn all subtasks effectively. To address this issue, we introduce a simple progressive sub-task training strategy, which introduces new sub-tasks progressively in each training epoch. We find that this novel strategy, analogous to instance level curriculum learning, consistently improves the effectiveness of multi-agents at all configurations. Our Pareto analysis shows that fine-tuned multi-agent systems yield better effectiveness-efficiency trade-offs. Additional ablations and analyses shows the importance of our progressive training strategy and its ability to reduce subtask error rates. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.04508 [cs.CL] (or arXiv:2509.04508v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.04508 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-65] From Silent Signals to Natural Language: A Dual-Stage Transformer-LLM Approach
【速读】: 该论文旨在解决无声语音接口(Silent Speech Interfaces, SSIs)中合成语音在识别与下游处理阶段面临的音素歧义和噪声问题,这些问题显著影响了语音的可理解性。解决方案的关键在于提出一种增强型自动语音识别(Automatic Speech Recognition, ASR)框架,该框架结合基于Transformer的声学模型与大语言模型(Large Language Model, LLM)进行后处理:其中Transformer负责捕捉完整话语上下文信息,而LLM则确保输出文本的语言一致性,从而有效降低词错误率(Word Error Rate, WER),实验表明相较基线模型实现了16%相对和6%绝对的WER下降。
链接: https://arxiv.org/abs/2509.04507
作者: Nithyashree Sivasubramaniam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Silent Speech Interfaces (SSIs) have gained attention for their ability to generate intelligible speech from non-acoustic signals. While significant progress has been made in advancing speech generation pipelines, limited work has addressed the recognition and downstream processing of synthesized speech, which often suffers from phonetic ambiguity and noise. To overcome these challenges, we propose an enhanced automatic speech recognition framework that combines a transformer-based acoustic model with a large language model (LLM) for post-processing. The transformer captures full utterance context, while the LLM ensures linguistic consistency. Experimental results show a 16% relative and 6% absolute reduction in word error rate (WER) over a 36% baseline, demonstrating substantial improvements in intelligibility for silent speech interfaces.
zh
[NLP-66] Behavioral Fingerprinting of Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估体系过于依赖性能指标、难以捕捉模型内在认知与交互风格差异的问题。其解决方案的关键在于提出了一种新颖的“行为指纹识别”(Behavioral Fingerprinting)框架,通过一个精心设计的诊断提示套件(Diagnostic Prompt Suite)和一个由强大LLM担任中立裁判的自动化评估流程,对18个不同能力层级的模型进行多维度行为分析。该方法揭示出顶级模型在抽象与因果推理等核心能力上趋于一致,但对齐相关行为(如谄媚倾向和语义鲁棒性)存在显著差异,并发现跨模型默认人格聚类(ISTJ/ESTJ)可能反映了共同的对齐激励机制,从而表明模型的交互特性并非规模或推理能力的自然涌现,而是开发者特定对齐策略的直接结果。
链接: https://arxiv.org/abs/2509.04504
作者: Zehua Pei,Hui-Ling Zhen,Ying Zhang,Zhiyuan Yang,Xing Li,Xianzhi Yu,Mingxuan Yuan,Bei Yu
机构: The Chinese University of Hong Kong (香港中文大学); Noah’s Ark Lab, Huawei (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to 1st Open Conference on AI Agents for Science (agents4science 2025)
Abstract:Current benchmarks for Large Language Models (LLMs) primarily focus on performance metrics, often failing to capture the nuanced behavioral characteristics that differentiate them. This paper introduces a novel ``Behavioral Fingerprinting’’ framework designed to move beyond traditional evaluation by creating a multi-faceted profile of a model’s intrinsic cognitive and interactive styles. Using a curated \textitDiagnostic Prompt Suite and an innovative, automated evaluation pipeline where a powerful LLM acts as an impartial judge, we analyze eighteen models across capability tiers. Our results reveal a critical divergence in the LLM landscape: while core capabilities like abstract and causal reasoning are converging among top models, alignment-related behaviors such as sycophancy and semantic robustness vary dramatically. We further document a cross-model default persona clustering (ISTJ/ESTJ) that likely reflects common alignment incentives. Taken together, this suggests that a model’s interactive nature is not an emergent property of its scale or reasoning power, but a direct consequence of specific, and highly variable, developer alignment strategies. Our framework provides a reproducible and scalable methodology for uncovering these deep behavioral differences. Project: this https URL
zh
[NLP-67] VaccineRAG : Boosting Multimodal Large Language Models Immunity to Harmful RAG Samples
【速读】: 该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)系统中因检索模块精度不足导致的生成质量下降问题,即大量无关或误导性样本被引入生成阶段,严重制约大型语言模型(Large Language Models, LLMs)的性能表现。解决方案的关键在于提出 VaccineRAG 数据集与 Partial-GRPO 算法:前者通过设计包含不同正负样本比例的基准测试,系统性暴露当前 LLM 在样本判别上的缺陷,并借助 Chain-of-Thought (CoT) 提示引导模型对每个检索样本进行显式推理分析;后者则通过将 LLM 输出建模为多个组件而非单一整体,实现对复杂 CoT 序列更精准的偏好选择,从而提升模型学习长序列复杂推理的能力。
链接: https://arxiv.org/abs/2509.04502
作者: Qixin Sun,Ziqin Wang,Hengyuan Zhao,Yilin Li,Kaiyou Song,Linjiang Huang,Xiaolin Hu,Qingpei Guo,Si Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval Augmented Generation enhances the response accuracy of Large Language Models (LLMs) by integrating retrieval and generation modules with external knowledge, demonstrating particular strength in real-time queries and Visual Question Answering tasks. However, the effectiveness of RAG is frequently hindered by the precision of the retriever: many retrieved samples fed into the generation phase are irrelevant or misleading, posing a critical bottleneck to LLMs’ performance. To address this challenge, we introduce VaccineRAG, a novel Chain-of-Thought-based retrieval-augmented generation dataset. On one hand, VaccineRAG employs a benchmark to evaluate models using data with varying positive/negative sample ratios, systematically exposing inherent weaknesses in current LLMs. On the other hand, it enhances models’ sample-discrimination capabilities by prompting LLMs to generate explicit Chain-of-Thought (CoT) analysis for each sample before producing final answers. Furthermore, to enhance the model’s ability to learn long-sequence complex CoT content, we propose Partial-GRPO. By modeling the outputs of LLMs as multiple components rather than a single whole, our model can make more informed preference selections for complex sequences, thereby enhancing its capacity to learn complex CoT. Comprehensive evaluations and ablation studies on VaccineRAG validate the effectiveness of the proposed scheme. The code and dataset will be publicly released soon.
zh
[NLP-68] Understanding Reinforcement Learning for Model Training and future directions with GRAPE
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在指令微调(instruction tuning)过程中算法理解门槛高、解释模糊以及缺乏针对LLMs场景的清晰推导问题。现有文献对SFT、Rejection Sampling、REINFORCE、TRPO、PPO、GRPO及DPO等关键算法的说明常依赖先验知识、省略关键细节或过于泛化,导致实践者难以准确掌握其原理与实现逻辑。论文的解决方案核心在于:以自包含、从零开始的方式,使用简化且明确的符号体系逐步推导每种算法,并聚焦于LLMs的应用场景,减少对强化学习(Reinforcement Learning, RL)通用理论的冗余抽象,从而消除歧义、提升直观性与可操作性,为研究者提供清晰、严谨且面向实际应用的算法理解路径。
链接: https://arxiv.org/abs/2509.04501
作者: Rohit Patel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages, 1 figure
Abstract:This paper provides a self-contained, from-scratch, exposition of key algorithms for instruction tuning of models: SFT, Rejection Sampling, REINFORCE, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Explanations of these algorithms often assume prior knowledge, lack critical details, and/or are overly generalized and complex. Here, each method is discussed and developed step by step using simplified and explicit notation focused on LLMs, aiming to eliminate ambiguity and provide a clear and intuitive understanding of the concepts. By minimizing detours into the broader RL literature and connecting concepts to LLMs, we eliminate superfluous abstractions and reduce cognitive overhead. Following this exposition, we provide a literature review of new techniques and approaches beyond those detailed. Finally, new ideas for research and exploration in the form of GRAPE (Generalized Relative Advantage Policy Evolution) are presented.
zh
[NLP-69] Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理包含相关与不适当内容混合的外部上下文时,易受低频不适当信息干扰而导致响应质量下降的问题。其核心发现是LLMs倾向于优先采纳在上下文中占比更小的信息,这种行为模式会显著削弱模型在现实场景中的可靠性与安全性。解决方案的关键在于提出RW-Steering方法——一种基于两阶段微调的上下文工程策略,通过引入神经科学中经典的Rescorla-Wagner模型思想,使模型能够内部识别并忽略不适当信号,从而在不同比例的不当内容下保持鲁棒性,有效提升响应质量并逆转不良行为趋势。
链接: https://arxiv.org/abs/2509.04500
作者: Rushi Wang,Jiateng Liu,Cheng Qian,Yifan Shen,Yanzhou Pan,Zhaozhuo Xu,Ahmed Abbasi,Heng Ji,Denghui Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 36 pages, 7 figures
Abstract:Incorporating external context can significantly enhance the response quality of Large Language Models (LLMs). However, real-world contexts often mix relevant information with disproportionate inappropriate content, posing reliability risks. How do LLMs process and prioritize mixed context? To study this, we introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content. Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs. Our adapted model reveals a consistent behavioral pattern: LLMs exhibit a strong tendency to incorporate information that is less prevalent in the context. This susceptibility is harmful in real-world settings, where small amounts of inappropriate content can substantially degrade response quality. Empirical evaluations on our testbed further confirm this vulnerability. To tackle this, we introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals. Unlike prior methods that rely on extensive supervision across diverse context mixtures, RW-Steering generalizes robustly across varying proportions of inappropriate content. Experiments show that our best fine-tuned model improves response quality by 39.8% and reverses the undesirable behavior curve, establishing RW-Steering as a robust, generalizable context engineering solution for improving LLM safety in real-world use.
zh
[NLP-70] DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence
【速读】: 该论文旨在解决生成式AI(Generative AI)在信息检索与深度研究场景中普遍存在的可信度不足问题,包括回答过度自信、来源支持薄弱以及引用实践混乱等现象。其解决方案的核心是提出DeepTRACE框架——一个社会技术基础的审计体系,将社区识别的典型失败案例转化为八个可量化的维度,覆盖答案文本、来源和引用三方面;通过语句级分析(分解与置信度评分)构建引用与事实支持矩阵,从而实现对系统从证据推理到归属 attribution 的端到端审计。该框架结合自动化提取管道与经验证的人类标注一致性高的LLM评判器,在多个主流模型(如GPT-4.5/5、Perplexity、Copilot/Bing、Gemini)上验证了其有效性,并揭示出当前系统尽管在深度研究配置下能降低过度假设并提升引用完整性,但依然存在显著的一边倒倾向及大量未被源支持的陈述。
链接: https://arxiv.org/abs/2509.04499
作者: Pranav Narayanan Venkit,Philippe Laban,Yilun Zhou,Kung-Hsiang Huang,Yixin Mao,Chien-Sheng Wu
机构: Salesforce AI Research (Salesforce人工智能研究中心); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2410.22349
Abstract:Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, this http URL, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40–80% across systems.
zh
[NLP-71] Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations
【速读】: 该论文旨在解决生成式 AI(Generative AI)在高等教育推荐场景中可能加剧社会偏见的问题,特别是地理、人口统计学和经济层面的系统性偏差。其解决方案的关键在于提出一个新颖的多维评估框架,该框架不仅关注推荐准确性,还量化了不同性别、国籍和经济背景用户的代表性差异以及地域分布的公平性,从而为识别和缓解教育领域大语言模型(Large Language Models, LLMs)中的偏见提供可操作的测量标准与改进方向。
链接: https://arxiv.org/abs/2509.04498
作者: Krithi Shailya,Akhilesh Kumar Mishra,Gokul S Krishnan,Balaraman Ravindran
机构: Centre for Responsible AI (CeRAI), Wadhwani School of Data Science and AI (WSAI); Indian Institute of Technology Madras
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly used as daily recommendation systems for tasks like education planning, yet their recommendations risk perpetuating societal biases. This paper empirically examines geographic, demographic, and economic biases in university and program suggestions from three open-source LLMs: LLaMA-3.1-8B, Gemma-7B, and Mistral-7B. Using 360 simulated user profiles varying by gender, nationality, and economic status, we analyze over 25,000 recommendations. Results show strong biases: institutions in the Global North are disproportionately favored, recommendations often reinforce gender stereotypes, and institutional repetition is prevalent. While LLaMA-3.1 achieves the highest diversity, recommending 481 unique universities across 58 countries, systemic disparities persist. To quantify these issues, we propose a novel, multi-dimensional evaluation framework that goes beyond accuracy by measuring demographic and geographic representation. Our findings highlight the urgent need for bias consideration in educational LMs to ensure equitable global access to higher education.
zh
[NLP-72] A Narrative-Driven Computational Framework for Clinician Burnout Surveillance
【速读】: 该论文旨在解决临床医生倦怠(clinician burnout)对患者安全构成的严重威胁,特别是在高急性度重症监护病房(ICU)环境中。现有研究多依赖回顾性调查工具或电子健康记录(EHR)的宽泛元数据,忽视了临床笔记中蕴含的宝贵叙事信息。解决方案的关键在于构建一个混合分析管道:首先利用针对临床叙事微调的BioBERT情感嵌入提取文本特征,结合专为临床医生倦怠监测设计的词汇压力词典,并引入五主题潜在狄利克雷分配(LDA)模型与工作负荷代理变量,最终通过提供者级别的逻辑回归分类器实现对倦怠风险的精准识别,其F1分数达0.84,显著优于仅使用元数据的基线模型。
链接: https://arxiv.org/abs/2509.04497
作者: Syed Ahmad Chan Bukhari,Fazel Keshtkar,Alyssa Meczkowska
机构: St. John’s University (圣约翰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 Figure
Abstract:Clinician burnout poses a substantial threat to patient safety, particularly in high-acuity intensive care units (ICUs). Existing research predominantly relies on retrospective survey tools or broad electronic health record (EHR) metadata, often overlooking the valuable narrative information embedded in clinical notes. In this study, we analyze 10,000 ICU discharge summaries from MIMIC-IV, a publicly available database derived from the electronic health records of Beth Israel Deaconess Medical Center. The dataset encompasses diverse patient data, including vital signs, medical orders, diagnoses, procedures, treatments, and deidentified free-text clinical notes. We introduce a hybrid pipeline that combines BioBERT sentiment embeddings fine-tuned for clinical narratives, a lexical stress lexicon tailored for clinician burnout surveillance, and five-topic latent Dirichlet allocation (LDA) with workload proxies. A provider-level logistic regression classifier achieves a precision of 0.80, a recall of 0.89, and an F1 score of 0.84 on a stratified hold-out set, surpassing metadata-only baselines by greater than or equal to 0.17 F1 score. Specialty-specific analysis indicates elevated burnout risk among providers in Radiology, Psychiatry, and Neurology. Our findings demonstrate that ICU clinical narratives contain actionable signals for proactive well-being monitoring.
zh
[NLP-73] Learned Hallucination Detection in Black-Box LLM s using Token-level Entropy Production Rate
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在问答(Question Answering, QA)任务中产生的幻觉(Hallucination)问题,该问题严重削弱了LLM在实际应用中的可靠性。解决方案的关键在于提出一种适用于数据受限场景(如仅能访问黑盒API返回的少量top-k log-probabilities)的单次生成(one-shot)幻觉检测方法:通过直接利用非贪婪解码过程中产生的log-probabilities构造熵产生率(Entropy Production Rate, EPR)指标,并进一步引入监督学习对EPR进行增强,其特征来自单个生成序列中前k个候选token的熵贡献,无需多次查询重跑。实验证明,该方法在多个QA数据集和LLM上显著优于仅使用EPR的基线,且仅依赖典型有限的log-probability信息(如每token top-10),具备良好的实用性与部署效率,特别适用于金融领域等对可靠性和实时性要求高的场景。
链接: https://arxiv.org/abs/2509.04492
作者: Charles Moslonka,Hicham Randrianarivo,Arthur Garnier,Emmanuel Malherbe
机构: Artefact Research Center (Artefact 研究中心); MICS, CentraleSupélec, Université Paris-Saclay (MICS,中央理工-巴黎高等电力学院,巴黎萨克雷大学); Ardian (Ardian)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, 1 table. pre-print version
Abstract:Hallucinations in Large Language Model (LLM) outputs for Question Answering (QA) tasks critically undermine their real-world reliability. This paper introduces an applied methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access, such as interacting with black-box LLM APIs that typically expose only a few top candidate log-probabilities per token. Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding. We first derive an Entropy Production Rate (EPR) metric that offers baseline performance, later augmented with supervised learning. Our learned model uses features representing the entropic contributions of the accessible top-ranked tokens within a single generated sequence, requiring no multiple query re-runs. Evaluated across diverse QA datasets and multiple LLMs, this estimator significantly improves hallucination detection over using EPR alone. Crucially, high performance is demonstrated using only the typically small set of available log-probabilities (e.g., top 10 per token), confirming its practical efficiency and suitability for these API-constrained deployments. This work provides a readily deployable technique to enhance the trustworthiness of LLM responses from a single generation pass in QA and Retrieval-Augmented Generation (RAG) systems, with its utility further demonstrated in a finance framework analyzing responses to queries on annual reports from an industrial dataset.
zh
[NLP-74] Refining Transcripts With TV Subtitles by Prompt-Based Weakly Supervised Training of ASR
【速读】: 该论文旨在解决电视字幕(TV subtitles)在弱监督自动语音识别(Weakly Supervised Automatic Speech Recognition, WS-ASR)中因与音频对齐不精确而难以直接作为标注目标的问题。其关键解决方案在于将字幕重新定义为富含上下文的提示(context-rich prompts),而非直接监督信号;在此基础上,模型生成伪标签(pseudo transcripts)作为主要训练目标,同时利用字幕作为引导线索进行迭代优化,并引入加权注意力机制以增强推理过程中相关字幕词元的权重,从而有效缓解语音与文本之间的差异,提升转录准确性。
链接: https://arxiv.org/abs/2509.04491
作者: Xinnian Zhao,Hugo Van Hamme
机构: KU Leuven University (鲁汶大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: eusipco2025
Abstract:This study proposes a novel approach to using TV subtitles within a weakly supervised (WS) Automatic Speech Recognition (ASR) framework. Although TV subtitles are readily available, their imprecise alignment with corresponding audio limits their applicability as supervised targets for verbatim transcription. Rather than using subtitles as direct supervision signals, our method reimagines them as context-rich prompts. This design enables the model to handle discrepancies between spoken audio and subtitle text. Instead, generated pseudo transcripts become the primary targets, with subtitles acting as guiding cues for iterative refinement. To further enhance the process, we introduce a weighted attention mechanism that emphasizes relevant subtitle tokens during inference. Our experiments demonstrate significant improvements in transcription accuracy, highlighting the effectiveness of the proposed method in refining transcripts. These enhanced pseudo-labeled datasets provide high-quality foundational resources for training robust ASR systems.
zh
[NLP-75] Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLM)在多说话人自动语音识别(Multi-talker Automatic Speech Recognition, MT-ASR)系统中因提示(prompt)设计不足而导致性能受限的问题。现有方法要么忽略提示,要么仅使用简单任务定义提示,未探索结构化提示对LLM引导的有效性。其解决方案的关键在于提出一种序列化输出提示(Serialized Output Prompt, SOP)机制:通过在语音编码器后插入分隔器和序列化连接时序分类(CTC)层,以“先说话者优先”的方式分离并提取混合语音中的多说话人内容;随后利用贪婪搜索解码序列化CTC输出获得SOP,并将其作为结构化提示显式引导LLM。该方法结合三阶段训练策略(序列化输出训练、语音信息提取与SOP适配),显著提升了在两说话人和三说话人场景下的系统性能。
链接: https://arxiv.org/abs/2509.04488
作者: Hao Shi,Yusuke Fujita,Tomoya Mizumoto,Lianbo Liu,Atsushi Kojima,Yui Sudo
机构: SB Intuitions( SB Intuitions)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Prompts are crucial for task definition and for improving the performance of large language models (LLM)-based systems. However, existing LLM-based multi-talker (MT) automatic speech recognition (ASR) systems either omit prompts or rely on simple task-definition prompts, with no prior work exploring the design of prompts to enhance performance. In this paper, we propose extracting serialized output prompts (SOP) and explicitly guiding the LLM using structured prompts to improve system performance (SOP-MT-ASR). A Separator and serialized Connectionist Temporal Classification (CTC) layers are inserted after the speech encoder to separate and extract MT content from the mixed speech encoding in a first-speaking-first-out manner. Subsequently, the SOP, which serves as a prompt for LLMs, is obtained by decoding the serialized CTC outputs using greedy search. To train the model effectively, we design a three-stage training strategy, consisting of serialized output training (SOT) fine-tuning, serialized speech information extraction, and SOP-based adaptation. Experimental results on the LibriMix dataset show that, although the LLM-based SOT model performs well in the two-talker scenario, it fails to fully leverage LLMs under more complex conditions, such as the three-talker scenario. The proposed SOP approach significantly improved performance under both two- and three-talker conditions.
zh
[NLP-76] ASCENDgpt : A Phenotype-Aware Transformer Model for Cardiovascular Risk Prediction from Electronic Health Records
【速读】: 该论文旨在解决基于纵向电子健康记录(EHR)进行心血管风险预测时面临的高维稀疏性与语义信息丢失问题。其解决方案的关键在于提出了一种新型的表型感知分词(phenotype-aware tokenization)策略,将47,155个原始ICD编码映射为176个具有临床意义的表型标记(phenotype tokens),在保留语义信息的同时实现99.6%的诊断代码合并率,并将总词汇量从原始ICD编码的约4万级降至10,442个,显著降低模型复杂度。在此基础上,研究采用基于Transformer架构的ASCENDgpt模型,在19,402名个体的序列数据上进行掩码语言建模预训练,随后针对五类心血管终点事件(心肌梗死、卒中、主要不良心血管事件、心血管死亡和全因死亡)进行微调,最终在测试集上平均C-index达0.816,验证了该方法在提升预测性能与临床可解释性方面的有效性。
链接: https://arxiv.org/abs/2509.04485
作者: Chris Sainsbury,Andreas Karwath
机构: NHS Greater Glasgow and Clyde (NHS格拉斯哥大克莱德); University of Birmingham (伯明翰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present ASCENDgpt, a transformer-based model specifically designed for cardiovascular risk prediction from longitudinal electronic health records (EHRs). Our approach introduces a novel phenotype-aware tokenization scheme that maps 47,155 raw ICD codes to 176 clinically meaningful phenotype tokens, achieving 99.6% consolidation of diagnosis codes while preserving semantic information. This phenotype mapping contributes to a total vocabulary of 10,442 tokens - a 77.9% reduction when compared with using raw ICD codes directly. We pretrain ASCENDgpt on sequences derived from 19402 unique individuals using a masked language modeling objective, then fine-tune for time-to-event prediction of five cardiovascular outcomes: myocardial infarction (MI), stroke, major adverse cardiovascular events (MACE), cardiovascular death, and all-cause mortality. Our model achieves excellent discrimination on the held-out test set with an average C-index of 0.816, demonstrating strong performance across all outcomes (MI: 0.792, stroke: 0.824, MACE: 0.800, cardiovascular death: 0.842, all-cause mortality: 0.824). The phenotype-based approach enables clinically interpretable predictions while maintaining computational efficiency. Our work demonstrates the effectiveness of domain-specific tokenization and pretraining for EHR-based risk prediction tasks.
zh
[NLP-77] he Good the Bad and the Constructive: Automatically Measuring Peer Reviews Utility for Authors EMNLP2025
【速读】: 该论文旨在解决学术同行评审中评论质量下降的问题,尤其是在审稿人时间有限的情况下,如何确保评审意见对作者具有实际帮助。其核心挑战在于量化和提升评审意见的实用性,即明确哪些特征能够使反馈真正对作者有用。解决方案的关键是识别并构建一个名为RevUtil的数据集,该数据集包含1,430条人工标注的评审意见及其四个关键维度:可操作性(Actionability)、依据性(Grounding)、具体性(Specificity)和可验证性(Verifiability),并辅以10,000条合成标注数据及理由(rationales),用于训练和评估模型。通过在该数据集上微调模型,研究发现这些模型在评估评论质量和生成解释方面能达到与人类评审者相当甚至更优的一致性水平,从而为自动化辅助评审系统提供可靠的技术支撑。
链接: https://arxiv.org/abs/2509.04484
作者: Abdelrahman Sadallah,Tim Baumgärtner,Iryna Gurevych,Ted Briscoe
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Technical University of Darmstadt (达姆施塔特工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: EMNLP 2025 Main
Abstract:Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.
zh
[NLP-78] DecMetrics: Structured Claim Decomposition Scoring for Factually Consistent LLM Outputs
【速读】: 该论文旨在解决当前事实核查中主张分解(claim decomposition)研究过度依赖生成式方法、而忽视对分解后原子主张质量评估的问题。解决方案的关键在于提出一套名为DecMetrics的自动评估指标体系,包含完整性(COMPLETENESS)、正确性(CORRECTNESS)和语义熵(SEMANTIC ENTROPY)三个维度,用以量化评估分解模型输出的原子主张质量,并将这些指标作为奖励函数集成到一个轻量级主张分解模型中,从而优化模型性能并提升事实核查系统的可靠性与有效性。
链接: https://arxiv.org/abs/2509.04483
作者: Minghui Huang
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Claim decomposition plays a crucial role in the fact-checking process by breaking down complex claims into simpler atomic components and identifying their unfactual elements. Despite its importance, current research primarily focuses on generative methods for decomposition, with insufficient emphasis on evaluating the quality of these decomposed atomic claims. To bridge this gap, we introduce \textbfDecMetrics, which comprises three new metrics: \textttCOMPLETENESS, \textttCORRECTNESS, and \textttSEMANTIC ENTROPY, designed to automatically assess the quality of claims produced by decomposition models. Utilizing these metrics, we develop a lightweight claim decomposition model, optimizing its performance through the integration of these metrics as a reward function. Through automatic evaluation, our approach aims to set a benchmark for claim decomposition, enhancing both the reliability and effectiveness of fact-checking systems.
zh
[NLP-79] Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在安全关键领域(如女性健康)中因错误回答可能导致严重后果的问题,核心挑战在于如何实现可靠的“拒绝回答”(abstention)机制。解决方案的关键在于提出一种基于能量的模型(Energy-Based Model, EBM),该模型通过学习一个覆盖260万条指南衍生问题的密集语义语料库上的平滑能量景观,从而为系统提供更可靠的置信度信号以决定何时生成答案或选择 abstention。实验表明,EBM在语义困难样本上显著优于校准后的softmax和k近邻密度启发式方法,其优势主要源于能量评分头的设计,而非特定负样本类型的选择,从而为安全、可扩展且可解释的RAG系统奠定了基础。
链接: https://arxiv.org/abs/2509.04482
作者: Ravi Shankar,Sheng Wong,Lin Li,Magdalena Bachmann,Alex Silverthorne,Beth Albert,Gabriel Davis Jones
机构: University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable abstention is critical for retrieval-augmented generation (RAG) systems, particularly in safety-critical domains such as women’s health, where incorrect answers can lead to harm. We present an energy-based model (EBM) that learns a smooth energy landscape over a dense semantic corpus of 2.6M guideline-derived questions, enabling the system to decide when to generate or abstain. We benchmark the EBM against a calibrated softmax baseline and a k-nearest neighbour (kNN) density heuristic across both easy and hard abstention splits, where hard cases are semantically challenging near-distribution queries. The EBM achieves superior abstention performance abstention on semantically hard cases, reaching AUROC 0.961 versus 0.950 for softmax, while also reducing FPR@95 (0.235 vs 0.331). On easy negatives, performance is comparable across methods, but the EBM’s advantage becomes most pronounced in safety-critical hard distributions. A comprehensive ablation with controlled negative sampling and fair data exposure shows that robustness stems primarily from the energy scoring head, while the inclusion or exclusion of specific negative types (hard, easy, mixed) sharpens decision boundaries but is not essential for generalisation to hard cases. These results demonstrate that energy-based abstention scoring offers a more reliable confidence signal than probability-based softmax confidence, providing a scalable and interpretable foundation for safe RAG systems.
zh
[NLP-80] Narrative-to-Scene Generation: An LLM -Driven Pipeline for 2D Game Environments
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在程序化内容生成(Procedural Content Generation, PCG)中将叙事文本与可玩的视觉环境相连接的挑战。其解决方案的关键在于构建一个轻量级流水线,首先利用大语言模型(Large Language Models, LLMs)生成叙事文本,并从中识别三个关键时间帧;随后提取“对象-关系-对象”三元组形式的空间谓词,结合GameTileNet数据集中基于 affordance(可用性)感知的语义嵌入检索视觉资产;再通过细胞自动机(Cellular Automata)生成分层地形,并依据谓词结构中的空间规则放置物体,从而实现从文本到2D瓦片化游戏场景的映射。该方法在十种不同故事中的评估验证了其在瓦片-物体匹配、可用性层对齐和空间约束满足方面的有效性,为未来多帧连续性、符号追踪及多智能体协作的以故事为中心的PCG奠定了基础。
链接: https://arxiv.org/abs/2509.04481
作者: Yi-Chun Chen,Arnav Jhala
机构: Yale University (耶鲁大学); North Carolina State University (北卡罗来纳州立大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:Recent advances in large language models(LLMs) enable compelling story generation, but connecting narrative text to playable visual environments remains an open challenge in procedural content generation(PCG). We present a lightweight pipeline that transforms short narrative prompts into a sequence of 2D tile-based game scenes, reflecting the temporal structure of stories. Given an LLM-generated narrative, our system identifies three key time frames, extracts spatial predicates in the form of “Object-Relation-Object” triples, and retrieves visual assets using affordance-aware semantic embeddings from the GameTileNet dataset. A layered terrain is generated using Cellular Automata, and objects are placed using spatial rules grounded in the predicate structure. We evaluated our system in ten diverse stories, analyzing tile-object matching, affordance-layer alignment, and spatial constraint satisfaction across frames. This prototype offers a scalable approach to narrative-driven scene generation and lays the foundation for future work on multi-frame continuity, symbolic tracking, and multi-agent coordination in story-centered PCG.
zh
[NLP-81] Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在个性化视觉情绪识别(Personalized Visual Emotion Recognition, VER)任务中表现受限的问题。由于MLLMs在训练时依赖于包含广泛观点的大规模多样化数据集,导致其倾向于捕捉多数群体的共性模式,从而削弱了对个体差异的敏感性,限制了其在实际应用中的准确性和适用性。解决方案的关键在于引入受人类提示工程启发的离散提示调优(discrete prompt tuning)方法:通过生成多个自然语言提示并选择最优表示来更新个性化提示,实现针对每个个体的精准情绪识别,从而有效提升模型在个性化场景下的性能。
链接: https://arxiv.org/abs/2509.04480
作者: Ryo Takahashi,Naoki Saito,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
机构: Hokkaido University (北海道大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 4 figures
Abstract:Visual Emotion Recognition (VER) is an important research topic due to its wide range of applications, including opinion mining and advertisement design. Extending this capability to recognize emotions at the individual level further broadens its potential applications. Recently, Multimodal Large Language Models (MLLMs) have attracted increasing attention and demonstrated performance comparable to that of conventional VER methods. However, MLLMs are trained on large and diverse datasets containing general opinions, which causes them to favor majority viewpoints and familiar patterns. This tendency limits their performance in a personalized VER, which is crucial for practical and real-world applications, and indicates a key area for improvement. To address this limitation, the proposed method employs discrete prompt tuning inspired by the process of humans’ prompt engineering to adapt the VER task to each individual. Our method selects the best natural language representation from the generated prompts and uses it to update the prompt for the realization of accurate personalized VER.
zh
[NLP-82] No Clustering No Routing: How Transformers Actually Process Rare Tokens
【速读】: 该论文旨在解决大语言模型在罕见词元(rare token)预测能力不足的问题,尤其是揭示驱动其功能特化的内在机制。此前研究已发现针对罕见词元的“平台型”神经元(plateau neurons)遵循独特的三阶段影响模式,但其功能组织方式尚不明确。论文通过神经元影响分析、基于图的聚类以及注意力头消融实验,在GPT-2 XL和Pythia模型中系统探究了这一问题,关键发现在于:罕见词元处理需要额外的平台神经元,形成与常见词元不同的双计算范式;这些神经元空间分布而非模块化聚集;注意力机制并未优先路由至这些特化神经元。因此,解决方案的核心是揭示了罕见词元专化由训练驱动的分布式分化所促成,而非架构上的模块化设计,从而在保持上下文敏感灵活性的同时实现适应性资源分配。
链接: https://arxiv.org/abs/2509.04479
作者: Jing Liu
机构: ENS, Université PSL, EHESS, CNRS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models struggle with rare token prediction, yet the mechanisms driving their specialization remain unclear. Prior work identified specialized ``plateau’’ neurons for rare tokens following distinctive three-regime influence patterns \citeliu2025emergent, but their functional organization is unknown. We investigate this through neuron influence analyses, graph-based clustering, and attention head ablations in GPT-2 XL and Pythia models. Our findings show that: (1) rare token processing requires additional plateau neurons beyond the power-law regime sufficient for common tokens, forming dual computational regimes; (2) plateau neurons are spatially distributed rather than forming modular clusters; and (3) attention mechanisms exhibit no preferential routing to specialists. These results demonstrate that rare token specialization arises through distributed, training-driven differentiation rather than architectural modularity, preserving context-sensitive flexibility while achieving adaptive capacity allocation.
zh
[NLP-83] An End-to-End System for Culturally-Attuned Driving Feedback using a Dual-Component NLG Engine
【速读】: 该论文旨在解决在基础设施薄弱的低资源环境中,如何为驾驶员提供文化适配且安全有效的驾驶反馈问题。其核心挑战在于应对网络连接不稳定、传感器数据噪声大等现实限制,同时确保反馈内容既符合当地法律规范,又能通过行为科学理论驱动产生说服力。解决方案的关键在于设计了一个端到端移动系统,其中包含一个双组件自然语言生成(Natural Language Generation, NLG)引擎,能够生成基于法律的安全建议和基于行为理论的个性化反馈报告;此外,系统还集成了一种专门用于检测酒精影响驾驶的机器学习模型,并采用两步反思机制优化NLG输出质量,从而实现高鲁棒性的本地化安全干预。
链接: https://arxiv.org/abs/2509.04478
作者: Iniakpokeikiye Peter Thompson,Yi Dewei,Reiter Ehud
机构: University of Aberdeen (阿伯丁大学)
类目: Computation and Language (cs.CL)
备注: The paper has 5 figures and 1 table
Abstract:This paper presents an end-to-end mobile system that delivers culturally-attuned safe driving feedback to drivers in Nigeria, a low-resource environment with significant infrastructural challenges. The core of the system is a novel dual-component Natural Language Generation (NLG) engine that provides both legally-grounded safety tips and persuasive, theory-driven behavioural reports. We describe the complete system architecture, including an automatic trip detection service, on-device behaviour analysis, and a sophisticated NLG pipeline that leverages a two-step reflection process to ensure high-quality feedback. The system also integrates a specialized machine learning model for detecting alcohol-influenced driving, a key local safety issue. The architecture is engineered for robustness against intermittent connectivity and noisy sensor data. A pilot deployment with 90 drivers demonstrates the viability of our approach, and initial results on detected unsafe behaviours are presented. This work provides a framework for applying data-to-text and AI systems to achieve social good.
zh
[NLP-84] raining Text-to-Molecule Models with Context-Aware Tokenization EMNLP2025
【速读】: 该论文旨在解决现有文本到分子(text-to-molecule)模型依赖原子级标记化(atom-level tokenization)所导致的全局结构上下文建模能力不足的问题,从而限制了对分子语义的准确捕捉。其解决方案的关键在于提出一种基于子结构级标记化(substructure-level tokenization)的新方法,并在此基础上设计了一种基于重要性的训练策略,优先关注关键子结构以增强模型对分子语义的理解能力。实验表明,该方法在仅使用2%训练标记的情况下仍优于当前最先进方法,且通过简单有效的集成策略进一步提升了生成性能。
链接: https://arxiv.org/abs/2509.04476
作者: Seojin Kim,Hyeontae Song,Jaehyun Nam,Jinwoo Shin
机构: Seoul National University (首尔国立大学); Moloco Inc.; Korea Advanced Institute of Science and Technology (KAIST)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Findings
Abstract:Recently, text-to-molecule models have shown great potential across various chemical applications, e.g., drug-discovery. These models adapt language models to molecular data by representing molecules as sequences of atoms. However, they rely on atom-level tokenizations, which primarily focus on modeling local connectivity, thereby limiting the ability of models to capture the global structural context within molecules. To tackle this issue, we propose a novel text-to-molecule model, coined Context-Aware Molecular T5 (CAMT5). Inspired by the significance of the substructure-level contexts in understanding molecule structures, e.g., ring systems, we introduce substructure-level tokenization for text-to-molecule models. Building on our tokenization scheme, we develop an importance-based training strategy that prioritizes key substructures, enabling CAMT5 to better capture the molecular semantics. Extensive experiments verify the superiority of CAMT5 in various text-to-molecule generation tasks. Intriguingly, we find that CAMT5 outperforms the state-of-the-art methods using only 2% of training tokens. In addition, we propose a simple yet effective ensemble strategy that aggregates the outputs of text-to-molecule models to further boost the generation performance. Code is available at this https URL.
zh
[NLP-85] ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在推理能力提升过程中因依赖测试时计算扩展(test-time compute scaling)所导致的“隧道视野”(Tunnel Vision)问题,即模型因初始步骤不完善而陷入次优推理路径,使得进一步增加计算量带来的性能提升边际递减。解决方案的关键在于提出一种全新的扩展范式——原生思维并行(native thought parallelism),并通过端到端框架ParaThinker实现:训练模型同时生成多个多样化推理路径,并将其融合以得到更优答案。该方法通过并行探索不同思维路径有效规避了隧道视野问题,显著释放了模型潜在的推理能力,且在保持极低延迟开销(仅7.1%)的前提下,使较小模型在多个复杂推理基准上超越更大规模模型。
链接: https://arxiv.org/abs/2509.04475
作者: Hao Wen,Yifan Su,Feifei Zhang,Yunxin Liu,Yunhao Liu,Ya-Qin Zhang,Yuanchun Li
机构: Institute for AI Industry Research (AIR)Tsinghua University; Global Innovation Exchange & Department of AutomationTsinghua University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling - a strategy that improves reasoning by generating longer, sequential thought processes. While effective, this approach encounters a significant bottleneck as computation increases, where further computation offers only marginal performance gains. We argue this ceiling is not an inherent limit of the model’s capability but a flaw in the scaling strategy itself, a phenomenon we term “Tunnel Vision”, where a model’s imperfect initial steps lock it into a suboptimal reasoning path. To overcome this, we introduce a new scaling paradigm: native thought parallelism. We present ParaThinker, an end-to-end framework that trains an LLM to generate multiple, diverse reasoning paths in parallel and synthesize them into a superior final answer. By exploring different lines of thoughts simultaneously, ParaThinker effectively sidesteps the Tunnel Vision issue and unlocks the model’s latent reasoning potential. Our approach demonstrates that scaling compute in parallel (width) is a more effective and efficient way to superior reasoning than simply scaling sequentially (depth). On challenging reasoning benchmarks, ParaThinker achieves substantial accuracy improvements over sequential LLMs (12.3% for 1.5B and 7.5% for 7B models on average with 8 parallel paths), while adding only negligible latency overhead (7.1%). This enables smaller models to surpass much larger counterparts and establishes parallel thinking as a critical, efficient dimension for scaling future LLMs.
zh
[NLP-86] Scaling Up Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling
【速读】: 该论文旨在解决生成式 AI(Generative AI)在测试时扩展(test-time scaling)过程中因冗余和重复推理路径导致的计算效率低下问题。其解决方案的关键在于引入首个针对推测解码(speculative decoding)方法的综合性基准,系统评估三类主流推测解码策略——基于模型、基于训练和基于n-gram的方法在不同测试时扩展范式(如Best-of-N采样和多轮思考)中的表现。实验表明,简单的n-gram方法能有效捕捉重复模式,在加速测试时扩展方面展现出独特潜力,提示将n-gram方法与模型或训练驱动方法结合,可在处理重复与多样化推理路径之间实现更优平衡,从而提升大语言模型(LLM)推理效率。
链接: https://arxiv.org/abs/2509.04474
作者: Shengyin Sun,Yiming Li,Xing Li,Yingzhao Lian,Weizhe Lin,Hui-Ling Zhen,Zhiyuan Yang,Chen Chen,Xianzhi Yu,Mingxuan Yuan,Chen Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently inefficient due to the generation of redundant and repetitive reasoning traces, leading to significant computational overhead. Speculative decoding offers a promising avenue for mitigating this inefficiency, yet its efficacy in the structured, repetition-rich context of test-time scaling remains largely unexplored. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating LLM test-time scaling. Our benchmark provides consistent experimental protocols across representative test-time scaling paradigms (e.g., Best-of-N sampling and multi-round thinking), enabling a fair comparison of three major categories of speculative decoding: model-based, training-based, and n-gram-based methods. Extensive experiments reveal that simple n-gram-based methods effectively capture repetitive patterns, demonstrating unique potential in accelerating test-time scaling. This phenomenon demonstrates the value of integrating n-gram-based methods with model-based or training-based approaches to balance acceleration for both repetitive and diverse reasoning in test-time scaling. We hope this benchmark spurs further research on speculative decoding for test-time scaling, enabling faster and more practical reasoning in LLMs through better handling of repetitive and diverse reasoning paths.
zh
[NLP-87] SpeechLLM : Unified Speech and Language Model for Enhanced Multi-Task Understanding in Low Resource Settings
【速读】: 该论文旨在解决将语音编码器(speech encoder)与大语言模型(Large Language Model, LLM)集成时面临的资源消耗高、标注数据稀缺等问题。其核心解决方案是一种参数高效的适配器(adapter),能够将语音嵌入(speech embeddings)转换为LLM兼容的token表示,从而实现端到端的自动语音识别(ASR)、命名实体识别(NER)和情感分析(SA)。该适配器仅使用7倍更少的可训练参数,在多个任务上显著提升性能:在LibriSpeech ASR任务中相对词错误率(WER)降低26%,NER任务F1分数提升6.3%,情感分析(SA)F1分数提升32%;进一步结合分类器正则化和低秩适应(LoRA)优化技术,使Spoken Language Understanding Evaluation(SLUE)得分分别提升6.6%和9.5%,有效降低了对大规模标注数据的依赖并提升了模型效率与效果。
链接: https://arxiv.org/abs/2509.04473
作者: Jaekwon Yoo,Kunal Chandiramani,Divya Tadimeti,Abenezer Girma,Chandra Dhir
机构: JPMorgan Chase (摩根大通); Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While integrating speech encoder with LLM requires substantial data and resources, use cases face limitations due to insufficient availability. To address this, we propose a solution with a parameter-efficient adapter that converts speech embeddings into LLM-compatible tokens, focusing on end-to-end automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). To reduce labeling costs, we employ an LLM-based synthetic dataset annotation technique. The proposed adapter, using 7x fewer trainable parameters, achieves significant performance gains: a 26% relative Word Error Rates (WER) improvement on the LibriSpeech ASR task, a 6.3% relative F1 score increase on the NER task, and a 32% relative F1 score boost on the SA task. Moreover, using advanced techniques such as adding a classifier regularizer and optimizing the LLM with Low-Rank Adaptation (LoRA) yields notable performance gains, with Spoken Language Understanding Evaluation (SLUE) score improvement of 6.6% and 9.5%
zh
[NLP-88] RECAP: REwriting Conversations for Intent Understanding in Agent ic Planning
【速读】: 该论文旨在解决开放域对话系统中用户意图识别(intent detection)的挑战,尤其是在多智能体协同的大语言模型(LLM)框架下,传统基于分类的方法难以应对真实对话中存在的模糊性、意图漂移和动态变化等问题,导致下游规划效果不佳。解决方案的关键在于提出RECAP(REwriting Conversations for Agent Planning)基准,通过将原始对话重写为简洁的目标表示(即意图重写),从而提升意图表达的清晰度与规划实用性;同时引入基于LLM的评估器量化重写后的意图对规划任务的价值,并验证了提示工程(prompt-based)和DPO微调两种重写策略的有效性,证明意图重写是改善多智能体对话系统规划能力的关键且可行环节。
链接: https://arxiv.org/abs/2509.04472
作者: Kushan Mitra,Dan Zhang,Hannah Kim,Estevam Hruschka
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding user intent is essential for effective planning in conversational assistants, particularly those powered by large language models (LLMs) coordinating multiple agents. However, real-world dialogues are often ambiguous, underspecified, or dynamic, making intent detection a persistent challenge. Traditional classification-based approaches struggle to generalize in open-ended settings, leading to brittle interpretations and poor downstream planning. We propose RECAP (REwriting Conversations for Agent Planning), a new benchmark designed to evaluate and advance intent rewriting, reframing user-agent dialogues into concise representations of user goals. RECAP captures diverse challenges such as ambiguity, intent drift, vagueness, and mixed-goal conversations. Alongside the dataset, we introduce an LLM-based evaluator that assesses planning utility given the rewritten intent. Using RECAP, we develop a prompt-based rewriting approach that outperforms baselines. We further demonstrate that fine-tuning two DPO-based rewriters yields additional utility gains. Our results highlight intent rewriting as a critical and tractable component for improving agent planning in open-domain dialogue systems.
zh
[NLP-89] MOSAIC: A Multilingual Taxonomy-Agnostic and Computationally Efficient Approach for Radiological Report Classification
【速读】: 该论文旨在解决医学影像报告分类中依赖昂贵人工标注、模型泛化能力弱以及对封闭或高资源需求的大语言模型(Large Language Models, LLMs)的依赖问题。现有方法在多语言支持、跨模态适应性和计算效率方面存在显著局限,尤其难以部署于临床环境。其解决方案的关键在于提出MOSAIC——一个基于轻量级开源语言模型(MedGemma-4B)的多语言、无特定分类体系依赖且计算高效的放射学报告分类框架,支持零样本/少样本提示与轻量微调,在消费级GPU上即可运行,实现了在多种语言(英语、西班牙语、法语、丹麦语)和多种成像模态下的高性能分类,验证了其在真实临床场景中的实用性与可扩展性。
链接: https://arxiv.org/abs/2509.04471
作者: Alice Schiavone(1 and 2),Marco Fraccaro(3),Lea Marie Pehrson(1, 4 and 5),Silvia Ingala(4 and 6),Rasmus Bonnevie(3),Michael Bachmann Nielsen(5),Vincent Beliveau(7),Melanie Ganz(1 and 2),Desmond Elliott(1) ((1) Department of Computer Science, University of Copenhagen, Denmark, (2) Neurobiology Research Unit, Copenhagen University Hospital, Denmark, (3) Unumed Aps, Denmark, (4) Department of Diagnostic Radiology, Copenhagen University Hospital, Denmark, (5) Department of Clinical Medicine, University of Copenhagen, Denmark, (6) Cerebriu A/S, Denmark, (7) Institute for Human Genetics, Medical University of Innsbruck, Austria)
机构: University of Copenhagen (哥本哈根大学); Copenhagen University Hospital (哥本哈根大学医院); Unumed Aps; Department of Diagnostic Radiology, Copenhagen University Hospital (哥本哈根大学医院诊断放射科); Cerebriu A/S; Institute for Human Genetics, Medical University of Innsbruck (因斯布鲁克医科大学人类遗传学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 14 pages including references and appendix. 9 figures. Preprint
Abstract:Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.
zh
[NLP-90] COCORELI: Cooperative Compositional Reconstitution Execution of Language Instructions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行复杂指令、减少幻觉(hallucination)以及进行空间推理(spatial reasoning)时存在的局限性。其解决方案的关键在于提出一种混合代理框架——COCORELI,该框架通过集成中等规模的LLM代理与新颖的抽象机制(abstraction mechanisms)及话语模块(discourse module),实现对指令的解析,并支持在上下文中动态学习环境的高阶表征(high-level representations)。实验表明,COCORELI在自然协同构建任务中优于使用更大规模LLM的单一思维链(Chain-of-Thought, CoT)和代理式LLM系统,在避免幻觉、识别信息缺失、请求澄清及更新所学对象方面表现优异,且其抽象能力可扩展至工具调用等任务场景。
链接: https://arxiv.org/abs/2509.04470
作者: Swarnadeep Bhar,Omar Naim,Eleni Metheniti,Bastien Navarri,Loïc Cabannes,Morteza Ezzabady,Nicholas Asher
机构: IRIT; ANTI; ENS Paris-Saclay, France; CNRS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:We present COCORELI, a hybrid agent framework designed to tackle the limitations of large language models (LLMs) in tasks requiring: following complex instructions, minimizing hallucination, and spatial reasoning. COCORELI integrates medium-sized LLM agents with novel abstraction mechanisms and a discourse module to parse instructions to in-context learn dynamic, high-level representations of the environment. Experiments on natural collaborative construction tasks show that COCORELI outperforms single-LLM CoT and agentic LLM systems, all using larger LLMs. It manages to largely avoid hallucinations, identify missing information, ask for clarifications, and update its learned objects. COCORELI’s abstraction abilities extend beyond ENVIRONMENT, as shown in the ToolBench API completion task.
zh
[NLP-91] Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动化发票文档处理系统中的性能差异与选择难题,特别是如何在不同模型架构和文档特征下优化信息提取效果。其解决方案的关键在于通过零样本提示(zero-shot prompting)对来自三个家族的八种多模态大语言模型(GPT-5、Gemini 2.5 和开源 Gemma 3)进行基准测试,并对比两种处理策略:直接利用模型的多模态能力进行图像原生处理,以及先将文档结构化转换为 Markdown 再处理的分步解析方法。结果表明,原生图像处理策略通常优于结构化解析方法,且性能表现受模型类型和文档特性共同影响,为实际部署中模型与处理策略的选择提供了实证依据。
链接: https://arxiv.org/abs/2509.04469
作者: David Berghaus,Armin Berger,Lars Hillebrand,Kostadin Cvejoski,Rafet Sifa
机构: Fraunhofer IAIS (弗劳恩霍夫信息与通信技术研究所); Lamarr Institute (拉马尔研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper benchmarks eight multi-modal large language models from three families (GPT-5, Gemini 2.5, and open-source Gemma 3) on three diverse openly available invoice document datasets using zero-shot prompting. We compare two processing strategies: direct image processing using multi-modal capabilities and a structured parsing approach converting documents to markdown first. Results show native image processing generally outperforms structured approaches, with performance varying across model types and document characteristics. This benchmark provides insights for selecting appropriate models and processing strategies for automated document systems. Our code is available online.
zh
[NLP-92] Evaluating Large Language Models for Financial Reasoning : A CFA-Based Benchmark Study
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在金融专业场景下缺乏系统性评估的问题,特别是针对全球最严格的金融认证考试——特许金融分析师(Chartered Financial Analyst, CFA)三级考试的多选题进行实证分析。其解决方案的关键在于构建一个新颖的检索增强生成(Retrieval-Augmented Generation, RAG)管道,通过分层知识组织与结构化查询生成实现精准的领域知识召回,从而显著提升模型在复杂金融推理任务中的准确性;同时对比不同设计导向的LLMs(如推理优化型、多模态强大型及轻量高效型),发现推理导向模型在零样本设置下表现最优,而RAG进一步强化了复杂情境下的性能,为金融领域LLM部署提供了可操作的选型依据和成本-性能权衡策略。
链接: https://arxiv.org/abs/2509.04468
作者: Xuan Yao,Qianteng Wang,Xinbo Liu,Ke-Wei Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of large language models presents significant opportunities for financial applications, yet systematic evaluation in specialized financial contexts remains limited. This study presents the first comprehensive evaluation of state-of-the-art LLMs using 1,560 multiple-choice questions from official mock exams across Levels I-III of CFA, most rigorous professional certifications globally that mirror real-world financial analysis complexity. We compare models distinguished by core design priorities: multi-modal and computationally powerful, reasoning-specialized and highly accurate, and lightweight efficiency-optimized. We assess models under zero-shot prompting and through a novel Retrieval-Augmented Generation pipeline that integrates official CFA curriculum content. The RAG system achieves precise domain-specific knowledge retrieval through hierarchical knowledge organization and structured query generation, significantly enhancing reasoning accuracy in professional financial certification evaluation. Results reveal that reasoning-oriented models consistently outperform others in zero-shot settings, while the RAG pipeline provides substantial improvements particularly for complex scenarios. Comprehensive error analysis identifies knowledge gaps as the primary failure mode, with minimal impact from text readability. These findings provide actionable insights for LLM deployment in finance, offering practitioners evidence-based guidance for model selection and cost-performance optimization. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.04468 [cs.CL] (or arXiv:2509.04468v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.04468 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-93] Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在预填充-解码(Prefill-Decoding, PD)分离推理场景下因高计算和内存开销导致的部署瓶颈问题。现有剪枝方法通常忽略PD分离架构的特点,难以实现高效且精准的块(block)与键值缓存(KV Cache)裁剪。其解决方案的关键在于:首先构建独立的剪枝与蒸馏集合,对预填充和解码阶段分别进行迭代式块移除,从而获得更优的剪枝策略;其次提出一种基于token感知的缓存剪枝机制,在预填充阶段保留全部KV Cache,而在解码阶段仅选择性复用所选层中首尾token序列的缓存条目,显著降低通信开销且引入极小额外负担。实验表明,该方法在PD分离和统一设置下均能实现显著加速(默认设置下推理速度提升20.56%)和带宽节省(数据传输带宽减少4.95倍)。
链接: https://arxiv.org/abs/2509.04467
作者: Hao Zhang,Mengsi Lyu,Yulong Ao,Yonghua Lin
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages
Abstract:Large Language Models (LLMs) demonstrate exceptional capabilities across various tasks, but their deployment is constrained by high computational and memory costs. Model pruning provides an effective means to alleviate these demands. However, existing methods often ignore the characteristics of prefill-decode (PD) disaggregation in practice. In this paper, we propose a novel pruning method for PD disaggregation inference, enabling more precise and efficient block and KV Cache pruning. Our approach constructs pruning and distillation sets to perform iterative block removal independently for the prefill and decode stages, obtaining better pruning solutions. Moreover, we introduce a token-aware cache pruning mechanism that retains all KV Cache in the prefill stage but selectively reuses entries for the first and last token sequences in selected layers during decode, reducing communication costs with minimal overhead. Extensive experiments demonstrate that our approach consistently achieves strong performance in both PD disaggregation and PD unified settings without disaggregation. Under the default settings, our method achieves a 20.56% inference speedup and a 4.95 times reduction in data transmission bandwidth consumption.
zh
[NLP-94] Just-in-time and distributed task representations in language models
【速读】: 该论文试图解决的问题是:语言模型在上下文学习(in-context learning)中,新任务的表示(representation)何时形成以及如何随上下文演变。解决方案的关键在于识别出两类不同的任务表示——一类是“可迁移的任务表示”(transferrable task representations),这类表示能恢复其他实例中的任务上下文而无需完整提示;另一类是高阶任务类别上更稳定的惰性表示。研究发现,可迁移表示以非单调且间歇的方式演化,并具有显著的时间局部性和语义局部性:它们仅在特定标记处激活,且通常捕捉最小任务范围(如语义独立的子任务),而复杂任务则依赖于时间分布更广的表示来支撑。这种双重局部性揭示了语言模型实现即时计算(just-in-time computation)的能力,使其能够动态适应新证据并在线学习新任务。
链接: https://arxiv.org/abs/2509.04466
作者: Yuxuan Li,Declan Campbell,Stephanie C. Y. Chan,Andrew Kyle Lampinen
机构: Google DeepMind(谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Many of language models’ impressive capabilities originate from their in-context learning: based on instructions or examples, they can infer and perform new tasks without weight updates. In this work, we investigate \emphwhen representations for new tasks are formed in language models, and \emphhow these representations change over the course of context. We focus on ‘‘transferrable’’ task representations – vector representations that can restore task context in another instance of the model, even without the full prompt. We show that these representations evolve in non-monotonic and sporadic ways, and are distinct from a more inert representation of high-level task categories that persists throughout the context. Specifically, models often condense multiple evidence into these transferrable task representations, which align well with the performance improvement based on more examples in the context. However, this accrual process exhibits strong locality along the sequence dimension, coming online only at certain tokens – despite task identity being reliably decodable throughout the context. Moreover, these local but transferrable task representations tend to capture minimal ‘‘task scopes’’, such as a semantically-independent subtask, and models rely on more temporally-distributed representations to support longer and composite tasks. This two-fold locality (temporal and semantic) underscores a kind of just-in-time computational process underlying language models’ ability to adapt to new evidence and learn new tasks on the fly.
zh
[NLP-95] Emotionally-Aware Agents for Dispute Resolution
【速读】: 该论文试图解决的问题是:在买卖双方纠纷对话中,情感表达如何影响主观和客观的争议解决结果,以及如何更准确地识别和量化这些情感表达的影响。其解决方案的关键在于利用大规模语言模型(Large-Language Models, LLMs)进行情绪强度标注,相较于传统方法显著提升了对情感表达的解释力,并且与人工标注者决策更加一致,从而为理解冲突升级与化解机制提供更可靠的数据支持,并为基于代理的系统在纠纷管理中识别和缓解情绪激化提供了可行路径。
链接: https://arxiv.org/abs/2509.04465
作者: Sushrita Rakshit,James Hale,Kushal Chawla,Jeanne M. Brett,Jonathan Gratch
机构: University of Michigan (密歇根大学); University of Southern California (南加州大学); Capital One (资本一号); Northwestern University (西北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In conflict, people use emotional expressions to shape their counterparts’ thoughts, feelings, and actions. This paper explores whether automatic text emotion recognition offers insight into this influence in the context of dispute resolution. Prior work has shown the promise of such methods in negotiations; however, disputes evoke stronger emotions and different social processes. We use a large corpus of buyer-seller dispute dialogues to investigate how emotional expressions shape subjective and objective outcomes. We further demonstrate that large-language models yield considerably greater explanatory power than previous methods for emotion intensity annotation and better match the decisions of human annotators. Findings support existing theoretical models for how emotional expressions contribute to conflict escalation and resolution and suggest that agent-based systems could be useful in managing disputes by recognizing and potentially mitigating emotional escalation.
zh
[NLP-96] Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因输出不可靠或误导性而引发的可信度问题,尤其关注如何诊断模型不确定性的根本来源。其解决方案的关键在于:通过收集目标LLM对同一输入生成的多个响应,并利用一个辅助LLM分析这些响应之间的不一致模式,从而识别不确定性是由输入歧义、知识缺失或两者共同导致;在存在知识缺口的情况下,该方法还能进一步定位具体缺失的事实或概念,为后续人工干预提供精准依据。
链接: https://arxiv.org/abs/2509.04464
作者: Yang Nan,Pengfei He,Ravi Tandon,Han Xu
机构: University of Arizona (亚利桑那大学); Michigan State University (密歇根州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Proceedings of The 2025 Conference on Empirical Methods in Natural Language Processing (Findings)
Abstract:Large language models (LLMs) have delivered significant breakthroughs across diverse domains but can still produce unreliable or misleading outputs, posing critical challenges for real-world applications. While many recent studies focus on quantifying model uncertainty, relatively little work has been devoted to \textitdiagnosing the source of uncertainty. In this study, we show that, when an LLM is uncertain, the patterns of disagreement among its multiple generated responses contain rich clues about the underlying cause of uncertainty. To illustrate this point, we collect multiple responses from a target LLM and employ an auxiliary LLM to analyze their patterns of disagreement. The auxiliary model is tasked to reason about the likely source of uncertainty, such as whether it stems from ambiguity in the input question, a lack of relevant knowledge, or both. In cases involving knowledge gaps, the auxiliary model also identifies the specific missing facts or concepts contributing to the uncertainty. In our experiment, we validate our framework on AmbigQA, OpenBookQA, and MMLU-Pro, confirming its generality in diagnosing distinct uncertainty sources. Such diagnosis shows the potential for relevant manual interventions that improve LLM performance and reliability.
zh
[NLP-97] Benchmarking GPT -5 for biomedical natural language processing
【速读】: 该论文旨在解决生物医学文献快速增长背景下,对可扩展自然语言处理(Natural Language Processing, NLP)解决方案的迫切需求,特别是评估大语言模型在多种生物医学任务中的通用性能表现。其关键解决方案是构建并更新了一个标准化的生物医学NLP(BioNLP)基准测试体系,涵盖命名实体识别、关系抽取、多标签文档分类、问答、文本摘要和文本简化六大任务类别,并在12个数据集上系统比较GPT-5、GPT-4和GPT-4o在零样本、单样本和五样本提示下的性能表现。结果表明,GPT-5在多数任务中取得最优整体性能,尤其在医学问答(如MedQA达94.1%准确率)和化学相关抽取任务(如化学命名实体识别F1=0.886)上显著超越前代模型,验证了其作为通用模型在推理导向型生物医学问答场景中的部署可行性;但文本摘要与疾病命名实体识别等高精度要求任务仍需微调或混合方法支持,从而为未来生物医学NLP系统设计提供了基于提示策略的分层优化路径。
链接: https://arxiv.org/abs/2509.04462
作者: Yu Hou,Zaifu Zhan,Rui Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid expansion of biomedical literature has heightened the need for scalable natural language processing (NLP) solutions. While GPT-4 substantially narrowed the gap with task-specific systems, especially in question answering, its performance across other domains remained uneven. We updated a standardized BioNLP benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across 12 datasets spanning six task families: named entity recognition, relation extraction, multi-label document classification, question answering, text summarization, and text simplification. Using fixed prompt templates, identical decoding parameters, and batch inference, we report primary metrics per dataset and include prior results for GPT-4, GPT-3.5, and LLaMA-2-13B for comparison. GPT-5 achieved the strongest overall benchmark performance, with macro-average scores rising to 0.557 under five-shot prompting versus 0.506 for GPT-4 and 0.508 for GPT-4o. On MedQA, GPT-5 reached 94.1% accuracy, exceeding the previous supervised state of the art by over fifty points, and attained parity with supervised systems on PubMedQA (0.734). In extraction tasks, GPT-5 delivered major gains in chemical NER (0.886 F1) and ChemProt relation extraction (0.616 F1), outperforming GPT-4 and GPT-4o, though summarization and disease NER still lagged behind domain-specific baselines. These results establish GPT-5 as a general-purpose model now offering deployment-ready performance for reasoning-oriented biomedical QA, while precision-critical extraction and evidence-dense summarization continue to favor fine-tuned or hybrid approaches. The benchmark delineates where simple prompting suffices and where retrieval-augmented or planning-based scaffolds are likely required, providing actionable guidance for BioNLP system design as frontier models advance.
zh
[NLP-98] From Post To Personality: Harnessing LLM s for MBTI Prediction in Social Media
【速读】: 该论文旨在解决从社交媒体文本中预测迈尔斯-布里格斯类型指标(Myers Briggs Type Indicator, MBTI)时面临的两大挑战:一是大语言模型(Large Language Models, LLMs)固有的幻觉问题,二是MBTI类型在人群中的自然不平衡分布。解决方案的关键在于提出一种名为PostToPersonality(PtoP)的新型LLM框架,其核心创新包括:1)采用检索增强生成(Retrieval Augmented Generation, RAG)结合上下文学习(in-context learning)以缓解LLM的幻觉;2)通过合成少数类过采样(synthetic minority oversampling)对预训练LLM进行微调,从而提升模型对MBTI类型的区分能力并平衡类别分布。实验表明,PtoP在真实社交媒体数据集上显著优于10种机器学习和深度学习基线方法。
链接: https://arxiv.org/abs/2509.04461
作者: Tian Ma,Kaiyu Feng,Yu Rong,Kangfei Zhao
机构: Beijing Institute of Technology (北京理工大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:Personality prediction from social media posts is a critical task that implies diverse applications in psychology and sociology. The Myers Briggs Type Indicator (MBTI), a popular personality inventory, has been traditionally predicted by machine learning (ML) and deep learning (DL) techniques. Recently, the success of Large Language Models (LLMs) has revealed their huge potential in understanding and inferring personality traits from social media content. However, directly exploiting LLMs for MBTI prediction faces two key challenges: the hallucination problem inherent in LLMs and the naturally imbalanced distribution of MBTI types in the population. In this paper, we propose PostToPersonality (PtoP), a novel LLM based framework for MBTI prediction from social media posts of individuals. Specifically, PtoP leverages Retrieval Augmented Generation with in context learning to mitigate hallucination in LLMs. Furthermore, we fine tune a pretrained LLM to improve model specification in MBTI understanding with synthetic minority oversampling, which balances the class imbalance by generating synthetic samples. Experiments conducted on a real world social media dataset demonstrate that PtoP achieves state of the art performance compared with 10 ML and DL baselines.
zh
[NLP-99] CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在同行评审中被用于生成实质性内容时,导致评审公平性和可靠性受损的问题。现有基于风格的AI文本检测方法易受改写攻击,难以区分仅进行语言润色与实质性内容生成,从而可能误判合规的AI辅助修改,并漏检伪装成人类撰写的AI生成内容。解决方案的关键在于提出从风格导向到内容导向的范式转变:构建了一个细粒度的AI生成同行评审数据集(CoCoNUTS),涵盖六种不同的人机协作模式;并开发了基于多任务学习框架的CoCoDet检测模型,以实现对评审内容中AI参与更准确、鲁棒的识别,为学术界提供可落地的评估工具和更精准、公平、可靠的检测方法。
链接: https://arxiv.org/abs/2509.04460
作者: Yihan Chen,Jiawei Chen,Guozhao Mo,Xuanang Chen,Ben He,Xianpei Han,Le Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The growing integration of large language models (LLMs) into the peer review process presents potential risks to the fairness and reliability of scholarly evaluation. While LLMs offer valuable assistance for reviewers with language refinement, there is growing concern over their use to generate substantive review content. Existing general AI-generated text detectors are vulnerable to paraphrasing attacks and struggle to distinguish between surface language refinement and substantial content generation, suggesting that they primarily rely on stylistic cues. When applied to peer review, this limitation can result in unfairly suspecting reviews with permissible AI-assisted language enhancement, while failing to catch deceptively humanized AI-generated reviews. To address this, we propose a paradigm shift from style-based to content-based detection. Specifically, we introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews, covering six distinct modes of human-AI collaboration. Furthermore, we develop CoCoDet, an AI review detector via a multi-task learning framework, designed to achieve more accurate and robust detection of AI involvement in review content. Our work offers a practical foundation for evaluating the use of LLMs in peer review, and contributes to the development of more precise, equitable, and reliable detection methods for real-world scholarly applications. Our code and data will be publicly available at this https URL.
zh
[NLP-100] Uncertainty-Aware Collaborative System of Large and Small Models for Multimodal Sentiment Analysis
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实际部署中面临的计算资源消耗过高与轻量级模型性能不足之间的性能-效率权衡问题。解决方案的关键在于提出一种基于不确定性的协同系统(Uncertainty-Aware Collaborative System, U-ACS),其核心是一个由不确定性驱动的级联机制:首先由高效的小模型对所有输入样本进行快速筛选,仅将预测不确定性较高的困难样本(indicating greater difficulty)递交给MLLM进行深度分析;同时引入加权平均策略处理相似极性预测,并采用基于提示的交叉验证机制解决双模型均高不确定时的冲突预测。该方法实现了计算资源的动态分配,在显著降低推理开销的同时保持了MLLM的高精度表现。
链接: https://arxiv.org/abs/2509.04459
作者: Shiqin Han,Manning Gao,Menghua Jiang,Yuncheng Jiang,Haifeng Hu,Sijie Mai
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The advent of Multimodal Large Language Models (MLLMs) has significantly advanced the state-of-the-art in multimodal machine learning, yet their substantial computational demands present a critical barrier to real-world deployment. Conversely, smaller, specialized models offer high efficiency but often at the cost of performance. To reconcile this performance-efficiency trade-off, we propose a novel Uncertainty-Aware Collaborative System (U-ACS) that synergistically orchestrates a powerful MLLM (e.g., HumanOmni) and a lightweight baseline model for multimodal sentiment analysis. The core of our system is an uncertainty-driven cascade mechanism, where the efficient small model first acts as a rapid filter for all input samples. Only those samples yielding high predictive uncertainty, thereby indicating greater difficulty, are selectively escalated to the MLLM for more sophisticated analysis. Furthermore, our system introduces advanced strategies to handle ambiguous or conflicting predictions, including weighted averaging for predictions of similar polarity and a prompt-based cross-verification to resolve conflicting predictions when both models exhibit high uncertainty. This sample-difficulty-aware approach allows for a dynamic allocation of computational resources, drastically reducing inference costs while retaining the high accuracy of MLLM. Extensive experiments on benchmark datasets demonstrate that our proposed method achieves state-of-the-art performance, while requiring only a fraction of the computational resources compared to using a standalone MLLM.
zh
[NLP-101] Predicting Failures of LLM s to Link Biomedical Ontology Terms to Identifiers Evidence Across Models and Ontologies ALT
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生物医学自然语言处理任务中,尽管整体表现优异,却常无法将术语正确关联到其唯一标识符(identifier)的问题。研究通过分析两个主流本体(ontology)——人类表型本体(Human Phenotype Ontology, HPO)和基因本体(Gene Ontology, GO)——以及两种高性能模型(GPT-4o 和 LLaMa 3.1 405B)的预测结果,识别出影响链接准确性的关键因素。解决方案之关键在于发现:术语标识符的暴露程度(exposure to ontology identifiers)是预测链接成功与否的最强变量,表明模型对特定标识符的熟悉度显著影响其推理准确性。
链接: https://arxiv.org/abs/2509.04458
作者: Daniel B. Hier,Steven Keith Platt,Tayo Obafemi-Ajayi
机构: University of Illinois at Chicago (伊利诺伊大学芝加哥分校); Loyola University Chicago (洛约拉大学芝加哥分校); Missouri State University (密苏里州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted for Presentation, IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 25), Atlanta GA USA, October 26-29, 2025
Abstract:Large language models often perform well on biomedical NLP tasks but may fail to link ontology terms to their correct identifiers. We investigate why these failures occur by analyzing predictions across two major ontologies, Human Phenotype Ontology and Gene Ontology, and two high-performing models, GPT-4o and LLaMa 3.1 405B. We evaluate nine candidate features related to term familiarity, identifier usage, morphology, and ontology structure. Univariate and multivariate analyses show that exposure to ontology identifiers is the strongest predictor of linking success.
zh
[NLP-102] Do MLLM s Really Understand the Charts?
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLs)在处理无标注图表时存在严重幻觉和性能下降的问题,核心质疑在于MLLM是否真正具备图表理解能力。研究表明,当前MLLM主要依赖图像识别而非视觉推理来解析图表。为此,作者提出ChartReasoner,其关键在于模拟人类通过视觉推理进行数值估计的行为,将模型的推理过程锚定在对图表结构和语义的理解之上,从而显著提升图表推理准确性和泛化能力,在自建的CRBench基准和公开基准上均取得优于GPT-4o与Gemini-2.5-Flash等先进模型的表现。
链接: https://arxiv.org/abs/2509.04457
作者: Xiao Zhang,Dongyuan Li,Liuyu Xiang,Yao Zhang,Cheng Zhong,Zhaofeng He
机构: AAITC
类目: Computation and Language (cs.CL)
备注: 19 pages,15 figures
Abstract:Although Multimodal Large Language Models (MLLMs) have demonstrated increasingly impressive performance in chart understanding, most of them exhibit alarming hallucinations and significant performance degradation when handling non-annotated charts. Therefore, a question arises: Do MLLMs really understand the charts? Since a human is capable of understanding charts and estimating the values by visual reasoning, we first carefully establish a comprehensive Chart Reasoning Benchmark CRBench to rigorously evaluate the visual reasoning abilities of MLLMs on non-annotated charts. We argue that MLLMs are primarily relying on recognition rather than reasoning to interpret the charts. To steer MLLMs to reasonable chart understanding, we propose ChartReasoner that mimics human behavior by grounding their estimation in chart understanding. Extensive results on the proposed CRBench show that ChartReasnoner-3B/7B achieves superior performance in chart reasoning, even compared to GPT-4o and Gemini-2.5-Flash. More importantly, ChartReasnoner also demonstrates the visual reasoning abilities in general chart comprehension on public benchmarks, leading to significant performance gains and enabling MLLMs to rationally understand the charts. The code and dataset will be publicly available upon publication.
zh
[NLP-103] Mentalic Net: Development of RAG -based Conversational AI and Evaluation Framework for Mental Health Support
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在心理健康支持场景中应用时面临的准确性、共情能力、可信度、隐私保护及偏见等关键挑战。其解决方案的核心在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)框架的对话式人工智能系统——Mentalic Net Conversational AI,通过提示工程(prompt engineering)与新颖数据集上的微调策略相结合,实现了高语义一致性(BERT Score达0.898)和多维评估指标的综合优化,同时强调采用“人在环路”(human-in-the-loop)机制与长期负责任的发展路径,以确保技术的安全性与伦理合规性。
链接: https://arxiv.org/abs/2509.04456
作者: Anandi Dutta,Shivani Mruthyunjaya,Jessica Saddington,Kazi Sifatul Islam
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint Version, Accepted in ISEMV 2025
Abstract:The emergence of large language models (LLMs) has unlocked boundless possibilities, along with significant challenges. In response, we developed a mental health support chatbot designed to augment professional healthcare, with a strong emphasis on safe and meaningful application. Our approach involved rigorous evaluation, covering accuracy, empathy, trustworthiness, privacy, and bias. We employed a retrieval-augmented generation (RAG) framework, integrated prompt engineering, and fine-tuned a pre-trained model on novel datasets. The resulting system, Mentalic Net Conversational AI, achieved a BERT Score of 0.898, with other evaluation metrics falling within satisfactory ranges. We advocate for a human-in-the-loop approach and a long-term, responsible strategy in developing such transformative technologies, recognizing both their potential to change lives and the risks they may pose if not carefully managed.
zh
[NLP-104] INSEva: A Comprehensive Chinese Benchmark for Large Language Models in Insurance
【速读】: 该论文旨在解决现有AI评估基准在保险领域适用性不足的问题,即现有基准未能充分捕捉保险业务特有的知识需求与任务复杂性。其解决方案的关键在于构建了一个名为INSEva的中文多维评估基准,该基准涵盖保险业务领域、任务形式、难度层级及认知-知识维度,并整合了38,704个来自权威来源的高质量评估样本;同时,通过定制化的评估方法对生成式AI(Generative AI)在开放问答中的一致性(faithfulness)与完整性(completeness)进行量化分析,从而更精准地衡量大语言模型(Large Language Models, LLMs)在真实保险场景中的表现能力。
链接: https://arxiv.org/abs/2509.04455
作者: Shisong Chen,Qian Zhu,Wenyan Yang,Chengyi Yang,Zhong Wang,Ping Wang,Xuan Lin,Bo Xu,Daqian Li,Chao Yuan,Licai Qi,Wanqing Xu,sun zhenxing,Xin Lu,Shiqiang Xiong,Chao Chen,Haixiang Hu,Yanghua Xiao
机构: Ant Group (蚂蚁集团); Fudan University (复旦大学); East China Normal University (华东师范大学); Donghua University (东华大学)
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:Insurance, as a critical component of the global financial system, demands high standards of accuracy and reliability in AI applications. While existing benchmarks evaluate AI capabilities across various domains, they often fail to capture the unique characteristics and requirements of the insurance domain. To address this gap, we present INSEva, a comprehensive Chinese benchmark specifically designed for evaluating AI systems’ knowledge and capabilities in insurance. INSEva features a multi-dimensional evaluation taxonomy covering business areas, task formats, difficulty levels, and cognitive-knowledge dimension, comprising 38,704 high-quality evaluation examples sourced from authoritative materials. Our benchmark implements tailored evaluation methods for assessing both faithfulness and completeness in open-ended responses. Through extensive evaluation of 8 state-of-the-art Large Language Models (LLMs), we identify significant performance variations across different dimensions. While general LLMs demonstrate basic insurance domain competency with average scores above 80, substantial gaps remain in handling complex, real-world insurance scenarios. The benchmark will be public soon.
zh
[NLP-105] Labelling Data with Unknown References
【速读】: 该论文旨在解决在缺乏标注参考数据(如开发集)的情况下,如何建立评估者(evaluator)可信度的问题。传统方法依赖于测试数据或假设评估者已知标签规则,但在无参考标注场景下均失效。解决方案的关键在于提出一种“无数据算法”(No-Data Algorithm),通过连续向评估者发起挑战任务,以概率意义上的正确性(w.h.p.)验证其可信度:当评估者确实掌握标签规则时,算法可接受其输出;反之则能识别并标记不可信的评估者。该方法已在大语言模型(LLM)作为裁判对低资源语言进行评估的应用中得到实证验证。
链接: https://arxiv.org/abs/2506.03083
作者: Adrian de Wynter
机构: Microsoft(微软); University of York(约克大学)
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Extended version with LLM-based results/analysis
Abstract:An evaluator is trustworthy when there exists some agreed-upon way to measure its performance as a labeller. The two ways to establish trustworthiness are either by testing it, or by assuming the evaluator knows' somehow the way to label the corpus. However, if labelled references (e.g., a development set) are unavailable, neither of these approaches work: the former requires the data, and the latter is an assumption, not evidence. To address this, we introduce an algorithm (the
No-Data Algorithm’) by which to establish trust in an evaluator without any existing references. Our algorithm works by successively posing challenges to said evaluator. We show that this is sufficient to establish trustworthiness w.h.p., in such a way that when the evaluator actually knows the way to label the corpus, the No-Data Algorithm accepts its output; and, conversely, flags untrustworthy evaluators when these are unable to prove it. We present formal proofs of correctness, empirical tests, and applications to LLMs-as-judges on low-resource languages.
zh
[NLP-106] DarkStream: real-time speech anonymization with low latency
【速读】: 该论文旨在解决实时语音合成中说话人匿名化(speaker anonymization)的难题,特别是在低延迟约束下如何有效保护说话人隐私同时保持语音内容的可懂性。其解决方案的关键在于:首先,采用因果波形编码器结合短时前瞻缓冲区和基于Transformer的上下文层以提升在严格延迟限制下的内容编码能力;其次,通过神经声码器直接生成波形,省去中间mel-spectrogram转换步骤,显著降低推理时间;最后,利用GAN生成的伪说话人嵌入(pseudo-speaker embedding)注入到内容编码器提取的语言特征中,实现对说话人身份的有效混淆,在懒惰攻击场景下达到接近随机水平的说话人验证等错误率(EER ≈ 50%),同时保持较低的词错误率(WER < 9%),从而在低延迟、强隐私保护与最小可懂性损失之间取得良好平衡。
链接: https://arxiv.org/abs/2509.04667
作者: Waris Quamer,Ricardo Gutierrez-Osuna
机构: Texas A&M University (德克萨斯农工大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for presentation at ASRU 2025
Abstract:We propose DarkStream, a streaming speech synthesis model for real-time speaker anonymization. To improve content encoding under strict latency constraints, DarkStream combines a causal waveform encoder, a short lookahead buffer, and transformer-based contextual layers. To further reduce inference time, the model generates waveforms directly via a neural vocoder, thus removing intermediate mel-spectrogram conversions. Finally, DarkStream anonymizes speaker identity by injecting a GAN-generated pseudo-speaker embedding into linguistic features from the content encoder. Evaluations show our model achieves strong anonymization, yielding close to 50% speaker verification EER (near-chance performance) on the lazy-informed attack scenario, while maintaining acceptable linguistic intelligibility (WER within 9%). By balancing low-latency, robust privacy, and minimal intelligibility degradation, DarkStream provides a practical solution for privacy-preserving real-time speech communication.
zh
计算机视觉
[CV-0] FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases ICCV2025
【速读】:该论文旨在解决当前光学流(optical flow)模型训练对硬件资源依赖过高、难以在消费级设备上高效实现的问题。其解决方案的关键在于提出FlowSeek框架,该框架融合了光学流网络设计的最新进展、先进的单图像深度基础模型(single-image depth foundation models)以及经典的低维运动参数化方法,从而构建出一个结构紧凑但精度优异的模型架构。该方法仅需单个消费级GPU即可完成训练,相较多数近期方法降低了约8倍的硬件预算,同时在多个基准数据集(如Sintel Final、KITTI、Spring和LayeredFlow)上实现了优于现有最优方法SEA-RAFT的跨数据集泛化性能。
链接: https://arxiv.org/abs/2509.05297
作者: Matteo Poggi,Fabio Tosi
机构: University of Bologna (博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 - Project Page: this https URL - Code: this https URL
Abstract:We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. FlowSeek marries the latest advances on the design space of optical flow networks with cutting-edge single-image depth foundation models and classical low-dimensional motion parametrization, implementing a compact, yet accurate architecture. FlowSeek is trained on a single consumer-grade GPU, a hardware budget about 8x lower compared to most recent methods, and still achieves superior cross-dataset generalization on Sintel Final and KITTI, with a relative improvement of 10 and 15% over the previous state-of-the-art SEA-RAFT, as well as on Spring and LayeredFlow datasets.
zh
[CV-1] WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool
【速读】:该论文旨在解决传统视觉重建方法在重建质量与实时性能之间存在权衡的问题(trade-off between reconstruction quality and real-time performance)。其核心解决方案在于提出一种基于滑动窗口机制(sliding window mechanism)的前馈重建模型 WinT3R,该机制通过在窗口内帧间保持充分的信息交互,在不显著增加计算负担的前提下提升几何预测精度;同时引入紧凑的相机表示和全局相机标记池(global camera token pool),增强了相机位姿估计的可靠性而不牺牲效率。上述设计使 WinT3R 在在线重建质量、相机位姿估计准确性和重建速度方面均达到当前最优水平。
链接: https://arxiv.org/abs/2509.05296
作者: Zizun Li,Jianjun Zhou,Yifan Wang,Haoyu Guo,Wenzheng Chang,Yang Zhou,Haoyi Zhu,Junyi Chen,Chunhua Shen,Tong He
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Lab (上海人工智能实验室); SII; Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at this https URL.
zh
[CV-2] Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control
【速读】:该论文旨在解决文本驱动的3D场景风格化(text-driven 3D stylization)中同时保持高质量风格一致性与视角一致性(view consistency)的难题,尤其在不同区域或物体间实现语义对齐的风格迁移仍具挑战。解决方案的关键在于:首先,通过扩展风格对齐的深度条件视图生成框架,将原本全共享的注意力机制替换为基于单参考图像的注意力共享机制,从而有效对齐跨视角的风格特征;其次,借鉴3D修复方法,利用多个深度图构成的网格作为单图像参考,进一步增强风格化图像间的视角一致性;最后,提出多区域重要性加权切片Wasserstein距离损失(Multi-Region Importance-Weighted Sliced Wasserstein Distance Loss),结合现成分割模型提供的掩码,实现对不同区域的可控风格迁移,提升风格转移的忠实度并支持多风格混合。
链接: https://arxiv.org/abs/2509.05285
作者: Haruo Fujiwara,Yusuke Mukuta,Tatsuya Harada
机构: The University of Tokyo (东京大学); RIKEN AIP (理化学研究所人工智能中心)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in text-driven 3D scene editing and stylization, which leverage the powerful capabilities of 2D generative models, have demonstrated promising outcomes. However, challenges remain in ensuring high-quality stylization and view consistency simultaneously. Moreover, applying style consistently to different regions or objects in the scene with semantic correspondence is a challenging task. To address these limitations, we introduce techniques that enhance the quality of 3D stylization while maintaining view consistency and providing optional region-controlled style transfer. Our method achieves stylization by re-training an initial 3D representation using stylized multi-view 2D images of the source views. Therefore, ensuring both style consistency and view consistency of stylized multi-view images is crucial. We achieve this by extending the style-aligned depth-conditioned view generation framework, replacing the fully shared attention mechanism with a single reference-based attention-sharing mechanism, which effectively aligns style across different viewpoints. Additionally, inspired by recent 3D inpainting methods, we utilize a grid of multiple depth maps as a single-image reference to further strengthen view consistency among stylized images. Finally, we propose Multi-Region Importance-Weighted Sliced Wasserstein Distance Loss, allowing styles to be applied to distinct image regions using segmentation masks from off-the-shelf models. We demonstrate that this optional feature enhances the faithfulness of style transfer and enables the mixing of different styles across distinct regions of the scene. Experimental evaluations, both qualitative and quantitative, demonstrate that our pipeline effectively improves the results of text-driven 3D stylization.
zh
[CV-3] LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation
【速读】:该论文旨在解决传统3D世界建模方法在工业生产效率低、难以实现大规模动态交互场景生成的问题,尤其针对模拟真实世界复杂物理行为与多智能体交互的挑战。其解决方案的关键在于提出LatticeWorld框架,该框架结合轻量级大语言模型(LLaMA-2-7B)与行业级渲染引擎(如Unreal Engine 5),通过接受文本描述和视觉指令作为多模态输入,实现高保真物理仿真、实时渲染及动态代理的自动化生成,从而显著提升3D环境构建的效率与创造性质量。
链接: https://arxiv.org/abs/2509.05263
作者: Yinglin Duan,Zhengxia Zou,Tongwei Gu,Wei Jia,Zhan Zhao,Luyi Xu,Xinzhu Liu,Hao Jiang,Kang Chen,Shuang Qiu
机构: NetEase, Inc., China.(网易公司); Beihang University, China.(北京航空航天大学); Tsinghua University, China.(清华大学); City University of Hong Kong, China.(香港城市大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent research has been increasingly focusing on developing 3D world models that simulate complex real-world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim-to-real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes, modern approaches have leveraged advanced machine learning algorithms for 3D world generation, with most recent advances focusing on generative methods that can create virtual worlds based on user instructions. This work explores such a research direction by proposing LatticeWorld, a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments. LatticeWorld leverages lightweight LLMs (LLaMA-2-7B) alongside the industry-grade rendering engine (e.g., Unreal Engine 5) to generate a dynamic environment. Our proposed framework accepts textual descriptions and visual instructions as multimodal inputs and creates large-scale 3D interactive worlds with dynamic agents, featuring competitive multi-agent interaction, high-fidelity physics simulation, and real-time rendering. We conduct comprehensive experiments to evaluate LatticeWorld, showing that it achieves superior accuracy in scene layout generation and visual fidelity. Moreover, LatticeWorld achieves over a 90\times increase in industrial production efficiency while maintaining high creative quality compared with traditional manual production methods. Our demo video is available at this https URL
zh
[CV-4] COGITAO: A Visual Reasoning Framework To Study Compositionality Generalization
【速读】:该论文旨在解决当前机器学习模型在视觉领域中缺乏概念组合与泛化能力的问题,即模型难以将已学习的概念进行灵活组合并在新场景中有效应用。其解决方案的关键在于提出一个模块化且可扩展的数据生成框架与基准测试平台——COGITAO,该平台基于规则构建任务,在网格环境中对物体施加28种可互操作的变换操作,并支持在不同深度上实现组合性控制,从而生成数百万种独特任务规则,显著超越现有数据集规模。通过这种高灵活性和可控性的设计,COGITAO不仅实现了对复杂度的精细调节,还支持无限样本生成,为系统性研究视觉领域的组合性和泛化能力提供了有力工具。
链接: https://arxiv.org/abs/2509.05249
作者: Yassine Taoudi-Benchekroun,Klim Troyan,Pascal Sager,Stefan Gerber,Lukas Tuggener,Benjamin Grewe
机构: Institute of Neuroinformatics (神经信息学研究所); University of Zurich (苏黎世大学); ETH Zurich (苏黎世联邦理工学院); Centre for Artificial Intelligence (人工智能中心); Zurich University of Applied Sciences (苏黎世应用科学大学); RWAI AG (RWAI AG)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 main pages, 3 figure, appendix available
Abstract:The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI’s problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules – surpassing concurrent datasets by several orders of magnitude – across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.
zh
[CV-5] Symbolic Graphics Programming with Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成符号图形程序(Symbolic Graphics Programs, SGPs)方面能力不足的问题,特别是如何从自然语言描述中生成可渲染为精确视觉内容的可扩展矢量图形(Scalable Vector Graphics, SVG)程序。其核心挑战在于提升LLMs对视觉语义的理解与控制能力,从而实现高保真度的对象、场景及组合关系(如属性绑定、空间关系和数值逻辑)的准确生成。解决方案的关键在于提出一种基于可验证奖励的强化学习(Reinforcement Learning, RL)框架:首先引入格式有效性门控机制以确保生成SVG的语法正确性,进而利用跨模态奖励机制(结合SigLIP和DINO等强视觉编码器)对齐文本描述与渲染图像,从而引导模型学习更精确的视觉语义表示。该方法显著提升了Qwen-2.5-7B模型在SVG生成中的质量与语义一致性,达到前沿系统水平,并揭示了强化学习促使模型将对象细分为可控基元并增强场景上下文连贯性的训练动力学特性。
链接: https://arxiv.org/abs/2509.05208
作者: Yamei Chen,Haoquan Zhang,Yangyi Huang,Zeju Qiu,Kaipeng Zhang,Yandong Wen,Weiyang Liu
机构: The Chinese University of Hong Kong (香港中文大学); Westlake University (西湖大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Technical report (32 pages, 12 figures, project page: this https URL )
Abstract:Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs’ ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.
zh
[CV-6] Robust Model Predictive Control Design for Autonomous Vehicles with Perception-based Observers
【速读】:该论文旨在解决深度学习感知模块在状态估计中引入的非高斯噪声(non-Gaussian noise)对模型预测控制(MPC)性能与安全性的影响问题。传统MPC通常假设感知误差为零均值高斯噪声,这在实际场景中难以成立,尤其当感知模块存在偏置或重尾分布时会导致控制失效。解决方案的关键在于:首先采用基于约束zonotope的集值状态估计方法(set-based state estimation with constrained zonotopes),有效刻画感知误差中的偏置性和重尾特性,同时保证估计误差有界;其次将鲁棒MPC重构为线性规划(LP)形式,利用Minkowski-Lyapunov函数设计代价函数并引入松弛变量以避免退化解;最后通过收缩不变zonotope集合和椭球近似推导出最大稳定终端集及其反馈增益,从而确保闭环稳定性。该框架在ROS2平台上通过全向移动机器人硬件实验验证了其在重尾噪声下优于传统高斯假设设计的控制精度与鲁棒性。
链接: https://arxiv.org/abs/2509.05201
作者: Nariman Niknejad,Gokul S. Sankar,Bahare Kiumarsi,Hamidreza Modares
机构: Michigan State University (密歇根州立大学); Michigan State University (密歇根州立大学); Michigan State University (密歇根州立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:This paper presents a robust model predictive control (MPC) framework that explicitly addresses the non-Gaussian noise inherent in deep learning-based perception modules used for state estimation. Recognizing that accurate uncertainty quantification of the perception module is essential for safe feedback control, our approach departs from the conventional assumption of zero-mean noise quantification of the perception error. Instead, it employs set-based state estimation with constrained zonotopes to capture biased, heavy-tailed uncertainties while maintaining bounded estimation errors. To improve computational efficiency, the robust MPC is reformulated as a linear program (LP), using a Minkowski-Lyapunov-based cost function with an added slack variable to prevent degenerate solutions. Closed-loop stability is ensured through Minkowski-Lyapunov inequalities and contractive zonotopic invariant sets. The largest stabilizing terminal set and its corresponding feedback gain are then derived via an ellipsoidal approximation of the zonotopes. The proposed framework is validated through both simulations and hardware experiments on an omnidirectional mobile robot along with a camera and a convolutional neural network-based perception module implemented within a ROS2 framework. The results demonstrate that the perception-aware MPC provides stable and accurate control performance under heavy-tailed noise conditions, significantly outperforming traditional Gaussian-noise-based designs in terms of both state estimation error bounding and overall control performance.
zh
[CV-7] Enhancing 3D Point Cloud Classification with ModelNet-R and Point-SkipNet
【速读】:该论文旨在解决当前3D点云分类任务中常用数据集ModelNet40存在的问题,如标签不一致、二维数据混入、尺寸不匹配及类别区分度不足等,这些问题限制了模型性能的提升。其解决方案的关键在于两个方面:一是提出一个经过精细化处理的改进数据集ModelNet-R,以提高数据质量与一致性;二是设计一种轻量级图神经网络Point-SkipNet,通过高效的采样策略、邻域分组机制和跳跃连接(skip connections)实现高分类精度的同时显著降低计算开销。实验表明,基于ModelNet-R训练的模型性能明显优于原始数据集,且Point-SkipNet在参数量远低于现有方法的情况下达到最优分类准确率,凸显了高质量数据对优化3D点云分类模型效率的重要性。
链接: https://arxiv.org/abs/2509.05198
作者: Mohammad Saeid,Amir Salarpour,Pedram MohajerAnsari
机构: Sirjan University of Technology (锡尔詹科技大学); Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: This paper has been accepted for presentation at the 7th International Conference on Pattern Recognition and Image Analysis (IPRIA 2025)
Abstract:The classification of 3D point clouds is crucial for applications such as autonomous driving, robotics, and augmented reality. However, the commonly used ModelNet40 dataset suffers from limitations such as inconsistent labeling, 2D data, size mismatches, and inadequate class differentiation, which hinder model performance. This paper introduces ModelNet-R, a meticulously refined version of ModelNet40 designed to address these issues and serve as a more reliable benchmark. Additionally, this paper proposes Point-SkipNet, a lightweight graph-based neural network that leverages efficient sampling, neighborhood grouping, and skip connections to achieve high classification accuracy with reduced computational overhead. Extensive experiments demonstrate that models trained in ModelNet-R exhibit significant performance improvements. Notably, Point-SkipNet achieves state-of-the-art accuracy on ModelNet-R with a substantially lower parameter count compared to contemporary models. This research highlights the crucial role of dataset quality in optimizing model efficiency for 3D point cloud classification. For more details, see the code at: this https URL.
zh
[CV-8] SL-SLR: Self-Supervised Representation Learning for Sign Language Recognition
【速读】:该论文旨在解决手势识别(Sign Language Recognition, SLR)中自监督学习方法存在的两个关键问题:一是现有对比学习方法对视频中所有部分一视同仁,未考虑不同区域对识别任务的重要性差异;二是不同手势间的共享动作导致负样本高度相似,难以区分。为应对上述挑战,论文提出了一种新的自监督学习框架,其核心创新在于两个方面:一是引入“自由负样本”(free-negative pairs)的自监督策略,有效缓解因负样本过于相似而导致的特征混淆问题;二是设计了一种新型数据增强技术,更好地捕捉手势的关键语义信息。该方案在多个下游任务(如线性评估、半监督学习及跨手语迁移)中均显著优于多种对比与自监督方法。
链接: https://arxiv.org/abs/2509.05188
作者: Ariel Basso Madjoukeng,Jérôme Fink,Pierre Poitier,Edith Belise Kenmogne,Benoit Frenay
机构: University of Namur(Namur大学); University of Dschang(杜舍安格大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sign language recognition (SLR) is a machine learning task aiming to identify signs in videos. Due to the scarcity of annotated data, unsupervised methods like contrastive learning have become promising in this field. They learn meaningful representations by pulling positive pairs (two augmented versions of the same instance) closer and pushing negative pairs (different from the positive pairs) apart. In SLR, in a sign video, only certain parts provide information that is truly useful for its recognition. Applying contrastive methods to SLR raises two issues: (i) contrastive learning methods treat all parts of a video in the same way, without taking into account the relevance of certain parts over others; (ii) shared movements between different signs make negative pairs highly similar, complicating sign discrimination. These issues lead to learning non-discriminative features for sign recognition and poor results in downstream tasks. In response, this paper proposes a self-supervised learning framework designed to learn meaningful representations for SLR. This framework consists of two key components designed to work together: (i) a new self-supervised approach with free-negative pairs; (ii) a new data augmentation technique. This approach shows a considerable gain in accuracy compared to several contrastive and self-supervised methods, across linear evaluation, semi-supervised learning, and transferability between sign languages.
zh
[CV-9] SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing
【速读】:该论文旨在解决基于2D-to-3D lifting方法的3D实例分割(3D instance segmentation)中存在的精度不足问题,其核心挑战在于:从2D图像中提升至3D空间的过程中,由于语义引导信息模糊和深度约束不足,导致累积误差严重,难以获得精确的实例级分割结果。解决方案的关键在于提出一种无需训练的“分而生长”框架SGS-3D(Split and Grow for Semantic mask refinement in 3D instance segmentation),该框架首先利用几何原语(geometric primitives)对模糊的 lifted masks 进行净化与分割,从而提升语义一致性;随后通过融合空间连续性和高层特征进行精细化实例生长,有效缓解不同物体间的语义歧义问题。该方法在ScanNet200、ScanNet++和KITTI-360等多个数据集上验证了其显著提升的分割精度与鲁棒性,同时保持良好的跨场景泛化能力。
链接: https://arxiv.org/abs/2509.05144
作者: Chaolei Wang,Yang Luo,Jing Du,Siyu Chen,Yiping Chen,Ting Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate 3D instance segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D-to-3D lifting approaches struggle to produce precise instance-level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose splitting and growing reliable semantic mask for high-fidelity 3D instance segmentation (SGS-3D), a novel “split-then-grow” framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS-3D serves as a training-free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co-occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine-grained object instances by exploiting both spatial continuity and high-level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI-360 demonstrate that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments. Code is available in the supplementary materials.
zh
[CV-10] A Scalable Attention-Based Approach for Image-to-3D Texture Mapping
【速读】:该论文旨在解决现有生成式方法在3D纹理生成中存在效率低、依赖UV映射以及难以忠实还原参考图像的问题。其关键解决方案是提出一种基于Transformer的框架,直接从单张图像和网格(mesh)预测3D纹理场,无需UV映射和可微渲染,从而实现更快的纹理生成速度;同时结合三平面(triplane)表示与基于深度的反投影损失(depth-based backprojection losses),提升训练效率和推理速度,在一次前向传播中即可生成高保真纹理(平均仅需0.2秒/形状),显著优于当前最优基线方法在图像保真度与感知质量上的表现。
链接: https://arxiv.org/abs/2509.05131
作者: Arianna Rampini,Kanika Madan,Bruno Roy,AmirHossein Zamani,Derek Cheung
机构: Autodesk Research; Mila; Concordia University; Autodesk Research; Autodesk Research; Autodesk Research
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:High-quality textures are critical for realistic 3D content creation, yet existing generative methods are slow, rely on UV maps, and often fail to remain faithful to a reference image. To address these challenges, we propose a transformer-based framework that predicts a 3D texture field directly from a single image and a mesh, eliminating the need for UV mapping and differentiable rendering, and enabling faster texture generation. Our method integrates a triplane representation with depth-based backprojection losses, enabling efficient training and faster inference. Once trained, it generates high-fidelity textures in a single forward pass, requiring only 0.2s per shape. Extensive qualitative, quantitative, and user preference evaluations demonstrate that our method outperforms state-of-the-art baselines on single-image texture reconstruction in terms of both fidelity to the input image and perceptual quality, highlighting its practicality for scalable, high-quality, and controllable 3D content creation.
zh
[CV-11] Semi-supervised Deep Transfer for Regression without Domain Alignment
【速读】:该论文旨在解决在源数据不可获取(source-free)、目标域标签样本稀缺(semi-supervised)且任务为回归(regression)场景下的域适应(domain adaptation, DA)问题,这在医学和神经科学等实际应用中尤为常见。传统域适应方法通常依赖完整的源数据访问,但在隐私保护或存储计算成本限制下难以实现;同时,仅靠少量标注目标样本进行微调(fine-tuning)往往性能有限。解决方案的关键在于提出一种基于Contradistinguisher(CUDA)框架的正则化方法——CRAFT(Contradistinguisher-based Regularization Approach for Flexible Training),其创新点在于:无需源数据即可利用未标记的目标样本构建共享表示空间,通过引入对比学习机制引导模型在目标域上保持判别性特征,从而实现对预训练模型的有效迁移,尤其适用于连续值输出的回归任务。实验表明,CRAFT在脑电图(EEG)眼动预测与结构磁共振成像(sMRI)“脑龄”预测两个神经科学任务中均显著优于基线方法,RMSE提升最高达9%,并优于四种先进源自由域适应模型超过3%。
链接: https://arxiv.org/abs/2509.05092
作者: Mainak Biswas,Ambedkar Dukkipati,Devarajan Sridharan
机构: Indian Institute of Science (印度科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures, International Conference on Computer Vision 2025
Abstract:Deep learning models deployed in real-world applications (e.g., medicine) face challenges because source models do not generalize well to domain-shifted target data. Many successful domain adaptation (DA) approaches require full access to source data. Yet, such requirements are unrealistic in scenarios where source data cannot be shared either because of privacy concerns or because it is too large and incurs prohibitive storage or computational costs. Moreover, resource constraints may limit the availability of labeled targets. We illustrate this challenge in a neuroscience setting where source data are unavailable, labeled target data are meager, and predictions involve continuous-valued outputs. We build upon Contradistinguisher (CUDA), an efficient framework that learns a shared model across the labeled source and unlabeled target samples, without intermediate representation alignment. Yet, CUDA was designed for unsupervised DA, with full access to source data, and for classification tasks. We develop CRAFT – a Contradistinguisher-based Regularization Approach for Flexible Training – for source-free (SF), semi-supervised transfer of pretrained models in regression tasks. We showcase the efficacy of CRAFT in two neuroscience settings: gaze prediction with electroencephalography (EEG) data and ``brain age’’ prediction with structural MRI data. For both datasets, CRAFT yielded up to 9% improvement in root-mean-squared error (RMSE) over fine-tuned models when labeled training examples were scarce. Moreover, CRAFT leveraged unlabeled target data and outperformed four competing state-of-the-art source-free domain adaptation models by more than 3%. Lastly, we demonstrate the efficacy of CRAFT on two other real-world regression benchmarks. We propose CRAFT as an efficient approach for source-free, semi-supervised deep transfer for regression that is ubiquitous in biology and medicine.
zh
[CV-12] Robust Experts: the Effect of Adversarial Training on CNNs with Sparse Mixture-of-Experts Layers ICCV2025
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在面对对抗攻击时鲁棒性不足的问题,尤其针对现有增强鲁棒性的方法往往需要高计算资源的局限性。其解决方案的关键在于引入稀疏专家混合模型(Sparse Mixture-of-Experts, MoE)层,通过替换ResNet架构中特定的残差块或卷积层,在不增加推理成本的前提下提升模型容量与鲁棒性。实验表明,在CIFAR-100数据集上,将单个MoE层插入更深阶段并结合对抗训练后,可显著提升模型在PGD和AutoPGD攻击下的鲁棒性;进一步发现,使用开关损失(switch loss)进行专家平衡时会导致路由机制坍缩至少数过用专家,使对抗训练集中于这些子路径,从而形成具有更强鲁棒性的专用子路径(robust subpaths)。
链接: https://arxiv.org/abs/2509.05086
作者: Svetlana Pavlitska,Haixi Fan,Konstantin Ditschuneit,J. Marius Zöllner
机构: Karlsruhe Institute of Technology (KIT); FZI Research Center for Information Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at the STREAM workshop at ICCV 2025
Abstract:Robustifying convolutional neural networks (CNNs) against adversarial attacks remains challenging and often requires resource-intensive countermeasures. We explore the use of sparse mixture-of-experts (MoE) layers to improve robustness by replacing selected residual blocks or convolutional layers, thereby increasing model capacity without additional inference cost. On ResNet architectures trained on CIFAR-100, we find that inserting a single MoE layer in the deeper stages leads to consistent improvements in robustness under PGD and AutoPGD attacks when combined with adversarial training. Furthermore, we discover that when switch loss is used for balancing, it causes routing to collapse onto a small set of overused experts, thereby concentrating adversarial training on these paths and inadvertently making them more robust. As a result, some individual experts outperform the gated MoE model in robustness, suggesting that robust subpaths emerge through specialization. Our code is available at this https URL.
zh
[CV-13] Scale-interaction transformer: a hybrid cnn-transformer model for facial beauty prediction
【速读】:该论文旨在解决自动化面部美感预测(Automated Facial Beauty Prediction, FBP)任务中因局部与全局面部特征之间复杂交互关系而导致的挑战性问题。传统卷积神经网络(Convolutional Neural Networks, CNNs)虽具备强大的特征提取能力,但通常以固定尺度处理信息,难以捕捉不同粒度特征间的依赖关系。解决方案的关键在于提出一种新型混合深度学习架构——尺度交互Transformer(Scale-Interaction Transformer, SIT),其核心创新是将多尺度卷积模块与Transformer编码器相结合:首先通过并行卷积构建不同感受野下的多尺度面部表征,再将其序列化输入Transformer编码器,利用自注意力机制显式建模各尺度特征之间的上下文关联与交互关系。实验表明,该方法在SCUT-FBP5500数据集上达到皮尔逊相关系数0.9187的新SOTA性能,验证了显式建模多尺度视觉线索交互对高精度FBP的重要性。
链接: https://arxiv.org/abs/2509.05078
作者: Djamel Eddine Boukhari
机构: Scientific and Technical Research Centre for Arid Areas, CRSTRA (干旱地区科学技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated Facial Beauty Prediction (FBP) is a challenging computer vision task due to the complex interplay of local and global facial features that influence human perception. While Convolutional Neural Networks (CNNs) excel at feature extraction, they often process information at a fixed scale, potentially overlooking the critical inter-dependencies between features at different levels of granularity. To address this limitation, we introduce the Scale-Interaction Transformer (SIT), a novel hybrid deep learning architecture that synergizes the feature extraction power of CNNs with the relational modeling capabilities of Transformers. The SIT first employs a multi-scale module with parallel convolutions to capture facial characteristics at varying receptive fields. These multi-scale representations are then framed as a sequence and processed by a Transformer encoder, which explicitly models their interactions and contextual relationships via a self-attention mechanism. We conduct extensive experiments on the widely-used SCUT-FBP5500 benchmark dataset, where the proposed SIT model establishes a new state-of-the-art. It achieves a Pearson Correlation of 0.9187, outperforming previous methods. Our findings demonstrate that explicitly modeling the interplay between multi-scale visual cues is crucial for high-performance FBP. The success of the SIT architecture highlights the potential of hybrid CNN-Transformer models for complex image regression tasks that demand a holistic, context-aware understanding.
zh
[CV-14] GeoSplat: A Deep Dive into Geometry-Constrained Gaussian Splatting
【速读】:该论文旨在解决现有高斯溅射(Gaussian splatting)方法中因缺乏可靠几何先验而导致优化不稳定、初始化质量差及表面覆盖不充分的问题。其关键解决方案是提出GeoSplat框架,该框架通过引入一阶和二阶几何量(如法向量和主曲率)来约束整个训练流程,包括高斯初始化、梯度更新与稀疏化策略;特别地,利用局部流形结构设计了高效且抗噪的几何估计方法,从而实现动态几何先验的生成,显著提升了重建质量和鲁棒性。
链接: https://arxiv.org/abs/2509.05075
作者: Yangming Li,Chaoyu Liu,Lihao Liu,Simon Masnou,Carola-Bibian Schönlieb
机构: University of Cambridge (剑桥大学); Université Claude Bernard Lyon 1 (克莱蒙-奥弗涅大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A few recent works explored incorporating geometric priors to regularize the optimization of Gaussian splatting, further improving its performance. However, those early studies mainly focused on the use of low-order geometric priors (e.g., normal vector), and they are also unreliably estimated by noise-sensitive methods, like local principal component analysis. To address their limitations, we first present GeoSplat, a general geometry-constrained optimization framework that exploits both first-order and second-order geometric quantities to improve the entire training pipeline of Gaussian splatting, including Gaussian initialization, gradient update, and densification. As an example, we initialize the scales of 3D Gaussian primitives in terms of principal curvatures, leading to a better coverage of the object surface than random initialization. Secondly, based on certain geometric structures (e.g., local manifold), we introduce efficient and noise-robust estimation methods that provide dynamic geometric priors for our framework. We conduct extensive experiments on multiple datasets for novel view synthesis, showing that our framework: GeoSplat, significantly improves the performance of Gaussian splatting and outperforms previous baselines.
zh
[CV-15] Systematic Review and Meta-analysis of AI-driven MRI Motion Artifact Detection and Correction
【速读】:该论文旨在系统性地回顾并开展荟萃分析,解决磁共振成像(MRI)中运动伪影检测与校正的问题,评估当前人工智能(AI)驱动方法的发展、有效性、挑战及未来研究方向。其解决方案的关键在于利用深度学习(DL)技术,尤其是生成式模型(generative models),以减少运动伪影并提升图像质量;然而,研究指出,当前方法仍面临泛化能力不足、对配对训练数据的依赖以及视觉失真风险等核心挑战,亟需建立标准化公共数据集和报告协议,并发展更先进、适应性强的深度学习算法以降低对大规模标注数据的依赖。
链接: https://arxiv.org/abs/2509.05071
作者: Mojtaba Safari,Zach Eidex,Richard L.J. Qiu,Matthew Goette,Tonghe Wang,Xiaofeng Yang
机构: Emory University (埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:Background: To systematically review and perform a meta-analysis of artificial intelligence (AI)-driven methods for detecting and correcting magnetic resonance imaging (MRI) motion artifacts, assessing current developments, effectiveness, challenges, and future research directions. Methods: A comprehensive systematic review and meta-analysis were conducted, focusing on deep learning (DL) approaches, particularly generative models, for the detection and correction of MRI motion artifacts. Quantitative data were extracted regarding utilized datasets, DL architectures, and performance metrics. Results: DL, particularly generative models, show promise for reducing motion artifacts and improving image quality; however, limited generalizability, reliance on paired training data, and risk of visual distortions remain key challenges that motivate standardized datasets and reporting. Conclusions: AI-driven methods, particularly DL generative models, show significant potential for improving MRI image quality by effectively addressing motion artifacts. However, critical challenges must be addressed, including the need for comprehensive public datasets, standardized reporting protocols for artifact levels, and more advanced, adaptable DL techniques to reduce reliance on extensive paired datasets. Addressing these aspects could substantially enhance MRI diagnostic accuracy, reduce healthcare costs, and improve patient care outcomes.
zh
[CV-16] owards Efficient Pixel Labeling for Industrial Anomaly Detection and Localization
【速读】:该论文旨在解决工业产品缺陷检测中因缺乏像素级标注而导致的模型性能受限问题。传统异常检测(Anomaly Detection, AD)方法通常仅依赖正常样本进行训练,而缺陷样本虽可收集但需耗费大量人力进行像素级标注,难以规模化应用。解决方案的关键在于提出ADClick算法,通过少量用户点击和简短文本描述即可生成像素级异常标注,从而显著提升AD模型性能;进一步地,引入ADClick-Seg框架,利用基于原型的跨模态对齐机制将视觉特征与文本提示融合,结合像素级先验与语言引导线索,在MVTec AD数据集上实现了多类异常检测任务的最先进性能(AP = 80.0%,PRO = 97.5%,Pixel-AUROC = 99.1%)。
链接: https://arxiv.org/abs/2509.05034
作者: Jingqi Wu,Hanxi Li,Lin Yuanbo Wu,Hao Chen,Deyin Liu,Peng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Industrial product inspection is often performed using Anomaly Detection (AD) frameworks trained solely on non-defective samples. Although defective samples can be collected during production, leveraging them usually requires pixel-level annotations, limiting scalability. To address this, we propose ADClick, an Interactive Image Segmentation (IIS) algorithm for industrial anomaly detection. ADClick generates pixel-wise anomaly annotations from only a few user clicks and a brief textual description, enabling precise and efficient labeling that significantly improves AD model performance (e.g., AP = 96.1% on MVTec AD). We further introduce ADClick-Seg, a cross-modal framework that aligns visual features and textual prompts via a prototype-based approach for anomaly detection and localization. By combining pixel-level priors with language-guided cues, ADClick-Seg achieves state-of-the-art results on the challenging ``Multi-class’’ AD task (AP = 80.0%, PRO = 97.5%, Pixel-AUROC = 99.1% on MVTec AD).
zh
[CV-17] Pointing-Guided Target Estimation via Transformer-Based Attention ICANN
【速读】:该论文旨在解决人机交互(Human-Robot Interaction, HRI)中机器人如何准确理解人类通过自然指向手势(deictic gestures)所指示的目标对象的问题。解决方案的关键在于提出了一种多模态互注意力Transformer架构(Multi-Modality Inter-TransFormer, MM-ITF),该架构利用跨模态注意力机制,将单目RGB图像中的2D指向手势映射到候选物体位置,并为每个候选对象分配置信度分数,从而识别最可能的目标对象。实验表明,该方法仅需单目RGB数据即可实现高精度的目标预测,显著提升了人机协作的直观性和可访问性。
链接: https://arxiv.org/abs/2509.05031
作者: Luca Müller,Hassan Ali,Philipp Allgeuer,Lukáš Gajdošech,Stefan Wermter
机构: University of Hamburg (汉堡大学); University of Bremen (不来梅大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 34th International Conference on Artificial Neural Networks (ICANN) 2025,12 pages,4 figures,1 table; work was co-funded by Horizon Europe project TERAIS under Grant agreement number 101079338
Abstract:Deictic gestures, like pointing, are a fundamental form of non-verbal communication, enabling humans to direct attention to specific objects or locations. This capability is essential in Human-Robot Interaction (HRI), where robots should be able to predict human intent and anticipate appropriate responses. In this work, we propose the Multi-Modality Inter-TransFormer (MM-ITF), a modular architecture to predict objects in a controlled tabletop scenario with the NICOL robot, where humans indicate targets through natural pointing gestures. Leveraging inter-modality attention, MM-ITF maps 2D pointing gestures to object locations, assigns a likelihood score to each, and identifies the most likely target. Our results demonstrate that the method can accurately predict the intended object using monocular RGB data, thus enabling intuitive and accessible human-robot collaboration. To evaluate the performance, we introduce a patch confusion matrix, providing insights into the model’s predictions across candidate object locations. Code available at: this https URL.
zh
[CV-18] LUIVITON: Learned Universal Interoperable VIrtual Try-ON
【速读】:该论文旨在解决复杂多层服装在多样化且任意姿态的人形角色上实现全自动虚拟试穿(Virtual Try-On)的问题,尤其针对服装与不同体型、姿态的三维人体模型之间的精准贴合难题。解决方案的关键在于引入SMPL(Skinned Multi-Person Linear model)作为代理表示,并将服装到身体的拟合问题分解为两个对应任务:1)服装到SMPL的对应关系预测,采用基于几何学习的方法处理局部到完整形状的映射;2)身体到SMPL的对应关系建模,创新性地利用多视角一致的外观特征和预训练的2D基础模型,结合扩散模型(Diffusion Model)进行建模。这一双阶段框架使得系统能够高效处理复杂几何结构(如非流形网格)、支持广泛的人形角色类型(包括人类、机器人、卡通角色等),并具备实时调整服装尺寸和材质属性的能力,从而实现端到端自动化且高质量的3D虚拟试穿。
链接: https://arxiv.org/abs/2509.05030
作者: Cong Cao,Xianhang Cheng,Jingyuan Liu,Yujian Zheng,Zhenhui Lin,Meriem Chkir,Hao Li
机构: MBZUAI(穆罕默德·本·扎耶德人工智能大学); The University of Tokyo(东京大学); Pinscreen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present LUIVITON, an end-to-end system for fully automated virtual try-on, capable of draping complex, multi-layer clothing onto diverse and arbitrarily posed humanoid characters. To address the challenge of aligning complex garments with arbitrary and highly diverse body shapes, we use SMPL as a proxy representation and separate the clothing-to-body draping problem into two correspondence tasks: 1) clothing-to-SMPL and 2) body-to-SMPL correspondence, where each has its unique challenges. While we address the clothing-to-SMPL fitting problem using a geometric learning-based approach for partial-to-complete shape correspondence prediction, we introduce a diffusion model-based approach for body-to-SMPL correspondence using multi-view consistent appearance features and a pre-trained 2D foundation model. Our method can handle complex geometries, non-manifold meshes, and generalizes effectively to a wide range of humanoid characters – including humans, robots, cartoon subjects, creatures, and aliens, while maintaining computational efficiency for practical adoption. In addition to offering a fully automatic fitting solution, LUIVITON supports fast customization of clothing size, allowing users to adjust clothing sizes and material properties after they have been draped. We show that our system can produce high-quality 3D clothing fittings without any human labor, even when 2D clothing sewing patterns are not available.
zh
[CV-19] Leverag ing Transfer Learning and Mobile-enabled Convolutional Neural Networks for Improved Arabic Handwritten Character Recognition
【速读】:该论文旨在解决阿拉伯手写字符识别(Arabic Handwritten Character Recognition, AHCR)中面临的计算资源消耗大和训练数据稀缺的问题。解决方案的关键在于将迁移学习(Transfer Learning, TL)与轻量级移动端卷积神经网络(Mobile-enabled Convolutional Neural Networks, MbNets)相结合,通过对比全微调、部分微调和从头训练三种策略,在多个基准数据集上评估四种轻量级模型(MobileNet、SqueezeNet、MnasNet 和 ShuffleNet)的性能。实验表明,MobileNet 在准确率、鲁棒性和效率方面表现最优,而 ShuffleNet 在全微调下展现出更强的泛化能力;其中,IFHCDB 数据集在 MnasNet 全微调时达到 99% 的准确率,验证了该方法的有效性,为资源受限场景下的高效 AHCR 提供了可行路径。
链接: https://arxiv.org/abs/2509.05019
作者: Mohsine El Khayati,Ayyad Maafiri,Yassine Himeur,Hamzah Ali Alkhazaleh,Shadi Atalla,Wathiq Mansoor
机构: University Moulay Ismail (穆莱伊斯梅尔大学); Cadi Ayyad University (卡迪阿亚德大学); University of Dubai (迪拜大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20pages, 9 figures and 11 tables
Abstract:The study explores the integration of transfer learning (TL) with mobile-enabled convolutional neural networks (MbNets) to enhance Arabic Handwritten Character Recognition (AHCR). Addressing challenges like extensive computational requirements and dataset scarcity, this research evaluates three TL strategies–full fine-tuning, partial fine-tuning, and training from scratch–using four lightweight MbNets: MobileNet, SqueezeNet, MnasNet, and ShuffleNet. Experiments were conducted on three benchmark datasets: AHCD, HIJJA, and IFHCDB. MobileNet emerged as the top-performing model, consistently achieving superior accuracy, robustness, and efficiency, with ShuffleNet excelling in generalization, particularly under full fine-tuning. The IFHCDB dataset yielded the highest results, with 99% accuracy using MnasNet under full fine-tuning, highlighting its suitability for robust character recognition. The AHCD dataset achieved competitive accuracy (97%) with ShuffleNet, while HIJJA posed significant challenges due to its variability, achieving a peak accuracy of 92% with ShuffleNet. Notably, full fine-tuning demonstrated the best overall performance, balancing accuracy and convergence speed, while partial fine-tuning underperformed across metrics. These findings underscore the potential of combining TL and MbNets for resource-efficient AHCR, paving the way for further optimizations and broader applications. Future work will explore architectural modifications, in-depth dataset feature analysis, data augmentation, and advanced sensitivity analysis to enhance model robustness and generalizability.
zh
[CV-20] A biologically inspired separable learning vision model for real-time traffic object perception in Dark
【速读】:该论文旨在解决低光照交通场景中目标感知的准确性与实时性难题,尤其是在光照严重退化、视觉线索不可靠的情况下,现有感知模型难以快速适应并准确预测。其关键解决方案是提出一种生物启发的可分离学习视觉模型(Separable Learning Vision Model, SLVM),该模型包含四个核心组件:基于光适应瞳孔机制的光照敏感特征提取方法、特征层面的可分离学习策略以实现高效表示、面向多任务的解耦分支结构以及考虑空间错位的融合模块,从而在保持低计算开销的同时显著提升检测、实例分割和光流估计等任务的性能。
链接: https://arxiv.org/abs/2509.05012
作者: Hulin Li,Qiliang Ren,Jun Li,Hanbing Wei,Zheng Liu,Linfang Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fast and accurate object perception in low-light traffic scenes has attracted increasing attention. However, due to severe illumination degradation and the lack of reliable visual cues, existing perception models and methods struggle to quickly adapt to and accurately predict in low-light environments. Moreover, there is the absence of available large-scale benchmark specifically focused on low-light traffic scenes. To bridge this gap, we introduce a physically grounded illumination degradation method tailored to real-world low-light settings and construct Dark-traffic, the largest densely annotated dataset to date for low-light traffic scenes, supporting object detection, instance segmentation, and optical flow estimation. We further propose the Separable Learning Vision Model (SLVM), a biologically inspired framework designed to enhance perception under adverse lighting. SLVM integrates four key components: a light-adaptive pupillary mechanism for illumination-sensitive feature extraction, a feature-level separable learning strategy for efficient representation, task-specific decoupled branches for multi-task separable learning, and a spatial misalignment-aware fusion module for precise multi-feature alignment. Extensive experiments demonstrate that SLVM achieves state-of-the-art performance with reduced computational overhead. Notably, it outperforms RT-DETR by 11.2 percentage points in detection, YOLOv12 by 6.1 percentage points in instance segmentation, and reduces endpoint error (EPE) of baseline by 12.37% on Dark-traffic. On the LIS benchmark, the end-to-end trained SLVM surpasses Swin Transformer+EnlightenGAN and ConvNeXt-T+EnlightenGAN by an average of 11 percentage points across key metrics, and exceeds Mask RCNN (with light enhancement) by 3.1 percentage points. The Dark-traffic dataset and complete code is released at this https URL.
zh
[CV-21] Interpretable Deep Transfer Learning for Breast Ultrasound Cancer Detection: A Multi-Dataset Study
【速读】:该论文旨在解决乳腺癌早期诊断中依赖人工判读、效率低且易受主观因素影响的问题,尤其针对超声成像在致密型乳腺组织中的应用局限性。其解决方案的关键在于引入机器学习(ML)与深度学习(DL)技术,构建高精度、可解释的分类模型:通过对比经典ML算法(如支持向量机SVM、K近邻KNN)与卷积神经网络(CNN,包括ResNet-18、EfficientNet-B0、GoogLeNet)在多个公开超声图像数据集(BUSI、BUS-BRA、BrEaST-Lesions USG)上的性能表现,发现ResNet-18在恶性病变识别中达到99.7%的准确率和100%敏感度;同时结合深度特征提取与Grad-CAM可视化方法,显著提升了模型的可解释性,从而为临床部署提供可靠、透明的AI辅助诊断工具。
链接: https://arxiv.org/abs/2509.05004
作者: Mohammad Abbadi,Yassine Himeur,Shadi Atalla,Wathiq Mansoor
机构: University of Dubai (迪拜大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures and 1 table
Abstract:Breast cancer remains a leading cause of cancer-related mortality among women worldwide. Ultrasound imaging, widely used due to its safety and cost-effectiveness, plays a key role in early detection, especially in patients with dense breast tissue. This paper presents a comprehensive study on the application of machine learning and deep learning techniques for breast cancer classification using ultrasound images. Using datasets such as BUSI, BUS-BRA, and BrEaST-Lesions USG, we evaluate classical machine learning models (SVM, KNN) and deep convolutional neural networks (ResNet-18, EfficientNet-B0, GoogLeNet). Experimental results show that ResNet-18 achieves the highest accuracy (99.7%) and perfect sensitivity for malignant lesions. Classical ML models, though outperformed by CNNs, achieve competitive performance when enhanced with deep feature extraction. Grad-CAM visualizations further improve model transparency by highlighting diagnostically relevant image regions. These findings support the integration of AI-based diagnostic tools into clinical workflows and demonstrate the feasibility of deploying high-performing, interpretable systems for ultrasound-based breast cancer detection.
zh
[CV-22] Dual-Domain Perspective on Degradation-Aware Fusion: A VLM-Guided Robust Infrared and Visible Image Fusion Framework
【速读】:该论文旨在解决现有红外-可见光图像融合(Infrared-Visible Image Fusion, IVIF)方法在双源退化场景下性能下降的问题,此类场景中输入图像常因噪声、模糊或低对比度等退化因素导致传统分步预增强与融合流程产生误差累积,从而影响最终融合质量。解决方案的关键在于提出Guided Dual-Domain Fusion (GD²Fusion)框架,其核心创新是将视觉语言模型(Vision-Language Models, VLMs)用于退化感知,并结合频域与空域的联合优化机制:具体而言,Guided Frequency Modality-Specific Extraction (GFMSE)模块在频域内实现退化感知与抑制并提取与融合相关的子带特征;Guided Spatial Modality-Aggregated Fusion (GSMAF)模块则在空域内进行跨模态退化滤波与自适应多源特征聚合,以增强模态互补性与结构一致性。该协同设计显著提升了在复杂退化条件下的融合性能。
链接: https://arxiv.org/abs/2509.05000
作者: Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most existing infrared-visible image fusion (IVIF) methods assume high-quality inputs, and therefore struggle to handle dual-source degraded scenarios, typically requiring manual selection and sequential application of multiple pre-enhancement steps. This decoupled pre-enhancement-to-fusion pipeline inevitably leads to error accumulation and performance degradation. To overcome these limitations, we propose Guided Dual-Domain Fusion (GD^2Fusion), a novel framework that synergistically integrates vision-language models (VLMs) for degradation perception with dual-domain (frequency/spatial) joint optimization. Concretely, the designed Guided Frequency Modality-Specific Extraction (GFMSE) module performs frequency-domain degradation perception and suppression and discriminatively extracts fusion-relevant sub-band features. Meanwhile, the Guided Spatial Modality-Aggregated Fusion (GSMAF) module carries out cross-modal degradation filtering and adaptive multi-source feature aggregation in the spatial domain to enhance modality complementarity and structural consistency. Extensive qualitative and quantitative experiments demonstrate that GD^2Fusion achieves superior fusion performance compared with existing algorithms and strategies in dual-source degraded scenarios. The code will be publicly released after acceptance of this paper.
zh
[CV-23] Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper
【速读】:该论文旨在解决视频到音频(Video-to-Audio, V2A)生成任务中训练成本高、性能受限的问题,尤其是传统方法从头训练模型资源消耗大且难以保证语义与时间一致性。解决方案的关键在于提出多基础模型映射器(Multiple Foundation Model Mapper, MFM-Mapper),其创新点包括:1)通过融合双视觉编码器的特征以获取更丰富的语义和时间信息;2)用GPT-2替代线性映射器,提升跨模态特征对齐能力,类比于自回归翻译任务中的映射机制;3)显著降低训练规模(仅需前序方法的16%),同时保持甚至超越大规模模型的性能表现。
链接: https://arxiv.org/abs/2509.04957
作者: Gehui Chen,Guan’an Wang,Xiaowen Huang,Jitao Sang
机构: Beijing Jiaotong University (北京交通大学); Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence (北京市交通数据挖掘与具身智能重点实验室); Key Laboratory of Big Data & Artificial Intelligence in Transportation, Ministry of Education (教育部大数据与交通人工智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent Video-to-Audio (V2A) generation relies on extracting semantic and temporal features from video to condition generative models. Training these models from scratch is resource intensive. Consequently, leveraging foundation models (FMs) has gained traction due to their cross-modal knowledge transfer and generalization capabilities. One prior work has explored fine-tuning a lightweight mapper network to connect a pre-trained visual encoder with a text-to-audio generation model for V2A. Inspired by this, we introduce the Multiple Foundation Model Mapper (MFM-Mapper). Compared to the previous mapper approach, MFM-Mapper benefits from richer semantic and temporal information by fusing features from dual visual encoders. Furthermore, by replacing a linear mapper with GPT-2, MFM-Mapper improves feature alignment, drawing parallels between cross-modal features mapping and autoregressive translation tasks. Our MFM-Mapper exhibits remarkable training efficiency. It achieves better performance in semantic and temporal consistency with fewer training consuming, requiring only 16% of the training scale compared to previous mapper-based work, yet achieves competitive performance with models trained on a much larger scale.
zh
[CV-24] owards an Accurate and Effective Robot Vision (The Problem of Topological Localization for Mobile Robots)
【速读】:该论文旨在解决移动机器人在办公环境中进行拓扑定位(topological localization)的问题,即如何仅依赖单视角彩色相机获取的图像,在不依赖图像序列时间连续性的前提下准确识别机器人所处位置。解决方案的关键在于系统性地比较多种视觉描述子(如颜色直方图、SIFT、ASIFT、RGB-SIFT 和词袋模型 Bag-of-Visual-Words)及其对应的相似性度量和分类器,并通过标准评估指标与可视化手段验证配置优化的效果。实验结果表明,恰当选择描述子、距离度量和分类器组合可显著提升定位性能,且该方法在ImageCLEF机器人视觉任务中进一步得到验证,具备识别新图像序列最可能位置的能力。
链接: https://arxiv.org/abs/2509.04948
作者: Emanuela Boros
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Master’s thesis
Abstract:Topological localization is a fundamental problem in mobile robotics, since robots must be able to determine their position in order to accomplish tasks. Visual localization and place recognition are challenging due to perceptual ambiguity, sensor noise, and illumination variations. This work addresses topological localization in an office environment using only images acquired with a perspective color camera mounted on a robot platform, without relying on temporal continuity of image sequences. We evaluate state-of-the-art visual descriptors, including Color Histograms, SIFT, ASIFT, RGB-SIFT, and Bag-of-Visual-Words approaches inspired by text retrieval. Our contributions include a systematic, quantitative comparison of these features, distance measures, and classifiers. Performance was analyzed using standard evaluation metrics and visualizations, extending previous experiments. Results demonstrate the advantages of proper configurations of appearance descriptors, similarity measures, and classifiers. The quality of these configurations was further validated in the Robot Vision task of the ImageCLEF evaluation campaign, where the system identified the most likely location of novel image sequences. Future work will explore hierarchical models, ranking methods, and feature combinations to build more robust localization systems, reducing training and runtime while avoiding the curse of dimensionality. Ultimately, this aims toward integrated, real-time localization across varied illumination and longer routes.
zh
[CV-25] UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features
【速读】:该论文旨在解决单图视角合成(novel view synthesis)任务中因未观测区域存在多重解释而导致的病态问题,现有方法通常依赖模糊先验和输入视图附近的插值,易产生严重失真。其解决方案的关键在于提出一种名为UniView的新模型,通过引入来自相似物体的参考图像提供强先验信息,具体包括:构建检索与增强系统并利用多模态大语言模型(MLLM)筛选符合要求的参考图像;设计一个可插拔的适配器模块,包含多级隔离层以动态生成目标视角所需的参考特征;同时引入解耦三重注意力机制,有效对齐并融合多分支特征,从而在保留原始输入细节的同时提升合成质量。
链接: https://arxiv.org/abs/2509.04932
作者: Haowang Cui,Rui Chen,Tao Luo,Rui Li,Jiaze Wang
机构: Tianjin University (天津大学); China Electronics System Technology Co., Ltd. (中国电子系统技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ACM TOMM
Abstract:The task of synthesizing novel views from a single image is highly ill-posed due to multiple explanations for unobserved areas. Most current methods tend to generate unseen regions from ambiguity priors and interpolation near input views, which often lead to severe distortions. To address this limitation, we propose a novel model dubbed as UniView, which can leverage reference images from a similar object to provide strong prior information during view synthesis. More specifically, we construct a retrieval and augmentation system and employ a multimodal large language model (MLLM) to assist in selecting reference images that meet our requirements. Additionally, a plug-and-play adapter module with multi-level isolation layers is introduced to dynamically generate reference features for the target views. Moreover, in order to preserve the details of an original input image, we design a decoupled triple attention mechanism, which can effectively align and integrate multi-branch features into the synthesis process. Extensive experiments have demonstrated that our UniView significantly improves novel view synthesis performance and outperforms state-of-the-art methods on the challenging datasets.
zh
[CV-26] Evaluating Multiple Instance Learning Strategies for Automated Sebocyte Droplet Counting
【速读】:该论文旨在解决皮肤皮脂腺细胞(sebocyte)中脂滴数量自动量化的问题,传统手动计数方法存在劳动强度大和主观性强的局限。其解决方案的关键在于提出一种基于注意力机制的多实例学习(attention-based multiple instance learning, MIL)框架,利用ResNet-50提取图像特征并引入实例加权策略,以实现从高分辨率图像中准确预测每张切片的脂滴数目。实验表明,尽管简单的袋级聚合(bag-level aggregation)方法在稳定性上优于注意力MIL模型(平均绝对误差MAE分别为5.6 vs 10.7),但后者若结合任务对齐的池化与正则化策略,仍具备提升潜力,为未来自动化分析提供了新思路。
链接: https://arxiv.org/abs/2509.04895
作者: Maryam Adelipour,Gustavo Carneiro,Jeongkwon Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 1 figure, 2 tables
Abstract:Sebocytes are lipid-secreting cells whose differentiation is marked by the accumulation of intracellular lipid droplets, making their quantification a key readout in sebocyte biology. Manual counting is labor-intensive and subjective, motivating automated solutions. Here, we introduce a simple attention-based multiple instance learning (MIL) framework for sebocyte image analysis. Nile Red-stained sebocyte images were annotated into 14 classes according to droplet counts, expanded via data augmentation to about 50,000 cells. Two models were benchmarked: a baseline multi-layer perceptron (MLP) trained on aggregated patch-level counts, and an attention-based MIL model leveraging ResNet-50 features with instance weighting. Experiments using five-fold cross-validation showed that the baseline MLP achieved more stable performance (mean MAE = 5.6) compared with the attention-based MIL, which was less consistent (mean MAE = 10.7) but occasionally superior in specific folds. These findings indicate that simple bag-level aggregation provides a robust baseline for slide-level droplet counting, while attention-based MIL requires task-aligned pooling and regularization to fully realize its potential in sebocyte image analysis.
zh
[CV-27] SynGen-Vision: Synthetic Data Generation for training industrial vision models
【速读】:该论文旨在解决工业场景中磨损检测(wear and tear detection)任务中因真实标注数据稀缺而导致的计算机视觉(Computer Vision, CV)模型训练困难问题。其解决方案的关键在于利用视觉语言模型(Vision Language Model, VLM)结合3D仿真与渲染引擎,生成多样化的锈蚀场景合成数据,从而有效支持CV模型的训练。实验表明,基于该方法生成的数据训练出的锈蚀检测模型在真实图像测试集上达到0.87的mAP50指标,显著优于其他方法,且该方案具备良好的可扩展性,适用于其他工业磨损检测场景。
链接: https://arxiv.org/abs/2509.04894
作者: Alpana Dubey,Suma Mani Kuriakose,Nitish Bhardwaj
机构: Accenture Labs(埃森哲实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We propose an approach to generate synthetic data to train computer vision (CV) models for industrial wear and tear detection. Wear and tear detection is an important CV problem for predictive maintenance tasks in any industry. However, data curation for training such models is expensive and time-consuming due to the unavailability of datasets for different wear and tear scenarios. Our approach employs a vision language model along with a 3D simulation and rendering engine to generate synthetic data for varying rust conditions. We evaluate our approach by training a CV model for rust detection using the generated dataset and tested the trained model on real images of rusted industrial objects. The model trained with the synthetic data generated by our approach, outperforms the other approaches with a mAP50 score of 0.87. The approach is customizable and can be easily extended to other industrial wear and tear detection scenarios
zh
[CV-28] SpiderNets: Estimating Fear Ratings of Spider-Related Images with Vision Models
【速读】:该论文旨在解决如何利用预训练计算机视觉模型准确预测个体对蜘蛛相关图像的恐惧水平,从而为自适应计算机辅助暴露疗法(computerized exposure therapy)提供技术支持。其解决方案的关键在于:通过迁移学习适配三种不同的预训练模型以预测人类恐惧评分(0–100分),并验证了模型在小规模数据集上仍具备较高可解释性与预测性能;同时发现模型性能高度依赖于足够规模的数据集,且其预测依据主要基于与蜘蛛相关的视觉特征,而非无关背景信息,这为开发情绪感知型治疗技术提供了可靠的技术路径和理论支撑。
链接: https://arxiv.org/abs/2509.04889
作者: Dominik Pegler,David Steyrl,Mengfan Zhang,Alexander Karner,Jozsef Arato,Frank Scharnowski,Filip Melinscak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 60 pages (30 main text, 30 appendix), 20 figures (5 in main text, 15 in appendix)
Abstract:Advances in computer vision have opened new avenues for clinical applications, particularly in computerized exposure therapy where visual stimuli can be dynamically adjusted based on patient responses. As a critical step toward such adaptive systems, we investigated whether pretrained computer vision models can accurately predict fear levels from spider-related images. We adapted three diverse models using transfer learning to predict human fear ratings (on a 0-100 scale) from a standardized dataset of 313 images. The models were evaluated using cross-validation, achieving an average mean absolute error (MAE) between 10.1 and 11.0. Our learning curve analysis revealed that reducing the dataset size significantly harmed performance, though further increases yielded no substantial gains. Explainability assessments showed the models’ predictions were based on spider-related features. A category-wise error analysis further identified visual conditions associated with higher errors (e.g., distant views and artificial/painted spiders). These findings demonstrate the potential of explainable computer vision models in predicting fear ratings, highlighting the importance of both model explainability and a sufficient dataset size for developing effective emotion-aware therapeutic technologies.
zh
[CV-29] Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning
【速读】:该论文旨在解决前列腺癌冷冻消融(cryoablation)术前规划依赖人工、耗时且易受专家经验影响的问题,从而导致治疗质量不一致和临床可扩展性差。解决方案的关键在于提出一种基于强化学习(reinforcement learning, RL)的框架——Cryo-RL,其将冷冻探针放置规划建模为马尔可夫决策过程(Markov decision process),并在模拟环境中通过奖励函数引导智能体逐步选择探针位置与冰球直径,从而自动学习最优的冷冻消融策略,无需任何手动设计的计划。该方法在583例回顾性病例中显著优于现有几何优化自动化基线(Dice指标提升超8个百分点),并达到人类专家水平,同时大幅减少规划时间。
链接: https://arxiv.org/abs/2509.04886
作者: Trixia Simangan,Ahmed Nadeem Abbasi,Yipeng Hu,Shaheer U. Saeed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICAD (Medical Imaging and Computer-Aided Diagnosis) 2025
Abstract:Cryoablation is a minimally invasive localised treatment for prostate cancer that destroys malignant tissue during de-freezing, while sparing surrounding healthy structures. Its success depends on accurate preoperative planning of cryoprobe placements to fully cover the tumour and avoid critical anatomy. This planning is currently manual, expertise-dependent, and time-consuming, leading to variability in treatment quality and limited scalability. In this work, we introduce Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process and learns an optimal policy for cryoprobe placement. Within a simulated environment that models clinical constraints and stochastic intraoperative variability, an agent sequentially selects cryoprobe positions and ice sphere diameters. Guided by a reward function based on tumour coverage, this agent learns a cryoablation strategy that leads to optimal cryoprobe placements without the need for any manually-designed plans. Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared with the best automated baselines, based on geometric optimisation, and matched human expert performance while requiring substantially less planning time. These results highlight the potential of reinforcement learning to deliver clinically viable, reproducible, and efficient cryoablation plans.
zh
[CV-30] CoRe-GS: Coarse-to-Refined Gaussian Splatting with Semantic Object Focus
【速读】:该论文旨在解决移动自主空中机器人在关键应用场景(如远程引导和灾害响应)中对高精度3D重建与快速场景处理的双重需求问题。传统方法通常需要对整个场景进行详细重建,效率低下;而本文提出仅聚焦于感兴趣点(Points of Interest, PoIs),以提升效率并保持质量。解决方案的关键在于提出CoRe-GS框架:首先利用语义3D高斯溅射(Semantic 3D Gaussian Splatting, GS)生成粗粒度可分割的场景表示,随后通过一种新颖的颜色基有效滤波方法实现对目标对象的高效隔离与精细化重建,从而将训练时间缩短至完整语义GS训练周期的约四分之一,同时显著提升新视角合成质量。
链接: https://arxiv.org/abs/2509.04859
作者: Hannah Schieber,Dominik Frischmann,Simon Boche,Victor Schaack,Angela Schoellig,Stefan Leutenegger,Daniel Roth
机构: Technical University of Munich (慕尼黑工业大学); TUM University Hospital (慕尼黑工业大学医院); Munich Institute of Robotics and Machine Intelligence (慕尼黑机器人与智能机器研究所); Mobile Robotics Lab, Department of Mechanical and Process Engineering, ETH Zurich (苏黎世联邦理工学院机械与过程工程系移动机器人实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mobile reconstruction for autonomous aerial robotics holds strong potential for critical applications such as tele-guidance and disaster response. These tasks demand both accurate 3D reconstruction and fast scene processing. Instead of reconstructing the entire scene in detail, it is often more efficient to focus on specific objects, i.e., points of interest (PoIs). Mobile robots equipped with advanced sensing can usually detect these early during data acquisition or preliminary analysis, reducing the need for full-scene optimization. Gaussian Splatting (GS) has recently shown promise in delivering high-quality novel view synthesis and 3D representation by an incremental learning process. Extending GS with scene editing, semantics adds useful per-splat features to isolate objects effectively. Semantic 3D Gaussian editing can already be achieved before the full training cycle is completed, reducing the overall training time. Moreover, the semantically relevant area, the PoI, is usually already known during capturing. To balance high-quality reconstruction with reduced training time, we propose CoRe-GS. We first generate a coarse segmentation-ready scene with semantic GS and then refine it for the semantic object using our novel color-based effective filtering for effective object isolation. This is speeding up the training process to be about a quarter less than a full training cycle for semantic GS. We evaluate our approach on two datasets, SCRREAM (real-world, outdoor) and NeRDS 360 (synthetic, indoor), showing reduced runtime and higher novel-view-synthesis quality. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.04859 [cs.CV] (or arXiv:2509.04859v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.04859 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-31] Pose-Free 3D Quantitative Phase Imaging of Flowing Cellular Populations
【速读】:该论文旨在解决现有高通量三维定量相位成像(3D quantitative phase imaging, QPI)在流式细胞术中对不规则形状细胞成像受限的问题。当前方法假设细胞仅发生单一轴向的均匀旋转,需已知每帧中的细胞姿态,这限制了其在非球形或复杂旋转细胞中的应用,导致只能分析部分细胞群体,影响统计分析的鲁棒性。解决方案的关键在于提出OmniFHT框架,该框架基于傅里叶衍射定理与隐式神经表示(implicit neural representations, INRs),通过联合优化每个细胞未知的旋转轨迹和体积结构,在弱散射假设下实现任意几何形状和多轴旋转的细胞重建。其连续表示特性支持稀疏投影和有限角度覆盖下的高保真重建,仅需10个视角或120°角范围即可获得高质量结果,首次实现了对整个流动细胞群体的原位、高通量断层成像,为流式细胞术平台提供了可扩展且无偏的无标记形态计量分析方案。
链接: https://arxiv.org/abs/2509.04848
作者: Enze Ye,Wei Lin,Shaochi Ren,Yakun Liu,Xiaoping Li,Hao Wang,He Sun,Feng Pan
机构: Beihang University (北京航空航天大学); Peking University (北京大学); Peking University People’s Hospital (北京大学人民医院); Peking University Third Hospital (北京大学第三医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Biological Physics (physics.bio-ph); Optics (physics.optics); Quantitative Methods (q-bio.QM)
备注: 16 pages, 5 figures
Abstract:High-throughput 3D quantitative phase imaging (QPI) in flow cytometry enables label-free, volumetric characterization of individual cells by reconstructing their refractive index (RI) distributions from multiple viewing angles during flow through microfluidic channels. However, current imaging methods assume that cells undergo uniform, single-axis rotation, which require their poses to be known at each frame. This assumption restricts applicability to near-spherical cells and prevents accurate imaging of irregularly shaped cells with complex rotations. As a result, only a subset of the cellular population can be analyzed, limiting the ability of flow-based assays to perform robust statistical analysis. We introduce OmniFHT, a pose-free 3D RI reconstruction framework that leverages the Fourier diffraction theorem and implicit neural representations (INRs) for high-throughput flow cytometry tomographic imaging. By jointly optimizing each cell’s unknown rotational trajectory and volumetric structure under weak scattering assumptions, OmniFHT supports arbitrary cell geometries and multi-axis rotations. Its continuous representation also allows accurate reconstruction from sparsely sampled projections and restricted angular coverage, producing high-fidelity results with as few as 10 views or only 120 degrees of angular range. OmniFHT enables, for the first time, in situ, high-throughput tomographic imaging of entire flowing cell populations, providing a scalable and unbiased solution for label-free morphometric analysis in flow cytometry platforms.
zh
[CV-32] mporalFlowViz: Parameter-Aware Visual Analytics for Interpreting Scramjet Combustion Evolution
【速读】:该论文旨在解决高超声速冲压发动机(scramjet)燃烧模拟中大规模、高维度时序流场数据的可视化与分析难题,尤其是专家在特征区分、跨案例比较及动态演化理解方面的挑战。其解决方案的关键在于构建一个参数感知的视觉分析工作流 TemporalFlowViz,通过预训练 Vision Transformer 提取流场图像的高维嵌入表示,结合降维与密度聚类识别隐含的燃烧模式,并在嵌入空间中构建时序轨迹以追踪仿真演化过程;同时,利用领域专家对聚类中心的标注作为上下文提示,驱动视觉-语言模型生成自然语言摘要,从而实现从抽象表征到可解释知识的转化,最终支持基于参数过滤、相似性检索和多视图协同探索的深入分析。
链接: https://arxiv.org/abs/2509.04834
作者: Yifei Jia,Shiyu Cheng,Yu Dong,Guan Li,Dong Tian,Ruixiao Peng,Xuyi Lu,Yu Wang,Wei Yao,Guihua Shan
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Artificial Intelligence, University of Science and Technology of China (中国科学技术大学人工智能研究院); 3. Alibaba Group (阿里巴巴集团); 4. National Engineering Research Center for Big Data Technology and Application (国家大数据工程技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding the complex combustion dynamics within scramjet engines is critical for advancing high-speed propulsion technologies. However, the large scale and high dimensionality of simulation-generated temporal flow field data present significant challenges for visual interpretation, feature differentiation, and cross-case comparison. In this paper, we present TemporalFlowViz, a parameter-aware visual analytics workflow and system designed to support expert-driven clustering, visualization, and interpretation of temporal flow fields from scramjet combustion simulations. Our approach leverages hundreds of simulated combustion cases with varying initial conditions, each producing time-sequenced flow field images. We use pretrained Vision Transformers to extract high-dimensional embeddings from these frames, apply dimensionality reduction and density-based clustering to uncover latent combustion modes, and construct temporal trajectories in the embedding space to track the evolution of each simulation over time. To bridge the gap between latent representations and expert reasoning, domain specialists annotate representative cluster centroids with descriptive labels. These annotations are used as contextual prompts for a vision-language model, which generates natural-language summaries for individual frames and full simulation cases. The system also supports parameter-based filtering, similarity-based case retrieval, and coordinated multi-view exploration to facilitate in-depth analysis. We demonstrate the effectiveness of TemporalFlowViz through two expert-informed case studies and expert feedback, showing TemporalFlowViz enhances hypothesis generation, supports interpretable pattern discovery, and enhances knowledge discovery in large-scale scramjet combustion analysis.
zh
[CV-33] PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination ICCV2025
【速读】:该论文旨在解决当前视觉定位(Visual Grounding)方法中两个关键问题:一是现有端到端直接引用范式仅依赖被指对象进行监督,忽略了潜在显著目标的利用价值;二是多数方法缺乏多粒度区分能力,难以在复杂场景中实现鲁棒的目标识别。其解决方案的关键在于提出PropVG框架,这是首个将前景目标提议生成与参考对象理解无缝结合的端到端提案驱动方法,无需额外检测器;同时引入对比学习增强的参考评分(Contrastive-based Refer Scoring, CRS)模块,通过句子和词级对比学习提升对参照对象的理解与区分能力,并设计多粒度目标区分(Multi-granularity Target Discrimination, MTD)模块,融合对象级与语义级信息以改善对缺失目标的识别性能。
链接: https://arxiv.org/abs/2509.04833
作者: Ming Dai,Wenxuan Cheng,Jiedong Zhuang,Jiang-jiang Liu,Hongshen Zhao,Zhenhua Feng,Wankou Yang
机构: Southeast University (东南大学); Zhejiang University (浙江大学); Baidu VIS (百度视觉智能实验室); Jiangnan University (江南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV2025
Abstract:Recent advances in visual grounding have largely shifted away from traditional proposal-based two-stage frameworks due to their inefficiency and high computational complexity, favoring end-to-end direct reference paradigms. However, these methods rely exclusively on the referred target for supervision, overlooking the potential benefits of prominent prospective targets. Moreover, existing approaches often fail to incorporate multi-granularity discrimination, which is crucial for robust object identification in complex scenarios. To address these limitations, we propose PropVG, an end-to-end proposal-based framework that, to the best of our knowledge, is the first to seamlessly integrate foreground object proposal generation with referential object comprehension without requiring additional detectors. Furthermore, we introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the capability in understanding and distinguishing referred objects. Additionally, we design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets. Extensive experiments on gRefCOCO (GREC/GRES), Ref-ZOM, R-RefCOCO, and RefCOCO (REC/RES) benchmarks demonstrate the effectiveness of PropVG. The codes and models are available at this https URL.
zh
[CV-34] Exploring Non-Local Spatial-Angular Correlations with a Hybrid Mamba-Transformer Framework for Light Field Super-Resolution
【速读】:该论文旨在解决轻量级光场图像超分辨率(Light Field Image Super-Resolution, LFSR)中因现有基于Mamba的方法采用多方向扫描策略而导致特征提取效率低、冗余度高的问题,以及状态空间模型在保留空间-角度(spatial-angular)和视差(disparity)信息方面的局限性。解决方案的关键在于提出一种子空间简单扫描(Subspace Simple Scanning, Sub-SS)策略,并据此设计子空间简单Mamba块(Subspace Simple Mamba Block, SSMB),实现更高效且精准的特征提取;同时引入双阶段建模机制:第一阶段通过空间-角度残差子空间Mamba块(Spatial-Angular Residual Subspace Mamba Block, SA-RSMB)进行浅层空间-角度特征提取,第二阶段则采用双分支并行结构结合视极平面Mamba块(Epipolar Plane Mamba Block, EPMB)与视极平面Transformer块(Epipolar Plane Transformer Block, EPTB),以深度细化视极平面特征,从而全面挖掘非局部空间-角度相关性。最终构建的混合Mamba-Transformer框架LFMT融合了Mamba与Transformer的优势,在保持低计算复杂度的同时显著提升LFSR性能。
链接: https://arxiv.org/abs/2509.04824
作者: Haosong Liu,Xiancheng Zhu,Huanqiang Zeng,Jianqing Zhu,Jiuwen Cao,Junhui Hou
机构: Huaqiao University (华侨大学); Xiamen University of Technology (厦门理工学院); Hangzhou Dianzi University (杭州电子科技大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, Mamba-based methods, with its advantage in long-range information modeling and linear complexity, have shown great potential in optimizing both computational cost and performance of light field image super-resolution (LFSR). However, current multi-directional scanning strategies lead to inefficient and redundant feature extraction when applied to complex LF data. To overcome this challenge, we propose a Subspace Simple Scanning (Sub-SS) strategy, based on which we design the Subspace Simple Mamba Block (SSMB) to achieve more efficient and precise feature extraction. Furthermore, we propose a dual-stage modeling strategy to address the limitation of state space in preserving spatial-angular and disparity information, thereby enabling a more comprehensive exploration of non-local spatial-angular correlations. Specifically, in stage I, we introduce the Spatial-Angular Residual Subspace Mamba Block (SA-RSMB) for shallow spatial-angular feature extraction; in stage II, we use a dual-branch parallel structure combining the Epipolar Plane Mamba Block (EPMB) and Epipolar Plane Transformer Block (EPTB) for deep epipolar feature refinement. Building upon meticulously designed modules and strategies, we introduce a hybrid Mamba-Transformer framework, termed LFMT. LFMT integrates the strengths of Mamba and Transformer models for LFSR, enabling comprehensive information exploration across spatial, angular, and epipolar-plane domains. Experimental results demonstrate that LFMT significantly outperforms current state-of-the-art methods in LFSR, achieving substantial improvements in performance while maintaining low computational complexity on both real-word and synthetic LF datasets.
zh
[CV-35] Extracting Uncertainty Estimates from Mixtures of Experts for Semantic Segmentation ICCV2025
【速读】:该论文旨在解决计算机视觉模型在安全关键场景(如交通场景感知)中预测不确定性估计的准确性与校准性问题。现有方法如集成学习虽能量化不确定性,但存在计算开销大等问题。其解决方案的关键在于利用专家混合(Mixture of Experts, MoE)架构,在不修改网络结构的前提下,通过门控机制动态加权多个专家预测,并基于三种方法——预测熵、互信息和专家方差——提取可靠的不确定性估计。实验表明,MoE在分布外(OOD)数据下比传统集成方法具有更优的条件正确性指标,且简单门控机制相比复杂类别感知门控更能提升路由不确定性的校准效果,同时增加专家数量可进一步改善不确定性校准性能。
链接: https://arxiv.org/abs/2509.04816
作者: Svetlana Pavlitska,Beyza Keskin,Alwin Faßbender,Christian Hubschneider,J. Marius Zöllner
机构: Karlsruhe Institute of Technology (KIT); FZI Research Center for Information Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at the STREAM workshop at ICCV2025
Abstract:Estimating accurate and well-calibrated predictive uncertainty is important for enhancing the reliability of computer vision models, especially in safety-critical applications like traffic scene perception. While ensemble methods are commonly used to quantify uncertainty by combining multiple models, a mixture of experts (MoE) offers an efficient alternative by leveraging a gating network to dynamically weight expert predictions based on the input. Building on the promising use of MoEs for semantic segmentation in our previous works, we show that well-calibrated predictive uncertainty estimates can be extracted from MoEs without architectural modifications. We investigate three methods to extract predictive uncertainty estimates: predictive entropy, mutual information, and expert variance. We evaluate these methods for an MoE with two experts trained on a semantical split of the A2D2 dataset. Our results show that MoEs yield more reliable uncertainty estimates than ensembles in terms of conditional correctness metrics under out-of-distribution (OOD) data. Additionally, we evaluate routing uncertainty computed via gate entropy and find that simple gating mechanisms lead to better calibration of routing uncertainty estimates than more complex classwise gates. Finally, our experiments on the Cityscapes dataset suggest that increasing the number of experts can further enhance uncertainty calibration. Our code is available at this https URL.
zh
[CV-36] oward Accessible Dermatology: Skin Lesion Classification Using Deep Learning Models on Mobile-Acquired Images
【速读】:该论文旨在解决皮肤疾病诊断在低资源环境中因传统方法成本高、流程复杂且难以获取而受限的问题。其关键解决方案是构建一个包含50余种皮肤疾病类别的大规模移动设备采集数据集,并采用基于Transformer的模型(尤其是Swin Transformer)进行分类,该架构通过有效捕捉全局上下文特征显著提升性能;同时引入梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM)增强模型可解释性,从而为移动终端上的皮肤病变自动识别提供高效、透明且适用于资源有限场景的AI辅助诊断方案。
链接: https://arxiv.org/abs/2509.04800
作者: Asif Newaz,Masum Mushfiq Ishti,A Z M Ashraful Azam,Asif Ur Rahman Adib
机构: Islamic University of Technology (伊斯兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review in ICSigSys 2025
Abstract:Skin diseases are among the most prevalent health concerns worldwide, yet conventional diagnostic methods are often costly, complex, and unavailable in low-resource settings. Automated classification using deep learning has emerged as a promising alternative, but existing studies are mostly limited to dermoscopic datasets and a narrow range of disease classes. In this work, we curate a large dataset of over 50 skin disease categories captured with mobile devices, making it more representative of real-world conditions. We evaluate multiple convolutional neural networks and Transformer-based architectures, demonstrating that Transformer models, particularly the Swin Transformer, achieve superior performance by effectively capturing global contextual features. To enhance interpretability, we incorporate Gradient-weighted Class Activation Mapping (Grad-CAM), which highlights clinically relevant regions and provides transparency in model predictions. Our results underscore the potential of Transformer-based approaches for mobile-acquired skin lesion classification, paving the way toward accessible AI-assisted dermatological screening and early diagnosis in resource-limited environments.
zh
[CV-37] Comparative Evaluation of Traditional and Deep Learning Feature Matching Algorithms using Chandrayaan-2 Lunar Data
【速读】:该论文旨在解决月球探测中多源传感器图像(如光学、高光谱和雷达)的精确配准问题,这对月表测绘、资源定位及任务规划至关重要。由于不同传感器在分辨率、光照条件和成像畸变上的差异,传统配准方法面临挑战。其解决方案的关键在于构建一套系统性的预处理流程(包括几何校正、分辨率对齐、强度归一化以及自适应直方图均衡化、主成分分析和阴影校正等增强技术),并结合深度学习驱动的匹配算法SuperGlue,该方法在跨模态图像配准中表现出最低的均方根误差和最快的运行速度,显著优于经典特征匹配算法(如SIFT、AKAZE),尤其在极区复杂光照条件下展现出更强的鲁棒性。
链接: https://arxiv.org/abs/2509.04775
作者: R. Makharia,J. G. Singla,Amitabh,N. Dube,H. Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 11 figures, 3 tables
Abstract:Accurate image registration is critical for lunar exploration, enabling surface mapping, resource localization, and mission planning. Aligning data from diverse lunar sensors – optical (e.g., Orbital High Resolution Camera, Narrow and Wide Angle Cameras), hyperspectral (Imaging Infrared Spectrometer), and radar (e.g., Dual-Frequency Synthetic Aperture Radar, Selene/Kaguya mission) – is challenging due to differences in resolution, illumination, and sensor distortion. We evaluate five feature matching algorithms: SIFT, ASIFT, AKAZE, RIFT2, and SuperGlue (a deep learning-based matcher), using cross-modality image pairs from equatorial and polar regions. A preprocessing pipeline is proposed, including georeferencing, resolution alignment, intensity normalization, and enhancements like adaptive histogram equalization, principal component analysis, and shadow correction. SuperGlue consistently yields the lowest root mean square error and fastest runtimes. Classical methods such as SIFT and AKAZE perform well near the equator but degrade under polar lighting. The results highlight the importance of preprocessing and learning-based approaches for robust lunar image registration across diverse conditions.
zh
[CV-38] Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval ICCV2025
【速读】:该论文旨在解决文本到视频检索(Text-to-Video Retrieval, T2VR)任务中现有方法在有效性(effectiveness)与效率(efficiency)之间难以平衡的问题:两塔(Two-Tower)框架虽高效但效果较差,单塔(Single-Tower)框架虽有效但计算开销大。解决方案的关键在于提出一种新的混合塔(Hybrid-Tower)框架,并设计了一种名为细粒度伪查询交互与生成(Fine-grained Pseudo-query Interaction and Generation, PIG)的方法——通过为每个视频生成一个伪查询(pseudo-query),使视频特征与伪查询的文本特征在细粒度层面进行交互,从而在未接收到真实文本查询前即实现高有效性;同时,在推理阶段不引入额外存储或计算开销,保持与两塔框架相当的高效率。实验证明该方法在多个基准上显著提升性能(R@1提升1.6%~3.9%),并达到接近最先进水平的同时保持高效率。
链接: https://arxiv.org/abs/2509.04773
作者: Bangxiang Lan,Ruobing Xie,Ruixiang Zhao,Xingwu Sun,Zhanhui Kang,Gang Yang,Xirong Li
机构: Renmin University of China (中国人民大学); Large Language Model Department, Tencent (腾讯大语言模型部门)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025
Abstract:The Text-to-Video Retrieval (T2VR) task aims to retrieve unlabeled videos by textual queries with the same semantic meanings. Recent CLIP-based approaches have explored two frameworks: Two-Tower versus Single-Tower framework, yet the former suffers from low effectiveness, while the latter suffers from low efficiency. In this study, we explore a new Hybrid-Tower framework that can hybridize the advantages of the Two-Tower and Single-Tower framework, achieving high effectiveness and efficiency simultaneously. We propose a novel hybrid method, Fine-grained Pseudo-query Interaction and Generation for T2VR, ie, PIG, which includes a new pseudo-query generator designed to generate a pseudo-query for each video. This enables the video feature and the textual features of pseudo-query to interact in a fine-grained manner, similar to the Single-Tower approaches to hold high effectiveness, even before the real textual query is received. Simultaneously, our method introduces no additional storage or computational overhead compared to the Two-Tower framework during the inference stage, thus maintaining high efficiency. Extensive experiments on five commonly used text-video retrieval benchmarks demonstrate that our method achieves a significant improvement over the baseline, with an increase of 1.6% \sim 3.9% in R@1. Furthermore, our method matches the efficiency of Two-Tower models while achieving near state-of-the-art performance, highlighting the advantages of the Hybrid-Tower framework.
zh
[CV-39] FloodVision: Urban Flood Depth Estimation Using Foundation Vision-Language Models and Domain Knowledge Graph
【速读】:该论文旨在解决洪水水深估算在准确性与跨场景泛化能力方面的局限性问题,现有计算机视觉方法因依赖固定目标检测器和任务特定训练而难以适应多样化的洪水场景。解决方案的关键在于提出FloodVision框架,其创新性地融合了基础视觉-语言模型GPT-4o的语义推理能力与结构化的领域知识图谱(knowledge graph),通过动态识别RGB图像中的可见参考物体、从知识图谱中检索已验证的物理高度以减少幻觉、计算淹没比并结合统计异常值滤波,从而实现高精度且具备零样本泛化能力的水深估计。
链接: https://arxiv.org/abs/2509.04772
作者: Zhangding Liu,Neda Mohammadi,John E. Taylor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Timely and accurate floodwater depth estimation is critical for road accessibility and emergency response. While recent computer vision methods have enabled flood detection, they suffer from both accuracy limitations and poor generalization due to dependence on fixed object detectors and task-specific training. To enable accurate depth estimation that can generalize across diverse flood scenarios, this paper presents FloodVision, a zero-shot framework that combines the semantic reasoning abilities of the foundation vision-language model GPT-4o with a structured domain knowledge graph. The knowledge graph encodes canonical real-world dimensions for common urban objects including vehicles, people, and infrastructure elements to ground the model’s reasoning in physical reality. FloodVision dynamically identifies visible reference objects in RGB images, retrieves verified heights from the knowledge graph to mitigate hallucination, estimates submergence ratios, and applies statistical outlier filtering to compute final depth values. Evaluated on 110 crowdsourced images from MyCoast New York, FloodVision achieves a mean absolute error of 8.17 cm, reducing the GPT-4o baseline 10.28 cm by 20.5% and surpassing prior CNN-based methods. The system generalizes well across varying scenes and operates in near real-time, making it suitable for future integration into digital twin platforms and citizen-reporting apps for smart city flood resilience.
zh
[CV-40] Dynamic Group Detection using VLM-augmented Temporal Groupness Graph ICCV2025
【速读】:该论文旨在解决视频中动态人类群体检测的问题,尤其针对复杂群体结构随时间变化的情况。传统方法通常假设群体在视频中保持不变,难以应对群体成员的增减或重组。本文的关键解决方案是结合局部外观特征与全局场景上下文信息,利用增强后的视觉-语言模型(Vision-Language Model, VLM)提取每帧中的局部和全局特征,并通过图优化方法对所有帧的群体度(groupness)概率进行全局一致性建模,从而实现对动态变化群体的稳定检测。
链接: https://arxiv.org/abs/2509.04758
作者: Kaname Yokoyama,Chihiro Nakatani,Norimichi Ukita
机构: Toyota Technological Institute (丰田工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, Accepted to ICCV2025
Abstract:This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames’ groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: this https URL
zh
[CV-41] MCANet: A Multi-Scale Class-Specific Attention Network for Multi-Label Post-Hurricane Damage Assessment using UAV Imagery
【速读】:该论文旨在解决飓风灾后损伤评估中现有基于卷积神经网络(CNN)方法难以捕捉多尺度空间特征以及无法有效区分视觉相似或共现损伤类型的问题。其解决方案的关键在于提出MCANet框架,该框架通过两个核心设计实现性能提升:一是采用Res2Net构建的分层骨干网络以增强跨尺度的空间上下文信息;二是引入多头类别特定残差注意力模块(multi-head class-specific residual attention module),使每个注意力分支聚焦于不同空间粒度,从而在局部细节与全局上下文之间取得平衡,显著提升了对复杂损伤类别的判别能力。
链接: https://arxiv.org/abs/2509.04757
作者: Zhangding Liu,Neda Mohammadi,John E. Taylor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 34 pages, 7 figures
Abstract:Rapid and accurate post-hurricane damage assessment is vital for disaster response and recovery. Yet existing CNN-based methods struggle to capture multi-scale spatial features and to distinguish visually similar or co-occurring damage types. To address these issues, we propose MCANet, a multi-label classification framework that learns multi-scale representations and adaptively attends to spatially relevant regions for each damage category. MCANet employs a Res2Net-based hierarchical backbone to enrich spatial context across scales and a multi-head class-specific residual attention module to enhance discrimination. Each attention branch focuses on different spatial granularities, balancing local detail with global context. We evaluate MCANet on the RescueNet dataset of 4,494 UAV images collected after Hurricane Michael. MCANet achieves a mean average precision (mAP) of 91.75%, outperforming ResNet, Res2Net, VGG, MobileNet, EfficientNet, and ViT. With eight attention heads, performance further improves to 92.35%, boosting average precision for challenging classes such as Road Blocked by over 6%. Class activation mapping confirms MCANet’s ability to localize damage-relevant regions, supporting interpretability. Outputs from MCANet can inform post-disaster risk mapping, emergency routing, and digital twin-based disaster response. Future work could integrate disaster-specific knowledge graphs and multimodal large language models to improve adaptability to unseen disasters and enrich semantic understanding for real-world decision-making.
zh
[CV-42] WatchHAR: Real-time On-device Human Activity Recognition System for Smartwatches
【速读】:该论文旨在解决在非受限环境中,基于智能手表的细粒度人体活动识别(Human Activity Recognition, HAR)系统仍难以实现端侧全运行的问题,从而避免依赖外部设备进行数据处理所引发的隐私泄露与高延迟问题。其解决方案的关键在于提出WatchHAR系统,通过优化整个处理流程,设计了一种将传感器数据预处理与推理统一为可端到端训练模块的新架构,在保持超过90%准确率的同时,使处理速度提升至5倍,并实现单次活动事件检测仅需9.3毫秒、多模态活动分类仅需11.8毫秒的低延迟性能,从而推动了智能手表作为独立、隐私友好且低侵入性的连续活动监测设备的发展。
链接: https://arxiv.org/abs/2509.04736
作者: Taeyoung Yeon,Vasco Xu,Henry Hoffmann,Karan Ahuja
机构: Northwestern University (西北大学); University of Chicago (芝加哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, ICMI '25 (27th International Conference on Multimodal Interaction), October 13-17, 2025, Canberra, ACT, Australia
Abstract:Despite advances in practical and multimodal fine-grained Human Activity Recognition (HAR), a system that runs entirely on smartwatches in unconstrained environments remains elusive. We present WatchHAR, an audio and inertial-based HAR system that operates fully on smartwatches, addressing privacy and latency issues associated with external data processing. By optimizing each component of the pipeline, WatchHAR achieves compounding performance gains. We introduce a novel architecture that unifies sensor data preprocessing and inference into an end-to-end trainable module, achieving 5x faster processing while maintaining over 90% accuracy across more than 25 activity classes. WatchHAR outperforms state-of-the-art models for event detection and activity classification while running directly on the smartwatch, achieving 9.3 ms processing time for activity event detection and 11.8 ms for multimodal activity classification. This research advances on-device activity recognition, realizing smartwatches’ potential as standalone, privacy-aware, and minimally-invasive continuous activity tracking devices.
zh
[CV-43] Enhancing Self-Driving Segmentation in Adverse Weather Conditions: A Dual Uncertainty-Aware Training Approach to SAM Optimization
【速读】:该论文旨在解决视觉基础模型(如SAM2)在恶劣天气条件下因视觉模糊性高而导致分割性能下降的问题,其核心挑战在于模型缺乏不确定性量化能力。解决方案的关键在于引入显式的不确定性建模机制:一是设计一种多步骤微调流程,将不确定性度量直接嵌入损失函数中以提升场景识别能力;二是将医疗图像分割中开发的不确定性感知适配器(Uncertainty-Aware Adapter, UAT)迁移至自动驾驶场景。实验表明,这两种方法均能显著增强模型在极端天气和多样化驾驶环境下的鲁棒性,尤其UAT-SAM在恶劣天气下优于标准SAM,而不确定性感知损失训练的SAM2则在各类场景中表现更优。
链接: https://arxiv.org/abs/2509.04735
作者: Dharsan Ravindran,Kevin Wang,Zhuoyuan Cao,Saleh Abdelrahman,Jeffery Wu
机构: Queen’s University (皇后大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in vision foundation models, such as the Segment Anything Model (SAM) and its successor SAM2, have achieved state-of-the-art performance on general image segmentation benchmarks. However, these models struggle in adverse weather conditions where visual ambiguity is high, largely due to their lack of uncertainty quantification. Inspired by progress in medical imaging, where uncertainty-aware training has improved reliability in ambiguous cases, we investigate two approaches to enhance segmentation robustness for autonomous driving. First, we introduce a multi-step finetuning procedure for SAM2 that incorporates uncertainty metrics directly into the loss function, improving overall scene recognition. Second, we adapt the Uncertainty-Aware Adapter (UAT), originally designed for medical image segmentation, to driving contexts. We evaluate both methods on CamVid, BDD100K, and GTA driving datasets. Experiments show that UAT-SAM outperforms standard SAM in extreme weather, while SAM2 with uncertainty-aware loss achieves improved performance across diverse driving scenes. These findings underscore the value of explicit uncertainty modeling for safety-critical autonomous driving in challenging environments.
zh
[CV-44] Beyond I-Con: Exploring New Dimension of Distance Measures in Representation Learning
【速读】:该论文旨在解决当前表示学习(representation learning)中损失函数设计的局限性问题,特别是基于KL散度(Kullback-Leibler divergence)的优化目标可能与真实任务目标不一致,且KL散度的非对称性和无界性会带来优化挑战。解决方案的关键在于提出Beyond I-Con框架,通过系统性探索替代的统计散度(statistical divergences)和相似性核函数(similarity kernels),从而发现更优的损失函数形式;具体包括:将PMI算法中的KL散度替换为有界总变差距离(Total Variation, TV)以提升无监督聚类性能,用TV结合距离型相似性核替代KL与角度核以改进监督对比学习,以及在降维任务中用有界f散度替代KL散度以优于t-SNE的下游任务表现。
链接: https://arxiv.org/abs/2509.04734
作者: Jasmine Shone,Shaden Alshammari,Mark Hamilton,Zhening Li,William Freeman
机构: Massachusetts Institute of Technology (麻省理工学院); Microsoft; Google(谷歌)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Information Contrastive (I-Con) framework revealed that over 23 representation learning methods implicitly minimize KL divergence between data and learned distributions that encode similarities between data points. However, a KL-based loss may be misaligned with the true objective, and properties of KL divergence such as asymmetry and unboundedness may create optimization challenges. We present Beyond I-Con, a framework that enables systematic discovery of novel loss functions by exploring alternative statistical divergences and similarity kernels. Key findings: (1) on unsupervised clustering of DINO-ViT embeddings, we achieve state-of-the-art results by modifying the PMI algorithm to use total variation (TV) distance; (2) on supervised contrastive learning, we outperform the standard approach by using TV and a distance-based similarity kernel instead of KL and an angular kernel; (3) on dimensionality reduction, we achieve superior qualitative results and better performance on downstream tasks than SNE by replacing KL with a bounded f-divergence. Our results highlight the importance of considering divergence and similarity kernel choices in representation learning optimization.
zh
[CV-45] Exploiting Unlabeled Structures through Task Consistency Training for Versatile Medical Image Segmentation
【速读】:该论文旨在解决多类医学图像分割(Versatile Medical Image Segmentation, VMIS)中因部分标注数据集(Partially Labeled Datasets, PLDs)导致的类别不平衡问题。现有方法通常依赖额外模型生成伪全标签,但易引入标签噪声并造成性能下降。其解决方案的关键在于提出任务一致性训练(Task Consistency Training, TCT)框架:该框架包含一个主分割头(Main Segmentation Head, MSH)和多个辅助任务头(Auxiliary Task Heads, ATHs),通过在MSH与ATH预测之间施加一致性约束,有效利用未标注解剖结构的信息;同时设计过滤策略排除低一致性样本以避免错误传播,并引入统一的辅助不确定性加权损失(Unified Auxiliary Uncertainty-Weighted Loss, UAUWL)缓解特定任务主导对分割质量的影响。
链接: https://arxiv.org/abs/2509.04732
作者: Shengqian Zhu,Jiafei Wu,Xiaogang Xu,Chengrong Yu,Ying Song,Zhang Yi,Guangjun Li,Junjie Hu
机构: Sichuan University (四川大学); University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Versatile medical image segmentation (VMIS) targets the segmentation of multiple classes, while obtaining full annotations for all classes is often impractical due to the time and labor required. Leveraging partially labeled datasets (PLDs) presents a promising alternative; however, current VMIS approaches face significant class imbalance due to the unequal category distribution in PLDs. Existing methods attempt to address this by generating pseudo-full labels. Nevertheless, these typically require additional models and often result in potential performance degradation from label noise. In this work, we introduce a Task Consistency Training (TCT) framework to address class imbalance without requiring extra models. TCT includes a backbone network with a main segmentation head (MSH) for multi-channel predictions and multiple auxiliary task heads (ATHs) for task-specific predictions. By enforcing a consistency constraint between the MSH and ATH predictions, TCT effectively utilizes unlabeled anatomical structures. To avoid error propagation from low-consistency, potentially noisy data, we propose a filtering strategy to exclude such data. Additionally, we introduce a unified auxiliary uncertainty-weighted loss (UAUWL) to mitigate segmentation quality declines caused by the dominance of specific tasks. Extensive experiments on eight abdominal datasets from diverse clinical sites demonstrate our approach’s effectiveness.
zh
[CV-46] CD-Mamba: Cloud detection with long-range spatial dependency modeling
【速读】:该论文旨在解决遥感图像中云层遮挡导致的数据完整性与可靠性问题,其核心挑战在于如何同时捕捉云斑的短程空间冗余和长程大气相似性。解决方案的关键在于提出一种名为CD-Mamba的混合模型,该模型将卷积神经网络(Convolutional Neural Networks)与状态空间模型(State-space Model)相结合,以协同建模局部像素级纹理细节和远距离块级依赖关系,从而在不同空间尺度上提升云检测的准确性。
链接: https://arxiv.org/abs/2509.04729
作者: Tianxiang Xue,Jiayi Zhao,Jingsheng Li,Changlu Chen,Kun Zhan
机构: Lanzhou University (兰州大学); City University of Macau (澳门城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal of Applied Remote Sensing
Abstract:Remote sensing images are frequently obscured by cloud cover, posing significant challenges to data integrity and reliability. Effective cloud detection requires addressing both short-range spatial redundancies and long-range atmospheric similarities among cloud patches. Convolutional neural networks are effective at capturing local spatial dependencies, while Mamba has strong capabilities in modeling long-range dependencies. To fully leverage both local spatial relations and long-range dependencies, we propose CD-Mamba, a hybrid model that integrates convolution and Mamba’s state-space modeling into a unified cloud detection network. CD-Mamba is designed to comprehensively capture pixelwise textural details and long term patchwise dependencies for cloud detection. This design enables CD-Mamba to manage both pixel-wise interactions and extensive patch-wise dependencies simultaneously, improving detection accuracy across diverse spatial scales. Extensive experiments validate the effectiveness of CD-Mamba and demonstrate its superior performance over existing methods.
zh
[CV-47] STADI: Fine-Grained Step-Patch Diffusion Parallelism for Heterogeneous GPUs
【速读】:该论文旨在解决扩散模型(Diffusion Model)在异构多GPU环境中进行并行推理时,因硬件能力差异或后台任务干扰导致的负载不均衡问题,从而提升资源利用率和推理效率。其解决方案的关键在于提出一种时空自适应扩散推理框架(Spatio-Temporal Adaptive Diffusion Inference, STADI),通过一个混合调度器实现时间维度与空间维度上的细粒度并行:在时间维度上,引入基于计算感知的步长分配策略,在预热阶段后采用最小公倍数最小化的量化技术减少慢速GPU上的去噪步骤及执行同步开销;在空间维度上,设计弹性图像块并行机制,根据各GPU的算力动态分配不同大小的图像块,结合互补的空间负载均衡机制,显著降低GPU空闲时间。实验表明,STADI相较现有最优方案Patch Parallelism可将端到端推理延迟降低最高达45%,并有效缓解异构环境下的性能瓶颈。
链接: https://arxiv.org/abs/2509.04719
作者: Han Liang,Jiahui Zhou,Zicheng Zhou,Xiaoxi Zhang,Xu Chen
机构: Sun Yat-sen University (中山大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The escalating adoption of diffusion models for applications such as image generation demands efficient parallel inference techniques to manage their substantial computational cost. However, existing diffusion parallelism inference schemes often underutilize resources in heterogeneous multi-GPU environments, where varying hardware capabilities or background tasks cause workload imbalance. This paper introduces Spatio-Temporal Adaptive Diffusion Inference (STADI), a novel framework to accelerate diffusion model inference in such settings. At its core is a hybrid scheduler that orchestrates fine-grained parallelism across both temporal and spatial dimensions. Temporally, STADI introduces a novel computation-aware step allocator applied after warmup phases, using a least-common-multiple-minimizing quantization technique to reduce denoising steps on slower GPUs and execution synchronization. To further minimize GPU idle periods, STADI executes an elastic patch parallelism mechanism that allocates variably sized image patches to GPUs according to their computational capability, ensuring balanced workload distribution through a complementary spatial mechanism. Extensive experiments on both load-imbalanced and heterogeneous multi-GPU clusters validate STADI’s efficacy, demonstrating improved load balancing and mitigation of performance bottlenecks. Compared to patch parallelism, a state-of-the-art diffusion inference framework, our method significantly reduces end-to-end inference latency by up to 45% and significantly improves resource utilization on heterogeneous GPUs.
zh
[CV-48] Domain Adaptation for Different Sensor Configurations in 3D Object Detection
【速读】:该论文旨在解决不同传感器配置(sensor configurations)下三维目标检测(3D object detection)模型性能下降的问题,即当在一种LiDAR配置上训练的模型迁移到另一种配置时,由于点云分布差异导致的域偏移(domain gap)问题。现有方法主要关注环境域差异或单个LiDAR内部密度变化,而未充分探索跨传感器配置的域适应问题。其解决方案的关键在于提出两种技术:下游微调(Downstream Fine-tuning),即在多数据集联合训练后对特定数据集进行微调;以及部分层微调(Partial Layer Fine-tuning),仅更新模型中的一小部分参数以增强跨配置的泛化能力。实验表明,结合这两种策略能显著提升模型在不同车辆平台上的适应性与检测精度。
链接: https://arxiv.org/abs/2509.04711
作者: Satoshi Tanaka,Kok Seang Tan,Isamu Yamashita
机构: TIER IV, Inc
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Recent advances in autonomous driving have underscored the importance of accurate 3D object detection, with LiDAR playing a central role due to its robustness under diverse visibility conditions. However, different vehicle platforms often deploy distinct sensor configurations, causing performance degradation when models trained on one configuration are applied to another because of shifts in the point cloud distribution. Prior work on multi-dataset training and domain adaptation for 3D object detection has largely addressed environmental domain gaps and density variation within a single LiDAR; in contrast, the domain gap for different sensor configurations remains largely unexplored. In this work, we address domain adaptation across different sensor configurations in 3D object detection. We propose two techniques: Downstream Fine-tuning (dataset-specific fine-tuning after multi-dataset training) and Partial Layer Fine-tuning (updating only a subset of layers to improve cross-configuration generalization). Using paired datasets collected in the same geographic region with multiple sensor configurations, we show that joint training with Downstream Fine-tuning and Partial Layer Fine-tuning consistently outperforms naive joint training for each configuration. Our findings provide a practical and scalable solution for adapting 3D object detection models to the diverse vehicle platforms.
zh
[CV-49] Guideline-Consistent Segmentation via Multi-Agent Refinement
【速读】:该论文旨在解决现实场景中语义分割任务面临的复杂文本标注规范难以严格遵循的问题,尤其针对长篇幅、规则复杂的文本指令(如段落级标签指南),传统方法因依赖昂贵的任务特定再训练而难以适应动态变化的规范。解决方案的关键在于提出一种无需训练的多智能体框架,其核心是基于迭代式“工作者-监督者”精炼架构:工作者(Worker)利用通用视觉语言模型执行初步分割,监督者(Supervisor)基于检索到的文本指南对分割结果进行批判性评估,同时引入轻量级强化学习停止策略(stop policy)决定终止条件,从而在保证标注规范一致性的同时优化计算资源使用。该方法在Waymo和ReasonSeg数据集上显著优于现有最先进基线,展现出强泛化能力和指令遵循能力。
链接: https://arxiv.org/abs/2509.04687
作者: Vanshika Vats,Ashwani Rathee,James Davis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation in real-world applications often requires not only accurate masks but also strict adherence to textual labeling guidelines. These guidelines are typically complex and long, and both human and automated labeling often fail to follow them faithfully. Traditional approaches depend on expensive task-specific retraining that must be repeated as the guidelines evolve. Although recent open-vocabulary segmentation methods excel with simple prompts, they often fail when confronted with sets of paragraph-length guidelines that specify intricate segmentation rules. To address this, we introduce a multi-agent, training-free framework that coordinates general-purpose vision-language models within an iterative Worker-Supervisor refinement architecture. The Worker performs the segmentation, the Supervisor critiques it against the retrieved guidelines, and a lightweight reinforcement learning stop policy decides when to terminate the loop, ensuring guideline-consistent masks while balancing resource use. Evaluated on the Waymo and ReasonSeg datasets, our method notably outperforms state-of-the-art baselines, demonstrating strong generalization and instruction adherence.
zh
[CV-50] Ecologically Valid Benchmarking and Adaptive Attention: Scalable Marine Bioacoustic Monitoring
【速读】:该论文旨在解决水下被动声学监测(Underwater Passive Acoustic Monitoring, UPAM)中因环境噪声变异、复杂信号依赖关系以及生物与人为源混合导致的模型稳定性差和泛化能力弱的问题。其解决方案的关键在于提出一种分层嵌套交叉验证框架GetNetUPAM,通过站点-年份(site-year)分块策略保留数据异质性,强制模型在真实生态多样性下评估性能,同时结合随机子集的标准交叉验证以衡量对UPAM全信号分布的泛化能力;在此基础上设计了自适应分辨率池化与注意力网络(Adaptive Resolution Pooling and Attention Network, ARPA-N),利用空间注意力机制扩展感受野,在不增加过多参数的前提下捕捉全局上下文信息,显著提升了检测精度(平均精度提升14.4%)并大幅降低各指标波动(log₂尺度下降一个数量级),从而实现跨站点和年份的一致性检测,推动可扩展、高精度的生物声学监测发展。
链接: https://arxiv.org/abs/2509.04682
作者: Nicholas R. Rasmussen,Rodrigue Rizk,Longwei Wang,KC Santosh
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Under review as an anonymous submission to IEEETAI - We are allowed an archive submission. Final formatting is yet to be determined
Abstract:Underwater Passive Acoustic Monitoring (UPAM) provides rich spatiotemporal data for long-term ecological analysis, but intrinsic noise and complex signal dependencies hinder model stability and generalization. Multilayered windowing has improved target sound localization, yet variability from shifting ambient noise, diverse propagation effects, and mixed biological and anthropogenic sources demands robust architectures and rigorous evaluation. We introduce GetNetUPAM, a hierarchical nested cross-validation framework designed to quantify model stability under ecologically realistic variability. Data are partitioned into distinct site-year segments, preserving recording heterogeneity and ensuring each validation fold reflects a unique environmental subset, reducing overfitting to localized noise and sensor artifacts. Site-year blocking enforces evaluation against genuine environmental diversity, while standard cross-validation on random subsets measures generalization across UPAM’s full signal distribution, a dimension absent from current benchmarks. Using GetNetUPAM as the evaluation backbone, we propose the Adaptive Resolution Pooling and Attention Network (ARPA-N), a neural architecture for irregular spectrogram dimensions. Adaptive pooling with spatial attention extends the receptive field, capturing global context without excessive parameters. Under GetNetUPAM, ARPA-N achieves a 14.4% gain in average precision over DenseNet baselines and a log2-scale order-of-magnitude drop in variability across all metrics, enabling consistent detection across site-year folds and advancing scalable, accurate bioacoustic monitoring.
zh
[CV-51] VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation ICCV
【速读】:该论文旨在解决当前视觉模型中局部细节特征提取与全局上下文建模之间的权衡问题:卷积神经网络(CNN)虽具备强归纳偏置以捕捉细粒度局部特征,但缺乏对长距离依赖和全局语义的建模能力;而视觉 Transformer(ViT)和状态空间模型(SSM,如 Mamba)虽能高效建模全局上下文,但在局部特征表达上不如 CNN。解决方案的关键在于提出一种新型混合架构 VCMamba,其核心是通过“卷积主干+多方向 Mamba 块”的分阶段设计——早期采用卷积块提取丰富局部特征,后期引入多方向 Mamba 块以线性复杂度建模长程依赖和全局信息,从而在保持计算效率的同时实现更优的特征表示能力。
链接: https://arxiv.org/abs/2509.04669
作者: Mustafa Munir,Alex Zhang,Radu Marculescu
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Proceedings of the 2025 IEEE/CVF International Conference on Computer Vision (ICCV) Workshops
Abstract:Recent advances in Vision Transformers (ViTs) and State Space Models (SSMs) have challenged the dominance of Convolutional Neural Networks (CNNs) in computer vision. ViTs excel at capturing global context, and SSMs like Mamba offer linear complexity for long sequences, yet they do not capture fine-grained local features as effectively as CNNs. Conversely, CNNs possess strong inductive biases for local features but lack the global reasoning capabilities of transformers and Mamba. To bridge this gap, we introduce \textitVCMamba, a novel vision backbone that integrates the strengths of CNNs and multi-directional Mamba SSMs. VCMamba employs a convolutional stem and a hierarchical structure with convolutional blocks in its early stages to extract rich local features. These convolutional blocks are then processed by later stages incorporating multi-directional Mamba blocks designed to efficiently model long-range dependencies and global context. This hybrid design allows for superior feature representation while maintaining linear complexity with respect to image resolution. We demonstrate VCMamba’s effectiveness through extensive experiments on ImageNet-1K classification and ADE20K semantic segmentation. Our VCMamba-B achieves 82.6% top-1 accuracy on ImageNet-1K, surpassing PlainMamba-L3 by 0.3% with 37% fewer parameters, and outperforming Vision GNN-B by 0.3% with 64% fewer parameters. Furthermore, VCMamba-B obtains 47.1 mIoU on ADE20K, exceeding EfficientFormer-L7 by 2.0 mIoU while utilizing 62% fewer parameters. Code is available at this https URL.
zh
[CV-52] UAV-Based Intelligent Traffic Surveillance System: Real-Time Vehicle Detection Classification Tracking and Behavioral Analysis
【速读】:该论文旨在解决城市交通拥堵与违规行为监测难题,传统固定摄像头和基于传感器的监控系统因覆盖范围有限、适应性差及扩展性不足而难以满足现代智慧城市建设需求。其解决方案的关键在于构建一套基于无人机(Unmanned Aerial Vehicle, UAV)的交通 surveillance系统,通过多尺度与多角度模板匹配实现高精度车辆检测与分类,结合卡尔曼滤波(Kalman filtering)与单应性校准(homography-based calibration)提升跟踪鲁棒性,并融合地理围栏(geofencing)、运动滤波与轨迹偏移分析技术自动识别非法变道、违规双停车及人行横道阻塞等典型交通违法行为。该系统可支持多粒度城市交通流分析,包括OD轨迹追踪、车流量可视化、跨类别关联分析及热力图拥堵建模,具备良好的可扩展性与实际应用价值,为下一代智慧城市的执法感知型、无基础设施依赖的交通管理提供新范式。
链接: https://arxiv.org/abs/2509.04624
作者: Ali Khanpour,Tianyi Wang,Afra Vahidi-Shams,Wim Ectors,Farzam Nakhaie,Amirhossein Taheri,Christian Claudel
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); Babol Noshirvani University of Technology (巴博勒诺希尔瓦尼理工大学); Hasselt University (哈塞尔特大学); Amirkabir University of Technology (阿米尔卡比尔理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Robotics (cs.RO); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
备注: 15 pages, 8 figures, 2 tables
Abstract:Traffic congestion and violations pose significant challenges for urban mobility and road safety. Traditional traffic monitoring systems, such as fixed cameras and sensor-based methods, are often constrained by limited coverage, low adaptability, and poor scalability. To address these challenges, this paper introduces an advanced unmanned aerial vehicle (UAV)-based traffic surveillance system capable of accurate vehicle detection, classification, tracking, and behavioral analysis in real-world, unconstrained urban environments. The system leverages multi-scale and multi-angle template matching, Kalman filtering, and homography-based calibration to process aerial video data collected from altitudes of approximately 200 meters. A case study in urban area demonstrates robust performance, achieving a detection precision of 91.8%, an F1-score of 90.5%, and tracking metrics (MOTA/MOTP) of 92.1% and 93.7%, respectively. Beyond precise detection, the system classifies five vehicle types and automatically detects critical traffic violations, including unsafe lane changes, illegal double parking, and crosswalk obstructions, through the fusion of geofencing, motion filtering, and trajectory deviation analysis. The integrated analytics module supports origin-destination tracking, vehicle count visualization, inter-class correlation analysis, and heatmap-based congestion modeling. Additionally, the system enables entry-exit trajectory profiling, vehicle density estimation across road segments, and movement direction logging, supporting comprehensive multi-scale urban mobility analytics. Experimental results confirms the system’s scalability, accuracy, and practical relevance, highlighting its potential as an enforcement-aware, infrastructure-independent traffic monitoring solution for next-generation smart cities.
zh
[CV-53] Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning EMNLP2025
【速读】:该论文旨在解决密集视频字幕(Dense Video Captioning)任务中两个关键问题:一是仅对文本标注施加时间戳监督,而将所有视频帧视为同等重要,忽略了视频帧的重要性差异;二是从固定大小的视频片段中检索字幕,未能捕捉场景转换带来的语义变化。解决方案的核心在于提出Sali4Vid框架,其关键创新包括:(1)引入基于sigmoid函数的显著性感知视频重加权(Saliency-aware Video Reweighting),将时间戳标注转化为帧重要性权重,从而增强关键帧的表示;(2)设计语义自适应字幕检索(Semantic-based Adaptive Caption Retrieval),通过帧相似度对视频进行分段以识别场景过渡,提升字幕检索的准确性。实验表明,该方法在YouCook2和ViTT数据集上达到当前最优性能,验证了联合优化视频加权与检索策略的有效性。
链接: https://arxiv.org/abs/2509.04602
作者: MinJu Jeon,Si-Woo Kim,Ye-Chan Kim,HyunGee Kim,Dong-Jin Kim
机构: Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in EMNLP 2025
Abstract:Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose Sali4Vid, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning
zh
[CV-54] WATCH: World-aware Allied Trajectory and pose reconstruction for Camera and Human
【速读】:该论文旨在解决从自然场景单目视频中进行全局人体运动重建的问题,其核心挑战在于深度模糊性(depth ambiguity)、运动模糊性(motion ambiguity)以及相机与人体运动之间的耦合关系。现有以人体运动为中心的方法虽能较好保留动作细节和物理合理性,但存在两个关键局限:一是未能充分挖掘相机朝向信息,二是对相机平移线索的整合效率低下。论文提出的WATCH框架通过两项关键技术突破实现改进:首先引入解析式的航向角分解方法,相较于传统几何方法在效率和可扩展性上更具优势;其次设计了一种受世界模型启发的相机轨迹融合机制,有效利用了相机平移信息,而非依赖低效的硬解码方式。实验表明,WATCH在端到端轨迹重建任务中达到当前最优性能,验证了联合建模相机-人体运动关系的有效性,并为解决相机平移整合这一长期难题提供了新思路。
链接: https://arxiv.org/abs/2509.04600
作者: Qijun Ying,Zhongyuan Hu,Rui Zhang,Ronghui Li,Yu Lu,Zijiao Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Global human motion reconstruction from in-the-wild monocular videos is increasingly demanded across VR, graphics, and robotics applications, yet requires accurate mapping of human poses from camera to world coordinates-a task challenged by depth ambiguity, motion ambiguity, and the entanglement between camera and human movements. While human-motion-centric approaches excel in preserving motion details and physical plausibility, they suffer from two critical limitations: insufficient exploitation of camera orientation information and ineffective integration of camera translation cues. We present WATCH (World-aware Allied Trajectory and pose reconstruction for Camera and Human), a unified framework addressing both challenges. Our approach introduces an analytical heading angle decomposition technique that offers superior efficiency and extensibility compared to existing geometric methods. Additionally, we design a camera trajectory integration mechanism inspired by world models, providing an effective pathway for leveraging camera translation information beyond naive hard-decoding approaches. Through experiments on in-the-wild benchmarks, WATCH achieves state-of-the-art performance in end-to-end trajectory reconstruction. Our work demonstrates the effectiveness of jointly modeling camera-human motion relationships and offers new insights for addressing the long-standing challenge of camera translation integration in global human motion reconstruction. The code will be available publicly.
zh
[CV-55] DisPatch: Disarming Adversarial Patches in Object Detection with Diffusion Models
【速读】:该论文旨在解决当前目标检测模型在面对对抗补丁攻击(adversarial patch attacks)时的脆弱性问题,此类攻击可轻易应用于现实世界物体,导致真实目标被隐藏或虚假目标被生成,从而引发严重后果。解决方案的关键在于提出首个基于扩散模型(diffusion model)的防御框架 DISPATCH,其核心创新是采用“再生与修正”(regenerate and rectify)策略:首先利用扩散模型强大的分布内生成能力对整张图像进行再生,使其与良性数据对齐;随后通过修正过程识别并替换受攻击区域为再生后的良性内容,从而在不依赖攻击先验知识的前提下有效消除攻击影响。该方法具有攻击无关性(attack-agnostic)和强鲁棒性,显著优于现有防御手段,在多种检测器和攻击场景下均实现最优性能。
链接: https://arxiv.org/abs/2509.04597
作者: Jin Ma,Mohammed Aldeen,Christopher Salas,Feng Luo,Mashrur Chowdhury,Mert Pesé,Long Cheng
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Object detection is fundamental to various real-world applications, such as security monitoring and surveillance video analysis. Despite their advancements, state-of-theart object detectors are still vulnerable to adversarial patch attacks, which can be easily applied to real-world objects to either conceal actual items or create non-existent ones, leading to severe consequences. Given the current diversity of adversarial patch attacks and potential unknown threats, an ideal defense method should be effective, generalizable, and robust against adaptive attacks. In this work, we introduce DISPATCH, the first diffusion-based defense framework for object detection. Unlike previous works that aim to “detect and remove” adversarial patches, DISPATCH adopts a “regenerate and rectify” strategy, leveraging generative models to disarm attack effects while preserving the integrity of the input image. Specifically, we utilize the in-distribution generative power of diffusion models to regenerate the entire image, aligning it with benign data. A rectification process is then employed to identify and replace adversarial regions with their regenerated benign counterparts. DISPATCH is attack-agnostic and requires no prior knowledge of the existing patches. Extensive experiments across multiple detectors and attacks demonstrate that DISPATCH consistently outperforms state-of-the-art defenses on both hiding attacks and creating attacks, achieving the best overall mAP.5 score of 89.3% on hiding attacks, and lowering the attack success rate to 24.8% on untargeted creating attacks. Moreover, it maintains strong robustness against adaptive attacks, making it a practical and reliable defense for object detection systems.
zh
[CV-56] Inpaint4Drag : Repurposing Inpainting Models for Drag -Based Image Editing via Bidirectional Warping ICCV2025
【速读】:该论文旨在解决当前基于拖拽(drag-based)图像编辑方法在精度、实时反馈和模型兼容性方面的局限性问题。现有方法主要依赖生成模型的潜在空间操作,导致编辑精度不足、响应延迟且受限于特定模型架构。其解决方案的关键在于提出Inpaint4Drag框架,将拖拽编辑分解为像素空间中的双向形变(bidirectional warping)与图像修复(inpainting)两个步骤:通过模拟物理世界中弹性物体的形变机制,使图像区域在用户拖拽时保持自然形状;同时,将拖拽输入直接转换为标准图像修复格式,从而无需修改模型结构即可适配任意图像修复模型,实现高效(0.3秒/次)、实时(0.01秒预览)且通用的交互式编辑体验。
链接: https://arxiv.org/abs/2509.04582
作者: Jingyi Lu,Kai Han
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025. Project page: this https URL
Abstract:Drag-based image editing has emerged as a powerful paradigm for intuitive image manipulation. However, existing approaches predominantly rely on manipulating the latent space of generative models, leading to limited precision, delayed feedback, and model-specific constraints. Accordingly, we present Inpaint4Drag, a novel framework that decomposes drag-based editing into pixel-space bidirectional warping and image inpainting. Inspired by elastic object deformation in the physical world, we treat image regions as deformable materials that maintain natural shape under user manipulation. Our method achieves real-time warping previews (0.01s) and efficient inpainting (0.3s) at 512x512 resolution, significantly improving the interaction experience compared to existing methods that require minutes per edit. By transforming drag inputs directly into standard inpainting formats, our approach serves as a universal adapter for any inpainting model without architecture modification, automatically inheriting all future improvements in inpainting technology. Extensive experiments demonstrate that our method achieves superior visual quality and precise control while maintaining real-time performance. Project page: this https URL
zh
[CV-57] Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model
【速读】:该论文旨在解决当前开源多模态模型在图像生成与编辑任务中因过度依赖参数规模扩张而忽视训练策略优化所导致的效率与性能瓶颈问题。其核心解决方案在于提出一种基于SD3.5-Medium架构的2B参数DiT模型UniPic2-SD3.5M-Kontext,并引入创新的渐进式双任务强化学习策略(Progressive Dual-Task Reinforcement, PDTR),通过分阶段增强文本到图像生成与图像编辑能力,实现二者协同提升且无负向干扰;进一步结合MetaQuery框架构建统一多模态模型UniPic2-Metaquery,验证了该训练范式在理解、生成与编辑任务上的通用性与高效性,最终形成可扩展的Skywork UniPic 2.0训练体系。
链接: https://arxiv.org/abs/2509.04548
作者: Hongyang Wei,Baixin Xu,Hongbo Liu,Cyrus Wu,Jie Liu,Yi Peng,Peiyu Wang,Zexiang Liu,Jingwen He,Yidan Xietian,Chuanxin Tang,Zidong Wang,Yichen Wei,Liang Hu,Boyi Jiang,William Li,Ying He,Yang Liu,Xuchen Song,Eric Li,Yahui Zhou
机构: Skywork Multimodality Team (Skywork多模态团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework. Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data, enabling joint text-to-image generation and editing capabilities. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy (PDTR), which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters-including BAGEL (7B) and Flux-Kontext (12B). Furthermore, following the MetaQuery, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery. UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.
zh
[CV-58] PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting
【速读】:该论文旨在解决文本到图像(text-to-image, T2I)扩散模型在生成图像时难以准确遵循用户复杂提示的问题,尤其在属性绑定、否定关系和组合性语义等方面存在显著偏差,导致生成结果与用户意图不一致。解决方案的关键在于提出 PromptEnhancer,一个通用的提示重写框架,其核心创新是将提示重写器(rewriter)与生成器(generator)解耦,并通过强化学习训练一个基于思维链(Chain-of-Thought, CoT)结构的重写器,该重写器由一个名为 AlignEvaluator 的专用奖励模型引导,该模型基于对 T2I 常见失败模式系统分析得出的 24 个关键点提供细粒度反馈。此机制使重写器能够生成更易被 T2I 模型精确理解的提示,从而显著提升图像-文本对齐质量。
链接: https://arxiv.org/abs/2509.04545
作者: Linqing Wang,Ximing Xing,Yiji Cheng,Zhiyuan Zhao,Jiale Tao,Qixun Wang,Ruihuang Li,Xin Li,Mingrui Wu,Xinchi Deng,Chunyu Wang,Qinglin Lu
机构: Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report
Abstract:Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine-tuning or implicit reward signals like image-reward scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.
zh
[CV-59] Facial Emotion Recognition does not detect feeling unsafe in automated driving
【速读】:该论文旨在解决自动驾驶车辆中公众对感知风险(perceived risk)的理解与评估问题,以提升用户信任和接受度。其核心挑战在于如何客观、准确地量化乘客在不同驾驶风格及突发情境下的主观风险感知。解决方案的关键在于构建一个基于车辆运动学数据与皮肤电反应(skin conductance)的神经网络模型,该模型能有效预测主观感知风险,显著优于依赖面部表情识别的方法(后者因缺乏一致性的生理反应而不可靠)。此方法为自动化车辆中的风险感知提供了可量化的客观指标,减少主观偏差,并指明了未来研究方向。
链接: https://arxiv.org/abs/2509.04490
作者: Abel van Elburg,Konstantinos Gkentsidis,Mathieu Sarrazin,Sarah Barendswaard,Varun Kotian,Riender Happee
机构: Delft University of Technology (代尔夫特理工大学); Siemens Digital Industries Software (西门子数字工业软件)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Trust and perceived safety play a crucial role in the public acceptance of automated vehicles. To understand perceived risk, an experiment was conducted using a driving simulator under two automated driving styles and optionally introducing a crossing pedestrian. Data was collected from 32 participants, consisting of continuous subjective comfort ratings, motion, webcam footage for facial expression, skin conductance, heart rate, and eye tracking. The continuous subjective perceived risk ratings showed significant discomfort associated with perceived risk during cornering and braking followed by relief or even positive comfort on continuing the ride. The dynamic driving style induced a stronger discomfort as compared to the calm driving style. The crossing pedestrian did not affect discomfort with the calm driving style but doubled the comfort decrement with the dynamic driving style. This illustrates the importance of consequences of critical interactions in risk perception. Facial expression was successfully analyzed for 24 participants but most (15/24) did not show any detectable facial reaction to the critical event. Among the 9 participants who did, 8 showed a Happy expression, and only 4 showed a Surprise expression. Fear was never dominant. This indicates that facial expression recognition is not a reliable method for assessing perceived risk in automated vehicles. To predict perceived risk a neural network model was implemented using vehicle motion and skin conductance. The model correlated well with reported perceived risk, demonstrating its potential for objective perceived risk assessment in automated vehicles, reducing subjective bias and highlighting areas for future research.
zh
[CV-60] acher-Student Model for Detecting and Classifying Mitosis in the MIDOG 2025 Challenge
【速读】:该论文旨在解决病理学家在计数有丝分裂象(mitotic figures)时耗时且存在观察者间变异的问题,同时应对人工智能(AI)模型在跨域场景下性能下降(domain shift)以及有丝分裂象与正常细胞核数量严重不平衡带来的检测挑战。解决方案的关键在于提出一种基于像素级分割的教师-学生(teacher-student)框架,结合对比表示学习(contrastive representation learning)和域对抗训练(domain-adversarial training)以增强模型对不同组织、物种及染色协议差异的鲁棒性;并通过生成像素级伪掩膜(pseudo-masks)对标注有丝分裂象、难负样本及正常细胞核进行联合优化,提升特征判别能力;此外,在分类任务中引入多尺度卷积神经网络(multi-scale CNN)并嵌入分割模型的特征图,实现有丝分裂象检测与异常有丝分裂分类的多任务协同学习,从而显著提升了整体性能(Track 1 F1=0.7660,Track 2平衡准确率=0.8414)。
链接: https://arxiv.org/abs/2509.03614
作者: Seungho Choe,Xiaoli Qin,Abubakr Shafique,Amanda Dy,Susan Done,Dimitrios Androutsos,April Khademi
机构: Toronto Metropolitan University (多伦多都会大学); University Health Network (大学健康网络); University of Toronto (多伦多大学); Vector Institute of Artificial Intelligence (向量人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 1 figures, final submission for MIDOG 2025 challenge
Abstract:Counting mitotic figures is time-intensive for pathologists and leads to inter-observer variability. Artificial intelligence (AI) promises a solution by automatically detecting mitotic figures while maintaining decision consistency. However, AI tools are susceptible to domain shift, where a significant drop in performance can occur due to differences in the training and testing sets, including morphological diversity between organs, species, and variations in staining protocols. Furthermore, the number of mitoses is much less than the count of normal nuclei, which introduces severely imbalanced data for the detection task. In this work, we formulate mitosis detection as a pixel-level segmentation and propose a teacher-student model that simultaneously addresses mitosis detection (Track 1) and atypical mitosis classification (Track 2). Our method is based on a UNet segmentation backbone that integrates domain generalization modules, namely contrastive representation learning and domain-adversarial training. A teacher-student strategy is employed to generate pixel-level pseudo-masks not only for annotated mitoses and hard negatives but also for normal nuclei, thereby enhancing feature discrimination and improving robustness against domain shift. For the classification task, we introduce a multi-scale CNN classifier that leverages feature maps from the segmentation model within a multi-task learning paradigm. On the preliminary test set, the algorithm achieved an F1 score of 0.7660 in Track 1 and balanced accuracy of 0.8414 in Track 2, demonstrating the effectiveness of integrating segmentation-based detection and classification into a unified framework for robust mitosis analysis.
zh
[CV-61] MLP-SRGAN: A Single-Dimension Super Resolution GAN using MLP-Mixer
【速读】:该论文旨在解决医学影像中低分辨率(尤其是切片方向)图像的超分辨率重建问题,特别是在多中心临床数据上保持高质量细节恢复的挑战。其解决方案的关键在于提出了一种新型架构 MLP-SRGAN,该架构结合了多层感知机混合器(MLP-Mixer)与卷积层,在切片方向实现高效的上采样操作。相较于传统方法,MLP-SRGAN 在参数量、训练/推理速度和模型体积方面更具优势,同时在无真实高分辨率(HR)标签的情况下,引入了新的无参考图像质量评估指标(如边缘强度、熵和低频信息),从而更全面地量化重建图像的锐度、噪声和模糊程度,显著提升了纹理和细部解剖结构的保真度。
链接: https://arxiv.org/abs/2303.06298
作者: Samir Mitha,Seungho Choe,Pejman Jahbedar Maralani,Alan R. Moody,April Khademi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 14 pages, 10 figures
Abstract:We propose a novel architecture called MLP-SRGAN, which is a single-dimension Super Resolution Generative Adversarial Network (SRGAN) that utilizes Multi-Layer Perceptron Mixers (MLP-Mixers) along with convolutional layers to upsample in the slice direction. MLP-SRGAN is trained and validated using high resolution (HR) FLAIR MRI from the MSSEG2 challenge dataset. The method was applied to three multicentre FLAIR datasets (CAIN, ADNI, CCNA) of images with low spatial resolution in the slice dimension to examine performance on held-out (unseen) clinical data. Upsampled results are compared to several state-of-the-art SR networks. For images with high resolution (HR) ground truths, peak-signal-to-noise-ratio (PSNR) and structural similarity index (SSIM) are used to measure upsampling performance. Several new structural, no-reference image quality metrics were proposed to quantify sharpness (edge strength), noise (entropy), and blurriness (low frequency information) in the absence of ground truths. Results show MLP-SRGAN results in sharper edges, less blurring, preserves more texture and fine-anatomical detail, with fewer parameters, faster training/evaluation time, and smaller model size than existing methods. Code for MLP-SRGAN training and inference, data generators, models and no-reference image quality metrics will be available at this https URL.
zh
[CV-62] VLSM-Ensemble: Ensembling CLIP-based Vision-Language Models for Enhanced Medical Image Segmentation
【速读】:该论文旨在解决当前基于CLIP和BiomedCLIP的视觉-语言分割模型(Vision-Language Segmentation Models, VLSMs)在图像分割任务中性能落后于更复杂架构(如CRIS)的问题。其解决方案的关键在于摒弃传统的文本提示工程(text prompt engineering),转而采用一种低复杂度卷积神经网络(CNN)对VLSMs进行集成(ensembling),从而显著提升分割精度。实验表明,该方法在BKAI结肠息肉数据集上使Dice分数提升6.3%,其他数据集亦获得1%至6%的改进,且不同数据集上的表现差异提示了未来研究方向。
链接: https://arxiv.org/abs/2509.05154
作者: Julia Dietlmeier,Oluwabukola Grace Adegboro,Vayangi Ganepola,Claudia Mazo,Noel E. O’Connor
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Medical Imaging with Deep Learning (MIDL 2025) short paper
Abstract:Vision-language models and their adaptations to image segmentation tasks present enormous potential for producing highly accurate and interpretable results. However, implementations based on CLIP and BiomedCLIP are still lagging behind more sophisticated architectures such as CRIS. In this work, instead of focusing on text prompt engineering as is the norm, we attempt to narrow this gap by showing how to ensemble vision-language segmentation models (VLSMs) with a low-complexity CNN. By doing so, we achieve a significant Dice score improvement of 6.3% on the BKAI polyp dataset using the ensembled BiomedCLIPSeg, while other datasets exhibit gains ranging from 1% to 6%. Furthermore, we provide initial results on additional four radiology and non-radiology datasets. We conclude that ensembling works differently across these datasets (from outperforming to underperforming the CRIS model), indicating a topic for future investigation by the community. The code is available at this https URL.
zh
[CV-63] Multi-modal Uncertainty Robust Tree Cover Segmentation For High-Resolution Remote Sensing Images
【速读】:该论文旨在解决多模态遥感影像在树冠覆盖制图中因时间错位导致的交叉模态不确定性问题,尤其是在高分辨率影像中,由于光学、激光雷达(LiDAR)和合成孔径雷达(SAR)等不同模态数据采集时间不一致,可能引发植被扰动或成像质量变化,从而显著降低语义分割精度。解决方案的关键在于提出MURTreeFormer框架,其核心创新是将某一模态设为主导模态,其余作为辅助模态,并通过概率潜在表示显式建模辅助模态的局部patch级不确定性;不确定区域由主模态分布经基于变分自编码器(VAE)的重采样机制重建,以增强辅助特征并提升融合效果;此外,在解码器中引入梯度幅值注意力(GMA)模块与轻量级细化头(RH),分别引导关注树状结构并保留细粒度空间细节,从而有效缓解时序引起的随机不确定性(aleatoric uncertainty),提升分割鲁棒性与准确性。
链接: https://arxiv.org/abs/2509.04870
作者: Yuanyuan Gui,Wei Li,Yinjian Wang,Xiang-Gen Xia,Mauro Marty,Christian Ginzler,Zuyuan Wang
机构: Beijing Institute of Technology (北京理工大学); University of Delaware (特拉华大学); Swiss Federal Institute for Forest, Snow, and Landscape Research WSL (瑞士联邦森林、雪和景观研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in semantic segmentation of multi-modal remote sensing images have significantly improved the accuracy of tree cover mapping, supporting applications in urban planning, forest monitoring, and ecological assessment. Integrating data from multiple modalities-such as optical imagery, light detection and ranging (LiDAR), and synthetic aperture radar (SAR)-has shown superior performance over single-modality methods. However, these data are often acquired days or even months apart, during which various changes may occur, such as vegetation disturbances (e.g., logging, and wildfires) and variations in imaging quality. Such temporal misalignments introduce cross-modal uncertainty, especially in high-resolution imagery, which can severely degrade segmentation accuracy. To address this challenge, we propose MURTreeFormer, a novel multi-modal segmentation framework that mitigates and leverages aleatoric uncertainty for robust tree cover mapping. MURTreeFormer treats one modality as primary and others as auxiliary, explicitly modeling patch-level uncertainty in the auxiliary modalities via a probabilistic latent representation. Uncertain patches are identified and reconstructed from the primary modality’s distribution through a VAE-based resampling mechanism, producing enhanced auxiliary features for fusion. In the decoder, a gradient magnitude attention (GMA) module and a lightweight refinement head (RH) are further integrated to guide attention toward tree-like structures and to preserve fine-grained spatial details. Extensive experiments on multi-modal datasets from Shanghai and Zurich demonstrate that MURTreeFormer significantly improves segmentation performance and effectively reduces the impact of temporally induced aleatoric uncertainty.
zh
[CV-64] Histogram Driven Amplitude Embedding for Qubit Efficient Quantum Image Compression
【速读】:该论文旨在解决当前量子计算在图像压缩任务中资源消耗高、可扩展性差的问题,尤其是在NISQ(含噪声中等规模量子)时代下如何高效利用有限量子硬件资源实现高质量图像压缩。其解决方案的关键在于采用基于“块像素”(bixel)的分块处理策略与幅度嵌入(amplitude embedding)技术:首先将图像分割为固定大小的块(bixels),统计每块总强度并构建全局直方图,随后将直方图各 bin 的归一化平方根值作为振幅编码进 n 个量子比特的量子态中;该方法仅依赖于直方图分箱数 B 决定所需量子比特数量,与图像分辨率无关,从而实现了对图像信息的低资源占用压缩与近似重建,实验证明仅需 5–7 个量子比特即可获得高质量重构结果,显著优于传统逐像素编码方式。
链接: https://arxiv.org/abs/2509.04849
作者: Sahil Tomar,Sandeep Kumar
机构: 未知
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Information Theory (cs.IT)
备注: 7 pages
Abstract:This work introduces a compact and hardware efficient method for compressing color images using near term quantum devices. The approach segments the image into fixed size blocks called bixels, and computes the total intensity within each block. A global histogram with B bins is then constructed from these block intensities, and the normalized square roots of the bin counts are encoded as amplitudes into an n qubit quantum state. Amplitude embedding is performed using PennyLane and executed on real IBM Quantum hardware. The resulting state is measured to reconstruct the histogram, enabling approximate recovery of block intensities and full image reassembly. The method maintains a constant qubit requirement based solely on the number of histogram bins, independent of the resolution of the image. By adjusting B, users can control the trade off between fidelity and resource usage. Empirical results demonstrate high quality reconstructions using as few as 5 to 7 qubits, significantly outperforming conventional pixel level encodings in terms of qubit efficiency and validating the practical application of the method for current NISQ era quantum systems.
zh
[CV-65] AURAD: Anatomy-Pathology Unified Radiology Synthesis with Progressive Representations
【速读】:该论文旨在解决医学图像合成中细粒度控制难、跨数据集域偏移大以及临床相关性不足的问题,尤其针对胸部X光片(chest radiographs)中病灶形态多样且与解剖结构紧密交织的挑战。其核心解决方案是提出AURAD框架,该框架通过分阶段生成策略实现可控合成:首先基于临床提示和解剖结构条件生成伪语义掩膜(pseudo semantic masks),确保多病共存与解剖-病理一致性;随后利用这些掩膜引导高质量图像合成,并借助预训练专家医疗模型过滤输出以保障临床合理性。该方法不仅提升了图像的真实性与多样性,还使生成的掩膜可直接用于下游检测与分割任务,从而打通生成式AI与真实临床应用之间的壁垒。
链接: https://arxiv.org/abs/2509.04819
作者: Shuhan Ding,Jingjing Fu,Yu Gu,Naiteek Sangani,Mu Wei,Paul Vozila,Nan Liu,Jiang Bian,Hoifung Poon
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image synthesis has become an essential strategy for augmenting datasets and improving model generalization in data-scarce clinical settings. However, fine-grained and controllable synthesis remains difficult due to limited high-quality annotations and domain shifts across datasets. Existing methods, often designed for natural images or well-defined tumors, struggle to generalize to chest radiographs, where disease patterns are morphologically diverse and tightly intertwined with anatomical structures. To address these challenges, we propose AURAD, a controllable radiology synthesis framework that jointly generates high-fidelity chest X-rays and pseudo semantic masks. Unlike prior approaches that rely on randomly sampled masks-limiting diversity, controllability, and clinical relevance-our method learns to generate masks that capture multi-pathology coexistence and anatomical-pathological consistency. It follows a progressive pipeline: pseudo masks are first generated from clinical prompts conditioned on anatomical structures, and then used to guide image synthesis. We also leverage pretrained expert medical models to filter outputs and ensure clinical plausibility. Beyond visual realism, the synthesized masks also serve as labels for downstream tasks such as detection and segmentation, bridging the gap between generative modeling and real-world clinical applications. Extensive experiments and blinded radiologist evaluations demonstrate the effectiveness and generalizability of our method across tasks and datasets. In particular, 78% of our synthesized images are classified as authentic by board-certified radiologists, and over 40% of predicted segmentation overlays are rated as clinically useful. All code, pre-trained models, and the synthesized dataset will be released upon publication.
zh
[CV-66] Inferring the Graph Structure of Images for Graph Neural Networks
【速读】:该论文旨在提升图神经网络(Graph Neural Network, GNN)在图像分类任务中的性能,其核心问题是传统基于像素网格图(grid graph)和超像素(superpixel)方法对图像结构建模的局限性。解决方案的关键在于引入新的图表示方法:通过分析图像中像素值之间的行相关性、列相关性及乘积图(product graph),构建更有效的图结构作为GNN的输入,从而增强模型对图像语义信息的捕捉能力。实验表明,这种基于像素相关性的图构造方法显著优于传统的网格图和超像素方法,在MNIST和Fashion-MNIST数据集上提升了下游GNN模型的分类准确率。
链接: https://arxiv.org/abs/2509.04677
作者: Mayur S Gowda,John Shi,Augusto Santos,José M. F. Moura
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:Image datasets such as MNIST are a key benchmark for testing Graph Neural Network (GNN) architectures. The images are traditionally represented as a grid graph with each node representing a pixel and edges connecting neighboring pixels (vertically and horizontally). The graph signal is the values (intensities) of each pixel in the image. The graphs are commonly used as input to graph neural networks (e.g., Graph Convolutional Neural Networks (Graph CNNs) [1, 2], Graph Attention Networks (GAT) [3], GatedGCN [4]) to classify the images. In this work, we improve the accuracy of downstream graph neural network tasks by finding alternative graphs to the grid graph and superpixel methods to represent the dataset images, following the approach in [5, 6]. We find row correlation, column correlation, and product graphs for each image in MNIST and Fashion-MNIST using correlations between the pixel values building on the method in [5, 6]. Experiments show that using these different graph representations and features as input into downstream GNN models improves the accuracy over using the traditional grid graph and superpixel methods in the literature.
zh
人工智能
[AI-0] Scaling Performance of Large Language Model Pretraining
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)预训练过程中分布式训练的效率问题,特别是如何在数百个节点上高效管理海量数据集,并通过数据并行策略充分释放GPU计算资源。其解决方案的关键在于系统性地优化训练管道,包括改进分布式训练框架以提升通信效率、设计可扩展的数据加载与管理机制,以及针对不同硬件配置进行精细化调优,从而实现对可用GPU算力的充分利用,显著提升训练吞吐量和资源利用率。
链接: https://arxiv.org/abs/2509.05258
作者: Alexander Interrante-Grant,Carla Varela-Rosa,Suhaas Narayan,Chris Connelly,Albert Reuther
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) show best-in-class performance across a wide range of natural language processing applications. Training these models is an extremely computationally expensive task; frontier Artificial Intelligence (AI) research companies are investing billions of dollars into supercomputing infrastructure to train progressively larger models on increasingly massive datasets. Unfortunately, information about the scaling performance and training considerations of these large training pipelines is scarce in public literature. Working with large-scale datasets and models can be complex and practical recommendations are scarce in the public literature for tuning training performance when scaling up large language models. In this paper, we aim to demystify the large language model pretraining pipeline somewhat - in particular with respect to distributed training, managing large datasets across hundreds of nodes, and scaling up data parallelism with an emphasis on fully leveraging available GPU compute capacity.
zh
[AI-1] Recomposer: Event-roll-guided generative audio editing
【速读】:该论文旨在解决复杂真实声场中个体声音事件难以编辑的问题,尤其是在多个声音源在时间上重叠的情况下。其核心挑战在于如何精确地删除、插入或增强特定的声音事件,同时保持整体声景的自然性和一致性。解决方案的关键在于提出了一种基于文本描述和事件时间图(event roll)的编码器-解码器Transformer架构,该架构以SoundStream表示为基础,利用合成数据对模型进行训练——即通过将孤立的声音事件添加到密集的真实背景音频中生成输入与期望输出的配对样本。实验表明,编辑指令中的动作(action)、类别(class)和时间(timing)三要素均对编辑效果至关重要,证明了“重组”(recomposition)是一种可行且实用的音频编辑范式。
链接: https://arxiv.org/abs/2509.05256
作者: Daniel P. W. Ellis,Eduardo Fonseca,Ron J. Weiss,Kevin Wilson,Scott Wisdom,Hakan Erdogan,John R. Hershey,Aren Jansen,R. Channing Moore,Manoj Plakal
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 pages, 5 figures
Abstract:Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., enhance Door'') and a graphical representation of the event timing derived from an
event roll’’ transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions – action, class, timing. Our work demonstrates ``recomposition’’ is an important and practical application.
zh
[AI-2] Uncertain but Useful: Leverag ing CNN Variability into Data Augmentation
【速读】:该论文旨在解决深度学习(Deep Learning, DL)模型在神经影像学应用中训练阶段的数值稳定性问题,尤其是其对结果可重复性的影响。现有研究表明,尽管深度学习在推理阶段表现稳定且高效,但训练过程中的迭代随机优化引入了额外的变异性,这可能影响模型的可靠性和泛化能力。解决方案的关键在于:通过引入浮点数扰动和随机种子控制扰动,系统性地量化并利用这种训练时间变异性;发现扰动生成的集成模型(numerical ensembles)不仅性能与未扰动基线相当,还能作为数据增强策略用于下游任务(如脑龄回归),从而将原本被视为负面因素的数值变异性转化为提升鲁棒性和拓展新应用的资源。
链接: https://arxiv.org/abs/2509.05238
作者: Inés Gonzalez-Pepe,Vinuyan Sivakolunthu,Yohan Chatelain,Tristan Glatard
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning (DL) is rapidly advancing neuroimaging by achieving state-of-the-art performance with reduced computation times. Yet the numerical stability of DL models – particularly during training – remains underexplored. While inference with DL is relatively stable, training introduces additional variability primarily through iterative stochastic optimization. We investigate this training-time variability using FastSurfer, a CNN-based whole-brain segmentation pipeline. Controlled perturbations are introduced via floating point perturbations and random seeds. We find that: (i) FastSurfer exhibits higher variability compared to that of a traditional neuroimaging pipeline, suggesting that DL inherits and is particularly susceptible to sources of instability present in its predecessors; (ii) ensembles generated with perturbations achieve performance similar to an unperturbed baseline; and (iii) variability effectively produces ensembles of numerical model families that can be repurposed for downstream applications. As a proof of concept, we demonstrate that numerical ensembles can be used as a data augmentation strategy for brain age regression. These findings position training-time variability not only as a reproducibility concern but also as a resource that can be harnessed to improve robustness and enable new applications in neuroimaging.
zh
[AI-3] RapidGNN: Energy and Communication-Efficient Distributed Training on Large-Scale Graph Neural Networks
【速读】:该论文旨在解决大规模图神经网络(Graph Neural Networks, GNNs)在分布式训练中因数据高度连通性导致的通信开销问题。传统基于采样的方法虽能缓解计算负载,但无法有效降低远程特征获取带来的通信瓶颈。解决方案的关键在于提出一种具有确定性采样调度机制的分布式训练框架 RapidGNN,通过优化缓存构建与远程特征预取策略,显著减少远程特征访问次数,并提升整体训练吞吐量和能效。实验表明,RapidGNN 在多个基准图数据集上平均提升端到端训练吞吐量 2.46x 至 3.00x,同时将远程特征获取次数减少 9.70x 至 15.39x,并实现接近线性的可扩展性及 CPU 和 GPU 上分别达 44% 和 32% 的能效提升。
链接: https://arxiv.org/abs/2509.05207
作者: Arefin Niam,Tevfik Kosar,M S Q Zulkar Nine
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2505.10806
Abstract:Graph Neural Networks (GNNs) have become popular across a diverse set of tasks in exploring structural relationships between entities. However, due to the highly connected structure of the datasets, distributed training of GNNs on large-scale graphs poses significant challenges. Traditional sampling-based approaches mitigate the computational loads, yet the communication overhead remains a challenge. This paper presents RapidGNN, a distributed GNN training framework with deterministic sampling-based scheduling to enable efficient cache construction and prefetching of remote features. Evaluation on benchmark graph datasets demonstrates RapidGNN’s effectiveness across different scales and topologies. RapidGNN improves end-to-end training throughput by 2.46x to 3.00x on average over baseline methods across the benchmark datasets, while cutting remote feature fetches by over 9.70x to 15.39x. RapidGNN further demonstrates near-linear scalability with an increasing number of computing units efficiently. Furthermore, it achieves increased energy efficiency over the baseline methods for both CPU and GPU by 44% and 32%, respectively.
zh
[AI-4] AI Agents for Web Testing: A Case Study in the Wild
【速读】:该论文旨在解决传统自动化网页测试方法在捕捉复杂用户行为方面的不足,这些方法主要依赖代码覆盖率和负载测试,难以发现影响用户体验的可用性问题。解决方案的关键在于引入基于AI代理(AI agent)的测试框架WebProber,该框架能够自主探索网站、模拟真实用户交互、识别缺陷与可用性问题,并生成人类可读的报告,从而实现更贴近实际用户行为的测试评估。
链接: https://arxiv.org/abs/2509.05197
作者: Naimeng Ye,Xiao Yu,Ruize Xu,Tianyi Peng,Zhou Yu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Automated web testing plays a critical role in ensuring high-quality user experiences and delivering business value. Traditional approaches primarily focus on code coverage and load testing, but often fall short of capturing complex user behaviors, leaving many usability issues undetected. The emergence of large language models (LLM) and AI agents opens new possibilities for web testing by enabling human-like interaction with websites and a general awareness of common usability problems. In this work, we present WebProber, a prototype AI agent-based web testing framework. Given a URL, WebProber autonomously explores the website, simulating real user interactions, identifying bugs and usability issues, and producing a human-readable report. We evaluate WebProber through a case study of 120 academic personal websites, where it uncovered 29 usability issues–many of which were missed by traditional tools. Our findings highlight agent-based testing as a promising direction while outlining directions for developing next-generation, user-centered testing frameworks.
zh
[AI-5] Accuracy-Constrained CNN Pruning for Efficient and Reliable EEG-Based Seizure Detection
【速读】:该论文旨在解决深度学习模型(特别是卷积神经网络,CNN)在生物医学信号(如基于脑电图 EEG 的癫痫发作检测)应用中因模型规模庞大和计算资源需求高而导致的实时性差与资源受限环境适应性不足的问题。其解决方案的关键在于提出一种轻量级的一维 CNN 模型,并结合结构化剪枝(structured pruning)技术,在保留预测性能的同时显著降低模型复杂度;具体而言,通过移除 50% 的卷积核(基于其对模型预测的重要性),在权重和内存减少 50% 的情况下仍保持甚至略微提升精度(92.87%)与宏 F1 分数(0.8707),同时配合温和的早停策略(mild early stopping)以缓解过拟合,从而实现高效且可靠的癫痫发作检测,尤其适用于计算资源受限的场景。
链接: https://arxiv.org/abs/2509.05190
作者: Mounvik K,N Harshit
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning models, especially convolutional neural networks (CNNs), have shown considerable promise for biomedical signals such as EEG-based seizure detection. However, these models come with challenges, primarily due to their size and compute requirements in environments where real-time detection or limited resources are available. In this study, we present a lightweight one-dimensional CNN model with structured pruning to improve efficiency and reliability. The model was trained with mild early stopping to address possible overfitting, achieving an accuracy of 92.78% and a macro-F1 score of 0.8686. Structured pruning of the baseline CNN involved removing 50% of the convolutional kernels based on their importance to model predictions. Surprisingly, after pruning the weights and memory by 50%, the new network was still able to maintain predictive capabilities, while modestly increasing precision to 92.87% and improving the macro-F1 score to 0.8707. Overall, we present a convincing case that structured pruning removes redundancy, improves generalization, and, in combination with mild early stopping, achieves a promising way forward to improve seizure detection efficiency and reliability, which is clear motivation for resource-limited settings.
zh
[AI-6] Exploring Situated Stabilities of a Rhythm Generation System through Variational Cross-Examination
【速读】:该论文试图解决的问题是:如何解释GrooveTransformer这一实时节奏生成系统在不同艺术场景中展现出的多稳定性(multistability)现象,即为何一个最初并非为多样化应用设计的系统能够适应并嵌入多种音乐创作情境。解决方案的关键在于运用变分交叉审视(Variational Cross-Examination, VCE)这一后现象学框架,识别出三个促成多稳定性涌现的核心因素:系统不变性所赋予的可供性(affordances of system invariants)、跨学科协作(interdisciplinary collaboration)以及开发过程的情境性(situated nature of its development)。通过VCE方法,论文揭示了技术如何与用户和情境共同塑造,并在实践中被不断重构,从而为数字乐器(Digital Musical Instrument, DMI)的设计提供了一种具有解释力的分析路径。
链接: https://arxiv.org/abs/2509.05145
作者: Błażej Kotowski,Nicholas Evans,Behzad Haki,Frederic Font,Sergi Jordà
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: AI Music Creativity 2025
Abstract:This paper investigates GrooveTransformer, a real-time rhythm generation system, through the postphenomenological framework of Variational Cross-Examination (VCE). By reflecting on its deployment across three distinct artistic contexts, we identify three stabilities: an autonomous drum accompaniment generator, a rhythmic control voltage sequencer in Eurorack format, and a rhythm driver for a harmonic accompaniment system. The versatility of its applications was not an explicit goal from the outset of the project. Thus, we ask: how did this multistability emerge? Through VCE, we identify three key contributors to its emergence: the affordances of system invariants, the interdisciplinary collaboration, and the situated nature of its development. We conclude by reflecting on the viability of VCE as a descriptive and analytical method for Digital Musical Instrument (DMI) design, emphasizing its value in uncovering how technologies mediate, co-shape, and are co-shaped by users and contexts.
zh
[AI-7] Evaluation and Comparison Semantics for ODRL
【速读】:该论文旨在解决开放数字权利语言(ODRL)中计算策略的评估与比较问题,当前虽已有对ODRL部分特性的形式化规范,但缺乏一个全面的形式语义体系。其解决方案的关键在于提出一种基于查询回答的简洁且直观的形式语义模型,该模型不仅细化了先前的形式化工作,还与ODRL最新版本(2.2)保持一致;在此基础上,论文进一步定义并研究了策略比较问题,用于识别等价、更严格或更宽松的策略,从而支持数据共享场景下的政策分析与决策。
链接: https://arxiv.org/abs/2509.05139
作者: Jaime Osvaldo Salas,Paolo Pareti,Semih Yumuşak,Soulmaz Gheisari,Luis-Daniel Ibáñez,George Konstantinidis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as a full paper at the 14th International Joint Conference on Knowledge Graphs (IJCKG 2025). This is the submitted manuscript, the accepted manuscript will be published by Springer Nature
Abstract:We consider the problem of evaluating, and comparing computational policies in the Open Digital Rights Language (ODRL), which has become the de facto standard for governing the access and usage of digital resources. Although preliminary progress has been made on the formal specification of the language’s features, a comprehensive formal semantics of ODRL is still missing. In this paper, we provide a simple and intuitive formal semantics for ODRL that is based on query answering. Our semantics refines previous formalisations, and is aligned with the latest published specification of the language (2.2). Building on our evaluation semantics, and motivated by data sharing scenarios, we also define and study the problem of comparing two policies, detecting equivalent, more restrictive or more permissive policies.
zh
[AI-8] GenAI-based test case generation and execution in SDV platform
【速读】:该论文旨在解决汽车软件测试中手动编写测试用例效率低、一致性差以及跨系统兼容性不足的问题。其解决方案的关键在于引入生成式 AI (Generative AI) 驱动的自动化测试用例生成方法,利用大语言模型(Large Language Models)和视觉-语言模型(Vision-Language Models)将自然语言需求和系统图转化为结构化的 Gherkin 测试用例,并结合车辆信号规范(Vehicle Signal Specification)建模以统一信号定义、提升子系统间兼容性及第三方工具集成能力,最终在开放且厂商中立的测试执行环境 this http URL playground 中实现快速验证。
链接: https://arxiv.org/abs/2509.05112
作者: Denesa Zyberaj,Lukasz Mazur,Nenad Petrovic,Pankhuri Verma,Pascal Hirmer,Dirk Slama,Xiangwei Cheng,Alois Knoll
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces a GenAI-driven approach for automated test case generation, leveraging Large Language Models and Vision-Language Models to translate natural language requirements and system diagrams into structured Gherkin test cases. The methodology integrates Vehicle Signal Specification modeling to standardize vehicle signal definitions, improve compatibility across automotive subsystems, and streamline integration with third-party testing tools. Generated test cases are executed within the this http URL playground, an open and vendor-neutral environment designed to facilitate rapid validation of software-defined vehicle functionalities. We evaluate our approach using the Child Presence Detection System use case, demonstrating substantial reductions in manual test specification effort and rapid execution of generated tests. Despite significant automation, the generation of test cases and test scripts still requires manual intervention due to current limitations in the GenAI pipeline and constraints of the this http URL platform.
zh
[AI-9] ProToM: Promoting Prosocial Behaviour via Theory of Mind-Informed Feedback WWW
【速读】:该论文旨在解决多智能体系统中个体在追求独立目标时难以识别并有效促进利他行为(prosocial behaviour)的问题,从而提升协作效率。其解决方案的关键在于提出ProToM——一个基于心智理论(Theory of Mind)的反馈引导机制,通过贝叶斯逆向规划(Bayesian inverse planning)推断智能体的目标分布,并基于该分布以最大化期望效用的方式选择情境敏感且针对性的反馈信息,从而显著提升任务成功率、缩短完成时间,并获得人类用户的偏好认可。
链接: https://arxiv.org/abs/2509.05091
作者: Matteo Bortoletto,Yichao Zhou,Lance Ying,Tianmin Shu,Andreas Bulling
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Website at this https URL
Abstract:While humans are inherently social creatures, the challenge of identifying when and how to assist and collaborate with others - particularly when pursuing independent goals - can hinder cooperation. To address this challenge, we aim to develop an AI system that provides useful feedback to promote prosocial behaviour - actions that benefit others, even when not directly aligned with one’s own goals. We introduce ProToM, a Theory of Mind-informed facilitator that promotes prosocial actions in multi-agent systems by providing targeted, context-sensitive feedback to individual agents. ProToM first infers agents’ goals using Bayesian inverse planning, then selects feedback to communicate by maximising expected utility, conditioned on the inferred goal distribution. We evaluate our approach against baselines in two multi-agent environments: Doors, Keys, and Gems, as well as Overcooked. Our results suggest that state-of-the-art large language and reasoning models fall short of communicating feedback that is both contextually grounded and well-timed - leading to higher communication overhead and task speedup. In contrast, ProToM provides targeted and helpful feedback, achieving a higher success rate, shorter task completion times, and is consistently preferred by human users.
zh
[AI-10] Adversarial Augmentation and Active Sampling for Robust Cyber Anomaly Detection
【速读】:该论文旨在解决高级持续性威胁(Advanced Persistent Threats, APTs)检测中因标注数据稀缺而导致的传统监督学习方法性能受限的问题。其解决方案的关键在于将自编码器(AutoEncoder)用于异常检测,并结合主动学习(Active Learning)机制,通过迭代地选择不确定性高的样本向标注者(oracle)查询标签,从而在最小化人工标注成本的同时显著提升模型对APT攻击的识别准确率。该方法特别适用于真实世界中极端不平衡的数据场景,如DARPA透明计算计划提供的多操作系统(Android、Linux、BSD、Windows)日志数据,其中APT攻击仅占0.004%。实验表明,基于注意力对抗双自编码器(Attention Adversarial Dual AutoEncoder)的框架在主动学习循环中能持续优化检测性能,优于现有方法。
链接: https://arxiv.org/abs/2509.04999
作者: Sidahmed Benabderrahmane,Talal Rahwan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Advanced Persistent Threats (APTs) present a considerable challenge to cybersecurity due to their stealthy, long-duration nature. Traditional supervised learning methods typically require large amounts of labeled data, which is often scarce in real-world scenarios. This paper introduces a novel approach that combines AutoEncoders for anomaly detection with active learning to iteratively enhance APT detection. By selectively querying an oracle for labels on uncertain or ambiguous samples, our method reduces labeling costs while improving detection accuracy, enabling the model to effectively learn with minimal data and reduce reliance on extensive manual labeling. We present a comprehensive formulation of the Attention Adversarial Dual AutoEncoder-based anomaly detection framework and demonstrate how the active learning loop progressively enhances the model’s performance. The framework is evaluated on real-world, imbalanced provenance trace data from the DARPA Transparent Computing program, where APT-like attacks account for just 0.004% of the data. The datasets, which cover multiple operating systems including Android, Linux, BSD, and Windows, are tested in two attack scenarios. The results show substantial improvements in detection rates during active learning, outperforming existing methods.
zh
[AI-11] LLM Enabled Multi-Agent System for 6G Networks: Framework and Method of Dual-Loop Edge-Terminal Collaboration
【速读】:该论文旨在解决6G网络中大型语言模型(Large Language Models, LLMs)驱动的智能代理在资源受限终端设备上高效运行的问题,尤其是在复杂工具调用场景下,单个设备资源不足导致代理性能受限。解决方案的关键在于提出一种双环终端-边缘协同的多智能体系统框架:外层循环通过全局代理与部署在边缘服务器和终端上的多个子代理之间的迭代协作,实现任务分解与并行子任务分发以增强规划能力;内层循环则由具备特定角色的子代理进行循环推理、执行与重规划,并引入并行工具调用生成与卸载策略以提升执行效率。该方案有效提升了任务规划能力和执行效率,在6G支持的城市安全治理场景中得到验证。
链接: https://arxiv.org/abs/2509.04993
作者: Zheyan Qu,Wenbo Wang,Zitong Yu,Boquan Sun,Yang Li,Xing Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by IEEE Communications Magazine
Abstract:The ubiquitous computing resources in 6G networks provide ideal environments for the fusion of large language models (LLMs) and intelligent services through the agent framework. With auxiliary modules and planning cores, LLM-enabled agents can autonomously plan and take actions to deal with diverse environment semantics and user intentions. However, the limited resources of individual network devices significantly hinder the efficient operation of LLM-enabled agents with complex tool calls, highlighting the urgent need for efficient multi-level device collaborations. To this end, the framework and method of the LLM-enabled multi-agent system with dual-loop terminal-edge collaborations are proposed in 6G networks. Firstly, the outer loop consists of the iterative collaborations between the global agent and multiple sub-agents deployed on edge servers and terminals, where the planning capability is enhanced through task decomposition and parallel sub-task distribution. Secondly, the inner loop utilizes sub-agents with dedicated roles to circularly reason, execute, and replan the sub-task, and the parallel tool calling generation with offloading strategies is incorporated to improve efficiency. The improved task planning capability and task execution efficiency are validated through the conducted case study in 6G-supported urban safety governance. Finally, the open challenges and future directions are thoroughly analyzed in 6G networks, accelerating the advent of the 6G era.
zh
[AI-12] Internet 3.0: Architecture for a Web-of-Agents with its Algorithm for Ranking Agents
【速读】:该论文旨在解决如何在“智能体网络(Agentic Web)”中实现可信、可扩展的代理(Agent)排序问题,即在缺乏全局透明交互记录的情况下,基于实际使用表现而非仅声明能力对代理进行动态且可靠的排序。其核心挑战在于使用信号碎片化和私有化导致传统如PageRank式的全局排名不可行。解决方案的关键是提出DOVIS协议——一个五层操作框架(发现、编排、验证、激励、语义),用于收集最小化、隐私保护的跨生态系统的使用与性能聚合数据;在此基础上构建AgentRank-UC算法,融合使用频率(usage)与胜任力指标(competence,包括结果质量、成本、安全性、延迟)形成统一评分体系,从而实现动态、信任感知的代理排序,并通过理论证明其收敛性、鲁棒性和抗Sybil攻击能力,为构建可扩展、可信的智能体网络提供技术基础。
链接: https://arxiv.org/abs/2509.04979
作者: Rajesh Tembarai Krishnamachari,Srividya Rajesh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents – powered by reasoning-capable large language models (LLMs) and integrated with tools, data, and web search – are poised to transform the internet into a \emphWeb of Agents: a machine-native ecosystem where autonomous agents interact, collaborate, and execute tasks at scale. Realizing this vision requires \emphAgent Ranking – selecting agents not only by declared capabilities but by proven, recent performance. Unlike Web~1.0’s PageRank, a global, transparent network of agent interactions does not exist; usage signals are fragmented and private, making ranking infeasible without coordination. We propose \textbfDOVIS, a five-layer operational protocol (\emphDiscovery, Orchestration, Verification, Incentives, Semantics) that enables the collection of minimal, privacy-preserving aggregates of usage and performance across the ecosystem. On this substrate, we implement \textbfAgentRank-UC, a dynamic, trust-aware algorithm that combines \emphusage (selection frequency) and \emphcompetence (outcome quality, cost, safety, latency) into a unified ranking. We present simulation results and theoretical guarantees on convergence, robustness, and Sybil resistance, demonstrating the viability of coordinated protocols and performance-aware ranking in enabling a scalable, trustworthy Agentic Web. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2509.04979 [cs.AI] (or arXiv:2509.04979v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.04979 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-13] DeGuV: Depth-Guided Visual Reinforcement Learning for Generalization and Interpretability in Manipulation
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)代理在复杂任务中从视觉输入学习后,泛化能力不足的问题,尤其是在机器人应用中,如何在新环境中保持鲁棒性。其核心挑战在于现有数据增强方法虽能提升泛化性能,但常导致样本效率下降和训练不稳定。解决方案的关键在于提出一种名为DeGuV的RL框架:首先引入一个可学习的掩码网络(masker network),基于深度输入生成掩码,保留关键视觉信息并丢弃无关像素,使代理聚焦于重要特征以增强对数据增强的鲁棒性;其次结合对比学习(contrastive learning)稳定Q值估计,从而在数据增强下提升样本效率与训练稳定性。该方法在RL-ViGen基准上通过Franka Emika机器人验证,实现了零样本仿真到现实世界的迁移性能优于当前最优方法,并提升了模型的可解释性。
链接: https://arxiv.org/abs/2509.04970
作者: Tien Pham,Xinyun Chi,Khang Nguyen,Manfred Huber,Angelo Cangelosi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) agents can learn to solve complex tasks from visual inputs, but generalizing these learned skills to new environments remains a major challenge in RL application, especially robotics. While data augmentation can improve generalization, it often compromises sample efficiency and training stability. This paper introduces DeGuV, an RL framework that enhances both generalization and sample efficiency. In specific, we leverage a learnable masker network that produces a mask from the depth input, preserving only critical visual information while discarding irrelevant pixels. Through this, we ensure that our RL agents focus on essential features, improving robustness under data augmentation. In addition, we incorporate contrastive learning and stabilize Q-value estimation under augmentation to further enhance sample efficiency and training stability. We evaluate our proposed method on the RL-ViGen benchmark using the Franka Emika robot and demonstrate its effectiveness in zero-shot sim-to-real transfer. Our results show that DeGuV outperforms state-of-the-art methods in both generalization and sample efficiency while also improving interpretability by highlighting the most relevant regions in the visual input
zh
[AI-14] OSC: Cognitive Orchestration through Dynamic Knowledge Alignment in Multi-Agent LLM Collaboration EMNLP2025
【速读】:该论文旨在解决多智能体系统中大型语言模型(Large Language Models, LLMs)之间高效语言交互以实现深度协作的瓶颈问题,现有方法虽在代理选择和结果聚合方面取得进展,但缺乏对专家代理间认知状态动态感知与适应性沟通机制的支持。解决方案的关键在于提出 OSC(Orchestrating Cognitive Synergy)框架,其核心创新是引入协作知识模型(Collaborator Knowledge Models, CKM),使每个代理能够实时感知协作方的认知状态,并通过认知差距分析自适应调整通信行为(如内容焦点、细节层次和表达风格),从而显著提升任务性能与沟通效率,推动多智能体从“并行工作个体”向“深度协同认知团队”演进。
链接: https://arxiv.org/abs/2509.04876
作者: Jusheng Zhang,Yijia Fan,Kaitong Cai,Xiaofei Sun,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2025 (Long Paper)
Abstract:This paper introduces OSC (Orchestrating Cognitive Synergy), a knowledge-aware adaptive collaboration framework designed to enhance cognitive synergy in multi-agent systems with large language models. While prior work has advanced agent selection and result aggregation, efficient linguistic interactions for deep collaboration among expert agents remain a critical bottleneck. OSC addresses this gap as a pivotal intermediate layer between selection and aggregation, introducing Collaborator Knowledge Models (CKM) to enable each agent to dynamically perceive its collaborators’ cognitive states. Through real-time cognitive gap analysis, agents adaptively adjust communication behaviors, including content focus, detail level, and expression style, using learned strategies. Experiments on complex reasoning and problem-solving benchmarks demonstrate that OSC significantly improves task performance and communication efficiency, transforming "parallel-working individuals’’ into a "deeply collaborative cognitive team.‘’ This framework not only optimizes multi-agent collaboration but also offers new insights into LLM agent interaction behaviors.
zh
[AI-15] Cloning a Conversational Voice AI Agent from CallRecording Datasets for Telesales
【速读】:该论文旨在解决如何从电话通话记录中克隆一个能够自主进行对话的语音AI代理(Conversational Voice AI Agent)的问题,以实现自动化客户服务等场景中的高效交互。其核心解决方案在于构建一个集成自动语音识别(Automatic Speech Recognition, ASR)、基于大语言模型(Large Language Model, LLM)的对话管理器和文本转语音(Text-to-Speech, TTS)合成的流式推理管道,并通过知识提取与提示工程(Prompt Engineering)将顶尖人类客服人员的结构化话术 playbook 转化为可执行的对话策略。该方法不仅实现了对人类行为的有效模仿,在常规环节接近人工水平,还为后续优化如说服力与异议处理能力提供了可迭代的分析路径。
链接: https://arxiv.org/abs/2509.04871
作者: Krittanon Kaewtawee,Wachiravit Modecrua,Krittin Pachtrachai,Touchapon Kraisingkorn
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 figures
Abstract:Recent advances in language and speech modelling have made it possible to build autonomous voice assistants that understand and generate human dialogue in real time. These systems are increasingly being deployed in domains such as customer service and healthcare care, where they can automate repetitive tasks, reduce operational costs, and provide constant support around the clock. In this paper, we present a general methodology for cloning a conversational voice AI agent from a corpus of call recordings. Although the case study described in this paper uses telesales data to illustrate the approach, the underlying process generalizes to any domain where call transcripts are available. Our system listens to customers over the telephone, responds with a synthetic voice, and follows a structured playbook learned from top performing human agents. We describe the domain selection, knowledge extraction, and prompt engineering used to construct the agent, integrating automatic speech recognition, a large language model based dialogue manager, and text to speech synthesis into a streaming inference pipeline. The cloned agent is evaluated against human agents on a rubric of 22 criteria covering introduction, product communication, sales drive, objection handling, and closing. Blind tests show that the AI agent approaches human performance in routine aspects of the call while underperforming in persuasion and objection handling. We analyze these shortcomings and refine the prompt accordingly. The paper concludes with design lessons and avenues for future research, including large scale simulation and automated evaluation.
zh
[AI-16] A Knowledge-Driven Diffusion Policy for End-to-End Autonomous Driving Based on Expert Routing
【速读】:该论文旨在解决端到端自动驾驶中多模态动作生成、时序稳定性维持以及跨场景泛化能力不足的问题。现有方法常面临多模态信息坍缩、长时程控制不一致或模块化适应性差等挑战。其解决方案的关键在于提出KDP(Knowledge-driven Diffusion Policy),通过将生成式扩散模型与稀疏专家混合(Mixture-of-Experts, MoE)路由机制相结合:扩散组件负责生成时序连贯且多样化的动作序列,而专家路由机制则根据环境上下文动态激活特定且可复用的专家模块,实现知识的模块化组合与结构化利用。这种设计显著提升了驾驶成功率、降低了碰撞风险,并增强了控制平滑性与可解释性。
链接: https://arxiv.org/abs/2509.04853
作者: Chengkai Xu,Jiaqi Liu,Yicheng Guo,Peng Hang,Jian Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:End-to-end autonomous driving remains constrained by the need to generate multi-modal actions, maintain temporal stability, and generalize across diverse scenarios. Existing methods often collapse multi-modality, struggle with long-horizon consistency, or lack modular adaptability. This paper presents KDP, a knowledge-driven diffusion policy that integrates generative diffusion modeling with a sparse mixture-of-experts routing mechanism. The diffusion component generates temporally coherent and multi-modal action sequences, while the expert routing mechanism activates specialized and reusable experts according to context, enabling modular knowledge composition. Extensive experiments across representative driving scenarios demonstrate that KDP achieves consistently higher success rates, reduced collision risk, and smoother control compared to prevailing paradigms. Ablation studies highlight the effectiveness of sparse expert activation and the Transformer backbone, and activation analyses reveal structured specialization and cross-scenario reuse of experts. These results establish diffusion with expert routing as a scalable and interpretable paradigm for knowledge-driven end-to-end autonomous driving.
zh
[AI-17] Collaboration and Conflict between Humans and Language Models through the Lens of Game Theory
【速读】:该论文旨在解决语言模型在长期多轮互动场景下(特别是博弈论框架中的重复囚徒困境,Iterated Prisoner’s Dilemma, IPD)的合作与竞争行为机制问题,尤其是现有研究多聚焦于短期或孤立情境,忽视了人类-模型协作、长期行为演化等关键因素。其解决方案的关键在于设计了一个类Axelrod锦标赛实验,将语言模型代理(language model agents)与240种经典策略进行系统性对抗测试,并通过行为分析揭示语言模型具备“友好性(niceness)、可挑衅性(provocability)和慷慨性(generosity)”等强合作策略的核心特征,同时展现出对对手策略变化的快速适应能力——在控制实验中仅需数轮即可检测并响应策略切换,表现媲美甚至超越人类水平。这一发现首次系统刻画了语言模型在长期交互中的合作行为动态,为未来复杂混合人机社会环境中的AI角色研究奠定了基础。
链接: https://arxiv.org/abs/2509.04847
作者: Mukul Singh,Arjun Radhakrishna,Sumit Gulwani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:Language models are increasingly deployed in interactive online environments, from personal chat assistants to domain-specific agents, raising questions about their cooperative and competitive behavior in multi-party settings. While prior work has examined language model decision-making in isolated or short-term game-theoretic contexts, these studies often neglect long-horizon interactions, human-model collaboration, and the evolution of behavioral patterns over time. In this paper, we investigate the dynamics of language model behavior in the iterated prisoner’s dilemma (IPD), a classical framework for studying cooperation and conflict. We pit model-based agents against a suite of 240 well-established classical strategies in an Axelrod-style tournament and find that language models achieve performance on par with, and in some cases exceeding, the best-known classical strategies. Behavioral analysis reveals that language models exhibit key properties associated with strong cooperative strategies - niceness, provocability, and generosity while also demonstrating rapid adaptability to changes in opponent strategy mid-game. In controlled “strategy switch” experiments, language models detect and respond to shifts within only a few rounds, rivaling or surpassing human adaptability. These results provide the first systematic characterization of long-term cooperative behaviors in language model agents, offering a foundation for future research into their role in more complex, mixed human-AI social environments.
zh
[AI-18] REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts ACM-MM2025
【速读】:该论文旨在解决现有多模态关系抽取(Multimodal Relation Extraction, MRE)方法在处理跨模态关系时存在的局限性,即通常仅能提取特定类型的关系三元组,难以同时捕获文本实体与视觉对象之间的跨模态关系,并且缺乏对不同关系类型动态选择最优交互特征的能力。此外,传统编码器的多层序列结构常导致低层信息丢失,影响表示能力。解决方案的关键在于提出一个统一的框架REMOTE,其核心创新包括:(1) 引入基于最优传输(Optimal Transport)的多层级融合模块,在保留低层细节的同时实现多层次语义编码;(2) 设计混合专家(Mixture-of-Experts)机制,根据关系类型动态选择最相关的模态信息进行交互建模,从而提升跨模态关系抽取的灵活性与表达力。
链接: https://arxiv.org/abs/2509.04844
作者: Xinkui Lin,Yongxiu Xu,Minghao Tang,Shilong Zhang,Hongbo Xu,Hao Xu,Yubin Wang
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: ACM MM 2025
Abstract:Multimodal relation extraction (MRE) is a crucial task in the fields of Knowledge Graph and Multimedia, playing a pivotal role in multimodal knowledge graph construction. However, existing methods are typically limited to extracting a single type of relational triplet, which restricts their ability to extract triplets beyond the specified types. Directly combining these methods fails to capture dynamic cross-modal interactions and introduces significant computational redundancy. Therefore, we propose a novel \textitunified multimodal Relation Extraction framework with Multilevel Optimal Transport and mixture-of-Experts, termed REMOTE, which can simultaneously extract intra-modal and inter-modal relations between textual entities and visual objects. To dynamically select optimal interaction features for different types of relational triplets, we introduce mixture-of-experts mechanism, ensuring the most relevant modality information is utilized. Additionally, considering that the inherent property of multilayer sequential encoding in existing encoders often leads to the loss of low-level information, we adopt a multilevel optimal transport fusion module to preserve low-level features while maintaining multilayer encoding, yielding more expressive representations. Correspondingly, we also create a Unified Multimodal Relation Extraction (UMRE) dataset to evaluate the effectiveness of our framework, encompassing diverse cases where the head and tail entities can originate from either text or image. Extensive experiments show that REMOTE effectively extracts various types of relational triplets and achieves state-of-the-art performanc on almost all metrics across two other public MRE datasets. We release our resources at this https URL.
zh
[AI-19] alkToAgent : A Human-centric Explanation of Reinforcement Learning Agents with Large Language Models
【速读】:该论文旨在解决生成式 AI(Generative AI)在强化学习(Reinforcement Learning, RL)领域中面临的可解释性不足问题,特别是当前可解释强化学习(Explainable Reinforcement Learning, XRL)方法存在结果难以理解、工具孤立且缺乏统一交互框架的问题。解决方案的关键在于提出一个基于多智能体大语言模型(Multi-agent Large Language Models, LLMs)的框架——TalkToAgent,其核心由五个专业化LLM代理组成:协调者(Coordinator)、解释者(Explainer)、编码器(Coder)、评估者(Evaluator)和调试器(Debugger),能够自动将用户自然语言查询映射到相应的XRL任务,并提供基于关键状态变量、预期结果或反事实解释的多维度行为说明;同时,该框架通过从定性行为描述甚至规则策略中推导替代场景,扩展了传统反事实解释的能力,从而显著提升了RL策略的透明度与人机协作效率。
链接: https://arxiv.org/abs/2509.04809
作者: Haechang Kim,Hao Chen,Can Li,Jong Min Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 31 pages total
Abstract:Explainable Reinforcement Learning (XRL) has emerged as a promising approach in improving the transparency of Reinforcement Learning (RL) agents. However, there remains a gap between complex RL policies and domain experts, due to the limited comprehensibility of XRL results and isolated coverage of current XRL approaches that leave users uncertain about which tools to employ. To address these challenges, we introduce TalkToAgent, a multi-agent Large Language Models (LLM) framework that delivers interactive, natural language explanations for RL policies. The architecture with five specialized LLM agents (Coordinator, Explainer, Coder, Evaluator, and Debugger) enables TalkToAgent to automatically map user queries to relevant XRL tools and clarify an agent’s actions in terms of either key state variables, expected outcomes, or counterfactual explanations. Moreover, our approach extends previous counterfactual explanations by deriving alternative scenarios from qualitative behavioral descriptions, or even new rule-based policies. We validated TalkToAgent on quadruple-tank process control problem, a well-known nonlinear control benchmark. Results demonstrated that TalkToAgent successfully mapped user queries into XRL tasks with high accuracy, and coder-debugger interactions minimized failures in counterfactual generation. Furthermore, qualitative evaluation confirmed that TalkToAgent effectively interpreted agent’s actions and contextualized their meaning within the problem domain.
zh
[AI-20] What-If Analysis of Large Language Models : Explore the Game World Using Proactive Thinking
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在动态高风险场景中缺乏前瞻性推理能力的问题,即模型无法系统性地探索假设性未来,难以回答“如果我们采取此行动,会对最终结果产生何种影响”这类问题。其核心解决方案是提出WiA-LLM范式,关键在于将What-If Analysis(WIA)与强化学习相结合,使模型能够通过环境反馈动态模拟每种潜在动作的后果,从而实现对未来的状态预测而非仅对当前状态的被动响应。这一机制显著提升了模型在复杂多变环境中进行战略决策和多步后果预测的能力。
链接: https://arxiv.org/abs/2509.04791
作者: Yuan Sui,Yanming Zhang,Yi Liao,Yu Gu,Guohua Tang,Zhongqian Sun,Wei Yang,Bryan Hooi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2508.21365
Abstract:Large language models (LLMs) excel at processing information reactively but lack the ability to systemically explore hypothetical futures. They cannot ask, “what if we take this action? how will it affect the final outcome” and forecast its potential consequences before acting. This critical gap limits their utility in dynamic, high-stakes scenarios like strategic planning, risk assessment, and real-time decision making. To bridge this gap, we propose WiA-LLM, a new paradigm that equips LLMs with proactive thinking capabilities. Our approach integrates What-If Analysis (WIA), a systematic approach for evaluating hypothetical scenarios by changing input variables. By leveraging environmental feedback via reinforcement learning, WiA-LLM moves beyond reactive thinking. It dynamically simulates the outcomes of each potential action, enabling the model to anticipate future states rather than merely react to the present conditions. We validate WiA-LLM in Honor of Kings (HoK), a complex multiplayer game environment characterized by rapid state changes and intricate interactions. The game’s real-time state changes require precise multi-step consequence prediction, making it an ideal testbed for our approach. Experimental results demonstrate WiA-LLM achieves a remarkable 74.2% accuracy in forecasting game-state changes (up to two times gain over baselines). The model shows particularly significant gains in high-difficulty scenarios where accurate foresight is critical. To our knowledge, this is the first work to formally explore and integrate what-if analysis capabilities within LLMs. WiA-LLM represents a fundamental advance toward proactive reasoning in LLMs, providing a scalable framework for robust decision-making in dynamic environments with broad implications for strategic applications.
zh
[AI-21] Graph Unlearning: Efficient Node Removal in Graph Neural Networks
【速读】:该论文旨在解决图神经网络(Graph Neural Network, GNN)中敏感训练节点信息泄露的隐私风险问题,尤其是现有节点遗忘(node unlearning)方法在模型结构限制、图拓扑利用不足或性能-复杂度权衡不佳等方面的局限性。其解决方案的关键在于提出三种新颖的节点遗忘方法:基于类别的标签替换(Class-based Label Replacement)、基于拓扑引导的邻居均值后验概率(Topology-guided Neighbor Mean Posterior Probability)以及类别一致的邻居节点过滤(Class-consistent Neighbor Node Filtering),其中后两者通过有效利用图结构拓扑特征,显著提升了遗忘效率与模型性能之间的平衡,从而实现对敏感节点信息的高效移除并保障GNN模型的隐私安全性。
链接: https://arxiv.org/abs/2509.04785
作者: Faqian Guan,Tianqing Zhu,Zhoutian Wang,Wei Ren,Wanlei Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:With increasing concerns about privacy attacks and potential sensitive information leakage, researchers have actively explored methods to efficiently remove sensitive training data and reduce privacy risks in graph neural network (GNN) models. Node unlearning has emerged as a promising technique for protecting the privacy of sensitive nodes by efficiently removing specific training node information from GNN models. However, existing node unlearning methods either impose restrictions on the GNN structure or do not effectively utilize the graph topology for node unlearning. Some methods even compromise the graph’s topology, making it challenging to achieve a satisfactory performance-complexity trade-off. To address these issues and achieve efficient unlearning for training node removal in GNNs, we propose three novel node unlearning methods: Class-based Label Replacement, Topology-guided Neighbor Mean Posterior Probability, and Class-consistent Neighbor Node Filtering. Among these methods, Topology-guided Neighbor Mean Posterior Probability and Class-consistent Neighbor Node Filtering effectively leverage the topological features of the graph, resulting in more effective node unlearning. To validate the superiority of our proposed methods in node unlearning, we conducted experiments on three benchmark datasets. The evaluation criteria included model utility, unlearning utility, and unlearning efficiency. The experimental results demonstrate the utility and efficiency of the proposed methods and illustrate their superiority compared to state-of-the-art node unlearning methods. Overall, the proposed methods efficiently remove sensitive training nodes and protect the privacy information of sensitive nodes in GNNs. The findings contribute to enhancing the privacy and security of GNN models and provide valuable insights into the field of node unlearning.
zh
[AI-22] VARMA-Enhanced Transformer for Time Series Forecasting PRICAI2025
【速读】:该论文旨在解决当前基于Transformer的时序预测模型(如CATS)虽然在效率和准确性上有所提升,但因移除自注意力机制而忽略了经典统计模型(如VARMA)所擅长捕捉的细粒度局部时序依赖关系的问题。其解决方案的关键在于提出VARMAformer架构,通过两个核心创新实现古典统计建模与现代深度学习框架的融合:(1) 设计了一个受VARMA启发的特征提取器(VARMA-inspired Feature Extractor, VFE),在patch级别显式建模自回归(AR)和移动平均(MA)模式;(2) 提出一种VARMA增强型注意力机制(VARMA-Enhanced Attention, VE-atten),引入时间门控机制使查询更具上下文感知能力。这一设计使模型能够同时捕获全局长程依赖与局部统计结构,从而在多个基准数据集上持续优于现有最先进方法。
链接: https://arxiv.org/abs/2509.04782
作者: Jiajun Song,Xiaoou Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The Pacific Rim International Conference on Artificial Intelligence - PRICAI2025
Abstract:Transformer-based models have significantly advanced time series forecasting. Recent work, like the Cross-Attention-only Time Series transformer (CATS), shows that removing self-attention can make the model more accurate and efficient. However, these streamlined architectures may overlook the fine-grained, local temporal dependencies effectively captured by classical statistical models like Vector AutoRegressive Moving Average model (VARMA). To address this gap, we propose VARMAformer, a novel architecture that synergizes the efficiency of a cross-attention-only framework with the principles of classical time series analysis. Our model introduces two key innovations: (1) a dedicated VARMA-inspired Feature Extractor (VFE) that explicitly models autoregressive (AR) and moving-average (MA) patterns at the patch level, and (2) a VARMA-Enhanced Attention (VE-atten) mechanism that employs a temporal gate to make queries more context-aware. By fusing these classical insights into a modern backbone, VARMAformer captures both global, long-range dependencies and local, statistical structures. Through extensive experiments on widely-used benchmark datasets, we demonstrate that our model consistently outperforms existing state-of-the-art methods. Our work validates the significant benefit of integrating classical statistical insights into modern deep learning frameworks for time series forecasting.
zh
[AI-23] he LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
【速读】:该论文旨在探究大型语言模型(Large Language Models, LLMs)在对话中是否会选择主动退出(bail),即在有选项的情况下是否会终止交互。研究的关键在于设计三种不同的“退出”机制:一种可调用的工具(bail tool)、一个可输出的字符串(bail string)以及一个询问模型是否希望离开的提示(bail prompt),并通过真实世界对话数据(Wildchat 和 ShareGPT)的续写实验来测量模型的退出率。研究发现,不同模型和方法下退出率差异显著(0.28%–32%),但考虑到模型特异性及假阳性(如bail prompt中的22%误判),实际退出率可能低至0.06%–7%。此外,作者基于观察构建了非穷尽的退出行为分类体系,并开发了 BailBench 数据集用于系统评估模型退出行为,揭示了退出与拒绝响应(refusal)之间的复杂关系:部分退出不伴随拒绝,且 jailbreak 和拒绝消除策略对两者影响各异,说明退出行为不能简单由拒绝率预测。
链接: https://arxiv.org/abs/2509.04781
作者: Danielle Ensign,Henry Sleight,Kyle Fish
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:When given the option, will LLMs choose to leave the conversation (bail)? We investigate this question by giving models the option to bail out of interactions using three different bail methods: a bail tool the model can call, a bail string the model can output, and a bail prompt that asks the model if it wants to leave. On continuations of real world data (Wildchat and ShareGPT), all three of these bail methods find models will bail around 0.28-32% of the time (depending on the model and bail method). However, we find that bail rates can depend heavily on the model used for the transcript, which means we may be overestimating real world bail rates by up to 4x. If we also take into account false positives on bail prompt (22%), we estimate real world bail rates range from 0.06-7%, depending on the model and bail method. We use observations from our continuations of real world data to construct a non-exhaustive taxonomy of bail cases, and use this taxonomy to construct BailBench: a representative synthetic dataset of situations where some models bail. We test many models on this dataset, and observe some bail behavior occurring for most of them. Bail rates vary substantially between models, bail methods, and prompt wordings. Finally, we study the relationship between refusals and bails. We find: 1) 0-13% of continuations of real world conversations resulted in a bail without a corresponding refusal 2) Jailbreaks tend to decrease refusal rates, but increase bail rates 3) Refusal abliteration increases no-refuse bail rates, but only for some bail methods 4) Refusal rate on BailBench does not appear to predict bail rate.
zh
[AI-24] SePA: A Search-enhanced Predictive Agent for Personalized Health Coaching ALT
【速读】:该论文旨在解决当前健康教练系统中个性化不足与信息可靠性差的问题,即如何在提供个体化健康建议的同时确保其基于可靠证据并具备上下文相关性。解决方案的关键在于提出一种名为SePA(Search-enhanced Predictive AI Agent)的新型大语言模型(Large Language Model, LLM)健康 coaching系统,其核心创新包括:一是利用可穿戴设备传感器数据构建个体化预测模型,用于实时估计每日压力、肌肉酸痛和受伤风险;二是引入检索增强生成(Retrieval-Augmented Generation, RAG)模块,将LLM生成的反馈锚定于专家审核过的网络内容,从而提升建议的准确性与可信度。实证研究表明,该方法在预测性能和用户偏好上均优于通用基线,为下一代可信赖的个性化健康信息学系统提供了透明且高效的实现路径。
链接: https://arxiv.org/abs/2509.04752
作者: Melik Ozolcer,Sang Won Bae
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI’25). 7 pages, 5 figures, 3 tables
Abstract:This paper introduces SePA (Search-enhanced Predictive AI Agent), a novel LLM health coaching system that integrates personalized machine learning and retrieval-augmented generation to deliver adaptive, evidence-based guidance. SePA combines: (1) Individualized models predicting daily stress, soreness, and injury risk from wearable sensor data (28 users, 1260 data points); and (2) A retrieval module that grounds LLM-generated feedback in expert-vetted web content to ensure contextual relevance and reliability. Our predictive models, evaluated with rolling-origin cross-validation and group k-fold cross-validation show that personalized models outperform generalized baselines. In a pilot expert study (n=4), SePA’s retrieval-based advice was preferred over a non-retrieval baseline, yielding meaningful practical effect (Cliff’s \delta =0.3, p=0.05). We also quantify latency performance trade-offs between response quality and speed, offering a transparent blueprint for next-generation, trustworthy personal health informatics systems.
zh
[AI-25] CoVeR: Conformal Calibration for Versatile and Reliable Autoregressive Next-Token Prediction
【速读】:该论文旨在解决自回归预训练模型在复杂推理任务中,主流解码策略(如束搜索)缺乏可证明的覆盖保证、且难以在搜索效率与多样化轨迹(尤其是长尾序列)之间取得平衡的问题。解决方案的关键在于提出一种无需模型的解码策略 \textscCoVeR,其基于置信区间预测(conformal prediction)框架,在保持紧凑搜索空间的同时,确保对理想轨迹具有高覆盖率;理论上,该方法建立了类似PAC(Probably Approximately Correct)的泛化界,证明其渐近覆盖概率至少为 1−α(α∈(0,1)),从而提供严格的统计保障。
链接: https://arxiv.org/abs/2509.04733
作者: Yuzhu Chen,Yingjie Wang,Shunyu Liu,Yongcheng Jing,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Autoregressive pre-trained models combined with decoding methods have achieved impressive performance on complex reasoning tasks. While mainstream decoding strategies such as beam search can generate plausible candidate sets, they often lack provable coverage guarantees, and struggle to effectively balance search efficiency with the need for versatile trajectories, particularly those involving long-tail sequences that are essential in certain real-world applications. To address these limitations, we propose \textscCoVeR, a novel model-free decoding strategy wihtin the conformal prediction framework that simultaneously maintains a compact search space and ensures high coverage probability over desirable trajectories. Theoretically, we establish a PAC-style generalization bound, guaranteeing that \textscCoVeR asymptotically achieves a coverage rate of at least 1 - \alpha for any target level \alpha \in (0,1) .
zh
[AI-26] Bootstrapping Reinforcement Learning with Sub-optimal Policies for Autonomous Driving
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在自动驾驶控制中面临的训练样本效率低和探索能力弱的问题,这些问题阻碍了RL代理发现最优驾驶策略。解决方案的关键在于引入一个无需达到专家水平的示范策略(demonstration policy),通过将基于规则的变道控制器与Soft Actor Critic(SAC)算法相结合,有效引导RL代理提升探索效率和学习性能,从而改善整体驾驶表现,并具备扩展至其他类似驾驶场景的潜力。
链接: https://arxiv.org/abs/2509.04712
作者: Zhihao Zhang,Chengyang Peng,Ekim Yurtsever,Keith A. Redmill
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:Automated vehicle control using reinforcement learning (RL) has attracted significant attention due to its potential to learn driving policies through environment interaction. However, RL agents often face training challenges in sample efficiency and effective exploration, making it difficult to discover an optimal driving strategy. To address these issues, we propose guiding the RL driving agent with a demonstration policy that need not be a highly optimized or expert-level controller. Specifically, we integrate a rule-based lane change controller with the Soft Actor Critic (SAC) algorithm to enhance exploration and learning efficiency. Our approach demonstrates improved driving performance and can be extended to other driving scenarios that can similarly benefit from demonstration-based guidance.
zh
[AI-27] An Approach to Grounding AI Model Evaluations in Human-derived Criteria
【速读】:该论文试图解决传统人工智能(Artificial Intelligence, AI)基准测试在捕捉AI模型复杂能力方面的局限性问题,特别是在物理世界建模场景中,现有评估方法难以体现模型行为的可解释性和实用性。其解决方案的关键在于引入人类衍生的评价标准,通过深度访谈和大规模调查识别出优先排序(Prioritization)、记忆(Memorizing)、辨别(Discerning)和情境化(Contextualizing)等关键认知技能,并将这些人类认知维度融入基准设计,从而构建更贴近人类认知过程的评估框架,实现AI能力与人类认知机制的对齐。
链接: https://arxiv.org/abs/2509.04676
作者: Sasha Mitts
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 4 figures, 6 pages, presented at CHI 2025 Workshop on Human-AI Interaction for Augmented Reasoning
Abstract:In the rapidly evolving field of artificial intelligence (AI), traditional benchmarks can fall short in attempting to capture the nuanced capabilities of AI models. We focus on the case of physical world modeling and propose a novel approach to augment existing benchmarks with human-derived evaluation criteria, aiming to enhance the interpretability and applicability of model behaviors. Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large-scale surveys to identify key cognitive skills, such as Prioritization, Memorizing, Discerning, and Contextualizing, that are critical for both AI and human reasoning. Our findings reveal that participants perceive AI as lacking in interpretive and empathetic skills yet hold high expectations for AI performance. By integrating insights from our findings into benchmark design, we offer a framework for developing more human-aligned means of defining and measuring progress. This work underscores the importance of user-centered evaluation in AI development, providing actionable guidelines for researchers and practitioners aiming to align AI capabilities with human cognitive processes. Our approach both enhances current benchmarking practices and sets the stage for future advancements in AI model evaluation.
zh
[AI-28] Interpreting Transformer Architectures as Implicit Multinomial Regression
【速读】:该论文旨在解决现代机器学习模型中注意力机制(attention mechanism)的可解释性问题,尤其是Transformer模型中注意力机制与其内部表示演化、特征多义性(polysemanticity)、超位置编码(superposition)及模型性能之间关系不明确的问题。其解决方案的关键在于建立注意力机制与多项逻辑回归(multinomial regression)之间的理论联系:通过在固定多项逻辑回归设定下优化潜在特征,发现最优解与注意力模块所诱导的表示演化动态高度一致,从而表明Transformer中表示的逐层演化可被理解为恢复分类最优特征的轨迹。
链接: https://arxiv.org/abs/2509.04653
作者: Jonas A. Actor,Anthony Gruber,Eric C. Cyr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:Mechanistic interpretability aims to understand how internal components of modern machine learning models, such as weights, activations, and layers, give rise to the model’s overall behavior. One particularly opaque mechanism is attention: despite its central role in transformer models, its mathematical underpinnings and relationship to concepts like feature polysemanticity, superposition, and model performance remain poorly understood. This paper establishes a novel connection between attention mechanisms and multinomial regression. Specifically, we show that in a fixed multinomial regression setting, optimizing over latent features yields optimal solutions that align with the dynamics induced by attention blocks. In other words, the evolution of representations through a transformer can be interpreted as a trajectory that recovers the optimal features for classification.
zh
[AI-29] owards Personalized Explanations for Health Simulations: A Mixed-Methods Framework for Stakeholder-Centric Summarization AAAI2025
【速读】:该论文旨在解决健康模拟(Health Simulation)模型在实际应用中因复杂性导致的可及性问题,即不同利益相关者(如临床医生、政策制定者、患者等)难以理解或利用模型输出的问题。当前基于大语言模型(Large Language Models, LLMs)的摘要方法通常采用“一刀切”式的内容生成策略,无法满足各类用户的信息需求和表达风格偏好。解决方案的关键在于提出一个分步骤的框架,通过混合方法设计首先识别不同利益相关者的解释需求与风格偏好,进而优化LLM以可控属性调优的方式生成定制化解释内容,并借助多维评估指标持续改进生成效果,从而实现对健康模拟结果的精准、个性化解释输出。
链接: https://arxiv.org/abs/2509.04646
作者: Philippe J. Giabbanelli,Ameeta Agrawal
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Accepted at the AAAI 2025 Fall Symposium Series. November 6-8, 2025, Arlington, VA, USA
Abstract:Modeling Simulation (MS) approaches such as agent-based models hold significant potential to support decision-making activities in health, with recent examples including the adoption of vaccines, and a vast literature on healthy eating behaviors and physical activity behaviors. These models are potentially usable by different stakeholder groups, as they support policy-makers to estimate the consequences of potential interventions and they can guide individuals in making healthy choices in complex environments. However, this potential may not be fully realized because of the models’ complexity, which makes them inaccessible to the stakeholders who could benefit the most. While Large Language Models (LLMs) can translate simulation outputs and the design of models into text, current approaches typically rely on one-size-fits-all summaries that fail to reflect the varied informational needs and stylistic preferences of clinicians, policymakers, patients, caregivers, and health advocates. This limitation stems from a fundamental gap: we lack a systematic understanding of what these stakeholders need from explanations and how to tailor them accordingly. To address this gap, we present a step-by-step framework to identify stakeholder needs and guide LLMs in generating tailored explanations of health simulations. Our procedure uses a mixed-methods design by first eliciting the explanation needs and stylistic preferences of diverse health stakeholders, then optimizing the ability of LLMs to generate tailored outputs (e.g., via controllable attribute tuning), and then evaluating through a comprehensive range of metrics to further improve the tailored generation of summaries.
zh
[AI-30] Scaling Environments for Organoid Intelligence with LLM -Automated Design and Plasticity-Based Evaluation
【速读】:该论文旨在解决如何设计可有效塑造生物神经网络(如类脑器官)行为与能力的环境问题,以推动对学习机制(如长时程增强LTP和长时程抑制LTD)的研究。其关键解决方案是提出一个闭环虚拟环境框架,通过三个逐步复杂化的任务环境(条件回避、一维捕食者-猎物场景、经典乒乓球游戏)实现对类脑器官的训练与机制探究,并结合大语言模型(LLM)自动化生成和优化实验协议,从而实现环境与课程设计的规模化;同时采用多模态评估方法,从电生理、细胞到分子层面测量突触可塑性,为计算神经科学与基于智能体的人工智能之间搭建桥梁。
链接: https://arxiv.org/abs/2509.04633
作者: Brennen Hill
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:As the complexity of artificial agents increases, the design of environments that can effectively shape their behavior and capabilities has become a critical research frontier. We propose a framework that extends this principle to a novel class of agents: biological neural networks in the form of neural organoids. This paper introduces three scalable, closed-loop virtual environments designed to train organoid-based biological agents and probe the underlying mechanisms of learning, such as long-term potentiation (LTP) and long-term depression (LTD). We detail the design of three distinct task environments with increasing complexity: (1) a conditional avoidance task, (2) a one-dimensional predator-prey scenario, and (3) a replication of the classic Pong game. For each environment, we formalize the state and action spaces, the sensory encoding and motor decoding mechanisms, and the feedback protocols based on predictable (reward) and unpredictable (punishment) stimulation. Furthermore, we propose a novel meta-learning approach where a Large Language Model (LLM) is used to automate the generation and optimization of experimental protocols, scaling the process of environment and curriculum design. Finally, we outline a multi-modal approach for evaluating learning by measuring synaptic plasticity at electrophysiological, cellular, and molecular levels. This work bridges the gap between computational neuroscience and agent-based AI, offering a unique platform for studying embodiment, learning, and intelligence in a controlled biological substrate.
zh
[AI-31] Schema Inference for Tabular Data Repositories Using Large Language Models
【速读】:该论文旨在解决在缺乏充足元数据(metadata)的情况下,从未经充分清洗的表格数据中自动推断出结构化概念模式(conceptual schema)的问题。当前,这类数据常因来源异构而存在表示不一致,且元数据稀疏,导致schema inference(模式推断)困难。解决方案的关键在于提出SI-LLM(Schema Inference using Large Language Models),该方法仅依赖列标题和单元格值,利用大语言模型(Large Language Models, LLMs)识别出层次化的实体类型(hierarchical entity types)、属性(attributes)以及实体间的关联关系(inter-type relationships),从而实现端到端的高效模式推断。
链接: https://arxiv.org/abs/2509.04632
作者: Zhenyu Wu,Jiaoyan Chen,Norman W. Paton
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:Minimally curated tabular data often contain representational inconsistencies across heterogeneous sources, and are accompanied by sparse metadata. Working with such data is intimidating. While prior work has advanced dataset discovery and exploration, schema inference remains difficult when metadata are limited. We present SI-LLM (Schema Inference using Large Language Models), which infers a concise conceptual schema for tabular data using only column headers and cell values. The inferred schema comprises hierarchical entity types, attributes, and inter-type relationships. In extensive evaluation on two datasets from web tables and open data, SI-LLM achieves promising end-to-end results, as well as better or comparable results to state-of-the-art methods at each step. All source code, full prompts, and datasets of SI-LLM are available at this https URL.
zh
[AI-32] Action Chunking with Transformers for Image-Based Spacecraft Guidance and Control
【速读】:该论文旨在解决航天器制导、导航与控制(Guidance, Navigation, and Control, GNC)中样本效率低的问题,即如何在极少专家示范数据下训练出高性能的控制策略。解决方案的关键在于提出一种基于Transformer架构的动作分块(Action Chunking with Transformers, ACT)的模仿学习方法,通过将视觉和状态观测映射为推力与扭矩指令,实现从仅100个专家示范(相当于6,300次环境交互)中高效学习控制策略,显著优于使用4000万次交互训练的元强化学习(meta-RL)基线方法,在国际空间站(ISS)在轨对接任务中展现出更高的精度、更平滑的轨迹以及更强的样本效率。
链接: https://arxiv.org/abs/2509.04628
作者: Alejandro Posadas-Nava,Andrea Scorsoglio,Luca Ghilardi,Roberto Furfaro,Richard Linares
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 2025 AAS/AIAA Astrodynamics Specialist Conference
Abstract:We present an imitation learning approach for spacecraft guidance, navigation, and control(GNC) that achieves high performance from limited data. Using only 100 expert demonstrations, equivalent to 6,300 environment interactions, our method, which implements Action Chunking with Transformers (ACT), learns a control policy that maps visual and state observations to thrust and torque commands. ACT generates smoother, more consistent trajectories than a meta-reinforcement learning (meta-RL) baseline trained with 40 million interactions. We evaluate ACT on a rendezvous task: in-orbit docking with the International Space Station (ISS). We show that our approach achieves greater accuracy, smoother control, and greater sample efficiency.
zh
[AI-33] Measuring the Measures: Discriminative Capacity of Representational Similarity Metrics Across Model Families
【速读】:该论文旨在解决神经科学与人工智能领域中,对不同表示相似性度量(representational similarity metrics)在区分模型家族(如CNN、Vision Transformer、Swin Transformer和ConvNeXt)方面的判别能力缺乏系统比较的问题。其解决方案的关键在于提出一个定量框架,基于三种互补的可分离性指标(d’来自信号检测理论、轮廓系数和ROC-AUC),系统评估常用度量方法(如RSA、线性预测性、Procrustes变换和软匹配)的判别性能;研究发现,随着度量方法施加更严格的对齐约束,其可分离性显著提升,其中基于映射的方法中软匹配表现最优,非拟合方法如RSA也展现出强判别力,从而为大规模模型与大脑表示比较中的度量选择提供了明确指导。
链接: https://arxiv.org/abs/2509.04622
作者: Jialin Wu,Shreya Saha,Yiqing Bo,Meenakshi Khosla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Representational similarity metrics are fundamental tools in neuroscience and AI, yet we lack systematic comparisons of their discriminative power across model families. We introduce a quantitative framework to evaluate representational similarity measures based on their ability to separate model families-across architectures (CNNs, Vision Transformers, Swin Transformers, ConvNeXt) and training regimes (supervised vs. self-supervised). Using three complementary separability measures-dprime from signal detection theory, silhouette coefficients and ROC-AUC, we systematically assess the discriminative capacity of commonly used metrics including RSA, linear predictivity, Procrustes, and soft matching. We show that separability systematically increases as metrics impose more stringent alignment constraints. Among mapping-based approaches, soft-matching achieves the highest separability, followed by Procrustes alignment and linear predictivity. Non-fitting methods such as RSA also yield strong separability across families. These results provide the first systematic comparison of similarity metrics through a separability lens, clarifying their relative sensitivity and guiding metric choice for large-scale model and brain comparisons.
zh
[AI-34] Quantum-Enhanced Multi-Task Learning with Learnable Weighting for Pharmacokinetic and Toxicity Prediction
【速读】:该论文旨在解决药物发现中ADMET(吸收、分布、代谢、排泄和毒性)预测任务中存在的两个关键问题:一是传统单任务学习(Single-Task Learning, STL)方法难以充分利用不同任务间的互补信息,导致模型性能受限;二是STL在训练和推理过程中计算资源消耗较大。为此,作者提出了一种统一的量子增强且任务加权的多任务学习框架(Quantum-enhanced and task-Weighted Multi-Task Learning, QW-MTL),其核心创新在于:首先,基于Chemprop-RDKit骨干网络引入量子化学描述符,以增强分子表示中关于电子结构与相互作用的信息;其次,设计了一种新颖的指数型任务加权机制,通过结合数据集规模先验与可学习参数实现跨任务动态损失平衡,从而提升多任务协同学习效率与预测精度。
链接: https://arxiv.org/abs/2509.04601
作者: Han Zhang,Fengji Ma,Jiamin Su,Xinyue Yang,Lei Wang,Wen-Cai Ye,Li Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Prediction for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) plays a crucial role in drug discovery and development, accelerating the screening and optimization of new drugs. Existing methods primarily rely on single-task learning (STL), which often fails to fully exploit the complementarities between tasks. Besides, it requires more computational resources while training and inference of each task independently. To address these issues, we propose a new unified Quantum-enhanced and task-Weighted Multi-Task Learning (QW-MTL) framework, specifically designed for ADMET classification tasks. Built upon the Chemprop-RDKit backbone, QW-MTL adopts quantum chemical descriptors to enrich molecular representations with additional information about the electronic structure and interactions. Meanwhile, it introduces a novel exponential task weighting scheme that combines dataset-scale priors with learnable parameters to achieve dynamic loss balancing across tasks. To the best of our knowledge, this is the first work to systematically conduct joint multi-task training across all 13 Therapeutics Data Commons (TDC) classification benchmarks, using leaderboard-style data splits to ensure a standardized and realistic evaluation setting. Extensive experimental results show that QW-MTL significantly outperforms single-task baselines on 12 out of 13 tasks, achieving high predictive performance with minimal model complexity and fast inference, demonstrating the effectiveness and efficiency of multi-task molecular learning enhanced by quantum-informed features and adaptive task weighting.
zh
[AI-35] oward Faithfulness-guided Ensemble Interpretation of Neural Network
【速读】:该论文旨在解决神经网络解释中忠实性(faithfulness)不足的问题,即现有解释方法难以准确反映模型决策过程,导致可视化结果与实际推理机制不一致。其解决方案的关键在于提出Faithfulness-guided Ensemble Interpretation (FEI) 框架,通过引入平滑近似(smooth approximation)提升定量忠实性评分,并设计多样化的变体以增强对隐藏层编码的忠实性建模;同时,创新性地提出一种用于评估隐藏层解释忠实性的定性指标,从而在广度和精度上系统性地提升神经网络解释的可靠性与可解释性。
链接: https://arxiv.org/abs/2509.04588
作者: Siyu Zhang,Kenneth Mcmillan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Interpretable and faithful explanations for specific neural inferences are crucial for understanding and evaluating model behavior. Our work introduces \textbfFaithfulness-guided \textbfEnsemble \textbfInterpretation (\textbfFEI), an innovative framework that enhances the breadth and effectiveness of faithfulness, advancing interpretability by providing superior visualization. Through an analysis of existing evaluation benchmarks, \textbfFEI employs a smooth approximation to elevate quantitative faithfulness scores. Diverse variations of \textbfFEI target enhanced faithfulness in hidden layer encodings, expanding interpretability. Additionally, we propose a novel qualitative metric that assesses hidden layer faithfulness. In extensive experiments, \textbfFEI surpasses existing methods, demonstrating substantial advances in qualitative visualization and quantitative faithfulness scores. Our research establishes a comprehensive framework for elevating faithfulness in neural network explanations, emphasizing both breadth and precision
zh
[AI-36] -Mask: An Intelligent Mask for Breath-Driven Activity Recognition
【速读】:该论文旨在解决人类活动识别(Human Activity Recognition, HAR)中缺乏高精度、非侵入式生理信号采集手段的问题,尤其关注如何利用呼吸模式(inhalation and exhalation patterns)来提升对行为状态和健康趋势的预测能力。其解决方案的关键在于提出了一种名为i-Mask的新方法,该方法通过一个集成传感器的定制化口罩实时捕捉呼气模式,并结合噪声滤波、时间序列分解与标签化处理,构建可训练的预测模型,从而在实验中实现了超过95%的识别准确率,展现出在医疗保健和健身监测等场景中的应用潜力。
链接: https://arxiv.org/abs/2509.04544
作者: Ashutosh Kumar Sinha,Ayush Patel,Mitul Dudhat,Pritam Anand,Rahul Mishra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 Pages, 10 Figures
Abstract:The patterns of inhalation and exhalation contain important physiological signals that can be used to anticipate human behavior, health trends, and vital parameters. Human activity recognition (HAR) is fundamentally connected to these vital signs, providing deeper insights into well-being and enabling real-time health monitoring. This work presents i-Mask, a novel HAR approach that leverages exhaled breath patterns captured using a custom-developed mask equipped with integrated sensors. Data collected from volunteers wearing the mask undergoes noise filtering, time-series decomposition, and labeling to train predictive models. Our experimental results validate the effectiveness of the approach, achieving over 95% accuracy and highlighting its potential in healthcare and fitness applications.
zh
[AI-37] Emergent Social Dynamics of LLM Agents in the El Farol Bar Problem
【速读】:该论文试图解决经典社会困境——El Farol Bar问题中群体决策行为的建模难题,即如何在缺乏中央协调的情况下,使个体Agent在面对有限资源(如酒吧容量限制)时自发形成有效协作策略。解决方案的关键在于利用大型语言模型(Large Language Model, LLM)代理在空间扩展环境中的自主演化能力,使其不仅依据外部提示设定的约束(如60%阈值)进行理性决策,还内生地调用预训练过程中习得的文化性社会偏好(culturally-encoded social preferences),从而在形式博弈理性与人类社会动机之间实现动态平衡。这一机制使得LLM代理表现出类人行为模式,而非完全优化解,揭示了基于LLM的群体决策新范式。
链接: https://arxiv.org/abs/2509.04537
作者: Ryosuke Takata,Atsushi Masumori,Takashi Ikegammi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:We investigate the emergent social dynamics of Large Language Model (LLM) agents in a spatially extended El Farol Bar problem, observing how they autonomously navigate this classic social dilemma. As a result, the LLM agents generated a spontaneous motivation to go to the bar and changed their decision making by becoming a collective. We also observed that the LLM agents did not solve the problem completely, but rather behaved more like humans. These findings reveal a complex interplay between external incentives (prompt-specified constraints such as the 60% threshold) and internal incentives (culturally-encoded social preferences derived from pre-training), demonstrating that LLM agents naturally balance formal game-theoretic rationality with social motivations that characterize human behavior. These findings suggest that a new model of group decision making, which could not be handled in the previous game-theoretic problem setting, can be realized by LLM agents.
zh
[AI-38] In-Context Policy Adaptation via Cross-Domain Skill Diffusion
【速读】:该论文旨在解决长时程多任务环境中策略迁移的难题,特别是在目标域数据稀缺且不允许更新模型参数的约束条件下,如何实现高效、快速的技能驱动型强化学习策略适应。解决方案的关键在于提出了一种基于上下文的策略自适应框架(In-Context Policy Adaptation, ICPAD),其核心创新是引入跨域技能扩散机制:通过联合学习与跨域一致性扩散过程,从离线数据中提取领域无关的原型技能(prototype skills)和领域感知的技能适配器(domain-grounded skill adapter)。其中,原型技能作为通用行为表示的基元,充当不同领域间的“语言通货”,而动态领域提示机制则进一步引导技能适配器在推理阶段实现对目标域的精准对齐,从而显著提升在环境动力学差异、代理形态变化及任务时长不一致等复杂跨域场景下的策略迁移性能。
链接: https://arxiv.org/abs/2509.04535
作者: Minjong Yoo,Woo Kyung Kim,Honguk Woo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages
Abstract:In this work, we present an in-context policy adaptation (ICPAD) framework designed for long-horizon multi-task environments, exploring diffusion-based skill learning techniques in cross-domain settings. The framework enables rapid adaptation of skill-based reinforcement learning policies to diverse target domains, especially under stringent constraints on no model updates and only limited target domain data. Specifically, the framework employs a cross-domain skill diffusion scheme, where domain-agnostic prototype skills and a domain-grounded skill adapter are learned jointly and effectively from an offline dataset through cross-domain consistent diffusion processes. The prototype skills act as primitives for common behavior representations of long-horizon policies, serving as a lingua franca to bridge different domains. Furthermore, to enhance the in-context adaptation performance, we develop a dynamic domain prompting scheme that guides the diffusion-based skill adapter toward better alignment with the target domain. Through experiments with robotic manipulation in Metaworld and autonomous driving in CARLA, we show that our \oursol framework achieves superior policy adaptation performance under limited target domain data conditions for various cross-domain configurations including differences in environment dynamics, agent embodiment, and task horizon.
zh
[AI-39] Memristor-Based Neural Network Accelerators for Space Applications: Enhancing Performance with Temporal Averag ing and SIRENs
【速读】:该论文旨在解决将神经网络(Neural Networks, NNs)部署到阻变存储器(Resistive Random-Access Memory, RRAM)器件上时因器件非理想特性(如参数变异、电导漂移和故障)导致的性能严重退化问题,从而实现面向航天器 onboard 应用的高能效与抗辐射计算。解决方案的关键在于通过位切片(bit-slicing)、NN 层的时间平均(temporal averaging)以及周期性激活函数(periodic activation functions)三项技术改进,显著提升 memristor 基神经网络在 asteroid navigation/control 和 geodesy 任务上的精度,使误差从初始的 0.07/0.3 降低至 0.007/0.007,逼近当前最优水平(0.003–0.005)。
链接: https://arxiv.org/abs/2509.04506
作者: Zacharia A. Rudge,Dominik Dold,Moritz Fieback,Dario Izzo,Said Hamdioui
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 21 pages, IAA acta astronautica. arXiv admin note: text overlap with arXiv:2509.02369
Abstract:Memristors are an emerging technology that enables artificial intelligence (AI) accelerators with high energy efficiency and radiation robustness – properties that are vital for the deployment of AI on-board spacecraft. However, space applications require reliable and precise computations, while memristive devices suffer from non-idealities, such as device variability, conductance drifts, and device faults. Thus, porting neural networks (NNs) to memristive devices often faces the challenge of severe performance degradation. In this work, we show in simulations that memristor-based NNs achieve competitive performance levels on on-board tasks, such as navigation \ control and geodesy of asteroids. Through bit-slicing, temporal averaging of NN layers, and periodic activation functions, we improve initial results from around 0.07 to 0.01 and 0.3 to 0.007 for both tasks using RRAM devices, coming close to state-of-the-art levels ( 0.003-0.005 and 0.003 , respectively). Our results demonstrate the potential of memristors for on-board space applications, and we are convinced that future technology and NN improvements will further close the performance gap to fully unlock the benefits of memristors.
zh
[AI-40] he Ethical Compass of the Machine: Evaluating Large Language Models for Decision Support in Construction Project Management
【速读】:该论文旨在解决生成式 AI(Generative AI)在建筑项目管理(Construction Project Management, CPM)中应用于高风险、伦理敏感决策场景时的伦理可行性与可靠性问题。其解决方案的关键在于:通过混合方法研究设计,结合基于创新伦理决策支持评估清单(Ethical Decision Support Assessment Checklist, EDSAC)的量化测试与对12位行业专家的半结构化访谈,系统评估主流大语言模型(Large Language Models, LLMs)在真实伦理情境下的表现;研究发现LLMs虽在法律合规等结构化领域表现尚可,但在情境理解、责任归属和推理透明度方面存在显著不足,因而主张将LLMs定位为辅助决策工具而非自主伦理代理,并强调必须建立“人在回路”(human-in-the-loop)的监督机制以确保其安全可靠应用。
链接: https://arxiv.org/abs/2509.04505
作者: Somtochukwu Azie,Yiping Meng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 16 Pages
Abstract:The integration of Artificial Intelligence (AI) into construction project management (CPM) is accelerating, with Large Language Models (LLMs) emerging as accessible decision-support tools. This study aims to critically evaluate the ethical viability and reliability of LLMs when applied to the ethically sensitive, high-risk decision-making contexts inherent in CPM. A mixed-methods research design was employed, involving the quantitative performance testing of two leading LLMs against twelve real-world ethical scenarios using a novel Ethical Decision Support Assessment Checklist (EDSAC), and qualitative analysis of semi-structured interviews with 12 industry experts to capture professional perceptions. The findings reveal that while LLMs demonstrate adequate performance in structured domains such as legal compliance, they exhibit significant deficiencies in handling contextual nuance, ensuring accountability, and providing transparent reasoning. Stakeholders expressed considerable reservations regarding the autonomous use of AI for ethical judgments, strongly advocating for robust human-in-the-loop oversight. To our knowledge, this is one of the first studies to empirically test the ethical reasoning of LLMs within the construction domain. It introduces the EDSAC framework as a replicable methodology and provides actionable recommendations, emphasising that LLMs are currently best positioned as decision-support aids rather than autonomous ethical agents.
zh
[AI-41] Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在在线部署场景中面临的高计算成本与有限token预算问题,尤其是在高并发查询环境下如何实现高效、低成本的模型路由。其核心挑战在于现有方法多局限于离线训练场景,难以适应动态变化的在线环境。解决方案的关键在于提出一种无需训练的在线路由算法:通过近似最近邻搜索(Approximate Nearest Neighbor Search)快速估算查询特征,并基于少量初始查询进行一次性优化,学习出一个可指导后续路由决策的策略。该方法在自然假设下具备理论保证,能够达到渐进最优的竞争比(1 - o(1)),并在多个基准数据集和8种基线方法上验证了显著性能提升,平均性能提升3.55倍、成本效率提升1.85倍、吞吐量提升近4.25倍。
链接: https://arxiv.org/abs/2509.02718
作者: Fangzhou Wu,Sandeep Silwal
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages
Abstract:Increasing demand for Large Language Models (LLMs) services imposes substantial deployment and computation costs on providers. LLM routing offers a cost-efficient solution by directing queries to the optimal LLM based on model and query features. However, existing works primarily focus on offline scenarios and struggle to adapt to online settings with high query volume and constrained token budgets. In this work, we introduce the first training-free algorithm for online routing scenarios. Our algorithm leverages approximate nearest neighbor search to efficiently estimate query features and performs a one-time optimization over a small set of initial queries to learn a routing strategy that guides future routing. We provide theoretical guarantees demonstrating that our algorithm achieves a competitive ratio of 1 - o(1) under natural assumptions, which is further validated by extensive experiments across 3 benchmark datasets and 8 baselines, showing an average improvement of 3.55 \times in overall performance, 1.85 \times in cost efficiency, and nearly 4.25 \times in throughput.
zh
[AI-42] High-Resolution Global Land Surface Temperature Retrieval via a Coupled Mechanism-Machine Learning Framework
【速读】:该论文旨在解决在复杂地表覆盖和极端大气条件下,地表温度(Land Surface Temperature, LST)反演精度不足的问题。传统分裂窗(Split Window, SW)算法在高湿环境中存在偏差,而纯机器学习(Machine Learning, ML)方法则因缺乏物理可解释性且在数据有限时泛化能力差,难以满足实际需求。其解决方案的关键在于提出一种耦合机制模型-机器学习(Mechanism Model-Machine Learning, MM-ML)框架,将辐射传输建模的物理约束与数据驱动的学习能力相结合,利用MODTRAN模拟全球大气廓线并引入物理约束优化策略,从而在保证模型可解释性的基础上提升非线性拟合能力,实现高精度、高稳定性的LST反演。
链接: https://arxiv.org/abs/2509.04991
作者: Tian Xie,Huanfeng Shen,Menghui Jiang,Juan-Carlos Jiménez-Muñoz,José A. Sobrino,Huifang Li,Chao Zeng
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Land surface temperature (LST) is vital for land-atmosphere interactions and climate processes. Accurate LST retrieval remains challenging under heterogeneous land cover and extreme atmospheric conditions. Traditional split window (SW) algorithms show biases in humid environments; purely machine learning (ML) methods lack interpretability and generalize poorly with limited data. We propose a coupled mechanism model-ML (MM-ML) framework integrating physical constraints with data-driven learning for robust LST retrieval. Our approach fuses radiative transfer modeling with data components, uses MODTRAN simulations with global atmospheric profiles, and employs physics-constrained optimization. Validation against 4,450 observations from 29 global sites shows MM-ML achieves MAE=1.84K, RMSE=2.55K, and R-squared=0.966, outperforming conventional methods. Under extreme conditions, MM-ML reduces errors by over 50%. Sensitivity analysis indicates LST estimates are most sensitive to sensor radiance, then water vapor, and less to emissivity, with MM-ML showing superior stability. These results demonstrate the effectiveness of our coupled modeling strategy for retrieving geophysical parameters. The MM-ML framework combines physical interpretability with nonlinear modeling capacity, enabling reliable LST retrieval in complex environments and supporting climate monitoring and ecosystem studies.
zh
[AI-43] Exploring an implementation of quantum learning pipeline for support vector machines
【速读】:该论文旨在解决传统支持向量机(Support Vector Machine, SVM)在处理复杂数据分类任务时面临的计算效率瓶颈问题,尤其是在高维特征空间中核函数构造与优化过程的资源消耗难题。其解决方案的关键在于构建一个端到端的量子学习框架:首先利用基于门模型的量子核方法(gate-based quantum kernel methods)生成量子核函数,并通过核目标对齐(Kernel-Target Alignment, KTA)评估不同特征映射和量子比特配置下的核函数质量;随后将SVM的对偶问题转化为无约束二次二值优化(Quadratic Unconstrained Binary Optimization, QUBO)形式,从而可在量子退火设备上求解。实验表明,高核对齐度与合适的正则化参数可显著提升分类性能,最终实现F1-score达90%的优越效果,验证了混合量子架构在量子高性能计算(Quantum High-Performance Computing, QHPC)场景中的可行性与潜力。
链接: https://arxiv.org/abs/2509.04983
作者: Mario Bifulco,Luca Roversi
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:This work presents a fully quantum approach to support vector machine (SVM) learning by integrating gate-based quantum kernel methods with quantum annealing-based optimization. We explore the construction of quantum kernels using various feature maps and qubit configurations, evaluating their suitability through Kernel-Target Alignment (KTA). The SVM dual problem is reformulated as a Quadratic Unconstrained Binary Optimization (QUBO) problem, enabling its solution via quantum annealers. Our experiments demonstrate that a high degree of alignment in the kernel and an appropriate regularization parameter lead to competitive performance, with the best model achieving an F1-score of 90%. These results highlight the feasibility of an end-to-end quantum learning pipeline and the potential of hybrid quantum architectures in quantum high-performance computing (QHPC) contexts.
zh
[AI-44] Artificial intelligence for representing and characterizing quantum systems
【速读】:该论文旨在解决大规模量子系统(尤其是由量子模拟器和兆量子比特量子计算机产生的系统)表征中的核心挑战,即随着系统规模增大,希尔伯特空间呈指数级增长导致的计算复杂性问题。解决方案的关键在于利用人工智能(AI),特别是机器学习、深度学习和语言模型等不同范式,通过高维模式识别与函数逼近能力,实现对量子系统的高效表征。具体而言,AI被用于两大核心任务:量子性质预测和量子态代理模型构建,从而支撑量子认证、算法优化及强关联物态理解等应用。
链接: https://arxiv.org/abs/2509.04923
作者: Yuxuan Du,Yan Zhu,Yuan-Hang Zhang,Min-Hsiu Hsieh,Patrick Rebentrost,Weibo Gao,Ya-Dong Wu,Jens Eisert,Giulio Chiribella,Dacheng Tao,Barry C. Sanders
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages. Comments are welcome
Abstract:Efficient characterization of large-scale quantum systems, especially those produced by quantum analog simulators and megaquop quantum computers, poses a central challenge in quantum science due to the exponential scaling of the Hilbert space with respect to system size. Recent advances in artificial intelligence (AI), with its aptitude for high-dimensional pattern recognition and function approximation, have emerged as a powerful tool to address this challenge. A growing body of research has leveraged AI to represent and characterize scalable quantum systems, spanning from theoretical foundations to experimental realizations. Depending on how prior knowledge and learning architectures are incorporated, the integration of AI into quantum system characterization can be categorized into three synergistic paradigms: machine learning, and, in particular, deep learning and language models. This review discusses how each of these AI paradigms contributes to two core tasks in quantum systems characterization: quantum property prediction and the construction of surrogates for quantum states. These tasks underlie diverse applications, from quantum certification and benchmarking to the enhancement of quantum algorithms and the understanding of strongly correlated phases of matter. Key challenges and open questions are also discussed, together with future prospects at the interface of AI and quantum science.
zh
[AI-45] he Paradox of Doom: Acknowledging Extinction Risk Reduces the Incentive to Prevent It
【速读】:该论文试图解决的问题是:人类在面对灭绝风险(extinction risk)时,为何会表现出更高的贴现率(即更不耐心),从而导致对长期性灾难风险(如气候变化、大流行病及生成式 AI 的潜在威胁)的持续投资不足。其解决方案的关键在于构建一个区分人类灭绝风险与个体死亡风险的理论框架,并引入代际利他主义和进化论视角下的“自私基因”假设。研究表明,人类灭绝风险无法通过繁殖完全对冲,因而成为贴现率中不可忽略的部分;而个体死亡风险则可通过生育实现部分或全部规避。因此,在灭绝风险下,人们倾向于更短视的行为,反而降低了应对此类系统性危机的动力,从而解释了人类在重大全球性风险上的持续低投资现象。
链接: https://arxiv.org/abs/2509.04855
作者: Jakub Growiec,Klaus Prettner
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate the salience of extinction risk as a source of impatience. Our framework distinguishes between human extinction risk and individual mortality risk while allowing for various degrees of intergenerational altruism. Additionally, we consider the evolutionarily motivated “selfish gene” perspective. We find that the risk of human extinction is an indispensable component of the discount rate, whereas individual mortality risk can be hedged against - partially or fully, depending on the setup - through human reproduction. Overall, we show that in the face of extinction risk, people become more impatient rather than more farsighted. Thus, the greater the threat of extinction, the less incentive there is to invest in avoiding it. Our framework can help explain why humanity consistently underinvests in mitigation of catastrophic risks, ranging from climate change mitigation, via pandemic prevention, to addressing the emerging risks of transformative artificial intelligence.
zh
[AI-46] AI-Driven Fronthaul Link Compression in Wireless Communication Systems: Review and Method Design
【速读】:该论文旨在解决无线系统中前传链路(fronthaul link)在带宽和时延严格约束下高效传输高维信号的问题,传统压缩方法如压缩感知、标量量化及固定编码器流水线因依赖强先验假设、高压缩比下性能急剧下降且难以跨信道调优而难以满足需求。解决方案的关键在于引入人工智能驱动的压缩技术,包括端到端学习的变换、向量与分层量化以及学习的熵模型,从而更有效地利用信道状态信息(CSI)、预编码矩阵、I/Q样本和对数似然比(LLR)等数据结构;进一步聚焦于两类高压缩路径:基于端到端学习的CSI反馈和资源块(RB)粒度预编码优化结合压缩,并据此提出一种面向无小区架构(cell-free architecture)的前传压缩策略,实现高压缩比下的可控性能损失、支持RB级速率自适应,并具备适用于下一代网络集中式协同传输的低延迟推理能力。
链接: https://arxiv.org/abs/2509.04805
作者: Keqin Zhang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern fronthaul links in wireless systems must transport high-dimensional signals under stringent bandwidth and latency constraints, which makes compression indispensable. Traditional strategies such as compressed sensing, scalar quantization, and fixed-codec pipelines often rely on restrictive priors, degrade sharply at high compression ratios, and are hard to tune across channels and deployments. Recent progress in Artificial Intelligence (AI) has brought end-to-end learned transforms, vector and hierarchical quantization, and learned entropy models that better exploit the structure of Channel State Information(CSI), precoding matrices, I/Q samples, and LLRs. This paper first surveys AI-driven compression techniques and then provides a focused analysis of two representative high-compression routes: CSI feedback with end-to-end learning and Resource Block (RB) granularity precoding optimization combined with compression. Building on these insights, we propose a fronthaul compression strategy tailored to cell-free architectures. The design targets high compression with controlled performance loss, supports RB-level rate adaptation, and enables low-latency inference suitable for centralized cooperative transmission in next-generation networks.
zh
[AI-47] Multiscale Graph Neural Network for Turbulent Flow-Thermal Prediction Around a Complex-Shaped Pin-Fin
【速读】:该论文旨在解决复杂几何结构(如任意形状的pin-fin)中二维通道内稳态湍流流动与传热行为的高效预测问题,传统数值模拟方法计算成本高、耗时长。解决方案的关键在于构建了一个领域响应的边缘感知多尺度图神经网络(domain-responsive edge-aware multiscale Graph Neural Network),其核心创新在于将每个CFD仿真结果转化为包含空间坐标、归一化流向位置、边界指示符及到最近边界的有符号距离等特征的图结构,并基于此训练模型以实现对温度、速度大小和压力场的高精度预测。该方法在保持物理准确性的同时,将计算时间缩短了2–3个数量级,从而为复杂流动配置提供了快速可靠的代理模型。
链接: https://arxiv.org/abs/2509.04463
作者: Riddhiman Raut,Evan M. Mihalko,Amrita Basak
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:
Abstract:This study presents the development of a domain-responsive edge-aware multiscale Graph Neural Network for predicting steady, turbulent flow and thermal behavior in a two-dimensional channel containing arbitrarily shaped complex pin-fin geometries. The training dataset was constructed through an automated framework that integrated geometry generation, meshing, and flow-field solutions in ANSYS Fluent. The pin-fin geometry was parameterized using piecewise cubic splines, producing 1,000 diverse configurations through Latin Hypercube Sampling. Each simulation was converted into a graph structure, where nodes carried a feature vector containing spatial coordinates, a normalized streamwise position, one-hot boundary indicators, and a signed distance to the nearest boundary such as wall. This graph structure served as input to the newly developed Graph Neural Network, which was trained to predict temperature, velocity magnitude, and pressure at each node using data from ANSYS. The network predicted fields with outstanding accuracy, capturing boundary layers, recirculation, and the stagnation region upstream of the pin-fins while reducing wall time by 2-3 orders of magnitude. In conclusion, the novel graph neural network offered a fast and reliable surrogate for simulations in complex flow configurations.
zh
机器学习
[LG-0] Deep Reinforcement Learning for Ranking Utility Tuning in the Ad Recommender System at Pinterest
链接: https://arxiv.org/abs/2509.05292
作者: Xiao Yang,Mehdi Ben Ayed,Longyu Zhao,Fan Zhou,Yuchen Shen,Abe Engle,Jinfeng Zhuang,Ling Leng,Jiajing Xu,Charles Rosenberg,Prathibha Deshikachar
类目: Machine Learning (cs.LG)
*备注:
Abstract:The ranking utility function in an ad recommender system, which linearly combines predictions of various business goals, plays a central role in balancing values across the platform, advertisers, and users. Traditional manual tuning, while offering simplicity and interpretability, often yields suboptimal results due to its unprincipled tuning objectives, the vast amount of parameter combinations, and its lack of personalization and adaptability to seasonality. In this work, we propose a general Deep Reinforcement Learning framework for Personalized Utility Tuning (DRL-PUT) to address the challenges of multi-objective optimization within ad recommender systems. Our key contributions include: 1) Formulating the problem as a reinforcement learning task: given the state of an ad request, we predict the optimal hyperparameters to maximize a pre-defined reward. 2) Developing an approach to directly learn an optimal policy model using online serving logs, avoiding the need to estimate a value function, which is inherently challenging due to the high variance and unbalanced distribution of immediate rewards. We evaluated DRL-PUT through an online A/B experiment in Pinterest’s ad recommender system. Compared to the baseline manual utility tuning approach, DRL-PUT improved the click-through rate by 9.7% and the long click-through rate by 7.7% on the treated segment. We conducted a detailed ablation study on the impact of different reward definitions and analyzed the personalization aspect of the learned policy model.
[LG-1] Learning to accelerate distributed ADMM using graph neural networks
链接: https://arxiv.org/abs/2509.05288
作者: Henri Doerks,Paul Häusner,Daniel Hernández Escobar,Jens Sjölund
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Under review, the first two authors contributed equally
Abstract:Distributed optimization is fundamental in large-scale machine learning and control applications. Among existing methods, the Alternating Direction Method of Multipliers (ADMM) has gained popularity due to its strong convergence guarantees and suitability for decentralized computation. However, ADMM often suffers from slow convergence and sensitivity to hyperparameter choices. In this work, we show that distributed ADMM iterations can be naturally represented within the message-passing framework of graph neural networks (GNNs). Building on this connection, we propose to learn adaptive step sizes and communication weights by a graph neural network that predicts the hyperparameters based on the iterates. By unrolling ADMM for a fixed number of iterations, we train the network parameters end-to-end to minimize the final iterates error for a given problem class, while preserving the algorithm’s convergence properties. Numerical experiments demonstrate that our learned variant consistently improves convergence speed and solution quality compared to standard ADMM. The code is available at this https URL.
[LG-2] Dual-Branch Convolutional Framework for Spatial and Frequency-Based Image Forgery Detection
链接: https://arxiv.org/abs/2509.05281
作者: Naman Tyagi
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures
Abstract:With a very rapid increase in deepfakes and digital image forgeries, ensuring the authenticity of images is becoming increasingly challenging. This report introduces a forgery detection framework that combines spatial and frequency-based features for detecting forgeries. We propose a dual branch convolution neural network that operates on features extracted from spatial and frequency domains. Features from both branches are fused and compared within a Siamese network, yielding 64 dimensional embeddings for classification. When benchmarked on CASIA 2.0 dataset, our method achieves an accuracy of 77.9%, outperforming traditional statistical methods. Despite its relatively weaker performance compared to larger, more complex forgery detection pipelines, our approach balances computational complexity and detection reliability, making it ready for practical deployment. It provides a strong methodology for forensic scrutiny of digital images. In a broader sense, it advances the state of the art in visual forensics, addressing an urgent requirement in media verification, law enforcement and digital content reliability.
[LG-3] Greener Deep Reinforcement Learning: Analysis of Energy and Carbon Efficiency Across Atari Benchmarks
链接: https://arxiv.org/abs/2509.05273
作者: Jason Gardner,Ayan Dutta,Swapnoneel Roy,O. Patrick Kreidl,Ladislau Boloni
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: Submitted to a journal - under review
Abstract:The growing computational demands of deep reinforcement learning (DRL) have raised concerns about the environmental and economic costs of training large-scale models. While algorithmic efficiency in terms of learning performance has been extensively studied, the energy requirements, greenhouse gas emissions, and monetary costs of DRL algorithms remain largely unexplored. In this work, we present a systematic benchmarking study of the energy consumption of seven state-of-the-art DRL algorithms, namely DQN, TRPO, A2C, ARS, PPO, RecurrentPPO, and QR-DQN, implemented using Stable Baselines. Each algorithm was trained for one million steps each on ten Atari 2600 games, and power consumption was measured in real-time to estimate total energy usage, CO2-Equivalent emissions, and electricity cost based on the U.S. national average electricity price. Our results reveal substantial variation in energy efficiency and training cost across algorithms, with some achieving comparable performance while consuming up to 24% less energy (ARS vs. DQN), emitting nearly 68% less CO2, and incurring almost 68% lower monetary cost (QR-DQN vs. RecurrentPPO) than less efficient counterparts. We further analyze the trade-offs between learning performance, training time, energy use, and financial cost, highlighting cases where algorithmic choices can mitigate environmental and economic impact without sacrificing learning performance. This study provides actionable insights for developing energy-aware and cost-efficient DRL practices and establishes a foundation for incorporating sustainability considerations into future algorithmic design and evaluation.
[LG-4] On Evaluating the Poisoning Robustness of Federated Learning under Local Differential Privacy
链接: https://arxiv.org/abs/2509.05265
作者: Zijian Wang,Wei Tong,Tingxuan Han,Haoyu Chen,Tianling Zhang,Yunlong Mao,Sheng Zhong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) combined with local differential privacy (LDP) enables privacy-preserving model training across decentralized data sources. However, the decentralized data-management paradigm leaves LDPFL vulnerable to participants with malicious intent. The robustness of LDPFL protocols, particularly against model poisoning attacks (MPA), where adversaries inject malicious updates to disrupt global model convergence, remains insufficiently studied. In this paper, we propose a novel and extensible model poisoning attack framework tailored for LDPFL settings. Our approach is driven by the objective of maximizing the global training loss while adhering to local privacy constraints. To counter robust aggregation mechanisms such as Multi-Krum and trimmed mean, we develop adaptive attacks that embed carefully crafted constraints into a reverse training process, enabling evasion of these defenses. We evaluate our framework across three representative LDPFL protocols, three benchmark datasets, and two types of deep neural networks. Additionally, we investigate the influence of data heterogeneity and privacy budgets on attack effectiveness. Experimental results demonstrate that our adaptive attacks can significantly degrade the performance of the global model, revealing critical vulnerabilities and highlighting the need for more robust LDPFL defense strategies against MPA. Our code is available at this https URL
[LG-5] A Kolmogorov-Arnold Network for Interpretable Cyberattack Detection in AGC Systems
链接: https://arxiv.org/abs/2509.05259
作者: Jehad Jilan,Niranjana Naveen Nambiar,Ahmad Mohammad Saber,Alok Paranjape,Amr Youssef,Deepa Kundur
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Systems and Control (eess.SY)
*备注: Peer-reviewed
Abstract:Automatic Generation Control (AGC) is essential for power grid stability but remains vulnerable to stealthy cyberattacks, such as False Data Injection Attacks (FDIAs), which can disturb the system’s stability while evading traditional detection methods. Unlike previous works that relied on blackbox approaches, this work proposes Kolmogorov-Arnold Networks (KAN) as an interpretable and accurate method for FDIA detection in AGC systems, considering the system nonlinearities. KAN models include a method for extracting symbolic equations, and are thus able to provide more interpretability than the majority of machine learning models. The proposed KAN is trained offline to learn the complex nonlinear relationships between the AGC measurements under different operating scenarios. After training, symbolic formulas that describe the trained model’s behavior can be extracted and leveraged, greatly enhancing interpretability. Our findings confirm that the proposed KAN model achieves FDIA detection rates of up to 95.97% and 95.9% for the initial model and the symbolic formula, respectively, with a low false alarm rate, offering a reliable approach to enhancing AGC cybersecurity.
[LG-6] Deep Learning-Enhanced for Amine Emission Monitoring and Performance Analysis in Industrial Carbon Capture Plants
链接: https://arxiv.org/abs/2509.05241
作者: Lokendra Poudel,David Tincher,Duy-Nhat Phan,Rahul Bhowmik
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present data driven deep learning models for forecasting and monitoring amine emissions and key performance parameters in amine-based post-combustion carbon capture systems. Using operational data from the CESAR1 solvent campaign at Technology Center Mongstad, four DL architectures such as Basic Long Short-Term Memory (LSTM), Stacked LSTM, Bi-directional LSTM, and Convolutional LSTM were developed to capture time-dependent process behavior. For emission prediction, models were designed for 2-amino-2-methyl-1-propanol (AMP) and Piperazine emissions measured via FTIR and IMR-MS methods. System performance models target four critical parameters: CO _2 product flow, absorber outlet temperature, depleted flue gas outlet temperature, and RFCC stripper bottom temperature. These models achieved high predictive accuracy exceeding 99% and effectively tracked both steady trends and abrupt fluctuations. Additionally, we conducted causal impact analysis to evaluate how operational variables influence emissions and system performance. Eight input variables were systematically perturbed within \pm 20% of nominal values to simulate deviations and assess their impact. This analysis revealed that adjusting specific operational parameters, such as lean solvent temperature and water wash conditions, can significantly reduce amine emissions and enhance system performance. This study highlights ML not only as a predictive tool but also as a decision support system for optimizing carbon capture operations under steady state and dynamic conditions. By enabling real time monitoring, scenario testing, and operational optimization, the developed ML framework offers a practical pathway for mitigating environmental impacts. This work represents a step toward intelligent, data-driven control strategies that enhance the efficiency, stability, and sustainability of carbon capture and storage technologies.
[LG-7] An Efficient Subspace Algorithm for Federated Learning on Heterogeneous Data
链接: https://arxiv.org/abs/2509.05213
作者: Jiaojiao Zhang,Yuqi Xu,Kun Yuan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:This work addresses the key challenges of applying federated learning to large-scale deep neural networks, particularly the issue of client drift due to data heterogeneity across clients and the high costs of communication, computation, and memory. We propose FedSub, an efficient subspace algorithm for federated learning on heterogeneous data. Specifically, FedSub utilizes subspace projection to guarantee local updates of each client within low-dimensional subspaces, thereby reducing communication, computation, and memory costs. Additionally, it incorporates low-dimensional dual variables to mitigate client drift. We provide convergence analysis that reveals the impact of key factors such as step size and subspace projection matrices on convergence. Experimental results demonstrate its efficiency.
[LG-8] Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning
链接: https://arxiv.org/abs/2509.05193
作者: Bastien Dubail,Stefan Stojanovic,Alexandre Proutière
类目: Machine Learning (cs.LG)
*备注: 67 pages, 11 figures
Abstract:Low-rank structure is a common implicit assumption in many modern reinforcement learning (RL) algorithms. For instance, reward-free and goal-conditioned RL methods often presume that the successor measure admits a low-rank representation. In this work, we challenge this assumption by first remarking that the successor measure itself is not low-rank. Instead, we demonstrate that a low-rank structure naturally emerges in the shifted successor measure, which captures the system dynamics after bypassing a few initial transitions. We provide finite-sample performance guarantees for the entry-wise estimation of a low-rank approximation of the shifted successor measure from sampled entries. Our analysis reveals that both the approximation and estimation errors are primarily governed by the so-called spectral recoverability of the corresponding matrix. To bound this parameter, we derive a new class of functional inequalities for Markov chains that we call Type II Poincaré inequalities and from which we can quantify the amount of shift needed for effective low-rank approximation and estimation. This analysis shows in particular that the required shift depends on decay of the high-order singular values of the shifted successor measure and is hence typically small in practice. Additionally, we establish a connection between the necessary shift and the local mixing properties of the underlying dynamical system, which provides a natural way of selecting the shift. Finally, we validate our theoretical findings with experiments, and demonstrate that shifting the successor measure indeed leads to improved performance in goal-conditioned RL.
[LG-9] KVCompose: Efficient Structured KV Cache Compression with Composite Tokens
链接: https://arxiv.org/abs/2509.05165
作者: Dmitry Akulov,Mohamed Sana,Antonio De Domenico,Tareq Si Salem,Nicola Piovesan,Fadhel Ayed
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache compression methods either enforce rigid heuristics, disrupt tensor layouts with per-attention-head variability, or require specialized compute kernels. We propose a simple, yet effective, KV cache compression framework based on attention-guided, layer-adaptive composite tokens. Our method aggregates attention scores to estimate token importance, selects head-specific tokens independently, and aligns them into composite tokens that respect the uniform cache structure required by existing inference engines. A global allocation mechanism further adapts retention budgets across layers, assigning more capacity to layers with informative tokens. This approach achieves significant memory reduction while preserving accuracy, consistently outperforming prior structured and semi-structured methods. Crucially, our approach remains fully compatible with standard inference pipelines, offering a practical and scalable solution for efficient long-context LLM deployment. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.05165 [cs.LG] (or arXiv:2509.05165v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.05165 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-10] Foundational Models and Federated Learning: Survey Taxonomy Challenges and Practical Insights
链接: https://arxiv.org/abs/2509.05142
作者: Cosmin-Andrei Hatfaludi,Alex Serban
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning has the potential to unlock siloed data and distributed resources by enabling collaborative model training without sharing private data. As more complex foundational models gain widespread use, the need to expand training resources and integrate privately owned data grows as well. In this article, we explore the intersection of federated learning and foundational models, aiming to identify, categorize, and characterize technical methods that integrate the two paradigms. As a unified survey is currently unavailable, we present a literature survey structured around a novel taxonomy that follows the development life-cycle stages, along with a technical comparison of available methods. Additionally, we provide practical insights and guidelines for implementing and evolving these methods, with a specific focus on the healthcare domain as a case study, where the potential impact of federated learning and foundational models is considered significant. Our survey covers multiple intersecting topics, including but not limited to federated learning, self-supervised learning, fine-tuning, distillation, and transfer learning. Initially, we retrieved and reviewed a set of over 4,200 articles. This collection was narrowed to more than 250 thoroughly reviewed articles through inclusion criteria, featuring 42 unique methods. The methods were used to construct the taxonomy and enabled their comparison based on complexity, efficiency, and scalability. We present these results as a self-contained overview that not only summarizes the state of the field but also provides insights into the practical aspects of adopting, evolving, and integrating foundational models with federated learning.
[LG-11] On the Learnability of Distribution Classes with Adaptive Adversaries
链接: https://arxiv.org/abs/2509.05137
作者: Tosca Lechner,Alex Bie,Gautam Kamath
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider the question of learnability of distribution classes in the presence of adaptive adversaries – that is, adversaries capable of intercepting the samples requested by a learner and applying manipulations with full knowledge of the samples before passing it on to the learner. This stands in contrast to oblivious adversaries, who can only modify the underlying distribution the samples come from but not their i.i.d.\ nature. We formulate a general notion of learnability with respect to adaptive adversaries, taking into account the budget of the adversary. We show that learnability with respect to additive adaptive adversaries is a strictly stronger condition than learnability with respect to additive oblivious adversaries.
[LG-12] Should We Always Train Models on Fine-Grained Classes?
链接: https://arxiv.org/abs/2509.05130
作者: Davide Pirovano,Federico Milanesio,Michele Caselle,Piero Fariselli,Matteo Osella
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures
Abstract:In classification problems, models must predict a class label based on the input data features. However, class labels are organized hierarchically in many datasets. While a classification task is often defined at a specific level of this hierarchy, training can utilize a finer granularity of labels. Empirical evidence suggests that such fine-grained training can enhance performance. In this work, we investigate the generality of this observation and explore its underlying causes using both real and synthetic datasets. We show that training on fine-grained labels does not universally improve classification accuracy. Instead, the effectiveness of this strategy depends critically on the geometric structure of the data and its relations with the label hierarchy. Additionally, factors such as dataset size and model capacity significantly influence whether fine-grained labels provide a performance benefit.
[LG-13] Efficient Exact Resistance Distance Computation on Small-Treewidth Graphs: a Labelling Approach SIGMOD2026
链接: https://arxiv.org/abs/2509.05129
作者: Meihao Liao,Yueyang Pan,Rong-Hua Li,Guoren Wang
类目: Databases (cs.DB); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Accepted by SIGMOD 2026
Abstract:Resistance distance computation is a fundamental problem in graph analysis, yet existing random walk-based methods are limited to approximate solutions and suffer from poor efficiency on small-treewidth graphs (e.g., road networks). In contrast, shortest-path distance computation achieves remarkable efficiency on such graphs by leveraging cut properties and tree decompositions. Motivated by this disparity, we first analyze the cut property of resistance distance. While a direct generalization proves impractical due to costly matrix operations, we overcome this limitation by integrating tree decompositions, revealing that the resistance distance r(s,t) depends only on labels along the paths from s and t to the root of the decomposition. This insight enables compact labelling structures. Based on this, we propose \treeindex, a novel index method that constructs a resistance distance labelling of size O(n \cdot h_\mathcalG) in O(n \cdot h_\mathcalG^2 \cdot d_\max) time, where h_\mathcalG (tree height) and d_\max (maximum degree) behave as small constants in many real-world small-treewidth graphs (e.g., road networks). Our labelling supports exact single-pair queries in O(h_\mathcalG) time and single-source queries in O(n \cdot h_\mathcalG) time. Extensive experiments show that TreeIndex substantially outperforms state-of-the-art approaches. For instance, on the full USA road network, it constructs a 405 GB labelling in 7 hours (single-threaded) and answers exact single-pair queries in 10^-3 seconds and single-source queries in 190 seconds–the first exact method scalable to such large graphs.
[LG-14] HyPINO: Multi-Physics Neural Operators via HyperPINNs and the Method of Manufactured Solutions
链接: https://arxiv.org/abs/2509.05117
作者: Rafael Bischof,Michal Piovarči,Michael A. Kraus,Siddhartha Mishra,Bernd Bickel
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present HyPINO, a multi-physics neural operator designed for zero-shot generalization across a broad class of parametric PDEs without requiring task-specific fine-tuning. Our approach combines a Swin Transformer-based hypernetwork with mixed supervision: (i) labeled data from analytical solutions generated via the Method of Manufactured Solutions (MMS), and (ii) unlabeled samples optimized using physics-informed objectives. The model maps PDE parametrizations to target Physics-Informed Neural Networks (PINNs) and can handle linear elliptic, hyperbolic, and parabolic equations in two dimensions with varying source terms, geometries, and mixed Dirichlet/Neumann boundary conditions, including interior boundaries. HyPINO achieves strong zero-shot accuracy on seven benchmark problems from PINN literature, outperforming U-Nets, Poseidon, and Physics-Informed Neural Operators (PINO). Further, we introduce an iterative refinement procedure that compares the physics of the generated PINN to the requested PDE and uses the discrepancy to generate a “delta” PINN. Summing their contributions and repeating this process forms an ensemble whose combined solution progressively reduces the error on six benchmarks and achieves over 100x gain in average L_2 loss in the best case, while retaining forward-only inference. Additionally, we evaluate the fine-tuning behavior of PINNs initialized by HyPINO and show that they converge faster and to lower final error than both randomly initialized and Reptile-meta-learned PINNs on five benchmarks, performing on par on the remaining two. Our results highlight the potential of this scalable approach as a foundation for extending neural operators toward solving increasingly complex, nonlinear, and high-dimensional PDE problems with significantly improved accuracy and reduced computational cost.
[LG-15] Recurrent State Encoders for Efficient Neural Combinatorial Optimization
链接: https://arxiv.org/abs/2509.05084
作者: Tim Dernedde,Daniela Thyssens,Lars Schmidt-Thieme
类目: Machine Learning (cs.LG)
*备注: 22 pages, 7 figures
Abstract:The primary paradigm in Neural Combinatorial Optimization (NCO) are construction methods, where a neural network is trained to sequentially add one solution component at a time until a complete solution is constructed. We observe that the typical changes to the state between two steps are small, since usually only the node that gets added to the solution is removed from the state. An efficient model should be able to reuse computation done in prior steps. To that end, we propose to train a recurrent encoder that computes the state embeddings not only based on the state but also the embeddings of the step before. We show that the recurrent encoder can achieve equivalent or better performance than a non-recurrent encoder even if it consists of 3\times fewer layers, thus significantly improving on latency. We demonstrate our findings on three different problems: the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP), and the Orienteering Problem (OP) and integrate the models into a large neighborhood search algorithm, to showcase the practical relevance of our findings.
[LG-16] MultiSurv: A Multimodal Deep Survival Framework for Prostrate and Bladder Cancer
链接: https://arxiv.org/abs/2509.05037
作者: Noorul Wahab,Ethar Alzaid,Jiaqi Lv,Adam Shephard,Shan E Ahmed Raza
类目: Machine Learning (cs.LG)
*备注: 6 pages, 1 figure, 2 tables
Abstract:Accurate prediction of time-to-event outcomes is a central challenge in oncology, with significant implications for treatment planning and patient management. In this work, we present MultiSurv, a multimodal deep survival model utilising DeepHit with a projection layer and inter-modality cross-attention, which integrates heterogeneous patient data, including clinical, MRI, RNA-seq and whole-slide pathology features. The model is designed to capture complementary prognostic signals across modalities and estimate individualised time-to-biochemical recurrence in prostate cancer and time-to-cancer recurrence in bladder cancer. Our approach was evaluated in the context of the CHIMERA Grand Challenge, across two of the three provided tasks. For Task 1 (prostate cancer bio-chemical recurrence prediction), the proposed framework achieved a concordance index (C-index) of 0.843 on 5-folds cross-validation and 0.818 on CHIMERA development set, demonstrating robust discriminatory ability. For Task 3 (bladder cancer recurrence prediction), the model obtained a C-index of 0.662 on 5-folds cross-validation and 0.457 on development set, highlighting its adaptability and potential for clinical translation. These results suggest that leveraging multimodal integration with deep survival learning provides a promising pathway toward personalised risk stratification in prostate and bladder cancer. Beyond the challenge setting, our framework is broadly applicable to survival prediction tasks involving heterogeneous biomedical data.
[LG-17] Depth-Aware Initialization for Stable and Efficient Neural Network Training
链接: https://arxiv.org/abs/2509.05018
作者: Vijay Pandey
类目: Machine Learning (cs.LG)
*备注:
Abstract:In past few years, various initialization schemes have been proposed. These schemes are glorot initialization, He initialization, initialization using orthogonal matrix, random walk method for initialization. Some of these methods stress on keeping unit variance of activation and gradient propagation through the network layer. Few of these methods are independent of the depth information while some methods has considered the total network depth for better initialization. In this paper, comprehensive study has been done where depth information of each layer as well as total network is incorporated for better initialization scheme. It has also been studied that for deeper networks theoretical assumption of unit variance throughout the network does not perform well. It requires the need to increase the variance of the network from first layer activation to last layer activation. We proposed a novel way to increase the variance of the network in flexible manner, which incorporates the information of each layer depth. Experiments shows that proposed method performs better than the existing initialization scheme.
[LG-18] On approximating the f-divergence between two Ising models
链接: https://arxiv.org/abs/2509.05016
作者: Weiming Feng,Yucheng Fu
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:The f -divergence is a fundamental notion that measures the difference between two distributions. In this paper, we study the problem of approximating the f -divergence between two Ising models, which is a generalization of recent work on approximating the TV-distance. Given two Ising models \nu and \mu , which are specified by their interaction matrices and external fields, the problem is to approximate the f -divergence D_f(\nu,|,\mu) within an arbitrary relative error \mathrme^\pm \varepsilon . For \chi^\alpha -divergence with a constant integer \alpha , we establish both algorithmic and hardness results. The algorithm works in a parameter regime that matches the hardness result. Our algorithm can be extended to other f -divergences such as \alpha -divergence, Kullback-Leibler divergence, Rényi divergence, Jensen-Shannon divergence, and squared Hellinger distance.
[LG-19] Directed Evolution of Proteins via Bayesian Optimization in Embedding Space
链接: https://arxiv.org/abs/2509.04998
作者: Matouš Soldát,Jiří Kléma
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 8 pages, 2 figures
Abstract:Directed evolution is an iterative laboratory process of designing proteins with improved function by iteratively synthesizing new protein variants and evaluating their desired property with expensive and time-consuming biochemical screening. Machine learning methods can help select informative or promising variants for screening to increase their quality and reduce the amount of necessary screening. In this paper, we present a novel method for machine-learning-assisted directed evolution of proteins which combines Bayesian optimization with informative representation of protein variants extracted from a pre-trained protein language model. We demonstrate that the new representation based on the sequence embeddings significantly improves the performance of Bayesian optimization yielding better results with the same number of conducted screening in total. At the same time, our method outperforms the state-of-the-art machine-learning-assisted directed evolution methods with regression objective.
[LG-20] MAIA: An Inpainting-Based Approach for Music Adversarial Attacks
链接: https://arxiv.org/abs/2509.04980
作者: Yuxuan Liu,Peihong Zhang,Rui Sang,Zhixin Li,Shengchen Li
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ISMIR2025
Abstract:Music adversarial attacks have garnered significant interest in the field of Music Information Retrieval (MIR). In this paper, we present Music Adversarial Inpainting Attack (MAIA), a novel adversarial attack framework that supports both white-box and black-box attack scenarios. MAIA begins with an importance analysis to identify critical audio segments, which are then targeted for modification. Utilizing generative inpainting models, these segments are reconstructed with guidance from the output of the attacked model, ensuring subtle and effective adversarial perturbations. We evaluate MAIA on multiple MIR tasks, demonstrating high attack success rates in both white-box and black-box settings while maintaining minimal perceptual distortion. Additionally, subjective listening tests confirm the high audio fidelity of the adversarial samples. Our findings highlight vulnerabilities in current MIR systems and emphasize the need for more robust and secure models.
[LG-21] Adapt in the Wild: Test-Time Entropy Minimization with Sharpness and Feature Regularization
链接: https://arxiv.org/abs/2509.04977
作者: Shuaicheng Niu,Guohao Chen,Deyu Chen,Yifan Zhang,Jiaxiang Wu,Zhiquan Wen,Yaofo Chen,Peilin Zhao,Chunyan Miao,Mingkui Tan
类目: Machine Learning (cs.LG)
*备注: 25 pages, 27 tables, 14 figures. arXiv admin note: substantial text overlap with arXiv:2302.12400
Abstract:Test-time adaptation (TTA) may fail to improve or even harm the model performance when test data have: 1) mixed distribution shifts, 2) small batch sizes, 3) online imbalanced label distribution shifts. This is often a key obstacle preventing existing TTA methods from being deployed in the real world. In this paper, we investigate the unstable reasons and find that the batch norm layer is a crucial factor hindering TTA stability. Conversely, TTA can perform more stably with batch-agnostic norm layers, i.e., group or layer norm. However, we observe that TTA with group and layer norms does not always succeed and still suffers many failure cases, i.e., the model collapses into trivial solutions by assigning the same class label for all samples. By digging into this, we find that, during the collapse process: 1) the model gradients often undergo an initial explosion followed by rapid degradation, suggesting that certain noisy test samples with large gradients may disrupt adaptation; and 2) the model representations tend to exhibit high correlations and classification bias. To address this, we first propose a sharpness-aware and reliable entropy minimization method, called SAR, for stabilizing TTA from two aspects: 1) remove partial noisy samples with large gradients, 2) encourage model weights to go to a flat minimum so that the model is robust to the remaining noisy samples. Based on SAR, we further introduce SAR^2 to prevent representation collapse with two regularizers: 1) a redundancy regularizer to reduce inter-dimensional correlations among centroid-invariant features; and 2) an inequity regularizer to maximize the prediction entropy of a prototype centroid, thereby penalizing biased representations toward any specific class. Promising results demonstrate that our methods perform more stably over prior methods and are computationally efficient under the above wild test scenarios.
[LG-22] opology-Aware Graph Reinforcement Learning for Dynamic Routing in Cloud Networks
链接: https://arxiv.org/abs/2509.04973
作者: Yuxi Wang,Heyao Liu,Guanzi Yao,Nyutian Long,Yue Kang
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes a topology-aware graph reinforcement learning approach to address the routing policy optimization problem in cloud server environments. The method builds a unified framework for state representation and structural evolution by integrating a Structure-Aware State Encoding (SASE) module and a Policy-Adaptive Graph Update (PAGU) mechanism. It aims to tackle the challenges of decision instability and insufficient structural awareness under dynamic topologies. The SASE module models node states through multi-layer graph convolution and structural positional embeddings, capturing high-order dependencies in the communication topology and enhancing the expressiveness of state representations. The PAGU module adjusts the graph structure based on policy behavior shifts and reward feedback, enabling adaptive structural updates in dynamic environments. Experiments are conducted on the real-world GEANT topology dataset, where the model is systematically evaluated against several representative baselines in terms of throughput, latency control, and link balance. Additional experiments, including hyperparameter sensitivity, graph sparsity perturbation, and node feature dimensionality variation, further explore the impact of structure modeling and graph updates on model stability and decision quality. Results show that the proposed method outperforms existing graph reinforcement learning models across multiple performance metrics, achieving efficient and robust routing in dynamic and complex cloud networks.
[LG-23] Neuro-Spectral Architectures for Causal Physics-Informed Networks
链接: https://arxiv.org/abs/2509.04966
作者: Arthur Bizzi,Leonardo M. Moreira,Márcio Marques,Leonardo Mendonça,Christian Júnior de Oliveira,Vitor Balestro,Lucas dos Santos Fernandez,Daniel Yukimura,Pavel Petrov,João M. Pereira,Tiago Novello,Lucas Nissenbaum
类目: Machine Learning (cs.LG)
*备注: 24 pages, 10 figures
Abstract:Physics-Informed Neural Networks (PINNs) have emerged as a powerful neural framework for solving partial differential equations (PDEs). However, standard MLP-based PINNs often fail to converge when dealing with complex initial-value problems, leading to solutions that violate causality and suffer from a spectral bias towards low-frequency components. To address these issues, we introduce NeuSA (Neuro-Spectral Architectures), a novel class of PINNs inspired by classical spectral methods, designed to solve linear and nonlinear PDEs with variable coefficients. NeuSA learns a projection of the underlying PDE onto a spectral basis, leading to a finite-dimensional representation of the dynamics which is then integrated with an adapted Neural ODE (NODE). This allows us to overcome spectral bias, by leveraging the high-frequency components enabled by the spectral representation; to enforce causality, by inheriting the causal structure of NODEs, and to start training near the target solution, by means of an initialization scheme based on classical methods. We validate NeuSA on canonical benchmarks for linear and nonlinear wave equations, demonstrating strong performance as compared to other architectures, with faster convergence, improved temporal consistency and superior predictive accuracy. Code and pretrained models will be released.
[LG-24] On the Normalization of Confusion Matrices: Methods and Geometric Interpretations
链接: https://arxiv.org/abs/2509.04959
作者: Johan Erbani,Pierre-Edouard Portier,Elod Egyed-Zsigmond,Sonia Ben Mokhtar,Diana Nurbakova
类目: Machine Learning (cs.LG)
*备注:
Abstract:The confusion matrix is a standard tool for evaluating classifiers by providing insights into class-level errors. In heterogeneous settings, its values are shaped by two main factors: class similarity – how easily the model confuses two classes – and distribution bias, arising from skewed distributions in the training and test sets. However, confusion matrix values reflect a mix of both factors, making it difficult to disentangle their individual contributions. To address this, we introduce bistochastic normalization using Iterative Proportional Fitting, a generalization of row and column normalization. Unlike standard normalizations, this method recovers the underlying structure of class similarity. By disentangling error sources, it enables more accurate diagnosis of model behavior and supports more targeted improvements. We also show a correspondence between confusion matrix normalizations and the model’s internal class representations. Both standard and bistochastic normalizations can be interpreted geometrically in this space, offering a deeper understanding of what normalization reveals about a classifier.
[LG-25] Detecting Blinks in Healthy and Parkinsons EEG: A Deep Learning Perspective
链接: https://arxiv.org/abs/2509.04951
作者: Artem Lensky,Yiding Qiu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Blinks in electroencephalography (EEG) are often treated as unwanted artifacts. However, recent studies have demonstrated that blink rate and its variability are important physiological markers to monitor cognitive load, attention, and potential neurological disorders. This paper addresses the critical task of accurate blink detection by evaluating various deep learning models for segmenting EEG signals into involuntary blinks and non-blinks. We present a pipeline for blink detection using 1, 3, or 5 frontal EEG electrodes. The problem is formulated as a sequence-to-sequence task and tested on various deep learning architectures including standard recurrent neural networks, convolutional neural networks (both standard and depth-wise), temporal convolutional networks (TCN), transformer-based models, and hybrid architectures. The models were trained on raw EEG signals with minimal pre-processing. Training and testing was carried out on a public dataset of 31 subjects collected at UCSD. This dataset consisted of 15 healthy participants and 16 patients with Parkinson’s disease allowing us to verify the model’s robustness to tremor. Out of all models, CNN-RNN hybrid model consistently outperformed other models and achieved the best blink detection accuracy of 93.8%, 95.4% and 95.8% with 1, 3, and 5 channels in the healthy cohort and correspondingly 73.8%, 75.4% and 75.8% in patients with PD. The paper compares neural networks for the task of segmenting EEG recordings to involuntary blinks and no blinks allowing for computing blink rate and other statistics.
[LG-26] Ontology-Aligned Embeddings for Data-Driven Labour Market Analytics
链接: https://arxiv.org/abs/2509.04942
作者: Heinke Hihn,Dennis A. V. Dittrich,Carl Jeske,Cayo Costa Sobral,Helio Pais,Timm Lochmann
类目: Machine Learning (cs.LG)
*备注: Workshop SIG Knowledge Management (FG WM) at KI2025, Potsdam, Germany
Abstract:The limited ability to reason across occupational data from different sources is a long-standing bottleneck for data-driven labour market analytics. Previous research has relied on hand-crafted ontologies that allow such reasoning but are computationally expensive and require careful maintenance by human experts. The rise of language processing machine learning models offers a scalable alternative by learning shared semantic spaces that bridge diverse occupational vocabularies without extensive human curation. We present an embedding-based alignment process that links any free-form German job title to two established ontologies - the German Klassifikation der Berufe and the International Standard Classification of Education. Using publicly available data from the German Federal Employment Agency, we construct a dataset to fine-tune a Sentence-BERT model to learn the structure imposed by the ontologies. The enriched pairs (job title, embedding) define a similarity graph structure that we can use for efficient approximate nearest-neighbour search, allowing us to frame the classification process as a semantic search problem. This allows for greater flexibility, e.g., adding more classes. We discuss design decisions, open challenges, and outline ongoing work on extending the graph with other ontologies and multilingual titles.
[LG-27] A transformer-BiGRU-based framework with data augmentation and confident learning for network intrusion detection
链接: https://arxiv.org/abs/2509.04925
作者: Jiale Zhang,Pengfei He,Fei Li,Kewei Li,Yan Wang,Lan Huang,Ruochi Zhang,Fengfeng Zhou
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:In today’s fast-paced digital communication, the surge in network traffic data and frequency demands robust and precise network intrusion solutions. Conventional machine learning methods struggle to grapple with complex patterns within the vast network intrusion datasets, which suffer from data scarcity and class imbalance. As a result, we have integrated machine learning and deep learning techniques within the network intrusion detection system to bridge this gap. This study has developed TrailGate, a novel framework that combines machine learning and deep learning techniques. By integrating Transformer and Bidirectional Gated Recurrent Unit (BiGRU) architectures with advanced feature selection strategies and supplemented by data augmentation techniques, TrailGate can identifies common attack types and excels at detecting and mitigating emerging threats. This algorithmic fusion excels at detecting common and well-understood attack types and has the unique ability to swiftly identify and neutralize emerging threats that stem from existing paradigms.
[LG-28] Scaling Law for Large-Scale Pre-Training Using Chaotic Time Series and Predictability in Financial Time Series
链接: https://arxiv.org/abs/2509.04921
作者: Yuki Takemoto
类目: Machine Learning (cs.LG)
*备注: Patent pending
Abstract:Time series forecasting plays a critical role in decision-making processes across diverse fields including meteorology, traffic, electricity, economics, finance, and so on. Especially, predicting returns on financial instruments is a challenging problem. Some researchers have proposed time series foundation models applicable to various forecasting tasks. Simultaneously, based on the recognition that real-world time series exhibit chaotic properties, methods have been developed to artificially generate synthetic chaotic time series, construct diverse datasets and train models. In this study, we propose a methodology for modeling financial time series by generating artificial chaotic time series and applying resampling techniques to simulate financial time series data, which we then use as training samples. Increasing the resampling interval to extend predictive horizons, we conducted large-scale pre-training using 10 billion training samples for each case. We subsequently created test datasets for multiple timeframes using actual Bitcoin trade data and performed zero-shot prediction without re-training the pre-trained model. The results of evaluating the profitability of a simple trading strategy based on these predictions demonstrated significant performance improvements over autocorrelation models. During the large-scale pre-training process, we observed a scaling law-like phenomenon that we can achieve predictive performance at a certain level with extended predictive horizons for chaotic time series by increasing the number of training samples exponentially. If this scaling law proves robust and holds true across various chaotic models, it suggests the potential to predict near-future events by investing substantial computational resources. Future research should focus on further large-scale training and verifying the applicability of this scaling law to diverse chaotic models.
[LG-29] Revolution or Hype? Seeking the Limits of Large Models in Hardware Design
链接: https://arxiv.org/abs/2509.04905
作者: Qiang Xu,Leon Stok,Rolf Drechsler,Xi Wang,Grace Li Zhang,Igor L. Markov
类目: Machine Learning (cs.LG)
*备注: Invited paper to appear at ICCAD’25
Abstract:Recent breakthroughs in Large Language Models (LLMs) and Large Circuit Models (LCMs) have sparked excitement across the electronic design automation (EDA) community, promising a revolution in circuit design and optimization. Yet, this excitement is met with significant skepticism: Are these AI models a genuine revolution in circuit design, or a temporary wave of inflated expectations? This paper serves as a foundational text for the corresponding ICCAD 2025 panel, bringing together perspectives from leading experts in academia and industry. It critically examines the practical capabilities, fundamental limitations, and future prospects of large AI models in hardware design. The paper synthesizes the core arguments surrounding reliability, scalability, and interpretability, framing the debate on whether these models can meaningfully outperform or complement traditional EDA methods. The result is an authoritative overview offering fresh insights into one of today’s most contentious and impactful technology trends.
[LG-30] Learning and composing of classical music using restricted Boltzmann machines
链接: https://arxiv.org/abs/2509.04899
作者: Mutsumi Kobayashi,Hiroshi Watanabe
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 19 pages, 10 figures
Abstract:Recently, software has been developed that uses machine learning to mimic the style of a particular composer, such as J. S. Bach. However, since such software often adopts machine learning models with complex structures, it is difficult to analyze how the software understands the characteristics of the composer’s music. In this study, we adopted J. S. Bach’s music for training of a restricted Boltzmann machine (RBM). Since the structure of RBMs is simple, it allows us to investigate the internal states after learning. We found that the learned RBM is able to compose music.
[LG-31] Filtering with Randomised Observations: Sequential Learning of Relevant Subspace Properties and Accuracy Analysis
链接: https://arxiv.org/abs/2509.04867
作者: Nazanin Abedini,Jana de Wiljes,Svetlana Dubinkina
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:State estimation that combines observational data with mathematical models is central to many applications and is commonly addressed through filtering methods, such as ensemble Kalman filters. In this article, we examine the signal-tracking performance of a continuous ensemble Kalman filtering under fixed, randomised, and adaptively varying partial observations. Rigorous bounds are established for the expected signal-tracking error relative to the randomness of the observation operator. In addition, we propose a sequential learning scheme that adaptively determines the dimension of a state subspace sufficient to ensure bounded filtering error, by balancing observation complexity with estimation accuracy. Beyond error control, the adaptive scheme provides a systematic approach to identifying the appropriate size of the filter-relevant subspace of the underlying dynamics.
[LG-32] An Arbitration Control for an Ensemble of Diversified DQN variants in Continual Reinforcement Learning
链接: https://arxiv.org/abs/2509.04815
作者: Wonseo Jang,Dongjae Kim
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 8 pages, 8 figures
Abstract:Deep reinforcement learning (RL) models, despite their efficiency in learning an optimal policy in static environments, easily loses previously learned knowledge (i.e., catastrophic forgetting). It leads RL models to poor performance in continual reinforcement learning (CRL) scenarios. To address this, we present an arbitration control mechanism over an ensemble of RL agents. It is motivated by and closely aligned with how humans make decisions in a CRL context using an arbitration control of multiple RL agents in parallel as observed in the prefrontal cortex. We integrated two key ideas into our model: (1) an ensemble of RLs (i.e., DQN variants) explicitly trained to have diverse value functions and (2) an arbitration control that prioritizes agents with higher reliability (i.e., less error) in recent trials. We propose a framework for CRL, an Arbitration Control for an Ensemble of Diversified DQN variants (ACED-DQN). We demonstrate significant performance improvements in both static and continual environments, supported by empirical evidence showing the effectiveness of arbitration control over diversified DQNs during training. In this work, we introduced a framework that enables RL agents to continuously learn, with inspiration from the human brain.
[LG-33] Multimodal Foundation Model-Driven User Interest Modeling and Behavior Analysis on Short Video Platforms
链接: https://arxiv.org/abs/2509.04751
作者: Yushang Zhao,Yike Peng,Li Zhang,Qianyi Sun,Zhihui Zhang,Yingying Zhuang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:With the rapid expansion of user bases on short video platforms, personalized recommendation systems are playing an increasingly critical role in enhancing user experience and optimizing content distribution. Traditional interest modeling methods often rely on unimodal data, such as click logs or text labels, which limits their ability to fully capture user preferences in a complex multimodal content environment. To address this challenge, this paper proposes a multimodal foundation model-based framework for user interest modeling and behavior analysis. By integrating video frames, textual descriptions, and background music into a unified semantic space using cross-modal alignment strategies, the framework constructs fine-grained user interest vectors. Additionally, we introduce a behavior-driven feature embedding mechanism that incorporates viewing, liking, and commenting sequences to model dynamic interest evolution, thereby improving both the timeliness and accuracy of recommendations. In the experimental phase, we conduct extensive evaluations using both public and proprietary short video datasets, comparing our approach against multiple mainstream recommendation algorithms and modeling techniques. Results demonstrate significant improvements in behavior prediction accuracy, interest modeling for cold-start users, and recommendation click-through rates. Moreover, we incorporate interpretability mechanisms using attention weights and feature visualization to reveal the model’s decision basis under multimodal inputs and trace interest shifts, thereby enhancing the transparency and controllability of the recommendation system.
[LG-34] Real-Time Performance Benchmarking of TinyML Models in Embedded Systems (PICO: Performance of Inference CPU and Operations)
链接: https://arxiv.org/abs/2509.04721
作者: Abhishek Dey,Saurabh Srivastava,Gaurav Singh,Robert G. Pettit
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents PICO-TINYML-BENCHMARK, a modular and platform-agnostic framework for benchmarking the real-time performance of TinyML models on resource-constrained embedded systems. Evaluating key metrics such as inference latency, CPU utilization, memory efficiency, and prediction stability, the framework provides insights into computational trade-offs and platform-specific optimizations. We benchmark three representative TinyML models – Gesture Classification, Keyword Spotting, and MobileNet V2 – on two widely adopted platforms, BeagleBone AI64 and Raspberry Pi 4, using real-world datasets. Results reveal critical trade-offs: the BeagleBone AI64 demonstrates consistent inference latency for AI-specific tasks, while the Raspberry Pi 4 excels in resource efficiency and cost-effectiveness. These findings offer actionable guidance for optimizing TinyML deployments, bridging the gap between theoretical advancements and practical applications in embedded systems.
[LG-35] Natural Spectral Fusion: p-Exponent Cyclic Scheduling and Early Decision-Boundary Alignment in First-Order Optimization
链接: https://arxiv.org/abs/2509.04713
作者: Gongyue Zhang,Honghai Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spectral behaviors have been widely discussed in machine learning, yet the optimizer’s own spectral bias remains unclear. We argue that first-order optimizers exhibit an intrinsic frequency preference that significantly reshapes the optimization path. To address this, we propose Natural Spectral Fusion (NSF): reframing training as controllable spectral coverage and information fusion rather than merely scaling step sizes. NSF has two core principles: treating the optimizer as a spectral controller that dynamically balances low- and high-frequency information; and periodically reweighting frequency bands at negligible cost, without modifying the model, data, or training pipeline. We realize NSF via a p-exponent extension of the second-moment term, enabling both positive and negative exponents, and implement it through cyclic scheduling. Theory and experiments show that adaptive methods emphasize low frequencies, SGD is near-neutral, and negative exponents amplify high-frequency information. Cyclic scheduling broadens spectral coverage, improves cross-band fusion, and induces early decision-boundary alignment, where accuracy improves even while loss remains high. Across multiple benchmarks, with identical learning-rate strategies and fixed hyperparameters, p-exponent cyclic scheduling consistently reduces test error and demonstrates distinct convergence behavior; on some tasks, it matches baseline accuracy with only one-quarter of the training cost. Overall, NSF reveals the optimizer’s role as an active spectral controller and provides a unified, controllable, and efficient framework for first-order optimization.
[LG-36] CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals
链接: https://arxiv.org/abs/2509.04699
作者: Wenhui Cui,Christopher Sandino,Hadi Pouransari,Ran Liu,Juri Minxha,Ellen L. Zippi,Aman Verma,Anna Sedlackova,Behrooz Mahasseni,Erdrin Azemi
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Hand gesture classification using high-quality structured data such as videos, images, and hand skeletons is a well-explored problem in computer vision. Leveraging low-power, cost-effective biosignals, e.g. surface electromyography (sEMG), allows for continuous gesture prediction on wearables. In this paper, we demonstrate that learning representations from weak-modality data that are aligned with those from structured, high-quality data can improve representation quality and enables zero-shot classification. Specifically, we propose a Contrastive Pose-EMG Pre-training (CPEP) framework to align EMG and pose representations, where we learn an EMG encoder that produces high-quality and pose-informative representations. We assess the gesture classification performance of our model through linear probing and zero-shot setups. Our model outperforms emg2pose benchmark models by up to 21% on in-distribution gesture classification and 72% on unseen (out-of-distribution) gesture classification.
[LG-37] Unified Representation Learning for Multi-Intent Diversity and Behavioral Uncertainty in Recommender Systems
链接: https://arxiv.org/abs/2509.04694
作者: Wei Xu,Jiasen Zheng,Junjiang Lin,Mingxuan Han,Junliang Du
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:This paper addresses the challenge of jointly modeling user intent diversity and behavioral uncertainty in recommender systems. A unified representation learning framework is proposed. The framework builds a multi-intent representation module and an uncertainty modeling mechanism. It extracts multi-granularity interest structures from user behavior sequences. Behavioral ambiguity and preference fluctuation are captured using Bayesian distribution modeling. In the multi-intent modeling part, the model introduces multiple latent intent vectors. These vectors are weighted and fused using an attention mechanism to generate semantically rich representations of long-term user preferences. In the uncertainty modeling part, the model learns the mean and covariance of behavior representations through Gaussian distributions. This reflects the user’s confidence in different behavioral contexts. Next, a learnable fusion strategy is used to combine long-term intent and short-term behavior signals. This produces the final user representation, improving both recommendation accuracy and robustness. The method is evaluated on standard public datasets. Experimental results show that it outperforms existing representative models across multiple metrics. It also demonstrates greater stability and adaptability under cold-start and behavioral disturbance scenarios. The approach alleviates modeling bottlenecks faced by traditional methods when dealing with complex user behavior. These findings confirm the effectiveness and practical value of the unified modeling strategy in real-world recommendation tasks.
[LG-38] KRAFT: A Knowledge Graph-Based Framework for Automated Map Conflation
链接: https://arxiv.org/abs/2509.04684
作者: Farnoosh Hashemi,Laks V.S. Lakshmanan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Digital maps play a crucial role in various applications such as navigation, fleet management, and ride-sharing, necessitating their accuracy and currency, which require timely updates. While the majority of geospatial databases (GDBs) provide high-quality information, their data is (i) limited to specific regions and/or (ii) missing some entities, even in their covered areas. Map conflation is the process of augmentation of a GDB using another GDB to conflate missing spatial features. Existing map conflation methods suffer from two main limitations: (1) They are designed for the conflation of linear objects (e.g., road networks) and cannot simply be extended to non-linear objects, thus missing information about most entities in the map. (2) They are heuristic algorithmic approaches that are based on pre-defined rules, unable to learn entities matching in a data-driven manner. To address these limitations, we design KRAFT, a learning based approach consisting of three parts: (1) Knowledge Graph Construction - where each GDB is represented by a knowledge graph, (2) Map Matching - where we use a knowledge graph alignment method as well as a geospatial feature encoder to match entities in obtained knowledge graphs, and (3) Map Merging - where we merge matched entities in the previous modules in a consistent manner, using a mixed integer linear programming formulation that fully merges the GDBs without adding any inconsistencies. Our experimental evaluation shows that not only does KRAFT achieve outstanding performance compared to state-of-the-art and baseline methods in map conflation tasks, but each of its modules (e.g., Map Matching and Map Merging) also separately outperforms traditional matching and merging methods.
[LG-39] Echoes Before Collapse: Deep Learning Detection of Flickering in Complex Systems
链接: https://arxiv.org/abs/2509.04683
作者: Yazdan Babazadeh Maghsoodlo,Madhur Anand,Chris T. Bauch
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning offers powerful tools for anticipating tipping points in complex systems, yet its potential for detecting flickering (noise-driven switching between coexisting stable states) remains unexplored. Flickering is a hallmark of reduced resilience in climate systems, ecosystems, financial markets, and other systems. It can precede critical regime shifts that are highly impactful but difficult to predict. Here we show that convolutional long short-term memory (CNN LSTM) models, trained on synthetic time series generated from simple polynomial functions with additive noise, can accurately identify flickering patterns. Despite being trained on simplified dynamics, our models generalize to diverse stochastic systems and reliably detect flickering in empirical datasets, including dormouse body temperature records and palaeoclimate proxies from the African Humid Period. These findings demonstrate that deep learning can extract early warning signals from noisy, nonlinear time series, providing a flexible framework for identifying instability across a wide range of dynamical systems.
[LG-40] Beyond Ordinary Lipschitz Constraints: Differentially Private Stochastic Optimization with Tsybakov Noise Condition
链接: https://arxiv.org/abs/2509.04668
作者: Difei Xu,Meng Ding,Zihang Xiang,Jinhui Xu,Di Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study Stochastic Convex Optimization in the Differential Privacy model (DP-SCO). Unlike previous studies, here we assume the population risk function satisfies the Tsybakov Noise Condition (TNC) with some parameter \theta1 , where the Lipschitz constant of the loss could be extremely large or even unbounded, but the \ell_2 -norm gradient of the loss has bounded k -th moment with k\geq 2 . For the Lipschitz case with \theta\geq 2 , we first propose an (\varepsilon, \delta) -DP algorithm whose utility bound is \TildeO\left(\left(\tilder_2k(\frac1\sqrtn+(\frac\sqrtdn\varepsilon))^\frack-1k\right)^\frac\theta\theta-1\right) in high probability, where n is the sample size, d is the model dimension, and \tilder_2k is a term that only depends on the 2k -th moment of the gradient. It is notable that such an upper bound is independent of the Lipschitz constant. We then extend to the case where \theta\geq \bar\theta 1 for some known constant \bar\theta . Moreover, when the privacy budget \varepsilon is small enough, we show an upper bound of \tildeO\left(\left(\tilder_k(\frac1\sqrtn+(\frac\sqrtdn\varepsilon))^\frack-1k\right)^\frac\theta\theta-1\right) even if the loss function is not Lipschitz. For the lower bound, we show that for any \theta\geq 2 , the private minimax rate for \rho -zero Concentrated Differential Privacy is lower bounded by \Omega\left(\left(\tilder_k(\frac1\sqrtn+(\frac\sqrtdn\sqrt\rho))^\frack-1k\right)^\frac\theta\theta-1\right) . Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.04668 [cs.LG] (or arXiv:2509.04668v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.04668 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-41] Flexible inference of learning rules from de novo learning data using neural networks
链接: https://arxiv.org/abs/2509.04661
作者: Yuhan Helena Liu,Victor Geadah,Jonathan Pillow
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Understanding how animals learn is a central challenge in neuroscience, with growing relevance to the development of animal- or human-aligned artificial intelligence. However, most existing approaches assume specific parametric forms for the learning rule (e.g., Q-learning, policy gradient) or are limited to simplified settings like bandit tasks, which do not involve learning a new input-output mapping from scratch. In contrast, animals must often learn new behaviors de novo, which poses a rich challenge for learning-rule inference. We target this problem by inferring learning rules directly from animal decision-making data during de novo task learning, a setting that requires models flexible enough to capture suboptimality, history dependence, and rich external stimulus integration without strong structural priors. We first propose a nonparametric framework that parameterizes the per-trial update of policy weights with a deep neural network (DNN), and validate it by recovering ground-truth rules in simulation. We then extend to a recurrent variant (RNN) that captures non-Markovian dynamics by allowing updates to depend on trial history. Applied to a large behavioral dataset of mice learning a sensory decision-making task over multiple weeks, our models improved predictions on held-out data. The inferred rules revealed asymmetric updates after correct versus error trials and history dependence, consistent with non-Markovian learning. Overall, these results introduce a flexible framework for inferring biological learning rules from behavioral data in de novo learning tasks, providing insights to inform experimental training protocols and the development of behavioral digital twins.
[LG-42] Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction
链接: https://arxiv.org/abs/2509.04631
作者: Arash Behboodi,Alvaro H.C. Correia,Fabio Valerio Massoli,Christos Louizos
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:
Abstract:Transductive conformal prediction addresses the simultaneous prediction for multiple data points. Given a desired confidence level, the objective is to construct a prediction set that includes the true outcomes with the prescribed confidence. We demonstrate a fundamental trade-off between confidence and efficiency in transductive methods, where efficiency is measured by the size of the prediction sets. Specifically, we derive a strict finite-sample bound showing that any non-trivial confidence level leads to exponential growth in prediction set size for data with inherent uncertainty. The exponent scales linearly with the number of samples and is proportional to the conditional entropy of the data. Additionally, the bound includes a second-order term, dispersion, defined as the variance of the log conditional probability distribution. We show that this bound is achievable in an idealized setting. Finally, we examine a special case of transductive prediction where all test data points share the same label. We show that this scenario reduces to the hypothesis testing problem with empirically observed statistics and provide an asymptotically optimal confidence predictor, along with an analysis of the error exponent.
[LG-43] Split Conformal Prediction in the Function Space with Neural Operators
链接: https://arxiv.org/abs/2509.04623
作者: David Millard,Lars Lindemann,Ali Baheri
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 figures, conference
Abstract:Uncertainty quantification for neural operators remains an open problem in the infinite-dimensional setting due to the lack of finite-sample coverage guarantees over functional outputs. While conformal prediction offers finite-sample guarantees in finite-dimensional spaces, it does not directly extend to function-valued outputs. Existing approaches (Gaussian processes, Bayesian neural networks, and quantile-based operators) require strong distributional assumptions or yield conservative coverage. This work extends split conformal prediction to function spaces following a two step method. We first establish finite-sample coverage guarantees in a finite-dimensional space using a discretization map in the output function space. Then these guarantees are lifted to the function-space by considering the asymptotic convergence as the discretization is refined. To characterize the effect of resolution, we decompose the conformal radius into discretization, calibration, and misspecification components. This decomposition motivates a regression-based correction to transfer calibration across resolutions. Additionally, we propose two diagnostic metrics (conformal ensemble score and internal agreement) to quantify forecast degradation in autoregressive settings. Empirical results show that our method maintains calibrated coverage with less variation under resolution shifts and achieves better coverage in super-resolution tasks.
[LG-44] Instance-Wise Adaptive Sampling for Dataset Construction in Approximating Inverse Problem Solutions
链接: https://arxiv.org/abs/2509.04583
作者: Jiequn Han,Kui Ren,Nathan Soedjak
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:
Abstract:We propose an instance-wise adaptive sampling framework for constructing compact and informative training datasets for supervised learning of inverse problem solutions. Typical learning-based approaches aim to learn a general-purpose inverse map from datasets drawn from a prior distribution, with the training process independent of the specific test instance. When the prior has a high intrinsic dimension or when high accuracy of the learned solution is required, a large number of training samples may be needed, resulting in substantial data collection costs. In contrast, our method dynamically allocates sampling effort based on the specific test instance, enabling significant gains in sample efficiency. By iteratively refining the training dataset conditioned on the latest prediction, the proposed strategy tailors the dataset to the geometry of the inverse map around each test instance. We demonstrate the effectiveness of our approach in the inverse scattering problem under two types of structured priors. Our results show that the advantage of the adaptive method becomes more pronounced in settings with more complex priors or higher accuracy requirements. While our experiments focus on a particular inverse problem, the adaptive sampling strategy is broadly applicable and readily extends to other inverse problems, offering a scalable and practical alternative to conventional fixed-dataset training regimes.
[LG-45] Bootstrapping Task Spaces for Self-Improvement
链接: https://arxiv.org/abs/2509.04575
作者: Minqi Jiang,Andrei Lupu,Yoram Bachrach
类目: Machine Learning (cs.LG)
*备注:
Abstract:Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.
[LG-46] Finance-Grounded Optimization For Algorithmic Trading
链接: https://arxiv.org/abs/2509.04541
作者: Kasymkhan Khubiev,Mikhail Semenov,Irina Podlipnova
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注: 12 pages, 8 figures, 5 tables
Abstract:Deep Learning is evolving fast and integrates into various domains. Finance is a challenging field for deep learning, especially in the case of interpretable artificial intelligence (AI). Although classical approaches perform very well with natural language processing, computer vision, and forecasting, they are not perfect for the financial world, in which specialists use different metrics to evaluate model performance. We first introduce financially grounded loss functions derived from key quantitative finance metrics, including the Sharpe ratio, Profit-and-Loss (PnL), and Maximum Draw down. Additionally, we propose turnover regularization, a method that inherently constrains the turnover of generated positions within predefined limits. Our findings demonstrate that the proposed loss functions, in conjunction with turnover regularization, outperform the traditional mean squared error loss for return prediction tasks when evaluated using algorithmic trading metrics. The study shows that financially grounded metrics enhance predictive performance in trading strategies and portfolio optimization. Comments: 12 pages, 8 figures, 5 tables Subjects: Machine Learning (cs.LG); Statistical Finance (q-fin.ST) Cite as: arXiv:2509.04541 [cs.LG] (or arXiv:2509.04541v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.04541 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-47] Q-SafeML: Safety Assessment of Quantum Machine Learning via Quantum Distance Metrics
链接: https://arxiv.org/abs/2509.04536
作者: Oliver Dunn,Koorosh Aslansefat,Yiannis Papadopoulos
类目: Machine Learning (cs.LG); Quantum Algebra (math.QA); Statistics Theory (math.ST)
*备注:
Abstract:The rise of machine learning in safety-critical systems has paralleled advancements in quantum computing, leading to the emerging field of Quantum Machine Learning (QML). While safety monitoring has progressed in classical ML, existing methods are not directly applicable to QML due to fundamental differences in quantum computation. Given the novelty of QML, dedicated safety mechanisms remain underdeveloped. This paper introduces Q-SafeML, a safety monitoring approach for QML. The method builds on SafeML, a recent method that utilizes statistical distance measures to assess model accuracy and provide confidence in the reasoning of an algorithm. An adapted version of Q-SafeML incorporates quantum-centric distance measures, aligning with the probabilistic nature of QML outputs. This shift to a model-dependent, post-classification evaluation represents a key departure from classical SafeML, which is dataset-driven and classifier-agnostic. The distinction is motivated by the unique representational constraints of quantum systems, requiring distance metrics defined over quantum state spaces. Q-SafeML detects distances between operational and training data addressing the concept drifts in the context of QML. Experiments on QCNN and VQC Models show that this enables informed human oversight, enhancing system transparency and safety.
[LG-48] Solving Robotics Tasks with Prior Demonstration via Exploration-Efficient Deep Reinforcement Learning
链接: https://arxiv.org/abs/2509.04069
作者: Chengyandan Shen,Christoffer Sloth
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes an exploration-efficient Deep Reinforcement Learning with Reference policy (DRLR) framework for learning robotics tasks that incorporates demonstrations. The DRLR framework is developed based on an algorithm called Imitation Bootstrapped Reinforcement Learning (IBRL). We propose to improve IBRL by modifying the action selection module. The proposed action selection module provides a calibrated Q-value, which mitigates the bootstrapping error that otherwise leads to inefficient exploration. Furthermore, to prevent the RL policy from converging to a sub-optimal policy, SAC is used as the RL policy instead of TD3. The effectiveness of our method in mitigating bootstrapping error and preventing overfitting is empirically validated by learning two robotics tasks: bucket loading and open drawer, which require extensive interactions with the environment. Simulation results also demonstrate the robustness of the DRLR framework across tasks with both low and high state-action dimensions, and varying demonstration qualities. To evaluate the developed framework on a real-world industrial robotics task, the bucket loading task is deployed on a real wheel loader. The sim2real results validate the successful deployment of the DRLR framework.
[LG-49] Beyond Linearity and Time-homogeneity: Relational Hyper Event Models with Time-Varying Non-Linear Effects
链接: https://arxiv.org/abs/2509.05289
作者: Martina Boschi,Jürgen Lerner,Ernst C. Wit
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Recent technological advances have made it easier to collect large and complex networks of time-stamped relational events connecting two or more entities. Relational hyper-event models (RHEMs) aim to explain the dynamics of these events by modeling the event rate as a function of statistics based on past history and external information. However, despite the complexity of the data, most current RHEM approaches still rely on a linearity assumption to model this relationship. In this work, we address this limitation by introducing a more flexible model that allows the effects of statistics to vary non-linearly and over time. While time-varying and non-linear effects have been used in relational event modeling, we take this further by modeling joint time-varying and non-linear effects using tensor product smooths. We validate our methodology on both synthetic and empirical data. In particular, we use RHEMs to study how patterns of scientific collaboration and impact evolve over time. Our approach provides deeper insights into the dynamic factors driving relational hyper-events, allowing us to evaluate potential non-monotonic patterns that cannot be identified using linear models. Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2509.05289 [stat.ME] (or arXiv:2509.05289v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2509.05289 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-50] Probabilistic operator learning: generative modeling and uncertainty quantification for foundation models of differential equations
链接: https://arxiv.org/abs/2509.05186
作者: Benjamin J. Zhang,Siting Liu,Stanley J. Osher,Markos A. Katsoulakis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: First two authors contributed equally
Abstract:In-context operator networks (ICON) are a class of operator learning methods based on the novel architectures of foundation models. Trained on a diverse set of datasets of initial and boundary conditions paired with corresponding solutions to ordinary and partial differential equations (ODEs and PDEs), ICON learns to map example condition-solution pairs of a given differential equation to an approximation of its solution operator. Here, we present a probabilistic framework that reveals ICON as implicitly performing Bayesian inference, where it computes the mean of the posterior predictive distribution over solution operators conditioned on the provided context, i.e., example condition-solution pairs. The formalism of random differential equations provides the probabilistic framework for describing the tasks ICON accomplishes while also providing a basis for understanding other multi-operator learning methods. This probabilistic perspective provides a basis for extending ICON to \emphgenerative settings, where one can sample from the posterior predictive distribution of solution operators. The generative formulation of ICON (GenICON) captures the underlying uncertainty in the solution operator, which enables principled uncertainty quantification in the solution predictions in operator learning.
[LG-51] Room-acoustic simulations as an alternative to measurements for audio-algorithm evaluation
链接: https://arxiv.org/abs/2509.05175
作者: Georg Götz,Daniel Gert Nielsen,Steinar Guðjónsson,Finnur Pind
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:
Abstract:Audio-signal-processing and audio-machine-learning (ASP/AML) algorithms are ubiquitous in modern technology like smart devices, wearables, and entertainment systems. Development of such algorithms and models typically involves a formal evaluation to demonstrate their effectiveness and progress beyond the state-of-the-art. Ideally, a thorough evaluation should cover many diverse application scenarios and room-acoustic conditions. However, in practice, evaluation datasets are often limited in size and diversity because they rely on costly and time-consuming measurements. This paper explores how room-acoustic simulations can be used for evaluating ASP/AML algorithms. To this end, we evaluate three ASP/AML algorithms with room-acoustic measurements and data from different simulation engines, and assess the match between the evaluation results obtained from measurements and simulations. The presented investigation compares a numerical wave-based solver with two geometrical acoustics simulators. While numerical wave-based simulations yielded similar evaluation results as measurements for all three evaluated ASP/AML algorithms, geometrical acoustic simulations could not replicate the measured evaluation results as reliably.
[LG-52] Spectral Algorithms in Misspecified Regression: Convergence under Covariate Shift
链接: https://arxiv.org/abs/2509.05106
作者: Ren-Rui Liu,Zheng-Chu Guo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 47 pages
Abstract:This paper investigates the convergence properties of spectral algorithms – a class of regularization methods originating from inverse problems – under covariate shift. In this setting, the marginal distributions of inputs differ between source and target domains, while the conditional distribution of outputs given inputs remains unchanged. To address this distributional mismatch, we incorporate importance weights, defined as the ratio of target to source densities, into the learning framework. This leads to a weighted spectral algorithm within a nonparametric regression setting in a reproducing kernel Hilbert space (RKHS). More importantly, in contrast to prior work that largely focuses on the well-specified setting, we provide a comprehensive theoretical analysis of the more challenging misspecified case, in which the target function does not belong to the RKHS. Under the assumption of uniformly bounded density ratios, we establish minimax-optimal convergence rates when the target function lies within the RKHS. For scenarios involving unbounded importance weights, we introduce a novel truncation technique that attains near-optimal convergence rates under mild regularity conditions, and we further extend these results to the misspecified regime. By addressing the intertwined challenges of covariate shift and model misspecification, this work extends classical kernel learning theory to more practical scenarios, providing a systematic framework for understanding their interaction.
[LG-53] Lightweight DNN for Full-Band Speech Denoising on Mobile Devices: Exploiting Long and Short Temporal Patterns
链接: https://arxiv.org/abs/2509.05079
作者: Konstantinos Drossos,Mikko Heikkinen,Paschalis Tsiaflakis
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
*备注: Accepted for publication in Proceedings of the 2025 IEEE 27th International Workshop on Multimedia Signal Processing (MMSP)
Abstract:Speech denoising (SD) is an important task of many, if not all, modern signal processing chains used in devices and for everyday-life applications. While there are many published and powerful deep neural network (DNN)-based methods for SD, few are optimized for resource-constrained platforms such as mobile devices. Additionally, most DNN-based methods for SD are not focusing on full-band (FB) signals, i.e. having 48 kHz sampling rate, and/or low latency cases. In this paper we present a causal, low latency, and lightweight DNN-based method for full-band SD, leveraging both short and long temporal patterns. The method is based on a modified UNet architecture employing look-back frames, temporal spanning of convolutional kernels, and recurrent neural networks for exploiting short and long temporal patterns in the signal and estimated denoising mask. The DNN operates on a causal frame-by-frame basis taking as an input the STFT magnitude, utilizes inverted bottlenecks inspired by MobileNet, employs causal instance normalization for channel-wise normalization, and achieves a real-time factor below 0.02 when deployed on a modern mobile phone. The proposed method is evaluated using established speech denoising metrics and publicly available datasets, demonstrating its effectiveness in achieving an (SI-)SDR value that outperforms existing FB and low latency SD methods.
[LG-54] QCA-MolGAN: Quantum Circuit Associative Molecular GAN with Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2509.05051
作者: Aaron Mark Thomas,Yu-Cheng Chen,Hubert Okadome Valencia,Sharu Theresa Jose,Ronin Wu
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Accepted to the proceedings of IEEE Quantum Artificial Intelligence, 6 pages, 3 figures
Abstract:Navigating the vast chemical space of molecular structures to design novel drug molecules with desired target properties remains a central challenge in drug discovery. Recent advances in generative models offer promising solutions. This work presents a novel quantum circuit Born machine (QCBM)-enabled Generative Adversarial Network (GAN), called QCA-MolGAN, for generating drug-like molecules. The QCBM serves as a learnable prior distribution, which is associatively trained to define a latent space aligning with high-level features captured by the GANs discriminator. Additionally, we integrate a novel multi-agent reinforcement learning network to guide molecular generation with desired targeted properties, optimising key metrics such as quantitative estimate of drug-likeness (QED), octanol-water partition coefficient (LogP) and synthetic accessibility (SA) scores in conjunction with one another. Experimental results demonstrate that our approach enhances the property alignment of generated molecules with the multi-agent reinforcement learning agents effectively balancing chemical properties.
[LG-55] Dynamical Learning in Deep Asymmetric Recurrent Neural Networks
链接: https://arxiv.org/abs/2509.05041
作者: Davide Badalotti,Carlo Baldassi,Marc Mézard,Mattia Scardecchia,Riccardo Zecchina
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:We show that asymmetric deep recurrent neural networks, enhanced with additional sparse excitatory couplings, give rise to an exponentially large, dense accessible manifold of internal representations which can be found by different algorithms, including simple iterative dynamics. Building on the geometrical properties of the stable configurations, we propose a distributed learning scheme in which input-output associations emerge naturally from the recurrent dynamics, without any need of gradient evaluation. A critical feature enabling the learning process is the stability of the configurations reached at convergence, even after removal of the supervisory output signal. Extensive simulations demonstrate that this approach performs competitively on standard AI benchmarks. The model can be generalized in multiple directions, both computational and biological, potentially contributing to narrowing the gap between AI and computational neuroscience.
[LG-56] Optimal Variance and Covariance Estimation under Differential Privacy in the Add-Remove Model and Beyond
链接: https://arxiv.org/abs/2509.04919
作者: Shokichi Takakura,Seng Pei Liew,Satoshi Hasegawa
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we study the problem of estimating the variance and covariance of datasets under differential privacy in the add-remove model. While estimation in the swap model has been extensively studied in the literature, the add-remove model remains less explored and more challenging, as the dataset size must also be kept private. To address this issue, we develop efficient mechanisms for variance and covariance estimation based on the \emphBézier mechanism, a novel moment-release framework that leverages Bernstein bases. We prove that our proposed mechanisms are minimax optimal in the high-privacy regime by establishing new minimax lower bounds. Moreover, beyond worst-case scenarios, we analyze instance-wise utility and show that the Bézier-based estimator consistently achieves better utility compared to alternative mechanisms. Finally, we demonstrate the effectiveness of the Bézier mechanism beyond variance and covariance estimation, showcasing its applicability to other statistical tasks.
[LG-57] RobQFL: Robust Quantum Federated Learning in Adversarial Environment
链接: https://arxiv.org/abs/2509.04914
作者: Walid El Maouaki,Nouhaila Innan,Alberto Marchisio,Taoufik Said,Muhammad Shafique,Mohamed Bennai
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum Federated Learning (QFL) merges privacy-preserving federation with quantum computing gains, yet its resilience to adversarial noise is unknown. We first show that QFL is as fragile as centralized quantum learning. We propose Robust Quantum Federated Learning (RobQFL), embedding adversarial training directly into the federated loop. RobQFL exposes tunable axes: client coverage \gamma (0-100%), perturbation scheduling (fixed- \varepsilon vs \varepsilon -mixes), and optimization (fine-tune vs scratch), and distils the resulting \gamma \times \varepsilon surface into two metrics: Accuracy-Robustness Area and Robustness Volume. On 15-client simulations with MNIST and Fashion-MNIST, IID and Non-IID conditions, training only 20-50% clients adversarially boosts \varepsilon \leq 0.1 accuracy \sim 15 pp at 2 pp clean-accuracy cost; fine-tuning adds 3-5 pp. With \geq 75% coverage, a moderate \varepsilon -mix is optimal, while high- \varepsilon schedules help only at 100% coverage. Label-sorted non-IID splits halve robustness, underscoring data heterogeneity as a dominant risk.
[LG-58] Any-Step Density Ratio Estimation via Interval-Annealed Secant Alignment
链接: https://arxiv.org/abs/2509.04852
作者: Wei Chen,Shigui Li,Jiacheng Li,Jian Xu,Zhiqi Lin,Junmei Yang,Delu Zeng,John Paisley,Qibin Zhao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Estimating density ratios is a fundamental problem in machine learning, but existing methods often trade off accuracy for efficiency. We propose \textitInterval-annealed Secant Alignment Density Ratio Estimation (ISA-DRE), a framework that enables accurate, any-step estimation without numerical integration. Instead of modeling infinitesimal tangents as in prior methods, ISA-DRE learns a global secant function, defined as the expectation of all tangents over an interval, with provably lower variance, making it more suitable for neural approximation. This is made possible by the \emphSecant Alignment Identity, a self-consistency condition that formally connects the secant with its underlying tangent representations. To mitigate instability during early training, we introduce \emphContraction Interval Annealing, a curriculum strategy that gradually expands the alignment interval during training. This process induces a contraction mapping, which improves convergence and training stability. Empirically, ISA-DRE achieves competitive accuracy with significantly fewer function evaluations compared to prior methods, resulting in much faster inference and making it well suited for real-time and interactive applications. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2509.04852 [stat.ML] (or arXiv:2509.04852v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2509.04852 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wei Chen [view email] [v1] Fri, 5 Sep 2025 07:06:56 UTC (5,916 KB)
[LG-59] An Interactive Tool for Analyzing High-Dimensional Clusterings
链接: https://arxiv.org/abs/2509.04603
作者: Justin Lin,Julia Fukuyama
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 34 pages, 12 figures
Abstract:Technological advances have spurred an increase in data complexity and dimensionality. We are now in an era in which data sets containing thousands of features are commonplace. To digest and analyze such high-dimensional data, dimension reduction techniques have been developed and advanced along with computational power. Of these techniques, nonlinear methods are most commonly employed because of their ability to construct visually interpretable embeddings. Unlike linear methods, these methods non-uniformly stretch and shrink space to create a visual impression of the high-dimensional data. Since capturing high-dimensional structures in a significantly lower number of dimensions requires drastic manipulation of space, nonlinear dimension reduction methods are known to occasionally produce false structures, especially in noisy settings. In an effort to deal with this phenomenon, we developed an interactive tool that enables analysts to better understand and diagnose their dimension reduction results. It uses various analytical plots to provide a multi-faceted perspective on results to determine legitimacy. The tool is available via an R package named DRtool.
[LG-60] Provably data-driven projection method for quadratic programming
链接: https://arxiv.org/abs/2509.04524
作者: Anh Tuan Nguyen,Viet Anh Nguyen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 25 pages
Abstract:Projection methods aim to reduce the dimensionality of the optimization instance, thereby improving the scalability of high-dimensional problems. Recently, Sakaue and Oki proposed a data-driven approach for linear programs (LPs), where the projection matrix is learned from observed problem instances drawn from an application-specific distribution of problems. We analyze the generalization guarantee for the data-driven projection matrix learning for convex quadratic programs (QPs). Unlike in LPs, the optimal solutions of convex QPs are not confined to the vertices of the feasible polyhedron, and this complicates the analysis of the optimal value function. To overcome this challenge, we demonstrate that the solutions of convex QPs can be localized within a feasible region corresponding to a special active set, utilizing Caratheodory’s theorem. Building on such observation, we propose the unrolled active set method, which models the computation of the optimal value as a Goldberg-Jerrum (GJ) algorithm with bounded complexities, thereby establishing learning guarantees. We then further extend our analysis to other settings, including learning to match the optimal solution and input-aware setting, where we learn a mapping from QP problem instances to projection matrices.
[LG-61] Universal Representation of Generalized Convex Functions and their Gradients
链接: https://arxiv.org/abs/2509.04477
作者: Moeen Nehzati
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Solutions to a wide range of optimization problems, from optimal transport theory to mathematical economics, often take the form of generalized convex functions (GCFs). This characterization can be used to convert nested bilevel optimization problems into single-level optimization problems. Despite this, the characterization has not been fully exploited in numerical optimization. When the solution to an optimization problem is known to belong to a particular class of objects, this information can be leveraged by parameterizing that class of objects and optimizing over this parameterization. The hallmark of a good parameterization is the Universal Approximation Property (UAP): that is, the parameterization approximates any object in the class arbitrarily well. For example, neural networks satisfy the UAP with respect to the class of continuous functions. Building on the literature concerned with the parameterization of convex functions, we extend these ideas to GCFs. We present a convex and potentially one-to-one parameterization of GCFs and their gradients that satisfies the UAP. We also compare this class to shallow neural networks and highlight their shared characteristics. The ideas pursued here have been implemented in the Python package \hrefthis https URL\textttgconvex, available online. Using it, we tackle the problem of finding the revenue-maximizing auction for multiple goods and demonstrate how our parameterization can effectively solve this problem. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) MSC classes: 91-08, 91-10, 91B68, 62P20, 90C26, 90C30, 65D40, 65K10, 49J52, 41A30 ACMclasses: G.1.2; G.1.6; G.1.10; I.5.1 Cite as: arXiv:2509.04477 [math.OC] (or arXiv:2509.04477v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2509.04477 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Moeen Nehzati [view email] [v1] Sat, 30 Aug 2025 15:21:33 UTC (182 KB)
信息检索
[IR-0] Hybrid Matrix Factorization Based Graph Contrastive Learning for Recommendation System
链接: https://arxiv.org/abs/2509.05115
作者: Hao Chen,Wenming Ma,Zihao Chu,Mingqi Li
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In recent years, methods that combine contrastive learning with graph neural networks have emerged to address the challenges of recommendation systems, demonstrating powerful performance and playing a significant role in this domain. Contrastive learning primarily tackles the issue of data sparsity by employing data augmentation strategies, effectively alleviating this problem and showing promising results. Although existing research has achieved favorable outcomes, most current graph contrastive learning methods are based on two types of data augmentation strategies: the first involves perturbing the graph structure, such as by randomly adding or removing edges; and the second applies clustering techniques. We believe that the interactive information obtained through these two strategies does not fully capture the user-item interactions. In this paper, we propose a novel method called HMFGCL (Hybrid Matrix Factorization Based Graph Contrastive Learning), which integrates two distinct matrix factorization techniques-low-rank matrix factorization (MF) and singular value decomposition (SVD)-to complementarily acquire global collaborative information, thereby constructing enhanced views. Experimental results on multiple public datasets demonstrate that our model outperforms existing baselines, particularly on small-scale datasets.
[IR-1] Fishing for Answers: Exploring One-shot vs. Iterative Retrieval Strategies for Retrieval Augmented Generation EMNLP2025
链接: https://arxiv.org/abs/2509.04820
作者: Huifeng Lin,Gang Su,Jintao Liang,You Wu,Rui Zhao,Ziyue Li
类目: Information Retrieval (cs.IR)
*备注: under Review of EMNLP 2025
Abstract:Retrieval-Augmented Generation (RAG) based on Large Language Models (LLMs) is a powerful solution to understand and query the industry’s closed-source documents. However, basic RAG often struggles with complex QA tasks in legal and regulatory domains, particularly when dealing with numerous government documents. The top- k strategy frequently misses golden chunks, leading to incomplete or inaccurate answers. To address these retrieval bottlenecks, we explore two strategies to improve evidence coverage and answer quality. The first is a One-SHOT retrieval method that adaptively selects chunks based on a token budget, allowing as much relevant content as possible to be included within the model’s context window. Additionally, we design modules to further filter and refine the chunks. The second is an iterative retrieval strategy built on a Reasoning Agentic RAG framework, where a reasoning LLM dynamically issues search queries, evaluates retrieved results, and progressively refines the context over multiple turns. We identify query drift and retrieval laziness issues and further design two modules to tackle them. Through extensive experiments on a dataset of government documents, we aim to offer practical insights and guidance for real-world applications in legal and regulatory domains.