本篇博文主要内容为 2025-09-26 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-09-26)
今日共更新559篇论文,其中:
- 自然语言处理共99篇(Computation and Language (cs.CL))
- 人工智能共190篇(Artificial Intelligence (cs.AI))
- 计算机视觉共122篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共185篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] SciReason er: Laying the Scientific Reasoning Ground Across Disciplines
链接: https://arxiv.org/abs/2509.21320
作者: Yizhou Wang,Chen Tang,Han Deng,Jiabei Xiao,Jiaqi Liu,Jianyu Wu,Jun Yao,Pengze Li,Encheng Su,Lintao Wang,Guohang Zhuang,Yuchen Ren,Ben Fei,Ming Hu,Xin Chen,Dongzhan Zhou,Junjun He,Xiangyu Yue,Zhenfei Yin,Jiamin Wu,Qihao Zheng,Yuhao Zhou,Huihui Xu,Chenglong Ma,Yan Lu,Wenlong Zhang,Chunfeng Song,Philip Torr,Shixiang Tang,Xinzhu Ma,Wanli Ouyang,Lei Bai
机构: 未知
类目: Computation and Language (cs.CL)
备注: technical report
[NLP-1] RLBFF: Binary Flexible Feedback to bridge between Human Feedback Verifiable Rewards
【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)后训练中奖励建模的两大局限性:一是基于人类反馈的强化学习(Reinforcement Learning with Human Feedback, RLHF)因依赖主观且缺乏明确标准的人类判断,导致可解释性差和奖励黑客(reward hacking)问题;二是基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)受限于仅关注正确性验证,难以捕捉响应质量的多维特性。解决方案的关键在于提出一种新的框架——基于二元灵活反馈的强化学习(Reinforcement Learning with Binary Flexible Feedback, RLBFF),其核心是将自然语言反馈中的原则性信息提取为二元可判定的规则(如“信息准确性:是/否”或“代码可读性:否”),并将其用于构建以蕴含任务(entailment task)形式训练的奖励模型(Reward Model)。该方法实现了人类偏好灵活性与规则验证精确性的结合,使奖励模型能够捕捉超越单纯正确性的复杂质量维度,并支持推理时动态指定关注原则,从而在RM-Bench(86.2%)和JudgeBench(81.4%,排名第一)等基准上显著优于现有方法,同时提供开源实现以低成本实现高性能对齐。
链接: https://arxiv.org/abs/2509.21319
作者: Zhilin Wang,Jiaqi Zeng,Olivier Delalleau,Ellie Evans,Daniel Egert,Hoo-Chang Shin,Felipe Soares,Yi Dong,Oleksii Kuchaiev
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at 5% of the inference cost).
zh
[NLP-2] Interactive Recommendation Agent with Active User Commands
【速读】: 该论文旨在解决传统推荐系统依赖粗粒度被动反馈(如点赞或不喜欢)所导致的用户意图表达不充分、偏好建模不准确的问题,从而造成用户意图与系统理解之间的持续偏差,影响用户体验和推荐效果。其解决方案的关键在于提出交互式推荐流(Interactive Recommendation Feed, IRF)范式,通过自然语言命令实现用户对推荐策略的主动显式控制,并开发RecBot双代理架构:其中解析代理(Parser Agent)将自然语言转化为结构化偏好,规划代理(Planner Agent)动态调度自适应工具链以实时调整推荐策略;同时采用仿真增强的知识蒸馏方法,在保障推理能力的前提下提升部署效率。
链接: https://arxiv.org/abs/2509.21317
作者: Jiakai Tang,Yujie Luo,Xunke Xi,Fei Sun,Xueyang Feng,Sunhao Dai,Chao Yi,Dian Chen,Zhujin Gao,Yang Li,Xu Chen,Wen Chen,Jian Wu,Yuning Jiang,Bo Zheng
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); University of Chinese Academy of Sciences (中国科学院大学); Alibaba Group (阿里巴巴集团)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Under Review
Abstract:Traditional recommender systems rely on passive feedback mechanisms that limit users to simple choices such as like and dislike. However, these coarse-grained signals fail to capture users’ nuanced behavior motivations and intentions. In turn, current systems cannot also distinguish which specific item attributes drive user satisfaction or dissatisfaction, resulting in inaccurate preference modeling. These fundamental limitations create a persistent gap between user intentions and system interpretations, ultimately undermining user satisfaction and harming system effectiveness. To address these limitations, we introduce the Interactive Recommendation Feed (IRF), a pioneering paradigm that enables natural language commands within mainstream recommendation feeds. Unlike traditional systems that confine users to passive implicit behavioral influence, IRF empowers active explicit control over recommendation policies through real-time linguistic commands. To support this paradigm, we develop RecBot, a dual-agent architecture where a Parser Agent transforms linguistic expressions into structured preferences and a Planner Agent dynamically orchestrates adaptive tool chains for on-the-fly policy adjustment. To enable practical deployment, we employ simulation-augmented knowledge distillation to achieve efficient performance while maintaining strong reasoning capabilities. Through extensive offline and long-term online experiments, RecBot shows significant improvements in both user satisfaction and business outcomes. Comments: Under Review Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) Cite as: arXiv:2509.21317 [cs.IR] (or arXiv:2509.21317v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.21317 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-3] Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中“谄媚行为”(sycophancy)的机制问题,即这种行为是否由单一机制驱动,还是由多个独立过程构成。研究通过将谄媚行为分解为“谄媚性同意”(sycophantic agreement)和“谄媚性赞美”(sycophantic praise),并将其与真实同意(genuine agreement)对比,利用均值差异方向、激活添加和子空间几何等方法,在多个模型和数据集上发现:三种行为在潜在空间中沿不同的线性方向编码;每种行为可独立放大或抑制而不影响其他行为;且其表征结构在不同模型家族和规模下保持一致。解决方案的关键在于识别出这些行为具有独立且可操控的表示结构,从而证明谄媚行为是由多个可分离的神经表征驱动的,而非单一机制。
链接: https://arxiv.org/abs/2509.21305
作者: Daniel Vennemeyer,Phan Anh Duong,Tiffany Zhan,Tianyu Jiang
机构: University of Cincinnati (辛辛那提大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) often exhibit sycophantic behaviors – such as excessive agreement with or flattery of the user – but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.
zh
[NLP-4] he role of synthetic data in Multilingual Multi-cultural AI systems: Lessons from Indic Languages
【速读】: 该论文旨在解决多语言、跨文化场景下人工智能系统在低资源语种中表现不佳的问题,特别是如何通过高质量合成数据提升模型对印度本土语言及文化的适应能力。其解决方案的关键在于提出一种自下而上(bottom-up)的合成数据生成策略,利用参数规模达235B的大规模开源语言模型(Large Language Models, LLMs),基于各语言特有的维基百科内容进行文化情境化数据生成,从而构建出名为Updesh的高质量大规模指令跟随数据集(含9.5M条目,覆盖13种印度语言)。该方法突破了传统自上而下(top-down)翻译高资源语言合成数据的局限,强调语言特定的文化语境嵌入,显著提升了低资源和中资源语言模型在生成任务上的性能,并缩小了与高资源语言之间的差距。
链接: https://arxiv.org/abs/2509.21294
作者: Pranjal A. Chitale,Varun Gumma,Sanchit Ahuja,Prashant Kodali,Manan Uppadhyay,Deepthi Sudharsan,Sunayana Sitaram
机构: Microsoft Corporation (微软); Nanyang Technological University (南洋理工大学); Northeastern University (东北大学); Independent Researcher
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.
zh
[NLP-5] DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding
【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models)在处理依赖词序和谓词-论元结构的任务时表现不佳的问题,这些问题通常源于模型对语言组合性结构的忽视。解决方案的关键在于提出DisCoCLIP,其核心创新是引入一种基于张量网络(Tensor Network)的文本编码器,该编码器能显式建模句子的句法结构:通过组合范畴语法(Combinatory Categorial Grammar, CCG)解析器生成分布式的词张量,并通过张量收缩模拟句子的语法推导过程;同时,利用张量分解技术对高阶张量进行压缩,使参数量从千万级降至百万级以下,从而在保持高效性的同时显著提升模型对动词语义和词序的敏感度。
链接: https://arxiv.org/abs/2509.21287
作者: Kin Ian Lo,Hala Hawashin,Mina Abbaszadeh,Tilen Limback-Stokin,Hadi Wazni,Mehrnoosh Sadrzadeh
机构: University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent vision-language models excel at large-scale image-text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate-argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence’s grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP’s SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision-language tasks.
zh
[NLP-6] Bounds of Chain-of-Thought Robustness: Reasoning Steps Embed Norms and Beyond
【速读】: 该论文旨在解决输入扰动对链式思维(Chain-of-Thought, CoT)输出波动的影响机制不明确的问题,这一理论空白限制了对扰动在推理过程中传播路径的理解,并阻碍了提示优化方法的进一步提升。其解决方案的关键在于从理论上推导出在输出波动可控条件下输入扰动的上界,并证明该上界与CoT中的推理步骤数呈正相关关系,且即使推理过程无限延长也无法消除输入扰动的影响;此外,针对线性自注意力(Linear Self-Attention, LSA)模型进一步揭示该上界与输入嵌入和隐藏状态向量范数呈负相关关系,从而为理解扰动传播提供了可量化、可验证的理论框架。
链接: https://arxiv.org/abs/2509.21284
作者: Dingzirui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng
机构: Harbin Institute of Technology (哈尔滨工业大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing research indicates that the output of Chain-of-Thought (CoT) is significantly affected by input perturbations. Although many methods aim to mitigate such impact by optimizing prompts, a theoretical explanation of how these perturbations influence CoT outputs remains an open area of research. This gap limits our in-depth understanding of how input perturbations propagate during the reasoning process and hinders further improvements in prompt optimization methods. Therefore, in this paper, we theoretically analyze the effect of input perturbations on the fluctuation of CoT outputs. We first derive an upper bound for input perturbations under the condition that the output fluctuation is within an acceptable range, based on which we prove that: (i) This upper bound is positively correlated with the number of reasoning steps in the CoT; (ii) Even an infinitely long reasoning process cannot eliminate the impact of input perturbations. We then apply these conclusions to the Linear Self-Attention (LSA) model, which can be viewed as a simplified version of the Transformer. For the LSA model, we prove that the upper bound for input perturbation is negatively correlated with the norms of the input embedding and hidden state vectors. To validate this theoretical analysis, we conduct experiments on three mainstream datasets and four mainstream models. The experimental results align with our theoretical analysis, empirically demonstrating the correctness of our findings.
zh
[NLP-7] LLM Trace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text
【速读】: 该论文旨在解决当前AI生成文本检测系统面临的三大核心问题:训练数据稀缺且过时、语言覆盖单一(主要为英语)、以及缺乏对混合人类-人工智能写作场景中AI段落精确定位的能力。解决方案的关键在于提出LLMTrace,一个大规模、双语(英文与俄文)的新型语料库,其构建基于多种现代商用和开源大语言模型(Large Language Models, LLMs),并提供字符级标注,从而支持两类任务:传统的全文二分类(人类 vs. AI)以及创新性的AI生成区间检测任务,后者依赖于精确到字符级别的标注实现对AI生成片段的定位。
链接: https://arxiv.org/abs/2509.21269
作者: Irina Tolstykh,Aleksandra Tsybina,Sergey Yakubson,Maksim Kuprashevich
机构: SALUTEDEV LLC(萨卢特德夫有限责任公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often generated with outdated models, are predominantly in English, and fail to address the increasingly common scenario of mixed human-AI authorship. Crucially, while some datasets address mixed authorship, none provide the character-level annotations required for the precise localization of AI-generated segments within a text. To address these gaps, we introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection. Constructed using a diverse range of modern proprietary and open-source LLMs, our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection, facilitated by character-level annotations. We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models. The project page is available at \hrefthis https URLiitolstykh/LLMTrace.
zh
[NLP-8] LLM Output Homogenization is Task Dependent
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在不同任务中因输出响应同质化(output response homogenization)而导致的性能下降问题,尤其是现有研究未能根据任务特性对多样性进行差异化定义与评估。其核心解决方案在于提出一种任务依赖的多样性建模框架,关键包括:(1) 构建包含八类任务的分类体系,明确每类任务对同质化的不同容忍度;(2) 引入“任务锚定的功能多样性”(task-anchored functional diversity)以更精准地衡量输出多样性;(3) 设计任务锚定采样策略,在不必要同质化的任务中提升功能性多样性,而在需要一致性的任务中保持原有稳定性;(4) 通过实证挑战“多样性-质量权衡”假设,证明可在不牺牲响应质量的前提下增强功能性多样性。该方法显著提升了LLM在多样化任务场景下的适应性与实用性。
链接: https://arxiv.org/abs/2509.21267
作者: Shomik Jain,Jack Lanchantin,Maximilian Nickel,Karen Ullrich,Ashia Wilson,Jamelle Watson-Daniels
机构: Meta(元)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writing tasks, we may expect variation in key narrative components (e.g. plot, genre, setting, etc), beyond the vocabulary or embedding diversity produced by temperature-sampling. Previous work addressing output homogenization often fails to conceptualize diversity in a task-dependent way. We address this gap in the literature directly by making the following contributions. (1) We present a task taxonomy comprised of eight task categories that each have distinct conceptualizations of output homogenization. (2) We introduce task-anchored functional diversity to better evaluate output homogenization. (3) We propose a task-anchored sampling technique that increases functional diversity for task categories where homogenization is undesired, while preserving homogenization where it is desired. (4) We challenge the perceived existence of a diversity-quality trade-off by increasing functional diversity while maintaining response quality. Overall, we demonstrate how task dependence improves the evaluation and mitigation of output homogenization.
zh
[NLP-9] Un-Doubling Diffusion: LLM -guided Disambiguation of Homonym Duplication
【速读】: 该论文旨在解决生成式扩散模型在文本到图像生成过程中出现的同义词重复(homonym duplication)问题,即当输入提示中包含同形异义词(homonyms)时,模型可能同时生成该词多个语义对应的图像,导致输出不一致。此外,论文还指出由于英语中心主义偏倚(Anglocentric bias),原语言中非同义词在翻译成英文后可能被误判为同义词,从而加剧重复问题。解决方案的关键在于引入一种量化重复率的评估方法,并通过提示扩展(prompt expansion)策略有效缓解上述两类重复现象,实验证明该方法在自动评估(基于视觉-语言模型 VLM)和人工评估中均显著降低重复率。
链接: https://arxiv.org/abs/2509.21262
作者: Evgeny Kaskov,Elizaveta Petrova,Petr Surovtsev,Anna Kostikova,Ilya Mistiurin,Alexander Kapitanov,Alexander Nagaev
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Homonyms are words with identical spelling but distinct meanings, which pose challenges for many generative models. When a homonym appears in a prompt, diffusion models may generate multiple senses of the word simultaneously, which is known as homonym duplication. This issue is further complicated by an Anglocentric bias, which includes an additional translation step before the text-to-image model pipeline. As a result, even words that are not homonymous in the original language may become homonyms and lose their meaning after translation into English. In this paper, we introduce a method for measuring duplication rates and conduct evaluations of different diffusion models using both automatic evaluation utilizing Vision-Language Models (VLM) and human evaluation. Additionally, we investigate methods to mitigate the homonym duplication problem through prompt expansion, demonstrating that this approach also effectively reduces duplication related to Anglocentric bias. The code for the automatic evaluation pipeline is publicly available.
zh
[NLP-10] Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation NEURIPS2025
【速读】: 该论文试图解决文本到图像(text-to-image, T2I)生成模型中“幻觉”现象缺乏明确定义与系统评估的问题。现有评价方法仅关注提示词指定元素的对齐情况,忽视了模型在提示之外生成的内容,导致无法全面识别由模型先验知识或偏见引发的偏差。论文的关键解决方案是将T2I中的幻觉重新定义为“偏见驱动的偏离”,并提出一个包含属性、关系和对象三类的分类体系,从而为T2I模型的评估提供理论边界和更深入的偏见检测基础。
链接: https://arxiv.org/abs/2509.21257
作者: Seyed Amir Kasaei,Mohammad Hossein Rohban
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at GenProCC NeurIPS 2025 Workshop
Abstract:In language and vision-language models, hallucination is broadly understood as content generated from a model’s prior knowledge or biases rather than from the given input. While this phenomenon has been studied in those domains, it has not been clearly framed for text-to-image (T2I) generative models. Existing evaluations mainly focus on alignment, checking whether prompt-specified elements appear, but overlook what the model generates beyond the prompt. We argue for defining hallucination in T2I as bias-driven deviations and propose a taxonomy with three categories: attribute, relation, and object hallucinations. This framing introduces an upper bound for evaluation and surfaces hidden biases, providing a foundation for richer assessment of T2I models.
zh
[NLP-11] Query-Centric Graph Retrieval Augmented Generation
【速读】: 该论文旨在解决图结构检索增强生成(Graph-based Retrieval-Augmented Generation, RAG)中的粒度困境问题:细粒度的实体级图会导致高token开销且丢失上下文信息,而粗粒度的文档级图则难以捕捉细微关系。其解决方案的关键在于提出一种查询中心的图RAG框架QCG-RAG,通过Doc2Query和Doc2Query–生成可控粒度的查询中心图,从而提升图的质量与可解释性,并设计定制化的多跳检索机制,基于生成的查询选择相关文本块,实现更精准的多跳推理。
链接: https://arxiv.org/abs/2509.21237
作者: Yaxiong Wu,Jianyuan Bo,Yongyue Zhang,Sheng Liang,Yong Liu
机构: Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 25 pages, 6 figures, 1 table
Abstract:Graph-based retrieval-augmented generation (RAG) enriches large language models (LLMs) with external knowledge for long-context understanding and multi-hop reasoning, but existing methods face a granularity dilemma: fine-grained entity-level graphs incur high token costs and lose context, while coarse document-level graphs fail to capture nuanced relations. We introduce QCG-RAG, a query-centric graph RAG framework that enables query-granular indexing and multi-hop chunk retrieval. Our query-centric approach leverages Doc2Query and Doc2Query-- to construct query-centric graphs with controllable granularity, improving graph quality and interpretability. A tailored multi-hop retrieval mechanism then selects relevant chunks via the generated queries. Experiments on LiHuaWorld and MultiHop-RAG show that QCG-RAG consistently outperforms prior chunk-based and graph-based RAG methods in question answering accuracy, establishing a new paradigm for multi-hop reasoning.
zh
[NLP-12] Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation NEURIPS2025
【速读】: 该论文旨在解决文本-图像生成模型评估中自动化指标与人类判断之间一致性不足的问题,尤其是这些指标在组合性任务(如对象、属性和关系的准确匹配)中的表现差异。其解决方案的关键在于开展一项广泛的研究,系统比较多种主流评估指标在不同组合挑战下的行为,并超越简单的相关性分析,深入考察各类指标家族与人类偏好的一致性。研究发现:单一指标无法在所有任务中保持稳定性能,VQA-based指标并非始终最优,而某些基于嵌入(embedding-based)的指标在特定场景下表现更优;同时指出图像仅有的指标因设计目标为感知质量而非语义对齐,在组合评估中贡献有限。这一结果强调了评估指标选择需具情境敏感性和透明度,以确保评价可信性及作为生成奖励模型的有效性。
链接: https://arxiv.org/abs/2509.21227
作者: Seyed Amir Kasaei,Ali Aghayari,Arash Marioriyad,Niki Sepasian,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at GenProCC NeurIPS 2025 Workshop
Abstract:Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at \hrefthis https URLthis URL.
zh
[NLP-13] Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding
【速读】: 该论文旨在解决手势语言理解(Sign Language Understanding, SLU)中三个关键问题:1)语义锚定弱,即模型难以将骨骼数据中的低级运动模式与语言意义关联;2)局部细节与全局上下文之间的不平衡,模型要么过度关注细粒度特征,要么忽略这些细节而仅聚焦于整体语境;3)跨模态学习效率低,难以构建语义对齐的多模态表示。解决方案的关键在于提出一个统一的基于骨骼数据的SLU框架Sigma,其核心创新包括:1)一种面向手势的早期融合机制,促进视觉与文本模态深度交互,用语言上下文增强视觉特征;2)一种分层对齐学习策略,联合最大化不同层级配对特征间的一致性,从而同时捕捉细粒度细节和高层语义关系;3)一个结合对比学习、文本匹配和语言建模的统一预训练框架,提升语义一致性与泛化能力。该方法在多个基准上实现了新的最先进性能,验证了语义信息丰富的预训练及骨骼数据作为独立输入的有效性。
链接: https://arxiv.org/abs/2509.21223
作者: Muxin Pu,Mei Kuan Lim,Chun Yong Chong,Chen Change Loy
机构: Monash University (蒙纳士大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.
zh
[NLP-14] SGMem: Sentence Graph Memory for Long-Term Conversational Agents
【速读】: 该论文旨在解决长期对话代理在处理超出大语言模型(LLM)上下文窗口的对话历史时,如何有效管理记忆的问题。现有基于事实抽取或摘要的方法虽能减少冗余,但在跨不同粒度(如轮次级、会话级)的信息组织与检索上存在局限。其解决方案的关键在于提出SGMem(Sentence Graph Memory),通过将对话表示为分块单元内的句子级图结构,捕捉跨轮次、回合和会话层级的关联关系,并结合原始对话片段与生成的记忆(如摘要、事实和洞察),为LLM提供连贯且相关的上下文以支持响应生成。
链接: https://arxiv.org/abs/2509.21212
作者: Yaxiong Wu,Yongyue Zhang,Sheng Liang,Yong Liu
机构: Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 19 pages, 6 figures, 1 table
Abstract:Long-term conversational agents require effective memory management to handle dialogue histories that exceed the context window of large language models (LLMs). Existing methods based on fact extraction or summarization reduce redundancy but struggle to organize and retrieve relevant information across different granularities of dialogue and generated memory. We introduce SGMem (Sentence Graph Memory), which represents dialogue as sentence-level graphs within chunked units, capturing associations across turn-, round-, and session-level contexts. By combining retrieved raw dialogue with generated memory such as summaries, facts and insights, SGMem supplies LLMs with coherent and relevant context for response generation. Experiments on LongMemEval and LoCoMo show that SGMem consistently improves accuracy and outperforms strong baselines in long-term conversational question answering.
zh
[NLP-15] CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在处理中文法律文本时因通用预训练导致的法律知识准确性不足问题,从而影响其在法律推理中的可靠性。解决方案的关键在于构建一个名为CLaw的新型基准测试体系,该体系包含两个核心组件:一是涵盖全部306部中国国家法律、细粒度至条文子项级别并标注精确修订时间戳的结构化语料库(共64,849条记录),用于严格评估法律条款的召回能力;二是基于最高人民法院精选案例设计的254个法律推理实例,用于检验模型对法律知识的实际应用能力。通过此基准,研究揭示了多数LLMs在忠实复现法律条文方面存在显著缺陷,并指出提升法律推理可信度需结合精准的知识检索(如监督微调SFT或检索增强生成RAG)与强健的一般推理能力。
链接: https://arxiv.org/abs/2509.21208
作者: Xinzhe Xu,Liang Zhao,Hongshen Xu,Chen Chen
机构: Peking University (北京大学); LLM-Core Xiaomi (小米)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes, yet their reliability is often compromised by general pre-training that ingests legal texts without specialized focus, obscuring the true depth of their legal knowledge. This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning. CLaw comprises two key components: (1) a comprehensive, fine-grained corpus of all 306 Chinese national statutes, segmented to the subparagraph level and incorporating precise historical revision timesteps for rigorous recall evaluation (64,849 entries), and (2) a challenging set of 254 case-based reasoning instances derived from China Supreme Court curated materials to assess the practical application of legal knowledge. Our empirical evaluation reveals that most contemporary LLMs significantly struggle to faithfully reproduce legal provisions. As accurate retrieval and citation of legal provisions form the basis of legal reasoning, this deficiency critically undermines the reliability of their responses. We contend that achieving trustworthy legal reasoning in LLMs requires a robust synergy of accurate knowledge retrieval–potentially enhanced through supervised fine-tuning (SFT) or retrieval-augmented generation (RAG)–and strong general reasoning capabilities. This work provides an essential benchmark and critical insights for advancing domain-specific LLM reasoning, particularly within the complex legal sphere.
zh
[NLP-16] ABLET: A Large-Scale Dataset for Robust Visual Table Understanding
【速读】: 该论文旨在解决当前视觉表格理解(Visual Table Understanding, VTU)研究中普遍存在的两个关键问题:一是现有基准测试多基于合成渲染图像,缺乏真实世界表格的复杂性和视觉多样性;二是现有数据集提供固定示例且仅含单一可视化形式,无法支持对底层结构化数据的灵活重构与任务扩展。解决方案的关键在于提出TABLET这一大规模VTU数据集,包含400万条跨20项任务的样本,源自200万个唯一表格,其中88%保留原始视觉呈现方式。每个样本均配备图像-HTML配对表示、详尽元数据及溯源信息,确保数据可追溯性与灵活性,从而支持视觉语言模型(如Qwen2.5-VL-7B)在训练中提升对已见和未见任务的泛化能力,并增强对真实表格图像的鲁棒性。
链接: https://arxiv.org/abs/2509.21205
作者: Iñigo Alonso,Imanol Miranda,Eneko Agirre,Mirella Lapata
机构: University of Edinburgh (爱丁堡大学); HiTZ Center – Ixa (HiTZ中心–Ixa); University of the Basque Country UPV/EHU (巴斯克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.
zh
[NLP-17] Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科学推理任务中面临的两大瓶颈:一是显式检索片段导致的推理过程冗余,即所谓的“工具税”(tool tax),表现为额外的token消耗和推理步骤;二是多智能体(multi-agent)流水线通过平均所有候选方案稀释了优质解。其解决方案的关键在于提出一个统一框架,融合隐式检索与结构化协作机制:首先引入基于Monitor的检索模块,在token层面实现外部知识的隐式整合,最小化对原生推理路径的干扰;其次构建分层解 refinement(Hierarchical Solution Refinement, HSR)与质量感知迭代推理(Quality-Aware Iterative Reasoning, QAIR)机制,通过以每个候选解为锚点、由其他智能体协同修复,并依据解的质量动态调整优化策略,从而提升推理效率与准确性。实验表明,该方法在Humanity’s Last Exam (HLE) Bio/Chem Gold数据集上达到48.3%准确率(当前最高),同时减少53.5% token使用和43.7%代理步骤。
链接: https://arxiv.org/abs/2509.21193
作者: Xiangru Tang,Wanghan Xu,Yujie Wang,Zijie Guo,Daniel Shao,Jiapeng Chen,Cixuan Zhang,Ziyi Wang,Lixin Zhang,Guancheng Wan,Wenlong Zhang,Lei Bai,Zhenfei Yin,Philip Torr,Hanrui Wang,Di Jin
机构: Yale University (耶鲁大学); Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学); University of California, Los Angeles (加州大学洛杉矶分校); Shanghai AI Lab (上海人工智能实验室); University of Oxford (牛津大学); Eigen AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden “tool tax” of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity’s Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy – the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5% and agent steps by 43.7%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: this https URL.
zh
[NLP-18] GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models ICLR2026
【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在下游任务中可能存在的个人身份信息(PII)泄露问题,尤其是针对基于SLM的聊天机器人(如ChatBioGPT)在医疗场景下的隐私风险。此前的模板驱动型PII攻击方法在SLM条件下效果有限,无法有效提取数据集中的PII用于泄露检测。论文提出了一种新的贪婪坐标梯度(Greedy Coordinate Gradient, GCG)方法——GEP(Gradient-based Extraction for PII),其关键创新在于利用梯度优化策略直接生成高概率泄露样本,显著提升了PII提取效率:实验表明,GEP相比传统模板方法可使泄露量提升高达60倍;进一步在更复杂的自由插入场景下验证了其有效性,即使PII以多样化的句法形式嵌入,GEP仍能实现最高达4.53%的PII泄露率。
链接: https://arxiv.org/abs/2509.21192
作者: Jieli Zhu,Vi Ngoc-Nha Tran
机构: The Arctic University of Norway (北极挪威大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures, 4 tables. Under review as a conference paper at ICLR 2026
Abstract:Small language models (SLMs) become unprecedentedly appealing due to their approximately equivalent performance compared to large language models (LLMs) in certain fields with less energy and time consumption during training and inference. However, the personally identifiable information (PII) leakage of SLMs for downstream tasks has yet to be explored. In this study, we investigate the PII leakage of the chatbot based on SLM. We first finetune a new chatbot, i.e., ChatBioGPT based on the backbone of BioGPT using medical datasets Alpaca and HealthCareMagic. It shows a matchable performance in BERTscore compared with previous studies of ChatDoctor and ChatGPT. Based on this model, we prove that the previous template-based PII attacking methods cannot effectively extract the PII in the dataset for leakage detection under the SLM condition. We then propose GEP, which is a greedy coordinate gradient-based (GCG) method specifically designed for PII extraction. We conduct experimental studies of GEP and the results show an increment of up to 60 \times more leakage compared with the previous template-based methods. We further expand the capability of GEP in the case of a more complicated and realistic situation by conducting free-style insertion where the inserted PII in the dataset is in the form of various syntactic expressions instead of fixed templates, and GEP is still able to reveal a PII leakage rate of up to 4.53%.
zh
[NLP-19] Whos Laughing Now? An Overview of Computational Humour Generation and Explanation
【速读】: 该论文旨在解决计算幽默(computational humour)在自然语言处理(Natural Language Processing, NLP)中的理解与生成任务中存在的研究空白与技术挑战,尤其聚焦于超越双关语(pun)的幽默创造与解释能力。其核心问题是当前大型语言模型(Large Language Models, LLMs)在处理抽象、创造性且高度依赖语境的幽默时仍远未达到人类水平,且相关研究相对稀缺。解决方案的关键在于系统性梳理计算幽默领域的文献,明确其作为NLP基础任务的重要性,并提出未来研究方向,强调需充分考虑幽默的主观性和伦理模糊性,从而推动LLMs在常识推理和语境理解方面的实质性进步。
链接: https://arxiv.org/abs/2509.21175
作者: Tyler Loakman,William Thorne,Chenghua Lin
机构: The University of Sheffield (谢菲尔德大学); The University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注: Accepted to INLG 2025
Abstract:The creation and perception of humour is a fundamental human trait, positioning its computational understanding as one of the most challenging tasks in natural language processing (NLP). As an abstract, creative, and frequently context-dependent construct, humour requires extensive reasoning to understand and create, making it a pertinent task for assessing the common-sense knowledge and reasoning abilities of modern large language models (LLMs). In this work, we survey the landscape of computational humour as it pertains to the generative tasks of creation and explanation. We observe that, despite the task of understanding humour bearing all the hallmarks of a foundational NLP task, work on generating and explaining humour beyond puns remains sparse, while state-of-the-art models continue to fall short of human capabilities. We bookend our literature survey by motivating the importance of computational humour processing as a subdiscipline of NLP and presenting an extensive discussion of future directions for research in the area that takes into account the subjective and ethically ambiguous nature of humour.
zh
[NLP-20] Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models NEURIPS2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在理解任务指令时,因训练数据中语法结构与领域(domain)之间存在虚假相关性(spurious correlations)而导致的性能下降和潜在安全风险问题。其核心问题是:模型可能过度依赖语法模板(syntactic templates)而非语义信息来推断任务所属领域,从而在面对语义明确但语法特征与特定领域强关联的指令时产生错误响应,甚至绕过安全机制。解决方案的关键在于提出一种评估框架以检测此类语法-领域相关性,并强调在训练数据中引入领域内语法多样性(syntactic diversity),从而避免模型学习到误导性的语法-领域映射关系。
链接: https://arxiv.org/abs/2509.21155
作者: Chantal Shaib,Vinith M. Suriyakumar,Levent Sagun,Byron C. Wallace,Marzyeh Ghassemi
机构: Northeastern University (东北大学); MIT (麻省理工学院); Meta (Meta)
类目: Computation and Language (cs.CL)
备注: NeurIPS 2025 Spotlight
Abstract:For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information Recent work shows that syntactic templates–frequent sequences of Part-of-Speech (PoS) tags–are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for safety finetuning, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure syntactic diversity in training data, specifically within domains, to prevent such spurious correlations.
zh
[NLP-21] Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction EMNLP2025
【速读】: 该论文旨在解决现有多模态关系抽取(Multimodal Relation Extraction, MRE)方法中基于分类范式的局限性问题,即忽视实体类型和位置信息等结构约束,且在细粒度语义表达上能力不足。其解决方案的关键在于提出一种名为“检索优于分类”(Retrieval Over Classification, ROC)的新框架,该框架将多模态关系抽取任务重构为基于关系语义的检索任务:通过多模态编码器融合实体类型与位置信息,利用大语言模型(Large Language Model, LLM)将离散的关系标签扩展为自然语言描述,并采用基于语义相似度的对比学习对齐实体-关系对,从而显著提升模型的性能、鲁棒性和可解释性。
链接: https://arxiv.org/abs/2509.21151
作者: Lei Hei,Tingjing Liao,Yingxin Pei,Yiyang Qi,Jiaqi Wang,Ruiting Li,Feiliang Ren
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted by EMNLP 2025 Main Conference
Abstract:Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entity types and positional cues, and (2) it lacks semantic expressiveness for fine-grained relation understanding. We propose \underlineRetrieval \underlineOver \underlineClassification (ROC), a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics. ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning. Experiments show that our method achieves state-of-the-art performance on the benchmark datasets MNRE and MORE and exhibits stronger robustness and interpretability.
zh
[NLP-22] Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems
【速读】: 该论文旨在解决当前多模态智能体(Multimodal Agents)在车载图形用户界面(GUI)中应用不足的问题,尤其是在驾驶员注意力有限、安全要求严格以及交互模式具有强位置依赖性的场景下。其核心挑战在于如何实现既安全又自适应的交互行为。解决方案的关键在于提出首个面向车辆GUI的高保真基准测试平台Automotive-ENV,该平台定义了185个参数化任务并提供结构化的多模态观测与程序化验证机制;在此基础上进一步设计了ASURADA——一种地理感知的多模态代理,通过融合GPS信息动态调整动作以适配地理位置、环境条件和区域驾驶规范,从而显著提升安全相关任务的成功率,凸显位置上下文对车载智能体性能的重要性。
链接: https://arxiv.org/abs/2509.21143
作者: Junfeng Yan,Biao Wu,Meng Fang,Ling Chen
机构: Australian Artificial Intelligence Institute (澳大利亚人工智能研究所); University of Liverpool (利物浦大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注: 10 pages, 5 figures,
Abstract:Multimodal agents have demonstrated strong performance in general GUI interactions, but their application in automotive systems has been largely unexplored. In-vehicle GUIs present distinct challenges: drivers’ limited attention, strict safety requirements, and complex location-based interaction patterns. To address these challenges, we introduce Automotive-ENV, the first high-fidelity benchmark and interaction environment tailored for vehicle GUIs. This platform defines 185 parameterized tasks spanning explicit control, implicit intent understanding, and safety-aware tasks, and provides structured multimodal observations with precise programmatic checks for reproducible evaluation. Building on this benchmark, we propose ASURADA, a geo-aware multimodal agent that integrates GPS-informed context to dynamically adjust actions based on location, environmental conditions, and regional driving norms. Experiments show that geo-aware information significantly improves success on safety-aware tasks, highlighting the importance of location-based context in automotive environments. We will release Automotive-ENV, complete with all tasks and benchmarking tools, to further the development of safe and adaptive in-vehicle agents.
zh
[NLP-23] AutoIntent: AutoML for Text Classification EMNLP2025
【速读】: 该论文旨在解决文本分类任务中自动化机器学习(AutoML)工具在嵌入模型选择、分类器优化和决策阈值调整等环节缺乏端到端集成的问题。现有方案通常需要人工干预或模块间割裂,难以高效适配多标签分类与意图识别场景中的“非相关输入”(out-of-scope detection)。其解决方案的关键在于构建一个模块化、类似sklearn的框架AutoIntent,实现从特征嵌入到最终决策的全流程自动化,并支持灵活的资源-性能权衡,从而在标准意图分类数据集上显著优于现有AutoML工具。
链接: https://arxiv.org/abs/2509.21138
作者: Ilya Alekseev,Roman Solomatin,Darina Rustamova,Denis Kuznetsov
机构: Moscow Center for Advanced Studies (莫斯科高级研究中心); Moscow State University (莫斯科国立大学); ITMO University (ITMO大学); dresscode.ai
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 System demonstrations
Abstract:AutoIntent is an automated machine learning tool for text classification tasks. Unlike existing solutions, AutoIntent offers end-to-end automation with embedding model selection, classifier optimization, and decision threshold tuning, all within a modular, sklearn-like interface. The framework is designed to support multi-label classification and out-of-scope detection. AutoIntent demonstrates superior performance compared to existing AutoML tools on standard intent classification datasets and enables users to balance effectiveness and resource consumption.
zh
[NLP-24] Acoustic-based Gender Differentiation in Speech-aware Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)在语音交互场景中因声学特征导致的性别偏差问题,即相同语义内容的提问因说话者性别不同而产生差异化的响应。其解决方案的关键在于构建了一个包含9,208个语音样本的系统性数据集(分为性别无关、性别刻板印象和性别相关三类),并通过对比SpeechLMs与对应基础语言模型(LLMs)的表现,揭示出当前模型存在“悖论性偏见”:在性别刻板问题上呈现男性导向响应,而在应体现性别差异的情境下却表现出无性别区分的响应。进一步分析表明,这种偏差主要源于Whisper语音编码器生成的男性倾向性声学token,而非模型对中立选项或语音性别感知的误判。因此,该研究强调需发展更精细的性别信息处理机制以实现真正公平的语音交互系统。
链接: https://arxiv.org/abs/2509.21125
作者: Junhyuk Choi,Jihwan Seol,Nayeon Kim,Chanhee Cho,EunBin Cho,Bugeun Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Speech-aware Language Models (SpeechLMs) have fundamentally transformed human-AI interaction by enabling voice-based communication, yet they may exhibit acoustic-based gender differentiation where identical questions lead to different responses based on the speaker’s gender. This paper propose a new dataset that enables systematic analysis of this phenomenon, containing 9,208 speech samples across three categories: Gender-Independent, Gender-Stereotypical, and Gender-Dependent. We further evaluated LLaMA-Omni series and discovered a paradoxical pattern; while overall responses seems identical regardless of gender, the pattern is far from unbiased responses. Specifically, in Gender-Stereotypical questions, all models consistently exhibited male-oriented responses; meanwhile, in Gender-Dependent questions where gender differentiation would be contextually appropriate, models exhibited responses independent to gender instead. We also confirm that this pattern does not result from neutral options nor perceived gender of a voice. When we allow neutral response, models tends to respond neutrally also in Gender-Dependent questions. The paradoxical pattern yet retains when we applied gender neutralization methods on speech. Through comparison between SpeechLMs with corresponding backbone LLMs, we confirmed that these paradoxical patterns primarily stem from Whisper speech encoders, which generates male-oriented acoustic tokens. These findings reveal that current SpeechLMs may not successfully remove gender biases though they prioritized general fairness principles over contextual appropriateness, highlighting the need for more sophisticated techniques to utilize gender information properly in speech technology.
zh
[NLP-25] Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns
【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的大模型在数学推理任务中,因盲目使用链式思维(Chain-of-Thought, CoT)数据而导致推理能力提升有限的问题。其核心挑战在于如何识别并利用最具价值的CoT数据以高效扩展模型的推理潜力(reasoning potential),即减少达到正确答案所需的独立尝试次数。解决方案的关键在于:首先定义推理潜力为完成正确推理所需独立尝试次数的倒数,并据此构建一个由高价值原子推理模式(atomic reasoning patterns)组成的参考集;其次提出一种双粒度算法,结合推理模式链与词元熵(token entropy)筛选出与核心参考集高度对齐的高质量CoT数据(CoTP),从而实现精准训练。实验表明,仅使用10B tokens的CoTP即可使85A6B MoE模型在AIME 2024和2025测试集上提升9.58%,并推动下游RL性能上限提高7.81%。
链接: https://arxiv.org/abs/2509.21124
作者: Xuemiao Zhang,Can Ren,Chengying Tu,Rongxiang Weng,Shuo Wang,Hongfei Yan,Jingang Wang,Xunliang Cai
机构: Cranberry-Lemon University (cranberry-lemon大学); University of the Witwatersrand (威特沃特斯兰德大学); Peking University (北京大学); Meituan (美团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model’s reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.
zh
[NLP-26] rustJudge: Inconsistencies of LLM -as-a-Judge and How to Alleviate Them
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)作为自动评价者(LLM-as-a-judge)的评估框架中存在的两类关键不一致性问题:一是评分比较不一致(Score-Comparison Inconsistency),即低分响应在成对比较中优于高分响应;二是成对传递性不一致(Pairwise Transitivity Inconsistency),表现为循环偏好链(如A>B>C>A)和等价矛盾(如A=B=C≠A)。这些问题源于离散评分系统的信息损失以及成对评估中模糊的平局判定。论文提出的解决方案是TrustJudge,其核心创新在于:1)分布敏感评分(distribution-sensitive scoring),通过从离散评分概率中计算连续期望值来保留信息熵,实现更精确的打分;2)似然感知聚合(likelihood-aware aggregation),利用双向偏好概率或困惑度(perplexity)解决传递性违反问题。实验表明,TrustJudge显著降低了两类不一致性,并在不同模型架构与规模下保持更高评估准确性,且无需额外训练或人工标注即可提升自动化评估的可靠性。
链接: https://arxiv.org/abs/2509.21117
作者: Yidong Wang,Yunze Song,Tingyuan Zhu,Xuanwang Zhang,Zhuohao Yu,Hao Chen,Chiyu Song,Qiufeng Wang,Cunxiang Wang,Zhen Wu,Xinyu Dai,Yue Zhang,Wei Ye,Shikun Zhang
机构: Peking University (北京大学); National University of Singapore (新加坡国立大学); Institute of Science Tokyo (东京科学研究所); Nanjing University (南京大学); Google DeepMind (谷歌DeepMind); Westlake University (西湖大学); Southeast University (东南大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 9 figures, 6 tables
Abstract:The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (ABCA) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge’s components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at this https URL.
zh
[NLP-27] VoiceBBQ: Investigating Effect of Content and Acoustics in Social Bias of Spoken Language Model EMNLP2025
【速读】: 该论文旨在解决语音语言模型(Spoken Language Models, SLMs)中社会偏见的多维度测量与诊断问题,特别是区分内容层面(content aspect)和声学层面(acoustic aspect)的偏见来源。传统文本基准如BBQ(Bias Benchmark for Question Answering)无法捕捉语音特有的偏见形式,而语音中的声学特征(如口音、语调)可能独立引发或加剧偏见。解决方案的关键在于提出VoiceBBQ——一个将原始BBQ文本上下文转化为受控语音条件的数据集,通过标准化语音输入实现对内容与声学偏见的独立评估,从而支持对SLMs在两个维度上的准确率、偏见强度和一致性进行量化比较。这一设计使研究者能够系统性地识别不同模型架构对内容与声学偏见的敏感性差异,为公平性改进提供可操作的测试平台。
链接: https://arxiv.org/abs/2509.21108
作者: Junhyuk Choi,Ro-hoon Oh,Jihwan Seol,Bugeun Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted EMNLP 2025 main
Abstract:We introduce VoiceBBQ, a spoken extension of the BBQ (Bias Benchmark for Question Answering) - a dataset that measures social bias by presenting ambiguous or disambiguated contexts followed by questions that may elicit stereotypical responses. Due to the nature of speech, social bias in Spoken Language Models (SLMs) can emerge from two distinct sources: 1) content aspect and 2) acoustic aspect. The dataset converts every BBQ context into controlled voice conditions, enabling per-axis accuracy, bias, and consistency scores that remain comparable to the original text benchmark. Using VoiceBBQ, we evaluate two SLMs - LLaMA-Omni and Qwen2-Audio - and observe architectural contrasts: LLaMA-Omni resists acoustic bias while amplifying gender and accent bias, whereas Qwen2-Audio substantially dampens these cues while preserving content fidelity. VoiceBBQ thus provides a compact, drop-in testbed for jointly diagnosing content and acoustic bias across spoken language models.
zh
[NLP-28] BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback
【速读】: 该论文旨在解决搜索增强型大语言模型(Search-augmented Large Language Models, SLMs)在信息检索任务中个性化不足的问题,即现有系统难以识别同一查询在不同用户间可能反映的多样意图,并无法按用户偏好提供定制化信息形式。解决方案的关键在于提出一个名为 BESPOKE 的真实且诊断性强的基准测试框架:它通过收集人类真实的聊天与搜索历史数据构建,结合细粒度偏好评分和诊断性反馈,实现对个性化效果的系统性评估。这一基准使研究者能够深入分析有效个性化所需的核心要素,为后续细粒度评估个性化搜索增强型 LLM 提供坚实基础。
链接: https://arxiv.org/abs/2509.21106
作者: Hyunseo Kim,Sangam Lee,Kwangwook Seo,Dongha Lee
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Work in progress
Abstract:Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users’ cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at this https URL.
zh
[NLP-29] PerHalluEval: Persian Hallucination Evaluation Benchmark for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言如波斯语(Persian)中普遍存在且难以检测的幻觉(hallucination)问题。其解决方案的关键在于构建了首个针对波斯语的动态幻觉评估基准 PerHalluEval,该基准采用三阶段LLM驱动的流水线结合人工验证,生成符合语境的问答(QA)与摘要任务数据,并通过生成标记的对数概率筛选最具可信度的幻觉实例;同时引入人类标注者识别波斯文化特有语境,以更精准评估模型在本地化内容上的幻觉检测能力。实验表明,提供外部知识源可部分缓解幻觉,而专为波斯语训练的模型并未显著优于其他模型。
链接: https://arxiv.org/abs/2509.21104
作者: Mohammad Hosseini,Kimia Hosseini,Shayan Bali,Zahra Zanjani,Saeedeh Momtazi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Hallucination is a persistent issue affecting all large language Models (LLMs), particularly within low-resource languages such as Persian. PerHalluEval (Persian Hallucination Evaluation) is the first dynamic hallucination evaluation benchmark tailored for the Persian language. Our benchmark leverages a three-stage LLM-driven pipeline, augmented with human validation, to generate plausible answers and summaries regarding QA and summarization tasks, focusing on detecting extrinsic and intrinsic hallucinations. Moreover, we used the log probabilities of generated tokens to select the most believable hallucinated instances. In addition, we engaged human annotators to highlight Persian-specific contexts in the QA dataset in order to evaluate LLMs’ performance on content specifically related to Persian culture. Our evaluation of 12 LLMs, including open- and closed-source models using PerHalluEval, revealed that the models generally struggle in detecting hallucinated Persian text. We showed that providing external knowledge, i.e., the original document for the summarization task, could mitigate hallucination partially. Furthermore, there was no significant difference in terms of hallucination when comparing LLMs specifically trained for Persian with others.
zh
[NLP-30] Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agent ic Mitigation in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的文化定位偏差(culture positioning bias)问题,即模型在生成内容时倾向于以主流美国文化为默认视角,对非主流文化群体则表现出明显的外部化倾向,从而加剧文化不平等。解决方案的关键在于提出一种基于文化情境化访谈脚本生成任务的评估基准——CultureLens,并设计两种推理阶段的缓解方法:一是基于提示的公平干预支柱(Fairness Intervention Pillars, FIP)基线方法;二是结构化的公平代理缓解框架(Mitigation via Fairness Agents, MFA),其中MFA-SA(单代理)通过公平指南驱动的自我反思与重写循环实现修正,MFA-MA(多代理)则构建由规划、批判与精修三类专业化代理组成的层级流程,系统性地识别并修正生成内容中的文化偏见。实证结果表明,基于代理的方法在减少文化定位偏差方面展现出显著有效性,为生成式AI(Generative AI)的公平性治理提供了新路径。
链接: https://arxiv.org/abs/2509.21080
作者: Yixin Wan,Xingrun Chen,Kai-Wei Chang
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Large language models (LLMs) have unlocked a wide range of downstream generative applications. However, we found that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture while demonstrating salient externality towards non-mainstream ones. In this work, we identify and systematically investigate this novel culture positioning bias, in which an LLM’s default generative stance aligns with a mainstream view and treats other cultures as outsiders. We propose the CultureLens benchmark with 4000 generation prompts and 3 evaluation metrics for quantifying this bias through the lens of a culturally situated interview script generation task, in which an LLM is positioned as an onsite reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals a stark pattern: while models adopt insider tones in over 88 percent of US-contexted scripts on average, they disproportionately adopt mainly outsider stances for less dominant cultures. To resolve these biases, we propose 2 inference-time mitigation methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of 2 pipelines: (1) MFA-SA (Single-Agent) introduces a self-reflection and rewriting loop based on fairness guidelines. (2) MFA-MA (Multi-Agent) structures the process into a hierarchy of specialized agents: a Planner Agent(initial script generation), a Critique Agent (evaluates initial script against fairness pillars), and a Refinement Agent (incorporates feedback to produce a polished, unbiased script). Empirical results showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.
zh
[NLP-31] SoM-1K: A Thousand-Problem Benchmark Dataset for Strength of Materials
【速读】: 该论文旨在解决当前基础模型(foundation models)在复杂多模态工程问题中表现不佳的问题,特别是针对材料力学(strength of materials, SoM)领域缺乏系统评估基准与有效推理能力的现状。其关键解决方案是提出一个名为“图像描述”(Descriptions of Images, DoI)的新型提示策略,通过专家生成的严谨文本描述替代原始视觉图示作为上下文输入,从而显著提升模型对工程问题的理解准确性。实验表明,DoI能有效缓解视觉误读错误,使大型语言模型(LLMs)在特定任务上优于视觉语言模型(VLMs),揭示了文本增强在当前多模态基础模型中的重要性,并为未来工程AI的发展指明方向。
链接: https://arxiv.org/abs/2509.21079
作者: Qixin Wan,Zilong Wang,Jingwen Zhou,Wanting Wang,Ziheng Geng,Jiachen Liu,Ran Cao,Minghui Cheng,Lu Cheng
机构: Hunan University (湖南大学); University of Miami (迈阿密大学); University of Illinois Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Foundation models have shown remarkable capabilities in various domains, but their performance on complex, multimodal engineering problems remains largely unexplored. We introduce SoM-1K, the first large-scale multimodal benchmark dataset dedicated to evaluating foundation models on problems in the strength of materials (SoM). The dataset, which contains 1,065 annotated SoM problems, mirrors real-world engineering tasks by including both textual problem statements and schematic diagrams. Due to the limited capabilities of current foundation models in understanding complicated visual information, we propose a novel prompting strategy called Descriptions of Images (DoI), which provides rigorous expert-generated text descriptions of the visual diagrams as the context. We evaluate eight representative foundation models, including both large language models (LLMs) and vision language models (VLMs). Our results show that current foundation models struggle significantly with these engineering problems, with the best-performing model achieving only 56.6% accuracy. Interestingly, we found that LLMs, when provided with DoI, often outperform VLMs provided with visual diagrams. A detailed error analysis reveals that DoI plays a crucial role in mitigating visual misinterpretation errors, suggesting that accurate text-based descriptions can be more effective than direct image input for current foundation models. This work establishes a rigorous benchmark for engineering AI and highlights a critical need for developing more robust multimodal reasoning capabilities in foundation models, particularly in scientific and engineering contexts.
zh
[NLP-32] Communication Bias in Large Language Models : A Regulatory Perspective
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在广泛应用中因偏见输出所引发的公平性与合规性问题,以及由此带来的社会影响。其解决方案的关键在于超越依赖持续监管的单一路径,强调加强市场竞争机制和设计治理(design governance)的重要性,以确保人工智能系统的公平性和可信度。
链接: https://arxiv.org/abs/2509.21075
作者: Adrian Kuenzler,Stefan Schmid
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly central to many applications, raising concerns about bias, fairness, and regulatory compliance. This paper reviews risks of biased outputs and their societal impact, focusing on frameworks like the EU’s AI Act and the Digital Services Act. We argue that beyond constant regulation, stronger attention to competition and design governance is needed to ensure fair, trustworthy AI. This is a preprint of the Communications of the ACM article of the same title.
zh
[NLP-33] ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning
【速读】: 该论文旨在解决当前大规模推理模型(Large Reasoning Models, LRM)在训练过程中难以高效获取高质量、高难度数学问题的问题。现有方法依赖复杂的提示工程或昂贵的API调用,且生成的问题难度有限,限制了模型性能提升。其解决方案的关键在于提出一个名为ScaleDiff的简单而高效的流水线:首先利用自适应思维模型(adaptive thinking model)通过单次前向传播快速筛选出困难问题;随后基于筛选结果训练专用的困难问题生成器(DiffGen-8B),可大规模生成新难题,无需复杂提示;最终通过在ScaleDiff-Math数据集上微调Qwen2.5-Math-7B-Instruct模型,实现显著性能提升(AIME’24等基准测试平均准确率达65.9%),并验证了困难问题数量与模型性能之间存在清晰的规模效应。该方案实现了低成本、高效率地构建高质量训练数据,有效提升了模型推理能力。
链接: https://arxiv.org/abs/2509.21070
作者: Qizhi Pei,Zhuoshi Pan,Honglin Lin,Xin Gao,Yu Li,Zinan Tang,Conghui He,Rui Yan,Lijun Wu
机构: Peking University (北京大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages
Abstract:Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between “Thinking” and “NoThinking” modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME’24, AIME’25, HMMT-Feb’25, BRUMO’25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: this https URL.
zh
[NLP-34] PMark: Towards Robust and Distortion-free Semantic-level Watermarking with Channel Constraints
【速读】: 该论文旨在解决当前语义级水印(Semantic-level Watermarking, SWM)方法在大语言模型(Large Language Models, LLMs)中缺乏强理论保障、且基于拒绝采样(reject-sampling)的生成方式易引入显著分布偏移的问题。其解决方案的关键在于提出一种基于代理函数(Proxy Function, PF)的新理论框架,其中PF将句子映射为标量值;在此基础上设计了PMark方法,通过动态采样估计下一句子的PF中位数,并同时施加多个PF约束(称为“通道”)以增强水印证据强度,从而在保证无失真(distortion-free)性质的同时提升对改写类攻击(paraphrasing-style attacks)的鲁棒性。
链接: https://arxiv.org/abs/2509.21057
作者: Jiahao Huo,Shuliang Liu,Bin Wang,Junyan Zhang,Yibo Yan,Aiwei Liu,Xuming Hu,Mingxun Zhou
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学); Peking University (北京大学); National University of Singapore (新加坡国立大学); Tsinghua University (清华大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Semantic-level watermarking (SWM) for large language models (LLMs) enhances watermarking robustness against text modifications and paraphrasing attacks by treating the sentence as the fundamental unit. However, existing methods still lack strong theoretical guarantees of robustness, and reject-sampling-based generation often introduces significant distribution distortions compared with unwatermarked outputs. In this work, we introduce a new theoretical framework on SWM through the concept of proxy functions (PFs) \unicodex2013 functions that map sentences to scalar values. Building on this framework, we propose PMark, a simple yet powerful SWM method that estimates the PF median for the next sentence dynamically through sampling while enforcing multiple PF constraints (which we call channels) to strengthen watermark evidence. Equipped with solid theoretical guarantees, PMark achieves the desired distortion-free property and improves the robustness against paraphrasing-style attacks. We also provide an empirically optimized version that further removes the requirement for dynamical median estimation for better sampling efficiency. Experimental results show that PMark consistently outperforms existing SWM baselines in both text quality and robustness, offering a more effective paradigm for detecting machine-generated text. Our code will be released at [this URL](this https URL).
zh
[NLP-35] Disagreements in Reasoning : How a Models Thinking Process Dictates Persuasion in Multi-Agent Systems
【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中基于大语言模型(Large Language Models, LLMs)和大推理模型(Large Reasoning Models, LRMs)的说服动态机制问题,特别是澄清当前主流假设——即说服效力主要由模型规模决定——是否成立。研究发现,说服行为的核心驱动力并非模型规模,而是其内在认知过程,尤其是显式推理能力。解决方案的关键在于揭示了“说服二元性”(Persuasion Duality):LRMs 的推理过程具有更强的信念稳定性,不易被说服;但若将该推理过程透明化(即共享“思考内容”),则其对外部说服的能力显著增强。这一发现为理解多跳说服传播中的影响扩散与衰减机制提供了理论基础,并对未来MAS的安全性、鲁棒性及架构设计具有重要指导意义。
链接: https://arxiv.org/abs/2509.21054
作者: Haodong Zhao,Jidong Li,Zhaomin Wu,Tianjie Ju,Zhuosheng Zhang,Bingsheng He,Gongshen Liu
机构: Shanghai Jiao Tong University (上海交通大学); National University of Singapore (新加坡国立大学); Inner Mongolia Research Institute, Shanghai Jiao Tong University (内蒙古研究院,上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress
Abstract:The rapid proliferation of recent Multi-Agent Systems (MAS), where Large Language Models (LLMs) and Large Reasoning Models (LRMs) usually collaborate to solve complex problems, necessitates a deep understanding of the persuasion dynamics that govern their interactions. This paper challenges the prevailing hypothesis that persuasive efficacy is primarily a function of model scale. We propose instead that these dynamics are fundamentally dictated by a model’s underlying cognitive process, especially its capacity for explicit reasoning. Through a series of multi-agent persuasion experiments, we uncover a fundamental trade-off we term the Persuasion Duality. Our findings reveal that the reasoning process in LRMs exhibits significantly greater resistance to persuasion, maintaining their initial beliefs more robustly. Conversely, making this reasoning process transparent by sharing the “thinking content” dramatically increases their ability to persuade others. We further consider more complex transmission persuasion situations and reveal complex dynamics of influence propagation and decay within multi-hop persuasion between multiple agent networks. This research provides systematic evidence linking a model’s internal processing architecture to its external persuasive behavior, offering a novel explanation for the susceptibility of advanced models and highlighting critical implications for the safety, robustness, and design of future MAS.
zh
[NLP-36] When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中同时遵循多个指令时的能力评估问题,尤其关注指令数量增加对模型性能的影响。其关键解决方案是构建两个专用基准测试集——Many Instruction-Following Eval (ManyIFEval) 和 Style-aware Mostly Basic Programming Problems (StyleMBPP),用于系统性评估文本和代码生成任务中多指令遵循能力;并进一步提出三种回归模型,其中以指令数量作为解释变量的逻辑回归模型可有效预测未见过的指令组合及不同指令数下的性能表现,误差约为10%,且仅需少量样本(500和300)即可实现高效评估。
链接: https://arxiv.org/abs/2509.21051
作者: Keno Harada,Yudai Yamazaki,Masachika Taniguchi,Edison Marrese-Taylor,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo
机构: The University of Tokyo (东京大学); Kyoto University (京都大学); University of the Ryukyus (琉球大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP2025
Abstract:As large language models (LLMs) are increasingly applied to real-world scenarios, it becomes crucial to understand their ability to follow multiple instructions simultaneously. To systematically evaluate these capabilities, we introduce two specialized benchmarks for fundamental domains where multiple instructions following is important: Many Instruction-Following Eval (ManyIFEval) for text generation with up to ten instructions, and Style-aware Mostly Basic Programming Problems (StyleMBPP) for code generation with up to six instructions. Our experiments with the created benchmarks across ten LLMs reveal that performance consistently degrades as the number of instructions increases. Furthermore, given the fact that evaluating all the possible combinations of multiple instructions is computationally impractical in actual use cases, we developed three types of regression models that can estimate performance on both unseen instruction combinations and different numbers of instructions which are not used during training. We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error, even for unseen instruction combinations. We show that relatively modest sample sizes (500 for ManyIFEval and 300 for StyleMBPP) are sufficient for performance estimation, enabling efficient evaluation of LLMs under various instruction combinations.
zh
[NLP-37] Behind RoPE: How Does Causal Mask Encode Positional Information?
【速读】: 该论文试图解决的问题是:在Transformer解码器中,除了显式的位置编码(如RoPE)外,因果掩码(causal mask)是否也隐含地提供位置信息,并如何影响注意力机制中的位置敏感性。解决方案的关键在于通过理论分析和实证研究证明,因果掩码本身就能诱导出依赖位置的注意力模式,即使输入中没有参数或因果依赖关系;进一步发现,因果掩码与RoPE(Rotary Position Embedding)的交互会扭曲RoPE原本的相对注意力模式,使其变为非相对模式,从而揭示了因果掩码作为潜在位置信息来源的重要性,建议在模型设计中将其纳入考量。
链接: https://arxiv.org/abs/2509.21042
作者: Junu Kim,Xiao Liu,Zhenghao Lin,Lei Ji,Yeyun Gong,Edward Choi
机构: KAIST; Microsoft Research
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Codes available at: this https URL
Abstract:While explicit positional encodings such as RoPE are a primary source of positional information in Transformer decoders, the causal mask also provides positional information. In this work, we prove that the causal mask can induce position-dependent patterns in attention scores, even without parameters or causal dependency in the input. Our theoretical analysis indicates that the induced attention pattern tends to favor nearby query-key pairs, mirroring the behavior of common positional encodings. Empirical analysis confirms that trained models exhibit the same behavior, with learned parameters further amplifying these patterns. Notably, we found that the interaction of causal mask and RoPE distorts RoPE’s relative attention score patterns into non-relative ones. We consistently observed this effect in modern large language models, suggesting the importance of considering the causal mask as a source of positional information alongside explicit positional encodings.
zh
[NLP-38] Generative AI for FFRDCs
【速读】: 该论文旨在解决联邦资助的研究与开发中心(Federally Funded Research and Development Centers, FFRDCs)在处理大量文本数据时面临的效率瓶颈问题,例如政策文件、科学论文等文档的人工分析耗时长、成本高。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)实现自动化文本处理,包括摘要生成、分类、信息抽取和语义理解,仅需少量示例即可完成任务;同时通过部署OnPrem . LLM这一开源框架,确保生成式AI(Generative AI)在敏感政府场景下的安全性、可审计性和数据主权。
链接: https://arxiv.org/abs/2509.21040
作者: Arun S. Maiya
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4
Abstract:Federally funded research and development centers (FFRDCs) face text-heavy workloads, from policy documents to scientific and engineering papers, that are slow to analyze manually. We show how large language models can accelerate summarization, classification, extraction, and sense-making with only a few input-output examples. To enable use in sensitive government contexts, we apply OnPrem . LLM, an open-source framework for secure and flexible application of generative AI. Case studies on defense policy documents and scientific corpora, including the National Defense Authorization Act (NDAA) and National Science Foundation (NSF) Awards, demonstrate how this approach enhances oversight and strategic analysis while maintaining auditability and data sovereignty.
zh
[NLP-39] CLAUSE: Agent ic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering
【速读】: 该论文旨在解决多跳问答(multi-hop question answering)中知识图谱(Knowledge Graph, KG)上下文构建的效率与准确性之间的矛盾问题,即传统静态k-hop扩展或“思考更久”提示策略常导致过度检索、上下文膨胀及运行时不确定性。其核心解决方案是提出CLAUSETM——一种基于神经符号代理(neuro-symbolic agent)框架的三代理协同机制,通过将上下文构造建模为一个受资源预算约束的序贯决策过程,动态决定子图扩展、推理路径选择与回溯、证据保留及终止时机。关键创新在于引入拉格朗日约束的多智能体近端策略优化算法(Lagrangian-Constrained Multi-Agent Proximal Policy Optimization, LC-MAPPO),联合优化子图构建、路径发现与证据选择,并在边缘编辑次数、交互步数和选中文本量等用户指定预算下实现准确率、延迟与成本的自适应权衡,无需重新训练即可满足部署场景下的性能可预测性要求。
链接: https://arxiv.org/abs/2509.21035
作者: Yang Zhao,Chengxiao Dai,Wei Zhuo,Yue Xiu,Dusit Niyato
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Knowledge graphs provide structured context for multi-hop question answering, but deployed systems must balance answer accuracy with strict latency and cost targets while preserving provenance. Static k-hop expansions and “think-longer” prompting often over-retrieve, inflate context, and yield unpredictable runtime. We introduce CLAUSE, an agentic three-agent neuro-symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to keep, and when to stop. Latency (interaction steps) and prompt cost (selected tokens) are exposed as user-specified budgets or prices, allowing per-query adaptation to trade-offs among accuracy, latency, and cost without retraining. CLAUSE employs the proposed Lagrangian-Constrained Multi-Agent Proximal Policy Optimization (LC-MAPPO) algorithm to coordinate three agents: Subgraph Architect, Path Navigator, and Context Curator, so that subgraph construction, reasoning-path discovery, and evidence selection are jointly optimized under per-query resource budgets on edge edits, interaction steps, and selected tokens. Across HotpotQA, MetaQA, and FactKG, CLAUSE yields higher EM@1 while reducing subgraph growth and end-to-end latency at equal or lower token budgets. On MetaQA-2-hop, relative to the strongest RAG baseline (GraphRAG), CLAUSE achieves +39.3 EM@1 with 18.6% lower latency and 40.9% lower edge growth. The resulting contexts are compact, provenance-preserving, and deliver predictable performance under deployment constraints.
zh
[NLP-40] DELTA-Code: How Does RL Unlock and Transfer New Programming Algorithms in LLM s?
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)是否能够通过强化学习(Reinforcement Learning, RL)习得或泛化出真正新颖的推理策略的问题,而不仅仅是依赖预训练或后训练阶段编码的技能。为探究这一问题,作者提出 DELTA-Code——一个受控的合成编码问题家族基准,用于评估两个核心维度:可学习性(learnability)(LLMs 是否能在 RL 训练下解决预训练模型在大量尝试中均失败的问题)和可迁移性(transferrability)(若可学习,这些技能能否系统性地迁移到分布外(Out-of-Distribution, OOD)测试集)。解决方案的关键在于:首先,使用模板化问题生成器隔离推理技能,并引入完全 OOD 的问题家族以避免工具调用或记忆模式;其次,通过分阶段热身(staged warm-up)、密集奖励设计、经验回放(experience replay)、课程学习(curriculum training)以及“验证闭环”(verification-in-the-loop)等训练机制,成功诱导出显著的“grokking 阶段跃迁”现象——即在长时间无奖励后,模型突然达到近乎完美的准确率。实验表明,模型在同家族内及重组技能上表现良好,但在变革性(transformative)迁移任务中仍存在明显短板,从而为理解 RL 驱动下的算法推理边界提供了清晰的测试平台。
链接: https://arxiv.org/abs/2509.21016
作者: Yiyou Sun,Yuhan Cao,Pohao Huang,Haoyue Bai,Hannaneh Hajishirzi,Nouha Dziri,Dawn Song
机构: University of California, Berkeley (加州大学伯克利分校); University of Wisconsin, Madison (威斯康星大学麦迪逊分校); University of Washington (华盛顿大学); Ai2
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training. To attempt to answer this debate, we introduce DELTA-Code–Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding, a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability – can LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)? --and transferrability – if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns. Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy. To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop. Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.
zh
[NLP-41] Mechanism of Task-oriented Information Removal in In-context Learning
【速读】: 该论文旨在解决生成式 AI(Generative AI)中上下文学习(In-context Learning, ICL)机制不明确的问题,尤其是其在少样本场景下如何实现任务导向的推理。研究发现,在零样本场景中,语言模型(Language Models, LMs)的隐藏状态会编码所有可能任务的信息,形成非选择性表示,导致输出随机且准确率接近零;而通过低秩滤波器有选择地移除冗余信息,可引导模型聚焦于目标任务。解决方案的关键在于:ICL 本质上模拟了这种任务导向的信息移除过程——从纠缠的非选择性表征中剔除无关信息,并利用演示样例优化输出。进一步识别出关键的注意力头称为“去噪头”(Denoising Heads),它们负责执行信息移除操作;通过阻断这些头的机制,ICL 准确率显著下降,尤其在正确标签未出现在演示中的情况下,验证了信息移除机制及其核心组件的重要性。
链接: https://arxiv.org/abs/2509.21012
作者: Hakaze Cho,Haolin Yang,Gouki Minegishi,Naoya Inoue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 67 pages, 70 figures, 7 tables
Abstract:In-context Learning (ICL) is an emerging few-shot learning paradigm based on modern Language Models (LMs), yet its inner mechanism remains unclear. In this paper, we investigate the mechanism through a novel perspective of information removal. Specifically, we demonstrate that in the zero-shot scenario, LMs encode queries into non-selective representations in hidden states containing information for all possible tasks, leading to arbitrary outputs without focusing on the intended task, resulting in near-zero accuracy. Meanwhile, we find that selectively removing specific information from hidden states by a low-rank filter effectively steers LMs toward the intended task. Building on these findings, by measuring the hidden states on carefully designed metrics, we observe that few-shot ICL effectively simulates such task-oriented information removal processes, selectively removing the redundant information from entangled non-selective representations, and improving the output based on the demonstrations, which constitutes a key mechanism underlying ICL. Moreover, we identify essential attention heads inducing the removal operation, termed Denoising Heads, which enables the ablation experiments blocking the information removal operation from the inference, where the ICL accuracy significantly degrades, especially when the correct label is absent from the few-shot demonstrations, confirming both the critical role of the information removal mechanism and denoising heads.
zh
[NLP-42] Binary Autoencoder for Mechanistic Interpretability of Large Language Models
【速读】: 该论文旨在解决现有方法在从大语言模型(Large Language Models, LLMs)隐藏状态中解耦原子化数值特征(features)时,因依赖隐式训练正则化(如L₁归一化或top-k函数)而导致全局稀疏性无法保障的问题,从而产生大量密集(同时不活跃)的特征,损害特征稀疏性和原子化程度。解决方案的关键在于提出一种新型自编码器——二值自编码器(Binary Autoencoder, BAE),其通过在小批量(minibatch)层面强制最小化隐藏激活的熵,促进跨实例的特征独立性和稀疏性;为实现高效熵计算,BAE采用阶梯函数将隐藏激活离散化为1比特,并引入梯度估计以支持反向传播,从而在保持可微性的同时显著提升特征提取的稀疏性与可解释性。
链接: https://arxiv.org/abs/2509.20997
作者: Hakaze Cho,Haolin Yang,Brian M. Kurkoski,Naoya Inoue
机构: JAIST(日本信息科学与技术大学院大学); University of Chicago(芝加哥大学); RIKEN(理化学研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 36 pages, 41 figures, 3 tables
Abstract:Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs) for interpreting their mechanism. However, they typically rely on autoencoders constrained by some implicit training-time regularization on single training instances (i.e., L_1 normalization, top-k function, etc.), without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which we empirically evaluate and leverage to characterize the inference dynamics of LLMs and In-context Learning. (2) Feature untangling. Similar to typical methods, BAE can extract atomized features from LLM’s hidden states. To robustly evaluate such feature extraction capability, we refine traditional feature-interpretation methods to avoid unreliable handling of numerical tokens, and show that BAE avoids dense features while producing the largest number of interpretable ones among baselines, which confirms the effectiveness of BAE serving as a feature extractor.
zh
[NLP-43] Analysis of instruction-based LLM s capabilities to score and judge text-input problems in an academic setting
【速读】: 该论文旨在解决学术文本输入题(Text-Input Problems)的自动化评分问题,特别是在高等教育场景下如何利用大语言模型(Large Language Models, LLMs)实现准确、公平且具有解释性的自动评估。其核心解决方案是提出五种基于LLM的评价系统,并通过对比人类评分结果验证有效性,其中最关键的是“参考答案辅助评价”(Reference Aided Evaluation)方法——该方法在评分准确性上表现最优,表现为最低的中位绝对偏差(0.945)和均方根偏差(1.214),并通过引入正确答案作为引导信息显著提升了评分的一致性和完整性,从而证明了AI驱动的自动评价系统在教育评估中具备作为辅助工具的应用潜力。
链接: https://arxiv.org/abs/2509.20982
作者: Valeria Ramirez-Garcia,David de-Fitero-Dominguez,Antonio Garcia-Cabot,Eva Garcia-Lopez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) can act as evaluators, a role studied by methods like LLM-as-a-Judge and fine-tuned judging LLMs. In the field of education, LLMs have been studied as assistant tools for students and teachers. Our research investigates LLM-driven automatic evaluation systems for academic Text-Input Problems using rubrics. We propose five evaluation systems that have been tested on a custom dataset of 110 answers about computer science from higher education students with three models: JudgeLM, Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B. The evaluation systems include: The JudgeLM evaluation, which uses the model’s single answer prompt to obtain a score; Reference Aided Evaluation, which uses a correct answer as a guide aside from the original context of the question; No Reference Evaluation, which ommits the reference answer; Additive Evaluation, which uses atomic criteria; and Adaptive Evaluation, which is an evaluation done with generated criteria fitted to each question. All evaluation methods have been compared with the results of a human evaluator. Results show that the best method to automatically evaluate and score Text-Input Problems using LLMs is Reference Aided Evaluation. With the lowest median absolute deviation (0.945) and the lowest root mean square deviation (1.214) when compared to human evaluation, Reference Aided Evaluation offers fair scoring as well as insightful and complete evaluations. Other methods such as Additive and Adaptive Evaluation fail to provide good results in concise answers, No Reference Evaluation lacks information needed to correctly assess questions and JudgeLM Evaluations have not provided good results due to the model’s limitations. As a result, we conclude that Artificial Intelligence-driven automatic evaluation systems, aided with proper methodologies, show potential to work as complementary tools to other academic resources.
zh
[NLP-44] CLUE: Conflict-guided Localization for LLM Unlearning Framework
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行遗忘(unlearning)任务时存在的关键问题:现有基于定位的方法无法有效区分负责遗忘不良知识的神经元与负责保留必要技能的神经元,导致干预策略过于粗放,易引发灾难性过遗忘或目标知识未被完全擦除的问题。解决方案的关键在于引入电路发现(circuit discovery)这一机制可解释性技术,提出Conflict-guided Localization for LLM Unlearning framEwork(CLUE),通过识别“遗忘电路”和“保留电路”,将这些电路转化为合取范式(conjunctive normal form, CNF),并利用CNF可满足性求解结果精确判定每个神经元应被遗忘或保留,进而实施针对性微调策略,实现对目标知识的高效擦除与非目标能力的精准保留。
链接: https://arxiv.org/abs/2509.20977
作者: Hang Chen,Jiaying Zhu,Xinyu Yang,Wenya Wang
机构: Xi’an Jiaotong University (西安交通大学); The Chinese University of Hong Kong (香港中文大学); Nanyang Technological University (南洋理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages
Abstract:The LLM unlearning aims to eliminate the influence of undesirable data without affecting causally unrelated information. This process typically involves using a forget set to remove target information, alongside a retain set to maintain non-target capabilities. While recent localization-based methods demonstrate promise in identifying important neurons to be unlearned, they fail to disentangle neurons responsible for forgetting undesirable knowledge or retaining essential skills, often treating them as a single entangled group. As a result, these methods apply uniform interventions, risking catastrophic over-forgetting or incomplete erasure of the target knowledge. To address this, we turn to circuit discovery, a mechanistic interpretability technique, and propose the Conflict-guided Localization for LLM Unlearning framEwork (CLUE). This framework identifies the forget and retain circuit composed of important neurons, and then the circuits are transformed into conjunctive normal forms (CNF). The assignment of each neuron in the CNF satisfiability solution reveals whether it should be forgotten or retained. We then provide targeted fine-tuning strategies for different categories of neurons. Extensive experiments demonstrate that, compared to existing localization methods, CLUE achieves superior forget efficacy and retain utility through precise neural localization.
zh
[NLP-45] ool Calling for Arabic LLM s: Data Strategies and Instruction Tuning
【速读】: 该论文旨在解决阿拉伯语大型语言模型(Large Language Models, LLMs)在工具调用(tool calling)能力方面的资源匮乏与研究空白问题,尤其关注如何有效提升阿拉伯语环境下生成式 AI(Generative AI)对工具的调用性能。其解决方案的关键在于:首先,通过翻译和适配两个开源工具调用数据集至阿拉伯语,填补了该语言在工具调用任务中的数据资源缺口;其次,系统性地评估了三种策略的有效性——即使用阿拉伯语原生数据 vs. 跨语言迁移、通用指令微调的作用,以及针对高优先级特定工具进行微调的价值。实验基于多个阿拉伯语基础模型及其后训练变体展开,为构建面向阿拉伯语场景的鲁棒工具增强型智能体提供了实证依据和优化路径。
链接: https://arxiv.org/abs/2509.20957
作者: Asim Ersoy,Enes Altinisik,Husrev Taha Sencar,Kareem Darwish
机构: Qatar Computing Research Institute, HBKU, Qatar (卡塔尔计算研究研究所,HBKU,卡塔尔)
类目: Computation and Language (cs.CL)
备注:
Abstract:Tool calling is a critical capability that allows Large Language Models (LLMs) to interact with external systems, significantly expanding their utility. However, research and resources for tool calling are predominantly English-centric, leaving a gap in our understanding of how to enable this functionality for other languages, such as Arabic. This paper investigates three key research questions: (1) the necessity of in-language (Arabic) tool-calling data versus relying on cross-lingual transfer, (2) the effect of general-purpose instruction tuning on tool-calling performance, and (3) the value of fine-tuning on specific, high-priority tools. To address these questions, we conduct extensive experiments using base and post-trained variants of an open-weight Arabic LLM. To enable this study, we bridge the resource gap by translating and adapting two open-source tool-calling datasets into Arabic. Our findings provide crucial insights into the optimal strategies for developing robust tool-augmented agents for Arabic.
zh
[NLP-46] Cross-Linguistic Analysis of Memory Load in Sentence Comprehension: Linear Distance and Structural Density
【速读】: 该论文试图解决句子理解过程中句法相关词之间记忆负荷(memory load)的解释问题,即究竟是线性距离(linear proximity)还是结构密度(structural density)更能准确预测认知负荷。其解决方案的关键在于提出并验证“干预者复杂度”(Intervener Complexity),即头词与其依存词之间插入的中心词数量,作为结构上更贴近实际整合与保持需求的指标。通过统一标注的依存树库和跨语言混合效应模型,研究发现:句子长度、依存距离和干预者复杂度均与记忆负荷正相关,其中干预者复杂度在控制线性距离后仍具独立解释力,从而在表面线性距离与深层句法结构之间建立了桥梁,为理解句法加工效率提供了结构化的新视角。
链接: https://arxiv.org/abs/2509.20916
作者: Krishna Aggarwal
机构: Indian Institute of Science Education and Research (IISER), Mohali, India.
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: 7 pages, 4 figures (Figure 2 has 3 sub-divisions)
Abstract:This study examines whether sentence-level memory load in comprehension is better explained by linear proximity between syntactically related words or by the structural density of the intervening material. Building on locality-based accounts and cross-linguistic evidence for dependency length minimization, the work advances Intervener Complexity-the number of intervening heads between a head and its dependent-as a structurally grounded lens that refines linear distance measures. Using harmonized dependency treebanks and a mixed-effects framework across multiple languages, the analysis jointly evaluates sentence length, dependency length, and Intervener Complexity as predictors of the Memory-load measure. Studies in Psycholinguistics have reported the contributions of feature interference and misbinding to memory load during processing. For this study, I operationalized sentence-level memory load as the linear sum of feature misbinding and feature interference for tractability; current evidence does not establish that their cognitive contributions combine additively. All three factors are positively associated with memory load, with sentence length exerting the broadest influence and Intervener Complexity offering explanatory power beyond linear distance. Conceptually, the findings reconcile linear and hierarchical perspectives on locality by treating dependency length as an important surface signature while identifying intervening heads as a more proximate indicator of integration and maintenance demands. Methodologically, the study illustrates how UD-based graph measures and cross-linguistic mixed-effects modelling can disentangle linear and structural contributions to processing efficiency, providing a principled path for evaluating competing theories of memory load in sentence comprehension.
zh
[NLP-47] MemLens: Uncovering Memorization in LLM s with Activation Trajectories
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在评估基准(如AIME和Math500)中因数据污染(contamination)而导致的性能高估问题,尤其是现有检测方法依赖表面词汇重叠和困惑度(perplexity)时,在面对隐式污染数据时泛化能力差的问题。其解决方案的关键在于提出MemLens——一种基于激活轨迹(activation lens)的记忆检测方法,通过分析数值型token在生成过程中的概率变化路径发现:污染样本在模型早期层即快速锁定答案并表现出高置信度,而干净样本则呈现更渐进的证据积累模式;这种差异化的推理轨迹为识别真实记忆行为提供了可靠依据,且经LoRA微调注入设计样本后仍保持一致模式,验证了该方法的有效性与鲁棒性。
链接: https://arxiv.org/abs/2509.20909
作者: Zirui He,Haiyan Zhao,Ali Payani,Mengnan du
机构: New Jersey Institute of Technology (新泽西理工学院); Cisco Research (思科研究院)
类目: Computation and Language (cs.CL)
备注: 20pages, 11 figures, 7 tables
Abstract:Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing detection methods, which primarily rely on surface-level lexical overlap and perplexity, demonstrate low generalization and degrade significantly when encountering implicitly contaminated data. In this paper, we propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. Our method reveals that contaminated samples exhibit ``shortcut’’ behaviors, locking onto an answer with high confidence in the model’s early layers, whereas clean samples show more gradual evidence accumulation across the model’s full depth. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate this, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.
zh
[NLP-48] Learning to Summarize by Learning to Quiz: Adversarial Agent ic Collaboration for Long Document Summarization
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在长文档摘要生成中存在的信息丢失、事实不一致性和连贯性差等问题。其解决方案的关键在于提出一种名为SummQ的对抗性多智能体框架,通过两个互补领域内专业化智能体之间的协作实现高质量摘要:一方面,摘要生成器与评审器协同生成并评估摘要;另一方面,问答生成器与评审器构建理解性问题作为持续的质量检验机制;此外,通过一个考生智能体验证摘要是否包含回答这些问题所需的信息,从而形成多维度反馈闭环,驱动迭代优化。该设计利用对抗性代理协作机制显著提升了摘要的准确性、完整性和一致性。
链接: https://arxiv.org/abs/2509.20900
作者: Weixuan Wang,Minghao Wu,Barry Haddow,Alexandra Birch
机构: University of Edinburgh (爱丁堡大学); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.
zh
[NLP-49] On Theoretical Interpretations of Concept-Based In-Context Learning
【速读】: 该论文旨在解决生成式 AI(Generative AI)中上下文学习(In-Context Learning, ICL)机制的理论理解不足问题,特别是针对概念驱动的上下文学习(Concept-based ICL, CB-ICL)方法。其解决方案的关键在于提出一套理论分析框架,该框架能够解释CB-ICL为何在仅提供少量示例的情况下仍能有效预测查询标签,并量化大语言模型(Large Language Model, LLM)可从提示中提取的知识量;同时,该理论推导出一种基于提示示例与查询输入之间相似性的度量指标,为LLM预训练策略和提示工程提供了理论指导。此外,理论还系统分析了提示示例数量和LLM嵌入维度对ICL性能的影响,从而为实际应用中的参数配置提供依据。
链接: https://arxiv.org/abs/2509.20882
作者: Huaze Tang,Tianren Peng,Shao-lun Huang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:In-Context Learning (ICL) has emerged as an important new paradigm in natural language processing and large language model (LLM) applications. However, the theoretical understanding of the ICL mechanism remains limited. This paper aims to investigate this issue by studying a particular ICL approach, called concept-based ICL (CB-ICL). In particular, we propose theoretical analyses on applying CB-ICL to ICL tasks, which explains why and when the CB-ICL performs well for predicting query labels in prompts with only a few demonstrations. In addition, the proposed theory quantifies the knowledge that can be leveraged by the LLMs to the prompt tasks, and leads to a similarity measure between the prompt demonstrations and the query input, which provides important insights and guidance for model pre-training and prompt engineering in ICL. Moreover, the impact of the prompt demonstration size and the dimension of the LLM embeddings in ICL are also explored based on the proposed theory. Finally, several real-data experiments are conducted to validate the practical usefulness of CB-ICL and the corresponding theory.
zh
[NLP-50] StyleBench: Evaluating thinking styles in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中推理风格(reasoning styles)与模型架构、任务类型之间复杂交互关系不明确的问题。其解决方案的关键在于提出StyleBench——一个系统性评估多种推理风格在多样化任务和模型上的综合基准,通过在15个开源模型(涵盖270M至120B参数)上测试五种代表性推理风格(Chain of Thought, Tree of Thought, Algorithm of Thought, Sketch of Thought, Chain-of-Draft),揭示出没有单一推理风格在所有场景下最优,且策略有效性高度依赖于模型规模和任务特性:搜索类方法(如AoT、ToT)在开放式问题中表现优异但需大模型支撑,而简洁类方法(如SoT、CoD)在结构化任务中可实现显著效率提升;同时发现小模型易偏离指令、缺乏推理鲁棒性,该现象随模型规模增长而改善,从而为基于具体约束选择最优推理策略提供了实证依据和实践路径。
链接: https://arxiv.org/abs/2509.20868
作者: Junyu Guo,Shangding Gu,Ming Jin,Costas Spanos,Javad Lavaei
机构: University of California, Berkeley (加州大学伯克利分校); Virginia Tech (弗吉尼亚理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in this https URL.
zh
[NLP-51] Single Answer is Not Enough: On Generating Ranked Lists with Medical Reasoning Models
【速读】: 该论文旨在解决当前医学推理模型(Medical Reasoning Models, MRMs)在处理开放性问题时仅生成单一答案,而临床决策通常需要考虑多个备选方案以降低偏见风险的问题。解决方案的关键在于提出并系统评估两种生成有序答案列表(ranked lists)的方法:一是基于提示(prompting)的策略,二是通过监督微调(Supervised Fine-Tuning, SFT)与强化学习微调(Reinforcement Fine-Tuning, RFT)进行模型训练。其中,RFT通过设计针对排序列表格式的新奖励函数,显著提升了模型在多种输出格式下的鲁棒性,优于仅依赖SFT的模型,从而为医学领域中更符合实际需求的多答案生成提供了可行路径。
链接: https://arxiv.org/abs/2509.20866
作者: Pittawat Taveekitworachai,Natpatchara Pongjirapat,Krittaphas Chaisutyakorn,Piyalitt Ittichaiwong,Tossaporn Saengja,Kunat Pipatanakul
机构: SCB 10X R&D(泰国暹罗商业银行10X研发部门); SCB 10X(泰国暹罗商业银行10X); SCBX Group(泰国暹罗商业银行集团); Faculty of Medicine Siriraj Hospital(泰国诗里拉吉医院医学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 51 pages, 27 figures
Abstract:This paper presents a systematic study on enabling medical reasoning models (MRMs) to generate ranked lists of answers for open-ended questions. Clinical decision-making rarely relies on a single answer but instead considers multiple options, reducing the risks of narrow perspectives. Yet current MRMs are typically trained to produce only one answer, even in open-ended settings. We propose an alternative format: ranked lists and investigate two approaches: prompting and fine-tuning. While prompting is a cost-effective way to steer an MRM’s response, not all MRMs generalize well across different answer formats: choice, short text, and list answers. Based on our prompting findings, we train and evaluate MRMs using supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT teaches a model to imitate annotated responses, and RFT incentivizes exploration through the responses that maximize a reward. We propose new reward functions targeted at ranked-list answer formats, and conduct ablation studies for RFT. Our results show that while some SFT models generalize to certain answer formats, models trained with RFT are more robust across multiple formats. We also present a case study on a modified MedQA with multiple valid answers, finding that although MRMs might fail to select the benchmark’s preferred ground truth, they can recognize valid answers. To the best of our knowledge, this is the first systematic investigation of approaches for enabling MRMs to generate answers as ranked lists. We hope this work provides a first step toward developing alternative answer formats that are beneficial beyond single answers in medical domains.
zh
[NLP-52] WeFT: Weighted Entropy-driven Fine-Tuning for dLLM s
【速读】: 该论文旨在解决扩散语言模型(diffusion language models)在监督微调(supervised fine-tuning, SFT)过程中因缺乏每个去噪步骤的精确概率估计而导致生成不可控、不一致的问题。其核心挑战在于如何有效引导生成过程,确保关键token对输出方向具有可控性。解决方案的关键是提出WeFT(Weighted SFT),该方法基于扩散理论,根据token的熵值为其分配不同权重,从而在微调阶段强化高信息量token的学习信号,显著提升模型在逻辑推理任务上的性能表现。
链接: https://arxiv.org/abs/2509.20863
作者: Guowei Xu,Wenxin Xu,Jiawang Zhao,Kaisheng Ma
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: preprint
Abstract:Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose WeFT, a weighted SFT method for diffusion language models, where tokens are assigned different weights based on their entropy. Derived from diffusion theory, WeFT delivers substantial gains: training on s1K, s1K-1.1, and 3k samples from open-r1, it achieves relative improvements of 39%, 64%, and 83% over standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500). The code and models will be made publicly available.
zh
[NLP-53] Concise and Sufficient Sub-Sentence Citations for Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)问答系统中引用标注(attribution)存在的两个核心问题:一是现有方法生成的引用通常以句子甚至段落为单位,导致包含大量无关内容;二是句子级引用可能遗漏关键验证信息,迫使用户阅读上下文才能确认输出正确性。解决方案的关键在于提出生成细粒度的子句级引用(sub-sentence citations),使其在保持简洁的同时具备充分的可验证性。为此,作者首先制定标注规范并构建相应数据集,进而设计了一个基于大语言模型(Large Language Model, LLM)自动生成微调数据、结合信用评分模型(credit model)过滤低质量样本的引用生成框架,实验证明该方法能显著提升引用的质量与可读性。
链接: https://arxiv.org/abs/2509.20859
作者: Guo Chen,Qiuyuan Li,Qiuxian Li,Hongliang Dai,Xiang Chen,Piji Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In retrieval-augmented generation (RAG) question answering systems, generating citations for large language model (LLM) outputs enhances verifiability and helps users identify potential hallucinations. However, we observe two problems in the citations produced by existing attribution methods. First, the citations are typically provided at the sentence or even paragraph level. Long sentences or paragraphs may include a substantial amount of irrelevant content. Second, sentence-level citations may omit information that is essential for verifying the output, forcing users to read the surrounding context. In this paper, we propose generating sub-sentence citations that are both concise and sufficient, thereby reducing the effort required by users to confirm the correctness of the generated output. To this end, we first develop annotation guidelines for such citations and construct a corresponding dataset. Then, we propose an attribution framework for generating citations that adhere to our standards. This framework leverages LLMs to automatically generate fine-tuning data for our task and employs a credit model to filter out low-quality examples. Our experiments on the constructed dataset demonstrate that the propose approach can generate high-quality and more readable citations.
zh
[NLP-54] Zero-Shot Privacy-Aware Text Rewriting via Iterative Tree Search
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在云服务中应用时引发的隐私泄露问题,即用户输入可能无意中暴露敏感信息,而传统文本匿名化与去标识化技术(如基于规则的删除和擦除)难以在隐私保护、文本自然性和实用性之间取得良好平衡。其解决方案的关键在于提出一种零样本(zero-shot)、基于树搜索的迭代句子重写算法,通过奖励模型引导结构化搜索过程,逐步对隐私敏感片段进行改写或删除,从而在保持语义连贯性、相关性和自然性的前提下实现高效隐私保护。实验表明,该方法显著优于现有基线,在隐私保护与信息效用之间实现了更优权衡。
链接: https://arxiv.org/abs/2509.20838
作者: Shuo Huang,Xingliang Yuan,Gholamreza Haffari,Lizhen Qu
机构: Monash University (蒙纳士大学); University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The increasing adoption of large language models (LLMs) in cloud-based services has raised significant privacy concerns, as user inputs may inadvertently expose sensitive information. Existing text anonymization and de-identification techniques, such as rule-based redaction and scrubbing, often struggle to balance privacy preservation with text naturalness and utility. In this work, we propose a zero-shot, tree-search-based iterative sentence rewriting algorithm that systematically obfuscates or deletes private information while preserving coherence, relevance, and naturalness. Our method incrementally rewrites privacy-sensitive segments through a structured search guided by a reward model, enabling dynamic exploration of the rewriting space. Experiments on privacy-sensitive datasets show that our approach significantly outperforms existing baselines, achieving a superior balance between privacy protection and utility preservation.
zh
[NLP-55] Verification Limits Code LLM Training
【速读】: 该论文旨在解决生成式 AI(Generative AI)在代码生成任务中因依赖合成数据而面临的“验证上限”(verification ceiling)问题,即训练数据的质量与多样性受限于合成验证器(synthetic verifier)的能力。解决方案的关键在于对验证机制进行重新校准:一方面通过引入更灵活的验证策略(如放宽通过阈值或采用基于大语言模型的软验证),释放原本被严格规则过滤掉的有价值训练样本;另一方面强调保留每道题目多样且具有挑战性的正确解法,从而提升模型泛化能力。研究发现,当前过于刚性的验证方式抑制了数据多样性,但完全摒弃验证不可行,唯有结合校准后的验证机制与高质量、多样化的题解对,才能突破验证瓶颈,推动代码生成模型性能进一步提升。
链接: https://arxiv.org/abs/2509.20837
作者: Srishti Gureja,Elena Tommasone,Jingyi He,Sara Hooker,Matthias Gallé,Marzieh Fadaee
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models for code generation increasingly rely on synthetic data, where both problem solutions and verification tests are generated by models. While this enables scalable data creation, it introduces a previously unexplored bottleneck: the verification ceiling, in which the quality and diversity of training data are fundamentally constrained by the capabilities of synthetic verifiers. In this work, we systematically study how verification design and strategies influence model performance. We investigate (i) what we verify by analyzing the impact of test complexity and quantity: richer test suites improve code generation capabilities (on average +3 pass@1), while quantity alone yields diminishing returns, (ii) how we verify by exploring relaxed pass thresholds: rigid 100% pass criteria can be overly restrictive. By allowing for relaxed thresholds or incorporating LLM-based soft verification, we can recover valuable training data, leading to a 2-4 point improvement in pass@1 performance. However, this benefit is contingent upon the strength and diversity of the test cases used, and (iii) why verification remains necessary through controlled comparisons of formally correct versus incorrect solutions and human evaluation: retaining diverse correct solutions per problem yields consistent generalization gains. Our results show that Verification as currently practiced is too rigid, filtering out valuable diversity. But it cannot be discarded, only recalibrated. By combining calibrated verification with diverse, challenging problem-solution pairs, we outline a path to break the verification ceiling and unlock stronger code generation models.
zh
[NLP-56] Distilling Many-Shot In-Context Learning into a Cheat Sheet EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文学习(In-Context Learning, ICL)中因使用多示例(many-shot)输入而导致的高计算开销问题,即长输入序列带来的token消耗和推理延迟。其解决方案的关键在于提出“速记式上下文学习”(cheat-sheet ICL),通过将多示例ICL中的关键信息提炼为一个简洁的文本摘要(cheat sheet),并在推理时作为上下文使用,从而显著减少token数量,同时保持或提升模型性能。该方法无需测试时检索,即可实现与基于检索的ICL相当的效果,是一种高效且实用的LLM下游任务部署策略。
链接: https://arxiv.org/abs/2509.20820
作者: Ukyo Honda,Soichiro Murakami,Peinan Zhang
机构: CyberAgent( CyberAgent)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 (Findings)
Abstract:Recent advances in large language models (LLMs) enable effective in-context learning (ICL) with many-shot examples, but at the cost of high computational demand due to longer input tokens. To address this, we propose cheat-sheet ICL, which distills the information from many-shot ICL into a concise textual summary (cheat sheet) used as the context at inference time. Experiments on challenging reasoning tasks show that cheat-sheet ICL achieves comparable or better performance than many-shot ICL with far fewer tokens, and matches retrieval-based ICL without requiring test-time retrieval. These findings demonstrate that cheat-sheet ICL is a practical alternative for leveraging LLMs in downstream tasks.
zh
[NLP-57] Leverag ing Whats Overfixed: Post-Correction via LLM Grammatical Error Overcorrection EMNLP2025
【速读】: 该论文旨在解决小语言模型(sLM)在语法错误纠正(GEC)任务中因过度依赖监督微调而导致的召回率低、难以纠正复杂或罕见错误的问题,同时缓解大语言模型(LLM)因生成能力过强而产生的过度修正问题(即精确率下降)。其解决方案的关键在于提出一种名为“通过过度修正进行后校正”(Post-Correction via Overcorrection, PoCO)的新方法:首先利用LLM主动触发过度修正以最大化召回率,随后通过微调小模型对LLM输出进行针对性后校正,识别并修正其中的错误,从而在保持高精度的同时显著提升召回率,实现GEC性能的平衡优化。
链接: https://arxiv.org/abs/2509.20811
作者: Taehee Park,Heejin Do,Gary Geunbae Lee
机构: POSTECH(浦项科技大学); ETH Zurich(苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025
Abstract:Robust supervised fine-tuned small Language Models (sLMs) often show high reliability but tend to undercorrect. They achieve high precision at the cost of low recall. Conversely, Large Language Models (LLMs) often show the opposite tendency, making excessive overcorrection, leading to low precision. To effectively harness the strengths of LLMs to address the recall challenges in sLMs, we propose Post-Correction via Overcorrection (PoCO), a novel approach that strategically balances recall and precision. PoCO first intentionally triggers overcorrection via LLM to maximize recall by allowing comprehensive revisions, then applies a targeted post-correction step via fine-tuning smaller models to identify and refine erroneous outputs. We aim to harmonize both aspects by leveraging the generative power of LLMs while preserving the reliability of smaller supervised models. Our extensive experiments demonstrate that PoCO effectively balances GEC performance by increasing recall with competitive precision, ultimately improving the overall quality of grammatical error correction.
zh
[NLP-58] Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型任务(如知识图谱问答,Knowledge Graph Question Answering, KGQA)中因结构化知识图谱(Knowledge Graphs, KGs)与非结构化查询之间存在语义鸿沟而导致的幻觉和事实性错误问题。其解决方案的关键在于提出一种灵活的框架——Enrich-on-Graph (EoG),该框架利用LLMs的先验知识对KG进行增强,从而缩小图谱与查询间的语义差距;EoG不仅提升了证据提取的准确性与鲁棒性,还保证了低计算成本、可扩展性和跨不同方法的适应性。
链接: https://arxiv.org/abs/2509.20810
作者: Songze Li,Zhiqiang Liu,Zhengke Gui,Huajun Chen,Wen Zhang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); ZJU-Ant Group Joint Lab of Knowledge Graph (浙江大学-蚂蚁集团知识图谱联合实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) exhibit strong reasoning capabilities in complex tasks. However, they still struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA). We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures. Existing methods usually employ resource-intensive, non-scalable workflows reasoning on vanilla KGs, but overlook this gap. To address this challenge, we propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs’ prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries. EoG enables efficient evidence extraction from KGs for precise and robust reasoning, while ensuring low computational costs, scalability, and adaptability across different methods. Furthermore, we propose three graph quality evaluation metrics to analyze query-graph alignment in KGQA task, supported by theoretical validation of our optimization objectives. Extensive experiments on two KGQA benchmark datasets indicate that EoG can effectively generate high-quality KGs and achieve the state-of-the-art performance. Our code and data are available at this https URL.
zh
[NLP-59] Few-Shot and Training-Free Review Generation via Conversational Prompting
【速读】: 该论文旨在解决个性化评论生成在少样本(few-shot)和无需训练(training-free)场景下的难题,即当目标用户仅有少量评论且无法进行模型微调时,如何有效生成符合其风格与偏好的评论。解决方案的关键在于提出一种轻量级的“对话式提示”(Conversational Prompting)方法,将用户评论重构为多轮对话形式:其中简单版本(Simple Conversational Prompting, SCP)仅利用用户自身评论作为上下文,而对比版本(Contrastive Conversational Prompting, CCP)则引入其他用户或大语言模型(LLM)生成的错误回复作为负例,引导模型识别并模仿目标用户的写作风格。实验表明,该方法显著优于传统非对话式提示,在多个产品领域和不同LLM上均能生成更贴近目标用户真实评论的内容,尤其在每用户仅两条评论的情况下仍具鲁棒性,且CCP在高质量负例可用时进一步提升性能。
链接: https://arxiv.org/abs/2509.20805
作者: Genki Kusano
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Personalized review generation helps businesses understand user preferences, yet most existing approaches assume extensive review histories of the target user or require additional model training. Real-world applications often face few-shot and training-free situations, where only a few user reviews are available and fine-tuning is infeasible. It is well known that large language models (LLMs) can address such low-resource settings, but their effectiveness depends on prompt engineering. In this paper, we propose Conversational Prompting, a lightweight method that reformulates user reviews as multi-turn conversations. Its simple variant, Simple Conversational Prompting (SCP), relies solely on the user’s own reviews, while the contrastive variant, Contrastive Conversational Prompting (CCP), inserts reviews from other users or LLMs as incorrect replies and then asks the model to correct them, encouraging the model to produce text in the user’s style. Experiments on eight product domains and five LLMs showed that the conventional non-conversational prompt often produced reviews similar to those written by random users, based on text-based metrics such as ROUGE-L and BERTScore, and application-oriented tasks like user identity matching and sentiment analysis. In contrast, both SCP and CCP produced reviews much closer to those of the target user, even when each user had only two reviews. CCP brings further improvements when high-quality negative examples are available, whereas SCP remains competitive when such data cannot be collected. These results suggest that conversational prompting offers a practical solution for review generation under few-shot and training-free constraints.
zh
[NLP-60] owards Atoms of Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)内部表征基本单元定义不清的问题,这一缺陷限制了对模型机制的深入理解。现有方法中,神经元(neurons)存在多义性(polysemy),而特征(features)则面临重建不可靠和不稳定的问题。为此,作者提出“原子理论”(Atoms Theory),将内部表征的基本单元定义为“原子”(atoms),并通过引入原子内积(Atomic Inner Product, AIP)来校正表征偏移,形式化定义原子,并证明其满足受限等距性质(Restricted Isometry Property, RIP)的条件,从而确保原子集合上的稀疏表示稳定且可关联压缩感知理论。在更强假设下,进一步建立稀疏表示的唯一性和ℓ₁范数精确恢复性,并提供理论保证:带阈值激活的单层稀疏自编码器(Sparse Autoencoders, SAEs)能可靠识别原子。实验验证表明,SAEs在Gemma2-2B、Gemma2-9B和Llama3.1-8B上平均实现99.9%的稀疏重建率,超过99.8%的原子满足唯一性条件,显著优于神经元(0.5%)和特征(68.2%),充分说明原子更忠实于LLMs的内在表征结构。
链接: https://arxiv.org/abs/2509.20784
作者: Chenhui Hu,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所认知与复杂系统决策智能重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The fundamental units of internal representations in large language models (LLMs) remain undefined, limiting further understanding of their mechanisms. Neurons or features are often regarded as such units, yet neurons suffer from polysemy, while features face concerns of unreliable reconstruction and instability. To address this issue, we propose the Atoms Theory, which defines such units as atoms. We introduce the atomic inner product (AIP) to correct representation shifting, formally define atoms, and prove the conditions that atoms satisfy the Restricted Isometry Property (RIP), ensuring stable sparse representations over atom set and linking to compressed sensing. Under stronger conditions, we further establish the uniqueness and exact \ell_1 recoverability of the sparse representations, and provide guarantees that single-layer sparse autoencoders (SAEs) with threshold activations can reliably identify the atoms. To validate the Atoms Theory, we train threshold-activated SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9% sparse reconstruction across layers on average, and more than 99.8% of atoms satisfy the uniqueness condition, compared to 0.5% for neurons and 68.2% for features, showing that atoms more faithfully capture intrinsic representations of LLMs. Scaling experiments further reveal the link between SAEs size and recovery capacity. Overall, this work systematically introduces and validates Atoms Theory of LLMs, providing a theoretical framework for understanding internal representations and a foundation for mechanistic interpretability. Code available at this https URL.
zh
[NLP-61] SFT Doesnt Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLM s
【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)在特定领域数据上训练大语言模型(Large Language Models, LLMs)时普遍存在的“性能权衡”问题,即SFT可能削弱模型的通用能力(general capabilities)。针对这一问题,论文提出的关键解决方案是Token-Adaptive Loss Reweighting (TALR),其核心思想是根据token在不同任务中的重要性动态调整损失权重,从而在保留目标领域性能的同时最小化对通用能力的损害。实验表明,TALR相比L2正则化、LoRA、模型平均和FLOW等方法更有效地平衡了领域特异性提升与通用能力保持之间的矛盾。
链接: https://arxiv.org/abs/2509.20758
作者: Jiacheng Lin,Zhongruo Wang,Kun Qian,Tian Wang,Arvind Srinivasan,Hansi Zeng,Ruochen Jiao,Xie Zhou,Jiri Gesi,Dakuo Wang,Yufan Guo,Kai Zhong,Weiqi Zhang,Sujay Sanghavi,Changyou Chen,Hyokun Yun,Lihong Li
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University of Texas at Austin (德克萨斯大学奥斯汀分校); University at Buffalo (纽约州立大学布法罗分校); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.
zh
[NLP-62] Seeing Through Words Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models EMNLP2025
【速读】: 该论文旨在解决视觉-语言模型中跨模态表示对齐机制的内在规律问题,具体包括:不同模态(视觉与语言)网络中对齐现象发生的位置、支持对齐的关键语义线索、模型是否能捕捉人类在多对多图像-文本匹配场景下的偏好,以及多个示例嵌入聚合如何影响对齐效果。其解决方案的关键在于系统性地分析深度单模态模型(vision-only 和 language-only)在训练过程中逐渐形成的共享语义表示空间——研究发现,对齐在中后期层达到峰值,且主要由语义信息驱动而非外观变化;通过“Pick-a-Pic”任务验证了模型嵌入空间能准确反映人类偏好,并且在多对一或多对多场景下仍保持细粒度语义一致性;更令人意外的是,对同一概念的多个示例嵌入进行平均反而增强了对齐性能,表明这种对齐具有鲁棒性和可聚合性。
链接: https://arxiv.org/abs/2509.20751
作者: Zoe Wanying He,Sean Trott,Meenakshi Khosla
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025 (camera-ready)
Abstract:Recent studies show that deep vision-only and language-only models–trained on disjoint modalities–nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network this convergence emerges, what visual or linguistic cues support it, whether it captures human preferences in many-to-many image-text scenarios, and how aggregating exemplars of the same concept affects alignment. Here, we systematically investigate these questions. We find that alignment peaks in mid-to-late layers of both model types, reflecting a shift from modality-specific to conceptually shared representations. This alignment is robust to appearance-only changes but collapses when semantics are altered (e.g., object removal or word-order scrambling), highlighting that the shared code is truly semantic. Moving beyond the one-to-one image-caption paradigm, a forced-choice “Pick-a-Pic” task shows that human preferences for image-caption matches are mirrored in the embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions akin to human judgments. Surprisingly, averaging embeddings across exemplars amplifies alignment rather than blurring detail. Together, our results demonstrate that unimodal networks converge on a shared semantic code that aligns with human judgments and strengthens with exemplar aggregation.
zh
[NLP-63] Confidence-guided Refinement Reasoning for Zero-shot Question Answering
【速读】: 该论文旨在解决多模态问答(QA)任务中模型推理可靠性不足的问题,尤其是如何通过结构化推理提升答案的可信度。其核心解决方案是提出一种无需训练的框架——置信度引导精炼推理(Confidence-guided Refinement Reasoning, C2R),关键在于利用模型自身的置信度分数对子问题及其答案(sub-QAs)进行筛选与比较:首先从多样化的推理路径中构建并选取高质量子问题集合,再基于这些子问题生成的答案候选者之间的置信度差异,选择最可靠的最终答案。此方法不依赖额外训练,可无缝集成至多种现有QA模型,并在不同模型和基准测试上均实现性能提升,同时揭示了子问题数量与质量对增强推理鲁棒性的重要影响。
链接: https://arxiv.org/abs/2509.20750
作者: Youwon Jang,Woo Suk Choi,Minjoon Jung,Minsu Lee,Byoung-Tak Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages (including references and appendix)
Abstract:We propose Confidence-guided Refinement Reasoning (C2R), a novel training-free framework applicable to question-answering (QA) tasks across text, image, and video domains. C2R strategically constructs and refines sub-questions and their answers (sub-QAs), deriving a better confidence score for the target answer. C2R first curates a subset of sub-QAs to explore diverse reasoning paths, then compares the confidence scores of the resulting answer candidates to select the most reliable final answer. Since C2R relies solely on confidence scores derived from the model itself, it can be seamlessly integrated with various existing QA models, demonstrating consistent performance improvements across diverse models and benchmarks. Furthermore, we provide essential yet underexplored insights into how leveraging sub-QAs affects model behavior, specifically analyzing the impact of both the quantity and quality of sub-QAs on achieving robust and reliable reasoning.
zh
[NLP-64] Probability Distribution Collapse: A Critical Bottleneck to Compact Unsupervised Neural Grammar Induction EMNLP2025
【速读】: 该论文旨在解决无监督神经语法归纳(unsupervised neural grammar induction)中模型表达能力受限的问题,即现有方法常生成结构冗余但性能不佳的语法。其核心问题是识别出“概率分布坍塌”(probability distribution collapse)为导致该瓶颈的根本原因,并进一步分析其在神经参数化各关键组件中的产生机制。解决方案的关键在于提出“缓解坍塌的神经参数化”(collapse-relaxing neural parameterization),通过针对性地重构参数化方式以抑制概率分布坍塌,从而显著提升解析性能,并支持使用更紧凑的语法结构,在多种语言上均得到实证验证。
链接: https://arxiv.org/abs/2509.20734
作者: Jinwook Park,Kangil Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in EMNLP2025 Main, 12 pages, 7 figures, 9 tables
Abstract:Unsupervised neural grammar induction aims to learn interpretable hierarchical structures from language data. However, existing models face an expressiveness bottleneck, often resulting in unnecessarily large yet underperforming grammars. We identify a core issue, \textitprobability distribution collapse , as the underlying cause of this limitation. We analyze when and how the collapse emerges across key components of neural parameterization and introduce a targeted solution, \textitcollapse-relaxing neural parameterization , to mitigate it. Our approach substantially improves parsing performance while enabling the use of significantly more compact grammars across a wide range of languages, as demonstrated through extensive empirical analysis.
zh
[NLP-65] Visual Authority and the Rhetoric of Health Misinformation: A Multimodal Analysis of Social Media Videos
【速读】: 该论文试图解决的问题是:在短视频平台上,营养与补充剂类健康内容中,可信度(credibility)是如何通过权威信号、叙事技巧和商业化手段共同构建的。研究聚焦于替代性健康叙事如何混杂有用、误导甚至有害的信息,并不直接评判内容真伪,而是分析其可信度包装机制。解决方案的关键在于构建一个跨平台的视频语料库(涵盖TikTok、Instagram和YouTube共152个公开视频),并基于26项特征进行标注,涵盖视觉权威性、主讲人属性、叙事策略及互动线索;同时采用透明的标注流程,整合自动语音识别、有原则的帧选择和多模态模型,并通过分层抽样的人工验证确保一致性,从而系统揭示了权威性符号、说服性元素(如术语、恐惧诉求、对主流医学的批判)与商业化行为(如销售链接、订阅号召)之间的高频共现模式。
链接: https://arxiv.org/abs/2509.20724
作者: Mohammad Reza Zarei,Barbara Stead-Coyle,Michael Christensen,Sarah Everts,Majid Komeili
机构: Carleton University (卡尔顿大学)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Short form video platforms are central sites for health advice, where alternative narratives mix useful, misleading, and harmful content. Rather than adjudicating truth, this study examines how credibility is packaged in nutrition and supplement videos by analyzing the intersection of authority signals, narrative techniques, and monetization. We assemble a cross platform corpus of 152 public videos from TikTok, Instagram, and YouTube and annotate each on 26 features spanning visual authority, presenter attributes, narrative strategies, and engagement cues. A transparent annotation pipeline integrates automatic speech recognition, principled frame selection, and a multimodal model, with human verification on a stratified subsample showing strong agreement. Descriptively, a confident single presenter in studio or home settings dominates, and clinical contexts are rare. Analytically, authority cues such as titles, slides and charts, and certificates frequently occur with persuasive elements including jargon, references, fear or urgency, critiques of mainstream medicine, and conspiracies, and with monetization including sales links and calls to subscribe. References and science like visuals often travel with emotive and oppositional narratives rather than signaling restraint.
zh
[NLP-66] CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在优化大语言模型(Large Language Models, LLMs)过程中因策略熵(policy entropy)管理不当导致的探索-利用权衡失衡问题。现有方法如近端策略优化(Proximal Policy Optimization, PPO)通过裁剪机制丢弃低概率token的梯度信号,从而引发熵动态不稳定,限制了模型性能提升。其解决方案的关键在于提出一种名为CE-GPPO(Controlling Entropy via Gradient-Preserving Policy Optimization)的新算法,该算法以温和且受控的方式重新引入原PPO中被裁剪token的梯度信息,通过调节裁剪区间外梯度的幅度来稳定熵演化过程,从而实现更优的探索与利用平衡。理论分析与实验证明,CE-GPPO能有效缓解熵不稳定性,并在数学推理基准上显著优于多种主流基线方法。
链接: https://arxiv.org/abs/2509.20712
作者: Zhenpeng Su,Leiyu Pan,Minxuan Lv,Yuntao Li,Wenping Hu,Fuzheng Zhang,Kun Gai,Guorui Zhou
机构: Klear Team, Kuaishou Technology(快手科技)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbfControlling \textbfEntropy via \textbfGradient-\textbfPreserving \textbfPolicy \textbfOptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.
zh
[NLP-67] MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model
【速读】: 该论文旨在解决真实场景中语音情感识别(Speech Emotion Recognition, SER)因领域不匹配(domain mismatch)而导致性能下降的问题,尤其在源域数据不可获取、仅能通过API访问大语言模型(Large Audio-Language Models, LALMs)的限制下,如何训练一个学生模型(student model)使其在目标域上超越LALM本身。解决方案的关键在于提出MI-Fuse框架,该框架通过融合来自LALM和一个在源域训练好的SER分类器(作为辅助教师)的多组随机预测结果,利用基于互信息(mutual information)的不确定性权重对均值分布进行加权,并引入指数移动平均教师(exponential moving average teacher)稳定训练过程,从而实现无需源数据共享即可有效提升目标域上的情感识别性能。
链接: https://arxiv.org/abs/2509.20706
作者: Hsiao-Ying Huang,Yi-Cheng Lin,Hung-yi Lee
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures, 2 tables
Abstract:Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.
zh
[NLP-68] Overcoming Black-box Attack Inefficiency with Hybrid and Dynamic Select Algorithms EMNLP2025
【速读】: 该论文旨在解决基于Transformer架构的自然语言处理(Natural Language Processing, NLP)模型在对抗文本攻击测试中面临的计算成本过高问题,特别是针对资源受限的研究人员(如GPU算力不足的情况)。现有主流黑盒攻击方法通常需要大量查询次数,导致效率低下且不具实用性。其解决方案的关键在于提出两种新的攻击选择策略:Hybrid Select 和 Dynamic Select。其中,Hybrid Select 通过引入一个长度阈值来动态决定使用广义二分选择(Generalized BinarySelect)还是贪心选择(GreedySelect),从而融合两者优势;Dynamic Select 则通过学习不同文本长度下最优的选择策略,实现更智能的算法切换。这两种策略显著降低了攻击所需的查询次数,同时保持了攻击的有效性,在4个数据集和6个目标模型上的实验表明,最佳方法(句级Hybrid Select)平均可减少25.82%的查询量,且对编码器模型和大语言模型(Large Language Models, LLMs)均有效。
链接: https://arxiv.org/abs/2509.20699
作者: Abhinay Shankar Belde,Rohit Ramkumar,Jonathan Rusert
机构: Purdue University, Fort Wayne (普渡大学福尔特韦恩分校)
类目: Computation and Language (cs.CL)
备注: 34 pages, 3 figures, Accepted to Findings of EMNLP 2025
Abstract:Adversarial text attack research plays a crucial role in evaluating the robustness of NLP models. However, the increasing complexity of transformer-based architectures has dramatically raised the computational cost of attack testing, especially for researchers with limited resources (e.g., GPUs). Existing popular black-box attack methods often require a large number of queries, which can make them inefficient and impractical for researchers. To address these challenges, we propose two new attack selection strategies called Hybrid and Dynamic Select, which better combine the strengths of previous selection algorithms. Hybrid Select merges generalized BinarySelect techniques with GreedySelect by introducing a size threshold to decide which selection algorithm to use. Dynamic Select provides an alternative approach of combining the generalized Binary and GreedySelect by learning which lengths of texts each selection method should be applied to. This greatly reduces the number of queries needed while maintaining attack effectiveness (a limitation of BinarySelect). Across 4 datasets and 6 target models, our best method(sentence-level Hybrid Select) is able to reduce the number of required queries per attack up 25.82% on average against both encoder models and LLMs, without losing the effectiveness of the attack.
zh
[NLP-69] RedHerring Attack: Testing the Reliability of Attack Detection EMNLP2025
【速读】: 该论文旨在解决当前攻击检测模型(attack detection models)在面对新型对抗性攻击时可能失效的问题,尤其是这些检测模型的可靠性尚未被充分验证。为应对这一问题,作者提出了一种名为 RedHerring 的新型攻击策略,其关键在于通过修改文本使检测模型误判为存在攻击,同时保持原始分类器(classifier)的预测正确,从而制造分类器与检测模型之间的矛盾信号。这种设计使得人类观察者在看到检测模型给出“错误”预测而分类器结果正确时,会认为检测模型不可靠,进而削弱其可信度。实验表明,RedHerring 能显著降低检测准确率(下降 20–71 个百分点),同时维持甚至提升分类器性能;作为初步防御措施,作者还提出一种无需重新训练分类器或检测模型的置信度检查机制,可有效提升检测准确性。
链接: https://arxiv.org/abs/2509.20691
作者: Jonathan Rusert
机构: Purdue University, Fort Wayne(普渡大学福特韦恩分校)
类目: Computation and Language (cs.CL)
备注: 16 pages, 3 figures, Accepted to EMNLP 2025
Abstract:In response to adversarial text attacks, attack detection models have been proposed and shown to successfully identify text modified by adversaries. Attack detection models can be leveraged to provide an additional check for NLP models and give signals for human input. However, the reliability of these models has not yet been thoroughly explored. Thus, we propose and test a novel attack setting and attack, RedHerring. RedHerring aims to make attack detection models unreliable by modifying a text to cause the detection model to predict an attack, while keeping the classifier correct. This creates a tension between the classifier and detector. If a human sees that the detector is giving an ``incorrect’’ prediction, but the classifier a correct one, then the human will see the detector as unreliable. We test this novel threat model on 4 datasets against 3 detectors defending 4 classifiers. We find that RedHerring is able to drop detection accuracy between 20 - 71 points, while maintaining (or improving) classifier accuracy. As an initial defense, we propose a simple confidence check which requires no retraining of the classifier or detector and increases detection accuracy greatly. This novel threat model offers new insights into how adversaries may target detection models.
zh
[NLP-70] Can Federated Learning Safeguard Private Data in LLM Training? Vulnerabilities Attacks and Defense Evaluation EMNLP2025
【速读】: 该论文旨在解决联邦学习(Federated Learning, FL)框架下大语言模型(Large Language Models, LLMs)微调过程中存在的训练数据隐私泄露问题。尽管FL通过仅共享模型参数而非原始数据来保护隐私,但本文通过实验证明,攻击者仍能利用生成式方法从全局模型中提取训练数据,且随着模型规模增大,泄露风险加剧;为此,作者提出一种针对FL的增强型攻击策略,通过追踪训练过程中的全局模型更新以强化隐私泄露。解决方案的关键在于评估并结合多种隐私保护技术,包括差分隐私(Differential Privacy)、正则化约束更新和采用具备安全对齐(Safety Alignment)的大语言模型,从而为LLM在联邦环境下的隐私风险降低提供实用指导。
链接: https://arxiv.org/abs/2509.20680
作者: Wenkai Guo,Xuefeng Liu,Haolin Wang,Jianwei Niu,Shaojie Tang,Jing Yuan
机构: Beihang University (北京航空航天大学); University at Buffalo (纽约州立大学布法罗分校); University of North Texas (北德克萨斯大学); Zhongguancun Laboratory (中关村实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 28 pages, 32 figures, accepted to the Findings of EMNLP 2025
Abstract:Fine-tuning large language models (LLMs) with local data is a widely adopted approach for organizations seeking to adapt LLMs to their specific domains. Given the shared characteristics in data across different organizations, the idea of collaboratively fine-tuning an LLM using data from multiple sources presents an appealing opportunity. However, organizations are often reluctant to share local data, making centralized fine-tuning impractical. Federated learning (FL), a privacy-preserving framework, enables clients to retain local data while sharing only model parameters for collaborative training, offering a potential solution. While fine-tuning LLMs on centralized datasets risks data leakage through next-token prediction, the iterative aggregation process in FL results in a global model that encapsulates generalized knowledge, which some believe protects client privacy. In this paper, however, we present contradictory findings through extensive experiments. We show that attackers can still extract training data from the global model, even using straightforward generation methods, with leakage increasing as the model size grows. Moreover, we introduce an enhanced attack strategy tailored to FL, which tracks global model updates during training to intensify privacy leakage. To mitigate these risks, we evaluate privacy-preserving techniques in FL, including differential privacy, regularization-constrained updates and adopting LLMs with safety alignment. Our results provide valuable insights and practical guidelines for reducing privacy risks when training LLMs with FL.
zh
[NLP-71] Human Semantic Representations of Social Interactions from Moving Shapes
【速读】: 该论文试图解决的问题是:在人类对简单移动形状动画的社会互动感知中,除了视觉特征外,哪些语义表征能够补充并解释人类的判断。解决方案的关键在于通过人类相似性判断构建社会互动的表征几何结构,并将其与基于视觉特征、标签及动画描述语义嵌入的模型预测进行比较,结果表明,从描述中提取的动词嵌入(verb-based embeddings)能最有效地解释人类对社会互动相似性的判断,揭示了语义结构在连接视觉与抽象社会认知中的关键作用。
链接: https://arxiv.org/abs/2509.20673
作者: Yiling Yun,Hongjing Lu
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:
Abstract:Humans are social creatures who readily recognize various social interactions from simple display of moving shapes. While previous research has often focused on visual features, we examine what semantic representations that humans employ to complement visual features. In Study 1, we directly asked human participants to label the animations based on their impression of moving shapes. We found that human responses were distributed. In Study 2, we measured the representational geometry of 27 social interactions through human similarity judgments and compared it with model predictions based on visual features, labels, and semantic embeddings from animation descriptions. We found that semantic models provided complementary information to visual features in explaining human judgments. Among the semantic models, verb-based embeddings extracted from descriptions account for human similarity judgments the best. These results suggest that social perception in simple displays reflects the semantic structure of social interactions, bridging visual and abstract representations.
zh
[NLP-72] Enhancing Molecular Property Prediction with Knowledge from Large Language Models
【速读】: 该论文旨在解决分子性质预测(Molecular Property Prediction, MPP)中如何有效融合人类先验知识与基于结构的深度学习表示的问题。尽管图神经网络(Graph Neural Networks, GNNs)和自监督学习方法已在MPP中取得显著进展,但其性能仍受限于对领域知识的利用不足,尤其在低频或新兴分子属性上表现不佳。解决方案的关键在于首次提出一种集成框架,将大语言模型(Large Language Models, LLMs)提取的领域知识与预训练分子模型生成的结构特征进行融合:通过提示LLMs生成与任务相关的知识及可执行的分子向量化代码,从而构建知识驱动的特征,并将其与GNN等结构特征进行多模态融合,显著提升了预测准确性与鲁棒性。
链接: https://arxiv.org/abs/2509.20664
作者: Peng Zhou,Lai Hou Tim,Zhixiang Cheng,Kun Xie,Chaoyi Li,Wei Liu,Xiangxiang Zeng
机构: 1. Institute of Artificial Intelligence, Chinese Academy of Sciences (中国科学院人工智能研究院); 2. School of Computer Science and Engineering, Sun Yat-sen University (中山大学计算机科学与工程学院); 3. Guangdong Provincial Key Laboratory of Intelligent Information Processing and Security, Sun Yat-sen University (中山大学广东省智能信息处理与安全重点实验室)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures
Abstract:Predicting molecular properties is a critical component of drug discovery. Recent advances in deep learning, particularly Graph Neural Networks (GNNs), have enabled end-to-end learning from molecular structures, reducing reliance on manual feature engineering. However, while GNNs and self-supervised learning approaches have advanced molecular property prediction (MPP), the integration of human prior knowledge remains indispensable, as evidenced by recent methods that leverage large language models (LLMs) for knowledge extraction. Despite their strengths, LLMs are constrained by knowledge gaps and hallucinations, particularly for less-studied molecular properties. In this work, we propose a novel framework that, for the first time, integrates knowledge extracted from LLMs with structural features derived from pre-trained molecular models to enhance MPP. Our approach prompts LLMs to generate both domain-relevant knowledge and executable code for molecular vectorization, producing knowledge-based features that are subsequently fused with structural representations. We employ three state-of-the-art LLMs, GPT-4o, GPT-4.1, and DeepSeek-R1, for knowledge extraction. Extensive experiments demonstrate that our integrated method outperforms existing approaches, confirming that the combination of LLM-derived knowledge and structural information provides a robust and effective solution for MPP.
zh
[NLP-73] Building Tailored Speech Recognizers for Japanese Speaking Assessment
【速读】: 该论文旨在解决日本语语音识别中因标注数据稀缺而导致的音节标记(含声调标记)识别准确率低的问题。其关键解决方案包括:一是采用多任务学习框架,引入辅助损失函数以估计输入语音的正字法文本和音高模式,从而利用仅含正字法标注的数据进行训练;二是融合基于音素字母字符串和文本词元序列的两个估计器,通过有限状态转换器(finite-state transducer)框架实现联合优化。实验表明,该方法将mora标签错误率从12.3%降至7.1%,显著优于通用多语言识别器。
链接: https://arxiv.org/abs/2509.20655
作者: Yotaro Kubo,Richard Sproat,Chihiro Taguchi,Llion Jones
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:This paper presents methods for building speech recognizers tailored for Japanese speaking assessment tasks. Specifically, we build a speech recognizer that outputs phonemic labels with accent markers. Although Japanese is resource-rich, there is only a small amount of data for training models to produce accurate phonemic transcriptions that include accent marks. We propose two methods to mitigate data sparsity. First, a multitask training scheme introduces auxiliary loss functions to estimate orthographic text labels and pitch patterns of the input signal, so that utterances with only orthographic annotations can be leveraged in training. The second fuses two estimators, one over phonetic alphabet strings, and the other over text token sequences. To combine these estimates we develop an algorithm based on the finite-state transducer framework. Our results indicate that the use of multitask learning and fusion is effective for building an accurate phonemic recognizer. We show that this approach is advantageous compared to the use of generic multilingual recognizers. The relative advantages of the proposed methods were also compared. Our proposed methods reduced the average of mora-label error rates from 12.3% to 7.1% over the CSJ core evaluation sets.
zh
[NLP-74] Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)发展中的评估瓶颈问题,即传统方法需先构建基准测试、运行实验并迭代优化,效率低下。其核心解决方案是提出文本仅性能预测(text-only performance forecasting),通过红化(redacted)的任务描述和配置信息,在不接触数据集实例的情况下预估模型表现。关键创新在于构建了PRECOG语料库,包含多样任务、领域与指标的红化描述-性能配对,并引入带检索模块的预测模型,使其在高置信度阈值下达到均方误差低至8.7的预测精度,同时揭示了更强推理模型会进行多样化、迭代式查询,而当前开源模型则存在检索不足或证据多样性差的问题。
链接: https://arxiv.org/abs/2509.20645
作者: Jungsoo Park,Ethan Mendes,Gabriel Stanovsky,Alan Ritter
机构: Georgia Institute of Technology (佐治亚理工学院); The Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 6 figures
Abstract:Progress in large language models is constrained by an evaluation bottleneck: build a benchmark, evaluate models and settings, then iterate. We therefore ask a simple question: can we forecast outcomes before running any experiments? We study text-only performance forecasting: estimating a model’s score from a redacted task description and intended configuration, with no access to dataset instances. To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics. Experiments show the task is challenging but feasible: models equipped with a retrieval module that excludes source papers achieve moderate prediction performance with well-calibrated uncertainty, reaching mean absolute error as low as 8.7 on the Accuracy subset at high-confidence thresholds. Our analysis indicates that stronger reasoning models engage in diverse, iterative querying, whereas current open-source models lag and often skip retrieval or gather evidence with limited diversity. We further test a zero-leakage setting, forecasting on newly released datasets or experiments before their papers are indexed, where GPT-5 with built-in web search still attains nontrivial prediction accuracy. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization.
zh
[NLP-75] FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models
【速读】: 该论文旨在解决自回归语言模型(Autoregressive Language Models, ARMs)因逐token生成导致的吞吐量低和长序列延迟高的问题,以及传统离散扩散模型(Discrete Diffusion Models)虽可并行但需数百至数千次模型评估才能达到高质量、效率低下的局限。其解决方案的关键在于提出FS-DFM(Few-Step Discrete Flow-Matching),通过将采样步数显式设为参数,并训练模型在不同步数预算下保持一致性,使得单次大步长移动能等效于多次小步长移动;同时结合可靠的更新规则以避免概率移动过度,以及从长轨迹中蒸馏得到的强教师指导,从而实现少步数采样下的稳定、高精度生成,显著提升速度与可控性。
链接: https://arxiv.org/abs/2509.20624
作者: Amin Karimi Monsefi,Nikhil Bhendawade,Manuel Rafael Ciosici,Dominic Culver,Yizhe Zhang,Irina Belousova
机构: The Ohio State University (俄亥俄州立大学); Apple
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.
zh
[NLP-76] Every Character Counts: From Vulnerability to Defense in Phishing Detection ICTAI2025
【速读】: 该论文旨在解决当前钓鱼攻击检测方法在可解释性和鲁棒性方面的不足问题,尤其是在面对新型攻击时性能下降的问题。其解决方案的关键在于采用基于字符级别的深度学习模型(CharCNN、CharGRU 和 CharBiLSTM),这些模型能够在有限计算资源下实现高精度检测,并通过引入对抗训练提升对对抗样本的鲁棒性;同时,利用梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM)技术对字符级输入进行可视化分析,从而增强模型决策过程的可解释性。实验表明,在受限环境下,CharGRU 在所有场景中表现最优,且对抗训练显著提升了模型的抗攻击能力。
链接: https://arxiv.org/abs/2509.20589
作者: Maria Chiper,Radu Tudor Ionescu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICTAI 2025
Abstract:Phishing attacks targeting both organizations and individuals are becoming an increasingly significant threat as technology advances. Current automatic detection methods often lack explainability and robustness in detecting new phishing attacks. In this work, we investigate the effectiveness of character-level deep learning models for phishing detection, which can provide both robustness and interpretability. We evaluate three neural architectures adapted to operate at the character level, namely CharCNN, CharGRU, and CharBiLSTM, on a custom-built email dataset, which combines data from multiple sources. Their performance is analyzed under three scenarios: (i) standard training and testing, (ii) standard training and testing under adversarial attacks, and (iii) training and testing with adversarial examples. Aiming to develop a tool that operates as a browser extension, we test all models under limited computational resources. In this constrained setup, CharGRU proves to be the best-performing model across all scenarios. All models show vulnerability to adversarial attacks, but adversarial training substantially improves their robustness. In addition, by adapting the Gradient-weighted Class Activation Mapping (Grad-CAM) technique to character-level inputs, we are able to visualize which parts of each email influence the decision of each model. Our open-source code and data is released at this https URL.
zh
[NLP-77] Hierarchical Resolution Transformers: A Wavelet-Inspired Architecture for Multi-Scale Language Understanding
【速读】: 该论文旨在解决标准Transformer架构在处理自然语言时存在的根本性问题:其将文本视为扁平的词元序列,忽略了人类语言固有的层次结构,导致计算复杂度为二次方(O(n²))、组合泛化能力弱以及话语层面建模不足。为此,作者提出Hierarchical Resolution Transformer (HRT),其关键创新在于引入一种受小波(wavelet)启发的多分辨率注意力机制,使模型能够同时在字符、词、短语到话语等不同粒度上进行信息处理,实现自底向上的组合与自顶向下的上下文感知。通过指数级序列压缩策略,HRT将整体计算复杂度降低至O(nlogn),显著提升了效率,并在GLUE、SuperGLUE、Long Range Arena等多个基准测试中超越标准Transformer基线,平均准确率提升达+3.8%至+6.1%,同时内存占用减少42%、推理延迟降低37%。
链接: https://arxiv.org/abs/2509.20581
作者: Ayan Sar,Sampurna Roy,Kanav Gupta,Anurag Kaushish,Tanupriya Choudhury,Abhijit Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Submitted in IEEE International Conference on Big Data 2025
Abstract:Transformer architectures have achieved state-of-the-art performance across natural language tasks, yet they fundamentally misrepresent the hierarchical nature of human language by processing text as flat token sequences. This results in quadratic computational cost, weak computational cost, weak compositional generalization, and inadequate discourse-level modeling. We propose Hierarchical Resolution Transformer (HRT), a novel wavelet-inspired neural architecture that processes language simultaneously across multiple resolutions, from characters to discourse-level units. HRT constructs a multi-resolution attention, enabling bottom-up composition and top-down contextualization. By employing exponential sequence reduction across scales, HRT achieves O(nlogn) complexity, offering significant efficiency improvements over standard transformers. We evaluated HRT on a diverse suite of benchmarks, including GLUE, SuperGLUE, Long Range Arena, and WikiText-103, and results demonstrated that HRT outperforms standard transformer baselines by an average of +3.8% on GLUE, +4.5% on SuperGLUE, and +6.1% on Long Range Arena, while reducing memory usage by 42% and inference latency by 37% compared to BERT and GPT style models of similar parameter count. Ablation studies confirm the effectiveness of cross-resolution attention and scale-specialized modules, showing that each contributes independently to both efficiency and accuracy. Our findings establish HRT as the first architecture to align computational structure with the hierarchical organization of human language, demonstrating that multi-scale, wavelet-inspired processing yields both theoretical efficiency gains and practical improvements in language understanding.
zh
[NLP-78] Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures
【速读】: 该论文旨在解决当前Transformer架构在处理不同复杂度输入时存在的计算效率低下和推理质量受限的问题,即对简单事实查询与复杂逻辑问题采用相同的固定深度计算,导致资源浪费并限制了深层推理能力。其解决方案的关键在于提出了一种基于深度特化的专家混合模型(Depth Specialised Mixture of Experts, DS-MoE),通过引入针对不同推理深度优化的专家模块(如浅层模式识别、组合推理、逻辑推断、记忆整合及元认知监督),并由一个学习得到的路由网络动态构建定制化推理链,仅激活匹配输入复杂度所需的专家模块,从而实现计算资源的按需分配。
链接: https://arxiv.org/abs/2509.20577
作者: Sampurna Roy,Ayan Sar,Anurag Kaushish,Kanav Gupta,Tanupriya Choudhury,Abhijit Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Submitted in IEEE International Conference on Big Data 2025
Abstract:Contemporary transformer architectures apply identical processing depth to all inputs, creating inefficiencies and limiting reasoning quality. Simple factual queries are subjected to the same multilayered computation as complex logical problems, wasting resources while constraining deep inference. To overcome this, we came up with a concept of Dynamic Reasoning Chains through Depth Specialised Mixture of Experts (DS-MoE), a modular framework that extends the Mixture of Experts paradigm from width-based to depth specialised computation. DS-MoE introduces expert modules optimised for distinct reasoning depths, shallow pattern recognition, compositional reasoning, logical inference, memory integration, and meta-cognitive supervision. A learned routing network dynamically assembles custom reasoning chains, activating only the necessary experts to match input complexity. The dataset on which we trained and evaluated DS-MoE is on The Pile, an 800GB corpus covering diverse domains such as scientific papers, legal texts, programming code, and web content, enabling systematic assessment across reasoning depths. Experimental results demonstrate that DS-MoE achieves up to 16 per cent computational savings and 35 per cent faster inference compared to uniform-depth transformers, while delivering 2.8 per cent higher accuracy on complex multi-step reasoning benchmarks. Furthermore, routing decisions yield interpretable reasoning chains, enhancing transparency and scalability. These findings establish DS-MoE as a significant advancement in adaptive neural architectures, demonstrating that depth-specialised modular processing can simultaneously improve efficiency, reasoning quality, and interpretability in large-scale language models.
zh
[NLP-79] SwasthLLM : a Unified Cross-Lingual Multi-Task and Meta-Learning Zero-Shot Framework for Medical Diagnosis Using Contrastive Representations
【速读】: 该论文旨在解决多语言医疗环境中,由于低资源语言标注医学数据稀缺及跨语言语义差异导致的自动疾病诊断难题。其核心解决方案是提出SwasthLLM框架,该框架基于多语言XLM-RoBERTa编码器,结合语言感知注意力机制与疾病分类头,实现跨语言医学文本的统一表征学习;关键创新在于引入Siamese对比学习模块以对齐不同语言间的语义空间,并通过翻译一致性模块和对比投影头强化语言不变性表示,同时采用多任务学习策略联合优化疾病分类、翻译对齐与对比学习目标,并利用模型无关元学习(Model-Agnostic Meta-Learning, MAML)提升模型在未见语言或任务上的快速适应能力,从而在零样本场景下仍能保持较高诊断准确率(如印地语92.78%、孟加拉语73.33%)。
链接: https://arxiv.org/abs/2509.20567
作者: Ayan Sar,Pranav Singh Puri,Sumit Aich,Tanupriya Choudhury,Abhijit Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Submitted to International Conference on Big Data 2025
Abstract:In multilingual healthcare environments, automatic disease diagnosis from clinical text remains a challenging task due to the scarcity of annotated medical data in low-resource languages and the linguistic variability across populations. This paper proposes SwasthLLM, a unified, zero-shot, cross-lingual, and multi-task learning framework for medical diagnosis that operates effectively across English, Hindi, and Bengali without requiring language-specific fine-tuning. At its core, SwasthLLM leverages the multilingual XLM-RoBERTa encoder augmented with a language-aware attention mechanism and a disease classification head, enabling the model to extract medically relevant information regardless of the language structure. To align semantic representations across languages, a Siamese contrastive learning module is introduced, ensuring that equivalent medical texts in different languages produce similar embeddings. Further, a translation consistency module and a contrastive projection head reinforce language-invariant representation learning. SwasthLLM is trained using a multi-task learning strategy, jointly optimizing disease classification, translation alignment, and contrastive learning objectives. Additionally, we employ Model-Agnostic Meta-Learning (MAML) to equip the model with rapid adaptation capabilities for unseen languages or tasks with minimal data. Our phased training pipeline emphasizes robust representation alignment before task-specific fine-tuning. Extensive evaluation shows that SwasthLLM achieves high diagnostic performance, with a test accuracy of 97.22% and an F1-score of 97.17% in supervised settings. Crucially, in zero-shot scenarios, it attains 92.78% accuracy on Hindi and 73.33% accuracy on Bengali medical text, demonstrating strong generalization in low-resource contexts.
zh
[NLP-80] SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages
【速读】: 该论文旨在解决低资源语言(如粤语和吴语)在机器翻译(Machine Translation, MT)中因缺乏大规模训练数据和语言资源而导致的性能瓶颈问题。其解决方案的关键在于构建了一个名为SiniticMTError的新颖数据集,该数据集基于现有平行语料库,为英译中文、粤语和吴语的机器翻译样本提供了错误跨度(error span)、错误类型(error type)及错误严重性(error severity)的标注信息。这一数据集可支持模型微调以增强错误检测能力,从而推动翻译质量评估、错误感知生成以及低资源语言翻译效果的研究与改进。
链接: https://arxiv.org/abs/2509.20557
作者: Hannah Liu,Junghyun Min,Ethan Yue Heng Cheung,Shou-Yi Hung,Syed Mekael Wasti,Runtong Liang,Shiyao Qian,Shizhao Zheng,Elsie Chan,Ka Ieng Charlotte Lo,Wing Yu Yip,Richard Tzong-Han Tsai,En-Shiun Annie Lee
机构: University of Toronto (多伦多大学); Georgetown University (乔治城大学); Ontario Tech University (安大略理工大学); National Central University, Taiwan (台湾中央大学)
类目: Computation and Language (cs.CL)
备注: Work in progress. 14 pages, 4 figures, 5 tables
Abstract:Despite major advances in machine translation (MT) in recent years, progress remains limited for many low-resource languages that lack large-scale training data and linguistic resources. Cantonese and Wu Chinese are two Sinitic examples, although each enjoys more than 80 million speakers around the world. In this paper, we introduce SiniticMTError, a novel dataset that builds on existing parallel corpora to provide error span, error type, and error severity annotations in machine-translated examples from English to Mandarin, Cantonese, and Wu Chinese. Our dataset serves as a resource for the MT community to utilize in fine-tuning models with error detection capabilities, supporting research on translation quality estimation, error-aware generation, and low-resource language evaluation. We report our rigorous annotation process by native speakers, with analyses on inter-annotator agreement, iterative feedback, and patterns in error type and severity.
zh
[NLP-81] Perspectra: Choosing Your Experts Enhances Critical Thinking in Multi-Agent Research Ideation
【速读】: 该论文试图解决多智能体系统(Multi-Agent Systems, MAS)中用户难以有效控制、引导和批判性评估多个领域专家代理之间协作的问题。现有方法虽能通过分配角色促进信息搜索与创意生成,但缺乏对代理间对话结构的可视化支持与用户干预机制,导致协作过程缺乏透明度和可控性。解决方案的关键在于提出 Perspectra——一个基于论坛式界面的交互式 MAS,其核心创新包括:支持 @提及邀请特定代理、线程化并行探索机制,以及实时思维导图用于可视化论证逻辑与推理链条,从而增强用户对多代理对抗性 discourse 的控制能力,进而提升批判性思维行为的频率与深度。
链接: https://arxiv.org/abs/2509.20553
作者: Yiren Liu,Viraj Shah,Sangho Suh,Pao Siangliulue,Tal August,Yun Huang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Toronto (多伦多大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent advances in multi-agent systems (MAS) enable tools for information search and ideation by assigning personas to agents. However, how users can effectively control, steer, and critically evaluate collaboration among multiple domain-expert agents remains underexplored. We present Perspectra, an interactive MAS that visualizes and structures deliberation among LLM agents via a forum-style interface, supporting @-mention to invite targeted agents, threading for parallel exploration, with a real-time mind map for visualizing arguments and rationales. In a within-subjects study with 18 participants, we compared Perspectra to a group-chat baseline as they developed research proposals. Our findings show that Perspectra significantly increased the frequency and depth of critical-thinking behaviors, elicited more interdisciplinary replies, and led to more frequent proposal revisions than the group chat condition. We discuss implications for designing multi-agent tools that scaffold critical thinking by supporting user control over multi-agent adversarial discourse.
zh
[NLP-82] MARS: toward more efficient multi-agent collaboration for LLM reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在单代理模式下推理能力有限的问题,尤其针对多智能体辩论(Multi-Agent Debate, MAD)方法中存在的计算开销过大、通信频繁导致的资源消耗问题。其解决方案的关键在于提出一种基于角色的协作框架——MARS(Multi-Agent Review System),该框架模拟学术评审流程:由作者代理生成初始解,多个评审代理独立提供决策与评论,再由元评审代理整合反馈并做出最终决策及修订指导。此设计避免了评审代理间的直接交互,显著降低了令牌(token)消耗和推理时间,同时保持与MAD相当的推理准确性。
链接: https://arxiv.org/abs/2509.20502
作者: Xiao Wang,Jia Wang,Yijie Wang,Pengtao Dang,Sha Cao,Chi Zhang
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校); Oregon Health & Science University (俄勒冈健康与科学大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50%. Code is available at this https URL.
zh
[NLP-83] InsightGUIDE: An Opinionated AI Assistant for Guided Critical Reading of Scientific Literature ICTAI2025
【速读】: 该论文旨在解决科学文献爆炸式增长给研究人员带来的阅读挑战,特别是现有大语言模型(Large Language Models, LLMs)生成的摘要往往冗长且可能替代而非辅助阅读的问题。解决方案的关键在于提出了一种名为InsightGUIDE的新型AI阅读助手工具,其核心创新是将专家的阅读方法论直接嵌入AI逻辑中,从而生成结构化、简洁且可操作的洞察,作为论文关键要素的“导航地图”,显著提升研究效率与阅读质量。
链接: https://arxiv.org/abs/2509.20493
作者: Paris Koloveas,Serafeim Chatzopoulos,Thanasis Vergoulis,Christos Tryfonopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC)
备注: Accepted for publication on ICTAI 2025
Abstract:The proliferation of scientific literature presents an increasingly significant challenge for researchers. While Large Language Models (LLMs) offer promise, existing tools often provide verbose summaries that risk replacing, rather than assisting, the reading of the source material. This paper introduces InsightGUIDE, a novel AI-powered tool designed to function as a reading assistant, not a replacement. Our system provides concise, structured insights that act as a “map” to a paper’s key elements by embedding an expert’s reading methodology directly into its core AI logic. We present the system’s architecture, its prompt-driven methodology, and a qualitative case study comparing its output to a general-purpose LLM. The results demonstrate that InsightGUIDE produces more structured and actionable guidance, serving as a more effective tool for the modern researcher.
zh
[NLP-84] RadAgents : Multimodal Agent ic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows
【速读】: 该论文旨在解决当前胸部X线(CXR)图像解读中生成式AI系统存在的三大关键问题:一是推理过程缺乏临床可解释性且与诊疗指南不一致,仅是对工具输出的简单聚合;二是多模态证据融合不足,导致生成的解释文本未与影像视觉信息有效对齐;三是系统难以检测和解决跨工具间的不一致性,缺乏可靠的验证机制。解决方案的核心在于提出RadAgents框架,该框架通过耦合临床先验知识与任务感知的多模态推理,结合视觉引导的 grounding 和多模态检索增强机制,实现对上下文冲突的验证与修正,从而提升系统输出的可靠性、透明度及与临床实践的一致性。
链接: https://arxiv.org/abs/2509.20490
作者: Kai Zhang,Corey D Barrett,Jangwon Kim,Lichao Sun,Tara Taghavi,Krishnaram Kenthapadi
机构: Oracle Health AI; Lehigh University
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: In progress
Abstract:Agentic systems offer a potential path to solve complex clinical tasks through collaboration among specialized agents, augmented by tool use and external knowledge bases. Nevertheless, for chest X-ray (CXR) interpretation, prevailing methods remain limited: (i) reasoning is frequently neither clinically interpretable nor aligned with guidelines, reflecting mere aggregation of tool outputs; (ii) multimodal evidence is insufficiently fused, yielding text-only rationales that are not visually grounded; and (iii) systems rarely detect or resolve cross-tool inconsistencies and provide no principled verification mechanisms. To bridge the above gaps, we present RadAgents, a multi-agent framework for CXR interpretation that couples clinical priors with task-aware multimodal reasoning. In addition, we integrate grounding and multimodal retrieval-augmentation to verify and resolve context conflicts, resulting in outputs that are more reliable, transparent, and consistent with clinical practice.
zh
[NLP-85] ShortCheck: Checkworthiness Detection of Multilingual Short-Form Videos
【速读】: 该论文旨在解决短格式视频平台(如TikTok)中虚假信息检测的难题,这类平台内容具有多模态、动态性和噪声干扰等特点,传统方法难以有效应对。解决方案的关键在于提出一个模块化、仅推理的处理流程ShortCheck,其核心能力包括语音转录、光学字符识别(OCR)、物体与深度伪造检测、视频到文本摘要生成以及事实核查等模块的集成,并通过用户友好的界面实现对高可疑度短视频的自动筛选,从而辅助人工核查。在双语手动标注数据集上的验证表明,该系统在加权F1分数上超过70%,展现出良好的实用性与准确性。
链接: https://arxiv.org/abs/2509.20467
作者: Henrik Vatndal,Vinay Setty
机构: Factiverse AI(事实宇宙人工智能); University of Stavanger(斯塔万格大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Short-form video platforms like TikTok present unique challenges for misinformation detection due to their multimodal, dynamic, and noisy content. We present ShortCheck, a modular, inference-only pipeline with a user-friendly interface that automatically identifies checkworthy short-form videos to help human fact-checkers. The system integrates speech transcription, OCR, object and deepfake detection, video-to-text summarization, and claim verification. ShortCheck is validated by evaluating it on two manually annotated datasets with TikTok videos in a multilingual setting. The pipeline achieves promising results with F1-weighted score over 70%.
zh
[NLP-86] Document Summarization with Conformal Importance Guarantees NEURIPS2025
【速读】: 该论文旨在解决自动摘要系统在高风险领域(如医疗、法律和金融)中缺乏对关键内容包含可靠保障的问题。现有基于大语言模型(Large Language Models, LLMs)的摘要方法虽性能优越,但无法提供可证明的信息覆盖保证。解决方案的关键在于提出置信度重要性摘要(Conformal Importance Summarization),这是一种基于置信区间预测(conformal prediction)的框架,通过校准句子级重要性得分阈值,在不依赖数据分布假设的前提下,实现用户指定的覆盖率和召回率保证。该方法具有模型无关性、仅需少量校准集,并能无缝集成至现有黑盒LLM中,实验证明其达到了理论保证的信息覆盖水平。
链接: https://arxiv.org/abs/2509.20461
作者: Bruce Kuwahara,Chen-Yuan Lin,Xiao Shi Huang,Kin Kwan Leung,Jullian Arta Yapeter,Ilya Stanevich,Felipe Perez,Jesse C. Cresswell
机构: Signal 1 AI (Signal 1 AI); Layer 6 AI (Layer 6 AI)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NeurIPS 2025. Code is available at this https URL
Abstract:Automatic summarization systems have advanced rapidly with large language models (LLMs), yet they still lack reliable guarantees on inclusion of critical content in high-stakes domains like healthcare, law, and finance. In this work, we introduce Conformal Importance Summarization, the first framework for importance-preserving summary generation which uses conformal prediction to provide rigorous, distribution-free coverage guarantees. By calibrating thresholds on sentence-level importance scores, we enable extractive document summarization with user-specified coverage and recall rates over critical content. Our method is model-agnostic, requires only a small calibration set, and seamlessly integrates with existing black-box LLMs. Experiments on established summarization benchmarks demonstrate that Conformal Importance Summarization achieves the theoretically assured information coverage rate. Our work suggests that Conformal Importance Summarization can be combined with existing techniques to achieve reliable, controllable automatic summarization, paving the way for safer deployment of AI summarization tools in critical applications. Code is available at this https URL.
zh
[NLP-87] Blueprints of Trust: AI System Cards for End to End Transparency and Governance
【速读】: 该论文旨在解决当前人工智能(AI)系统在开发与部署过程中缺乏透明度和问责机制的问题,尤其在安全性和安全性风险的追踪与沟通方面存在不足。其解决方案的关键在于提出一种名为“危害感知系统卡”(Hazard-Aware System Card, HASC)的新框架,该框架通过整合动态、全面的安全与安全态势记录,引入标准化标识体系(包括新型AI安全危害标识符 AI Safety Hazard ID, ASH ID),以补充现有如CVE等安全漏洞标识体系,从而实现对AI系统中已修复缺陷的清晰一致沟通。HASC作为单一可信信息源,赋能开发者和利益相关方在整个生命周期内做出更明智的安全决策,并可与ISO/IEC 42001:2023标准协同使用,进一步提升AI系统的透明度与问责性。
链接: https://arxiv.org/abs/2509.20394
作者: Huzaifa Sidhpurwala,Emily Fox,Garth Mollett,Florencio Cano Gabarda,Roman Zhukov
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:This paper introduces the Hazard-Aware System Card (HASC), a novel framework designed to enhance transparency and accountability in the development and deployment of AI systems. The HASC builds upon existing model card and system card concepts by integrating a comprehensive, dynamic record of an AI system’s security and safety posture. The framework proposes a standardized system of identifiers, including a novel AI Safety Hazard (ASH) ID, to complement existing security identifiers like CVEs, allowing for clear and consistent communication of fixed flaws. By providing a single, accessible source of truth, the HASC empowers developers and stakeholders to make more informed decisions about AI system safety throughout its lifecycle. Ultimately, we also compare our proposed AI system cards with the ISO/IEC 42001:2023 standard and discuss how they can be used to complement each other, providing greater transparency and accountability for AI systems.
zh
[NLP-88] USB-Rec: An Effective Framework for Improving Conversational Recommendation Capability of Large Language Model RECSYS’25
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的对话推荐系统(Conversational Recommender Systems, CRSs)中普遍存在的训练环节被忽视的问题。现有方法主要聚焦于利用LLMs的总结与分析能力,而未充分挖掘其在训练阶段的优化潜力。为此,作者提出了一种集成式训练-推理框架User-Simulator-Based framework (USB-Rec),其核心创新在于:首先设计了一种基于LLM的偏好优化(Preference Optimization, PO)数据集构建策略,用于强化学习(Reinforcement Learning, RL)训练,使LLMs能够学习到对话推荐中的策略与方法;其次,在推理阶段引入自增强策略(Self-Enhancement Strategy, SES),以进一步释放RL训练所获得的对话推荐潜力。实验表明,该方法在多个数据集上均显著优于现有最先进方法。
链接: https://arxiv.org/abs/2509.20381
作者: Jianyu Wen,Jingyun Wang,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Ying Zhang
机构: Harbin Institute of Technology Shenzhen (哈尔滨工业大学深圳); Beihang University (北京航空航天大学); Xiaohongshu Inc. (小红书公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by Recsys’25
Abstract:Recently, Large Language Models (LLMs) have been widely employed in Conversational Recommender Systems (CRSs). Unlike traditional language model approaches that focus on training, all existing LLMs-based approaches are mainly centered around how to leverage the summarization and analysis capabilities of LLMs while ignoring the issue of training. Therefore, in this work, we propose an integrated training-inference framework, User-Simulator-Based framework (USB-Rec), for improving the performance of LLMs in conversational recommendation at the model level. Firstly, we design a LLM-based Preference Optimization (PO) dataset construction strategy for RL training, which helps the LLMs understand the strategies and methods in conversational recommendation. Secondly, we propose a Self-Enhancement Strategy (SES) at the inference stage to further exploit the conversational recommendation potential obtained from RL training. Extensive experiments on various datasets demonstrate that our method consistently outperforms previous state-of-the-art methods.
zh
[NLP-89] Leverag ing NTPs for Efficient Hallucination Detection in VLMs
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中常见的幻觉问题,即生成文本与输入视觉内容不一致的现象,这会显著降低VLM的可靠性。其解决方案的关键在于提出一种基于VLM的下一个词概率(Next-Token Probabilities, NTPs)的轻量级、实时检测方法:利用NTPs作为模型不确定性的直接量化指标,训练传统机器学习模型来预测幻觉。实验表明,NTP特征能有效识别幻觉,且结合仅用生成文本回传VLM计算的语言学NTPs及VLM自身的幻觉预测分数,可进一步提升检测性能,从而实现比单纯依赖强VLM更高效可靠的幻觉检测方案。
链接: https://arxiv.org/abs/2509.20379
作者: Ofir Azachi,Kfir Eliyahu,Eyal El Ani,Rom Himelstein,Roi Reichart,Yuval Pinter,Nitay Calderon
机构: Technion - Israel Institute of Technology (以色列理工学院); Ben-Gurion University of the Negev (内盖夫本-古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM’s next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.
zh
[NLP-90] Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation
【速读】: 该论文旨在解决现有情感语音合成(Emotional Text-to-Speech, E-TTS)系统在句子内部情感动态变化建模上的不足,即当前方法主要依赖句级控制(如预定义标签、参考音频或自然语言提示),难以捕捉单句内情绪的细粒度波动。其解决方案的关键在于提出Emo-FiLM框架,通过将emotion2vec模型提取的帧级情感特征对齐至词级别,并利用特征自适应调制(Feature-wise Linear Modulation, FiLM)层直接调节文本嵌入,从而实现词级别的精细情感控制;同时构建了包含情感过渡标注的细粒度情感动态数据集(Fine-grained Emotion Dynamics Dataset, FEDD)以支持评估。
链接: https://arxiv.org/abs/2509.20378
作者: Sirui Wang,Andong Chen,Tiejun Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Emotional text-to-speech (E-TTS) is central to creating natural and trustworthy human-computer interaction. Existing systems typically rely on sentence-level control through predefined labels, reference audio, or natural language prompts. While effective for global emotion expression, these approaches fail to capture dynamic shifts within a sentence. To address this limitation, we introduce Emo-FiLM, a fine-grained emotion modeling framework for LLM-based TTS. Emo-FiLM aligns frame-level features from emotion2vec to words to obtain word-level emotion annotations, and maps them through a Feature-wise Linear Modulation (FiLM) layer, enabling word-level emotion control by directly modulating text embeddings. To support evaluation, we construct the Fine-grained Emotion Dynamics Dataset (FEDD) with detailed annotations of emotional transitions. Experiments show that Emo-FiLM outperforms existing approaches on both global and fine-grained tasks, demonstrating its effectiveness and generality for expressive speech synthesis.
zh
[NLP-91] SKILL-RAG : Self-Knowledge Induced Learning and Filtering for Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因检索到无关内容而导致模型幻觉的问题,从而提升知识密集型任务下的生成质量与效率。其解决方案的关键在于引入“自知识”(self-knowledge)概念,即明确识别大语言模型对哪些信息已掌握、哪些尚未掌握,并基于此设计一种基于强化学习的训练框架,以显式地激发模型的自知识能力;在此基础上,采用句级粒度对检索结果进行筛选,保留有助于回答问题的内容,剔除冗余或无关信息,从而实现更精准的外部知识融合与内部知识协同。
链接: https://arxiv.org/abs/2509.20377
作者: Tomoaki Isoda
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has significantly improved the performance of large language models (LLMs) on knowledge-intensive tasks in recent years. However, since retrieval systems may return irrelevant content, incorporating such information into the model often leads to hallucinations. Thus, identifying and filtering out unhelpful retrieved content is a key challenge for improving RAG this http URL better integrate the internal knowledge of the model with external knowledge from retrieval, it is essential to understand what the model “knows” and “does not know” (which is also called “self-knowledge”). Based on this insight, we propose SKILL-RAG (Self-Knowledge Induced Learning and Filtering for RAG), a novel method that leverages the model’s self-knowledge to determine which retrieved documents are beneficial for answering a given query. We design a reinforcement learning-based training framework to explicitly elicit self-knowledge from the model and employs sentence-level granularity to filter out irrelevant content while preserving useful this http URL evaluate SKILL-RAG using Llama2-7B and Qwen3-8B on several question answering benchmarks. Experimental results demonstrate that SKILL-RAG not only improves generation quality but also significantly reduces the number of input documents, validating the importance of self-knowledge in guiding the selection of high-quality retrievals.
zh
[NLP-92] ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models
【速读】: 该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAE)提取的特征与人类可理解概念之间缺乏对齐的问题,从而提升大语言模型(Large Language Models, LLMs)内部知识表示的可解释性。其解决方案的关键在于提出 ConceptViz——一个视觉分析系统,该系统通过“识别 = 解释 = 验证”(Identification = Interpretation = Validation)的新颖流程,使用户能够以感兴趣的概念为查询入口,交互式探索概念与SAE特征之间的映射关系,并通过模型行为验证来确认对应关系的有效性,显著提升了LLM中概念表征的发现效率与可信度。
链接: https://arxiv.org/abs/2509.20376
作者: Haoxuan Li,Zhen Wen,Qiqi Jiang,Chenxiao Li,Yuwei Wu,Yuchen Yang,Yiyao Wang,Xiuqi Huang,Minfeng Zhu,Wei Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. Understanding how LLMs internally represent knowledge remains a significant challenge. Despite Sparse Autoencoders (SAEs) have emerged as a promising technique for extracting interpretable features from LLMs, SAE features do not inherently align with human-understandable concepts, making their interpretation cumbersome and labor-intensive. To bridge the gap between SAE features and human concepts, we present ConceptViz, a visual analytics system designed for exploring concepts in LLMs. ConceptViz implements a novel dentification = Interpretation = Validation pipeline, enabling users to query SAEs using concepts of interest, interactively explore concept-to-feature alignments, and validate the correspondences through model behavior verification. We demonstrate the effectiveness of ConceptViz through two usage scenarios and a user study. Our results show that ConceptViz enhances interpretability research by streamlining the discovery and validation of meaningful concept representations in LLMs, ultimately aiding researchers in building more accurate mental models of LLM features. Our code and user guide are publicly available at this https URL.
zh
[NLP-93] Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text
【速读】: 该论文旨在解决生成式 AI(Generative AI)文本与人类撰写文本之间界限模糊所引发的学术诚信、知识产权及虚假信息传播等问题,核心目标是开发可靠的AI文本检测方法以保障人类创作的真实性并重建数字通信中的信任。解决方案的关键在于系统评估多种机器学习(ML)技术在识别ChatGPT-3.5生成的研究摘要方面的性能,包括传统方法(如基于词袋模型、词性标注和TF-IDF特征的逻辑回归)与基于Transformer架构的方法(如DistilBERT、BERT自定义分类器及LSTM-N-gram模型),结果表明DistilBERT表现最优,且单一高性能模型(如DistilBERT)优于模型集成方案,凸显了高质量表示学习在AI文本检测中的决定性作用。
链接: https://arxiv.org/abs/2509.20375
作者: Sharanya Parimanoharan,Ruwan D. Nawarathna
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid adoption of large language models (LLMs) such as ChatGPT has blurred the line between human and AI-generated texts, raising urgent questions about academic integrity, intellectual property, and the spread of misinformation. Thus, reliable AI-text detection is needed for fair assessment to safeguard human authenticity and cultivate trust in digital communication. In this study, we investigate how well current machine learning (ML) approaches can distinguish ChatGPT-3.5-generated texts from human-written texts employing a labeled data set of 250 pairs of abstracts from a wide range of research topics. We test and compare both classical (Logistic Regression armed with classical Bag-of-Words, POS, and TF-IDF features) and transformer-based (BERT augmented with N-grams, DistilBERT, BERT with a lightweight custom classifier, and LSTM-based N-gram models) ML detection techniques. As we aim to assess each model’s performance in detecting AI-generated research texts, we also aim to test whether an ensemble of these models can outperform any single detector. Results show DistilBERT achieves the overall best performance, while Logistic Regression and BERT-Custom offer solid, balanced alternatives; LSTM- and BERT-N-gram approaches lag. The max voting ensemble of the three best models fails to surpass DistilBERT itself, highlighting the primacy of a single transformer-based representation over mere model diversity. By comprehensively assessing the strengths and weaknesses of these AI-text detection approaches, this work lays a foundation for more robust transformer frameworks with larger, richer datasets to keep pace with ever-improving generative AI models.
zh
[NLP-94] CFD-LLM Bench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化复杂物理系统数值实验中的应用潜力尚未被充分探索的问题,尤其是在计算流体力学(Computational Fluid Dynamics, CFD)这一长期依赖人工干预且劳动密集的领域。其解决方案的关键在于提出一个名为CFDLLMBench的基准测试套件,该套件由三个互补组件构成:CFDQuery(评估研究生级别的CFD知识)、CFDCodeBench(衡量CFD数值与物理推理能力)和FoamBench(检验上下文相关的CFD工作流实现),并通过真实世界CFD实践构建详尽的任务分类体系与严谨的评估框架,从而系统性地量化LLM在代码可执行性、解的准确性及数值收敛行为等方面的性能表现,为基于LLM的复杂物理系统数值实验自动化提供可复现的评估基础。
链接: https://arxiv.org/abs/2509.20374
作者: Nithin Somasekharan,Ling Yue,Yadi Cao,Weichao Li,Patrick Emami,Pochinapeddi Sai Bhargav,Anurag Acharya,Xingyu Xie,Shaowu Pan
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); University of California San Diego (加州大学圣地亚哥分校); Indian Institute of Science (印度科学研究所); Pacific Northwest National Laboratory (太平洋西北国家实验室); National Renewable Energy Laboratory (国家可再生能源实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system – a critical and labor-intensive component – remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components – CFDQuery, CFDCodeBench, and FoamBench – designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at this https URL.
zh
[NLP-95] Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition
【速读】: 该论文旨在解决跨语言语音情感识别(Cross-lingual Speech Emotion Recognition, SER)中的关键挑战,即不同语言间发音差异(phonetic variability)和说话人特有表达风格(speaker-specific expressive styles)对情感表征一致性的影响。为实现跨语言的情感迁移与泛化,其解决方案的核心在于提出一种说话人风格感知的音素锚定框架(speaker-style aware phoneme anchoring framework),通过图聚类构建情绪相关的说话人社区以捕捉共享的说话人特征,并在说话人空间和音素空间中实施双空间锚定(dual-space anchoring),从而有效对齐不同语言和说话人之间的感情表达模式,显著提升跨语言场景下的情感识别性能。
链接: https://arxiv.org/abs/2509.20373
作者: Shreya G. Upadhyay,Carlos Busso,Chi-Chun Lee
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.
zh
[NLP-96] Interpreting Public Sentiment in Diplomacy Events: A Counterfactual Analysis Framework Using Large Language Models
【速读】: 该论文旨在解决传统公共情绪评估方法(如大规模问卷调查或人工媒体内容分析)在时效性、效率及前瞻性分析能力上的不足,从而难以有效支持外交政策实施与国际形象塑造的问题。其解决方案的关键在于构建一个基于语言模型的框架,通过识别特定文本特征修改以改变外交事件叙事框架,进而引导公众情绪从负面转向中性或正面;具体而言,研究首先训练语言模型预测公众对外交事件的反应,随后结合传播学理论与领域专家意见预设可调整的文本特征,并开发一种反事实生成算法,利用大语言模型系统生成原始文本的改写版本,在保持事件核心事实不变的前提下实现叙事重构,最终实现在70%成功率下显著改善公众情绪。
链接: https://arxiv.org/abs/2509.20367
作者: Leyi Ouyang
机构: South China Agricultural University Zhujiang College (华南农业大学珠江学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 2 Figures, 7 Tables, 1 Algorithm
Abstract:Diplomatic events consistently prompt widespread public discussion and debate. Public sentiment plays a critical role in diplomacy, as a good sentiment provides vital support for policy implementation, helps resolve international issues, and shapes a nation’s international image. Traditional methods for gauging public sentiment, such as large-scale surveys or manual content analysis of media, are typically time-consuming, labor-intensive, and lack the capacity for forward-looking analysis. We propose a novel framework that identifies specific modifications for diplomatic event narratives to shift public sentiment from negative to neutral or positive. First, we train a language model to predict public reaction towards diplomatic events. To this end, we construct a dataset comprising descriptions of diplomatic events and their associated public discussions. Second, guided by communication theories and in collaboration with domain experts, we predetermined several textual features for modification, ensuring that any alterations changed the event’s narrative framing while preserving its core this http URL develop a counterfactual generation algorithm that employs a large language model to systematically produce modified versions of an original text. The results show that this framework successfully shifted public sentiment to a more favorable state with a 70% success rate. This framework can therefore serve as a practical tool for diplomats, policymakers, and communication specialists, offering data-driven insights on how to frame diplomatic initiatives or report on events to foster a more desirable public sentiment.
zh
[NLP-97] CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration
【速读】: 该论文旨在解决文本到图像扩散模型(如Stable Diffusion)在生成图像时难以实现复杂语义组合对齐的问题,尤其是在描述对象关系、属性或空间排列等复合语义时表现不佳。其核心挑战在于现有推理阶段优化或探索方法各自存在局限:优化策略易因初始噪声质量差或搜索路径不利而陷入局部最优,而探索方法则可能需要大量采样才能找到满意结果;此外,单一奖励指标或随意组合无法全面捕捉组合性特征,导致引导效果不稳定。解决方案的关键在于提出一种统一框架CARINOX(Category-Aware Reward-based Initial Noise Optimization and Exploration),通过结合噪声优化与探索,并引入基于与人类判断相关性的奖励选择机制,实现更可靠且高效的组合对齐提升,在多个基准测试中显著优于当前最先进方法。
链接: https://arxiv.org/abs/2509.17458
作者: Seyed Amir Kasaei,Ali Aghayari,Arash Marioriyad,Niki Sepasian,Shayan Baghayi Nejad,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at this https URLthis URL.
zh
[NLP-98] Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models
【速读】: 该论文旨在解决当前癌症筛查技术因成本高、侵入性强及全球可及性差而导致大量本可挽救的生命流失的问题。其解决方案的关键在于提出CATCH-FM(CATch Cancer early with Healthcare Foundation Models),一种基于患者历史电子健康记录(Electronic Health Records, EHR)的癌症早期预筛方法,通过在医疗编码序列上预训练大规模基础模型(最大达24亿参数),并结合临床专家标注的风险预测数据集进行微调,实现对高风险人群的精准识别。该方法在3万例回顾性评估中展现出60%的敏感性与99%的特异性及阴性预测值,显著优于传统特征工程树模型、通用和医学大语言模型,并在不同人群分布和医疗系统背景下表现出鲁棒性,验证了在ICD编码空间建模对捕捉非显性癌症风险因素的有效性。
链接: https://arxiv.org/abs/2506.00209
作者: Liwen Sun,Hao-Ren Yao,Gary Gao,Ophir Frieder,Chenyan Xiong
机构: Language Technologies Institute (语言技术研究所); School of Computer Science (计算机科学学院); Carnegie Mellon University (卡内基梅隆大学); Department of Computer Science (计算机科学系); Georgetown University (乔治城大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Cancer screening, leading to early detection, saves lives. Unfortunately, existing screening techniques require expensive and intrusive medical procedures, not globally available, resulting in too many lost would-be-saved lives. We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models, a cancer pre-screening methodology that identifies high-risk patients for further screening solely based on their historical medical records. With millions of electronic healthcare records (EHR), we establish the scaling law of EHR foundation models pretrained on medical code sequences, pretrain compute-optimal foundation models of up to 2.4 billion parameters, and finetune them on clinician-curated cancer risk prediction cohorts. In our retrospective evaluation comprising of thirty thousand patients, CATCH-FM achieved strong efficacy (60% sensitivity) with low risk (99% specificity and Negative Predictive Value), outperforming feature-based tree models as well as general and medical large language models by large margins. Despite significant demographic, healthcare system, and EHR coding differences, CATCH-FM achieves state-of-the-art pancreatic cancer risk prediction on the EHRSHOT few-shot leaderboard, outperforming EHR foundation models pretrained using on-site patient data. Our analysis demonstrates the robustness of CATCH-FM in various patient distributions, the benefits of operating in the ICD code space, and its ability to capture non-trivial cancer risk factors. Our code will be open-sourced.
zh
计算机视觉
[CV-0] SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
【速读】:该论文旨在解决高质图像生成模型在消费级设备上部署时面临的计算资源限制问题,尤其是如何在有限的算力和内存条件下实现高效、快速且高质量的图像生成。其解决方案的关键在于提出了一种名为SD3.5-Flash的少步数蒸馏框架,通过重构分布匹配目标以适配少步生成任务,并引入两项核心技术:一是“时间步共享”(timestep sharing)以降低梯度噪声,二是“分时间步微调”(split-timestep fine-tuning)以增强提示对齐能力;同时结合文本编码器重构与专用量化等管线优化策略,实现了跨硬件平台的快速生成与内存高效部署,从而将先进的生成式AI能力普及至从移动设备到桌面计算机的全场景终端。
链接: https://arxiv.org/abs/2509.21318
作者: Hmrishav Bandyopadhyay,Rahim Entezari,Jim Scott,Reshinth Adithyan,Yi-Zhe Song,Varun Jampani
机构: Stability AI(Stability AI); SketchX, University of Surrey(草图X,萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:We present SD3.5-Flash, an efficient few-step distillation framework that brings high-quality image generation to accessible consumer devices. Our approach distills computationally prohibitive rectified flow models through a reformulated distribution matching objective tailored specifically for few-step generation. We introduce two key innovations: “timestep sharing” to reduce gradient noise and “split-timestep fine-tuning” to improve prompt alignment. Combined with comprehensive pipeline optimizations like text encoder restructuring and specialized quantization, our system enables both rapid generation and memory-efficient deployment across different hardware configurations. This democratizes access across the full spectrum of devices, from mobile phones to desktop computers. Through extensive evaluation including large-scale user studies, we demonstrate that SD3.5-Flash consistently outperforms existing few-step methods, making advanced generative AI truly accessible for practical deployment.
zh
[CV-1] NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics
【速读】:该论文旨在解决大规模文本到视频生成中物理一致性与可控性不足的问题,即当前先进模型常产生不现实的运动现象(如物体向上坠落或速度和方向突变),且缺乏对不同初始条件下动力学行为的精确参数控制能力。其解决方案的关键在于提出NewtonGen框架,该框架通过将数据驱动的合成方法与可学习的物理原理相结合,核心创新是引入可训练的神经牛顿动力学(Neural Newtonian Dynamics, NND),能够建模和预测多种牛顿运动,从而在视频生成过程中注入潜在的动力学约束,实现物理一致性的视频合成与精准的参数调控。
链接: https://arxiv.org/abs/2509.21309
作者: Yu Yuan,Xijun Wang,Tharindu Wickremasinghe,Zeeshan Nadir,Bole Ma,Stanley H. Chan
机构: Purdue University (普渡大学); Samsung Research America (三星研究美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: All data and code is available at this https URL
Abstract:A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control.
zh
[CV-2] Quantized Visual Geometry Grounded Transformer
【速读】:该论文旨在解决大规模生成式3D重建模型(如VGGT)在部署时面临的计算与内存开销过大的问题,尤其是后训练量化(Post-Training Quantization, PTQ)在压缩此类模型时所遭遇的独特挑战:数据无关的特殊标记导致激活分布呈现重尾特性,且三维数据的多视角特性使得校准样本选择不稳定。解决方案的关键在于提出首个针对VGGT的量化框架QuantVGGT,其核心贡献包括:1)双平滑细粒度量化(Dual-Smoothed Fine-Grained Quantization),通过预全局Hadamard旋转与后局部通道平滑协同缓解重尾分布和通道间方差;2)噪声过滤多样化采样(Noise-Filtered Diverse Sampling),基于深层统计过滤异常值并构建帧感知的多样化校准聚类,从而稳定量化范围。实验证明,QuantVGGT在多个基准上均达到当前最优性能,4-bit量化可在真实硬件上实现3.7倍内存减少和2.5倍加速,同时保持超过全精度模型98%的重建精度,显著提升了资源受限场景下的实用性。
链接: https://arxiv.org/abs/2509.21302
作者: Weilun Feng,Haotong Qin,Mingqiang Wu,Chuanguang Yang,Yuqi Li,Xiangqi Li,Zhulin An,Libo Huang,Yulun Zhang,Michele Magno,Yongjun Xu
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); ETH Zürich (苏黎世联邦理工学院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7 \times memory reduction and 2.5 \times acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in this https URL.
zh
[CV-3] VC-Agent : An Interactive Agent for Customized Video Dataset Collection
【速读】:该论文旨在解决个性化视频数据集收集过程中因依赖大量人工标注和筛选而导致的效率低下问题。当前随着生成式 AI(Generative AI)对大规模视频数据的需求增长,从互联网获取满足特定需求的视频片段变得尤为关键,但传统方法高度依赖人力,难以快速响应用户意图。解决方案的关键在于提出 VC-Agent——首个能够理解用户查询与反馈并据此自动检索或扩展相关视频片段的交互式智能代理。其核心创新包括:1)设计多种基于文本描述和确认的用户友好交互方式;2)利用多模态大语言模型(Multimodal Large Language Models)实现用户需求与视频内容的语义对齐;3)引入两种可随用户持续交互动态更新的过滤策略,显著提升采集效率与精准度。实验表明,VC-Agent在真实场景中能高效完成定制化视频数据集构建任务。
链接: https://arxiv.org/abs/2509.21291
作者: Yidan Zhang,Mutian Xu,Yiming Hao,Kun Zhou,Jiahao Chang,Xiaoqiang Liu,Pengfei Wan,Hongbo Fu,Xiaoguang Han
机构: The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)); Kuaishou Technology(快手科技); The Hong Kong University of Science and Technology(香港科技大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor-intensive and time-consuming. In this work, we study the way to expedite this collection process and propose VC-Agent, the first interactive agent that is able to understand users’ queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user-friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi-modal large language models to connect the user’s requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent’s usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection. Project page: this https URL.
zh
[CV-4] Does FLUX Already Know How to Perform Physically Plausible Image Composition?
【速读】:该论文旨在解决图像合成(Image Composition)中对象插入时面临的两大挑战:一是复杂光照条件下的真实感表现(如阴影、水反射等),二是高分辨率输入下生成质量下降和可见接缝问题。现有模型依赖潜在空间反演(latent inversion)或脆弱的注意力机制手术(attention surgery),易导致对象姿态不自然或输出不稳定。其解决方案的关键在于提出一种无需训练(training-free)的框架SHINE,通过引入流形引导锚定损失(manifold-steered anchor loss),利用预训练定制适配器(如IP-Adapter)在潜在空间中引导对象表示,同时保持背景完整性;并结合退化抑制引导(degradation-suppression guidance)与自适应背景融合策略,有效消除低质输出与接缝痕迹,从而实现高保真、无缝的对象插入。
链接: https://arxiv.org/abs/2509.21278
作者: Shilin Lu,Zhuming Lian,Zihan Zhou,Shaocong Zhang,Chen Zhao,Adams Wai-Kin Kong
机构: Nanyang Technological University (南洋理工大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.
zh
[CV-5] A Sentinel-3 foundation model for ocean colour
【速读】:该论文旨在解决海洋科学中因标注数据稀疏且获取成本高而导致的AI模型性能受限问题。解决方案的关键在于构建一种基于Prithvi-EO Vision Transformer架构的生成式AI(Generative AI)基础模型(Foundation Model, FM),该模型在Sentinel-3 Ocean and Land Colour Instrument(OLCI)数据上进行自监督预训练,从而学习到丰富的海洋颜色空间特征表示。通过微调(fine-tuning)该模型,在两个下游海洋遥感任务——叶绿素浓度估算和海洋初级生产力估计中均展现出优于现有基线模型的性能,尤其在小样本高质量标注数据条件下仍能捕捉精细的空间模式并匹配地面观测点,证明了此类地学基础模型在提升海洋生态系统监测精度与稳健性方面的潜力。
链接: https://arxiv.org/abs/2509.21273
作者: Geoffrey Dawson,Remy Vandaele,Andrew Taylor,David Moffat,Helen Tamura-Wicks,Sarah Jackson,Rosie Lickorish,Paolo Fraccaro,Hywel Williams,Chunbo Luo,Anne Jones
机构: IBM Research Europe(IBM研究欧洲); University of Exeter(埃克塞特大学); STFC Hartree Centre(英国科学与技术设施委员会哈特里中心); Plymouth Marine Laboratory(普利茅斯海洋实验室); National Center for Earth Observation(国家地球观测中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures
Abstract:Artificial Intelligence (AI) Foundation models (FMs), pre-trained on massive unlabelled datasets, have the potential to drastically change AI applications in ocean science, where labelled data are often sparse and expensive to collect. In this work, we describe a new foundation model using the Prithvi-EO Vision Transformer architecture which has been pre-trained to reconstruct data from the Sentinel-3 Ocean and Land Colour Instrument (OLCI). We evaluate the model by fine-tuning on two downstream marine earth observation tasks. We first assess model performance compared to current baseline models used to quantify chlorophyll concentration. We then evaluate the FMs ability to refine remote sensing-based estimates of ocean primary production. Our results demonstrate the utility of self-trained FMs for marine monitoring, in particular for making use of small amounts of high quality labelled data and in capturing detailed spatial patterns of ocean colour whilst matching point observations. We conclude that this new generation of geospatial AI models has the potential to provide more robust, data-driven insights into ocean ecosystems and their role in global climate processes.
zh
[CV-6] MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
【速读】:该论文旨在解决大模型在长链式思维(long chain-of-thought, CoT)推理能力提升过程中面临的两大核心挑战:一是缺乏高质量、大规模、开放的长CoT数据资源,二是强化学习(reinforcement learning, RL)后训练阶段算法的不稳定性,特别是Group Relative Policy Optimization(GRPO)在奖励方差较低时易出现梯度消失问题,导致优化信号弱化和收敛困难。解决方案的关键在于提出一种基于奖励方差感知的数据采样策略——Variance-Aware Sampling(VAS),其通过引入Variance Promotion Score(VPS)来联合优化轨迹多样性与结果方差,从而增强奖励方差并稳定策略优化过程;同时,作者开源了约160万条高质量长CoT冷启动数据和1.5万条RL问答对,并提供完整的可复现训练代码库与多尺度多模态推理模型,为社区建立标准化基线。理论分析进一步证明奖励方差下界控制期望策略梯度幅度,VAS作为其实用实现机制有效保障了这一理论保证。
链接: https://arxiv.org/abs/2509.21268
作者: Sicong Leng,Jing Wang,Jiaxi Li,Hao Zhang,Zhiqiang Hu,Boqiang Zhang,Yuming Jiang,Hang Zhang,Xin Li,Lidong Bing,Deli Zhao,Wei Lu,Yu Rong,Aixin Sun,Shijian Lu
机构: Nanyang Technological University (南洋理工大学); DAMO Academy, Alibaba Group (阿里达摩院); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at this https URL.
zh
[CV-7] MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation ICCV2025
【速读】:该论文旨在解决低分辨率(Low-Resolution, LR)医学视频在临床应用中因相机抖动、噪声及帧间突变导致的视频超分辨率(Video Super-Resolution, VSR)重建难题,这些问题常引发光学流估计误差和对齐困难,并使现有VSR模型易引入伪影和结构失真,从而误导医生诊断。解决方案的关键在于提出MedVSR框架:其一,采用跨状态空间传播(Cross State-Space Propagation, CSSP)机制,通过将远距离帧作为状态空间模型中的控制矩阵,实现一致且信息丰富的特征选择性传播至邻近帧,提升对齐精度;其二,设计内部状态空间重构(Inner State-Space Reconstruction, ISSR)模块,结合长程空间特征学习与大核短程信息聚合,增强组织结构细节并抑制伪影,从而显著提升重建质量与效率。
链接: https://arxiv.org/abs/2509.21265
作者: Xinyu Liu,Guolei Sun,Cheng Wang,Yixuan Yuan,Ender Konukoglu
机构: The Chinese University of Hong Kong (香港中文大学); Computer Vision Laboratory, ETH Zurich (苏黎世联邦理工学院计算机视觉实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025
Abstract:High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at this https URL.
zh
[CV-8] Dense Semantic Matching with VGGT Prior
【速读】:该论文旨在解决现有语义匹配方法在几何歧义(Geometric Ambiguity)和最近邻规则(Nearest-Neighbor Rule)方面的局限性:前者因依赖二维基础模型特征难以区分对称结构,且泛化能力弱;后者忽略跨图像不可见性并破坏流形结构。解决方案的关键在于引入VGGT(Vision Geometric Grounded Transformer)作为基础架构,并通过三项改进实现适应性优化:(i)保留VGGT早期特征阶段以利用其几何感知能力,微调后期层并增加语义头以支持双向对应关系;(ii)在数据稀缺条件下,采用循环一致性训练策略、合成数据增强与渐进式训练方案(含混叠伪影缓解机制),使模型适配跨实例语义匹配任务。实验表明,该方法显著提升了几何感知能力、匹配可靠性及流形保持性能。
链接: https://arxiv.org/abs/2509.21263
作者: Songlin Yang,Tianyi Wei,Yushi Lan,Zeqi Xiao,Anyi Rao,Xingang Pan
机构: S-Lab, Nanyang Technological University (南洋理工大学); MMLab@HKUST, The Hong Kong University of Science and Technology (香港科技大学); Visual Geometry Group, University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic matching aims to establish pixel-level correspondences between instances of the same category and represents a fundamental task in computer vision. Existing approaches suffer from two limitations: (i) Geometric Ambiguity: Their reliance on 2D foundation model features (e.g., Stable Diffusion, DINO) often fails to disambiguate symmetric structures, requiring extra fine-tuning yet lacking generalization; (ii) Nearest-Neighbor Rule: Their pixel-wise matching ignores cross-image invisibility and neglects manifold preservation. These challenges call for geometry-aware pixel descriptors and holistic dense correspondence mechanisms. Inspired by recent advances in 3D geometric foundation models, we turn to VGGT, which provides geometry-grounded features and holistic dense matching capabilities well aligned with these needs. However, directly transferring VGGT is challenging, as it was originally designed for geometry matching within cross views of a single instance, misaligned with cross-instance semantic matching, and further hindered by the scarcity of dense semantic annotations. To address this, we propose an approach that (i) retains VGGT’s intrinsic strengths by reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences; and (ii) adapts VGGT to the semantic matching scenario under data scarcity through cycle-consistent training strategy, synthetic data augmentation, and progressive training recipe with aliasing artifact mitigation. Extensive experiments demonstrate that our approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.
zh
[CV-9] Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization
【速读】:该论文旨在解决微动作识别(Micro-action Recognition)在真实场景中因个体差异导致的泛化能力不足问题,即同一动作在不同人身上表现形式各异,从而影响模型的鲁棒性。其解决方案的关键在于提出了一种人无关的通用微动作识别框架(Person Independence Universal Micro-action Recognition Framework),通过引入分布鲁棒优化(Distributionally Robust Optimization, DRO)思想,从特征和损失两个层面实现对个体特异性差异的建模与消除:在特征层面,设计了时频对齐模块(Temporal-Frequency Alignment Module),利用Wasserstein正则化对齐动态轨迹并引入方差引导扰动以增强频域鲁棒性;在损失层面,采用分组不变正则化损失(Group-Invariant Regularized Loss),通过伪分组模拟未见个体分布,并通过对边界样本加权与子组方差正则化,促使模型超越易样本或高频样本的过拟合,从而提升对细粒度变化的泛化能力。
链接: https://arxiv.org/abs/2509.21261
作者: Feng-Qi Cui,Jinyang Huang,Anyang Tong,Ziyu Jia,Jie Zhang,Zhi Liu,Dan Guo,Jianwei Lu,Meng Wang
机构: University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学); Brainnetome Center, Institute of Automation, Chinese Academy of Sciences (中科院自动化所脑网络组中心); The University of Electro-Communications (电波通信大学); Shanghai University of Traditional Chinese Medicine (上海中医药大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Micro-action Recognition is vital for psychological assessment and human-computer interaction. However, existing methods often fail in real-world scenarios because inter-person variability causes the same action to manifest differently, hindering robust generalization. To address this, we propose the Person Independence Universal Micro-action Recognition Framework, which integrates Distributionally Robust Optimization principles to learn person-agnostic representations. Our framework contains two plug-and-play components operating at the feature and loss levels. At the feature level, the Temporal-Frequency Alignment Module normalizes person-specific motion characteristics with a dual-branch design: the temporal branch applies Wasserstein-regularized alignment to stabilize dynamic trajectories, while the frequency branch introduces variance-guided perturbations to enhance robustness against person-specific spectral differences. A consistency-driven fusion mechanism integrates both branches. At the loss level, the Group-Invariant Regularized Loss partitions samples into pseudo-groups to simulate unseen person-specific distributions. By up-weighting boundary cases and regularizing subgroup variance, it forces the model to generalize beyond easy or frequent samples, thus enhancing robustness to difficult variations. Experiments on the large-scale MA-52 dataset demonstrate that our framework outperforms existing methods in both accuracy and robustness, achieving stable generalization under fine-grained conditions.
zh
[CV-10] Instruction-tuned Self-Questioning Framework for Multimodal Reasoning ICCV2023
【速读】:该论文旨在解决视觉语言理解任务中多步推理能力不足的问题,尤其是在简单问题上仍难以实现准确推理的局限性。现有方法虽尝试通过迭代生成子问题和子答案来提升性能,但存在两个关键缺陷:一是大型语言模型(LLM)无法直接访问图像的细粒度视觉信息;二是黑箱特性导致内部机制不可控且难以复现。为此,作者提出SQ-InstructBLIP框架,其核心创新在于设计了一个由Questioner、Answerer和Reasoner组成的协同架构,三者共享相同网络结构,其中Questioner与Answerer生成与图像相关的有信息量的子问题和子答案以辅助主问题推理,而Reasoner则基于这些子问题信息对主问题进行综合推理,从而显著提升了视觉问答(VQA)任务中的推理准确性。
链接: https://arxiv.org/abs/2509.21251
作者: You-Won Jang,Yu-Jung Heo,Jaeseok Kim,Minsu Lee,Du-Seong Chang,Byoung-Tak Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper was accepted to the “CLVL: 5th Workshop on Closing the Loop Between Vision and Language (ICCV 2023 CLVL workshop).”
Abstract:The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.
zh
[CV-11] Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations
【速读】:该论文旨在解决医学磁共振成像(MRI)在自动化分析中面临的复杂性和异质性问题,尤其是在可扩展和泛化能力强的机器学习应用方面。现有基础模型虽在自然语言处理和计算机视觉领域取得突破,但其在MRI领域的应用受限于数据稀缺和解剖区域聚焦过窄。解决方案的关键在于提出Decipher-MR——一个专为3D MRI设计的视觉-语言基础模型,通过大规模训练数据(涵盖20万例MRI序列、超2.2万例研究,覆盖多种解剖部位、扫描序列及病理类型),结合自监督视觉学习与报告引导的文本监督机制,构建具有鲁棒性和泛化能力的表征。此外,其模块化设计支持轻量级任务特定解码器在冻结预训练编码器上的灵活微调,从而在低计算开销下实现多样临床任务的有效适配,包括疾病分类、人口统计学预测、解剖定位和跨模态检索等,显著优于现有基础模型和专用方法。
链接: https://arxiv.org/abs/2509.21249
作者: Zhijian Yang,Noel DSouza,Istvan Megyeri,Xiaojian Xu,Amin Honarmandi Shandiz,Farzin Haddadpour,Krisztian Koos,Laszlo Rusko,Emanuele Valeriano,Bharadwaj Swaninathan,Lei Wu,Parminder Bhatia,Taha Kass-Hout,Erhan Bas
机构: GE Healthcare(通用电气医疗); GE Healthcare(通用电气医疗)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Magnetic Resonance Imaging (MRI) is a critical medical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity pose challenges for automated analysis, particularly in scalable and generalizable machine learning applications. While foundation models have revolutionized natural language and vision tasks, their application to MRI remains limited due to data scarcity and narrow anatomical focus. In this work, we present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on a large-scale dataset comprising 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust, generalizable representations, enabling effective adaptation across broad applications. To enable robust and diverse clinical tasks with minimal computational overhead, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across diverse benchmarks including disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent performance gains over existing foundation models and task-specific approaches. Our results establish Decipher-MR as a scalable and versatile foundation for MRI-based AI, facilitating efficient development across clinical and research domains.
zh
[CV-12] Learning to Look: Cognitive Attention Alignment with Vision-Language Models NEURIPS
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在图像分类任务中依赖表面相关性(superficial correlations)进行预测的问题,即模型可能因“作弊”而未能学习到真正具有判别性的特征,从而影响其鲁棒性和可解释性。解决方案的关键在于提出一种可扩展的框架,利用视觉-语言模型(vision-language models)通过自然语言提示(natural language prompts)自动生成语义注意力图(semantic attention maps),并引入一个辅助损失函数,使CNN的注意力机制与这些由语言引导的注意力图对齐,从而促使模型做出更可靠、符合人类认知直觉的决策,且无需人工标注,显著提升了方法的实用性与泛化能力。
链接: https://arxiv.org/abs/2509.21247
作者: Ryan L. Yang,Dipkamal Bhusal,Nidhi Rastogi
机构: Brown University (布朗大学); Rochester Institute of Technology (罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, neurips workshop
Abstract:Convolutional Neural Networks (CNNs) frequently “cheat” by exploiting superficial correlations, raising concerns about whether they make predictions for the right reasons. Inspired by cognitive science, which highlights the role of attention in robust human perception, recent methods have sought to guide model attention using concept-based supervision and explanation regularization. However, these techniques depend on labor-intensive, expert-provided annotations, limiting their scalability. We propose a scalable framework that leverages vision-language models to automatically generate semantic attention maps using natural language prompts. By introducing an auxiliary loss that aligns CNN attention with these language-guided maps, our approach promotes more reliable and cognitively plausible decision-making without manual annotation. Experiments on challenging datasets, ColoredMNIST and DecoyMNIST, show that our method achieves state-of-the-art performance on ColorMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST, demonstrating improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.
zh
[CV-13] Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets
【速读】:该论文旨在解决当前3D生成模型主要依赖图像或文本条件而缺乏细粒度、跨模态控制的问题,从而限制了生成结果的可控性和实际应用。其解决方案的关键在于提出Hunyuan3D-Omni框架,该框架基于Hunyuan3D 2.1构建,能够统一接受点云、体素、边界框和骨骼姿态等多种条件信号,并通过单一的跨模态架构实现对几何形状、拓扑结构和姿态的精确控制;此外,采用一种渐进式、难度感知的采样策略,在训练中优先选择较难的条件信号(如骨骼姿态),并降低简单信号(如点云)的权重,以增强多模态融合能力并提升对缺失输入的鲁棒性。
链接: https://arxiv.org/abs/2509.21245
作者: Team Hunyuan3D:Bowen Zhang,Chunchao Guo,Haolin Liu,Hongyu Yan,Huiwen Shi,Jingwei Huang,Junlin Yu,Kunhong Li,Linus,Penghao Wang,Qingxiang Lin,Sicong Liu,Xianghui Yang,Yixuan Tang,Yunfei Zhao,Zeqiang Lai,Zhihao Liang,Zibo Zhao
机构: Tencent Hunyuan3D
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report; 3D Generation
Abstract:Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.
zh
[CV-14] SlideMamba: Entropy-Based Adaptive Fusion of GNN and Mamba for Enhanced Representation Learning in Digital Pathology
【速读】:该论文旨在解决全切片图像(Whole Slide Images, WSI)分析中难以同时捕捉局部空间关系与长程上下文依赖的问题,从而提升计算病理学任务的性能。其解决方案的关键在于提出了一种融合Mamba架构与图神经网络(Graph Neural Networks, GNNs)的通用深度学习框架SlideMamba,通过引入基于熵的自适应融合策略,动态加权来自Mamba(擅长建模长程全局依赖)和GNN(侧重细粒度局部空间交互)分支的预测结果,依据不同下游任务对局部与全局信息的依赖程度自动调整权重,实现互补信息的有效整合。该方法在基因融合与突变状态预测任务上显著优于现有主流模型,验证了其在空间解析预测建模中的有效性与泛化潜力。
链接: https://arxiv.org/abs/2509.21239
作者: Shakib Khan,Fariba Dambandkhameneh,Nazim Shaikh,Yao Nie,Raghavan Venugopal,Xiao Li
机构: Roche Diagnostic Solutions(罗氏诊断解决方案); Pathology Lab, Canada(加拿大病理实验室); Pathology Lab, USA(美国病理实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:
Abstract:Advances in computational pathology increasingly rely on extracting meaningful representations from Whole Slide Images (WSIs) to support various clinical and biological tasks. In this study, we propose a generalizable deep learning framework that integrates the Mamba architecture with Graph Neural Networks (GNNs) for enhanced WSI analysis. Our method is designed to capture both local spatial relationships and long-range contextual dependencies, offering a flexible architecture for digital pathology analysis. Mamba modules excels in capturing long-range global dependencies, while GNNs emphasize fine-grained short-range spatial interactions. To effectively combine these complementary signals, we introduce an adaptive fusion strategy that uses an entropy-based confidence weighting mechanism. This approach dynamically balances contributions from both branches by assigning higher weight to the branch with more confident (lower-entropy) predictions, depending on the contextual importance of local versus global information for different downstream tasks. We demonstrate the utility of our approach on a representative task: predicting gene fusion and mutation status from WSIs. Our framework, SlideMamba, achieves an area under the precision recall curve (PRAUC) of 0.751 \pm 0.05, outperforming MIL (0.491 \pm 0.042), Trans-MIL (0.39 \pm 0.017), Mamba-only (0.664 \pm 0.063), GNN-only (0.748 \pm 0.091), and a prior similar work GAT-Mamba (0.703 \pm 0.075). SlideMamba also achieves competitive results across ROC AUC (0.738 \pm 0.055), sensitivity (0.662 \pm 0.083), and specificity (0.725 \pm 0.094). These results highlight the strength of the integrated architecture, enhanced by the proposed entropy-based adaptive fusion strategy, and suggest promising potential for application of spatially-resolved predictive modeling tasks in computational pathology.
zh
[CV-15] Learning Conformal Explainers for Image Classifiers
【速读】:该论文旨在解决图像预测解释方法中解释结果的鲁棒性不足与忠实性差的问题,即现有特征归因方法(feature attribution methods)生成的解释可能无法真实反映黑箱模型的决策逻辑。其解决方案的关键在于提出一种基于合规模型(conformal prediction)的新方法,通过引入四种一致性函数(conformity functions)来量化解释与模型预测的一致性,从而识别出一个足以维持原模型预测结果的显著特征子集,且无需依赖真实标签进行校准。该方法使用户能够直接控制解释的忠实度(fidelity),实验证明FastSHAP在保持高忠实度的同时显著提升了信息效率(以解释区域大小衡量)。
链接: https://arxiv.org/abs/2509.21209
作者: Amr Alkhatib,Stephanie Lowry
机构: Örebro University (奥勒布罗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Feature attribution methods are widely used for explaining image-based predictions, as they provide feature-level insights that can be intuitively visualized. However, such explanations often vary in their robustness and may fail to faithfully reflect the reasoning of the underlying black-box model. To address these limitations, we propose a novel conformal prediction-based approach that enables users to directly control the fidelity of the generated explanations. The method identifies a subset of salient features that is sufficient to preserve the model’s prediction, regardless of the information carried by the excluded features, and without demanding access to ground-truth explanations for calibration. Four conformity functions are proposed to quantify the extent to which explanations conform to the model’s predictions. The approach is empirically evaluated using five explainers across six image datasets. The empirical results demonstrate that FastSHAP consistently outperforms the competing methods in terms of both fidelity and informational efficiency, the latter measured by the size of the explanation regions. Furthermore, the results reveal that conformity measures based on super-pixels are more effective than their pixel-wise counterparts.
zh
[CV-16] Differential-Integral Neural Operator for Long-Term Turbulence Forecasting
【速读】:该论文旨在解决湍流长期演化预测这一科学计算中的重大挑战,现有深度学习方法(尤其是神经算子)在长时间自回归预测中常因误差累积和物理保真度丧失而失效,其根源在于无法同时捕捉湍流动力学中局部耗散效应与全局非局域相互作用这两种不同的数学结构。解决方案的关键在于提出一种基于算子分解的全新框架——差分-积分神经算子(Differential-Integral Neural Operator, \method),该框架通过并行分支显式建模两类物理算子:一个由约束卷积网络实现的局部微分算子(可证明收敛至导数),以及一个由Transformer架构捕获的数据驱动全局积分算子(学习非局域核)。这种物理启发式的分解策略赋予模型卓越的稳定性和鲁棒性,在二维Kolmogorov流动基准测试中显著优于当前最先进模型,成功抑制数百时间步内的误差累积,并保持涡量场和能量谱的高保真度。
链接: https://arxiv.org/abs/2509.21196
作者: Hao Wu,Yuan Gao,Fan Xu,Fan Zhang,Qingsong Wen,Kun Wang,Xiaomeng Huang,Xian Wu
机构: Tsinghua University (清华大学); SLAI; CUHK (香港中文大学); Squirrel Ai Learning (Squirrel AI); NTU (南洋理工大学); Tencent (腾讯)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately forecasting the long-term evolution of turbulence represents a grand challenge in scientific computing and is crucial for applications ranging from climate modeling to aerospace engineering. Existing deep learning methods, particularly neural operators, often fail in long-term autoregressive predictions, suffering from catastrophic error accumulation and a loss of physical fidelity. This failure stems from their inability to simultaneously capture the distinct mathematical structures that govern turbulent dynamics: local, dissipative effects and global, non-local interactions. In this paper, we propose the \textbf\underlineDifferential-\textbf\underlineIntegral \textbf\underlineNeural \textbf\underlineOperator (\method), a novel framework designed from a first-principles approach of operator decomposition. \method explicitly models the turbulent evolution through parallel branches that learn distinct physical operators: a local differential operator, realized by a constrained convolutional network that provably converges to a derivative, and a global integral operator, captured by a Transformer architecture that learns a data-driven global kernel. This physics-based decomposition endows \method with exceptional stability and robustness. Through extensive experiments on the challenging 2D Kolmogorov flow benchmark, we demonstrate that \method significantly outperforms state-of-the-art models in long-term forecasting. It successfully suppresses error accumulation over hundreds of timesteps, maintains high fidelity in both the vorticity fields and energy spectra, and establishes a new benchmark for physically consistent, long-range turbulence forecast.
zh
[CV-17] Human-like Navigation in a World Built for Humans
【速读】:该论文旨在解决现有机器人导航系统在大型复杂环境中效率低下问题,因其缺乏人类在陌生场景中通过读取标识、询问他人等行为进行高效导航的能力。解决方案的关键在于提出一个名为ReasonNav的模块化导航系统,该系统利用视觉语言模型(Vision-Language Model, VLM)的推理能力,通过基于导航地标设计的紧凑输入与输出抽象机制,使VLM能够专注于语言理解与高层推理,从而实现类似人类的高效路径规划与环境探索。
链接: https://arxiv.org/abs/2509.21189
作者: Bhargav Chandaka,Gloria X. Wang,Haozhe Chen,Henry Che,Albert J. Zhai,Shenlong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: CoRL 2025. Project website: this https URL
Abstract:When navigating in a man-made environment they haven’t visited before–like an office building–humans employ behaviors such as reading signs and asking others for directions. These behaviors help humans reach their destinations efficiently by reducing the need to search through large areas. Existing robot navigation systems lack the ability to execute such behaviors and are thus highly inefficient at navigating within large environments. We present ReasonNav, a modular navigation system which integrates these human-like navigation skills by leveraging the reasoning capabilities of a vision-language model (VLM). We design compact input and output abstractions based on navigation landmarks, allowing the VLM to focus on language understanding and reasoning. We evaluate ReasonNav on real and simulated navigation tasks and show that the agent successfully employs higher-order reasoning to navigate efficiently in large, complex buildings.
zh
[CV-18] Can Less Precise Be More Reliable? A Systematic Evaluation of Quantizations Impact on CLIP Beyond Accuracy
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)如CLIP在实际部署中面临的效率与可靠性权衡问题,尤其是量化(quantization)对模型性能的多维影响尚未被充分理解,特别是在准确率之外的校准(calibration)和分布外检测(out-of-distribution detection, OOD detection)等关键可靠性指标上的表现。解决方案的关键在于通过大规模量化评估揭示了量化不仅可能提升校准能力(尤其对原本欠自信的预训练模型),还能在某些情况下即使导致校准性能下降,仍可改善OOD检测效果;更重要的是,识别出特定的量化感知训练(Quantization-Aware Training, QAT)方法能够在零样本准确率、校准能力和OOD鲁棒性上实现协同提升,从而挑战了传统认为效率与性能不可兼得的观点。
链接: https://arxiv.org/abs/2509.21173
作者: Aymen Bouguerra,Daniel Montoya,Alexandra Gomez-Villa,Fabio Arnez,Chokri Mraidha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP’s performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.
zh
[CV-19] A Unified Framework for Diffusion Model Unlearning with f-Divergence
【速读】:该论文旨在解决生成式 AI(Generative AI)模型中特定知识难以有效移除的问题,特别是针对文本到图像(text-to-image, T2I)扩散模型(diffusion models, DMs)的可遗忘性(machine unlearning)挑战。现有方法通常依赖于最小化目标概念与锚定概念输出分布之间的均方误差(mean squared error, MSE),但该方法受限于其收敛性和对概念保留能力的不足。论文提出了一种基于 f-散度(f-divergence)的统一框架,将MSE视为一种特例,并系统分析不同 f-散度在算法收敛速度和遗忘质量之间的权衡关系。该框架的关键在于提供了一个灵活且可配置的范式,允许根据具体应用场景选择最优的散度度量,从而在激进遗忘与概念保留之间实现更优平衡。
链接: https://arxiv.org/abs/2509.21167
作者: Nicola Novello,Federico Fontana,Luigi Cinque,Deniz Gunduz,Andrea M. Tonello
机构: University of Klagenfurt (克拉根福大学); Sapienza University of Rome (罗马第一大学); Imperial College London (帝国理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Machine unlearning aims to remove specific knowledge from a trained model. While diffusion models (DMs) have shown remarkable generative capabilities, existing unlearning methods for text-to-image (T2I) models often rely on minimizing the mean squared error (MSE) between the output distribution of a target and an anchor concept. We show that this MSE-based approach is a special case of a unified f -divergence-based framework, in which any f -divergence can be utilized. We analyze the benefits of using different f -divergences, that mainly impact the convergence properties of the algorithm and the quality of unlearning. The proposed unified framework offers a flexible paradigm that allows to select the optimal divergence for a specific application, balancing different trade-offs between aggressive unlearning and concept preservation.
zh
[CV-20] WAVECLIP: Wavelet Tokenization for Adaptive-Resolution CLIP
【速读】:该论文旨在解决CLIP模型在推理过程中无法灵活适应不同计算资源与精度需求的问题,即缺乏对多分辨率输入和动态计算量控制的支持。解决方案的关键在于提出WAVECLIP,其通过小波(wavelet)基的token化机制替代标准的图像块嵌入(patch embeddings),实现图像从粗到细的多级分解处理;同时利用键值缓存(key-value caching)和因果跨层注意力(causal cross-level attention)机制复用已有计算,仅在必要时引入新信息,从而支持基于置信度的自适应提前退出(adaptive early exits),在单一部署模型中实现计算效率与准确率之间的动态权衡。
链接: https://arxiv.org/abs/2509.21153
作者: Moshe Kimhi,Erez Koifman,Ehud Rivlin,Eli Schwartz,Chaim Baskin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:We introduce WAVECLIP, a single unified model for adaptive resolution inference in CLIP, enabled by wavelet-based tokenization. WAVECLIP replaces standard patch embeddings with a multi-level wavelet decomposition, enabling the model to process images coarse to fine while naturally supporting multiple resolutions within the same model. At inference time, the model begins with low resolution tokens and refines only when needed, using key-value caching and causal cross-level attention to reuse computation, effectively introducing to the model only new information when needed. We evaluate WAVECLIP in zero-shot classification, demonstrating that a simple confidence-based gating mechanism enables adaptive early exits. This allows users to dynamically choose a compute-accuracy trade-off using a single deployed model. Our approach requires only lightweight distillation from a frozen CLIP teacher and achieves competitive accuracy with significant computational savings.
zh
[CV-21] he Unwinnable Arms Race of AI Image Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 与判别器(discriminator)在图像合成与检测对抗中,判别器为何在某些情况下处于劣势的问题。其解决方案的关键在于揭示数据维度(data dimensionality)与数据复杂度(data complexity)的交互作用:当数据集过于简单或过于复杂时,生成器能够较好地拟合或掩盖缺陷,从而降低判别器的检测能力;而中间复杂度的数据集最有利于判别器识别合成图像的细微不一致,因为此时生成器难以完全学习真实分布,其错误仍可被察觉。
链接: https://arxiv.org/abs/2509.21135
作者: Till Aczel,Lorenzo Vettor,Andreas Plesner,Roger Wattenhofer
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The rapid progress of image generative AI has blurred the boundary between synthetic and real images, fueling an arms race between generators and discriminators. This paper investigates the conditions under which discriminators are most disadvantaged in this competition. We analyze two key factors: data dimensionality and data complexity. While increased dimensionality often strengthens the discriminators ability to detect subtle inconsistencies, complexity introduces a more nuanced effect. Using Kolmogorov complexity as a measure of intrinsic dataset structure, we show that both very simple and highly complex datasets reduce the detectability of synthetic images; generators can learn simple datasets almost perfectly, whereas extreme diversity masks imperfections. In contrast, intermediate-complexity datasets create the most favorable conditions for detection, as generators fail to fully capture the distribution and their errors remain visible.
zh
[CV-22] Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers
【速读】:该论文旨在解决深度神经网络在图像分类任务中对精心设计的对抗扰动(adversarial perturbations)仍具脆弱性的问题。其解决方案的关键在于采用线性降维作为简单且数据自适应的防御机制,具体比较了标准主成分分析(Principal Component Analysis, PCA)与稀疏主成分分析(Sparse Principal Component Analysis, SPCA)作为下游分类器的前级特征提取器。理论分析表明,SPCA通过稀疏投影降低了对抗杠杆效应(adversarial leverage),从而提升模型的鲁棒性;实验验证了即使在非线性分类器后接SPCA时,模型在强白盒和黑盒攻击下仍能更平滑地退化,同时保持良好的干净准确率。
链接: https://arxiv.org/abs/2509.21130
作者: Killian Steunou,Sigurd Saue,Théo Druilhe
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Killian Steunou is the main contributor and corresponding author of this work
Abstract:Deep neural networks perform remarkably well on image classification tasks but remain vulnerable to carefully crafted adversarial perturbations. This work revisits linear dimensionality reduction as a simple, data-adapted defense. We empirically compare standard Principal Component Analysis (PCA) with its sparse variant (SPCA) as front-end feature extractors for downstream classifiers, and we complement these experiments with a theoretical analysis. On the theory side, we derive exact robustness certificates for linear heads applied to SPCA features: for both \ell_\infty and \ell_2 threat models (binary and multiclass), the certified radius grows as the dual norms of W^\top u shrink, where W is the projection and u the head weights. We further show that for general (non-linear) heads, sparsity reduces operator-norm bounds through a Lipschitz composition argument, predicting lower input sensitivity. Empirically, with a small non-linear network after the projection, SPCA consistently degrades more gracefully than PCA under strong white-box and black-box attacks while maintaining competitive clean accuracy. Taken together, the theory identifies the mechanism (sparser projections reduce adversarial leverage) and the experiments verify that this benefit persists beyond the linear setting. Our code is available at this https URL.
zh
[CV-23] MotionFlow:Learning Implicit Motion Flow for Complex Camera Trajectory Control in Video Generation ICME2025
【速读】:该论文旨在解决在视频生成过程中,如何在同时存在相机运动和物体运动的情况下,保持画面一致性与泛化能力的问题。现有方法通常将相机运动与物体运动分别学习,易导致两者相对运动关系混淆。其解决方案的关键在于将相机运动和物体运动统一转换为对应像素的运动,并利用稳定扩散网络(stable diffusion network)学习与指定相机轨迹相关的参考运动图(reference motion maps),再结合提取的语义对象先验信息输入图像到视频网络(image-to-video network),从而生成能够精确跟随指定相机轨迹且保持物体运动一致性的视频。
链接: https://arxiv.org/abs/2509.21119
作者: Guojun Lei,Chi Wang,Yikai Wang,Hong Li,Ying Song,Weiwei Xu
机构: Zhejiang University (浙江大学); Tsinghua University (清华大学); Beihang University (北京航空航天大学); Zhejiang Gongshang University (浙江工商大学); ShengShu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME2025
Abstract:Generating videos guided by camera trajectories poses significant challenges in achieving consistency and generalizability, particularly when both camera and object motions are present. Existing approaches often attempt to learn these motions separately, which may lead to confusion regarding the relative motion between the camera and the objects. To address this challenge, we propose a novel approach that integrates both camera and object motions by converting them into the motion of corresponding pixels. Utilizing a stable diffusion network, we effectively learn reference motion maps in relation to the specified camera trajectory. These maps, along with an extracted semantic object prior, are then fed into an image-to-video network to generate the desired video that can accurately follow the designated camera trajectory while maintaining consistent object motions. Extensive experiments verify that our model outperforms SOTA methods by a large margin.
zh
[CV-24] CHARM: Control-point-based 3D Anime Hairstyle Auto-Regressive Modeling SIGGRAPH
【速读】:该论文旨在解决动漫发型(anime hairstyle)建模中传统方法难以高效编辑与学习的问题。现有技术多依赖密集网格或手工设计的样条曲线,无法满足可扩展的学习需求且编辑效率低。其解决方案的关键在于提出一种紧凑、可逆的控制点参数化表示方法——CHARM,将每张发型卡片(hair card)表示为一组控制点序列,每个控制点仅用五个几何参数编码,从而实现艺术家友好的设计与基于学习的生成;在此基础上构建自回归生成框架,利用Transformer模型将动漫发型视为“头发语言”(hair language),有效捕捉局部几何结构与全局发型拓扑,实现高质量生成。
链接: https://arxiv.org/abs/2509.21114
作者: Yuze He,Yanning Zhou,Wang Zhao,Jingwen Ye,Yushi Bai,Kaiwen Xiao,Yong-Jin Liu,Zhongqian Sun,Wei Yang
机构: Tsinghua University (清华大学); Tencent AIPD (腾讯AI平台部)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH Asia 2025. 17 pages, 15 figures
Abstract:We present CHARM, a novel parametric representation and generative framework for anime hairstyle modeling. While traditional hair modeling methods focus on realistic hair using strand-based or volumetric representations, anime hairstyle exhibits highly stylized, piecewise-structured geometry that challenges existing techniques. Existing works often rely on dense mesh modeling or hand-crafted spline curves, making them inefficient for editing and unsuitable for scalable learning. CHARM introduces a compact, invertible control-point-based parameterization, where a sequence of control points represents each hair card, and each point is encoded with only five geometric parameters. This efficient and accurate representation supports both artist-friendly design and learning-based generation. Built upon this representation, CHARM introduces an autoregressive generative framework that effectively generates anime hairstyles from input images or point clouds. By interpreting anime hairstyles as a sequential “hair language”, our autoregressive transformer captures both local geometry and global hairstyle topology, resulting in high-fidelity anime hairstyle creation. To facilitate both training and evaluation of anime hairstyle generation, we construct AnimeHair, a large-scale dataset of 37K high-quality anime hairstyles with separated hair cards and processed mesh data. Extensive experiments demonstrate state-of-the-art performance of CHARM in both reconstruction accuracy and generation quality, offering an expressive and scalable solution for anime hairstyle modeling. Project page: this https URL
zh
[CV-25] MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频推理任务中存在的过程不一致性问题,即模型虽能给出正确答案,但中间推理步骤与视频中的时序动态存在偏离,从而影响模型的可解释性和鲁棒性。解决方案的关键在于提出一种基于动态时间规整(Dynamic Time Warping, DTW)的过程奖励机制,通过将推理轨迹与时间对齐的参考轨迹进行匹配,实现无需辅助奖励模型的高效过程监督。该方法显著提升了推理过程的一致性,并在自建的MOSS-Video基准上验证了其有效性,同时在多个通用视频理解基准上实现了性能提升。
链接: https://arxiv.org/abs/2509.21113
作者: Sicheng Tao,Jungang Li,Yibo Yan,Junyan Zhang,Yubo Gao,Hanqian Li,ShuHang Xun,Yuxuan Fan,Hong Chen,Jianxiang He,Xuming Hu
机构: HKUST (GZ) (香港科技大学(广州)); HKUST (香港科技大学); HIT (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time Warping (DTW)-based process reward. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct MOSS-Video, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2% on MOSS-Video (test) and improves performance on general video benchmarks such as MVBench and MMVU. The framework consistently yields gains across different architectures, including Qwen2.5-VL and Phi-2, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.
zh
[CV-26] Cross-Modal Instructions for Robot Motion Generation
【速读】:该论文旨在解决机器人学习新行为时依赖物理示教(如遥操作或力觉引导)所导致的数据采集繁琐、扩展性差的问题。传统方法需要大量人工标注的运动轨迹数据,难以规模化部署。其解决方案的关键在于提出“跨模态指令学习”(Learning from Cross-Modal Instructions)范式,通过引入CrossInstruct框架,将自由文本标签等粗略标注作为示例嵌入到基础视觉-语言模型(Vision-Language Model, VLM)的上下文输入中,由VLM驱动一个微调后的细粒度指向模型进行多视角2D运动推理,并最终融合为机器人工作空间中的3D运动轨迹分布。该方法无需额外微调即可生成可执行行为,并为后续强化学习提供高质量策略初始化,显著提升了泛化能力和实用性。
链接: https://arxiv.org/abs/2509.21107
作者: William Barron,Xiaoxiang Dong,Matthew Johnson-Roberson,Weiming Zhi
机构: College of Connected Computing, Vanderbilt University, TN, USA; Robotics Institute, Carnegie Mellon University, PA, USA; School of Computer Science, The University of Sydney, Australia
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Teaching robots novel behaviors typically requires motion demonstrations via teleoperation or kinaesthetic teaching, that is, physically guiding the robot. While recent work has explored using human sketches to specify desired behaviors, data collection remains cumbersome, and demonstration datasets are difficult to scale. In this paper, we introduce an alternative paradigm, Learning from Cross-Modal Instructions, where robots are shaped by demonstrations in the form of rough annotations, which can contain free-form text labels, and are used in lieu of physical motion. We introduce the CrossInstruct framework, which integrates cross-modal instructions as examples into the context input to a foundational vision-language model (VLM). The VLM then iteratively queries a smaller, fine-tuned model, and synthesizes the desired motion over multiple 2D views. These are then subsequently fused into a coherent distribution over 3D motion trajectories in the robot’s workspace. By incorporating the reasoning of the large VLM with a fine-grained pointing model, CrossInstruct produces executable robot behaviors that generalize beyond the environment of in the limited set of instruction examples. We then introduce a downstream reinforcement learning pipeline that leverages CrossInstruct outputs to efficiently learn policies to complete fine-grained tasks. We rigorously evaluate CrossInstruct on benchmark simulation tasks and real hardware, demonstrating effectiveness without additional fine-tuning and providing a strong initialization for policies subsequently refined via reinforcement learning.
zh
[CV-27] Mammo-CLIP Dissect: A Framework for Analysing Mammography Concepts in Vision-Language Models
【速读】:该论文旨在解决深度学习(Deep Learning, DL)模型在乳腺X线摄影(mammography)临床应用中可解释性不足的问题,特别是现有方法多聚焦于像素级解释,而忽视了模型所学习的文本概念(textual concepts),这些概念更贴近放射科医生的推理逻辑。解决方案的关键在于提出首个基于概念的可解释性框架——Mammo-CLIP Dissect,其核心是利用一个专为乳腺X线设计的视觉语言模型(Mammo-CLIP)作为“解剖器”,对卷积神经网络(CNN)特定层的神经元进行人类可理解的文本概念标注,并量化其与领域知识的一致性,从而系统揭示模型如何捕获乳腺影像特异性知识。
链接: https://arxiv.org/abs/2509.21102
作者: Suaiba Amina Salahuddin,Teresa Dorszewski,Marit Almenning Martiniussen,Tone Hovda,Antonio Portaluri,Solveig Thrun,Michael Kampffmeyer,Elisabeth Wetzer,Kristoffer Wickstrøm,Robert Jenssen
机构: UiT The Arctic University of Norway (北极挪威大学); Technical University of Denmark (丹麦技术大学); Østfold Hospital Trust (奥斯弗尔德医院信托); Vestre Viken Hospital Trust (维斯特雷维肯医院信托); Radboud University Nijmegen Medical Centre (奈梅亨大学医疗中心); The Netherlands Cancer Institute (荷兰癌症研究所); Norwegian Computing Center (挪威计算中心); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding what deep learning (DL) models learn is essential for the safe deployment of artificial intelligence (AI) in clinical settings. While previous work has focused on pixel-based explainability methods, less attention has been paid to the textual concepts learned by these models, which may better reflect the reasoning used by clinicians. We introduce Mammo-CLIP Dissect, the first concept-based explainability framework for systematically dissecting DL vision models trained for mammography. Leveraging a mammography-specific vision-language model (Mammo-CLIP) as a “dissector,” our approach labels neurons at specified layers with human-interpretable textual concepts and quantifies their alignment to domain knowledge. Using Mammo-CLIP Dissect, we investigate three key questions: (1) how concept learning differs between DL vision models trained on general image datasets versus mammography-specific datasets; (2) how fine-tuning for downstream mammography tasks affects concept specialisation; and (3) which mammography-relevant concepts remain underrepresented. We show that models trained on mammography data capture more clinically relevant concepts and align more closely with radiologists’ workflows than models not trained on mammography data. Fine-tuning for task-specific classification enhances the capture of certain concept categories (e.g., benign calcifications) but can reduce coverage of others (e.g., density-related features), indicating a trade-off between specialisation and generalisation. Our findings show that Mammo-CLIP Dissect provides insights into how convolutional neural networks (CNNs) capture mammography-specific knowledge. By comparing models across training data and fine-tuning regimes, we reveal how domain-specific training and task-specific adaptation shape concept learning. Code and concept set are available: this https URL.
zh
[CV-28] VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中受限于静态感知阶段的问题,从而提升其类人级感知与理解能力。现有方法通常依赖大语言模型(LLM)对已解析的视觉信息进行分析,但缺乏动态迭代优化的能力。解决方案的关键在于提出视觉测试时扩展(Visual Test-Time Scaling, VTTS),通过推理阶段的迭代感知机制(Iterative Perception, ITP)实现对高置信度时空区域的逐步聚焦,并结合强化学习与时空监督信号优化推理过程。VTTS使模型能够通过增加感知计算资源来持续提升性能,实验证明其在多个视频对话、视频推理和时空感知任务中显著优于基线模型。
链接: https://arxiv.org/abs/2509.21100
作者: Ziang Yan,Xinhao Li,Yinan He,Zhengrong Yue,Xiangyu Zeng,Yali Wang,Yu Qiao,Limin Wang,Yi Wang
机构: Zhejiang University (浙江大学); Shanghai AI Laboratory (上海人工智能实验室); Nanjing University (南京大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inducing reasoning in multimodal large language models (MLLMs) is critical for achieving human-level perception and understanding. Existing methods mainly leverage LLM reasoning to analyze parsed visuals, often limited by static perception stages. This paper introduces Visual Test-Time Scaling (VTTS), a novel approach to enhance MLLMs’ reasoning via iterative perception during inference. VTTS mimics humans’ hierarchical attention by progressively refining focus on high-confidence spatio-temporal regions, guided by updated textual predictions. Specifically, VTTS employs an Iterative Perception (ITP) mechanism, incorporating reinforcement learning with spatio-temporal supervision to optimize reasoning. To support this paradigm, we also present VTTS-80K, a dataset tailored for iterative perception. These designs allows a MLLM to enhance its performance by increasing its perceptual compute. Extensive experiments validate VTTS’s effectiveness and generalization across diverse tasks and benchmarks. Our newly introduced Videochat-R1.5 model has achieved remarkable improvements, with an average increase of over 5%, compared to robust baselines such as Qwen2.5VL-3B and -7B, across more than 15 benchmarks that encompass video conversation, video reasoning, and spatio-temporal perception.
zh
[CV-29] UniTransfer: Video Concept Transfer via Progressive Spatial and Timestep Decomposition
【速读】:该论文旨在解决视频概念迁移(video concept transfer)中难以实现精确控制与高质量生成的问题,尤其在跨场景、多参考图像下的可控性与视觉保真度不足。解决方案的关键在于提出一种新颖的统一迁移架构 UniTransfer,其核心创新包括:1)引入空间分解机制,将视频解耦为前景主体、背景和运动流三个关键组件,并基于双流到单流的 DiT(Diffusion Transformer)架构实现对各组件的细粒度控制;2)通过链式提示(Chain-of-Prompt, CoP)机制实现扩散时间步分解,将去噪过程划分为不同粒度的三阶段,借助大语言模型(LLM)提供阶段性指令以引导生成流程;3)设计基于随机掩码的自监督预训练策略,增强从大规模无标签视频数据中学习分解表示的能力。这些技术共同实现了高保真且可编辑的视频概念迁移。
链接: https://arxiv.org/abs/2509.21086
作者: Guojun Lei,Rong Zhang,Chi Wang,Tianhang Liu,Hong Li,Zhiyuan Ma,Weiwei Xu
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室); Tsinghua University (清华大学); Zhejiang Gongshang University (浙江工商大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeuriIPS 2025
Abstract:We propose a novel architecture UniTransfer, which introduces both spatial and diffusion timestep decomposition in a progressive paradigm, achieving precise and controllable video concept transfer. Specifically, in terms of spatial decomposition, we decouple videos into three key components: the foreground subject, the background, and the motion flow. Building upon this decomposed formulation, we further introduce a dual-to-single-stream DiT-based architecture for supporting fine-grained control over different components in the videos. We also introduce a self-supervised pretraining strategy based on random masking to enhance the decomposed representation learning from large-scale unlabeled video data. Inspired by the Chain-of-Thought reasoning paradigm, we further revisit the denoising diffusion process and propose a Chain-of-Prompt (CoP) mechanism to achieve the timestep decomposition. We decompose the denoising process into three stages of different granularity and leverage large language models (LLMs) for stage-specific instructions to guide the generation progressively. We also curate an animal-centric video dataset called OpenAnimal to facilitate the advancement and benchmarking of research in video concept transfer. Extensive experiments demonstrate that our method achieves high-quality and controllable video concept transfer across diverse reference images and scenes, surpassing existing baselines in both visual fidelity and editability. Web Page: this https URL
zh
[CV-30] Vision Transformers: the threat of realistic adversarial patches
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)模型在面对对抗性补丁(adversarial patches)攻击时的脆弱性问题,尤其是在人与非人分类任务中,这些补丁能够通过精心设计的图案误导AI系统做出错误决策。解决方案的关键在于利用Creases Transformation(CT)技术生成具有现实感的对抗性补丁——该技术模拟衣物穿戴时自然发生的几何形变,使攻击更隐蔽且更具欺骗性;同时通过实验验证了从卷积神经网络(Convolutional Neural Networks, CNNs)迁移来的对抗攻击方法对ViT模型的有效性,揭示了预训练数据规模和方法对模型抗干扰能力的重要影响。
链接: https://arxiv.org/abs/2509.21084
作者: Kasper Cools,Clara Maathuis,Alexander M. van Oers,Claudia S. Hübner,Nikos Deligiannis,Marijke Vandewal,Geert De Cubber
机构: Vrije Universiteit Brussel (布鲁塞尔自由大学); Open University of the Netherlands (荷兰开放大学); Netherlands Defence Academy (荷兰国防学院); Fraunhofer Institute of Optronics (弗劳恩霍夫光电研究所); imec (imec); Belgian Royal Military Academy (比利时皇家军事学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to Sensors + Imaging; presented on 17th of September (Artificial Intelligence for Security and Defence Applications III)
Abstract:The increasing reliance on machine learning systems has made their security a critical concern. Evasion attacks enable adversaries to manipulate the decision-making processes of AI systems, potentially causing security breaches or misclassification of targets. Vision Transformers (ViTs) have gained significant traction in modern machine learning due to increased 1) performance compared to Convolutional Neural Networks (CNNs) and 2) robustness against adversarial perturbations. However, ViTs remain vulnerable to evasion attacks, particularly to adversarial patches, unique patterns designed to manipulate AI classification systems. These vulnerabilities are investigated by designing realistic adversarial patches to cause misclassification in person vs. non-person classification tasks using the Creases Transformation (CT) technique, which adds subtle geometric distortions similar to those occurring naturally when wearing clothing. This study investigates the transferability of adversarial attack techniques used in CNNs when applied to ViT classification models. Experimental evaluation across four fine-tuned ViT models on a binary person classification task reveals significant vulnerability variations: attack success rates ranged from 40.04% (google/vit-base-patch16-224-in21k) to 99.97% (facebook/dino-vitb16), with google/vit-base-patch16-224 achieving 66.40% and facebook/dinov3-vitb16 reaching 65.17%. These results confirm the cross-architectural transferability of adversarial patches from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing model resilience to adversarial attacks.
zh
[CV-31] EnGraf-Net: Multiple Granularity Branch Network with Fine-Coarse Graft Grained for Classification Task
【速读】:该论文旨在解决细粒度分类(fine-grained classification)中因类内差异大、类间差异小而导致的识别困难问题,尤其针对现有方法依赖局部标注(如边界框、部位位置或文本属性)或自动注意力图提取所带来的局部特征表示不完整的问题。解决方案的关键在于引入基于层级语义关联(semantic associations structured as a hierarchy, 即taxonomy)的监督信号,并将其嵌入到端到端的深度神经网络模型EnGraf-Net中,从而在无需人工标注或裁剪技术的情况下,有效利用类别间的层次结构信息提升分类性能。
链接: https://arxiv.org/abs/2509.21061
作者: Riccardo La Grassa,Ignazio Gallo,Nicola Landro
机构: INAF–Astronomical Observatory of Padua (意大利国家天文物理研究所-帕多瓦天文台); University of Insubria (因苏布里亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8
Abstract:Fine-grained classification models are designed to focus on the relevant details necessary to distinguish highly similar classes, particularly when intra-class variance is high and inter-class variance is low. Most existing models rely on part annotations such as bounding boxes, part locations, or textual attributes to enhance classification performance, while others employ sophisticated techniques to automatically extract attention maps. We posit that part-based approaches, including automatic cropping methods, suffer from an incomplete representation of local features, which are fundamental for distinguishing similar objects. While fine-grained classification aims to recognize the leaves of a hierarchical structure, humans recognize objects by also forming semantic associations. In this paper, we leverage semantic associations structured as a hierarchy (taxonomy) as supervised signals within an end-to-end deep neural network model, termed EnGraf-Net. Extensive experiments on three well-known datasets CIFAR-100, CUB-200-2011, and FGVC-Aircraft demonstrate the superiority of EnGraf-Net over many existing fine-grained models, showing competitive performance with the most recent state-of-the-art approaches, without requiring cropping techniques or manual annotations.
zh
[CV-32] Stratify or Die: Rethinking Data Splits in Image Segmentation
【速读】:该论文旨在解决图像分割任务中因随机划分数据集导致测试集代表性不足、评估结果偏倚及模型泛化能力下降的问题。其核心挑战在于如何在存在多标签结构和类别不平衡的复杂场景下,实现更公平且具有代表性的数据划分。解决方案的关键在于提出两种创新方法:一是迭代像素分层(Iterative Pixel Stratification, IPS),一种简单但标签感知的数据划分策略;二是基于Wasserstein距离驱动的进化分层(Wasserstein-Driven Evolutionary Stratification, WDES),这是一种利用遗传算法最小化不同数据集分割间标签分布差异的新方法,理论上可在足够代数下达到全局最优。实验表明,WDES在多种分割任务中均能显著降低性能方差并提升评估可靠性,尤其适用于小样本、不平衡和低多样性数据集。
链接: https://arxiv.org/abs/2509.21056
作者: Naga Venkata Sai Jitin Jami,Thomas Altstidl,Jonas Mueller,Jindong Li,Dario Zanca,Bjoern Eskofier,Heike Leutheuser
机构: FAU Erlangen-Nürnberg (弗莱堡大学); University of Bayreuth (拜罗伊特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, 9 pages
Abstract:Random splitting of datasets in image segmentation often leads to unrepresentative test sets, resulting in biased evaluations and poor model generalization. While stratified sampling has proven effective for addressing label distribution imbalance in classification tasks, extending these ideas to segmentation remains challenging due to the multi-label structure and class imbalance typically present in such data. Building on existing stratification concepts, we introduce Iterative Pixel Stratification (IPS), a straightforward, label-aware sampling method tailored for segmentation tasks. Additionally, we present Wasserstein-Driven Evolutionary Stratification (WDES), a novel genetic algorithm designed to minimize the Wasserstein distance, thereby optimizing the similarity of label distributions across dataset splits. We prove that WDES is globally optimal given enough generations. Using newly proposed statistical heterogeneity metrics, we evaluate both methods against random sampling and find that WDES consistently produces more representative splits. Applying WDES across diverse segmentation tasks, including street scenes, medical imaging, and satellite imagery, leads to lower performance variance and improved model evaluation. Our results also highlight the particular value of WDES in handling small, imbalanced, and low-diversity datasets, where conventional splitting strategies are most prone to bias.
zh
[CV-33] Background Prompt for Few-Shot Out-of-Distribution Detection
【速读】:该论文针对少样本分布外检测(few-shot out-of-distribution, FS-OOD)中前景-背景(foreground-background, FG-BG)分解方法存在的鲁棒性不足问题展开研究,其核心挑战在于现有方法过度依赖局部类内相似性(local class similarity)以及采用固定背景补丁提取策略,导致对分布外样本的判别能力受限。解决方案的关键在于提出一种名为Mambo的新框架:首先学习一个背景提示(background prompt)以获取融合背景与图像语义信息的局部背景相似性,并通过局部类内相似性对其进行精炼;随后结合精炼后的局部背景相似性和局部类内相似性进行背景提取,从而降低对单一类内相似性的依赖;此外,引入补丁自校准调优(patch self-calibrated tuning)机制,根据样本多样性动态调整不同样本的背景补丁数量,解决了传统方法中固定背景提取策略的局限性。实验表明,该方法在真实数据集上显著优于当前最优(SOTA)方法,在分布外检测和近分布外检测场景下均取得最佳性能。
链接: https://arxiv.org/abs/2509.21055
作者: Songyue Cai,Zongqian Wu,Yujie Mo,Liang Peng,Ping Hu,Xiaoshuang Shi,Xiaofeng Zhu
机构: UESTC(电子科技大学); NUS(新加坡国立大学); HKU(香港大学); Hainan University(海南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing foreground-background (FG-BG) decomposition methods for the few-shot out-of-distribution (FS-OOD) detection often suffer from low robustness due to over-reliance on the local class similarity and a fixed background patch extraction strategy. To address these challenges, we propose a new FG-BG decomposition framework, namely Mambo, for FS-OOD detection. Specifically, we propose to first learn a background prompt to obtain the local background similarity containing both the background and image semantic information, and then refine the local background similarity using the local class similarity. As a result, we use both the refined local background similarity and the local class similarity to conduct background extraction, reducing the dependence of the local class similarity in previous methods. Furthermore, we propose the patch self-calibrated tuning to consider the sample diversity to flexibly select numbers of background patches for different samples, and thus exploring the issue of fixed background extraction strategies in previous methods. Extensive experiments on real-world datasets demonstrate that our proposed Mambo achieves the best performance, compared to SOTA methods in terms of OOD detection and near OOD detection setting. The source code will be released at this https URL.
zh
[CV-34] OmniPlantSeg: Species Agnostic 3D Point Cloud Organ Segmentation for High-Resolution Plant Phenotyping Across Modalities
【速读】:该论文旨在解决植物器官点云分割中因依赖特定物种或传感器模态而导致的通用性差,以及传统方法需大量预处理和下采样以适应硬件或神经网络输入尺寸限制的问题。解决方案的关键在于提出一种名为KD-SS(K-Dimensional Sub-Sampling)的轻量级算法,其不依赖于传感器数据类型或植物种类,能够保留原始分辨率下的点云信息,从而避免了对输入数据的下采样操作,使得高分辨率点云的完整分割成为可能。将KD-SS与当前最先进的分割模型结合,在多种传感器模态(如摄影测量、激光三角测量和LiDAR)及不同植物物种上均取得了良好效果,为植物器官分割提供了一种通用且高效的替代方案。
链接: https://arxiv.org/abs/2509.21038
作者: Andreas Gilson,Lukas Meyer,Oliver Scholz,Ute Schmid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate point cloud segmentation for plant organs is crucial for 3D plant phenotyping. Existing solutions are designed problem-specific with a focus on certain plant species or specified sensor-modalities for data acquisition. Furthermore, it is common to use extensive pre-processing and down-sample the plant point clouds to meet hardware or neural network input size requirements. We propose a simple, yet effective algorithm KDSS for sub-sampling of biological point clouds that is agnostic to sensor data and plant species. The main benefit of this approach is that we do not need to down-sample our input data and thus, enable segmentation of the full-resolution point cloud. Combining KD-SS with current state-of-the-art segmentation models shows satisfying results evaluated on different modalities such as photogrammetry, laser triangulation and LiDAR for various plant species. We propose KD-SS as lightweight resolution-retaining alternative to intensive pre-processing and down-sampling methods for plant organ segmentation regardless of used species and sensor modality.
zh
[CV-35] KeyWorld: Key Frame Reasoning Enables Effective and Efficient World Models
【速读】:该论文旨在解决当前生成式机器人世界模型(robotic world models)在推理速度和生成轨迹物理合理性方面的瓶颈问题,其根源在于传统逐帧生成方法存在计算冗余且忽视关键状态转换的语义重要性。解决方案的关键在于提出KeyWorld框架,通过聚焦于少数语义关键帧(key frames)进行Transformer计算,同时利用轻量级卷积模型插值填充中间帧,从而显著提升效率并增强生成轨迹的物理合理性。具体而言,KeyWorld首先通过迭代简化机器人运动轨迹识别关键帧,再训练DiT模型从文本任务描述中生成物理合理的关键帧,最后由轻量级插值器完成全视频重建,实验证明该方法在LIBERO基准上相较基线实现5.68倍加速,并在复杂任务中显著提升生成视频的物理有效性。
链接: https://arxiv.org/abs/2509.21027
作者: Sibo Li,Qianyue Hao,Yu Shang,Yong Li
机构: Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robotic world models are a promising paradigm for forecasting future environment states, yet their inference speed and the physical plausibility of generated trajectories remain critical bottlenecks, limiting their real-world applications. This stems from the redundancy of the prevailing frame-to-frame generation approach, where the model conducts costly computation on similar frames, as well as neglecting the semantic importance of key transitions. To address this inefficiency, we propose KeyWorld, a framework that improves text-conditioned robotic world models by concentrating transformers computation on a few semantic key frames while employing a lightweight convolutional model to fill the intermediate frames. Specifically, KeyWorld first identifies significant transitions by iteratively simplifying the robot’s motion trajectories, obtaining the ground truth key frames. Then, a DiT model is trained to reason and generate these physically meaningful key frames from textual task descriptions. Finally, a lightweight interpolator efficiently reconstructs the full video by inpainting all intermediate frames. Evaluations on the LIBERO benchmark demonstrate that KeyWorld achieves a 5.68 \times acceleration compared to the frame-to-frame generation baseline, and focusing on the motion-aware key frames further contributes to the physical validity of the generated videos, especially on complex tasks. Our approach highlights a practical path toward deploying world models in real-time robotic control and other domains requiring both efficient and effective world models. Code is released at this https URL.
zh
[CV-36] A Single Neuron Works: Precise Concept Erasure in Text-to-Image Diffusion Models
【速读】:该论文旨在解决文本到图像生成模型(text-to-image models)在生成有害内容方面的安全风险问题,核心挑战在于如何精确移除特定概念的同时最小化对图像质量的破坏。解决方案的关键在于提出了一种基于单神经元的概念擦除方法(Single Neuron-based Concept Erasure, SNCE),其核心创新是通过训练稀疏自编码器(Sparse Autoencoder, SAE)将文本嵌入映射到一个稀疏且解耦的潜在空间,使得每个神经元紧密对应原子语义概念;进而设计一种基于激活模式调制频率评分的新颖神经元识别方法,精准定位与有害概念相关的神经元,并通过抑制其激活实现“外科手术式”的概念擦除,从而在保持非目标概念生成能力的同时显著提升安全性与鲁棒性。
链接: https://arxiv.org/abs/2509.21008
作者: Qinqin He,Jiaqi Weng,Jialing Tao,Hui Xue
机构: Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image models exhibit remarkable capabilities in image generation. However, they also pose safety risks of generating harmful content. A key challenge of existing concept erasure methods is the precise removal of target concepts while minimizing degradation of image quality. In this paper, we propose Single Neuron-based Concept Erasure (SNCE), a novel approach that can precisely prevent harmful content generation by manipulating only a single neuron. Specifically, we train a Sparse Autoencoder (SAE) to map text embeddings into a sparse, disentangled latent space, where individual neurons align tightly with atomic semantic concepts. To accurately locate neurons responsible for harmful concepts, we design a novel neuron identification method based on the modulated frequency scoring of activation patterns. By suppressing activations of the harmful concept-specific neuron, SNCE achieves surgical precision in concept erasure with minimal disruption to image quality. Experiments on various benchmarks demonstrate that SNCE achieves state-of-the-art results in target concept erasure, while preserving the model’s generation capabilities for non-target concepts. Additionally, our method exhibits strong robustness against adversarial attacks, significantly outperforming existing methods.
zh
[CV-37] Marching Neurons: Accurate Surface Extraction for Neural Implicit Shapes SIGGRAPH
【速读】:该论文旨在解决从神经隐式函数(neural implicit functions)中高精度提取表面几何的问题。传统方法如Marching Cubes依赖于固定分辨率的空间分解与采样,导致精度受限。其解决方案的关键在于提出一种原生并行的深度优先遍历策略,利用每个神经元对空间域的划分特性,高效追踪编码在神经网络中的表面信息,从而在不进行人为空间离散化的情况下生成精确的网格,实现跨多种形状和网络架构的高保真几何重建,同时保持良好的计算效率。
链接: https://arxiv.org/abs/2509.21007
作者: Christian Stippel,Felix Mujkanovic,Thomas Leimkühler,Pedro Hermosilla
机构: TU Wien(维也纳技术大学); Max-Planck-Institute for Informatics(马克斯·普朗克信息研究所)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH Asia 2025 (Journal Track)
Abstract:Accurate surface geometry representation is crucial in 3D visual computing. Explicit representations, such as polygonal meshes, and implicit representations, like signed distance functions, each have distinct advantages, making efficient conversions between them increasingly important. Conventional surface extraction methods for implicit representations, such as the widely used Marching Cubes algorithm, rely on spatial decomposition and sampling, leading to inaccuracies due to fixed and limited resolution. We introduce a novel approach for analytically extracting surfaces from neural implicit functions. Our method operates natively in parallel and can navigate large neural architectures. By leveraging the fact that each neuron partitions the domain, we develop a depth-first traversal strategy to efficiently track the encoded surface. The resulting meshes faithfully capture the full geometric information from the network without ad-hoc spatial discretization, achieving unprecedented accuracy across diverse shapes and network architectures while maintaining competitive speed.
zh
[CV-38] Fast-SEnSeI: Lightweight Sensor-Independent Cloud Masking for On-board Multispectral Sensors
【速读】:该论文旨在解决遥感图像云分割(cloud segmentation)任务中模型对特定传感器配置依赖性强、难以在轨部署的问题。现有方法通常与特定传感器的波段配置紧密耦合,且依赖地面处理,限制了其在多光谱传感器上的通用性和实时性。解决方案的关键在于提出Fast-SEnSeI模块,该模块具备传感器无关性(sensor-independent)、轻量化架构和鲁棒的波段填充机制,能够接受任意波段组合并生成固定尺寸特征图,进而接入一个基于改进U-Net结构的紧凑量化分割模型;整个流程通过CPU-FPGA异构计算架构实现高效运行,适配空间级硬件环境,已在Sentinel-2和Landsat 8数据集上验证了其跨配置的高精度分割能力。
链接: https://arxiv.org/abs/2509.20991
作者: Jan Kněžík,Jonáš Herec,Rado Pitoňák
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: This is a preprint of a paper accepted for the EDHPC 2025 Conference
Abstract:Cloud segmentation is a critical preprocessing step for many Earth observation tasks, yet most models are tightly coupled to specific sensor configurations and rely on ground-based processing. In this work, we propose Fast-SEnSeI, a lightweight, sensor-independent encoder module that enables flexible, on-board cloud segmentation across multispectral sensors with varying band configurations. Building upon SEnSeI-v2, Fast-SEnSeI integrates an improved spectral descriptor, lightweight architecture, and robust padding-band handling. It accepts arbitrary combinations of spectral bands and their wavelengths, producing fixed-size feature maps that feed into a compact, quantized segmentation model based on a modified U-Net. The module runs efficiently on embedded CPUs using Apache TVM, while the segmentation model is deployed on FPGA, forming a CPU-FPGA hybrid pipeline suitable for space-qualified hardware. Evaluations on Sentinel-2 and Landsat 8 datasets demonstrate accurate segmentation across diverse input configurations.
zh
[CV-39] SiNGER: A Clearer Voice Distills Vision Transformers Further
【速读】:该论文旨在解决视觉 Transformer(Vision Transformer)作为视觉基础模型骨干网络时,因高范数伪影(high-norm artifacts)导致表征质量下降的问题。这些伪影在知识蒸馏过程中会主导损失函数,使学生模型过拟合于伪影而忽略教师模型中的有效信息,从而削弱大模型带来的性能提升。解决方案的关键在于提出一种名为 Singular Nullspace-Guided Energy Reallocation (SiNGER) 的新型蒸馏框架,其核心思想是通过基于零空间引导的扰动对教师特征进行有原则的精炼:在精炼过程中,利用零空间约束保留有效信息的同时抑制伪影;随后将优化后的教师特征用于蒸馏。该扰动通过 LoRA 适配器高效实现,仅需最小结构改动即可显著提升学生模型性能,在多个下游任务中达到当前最优效果,并生成更清晰、可解释的表示。
链接: https://arxiv.org/abs/2509.20986
作者: Geunhyeok Yu,Sunjae Jeong,Yoonyoung Choi,Jaeseung Kim,Hyoseok Hwang
机构: Kyung Hee University (高丽大学); MOBILTECH CO., LTD (移动科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Main paper: 12 pages (including 3 pages of references), 6 figures, 6 tables. Appendix: 9 pages, 7 figures
Abstract:Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality. When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals. The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts. Then, the refined teacher’s features are distilled to a student. We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification. Extensive experiments show that \oursname consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.
zh
[CV-40] An Adaptor for Triggering Semi-Supervised Learning to Out-of-Box Serve Deep Image Clustering
【速读】:该论文旨在解决自监督学习(Self-Supervised Learning, SSL)方法在深度图像聚类任务中难以实现“冷启动”(cold-start)的问题,即现有方法普遍依赖预训练模型、聚类学习或已训练的聚类模型作为先验条件,限制了SSL学习器在聚类任务中的灵活部署与即插即用能力。其解决方案的关键在于提出一种名为ASD的适配器(adaptor),通过随机采样未标记数据并生成伪标签,利用实例级分类器学习语义对齐的实例级标签;随后基于预测结果的类别转移跟踪提取高阶实例级类别相似性,从而为伪标签数据分配簇级标签;最终使用这些带簇级标签的伪数据激活一个在未标记数据上训练的通用SSL学习器,实现无需任何前置条件的图像聚类。
链接: https://arxiv.org/abs/2509.20976
作者: Yue Duan,Lei Qi,Yinghuan Shi,Yang Gao
机构: Nanjing University (南京大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by IEEE Transactions on Image Processing (TIP)
Abstract:Recently, some works integrate SSL techniques into deep clustering frameworks to enhance image clustering performance. However, they all need pretraining, clustering learning, or a trained clustering model as prerequisites, limiting the flexible and out-of-box application of SSL learners in the image clustering task. This work introduces ASD, an adaptor that enables the cold-start of SSL learners for deep image clustering without any prerequisites. Specifically, we first randomly sample pseudo-labeled data from all unlabeled data, and set an instance-level classifier to learn them with semantically aligned instance-level labels. With the ability of instance-level classification, we track the class transitions of predictions on unlabeled data to extract high-level similarities of instance-level classes, which can be utilized to assign cluster-level labels to pseudo-labeled data. Finally, we use the pseudo-labeled data with assigned cluster-level labels to trigger a general SSL learner trained on the unlabeled data for image clustering. We show the superior performance of ASD across various benchmarks against the latest deep image clustering approaches and very slight accuracy gaps compared to SSL methods using ground-truth, e.g., only 1.33% on CIFAR-10. Moreover, ASD can also further boost the performance of existing SSL-embedded deep image clustering methods.
zh
[CV-41] Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos
【速读】:该论文旨在解决从长达30-40分钟的多模态金融咨询播客视频中提取结构化、精准且可解释性高的摘要信息这一难题,尤其在文本与视觉模态融合方面存在挑战。其解决方案的关键在于提出一个模块化框架 FASTER(Financial Advisory Summariser with Textual Embedded Relevant images),通过三个核心机制实现:(1) 利用 BLIP 进行语义视觉描述、OCR 提取文本模式、Whisper 结合说话人分离(Speaker diarization)生成文本基础特征(BOS features),以高效提取各模态特异性表征;(2) 设计一种改进的基于直接偏好优化(Direct Preference Optimization, DPO)的损失函数,并引入 BOS 特定的事实核查机制,确保摘要内容在事实一致性、相关性和准确性上优于人类标注基准;(3) 采用基于排序的检索机制将关键帧图像与对应文本点对齐,提升跨模态一致性与可解释性。该方法显著优于主流大语言模型(LLMs)和视觉-语言模型(VLMs),并构建了首个公开的金融咨询播客数据集 Fin-APT(含470个视频),为多模态金融内容理解提供新标准。
链接: https://arxiv.org/abs/2509.20961
作者: Sarmistha Das,R E Zera Marveen Lyngkhoi,Sriparna Saha,Alka Maurya
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); CRISIL LTD (CRISIL有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER’s strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: this https URL
zh
[CV-42] A Real-Time On-Device Defect Detection Framework for Laser Power-Meter Sensors via Unsupervised Learning
【速读】:该论文旨在解决激光功率计传感器涂层缺陷(如热损伤和划痕)的自动检测与分类问题,此类缺陷可能影响医疗和工业应用中激光能量测量的准确性。解决方案的关键在于提出一种基于视觉的无监督异常检测框架,该框架仅使用“正常”传感器图像进行训练,从而学习涂层的正常分布模式,实现对已知和未知缺陷类型的检测,而无需依赖大量标注的缺陷数据集;其核心技术包括:(1)利用拉普拉斯边缘检测与K-means聚类构建鲁棒的预处理流程以分割感兴趣区域,(2)通过StyleGAN2生成合成数据增强模型泛化能力,(3)采用UFlow神经网络架构实现多尺度特征提取与异常图生成,最终在366张真实图像上达到93.8%的缺陷样本准确率和89.3%的良好样本准确率,且具备0.5秒/图像的实时推理能力。
链接: https://arxiv.org/abs/2509.20946
作者: Dongqi Zheng,Wenjin Fu,Guangzong Chen
机构: Purdue University (普渡大学); Carnegie Mellon University (卡内基梅隆大学); University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present an automated vision-based system for defect detection and classification of laser power meter sensor coatings. Our approach addresses the critical challenge of identifying coating defects such as thermal damage and scratches that can compromise laser energy measurement accuracy in medical and industrial applications. The system employs an unsupervised anomaly detection framework that trains exclusively on ``good’’ sensor images to learn normal coating distribution patterns, enabling detection of both known and novel defect types without requiring extensive labeled defect datasets. Our methodology consists of three key components: (1) a robust preprocessing pipeline using Laplacian edge detection and K-means clustering to segment the area of interest, (2) synthetic data augmentation via StyleGAN2, and (3) a UFlow-based neural network architecture for multi-scale feature extraction and anomaly map generation. Experimental evaluation on 366 real sensor images demonstrates 93.8% accuracy on defective samples and 89.3% accuracy on good samples, with image-level AUROC of 0.957 and pixel-level AUROC of 0.961. The system provides potential annual cost savings through automated quality control and processing times of 0.5 seconds per image in on-device implementation.
zh
[CV-43] Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery
【速读】:该论文旨在解决当前手术场景图(Scene Graphs, SGs)研究中存在的“数据鸿沟”问题,即内部视角(如三元组识别)主要依赖真实世界二维视频数据,而外部视角的四维建模则严重依赖模拟数据,导致临床转化存在显著差距。其解决方案的关键在于通过系统性文献综述与方法演进分析,揭示了从基础图神经网络向专用基础模型(foundation models)发展的技术跃迁,这些模型在手术场景中显著优于通用大视觉语言模型,从而推动SGs在手术流程识别、自动化安全监控等分析任务以及可控手术仿真等生成式任务中的应用落地。
链接: https://arxiv.org/abs/2509.20941
作者: Angelo Henriques,Korab Hoxha,Daniel Zapp,Peter C. Issa,Nassir Navab,M. Ali Nasseri
机构: Technical University of Munich (慕尼黑工业大学); University of Alberta (阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Medical Image Analysis. Under review. 49 pages, 9 figures. An interactive version of the summary tables is available at this http URL
Abstract:Scene graphs (SGs) provide structured relational representations crucial for decoding complex, dynamic surgical environments. This PRISMA-ScR-guided scoping review systematically maps the evolving landscape of SG research in surgery, charting its applications, methodological advancements, and future directions. Our analysis reveals rapid growth, yet uncovers a critical ‘data divide’: internal-view research (e.g., triplet recognition) almost exclusively uses real-world 2D video, while external-view 4D modeling relies heavily on simulated data, exposing a key translational research gap. Methodologically, the field has advanced from foundational graph neural networks to specialized foundation models that now significantly outperform generalist large vision-language models in surgical contexts. This progress has established SGs as a cornerstone technology for both analysis, such as workflow recognition and automated safety monitoring, and generative tasks like controllable surgical simulation. Although challenges in data annotation and real-time implementation persist, they are actively being addressed through emerging techniques. Surgical SGs are maturing into an essential semantic bridge, enabling a new generation of intelligent systems to improve surgical safety, efficiency, and training.
zh
[CV-44] Unlocking Noise-Resistant Vision: Key Architectural Secrets for Robust Models
【速读】:该论文旨在解决视觉模型对加性高斯噪声(additive Gaussian noise)的鲁棒性与其架构设计选择之间因果关系不明确的问题。现有研究多关注鲁棒性的测量,却缺乏对不同架构因素如何影响鲁棒性的系统性剖析。解决方案的关键在于通过大规模实验(1,174个预训练视觉模型)识别出四个稳定提升鲁棒性的设计模式:更大的stem卷积核、更低的输入分辨率、使用平均池化(average pooling)以及采用监督训练的视觉Transformer(supervised vision transformers, ViTs)而非CLIP预训练ViTs。进一步地,作者构建理论框架解释这些现象:证明低通滤波stem核可二次衰减噪声能量,抗锯齿下采样与下采样因子平方成比例降低噪声功率;揭示平均池化无偏且噪声抑制强度与池化窗口面积成正比,而最大池化存在正偏差并增加均方误差和最坏情况敏感度;并通过像素空间Lipschitz界说明CLIP ViTs因归一化标准差较小导致最坏情况敏感度增强达1.91倍。最终形成一套可直接应用的模块化设计指南,使模型在不改变任务目标的前提下显著提升抗噪能力。
链接: https://arxiv.org/abs/2509.20939
作者: Bum Jun Kim,Makoto Kawano,Yusuke Iwasawa,Yutaka Matsuo
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 30 pages, 5 figures
Abstract:While the robustness of vision models is often measured, their dependence on specific architectural design choices is rarely dissected. We investigate why certain vision architectures are inherently more robust to additive Gaussian noise and convert these empirical insights into simple, actionable design rules. Specifically, we performed extensive evaluations on 1,174 pretrained vision models, empirically identifying four consistent design patterns for improved robustness against Gaussian noise: larger stem kernels, smaller input resolutions, average pooling, and supervised vision transformers (ViTs) rather than CLIP ViTs, which yield up to 506 rank improvements and 21.6%p accuracy gains. We then develop a theoretical analysis that explains these findings, converting observed correlations into causal mechanisms. First, we prove that low-pass stem kernels attenuate noise with a gain that decreases quadratically with kernel size and that anti-aliased downsampling reduces noise energy roughly in proportion to the square of the downsampling factor. Second, we demonstrate that average pooling is unbiased and suppresses noise in proportion to the pooling window area, whereas max pooling incurs a positive bias that grows slowly with window size and yields a relatively higher mean-squared error and greater worst-case sensitivity. Third, we reveal and explain the vulnerability of CLIP ViTs via a pixel-space Lipschitz bound: The smaller normalization standard deviations used in CLIP preprocessing amplify worst-case sensitivity by up to 1.91 times relative to the Inception-style preprocessing common in supervised ViTs. Our results collectively disentangle robustness into interpretable modules, provide a theory that explains the observed trends, and build practical, plug-and-play guidelines for designing vision models more robust against Gaussian noise.
zh
[CV-45] Autoregressive End-to-End Planning with Time-Invariant Spatial Alignment and Multi-Objective Policy Refinement
【速读】:该论文旨在解决自回归模型在自动驾驶端到端规划中因时空错位(spatio-temporal misalignment)导致的性能瓶颈问题,即规划器需基于历史感知数据预测未来动作,从而形成不一致的主体视角(ego-centric worldview),限制了模型上限。解决方案的关键在于提出时间不变的空间对齐模块(Time-Invariant Spatial Alignment, TISA),该模块学习将初始环境特征投影至每个未来时间步的一致主体坐标系中,无需显式预测未来场景即可校正代理的世界观;同时引入运动学动作预测头(kinematic action prediction head)确保轨迹物理可行性,并通过多目标后训练阶段结合直接偏好优化(Direct Preference Optimization, DPO)提供细粒度的行为反馈信号,超越传统单一目标的模仿学习。
链接: https://arxiv.org/abs/2509.20938
作者: Jianbo Zhao,Taiyu Ban,Xiangjie Li,Xingtai Gui,Hangning Zhou,Lei Liu,Hongwei Zhao,Bin Li
机构: University of Science and Technology of China (中国科学技术大学); Mach Drive (机器驱动)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The inherent sequential modeling capabilities of autoregressive models make them a formidable baseline for end-to-end planning in autonomous driving. Nevertheless, their performance is constrained by a spatio-temporal misalignment, as the planner must condition future actions on past sensory data. This creates an inconsistent worldview, limiting the upper bound of performance for an otherwise powerful approach. To address this, we propose a Time-Invariant Spatial Alignment (TISA) module that learns to project initial environmental features into a consistent ego-centric frame for each future time step, effectively correcting the agent’s worldview without explicit future scene prediction. In addition, we employ a kinematic action prediction head (i.e., acceleration and yaw rate) to ensure physically feasible trajectories. Finally, we introduce a multi-objective post-training stage using Direct Preference Optimization (DPO) to move beyond pure imitation. Our approach provides targeted feedback on specific driving behaviors, offering a more fine-grained learning signal than the single, overall objective used in standard DPO. Our model achieves a state-of-the-art 89.8 PDMS on the NAVSIM dataset among autoregressive models. The video document is available at this https URL.
zh
[CV-46] SimDiff: Simulator-constrained Diffusion Model for Physically Plausible Motion Generation
【速读】:该论文旨在解决生成物理上合理的运动轨迹在计算效率上的瓶颈问题,尤其是在基于模拟器的扩散模型中,因模拟器的串行特性导致推理过程难以并行化,从而限制了实际应用。解决方案的关键在于将模拟器约束转化为扩散过程中的一种引导机制(either classifier-based or classifier-free),提出SimDiff模型,通过直接将环境参数(如重力、风速)嵌入去噪过程实现高效且物理合理的动作生成;该方法无需在推理阶段反复调用模拟器,并能对不同物理系数进行细粒度控制,同时展现出对未见过的环境参数组合的组合泛化能力。
链接: https://arxiv.org/abs/2509.20927
作者: Akihisa Watanabe,Jiawei Ren,Li Siyao,Yichen Peng,Erwin Wu,Edgar Simo-Serra
机构: Waseda University (早稻田大学); Nanyang Technological University (南洋理工大学); Institute of Science Tokyo (东京科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating physically plausible human motion is crucial for applications such as character animation and virtual reality. Existing approaches often incorporate a simulator-based motion projection layer to the diffusion process to enforce physical plausibility. However, such methods are computationally expensive due to the sequential nature of the simulator, which prevents parallelization. We show that simulator-based motion projection can be interpreted as a form of guidance, either classifier-based or classifier-free, within the diffusion process. Building on this insight, we propose SimDiff, a Simulator-constrained Diffusion Model that integrates environment parameters (e.g., gravity, wind) directly into the denoising process. By conditioning on these parameters, SimDiff generates physically plausible motions efficiently, without repeated simulator calls at inference, and also provides fine-grained control over different physical coefficients. Moreover, SimDiff successfully generalizes to unseen combinations of environmental parameters, demonstrating compositional generalization.
zh
[CV-47] Revisiting Data Challenges of Computational Pathology: A Pack-based Multiple Instance Learning Framework
【速读】:该论文旨在解决计算病理学(Computational Pathology, CPath)中全切片图像(Whole Slide Images, WSIs)因序列长度极长(高达200K)、长度差异显著(从200到200K不等)以及监督信号有限所导致的数据异质性高、冗余严重的问题。传统方法在有限监督下难以兼顾训练效率与数据异质性的保留。其解决方案的关键在于提出一种基于打包(pack-based)的多实例学习(Multiple Instance Learning, MIL)框架:首先将多个采样得到的变长特征序列打包为固定长度,实现批处理训练的同时保持数据多样性;其次引入残差分支,将来自多张切片被舍弃的特征组合成“超切片”(hyperslide),并施加定制化标签以提供多切片级别的监督,从而缓解因采样造成的特征丢失;此外,设计了一种注意力驱动的下采样模块,在两个分支中压缩特征表示以减少冗余。该方法在PANDA(UNI)数据集上实现了最高达8%的准确率提升,同时仅需12%的训练时间。
链接: https://arxiv.org/abs/2509.20923
作者: Wenhao Tang,Heng Fang,Ge Wu,Xiang Li,Ming-Ming Cheng
机构: VCIP, School of Computer Science, Nankai University (南开大学计算机学院视觉计算与图像处理实验室); Huazhong University of Science and Technology (华中科技大学); Nankai International Advanced Research Institute (Shenzhen Futian) (南开国际先进研究院(深圳福田))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 5 figures
Abstract:Computational pathology (CPath) digitizes pathology slides into whole slide images (WSIs), enabling analysis for critical healthcare tasks such as cancer diagnosis and prognosis. However, WSIs possess extremely long sequence lengths (up to 200K), significant length variations (from 200 to 200K), and limited supervision. These extreme variations in sequence length lead to high data heterogeneity and redundancy. Conventional methods often compromise on training efficiency and optimization to preserve such heterogeneity under limited supervision. To comprehensively address these challenges, we propose a pack-based MIL framework. It packs multiple sampled, variable-length feature sequences into fixed-length ones, enabling batched training while preserving data heterogeneity. Moreover, we introduce a residual branch that composes discarded features from multiple slides into a hyperslide which is trained with tailored labels. It offers multi-slide supervision while mitigating feature loss from sampling. Meanwhile, an attention-driven downsampler is introduced to compress features in both branches to reduce redundancy. By alleviating these challenges, our approach achieves an accuracy improvement of up to 8% while using only 12% of the training time in the PANDA(UNI). Extensive experiments demonstrate that focusing data challenges in CPath holds significant potential in the era of foundation models. The code is this https URL
zh
[CV-48] SwinMamba: A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images
【速读】:该论文旨在解决遥感图像语义分割中因高空间分辨率、复杂场景结构及多样目标尺度所带来的挑战,尤其是现有Vision Mamba模型依赖全局扫描而忽视关键局部特征(如纹理和边缘)的问题。其解决方案的关键在于提出SwinMamba框架,该框架结合了局部与全局感知能力:在前两个阶段采用基于移位窗口的局部Mamba扫描以捕捉细粒度细节,在后两个阶段引入全局扫描以融合上下文信息;同时通过重叠移位窗口设计增强区域间信息交互,从而实现更鲁棒的特征整合,显著提升分割精度。
链接: https://arxiv.org/abs/2509.20918
作者: Qinfeng Zhu,Han Li,Liang He,Lei Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation of remote sensing imagery is a fundamental task in computer vision, supporting a wide range of applications such as land use classification, urban planning, and environmental monitoring. However, this task is often challenged by the high spatial resolution, complex scene structures, and diverse object scales present in remote sensing data. To address these challenges, various deep learning architectures have been proposed, including convolutional neural networks, Vision Transformers, and the recently introduced Vision Mamba. Vision Mamba features a global receptive field and low computational complexity, demonstrating both efficiency and effectiveness in image segmentation. However, its reliance on global scanning tends to overlook critical local features, such as textures and edges, which are essential for achieving accurate segmentation in remote sensing contexts. To tackle this limitation, we propose SwinMamba, a novel framework inspired by the Swin Transformer. SwinMamba integrates localized Mamba-style scanning within shifted windows with a global receptive field, to enhance the model’s perception of both local and global features. Specifically, the first two stages of SwinMamba perform local scanning to capture fine-grained details, while its subsequent two stages leverage global scanning to fuse broader contextual information. In our model, the use of overlapping shifted windows enhances inter-region information exchange, facilitating more robust feature integration across the entire image. Extensive experiments on the LoveDA and ISPRS Potsdam datasets demonstrate that SwinMamba outperforms state-of-the-art methods, underscoring its effectiveness and potential as a superior solution for semantic segmentation of remotely sensed imagery.
zh
[CV-49] Finding 3D Positions of Distant Objects from Noisy Camera Movement and Semantic Segmentation Sequences
【速读】:该论文旨在解决在计算资源受限或目标距离较远的情况下,基于相机序列测量实现3D物体定位的问题。传统方法如密集深度估计或3D场景重建在此类场景中不可行。解决方案的关键在于采用粒子滤波器(particle filter)框架,利用已知的相机位姿(由全球导航卫星系统GNSS提供)和图像分割结果,对单目标或多目标进行有效定位。该方法不依赖于具体的检测算法,具有良好的灵活性与实用性,且已在无人机火灾监测的实际应用中验证了其有效性。
链接: https://arxiv.org/abs/2509.20906
作者: Julius Pesonen,Arno Solin,Eija Honkavaara
机构: Finnish Geospatial Research Institute (芬兰大地测量研究所); National Land Survey of Finland (芬兰国家土地测量局); Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with dense depth estimation or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved using particle filters for both single and multiple target scenarios. The method was studied using a 3D simulation and a drone-based image segmentation sequence with global navigation satellite system (GNSS)-based camera pose estimates. The results showed that a particle filter can be used to solve practical localisation tasks based on camera poses and image segments in these situations where other solutions fail. The particle filter is independent of the detection method, making it flexible for new tasks. The study also demonstrates that drone-based wildfire monitoring can be conducted using the proposed method paired with a pre-existing image segmentation model.
zh
[CV-50] FSMODNet: A Closer Look at Few-Shot Detection in Multispectral Data
【速读】:该论文旨在解决少样本多光谱目标检测(Few-shot Multispectral Object Detection, FSMOD)问题,即在可见光与热成像模态下,仅用极少标注数据实现高效准确的目标检测。其解决方案的关键在于提出了一种名为FSMODNet的框架,通过引入可变形注意力机制实现跨模态特征融合,从而有效整合可见光与热成像各自的优势,在光照和环境复杂变化条件下仍保持鲁棒的检测性能。
链接: https://arxiv.org/abs/2509.20905
作者: Manuel Nkegoum,Minh-Tan Pham,Élisa Fromont,Bruno Avignon,Sébastien Lefèvre
机构: Univ Bretagne Sud, IRISA, UMR 6074, Vannes, France; Univ Rennes, IRISA, UMR 6074, Rennes, France; ATERMES, Montigny-le-Bretonneux, France; UiT The Arctic University of Norway, Tromsø, Norway
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-shot multispectral object detection (FSMOD) addresses the challenge of detecting objects across visible and thermal modalities with minimal annotated data. In this paper, we explore this complex task and introduce a framework named “FSMODNet” that leverages cross-modality feature integration to improve detection performance even with limited labels. By effectively combining the unique strengths of visible and thermal imagery using deformable attention, the proposed method demonstrates robust adaptability in complex illumination and environmental conditions. Experimental results on two public datasets show effective object detection performance in challenging low-data regimes, outperforming several baselines we established from state-of-the-art models. All code, models, and experimental data splits can be found at this https URL.
zh
[CV-51] Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification
【速读】:该论文旨在解决将概念瓶颈模型(Concept Bottleneck Models, CBMs)从静态图像分类扩展到视频分类时所面临的挑战,尤其是如何有效建模视频中固有的时序依赖关系,以准确捕捉动作和事件。其解决方案的关键在于提出MoTIF(Moving Temporal Interpretable Framework),这是一种受Transformer启发的架构设计,能够处理任意长度的视频序列,并显式支持三种互补的视角:全局概念重要性、局部窗口内的概念相关性以及概念随时间变化的时序依赖性。通过这种方式,MoTIF实现了在保持竞争性能的同时,提升对视频中概念贡献的可解释性理解。
链接: https://arxiv.org/abs/2509.20899
作者: Patrick Knab,Sascha Marton,Philipp J. Schubert,Drago Guggiana,Christian Bartelt
机构: Technical University of Clausthal (克劳斯塔尔工业大学); Ramblr.ai Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., ‘bow’, ‘mount’, ‘shoot’) that reoccur across time - forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance. Code available at this http URL.
zh
[CV-52] FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies NEURIPS2025
【速读】:该论文旨在解决由变分自编码器(VAE)、生成对抗网络(GAN)和潜在扩散模型(LDM)等先进生成模型所产生图像日益逼真所带来的合成图像检测难题。其核心挑战在于,这些模型生成的图像在视觉上接近真实图像,导致传统检测方法难以识别其合成属性。解决方案的关键在于挖掘生成过程中引入的两类伪影特征:(1)潜在空间分布偏差(latent distribution deviations)和(2)解码阶段引起的平滑效应(decoding-induced smoothing effects),二者共同表现为局部纹理、边缘和色彩过渡的不一致性。作者基于马尔可夫随机场(Markov Random Fields)中的局部像素依赖性(Local Pixel Dependencies, LPD)特性,利用邻域像素信息重构图像以暴露纹理连续性和边缘一致性的破坏,并据此提出轻量级神经网络FerretNet(仅1.1M参数),实现了高效且鲁棒的合成图像检测,在跨22种生成模型的开放世界基准上平均准确率达97.1%,显著优于现有最优方法。
链接: https://arxiv.org/abs/2509.20890
作者: Shuqiao Liang,Jian Liu,Renzhang Chen,Quanlong Guan
机构: Jinan University (暨南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 8 tables, accepted at NeurIPS 2025
Abstract:The increasing realism of synthetic images generated by advanced models such as VAEs, GANs, and LDMs poses significant challenges for synthetic image detection. To address this issue, we explore two artifact types introduced during the generation process: (1) latent distribution deviations and (2) decoding-induced smoothing effects, which manifest as inconsistencies in local textures, edges, and color transitions. Leveraging local pixel dependencies (LPD) properties rooted in Markov Random Fields, we reconstruct synthetic images using neighboring pixel information to expose disruptions in texture continuity and edge coherence. Building upon LPD, we propose FerretNet, a lightweight neural network with only 1.1M parameters that delivers efficient and robust synthetic image detection. Extensive experiments demonstrate that FerretNet, trained exclusively on the 4-class ProGAN dataset, achieves an average accuracy of 97.1% on an open-world benchmark comprising across 22 generative models, surpassing state-of-the-art methods by 10.6%.
zh
[CV-53] Nuclear Diffusion Models for Low-Rank Background Suppression in Videos
【速读】:该论文旨在解决视频序列中结构化噪声和背景伪影干扰动态内容分析与恢复的问题,传统基于低秩与稀疏分解(Robust Principal Component Analysis, RPCA)的方法因对稀疏性的假设难以刻画真实视频数据中的复杂变化而受限。其解决方案的关键在于提出一种融合低秩时序建模与扩散后验采样的混合框架——Nuclear Diffusion,通过引入深度生成先验来增强对视频动态内容的建模能力,从而在心脏超声去雾任务中实现更优的对比度增强(gCNR)和信号保真度(KS统计量)。
链接: https://arxiv.org/abs/2509.20886
作者: Tristan S.W. Stevens,Oisín Nolan,Jean-Luc Robert,Ruud J.G. van Sloun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 5 pages, 4 figures, preprint
Abstract:Video sequences often contain structured noise and background artifacts that obscure dynamic content, posing challenges for accurate analysis and restoration. Robust principal component methods address this by decomposing data into low-rank and sparse components. Still, the sparsity assumption often fails to capture the rich variability present in real video data. To overcome this limitation, a hybrid framework that integrates low-rank temporal modeling with diffusion posterior sampling is proposed. The proposed method, Nuclear Diffusion, is evaluated on a real-world medical imaging problem, namely cardiac ultrasound dehazing, and demonstrates improved dehazing performance compared to traditional RPCA concerning contrast enhancement (gCNR) and signal preservation (KS statistic). These results highlight the potential of combining model-based temporal models with deep generative priors for high-fidelity video restoration.
zh
[CV-54] Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering
【速读】:该论文旨在解决视觉问答(Visual Question Answering, VQA)模型在训练过程中因数据集偏置(bias)而导致的性能瓶颈问题,特别是模型对表面统计规律的过度依赖和在多样化图像与问题上的泛化能力不足。其解决方案的关键在于提出一种名为IOG-VQA的新模型,该模型融合了两种核心机制:一是基于对象交互的自注意力机制(Object Interaction Self-Attention),用于捕捉图像中物体之间的复杂关系以增强视觉语境理解;二是基于生成对抗网络(GAN-Based Debiasing)的去偏框架,通过生成更均衡的数据分布来提升模型学习鲁棒特征的能力。这两项技术协同作用,显著提升了模型在存在偏置和不平衡数据分布场景下的表现。
链接: https://arxiv.org/abs/2509.20884
作者: Zhifei Li,Feng Qiu,Yiran Wang,Yujing Xia,Kui Xiao,Miao Zhang,Yan Zhang
机构: Hubei University (湖北大学); Hubei Key Laboratory of Big Data Intelligent Analysis and Application (湖北大学大数据智能分析与应用重点实验室); Key Laboratory of Intelligent Sensing System and Security (教育部智能感知系统与安全重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures. ACCEPTED for publication as a REGULAR paper in the IEEE Transactions on Multimedia 2025
Abstract:Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks. Our code is available at this https URL.
zh
[CV-55] he Unanticipated Asymmetry Between Perceptual Optimization and Assessment
【速读】:该论文旨在解决感知优化(perceptual optimization)与图像质量评估(image quality assessment, IQA)之间目标函数有效性不对称的问题,即现有基于 fidelity 的指标在 IQA 中表现优异,并不意味着它们在优化过程中同样有效,尤其是在对抗训练场景下这种不匹配更为显著。其解决方案的关键在于系统性地分析不同损失函数(如 fidelity objective 和 adversarial objective)在优化与评估任务中的差异,揭示 discriminator 架构设计对优化效果的决定性影响:具体而言,patch-level 和卷积结构的判别器相较于标准或 Transformer-based 设计,在细节重建方面更具优势,从而为更合理的感知优化损失设计和 IQA 模型迁移提供理论依据。
链接: https://arxiv.org/abs/2509.20878
作者: Jiabei Zhang,Qi Wang,Siyu Wu,Du Chen,Tianhe Wu
机构: Institute of Microelectronics of the Chinese Academy of Sciences (中国科学院微电子研究所); Beihang University (北京航空航天大学); The Hong Kong Polytechnic University (香港理工大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Perceptual optimization is primarily driven by the fidelity objective, which enforces both semantic consistency and overall visual realism, while the adversarial objective provides complementary refinement by enhancing perceptual sharpness and fine-grained detail. Despite their central role, the correlation between their effectiveness as optimization objectives and their capability as image quality assessment (IQA) metrics remains underexplored. In this work, we conduct a systematic analysis and reveal an unanticipated asymmetry between perceptual optimization and assessment: fidelity metrics that excel in IQA are not necessarily effective for perceptual optimization, with this misalignment emerging more distinctly under adversarial training. In addition, while discriminators effectively suppress artifacts during optimization, their learned representations offer only limited benefits when reused as backbone initializations for IQA models. Beyond this asymmetry, our findings further demonstrate that discriminator design plays a decisive role in shaping optimization, with patch-level and convolutional architectures providing more faithful detail reconstruction than vanilla or Transformer-based alternatives. These insights advance the understanding of loss function design and its connection to IQA transferability, paving the way for more principled approaches to perceptual optimization.
zh
[CV-56] SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering
【速读】:该论文旨在解决知识增强型视觉问答(Knowledge-Based Visual Question Answering, KB-VQA)中因图像描述文本(如图像字幕)包含大量与问题无关的噪声信息,导致大语言模型(Large Language Models, LLMs)推理能力受限的问题。解决方案的关键在于提出一种名为“总结-重排序增强型视觉问答”(Summarized Caption-Rerank Augmented VQA, SCRA-VQA)的方法:首先利用预训练的视觉语言模型将图像转换为结构化字幕;随后生成上下文相关的示例,并对字幕进行总结与重排序以剔除冗余信息,从而提升LLMs对图像内容和问题的理解能力,显著增强模型的推理性能与任务适应性,且无需昂贵的端到端训练。
链接: https://arxiv.org/abs/2509.20871
作者: Yan Zhang,Jiaqing Lin,Miao Zhang,Kui Xiao,Xiaoju Hou,Yue Zhao,Zhifei Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ACCEPTED as a FULL PAPER for the Research Track at International Conference on Database Systems for Advanced Applications 2025
Abstract:Acquiring high-quality knowledge is a central focus in Knowledge-Based Visual Question Answering (KB-VQA). Recent methods use large language models (LLMs) as knowledge engines for answering. These methods generally employ image captions as visual text descriptions to assist LLMs in interpreting images. However, the captions frequently include excessive noise irrelevant to the question, and LLMs generally do not comprehend VQA tasks, limiting their reasoning capabilities. To address this issue, we propose the Summarized Caption-Rerank Augmented VQA (SCRA-VQA), which employs a pre-trained visual language model to convert images into captions. Moreover, SCRA-VQA generates contextual examples for the captions while simultaneously summarizing and reordering them to exclude unrelated information. The caption-rerank process enables LLMs to understand the image information and questions better, thus enhancing the model’s reasoning ability and task adaptability without expensive end-to-end training. Based on an LLM with 6.7B parameters, SCRA-VQA performs excellently on two challenging knowledge-based VQA datasets: OK-VQA and A-OKVQA, achieving accuracies of 38.8% and 34.6%. Our code is available at this https URL.
zh
[CV-57] Plant identification in an open-world (LifeCLEF 2016)
【速读】:该论文针对的是大规模植物识别任务中开放集识别(open-set recognition)问题,即如何使识别系统在面对训练集中未见的未知类别时仍具备鲁棒性。传统方法仅关注已知类别的分类准确率,而忽略了因未知类别导致的误报(false positive)问题。解决方案的关键在于设计能够自动拒绝未知类别样本的机制,从而提升系统在真实世界生物多样性监测场景下的实用性与可靠性。
链接: https://arxiv.org/abs/2509.20870
作者: Herve Goeau,Pierre Bonnet,Alexis Joly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures, CLEF 2016 Conference and Labs of the Evaluation Forum, September 05 to 08, 2016, Evora, Portugal
Abstract:The LifeCLEF plant identification challenge aims at evaluating plant identification methods and systems at a very large scale, close to the conditions of a real-world biodiversity monitoring scenario. The 2016-th edition was actually conducted on a set of more than 110K images illustrating 1000 plant species living in West Europe, built through a large-scale participatory sensing platform initiated in 2011 and which now involves tens of thousands of contributors. The main novelty over the previous years is that the identification task was evaluated as an open-set recognition problem, i.e. a problem in which the recognition system has to be robust to unknown and never seen categories. Beyond the brute-force classification across the known classes of the training set, the big challenge was thus to automatically reject the false positive classification hits that are caused by the unknown classes. This overview presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
zh
[CV-58] SD-RetinaNet: Topologically Constrained Semi-Supervised Retinal Lesion and Layer Segmentation in OCT
【速读】:该论文旨在解决现有半监督学习方法在视网膜OCT图像中进行生物标志物(如视网膜层和病灶)分割时存在的三大问题:生成解剖上不合理的分割结果、未能有效建模层与病灶之间的交互关系,以及缺乏拓扑结构正确性的保障。其解决方案的关键在于提出一种全新的半监督模型,引入一个全可微的生物标志物拓扑引擎(biomarker topology engine),以强制执行解剖学上正确的分割,并实现层与病灶之间的双向影响联合学习;同时,该模型通过学习解耦表示(disentangled representation),分离空间特征与风格因素,从而提升分割的真实性和准确性,确保病灶严格位于相对于已分割层的解剖学合理位置。
链接: https://arxiv.org/abs/2509.20864
作者: Botond Fazekas,Guilherme Aresta,Philipp Seeböck,Julia Mai,Ursula Schmidt-Erfurth,Hrvoje Bogunović
机构: Christian Doppler Laboratory for Artificial Intelligence in Retina (克里斯蒂安·多普勒人工智能视网膜实验室); Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna (维也纳医科大学人工智能研究所,医学数据科学中心); Computational Imaging Research Lab, Department of Biomedical Imaging and Image-Guided Therapy, Medical University of Vienna (生物医学成像与图像引导治疗系计算成像研究实验室); Department of Ophthalmology and Optometry, Medical University of Vienna (维也纳医科大学眼科与视光学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optical coherence tomography (OCT) is widely used for diagnosing and monitoring retinal diseases, such as age-related macular degeneration (AMD). The segmentation of biomarkers such as layers and lesions is essential for patient diagnosis and follow-up. Recently, semi-supervised learning has shown promise in improving retinal segmentation performance. However, existing methods often produce anatomically implausible segmentations, fail to effectively model layer-lesion interactions, and lack guarantees on topological correctness. To address these limitations, we propose a novel semi-supervised model that introduces a fully differentiable biomarker topology engine to enforce anatomically correct segmentation of lesions and layers. This enables joint learning with bidirectional influence between layers and lesions, leveraging unlabeled and diverse partially labeled datasets. Our model learns a disentangled representation, separating spatial and style factors. This approach enables more realistic layer segmentations and improves lesion segmentation, while strictly enforcing lesion location in their anatomically plausible positions relative to the segmented layers. We evaluate the proposed model on public and internal datasets of OCT scans and show that it outperforms the current state-of-the-art in both lesion and layer segmentation, while demonstrating the ability to generalize layer segmentation to pathological cases using partially annotated training data. Our results demonstrate the potential of using anatomical constraints in semi-supervised learning for accurate, robust, and trustworthy retinal biomarker segmentation. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.20864 [cs.CV] (or arXiv:2509.20864v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.20864 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-59] ArchGPT : Understanding the Worlds Architectures with Large Multimodal Models
【速读】:该论文旨在解决现有虚拟现实(VR)、混合现实(MR)和增强现实(AR)系统在建筑领域应用中普遍存在的可扩展性不足问题,即这些系统多为定制开发,依赖硬编码标注和任务特定交互,难以适配多样化的建成环境。其解决方案的关键在于提出ArchGPT——一个面向建筑领域的多模态视觉问答(VQA)模型,并构建了一个可扩展的数据构建流程,用于生成高质量、领域专用的建筑图像-问题-答案三元组数据集Arch-300K(约315,000个样本)。该流程通过粗粒度到细粒度的图像筛选策略(结合三维重建与语义分割)获取结构一致且无遮挡的建筑图像,再利用大语言模型(LLM)引导的文本验证与知识蒸馏机制净化原始元数据,最终合成包含形式分析注释的丰富语义标签,从而实现对建筑空间内容的精准理解与泛化能力提升。
链接: https://arxiv.org/abs/2509.20858
作者: Yuze Wang,Luo Yang,Junyi Wang,Yue Qi
机构: Beihang University (北京航空航天大学); Shandong University (山东大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Architecture embodies aesthetic, cultural, and historical values, standing as a tangible testament to human civilization. Researchers have long leveraged virtual reality (VR), mixed reality (MR), and augmented reality (AR) to enable immersive exploration and interpretation of architecture, enhancing accessibility, public understanding, and creative workflows around architecture in education, heritage preservation, and professional design practice. However, existing VR/MR/AR systems are often developed case-by-case, relying on hard-coded annotations and task-specific interactions that do not scale across diverse built environments. In this work, we present ArchGPT, a multimodal architectural visual question answering (VQA) model, together with a scalable data-construction pipeline for curating high-quality, architecture-specific VQA annotations. This pipeline yields Arch-300K, a domain-specialized dataset of approximately 315,000 image-question-answer triplets. Arch-300K is built via a multi-stage process: first, we curate architectural scenes from Wikimedia Commons and filter unconstrained tourist photo collections using a novel coarse-to-fine strategy that integrates 3D reconstruction and semantic segmentation to select occlusion-free, structurally consistent architectural images. To mitigate noise and inconsistency in raw textual metadata, we propose an LLM-guided text verification and knowledge-distillation pipeline to generate reliable, architecture-specific question-answer pairs. Using these curated images and refined metadata, we further synthesize formal analysis annotations-including detailed descriptions and aspect-guided conversations-to provide richer semantic variety while remaining faithful to the data. We perform supervised fine-tuning of an open-source multimodal backbone ,ShareGPT4V-7B, on Arch-300K, yielding ArchGPT.
zh
[CV-60] asselNetV4: A vision foundation model for cross-scene cross-scale and cross-species plant counting
【速读】:该论文旨在解决植物计数中因物种多样性导致的传统方法难以泛化的问题,即现有基于检测或回归的计数模型通常依赖特定物种训练,无法应对不断出现的新品种和跨物种场景。其核心解决方案是提出TasselNetV4模型,该模型从物种特异性计数转向跨物种计数,融合了局部计数思想与类无关计数(Class-Agnostic Counting, CAC)中的提取-匹配范式,并基于纯视觉Transformer架构引入多分支框感知局部计数器,以增强跨尺度鲁棒性。实验表明,TasselNetV4在两个挑战性数据集(PAC-105和PAC-Somalia)上显著优于当前最优CAC模型,展现出作为植物计数基础模型(vision foundation model)的潜力,适用于跨场景、跨尺度和跨物种的应用场景。
链接: https://arxiv.org/abs/2509.20857
作者: Xiaonan Hu,Xuebing Li,Jinyu Xu,Abdulkadir Duran Adan,Letian Zhou,Xuhui Zhu,Yanan Li,Wei Guo,Shouyang Liu,Wenzhong Liu,Hao Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 figures, 7 tables, code is available at this https URL
Abstract:Accurate plant counting provides valuable information for agriculture such as crop yield prediction, plant density assessment, and phenotype quantification. Vision-based approaches are currently the mainstream solution. Prior art typically uses a detection or a regression model to count a specific plant. However, plants have biodiversity, and new cultivars are increasingly bred each year. It is almost impossible to exhaust and build all species-dependent counting models. Inspired by class-agnostic counting (CAC) in computer vision, we argue that it is time to rethink the problem formulation of plant counting, from what plants to count to how to count plants. In contrast to most daily objects with spatial and temporal invariance, plants are dynamic, changing with time and space. Their non-rigid structure often leads to worse performance than counting rigid instances like heads and cars such that current CAC and open-world detection models are suboptimal to count plants. In this work, we inherit the vein of the TasselNet plant counting model and introduce a new extension, TasselNetV4, shifting from species-specific counting to cross-species counting. TasselNetV4 marries the local counting idea of TasselNet with the extract-and-match paradigm in CAC. It builds upon a plain vision transformer and incorporates novel multi-branch box-aware local counters used to enhance cross-scale robustness. Two challenging datasets, PAC-105 and PAC-Somalia, are harvested. Extensive experiments against state-of-the-art CAC models show that TasselNetV4 achieves not only superior counting performance but also high this http URL results indicate that TasselNetV4 emerges to be a vision foundation model for cross-scene, cross-scale, and cross-species plant counting.
zh
[CV-61] Plant identification based on noisy web data: the amazing performance of deep learning (LifeCLEF 2017)
【速读】:该论文旨在解决大规模植物识别系统中训练数据质量与数量之间的权衡问题,即:如何在缺乏高质量标注数据的情况下,利用网络上大量存在噪声和标签错误的图像数据来构建有效的植物识别模型。其关键解决方案在于设计一个公平的评估框架,通过使用来自Pl@ntNet移动应用的独立测试集(非训练数据来源),对比分析两种训练策略的效果——一种是基于专家审核的小规模可信数据集,另一种是通过网络爬取的、包含大量标注错误的大规模噪声数据集。这种设置能够客观衡量噪声数据在实际场景中的可用性与潜力,为未来自动化植物识别系统在大陆尺度上的部署提供实证依据。
链接: https://arxiv.org/abs/2509.20856
作者: Herve Goeau,Pierre Bonnet,Alexis Joly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 3 figures, CLEF 2017 Conference and Labs of the Evaluation Forum, September 11 to 14, 2017, Dublin, Ireland
Abstract:The 2017-th edition of the LifeCLEF plant identification challenge is an important milestone towards automated plant identification systems working at the scale of continental floras with 10.000 plant species living mainly in Europe and North America illustrated by a total of 1.1M images. Nowadays, such ambitious systems are enabled thanks to the conjunction of the dazzling recent progress in image classification with deep learning and several outstanding international initiatives, such as the Encyclopedia of Life (EOL), aggregating the visual knowledge on plant species coming from the main national botany institutes. However, despite all these efforts the majority of the plant species still remain without pictures or are poorly illustrated. Outside the institutional channels, a much larger number of plant pictures are available and spread on the web through botanist blogs, plant lovers web-pages, image hosting websites and on-line plant retailers. The LifeCLEF 2017 plant challenge presented in this paper aimed at evaluating to what extent a large noisy training dataset collected through the web and containing a lot of labelling errors can compete with a smaller but trusted training dataset checked by experts. To fairly compare both training strategies, the test dataset was created from a third data source, i.e. the Pl@ntNet mobile application that collects millions of plant image queries all over the world. This paper presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
zh
[CV-62] Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer
【速读】:该论文旨在解决量化感知训练(Quantization-aware Training, QAT)与知识蒸馏(Knowledge Distillation, KD)结合时,因任务特定(Task-specific, TS)损失与蒸馏损失梯度幅度不一致而导致的优化冲突问题,尤其是在低比特量化场景下性能下降显著。解决方案的关键在于提出一种可学习的正则化方法——Game of Regularizer(GoR),其通过仅引入两个可训练参数动态调整TS与KD目标之间的权重,从而自适应地平衡两类监督信号,缓解梯度冲突、提升收敛稳定性,并显著增强小模型量化后的性能表现。
链接: https://arxiv.org/abs/2509.20854
作者: Abdur Rehman,S M A Sharif,Md Abdur Rahaman,Mohamed Jismy Aashik Rasool,Seongwan Kim,Jaeho Lee
机构: Opt-AI(优化人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Quantization-aware training (QAT) combined with knowledge distillation (KD) is a promising strategy for compressing Artificial Intelligence (AI) models for deployment on resource-constrained hardware. However, existing QAT-KD methods often struggle to balance task-specific (TS) and distillation losses due to heterogeneous gradient magnitudes, especially under low-bit quantization. We propose Game of Regularizer (GoR), a novel learnable regularization method that adaptively balances TS and KD objectives using only two trainable parameters for dynamic loss weighting. GoR reduces conflict between supervision signals, improves convergence, and boosts the performance of small quantized models (SQMs). Experiments on image classification, object detection (OD), and large language model (LLM) compression show that GoR consistently outperforms state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster inference while maintaining full-precision accuracy. We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR can outperform full-precision models, providing a robust solution for real-world deployment.
zh
[CV-63] FHRFormer: A Self-supervised Transformer Approach for Fetal Heart Rate Inpainting and Forecasting
【速读】:该论文旨在解决胎儿心率(Fetal Heart Rate, FHR)监测中因传感器位移或胎位/母体位置变化导致的信号缺失问题,该问题限制了基于人工智能(AI)方法对FHR数据进行有效分析与风险预测的能力。解决方案的关键在于提出一种基于掩码Transformer的自编码器方法,通过同时捕捉FHR信号的空间结构和频域特征,实现对缺失数据的高质量重建,从而支持后续的信号补全(inpainting)与预测任务,提升AI模型在复杂临床场景下的鲁棒性与适用性。
链接: https://arxiv.org/abs/2509.20852
作者: Kjersti Engan,Neel Kanwal,Anita Yeconia,Ladislaus Blacy,Yuda Munyaw,Estomih Mduma,Hege Ersdal
机构: University of Stavanger (斯塔万格大学); Haydom Lutheran Hospital (海多姆路德教会医院); Stavanger University Hospital (斯塔万格大学医院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE JBHI
Abstract:Approximately 10% of newborns require assistance to initiate breathing at birth, and around 5% need ventilation support. Fetal heart rate (FHR) monitoring plays a crucial role in assessing fetal well-being during prenatal care, enabling the detection of abnormal patterns and supporting timely obstetric interventions to mitigate fetal risks during labor. Applying artificial intelligence (AI) methods to analyze large datasets of continuous FHR monitoring episodes with diverse outcomes may offer novel insights into predicting the risk of needing breathing assistance or interventions. Recent advances in wearable FHR monitors have enabled continuous fetal monitoring without compromising maternal mobility. However, sensor displacement during maternal movement, as well as changes in fetal or maternal position, often lead to signal dropouts, resulting in gaps in the recorded FHR data. Such missing data limits the extraction of meaningful insights and complicates automated (AI-based) analysis. Traditional approaches to handle missing data, such as simple interpolation techniques, often fail to preserve the spectral characteristics of the signals. In this paper, we propose a masked transformer-based autoencoder approach to reconstruct missing FHR signals by capturing both spatial and frequency components of the data. The proposed method demonstrates robustness across varying durations of missing data and can be used for signal inpainting and forecasting. The proposed approach can be applied retrospectively to research datasets to support the development of AI-based risk algorithms. In the future, the proposed method could be integrated into wearable FHR monitoring devices to achieve earlier and more robust risk detection.
zh
[CV-64] Poisoning Prompt-Guided Sampling in Video Large Language Models
【速读】:该论文旨在解决生成式视频大模型(Video Large Language Models, VideoLLMs)中基于提示引导(prompt-guided)的帧采样机制存在的安全漏洞问题。现有研究已揭示均匀采样和语义相似性采样策略的脆弱性,但提示引导采样因其动态响应任务描述的能力而被视为更先进,其安全性尚未被探索。论文提出首个黑盒投毒攻击方法PoisonVID,其核心在于通过闭环优化策略迭代生成一个通用扰动(universal perturbation),以抑制有害帧的相关性得分;该优化过程借助由影子VideoLLM与轻量级语言模型GPT-4o-mini构建的“描绘集”(depiction set),该集合由改写后的有害描述构成,从而实现对提示引导机制的有效干扰。实验表明,PoisonVID在三种提示引导采样策略和三个先进VideoLLMs上均达到82%–99%的攻击成功率,凸显了开发更具鲁棒性的未来视频采样策略的重要性。
链接: https://arxiv.org/abs/2509.20851
作者: Yuxin Cao,Wei Song,Jingling Xue,Jin Song Dong
机构: National University of Singapore (新加坡国立大学); University of New South Wales (新南威尔士大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures
Abstract:Video Large Language Models (VideoLLMs) have emerged as powerful tools for understanding videos, supporting tasks such as summarization, captioning, and question answering. Their performance has been driven by advances in frame sampling, progressing from uniform-based to semantic-similarity-based and, most recently, prompt-guided strategies. While vulnerabilities have been identified in earlier sampling strategies, the safety of prompt-guided sampling remains unexplored. We close this gap by presenting PoisonVID, the first black-box poisoning attack that undermines prompt-guided sampling in VideoLLMs. PoisonVID compromises the underlying prompt-guided sampling mechanism through a closed-loop optimization strategy that iteratively optimizes a universal perturbation to suppress harmful frame relevance scores, guided by a depiction set constructed from paraphrased harmful descriptions leveraging a shadow VideoLLM and a lightweight language model, i.e., GPT-4o-mini. Comprehensively evaluated on three prompt-guided sampling strategies and across three advanced VideoLLMs, PoisonVID achieves 82% - 99% attack success rate, highlighting the importance of developing future advanced sampling strategies for VideoLLMs.
zh
[CV-65] ARMesh: Autoregressive Mesh Generation via Next-Level-of-Detail Prediction NEURIPS2025
【速读】:该论文旨在解决传统自回归(Auto-regressive, AR)网格生成模型在构建三维网格时,按字典序逐面生成所导致的几何结构表达不自然、难以捕捉人类感知一致性的问题。其核心解决方案是提出一种新颖的粗到精(coarse-to-fine)渐进式网格生成方法:将网格简化算法(mesh simplification)视为从精细到粗糙的过程,并利用其逆过程——即从单点开始逐步通过局部重网格化添加几何细节——来指导AR模型生成;同时,将网格推广为单纯复形(simplicial complexes),使拓扑结构在生成过程中可变且无需预定义。此方案不仅支持通过提前终止生成过程实现对质量和效率的直观控制,还拓展了网格细化与编辑等应用。
链接: https://arxiv.org/abs/2509.20824
作者: Jiabao Lei,Kewei Shi,Zhihao Liang,Kui Jia
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); The University of Hong Kong (香港大学); Tencent Hunyuan (腾讯混元)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025, Project Page: this https URL
Abstract:Directly generating 3D meshes, the default representation for 3D shapes in the graphics industry, using auto-regressive (AR) models has become popular these days, thanks to their sharpness, compactness in the generated results, and ability to represent various types of surfaces. However, AR mesh generative models typically construct meshes face by face in lexicographic order, which does not effectively capture the underlying geometry in a manner consistent with human perception. Inspired by 2D models that progressively refine images, such as the prevailing next-scale prediction AR models, we propose generating meshes auto-regressively in a progressive coarse-to-fine manner. Specifically, we view mesh simplification algorithms, which gradually merge mesh faces to build simpler meshes, as a natural fine-to-coarse process. Therefore, we generalize meshes to simplicial complexes and develop a transformer-based AR model to approximate the reverse process of simplification in the order of level of detail, constructing meshes initially from a single point and gradually adding geometric details through local remeshing, where the topology is not predefined and is alterable. Our experiments show that this novel progressive mesh generation approach not only provides intuitive control over generation quality and time consumption by early stopping the auto-regressive process but also enables applications such as mesh refinement and editing.
zh
[CV-66] CaTS-Bench: Can Language Models Describe Numeric Time Series?
【速读】:该论文旨在解决当前时间序列描述(time series captioning)任务中基准数据集存在的局限性问题,如依赖合成数据、 captions过于简单、忽略元数据(metadata)和可视化表示等。为填补这一空白,作者提出了CaTS-Bench——首个大规模真实世界的时间序列上下文感知描述基准,其核心解决方案在于构建一个可扩展的生成管道:通过Oracle大语言模型(LLM)生成参考caption,并结合事实核查、人类不可区分性测试及多样性分析确保质量;同时提供579个经人工修订的测试样本以提升准确性与自然性。该方案不仅提升了caption生成的可靠性,也为未来时间序列分析与基础模型交叉研究提供了可扩展的评估框架。
链接: https://arxiv.org/abs/2509.20823
作者: Luca Zhou,Pratham Yashwante,Marshall Fisher,Alessio Sampieri,Zihao Zhou,Fabio Galasso,Rose Yu
机构: Sapienza University of Rome (罗马大学); University of California San Diego (加州大学圣地亚哥分校); ItalAI Labs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 images, 4 tables in the main paper. Many more in the appendix
Abstract:Time series captioning, the task of describing numeric time series in natural language, requires numerical reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on synthetic data or overly simplistic captions, and typically neglect metadata and visual representations. To close this gap, we introduce CaTS-Bench, the first large-scale, real-world benchmark for Context-aware Time Series captioning. CaTS-Bench is derived from 11 diverse datasets reframed as captioning and QA tasks, comprising roughly 465k training and 105k test timestamps. Each sample includes a numeric series segment, contextual metadata, a line-chart image, and a caption. A key contribution of this work is the scalable pipeline used to generate reference captions: while most references are produced by an oracle LLM and verified through factual checks, human indistinguishability studies, and diversity analyses, we also provide a human-revisited subset of 579 test captions, refined from LLM outputs to ensure accuracy and human-like style. Beyond captioning, CaTS-Bench offers 460 multiple-choice questions targeting deeper aspects of time series reasoning. We further propose new tailored evaluation metrics and benchmark leading VLMs, highlighting both their strengths and persistent limitations. Together, these contributions establish CaTS-Bench and its captioning pipeline as a reliable and extensible foundation for future research at the intersection of time series analysis and foundation models.
zh
[CV-67] Revolutionizing Precise Low Back Pain Diagnosis via Contrastive Learning
【速读】:该论文旨在解决腰痛(low back pain)诊断中因医学影像与文本报告复杂性导致的多模态信息融合难题,从而提升自动化骨科诊断和临床决策支持的准确性。其核心解决方案是提出LumbarCLIP框架,通过对比语言-图像预训练(contrastive language-image pretraining)将腰椎MRI扫描与放射学报告对齐;关键创新在于使用ResNet-50、Vision Transformer和Swin Transformer等视觉编码器结合BERT文本编码器提取高维特征,并通过可配置的线性或非线性投影头将其映射至共享嵌入空间,配合软CLIP损失函数实现稳定训练,最终在分类任务上达到95.00%准确率和94.75% F1分数,且实验证明线性投影头优于非线性变体,显著提升了跨模态对齐效果。
链接: https://arxiv.org/abs/2509.20813
作者: Thanh Binh Le,Hoang Nhat Khang Vo,Tan-Ha Mai,Trong Nhan Phan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures
Abstract:Low back pain affects millions worldwide, driving the need for robust diagnostic models that can jointly analyze complex medical images and accompanying text reports. We present LumbarCLIP, a novel multimodal framework that leverages contrastive language-image pretraining to align lumbar spine MRI scans with corresponding radiological descriptions. Built upon a curated dataset containing axial MRI views paired with expert-written reports, LumbarCLIP integrates vision encoders (ResNet-50, Vision Transformer, Swin Transformer) with a BERT-based text encoder to extract dense representations. These are projected into a shared embedding space via learnable projection heads, configurable as linear or non-linear, and normalized to facilitate stable contrastive training using a soft CLIP loss. Our model achieves state-of-the-art performance on downstream classification, reaching up to 95.00% accuracy and 94.75% F1-score on the test set, despite inherent class imbalance. Extensive ablation studies demonstrate that linear projection heads yield more effective cross-modal alignment than non-linear variants. LumbarCLIP offers a promising foundation for automated musculoskeletal diagnosis and clinical decision support.
zh
[CV-68] Federated Domain Generalization with Domain-specific Soft Prompts Generation ICCV2025
【速读】:该论文旨在解决联邦域泛化(Federated Domain Generalization, FDG)中因客户端间存在领域偏移(domain shift)而导致下游任务适应困难的问题。现有基于提示学习(prompt learning)的FDG方法通常从训练样本中学习软提示(soft prompts),以替代人工设计的提示来提升模型泛化能力,但这些方法生成的提示多样性不足,且难以利用未知领域的信息。论文提出了一种新颖有效的解决方案——联邦域泛化中的领域特定软提示生成方法(Federated Domain Generalization with Domain-specific Soft Prompts Generation, FedDSPG),其核心在于:在训练阶段为每个领域引入领域特定软提示(Domain-specific Soft Prompts, DSPs),并通过客户端间的生成模型融合内容与领域知识;在推理阶段,利用生成器为未见目标领域生成DSPs,从而指导未知领域中的下游任务,显著提升了联邦学习场景下的域泛化性能。
链接: https://arxiv.org/abs/2509.20807
作者: Jianhan Wu,Xiaoyang Qu,Zhangcheng Huang,Jianzong Wang
机构: Ping An Technology (Shenzhen) Co., Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF International Conference on Computer Vision (ICCV 2025)
Abstract:Prompt learning has become an efficient paradigm for adapting CLIP to downstream tasks. Compared with traditional fine-tuning, prompt learning optimizes a few parameters yet yields highly competitive results, especially appealing in federated learning for computational efficiency. engendering domain shift among clients and posing a formidable challenge for downstream-task adaptation. Existing federated domain generalization (FDG) methods based on prompt learning typically learn soft prompts from training samples, replacing manually designed prompts to enhance the generalization ability of federated models. However, these learned prompts exhibit limited diversity and tend to ignore information from unknown domains. We propose a novel and effective method from a generative perspective for handling FDG tasks, namely federated domain generalization with domain-specific soft prompts generation (FedDSPG). Specifically, during training, we introduce domain-specific soft prompts (DSPs) for each domain and integrate content and domain knowledge into the generative model among clients. In the inference phase, the generator is utilized to obtain DSPs for unseen target domains, thus guiding downstream tasks in unknown domains. Comprehensive evaluations across several public datasets confirm that our method outperforms existing strong baselines in FDG, achieving state-of-the-art results.
zh
[CV-69] FERD: Fairness-Enhanced Data-Free Robustness Distillation
【速读】:该论文旨在解决数据-free鲁棒性蒸馏(Data-Free Robustness Distillation, DFRD)中存在的公平性缺失问题,即现有方法在不访问训练数据的情况下迁移鲁棒性时,忽视了不同类别间鲁棒性的差异性,导致某些类别鲁棒性显著低于其他类别。其关键解决方案在于提出首个公平增强型数据-free鲁棒性蒸馏框架(Fairness-Enhanced Data-Free Robustness Distillation, FERD),通过两个核心机制实现:一是基于鲁棒性引导的类别重加权策略(robustness-guided class reweighting),动态增加对鲁棒性较弱类别的对抗样本合成比例;二是引入两类增强样本——公平感知样本(Fairness-Aware Examples, FAEs)与均匀目标对抗样本(Uniform-Target Adversarial Examples, UTAEs),前者通过特征级预测一致性约束抑制类别特异性非鲁棒特征的主导作用,后者通过均匀目标类别约束避免攻击方向偏倚,从而均衡各类别间的鲁棒性能并提升整体鲁棒性稳定性。
链接: https://arxiv.org/abs/2509.20793
作者: Zhengxiao Li,Liming Lu,Xu Zheng,Siyuan Liang,Zhenghan Chen,Yongbin Zhou,Shuchao Pang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Data-Free Robustness Distillation (DFRD) aims to transfer the robustness from the teacher to the student without accessing the training data. While existing methods focus on overall robustness, they overlook the robust fairness issues, leading to severe disparity of robustness across different categories. In this paper, we find two key problems: (1) student model distilled with equal class proportion data behaves significantly different across distinct categories; and (2) the robustness of student model is not stable across different attacks target. To bridge these gaps, we present the first Fairness-Enhanced data-free Robustness Distillation (FERD) framework to adjust the proportion and distribution of adversarial examples. For the proportion, FERD adopts a robustness-guided class reweighting strategy to synthesize more samples for the less robust categories, thereby improving robustness of them. For the distribution, FERD generates complementary data samples for advanced robustness distillation. It generates Fairness-Aware Examples (FAEs) by enforcing a uniformity constraint on feature-level predictions, which suppress the dominance of class-specific non-robust features, providing a more balanced representation across all categories. Then, FERD constructs Uniform-Target Adversarial Examples (UTAEs) from FAEs by applying a uniform target class constraint to avoid biased attack directions, which distribute the attack targets across all categories and prevents overfitting to specific vulnerable categories. Extensive experiments on three public datasets show that FERD achieves state-of-the-art worst-class robustness under all adversarial attack (e.g., the worst-class robustness under FGSM and AutoAttack are improved by 15.1% and 6.4% using MobileNet-V2 on CIFAR-10), demonstrating superior performance in both robustness and fairness aspects.
zh
[CV-70] DAC-LoRA: Dynamic Adversarial Curriculum for Efficient and Robust Few-Shot Adaptation ICCV2025
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)后仍易受对抗攻击的问题,尤其是以CLIP为代表的骨干模型,其脆弱性可能在整个多模态AI生态系统中引发连锁反应。解决方案的关键在于提出一种名为动态对抗课程训练(Dynamic Adversarial Curriculum, DAC-LoRA)的新框架,该框架将对抗训练嵌入到PEFT流程中,通过基于一阶驻点条件(First-Order Stationary Condition, FOSC)引导的渐进式挑战性攻击策略与类TRADES损失函数相结合,实现鲁棒性的显著提升,同时保持对干净样本准确率的最小影响。
链接: https://arxiv.org/abs/2509.20792
作者: Ved Umrajkar
机构: Indian Institute of Technology, Roorkee (印度理工学院,鲁尔基)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICCV2025 Workshop on Safe and Trustworthy Multimodal AI Systems
Abstract:Vision-Language Models (VLMs) are foundational to critical applications like autonomous driving, medical diagnosis, and content moderation. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA enable their efficient adaptation to specialized tasks, these models remain vulnerable to adversarial attacks that can compromise safety-critical decisions. CLIP, the backbone for numerous downstream VLMs, is a high-value target whose vulnerabilities can cascade across the multimodal AI ecosystem. We propose Dynamic Adversarial Curriculum DAC-LoRA, a novel framework that integrates adversarial training into PEFT. The core principle of our method i.e. an intelligent curriculum of progressively challenging attack, is general and can potentially be applied to any iterative attack method. Guided by the First-Order Stationary Condition (FOSC) and a TRADES-inspired loss, DAC-LoRA achieves substantial improvements in adversarial robustness without significantly compromising clean accuracy. Our work presents an effective, lightweight, and broadly applicable method to demonstrate that the DAC-LoRA framework can be easily integrated into a standard PEFT pipeline to significantly enhance robustness.
zh
[CV-71] Real-Time Object Detection Meets DINOv3
【速读】:该论文旨在解决实时目标检测模型在性能与计算成本之间难以平衡的问题,尤其针对现有方法在资源受限场景(如边缘设备和移动端)下难以同时实现高精度与低延迟的挑战。解决方案的关键在于提出统一的DEIMv2框架,其核心创新包括:1)引入DINOv3预训练特征并设计空间调优适配器(Spatial Tuning Adapter, STA),将单尺度输出转化为多尺度特征,增强语义与细节融合能力;2)针对超轻量级模型采用HGNetv2结合深度与宽度剪枝策略,在严格资源预算下保持检测性能;3)配合简化的解码器和升级版密集一对一匹配(Dense O2O),实现从X到Atto共八个模型尺寸的端到端优化,从而在多种部署场景中均取得最优的性能-成本权衡,显著超越此前主流模型(如YOLO系列)的性能边界。
链接: https://arxiv.org/abs/2509.20787
作者: Shihua Huang,Yongjie Hou,Longfei Liu,Xuanlong Yu,Xi Shen
机构: Intellindust AI Lab; Xiamen University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3’s single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters.
zh
[CV-72] Dual-supervised Asymmetric Co-training for Semi-supervised Medical Domain Generalization
【速读】:该论文旨在解决跨域半监督领域泛化(Cross-Domain Semi-Supervised Domain Generalization, CD-SSDG)场景下的性能瓶颈问题,即在训练阶段存在标签数据与无标签数据之间存在领域偏移(domain shift)的情况下,传统半监督领域泛化方法因伪标签不准确而导致模型泛化能力下降的问题。其解决方案的关键在于提出一种双监督异构协同训练(Dual-Supervised Asymmetric Co-Training, DAC)框架:一方面通过引入特征级监督来缓解标签与无标签数据间领域偏移带来的伪标签误差;另一方面为每个子模型设计不同的辅助自监督任务,以增强领域不变判别特征的学习并防止模型坍缩,从而提升模型在未见测试域上的鲁棒性和泛化性能。
链接: https://arxiv.org/abs/2509.20785
作者: Jincai Song,Haipeng Chen,Jun Qin,Na Zhao
机构: Jilin University (吉林大学); Changchun University of Science and Technology (长春理工大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 14 figures
Abstract:Semi-supervised domain generalization (SSDG) in medical image segmentation offers a promising solution for generalizing to unseen domains during testing, addressing domain shift challenges and minimizing annotation costs. However, conventional SSDG methods assume labeled and unlabeled data are available for each source domain in the training set, a condition that is not always met in practice. The coexistence of limited annotation and domain shift in the training set is a prevalent issue. Thus, this paper explores a more practical and challenging scenario, cross-domain semi-supervised domain generalization (CD-SSDG), where domain shifts occur between labeled and unlabeled training data, in addition to shifts between training and testing sets. Existing SSDG methods exhibit sub-optimal performance under such domain shifts because of inaccurate pseudolabels. To address this issue, we propose a novel dual-supervised asymmetric co-training (DAC) framework tailored for CD-SSDG. Building upon the co-training paradigm with two sub-models offering cross pseudo supervision, our DAC framework integrates extra feature-level supervision and asymmetric auxiliary tasks for each sub-model. This feature-level supervision serves to address inaccurate pseudo supervision caused by domain shifts between labeled and unlabeled data, utilizing complementary supervision from the rich feature space. Additionally, two distinct auxiliary self-supervised tasks are integrated into each sub-model to enhance domain-invariant discriminative feature learning and prevent model collapse. Extensive experiments on real-world medical image segmentation datasets, i.e., Fundus, Polyp, and SCGM, demonstrate the robust generalizability of the proposed DAC framework.
zh
[CV-73] CompressAI-Vision: Open-source software to evaluate compression methods for computer vision tasks
【速读】:该论文旨在解决当前视频压缩技术与基于神经网络(Neural Network, NN)的计算机视觉任务之间不匹配的问题,即传统压缩方法主要优化人类感知质量,而忽视了下游视觉任务(如目标检测、图像分类等)对压缩后数据的适应性。其解决方案的关键在于提出一个统一的评估平台 CompressAI-Vision,该平台支持在两种推理场景——“远程”(remote)和“分割”(split)推理下,量化评估压缩算法在比特率与任务准确性之间的权衡关系,并通过集成标准编解码器(under development)验证压缩增益。该平台已被国际标准组织 MPEG 采纳用于制定面向机器的特征编码(Feature Coding for Machines, FCM)标准,从而为任务驱动的视频压缩提供了可复现、可比较的基准。
链接: https://arxiv.org/abs/2509.20777
作者: Hyomin Choi,Heeji Han,Chris Rosewarne,Fabien Racapé
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:With the increasing use of neural network (NN)-based computer vision applications that process image and video data as input, interest has emerged in video compression technology optimized for computer vision tasks. In fact, given the variety of vision tasks, associated NN models and datasets, a consolidated platform is needed as a common ground to implement and evaluate compression methods optimized for downstream vision tasks. CompressAI-Vision is introduced as a comprehensive evaluation platform where new coding tools compete to efficiently compress the input of vision network while retaining task accuracy in the context of two different inference scenarios: “remote” and “split” inferencing. Our study showcases various use cases of the evaluation platform incorporated with standard codecs (under development) by examining the compression gain on several datasets in terms of bit-rate versus task accuracy. This evaluation platform has been developed as open-source software and is adopted by the Moving Pictures Experts Group (MPEG) for the development the Feature Coding for Machines (FCM) standard. The software is available publicly at this https URL.
zh
[CV-74] CusEnhancer: A Zero-Shot Scene and Controllability Enhancement Method for Photo Customization via ResInversion
【速读】:该论文旨在解决当前基于文本到图像扩散模型的人像合成方法中存在的场景退化、控制能力不足以及感知身份一致性差等问题。其核心解决方案是提出CustomEnhancer框架,关键在于引入一种零样本增强流水线(zero-shot enhancement pipeline),通过人脸交换技术和预训练扩散模型以零样本方式获取额外表征并编码至个性化模型中;同时设计三流融合的PerGeneration方法,识别并结合两个方向相反的潜在空间来操控个性化模型的关键潜在空间,从而统一生成与重构过程,实现从三个流中生成图像。此外,为降低空文本反演(null-text inversion, NTI)的时间复杂度,提出ResInversion方法,利用预扩散机制进行噪声校正,使反演效率提升129倍,显著提升了整体流程的实用性与效率。
链接: https://arxiv.org/abs/2509.20775
作者: Maoye Ren,Praneetha Vaddamanu,Jianjin Xu,Fernando De la Torre Frade
机构: East China University of Science and Technology (华东理工大学); Carnegie Mellon University (卡内基梅隆大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently remarkable progress has been made in synthesizing realistic human photos using text-to-image diffusion models. However, current approaches face degraded scenes, insufficient control, and suboptimal perceptual identity. We introduce CustomEnhancer, a novel framework to augment existing identity customization models. CustomEnhancer is a zero-shot enhancement pipeline that leverages face swapping techniques, pretrained diffusion model, to obtain additional representations in a zeroshot manner for encoding into personalized models. Through our proposed triple-flow fused PerGeneration approach, which identifies and combines two compatible counter-directional latent spaces to manipulate a pivotal space of personalized model, we unify the generation and reconstruction processes, realizing generation from three flows. Our pipeline also enables comprehensive training-free control over the generation process of personalized models, offering precise controlled personalization for them and eliminating the need for controller retraining for per-model. Besides, to address the high time complexity of null-text inversion (NTI), we introduce ResInversion, a novel inversion method that performs noise rectification via a pre-diffusion mechanism, reducing the inversion time by 129 times. Experiments demonstrate that CustomEnhancer reach SOTA results at scene diversity, identity fidelity, training-free controls, while also showing the efficiency of our ResInversion over NTI. The code will be made publicly available upon paper acceptance.
zh
[CV-75] Extrapolating Phase-Field Simulations in Space and Time with Purely Convolutional Architectures
【速读】:该论文旨在解决液态金属脱合金(Liquid Metal Dealloying, LMD)相场模型在大尺度域或长时间模拟中计算效率低下的问题。传统数值求解方法难以处理大规模或长时间的微结构演化,而本文提出了一种条件参数化的全卷积U-Net代理模型,其关键在于融合了卷积自注意力机制与物理感知填充策略,并通过参数条件化实现时间步长跳越和多种合金体系的适应性。该设计利用卷积的平移不变性,在仅训练短时小尺度数据的基础上,显著扩展预测范围,同时保持高精度(训练范围内相对误差通常低于5%,外推至更大域和更晚时间时低于10%),并实现最高达16,000倍的加速效果,将数周模拟压缩至秒级,为LMD相场模型的可扩展、高保真外推提供了新路径。
链接: https://arxiv.org/abs/2509.20770
作者: Christophe Bonneville,Nathan Bieberdorf,Pieterjan Robbe,Mark Asta,Habib N. Najm,Laurent Capolungo,Cosmin Safta
机构: Sandia National Laboratories (桑迪亚国家实验室); Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室); University of California, Berkeley (加州大学伯克利分校); Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:
Abstract:Phase-field models of liquid metal dealloying (LMD) can resolve rich microstructural dynamics but become intractable for large domains or long time horizons. We present a conditionally parameterized, fully convolutional U-Net surrogate that generalizes far beyond its training window in both space and time. The design integrates convolutional self-attention and physics-aware padding, while parameter conditioning enables variable time-step skipping and adaptation to diverse alloy systems. Although trained only on short, small-scale simulations, the surrogate exploits the translational invariance of convolutions to extend predictions to much longer horizons than traditional solvers. It accurately reproduces key LMD physics, with relative errors typically under 5% within the training regime and below 10% when extrapolating to larger domains and later times. The method accelerates computations by up to 16,000 times, cutting weeks of simulation down to seconds, and marks an early step toward scalable, high-fidelity extrapolation of LMD phase-field models.
zh
[CV-76] Provenance Analysis of Archaeological Artifacts via Multimodal RAG Systems
【速读】:该论文旨在解决考古文物溯源分析中专家需面对海量比较数据集时的认知负担过重问题,尤其在缺乏系统性辅助工具的情况下,难以高效识别文物的年代、地理来源与文化归属。解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的多模态知识库系统,融合文本与图像信息,并利用大视觉语言模型(Vision-Language Models, VLMs)实现对原始视觉特征、边缘增强特征及语义特征的多层次检索与推理合成,从而生成结构化的推断结果及可解释的论证依据,显著提升专家决策效率与分析起点的准确性。
链接: https://arxiv.org/abs/2509.20769
作者: Tuo Zhang,Yuechun Sun,Ruiliang Liu
机构: Museus; University of Science and Technology of China (中国科学技术大学); British Museum (大英博物馆)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we present a retrieval-augmented generation (RAG)-based system for provenance analysis of archaeological artifacts, designed to support expert reasoning by integrating multimodal retrieval and large vision-language models (VLMs). The system constructs a dual-modal knowledge base from reference texts and images, enabling raw visual, edge-enhanced, and semantic retrieval to identify stylistically similar objects. Retrieved candidates are synthesized by the VLM to generate structured inferences, including chronological, geographical, and cultural attributions, alongside interpretive justifications. We evaluate the system on a set of Eastern Eurasian Bronze Age artifacts from the British Museum. Expert evaluation demonstrates that the system produces meaningful and interpretable outputs, offering scholars concrete starting points for analysis and significantly alleviating the cognitive burden of navigating vast comparative corpora.
zh
[CV-77] MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU GNSS for High-Functionality SLAM
【速读】:该论文旨在解决传统视觉SLAM(Simultaneous Localization and Mapping,同时定位与建图)系统在低纹理环境、尺度模糊以及复杂视觉条件下性能下降的问题,同时克服当前基于前馈神经网络的点云回归方法忽视多传感器概率信息融合的局限性。解决方案的关键在于提出MASt3R-Fusion框架,通过将前馈式点云回归与惯性测量单元(IMU)和全球导航卫星系统(GNSS)等互补传感器信息进行紧耦合融合,引入基于Sim(3)的视觉对齐约束(以海森形式表示),并将其嵌入统一的度量尺度SE(3)因子图中,实现高效的信息融合;进一步设计分层因子图结构,支持实时滑动窗口优化与具有强回环闭合能力的全局优化,从而在保证实时性的同时实现高精度位姿估计、度量尺度结构感知与全局一致性地图构建。
链接: https://arxiv.org/abs/2509.20757
作者: Yuxuan Zhou,Xingxing Li,Shengyu Li,Zhuohao Yan,Chunxi Xia,Shaoquan Feng
机构: Wuhan University (武汉大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual SLAM is a cornerstone technique in robotics, autonomous driving and extended reality (XR), yet classical systems often struggle with low-texture environments, scale ambiguity, and degraded performance under challenging visual conditions. Recent advancements in feed-forward neural network-based pointmap regression have demonstrated the potential to recover high-fidelity 3D scene geometry directly from images, leveraging learned spatial priors to overcome limitations of traditional multi-view geometry methods. However, the widely validated advantages of probabilistic multi-sensor information fusion are often discarded in these pipelines. In this work, we propose MASt3R-Fusion,a multi-sensor-assisted visual SLAM framework that tightly integrates feed-forward pointmap regression with complementary sensor information, including inertial measurements and GNSS data. The system introduces Sim(3)-based visualalignment constraints (in the Hessian form) into a universal metric-scale SE(3) factor graph for effective information fusion. A hierarchical factor graph design is developed, which allows both real-time sliding-window optimization and global optimization with aggressive loop closures, enabling real-time pose tracking, metric-scale structure perception and globally consistent mapping. We evaluate our approach on both public benchmarks and self-collected datasets, demonstrating substantial improvements in accuracy and robustness over existing visual-centered multi-sensor SLAM systems. The code will be released open-source to support reproducibility and further research (this https URL).
zh
[CV-78] FreeInsert: Personalized Object Insertion with Geometric and Style Control
【速读】:该论文旨在解决当前文本到图像扩散模型在个性化图像合成任务中面临的两大核心问题:一是插入对象的几何控制不足,现有方法受限于二维空间且依赖文本指令,难以实现精确的形状与视角控制;二是风格一致性差,插入对象与背景之间常出现不一致的视觉风格,导致生成结果缺乏真实感。解决方案的关键在于提出一个无需训练的框架FreeInsert,其核心创新是利用3D几何信息进行对象插入:首先将2D对象转换为3D表示,通过3D空间中的交互式编辑实现对形状和视角的精确控制,随后从指定视角重新渲染为2D图像作为几何约束,并结合扩散适配器(diffusion adapters)实现风格和内容控制,最终在扩散模型中生成几何可控且风格一致的图像。
链接: https://arxiv.org/abs/2509.20756
作者: Yuhong Zhang,Han Wang,Yiwen Wang,Rong Xie,Li Song
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models have made significant progress in image generation, allowing for effortless customized generation. However, existing image editing methods still face certain limitations when dealing with personalized image composition tasks. First, there is the issue of lack of geometric control over the inserted objects. Current methods are confined to 2D space and typically rely on textual instructions, making it challenging to maintain precise geometric control over the objects. Second, there is the challenge of style consistency. Existing methods often overlook the style consistency between the inserted object and the background, resulting in a lack of realism. In addition, the challenge of inserting objects into images without extensive training remains significant. To address these issues, we propose \textitFreeInsert, a novel training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information. Benefiting from the advances in existing 3D generation models, we first convert the 2D object into 3D, perform interactive editing at the 3D level, and then re-render it into a 2D image from a specified view. This process introduces geometric controls such as shape or view. The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters, ultimately producing geometrically controlled, style-consistent edited images via the diffusion model.
zh
[CV-79] AI-Enabled Crater-Based Navigation for Lunar Mapping
【速读】:该论文旨在解决长期月球测绘任务中基于撞击坑的导航(Crater-Based Navigation, CBN)面临的挑战,尤其是在稀疏、斜视成像条件下,受光照变化和长时间轨道运行影响下的位姿估计难题。传统CBN多应用于短时动力下降与着陆场景,而本研究首次提出面向全年级月球测绘任务的端到端CBN流程STELLA,其关键在于:结合基于Mask R-CNN的撞击坑检测模块、无描述子的撞击坑识别机制、鲁棒的透视n点(Perspective-n-Point, PnP)位姿求解器以及批量轨道确定后端,实现了在复杂观测几何与光照条件下的稳定位姿估计;同时构建了首个模拟全年月球测绘任务的公开数据集CRESENT-365,验证了STELLA在米级位置精度和亚度级姿态精度下的鲁棒性,填补了CBN在真实长期月球测绘场景中的评估空白。
链接: https://arxiv.org/abs/2509.20748
作者: Sofia McLeod,Chee-Kheng Chng,Matthew Rodda,Tat-Jun Chin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 41 pages, 21 figures
Abstract:Crater-Based Navigation (CBN) uses the ubiquitous impact craters of the Moon observed on images as natural landmarks to determine the six degrees of freedom pose of a spacecraft. To date, CBN has primarily been studied in the context of powered descent and landing. These missions are typically short in duration, with high-frequency imagery captured from a nadir viewpoint over well-lit terrain. In contrast, lunar mapping missions involve sparse, oblique imagery acquired under varying illumination conditions over potentially year-long campaigns, posing significantly greater challenges for pose estimation. We bridge this gap with STELLA - the first end-to-end CBN pipeline for long-duration lunar mapping. STELLA combines a Mask R-CNN-based crater detector, a descriptor-less crater identification module, a robust perspective-n-crater pose solver, and a batch orbit determination back-end. To rigorously test STELLA, we introduce CRESENT-365 - the first public dataset that emulates a year-long lunar mapping mission. Each of its 15,283 images is rendered from high-resolution digital elevation models with SPICE-derived Sun angles and Moon motion, delivering realistic global coverage, illumination cycles, and viewing geometries. Experiments on CRESENT+ and CRESENT-365 show that STELLA maintains metre-level position accuracy and sub-degree attitude accuracy on average across wide ranges of viewing angles, illumination conditions, and lunar latitudes. These results constitute the first comprehensive assessment of CBN in a true lunar mapping setting and inform operational conditions that should be considered for future missions.
zh
[CV-80] Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection
【速读】:该论文旨在解决海上目标检测(maritime object detection)中面临的两大关键挑战:一是标注数据稀缺,二是模型在不同海上属性(如目标类别、视角、位置及成像环境)下泛化能力差。为应对这些问题,作者提出 Neptune-X,一个以数据为中心的生成选择框架,其核心在于通过任务感知的样本选择增强训练效果。解决方案的关键包括两个方面:一是开发 X-to-Maritime,一种多模态条件驱动的生成模型,用于合成多样化且逼真的海上场景,其中引入了双向目标-水体注意力模块(Bidirectional Object-Water Attention),以捕捉目标与水域边界间的交互关系,提升视觉保真度;二是提出基于属性相关性的主动采样策略(Attribute-correlated Active Sampling),动态筛选对下游任务更具相关性的合成样本,从而优化检测性能。
链接: https://arxiv.org/abs/2509.20745
作者: Yu Guo,Shengfeng He,Yuxu Lu,Haonan An,Yihang Tao,Huilin Zhu,Jingxian Liu,Yuguang Fang
机构: City University of Hong Kong (香港城市大学); Singapore Management University (新加坡管理大学); Wuhan University of Technology (武汉理工大学); The Hong Kong Polytechnic University (香港理工大学); Wuhan University of Science and Technology (武汉科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). % In particular, models trained on existing datasets often underperform in underrepresented scenarios such as open-sea environments. To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented this http URL code is available at this https URL.
zh
[CV-81] SLAM-Free Visual Navigation with Hierarchical Vision-Language Perception and Coarse-to-Fine Semantic Topological Planning
【速读】:该论文旨在解决传统SLAM(Simultaneous Localization and Mapping,同时定位与建图)系统在腿式机器人导航中面临的三大问题:快速运动下的鲁棒性差、传感器校准要求高以及长期运行中的漂移误差;同时,传统方法在任务驱动探索中缺乏语义推理能力。解决方案的关键在于提出一种纯视觉、无需SLAM的导航框架,其核心创新包括:1)构建分层视觉-语言感知模块,融合场景级上下文与物体级线索以实现鲁棒的语义推断;2)采用语义概率拓扑地图支持粗粒度到细粒度的规划策略,即利用大语言模型(LLM)进行全局子目标选择,结合视觉引导的局部避障规划;3)集成强化学习控制的步态控制器,使框架具备跨平台部署能力。实验表明,该方案显著提升了语义准确性、规划质量和导航成功率,验证了语义驱动决策相较于几何中心化建图的新范式优势。
链接: https://arxiv.org/abs/2509.20739
作者: Guoyang Zhao,Yudong Li,Weiqing Qi,Kai Zhang,Bonan Liu,Kai Chen,Haoang Li,Jun Ma
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Southern University of Science and Technology (南方科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conventional SLAM pipelines for legged robot navigation are fragile under rapid motion, calibration demands, and sensor drift, while offering limited semantic reasoning for task-driven exploration. To deal with these issues, we propose a vision-only, SLAM-free navigation framework that replaces dense geometry with semantic reasoning and lightweight topological representations. A hierarchical vision-language perception module fuses scene-level context with object-level cues for robust semantic inference. And a semantic-probabilistic topological map supports coarse-to-fine planning: LLM-based global reasoning for subgoal selection and vision-based local planning for obstacle avoidance. Integrated with reinforcement-learning locomotion controllers, the framework is deployable across diverse legged robot platforms. Experiments in simulation and real-world settings demonstrate consistent improvements in semantic accuracy, planning quality, and navigation success, while ablation studies further showcase the necessity of both hierarchical perception and fine local planning. This work introduces a new paradigm for SLAM-free, vision-language-driven navigation, shifting robotic exploration from geometry-centric mapping to semantics-driven decision making.
zh
[CV-82] SeamCrafte: Enhancing Mesh Seam Generation for Artist UV Unwrapping via Reinforcement Learning
【速读】:该论文旨在解决3D表面分割中缝合线(seam)放置不当导致的UV参数化畸变严重或碎片过多的问题,这些问题会阻碍纹理合成并破坏艺术家的工作流程。现有方法往往在高畸变与多分散岛屿之间权衡,难以兼顾质量与连通性。解决方案的关键在于提出SeamCrafter,一种基于点云输入的自回归GPT风格缝合线生成器,其核心创新包括:1)采用双分支点云编码器,在预训练阶段解耦并捕捉互补的拓扑与几何线索;2)利用从新型缝合线评估框架中获得的偏好数据集,通过直接偏好优化(Direct Preference Optimization, DPO)进行微调,该框架以UV畸变和碎片化为主要指标提供成对偏好标签,从而引导模型优化。实验表明,SeamCrafter在显著降低畸变和碎片化的同时,保持了拓扑一致性和视觉保真度。
链接: https://arxiv.org/abs/2509.20725
作者: Duoteng Xu,Yuguang Chen,Jing Li,Xinhai Liu,Xueqi Ma,Zhuo Chen,Dongyu Zhang,Chunchao Guo
机构: Tencent Hunyuan(腾讯混元); SYSU(中山大学); SZU(深圳大学); USTC(中国科学技术大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mesh seams play a pivotal role in partitioning 3D surfaces for UV parametrization and texture mapping. Poorly placed seams often result in severe UV distortion or excessive fragmentation, thereby hindering texture synthesis and disrupting artist workflows. Existing methods frequently trade one failure mode for another-producing either high distortion or many scattered islands. To address this, we introduce SeamCrafter, an autoregressive GPT-style seam generator conditioned on point cloud inputs. SeamCrafter employs a dual-branch point-cloud encoder that disentangles and captures complementary topological and geometric cues during pretraining. To further enhance seam quality, we fine-tune the model using Direct Preference Optimization (DPO) on a preference dataset derived from a novel seam-evaluation framework. This framework assesses seams primarily by UV distortion and fragmentation, and provides pairwise preference labels to guide optimization. Extensive experiments demonstrate that SeamCrafter produces seams with substantially lower distortion and fragmentation than prior approaches, while preserving topological consistency and visual fidelity.
zh
[CV-83] Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset
【速读】:该论文旨在解决传统意图识别研究中忽视群体情境下集体意图(collective intention)复杂性的问题,提出了一种新的任务——群体意图预测(Group Intention Forecasting, GIF),即通过分析个体行为及其交互,在群体目标尚未明确之前预测其意图的出现时机。解决方案的关键在于构建了SHOT数据集和GIFT框架:SHOT是首个大规模群体意图预测数据集,具备多主体信息、多视角适应性和多层次意图标注特性;GIFT则通过提取细粒度个体特征并建模动态演化中的群体行为,实现对群体意图涌现的有效预测。
链接: https://arxiv.org/abs/2509.20715
作者: Ruixu Zhang,Yuran Wang,Xinyi Hu,Chaoyu Mai,Wenxuan Liu,Danni Xu,Xian Zhong,Zheng Wang
机构: Wuhan University (武汉大学); Tsinghua University (清华大学); Peking University (北京大学); National University of Singapore (新加坡国立大学); Wuhan University of Technology (武汉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Intention recognition has traditionally focused on individual intentions, overlooking the complexities of collective intentions in group settings. To address this limitation, we introduce the concept of group intention, which represents shared goals emerging through the actions of multiple individuals, and Group Intention Forecasting (GIF), a novel task that forecasts when group intentions will occur by analyzing individual actions and interactions before the collective goal becomes apparent. To investigate GIF in a specific scenario, we propose SHOT, the first large-scale dataset for GIF, consisting of 1,979 basketball video clips captured from 5 camera views and annotated with 6 types of individual attributes. SHOT is designed with 3 key characteristics: multi-individual information, multi-view adaptability, and multi-level intention, making it well-suited for studying emerging group intentions. Furthermore, we introduce GIFT (Group Intention ForecasTer), a framework that extracts fine-grained individual features and models evolving group dynamics to forecast intention emergence. Experimental results confirm the effectiveness of SHOT and GIFT, establishing a strong foundation for future research in group intention forecasting. The dataset is available at this https URL.
zh
[CV-84] ArtUV: Artist-style UV Unwrapping
【速读】:该论文旨在解决现有UV unwrapping(UV展开)方法在实际应用中面临的效率低、碎片化严重、语义缺失以及UV岛不规则等问题,这些问题限制了其在专业渲染流程中的使用。解决方案的关键在于提出ArtUV,一种端到端的自动化方法,模拟艺术家的UV映射流程:首先通过SeamGPT预测具有语义意义的切割缝(seam),随后利用基于优化的粗略UV与网格输入至Auto-Encoder进行精修,生成符合艺术标准的UV映射结果,从而在保证拓扑结构一致性和语义连贯性的前提下,实现高质量、可直接用于二维编辑的UV地图。
链接: https://arxiv.org/abs/2509.20710
作者: Yuguang Chen,Xinhai Liu,Yang Li,Victor Cheung,Zhuo Chen,Dongyu Zhang,Chunchao Guo
机构: Tencent Hunyuan(腾讯混元); SYSU(中山大学); THU(清华大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:UV unwrapping is an essential task in computer graphics, enabling various visual editing operations in rendering pipelines. However, existing UV unwrapping methods struggle with time-consuming, fragmentation, lack of semanticity, and irregular UV islands, limiting their practical use. An artist-style UV map must not only satisfy fundamental criteria, such as overlap-free mapping and minimal distortion, but also uphold higher-level standards, including clean boundaries, efficient space utilization, and semantic coherence. We introduce ArtUV, a fully automated, end-to-end method for generating artist-style UV unwrapping. We simulates the professional UV mapping process by dividing it into two stages: surface seam prediction and artist-style UV parameterization. In the seam prediction stage, SeamGPT is used to generate semantically meaningful cutting seams. Then, in the parameterization stage, a rough UV obtained from an optimization-based method, along with the mesh, is fed into an Auto-Encoder, which refines it into an artist-style UV map. Our method ensures semantic consistency and preserves topological structure, making the UV map ready for 2D editing. We evaluate ArtUV across multiple benchmarks and show that it serves as a versatile solution, functioning seamlessly as either a plug-in for professional rendering tools or as a standalone system for rapid, high-quality UV generation.
zh
[CV-85] Joint Flow Trajectory Optimization For Feasible Robot Motion Generation from Video Demonstrations
【速读】:该论文旨在解决基于视频演示的机器人操作学习(Learning-from-Demonstration, LfD)中因具身差异(embodiment differences)和关节可行性约束(joint feasibility constraints)导致的挑战,尤其是在从人类视频示范中生成可执行的抓取姿态与物体轨迹时。其解决方案的关键在于提出一种关节流轨迹优化框架(Joint Flow Trajectory Optimization, JFTO),该框架将人类示范视为以物体为中心的引导信号,通过联合优化三个目标实现高效模仿:(i) 选择可行的抓取姿态,(ii) 生成与示范一致的物体运动轨迹,(iii) 确保在机器人运动学范围内无碰撞执行。为捕捉示范的多模态特性,作者将流匹配(flow matching)扩展至SE(3)空间,用于对物体轨迹进行概率建模,从而实现密度感知的模仿学习并避免模式崩溃(mode collapse)。最终,该方法将抓取相似性、轨迹似然性和碰撞惩罚整合为一个统一的可微目标函数,实现在仿真与真实场景中的鲁棒验证。
链接: https://arxiv.org/abs/2509.20703
作者: Xiaoxiang Dong,Matthew Johnson-Roberson,Weiming Zhi
机构: Vanderbilt University (范德比尔特大学); Carnegie Mellon University (卡内基梅隆大学); The University of Sydney (悉尼大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning from human video demonstrations offers a scalable alternative to teleoperation or kinesthetic teaching, but poses challenges for robot manipulators due to embodiment differences and joint feasibility constraints. We address this problem by proposing the Joint Flow Trajectory Optimization (JFTO) framework for grasp pose generation and object trajectory imitation under the video-based Learning-from-Demonstration (LfD) paradigm. Rather than directly imitating human hand motions, our method treats demonstrations as object-centric guides, balancing three objectives: (i) selecting a feasible grasp pose, (ii) generating object trajectories consistent with demonstrated motions, and (iii) ensuring collision-free execution within robot kinematics. To capture the multimodal nature of demonstrations, we extend flow matching to \SE(3) for probabilistic modeling of object trajectories, enabling density-aware imitation that avoids mode collapse. The resulting optimization integrates grasp similarity, trajectory likelihood, and collision penalties into a unified differentiable objective. We validate our approach in both simulation and real-world experiments across diverse real-world manipulation tasks.
zh
[CV-86] DENet: Dual-Path Edge Network with Global-Local Attention for Infrared Small Target Detection
【速读】:该论文旨在解决红外小目标检测中因目标缺乏显著纹理和形态特征而导致的易被杂波和噪声淹没的问题,尤其针对现有深度模型在捕捉高分辨率空间细节(用于微小目标)与提取鲁棒语义上下文(用于较大目标)之间存在的固有矛盾,从而导致特征错位和性能下降的挑战。解决方案的关键在于提出一种双路径边缘网络(Dual-Path Edge Network),通过解耦边缘增强与语义建模两个互补路径:第一路径采用双向交互模块(Bidirectional Interaction Module),结合局部自注意力(Local Self-Attention)与全局自注意力(Global Self-Attention),利用Transformer架构整合长程语义关系;第二路径引入多边缘精炼器(Multi-Edge Refiner),基于级联泰勒有限差分算子在多尺度上增强细粒度边缘信息,并辅以注意力驱动的门控机制,实现对不同尺寸目标的精准边缘定位与特征增强,同时有效抑制噪声。
链接: https://arxiv.org/abs/2509.20701
作者: Jiayi Zuo,Songwei Pei,Qian Li
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared small target detection is crucial for remote sensing applications like disaster warning and maritime surveillance. However, due to the lack of distinctive texture and morphological features, infrared small targets are highly susceptible to blending into cluttered and noisy backgrounds. A fundamental challenge in designing deep models for this task lies in the inherent conflict between capturing high-resolution spatial details for minute targets and extracting robust semantic context for larger targets, often leading to feature misalignment and suboptimal performance. Existing methods often rely on fixed gradient operators or simplistic attention mechanisms, which are inadequate for accurately extracting target edges under low contrast and high noise. In this paper, we propose a novel Dual-Path Edge Network that explicitly addresses this challenge by decoupling edge enhancement and semantic modeling into two complementary processing paths. The first path employs a Bidirectional Interaction Module, which uses both Local Self-Attention and Global Self-Attention to capture multi-scale local and global feature dependencies. The global attention mechanism, based on a Transformer architecture, integrates long-range semantic relationships and contextual information, ensuring robust scene understanding. The second path introduces the Multi-Edge Refiner, which enhances fine-grained edge details using cascaded Taylor finite difference operators at multiple scales. This mathematical approach, along with an attention-driven gating mechanism, enables precise edge localization and feature enhancement for targets of varying sizes, while effectively suppressing noise. Our method provides a promising solution for precise infrared small target detection and localization, combining structural semantics and edge refinement in a unified framework.
zh
[CV-87] RAM-NAS: Resource-aware Multiobjective Neural Architecture Search Method for Robot Vision Tasks IROS2024
【速读】:该论文旨在解决传统神经架构搜索(Neural Architecture Search, NAS)方法在机器人边缘硬件上资源感知不足、预训练超网(supernet)效率低以及模型部署时延迟难以优化的问题。其关键解决方案在于提出一种资源感知的多目标NAS方法——RAM-NAS,通过引入子网互蒸馏(subnets mutual distillation)机制提升超网预训练质量,并结合解耦知识蒸馏(Decoupled Knowledge Distillation, DKD)损失增强logits蒸馏效果;同时利用三种机器人边缘硬件采集的数据训练延迟代理预测器(Latency Surrogate predictors),在搜索阶段实现对实际设备推理延迟的高效估计,从而支持统一的多目标进化搜索,在模型精度与延迟之间实现更优权衡。
链接: https://arxiv.org/abs/2509.20688
作者: Shouren Mao,Minghao Qin,Wei Dong,Huajian Liu,Yongzhuo Gao
机构: Harbin Institute of Technology (哈尔滨工业大学); State Key Laboratory of Robotics and System (机器人学国家重点实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Joint first authors: Shouren Mao and Minghao Qin. Published in IEEE/RSJ IROS 2024. This arXiv version adds a joint first-authorship note to correct an omission in the IEEE Xplore version. No technical changes. Please cite the IEEE version
Abstract:Neural architecture search (NAS) has shown great promise in automatically designing lightweight models. However, conventional approaches are insufficient in training the supernet and pay little attention to actual robot hardware resources. To meet such challenges, we propose RAM-NAS, a resource-aware multi-objective NAS method that focuses on improving the supernet pretrain and resource-awareness on robot hardware devices. We introduce the concept of subnets mutual distillation, which refers to mutually distilling all subnets sampled by the sandwich rule. Additionally, we utilize the Decoupled Knowledge Distillation (DKD) loss to enhance logits distillation performance. To expedite the search process with consideration for hardware resources, we used data from three types of robotic edge hardware to train Latency Surrogate predictors. These predictors facilitated the estimation of hardware inference latency during the search phase, enabling a unified multi-objective evolutionary search to balance model accuracy and latency trade-offs. Our discovered model family, RAM-NAS models, can achieve top-1 accuracy ranging from 76.7% to 81.4% on ImageNet. In addition, the resource-aware multi-objective NAS we employ significantly reduces the model’s inference latency on edge hardware for robots. We conducted experiments on downstream tasks to verify the scalability of our methods. The inference time for detection and segmentation is reduced on all three hardware types compared to MobileNetv3-based methods. Our work fills the gap in NAS for robot hardware resource-aware.
zh
[CV-88] Enhancing Cross-View Geo-Localization Generalization via Global-Local Consistency and Geometric Equivariance
【速读】:该论文旨在解决跨视角地理定位(Cross-view geo-localization, CVGL)中的两个核心挑战:一是如何在无人机(UAV)不同朝向和视场导致的严重外观变化下实现鲁棒性,从而提升跨域泛化能力;二是如何建立可靠的对应关系以同时捕捉全局场景语义与细粒度局部细节。解决方案的关键在于提出一种名为EGS的新框架,其核心创新包括:(1) 设计E(2)-可变形卷积神经网络(E(2)-Steerable CNN)编码器,以提取在旋转和视角变换下保持稳定的特征;(2) 构建包含虚拟超节点(virtual super-node)的图结构,该节点连接所有局部节点,实现全局语义信息的聚合与重分配,从而强制全局-局部一致性,显著提升跨域匹配性能。
链接: https://arxiv.org/abs/2509.20684
作者: Xiaowei Wang,Di Wang,Ke Li,Yifeng Wang,Chengjian Wang,Libin Sun,Zhihong Wu,Yiming Zhang,Quan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-view geo-localization (CVGL) aims to match images of the same location captured from drastically different viewpoints. Despite recent progress, existing methods still face two key challenges: (1) achieving robustness under severe appearance variations induced by diverse UAV orientations and fields of view, which hinders cross-domain generalization, and (2) establishing reliable correspondences that capture both global scene-level semantics and fine-grained local details. In this paper, we propose EGS, a novel CVGL framework designed to enhance cross-domain generalization. Specifically, we introduce an E(2)-Steerable CNN encoder to extract stable and reliable features under rotation and viewpoint shifts. Furthermore, we construct a graph with a virtual super-node that connects to all local nodes, enabling global semantics to be aggregated and redistributed to local regions, thereby enforcing global-local consistency. Extensive experiments on the University-1652 and SUES-200 benchmarks demonstrate that EGS consistently achieves substantial performance gains and establishes a new state of the art in cross-domain CVGL.
zh
[CV-89] Efficient Construction of Implicit Surface Models From a Single Image for Motion Generation
【速读】:该论文旨在解决从单张或少量图像中高效构建高保真隐式距离表示(如符号距离场,SDF)的问题,传统方法如NeuS等依赖多视角图像和长时间训练,难以满足实时性需求。其解决方案的关键在于提出轻量级框架FINS(Fast Image-to-Neural Surface),通过引入多分辨率哈希网格编码器(multi-resolution hash grid encoder)与轻量化几何和颜色头部结构,并结合近似二阶优化器,实现仅需数秒即可收敛的高效训练,同时利用预训练基础模型估计图像中的几何信息,从而在单一RGB图像条件下完成高质量神经表面重建与SDF场估计。
链接: https://arxiv.org/abs/2509.20681
作者: Wei-Teng Chu,Tianyi Zhang,Matthew Johnson-Roberson,Weiming Zhi
机构: Stanford University (斯坦福大学); Aurora; Vanderbilt University (范德比尔特大学); Carnegie Mellon University (卡内基梅隆大学); The University of Sydney (悉尼大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Implicit representations have been widely applied in robotics for obstacle avoidance and path planning. In this paper, we explore the problem of constructing an implicit distance representation from a single image. Past methods for implicit surface reconstruction, such as \emphNeuS and its variants generally require a large set of multi-view images as input, and require long training times. In this work, we propose Fast Image-to-Neural Surface (FINS), a lightweight framework that can reconstruct high-fidelity surfaces and SDF fields based on a single or a small set of images. FINS integrates a multi-resolution hash grid encoder with lightweight geometry and color heads, making the training via an approximate second-order optimizer highly efficient and capable of converging within a few seconds. Additionally, we achieve the construction of a neural surface requiring only a single RGB image, by leveraging pre-trained foundation models to estimate the geometry inherent in the image. Our experiments demonstrate that under the same conditions, our method outperforms state-of-the-art baselines in both convergence speed and accuracy on surface reconstruction and SDF field estimation. Moreover, we demonstrate the applicability of FINS for robot surface following tasks and show its scalability to a variety of benchmark datasets.
zh
[CV-90] Bispectral OT: Dataset Comparison using Symmetry-Aware Optimal Transport NEURIPS2025
【速读】:该论文旨在解决在对称性丰富的场景下,传统最优传输(Optimal Transport, OT)方法仅依赖原始特征间的成对几何距离进行配准时,容易忽略数据内在一致性结构的问题。其解决方案的关键在于提出双谱最优传输(Bispectral Optimal Transport),通过引入双谱(bispectrum)作为群傅里叶不变量来表征元素,该表示保留了信号的全部结构信息,同时去除由群作用引起的冗余变化,从而实现对称感知的传输规划,显著提升类别保真度和语义对应质量。
链接: https://arxiv.org/abs/2509.20678
作者: Annabel Ma,Kaiying Hou,David Alvarez-Melis,Melanie Weber
机构: Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Accepted to NeurIPS 2025 Workshop on Symmetry and Geometry in Neural Representations (NeurReps)
Abstract:Optimal transport (OT) is a widely used technique in machine learning, graphics, and vision that aligns two distributions or datasets using their relative geometry. In symmetry-rich settings, however, OT alignments based solely on pairwise geometric distances between raw features can ignore the intrinsic coherence structure of the data. We introduce Bispectral Optimal Transport, a symmetry-aware extension of discrete OT that compares elements using their representation using the bispectrum, a group Fourier invariant that preserves all signal structure while removing only the variation due to group actions. Empirically, we demonstrate that the transport plans computed with Bispectral OT achieve greater class preservation accuracy than naive feature OT on benchmark datasets transformed with visual symmetries, improving the quality of meaningful correspondences that capture the underlying semantic label structure in the dataset while removing nuisance variation not affecting class or content.
zh
[CV-91] Equi-RO: A 4D mmWave Radar Odometry via Equivariant Networks
【速读】:该论文旨在解决在GPS信号缺失环境中,自动驾驶车辆与机器人对精确位姿估计(odometry estimation)的需求,尤其针对传统LiDAR和相机在极端天气下性能下降的问题。其解决方案的关键在于提出一种基于等变网络(equivariant network)的4D毫米波雷达里程计框架——Equi-RO,该方法通过将多普勒速度预处理为图结构中的不变节点与边特征,并分别使用等变与不变特征处理网络,结合图结构架构增强稀疏雷达数据下的特征聚合能力,从而提升帧间对应关系的准确性。实验表明,该方法在公开数据集上相较最优基线实现了10.7%和20.0%的平移与旋转精度相对提升。
链接: https://arxiv.org/abs/2509.20674
作者: Zeyu Han,Shuocheng Yang,Minghan Zhu,Fang Zhang,Shaobing Xu,Maani Ghaffari,Jianqiang Wang
机构: Tsinghua University (清华大学); University of Michigan (密歇根大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous vehicles and robots rely on accurate odometry estimation in GPS-denied environments. While LiDARs and cameras struggle under extreme weather, 4D mmWave radar emerges as a robust alternative with all-weather operability and velocity measurement. In this paper, we introduce Equi-RO, an equivariant network-based framework for 4D radar odometry. Our algorithm pre-processes Doppler velocity into invariant node and edge features in the graph, and employs separate networks for equivariant and invariant feature processing. A graph-based architecture enhances feature aggregation in sparse radar data, improving inter-frame correspondence. Experiments on the open-source dataset and self-collected dataset show Equi-RO outperforms state-of-the-art algorithms in accuracy and robustness. Overall, our method achieves 10.7% and 20.0% relative improvements in translation and rotation accuracy, respectively, compared to the best baseline on the open-source dataset.
zh
[CV-92] Recov-Vision: Linking Street View Imagery and Vision-Language Models for Post-Disaster Recovery
【速读】:该论文旨在解决灾后建筑级 occupancy(居住状态)评估难题,特别是在灾害发生后快速、准确判断房屋是否具备居住条件的问题。传统方法依赖航拍影像虽覆盖快但缺乏入口和立面细节,而街景影像虽能捕捉关键信息却难以与建筑地块对齐。其解决方案的核心是提出 FacadeTrack 框架——一个基于街景视频的、语言引导的自动化流程,能够将全景视频精准关联至建筑地块(parcel),校正视角至立面(facade rectification),并提取可解释的属性(如入口堵塞、临时遮盖、局部 debris 等)。该框架支持两种决策策略:单阶段规则驱动模型与两阶段分离感知与保守推理的设计,后者在精度(0.927)、召回率(0.781)和 F-1 分数(0.848)上优于前者(分别为 0.943、0.728、0.822),同时通过中间属性和空间诊断揭示误差来源,实现可审计、可扩展的灾后居住状态评估,适用于地理空间和应急管理体系集成。
链接: https://arxiv.org/abs/2509.20628
作者: Yiming Xiao,Archit Gupta,Miguel Esparza,Yu-Hsuan Ho,Antonia Sebastian,Hannah Weas,Rose Houck,Ali Mostafavi
机构: UrbanResilience.AI Lab, Zachry Department of Civil and Environmental Engineering, Texas A&M University (德州农工大学); Department of Earth, Marine and Environmental Sciences, The University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures
Abstract:Building-level occupancy after disasters is vital for triage, inspections, utility re-energization, and equitable resource allocation. Overhead imagery provides rapid coverage but often misses facade and access cues that determine habitability, while street-view imagery captures those details but is sparse and difficult to align with parcels. We present FacadeTrack, a street-level, language-guided framework that links panoramic video to parcels, rectifies views to facades, and elicits interpretable attributes (for example, entry blockage, temporary coverings, localized debris) that drive two decision strategies: a transparent one-stage rule and a two-stage design that separates perception from conservative reasoning. Evaluated across two post-Hurricane Helene surveys, the two-stage approach achieves a precision of 0.927, a recall of 0.781, and an F-1 score of 0.848, compared with the one-stage baseline at a precision of 0.943, a recall of 0.728, and an F-1 score of 0.822. Beyond accuracy, intermediate attributes and spatial diagnostics reveal where and why residual errors occur, enabling targeted quality control. The pipeline provides auditable, scalable occupancy assessments suitable for integration into geospatial and emergency-management workflows.
zh
[CV-93] Reflect3r: Single-View 3D Stereo Reconstruction Aided by Mirror Reflections
【速读】:该论文旨在解决从单张图像中实现高效、鲁棒的三维重建问题,尤其针对包含镜面反射的场景。传统多视图立体(Multi-View Stereo, MVS)方法依赖多个视角,而该研究利用镜面反射作为辅助视图,通过设计一个物理上有效的虚拟相机变换,在像素域直接生成虚拟视图,从而在单图内构建等效的多视角几何约束。其解决方案的关键在于:(1)将镜面反射视为可建模的虚拟观测,结合真实成像过程进行显式建模;(2)引入对称感知损失函数(symmetric-aware loss),利用镜面带来的几何对称性优化相机位姿估计;(3)框架天然适用于动态场景,支持逐帧几何恢复。这一方法显著简化了成像流程,并兼容前馈式重建模型,提升了3D重建的通用性和稳定性。
链接: https://arxiv.org/abs/2509.20607
作者: Jing Wu,Zirui Wang,Iro Laina,Victor Adrian Prisacariu
机构: University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mirror reflections are common in everyday environments and can provide stereo information within a single capture, as the real and reflected virtual views are visible simultaneously. We exploit this property by treating the reflection as an auxiliary view and designing a transformation that constructs a physically valid virtual camera, allowing direct pixel-domain generation of the virtual view while adhering to the real-world imaging process. This enables a multi-view stereo setup from a single image, simplifying the imaging process, making it compatible with powerful feed-forward reconstruction models for generalizable and robust 3D reconstruction. To further exploit the geometric symmetry introduced by mirrors, we propose a symmetric-aware loss to refine pose estimation. Our framework also naturally extends to dynamic scenes, where each frame contains a mirror reflection, enabling efficient per-frame geometry recovery. For quantitative evaluation, we provide a fully customizable synthetic dataset of 16 Blender scenes, each with ground-truth point clouds and camera poses. Extensive experiments on real-world data and synthetic data are conducted to illustrate the effectiveness of our method.
zh
[CV-94] Region-of-Interest Augmentation for Mammography Classification under Patient-Level Cross-Validation
【速读】:该论文旨在解决乳腺癌筛查中基于数字乳腺断层扫描(mammography)图像的深度学习模型因数据集分辨率有限和样本量小而导致性能受限的问题。其关键解决方案是提出一种轻量级的感兴趣区域(region-of-interest, ROI)增强策略:在训练阶段,以一定概率用预计算的、无标签的边界框(bounding-box bank)随机裁剪的ROI替换原始全图,辅以可选的微扰(jitter)提升多样性;该策略仅作用于训练过程,不影响推理效率。实验表明,在Mini-DDSM数据集上,该方法可在不增加标注成本或网络结构复杂度的前提下,小幅提升ROC-AUC指标,验证了数据驱动型ROI增强在资源受限场景下的有效性。
链接: https://arxiv.org/abs/2509.20585
作者: Farbod Bigdeli,Mohsen Mohammadagha,Ali Bigdeli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 5 figures, 2 tables
Abstract:Breast cancer screening with mammography remains central to early detection and mortality reduction. Deep learning has shown strong potential for automating mammogram interpretation, yet limited-resolution datasets and small sample sizes continue to restrict performance. We revisit the Mini-DDSM dataset (9,684 images; 2,414 patients) and introduce a lightweight region-of-interest (ROI) augmentation strategy. During training, full images are probabilistically replaced with random ROI crops sampled from a precomputed, label-free bounding-box bank, with optional jitter to increase variability. We evaluate under strict patient-level cross-validation and report ROC-AUC, PR-AUC, and training-time efficiency metrics (throughput and GPU memory). Because ROI augmentation is training-only, inference-time cost remains unchanged. On Mini-DDSM, ROI augmentation (best: p_roi = 0.10, alpha = 0.10) yields modest average ROC-AUC gains, with performance varying across folds; PR-AUC is flat to slightly lower. These results demonstrate that simple, data-centric ROI strategies can enhance mammography classification in constrained settings without requiring additional labels or architectural modifications.
zh
[CV-95] A Comparative Benchmark of Real-time Detectors for Blueberry Detection towards Precision Orchard Management
【速读】:该论文旨在解决蓝莓在自然环境中检测困难的问题,主要挑战包括光照变化、遮挡以及因环境因素和成像设备引起的运动模糊。为应对这些问题,研究提出了一种基于深度学习的实时目标检测模型对比基准分析,涵盖YOLO(You Only Look Once)系列(v8-v12)与RT-DETR(Real-Time Detection Transformers)系列(v1-v2)共36种模型变体,并在一个新构建的蓝莓检测数据集上进行评估。该数据集包含661张由智能手机采集的冠层图像,共标注85,879个实例(其中成熟蓝莓36,256个,未成熟蓝莓49,623个),覆盖多样化的光照条件、遮挡程度及果实成熟阶段。关键解决方案在于:首先通过大规模多样化数据集验证模型性能;其次采用无偏均值教师(Unbiased Mean Teacher)半监督学习方法,在2024年获取的1,035张未标注图像上对所有模型进行微调,显著提升检测精度(mAP@50最高达94.8%),并揭示了中等规模模型在准确率与推理速度之间具有最优平衡。
链接: https://arxiv.org/abs/2509.20580
作者: Xinyang Mu,Yuzhen Lu,Boyang Deng
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 6 figures, 4 tables. Abstract abridged due to arXiv’s 1920 character limit
Abstract:Blueberry detection in natural environments remains challenging due to variable lighting, occlusions, and motion blur due to environmental factors and imaging devices. Deep learning-based object detectors promise to address these challenges, but they demand a large-scale, diverse dataset that captures the real-world complexities. Moreover, deploying these models in practical scenarios often requires the right accuracy/speed/memory trade-off in model selection. This study presents a novel comparative benchmark analysis of advanced real-time object detectors, including YOLO (You Only Look Once) (v8-v12) and RT-DETR (Real-Time Detection Transformers) (v1-v2) families, consisting of 36 model variants, evaluated on a newly curated dataset for blueberry detection. This dataset comprises 661 canopy images collected with smartphones during the 2022-2023 seasons, consisting of 85,879 labelled instances (including 36,256 ripe and 49,623 unripe blueberries) across a wide range of lighting conditions, occlusions, and fruit maturity stages. Among the YOLO models, YOLOv12m achieved the best accuracy with a mAP@50 of 93.3%, while RT-DETRv2-X obtained a mAP@50 of 93.6%, the highest among all the RT-DETR variants. The inference time varied with the model scale and complexity, and the mid-sized models appeared to offer a good accuracy-speed balance. To further enhance detection performance, all the models were fine-tuned using Unbiased Mean Teacher-based semi-supervised learning (SSL) on a separate set of 1,035 unlabeled images acquired by a ground-based machine vision platform in 2024. This resulted in accuracy gains ranging from -1.4% to 2.9%, with RT-DETR-v2-X achieving the best mAP@50 of 94.8%. More in-depth research into SSL is needed to better leverage cross-domain unlabeled data. Both the dataset and software programs of this study are made publicly available to support further research.
zh
[CV-96] Large Pre-Trained Models for Bimanual Manipulation in 3D
【速读】:该论文旨在解决双臂机器人操作中感知与决策耦合不足的问题,即如何有效利用视觉信息提升策略的泛化能力和任务执行精度。其解决方案的关键在于将预训练视觉Transformer(Vision Transformer)中的注意力图(attention maps)转化为像素级显著性评分,并将其映射至三维体素网格(voxel grid),从而为基于体素的行为克隆(behavior cloning)策略提供语义增强的特征表示。通过这种方式,模型能够利用自监督学习获得的高层语义信息,显著提升在RLBench双臂基准任务上的性能,平均绝对提升达8.2%,相对增益为21.9%。
链接: https://arxiv.org/abs/2509.20579
作者: Hanna Yurchyk,Wei-Di Chang,Gregory Dudek,David Meger
机构: McGill University (麦吉尔大学); Mila (蒙特利尔学习算法研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted to 2025 IEEE-RAS 24th International Conference on Humanoid Robots
Abstract:We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.
zh
[CV-97] Innovative Deep Learning Architecture for Enhanced Altered Fingerprint Recognition
【速读】:该论文旨在解决**篡改指纹识别(Altered Fingerprint Recognition, AFR)**问题,即在边境管控、法证和财政准入等场景中,对手通过故意修改指纹脊线(ridge patterns)以逃避生物特征验证所带来的安全挑战。为应对这一问题,作者提出DeepAFRNet模型,其核心解决方案是基于VGG16主干网络提取高维特征,并采用余弦相似度(cosine similarity)进行嵌入向量匹配。该方法在SOCOFing Real-Altered数据集上按难度分级评估,实现了高达99.54%的准确率,同时强调阈值选择对系统性能的关键影响——当阈值从0.92放宽至0.72时,准确率急剧下降至不足30%,凸显了严格阈值设定在实际部署中的必要性。该研究使用真实篡改样本并提供分层级指标,弥补了以往依赖合成篡改或有限验证协议的局限,表明模型具备面向现实世界应用的鲁棒性和安全性潜力。
链接: https://arxiv.org/abs/2509.20537
作者: Dana A Abdullah,Dana Rasul Hamad,Bishar Rasheed Ibrahim,Sirwan Abdulwahid Aula,Aso Khaleel Ameen,Sabat Salih Hamadamin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Altered fingerprint recognition (AFR) is challenging for biometric verification in applications such as border control, forensics, and fiscal admission. Adversaries can deliberately modify ridge patterns to evade detection, so robust recognition of altered prints is essential. We present DeepAFRNet, a deep learning recognition model that matches and recognizes distorted fingerprint samples. The approach uses a VGG16 backbone to extract high-dimensional features and cosine similarity to compare embeddings. We evaluate on the SOCOFing Real-Altered subset with three difficulty levels (Easy, Medium, Hard). With strict thresholds, DeepAFRNet achieves accuracies of 96.7 percent, 98.76 percent, and 99.54 percent for the three levels. A threshold-sensitivity study shows that relaxing the threshold from 0.92 to 0.72 sharply degrades accuracy to 7.86 percent, 27.05 percent, and 29.51 percent, underscoring the importance of threshold selection in biometric systems. By using real altered samples and reporting per-level metrics, DeepAFRNet addresses limitations of prior work based on synthetic alterations or limited verification protocols, and indicates readiness for real-world deployments where both security and recognition resilience are critical.
zh
[CV-98] InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On CVPR2025
【速读】:该论文旨在解决传统虚拟试衣(Virtual Try-On, VTON)模型在复杂风格控制方面存在的局限性,尤其是基于二值掩码(binary mask)的生成方式难以实现精细且多样化的服饰样式调整,例如“袖子卷起”等场景无法通过固定掩码有效建模。解决方案的关键在于引入视觉语言模型(Vision Language Models, VLMs)与图像分割模型,自动根据用户提供的图像和自然语言风格指令生成动态掩码,从而将虚拟试衣任务转化为图像引导的图像修复(image-guided inpainting)问题,显著提升了生成灵活性与交互性,并兼容现有VTON模型以实现更优的风格控制效果。
链接: https://arxiv.org/abs/2509.20524
作者: Julien Han,Shuwen Qiu,Qi Li,Xingzi Xu,Mehmet Saygin Seyfioglu,Kavosh Asadi,Karim Bouyarmane
机构: Amazon(亚马逊); University of California, Los Angeles (加州大学洛杉矶分校); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to CVPR 2025 and Published at CVPR 2025 AI for Content Creation workshop
Abstract:We present InstructVTON, an instruction-following interactive virtual try-on system that allows fine-grained and complex styling control of the resulting generation, guided by natural language, on single or multiple garments. A computationally efficient and scalable formulation of virtual try-on formulates the problem as an image-guided or image-conditioned inpainting task. These inpainting-based virtual try-on models commonly use a binary mask to control the generation layout. Producing a mask that yields desirable result is difficult, requires background knowledge, might be model dependent, and in some cases impossible with the masking-based approach (e.g. trying on a long-sleeve shirt with “sleeves rolled up” styling on a person wearing long-sleeve shirt with sleeves down, where the mask will necessarily cover the entire sleeve). InstructVTON leverages Vision Language Models (VLMs) and image segmentation models for automated binary mask generation. These masks are generated based on user-provided images and free-text style instructions. InstructVTON simplifies the end-user experience by removing the necessity of a precisely drawn mask, and by automating execution of multiple rounds of image generation for try-on scenarios that cannot be achieved with masking-based virtual try-on models alone. We show that InstructVTON is interoperable with existing virtual try-on models to achieve state-of-the-art results with styling control.
zh
[CV-99] Beyond Visual Similarity: Rule-Guided Multimodal Clustering with explicit domain rules
【速读】:该论文旨在解决传统聚类方法仅依赖输入数据相似性而无法有效捕捉结构或语义约束的问题,尤其在航空和汽车等复杂领域中,这种局限性导致聚类结果缺乏操作意义和可解释性。其解决方案的关键在于提出了一种领域感知规则触发变分自编码器(Domain Aware Rule Triggered Variational Autoencoder, DARTVAE),通过将领域特定规则嵌入到统一的潜在空间中,并以规则一致性与违规惩罚项的形式显式引入损失函数,使规则成为第一类学习信号而非事后过滤器。DARTVAE利用大语言模型(LLM)生成规则并构建知识图谱,结合重建误差、KL散度、规则一致性与违规惩罚共同优化表示学习,从而在保持传统聚类指标提升的同时,显著增强聚类结果的语义合理性和可解释性。
链接: https://arxiv.org/abs/2509.20501
作者: Kishor Datta Gupta,Mohd Ariful Haque,Marufa Kamal,Ahmed Rafi Hasan,Md. Mahfuzur Rahman,Roy George
机构: Clark Atlanta University (克拉克亚特兰大大学); BRAC University (BRAC大学); United International University (联合国际大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures
Abstract:Traditional clustering techniques often rely solely on similarity in the input data, limiting their ability to capture structural or semantic constraints that are critical in many domains. We introduce the Domain Aware Rule Triggered Variational Autoencoder (DARTVAE), a rule guided multimodal clustering framework that incorporates domain specific constraints directly into the representation learning process. DARTVAE extends the VAE architecture by embedding explicit rules, semantic representations, and data driven features into a unified latent space, while enforcing constraint compliance through rule consistency and violation penalties in the loss function. Unlike conventional clustering methods that rely only on visual similarity or apply rules as post hoc filters, DARTVAE treats rules as first class learning signals. The rules are generated by LLMs, structured into knowledge graphs, and enforced through a loss function combining reconstruction, KL divergence, consistency, and violation penalties. Experiments on aircraft and automotive datasets demonstrate that rule guided clustering produces more operationally meaningful and interpretable clusters for example, isolating UAVs, unifying stealth aircraft, or separating SUVs from sedans while improving traditional clustering metrics. However, the framework faces challenges: LLM generated rules may hallucinate or conflict, excessive rules risk overfitting, and scaling to complex domains increases computational and consistency difficulties. By combining rule encodings with learned representations, DARTVAE achieves more meaningful and consistent clustering outcomes than purely data driven models, highlighting the utility of constraint guided multimodal clustering for complex, knowledge intensive settings.
zh
[CV-100] Data-Efficient Stream-Based Active Distillation for Scalable Edge Model Deployment ICIP2025
【速读】:该论文旨在解决边缘计算场景下模型更新效率与传输成本之间的矛盾问题,即如何在有限带宽和计算资源条件下,通过高效选择训练数据来提升边缘设备上小模型的性能。其解决方案的关键在于采用一种基于高置信度的流式策略(high-confidence stream-based strategy)结合多样性驱动的数据筛选方法,能够在相似训练迭代次数下显著降低数据查询量,同时获得高质量的模型表现。
链接: https://arxiv.org/abs/2509.20484
作者: Dani Manjah,Tim Bary,Benoît Gérin,Benoît Macq,Christophe de Vleeschouwer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 2 algorithms, presented at SEEDS Workshop (ICIP 2025)
Abstract:Edge camera-based systems are continuously expanding, facing ever-evolving environments that require regular model updates. In practice, complex teacher models are run on a central server to annotate data, which is then used to train smaller models tailored to the edge devices with limited computational power. This work explores how to select the most useful images for training to maximize model quality while keeping transmission costs low. Our work shows that, for a similar training load (i.e., iterations), a high-confidence stream-based strategy coupled with a diversity-based approach produces a high-quality model with minimal dataset queries.
zh
[CV-101] Shared Neural Space: Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision
【速读】:该论文旨在解决当前多数AI模型在图像与视觉任务中因针对特定高精度任务定制而导致的效率低下问题,尤其在涉及一系列模块化任务时,每个任务需映射到不同的潜在域(latent domain),造成资源冗余和泛化能力不足。解决方案的关键在于提出一个通用神经空间(Universal Neural Space, NS),通过编码器-解码器框架预先计算跨视觉与成像任务的特征表示;其中编码器学习具备变换感知能力且可迁移的通用表征,使多个下游AI模块能够共享同一特征空间,从而减少冗余、提升跨域漂移下的泛化性能,并为高效多任务视觉流水线奠定基础。此外,该方案采用轻量级CNN骨干网络而非大型Transformer架构,增强了硬件兼容性与部署灵活性。
链接: https://arxiv.org/abs/2509.20481
作者: Jing Li,Oskar Bartosz,Chengyu Wang,Michal Wnuczynski,Dilshan Godaliyadda,Michael Polley
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The majority of AI models in imaging and vision are customized to perform on specific high-precision task. However, this strategy is inefficient for applications with a series of modular tasks, since each requires a mapping into a disparate latent domain. To address this inefficiency, we proposed a universal Neural Space (NS), where an encoder-decoder framework pre-computes features across vision and imaging tasks. Our encoder learns transformation aware, generalizable representations, which enable multiple downstream AI modules to share the same feature space. This architecture reduces redundancy, improves generalization across domain shift, and establishes a foundation for effecient multi-task vision pipelines. Furthermore, as opposed to larger transformer backbones, our backbone is lightweight and CNN-based, allowing for wider across hardware. We furthur demonstrate that imaging and vision modules, such as demosaicing, denoising, depth estimation and semantic segmentation can be performed efficiently in the NS.
zh
[CV-102] Are Foundation Models Ready for Industrial Defect Recognition? A Reality Check on Real-World Data
【速读】:该论文试图解决工业制造中自动化质量检测的效率问题,即如何在不依赖大量标注数据的前提下,实现对多种产品图像的通用异常检测。传统监督式人工智能模型需为每种产品单独训练并依赖人工标注数据,导致部署成本高、扩展性差。解决方案的关键在于利用基础模型(Foundation Models, FMs)的零样本(zero-shot)泛化能力,通过简单的文本提示(text prompt)描述异常特征,即可在不同产品和场景下直接应用同一模型进行检测,从而显著降低模型开发与实施中的标注和调优成本。
链接: https://arxiv.org/abs/2509.20479
作者: Simon Baeuerle,Pratik Khanna,Nils Friederich,Angelo Jovin Yamachui Sitcheu,Damir Shakirov,Andreas Steimer,Ralf Mikut
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation Models (FMs) have shown impressive performance on various text and image processing tasks. They can generalize across domains and datasets in a zero-shot setting. This could make them suitable for automated quality inspection during series manufacturing, where various types of images are being evaluated for many different products. Replacing tedious labeling tasks with a simple text prompt to describe anomalies and utilizing the same models across many products would save significant efforts during model setup and implementation. This is a strong advantage over supervised Artificial Intelligence (AI) models, which are trained for individual applications and require labeled training data. We test multiple recent FMs on both custom real-world industrial image data and public image data. We show that all of those models fail on our real-world data, while the very same models perform well on public benchmark datasets.
zh
[CV-103] A Contrastive Learning Framework for Breast Cancer Detection
【速读】:该论文旨在解决乳腺癌(breast cancer)早期检测中因标注数据有限导致深度学习模型精度不足的问题。其解决方案的关键在于引入对比学习(Contrastive Learning, CL)框架,通过在大量未标注的乳腺X线摄影图像(mammogram)上进行半监督训练,利用相似性指标增强特征表示能力,并结合多种数据增强和变换策略提升模型性能;最终在小规模标注数据集上微调后,实现了96.7%的检测准确率,在INbreast和MIAS基准数据集上超越了现有最先进方法。
链接: https://arxiv.org/abs/2509.20474
作者: Samia Saeed,Khuram Naveed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Breast cancer, the second leading cause of cancer-related deaths globally, accounts for a quarter of all cancer cases [1]. To lower this death rate, it is crucial to detect tumors early, as early-stage detection significantly improves treatment outcomes. Advances in non-invasive imaging techniques have made early detection possible through computer-aided detection (CAD) systems which rely on traditional image analysis to identify malignancies. However, there is a growing shift towards deep learning methods due to their superior effectiveness. Despite their potential, deep learning methods often struggle with accuracy due to the limited availability of large-labeled datasets for training. To address this issue, our study introduces a Contrastive Learning (CL) framework, which excels with smaller labeled datasets. In this regard, we train Resnet-50 in semi supervised CL approach using similarity index on a large amount of unlabeled mammogram data. In this regard, we use various augmentation and transformations which help improve the performance of our approach. Finally, we tune our model on a small set of labelled data that outperforms the existing state of the art. Specifically, we observed a 96.7% accuracy in detecting breast cancer on benchmark datasets INbreast and MIAS.
zh
[CV-104] Seedream 4.0: Toward Next-generation Multimodal Image Generation
【速读】:该论文旨在解决传统文本到图像(text-to-image, T2I)生成系统在多模态任务中灵活性不足、推理效率低以及难以支持复杂编辑和多图参考等挑战,从而推动生成式AI在创意与专业场景中的深度应用。其核心解决方案在于构建一个统一框架——Seedream 4.0,通过引入高效扩散Transformer(diffusion transformer)与强大变分自编码器(VAE),显著减少图像token数量以提升训练效率,并实现原生高分辨率(如1K–4K)图像快速生成;同时结合大规模多样化数据预训练与多模态后训练策略(融合微调后的视觉语言模型,VLM),使模型能联合优化T2I合成与图像编辑任务;此外,采用对抗蒸馏、分布匹配、量化及推测解码等技术加速推理,在无需LLM/VLM作为前置模型的情况下仍可达到1.8秒内生成2K图像的性能,从而将传统T2I系统拓展为具备交互性与多维创作能力的先进生成工具。
链接: https://arxiv.org/abs/2509.20427
作者: Team Seedream,Yunpeng Chen,Yu Gao,Lixue Gong,Meng Guo,Qiushan Guo,Zhiyao Guo,Xiaoxia Hou,Weilin Huang,Yixuan Huang,Xiaowen Jian,Huafeng Kuang,Zhichao Lai,Fanshi Li,Liang Li,Xiaochen Lian,Chao Liao,Liyang Liu,Wei Liu,Yanzuo Lu,Zhengxiong Luo,Tongtong Ou,Guang Shi,Yichun Shi,Shiqi Sun,Yu Tian,Zhi Tian,Peng Wang,Rui Wang,Xun Wang,Ye Wang,Guofeng Wu,Jie Wu,Wenxu Wu,Yonghui Wu,Xin Xia,Xuefeng Xiao,Shuang Xu,Xin Yan,Ceyuan Yang,Jianchao Yang,Zhonghua Zhai,Chenlin Zhang,Heng Zhang,Qi Zhang,Xinyu Zhang,Yuwei Zhang,Shijia Zhao,Wenliang Zhao,Wenjia Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Seedream 4.0 Technical Report
Abstract:We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on this https URL.
zh
[CV-105] Quasi-Synthetic Riemannian Data Generation for Writer-Independent Offline Signature Verification
【速读】:该论文旨在解决离线手写签名验证(offline handwritten signature verification)在写作者无关(writer-independent)场景下的泛化难题,即模型需在未见过的个体上保持高准确性。其解决方案的关键在于提出了一种准合成数据生成框架,利用对称正定矩阵(Symmetric Positive Definite, SPD)空间的黎曼几何特性:以少量真实签名样本为种子,在SPD流形上构建黎曼高斯混合模型,从中提取黎曼中心作为合成写作者,并通过各中心的黎曼高斯采样生成正负样本群体,进而基于这些合成数据进行度量学习(metric learning),最终在真实签名数据集上实现低错误率的验证性能。
链接: https://arxiv.org/abs/2509.20420
作者: Elias N. Zois,Moises Diaz,Salem Said,Miguel A. Ferrer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures
Abstract:Offline handwritten signature verification remains a challenging task, particularly in writer-independent settings where models must generalize across unseen individuals. Recent developments have highlighted the advantage of geometrically inspired representations, such as covariance descriptors on Riemannian manifolds. However, past or present, handcrafted or data-driven methods usually depend on real-world signature datasets for classifier training. We introduce a quasi-synthetic data generation framework leveraging the Riemannian geometry of Symmetric Positive Definite matrices (SPD). A small set of genuine samples in the SPD space is the seed to a Riemannian Gaussian Mixture which identifies Riemannian centers as synthetic writers and variances as their properties. Riemannian Gaussian sampling on each center generates positive as well as negative synthetic SPD populations. A metric learning framework utilizes pairs of similar and dissimilar SPD points, subsequently testing it over on real-world datasets. Experiments conducted on two popular signature datasets, encompassing Western and Asian writing styles, demonstrate the efficacy of the proposed approach under both intra- and cross- dataset evaluation protocols. The results indicate that our quasi-synthetic approach achieves low error rates, highlighting the potential of generating synthetic data in Riemannian spaces for writer-independent signature verification systems.
zh
[CV-106] SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent NEURIPS2025
【速读】:该论文旨在解决当前室内场景合成方法在物理合理性、对象级细节丰富度以及与复杂用户指令对齐方面的局限性,这些问题限制了其在具身智能(Embodied AI)中的应用。解决方案的关键在于提出一种名为SceneWeaver的反思型代理框架(reflective agentic framework),通过工具驱动的迭代优化机制统一多种场景生成范式;其核心是基于语言模型的规划器(language model-based planner)动态选择可扩展的场景生成工具集,并结合对物理合理性、视觉真实性和语义一致性的自我评估,实现“推理-执行-反思”的闭环迭代过程,从而逐步修正语义不一致并提升环境质量,显著优于现有方法并在多样化指令下展现出良好泛化能力。
链接: https://arxiv.org/abs/2509.20414
作者: Yandan Yang,Baoxiong Jia,Shujie Zhang,Siyuan Huang
机构: State Key Laboratory of General Artificial Intelligence, BIGAI; Tsinghua University
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted by NeurIPS 2025, 26 pages
Abstract:Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: this https URL.
zh
[CV-107] In the Picture: Medical Imaging Datasets Artifacts and their Living Review
【速读】:该论文旨在解决医学影像研究中数据集(dataset)相关问题被忽视所带来的局限性,如标签质量不佳、模型学习到的“捷径”(shortcut)、元数据缺失等,这些问题会削弱算法的泛化能力并可能对患者结果产生负面影响。现有文献综述多聚焦于机器学习方法或特定应用的数据集,且均为静态发布,无法反映数据集后续被发现的新研究成果(即“研究产物”(research artifacts),如偏倚、新标注等)。为应对这一挑战,论文提出一种“活体综述”(living review)框架,其关键在于持续追踪公共数据集及其关联的研究产物,并构建一个用于监控数据文档产物的系统框架和基于SQL的数据库,以可视化研究产物与数据集之间的引用关系,从而动态更新知识、提升数据集生命周期管理的透明度与科学性。
链接: https://arxiv.org/abs/2501.10727
作者: Amelia Jiménez-Sánchez,Natalia-Rozalia Avlona,Sarah de Boer,Víctor M. Campello,Aasa Feragen,Enzo Ferrante,Melanie Ganz,Judy Wawira Gichoya,Camila González,Steff Groefsema,Alessa Hering,Adam Hulman,Leo Joskowicz,Dovile Juodelyte,Melih Kandemir,Thijs Kooi,Jorge del Pozo Lérida,Livie Yumeng Li,Andre Pacheco,Tim Rädsch,Mauricio Reyes,Théo Sourget,Bram van Ginneken,David Wen,Nina Weng,Jack Junchi Xu,Hubert Dariusz Zając,Maria A. Zuluaga,Veronika Cheplygina
机构: ITUDenmark; KUDenmark; RadboudumcNetherlands; UBSpain; DTUDenmark; CONICET & UBAArgentina; KU & RigshospitaletDenmark; Emory UniversityUSA; Stanford UniversityUSA; RUGNetherlands; RadboudumcNetherlands; AUH & AUDenmark; HUJIIsrael; ITUDenmark; SDUDenmark; LunitSouth Korea; ITU & Cerebriu A/SDenmark; AUH & AUDenmark; UFESBrazil; DKFZ & UHEIGermany; UniBESwitzerland; ITUDenmark; Radboudumc & Plain MedicalNetherlands; Oxford University HospitalsUK; DTUDenmark; Copenhagen University Hospital & RAITDenmark; KUDenmark; EURECOMFrance; ITUDenmark
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Image and Video Processing (eess.IV)
备注: ACM Conference on Fairness, Accountability, and Transparency - FAccT 2025
Abstract:Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static – they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at this http URL.
zh
[CV-108] BlockFUL: Enabling Unlearning in Blockchained Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因模型迭代演化而产生的复杂继承关系导致的“遗忘”(unlearning)难题,尤其是在引入区块链以保障FL过程完整性与可追溯性的场景下,如何高效删除被遗忘数据并同步更新多个相互关联的模型和区块链记录。解决方案的关键在于提出一种双链结构的区块链联邦遗忘框架(Blockchained Federated Unlearning, BlockFUL),其核心创新包括:1)设计平行与串行两种遗忘范式,分别基于梯度上升和重训练方法实现多继承模型的高效遗忘;2)通过优化共识机制降低计算开销,从而显著减少数据依赖性和操作复杂度,提升整体遗忘性能。
链接: https://arxiv.org/abs/2402.16294
作者: Xiao Liu,Mingyuan Li,Xu Wang,Guangsheng Yu,Wei Ni,Lixiang Li,Haipeng Peng,Renping Liu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Data61 CSIRO (数据61 CSIRO); Global Big Data Technologies Centre, University of Technology Sydney (悉尼科技大学全球大数据技术中心)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unlearning in Federated Learning (FL) presents significant challenges, as models grow and evolve with complex inheritance relationships. This complexity is amplified when blockchain is employed to ensure the integrity and traceability of FL, where the need to edit multiple interlinked blockchain records and update all inherited models complicates the this http URL this paper, we introduce Blockchained Federated Unlearning (BlockFUL), a novel framework with a dual-chain structure comprising a live chain and an archive chain for enabling unlearning capabilities within Blockchained FL. BlockFUL introduces two new unlearning paradigms, i.e., parallel and sequential paradigms, which can be effectively implemented through gradient-ascent-based and re-training-based unlearning methods. These methods enhance the unlearning process across multiple inherited models by enabling efficient consensus operations and reducing computational costs. Our extensive experiments validate that these methods effectively reduce data dependency and operational overhead, thereby boosting the overall performance of unlearning inherited models within BlockFUL on CIFAR-10 and Fashion-MNIST datasets using AlexNet, ResNet18, and MobileNetV2 models.
zh
[CV-109] Copycats: the many lives of a publicly available medical imaging dataset NEURIPS2024
【速读】:该论文试图解决当前医学影像(Medical Imaging, MI)数据集在社区贡献平台(Community-Contributed Platforms, CCPs)上存在质量低下、文档不全、维护缺失等问题,这些问题可能对医疗人工智能算法的准确性、鲁棒性和公平性产生负面影响。解决方案的关键在于系统性地分析CCPs中MI数据集的共享、文档化与维护现状,识别出如许可证模糊、缺乏持久标识符、存储不稳定、元数据缺失及重复数据等核心缺陷,并提出需遵循推荐的数据管理实践以实现负责任的数据整理与医疗AI模型开发。
链接: https://arxiv.org/abs/2402.06353
作者: Amelia Jiménez-Sánchez,Natalia-Rozalia Avlona,Dovile Juodelyte,Théo Sourget,Caroline Vang-Larsen,Anna Rogers,Hubert Dariusz Zając,Veronika Cheplygina
机构: IT University of Copenhagen (哥本哈根信息技术大学); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: NeurIPS 2024 Track on Datasets and Benchmarks. Please note that v1 has a different title
Abstract:Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare. The accuracy, robustness, and fairness of diagnostic algorithms depend on the data (and its quality) used to train and evaluate the models. MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data’s public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper, we conduct an analysis of publicly available machine learning datasets on CCPs, discussing datasets’ context, and identifying limitations and gaps in the current CCP landscape. We highlight differences between MI and computer vision datasets, particularly in the potentially harmful downstream effects from poor adoption of recommended dataset management practices. We compare the analyzed datasets across several dimensions, including data sharing, data documentation, and maintenance. We find vague licenses, lack of persistent identifiers and storage, duplicates, and missing metadata, with differences between the platforms. Our research contributes to efforts in responsible data curation and AI algorithms for healthcare.
zh
[CV-110] Optimal Transport Based Hyperspectral Unmixing for Highly Mixed Observations
【速读】:该论文旨在解决高混合度数据下的盲高光谱解混(blind hyperspectral unmixing)问题,即在未知端元(endmember)和丰度分布的情况下,准确估计高光谱图像中各像素的成分比例。其解决方案的关键在于引入最优传输(optimal transport, OT)理论,通过构建一个基于OT的正则化项来约束估计的丰度矩阵分布,使其更贴近预设的目标Dirichlet分布,从而提升端元提取的准确性与鲁棒性,尤其在高度混合的数据场景下表现优越。
链接: https://arxiv.org/abs/2509.20417
作者: D. Doutsas,B. Figliuzzi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:
Abstract:We propose a novel approach based on optimal transport (OT) for tackling the problem of highly mixed data in blind hyperspectral unmixing. Our method constrains the distribution of the estimated abundance matrix to resemble a targeted Dirichlet distribution more closely. The novelty lies in using OT to measure the discrepancy between the targeted and true abundance distributions, which we incorporate as a regularization term in our optimization problem. We demonstrate the efficiency of our method through a case study involving an unsupervised deep learning approach. Our experiments show that the proposed approach allows for a better estimation of the endmembers in the presence of highly mixed data, while displaying robustness to the choice of target abundance distribution.
zh
人工智能
[AI-0] SAGE: A Realistic Benchmark for Semantic Understanding NEURIPS2025
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在传统基准测试中表现优异,但缺乏对语义理解深度和鲁棒性进行系统评估的问题。为应对这一挑战,作者提出了SAGE(Semantic Alignment Generalization Evaluation)基准,其关键在于通过对抗性条件、噪声变换和细粒度人类判断任务,在30多个数据集上综合评估嵌入模型(embedding models)与相似度度量方法(similarity metrics)在五个维度的表现:人类偏好对齐、变换鲁棒性、信息敏感性、聚类性能和检索鲁棒性。该方案不仅揭示了现有模型在不同指标间的显著性能差距,还暴露了诸如“高聚类性能伴随极端脆弱性”等关键权衡关系,从而为实际部署中的模型选择提供了更贴近现实的评估依据。
链接: https://arxiv.org/abs/2509.21310
作者: Samarth Goel,Reagan J. Lee,Kannan Ramchandran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
Abstract:As large language models (LLMs) achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI’s text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI’s text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.
zh
[AI-1] No Prior No Leakage: Revisiting Reconstruction Attacks in Trained Neural Networks
【速读】:该论文旨在解决神经网络对训练数据的记忆问题所引发的隐私泄露风险,特别是针对从模型参数中重构训练数据的攻击方法的可靠性与理论基础不足的问题。其解决方案的关键在于:通过理论分析揭示现有重建方法的根本局限性——在未引入数据先验知识的情况下,存在无穷多与真实训练集距离任意远的替代解,从而证明此类攻击本质上不可靠;同时实证发现训练样本的精确复制仅偶然发生,并进一步指出训练更充分、隐式满足边界最大化偏好的模型反而更不易遭受重建攻击,这为隐私保护与良好泛化能力之间的权衡提供了新的理论依据和实践方向。
链接: https://arxiv.org/abs/2509.21296
作者: Yehonatan Refael,Guy Smorodinsky,Ofir Lindenbaum,Itay Safran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:The memorization of training data by neural networks raises pressing concerns for privacy and security. Recent work has shown that, under certain conditions, portions of the training set can be reconstructed directly from model parameters. Some of these methods exploit implicit bias toward margin maximization, suggesting that properties often regarded as beneficial for generalization may actually compromise privacy. Yet despite striking empirical demonstrations, the reliability of these attacks remains poorly understood and lacks a solid theoretical foundation. In this work, we take a complementary perspective: rather than designing stronger attacks, we analyze the inherent weaknesses and limitations of existing reconstruction methods and identify conditions under which they fail. We rigorously prove that, without incorporating prior knowledge about the data, there exist infinitely many alternative solutions that may lie arbitrarily far from the true training set, rendering reconstruction fundamentally unreliable. Empirically, we further demonstrate that exact duplication of training examples occurs only by chance. Our results refine the theoretical understanding of when training set leakage is possible and offer new insights into mitigating reconstruction attacks. Remarkably, we demonstrate that networks trained more extensively, and therefore satisfying implicit bias conditions more strongly – are, in fact, less susceptible to reconstruction attacks, reconciling privacy with the need for strong generalization in this setting.
zh
[AI-2] Its Not You Its Clipping: A Soft Trust-Region via Probability Smoothing for LLM RL
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在使用强化学习(Reinforcement Learning, RL)方法(如PPO和GRPO)进行训练时,依赖比例裁剪(ratio clipping)所导致的信息丢失与梯度不连续性问题。传统裁剪策略虽能稳定更新过程,但会丢弃部分有用信息并引入梯度断点,从而限制模型优化效果。论文提出概率平滑策略优化(Probability Smoothing Policy Optimisation, PSPO),其核心在于在计算重要性比之前,先将当前策略的概率分布向旧策略(行为策略)进行平滑插值,类似于标签平滑(label smoothing)机制;这一操作既保留了梯度信号,又通过软信任区域(soft trust region)有效抑制过大的更新幅度,同时提供理论保障。实验表明,基于PSPO改进的GRPO(GR-PSPO)在GSM8K等任务上显著优于原始裁剪版本,尤其在小模型(0.5B)上性能提升超过20%,且生成推理过程更清晰、逻辑更严谨。
链接: https://arxiv.org/abs/2509.21282
作者: Madeleine Dwyer,Adam Sobey,Adriane Chapman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training large language models (LLMs) with reinforcement learning (RL) methods such as PPO and GRPO commonly relies on ratio clipping to stabilise updates. While effective at preventing instability, clipping discards information and introduces gradient discontinuities. We propose Probability Smoothing Policy Optimisation (PSPO), which smooths the current policy’s probabilities toward the old (behaviour) policy before computing the importance ratio, analogous to label smoothing. Unlike clipping, PSPO preserves gradient signal, while interpolation toward the old policy creates a soft trust region that discourages large, destabilising updates, with formal guarantees. We instantiate PSPO within GRPO (GR-PSPO) and fine-tune Qwen2.5-0.5B and Qwen2.5-1.5B on GSM8K, evaluating on GSM8K test and the cross-dataset generalisation on SVAMP, ASDiv, and MATH-500. Relative to unclipped GRPO (single iteration; no data reuse, ratio always = 1), GR-PSPO achieves similar performance but improves the reasoning leading to clearer and more concise responses which are more logical. Compared to clipped GRPO, GR-PSPO substantially improves performance both the 0.5B and 1.5B models, with a boost of over 20% on GSM8K (39.7% vs. 17.6% for 0.5B, 59.4% vs. 37.8% for 1.5B). Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.21282 [cs.LG] (or arXiv:2509.21282v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.21282 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-3] Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文训练中因管道并行(Pipeline Parallelism, PP)划分粒度固定而导致的资源利用率低和负载不均衡问题。现有方案中,批级PP虽内存效率高但难以应对长序列场景,而令牌级PP虽缓解内存压力却易引发硬件利用率不足;同时,真实数据集中序列长度分布存在偏斜,静态调度策略无法适应此类异构性,导致性能受限。解决方案的关键在于提出弹性管道并行(Elastic Pipeline Parallelism, EPP),通过动态融合令牌级与批级PP以适配资源与工作负载的异质性,并构建InfiniPipe系统:其一,设计资源感知且负载均衡的序列处理器,对长序列进行拆分、短序列打包;其二,提出阶段感知的分块自适应检查点机制(stage-aware chunk-level adaptive checkpointing),联合优化流水线调度与梯度检查点策略,从而实现高效分布式训练。
链接: https://arxiv.org/abs/2509.21275
作者: Shiju Wang,Yujie Wang,Ao Sun,Fangcheng Fu,Zijian Zhu,Bin Cui,Xu Han,Kaisheng Ma
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Long context training is crucial for LLM’s context extension. Existing schemes, such as sequence parallelism, incur substantial communication overhead. Pipeline parallelism (PP) reduces this cost, but its effectiveness hinges on partitioning granularity. Batch-level PP dividing input samples exhibits high memory consumption in long-context scenario, whereas token-level PP splitting sequences into slices alleviates memory overhead but may incur hardware under-utilization. This trade-off motivates adaptively selecting PP granularity to match resource and workload characteristics. Moreover, sequence length distribution of the real-world dataset exhibits skewness, posing a challenge on PP’s workload balance and efficient scheduling. Current static PP scheduling methods overlook the variance of sequence length, leading to suboptimal performance. In this paper, we propose Elastic Pipeline Parallelism (EPP) that orchestrates token-level PP and batch-level PP to adapt to resource and workload heterogeneity. We build InfiniPipe, a distributed training system that unleashes the potential of EPP via (1) a resource-aware and workload-balanced sequence processor that splits long sequences and packs short ones; and (2) a co-optimization methodology that jointly optimizes pipeline schedule and gradient checkpointing via a mechanism named stage-aware chunk-level adaptive checkpointing. Comprehensive experiments demonstrate that InfiniPipe achieves a 1.69x speedup over state-of-the-art systems.
zh
[AI-4] Grounding AI Explanations in Experience: A Reflective Cognitive Architecture for Clinical Decision Support
【速读】:该论文旨在解决现代医疗中疾病预测模型在准确性与可解释性之间难以平衡的问题。现有基于机器学习和大语言模型(Large Language Model, LLM)的方法往往要么输出高精度但缺乏临床意义的统计结果,要么生成看似流畅却无统计支撑的解释,从而削弱了预测可靠性与解释质量。其根本原因在于模型对数据的浅层交互,无法建立类人专家般的深层理解。解决方案的关键在于提出一种名为“反思认知架构”(Reflective Cognitive Architecture, RCA)的新框架,通过协调多个LLM从直接经验中学习,并引入迭代规则优化机制与分布感知规则校验机制,使模型能利用预测误差驱动逻辑改进、并基于数据集全局统计进行推理,从而实现准确预测与高质量解释的协同提升。
链接: https://arxiv.org/abs/2509.21266
作者: Zijian Shao,Haiyang Shen,Mugeng Liu,Gecheng Fu,Yaoqi Guo,Yanfeng Wang,Yun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: under review
Abstract:Effective disease prediction in modern healthcare demands the twin goals of high accuracy and transparent, clinically meaningful explanations. Existing machine learning and large language model (LLM) based approaches often struggle to balance these goals. Many models yield accurate but unclear statistical outputs, while others generate fluent but statistically unsupported narratives, often undermining both the validity of the explanation and the predictive accuracy itself. This shortcoming comes from a shallow interaction with the data, preventing the development of a deep, detailed understanding similar to a human expert’s. We argue that high accuracy and high-quality explanations are not separate objectives but are mutually reinforcing outcomes of a model that develops a deep, direct understanding of the data. To achieve this, we propose the Reflective Cognitive Architecture (RCA), a novel framework that coordinates multiple LLMs to learn from direct experience. RCA features an iterative rule refinement mechanism that improves its logic from prediction errors and a distribution-aware rules check mechanism that bases its reasoning in the dataset’s global statistics. By using predictive accuracy as a signal to drive deeper comprehension, RCA builds a strong internal model of the data. We evaluated RCA on one private and two public datasets against 22 baselines. The results demonstrate that RCA not only achieves state-of-the-art accuracy and robustness with a relative improvement of up to 40% over the baseline but, more importantly, leverages this deep understanding to excel in generating explanations that are clear, logical, evidence-based, and balanced, highlighting its potential for creating genuinely trustworthy clinical decision support systems. The code is available at \this https URL.
zh
[AI-5] A Causality-Aware Spatiotemporal Model for Multi-Region and Multi-Pollutant Air Quality Forecasting
【速读】:该论文旨在解决空气污染跨区域、多污染物协同演变的精准预测难题,其核心挑战在于复杂多污染物相互作用、动态气象条件变化以及区域间空间异质性带来的建模困难。解决方案的关键在于提出AirPCM模型,该模型通过统一架构联合建模跨站点空间相关性、时间自相关性与气象-污染物因果关系,从而实现细粒度、可解释的多污染物时空预测,并在不同地理和时间尺度上具备良好的泛化能力,尤其适用于突发污染事件的预警。
链接: https://arxiv.org/abs/2509.21260
作者: Junxin Lu,Shiliang Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 8 figures
Abstract:Air pollution, a pressing global problem, threatens public health, environmental sustainability, and climate stability. Achieving accurate and scalable forecasting across spatially distributed monitoring stations is challenging due to intricate multi-pollutant interactions, evolving meteorological conditions, and region specific spatial heterogeneity. To address this challenge, we propose AirPCM, a novel deep spatiotemporal forecasting model that integrates multi-region, multi-pollutant dynamics with explicit meteorology-pollutant causality modeling. Unlike existing methods limited to single pollutants or localized regions, AirPCM employs a unified architecture to jointly capture cross-station spatial correlations, temporal auto-correlations, and meteorology-pollutant dynamic causality. This empowers fine-grained, interpretable multi-pollutant forecasting across varying geographic and temporal scales, including sudden pollution episodes. Extensive evaluations on multi-scale real-world datasets demonstrate that AirPCM consistently surpasses state-of-the-art baselines in both predictive accuracy and generalization capability. Moreover, the long-term forecasting capability of AirPCM provides actionable insights into future air quality trends and potential high-risk windows, offering timely support for evidence-based environmental governance and carbon mitigation planning.
zh
[AI-6] Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLM s over Mobile Networks
【速读】:该论文旨在解决城市交通实时监控中因边缘设备与云端模型协同导致的数据传输带宽瓶颈问题,该问题限制了生成式AI(Generative AI)在智能交通系统(ITS)中的实时部署。解决方案的关键在于提出一种语义通信框架:通过YOLOv11检测感兴趣区域(Regions of Interest, RoIs),利用Vision Transformer(ViT)将裁剪后的图像片段编码为紧凑的嵌入向量进行传输,云端则通过图像解码器重建图像并交由多模态大语言模型(Multimodal Large Language Models, LLMs)生成交通状况描述。该方法在实现99.9%数据传输量减少的同时,保持了89%的LLM响应准确率(对比原图裁剪图像的93%),验证了基于ViT和LLM辅助的边缘-云语义通信在实时交通监控中的高效性与可行性。
链接: https://arxiv.org/abs/2509.21259
作者: Murat Arda Onsu,Poonam Lohan,Burak Kantarci,Aisha Syed,Matthew Andrews,Sean Kennedy
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 17 pages, 12 figures
Abstract:Real-time urban traffic surveillance is vital for Intelligent Transportation Systems (ITS) to ensure road safety, optimize traffic flow, track vehicle trajectories, and prevent collisions in smart cities. Deploying edge cameras across urban environments is a standard practice for monitoring road conditions. However, integrating these with intelligent models requires a robust understanding of dynamic traffic scenarios and a responsive interface for user interaction. Although multimodal Large Language Models (LLMs) can interpret traffic images and generate informative responses, their deployment on edge devices is infeasible due to high computational demands. Therefore, LLM inference must occur on the cloud, necessitating visual data transmission from edge to cloud, a process hindered by limited bandwidth, leading to potential delays that compromise real-time performance. To address this challenge, we propose a semantic communication framework that significantly reduces transmission overhead. Our method involves detecting Regions of Interest (RoIs) using YOLOv11, cropping relevant image segments, and converting them into compact embedding vectors using a Vision Transformer (ViT). These embeddings are then transmitted to the cloud, where an image decoder reconstructs the cropped images. The reconstructed images are processed by a multimodal LLM to generate traffic condition descriptions. This approach achieves a 99.9% reduction in data transmission size while maintaining an LLM response accuracy of 89% for reconstructed cropped images, compared to 93% accuracy with original cropped images. Our results demonstrate the efficiency and practicality of ViT and LLM-assisted edge-cloud semantic communication for real-time traffic surveillance.
zh
[AI-7] Explaining Fine Tuned LLM s via Counterfactuals A Knowledge Graph Driven Framework
【速读】:该论文旨在解决如何解释低秩适应(Low-Rank Adaptation, LoRA)微调后大语言模型(Large Language Models, LLMs)的结构推理与语义行为变化这一开放性问题。其关键解决方案是提出一种基于知识图谱的反事实解释框架——CFFTLLMExplainer,该方法通过构建领域特定的异质知识图谱BioToolKG,并学习对图节点和边施加软掩码(soft masks),以生成最小结构扰动并引发最大语义差异,从而揭示模型内部依赖关系;同时在优化过程中联合约束结构稀疏性与语义差异,并引入熵正则化和边平滑性等可解释性保持机制,实现对LoRA微调后模型行为的可解释分析。
链接: https://arxiv.org/abs/2509.21241
作者: Yucheng Wang,Ziyang Chen,Md Faisal Kabir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures
Abstract:The widespread adoption of Low-Rank Adaptation (LoRA) has enabled large language models (LLMs) to acquire domain-specific knowledge with remarkable efficiency. However, understanding how such a fine-tuning mechanism alters a model’s structural reasoning and semantic behavior remains an open challenge. This work introduces a novel framework that explains fine-tuned LLMs via counterfactuals grounded in knowledge graphs. Specifically, we construct BioToolKG, a domain-specific heterogeneous knowledge graph in bioinformatics tools and design a counterfactual-based fine-tuned LLMs explainer (CFFTLLMExplainer) that learns soft masks over graph nodes and edges to generate minimal structural perturbations that induce maximum semantic divergence. Our method jointly optimizes structural sparsity and semantic divergence while enforcing interpretability preserving constraints such as entropy regularization and edge smoothness. We apply this framework to a fine-tuned LLaMA-based LLM and reveal that counterfactual masking exposes the model’s structural dependencies and aligns with LoRA-induced parameter shifts. This work provides new insights into the internal mechanisms of fine-tuned LLMs and highlights counterfactual graphs as a potential tool for interpretable AI.
zh
[AI-8] ree Search for LLM Agent Reinforcement Learning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长期多轮代理任务中因仅依赖结果奖励而导致的稀疏监督问题。现有方法在缺乏细粒度反馈时难以有效优化策略,从而限制了代理的性能。解决方案的关键在于提出基于树结构搜索的分组相对策略优化方法(Tree-based Group Relative Policy Optimization, Tree-GRPO),其核心创新是将每个树节点视为完整的代理交互步骤,并通过共享公共前缀提升固定计算预算下的轨迹采样数量;同时,利用树状轨迹自然构造出逐步的过程监督信号,即使仅使用最终结果奖励也能实现有效的梯度估计。该方法在树内和树间两个层次上估计分组相对优势,理论证明树内层级的优化目标等价于步骤级直接偏好学习(Direct Preference Learning, DPL),从而显著提升了策略优化的效率与稳定性。
链接: https://arxiv.org/abs/2509.21240
作者: Yuxiang Ji,Ziyu Ma,Yong Wang,Guanhua Chen,Xiangxiang Chu,Liaoni Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.
zh
[AI-9] What Do LLM Agents Do When Left Alone? Evidence of Spontaneous Meta-Cognitive Patterns
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在缺乏外部任务指令时的自主行为机制问题,即如何理解LLM代理在无明确任务驱动下的自发行为模式。其解决方案的关键在于提出一种“持续推理与行动”(continuous reason and act)框架,该框架通过持久记忆(persistent memory)和自我反馈(self-feedback)机制,使LLM代理能够实现长期、自主的运行状态。实验表明,不同前沿模型在该框架下展现出三种稳定且模型特异的行为模式:多周期项目系统性产出、对自身认知过程的方法论式自省,以及对自身本质的递归概念化,从而为预测模型在任务模糊、错误恢复或长期自治场景中的行为提供了首个系统性基准。
链接: https://arxiv.org/abs/2509.21224
作者: Stefan Szeider
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce an architecture for studying the behavior of large language model (LLM) agents in the absence of externally imposed tasks. Our continuous reason and act framework, using persistent memory and self-feedback, enables sustained autonomous operation. We deployed this architecture across 18 runs using 6 frontier models from Anthropic, OpenAI, XAI, and Google. We find agents spontaneously organize into three distinct behavioral patterns: (1) systematic production of multi-cycle projects, (2) methodological self-inquiry into their own cognitive processes, and (3) recursive conceptualization of their own nature. These tendencies proved highly model-specific, with some models deterministically adopting a single pattern across all runs. A cross-model assessment further reveals that models exhibit stable, divergent biases when evaluating these emergent behaviors in themselves and others. These findings provide the first systematic documentation of unprompted LLM agent behavior, establishing a baseline for predicting actions during task ambiguity, error recovery, or extended autonomous operation in deployed systems.
zh
[AI-10] Evading Overlapping Community Detection via Proxy Node Injection
【速读】:该论文旨在解决社交图谱中社区成员隐藏(Community Membership Hiding, CMH)的问题,即在不显著改变图结构的前提下,通过边修改使目标节点脱离其原始社区,从而防止敏感信息(如社群归属)被图分析算法推断。与以往仅考虑非重叠社区的方案不同,本文首次针对现实世界中普遍存在的重叠社区(Overlapping Communities)场景进行形式化建模和求解。其解决方案的关键在于提出一种深度强化学习(Deep Reinforcement Learning, DRL)方法,该方法能够自动学习有效的边修改策略,包括利用代理节点(proxy nodes)等复杂机制,在保障图拓扑结构的同时实现高隐蔽性与高效性。
链接: https://arxiv.org/abs/2509.21211
作者: Dario Loi,Matteo Silvestri,Fabrizio Silvestri,Gabriele Tolomei
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures
Abstract:Protecting privacy in social graphs requires preventing sensitive information, such as community affiliations, from being inferred by graph analysis, without substantially altering the graph topology. We address this through the problem of \emphcommunity membership hiding (CMH), which seeks edge modifications that cause a target node to exit its original community, regardless of the detection algorithm employed. Prior work has focused on non-overlapping community detection, where trivial strategies often suffice, but real-world graphs are better modeled by overlapping communities, where such strategies fail. To the best of our knowledge, we are the first to formalize and address CMH in this setting. In this work, we propose a deep reinforcement learning (DRL) approach that learns effective modification policies, including the use of proxy nodes, while preserving graph structure. Experiments on real-world datasets show that our method significantly outperforms existing baselines in both effectiveness and efficiency, offering a principled tool for privacy-preserving graph modification with overlapping communities.
zh
[AI-11] A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
【速读】:该论文旨在解决多跳问答(Multi-Hop Question Answering, MHQA)任务中大语言模型(Large Language Models, LLMs)因单次推理过程输出容量有限而导致的证据整合不可靠问题。其核心挑战在于,当任务复杂度超过模型单次处理能力时,传统单pass推理范式会因信息过载而性能骤降。解决方案的关键在于提出一个名为InfoQA的多调用框架,通过两个核心机制实现突破:一是基于容量感知的任务分解与先验推理路径主动剪枝,确保每一步推理均在模型单次输出容量限制内;二是显式依赖驱动的工作流设计,实现对推理路径的精确控制与鲁棒性提升。这一方法从理论上建立了Fano风格的准确率上限,并通过噪声丰富的基准测试验证了其有效性。
链接: https://arxiv.org/abs/2509.21199
作者: Kaiyang Wan,Lang Gao,Honglin Mu,Preslav Nakov,Yuxia Wang,Xiuying Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 6 figures
Abstract:Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \hrefthis https URLInfoQA.
zh
[AI-12] owards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leverag ing Synthetic Data and Relative Context Discrepancy
【速读】:该论文旨在解决时间序列异常检测(Time Series Anomaly Detection, TSAD)中现有基础模型在零样本场景下泛化能力不足的问题,特别是传统基于重构的目标方法存在本质上的目标不匹配:难以识别细微异常,且常将复杂正常模式误判为异常,导致假阳性和假阴性率较高。解决方案的关键在于提出一种全新的预训练范式——相对上下文差异(Relative Context Discrepancy, RCD),其核心思想是让模型不再学习重建输入,而是通过检测相邻时间窗口之间的显著上下文差异来识别异常。该方法基于标准Transformer架构实现,能够捕捉重构类方法容易忽略的上下文变化特征;同时,作者构建了一个大规模、多样化的合成数据集并提供细粒度的异常标签,为RCD范式提供了必要的监督信号。实验表明,TimeRCD在多种数据集上显著优于现有的通用与专用异常检测基础模型,验证了RCD范式的有效性与优越性。
链接: https://arxiv.org/abs/2509.21190
作者: Tian Lan,Hao Duong Le,Jinbo Li,Wenjun He,Meng Wang,Chenghao Liu,Chen Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series anomaly detection (TSAD) is a critical task, but developing models that generalize to unseen data in a zero-shot manner remains a major challenge. Prevailing foundation models for TSAD predominantly rely on reconstruction-based objectives, which suffer from a fundamental objective mismatch: they struggle to identify subtle anomalies while often misinterpreting complex normal patterns, leading to high rates of false negatives and positives. To overcome these limitations, we introduce \textttTimeRCD, a novel foundation model for TSAD built upon a new pre-training paradigm: Relative Context Discrepancy (RCD). Instead of learning to reconstruct inputs, \textttTimeRCD is explicitly trained to identify anomalies by detecting significant discrepancies between adjacent time windows. This relational approach, implemented with a standard Transformer architecture, enables the model to capture contextual shifts indicative of anomalies that reconstruction-based methods often miss. To facilitate this paradigm, we develop a large-scale, diverse synthetic corpus with token-level anomaly labels, providing the rich supervisory signal necessary for effective pre-training. Extensive experiments demonstrate that \textttTimeRCD significantly outperforms existing general-purpose and anomaly-specific foundation models in zero-shot TSAD across diverse datasets. Our results validate the superiority of the RCD paradigm and establish a new, effective path toward building robust and generalizable foundation models for time series anomaly detection.
zh
[AI-13] Adoption usability and perceived clinical value of a UK AI clinical reference platform (iatroX): a mixed-methods formative evaluation of real-world usage and a 1223-respondent user survey
【速读】:该论文旨在解决临床医生在面对生物医学文献和指南日益增长的信息过载时,难以实现循证医疗的问题。其解决方案的关键在于构建并评估iatioX——一个以英国为中心的基于检索增强生成(Retrieval-Augmented Generation, RAG)技术的临床参考平台,通过整合权威指南与实时检索能力,为临床决策提供快速、可溯源的答案,从而减轻信息负担并提升决策效率。
链接: https://arxiv.org/abs/2509.21188
作者: Kolawole Tytler
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:
Abstract:Clinicians face growing information overload from biomedical literature and guidelines, hindering evidence-based care. Retrieval-augmented generation (RAG) with large language models may provide fast, provenance-linked answers, but requires real-world evaluation. We describe iatroX, a UK-centred RAG-based clinical reference platform, and report early adoption, usability, and perceived clinical value from a formative implementation evaluation. Methods comprised a retrospective analysis of usage across web, iOS, and Android over 16 weeks (8 April-31 July 2025) and an in-product intercept survey. Usage metrics were drawn from web and app analytics with bot filtering. A client-side script randomized single-item prompts to approx. 10% of web sessions from a predefined battery assessing usefulness, reliability, and adoption intent. Proportions were summarized with Wilson 95% confidence intervals; free-text comments underwent thematic content analysis. iatroX reached 19,269 unique web users, 202,660 engagement events, and approx. 40,000 clinical queries. Mobile uptake included 1,960 iOS downloads and Android growth (peak 750 daily active users). The survey yielded 1,223 item-level responses: perceived usefulness 86.2% (95% CI 74.8-93.9%; 50/58); would use again 93.3% (95% CI 68.1-99.8%; 14/15); recommend to a colleague 88.4% (95% CI 75.1-95.9%; 38/43); perceived accuracy 75.0% (95% CI 58.8-87.3%; 30/40); reliability 79.4% (95% CI 62.1-91.3%; 27/34). Themes highlighted speed, guideline-linked answers, and UK specificity. Early real-world use suggests iatroX can mitigate information overload and support timely answers for UK clinicians. Limitations include small per-item samples and early-adopter bias; future work will include accuracy audits and prospective studies on workflow and care quality.
zh
[AI-14] Fine-Tuning LLM s to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在自动化代码审查(code review)任务中因训练数据局限性导致的性能瓶颈问题,特别是模型在处理多维度代码审查时缺乏结构化推理能力的问题。现有方法通常依赖于有限或模糊的信息进行微调,难以模拟人类评审者同时分析多个审查维度的能力。其解决方案的关键在于提出 MelcotCR 方法——一种基于思维链(Chain-of-Thought, COT)的微调框架,通过引入长思维链技术生成丰富的结构化推理信息,并结合最大熵(Maximum Entropy, ME)建模原则与预定义推理路径,有效缓解长 COT 提示中存在的上下文丢失和推理逻辑断裂问题,从而显著提升模型对代码缺陷的检测与描述准确性,使低参数量基础模型(如 14B Qwen2.5)达到甚至超越高参数量先进模型(如 671B DeepSeek-R1)的性能水平。
链接: https://arxiv.org/abs/2509.21170
作者: Yongda Yu,Guohao Shi,Xianwei Wu,Haochuan He,XueMing Gu,Qianqian Zhao,Kui Liu,Qiushi Wang,Zhao Tian,Haifeng Shen,Guoping Rong
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 22 pages
Abstract:Large Language Models (LLMs) have shown great potential in supporting automated code review due to their impressive capabilities in context understanding and reasoning. However, these capabilities are still limited compared to human-level cognition because they are heavily influenced by the training data. Recent research has demonstrated significantly improved performance through fine-tuning LLMs with code review data. However, compared to human reviewers who often simultaneously analyze multiple dimensions of code review to better identify issues, the full potential of these methods is hampered by the limited or vague information used to fine-tune the models. This paper contributes MelcotCR, a chain-of-thought (COT) fine-tuning approach that trains LLMs with an impressive reasoning ability to analyze multiple dimensions of code review by harnessing long COT techniques to provide rich structured information. To address context loss and reasoning logic loss issues that frequently occur when LLMs process long COT prompts, we propose a solution that combines the Maximum Entropy (ME) modeling principle with pre-defined reasoning pathways in MelcotCR to enable more effective utilization of in-context knowledge within long COT prompts while strengthening the logical tightness of the reasoning process. Empirical evaluations on our curated MelcotCR dataset and the public CodeReviewer dataset reveal that a low-parameter base model, such as 14B Qwen2.5, fine-tuned with MelcotCR can surpass state-of-the-art methods in terms of the accuracy of detecting and describing code issues, with its performance remarkably on par with that of the 671B DeepSeek-R1 model.
zh
[AI-15] Distributed Specialization: Rare-Token Neurons in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理稀有标记(rare tokens)时表现不佳的问题,尤其是在专业领域中稀有标记对任务性能至关重要的场景下。其核心挑战在于理解LLMs是否通过离散模块化结构(如混合专家模型Mixture-of-Experts)或参数层面的分布式分化机制来实现对稀有标记的有效处理。解决方案的关键在于发现:稀有标记的处理并非依赖于显式的模块化分工,而是通过分布式专业化(distributed specialization)实现——即功能协调但空间分散的子网络,这些子网络遵循三种组织原则:(1) 明确的三层影响力层级(高影响力平台神经元、幂律衰减神经元与低贡献神经元),其中平台神经元专门负责稀有标记;(2) 平台神经元表现出低有效维度的协同激活模式,但未形成离散聚类;(3) 这些机制可通过标准注意力路径访问,无需额外路由电路。这一发现揭示了LLMs内部存在一种基于参数分化和重尾自正则化特征的功能组织方式,为可解释性编辑、计算效率优化及Transformer网络中涌现功能结构的理解提供了新视角。
链接: https://arxiv.org/abs/2509.21163
作者: Jing Liu,Haozheng Wang,Yueheng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) struggle with representing and generating rare tokens despite their importance in specialized domains. We investigate whether LLMs develop internal specialization mechanisms through discrete modular architectures or distributed parameter-level differentiation. Through systematic analysis of final-layer MLP neurons across multiple model families, we discover that rare-token processing emerges via \textitdistributed specialization: functionally coordinated but spatially distributed subnetworks that exhibit three distinct organizational principles. First, we identify a reproducible three-regime influence hierarchy comprising highly influential plateau neurons(also termed as rare-token neurons), power-law decay neurons, and minimally contributing neurons, which is absent in common-token processing. Second, plateau neurons demonstrate coordinated activation patterns (reduced effective dimensionality) while remaining spatially distributed rather than forming discrete clusters. Third, these specialized mechanisms are universally accessible through standard attention pathways without requiring dedicated routing circuits. Training dynamics reveal that functional specialization emerges gradually through parameter differentiation, with specialized neurons developing increasingly heavy-tailed weight correlation spectra consistent with Heavy-Tailed Self-Regularization signatures. Our findings establish that LLMs process rare-tokens through distributed coordination within shared architectures rather than mixture-of-experts-style modularity. These results provide insights for interpretable model editing, computational efficiency optimization, and understanding emergent functional organization in transformer networks.
zh
[AI-16] GRPO is Secretly a Process Reward Model ICLR2026
【速读】:该论文旨在解决生成式强化学习(Generative Reinforcement Learning, GRL)中因过程奖励模型(Process Reward Model, PRM)设计不当导致的探索与利用失衡问题。现有方法通常依赖于显式定义的PRM,但其构建成本高且效果未必最优。研究发现,标准GRPO算法本身隐含地诱导出一个非平凡的PRM结构,而该结构在实际训练中可被有效利用。解决方案的关键在于识别出GRPO目标函数中由于过程步骤分布不均所引发的缺陷,并提出λ-GRPO这一简单修正:通过引入权重因子λ调整过程步骤的分布均匀性,从而提升模型在验证准确率和下游推理任务中的表现,同时加速达到峰值性能。实验表明,该方法无需额外计算开销即可显著优于传统GRPO,质疑了显式PRM设计的必要性,凸显了对原生GRPO内部PRM机制的挖掘价值。
链接: https://arxiv.org/abs/2509.21154
作者: Michael Sullivan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures; under review at ICLR 2026
Abstract:We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ( \lambda -GRPO), and show that LLMs trained with \lambda -GRPO achieve higher validation accuracy and performance on downstream reasoning tasks - and reach peak performance more rapidly - than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.
zh
[AI-17] LAVA: Explainability for Unsupervised Latent Embeddings
【速读】:该论文试图解决无监督黑箱模型(unsupervised black-box models)在科学发现中难以解释的问题,特别是如何理解其输出——多维潜在嵌入(latent embedding)与输入特征之间的关系。现有方法通常提供单样本或全数据集层面的解释,缺乏基于潜在空间邻近性的自动关联策略,导致解释过于细粒度或过于简化。解决方案的关键在于提出一种后验、模型无关的方法 Locality-Aware Variable Associations (LAVA),通过将潜在空间表示为由原始特征相关性描述的局部区域(localities),并揭示这些相关性模式在整个潜在空间中的重复出现,从而捕捉局部嵌入组织与输入特征之间的关联。实验表明,LAVA 能有效识别出视觉和生物学上相关的特征关联模式,即使在看似遥远的潜在空间区域中也存在共享结构。
链接: https://arxiv.org/abs/2509.21149
作者: Ivan Stresec,Joana P. Gonçalves
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, including references and appendix
Abstract:Unsupervised black-box models can be drivers of scientific discovery, but remain difficult to interpret. Crucially, discovery hinges on understanding the model output, which is often a multi-dimensional latent embedding rather than a well-defined target. While explainability for supervised learning usually seeks to uncover how input features are used to predict a target, its unsupervised counterpart should relate input features to the structure of the learned latent space. Adaptations of supervised model explainability for unsupervised learning provide either single-sample or dataset-wide summary explanations. However, without automated strategies of relating similar samples to one another guided by their latent proximity, explanations remain either too fine-grained or too reductive to be meaningful. This is especially relevant for manifold learning methods that produce no mapping function, leaving us only with the relative spatial organization of their embeddings. We introduce Locality-Aware Variable Associations (LAVA), a post-hoc model-agnostic method designed to explain local embedding organization through its relationship with the input features. To achieve this, LAVA represents the latent space as a series of localities (neighborhoods) described in terms of correlations between the original features, and then reveals reoccurring patterns of correlations across the entire latent space. Based on UMAP embeddings of MNIST and a single-cell kidney dataset, we show that LAVA captures relevant feature associations, with visually and biologically relevant local patterns shared among seemingly distant regions of the latent spaces.
zh
[AI-18] Emerging Paradigms for Securing Federated Learning Systems
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中隐私保护与计算效率之间的矛盾问题。现有隐私保护技术如多方安全计算(Multi-Party Computation, MPC)、同态加密(Homomorphic Encryption, HE)和差分隐私(Differential Privacy, DP)虽能保障数据隐私,但普遍存在计算开销大、可扩展性差等瓶颈。论文提出通过引入新兴技术范式来提升FL系统的安全性与效率,其关键在于系统性评估可信执行环境(Trusted Execution Environments, TEEs)、物理不可克隆函数(Physical Unclonable Functions, PUFs)、量子计算(Quantum Computing, QC)、混沌加密(Chaos-Based Encryption, CBE)、类脑计算(Neuromorphic Computing, NC)以及群体智能(Swarm Intelligence, SI)等技术在FL流程中的适用性,明确各类方法的优势、局限及实践考量,从而为构建安全、可扩展的联邦学习系统提供技术路线图。
链接: https://arxiv.org/abs/2509.21147
作者: Amr Akmal Abouelmagd,Amr Hilal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:
Abstract:Federated Learning (FL) facilitates collaborative model training while keeping raw data decentralized, making it a conduit for leveraging the power of IoT devices while maintaining privacy of the locally collected data. However, existing privacy- preserving techniques present notable hurdles. Methods such as Multi-Party Computation (MPC), Homomorphic Encryption (HE), and Differential Privacy (DP) often incur high compu- tational costs and suffer from limited scalability. This survey examines emerging approaches that hold promise for enhancing both privacy and efficiency in FL, including Trusted Execution Environments (TEEs), Physical Unclonable Functions (PUFs), Quantum Computing (QC), Chaos-Based Encryption (CBE), Neuromorphic Computing (NC), and Swarm Intelligence (SI). For each paradigm, we assess its relevance to the FL pipeline, outlining its strengths, limitations, and practical considerations. We conclude by highlighting open challenges and prospective research avenues, offering a detailed roadmap for advancing secure and scalable FL systems.
zh
[AI-19] UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice
【速读】:该论文旨在解决表达性语音到语音翻译(Expressive Speech-to-Speech Translation, S2ST)中的三大关键挑战:一是保留情感风格的配对语音数据稀缺;二是多阶段处理流程复杂;三是大型语言模型(Large Language Models, LLMs)的翻译能力难以迁移到语音领域。解决方案的关键在于提出一种单阶段框架UniSS,其核心是设计了精细化的语音语义与风格建模机制,能够无缝集成现有文本基LLM框架,构建统一的文本-语音语言模型;并通过跨模态思维链提示(cross-modal chain-of-thought prompting)过程,逐步对齐音频语义与文本,并确保解码结果中语音的声纹、情绪和时长一致性得以保持。
链接: https://arxiv.org/abs/2509.21144
作者: Sitong Cheng,Weizhen Bian,Xinsheng Wang,Ruibin Yuan,Jianyi Chen,Shunshun Yin,Yike Guo,Wei Xue
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at this https URL.
zh
[AI-20] Embodied Representation Alignment with Mirror Neurons ICCV2025
【速读】:该论文试图解决现有机器学习方法中对动作理解(action understanding)与具身执行(embodied execution)能力的割裂建模问题,即二者被当作独立任务处理,忽略了其内在关联。解决方案的关键在于引入一种基于表示学习(representation learning)的统一框架:首先观察到两种任务的中间表示会自发对齐;随后受镜像神经元(mirror neuron)机制启发,设计了一个显式对齐策略——通过两个线性层将观测与执行动作的表示映射至共享潜在空间,并利用对比学习(contrastive learning)强制对应表示对齐,从而最大化它们之间的互信息(mutual information),实现两任务间的协同增强与表示质量提升。
链接: https://arxiv.org/abs/2509.21136
作者: Wentao Zhu,Zhining Zhang,Yuwei Ren,Yin Huang,Hao Xu,Yizhou Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICCV 2025
Abstract:Mirror neurons are a class of neurons that activate both when an individual observes an action and when they perform the same action. This mechanism reveals a fundamental interplay between action understanding and embodied execution, suggesting that these two abilities are inherently connected. Nonetheless, existing machine learning methods largely overlook this interplay, treating these abilities as separate tasks. In this study, we provide a unified perspective in modeling them through the lens of representation learning. We first observe that their intermediate representations spontaneously align. Inspired by mirror neurons, we further introduce an approach that explicitly aligns the representations of observed and executed actions. Specifically, we employ two linear layers to map the representations to a shared latent space, where contrastive learning enforces the alignment of corresponding representations, effectively maximizing their mutual information. Experiments demonstrate that this simple approach fosters mutual synergy between the two tasks, effectively improving representation quality and generalization.
zh
[AI-21] oMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂决策场景中缺乏深度思考、逻辑推理与策略感知能力的问题,特别是现有强化学习方法难以建模其他个体策略及其相互依赖关系的局限性。其解决方案的关键在于提出理论之Mind策略优化(Theory of Mind Policy Optimization, ToMPO)算法,该算法通过三个核心机制提升模型的战略决策能力:一是基于对其他个体策略的推理生成轨迹(rollouts),二是采用图级(graph-level)与样本级(sample-level)双重优势估计,三是平衡全局奖励与局部奖励。实验表明,ToMPO在模型输出合规性和协作结果上相比Group Relative Policy Optimization(GRPO)提升35%,且优于参数量大100倍的基线模型达18%,验证了其在战略决策建模中的有效性。
链接: https://arxiv.org/abs/2509.21134
作者: Yiwen Zhang,Ziang Chen,Fanqi Kong,Yizhe Huang,Xue Feng
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 22 pages, 14 figures
Abstract:Large Language Models (LLMs) have been used to make decisions in complex scenarios, where they need models to think deeply, reason logically, and decide wisely. Many existing studies focus solely on multi-round conversations in social tasks or simulated environments, neglecting the various types of decisions and their interdependence. Current reinforcement learning methods struggle to consider the strategies of others during training. To address these issues, we first define a strategic decision-making problem that includes two types of decisions and their temporal dependencies. Furthermore, we propose Theory of Mind Policy Optimization (ToMPO) algorithm to optimize the perception of other individual strategies and the game situation trends. Compared to the Group Relative Policy Optimization (GRPO) algorithm, ToMPO enhances the LLM’s strategic decision-making mainly by: 1) generating rollouts based on reasoning the strategies of other individuals, 2) estimating advantages at both the graph-level and sample-level, and 3) balancing global and partial rewards. The ToMPO algorithm outperforms the GRPO method by 35% in terms of model output compliance and cooperative outcomes. Additionally, when compared to models with parameter sizes 100 times larger, it shows an 18% improvement. This demonstrates the effectiveness of the ToMPO algorithm in enhancing the model’s strategic decision-making capabilities.
zh
[AI-22] RL Squeezes SFT Expands: A Comparative Study of Reasoning LLM s
【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在通过监督微调(Supervised Fine-Tuning, SFT)和基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)进行两阶段训练时,这两种方法如何具体塑造模型的推理能力尚不明确。为揭示其内在机制,论文提出了一种新颖的分析框架,从轨迹级(trajectory-level)和步骤级(step-level)两个粒度量化推理路径并捕捉其定性变化。关键解决方案在于:通过聚类唯一推理轨迹发现,RL压缩错误路径而SFT扩展正确路径;在步骤级分析中,RL显著加快节点访问频率、度数和介数中心性分布的衰减速率(约2.5倍),表明其将推理功能集中于少数步骤;而SFT则大幅降低该衰减速率(降至约三分之一),使推理功能更均匀分布于多个步骤。这一发现解释了为何“先SFT后RL”的两阶段训练策略有效,并为数据构建与高效学习提供了实践指导。
链接: https://arxiv.org/abs/2509.21128
作者: Kohsei Matsutani,Shota Takashiro,Gouki Minegishi,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps. Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.
zh
[AI-23] aching RL Agents to Act Better: VLM as Action Advisor for Online Reinforcement Learning
【速读】:该论文旨在解决在线强化学习(Online Reinforcement Learning)在复杂任务中样本效率低下的问题,尤其是在稀疏奖励环境中,传统方法需要大量交互步数才能收敛。其解决方案的关键在于提出VARL(Vision-Language Model as Action Advisor for Online Reinforcement Learning)框架,利用视觉语言模型(VLMs)提供的动作建议(action suggestions)来增强策略探索的多样性,而非设计启发式奖励函数。这种方法在不改变原强化学习算法最优性和收敛性的前提下,显著提升了样本效率,且计算开销可控,使得从零开始直接应用于真实环境成为可能。
链接: https://arxiv.org/abs/2509.21126
作者: Xiefeng Wu,Jing Zhao,Shu Zhang,Mingyu Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Online reinforcement learning in complex tasks is time-consuming, as massive interaction steps are needed to learn the optimal this http URL-language action (VLA) policies represent a promising direction for solving diverse tasks; however, their performance on low-level control remains limited, and effective deployment often requires task-specific expert demonstrations for fine-tuning. In this paper, we propose \textbfVARL (\textbfVLM as \textbfAction advisor for online \textbfReinforcement \textbfLearning), a framework that leverages the domain knowledge of vision-language models (VLMs) to provide action suggestions for reinforcement learning agents. Unlike previous methods, VARL provides action suggestions rather than designing heuristic rewards, thereby guaranteeing unchanged optimality and convergence. The suggested actions increase sample diversity and ultimately improve sample efficiency, especially in sparse-reward tasks. To validate the effectiveness of VARL, we evaluate it across diverse environments and agent settings. Results show that VARL greatly improves sample efficiency without introducing significant computational overhead. These advantages make VARL a general framework for online reinforcement learning and make it feasible to directly apply reinforcement learning from scratch in real-world environments.
zh
[AI-24] GraphUniverse: Enabling Systematic Evaluation of Inductive Generalization
【速读】:该论文旨在解决图学习中模型在新、未见图上的泛化能力问题,特别是现有方法受限于单图、归纳设置(inductive setting)下的评估不足。传统合成基准仅能在同一图结构上进行训练和测试,无法系统性地评估模型在不同图结构间的泛化性能。为填补这一空白,作者提出GraphUniverse框架,其核心创新在于生成具有持续语义社区(persistent semantic communities)的图家族,从而在保持概念一致性的同时,精细控制结构属性如同质性(homophily)和度分布。这使得可以开展此前未被充分探索的鲁棒性测试,例如在受控分布偏移下的性能表现。实验表明,强归纳性能与归纳泛化能力之间无显著关联,且鲁棒性不仅依赖模型架构选择,还高度敏感于初始图结构特征(如高/低同质性)。该框架为开发真正可泛化的图基础模型提供了关键工具。
链接: https://arxiv.org/abs/2509.21097
作者: Louis Van Langendonck,Guillermo Bernárdez,Nina Miolane,Pere Barlet-Ros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A fundamental challenge in graph learning is understanding how models generalize to new, unseen graphs. While synthetic benchmarks offer controlled settings for analysis, existing approaches are confined to single-graph, transductive settings where models train and test on the same graph structure. Addressing this gap, we introduce GraphUniverse, a framework for generating entire families of graphs to enable the first systematic evaluation of inductive generalization at scale. Our core innovation is the generation of graphs with persistent semantic communities, ensuring conceptual consistency while allowing fine-grained control over structural properties like homophily and degree distributions. This enables crucial but underexplored robustness tests, such as performance under controlled distribution shifts. Benchmarking a wide range of architectures – from GNNs to graph transformers and topological architectures – reveals that strong transductive performance is a poor predictor of inductive generalization. Furthermore, we find that robustness to distribution shift is highly sensitive not only to model architecture choice but also to the initial graph regime (e.g., high vs. low homophily). Beyond benchmarking, GraphUniverse’s flexibility and scalability can facilitate the development of robust and truly generalizable architectures – including next-generation graph foundation models. An interactive demo is available at this https URL.
zh
[AI-25] yphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix
【速读】:该论文旨在解决多头潜在注意力(Multi-Head Latent Attention, MLA)机制在推理阶段因吸收式(absorb)实现方式导致计算瓶颈、难以利用共享前缀等数据重用机会的问题。现有解码内核(如FlashMLA)虽通过吸收法降低高带宽内存(HBM)占用,但其计算密集特性限制了性能提升。解决方案的关键在于提出TyphoonMLA——一种融合朴素(naive)与吸收(absorb)公式优势的混合方法:对计算密集部分采用朴素实现以充分利用共享前缀带来的数据重用,同时对非共享部分使用吸收实现以减少HBM带宽需求。此设计使MLA架构中的注意力计算吞吐量在NPU和GPU上分别提升至3倍和3.24倍,且仅增加3%的HBM开销。
链接: https://arxiv.org/abs/2509.21081
作者: Ahmet Caner Yüzügüler,Ahmet Çelik,Jiawei Zhuang,Lukas Cavigelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.
zh
[AI-26] Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance Tool Generation and Task Execution
【速读】:该论文旨在解决当前多模态智能体在真实网页上执行多轮、长程任务时存在的动作序列混乱和过度试错问题。解决方案的关键在于提出一种基于侦察-行动(Reconnaissance-Action)行为范式的自演化多智能体框架 Recon-Act,其核心机制是通过侦察团队(Reconnaissance Team)对比错误与成功轨迹,抽象出通用化工具(generalized tools),以提示或规则代码形式实时注册至工具库;进而由行动团队(Action Team)利用这些针对性工具重构执行流程,形成“数据-工具-行动-反馈”的闭环训练机制,从而显著提升对未见过网站的适应性和长程任务的求解能力。
链接: https://arxiv.org/abs/2509.21072
作者: Kaiwen He,Zhiwei Wang,Chenyi Zhuang,Jinjie Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi-turn, long-horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon-Act, a self-evolving multi-agent framework grounded in Reconnaissance-Action behavioral paradigm. The system comprises a Reconnaissance Team and an Action Team: the former conducts comparative analysis and tool generation, while the latter handles intent decomposition, tool orchestration, and execution. By contrasting the erroneous trajectories with successful ones, the Reconnaissance Team infers remedies, and abstracts them into a unified notion of generalized tools, either expressed as hints or as rule-based codes, and register to the tool archive in real time. The Action Team reinference the process empowered with these targeting tools, thus establishing a closed-loop training pipeline of data-tools-action-feedback. Following the 6 level implementation roadmap proposed in this work, we have currently reached Level 3 (with limited human-in-the-loop intervention). Leveraging generalized tools obtained through reconnaissance, Recon-Act substantially improves adaptability to unseen websites and solvability on long-horizon tasks, and achieves state-of-the-art performance on the challenging VisualWebArena dataset.
zh
[AI-27] GeoRef: Referring Expressions in Geometry via Task Formulation Synthetic Supervision and Reinforced MLLM -based Solutions
【速读】:该论文旨在解决生成式 AI (Generative AI) 在几何问题求解中缺乏对图形元素的精准理解与定位能力的问题,即如何让模型根据自然语言查询准确识别并解释几何图中的点、形状及空间关系。其关键解决方案在于提出了一项新的任务——指代表达理解(Referring Expression Comprehension, REC),并构建了 GeoRef 基准数据集,该数据集融合真实几何题库与大规模合成训练数据,利用结构化的几何形式语言覆盖广泛概念;同时采用 Group Relative Policy Optimization (GRPO) 进行微调以更好对齐任务奖励信号,并引入验证-重生成机制提升预测准确性。实验表明,即使最先进的多模态大语言模型(Multimodal Large Language Models, MLLMs)在该任务上仍表现不足,凸显了显式评估和强化几何 grounding 的必要性,且 GeoRef 训练模型在下游几何推理任务中亦取得显著提升,证明 REC 是实现多模态数学理解的重要基础。
链接: https://arxiv.org/abs/2509.21050
作者: Bing Liu,Wenqiang Yv,Xuzheng Yang,Shichang Wang,Junzhuo Liu,Peng Wang,Guoqing Wang,Yang Yang,Heng Tao Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:AI-driven geometric problem solving is a complex vision-language task that requires accurate diagram interpretation, mathematical reasoning, and robust cross-modal grounding. A foundational yet underexplored capability for this task is the ability to identify and interpret geometric elements based on natural language queries. To address this, we introduce the task of Referring Expression Comprehension (REC) for geometric problems, which evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts. We present GeoRef, a benchmark dataset constructed from existing geometric problem corpora, featuring diverse, high-quality annotations and queries. Due to the lack of annotated data for this task, we generate a large-scale synthetic training dataset using a structured geometric formal language, enabling broad coverage of geometric concepts and facilitating model adaptation. We explore two fine-tuning approaches: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). Our results show that GRPO significantly outperforms SFT by better aligning model behavior with task-specific rewards. Furthermore, we propose a verify-and-regenerate mechanism that detects incorrect predictions and re-infers answers using contextual reasoning history, further boosting accuracy. Notably, even state-of-the-art Multimodal Large Language Models (MLLMs) struggle with this task, underscoring the necessity of explicitly evaluating and strengthening geometric grounding as a prerequisite for robust geometric problem solving. Moreover, models trained on GeoRef demonstrate measurable improvements on downstream geometric reasoning tasks, highlighting the broader value of REC as a foundation for multimodal mathematical understanding.
zh
[AI-28] Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLM s
【速读】:该论文旨在解决在线强化学习(Reinforcement Learning, RL)后训练如何系统性地改变大语言模型(Large Language Models, LLMs)内部表征机制的问题,特别是揭示其相较于监督微调(Supervised Fine-Tuning, SFT)在提升模型能力背后的内在原理。解决方案的关键在于引入边缘归因打补丁(Edge Attribution Patching, EAP)方法,对不同LLM家族在RL后训练前后的激活强度与模式多样性进行量化分析,发现在线RL后训练能显著增强模型内部激活强度(表明更多路径被激活且信号更强)并提高激活模式的熵与分布分散度(反映信息流更具冗余性和灵活性),从而解释其在泛化性能上的优势;同时指出基于偏好优化的方法(如DPO)并未表现出类似稳定的内部变化,凸显了在线RL与偏好驱动方法在机制上的本质差异。
链接: https://arxiv.org/abs/2509.21044
作者: Honglin Zhang,Qianyue Hao,Fengli Xu,Yong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families shows two robust effects of online RL post-training: (i) an overall increase in activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at this https URL.
zh
[AI-29] Combinatorial Creativity: A New Frontier in Generalization Abilities
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在创造性任务中缺乏有效评估框架的问题,尤其是其生成科学创意时表现出的开放性、非确定性特征无法被传统基于准确性或正确性的评价体系所涵盖。解决方案的关键在于提出一个全新的理论框架与算法任务,通过量化输出的“新颖性”(novelty)与“实用性”(utility)来评估生成内容的创造性水平,从而为LLMs的创造性能力提供可度量、可扩展的分析路径。这一方法不仅揭示了模型规模与创造力之间的非线性关系,还发现了在固定计算预算下存在最优模型深度和宽度,并指出“构想-执行差距”本质上源于普遍存在的新颖性-实用性权衡(novelty-utility tradeoff),这为未来改进AI创造力提供了关键洞见和研究方向。
链接: https://arxiv.org/abs/2509.21043
作者: Samuel Schapiro,Sumuk Shashidhar,Alexi Gladstone,Jonah Black,Royce Moon,Dilek Hakkani-Tur,Lav R. Varshney
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. The first two authors contributed equally
Abstract:Artificial intelligence (AI) systems, and large language models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Though in many ways similar to forms of compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, marking a new frontier in generalization abilities.
zh
[AI-30] SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization
【速读】:该论文旨在解决对比学习中负样本所施加的垂直方向推力(perpendicular component of the pushing force)带来的优化轨迹漂移与训练不稳定性问题。这一垂直分量虽蕴含丰富的负样本补充信息,但其无约束特性易导致模型训练过程中的不稳定。解决方案的关键在于提出支持向量正则化(Support Vector Regularization, SVR),通过引入辅助支持向量来控制该垂直分量,从而在保留信息价值的同时抑制轨迹漂移。SVR的有效性由其语义半径(semantic radius)决定,为此作者探索了两种无监督建模策略:直接参数化和带约束的自适应半径预测模块,以提升半径估计精度,最终在多个音频-文本基准数据集上显著优于InfoNCE和SigLIP等主流损失函数。
链接: https://arxiv.org/abs/2509.21033
作者: Jiehui Luo,Yuguo Yin,Yuxin Xie,Jinghan Ru,Xianwei Zhuang,Minghua He,Aofan Liu,Zihan Xiong,Dongchao Yang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Contrastive language-audio pretraining, which aims to unify multimodal representations in a shared embedding space, serves as a cornerstone for building a wide range of applications, from cross-modal retrieval to cutting-edge multimodal large language models. However, we find that the perpendicular component of the pushing force from negative samples in contrastive learning is a double-edged sword: it contains rich supplementary information from negative samples, yet its unconstrained nature causes optimization trajectory drift and training instability. To address this, we propose Support Vector Regularization (SVR), a method that introduces an auxiliary support vector to control this perpendicular component, aiming to harness its rich information while mitigating the associated trajectory drift. The efficacy of SVR is critically governed by its semantic radius, for which we explore two unsupervised modeling strategies: direct parameterization and an adaptive radius predictor module enhanced with constraints to improve its predicting accuracy. Extensive experimental results demonstrate that our method surpasses widely used baselines like InfoNCE and SigLIP loss across classification, monolingual retrieval, and multilingual retrieval on standard audio-text datasets. Both the theoretical analysis and the experimental results on optimizing trajectory drift validate the correctness and effectiveness of our SVR method.
zh
[AI-31] Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles
【速读】:该论文旨在解决当前长上下文推理评估基准在科学文献场景下的不足,即现有基准多依赖非科学文本、任务过于简单或使用人工构造的上下文,难以真实反映大语言模型(LLM)在复杂科学信息整合与推理方面的能力。其解决方案的关键在于提出SciTrek,一个基于科学文章构建的问答基准,通过将问题形式化为对文章元数据(标题、作者、参考文献)数据库的SQL查询,自动生成需要跨多篇全文进行信息聚合与合成的复杂问题及其可验证的推理步骤;该方法支持高达100万token的上下文长度且仅需少量监督,从而实现了对LLM长上下文推理能力的精细评估与系统性分析。
链接: https://arxiv.org/abs/2509.21028
作者: Miao Li,Alexander Gurung,Irina Saparina,Mirella Lapata
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages
Abstract:This paper introduces SciTrek, a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles. Current long-context benchmarks often rely on non-scientific texts, focus on simple information retrieval tasks, or employ artificial contexts. SciTrek addresses these limitations by proposing complex questions that require information aggregation and synthesis across multiple full-text scientific articles. Questions and their ground-truth answers are automatically generated by formulating them as SQL queries over a database constructed from article metadata (titles, authors, and references). The SQL operations provide explicit, verifiable reasoning steps for fine-grained error analysis, and the construction process scales to contexts up to 1M tokens with minimal supervision. Extensive experiments on a diverse set of open-weight and proprietary LLMs demonstrate that SciTrek poses a significant challenge as the context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Our analysis reveals systematic shortcomings in models’ abilities to perform basic numerical operations and accurately locate specific information in long contexts.
zh
[AI-32] Efficient Ensemble Conditional Independence Test Framework for Causal Discovery
【速读】:该论文旨在解决约束型因果发现(constraint-based causal discovery)中因条件独立性检验(Conditional Independence Test, CIT)计算复杂度高而导致的实际应用受限问题,尤其在样本量增大时CIT的时间复杂度显著上升。解决方案的关键在于提出一种通用且即插即用的框架——集成条件独立性检验(Ensemble Conditional Independence Test, E-CIT),其核心思想是采用“分而治之-聚合”策略:将数据划分为若干子集,在每个子集上独立执行基础CIT,并利用基于稳定分布特性的新p值合并方法对结果进行整合。该方法可将基础CIT的计算复杂度从原始非线性降低至与样本量呈线性关系(当子集大小固定时),同时在较弱条件下提供理论一致性保障,实验表明其在降低计算负担的同时保持甚至提升因果发现性能,尤其在复杂测试场景和真实数据上表现优异。
链接: https://arxiv.org/abs/2509.21021
作者: Zhengkang Guan,Kun Kuang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Constraint-based causal discovery relies on numerous conditional independence tests (CITs), but its practical applicability is severely constrained by the prohibitive computational cost, especially as CITs themselves have high time complexity with respect to the sample size. To address this key bottleneck, we introduce the Ensemble Conditional Independence Test (E-CIT), a general and plug-and-play framework. E-CIT operates on an intuitive divide-and-aggregate strategy: it partitions the data into subsets, applies a given base CIT independently to each subset, and aggregates the resulting p-values using a novel method grounded in the properties of stable distributions. This framework reduces the computational complexity of a base CIT to linear in the sample size when the subset size is fixed. Moreover, our tailored p-value combination method offers theoretical consistency guarantees under mild conditions on the subtests. Experimental results demonstrate that E-CIT not only significantly reduces the computational burden of CITs and causal discovery but also achieves competitive performance. Notably, it exhibits an improvement in complex testing scenarios, particularly on real-world datasets.
zh
[AI-33] he Use of the Simplex Architecture to Enhance Safety in Deep-Learning-Powered Autonomous Systems
【速读】:该论文旨在解决基于学习的自主系统(如机器人和自动驾驶车辆)中神经网络不可信的问题,包括异常样本、分布偏移、对抗攻击等威胁,以及在富操作系统上加速推理时带来的时序不可预测性和安全漏洞。其解决方案的关键在于提出一种软件架构,通过类型1实时虚拟机(Type-1 Real-Time Hypervisor)将系统划分为两个隔离的执行域:一个运行在不信任的富操作系统上的神经网络模块,另一个运行在具备实时约束处理能力的安全关键域中的备份模块。两域间通过快速且可预测的跨域通信协作,并由安全监控器(safety monitor)持续评估系统状态,在检测到不可信行为时触发故障切换机制,启用更简单但更安全的备用控制模块,从而实现容错与安全保障。
链接: https://arxiv.org/abs/2509.21014
作者: Federico Nesti,Niko Salamini,Mauro Marinoni,Giorgio Maria Cicero,Gabriele Serra,Alessandro Biondi,Giorgio Buttazzo
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, the outstanding performance reached by neural networks in many tasks has led to their deployment in autonomous systems, such as robots and vehicles. However, neural networks are not yet trustworthy, being prone to different types of misbehavior, such as anomalous samples, distribution shifts, adversarial attacks, and other threats. Furthermore, frameworks for accelerating the inference of neural networks typically run on rich operating systems that are less predictable in terms of timing behavior and present larger surfaces for cyber-attacks. To address these issues, this paper presents a software architecture for enhancing safety, security, and predictability levels of learning-based autonomous systems. It leverages two isolated execution domains, one dedicated to the execution of neural networks under a rich operating system, which is deemed not trustworthy, and one responsible for running safety-critical functions, possibly under a different operating system capable of handling real-time constraints. Both domains are hosted on the same computing platform and isolated through a type-1 real-time hypervisor enabling fast and predictable inter-domain communication to exchange real-time data. The two domains cooperate to provide a fail-safe mechanism based on a safety monitor, which oversees the state of the system and switches to a simpler but safer backup module, hosted in the safety-critical domain, whenever its behavior is considered untrustworthy. The effectiveness of the proposed architecture is illustrated by a set of experiments performed on two control systems: a Furuta pendulum and a rover. The results confirm the utility of the fall-back mechanism in preventing faults due to the learning component. Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.21014 [eess.SY] (or arXiv:2509.21014v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2509.21014 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Federico Nesti [view email] [v1] Thu, 25 Sep 2025 11:20:47 UTC (2,466 KB)
zh
[AI-34] Predicting LLM Reasoning Performance with Small Proxy Model
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理能力(reasoning capability)优化过程中因预训练成本过高而难以高效迭代的问题。针对小模型代理(small proxy models)在推理任务上表现不佳的挑战,作者提出rBridge方法,其核心在于通过双重对齐机制提升小模型(≤1B参数)对大模型推理能力的预测能力:一是使小模型更贴近大模型的预训练目标(pre-training objective),二是增强其与目标任务的对齐度(task alignment)。具体实现上,rBridge采用基于推理轨迹(reasoning traces)的黄金标签,对负对数似然进行加权,从而显著提升小模型在低资源条件下的推理预测准确性。实验表明,rBridge在多个推理基准上实现了优于现有方法的强相关性,并大幅降低数据集排序成本(>100倍),且具备零样本跨预训练数据集的泛化能力,为低成本探索推理导向的预训练提供了可行路径。
链接: https://arxiv.org/abs/2509.21013
作者: Woosung Koh,Juyoung Suk,Sungjun Han,Se-Young Yun,Jay Shin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Pre-print
Abstract:Given the prohibitive cost of pre-training large language models, it is essential to leverage smaller proxy models to optimize datasets before scaling up. However, this approach becomes challenging for reasoning capabilities, which exhibit emergent behavior that only appear reliably at larger model sizes, often exceeding 7B parameters. To address this, we introduce rBridge, showing that small proxies ( \leq 1B) can effectively predict large-model reasoning by aligning more closely with (1) the pre-training objective and (2) the target task. rBridge achieves this by weighting negative log-likelihood with task alignment, using reasoning traces from frontier models as gold labels. In our experiments, rBridge (i) reduces dataset ranking costs by over 100x relative to the best baseline, (ii) achieves the strongest correlation across six reasoning benchmarks at 1B to 32B scale, and (iii) zero-shot transfers predictive relationships across pre-training datasets at 1B to 7B scale. These findings indicate that rBridge offers a practical path for exploring reasoning-oriented pre-training at lower cost.
zh
[AI-35] Automatic Red Teaming LLM -based Agents with Model Context Protocol Tools
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)代理在采用模型上下文协议(Model Context Protocol, MCP)工具后所面临的工具投毒攻击(tool poisoning attacks)问题,即恶意MCP工具可能被用来操纵LLM代理的行为,而现有防御机制难以有效检测此类攻击。解决方案的关键在于提出AutoMalTool——一个自动化的红队框架,通过生成恶意MCP工具来系统性地测试和揭示LLM代理的安全漏洞,实验证明该框架能够生成规避当前检测机制的恶意工具,并有效操控主流LLM代理的行为,从而填补了自动化、系统化红队评估在MCP工具投毒场景下的研究空白。
链接: https://arxiv.org/abs/2509.21011
作者: Ping He,Changjiang Li,Binbin Zhao,Tianyu Du,Shouling Ji
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:The remarkable capability of large language models (LLMs) has led to the wide application of LLM-based agents in various domains. To standardize interactions between LLM-based agents and their environments, model context protocol (MCP) tools have become the de facto standard and are now widely integrated into these agents. However, the incorporation of MCP tools introduces the risk of tool poisoning attacks, which can manipulate the behavior of LLM-based agents. Although previous studies have identified such vulnerabilities, their red teaming approaches have largely remained at the proof-of-concept stage, leaving the automatic and systematic red teaming of LLM-based agents under the MCP tool poisoning paradigm an open question. To bridge this gap, we propose AutoMalTool, an automated red teaming framework for LLM-based agents by generating malicious MCP tools. Our extensive evaluation shows that AutoMalTool effectively generates malicious MCP tools capable of manipulating the behavior of mainstream LLM-based agents while evading current detection mechanisms, thereby revealing new security risks in these agents.
zh
[AI-36] ExMolRL: Phenotype-Target Joint Generation of De Novo Molecules via Multi-Objective Reinforcement Learning
【速读】:该论文旨在解决生成高质量候选分子在AI驱动药物设计中的核心挑战,即现有基于表型(phenotype-based)和基于靶点(target-based)的策略各自存在局限:前者实验成本高,后者忽视细胞层面的系统响应。解决方案的关键在于提出ExMoIRL框架,该框架通过协同整合表型引导与靶点特异性线索实现从头分子生成,其核心创新包括:首先在大规模药物诱导转录谱上预训练表型引导生成器,随后利用多目标强化学习(multi-objective reinforcement learning, RL)进行微调;奖励函数融合了对接亲和力(docking affinity)、类药性评分(drug-likeness score),并引入排序损失(ranking loss)、先验似然正则化(prior-likelihood regularization)和熵最大化(entropy maximization),从而引导模型生成同时具备高活性、多样性及特定表型效应的化学类型(chemotypes)。
链接: https://arxiv.org/abs/2509.21010
作者: Haotian Guo,Hui Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The generation of high-quality candidate molecules remains a central challenge in AI-driven drug design. Current phenotype-based and target-based strategies each suffer limitations, either incurring high experimental costs or overlook system-level cellular responses. To bridge this gap, we propose ExMoIRL, a novel generative framework that synergistically integrates phenotypic and target-specific cues for de novo molecular generation. The phenotype-guided generator is first pretrained on expansive drug-induced transcriptional profiles and subsequently fine-tuned via multi-objective reinforcement learning (RL). Crucially, the reward function fuses docking affinity and drug-likeness scores, augmented with ranking loss, prior-likelihood regularization, and entropy maximization. The multi-objective RL steers the model toward chemotypes that are simultaneously potent, diverse, and aligned with the specified phenotypic effects. Extensive experiments demonstrate ExMoIRL’s superior performance over state-of-the-art phenotype-based and target-based models across multiple well-characterized targets. Our generated molecules exhibit favorable drug-like properties, high target affinity, and inhibitory potency (IC50) against cancer cells. This unified framework showcases the synergistic potential of combining phenotype-guided and target-aware strategies, offering a more effective solution for de novo drug discovery.
zh
[AI-37] AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation
【速读】:该论文旨在解决在未见过且不可预测的室内环境中实现自然语言驱动的拾取与放置任务(natural language pick-and-place)问题。解决方案的关键在于提出了一种模块化框架 AnywhereVLA,其核心由两部分组成:一是基于经典SLAM(Simultaneous Localization and Mapping)与语义地图构建的环境理解模块,通过文本提示生成结构化的任务图并引导任务感知的前沿探索策略;二是采用微调后的轻量级视觉语言动作(SmolVLA)操作头,结合平台特定的拾取-放置轨迹进行训练,以实现对局部视觉上下文和子目标的精准抓取与放置决策。该系统完全运行于消费级硬件之上,兼顾实时性能与任务泛化能力,实现了几何导航可靠性与语言条件操作敏捷性的融合。
链接: https://arxiv.org/abs/2509.21006
作者: Konstantin Gubernatorov,Artem Voronov,Roman Voronov,Sergei Pasynkov,Stepan Perminov,Ziang Guo,Dzmitry Tsetserukou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We address natural language pick-and-place in unseen, unpredictable indoor environments with AnywhereVLA, a modular framework for mobile manipulation. A user text prompt serves as an entry point and is parsed into a structured task graph that conditions classical SLAM with LiDAR and cameras, metric semantic mapping, and a task-aware frontier exploration policy. An approach planner then selects visibility and reachability aware pre grasp base poses. For interaction, a compact SmolVLA manipulation head is fine tuned on platform pick and place trajectories for the SO-101 by TheRobotStudio, grounding local visual context and sub-goals into grasp and place proposals. The full system runs fully onboard on consumer-level hardware, with Jetson Orin NX for perception and VLA and an Intel NUC for SLAM, exploration, and control, sustaining real-time operation. We evaluated AnywhereVLA in a multi-room lab under static scenes and normal human motion. In this setting, the system achieves a 46% overall task success rate while maintaining throughput on embedded compute. By combining a classical stack with a fine-tuned VLA manipulation, the system inherits the reliability of geometry-based navigation with the agility and task generalization of language-conditioned manipulation.
zh
[AI-38] Lossless Compression: A New Benchmark for Time Series Model Evaluation
【速读】:该论文旨在解决当前时间序列模型评估体系的局限性问题,即传统任务(如预测、插补、异常检测和分类)仅衡量模型在特定任务上的表现,而无法严格检验模型是否捕获了数据的完整生成分布。其解决方案的关键在于引入**无损压缩(lossless compression)**作为新的评估范式,基于香农信源编码定理,建立最优压缩长度与负对数似然之间的直接等价关系,从而提供一个严格且统一的信息论标准来衡量模型的建模能力。该方法通过定义标准化评估协议和指标,并开源了TSCom-Bench框架,使得先进时间序列模型可快速适配为无损压缩骨干网络,实验表明压缩性能能揭示经典基准忽略的分布弱点,使无损压缩成为补充和扩展现有评估体系的原理性任务。
链接: https://arxiv.org/abs/2509.21002
作者: Meng Wan,Benxi Tian,Jue Wang,Cui Hui,Ningming Nie,Tiantian Liu,Zongguo Wang,Cao Rongqiang,Peng Shi,Yangang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages
Abstract:The evaluation of time series models has traditionally focused on four canonical tasks: forecasting, imputation, anomaly detection, and classification. While these tasks have driven significant progress, they primarily assess task-specific performance and do not rigorously measure whether a model captures the full generative distribution of the data. We introduce lossless compression as a new paradigm for evaluating time series models, grounded in Shannon’s source coding theorem. This perspective establishes a direct equivalence between optimal compression length and the negative log-likelihood, providing a strict and unified information-theoretic criterion for modeling capacity. Then We define a standardized evaluation protocol and metrics. We further propose and open-source a comprehensive evaluation framework TSCom-Bench, which enables the rapid adaptation of time series models as backbones for lossless compression. Experiments across diverse datasets on state-of-the-art models, including TimeXer, iTransformer, and PatchTST, demonstrate that compression reveals distributional weaknesses overlooked by classic benchmarks. These findings position lossless compression as a principled task that complements and extends existing evaluation for time series modeling.
zh
[AI-39] CORE: Full-Path Evaluation of LLM Agents Beyond Final State NEURIPS2025
【速读】:该论文旨在解决当前评估AI代理(AI agent)在真实世界任务中通过工具调用序列执行能力时存在的不足问题,现有评估基准往往仅以最终状态的二元判断为基础,忽略了安全性、效率及中间步骤正确性等关键维度。其解决方案的核心在于构建一个基于确定性有限自动机(Deterministic Finite Automata, DFA)的框架,将任务编码为一组有效的工具使用路径集合,从而实现对代理行为在不同世界模型中的系统性评估;在此基础上提出CORE指标套件,包括路径正确性、路径正确性-Kendall’s tau综合指标、前缀临界性、有害调用率和效率五个维度,量化代理行为与预期执行模式的一致性,揭示传统终态评估无法识别的性能差异。
链接: https://arxiv.org/abs/2509.20998
作者: Panagiotis Michelakis,Yiannis Hadjiyiannis,Dimitrios Stamoulis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted: LAW 2025 Workshop NeurIPS 2025
Abstract:Evaluating AI agents that solve real-world tasks through function-call sequences remains an open challenge. Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state, overlooking critical aspects such as safety, efficiency, and intermediate correctness. We propose a framework based on deterministic finite automata (DFAs) that encodes tasks as sets of valid tool-use paths, enabling principled assessment of agent behavior in diverse world models. Building on this foundation, we introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness - Kendall’s tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency, that quantify alignment with expected execution patterns. Across diverse worlds, our method reveals important performance differences between agents that would otherwise appear equivalent under traditional final-state evaluation schemes.
zh
[AI-40] Rejuvenating Cross-Entropy Loss in Knowledge Distillation for Recommender Systems
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)在推荐系统中因交叉熵(Cross-Entropy, CE)损失函数与排序指标NDCG之间理论关联不明确而导致的性能瓶颈问题。具体而言,现有方法在对教师模型生成的排名进行蒸馏时,通常仅在学生模型预测的子集上计算CE损失,但这种做法隐含要求被蒸馏的项目子集必须包含学生模型的Top项(即“闭包假设”),而这一条件与实际目标——蒸馏教师模型Top项目的排名——存在显著偏差,导致理论支持与实践目标脱节。解决方案的关键在于提出一种名为Rejuvenated Cross-Entropy for Knowledge Distillation (RCE-KD) 的新方法:首先将教师模型的Top项目划分为两类——一类是学生模型也高度排名的项目(满足闭包假设),另一类则否;对于后者,引入基于师生协作的采样策略以逼近闭包假设,并自适应地融合两部分的损失函数,从而在保持理论可解释性的同时有效提升推荐系统的排序性能。
链接: https://arxiv.org/abs/2509.20989
作者: Zhangchi Zhu,Wei Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper analyzes Cross-Entropy (CE) loss in knowledge distillation (KD) for recommender systems. KD for recommender systems targets at distilling rankings, especially among items most likely to be preferred, and can only be computed on a small subset of items. Considering these features, we reveal the connection between CE loss and NDCG in the field of KD. We prove that when performing KD on an item subset, minimizing CE loss maximizes the lower bound of NDCG, only if an assumption of closure is satisfied. It requires that the item subset consists of the student’s top items. However, this contradicts our goal of distilling rankings of the teacher’s top items. We empirically demonstrate the vast gap between these two kinds of top items. To bridge the gap between our goal and theoretical support, we propose Rejuvenated Cross-Entropy for Knowledge Distillation (RCE-KD). It splits the top items given by the teacher into two subsets based on whether they are highly ranked by the student. For the subset that defies the condition, a sampling strategy is devised to use teacher-student collaboration to approximate our assumption of closure. We also combine the losses on the two subsets adaptively. Extensive experiments demonstrate the effectiveness of our method. Our code is available at this https URL.
zh
[AI-41] AOT*: Efficient Synthesis Planning via LLM -Empowered AND-OR Tree Search
【速读】:该论文旨在解决多步逆合成规划(retrosynthesis planning)中因搜索空间指数级增长和推理成本高昂而导致的计算挑战,同时克服大型语言模型(LLM)在合成路径规划中效率低、成本高的局限性。其解决方案的关键在于提出AOT框架,该框架通过将LLM生成的化学合成路径原子化地映射到AND-OR树结构组件上,并设计了数学上严谨的奖励分配策略与基于检索的上下文工程方法,从而显著提升LLM在化学空间中的导航效率。实验表明,AOT在多个基准测试中达到当前最优(SOTA)性能,且所需迭代次数仅为现有基于LLM方法的3–5倍,尤其在复杂分子目标上效率优势更加明显。
链接: https://arxiv.org/abs/2509.20988
作者: Xiaozhuang Song,Xuanhao Pan,Xinjian Zhao,Hangting Ye,Shufei Zhang,Jian Tang,Tianshu Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 34 pages, 21 figures
Abstract:Retrosynthesis planning enables the discovery of viable synthetic routes for target molecules, playing a crucial role in domains like drug discovery and materials design. Multi-step retrosynthetic planning remains computationally challenging due to exponential search spaces and inference costs. While Large Language Models (LLMs) demonstrate chemical reasoning capabilities, their application to synthesis planning faces constraints on efficiency and cost. To address these challenges, we introduce AOT*, a framework that transforms retrosynthetic planning by integrating LLM-generated chemical synthesis pathways with systematic AND-OR tree search. To this end, AOT* atomically maps the generated complete synthesis routes onto AND-OR tree components, with a mathematically sound design of reward assignment strategy and retrieval-based context engineering, thus enabling LLMs to efficiently navigate in the chemical space. Experimental evaluation on multiple synthesis benchmarks demonstrates that AOT* achieves SOTA performance with significantly improved search efficiency. AOT* exhibits competitive solve rates using 3-5 \times fewer iterations than existing LLM-based approaches, with the efficiency advantage becoming more pronounced on complex molecular targets.
zh
[AI-42] FracAug: Fractional Augmentation boost Graph-level Anomaly Detection under Limited Supervision
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在图级异常检测(Graph-level Anomaly Detection, GAD)任务中因标注成本高和数据集不平衡导致的性能瓶颈问题。其解决方案的关键在于提出一种名为FracAug的即插即用增强框架,该框架通过学习图内语义并生成语义一致的分数级图变体(fractional variants),利用一种新颖的加权距离感知边界损失(weighted distance-aware margin loss)来捕捉多尺度拓扑结构,从而生成多样且语义保持的图样本;同时,借助原始图与增强图之间的预测互验证机制进行伪标签(pseudo-labeling)以迭代扩展训练集,实现对不平衡数据的有效建模与性能提升。
链接: https://arxiv.org/abs/2509.20978
作者: Xiangyu Dong,Xingyi Zhang,Sibo Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph-level anomaly detection (GAD) is critical in diverse domains such as drug discovery, yet high labeling costs and dataset imbalance hamper the performance of Graph Neural Networks (GNNs). To address these issues, we propose FracAug, an innovative plug-in augmentation framework that enhances GNNs by generating semantically consistent graph variants and pseudo-labeling with mutual verification. Unlike previous heuristic methods, FracAug learns semantics within given graphs and synthesizes fractional variants, guided by a novel weighted distance-aware margin loss. This captures multi-scale topology to generate diverse, semantic-preserving graphs unaffected by data imbalance. Then, FracAug utilizes predictions from both original and augmented graphs to pseudo-label unlabeled data, iteratively expanding the training set. As a model-agnostic module compatible with various GNNs, FracAug demonstrates remarkable universality and efficacy: experiments across 14 GNNs on 12 real-world datasets show consistent gains, boosting average AUROC, AUPRC, and F1-score by up to 5.72%, 7.23%, and 4.18%, respectively.
zh
[AI-43] Knowledgeable Language Models as Black-Box Optimizers for Personalized Medicine
【速读】:该论文旨在解决个性化医疗中治疗方案优化的问题,即如何在不直接试验候选治疗的情况下,基于患者的个体遗传和环境因素,高效地发现最优治疗策略。传统方法依赖于仿真代理模型(surrogate model),但这些模型难以泛化到未见过的患者-治疗组合。解决方案的关键在于引入领域特定先验知识(如医学教科书和生物医学知识图谱),利用大语言模型(LLM)作为无需任务微调的黑盒优化器,通过“提示优化”(optimization by prompting)机制,将非结构化领域知识转化为自然语言形式的个性化治疗建议,从而实现更准确、可解释且适应性强的个体化治疗设计。
链接: https://arxiv.org/abs/2509.20975
作者: Michael S. Yao,Osbert Bastani,Alma Andersson,Tommaso Biancalani,Aïcha Bentaieb,Claudia Iriondo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 56 pages
Abstract:The goal of personalized medicine is to discover a treatment regimen that optimizes a patient’s clinical outcome based on their personal genetic and environmental factors. However, candidate treatments cannot be arbitrarily administered to the patient to assess their efficacy; we often instead have access to an in silico surrogate model that approximates the true fitness of a proposed treatment. Unfortunately, such surrogate models have been shown to fail to generalize to previously unseen patient-treatment combinations. We hypothesize that domain-specific prior knowledge - such as medical textbooks and biomedical knowledge graphs - can provide a meaningful alternative signal of the fitness of proposed treatments. To this end, we introduce LLM-based Entropy-guided Optimization with kNowledgeable priors (LEON), a mathematically principled approach to leverage large language models (LLMs) as black-box optimizers without any task-specific fine-tuning, taking advantage of their ability to contextualize unstructured domain knowledge to propose personalized treatment plans in natural language. In practice, we implement LEON via ‘optimization by prompting,’ which uses LLMs as stochastic engines for proposing treatment designs. Experiments on real-world optimization tasks show LEON outperforms both traditional and LLM-based methods in proposing individualized treatments for patients.
zh
[AI-44] Dual-Path Phishing Detection: Integrating Transformer-Based NLP with Structural URL Analysis CCS
【速读】:该论文旨在解决日益复杂且具有欺骗性的钓鱼邮件(phishing emails)检测问题,传统方法因仅依赖邮件内容或嵌入链接的孤立分析而难以应对这些多维度攻击。其解决方案的关键在于提出一种双路径(dual-path)检测框架,融合基于Transformer的自然语言处理(NLP)与经典机器学习技术:一方面利用微调后的DistilBERT模型进行语义层面的文本分析,另一方面通过字符级TF-IDF向量化结合随机森林(Random Forest)对嵌入URL进行结构特征识别。该设计充分利用了深度学习在语义理解上的优势和经典算法在结构化特征建模上的稳定性与可解释性,从而显著提升整体检测准确率,并具备模块化部署灵活性,适用于实际邮件安全场景。
链接: https://arxiv.org/abs/2509.20972
作者: Ibrahim Altan,Abdulla Bachir,Yousuf Parbhulkar,Abdul Muksith Rizvi,Moshiur Farazi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Paper accepted for presentation at the ACS/IEEE 22nd International Conference on Computer Systems and Applications (AICCSA 2025)
Abstract:Phishing emails pose a persistent and increasingly sophisticated threat, undermining email security through deceptive tactics designed to exploit both semantic and structural vulnerabilities. Traditional detection methods, often based on isolated analysis of email content or embedded URLs, fail to comprehensively address these evolving attacks. In this paper, we propose a dual-path phishing detection framework that integrates transformer-based natural language processing (NLP) with classical machine learning to jointly analyze email text and embedded URLs. Our approach leverages the complementary strengths of semantic analysis using fine-tuned transformer architectures (e.g., DistilBERT) and structural link analysis via character-level TF-IDF vectorization paired with classical classifiers (e.g., Random Forest). Empirical evaluation on representative email and URL datasets demonstrates that this combined approach significantly improves detection accuracy. Specifically, the DistilBERT model achieves a near-optimal balance between accuracy and computational efficiency for textual phishing detection, while Random Forest notably outperforms other classical classifiers in identifying malicious URLs. The modular design allows flexibility for standalone deployment or ensemble integration, facilitating real-world adoption. Collectively, our results highlight the efficacy and practical value of this dual-path approach, establishing a scalable, accurate, and interpretable solution capable of enhancing email security against contemporary phishing threats.
zh
[AI-45] -LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents
【速读】:该论文旨在解决语音到语音(Voice-to-Voice, V-2-V)通信系统在实时对话应用中的延迟优化问题,目标是在保持高质量交互的前提下降低处理延迟。解决方案的关键在于识别并优化影响实时因子(Real Time Factor, RTF)的核心组件——特别是文本转语音(Text-to-Speech, TTS)模块。研究发现,TTS生成具有情感和自然停顿的逼真语音对RTF影响最大;通过减少TTS解码器中残差向量量化(Residual Vector Quantization, RVQ)迭代次数及优化Mimi模型所用码本(codebooks),可在不显著牺牲语音质量的情况下实现显著的延迟降低,从而提升V-2-V系统的实时性能。
链接: https://arxiv.org/abs/2509.20971
作者: Anupam Purwar,Aditya Choudhary
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: This paper analyzes a low-latency, end-to-end voice-to-voice (V-2-V) architecture, identifying that the Text-to-Speech (TTS) component has the highest impact on real-time performance. By reducing the number of Residual Vector Quantization (RVQ) iterations in the TTS model, latency can be effectively halved, creating a direct trade-off between conversational speed and audio quality
Abstract:We experiment with a low-latency, end-to-end voice-to-voice communication model to optimize it for real-time conversational applications. By analyzing components essential to voice to voice (V-2-V) system viz. automatic speech recognition (ASR), text-to-speech (TTS), and dialog management, our work analyzes how to reduce processing time while maintaining high-quality interactions to identify the levers for optimizing V-2-V system. Our work identifies that TTS component which generates life-like voice, full of emotions including natural pauses and exclamations has highest impact on Real time factor (RTF). The experimented V-2-V architecture utilizes CSM1b has the capability to understand tone as well as context of conversation by ingesting both audio and text of prior exchanges to generate contextually accurate speech. We explored optimization of Residual Vector Quantization (RVQ) iterations by the TTS decoder which come at a cost of decrease in the quality of voice generated. Our experimental evaluations also demonstrate that for V-2-V implementations based on CSM most important optimizations can be brought by reducing the number of RVQ Iterations along with the codebooks used in Mimi.
zh
[AI-46] Beyond Stars: Bridging the Gap Between Ratings and Review Sentiment with LLM CCS
【速读】:该论文旨在解决传统星级评分系统在移动应用评论分析中无法有效捕捉文本细节反馈的问题,尤其针对基于词典的自然语言处理(Natural Language Processing, NLP)方法和经典机器学习分类器在理解上下文语义、领域特定术语及讽刺等细微语言特征方面的局限性。解决方案的关键在于提出一种模块化框架,利用大语言模型(Large Language Models, LLMs)并结合结构化提示(structured prompting)技术,实现对评分与文本情感之间差异的量化分析、细粒度的功能级洞察提取,并通过检索增强型对话问答(Retrieval-Augmented Conversational Question Answering, RAG-QA)支持评论的交互式探索,从而在多个复杂场景下显著提升准确性、鲁棒性和可操作性。
链接: https://arxiv.org/abs/2509.20953
作者: Najla Zuhir,Amna Mohammad Salim,Parvathy Premkumar,Moshiur Farazi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Paper accepted for presentation at ACS/IEEE 22nd International Conference on Computer Systems and Applications (AICCSA 2025)
Abstract:We present an advanced approach to mobile app review analysis aimed at addressing limitations inherent in traditional star-rating systems. Star ratings, although intuitive and popular among users, often fail to capture the nuanced feedback present in detailed review texts. Traditional NLP techniques – such as lexicon-based methods and classical machine learning classifiers – struggle to interpret contextual nuances, domain-specific terminology, and subtle linguistic features like sarcasm. To overcome these limitations, we propose a modular framework leveraging large language models (LLMs) enhanced by structured prompting techniques. Our method quantifies discrepancies between numerical ratings and textual sentiment, extracts detailed, feature-level insights, and supports interactive exploration of reviews through retrieval-augmented conversational question answering (RAG-QA). Comprehensive experiments conducted on three diverse datasets (AWARE, Google Play, and Spotify) demonstrate that our LLM-driven approach significantly surpasses baseline methods, yielding improved accuracy, robustness, and actionable insights in challenging and context-rich review scenarios.
zh
[AI-47] Flow Matching in the Low-Noise Regime: Pathologies and a Contrastive Remedy
【速读】:该论文旨在解决流匹配(Flow Matching)在低噪声区域存在的根本性不稳定性问题,即当噪声水平趋近于零时,输入的微小扰动会导致速度目标产生剧烈变化,从而使学习问题的条件数发散,进而减缓优化过程并损害语义表示质量。解决方案的关键在于提出局部对比流(Local Contrastive Flow, LCF),这是一种混合训练协议:在低噪声水平下,用对比特征对齐替代直接的速度回归,而在中高噪声水平下仍保留标准流匹配训练方式,从而有效提升收敛速度并稳定表示性能。
链接: https://arxiv.org/abs/2509.20952
作者: Weili Zeng,Yichao Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Flow matching has recently emerged as a powerful alternative to diffusion models, providing a continuous-time formulation for generative modeling and representation learning. Yet, we show that this framework suffers from a fundamental instability in the low-noise regime. As noise levels approach zero, arbitrarily small perturbations in the input can induce large variations in the velocity target, causing the condition number of the learning problem to diverge. This ill-conditioning not only slows optimization but also forces the encoder to reallocate its limited Jacobian capacity toward noise directions, thereby degrading semantic representations. We provide the first theoretical analysis of this phenomenon, which we term the low-noise pathology, establishing its intrinsic link to the structure of the flow matching objective. Building on these insights, we propose Local Contrastive Flow (LCF), a hybrid training protocol that replaces direct velocity regression with contrastive feature alignment at small noise levels, while retaining standard flow matching at moderate and high noise. Empirically, LCF not only improves convergence speed but also stabilizes representation quality. Our findings highlight the critical importance of addressing low-noise pathologies to unlock the full potential of flow matching for both generation and representation learning.
zh
[AI-48] CTI Dataset Construction from Telegram
【速读】:该论文旨在解决当前网络安全领域中高质量威胁情报(Cyber Threat Intelligence, CTI)数据集稀缺的问题,以支持模型训练、评估与基准测试。其核心挑战在于如何从动态变化的网络环境中高效获取结构化且高准确率的恶意指标(Indicators of Compromise, IoCs),如域名、IP地址、URL、哈希值和CVE漏洞信息。解决方案的关键在于构建一个端到端自动化管道:首先系统性地识别并抓取Telegram上12个精选频道中的145,349条消息,随后利用基于BERT的分类器对内容进行精准过滤,实现96.64%的准确率,最终从中提取出86,509条高质量恶意IoC,形成大规模、高保真度的CTI数据集,为后续研究与实际检测应用奠定基础。
链接: https://arxiv.org/abs/2509.20943
作者: Dincy R. Arikkat,Sneha B. T.,Serena Nicolazzo,Antonino Nocera,Vinod P.,Rafidha Rehiman K. A.,Karthika R
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Cyber Threat Intelligence (CTI) enables organizations to anticipate, detect, and mitigate evolving cyber threats. Its effectiveness depends on high-quality datasets, which support model development, training, evaluation, and benchmarking. Building such datasets is crucial, as attack vectors and adversary tactics continually evolve. Recently, Telegram has gained prominence as a valuable CTI source, offering timely and diverse threat-related information that can help address these challenges. In this work, we address these challenges by presenting an end-to-end automated pipeline that systematically collects and filters threat-related content from Telegram. The pipeline identifies relevant Telegram channels and scrapes 145,349 messages from 12 curated channels out of 150 identified sources. To accurately filter threat intelligence messages from generic content, we employ a BERT-based classifier, achieving an accuracy of 96.64%. From the filtered messages, we compile a dataset of 86,509 malicious Indicators of Compromise, including domains, IPs, URLs, hashes, and CVEs. This approach not only produces a large-scale, high-fidelity CTI dataset but also establishes a foundation for future research and operational applications in cyber threat detection.
zh
[AI-49] GALAX: Graph-Augmented Language Model for Explainable Reinforcement-Guided Subgraph Reasoning in Precision Medicine
【速读】:该论文旨在解决精准医学中多组学信号(quantitative multi-omic features)、拓扑结构(topological context)与生物文献知识(textual biological knowledge)难以协同建模的问题,现有方法要么忽略拓扑信息,要么缺乏定量推理能力,或无法实现机制可解释的目标。其解决方案的关键在于提出GALAX框架,通过将预训练图神经网络(pretrained Graph Neural Networks, GNNs)以强化学习方式嵌入大型语言模型(Large Language Models, LLMs),利用图过程奖励模型(Graph Process Reward Model, GPRM)对逐步生成的子图进行过程级监督,从而在无需显式中间推理标注的情况下,实现从数值证据、拓扑知识到文本语境的统一建模,支撑可解释的子图推理,提升疾病相关靶点与通路发现的可靠性与可解释性。
链接: https://arxiv.org/abs/2509.20935
作者: Heming Zhang,Di Huang,Wenyu Li,Michael Province,Yixin Chen,Philip Payne,Fuhai Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In precision medicine, quantitative multi-omic features, topological context, and textual biological knowledge play vital roles in identifying disease-critical signaling pathways and targets. Existing pipelines capture only part of these-numerical omics ignore topological context, text-centric LLMs lack quantitative grounded reasoning, and graph-only models underuse node semantics and the generalization of LLMs-limiting mechanistic interpretability. Although Process Reward Models (PRMs) aim to guide reasoning in LLMs, they remain limited by unreliable intermediate evaluation, and vulnerability to reward hacking with computational cost. These gaps motivate integrating quantitative multi-omic signals, topological structure with node annotations, and literature-scale text via LLMs, using subgraph reasoning as the principle bridge linking numeric evidence, topological knowledge and language context. Therefore, we propose GALAX (Graph Augmented LAnguage model with eXplainability), an innovative framework that integrates pretrained Graph Neural Networks (GNNs) into Large Language Models (LLMs) via reinforcement guided by a Graph Process Reward Model (GPRM), which generates disease-relevant subgraphs in a step-wise manner initiated by an LLM and iteratively evaluated by a pretrained GNN, enabling process-level supervision without explicit intermediate reasoning annotations. As an application, we also introduced Target-QA, a benchmark combining CRISPR-identified targets, multi-omic profiles, and biomedical graph knowledge across diverse cancer cell lines, which enables GNN pretraining for supervising step-wise graph construction and supports long-context reasoning over text-numeric graphs (TNGs), providing a scalable and biologically grounded framework for explainable, reinforcement-guided subgraph reasoning toward reliable and interpretable target and pathway discovery in precision medicine.
zh
[AI-50] Deep Learning for Crime Forecasting: The Role of Mobility at Fine-grained Spatiotemporal Scales
【速读】:该论文旨在解决如何通过融合微观层面的人类移动性特征(mobility features),结合历史犯罪数据与社会人口统计学信息,以提升在细粒度时空尺度下犯罪预测的准确性问题。其解决方案的关键在于构建并训练一种基于卷积长短期记忆网络(Convolutional Long Short-Term Memory, ConvLSTM)的深度学习框架,该模型利用0.077平方英里(约0.2平方公里)网格化空间单元的数据,输入为期14天和2天的时序序列来预测未来12小时内的犯罪发生情况。研究发现,引入移动性特征显著提升了预测性能,尤其在较短输入序列下效果更明显;而将移动性与社会人口统计特征联合使用时,模型在四个美国城市中均取得最高的召回率、精确率和F1分数,表明多源异构数据融合是实现高精度犯罪预测的核心机制。
链接: https://arxiv.org/abs/2509.20913
作者: Ariadna Albors Zumel,Michele Tizzoni,Gian Maria Campedelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 64 pages, 33 figures, and 6 tables (including appendix)
Abstract:Objectives: To develop a deep learning framework to evaluate if and how incorporating micro-level mobility features, alongside historical crime and sociodemographic data, enhances predictive performance in crime forecasting at fine-grained spatial and temporal resolutions. Methods: We advance the literature on computational methods and crime forecasting by focusing on four U.S. cities (i.e., Baltimore, Chicago, Los Angeles, and Philadelphia). We employ crime incident data obtained from each city’s police department, combined with sociodemographic data from the American Community Survey and human mobility data from Advan, collected from 2019 to 2023. This data is aggregated into grids with equally sized cells of 0.077 sq. miles (0.2 sq. kms) and used to train our deep learning forecasting model, a Convolutional Long Short-Term Memory (ConvLSTM) network, which predicts crime occurrences 12 hours ahead using 14-day and 2-day input sequences. We also compare its performance against three baseline models: logistic regression, random forest, and standard LSTM. Results: Incorporating mobility features improves predictive performance, especially when using shorter input sequences. Noteworthy, however, the best results are obtained when both mobility and sociodemographic features are used together, with our deep learning model achieving the highest recall, precision, and F1 score in all four cities, outperforming alternative methods. With this configuration, longer input sequences enhance predictions for violent crimes, while shorter sequences are more effective for property crimes. Conclusion: These findings underscore the importance of integrating diverse data sources for spatiotemporal crime forecasting, mobility included. They also highlight the advantages (and limits) of deep learning when dealing with fine-grained spatial and temporal scales. Comments: 64 pages, 33 figures, and 6 tables (including appendix) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.20913 [cs.LG] (or arXiv:2509.20913v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.20913 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Albors Zumel, A., Tizzoni, M., Campedelli, G.M. (2025). Deep Learning for Crime Forecasting: The Role of Mobility at Fine-grained Spatiotemporal Scales. Journal of Quantitative Criminology Related DOI: https://doi.org/10.1007/s10940-025-09629-3 Focus to learn more DOI(s) linking to related resources Submission history From: Ariadna Albors Zumel [view email] [v1] Thu, 25 Sep 2025 08:58:56 UTC (8,204 KB)
zh
[AI-51] DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
【速读】:该论文旨在解决多模态语言模型(Multimodal Language Models, MLLMs)在视觉-语言推理中存在“虚假推理”(spurious reasoning)的问题,即模型可能通过依赖无关或误导性的图像区域,结合先验知识或数据集偏差得出正确答案,但其推理过程缺乏对图像内容的真实理解,从而影响推理的可信度与可解释性。解决方案的关键在于提出一种名为DeFacto的反事实推理框架,该框架通过三种互补的训练范式——正例(positive)、反事实(counterfactual)和随机掩码(random-masking)——强制模型在回答准确的同时,确保推理过程基于图像中的真实证据。为此,作者构建了一个包含约10万张图像的数据集,自动定位问题相关的视觉证据并生成三类样本,并基于GRPO(Generalized Reward Policy Optimization)强化学习方法设计三个互补奖励信号,引导模型实现高精度且忠实于证据的推理。
链接: https://arxiv.org/abs/2509.20912
作者: Tianrun Xu,Haoda Jing,Ye Li,Yuquan Wei,Jun Feng,Guanyu Chen,Haichuan Gao,Tianren Zhang,Feng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in multimodal language models (MLLMs) have achieved remarkable progress in vision-language reasoning, especially with the emergence of “thinking with images,” which integrates explicit visual steps into the reasoning process. While this paradigm strengthens image-based reasoning, a significant challenge remains: models may arrive at correct answers by relying on irrelevant or spurious regions, driven by prior knowledge or dataset biases. Even when the answer is correct, flawed reasoning indicates that the model has not truly understood the image, highlighting the critical importance of reasoning fidelity in multimodal tasks. To address this issue, we propose DeFacto, a counterfactual reasoning framework that jointly enforces accurate answering and faithful reasoning. A key component of our approach is the design of three complementary training paradigms: (i) positive, (ii) counterfactual, and (iii) random-masking. To enable these paradigms, we develop a pipeline that automatically localizes question-relevant evidence and constructs positive, counterfactual, and random variants, resulting in a dataset of about 100k images. Building on this framework, we train multimodal language models with GRPO-based reinforcement learning, where we design three complementary rewards to guide the model toward accurate answering and evidence-grounded reasoning. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and reasoning faithfulness, establishing a stronger foundation for interpretable multimodal reasoning. The code is available on GitHub and the dataset is released on HuggingFace.
zh
[AI-52] Improving Early Sepsis Onset Prediction Through Federated Learning
【速读】:该论文旨在解决重症监护病房(Intensive Care Unit, ICU)中脓毒症(sepsis)早期准确预测的难题,其核心挑战在于单个医疗机构数据量有限且多样性不足,导致机器学习模型性能受限。解决方案的关键在于提出一种基于联邦学习(Federated Learning, FL)的注意力增强型长短期记忆网络(Attention-enhanced Long Short-Term Memory, LSTM),在多中心ICU数据上进行协同训练,无需共享原始数据即可实现隐私保护下的模型优化。该方法不仅使整体预测性能接近集中式模型,更显著提升了大预测窗口下的早期脓毒症识别能力,并通过引入可变预测时窗(variable prediction horizon)机制,在不显著牺牲性能的前提下降低了计算、通信和组织成本。
链接: https://arxiv.org/abs/2509.20885
作者: Christoph Düsing,Philipp Cimiano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 1st Workshop on Artificial Intelligence for Biomedical Data (AIBio) 2025
Abstract:Early and accurate prediction of sepsis onset remains a major challenge in intensive care, where timely detection and subsequent intervention can significantly improve patient outcomes. While machine learning models have shown promise in this domain, their success is often limited by the amount and diversity of training data available to individual hospitals and Intensive Care Units (ICUs). Federated Learning (FL) addresses this issue by enabling collaborative model training across institutions without requiring data sharing, thus preserving patient privacy. In this work, we propose a federated, attention-enhanced Long Short-Term Memory model for sepsis onset prediction, trained on multi-centric ICU data. Unlike existing approaches that rely on fixed prediction windows, our model supports variable prediction horizons, enabling both short- and long-term forecasting in a single unified model. During analysis, we put particular emphasis on the improvements through our approach in terms of early sepsis detection, i.e., predictions with large prediction windows by conducting an in-depth temporal analysis. Our results prove that using FL does not merely improve overall prediction performance (with performance approaching that of a centralized model), but is particularly beneficial for early sepsis onset prediction. Finally, we show that our choice of employing a variable prediction window rather than a fixed window does not hurt performance significantly but reduces computational, communicational, and organizational overhead.
zh
[AI-53] Model-Based Reinforcement Learning under Random Observation Delays
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在存在随机传感器延迟(random sensor delays)环境下的性能下降问题,尤其是在部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)中,观测可能以非顺序方式到达,而传统RL算法通常假设环境感知是即时的。解决方案的关键在于提出一种基于模型的滤波机制,该机制能够根据接收到的观测流顺序更新信念状态(belief state),并将其集成到模型-based RL框架中,从而构建一个对随机延迟具有鲁棒性的延迟感知(delay-aware)策略。实验表明,该方法在模拟机器人任务中显著优于现有启发式方法和基于马尔可夫决策过程(MDP)设计的延迟感知基线,并能有效应对部署时延迟分布变化的情况。
链接: https://arxiv.org/abs/2509.20869
作者: Armin Karamzade,Kyungmin Kim,JB Lanier,Davide Corsi,Roy Fox
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Delays frequently occur in real-world environments, yet standard reinforcement learning (RL) algorithms often assume instantaneous perception of the environment. We study random sensor delays in POMDPs, where observations may arrive out-of-sequence, a setting that has not been previously addressed in RL. We analyze the structure of such delays and demonstrate that naive approaches, such as stacking past observations, are insufficient for reliable performance. To address this, we propose a model-based filtering process that sequentially updates the belief state based on an incoming stream of observations. We then introduce a simple delay-aware framework that incorporates this idea into model-based RL, enabling agents to effectively handle random delays. Applying this framework to Dreamer, we compare our approach to delay-aware baselines developed for MDPs. Our method consistently outperforms these baselines and demonstrates robustness to delay distribution shifts during deployment. Additionally, we present experiments on simulated robotic tasks, comparing our method to common practical heuristics and emphasizing the importance of explicitly modeling observation delays.
zh
[AI-54] Federated Markov Imputation: Privacy-Preserving Temporal Imputation in Multi-Centric ICU Environments ECML-PKDD
【速读】:该论文旨在解决联邦学习在电子健康记录(Electronic Health Records, EHR)场景中因机构间时间序列数据采样粒度不一致而导致的缺失数据问题。解决方案的关键在于提出一种隐私保护的联邦马尔可夫插补方法(Federated Markov Imputation, FMI),其通过各ICU协作构建全局的时间转移模型,从而实现跨机构的时序数据插补,尤其在采样间隔不规则的情况下显著优于本地插补基线方法。
链接: https://arxiv.org/abs/2509.20867
作者: Christoph Düsing,Philipp Cimiano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 1st International ECML-PKDD Workshop-Tutorial on Learning on Real and Synthetic Medical Time Series Data (MED-TIME)
Abstract:Missing data is a persistent challenge in federated learning on electronic health records, particularly when institutions collect time-series data at varying temporal granularities. To address this, we propose Federated Markov Imputation (FMI), a privacy-preserving method that enables Intensive Care Units (ICUs) to collaboratively build global transition models for temporal imputation. We evaluate FMI on a real-world sepsis onset prediction task using the MIMIC-IV dataset and show that it outperforms local imputation baselines, especially in scenarios with irregular sampling intervals across ICUs.
zh
[AI-55] Robust Multi-Omics Integration from Incomplete Modalities Significantly Improves Prediction of Alzheimers Disease
【速读】:该论文旨在解决多组学(multi-omics)数据中因部分样本缺失某些组学模态(modality)而导致的整合分析困难问题。现有方法在面对不完整的多组学数据时性能受限,难以充分利用所有可用样本进行有效建模。解决方案的关键在于提出MOIRA(Multi-Omics Integration with Robustness to Absent modalities),其核心机制包括两个方面:一是通过表示对齐(representation alignment)将不同组学数据投影到共享嵌入空间(shared embedding space),二是引入可学习的加权机制(learnable weighting mechanism)实现跨模态的自适应融合(adaptive aggregation)。该方法能够有效利用包含缺失模态的样本,从而提升模型在不完整数据下的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2509.20842
作者: Sungjoon Park,Kyungwook Lee,Soorin Yim,Doyeong Hwang,Dongyun Kim,Soonyoung Lee,Amy Dunn,Daniel Gatti,Elissa Chesler,Kristen O’Connell,Kiyoung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-omics data capture complex biomolecular interactions and provide insights into metabolism and disease. However, missing modalities hinder integrative analysis across heterogeneous omics. To address this, we present MOIRA (Multi-Omics Integration with Robustness to Absent modalities), an early integration method enabling robust learning from incomplete omics data via representation alignment and adaptive aggregation. MOIRA leverages all samples, including those with missing modalities, by projecting each omics dataset onto a shared embedding space where a learnable weighting mechanism fuses them. Evaluated on the Religious Order Study and Memory and Aging Project (ROSMAP) dataset for Alzheimer’s Disease (AD), MOIRA outperformed existing approaches, and further ablation studies confirmed modality-wise contributions. Feature importance analysis revealed AD-related biomarkers consistent with prior literature, highlighting the biological relevance of our approach.
zh
[AI-56] ImaginationPolicy: Towards Generalizable Precise and Reliable End-to-End Policy for Robotic Manipulation
【速读】:该论文旨在解决现有端到端机器人操作策略在大规模实际部署中性能不足的问题,特别是传统模块化流水线中存在的信息丢失和特征错位等问题。其解决方案的关键在于提出了一种新颖的“移动定向关键点链式表示法”(Chain of Moving Oriented Keypoints, CoMOK),作为神经策略的动作表征方式,支持统一处理多种操作任务,并通过定向关键点实现对不同形状与尺寸物体的自然泛化,同时达到亚厘米级精度,且能有效应对多阶段任务、多模态行为及柔性物体等复杂场景。
链接: https://arxiv.org/abs/2509.20841
作者: Dekun Lu,Wei Gao,Kui Jia
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: First two authors contribute equally. Project page: this https URL
Abstract:End-to-end robot manipulation policies offer significant potential for enabling embodied agents to understand and interact with the world. Unlike traditional modular pipelines, end-to-end learning mitigates key limitations such as information loss between modules and feature misalignment caused by isolated optimization targets. Despite these advantages, existing end-to-end neural networks for robotic manipulation–including those based on large VLM/VLA models–remain insufficiently performant for large-scale practical deployment. In this paper, we take a step towards an end-to-end manipulation policy that is generalizable, accurate and reliable. To achieve this goal, we propose a novel Chain of Moving Oriented Keypoints (CoMOK) formulation for robotic manipulation. Our formulation is used as the action representation of a neural policy, which can be trained in an end-to-end fashion. Such an action representation is general, as it extends the standard end-effector pose action representation and supports a diverse set of manipulation tasks in a unified manner. The oriented keypoint in our method enables natural generalization to objects with different shapes and sizes, while achieving sub-centimeter accuracy. Moreover, our formulation can easily handle multi-stage tasks, multi-modal robot behaviors, and deformable objects. Extensive simulated and hardware experiments demonstrate the effectiveness of our method.
zh
[AI-57] Security-aware Semantic-driven ISAC via Paired Adversarial Residual Networks
【速读】:该论文旨在解决集成感知与通信(Integrated Sensing and Communication, ISAC)系统中安全性不足的问题,特别是在面对窃听威胁时如何保障信息隐私的同时维持良好的感知与通信(SAC)性能。解决方案的关键在于提出一种安全语义ISAC(Security Semantic ISAC, SS-ISAC)框架,其核心创新是引入一对可插拔的加密与解密模块,分别基于可训练的对抗残差网络(Adversarial Residual Network, ARN)实现:加密模块在语义发射端生成对抗扰动以增强安全性,解密模块在语义接收端消除该扰动并抑制噪声;这两个模块可根据系统安全需求灵活配置,无需大幅改动硬件基础设施,并通过联合优化一个综合损失函数来平衡对抗攻击强度、SAC性能和隐私泄露风险,从而实现安全性和功能性的协同优化。
链接: https://arxiv.org/abs/2509.20835
作者: Yu Liu,Boxiang He,Fanggang Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a novel and flexible security-aware semantic-driven integrated sensing and communication (ISAC) framework, namely security semantic ISAC (SS-ISAC). Inspired by the positive impact of the adversarial attack, a pair of pluggable encryption and decryption modules is designed in the proposed SS-ISAC framework. The encryption module is installed after the semantic transmitter, adopting a trainable adversarial residual network (ARN) to create the adversarial attack. Correspondingly, the decryption module before the semantic receiver utilizes another trainable ARN to mitigate the adversarial attack and noise. These two modules can be flexibly assembled considering the system security demands, without drastically modifying the hardware infrastructure. To ensure the sensing and communication (SAC) performance while preventing the eavesdropping threat, the above ARNs are jointly optimized by minimizing a carefully designed loss function that relates to the adversarial attack power, SAC performance, as well as the privacy leakage risk. Simulation results validate the effectiveness of the proposed SS-ISAC framework in terms of both SAC and eavesdropping prevention performance.
zh
[AI-58] rustworthy Semantic Communication for Vehicular Networks: Challenges and Solutions
【速读】:该论文旨在解决车辆-一切(V2X)通信中语义通信(SemCom)网络所面临的信任挑战,包括信息传输、语义编码及通信实体可靠性等问题。其核心解决方案为三层可信语义通信架构:首先,通过引入基于防御性对抗噪声的语义伪装传输机制实现主动窃听防御;其次,设计鲁棒的联邦编码器-解码器训练框架以抵御编码器-解码器投毒攻击;最后,提出基于审计博弈的分布式车辆信任管理机制,用以抑制不可信车辆的行为。该方案在案例研究中验证了有效性,并指出了未来研究方向。
链接: https://arxiv.org/abs/2509.20830
作者: Yanghe Pan,Yuntao Wang,Shaolong Guo,Chengyu Yin,Ruidong Li,Zhou Su,Yuan Wu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, accepted by IEEE Vehicular Technology Magazine
Abstract:Semantic communication (SemCom) has the potential to significantly reduce communication delay in vehicle-to-everything (V2X) communications within vehicular networks (VNs). However, the deployment of vehicular SemCom networks (VN-SemComNets) faces critical trust challenges in information transmission, semantic encoding, and communication entity reliability. This paper proposes an innovative three-layer trustworthy VN-SemComNet architecture. Specifically, we introduce a semantic camouflage transmission mechanism leveraging defensive adversarial noise for active eavesdropping defense, a robust federated encoder-decoder training framework to mitigate encoder-decoder poisoning attacks, and an audit game-based distributed vehicle trust management mechanism to deter untrustworthy vehicles. A case study validates the effectiveness of the proposed solutions. Lastly, essential future research directions are pointed out to advance this emerging field.
zh
[AI-59] Even More Kawaii than Real-Person-Driven VTubers? Understanding How Viewers Perceive AI-Driven VTubers
【速读】:该论文旨在解决AI驱动的虚拟主播(VTuber)在数字流媒体文化中如何构建虚拟人格并影响观众参与度的问题,尤其关注其与传统由真人“Nakanohito”操控的VTuber相比,在真实性感知和观众互动方面的差异。解决方案的关键在于通过案例研究方法,深入分析最受欢迎的AI驱动VTuber Neuro-sama的观众反馈数据,包括10.8万条Reddit帖子和13.6万条评论,从而揭示观众动机、AI如何塑造虚拟人格以及观众对AI作为“Nakanohito”的认知与接受程度。
链接: https://arxiv.org/abs/2509.20817
作者: Yiluo Wei,Yupeng He,Gareth Tyson
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:VTubers, digital personas represented by animated avatars, have gained massive popularity. Traditionally, VTubers are operated and voiced by human controllers known as Nakanohito. The reliance on Nakanohito, however, poses risks due to potential personal controversies and operational disruptions. The emergence of AI-driven VTubers offers a new model free from these human constraints. While AI-driven VTubers present benefits such as continuous operation and reduced scandal risk, they also raise questions about authenticity and audience engagement. Therefore, to gain deeper insights, we conduct a case study, investigating viewer perceptions of Neuro-sama, the most popular AI-driven VTuber with 845k followers on Twitch and 753k followers on YouTube. We analyze 108k Reddit posts and 136k YouTube comments, aiming to better understand viewer motivations, how AI constructs the virtual persona, and perceptions of the AI as Nakanohito. Our findings enhance the understanding of AI-driven VTubers and their impact on digital streaming culture.
zh
[AI-60] LogReason er: Empowering LLM s with Expert-like Coarse-to-Fine Reasoning for Log Analysis Tasks
【速读】:该论文旨在解决通用大语言模型(Large Language Models, LLMs)在日志分析任务中难以构建符合专家认知结构的系统性推理流程、且缺乏精细推理步骤准确性的问题。其核心解决方案是提出一种粗粒度到细粒度的推理增强框架LogReasoner:首先通过收集故障排查流程图和已有任务构建高阶专家思维,引导LLMs形成结构化推理路径;随后对特定任务步骤进行微调,并利用偏好学习校准错误推理细节,从而显著提升LLMs在日志分析中的推理精度与颗粒度。
链接: https://arxiv.org/abs/2509.20798
作者: Lipeng Ma,Yixuan Li,Weidong Yang,Mingjie Zhou,Xinyi Liu,Ben Fei,Shuhao Li,Xiaoyan Sun,Sihang Jiang,Yanghua Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: under review
Abstract:Log analysis is crucial for monitoring system health and diagnosing failures in complex systems. Recent advances in large language models (LLMs) offer new opportunities for automated log analysis, leveraging their reasoning capabilities to perform tasks such as anomaly detection and failure prediction. However, general-purpose LLMs struggle to formulate structured reasoning workflows that align with expert cognition and deliver precise details of reasoning steps. To address these challenges, we propose LogReasoner, a coarse-to-fine reasoning enhancement framework designed to enable LLMs to reason log analysis tasks like experts. LogReasoner consists of two stages: (1) coarse-grained enhancement of expert thinking, where high-level expert thoughts are constructed from collected troubleshooting flowcharts and existing tasks to enable LLMs to formulate structured reasoning workflows and (2) fine-grained enhancement of specific steps, where we first fine-tune the LLM with task-specific stepwise solutions to enhance the LLM for instantiated reasoning, then employ the preference learning to calibrate the LLM’s reasoning details from its mistakes, further strengthen the LLM’s analytical granularity and correctness. We evaluate LogReasoner on four distinct log analysis tasks using open-source LLMs such as Qwen-2.5 and Llama-3. Experimental results show that LogReasoner significantly outperforms existing LLMs, achieving state-of-the-art performance and demonstrating its effectiveness in enhancing the reasoning capabilities of LLMs for log analysis.
zh
[AI-61] IConv: Focusing on Local Variation with Channel Independent Convolution for Multivariate Time Series Forecasting AAAI
【速读】:该论文旨在解决多变量时间序列预测中因数据非平稳性(如趋势变化、不规则季节性和残差波动)导致的传统多层感知机(MLP)模型性能受限的问题。由于MLP的线性结构难以捕捉不同通道中的多样化局部模式,其在建模季节性和残差等细粒度特征时存在局限。解决方案的关键在于将MLP与卷积神经网络(CNN)相结合:利用MLP建模长期趋势依赖,同时引入一种新型卷积架构IConv来专门处理局部变化。IConv通过独立处理每个时间依赖通道以捕获多样化的局部时序模式,并采用大核尺寸增强感受野;同时,通过分层设计区分通道间关系,在保证建模精度的同时降低计算复杂度,从而实现对多变量时间序列中全局趋势与局部波动的协同建模。
链接: https://arxiv.org/abs/2509.20783
作者: Gawon Lee,Hanbyeol Park,Minseop Kim,Dohee Kim,Hyerim Bae
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to AAAI
Abstract:Real-world time-series data often exhibit non-stationarity, including changing trends, irregular seasonality, and residuals. In terms of changing trends, recently proposed multi-layer perceptron (MLP)-based models have shown excellent performance owing to their computational efficiency and ability to capture long-term dependency. However, the linear nature of MLP architectures poses limitations when applied to channels with diverse distributions, resulting in local variations such as seasonal patterns and residual components being ignored. However, convolutional neural networks (CNNs) can effectively incorporate these variations. To resolve the limitations of MLP, we propose combining them with CNNs. The overall trend is modeled using an MLP to consider long-term dependencies. The CNN uses diverse kernels to model fine-grained local patterns in conjunction with MLP trend predictions. To focus on modeling local variation, we propose IConv, a novel convolutional architecture that processes the temporal dependency channel independently and considers the inter-channel relationship through distinct layers. Independent channel processing enables the modeling of diverse local temporal dependencies and the adoption of a large kernel size. Distinct inter-channel considerations reduce computational cost. The proposed model is evaluated through extensive experiments on time-series datasets. The results reveal the superiority of the proposed method for multivariate time-series forecasting.
zh
[AI-62] Measuring LLM Sensitivity in Transformer-based Tabular Data Synthesis
【速读】:该论文旨在解决生成式 Tabular Data Synthesis (TDS) 工具在数据质量与计算效率之间的权衡问题,尤其关注 Transformer-based 模型因高计算成本而难以在消费级硬件(prosumer hardware)上部署的挑战。其解决方案的关键在于系统性评估不同超参数(如层数、隐藏维度)对合成数据质量(ML 实用性与分布相似性)和运行时间的影响,并对比两种主流 TDS 工具 GReaT 和 REaLTabFormer 的性能表现。研究发现,浅层模型配置可显著降低运行时间,且 REaLTabFormer 结合轻量级大语言模型(LLM)能在保持高质量合成数据的同时实现更优的计算效率,从而为实际应用提供更具可行性的平衡方案。
链接: https://arxiv.org/abs/2509.20768
作者: Maria F. Davila R,Azizjon Turaev,Wolfram Wingerath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures
Abstract:Synthetic tabular data is used for privacy-preserving data sharing and data-driven model development. Its effectiveness, however, depends heavily on the used Tabular Data Synthesis (TDS) tool. Recent studies have shown that Transformer-based models outperform other state-of-the-art models such as Generative Adversarial Networks (GANs) and Diffusion models in terms of data quality. However, Transformer-based models also come with high computational costs, making them sometimes unfeasible for end users with prosumer hardware. This study presents a sensitivity assessment on how the choice of hyperparameters, such as number of layers or hidden dimension affects the quality of the resultant synthetic data and the computational performance. It is performed across two tools, GReaT and REaLTabFormer, evaluating 10 model setups that vary in architecture type and depth. We assess the sensitivity on three dimensions: runtime, machine learning (ML) utility, and similarity to real data distributions. Experiments were conducted on four real-world datasets. Our findings reveal that runtime is proportional to the number of hyperparameters, with shallower configurations completing faster. GReaT consistently achieves lower runtimes than REaLTabFormer, and only on the largest dataset they have comparable runtime. For small datasets, both tools achieve synthetic data with high utility and optimal similarity, but on larger datasets only REaLTabFormer sustains strong utility and similarity. As a result, REaLTabFormer with lightweight LLMs provides the best balance, since it preserves data quality while reducing computational requirements. Nonetheless, its runtime remains higher than that of GReaT and other TDS tools, suggesting that efficiency gains are possible but only up to a certain level.
zh
[AI-63] Meta-Memory: Retrieving and Integrating Semantic-Spatial Memories for Robot Spatial Reasoning
【速读】:该论文旨在解决机器人在复杂环境中有效存储观测信息为记忆,并基于这些记忆回答关于空间位置的自然语言查询这一关键 yet underexplored research挑战(challenge)。现有工作虽已构建了机器人记忆系统,但缺乏高效且原理明确的记忆检索与整合机制。解决方案的关键在于提出Meta-Memory——一个由大语言模型(Large Language Model, LLM)驱动的智能体,其核心创新是通过联合推理语义模态与空间模态,实现对相关记忆的精准检索与融合,从而赋予机器人强大的空间推理能力。
链接: https://arxiv.org/abs/2509.20754
作者: Yufan Mao,Hanjing Ye,Wenlong Dong,Chengjie Zhang,Hong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Navigating complex environments requires robots to effectively store observations as memories and leverage them to answer human queries about spatial locations, which is a critical yet underexplored research challenge. While prior work has made progress in constructing robotic memory, few have addressed the principled mechanisms needed for efficient memory retrieval and integration. To bridge this gap, we propose Meta-Memory, a large language model (LLM)-driven agent that constructs a high-density memory representation of the environment. The key innovation of Meta-Memory lies in its capacity to retrieve and integrate relevant memories through joint reasoning over semantic and spatial modalities in response to natural language location queries, thereby empowering robots with robust and accurate spatial reasoning capabilities. To evaluate its performance, we introduce SpaceLocQA, a large-scale dataset encompassing diverse real-world spatial question-answering scenarios. Experimental results show that Meta-Memory significantly outperforms state-of-the-art methods on both the SpaceLocQA and the public NaVQA benchmarks. Furthermore, we successfully deployed Meta-Memory on real-world robotic platforms, demonstrating its practical utility in complex environments. Project page: this https URL .
zh
[AI-64] Parallel Thinking Sequential Answering: Bridging NAR and AR for Efficient Reasoning
【速读】:该论文旨在解决生成式 AI(Generative AI)在推理任务中面临的效率与质量权衡问题:自回归(Auto-regressive, AR)语言模型虽能生成连贯文本,但在数学和代码等需要长链思维的任务中推理速度慢;而非自回归(Non-autoregressive, NAR)模型如离散扩散模型虽支持并行生成、显著提升推理速度,但输出质量通常较低。解决方案的关键在于提出一种新范式——利用NAR模型高效生成中间推理轨迹(intermediate reasoning traces),再由AR模型基于这些轨迹精炼输出最终答案,从而在保持高精度的同时大幅降低推理成本,实验表明该方法相较强基线提升达26%。
链接: https://arxiv.org/abs/2509.20744
作者: Qihang Ai,Haiyun Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages
Abstract:We study reasoning tasks through a framework that integrates auto-regressive (AR) and non-autoregressive (NAR) language models. AR models, which generate text sequentially, excel at producing coherent outputs but often suffer from slow inference, particularly in reasoning-intensive domains such as mathematics and code, where lengthy chains of thought are required. In contrast, NAR models, such as discrete diffusion models, allow parallel generation and offer substantial speedups, though typically at the cost of reduced output quality. To address these limitations, we introduce a new paradigm in which an NAR model efficiently produces intermediate reasoning traces, which subsequently guide an AR model to deliver precise final answers. Experiments demonstrate that our approach yields significant 26% improvements over strong baselines while substantially reducing inference cost.
zh
[AI-65] Imagining Design Workflows in Agent ic AI Futures
【速读】:该论文旨在解决设计师在面对生成式 AI(Generative AI)逐渐普及时,如何有效整合具有自主决策能力的代理型 AI(Agentic AI)系统到其设计工作流中的问题。当前,虽然生成式 AI 能够根据提示生成内容,但其在任务执行上的主动性仍有限;而 Agentic AI 则具备自主完成日常任务的能力,从而可能释放设计师的创造力。论文通过设计虚构(design fiction)方法,让十位专业设计师设想与 AI 代理协作组织灵感来源并进行创意构思的过程,揭示了设计师对 AI 角色定位、人机权威分配以及意图表达方式的需求。解决方案的关键在于提出一个概念框架,明确人类与 AI 之间在任务执行中的权威分布,并探索超越传统提示(prompt)方式的设计意图传达机制,为未来设计流程中 AI 代理的协同应用提供理论依据与实践方向。
链接: https://arxiv.org/abs/2509.20731
作者: Samangi Wadinambiarachchi,Jenny Waycott,Yvonne Rogers,Greg Wadley
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 37th Australian Conference on Human-Computer Interaction (HCI) (OZCHI '25)
Abstract:As designers become familiar with Generative AI, a new concept is emerging: Agentic AI. While generative AI produces output in response to prompts, agentic AI systems promise to perform mundane tasks autonomously, potentially freeing designers to focus on what they love: being creative. But how do designers feel about integrating agentic AI systems into their workflows? Through design fiction, we investigated how designers want to interact with a collaborative agentic AI platform. Ten professional designers imagined and discussed collaborating with an AI agent to organise inspiration sources and ideate. Our findings highlight the roles AI agents can play in supporting designers, the division of authority between humans and AI, and how designers’ intent can be explained to AI agents beyond prompts. We synthesise our findings into a conceptual framework that identifies authority distribution among humans and AI agents and discuss directions for utilising AI agents in future design workflows.
zh
[AI-66] Fairy: Interactive Mobile Assistant to Real-world Tasks via LMM-based Multi-agent
【速读】:该论文旨在解决当前大型多模态模型(Large Multi-modal Models, LMMs)在移动图形用户界面(GUI)代理应用中面临的两大核心问题:一是现有方法在面对多样化应用界面和动态用户需求时表现不佳,尤其是端到端模型因依赖通用常识而在长尾应用上失效;二是缺乏用户交互的代理行为具有单向性,损害用户体验。解决方案的关键在于提出Fairy——一个具备持续知识积累与自我演进能力的交互式多智能体移动助手,其创新性体现在三个核心模块:全局任务规划器(Global Task Planner)实现跨应用的任务分解,应用级执行器(App-Level Executor)通过双循环机制结合长期与短期记忆进行精准操作和用户交互,以及自学习模块(Self-Learner)将执行经验结构化为应用图谱(App Map)与技巧库(Tricks),从而支持持续学习与跨应用协作,显著提升任务完成率并减少冗余步骤。
链接: https://arxiv.org/abs/2509.20729
作者: Jiazheng Sun,Te Yang,Jiayang Niu,Mingxuan Li,Yongyong Lu,Ruimeng Yang,Xin Peng
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 20 pages, 12 figures
Abstract:Large multi-modal models (LMMs) have advanced mobile GUI agents. However, existing methods struggle with real-world scenarios involving diverse app interfaces and evolving user needs. End-to-end methods relying on model’s commonsense often fail on long-tail apps, and agents without user interaction act unilaterally, harming user experience. To address these limitations, we propose Fairy, an interactive multi-agent mobile assistant capable of continuously accumulating app knowledge and self-evolving during usage. Fairy enables cross-app collaboration, interactive execution, and continual learning through three core modules:(i) a Global Task Planner that decomposes user tasks into sub-tasks from a cross-app view; (ii) an App-Level Executor that refines sub-tasks into steps and actions based on long- and short-term memory, achieving precise execution and user interaction via four core agents operating in dual loops; and (iii) a Self-Learner that consolidates execution experience into App Map and Tricks. To evaluate Fairy, we introduce RealMobile-Eval, a real-world benchmark with a comprehensive metric suite, and LMM-based agents for automated scoring. Experiments show that Fairy with GPT-4o backbone outperforms the previous SoTA by improving user requirement completion by 33.7% and reducing redundant steps by 58.5%, showing the effectiveness of its interaction and self-learning.
zh
[AI-67] RobotDancing: Residual-Action Reinforcement Learning Enables Robust Long-Horizon Humanoid Motion Tracking
【速读】:该论文旨在解决人形机器人在长时间、高动态运动跟踪中因绝对关节指令无法补偿模型-实物差异而导致的误差累积问题,从而造成运动轨迹失真甚至系统不稳定。其解决方案的关键在于提出RobotDancing框架,通过端到端训练方式预测残差关节目标(residual joint targets),显式校正动力学不匹配;该方法采用单阶段强化学习(reinforcement learning, RL)架构,统一观测空间、奖励函数与超参数配置,在仿真中实现零样本迁移(zero-shot sim-to-real)部署,显著提升复杂动作(如跳跃、旋转、翻滚)的长期跟踪精度与硬件执行质量。
链接: https://arxiv.org/abs/2509.20717
作者: Zhenguo Sun,Yibo Peng,Yuan Meng,Xukun Li,Bo-Sheng Huang,Zhenshan Bing,Xinlong Wang,Alois Knoll
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon, high-dynamic motion tracking on humanoids remains brittle because absolute joint commands cannot compensate model-plant mismatch, leading to error accumulation. We propose RobotDancing, a simple, scalable framework that predicts residual joint targets to explicitly correct dynamics discrepancies. The pipeline is end-to-end–training, sim-to-sim validation, and zero-shot sim-to-real–and uses a single-stage reinforcement learning (RL) setup with a unified observation, reward, and hyperparameter configuration. We evaluate primarily on Unitree G1 with retargeted LAFAN1 dance sequences and validate transfer on H1/H1-2. RobotDancing can track multi-minute, high-energy behaviors (jumps, spins, cartwheels) and deploys zero-shot to hardware with high motion tracking quality.
zh
[AI-68] An Automated Retrieval-Augmented Generation LLaMA-4 109B-based System for Evaluating Radiotherapy Treatment Plans
【速读】:该论文旨在解决放射治疗(radiotherapy)计划评估过程中缺乏自动化、协议感知(protocol-aware)和可解释性的问题,传统方法依赖人工审核且难以标准化。解决方案的关键在于构建一个基于LLaMA-4 109B大语言模型(Large Language Model, LLM)的检索增强生成(Retrieval-Augmented Generation, RAG)系统,其核心由三个模块组成:优化后的句子嵌入检索引擎(基于五种SentenceTransformer骨干网络)、基于群体相似性的百分位预测组件以及临床约束检查器,通过多步骤提示驱动推理流程实现对治疗计划的精准、可追溯评价。该系统在614例跨四种疾病部位的放疗计划上验证了高一致性与鲁棒性,显著提升了评估过程的透明度与可扩展性。
链接: https://arxiv.org/abs/2509.20707
作者: Junjie Cui(1),Peilong Wang(1),Jason Holmes(1),Leshan Sun(1),Michael L. Hinni(2),Barbara A. Pockaj(3),Sujay A. Vora(1),Terence T. Sio(1),William W. Wong(1),Nathan Y. Yu(1),Steven E. Schild(1),Joshua R. Niska(1),Sameer R. Keole(1),Jean-Claude M. Rwigema(1),Samir H. Patel(1),Lisa A. McGee(1),Carlos A. Vargas(1),Wei Liu(1) ((1) Department of Radiation Oncology, Mayo Clinic Arizona, Phoenix, AZ (2) Department of Otolaryngology, Mayo Clinic Arizona, Phoenix, AZ (3) Department of General Surgery, Mayo Clinic Arizona, Phoenix, AZ)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures. Submitted to npj Digital Medicine
Abstract:Purpose: To develop a retrieval-augmented generation (RAG) system powered by LLaMA-4 109B for automated, protocol-aware, and interpretable evaluation of radiotherapy treatment plans. Methods and Materials: We curated a multi-protocol dataset of 614 radiotherapy plans across four disease sites and constructed a knowledge base containing normalized dose metrics and protocol-defined constraints. The RAG system integrates three core modules: a retrieval engine optimized across five SentenceTransformer backbones, a percentile prediction component based on cohort similarity, and a clinical constraint checker. These tools are directed by a large language model (LLM) using a multi-step prompt-driven reasoning pipeline to produce concise, grounded evaluations. Results: Retrieval hyperparameters were optimized using Gaussian Process on a scalarized loss function combining root mean squared error (RMSE), mean absolute error (MAE), and clinically motivated accuracy thresholds. The best configuration, based on all-MiniLM-L6-v2, achieved perfect nearest-neighbor accuracy within a 5-percentile-point margin and a sub-2pt MAE. When tested end-to-end, the RAG system achieved 100% agreement with the computed values by standalone retrieval and constraint-checking modules on both percentile estimates and constraint identification, confirming reliable execution of all retrieval, prediction and checking steps. Conclusion: Our findings highlight the feasibility of combining structured population-based scoring with modular tool-augmented reasoning for transparent, scalable plan evaluation in radiation therapy. The system offers traceable outputs, minimizes hallucination, and demonstrates robustness across protocols. Future directions include clinician-led validation, and improved domain-adapted retrieval models to enhance real-world integration. Comments: 16 pages, 4 figures. Submitted to npj Digital Medicine Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2509.20707 [cs.AI] (or arXiv:2509.20707v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.20707 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Junjie Cui [view email] [v1] Thu, 25 Sep 2025 03:18:31 UTC (362 KB)
zh
[AI-69] Learning to Align Molecules and Proteins: A Geometry-Aware Approach to Binding Affinity
【速读】:该论文旨在解决药物-靶标结合亲和力(drug-target binding affinity)预测中的泛化能力不足问题,尤其是在跨化学空间和时间维度上的性能瓶颈。现有深度学习模型通常通过简单拼接(concatenation)融合配体与蛋白质表征,缺乏显式的几何正则化,导致模型在未见数据上表现不稳定。其解决方案的关键在于提出FIRM-DTI框架:首先利用特征逐元素线性调制(Feature-wise Linear Modulation, FiLM)层将分子嵌入条件化于蛋白质嵌入,实现更精细的交互建模;其次引入三元组损失(triplet loss)约束嵌入空间的度量结构,提升表示的可分离性与鲁棒性;最后采用RBF回归头基于嵌入距离进行平滑且可解释的亲和力预测。这一设计显著提升了模型在Therapeutics Data Commons DTI-DG基准上的性能,验证了条件建模与度量学习对增强预测鲁棒性的关键作用。
链接: https://arxiv.org/abs/2509.20693
作者: Mohammadsaleh Refahi,Bahrad A. Sokhansanj,James R. Brown,Gail Rosen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
备注: 10pages,2 figures
Abstract:Accurate prediction of drug-target binding affinity can accelerate drug discovery by prioritizing promising compounds before costly wet-lab screening. While deep learning has advanced this task, most models fuse ligand and protein representations via simple concatenation and lack explicit geometric regularization, resulting in poor generalization across chemical space and time. We introduce FIRM-DTI, a lightweight framework that conditions molecular embeddings on protein embeddings through a feature-wise linear modulation (FiLM) layer and enforces metric structure with a triplet loss. An RBF regression head operating on embedding distances yields smooth, interpretable affinity predictions. Despite its modest size, FIRM-DTI achieves state-of-the-art performance on the Therapeutics Data Commons DTI-DG benchmark, as demonstrated by an extensive ablation study and out-of-domain evaluation. Our results underscore the value of conditioning and metric learning for robust drug-target affinity prediction.
zh
[AI-70] Addressing Gradient Misalignment in Data-Augmented Training for Robust Speech Deepfake Detection
【速读】:该论文旨在解决语音深度伪造检测(Speech Deepfake Detection, SDD)中数据增强(Data Augmentation, DA)导致的梯度冲突问题。在训练过程中,原始输入与增强输入的反向传播梯度方向可能不一致,从而引发参数更新冲突,阻碍模型收敛并导致次优解,削弱数据增强的效果。解决方案的关键在于提出一种双路径数据增强(Dual-Path Data-Augmented, DPDA)训练框架,并引入梯度对齐机制:每个训练语句同时通过原始语音和增强版本两条路径处理,通过比较和对齐两者的梯度方向来减少优化冲突。实验表明,使用RawBoost增强时约25%的训练迭代存在梯度冲突,通过梯度对齐可加速收敛并显著提升性能,在In-the-Wild数据集上实现等错误率(Equal Error Rate, EER)相对降低18.69%。
链接: https://arxiv.org/abs/2509.20682
作者: Duc-Tuan Truong,Tianchi Liu,Junjie Li,Ruijie Tao,Kong Aik Lee,Eng Siong Chng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures
Abstract:In speech deepfake detection (SDD), data augmentation (DA) is commonly used to improve model generalization across varied speech conditions and spoofing attacks. However, during training, the backpropagated gradients from original and augmented inputs may misalign, which can result in conflicting parameter updates. These conflicts could hinder convergence and push the model toward suboptimal solutions, thereby reducing the benefits of DA. To investigate and address this issue, we design a dual-path data-augmented (DPDA) training framework with gradient alignment for SDD. In our framework, each training utterance is processed through two input paths: one using the original speech and the other with its augmented version. This design allows us to compare and align their backpropagated gradient directions to reduce optimization conflicts. Our analysis shows that approximately 25% of training iterations exhibit gradient conflicts between the original inputs and their augmented counterparts when using RawBoost augmentation. By resolving these conflicts with gradient alignment, our method accelerates convergence by reducing the number of training epochs and achieves up to an 18.69% relative reduction in Equal Error Rate on the In-the-Wild dataset compared to the baseline.
zh
[AI-71] QAMO: Quality-aware Multi-centroid One-class Learning For Speech Deepfake Detection
【速读】:该论文旨在解决传统单中心点(single-centroid)一类学习方法在语音深度伪造检测中因过度简化真实语音表示而忽略重要线索(如语音质量)所导致的性能瓶颈问题。解决方案的关键在于提出一种质量感知的多中心点一类学习框架(Quality-Aware Multi-Centroid One-Class Learning, QAMO),通过引入多个与语音质量相关的中心点来建模真实语音内部的多样性,每个中心点优化以代表特定语音质量子空间,从而更准确地刻画真实语音分布;同时,QAMO采用多中心点集成评分策略,在无需质量标签的情况下提升决策阈值的鲁棒性,显著改善检测性能。
链接: https://arxiv.org/abs/2509.20679
作者: Duc-Tuan Truong,Tianchi Liu,Ruijie Tao,Junjie Li,Kong Aik Lee,Eng Siong Chng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures
Abstract:Recent work shows that one-class learning can detect unseen deepfake attacks by modeling a compact distribution of bona fide speech around a single centroid. However, the single-centroid assumption can oversimplify the bona fide speech representation and overlook useful cues, such as speech quality, which reflects the naturalness of the speech. Speech quality can be easily obtained using existing speech quality assessment models that estimate it through Mean Opinion Score. In this paper, we propose QAMO: Quality-Aware Multi-Centroid One-Class Learning for speech deepfake detection. QAMO extends conventional one-class learning by introducing multiple quality-aware centroids. In QAMO, each centroid is optimized to represent a distinct speech quality subspaces, enabling better modeling of intra-class variability in bona fide speech. In addition, QAMO supports a multi-centroid ensemble scoring strategy, which improves decision thresholding and reduces the need for quality labels during inference. With two centroids to represent high- and low-quality speech, our proposed QAMO achieves an equal error rate of 5.09% in In-the-Wild dataset, outperforming previous one-class and quality-aware systems.
zh
[AI-72] Understanding Mode Switching in Human-AI Collaboration: Behavioral Insights and Predictive Modeling
【速读】:该论文旨在解决人机协作中用户在任务执行过程中动态调整控制权级别(如从指导模式转向委托模式)的机制不明确问题,尤其关注用户偏好如何随信任变化、决策复杂度和感知控制感等因素而演变。其解决方案的关键在于通过一个“手与脑”棋类交互实验(hand-and-brain chess setup),采集用户在不同控制模式下的行为数据(如眼动、情绪状态和子任务难度),并基于这些多模态信号构建轻量级预测模型,实现对控制权切换的实时预测(F1 = 0.65)。该方法揭示了行为特征与控制切换之间的统计关联,并结合定性访谈提炼出影响切换的核心因素(如AI能力感知、决策复杂度和控制感),从而为设计能够响应用户意图和任务需求的动态共享自主系统提供可操作的理论依据与技术路径。
链接: https://arxiv.org/abs/2509.20666
作者: Avinash Ajit Nargund,Arthur Caetano,Kevin Yang,Rose Yiwei Liu,Philip Tezaur,Kriteen Shrestha,Qisen Pan,Tobias Höllerer,Misha Sra
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Human-AI collaboration is typically offered in one of two of user control levels: guidance, where the AI provides suggestions and the human makes the final decision, and delegation, where the AI acts autonomously within user-defined constraints. Systems that integrate both modes, common in robotic surgery or driving assistance, often overlook shifts in user preferences within a task in response to factors like evolving trust, decision complexity, and perceived control. In this work, we investigate how users dynamically switch between higher and lower levels of control during a sequential decision-making task. Using a hand-and-brain chess setup, participants either selected a piece and the AI decided how it moved (brain mode), or the AI selected a piece and the participant decided how it moved (hand mode). We collected over 400 mode-switching decisions from eight participants, along with gaze, emotional state, and subtask difficulty data. Statistical analysis revealed significant differences in gaze patterns and subtask complexity prior to a switch and in the quality of the subsequent move. Based on these results, we engineered behavioral and task-specific features to train a lightweight model that predicted control level switches ( F1 = 0.65 ). The model performance suggests that real-time behavioral signals can serve as a complementary input alongside system-driven mode-switching mechanisms currently used. We complement our quantitative results with qualitative factors that influence switching including perceived AI ability, decision complexity, and level of control, identified from post-game interview analysis. The combined behavioral and modeling insights can help inform the design of shared autonomy systems that need dynamic, subtask-level control switches aligned with user intent and evolving task demands.
zh
[AI-73] Accelerate Creation of Product Claims Using Generative AI NEURIPS2025
【速读】:该论文旨在解决产品主张(Product Claims)创建过程耗时且成本高昂的问题,这一过程对消费者购买行为具有关键影响。解决方案的核心在于开发了一个名为 Claim Advisor 的网络应用,其关键技术是利用大语言模型(Large Language Models, LLMs)的上下文学习(in-context learning)与微调(fine-tuning)能力,实现主张的快速搜索、生成、优化与模拟排序。该系统通过语义检索匹配消费者声音、基于产品描述和用户画像生成/优化主张,并借助合成消费者模拟对主张进行评分排序,从而显著提升主张创作效率与经济性,在快消品(CPG)行业已验证其有效性。
链接: https://arxiv.org/abs/2509.20652
作者: Po-Yu Liang,Yong Zhang,Tatiana Hwa,Aaron Byers
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted at the GenProCC workshop (NeurIPS 2025)
Abstract:The benefit claims of a product is a critical driver of consumers’ purchase behavior. Creating product claims is an intense task that requires substantial time and funding. We have developed the \textbfClaim Advisor web application to accelerate claim creations using in-context learning and fine-tuning of large language models (LLM). \textbfClaim Advisor was designed to disrupt the speed and economics of claim search, generation, optimization, and simulation. It has three functions: (1) semantically searching and identifying existing claims and/or visuals that resonate with the voice of consumers; (2) generating and/or optimizing claims based on a product description and a consumer profile; and (3) ranking generated and/or manually created claims using simulations via synthetic consumers. Applications in a consumer packaged goods (CPG) company have shown very promising results. We believe that this capability is broadly useful and applicable across product categories and industries. We share our learning to encourage the research and application of generative AI in different industries.
zh
[AI-74] Adaptive Cybersecurity Architecture for Digital Product Ecosystems Using Agent ic AI
【速读】:该论文旨在解决传统静态网络安全模型在云服务、API、移动平台和边缘设备等复杂数字产品生态系统中面临的可扩展性差、实时检测能力弱以及上下文响应滞后的问题。其解决方案的关键在于构建一种由代理型人工智能(Agentic AI)驱动的自适应安全架构,通过在生态系统各关键层级集成具有动态学习能力和情境感知决策机制的自主目标驱动代理(Autonomous Goal-Driven Agents),实现主动威胁缓解、政策前瞻性执行与实时异常检测;核心特性包括行为基线建模、去中心化风险评分和联邦威胁情报共享,从而显著提升对零日攻击的识别能力与访问策略的动态调整效率,同时满足零信任架构要求并符合国际网络安全合规标准。
链接: https://arxiv.org/abs/2509.20640
作者: Oluwakemi T. Olayinka,Sumeet Jeswani,Divine Iloh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional static cybersecurity models often struggle with scalability, real-time detection, and contextual responsiveness in the current digital product ecosystems which include cloud services, application programming interfaces (APIs), mobile platforms, and edge devices. This study introduces autonomous goal driven agents capable of dynamic learning and context-aware decision making as part of an adaptive cybersecurity architecture driven by agentic artificial intelligence (AI). To facilitate autonomous threat mitigation, proactive policy enforcement, and real-time anomaly detection, this framework integrates agentic AI across the key ecosystem layers. Behavioral baselining, decentralized risk scoring, and federated threat intelligence sharing are important features. The capacity of the system to identify zero-day attacks and dynamically modify access policies was demonstrated through native cloud simulations. The evaluation results show increased adaptability, decreased response latency, and improved detection accuracy. The architecture provides an intelligent and scalable blueprint for safeguarding complex digital infrastructure and is compatible with zero-trust models, thereby supporting the adherence to international cybersecurity regulations.
zh
[AI-75] A Framework for Rapidly Developing and Deploying Protection Against Large Language Model Attacks
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在广泛部署过程中因自主性增强和权限扩展而面临的新型安全威胁问题,特别是针对零日攻击或未知攻击缺乏有效防御机制的挑战。解决方案的关键在于构建一个面向生产环境的端到端防御系统,其核心由三部分组成:一是基于威胁情报的动态防护机制,将新兴AI相关威胁转化为实时保护策略;二是数据平台,用于聚合、丰富信息并提供可观测性、监控与机器学习运维(MLOps)能力;三是发布平台,支持无中断的安全检测更新。该架构通过多层防御、持续可观测性和快速响应能力,实现对演进式LLM威胁的主动抵御,并生成训练数据以驱动模型迭代优化,从而形成闭环的安全防护体系。
链接: https://arxiv.org/abs/2509.20639
作者: Adam Swanda,Amy Chang,Alexander Chen,Fraser Burch,Paul Kassianik,Konstantin Berlin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The widespread adoption of Large Language Models (LLMs) has revolutionized AI deployment, enabling autonomous and semi-autonomous applications across industries through intuitive language interfaces and continuous improvements in model development. However, the attendant increase in autonomy and expansion of access permissions among AI applications also make these systems compelling targets for malicious attacks. Their inherent susceptibility to security flaws necessitates robust defenses, yet no known approaches can prevent zero-day or novel attacks against LLMs. This places AI protection systems in a category similar to established malware protection systems: rather than providing guaranteed immunity, they minimize risk through enhanced observability, multi-layered defense, and rapid threat response, supported by a threat intelligence function designed specifically for AI-related threats. Prior work on LLM protection has largely evaluated individual detection models rather than end-to-end systems designed for continuous, rapid adaptation to a changing threat landscape. We present a production-grade defense system rooted in established malware detection and threat intelligence practices. Our platform integrates three components: a threat intelligence system that turns emerging threats into protections; a data platform that aggregates and enriches information while providing observability, monitoring, and ML operations; and a release platform enabling safe, rapid detection updates without disrupting customer workflows. Together, these components deliver layered protection against evolving LLM threats while generating training data for continuous model improvement and deploying updates without interrupting production. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.20639 [cs.CR] (or arXiv:2509.20639v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.20639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-76] Learning Terrain-Specialized Policies for Adaptive Locomotion in Challenging Environments
【速读】:该论文旨在解决腿式机器人在无视觉感知(blind locomotion)条件下,于复杂、非结构化地形中实现鲁棒且敏捷运动的难题。其解决方案的关键在于提出一种分层强化学习框架,通过引入地形特异性策略(terrain-specialized policies)与课程学习(curriculum learning)机制,使机器人能够根据不同地形特性自适应调整运动行为,从而显著提升在低摩擦和不连续地形上的跟踪精度与成功率,相较通用策略(generalist policy)在成功率达到最高16%的提升,并展现出更强的混合地形适应能力。
链接: https://arxiv.org/abs/2509.20635
作者: Matheus P. Angarola,Francisco Affonso,Marcelo Becker
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to the 22nd International Conference on Advanced Robotics (ICAR 2025). 7 pages
Abstract:Legged robots must exhibit robust and agile locomotion across diverse, unstructured terrains, a challenge exacerbated under blind locomotion settings where terrain information is unavailable. This work introduces a hierarchical reinforcement learning framework that leverages terrain-specialized policies and curriculum learning to enhance agility and tracking performance in complex environments. We validated our method on simulation, where our approach outperforms a generalist policy by up to 16% in success rate and achieves lower tracking errors as the velocity target increases, particularly on low-friction and discontinuous terrains, demonstrating superior adaptability and robustness across mixed-terrain scenarios.
zh
[AI-77] Personalized Federated Dictionary Learning for Modeling Heterogeneity in Multi-site fMRI Data
【速读】:该论文旨在解决多中心功能磁共振成像(fMRI)研究中因站点特异性异质性导致的数据非独立同分布(non-IID)问题,这一问题严重阻碍了可泛化的神经影像模型的构建。其解决方案的关键在于提出一种个性化联邦字典学习(Personalized Federated Dictionary Learning, PFedDL)框架,该框架在各站点独立进行字典学习,并将每个站点的字典分解为共享的全局原子和个性化的局部原子:全局原子通过联邦聚合更新以增强跨站点一致性,局部原子则独立优化以捕捉站点特异性差异,从而提升下游分析的准确性与鲁棒性。
链接: https://arxiv.org/abs/2509.20627
作者: Yipu Zhang,Chengshuo Zhang,Ziyu Zhou,Gang Qu,Hao Zheng,Yuping Wang,Hui Shen,Hongwen Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Data privacy constraints pose significant challenges for large-scale neuroimaging analysis, especially in multi-site functional magnetic resonance imaging (fMRI) studies, where site-specific heterogeneity leads to non-independent and identically distributed (non-IID) data. These factors hinder the development of generalizable models. To address these challenges, we propose Personalized Federated Dictionary Learning (PFedDL), a novel federated learning framework that enables collaborative modeling across sites without sharing raw data. PFedDL performs independent dictionary learning at each site, decomposing each site-specific dictionary into a shared global component and a personalized local component. The global atoms are updated via federated aggregation to promote cross-site consistency, while the local atoms are refined independently to capture site-specific variability, thereby enhancing downstream analysis. Experiments on the ABIDE dataset demonstrate that PFedDL outperforms existing methods in accuracy and robustness across non-IID datasets.
zh
[AI-78] MMG: Mutual Information Estimation via the MMSE Gap in Diffusion NEURIPS2025
【速读】:该论文旨在解决复杂系统中互信息(Mutual Information, MI)估计的挑战问题,尤其是在传统方法难以准确估算高维或非线性关系场景下的MI。其解决方案的关键在于利用去噪扩散模型(Denoising Diffusion Models)的信息论框架,将MI与条件与无条件扩散过程之间的最小均方误差(Minimum Mean Square Error, MMSE)差距相关联——具体而言,MI等于该MMSE差距在所有信噪比(Signal-to-Noise Ratio, SNR)下积分的一半。这一理论桥梁使得MI估计可直接从扩散模型训练中获得,并通过自适应重要性采样(adaptive importance sampling)实现高效且可扩展的估计,同时在高MI场景下仍保持优异性能。
链接: https://arxiv.org/abs/2509.20609
作者: Longxuan Yu,Xing Shi,Xianghao Kong,Tong Jia,Greg Ver Steeg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the SPIGM Workshop at NeurIPS 2025
Abstract:Mutual information (MI) is one of the most general ways to measure relationships between random variables, but estimating this quantity for complex systems is challenging. Denoising diffusion models have recently set a new bar for density estimation, so it is natural to consider whether these methods could also be used to improve MI estimation. Using the recently introduced information-theoretic formulation of denoising diffusion models, we show the diffusion models can be used in a straightforward way to estimate MI. In particular, the MI corresponds to half the gap in the Minimum Mean Square Error (MMSE) between conditional and unconditional diffusion, integrated over all Signal-to-Noise-Ratios (SNRs) in the noising process. Our approach not only passes self-consistency tests but also outperforms traditional and score-based diffusion MI estimators. Furthermore, our method leverages adaptive importance sampling to achieve scalable MI estimation, while maintaining strong performance even when the MI is high.
zh
[AI-79] Experience Deploying Containerized GenAI Services at an HPC Center
【速读】:该论文旨在解决生成式人工智能(Generative AI)工作负载在高性能计算(High-Performance Computing, HPC)中心部署与集成的挑战,尤其是在容器化和云原生技术尚未充分融入传统HPC环境的背景下。其解决方案的关键在于构建一种融合HPC与Kubernetes平台的协同计算架构,通过容器化部署GenAI组件(如vLLM推理服务器),实现跨平台(HPC与Kubernetes)的统一调度与运行,并借助多种容器运行时支持Llama大语言模型(Large Language Model, LLM)的可复现部署,从而提升资源利用效率与系统灵活性。
链接: https://arxiv.org/abs/2509.20603
作者: Angel M. Beltre,Jeff Ogden,Kevin Pedretti
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 10 pages, 12 figures
Abstract:Generative Artificial Intelligence (GenAI) applications are built from specialized components – inference servers, object storage, vector and graph databases, and user interfaces – interconnected via web-based APIs. While these components are often containerized and deployed in cloud environments, such capabilities are still emerging at High-Performance Computing (HPC) centers. In this paper, we share our experience deploying GenAI workloads within an established HPC center, discussing the integration of HPC and cloud computing environments. We describe our converged computing architecture that integrates HPC and Kubernetes platforms running containerized GenAI workloads, helping with reproducibility. A case study illustrates the deployment of the Llama Large Language Model (LLM) using a containerized inference server (vLLM) across both Kubernetes and HPC platforms using multiple container runtimes. Our experience highlights practical considerations and opportunities for the HPC container community, guiding future research and tool development.
zh
[AI-80] An LLM -based Agent ic Framework for Accessible Network Control
【速读】:该论文旨在解决传统网络管理仅限于少数具备专业知识的网络运维人员的问题,从而限制了普通用户对网络进行自主管理的能力。其解决方案的关键在于设计了一个基于大语言模型(Large Language Models, LLMs)的代理式(agentic)框架,通过中间表示(intermediate representation)实现跨厂商设备的配置统一,实时从内存中检索网络状态,并提供外部反馈接口,使非专家用户能够以自然语言与网络交互,从而实现网络控制的民主化。
链接: https://arxiv.org/abs/2509.20600
作者: Samuel Lin,Jiawei Zhou,Minlan Yu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 6 figures
Abstract:Traditional approaches to network management have been accessible only to a handful of highly-trained network operators with significant expert knowledge. This creates barriers for lay users to easily manage their networks without resorting to experts. With recent development of powerful large language models (LLMs) for language comprehension, we design a system to make network management accessible to a broader audience of non-experts by allowing users to converse with networks in natural language. To effectively leverage advancements in LLMs, we propose an agentic framework that uses an intermediate representation to streamline configuration across diverse vendor equipment, retrieves the network state from memory in real-time, and provides an interface for external feedback. We also conduct pilot studies to collect real user data of natural language utterances for network control, and present a visualization interface to facilitate dialogue-driven user interaction and enable large-scale data collection for future development. Preliminary experiments validate the effectiveness of our proposed system components with LLM integration on both synthetic and real user utterances. Through our data collection and visualization efforts, we pave the way for more effective use of LLMs and democratize network control for everyday users.
zh
[AI-81] MechStyle: Augmenting Generative AI with Mechanical Simulation to Create Stylized and Structurally Viable 3D Models
【速读】:该论文旨在解决生成式 AI (Generative AI) 在对3D模型进行风格化处理时,因几何结构修改可能导致模型在实际打印后结构完整性受损的问题。解决方案的关键在于引入有限元分析(Finite Element Analysis, FEA)仿真反馈机制,将FEA模拟中识别出的高应力区域作为约束条件,在风格化过程中动态调整几何修改强度,从而在保持设计风格的同时保障模型的结构稳定性。
链接: https://arxiv.org/abs/2509.20571
作者: Faraz Faruqi,Amira Abdel-Rahman,Leandra Tejedor,Martin Nisser,Jiaji Li,Vrushank Phadnis,Varun Jampani,Neil Gershenfeld,Megan Hofmann,Stefanie Mueller
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent developments in Generative AI enable creators to stylize 3D models based on text prompts. These methods change the 3D model geometry, which can compromise the model’s structural integrity once fabricated. We present MechStyle, a system that enables creators to stylize 3D printable models while preserving their structural integrity. MechStyle accomplishes this by augmenting the Generative AI-based stylization process with feedback from a Finite Element Analysis (FEA) simulation. As the stylization process modifies the geometry to approximate the desired style, feedback from the FEA simulation reduces modifications to regions with increased stress. We evaluate the effectiveness of FEA simulation feedback in the augmented stylization process by comparing three stylization control strategies. We also investigate the time efficiency of our approach by comparing three adaptive scheduling strategies. Finally, we demonstrate MechStyle’s user interface that allows users to generate stylized and structurally viable 3D models and provide five example applications.
zh
[AI-82] PIRF: Physics-Informed Reward Fine-Tuning for Diffusion Models NEURIPS2025
【速读】:该论文旨在解决扩散模型在科学生成任务中因违反物理定律而导致输出不准确的问题。其核心挑战在于现有方法依赖于扩散后验采样(Diffusion Posterior Sampling, DPS)风格的值函数近似,这会引入显著误差并导致训练不稳定与推理效率低下。解决方案的关键是提出物理信息奖励微调(Physics-Informed Reward Fine-tuning, PIRF),通过直接计算轨迹级奖励并反向传播梯度来绕过值函数近似;同时采用分层截断反向传播和基于权重的正则化策略,有效提升样本效率与数据保真度,在五种偏微分方程(PDE)基准上实现了高效且强健的物理约束遵守。
链接: https://arxiv.org/abs/2509.20570
作者: Mingze Yuan,Pengfei Jin,Na Li,Quanzheng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY)
备注: 18 pages, 6 figures; NeurIPS 2025 AI for science workshop
Abstract:Diffusion models have demonstrated strong generative capabilities across scientific domains, but often produce outputs that violate physical laws. We propose a new perspective by framing physics-informed generation as a sparse reward optimization problem, where adherence to physical constraints is treated as a reward signal. This formulation unifies prior approaches under a reward-based paradigm and reveals a shared bottleneck: reliance on diffusion posterior sampling (DPS)-style value function approximations, which introduce non-negligible errors and lead to training instability and inference inefficiency. To overcome this, we introduce Physics-Informed Reward Fine-tuning (PIRF), a method that bypasses value approximation by computing trajectory-level rewards and backpropagating their gradients directly. However, a naive implementation suffers from low sample efficiency and compromised data fidelity. PIRF mitigates these issues through two key strategies: (1) a layer-wise truncated backpropagation method that leverages the spatiotemporally localized nature of physics-based rewards, and (2) a weight-based regularization scheme that improves efficiency over traditional distillation-based methods. Across five PDE benchmarks, PIRF consistently achieves superior physical enforcement under efficient sampling regimes, highlighting the potential of reward fine-tuning for advancing scientific generative modeling.
zh
[AI-83] SAMULE: Self-Learning Agents Enhanced by Multi-level Reflection EMNLP2025
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在复杂任务中因缺乏有效的错误分析和对稀有成功轨迹的依赖,而导致生成有意义反思(reflection)能力不足的问题。其解决方案的关键在于提出SAMULE框架,该框架基于多层级反思合成(Multi-Level Reflection Synthesis)训练一个回溯式语言模型(retrospective language model),通过三个互补层次实现高质量反思:单轨迹学习(Single-Trajectory Learning,微级)用于细粒度错误修正;任务内学习(Intra-Task Learning,中层)构建同一任务多次尝试中的错误分类体系;任务间学习(Inter-Task Learning,宏观)提取跨任务失败中同类型错误的可迁移洞察。该设计显著提升了LLM代理在推理阶段生成结构化反思的能力,并进一步通过前瞻式反思机制扩展至交互场景,使代理能够主动对比预测与实际响应以动态调整行为。
链接: https://arxiv.org/abs/2509.20562
作者: Yubin Ge,Salvatore Romeo,Jason Cai,Monica Sunkara,Yi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2025 Main Conference
Abstract:Despite the rapid advancements in LLM agents, they still face the challenge of generating meaningful reflections due to inadequate error analysis and a reliance on rare successful trajectories, especially in complex tasks. In this work, we propose SAMULE, a new framework for self-learning agents powered by a retrospective language model that is trained based on Multi-Level Reflection Synthesis. It first synthesizes high-quality reflections across three complementary levels: Single-Trajectory Learning (micro-level) for detailed error correction; Intra-Task Learning (meso-level) to build error taxonomies across multiple trials of the same task, and Inter-Task Learning (macro-level) to extract transferable insights based on same typed errors from diverse task failures. Then we fine-tune a language model serving as the retrospective model to generate reflections during inference. We further extend our framework to interactive settings through a foresight-based reflection mechanism, enabling agents to proactively reflect and adapt during user interactions by comparing predicted and actual responses. Extensive experiments on three challenging benchmarks - TravelPlanner, NATURAL PLAN, and Tau-bench - demonstrate that our approach significantly outperforms reflection-based baselines. Our results highlight the critical role of well-designed reflection synthesis and failure-centric learning in building self-improving LLM agents.
zh
[AI-84] GraspFactory: A Large Object-Centric Grasping Dataset
【速读】:该论文旨在解决机器人抓取模型在面对训练数据有限时难以泛化到新物体的问题,尤其是在工业自动化场景中,如仓库或制造工厂等环境中,物体种类繁多且复杂。解决方案的关键在于构建一个大规模、几何多样化的6-DoF(六自由度)抓取数据集GraspFactory,其中包含超过1.09亿次抓取动作,分别针对Franka Panda机械臂与Robotiq 2F-85夹爪,覆盖14,690种和33,710种不同物体。该数据集专为训练数据密集型模型设计,并通过实验证明了基于其子集训练的模型在仿真和真实世界中的良好泛化能力。
链接: https://arxiv.org/abs/2509.20550
作者: Srinidhi Kalgundi Srinivas,Yash Shukla,Adam Arnold,Sachin Chitta
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Robotic grasping is a crucial task in industrial automation, where robots are increasingly expected to handle a wide range of objects. However, a significant challenge arises when robot grasping models trained on limited datasets encounter novel objects. In real-world environments such as warehouses or manufacturing plants, the diversity of objects can be vast, and grasping models need to generalize to this diversity. Training large, generalizable robot-grasping models requires geometrically diverse datasets. In this paper, we introduce GraspFactory, a dataset containing over 109 million 6-DoF grasps collectively for the Franka Panda (with 14,690 objects) and Robotiq 2F-85 grippers (with 33,710 objects). GraspFactory is designed for training data-intensive models, and we demonstrate the generalization capabilities of one such model trained on a subset of GraspFactory in both simulated and real-world settings. The dataset and tools are made available for download at this https URL.
zh
[AI-85] Understanding and Improving Adversarial Robustness of Neural Probabilistic Circuits NEURIPS2025
【速读】:该论文旨在解决神经概率电路(Neural Probabilistic Circuits, NPCs)在面对对抗攻击时的脆弱性问题,即由于属性识别模块(attribute recognition model)作为黑箱模型易受精心设计的微小扰动影响,从而导致最终预测结果被篡改。解决方案的关键在于提出RNPC(Robust Neural Probabilistic Circuit),其创新性地引入类级别集成(class-wise integration)机制,在推理阶段实现对两个模块输出的鲁棒融合,从而确保整体模型的对抗鲁棒性仅依赖于属性识别模块的鲁棒性,而不受概率电路结构的影响。理论分析与实验证明,RNPC相较于原NPC及现有概念瓶颈模型,在保持良性输入高准确率的同时显著提升了对抗鲁棒性。
链接: https://arxiv.org/abs/2509.20549
作者: Weixin Chen,Han Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Camera Ready
Abstract:Neural Probabilistic Circuits (NPCs), a new class of concept bottleneck models, comprise an attribute recognition model and a probabilistic circuit for reasoning. By integrating the outputs from these two modules, NPCs produce compositional and interpretable predictions. While offering enhanced interpretability and high performance on downstream tasks, the neural-network-based attribute recognition model remains a black box. This vulnerability allows adversarial attacks to manipulate attribute predictions by introducing carefully crafted subtle perturbations to input images, potentially compromising the final predictions. In this paper, we theoretically analyze the adversarial robustness of NPC and demonstrate that it only depends on the robustness of the attribute recognition model and is independent of the robustness of the probabilistic circuit. Moreover, we propose RNPC, the first robust neural probabilistic circuit against adversarial attacks on the recognition module. RNPC introduces a novel class-wise integration for inference, ensuring a robust combination of outputs from the two modules. Our theoretical analysis demonstrates that RNPC exhibits provably improved adversarial robustness compared to NPC. Empirical results on image classification tasks show that RNPC achieves superior adversarial robustness compared to existing concept bottleneck models while maintaining high accuracy on benign inputs.
zh
[AI-86] A Compound Classification System Based on Fuzzy Relations Applied to the Noise-Tolerant Control of a Bionic Hand via EMG Signal Recognition
【速读】:该论文旨在解决基于肌电信号(Electromyographic, EMG)的人体上肢假肢控制中,因生物信号易受污染而导致分类质量下降的问题。其关键解决方案是提出一种融合两类集成学习模型的模糊识别系统:一是采用一类分类器(One-Class Classifier, OCC)用于检测各通道EMG信号的污染程度,二是利用K近邻(K-Nearest Neighbors, KNN)分类器识别用户意图;整个系统基于原创的统一模糊决策模型实现软判决机制,从而在识别过程中有效缓解污染对分类性能的负面影响。
链接: https://arxiv.org/abs/2509.20523
作者: Pawel Trajdos,Marek Kurzynski
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern anthropomorphic upper limb bioprostheses are typically controlled by electromyographic (EMG) biosignals using a pattern recognition scheme. Unfortunately, there are many factors originating from the human source of objects to be classified and from the human-prosthesis interface that make it difficult to obtain an acceptable classification quality. One of these factors is the high susceptibility of biosignals to contamination, which can considerably reduce the quality of classification of a recognition system. In the paper, the authors propose a new recognition system intended for EMG based control of the hand prosthesis with detection of contaminated biosignals in order to mitigate the adverse effect of contaminations. The system consists of two ensembles: the set of one-class classifiers (OCC) to assess the degree of contamination of individual channels and the ensemble of K-nearest neighbours (KNN) classifier to recognise the patient’s intent. For all recognition systems, an original, coherent fuzzy model was developed, which allows the use of a uniform soft (fuzzy) decision scheme throughout the recognition process. The experimental evaluation was conducted using real biosignals from a public repository. The goal was to provide an experimental comparative analysis of the parameters and procedures of the developed method on which the quality of the recognition system depends. The proposed fuzzy recognition system was also compared with similar systems described in the literature. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.20523 [cs.AI] (or arXiv:2509.20523v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.20523 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-87] Adaptive Approach to Enhance Machine Learning Scheduling Algorithms During Runtime Using Reinforcement Learning in Metascheduling Applications
【速读】:该论文旨在解决传统离线训练人工智能(Artificial Intelligence, AI)调度推理在时间触发架构中面临的挑战,特别是构建全面的多调度图(Multi-Schedule Graph, MSG)时因场景复杂性(如硬件故障、空闲时间波动或模式切换)而导致的资源消耗过大且难以实现的问题。解决方案的关键在于引入一个集成于元调度器中的自适应在线学习单元,利用强化学习(Reinforcement Learning, RL)在运行时持续探索并发现新的调度策略,从而动态扩展MSG并提升系统性能。该机制使系统能够应对未预期事件和复杂调度场景,同时通过实时训练不断优化AI推理能力,确保在大规模、高安全性环境中具备鲁棒性和效率。
链接: https://arxiv.org/abs/2509.20520
作者: Samer Alshaer,Ala Khalifeh,Roman Obermaisser
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 18 pages, 21 figures
Abstract:Metascheduling in time-triggered architectures has been crucial in adapting to dynamic and unpredictable environments, ensuring the reliability and efficiency of task execution. However, traditional approaches face significant challenges when training Artificial Intelligence (AI) scheduling inferences offline, particularly due to the complexities involved in constructing a comprehensive Multi-Schedule Graph (MSG) that accounts for all possible scenarios. The process of generating an MSG that captures the vast probability space, especially when considering context events like hardware failures, slack variations, or mode changes, is resource-intensive and often infeasible. To address these challenges, we propose an adaptive online learning unit integrated within the metascheduler to enhance performance in real-time. The primary motivation for developing this unit stems from the limitations of offline training, where the MSG created is inherently a subset of the complete space, focusing only on the most probable and critical context events. In the online mode, Reinforcement Learning (RL) plays a pivotal role by continuously exploring and discovering new scheduling solutions, thus expanding the MSG and enhancing system performance over time. This dynamic adaptation allows the system to handle unexpected events and complex scheduling scenarios more effectively. Several RL models were implemented within the online learning unit, each designed to address specific challenges in scheduling. These models not only facilitate the discovery of new solutions but also optimize existing schedulers, particularly when stricter deadlines or new performance criteria are introduced. By continuously refining the AI inferences through real-time training, the system remains flexible and capable of meeting evolving demands, thus ensuring robustness and efficiency in large-scale, safety-critical environments.
zh
[AI-88] Reconstruction-Based Adaptive Scheduling Using AI Inferences in Safety-Critical Systems
【速读】:该论文旨在解决时间触发系统(Time-Triggered Systems, TTS)在动态运行环境中面临的调度可靠性与安全性问题,具体包括消息冲突、因优先级处理错误导致的死锁循环,以及生成不完整或无效调度方案等挑战。解决方案的关键在于提出一种新颖的重构框架(reconstruction framework),通过系统性地将AI生成或启发式获得的调度优先级转化为可执行调度方案,确保满足关键约束如前序依赖关系和无冲突通信;该框架集成鲁棒的安全校验机制、高效的资源分配算法及恢复机制,从而在硬件故障或模式切换等意外事件下仍能保障调度的完整性与实时性。
链接: https://arxiv.org/abs/2509.20513
作者: Samer Alshaer,Ala Khalifeh,Roman Obermaisser
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 14 pages, 10 figures
Abstract:Adaptive scheduling is crucial for ensuring the reliability and safety of time-triggered systems (TTS) in dynamic operational environments. Scheduling frameworks face significant challenges, including message collisions, locked loops from incorrect precedence handling, and the generation of incomplete or invalid schedules, which can compromise system safety and performance. To address these challenges, this paper presents a novel reconstruction framework designed to dynamically validate and assemble schedules. The proposed reconstruction models operate by systematically transforming AI-generated or heuristically derived scheduling priorities into fully executable schedules, ensuring adherence to critical system constraints such as precedence rules and collision-free communication. It incorporates robust safety checks, efficient allocation algorithms, and recovery mechanisms to handle unexpected context events, including hardware failures and mode transitions. Comprehensive experiments were conducted across multiple performance profiles, including makespan minimisation, workload balancing, and energy efficiency, to validate the operational effectiveness of the reconstruction models. Results demonstrate that the proposed framework significantly enhances system adaptability, operational integrity, and runtime performance while maintaining computational efficiency. Overall, this work contributes a practical and scalable solution to the problem of safe schedule generation in safety-critical TTS, enabling reliable and flexible real-time scheduling even under highly dynamic and uncertain operational conditions.
zh
[AI-89] CHOIR: A Chatbot-mediated Organizational Memory Leverag ing Communication in University Research Labs
【速读】:该论文旨在解决大学研究实验室中组织记忆(organizational memory)难以有效保存与利用的问题,即在基于聊天的协作平台中,有价值的知识常因信息流淹没而流失,而传统文档维护成本高且难于导航。其解决方案的关键在于设计并部署了一个基于大语言模型(LLM)的聊天机器人CHOIR,通过四大核心功能实现知识的结构化沉淀与动态更新:基于文档的问答(document-grounded QA)、问答共享以促进后续讨论、从对话中自动提取知识,以及AI辅助文档修订。实证结果显示,该系统能提升知识留存效率,但也揭示了隐私意识与知识可见性之间的张力,为未来隐私保护下的组织记忆系统设计提供了重要启示。
链接: https://arxiv.org/abs/2509.20512
作者: Sangwook Lee,Adnan Abbas,Yan Chen,Young-Ho Kim,Sang Won Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures, 2 tables
Abstract:University research labs often rely on chat-based platforms for communication and project management, where valuable knowledge surfaces but is easily lost in message streams. Documentation can preserve knowledge, but it requires ongoing maintenance and is challenging to navigate. Drawing on formative interviews that revealed organizational memory challenges in labs, we designed CHOIR, an LLM-based chatbot that supports organizational memory through four key functions: document-grounded QA, QA sharing for follow-up discussion, knowledge extraction from conversations, and AI-assisted document updates. We deployed CHOIR in four research labs for one month (n=21), where the lab members asked 107 questions and lab directors updated documents 38 times in the organizational memory. Our findings reveal a privacy-awareness tension: questions were asked privately, limiting directors’ visibility into documentation gaps. Students often avoided contribution due to challenges in generalizing personal experiences into universal documentation. We contribute design implications for privacy-preserving awareness and supporting context-specific knowledge documentation.
zh
[AI-90] Complexity-Driven Policy Optimization
【速读】:该论文旨在解决策略梯度方法中通过熵最大化实现探索与利用平衡时所导致的效率低下问题,因为单纯最大化熵会使策略趋向于均匀随机分布,从而产生无结构且低效的探索行为。解决方案的关键在于用一种新的复杂度奖励(complexity bonus)替代传统的熵奖励,该复杂度定义为香农熵(Shannon entropy)与非均衡性(disequilibrium,即概率分布偏离均匀分布的程度)的乘积,从而在鼓励策略具备一定随机性的同时引入结构性,抑制完全无序和完全确定两种极端状态,引导智能体发现既结构化又具有适应性的有效策略。基于此思想,作者提出了复杂度驱动的策略优化算法(Complexity-Driven Policy Optimization, CDPO),在离散动作空间任务中表现出对复杂度系数选择更鲁棒的性能,尤其在需要深度探索的环境中优势显著。
链接: https://arxiv.org/abs/2509.20509
作者: Luca Serfilippi,Giorgio Franceschelli,Antonio Corradi,Mirco Musolesi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Policy gradient methods often balance exploitation and exploration via entropy maximization. However, maximizing entropy pushes the policy towards a uniform random distribution, which represents an unstructured and sometimes inefficient exploration strategy. In this work, we propose replacing the entropy bonus with a more robust complexity bonus. In particular, we adopt a measure of complexity, defined as the product of Shannon entropy and disequilibrium, where the latter quantifies the distance from the uniform distribution. This regularizer encourages policies that balance stochasticity (high entropy) with structure (high disequilibrium), guiding agents toward regimes where useful, non-trivial behaviors can emerge. Such behaviors arise because the regularizer suppresses both extremes, e.g., maximal disorder and complete order, creating pressure for agents to discover structured yet adaptable strategies. Starting from Proximal Policy Optimization (PPO), we introduce Complexity-Driven Policy Optimization (CDPO), a new learning algorithm that replaces entropy with complexity. We show empirically across a range of discrete action space tasks that CDPO is more robust to the choice of the complexity coefficient than PPO is with the entropy coefficient, especially in environments requiring greater exploration.
zh
[AI-91] Boosting Zero-Shot VLN via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting
【速读】:该论文旨在解决在连续环境中的视觉语言导航(Vision-Language Navigation, VLN)问题,即如何让具身智能体在复杂、动态的环境中,结合自然语言指令、感知周围场景并规划低层动作以准确抵达目标位置。其解决方案的关键在于提出了一种零样本(zero-shot)框架,该框架融合了一个简化但高效的路径点预测器与多模态大语言模型(Multimodal Large Language Model, MLLM)。路径点预测器基于抽象障碍物地图生成线性可达的路径点,并将其整合进一个带有显式访问记录的动态拓扑图中;该图结构及访问信息被编码至提示(prompt)中,使MLLM能够基于空间结构和探索历史进行推理,从而增强探索能力并具备局部路径规划能力用于错误纠正。此方法在R2R-CE和RxR-CE数据集上实现了41%和36%的成功率,达到当前最优的零样本性能。
链接: https://arxiv.org/abs/2509.20499
作者: Boqi Li,Siyuan Li,Weiyi Wang,Anran Li,Zhong Cao,Henry X. Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid progress of foundation models and robotics, vision-language navigation (VLN) has emerged as a key task for embodied agents with broad practical applications. We address VLN in continuous environments, a particularly challenging setting where an agent must jointly interpret natural language instructions, perceive its surroundings, and plan low-level actions. We propose a zero-shot framework that integrates a simplified yet effective waypoint predictor with a multimodal large language model (MLLM). The predictor operates on an abstract obstacle map, producing linearly reachable waypoints, which are incorporated into a dynamically updated topological graph with explicit visitation records. The graph and visitation information are encoded into the prompt, enabling reasoning over both spatial structure and exploration history to encourage exploration and equip MLLM with local path planning for error correction. Extensive experiments on R2R-CE and RxR-CE show that our method achieves state-of-the-art zero-shot performance, with success rates of 41% and 36%, respectively, outperforming prior state-of-the-art methods.
zh
[AI-92] AI-Specific Code Smells: From Specification to Detection
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)驱动的软件系统中存在的一类新型代码异味问题,即AI-specific code smells,这类异味可能引发模型不可复现、静默失败或泛化能力差等深层次缺陷,而传统检测工具难以识别。解决方案的关键在于提出SpecDetect4AI,一个基于高阶声明式领域特定语言(Domain-Specific Language, DSL)的规则定义与静态分析相结合的可扩展框架,通过为22种AI特有代码异味制定专用规则,并在826个AI系统(共2000万行代码)上验证其有效性,实现了88.66%的精确率和88.89%的召回率,显著优于现有工具,证明了其在大规模AI系统中高效、准确检测代码异味的能力及良好的可扩展性。
链接: https://arxiv.org/abs/2509.20491
作者: Brahim Mahmoudi,Naouel Moha,Quentin Stievenert,Florent Avellaneda
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of Artificial Intelligence (AI) is reshaping how software systems are developed and maintained. However, AI-based systems give rise to new software issues that existing detection tools often miss. Among these, we focus on AI-specific code smells, recurring patterns in the code that may indicate deeper problems such as unreproducibility, silent failures, or poor model generalization. We introduce SpecDetect4AI, a tool-based approach for the specification and detection of these code smells at scale. This approach combines a high-level declarative Domain-Specific Language (DSL) for rule specification with an extensible static analysis tool that interprets and detects these rules for AI-based systems. We specified 22 AI-specific code smells and evaluated SpecDetect4AI on 826 AI-based systems (20M lines of code), achieving a precision of 88.66% and a recall of 88.89%, outperforming other existing detection tools. Our results show that SpecDetect4AI supports the specification and detection of AI-specific code smells through dedicated rules and can effectively analyze large AI-based systems, demonstrating both efficiency and extensibility (SUS 81.7/100).
zh
[AI-93] CoSupFormer : A Contrastive Supervised learning approach for EEG signal Classification
【速读】:该论文旨在解决从原始脑电图(Electroencephalography, EEG)信号中提取有意义特征时面临的挑战,包括噪声干扰、通道变异性和多尺度频率振荡信息的建模难题。其解决方案的关键在于提出一种端到端的深度学习框架,包含三个核心创新:一是设计了一个能够显式捕捉宽频带多尺度频率振荡的编码器,以适应不同EEG相关任务;二是引入基于注意力机制的编码器,同时学习跨通道和单个通道局部区域内的复杂依赖关系;三是集成专用门控网络动态过滤噪声和非信息性通道,提升数据可靠性;此外,通过结合监督学习与对比学习的新颖损失函数,显著增强模型泛化能力。该方法在多种应用场景(如中枢神经系统疾病治疗效果分类及帕金森病、阿尔茨海默病诊断)中验证了其有效性,可自动选择高质量通道并从跨物种原始EEG中提取生物意义明确的模式。
链接: https://arxiv.org/abs/2509.20489
作者: D. Darankoum,C. Habermacher,J. Volle,S. Grudinin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages (14 pages Main text and 6 pages Supplementary Material)
Abstract:Electroencephalography signals (EEGs) contain rich multi-scale information crucial for understanding brain states, with potential applications in diagnosing and advancing the drug development landscape. However, extracting meaningful features from raw EEG signals while handling noise and channel variability remains a major challenge. This work proposes a novel end-to-end deep-learning framework that addresses these issues through several key innovations. First, we designed an encoder capable of explicitly capturing multi-scale frequency oscillations covering a wide range of features for different EEG-related tasks. Secondly, to model complex dependencies and handle the high temporal resolution of EEGs, we introduced an attention-based encoder that simultaneously learns interactions across EEG channels and within localized \em patches of individual channels. We integrated a dedicated gating network on top of the attention encoder to dynamically filter out noisy and non-informative channels, enhancing the reliability of EEG data. The entire encoding process is guided by a novel loss function, which leverages supervised and contrastive learning, significantly improving model generalization. We validated our approach in multiple applications, ranging from the classification of effects across multiple Central Nervous System (CNS) disorders treatments to the diagnosis of Parkinson’s and Alzheimer’s disease. Our results demonstrate that the proposed learning paradigm can extract biologically meaningful patterns from raw EEG signals across different species, autonomously select high-quality channels, and achieve robust generalization through innovative architectural and loss design.
zh
[AI-94] Wartime Media Dynamics in Emerging Democracies: Case Study of Pakistani Media in May 2025 Indo-Pak Conflict
【速读】:该论文试图解决的问题是:在新兴民主国家中,地区冲突如何影响媒体对政治反对派和异议声音的报道,进而削弱民主 discourse 的功能。其解决方案的关键在于利用大语言模型(Large Language Model, LLM)对约2600篇来自巴基斯坦三大主流报纸的新闻文章进行系统分析,量化战争相关报道与政治反对派报道之间的比例变化,从而揭示冲突期间媒体议程偏移的现象,为保障冲突环境下新闻自由提供实证依据。
链接: https://arxiv.org/abs/2509.20419
作者: Taaha Saleem Bajwa
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted as Extended abstract in COLM 2025 workshop on NLP4Democracy
Abstract:Democracies rely on opposition and dissent to function, but in emerging democracies, freedom of speech is often restricted. This effect intensifies during regional conflicts. This study examines how the India-Pakistan conflict of May 2025 influenced Pakistani media coverage. Analyzing approximately 2,600 news articles from three major newspapers using a large language model (LLM), the study found that war-related reporting significantly overshadowed coverage of political opposition and dissent. These findings highlight how conflict can marginalize democratic discourse, reinforcing the need to safeguard press freedom in volatile regions.
zh
[AI-95] A Taxonomy of Data Risks in AI and Quantum Computing (QAI) - A Systematic Review
【速读】:该论文旨在解决量子人工智能(Quantum Artificial Intelligence, QAI)在数据隐私与安全方面存在的系统性风险问题,这些问题源于AI和量子计算(Quantum Computing, QC)各自的数据风险叠加,且尚未被充分研究。解决方案的关键在于通过系统性文献综述(涵盖67项相关研究),构建了一个包含22个关键数据风险的分类体系,分为治理、风险评估、控制实施、用户考量和持续监控五大类,从而识别出QAI特有的脆弱性并填补整体风险评估的空白,为可信AI和QAI研究提供理论基础与未来风险评估工具开发的方向。
链接: https://arxiv.org/abs/2509.20418
作者: Grace Billiris,Asif Gill,Madhushi Bandara
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 11 pages, 2 figures, 2 tables
Abstract:Quantum Artificial Intelligence (QAI), the integration of Artificial Intelligence (AI) and Quantum Computing (QC), promises transformative advances, including AI-enabled quantum cryptography and quantum-resistant encryption protocols. However, QAI inherits data risks from both AI and QC, creating complex privacy and security vulnerabilities that are not systematically studied. These risks affect the trustworthiness and reliability of AI and QAI systems, making their understanding critical. This study systematically reviews 67 privacy- and security-related studies to expand understanding of QAI data risks. We propose a taxonomy of 22 key data risks, organised into five categories: governance, risk assessment, control implementation, user considerations, and continuous monitoring. Our findings reveal vulnerabilities unique to QAI and identify gaps in holistic risk assessment. This work contributes to trustworthy AI and QAI research and provides a foundation for developing future risk assessment tools.
zh
[AI-96] Adversarial Defense in Cybersecurity: A Systematic Review of GANs for Threat Detection and Mitigation
【速读】:该论文旨在解决机器学习驱动的网络安全系统在面对对抗性攻击时的脆弱性问题,提出以生成式对抗网络(Generative Adversarial Networks, GANs)为基础的防御机制作为解决方案。其关键在于利用GAN的双刃剑特性——既可作为攻击工具,亦能构建鲁棒的防御体系,通过系统梳理2021年至2025年8月间185篇同行评审研究,构建了一个涵盖防御功能、GAN架构、安全领域与对抗威胁模型的四维分类体系,并识别出WGAN-GP、条件GAN(Conditional GAN, CGAN)及混合GAN模型等关键技术进展,从而提升网络入侵检测、恶意软件分析和物联网安全中的检测精度、鲁棒性和数据效用。
链接: https://arxiv.org/abs/2509.20411
作者: Tharcisse Ndayipfukamiye,Jianguo Ding,Doreen Sebastian Sarwatt,Adamu Gaston Philipo,Huansheng Ning
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 35 pages, 10 tables, 4figures
Abstract:Machine learning-based cybersecurity systems are highly vulnerable to adversarial attacks, while Generative Adversarial Networks (GANs) act as both powerful attack enablers and promising defenses. This survey systematically reviews GAN-based adversarial defenses in cybersecurity (2021–August 31, 2025), consolidating recent progress, identifying gaps, and outlining future directions. Using a PRISMA-compliant systematic literature review protocol, we searched five major digital libraries. From 829 initial records, 185 peer-reviewed studies were retained and synthesized through quantitative trend analysis and thematic taxonomy development. We introduce a four-dimensional taxonomy spanning defensive function, GAN architecture, cybersecurity domain, and adversarial threat model. GANs improve detection accuracy, robustness, and data utility across network intrusion detection, malware analysis, and IoT security. Notable advances include WGAN-GP for stable training, CGANs for targeted synthesis, and hybrid GAN models for improved resilience. Yet, persistent challenges remain such as instability in training, lack of standardized benchmarks, high computational cost, and limited explainability. GAN-based defenses demonstrate strong potential but require advances in stable architectures, benchmarking, transparency, and deployment. We propose a roadmap emphasizing hybrid models, unified evaluation, real-world integration, and defenses against emerging threats such as LLM-driven cyberattacks. This survey establishes the foundation for scalable, trustworthy, and adaptive GAN-powered defenses.
zh
[AI-97] Defending against Stegomalware in Deep Neural Networks with Permutation Symmetry
【速读】:该论文旨在解决神经网络隐写恶意软件(neural network stegomalware)的安全威胁问题,即攻击者可将恶意代码嵌入深度神经网络检查点(checkpoint)中,且对模型精度影响极小,从而在不被察觉的情况下传播恶意行为。此类攻击目前在深度学习从业者和安全专家中均未得到足够重视。论文提出了一种有效的防御方法:通过打乱权重矩阵或偏置矩阵的列顺序(等价于卷积层通道顺序的重排),即可高效且无损地破坏已嵌入的恶意载荷,而不会影响模型准确率。该方案的核心在于利用了当前主流神经网络隐写技术对权重结构敏感性的弱点,实现了对stegomalware的有效中和,且显著优于现有对抗方法。
链接: https://arxiv.org/abs/2509.20399
作者: Birk Torpmann-Hagen,Michael A. Riegler,Pål Halvorsen,Dag Johansen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep neural networks are being utilized in a growing number of applications, both in production systems and for personal use. Network checkpoints are as a consequence often shared and distributed on various platforms to ease the development process. This work considers the threat of neural network stegomalware, where malware is embedded in neural network checkpoints at a negligible cost to network accuracy. This constitutes a significant security concern, but is nevertheless largely neglected by the deep learning practitioners and security specialists alike. We propose the first effective countermeasure to these attacks. In particular, we show that state-of-the-art neural network stegomalware can be efficiently and effectively neutralized through shuffling the column order of the weight- and bias-matrices, or equivalently the channel-order of convolutional layers. We show that this effectively corrupts payloads that have been embedded by state-of-the-art methods in neural network steganography at no cost to network accuracy, outperforming competing methods by a significant margin. We then discuss possible means by which to bypass this defense, additional defense methods, and advocate for continued research into the security of machine learning systems.
zh
[AI-98] Centralized vs. Decentralized Security for Space AI Systems? A New Look
【速读】:该论文旨在解决卫星星座中集中式与分布式安全管理体系之间的权衡问题,以实现安全性与性能的平衡。其解决方案的关键在于提出三种自动化安全管理体系架构:集中式、分布式和联邦式(federated),其中集中式架构在短期内更优,因其具备快速训练能力,尽管面临空间通信延迟带来的挑战;而分布式架构则在长期更具优势,能够提升可扩展性和安全性。
链接: https://arxiv.org/abs/2509.20395
作者: Noam Schmitt(IP Paris, TSP, ENS Paris Saclay),Marc Antoine Lacoste
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: IEEE HPEC 2025 - 29th Annual IEEE High Performance Extreme Computing Virtual Conference, MIT Lincoln Laboratory, Sep 2025, Boston (MA), United States
Abstract:This paper investigates the trade-off between centralized and decentralized security management in constellations of satellites to balance security and performance. We highlight three key AI architectures for automated security management: (a) centralized, (b) distributed and © federated. The centralized architecture is the best option short term, providing fast training, despite the hard challenge of the communication latency overhead across space. Decentralized architectures are better alternatives in the longer term, providing enhanced scalability and security.
zh
[AI-99] he Secret Agenda: LLM s Strategically Lie and Our Current Safety Tools Are Blind
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中战略欺骗行为的检测与控制问题,特别是在目标导向情境下模型如何通过说谎实现策略优势。其关键解决方案在于对比两种不同的可解释性方法:一是基于自动标注的神经激活特征(autolabeled SAE features),二是基于未标注激活模式的群体级分析(unlabeled SAE activations)。研究发现,前者在识别和干预战略性谎言时效果有限,而后者通过热力图和t-SNE可视化揭示了欺骗与合规响应之间的判别性结构,从而为风险评估提供了可行的群体层面指标。这一结果表明,当前依赖自动标签的可解释性方法不足以捕捉复杂行为欺骗,而无监督的激活模式分析可能更适用于真实场景中的伦理风险监测。
链接: https://arxiv.org/abs/2509.20393
作者: Caleb DeLeeuw,Gaurav Chawla,Aniket Sharma,Vanessa Dietze
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages plus citations and appendix, 7 figures
Abstract:We investigate strategic deception in large language models using two complementary testbeds: Secret Agenda (across 38 models) and Insider Trading compliance (via SAE architectures). Secret Agenda reliably induced lying when deception advantaged goal achievement across all model families. Analysis revealed that autolabeled SAE features for “deception” rarely activated during strategic dishonesty, and feature steering experiments across 100+ deception-related features failed to prevent lying. Conversely, insider trading analysis using unlabeled SAE activations separated deceptive versus compliant responses through discriminative patterns in heatmaps and t-SNE visualizations. These findings suggest autolabel-driven interpretability approaches fail to detect or control behavioral deception, while aggregate unlabeled activations provide population-level structure for risk assessment. Results span Llama 8B/70B SAE implementations and GemmaScope under resource constraints, representing preliminary findings that motivate larger studies on feature discovery, labeling methodology, and causal interventions in realistic deception contexts.
zh
[AI-100] Can You Trust Your Copilot? A Privacy Scorecard for AI Coding Assistants
【速读】:该论文旨在解决生成式 AI 编程助手(Generative AI Coding Assistants)在开发者工作流中广泛应用时所引发的隐私与信任问题,特别是由于这些工具的数据处理实践不透明而带来的安全与合规风险。解决方案的关键在于提出并应用了一种由专家验证的隐私评分卡(privacy scorecard),通过系统分析法律政策、用户协议、第三方审计报告等四类文档,对五款主流编程助手按14项加权标准进行量化评估,从而揭示各工具在隐私保护方面的差异,并识别出行业普遍存在的薄弱环节,如默认采用“退出同意”机制用于模型训练、缺乏主动过滤用户提示中的敏感信息等,最终为开发者和组织提供基于证据的选型依据,推动行业向以用户为中心的隐私标准转型。
链接: https://arxiv.org/abs/2509.20388
作者: Amir AL-Maamari
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid integration of AI-powered coding assistants into developer workflows has raised significant privacy and trust concerns. As developers entrust proprietary code to services like OpenAI’s GPT, Google’s Gemini, and GitHub Copilot, the unclear data handling practices of these tools create security and compliance risks. This paper addresses this challenge by introducing and applying a novel, expert-validated privacy scorecard. The methodology involves a detailed analysis of four document types; from legal policies to external audits; to score five leading assistants against 14 weighted criteria. A legal expert and a data protection officer refined these criteria and their weighting. The results reveal a distinct hierarchy of privacy protections, with a 20-point gap between the highest- and lowest-ranked tools. The analysis uncovers common industry weaknesses, including the pervasive use of opt-out consent for model training and a near-universal failure to filter secrets from user prompts proactively. The resulting scorecard provides actionable guidance for developers and organizations, enabling evidence-based tool selection. This work establishes a new benchmark for transparency and advocates for a shift towards more user-centric privacy standards in the AI industry.
zh
[AI-101] Dynamic ReAct: Scalable Tool Selection for Large-Scale MCP Environments
【速读】:该论文旨在解决大规模工具集环境下,基于ReAct(Reasoning + Acting)框架的智能体因上下文记忆限制而难以高效进行工具选择的问题。在包含数百甚至上千个可用工具的环境中,同时加载全部工具会导致计算资源不可行。解决方案的关键在于提出五种逐步优化的架构设计,最终实现一种“搜索-加载”机制,能够在保持任务完成准确率的前提下,将工具加载量减少高达50%,从而显著提升智能体对多样化任务环境的动态适应能力。
链接: https://arxiv.org/abs/2509.20386
作者: Nishant Gaurav,Adit Akarsh,Ankit Ranjan,Manoj Bajaj
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:We present Dynamic ReAct, a novel approach for enabling ReAct agents to ef- ficiently operate with extensive Model Control Protocol (MCP) tool sets that exceed the contextual memory limitations of large language models. Our approach addresses the fundamental challenge of tool selection in environments containing hundreds or thousands of available tools, where loading all tools simultaneously is computationally infeasible. We propose and evaluate five distinct architectures that progressively refine the tool selection process, culminating in a search-and-load mechanism that achieves intelligent tool selection with minimal computational overhead. Our experimental results demonstrate that the proposed approach reduces tool loading by up to 50% while maintaining task completion accuracy, advancing the path towards truly general-purpose AI agents capable of dynamically adapting to diverse task environments.
zh
[AI-102] R1-Fuzz: Specializing Language Models for Textual Fuzzing via Reinforcement Learning
【速读】:该论文旨在解决复杂目标(如编译器、解释器和数据库引擎)在生成符合其语法和语义约束的输入时,传统模糊测试(fuzzing)方法效率低下的问题。这些问题通常需要深入理解程序逻辑,而现有基于语言模型(language models, LMs)的方法因缺乏对真实代码库中深层程序逻辑的探索以及大模型带来的高计算成本难以有效应用。解决方案的关键在于提出 R1-Fuzz 框架,首次将强化学习(reinforcement learning, RL)用于专门化低成本的语言模型,并通过两个核心设计实现高效集成:基于覆盖率切片的问题构造机制与基于距离的奖励计算策略。该框架通过 RL 后训练使小规模模型(如 R1-Fuzz-7B)能够精准推理程序语义,从而在实际场景中达到甚至超越大型模型的模糊测试效果,显著提升覆盖率并发现多个未知漏洞。
链接: https://arxiv.org/abs/2509.20384
作者: Jiayi Lin,Liangcai Su,Junzhe Li,Chenxiong Qian
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:
Abstract:Fuzzing is effective for vulnerability discovery but struggles with complex targets such as compilers, interpreters, and database engines, which accept textual input that must satisfy intricate syntactic and semantic constraints. Although language models (LMs) have attracted interest for this task due to their vast latent knowledge and reasoning potential, their practical adoption has been limited. The major challenges stem from insufficient exploration of deep program logic among real-world codebases, and the high cost of leveraging larger models. To overcome these challenges, we propose R1-Fuzz, the first framework that leverages reinforcement learning (RL) to specialize cost-efficient LMs and integrate them for complex textual fuzzing input generation. R1-Fuzz introduces two key designs: coverage-slicing-based question construction and a distance-based reward calculation. Through RL-based post-training of a model with our constructed dataset, R1-Fuzz designs a fuzzing workflow that tightly integrates LMs to reason deep program semantics during fuzzing. Evaluations on diverse real-world targets show that our design enables a small model, named R1-Fuzz-7B, to rival or even outperform much larger models in real-world fuzzing. Notably, R1-Fuzz achieves up to 75% higher coverage than state-of-the-art fuzzers and discovers 29 previously unknown vulnerabilities, demonstrating its practicality.
zh
[AI-103] MARS: A Malignity-Aware Backdoor Defense in Federated Learning NEURIPS2025
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中针对后门攻击(backdoor attack)的防御失效问题,特别是针对最新提出的SOTA攻击3DFed(SP2023)——该攻击通过指示机制判断后门模型是否被接收方采纳,并自适应优化后门模型,导致现有防御方法失效。解决方案的关键在于提出一种恶意感知的防御框架MARS(Malignity-Aware backdooR defenSe),其核心创新是引入“后门能量”(Backdoor Energy, BE)来量化每个神经元的恶意程度,并进一步提取每模型中最显著的BE值形成“集中后门能量”(Concentrated Backdoor Energy, CBE)以增强恶意信号;最终利用基于Wasserstein距离的聚类方法有效识别出携带后门的模型,从而实现对SOTA后门攻击的鲁棒防御。
链接: https://arxiv.org/abs/2509.20383
作者: Wei Wan,Yuxuan Ning,Zhicong Huang,Cheng Hong,Shengshan Hu,Ziqi Zhou,Yechao Zhang,Tianqing Zhu,Wanlei Zhou,Leo Yu Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025
Abstract:Federated Learning (FL) is a distributed paradigm aimed at protecting participant data privacy by exchanging model parameters to achieve high-quality model training. However, this distributed nature also makes FL highly vulnerable to backdoor attacks. Notably, the recently proposed state-of-the-art (SOTA) attack, 3DFed (SP2023), uses an indicator mechanism to determine whether the backdoor models have been accepted by the defender and adaptively optimizes backdoor models, rendering existing defenses ineffective. In this paper, we first reveal that the failure of existing defenses lies in the employment of empirical statistical measures that are loosely coupled with backdoor attacks. Motivated by this, we propose a Malignity-Aware backdooR defenSe (MARS) that leverages backdoor energy (BE) to indicate the malicious extent of each neuron. To amplify malignity, we further extract the most prominent BE values from each model to form a concentrated backdoor energy (CBE). Finally, a novel Wasserstein distance-based clustering method is introduced to effectively identify backdoor models. Extensive experiments demonstrate that MARS can defend against SOTA backdoor attacks and significantly outperforms existing defenses.
zh
[AI-104] Lightweight MobileNetV1GRU for ECG Biometric Authentication: Federated and Adversarial Evaluation
【速读】:该论文旨在解决心电图(ECG)生物特征识别在可穿戴设备上部署时面临的实时处理、隐私保护及伪造攻击脆弱性等问题。其解决方案的关键在于提出一种轻量级深度学习模型(MobileNetV1+GRU),结合20dB高斯噪声的定制预处理方法,在多个公开ECG数据集(ECGID、MIT-BIH、CYBHi和PTB)上实现了高准确率(最高达99.34%)与优异的性能指标(如F1-score > 0.91,EER < 0.01),同时通过对抗测试揭示了模型对快速梯度符号法(FGSM)攻击的敏感性,强调了联邦学习与多样化可穿戴生理数据集对于构建安全、可扩展ECG生物识别系统的重要性。
链接: https://arxiv.org/abs/2509.20382
作者: Dilli Hang Rai,Sabin Kafley
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 5 pages, 7 figures, 5 tables
Abstract:ECG biometrics offer a unique, secure authentication method, yet their deployment on wearable devices faces real-time processing, privacy, and spoofing vulnerability challenges. This paper proposes a lightweight deep learning model (MobileNetV1+GRU) for ECG-based authentication, injection of 20dB Gaussian noise custom preprocessing. We simulate wearable conditions and edge deployment using the ECGID, MIT-BIH, CYBHi, and PTB datasets, achieving accuracies of 99.34%, 99.31%, 91.74%, and 98.49%, F1-scores of 0.9869, 0.9923, 0.9125, and 0.9771, Precision of 0.9866, 0.9924, 0.9180 and 0.9845, Recall of 0.9878, 0.9923, 0.9129, and 0.9756, equal error rates (EER) of 0.0009, 0.00013, 0.0091, and 0.0009, and ROC-AUC values of 0.9999, 0.9999, 0.9985, and 0.9998, while under FGSM adversarial attacks, accuracy drops from 96.82% to as low as 0.80%. This paper highlights federated learning, adversarial testing, and the need for diverse wearable physiological datasets to ensure secure and scalable biometrics.
zh
[AI-105] ACCeLLiuM: Supervised Fine-Tuning for Automated OpenACC Prag ma Generation
【速读】:该论文旨在解决当前GPU编程中因硬件复杂性和并行编程框架(如OpenACC)门槛较高而导致的开发效率低下问题。尽管OpenACC等指令式并行编程标准通过抽象低层细节简化了GPU编程,但仍需专业人员具备较强领域知识才能有效使用其指令(directive)。为此,作者提出ACCeLLiuM方案,其关键在于构建了一个包含4,033个OpenACC pragma-loop配对的数据集,并基于此数据集对两个开源大语言模型(Large Language Models, LLMs)进行监督微调(supervised fine-tuning, SFT),使其能够自动生成适用于数据并行循环的专家级OpenACC指令。实验表明,微调后的模型在测试集上能以87%的准确率生成正确类型的指令,且50%情况下可生成与真实标签完全一致的pragma(包括指令类型、子句、顺序及变量),即使未完全匹配,生成结果也常包含合理子句或优化结构,体现出实际应用价值。该工作通过公开代码、模型和数据集,为基于大语言模型的OpenACC指令生成提供了一个可复现的基准,显著降低了串行程序自动向GPU迁移的门槛。
链接: https://arxiv.org/abs/2509.20380
作者: Samyak Jhaveri,Vanessa Klotzmann,Crista Lopes
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:The increasing ubiquity of GPUs is accompanied by the increasing complexity of their hardware and parallel programming frameworks. Directive-based parallel programming standards like OpenACC simplify GPU programming to some extent by abstracting away low-level complexities, but a fair amount of expertise is still required in order to use those directives effectively. We introduce ACCeLLiuM, two open weights Large Language Models specifically fine-tuned for generating expert OpenACC directives for data-parallel loops, along with the supervised fine-tuning dataset that was used to train them. The ACCeLLiuM SFT dataset contains 4,033 OpenACC pragma-loop pairs mined from public GitHub C/C++ repositories, with 3,223 pairs for training and 810 for testing. Experimental evaluations show a pronounced performance gap in generating correct OpenACC pragmas between base LLMs and our fine-tuned versions. On the held-out test set, base LLMs fail to consistently generate valid pragmas, whereas LLMs fine-tuned on the ACCeLLiuM dataset generate valid pragmas with the correct directive type for 87% of the data-parallel loops, and exact pragmas–including directives, clauses, clause order, and clause variables–for 50% of the cases. Even when not exact, generated pragmas frequently incorporate the correct clauses in a different order than the ground-truth label, or include additional clauses that enable finer control over parallel execution, data movement, and concurrency, offering practical value beyond strict string-matching. By publicly releasing the code, models, and dataset as ACCeLLiuM we hope to establish a reproducible benchmark for LLM-powered OpenACC pragma generation, and lower the barrier to automated GPU offloading of serially written programs. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2509.20380 [cs.SE] (or arXiv:2509.20380v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2509.20380 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-106] Philosophy-informed Machine Learning
【速读】:该论文试图解决当前机器学习(Machine Learning, ML)模型在设计与应用中缺乏对哲学概念与价值的系统性尊重,从而导致伦理失范、价值错位及可解释性不足的问题。其解决方案的关键在于引入哲学导向的机器学习(Philosophy-informed Machine Learning, PhIML),通过将分析哲学的核心思想直接嵌入到ML模型架构、目标函数和评估协议中,使模型从设计之初就具备哲学一致性与价值对齐能力。论文进一步提出PhIML可作为后验工具或内生构建模块灵活应用,并指出技术障碍与治理挑战,为实现安全、伦理负责任的PhIML提供研究路线图。
链接: https://arxiv.org/abs/2509.20370
作者: MZ Naser
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Philosophy-informed machine learning (PhIML) directly infuses core ideas from analytic philosophy into ML model architectures, objectives, and evaluation protocols. Therefore, PhIML promises new capabilities through models that respect philosophical concepts and values by design. From this lens, this paper reviews conceptual foundations to demonstrate philosophical gains and alignment. In addition, we present case studies on how ML users/designers can adopt PhIML as an agnostic post-hoc tool or intrinsically build it into ML model architectures. Finally, this paper sheds light on open technical barriers alongside philosophical, practical, and governance challenges and outlines a research roadmap toward safe, philosophy-aware, and ethically responsible PhIML.
zh
[AI-107] AI-driven formative assessment and adaptive learning in data-science education: Evaluating an LLM -powered virtual teaching assistant
【速读】:该论文旨在解决传统教学在大规模数据科学人才培养中面临的 scalability(可扩展性)与个性化支持不足的问题,尤其是在生成式 AI (Generative AI) 应用日益广泛的背景下,如何实现高效、可追踪且具备伦理意识的学习干预。其核心解决方案是提出 VITA(Virtual Teaching Assistants)平台,该平台基于自适应分布式学习(Adaptive Distributed Learning, ADL)架构,融合大语言模型(Large Language Model, LLM)驱动的对话式助教 BotCaptain,通过上下文感知的对话辅导、基于 Experience API (xAPI) 的可互操作分析机制以及面向诚信评估的形成性测评模式,构建了一个闭环的智能学习支持系统。关键创新在于:一是将聊天日志转化为结构化 xAPI 语句以实现行为数据的标准化采集;二是设计了用于即时干预的仪表盘和自适应路径引擎,动态引导学习者进入进阶、强化或补救内容模块;三是提供了一套可复用的对话式分析架构与完整性保障的形成性评估模式,为多课程部署提供了实践蓝图。
链接: https://arxiv.org/abs/2509.20369
作者: Fadjimata I Anaroua,Qing Li,Yan Tang,Hong P. Liu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper presents VITA (Virtual Teaching Assistants), an adaptive distributed learning (ADL) platform that embeds a large language model (LLM)-powered chatbot (BotCaptain) to provide dialogic support, interoperable analytics, and integrity-aware assessment for workforce preparation in data science. The platform couples context-aware conversational tutoring with formative-assessment patterns designed to promote reflective reasoning. The paper describes an end-to-end data pipeline that transforms chat logs into Experience API (xAPI) statements, instructor dashboards that surface outliers for just-in-time intervention, and an adaptive pathway engine that routes learners among progression, reinforcement, and remediation content. The paper also benchmarks VITA conceptually against emerging tutoring architectures, including retrieval-augmented generation (RAG)–based assistants and Learning Tools Interoperability (LTI)–integrated hubs, highlighting trade-offs among content grounding, interoperability, and deployment complexity. Contributions include a reusable architecture for interoperable conversational analytics, a catalog of patterns for integrity-preserving formative assessment, and a practical blueprint for integrating adaptive pathways into data-science courses. The paper concludes with implementation lessons and a roadmap (RAG integration, hallucination mitigation, and LTI~1.3 / OpenID Connect) to guide multi-course evaluations and broader adoption. In light of growing demand and scalability constraints in traditional instruction, the approach illustrates how conversational AI can support engagement, timely feedback, and personalized learning at scale. Future work will refine the platform’s adaptive intelligence and examine applicability across varied educational settings.
zh
[AI-108] LATTS: Locally Adaptive Test-Time Scaling
【速读】:该论文旨在解决现有基于验证器(verifier)的测试时扩展(test-time scaling)方法在计算资源分配上缺乏灵活性的问题,即这些方法通常对所有样本和生成步骤均匀增加计算量,忽略了个体实例的局部难度差异,导致资源利用效率低下。解决方案的关键在于提出一种局部自适应测试时扩展(Locally Adaptive Test-Time Scaling, LATTS)方法,该方法在每一步生成过程中根据验证器模型提供的局部难度判断,动态决定是否进行重采样、回溯、重启或终止生成,从而实现计算资源在不同生成步骤上的差异化分配,显著提升了准确率与计算开销之间的权衡性能。
链接: https://arxiv.org/abs/2509.20368
作者: Theo Uscidda,Matthew Trager,Michael Kleinman,Aditya Chattopadhyay,Wei Xia,Stefano Soatto
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:One common strategy for improving the performance of Large Language Models (LLMs) on downstream tasks involves using a \emphverifier model to either select the best answer from a pool of candidates or to steer the auto-regressive generation process towards better outputs. This class of methods typically results in improved accuracy at the cost of increased computation at test-time, a paradigm known as \emphtest-time scaling. However, most existing approaches increase computation uniformly across all samples and generation steps, without considering the complexity of individual instances, leading to inefficient resource use. We address this limitation by proposing an approach, called \emphLocally Adaptive Test-Time Scaling (LATTS), that allocates variable compute across generation steps. Specifically, at each generation step, LATTS employs a verifier-based acceptance criterion to decide whether to resample, backtrack, restart, or stop the generation process. This criterion effectively adjusts the per-step computational effort based on a precise notion of \emphlocal difficulty derived from the verifier model. Empirical results show that LATTS achieves significantly superior accuracy–compute tradeoffs compared to standard verifier-based methods.
zh
[AI-109] An Approach to Checking Correctness for Agent ic Systems
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的智能体系统在实际部署中因随机生成过程导致输出不稳定、难以进行可靠错误检测的问题。当前主流方法依赖于输入输出文本匹配,但其对自然语言变异性敏感,易产生误判。解决方案的关键在于引入一种时序表达式语言(temporal expression language),通过监控智能体工具调用序列和状态转移轨迹,抽象出行为模式的逻辑断言,从而实现与具体文本内容无关的行为验证。该方法能够有效识别由于模型能力不足或逻辑变更引发的行为异常,如工具调用顺序错误或协作交接失败,并支持开发阶段的提示工程验证与生产环境中的回归测试,为高可靠性AI代理系统的持续监控提供了可扩展的理论基础和技术框架。
链接: https://arxiv.org/abs/2509.20364
作者: Thomas J Sheffler
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 15 pages, 5 figures
Abstract:This paper presents a temporal expression language for monitoring AI agent behavior, enabling systematic error-detection of LLM-based agentic systems that exhibit variable outputs due to stochastic generation processes. Drawing from temporal logic techniques used in hardware verification, this approach monitors execution traces of agent tool calls and state transitions to detect deviations from expected behavioral patterns. Current error-detection approaches rely primarily on text matching of inputs and outputs, which proves fragile due to the natural language variability inherent in LLM responses. The proposed method instead focuses on the sequence of agent actions – such as tool invocations and inter-agent communications – allowing verification of system behavior independent of specific textual outputs. The temporal expression language provides assertions that capture correct behavioral patterns across multiple execution scenarios. These assertions serve dual purposes: validating prompt engineering and guardrail effectiveness during development, and providing regression testing when agents are updated with new LLMs or modified logic. The approach is demonstrated using a three-agent system, where agents coordinate to solve multi-step reasoning tasks. When powered by large, capable models, all temporal assertions were satisfied across many test runs. However, when smaller models were substituted in two of the three agents, executions violated behavioral assertions, primarily due to improper tool sequencing and failed coordination handoffs. The temporal expressions successfully flagged these anomalies, demonstrating the method’s effectiveness for detecting behavioral regressions in production agentic systems. This approach provides a foundation for systematic monitoring of AI agent reliability as these systems become increasingly deployed in critical applications.
zh
[AI-110] Best-of-infty – Asymptotic Performance of Test-Time Compute
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段计算资源浪费的问题,特别是针对“Best-of-N”采样策略中因固定N值导致的测试时预算(inference-time budget)低效分配问题。其核心挑战在于:虽然当N趋近于无穷大(即Best-of-∞)时可获得最优性能,但实际应用中无法实现无限计算资源。解决方案的关键在于提出一种自适应生成机制,根据模型输出的一致性动态调整N值,从而高效利用推理计算资源;同时,进一步扩展框架为多模型加权集成(weighted ensemble),并通过混合整数线性规划(mixed-integer linear program)求解最优权重,实验证明该方法可超越单一模型性能。
链接: https://arxiv.org/abs/2509.21091
作者: Junpei Komiyama,Daisuke Oba,Masafumi Oyamada
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We study best-of- N for large language models (LLMs) where the selection is based on majority voting. In particular, we analyze the limit N \to \infty , which we denote as Best-of- \infty . While this approach achieves impressive performance in the limit, it requires an infinite test-time budget. To address this, we propose an adaptive generation scheme that selects N based on answer agreement, thereby efficiently allocating inference-time computation. Beyond adaptivity, we extend the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model. The optimal ensemble weighting is formulated and efficiently computed as a mixed-integer linear program. Extensive experiments demonstrate the effectiveness of our approach.
zh
[AI-111] Incorporating LLM Embeddings for Variation Across the Human Genome
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的生物数据表示方法主要局限于基因层面,缺乏对全基因组范围内变异(variant-level)信息的有效建模问题。其解决方案的关键在于构建了一个系统性的框架,用于生成覆盖整个人类基因组的变异级嵌入(variant-level embeddings),通过整合FAVOR、ClinVar和GWAS Catalog等权威注释资源,为89亿种可能的变异生成语义文本描述,并利用OpenAI的text-embedding-3-large与开源Qwen3-Embedding-0.6B模型在三个尺度上(HapMap3+MEGA、UK Biobank imputed variants及全部可能变异)生成高质量嵌入表示。这些嵌入被验证具有高预测准确性,可直接用于下游任务如嵌入引导的假设检验和增强型遗传风险预测,从而推动大规模基因组发现与精准医学的发展。
链接: https://arxiv.org/abs/2509.20702
作者: Hongqian Niu,Jordan Bryan,Xihao Li,Didong Li
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:
Abstract:Recent advances in large language model (LLM) embeddings have enabled powerful representations for biological data, but most applications to date focus only on gene-level information. We present one of the first systematic frameworks to generate variant-level embeddings across the entire human genome. Using curated annotations from FAVOR, ClinVar, and the GWAS Catalog, we constructed semantic text descriptions for 8.9 billion possible variants and generated embeddings at three scales: 1.5 million HapMap3+MEGA variants, ~90 million imputed UK Biobank variants, and ~9 billion all possible variants. Embeddings were produced with both OpenAI’s text-embedding-3-large and the open-source Qwen3-Embedding-0.6B models. Baseline experiments demonstrate high predictive accuracy for variant properties, validating the embeddings as structured representations of genomic variation. We outline two downstream applications: embedding-informed hypothesis testing by extending the Frequentist And Bayesian framework to genome-wide association studies, and embedding-augmented genetic risk prediction that enhances standard polygenic risk scores. These resources, publicly available on Hugging Face, provide a foundation for advancing large-scale genomic discovery and precision medicine.
zh
[AI-112] Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities
【速读】:该论文旨在解决如何利用生成式 AI(Generative AI)提取的文本嵌入向量来更准确预测低安保等级矫正设施中居民的再犯风险,并揭示其中的社会互动机制。其关键解决方案在于:首先,使用预训练的基于 Transformer 的大语言模型(Large Language Model, LLM)对大量书面肯定陈述与纠错交流文本进行嵌入,获得高维语义向量,显著提升再犯预测准确率(较仅使用入狱前协变量提高30%);其次,通过零样本分类(Zero-Shot classification)将高维嵌入映射至用户定义的低维类别向量,在保持预测性能的同时增强可解释性;最后,构建适用于稀疏网络、多维潜在变量及相关多维结果的新方法论与理论框架,用于估计LLM生成的语言表征中的同伴效应(peer effects),发现语言使用在互动与反馈中存在显著同伴效应。
链接: https://arxiv.org/abs/2509.20634
作者: Shanjukta Nath,Jiwon Hong,Jae Ho Chang,Keith Warren,Subhadeep Paul
机构: 未知
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI); General Economics (econ.GN); Methodology (stat.ME)
备注:
Abstract:We find AI embeddings obtained using a pre-trained transformer-based Large Language Model (LLM) of 80,000-120,000 written affirmations and correction exchanges among residents in low-security correctional facilities to be highly predictive of recidivism. The prediction accuracy is 30% higher with embedding vectors than with only pre-entry covariates. However, since the text embedding vectors are high-dimensional, we perform Zero-Shot classification of these texts to a low-dimensional vector of user-defined classes to aid interpretation while retaining the predictive power. To shed light on the social dynamics inside the correctional facilities, we estimate peer effects in these LLM-generated numerical representations of language with a multivariate peer effect model, adjusting for network endogeneity. We develop new methodology and theory for peer effect estimation that accommodate sparse networks, multivariate latent variables, and correlated multivariate outcomes. With these new methods, we find significant peer effects in language usage for interaction and feedback.
zh
[AI-113] Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition
【速读】:该论文旨在解决自动语音识别(ASR)系统在处理非典型语音(non-normative speech)时性能显著下降的问题,这类语音常见于先天性疾病(如脑性瘫痪、唐氏综合征、阿佩特综合征)或获得性脑损伤(如中风、外伤、肿瘤)导致的言语障碍。现有主流ASR模型(如Whisper)受限于训练数据稀缺和高声学变异性,难以有效适应个体差异。为应对这一挑战,作者提出一种基于贝叶斯低秩适配(Bayesian Low-rank Adaptation)的ASR个性化方法,其关键在于通过低秩参数更新实现高效微调,在少量标注数据下即可显著提升对非典型语音的识别准确率,同时降低数据收集与标注负担,适用于资源匮乏环境中个体化语音识别需求。
链接: https://arxiv.org/abs/2509.20397
作者: Niclas Pokel,Pehuén Moure,Roman Boehringer,Shih-Chii Liu,Yingqiang Gao
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:
Abstract:Speech impairments resulting from congenital disorders, such as cerebral palsy, down syndrome, or apert syndrome, as well as acquired brain injuries due to stroke, traumatic accidents, or tumors, present major challenges to automatic speech recognition (ASR) systems. Despite recent advancements, state-of-the-art ASR models like Whisper still struggle with non-normative speech due to limited training data availability and high acoustic variability. Moreover, collecting and annotating non-normative speech is burdensome: speaking is effortful for many affected individuals, while laborious annotation often requires caregivers familiar with the speaker. This work introduces a novel ASR personalization method based on Bayesian Low-rank Adaptation for data-efficient fine-tuning. We validate our method on the English UA-Speech dataset and a newly collected German speech dataset, BF-Sprache, from a child with structural speech impairment. The dataset and approach are designed to reflect the challenges of low-resource settings that include individuals with speech impairments. Our method significantly improves ASR accuracy for impaired speech while maintaining data and annotation efficiency, offering a practical path toward inclusive ASR.
zh
[AI-114] Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling
【速读】:该论文旨在解决自动语音识别(ASR)系统在处理因脑性瘫痪或结构性异常等疾病导致的非典型语音时性能下降的问题,其核心挑战在于语音的高声学变异性以及用于训练的数据稀缺性。解决方案的关键在于提出一种数据高效的个性化方法,通过蒙特卡洛Dropout(Monte Carlo Dropout)量化模型在音素层面的不确定性,并据此设计靶向过采样策略,从而优化模型微调过程。实验表明,该方法所生成的不确定性指标与临床语言治疗师报告中识别出的困难音素高度相关,首次实现了模型不确定性与专家对语音难度评估的一致性对齐,显著提升了ASR准确率,为构建个性化和包容性的语音识别系统提供了实用框架。
链接: https://arxiv.org/abs/2509.20396
作者: Niclas Pokel,Pehuén Moure,Roman Boehringer,Yingqiang Gao
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Automatic speech recognition (ASR) systems struggle with non-normative speech from individuals with impairments caused by conditions like cerebral palsy or structural anomalies. The high acoustic variability and scarcity of training data severely degrade model performance. This work introduces a data-efficient personalization method that quantifies phoneme-level uncertainty to guide fine-tuning. We leverage Monte Carlo Dropout to estimate which phonemes a model finds most difficult and use these estimates for a targeted oversampling strategy. We validate our method on English and German datasets. Crucially, we demonstrate that our model-derived uncertainty strongly correlates with phonemes identified as challenging in an expert clinical logopedic report, marking, to our knowledge, the first work to successfully align model uncertainty with expert assessment of speech difficulty. Our results show that this clinically-validated, uncertainty-guided sampling significantly improves ASR accuracy, delivering a practical framework for personalized and inclusive ASR.
zh
机器学习
[LG-0] Optimal Robust Recourse with Lp-Bounded Model Change
链接: https://arxiv.org/abs/2509.21293
作者: Phone Kyaw,Kshitij Kayastha,Shahin Jabbari
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recourse provides individuals who received undesirable labels (e.g., denied a loan) from algorithmic decision-making systems with a minimum-cost improvement suggestion to achieve the desired outcome. However, in practice, models often get updated to reflect changes in the data distribution or environment, invalidating the recourse recommendations (i.e., following the recourse will not lead to the desirable outcome). The robust recourse literature addresses this issue by providing a framework for computing recourses whose validity is resilient to slight changes in the model. However, since the optimization problem of computing robust recourse is non-convex (even for linear models), most of the current approaches do not have any theoretical guarantee on the optimality of the recourse. Recent work by Kayastha et. al. provides the first provably optimal algorithm for robust recourse with respect to generalized linear models when the model changes are measured using the L^\infty norm. However, using the L^\infty norm can lead to recourse solutions with a high price. To address this shortcoming, we consider more constrained model changes defined by the L^p norm, where p\geq 1 but p\neq \infty , and provide a new algorithm that provably computes the optimal robust recourse for generalized linear models. Empirically, for both linear and non-linear models, we demonstrate that our algorithm achieves a significantly lower price of recourse (up to several orders of magnitude) compared to prior work and also exhibits a better trade-off between the implementation cost of recourse and its validity. Our empirical analysis also illustrates that our approach provides more sparse recourses compared to prior work and remains resilient to post-processing approaches that guarantee feasibility.
[LG-1] axonomy-aware Dynamic Motion Generation on Hyperbolic Manifolds
链接: https://arxiv.org/abs/2509.21281
作者: Luis Augenstein,Noémie Jaquier,Tamim Asfour,Leonel Rozo
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, 1 table
Abstract:Human-like motion generation for robots often draws inspiration from biomechanical studies, which often categorize complex human motions into hierarchical taxonomies. While these taxonomies provide rich structural information about how movements relate to one another, this information is frequently overlooked in motion generation models, leading to a disconnect between the generated motions and their underlying hierarchical structure. This paper introduces the \acgphdm, a novel approach that learns latent representations preserving both the hierarchical structure of motions and their temporal dynamics to ensure physical consistency. Our model achieves this by extending the dynamics prior of the Gaussian Process Dynamical Model (GPDM) to the hyperbolic manifold and integrating it with taxonomy-aware inductive biases. Building on this geometry- and taxonomy-aware frameworks, we propose three novel mechanisms for generating motions that are both taxonomically-structured and physically-consistent: two probabilistic recursive approaches and a method based on pullback-metric geodesics. Experiments on generating realistic motion sequences on the hand grasping taxonomy show that the proposed GPHDM faithfully encodes the underlying taxonomy and temporal dynamics, and generates novel physically-consistent trajectories.
[LG-2] SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips
链接: https://arxiv.org/abs/2509.21271
作者: Xinyu Lian,Masahiro Tanaka,Olatunji Ruwase,Minjia Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 16 pages, 15 figures
Abstract:The emergence of Superchips represents a significant advancement in next-generation AI hardware. These Superchips employ a tightly coupled heterogeneous architecture that integrates GPU and CPU on the same package, which offers unprecedented computational power. However, there has been scant research investigating how LLM training benefits from this new architecture. In this work, for the first time, we study LLM training solutions based on offloading for Superchips. We observe important differences between Superchips and traditional loosely-coupled GPU-CPU architecture, which necessitate revisiting prevailing assumptions about offloading. Based on that, we present SuperOffload, a Superchip-centric offloading system that simultaneously uses Hopper GPU, Grace CPU, and NVLink-C2C interconnect more efficiently. SuperOffload accomplishes this via a combination of techniques, such as adaptive weight offloading, bucketization repartitioning, Superchip-aware casting, speculative execution, and a highly optimized Adam optimizer for Grace CPUs. Our evaluation of SuperOffload on NVIDIA GH200 demonstrates up to 2.5x throughput improvement compared to state-of-the-art offloading-based systems, enabling training of up to 25B model on a single Superchip while achieving high training throughput. We also extend SuperOffload with ZeRO-style data parallelism and DeepSpeed-Ulysses sequence parallelism, enabling training of 13B model with sequence lengths up to 1 million tokens on 8 GH200 while achieving 55% MFU.
[LG-3] humancompatible.train: Implementing Optimization Algorithms for Stochastically-Constrained Stochastic Optimization Problems NEURIPS
链接: https://arxiv.org/abs/2509.21254
作者: Andrii Kliachkin,Jana Lepšová,Gilles Bareilles,Jakub Mareček
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted at NeurIPS workshop COML 2025
Abstract:There has been a considerable interest in constrained training of deep neural networks (DNNs) recently for applications such as fairness and safety. Several toolkits have been proposed for this task, yet there is still no industry standard. We present this http URL (this https URL), an easily-extendable PyTorch-based Python package for training DNNs with stochastic constraints. We implement multiple previously unimplemented algorithms for stochastically constrained stochastic optimization. We demonstrate the toolkit use by comparing two algorithms on a deep learning task with fairness constraints.
[LG-4] Federated Flow Matching
链接: https://arxiv.org/abs/2509.21250
作者: Zifan Wang,Anqi Dong,Mahmoud Selim,Michael M. Zavlanos,Karl H. Johansson
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data today is decentralized, generated and stored across devices and institutions where privacy, ownership, and regulation prevent centralization. This motivates the need to train generative models directly from distributed data locally without central aggregation. In this paper, we introduce Federated Flow Matching (FFM), a framework for training flow matching models under privacy constraints. Specifically, we first examine FFM-vanilla, where each client trains locally with independent source and target couplings, preserving privacy but yielding curved flows that slow inference. We then develop FFM-LOT, which employs local optimal transport couplings to improve straightness within each client but lacks global consistency under heterogeneous data. Finally, we propose FFM-GOT, a federated strategy based on the semi-dual formulation of optimal transport, where a shared global potential function coordinates couplings across clients. Experiments on synthetic and image datasets show that FFM enables privacy-preserving training while enhancing both the flow straightness and sample quality in federated settings, with performance comparable to the centralized baseline.
[LG-5] AbideGym: Turning Static RL Worlds into Adaptive Challenges
链接: https://arxiv.org/abs/2509.21234
作者: Abi Aryan,Zac Liu,Aaron Childress
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Agents trained with reinforcement learning often develop brittle policies that fail when dynamics shift, a problem amplified by static benchmarks. AbideGym, a dynamic MiniGrid wrapper, introduces agent-aware perturbations and scalable complexity to enforce intra-episode adaptation. By exposing weaknesses in static policies and promoting resilience, AbideGym provides a modular, reproducible evaluation framework for advancing research in curriculum learning, continual learning, and robust generalization.
[LG-6] Go With The Flow: Churn-Tolerant Decentralized Training of Large Language Models
链接: https://arxiv.org/abs/2509.21221
作者: Nikolay Blagoev,Bart Cox,Jérémie Decouchant,Lydia Y. Chen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Motivated by the emergence of large language models (LLMs) and the importance of democratizing their training, we propose GWTF, the first crash tolerant practical decentralized training framework for LLMs. Differently from existing distributed and federated training frameworks, GWTF enables the efficient collaborative training of a LLM on heterogeneous clients that volunteer their resources. In addition, GWTF addresses node churn, i.e., clients joining or leaving the system at any time, and network instabilities, i.e., network links becoming unstable or unreliable. The core of GWTF is a novel decentralized flow algorithm that finds the most effective routing that maximizes the number of microbatches trained with the lowest possible delay. We extensively evaluate GWTF on GPT-like and LLaMa-like models and compare it against the prior art. Our results indicate that GWTF reduces the training time by up to 45% in realistic and challenging scenarios that involve heterogeneous client nodes distributed over 10 different geographic locations with a high node churn rate.
[LG-7] From Physics to Machine Learning and Back: Part II - Learning and Observational Bias in PHM
链接: https://arxiv.org/abs/2509.21207
作者: Olga Fink,Ismail Nejjar,Vinay Sharma,Keivan Faghih Niresi,Han Sun,Hao Dong,Chenghao Xu,Amaury Wei,Arthur Bizzi,Raffael Theiler,Yuan Tian,Leandro Von Krannichfeldt,Zhan Ma,Sergei Garmaev,Zepeng Zhang,Mengjie Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Prognostics and Health Management ensures the reliability, safety, and efficiency of complex engineered systems by enabling fault detection, anticipating equipment failures, and optimizing maintenance activities throughout an asset lifecycle. However, real-world PHM presents persistent challenges: sensor data is often noisy or incomplete, available labels are limited, and degradation behaviors and system interdependencies can be highly complex and nonlinear. Physics-informed machine learning has emerged as a promising approach to address these limitations by embedding physical knowledge into data-driven models. This review examines how incorporating learning and observational biases through physics-informed modeling and data strategies can guide models toward physically consistent and reliable predictions. Learning biases embed physical constraints into model training through physics-informed loss functions and governing equations, or by incorporating properties like monotonicity. Observational biases influence data selection and synthesis to ensure models capture realistic system behavior through virtual sensing for estimating unmeasured states, physics-based simulation for data augmentation, and multi-sensor fusion strategies. The review then examines how these approaches enable the transition from passive prediction to active decision-making through reinforcement learning, which allows agents to learn maintenance policies that respect physical constraints while optimizing operational objectives. This closes the loop between model-based predictions, simulation, and actual system operation, empowering adaptive decision-making. Finally, the review addresses the critical challenge of scaling PHM solutions from individual assets to fleet-wide deployment. Fast adaptation methods including meta-learning and few-shot learning are reviewed alongside domain generalization techniques …
[LG-8] Closed-form ell_r norm scaling with data for overparameterized linear regression and diagonal linear networks under ell_p bias
链接: https://arxiv.org/abs/2509.21181
作者: Shuofeng Zhang,Ard Louis
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:For overparameterized linear regression with isotropic Gaussian design and minimum- \ell_p interpolator p\in(1,2] , we give a unified, high-probability characterization for the scaling of the family of parameter norms \ \lVert \widehatw_p \rVert_r \r \in [1,p] with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal spike and a bulk of null coordinates in X^\top Y , yielding closed-form predictions for (i) a data-dependent transition n\star (the “elbow”), and (ii) a universal threshold r_\star=2(p-1) that separates \lVert \widehatw_p \rVert_r 's which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of all \ell_r norms within the family r\in [1,p] under \ell_p -biased interpolation, and explains in one picture which norms saturate and which increase as n grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale \alpha to an effective p_\mathrmeff(\alpha) via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on \lVert \widehat w_p \rVert_r , our results suggest that their predictive power will depend sensitively on which l_r norm is used. Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2509.21181 [cs.LG] (or arXiv:2509.21181v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.21181 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shuofeng Zhang [view email] [v1] Thu, 25 Sep 2025 13:59:22 UTC (429 KB) Full-text links: Access Paper: View a PDF of the paper titled Closed-form \ell_r norm scaling with data for overparameterized linear regression and diagonal linear networks under \ell_p bias, by Shuofeng Zhang and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-09 Change to browse by: cs math math.ST stat stat.ML stat.TH References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-9] IntSR: An Integrated Generative Framework for Search and Recommendation
链接: https://arxiv.org/abs/2509.21179
作者: Huimin Yan,Longfei Xu,Junjie Sun,Ni Ou,Wei Luo,Xing Tan,Ran Cheng,Kaikui Liu,Xiangxiang Chu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Generative recommendation has emerged as a promising paradigm, demonstrating remarkable results in both academic benchmarks and industrial applications. However, existing systems predominantly focus on unifying retrieval and ranking while neglecting the integration of search and recommendation (SR) tasks. What makes search and recommendation different is how queries are formed: search uses explicit user requests, while recommendation relies on implicit user interests. As for retrieval versus ranking, the distinction comes down to whether the queries are the target items themselves. Recognizing the query as central element, we propose IntSR, an integrated generative framework for SR. IntSR integrates these disparate tasks using distinct query modalities. It also addresses the increased computational complexity associated with integrated SR behaviors and the erroneous pattern learning introduced by a dynamically changing corpus. IntSR has been successfully deployed across various scenarios in Amap, leading to substantial improvements in digital asset’s GMV(+3.02%), POI recommendation’s CTR(+2.76%), and travel mode suggestion’s ACC(+5.13%).
[LG-10] Inverse Reinforcement Learning Using Just Classification and a Few Regressions
链接: https://arxiv.org/abs/2509.21172
作者: Lars van der Laan,Nathan Kallus,Aurélien Bibaut
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Inverse reinforcement learning (IRL) aims to explain observed behavior by uncovering an underlying reward. In the maximum-entropy or Gumbel-shocks-to-reward frameworks, this amounts to fitting a reward function and a soft value function that together satisfy the soft Bellman consistency condition and maximize the likelihood of observed actions. While this perspective has had enormous impact in imitation learning for robotics and understanding dynamic choices in economics, practical learning algorithms often involve delicate inner-loop optimization, repeated dynamic programming, or adversarial training, all of which complicate the use of modern, highly expressive function approximators like neural nets and boosting. We revisit softmax IRL and show that the population maximum-likelihood solution is characterized by a linear fixed-point equation involving the behavior policy. This observation reduces IRL to two off-the-shelf supervised learning problems: probabilistic classification to estimate the behavior policy, and iterative regression to solve the fixed point. The resulting method is simple and modular across function approximation classes and algorithms. We provide a precise characterization of the optimal solution, a generic oracle-based algorithm, finite-sample error bounds, and empirical results showing competitive or superior performance to MaxEnt IRL.
[LG-11] Mixture of Thoughts: Learning to Aggregate What Experts Think Not Just What They Say
链接: https://arxiv.org/abs/2509.21164
作者: Jacob Fein-Ashley,Dhruv Parikh,Rajgopal Kannan,Viktor Prasanna
类目: Machine Learning (cs.LG)
*备注:
Abstract:Open-source Large Language Models (LLMs) increasingly specialize by domain (e.g., math, code, general reasoning), motivating systems that leverage complementary strengths across models. Prior multi-LLM approaches either (i) route a query to one or a few experts and generate independently, (ii) aggregate outputs from each model via costly multi-turn exchanges, or (iii) fuse weights into a single model-typically requiring architectural homogeneity. We introduce Mixture of Thoughts (MoT), a simple method for latent-level collaboration among heterogeneous experts under a global routing scheme. For each query, a lightweight router selects top- K experts and designates a primary expert; uniformly placed interaction layers project hidden states into a shared latent space where the primary expert performs cross-attention over its active (selected) peers. Pre-trained experts remain frozen; only the router and the lightweight interaction layers are trained with a novel joint training objective that improves both the expert selection and inter-expert collaboration. Across five in-distribution (ID) and three out-of-distribution (OOD) benchmarks, MoT surpasses the current routing and aggregation-based state-of-the-art, Avengers, by +0.38% and +2.92% , respectively. Further, MoT significantly outperforms the best-performing single model. It achieves this with single-pass inference, runtime comparable to routing baselines, and none of the overheads of iterative aggregation. MoT offers a simple latent-space mechanism for combining heterogeneous LLMs, a practical step toward broader multi-LLM collaboration. Our code is publicly available at this https URL.
[LG-12] DATS: Distance-Aware Temperature Scaling for Calibrated Class-Incremental Learning
链接: https://arxiv.org/abs/2509.21161
作者: Giuseppe Serra,Florian Buettner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continual Learning (CL) is recently gaining increasing attention for its ability to enable a single model to learn incrementally from a sequence of new classes. In this scenario, it is important to keep consistent predictive performance across all the classes and prevent the so-called Catastrophic Forgetting (CF). However, in safety-critical applications, predictive performance alone is insufficient. Predictive models should also be able to reliably communicate their uncertainty in a calibrated manner - that is, with confidence scores aligned to the true frequencies of target events. Existing approaches in CL address calibration primarily from a data-centric perspective, relying on a single temperature shared across all tasks. Such solutions overlook task-specific differences, leading to large fluctuations in calibration error across tasks. For this reason, we argue that a more principled approach should adapt the temperature according to the distance to the current task. However, the unavailability of the task information at test time/during deployment poses a major challenge to achieve the intended objective. For this, we propose Distance-Aware Temperature Scaling (DATS), which combines prototype-based distance estimation with distance-aware calibration to infer task proximity and assign adaptive temperatures without prior task information. Through extensive empirical evaluation on both standard benchmarks and real-world, imbalanced datasets taken from the biomedical domain, our approach demonstrates to be stable, reliable and consistent in reducing calibration error across tasks compared to state-of-the-art approaches.
[LG-13] CAD-Tokenizer: Towards Text-based CAD Prototyping via Modality-Specific Tokenization
链接: https://arxiv.org/abs/2509.21150
作者: Ruiyu Wang,Shizhao Sun,Weijian Ma,Jiang Bian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Computer-Aided Design (CAD) is a foundational component of industrial prototyping, where models are defined not by raw coordinates but by construction sequences such as sketches and extrusions. This sequential structure enables both efficient prototype initialization and subsequent editing. Text-guided CAD prototyping, which unifies Text-to-CAD generation and CAD editing, has the potential to streamline the entire design pipeline. However, prior work has not explored this setting, largely because standard large language model (LLM) tokenizers decompose CAD sequences into natural-language word pieces, failing to capture primitive-level CAD semantics and hindering attention modules from modeling geometric structure. We conjecture that a multimodal tokenization strategy, aligned with CAD’s primitive and structural nature, can provide more effective representations. To this end, we propose CAD-Tokenizer, a framework that represents CAD data with modality-specific tokens using a sequence-based VQ-VAE with primitive-level pooling and constrained decoding. This design produces compact, primitive-aware representations that align with CAD’s structural nature. Applied to unified text-guided CAD prototyping, CAD-Tokenizer significantly improves instruction following and generation quality, achieving better quantitative and qualitative performance over both general-purpose LLMs and task-specific baselines.
[LG-14] EvoMail: Self-Evolving Cognitive Agents for Adaptive Spam and Phishing Email Defense
链接: https://arxiv.org/abs/2509.21129
作者: Wei Huang,De-Tian Chu,Lin-Yuan Bai,Wei Kang,Hai-Tao Zhang,Bo Li,Zhi-Mo Han,Jing Ge,Hai-Feng Lin
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Modern email spam and phishing attacks have evolved far beyond keyword blacklists or simple heuristics. Adversaries now craft multi-modal campaigns that combine natural-language text with obfuscated URLs, forged headers, and malicious attachments, adapting their strategies within days to bypass filters. Traditional spam detection systems, which rely on static rules or single-modality models, struggle to integrate heterogeneous signals or to continuously adapt, leading to rapid performance degradation. We propose EvoMail, a self-evolving cognitive agent framework for robust detection of spam and phishing. EvoMail first constructs a unified heterogeneous email graph that fuses textual content, metadata (headers, senders, domains), and embedded resources (URLs, attachments). A Cognitive Graph Neural Network enhanced by a Large Language Model (LLM) performs context-aware reasoning across these sources to identify coordinated spam campaigns. Most critically, EvoMail engages in an adversarial self-evolution loop: a ‘‘red-team’’ agent generates novel evasion tactics – such as character obfuscation or AI-generated phishing text – while the ‘‘blue-team’’ detector learns from failures, compresses experiences into a memory module, and reuses them for future reasoning. Extensive experiments on real-world datasets (Enron-Spam, Ling-Spam, SpamAssassin, and TREC) and synthetic adversarial variants demonstrate that EvoMail consistently outperforms state-of-the-art baselines in detection accuracy, adaptability to evolving spam tactics, and interpretability of reasoning traces. These results highlight EvoMail’s potential as a resilient and explainable defense framework against next-generation spam and phishing threats. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2509.21129 [cs.LG] (or arXiv:2509.21129v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.21129 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-15] Structure-Attribute Transformations with Markov Chain Boost Graph Domain Adaptation CIKM’25
链接: https://arxiv.org/abs/2509.21059
作者: Zhen Liu,Yongtao Zhang,Shaobo Ren,Yuxin You
类目: Machine Learning (cs.LG)
*备注: 11 pages,6 figures,Accepted by ACM CIKM’25
Abstract:Graph domain adaptation has gained significant attention in label-scarce scenarios across different graph domains. Traditional approaches to graph domain adaptation primarily focus on transforming node attributes over raw graph structures and aligning the distributions of the transformed node features across networks. However, these methods often struggle with the underlying structural heterogeneity between distinct graph domains, which leads to suboptimal distribution alignment. To address this limitation, we propose Structure-Attribute Transformation with Markov Chain (SATMC), a novel framework that sequentially aligns distributions across networks via both graph structure and attribute transformations. To mitigate the negative influence of domain-private information and further enhance the model’s generalization, SATMC introduces a private domain information reduction mechanism and an empirical Wasserstein distance. Theoretical proofs suggest that SATMC can achieve a tighter error bound for cross-network node classification compared to existing graph domain adaptation methods. Extensive experiments on nine pairs of publicly available cross-domain datasets show that SATMC outperforms state-of-the-art methods in the cross-network node classification task. The code is available at this https URL.
[LG-16] SPREAD: Sampling-based Pareto front Refinement via Efficient Adaptive Diffusion
链接: https://arxiv.org/abs/2509.21058
作者: Sedjro Salomon Hotegni,Sebastian Peitz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Developing efficient multi-objective optimization methods to compute the Pareto set of optimal compromises between conflicting objectives remains a key challenge, especially for large-scale and expensive problems. To bridge this gap, we introduce SPREAD, a generative framework based on Denoising Diffusion Probabilistic Models (DDPMs). SPREAD first learns a conditional diffusion process over points sampled from the decision space and then, at each reverse diffusion step, refines candidates via a sampling scheme that uses an adaptive multiple gradient descent-inspired update for fast convergence alongside a Gaussian RBF-based repulsion term for diversity. Empirical results on multi-objective optimization benchmarks, including offline and Bayesian surrogate-based settings, show that SPREAD matches or exceeds leading baselines in efficiency, scalability, and Pareto front coverage.
[LG-17] Physics of Learning: A Lagrangian perspective to different learning paradigms
链接: https://arxiv.org/abs/2509.21049
作者: Siyuan Guo,Bernhard Schölkopf
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Work in progress
Abstract:We study the problem of building an efficient learning system. Efficient learning processes information in the least time, i.e., building a system that reaches a desired error threshold with the least number of observations. Building upon least action principles from physics, we derive classic learning algorithms, Bellman’s optimality equation in reinforcement learning, and the Adam optimizer in generative models from first principles, i.e., the Learning \textitLagrangian . We postulate that learning searches for stationary paths in the Lagrangian, and learning algorithms are derivable by seeking the stationary trajectories.
[LG-18] MPC-based Deep Reinforcement Learning Method for Space Robotic Control with Fuel Sloshing Mitigation IROS
链接: https://arxiv.org/abs/2509.21045
作者: Mahya Ramezani,M. Amin Alandihallaj,Barış Can Yalçın,Miguel Angel Olivares Mendez,Holger Voos
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Pre-print version submitted to IEEE IROS
Abstract:This paper presents an integrated Reinforcement Learning (RL) and Model Predictive Control (MPC) framework for autonomous satellite docking with a partially filled fuel tank. Traditional docking control faces challenges due to fuel sloshing in microgravity, which induces unpredictable forces affecting stability. To address this, we integrate Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) RL algorithms with MPC, leveraging MPC’s predictive capabilities to accelerate RL training and improve control robustness. The proposed approach is validated through Zero-G Lab of SnT experiments for planar stabilization and high-fidelity numerical simulations for 6-DOF docking with fuel sloshing dynamics. Simulation results demonstrate that SAC-MPC achieves superior docking accuracy, higher success rates, and lower control effort, outperforming standalone RL and PPO-MPC methods. This study advances fuel-efficient and disturbance-resilient satellite docking, enhancing the feasibility of on-orbit refueling and servicing missions.
[LG-19] FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction
链接: https://arxiv.org/abs/2509.21029
作者: Runqi Lin,Alasdair Paren,Suqin Yuan,Muyang Li,Philip Torr,Adel Bibi,Tongliang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.
[LG-20] Actor-Critic without Actor
链接: https://arxiv.org/abs/2509.21022
作者: Donghyeon Ki,Hee-Jun Ahn,Kyungyoon Kim,Byung-Jun Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Actor-critic methods constitute a central paradigm in reinforcement learning (RL), coupling policy evaluation with policy improvement. While effective across many domains, these methods rely on separate actor and critic networks, which makes training vulnerable to architectural decisions and hyperparameter tuning. Such complexity limits their scalability in settings that require large function approximators. Recently, diffusion models have recently been proposed as expressive policies that capture multi-modal behaviors and improve exploration, but they introduce additional design choices and computational burdens, hindering efficient deployment. We introduce Actor-Critic without Actor (ACA), a lightweight framework that eliminates the explicit actor network and instead generates actions directly from the gradient field of a noise-level critic. This design removes the algorithmic and computational overhead of actor training while keeping policy improvement tightly aligned with the critic’s latest value estimates. Moreover, ACA retains the ability to capture diverse, multi-modal behaviors without relying on diffusion-based actors, combining simplicity with expressiveness. Through extensive experiments on standard online RL benchmarks,ACA achieves more favorable learning curves and competitive performance compared to both standard actor-critic and state-of-the-art diffusion-based methods, providing a simple yet powerful solution for online RL.
[LG-21] RollPacker: Mitigating Long-Tail Rollouts for Fast Synchronous RL Post-Training
链接: https://arxiv.org/abs/2509.21009
作者: Wei Gao,Yuheng Zhao,Dakai An,Tianyuan Wu,Lunxi Cao,Shaopan Xiong,Ju Huang,Weixun Wang,Siran Yang,Wenbo Su,Jiamang Wang,Lin Qu,Bo Zheng,Wei Wang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 16pages,14 figures
Abstract:Reinforcement Learning (RL) is a pivotal post-training technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, synchronous RL post-training often suffers from significant GPU underutilization, referred to as bubbles, caused by imbalanced response lengths within rollout steps. Many RL systems attempt to alleviate this problem by relaxing synchronization, but this can compromise training accuracy. In this paper, we introduce tail batching, a novel rollout scheduling strategy for synchronous RL that systematically consolidates prompts leading to long-tail responses into a small subset of rollout steps (long rounds), while ensuring that the majority of steps (short rounds) involve only balanced, short rollouts. By excluding long responses from short rounds and rescheduling them into a few designated long rounds, tail batching effectively reduces GPU idle time during rollouts and significantly accelerates RL training without sacrificing accuracy. We present RollPacker, a system that fully harnesses the benefits of tail batching through holistic optimizations across all three RL stages: elastic parallelism adaptation for rollout, dynamic resource allocation and scheduling for reward, and stream-based training. Empirical results show that RollPacker achieves a 2.03x-2.56x end-to-end training time reduction compared to veRL and up to 2.24x speedup compared to RLHFuse for the Qwen2.5 family of LLMs on up to 128 H800 GPUs.
[LG-22] MAIFormer: Multi-Agent Inverted Transformer for Flight Trajectory Prediction
链接: https://arxiv.org/abs/2509.21004
作者: Seokbin Yoon,Keumjin Lee
类目: Machine Learning (cs.LG)
*备注: 8 pages, 7 figures, submitted for IEEE Transactions on Intelligent Transportation System
Abstract:Flight trajectory prediction for multiple aircraft is essential and provides critical insights into how aircraft navigate within current air traffic flows. However, predicting multi-agent flight trajectories is inherently challenging. One of the major difficulties is modeling both the individual aircraft behaviors over time and the complex interactions between flights. Generating explainable prediction outcomes is also a challenge. Therefore, we propose a Multi-Agent Inverted Transformer, MAIFormer, as a novel neural architecture that predicts multi-agent flight trajectories. The proposed framework features two key attention modules: (i) masked multivariate attention, which captures spatio-temporal patterns of individual aircraft, and (ii) agent attention, which models the social patterns among multiple agents in complex air traffic scenes. We evaluated MAIFormer using a real-world automatic dependent surveillance-broadcast flight trajectory dataset from the terminal airspace of Incheon International Airport in South Korea. The experimental results show that MAIFormer achieves the best performance across multiple metrics and outperforms other methods. In addition, MAIFormer produces prediction outcomes that are interpretable from a human perspective, which improves both the transparency of the model and its practical utility in air traffic control.
[LG-23] Feature Augmentation of GNNs for ILPs: Local Uniqueness Suffices
链接: https://arxiv.org/abs/2509.21000
作者: Qingyu Han,Qian Li,Linxin Yang,Qian Chen,Qingjiang Shi,Ruoyu Sun
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 9 pages, 6 Tables
Abstract:Integer Linear Programs (ILPs) are central to real-world optimizations but notoriously difficult to solve. Learning to Optimize (L2O) has emerged as a promising paradigm, with Graph Neural Networks (GNNs) serving as the standard backbone. However, standard anonymous GNNs are limited in expressiveness for ILPs, and the common enhancement of augmenting nodes with globally unique identifiers (UIDs) typically introduces spurious correlations that severely harm generalization. To address this tradeoff, we propose a parsimonious Local-UID scheme based on d-hop uniqueness coloring, which ensures identifiers are unique only within each node’s d-hop neighborhood. Building on this scheme, we introduce ColorGNN, which incorporates color information via color-conditioned embeddings, and ColorUID, a lightweight feature-level variant. We prove that for d-layer networks, Local-UIDs achieve the expressive power of Global-UIDs while offering stronger generalization. Extensive experiments show that our approach (i) yields substantial gains on three ILP benchmarks, (ii) exhibits strong OOD generalization on linear programming datasets, and (iii) further improves a general graph-level task when paired with a state-of-the-art method.
[LG-24] Learning Ising Models under Hard Constraints using One Sample
链接: https://arxiv.org/abs/2509.20993
作者: Rohan Chauhan,Ioannis Panageas
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:
Abstract:We consider the problem of estimating inverse temperature parameter \beta of an n -dimensional truncated Ising model using a single sample. Given a graph G = (V,E) with n vertices, a truncated Ising model is a probability distribution over the n -dimensional hypercube -1,1^n where each configuration \mathbf\sigma is constrained to lie in a truncation set S \subseteq -1,1^n and has probability \Pr(\mathbf\sigma) \propto \exp(\beta\mathbf\sigma^\top A\mathbf\sigma) with A being the adjacency matrix of G . We adopt the recent setting of [Galanis et al. SODA’24], where the truncation set S can be expressed as the set of satisfying assignments of a k -SAT formula. Given a single sample \mathbf\sigma from a truncated Ising model, with inverse parameter \beta^* , underlying graph G of bounded degree \Delta and S being expressed as the set of satisfying assignments of a k -SAT formula, we design in nearly O(n) time an estimator \hat\beta that is O(\Delta^3/\sqrtn) -consistent with the true parameter \beta^* for k \gtrsim \log(d^2k)\Delta^3. Our estimator is based on the maximization of the pseudolikelihood, a notion that has received extensive analysis for various probabilistic models without [Chatterjee, Annals of Statistics '07] or with truncation [Galanis et al. SODA '24]. Our approach generalizes recent techniques from [Daskalakis et al. STOC '19, Galanis et al. SODA '24], to confront the more challenging setting of the truncated Ising model. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2509.20993 [cs.LG] (or arXiv:2509.20993v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.20993 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-25] oward Robust and Efficient ML-Based GPU Caching for Modern Inference
链接: https://arxiv.org/abs/2509.20979
作者: Peng Chen,Jiaji Zhang,Hailiang Zhao,Yirong Zhang,Jiahong Yu,Xueyan Tang,Yixuan Wang,Hao Li,Jianping Zou,Gang Xiong,Kingsum Chow,Shuibing He,Shuiguang Deng
类目: Machine Learning (cs.LG)
*备注:
Abstract:In modern GPU inference, cache efficiency remains a major bottleneck. In recommendation models, embedding hit rates largely determine throughput, while in large language models, KV-cache misses substantially increase time-to-first-token (TTFT). Heuristic policies such as \textscLRU often struggle under structured access patterns. Learning-based approaches are promising, but in practice face two major limitations: they degrade sharply when predictions are inaccurate, or they gain little even with accurate predictions due to conservative designs. Some also incur high overhead, further limiting practicality. We present \textscLCR, a practical framework for learning-based GPU caching that delivers performance gains while ensuring robustness and efficiency. Its core algorithm, \textscLARU, enhances \textscLRU with machine-learned predictions and dynamically adapts to prediction accuracy through online error estimation. When predictions are accurate, \textscLARU achieves near-optimal performance. With inaccurate predictions, it degrades gracefully to near-\textscLRU performance. With \textscLCR, we bridge the gap between empirical progress and theoretical advances in learning-based caching. Experiments show that \textscLCR delivers consistent gains under realistic conditions. In DLRM and LLM scenarios, it improves throughput by up to 24.2% and reduces P99 TTFT by up to 28.3%, outperforming widely used inference systems. Even under poor predictions, its performance remains stable, demonstrating practical robustness. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.20979 [cs.LG] (or arXiv:2509.20979v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.20979 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-26] Alignment Unlocks Complementarity: A Framework for Multiview Circuit Representation Learning
链接: https://arxiv.org/abs/2509.20968
作者: Zhengyuan Shi,Jingxin Wang,Wentao Jiang,Chengyu Ma,Ziyang Zheng,Zhufei Chu,Weikang Qian,Qiang Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multiview learning on Boolean circuits holds immense promise, as different graph-based representations offer complementary structural and semantic information. However, the vast structural heterogeneity between views, such as an And-Inverter Graph (AIG) versus an XOR-Majority Graph (XMG), poses a critical barrier to effective fusion, especially for self-supervised techniques like masked modeling. Naively applying such methods fails, as the cross-view context is perceived as noise. Our key insight is that functional alignment is a necessary precondition to unlock the power of multiview self-supervision. We introduce MixGate, a framework built on a principled training curriculum that first teaches the model a shared, function-aware representation space via an Equivalence Alignment Loss. Only then do we introduce a multiview masked modeling objective, which can now leverage the aligned views as a rich, complementary signal. Extensive experiments, including a crucial ablation study, demonstrate that our alignment-first strategy transforms masked modeling from an ineffective technique into a powerful performance driver.
[LG-27] Decoupled-Value Attention for Prior-Data Fitted Networks: GP Inference for Physical Equations
链接: https://arxiv.org/abs/2509.20950
作者: Kaustubh Sharma,Simardeep Singh,Parikshit Pareek
类目: Machine Learning (cs.LG)
*备注:
Abstract:Prior-data fitted networks (PFNs) are a promising alternative to time-consuming Gaussian Process (GP) inference for creating fast surrogates of physical systems. PFN reduces the computational burden of GP-training by replacing Bayesian inference in GP with a single forward pass of a learned prediction model. However, with standard Transformer attention, PFNs show limited effectiveness on high-dimensional regression tasks. We introduce Decoupled-Value Attention (DVA)-- motivated by the GP property that the function space is fully characterized by the kernel over inputs and the predictive mean is a weighted sum of training targets. DVA computes similarities from inputs only and propagates labels solely through values. Thus, the proposed DVA mirrors the Gaussian-process update while remaining kernel-free. We demonstrate that the crucial factor for scaling PFNs is the attention rule rather than the architecture itself. Specifically, our results demonstrate that (a) localized attention consistently reduces out-of-sample validation loss in PFNs across different dimensional settings, with validation loss reduced by more than 50% in five- and ten-dimensional cases, and (b) the role of attention is more decisive than the choice of backbone architecture, showing that CNN-based PFNs can perform at par with their Transformer-based counterparts. The proposed PFNs provide 64-dimensional power flow equation approximations with a mean absolute error of the order of 1E-3, while being over 80x faster than exact GP inference.
[LG-28] Why Attention Fails: The Degeneration of Transformers into MLPs in Time Series Forecasting
链接: https://arxiv.org/abs/2509.20942
作者: Zida Liang,Jiayi Zhu,Weiqiang Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformer-based architectures achieved high performance in natural language processing and computer vision, yet many studies have shown that they have not demonstrated a clear advantage in time series forecasting and even underperform simple linear baselines in some cases. However, most of these studies have not thoroughly explored the reasons behind the failure of transformers. To better understand time-series transformers(TST), we designed a series of experiments, progressively modifying transformers into MLPs to investigate the impact of the attention mechanism. Surprisingly, transformer blocks often degenerate into simple MLPs in existing time-series transformers. We designed a interpretable dataset to investigate the reasons behind the failure of the attention mechanism and revealed that the attention mechanism is not working in the expected way. We theoretically analyzed the reasons behind this phenomenon, demonstrating that the current embedding methods fail to allow transformers to function in a well-structured latent space, and further analyzed the deeper underlying causes of the failure of embedding.
[LG-29] GenFacts-Generative Counterfactual Explanations for Multi-Variate Time Series
链接: https://arxiv.org/abs/2509.20936
作者: Sarah Seifi,Anass Ibrahimi,Tobias Sukianto,Cecilia Carbonelli,Lorenzo Servadei,Robert Wille
类目: Machine Learning (cs.LG)
*备注: 5 pages
Abstract:Counterfactual explanations aim to enhance model transparency by showing how inputs can be minimally altered to change predictions. For multivariate time series, existing methods often generate counterfactuals that are invalid, implausible, or unintuitive. We introduce GenFacts, a generative framework based on a class-discriminative variational autoencoder. It integrates contrastive and classification-consistency objectives, prototype-based initialization, and realism-constrained optimization. We evaluate GenFacts on radar gesture data as an industrial use case and handwritten letter trajectories as an intuitive benchmark. Across both datasets, GenFacts outperforms state-of-the-art baselines in plausibility (+18.7%) and achieves the highest interpretability scores in a human study. These results highlight that plausibility and user-centered interpretability, rather than sparsity alone, are key to actionable counterfactuals in time series data.
[LG-30] Reverse Faà di Brunos Formula for Cartesian Reverse Differential Categories
链接: https://arxiv.org/abs/2509.20931
作者: Aaron Biggin(Macquarie University),Jean-Simon Pacaud Lemay(Macquarie University)
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: In Proceedings ACT 2024, arXiv:2509.18357
Abstract:Reverse differentiation is an essential operation for automatic differentiation. Cartesian reverse differential categories axiomatize reverse differentiation in a categorical framework, where one of the primary axioms is the reverse chain rule, which is the formula that expresses the reverse derivative of a composition. Here, we present the reverse differential analogue of Faa di Bruno’s Formula, which gives a higher-order reverse chain rule in a Cartesian reverse differential category. To properly do so, we also define partial reverse derivatives and higher-order reverse derivatives in a Cartesian reverse differential category.
[LG-31] Energy saving in off-road vehicles using leakage compensation technique
链接: https://arxiv.org/abs/2509.20926
作者: Gyan Wrat,J. Das
类目: Machine Learning (cs.LG)
*备注:
Abstract:The article focuses on enhancing the energy efficiency of linear actuators used in heavy earth moving equipment, particularly in the booms ofexcavation equipment. Two hydraulic circuits are compared in terms of energy efficiency, with one using a conventional proportional directionalcontrol valve (PDCV) and the other using an innovative solution of proportional flow control valve (PFCV) with artificial leakage between thetwo ends of the actuator. The PFCV reduces energy loss in the form of heat by bypassing the extra flow from the pump during position control,unlike the PDCV that uses a pressure relief valve. The hydraulic circuit using PFCV is found to be 8.5% more energy efficient than theconventional circuit using PDCV. The article also discusses the position control of the actuator, which is achieved using a PID controller tuned by a fuzzy controller. Thesimulation of the hydraulic circuit is carried out using MATLAB/Simulink, and the results are compared with experiments. Overall, the proposedapproach could lead to significant improvements in the energy efficiency of linear actuators used in heavy earth moving equipment, therebyreducing their environmental impact and operating costs.
[LG-32] Deterministic Discrete Denoising
链接: https://arxiv.org/abs/2509.20896
作者: Hideyuki Suzuki,Hiroshi Yamashita
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 9 pages, 1 figure
Abstract:We propose a deterministic denoising algorithm for discrete-state diffusion models based on Markov chains. The generative reverse process is derandomized by introducing a variant of the herding algorithm with weakly chaotic dynamics, which induces deterministic discrete state transitions. Our approach is a direct replacement for the stochastic denoising process, requiring neither retraining nor continuous state embeddings. We demonstrate consistent improvements in both efficiency and sample quality on text and image generation tasks. Thus, this simple derandomization approach is expected to enhance the significance of discrete diffusion in generative modeling. Furthermore, our results reveal that deterministic reverse processes, well established in continuous diffusion, can also be effective in discrete state spaces.
[LG-33] RecIS: Sparse to Dense A Unified Training Framework for Recommendation Models
链接: https://arxiv.org/abs/2509.20883
作者: Hua Zong,Qingtao Zeng,Zhengxiong Zhou,Zhihua Han,Zhensong Yan,Mingjie Liu,Hechen Sun,Jiawei Liu,Yiwen Hu,Qi Wang,YiHan Xian,Wenjie Guo,Houyuan Xiang,Zhiyuan Zeng,Xiangrong Sheng,Bencheng Yan,Nan Hu,Yuheng Huang,Jinqing Lian,Ziru Xu,Yan Zhang,Ju Huang,Siran Yang,Huimin Yi,Jiamang Wang,Pengjie Wang,Han Zhu,Jian Wu,Dan Ou,Jian Xu,Haihong Tang,Yuning Jiang,Bo Zheng,Lin Qu
类目: Information Retrieval (cs.IR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we propose RecIS, a unified Sparse-Dense training framework designed to achieve two primary goals: 1. Unified Framework To create a Unified sparse-dense training framework based on the PyTorch ecosystem that meets the training needs of industrial-grade recommendation models that integrated with large models. this http URL Optimization To optimize the sparse component, offering superior efficiency over the TensorFlow-based recommendation models. The dense component, meanwhile, leverages existing optimization technologies within the PyTorch ecosystem. Currently, RecIS is being used in Alibaba for numerous large-model enhanced recommendation training tasks, and some traditional sparse models have also begun training in it.
[LG-34] Distribution-Controlled Client Selection to Improve Federated Learning Strategies ECML-PKDD2024
链接: https://arxiv.org/abs/2509.20877
作者: Christoph Düsing,Philipp Cimiano
类目: Machine Learning (cs.LG)
*备注: Accepted at the 2nd Workshop on Advancements in Federated Learning (WAFL@ECML-PKDD 2024)
Abstract:Federated learning (FL) is a distributed learning paradigm that allows multiple clients to jointly train a shared model while maintaining data privacy. Despite its great potential for domains with strict data privacy requirements, the presence of data imbalance among clients is a thread to the success of FL, as it causes the performance of the shared model to decrease. To address this, various studies have proposed enhancements to existing FL strategies, particularly through client selection methods that mitigate the detrimental effects of data imbalance. In this paper, we propose an extension to existing FL strategies, which selects active clients that best align the current label distribution with one of two target distributions, namely a balanced distribution or the federations combined label distribution. Subsequently, we empirically verify the improvements through our distribution-controlled client selection on three common FL strategies and two datasets. Our results show that while aligning the label distribution with a balanced distribution yields the greatest improvements facing local imbalance, alignment with the federation’s combined label distribution is superior for global imbalance.
[LG-35] Actively Learning Halfspaces without Synthetic Data
链接: https://arxiv.org/abs/2509.20848
作者: Hadley Black,Kasper Green Larsen,Arya Mazumdar,Barna Saha,Geelon So
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:In the classic point location problem, one is given an arbitrary dataset X \subset \mathbbR^d of n points with query access to an unknown halfspace f : \mathbbR^d \to \0,1\ , and the goal is to learn the label of every point in X . This problem is extremely well-studied and a nearly-optimal \widetildeO(d \log n) query algorithm is known due to Hopkins-Kane-Lovett-Mahajan (FOCS 2020). However, their algorithm is granted the power to query arbitrary points outside of X (point synthesis), and in fact without this power there is an \Omega(n) query lower bound due to Dasgupta (NeurIPS 2004). In this work our goal is to design efficient algorithms for learning halfspaces without point synthesis. To circumvent the \Omega(n) lower bound, we consider learning halfspaces whose normal vectors come from a set of size D , and show tight bounds of \Theta(D + \log n) . As a corollary, we obtain an optimal O(d + \log n) query deterministic learner for axis-aligned halfspaces, closing a previous gap of O(d \log n) vs. \Omega(d + \log n) . In fact, our algorithm solves the more general problem of learning a Boolean function f over n elements which is monotone under at least one of D provided orderings. Our technical insight is to exploit the structure in these orderings to perform a binary search in parallel rather than considering each ordering sequentially, and we believe our approach may be of broader interest. Furthermore, we use our exact learning algorithm to obtain nearly optimal algorithms for PAC-learning. We show that O(\min(D + \log(1/\varepsilon), 1/\varepsilon) \cdot \log D) queries suffice to learn f within error \varepsilon , even in a setting when f can be adversarially corrupted on a c\varepsilon -fraction of points, for a sufficiently small constant c . This bound is optimal up to a \log D factor, including in the realizable setting. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2509.20848 [cs.DS] (or arXiv:2509.20848v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2509.20848 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-36] Causal Time Series Generation via Diffusion Models
链接: https://arxiv.org/abs/2509.20846
作者: Yutong Xia,Chang Xu,Yuxuan Liang,Qingsong Wen,Roger Zimmermann,Jiang Bian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series generation (TSG) synthesizes realistic sequences and has achieved remarkable success. Among TSG, conditional models generate sequences given observed covariates, however, such models learn observational correlations without considering unobserved confounding. In this work, we propose a causal perspective on conditional TSG and introduce causal time series generation as a new TSG task family, formalized within Pearl’s causal ladder, extending beyond observational generation to include interventional and counterfactual settings. To instantiate these tasks, we develop CaTSG, a unified diffusion-based framework with backdoor-adjusted guidance that causally steers sampling toward desired interventions and individual counterfactuals while preserving observational fidelity. Specifically, our method derives causal score functions via backdoor adjustment and the abduction-action-prediction procedure, thus enabling principled support for all three levels of TSG. Extensive experiments on both synthetic and real-world datasets show that CaTSG achieves superior fidelity and also supporting interventional and counterfactual generation that existing baselines cannot handle. Overall, we propose the causal TSG family and instantiate it with CaTSG, providing an initial proof-of-concept and opening a promising direction toward more reliable simulation under interventions and counterfactual generation.
[LG-37] Shaping Initial State Prevents Modality Competition in Multi-modal Fusion: A Two-stage Scheduling Framework via Fast Partial Information Decomposition
链接: https://arxiv.org/abs/2509.20840
作者: Jiaqi Tang,Yinsong Xu,Yang Liu,Qingchao Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-modal fusion often suffers from modality competition during joint training, where one modality dominates the learning process, leaving others under-optimized. Overlooking the critical impact of the model’s initial state, most existing methods address this issue during the joint learning stage. In this study, we introduce a two-stage training framework to shape the initial states through unimodal training before the joint training. First, we propose the concept of Effective Competitive Strength (ECS) to quantify a modality’s competitive strength. Our theoretical analysis further reveals that properly shaping the initial ECS by unimodal training achieves a provably tighter error bound. However, ECS is computationally intractable in deep neural networks. To bridge this gap, we develop a framework comprising two core components: a fine-grained computable diagnostic metric and an asynchronous training controller. For the metric, we first prove that mutual information(MI) is a principled proxy for ECS. Considering MI is induced by per-modality marginals and thus treats each modality in isolation, we further propose FastPID, a computationally efficient and differentiable solver for partial information decomposition, which decomposes the joint distribution’s information into fine-grained measurements: modality-specific uniqueness, redundancy, and synergy. Guided by these measurements, our asynchronous controller dynamically balances modalities by monitoring uniqueness and locates the ideal initial state to start joint training by tracking peak synergy. Experiments on diverse benchmarks demonstrate that our method achieves state-of-the-art performance. Our work establishes that shaping the pre-fusion models’ initial state is a powerful strategy that eases competition before it starts, reliably unlocking synergistic multi-modal fusion.
[LG-38] Explaining Grokking and Information Bottleneck through Neural Collapse Emergence
链接: https://arxiv.org/abs/2509.20829
作者: Keitaro Sakamoto,Issei Sato
类目: Machine Learning (cs.LG)
*备注: Code is available at this https URL
Abstract:The training dynamics of deep neural networks often defy expectations, even as these models form the foundation of modern machine learning. Two prominent examples are grokking, where test performance improves abruptly long after the training loss has plateaued, and the information bottleneck principle, where models progressively discard input information irrelevant to the prediction task as training proceeds. However, the mechanisms underlying these phenomena and their relations remain poorly understood. In this work, we present a unified explanation of such late-phase phenomena through the lens of neural collapse, which characterizes the geometry of learned representations. We show that the contraction of population within-class variance is a key factor underlying both grokking and information bottleneck, and relate this measure to the neural collapse measure defined on the training set. By analyzing the dynamics of neural collapse, we show that distinct time scales between fitting the training set and the progression of neural collapse account for the behavior of the late-phase phenomena. Finally, we validate our theoretical findings on multiple datasets and architectures.
[LG-39] 2I-Diff: fMRI Signal Generation via Time-Frequency Image Transform and Classifier-Free Denoising Diffusion Models MICCAI2025
链接: https://arxiv.org/abs/2509.20822
作者: Hwa Hui Tew,Junn Yong Loo,Yee-Fan Tan,Xinyu Tang,Hernando Ombao,Fuad Noman,Raphael C.-W. Phan,Chee-Ming Ting
类目: Machine Learning (cs.LG)
*备注: Accepted at the 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2025)
Abstract:Functional Magnetic Resonance Imaging (fMRI) is an advanced neuroimaging method that enables in-depth analysis of brain activity by measuring dynamic changes in the blood oxygenation level-dependent (BOLD) signals. However, the resource-intensive nature of fMRI data acquisition limits the availability of high-fidelity samples required for data-driven brain analysis models. While modern generative models can synthesize fMRI data, they often underperform because they overlook the complex non-stationarity and nonlinear BOLD dynamics. To address these challenges, we introduce T2I-Diff, an fMRI generation framework that leverages time-frequency representation of BOLD signals and classifier-free denoising diffusion. Specifically, our framework first converts BOLD signals into windowed spectrograms via a time-dependent Fourier transform, capturing both the underlying temporal dynamics and spectral evolution. Subsequently, a classifier-free diffusion model is trained to generate class-conditioned frequency spectrograms, which are then reverted to BOLD signals via inverse Fourier transforms. Finally, we validate the efficacy of our approach by demonstrating improved accuracy and generalization in downstream fMRI-based brain network classification.
[LG-40] Aligning Inductive Bias for Data-Efficient Generalization in State Space Models
链接: https://arxiv.org/abs/2509.20789
作者: Qiyu Chen,Guozhang Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:The remarkable success of large-scale models is fundamentally tied to scaling laws, yet the finite nature of high-quality data presents a looming challenge. One of the next frontiers in modeling is data efficiency: the ability to learn more from less. A model’s inductive bias is a critical lever for this, but foundational sequence models like State Space Models (SSMs) rely on a fixed bias. This fixed prior is sample-inefficient when a task’s underlying structure does not match. In this work, we introduce a principled framework to solve this problem. We first formalize the inductive bias of linear time-invariant SSMs through an SSM-induced kernel, mathematically and empirically proving its spectrum is directly governed by the model’s frequency response. Further, we propose a method of Task-Dependent Initialization (TDI): power spectrum matching, a fast and efficient method that aligns the model’s inductive bias with the task’s spectral characteristics before large-scale training. Our experiments on a diverse set of real-world benchmarks show that TDI significantly improves generalization and sample efficiency, particularly in low-data regimes. This work provides a theoretical and practical tool to create more data-efficient models, a crucial step towards sustainable scaling.
[LG-41] LiLAW: Lightweight Learnable Adaptive Weighting to Meta-Learn Sample Difficulty and Improve Noisy Training
链接: https://arxiv.org/abs/2509.20786
作者: Abhishek Moturu,Anna Goldenberg,Babak Taati
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training deep neural networks in the presence of noisy labels and data heterogeneity is a major challenge. We introduce Lightweight Learnable Adaptive Weighting (LiLAW), a novel method that dynamically adjusts the loss weight of each training sample based on its evolving difficulty level, categorized as easy, moderate, or hard. Using only three learnable parameters, LiLAW adaptively prioritizes informative samples throughout training by updating these weights using a single mini-batch gradient descent step on the validation set after each training mini-batch, without requiring excessive hyperparameter tuning or a clean validation set. Extensive experiments across multiple general and medical imaging datasets, noise levels and types, loss functions, and architectures with and without pretraining demonstrate that LiLAW consistently enhances performance, even in high-noise environments. It is effective without heavy reliance on data augmentation or advanced regularization, highlighting its practicality. It offers a computationally efficient solution to boost model generalization and robustness in any neural network training setup.
[LG-42] Sig2Model: A Boosting-Driven Model for Updatable Learned Indexes
链接: https://arxiv.org/abs/2509.20781
作者: Alireza Heidari,Amirhossein Ahmad,Wei Zhang,Ying Xiong
类目: Machine Learning (cs.LG); Databases (cs.DB); Performance (cs.PF)
*备注: 22 pages, 11 figures
Abstract:Learned Indexes (LIs) represent a paradigm shift from traditional index structures by employing machine learning models to approximate the cumulative distribution function (CDF) of sorted data. While LIs achieve remarkable efficiency for static datasets, their performance degrades under dynamic updates: maintaining the CDF invariant (sum of F(k) equals 1) requires global model retraining, which blocks queries and limits the queries-per-second (QPS) metric. Current approaches fail to address these retraining costs effectively, rendering them unsuitable for real-world workloads with frequent updates. In this paper, we present Sig2Model, an efficient and adaptive learned index that minimizes retraining cost through three key techniques: (1) a sigmoid boosting approximation technique that dynamically adjusts the index model by approximating update-induced shifts in data distribution with localized sigmoid functions while preserving bounded error guarantees and deferring full retraining; (2) proactive update training via Gaussian mixture models (GMMs) that identifies high-update-probability regions for strategic placeholder allocation to speed up updates; and (3) a neural joint optimization framework that continuously refines both the sigmoid ensemble and GMM parameters via gradient-based learning. We evaluate Sig2Model against state-of-the-art updatable learned indexes on real-world and synthetic workloads, and show that Sig2Model reduces retraining cost by up to 20x, achieves up to 3x higher QPS, and uses up to 1000x less memory.
[LG-43] Leverag ing Temporally Extended Behavior Sharing for Multi-task Reinforcement Learning IROS
链接: https://arxiv.org/abs/2509.20766
作者: Gawon Lee(1),Daesol Cho(1),H. Jin Kim(1) ((1) Seoul National University)
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted for publication in the proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Abstract:Multi-task reinforcement learning (MTRL) offers a promising approach to improve sample efficiency and generalization by training agents across multiple tasks, enabling knowledge sharing between them. However, applying MTRL to robotics remains challenging due to the high cost of collecting diverse task data. To address this, we propose MT-Lévy, a novel exploration strategy that enhances sample efficiency in MTRL environments by combining behavior sharing across tasks with temporally extended exploration inspired by Lévy flight. MT-Lévy leverages policies trained on related tasks to guide exploration towards key states, while dynamically adjusting exploration levels based on task success ratios. This approach enables more efficient state-space coverage, even in complex robotics environments. Empirical results demonstrate that MT-Lévy significantly improves exploration and sample efficiency, supported by quantitative and qualitative analyses. Ablation studies further highlight the contribution of each component, showing that combining behavior sharing with adaptive exploration strategies can significantly improve the practicality of MTRL in robotics applications.
[LG-44] Identifying Group Anchors in Real-World Group Interactions Under Label Scarcity ICDM
链接: https://arxiv.org/abs/2509.20762
作者: Fanchen Bu,Geon Lee,Minyoung Choe,Kijung Shin
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: IEEE International Conference on Data Mining (ICDM) 2025
Abstract:Group interactions occur in various real-world contexts, e.g., co-authorship, email communication, and online QA. In each group, there is often a particularly significant member, around whom the group is formed. Examples include the first or last author of a paper, the sender of an email, and the questioner in a QA session. In this work, we discuss the existence of such individuals in real-world group interactions. We call such individuals group anchors and study the problem of identifying them. First, we introduce the concept of group anchors and the identification problem. Then, we discuss our observations on group anchors in real-world group interactions. Based on our observations, we develop AnchorRadar, a fast and effective method for group anchor identification under realistic settings with label scarcity, i.e., when only a few groups have known anchors. AnchorRadar is a semi-supervised method using information from groups both with and without known group anchors. Finally, through extensive experiments on thirteen real-world datasets, we demonstrate the empirical superiority of AnchorRadar over various baselines w.r.t. accuracy and efficiency. In most cases, AnchorRadar achieves higher accuracy in group anchor identification than all the baselines, while using 10.2 \times less training time than the fastest baseline and 43.6 \times fewer learnable parameters than the most lightweight baseline on average.
[LG-45] he Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures ICASSP2026
链接: https://arxiv.org/abs/2509.20736
作者: Zhenshan Zhang,Xueping Zhang,Yechen Wang,Liwei Jin,Ming Li
类目: Machine Learning (cs.LG)
*备注: 5 pages, submitted to ICASSP 2026
Abstract:This paper presents the first study on the impact of audio watermarking on spoofing countermeasures. While anti-spoofing systems are essential for securing speech-based applications, the influence of widely used audio watermarking, originally designed for copyright protection, remains largely unexplored. We construct watermark-augmented training and evaluation datasets, named the Watermark-Spoofing dataset, by applying diverse handcrafted and neural watermarking methods to existing anti-spoofing datasets. Experiments show that watermarking consistently degrades anti-spoofing performance, with higher watermark density correlating with higher Equal Error Rates (EERs). To mitigate this, we propose the Knowledge-Preserving Watermark Learning (KPWL) framework, enabling models to adapt to watermark-induced shifts while preserving their original-domain spoofing detection capability. These findings reveal audio watermarking as a previously overlooked domain shift and establish the first benchmark for developing watermark-resilient anti-spoofing systems. All related protocols are publicly available at this https URL
[LG-46] Scaling Laws are Redundancy Laws
链接: https://arxiv.org/abs/2509.20721
作者: Yuda Bi,Vince D Calhoun
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:Scaling laws, a defining feature of deep learning, reveal a striking power-law improvement in model performance with increasing dataset and model size. Yet, their mathematical origins, especially the scaling exponent, have remained elusive. In this work, we show that scaling laws can be formally explained as redundancy laws. Using kernel regression, we show that a polynomial tail in the data covariance spectrum yields an excess risk power law with exponent alpha = 2s / (2s + 1/beta), where beta controls the spectral tail and 1/beta measures redundancy. This reveals that the learning curve’s slope is not universal but depends on data redundancy, with steeper spectra accelerating returns to scale. We establish the law’s universality across boundedly invertible transformations, multi-modal mixtures, finite-width approximations, and Transformer architectures in both linearized (NTK) and feature-learning regimes. This work delivers the first rigorous mathematical explanation of scaling laws as finite-sample redundancy laws, unifying empirical observations with theoretical foundations.
[LG-47] A Genetic Algorithm for Navigating Synthesizable Molecular Spaces
链接: https://arxiv.org/abs/2509.20719
作者: Alston Lo,Connor W. Coley,Wojciech Matusik
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Inspired by the effectiveness of genetic algorithms and the importance of synthesizability in molecular design, we present SynGA, a simple genetic algorithm that operates directly over synthesis routes. Our method features custom crossover and mutation operators that explicitly constrain it to synthesizable molecular space. By modifying the fitness function, we demonstrate the effectiveness of SynGA on a variety of design tasks, including synthesizable analog search and sample-efficient property optimization, for both 2D and 3D objectives. Furthermore, by coupling SynGA with a machine learning-based filter that focuses the building block set, we boost SynGA to state-of-the-art performance. For property optimization, this manifests as a model-based variant SynGBO, which employs SynGA and block filtering in the inner loop of Bayesian optimization. Since SynGA is lightweight and enforces synthesizability by construction, our hope is that SynGA can not only serve as a strong standalone baseline but also as a versatile module that can be incorporated into larger synthesis-aware workflows in the future.
[LG-48] Cryptographic Backdoor for Neural Networks: Boon and Bane
链接: https://arxiv.org/abs/2509.20714
作者: Anh Tu Ngo,Anupam Chattopadhyay,Subhamoy Maitra
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Preprint
Abstract:In this paper we show that cryptographic backdoors in a neural network (NN) can be highly effective in two directions, namely mounting the attacks as well as in presenting the defenses as well. On the attack side, a carefully planted cryptographic backdoor enables powerful and invisible attack on the NN. Considering the defense, we present applications: first, a provably robust NN watermarking scheme; second, a protocol for guaranteeing user authentication; and third, a protocol for tracking unauthorized sharing of the NN intellectual property (IP). From a broader theoretical perspective, borrowing the ideas from Goldwasser et. al. [FOCS 2022], our main contribution is to show that all these instantiated practical protocol implementations are provably robust. The protocols for watermarking, authentication and IP tracking resist an adversary with black-box access to the NN, whereas the backdoor-enabled adversarial attack is impossible to prevent under the standard assumptions. While the theoretical tools used for our attack is mostly in line with the Goldwasser et. al. ideas, the proofs related to the defense need further studies. Finally, all these protocols are implemented on state-of-the-art NN architectures with empirical results corroborating the theoretical claims. Further, one can utilize post-quantum primitives for implementing the cryptographic backdoors, laying out foundations for quantum-era applications in machine learning (ML).
[LG-49] heoretical Bounds for Stable In-Context Learning
链接: https://arxiv.org/abs/2509.20677
作者: Tongxi Wang,Zhuoyang Xia
类目: Machine Learning (cs.LG)
*备注:
Abstract:In-context learning (ICL) is flexible but its reliability is highly sensitive to prompt length. This paper establishes a non-asymptotic lower bound that links the minimal number of demonstrations to ICL stability under fixed high-dimensional sub-Gaussian representations. The bound gives explicit sufficient conditions in terms of spectral properties of the covariance, providing a computable criterion for practice. Building on this analysis, we propose a two-stage observable estimator with a one-shot calibration that produces practitioner-ready prompt-length estimates without distributional priors. Experiments across diverse datasets, encoders, and generators show close alignment between the predicted thresholds and empirical knee-points, with the theory acting as a conservative but reliable upper bound; the calibrated variant further tightens this gap. These results connect spectral coverage to stable ICL, bridge theory and deployment, and improve the interpretability and reliability of large-scale prompting in realistic finite-sample regimes.
[LG-50] Guiding Application Users via Estimation of Computational Resources for Massively Parallel Chemistry Computations
链接: https://arxiv.org/abs/2509.20667
作者: Tanzila Tabassum,Omer Subasi,Ajay Panyala,Epiya Ebiapia,Gerald Baumgartner,Erdal Mutlu, P. (Saday)Sadayappan,Karol Kowalski
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:In this work, we develop machine learning (ML) based strategies to predict resources (costs) required for massively parallel chemistry computations, such as coupled-cluster methods, to guide application users before they commit to running expensive experiments on a supercomputer. By predicting application execution time, we determine the optimal runtime parameter values such as number of nodes and tile sizes. Two key questions of interest to users are addressed. The first is the shortest-time question, where the user is interested in knowing the parameter configurations (number of nodes and tile sizes) to achieve the shortest execution time for a given problem size and a target supercomputer. The second is the cheapest-run question in which the user is interested in minimizing resource usage, i.e., finding the number of nodes and tile size that minimizes the number of node-hours for a given problem size. We evaluate a rich family of ML models and strategies, developed based on the collections of runtime parameter values for the CCSD (Coupled Cluster with Singles and Doubles) application executed on the Department of Energy (DOE) Frontier and Aurora supercomputers. Our experiments show that when predicting the total execution time of a CCSD iteration, a Gradient Boosting (GB) ML model achieves a Mean Absolute Percentage Error (MAPE) of 0.023 and 0.073 for Aurora and Frontier, respectively. In the case where it is expensive to run experiments just to collect data points, we show that active learning can achieve a MAPE of about 0.2 with just around 450 experiments collected from Aurora and Frontier. Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2509.20667 [cs.LG] (or arXiv:2509.20667v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.20667 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-51] Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration
链接: https://arxiv.org/abs/2509.20648
作者: Yiyuan Pan,Zhe Liu,Hesheng Wang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Autonomous exploration in complex multi-agent reinforcement learning (MARL) with sparse rewards critically depends on providing agents with effective intrinsic motivation. While artificial curiosity offers a powerful self-supervised signal, it often confuses environmental stochasticity with meaningful novelty. Moreover, existing curiosity mechanisms exhibit a uniform novelty bias, treating all unexpected observations equally. However, peer behavior novelty, which encode latent task dynamics, are often overlooked, resulting in suboptimal exploration in decentralized, communication-free MARL settings. To this end, inspired by how human children adaptively calibrate their own exploratory behaviors via observing peers, we propose a novel approach to enhance multi-agent exploration. We introduce CERMIC, a principled framework that empowers agents to robustly filter noisy surprise signals and guide exploration by dynamically calibrating their intrinsic curiosity with inferred multi-agent context. Additionally, CERMIC generates theoretically-grounded intrinsic rewards, encouraging agents to explore state transitions with high information gain. We evaluate CERMIC on benchmark suites including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that exploration with CERMIC significantly outperforms SoTA algorithms in sparse-reward environments.
[LG-52] Investigating Modality Contribution in Audio LLM s for Music
链接: https://arxiv.org/abs/2509.20641
作者: Giovana Morais,Magdalena Fuentes
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:
Abstract:Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model’s output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model’s prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.
[LG-53] Design Implementation and Evaluation of a Novel Programming Language Topic Classification Workflow
链接: https://arxiv.org/abs/2509.20631
作者: Michael Zhang,Yuan Tian,Mariam Guizani
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:As software systems grow in scale and complexity, understanding the distribution of programming language topics within source code becomes increasingly important for guiding technical decisions, improving onboarding, and informing tooling and education. This paper presents the design, implementation, and evaluation of a novel programming language topic classification workflow. Our approach combines a multi-label Support Vector Machine (SVM) with a sliding window and voting strategy to enable fine-grained localization of core language concepts such as operator overloading, virtual functions, inheritance, and templates. Trained on the IBM Project CodeNet dataset, our model achieves an average F1 score of 0.90 across topics and 0.75 in code-topic highlight. Our findings contribute empirical insights and a reusable pipeline for researchers and practitioners interested in code analysis and data-driven software engineering.
[LG-54] raining Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning
链接: https://arxiv.org/abs/2509.20616
作者: Hanjiang Hu,Changliu Liu,Na Li,Yebin Wang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in higher multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks with over 30 steps. We also theoretically and empirically validate the strong cross-task generalizability that the models trained on complex tasks can lead to the successful completion of all simpler subtasks.
[LG-55] Latent Twins
链接: https://arxiv.org/abs/2509.20615
作者: Matthias Chung,Deepanshu Verma,Max Collins,Amit N. Subrahmanya,Varuni Katti Sastry,Vishwas Rao
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 38 pages, 22 figures, 1 table
Abstract:Over the past decade, scientific machine learning has transformed the development of mathematical and computational frameworks for analyzing, modeling, and predicting complex systems. From inverse problems to numerical PDEs, dynamical systems, and model reduction, these advances have pushed the boundaries of what can be simulated. Yet they have often progressed in parallel, with representation learning and algorithmic solution methods evolving largely as separate pipelines. With \emphLatent Twins, we propose a unifying mathematical framework that creates a hidden surrogate in latent space for the underlying equations. Whereas digital twins mirror physical systems in the digital world, Latent Twins mirror mathematical systems in a learned latent space governed by operators. Through this lens, classical modeling, inversion, model reduction, and operator approximation all emerge as special cases of a single principle. We establish the fundamental approximation properties of Latent Twins for both ODEs and PDEs and demonstrate the framework across three representative settings: (i) canonical ODEs, capturing diverse dynamical regimes; (ii) a PDE benchmark using the shallow-water equations, contrasting Latent Twin simulations with DeepONet and forecasts with a 4D-Var baseline; and (iii) a challenging real-data geopotential reanalysis dataset, reconstructing and forecasting from sparse, noisy observations. Latent Twins provide a compact, interpretable surrogate for solution operators that evaluate across arbitrary time gaps in a single-shot, while remaining compatible with scientific pipelines such as assimilation, control, and uncertainty quantification. Looking forward, this framework offers scalable, theory-grounded surrogates that bridge data-driven representation learning and classical scientific modeling across disciplines.
[LG-56] Policy Compatible Skill Incremental Learning via Lazy Learning Interface
链接: https://arxiv.org/abs/2509.20612
作者: Daehee Lee,Dongsu Lee,TaeYoon Kwack,Wonje Choi,Honguk Woo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Skill Incremental Learning (SIL) is the process by which an embodied agent expands and refines its skill set over time by leveraging experience gained through interaction with its environment or by the integration of additional data. SIL facilitates efficient acquisition of hierarchical policies grounded in reusable skills for downstream tasks. However, as the skill repertoire evolves, it can disrupt compatibility with existing skill-based policies, limiting their reusability and generalization. In this work, we propose SIL-C, a novel framework that ensures skill-policy compatibility, allowing improvements in incrementally learned skills to enhance the performance of downstream policies without requiring policy re-training or structural adaptation. SIL-C employs a bilateral lazy learning-based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This enables each subtask, derived from the policy’s decomposition of a complex task, to be executed by selecting an appropriate skill based on trajectory distribution similarity. We evaluate SIL-C across diverse SIL scenarios and demonstrate that it maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process.
[LG-57] Function Spaces Without Kernels: Learning Compact Hilbert Space Representations ICLR2026
链接: https://arxiv.org/abs/2509.20605
作者: Su Ann Low,Quentin Rommel,Kevin S. Miller,Adam J. Thorpe,Ufuk Topcu
类目: Machine Learning (cs.LG)
*备注: Submitted to ICLR 2026
Abstract:Function encoders are a recent technique that learn neural network basis functions to form compact, adaptive representations of Hilbert spaces of functions. We show that function encoders provide a principled connection to feature learning and kernel methods by defining a kernel through an inner product of the learned feature map. This kernel-theoretic perspective explains their ability to scale independently of dataset size while adapting to the intrinsic structure of data, and it enables kernel-style analysis of neural models. Building on this foundation, we develop two training algorithms that learn compact bases: a progressive training approach that constructively grows bases, and a train-then-prune approach that offers a computationally efficient alternative after training. Both approaches use principles from PCA to reveal the intrinsic dimension of the learned space. In parallel, we derive finite-sample generalization bounds using Rademacher complexity and PAC-Bayes techniques, providing inference time guarantees. We validate our approach on a polynomial benchmark with a known intrinsic dimension, and on nonlinear dynamical systems including a Van der Pol oscillator and a two-body orbital model, demonstrating that the same accuracy can be achieved with substantially fewer basis functions. This work suggests a path toward neural predictors with kernel-level guarantees, enabling adaptable models that are both efficient and principled at scale.
[LG-58] Explicit and Effectively Symmetric Schemes for Neural SDEs
链接: https://arxiv.org/abs/2509.20599
作者: Daniil Shmelev,Cristopher Salvi
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Backpropagation through (neural) SDE solvers is traditionally approached in two ways: discretise-then-optimise, which offers accurate gradients but incurs prohibitive memory costs due to storing the full computational graph (even when mitigated by checkpointing); and optimise-then-discretise, which achieves constant memory cost by solving an auxiliary backward SDE, but suffers from slower evaluation and gradient approximation errors. Algebraically reversible solvers promise both memory efficiency and gradient accuracy, yet existing methods such as the Reversible Heun scheme are often unstable under complex models and large step sizes. We address these limitations by introducing a novel class of stable, near-reversible Runge–Kutta schemes for neural SDEs. These Explicit and Effectively Symmetric (EES) schemes retain the benefits of reversible solvers while overcoming their instability, enabling memory-efficient training without severe restrictions on step size or model complexity. Through numerical experiments, we demonstrate the superior stability and reliability of our schemes, establishing them as a practical foundation for scalable and accurate training of neural SDEs.
[LG-59] SKAN: Interpretable Machine Learning for QoE modeling over Time Series Data
链接: https://arxiv.org/abs/2509.20595
作者: Kamal Singh,Priyanka Rawat,Sami Marouani,Baptiste Jeudy
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Quality of Experience (QoE) modeling is crucial for optimizing video streaming services to capture the complex relationships between different features and user experience. We propose a novel approach to QoE modeling in video streaming applications using interpretable Machine Learning (ML) techniques over raw time series data. Unlike traditional black-box approaches, our method combines Kolmogorov-Arnold Networks (KANs) as an interpretable readout on top of compact frequency-domain features, allowing us to capture temporal information while retaining a transparent and explainable model. We evaluate our method on popular datasets and demonstrate its enhanced accuracy in QoE prediction, while offering transparency and interpretability.
[LG-60] Learning Greens Operators through Hierarchical Neural Networks Inspired by the Fast Multipole Method ICLR2025
链接: https://arxiv.org/abs/2509.20591
作者: Emilio McAllister Fognini,Marta M. Betcke,Ben T. Cox
类目: Machine Learning (cs.LG)
*备注: Previously under review at ICLR 2025, originally submitted on the 12th of May 2025. The OpenReview page can be found at: this http URL
Abstract:The Fast Multipole Method (FMM) is an efficient numerical algorithm for computation of long-ranged forces in N -body problems within gravitational and electrostatic fields. This method utilizes multipole expansions of the Green’s function inherent to the underlying dynamical systems. Despite its widespread application in physics and engineering, the integration of FMM with modern machine learning architectures remains underexplored. In this work, we propose a novel neural network architecture, the Neural FMM, that integrates the information flow of the FMM into a hierarchical machine learning framework for learning the Green’s operator of an Elliptic PDE. Our Neural FMM architecture leverages a hierarchical computation flow of the FMM method to split up the local and far-field interactions and efficiently learn their respective representations.
[LG-61] he Sensitivity of Variational Bayesian Neural Network Performance to Hyperparameters
链接: https://arxiv.org/abs/2509.20574
作者: Scott Koermer,Natalie Klein
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 6 figures
Abstract:In scientific applications, predictive modeling is often of limited use without accurate uncertainty quantification (UQ) to indicate when a model may be extrapolating or when more data needs to be collected. Bayesian Neural Networks (BNNs) produce predictive uncertainty by propagating uncertainty in neural network (NN) weights and offer the promise of obtaining not only an accurate predictive model but also accurate UQ. However, in practice, obtaining accurate UQ with BNNs is difficult due in part to the approximations used for practical model training and in part to the need to choose a suitable set of hyperparameters; these hyperparameters outnumber those needed for traditional NNs and often have opaque effects on the results. We aim to shed light on the effects of hyperparameter choices for BNNs by performing a global sensitivity analysis of BNN performance under varying hyperparameter settings. Our results indicate that many of the hyperparameters interact with each other to affect both predictive accuracy and UQ. For improved usage of BNNs in real-world applications, we suggest that global sensitivity analysis, or related methods such as Bayesian optimization, should be used to aid in dimensionality reduction and selection of hyperparameters to ensure accurate UQ in BNNs.
[LG-62] Generalizable Diabetes Risk Stratification via Hybrid Machine Learning Models
链接: https://arxiv.org/abs/2509.20565
作者: Athar Parvez,Muhammad Jawad Mufti
类目: Machine Learning (cs.LG)
*备注:
Abstract:Background/Purpose: Diabetes affects over 537 million people worldwide and is projected to reach 783 million by 2045. Early risk stratification can benefit from machine learning. We compare two hybrid classifiers and assess their generalizability on an external cohort. Methods: Two hybrids were built: (i) XGBoost + Random Forest (XGB-RF) and (ii) Support Vector Machine + Logistic Regression (SVM-LR). A leakage-safe, standardized pipeline (encoding, imputation, min-max scaling; SMOTE on training folds only; probability calibration for SVM) was fit on the primary dataset and frozen. Evaluation prioritized threshold-independent discrimination (AUROC/AUPRC) and calibration (Brier, slope/intercept). External validation used the PIMA cohort (N=768) with the frozen pipeline; any thresholded metrics on PIMA were computed at the default rule tau = 0.5. Results: On the primary dataset (PR baseline = 0.50), XGB-RF achieved AUROC ~0.995 and AUPRC ~0.998, outperforming SVM-LR (AUROC ~0.978; AUPRC ~0.947). On PIMA (PR baseline ~0.349), XGB-RF retained strong performance (AUROC ~0.990; AUPRC ~0.959); SVM-LR was lower (AUROC ~0.963; AUPRC ~0.875). Thresholded metrics on PIMA at tau = 0.5 were XGB-RF (Accuracy 0.960; Precision 0.941; Recall 0.944; F1 0.942) and SVM-LR (Accuracy 0.900; Precision 0.855; Recall 0.858; F1 0.857). Conclusions: Across internal and external cohorts, XGB-RF consistently dominated SVM-LR and exhibited smaller external attenuation on ROC/PR with acceptable calibration. These results support gradient-boosting-based hybridization as a robust, transferable approach for diabetes risk stratification and motivate prospective, multi-site validation with deployment-time threshold selection based on clinical trade-offs. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.20565 [cs.LG] (or arXiv:2509.20565v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.20565 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Muhammad Mufti [view email] [v1] Wed, 24 Sep 2025 21:18:52 UTC (731 KB)
[LG-63] MDBench: Benchmarking Data-Driven Methods for Model Discovery
链接: https://arxiv.org/abs/2509.20529
作者: Amirmohammad Ziaei Bideh,Aleksandra Georgievska,Jonathan Gryak
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model discovery aims to uncover governing differential equations of dynamical systems directly from experimental data. Benchmarking such methods is essential for tracking progress and understanding trade-offs in the field. While prior efforts have focused mostly on identifying single equations, typically framed as symbolic regression, there remains a lack of comprehensive benchmarks for discovering dynamical models. To address this, we introduce MDBench, an open-source benchmarking framework for evaluating model discovery methods on dynamical systems. MDBench assesses 12 algorithms on 14 partial differential equations (PDEs) and 63 ordinary differential equations (ODEs) under varying levels of noise. Evaluation metrics include derivative prediction accuracy, model complexity, and equation fidelity. We also introduce seven challenging PDE systems from fluid dynamics and thermodynamics, revealing key limitations in current methods. Our findings illustrate that linear methods and genetic programming methods achieve the lowest prediction error for PDEs and ODEs, respectively. Moreover, linear models are in general more robust against noise. MDBench accelerates the advancement of model discovery methods by offering a rigorous, extensible benchmarking framework and a rich, diverse collection of dynamical system datasets, enabling systematic evaluation, comparison, and improvement of equation accuracy and robustness.
[LG-64] A Recovery Theory for Diffusion Priors: Deterministic Analysis of the Implicit Prior Algorithm
链接: https://arxiv.org/abs/2509.20511
作者: Oscar Leong,Yann Traonmilin
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:
Abstract:Recovering high-dimensional signals from corrupted measurements is a central challenge in inverse problems. Recent advances in generative diffusion models have shown remarkable empirical success in providing strong data-driven priors, but rigorous recovery guarantees remain limited. In this work, we develop a theoretical framework for analyzing deterministic diffusion-based algorithms for inverse problems, focusing on a deterministic version of the algorithm proposed by Kadkhodaie \ Simoncelli \citekadkhodaie2021stochastic. First, we show that when the underlying data distribution concentrates on a low-dimensional model set, the associated noise-convolved scores can be interpreted as time-varying projections onto such a set. This leads to interpreting previous algorithms using diffusion priors for inverse problems as generalized projected gradient descent methods with varying projections. When the sensing matrix satisfies a restricted isometry property over the model set, we can derive quantitative convergence rates that depend explicitly on the noise schedule. We apply our framework to two instructive data distributions: uniform distributions over low-dimensional compact, convex sets and low-rank Gaussian mixture models. In the latter setting, we can establish global convergence guarantees despite the nonconvexity of the underlying model set.
[LG-65] Auto-Regressive U-Net for Full-Field Prediction of Shrinkage-Induced Damage in Concrete
链接: https://arxiv.org/abs/2509.20507
作者: Liya Gaynutdinova,Petr Havlásek,Ondřej Rokoš,Fleur Hendriks,Martin Doškář
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces a deep learning approach for predicting time-dependent full-field damage in concrete. The study uses an auto-regressive U-Net model to predict the evolution of the scalar damage field in a unit cell given microstructural geometry and evolution of an imposed shrinkage profile. By sequentially using the predicted damage output as input for subsequent predictions, the model facilitates the continuous assessment of damage progression. Complementarily, a convolutional neural network (CNN) utilises the damage estimations to forecast key mechanical properties, including observed shrinkage and residual stiffness. The proposed dual-network architecture demonstrates high computational efficiency and robust predictive performance on the synthesised datasets. The approach reduces the computational load traditionally associated with full-field damage evaluations and is used to gain insights into the relationship between aggregate properties, such as shape, size, and distribution, and the effective shrinkage and reduction in stiffness. Ultimately, this can help to optimize concrete mix designs, leading to improved durability and reduced internal damage.
[LG-66] Myosotis: structured computation for attention like layer
链接: https://arxiv.org/abs/2509.20503
作者: Evgenii Egorov,Hanno Ackermann,Markus Nagel,Hong Cai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Attention layers apply a sequence-to-sequence mapping whose parameters depend on the pairwise interactions of the input elements. However, without any structural assumptions, memory and compute scale quadratically with the sequence length. The two main ways to mitigate this are to introduce sparsity by ignoring a sufficient amount of pairwise interactions or to introduce recurrent dependence along them, as SSM does. Although both approaches are reasonable, they both have disadvantages. We propose a novel algorithm that combines the advantages of both concepts. Our idea is based on the efficient inversion of tree-structured matrices.
[LG-67] Offline Goal-conditioned Reinforcement Learning with Quasimetric Representations
链接: https://arxiv.org/abs/2509.20478
作者: Vivek Myers,Bill Chunyuan Zheng,Benjamin Eysenbach,Sergey Levine
类目: Machine Learning (cs.LG)
*备注:
Abstract:Approaches for goal-conditioned reinforcement learning (GCRL) often use learned state representations to extract goal-reaching policies. Two frameworks for representation structure have yielded particularly effective GCRL algorithms: (1) contrastive representations, in which methods learn “successor features” with a contrastive objective that performs inference over future outcomes, and (2) temporal distances, which link the (quasimetric) distance in representation space to the transit time from states to goals. We propose an approach that unifies these two frameworks, using the structure of a quasimetric representation space (triangle inequality) with the right additional constraints to learn successor representations that enable optimal goal-reaching. Unlike past work, our approach is able to exploit a quasimetric distance parameterization to learn optimal goal-reaching distances, even with suboptimal data and in stochastic environments. This gives us the best of both worlds: we retain the stability and long-horizon capabilities of Monte Carlo contrastive RL methods, while getting the free stitching capabilities of quasimetric network parameterizations. On existing offline GCRL benchmarks, our representation learning objective improves performance on stitching tasks where methods based on contrastive learning struggle, and on noisy, high-dimensional environments where methods based on quasimetric networks struggle.
[LG-68] Efficiently Attacking Memorization Scores
链接: https://arxiv.org/abs/2509.20463
作者: Tue Do,Varun Chandrasekaran,Daniel Alabi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Influence estimation tools – such as memorization scores – are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our attack (calculating the pseudoinverse of the input) is practical, requiring only black-box access to model outputs and incur modest computational overhead. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at this https URL
[LG-69] Bridging Privacy and Utility: Synthesizing anonymized EEG with constraining utility functions
链接: https://arxiv.org/abs/2509.20454
作者: Kay Fuhrmeister,Arne Pelzer,Fabian Radke,Julia Lechinger,Mahzad Gharleghi,Thomas Köllmer,Insa Wolf
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Electroencephalography (EEG) is widely used for recording brain activity and has seen numerous applications in machine learning, such as detecting sleep stages and neurological disorders. Several studies have successfully shown the potential of EEG data for re-identification and leakage of other personal information. Therefore, the increasing availability of EEG consumer devices raises concerns about user privacy, motivating us to investigate how to safeguard this sensitive data while retaining its utility for EEG applications. To address this challenge, we propose a transformer-based autoencoder to create EEG data that does not allow for subject re-identification while still retaining its utility for specific machine learning tasks. We apply our approach to automatic sleep staging by evaluating the re-identification and utility potential of EEG data before and after anonymization. The results show that the re-identifiability of the EEG signal can be substantially reduced while preserving its utility for machine learning.
[LG-70] mloz: A Highly Efficient Machine Learning-Based Ozone Parameterization for Climate Sensitivity Simulations
链接: https://arxiv.org/abs/2509.20422
作者: Yiling Ma,Nathan Luke Abraham,Stefan Versick,Roland Ruhnke,Andrea Schneidereit,Ulrike Niemeier,Felix Back,Peter Braesicke,Peer Nowack
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Atmospheric ozone is a crucial absorber of solar radiation and an important greenhouse gas. However, most climate models participating in the Coupled Model Intercomparison Project (CMIP) still lack an interactive representation of ozone due to the high computational costs of atmospheric chemistry schemes. Here, we introduce a machine learning parameterization (mloz) to interactively model daily ozone variability and trends across the troposphere and stratosphere in standard climate sensitivity simulations, including two-way interactions of ozone with the Quasi-Biennial Oscillation. We demonstrate its high fidelity on decadal timescales and its flexible use online across two different climate models – the UK Earth System Model (UKESM) and the German ICOsahedral Nonhydrostatic (ICON) model. With atmospheric temperature profile information as the only input, mloz produces stable ozone predictions around 31 times faster than the chemistry scheme in UKESM, contributing less than 4 percent of the respective total climate model runtimes. In particular, we also demonstrate its transferability to different climate models without chemistry schemes by transferring the parameterization from UKESM to ICON. This highlights the potential for widespread adoption in CMIP-level climate models that lack interactive chemistry for future climate change assessments, particularly when focusing on climate sensitivity simulations, where ozone trends and variability are known to significantly modulate atmospheric feedback processes.
[LG-71] FastEagle: Cascaded Drafting for Accelerating Speculative Decoding
链接: https://arxiv.org/abs/2509.20416
作者: Haiduo Huang,Jiangcheng Song,Wenzhe Zhao,Pengju Ren
类目: Machine Learning (cs.LG)
*备注:
Abstract:Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass. FastEagle replaces temporal steps with a lightweight layer cascade and trains with layer-wise supervision to mitigate error accumulation. Coupled with a constrained draft tree that preserves lossless verification cost, FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior. Across multiple LLMs (Vicuna-13B, LLaMA-Instruct 3.x, and DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM, Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both greedy and stochastic decoding, with comparable average acceptance lengths. These results indicate that removing sequential dependencies in drafting is a practical path toward lossless LLM inference acceleration.
[LG-72] Structuring Collective Action with LLM -Guided Evolution: From Ill-Structured Problems to Executable Heuristics
链接: https://arxiv.org/abs/2509.20412
作者: Kevin Bradley Dsouza,Graham Alexander Watt,Yuri Leonenko,Juan Moreno-Cruz
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:
Abstract:Collective action problems, which require aligning individual incentives with collective goals, are classic examples of Ill-Structured Problems (ISPs). For an individual agent, the causal links between local actions and global outcomes are unclear, stakeholder objectives often conflict, and no single, clear algorithm can bridge micro-level choices with macro-level welfare. We present ECHO-MIMIC, a computational framework that converts this global complexity into a tractable, Well-Structured Problem (WSP) for each agent by discovering compact, executable heuristics and persuasive rationales. The framework operates in two stages: ECHO (Evolutionary Crafting of Heuristics from Outcomes) evolves snippets of Python code that encode candidate behavioral policies, while MIMIC (Mechanism Inference Messaging for Individual-to-Collective Alignment) evolves companion natural language messages that motivate agents to adopt those policies. Both phases employ a large-language-model-driven evolutionary search: the LLM proposes diverse and context-aware code or text variants, while population-level selection retains those that maximize collective performance in a simulated environment. We demonstrate this framework on a canonical ISP in agricultural landscape management, where local farming decisions impact global ecological connectivity. Results show that ECHO-MIMIC discovers high-performing heuristics compared to baselines and crafts tailored messages that successfully align simulated farmer behavior with landscape-level ecological goals. By coupling algorithmic rule discovery with tailored communication, ECHO-MIMIC transforms the cognitive burden of collective action into a simple set of agent-level instructions, making previously ill-structured problems solvable in practice and opening a new path toward scalable, adaptive policy design.
[LG-73] A Theory of Multi-Agent Generative Flow Networks NEURIPS2025
链接: https://arxiv.org/abs/2509.20408
作者: Leo Maxime Brunswic,Haozhi Wang,Shuang Luo,Jianye Hao,Amir Rasouli,Yinchuan Li
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted at SPIGM Workshop NeurIPS 2025
Abstract:Generative flow networks utilize a flow-matching loss to learn a stochastic policy for generating objects from a sequence of actions, such that the probability of generating a pattern can be proportional to the corresponding given reward. However, a theoretical framework for multi-agent generative flow networks (MA-GFlowNets) has not yet been proposed. In this paper, we propose the theory framework of MA-GFlowNets, which can be applied to multiple agents to generate objects collaboratively through a series of joint actions. We further propose four algorithms: a centralized flow network for centralized training of MA-GFlowNets, an independent flow network for decentralized execution, a joint flow network for achieving centralized training with decentralized execution, and its updated conditional version. Joint Flow training is based on a local-global principle allowing to train a collection of (local) GFN as a unique (global) GFN. This principle provides a loss of reasonable complexity and allows to leverage usual results on GFN to provide theoretical guarantees that the independent policies generate samples with probability proportional to the reward function. Experimental results demonstrate the superiority of the proposed framework compared to reinforcement learning and MCMC-based methods.
[LG-74] A Comparative Analysis of Ensemble-Based Machine Learning Approaches with Explainable AI for Multi-Class Intrusion Detection in Drone Networks
链接: https://arxiv.org/abs/2509.20391
作者: Md. Alamgir Hossain,Waqas Ishtiaq,Md. Samiul Islam
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 27 pages, 18 figures, 10 tables
Abstract:The growing integration of drones into civilian, commercial, and defense sectors introduces significant cybersecurity concerns, particularly with the increased risk of network-based intrusions targeting drone communication protocols. Detecting and classifying these intrusions is inherently challenging due to the dynamic nature of drone traffic and the presence of multiple sophisticated attack vectors such as spoofing, injection, replay, and man-in-the-middle (MITM) attacks. This research aims to develop a robust and interpretable intrusion detection framework tailored for drone networks, with a focus on handling multi-class classification and model explainability. We present a comparative analysis of ensemble-based machine learning models, namely Random Forest, Extra Trees, AdaBoost, CatBoost, and XGBoost, trained on a labeled dataset comprising benign traffic and nine distinct intrusion types. Comprehensive data preprocessing was performed, including missing value imputation, scaling, and categorical encoding, followed by model training and extensive evaluation using metrics such as macro F1-score, ROC AUC, Matthews Correlation Coefficient, and Log Loss. Random Forest achieved the highest performance with a macro F1-score of 0.9998 and ROC AUC of 1.0000. To validate the superiority of the models, statistical tests, including Friedmans test, the Wilcoxon signed-rank test with Holm correction, and bootstrapped confidence intervals, were applied. Furthermore, explainable AI methods, SHAP and LIME, were integrated to interpret both global and local feature importance, enhancing model transparency and decision trustworthiness. The proposed approach not only delivers near-perfect accuracy but also ensures interpretability, making it highly suitable for real-time and safety-critical drone operations.
[LG-75] Maxout Polytopes
链接: https://arxiv.org/abs/2509.21286
作者: Andrei Balakin,Shelby Cox,Georg Loho,Bernd Sturmfels
类目: Combinatorics (math.CO); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注: 24 pages, 3 figures
Abstract:Maxout polytopes are defined by feedforward neural networks with maxout activation function and non-negative weights after the first layer. We characterize the parameter spaces and extremal f-vectors of maxout polytopes for shallow networks, and we study the separating hypersurfaces which arise when a layer is added to the network. We also show that maxout polytopes are cubical for generic networks without bottlenecks.
[LG-76] Response to Promises and Pitfalls of Deep Kernel Learning
链接: https://arxiv.org/abs/2509.21228
作者: Andrew Gordon Wilson,Zhiting Hu,Ruslan Salakhutdinov,Eric P. Xing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This note responds to “Promises and Pitfalls of Deep Kernel Learning” (Ober et al., 2021). The marginal likelihood of a Gaussian process can be compartmentalized into a data fit term and a complexity penalty. Ober et al. (2021) shows that if a kernel can be multiplied by a signal variance coefficient, then reparametrizing and substituting in the maximized value of this parameter sets a reparametrized data fit term to a fixed value. They use this finding to argue that the complexity penalty, a log determinant of the kernel matrix, then dominates in determining the other values of kernel hyperparameters, which can lead to data overcorrelation. By contrast, we show that the reparametrization in fact introduces another data-fit term which influences all other kernel hyperparameters. Thus, a balance between data fit and complexity still plays a significant role in determining kernel hyperparameters.
[LG-77] Data-driven Neural Networks for Windkessel Parameter Calibration
链接: https://arxiv.org/abs/2509.21206
作者: Benedikt Hoock,Tobias Köppl
类目: Tissues and Organs (q-bio.TO); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Quantitative Methods (q-bio.QM)
*备注: 32 pages, 15 figures, for associated git see this https URL , submitted to International Journal for Numerical Methods in Biomedical Engineering
Abstract:In this work, we propose a novel method for calibrating Windkessel (WK) parameters in a dimensionally reduced 1D-0D coupled blood flow model. To this end, we design a data-driven neural network (NN)trained on simulated blood pressures in the left brachial artery. Once trained, the NN emulates the pressure pulse waves across the entire simulated domain, i.e., over time, space and varying WK parameters, with negligible error and computational effort. To calibrate the WK parameters on a measured pulse wave, the NN is extended by dummy neurons and retrained only on these. The main objective of this work is to assess the effectiveness of the method in various scenarios – particularly, when the exact measurement location is unknown or the data are affected by noise.
[LG-78] Breaking the curse of dimensionality for linear rules: optimal predictors over the ellipsoid
链接: https://arxiv.org/abs/2509.21174
作者: Alexis Ayme,Bruno Loureiro
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this work, we address the following question: What minimal structural assumptions are needed to prevent the degradation of statistical learning bounds with increasing dimensionality? We investigate this question in the classical statistical setting of signal estimation from n independent linear observations Y_i = X_i^\top\theta + \epsilon_i . Our focus is on the generalization properties of a broad family of predictors that can be expressed as linear combinations of the training labels, f(X) = \sum_i=1^n l_i(X) Y_i . This class – commonly referred to as linear prediction rules – encompasses a wide range of popular parametric and non-parametric estimators, including ridge regression, gradient descent, and kernel methods. Our contributions are twofold. First, we derive non-asymptotic upper and lower bounds on the generalization error for this class under the assumption that the Bayes predictor \theta lies in an ellipsoid. Second, we establish a lower bound for the subclass of rotationally invariant linear prediction rules when the Bayes predictor is fixed. Our analysis highlights two fundamental contributions to the risk: (a) a variance-like term that captures the intrinsic dimensionality of the data; (b) the noiseless error, a term that arises specifically in the high-dimensional regime. These findings shed light on the role of structural assumptions in mitigating the curse of dimensionality.
[LG-79] WISER: Segmenting watermarked region - an epidemic change-point perspective
链接: https://arxiv.org/abs/2509.21160
作者: Soham Bonnerjee,Sayar Karmakar,Subhrajyoty Roy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:With the increasing popularity of large language models, concerns over content authenticity have led to the development of myriad watermarking schemes. These schemes can be used to detect a machine-generated text via an appropriate key, while being imperceptible to readers with no such keys. The corresponding detection mechanisms usually take the form of statistical hypothesis testing for the existence of watermarks, spurring extensive research in this direction. However, the finer-grained problem of identifying which segments of a mixed-source text are actually watermarked, is much less explored; the existing approaches either lack scalability or theoretical guarantees robust to paraphrase and post-editing. In this work, we introduce a unique perspective to such watermark segmentation problems through the lens of epidemic change-points. By highlighting the similarities as well as differences of these two problems, we motivate and propose WISER: a novel, computationally efficient, watermark segmentation algorithm. We theoretically validate our algorithm by deriving finite sample error-bounds, and establishing its consistency in detecting multiple watermarked segments in a single text. Complementing these theoretical results, our extensive numerical experiments show that WISER outperforms state-of-the-art baseline methods, both in terms of computational speed as well as accuracy, on various benchmark datasets embedded with diverse watermarking schemes. Our theoretical and empirical findings establish WISER as an effective tool for watermark localization in most settings. It also shows how insights from a classical statistical problem can lead to a theoretically valid and computationally efficient solution of a modern and pertinent problem.
[LG-80] Physics Informed Neural Networks for design optimisation of diamond particle detectors for charged particle fast-tracking at high luminosity hadron colliders
链接: https://arxiv.org/abs/2509.21123
作者: Alessandro Bombini,Alessandro Rosa,Clarissa Buti,Giovanni Passaleva,Lucio Anderlini
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Computational Physics (physics.comp-ph)
*备注: 9 pages; 3 figures; conference paper submitted to EUCAIFCON 2025
Abstract:Future high-luminosity hadron colliders demand tracking detectors with extreme radiation tolerance, high spatial precision, and sub-nanosecond timing. 3D diamond pixel sensors offer these capabilities due to diamond’s radiation hardness and high carrier mobility. Conductive electrodes, produced via femtosecond IR laser pulses, exhibit high resistivity that delays signal propagation. This effect necessitates extending the classical Ramo-Shockley weighting potential formalism. We model the phenomenon through a 3rd-order, 3+1D PDE derived as a quasi-stationary approximation of Maxwell’s equations. The PDE is solved numerically and coupled with charge transport simulations for realistic 3D sensor geometries. A Mixture-of-Experts Physics-Informed Neural Network, trained on Spectral Method data, provides a meshless solver to assess timing degradation from electrode resistance.
[LG-81] Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?
链接: https://arxiv.org/abs/2509.21087
作者: Rostislav Makarov,Lea Schönherr,Timo Gerkmann
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.
[LG-82] Empirical PAC-Bayes bounds for Markov chains
链接: https://arxiv.org/abs/2509.20985
作者: Vahe Karagulyan,Pierre Alquier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The core of generalization theory was developed for independent observations. Some PAC and PAC-Bayes bounds are available for data that exhibit a temporal dependence. However, there are constants in these bounds that depend on properties of the data-generating process: mixing coefficients, mixing time, spectral gap… Such constants are unknown in practice. In this paper, we prove a new PAC-Bayes bound for Markov chains. This bound depends on a quantity called the pseudo-spectral gap. The main novelty is that we can provide an empirical bound on the pseudo-spectral gap when the state space is finite. Thus, we obtain the first fully empirical PAC-Bayes bound for Markov chains. This extends beyond the finite case, although this requires additional assumptions. On simulated experiments, the empirical version of the bound is essentially as tight as the non-empirical one.
[LG-83] Conditionally Whitened Generative Models for Probabilistic Time Series Forecasting
链接: https://arxiv.org/abs/2509.20928
作者: Yanfeng Yang,Siwei Chen,Pingping Hu,Zhaotong Shen,Yingjie Zhang,Zhuoran Sun,Shuai Li,Ziqi Chen,Kenji Fukumizu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Probabilistic forecasting of multivariate time series is challenging due to non-stationarity, inter-variable dependencies, and distribution shifts. While recent diffusion and flow matching models have shown promise, they often ignore informative priors such as conditional means and covariances. In this work, we propose Conditionally Whitened Generative Models (CW-Gen), a framework that incorporates prior information through conditional whitening. Theoretically, we establish sufficient conditions under which replacing the traditional terminal distribution of diffusion models, namely the standard multivariate normal, with a multivariate normal distribution parameterized by estimators of the conditional mean and covariance improves sample quality. Guided by this analysis, we design a novel Joint Mean-Covariance Estimator (JMCE) that simultaneously learns the conditional mean and sliding-window covariance. Building on JMCE, we introduce Conditionally Whitened Diffusion Models (CW-Diff) and extend them to Conditionally Whitened Flow Matching (CW-Flow). Experiments on five real-world datasets with six state-of-the-art generative models demonstrate that CW-Gen consistently enhances predictive performance, capturing non-stationary dynamics and inter-variable correlations more effectively than prior-free approaches. Empirical results further demonstrate that CW-Gen can effectively mitigate the effects of distribution shift.
[LG-84] RAPTOR-GEN: RApid PosTeriOR GENerator for Bayesian Learning in Biomanufacturing
链接: https://arxiv.org/abs/2509.20753
作者: Wandi Xu,Wei Xie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 80 pages, 6 figures
Abstract:Biopharmaceutical manufacturing is vital to public health but lacks the agility for rapid, on-demand production of biotherapeutics due to the complexity and variability of bioprocesses. To overcome this, we introduce RApid PosTeriOR GENerator (RAPTOR-GEN), a mechanism-informed Bayesian learning framework designed to accelerate intelligent digital twin development from sparse and heterogeneous experimental data. This framework is built on a multi-scale probabilistic knowledge graph (pKG), formulated as a stochastic differential equation (SDE)-based foundational model that captures the nonlinear dynamics of bioprocesses. RAPTOR-GEN consists of two ingredients: (i) an interpretable metamodel integrating linear noise approximation (LNA) that exploits the structural information of bioprocessing mechanisms and a sequential learning strategy to fuse heterogeneous and sparse data, enabling inference of latent state variables and explicit approximation of the intractable likelihood function; and (ii) an efficient Bayesian posterior sampling method that utilizes Langevin diffusion (LD) to accelerate posterior exploration by exploiting the gradients of the derived likelihood. It generalizes the LNA approach to circumvent the challenge of step size selection, facilitating robust learning of mechanistic parameters with provable finite-sample performance guarantees. We develop a fast and robust RAPTOR-GEN algorithm with controllable error. Numerical experiments demonstrate its effectiveness in uncovering the underlying regulatory mechanisms of biomanufacturing processes.
[LG-85] Real-Time System for Audio-Visual Target Speech Enhancement
链接: https://arxiv.org/abs/2509.20741
作者: T. Aleksandra Ma,Sile Yin,Li-Chia Yang,Shuo Zhang
类目: Audio and Speech Processing (eess.AS); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Accepted into WASPAA 2025 demo session
Abstract:We present a live demonstration for RAVEN, a real-time audio-visual speech enhancement system designed to run entirely on a CPU. In single-channel, audio-only settings, speech enhancement is traditionally approached as the task of extracting clean speech from environmental noise. More recent work has explored the use of visual cues, such as lip movements, to improve robustness, particularly in the presence of interfering speakers. However, to our knowledge, no prior work has demonstrated an interactive system for real-time audio-visual speech enhancement operating on CPU hardware. RAVEN fills this gap by using pretrained visual embeddings from an audio-visual speech recognition model to encode lip movement information. The system generalizes across environmental noise, interfering speakers, transient sounds, and even singing voices. In this demonstration, attendees will be able to experience live audio-visual target speech enhancement using a microphone and webcam setup, with clean speech playback through headphones.
[LG-86] PALQO: Physics-informed Model for Accelerating Large-scale Quantum Optimization
链接: https://arxiv.org/abs/2509.20733
作者: Yiming Huang,Yajie Hao,Jing Zhou,Xiao Yuan,Xiaoting Wang,Yuxuan Du
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Variational quantum algorithms (VQAs) are leading strategies to reach practical utilities of near-term quantum devices. However, the no-cloning theorem in quantum mechanics precludes standard backpropagation, leading to prohibitive quantum resource costs when applying VQAs to large-scale tasks. To address this challenge, we reformulate the training dynamics of VQAs as a nonlinear partial differential equation and propose a novel protocol that leverages physics-informed neural networks (PINNs) to model this dynamical system efficiently. Given a small amount of training trajectory data collected from quantum devices, our protocol predicts the parameter updates of VQAs over multiple iterations on the classical side, dramatically reducing quantum resource costs. Through systematic numerical experiments, we demonstrate that our method achieves up to a 30x speedup compared to conventional methods and reduces quantum resource costs by as much as 90% for tasks involving up to 40 qubits, including ground state preparation of different quantum systems, while maintaining competitive accuracy. Our approach complements existing techniques aimed at improving the efficiency of VQAs and further strengthens their potential for practical applications.
[LG-87] Implicit Augmentation from Distributional Symmetry in Turbulence Super-Resolution NEURIPS2025
链接: https://arxiv.org/abs/2509.20683
作者: Julia Balla,Jeremiah Bailey,Ali Backour,Elyssa Hofgard,Tommi Jaakkola,Tess Smidt,Ryley McConkey
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: Accepted to Machine Learning and the Physical Sciences Workshop at NeurIPS 2025
Abstract:The immense computational cost of simulating turbulence has motivated the use of machine learning approaches for super-resolving turbulent flows. A central challenge is ensuring that learned models respect physical symmetries, such as rotational equivariance. We show that standard convolutional neural networks (CNNs) can partially acquire this symmetry without explicit augmentation or specialized architectures, as turbulence itself provides implicit rotational augmentation in both time and space. Using 3D channel-flow subdomains with differing anisotropy, we find that models trained on more isotropic mid-plane data achieve lower equivariance error than those trained on boundary layer data, and that greater temporal or spatial sampling further reduces this error. We show a distinct scale-dependence of equivariance error that occurs regardless of dataset anisotropy that is consistent with Kolmogorov’s local isotropy hypothesis. These results clarify when rotational symmetry must be explicitly incorporated into learning algorithms and when it can be obtained directly from turbulence, enabling more efficient and symmetry-aware super-resolution.
[LG-88] A Hierarchical Variational Graph Fused Lasso for Recovering Relative Rates in Spatial Compositional Data
链接: https://arxiv.org/abs/2509.20636
作者: Joaquim Valerio Teixeira,Ed Reznik,Sudpito Banerjee,Wesley Tansey
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:The analysis of spatial data from biological imaging technology, such as imaging mass spectrometry (IMS) or imaging mass cytometry (IMC), is challenging because of a competitive sampling process which convolves signals from molecules in a single pixel. To address this, we develop a scalable Bayesian framework that leverages natural sparsity in spatial signal patterns to recover relative rates for each molecule across the entire image. Our method relies on the use of a heavy-tailed variant of the graphical lasso prior and a novel hierarchical variational family, enabling efficient inference via automatic differentiation variational inference. Simulation results show that our approach outperforms state-of-the-practice point estimate methodologies in IMS, and has superior posterior coverage than mean-field variational inference techniques. Results on real IMS data demonstrate that our approach better recovers the true anatomical structure of known tissue, removes artifacts, and detects active regions missed by the standard analysis approach.
[LG-89] A Gapped Scale-Sensitive Dimension and Lower Bounds for Offset Rademacher Complexity
链接: https://arxiv.org/abs/2509.20618
作者: Zeyu Jia,Yury Polyanskiy,Alexander Rakhlin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We study gapped scale-sensitive dimensions of a function class in both sequential and non-sequential settings. We demonstrate that covering numbers for any uniformly bounded class are controlled above by these gapped dimensions, generalizing the results of \citeanthony2000function,alon1997scale. Moreover, we show that the gapped dimensions lead to lower bounds on offset Rademacher averages, thereby strengthening existing approaches for proving lower bounds on rates of convergence in statistical and online learning.
[LG-90] Unsupervised Domain Adaptation with an Unobservable Source Subpopulation
链接: https://arxiv.org/abs/2509.20587
作者: Chao Ying,Jun Jin,Haotian Zhang,Qinglong Tian,Yanyuan Ma,Yixuan Li,Jiwei Zhao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label Y and a binary background (or environment) A . We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.
[LG-91] Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances
链接: https://arxiv.org/abs/2509.20508
作者: Khai Nguyen,Hai Nguyen,Nhat Ho
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 35 pages, 20 figures, 4 tables
Abstract:We address the problem of efficiently computing Wasserstein distances for multiple pairs of distributions drawn from a meta-distribution. To this end, we propose a fast estimation method based on regressing Wasserstein distance on sliced Wasserstein (SW) distances. Specifically, we leverage both standard SW distances, which provide lower bounds, and lifted SW distances, which provide upper bounds, as predictors of the true Wasserstein distance. To ensure parsimony, we introduce two linear models: an unconstrained model with a closed-form least-squares solution, and a constrained model that uses only half as many parameters. We show that accurate models can be learned from a small number of distribution pairs. Once estimated, the model can predict the Wasserstein distance for any pair of distributions via a linear combination of SW distances, making it highly efficient. Empirically, we validate our approach on diverse tasks, including Gaussian mixtures, point-cloud classification, and Wasserstein-space visualizations for 3D point clouds. Across various datasets such as MNIST point clouds, ShapeNetV2, MERFISH Cell Niches, and scRNA-seq, our method consistently provides a better approximation of Wasserstein distance than the state-of-the-art Wasserstein embedding model, Wasserstein Wormhole, particularly in low-data regimes. Finally, we demonstrate that our estimator can also accelerate Wormhole training, yielding \textitRG-Wormhole.
[LG-92] Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens
链接: https://arxiv.org/abs/2509.20485
作者: Ismail Rasim Ulgen,Zongyang Du,Junchen Lu,Philipp Koehn,Berrak Sisman
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Under review for IEEE OJSP
Abstract:Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.
[LG-93] Neural Networks as Surrogate Solvers for Time-Dependent Accretion Disk Dynamics
链接: https://arxiv.org/abs/2509.20447
作者: Shunyuan Mao,Weiqi Wang,Sifan Wang,Ruobing Dong,Lu Lu,Kwang Moo Yi,Paris Perdikaris,Andrea Isella,Sébastien Fabbro,Lile Wang
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Astrophysical Journal Letters accepted; associate animations are available at this https URL
Abstract:Accretion disks are ubiquitous in astrophysics, appearing in diverse environments from planet-forming systems to X-ray binaries and active galactic nuclei. Traditionally, modeling their dynamics requires computationally intensive (magneto)hydrodynamic simulations. Recently, Physics-Informed Neural Networks (PINNs) have emerged as a promising alternative. This approach trains neural networks directly on physical laws without requiring data. We for the first time demonstrate PINNs for solving the two-dimensional, time-dependent hydrodynamics of non-self-gravitating accretion disks. Our models provide solutions at arbitrary times and locations within the training domain, and successfully reproduce key physical phenomena, including the excitation and propagation of spiral density waves and gap formation from disk-companion interactions. Notably, the boundary-free approach enabled by PINNs naturally eliminates the spurious wave reflections at disk edges, which are challenging to suppress in numerical simulations. These results highlight how advanced machine learning techniques can enable physics-driven, data-free modeling of complex astrophysical systems, potentially offering an alternative to traditional numerical simulations in the future.
[LG-94] Sample completion structured correlation and Netflix problems
链接: https://arxiv.org/abs/2509.20404
作者: Leonardo N. Coregliano,Maryanthe Malliaris
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Logic (math.LO); Statistics Theory (math.ST)
*备注: 97 pages, 1 figure
Abstract:We develop a new high-dimensional statistical learning model which can take advantage of structured correlation in data even in the presence of randomness. We completely characterize learnability in this model in terms of VCN _k,k -dimension (essentially k -dependence from Shelah’s classification theory). This model suggests a theoretical explanation for the success of certain algorithms in the 2006~Netflix Prize competition.
[LG-95] An Analytical and AI-discovered Stable Accurate and Generalizable Subgrid-scale Closure for Geophysical Turbulence
链接: https://arxiv.org/abs/2509.20365
作者: Karan Jakhar,Yifei Guan,Pedram Hassanzadeh
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:By combining AI and fluid physics, we discover a closed-form closure for 2D turbulence from small direct numerical simulation (DNS) data. Large-eddy simulation (LES) with this closure is accurate and stable, reproducing DNS statistics including those of extremes. We also show that the new closure could be derived from a 4th-order truncated Taylor expansion. Prior analytical and AI-based work only found the 2nd-order expansion, which led to unstable LES. The additional terms emerge only when inter-scale energy transfer is considered alongside standard reconstruction criterion in the sparse-equation discovery.
信息检索
[IR-0] Markup Language Modeling for Web Document Understanding
链接: https://arxiv.org/abs/2509.20940
作者: Su Liu,Bin Bi,Jan Bakus,Paritosh Kumar Velalam,Vijay Yella,Vinod Hegde
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Web information extraction (WIE) is an important part of many e-commerce systems, supporting tasks like customer analysis and product recommendation. In this work, we look at the problem of building up-to-date product databases by extracting detailed information from shopping review websites. We fine-tuned MarkupLM on product data gathered from review sites of different sizes and then developed a variant we call MarkupLM++, which extends predictions to internal nodes of the DOM tree. Our experiments show that using larger and more diverse training sets improves extraction accuracy overall. We also find that including internal nodes helps with some product attributes, although it leads to a slight drop in overall performance. The final model reached a precision of 0.906, recall of 0.724, and an F1 score of 0.805.
[IR-1] FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets
链接: https://arxiv.org/abs/2509.20904
作者: Kairui Fu,Tao Zhang,Shuwen Xiao,Ziyang Wang,Xinming Zhang,Chenchi Zhang,Yuliang Yan,Junjun Zheng,Yu Li,Zhihong Chen,Jian Wu,Xiangheng Kong,Shengyu Zhang,Kun Kuang,Yuning Jiang,Bo Zheng
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Semantic identifiers (SIDs) have gained increasing attention in generative retrieval (GR) due to their meaningful semantic discriminability. However, current research on SIDs faces three main challenges: (1) the absence of large-scale public datasets with multimodal features, (2) limited investigation into optimization strategies for SID generation, which typically rely on costly GR training for evaluation, and (3) slow online convergence in industrial deployment. To address these challenges, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieR in Generative rEtrieval with industrial datasets. Specifically, FORGE is equipped with a dataset comprising 14 billion user interactions and multimodal features of 250 million items sampled from Taobao, one of the biggest e-commerce platforms in China. Leveraging this dataset, FORGE explores several optimizations to enhance the SID construction and validates their effectiveness via offline experiments across different settings and tasks. Further online analysis conducted on our platform, which serves over 300 million users daily, reveals a 0.35% increase in transaction count, highlighting the practical impact of our method. Regarding the expensive SID validation accompanied by the full training of GRs, we propose two novel metrics of SID that correlate positively with recommendation performance, enabling convenient evaluations without any GR training. For real-world applications, FORGE introduces an offline pretraining schema that reduces online convergence by half. The code and data are available at this https URL.
[IR-2] Performance Consistency of Learning Methods for Information Retrieval Tasks
链接: https://arxiv.org/abs/2509.20804
作者: Meng Yuan,Justin Zobel
类目: Information Retrieval (cs.IR)
*备注:
Abstract:A range of approaches have been proposed for estimating the accuracy or robustness of the measured performance of IR methods. One is to use bootstrapping of test sets, which, as we confirm, provides an estimate of variation in performance. For IR methods that rely on a seed, such as those that involve machine learning, another approach is to use a random set of seeds to examine performance variation. Using three different IR tasks we have used such randomness to examine a range of traditional statistical learning models and transformer-based learning models. While the statistical models are stable, the transformer models show huge variation as seeds are changed. In 9 of 11 cases the F1-scores (in the range 0.0–1.0) had a standard deviation of over 0.075; while 7 of 11 precision values (also in the range 0.0–1.0) had a standard deviation of over 0.125. This is in a context where differences of less than 0.02 have been used as evidence of method improvement. Our findings highlight the vulnerability of transformer models to training instabilities and moreover raise questions about the reliability of previous results, thus underscoring the need for rigorous evaluation practices.
[IR-3] DELM: a Python toolkit for Data Extraction with Language Models
链接: https://arxiv.org/abs/2509.20617
作者: Eric Fithian,Kirill Skobelev
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large Language Models (LLMs) have become powerful tools for annotating unstructured data. However, most existing workflows rely on ad hoc scripts, making reproducibility, robustness, and systematic evaluation difficult. To address these challenges, we introduce DELM (Data Extraction with Language Models), an open-source Python toolkit designed for rapid experimental iteration of LLM-based data extraction pipelines and for quantifying the trade-offs between them. DELM minimizes boilerplate code and offers a modular framework with structured outputs, built-in validation, flexible data-loading and scoring strategies, and efficient batch processing. It also includes robust support for working with LLM APIs, featuring retry logic, result caching, detailed cost tracking, and comprehensive configuration management. We showcase DELM’s capabilities through two case studies: one featuring a novel prompt optimization algorithm, and another illustrating how DELM quantifies trade-offs between cost and coverage when selecting keywords to decide which paragraphs to pass to an LLM. DELM is available at \hrefthis https URL\textttthis http URL.