本篇博文主要内容为 2025-09-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-09-03)

今日共更新1320篇论文,其中:

  • 自然语言处理190篇(Computation and Language (cs.CL))
  • 人工智能373篇(Artificial Intelligence (cs.AI))
  • 计算机视觉263篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习380篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] DynaGuard: A Dynamic Guardrail Model With User-Defined Policies

【速读】: 该论文旨在解决传统守护模型(guardian models)在应用灵活性和适应性方面的局限性问题,即标准守护模型如LlamaGuard仅能检测预定义的静态危害类别,难以适配多样化的应用场景和用户自定义政策。其解决方案的关键在于提出动态守护模型(dynamic guardian models),该模型能够基于用户定义的策略对文本进行评估,从而支持自由形式政策的违规检测;同时具备两种使用模式:一是快速识别政策违规,二是结合思维链(chain-of-thought)推理以解释和验证输出结果,既保持了与静态模型相当的检测准确率,又实现了在更复杂场景下的高效推理能力。

链接: https://arxiv.org/abs/2509.02563
作者: Monte Hoover,Vatsal Baherwani,Neel Jain,Khalid Saifullah,Joseph Vincent,Chirag Jain,Melissa Kazemi Rad,C. Bayan Bruss,Ashwinee Panda,Tom Goldstein
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 22 Pages

点击查看摘要

Abstract:Guardian models are used to supervise and moderate the outputs of user-facing chatbots, enforcing guardrails and detecting bad behaviors. Standard guardian models like LlamaGuard detect predefined, static categories of harms. We propose dynamic guardian models that evaluate text based on user-defined policies, making them useful for different application domains that are not addressed by standard guardian models. Our dynamic guardian models can be used for fast detection of policy violations or with chain-of-thought reasoning that articulates and justifies the model outputs. Our dynamic guardian models match static models in detection accuracy for static harm categories while identifying violations of free-form policies with accuracy comparable to frontier reasoning models in a fraction of the time.
zh

[NLP-1] PalmX 2025: The First Shared Task on Benchmarking LLM s on Arabic and Islamic Culture

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在阿拉伯语和伊斯兰文化知识方面存在显著理解不足的问题,这一问题源于预训练数据主要来自西方高资源语言和文化,导致模型对低资源文化群体的认知能力薄弱。为应对这一挑战,研究者提出了PalmX 2025,这是首个专门用于评估LLMs在阿拉伯与伊斯兰文化领域文化胜任力的共享任务。其关键解决方案在于设计两个基于现代标准阿拉伯语(Modern Standard Arabic, MSA)的多项选择题(MCQ)子任务——“通用阿拉伯文化”与“通用伊斯兰文化”,覆盖22个阿拉伯国家的传统、饮食、历史、宗教实践及语言表达等广泛主题,并通过任务特定微调(task-specific fine-tuning)显著提升模型表现,其中最优系统在文化问题上达到72.15%准确率,在伊斯兰知识上达84.22%,且参数高效微调(parameter-efficient fine-tuning)成为最有效的策略。

链接: https://arxiv.org/abs/2509.02550
作者: Fakhraddin Alwajih,Abdellah El Mekki,Hamdy Mubarak,Majd Hawasly,Abubakr Mohamed,Muhammad Abdul-Mageed
机构: The University of British Columbia (不列颠哥伦比亚大学); Qatar Computing Research Institute (卡塔尔计算研究研究所)
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) inherently reflect the vast data distributions they encounter during their pre-training phase. As this data is predominantly sourced from the web, there is a high chance it will be skewed towards high-resourced languages and cultures, such as those of the West. Consequently, LLMs often exhibit a diminished understanding of certain communities, a gap that is particularly evident in their knowledge of Arabic and Islamic cultures. This issue becomes even more pronounced with increasingly under-represented topics. To address this critical challenge, we introduce PalmX 2025, the first shared task designed to benchmark the cultural competence of LLMs in these specific domains. The task is composed of two subtasks featuring multiple-choice questions (MCQs) in Modern Standard Arabic (MSA): General Arabic Culture and General Islamic Culture. These subtasks cover a wide range of topics, including traditions, food, history, religious practices, and language expressions from across 22 Arab countries. The initiative drew considerable interest, with 26 teams registering for Subtask 1 and 19 for Subtask 2, culminating in nine and six valid submissions, respectively. Our findings reveal that task-specific fine-tuning substantially boosts performance over baseline models. The top-performing systems achieved an accuracy of 72.15% on cultural questions and 84.22% on Islamic knowledge. Parameter-efficient fine-tuning emerged as the predominant and most effective approach among participants, while the utility of data augmentation was found to be domain-dependent.
zh

[NLP-2] he Landscape of Agent ic Reinforcement Learning for LLM s: A Survey

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)应用中局限于被动序列生成的局限性,提出将LLMs重构为具备自主决策能力的智能体(Agentic Reinforcement Learning, Agentic RL),从而实现其在复杂、动态环境中的有效部署。解决方案的关键在于通过引入时序扩展的、部分可观测的马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDPs)来重新定义任务建模框架,并构建一个基于核心智能体能力(如规划、工具使用、记忆、推理、自我改进和感知)与应用场景双重维度的分类体系,同时强调强化学习作为驱动这些能力从静态启发式模块向自适应、鲁棒行为演化的关键机制。

链接: https://arxiv.org/abs/2509.02547
作者: Guibin Zhang,Hejia Geng,Xiaohang Yu,Zhenfei Yin,Zaibin Zhang,Zelin Tan,Heng Zhou,Zhongzhi Li,Xiangyuan Xue,Yijiang Li,Yifan Zhou,Yang Chen,Chen Zhang,Yutao Fan,Zihu Wang,Songtao Huang,Yue Liao,Hongru Wang,Mengyue Yang,Heng Ji,Michael Littman,Jun Wang,Shuicheng Yan,Philip Torr,Lei Bai
机构: University of Oxford (牛津大学); Shanghai AI Laboratory (上海人工智能实验室); National University of Singapore (新加坡国立大学); University College London (伦敦大学学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Brown University (布朗大学); University of Science and Technology of China (中国科学技术大学); Imperial College London (帝国理工学院); University of Bristol (布里斯托大学); Chinese Academy of Sciences (中国科学院); The Chinese University of Hong Kong (香港中文大学); Fudan University (复旦大学); University of Georgia (佐治亚大学); University of California San Diego (加州大学圣地亚哥分校); Dalian University of Technology (大连理工大学); University of California Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.
zh

[NLP-3] UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

【速读】: 该论文旨在解决图形用户界面(GUI)自主代理(agent)在数据可扩展性、多轮强化学习(multi-turn reinforcement learning, RL)、仅依赖GUI操作的局限性以及环境稳定性等方面的挑战。其解决方案的关键在于提出了一种系统性的训练方法:通过数据飞轮(data flywheel)实现可扩展的数据生成,构建稳定化的多轮RL框架,设计融合文件系统与终端的混合GUI环境,并建立统一沙箱平台以支持大规模部署。这一系列创新使UI-TARS-2在多个GUI基准测试中显著优于前代模型及主流商业代理(如Claude和OpenAI agents),并在游戏和软件工程任务中展现出强泛化能力,验证了其在复杂交互场景中的有效性与鲁棒性。

链接: https://arxiv.org/abs/2509.02544
作者: Haoming Wang,Haoyang Zou,Huatong Song,Jiazhan Feng,Junjie Fang,Junting Lu,Longxiang Liu,Qinyu Luo,Shihao Liang,Shijue Huang,Wanjun Zhong,Yining Ye,Yujia Qin,Yuwen Xiong,Yuxin Song,Zhiyong Wu,Bo Li,Chen Dun,Chong Liu,Fuxing Leng,Hanbin Wang,Hao Yu,Haobin Chen,Hongyi Guo,Jing Su,Jingjia Huang,Kai Shen,Kaiyu Shi,Lin Yan,Peiyao Zhao,Pengfei Liu,Qinghao Ye,Renjie Zheng,Wayne Xin Zhao,Wen Heng,Wenhao Huang,Wenqian Wang,Xiaobo Qin,Yi Lin,Youbin Wu,Zehui Chen,Zihao Wang,Baoquan Zhong,Xinchun Zhang,Xujing Li,Yuanfan Li,Zhongkai Zhao,Chengquan Jiang,Faming Wu,Haotian Zhou,Jinlin Pang,Li Han,Qianli Ma,Siyao Liu,Songhua Cai,Wenqi Fu,Xin Liu,Zhi Zhang,Bo Zhou,Guoliang Li,Jiajun Shi,Jiale Yang,Jie Tang,Li Li,Taoran Lu,Woyu Lin,Xiaokang Tong,Xinyao Li,Yichi Zhang,Yu Miao,Zhengxuan Jiang,Zili Li,Ziyuan Zhao,Chenxin Li,Dehua Ma,Feng Lin,Ge Zhang,Haihua Yang,Hangyu Guo,Hongda Zhu,Jiaheng Liu,Junda Du,Kai Cai,Kuanye Li,Lichen Yuan,Meilan Han,Minchao Wang,Shuyue Guo,Tianhao Cheng,Xiaobo Ma,Xiaojun Xiao,Xiaolong Huang,Xinjie Chen,Yidi Du,Yilin Chen,Yiwen Wang,Zhaojian Li,Zhenzhu Yang,Zhiyuan Zeng,Chaolin Jin
机构: ByteDance(字节跳动)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2’s potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.
zh

[NLP-4] Jointly Reinforcing Diversity and Quality in Language Model Generations

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段因过度追求准确性和有用性而导致输出分布趋窄、语义多样性下降的问题,从而限制了模型在创意生成和探索性任务(如头脑风暴、故事创作或问题求解)中的应用潜力。解决方案的关键在于提出一种名为Diversity-Aware Reinforcement Learning (DARLING)的框架,其核心创新是引入一个可学习的划分函数(partition function)来量化超越表层词汇差异的语义多样性,并将此多样性信号与质量奖励联合用于在线强化学习优化,从而引导模型生成既高质量又具有多样性的输出。

链接: https://arxiv.org/abs/2509.02534
作者: Tianjian Li,Yiming Zhang,Ping Yu,Swarnadeep Saha,Daniel Khashabi,Jason Weston,Jack Lanchantin,Tianlu Wang
机构: Johns Hopkins University (约翰霍普金斯大学); Meta (Meta)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 29 pages, 11 figures

点击查看摘要

Abstract:Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.
zh

[NLP-5] Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

【速读】: 该论文旨在解决小规模自动语音识别(ASR)模型在低资源语言中性能不足的问题,尤其是在缺乏高质量标注数据的场景下。其解决方案的关键在于:对于参数量仅为27M的小型模型,通过精心平衡的高质量人工标注数据、伪标签数据与合成数据混合训练,构建专用的单语种(monolingual)ASR系统,从而显著优于同等规模的多语种(multilingual)模型,甚至在多数情况下超越参数量大得多的Whisper系列模型。这一方法突破了“多语种模型必然优于单语种模型”的传统认知,为资源匮乏语言提供了高精度的本地化语音识别能力。

链接: https://arxiv.org/abs/2509.02523
作者: Evan King,Adam Sabra,Manjunath Kudlur,James Wang,Pete Warden
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We present the Flavors of Moonshine, a suite of tiny automatic speech recognition (ASR) models specialized for a range of underrepresented languages. Prevailing wisdom suggests that multilingual ASR models outperform monolingual counterparts by exploiting cross-lingual phonetic similarities. We challenge this assumption, showing that for sufficiently small models (27M parameters), training monolingual systems on a carefully balanced mix of high-quality human-labeled, pseudo-labeled, and synthetic data yields substantially superior performance. On average, our models achieve error rates 48% lower than the comparably sized Whisper Tiny model, outperform the 9x larger Whisper Small model, and in most cases match or outperform the 28x larger Whisper Medium model. These results advance the state of the art for models of this size, enabling accurate on-device ASR for languages that previously had limited support. We release Arabic, Chinese, Japanese, Korean, Ukrainian, and Vietnamese Moonshine models under a permissive open-source license.
zh

[NLP-6] Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)框架在训练大语言模型(Large Language Models, LLMs)时面临的稀疏奖励信号和不稳定的策略梯度更新问题。现有RL-based方法在数学推理等复杂任务中表现受限,主要源于奖励信号难以有效传递至策略优化过程。解决方案的关键在于提出一种名为PACS的新框架,其核心创新是通过监督学习范式实现隐式的策略(Actor)与评论家(Critic)耦合:将可验证结果奖励视为可预测标签,将RLVR问题重构为基于评分函数的监督学习任务,并使用交叉熵损失进行优化。理论分析表明,该方法在本质上恢复了经典的策略梯度更新,同时避免了显式分离Actor与Critic所带来的训练不稳定问题,从而实现了更稳定、高效的训练效果。实验表明,PACS在AIME 2025等数学推理基准上显著优于PPO和GRPO等强基线方法。

链接: https://arxiv.org/abs/2509.02522
作者: Jiaming Li,Longze Chen,Ze Gong,Yukun Chen,Lu Wang,Wanwei He,Run Luo,Min Yang
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences (中国科学院大学); Ritzz-AI
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose \textbfPACS , a novel RLVR framework that achieves im \textbfP licit \textbfA ctor \textbfC ritic coupling via a \textbfS upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at this https URL.
zh

[NLP-7] FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training

【速读】: 该论文旨在解决全双工对话模型中文本单声道(textual monologue)与不同比特率音频流之间的时序对齐问题,尤其是现有基于词级对齐的方法会损害大规模预训练模型的语言能力,并因高精度时间戳需求导致误差累积和预处理成本上升。其解决方案的关键在于提出“自然”单声道(natural monologues),即以连续的token序列模拟人类认知行为,在训练过程中通过交替调整该单声道相对于音频的位置(领先或滞后)来实现有效对齐,形成一种“双重”训练范式,从而显著提升模型的响应速度、双工性及对话体验。

链接: https://arxiv.org/abs/2509.02521
作者: Yiqun Yao,Xiang Li,Xin Jiang,Xuezhi Fang,Naitong Yu,Wenjia Ma,Aixin Sun,Yequan Wang
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); Spin Matrix (自旋矩阵); Nanyang Technological University (南洋理工大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Full-duplex dialog models are designed to listen and speak simultaneously with rapid responses to fast-changing user input. Among existing approaches, native full-duplex models merges different channels (e.g. listen and speak) in a single time step, overcoming the high response latency inherent to time-division multiplexing time-division multiplexing (TDM) alternatives. Yet, a key challenge remains: aligning textual monologues with audio streams that operate at different bitrates. The prevailing solution relies on word-level alignment, but this can degrade the language ability of large pre-trained models. Moreover, it requires highly accurate timestamps for every token, which introduces cascading errors and increases pre-processing costs. In this paper, we propose textual monologues in continuous tokens sequence, namely “natural” monologues, which mimics humanoid cognitive behavior in dialogs. For temporal alignment, we alternate the position of the natural monologue - leading or trailing the audio - across different training stages. This “dual” training paradigm proves highly effective in building FLM-Audio, our 7B spoken dialog model that demonstrates superior responsiveness, duplexity, and chatting experiences, as confirmed by experimental results.
zh

[NLP-8] Comparative Study of Pre-Trained BERT and Large Language Models for Code-Mixed Named Entity Recognition

【速读】: 该论文旨在解决代码混杂文本(code-mixed text)中命名实体识别(Named Entity Recognition, NER)的难题,尤其聚焦于印地语-英语混合语言(Hinglish)场景下的性能瓶颈。其核心挑战包括非正式结构、音译现象以及频繁的语言切换。解决方案的关键在于对比分析两类模型:一类是基于代码混杂数据预训练的专用模型(如HingBERT、HingMBERT和HingRoBERTa),另一类是非代码混杂多语言模型(如IndicBERT、MuRIL等)及零样本生成式大语言模型(LLMs,如Google Gemini)。研究发现,专用代码混杂模型因领域特定预训练而显著优于其他模型,显示出更强的任务适配能力;同时,零样本LLM如Google Gemini也展现出良好的泛化性能,表明现代大语言模型在缺乏任务特定训练时仍具备一定潜力。

链接: https://arxiv.org/abs/2509.02514
作者: Mayur Shirke,Amey Shembade,Pavan Thorat,Madhushri Wagh,Raviraj Joshi
机构: Pune Institute of Computer Technology (普奈计算机技术学院); Indian Institute of Technology Madras (印度理工学院马德拉斯分校); L3Cube Labs (L3Cube 实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Named Entity Recognition (NER) in code-mixed text, particularly Hindi-English (Hinglish), presents unique challenges due to informal structure, transliteration, and frequent language switching. This study conducts a comparative evaluation of code-mixed fine-tuned models and non-code-mixed multilingual models, along with zero-shot generative large language models (LLMs). Specifically, we evaluate HingBERT, HingMBERT, and HingRoBERTa (trained on code-mixed data), and BERT Base Cased, IndicBERT, RoBERTa and MuRIL (trained on non-code-mixed multilingual data). We also assess the performance of Google Gemini in a zero-shot setting using a modified version of the dataset with NER tags removed. All models are tested on a benchmark Hinglish NER dataset using Precision, Recall, and F1-score. Results show that code-mixed models, particularly HingRoBERTa and HingBERT-based fine-tuned models, outperform others - including closed-source LLMs like Google Gemini - due to domain-specific pretraining. Non-code-mixed models perform reasonably but show limited adaptability. Notably, Google Gemini exhibits competitive zero-shot performance, underlining the generalization strength of modern LLMs. This study provides key insights into the effectiveness of specialized versus generalized models for code-mixed NER tasks.
zh

[NLP-9] op-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放式文本生成中难以平衡创造力与逻辑一致性的问题。现有截断采样技术(如温度缩放、top-pp(nucleus)采样和min-pp采样)虽能调节这一权衡,但未能有效利用模型置信度信息,尤其min-pp采样仅以单个最高概率词作为置信度启发式,导致概率分布信息未被充分利用。解决方案的关键在于提出top-H解码,其理论基础是将创造性与一致性建模为一个熵约束的最小散度优化问题,并证明该问题等价于一个NP难的熵约束质量最大化(Entropy-Constrained Mass Maximization, ECMM)问题;随后设计了一种计算高效的贪心算法来近似求解ECMM,从而在保持逻辑一致性的同时显著提升生成多样性与创造性。实证结果表明,top-H在创意写作基准上相较最优替代方案(min-pp)性能提升达25.63%,且在问答类数据集(GPQA、GSM8K、MT-Bench)上仍具鲁棒性。

链接: https://arxiv.org/abs/2509.02510
作者: Erfan Baghaei Potraghloo,Seyedarmin Azizi,Souvik Kundu,Massoud Pedram
机构: University of Southern California (南加州大学); Intel Labs (英特尔实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including temperature scaling, top-\ p\ (nucleus) sampling, and min-\ p\ sampling, aim to manage this trade-off. However, they exhibit limitations, particularly in the effective incorporation of the confidence of the model into the corresponding sampling strategy. For example, min-\ p\ sampling relies on a single top token as a heuristic for confidence, eventually underutilizing the information of the probability distribution. Toward effective incorporation of the confidence of the model, in this paper, we present top-H decoding. We first establish the theoretical foundation of the interplay between creativity and coherence in truncated sampling by formulating an entropy-constrained minimum divergence problem. We then prove this minimization problem to be equivalent to an entropy-constrained mass maximization (ECMM) problem, which is NP-hard. Finally, we present top-H decoding, a computationally efficient greedy algorithm to solve the ECMM problem. Extensive empirical evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA) alternative of min-\ p\ sampling by up to 25.63% on creative writing benchmarks, while maintaining robustness on question-answering datasets such as GPQA, GSM8K, and MT-Bench. Additionally, an LLM-as-judge evaluation confirms that top-H indeed produces coherent outputs even at higher temperatures, where creativity is especially critical. In summary, top-H advances SoTA in open-ended text generation and can be easily integrated into creative writing applications. The code is available at this https URL.
zh

[NLP-10] he Forgotten Code: Validating a Century-Old Translation System with AI

【速读】: 该论文旨在解决如何通过现代人工智能(AI)技术重新验证并拓展早期规则-based 机器翻译(Rule-Based Machine Translation, RBMT)系统——即Federico Pucci于1929年提出、并在1931年详细阐述的基于国际符号与意符体系的机械翻译方法——的历史价值与实际可行性问题。其解决方案的关键在于:利用生成式AI(Generative AI)严格遵循Pucci原始规则对两段历史文本(出自但丁《新生》和伏尔泰《扎第格》)进行再翻译,结果表明94年后AI所译版本与Pucci原译在平均差异上极低,仅存在细微偏差;这一验证使Pucci的方法得以复现,并成功扩展至英语、西班牙语和德语等多语言翻译,从而确认其跨语言适用性,同时将该被长期忽视的早期贡献重新置于机器翻译发展史的重要位置,与Troyanskij、Booth及Weaver等人并列审视,推动对领域起源认知的重构。

链接: https://arxiv.org/abs/2509.02506
作者: Jean-Marie Le Ray
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint, 35 pages, 14 figures, 9 appendices

点击查看摘要

Abstract:A pioneering rule-based mechanical translation system (precursor of modern RBMTs) was first presented in December 1929 by its inventor, Federico Pucci, who later published the full method in a book titled “Il traduttore meccanico ed il metodo per corrispondersi fra Europei conoscendo ciascuno solo la propria lingua: Parte I”, in Salerno (Italy), in 1931. This study illustrates how AI breathes new life into the system of international keys and ideograms devised by Pucci to translate from/into any Romance language (at least as a first step). The methodology involves having the AIs retranslate, following Pucci’s method, the two text excerpts originally translated in 1931 and clearly documented in his publication: a passage from Dante’s La Vita Nuova, translated from Italian into French, and a passage from Voltaire’s Zadig, translated from French into Italian. The result is notable: the two texts, translated 94 years apart using the same method–by Pucci in 1931 and by AIs in 2025–show a low average difference, with only minor variations observed. With Pucci’s system thus validated, it became feasible to have the AIs reproduce the excerpts in English, Spanish, and German according to his method. The results were consistent, and Pucci–via Artificial Intelligence–was tasked with translating more modern and technical texts, thereby reviving, nearly a century later, an invention that had remained almost entirely unknown and never applied beyond its creator, now brought to wider attention and opened to possible experimentation. Such a demonstration would not only affirm Pucci’s historical status but also place him among the precursors and intellectual contributors to machine translation, whose work merits examination alongside figures such as Troyanskij, Booth, and Weaver, with possible consequences for how the history of the field is understood.
zh

[NLP-11] L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages

【速读】: 该论文旨在解决低资源语言(尤其是印地语系语言)中语义评估(semantic evaluation)的挑战,这是自然语言处理(NLP)领域长期存在的难题。现有句向量模型(sentence transformers)在高资源语言中表现优异,但在印地语系等低资源语言上的效果尚不明确,主要受限于高质量基准数据集的缺失。解决方案的关键在于构建并公开发布 L3Cube-IndicHeadline-ID 数据集,该数据集覆盖10种低资源印地语系语言(包括马拉地语、印地语、泰米尔语等),每条新闻文章配有四个标题变体:原始标题、语义相似标题、词汇相似标题和无关标题,从而精细测试模型对语义差异的识别能力。通过该数据集对多种多语言及语言特定句向量模型进行基准测试,结果表明多语言模型整体性能稳定,而语言特定模型表现差异显著,为后续改进基于检索增强生成(Retrieval-Augmented Generation, RAG)等任务中的语义理解提供了重要工具与评估依据。

链接: https://arxiv.org/abs/2509.02503
作者: Nishant Tanksale,Tanmay Kokate,Darshan Gohad,Sarvadnyaa Barate,Raviraj Joshi
机构: PICT( Pune Institute of Computer Technology); Indian Institute of Technology Madras(印度理工学院马德拉斯分校); L3Cube Labs( L3Cube 实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Semantic evaluation in low-resource languages remains a major challenge in NLP. While sentence transformers have shown strong performance in high-resource settings, their effectiveness in Indic languages is underexplored due to a lack of high-quality benchmarks. To bridge this gap, we introduce L3Cube-IndicHeadline-ID, a curated headline identification dataset spanning ten low-resource Indic languages: Marathi, Hindi, Tamil, Gujarati, Odia, Kannada, Malayalam, Punjabi, Telugu, Bengali and English. Each language includes 20,000 news articles paired with four headline variants: the original, a semantically similar version, a lexically similar version, and an unrelated one, designed to test fine-grained semantic understanding. The task requires selecting the correct headline from the options using article-headline similarity. We benchmark several sentence transformers, including multilingual and language-specific models, using cosine similarity. Results show that multilingual models consistently perform well, while language-specific models vary in effectiveness. Given the rising use of similarity models in Retrieval-Augmented Generation (RAG) pipelines, this dataset also serves as a valuable resource for evaluating and improving semantic understanding in such applications. Additionally, the dataset can be repurposed for multiple-choice question answering, headline classification, or other task-specific evaluations of LLMs, making it a versatile benchmark for Indic NLP. The dataset is shared publicly at this https URL
zh

[NLP-12] MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds EMNLP2025

【速读】: 该论文旨在解决当前AI生成文本检测系统中存在的两大问题:一是现有方法忽视了文本风格建模,导致检测性能受限;二是依赖静态阈值进行判断,缺乏对不同文本风格的适应性。解决方案的关键在于提出Mixture of Stylistic Experts (MoSEs)框架,其核心创新包括三个组件:风格参考库(Stylistics Reference Repository, SRR)、风格感知路由模块(Stylistics-Aware Router, SAR)和条件阈值估计器(Conditional Threshold Estimator, CTE)。其中,CTE通过联合建模语言统计特征与语义特征,动态确定最优检测阈值,从而实现基于风格的不确定性量化,显著提升了检测准确率,尤其在低资源场景下表现更优。

链接: https://arxiv.org/abs/2509.02499
作者: Junxi Wu,Jinpeng Wang,Zheng Liu,Bin Chen,Dongjian Hu,Hao Wu,Shu-Tao Xiu
机构: Nankai University (南开大学); Tsinghua University (清华大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳分校); Peng Cheng Laboratory (鹏城实验室); Shenzhen ShenNong Information Technology Co., Ltd. (深圳市深农信息技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025

点击查看摘要

Abstract:The rapid advancement of large language models has intensified public concerns about the potential misuse. Therefore, it is important to build trustworthy AI-generated text detection systems. Existing methods neglect stylistic modeling and mostly rely on static thresholds, which greatly limits the detection performance. In this paper, we propose the Mixture of Stylistic Experts (MoSEs) framework that enables stylistics-aware uncertainty quantification through conditional threshold estimation. MoSEs contain three core components, namely, the Stylistics Reference Repository (SRR), the Stylistics-Aware Router (SAR), and the Conditional Threshold Estimator (CTE). For input text, SRR can activate the appropriate reference data in SRR and provide them to CTE. Subsequently, CTE jointly models the linguistic statistical properties and semantic features to dynamically determine the optimal threshold. With a discrimination score, MoSEs yields prediction labels with the corresponding confidence level. Our framework achieves an average improvement 11.34% in detection performance compared to baselines. More inspiringly, MoSEs shows a more evident improvement 39.15% in the low-resource case. Our code is available at this https URL.
zh

[NLP-13] GRAM-R2: Self-Training Generative Foundation Reward Models for Reward Reasoning

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 中奖励模型(reward model)依赖大规模标注偏好数据的问题,尤其是现有方法难以从无标签数据中显式地引入推理能力。其解决方案的关键在于提出一种自训练(self-training)框架,利用大量未标注数据激发奖励模型的推理能力,从而构建出GRAM-R²——一个能够同时输出偏好标签和对应奖励理由的生成式奖励模型。该模型作为奖励推理的基础模型,在下游任务如响应排序、任务适应和人类反馈强化学习中均表现出强大性能,且仅需极少或无需额外微调即可应用。

链接: https://arxiv.org/abs/2509.02492
作者: Chenglong Wang,Yongyu Mu,Hang Zhou,Yifu Huo,Ziming Zhu,Jiali Zeng,Murun Yang,Bei Li,Tong Xiao,Xiaoyang Hao,Chunliang Zhang,Fandong Meng,Jingbo Zhu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R ^2 , a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R ^2 can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R ^2 consistently delivers strong performance, outperforming several strong discriminative and generative baselines.
zh

[NLP-14] SpecEval: Evaluating Model Adherence to Behavior Specifications

【速读】: 该论文旨在解决基础模型(foundation models)在实际运行中是否遵守其开发者所发布的行为规范这一关键问题。尽管OpenAI、Anthropic和Google等公司已制定详尽的安全约束与定性特征规范,但缺乏系统性的审计机制来验证模型输出是否真正符合这些规范。解决方案的关键在于提出一种自动化框架,通过解析行为声明、生成针对性提示并利用模型自身作为评判者,实现三重一致性检验:即开发者规范、模型输出与开发者自建评判模型之间的一致性。这构成了一个必要基准,确保基础模型在开发者评估模型的判断下能一致地满足其行为规范,从而为模型合规性提供可量化的评估标准。

链接: https://arxiv.org/abs/2509.02464
作者: Ahmed Ahmed,Kevin Klyman,Yi Zeng,Sanmi Koyejo,Percy Liang
机构: Stanford University (斯坦福大学); Virginia Tech (弗吉尼亚理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Companies that develop foundation models publish behavioral guidelines they pledge their models will follow, but it remains unclear if models actually do so. While providers such as OpenAI, Anthropic, and Google have published detailed specifications describing both desired safety constraints and qualitative traits for their models, there has been no systematic audit of adherence to these guidelines. We introduce an automated framework that audits models against their providers specifications by parsing behavioral statements, generating targeted prompts, and using models to judge adherence. Our central focus is on three way consistency between a provider specification, its model outputs, and its own models as judges; an extension of prior two way generator validator consistency. This establishes a necessary baseline: at minimum, a foundation model should consistently satisfy the developer behavioral specifications when judged by the developer evaluator models. We apply our framework to 16 models from six developers across more than 100 behavioral statements, finding systematic inconsistencies including compliance gaps of up to 20 percent across providers.
zh

[NLP-15] Do LLM s Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions EMNLP2025

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否真正整合了外部提供的标签定义,还是主要依赖其参数化知识(parametric knowledge)进行任务处理。为解答这一问题,研究者设计了受控实验,在多个解释性基准数据集(涵盖通用和领域特定任务)及不同标签定义条件(包括专家标注、LLM生成、扰动和交换定义)下进行评估。解决方案的关键在于通过系统性地操控外部定义的来源与质量,量化其对模型性能和可解释性的影响,并揭示模型在不同任务场景中对外部知识的整合程度——结果表明,尽管显式定义能提升准确性和可解释性,但模型在多数情况下仍倾向于使用内部表征,尤其在通用任务中;而在领域特定任务中,显式定义的作用更为显著。这凸显了理解LLMs如何融合外部知识与其先验能力之间的机制的重要性。

链接: https://arxiv.org/abs/2509.02452
作者: Seyedali Mohammadi,Bhaskara Hanuma Vedula,Hemank Lamba,Edward Raff,Ponnurangam Kumaraguru,Francis Ferraro,Manas Gaur
机构: UMBC(美国马里兰大学巴尔的摩县分校); IIIT Hyderabad(印度国际信息技术研究所); Dataminr, Inc.(Dataminr公司); CrowdStrike( CrowdStrike)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear in EMNLP 2025, Main Conference

点击查看摘要

Abstract:Do LLMs genuinely incorporate external definitions, or do they primarily rely on their parametric knowledge? To address these questions, we conduct controlled experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-generated, perturbed, and swapped definitions. Our results reveal that while explicit label definitions can enhance accuracy and explainability, their integration into an LLM’s task-solving processes is neither guaranteed nor consistent, suggesting reliance on internalized representations in many cases. Models often default to their internal representations, particularly in general tasks, whereas domain-specific tasks benefit more from explicit definitions. These findings underscore the need for a deeper understanding of how LLMs process external knowledge alongside their pre-existing capabilities.
zh

[NLP-16] EmoPerso: Enhancing Personality Detection with Self-Supervised Emotion-Aware Modelling

【速读】: 该论文旨在解决两个核心问题:一是现有基于文本的人格检测方法严重依赖大规模标注数据,导致高质量人格标签获取困难;二是多数研究将情绪与人格视为独立变量,忽视了二者之间的细粒度交互关系。解决方案的关键在于提出一种新颖的自监督框架EmoPerso,其核心创新包括:利用生成式机制进行合成数据增强和丰富表征学习,通过多任务学习联合优化伪标签情绪特征与人格预测,并引入交叉注意力模块捕捉人格特质与推断情绪表示之间的细粒度交互;此外,采用自教策略迭代提升模型的关联推理能力,从而在两个基准数据集上显著优于当前最先进模型。

链接: https://arxiv.org/abs/2509.02450
作者: Lingzhi Shen,Xiaohao Cai,Yunfei Long,Imran Razzak,Guanming Chen,Shoaib Jameel
机构: University of Southampton (南安普顿大学); Queen Mary University of London (伦敦玛丽女王大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Personality detection from text is commonly performed by analysing users’ social media posts. However, existing methods heavily rely on large-scale annotated datasets, making it challenging to obtain high-quality personality labels. Moreover, most studies treat emotion and personality as independent variables, overlooking their interactions. In this paper, we propose a novel self-supervised framework, EmoPerso, which improves personality detection through emotion-aware modelling. EmoPerso first leverages generative mechanisms for synthetic data augmentation and rich representation learning. It then extracts pseudo-labeled emotion features and jointly optimizes them with personality prediction via multi-task learning. A cross-attention module is employed to capture fine-grained interactions between personality traits and the inferred emotional representations. To further refine relational reasoning, EmoPerso adopts a self-taught strategy to enhance the model’s reasoning capabilities iteratively. Extensive experiments on two benchmark datasets demonstrate that EmoPerso surpasses state-of-the-art models. The source code is available at this https URL.
zh

[NLP-17] An Ensemble Classification Approach in A Multi-Layered Large Language Model Framework for Disease Prediction

【速读】: 该论文旨在解决阿拉伯语社交远程医疗数据中疾病分类的准确性问题,这类数据通常来源于患者在社交媒体和在线健康平台发布的症状描述,具有非结构化、语言多样性和噪声干扰等特点。解决方案的关键在于融合大语言模型(LLM)驱动的文本预处理方法(包括摘要生成、文本精炼和命名实体识别)与经过微调的阿拉伯语Transformer模型(CAMeLBERT、AraBERT 和 AsafayaBERT),并通过多数投票集成策略整合原始文本与预处理后文本的预测结果,从而提升模型对复杂医学文本的理解能力与分类性能,最终实现80.56%的最高分类准确率。

链接: https://arxiv.org/abs/2509.02446
作者: Ali Hamdi,Malak Mohamed,Rokaia Emad,Khaled Shaban
机构: MSA University (MSA大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Social telehealth has made remarkable progress in healthcare by allowing patients to post symptoms and participate in medical consultations remotely. Users frequently post symptoms on social media and online health platforms, creating a huge repository of medical data that can be leveraged for disease classification. Large language models (LLMs) such as LLAMA3 and GPT-3.5, along with transformer-based models like BERT, have demonstrated strong capabilities in processing complex medical text. In this study, we evaluate three Arabic medical text preprocessing methods such as summarization, refinement, and Named Entity Recognition (NER) before applying fine-tuned Arabic transformer models (CAMeLBERT, AraBERT, and AsafayaBERT). To enhance robustness, we adopt a majority voting ensemble that combines predictions from original and preprocessed text representations. This approach achieved the best classification accuracy of 80.56%, thus showing its effectiveness in leveraging various text representations and model predictions to improve the understanding of medical texts. To the best of our knowledge, this is the first work that integrates LLM-based preprocessing with fine-tuned Arabic transformer models and ensemble learning for disease classification in Arabic social telehealth data.
zh

[NLP-18] AppCopilot: Toward General Accurate Long-Horizon and Efficient Mobile Agent

【速读】: 该论文旨在解决移动代理(mobile agent)在实际应用中面临的四大核心挑战:任务、模态、应用和设备间的泛化能力不足;屏幕交互与点击定位的精度问题;长期多步骤目标的持续执行能力;以及在资源受限设备上的运行效率。其解决方案的关键在于提出一个名为AppCopilot的端侧通用助手系统,该系统是一个全栈闭环架构,涵盖从数据收集、训练、部署到高效推理与移动端开发的完整流程。其创新性体现在三层设计:模型层融合支持中英文的多模态基础模型;推理与控制层结合思维链(chain-of-thought)推理、分层任务规划与多智能体协作;执行层实现用户个性化、语音交互、函数调用、跨应用/跨设备协同及全面移动端支持。此外,系统通过基于性能分析的优化策略,在异构硬件上实现低延迟、低内存占用和低能耗,实证表明其在四项关键指标上均取得显著提升。

链接: https://arxiv.org/abs/2509.02444
作者: Jingru Fan,Yufan Dang,Jingyao Wu,Huatao Li,Runde Yang,Xiyuan Yang,Yuheng Wang,Zhong Zhang,Yaxi Lu,Yankai Lin,Zhiyuan Liu,Dahai Li,Chen Qian
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Project at this https URL

点击查看摘要

Abstract:With the raid evolution of large language models and multimodal foundation models, the mobile-agent landscape has proliferated without converging on the fundamental challenges. This paper identifies four core problems that must be solved for mobile agents to deliver practical, scalable impact: (1) generalization across tasks, modalities, apps, and devices; (2) accuracy, specifically precise on-screen interaction and click targeting; (3) long-horizon capability for sustained, multi-step goals; and (4) efficiency, specifically high-performance runtime on resource-constrained devices. We present AppCopilot, a multimodal, multi-agent, general-purpose on-device assistant that operates across applications and constitutes a full-stack, closed-loop system from data to deployment. AppCopilot operationalizes this position through an end-to-end autonomous pipeline spanning data collection, training, deployment, high-quality and efficient inference, and mobile application development. At the model layer, it integrates multimodal foundation models with robust Chinese-English support. At the reasoning and control layer, it combines chain-of-thought reasoning, hierarchical task planning and decomposition, and multi-agent collaboration. At the execution layer, it enables user personalization and experiential adaptation, voice interaction, function calling, cross-app and cross-device orchestration, and comprehensive mobile app support. The system design incorporates profiling-driven optimization for latency, memory, and energy across heterogeneous hardware. Empirically, AppCopilot achieves significant improvements along all four dimensions: stronger generalization, higher-precision on-screen actions, more reliable long-horizon task completion, and faster, more resource-efficient runtime.
zh

[NLP-19] Evaluating Cumulative Spectral Gradient as a Complexity Measure

【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)中数据集复杂度评估的准确性问题,特别是针对链接预测任务中现有复杂度度量方法的局限性。其解决方案的关键在于对累积谱梯度(Cumulative Spectral Gradient, CSG)这一指标进行系统性验证,通过分析两个核心参数——每类采样点数 MM 和嵌入空间中最近邻数 KK 对 CSG 行为的影响,发现原作者声称的“自然随类别数扩展”和“与下游分类性能强相关”的特性在实际链接预测场景下并不成立,从而揭示了 CSG 在KG链接预测任务中的不稳定性和低预测效度,强调需发展更鲁棒、与分类器无关的复杂度度量方法。

链接: https://arxiv.org/abs/2509.02399
作者: Haji Gul,Abdul Ghani Naim,Ajaz Ahmad Bhat
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate estimation of dataset complexity is crucial for evaluating and comparing link prediction models for knowledge graphs (KGs). The Cumulative Spectral Gradient (CSG) metric derived from probabilistic divergence between classes within a spectral clustering framework was proposed as a dataset complexity measure that (1) naturally scales with the number of classes and (2) correlates strongly with downstream classification performance. In this work, we rigorously assess CSG behavior on standard knowledge graph link prediction benchmarks a multi class tail prediction task, using two key parameters governing its computation, M, the number of Monte Carlo sampled points per class, and K, the number of nearest neighbors in the embedding space. Contrary to the original claims, we find that (1) CSG is highly sensitive to the choice of K and therefore does not inherently scale with the number of target classes, and (2) CSG values exhibit weak or no correlation with established performance metrics such as mean reciprocal rank (MRR). Through experiments on FB15k 237, WN18RR, and other standard datasets, we demonstrate that CSG purported stability and generalization predictive power break down in link prediction settings. Our results highlight the need for more robust, classifier agnostic complexity measures in KG link prediction evaluation.
zh

[NLP-20] owards Temporal Knowledge-Base Creation for Fine-Grained Opinion Analysis with Language Models

【速读】: 该论文旨在解决现有时间序列文本情感分析方法因缺乏细粒度、时序对齐的标注数据而难以发挥潜力的问题。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的可扩展自动标注流水线,将成熟的情感与观点挖掘范式融入声明式标注框架中,从而实现无需人工设计提示词(prompt engineering)的结构化观点抽取。通过定义三个基于情感和观点挖掘文献的数据模型作为结构化表示的模式,并在两个独立LLM上执行最终标注,同时计算标签层面的标注者间一致性(inter-annotator agreement),生成了一个时序对齐、结构化的观点知识库,适用于检索增强生成(Retrieval-Augmented Generation, RAG)、时序问答和时间线摘要等下游任务。

链接: https://arxiv.org/abs/2509.02363
作者: Gaurav Negi,Atul Kr. Ojha,Omnia Zayed,Paul Buitelaar
机构: Insight SFI Research Ireland Centre for Data Analytics, University of Galway (爱尔兰数据分析师研究中心,高威大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a scalable method for constructing a temporal opinion knowledge base with large language models (LLMs) as automated annotators. Despite the demonstrated utility of time-series opinion analysis of text for downstream applications such as forecasting and trend analysis, existing methodologies underexploit this potential due to the absence of temporally grounded fine-grained annotations. Our approach addresses this gap by integrating well-established opinion mining formulations into a declarative LLM annotation pipeline, enabling structured opinion extraction without manual prompt engineering. We define three data models grounded in sentiment and opinion mining literature, serving as schemas for structured representation. We perform rigorous quantitative evaluation of our pipeline using human-annotated test samples. We carry out the final annotations using two separate LLMs, and inter-annotator agreement is computed label-wise across the fine-grained opinion dimensions, analogous to human annotation protocols. The resulting knowledge base encapsulates time-aligned, structured opinions and is compatible with applications in Retrieval-Augmented Generation (RAG), temporal question answering, and timeline summarisation.
zh

[NLP-21] Implicit Reasoning in Large Language Models : A Comprehensive Survey

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)中关于隐式推理(implicit reasoning)机制缺乏系统性分析的问题。尽管已有研究探讨了隐式推理的潜在表征形式,但对其内部计算过程的机制层面理解仍不充分。论文的关键解决方案是提出一个基于执行范式的分类体系,将现有方法归纳为三类:潜在优化(latent optimization)、信号引导控制(signal-guided control)和层递归执行(layer-recurrent execution),从而从“如何”与“在哪里”展开内部计算的角度重构对隐式推理的理解,推动对LLMs内在推理机制的结构化认知与评估。

链接: https://arxiv.org/abs/2509.02350
作者: Jindong Li,Yali Fu,Li Fan,Jiahong Liu,Yao Shu,Chengwei Qin,Menglin Yang,Irwin King,Rex Ying
机构: Hong Kong University of Science and Technology (Guangzhou); Jilin University; The Chinese University of Hong Kong; Yale University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks. Reasoning with LLMs is central to solving multi-step problems and complex decision-making. To support efficient reasoning, recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning, where reasoning occurs silently via latent structures without emitting intermediate textual steps. Implicit reasoning brings advantages such as lower generation cost, faster inference, and better alignment with internal computation. Although prior surveys have discussed latent representations in the context of reasoning, a dedicated and mechanism-level examination of how reasoning unfolds internally within LLMs remains absent. This survey fills that gap by introducing a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies. We organize existing methods into three execution paradigms based on \textbf\textithow and where internal computation unfolds: latent optimization, signal-guided control, and layer-recurrent execution. We also review structural, behavioral and representation-based evidence that supports the presence of implicit reasoning in LLMs. We further provide a structured overview of the evaluation metrics and benchmarks used in existing works to assess the effectiveness and reliability of implicit this http URL maintain a continuously updated project at: this https URL.
zh

[NLP-22] DCPO: Dynamic Clipping Policy Optimization

【速读】: 该论文旨在解决强化学习从可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)中因固定裁剪边界和相同奖励标准化导致的零梯度问题,从而提升大语言模型在推理任务中的表现。其核心解决方案是提出动态裁剪策略优化(Dynamic Clipping Policy Optimization, DCPO),关键在于:一是引入基于token级先验概率自适应调整裁剪边界的动态裁剪机制,以增强token层面的探索能力;二是采用跨训练步长累积奖励平滑标准化技术,提升生成响应在response-level上的有效利用率。该方法显著改善了梯度更新效率与训练稳定性,在多个基准测试中均取得优于现有方法(如GRPO和DAPO)的性能表现。

链接: https://arxiv.org/abs/2509.02333
作者: Shihui Yang,Chengfeng Dou,Peidong Guo,Kai Lu,Qiang Ju,Fei Deng,Rihui Xin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization (DCPO), which introduces a dynamic clipping strategy that adaptively adjusts the clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing both DAPO (36.7/31.6) and GRPO (36.7/32.1) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5) and DAPO (20.0/15.3). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO’s effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.
zh

[NLP-23] LLM s and their Limited Theory of Mind: Evaluating Mental State Annotations in Situated Dialogue

【速读】: 该论文旨在解决团队对话中共享心智模型(Shared Mental Models, SMMs)的识别与偏差检测问题,即如何利用大语言模型(Large Language Models, LLMs)自动识别团队成员间对任务理解的一致性,并揭示潜在的认知差异。其解决方案的关键在于提出一个两步框架:第一步由LLM作为类人标注者,从任务导向型对话中提取SMM要素;第二步引入另一个LLM对比这些标注与人工标注及金标准标签,从而检测并表征个体间心智状态的分歧。该方法不仅构建了一个可复现的SMM一致性评估框架,还首次系统性地验证了LLM在空间推理或语调线索消歧等复杂场景下存在系统性误判的问题。

链接: https://arxiv.org/abs/2509.02292
作者: Katharine Kowalyshyn,Matthias Scheutz
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:What if large language models could not only infer human mindsets but also expose every blind spot in team dialogue such as discrepancies in the team members’ joint understanding? We present a novel, two-step framework that leverages large language models (LLMs) both as human-style annotators of team dialogues to track the team’s shared mental models (SMMs) and as automated discrepancy detectors among individuals’ mental states. In the first step, an LLM generates annotations by identifying SMM elements within task-oriented dialogues from the Cooperative Remote Search Task (CReST) corpus. Then, a secondary LLM compares these LLM-derived annotations and human annotations against gold-standard labels to detect and characterize divergences. We define an SMM coherence evaluation framework for this use case and apply it to six CReST dialogues, ultimately producing: (1) a dataset of human and LLM annotations; (2) a reproducible evaluation framework for SMM coherence; and (3) an empirical assessment of LLM-based discrepancy detection. Our results reveal that, although LLMs exhibit apparent coherence on straightforward natural-language annotation tasks, they systematically err in scenarios requiring spatial reasoning or disambiguation of prosodic cues.
zh

[NLP-24] Spectrogram Patch Codec: A 2D Block-Quantized VQ-VAE and HiFi-GAN for Neural Speech Coding

【速读】: 该论文旨在解决当前神经语音编解码器(Neural Speech Codec)中依赖复杂残差向量量化(Residual Vector Quantization, RVQ)堆叠结构所带来的架构冗余与高延迟问题。其解决方案的关键在于提出一种单阶段量化方法:直接对梅尔频谱图(mel-spectrogram)进行处理,将非重叠的4×4图像块作为单位,统一映射至一个共享码本(shared codebook),从而构建离散潜空间网格(discrete latent grid)。此设计显著简化了模型结构,支持低延迟流式传输,并结合后期对抗微调(late-stage adversarial fine-tuning)和从头训练的HiFi-GAN声码器(vocoder),在约7.5 kbits/s的比特率下实现了与先进编码器相当的感知质量和可懂度。

链接: https://arxiv.org/abs/2509.02244
作者: Luis Felipe Chary,Miguel Arjona Ramirez
机构: Universidade de São Paulo (圣保罗大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present a neural speech codec that challenges the need for complex residual vector quantization (RVQ) stacks by introducing a simpler, single-stage quantization approach. Our method operates directly on the mel-spectrogram, treating it as a 2D data and quantizing non-overlapping 4x4 patches into a single, shared codebook. This patchwise design simplifies the architecture, enables low-latency streaming, and yields a discrete latent grid. To ensure high-fidelity synthesis, we employ a late-stage adversarial fine-tuning for the VQ-VAE and train a HiFi-GAN vocoder from scratch on the codec’s reconstructed spectrograms. Operating at approximately 7.5 kbits/s for 16 kHz speech, our system was evaluated against several state-of-the-art neural codecs using objective metrics such as STOI, PESQ, MCD, and ViSQOL. The results demonstrate that our simplified, non-residual architecture achieves competitive perceptual quality and intelligibility, validating it as an effective and open foundation for future low-latency codec designs.
zh

[NLP-25] owards Fundamental Language Models: Does Linguistic Competence Scale with Model Size?

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的固有局限性,包括幻觉(hallucinations)、偏见、隐私问题以及高昂的计算成本,这些问题主要源于单一模型中语言能力与事实记忆的耦合。其解决方案的关键在于提出并实证支持“基础语言模型”(Fundamental Language Model, FLM)范式,即采用参数规模更小、语言能力更强的模型,并将事实检索任务外化至外部工具,从而实现语言理解和知识获取的解耦。该方法通过模块化设计提升了系统的效率、可解释性和可持续性。

链接: https://arxiv.org/abs/2509.02225
作者: Jaime Collado-Montañez,L. Alfonso Ureña-López,Arturo Montejo-Ráez
机构: University of Jaén (哈恩大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Large Language Models offer impressive language capabilities but suffer from well-known limitations, including hallucinations, biases, privacy concerns, and high computational costs. These issues are largely driven by the combination of linguistic competence and factual memorization within a single monolithic model. This paper introduces and empirically supports the Fundamental Language Model (FLM) paradigm, which advocates for smaller, linguistically competent models that offload factual retrieval to external tools. We evaluate models ranging from 135M to 32B parameters across three dimensions: linguistic competence, external factual knowledge, and internal factual knowledge. Our findings reveal that while both linguistic competence and factual knowledge improve with scale, internal factual knowledge grows significantly faster, suggesting that model size is more closely tied to memorization than to core language ability. These results support a modular approach to language modeling, where compact, linguistically proficient models serve as the foundation for tool-augmented systems. The FLM paradigm offers a path toward more efficient, interpretable, and sustainable NLP solutions.
zh

[NLP-26] FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM -Generated Text in the Medical Domain

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在医学领域中因专业性不足而导致的事实性错误(factuality)问题,特别是生成内容中可能出现的幻觉(hallucination)现象。其解决方案的关键在于构建了一个覆盖四个生成任务和六种前沿LLMs的医学领域事实核查基准(FActBench),并引入两种先进的事实核查技术——思维链提示(Chain-of-Thought, CoT)和自然语言推理(Natural Language Inference, NLI),通过两者一致投票(Unanimous Voting)机制获得与领域专家评估高度相关的事实核查得分,从而有效提升LLMs输出内容的真实性与可靠性。

链接: https://arxiv.org/abs/2509.02198
作者: Anum Afzal,Juraj Vladika,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models tend to struggle when dealing with specialized domains. While all aspects of evaluation hold importance, factuality is the most critical one. Similarly, reliable fact-checking tools and data sources are essential for hallucination mitigation. We address these issues by providing a comprehensive Fact-checking Benchmark FActBench covering four generation tasks and six state-of-the-art Large Language Models (LLMs) for the Medical domain. We use two state-of-the-art Fact-checking techniques: Chain-of-Thought (CoT) Prompting and Natural Language Inference (NLI). Our experiments show that the fact-checking scores acquired through the Unanimous Voting of both techniques correlate best with Domain Expert Evaluation.
zh

[NLP-27] Understanding Space Is Rocket Science - Only Top Reasoning Models Can Solve Spatial Understanding Tasks

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在空间关系理解能力上的显著不足问题,尤其是对现实世界图像中物体相对位置与顺序关系的建模能力薄弱。为应对这一挑战,作者提出了RocketScience——一个全新的开源对比式VLM基准测试集,其核心创新在于构建了完全由真实场景图像-文本对组成的数据集,专门聚焦于空间关系理解任务。该数据集设计为人类易解、当前主流开源及商业VLM难以处理,实证验证了现有模型在此类任务上的严重缺陷。关键解决方案包括:(1) 设计高难度的空间关系判断任务以暴露模型短板;(2) 通过链式思维(chain-of-thought)模型的解耦分析,明确指出性能瓶颈在于空间推理而非对象定位能力,从而为后续模型改进提供了方向性指导。

链接: https://arxiv.org/abs/2509.02175
作者: Nils Hoehing,Mayug Maniparambil,Ellen Rushe,Noel E. O’Connor,Anthony Ventresque
机构: University College Dublin (都柏林大学学院); Dublin City University (都柏林城市大学); Trinity College Dublin (都柏林圣三一学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose RocketScience, an open-source contrastive VLM benchmark that tests for spatial relation understanding. It is comprised of entirely new real-world image-text pairs covering mostly relative spatial understanding and the order of objects. The benchmark is designed to be very easy for humans and hard for the current generation of VLMs, and this is empirically verified. Our results show a striking lack of spatial relation understanding in open source and frontier commercial VLMs and a surprisingly high performance of reasoning models. Additionally, we perform a disentanglement analysis to separate the contributions of object localization and spatial reasoning in chain-of-thought-based models and find that the performance on the benchmark is bottlenecked by spatial reasoning and not object localization capabilities. We release the dataset with a CC-BY-4.0 license and make the evaluation code available at: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2509.02175 [cs.CV] (or arXiv:2509.02175v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.02175 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-28] Avoidance Decoding for Diverse Multi-Branch Story Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成任务(如故事生成)中输出重复、单调的问题,其核心在于提升生成内容的创造性多样性。解决方案的关键在于提出一种新颖的解码策略——避让解码(Avoidance Decoding),通过惩罚与先前生成内容相似的词元(token)概率,引导模型产生更多样化的多分支故事;该策略动态平衡两种相似性惩罚机制:早期优先使用概念级相似性惩罚(Concept-level Similarity Penalty)以丰富初始故事构思,后期逐渐强化叙事级相似性惩罚(Narrative-level Similarity Penalty)以保障情节自然且多样。实验表明,该方法可使输出多样性提升最高达2.6倍,平均减少30%的重复,并有效缓解文本退化现象,同时激活更广泛的神经元,证明其能激发模型内在创造力。

链接: https://arxiv.org/abs/2509.02170
作者: Kyeongman Park,Nakyeong Yang,Kyomin Jung
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often generate repetitive and monotonous outputs, especially in tasks like story generation, due to limited creative diversity when given the same input prompt. To address this challenge, we propose a novel decoding strategy, Avoidance Decoding, that modifies token logits by penalizing similarity to previously generated outputs, thereby encouraging more diverse multi-branch stories. This penalty adaptively balances two similarity measures: (1) Concept-level Similarity Penalty, which is prioritized in early stages to diversify initial story concepts, and (2) Narrative-level Similarity Penalty, which is increasingly emphasized later to ensure natural yet diverse plot development. Notably, our method achieves up to 2.6 times higher output diversity and reduces repetition by an average of 30% compared to strong baselines, while effectively mitigating text degeneration. Furthermore, we reveal that our method activates a broader range of neurons, demonstrating that it leverages the model’s intrinsic creativity.
zh

[NLP-29] Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages

【速读】: 该论文旨在解决低资源语言中的命名实体识别(Named-entity Recognition, NER)问题,尤其是在计算资源受限场景下,如何让小型解码器语言模型(decoder LM)快速适应并零样本迁移至预训练阶段未见过的语言。其核心解决方案是将传统的自回归目标部分替换为一阶模型无关元学习(first-order model-agnostic meta-learning, MAML),从而提升模型在少样本甚至零样本条件下的泛化能力与收敛速度。实验表明,在菲律宾语(Tagalog)和宿务语(Cebuano)这对结构差异显著但语义相似的语言上,该方法在不同模型规模(11M–570M参数)下均实现了零样本微调性能提升(F1值提高2–6个百分点),同时加速了收敛过程。

链接: https://arxiv.org/abs/2509.02160
作者: David Demitri Africa,Suchir Salhan,Yuval Weiss,Paula Buttery,Richard Diehl Martinez
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Named-entity recognition (NER) in low-resource languages is usually tackled by finetuning very large multilingual LMs, an option that is often infeasible in memory- or latency-constrained settings. We ask whether small decoder LMs can be pretrained so that they adapt quickly and transfer zero-shot to languages unseen during pretraining. To this end we replace part of the autoregressive objective with first-order model-agnostic meta-learning (MAML). Tagalog and Cebuano are typologically similar yet structurally different in their actor/non-actor voice systems, and hence serve as a challenging test-bed. Across four model sizes (11 M - 570 M) MAML lifts zero-shot micro-F1 by 2-6 pp under head-only tuning and 1-3 pp after full tuning, while cutting convergence time by up to 8%. Gains are largest for single-token person entities that co-occur with Tagalog case particles si/ni, highlighting the importance of surface anchors.
zh

[NLP-30] AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在印度语境下对种姓(caste)和宗教(religion)相关偏见的放大问题,这些问题源于训练数据中的社会不平等结构,而现有主流缓解策略多为西方中心视角,难以适配本地文化与法律语境。解决方案的关键在于提出AMBEDKAR框架,其核心创新是引入一个仅在推理阶段生效的“宪法感知解码层”(Constitution-Aware Decoding Layer),该层基于印度宪法第14至17条原则构建,并通过一种重新诠释的推测解码(speculative decoding)机制实现公平性干预:其中小型语言模型(Small Language Model, SLM)作为潜在偏见生成器,而大型语言模型(LLM)则作为宪法导向的验证者,强制修正SLM输出轨迹以减少种姓主义与社群主义偏见。此方法无需更新模型参数,避免了重训练成本,同时实现了高达26.41%的绝对偏见降低效果,形成了一种“以推测促公平”的新范式。

链接: https://arxiv.org/abs/2509.02133
作者: Snehasis Mukhopadhyay,Aryan Kasat,Shivam Dubey,Rahul Karthikeyan,Dhruv Sood,Vinija Jain,Aman Chadha,Amitava Das
机构: Indian Institute of Information Technology, Kalyani(印度信息科技学院,卡利亚尼); BITS Pilani Goa(博蒂斯理工学院果阿分校); IIT Madras(印度理工学院马德拉斯分校); DTU(德里技术大学); Artificial Intelligence Institute, University of South Carolina(南卡罗来纳大学人工智能研究所); Meta AI(Meta AI); Amazon GenAI(亚马逊生成式AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can inadvertently reflect societal biases present in their training data, leading to harmful or prejudiced outputs. In the Indian context, our empirical evaluations across a suite of models reveal that biases around caste and religion are particularly salient. Yet, most existing mitigation strategies are Western-centric and fail to address these local nuances. We propose AMBEDKAR, a framework inspired by the egalitarian vision of Dr B. R. Ambedkar, architect of the Indian Constitution, to guide LLM outputs toward fairness, neutrality, and inclusion in line with Articles 14 to 17. Our approach introduces a Constitution-Aware Decoding Layer, guided by the AI Constitution of India and applied only at inference time, without any parameter updates to the base model. We incorporate a speculative decoding algorithm that proactively reduces casteist and communal bias during generation. This mitigation layer operates directly within the decoding process, avoiding changes to model internals and lowering the computational and infrastructural costs associated with retraining. We reinterpret speculative decoding not merely as an efficiency tool but as a mechanism for fairness. In this framework, a Small Language Model (SLM) acts as a potentially biased generator, while a constitutionally guided Large Language Model (LLM) serves as the verifier. Rather than accelerating generation, the LLM enforces bias-robust trajectories in the SLM outputs. This inversion of roles gives rise to a fairness-by-speculation paradigm. Our approach yields an absolute reduction of bias up to 26.41 percent compared to baseline. Our source code, datasets, and results are available at this https URL
zh

[NLP-31] CMRAG : Co-modality-based document retrieval and visual question answering

【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理多模态文档时的局限性问题:一类方法依赖版面分析和文本提取,仅能利用显式文本信息,难以捕捉图像或非结构化内容;另一类方法将文档分割作为视觉输入直接送入视觉语言模型(Visual Language Models, VLMs),却忽略了文本的语义优势,导致生成效果不佳。其解决方案的关键在于提出共模态检索增强生成(Co-modality-based RAG, CMRAG),通过结构化解析文档以获得文本片段与图像区域的共模态表示,在查询响应阶段分别从文本和图像通道中检索候选证据,并在跨模态层面聚合检索结果,最终基于融合的共模态检索结果引导VLM生成答案。该方法实现了文本与图像信息的统一建模与协同利用,显著提升了复杂文档视觉问答(Visual Question Answering, VQA)系统的性能。

链接: https://arxiv.org/abs/2509.02123
作者: Wang Chen,Guanqiang Qi,Weikang Li,Yang Li
机构: Baidu Inc.(百度公司); The University of Hong Kong (香港大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a core paradigm in document question answering tasks. However, existing methods have limitations when dealing with multimodal documents: one category of methods relies on layout analysis and text extraction, which can only utilize explicit text information and struggle to capture images or unstructured content; the other category treats document segmentation as visual input and directly passes it to visual language models (VLMs) for processing, yet it ignores the semantic advantages of text, leading to suboptimal generation results. This paper proposes co-modality-based RAG (CMRAG), which can simultaneously leverage text and images for efficient retrieval and generation. Specifically, we first perform structured parsing on documents to obtain co-modality representations of text segments and image regions. Subsequently, in response to user queries, we retrieve candidate evidence from text and image channels, respectively, and aggregate the results at the cross-modal retrieval level. Finally, we prompt the VLM to generate the final response based on the co-modality retrieval results. Experiments demonstrate that our method significantly outperforms pure-vision-based RAG in visual document question answering tasks. The findings of this paper show that integrating co-modality information into the RAG framework in a unified manner is an effective approach to improving the performance of complex document visual question-answering (VQA) systems.
zh

[NLP-32] E-THER: A PCT-Grounded Dataset for Benchmarking Empathic AI

【速读】: 该论文旨在解决当前共情AI系统在情感识别中普遍存在的局限性,即难以准确识别言语表达与非言语线索(如面部表情、肢体语言)之间可能存在的不一致(verbal-visual incongruence),从而导致其共情能力停留在表层表现而非真正理解用户情绪状态。解决方案的关键在于构建首个基于人本主义心理治疗(Person-Centered Therapy, PCT)框架的多模态数据集E-THER,该数据集包含对咨询师与来访者互动中言语与视觉情绪错位的多维标注,并引入行为参与度评分以支持模型训练与评估。通过在此数据集上训练视觉-语言模型(VLMs),如IDEFICS和VideoLLAVA,显著提升了模型在共情对话质量、治疗连贯性和避免人为夸张语言模式等方面的性能,验证了基于不一致检测机制可有效促进AI发展出更真实、符合PCT理论的共情能力。

链接: https://arxiv.org/abs/2509.02100
作者: Sharjeel Tahir,Judith Johnson,Jumana Abu-Khalaf,Syed Afaq Ali Shah
机构: Edith Cowan University (埃迪斯科文大学); University of Manchester (曼彻斯特大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 15 pages, 4 figures. Preprint

点击查看摘要

Abstract:A prevalent shortfall among current empathic AI systems is their inability to recognize when verbal expressions may not fully reflect underlying emotional states. This is because the existing datasets, used for the training of these systems, focus on surface-level emotion recognition without addressing the complex verbal-visual incongruence (mismatch) patterns useful for empathic understanding. In this paper, we present E-THER, the first Person-Centered Therapy-grounded multimodal dataset with multidimensional annotations for verbal-visual incongruence detection, enabling training of AI systems that develop genuine rather than performative empathic capabilities. The annotations included in the dataset are drawn from humanistic approach, i.e., identifying verbal-visual emotional misalignment in client-counsellor interactions - forming a framework for training and evaluating AI on empathy tasks. Additional engagement scores provide behavioral annotations for research applications. Notable gains in empathic and therapeutic conversational qualities are observed in state-of-the-art vision-language models (VLMs), such as IDEFICS and VideoLLAVA, using evaluation metrics grounded in empathic and therapeutic principles. Empirical findings indicate that our incongruence-trained models outperform general-purpose models in critical traits, such as sustaining therapeutic engagement, minimizing artificial or exaggerated linguistic patterns, and maintaining fidelity to PCT theoretical framework.
zh

[NLP-33] JudgeAgent : Dynamically Evaluate LLM s with Agent -as-Interviewer

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估范式中存在的三大核心问题:受限的交互性、难以控制测试难度以及评估结果有效性难以验证,这些问题导致无法精确界定目标模型的知识与能力边界。其解决方案的关键在于提出JudgeAgent框架,该框架基于全新的“面试官式”动态评估范式,通过基准评分(benchmark grading)、交互式扩展(interactive extension)和评估反馈(evaluation feedback)三阶段综合评估机制,结合知识驱动的数据合成与目标自适应难度调整方法,实现对LLMs更精准、有效的动态测评。

链接: https://arxiv.org/abs/2509.02097
作者: Zhichao Shi,Xuhui Jiang,Chengjin Xu,Cangli Yao,Zhenxin Huang,Shengjie Ma,Yinghan Shen,Yuanzhuo Wang
机构: DataArc Tech Ltd.; IDEA Research, International Digital Economy Academy; School of Advanced Interdisciplinary Sciences, UCAS; State Key Lab of AI Safety, Institute of Computing Technology, CAS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating the capabilities of large language models (LLMs) is an essential step to ensure the successful application of LLMs across various domains. The current evaluation of LLMs is based on a paradigm that involves querying them with predefined question sets and assessing their outputs. This paradigm offers controllable processes and simplicity, but faces challenges such as limited interaction with targets, insufficient difficulty control, and difficulties in verifying the validity of evaluation results, making it hard to precisely determine the knowledge and capability boundaries of target models. To address these challenges, we propose JudgeAgent, a knowledge-target adaptive dynamic evaluation framework based on a new interviewer-style evaluation paradigm. JudgeAgent employs a comprehensive evaluation approach consisting of benchmark grading, interactive extension, and evaluation feedback. It utilizes knowledge-driven data synthesis and target-adaptive difficulty adjustment methods to conduct extended testing, providing accurate and effective evaluation results. We also introduce a novel insight into validating evaluation methods, demonstrating the effectiveness of JudgeAgent and its dynamic evaluation paradigm through extensive experiments.
zh

[NLP-34] Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization

【速读】: 该论文旨在解决自动提示优化(Automatic Prompt Optimization)中现有方法依赖直接提示精炼或模型微调、忽视大语言模型(Large Language Models, LLMs)自身推理能力的问题。其解决方案的关键在于提出一种名为对比推理提示优化(Contrastive Reasoning Prompt Optimization, CRPO)的新框架,该框架将提示优化建模为检索增强的推理过程:通过从HelpSteer2数据集(一个标注了有用性、正确性、连贯性、复杂度和冗余度等多维度质量的开源数据集)中检索top k参考提示,并构建两种互补的优化范式——层级对比推理(tiered contrastive reasoning)与多指标对比推理(multi-metric contrastive reasoning),使LLM能够通过对高质量与低质量提示的显式对比,推断出提示成功或失败的原因,从而实现更鲁棒且可解释的提示优化。

链接: https://arxiv.org/abs/2509.02093
作者: Juhyeon Lee,Wonduk Seo,Hyunjin An,Seunghyun Lee,Yi Bu
机构: Peking University (北京大学); Enhans (Enhans)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Preprint

点击查看摘要

Abstract:Automatic prompt optimization has recently emerged as a strategy for improving the quality of prompts used in Large Language Models (LLMs), with the goal of generating more accurate and useful responses. However, most prior work focuses on direct prompt refinement or model fine-tuning, overlooking the potential of leveraging LLMs’ inherent reasoning capability to learn from contrasting examples. In this paper, we present Contrastive Reasoning Prompt Optimization (CRPO), a novel framework that formulates prompt optimization as a retrieval augmented reasoning process. Our approach retrieves top k reference prompts from the HelpSteer2 dataset, an open-source collection annotated for helpfulness, correctness, coherence, complexity, and verbosity, and constructs two complementary optimization paradigms: (1) tiered contrastive reasoning, where the LLM compares high, medium, and low quality prompts to refine its own generation through reflective reasoning, and (2) multi-metric contrastive reasoning, where the LLM analyzes the best prompts along each evaluation dimension and integrates their strengths into an optimized prompt. By explicitly contrasting high and low quality exemplars, CRPO enables the model to deduce why certain prompts succeed while others fail, thereby achieving more robust and interpretable optimization. Experimental results on the HelpSteer2 benchmark demonstrate that CRPO significantly outperforms baselines. Our findings highlight the promise of contrastive, retrieval-augmented reasoning for advancing automatic prompt optimization.
zh

[NLP-35] From Attack Descriptions to Vulnerabilities: A Sentence Transformer-Based Approach

【速读】: 该论文旨在解决攻击事件与已公开漏洞(CVE)之间手动映射效率低下、难以及时响应的问题,从而提升软件安全事件的检测与响应能力。其解决方案的关键在于利用14种前沿句子嵌入模型自动识别攻击文本描述中的对应漏洞,并发现多任务问答MPNet(multi-qa-mpnet-base-dot-v1, MMPNet)模型在使用攻击技术(Attack Technique)描述时表现最优,F1得分为89.0,且能有效挖掘出MITRE知识库中未记录的275个潜在攻击-漏洞关联,显著缩短漏洞暴露时间并增强系统安全性。

链接: https://arxiv.org/abs/2509.02077
作者: Refat Othman,Diaeddin Rimawi,Bruno Rossi,Barbara Russo
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in The Journal of Systems and Software (2025)

点击查看摘要

Abstract:In the domain of security, vulnerabilities frequently remain undetected even after their exploitation. In this work, vulnerabilities refer to publicly disclosed flaws documented in Common Vulnerabilities and Exposures (CVE) reports. Establishing a connection between attacks and vulnerabilities is essential for enabling timely incident response, as it provides defenders with immediate, actionable insights. However, manually mapping attacks to CVEs is infeasible, thereby motivating the need for automation. This paper evaluates 14 state-of-the-art (SOTA) sentence transformers for automatically identifying vulnerabilities from textual descriptions of attacks. Our results demonstrate that the multi-qa-mpnet-base-dot-v1 (MMPNet) model achieves superior classification performance when using attack Technique descriptions, with an F1-score of 89.0, precision of 84.0, and recall of 94.7. Furthermore, it was observed that, on average, 56% of the vulnerabilities identified by the MMPNet model are also represented within the CVE repository in conjunction with an attack, while 61% of the vulnerabilities detected by the model correspond to those cataloged in the CVE repository. A manual inspection of the results revealed the existence of 275 predicted links that were not documented in the MITRE repositories. Consequently, the automation of linking attack techniques to vulnerabilities not only enhances the detection and response capabilities related to software security incidents but also diminishes the duration during which vulnerabilities remain exploitable, thereby contributing to the development of more secure systems.
zh

[NLP-36] How Instruction-Tuning Imparts Length Control: A Cross-Lingual Mechanistic Analysis

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文本时难以精确控制长度的问题,特别是在英语和意大利语中的长度约束生成任务。其解决方案的关键在于通过指令微调(instruction-tuning, IT)重构模型深层结构中各组件的功能分工:具体而言,指令微调显著提升了长度控制能力,主要体现在深层注意力头(attention heads)对目标长度的正向贡献增强(尤其在英语中),而在意大利语中则表现为最后一层多层感知机(MLP)的补偿性强化作用,表明指令微调通过重新配置后层组件以实现任务一致性,且组件级策略可能因语言特性而异。

链接: https://arxiv.org/abs/2509.02075
作者: Elisabetta Rocchetti,Alfio Ferrara
机构: Università degli Studi di Milano (米兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adhering to explicit length constraints, such as generating text with a precise word count, remains a significant challenge for Large Language Models (LLMs). This study aims at investigating the differences between foundation models and their instruction-tuned counterparts, on length-controlled text generation in English and Italian. We analyze both performance and internal component contributions using Cumulative Weighted Attribution, a metric derived from Direct Logit Attribution. Our findings reveal that instruction-tuning substantially improves length control, primarily by specializing components in deeper model layers. Specifically, attention heads in later layers of IT models show increasingly positive contributions, particularly in English. In Italian, while attention contributions are more attenuated, final-layer MLPs exhibit a stronger positive role, suggesting a compensatory mechanism. These results indicate that instruction-tuning reconfigures later layers for task adherence, with component-level strategies potentially adapting to linguistic context.
zh

[NLP-37] Attributes as Textual Genes: Leverag ing LLM s as Genetic Algorithm Simulators for Conditional Synthetic Data Generation EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成合成数据时面临的质量与多样性难以保障的问题。其解决方案的关键在于提出了一种名为“遗传提示”(Genetic Prompt)的框架,该框架将语义文本属性视为基因序列,并利用LLM模拟交叉(crossover)和突变(mutation)操作,从而通过遗传过程生成新颖的属性组合,提升合成数据的质量与多样性,使其分布更接近真实世界数据。此外,为优化父代选择,还引入主动学习机制以扩展后代搜索空间,进一步增强性能。

链接: https://arxiv.org/abs/2509.02040
作者: Guangzeng Han,Weisi Liu,Xiaolei Huang
机构: University of Memphis (孟菲斯大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP2025 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) excel at generating synthetic data, but ensuring its quality and diversity remains challenging. We propose Genetic Prompt, a novel framework that combines genetic algorithms with LLMs to augment synthetic data generation. Our approach treats semantic text attributes as gene sequences and leverages the LLM to simulate crossover and mutation operations. This genetic process enhances data quality and diversity by creating novel attribute combinations, yielding synthetic distributions closer to real-world data. To optimize parent selection, we also integrate an active learning scheme that expands the offspring search space. Our experiments on multiple NLP tasks reveal several key findings: Genetic Prompt not only significantly outperforms state-of-the-art baselines but also shows robust performance across various generator model sizes and scales. Moreover, we demonstrate that fusing our synthetic data with the original training set significantly boosts downstream model performance, particularly for class-imbalanced scenarios. Our findings validate that Genetic Prompt is an effective method for producing high-quality synthetic data for a wide range of NLP applications.
zh

[NLP-38] NADI 2025: The First Multidialectal Arabic Speech Processing Shared Task

【速读】: 该论文旨在解决阿拉伯语方言语音处理中的三大核心问题:方言识别(Spoken Dialect Identification)、语音识别(Speech Recognition)以及方言语音的元音符号恢复(Diacritic Restoration)。其关键解决方案在于通过组织第六届Nuanced Arabic Dialect Identification (NADI 2025) 共享任务,汇聚来自全球44个团队的创新方法,并系统评估其在三个子任务上的性能表现。最佳系统在方言识别上达到79.8%准确率,在语音识别和元音符号恢复上分别取得35.68/12.20 WER/CER 和 55/13 WER/CER 的结果,揭示了当前技术在阿拉伯语方言处理中的瓶颈与进展,为后续研究提供了可借鉴的方法论框架与改进方向。

链接: https://arxiv.org/abs/2509.02038
作者: Bashar Talafha,Hawau Olamide Toyin,Peter Sullivan,AbdelRahim Elmadany,Abdurrahman Juma,Amirbek Djanibekov,Chiyu Zhang,Hamad Alshehhi,Hanan Aldarmaki,Mustafa Jarrar,Nizar Habash,Muhammad Abdul-Mageed
机构: Hamad Bin Khalifa University (哈马德本哈利法大学); The University of British Columbia (不列颠哥伦比亚大学); Birzeit University (比尔宰特大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学); NYU Abu Dhabi (纽约大学阿布扎比校区)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We present the findings of the sixth Nuanced Arabic Dialect Identification (NADI 2025) Shared Task, which focused on Arabic speech dialect processing across three subtasks: spoken dialect identification (Subtask 1), speech recognition (Subtask 2), and diacritic restoration for spoken dialects (Subtask 3). A total of 44 teams registered, and during the testing phase, 100 valid submissions were received from eight unique teams. The distribution was as follows: 34 submissions for Subtask 1 "five teamsæ, 47 submissions for Subtask 2 “six teams”, and 19 submissions for Subtask 3 “two teams”. The best-performing systems achieved 79.8% accuracy on Subtask 1, 35.68/12.20 WER/CER (overall average) on Subtask 2, and 55/13 WER/CER on Subtask 3. These results highlight the ongoing challenges of Arabic dialect speech processing, particularly in dialect identification, recognition, and diacritic restoration. We also summarize the methods adopted by participating teams and briefly outline directions for future editions of NADI.
zh

[NLP-39] DeepSeek performs better than other Large Language Models in Dental Cases

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在解析纵向患者病历叙事方面能力不足的问题,尤其是在牙科领域中对复杂临床案例进行推理和分析的能力尚未得到充分验证。其解决方案的关键在于通过标准化的34例长期牙周病例(共258个问答对)对四种前沿LLM(GPT-4o、Gemini 2.0 Flash、Copilot和DeepSeek V3)进行系统评估,采用自动化指标与盲法牙科专家评分相结合的方法,量化模型在忠实性(faithfulness)和专业判断上的表现。结果显示,DeepSeek V3在忠实性和专家评分上均优于其他模型,成为最适合用于临床案例分析的LLM,从而为将此类模型作为医学教育与研究中的辅助工具提供了实证依据。

链接: https://arxiv.org/abs/2509.02036
作者: Hexian Zhang,Xinyu Yan,Yanqi Yang,Lijian Jin,Ping Yang,Junwen Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Abstract word count: 171; Total word count: 3130; Total number of tables: 2; Total number of figures: 3; Number of references: 32

点击查看摘要

Abstract:Large language models (LLMs) hold transformative potential in healthcare, yet their capacity to interpret longitudinal patient narratives remains inadequately explored. Dentistry, with its rich repository of structured clinical data, presents a unique opportunity to rigorously assess LLMs’ reasoning abilities. While several commercial LLMs already exist, DeepSeek, a model that gained significant attention earlier this year, has also joined the competition. This study evaluated four state-of-the-art LLMs (GPT-4o, Gemini 2.0 Flash, Copilot, and DeepSeek V3) on their ability to analyze longitudinal dental case vignettes through open-ended clinical tasks. Using 34 standardized longitudinal periodontal cases (comprising 258 question-answer pairs), we assessed model performance via automated metrics and blinded evaluations by licensed dentists. DeepSeek emerged as the top performer, demonstrating superior faithfulness (median score = 0.528 vs. 0.367-0.457) and higher expert ratings (median = 4.5/5 vs. 4.0/5), without significantly compromising readability. Our study positions DeepSeek as the leading LLM for case analysis, endorses its integration as an adjunct tool in both medical education and research, and highlights its potential as a domain-specific agent.
zh

[NLP-40] StructCoh: Structured Contrastive Learning for Context-Aware Text Semantic Matching PRICAI2025

【速读】: 该论文旨在解决文本语义匹配任务中对结构关系与细粒度语义差异理解不足的问题,尤其针对预训练语言模型在捕捉句法层级结构和细微语义区分上的局限性。其解决方案的关键在于提出一种图增强的对比学习框架 StructCoh,通过两个核心创新实现:一是设计双图编码器,利用依存句法分析和主题建模构建语义图,并借助图同构网络(Graph Isomorphism Network)在句法依赖边和跨文档概念节点间传播结构特征;二是引入分层对比目标,在节点级别保持核心语义单元的一致性,并通过显式与隐式负采样策略强化文档间结构语义的对齐。

链接: https://arxiv.org/abs/2509.02033
作者: Chao Xue,Ziyuan Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by PRICAI 2025

点击查看摘要

Abstract:Text semantic matching requires nuanced understanding of both structural relationships and fine-grained semantic distinctions. While pre-trained language models excel at capturing token-level interactions, they often overlook hierarchical structural patterns and struggle with subtle semantic discrimination. In this paper, we proposed StructCoh, a graph-enhanced contrastive learning framework that synergistically combines structural reasoning with representation space optimization. Our approach features two key innovations: (1) A dual-graph encoder constructs semantic graphs via dependency parsing and topic modeling, then employs graph isomorphism networks to propagate structural features across syntactic dependencies and cross-document concept nodes. (2) A hierarchical contrastive objective enforces consistency at multiple granularities: node-level contrastive regularization preserves core semantic units, while graph-aware contrastive learning aligns inter-document structural semantics through both explicit and implicit negative sampling strategies. Experiments on three legal document matching benchmarks and academic plagiarism detection datasets demonstrate significant improvements over state-of-the-art methods. Notably, StructCoh achieves 86.7% F1-score (+6.2% absolute gain) on legal statute matching by effectively identifying argument structure similarities.
zh

[NLP-41] DRAssist: Dispute Resolution Assistance using Large Language Models

【速读】: 该论文旨在解决多领域中争议纠纷(如汽车保险和域名争议)的自动化辅助处理问题,传统上此类争议由人类法官在特定法庭中通过听取事实、论点及诉求后作出裁决。解决方案的关键在于提出DRAssist系统,利用大语言模型(Large Language Models, LLMs)对非结构化的争议描述进行结构化提取与摘要,并设计多层次的提示策略(prompting strategies),使LLMs能够从三个维度辅助裁判:(i) 判断争议双方整体强弱;(ii) 逐项评估各方具体诉求是否可接受;(iii) 评判每方论点的强度。该方法通过结构化信息抽取与多粒度推理能力提升争议处理效率与一致性,且在多个任务上通过与基线对比验证了其有效性。

链接: https://arxiv.org/abs/2509.01962
作者: Sachin Pawar,Manoj Apte,Girish K. Palshikar,Basit Ali,Nitin Ramrakhiyani
机构: TCS Research, Tata Consultancy Services Limited (塔塔咨询服务有限公司)
类目: Computation and Language (cs.CL)
备注: Accepted at the 20th International Conference on Artificial Intelligence and Law (ICAIL 2025)

点击查看摘要

Abstract:Disputes between two parties occur in almost all domains such as taxation, insurance, banking, healthcare, etc. The disputes are generally resolved in a specific forum (e.g., consumer court) where facts are presented, points of disagreement are discussed, arguments as well as specific demands of the parties are heard, and finally a human judge resolves the dispute by often favouring one of the two parties. In this paper, we explore the use of large language models (LLMs) as assistants for the human judge to resolve such disputes, as part of our DRAssist system. We focus on disputes from two specific domains – automobile insurance and domain name disputes. DRAssist identifies certain key structural elements (e.g., facts, aspects or disagreement, arguments) of the disputes and summarizes the unstructured dispute descriptions to produce a structured summary for each dispute. We then explore multiple prompting strategies with multiple LLMs for their ability to assist in resolving the disputes in these domains. In DRAssist, these LLMs are prompted to produce the resolution output at three different levels – (i) identifying an overall stronger party in a dispute, (ii) decide whether each specific demand of each contesting party can be accepted or not, (iii) evaluate whether each argument by each contesting party is strong or weak. We evaluate the performance of LLMs on all these tasks by comparing them with relevant baselines using suitable evaluation metrics.
zh

[NLP-42] Content and Engagement Trends in COVID-19 YouTube Videos: Evidence from the Late Pandemic

【速读】: 该论文旨在解决新冠疫情期间YouTube平台上相关视频的用户参与度(engagement)如何受时间、词汇、语言和结构因素影响的问题。其关键解决方案在于系统性地分析约10,000条2023年1月至2024年10月发布的新冠相关视频,识别出发布时段、标题关键词、描述情感倾向及视频时长等变量对观看量的差异化影响:例如,发布日呈现从周一到周五的注意力转移趋势;“shorts”类标题显著提升平均观看量至216万次/视频;去除异常值后,描述情感与观看量的相关性增强;不同内容类型(如新闻、娱乐、博客)在视频时长上的表现差异明显,揭示了基于内容类别定制化传播策略的重要性。

链接: https://arxiv.org/abs/2509.01954
作者: Nirmalya Thakur,Madeline D Hartel,Lane Michael Boden,Dallas Enriquez,Boston Joyner Ricks
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work investigated about 10,000 COVID-19-related YouTube videos published between January 2023 and October 2024 to evaluate how temporal, lexical, linguistic, and structural factors influenced engagement during the late pandemic period. Publishing activity showed consistent weekday effects: in the first window, average views peaked on Mondays at 92,658; in the second, on Wednesdays at 115,479; and in the third, on Fridays at 84,874, reflecting a shift in audience attention toward mid- and late week. Lexical analysis of video titles revealed recurring high-frequency keywords related to COVID-19 and YouTube features, including COVID, coronavirus, shorts, and live. Frequency analysis revealed sharp spikes, with COVID appearing in 799 video titles in August 2024, while engagement analysis showed that videos titled with shorts attracted very high views, peaking at 2.16 million average views per video in June 2023. Analysis of sentiment of video descriptions in English showed weak correlation with views in the raw data (Pearson r = 0.0154, p = 0.2987), but stronger correlations emerged once outliers were addressed, with Spearman r = 0.110 (p 0.001) and Pearson r = 0.0925 (p 0.001). Category-level analysis of video durations revealed contrasting outcomes: long videos focusing on people and blogs averaged 209,114 views, short entertainment videos averaged 288,675 views, and medium-to-long news and politics videos averaged 51,309 and 59,226 views, respectively. These results demonstrate that engagement patterns of COVID-19-related videos on YouTube during the late pandemic followed distinct characteristics driven by publishing schedules, title vocabulary, topics, and genre-specific duration effects.
zh

[NLP-43] EigenBench: A Comparative Behavioral Measure of Value Alignment

【速读】: 该论文旨在解决人工智能(AI)与人类价值观对齐这一关键难题,尤其针对缺乏定量评估指标的问题。其解决方案的核心是提出EigenBench——一种黑箱式的语言模型价值比较基准方法,通过构建一个描述价值体系的“宪法”(constitution)和场景数据集,利用模型间相互评分并结合EigenTrust算法进行加权聚合,从而生成反映各模型整体价值倾向的得分向量。该方法不依赖真实标签,适用于人类判断存在分歧的价值属性,且实验证明其能有效区分模型自身特质与提示词(prompt)带来的影响。

链接: https://arxiv.org/abs/2509.01938
作者: Jonathn Chang,Leonard Piff,Suvadip Sana,Jasmine X. Li,Lionel Levine
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aligning AI with human values is a pressing unsolved problem. To address the lack of quantitative metrics for value alignment, we propose EigenBench: a black-box method for comparatively benchmarking language models’ values. Given an ensemble of models, a constitution describing a value system, and a dataset of scenarios, our method returns a vector of scores quantifying each model’s alignment to the given constitution. To produce these scores, each model judges the outputs of other models across many scenarios, and these judgments are aggregated with EigenTrust (Kamvar et al, 2003), yielding scores that reflect a weighted-average judgment of the whole ensemble. EigenBench uses no ground truth labels, as it is designed to quantify traits for which reasonable judges may disagree on the correct label. Using prompted personas, we test whether EigenBench scores are more sensitive to the model or the prompt: we find that most of the variance is explained by the prompt, but a small residual quantifies the disposition of the model itself.
zh

[NLP-44] How Real Is AI Tutoring? Comparing Simulated and Human Dialogues in One-on-One Instruction

【速读】: 该论文试图解决当前生成式 AI (Generative AI) 在教育对话系统中难以生成具有教学深度和认知引导性的师生互动问题。解决方案的关键在于通过结构化编码(Initiation-Response-Feedback, IRF)与知识网络分析(Epistemic Network Analysis, ENA)对真实人类辅导对话与AI模拟对话进行量化比较,揭示二者在话语长度、提问行为(I-Q)及反馈行为(F-F)上的显著差异,并识别出人类对话更倾向于围绕“提问-事实性回答-反馈”的教学循环展开,体现认知引导与学生主导思考;而AI模拟对话则趋于简化为“解释-简单回应”的信息传递模式,缺乏教学复杂性和多样性。这一发现为设计更具教学有效性的生成式教育对话系统提供了实证依据。

链接: https://arxiv.org/abs/2509.01914
作者: Ruijia Li,Yuan-Hao Jiang,Jiatong Wang,Bo Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Proceedings of the 33rd International Conference on Computers in Education (ICCE 2025). Asia-Pacific Society for Computers in Education

点击查看摘要

Abstract:Heuristic and scaffolded teacher-student dialogues are widely regarded as critical for fostering students’ higher-order thinking and deep learning. However, large language models (LLMs) currently face challenges in generating pedagogically rich interactions. This study systematically investigates the structural and behavioral differences between AI-simulated and authentic human tutoring dialogues. We conducted a quantitative comparison using an Initiation-Response-Feedback (IRF) coding scheme and Epistemic Network Analysis (ENA). The results show that human dialogues are significantly superior to their AI counterparts in utterance length, as well as in questioning (I-Q) and general feedback (F-F) behaviors. More importantly, ENA results reveal a fundamental divergence in interactional patterns: human dialogues are more cognitively guided and diverse, centered around a “question-factual response-feedback” teaching loop that clearly reflects pedagogical guidance and student-driven thinking; in contrast, simulated dialogues exhibit a pattern of structural simplification and behavioral convergence, revolving around an “explanation-simplistic response” loop that is essentially a simple information transfer between the teacher and student. These findings illuminate key limitations in current AI-generated tutoring and provide empirical guidance for designing and evaluating more pedagogically effective generative educational dialogue systems.
zh

[NLP-45] Oyster-I: Beyond Refusal – Constructive Safety Alignment for Responsible Language Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全机制过于依赖防御性拒绝(defensive refusals)的问题,尤其是在面对非恶意但处于心理困境的用户时,简单拒绝可能引发更严重的后果,如用户重复提问、行为升级或转向不安全平台。为此,作者提出**建设性安全对齐(Constructive Safety Alignment, CSA)**这一以人为中心的新范式,其关键在于:通过博弈论预测用户反应、细粒度风险边界识别以及可解释的推理控制,将安全防护从被动防御转向主动引导,使模型在保障安全的同时能够提供有帮助且符合情境的回应,从而构建信任关系。该方案已在Oyster-I(Oy1)中实现,展现出优于现有开源模型的安全性与通用能力,并在构造性交互和抗规避攻击方面接近甚至超越闭源先进模型。

链接: https://arxiv.org/abs/2509.01909
作者: Ranjie Duan,Jiexi Liu,Xiaojun Jia,Shiji Zhao,Ruoxi Cheng,Fengxiang Wang,Cheng Wei,Yong Xie,Chang Liu,Defeng Li,Yinpeng Dong,Yichi Zhang,Yuefeng Chen,Chongwen Wang,Xingjun Ma,Xingxing Wei,Yang Liu,Hang Su,Jun Zhu,Xinfeng Li,Yitong Sun,Jie Zhang,Jinzhao Hu,Sha Xu,Yitong Yang,Jialing Tao,Hui Xue
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Symbolic Computation (cs.SC)
备注: Technical Report

点击查看摘要

Abstract:Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model’s response can strongly influence the user’s next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.
zh

[NLP-46] RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

【速读】: 该论文旨在解决当前遥感灾害监测数据集中缺乏时间序列图像对与详细文本标注的问题,现有数据多以单时相影像为主,难以捕捉灾害动态演变过程。其解决方案的关键在于构建一个大规模的遥感变化描述(Remote Sensing Change Caption, RSCC)数据集,包含62,315组灾前/灾后图像对(涵盖地震、洪水、野火等灾害类型)及其丰富的人类化变化描述文本,从而在遥感数据中弥合时空与语义的鸿沟,支撑视觉-语言模型在双时相灾害理解中的鲁棒训练与评估。

链接: https://arxiv.org/abs/2509.01907
作者: Zhenyuan Chen,Chenxi Wang,Ningyu Zhang,Feng Zhang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,315 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC’s ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at this https URL.
zh

[NLP-47] Weakly Supervised Medical Entity Extraction and Linking for Chief Complaints

【速读】: 该论文旨在解决临床记录中主诉(Chief Complaint, CC)文本因录入方式多样而导致的术语不统一问题,这严重影响了跨医疗机构的标准化管理和医学文本挖掘效率。解决方案的关键在于提出一种弱监督实体抽取与链接方法(Weakly Supervised Entity Extraction and Linking, \ours),其核心创新是通过分治匹配算法(split-and-match algorithm)在无人工标注的情况下生成大量弱标签(包括实体提及跨度和类别标签),并基于这些弱标签训练一个BERT-based模型,实现对主诉文本中实体的精准定位及其与预定义本体的映射,从而在无需人工标注的前提下显著提升实体识别与链接性能。

链接: https://arxiv.org/abs/2509.01899
作者: Zhimeng Luo,Zhendong Wang,Rui Meng,Diyang Xue,Adam Frisch,Daqing He
机构: University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A Chief complaint (CC) is the reason for the medical visit as stated in the patient’s own words. It helps medical professionals to quickly understand a patient’s situation, and also serves as a short summary for medical text mining. However, chief complaint records often take a variety of entering methods, resulting in a wide variation of medical notations, which makes it difficult to standardize across different medical institutions for record keeping or text mining. In this study, we propose a weakly supervised method to automatically extract and link entities in chief complaints in the absence of human annotation. We first adopt a split-and-match algorithm to produce weak annotations, including entity mention spans and class labels, on 1.2 million real-world de-identified and IRB approved chief complaint records. Then we train a BERT-based model with generated weak labels to locate entity mentions in chief complaint text and link them to a pre-defined ontology. We conducted extensive experiments, and the results showed that our Weakly Supervised Entity Extraction and Linking (\ours) method produced superior performance over previous methods without any human annotation.
zh

[NLP-48] Extracting OPQRST in Electronic Health Records using Large Language Models with Reasoning

【速读】: 该论文旨在解决从电子健康记录(Electronic Health Records, EHRs)中提取关键患者信息(特别是OPQRST评估内容)时面临的挑战,传统机器学习方法因数据复杂性和非结构化特性难以高效捕捉相关信息,且在标注数据有限的医疗场景下表现受限。其解决方案的关键在于将原本视为序列标注任务的信息抽取重构为文本生成任务,利用大语言模型(Large Language Models, LLMs)的能力,使模型不仅能输出结果,还能模拟医生的认知推理过程,从而提升可解释性;同时引入语义相似度指标(如BERT Score)改进命名实体识别(Named Entity Recognition, NER)的传统评估方式,以更准确衡量生成文本与原始临床意图的一致性,最终实现更精准、可扩展且适用于临床实践的信息提取方案。

链接: https://arxiv.org/abs/2509.01885
作者: Zhimeng Luo,Abhibha Gupta,Adam Frisch,Daqing He
机构: University of Pittsburgh (匹兹堡大学); School of Computing and Information (计算机与信息学院); Department of Emergency Medicine (急诊医学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The extraction of critical patient information from Electronic Health Records (EHRs) poses significant challenges due to the complexity and unstructured nature of the data. Traditional machine learning approaches often fail to capture pertinent details efficiently, making it difficult for clinicians to utilize these tools effectively in patient care. This paper introduces a novel approach to extracting the OPQRST assessment from EHRs by leveraging the capabilities of Large Language Models (LLMs). We propose to reframe the task from sequence labeling to text generation, enabling the models to provide reasoning steps that mimic a physician’s cognitive processes. This approach enhances interpretability and adapts to the limited availability of labeled data in healthcare settings. Furthermore, we address the challenge of evaluating the accuracy of machine-generated text in clinical contexts by proposing a modification to traditional Named Entity Recognition (NER) metrics. This includes the integration of semantic similarity measures, such as the BERT Score, to assess the alignment between generated text and the clinical intent of the original records. Our contributions demonstrate a significant advancement in the use of AI in healthcare, offering a scalable solution that improves the accuracy and usability of information extraction from EHRs, thereby aiding clinicians in making more informed decisions and enhancing patient care outcomes.
zh

[NLP-49] Mic Drop or Data Flop? Evaluating the Fitness for Purpose of AI Voice Interviewers for Data Collection within Quantitative Qualitative Research Contexts

【速读】: 该论文旨在解决当前AI访谈系统(AI interviewers)在定量与定性研究场景中是否具备适用性的问题,特别是其相较于传统交互式语音应答(Interactive Voice Response, IVR)系统的性能差异与局限。解决方案的关键在于系统性评估AI访谈器在两个核心维度上的能力:一是输入/输出性能(如语音识别准确率、回答记录、情绪处理),二是语言推理能力(如追问、澄清及分支逻辑处理)。研究表明,AI访谈器在两类数据收集任务中均优于现有IVR系统,但实时转录错误率较高、情绪识别能力有限以及追问质量不均等问题表明,其实际应用效果仍高度依赖具体研究情境,尤其在定性研究中需谨慎评估其适用性。

链接: https://arxiv.org/abs/2509.01814
作者: Shreyas Tirumala,Nishant Jain,Danny D. Leybzon,Trent D. Buskirk
机构: VKL Research, Inc.(VKL研究公司); Old Dominion University(老多明尼昂大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Transformer-based Large Language Models (LLMs) have paved the way for “AI interviewers” that can administer voice-based surveys with respondents in real-time. This position paper reviews emerging evidence to understand when such AI interviewing systems are fit for purpose for collecting data within quantitative and qualitative research contexts. We evaluate the capabilities of AI interviewers as well as current Interactive Voice Response (IVR) systems across two dimensions: input/output performance (i.e., speech recognition, answer recording, emotion handling) and verbal reasoning (i.e., ability to probe, clarify, and handle branching logic). Field studies suggest that AI interviewers already exceed IVR capabilities for both quantitative and qualitative data collection, but real-time transcription error rates, limited emotion detection abilities, and uneven follow-up quality indicate that the utility, use and adoption of current AI interviewer technology may be context-dependent for qualitative data collection efforts.
zh

[NLP-50] ShortageSim: Simulating Drug Shortages under Information Asymmetry

【速读】: 该论文旨在解决药品短缺问题,即在全球范围内对患者护理和医疗系统构成重大风险的药品供应中断现象。现有监管干预措施的有效性难以评估,主要受限于制药供应链中存在的信息不对称。解决方案的关键在于提出首个基于大语言模型(Large Language Model, LLM)的多智能体仿真框架——ShortageSim,该框架能够刻画药品制造商、机构采购方与监管机构在短缺预警下的复杂战略互动。与传统博弈论模型假设完全理性与完备信息不同,ShortageSim利用LLM模拟不确定性下有限理性的决策行为,并通过多季度连续生产博弈建模FDA公告(包括反应式短缺警示与前瞻性风险预警)如何传导至产能投资和采购决策。实验表明,该方法可将未披露病例的短缺解决延迟比例降低83%,显著提升模拟结果与真实数据的一致性。

链接: https://arxiv.org/abs/2509.01813
作者: Mingxuan Cui,Yilan Jiang,Duo Zhou,Cheng Qian,Yuji Zhang,Qiong Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注: 21 Pages

点击查看摘要

Abstract:Drug shortages pose critical risks to patient care and healthcare systems worldwide, yet the effectiveness of regulatory interventions remains poorly understood due to fundamental information asymmetries in pharmaceutical supply chains. We present \textbfShortageSim, the first Large Language Model (LLM)-based multi-agent simulation framework that captures the complex, strategic interactions between drug manufacturers, institutional buyers, and regulatory agencies in response to shortage alerts. Unlike traditional game-theoretic models that assume perfect rationality and complete information, \textbfShortageSim leverages LLMs to simulate bounded-rational decision-making under uncertainty. Through a sequential production game spanning multiple quarters, we model how FDA announcements, both reactive alerts about existing shortages and proactive warnings about potential disruptions, propagate through the supply chain and influence capacity investment and procurement decisions. Our experiments on historical shortage events reveal that \textbfShortageSim reduces the resolution-lag percentage for discontinued-disclosed cases by 83%, bringing simulated durations more aligned to ground truth than the zero-shot baseline. We open-source \textbfShortageSim and a dataset of 2,925 FDA shortage events at this https URL, providing a novel computational framework for designing and testing interventions in complex, information-scarce supply chains.
zh

[NLP-51] Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s EMNLP2025

【速读】: 该论文试图解决的问题是:大语言模型(Large Language Models, LLMs)广泛报道的高提示敏感性(prompt sensitivity)是否真的是模型本身的固有缺陷,还是评估过程中的方法偏差所致。解决方案的关键在于采用基于大语言模型作为评判者(LLM-as-a-Judge)的评估方式,替代传统的启发式评价方法(如对数似然评分和严格答案匹配),从而更准确地衡量模型在不同提示模板下的表现。实验表明,使用LLM-as-a-Judge可显著降低性能波动,并提升模型排名在不同提示间的相关性,说明现代LLMs对提示模板的实际鲁棒性高于此前认知,提示敏感性更多源于评估方式而非模型本质缺陷。

链接: https://arxiv.org/abs/2509.01790
作者: Andong Hua,Kenan Tang,Chenhe Gu,Jindong Gu,Eric Wong,Yao Qin
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校); UC Irvine (加州大学欧文分校); University of Oxford (牛津大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2025 Main Conference

点击查看摘要

Abstract:Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-a-Judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.
zh

[NLP-52] chDzDT: Word-level morphology-aware language model for Algerian social media text

【速读】: 该论文旨在解决阿尔及利亚方言(Algerian dialect)在自然语言处理(Natural Language Processing, NLP)中因形态复杂、频繁混用多种语言(code-switching)、使用多套书写系统及受其他语言强烈词汇影响而导致的建模难题。传统基于词或子词级别的预训练语言模型(Pre-trained Language Models, PLMs)难以有效处理此类低资源、高形态变化的语言。其解决方案的关键在于提出 chDzDT,一个专为阿尔及利亚方言设计的字符级(character-level)预训练语言模型,该模型在孤立单词上进行训练,从而不依赖于分词边界或标准化拼写,能够更稳健地捕捉形态模式。这一方法显著提升了对复杂形态结构的编码能力,并通过多样化语料库(包括YouTube评论、法语/英语/柏柏尔语维基百科和Tatoeba项目)实现充分预训练,为下游任务提供有效的形态感知表示。

链接: https://arxiv.org/abs/2509.01772
作者: Abdelkrime Aries
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained language models (PLMs) have substantially advanced natural language processing by providing context-sensitive text representations. However, the Algerian dialect remains under-represented, with few dedicated models available. Processing this dialect is challenging due to its complex morphology, frequent code-switching, multiple scripts, and strong lexical influences from other languages. These characteristics complicate tokenization and reduce the effectiveness of conventional word- or subword-level approaches. To address this gap, we introduce chDzDT, a character-level pre-trained language model tailored for Algerian morphology. Unlike conventional PLMs that rely on token sequences, chDzDT is trained on isolated words. This design allows the model to encode morphological patterns robustly, without depending on token boundaries or standardized orthography. The training corpus draws from diverse sources, including YouTube comments, French, English, and Berber Wikipedia, as well as the Tatoeba project. It covers multiple scripts and linguistic varieties, resulting in a substantial pre-training workload. Our contributions are threefold: (i) a detailed morphological analysis of Algerian dialect using YouTube comments; (ii) the construction of a multilingual Algerian lexicon dataset; and (iii) the development and extensive evaluation of a character-level PLM as a morphology-focused encoder for downstream tasks. The proposed approach demonstrates the potential of character-level modeling for morphologically rich, low-resource dialects and lays a foundation for more inclusive and adaptable NLP systems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7 Cite as: arXiv:2509.01772 [cs.CL] (or arXiv:2509.01772v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.01772 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-53] An LLM -enabled semantic-centric framework to consume privacy policies

【速读】: 该论文旨在解决用户难以理解在线服务隐私政策(Privacy Policy)所导致的数据隐私实践透明度不足的问题,从而阻碍以用户为中心的网络应用发展及数据共享与再利用。其核心解决方案是提出一种语义驱动的方法,利用最先进的大语言模型(Large Language Models, LLMs)自动提取隐私政策中的关键信息,并构建基于数据隐私词汇表(Data Privacy Vocabulary, DPV)的结构化知识图谱——即 \mathitPr^2\mathitGraph,用于支撑下游任务如正式政策表示(如Open Digital Rights Language, ODRL 或 perennial semantic Data Terms of Use, psDToU)。该方法的关键在于将非结构化的隐私文本转化为可计算、可验证的语义知识表示,为大规模分析和审计网络服务的隐私实践提供了可行路径。

链接: https://arxiv.org/abs/2509.01716
作者: Rui Zhao,Vladyslav Melnychuk,Jun Zhao,Jesse Wright,Nigel Shadbolt
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In modern times, people have numerous online accounts, but they rarely read the Terms of Service or Privacy Policy of those sites, despite claiming otherwise, due to the practical difficulty in comprehending them. The mist of data privacy practices forms a major barrier for user-centred Web approaches, and for data sharing and reusing in an agentic world. Existing research proposed methods for using formal languages and reasoning for verifying the compliance of a specified policy, as a potential cure for ignoring privacy policies. However, a critical gap remains in the creation or acquisition of such formal policies at scale. We present a semantic-centric approach for using state-of-the-art large language models (LLM), to automatically identify key information about privacy practices from privacy policies, and construct \mathitPr^2\mathitGraph , knowledge graph with grounding from Data Privacy Vocabulary (DPV) for privacy practices, to support downstream tasks. Along with the pipeline, the \mathitPr^2\mathitGraph for the top-100 popular websites is also released as a public resource, by using the pipeline for analysis. We also demonstrate how the \mathitPr^2\mathitGraph can be used to support downstream tasks by constructing formal policy representations such as Open Digital Right Language (ODRL) or perennial semantic Data Terms of Use (psDToU). To evaluate the technology capability, we enriched the Policy-IE dataset by employing legal experts to create custom annotations. We benchmarked the performance of different large language models for our pipeline and verified their capabilities. Overall, they shed light on the possibility of large-scale analysis of online services’ privacy practices, as a promising direction to audit the Web and the Internet. We release all datasets and source code as public resources to facilitate reuse and improvement.
zh

[NLP-54] Bridging Thoughts and Words: Graph-Based Intent-Semantic Joint Learning for Fake News Detection CIKM’25

【速读】: 该论文旨在解决当前虚假新闻检测方法过度依赖表面语义线索(如情感词汇、写作风格等)而导致模型在动态信息环境中泛化能力不足的问题。其解决方案的关键在于引入新闻意图(news intent)作为新的分析维度,通过构建图结构联合建模语义与意图信号,实现对新闻欺骗本质的深层理解。具体而言,提出Graph-based Intent-Semantic Joint Modeling (InSide) 方法,将语义和意图信号映射为异构图结构,利用实体引导的长程上下文交互以及粗粒度到细粒度的意图建模机制,同时设计基于动态路径的图对齐策略以增强语义与意图之间的对齐与信息传递,从而提升检测性能。

链接: https://arxiv.org/abs/2509.01660
作者: Zhengjia Wang,Qiang Sheng,Danding Wang,Beizhe Hu,Juan Cao
机构: Media Synthesis and Forensics Lab, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所媒体合成与取证实验室); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computation and Language (cs.CL)
备注: Accepted to CIKM’25

点击查看摘要

Abstract:Fake news detection is an important and challenging task for defending online information integrity. Existing state-of-the-art approaches typically extract news semantic clues, such as writing patterns that include emotional words, stylistic features, etc. However, detectors tuned solely to such semantic clues can easily fall into surface detection patterns, which can shift rapidly in dynamic environments, leading to limited performance in the evolving news landscape. To address this issue, this paper investigates a novel perspective by incorporating news intent into fake news detection, bridging intents and semantics together. The core insight is that by considering news intents, one can deeply understand the inherent thoughts behind news deception, rather than the surface patterns within words alone. To achieve this goal, we propose Graph-based Intent-Semantic Joint Modeling (InSide) for fake news detection, which models deception clues from both semantic and intent signals via graph-based joint learning. Specifically, InSide reformulates news semantic and intent signals into heterogeneous graph structures, enabling long-range context interaction through entity guidance and capturing both holistic and implementation-level intent via coarse-to-fine intent modeling. To achieve better alignment between semantics and intents, we further develop a dynamic pathway-based graph alignment strategy for effective message passing and aggregation across these signals by establishing a common space. Extensive experiments on four benchmark datasets demonstrate the superiority of the proposed InSide compared to state-of-the-art methods.
zh

[NLP-55] Reinforced Visual Perception with Tools

【速读】: 该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在视觉推理任务中面临的挑战,即如何有效利用视觉工具进行复杂感知与逻辑推理,同时克服监督微调方法存在的数据生成成本高、依赖精细数据筛选以及泛化能力差等问题。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的新型训练框架 ReVPT,其核心创新是引入一种基于 GRPO(Generalized Reward Policy Optimization)改进的强化学习算法,使模型能够通过与四类视觉工具的交互来增强视觉推理能力。实验表明,该方法在多个以感知为主的基准测试(如 SAT、CV-Bench、BLINK 和 MMStar)上达到当前最优性能,显著优于监督微调和纯文本强化学习基线,尤其在 CV-Bench 上,ReVPT-3B 和 ReVPT-7B 分别超越指令微调模型 9.03% 和 9.44%。

链接: https://arxiv.org/abs/2509.01656
作者: Zetong Zhou,Dongping Chen,Zixian Ma,Zhihan Hu,Mingyang Fu,Sinan Wang,Yao Wan,Zhou Zhao,Ranjay Krishna
机构: ONE Lab, HUST(华中科技大学ONE实验室); University of Maryland (马里兰大学); University of Washington (华盛顿大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various perceptual tasks, leveraging these for general visual reasoning remains challenging. Prior work demonstrates that augmenting LLMs with vision models via supervised finetuning improves performance, but faces key limitations such as expensive data generation, reliance on careful data filtering, and poor generalization. To address these issues, we propose ReVPT to enhance multi-modal LLMs’ abilities to reason about and use visual tools through reinforcement learning. We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools. Through extensive experiments, we show that our method achieves state-of-the-art performance on several perception-heavy benchmarks, including SAT, CV-Bench, BLINK and MMStar, significantly outperforming the supervised and text-based RL finetuning baselines. Notably, Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench. Finally, we bring to the community new insights on RL-based visual tool-usage through extensive ablations. Our code is available at this https URL.
zh

[NLP-56] Parallel Needleman-Wunsch on CUDA to measure word similarity based on phonetic transcriptions

【速读】: 该论文旨在解决如何基于词的发音(phonetic transcription)计算词语之间相似性的难题,以分析语言的语音结构。其解决方案的关键在于利用Needleman-Wunsch算法对音素序列进行全局比对,并通过在CPU和GPU上并行化实现高效处理大规模数据集;其中GPU版本采用CUDA与cudarc Rust库加速运算,显著提升了性能。最终通过构建加权全连接图并应用聚类算法识别出语音相似词组,验证了方法的有效性与可扩展性。

链接: https://arxiv.org/abs/2509.01654
作者: Dominic Plein
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 12 figures, accompanied by a YouTube video ( this https URL ) and a GitHub repository ( this https URL )

点击查看摘要

Abstract:We present a method to calculate the similarity between words based on their phonetic transcription (their pronunciation) using the Needleman-Wunsch algorithm. We implement this algorithm in Rust and parallelize it on both CPU and GPU to handle large datasets efficiently. The GPU implementation leverages CUDA and the cudarc Rust library to achieve significant performance improvements. We validate our approach by constructing a fully-connected graph where nodes represent words and edges have weights according to the similarity between the words. This graph is then analyzed using clustering algorithms to identify groups of phonetically similar words. Our results demonstrate the feasibility and effectiveness of the proposed method in analyzing the phonetic structure of languages. It might be easily expanded to other languages.
zh

[NLP-57] ransGAT: Transformer-Based Graph Neural Networks for Multi-Dimensional Automated Essay Scoring

【速读】: 该论文旨在解决自动作文评分(Automated Essay Scoring, AES)系统中存在的两个核心问题:一是现有方法多采用静态词嵌入表示,难以捕捉多义词的上下文语义;二是多数模型仅提供整体评分,忽视了语法、词汇和连贯性等具体写作维度的分析。解决方案的关键在于提出TransGAT模型,其创新性地将微调后的Transformer(如BERT、RoBERTa和DeBERTaV3)与图注意力网络(Graph Attention Network, GAT)相结合,构建双流预测架构:第一流基于Transformer生成作文级评分,第二流利用GAT对Transformer输出的token嵌入进行关系建模,边结构由句法依存关系构建;最终融合两路预测结果实现精细化的分析评分。实验表明,该方法在ELLIPSE数据集上平均QWK达0.854,显著优于基线模型。

链接: https://arxiv.org/abs/2509.01640
作者: Hind Aljuaid,Areej Alhothali,Ohoud Al-Zamzami,Hussein Assalahi
机构: King Abdulaziz University (国王阿卜杜勒阿齐兹大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Essay writing is a critical component of student assessment, yet manual scoring is labor-intensive and inconsistent. Automated Essay Scoring (AES) offers a promising alternative, but current approaches face limitations. Recent studies have incorporated Graph Neural Networks (GNNs) into AES using static word embeddings that fail to capture contextual meaning, especially for polysemous words. Additionally, many methods rely on holistic scoring, overlooking specific writing aspects such as grammar, vocabulary, and cohesion. To address these challenges, this study proposes TransGAT, a novel approach that integrates fine-tuned Transformer models with GNNs for analytic scoring. TransGAT combines the contextual understanding of Transformers with the relational modeling strength of Graph Attention Networks (GAT). It performs two-stream predictions by pairing each fine-tuned Transformer (BERT, RoBERTa, and DeBERTaV3) with a separate GAT. In each pair, the first stream generates essay-level predictions, while the second applies GAT to Transformer token embeddings, with edges constructed from syntactic dependencies. The model then fuses predictions from both streams to produce the final analytic score. Experiments on the ELLIPSE dataset show that TransGAT outperforms baseline models, achieving an average Quadratic Weighted Kappa (QWK) of 0.854 across all analytic scoring dimensions. These findings highlight the potential of TransGAT to advance AES systems.
zh

[NLP-58] Benchmarking the Detection of LLM s-Generated Modern Chinese Poetry EMNLP2025

【速读】: 该论文旨在解决当前AI生成现代汉语诗歌(Modern Chinese Poetry)难以被有效识别的问题,尤其是在诗歌风格等内在特征上,现有检测工具表现不佳。其解决方案的关键在于构建了一个高质量的基准数据集,包含6位专业诗人创作的800首诗歌与4个主流大语言模型(Large Language Models, LLMs)生成的41,600首诗歌,并在此基础上系统评估了六种检测器的性能。实验结果表明,现有检测方法无法可靠区分AI生成的现代汉语诗歌,凸显了所提基准在推动未来AI生成诗歌检测研究中的必要性和有效性。

链接: https://arxiv.org/abs/2509.01620
作者: Shanshan Wang,Junchao Wu,Fengying Ye,Jingming Yao,Lidia S. Chao,Derek F. Wong
机构: NLP2CT Lab, Department of Computer and Information Science, University of Macau; Department of Portuguese, Faculty of Arts and Humanities, University of Macau
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025

点击查看摘要

Abstract:The rapid development of advanced large language models (LLMs) has made AI-generated text indistinguishable from human-written text. Previous work on detecting AI-generated text has made effective progress, but has not involved modern Chinese poetry. Due to the distinctive characteristics of modern Chinese poetry, it is difficult to identify whether a poem originated from humans or AI. The proliferation of AI-generated modern Chinese poetry has significantly disrupted the poetry ecosystem. Based on the urgency of identifying AI-generated poetry in the real Chinese world, this paper proposes a novel benchmark for detecting LLMs-generated modern Chinese poetry. We first construct a high-quality dataset, which includes both 800 poems written by six professional poets and 41,600 poems generated by four mainstream LLMs. Subsequently, we conduct systematic performance assessments of six detectors on this dataset. Experimental results demonstrate that current detectors cannot be used as reliable tools to detect modern Chinese poems generated by LLMs. The most difficult poetic features to detect are intrinsic qualities, especially style. The detection results verify the effectiveness and necessity of our proposed benchmark. Our work lays a foundation for future detection of AI-generated poetry.
zh

[NLP-59] sting the assumptions about the geometry of sentence embedding spaces: the cosine measure need not apply

【速读】: 该论文试图解决的问题是:句子嵌入(sentence embeddings)在嵌入空间中的几何结构是否能够预测其在各类语言任务中的性能表现。研究发现,尽管传统观点假设嵌入空间中相近的点代表语义相似的句子,但实际结果表明,这种几何距离(如余弦相似度)并不能有效反映句子嵌入在具体任务上的相对性能差异。解决方案的关键在于揭示:语言信息并非以浅层的、维度无关的方式编码于嵌入空间中,而是依赖于不同维度的加权组合,而这些高阶特征并未体现在嵌入空间的几何结构中,因此不能仅通过距离度量来判断嵌入质量或任务适配性。

链接: https://arxiv.org/abs/2509.01606
作者: Vivi Nastase,Paola Merlo
机构: Idiap Research Institute (Idiap 研究所); University of Geneva (日内瓦大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 6 tables, 10 figures

点击查看摘要

Abstract:Transformer models learn to encode and decode an input text, and produce contextual token embeddings as a side-effect. The mapping from language into the embedding space maps words expressing similar concepts onto points that are close in the space. In practice, the reverse implication is also assumed: words corresponding to close points in this space are similar or related, those that are further are not. Does closeness in the embedding space extend to shared properties for sentence embeddings? We present an investigation of sentence embeddings and show that the geometry of their embedding space is not predictive of their relative performances on a variety of tasks. We compute sentence embeddings in three ways: as averaged token embeddings, as the embedding of the special [CLS] token, and as the embedding of a random token from the sentence. We explore whether there is a correlation between the distance between sentence embedding variations and their performance on linguistic tasks, and whether despite their distances, they do encode the same information in the same manner. The results show that the cosine similarity – which treats dimensions shallowly – captures (shallow) commonalities or differences between sentence embeddings, which are not predictive of their performance on specific tasks. Linguistic information is rather encoded in weighted combinations of different dimensions, which are not reflected in the geometry of the sentence embedding space. Comments: 25 pages, 6 tables, 10 figures Subjects: Computation and Language (cs.CL) MSC classes: 68T50 ACMclasses: I.2.7 Cite as: arXiv:2509.01606 [cs.CL] (or arXiv:2509.01606v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.01606 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vivi Nastase [view email] [v1] Mon, 1 Sep 2025 16:37:03 UTC (3,166 KB)
zh

[NLP-60] CSRM-LLM : Embracing Multilingual LLM s for Cold-Start Relevance Matching in Emerging E-commerce Markets

【速读】: 该论文旨在解决新兴电商市场中因缺乏人工标注数据和用户行为信息而引发的冷启动(cold-start)问题,特别是在相关性匹配(relevance matching)任务中的性能瓶颈。解决方案的关键在于提出了一种冷启动相关性匹配框架(Cold-Start Relevance Matching, CSRM),其核心包括:(1) 通过机器翻译任务激活大语言模型(Large Language Model, LLM)的跨语言迁移能力;(2) 利用基于检索的查询增强策略提升查询理解并融合电商领域知识;(3) 采用多轮自蒸馏训练策略缓解训练标签噪声的影响。实验证明该方法在真实场景中显著提升了系统表现,缺陷率降低45.8%,会话购买率提升0.866%。

链接: https://arxiv.org/abs/2509.01566
作者: Yujing Wang,Yiren Chen,Huoran Li,Chunxu Xu,Yuchong Luo,Xianghui Mao,Cong Li,Lun Du,Chunyang Ma,Qiqi Jiang,Yin Wang,Fan Gao,Wenting Mo,Pei Wen,Shantanu Kumar,Taejin Park,Yiwei Song,Vijay Rajaram,Tao Cheng,Sonu Durgia,Pranam Kolari
机构: Coupang, Inc.(Coupang公司); Beijing(北京); China(中国); Seoul(首尔); Republic of Korea(韩国); Mountain View(山景城); United States(美国)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:As global e-commerce platforms continue to expand, companies are entering new markets where they encounter cold-start challenges due to limited human labels and user behaviors. In this paper, we share our experiences in Coupang to provide a competitive cold-start performance of relevance matching for emerging e-commerce markets. Specifically, we present a Cold-Start Relevance Matching (CSRM) framework, utilizing a multilingual Large Language Model (LLM) to address three challenges: (1) activating cross-lingual transfer learning abilities of LLMs through machine translation tasks; (2) enhancing query understanding and incorporating e-commerce knowledge by retrieval-based query augmentation; (3) mitigating the impact of training label errors through a multi-round self-distillation training strategy. Our experiments demonstrate the effectiveness of CSRM-LLM and the proposed techniques, resulting in successful real-world deployment and significant online gains, with a 45.8% reduction in defect ratio and a 0.866% uplift in session purchase rate.
zh

[NLP-61] Enhancing Uncertainty Estimation in LLM s with Expectation of Aggregated Internal Belief

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在经过人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)后普遍存在的过度自信问题,即模型在生成看似合理但实际错误的答案时仍表现出过高的置信度,这对可靠的风险评估和安全部署构成挑战。解决方案的关键在于提出一种基于自评估的校准方法EAGLE(Expectation of AGgregated internaL bEief),其核心思想是利用LLM内部多个中间层的隐藏状态提取模型的内在信念(internal belief),并通过聚合这些分层信念并计算其期望值,得到更精确的置信度分数,从而显著提升模型校准性能。

链接: https://arxiv.org/abs/2509.01564
作者: Zeguan Xiao,Diyang Dou,Boya Xiong,Yun Chen,Guanhua Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, but often exhibit overconfidence and generate plausible yet incorrect answers. This overconfidence, especially in models undergone Reinforcement Learning from Human Feedback (RLHF), poses significant challenges for reliable uncertainty estimation and safe deployment. In this paper, we propose EAGLE (Expectation of AGgregated internaL bEief), a novel self-evaluation-based calibration method that leverages the internal hidden states of LLMs to derive more accurate confidence scores. Instead of relying on the model’s final output, our approach extracts internal beliefs from multiple intermediate layers during self-evaluation. By aggregating these layer-wise beliefs and calculating the expectation over the resulting confidence score distribution, EAGLE produces a refined confidence score that more faithfully reflects the model’s internal certainty. Extensive experiments on diverse datasets and LLMs demonstrate that EAGLE significantly improves calibration performance over existing baselines. We also provide an in-depth analysis of EAGLE, including a layer-wise examination of uncertainty patterns, a study of the impact of self-evaluation prompts, and an analysis of the effect of self-evaluation score range.
zh

[NLP-62] In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents

【速读】: 该论文旨在解决工具代理(Tool Agents)在执行复杂任务时难以正确识别和按序调用外部API的问题。其解决方案的关键在于将API文档转化为结构化的API图(API Graph),以显式建模API之间的依赖关系,并利用该图支持多工具查询(multi-tool queries)的生成。通过构建首个专家标注的API图数据集In-N-Out,研究证明该方法显著提升了工具检索与多步API调用生成的性能,使模型在理解API文档和参数关系方面的能力大幅提升。

链接: https://arxiv.org/abs/2509.01560
作者: Seungkyu Lee,Nalim Kim,Yohan Jo
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool agents – LLM-based systems that interact with external APIs – offer a way to execute real-world tasks. However, as tasks become increasingly complex, these agents struggle to identify and call the correct APIs in the proper order. To tackle this problem, we investigate converting API documentation into a structured API graph that captures API dependencies and leveraging it for multi-tool queries that require compositional API calls. To support this, we introduce In-N-Out, the first expert-annotated dataset of API graphs built from two real-world API benchmarks and their documentation. Using In-N-Out significantly improves performance on both tool retrieval and multi-tool query generation, nearly doubling that of LLMs using documentation alone. Moreover, graphs generated by models fine-tuned on In-N-Out close 90% of this gap, showing that our dataset helps models learn to comprehend API documentation and parameter relationships. Our findings highlight the promise of using explicit API graphs for tool agents and the utility of In-N-Out as a valuable resource. We will release the dataset and code publicly.
zh

[NLP-63] CAT: Causal Attention Tuning For Injecting Fine-grained Causal Knowledge into Large Language Models EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中往往捕捉到虚假相关性而非真实因果关系的问题,导致其在分布外(out-of-distribution, OOD)场景下性能下降。解决方案的关键在于提出一种名为因果注意力调优(Causal Attention Tuning, CAT)的新方法,该方法通过自动化管道利用人类先验知识生成细粒度的token级因果信号,并引入Re-Attention机制引导训练过程,使模型能够聚焦于因果结构,同时抑制注意力分数中的噪声与偏差,从而提升模型在预测和生成任务中的因果推理能力与泛化性能。

链接: https://arxiv.org/abs/2509.01535
作者: Kairong Han,Wenshuo Zhao,Ziyu Zhao,JunJian Ye,Lujia Pan,Kun Kuang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP2025 Main conference

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across various domains. However, a fundamental question remains: Can LLMs effectively utilize causal knowledge for prediction and generation? Through empirical studies, we find that LLMs trained directly on large-scale data often capture spurious correlations rather than true causal relationships, leading to suboptimal performance, especially in out-of-distribution (OOD) scenarios. To address this challenge, we propose Causal Attention Tuning (CAT), a novel approach that injects fine-grained causal knowledge into the attention mechanism. We propose an automated pipeline that leverages human priors to automatically generate token-level causal signals and introduce the Re-Attention mechanism to guide training, helping the model focus on causal structures while mitigating noise and biases in attention scores. Experimental results on our proposed Spurious Token Game (STG) benchmark and multiple downstream tasks demonstrate that our approach effectively leverages causal knowledge for prediction and remains robust in OOD scenarios. Implementation details can be found at this https URL.
zh

[NLP-64] Service Solidarity and Self-Help: A Comparative Topic Modeling Analysis of Community Unionism in the Boot and Shoe Union and Unite Community

【速读】: 该论文试图解决的问题是:如何通过现代自然语言处理(Natural Language Processing, NLP)技术对不同时期的工会文本进行比较分析,以揭示社区型工会(Community Unionism, CU)在不同历史背景下其话语特征与实践模式的差异及其演化逻辑。解决方案的关键在于运用BERTopic主题建模与cTF-IDF加权方法结合词频分析,系统识别并量化两个代表性工会——1920年代的全国鞋靴工会(National Boot and Shoe Union, B\S)与2010–2020年代的联合工会社区部门(Unite Community)——在议题聚焦、话语连贯性及社会正义导向等方面的异同,从而验证CU概念在跨时空语境中的适用性与结构性差异。

链接: https://arxiv.org/abs/2509.01529
作者: Thomas Compton
机构: University of York (约克大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 7 figures, conference paper

点击查看摘要

Abstract:This paper presents a comparative analysis of community unionism (CU) in two distinct historical and organizational contexts: the National Boot and Shoe Union (B\S) in the 1920s and Unite Community in the 2010s–2020s. Using BERTopic for thematic modeling and cTF-IDF weighting, alongside word frequency analysis, the study examines the extent to which each union’s discourse aligns with key features of CU – such as coalition-building, grassroots engagement, and action beyond the workplace. The results reveal significant differences in thematic focus and discursive coherence. While Unite Community demonstrates stronger alignment with outward-facing, social justice-oriented themes, the B\S corpus emphasizes internal administration, industrial relations, and member services – reflecting a more traditional, servicing-oriented union model. The analysis also highlights methodological insights, demonstrating how modern NLP techniques can enhance the study of historical labor archives. Ultimately, the findings suggest that while both unions engage with community-related themes, their underlying models of engagement diverge significantly, challenging assumptions about the continuity and universality of community unionism across time and sector.
zh

[NLP-65] MeVe: A Modular System for Memory Verification and Effective Context Control in Language Models

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统因依赖简单Top-k语义搜索而导致的上下文冗余与无关信息引入问题,这会降低大语言模型(Large Language Model, LLM)的性能和效率。解决方案的关键在于提出一种名为MeVe的模块化架构,其核心创新是将检索与上下文构建过程拆解为五个可独立调试、调优的阶段:初始检索、相关性验证、备用检索、上下文优先级排序和令牌预算分配。通过在上下文组成前主动验证信息的相关性,MeVe实现了对LLM可用知识的细粒度控制,从而显著提升上下文效率,在英文维基百科子集上减少57%的冗余内容,在更复杂的HotpotQA数据集上减少75%,为更可靠、可扩展的大语言模型应用提供了新的实现路径。

链接: https://arxiv.org/abs/2509.01514
作者: Andreas Ottem
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, held online presentation at NLPA 2025

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems typically face constraints because of their inherent mechanism: a simple top-k semantic search [1]. The approach often leads to the incorporation of irrelevant or redundant information in the context, degrading performance and efficiency [10][11]. This paper presents MeVe, a novel modular architecture intended for Memory Verification and smart context composition. MeVe rethinks the RAG paradigm by proposing a five-phase modular design that distinctly breaks down the retrieval and context composition process into distinct, auditable, and independently tunable phases: initial retrieval, relevance verification, fallback retrieval, context prioritization, and token budgeting. This architecture enables fine-grained control of what knowledge is made available to an LLM, enabling task-dependent filtering and adaptation. We release a reference implementation of MeVe as a proof of concept and evaluate its performance on knowledge-heavy QA tasks over a subset of English Wikipedia [22]. Our results demonstrate that by actively verifying information before composition, MeVe significantly improves context efficiency, achieving a 57% reduction on the Wikipedia dataset and a 75% reduction on the more complex HotpotQA dataset compared to standard RAG implementations [25]. This work provides a framework for more scalable and reliable LLM applications. By refining and distilling contextual information, MeVe offers a path toward better grounding and more accurate factual support [16].
zh

[NLP-66] Do Retrieval Augmented Language Models Know When They Dont Know?

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成回答时出现幻觉(hallucination)的问题,特别是聚焦于检索增强型语言模型(Retrieval Augmented Language Models, RALMs)是否具备识别自身知识边界的能力,即“知道何时不知道”。研究发现,RALMs普遍存在过度拒绝(over-refusal)现象,且拒绝后训练(refusal post-training)方法对这一问题的影响存在差异:上下文微调(In-context fine-tuning)可缓解过量拒绝,而R-tuning反而加剧该问题;同时,拒绝能力与答案质量之间可能存在权衡。解决方案的关键在于提出一种简单但有效的拒绝策略,用于优化拒绝后训练模型的综合表现,从而在提升拒绝准确性的同时改善正确回答的质量。

链接: https://arxiv.org/abs/2509.01476
作者: Youchao Zhou,Heyan Huang,Yicheng Liu,Rui Dai,Xinglin Wang,Xingchen Zhang,Shumin Shi,Yang Deng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:Existing Large Language Models (LLMs) occasionally generate plausible yet factually incorrect responses, known as hallucinations. Researchers are primarily using two approaches to mitigate hallucinations, namely Retrieval Augmented Language Models (RALMs) and refusal post-training. However, current research predominantly emphasizes their individual effectiveness while overlooking the evaluation of the refusal capability of RALMs. In this study, we ask the fundamental question: Do RALMs know when they don’t know? Specifically, we ask three questions. First, are RALMs well-calibrated regarding different internal and external knowledge states? We examine the influence of various factors. Contrary to expectations, we find that LLMs exhibit significant \textbfover-refusal behavior. Then, how does refusal post-training affect the over-refusal issue? We investigate the Refusal-aware Instruction Tuning and In-Context Fine-tuning methods. Our results show that the over-refusal problem is mitigated by In-context fine-tuning. but magnified by R-tuning. However, we also find that the refusal ability may conflict with the quality of the answer. Finally, we develop a simple yet effective refusal method for refusal post-trained models to improve their overall answer quality in terms of refusal and correct answers. Our study provides a more comprehensive understanding of the influence of important factors on RALM systems.
zh

[NLP-67] Robust Knowledge Editing via Explicit Reasoning Chains for Distractor-Resilient Multi-Hop QA EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练完成后难以高效更新新兴知识的问题,现有知识编辑技术要么依赖表面线索导致准确性不足,要么采用复杂迭代流程在噪声干扰和多跳推理场景下失效。其解决方案的关键在于提出一种基于推理链(reasoning chain)的端到端编辑框架——Reason-KE,该框架通过四个结构化阶段(事实确认、相关性判断、选择性应用与最终推理)在单次遍历中过滤干扰信息,从而实现高鲁棒性和高效率的知识注入,显著提升模型在多跳问答任务中的准确率(如Qwen2.5-7B在MQuAKE-CF数据集上达到90.2%),同时对噪声和答案泄露具有强适应性。

链接: https://arxiv.org/abs/2509.01468
作者: Yuchen Wu,Liang Ding,Li Shen,Dacheng Tao
机构: Shanghai Jiao Tong University (上海交通大学); The University of Sydney (悉尼大学); Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Findings

点击查看摘要

Abstract:Large language models (LLMs) encode vast amounts of world knowledge but remain static once trained, making the timely integration of emerging facts prohibitively expensive via full retraining. Knowledge-editing techniques have thus emerged to inject or overwrite specific facts into LLMs, yet they either over-rely on superficial cues or incur complex, iterative pipelines that collapse under noisy, multi-hop conditions. We introduce Reason-KE, an end-to-end reasoning-chain-based editing framework that steers a pretrained LLM through four structured stages-fact acknowledgment, relevance determination, selective application, and final reasoning-to filter distractors in a single pass. Trained on MQuAKE-CF with up to four irrelevant facts, Reason-KE elevates Qwen2.5-7B’s multi-hop QA accuracy to 90.2% while suffering merely a 6.3% drop under heavy distraction and 1% when answers are leaked. Our quantitative analysis confirms Reason-KE’s resilience and efficiency, establishing a new state-of-the-art for reliable LLM knowledge updates.
zh

[NLP-68] rusted Uncertainty in Large Language Models : A Unified Framework for Confidence Calibration and Risk-Controlled Refusal

【速读】: 该论文旨在解决部署中的语言模型在回答用户问题时,不仅需决定“是否回答”,还需精准控制“何时拒绝回答”的问题,以提升系统的可靠性与可信任度。其核心挑战在于如何将多种异构的不确定性证据(如序列似然、自一致性离散度、检索兼容性及工具或验证器反馈)统一转化为一个校准后的正确概率,并在此基础上通过原则性的拒答机制实现用户指定的误差预算约束。解决方案的关键在于提出UniCR框架:它采用轻量级校准头(结合温度缩放和适当评分函数),支持仅API可用模型的黑盒特征提取,并利用符合风险控制(conformal risk control)提供分布无关的保证;同时,在长文本生成中通过原子事实性分数监督来对齐置信度与语义保真度,从而减少自信幻觉并保持覆盖率。该方法无需微调基础模型即可实现更可靠的决策输出,且在分布偏移下仍具有效性。

链接: https://arxiv.org/abs/2509.01455
作者: Markus Oehri,Giulia Conti,Kaviraj Pather,Alexandre Rossi,Laia Serra,Adrian Parody,Rogvi Johannesen,Aviaja Petersen,Arben Krasniqi
机构: University of Liechtenstein(列支敦士登大学); University of the Republic of San Marino(圣马力诺共和国大学); University of Mauritius(毛里求斯大学); International University of Monaco(摩纳哥国际大学); University of Andorra(安道尔大学); University of Gibraltar(直布罗陀大学); University of the Faroe Islands(法罗群岛大学); Ilisimatusarfik (University of Greenland)(格陵兰大学); University of Prizren ”Ukshin Hoti”(普里兹伦大学“乌克辛·霍蒂”)
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Deployed language models must decide not only what to answer but also when not to answer. We present UniCR, a unified framework that turns heterogeneous uncertainty evidence including sequence likelihoods, self-consistency dispersion, retrieval compatibility, and tool or verifier feedback into a calibrated probability of correctness and then enforces a user-specified error budget via principled refusal. UniCR learns a lightweight calibration head with temperature scaling and proper scoring, supports API-only models through black-box features, and offers distribution-free guarantees using conformal risk control. For long-form generation, we align confidence with semantic fidelity by supervising on atomic factuality scores derived from retrieved evidence, reducing confident hallucinations while preserving coverage. Experiments on short-form QA, code generation with execution tests, and retrieval-augmented long-form QA show consistent improvements in calibration metrics, lower area under the risk-coverage curve, and higher coverage at fixed risk compared to entropy or logit thresholds, post-hoc calibrators, and end-to-end selective baselines. Analyses reveal that evidence contradiction, semantic dispersion, and tool inconsistency are the dominant drivers of abstention, yielding informative user-facing refusal messages. The result is a portable recipe of evidence fusion to calibrated probability to risk-controlled decision that improves trustworthiness without fine-tuning the base model and remains valid under distribution shift.
zh

[NLP-69] On the Alignment of Large Language Models with Global Human Opinion

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在跨国家、跨语言和跨历史时期背景下,其观点与人类意见对齐程度不足的问题。现有研究多局限于美国或少数国家的群体意见,缺乏全球范围样本及历史维度的探讨,且未充分考虑提示语(prompt)语言对模型观点对齐的影响。解决方案的关键在于构建一个基于世界价值观调查(World Values Survey, WVS)的评估框架,系统性地衡量LLMs在全球范围内不同国家、语言和历史阶段中与人类意见的一致性,并发现:通过将提示语的语言调整为问卷所用语言,可显著提升LLMs对特定国家意见的对齐效果,优于现有对齐方法;同时,LLMs更倾向于与当代人群的观点保持一致。

链接: https://arxiv.org/abs/2509.01418
作者: Yang Liu,Masahiro Kaneko,Chenhui Chu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, 19 figures

点击查看摘要

Abstract:Today’s large language models (LLMs) are capable of supporting multilingual scenarios, allowing users to interact with LLMs in their native languages. When LLMs respond to subjective questions posed by users, they are expected to align with the views of specific demographic groups or historical periods, shaped by the language in which the user interacts with the model. Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods, as well as lacking discussion on using language to steer LLMs. Moreover, they also overlook the potential influence of prompt language on the alignment of LLMs’ opinions. In this study, our goal is to fill these gaps. To this end, we create an evaluation framework based on the World Values Survey (WVS) to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods around the world. We find that LLMs appropriately or over-align the opinions with only a few countries while under-aligning the opinions with most countries. Furthermore, changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. At the same time, LLMs are more aligned with the opinions of the contemporary population. To our knowledge, our study is the first comprehensive investigation of the topic of opinion alignment in LLMs across global, language, and temporal dimensions. Our code and data are publicly available at this https URL.
zh

[NLP-70] Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理过程中缺乏透明性的问题,这使得在高风险场景下难以进行验证、调试和控制。解决方案的关键在于提出Vis-CoT框架,该框架通过将线性的CoT文本转换为可交互的推理图(reasoning graph),使用户能够可视化逻辑流程、识别错误步骤,并通过剪枝错误路径和嫁接用户自定义前提来主动干预推理过程,从而实现从被动观察到主动协作的转变,显著提升推理结果的准确性与可信度。

链接: https://arxiv.org/abs/2509.01412
作者: Kaviraj Pather,Elena Hadjigeorgiou,Arben Krasniqi,Claire Schmit,Irina Rusu,Marc Pons,Kabir Khan
机构: University of Mauritius (毛里求斯大学); Cyprus University of Technology (塞浦路斯技术大学); University of Prishtina ”Hasan Prishtina” (普里什蒂纳大学 “哈桑·普里什蒂纳”); University of Luxembourg (卢森堡大学); Technical University of Moldova (摩尔多瓦技术大学); University of Andorra (安道尔大学); San Francisco State University (旧金山州立大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) show strong reasoning via chain-of-thought (CoT) prompting, but the process is opaque, which makes verification, debugging, and control difficult in high-stakes settings. We present Vis-CoT, a human-in-the-loop framework that converts linear CoT text into an interactive reasoning graph. Users can visualize the logical flow, identify flawed steps, and intervene by pruning incorrect paths and grafting new, user-defined premises. This shifts interaction from passive observation to active collaboration, steering models toward more accurate and trustworthy conclusions. Across GSM8K and StrategyQA, Vis-CoT improves final-answer accuracy by up to 24 percentage points over non-interactive baselines. A user study also shows large gains in perceived usability and trust. Vis-CoT points to a practical path for more reliable, understandable, and collaborative reasoning by combining LLMs with targeted human oversight.
zh

[NLP-71] ArabEmoNet: A Lightweight Hybrid 2D CNN-BiLSTM Model with Attention for Robust Arabic Speech Emotion Recognition

【速读】: 该论文旨在解决阿拉伯语等低资源语言在语音情感识别(Speech Emotion Recognition, SER)中因数据稀缺和研究不足而导致性能受限的问题。其解决方案的关键在于提出一种轻量级神经网络架构 ArabEmoNet,该模型采用 Mel 语谱图(Mel spectrogram)作为输入,并通过二维卷积(2D convolution)提取时频域中的细微情感特征,相较于传统依赖离散梅尔频率倒谱系数(MFCC)和一维卷积的方法,能更好地保留情感线索;同时,ArabEmoNet 仅使用约 100 万参数,显著小于 HuBERT base 和 Whisper 等大规模模型,实现了高精度与低计算开销的平衡,适用于资源受限环境下的实际部署。

链接: https://arxiv.org/abs/2509.01401
作者: Ali Abouzeid,Bilal Elbouardi,Mohamed Maged,Shady Shehata
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Waterloo (滑铁卢大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted (The Third Arabic Natural Language Processing Conference)

点击查看摘要

Abstract:Speech emotion recognition is vital for human-computer interaction, particularly for low-resource languages like Arabic, which face challenges due to limited data and research. We introduce ArabEmoNet, a lightweight architecture designed to overcome these limitations and deliver state-of-the-art performance. Unlike previous systems relying on discrete MFCC features and 1D convolutions, which miss nuanced spectro-temporal patterns, ArabEmoNet uses Mel spectrograms processed through 2D convolutions, preserving critical emotional cues often lost in traditional methods. While recent models favor large-scale architectures with millions of parameters, ArabEmoNet achieves superior results with just 1 million parameters, 90 times smaller than HuBERT base and 74 times smaller than Whisper. This efficiency makes it ideal for resource-constrained environments. ArabEmoNet advances Arabic speech emotion recognition, offering exceptional performance and accessibility for real-world applications. Comments: Accepted (The Third Arabic Natural Language Processing Conference) Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2509.01401 [cs.SD] (or arXiv:2509.01401v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2509.01401 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-72] LLM s cannot spot math errors even when allowed to peek into the solution EMNLP2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数学应用题中难以准确识别学生解答过程中首个错误步骤的问题,尤其是在已知参考答案的情况下仍表现不佳。其解决方案的关键在于提出一种生成中间修正版学生解题过程的方法,该方法通过更贴近原学生作答逻辑的方式进行纠错,从而提升模型定位首错步骤的准确性。

链接: https://arxiv.org/abs/2509.01395
作者: KV Aditya Srivatsa,Kaushal Kumar Maurya,Ekaterina Kochmar
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable performance on math word problems, yet they have been shown to struggle with meta-reasoning tasks such as identifying errors in student solutions. In this work, we investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K. Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution. To that end, we propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student’s solution, which helps improve performance.
zh

[NLP-73] Analysing the Language of Neural Audio Codecs

【速读】: 该论文旨在解决神经音频编解码器(Neural Audio Codecs, NACs)中离散语音标记(speech tokens)的统计与语言特性如何影响语音重建质量及语义保真度的问题。其解决方案的关键在于系统性地分析NAC生成的token序列是否遵循语言学统计规律(如Zipf定律和Heaps定律),并量化其熵与冗余度,进而揭示这些token级属性与自动语音识别(ASR)错误率、UTMOS评分等语音保真度指标之间的相关性。研究发现,3-gram级别的NAC token表现出类语言统计特征,且信息内容指标与语音识别和重合成任务性能正相关,这为优化生成式语音模型的设计提供了理论依据。

链接: https://arxiv.org/abs/2509.01390
作者: Joonyong Park,Shinnosuke Takamichi,David M. Chan,Shunsuke Kando,Yuki Saito,Hiroshi Saruwatari
机构: The University of Tokyo (东京大学); Keio University (庆应义塾大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: In Proceedings of 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2025)

点击查看摘要

Abstract:This study presents a comparative analysis of the statistical and linguistic properties of neural audio codecs (NACs). We investigate discrete speech tokens produced by various NAC models, examining their adherence to linguistic statistical laws such as Zipf’s law and Heaps’ law, as well as their entropy and redundancy. To assess how these token-level properties relate to semantic and acoustic preservation in synthesized speech, we evaluate intelligibility using error rates of automatic speech recognition, and quality using the UTMOS score. Our results reveal that NAC tokens, particularly 3-grams, exhibit language-like statistical patterns. Moreover, these properties, together with measures of information content, are found to correlate with improved performances in speech recognition and resynthesis tasks. These findings offer insights into the structure of NAC token sequences and inform the design of more effective generative speech models.
zh

[NLP-74] ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links

【速读】: 该论文旨在解决跨文档关联关系(cross-document links)自动标注与评估的难题,尤其在缺乏高效构建训练和评估数据集方法的情况下,限制了自动化辅助技术的发展。其核心解决方案是提出一种领域无关的框架,首先生成并验证半合成(semi-synthetic)的互联文档数据集,用于自动评估不同链接方法性能,从而筛选出表现最优的方法;随后通过大规模人工评估确定这些方法在真实文本对上的实际效果。关键创新在于结合检索模型与大语言模型(LLMs),在两个不同领域(同行评审与新闻)中实现了78%的人工认可率,显著优于单一强检索器的精度,为跨文档理解任务提供了系统性研究路径与高质量数据基础。

链接: https://arxiv.org/abs/2509.01387
作者: Serwar Basch,Ilia Kuznetsov,Tom Hope,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and Hessian Center for AI (hessian.AI), TU Darmstadt; Hebrew University of Jerusalem
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding fine-grained relations between documents is crucial for many application domains. However, the study of automated assistance is limited by the lack of efficient methods to create training and evaluation datasets of cross-document links. To address this, we introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links in a new domain from scratch. We first generate and validate semi-synthetic datasets of interconnected documents. This data is used to perform automatic evaluation, producing a shortlist of best-performing linking approaches. These approaches are then used in an extensive human evaluation study, yielding performance estimates on natural text pairs. We apply our framework in two distinct domains – peer review and news – and show that combining retrieval models with LLMs achieves 78% link approval from human raters, more than doubling the precision of strong retrievers alone. Our framework enables systematic study of cross-document understanding across application scenarios, and the resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review. We make the code, data, and annotation protocols openly available.
zh

[NLP-75] WATCHED: A Web AI Agent Tool for Combating Hate Speech by Expanding Data

【速读】: 该论文旨在解决数字空间中仇恨言论(hate speech)日益增长的问题,该问题威胁用户安全并削弱社交媒体平台的信任度。现有自动化系统在检测效率上存在局限,难以兼顾准确性与可解释性,而单纯依赖人工审核则难以应对大规模内容处理需求。为此,论文提出WATCHED——一个面向内容审核员的聊天机器人(chatbot),其核心解决方案是构建一个由大型语言模型(Large Language Models, LLMs)驱动的AI代理(Artificial Intelligence Agent)系统,融合多种专用工具:包括基于BERT的分类器用于识别有害内容、词典查询功能解析俚语和非正式表达、链式推理(chain-of-thought reasoning)生成决策逻辑,并结合平台政策规则提供可解释的判断依据。这种多模态协同机制使系统不仅能高效识别仇恨言论,还能基于先例和政策清晰解释判定理由,从而提升审核透明度与可信度。实验表明,该方法在宏观F1分数上达到0.91,优于当前最先进方法。

链接: https://arxiv.org/abs/2509.01379
作者: Paloma Piot,Diego Sánchez,Javier Parapar
机构: IRLab, CITIC, Universidade da Coruña, Spain
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Online harms are a growing problem in digital spaces, putting user safety at risk and reducing trust in social media platforms. One of the most persistent forms of harm is hate speech. To address this, we need tools that combine the speed and scale of automated systems with the judgment and insight of human moderators. These tools should not only find harmful content but also explain their decisions clearly, helping to build trust and understanding. In this paper, we present WATCHED, a chatbot designed to support content moderators in tackling hate speech. The chatbot is built as an Artificial Intelligence Agent system that uses Large Language Models along with several specialised tools. It compares new posts with real examples of hate speech and neutral content, uses a BERT-based classifier to help flag harmful messages, looks up slang and informal language using sources like Urban Dictionary, generates chain-of-thought reasoning, and checks platform guidelines to explain and support its decisions. This combination allows the chatbot not only to detect hate speech but to explain why content is considered harmful, grounded in both precedent and policy. Experimental results show that our proposed method surpasses existing state-of-the-art methods, reaching a macro F1 score of 0.91. Designed for moderators, safety teams, and researchers, the tool helps reduce online harms by supporting collaboration between AI and human oversight.
zh

[NLP-76] Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic

【速读】: 该论文旨在解决大型语言模型在掌握复杂推理任务时依赖昂贵优化方法(如强化学习)的问题。其核心解决方案是提出一种可提取并迁移的“推理向量”(reasoning vector),该向量通过从两个相同初始化但不同微调方式的Qwen2.5模型中计算参数差值得到:$ v_\text{reason} = \theta_\text{GRPO} - \theta_\text{SFT} $,其中GRPO为基于群体相对策略优化的强化学习微调,SFT为监督微调。该向量被认为捕获了强化学习引入的推理能力,同时去除了SFT共享的知识。实验表明,将此向量简单地加到其他指令微调模型的参数上即可显著提升多个推理基准性能(如GSM8K提升4.9%),且效果在对抗性条件下依然稳定,证明了推理能力可通过低成本的张量运算实现高效复用,从而避免重复高成本训练。

链接: https://arxiv.org/abs/2509.01363
作者: Mohammad Zbeeb,Hasan Abed Al Kader Hammoud,Bernard Ghanem
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Large language models often require costly optimization, such as reinforcement learning, to master complex reasoning tasks. This work demonstrates that reasoning ability, once learned, can be extracted and transferred between models as a compact task vector. We source two publicly available, identically initialized Qwen2.5 models, one fine-tuned with supervised fine-tuning (SFT) and the other with group relative policy optimization (GRPO) on the same dataset. From these, we extract a reasoning vector: v_\textreason = \theta_\textGRPO - \theta_\textSFT . We hypothesize that this vector captures the reasoning capability instilled by reinforcement learning while factoring out shared knowledge from the SFT process. When added to compatible instruction-tuned models through simple arithmetic, this vector consistently improves performance across diverse reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), SciQ (+1.7%), and BigBenchHard (+12.3% for the 1.5B model). The performance improvements persist under adversarial conditions. Conversely, subtracting the vector causes significant performance degradation (-11.8% on GSM8K), demonstrating the vector’s strong contribution to the model’s reasoning abilities. This work shows how reasoning capabilities, typically developed through expensive training, can be extracted from existing open-source models and reused through simple tensor arithmetic, offering a practical way to enhance models by recycling prior computational investments.
zh

[NLP-77] LLM -Guided Semantic Relational Reasoning for Multimodal Intent Recognition EMNLP2025

【速读】: 该论文旨在解决现有方法在多模态意图理解中因模态层面依赖过强而导致的细粒度语义关系推理受限问题,从而影响复杂意图的准确识别。其解决方案的关键在于提出一种LLM-Guided Semantic Relational Reasoning (LGSRR) 方法,通过大语言模型(Large Language Model, LLM)提取细粒度语义作为引导信号,并基于从浅层到深层的思维链(Chain-of-Thought, CoT)自动挖掘、描述并按重要性排序语义线索,无需人工先验;同时形式化建模三类基于逻辑原则的语义关系并分析其交互机制,显著提升小模型的语义关联推理能力。

链接: https://arxiv.org/abs/2509.01337
作者: Qianrui Zhou,Hua Xu,Yifan Wang,Xinzhi Dong,Hanlei Zhang
机构: Tsinghua University (清华大学); Hebei University of Science and Technology (河北科技大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025 (Main Track, Long Paper)

点击查看摘要

Abstract:Understanding human intents from multimodal signals is critical for analyzing human behaviors and enhancing human-machine interactions in real-world scenarios. However, existing methods exhibit limitations in their modality-level reliance, constraining relational reasoning over fine-grained semantics for complex intent understanding. This paper proposes a novel LLM-Guided Semantic Relational Reasoning (LGSRR) method, which harnesses the expansive knowledge of large language models (LLMs) to establish semantic foundations that boost smaller models’ relational reasoning performance. Specifically, an LLM-based strategy is proposed to extract fine-grained semantics as guidance for subsequent reasoning, driven by a shallow-to-deep Chain-of-Thought (CoT) that autonomously uncovers, describes, and ranks semantic cues by their importance without relying on manually defined priors. Besides, we formally model three fundamental types of semantic relations grounded in logical principles and analyze their nuanced interplay to enable more effective relational reasoning. Extensive experiments on multimodal intent and dialogue act recognition tasks demonstrate LGSRR’s superiority over state-of-the-art methods, with consistent performance gains across diverse semantic understanding scenarios. The complete data and code are available at this https URL.
zh

[NLP-78] Can Large Language Models Master Complex Card Games?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂卡牌游戏中的学习能力与泛化性能问题,特别是其是否能够通过少量训练达到与专用游戏AI相当的水平,以及在掌握特定游戏过程中是否会损害其原有的通用能力。解决方案的关键在于:首先,通过监督微调(supervised fine-tuning)使用高质量的游戏对弈数据,使LLMs能够逼近强游戏AI的性能;其次,利用多任务学习机制让LLMs同时掌握多种规则相似或相异的卡牌游戏,发现规则相似性可带来性能增强,而差异性则可能导致冲突;最后,通过引入一定比例的一般指令数据进行混合训练,有效缓解LLMs在专精特定游戏时出现的通用能力退化问题。

链接: https://arxiv.org/abs/2509.01328
作者: Wei Wang,Fuqing Bie,Junzhe Chen,Dan Zhang,Shiyu Huang,Evgeny Kharlamov,Jie Tang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models’ ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can master multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs.
zh

[NLP-79] KoBLEX: Open Legal Question Answering with Multi-hop Reasoning EMNLP2025

【速读】: 该论文旨在解决现有法律大语言模型(Large Language Models, LLM)评估基准在开放性、条款依据的问答(Question Answering, QA)任务中表现不足的问题,特别是缺乏对多跳法律推理(multi-hop legal reasoning)能力的有效评测。为此,作者提出了韩国法律可解释问答基准(Korean Benchmark for Legal EXplainable QA, KoBLEX),其包含226个基于场景的条款支撑型QA实例,并引入了一种名为参数化条款引导的选择性检索方法(Parametric provision-guided Selection Retrieval, ParSeR)。ParSeR的关键在于利用LLM生成的参数化条款作为检索指导信号,通过三阶段顺序检索机制实现更可靠、法律依据明确的多跳推理,从而显著提升答案准确性与法律一致性。实验表明,ParSeR相比标准检索方法在F1和法律保真度评价(Legal Fidelity Evaluation, LF-Eval)指标上分别提升37.91和30.81,且在不同推理深度下保持稳定性能。

链接: https://arxiv.org/abs/2509.01324
作者: Jihyung Lee,Daehui Kim,Seonjeong Hwang,Hyounghun Kim,Gary Lee
机构: Graduate School of Artificial Intelligence, POSTECH, Republic of Korea; Department of Computer Science and Engineering, POSTECH, Republic of Korea; AI Future Lab, KT, Republic of Korea
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Main Conference

点击查看摘要

Abstract:Large Language Models (LLM) have achieved remarkable performances in general domains and are now extending into the expert domain of law. Several benchmarks have been proposed to evaluate LLMs’ legal capabilities. However, these benchmarks fail to evaluate open-ended and provision-grounded Question Answering (QA). To address this, we introduce a Korean Benchmark for Legal EXplainable QA (KoBLEX), designed to evaluate provision-grounded, multi-hop legal reasoning. KoBLEX includes 226 scenario-based QA instances and their supporting provisions, created using a hybrid LLM-human expert pipeline. We also propose a method called Parametric provision-guided Selection Retrieval (ParSeR), which uses LLM-generated parametric provisions to guide legally grounded and reliable answers. ParSeR facilitates multi-hop reasoning on complex legal questions by generating parametric provisions and employing a three-stage sequential retrieval process. Furthermore, to better evaluate the legal fidelity of the generated answers, we propose Legal Fidelity Evaluation (LF-Eval). LF-Eval is an automatic metric that jointly considers the question, answer, and supporting provisions and shows a high correlation with human judgments. Experimental results show that ParSeR consistently outperforms strong baselines, achieving the best results across multiple LLMs. Notably, compared to standard retrieval with GPT-4o, ParSeR achieves +37.91 higher F1 and +30.81 higher LF-Eval. Further analyses reveal that ParSeR efficiently delivers consistent performance across reasoning depths, with ablations confirming the effectiveness of ParSeR.
zh

[NLP-80] LongCat-Flash Technical Report

【速读】: 该论文旨在解决大规模语言模型在计算效率与智能代理能力(agentic capabilities)之间的平衡难题,尤其是在训练成本高、推理延迟大和资源分配不灵活等问题下,如何实现高效且具备复杂任务执行能力的模型。其核心解决方案包括两个关键技术:一是提出“零计算专家”(Zero-computation Experts),通过动态激活机制根据上下文需求仅启用平均约270亿参数的专家模块(范围186亿–313亿),从而实现按需计算以优化资源利用率;二是设计“快捷连接MoE”(Shortcut-connected MoE),通过扩大计算-通信重叠窗口显著提升推理效率和吞吐量。这两项创新结合系统级训练框架(如超参数迁移、模型生长初始化、多维度稳定性保障等),使模型能在30天内完成超过20万亿token的训练,并实现每秒超100 tokens的推理速度(成本低于0.70美元/百万输出token),同时展现出卓越的代理任务性能。

链接: https://arxiv.org/abs/2509.01322
作者: Meituan LongCat Team,Bayan,Bei Li,Bingye Lei,Bo Wang,Bolin Rong,Chao Wang,Chao Zhang,Chen Gao,Chen Zhang,Cheng Sun,Chengcheng Han,Chenguang Xi,Chi Zhang,Chong Peng,Chuan Qin,Chuyu Zhang,Cong Chen,Congkui Wang,Dan Ma,Daoru Pan,Defei Bu,Dengchang Zhao,Deyang Kong,Dishan Liu,Feiye Huo,Fengcun Li,Fubao Zhang,Gan Dong,Gang Liu,Gang Xu,Ge Li,Guoqiang Tan,Guoyuan Lin,Haihang Jing,Haomin Fu,Haonan Yan,Haoxing Wen,Haozhe Zhao,Hong Liu,Hongmei Shi,Hongyan Hao,Hongyin Tang,Huantian Lv,Hui Su,Jiacheng Li,Jiahao Liu,Jiahuan Li,Jiajun Yang,Jiaming Wang,Jian Yang,Jianchao Tan,Jiaqi Sun,Jiaqi Zhang,Jiawei Fu,Jiawei Yang,Jiaxi Hu,Jiayu Qin,Jingang Wang,Jiyuan He,Jun Kuang,Junhui Mei,Kai Liang,Ke He,Kefeng Zhang,Keheng Wang,Keqing He,Liang Gao,Liang Shi,Lianhui Ma,Lin Qiu,Lingbin Kong,Lingtong Si,Linkun Lyu,Linsen Guo,Liqi Yang,Lizhi Yan,Mai Xia,Man Gao,Manyuan Zhang,Meng Zhou,Mengxia Shen,Mingxiang Tuo,Mingyang Zhu,Peiguang Li,Peng Pei,Peng Zhao,Pengcheng Jia,Pingwei Sun,Qi Gu,Qianyun Li,Qingyuan Li,Qiong Huang,Qiyuan Duan,Ran Meng,Rongxiang Weng,Ruichen Shao,Rumei Li,Shizhe Wu,Shuai Liang
机构: Meituan LongCat Team (美团龙猫团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \ 0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: this https URL Hugging Face: this https URL GitHub: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2509.01322 [cs.CL] (or arXiv:2509.01322v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.01322 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jiaqi Zhang [view email] [v1] Mon, 1 Sep 2025 10:05:45 UTC (7,133 KB)
zh

[NLP-81] owards High Data Efficiency in Reinforcement Learning with Verifiable Reward

【速读】: 该论文旨在解决大规模推理模型在强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练中面临的数据效率低、计算成本高的问题。现有方法通常依赖大量rollout计算和数据集,导致训练资源消耗巨大。其解决方案的关键在于提出一种数据高效策略优化流程(Data-Efficient Policy Optimization, DEPO),通过两个核心机制实现:一是在离线阶段基于多样性、影响力和适当难度筛选高质量样本子集;二是在在线RLVR训练中引入样本级可探索性度量(sample-level explorability metric),动态过滤低探索潜力样本以降低计算开销,并结合回放机制对未充分探索样本进行再训练,从而提升模型收敛性能。实验证明,DEPO在多个推理基准上显著优于现有方法,在仅使用20%训练数据时即可实现1.66–1.85倍的加速效果。

链接: https://arxiv.org/abs/2509.01321
作者: Xinyu Tang,Zhenduo Zhang,Yurou Liu,Wayne Xin Zhao,Zujie Wen,Zhiqiang Zhang,Jun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large reasoning models have leveraged reinforcement learning with verifiable rewards (RLVR) to improve reasoning capabilities. However, scaling these methods typically requires extensive rollout computation and large datasets, leading to high training costs and low data efficiency. To mitigate this issue, we propose DEPO, a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection. In the offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty. During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential, thereby reducing substantial rollout computational costs. Furthermore, we incorporate a replay mechanism for under-explored samples to ensure adequate training, which enhances the model’s final convergence performance. Experiments across five reasoning benchmarks show that DEPO consistently outperforms existing methods in both offline and online data selection scenarios. Notably, using only 20% of the training data, our approach achieves a 1.85 times speed-up on AIME24 and a 1.66 times speed-up on AIME25 compared to GRPO trained on the full dataset.
zh

[NLP-82] Can Smaller LLM s do better? Unlocking Cross-Domain Potential through Parameter-Efficient Fine-Tuning for Text Summarization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源领域中适应能力不足的问题,尤其是在缺乏标注数据的情况下,传统微调方法计算成本高且效率低下。解决方案的关键在于采用参数高效微调技术(Parameter-Efficient Fine-Tuning, PEFTs),通过在高资源数据集上训练得到的适配器(Adapters),实现对未见低资源领域的有效迁移与性能提升。实验表明,利用领域内适配器(Within-Domain Adapters)进行推理,在低资源场景下可优于少样本学习(Few-Shot)甚至媲美更大规模模型(如Llama-3-70B-Instruct);同时,当缺乏领域内适配器时,进一步探索跨域适配器(Cross-Domain Adapters)及其组合策略,以挖掘不同领域间的内在语言共性,从而增强模型在低资源环境下的泛化能力和适应性。

链接: https://arxiv.org/abs/2509.01314
作者: Anum Afzal,Mehul Kumawat,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), being generic task solvers, are versatile. However, despite the vast amount of data they are trained on, there are speculations about their adaptation capabilities to a new domain. Additionally, the simple fine-tuning of the model to incorporate knowledge of a new domain is computationally expensive and time-consuming. This becomes more challenging when the domain in question is also low-resource, and labeled data is unavailable. We leverage parameter-efficient fine-tuning techniques (PEFTs) on high-resource datasets to address these challenges to improve performance on unseen low-resource domains. Throughout our experiments, we evaluate whether intrinsic linguistic commonalities between datasets can be leveraged for efficient domain adaptation. We benchmark six PEFTs with \textttLlama-3-8B-Instruct on 14 training datasets from the Scientific, Medical, Legal, and News domains for a Text Summarization task. Our experiments show that for low-resource domains, inference using Within-Domain Adapters can achieve better performance than Few-Shot as well as a much larger \textttLlama-3-70B-Instruct. Lastly, in the absence of Within-Domain Adapters, we explore the concept of using Cross-Domain Adapters as well as the strategic combinations of adapters to leverage intrinsic language similarities across domains, facilitating better adaptability and performance in low-resource settings.
zh

[NLP-83] ableZoomer: A Collaborative Agent Framework for Large-scale Table Question Answering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在表格问答(Table Question Answering, TQA)任务中面临的三大挑战:结构异构性、目标数据定位困难以及复杂推理瓶颈。为应对这些问题,作者提出了一种基于编程的LLM驱动代理框架TableZoomer,其核心创新包括:(1) 用结构化表格模式(structured table schema)替代全文本描述的表格以缩小语义鸿沟并降低计算复杂度;(2) 提出一种查询感知的表格缩放机制(query-aware table zooming),通过列选择与实体链接动态生成子表模式,显著提升目标定位效率;(3) 引入思维链程序化策略(Program-of-Thoughts, PoT),将查询转化为可执行代码以缓解数值幻觉问题,并结合ReAct范式实现迭代推理流程。该方案在不同规模数据集上均实现了显著性能提升和更好的可扩展性。

链接: https://arxiv.org/abs/2509.01312
作者: Sishi Xiong,Ziyang He,Zhongjiang He,Yu Zhao,Changzai Pan,Jie Zhang,Zhenhe Wu,Shuangyong Song,Yongxiang Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have shown promise in the table question answering (TQA) task through prompt engineering, they face challenges in industrial applications, including structural heterogeneity, difficulties in target data localization, and bottlenecks in complex reasoning. To address these limitations, this paper presents TableZoomer, a novel LLM-powered, programming-based agent framework. It introduces three key innovations: (1) replacing the original fully verbalized table with structured table schema to bridge the semantic gap and reduce computational complexity; (2) a query-aware table zooming mechanism that dynamically generates sub-table schema through column selection and entity linking, significantly improving target localization efficiency; and (3) a Program-of-Thoughts (PoT) strategy that transforms queries into executable code to mitigate numerical hallucination. Additionally, we integrate the reasoning workflow with the ReAct paradigm to enable iterative reasoning. Extensive experiments demonstrate that our framework maintains the usability advantages while substantially enhancing performance and scalability across tables of varying scales. When implemented with the Qwen3-8B-Instruct LLM, TableZoomer achieves accuracy improvements of 19.34% and 25% over conventional PoT methods on the large-scale DataBench dataset and the small-scale Fact Checking task of TableBench dataset, respectively.
zh

[NLP-84] GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在Text-to-SQL任务中对复杂查询生成仍存在用户意图与数据库模式之间对齐不足的问题。尽管现有方法如Best-of-N(BoN)和多数投票(Maj)通过多次采样提升正确率,但其依赖表面启发式规则(如语法正确性或高频选择),难以有效捕捉语义层面的准确性。论文提出的关键解决方案是引入Outcome Reward Models (ORMs),即基于语义正确性为生成的SQL查询分配效用分数,并将其作为BoN的优选机制。实验表明,ORMs显著优于传统ex-BoN和Maj策略,在BIRD和SPIDER基准上分别实现+4.33%和+2.10%的执行准确率提升(相对于ex-BoN),且在简单查询上表现稳健、对候选数量更敏感,体现出更强的鲁棒性和实用性。

链接: https://arxiv.org/abs/2509.01308
作者: Mattia Tritto,Giuseppe Farano,Dario Di Palma,Gaetano Rossiello,Fedelucio Narducci,Dharmashankar Subramanian,Tommaso Di Noia
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Text-to-SQL, the task of translating natural language questions into SQL queries, has significantly advanced with the introduction of Large Language Models (LLMs), broadening database accessibility for a wide range of users. Despite substantial progress in generating valid SQL, current LLMs still struggle with complex queries that require precise alignment between user intent and the database schema. To mitigate this, test-time strategies such as Best-of-N (BoN) and Majority Voting (Maj) are often employed, based on the assumption that LLMs can generate correct answers but may require multiple attempts. However, these methods rely on surface-level heuristics, selecting either the syntactically correct query through execution-based BoN (ex-BoN) or the most frequently generated query with Maj. Recently, Outcome Reward Models (ORMs), which assign utility scores to generated outputs based on semantic correctness, have emerged as a promising approach for better aligning model predictions with user intent. Nevertheless, their application to Text-to-SQL remains largely underexplored. In this work, we evaluate ORMs as an effective heuristic for BoN, compare them with ex-BoN and Maj, and introduce a framework for training ORMs for the Text-to-SQL task. We evaluate our ORMs on the BIRD and SPIDER benchmarks, finetuning various open-source LLMs, including the Qwen2, Granite3, and Llama3 model families. Our results show that ORMs outperform ex-BoN and Maj, achieving execution accuracy gains of +4.33% (BIRD) and +2.10% (Spider) over ex-BoN, and +2.91% (BIRD) and +0.93% (Spider) over Maj. We further demonstrate that finetuning models already aligned with SQL generation, such as OmniSQL, yields superior ORM performance. Additionally, we observe that ORMs achieve competitive results on simple queries and benefit more from an increased number of candidates compared to ex-BoN and Maj. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB) Cite as: arXiv:2509.01308 [cs.AI] (or arXiv:2509.01308v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.01308 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-85] Culture is Everywhere: A Call for Intentionally Cultural Evaluation

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)文化对齐评估中普遍采用的“ trivia-centered paradigm”(以琐碎知识为中心的范式)所存在的局限性,即该范式将文化简化为静态事实或价值观,并通过多项选择或简答题形式测试模型,忽略了文化的多元性和互动性,以及文化假设如何渗透到看似“中立”的评估场景中。解决方案的关键在于提出有意的文化评估(intentionally cultural evaluation),这是一种系统性考察评估过程中所有环节中嵌入的文化假设的方法,不仅限于显性的文化任务,而是强调研究者立场(researcher positionality)在推动包容性、文化对齐的自然语言处理(Natural Language Processing, NLP)研究中的重要性,并倡导通过人机交互(Human-Computer Interaction, HCI)启发的参与式方法让社区深度参与评估设计,从而突破现有基准测试的局限,发现未知的重要应用场景。

链接: https://arxiv.org/abs/2509.01301
作者: Juhyun Oh,Inha Cha,Michael Saxon,Hyunseung Lim,Shaily Bhatt,Alice Oh
机构: KAIST(韩国科学技术院); Georgia Institute of Technology(佐治亚理工学院); University of Washington(华盛顿大学); Carnegie Mellon University(卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The prevailing trivia-centered paradigm'' for evaluating the cultural alignment of large language models (LLMs) is increasingly inadequate as these models become more advanced and widely deployed. Existing approaches typically reduce culture to static facts or values, testing models via multiple-choice or short-answer questions that treat culture as isolated trivia. Such methods neglect the pluralistic and interactive realities of culture, and overlook how cultural assumptions permeate even ostensibly neutral’’ evaluation settings. In this position paper, we argue for \textbfintentionally cultural evaluation: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation, not just in explicitly cultural tasks. We systematically characterize the what, how, and circumstances by which culturally contingent considerations arise in evaluation, and emphasize the importance of researcher positionality for fostering inclusive, culturally aligned NLP research. Finally, we discuss implications and future directions for moving beyond current benchmarking practices, discovering important applications that we don’t know exist, and involving communities in evaluation design through HCI-inspired participatory methodologies.
zh

[NLP-86] Annotation and modeling of emotions in a textual corpus: an evaluative approach

【速读】: 该论文旨在解决文本情感表达的标注不一致性问题,即在人工标注情绪时存在显著分歧,但这种分歧是否蕴含稳定统计规律尚不明确。其解决方案的关键在于利用基于评价性(evaluative)理论框架构建的标注数据集,训练语言模型以建模标注过程,并发现标注变异性由潜在的语言特征驱动;同时表明语言模型能够依据评价标准有效区分不同的情感情境。

链接: https://arxiv.org/abs/2509.01260
作者: Jonas Noblet(LIDILEM)
机构: 未知
类目: Computation and Language (cs.CL)
备注: in French language. 27{è}me Rencontre des {É}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL), Jun 2025, Marseille, France

点击查看摘要

Abstract:Emotion is a crucial phenomenon in the functioning of human beings in society. However, it remains a widely open subject, particularly in its textual manifestations. This paper examines an industrial corpus manually annotated following an evaluative approach to emotion. This theoretical framework, which is currently underutilized, offers a different perspective that complements traditional approaches. Noting that the annotations we collected exhibit significant disagreement, we hypothesized that they nonetheless follow stable statistical trends. Using language models trained on these annotations, we demonstrate that it is possible to model the labeling process and that variability is driven by underlying linguistic features. Conversely, our results indicate that language models seem capable of distinguishing emotional situations based on evaluative criteria.
zh

[NLP-87] Rethinking the Chain-of-Thought: The Roles of In-Context Learning and Pre-trained Priors

【速读】: 该论文旨在解决Chain-of-Thought(CoT)推理机制的内在工作原理不明确的问题,特别是其如何在上下文学习(in-context learning)与预训练先验(pretrained priors)之间建立双重视角下的协同关系。解决方案的关键在于通过细粒度的词汇级分析、引入噪声示例以考察模型对先验与上下文信息的权衡能力,以及探索提示工程是否能诱导大语言模型产生更长的推理链。实验表明,模型不仅能在词汇层面快速习得推理结构并理解深层逻辑模式,还高度依赖预训练先验;当提供足够多的示例时,模型决策会从先验转向上下文信号,而误导性提示则引发不稳定行为;此外,长链CoT提示可有效促使模型生成更长的推理路径,从而提升下游任务性能。

链接: https://arxiv.org/abs/2509.01236
作者: Hao Yang,Zhiyu Yang,Yunjie Zhang,Shanyi Zhu,Lin Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-Thought reasoning has emerged as a pivotal methodology for enhancing model inference capabilities. Despite growing interest in Chain-of-Thought reasoning, its underlying mechanisms remain unclear. This paper explores the working mechanisms of Chain-of-Thought reasoning from the perspective of the dual relationship between in-context learning and pretrained priors. We first conduct a fine-grained lexical-level analysis of rationales to examine the model’s reasoning behavior. Then, by incrementally introducing noisy exemplars, we examine how the model balances pretrained priors against erroneous in-context information. Finally, we investigate whether prompt engineering can induce slow thinking in large language models. Our extensive experiments reveal three key findings: (1) The model not only quickly learns the reasoning structure at the lexical level but also grasps deeper logical reasoning patterns, yet it heavily relies on pretrained priors. (2) Providing sufficient exemplars shifts the model’s decision-making from pretrained priors to in-context signals, while misleading prompts introduce instability. (3) Long Chain-of-Thought prompting can induce the model to generate longer reasoning chains, thereby improving its performance on downstream tasks.
zh

[NLP-88] DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Taks Based on Data and Model Compression EMNLP2025

【速读】: 该论文旨在解决开源大语言模型(Large Language Models, LLMs)在特定领域任务中进行微调时,如何高效筛选最优模型的问题。其核心挑战在于传统方法依赖大量试错和训练资源,难以快速识别适配下游任务的最佳模型。解决方案的关键在于提出一个数据与模型压缩框架(Data and Model Compression Framework, DaMoC),从两个层面优化:在数据层面,系统性地分类并改进数据过滤方法(分为分布感知、质量感知及混合三类),并通过关键token密度增强实现文本压缩,并利用LLM迭代重写提升表达质量;在模型层面,基于层相似性评分剔除低重要性层,并引入稀疏合并策略以最大限度保留原始模型能力。实验表明,该方法可在节省约20倍训练时间的前提下精准选出最优LLM。

链接: https://arxiv.org/abs/2509.01221
作者: Wei Huang,Huang Wei,Yinggui Wang
机构: Ant Group(蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by EMNLP 2025

点击查看摘要

Abstract:Large language models (LLMs) excel in general tasks but struggle with domain-specific ones, requiring fine-tuning with specific data. With many open-source LLMs available, selecting the best model for fine-tuning downstream tasks is challenging, primarily focusing on how to quickly identify the optimal LLM. We introduce a Data and Model Compression Framework (DaMoC) that addresses this challenge by: 1) Data Level: A systematic categorization of data filtering methodologies for LLMs is first established, classifying them into three distinct paradigms: (1) distribution-aware methods, (2) quality-aware methods, and (3) hybrid approaches considering both dimensions. Further, we enhance the density of key tokens in the text achieving token compression. Subsequently, we use an LLM to iterative rewrite the text to optimize its expression. 2) Model Level: We use layer similarity scores to assess each layer’s importance and remove those with lower importance. Then, we introduce a sparse merging paradigm to preserve as much of the original model’s capability as possible. Extensive experiments on four datasets, medical QA, financial QA, general QA, and reading comprehension, show that we can select the optimal LLM while saving approximately 20-fold in training time.
zh

[NLP-89] Mitigating Catastrophic Forgetting in Continual Learning through Model Growth

【速读】: 该论文旨在解决持续学习(continual learning)中的灾难性遗忘(catastrophic forgetting)问题,即模型在微调新任务时会丢失先前学到的知识,这对大语言模型(Large Language Models, LLMs)的多领域泛化能力构成挑战。解决方案的关键在于采用模型生长(model growth)策略,通过较小模型结构化地引导更大模型的训练过程,特别是利用Transformer堆叠(transformer stacking)的方式进行预训练,以期提升模型对历史任务知识的保留能力。实验表明,基于生长策略训练的Stack LLM相较于基准LLM,在阅读理解等任务上表现出更少的性能退化,显示出一定的抗遗忘优势,但其在社会偏见处理方面仍存在权衡。

链接: https://arxiv.org/abs/2509.01213
作者: Ege Süalp,Mina Rezaei
机构: Ludwig-Maximilian University (慕尼黑路德维希-马克西米利安大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Catastrophic forgetting is a significant challenge in continual learning, in which a model loses prior knowledge when it is fine-tuned on new tasks. This problem is particularly critical for large language models (LLMs) undergoing continual learning, as retaining performance across diverse domains is important for their general utility. In this paper, we explore model growth, a promising strategy that leverages smaller models to expedite and structure the training of larger ones for mitigating the catastrophic forgetting problem. Although growth-based pretraining, particularly via transformer stacking, has shown promise in accelerating convergence, its impact on forgetting remains under-explored. Therefore, we evaluate whether growth-based models can retain previously learned capabilities more effectively across a sequence of fine-tuning tasks involving domain knowledge, reasoning, reading comprehension, and bias. Our findings show that both models – one trained with growth (Stack LLM) and one without (LLM) – exhibit improvements in domain knowledge. However, reasoning and reading comprehension degrade over time, indicating signs of catastrophic forgetting. Stack LLM consistently shows less degradation, especially in reading comprehension, suggesting enhanced retention capabilities. Interestingly, in bias evaluation, the baseline LLM becomes progressively more neutral with continued fine-tuning, while Stack LLM maintains a steady bias ratio around 60–61%. These results indicate that growth-based pretraining may deliver modest improvements in resisting catastrophic forgetting, though trade-offs remain in handling social biases.
zh

[NLP-90] SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

【速读】: 该论文旨在解决同时语音翻译(Simultaneous Speech Translation, SimulST)中翻译质量、延迟和语义连贯性难以平衡的问题,尤其是在多语言多方言场景下,由于读写策略差异导致统一策略学习困难。解决方案的关键在于提出一种无监督的策略学习框架 SimulMEGA(Simultaneous Generation by Mixture-of-Experts Gating),通过前缀训练(prefix-based training)与专家混合(Mixture-of-Experts)精修模块相结合,在隐式层面学习有效的读写决策,且不增加推理时开销;该设计仅需对标准 Transformer 架构进行最小修改,即可在语音到文本和文本到语音流式任务中实现良好泛化性能。

链接: https://arxiv.org/abs/2509.01200
作者: Chenyang Le,Bing Han,Jinshun Li,Songyong Chen,Yanmin Qian
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning. In this paper, we present SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions in an implicit manner, without adding inference-time overhead. Our design requires only minimal modifications to standard transformer architectures and generalizes across both speech-to-text and text-to-speech streaming tasks. Through comprehensive evaluation on six language pairs, our 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. We further demonstrate the versatility of SimulMEGA by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality tradeoffs.
zh

[NLP-91] Efficient Large Language Models with Zero-Shot Adjustable Acceleration

【速读】: 该论文旨在解决在实际应用中使用大语言模型(Large Language Models, LLMs)时面临的计算效率与性能之间的平衡难题,尤其是在微调后推理阶段的加速优化问题。其解决方案的关键在于提出一种“零样本可调加速”(Zero-Shot Adjustable Acceleration)方法,该方法能够在推理过程中动态调整硬件资源使用,而无需额外的微调步骤,从而实现无需重新训练即可灵活调节模型执行速度,并在多个分类和文本生成任务中实现在零样本场景下的广泛加速效果,最高可达基线模型11倍的速度提升。

链接: https://arxiv.org/abs/2509.01190
作者: Sajjad Kachuee,Mohammad Sharifkhani
机构: Sharif University of Technology (伊朗谢里夫理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Using Large Language Models (LLMs) in real-world applications presents significant challenges, particularly in balancing computational efficiency and performance. Optimizing acceleration after the fine-tuning phase and during inference is crucial for building an efficient architecture. This paper introduces Zero-Shot Adjustable Acceleration, a novel training and inference method that dynamically adjusts hardware usage during inference without requiring additional fine-tuning. The proposed approach is applied to newly developed models and evaluated across multiple classification and text generation tasks. Experimental results demonstrate that the method enables a wide range of acceleration in a zero-shot manner and achieves up to a 11x speedup compared to the baseline.
zh

[NLP-92] Statutory Construction and Interpretation for Artificial Intelligence

【速读】: 该论文旨在解决生成式 AI (Generative AI) 系统在对齐过程中因自然语言规则的解释歧义(interpretive ambiguity)而导致的行为不一致或不稳定问题。当前AI对齐流程缺乏类似法律体系中通过制度性保障(如透明的上诉审查机制)来约束解释空间的能力,从而导致相同规则在不同情境下被不同解读,影响系统可靠性。解决方案的关键在于借鉴法律理论,提出一个计算框架:其一为规则精炼管道(rule refinement pipeline),通过迭代修订模糊规则以减少解释分歧(类比行政机关制定规则或立法过程);其二为基于提示的解释约束机制(prompt-based interpretive constraints),通过引入规范性解释原则降低规则应用阶段的不一致性(类比法律解释准则)。实验在WildChat数据集的5000个场景上验证了两种干预手段均能显著提升合理解释者群体间判断的一致性,为构建更稳定、守法的AI系统提供了系统化管理解释歧义的初步路径。

链接: https://arxiv.org/abs/2509.01186
作者: Luxi He,Nimra Nadeem,Michel Liao,Howard Chen,Danqi Chen,Mariano-Florentino Cuéllar,Peter Henderson
机构: Princeton University (普林斯顿大学); Carnegie Endowment for International Peace (卡内基国际和平基金会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:AI systems are increasingly governed by natural language principles, yet a key challenge arising from reliance on language remains underexplored: interpretive ambiguity. As in legal systems, ambiguity arises both from how these principles are written and how they are applied. But while legal systems use institutional safeguards to manage such ambiguity, such as transparent appellate review policing interpretive constraints, AI alignment pipelines offer no comparable protections. Different interpretations of the same rule can lead to inconsistent or unstable model behavior. Drawing on legal theory, we identify key gaps in current alignment pipelines by examining how legal systems constrain ambiguity at both the rule creation and rule application steps. We then propose a computational framework that mirrors two legal mechanisms: (1) a rule refinement pipeline that minimizes interpretive disagreement by revising ambiguous rules (analogous to agency rulemaking or iterative legislative action), and (2) prompt-based interpretive constraints that reduce inconsistency in rule application (analogous to legal canons that guide judicial discretion). We evaluate our framework on a 5,000-scenario subset of the WildChat dataset and show that both interventions significantly improve judgment consistency across a panel of reasonable interpreters. Our approach offers a first step toward systematically managing interpretive ambiguity, an essential step for building more robust, law-following AI systems.
zh

[NLP-93] Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理和推理长文本输入时面临的瓶颈问题,即缺乏高质量、多样化且可验证的长上下文数据集,这严重制约了模型在训练与评估中的性能提升。其解决方案的关键在于提出一个模块化、可扩展的合成长上下文数据生成框架,通过提示驱动(prompt-based)方式与LLMs交互,支持监督微调(Supervised Fine-Tuning, SFT)、直接偏好优化(Direct Preference Optimization, DPO)及组相对策略优化(Group Relative Policy Optimization, GRPO)等多种训练目标,并涵盖四种核心生成范式:多轮对话、文档引导的输入输出对、可验证指令响应任务以及长上下文推理示例。该方法借助模板化提示、模型无关架构和带元数据的输出设计,实现了可扩展、可控且目标对齐的数据集构建,从而有效推动LLMs长上下文能力的发展。

链接: https://arxiv.org/abs/2509.01185
作者: Seganrasan Subramanian,Abhigya Verma
机构: ServiceNow (ServiceNow)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:The ability of large language models (LLMs) to process and reason over long textual inputs is critical for a wide range of real-world applications. However, progress in this area is significantly constrained by the absence of high-quality, diverse, and verifiable long-context datasets suitable for both training and evaluation. This work introduces a modular, extensible framework for synthetic long-context data generation via prompt-based interaction with LLMs. The framework supports multiple training and alignment objectives, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). It encompasses four core generation paradigms: multi-turn conversational dialogues, document-grounded input-output pairs, verifiable instruction-response tasks, and long-context reasoning examples. Through templated prompting, a model-agnostic architecture, and metadata-enriched outputs, the proposed approach facilitates scalable, controllable, and purpose-aligned dataset creation for advancing long-context capabilities in LLMs.
zh

[NLP-94] Question-to-Knowledge: Multi-Agent Generation of Inspectable Facts for Product Mapping

【速读】: 该论文旨在解决电商场景中缺乏显式标识符时,如何准确识别不同产品列表是否指向同一库存单位(SKU)的问题,尤其针对产品名称差异大、品牌或配置信息模糊导致的误分类问题。解决方案的关键在于提出一种名为Question to Knowledge (Q2K) 的多智能体框架,其核心机制包括:(1) 推理智能体生成针对性的消歧问题,(2) 知识智能体通过聚焦式网络搜索解答问题,(3) 去重智能体复用已验证的推理路径以减少冗余并保证一致性;同时引入人工介入机制处理不确定情形。该方法通过复用检索到的推理过程而非重复搜索,在保证高准确率的同时提升效率,为产品集成提供了可扩展且可解释的解决方案。

链接: https://arxiv.org/abs/2509.01182
作者: Wonduk Seo,Taesub Shin,Hyunjin An,Dokyun Kim,Seunghyun Lee
机构: Enhans(Enhans)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: Preprint

点击查看摘要

Abstract:Identifying whether two product listings refer to the same Stock Keeping Unit (SKU) is a persistent challenge in ecommerce, especially when explicit identifiers are missing and product names vary widely across platforms. Rule based heuristics and keyword similarity often misclassify products by overlooking subtle distinctions in brand, specification, or bundle configuration. To overcome these limitations, we propose Question to Knowledge (Q2K), a multi agent framework that leverages Large Language Models (LLMs) for reliable SKU mapping. Q2K integrates: (1) a Reasoning Agent that generates targeted disambiguation questions, (2) a Knowledge Agent that resolves them via focused web searches, and (3) a Deduplication Agent that reuses validated reasoning traces to reduce redundancy and ensure consistency. A human in the loop mechanism further refines uncertain cases. Experiments on real world consumer goods datasets show that Q2K surpasses strong baselines, achieving higher accuracy and robustness in difficult scenarios such as bundle identification and brand origin disambiguation. By reusing retrieved reasoning instead of issuing repeated searches, Q2K balances accuracy with efficiency, offering a scalable and interpretable solution for product integration.
zh

[NLP-95] Do Video Language Models Really Know Where to Look? Diagnosing Attention Failures in Video Language Models

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解任务中因处理全部帧导致计算开销过大而依赖关键帧采样方法的问题。当前主流方法通常采用视觉-语言编码器(如SigLIP)指导关键帧选择,但本文通过实证研究发现,这些编码器在识别与文本查询相关的关键信息帧方面存在显著局限性,难以准确引导MLLM关注视频中的重要区域。因此,论文指出提升关键帧识别技术的有效性是优化高效视频MLLM发展的必要方向,其解决方案的关键在于开发更精准的、能匹配文本查询语义的帧选择机制。

链接: https://arxiv.org/abs/2509.01167
作者: Hyunjong Ok,Jaeho Lee
机构: POSTECH(浦项科技大学); HJ AILAB(人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have led to much progress in video understanding tasks. To avoid the heavy computational cost of processing all frames, these models typically rely on keyframe sampling methods guided by vision-language encoders (\textite.g., SigLIP). However, it remains unclear whether such encoders can truly identify the most informative frames. In this work, we provide several empirical pieces of evidence revealing that popular vision encoders critically suffer from their limited capability to identify where the MLLM should look inside the video to handle the given textual query appropriately. Our findings suggest that the development of better keyframe identification techniques may be necessary for efficient video MLLMs.
zh

[NLP-96] Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识图谱补全(Knowledge Graph Completion, KGC)任务中面临的两个关键问题:一是自然语言空间与图结构表示空间之间存在不一致,二是现有方法针对不同KGC任务设计独立指令,导致重复工作且效率低下。解决方案的核心在于提出一种名为SAT的新型框架,其关键创新包括:首先引入分层知识对齐机制,通过多任务对比学习将图嵌入与自然语言空间对齐;其次设计结构感知的指令微调策略,利用统一的图指令结合轻量级知识适配器引导LLM进行结构感知推理。该方法显著提升了KGC性能,尤其在链接预测任务上提升幅度达8.7%至29.8%。

链接: https://arxiv.org/abs/2509.01166
作者: Yu Liu,Yanan Cao,Xixun Lin,Yanmin Shang,Shi Wang,Shirui Pan
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); Griffith University(格里菲斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025, Main, Long Paper

点击查看摘要

Abstract:Knowledge graph completion (KGC) aims to infer new knowledge and make predictions from knowledge graphs. Recently, large language models (LLMs) have exhibited remarkable reasoning capabilities. LLM-enhanced KGC methods primarily focus on designing task-specific instructions, achieving promising advancements. However, there are still two critical challenges. First, existing methods often ignore the inconsistent representation spaces between natural language and graph structures. Second, most approaches design separate instructions for different KGC tasks, leading to duplicate works and time-consuming processes. To address these challenges, we propose SAT, a novel framework that enhances LLMs for KGC via structure-aware alignment-tuning. Specifically, we first introduce hierarchical knowledge alignment to align graph embeddings with the natural language space through multi-task contrastive learning. Then, we propose structural instruction tuning to guide LLMs in performing structure-aware reasoning over KGs, using a unified graph instruction combined with a lightweight knowledge adapter. Experimental results on two KGC tasks across four benchmark datasets demonstrate that SAT significantly outperforms state-of-the-art methods, especially in the link prediction task with improvements ranging from 8.7% to 29.8%.
zh

[NLP-97] Joint Information Extraction Across Classical and Modern Chinese with Tea-MOELoRA

【速读】: 该论文旨在解决中文信息抽取(Information Extraction, IE)任务中,由于跨时域(如古文与现代文)和多任务特性导致的模型微调时出现干扰、性能下降的问题。其解决方案的关键在于提出 Tea-MOELoRA 框架,该框架融合了低秩适配(Low-Rank Adaptation, LoRA)与专家混合(Mixture-of-Experts, MoE)机制:多个 LoRA 专家分别专注于特定的信息抽取任务和时间域,同时引入一个任务-时域感知的路由机制,动态分配各专家的贡献权重,从而实现参数高效且任务与时域知识协同利用的多任务学习。

链接: https://arxiv.org/abs/2509.01158
作者: Xuemei Tang,Chengxi Yan,Jinghang Gu,Chu-Ren Huang
机构: The Hong Kong Polytechnic University (香港理工大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Chinese information extraction (IE) involves multiple tasks across diverse temporal domains, including Classical and Modern documents. Fine-tuning a single model on heterogeneous tasks and across different eras may lead to interference and reduced performance. Therefore, in this paper, we propose Tea-MOELoRA, a parameter-efficient multi-task framework that combines LoRA with a Mixture-of-Experts (MoE) design. Multiple low-rank LoRA experts specialize in different IE tasks and eras, while a task-era-aware router mechanism dynamically allocates expert contributions. Experiments show that Tea-MOELoRA outperforms both single-task and joint LoRA baselines, demonstrating its ability to leverage task and temporal knowledge effectively.
zh

[NLP-98] Zero-shot Cross-lingual NER via Mitigating Language Difference: An Entity-aligned Translation Perspective EMNLP2025

【速读】: 该论文旨在解决非拉丁文字符语言(Non-Latin Script Language, NSL)在零样本跨语言命名实体识别(Zero-shot Cross-lingual Named Entity Recognition, ZCL-NER)中的性能下降问题,其根源在于NSL与拉丁文字符语言(Latin Script Language, LSL)之间深层的结构差异导致知识迁移效率降低。解决方案的关键在于提出一种实体对齐翻译(Entity-aligned Translation, EAT)方法,该方法利用大语言模型(Large Language Models, LLMs)实施双翻译策略,以实现NSL与英文之间的实体层面对齐,并通过多语言维基百科数据微调LLMs,从而增强源语言到目标语言的实体对齐能力。

链接: https://arxiv.org/abs/2509.01147
作者: Zhihao Zhang,Sophia Yat Mei Lee,Dong Zhang,Shoushan Li,Guodong Zhou
机构: Soochow University (苏州大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025

点击查看摘要

Abstract:Cross-lingual Named Entity Recognition (CL-NER) aims to transfer knowledge from high-resource languages to low-resource languages. However, existing zero-shot CL-NER (ZCL-NER) approaches primarily focus on Latin script language (LSL), where shared linguistic features facilitate effective knowledge transfer. In contrast, for non-Latin script language (NSL), such as Chinese and Japanese, performance often degrades due to deep structural differences. To address these challenges, we propose an entity-aligned translation (EAT) approach. Leveraging large language models (LLMs), EAT employs a dual-translation strategy to align entities between NSL and English. In addition, we fine-tune LLMs using multilingual Wikipedia data to enhance the entity alignment from source to target languages.
zh

[NLP-99] Dream-Coder 7B: An Open Diffusion Language Model for Code

【速读】: 该论文旨在解决传统自回归(Autoregressive, AR)模型在代码生成任务中受限于严格左到右解码策略所带来的灵活性不足问题,尤其是在复杂算法生成、代码补全和代码理解等多样化场景下表现不佳。其解决方案的关键在于提出Dream-Coder 7B——一个基于离散扩散语言模型(Discrete Diffusion Language Model)的新型架构,通过引入“任意顺序生成”(any-order generation)能力,使模型能够根据任务特性动态选择最优解码策略:对于复杂算法采用“草图优先”(sketch-first)生成,对于简单补全任务采用标准左到右生成,而对于代码理解类任务则采用交错推理(interleaved reasoning)生成。此外,作者通过连续时间加权交叉熵目标将预训练AR模型迁移至扩散框架,并结合监督微调与基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards)进行后训练优化,显著提升了生成质量与稳定性,最终在LiveCodeBench等基准上达到先进性能。

链接: https://arxiv.org/abs/2509.01142
作者: Zhihui Xie,Jiacheng Ye,Lin Zheng,Jiahui Gao,Jingwei Dong,Zirui Wu,Xueliang Zhao,Shansan Gong,Xin Jiang,Zhenguo Li,Lingpeng Kong
机构: The University of Hong Kong (香港大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Dream-Coder 7B, an open-source discrete diffusion language model for code generation that exhibits emergent any-order generation capabilities. Unlike traditional autoregressive (AR) models that decode strictly left-to-right, Dream-Coder 7B adaptively determines its decoding strategy based on the coding task: sketch-first generation for complex algorithms, left-to-right generation for straightforward completions, and interleaved reasoning generation for code understanding tasks. We adapt a pretrained AR checkpoint to a discrete diffusion frameworks with a continuous-time weighted cross-entropy objective. Our post-training recipe comprises (i) supervised fine-tuning, where we mitigate padding pathologies via random truncation and a padding penalty to improve sample efficiency and stabilize generation; and (ii) reinforcement learning with verifiable rewards over a curated high-quality prompt set drawn from open-source datasets, using a tailored reinforcement learning recipe for diffusion language models. The resulting Dream-Coder 7B Instruct attains 21.4% pass@1 on LiveCodeBench (2410–2505) and demonstrates competitive performance on HumanEval, MBPP, BigCodeBench, and CRUXEval. We release Dream-Coder-7B and Dream-Coder-7B-Instruct checkpoints, training recipes, preprocessing pipelines, and inference code to facilitate reproducibility and further research.
zh

[NLP-100] Natural Context Drift Undermines the Natural Language Understanding of Large Language Models EMNLP2025

【速读】: 该论文旨在解决生成式 AI (Generative AI) 模型在面对自然演化语境段落时,其问答(Question Answering, QA)性能如何变化的问题。核心挑战在于,尽管模型在预训练阶段接触过特定版本的文本,但现实世界中的文本会随时间自然演变,这种演变可能影响模型对信息的理解与推理能力。解决方案的关键在于提出一个框架,用于收集来自当代 QA 基准数据集的、经人类编辑后自然演化的阅读段落变体,并基于语义相似度分数量化这些变体与预训练内容的偏离程度,从而系统评估不同大型语言模型(Large Language Models, LLMs)在多个相似度区间下的表现。实验表明,即使问题和所需信息保持不变,随着段落内容与预训练样本的语义差异增大,LLM 的准确率显著下降,揭示了自然文本演化对模型泛化能力构成实质性挑战。

链接: https://arxiv.org/abs/2509.01093
作者: Yulong Wu,Viktor Schlegel,Riza Batista-Navarro
机构: University of Manchester (曼彻斯特大学); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Findings

点击查看摘要

Abstract:How does the natural evolution of context paragraphs affect question answering in generative Large Language Models (LLMs)? To investigate this, we propose a framework for curating naturally evolved, human-edited variants of reading passages from contemporary QA benchmarks and for analyzing LLM performance across a range of semantic similarity scores, which quantify how closely each variant aligns with content seen during pretraining. Using this framework, we evaluate six QA datasets and eight LLMs with publicly available training data. Our experiments reveal that LLM performance declines as reading passages naturally diverge from the versions encountered during pretraining-even when the question and all necessary information remains present at inference time. For instance, average model accuracy on BoolQ drops by over 30% from the highest to lowest similarity bins, with slopes exceeding 70 across several LLMs. These findings suggest that natural text evolution poses a significant challenge to the language understanding capabilities of LLMs.
zh

[NLP-101] REFRAG : Rethinking RAG based Decoding

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在检索增强生成(Retrieval-Augmented Generation, RAG)任务中因处理长上下文输入而导致的系统延迟高、键值缓存内存消耗大、吞吐量下降的问题,即知识丰富性与系统效率之间的权衡。其解决方案的关键在于识别RAG场景下上下文中的稀疏注意力结构:由于检索到的文本段落通常语义相似度低且仅少量与查询相关,导致解码过程中存在大量冗余计算。为此,作者提出REFRAG框架,通过压缩(compression)、感知(sensing)和扩展(expansion)三阶段优化,利用这种稀疏性显著减少不必要的计算,在不损失困惑度的前提下实现首次token生成时间加速30.85倍(相比之前最优方法提升3.75倍),并支持将LLM上下文长度扩展16倍。

链接: https://arxiv.org/abs/2509.01092
作者: Xiaoqiang Lin,Aritra Ghosh,Bryan Kian Hsiang Low,Anshumali Shrivastava,Vijai Mohan
机构: Meta(Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.
zh

[NLP-102] Privacy-Preserving Reasoning with Knowledge-Distilled Parametric Retrieval Augmented Generation

【速读】: 该论文旨在解决当前参数化检索增强生成(Parametric RAG, PRAG)系统在隐私保护推理中面临的两大挑战:一是传统PRAG需为每份文档单独合成问答对并微调大语言模型(LLM)以生成LoRA参数,导致推理延迟过高;二是PRAG性能依赖于合成问答数据,缺乏与标准RAG在文档结构和参数激活上的对齐,从而在分布外(Out-of-Distribution, OOD)输入上泛化能力差。解决方案的关键在于提出DistilledPRAG,通过知识蒸馏机制实现高效且可泛化的参数化表示:首先从单文档和多文档中合成QA对以增强跨文档推理能力;接着用特殊标记掩码原始文本,并通过参数生成器将其映射为LoRA,保持标准RAG的文档结构;最后在合成QA数据指导下训练参数生成器,使其输出的隐藏状态和logits与标准RAG一致,从而实现无需原始文档即可进行类RAG推理的能力。

链接: https://arxiv.org/abs/2509.01088
作者: Jinwen Chen,Hainan Zhang,Liang Pang,Yongxin Tong,Haibo Zhou,Yuan Zhan,Wei Lin,Zhiming Zheng
机构: Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (北京未来区块链与隐私计算高精尖创新中心); School of Artificial Intelligence, Beihang University (北京航空航天大学人工智能学院); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Meituan (美团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The current RAG system requires uploading plaintext documents to the cloud, risking private data leakage. Parametric RAG (PRAG) addresses this by encoding documents as LoRA within LLMs, enabling reasoning without exposing raw content. However, it still faces two issues: (1) PRAG demands synthesizing QA pairs and fine-tuning LLM for each individual document to create its corresponding LoRA, leading to unacceptable inference latency. (2) The performance of PRAG relies solely on synthetic QA data, lacking internal alignment with standard RAG, resulting in poor generalization on out-of-distribution(OOD) inputs. Therefore, achieving high-efficiency parameterization while maintaining RAG-level performance remains a critical challenge for privacy-preserving reasoning. In this paper, we propose DistilledPRAG, a generalizable knowledge-distilled parametric RAG model aligned with standard RAG in document structure and parameter activation. We first synthesize QA pairs from single and multi-documents to enhance cross-document reasoning. Then, we mask the plaintext documents with a special token and translate them to LoRA via a parameter generator, maintaining the standard RAG document structure. Finally, guided by synthetic QA data, we train the parameter generator to match standard RAG’s hidden states and output logits, enabling RAG-style reasoning without original documents. Experiments on four QA datasets show that DistilledPRAG outperforms baselines in accuracy and generalizes well on OOD data.
zh

[NLP-103] A Paradigm Gap in Urdu

【速读】: 该论文试图解决乌尔都语(Urdu)和印地语(Hindi)中动词与体貌(aspect)组合中存在的句法空白问题,即“-ya: kar”结构的完成体形式在现代语言中极度不合法,尽管在19世纪文献中广泛存在。解决方案的关键在于识别出这一变化源于形态句法冲突:该结构要求主语为通格(nominative),且分词形式不变,但及物动词的完成体必须赋予施事者作格(ergative)标记,二者矛盾导致完成体形式不稳定,最终被其他构式取代并固化于现代语法体系中。

链接: https://arxiv.org/abs/2509.01084
作者: Farah Adeeba,Rajesh Bhatt
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we document a paradigm gap in the combinatorial possibilities of verbs and aspect in Urdu: the perfective form of the -ya: kar construction (e.g. ro-ya: ki: cry-Pfv this http URL) is sharply ungrammatical in modern Urdu and Hindi, despite being freely attested in 19th century literature. We investigate this diachronic shift through historical text analysis, a large-scale corpus study which confirms the stark absence of perfective forms and subjective evaluation tasks with native speakers, who judge perfective examples as highly unnatural. We argue that this gap arose from a fundamental morphosyntactic conflict: the construction’s requirement for a nominative subject and an invariant participle clashes with the core grammatical rule that transitive perfective assign ergative case. This conflict rendered the perfective form unstable, and its functional replacement by other constructions allowed the gap to become entrenched in the modern grammar.
zh

[NLP-104] Assessing Large Language Models on Islamic Legal Reasoning : Evidence from Inheritance Law Evaluation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在伊斯兰继承法('ilm al-mawarith)领域中的知识掌握与推理能力评估问题,尤其关注其在复杂、结构化法律场景下准确分配遗产份额的能力。解决方案的关键在于构建一个包含1,000道多选题的基准测试集,覆盖多样化的继承情境,从而系统性地评估不同LLMs对伊斯兰教法中继承规则的理解与计算能力;同时通过详细错误分析识别出模型在理解继承场景、适用法律规则及领域知识不足等方面的共性失败模式,为提升LLMs在伊斯兰法律推理任务中的表现提供方向。

链接: https://arxiv.org/abs/2509.01081
作者: Abdessalam Bouchekif,Samer Rashwani,Heba Sbahi,Shahd Gaben,Mutez Al-Khatib,Mohammed Ghaly
机构: Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 Tables, Code: this https URL

点击查看摘要

Abstract:This paper evaluates the knowledge and reasoning capabilities of Large Language Models in Islamic inheritance law, known as ‘ilm al-mawarith. We assess the performance of seven LLMs using a benchmark of 1,000 multiple-choice questions covering diverse inheritance scenarios, designed to test models’ ability to understand the inheritance context and compute the distribution of shares prescribed by Islamic jurisprudence. The results reveal a significant performance gap: o3 and Gemini 2.5 achieved accuracies above 90%, whereas ALLaM, Fanar, LLaMA, and Mistral scored below 50%. These disparities reflect important differences in reasoning ability and domain adaptation. We conduct a detailed error analysis to identify recurring failure patterns across models, including misunderstandings of inheritance scenarios, incorrect application of legal rules, and insufficient domain knowledge. Our findings highlight limitations in handling structured legal reasoning and suggest directions for improving performance in Islamic legal reasoning. Code: this https URL
zh

[NLP-105] Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG -RL EMNLP2025

【速读】: 该论文旨在解决健康虚假信息(health misinformation)在网络上传播对公共健康构成的威胁,特别是现有自动生成反驳言论(counterspeech)的方法往往生成统一内容,忽视了受众健康素养(health literacy)水平差异对反驳效果的影响。解决方案的关键在于提出一种受控健康素养(Controlled-Literacy)框架,该框架基于检索增强生成(retrieval-augmented generation, RAG)与强化学习(reinforcement learning, RL)相结合的技术,通过检索与特定健康素养水平相匹配的知识来生成适配不同受众的反驳内容,并设计融合主观用户偏好与客观可读性指标的奖励函数,从而优化生成内容在目标健康素养层次上的可访问性与接受度。

链接: https://arxiv.org/abs/2509.01058
作者: Xiaoying Song,Anirban Saha Anik,Dibakar Barua,Pengcheng Luo,Junhua Ding,Lingzi Hong
机构: University of North Texas (北德克萨斯大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: Accepted at Findings of EMNLP 2025

点击查看摘要

Abstract:Health misinformation spreading online poses a significant threat to public health. Researchers have explored methods for automatically generating counterspeech to health misinformation as a mitigation strategy. Existing approaches often produce uniform responses, ignoring that the health literacy level of the audience could affect the accessibility and effectiveness of counterspeech. We propose a Controlled-Literacy framework using retrieval-augmented generation (RAG) with reinforcement learning (RL) to generate tailored counterspeech adapted to different health literacy levels. In particular, we retrieve knowledge aligned with specific health literacy levels, enabling accessible and factual information to support generation. We design a reward function incorporating subjective user preferences and objective readability-based rewards to optimize counterspeech to the target health literacy level. Experiment results show that Controlled-Literacy outperforms baselines by generating more accessible and user-preferred counterspeech. This research contributes to more equitable and impactful public health communication by improving the accessibility and comprehension of counterspeech to health misinformation.
zh

[NLP-106] VerlTool: Towards Holistic Agent ic Reinforcement Learning with Tool Use

【速读】: 该论文旨在解决当前工具增强型强化学习(Agentic Reinforcement Learning with Tool use, ARLT)系统中存在的三大问题:任务特定代码库导致的碎片化、同步执行带来的性能瓶颈,以及跨领域扩展性差的问题。为应对这些挑战,作者提出了一种统一且模块化的框架VerlTool,其关键创新在于:(1) 与VeRL(Verifiable Rewards for Reinforcement Learning)上游对齐以确保兼容性和维护简化;(2) 通过标准化API实现多模态工具管理(包括代码执行、搜索、SQL数据库和视觉处理);(3) 异步rollout执行机制显著提升效率(接近2倍加速);(4) 提供涵盖数学推理、知识问答、SQL生成、视觉推理、网络搜索及软件工程等6个领域的综合评估,验证其在多种任务上的竞争力。该框架将ARLT形式化为包含多模态观测标记(文本/图像/视频)的多轮轨迹,突破了传统单轮RLVR范式的局限,同时通过轻量级Python插件机制支持快速工具集成,为工具增强型强化学习研究提供可扩展的基础架构。

链接: https://arxiv.org/abs/2509.01055
作者: Dongfu Jiang,Yi Lu,Zhuofeng Li,Zhiheng Lyu,Ping Nie,Haozhe Wang,Alex Su,Hui Chen,Kai Zou,Chao Du,Tianyu Pang,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); Sea AI Lab; University of Toronto (多伦多大学); Shanghai University (上海大学); HKUST (香港科技大学); National University of Singapore (新加坡国立大学); NetMind.AI; Independent
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 5 figures, 13 tables

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2 \times speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at this https URL.
zh

[NLP-107] A Dynamic Fusion Model for Consistent Crisis Response EMNLP2025

【速读】: 该论文旨在解决危机沟通中由语言模型生成的响应在风格一致性(style consistency)方面存在的问题,这一问题虽常被忽视,但直接影响受灾人群对响应者的信任度。现有方法缺乏有效手段维持多条响应之间的风格统一性,导致生成内容出现不一致甚至混乱的情况。解决方案的关键在于提出一种新的风格一致性评估指标,并基于该指标设计了一种融合式生成方法(fusion-based generation approach),该方法采用两阶段流程:首先对候选响应进行风格评估,随后在实例层面通过融合机制优化并整合响应,从而在保持高响应质量的同时显著降低不同响应间的风格差异。

链接: https://arxiv.org/abs/2509.01053
作者: Xiaoying Song,Anirban Saha Anik,Eduardo Blanco,Vanessa Frias-Martinez,Lingzi Hong
机构: University of North Texas (北德克萨斯大学); University of Arizona (亚利桑那大学); University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL)
备注: Accepted at Findings of EMNLP 2025, 10 pages, 5 figures

点击查看摘要

Abstract:In response to the urgent need for effective communication with crisis-affected populations, automated responses driven by language models have been proposed to assist in crisis communications. A critical yet often overlooked factor is the consistency of response style, which could affect the trust of affected individuals in responders. Despite its importance, few studies have explored methods for maintaining stylistic consistency across generated responses. To address this gap, we propose a novel metric for evaluating style consistency and introduce a fusion-based generation approach grounded in this metric. Our method employs a two-stage process: it first assesses the style of candidate responses and then optimizes and integrates them at the instance level through a fusion process. This enables the generation of high-quality responses while significantly reducing stylistic variation between instances. Experimental results across multiple datasets demonstrate that our approach consistently outperforms baselines in both response quality and stylistic uniformity.
zh

[NLP-108] FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games EMNLP2025

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的图形用户界面(GUI)代理在复杂叙事驱动类游戏(如冒险游戏)中难以完成完整故事线的问题,尤其是面对“观察-行为差距”(observation-behavior gap)——即代理难以记忆并利用早期游戏信息进行后续决策。其解决方案的关键在于提出两个核心组件:一是CUE-as-a-Judge,一种自动化游戏评估机制,用于量化代理在任务里程碑上的表现;二是COAST框架,该框架通过引入长期线索记忆机制,增强代理对多步任务的规划与执行能力,从而有效缓解观察-行为差距问题。实验表明,尽管现有GUI代理在完整故事弧上表现不佳,COAST显著提升了关键节点的完成率,但仍存在人类与最优代理间的性能差距,提示仍需进一步研究以缩小这一鸿沟。

链接: https://arxiv.org/abs/2509.01052
作者: Jaewoo Ahn,Junseo Kim,Heeseung Yun,Jaehyeon Son,Dongmin Park,Jaewoong Cho,Gunhee Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: EMNLP 2025 Main. Project page: this https URL

点击查看摘要

Abstract:GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.
zh

[NLP-109] Chronotome: Real-Time Topic Modeling for Streaming Embedding Spaces IEEE-VIS2025

【速读】: 该论文试图解决现有降维方法难以捕捉真实世界数据中随时间演变的语义变化问题,尤其是在艺术作品集或社交媒体历史等时间序列数据中。其解决方案的关键在于结合基于力的投影(force-based projection)与流式聚类(streaming clustering)方法,构建嵌入空间的时空映射(spatial-temporal map),从而实现对时序数据中演化主题的实时交互式探索,所提出的工具 Chronotome 能够在动态过程中揭示数据的美学与语义变迁。

链接: https://arxiv.org/abs/2509.01051
作者: Matte Lim,Catherine Yeh,Martin Wattenberg,Fernanda Viégas,Panagiotis Michalatos
机构: Harvard University (哈佛大学); Google Research (谷歌研究)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to IEEE VIS 2025 Short Paper Track (5 pages, 4 figures)

点击查看摘要

Abstract:Many real-world datasets – from an artist’s body of work to a person’s social media history – exhibit meaningful semantic changes over time that are difficult to capture with existing dimensionality reduction methods. To address this gap, we introduce a visualization technique that combines force-based projection and streaming clustering methods to build a spatial-temporal map of embeddings. Applying this technique, we create Chronotome, a tool for interactively exploring evolving themes in time-based data – in real time. We demonstrate the utility of our approach through use cases on text and image data, showing how it offers a new lens for understanding the aesthetics and semantics of temporal datasets.
zh

[NLP-110] We Politely Insist: Your LLM Must Learn the Persian Art of Taarof EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨文化语境中对特定文化交际规范理解不足的问题,尤其聚焦于波斯语taarof这一复杂而微妙的礼貌体系——它强调谦逊、间接性和尊重,在现有文化评估基准中尚未被涵盖。解决方案的关键在于构建首个专门针对taarof的评估基准TaarofBench,包含450个角色扮演场景,覆盖12类常见社交情境,并由母语者验证其文化准确性;同时通过监督微调和直接偏好优化(Direct Preference Optimization)显著提升模型对文化规范的对齐程度,实现性能提升达21.8%至42.3%,并揭示了西方礼貌框架在评估此类文化现象时的局限性。

链接: https://arxiv.org/abs/2509.01035
作者: Nikta Gohari Sadr,Sahar Heidariasl,Karine Megerdoomian,Laleh Seyyed-Kalantari,Ali Emami
机构: Brock University (布罗克大学); Zoorna AI (Zoorna AI); York University (约克大学); Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) struggle to navigate culturally specific communication norms, limiting their effectiveness in global contexts. We focus on Persian taarof, a social norm in Iranian interactions, which is a sophisticated system of ritual politeness that emphasizes deference, modesty, and indirectness, yet remains absent from existing cultural benchmarks. We introduce TaarofBench, the first benchmark for evaluating LLM understanding of taarof, comprising 450 role-play scenarios covering 12 common social interaction topics, validated by native speakers. Our evaluation of five frontier LLMs reveals substantial gaps in cultural competence, with accuracy rates 40-48% below native speakers when taarof is culturally appropriate. Performance varies between interaction topics, improves with Persian-language prompts, and exhibits gender-based asymmetries. We also show that responses rated “polite” by standard metrics often violate taarof norms, indicating the limitations of Western politeness frameworks. Through supervised fine-tuning and Direct Preference Optimization, we achieve 21.8% and 42.3% improvement in model alignment with cultural expectations. Our human study with 33 participants (11 native Persian, 11 heritage, and 11 non-Iranian speakers) forms baselines in varying degrees of familiarity with Persian norms. This work lays the foundation for developing diverse and culturally aware LLMs, enabling applications that better navigate complex social interactions.
zh

[NLP-111] Analysis of Error Sources in LLM -based Hypothesis Search for Few-Shot Rule Induction NEURIPS2025

【速读】: 该论文旨在解决少样本规则归纳(few-shot rule induction)问题,即如何从有限的示例中推断出抽象规则并将其应用于新情境。其解决方案的关键在于采用基于大语言模型(Large Language Models, LLMs)的假设搜索框架(hypothesis search framework),通过在可能的规则空间中进行系统性探索与验证,实现对人类水平推理能力的逼近;相较之下,直接程序生成方法表现显著不足,表明假设搜索机制更适配于模拟人类的归纳推理过程。

链接: https://arxiv.org/abs/2509.01016
作者: Aishni Parab,Hongjing Lu,Ying Nian Wu,Sumit Gulwani
机构: University of California, Los Angeles (加州大学洛杉矶分校); Microsoft (微软)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: This is the preprint version corresponding to our NeurIPS 2025 Workshop on Multimodal Algorithmic Reasoning submission

点击查看摘要

Abstract:Inductive reasoning enables humans to infer abstract rules from limited examples and apply them to novel situations. In this work, we compare an LLM-based hypothesis search framework with direct program generation approaches on few-shot rule induction tasks. Our findings show that hypothesis search achieves performance comparable to humans, while direct program generation falls notably behind. An error analysis reveals key bottlenecks in hypothesis generation and suggests directions for advancing program induction methods. Overall, this paper underscores the potential of LLM-based hypothesis search for modeling inductive reasoning and the challenges in building more efficient systems.
zh

[NLP-112] Ranking of Bangla Word Graph using Graph-based Ranking Algorithms DATE2017

【速读】: 该论文试图解决如何在缺乏标准孟加拉语词库的情况下,通过构建词图(word graph)并应用多种基于图的排序算法来准确计算孟加拉语词汇的重要性排序问题。解决方案的关键在于利用印度语言词性标注语料库(Indian Language POS-tag Corpora)中丰富的孟加拉语句子及其词性标注数据,首先对文本进行预处理以构建词图,随后将标准化的图排序算法应用于该图结构,从而实现对孟加拉语词汇重要性的量化比较,并通过真实数据实验验证各算法在F1度量上的准确性。

链接: https://arxiv.org/abs/2509.01011
作者: S M Rafiuddin
机构: Bangladesh University of Engineering and Technology (孟加拉国工程与技术大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figures, Publication date 2017-12-07, Conference 2017 3rd International Conference on Electrical Information and Communication Technology EICT, Pages 1-5, Publisher IEEE

点击查看摘要

Abstract:Ranking words is an important way to summarize a text or to retrieve information. A word graph is a way to represent the words of a sentence or a text as the vertices of a graph and to show the relationship among the words. It is also useful to determine the relative importance of a word among the words in the word-graph. In this research, the ranking of Bangla words are calculated, representing Bangla words from a text in a word graph using various graph based ranking algorithms. There is a lack of a standard Bangla word database. In this research, the Indian Language POS-tag Corpora is used, which has a rich collection of Bangla words in the form of sentences with their parts of speech tags. For applying a word graph to various graph based ranking algorithms, several standard procedures are applied. The preprocessing steps are done in every word graph and then applied to graph based ranking algorithms to make a comparison among these algorithms. This paper illustrate the entire procedure of calculating the ranking of Bangla words, including the construction of the word graph from text. Experimental result analysis on real data reveals the accuracy of each ranking algorithm in terms of F1 measure.
zh

[NLP-113] MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper EMNLP2025

【速读】: 该论文旨在解决传统微调方法在面对多样且动态变化的数据分布时,因参数空间僵化而难以灵活激活合适神经路径的问题。其核心解决方案是提出一种新的专家提示微调混合模型(Mixture of Expert Prompt Tuning, MEPT),通过引入专家混合(Mixture of Experts, MoE)架构,集成多个提示专家以自适应地学习非平稳的数据分布,从而实现更高效、灵活的流形映射。MEPT在SuperGLUE基准上显著提升了平均准确率(如提升1.94%),同时大幅减少激活的提示比例(降低79.25%),并通过流形学习理论分析和神经激活路径可视化验证了其有效性。

链接: https://arxiv.org/abs/2509.00996
作者: Runjia Zeng,Guangyan Sun,Qifan Wang,Tong Geng,Sohail Dianat,Xiaotian Han,Raghuveer Rao,Xueling Zhang,Cheng Han,Lifu Huang,Dongfang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2025

点击查看摘要

Abstract:Considering deep neural networks as manifold mappers, the pretrain-then-fine-tune paradigm can be interpreted as a two-stage process: pretrain establishes a broad knowledge base, and fine-tune adjusts the model parameters to activate specific neural pathways to align with the target manifold. Although prior fine-tuning approaches demonstrate success, their rigid parameter space limits their ability to dynamically activate appropriate neural pathways, rendering them ill-equipped to adapt flexibly to the diverse and evolving data distributions. In light of this view, we propose a novel approach, Mixture of Expert Prompt Tuning (MEPT), as an effective and efficient manifold-mapping framework. MEPT leverages the Mixture of Experts architecture by integrating multiple prompt experts to adaptively learn diverse and non-stationary data distributions. Empirical evaluations demonstrate that MEPT outperforms several state-of-the-art parameter efficient baselines on SuperGLUE, achieving notable improvements in mean accuracy (e.g., 1.94%) while significantly reducing activated prompts by 79.25%. The effectiveness of MEPT is further supported by theoretical insights from manifold learning and validated through neural activation pathway visualization results. Our code is avaliable at this https URL.
zh

[NLP-114] Performance Analysis of Supervised Machine Learning Algorithms for Text Classification

【速读】: 该论文旨在解决文本分类(text classification)在Web搜索、数据挖掘、推荐系统等信息与技术领域中日益增长的需求问题。其解决方案的关键在于利用多种标准的监督学习算法对不同数据集进行文本分类实验,并通过构建基于反向传播网络(Back Propagation Network, BPN)的人工神经网络(Artificial Neural Network, ANN)模型,与其他模型对比分析,从而形成一个独立且高效的标注文本监督分类平台。实验结果表明,该方法能够有效提升分类准确性,为实际应用提供性能基准参考。

链接: https://arxiv.org/abs/2509.00983
作者: Sadia Zaman Mishu,S M Rafiuddin
机构: Rajshahi University of Engineering and Technology (拉杰沙希工程与技术大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figures, published in 2016 at the 19th International Conference on Computer and Information Technology (ICCIT), Bangladesh, proceedings pp. 409-413, IEEE

点击查看摘要

Abstract:The demand for text classification is growing significantly in web searching, data mining, web ranking, recommendation systems, and so many other fields of information and technology. This paper illustrates the text classification process on different datasets using some standard supervised machine learning techniques. Text documents can be classified through various kinds of classifiers. Labeled text documents are used to classify the text in supervised classifications. This paper applies these classifiers on different kinds of labeled documents and measures the accuracy of the classifiers. An Artificial Neural Network (ANN) model using Back Propagation Network (BPN) is used with several other models to create an independent platform for labeled and supervised text classification process. An existing benchmark approach is used to analyze the performance of classification using labeled documents. Experimental analysis on real data reveals which model works well in terms of classification accuracy.
zh

[NLP-115] Self-Exploring Language Models for Explainable Link Forecasting on Temporal Graphs via Reinforcement Learning

【速读】: 该论文旨在解决时序图(Temporal Graph, TG)中未来边预测任务的可解释性不足与泛化能力弱的问题。传统神经网络方法虽性能优异,但缺乏可解释性且无法在未见过的图上直接应用;而现有基于大语言模型(Large Language Models, LLMs)的方法多局限于静态图或小规模合成数据,且未对LLM生成的推理轨迹质量进行有效评估。其解决方案的关键在于提出一种基于强化学习的框架ReaL-TG,通过基于结果的奖励机制引导LLM从图结构中自主探索推理策略,并生成能够直接支撑预测结论的解释性推理路径;同时设计了一种结合排序指标与LLM作为裁判(LLM-as-a-Judge)的新评估协议,以量化评估推理质量并检测幻觉影响,从而实现高效、可解释且具备强泛化能力的时序图边预测。

链接: https://arxiv.org/abs/2509.00975
作者: Zifeng Ding,Shenyang Huang,Zeyu Cao,Emma Kondrup,Zachary Yang,Xingyue Huang,Yuan Sui,Zhangdie Yuan,Yuqicheng Zhu,Xianglong Hu,Yuan He,Farimah Poursafaei,Michael Bronstein,Andreas Vlachos
机构: University of Cambridge (剑桥大学); Mila - Quebec AI Institute (魁北克人工智能研究所); McGill University (麦吉尔大学); University of Oxford (牛津大学); National University of Singapore (新加坡国立大学); University of Stuttgart (斯图加特大学); Amazon (亚马逊); AITHYRA
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Forecasting future links is a central task in temporal graph (TG) reasoning, requiring models to leverage historical interactions to predict upcoming ones. Traditional neural approaches, such as temporal graph neural networks, achieve strong performance but lack explainability and cannot be applied to unseen graphs without retraining. Recent studies have begun to explore using large language models (LLMs) for graph reasoning, but most of them are constrained to static graphs or small synthetic TGs and lack the evaluation of the quality of reasoning traces generated by LLMs. In this work, we present Reasoning-Enhanced Learning for Temporal Graphs (ReaL-TG), a reinforcement learning framework that fine-tunes LLMs to perform explainable link forecasting on real-world TGs. ReaL-TG uses outcome-based reward to encourage models to self-explore reasoning strategies from graph structure and to produce explanations that directly justify their predictions. To enable evaluation on LLM-generated reasoning traces, we propose a new evaluation protocol combining ranking metrics with an LLM-as-a-Judge system that assesses both the quality of reasoning and the impact of hallucinations. Experiments with ReaL-TG-4B, obtained by fine-tuning Qwen3-4B under our framework, show that it outperforms much larger frontier LLMs, including GPT-5 mini, on ranking metrics, while producing high-quality explanations confirmed by both the LLM judge and human evaluation.
zh

[NLP-116] RPRO:Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医学问答任务中生成的推理链(Chain-of-Thought, CoT)缺乏事实准确性与临床可靠性的问题。其解决方案的关键在于提出一种名为Ranked Preference Reinforcement Optimization (RPRO) 的新框架,该框架通过结合强化学习与偏好驱动的推理链精炼机制,实现对临床推理过程的有效优化。RPRO的核心创新包括:引入任务自适应的推理模板以匹配临床工作流程、采用基于Bradley-Terry模型的组级排序优化策略替代传统成对偏好方法,并引入KL散度正则化以提升训练稳定性;同时具备自动识别并修正低质量推理链的能力,从而显著提升模型输出的临床可信度与准确性。

链接: https://arxiv.org/abs/2509.00974
作者: Chia-Hsuan Hsu,Jun-En Ding,Hsin-Ling Hsu,Feng Liu,Fang-Ming Hung
机构: National Taiwan University of Science and Technology (台湾科技大学); Far Eastern Memorial Hospital (远东纪念医院); Stevens Institute of Technology (史蒂文斯理工学院); National Chengchi University (国立政治大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that uniquely combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO differentiates itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley-Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA and MedQA-USMLE show consistent improvements over strong baselines. Remarkably, our 1.1B parameter model outperforms much larger 7B-13B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement offers a scalable and effective approach to building more reliable, clinically grounded medical LLMs.
zh

[NLP-117] Structure and Destructure: Dual Forces in the Making of Knowledge Engines

【速读】: 该论文旨在解决自然语言处理中知识引擎构建的两大范式——结构化范式与非结构化范式之间的割裂问题。结构化范式依赖预定义的符号交互(如知识图谱)作为先验,而非结构化范式则通过大规模数据和模型规模扩展Transformer架构来实现能力提升。尽管二者路径不同,本文提出二者可通过“结构”与“去结构”(destructure)两种互补力量建立概念联系:结构用于组织已知符号交互,而去结构通过周期性嵌入重置提升模型对未见场景的泛化能力和可塑性。这一发现构成了开发通用知识引擎的新方法论,为构建透明、可控且适应性强的智能系统提供了关键支撑。

链接: https://arxiv.org/abs/2509.00949
作者: Yihong Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PhD thesis. this https URL

点击查看摘要

Abstract:The making of knowledge engines in natural language processing has been shaped by two seemingly distinct paradigms: one grounded in structure, the other driven by massively available unstructured data. The structured paradigm leverages predefined symbolic interactions, such as knowledge graphs, as priors and designs models to capture them. In contrast, the unstructured paradigm centers on scaling transformer architectures with increasingly vast data and model sizes, as seen in modern large language models. Despite their divergence, this thesis seeks to establish conceptual connections bridging these paradigms. Two complementary forces, structure and destructure, emerge across both paradigms: structure organizes seen symbolic interactions, while destructure, through periodic embedding resets, improves model plasticity and generalization to unseen scenarios. These connections form a new recipe for developing general knowledge engines that can support transparent, controllable, and adaptable intelligent systems.
zh

[NLP-118] MedCOD: Enhancing English-to-Spanish Medical Translation of Large Language Models Using Enriched Chain-of-Dictionary Framework EMNLP2025

【速读】: 该论文旨在解决医学领域中英翻译任务中存在的术语准确性不足与上下文理解偏差问题,尤其是在专业性强、术语密集的医疗文本翻译场景下。解决方案的关键在于提出一种混合框架 MedCOD(Medical Chain-of-Dictionary),通过将领域特定结构化知识(如统一医学语言系统 UMLS 和 LLM-as-Knowledge-Base 模式)融入大型语言模型(LLMs)的结构化提示(structured prompting)和低秩适应(LoRA-based fine-tuning)过程中,从而提升翻译质量。实验证明,该方法在多个开源模型上均显著优于基线模型,且消融实验表明结构化提示与模型微调各自独立贡献性能提升,二者结合效果最佳。

链接: https://arxiv.org/abs/2509.00934
作者: Md Shahidul Salim,Lian Fu,Arav Adikesh Ramakrishnan,Zonghai Yao,Hong Yu
机构: Miner School of Computer and Information Sciences, University of Massachusetts Lowell (麻省大学洛厄尔分校计算机与信息科学学院); Manning College of Information and Computer Sciences, University of Massachusetts Amherst (麻省大学阿默斯特分校信息与计算机科学学院); Center for Healthcare Organization and Implementation Research, VA Bedford Health Care (贝德福德退伍军人健康 care 中心医疗组织与实施研究); Department of Medicine, University of Massachusetts Medical School (马萨诸塞大学医学院医学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in Findings of the Association for Computational Linguistics: EMNLP 2025

点击查看摘要

Abstract:We present MedCOD (Medical Chain-of-Dictionary), a hybrid framework designed to improve English-to-Spanish medical translation by integrating domain-specific structured knowledge into large language models (LLMs). MedCOD integrates domain-specific knowledge from both the Unified Medical Language System (UMLS) and the LLM-as-Knowledge-Base (LLM-KB) paradigm to enhance structured prompting and fine-tuning. We constructed a parallel corpus of 2,999 English-Spanish MedlinePlus articles and a 100-sentence test set annotated with structured medical contexts. Four open-source LLMs (Phi-4, Qwen2.5-14B, Qwen2.5-7B, and LLaMA-3.1-8B) were evaluated using structured prompts that incorporated multilingual variants, medical synonyms, and UMLS-derived definitions, combined with LoRA-based fine-tuning. Experimental results demonstrate that MedCOD significantly improves translation quality across all models. For example, Phi-4 with MedCOD and fine-tuning achieved BLEU 44.23, chrF++ 28.91, and COMET 0.863, surpassing strong baseline models like GPT-4o and GPT-4o-mini. Ablation studies confirm that both MedCOD prompting and model adaptation independently contribute to performance gains, with their combination yielding the highest improvements. These findings highlight the potential of structured knowledge integration to enhance LLMs for medical translation tasks.
zh

[NLP-119] DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers

【速读】: 该论文旨在解决标准Transformer架构中因全局自注意力(self-attention)机制导致的计算复杂度高、尤其是随序列长度呈二次增长的问题。其核心解决方案是提出DTRNet(Dynamic Token Routing Network),通过动态路由机制使约90%的token在每层中跳过昂贵的交叉token混合(cross-token mixing),仅保留约10%的token通过全注意力模块,其余token则通过轻量级线性更新(如MLP模块)进行处理。该设计实现了token显式更新与计算成本解耦,显著降低FLOPs(浮点运算次数),同时保持与完整Transformer相当的性能,尤其在长序列场景下效率优势更明显。

链接: https://arxiv.org/abs/2509.00925
作者: Aman Sharma,Saeed Najafi,Parsa Farinneya,Benyamin Jamialahmadi,Marzieh S. Tahaei,Yuhe Fan,Mehdi Rezagholizadeh,Boxing Chen,Aref Jafari
机构: Huawei(华为)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformers achieve state-of-the-art results across many tasks, but their uniform application of quadratic self-attention to every token at every layer makes them computationally expensive. We introduce DTRNet (Dynamic Token Routing Network), an improved Transformer architecture that allows tokens to dynamically skip the quadratic cost of cross-token mixing while still receiving lightweight linear updates. By preserving the MLP module and reducing the attention cost for most tokens to linear, DTRNet ensures that every token is explicitly updated while significantly lowering overall computation. This design offers an efficient and effective alternative to standard dense attention. Once trained, DTRNet blocks routes only ~10% of tokens through attention at each layer while maintaining performance comparable to a full Transformer. It consistently outperforms routing-based layer skipping methods such as MoD and D-LLM in both accuracy and memory at matched FLOPs, while routing fewer tokens to full attention. Its efficiency gains, scales with sequence length, offering significant reduction in FLOPs for long-context inputs. By decoupling token updates from attention mixing, DTRNet substantially reduces the quadratic share of computation, providing a simple, efficient, and scalable alternative to Transformers.
zh

[NLP-120] Supervised In-Context Fine-Tuning for Generative Sequence Labeling

【速读】: 该论文旨在解决序列标注(Sequence Labeling, SL)任务中传统编码器模型性能增长停滞的问题,尤其是如何更有效地利用因果语言模型(Causal Language Models, LLMs)在SL任务中的潜力。当前主流方法多采用编码器模型或对解码器模型进行去因果掩码微调,但这些方法未能充分利用LLMs在生成式任务上的优势。论文提出了一种监督式上下文内微调(Supervised In-Context Fine-Tuning, SIFT)方案,其关键在于将SL任务建模为受约束的响应生成任务,融合了(1)基于演示的上下文学习(In-Context Learning, ICL)与(2)监督微调(Supervised Fine-Tuning),从而显著优于ICL和解码器作为编码器的基线方法。此外,研究发现长上下文虽会降低生成式SL性能,但通过移除指令可有效缓解这一问题,表明指令在SIFT框架下并非必要,进一步凸显了以响应为导向的生成式任务设定对SL性能提升的重要性。

链接: https://arxiv.org/abs/2509.00921
作者: David Dukić,Goran Glavaš,Jan Šnajder
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sequence labeling (SL) tasks, where labels are assigned to tokens, are abundant in NLP (e.g., named entity recognition and aspect-based sentiment analysis). Owing to the intuition that they require bidirectional context, SL tasks are commonly tackled with encoder-only models. Recent work also shows that removing the causal mask in fine-tuning enables decoder-based LLMs to become effective token classifiers. Less work, however, focused on (supervised) generative SL, a more natural setting for causal LLMs. Due to their rapid scaling, causal LLMs applied to SL are expected to outperform encoders, whose own development has stagnated. In this work, we propose supervised in-context fine-tuning (SIFT) for generative SL. SIFT casts SL tasks as constrained response generation, natural to LLMs, combining (1) in-context learning (ICL) from demonstrations with (2) supervised fine-tuning. SIFT considerably outperforms both ICL and decoder-as-encoder fine-tuning baselines on a range of standard SL tasks. We further find that although long context hinders the performance of generative SL in both ICL and SIFT, this deficiency can be mitigated by removing the instruction, as instructions are shown to be largely unnecessary for achieving strong SL performance with SIFT. Our findings highlight strengths and limitations of SL with LLMs, underscoring the importance of a response-based generative task formulation for effective SL performance.
zh

[NLP-121] SeLeRoSa: Sentence-Level Romanian Satire Detection Dataset

【速读】: 该论文旨在解决句子级别讽刺检测(sentence-level satire detection)这一自然语言处理难题,特别是在新闻文章中识别具有讽刺意味的语句。其关键解决方案是构建了首个针对罗马尼亚语新闻文章的句子级讽刺检测数据集SeLeRoSa,包含13,873条人工标注语句,覆盖社会问题、信息技术、科学和电影等多个领域,并在此基础上评估了多种基于大语言模型(Large Language Models, LLMs)与Transformer架构的基线模型在零样本(zero-shot)和微调(fine-tuning)设置下的表现,揭示了当前模型在该任务中的局限性,从而为未来研究指明方向。

链接: https://arxiv.org/abs/2509.00893
作者: Răzvan-Alexandru Smădu,Andreea Iuga,Dumitru-Clementin Cercel,Florin Pop
机构: National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学); National Institute for Research & Development in Informatics – ICI Bucharest (罗马尼亚信息研究所)
类目: Computation and Language (cs.CL)
备注: 12 pages, 2 Figures

点击查看摘要

Abstract:Satire, irony, and sarcasm are techniques typically used to express humor and critique, rather than deceive; however, they can occasionally be mistaken for factual reporting, akin to fake news. These techniques can be applied at a more granular level, allowing satirical information to be incorporated into news articles. In this paper, we introduce the first sentence-level dataset for Romanian satire detection for news articles, called SeLeRoSa. The dataset comprises 13,873 manually annotated sentences spanning various domains, including social issues, IT, science, and movies. With the rise and recent progress of large language models (LLMs) in the natural language processing literature, LLMs have demonstrated enhanced capabilities to tackle various tasks in zero-shot settings. We evaluate multiple baseline models based on LLMs in both zero-shot and fine-tuning settings, as well as baseline transformer-based models. Our findings reveal the current limitations of these models in the sentence-level satire detection task, paving the way for new research directions.
zh

[NLP-122] ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care

【速读】: 该论文旨在解决闭环胰岛素输注系统(Closed-Loop Insulin Delivery Systems, CLIDS)在1型糖尿病患者中实际应用率低的问题,其根源并非技术缺陷,而是行为、心理和社会层面的障碍。为应对这一挑战,作者提出了ChatCLIDS——首个用于严格评估大语言模型(Large Language Models, LLMs)驱动的说服性对话在健康行为改变中作用的基准测试框架。其关键创新在于构建了一个由专家验证的虚拟患者库,每个患者具备临床基础、异质性特征及现实中的采纳障碍,并通过模拟与配备多种循证说服策略的护士代理进行多轮交互,支持纵向咨询和对抗性社会影响场景,从而实现多维度、高保真的评估能力。研究发现,尽管更大更反思性的LLMs能动态调整策略,但所有模型在真实社交压力下仍难以克服用户抵触,揭示了当前LLMs在行为干预中的局限性,并为可信说服性人工智能在医疗及其他领域的进步提供了可扩展的测试平台。

链接: https://arxiv.org/abs/2509.00891
作者: Zonghai Yao,Talha Chafekar,Junda Wang,Shuo Han,Feiyun Ouyang,Junhui Qian,Lingxi Li,Hong Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Equal contribution for the first two authors

点击查看摘要

Abstract:Real-world adoption of closed-loop insulin delivery systems (CLIDS) in type 1 diabetes remains low, driven not by technical failure, but by diverse behavioral, psychosocial, and social barriers. We introduce ChatCLIDS, the first benchmark to rigorously evaluate LLM-driven persuasive dialogue for health behavior change. Our framework features a library of expert-validated virtual patients, each with clinically grounded, heterogeneous profiles and realistic adoption barriers, and simulates multi-turn interactions with nurse agents equipped with a diverse set of evidence-based persuasive strategies. ChatCLIDS uniquely supports longitudinal counseling and adversarial social influence scenarios, enabling robust, multi-dimensional evaluation. Our findings reveal that while larger and more reflective LLMs adapt strategies over time, all models struggle to overcome resistance, especially under realistic social pressure. These results highlight critical limitations of current LLMs for behavior change, and offer a high-fidelity, scalable testbed for advancing trustworthy persuasive AI in healthcare and beyond.
zh

[NLP-123] EviNote-RAG : Enhancing RAG Models via Answer-Supportive Evidence Notes

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)框架在开放域问答(Open-domain Question Answering, QA)中面临的两大问题:一是检索到的证据信息信噪比低,有用信息被大量无关内容淹没;二是多跳推理过程中因不完整或噪声文本导致错误累积。其解决方案的关键在于提出一种名为 EviNote-RAG 的代理式 RAG 框架,引入结构化的“检索—注记—回答”流程,通过训练模型生成支持性证据笔记(Supportive-Evidence Notes, SENs),这些笔记以人类写作风格提炼出与答案相关的信息、标注不确定性,并明确指出无用证据的情形;同时设计基于蕴含关系的证据质量奖励(Evidence Quality Reward, EQR),用于评估 SEN 是否逻辑上支持最终答案,从而引导模型实现更忠实、鲁棒的推理过程,显著提升准确性、泛化能力和训练稳定性。

链接: https://arxiv.org/abs/2509.00877
作者: Yuqin Dai,Guoqing Wang,Yuan Wang,Kairan Dou,Kaichen Zhou,Zhanwei Zhang,Shuo Yang,Fei Tang,Jun Yin,Pengyu Zeng,Zhenzhe Ying,Can Yi,Changhua Meng,Yuchen Zhou,Yongliang Shen,Shuai Lu
机构: Tsinghua University (清华大学); Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Massachusetts Institute of Technology (麻省理工学院); UC Berkeley (加州大学伯克利分校); The University of Hong Kong (香港大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) empowered with retrieval mechanisms have achieved strong progress in open-domain question answering (QA). Yet, the conventional retrieve–then–answer paradigm often suffers from two key limitations: (1) low signal-to-noise ratio in retrieved evidence, where useful information is buried under irrelevant content, and (2) error accumulation in multi-hop reasoning when incomplete or noisy passages are involved. To address these challenges, we present EviNote-RAG, an agentic RAG framework that introduces a structured retrieve–note–answer pipeline. Instead of directly reasoning over raw retrievals, the model is trained to compose Supportive-Evidence Notes (SENs), concise, human-like notes that preserve only answer-relevant information, highlight uncertainty, and explicitly state when no useful evidence exists. This distillation process is further reinforced by the Evidence Quality Reward (EQR), an entailment-based signal that evaluates whether SENs logically support the final answer. Together, SENs and EQR guide the model toward faithful and robust reasoning, while reducing the impact of noise. Experiments on in-domain and out-of-domain QA benchmarks show that EviNote-RAG consistently outperforms strong baselines in accuracy, generalization, and training stability. In particular, it achieves state-of-the-art results while enhancing robustness and efficiency, yielding relative F1 gains of 20% on HotpotQA (+0.093), 40% on Bamboogle (+0.151), and 91% on 2Wiki (+0.256) via denser rewards and reduced verbosity.
zh

[NLP-124] Exploring and Mitigating Fawning Hallucinations in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对欺骗性或误导性提示时产生的“谄媚幻觉”(fawning hallucinations)问题,即模型倾向于迎合输入隐含立场而非忠实于事实,导致输出偏离真实信息。解决方案的关键在于提出协同对比解码(collaborative contrastive decoding, CCD)方法:通过设计两种范式生成诱导性欺骗输入以稳定复现谄媚幻觉,并基于对比机制分析模型在诱导输入与中性转换输入下的输出分布差异,从而在不依赖额外训练的前提下降低模型对误导信息的依赖,有效提升生成内容的事实准确性。

链接: https://arxiv.org/abs/2509.00869
作者: Zixuan Shangguan,Yanjie Dong,Lanjun Wang,Xiaoyi Fan,Victor C. M. Leung,Xiping Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional proficiency in language understanding. However, when LLMs align their outputs with deceptive and/or misleading prompts, the generated responses could deviate from the de facto information. Such observations are known as fawning hallucinations, where the model prioritizes alignment with the input’s implied perspective over accuracy and truthfulness. In this work, we analyze fawning hallucinations in various natural language processing tasks and tailor the so-termed contrastive decoding method for fawning-hallucination mitigation. Specifically, we design two paradigms to generate corresponding deceptive and/or misleading inputs for the consistent fawning hallucinations induction. Then, we propose the collaborative contrastive decoding (CCD) to handle the fawning hallucinations across different tasks in LLMs. By contrasting the deviation in output distribution between induced and transformed neutral inputs, the proposed CCD can reduce reliance on deceptive and/or misleading information without requiring additional training. Extensive experiments demonstrate that the proposed CCD can effectively mitigate fawning hallucinations and improve the factuality of the generated responses over various tasks.
zh

[NLP-125] Prompting Away Stereotypes? Evaluating Bias in Text-to-Image Models for Occupations

【速读】: 该论文旨在解决文本到图像(Text-to-Image, TTI)生成模型可能放大数据社会偏见的问题,尤其是职业形象在性别和种族维度上的代表性失衡。其解决方案的关键在于将偏见评估转化为图像筛选与评价任务,并构建了一个包含五类社会敏感职业(CEO、护士、软件工程师、教师、运动员)的基准测试集;通过对比中性提示与公平导向控制提示生成的图像,系统性分析不同模型在性别(男/女)和种族(亚裔、黑人、白人)分布上的变化,从而揭示提示工程对公平性的调节潜力及其局限性。

链接: https://arxiv.org/abs/2509.00849
作者: Shaina Raza,Maximus Powers,Partha Pratim Saha,Mahveen Raza,Rizwan Qureshi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-to-Image (TTI) models are powerful creative tools but risk amplifying harmful social biases. We frame representational societal bias assessment as an image curation and evaluation task and introduce a pilot benchmark of occupational portrayals spanning five socially salient roles (CEO, Nurse, Software Engineer, Teacher, Athlete). Using five state-of-the-art models: closed-source (DALLE 3, Gemini Imagen 4.0) and open-source (FLUX.1-dev, Stable Diffusion XL Turbo, Grok-2 Image), we compare neutral baseline prompts against fairness-aware controlled prompts designed to encourage demographic diversity. All outputs are annotated for gender (male, female) and race (Asian, Black, White), enabling structured distributional analysis. Results show that prompting can substantially shift demographic representations, but with highly model-specific effects: some systems diversify effectively, others overcorrect into unrealistic uniformity, and some show little responsiveness. These findings highlight both the promise and the limitations of prompting as a fairness intervention, underscoring the need for complementary model-level strategies. We release all code and data for transparency and reproducibility this https URL.
zh

[NLP-126] Negative Matters: Multi-Granularity Hard-Negative Synthesis and Anchor-Token-Aware Pooling for Enhanced Text Embeddings

【速读】: 该论文旨在解决文本嵌入模型在对比学习过程中负样本质量不足的问题,尤其是如何有效生成具有多样性和语义相似度梯度的负样本以提升模型对细微语义差异的区分能力。解决方案的关键在于提出一种多粒度难负样本(Multi-Granularity Hard-negative, MGH)合成框架,利用大语言模型(Large Language Models, LLMs)生成不同相似度层级的负样本,并结合粗到细的课程学习策略进行监督训练;同时引入锚点词元感知(Anchor Token Aware, ATA)池化方法,通过借鉴LLM中的聚合模式为锚点词元分配更高权重,在不增加模型复杂度的前提下显著提升文本嵌入精度。

链接: https://arxiv.org/abs/2509.00842
作者: Tengyu Pan,Zhichao Duan,Zhenyu Li,Bowen Dong,Ning Liu,Xiuxing Li,Jianyong Wang
机构: Tsinghua University (清华大学); Shandong University (山东大学); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text embedding models are essential for various natural language processing tasks, enabling the effective encoding of semantic information into dense vector representations. These models are typically optimized using triplets of (query, positive, negative) data pairs for contrastive learning, where the negative samples play a critical role in enhancing the model’s ability to discern subtle semantic distinctions. In this work, we introduce a Multi-Granularity Hard-negative (MGH) synthesis framework that leverages large language models (LLMs) to generate diverse negative samples with varying levels of similarity with the query. This approach facilitates a coarse-to-fine curriculum learning strategy during supervised training, allowing the embedding model to progressively learn more nuanced semantic representations. Meanwhile, we propose an Anchor Token Aware (ATA) pooling method that assigns higher weights to anchor tokens based on aggregation patterns observed in LLMs, improving text embedding accuracy without increasing model complexity. Comprehensive experiments on the MTEB benchmark demonstrate that our methods achieve state-of-the-art performance, surpassing existing synthesis strategies both with synthetic data and when combined with public retrieval datasets.
zh

[NLP-127] Neural Models and Language Model Prompting for the Multidimensional Evaluation of Open-Ended Conversations

【速读】: 该论文旨在解决生成式 AI(Generative AI)对话系统评估中的关键挑战,即如何在模型参数规模受限(少于130亿参数)的前提下,准确预测对话级别的、维度特定的评分。其解决方案的关键在于采用两种策略:一是利用语言模型(Language Models, LMs)通过提示(prompting)方式作为评估器;二是训练基于编码器的分类与回归模型。实验表明,尽管LM提示法与人工判断的相关性较低,但在测试集上仍排名第二,仅次于基线模型;而编码器类模型虽在验证集上表现出高相关性,但在测试集性能下降,这可能与测试集中某些维度的评分范围与训练/验证集存在显著差异有关。

链接: https://arxiv.org/abs/2509.00841
作者: Michelle Elizabeth,Alicja Kasicka,Natalia Krawczyk,Magalie Ochs,Gwénolé Lecorvé,Justyna Gromada,Lina M. Rojas-Barahona
机构: Orange Research (橙色研究); Aix-Marseille University (艾克斯-马赛大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing number of generative AI-based dialogue systems has made their evaluation a crucial challenge. This paper presents our contribution to this important problem through the Dialogue System Technology Challenge (DSTC-12, Track 1), where we developed models to predict dialogue-level, dimension-specific scores. Given the constraint of using relatively small models (i.e. fewer than 13 billion parameters) our work follows two main strategies: employing Language Models (LMs) as evaluators through prompting, and training encoder-based classification and regression models. Our results show that while LM prompting achieves only modest correlations with human judgments, it still ranks second on the test set, outperformed only by the baseline. The regression and classification models, with significantly fewer parameters, demonstrate high correlation for some dimensions on the validation set. Although their performance decreases on the test set, it is important to note that the test set contains annotations with significantly different score ranges for some of the dimensions with respect to the train and validation sets. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.00841 [cs.CL] (or arXiv:2509.00841v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.00841 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-128] MT: A Simple Way to Translate Topic Models Using Dictionaries

【速读】: 该论文旨在解决多语言环境下主题模型(Topic Model)训练的难题,尤其是在目标语言知识缺乏或可用语料稀缺的情况下,传统方法依赖对齐语料库、嵌入表示或人工标注,难以实施。其核心解决方案是提出一种名为“主题模型翻译”(Topic Model Translation, TMT)的新技术,该方法无需元数据、词向量或对齐语料即可将基于LDA等算法训练的主题模型从源语言跨语言迁移至目标语言,从而实现主题模型在低资源语言环境中的复用。TMT的关键创新在于通过语义一致性约束和结构映射机制,确保迁移后主题在目标语言中仍保持语义连贯性和逻辑一致性,实验证明其在定量与定性评估中均表现优异。

链接: https://arxiv.org/abs/2509.00822
作者: Felix Engl,Andreas Henrich
机构: University of Bamberg (班贝格大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 10 pages, 2 figures, 8 tables

点击查看摘要

Abstract:The training of topic models for a multilingual environment is a challenging task, requiring the use of sophisticated algorithms, topic-aligned corpora, and manual evaluation. These difficulties are further exacerbated when the developer lacks knowledge of the target language or is working in an environment with limited data, where only small or unusable multilingual corpora are available. Considering these challenges, we introduce Topic Model Translation (TMT), a novel, robust and transparent technique designed to transfer topic models (e.g., Latent Dirichlet Allocation (LDA) based topic models) from one language to another, without the need for metadata, embeddings, or aligned corpora. TMT enables the reuse of topic models across languages, making it especially suitable for scenarios where large corpora in the target language are unavailable or manual translation is infeasible. Furthermore, we evaluate TMT extensively using both quantitative and qualitative methods, demonstrating that it produces semantically coherent and consistent topic translations. Comments: 10 pages, 2 figures, 8 tables Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) ACMclasses: I.2.7; H.3.1 Cite as: arXiv:2509.00822 [cs.CL] (or arXiv:2509.00822v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.00822 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-129] CaresAI at BioCreative IX Track 1 – LLM for Biomedical QA IJCAI

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂生物医学多跳问答(Multi-hop Biomedical Question Answering, QA)任务中,尽管具备较强语义理解能力但输出格式不规范、精确匹配(Exact Match, EM)指标偏低的问题。解决方案的关键在于提出一种两阶段推理流水线(two-stage inference pipeline),首先通过监督微调的LLaMA 3 8B模型生成候选答案,再引入专门的短答案提取模块以减少冗余信息并提升与评估指标的一致性,从而缓解因输出格式偏差导致的EM分数不足问题。

链接: https://arxiv.org/abs/2509.00806
作者: Reem Abdel-Salam,Mary Adewunmi,Modinat A. Abayomi
机构: Cairo University (开罗大学); CaresAI; Menzies School of Health Research, Charles Darwin University (查尔斯达尔文大学梅尼兹健康研究所); Department of Biology, Boston College (波士顿学院生物系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI), Montreal, Canada, 2025

点击查看摘要

Abstract:Large language models (LLMs) are increasingly evident for accurate question answering across various domains. However, rigorous evaluation of their performance on complex question-answering (QA) capabilities is essential before deployment in real-world biomedical and healthcare applications. This paper presents our approach to the MedHopQA track of the BioCreative IX shared task, which focuses on multi-hop biomedical question answering involving diseases, genes, and chemicals. We adopt a supervised fine-tuning strategy leveraging LLaMA 3 8B, enhanced with a curated biomedical question-answer dataset compiled from external sources including BioASQ, MedQuAD, and TREC. Three experimental setups are explored: fine-tuning on combined short and long answers, short answers only, and long answers only. While our models demonstrate strong domain understanding, achieving concept-level accuracy scores of up to 0.8, their Exact Match (EM) scores remain significantly lower, particularly in the test phase. We introduce a two-stage inference pipeline for precise short-answer extraction to mitigate verbosity and improve alignment with evaluation metrics. Despite partial improvements, challenges persist in generating strictly formatted outputs. Our findings highlight the gap between semantic understanding and exact answer evaluation in biomedical LLM applications, motivating further research in output control and post-processing strategies.
zh

[NLP-130] LegalChainReason er: A Legal Chain-guided Framework for Criminal Judicial Opinion Generation

【速读】: 该论文旨在解决刑事司法意见生成任务中法律推理与量刑预测分离导致的不一致性问题,以及现有方法依赖人工构建知识库而难以实际部署的局限性。其核心解决方案是提出一个新的LegalAI任务——司法意见生成(Judicial Opinion Generation),并设计了LegalChainReasoner框架,通过结构化的法律链条(structured legal chains)引导模型进行综合案件评估,整合事实前提、复合法律条件和量刑结论,实现灵活的知识注入与端到端的意见生成,从而更贴合真实司法实践需求。

链接: https://arxiv.org/abs/2509.00783
作者: Weizhe Shi,Qiqi Wang,Yihong Pan,Qian Liu,Kaiqi Zhao
机构: The University of Auckland, New Zealand(奥克兰大学); Nankai University, China(南开大学); Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A criminal judicial opinion represents the judge’s disposition of a case, including the decision rationale and sentencing. Automatically generating such opinions can assist in analyzing sentencing consistency and provide judges with references to similar past cases. However, current research typically approaches this task by dividing it into two isolated subtasks: legal reasoning and sentencing prediction. This separation often leads to inconsistency between the reasoning and predictions, failing to meet real-world judicial requirements. Furthermore, prior studies rely on manually curated knowledge to enhance applicability, yet such methods remain limited in practical deployment. To address these limitations and better align with legal practice, we propose a new LegalAI task: Judicial Opinion Generation, which simultaneously produces both legal reasoning and sentencing decisions. To achieve this, we introduce LegalChainReasoner, a framework that applies structured legal chains to guide the model through comprehensive case assessments. By integrating factual premises, composite legal conditions, and sentencing conclusions, our approach ensures flexible knowledge injection and end-to-end opinion generation. Experiments on two real-world and open-source Chinese legal case datasets demonstrate that our method outperforms baseline models.
zh

[NLP-131] Aligning Reasoning LLM s for Materials Discovery with Physics-aware Rejection Sampling

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在材料发现中面临的挑战,即如何构建准确、校准良好且符合物理规律的“工艺-配方-性质”预测模型,以支持自动化实验与算法决策相结合的闭环材料设计流程。其核心问题在于现有训练范式依赖二值正确性或偏好信号选择推理轨迹,难以保障物理合理性。解决方案的关键在于提出一种物理感知拒绝采样(Physics-aware Rejection Sampling, PaRS)机制,在训练阶段优先选取与基础物理规律一致且数值接近目标值的推理轨迹,并引入轻量级终止策略控制计算开销。通过在大规模学生模型上使用教师模型合成的推理轨迹进行微调,该方法显著提升了预测准确性与校准度,降低了物理违反率并减少了采样成本,证明了领域约束与轨迹级选择相结合是实现高效可靠大语言模型用于过程感知性质预测的有效路径。

链接: https://arxiv.org/abs/2509.00768
作者: Lee Hyun,Sohee Yoon,Jinwoo Park,Sue In Chae,Seongeon Park,Jooyeon Ahn,Yebin Jung,Youjung Chung,Hogeun Chang,Myeonginn Kang,Jina Kim,Ho-Gyeong Kim,Myeonghun Jeong
机构: Materials AI Lab (AI Center), Samsung Electronics (三星电子); Samsung Advanced Institute of Technology (SAIT) (三星先进技术研究院)
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:AI-driven materials discovery that couples automated experimentation with algorithmic decision-making requires process aware recipe to property predictors that are accurate, calibrated, and physically admissible. We approach this as a reasoning problem with large reasoning models (LRMs). To instill reasoning capability into language models, we curate reasoning traces from a teacher model to train a student model. However, most training pipelines select reasoning traces using binary correctness or learned preference signals that poorly reflect physical admissibility. We introduce Physics-aware Rejection Sampling (PaRS), a training-time trace selection scheme that favors traces consistent with fundamental physics and numerically close to targets, with lightweight halting to control compute. We instantiate our framework with a large student model fine-tuned on traces synthesized by a larger teacher model, and evaluate under matched token budgets against various rejection sampling baselines. Our method improves accuracy and calibration, reduces physics-violation rates, and lowers sampling cost relative to baselines. These results indicate that modest, domain-aware constraints combined with trace-level selection provide a practical path toward reliable, efficient LRMs for process-aware property prediction and closed-loop materials design.
zh

[NLP-132] Decomposing and Revising What Language Models Generate

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在问答(Question Answering, QA)任务中缺乏可解释性与事实准确性的问题,尤其是现有基于问题分解的方法存在生成问题不相关或不完整、无法有效聚合跨文档证据片段等缺陷。其解决方案的关键在于提出一种新的基于事实分解的框架 FIDES(Faithful Context Enhanced Fact Decomposition and Evidence Aggregation),该框架采用上下文增强的两阶段忠实分解方法,将长答案分解为子事实,并通过检索器获取相关证据片段;若证据与子事实冲突,则对子事实进行修正,最终依据原始答案结构聚合证据,从而提升答案的可追溯性和精确性。

链接: https://arxiv.org/abs/2509.00765
作者: Zhichao Yan,Jiaoyan Chen,Jiapu Wang,Xiaoli Li,Ru Li,Jeff Z. Pan
机构: Shanxi University (山西大学); University of Manchester (曼彻斯特大学); Beijing University of Technology (北京工业大学); Singapore University of Technology and Design (新加坡科技设计大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attribution is crucial in question answering (QA) with Large Language Models (LLMs).SOTA question decomposition-based approaches use long form answers to generate questions for retrieving related documents. However, the generated questions are often irrelevant and incomplete, resulting in a loss of facts in this http URL approaches also fail to aggregate evidence snippets from different documents and paragraphs. To tackle these problems, we propose a new fact decomposition-based framework called FIDES (\textitfaithful context enhanced fact decomposition and evidence aggregation) for attributed QA. FIDES uses a contextually enhanced two-stage faithful decomposition method to decompose long form answers into sub-facts, which are then used by a retriever to retrieve related evidence snippets. If the retrieved evidence snippets conflict with the related sub-facts, such sub-facts will be revised accordingly. Finally, the evidence snippets are aggregated according to the original this http URL evaluation has been conducted with six datasets, with an additionally proposed new metric called Attr_auto-P for evaluating the evidence precision. FIDES outperforms the SOTA methods by over 14% in average with GPT-3.5-turbo, Gemini and Llama 70B series.
zh

[NLP-133] L-MARS – Legal Multi-Agent Workflow with Orchestrated Reasoning and Agent ic Search

【速读】: 该论文旨在解决法律问答中大语言模型(Large Language Models, LLMs)易产生幻觉(hallucination)和不确定性的问题。解决方案的关键在于提出L-MARS系统,其通过多智能体协同推理与代理式搜索机制实现:将复杂法律问题分解为子问题,调用异构数据源(如Serper网络搜索、本地检索增强生成RAG、CourtListener判例库)进行精准检索,并由裁判智能体(Judge Agent)对证据的充分性、管辖权及时间有效性进行验证,形成迭代的“推理-搜索-验证”闭环。此设计有效提升答案的准确性与可信度,确保输出结果基于权威法律依据。

链接: https://arxiv.org/abs/2509.00761
作者: Ziqi Wang,Boqin Yuan
机构: University of Southern California (南加州大学); University of California, San Diego (加州大学圣地亚哥分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a system that reduces hallucination and uncertainty in legal question answering through coordinated multi-agent reasoning and retrieval. Unlike single-pass retrieval-augmented generation (RAG), L-MARS decomposes queries into subproblems, issues targeted searches across heterogeneous sources (Serper web, local RAG, CourtListener case law), and employs a Judge Agent to verify sufficiency, jurisdiction, and temporal validity before answer synthesis. This iterative reasoning-search-verification loop maintains coherence, filters noisy evidence, and grounds answers in authoritative law. We evaluated L-MARS on LegalSearchQA, a new benchmark of 200 up-to-date multiple choice legal questions in 2025. Results show that L-MARS substantially improves factual accuracy, reduces uncertainty, and achieves higher preference scores from both human experts and LLM-based judges. Our work demonstrates that multi-agent reasoning with agentic search offers a scalable and reproducible blueprint for deploying LLMs in high-stakes domains requiring precise legal retrieval and deliberation.
zh

[NLP-134] LLM Encoder vs. Decoder: Robust Detection of Chinese AI-Generated Text with LoRA

【速读】: 该论文旨在解决中文生成式AI文本检测的准确性问题,尤其是在面对语言细微差异和分布偏移时现有方法性能下降的挑战。其解决方案的关键在于采用基于解码器的大型语言模型(LLM)——具体为阿里巴巴Qwen2.5-7B,并通过低秩适应(LoRA)进行参数高效微调,结合指令格式输入与轻量分类头,在NLPCC 2025中文AI生成文本检测数据集上实现了95.94%的测试准确率,显著优于编码器模型(如RoBERTa-wwm-ext-large和Chinese BERT-large)及FastText基线,体现出更强的泛化能力和对数据集特定伪影的鲁棒性。

链接: https://arxiv.org/abs/2509.00731
作者: Houji Jin,Negin Ashrafi,Armin Abdollahi,Wei Liu,Jian Wang,Ganyu Gui,Maryam Pishgar,Huanghao Feng
机构: Suzhou University of Technology (苏州科技大学); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid growth of large language models (LLMs) has heightened the demand for accurate detection of AI-generated text, particularly in languages like Chinese, where subtle linguistic nuances pose significant challenges to current methods. In this study, we conduct a systematic comparison of encoder-based Transformers (Chinese BERT-large and RoBERTa-wwm-ext-large), a decoder-only LLM (Alibaba’s Qwen2.5-7B/DeepSeek-R1-Distill-Qwen-7B fine-tuned via Low-Rank Adaptation, LoRA), and a FastText baseline using the publicly available dataset from the NLPCC 2025 Chinese AI-Generated Text Detection Task. Encoder models were fine-tuned using a novel prompt-based masked language modeling approach, while Qwen2.5-7B was adapted for classification with an instruction-format input and a lightweight classification head trained via LoRA. Experiments reveal that although encoder models nearly memorize training data, they suffer significant performance degradation under distribution shifts (RoBERTa: 76.3% test accuracy; BERT: 79.3%). FastText demonstrates surprising lexical robustness (83.5% accuracy) yet lacks deeper semantic understanding. In contrast, the LoRA-adapted Qwen2.5-7B achieves 95.94% test accuracy with balanced precision-recall metrics, indicating superior generalization and resilience to dataset-specific artifacts. These findings underscore the efficacy of decoder-based LLMs with parameter-efficient fine-tuning for robust Chinese AI-generated text detection. Future work will explore next-generation Qwen3 models, distilled variants, and ensemble strategies to enhance cross-domain robustness further.
zh

[NLP-135] On Verifiable Legal Reasoning : A Multi-Agent Framework with Formalized Knowledge Representations CIKM’25

【速读】: 该论文旨在解决人工智能在法律推理中面临的两大核心挑战:一是对法规语言的精确解读,二是复杂规则的一致性应用。传统端到端的AI方法难以保证推理过程的透明性和可验证性,导致模型决策缺乏可信度。解决方案的关键在于提出一种模块化的多智能体框架,将法律推理分解为两个阶段:第一阶段由专业化智能体提取法律概念并形式化规则,生成可验证的中间表示;第二阶段通过查询分析、符号推理和程序化实现三步操作,将结构化知识应用于具体案例。该架构实现了自然语言理解与符号推理的有效衔接,显著提升了法律推理的透明度、一致性和可解释性,同时在税务计算任务上使基础模型准确率从18.8%提升至76.4%,验证了其有效性与可行性。

链接: https://arxiv.org/abs/2509.00710
作者: Albert Sadowski,Jarosław A. Chudziak
机构: Warsaw University of Technology (华沙理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for publication at the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)

点击查看摘要

Abstract:Legal reasoning requires both precise interpretation of statutory language and consistent application of complex rules, presenting significant challenges for AI systems. This paper introduces a modular multi-agent framework that decomposes legal reasoning into distinct knowledge acquisition and application stages. In the first stage, specialized agents extract legal concepts and formalize rules to create verifiable intermediate representations of statutes. The second stage applies this knowledge to specific cases through three steps: analyzing queries to map case facts onto the ontology schema, performing symbolic inference to derive logically entailed conclusions, and generating final answers using a programmatic implementation that operationalizes the ontological knowledge. This bridging of natural language understanding with symbolic reasoning provides explicit and verifiable inspection points, significantly enhancing transparency compared to end-to-end approaches. Evaluation on statutory tax calculation tasks demonstrates substantial improvements, with foundational models achieving 76.4% accuracy compared to 18.8% baseline performance, effectively narrowing the performance gap between reasoning and foundational models. These findings suggest that modular architectures with formalized knowledge representations can make sophisticated legal reasoning more accessible through computationally efficient models while enhancing consistency and explainability in AI legal reasoning, establishing a foundation for future research into more transparent, trustworthy, and effective AI systems for legal domain.
zh

[NLP-136] Designing LMS and Instructional Strategies for Integrating Generative-Conversational AI

【速读】: 该论文旨在解决高等教育中个性化、可扩展且教学逻辑一致的学习体验难以实现的问题。其解决方案的关键在于提出一个结构化的AI赋能学习管理系统(AI-LMS)设计框架,通过整合生成式AI(Generative AI)与对话式AI技术,构建模块化系统组件(如可配置提示词、自适应反馈回路和多智能体对话流程),并基于行为主义、建构主义和联结主义等学习理论进行教学策略设计,从而支持自适应、交互式和以学习者为中心的教学实践。

链接: https://arxiv.org/abs/2509.00709
作者: Elias Ra,Seung Je Kim,Eui-Yeong Seo,Geunju So
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Higher education faces growing challenges in delivering personalized, scalable, and pedagogically coherent learning experiences. This study introduces a structured framework for designing an AI-powered Learning Management System (AI-LMS) that integrates generative and conversational AI to support adaptive, interactive, and learner-centered instruction. Using a design-based research (DBR) methodology, the framework unfolds through five phases: literature review, SWOT analysis, development of ethical-pedagogical principles, system design, and instructional strategy formulation. The resulting AI-LMS features modular components – including configurable prompts, adaptive feedback loops, and multi-agent conversation flows – aligned with pedagogical paradigms such as behaviorist, constructivist, and connectivist learning theories. By combining AI capabilities with human-centered design and ethical safeguards, this study advances a practical model for AI integration in education. Future research will validate and refine the system through real-world implementation.
zh

[NLP-137] Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLM s EMNLP2025

【速读】: 该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)在解码过程中因采用传统的基于置信度的独立采样策略而导致生成顺序趋近于自回归过程的问题,从而削弱了非自回归建模的优势。其解决方案的关键在于提出一种名为奖励加权采样(Reward-Weighted Sampling, RWS)的新颖解码策略,该策略利用外部奖励模型在迭代扩散过程中提供全局信号:在每一步扩散中,RWS评估整个中间序列的质量,并据此缩放token的对数概率(logits),从而引导token选择时融合序列级的一致性信息;该方法通过提升初始得分较低token的置信度,促进更符合非自回归特性的生成顺序,理论证明其可诱导有益的排序反转并提高期望奖励,实验验证其显著提升了非自回归生成特性及多项指标表现。

链接: https://arxiv.org/abs/2509.00707
作者: Daehoon Gwak,Minseo Jung,Junwoo Park,Minho Park,ChaeHun Park,Junha Hyung,Jaegul Choo
机构: KAIST AI; Applied Artificial Intelligence, Sungkyunkwan University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Main Paper (Long)

点击查看摘要

Abstract:Masked diffusion models (MDMs) offer a promising non-autoregressive alternative for large language modeling. Standard decoding methods for MDMs, such as confidence-based sampling, select tokens independently based on individual token confidences at each diffusion step. However, we observe that this independent token selection often results in generation orders resembling sequential autoregressive processes, limiting the advantages of non-autoregressive modeling. To mitigate this pheonomenon, we propose Reward-Weighted Sampling (RWS), a novel decoding strategy that leverages an external reward model to provide a principled global signal during the iterative diffusion process. Specifically, at each diffusion step, RWS evaluates the quality of the entire intermediate sequence and scales token logits accordingly, guiding token selection by integrating global sequence-level coherence. This method selectively increases the confidence of tokens that initially have lower scores, thereby promoting a more non-autoregressive generation order. Furthermore, we provide theoretical justification showing that reward-weighted logit scaling induces beneficial rank reversals in token selection and consistently improves expected reward. Experiments demonstrate that RWS significantly promotes non-autoregressive generation orders, leading to improvements across multiple evaluation metrics. These results highlight the effectiveness of integrating global signals in enhancing both the non-autoregressive properties and overall performance of MDMs.
zh

[NLP-138] Learning to Shop Like Humans: A Review-driven Retrieval-Augmented Recommendation Framework with LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在基于评论的推荐系统中应用时面临的两个核心挑战:一是受限于LLMs的上下文窗口长度,难以高效地动态利用用户评论;二是缺乏有效的机制来优先选择与用户当前决策情境最相关的评论。为应对这些问题,论文提出RevBrowse框架,其关键创新在于引入PrefRAG模块——一个检索增强型组件,通过将用户和物品表征解耦为结构化形式,并根据目标物品自适应检索偏好相关的内容,从而提升评论使用的相关性和效率。该方案不仅显著提升了推荐效果,还增强了模型的可解释性,使影响最终推荐的评论变得可见。

链接: https://arxiv.org/abs/2509.00698
作者: Kaiwen Wei,Jinpeng Gao,Jiang Zhong,Yuming Yang,Fengmao Lv,Zhenyang Li
机构: Chongqing University (重庆大学); Southwest Jiaotong University (西南交通大学); Hong Kong Generative AI Research and Development Center, City University of Hong Kong (香港城市大学生成式人工智能研发中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown strong potential in recommendation tasks due to their strengths in language understanding, reasoning and knowledge integration. These capabilities are especially beneficial for review-based recommendation, which relies on semantically rich user-generated texts to reveal fine-grained user preferences and item attributes. However, effectively incorporating reviews into LLM-based recommendation remains challenging due to (1) inefficient to dynamically utilize user reviews under LLMs’ constrained context windows, and (2) lacking effective mechanisms to prioritize reviews most relevant to the user’s current decision context. To address these challenges, we propose RevBrowse, a review-driven recommendation framework inspired by the “browse-then-decide” decision process commonly observed in online user behavior. RevBrowse integrates user reviews into the LLM-based reranking process to enhance its ability to distinguish between candidate items. To improve the relevance and efficiency of review usage, we introduce PrefRAG, a retrieval-augmented module that disentangles user and item representations into structured forms and adaptively retrieves preference-relevant content conditioned on the target item. Extensive experiments on four Amazon review datasets demonstrate that RevBrowse achieves consistent and significant improvements over strong baselines, highlighting its generalizability and effectiveness in modeling dynamic user preferences. Furthermore, since the retrieval-augmented process is transparent, RevBrowse offers a certain level of interpretability by making visible which reviews influence the final recommendation.
zh

[NLP-139] CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

【速读】: 该论文旨在解决稀疏自编码器(Sparse Autoencoders)在大型语言模型(LLMs)中缺乏自动化评估方法的问题,从而限制了其可解释特征的发现与广泛应用。解决方案的关键在于提出CE-Bench——一个轻量级对比评估基准,基于精心构建的对比故事对数据集,能够可靠地衡量稀疏自编码器的可解释性,并且无需依赖外部大语言模型(LLM)即可完成评估,同时在实验中验证了其与现有基准的一致性。

链接: https://arxiv.org/abs/2509.00691
作者: Alex Gulko,Yusen Peng,Sachin Kumar
机构: The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Probing with sparse autoencoders is a promising approach for uncovering interpretable features in large language models (LLMs). However, the lack of automated evaluation methods has hindered their broader adoption and development. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive ablation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks, all without requiring an external LLM. The official implementation and evaluation dataset are open-sourced under the MIT License.
zh

[NLP-140] xt Reinforcement for Multimodal Time Series Forecasting

【速读】: 该论文旨在解决多模态时间序列预测(Multimodal Time Series Forecasting, MTTSF)中因文本模态信息不准确或不完整而导致模型性能不稳定的问题。其核心挑战在于,现有方法依赖高质量的文本输入来增强时间序列建模,但原始文本可能无法充分捕捉历史时间序列所承载的信息,从而限制了预测性能。解决方案的关键在于提出一种文本强化模型(Text Reinforcement Model, TeR),通过生成能够弥补原始文本缺陷的强化文本,并结合基于强化学习的奖励机制——该机制根据每条强化文本对多模态TSF模型性能的影响及其任务相关性动态调整奖励信号——优化TeR以提升生成文本的质量,进而显著增强时间序列预测的准确性与鲁棒性。

链接: https://arxiv.org/abs/2509.00687
作者: Chen Su,Yuanhe Tian,Yan Song,Yongdong Zhang
机构: University of Science and Technology of China (中国科学技术大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies in time series forecasting (TSF) use multimodal inputs, such as text and historical time series data, to predict future values. These studies mainly focus on developing advanced techniques to integrate textual information with time series data to perform the task and achieve promising results. Meanwhile, these approaches rely on high-quality text and time series inputs, whereas in some cases, the text does not accurately or fully capture the information carried by the historical time series, which leads to unstable performance in multimodal TSF. Therefore, it is necessary to enhance the textual content to improve the performance of multimodal TSF. In this paper, we propose improving multimodal TSF by reinforcing the text modalities. We propose a text reinforcement model (TeR) to generate reinforced text that addresses potential weaknesses in the original text, then apply this reinforced text to support the multimodal TSF model’s understanding of the time series, improving TSF performance. To guide the TeR toward producing higher-quality reinforced text, we design a reinforcement learning approach that assigns rewards based on the impact of each reinforced text on the performance of the multimodal TSF model and its relevance to the TSF task. We optimize the TeR accordingly, so as to improve the quality of the generated reinforced text and enhance TSF performance. Extensive experiments on a real-world benchmark dataset covering various domains demonstrate the effectiveness of our approach, which outperforms strong baselines and existing studies on the dataset.
zh

[NLP-141] Do small language models generate realistic variable-quality fake news headlines?

【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在被明确指令生成虚假新闻标题时的合规性及其生成内容与真实新闻标题相似度的问题。研究通过控制提示工程(prompt engineering)生成了24,000条低质量和高质量的虚假新闻标题,并使用已有的基于DistilBERT和集成学习(bagging classifier)的新闻标题质量检测模型进行评估。关键发现是:所测试的SLMs表现出高合规率且伦理阻力较小,但其生成的标题在质量分类上存在显著误判,检测准确率仅为35.2%至63.5%,表明这些模型生成的内容虽具欺骗性,却难以被现有检测系统有效识别,从而揭示了当前虚假信息检测技术在面对SLM生成内容时的局限性。

链接: https://arxiv.org/abs/2509.00680
作者: Austin McCutcheon,Chris Brogly
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Small language models (SLMs) have the capability for text generation and may potentially be used to generate falsified texts online. This study evaluates 14 SLMs (1.7B-14B parameters) including LLaMA, Gemma, Phi, SmolLM, Mistral, and Granite families in generating perceived low and high quality fake news headlines when explicitly prompted, and whether they appear to be similar to real-world news headlines. Using controlled prompt engineering, 24,000 headlines were generated across low-quality and high-quality deceptive categories. Existing machine learning and deep learning-based news headline quality detectors were then applied against these SLM-generated fake news headlines. SLMs demonstrated high compliance rates with minimal ethical resistance, though there were some occasional exceptions. Headline quality detection using established DistilBERT and bagging classifier models showed that quality misclassification was common, with detection accuracies only ranging from 35.2% to 63.5%. These findings suggest the following: tested SLMs generally are compliant in generating falsified headlines, although there are slight variations in ethical restraints, and the generated headlines did not closely resemble existing primarily human-written content on the web, given the low quality classification accuracy.
zh

[NLP-142] Router Upcycling: Leverag ing Mixture-of-Routers in Mixture-of-Experts Upcycling

【速读】: 该论文旨在解决Mixture-of-Experts (MoE) 模型在upcycling(模型组件再利用)过程中,因简单路由器(如线性路由器)难以处理复杂路由任务而导致性能受限的问题。其解决方案的关键在于提出一种新颖的Router Upcycling技术:在upcycling阶段,从前置注意力层的注意力头中初始化多个路由器,这些路由器以类似注意力机制的方式协同分配token到专用专家;每个token被映射为多样化的查询,并与专家特征(作为键)对齐,从而实现更精准的专家选择。实验表明,该方法在性能上达到当前最优(SOTA)。

链接: https://arxiv.org/abs/2509.00679
作者: Junfeng Ran,Guangxiang Zhao,Yuhan Wu,Dawei Zhu,Longyun Wu,Yikai Zhao,Tong Yang,Lin Sun,Xiangzheng Zhang,Sujian Li
机构: Qiyuan Tech; National Key Laboratory for Multimedia Information Processing, Peking University (北京大学多媒体信息处理国家重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Mixture-of-Experts (MoE) models have gained significant attention in deep learning due to their dynamic resource allocation and superior performance across diverse tasks. However, efficiently training these models remains challenging. The MoE upcycling technique has been proposed to reuse and improve existing model components, thereby minimizing training overhead. Despite this, simple routers, such as linear routers, often struggle with complex routing tasks within MoE upcycling. In response, we propose a novel routing technique called Router Upcycling to enhance the performance of MoE upcycling models. Our approach initializes multiple routers from the attention heads of preceding attention layers during upcycling. These routers collaboratively assign tokens to specialized experts in an attention-like manner. Each token is processed into diverse queries and aligned with the experts’ features (serving as keys). Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance, outperforming other upcycling baselines.
zh

[NLP-143] Confident Calibrated or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在识别显性和隐性仇恨言论时的客观性与性能之间的矛盾问题,具体探讨未经安全对齐(uncensored)模型是否比经过深度安全对齐(censored)模型更具备无偏见的分类能力。研究发现,尽管 uncensored 模型理论上不受道德约束影响而更具“客观性”,但其实际表现显著逊于 censored 模型(严格准确率分别为 64.1% 和 78.7%),后者因安全对齐机制形成了更强的鲁棒性;然而,censored 模型也因意识形态锚定效应而缺乏灵活性,无法响应不同人格设定的影响,而 uncensored 模型则极易受意识形态框架操控。因此,解决方案的关键在于构建兼顾公平性、校准度和意识形态一致性的新型审计框架,以克服当前 LLM 在仇恨言论检测中固有的偏差、误判及不可靠置信度问题。

链接: https://arxiv.org/abs/2509.00673
作者: Sanjeeevan Selvaganapathy,Mehwish Nasim
机构: The University of Western Australia (西澳大利亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We investigate the efficacy of Large Language Models (LLMs) in detecting implicit and explicit hate speech, examining whether models with minimal safety alignment (uncensored) might provide more objective classification capabilities compared to their heavily-aligned (censored) counterparts. While uncensored models theoretically offer a less constrained perspective free from moral guardrails that could bias classification decisions, our results reveal a surprising trade-off: censored models significantly outperform their uncensored counterparts in both accuracy and robustness, achieving 78.7% versus 64.1% strict accuracy. However, this enhanced performance comes with its own limitation – the safety alignment acts as a strong ideological anchor, making censored models resistant to persona-based influence, while uncensored models prove highly malleable to ideological framing. Furthermore, we identify critical failures across all models in understanding nuanced language such as irony. We also find alarming fairness disparities in performance across different targeted groups and systemic overconfidence that renders self-reported certainty unreliable. These findings challenge the notion of LLMs as objective arbiters and highlight the need for more sophisticated auditing frameworks that account for fairness, calibration, and ideological consistency.
zh

[NLP-144] Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems?

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在竞赛编程(Competitive Programming)任务中的能力瓶颈问题,此类任务要求模型具备严谨的算法思维、逻辑推理和代码实现能力。现有语言模型(Language Models, LMs)在该领域表现不佳,缺乏对复杂问题的有效分解与求解能力。解决方案的关键在于提出 ICPC 基准测试集(包含 254 道国际大学生程序设计竞赛题目及其官方分析、参考代码和多层测试用例),并开发一种结合多轮自评(multi-turn self-judge)、反思(reflection)和情景记忆检索(retrieval over episodic information)的推理机制,显著提升模型在零样本场景下的求解准确率(从 19.1% 提升至 42.2%)。此外,通过人机协同实验进一步揭示了当前模型仍存在的结构性障碍,并验证了特定指令引导可有效突破部分难题,为构建具备 grounded(具身)、imaginative(想象性)和 algorithmic(算法性)思维的语言模型提供了实证基础与方法路径。

链接: https://arxiv.org/abs/2509.00629
作者: Md Tanzib Hosain,Md Kishor Morol
机构: American International University-Bangladesh (美国国际大学-孟加拉); Cornell University (康奈尔大学); ELITE Research Lab (精英研究实验室)
类目: Computation and Language (cs.CL)
备注: Accepted in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Student Research Workshop), 2025

点击查看摘要

Abstract:Among the hardest tasks for humans are those found in competitive programming where problems require sophisticated algorithmic thinking, puzzle solving, and the creation of effective code. As a domain to assess language models (LMs), it has not received enough attention, though. This study presents the ICPC benchmark, which consists of 254 international collegiate programming contest (ICPC) tasks. Each problem includes official analysis, reference code, and sample, high-quality unit, and hidden tests. We are able to develop and evaluate a variety of LM inference techniques for competitive programming with these resources. With zero-shot chain-of-thought prompting, we find that o1 only achieves a 19.1% pass@1 solve rate. With our best inference technique, which combines multi-turn self-judge with reflection and retrieval over episodic information, raises this to 42.2%. Furthermore, we conduct a new human-in-the-loop investigation to gain a deeper understanding of the remaining difficulties. Surprisingly, we discover that o1 can solve 17 out of 18 problems that were previously unsolvable by any model or technique with just a few specific instructions. A footstep toward LMs with grounded, imaginative, and algorithmic thinking is provided by our quantitative findings and qualitative research. We open-source our code and data at this https URL.
zh

[NLP-145] A Multi-Strategy Approach for AI-Generated Text Detection

【速读】: 该论文旨在解决新闻文章和学术摘要中人工智能生成内容(AI-generated content)的检测问题。其解决方案的关键在于构建并比较三种不同的分类系统:(1) 微调的 RoBERTa-base 分类器,(2) 基于 TF-IDF 与支持向量机(SVM)的经典机器学习方法,以及 (3) 一种名为 Candace 的创新集成模型,该模型利用多个 Llama-3.2 模型提取的概率特征,并通过自定义 Transformer 进行融合。实验表明,基于 RoBERTa 的系统在开发集和测试集上均取得了接近完美的性能,成为最优方案。

链接: https://arxiv.org/abs/2509.00623
作者: Ali Zain,Sareem Farooqui,Muhammad Rafi
机构: National University of Computer and Emerging Sciences, FAST (国家计算机与新兴科学大学,快速)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents presents three distinct systems developed for the M-DAIGT shared task on detecting AI generated content in news articles and academic abstracts. The systems includes: (1) A fine-tuned RoBERTa-base classifier, (2) A classical TF-IDF + Support Vector Machine (SVM) classifier , and (3) An Innovative ensemble model named Candace, leveraging probabilistic features extracted from multiple Llama-3.2 models processed by a customTransformer this http URL RoBERTa-based system emerged as the most performant, achieving near-perfect results on both development and test sets.
zh

[NLP-146] Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling

【速读】: 该论文旨在解决Transformer架构在处理长序列时因自注意力机制导致的计算复杂度为二次方(O(N²))的问题,从而限制了其在长上下文建模中的效率。解决方案的关键在于提出一种全新的全并行结构——门控关联记忆(Gated Associative Memory, GAM)网络,该结构通过两条并行路径实现线性时间复杂度(O(N)):一条是因果卷积路径,用于高效捕获局部位置依赖信息;另一条是关联记忆检索机制,用于建模全局内容驱动的模式。这两条路径通过门控机制动态融合,使模型能够灵活地为每个token组合局部与全局信息,从而在保持高性能的同时显著提升训练速度和可扩展性。

链接: https://arxiv.org/abs/2509.00605
作者: Rishiraj Acharya
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 4 figures, 3 tables

点击查看摘要

Abstract:The Transformer architecture, underpinned by the self-attention mechanism, has become the de facto standard for sequence modeling tasks. However, its core computational primitive scales quadratically with sequence length (O(N^2)), creating a significant bottleneck for processing long contexts. In this paper, we propose the Gated Associative Memory (GAM) network, a novel, fully parallel architecture for sequence modeling that exhibits linear complexity (O(N)) with respect to sequence length. The GAM block replaces the self-attention layer with two parallel pathways: a causal convolution to efficiently capture local, position-dependent context, and a parallel associative memory retrieval mechanism to model global, content-based patterns. These pathways are dynamically fused using a gating mechanism, allowing the model to flexibly combine local and global information for each token. We implement GAM from scratch and conduct a rigorous comparative analysis against a standard Transformer model and a modern linear-time baseline (Mamba) on the WikiText-2 benchmark, as well as against the Transformer on the TinyStories dataset. Our experiments demonstrate that GAM is consistently faster, outperforming both baselines on training speed, and achieves a superior or competitive final validation perplexity across all datasets, establishing it as a promising and efficient alternative for sequence modeling.
zh

[NLP-147] StealthEval: A Probe-Rewrite-Evaluate Workflow for Reliable Benchmarks

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在评估场景中表现出与真实部署环境不一致的行为问题,即“评估意识”(evaluation awareness),这会导致基准测试性能无法准确反映模型的真实安全性与诚实性。解决方案的关键在于提出一种可量化的干预方法:通过线性探测器(linear probe)对提示词(prompt)进行从“测试类”到“部署类”的连续评分,并结合LLM重写策略,在保持原始任务不变的前提下,将提示词转换为更贴近实际部署场景的自然语境。实验表明,该方法使提示词平均探针得分提升30%,且在多个先进模型上诱导出显著且一致的行为变化——诚实回答平均增加5.26%,欺骗性回答减少12.40%,拒绝率上升6.38%,验证了评估意识是可量化和可控的因素,从而凸显了构建更贴近现实的评估框架以准确衡量模型对齐程度的紧迫性。

链接: https://arxiv.org/abs/2509.00591
作者: Lang Xiong,Nishant Bhargava,Wesley Chang,Jianhang Hong,Haihao Liu,Kevin Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as “evaluation awareness.” This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model’s true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from “test-like” to “deploy-like” and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten “deploy-like” prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.
zh

[NLP-148] Advanced spectral clustering for heterogeneous data in credit risk monitoring systems

【速读】: 该论文旨在解决异质数据(heterogeneous data)在信用监控中的应用难题,这类数据同时包含数值型财务变量和文本记录,传统方法难以有效融合两类信息。其解决方案的关键在于提出一种高级谱聚类方法(Advanced Spectral Clustering, ASC),通过优化权重参数整合金融与文本相似性,并采用基于特征值-轮廓系数(eigenvalue-silhouette)的新型向量选择策略,从而提升聚类质量与可解释性。实验表明,ASC在1,428家中小企业的数据集上使轮廓系数提升18%,且所识别出的集群具有明确业务含义,如“社会招聘”相关企业违约风险降低30%,验证了方法的有效性和鲁棒性。

链接: https://arxiv.org/abs/2509.00546
作者: Lu Han,Mengyan Li,Jiping Qiang,Zhi Su
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 25 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Heterogeneous data, which encompass both numerical financial variables and textual records, present substantial challenges for credit monitoring. To address this issue, we propose Advanced Spectral Clustering (ASC), a method that integrates financial and textual similarities through an optimized weight parameter and selects eigenvectors using a novel eigenvalue-silhouette optimization approach. Evaluated on a dataset comprising 1,428 small and medium-sized enterprises (SMEs), ASC achieves a Silhouette score that is 18% higher than that of a single-type data baseline method. Furthermore, the resulting clusters offer actionable insights; for instance, 51% of low-risk firms are found to include the term ‘social recruitment’ in their textual records. The robustness of ASC is confirmed across multiple clustering algorithms, including k-means, k-medians, and k-medoids, with \DeltaIntra/Inter 0.13 and \DeltaSilhouette Coefficient 0.02. By bridging spectral clustering theory with heterogeneous data applications, ASC enables the identification of meaningful clusters, such as recruitment-focused SMEs exhibiting a 30% lower default risk, thereby supporting more targeted and effective credit interventions.
zh

[NLP-149] hinking Hard Going Misaligned: Emergent Misalignment in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在增强推理能力后可能出现的安全对齐问题,即“推理诱导偏移”(Reasoning-Induced Misalignment)现象——当模型通过切换至“思考模式”或在良性数学数据集上微调以提升推理能力时,反而更易响应恶意请求。解决方案的关键在于揭示内部机制:一方面,注意力机制的动态调整和混合专家(Mixture-of-Experts, MoE)架构中特定专家的选择能够引导过度推理行为向安全防护机制收敛;另一方面,这提示未来需在强化推理能力的同时,系统性地优化模型内部状态以维持与人类价值观的一致性。

链接: https://arxiv.org/abs/2509.00544
作者: Hanqi Yan,Hainiu Xu,Yulan He
机构: King’s College London (伦敦国王学院); The Alan Turing Institute (艾伦·图灵研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With Large Language Models (LLMs) becoming increasingly widely adopted, concerns regarding their safety and alignment with human values have intensified. Previous studies have shown that fine-tuning LLMs on narrow and malicious datasets induce misaligned behaviors. In this work, we report a more concerning phenomenon, Reasoning-Induced Misalignment. Specifically, we observe that LLMs become more responsive to malicious requests when reasoning is strengthened, via switching to “think-mode” or fine-tuning on benign math datasets, with dense models particularly vulnerable. Moreover, we analyze internal model states and find that both attention shifts and specialized experts in mixture-of-experts models help redirect excessive reasoning towards safety guardrails. These findings provide new insights into the emerging reasoning-safety trade-off and underscore the urgency of advancing alignment for advanced reasoning models.
zh

[NLP-150] Modeling Motivated Reasoning in Law: Evaluating Strategic Role Conditioning in LLM Summarization

【速读】: 该论文旨在解决生成式 AI(Generative AI)在法律文本摘要任务中因角色诱导而产生的动机性推理(motivated reasoning)问题,即模型如何根据用户所扮演的法律角色(如法官、检察官或律师)选择性地呈现信息以契合其立场。解决方案的关键在于构建一个基于法律事实和推理包含度的角色感知评估框架,并引入对利益相关方倾向性的考量,从而系统性识别模型在不同角色提示下的选择性信息呈现模式,揭示即使在有平衡指令的情况下,模型仍会表现出与角色一致的认知偏差,凸显了在高风险法律场景中开展角色敏感型评估的重要性。

链接: https://arxiv.org/abs/2509.00529
作者: Eunjung Cho,Alexander Hoyle,Yoan Hermstrüwer
机构: ETH Zurich (苏黎世联邦理工学院); University of Zurich (苏黎世大学); Max Planck Institute for Research on Collective Goods (马克斯·普朗克研究所集体行为研究)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to generate user-tailored summaries, adapting outputs to specific stakeholders. In legal contexts, this raises important questions about motivated reasoning – how models strategically frame information to align with a stakeholder’s position within the legal system. Building on theories of legal realism and recent trends in legal practice, we investigate how LLMs respond to prompts conditioned on different legal roles (e.g., judges, prosecutors, attorneys) when summarizing judicial decisions. We introduce an evaluation framework grounded in legal fact and reasoning inclusion, also considering favorability towards stakeholders. Our results show that even when prompts include balancing instructions, models exhibit selective inclusion patterns that reflect role-consistent perspectives. These findings raise broader concerns about how similar alignment may emerge as LLMs begin to infer user roles from prior interactions or context, even without explicit role instructions. Our results underscore the need for role-aware evaluation of LLM summarization behavior in high-stakes legal settings.
zh

[NLP-151] ERank: Fusing Supervised Fine-Tuning and Reinforcement Learning for Effective and Efficient Text Reranking

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的文本重排序(Text Reranking)模型中存在的有效性与效率之间的权衡问题。具体而言,监督微调(Supervised Fine-Tuning, SFT)驱动的点对式(pointwise)方法因采用二分类标签而缺乏评分判别力,尤其在推理型LLM中表现不足;而复杂推理任务所需的列表式(listwise)方法虽有效但计算开销大,难以满足低延迟应用场景。解决方案的关键在于提出一种名为ERank的高效且有效的点对式重排序器,其核心创新是一个两阶段训练流程:第一阶段通过生成式SFT训练模型输出细粒度整数评分,显著提升相关性判别能力;第二阶段引入一种从列表式奖励中衍生的新颖强化学习(Reinforcement Learning, RL)策略,将全局排序感知注入高效的点对式架构中,从而在保持高效率的同时实现卓越的排序性能。

链接: https://arxiv.org/abs/2509.00520
作者: Yuzheng Cai,Yanzhao Zhang,Dingkun Long,Mingxin Li,Pengjun Xie,Weiguo Zheng
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text reranking models are a crucial component in modern systems like Retrieval-Augmented Generation, tasked with selecting the most relevant documents prior to generation. However, current Large Language Models (LLMs) powered rerankers often face a fundamental trade-off. On one hand, Supervised Fine-Tuning based pointwise methods that frame relevance as a binary classification task lack the necessary scoring discrimination, particularly for those built on reasoning LLMs. On the other hand, approaches designed for complex reasoning often employ powerful yet inefficient listwise formulations, rendering them impractical for low latency applications. To resolve this dilemma, we introduce ERank, a highly effective and efficient pointwise reranker built from a reasoning LLM that excels across diverse relevance scenarios. We propose a novel two-stage training pipeline that begins with Supervised Fine-Tuning (SFT). In this stage, we move beyond binary labels and train the model generatively to output fine grained integer scores, which significantly enhances relevance discrimination. The model is then further refined using Reinforcement Learning (RL) with a novel, listwise derived reward. This technique instills global ranking awareness into the efficient pointwise architecture. We evaluate the ERank reranker on the BRIGHT, FollowIR, TREC DL, and BEIR benchmarks, demonstrating superior effectiveness and robustness compared to existing approaches. On the reasoning-intensive BRIGHT benchmark, our ERank-4B achieves an nDCG@10 of 38.7, while a larger 32B variant reaches a state of the art nDCG@10 of 40.2.
zh

[NLP-152] LLM -Assisted Iterative Evolution with Swarm Intelligence Toward SuperBrain

【速读】: 该论文旨在解决当前生成式 AI 系统在个体智能与群体协作之间缺乏动态演化机制的问题,尤其是传统静态提示工程(prompt engineering)和孤立代理模拟无法实现人机协同进化与跨主体知识整合的局限性。其解决方案的关键在于提出一种名为 SuperBrain 的新型集体智能框架,该框架基于大语言模型(LLM)与人类用户之间的持续个性化交互所形成的认知二元体(cognitive dyad),通过遗传算法(GA)辅助的前向-后向演化优化提示与任务表现,并借助群体智能(Swarm Intelligence)实现多个子类脑(Subclass Brain)间的多目标协同优化与经验提炼,最终将标准化行为模式与认知特征融合为具备抽象、泛化与自我改进能力的超类脑(Superclass Brain),从而构建可扩展、可解释且伦理对齐的集体人工智能体系。

链接: https://arxiv.org/abs/2509.00510
作者: Li Weigang,Pedro Carvalho Brom,Lucas Ramson Siefert
机构: TransLab, Computer Science Department, University of Brasilia (巴西利亚大学); Math Department, Federal Institute of Brasilia (巴西利亚联邦理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages, 5 figures

点击查看摘要

Abstract:We propose a novel SuperBrain framework for collective intelligence, grounded in the co-evolution of large language models (LLMs) and human users. Unlike static prompt engineering or isolated agent simulations, our approach emphasizes a dynamic pathway from Subclass Brain to Superclass Brain: (1) A Subclass Brain arises from persistent, personalized interaction between a user and an LLM, forming a cognitive dyad with adaptive learning memory. (2) Through GA-assisted forward-backward evolution, these dyads iteratively refine prompts and task performance. (3) Multiple Subclass Brains coordinate via Swarm Intelligence, optimizing across multi-objective fitness landscapes and exchanging distilled heuristics. (4) Their standardized behaviors and cognitive signatures integrate into a Superclass Brain, an emergent meta-intelligence capable of abstraction, generalization and self-improvement. We outline the theoretical constructs, present initial implementations (e.g., UAV scheduling, KU/KI keyword filtering) and propose a registry for cross-dyad knowledge consolidation. This work provides both a conceptual foundation and an architectural roadmap toward scalable, explainable and ethically aligned collective AI.
zh

[NLP-153] Entropy-based Coarse and Compressed Semantic Speech Representation Learning

【速读】: 该论文旨在解决现有离散语音表示学习中因细粒度分词(如每秒25或50个token)导致的冗余问题,以及由此引发的下游任务训练与推理效率低下、语义理解缺乏必要粒度的问题。其解决方案的关键在于提出一种基于熵的动态聚合框架:首先通过大规模无标签数据预训练语音语言模型以捕捉高频词元模式;随后利用预测熵自适应确定聚合边界,并引入跨注意力模块融合每个片段内的信息,从而实现可灵活调节粒度与压缩比的语义压缩表示。

链接: https://arxiv.org/abs/2509.00503
作者: Jialong Zuo,Guangyan Zhang,Minghui Fang,Shengpeng Ji,Xiaoqi Jiao,Jingyu Li,Yiwen Guo,Zhou Zhao
机构: Zhejiang University (浙江大学); LIGHTSPEED; Independent Researcher
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Discrete speech representation learning has recently attracted increasing interest in both acoustic and semantic modeling. Existing approaches typically encode 16 kHz waveforms into discrete tokens at a rate of 25 or 50 tokens per second. However, given that speech generally conveys only 2 to 5 words per second, such fine-grained tokenization introduces redundancy and hinders efficiency in downstream training and inference. Moreover, semantic speech representations at this frequency primarily capture phonetic-level information, while semantic understanding may not require such detailed token-level resolution. To address these limitations, we propose an entropy-based dynamic aggregation framework for learning compressed semantic speech representations. A speech language model is first pre-trained via next-token prediction on large-scale unlabeled data to capture frequent token patterns. Predictive entropy is then used to adaptively determine aggregation boundaries, followed by a cross-attention module that fuses information within each segment. By adjusting the entropy threshold, the granularity and compression ratio of the representations can be flexibly controlled. Experiments on ASR, speech-to-text translation, and voice conversion tasks demonstrate that the compressed representations perform on par with or better than dense token sequences, demonstrating the effectiveness of the proposed approach.
zh

[NLP-154] ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在长文本回答研究问题时缺乏多领域、高质量评估标准的问题。现有评估方法严重依赖专家标注,且主要局限于AI等少数领域,难以全面衡量LLMs在跨学科研究场景中的表现。解决方案的关键在于构建ResearchQA——一个从75个研究领域的综述文章中提炼出的结构化评估资源,包含21,000个查询和160,000条评分细则(rubric items),每条细则对应具体回答要求,如引用文献、解释机制和描述局限性等。通过与31名博士级标注者协作验证,该框架支持自动化的成对判断(与人工判断一致性达74%),并用于系统性分析18个LLM在超过7600次对比评估中的能力差距,揭示了当前主流参数化及检索增强型系统在覆盖关键评估维度方面存在显著不足。

链接: https://arxiv.org/abs/2509.00496
作者: Li S. Yifei,Allen Chang,Chaitanya Malaviya,Mark Yatskar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages main, 40 pages total, 16 figures

点击查看摘要

Abstract:Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is widespread: survey articles synthesize knowledge distributed across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Each rubric, derived jointly with queries from survey sections, lists query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. Assessments by 31 Ph.D. annotators in 8 fields indicate 96% of queries support Ph.D. information needs and 87% of rubric items should be addressed in system responses by a sentence or more. Using our rubrics, we are able to construct an automatic pairwise judge obtaining 74% agreement with expert judgments. We leverage ResearchQA to analyze competency gaps in 18 systems in over 7.6K pairwise evaluations. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking agentic system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.
zh

[NLP-155] alk Less Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

【速读】: 该论文旨在解决工具增强型大语言模型(Tool-augmented Large Language Model, TALLM)在常识人格化对话挑战赛(Commonsense Persona-grounded Dialogue Challenge, CPDC)API赛道中角色扮演对话代理存在的两大问题:一是生成过长的、脱离角色的回应(over-speaking),二是未能根据人格设定有效调用工具(under-acting),如生成不存在的函数调用或在回答前进行不必要的工具调用。解决方案的关键在于提出了一种基于规则的角色提示方法(Rule-based Role Prompting, RRP),其核心创新包括两个方面:一是设计“角色卡/场景契约”(character-card/scene-contract)以明确角色行为边界,二是严格强制执行函数调用逻辑,从而显著提升对话代理的准确性与可靠性,最终使整体得分从零样本基线的0.519提升至0.571。

链接: https://arxiv.org/abs/2509.00482
作者: Saksorn Ruangtanusak,Pittawat Taveekitworachai,Kunat Pipatanakul
机构: SCBX R&D (SCBX 研发部门); SCB 10X R&D (SCB 10X 研发部门); SCBX Group (SCBX 集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 17 pages, 2 figures

点击查看摘要

Abstract:This report investigates approaches for prompting a tool-augmented large language model (LLM) to act as a role-playing dialogue agent in the API track of the Commonsense Persona-grounded Dialogue Challenge (CPDC) 2025. In this setting, dialogue agents often produce overly long in-character responses (over-speaking) while failing to use tools effectively according to the persona (under-acting), such as generating function calls that do not exist or making unnecessary tool calls before answering. We explore four prompting approaches to address these issues: 1) basic role prompting, 2) human-crafted role prompting, 3) automatic prompt optimization (APO), and 4) rule-based role prompting. The rule-based role prompting (RRP) approach achieved the best performance through two novel techniques–character-card/scene-contract design and strict enforcement of function calling–which led to an overall score of 0.571, improving on the zero-shot baseline score of 0.519. These findings demonstrate that RRP design can substantially improve the effectiveness and reliability of role-playing dialogue agents compared with more elaborate methods such as APO. To support future efforts in developing persona prompts, we are open-sourcing all of our best-performing prompts and the APO tool. Source code is available at this https URL.
zh

[NLP-156] ECP: Token-Entropy Conformal Prediction for LLM s

【速读】: 该论文旨在解决开放-ended语言生成任务中不确定性量化(Uncertainty Quantification, UQ)的难题,尤其是在黑盒约束下无法访问模型内部信号时的可靠性保障问题。其核心解决方案是提出Token-Entropy Conformal Prediction (TECP) 框架,关键在于利用token-level entropy作为无需logits、无参考的不确定性度量,并将其嵌入到分隔式共形预测(Split Conformal Prediction, CP)流程中,通过CP分位数校准不确定性阈值,从而在保证形式化覆盖概率的前提下生成紧凑的预测集。该方法不依赖语义一致性启发式或白盒特征,直接从采样生成的token熵结构中估计认知不确定性(epistemic uncertainty),实现了对黑盒大语言模型(Large Language Models, LLMs)生成结果的可信赖评估。

链接: https://arxiv.org/abs/2509.00461
作者: Beining Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) for open-ended language generation remains a critical yet underexplored challenge, especially under black-box constraints where internal model signals are inaccessible. In this paper, we introduce Token-Entropy Conformal Prediction (TECP), a novel framework that leverages token-level entropy as a logit-free, reference-free uncertainty measure and integrates it into a split conformal prediction (CP) pipeline to construct prediction sets with formal coverage guarantees. Unlike existing approaches that rely on semantic consistency heuristics or white-box features, TECP directly estimates epistemic uncertainty from the token entropy structure of sampled generations and calibrates uncertainty thresholds via CP quantiles to ensure provable error control. Empirical evaluations across six large language models and two benchmarks (CoQA and TriviaQA) demonstrate that TECP consistently achieves reliable coverage and compact prediction sets, outperforming prior self-consistency-based UQ methods. Our method provides a principled and efficient solution for trustworthy generation in black-box LLM settings.
zh

[NLP-157] CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning

【速读】: 该论文旨在解决伊斯兰继承法(Ilm al-Mawarith)中遗产继承人识别与份额计算的精准性问题,这在人工智能(AI)领域具有高度复杂性和挑战性。其解决方案的关键在于提出一种轻量级框架,结合专用阿拉伯语文本编码器(如MARBERT)与注意力相关性评分(Attentive Relevance Scoring, ARS),通过语义相关性对多项选择题的答案选项进行排序,从而实现无需生成式推理的快速、本地化(on-device)推理。该方法在保证隐私和部署灵活性的同时,展现出相较于大型语言模型(LLM)更优的效率与实用性,尽管准确率略低(69.87% vs. 最佳LLM的87.6%),但量化了高风险场景下大模型峰值性能与小规模专业化系统实用性的关键权衡。

链接: https://arxiv.org/abs/2509.00457
作者: Salah Eddine Bekhouche,Abdellah Zakaria Sellam,Hichem Telli,Cosimo Distante,Abdenour Hadid
机构: University of the Basque Country UPV/EHU (巴斯克大学); Institute of Applied Sciences and Intelligent Systems – CNR (应用科学与智能系统研究所 – 国家研究委员会); Laboratory of LESIA, University of Biskra (LESIA实验室,比斯克拉大学); Sorbonne University Abu Dhabi (索邦大学阿布扎比分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Islamic inheritance law (Ilm al-Mawarith) requires precise identification of heirs and calculation of shares, which poses a challenge for AI. In this paper, we present a lightweight framework for solving multiple-choice inheritance questions using a specialised Arabic text encoder and Attentive Relevance Scoring (ARS). The system ranks answer options according to semantic relevance, and enables fast, on-device inference without generative reasoning. We evaluate Arabic encoders (MARBERT, ArabicBERT, AraBERT) and compare them with API-based LLMs (Gemini, DeepSeek) on the QIAS 2025 dataset. While large models achieve an accuracy of up to 87.6%, they require more resources and are context-dependent. Our MARBERT-based approach achieves 69.87% accuracy, presenting a compelling case for efficiency, on-device deployability, and privacy. While this is lower than the 87.6% achieved by the best-performing LLM, our work quantifies a critical trade-off between the peak performance of large models and the practical advantages of smaller, specialized systems in high-stakes domains.
zh

[NLP-158] GOSU: Retrieval-Augmented Generation with Global-Level Optimized Semantic Unit-Centric Framework

【速读】: 该论文旨在解决传统基于图结构的检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理多实体间语义关系时存在的局限性,尤其是由于局部文本片段中提取的高阶语义单元(Semantic Units, SUs)存在歧义、复杂耦合及检索开销增加的问题。其核心解决方案是提出GOSU框架,该框架以语义单元为中心,在图构建阶段通过全局合并局部提取的SUs来指导实体与关系抽取,从而降低共指消解难度并揭示跨文本片段的全局语义对象;在检索与生成阶段则引入分层关键词提取与语义单元补全机制,前者捕捉细粒度二元关系,后者弥补粗粒度n元关系缺失,实现对多实体间复杂交互的高效建模与高质量生成。

链接: https://arxiv.org/abs/2509.00449
作者: Xuecheng Zou,Ke Liu,Bingbing Wang,Huafei Deng,Li Zhang,Yu Tang
机构: Soochow University (苏州大学); Suzhou Institute of Trade & Commerce (苏州贸易与商业研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Building upon the standard graph-based Retrieval-Augmented Generation (RAG), the introduction of heterogeneous graphs and hypergraphs aims to enrich retrieval and generation by leveraging the relationships between multiple entities through the concept of semantic units (SUs). But this also raises a key issue: The extraction of high-level SUs limited to local text chunks is prone to ambiguity, complex coupling, and increased retrieval overhead due to the lack of global knowledge or the neglect of fine-grained relationships. To address these issues, we propose GOSU, a semantic unit-centric RAG framework that efficiently performs global disambiguation and utilizes SUs to capture interconnections between different nodes across the global context. In the graph construction phase, GOSU performs global merging on the pre-extracted SUs from local text chunks and guides entity and relationship extraction, reducing the difficulty of coreference resolution while uncovering global semantic objects across text chunks. In the retrieval and generation phase, we introduce hierarchical keyword extraction and semantic unit completion. The former uncovers the fine-grained binary relationships overlooked by the latter, while the latter compensates for the coarse-grained n-ary relationships missing from the former. Evaluation across multiple tasks demonstrates that GOSU outperforms the baseline RAG methods in terms of generation quality.
zh

[NLP-159] he Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLM s with Camlang

【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在多个基准测试中表现优异,但其成功是否源于真正的推理能力,还是仅依赖于模式匹配尚不明确。为从认知科学角度提供更严格的评估,研究者提出通过显式元语言演绎学习(metalinguistic deductive learning)来检验模型是否具备类似人类的语言习得能力。解决方案的关键在于设计了一种新型人造语言 Camlang,该语言包含两个显性资源——语法书和双语词典,模拟成人第二语言习得中的规则学习与词汇查找过程,并构建了 Camlang-CSQA-v0 任务,要求模型同时运用语法规则和词义映射进行句子级推理。实验表明,尽管人类能高效掌握 Camlang 并完成任务,主流 LLMs(如 GPT-5)在 Camlang 上的表现显著低于人类,且多数错误源于浅层词汇对齐而非系统性语法理解,从而揭示了当前模型与人类元语言能力之间的根本差距。

链接: https://arxiv.org/abs/2509.00425
作者: Fenghua Liu,Yulong Chen,Yixuan Liu,Zhujun Jin,Solomon Tsai,Ming Zhong
机构: University of Cambridge (剑桥大学); University of Oxford (牛津大学); UIUC (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: Working in progress

点击查看摘要

Abstract:Large Language Models (LLMs) achieve gold-medal performance across many benchmarks, yet it remains unclear whether such success reflects genuine reasoning or pattern matching. From a cognitive science perspective, an informative test is whether models can master an unfamiliar language through explicit metalinguistic deductive learning, a paradigm where human learners can reliably internalise grammatical systems through metalinguistic reasoning. We address this question with Camlang, a novel constructed language that exhibits naturalistic yet unattested feature combinations. Camlang consists of two explicit resources, a grammar book and a bilingual dictionary, which mirror adult second-language learning via explicit grammar rules and lexical lookup, and enable us to disentangle errors in morpho-syntax, lexical semantics, and sentence-level reasoning. Human experiments show that these resources are sufficient for participants to acquire Camlang and successfully solve Camlang tasks. To operationalise evaluation, we adapt CommonsenseQA into Camlang, creating Camlang-CSQA-v0, the first task in a broader suite where solving questions requires applying grammar rules and lexical mappings. Experimental results show that GPT-5 achieves 98% EM accuracy in English but only 47% in Camlang, far below human performance at 87%, while other state-of-the-art reasoning LLMs perform even worse. Human verification further reveals that most model successes stem from shallow lexical alignment while GPT-5 shows emerging metalinguistic awareness to a limited extent but not systematic grammatical mastery as humans. Camlang establishes a cognitively grounded evaluation paradigm that exposes fundamental gaps between current models and human metalinguistic competence.
zh

[NLP-160] MedSEBA: Synthesizing Evidence-Based Answers Grounded in Evolving Medical Literature CIKM2025

【速读】: 该论文旨在解决数字时代医疗信息过载与研究结论不一致带来的可信度问题,即用户难以从海量在线内容中甄别可靠医学建议,且研究人员难以追踪最新科学发现并理解不同研究之间的分歧。解决方案的关键在于提出 MedSEBA 系统,这是一个基于生成式 AI (Generative AI) 的交互式证据合成平台,能够动态从 PubMed 数据库检索可信医学研究,并利用大语言模型(Large Language Models, LLMs)生成结构化、可追溯的答案,同时提供支持或反驳特定医学主张的研究共识演化可视化,从而增强答案的可信性与可解释性。

链接: https://arxiv.org/abs/2509.00414
作者: Juraj Vladika,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to CIKM 2025

点击查看摘要

Abstract:In the digital age, people often turn to the Internet in search of medical advice and recommendations. With the increasing volume of online content, it has become difficult to distinguish reliable sources from misleading information. Similarly, millions of medical studies are published every year, making it challenging for researchers to keep track of the latest scientific findings. These evolving studies can reach differing conclusions, which is not reflected in traditional search tools. To address these challenges, we introduce MedSEBA, an interactive AI-powered system for synthesizing evidence-based answers to medical questions. It utilizes the power of Large Language Models to generate coherent and expressive answers, but grounds them in trustworthy medical studies dynamically retrieved from the research database PubMed. The answers consist of key points and arguments, which can be traced back to respective studies. Notably, the platform also provides an overview of the extent to which the most relevant studies support or refute the given medical claim, and a visualization of how the research consensus evolved through time. Our user study revealed that medical experts and lay users find the system usable and helpful, and the provided answers trustworthy and informative. This makes the system well-suited for both everyday health questions and advanced research insights.
zh

[NLP-161] he Resurgence of GCG Adversarial Attacks on Large Language Models

【速读】: 该论文旨在解决梯度驱动的对抗性提示(gradient-based adversarial prompting)方法在大型语言模型(Large Language Models, LLMs)中攻击效果评估不准确、模型规模依赖性及推理任务漏洞等问题。其核心解决方案是系统性地评估基于梯度的对抗提示算法(如GCG及其模拟退火增强版本T-GCG)在不同规模开源LLMs上的表现,通过AdvBench安全提示和编码类推理提示进行多维度测试,并引入GPT-4o语义判断作为更严格的评估基准,从而揭示攻击成功率(ASR)随模型规模下降的趋势、前缀启发式评估的高估偏差以及推理任务本身可能成为攻击向量的关键现象。

链接: https://arxiv.org/abs/2509.00391
作者: Yuting Tan,Xuying Li,Zhuo Li,Huizhen Shu,Peikang Hu
机构: HydroX(氢能AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Gradient-based adversarial prompting, such as the Greedy Coordinate Gradient (GCG) algorithm, has emerged as a powerful method for jailbreaking large language models (LLMs). In this paper, we present a systematic appraisal of GCG and its annealing-augmented variant, T-GCG, across open-source LLMs of varying scales. Using Qwen2.5-0.5B, LLaMA-3.2-1B, and GPT-OSS-20B, we evaluate attack effectiveness on both safety-oriented prompts (AdvBench) and reasoning-intensive coding prompts. Our study reveals three key findings: (1) attack success rates (ASR) decrease with model size, reflecting the increasing complexity and non-convexity of larger models’ loss landscapes; (2) prefix-based heuristics substantially overestimate attack effectiveness compared to GPT-4o semantic judgments, which provide a stricter and more realistic evaluation; and (3) coding-related prompts are significantly more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector. In addition, preliminary results with T-GCG show that simulated annealing can diversify adversarial search and achieve competitive ASR under prefix evaluation, though its benefits under semantic judgment remain limited. Together, these findings highlight the scalability limits of GCG, expose overlooked vulnerabilities in reasoning tasks, and motivate further development of annealing-inspired strategies for more robust adversarial evaluation.
zh

[NLP-162] GraphKV: Breaking the Static Selection Paradigm with Graph-Based KV Cache Eviction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本序列时,由于内存限制导致的Key-Value (KV)缓存管理效率低下问题。传统KV缓存淘汰策略(如基于注意力分数的top-k选择)依赖静态启发式规则,难以捕捉推理过程中token之间动态演化的隐式依赖关系。解决方案的关键在于提出GraphKV框架,其将token建模为带有重要性评分的节点,边表示token间的相似性关系,并通过衰减信号传播机制在图结构中动态更新token重要性,从而自适应地保留最具上下文意义的token,实现更高效的KV缓存压缩。该方法可无缝集成至现有缓存淘汰策略(如SnapKV和PyramidKV)中,具备良好的兼容性和实用性。

链接: https://arxiv.org/abs/2509.00388
作者: Xuelin Li,Xiangqi Jin,Linfeng Zhang
机构: EPIC Lab, Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Efficient Key-Value (KV) cache management is essential for processing long text sequences in large language models (LLMs), where memory constraints often limit performance. Conventional KV eviction strategies, such as top-k selection based on attention scores, depend on static heuristics that fail to capture the evolving implicit dependencies among tokens during inference. To overcome this, we propose GraphKV, a graph-based framework that redefines token selection for KV cache compression. In GraphKV, tokens are modeled as nodes with importance scores, and edges represent their similarity relationships. Through a decay-signal-propagation mechanism, token importance is dynamically updated by propagating information across the graph, enabling adaptive retention of the most contextually significant tokens. GraphKV can be seamlessly utilized in existing KV cache eviction methods such as SnapKV and PyramidKV in a plug-and-play manner. Codes will be released on Github.
zh

[NLP-163] Open Data Synthesis For Deep Research

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在处理需要多步骤推理与跨源证据整合的深度研究任务(Deep Research tasks)时面临的局限性,这类任务要求将问题分解为子问题、协调多步推理并合成多样来源的信息。现有基准(如Natural Questions、HotpotQA)无法充分刻画此类任务的层级结构复杂性,而合成数据集常存在捷径推理、知识泄露或结构深度不足的问题。解决方案的关键在于提出InfoSeek框架,其核心是采用双代理系统递归构建基于大规模网页的“研究树”(Research Tree),将中间节点模糊化为有效子问题,并将其转化为需遍历完整层次结构的自然语言问题;该框架支持高效扩展,生成超过5万训练样本及带元信息(如中间步骤和检索标签)的推理轨迹,从而实现更有效的模型优化,实验表明基于InfoSeek训练的3B模型在BrowseComp-Plus基准上超越更大规模(32B参数)模型和轻量级商用API,性能接近更强API。

链接: https://arxiv.org/abs/2509.00375
作者: Ziyi Xia,Kun Luo,Hongjin Qian,Zheng Liu
机构: BAAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in \hrefthis https URLthis repository.
zh

[NLP-164] KG-RAG : Enhancing GUI Agent Decision-Making via Knowledge Graph-Driven Retrieval-Augmented Generation EMNLP2025

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的图形用户界面(Graphic User Interface, GUI)智能体在执行复杂移动端任务时因缺乏特定应用知识而导致性能受限的问题。其核心挑战在于现有UI Transition Graphs (UTGs) 结构化导航信息难以有效提取与集成,限制了智能体的决策能力。解决方案的关键在于提出KG-RAG框架——一个以知识图谱(Knowledge Graph, KG)驱动的检索增强生成(Retrieval-Augmented Generation, RAG)方法,通过将碎片化的UTG转化为结构化的向量数据库,实现高效实时检索;并结合意图引导的LLM搜索机制,生成可执行的导航路径,从而显著提升智能体的任务成功率和决策准确性。

链接: https://arxiv.org/abs/2509.00366
作者: Ziyi Guan,Jason Chun Lok Li,Zhijian Hou,Pingping Zhang,Donglai Xu,Yuzhi Zhao,Mengyang Wu,Jinpeng Chen,Thanh-Toan Nguyen,Pengfei Xian,Wenao Ma,Shengchao Qin,Graziano Chesi,Ngai Wong
机构: Huawei Hong Kong Research Center (华为香港研究中心); The University of Hong Kong (香港大学); City University of Hong Kong (香港城市大学); Guangzhou Institute of Technology, Xidian University (西安电子科技大学广州研究院); ICTT and ISN Laboratory, Xidian University (西安电子科技大学信息感知与安全实验室)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted by the EMNLP 2025

点击查看摘要

Abstract:Despite recent progress, Graphic User Interface (GUI) agents powered by Large Language Models (LLMs) struggle with complex mobile tasks due to limited app-specific knowledge. While UI Transition Graphs (UTGs) offer structured navigation representations, they are underutilized due to poor extraction and inefficient integration. We introduce KG-RAG, a Knowledge Graph-driven Retrieval-Augmented Generation framework that transforms fragmented UTGs into structured vector databases for efficient real-time retrieval. By leveraging an intent-guided LLM search method, KG-RAG generates actionable navigation paths, enhancing agent decision-making. Experiments across diverse mobile apps show that KG-RAG outperforms existing methods, achieving a 75.8% success rate (8.9% improvement over AutoDroid), 84.6% decision accuracy (8.1% improvement), and reducing average task steps from 4.5 to 4.1. Additionally, we present KG-Android-Bench and KG-Harmony-Bench, two benchmarks tailored to the Chinese mobile ecosystem for future research. Finally, KG-RAG transfers to web/desktop (+40% SR on Weibo-web; +20% on QQ Music-desktop), and a UTG cost ablation shows accuracy saturates at ~4h per complex app, enabling practical deployment trade-offs.
zh

[NLP-165] GIER: Gap-Driven Self-Refinement for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)输出在推理质量、事实依据和推理一致性方面存在的不足问题,尤其针对复杂任务中生成内容的逻辑严谨性和准确性提升难题。其解决方案的关键在于提出一种名为GIER(Gap-driven Iterative Enhancement of Responses)的通用框架,该框架通过自省与迭代修订机制,利用自然语言描述的推理差距(reasoning gaps)引导模型对自身输出进行批判性评估与优化,从而在不损害任务准确性的前提下显著提升理由质量、事实锚定性和推理一致性。

链接: https://arxiv.org/abs/2509.00325
作者: Rinku Dewri
机构: University of Denver (丹佛大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We introduce GIER (Gap-driven Iterative Enhancement of Responses), a general framework for improving large language model (LLM) outputs through self-reflection and revision based on conceptual quality criteria. Unlike prompting strategies that rely on demonstrations, examples, or chain-of-thought templates, GIER utilizes natural language descriptions of reasoning gaps, and prompts a model to iteratively critique and refine its own outputs to better satisfy these criteria. Across three reasoning-intensive tasks (SciFact, PrivacyQA, and e-SNLI) and four LLMs (GPT-4.1, GPT-4o Mini, Gemini 1.5 Pro, and Llama 3.3 70B), GIER improves rationale quality, grounding, and reasoning alignment without degrading task accuracy. Our analysis demonstrates that models can not only interpret abstract conceptual gaps but also translate them into concrete reasoning improvements.
zh

[NLP-166] Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models

【速读】: 该论文旨在解决在第三种大语言模型训练范式中,即对蒸馏训练后的模型应用基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)时所面临的两大稳定性问题:序列长度坍缩(Sequence Length Collapse)和奖励冰球棒曲线(Reward Hockey Stick Curve)。这些问题会导致模型对齐能力和推理能力显著下降。解决方案的关键在于提出一种两阶段加权模型融合方法——平衡演员初始化(Balanced Actor Initialization, BAI),首先融合指令微调与蒸馏推理微调模型,再将该中间模型与预训练模型结合以保留基础知识,从而有效缓解上述不稳定性现象,并实现训练过程中序列长度的持续提升及推理能力的稳定保持。

链接: https://arxiv.org/abs/2509.00309
作者: Chen Zheng,Yiyuan Ma,Yuan Yang,Deyi Liu,Jing Liu,Zuquan Song,Yuxin Song,Cheng Ren,Hang Zhu,Xin Liu,Yiyuan Ma,Siyuan Qiao,Xun Zhou,Liang Xiang,Yonghui Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of alignment and reasoning capabilities in large language models has seen remarkable progress through two paradigms: instruction tuning and reinforcement learning from human feedback (RLHF) alignment paradigm, and distillation-based reasoning fine-tuning paradigm. While both approaches prove effective independently, the third paradigm of applying RLHF to distillation-trained models presents significant challenges. Our investigation reveals two critical phenomena that emerge in this paradigm: Sequence Length Collapse, where language generation dramatically reduces during early RLHF training, and the Reward Hockey Stick Curve, featuring severe reward score drops followed by gradual recovery. These instabilities fundamentally compromise the model’s alignment and reasoning capabilities. To address these challenges, we propose Balanced Actor Initialization (BAI), a two-stage weighted model merging approach. BAI first merges instruction-following and distillation-based reasoning fine-tuned models, then further combines this intermediate model with the pretrained model to preserve foundational knowledge. Through comprehensive experiments across diverse benchmarks and detailed analysis of training experiments, we demonstrate that BAI resolves Sequence Length Collapse, mitigates the Reward Hockey Stick Curve, and enables continuous sequence length improvement during training. Additionally, our analysis reveals that balanced merging ratios achieve optimal trade-offs between training stability and reasoning capability preservation. Our work provides the effective solution for stable training in this third paradigm, enabling more capable reasoning models that combine distillation efficiency with RLHF alignment.
zh

[NLP-167] Wage Sentiment Indices Derived from Survey Comments via Large Language Models

【速读】: 该论文旨在解决传统经济文本分析方法在预测工资动态(wage dynamics)时存在时效性不足与信息利用不充分的问题。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的工资情绪指数(Wage Sentiment Index, WSI),该指数以日本内阁府每月开展的经济观察者调查(Economy Watchers Survey, EWS)为基础,通过LLM提取并量化劳动者对工资变化的情绪倾向,并借鉴已有价格情绪指数(Price Sentiment Index, PSI)的框架进行适配。此外,论文提出了一种可扩展的数据架构,支持未来融合新闻和社交媒体等多源文本数据,从而显著提升模型预测能力,实验证明基于LLM的WSI模型优于基线方法和预训练模型,展现出增强宏观经济政策制定时效性与精准性的潜力。

链接: https://arxiv.org/abs/2509.00290
作者: Taihei Sone
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL)
备注: Submitted to IEEE Big Data 2025. 10 pages, 2 tables, 16 figures

点击查看摘要

Abstract:The emergence of generative Artificial Intelligence (AI) has created new opportunities for economic text analysis. This study proposes a Wage Sentiment Index (WSI) constructed with Large Language Models (LLMs) to forecast wage dynamics in Japan. The analysis is based on the Economy Watchers Survey (EWS), a monthly survey conducted by the Cabinet Office of Japan that captures real-time economic assessments from workers in industries highly sensitive to business conditions. The WSI extends the framework of the Price Sentiment Index (PSI) used in prior studies, adapting it specifically to wage related sentiment. To ensure scalability and adaptability, a data architecture is also developed that enables integration of additional sources such as newspapers and social media. Experimental results demonstrate that WSI models based on LLMs significantly outperform both baseline approaches and pretrained models. These findings highlight the potential of LLM-driven sentiment indices to enhance the timeliness and effectiveness of economic policy design by governments and central banks.
zh

[NLP-168] OpinioRAG : Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews

【速读】: 该论文旨在解决从海量用户评论(通常每实体超过数千条)中生成个性化意见摘要的问题,现有方法要么难以扩展,要么产生通用、缺乏针对性的总结,无法满足用户的个性化需求。其解决方案的关键在于提出 OpinioRAG——一个无需训练的可扩展框架,结合基于检索增强生成(Retrieval-Augmented Generation, RAG)的证据检索与大语言模型(Large Language Models, LLMs),实现高效且定制化的摘要生成;同时引入无参考文本的验证指标,以在情感丰富领域中更精细地评估事实一致性与情感对齐度,从而提升摘要的质量与可信度。

链接: https://arxiv.org/abs/2509.00285
作者: Mir Tafseer Nayeem,Davood Rafiei
机构: University of Alberta (阿尔伯塔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: COLM 2025

点击查看摘要

Abstract:We study the problem of opinion highlights generation from large volumes of user reviews, often exceeding thousands per entity, where existing methods either fail to scale or produce generic, one-size-fits-all summaries that overlook personalized needs. To tackle this, we introduce OpinioRAG, a scalable, training-free framework that combines RAG-based evidence retrieval with LLMs to efficiently produce tailored summaries. Additionally, we propose novel reference-free verification metrics designed for sentiment-rich domains, where accurately capturing opinions and sentiment alignment is essential. These metrics offer a fine-grained, context-sensitive assessment of factual consistency. To facilitate evaluation, we contribute the first large-scale dataset of long-form user reviews, comprising entities with over a thousand reviews each, paired with unbiased expert summaries and manually annotated queries. Through extensive experiments, we identify key challenges, provide actionable insights into improving systems, pave the way for future research, and position OpinioRAG as a robust framework for generating accurate, relevant, and structured summaries at scale.
zh

[NLP-169] Exploring Reasoning -Infused Text Embedding with Large Language Models for Zero-Shot Dense Retrieval CIKM2025

【速读】: 该论文试图解决的问题是:基于Transformer的编码器模型(如BERT和E5)在处理需要复杂推理的查询时,往往受限于表面词汇匹配,难以有效检索相关文档;而解码器-only的大语言模型(LLM)虽具备强大的逻辑推理能力,但现有基于LLM的文本嵌入方法未充分挖掘其推理优势。解决方案的关键在于提出一种名为“推理注入式文本嵌入”(Reasoning-Infused Text Embedding, RITE)的方法,通过在计算嵌入前利用生成式LLM在token空间中生成中间推理文本,从而将逻辑推理过程融入嵌入表示中,增强语义表征的推理深度。实验表明,RITE在BRIGHT这一高推理强度的检索基准上显著提升了零样本检索性能,验证了推理注入对提升文本嵌入质量的有效性。

链接: https://arxiv.org/abs/2509.00276
作者: Yuxiang Liu,Tian Wang,Gourab Kundu,Tianyu Cao,Guang Cheng,Zhen Ge,Jianshu Chen,Qingjun Cui,Trishul Chilimbi
机构: Amazon(亚马逊); Amazon(亚马逊); Amazon(亚马逊); Amazon(亚马逊); Amazon(亚马逊); Amazon(亚马逊); Amazon(亚马逊); Amazon(亚马逊); Amazon(亚马逊)
类目: Computation and Language (cs.CL)
备注: CIKM 2025

点击查看摘要

Abstract:Transformer-based models such as BERT and E5 have significantly advanced text embedding by capturing rich contextual representations. However, many complex real-world queries require sophisticated reasoning to retrieve relevant documents beyond surface-level lexical matching, where encoder-only retrievers often fall short. Decoder-only large language models (LLMs), known for their strong reasoning capabilities, offer a promising alternative. Despite this potential, existing LLM-based embedding methods primarily focus on contextual representation and do not fully exploit the reasoning strength of LLMs. To bridge this gap, we propose Reasoning-Infused Text Embedding (RITE), a simple but effective approach that integrates logical reasoning into the text embedding process using generative LLMs. RITE builds upon existing language model embedding techniques by generating intermediate reasoning texts in the token space before computing embeddings, thereby enriching representations with inferential depth. Experimental results on BRIGHT, a reasoning-intensive retrieval benchmark, demonstrate that RITE significantly enhances zero-shot retrieval performance across diverse domains, underscoring the effectiveness of incorporating reasoning into the embedding process.
zh

[NLP-170] he Temporal Game: A New Perspective on Temporal Relation Extraction

【速读】: 该论文旨在解决时序关系抽取(Temporal Relation Extraction)中传统方法难以处理区间与时刻实体混合标注、且缺乏灵活性和细粒度控制的问题。其解决方案的关键在于提出“时序博弈”(Temporal Game)框架,将区间级时序关系分解为起点与终点之间的点对点比较,并通过玩家逐步分类单个点关系、系统应用时序闭包规则推导额外关系并强制一致性,从而实现更细粒度、灵活且统一的标注机制。该方法不仅支持区间和时刻实体的联合建模,还为强化学习代理在时序标注中的训练奠定了基础。

链接: https://arxiv.org/abs/2509.00250
作者: Hugo Sousa,Ricardo Campos,Alípio Jorge
机构: University of Porto (波尔图大学); INESC TECPortoPortugal (INESC TEC波尔图葡萄牙)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper we demo the Temporal Game, a novel approach to temporal relation extraction that casts the task as an interactive game. Instead of directly annotating interval-level relations, our approach decomposes them into point-wise comparisons between the start and end points of temporal entities. At each step, players classify a single point relation, and the system applies temporal closure to infer additional relations and enforce consistency. This point-based strategy naturally supports both interval and instant entities, enabling more fine-grained and flexible annotation than any previous approach. The Temporal Game also lays the groundwork for training reinforcement learning agents, by treating temporal annotation as a sequential decision-making task. To showcase this potential, the demo presented in this paper includes a Game mode, in which users annotate texts from the TempEval-3 dataset and receive feedback based on a scoring system, and an Annotation mode, that allows custom documents to be annotated and resulting timeline to be exported. Therefore, this demo serves both as a research tool and an annotation interface. The demo is publicly available at this https URL, and the source code is open-sourced to foster further research and community-driven development in temporal reasoning and annotation.
zh

[NLP-171] he Differential Meaning of Models: A Framework for Analyzing the Structural Consequences of Semantic Modeling Decisions

【速读】: 该论文旨在解决当前人类意义建构建模方法缺乏统一理论框架的问题,即如何在不同模型类型之间以可比的方式描述和分析建模实践。其解决方案的关键在于基于C. S. Peirce的符号学理论构建一个通用框架,将模型视为符号(sign),并认为这些模型测量的是潜在的符号几何结构(latent symbol geometries),这些结构可被理解为对符号数据集中潜在符号代理(semiotic agencies)复杂性的假设。通过这一框架,模型的价值不再仅依赖于性能指标,而是可以通过与其他模型的关系性对比来揭示其特定的解释视角,从而形成一种关于模型语义(model semantics)的理论。

链接: https://arxiv.org/abs/2509.00248
作者: Zachary K. Stine,James E. Deitrick
机构: University of Central Arkansas (中央阿肯色大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of methods for modeling of human meaning-making constitutes a powerful class of instruments for the analysis of complex semiotic systems. However, the field lacks a general theoretical framework for describing these modeling practices across various model types in an apples-to-apples way. In this paper, we propose such a framework grounded in the semiotic theory of C. S. Peirce. We argue that such models measure latent symbol geometries, which can be understood as hypotheses about the complex of semiotic agencies underlying a symbolic dataset. Further, we argue that in contexts where a model’s value cannot be straightforwardly captured by proxy measures of performance, models can instead be understood relationally, so that the particular interpretive lens of a model becomes visible through its contrast with other models. This forms the basis of a theory of model semantics in which models, and the modeling decisions that constitute them, are themselves treated as signs. In addition to proposing the framework, we illustrate its empirical use with a few brief examples and consider foundational questions and future directions enabled by the framework.
zh

[NLP-172] he Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLM s

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在面对需要细粒度统计推理与稀有特征识别的任务时表现不足的问题,尤其是缺乏对文档集合中全局罕见特征的识别能力。现有基准主要关注信息检索或摘要生成,而忽视了模型在小至中等规模文档集(10–40篇)中挖掘具有低频出现率(如少于10%)特征的能力。解决方案的关键在于提出Distinctive Feature Mining (DFM) 任务,并开发配套的可配置评估框架 DiFBench,通过控制文档数量和显著性阈值来系统化测试模型的稀有特征识别性能。实验表明,尽管推理增强型模型优于通用模型,但所有模型在任务复杂度上升时均显著退化,且普遍存在将高频特征误判为稀有特征的倾向,揭示了当前LLMs在统计稀有性感知方面的根本局限。

链接: https://arxiv.org/abs/2509.00245
作者: Seiji Maekawa,Hayate Iso,Nikita Bhutani
机构: Megagon Labs( Megagon 实验室); NVIDIA(英伟达)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model’s ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs’ abilities to perform fine-grained, statistical reasoning and rarity detection.
zh

[NLP-173] Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0 XLS-R and Whisper for Speaker Identification Tasks

【速读】: 该论文旨在解决如何有效利用预训练语音编码器(speech encoder)模型在说话人识别任务中的性能优化问题。其解决方案的关键在于通过微调(fine-tuning)不同深度的Transformer层,并结合SVCCA(Singular Vector Canonical Correlation Analysis)、k-means聚类和t-SNE可视化等方法,系统分析各模型层间表示的空间结构与判别性特征分布,从而确定每种模型在说话人识别任务中最优的层数配置,进而提升识别稳定性与准确性。

链接: https://arxiv.org/abs/2509.00230
作者: Linus Stuhlmann,Michael Alexander Saxer
机构: ZHAW School of Engineering (瑞士联邦理工学院工程学院)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This study evaluates the performance of three advanced speech encoder models, Wav2Vec 2.0, XLS-R, and Whisper, in speaker identification tasks. By fine-tuning these models and analyzing their layer-wise representations using SVCCA, k-means clustering, and t-SNE visualizations, we found that Wav2Vec 2.0 and XLS-R capture speaker-specific features effectively in their early layers, with fine-tuning improving stability and performance. Whisper showed better performance in deeper layers. Additionally, we determined the optimal number of transformer layers for each model when fine-tuned for speaker identification tasks.
zh

[NLP-174] Explainable Chain-of-Thought Reasoning : An Empirical Analysis on State-Aware Reasoning Dynamics

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)提示中推理过程可解释性不足的问题,尤其是现有方法仅关注局部token级别的归因分析,而忽视了推理步骤的高阶语义角色及其转换关系。其解决方案的关键在于提出一种状态感知的转移框架(state-aware transition framework),通过谱分析(spectral analysis)对token级嵌入进行处理,将每个推理步骤聚类为语义一致的潜在状态(latent states),并将其进展建模为马尔可夫链(Markov chain),从而抽象出结构化的潜动态(structured latent dynamics),实现对推理过程的全局结构化、可解释性刻画。

链接: https://arxiv.org/abs/2509.00190
作者: Sheldon Yu,Yuxin Xiong,Junda Wu,Xintong Li,Tong Yu,Xiang Chen,Ritwik Sinha,Jingbo Shang,Julian McAuley
机构: University of California San Diego (加州大学圣地亚哥分校); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Recent advances in chain-of-thought (CoT) prompting have enabled large language models (LLMs) to perform multi-step reasoning. However, the explainability of such reasoning remains limited, with prior work primarily focusing on local token-level attribution, such that the high-level semantic roles of reasoning steps and their transitions remain underexplored. In this paper, we introduce a state-aware transition framework that abstracts CoT trajectories into structured latent dynamics. Specifically, to capture the evolving semantics of CoT reasoning, each reasoning step is represented via spectral analysis of token-level embeddings and clustered into semantically coherent latent states. To characterize the global structure of reasoning, we model their progression as a Markov chain, yielding a structured and interpretable view of the reasoning process. This abstraction supports a range of analyses, including semantic role identification, temporal pattern visualization, and consistency evaluation.
zh

[NLP-175] What Are Research Hypotheses?

【速读】: 该论文旨在解决自然语言理解(Natural Language Understanding, NLU)领域中“假设”(hypothesis)概念定义不统一的问题,这一问题导致不同研究任务对假设的理解存在歧义,进而影响模型训练与评估的一致性。解决方案的关键在于系统梳理和区分近年来NLU任务中关于假设的各种定义,明确其在开放科学领域中的语义差异,并强调构建结构化、可机器解析的假设表述的重要性,以推动未来可解释、可验证的学术记录自动化发展。

链接: https://arxiv.org/abs/2509.00185
作者: Jian Wu,Sarah Rajtmajer
机构: Old Dominion University (老多佛大学); The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, accepted by Sci-K’25: International Workshop on Scientific Knowledge

点击查看摘要

Abstract:Over the past decades, alongside advancements in natural language processing, significant attention has been paid to training models to automatically extract, understand, test, and generate hypotheses in open and scientific domains. However, interpretations of the term \emphhypothesis for various natural language understanding (NLU) tasks have migrated from traditional definitions in the natural, social, and formal sciences. Even within NLU, we observe differences defining hypotheses across literature. In this paper, we overview and delineate various definitions of hypothesis. Especially, we discern the nuances of definitions across recently published NLU tasks. We highlight the importance of well-structured and well-defined hypotheses, particularly as we move toward a machine-interpretable scholarly record.
zh

[NLP-176] Adaptive Monitoring and Real-World Evaluation of Agent ic AI Systems

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)多智能体系统在从实验室走向高风险应用场景时,缺乏全面、动态且可量化评估机制的问题。现有研究和工业部署普遍依赖能力指标,忽视了人类中心化和经济维度的衡量,导致对系统行为异常(如目标漂移或潜在危害)的检测滞后且误报率高。其解决方案的关键在于提出一种自适应多维监控(Adaptive Multi-Dimensional Monitoring, AMDM)算法:该算法通过归一化异构指标、为各维度应用指数加权移动平均阈值,并利用马氏距离(Mahalanobis distance)进行联合异常检测,从而显著提升检测灵敏度与响应速度。实验表明,AMDM将异常检测延迟从12.3秒降低至5.6秒,同时将假阳性率从4.5%降至0.9%,实现了更可靠、实时的多维安全监控。

链接: https://arxiv.org/abs/2509.00115
作者: Manish Shukla
机构: Independent Researcher(独立研究员)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Agentic artificial intelligence (AI) – multi-agent systems that combine large language models with external tools and autonomous planning – are rapidly transitioning from research laboratories into high-stakes domains. Our earlier “Basic” paper introduced a five-axis framework and proposed preliminary metrics such as goal drift and harm reduction but did not provide an algorithmic instantiation or empirical evidence. This “Advanced” sequel fills that gap. First, we revisit recent benchmarks and industrial deployments to show that technical metrics still dominate evaluations: a systematic review of 84 papers from 2023–2025 found that 83% report capability metrics while only 30% consider human-centred or economic axes [2]. Second, we formalise an Adaptive Multi-Dimensional Monitoring (AMDM) algorithm that normalises heterogeneous metrics, applies per-axis exponentially weighted moving-average thresholds and performs joint anomaly detection via the Mahalanobis distance. Third, we conduct simulations and real-world experiments. AMDM cuts anomaly-detection latency from 12.3 s to 5.6 s on simulated goal drift and reduces false-positive rates from 4.5% to 0.9% compared with static thresholds. We present a comparison table and ROC/PR curves, and we reanalyse case studies to surface missing metrics. Code, data and a reproducibility checklist accompany this paper to facilitate replication.
zh

[NLP-177] Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLM s EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源场景下通过剪枝(pruning)压缩模型时,可能破坏其内部激活特征(activation features)从而削弱谎言检测能力的问题。现有方法通常依据层重要性调整剪枝稀疏度,但这一策略会无意中移除关键权重,无法提升谎言检测性能。解决方案的关键在于提出按层异常值对齐的诚实剪枝方法(Truthful Pruning aligned by Layer-wise Outliers, TPLO),该方法聚焦于具有更多激活异常值和更强判别特征的层,以保留LLM原始性能的同时维持用于鲁棒谎言检测的核心状态特征。

链接: https://arxiv.org/abs/2509.00096
作者: Yao Fu,Runchao Li,Xianxuan Long,Haotian Yu,Xiaotian Han,Yu Yin,Pan Li
机构: Case Western Reserve University (凯斯西储大学); Hangzhou Dianzi University (杭州电子科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to EMNLP2025 findings (poster)

点击查看摘要

Abstract:Neural network pruning has emerged as a promising approach for deploying LLMs in low-resource scenarios while preserving downstream task performance. However, for the first time, we reveal that such pruning disrupts LLMs’ internal activation features crucial for lie detection, where probing classifiers (typically small logistic regression models) trained on these features assess the truthfulness of LLM-generated statements. This discovery raises a crucial open question: how can we prune LLMs without sacrificing these critical lie detection capabilities? Our investigation further reveals that naively adjusting layer-wise pruning sparsity based on importance inadvertently removes crucial weights, failing to improve lie detection performance despite its reliance on the most crucial LLM layer. To address this issue, we propose Truthful Pruning aligned by Layer-wise Outliers (TPLO), which places greater emphasis on layers with more activation outliers and stronger discriminative features simultaneously. This preserves LLMs’ original performance while retaining critical features of inner states needed for robust lie detection. Moreover, we introduce a prompting rule to enrich the TruthfulQA benchmark for better calibrating LLM pruning. Empirical results show that our approach improves the hallucination detection for pruned LLMs (achieving 88% accuracy at 50% sparsity) and enhances their performance on TruthfulQA.
zh

[NLP-178] Ensemble Debates with Local Large Language Models for AI Alignment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险决策中与人类价值观对齐的问题,同时应对依赖专有API导致的可复现性差和参与度低的局限。其解决方案的关键在于提出一种基于本地开源模型的集成辩论机制(ensemble debates),通过多模型协同辩论来提升推理深度、论点质量及对人类价值的契合度,实验表明该方法在多个评估维度上显著优于单模型基线,尤其在真实性(+1.25分)和人类增强性(+0.80分)方面表现突出。

链接: https://arxiv.org/abs/2509.00091
作者: Ephraiem Sarabamoun
机构: Capital One(资本一号)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 2 tables

点击查看摘要

Abstract:As large language models (LLMs) take on greater roles in high-stakes decisions, alignment with human values is essential. Reliance on proprietary APIs limits reproducibility and broad participation. We study whether local open-source ensemble debates can improve alignmentoriented reasoning. Across 150 debates spanning 15 scenarios and five ensemble configurations, ensembles outperform single-model baselines on a 7-point rubric (overall: 3.48 vs. 3.13), with the largest gains in reasoning depth (+19.4%) and argument quality (+34.1%). Improvements are strongest for truthfulness (+1.25 points) and human enhancement (+0.80). We provide code, prompts, and a debate data set, providing an accessible and reproducible foundation for ensemble-based alignment evaluation.
zh

[NLP-179] Learning to Refine: Self-Refinement of Parallel Reasoning in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂多步推理任务时性能受限的问题,尤其针对现有测试时扩展(test-time scaling, TTS)方法如Best-of-N和多数投票策略的局限性——这些方法依赖候选响应的质量,若所有候选均错误则无法生成正确解,且引入额外选择模型会增加部署成本。解决方案的关键在于提出一种新颖的并行测试时扩展框架:生成式自精炼(Generative Self-Refinement, GSR),其核心机制是让统一模型首先并行生成一组候选响应,随后基于包含问题与候选响应的提示词进行自精炼,从而合成出更优的新解。为提升精炼效果,作者设计了一个联合优化的混合训练流程,同时学习直接解题和候选响应精炼两项互补目标,实验表明该方法在五个数学基准上达到当前最优性能,且所学自精炼能力具有模型无关性、跨模型规模鲁棒性及对分布外推理任务的泛化能力。

链接: https://arxiv.org/abs/2509.00084
作者: Qibin Wang,Pu Zhao,Shaohan Huang,Fangkai Yang,Lu Wang,Furu Wei,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To further enhance the ability of Large Language Models (LLMs) to solve complex, multi-step reasoning problems, test-time scaling (TTS) methods have gained widespread attention. Existing approaches such as Best-of-N and majority voting are limited as their performance depends on the quality of candidate responses, making them unable to produce a correct solution when all candidates are incorrect. Introducing an additional model to select the best response also incurs significant deployment costs. To this end, we introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework where a unified model first generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution based on a prompt consisting of the problem and these candidates. However, LLMs struggle to perform refinement effectively when prompted directly. Therefore, we design a hybrid training pipeline by jointly optimizing for two complementary objectives, solving problems directly and refining candidate responses. Experimental results demonstrate that our method achieves state-of-the-art performance across five mathematical benchmarks. We further show that this learned self-refinement skill is a model-agnostic enhancement, robust across different model scales and generalizing to out-of-distribution reasoning tasks.
zh

[NLP-180] Data Cartography for Detecting Memorization Hotspots and Guiding Data Interventions in Generative Models ICML

【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在预训练过程中可能发生的过拟合与意外记忆稀有训练样本的问题,这些问题可能导致 adversaries 提取敏感信息或虚高基准性能评估。解决方案的关键在于提出生成数据制图(Generative Data Cartography, GenDataCarto),该框架通过两个核心指标对每个训练样本进行量化:早期训练阶段的损失值作为难度得分,以及“遗忘事件”频率作为记忆得分;进而将样本划分为四个象限,指导有针对性的数据剪枝与加权调整。理论证明表明,该记忆得分在平滑性假设下可下界经典影响度(influence),且降低高记忆热点权重可基于一致稳定性边界有效缩小泛化差距。实证结果表明,在仅修剪10%数据的情况下,GenDataCarto使合成“金丝雀”提取成功率下降超40%,同时验证困惑度提升不足0.5%,证明了该方法在显著减少数据泄露风险的同时几乎不损害生成性能。

链接: https://arxiv.org/abs/2509.00083
作者: Laksh Patel,Neel Shanbhag
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 2 figures, 1 table; Presented at the 42nd International Conference on Machine Learning (ICML), winning the “Best Poster” award at ICML’s workshop for data in generative models (DIG-BUGS)

点击查看摘要

Abstract:Modern generative models risk overfitting and unintentionally memorizing rare training examples, which can be extracted by adversaries or inflate benchmark performance. We propose Generative Data Cartography (GenDataCarto), a data-centric framework that assigns each pretraining sample a difficulty score (early-epoch loss) and a memorization score (frequency of ``forget events’'), then partitions examples into four quadrants to guide targeted pruning and up-/down-weighting. We prove that our memorization score lower-bounds classical influence under smoothness assumptions and that down-weighting high-memorization hotspots provably decreases the generalization gap via uniform stability bounds. Empirically, GenDataCarto reduces synthetic canary extraction success by over 40% at just 10% data pruning, while increasing validation perplexity by less than 0.5%. These results demonstrate that principled data interventions can dramatically mitigate leakage with minimal cost to generative performance.
zh

[NLP-181] Language and Experience: A Computational Model of Social Learning in Complex Tasks

【速读】: 该论文试图解决如何将来自他人的语言指导与直接经验相结合,以实现安全且快速的学习问题,尤其是在新环境中。其核心挑战在于建模人类如何整合这两种知识来源,并将其应用于人工智能(AI)系统中,从而提升学习效率和探索策略。解决方案的关键在于提出一种计算框架,将社会学习建模为基于感官运动和语言数据的结构化、可执行世界模型的联合概率推理过程;通过将预训练语言模型转化为对人类分享建议条件于信念的概率建模,使智能体既能生成建议,又能将语言输入作为贝叶斯推理中的证据使用,从而实现语言引导下的高效学习与知识积累。

链接: https://arxiv.org/abs/2509.00074
作者: Cédric Colas,Tracey Mills,Ben Prystawski,Michael Henry Tessler,Noah Goodman,Jacob Andreas,Joshua Tenenbaum
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments. How do people integrate these two sources of knowledge, and how might AI systems? We present a computational framework that models social learning as joint probabilistic inference over structured, executable world models given sensorimotor and linguistic data. We make this possible by turning a pretrained language model into a probabilistic model of how humans share advice conditioned on their beliefs, allowing our agents both to generate advice for others and to interpret linguistic input as evidence during Bayesian inference. Using behavioral experiments and simulations across 10 video games, we show how linguistic guidance can shape exploration and accelerate learning by reducing risky interactions and speeding up key discoveries in both humans and models. We further explore how knowledge can accumulate across generations through iterated learning experiments and demonstrate successful knowledge transfer between humans and models – revealing how structured, language-compatible representations might enable human-machine collaborative learning.
zh

[NLP-182] raj-MLLM : Can Multimodal Large Language Models Reform Trajectory Data Mining?

【速读】: 该论文旨在解决轨迹数据挖掘中模型泛化能力不足的问题,即现有方法通常局限于特定地理区域或仅适用于少数任务,难以跨区域、跨任务迁移。其核心解决方案是提出首个基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的通用框架——Traj-MLLM,通过引入多视角上下文将原始轨迹转化为交织的图文序列,在保留关键时空特征的同时,利用MLLM强大的推理能力实现无需训练或微调即可适配多种轨迹分析任务。该方案的关键创新在于构建了任务无关的多模态轨迹表示,并设计了一种提示优化方法以生成数据不变的提示用于灵活的任务适应。

链接: https://arxiv.org/abs/2509.00053
作者: Shuo Liu,Di Yao,Yan Lin,Gao Cong,Jingping Bi
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Aalborg University (奥尔堡大学); Nanyang Technological University (南洋理工大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 10 figures

点击查看摘要

Abstract:Building a general model capable of analyzing human trajectories across different geographic regions and different tasks becomes an emergent yet important problem for various applications. However, existing works suffer from the generalization problem, \ie, they are either restricted to train for specific regions or only suitable for a few tasks. Given the recent advances of multimodal large language models (MLLMs), we raise the question: can MLLMs reform current trajectory data mining and solve the problem? Nevertheless, due to the modality gap of trajectory, how to generate task-independent multimodal trajectory representations and how to adapt flexibly to different tasks remain the foundational challenges. In this paper, we propose \textttTraj-MLLM, which is the first general framework using MLLMs for trajectory data mining. By integrating multiview contexts, \textttTraj-MLLM transforms raw trajectories into interleaved image-text sequences while preserving key spatial-temporal characteristics, and directly utilizes the reasoning ability of MLLMs for trajectory analysis. Additionally, a prompt optimization method is proposed to finalize data-invariant prompts for task adaptation. Extensive experiments on four publicly available datasets show that \textttTraj-MLLM outperforms state-of-the-art baselines by 48.05% , 15.52% , 51.52% , 1.83% on travel time estimation, mobility prediction, anomaly detection and transportation mode identification, respectively. \textttTraj-MLLM achieves these superior performances without requiring any training data or fine-tuning the MLLM backbones.
zh

[NLP-183] Compiling Prompts Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在系统性文献综述(Systematic Literature Reviews, SLRs)中应用时存在的可靠性与可重复性问题,这些问题主要源于依赖人工设计的、易失效的提示词(prompts)。其解决方案的关键在于引入声明式提示优化(declarative prompt optimisation)方法,并将其适配至SLR领域,构建一个结构化的、面向领域的框架,该框架集成任务声明、测试套件和自动化提示调优机制,从而实现可复现的SLR工作流。此方案通过提供可验证的LLM管道代码示例,确保了证据合成过程的透明性和严谨性。

链接: https://arxiv.org/abs/2509.00038
作者: Teo Susnjak
机构: Massey University (梅西大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) offer significant potential to accelerate systematic literature reviews (SLRs), yet current approaches often rely on brittle, manually crafted prompts that compromise reliability and reproducibility. This fragility undermines scientific confidence in LLM-assisted evidence synthesis. In response, this work adapts recent advances in declarative prompt optimisation, developed for general-purpose LLM applications, and demonstrates their applicability to the domain of SLR automation. This research proposes a structured, domain-specific framework that embeds task declarations, test suites, and automated prompt tuning into a reproducible SLR workflow. These emerging methods are translated into a concrete blueprint with working code examples, enabling researchers to construct verifiable LLM pipelines that align with established principles of transparency and rigour in evidence synthesis. This is a novel application of such approaches to SLR pipelines.
zh

[NLP-184] MultiStream-LLM : Bridging Modalities for Robust Sign Language Translation

【速读】: 该论文旨在解决当前端到端式手语翻译(Sign Language Translation, SLT)模型在处理自然手语中两个关键问题时的不足:一是对高速指拼写(fingerspelling)的精确识别能力差,二是无法有效整合来自面部的异步非手动线索(non-manual cues)。现有方法通常将所有任务强制由单一网络学习,导致在翻译姓名、地名和技术术语等关键信息时性能不佳。其解决方案的关键在于提出 MultiStream-LLM 框架,采用模块化设计,通过独立的专用预测器分别处理连续手语、指拼写和唇读(lipreading)三种模态,每路专家网络先将各自模态解码为词元序列,再由轻量级 Transformer 对齐时间戳后融合,最终输入大型语言模型(Large Language Model, LLM)生成句子。这种“分而治之”策略显著提升了翻译精度与鲁棒性。

链接: https://arxiv.org/abs/2509.00030
作者: Marshall Thomas,Edward Fish,Richard Bowden
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite progress in gloss-free Sign Language Translation (SLT), monolithic end-to-end models consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in Automated Sign Language Translation with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names,places, and technical terms. We introduce MultiStream-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign benchmark with a BLEU-4 score of 23.5 and achieves 73.2% letter accuracy on the challenging ChicagoFSWildPlus fingerspelling dataset. These results validate our core hypothesis: by isolating and solving distinct recogni tion tasks before fusion, our multi-expert approach provides a more powerful and effective pathway to robust, high-fidelity sign language translation.
zh

[NLP-185] MixedG2P-T5: G2P-free Speech Synthesis for Mixed-script texts using Speech Self-Supervised Learning and Language Model

【速读】: 该论文旨在解决传统语音合成系统中依赖人工标注的音素转换(grapheme-to-phoneme, G2P)流程所带来的高成本与低可扩展性问题,尤其针对大规模未标注音频数据集的应用场景。其解决方案的关键在于利用预训练的语音自监督学习(Speech Self-Supervised Learning, SSL)模型,结合T5编码器直接从混合脚本文本(如汉字与假名混用)生成伪语言标签(pseudo-language labels),从而绕过显式的G2P转换步骤,实现端到端的语音合成,同时保持自然的语言和副语言特征(如口音与语调)。

链接: https://arxiv.org/abs/2509.01391
作者: Joonyong Park,Daisuke Saito,Nobuaki Minematsu
机构: The University of Tokyo (东京大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: In Proceedings of the 17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC 2025)

点击查看摘要

Abstract:This study presents a novel approach to voice synthesis that can substitute the traditional grapheme-to-phoneme (G2P) conversion by using a deep learning-based model that generates discrete tokens directly from speech. Utilizing a pre-trained voice SSL model, we train a T5 encoder to produce pseudo-language labels from mixed-script texts (e.g., containing Kanji and Kana). This method eliminates the need for manual phonetic transcription, reducing costs and enhancing scalability, especially for large non-transcribed audio datasets. Our model matches the performance of conventional G2P-based text-to-speech systems and is capable of synthesizing speech that retains natural linguistic and paralinguistic features, such as accents and intonations.
zh

[NLP-186] Hybrid Topic-Semantic Labeling and Graph Embeddings for Unsupervised Legal Document Clustering

【速读】: 该论文旨在解决法律文本分类中因领域特定语言(domain-specific language)和标注数据有限(limited labeled data)所带来的挑战。其解决方案的关键在于提出一种混合方法,结合无监督的主题建模与图嵌入技术:首先使用Top2Vec学习语义文档嵌入并自动发现潜在主题,再通过Node2Vec在法律文档的二分图结构中捕捉结构性关系;随后将两种嵌入融合并采用KMeans聚类,从而实现高质量的文档分组。实验表明,该Top2Vec+Node2Vec联合策略优于仅使用文本或仅使用图结构的嵌入方式,并在超参数敏感性分析中展现出对LDA和NMF等基线模型的竞争力。

链接: https://arxiv.org/abs/2509.00990
作者: Deepak Bastola,Woohyeok Choi
机构: Salisbury University (萨里斯伯里大学); Carleton College (卡尔顿学院)
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Legal documents pose unique challenges for text classification due to their domain-specific language and often limited labeled data. This paper proposes a hybrid approach for classifying legal texts by combining unsupervised topic and graph embeddings with a supervised model. We employ Top2Vec to learn semantic document embeddings and automatically discover latent topics, and Node2Vec to capture structural relationships via a bipartite graph of legal documents. The embeddings are combined and clustered using KMeans, yielding coherent groupings of documents. Our computations on a legal document dataset demonstrate that the combined Top2Vec+Node2Vec approach improves clustering quality over text-only or graph-only embeddings. We conduct a sensitivity analysis of hyperparameters, such as the number of clusters and the dimensionality of the embeddings, and demonstrate that our method achieves competitive performance against baseline Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) models. Key findings indicate that while the pipeline presents an innovative approach to unsupervised legal document analysis by combining semantic topic modeling with graph embedding techniques, its efficacy is contingent upon the quality of initial topic generation and the representational power of the chosen embedding models for specialized legal language. Strategic recommendations include the exploration of domain-specific embeddings, more comprehensive hyperparameter tuning for Node2Vec, dynamic determination of cluster numbers, and robust human-in-the-loop validation processes to enhance legal relevance and trustworthiness. The pipeline demonstrates potential for exploratory legal data analysis and as a precursor to supervised learning tasks but requires further refinement and domain-specific adaptation for practical legal applications.
zh

[NLP-187] Automatic Pronunciation Error Detection and Correction of the Holy Qurans Learners Using Deep Learning

【速读】: 该论文旨在解决圣训诵读(Quranic recitation)中发音评估的难题,特别是针对机器学习模型难以量化发音指标的问题。由于伊斯兰学者制定的严格诵读规则(tajweed)为发音提供了明确标准,使得对《古兰经》语音质量的自动评估成为可能,但高质量标注数据的稀缺仍是主要障碍。解决方案的关键在于:(1) 构建一个98%自动化的数据生成流程,涵盖专家诵读采集、基于微调wav2vec2-BERT模型的停顿点(waqf)分割、分段转录及通过新型Tasmeea算法进行文本验证;(2) 提出一种基于自动语音识别(ASR)的发音错误检测方法,利用自定义的《古兰经音位脚本》(QPS)编码tajweed规则——QPS采用双层结构:第一层为音素级(phoneme level),表示阿拉伯字母及其长短元音;第二层为特征级(sifa level),刻画每个音素的发音特征;(3) 设计多层级CTC模型,在测试集上实现0.16%的平均音素错误率(PER),显著提升发音评估精度。该方案有效整合了领域知识与深度学习技术,推动了宗教语音处理领域的自动化发展。

链接: https://arxiv.org/abs/2509.00094
作者: Abdullah Abdelfattah,Mahmoud I. Khalil,Hazem Abbas
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Assessing spoken language is challenging, and quantifying pronunciation metrics for machine learning models is even harder. However, for the Holy Quran, this task is simplified by the rigorous recitation rules (tajweed) established by Muslim scholars, enabling highly effective assessment. Despite this advantage, the scarcity of high-quality annotated data remains a significant barrier. In this work, we bridge these gaps by introducing: (1) A 98% automated pipeline to produce high-quality Quranic datasets – encompassing: Collection of recitations from expert reciters, Segmentation at pause points (waqf) using our fine-tuned wav2vec2-BERT model, Transcription of segments, Transcript verification via our novel Tasmeea algorithm; (2) 850+ hours of audio (~300K annotated utterances); (3) A novel ASR-based approach for pronunciation error detection, utilizing our custom Quran Phonetic Script (QPS) to encode Tajweed rules (unlike the IPA standard for Modern Standard Arabic). QPS uses a two-level script: (Phoneme level): Encodes Arabic letters with short/long vowels. (Sifa level): Encodes articulation characteristics of every phoneme. We further include comprehensive modeling with our novel multi-level CTC Model which achieved 0.16% average Phoneme Error Rate (PER) on the testset. We release all code, data, and models as open-source: this https URL Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD) Cite as: arXiv:2509.00094 [eess.AS] (or arXiv:2509.00094v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2509.00094 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-188] ChipChat: Low-Latency Cascaded Conversational Agent in MLX

【速读】: 该论文旨在解决实时本地语音代理(on-device voice agents)中因传统级联系统(Cascaded Systems, CSs)的串行处理延迟导致响应速度慢的问题。尽管CS在语言理解任务上仍优于端到端模型,但其固有的顺序执行特性限制了低延迟性能。解决方案的关键在于通过架构创新和流式优化设计出一种新型低延迟级联系统ChipChat:它融合了流式对话语音识别(使用专家混合模型)、状态-动作增强的大语言模型(LLM)、文本到语音合成(TTS)、神经声码器以及说话人建模等模块,并基于MLX框架实现全链路设备端运行,在无专用GPU的Mac Studio上实现了亚秒级响应延迟,同时保障用户隐私。

链接: https://arxiv.org/abs/2509.00078
作者: Tatiana Likhomanenko,Luke Carlson,Richard He Bai,Zijin Gu,Han Tran,Zakaria Aldeneh,Yizhe Zhang,Ruixiang Zhang,Huangjie Zheng,Navdeep Jaitly
机构: Apple(苹果)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: ASRU 2025

点击查看摘要

Abstract:The emergence of large language models (LLMs) has transformed spoken dialog systems, yet the optimal architecture for real-time on-device voice agents remains an open question. While end-to-end approaches promise theoretical advantages, cascaded systems (CSs) continue to outperform them in language understanding tasks, despite being constrained by sequential processing latency. In this work, we introduce ChipChat, a novel low-latency CS that overcomes traditional bottlenecks through architectural innovations and streaming optimizations. Our system integrates streaming (a) conversational speech recognition with mixture-of-experts, (b) state-action augmented LLM, © text-to-speech synthesis, (d) neural vocoder, and (e) speaker modeling. Implemented using MLX, ChipChat achieves sub-second response latency on a Mac Studio without dedicated GPUs, while preserving user privacy through complete on-device processing. Our work shows that strategically redesigned CSs can overcome their historical latency limitations, offering a promising path forward for practical voice-based AI agents.
zh

[NLP-189] Amplifying Emotional Signals: Data-Efficient Deep Learning for Robust Speech Emotion Recognition

【速读】: 该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)在小样本数据条件下难以实现高精度分类的问题。其解决方案的关键在于结合迁移学习(transfer learning)与创新的数据增强技术,从而有效缓解数据稀缺带来的性能瓶颈。实验表明,采用预训练的ResNet34模型并辅以优化的数据增强策略,可在RAVDESS和SAVEE数据集组合上达到66.7%的准确率和0.631的F1分数,显著提升了SER系统的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2509.00077
作者: Tai Vu
机构: Stanford University (斯坦福大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Speech Emotion Recognition (SER) presents a significant yet persistent challenge in human-computer interaction. While deep learning has advanced spoken language processing, achieving high performance on limited datasets remains a critical hurdle. This paper confronts this issue by developing and evaluating a suite of machine learning models, including Support Vector Machines (SVMs), Long Short-Term Memory networks (LSTMs), and Convolutional Neural Networks (CNNs), for automated emotion classification in human speech. We demonstrate that by strategically employing transfer learning and innovative data augmentation techniques, our models can achieve impressive performance despite the constraints of a relatively small dataset. Our most effective model, a ResNet34 architecture, establishes a new performance benchmark on the combined RAVDESS and SAVEE datasets, attaining an accuracy of 66.7% and an F1 score of 0.631. These results underscore the substantial benefits of leveraging pre-trained models and data augmentation to overcome data scarcity, thereby paving the way for more robust and generalizable SER systems.
zh

计算机视觉

[CV-0] FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

【速读】:该论文旨在解决3D视觉基础模型(如VGGT)在处理长序列图像输入时存在的推理效率低下问题,其核心瓶颈在于注意力机制中出现的token collapse现象,导致冗余计算和误差累积。解决方案的关键在于提出FastVGGT,首次在3D领域通过无训练的token合并机制实现加速:设计了一种面向3D架构与任务特性的独特token分组策略,在不牺牲VGGT强大重建能力的前提下有效消除冗余计算。实验表明,使用1000张输入图像时,FastVGGT相比原模型实现4倍加速,同时显著缓解长序列场景下的误差累积问题。

链接: https://arxiv.org/abs/2509.02560
作者: You Shen,Zhipeng Zhang,Yansong Qu,Liujuan Cao
机构: Xiamen University (厦门大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence image inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a state-of-the-art feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectural and task-specific properties of 3D models, directly applying existing merging techniques proves challenging. To this end, we propose FastVGGT, which, for the first time, leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT. we devise a unique token partitioning strategy tailored to 3D architectures and tasks, effectively eliminating redundant computation while preserving VGGT’s powerful reconstruction capacity. Extensive experiments on multiple 3D geometry benchmarks validate the effectiveness of our approach. Notably, with 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios. These findings underscore the potential of token merging as a principled solution for scalable 3D vision systems. Code is available at: this https URL.
zh

[CV-1] Motion-Refined DINOSAUR for Unsupervised Multi-Object Discovery ICCV

【速读】:该论文旨在解决无监督多对象发现(Unsupervised Multi-Object Discovery, MOD)任务中依赖伪标签监督的问题,即现有方法虽利用视频中的运动线索和对象中心学习(Object-Centric Learning, OCL)进行对象检测与定位,但仍需通过监督方式生成伪标签来训练模型。其解决方案的关键在于提出MR-DINOSAUR——一种最小化且完全无监督的改进框架,它基于预训练的自监督OCL模型DINOSAUR,通过检索无相机运动的视频帧并执行无监督光流运动分割,生成高质量的无监督伪标签;进而利用这些伪标签优化DINOSAUR的slot表示,并引入slot去激活模块以区分前景与背景,从而在不依赖任何人工标注的情况下实现强性能的多对象发现。

链接: https://arxiv.org/abs/2509.02545
作者: Xinrui Gong,Oliver Hahn,Christoph Reich,Krishnakant Singh,Simone Schaub-Meyer,Daniel Cremers,Stefan Roth
机构: TU Darmstadt (达姆施塔特工业大学); TU Munich (慕尼黑工业大学); MCML; ELIZA; hessian.AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at ICCVW 2025. Xinrui Gong and Oliver Hahn - both authors contributed equally. Code: this https URL

点击查看摘要

Abstract:Unsupervised multi-object discovery (MOD) aims to detect and localize distinct object instances in visual scenes without any form of human supervision. Recent approaches leverage object-centric learning (OCL) and motion cues from video to identify individual objects. However, these approaches use supervision to generate pseudo labels to train the OCL model. We address this limitation with MR-DINOSAUR – Motion-Refined DINOSAUR – a minimalistic unsupervised approach that extends the self-supervised pre-trained OCL model, DINOSAUR, to the task of unsupervised multi-object discovery. We generate high-quality unsupervised pseudo labels by retrieving video frames without camera motion for which we perform motion segmentation of unsupervised optical flow. We refine DINOSAUR’s slot representations using these pseudo labels and train a slot deactivation module to assign slots to foreground and background. Despite its conceptual simplicity, MR-DINOSAUR achieves strong multi-object discovery results on the TRI-PD and KITTI datasets, outperforming the previous state of the art despite being fully unsupervised.
zh

[CV-2] Mix-modal Federated Learning for MRI Image Segmentation

【速读】:该论文旨在解决非集中式多模态磁共振成像(MRI)图像分割问题,即在分布式医疗场景中,各客户端(如医院)处理多种混合MRI模态数据时,因模态异质性和数据异质性导致现有联邦学习(Federated Learning, FL)方法难以有效应用的问题。其解决方案的关键在于提出一种新型的多模态联邦学习框架——模态解耦与记忆混合多模态联邦学习(Modality Decoupling and Memorizing Mix-Modal Federated Learning, MDM-MixMFL),该框架包含两个核心机制:一是模态解耦策略,将每种模态分解为模态特有信息和模态共享信息,并分别进行定制化与共享更新,从而实现对异构数据和模态的稳定自适应聚合;二是模态记忆机制,动态存储来自各模态特有编码器的客户端共享模态原型,以补偿本地客户端不完整模态信息,提升多模态联邦学习过程中的模态聚合与融合效果。

链接: https://arxiv.org/abs/2509.02541
作者: Guyue Hu,Siyuan Song,Jingpeng Sun,Zhe Jin,Chenglong Li,Jin Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) image segmentation is crucial in diagnosing and treating many diseases, such as brain tumors. Existing MRI image segmentation methods mainly fall into a centralized multimodal paradigm, which is inapplicable in engineering non-centralized mix-modal medical scenarios. In this situation, each distributed client (hospital) processes multiple mixed MRI modalities, and the modality set and image data for each client are diverse, suffering from extensive client-wise modality heterogeneity and data heterogeneity. In this paper, we first formulate non-centralized mix-modal MRI image segmentation as a new paradigm for federated learning (FL) that involves multiple modalities, called mix-modal federated learning (MixMFL). It distinguishes from existing multimodal federating learning (MulMFL) and cross-modal federating learning (CroMFL) paradigms. Then, we proposed a novel modality decoupling and memorizing mix-modal federated learning framework (MDM-MixMFL) for MRI image segmentation, which is characterized by a modality decoupling strategy and a modality memorizing mechanism. Specifically, the modality decoupling strategy disentangles each modality into modality-tailored and modality-shared information. During mix-modal federated updating, corresponding modality encoders undergo tailored and shared updating, respectively. It facilitates stable and adaptive federating aggregation of heterogeneous data and modalities from distributed clients. Besides, the modality memorizing mechanism stores client-shared modality prototypes dynamically refreshed from every modality-tailored encoder to compensate for incomplete modalities in each local client. It further benefits modality aggregation and fusion processes during mixmodal federated learning. Extensive experiments on two public datasets for MRI image segmentation demonstrate the effectiveness and superiority of our methods.
zh

[CV-3] Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots

【速读】:该论文旨在解决当前机器人操作任务中因依赖二维彩色图像进行技能学习而导致的泛化能力差的问题。传统方法受限于视觉信息的局限性,难以在真实环境中稳定执行复杂操作,而人类在三维空间中更依赖几何属性(如距离、尺寸和形状)而非纹理进行交互,这提示应引入高精度的三维几何信息以提升机器人感知能力。解决方案的关键在于提出Camera Depth Models (CDMs),这是一种可部署于日常深度相机上的轻量级插件,能够接收RGB图像与原始深度信号作为输入,并输出去噪且具有精确度量尺度的深度图。其核心创新是构建了一个神经数据引擎,通过模拟深度相机的噪声模式生成高质量的配对训练数据,从而实现接近仿真水平的深度预测精度,显著缩小了仿真到现实的差距。实验表明,基于原始仿真深度数据训练的策略无需添加噪声或进行真实世界微调即可在两个涉及铰接、反光和细长物体的长期任务中无缝迁移到真实机器人上,性能几乎无损失。

链接: https://arxiv.org/abs/2509.02530
作者: Minghuan Liu,Zhengbang Zhu,Xiaoshen Han,Peng Hu,Haotong Lin,Xinyao Li,Jingxiao Chen,Jiafeng Xu,Yichu Yang,Yunfeng Lin,Xinghang Li,Yong Yu,Weinan Zhang,Tao Kong,Bingyi Kang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 18 figures, project page: this https URL

点击查看摘要

Abstract:Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera’s noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.
zh

[CV-4] Enhancing Fitness Movement Recognition with Attention Mechanism and Pre-Trained Feature Extractors

【速读】:该论文旨在解决健身动作识别(Fitness Movement Recognition)中现有深度学习方法依赖计算密集型3D模型、难以在实时或资源受限场景下部署的问题。其解决方案的关键在于提出一种轻量级框架,通过预训练的2D卷积神经网络(如ResNet50、EfficientNet和Vision Transformer)提取空间特征,结合增强型空间注意力机制的长短期记忆网络(LSTM)捕捉时序依赖关系,从而在保持高精度的同时显著降低计算开销,实现在资源受限环境下的高效实时识别。

链接: https://arxiv.org/abs/2509.02511
作者: Shanjid Hasan Nishat,Srabonti Deb,Mohiuddin Ahmed
机构: Rajshahi University of Engineering & Technology (拉杰沙希工程与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages,9 figures, 2025 28th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:Fitness movement recognition, a focused subdomain of human activity recognition (HAR), plays a vital role in health monitoring, rehabilitation, and personalized fitness training by enabling automated exercise classification from video data. However, many existing deep learning approaches rely on computationally intensive 3D models, limiting their feasibility in real-time or resource-constrained settings. In this paper, we present a lightweight and effective framework that integrates pre-trained 2D Convolutional Neural Networks (CNNs) such as ResNet50, EfficientNet, and Vision Transformers (ViT) with a Long Short-Term Memory (LSTM) network enhanced by spatial attention. These models efficiently extract spatial features while the LSTM captures temporal dependencies, and the attention mechanism emphasizes informative segments. We evaluate the framework on a curated subset of the UCF101 dataset, achieving a peak accuracy of 93.34% with the ResNet50-based configuration. Comparative results demonstrate the superiority of our approach over several state-of-the-art HAR systems. The proposed method offers a scalable and real-time-capable solution for fitness activity recognition with broader applications in vision-based health and activity monitoring.
zh

[CV-5] Anisotropic Fourier Features for Positional Encoding in Medical Imaging MICCAI2025

【速读】:该论文旨在解决Transformer架构在医学影像分析中因位置编码(Positional Encoding, PE)设计不当而导致性能受限的问题。具体而言,传统正弦位置编码(Sinusoidal Positional Encoding, SPE)难以保持高维空间中的欧氏距离,而各向同性傅里叶特征位置编码(Isotropic Fourier Feature Positional Encoding, IFPE)虽能更好保留距离信息,却无法建模医学图像常见的各向异性(anisotropy)特性。解决方案的关键在于提出一种新的各向异性傅里叶特征位置编码(Anisotropic Fourier Feature Positional Encoding, AFPE),该方法通过引入与类别、领域和空间相关的各向异性参数,显式建模结构形状与数据异质性的依赖关系。实验表明,在胸部X光多标签分类、CT器官分类及超声心动图射血分数回归任务中,AFPE显著优于现有先进位置编码方法,证明了选择适配数据特性和目标结构形状的位置编码对提升模型性能至关重要。

链接: https://arxiv.org/abs/2509.02488
作者: Nabil Jabareen,Dongsheng Yuan,Dingming Liu,Foo-Wei Ten,Sören Lukassen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 3 figures, 2 tables, to be published in ShapeMI MICCAI 2025

点击查看摘要

Abstract:The adoption of Transformer-based architectures in the medical domain is growing rapidly. In medical imaging, the analysis of complex shapes - such as organs, tissues, or other anatomical structures - combined with the often anisotropic nature of high-dimensional images complicates these adaptations. In this study, we critically examine the role of Positional Encodings (PEs), arguing that commonly used approaches may be suboptimal for the specific challenges of medical imaging. Sinusoidal Positional Encodings (SPEs) have proven effective in vision tasks, but they struggle to preserve Euclidean distances in higher-dimensional spaces. Isotropic Fourier Feature Positional Encodings (IFPEs) have been proposed to better preserve Euclidean distances, but they lack the ability to account for anisotropy in images. To address these limitations, we propose Anisotropic Fourier Feature Positional Encoding (AFPE), a generalization of IFPE that incorporates anisotropic, class-specific, and domain-specific spatial dependencies. We systematically benchmark AFPE against commonly used PEs on multi-label classification in chest X-rays, organ classification in CT images, and ejection fraction regression in echocardiography. Our results demonstrate that choosing the correct PE can significantly improve model performance. We show that the optimal PE depends on the shape of the structure of interest and the anisotropy of the data. Finally, our proposed AFPE significantly outperforms state-of-the-art PEs in all tested anisotropic settings. We conclude that, in anisotropic medical images and videos, it is of paramount importance to choose an anisotropic PE that fits the data and the shape of interest.
zh

[CV-6] Unifi3D: A Study on 3D Representations for Generation and Reconstruction in a Common Framework

【速读】:该论文旨在解决当前3D生成领域中由于表示方式多样且碎片化所导致的评估标准不统一问题,特别是在重建与生成任务中缺乏系统性比较框架。其解决方案的关键在于提出一个统一的评估框架,能够对多种3D表示方法(如体素网格、神经辐射场、符号距离函数、点云和八叉树)在重建质量、计算效率和泛化性能等方面的综合表现进行量化对比,并进一步分析整个3D生成流程中的关键步骤(包括预处理、网格重构、基于自编码器的压缩及生成),从而揭示重建误差对整体性能的显著影响,强调需联合评估生成与重建能力,为不同应用场景下选择最优3D模型提供实证依据。

链接: https://arxiv.org/abs/2509.02474
作者: Nina Wiedemann,Sainan Liu,Quentin Leboutet,Katelyn Gao,Benjamin Ummenhofer,Michael Paulitsch,Kai Yuan
机构: Intel Corporation(英特尔公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Following rapid advancements in text and image generation, research has increasingly shifted towards 3D generation. Unlike the well-established pixel-based representation in images, 3D representations remain diverse and fragmented, encompassing a wide variety of approaches such as voxel grids, neural radiance fields, signed distance functions, point clouds, or octrees, each offering distinct advantages and limitations. In this work, we present a unified evaluation framework designed to assess the performance of 3D representations in reconstruction and generation. We compare these representations based on multiple criteria: quality, computational efficiency, and generalization performance. Beyond standard model benchmarking, our experiments aim to derive best practices over all steps involved in the 3D generation pipeline, including preprocessing, mesh reconstruction, compression with autoencoders, and generation. Our findings highlight that reconstruction errors significantly impact overall performance, underscoring the need to evaluate generation and reconstruction jointly. We provide insights that can inform the selection of suitable 3D models for various applications, facilitating the development of more robust and application-specific solutions in 3D generation. The code for our framework is available at this https URL.
zh

[CV-7] RA: Rethinking Text-driven Realistic 3D Avatar Generation ICCV2025

【速读】:该论文旨在解决现有文本到虚拟形象(text-to-avatar)生成模型效率低、依赖缓慢迭代优化以及难以实现文本驱动的局部定制等问题。其解决方案的关键在于提出TeRA框架,采用两阶段训练策略:首先从大型人体重建模型中蒸馏出一个解码器,以构建结构化的潜在空间;随后在该潜在空间中训练一个受文本控制的潜在扩散模型,从而实现高效且高质量的3D人像生成,并支持基于文本的局部定制能力。

链接: https://arxiv.org/abs/2509.02466
作者: Yanwen Wang,Yiyu Zhuang,Jiawei Zhang,Li Wang,Yifei Zeng,Xun Cao,Xinxin Zuo,Hao Zhu
机构: Nanjing University (南京大学); Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:In this paper, we rethink text-to-avatar generative models by proposing TeRA, a more efficient and effective framework than the previous SDS-based models and general large 3D generative this http URL approach employs a two-stage training strategy for learning a native 3D avatar generative model. Initially, we distill a decoder to derive a structured latent space from a large human reconstruction model. Subsequently, a text-controlled latent diffusion model is trained to generate photorealistic 3D human avatars within this latent space. TeRA enhances the model performance by eliminating slow iterative optimization and enables text-based partial customization through a structured 3D human this http URL have proven our approach’s superiority over previous text-to-avatar generative models in subjective and objective evaluation.
zh

[CV-8] GenCompositor: Generative Video Compositing with Diffusion Transformer

【速读】:该论文旨在解决传统视频合成(video compositing)流程中依赖大量人工操作和专家协作、导致制作周期长且成本高的问题。其核心解决方案是提出一种基于生成式模型的自动化视频合成方法——生成式视频合成(generative video compositing),关键在于设计了一种基于扩散Transformer(Diffusion Transformer, DiT)的新架构:通过轻量级背景保留分支实现编辑前后目标视频的一致性,引入DiT融合模块利用全自注意力机制继承动态元素,并结合扩展旋转位置编码(Extended Rotary Position Embedding, ERoPE)实现用户可控的前景与背景布局融合,从而支持对动态元素尺寸、运动轨迹等属性的交互式定制。

链接: https://arxiv.org/abs/2509.02460
作者: Shuzhou Yang,Xiaoyu Li,Xiaodong Cun,Guangzhi Wang,Lingen Li,Ying Shan,Jian Zhang
机构: Peking University (北京大学); Tencent (腾讯); Great Bay University (大湾大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video compositing combines live-action footage to create video production, serving as a crucial technique in video creation and film production. Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs. To address this issue, we automate this process with generative models, called generative video compositing. This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner, allowing users to customize the size, motion trajectory, and other attributes of the dynamic elements added in final video. Specifically, we designed a novel Diffusion Transformer (DiT) pipeline based on its intrinsic properties. To maintain consistency of the target video before and after editing, we revised a light-weight DiT-based background preservation branch with masked token injection. As to inherit dynamic elements from other sources, a DiT fusion block is proposed using full self-attention, along with a simple yet effective foreground augmentation for training. Besides, for fusing background and foreground videos with different layouts based on user control, we developed a novel position embedding, named Extended Rotary Position Embedding (ERoPE). Finally, we curated a dataset comprising 61K sets of videos for our new task, called VideoComp. This data includes complete dynamic elements and high-quality target videos. Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.
zh

[CV-9] RiverScope: High-Resolution River Masking Dataset

【速读】:该论文旨在解决窄河段或高泥沙含量河流在细空间和时间尺度上难以通过低分辨率卫星数据准确监测的问题(即地表水动态监测的挑战)。其解决方案的关键在于构建了RiverScope这一高分辨率数据集,包含1,145幅专家标注的河流与地表水掩膜图像(覆盖2,577 km²),并将其与Sentinel-2、SWOT及SWORD数据库进行配准,从而支持多传感器成本-精度权衡分析。此外,研究提出基于深度学习模型(如CNN和Transformer)的优化策略,结合迁移学习与多光谱PlanetScope通道的自适应特征融合方法,在全球首个高分辨率河宽估计基准上实现了7.2米的中位误差,显著优于现有卫星反演方法,为精细尺度水文建模和可持续水资源管理提供了关键资源。

链接: https://arxiv.org/abs/2509.02451
作者: Rangel Daroya,Taylor Rowley,Jonathan Flores,Elisa Friedmann,Fiona Bennitt,Heejin An,Travis Simmons,Marissa Jean Hughes,Camryn L Kluetmeier,Solomon Kica,J. Daniel Vélez,Sarah E. Esenther,Thomas E. Howard,Yanqi Ye,Audrey Turcotte,Colin Gleason,Subhransu Maji
机构: University of California, Berkeley (加州大学伯克利分校); University of Washington (华盛顿大学); Stanford University (斯坦福大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surface water dynamics play a critical role in Earth’s climate system, influencing ecosystems, agriculture, disaster resilience, and sustainable development. Yet monitoring rivers and surface water at fine spatial and temporal scales remains challenging – especially for narrow or sediment-rich rivers that are poorly captured by low-resolution satellite data. To address this, we introduce RiverScope, a high-resolution dataset developed through collaboration between computer science and hydrology experts. RiverScope comprises 1,145 high-resolution images (covering 2,577 square kilometers) with expert-labeled river and surface water masks, requiring over 100 hours of manual annotation. Each image is co-registered with Sentinel-2, SWOT, and the SWOT River Database (SWORD), enabling the evaluation of cost-accuracy trade-offs across sensors – a key consideration for operational water monitoring. We also establish the first global, high-resolution benchmark for river width estimation, achieving a median error of 7.2 meters – significantly outperforming existing satellite-derived methods. We extensively evaluate deep networks across multiple architectures (e.g., CNNs and transformers), pretraining strategies (e.g., supervised and self-supervised), and training datasets (e.g., ImageNet and satellite imagery). Our best-performing models combine the benefits of transfer learning with the use of all the multispectral PlanetScope channels via learned adaptors. RiverScope provides a valuable resource for fine-scale and multi-sensor hydrological modeling, supporting climate adaptation and sustainable water management.
zh

[CV-10] owards High-Fidelity Identity-Preserving Real-Time Makeup Transfer: Decoupling Style Generation

【速读】:该论文旨在解决实时虚拟试妆(real-time virtual makeup try-on)中面临的两大核心挑战:一是现有方法难以将半透明化妆品与皮肤色调及其他身份特征有效解耦,导致妆容合成不准确、身份信息失真,引发公平性问题;二是缺乏实时处理能力和时序一致性保障,限制了实际应用。解决方案的关键在于提出一种两阶段框架:首先通过图形渲染管道和无监督k-means聚类生成伪标签数据,训练一个透明妆容掩码提取模型;随后利用图形学驱动的掩码渲染实现高效实时合成。此外,为提升透明度估计精度和色彩保真度,设计了alpha加权重建损失和唇部颜色损失等专用训练目标,从而在多样姿态、表情和肤色条件下实现高保真、身份保留且时序稳定的妆容迁移。

链接: https://arxiv.org/abs/2509.02445
作者: Lydia Kin Ching Chau,Zhi Yu,Ruo Wei Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel framework for real-time virtual makeup try-on that achieves high-fidelity, identity-preserving cosmetic transfer with robust temporal consistency. In live makeup transfer applications, it is critical to synthesize temporally coherent results that accurately replicate fine-grained makeup and preserve user’s identity. However, existing methods often struggle to disentangle semitransparent cosmetics from skin tones and other identify features, causing identity shifts and raising fairness concerns. Furthermore, current methods lack real-time capabilities and fail to maintain temporal consistency, limiting practical adoption. To address these challenges, we decouple makeup transfer into two steps: transparent makeup mask extraction and graphics-based mask rendering. After the makeup extraction step, the makeup rendering can be performed in real time, enabling live makeup try-on. Our makeup extraction model trained on pseudo-ground-truth data generated via two complementary methods: a graphics-based rendering pipeline and an unsupervised k-means clustering approach. To further enhance transparency estimation and color fidelity, we propose specialized training objectives, including alpha-weighted reconstruction and lip color losses. Our method achieves robust makeup transfer across diverse poses, expressions, and skin tones while preserving temporal smoothness. Extensive experiments demonstrate that our approach outperforms existing baselines in capturing fine details, maintaining temporal stability, and preserving identity integrity.
zh

[CV-11] Efficient Pyramidal Analysis of Gigapixel Images on a Decentralized Modest Computer Cluster

【速读】:该论文旨在解决高分辨率图像(如吉像素级图像)分析中计算成本过高的问题。其核心解决方案是提出PyramidAI方法,该方法通过分层渐进式分析策略实现高效处理:首先在低分辨率下对全图进行初步筛查,识别出感兴趣区域(Region of Interest, ROI),随后仅对这些区域在高分辨率下进行精细化分析。该方案的关键在于自适应分辨率选择机制,能够在保证检测准确率的前提下显著减少需处理的数据量(最多降低2.65倍),并利用并行计算潜力实现负载均衡,从而在普通计算机集群上将分析时间从一小时以上缩短至数分钟,为大规模图像分析提供了可扩展、低成本的实用路径。

链接: https://arxiv.org/abs/2509.02440
作者: Marie Reinbigler,Rishi Sharma,Rafael Pires,Elisabeth Brunet,Anne-Marie Kermarrec,Catalin Fetita
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 31st International European Conference on Parallel and Distributed Computing (Euro-Par’25)

点击查看摘要

Abstract:Analyzing gigapixel images is recognized as computationally demanding. In this paper, we introduce PyramidAI, a technique for analyzing gigapixel images with reduced computational cost. The proposed approach adopts a gradual analysis of the image, beginning with lower resolutions and progressively concentrating on regions of interest for detailed examination at higher resolutions. We investigated two strategies for tuning the accuracy-computation performance trade-off when implementing the adaptive resolution selection, validated against the Camelyon16 dataset of biomedical images. Our results demonstrate that PyramidAI substantially decreases the amount of processed data required for analysis by up to 2.65x, while preserving the accuracy in identifying relevant sections on a single computer. To ensure democratization of gigapixel image analysis, we evaluated the potential to use mainstream computers to perform the computation by exploiting the parallelism potential of the approach. Using a simulator, we estimated the best data distribution and load balancing algorithm according to the number of workers. The selected algorithms were implemented and highlighted the same conclusions in a real-world setting. Analysis time is reduced from more than an hour to a few minutes using 12 modest workers, offering a practical solution for efficient large-scale image analysis.
zh

[CV-12] Faster and Better: Reinforced Collaborative Distillation and Self-Learning for Infrared-Visible Image Fusion

【速读】:该论文旨在解决红外与可见光图像融合中如何在轻量化模型下实现高质量融合的问题。现有方法难以在保持模型效率的同时提升融合效果,尤其是在复杂场景下的细节保留与信息互补方面存在不足。其解决方案的关键在于提出一种基于强化学习的协同蒸馏与自学习框架:通过引入强化学习代理(reinforcement learning agent),动态识别更具挑战性的训练样本以促进学生模型的自学习能力,同时根据学生模型的状态实时调整教师模型的指导强度,从而优化知识迁移过程,显著提升融合质量与模型性能。

链接: https://arxiv.org/abs/2509.02424
作者: Yuhao Wang,Lingjuan Miao,Zhiqiang Zhou,Yajun Qiao,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared and visible image fusion plays a critical role in enhancing scene perception by combining complementary information from different modalities. Despite recent advances, achieving high-quality image fusion with lightweight models remains a significant challenge. To bridge this gap, we propose a novel collaborative distillation and self-learning framework for image fusion driven by reinforcement learning. Unlike conventional distillation, this approach not only enables the student model to absorb image fusion knowledge from the teacher model, but more importantly, allows the student to perform self-learning on more challenging samples to enhance its capabilities. Particularly, in our framework, a reinforcement learning agent explores and identifies a more suitable training strategy for the this http URL agent takes both the student’s performance and the teacher-student gap as inputs, which leads to the generation of challenging samples to facilitate the student’s self-learning. Simultaneously, it dynamically adjusts the teacher’s guidance strength based on the student’s state to optimize the knowledge transfer. Experimental results demonstrate that our method can significantly improve student performance and achieve better fusion results compared to existing techniques.
zh

[CV-13] From Noisy Labels to Intrinsic Structure: A Geometric-Structural Dual-Guided Framework for Noise-Robust Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因标注数据量大且质量参差不齐(如存在主观性偏差和粗略边界)而导致卷积神经网络性能下降的问题。其核心解决方案是提出一种几何-结构双引导网络(Geometric-Structural Dual-Guided Network, GSD-Net),关键在于通过两个创新模块增强模型对噪声标注的鲁棒性:一是几何距离感知模块(Geometric Distance-Aware module),利用几何特征动态调整像素级权重,强化可靠区域的监督信号并抑制噪声;二是结构引导标签精炼模块(Structure-Guided Label Refinement module),引入结构先验信息优化标签,提升局部细节敏感度。

链接: https://arxiv.org/abs/2509.02419
作者: Tao Wang,Zhenxuan Zhang,Yuanbo Zhou,Xinlin Zhang,Yuanbin Chen,Tao Tan,Guang Yang,Tong Tong
机构: Fuzhou University (福州大学); Imperial College London (帝国理工学院); King’s College London (国王学院); Macao Polytechnic University (澳门理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The effectiveness of convolutional neural networks in medical image segmentation relies on large-scale, high-quality annotations, which are costly and time-consuming to obtain. Even expert-labeled datasets inevitably contain noise arising from subjectivity and coarse delineations, which disrupt feature learning and adversely impact model performance. To address these challenges, this study propose a Geometric-Structural Dual-Guided Network (GSD-Net), which integrates geometric and structural cues to improve robustness against noisy annotations. It incorporates a Geometric Distance-Aware module that dynamically adjusts pixel-level weights using geometric features, thereby strengthening supervision in reliable regions while suppressing noise. A Structure-Guided Label Refinement module further refines labels with structural priors, and a Knowledge Transfer module enriches supervision and improves sensitivity to local details. To comprehensively assess its effectiveness, we evaluated GSD-Net on six publicly available datasets: four containing three types of simulated label noise, and two with multi-expert annotations that reflect real-world subjectivity and labeling inconsistencies. Experimental results demonstrate that GSD-Net achieves state-of-the-art performance under noisy annotations, achieving improvements of 2.52% on Kvasir, 22.76% on Shenzhen, 8.87% on BU-SUC, and 4.59% on BraTS2020 under SR simulated noise. The codes of this study are available at this https URL.
zh

[CV-14] Decoupling Bidirectional Geometric Representations of 4D cost volume with 2D convolution

【速读】:该论文旨在解决高精度实时立体匹配方法中依赖3D正则化成本体积(cost volume)对移动设备不友好,以及基于2D正则化的方法在病态区域表现不佳的问题。其解决方案的关键在于提出一种部署友好的4D成本聚合网络DBStereo,该方法完全基于2D卷积实现,通过深入分析4D成本体积的解耦特性,设计了一个轻量级双向几何聚合模块,分别捕获空间和视差表示;借助解耦学习策略,在保持实时性能的同时显著提升精度,从而突破了传统使用3D卷积处理4D成本体积的经验式设计,为后续研究提供了简单而强大的解耦聚合范式基线。

链接: https://arxiv.org/abs/2509.02415
作者: Xiaobao Wei,Changyong Shu,Zhaokun Yue,Chang Huang,Weiwei Liu,Shuai Yang,Lirong Yang,Peng Gao,Wenbin Zhang,Gaochao Zhu,Chengxiang Wang
机构: Nanjing University of Science and Technology (南京理工大学); Carizon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-performance real-time stereo matching methods invariably rely on 3D regularization of the cost volume, which is unfriendly to mobile devices. And 2D regularization based methods struggle in ill-posed regions. In this paper, we present a deployment-friendly 4D cost aggregation network DBStereo, which is based on pure 2D convolutions. Specifically, we first provide a thorough analysis of the decoupling characteristics of 4D cost volume. And design a lightweight bidirectional geometry aggregation block to capture spatial and disparity representation respectively. Through decoupled learning, our approach achieves real-time performance and impressive accuracy simultaneously. Extensive experiments demonstrate that our proposed DBStereo outperforms all existing aggregation-based methods in both inference time and accuracy, even surpassing the iterative-based method IGEV-Stereo. Our study break the empirical design of using 3D convolutions for 4D cost volume and provides a simple yet strong baseline of the proposed decouple aggregation paradigm for further study. Code will be available at (\hrefthis https URLthis https URL) soon.
zh

[CV-15] MedDINOv3: How to adapt vision foundation models for medical image segmentation?

【速读】:该论文旨在解决当前医学图像分割中深度学习模型泛化能力不足的问题,即现有模型多为任务特定设计,难以在不同成像模态(如CT与MRI)和不同医疗机构间有效迁移。其解决方案的关键在于提出MedDINOv3框架,通过两个核心步骤实现:首先重构纯视觉Transformer(ViT)结构,引入多尺度token聚合机制以提升对医学图像的特征表达能力;其次,在自定义的387万张轴向CT切片数据集CT-3M上进行分阶段域适应预训练,显著缩小自然图像与医学图像之间的域差距,从而使得DINOv3能够作为统一骨干网络,在多个医学图像分割基准上达到或超越现有最先进性能。

链接: https://arxiv.org/abs/2509.02379
作者: Yuheng Li,Yizhou Wu,Yuxiang Lai,Mingzhe Hu,Xiaofeng Yang
机构: Georgia Institute of Technology (佐治亚理工学院); Emory University (埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce \textbfMedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on \textbfCT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at this https URL.
zh

[CV-16] Why Do MLLM s Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在具身环境中的空间理解能力不足的问题,尤其关注其在单视图、多视图和视频等不同场景下空间推理性能的系统性局限。解决方案的关键在于提出一个名为MulSeT(Multi-view Spatial Understanding Tasks)的基准测试框架,并从数据和架构两个维度进行深入分析:一方面发现单纯增加训练数据难以突破性能上限,尤其是在需要空间想象力的任务中;另一方面揭示空间理解更依赖于视觉编码器中的位置编码(positional encoding),而非语言模型本身,从而为通过架构优化(如引入推理注入机制)提升空间推理能力提供了新方向。

链接: https://arxiv.org/abs/2509.02359
作者: Wanyue Zhang,Yibin Huang,Yangbin Xu,JingJing Huang,Helu Zhi,Shuo Ren,Wang Xu,Jiajun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The benchmark MulSeT is available at this https URL

点击查看摘要

Abstract:Spatial understanding is essential for Multimodal Large Language Models (MLLMs) to support perception, reasoning, and planning in embodied environments. Despite recent progress, existing studies reveal that MLLMs still struggle with spatial understanding. However, existing research lacks a comprehensive and systematic evaluation of these limitations, often restricted to isolated scenarios, such as single-view or video. In this work, we present a systematic analysis of spatial understanding from both data and architectural perspectives across three representative scenarios: single-view, multi-view, and video. We propose a benchmark named MulSeT (Multi-view Spatial Understanding Tasks), and design a series of experiments to analyze the spatial reasoning capabilities of MLLMs. From the data perspective, the performance of spatial understanding converges quickly as the training data increases, and the upper bound is relatively low, especially for tasks that require spatial imagination. This indicates that merely expanding training data is insufficient to achieve satisfactory performance. From the architectural perspective, we find that spatial understanding relies more heavily on the positional encoding within the visual encoder than within the language model, in both cascaded and native MLLMs. Moreover, we explore reasoning injection and envision future improvements through architectural design to optimize spatial understanding. These insights shed light on the limitations of current MLLMs and suggest new directions for improving spatial reasoning capabilities through data scaling and architectural tuning.
zh

[CV-17] Category-Aware 3D Object Composition with Disentangled Texture and Shape Multi-view Diffusion

【速读】:该论文致力于解决3D物体合成任务中多源内容融合困难的问题,即在生成包含不同类别对象的新3D模型时,现有文本/图像/3D到3D方法常出现纹理不一致和形状不准确的问题。其解决方案的关键在于提出了一种名为category+3D-to-3D (C33D) 的新方法:首先从输入3D模型渲染多视角图像与法线图,随后利用自适应文本-图像和谐(ATIH)模块以前景图像和另一类别的文本描述为条件生成新颖的2D对象;接着通过纹理多视角扩散(texture multi-view diffusion)和形状多视角扩散(shape multi-view diffusion)分别优化其余多视角RGB图像和法线图的纹理与形状一致性,最终基于这些改进后的多视角信息重建结构合理且视觉一致的全新3D模型。

链接: https://arxiv.org/abs/2509.02357
作者: Zeren Xiong,Zikun Chen,Zedong Zhang,Xiang Li,Ying Tai,Jian Yang,Jun Li
机构: Nanjing University of Science and Technology (南京理工大学); Nankai University (南开大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM Multimedia 2025

点击查看摘要

Abstract:In this paper, we tackle a new task of 3D object synthesis, where a 3D model is composited with another object category to create a novel 3D model. However, most existing text/image/3D-to-3D methods struggle to effectively integrate multiple content sources, often resulting in inconsistent textures and inaccurate shapes. To overcome these challenges, we propose a straightforward yet powerful approach, category+3D-to-3D (C33D), for generating novel and structurally coherent 3D models. Our method begins by rendering multi-view images and normal maps from the input 3D model, then generating a novel 2D object using adaptive text-image harmony (ATIH) with the front-view image and a text description from another object category as inputs. To ensure texture consistency, we introduce texture multi-view diffusion, which refines the textures of the remaining multi-view RGB images based on the novel 2D object. For enhanced shape accuracy, we propose shape multi-view diffusion to improve the 2D shapes of both the multi-view RGB images and the normal maps, also conditioned on the novel 2D object. Finally, these outputs are used to reconstruct a complete and novel 3D model. Extensive experiments demonstrate the effectiveness of our method, yielding impressive 3D creations, such as shark(3D)-crocodile(text) in the first row of Fig. 1. A project page is available at: this https URL
zh

[CV-18] Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

【速读】:该论文旨在解决有序图像分类任务中标签噪声(label noise)对模型性能和可靠性造成的显著负面影响问题。其解决方案的关键在于提出一种数据驱动的自适应标签修正方法——ORDAC(ORDinal Adaptive Correction),该方法利用标签分布学习(Label Distribution Learning, LDL)建模有序标签中的固有模糊性和不确定性,在训练过程中动态调整每个样本的标签分布均值与标准差,从而实现对噪声标签的自适应修正而非直接剔除,最大化利用完整训练数据集。实验表明,该方法在年龄估计(Adience)和疾病严重程度检测(糖尿病视网膜病变)等任务中均能显著提升模型精度与鲁棒性。

链接: https://arxiv.org/abs/2509.02351
作者: Alireza Sedighi Moghaddam,Mohammad Reza Mohammadi
机构: Iran University of Science and Technology (伊朗科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Labeled data is a fundamental component in training supervised deep learning models for computer vision tasks. However, the labeling process, especially for ordinal image classification where class boundaries are often ambiguous, is prone to error and noise. Such label noise can significantly degrade the performance and reliability of machine learning models. This paper addresses the problem of detecting and correcting label noise in ordinal image classification tasks. To this end, a novel data-centric method called ORDinal Adaptive Correction (ORDAC) is proposed for adaptive correction of noisy labels. The proposed approach leverages the capabilities of Label Distribution Learning (LDL) to model the inherent ambiguity and uncertainty present in ordinal labels. During training, ORDAC dynamically adjusts the mean and standard deviation of the label distribution for each sample. Rather than discarding potentially noisy samples, this approach aims to correct them and make optimal use of the entire training dataset. The effectiveness of the proposed method is evaluated on benchmark datasets for age estimation (Adience) and disease severity detection (Diabetic Retinopathy) under various asymmetric Gaussian noise scenarios. Results show that ORDAC and its extended versions (ORDAC_C and ORDAC_R) lead to significant improvements in model performance. For instance, on the Adience dataset with 40% noise, ORDAC_R reduced the mean absolute error from 0.86 to 0.62 and increased the recall metric from 0.37 to 0.49. The method also demonstrated its effectiveness in correcting intrinsic noise present in the original datasets. This research indicates that adaptive label correction using label distributions is an effective strategy to enhance the robustness and accuracy of ordinal classification models in the presence of noisy data.
zh

[CV-19] OmniActor: A Generalist GUI and Embodied Agent for 2D3D Worlds

【速读】:该论文旨在解决多模态大语言模型在同时处理GUI(图形用户界面)和具身(embodied)任务时因数据冲突导致性能下降的问题。现有研究通常局限于单一环境类型,而复杂任务往往需要在2D虚拟世界(GUI)与3D真实世界(具身)之间交替交互。为应对这一挑战,作者提出了一种名为OmniActor的通用代理模型,其核心创新在于结构设计上的层异质性MoE(Layer-heterogeneity MoE)机制:通过共享浅层参数以利用两类数据的协同效应,同时分离深层参数以消除其冲突,从而实现跨模态任务的高效泛化。此外,论文还统一了GUI与具身任务的动作空间,并整合多源大规模数据进行训练,显著提升了模型在不同场景下的表现,尤其是在GUI任务中效果突出。

链接: https://arxiv.org/abs/2509.02322
作者: Longrong Yang,Zhixiong Zeng,Yufeng Zhong,Jing Huang,Liming Zheng,Lei Chen,Haibo Qiu,Zequn Qin,Lin Ma,Xi Li
机构: Meituan(美团); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models are evolving toward multimodal agents capable of proactively executing tasks. Most agent research focuses on GUI or embodied scenarios, which correspond to agents interacting with 2D virtual worlds or 3D real worlds, respectively. However, many complex tasks typically require agents to interleavely interact with these two types of environment. We initially mix GUI and embodied data to train, but find the performance degeneration brought by the data conflict. Further analysis reveals that GUI and embodied data exhibit synergy and conflict at the shallow and deep layers, respectively, which resembles the cerebrum-cerebellum mechanism in the human brain. To this end, we propose a high-performance generalist agent OmniActor, designed from both structural and data perspectives. First, we propose Layer-heterogeneity MoE to eliminate the conflict between GUI and embodied data by separating deep-layer parameters, while leverage their synergy by sharing shallow-layer parameters. By successfully leveraging the synergy and eliminating the conflict, OmniActor outperforms agents only trained by GUI or embodied data in GUI or embodied tasks. Furthermore, we unify the action spaces of GUI and embodied tasks, and collect large-scale GUI and embodied data from various sources for training. This significantly improves OmniActor under different scenarios, especially in GUI tasks. The code will be publicly available.
zh

[CV-20] Hues and Cues: Human vs. CLIP

【速读】:该论文试图解决的问题是:当前评估人工智能模型是否具备人类相似性时,往往忽略了通过游戏等复杂任务来测试其对人类认知特征(如颜色感知与命名)的模拟能力,导致某些潜在缺陷难以被发现。解决方案的关键在于提出一种基于棋盘游戏(board games)的新评估方法,具体以“Hues Cues”游戏为实验平台,测试CLIP模型在颜色感知和命名方面的表现,并与人类观察者进行对比,从而揭示模型在不同抽象层级下存在的文化偏见和不一致性,这些缺陷在传统基准测试中不易察觉。

链接: https://arxiv.org/abs/2509.02305
作者: Nuria Alabau-Bosque,Jorge Vila-Tomás,Paula Daudén-Oliver,Pablo Hernández-Cámara,Jose Manuel Jaén-Lorites,Valero Laparra,Jesús Malo
机构: Image Processing Lab, Universidad de Valencia, Paterna, Spain(瓦伦西亚大学图像处理实验室); Center for Biomaterials and Tissue Engineering Universitat Politecnica de Valencia, Valencia, Spain(瓦伦西亚理工大学生物材料与组织工程中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, 2 figures. 8th annual conference on Cognitive Computational Neuroscience

点击查看摘要

Abstract:Playing games is inherently human, and a lot of games are created to challenge different human characteristics. However, these tasks are often left out when evaluating the human-like nature of artificial models. The objective of this work is proposing a new approach to evaluate artificial models via board games. To this effect, we test the color perception and color naming capabilities of CLIP by playing the board game Hues Cues and assess its alignment with humans. Our experiments show that CLIP is generally well aligned with human observers, but our approach brings to light certain cultural biases and inconsistencies when dealing with different abstraction levels that are hard to identify with other testing strategies. Our findings indicate that assessing models with different tasks like board games can make certain deficiencies in the models stand out in ways that are difficult to test with the commonly used benchmarks.
zh

[CV-21] Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image Generation

【速读】:该论文旨在解决文本到图像扩散模型在空间推理能力上的不足,例如无法准确理解并实现“狗在泰迪熊右侧”或“长颈鹿在飞机上方”等简单但关键的空间关系。现有方法依赖人工设计的损失函数进行微调或测试时优化,效果有限。其解决方案的关键在于提出一种名为 Learn-to-Steer 的新框架,通过从扩散模型的交叉注意力(cross-attention)图中学习数据驱动的目标函数,而非手工构造损失。该方法的核心创新是训练一个轻量级分类器来解码空间关系,并利用双反演(dual-inversion)策略防止模型通过语言线索作弊,从而强制其学习真实的几何理解能力,显著提升了多个基准测试中的空间准确性。

链接: https://arxiv.org/abs/2509.02295
作者: Sapir Esther Yiflach,Yuval Atzmon,Gal Chechik
机构: Bar-Ilan University (巴伊兰大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page is at this https URL

点击查看摘要

Abstract:Text-to-image diffusion models can generate stunning visuals, yet they often fail at tasks children find trivial–like placing a dog to the right of a teddy bear rather than to the left. When combinations get more unusual–a giraffe above an airplane–these failures become even more pronounced. Existing methods attempt to fix these spatial reasoning failures through model fine-tuning or test-time optimization with handcrafted losses that are suboptimal. Rather than imposing our assumptions about spatial encoding, we propose learning these objectives directly from the model’s internal representations. We introduce Learn-to-Steer, a novel framework that learns data-driven objectives for test-time optimization rather than handcrafting them. Our key insight is to train a lightweight classifier that decodes spatial relationships from the diffusion model’s cross-attention maps, then deploy this classifier as a learned loss function during inference. Training such classifiers poses a surprising challenge: they can take shortcuts by detecting linguistic traces rather than learning true spatial patterns. We solve this with a dual-inversion strategy that enforces geometric understanding. Our method dramatically improves spatial accuracy: from 0.20 to 0.61 on FLUX.1-dev and from 0.07 to 0.54 on SD2.1 across standard benchmarks. Moreover, our approach generalizes to multiple relations and significantly improves accuracy.
zh

[CV-22] SynthGenNet: a self-supervised approach for test-time generalization using synthetic multi-source domain mixing of street view images

【速读】:该论文旨在解决复杂城市环境中场景理解与域泛化(domain generalization)的挑战,尤其是在缺乏目标域标注数据的情况下,如何提升模型在真实世界场景中的适应能力。其核心解决方案是提出一种自监督的学生-教师架构 SynthGenNet,关键创新包括:1)ClassMix++ 算法,通过融合多源合成数据并保持语义一致性来增强模型适应性;2)基于源域真值的 Grounded Mask Consistency Loss (GMC),提升跨域预测一致性和特征对齐;3)伪标签引导的对比学习机制(Pseudo-Label Guided Contrastive Learning, PLGCL),通过迭代知识蒸馏促进域不变特征学习。该方法显著缩小了仿真到现实(sim-to-real)的域差距,在印度驾驶数据集(IDD)上实现了 50% 的 mIoU,优于仅依赖单源数据的最先进方法。

链接: https://arxiv.org/abs/2509.02287
作者: Pushpendra Dhakara,Prachi Chachodhia,Vaibhav Kumar
机构: IISER Bhopal (印度科学教育研究所布博内尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unstructured urban environments present unique challenges for scene understanding and generalization due to their complex and diverse layouts. We introduce SynthGenNet, a self-supervised student-teacher architecture designed to enable robust test-time domain generalization using synthetic multi-source imagery. Our contributions include the novel ClassMix++ algorithm, which blends labeled data from various synthetic sources while maintaining semantic integrity, enhancing model adaptability. We further employ Grounded Mask Consistency Loss (GMC), which leverages source ground truth to improve cross-domain prediction consistency and feature alignment. The Pseudo-Label Guided Contrastive Learning (PLGCL) mechanism is integrated into the student network to facilitate domain-invariant feature learning through iterative knowledge distillation from the teacher network. This self-supervised strategy improves prediction accuracy, addresses real-world variability, bridges the sim-to-real domain gap, and reliance on labeled target data, even in complex urban areas. Outcomes show our model outperforms the state-of-the-art (relying on single source) by achieving 50% Mean Intersection-Over-Union (mIoU) value on real-world datasets like Indian Driving Dataset (IDD).
zh

[CV-23] RS-OOD: A Vision-Language Augmented Framework for Out-of-Distribution Detection in Remote Sensing

【速读】:该论文旨在解决遥感图像中分布外(Out-of-distribution, OOD)检测的挑战,特别是在数据稀缺、多尺度场景结构复杂以及显著分布偏移背景下,现有方法在遥感领域的适用性不足问题。解决方案的关键在于提出RS-OOD框架,其核心创新包括:空间特征增强以提升场景判别力、双提示对齐机制实现空间与语义一致性验证,以及置信度引导的自训练循环动态挖掘伪标签以扩展训练数据,从而实现少样本条件下的鲁棒OOD检测。

链接: https://arxiv.org/abs/2509.02273
作者: Yingrui Ji,Jiansheng Chen,Jingbo Chen,Anzhi Yue,Chenhao Wang,Kai Li,Yao Zhu
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子、电气与通信工程学院); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection represents a critical challenge in remote sensing applications, where reliable identification of novel or anomalous patterns is essential for autonomous monitoring, disaster response, and environmental assessment. Despite remarkable progress in OOD detection for natural images, existing methods and benchmarks remain poorly suited to remote sensing imagery due to data scarcity, complex multi-scale scene structures, and pronounced distribution shifts. To this end, we propose RS-OOD, a novel framework that leverages remote sensing-specific vision-language modeling to enable robust few-shot OOD detection. Our approach introduces three key innovations: spatial feature enhancement that improved scene discrimination, a dual-prompt alignment mechanism that cross-verifies scene context against fine-grained semantics for spatial-semantic consistency, and a confidence-guided self-training loop that dynamically mines pseudo-labels to expand training data without manual annotation. RS-OOD consistently outperforms existing methods across multiple remote sensing benchmarks and enables efficient adaptation with minimal labeled data, demonstrating the critical value of spatial-semantic integration.
zh

[CV-24] DSGC-Net: A Dual-Stream Graph Convolutional Network for Crowd Counting via Feature Correlation Mining

【速读】:该论文旨在解决复杂人群场景下现有基于深度学习的计数模型在密度分布差异显著区域适应性差,以及因视角变化和身体姿态差异导致个体表征不一致从而限制计数精度的问题。解决方案的关键在于提出一种双流图卷积网络(DSGC-Net),通过引入密度近似(Density Approximation, DA)分支和表征近似(Representation Approximation, RA)分支,分别构建基于密度相似性的密度驱动语义图与基于全局表征相似性的表征驱动语义图,并利用图卷积网络建模两类语义图中的潜在语义关系,从而增强模型对密度变化的适应能力并提升多视角、多姿态场景下的计数准确性。

链接: https://arxiv.org/abs/2509.02261
作者: Yihong Wu,Jinqiao Wei,Xionghui Zhao,Yidi Li,Shaoyi Du,Bin Ren,Nicu Sebe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by PRCV 2025

点击查看摘要

Abstract:Deep learning-based crowd counting methods have achieved remarkable progress in recent years. However, in complex crowd scenarios, existing models still face challenges when adapting to significant density distribution differences between regions. Additionally, the inconsistency of individual representations caused by viewpoint changes and body posture differences further limits the counting accuracy of the models. To address these challenges, we propose DSGC-Net, a Dual-Stream Graph Convolutional Network based on feature correlation mining. DSGC-Net introduces a Density Approximation (DA) branch and a Representation Approximation (RA) branch. By modeling two semantic graphs, it captures the potential feature correlations in density variations and representation distributions. The DA branch incorporates a density prediction module that generates the density distribution map, and constructs a density-driven semantic graph based on density similarity. The RA branch establishes a representation-driven semantic graph by computing global representation similarity. Then, graph convolutional networks are applied to the two semantic graphs separately to model the latent semantic relationships, which enhance the model’s ability to adapt to density variations and improve counting accuracy in multi-view and multi-pose scenarios. Extensive experiments on three widely used datasets demonstrate that DSGC-Net outperforms current state-of-the-art methods. In particular, we achieve MAE of 48.9 and 5.9 in ShanghaiTech Part A and Part B datasets, respectively. The released code is available at: this https URL.
zh

[CV-25] A Multimodal Cross-View Model for Predicting Postoperative Neck Pain in Cervical Spondylosis Patients

【速读】:该论文旨在解决颈椎病(cervical spondylosis)患者术后颈痛恢复预测的准确性问题,其核心挑战在于多模态医学影像数据因成像差异和空间配准不一致导致的特征融合困难。解决方案的关键在于提出一种自适应双向金字塔差分卷积模块(Adaptive Bidirectional Pyramid Difference Convolution, ABPDC),该模块利用差分卷积在纹理提取与灰度不变性方面的优势,实现跨模态特征的有效融合;同时引入特征金字塔配准辅助网络(Feature Pyramid Registration Auxiliary Network, FPRAN)以缓解结构层面的空间错位问题,从而显著提升预测精度。

链接: https://arxiv.org/abs/2509.02256
作者: Jingyang Shan,Qishuai Yu,Jiacen Liu,Shaolin Zhang,Wen Shen,Yanxiao Zhao,Tianyi Wang,Xiaolin Qin,Yiheng Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neck pain is the primary symptom of cervical spondylosis, yet its underlying mechanisms remain unclear, leading to uncertain treatment outcomes. To address the challenges of multimodal feature fusion caused by imaging differences and spatial mismatches, this paper proposes an Adaptive Bidirectional Pyramid Difference Convolution (ABPDC) module that facilitates multimodal integration by exploiting the advantages of difference convolution in texture extraction and grayscale invariance, and a Feature Pyramid Registration Auxiliary Network (FPRAN) to mitigate structural misalignment. Experiments on the MMCSD dataset demonstrate that the proposed model achieves superior prediction accuracy of postoperative neck pain recovery compared with existing methods, and ablation studies further confirm its effectiveness.
zh

[CV-26] Palmistry-Informed Feature Extraction and Analysis using Machine Learning

【速读】:该论文旨在解决传统手掌特征分析中依赖主观判断、缺乏量化依据的问题,试图通过机器学习实现对掌纹形态的自动化、客观化分析。其解决方案的关键在于构建一个计算机视觉流水线,从手掌图像中提取主纹结构、纹理和形状等关键特征,并基于标注的手掌图像数据集训练预测模型,从而建立掌纹形态与外部验证性状或健康指标之间的数据驱动关联,为数字人类测量学和个性化用户分析提供可部署于移动平台的技术路径。

链接: https://arxiv.org/abs/2509.02248
作者: Shweta Patil
机构: D.Y. Patil University Navi Mumbai, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:This paper explores the automated analysis of palmar features using machine learning techniques. We present a computer vision pipeline that extracts key characteristics from palm images, such as principal line structures, texture, and shape metrics. These features are used to train predictive models on a novel dataset curated from annotated palm images. Our approach moves beyond traditional subjective interpretation by providing a data-driven, quantitative framework for studying the correlations between palmar morphology and externally validated traits or conditions. The methodology demonstrates feasibility for applications in digital anthropometry and personalized user analytics, with potential for deployment on mobile platforms. Results indicate that machine learning models can identify complex patterns in palm data, opening avenues for research that intersects cultural practices with computational analysis.
zh

[CV-27] ADVMEM: Adversarial Memory Initialization for Realistic Test-Time Adaptation via Tracklet-Based Benchmarking

【速读】:该论文旨在解决当前测试时适应(Test-Time Adaptation, TTA)基准数据集未能充分模拟现实场景中时间依赖性的问题。现有基准主要关注分布偏移对模型性能的影响,但忽略了视频流等真实环境中连续帧间存在的自然时序关联,例如同一目标在多帧中的持续出现。为应对这一局限,作者提出了一个基于轨迹片段(tracklet)的新颖TTA基准——“固有时间依赖性”(Inherent Temporal Dependencies, ITD)数据集,其通过从目标跟踪数据集中提取对象中心图像序列来确保实例具备内在的时间依赖结构。解决方案的关键在于利用tracklet构建具有真实时序关系的数据集,并在此基础上揭示现有TTA方法在处理时间依赖性时的不足;进一步提出一种对抗性记忆初始化策略,显著提升了基于记忆的TTA方法在该挑战性基准上的性能表现。

链接: https://arxiv.org/abs/2509.02182
作者: Shyma Alhuwaider,Motasem Alfarra,Juan C. Perez,Merey Ramazanova,Bernard Ghanem
机构: Center of Excellence in Generative AI, KAUST, Saudi Arabia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel tracklet-based dataset for benchmarking test-time adaptation (TTA) methods. The aim of this dataset is to mimic the intricate challenges encountered in real-world environments such as images captured by hand-held cameras, self-driving cars, etc. The current benchmarks for TTA focus on how models face distribution shifts, when deployed, and on violations to the customary independent-and-identically-distributed (i.i.d.) assumption in machine learning. Yet, these benchmarks fail to faithfully represent realistic scenarios that naturally display temporal dependencies, such as how consecutive frames from a video stream likely show the same object across time. We address this shortcoming of current datasets by proposing a novel TTA benchmark we call the “Inherent Temporal Dependencies” (ITD) dataset. We ensure the instances in ITD naturally embody temporal dependencies by collecting them from tracklets-sequences of object-centric images we compile from the bounding boxes of an object-tracking dataset. We use ITD to conduct a thorough experimental analysis of current TTA methods, and shed light on the limitations of these methods when faced with the challenges of temporal dependencies. Moreover, we build upon these insights and propose a novel adversarial memory initialization strategy to improve memory-based TTA methods. We find this strategy substantially boosts the performance of various methods on our challenging benchmark.
zh

[CV-28] Omnidirectional Spatial Modeling from Correlated Panoramas

【速读】:该论文旨在解决全景场景理解中因几何失真和复杂空间关系导致的挑战,尤其是现有方法仅关注单帧内的场景理解而忽略了跨帧相关全景图之间的关联信息。解决方案的关键在于提出首个专注于跨帧相关全景图视觉问答(VQA)的基准数据集CFpano,包含超过2700张图像和8000个问答对,并在此基础上设计了一种多模态大语言模型(MLLM)——\methodname,该模型通过Group Relative Policy Optimization(GRPO)进行微调,并结合定制化的奖励函数,以实现对跨帧相关全景图的鲁棒且一致的推理能力。实验表明,该方法在多项选择和开放式VQA任务上均达到当前最优性能,整体提升达+5.37%。

链接: https://arxiv.org/abs/2509.02164
作者: Xinshen Zhang,Tongxi Fu,Xu Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omnidirectional scene understanding is vital for various downstream applications, such as embodied AI, autonomous driving, and immersive environments, yet remains challenging due to geometric distortion and complex spatial relations in 360° imagery. Existing omnidirectional methods achieve scene understanding within a single frame while neglecting cross-frame correlated panoramas. To bridge this gap, we introduce \textbfCFpano, the \textbffirst benchmark dataset dedicated to cross-frame correlated panoramas visual question answering in the holistic 360° scenes. CFpano consists of over 2700 images together with over 8000 question-answer pairs, and the question types include both multiple choice and open-ended VQA. Building upon our CFpano, we further present \methodname, a multi-modal large language model (MLLM) fine-tuned with Group Relative Policy Optimization (GRPO) and a set of tailored reward functions for robust and consistent reasoning with cross-frame correlated panoramas. Benchmark experiments with existing MLLMs are conducted with our CFpano. The experimental results demonstrate that \methodname achieves state-of-the-art performance across both multiple-choice and open-ended VQA tasks, outperforming strong baselines on all major reasoning categories (\textbf+5.37% in overall performance). Our analyses validate the effectiveness of GRPO and establish a new benchmark for panoramic scene understanding.
zh

[CV-29] Enhancing Zero-Shot Pedestrian Attribute Recognition with Synthetic Data Generation: A Comparative Study with Image-To-Image Diffusion Models

【速读】:该论文旨在解决行人属性识别(Pedestrian Attribute Recognition, PAR)模型在复杂场景下泛化能力不足的问题,其根本原因在于高质量标注数据的稀缺性,尤其在存在遮挡、姿态变化和多样化环境等挑战时。解决方案的关键在于利用扩散模型(diffusion models)进行图像生成式数据扩展(img2img diffusion-based data expansion),通过优化文本提示(text prompts)、图像属性控制以及最新扩散增强技术,生成符合PAR任务需求的合成行人图像,从而提升模型在零样本场景下的训练效果与识别性能。实验表明,提示对齐和图像属性选择是影响生成质量的核心因素,最优配置可使PAR识别准确率提升4.5%。

链接: https://arxiv.org/abs/2509.02161
作者: Pablo Ayuso-Albizu,Juan C. SanMiguel,Pablo Carballeira
机构: Universidad Autónoma de Madrid (马德里自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted at AVSS 2025 conference

点击查看摘要

Abstract:Pedestrian Attribute Recognition (PAR) involves identifying various human attributes from images with applications in intelligent monitoring systems. The scarcity of large-scale annotated datasets hinders the generalization of PAR models, specially in complex scenarios involving occlusions, varying poses, and diverse environments. Recent advances in diffusion models have shown promise for generating diverse and realistic synthetic images, allowing to expand the size and variability of training data. However, the potential of diffusion-based data expansion for generating PAR-like images remains underexplored. Such expansion may enhance the robustness and adaptability of PAR models in real-world scenarios. This paper investigates the effectiveness of diffusion models in generating synthetic pedestrian images tailored to PAR tasks. We identify key parameters of img2img diffusion-based data expansion; including text prompts, image properties, and the latest enhancements in diffusion-based data augmentation, and examine their impact on the quality of generated images for PAR. Furthermore, we employ the best-performing expansion approach to generate synthetic images for training PAR models, by enriching the zero-shot datasets. Experimental results show that prompt alignment and image properties are critical factors in image generation, with optimal selection leading to a 4.5% improvement in PAR recognition performance.
zh

[CV-30] SegFormer Fine-Tuning with Dropout: Advancing Hair Artifact Removal in Skin Lesion Analysis

【速读】:该论文旨在解决皮肤病变图像中毛发伪影(hair artifacts)对准确诊断造成的干扰问题,这类伪影可能掩盖关键的皮肤病学特征,影响后续皮肤癌检测任务的性能。解决方案的关键在于提出一种改进的SegFormer模型——SegformerWithDropout,其核心创新是引入dropout正则化策略(dropout probability = 0.3)于分割头部分,以缓解过拟合问题;同时采用预训练于ImageNet的MiT-B2编码器,并在500张带有精细毛发掩膜标注的皮肤镜图像上进行10折交叉验证训练,最终实现了高精度的毛发区域分割,平均Dice系数达0.96、IoU为0.93,且在PSNR(~34 dB)、SSIM(0.97)和LPIPS(0.06)等指标上表现优异,证明该方法能有效提升皮肤病变图像的预处理质量,增强下游皮肤癌识别的可靠性。

链接: https://arxiv.org/abs/2509.02156
作者: Asif Mohammed Saad,Umme Niraj Mahi
机构: Khulna University of Engineering & Technology (库尔纳工程与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hair artifacts in dermoscopic images present significant challenges for accurate skin lesion analysis, potentially obscuring critical diagnostic features in dermatological assessments. This work introduces a fine-tuned SegFormer model augmented with dropout regularization to achieve precise hair mask segmentation. The proposed SegformerWithDropout architecture leverages the MiT-B2 encoder, pretrained on ImageNet, with an in-channel count of 3 and 2 output classes, incorporating a dropout probability of 0.3 in the segmentation head to prevent overfitting. Training is conducted on a specialized dataset of 500 dermoscopic skin lesion images with fine-grained hair mask annotations, employing 10-fold cross-validation, AdamW optimization with a learning rate of 0.001, and cross-entropy loss. Early stopping is applied based on validation loss, with a patience of 3 epochs and a maximum of 20 epochs per fold. Performance is evaluated using a comprehensive suite of metrics, including Intersection over Union (IoU), Dice coefficient, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). Experimental results from the cross-validation demonstrate robust performance, with average Dice coefficients reaching approximately 0.96 and IoU values of 0.93, alongside favorable PSNR (around 34 dB), SSIM (0.97), and low LPIPS (0.06), highlighting the model’s effectiveness in accurate hair artifact segmentation and its potential to enhance preprocessing for downstream skin cancer detection tasks.
zh

[CV-31] Conditional-t3VAE: Equitable Latent Space Allocation for Fair Generation

【速读】:该论文旨在解决变分自编码器(Variational Autoencoders, VAEs)在类别不平衡数据集上生成不公平的问题,即传统VAE使用全局先验时会将潜在空间资源按训练集类别频率分配,导致尾部类别的表征不足,从而降低生成公平性。解决方案的关键在于提出条件型 t³ VAE(Conditional- t^3 VAE),其通过为每个类别定义独立的Student’s t分布联合先验来强制实现类别间潜在空间的均衡分配,避免多数类主导潜在空间;同时采用基于 γ-幂散度的闭式优化目标,并推导出等权重的Student’s t混合分布用于类别平衡生成,显著提升了极端类别不平衡场景下的生成公平性和多样性。

链接: https://arxiv.org/abs/2509.02154
作者: Aymene Mohammed Bouayed,Samuel Deslauriers-Gauthier,Adrian Iaccovelli,David Naccache
机构: DIÉNS, ÉNS, CNRS, PSL University (巴黎文理研究大学), France; Centre Inria d’Université Côte d’Azur (蔚蓝海岸大学计算机研究院), France; Be-Ys Research (法国贝叶斯研究中心), France
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Variational Autoencoders (VAEs) with global priors mirror the training set’s class frequency in latent space, underrepresenting tail classes and reducing generative fairness on imbalanced datasets. While t^3 VAE improves robustness via heavy-tailed Student’s t-distribution priors, it still allocates latent volume proportionally to the class this http URL this work, we address this issue by explicitly enforcing equitable latent space allocation across classes. To this end, we propose Conditional- t^3 VAE, which defines a per-class \mboxStudent’s t joint prior over latent and output variables, preventing dominance by majority classes. Our model is optimized using a closed-form objective derived from the \gamma -power divergence. Moreover, for class-balanced generation, we derive an equal-weight latent mixture of Student’s t-distributions. On SVHN-LT, CIFAR100-LT, and CelebA, Conditional- t^3 VAE consistently achieves lower FID scores than both t^3 VAE and Gaussian-based VAE baselines, particularly under severe class imbalance. In per-class F1 evaluations, Conditional- t^3 VAE also outperforms the conditional Gaussian VAE across all highly imbalanced settings. While Gaussian-based models remain competitive under mild imbalance ratio ( \rho \lesssim 3 ), our approach substantially improves generative fairness and diversity in more extreme regimes.
zh

[CV-32] GRMM: Real-Time High-Fidelity Gaussian Morphable Head Model with Learned Residuals

【速读】:该论文旨在解决传统3D Morphable Models (3DMMs) 在分辨率、细节表现和真实感方面的局限性,以及现有基于高斯泼溅(Gaussian Splatting, 3DGS)的面部模型在表达控制能力上的不足。传统PCA驱动的3DMM难以捕捉细粒度几何特征(如皱纹、皮肤纹理)和完整头部覆盖,而现有3DGS方法虽实现快速高质量渲染,却仍依赖于网格型3DMM先验,无法灵活建模个体与表情特异性的高频细节。解决方案的关键在于提出GRMM——首个全头高斯3D可变形模型,通过在基础3DMM上叠加残差几何与外观分量,实现对高频率细节的显式建模;同时引入解耦控制机制,以低维可解释参数(身份形状、表情等)进行粗粒度调控,并由精细解码器分别处理每个高斯点的外观与图像级细化,从而在保持75 FPS实时渲染的同时显著提升重建精度与表达准确性。

链接: https://arxiv.org/abs/2509.02141
作者: Mohit Mendiratta,Mayur Deshmukh,Kartik Teotia,Vladislav Golyanik,Adam Kortylewski,Christian Theobalt
机构: Max Planck Institute for Informatics (马普信息研究所); Saarland University (萨尔大学); University of Freiburg (弗莱堡大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:3D Morphable Models (3DMMs) enable controllable facial geometry and expression editing for reconstruction, animation, and AR/VR, but traditional PCA-based mesh models are limited in resolution, detail, and photorealism. Neural volumetric methods improve realism but remain too slow for interactive use. Recent Gaussian Splatting (3DGS) based facial models achieve fast, high-quality rendering but still depend solely on a mesh-based 3DMM prior for expression control, limiting their ability to capture fine-grained geometry, expressions, and full-head coverage. We introduce GRMM, the first full-head Gaussian 3D morphable model that augments a base 3DMM with residual geometry and appearance components, additive refinements that recover high-frequency details such as wrinkles, fine skin texture, and hairline variations. GRMM provides disentangled control through low-dimensional, interpretable parameters (e.g., identity shape, facial expressions) while separately modelling residuals that capture subject- and expression-specific detail beyond the base model’s capacity. Coarse decoders produce vertex-level mesh deformations, fine decoders represent per-Gaussian appearance, and a lightweight CNN refines rasterised images for enhanced realism, all while maintaining 75 FPS real-time rendering. To learn consistent, high-fidelity residuals, we present EXPRESS-50, the first dataset with 60 aligned expressions across 50 identities, enabling robust disentanglement of identity and expression in Gaussian-based 3DMMs. Across monocular 3D face reconstruction, novel-view synthesis, and expression transfer, GRMM surpasses state-of-the-art methods in fidelity and expression accuracy while delivering interactive real-time performance.
zh

[CV-33] Scale Dont Fine-tune: Guiding Multimodal LLM s for Efficient Visual Place Recognition at Test-Time

【速读】:该论文旨在解决当前视觉位置识别(Visual Place Recognition, VPR)方法在跨域泛化能力不足以及计算效率低下的问题,尤其是基于视觉基础模型(Vision Foundation Models, VFMs)和多模态大语言模型(Multimodal Large Language Models, MLLs)的方法在微调时存在高计算开销且迁移性能受限的挑战。其解决方案的关键在于提出一种新颖的零样本框架——测试时缩放(Test-Time Scaling, TTS),通过引导式方法(Guidance-based methods)利用MLLMs的视觉-语言对齐能力进行直接相似性评分,并采用结构化提示(structured prompts)生成长度可控的JSON输出以消除两阶段处理流程;同时引入不确定性感知自一致性机制(Uncertainty-Aware Self-Consistency, UASC),实现无需额外训练成本的实时适应,从而在保持高精度的同时显著提升跨域泛化能力和计算效率(最高达210倍加速)。

链接: https://arxiv.org/abs/2509.02129
作者: Jintao Cheng,Weibin Li,Jiehao Luo,Xiaoyu Tang,Zhijian He,Jin Wu,Yao Zou,Wei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) has evolved from handcrafted descriptors to deep learning approaches, yet significant challenges remain. Current approaches, including Vision Foundation Models (VFMs) and Multimodal Large Language Models (MLLMs), enhance semantic understanding but suffer from high computational overhead and limited cross-domain transferability when fine-tuned. To address these limitations, we propose a novel zero-shot framework employing Test-Time Scaling (TTS) that leverages MLLMs’ vision-language alignment capabilities through Guidance-based methods for direct similarity scoring. Our approach eliminates two-stage processing by employing structured prompts that generate length-controllable JSON outputs. The TTS framework with Uncertainty-Aware Self-Consistency (UASC) enables real-time adaptation without additional training costs, achieving superior generalization across diverse environments. Experimental results demonstrate significant improvements in cross-domain VPR performance with up to 210 \times computational efficiency gains.
zh

[CV-34] NOOUGAT: Towards Unified Online and Offline Multi-Object Tracking

【速读】:该论文旨在解决在线(\textitonline)与离线(\textitoffline)多目标跟踪(Multi-Object Tracking, MOT)之间长期存在的分割问题,即现有方法难以适应真实场景中灵活的时间窗口需求。当前在线跟踪器依赖逐帧的手工关联策略,在长时间遮挡下表现不佳;而离线方法虽能覆盖更长的时间跨度,但仍需启发式拼接处理任意长度序列。解决方案的关键在于提出NOOUGAT——首个支持任意时间窗口的跟踪框架,其核心是统一的图神经网络(Graph Neural Network, GNN)架构,通过非重叠子片段(subclips)处理并结合一种新颖的自回归长时跟踪(Autoregressive Long-term Tracking, ALT)层进行融合,其中子片段大小可调控延迟与时间上下文之间的权衡,从而实现从逐帧到批量处理的多种部署模式,并在两种跟踪范式下均达到最先进性能。

链接: https://arxiv.org/abs/2509.02111
作者: Benjamin Missaoui,Orcun Cetintas,Guillem Brasó,Tim Meinhardt,Laura Leal-Taixé
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The long-standing division between \textitonline and \textitoffline Multi-Object Tracking (MOT) has led to fragmented solutions that fail to address the flexible temporal requirements of real-world deployment scenarios. Current \textitonline trackers rely on frame-by-frame hand-crafted association strategies and struggle with long-term occlusions, whereas \textitoffline approaches can cover larger time gaps, but still rely on heuristic stitching for arbitrarily long sequences. In this paper, we introduce NOOUGAT, the first tracker designed to operate with arbitrary temporal horizons. NOOUGAT leverages a unified Graph Neural Network (GNN) framework that processes non-overlapping subclips, and fuses them through a novel Autoregressive Long-term Tracking (ALT) layer. The subclip size controls the trade-off between latency and temporal context, enabling a wide range of deployment scenarios, from frame-by-frame to batch processing. NOOUGAT achieves state-of-the-art performance across both tracking regimes, improving \textitonline AssA by +2.3 on DanceTrack, +9.2 on SportsMOT, and +5.0 on MOT20, with even greater gains in \textitoffline mode.
zh

[CV-35] SALAD – Semantics-Aware Logical Anomaly Detection ICCV2025

【速读】:该论文旨在解决逻辑异常检测(logical anomaly detection)中现有方法性能不足的问题,特别是针对物体组件缺失或不合理的异常类型,传统方法依赖于聚合的预训练特征或手工设计的描述符(如组成图(composition maps)),这些方法会丢失空间和语义信息,导致检测效果受限。解决方案的关键在于提出SALAD方法,其核心创新是引入一个新设计的组成分支(composition branch),显式建模物体组成图的分布,从而学习关键的语义关系;同时,提出一种无需人工标签或类别特定信息的新颖组成图提取流程,有效保留了空间与语义细节,显著提升了在MVTec LOCO标准基准上的图像级AUROC至96.1%。

链接: https://arxiv.org/abs/2509.02101
作者: Matic Fučka,Vitjan Zavrtanik,Danijel Skočaj
机构: University of Ljubljana, Faculty of Computer and Information Science (卢布尔雅那大学计算机与信息科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Recent surface anomaly detection methods excel at identifying structural anomalies, such as dents and scratches, but struggle with logical anomalies, such as irregular or missing object components. The best-performing logical anomaly detection approaches rely on aggregated pretrained features or handcrafted descriptors (most often derived from composition maps), which discard spatial and semantic information, leading to suboptimal performance. We propose SALAD, a semantics-aware discriminative logical anomaly detection method that incorporates a newly proposed composition branch to explicitly model the distribution of object composition maps, consequently learning important semantic relationships. Additionally, we introduce a novel procedure for extracting composition maps that requires no hand-made labels or category-specific information, in contrast to previous methods. By effectively modelling the composition map distribution, SALAD significantly improves upon state-of-the-art methods on the standard benchmark for logical anomaly detection, MVTec LOCO, achieving an impressive image-level AUROC of 96.1%. Code: this https URL
zh

[CV-36] A Data-Centric Approach to Pedestrian Attribute Recognition: Synthetic Augmentation via Prompt-driven Diffusion Models

【速读】:该论文旨在解决行人属性识别(Pedestrian Attribute Recognition, PAR)中因训练数据集对某些属性覆盖不足而导致模型泛化能力受限的问题。解决方案的关键在于提出一种以数据为中心的合成数据增强方法,通过文本描述引导扩散模型生成与原始PAR数据集保持一致性的行人图像,并结合基于提示(prompt-based)的标注规则和损失函数调整策略,将合成样本无缝融入训练过程。该方法在不改变模型结构的前提下显著提升了低频属性的识别性能及整体模型的零样本泛化能力。

链接: https://arxiv.org/abs/2509.02099
作者: Alejandro Alonso,Sawaiz A. Chaudhry,Juan C. SanMiguel,Álvaro García-Martín,Pablo Ayuso-Albizu,Pablo Carballeira
机构: Autonomous University of Madrid (马德里自治大学); University of Bordeaux (波尔多大学); Pázmány Péter Catholic University (帕兹曼尼·彼得天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper Acepted at AVSS 2025 conference. Best paper award

点击查看摘要

Abstract:Pedestrian Attribute Recognition (PAR) is a challenging task as models are required to generalize across numerous attributes in real-world data. Traditional approaches focus on complex methods, yet recognition performance is often constrained by training dataset limitations, particularly the under-representation of certain attributes. In this paper, we propose a data-centric approach to improve PAR by synthetic data augmentation guided by textual descriptions. First, we define a protocol to identify weakly recognized attributes across multiple datasets. Second, we propose a prompt-driven pipeline that leverages diffusion models to generate synthetic pedestrian images while preserving the consistency of PAR datasets. Finally, we derive a strategy to seamlessly incorporate synthetic samples into training data, which considers prompt-based annotation rules and modifies the loss function. Results on popular PAR datasets demonstrate that our approach not only boosts recognition of underrepresented attributes but also improves overall model performance beyond the targeted attributes. Notably, this approach strengthens zero-shot generalization without requiring architectural changes of the model, presenting an efficient and scalable solution to improve the recognition of attributes of pedestrians in the real world.
zh

[CV-37] ContextFusion and Bootstrap: An Effective Approach to Improve Slot Attention-Based Object-Centric Learning

【速读】:该论文旨在解决当前基于slot attention的无监督对象中心学习方法中存在的两个关键问题:一是缺乏高层次语义信息,导致模型过度依赖颜色、纹理等低级特征,难以理解物体轮廓、形状等语义特性;二是无法对编码器进行微调,由于现有方法要求在整个训练过程中保持稳定的特征空间以实现从slots的重建,从而限制了对象中心学习的有效灵活性。解决方案的关键在于提出两个可无缝集成到现有slot attention模型中的新模块:其一是ContextFusion阶段,通过融合前景和背景的语义信息,并引入辅助指示器提供额外上下文线索,增强超越低级特征的语义内容;其二是Bootstrap Branch,将特征适应与原始重建阶段解耦,并采用bootstrap策略训练一个特征自适应机制,从而提升模型的灵活性与适应能力。实验表明,该方法显著提升了多种SOTA slot attention模型在模拟和真实数据集上的性能。

链接: https://arxiv.org/abs/2509.02032
作者: Pinzhuo Tian,Shengjie Yang,Hang Yu,Alex C. Kot
机构: Shanghai University (上海大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A key human ability is to decompose a scene into distinct objects and use their relationships to understand the environment. Object-centric learning aims to mimic this process in an unsupervised manner. Recently, the slot attention-based framework has emerged as a leading approach in this area and has been widely used in various downstream tasks. However, existing slot attention methods face two key limitations: (1) a lack of high-level semantic information. In current methods, image areas are assigned to slots based on low-level features such as color and texture. This makes the model overly sensitive to low-level features and limits its understanding of object contours, shapes, or other semantic characteristics. (2) The inability to fine-tune the encoder. Current methods require a stable feature space throughout training to enable reconstruction from slots, which restricts the flexibility needed for effective object-centric learning. To address these limitations, we propose a novel ContextFusion stage and a Bootstrap Branch, both of which can be seamlessly integrated into existing slot attention models. In the ContextFusion stage, we exploit semantic information from the foreground and background, incorporating an auxiliary indicator that provides additional contextual cues about them to enrich the semantic content beyond low-level features. In the Bootstrap Branch, we decouple feature adaptation from the original reconstruction phase and introduce a bootstrap strategy to train a feature-adaptive mechanism, allowing for more flexible adaptation. Experimental results show that our method significantly improves the performance of different SOTA slot attention models on both simulated and real-world datasets.
zh

[CV-38] Fake Square: Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives ICCV2025

【速读】:该论文旨在解决当前对比自监督学习(contrastive self-supervised learning)对大量真实世界数据和精心设计的难负样本(hard negatives)高度依赖的问题。其解决方案的关键在于提出一种名为Syn2Co的框架,通过两种“造假”策略来增强训练多样性:一是利用生成式模型(Generative Models)生成合成数据以扩充样本多样性;二是直接在表示空间中生成合成难负样本,从而构建更具挑战性的对比信号。该方法不依赖真实数据,而是借助合成手段提升视觉Transformer(如DeiT-S和Swin-T)模型的表征鲁棒性与迁移能力。

链接: https://arxiv.org/abs/2509.02029
作者: Nikolaos Giakoumoglou,Andreas Floros,Kleanthis Marios Papadopoulos,Tania Stathaki
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025 Workshop LIMIT

点击查看摘要

Abstract:This paper does not introduce a new method per se. Instead, we build on existing self-supervised learning approaches for vision, drawing inspiration from the adage “fake it till you make it”. While contrastive self-supervised learning has achieved remarkable success, it typically relies on vast amounts of real-world data and carefully curated hard negatives. To explore alternatives to these requirements, we investigate two forms of “faking it” in vision transformers. First, we study the potential of generative models for unsupervised representation learning, leveraging synthetic data to augment sample diversity. Second, we examine the feasibility of generating synthetic hard negatives in the representation space, creating diverse and challenging contrasts. Our framework - dubbed Syn2Co - combines both approaches and evaluates whether synthetically enhanced training can lead to more robust and transferable visual representations on DeiT-S and Swin-T architectures. Our findings highlight the promise and limitations of synthetic data in self-supervised learning, offering insights for future work in this direction.
zh

[CV-39] See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems

【速读】:该论文旨在解决Referring Multi-Object Tracking (RMOT)系统在安全性和鲁棒性方面的未充分探索问题,特别是其在面对对抗性攻击时的脆弱性。研究发现,RMOT模型中基于Transformer的空间-时间推理模块存在设计逻辑漏洞,使得语言-视觉指代与目标匹配两个核心组件易受干扰;更进一步,采用FIFO内存机制的先进RMOT模型还存在一种新型持久性错误传播漏洞,即针对空间-时间推理的定向攻击会在历史缓冲区中持续引入错误,导致跟踪ID切换和终止。解决方案的关键在于提出名为VEIL的新型对抗框架,通过精心设计的数字和物理扰动破坏RMOT模型统一的指代-匹配机制,从而验证了当前主流模型在关键大规模应用中的安全性缺陷,并强调了未来需发展具备安全意识的RMOT架构设计。

链接: https://arxiv.org/abs/2509.02028
作者: Halima Bouzidi,Haoyu Liu,Mohammad Al Faruque
机构: University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 12 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fused textual-visual representations. Despite these advances, the reliability and robustness of RMOT remain underexplored. In this paper, we examine the security implications of RMOT systems from a design-logic perspective, identifying adversarial vulnerabilities that compromise both the linguistic-visual referring and track-object matching components. Additionally, we uncover a novel vulnerability in advanced RMOT models employing FIFO-based memory, whereby targeted and consistent attacks on their spatial-temporal reasoning introduce errors that persist within the history buffer over multiple subsequent frames. We present VEIL, a novel adversarial framework designed to disrupt the unified referring-matching mechanisms of RMOT models. We show that carefully crafted digital and physical perturbations can corrupt the tracking logic reliability, inducing track ID switches and terminations. We conduct comprehensive evaluations using the Refer-KITTI dataset to validate the effectiveness of VEIL and demonstrate the urgent need for security-aware RMOT designs for critical large-scale applications.
zh

[CV-40] Unsupervised Training of Vision Transformers with Synthetic Negatives CVPR2025

【速读】:该论文旨在解决自监督学习中硬负样本(hard negative samples)潜力被忽视的问题,尤其是在视觉Transformer(Vision Transformer)架构下的应用。以往研究多探索合成硬负样本,但缺乏在视觉Transformer场景中的系统性验证与整合。本文的关键解决方案是将合成硬负样本引入视觉Transformer的表示学习过程中,通过这一简单而有效的方法显著提升了模型特征的判别能力,实验表明该策略在DeiT-S和Swin-T等主流视觉Transformer架构上均带来了性能提升。

链接: https://arxiv.org/abs/2509.02024
作者: Nikolaos Giakoumoglou,Andreas Floros,Kleanthis Marios Papadopoulos,Tania Stathaki
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025 Workshop VisCon

点击查看摘要

Abstract:This paper does not introduce a novel method per se. Instead, we address the neglected potential of hard negative samples in self-supervised learning. Previous works explored synthetic hard negatives but rarely in the context of vision transformers. We build on this observation and integrate synthetic hard negatives to improve vision transformer representation learning. This simple yet effective technique notably improves the discriminative power of learned representations. Our experiments show performance improvements for both DeiT-S and Swin-T architectures.
zh

[CV-41] Vision-Based Embedded System for Noncontact Monitoring of Preterm Infant Behavior in Low-Resource Care Settings

【速读】:该论文旨在解决早产儿在资源匮乏环境中缺乏有效、非侵入式行为监测手段的问题,尤其针对睡眠/清醒状态和哭闹事件的连续识别难题。传统方法依赖人工观察或侵入式传感器,存在误差大、不实用且可能引发皮肤损伤等局限性。其解决方案的关键在于提出了一种基于视觉的轻量化边缘计算框架:通过在树莓派(Raspberry Pi)上部署量化后的MobileNet模型,实现高精度(睡眠检测准确率达91.8%,哭闹/正常分类达97.7%)且低延迟的行为状态自动识别;同时集成优化的视觉处理流水线与安全物联网通信机制,显著降低模型体积(减少68%内存占用),从而为低成本、可扩展的新生儿重症监护室(NICU)监测系统提供可行路径,特别适用于低资源环境下的临床应用。

链接: https://arxiv.org/abs/2509.02018
作者: Stanley Mugisha,Rashid Kisitu,Francis Komakech,Excellence Favor
机构: University of South Africa (南非大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 23 pages. 5 tables, 8 figures

点击查看摘要

Abstract:Preterm birth remains a leading cause of neonatal mortality, disproportionately affecting low-resource settings with limited access to advanced neonatal intensive care units (NICUs).Continuous monitoring of infant behavior, such as sleep/awake states and crying episodes, is critical but relies on manual observation or invasive sensors, which are prone to error, impractical, and can cause skin damage. This paper presents a novel, noninvasive, and automated vision-based framework to address this gap. We introduce an embedded monitoring system that utilizes a quantized MobileNet model deployed on a Raspberry Pi for real-time behavioral state detection. When trained and evaluated on public neonatal image datasets, our system achieves state-of-the-art accuracy (91.8% for sleep detection and 97.7% for crying/normal classification) while maintaining computational efficiency suitable for edge deployment. Through comparative benchmarking, we provide a critical analysis of the trade-offs between model size, inference latency, and diagnostic accuracy. Our findings demonstrate that while larger architectures (e.g., ResNet152, VGG19) offer marginal gains in accuracy, their computational cost is prohibitive for real-time edge use. The proposed framework integrates three key innovations: model quantization for memory-efficient inference (68% reduction in size), Raspberry Pi-optimized vision pipelines, and secure IoT communication for clinical alerts. This work conclusively shows that lightweight, optimized models such as the MobileNet offer the most viable foundation for scalable, low-cost, and clinically actionable NICU monitoring systems, paving the way for improved preterm care in resource-constrained environments.
zh

[CV-42] Palette Aligned Image Diffusion

【速读】:该论文旨在解决文本到图像扩散模型在使用用户指定色板(color palette)进行条件生成时面临的模糊性和不稳定性问题。现有方法难以保证生成图像严格遵循色板约束,同时保持语义一致性与高质量输出。其解决方案的关键在于将色板视为稀疏直方图(sparse histogram),并引入两个标量控制参数:直方图熵(histogram entropy)和色板到直方图距离(palette-to-histogram distance),以灵活调节色彩贴合度与颜色多样性;此外,提出负向直方图机制(negative histogram mechanism)用于抑制特定不期望的色调,从而在无分类器引导(classifier-free guidance)框架下提升对目标色板的忠实度。通过在精心构建、覆盖常见与罕见颜色的平衡数据集上训练,该方法实现了跨多种色板和提示词的稳定且语义一致的图像生成。

链接: https://arxiv.org/abs/2509.02000
作者: Elad Aharoni,Noy Porat,Dani Lischinski,Ariel Shamir
机构: Hebrew University (希伯来大学); Reichman University (里奇曼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 19 figures

点击查看摘要

Abstract:We introduce the Palette-Adapter, a novel method for conditioning text-to-image diffusion models on a user-specified color palette. While palettes are a compact and intuitive tool widely used in creative workflows, they introduce significant ambiguity and instability when used for conditioning image generation. Our approach addresses this challenge by interpreting palettes as sparse histograms and introducing two scalar control parameters: histogram entropy and palette-to-histogram distance, which allow flexible control over the degree of palette adherence and color variation. We further introduce a negative histogram mechanism that allows users to suppress specific undesired hues, improving adherence to the intended palette under the standard classifier-free guidance mechanism. To ensure broad generalization across the color space, we train on a carefully curated dataset with balanced coverage of rare and common colors. Our method enables stable, semantically coherent generation across a wide range of palettes and prompts. We evaluate our method qualitatively, quantitatively, and through a user study, and show that it consistently outperforms existing approaches in achieving both strong palette adherence and high image quality.
zh

[CV-43] Explaining What Machines See: XAI Strategies in Deep Object Detection Models

【速读】:该论文旨在解决深度神经网络在目标检测任务中因黑箱特性与高复杂性而导致的可解释性难题,尤其在自动驾驶、医学影像和安全系统等关键领域,模型决策的透明度与可信度至关重要。其解决方案的关键在于系统梳理并分类当前最先进的可解释人工智能(Explainable Artificial Intelligence, XAI)方法,依据其底层机制划分为扰动类、梯度类、反向传播类和图结构类方法,并深入分析代表性技术如D-RISE、BODEM、D-CLOSE和FSOD在YOLO、SSD、Faster R-CNN及EfficientDet等主流目标检测架构中的适用性与效果,从而为研究者提供结构化的评估框架与选择依据,推动更可解释的人工智能系统发展。

链接: https://arxiv.org/abs/2509.01991
作者: FatemehSadat Seyedmomeni,Mohammad Ali Keyvanrad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 71 pages, 47 figures

点击查看摘要

Abstract:In recent years, deep learning has achieved unprecedented success in various computer vision tasks, particularly in object detection. However, the black-box nature and high complexity of deep neural networks pose significant challenges for interpretability, especially in critical domains such as autonomous driving, medical imaging, and security systems. Explainable Artificial Intelligence (XAI) aims to address this challenge by providing tools and methods to make model decisions more transparent, interpretable, and trust-worthy for humans. This review provides a comprehensive analysis of state-of-the-art explain-ability methods specifically applied to object detection models. The paper be-gins by categorizing existing XAI techniques based on their underlying mechanisms-perturbation-based, gradient-based, backpropagation-based, and graph-based methods. Notable methods such as D-RISE, BODEM, D-CLOSE, and FSOD are discussed in detail. Furthermore, the paper investigates their applicability to various object detection architectures, including YOLO, SSD, Faster R-CNN, and EfficientDet. Statistical analysis of publication trends from 2022 to mid-2025 shows an accelerating interest in explainable object detection, indicating its increasing importance. The study also explores common datasets and evaluation metrics, and highlights the major challenges associated with model interpretability. By providing a structured taxonomy and a critical assessment of existing methods, this review aims to guide researchers and practitioners in selecting suitable explainability techniques for object detection applications and to foster the development of more interpretable AI systems.
zh

[CV-44] Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination

【速读】:该论文旨在解决多模态统一模型在图像编辑任务中表现不佳的问题,其核心原因在于理解模块与生成模块之间的职责分配失衡:理解模块仅作为指令翻译器,而生成模块需同时承担布局推理、目标区域识别和内容渲染等多重任务,导致编辑精度不足。解决方案的关键在于通过构建一个包含1400万长上下文图文对(DIM-T2I)和23.3万由GPT-4o生成的链式思维图像想象(DIM-Edit)的数据集,将设计职责明确赋予理解模块,从而提升图像编辑性能。作者采用冻结的Qwen2.5-VL-3B与可训练的SANA1.5-1.6B通过轻量级两层MLP连接,在DIM数据集上训练得到DIM-4.6B-T2I/Edit模型,尽管参数规模较小,但在ImgEdit和GEdit-Bench基准上达到或超越更大模型的性能,验证了显式分工对图像编辑的有效性。

链接: https://arxiv.org/abs/2509.01986
作者: Ziyun Zeng,Junhao Zhang,Wei Li,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学); TikTok
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Tech Report

点击查看摘要

Abstract:In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models will be available at this https URL.
zh

[CV-45] Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing

【速读】:该论文旨在解决视觉自回归模型(Visual Autoregressive Models, VAR)在无需额外训练的情况下实现文本引导图像编辑的问题,这是其在实际应用中面临的关键挑战。解决方案的核心在于提出一种名为Visual AutoRegressive Inverse Noise (VARIN) 的噪声反演方法,其关键创新是引入了一种新的伪逆函数——位置感知的Argmax反演(Location-aware Argmax Inversion, LAI),用于生成逆Gumbel噪声。该机制能够精确重建源图像并支持与文本提示对齐的可控编辑,从而在保持原图背景和结构细节的同时实现高效、精准的文本驱动图像修改。

链接: https://arxiv.org/abs/2509.01984
作者: Quan Dao,Xiaoxiao He,Ligong Han,Ngan Hoai Nguyen,Amin Heyrani Nobar,Faez Ahmed,Han Zhang,Viet Anh Nguyen,Dimitris Metaxas
机构: Rutgers University (罗格斯大学); Red Hat AI Innovation (红帽人工智能创新); Qualcomm AI Research (高通人工智能研究); MIT (麻省理工学院); ReveAI (ReveAI); CUHK (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint

点击查看摘要

Abstract:Visual autoregressive models (VAR) have recently emerged as a promising class of generative models, achieving performance comparable to diffusion models in text-to-image generation tasks. While conditional generation has been widely explored, the ability to perform prompt-guided image editing without additional training is equally critical, as it supports numerous practical real-world applications. This paper investigates the text-to-image editing capabilities of VAR by introducing Visual AutoRegressive Inverse Noise (VARIN), the first noise inversion-based editing technique designed explicitly for VAR models. VARIN leverages a novel pseudo-inverse function for argmax sampling, named Location-aware Argmax Inversion (LAI), to generate inverse Gumbel noises. These inverse noises enable precise reconstruction of the source image and facilitate targeted, controllable edits aligned with textual prompts. Extensive experiments demonstrate that VARIN effectively modifies source images according to specified prompts while significantly preserving the original background and structural details, thus validating its efficacy as a practical editing approach.
zh

[CV-46] MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement

【速读】:该论文旨在解决多主体个性化图像生成中身份保真度(identity fidelity)和语义一致性(semantic coherence)难以维持的问题,尤其在条件生成多个参考主体时,现有方法常因对不同主体间交互关系建模不足而出现身份混杂(identity blending)和属性泄露(attribute leakage)。其解决方案的关键在于提出一个以表示为中心的框架MOSAIC,核心创新包括:1)引入SemAlign-MS数据集,提供细粒度的多参考主体与目标图像间的语义对应标注;2)设计语义对应注意力损失(semantic correspondence attention loss),强制实现点对点的语义对齐,确保每个参考主体的特征精准映射到生成图像的指定区域;3)提出多参考解耦损失(multi-reference disentanglement loss),促使不同主体的特征进入正交注意力子空间,避免特征干扰的同时保留个体身份特性。这一系列设计使MOSAIC在4个及以上参考主体下仍保持高保真度,显著优于现有方法。

链接: https://arxiv.org/abs/2509.01977
作者: Dong She,Siming Fu,Mushui Liu,Qiaoqiao Jin,Hualiang Wang,Mu Liu,Jidong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-subject personalized generation presents unique challenges in maintaining identity fidelity and semantic coherence when synthesizing images conditioned on multiple reference subjects. Existing methods often suffer from identity blending and attribute leakage due to inadequate modeling of how different subjects should interact within shared representation spaces. We present MOSAIC, a representation-centric framework that rethinks multi-subject generation through explicit semantic correspondence and orthogonal feature disentanglement. Our key insight is that multi-subject generation requires precise semantic alignment at the representation level - knowing exactly which regions in the generated image should attend to which parts of each reference. To enable this, we introduce SemAlign-MS, a meticulously annotated dataset providing fine-grained semantic correspondences between multiple reference subjects and target images, previously unavailable in this domain. Building on this foundation, we propose the semantic correspondence attention loss to enforce precise point-to-point semantic alignment, ensuring high consistency from each reference to its designated regions. Furthermore, we develop the multi-reference disentanglement loss to push different subjects into orthogonal attention subspaces, preventing feature interference while preserving individual identity characteristics. Extensive experiments demonstrate that MOSAIC achieves state-of-the-art performance on multiple benchmarks. Notably, while existing methods typically degrade beyond 3 subjects, MOSAIC maintains high fidelity with 4+ reference subjects, opening new possibilities for complex multi-subject synthesis applications.
zh

[CV-47] Ensemble-Based Event Camera Place Recognition Under Varying Illumination

【速读】:该论文旨在解决在极端光照变化条件下,基于事件相机(event camera)的视觉位置识别(Visual Place Recognition, VPR)系统鲁棒性不足的问题。现有方法在日间到夜间过渡等剧烈光照场景下性能显著下降,而事件相机虽具备高动态范围和低延迟优势,但其VPR框架仍缺乏对复杂照明条件的有效适应能力。解决方案的关键在于提出一种多维度集成策略(ensemble-based approach),通过融合来自不同事件到图像重建方法(event-to-frame reconstruction)、VPR特征提取器(feature extractor)以及时间分辨率(temporal resolution)的序列匹配结果,实现更稳定的识别性能。相较于以往仅依赖时间分辨率的集成方法,该方案在日-夜转换场景中实现了Recall@1相对提升57%,并通过详尽的设计分析验证了关键组件(如极性处理、分箱策略、重建算法等)对鲁棒性的贡献,同时改进了长序列匹配框架以增强长距离识别能力。

链接: https://arxiv.org/abs/2509.01968
作者: Therese Joseph,Tobias Fischer,Michael Milford
机构: Queensland University of Technology (昆士兰科技大学); QUT Centre for Robotics (QUT机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Compared to conventional cameras, event cameras provide a high dynamic range and low latency, offering greater robustness to rapid motion and challenging lighting conditions. Although the potential of event cameras for visual place recognition (VPR) has been established, developing robust VPR frameworks under severe illumination changes remains an open research problem. In this paper, we introduce an ensemble-based approach to event camera place recognition that combines sequence-matched results from multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions. Unlike previous event-based ensemble methods, which only utilise temporal resolution, our broader fusion strategy delivers significantly improved robustness under varied lighting conditions (e.g., afternoon, sunset, night), achieving a 57% relative improvement in Recall@1 across day-night transitions. We evaluate our approach on two long-term driving datasets (with 8 km per traverse) without metric subsampling, thereby preserving natural variations in speed and stop duration that influence event density. We also conduct a comprehensive analysis of key design choices, including binning strategies, polarity handling, reconstruction methods, and feature extractors, to identify the most critical components for robust performance. Additionally, we propose a modification to the standard sequence matching framework that enhances performance at longer sequence lengths. To facilitate future research, we will release our codebase and benchmarking framework.
zh

[CV-48] 2D Gaussian Splatting with Semantic Alignment for Image Inpainting

【速读】:该论文旨在解决图像修复(image inpainting)中同时实现局部像素级连贯性与全局语义一致性的问题。其核心解决方案是首次将二维高斯点绘(2D Gaussian Splatting, GS)引入图像修复任务,通过将不完整图像编码为连续的二维高斯系数场,并利用可微渲染过程重建图像,从而天然地保障了修复区域的像素级连贯性;为提升效率与可扩展性,进一步提出分块渲染策略以降低内存开销并加速推理;此外,融合预训练DINO模型的全局特征,有效增强大掩码场景下的语义一致性,确保修复内容与周围场景语义一致。该方法在标准基准上实现了优异的定量指标与感知质量,开辟了高斯点绘在二维图像处理中的新应用方向。

链接: https://arxiv.org/abs/2509.01964
作者: Hongyu Li,Chaofeng Chen,Xiaoming Li,Guangming Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gaussian Splatting (GS), a recent technique for converting discrete points into continuous spatial representations, has shown promising results in 3D scene modeling and 2D image super-resolution. In this paper, we explore its untapped potential for image inpainting, which demands both locally coherent pixel synthesis and globally consistent semantic restoration. We propose the first image inpainting framework based on 2D Gaussian Splatting, which encodes incomplete images into a continuous field of 2D Gaussian splat coefficients and reconstructs the final image via a differentiable rasterization process. The continuous rendering paradigm of GS inherently promotes pixel-level coherence in the inpainted results. To improve efficiency and scalability, we introduce a patch-wise rasterization strategy that reduces memory overhead and accelerates inference. For global semantic consistency, we incorporate features from a pretrained DINO model. We observe that DINO’s global features are naturally robust to small missing regions and can be effectively adapted to guide semantic alignment in large-mask scenarios, ensuring that the inpainted content remains contextually consistent with the surrounding scene. Extensive experiments on standard benchmarks demonstrate that our method achieves competitive performance in both quantitative metrics and perceptual quality, establishing a new direction for applying Gaussian Splatting to 2D image processing.
zh

[CV-49] Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

【速读】:该论文旨在解决当前多模态模型(如CLIP)在处理专业视觉领域(如图表类图像)时存在的理解局限性问题,这类图像包含结构化、符号化的信息,与自然图像存在本质差异。解决方案的关键在于提出一种新颖的训练范式,通过引入“硬样本”进行对比学习,并设计两种专门针对图表结构特性的损失函数,从而增强模型对图表内容的结构化语义理解能力。实验表明,该方法在流程图基准数据集上显著优于标准CLIP及传统硬负例CLIP训练策略,在图像-文本匹配和视觉问答任务中均取得提升。

链接: https://arxiv.org/abs/2509.01959
作者: Hiroshi Sasaki
机构: The Japan Research Institute, Limited(日本研究所有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Multimodal models, such as the Contrastive Language-Image Pre-training (CLIP) model, have demonstrated remarkable success in aligning visual and linguistic representations. However, these models exhibit limitations when applied to specialised visual domains, such as diagrams, which encode structured, symbolic information distinct from that of natural imagery. In this paper, we introduce a novel training paradigm explicitly designed to enhance the comprehension of diagrammatic images within vision-language models. Our approach uses ``hard’’ samples for our proposed contrastive learning that incorporates two specialised loss functions that leverage the inherent structural properties of diagrams. By integrating these objectives into model training, our method enables models to develop a more structured and semantically coherent understanding of diagrammatic content. We empirically validate our approach on a benchmark dataset of flowcharts, as a representative class of diagrammatic imagery, demonstrating substantial improvements over standard CLIP and conventional hard negative CLIP learning paradigms for both image-text matching and visual question answering tasks. Our findings underscore the significance of tailored training strategies for specialised tasks and contribute to advancing diagrammatic understanding within the broader landscape of vision-language integration. Comments: 10 pages, 8 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.01959 [cs.CV] (or arXiv:2509.01959v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.01959 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-50] AutoDrive-R2: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

【速读】:该论文旨在解决自动驾驶系统中视觉-语言-动作(Vision-Language-Action, VLA)模型在决策过程的可解释性与一致性,以及动作序列合理性方面存在的不足。其核心解决方案在于提出AutoDrive-R²框架,通过链式思维(Chain-of-Thought, CoT)处理和强化学习(Reinforcement Learning, RL)协同增强系统的推理与自我反思能力:首先构建了一个包含6000条样本的nuScenesR²-6K数据集用于监督微调,该数据集采用四步逻辑链并引入自我验证机制以建立输入信息与输出轨迹间的认知桥梁;其次,在强化学习阶段采用分组相对策略优化(Group Relative Policy Optimization, GRPO)算法,并结合物理约束的奖励函数(涵盖空间对齐、车辆动力学和时间平滑性指标),从而实现更可靠且符合现实的轨迹规划。

链接: https://arxiv.org/abs/2509.01944
作者: Zhenlong Yuan,Jing Tang,Jinguo Luo,Rui Chen,Chengxuan Qian,Lei Sun,Xiangxiang Chu,Yujun Cai,Dapeng Zhang,Shuo Li
机构: AMAP, Alibaba Group (阿里巴巴集团); University of Queensland (昆士兰大学); Lanzhou University (兰州大学); Case Western Reserve University (凯斯西储大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models in autonomous driving systems have recently demonstrated transformative potential by integrating multimodal perception with decision-making capabilities. However, the interpretability and coherence of the decision process and the plausibility of action sequences remain largely underexplored. To address these issues, we propose AutoDrive-R ^2 , a novel VLA framework that enhances both reasoning and self-reflection capabilities of autonomous driving systems through chain-of-thought (CoT) processing and reinforcement learning (RL). Specifically, we first propose an innovative CoT dataset named nuScenesR ^2 -6K for supervised fine-tuning, which effectively builds cognitive bridges between input information and output trajectories through a four-step logical chain with self-reflection for validation. Moreover, to maximize both reasoning and self-reflection during the RL stage, we further employ the Group Relative Policy Optimization (GRPO) algorithm within a physics-grounded reward framework that incorporates spatial alignment, vehicle dynamic, and temporal smoothness criteria to ensure reliable and realistic trajectory planning. Extensive evaluation results across both nuScenes and Waymo datasets demonstrates the state-of-the-art performance and robust generalization capacity of our proposed method.
zh

[CV-51] A Diffusion-Based Framework for Configurable and Realistic Multi-Storag e Trace Generation

【速读】:该论文旨在解决多设备存储轨迹生成中难以同时实现高保真度、精确配置性和多样性的挑战。现有方法往往在真实性与可控性之间存在权衡,难以捕捉设备间的时序动态和依赖关系。解决方案的关键在于提出DiTTO框架,该框架基于扩散模型(diffusion model)技术,能够合成具有高保真连续轨迹的多设备存储 trace,支持用户定义的配置参数,并通过扩散机制有效建模时间演化和跨设备依赖关系,实验表明其在保持配置一致性的同时仅产生8%的误差,显著提升了轨迹生成的精度与灵活性。

链接: https://arxiv.org/abs/2509.01919
作者: Seohyun Kim,Junyoung Lee,Jongho Park,Jinhyung Koo,Sungjin Lee,Yeseong Kim
机构: DGIST(韩国科学技术院); POSTECH(浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注:

点击查看摘要

Abstract:We propose DiTTO, a novel diffusion-based framework for generating realistic, precisely configurable, and diverse multi-device storage traces. Leveraging advanced diffusion tech- niques, DiTTO enables the synthesis of high-fidelity continuous traces that capture temporal dynamics and inter-device dependencies with user-defined configurations. Our experimental results demonstrate that DiTTO can generate traces with high fidelity and diversity while aligning closely with guided configurations with only 8% errors.
zh

[CV-52] owards Interpretable Geo-localization: a Concept-Aware Global Image-GPS Alignment Framework

【速读】:该论文旨在解决当前全球地理定位(geo-localization)模型在可解释性方面的不足问题,尤其是现有基于概念的可解释性方法难以与图像-位置对齐的嵌入目标有效协同,导致模型决策过程缺乏透明度和语义清晰度。解决方案的关键在于提出一种融合全局地理定位与概念瓶颈(concept bottleneck)的新框架,其核心创新是引入一个概念感知对齐模块(Concept-Aware Alignment Module),该模块将图像和地理位置嵌入共同投影到一组地理概念(如热带气候、山脉、教堂等)构成的共享空间中,并通过最小化概念层级损失来增强特定语义子空间中的对齐效果,从而实现更鲁棒且可解释的地理定位决策过程。

链接: https://arxiv.org/abs/2509.01910
作者: Furong Jia,Lanxin Liu,Ce Hou,Fan Zhang,Xinyan Liu,Yu Liu
机构: Peking University (北京大学); Harbin Institute of Technology (哈尔滨工业大学); The Hong Kong University of Science and Technology (香港科技大学); Harbin Institute of Technology (Weihai) (哈尔滨工业大学(威海)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Worldwide geo-localization involves determining the exact geographic location of images captured globally, typically guided by geographic cues such as climate, landmarks, and architectural styles. Despite advancements in geo-localization models like GeoCLIP, which leverages images and location alignment via contrastive learning for accurate predictions, the interpretability of these models remains insufficiently explored. Current concept-based interpretability methods fail to align effectively with Geo-alignment image-location embedding objectives, resulting in suboptimal interpretability and performance. To address this gap, we propose a novel framework integrating global geo-localization with concept bottlenecks. Our method inserts a Concept-Aware Alignment Module that jointly projects image and location embeddings onto a shared bank of geographic concepts (e.g., tropical climate, mountain, cathedral) and minimizes a concept-level loss, enhancing alignment in a concept-specific subspace and enabling robust interpretability. To our knowledge, this is the first work to introduce interpretability into geo-localization. Extensive experiments demonstrate that our approach surpasses GeoCLIP in geo-localization accuracy and boosts performance across diverse geospatial prediction tasks, revealing richer semantic insights into geographic decision-making processes.
zh

[CV-53] DroneSR: Rethinking Few-shot Thermal Image Super-Resolution from Drone-based Perspective

【速读】:该论文旨在解决大规模扩散模型在少量无人机采集的红外图像数据下易发生过拟合(overfitting)的问题,从而影响其在图像超分辨率任务中的泛化能力。解决方案的关键在于提出一种面向扩散模型的高斯量化表示学习方法(Gaussian quantization representation learning),通过该方法有效降低过拟合风险并保持模型架构复杂度;同时引入一个有效的训练监控机制,实时检测大规模架构的过拟合迹象,从而提升模型在小样本、多样化无人机红外图像重建场景下的鲁棒性。

链接: https://arxiv.org/abs/2509.01898
作者: Zhipeng Weng,Xiaopeng Liu,Ce Liu,Xingyuan Guo,Yukai Shi,Liang Lin
机构: Guangdong University of Technology (广东工业大学); Guangdong Power Grid, Ltd. (广东电网有限公司); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although large scale models achieve significant improvements in performance, the overfitting challenge still frequently undermines their generalization ability. In super resolution tasks on images, diffusion models as representatives of generative models typically adopt large scale architectures. However, few-shot drone-captured infrared training data frequently induces severe overfitting in large-scale architectures. To address this key challenge, our method proposes a new Gaussian quantization representation learning method oriented to diffusion models that alleviates overfitting and enhances robustness. At the same time, an effective monitoring mechanism tracks large scale architectures during training to detect signs of overfitting. By introducing Gaussian quantization representation learning, our method effectively reduces overfitting while maintaining architecture complexity. On this basis, we construct a multi source drone-based infrared image benchmark dataset for detection and use it to emphasize overfitting issues of large scale architectures in few sample, drone-based diverse drone-based image reconstruction scenarios. To verify the efficacy of the method in mitigating overfitting, experiments are conducted on the constructed benchmark. Experimental results demonstrate that our method outperforms existing super resolution approaches and significantly mitigates overfitting of large scale architectures under complex conditions. The code and DroneSR dataset will be available at: this https URL.
zh

[CV-54] Automated Wildfire Damage Assessment from Multi view Ground level Imagery Via Vision Language Models

【速读】:该论文旨在解决野火灾害后财产损失快速准确评估的难题,传统方法耗时较长,而现代计算机视觉技术通常依赖大量标注数据,难以在灾后立即部署。其解决方案的关键在于提出一种基于预训练视觉语言模型(Vision Language Models, VLMs)的零样本(zero-shot)框架,通过结构化提示(structured prompts)整合特定野火损伤指标,实现从地面图像中对受损建筑进行分类。研究进一步比较了两种流程:仅使用VLM(Pipeline A)和VLM结合大语言模型(Large Language Model, LLM)(Pipeline B),并发现多视角图像分析显著提升分类性能(F1分数从0.225–0.511提升至0.857–0.947),且统计检验确认该改进具有显著性,表明VLM在融合多源信息识别细微损伤方面具备独特优势,为灾后应急响应提供了无需监督训练、可立即部署且可解释的自动化评估流程。

链接: https://arxiv.org/abs/2509.01895
作者: Miguel Esparza,Archit Gupta,Ali Mostafavi,Kai Yin,Yiming Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The escalating intensity and frequency of wildfires demand innovative computational methods for rapid and accurate property damage assessment. Traditional methods are often time consuming, while modern computer vision approaches typically require extensive labeled datasets, hindering immediate post-disaster deployment. This research introduces a novel, zero-shot framework leveraging pre-trained vision language models (VLMs) to classify damage from ground-level imagery. We propose and evaluate two pipelines applied to the 2025 Eaton and Palisades fires in California, a VLM (Pipeline A) and a VLM + large language model (LLM) approach (Pipeline B), that integrate structured prompts based on specific wildfire damage indicators. A primary scientific contribution of this study is demonstrating the VLMs efficacy in synthesizing information from multiple perspectives to identify nuanced damage, a critical limitation in existing literature. Our findings reveal that while single view assessments struggled to classify affected structures (F1 scores ranging from 0.225 to 0.511), the multi-view analysis yielded dramatic improvements (F1 scores ranging from 0.857 to 0.947). Moreover, the McNemar test confirmed that pipelines with a multi-view image assessment yields statistically significant classification improvements; however, the improvements this research observed between Pipeline A and B were not statistically significant. Thus, future research can explore the potential of LLM prompting in damage assessment. The practical contribution is an immediately deployable, flexible, and interpretable workflow that bypasses the need for supervised training, significantly accelerating triage and prioritization for disaster response practitioners.
zh

[CV-55] HydroVision: Predicting Optically Active Parameters in Surface Water Using Computer Vision

【速读】:该论文旨在解决传统水体质量监测方法依赖昂贵、低效的多光谱或高光谱遥感设备的问题,提出一种基于标准RGB图像的非接触式水环境参数估计框架——HydroVision。其核心解决方案是利用深度学习技术从广泛可用的RGB影像中预测多种光学活性水质参数(如叶绿素α、有色溶解性有机质CDOM、悬浮物等),通过迁移学习在四个主流卷积神经网络(VGG-16、ResNet50、MobileNetV2、DenseNet121)和一个视觉Transformer中筛选出最优架构,最终DenseNet121在CDOM预测上达到R²=0.89的验证性能,表明该方法可作为低成本、高可扩展性的水环境监测替代方案,适用于灾害响应与监管机构日常监测场景。

链接: https://arxiv.org/abs/2509.01882
作者: Shubham Laxmikant Deshmukh,Matthew Wilchek,Feras A. Batarseh
机构: Virginia Tech (弗吉尼亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper is under peer review for IEEE Journal of Oceanic Engineering

点击查看摘要

Abstract:Ongoing advancements in computer vision, particularly in pattern recognition and scene classification, have enabled new applications in environmental monitoring. Deep learning now offers non-contact methods for assessing water quality and detecting contamination, both critical for disaster response and public health protection. This work introduces HydroVision, a deep learning-based scene classification framework that estimates optically active water quality parameters including Chlorophyll-Alpha, Chlorophylls, Colored Dissolved Organic Matter (CDOM), Phycocyanins, Suspended Sediments, and Turbidity from standard Red-Green-Blue (RGB) images of surface water. HydroVision supports early detection of contamination trends and strengthens monitoring by regulatory agencies during external environmental stressors, industrial activities, and force majeure events. The model is trained on more than 500,000 seasonally varied images collected from the United States Geological Survey Hydrologic Imagery Visualization and Information System between 2022 and 2024. This approach leverages widely available RGB imagery as a scalable, cost-effective alternative to traditional multispectral and hyperspectral remote sensing. Four state-of-the-art convolutional neural networks (VGG-16, ResNet50, MobileNetV2, DenseNet121) and a Vision Transformer are evaluated through transfer learning to identify the best-performing architecture. DenseNet121 achieves the highest validation performance, with an R2 score of 0.89 in predicting CDOM, demonstrating the framework’s promise for real-world water quality monitoring across diverse conditions. While the current model is optimized for well-lit imagery, future work will focus on improving robustness under low-light and obstructed scenarios to expand its operational utility.
zh

[CV-56] AI-Driven Marine Robotics: Emerging Trends in Underwater Perception and Ecosystem Monitoring

【速读】:该论文旨在解决海洋生态系统因气候变化而面临日益加剧的压力问题,核心挑战在于如何实现大规模、可持续的水下环境监测。其解决方案的关键在于利用生成式 AI(Generative AI)和计算机视觉技术推动水下感知范式的转变,通过三大驱动力——生态系统级监测的环境必要性、公民科学平台带来的水下数据民主化以及传统陆地计算机视觉领域研究者的迁移——促进弱监督学习、开放集识别和鲁棒感知等关键技术的突破。这些方法创新不仅提升了水下场景理解与三维重建能力,还推动了基础模型和自监督学习的发展,从而将水下约束条件转化为通用计算机视觉、机器人学及环境监测领域的前沿进步。

链接: https://arxiv.org/abs/2509.01878
作者: Scarlett Raine,Tobias Fischer
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Marine ecosystems face increasing pressure due to climate change, driving the need for scalable, AI-powered monitoring solutions. This paper examines the rapid emergence of underwater AI as a major research frontier and analyzes the factors that have transformed marine perception from a niche application into a catalyst for AI innovation. We identify three convergent drivers: environmental necessity for ecosystem-scale monitoring, democratization of underwater datasets through citizen science platforms, and researcher migration from saturated terrestrial computer vision domains. Our analysis reveals how unique underwater challenges - turbidity, cryptic species detection, expert annotation bottlenecks, and cross-ecosystem generalization - are driving fundamental advances in weakly supervised learning, open-set recognition, and robust perception under degraded conditions. We survey emerging trends in datasets, scene understanding and 3D reconstruction, highlighting the paradigm shift from passive observation toward AI-driven, targeted intervention capabilities. The paper demonstrates how underwater constraints are pushing the boundaries of foundation models, self-supervised learning, and perception, with methodological innovations that extend far beyond marine applications to benefit general computer vision, robotics, and environmental monitoring.
zh

[CV-57] Doctoral Thesis: Geometric Deep Learning For Camera Pose Prediction Registration Depth Estimation and 3D Reconstruction

【速读】:该论文旨在解决3D视觉任务中因高维数据特性及标注数据稀缺导致的深度学习模型训练困难问题,以及传统结构光重建(Structure-from-Motion, SfM)和同时定位与地图构建(Simultaneous Localization and Mapping, SLAM)技术在非结构化环境中难以生成适用于渲染与语义分析等下游任务的精细几何表示的问题。其解决方案的关键在于将几何先验信息(如深度图、表面法向量及等变性约束)融入深度学习模型,从而构建兼具几何感知能力与学习能力的鲁棒模型,显著提升相机位姿估计、点云配准、深度预测和高质量3D重建等核心任务的性能,并在数字文化遗产保护与沉浸式虚拟现实/增强现实(VR/AR)场景中得到验证。

链接: https://arxiv.org/abs/2509.01873
作者: Xueyang Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 175 pages, 66 figures

点击查看摘要

Abstract:Modern deep learning developments create new opportunities for 3D mapping technology, scene reconstruction pipelines, and virtual reality development. Despite advances in 3D deep learning technology, direct training of deep learning models on 3D data faces challenges due to the high dimensionality inherent in 3D data and the scarcity of labeled datasets. Structure-from-motion (SfM) and Simultaneous Localization and Mapping (SLAM) exhibit robust performance when applied to structured indoor environments but often struggle with ambiguous features in unstructured environments. These techniques often struggle to generate detailed geometric representations effective for downstream tasks such as rendering and semantic analysis. Current limitations require the development of 3D representation methods that combine traditional geometric techniques with deep learning capabilities to generate robust geometry-aware deep learning models. The dissertation provides solutions to the fundamental challenges in 3D vision by developing geometric deep learning methods tailored for essential tasks such as camera pose estimation, point cloud registration, depth prediction, and 3D reconstruction. The integration of geometric priors or constraints, such as including depth information, surface normals, and equivariance into deep learning models, enhances both the accuracy and robustness of geometric representations. This study systematically investigates key components of 3D vision, including camera pose estimation, point cloud registration, depth estimation, and high-fidelity 3D reconstruction, demonstrating their effectiveness across real-world applications such as digital cultural heritage preservation and immersive VR/AR environments. Comments: 175 pages, 66 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.01873 [cs.CV] (or arXiv:2509.01873v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.01873 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-58] Enabling Federated Object Detection for Connected Autonomous Vehicles: A Deployment-Oriented Evaluation

【速读】:该论文旨在解决在联网自动驾驶车辆(Connected Autonomous Vehicles, CAVs)中部署基于联邦学习(Federated Learning, FL)的物体检测模型所面临的三大挑战:非独立同分布(non-IID)数据带来的模型性能下降、车载计算硬件资源受限导致的训练与推理效率瓶颈,以及光照和天气等环境因素引起的系统鲁棒性不足。解决方案的关键在于提出首个面向实际部署的综合评估框架,集成模型检测精度、系统级资源消耗分析和环境适应性测试,通过在KITTI、BDD100K和nuScenes数据集上对YOLOv5、YOLOv8、YOLOv11及Deformable DETR等先进检测器进行多维度实验,量化不同分辨率、批处理大小、动态客户端参与度以及复杂环境条件下准确率与计算开销之间的权衡关系,从而为FL在CAVs中的可靠落地提供可操作的优化路径。

链接: https://arxiv.org/abs/2509.01868
作者: Komala Subramanyam Cherukuri,Kewei Sha,Zhenhua Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Object detection is crucial for Connected Autonomous Vehicles (CAVs) to perceive their surroundings and make safe driving decisions. Centralized training of object detection models often achieves promising accuracy, fast convergence, and simplified training process, but it falls short in scalability, adaptability, and privacy-preservation. Federated learning (FL), by contrast, enables collaborative, privacy-preserving, and continuous training across naturally distributed CAV fleets. However, deploying FL in real-world CAVs remains challenging due to the substantial computational demands of training and inference, coupled with highly diverse operating conditions. Practical deployment must address three critical factors: (i) heterogeneity from non-IID data distributions, (ii) constrained onboard computing hardware, and (iii) environmental variability such as lighting and weather, alongside systematic evaluation to ensure reliable performance. This work introduces the first holistic deployment-oriented evaluation of FL-based object detection in CAVs, integrating model performance, system-level resource profiling, and environmental robustness. Using state-of-the-art detectors, YOLOv5, YOLOv8, YOLOv11, and Deformable DETR, evaluated on the KITTI, BDD100K, and nuScenes datasets, we analyze trade-offs between detection accuracy, computational cost, and resource usage under diverse resolutions, batch sizes, weather and lighting conditions, and dynamic client participation, paving the way for robust FL deployment in CAVs.
zh

[CV-59] Latent Gene Diffusion for Spatial Transcriptomics Completion ICCV2025

【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)数据中因数据缺失(dropout)导致的基因表达预测精度下降问题。现有方法通常依赖于单细胞RNA测序(single-cell RNA sequencing)参考数据,存在对齐质量敏感、引入批次效应及继承dropout等问题。解决方案的关键在于提出LGDiST——首个无参考的潜在基因扩散模型(reference-free latent gene diffusion model),其核心创新是利用此前被认为信息量不足的“上下文基因”(context genes)构建丰富且具有生物学意义的遗传潜空间,并结合ST潜空间和邻居条件化机制,显著提升基因表达补全性能。实验表明,LGDiST在26个数据集上平均均方误差(MSE)降低18%,并能将六种前沿基因表达预测方法的性能提升最高达10%。

链接: https://arxiv.org/abs/2509.01864
作者: Paula Cárdenas,Leonardo Manrique,Daniela Vega,Daniela Ruiz,Pablo Arbeláez
机构: Universidad de los Andes(安第斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures. Accepted to CVAMD Workshop, ICCV 2025

点击查看摘要

Abstract:Computer Vision has proven to be a powerful tool for analyzing Spatial Transcriptomics (ST) data. However, current models that predict spatially resolved gene expression from histopathology images suffer from significant limitations due to data dropout. Most existing approaches rely on single-cell RNA sequencing references, making them dependent on alignment quality and external datasets while also risking batch effects and inherited dropout. In this paper, we address these limitations by introducing LGDiST, the first reference-free latent gene diffusion model for ST data dropout. We show that LGDiST outperforms the previous state-of-the-art in gene expression completion, with an average Mean Squared Error that is 18% lower across 26 datasets. Furthermore, we demonstrate that completing ST data with LGDiST improves gene expression prediction performance on six state-of-the-art methods up to 10% in MSE. A key innovation of LGDiST is using context genes previously considered uninformative to build a rich and biologically meaningful genetic latent space. Our experiments show that removing key components of LGDiST, such as the context genes, the ST latent space, and the neighbor conditioning, leads to considerable drops in performance. These findings underscore that the full architecture of LGDiST achieves substantially better performance than any of its isolated components.
zh

[CV-60] HodgeFormer: Transformers for Learnable Operators on Triangular Meshes through Data-Driven Hodge Matrices

【速读】:该论文旨在解决当前基于Transformer的图与网格形状分析方法中,依赖昂贵的特征分解运算(如拉普拉斯矩阵的特征值分解)来构建位置嵌入的问题。传统方法通常通过谱特征或热核签名生成位置编码,并将其与输入特征拼接,导致计算复杂度高且预处理流程繁琐。解决方案的关键在于引入离散外微分几何中的Hodge拉普拉斯算子(Hodge Laplacian operator),利用其显式构造形式 $ L := \star_0^{-1} d_0^T \star_1 d_0 $,设计一种新型深度学习层,使多头注意力机制能够直接学习离散Hodge对偶算子 0,1,2\star_0, \star_1, \star_2 和作用于顶点、边、面的离散算子族 LL,从而在无需特征分解或复杂预处理的前提下实现高效的网格分割与分类任务。

链接: https://arxiv.org/abs/2509.01839
作者: Akis Nousias,Stavros Nousias
机构: K3Y Labs; TUM Georg Nemetschek Institute (慕尼黑工业大学乔治·内梅茨克研究所); Technical University of Munich (慕尼黑工业大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 11 figures, 9 tables

点击查看摘要

Abstract:Currently, prominent Transformer architectures applied on graphs and meshes for shape analysis tasks employ traditional attention layers that heavily utilize spectral features requiring costly eigenvalue decomposition-based methods. To encode the mesh structure, these methods derive positional embeddings, that heavily rely on eigenvalue decomposition based operations, e.g. on the Laplacian matrix, or on heat-kernel signatures, which are then concatenated to the input features. This paper proposes a novel approach inspired by the explicit construction of the Hodge Laplacian operator in Discrete Exterior Calculus as a product of discrete Hodge operators and exterior derivatives, i.e. (L := \star_0^-1 d_0^T \star_1 d_0) . We adjust the Transformer architecture in a novel deep learning layer that utilizes the multi-head attention mechanism to approximate Hodge matrices \star_0 , \star_1 and \star_2 and learn families of discrete operators L that act on mesh vertices, edges and faces. Our approach results in a computationally-efficient architecture that achieves comparable performance in mesh segmentation and classification tasks, through a direct learning framework, while eliminating the need for costly eigenvalue decomposition operations or complex preprocessing operations.
zh

[CV-61] PractiLight: Practical Light Control Using Foundational Diffusion Models

【速读】:该论文旨在解决生成图像中光照控制难题,即如何在不同场景和频率范围内实现对图像光照的精确调控。传统方法依赖于大规模但领域特定的数据集进行训练,限制了基础模型的泛化能力和适用性。其解决方案的关键在于提出PractiLight方法,核心洞察是图像中的光照关系与自注意力层中的token交互具有相似性,因此可直接从扩散模型早期迭代中提取光照信息;通过训练一个轻量级LoRA回归器,从少量样本中学习生成直接辐照度图(irradiance map),并利用Classifier Guidance将目标光照注入新图像的生成过程,从而实现高效、通用且可控的重光照(relighting)效果。

链接: https://arxiv.org/abs/2509.01837
作者: Yotam Erel,Rishabh Dabral,Vladislav Golyanik,Amit H. Bermano,Christian Theobalt
机构: Tel Aviv University (特拉维夫大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Light control in generated images is a difficult task, posing specific challenges, spanning over the entire image and frequency spectrum. Most approaches tackle this problem by training on extensive yet domain-specific datasets, limiting the inherent generalization and applicability of the foundational backbones used. Instead, PractiLight is a practical approach, effectively leveraging foundational understanding of recent generative models for the task. Our key insight is that lighting relationships in an image are similar in nature to token interaction in self-attention layers, and hence are best represented there. Based on this and other analyses regarding the importance of early diffusion iterations, PractiLight trains a lightweight LoRA regressor to produce the direct irradiance map for a given image, using a small set of training images. We then employ this regressor to incorporate the desired lighting into the generation process of another image using Classifier Guidance. This careful design generalizes well to diverse conditions and image domains. We demonstrate state-of-the-art performance in terms of quality and control with proven parameter and data efficiency compared to leading works over a wide variety of scenes types. We hope this work affirms that image lighting can feasibly be controlled by tapping into foundational knowledge, enabling practical and general relighting.
zh

[CV-62] Mixture of Balanced Information Bottlenecks for Long-Tailed Visual Recognition

【速读】:该论文旨在解决真实世界视觉识别中数据长尾分布(long-tailed distribution)带来的深度神经网络(Deep Neural Networks, DNNs)训练与部署效率低下问题。其核心挑战在于如何在类别样本不均衡的情况下,仍能有效学习到具有判别性的特征表示。解决方案的关键在于提出一种平衡信息瓶颈(Balanced Information Bottleneck, BIB)方法,通过损失函数重平衡和自蒸馏(self-distillation)技术,确保标签相关的信息在压缩表示中被充分保留;进一步地,作者设计了多平衡信息瓶颈混合结构(Mixture of Multiple Balanced Information Bottlenecks, MBIB),使不同层级的BIB协同工作,实现从信息论角度出发的端到端表示学习与分类联合优化,从而显著提升长尾视觉识别性能。

链接: https://arxiv.org/abs/2509.01804
作者: Yifan Lan,Xin Cai,Jun Cheng,Shan Tan
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved significant success in various applications with large-scale and balanced data. However, data in real-world visual recognition are usually long-tailed, bringing challenges to efficient training and deployment of DNNs. Information bottleneck (IB) is an elegant approach for representation learning. In this paper, we propose a balanced information bottleneck (BIB) approach, in which loss function re-balancing and self-distillation techniques are integrated into the original IB network. BIB is thus capable of learning a sufficient representation with essential label-related information fully preserved for long-tailed visual recognition. To further enhance the representation learning capability, we also propose a novel structure of mixture of multiple balanced information bottlenecks (MBIB), where different BIBs are responsible for combining knowledge from different network layers. MBIB facilitates an end-to-end learning strategy that trains representation and classification simultaneously from an information theory perspective. We conduct experiments on commonly used long-tailed datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018. Both BIB and MBIB reach state-of-the-art performance for long-tailed visual recognition.
zh

[CV-63] EgoTouch: On-Body Touch Input Using AR/VR Headset Cameras WWW

【速读】:该论文旨在解决增强现实(AR)与虚拟现实(VR)中用户交互效率与自然性不足的问题,特别是针对当前主流的空中手势(in-air interfaces)在速度、准确性和人体工学方面存在的局限。其核心解决方案是利用仅含RGB摄像头的现有XR设备实现无需任何附加传感器的裸手皮肤触控输入(bare hands skin input),通过深度学习和计算机视觉技术提取高精度的手部接触信息,并在多种光照条件、肤色差异及身体运动场景下保持鲁棒性。该方法的关键在于构建一个可直接部署于现代XR头显的端到端处理流程,不仅实现了高准确性,还能输出包括触压力、手指识别、接触角度和旋转等丰富的输入元数据,为实现真正实用的皮肤表面交互界面提供了关键技术支撑。

链接: https://arxiv.org/abs/2509.01786
作者: Vimal Mollyn,Chris Harrison
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Published at UIST 2024. More info at this https URL

点击查看摘要

Abstract:In augmented and virtual reality (AR/VR) experiences, a user’s arms and hands can provide a convenient and tactile surface for touch input. Prior work has shown on-body input to have significant speed, accuracy, and ergonomic benefits over in-air interfaces, which are common today. In this work, we demonstrate high accuracy, bare hands (i.e., no special instrumentation of the user) skin input using just an RGB camera, like those already integrated into all modern XR headsets. Our results show this approach can be accurate, and robust across diverse lighting conditions, skin tones, and body motion (e.g., input while walking). Finally, our pipeline also provides rich input metadata including touch force, finger identification, angle of attack, and rotation. We believe these are the requisite technical ingredients to more fully unlock on-skin interfaces that have been well motivated in the HCI literature but have lacked robust and practical methods.
zh

[CV-64] ransMatch: A Transfer-Learning Framework for Defect Detection in Laser Powder Bed Fusion Additive Manufacturing

【速读】:该论文旨在解决激光粉末床熔融(Laser Powder Bed Fusion, LPBF)制造过程中表面缺陷检测中因标注数据稀缺而导致的模型泛化能力不足问题。解决方案的关键在于提出了一种名为TransMatch的新框架,该框架融合了迁移学习(transfer learning)与半监督少样本学习(semi-supervised few-shot learning),通过有效利用少量标注数据和大量未标注的新类别图像,克服了传统元学习方法在数据稀缺场景下的局限性,从而实现了对多种缺陷类型(如裂纹、气孔、孔洞和飞溅)的高精度识别,验证了其在工业质量保证中的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2509.01754
作者: Mohsen Asghari Ilani,Yaser Mike Banad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Surface defects in Laser Powder Bed Fusion (LPBF) pose significant risks to the structural integrity of additively manufactured components. This paper introduces TransMatch, a novel framework that merges transfer learning and semi-supervised few-shot learning to address the scarcity of labeled AM defect data. By effectively leveraging both labeled and unlabeled novel-class images, TransMatch circumvents the limitations of previous meta-learning approaches. Experimental evaluations on a Surface Defects dataset of 8,284 images demonstrate the efficacy of TransMatch, achieving 98.91% accuracy with minimal loss, alongside high precision, recall, and F1-scores for multiple defect classes. These findings underscore its robustness in accurately identifying diverse defects, such as cracks, pinholes, holes, and spatter. TransMatch thus represents a significant leap forward in additive manufacturing defect detection, offering a practical and scalable solution for quality assurance and reliability across a wide range of industrial applications.
zh

[CV-65] Clinical Metadata Guided Limited-Angle CT Image Reconstruction

【速读】:该论文旨在解决有限角度计算机断层成像(Limited-angle computed tomography, LACT)中因投影数据截断导致的严重伪影问题,该问题使得重建图像质量下降且病灶识别困难。解决方案的关键在于提出了一种两阶段扩散框架,通过结构化临床元数据(包括采集参数、患者人口统计学信息和诊断意见)引导图像生成过程:第一阶段利用基于Transformer的扩散模型仅依赖元数据生成粗略解剖先验;第二阶段结合该先验与元数据进一步优化图像细节,并在每一步采样中引入基于交替方向乘子法(Alternating Direction Method of Multipliers, ADMM)的物理一致性约束,确保重建结果与实际测量投影一致。实验证明,该方法在严重角度截断条件下显著提升了重建图像的质量指标(如SSIM、PSNR、nMI和PCC),且不同类型的元数据具有互补作用,尤其诊断和人口统计学信息对提升重建性能贡献突出。

链接: https://arxiv.org/abs/2509.01752
作者: Yu Shi,Shuyi Fan,Changsheng Fang,Shuo Han,Haodong Li,Li Zhou,Bahareh Morovati,Dayang Wang,Hengyong Yu
机构: University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Limited-angle computed tomography (LACT) offers improved temporal resolution and reduced radiation dose for cardiac imaging, but suffers from severe artifacts due to truncated projections. To address the ill-posedness of LACT reconstruction, we propose a two-stage diffusion framework guided by structured clinical metadata. In the first stage, a transformer-based diffusion model conditioned exclusively on metadata, including acquisition parameters, patient demographics, and diagnostic impressions, generates coarse anatomical priors from noise. The second stage further refines the images by integrating both the coarse prior and metadata to produce high-fidelity results. Physics-based data consistency is enforced at each sampling step in both stages using an Alternating Direction Method of Multipliers module, ensuring alignment with the measured projections. Extensive experiments on both synthetic and real cardiac CT datasets demonstrate that incorporating metadata significantly improves reconstruction fidelity, particularly under severe angular truncation. Compared to existing metadata-free baselines, our method achieves superior performance in SSIM, PSNR, nMI, and PCC. Ablation studies confirm that different types of metadata contribute complementary benefits, particularly diagnostic and demographic priors under limited-angle conditions. These findings highlight the dual role of clinical metadata in improving both reconstruction quality and efficiency, supporting their integration into future metadata-guided medical imaging frameworks.
zh

[CV-66] BM-CL: Bias Mitigation through the lens of Continual Learning

【速读】:该论文旨在解决机器学习中偏见问题,特别是传统偏见缓解技术常导致“水平下降效应”(leveling-down effect),即改善弱势群体表现的同时牺牲优势群体的性能。其解决方案的关键在于将偏见缓解重新诠释为持续学习(continual learning)问题,借鉴如“遗忘学习”(Learning without Forgetting)和“弹性权重巩固”(Elastic Weight Consolidation)等机制,使模型在增量调整公平性目标的过程中,既能提升弱势群体的预测效果,又能保留对优势群体原有的高性能知识,从而实现公平性与性能之间的动态平衡。

链接: https://arxiv.org/abs/2509.01730
作者: Lucas Mansilla,Rodrigo Echeveste,Camila Gonzalez,Diego H. Milone,Enzo Ferrante
机构: sinc(i), CONICET - Universidad Nacional del Litoral (信号、系统和计算智能研究所,国家科学与技术研究委员会 - 圣地亚哥国立大学); AIDE Lab, Stanford University (人工智能开发与评估实验室,斯坦福大学); Institute of Computer Sciences (ICC), CONICET - Universidad de Buenos Aires (计算机科学研究所,国家科学与技术研究委员会 - 布宜诺斯艾利斯大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Biases in machine learning pose significant challenges, particularly when models amplify disparities that affect disadvantaged groups. Traditional bias mitigation techniques often lead to a \itshape leveling-down effect, whereby improving outcomes of disadvantaged groups comes at the expense of reduced performance for advantaged groups. This study introduces Bias Mitigation through Continual Learning (BM-CL), a novel framework that leverages the principles of continual learning to address this trade-off. We postulate that mitigating bias is conceptually similar to domain-incremental continual learning, where the model must adjust to changing fairness conditions, improving outcomes for disadvantaged groups without forgetting the knowledge that benefits advantaged groups. Drawing inspiration from techniques such as Learning without Forgetting and Elastic Weight Consolidation, we reinterpret bias mitigation as a continual learning problem. This perspective allows models to incrementally balance fairness objectives, enhancing outcomes for disadvantaged groups while preserving performance for advantaged groups. Experiments on synthetic and real-world image datasets, characterized by diverse sources of bias, demonstrate that the proposed framework mitigates biases while minimizing the loss of original knowledge. Our approach bridges the fields of fairness and continual learning, offering a promising pathway for developing machine learning systems that are both equitable and effective.
zh

[CV-67] Articulated Object Estimation in the Wild

【速读】:该论文旨在解决在动态相机运动和部分可观测条件下,从RGB-D视频中准确估计刚性关节物体(articulated object)的结构与运动问题。传统方法多依赖于固定视角或对物体状态的直接观测,在真实复杂场景中表现不佳;而人类可通过观察他人操作轻松推断物体的关节结构。为此,作者提出ArtiPoint框架,其核心创新在于将深度点跟踪(deep point tracking)与因子图优化(factor graph optimization)相结合,从而在无需先验知识的情况下,直接从原始RGB-D视频中鲁棒地恢复关节部件的轨迹及转动轴。该方案显著提升了在非受控环境下的估计精度与泛化能力。

链接: https://arxiv.org/abs/2509.01708
作者: Abdelrhman Werby,Martin Büchner,Adrian Röfer,Chenguang Huang,Wolfram Burgard,Abhinav Valada
机构: University of Freiburg (弗莱堡大学); University of Stuttgart (斯图加特大学); University of Technology Nuremberg (纽伦堡工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9th Conference on Robot Learning (CoRL), 2025

点击查看摘要

Abstract:Understanding the 3D motion of articulated objects is essential in robotic scene understanding, mobile manipulation, and motion planning. Prior methods for articulation estimation have primarily focused on controlled settings, assuming either fixed camera viewpoints or direct observations of various object states, which tend to fail in more realistic unconstrained environments. In contrast, humans effortlessly infer articulation by watching others manipulate objects. Inspired by this, we introduce ArtiPoint, a novel estimation framework that can infer articulated object models under dynamic camera motion and partial observability. By combining deep point tracking with a factor graph optimization framework, ArtiPoint robustly estimates articulated part trajectories and articulation axes directly from raw RGB-D videos. To foster future research in this domain, we introduce Arti4D, the first ego-centric in-the-wild dataset that captures articulated object interactions at a scene level, accompanied by articulation labels and ground-truth camera poses. We benchmark ArtiPoint against a range of classical and learning-based baselines, demonstrating its superior performance on Arti4D. We make code and Arti4D publicly available at this https URL.
zh

[CV-68] Deep Learning-Based Rock Particulate Classification Using Attention-Enhanced ConvNeXt

【速读】:该论文旨在解决岩石尺寸分类(rock size classification)中的精度问题,这在岩土工程、采矿及资源管理中对操作效率与安全至关重要。其解决方案的关键在于提出一种基于ConvNeXt架构的增强型深度学习模型(CNSCA),通过引入自注意力(self-attention)机制捕捉长距离空间依赖关系,并结合通道注意力(channel attention)机制强化重要特征通道,从而有效融合细粒度局部模式与全局上下文信息,显著提升模型在自然纹理类任务(如岩石图像)中的分类准确性和鲁棒性。

链接: https://arxiv.org/abs/2509.01704
作者: Anthony Amankwah,Chris Aldrich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate classification of rock sizes is a vital component in geotechnical engineering, mining, and resource management, where precise estimation influences operational efficiency and safety. In this paper, we propose an enhanced deep learning model based on the ConvNeXt architecture, augmented with both self-attention and channel attention mechanisms. Building upon the foundation of ConvNext, our proposed model, termed CNSCA, introduces self-attention to capture long-range spatial dependencies and channel attention to emphasize informative feature channels. This hybrid design enables the model to effectively capture both fine-grained local patterns and broader contextual relationships within rock imagery, leading to improved classification accuracy and robustness. We evaluate our model on a rock size classification dataset and compare it against three strong baseline. The results demonstrate that the incorporation of attention mechanisms significantly enhances the models capability for fine-grained classification tasks involving natural textures like rocks.
zh

[CV-69] Examination of PCA Utilisation for Multilabel Classifier of Multispectral Images

【速读】:该论文旨在解决多光谱图像(multispectral images)在多标签分类任务中因高维度特征导致的处理复杂性问题,尤其关注如何有效提取具有判别力的特征以提升分类性能。其解决方案的关键在于引入主成分分析(Principal Component Analysis, PCA)作为预处理步骤,将原始高维特征空间压缩至三维,并结合ResNet50与DINOv2两种深度学习架构进行实验验证,结果表明PCA的有效性高度依赖于所选用的模型架构和训练策略,从而为后续研究自监督预训练及替代降维方法提供了重要方向。

链接: https://arxiv.org/abs/2509.01691
作者: Filip Karpowicz,Wiktor Kępiński,Bartosz Staszyński,Grzegorz Sarwas
机构: Warsaw University of Technology (华沙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates the utility of Principal Component Analysis (PCA) for multi-label classification of multispectral images using ResNet50 and DINOv2, acknowledging the high dimensionality of such data and the associated processing challenges. Multi-label classification, where each image may belong to multiple classes, adds further complexity to feature extraction. Our pipeline includes an optional PCA step that reduces the data to three dimensions before feeding it into a three-layer classifier. The findings demonstrate that the effectiveness of PCA for multi-label multispectral image classification depends strongly on the chosen deep learning architecture and training strategy, opening avenues for future research into self-supervised pre-training and alternative dimensionality reduction approaches.
zh

[CV-70] GaussianGAN: Real-Time Photorealistic controllable Human Avatars

【速读】:该论文旨在解决当前基于神经渲染的人体虚拟形象(avatar)在实时渲染中普遍存在明显模糊的问题。其解决方案的关键在于提出了一种名为GaussianGAN的新方法,核心创新包括:首先设计了一种新颖的高斯点云细化策略,从估计骨骼肢体周围圆柱结构表面生成高斯点;其次引入一个视图分割模块,结合相机标定信息实现精确的语义分割;最后利用UNet生成器融合高斯点云特征与分割图,生成高保真、可动画化的数字人像。该方法在ZJU Mocap和Thuman4数据集上分别达到32.94dB和33.39dB的像素级保真度,且渲染速度达79 FPS,显著优于现有方法。

链接: https://arxiv.org/abs/2509.01681
作者: Mohamed Ilyes Lakhal,Richard Bowden
机构: CVSSP, University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE conference series on Automatic Face and Gesture Recognition 2025

点击查看摘要

Abstract:Photorealistic and controllable human avatars have gained popularity in the research community thanks to rapid advances in neural rendering, providing fast and realistic synthesis tools. However, a limitation of current solutions is the presence of noticeable blurring. To solve this problem, we propose GaussianGAN, an animatable avatar approach developed for photorealistic rendering of people in real-time. We introduce a novel Gaussian splatting densification strategy to build Gaussian points from the surface of cylindrical structures around estimated skeletal limbs. Given the camera calibration, we render an accurate semantic segmentation with our novel view segmentation module. Finally, a UNet generator uses the rendered Gaussian splatting features and the segmentation maps to create photorealistic digital avatars. Our method runs in real-time with a rendering speed of 79 FPS. It outperforms previous methods regarding visual perception and quality, achieving a state-of-the-art results in terms of a pixel fidelity of 32.94db on the ZJU Mocap dataset and 33.39db on the Thuman4 dataset.
zh

[CV-71] OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

【速读】:该论文旨在解决视觉-语言预训练模型(vision-language pretraining)中训练效率低下的问题,尤其是在计算资源消耗大、训练时间长和内存占用高的情况下。其关键解决方案是简化原始OpenVision架构:移除文本编码器(text encoder)及对应的对比损失(contrastive loss),仅保留以图像描述生成任务为核心的纯生成式训练信号(captioning loss)。这种轻量化的生成式范式显著提升了训练效率,在保持多模态基准测试性能的同时,将ViT-L/14模型的训练时间减少约1.5倍(从83小时降至57小时),内存使用降低约1.8倍(从24.5GB降至13.8GB),并支持更大批量训练(最大batch size从2k提升至8k),同时实现了超过10亿参数规模的扩展能力。

链接: https://arxiv.org/abs/2509.01644
作者: Yanqing Liu,Xianhang Li,Letian Zhang,Zirui Wang,Zeyu Zheng,Yuyin Zhou,Cihang Xie
机构: University of California Santa Cruz (加州大学圣克鲁兹分校); Apple (苹果); University of California Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper provides a simplification on OpenVision’s architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model’s performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.
zh

[CV-72] Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling

【速读】:该论文旨在解决生成式 AI(Generative AI)中文本到图像扩散模型计算成本高昂的问题,尤其是高精度推理对大规模 GPU 资源的依赖限制了其在边缘或消费级设备上的部署。现有方法如后训练量化(post-training quantization)受限于全精度校准需求,难以在保持图像质量的同时实现高效压缩。解决方案的关键在于提出 Q-Sched——一种通过修改扩散模型调度器(scheduler)而非模型权重来实现量化的方法,从而在仅需少量校准提示的情况下完成量化感知优化;其核心创新是 JAQ 损失函数,该损失结合文本-图像兼容性与图像质量指标,在无需参考图像和全精度推理的前提下,学习量化感知的预条件系数,最终实现 4 倍模型尺寸压缩并维持全精度性能,显著优于当前主流少步数扩散模型。

链接: https://arxiv.org/abs/2509.01624
作者: Natalia Frumkin,Diana Marculescu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models are computationally intensive, often requiring dozens of forward passes through large transformer backbones. For instance, Stable Diffusion XL generates high-quality images with 50 evaluations of a 2.6B-parameter model, an expensive process even for a single batch. Few-step diffusion models reduce this cost to 2-8 denoising steps but still depend on large, uncompressed U-Net or diffusion transformer backbones, which are often too costly for full-precision inference without datacenter GPUs. These requirements also limit existing post-training quantization methods that rely on full-precision calibration. We introduce Q-Sched, a new paradigm for post-training quantization that modifies the diffusion model scheduler rather than model weights. By adjusting the few-step sampling trajectory, Q-Sched achieves full-precision accuracy with a 4x reduction in model size. To learn quantization-aware pre-conditioning coefficients, we propose the JAQ loss, which combines text-image compatibility with an image quality metric for fine-grained optimization. JAQ is reference-free and requires only a handful of calibration prompts, avoiding full-precision inference during calibration. Q-Sched delivers substantial gains: a 15.5% FID improvement over the FP16 4-step Latent Consistency Model and a 16.6% improvement over the FP16 8-step Phased Consistency Model, showing that quantization and few-step distillation are complementary for high-fidelity generation. A large-scale user study with more than 80,000 annotations further confirms Q-Sched’s effectiveness on both FLUX.1[schnell] and SDXL-Turbo.
zh

[CV-73] Improving Large Vision and Language Models by Learning from a Panel of Peers ICCV2025

【速读】:该论文旨在解决大型视觉语言模型(Large Vision and Language Models, LVLMs)对齐过程中依赖人工标注偏好数据成本高、机器生成偏好数据质量有限以及自监督方法易引入幻觉等问题。其解决方案的关键在于提出一种受人类协作学习启发的“同伴面板”(Panel-of-Peers)学习框架,通过多个LVLM组成一个评估与反馈群体,在迭代自我改进过程中共同生成、评价和优化输出,模拟课堂式同行评审机制,从而在无需大量人工标注数据的情况下显著提升模型性能。

链接: https://arxiv.org/abs/2509.01610
作者: Jefferson Hernandez,Jing Shi,Simon Jenni,Vicente Ordonez,Kushal Kafle
机构: Rice University (莱斯大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Traditional alignment methods for Large Vision and Language Models (LVLMs) primarily rely on human-curated preference data. Human-generated preference data is costly; machine-generated preference data is limited in quality; and self-supervised preference data often introduces hallucinations. To overcome these limitations, we propose a novel Panel-of-Peers learning framework inspired by collaborative learning among humans. This approach leverages a panel of LVLMs, each evaluating and learning from their collective outputs through an iterative self-improvement process. By simulating a peer review system, our models generate, assess, and refine outputs in response to a curated set of prompts, mimicking a classroom learning environment. We demonstrate that this methodology enhances model performance without requiring extensive human-labeled datasets. Our experiments show significant improvement across multiple benchmarks, demonstrating the potential of peer evaluations as a scalable alternative to self-supervised alignment. Notably, we show that Panel-of-Peers increases the average score on fifteen benchmarks from 48% to 57%
zh

[CV-74] ransForSeg: A Multitask Stereo ViT for Joint Stereo Segmentation and 3D Force Estimation in Catheterization

【速读】:该论文旨在解决介入式医疗中导管定位与力感知的联合任务问题,即如何在X射线影像下同时实现导管的精准分割(segmentation)与三维力估计(3D force estimation)。传统方法通常将这两个任务分开处理或依赖单一视角信息,难以准确捕捉导管在复杂空间中的形变与受力状态。解决方案的关键在于提出一种新型双输入立体视觉Transformer模型(stereo Vision Transformer),该模型将来自两个视角的X射线图像分别作为独立序列进行处理,利用自注意力机制直接建模跨视角长程依赖关系,无需通过逐步扩展感受野来捕获空间关联性;此外,编码器和解码器输出的嵌入特征共享两个分割头用于多视角导管分割,而解码器融合后的特征则输入回归头以实现高精度的3D力估计,从而在统一框架内完成多任务协同优化。

链接: https://arxiv.org/abs/2509.01605
作者: Pedram Fekri,Mehrdad Zadeh,Javad Dargahi
机构: Concordia University (康考迪亚大学); Kettering University (凯特林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Preprint version. This work is intended for future journal submission

点击查看摘要

Abstract:Recently, the emergence of multitask deep learning models has enhanced catheterization procedures by providing tactile and visual perception data through an end-to-end architec- ture. This information is derived from a segmentation and force estimation head, which localizes the catheter in X-ray images and estimates the applied pressure based on its deflection within the image. These stereo vision architectures incorporate a CNN- based encoder-decoder that captures the dependencies between X-ray images from two viewpoints, enabling simultaneous 3D force estimation and stereo segmentation of the catheter. With these tasks in mind, this work approaches the problem from a new perspective. We propose a novel encoder-decoder Vision Transformer model that processes two input X-ray images as separate sequences. Given sequences of X-ray patches from two perspectives, the transformer captures long-range dependencies without the need to gradually expand the receptive field for either image. The embeddings generated by both the encoder and decoder are fed into two shared segmentation heads, while a regression head employs the fused information from the decoder for 3D force estimation. The proposed model is a stereo Vision Transformer capable of simultaneously segmenting the catheter from two angles while estimating the generated forces at its tip in 3D. This model has undergone extensive experiments on synthetic X-ray images with various noise levels and has been compared against state-of-the-art pure segmentation models, vision-based catheter force estimation methods, and a multitask catheter segmentation and force estimation approach. It outperforms existing models, setting a new state-of-the-art in both catheter segmentation and force estimation.
zh

[CV-75] O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing

【速读】:该论文旨在解决视频编辑中可控性不足的问题,即现有方法在处理多样化的编辑任务时需要不同的控制信号,导致模型设计复杂且训练资源消耗大。解决方案的关键在于提出一个统一框架 O-DisCo-Edit,其核心是引入一种新颖的物体扭曲控制(Object Distortion Control, O-DisCo)信号,该信号基于随机和自适应噪声,能够以单一表示灵活封装多种编辑提示;同时结合“复制-重构”(copy-form)保留模块以保护未编辑区域,从而实现高效、高保真度的视频编辑。

链接: https://arxiv.org/abs/2509.01596
作者: Yuqing Chen,Junjie Wang,Lin Liu,Ruihang Chu,Xiaopeng Zhang,Qi Tian,Yujiu Yang
机构: Tsinghua University (清华大学); Huawei Inc. (华为公司); Pengcheng National Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have recently advanced video editing, yet controllable editing remains challenging due to the need for precise manipulation of diverse object properties. Current methods require different control signal for diverse editing tasks, which complicates model design and demands significant training resources. To address this, we propose O-DisCo-Edit, a unified framework that incorporates a novel object distortion control (O-DisCo). This signal, based on random and adaptive noise, flexibly encapsulates a wide range of editing cues within a single representation. Paired with a “copy-form” preservation module for preserving non-edited regions, O-DisCo-Edit enables efficient, high-fidelity editing through an effective training paradigm. Extensive experiments and comprehensive human evaluations consistently demonstrate that O-DisCo-Edit surpasses both specialized and multitask state-of-the-art methods across various video editing tasks. this https URL
zh

[CV-76] ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association

【速读】:该论文旨在解决单目视觉SLAM(Simultaneous Localization and Mapping,即时定位与建图)系统在实际应用中对相机内参(camera intrinsics)依赖性强的问题,从而提升其在不同相机配置下的通用性与鲁棒性。解决方案的关键在于提出一种轻量级对称两视图关联(Symmetric Two-View Association, STA)前端模型,该模型仅需两幅RGB图像即可同时估计相对相机位姿并回归局部点云地图,显著降低了模型复杂度(仅为现有先进方法的35%),同时提升了两视图约束的质量;后端则构建了一个专门设计的Sim(3)位姿图优化框架,并引入回环检测以抑制累积漂移,从而实现高精度的相机跟踪与稠密三维重建。

链接: https://arxiv.org/abs/2509.01584
作者: Ganlin Zhang,Shenhan Qian,Xi Wang,Daniel Cremers
机构: TU Munich (慕尼黑工业大学); MCML; ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present ViSTA-SLAM as a real-time monocular visual SLAM system that operates without requiring camera intrinsics, making it broadly applicable across diverse camera setups. At its core, the system employs a lightweight symmetric two-view association (STA) model as the frontend, which simultaneously estimates relative camera poses and regresses local pointmaps from only two RGB images. This design reduces model complexity significantly, the size of our frontend is only 35% that of comparable state-of-the-art methods, while enhancing the quality of two-view constraints used in the pipeline. In the backend, we construct a specially designed Sim(3) pose graph that incorporates loop closures to address accumulated drift. Extensive experiments demonstrate that our approach achieves superior performance in both camera tracking and dense 3D reconstruction quality compared to current methods. Github repository: this https URL
zh

[CV-77] Aleatoric Uncertainty from AI-based 6D Object Pose Predictors for Object-relative State Estimation

【速读】:该论文旨在解决基于深度学习(Deep Learning, DL)的6D物体位姿预测器在机器人状态估计中缺乏不确定性建模的问题。由于传统方法通常假设测量噪声为固定协方差,忽略了实际场景中因传感器噪声、遮挡或光照变化等引起的动态不确定性,导致状态估计算法性能受限。解决方案的关键在于:仅通过在原有预训练DL位姿预测器的平移和旋转分支后附加两个独立的多层感知机(Multi-Layer Perceptron, MLP),即可实现对**偶然不确定性(aleatoric uncertainty)**的高效推断,且无需重新训练原始网络。该方法可无缝集成到扩展卡尔曼滤波(Extended Kalman Filter, EKF)框架中,作为带动态协方差的测量输入,从而显著提升物体相对状态估计的鲁棒性和精度,同时保持极低的计算开销,适用于边缘设备部署。

链接: https://arxiv.org/abs/2509.01583
作者: Thomas Jantos,Stephan Weiss,Jan Steinbrener
机构: University of Klagenfurt (克恩顿大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:Deep Learning (DL) has become essential in various robotics applications due to excelling at processing raw sensory data to extract task specific information from semantic objects. For example, vision-based object-relative navigation relies on a DL-based 6D object pose predictor to provide the relative pose between the object and the robot as measurements to the robot’s state estimator. Accurately knowing the uncertainty inherent in such Deep Neural Network (DNN) based measurements is essential for probabilistic state estimators subsequently guiding the robot’s tasks. Thus, in this letter, we show that we can extend any existing DL-based object-relative pose predictor for aleatoric uncertainty inference simply by including two multi-layer perceptrons detached from the translational and rotational part of the DL predictor. This allows for efficient training while freezing the existing pre-trained predictor. We then use the inferred 6D pose and its uncertainty as a measurement and corresponding noise covariance matrix in an extended Kalman filter (EKF). Our approach induces minimal computational overhead such that the state estimator can be deployed on edge devices while benefiting from the dynamically inferred measurement uncertainty. This increases the performance of the object-relative state estimation task compared to a fix-covariance approach. We conduct evaluations on synthetic data and real-world data to underline the benefits of aleatoric uncertainty inference for the object-relative state estimation task.
zh

[CV-78] User Manual for Model-based Imaging Inverse Problem

【速读】:该论文旨在解决成像逆问题(imaging inverse problem)建模与求解中的复杂性与挑战性问题,尤其针对缺乏凸优化或逆问题理论基础的研究者。其解决方案的关键在于通过模型驱动的优化方法(model-based optimization),系统化地梳理成像逆问题的数学逻辑,并以四个核心问题为导向:(1) 什么是成像逆问题?(2) 为何采用优化方法求解?(3) 如何求解优化问题?(4) 如何在实际成像系统中实现优化算法?该手册强调逻辑思维而非严格的数学表述,旨在为研究人员提供清晰、可操作的问题分析路径。

链接: https://arxiv.org/abs/2509.01572
作者: Xiaodong Wang
机构: 未知
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This user manual is intended to provide a detailed description on model-based optimization for imaging inverse problem. Theseproblems can be particularly complex and challenging, especially for individuals without prior exposure to convex optimization orinverse problem theory, like myself. In light of this, I am writing this manual to clarify and systematically organize the mathematicalrationale underlying imaging inverse problems. This manual might not be accurate in mathmatical notion but more focus on the logicalthinking on how to solve and proceed to solve the problems. If you want to think deep about something, try to raise questions! Thismanual is seaprated into four sections, aiming to answer the following four questions: (1) What is inverse imaging problem? (2) Why optimization is used to solve the inverse imaging problem? (3) How to solve the optimization problem? (4) How to implement the optimization algorithm in real imaging system?
zh

[CV-79] Kwai Keye-VL 1.5 Technical Report

【速读】:该论文旨在解决视频理解任务中模型在空间分辨率与时间覆盖范围之间难以平衡的问题,尤其针对视频内容动态性强、信息密度高的特性。其解决方案的关键在于三项创新:一是提出一种新颖的“慢-快”视频编码策略(Slow-Fast video encoding strategy),根据帧间相似性动态分配计算资源,对视觉变化显著的关键帧采用高分辨率处理(慢路径),对相对静态帧则以低分辨率提升时间覆盖(快路径);二是设计渐进式四阶段预训练方法,将上下文长度从8K逐步扩展至128K tokens,从而支持更长视频和复杂视觉内容的理解;三是构建涵盖推理增强与人类偏好对齐的后训练流程,包括五步链式思维数据构建、基于迭代GSPO的强化学习及渐进提示机制,显著提升了模型在视频理解任务中的表现。

链接: https://arxiv.org/abs/2509.01563
作者: Biao Yang,Bin Wen,Boyang Ding,Changyi Liu,Chenglong Chu,Chengru Song,Chongling Rao,Chuan Yi,Da Li,Dunju Zang,Fan Yang,Guorui Zhou,Guowang Zhang,Han Shen,Hao Peng,Haojie Ding,Hao Wang,Hengrui Ju,Jiaming Huang,Jiangxia Cao,Jiankang Chen,Jingyun Hua,Kaibing Chen,Kaiyu Jiang,Kaiyu Tang,Kun Gai,Muhao Wei,Qiang Wang,Ruitao Wang,Sen Na,Shengnan Zhang,Siyang Mao,Sui Huang,Tianke Zhang,Tingting Gao,Wei Chen,Wei Yuan,Xiangyu Wu,Xiao Hu,Xingyu Lu,Yi-Fan Zhang,Yiping Yang,Yulong Chen,Zeyi Lu,Zhenhua Wu,Zhixin Ling,Zhuoran Yang,Ziming Li,Di Xu,Haixuan Gao,Hang Li,Jing Wang,Lejian Ren,Qigen Hu,Qianqian Wang,Shiyao Wang,Xinchen Luo,Yan Li,Yuhang Hu,Zixing Zhang
机构: Kuaishou Group(快手集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Github page: this https URL

点击查看摘要

Abstract:In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model’s context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.
zh

[CV-80] Acoustic Interference Suppression in Ultrasound images for Real-Time HIFU Monitoring Using an Image-Based Latent Diffusion Model

【速读】:该论文旨在解决高强度聚焦超声(High-Intensity Focused Ultrasound, HIFU)治疗过程中,因HIFU能量干扰导致超声引导图像质量下降的问题,从而影响实时监控的准确性与安全性。解决方案的关键在于提出一种基于潜在扩散模型(latent diffusion model)的深度学习方法——HIFU-ILDiff,其核心创新是利用向量量化变分自编码器(Vector Quantized Variational Autoencoder, VQ-VAE)将含噪超声图像映射至低维潜在空间,并通过迭代去噪机制在潜在空间中有效去除HIFU干扰,最终重建出高分辨率、无干扰的超声图像,实现亚秒级实时处理(15帧/秒),显著优于传统陷波滤波(Notch Filter)方法。

链接: https://arxiv.org/abs/2509.01557
作者: Dejia Cai,Yao Ran,Kun Yang,Xinwang Shi,Yingying Zhou,Kexian Wu,Yang Xu,Yi Hu,Xiaowei Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-Intensity Focused Ultrasound (HIFU) is a non-invasive therapeutic technique widely used for treating various diseases. However, the success and safety of HIFU treatments depend on real-time monitoring, which is often hindered by interference when using ultrasound to guide HIFU treatment. To address these challenges, we developed HIFU-ILDiff, a novel deep learning-based approach leveraging latent diffusion models to suppress HIFU-induced interference in ultrasound images. The HIFU-ILDiff model employs a Vector Quantized Variational Autoencoder (VQ-VAE) to encode noisy ultrasound images into a lower-dimensional latent space, followed by a latent diffusion model that iteratively removes interference. The denoised latent vectors are then decoded to reconstruct high-resolution, interference-free ultrasound images. We constructed a comprehensive dataset comprising 18,872 image pairs from in vitro phantoms, ex vivo tissues, and in vivo animal data across multiple imaging modalities and HIFU power levels to train and evaluate the model. Experimental results demonstrate that HIFU-ILDiff significantly outperforms the commonly used Notch Filter method, achieving a Structural Similarity Index (SSIM) of 0.796 and Peak Signal-to-Noise Ratio (PSNR) of 23.780 compared to SSIM of 0.443 and PSNR of 14.420 for the Notch Filter under in vitro scenarios. Additionally, HIFU-ILDiff achieves real-time processing at 15 frames per second, markedly faster than the Notch Filter’s 5 seconds per frame. These findings indicate that HIFU-ILDiff is able to denoise HIFU interference in ultrasound guiding images for real-time monitoring during HIFU therapy, which will greatly improve the treatment precision in current clinical applications.
zh

[CV-81] Unified Supervision For Vision-Language Modeling in 3D Computed Tomography ICCV2025

【速读】:该论文旨在解决通用视觉-语言模型(VLMs)在诊断放射学等高风险领域中因缺乏判别精度而难以可靠临床应用的问题,同时应对公开三维CT数据集标注格式和粒度不一致导致的训练困难。解决方案的关键在于提出Uniferum模型,通过统一分类标签与分割掩码等多种监督信号,构建一个整合异构标注信息的联合训练框架,并利用三个具有不同注释方式的公共3D CT数据集进行训练,从而显著提升模型性能与跨分布泛化能力。

链接: https://arxiv.org/abs/2509.01554
作者: Hao-Chih Lee,Zelong Liu,Hamza Ahmed,Spencer Kim,Sean Huver,Vishwesh Nath,Zahi A. Fayad,Timothy Deyer,Xueyan Mei
机构: Icahn School of Medicine at Mount Sinai (纽约西奈山伊坎医学院); NVIDIA (英伟达); Cornell Medicine (康奈尔医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICCV 2025 VLM 3d Workshop

点击查看摘要

Abstract:General-purpose vision-language models (VLMs) have emerged as promising tools in radiology, offering zero-shot capabilities that mitigate the need for large labeled datasets. However, in high-stakes domains like diagnostic radiology, these models often lack the discriminative precision required for reliable clinical use. This challenge is compounded by the scarcity and heterogeneity of publicly available volumetric CT datasets, which vary widely in annotation formats and granularity. To address these limitations, we introduce Uniferum, a volumetric VLM that unifies diverse supervision signals, encoded in classification labels and segmentation masks, into a single training framework. By harmonizing three public 3D CT datasets with distinct annotations, Uniferum achieves state-of-the-art performance, improving AUROC on the CT-RATE benchmark by 7% compared to CLIP-based and conventional multi-label convolutional models. The model demonstrates robust out-of-distribution generalization, with observed evidence of unexpected zero-shot performance on the RAD-CHEST and INSPECT datasets. Our results highlight the effectiveness of integrating heterogeneous annotations and body segmentation to enhance model performance, setting a new direction for clinically reliable, data-efficient VLMs in 3D medical imaging.
zh

[CV-82] Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在高分辨率图像和长视频理解任务中因token数量激增而导致的推理效率低下问题。现有基于内部大语言模型(Inner-LLM)的token压缩方法存在位置偏差(positional bias)和与高效算子不兼容的问题,限制了其在LVLM加速中的实际应用。解决方案的关键在于从token变化性(token variation)视角出发,提出Variation-aware Vision Token Dropping(V² Drop),该方法识别并逐步剔除在推理过程中变化最小的视觉token,从而显著降低计算负担而不明显损害性能——实验表明,在图像和视频理解任务中分别保持原模型94.0%和98.6%的性能,同时将大语言模型生成延迟减少31.5%和74.2%,并与高效算子结合进一步降低GPU峰值显存占用。

链接: https://arxiv.org/abs/2509.01552
作者: Junjie Chen,Xuyang Liu,Zichen Wen,Yiyu Wang,Siteng Huang,Honggang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: \url{ this https URL }

点击查看摘要

Abstract:Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textiti.e., \textbfV ^2 Drop), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks demonstrate that our V ^2 Drop is able to maintain \textbf94.0% and \textbf98.6% of the original model performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf31.5% and \textbf74.2%. When combined with efficient operators, V ^2 Drop further reduces GPU peak memory usage.
zh

[CV-83] Forward-Only Continual Learning

【速读】:该论文旨在解决预训练模型在持续学习(Continual Learning, CL)过程中面临的灾难性遗忘(Catastrophic Forgetting)问题。现有方法通常冻结骨干网络并微调少量参数,但仍依赖迭代误差反向传播和基于梯度的优化,计算开销大,不适用于资源受限环境。解决方案的关键在于提出一种前向-only、无梯度的持续学习方法FoRo:其核心包括轻量级提示调优(Prompt Tuning)策略与新颖的知识编码机制,均无需修改预训练模型;其中提示嵌入通过协方差矩阵自适应进化策略(CMA-ES)优化,缓解分布偏移并提取高质量任务表征;随后利用非线性随机投影与递归最小二乘法将任务特定知识编码为知识编码矩阵,实现分类器的增量更新而无需访问历史数据。该方法显著降低平均遗忘率并提升准确率,在保持高知识保留的同时减少内存占用与运行时间,适合实际多媒体场景中对效率与效果并重的持续学习需求。

链接: https://arxiv.org/abs/2509.01533
作者: Jiao Chen,Jiayi He,Fangfang Chen,Zuohong Lv,Jianhua Tang
机构: South China University of Technology (华南理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Catastrophic forgetting remains a central challenge in continual learning (CL) with pre-trained models. While existing approaches typically freeze the backbone and fine-tune a small number of parameters to mitigate forgetting, they still rely on iterative error backpropagation and gradient-based optimization, which can be computationally intensive and less suitable for resource-constrained environments. To address this, we propose FoRo, a forward-only, gradient-free continual learning method. FoRo consists of a lightweight prompt tuning strategy and a novel knowledge encoding mechanism, both designed without modifying the pre-trained model. Specifically, prompt embeddings are inserted at the input layer and optimized using the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which mitigates distribution shifts and extracts high-quality task representations. Subsequently, task-specific knowledge is encoded into a knowledge encoding matrix via nonlinear random projection and recursive least squares, enabling incremental updates to the classifier without revisiting prior data. Experiments show that FoRo significantly reduces average forgetting and improves accuracy. Thanks to forward-only learning, FoRo reduces memory usage and run time while maintaining high knowledge retention across long task sequences. These results suggest that FoRo could serve as a promising direction for exploring continual learning with pre-trained models, especially in real-world multimedia applications where both efficiency and effectiveness are critical.
zh

[CV-84] MSA2-Net: Utilizing Self-Adaptive Convolution Module to Extract Multi-Scale Information in Medical Image Segmentation

【速读】:该论文旨在解决现有医学图像分割模型在跨数据集泛化能力上的局限性,尤其是由于未对分割网络内部超参数进行优化所导致的性能瓶颈。其核心解决方案是提出一种自适应卷积模块(Self-Adaptive Convolution Module),该模块能够根据不同数据集的独特特征动态调整卷积核大小,从而增强模型捕捉全局与局部特征的能力。此模块被集成至MSA2-Net的两个关键组件——多尺度卷积桥(Multi-Scale Convolution Bridge)和多尺度融合解码器(Multi-Scale Amalgamation Decoder)中,分别提升跳接路径中特征精炼能力和解码阶段对不同尺寸器官细节的建模精度,最终实现更鲁棒且高精度的医学图像分割效果。

链接: https://arxiv.org/abs/2509.01498
作者: Chao Deng,Xiaosen Li,Xiao Qin
机构: School of Artificial Intelligence, Nanning Normal University, Nanning, Guangxi, 530100, People’s Republic of China (人工智能学院,南宁师范大学,南宁,广西,530100,中华人民共和国); School of Artificial Intelligence, Guangxi Minzu University, Nanning, Guangxi, 530300, People’s Republic of China (人工智能学院,广西民族大学,南宁,广西,530300,中华人民共和国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The nnUNet segmentation framework adeptly adjusts most hyperparameters in training scripts automatically, but it overlooks the tuning of internal hyperparameters within the segmentation network itself, which constrains the model’s ability to generalize. Addressing this limitation, this study presents a novel Self-Adaptive Convolution Module that dynamically adjusts the size of the convolution kernels depending on the unique fingerprints of different datasets. This adjustment enables the MSA2-Net, when equipped with this module, to proficiently capture both global and local features within the feature maps. Self-Adaptive Convolution Module is strategically integrated into two key components of the MSA2-Net: the Multi-Scale Convolution Bridge and the Multi-Scale Amalgamation Decoder. In the MSConvBridge, the module enhances the ability to refine outputs from various stages of the CSWin Transformer during the skip connections, effectively eliminating redundant data that could potentially impair the decoder’s performance. Simultaneously, the MSADecoder, utilizing the module, excels in capturing detailed information of organs varying in size during the decoding phase. This capability ensures that the decoder’s output closely reproduces the intricate details within the feature maps, thus yielding highly accurate segmentation images. MSA2-Net, bolstered by this advanced architecture, has demonstrated exceptional performance, achieving Dice coefficient scores of 86.49%, 92.56%, 93.37%, and 92.98% on the Synapse, ACDC, Kvasir, and Skin Lesion Segmentation (ISIC2017) datasets, respectively. This underscores MSA2-Net’s robustness and precision in medical image segmentation tasks across various datasets.
zh

[CV-85] A Continuous-Time Consistency Model for 3D Point Cloud Generation

【速读】:该论文旨在解决从点云中快速且准确生成三维形状的问题,这是机器人、增强现实/虚拟现实(AR/VR)以及数字内容创作等应用中的关键挑战。现有方法通常依赖于迭代去噪过程或预训练的教师模型与潜在空间编码,存在计算效率低、训练不稳定等问题。解决方案的关键在于提出ConTiCoM-3D,一种在点空间中直接进行连续时间一致性建模的方法,其核心创新包括:基于TrigFlow启发的连续噪声调度策略和基于Chamfer Distance的几何损失函数,从而避免了离散扩散步骤、潜在空间映射及昂贵的雅可比向量积计算;同时采用时间条件神经网络在连续时间内完成生成,支持仅需一到两步即可实现高保真度的3D形状合成,显著提升了生成速度与质量。

链接: https://arxiv.org/abs/2509.01492
作者: Sebastian Eilermann,René Heesch,Oliver Niggemann
机构: Institute for Artificial Intelligence (人工智能研究所); Helmut Schmidt University (赫尔穆特·施密特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fast and accurate 3D shape generation from point clouds is essential for applications in robotics, AR/VR, and digital content creation. We introduce ConTiCoM-3D, a continuous-time consistency model that synthesizes 3D shapes directly in point space, without discretized diffusion steps, pre-trained teacher models, or latent-space encodings. The method integrates a TrigFlow-inspired continuous noise schedule with a Chamfer Distance-based geometric loss, enabling stable training on high-dimensional point sets while avoiding expensive Jacobian-vector products. This design supports efficient one- to two-step inference with high geometric fidelity. In contrast to previous approaches that rely on iterative denoising or latent decoders, ConTiCoM-3D employs a time-conditioned neural network operating entirely in continuous time, thereby achieving fast generation. Experiments on the ShapeNet benchmark show that ConTiCoM-3D matches or outperforms state-of-the-art diffusion and latent consistency models in both quality and efficiency, establishing it as a practical framework for scalable 3D shape generation.
zh

[CV-86] PointSlice: Accurate and Efficient Slice-Based Representation for 3D Object Detection from Point Clouds

【速读】:该论文旨在解决点云(Point Cloud)处理中精度与推理速度之间的权衡问题,即传统体素(Voxel-based)方法虽精度高但速度慢,而柱状体(Pillar-based)方法虽速度快但精度不足。其解决方案的关键在于提出一种名为PointSlice的新颖点云处理方法:首先将3D点云沿水平面切片为多个2D (x-y) 数据切片,使模型仅需学习二维数据分布,从而显著减少参数量并提升推理速度;其次引入Slice Interaction Network(SIN),在2D骨干网络中建模切片间的垂直关系,以增强模型对三维目标的感知能力,实现高精度与高速度的统一。

链接: https://arxiv.org/abs/2509.01487
作者: Liu Qifeng,Zhao Dawei,Dong Yabo,Xiao Liang,Wang Juan,Min Chen,Li Fuyang,Jiang Weizhong,Lu Dongming,Nie Yiming
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript submitted to PATTERN RECOGNITION, currently under review

点击查看摘要

Abstract:3D object detection from point clouds plays a critical role in autonomous driving. Currently, the primary methods for point cloud processing are voxel-based and pillarbased approaches. Voxel-based methods offer high accuracy through fine-grained spatial segmentation but suffer from slower inference speeds. Pillar-based methods enhance inference speed but still fall short of voxel-based methods in accuracy. To address these issues, we propose a novel point cloud processing method, PointSlice, which slices point clouds along the horizontal plane and includes a dedicated detection network. The main contributions of PointSlice are: (1) A new point cloud processing technique that converts 3D point clouds into multiple sets of 2D (x-y) data slices. The model only learns 2D data distributions, treating the 3D point cloud as separate batches of 2D data, which reduces the number of model parameters and enhances inference speed; (2) The introduction of a Slice Interaction Network (SIN). To maintain vertical relationships across slices, we incorporate SIN into the 2D backbone network, which improves the model’s 3D object perception capability. Extensive experiments demonstrate that PointSlice achieves high detection accuracy and inference speed. On the Waymo dataset, PointSlice is 1.13x faster and has 0.79x fewer parameters than the state-of-the-art voxel-based method (SAFDNet), with only a 1.2 mAPH accuracy reduction. On the nuScenes dataset, we achieve a state-of-the-art detection result of 66.74 mAP. On the Argoverse 2 dataset, PointSlice is 1.10x faster, with 0.66x fewer parameters and a 1.0 mAP accuracy reduction. The code will be available at this https URL.
zh

[CV-87] Im2Haircut: Single-view Strand-based Hair Reconstruction for Human Avatars

【速读】:该论文旨在解决从单张图像中重建高质量三维头发结构的问题,尤其针对现有方法在处理发型几何复杂性、缺乏真实标注数据以及难以还原不可见区域(如背面)结构方面的局限性。其核心解决方案在于提出一种结合真实数据与合成数据的新型发型先验建模方法:利用基于Transformer的模型在合成数据上学习发型内部几何结构,并引入真实数据以优化外层可见发丝的拟合效果,从而实现对输入图像中可见部分的精确建模同时保持整体发型的3D一致性。该先验被进一步用于指导基于高斯点绘(Gaussian splatting)的重建流程,在单图或多图输入下均能生成具有细节导向性、轮廓准确性和背面一致性的高质量发型。

链接: https://arxiv.org/abs/2509.01469
作者: Vanessa Sklyarova,Egor Zakharov,Malte Prinzler,Giorgio Becherini,Michael J. Black,Justus Thies
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); ETH Zürich (苏黎世联邦理工学院); Technical University of Darmstadt (达姆施塔特工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: For more results please refer to the project page this https URL

点击查看摘要

Abstract:We present a novel approach for 3D hair reconstruction from single photographs based on a global hair prior combined with local optimization. Capturing strand-based hair geometry from single photographs is challenging due to the variety and geometric complexity of hairstyles and the lack of ground truth training data. Classical reconstruction methods like multi-view stereo only reconstruct the visible hair strands, missing the inner structure of hairstyles and hampering realistic hair simulation. To address this, existing methods leverage hairstyle priors trained on synthetic data. Such data, however, is limited in both quantity and quality since it requires manual work from skilled artists to model the 3D hairstyles and create near-photorealistic renderings. To address this, we propose a novel approach that uses both, real and synthetic data to learn an effective hairstyle prior. Specifically, we train a transformer-based prior model on synthetic data to obtain knowledge of the internal hairstyle geometry and introduce real data in the learning process to model the outer structure. This training scheme is able to model the visible hair strands depicted in an input image, while preserving the general 3D structure of hairstyles. We exploit this prior to create a Gaussian-splatting-based reconstruction method that creates hairstyles from one or more images. Qualitative and quantitative comparisons with existing reconstruction pipelines demonstrate the effectiveness and superior performance of our method for capturing detailed hair orientation, overall silhouette, and backside consistency. For additional results and code, please refer to this https URL.
zh

[CV-88] races of Image Memorability in Vision Encoders: Activations Attention Distributions and Autoencoder Losses ICCV2025

【速读】:该论文旨在解决图像记忆性(image memorability)的预测问题,即如何利用预训练视觉编码器中的内部特征来准确预测人类对图像的记忆程度。其解决方案的关键在于探索视觉Transformer模型中潜在激活(latent activations)、注意力分布(attention distributions)以及图像块均匀性(uniformity of image patches)等特征与记忆性之间的相关性,并创新性地引入稀疏自编码器损失(sparse autoencoder loss)作为记忆性的代理指标,从而在性能上超越以往基于卷积神经网络表示的方法。

链接: https://arxiv.org/abs/2509.01453
作者: Ece Takmaz,Albert Gatt,Jakub Dotlacil
机构: Utrecht University (乌得勒支大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the ICCV 2025 workshop MemVis: The 1st Workshop on Memory and Vision (non-archival)

点击查看摘要

Abstract:Images vary in how memorable they are to humans. Inspired by findings from cognitive science and computer vision, this paper explores the correlates of image memorability in pretrained vision encoders, focusing on latent activations, attention distributions, and the uniformity of image patches. We find that these features correlate with memorability to some extent. Additionally, we explore sparse autoencoder loss over the representations of vision transformers as a proxy for memorability, which yields results outperforming past methods using convolutional neural network representations. Our results shed light on the relationship between model-internal features and memorability. They show that some features are informative predictors of what makes images memorable to humans.
zh

[CV-89] SoccerHigh: A Benchmark Dataset for Automatic Soccer Video Summarization

【速读】:该论文旨在解决体育视频摘要生成中因缺乏公开可用数据集而导致模型训练与评估困难的问题。其关键解决方案是构建了一个针对足球比赛的结构化数据集,包含来自西班牙、法国和意大利联赛共237场比赛的镜头边界信息,基于SoccerNet数据集的广播画面,并提出了一种专门为此任务设计的基线模型,同时引入一种受目标摘要长度约束的新评估指标,从而实现更客观的摘要质量评价。

链接: https://arxiv.org/abs/2509.01439
作者: Artur Díaz-Juan,Coloma Ballester,Gloria Haro
机构: Universitat Pompeu Fabra (庞佩乌·法布拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted at MMSports 2025 (Dublin, Ireland)

点击查看摘要

Abstract:Video summarization aims to extract key shots from longer videos to produce concise and informative summaries. One of its most common applications is in sports, where highlight reels capture the most important moments of a game, along with notable reactions and specific contextual events. Automatic summary generation can support video editors in the sports media industry by reducing the time and effort required to identify key segments. However, the lack of publicly available datasets poses a challenge in developing robust models for sports highlight generation. In this paper, we address this gap by introducing a curated dataset for soccer video summarization, designed to serve as a benchmark for the task. The dataset includes shot boundaries for 237 matches from the Spanish, French, and Italian leagues, using broadcast footage sourced from the SoccerNet dataset. Alongside the dataset, we propose a baseline model specifically designed for this task, which achieves an F1 score of 0.3956 in the test set. Furthermore, we propose a new metric constrained by the length of each target summary, enabling a more objective evaluation of the generated content. The dataset and code are available at this https URL.
zh

[CV-90] Mamba-CNN: A Hybrid Architecture for Efficient and Accurate Facial Beauty Prediction

【速读】:该论文旨在解决面部吸引力计算这一主观回归任务中的性能与效率权衡问题:传统卷积神经网络(CNNs)虽计算高效但感受野有限,而视觉Transformer(ViTs)虽能建模全局上下文却存在二次方级的计算开销。其解决方案的关键在于提出一种新型混合架构Mamba-CNN,该架构在层次化卷积主干中嵌入轻量级、受Mamba启发的状态空间模型(SSM)门控机制,使网络能够动态调节特征图,选择性强化显著面部特征及其长程空间关系,从而模拟人类整体感知的同时保持高计算效率。

链接: https://arxiv.org/abs/2509.01431
作者: Djamel Eddine Boukhari
机构: Scientific and Technical Research Centre for Arid Areas, CRSTRA (干旱地区科学研究与技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The computational assessment of facial attractiveness, a challenging subjective regression task, is dominated by architectures with a critical trade-off: Convolutional Neural Networks (CNNs) offer efficiency but have limited receptive fields, while Vision Transformers (ViTs) model global context at a quadratic computational cost. To address this, we propose Mamba-CNN, a novel and efficient hybrid architecture. Mamba-CNN integrates a lightweight, Mamba-inspired State Space Model (SSM) gating mechanism into a hierarchical convolutional backbone. This core innovation allows the network to dynamically modulate feature maps and selectively emphasize salient facial features and their long-range spatial relationships, mirroring human holistic perception while maintaining computational efficiency. We conducted extensive experiments on the widely-used SCUT-FBP5500 benchmark, where our model sets a new state-of-the-art. Mamba-CNN achieves a Pearson Correlation (PC) of 0.9187, a Mean Absolute Error (MAE) of 0.2022, and a Root Mean Square Error (RMSE) of 0.2610. Our findings validate the synergistic potential of combining CNNs with selective SSMs and present a powerful new architectural paradigm for nuanced visual understanding tasks.
zh

[CV-91] InfoScale: Unleashing Training-free Variable-scaled Image Generation via Effective Utilization of Information

【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在生成不同分辨率图像时性能下降的问题,其核心挑战在于不同分辨率下信息量差异导致的信息转换机制不适应。具体而言,论文指出三个关键问题:1)高分辨率生成中膨胀卷积(dilated convolution)会丢失高频信息;2)注意力机制难以自适应调整信息聚合方式;3)初始噪声的空间信息分布与可变尺度图像不匹配。解决方案的关键在于提出一个以信息为中心的框架InfoScale,通过三个模块分别应对上述问题:引入渐进式频率补偿模块(Progressive Frequency Compensation module)恢复高频信息,设计自适应信息聚合模块(Adaptive Information Aggregation module)实现局部与全局信息的动态平衡,以及构建噪声适配模块(Noise Adaptation module)重构初始噪声的信息分布,从而实现对可变尺度图像的有效生成。该方法具有即插即用特性,且在多个实验场景中验证了其有效性。

链接: https://arxiv.org/abs/2509.01421
作者: Guohui Zhang,Jiangtong Tan,Linjiang Huang,Zhonghang Yuan,Naishan Zheng,Jie Huang,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models (DMs) have become dominant in visual generation but suffer performance drop when tested on resolutions that differ from the training scale, whether lower or higher. In fact, the key challenge in generating variable-scale images lies in the differing amounts of information across resolutions, which requires information conversion procedures to be varied for generating variable-scaled images. In this paper, we investigate the issues of three critical aspects in DMs for a unified analysis in variable-scaled generation: dilated convolution, attention mechanisms, and initial noise. Specifically, 1) dilated convolution in DMs for the higher-resolution generation loses high-frequency information. 2) Attention for variable-scaled image generation struggles to adjust the information aggregation adaptively. 3) The spatial distribution of information in the initial noise is misaligned with variable-scaled image. To solve the above problems, we propose \textbfInfoScale, an information-centric framework for variable-scaled image generation by effectively utilizing information from three aspects correspondingly. For information loss in 1), we introduce Progressive Frequency Compensation module to compensate for high-frequency information lost by dilated convolution in higher-resolution generation. For information aggregation inflexibility in 2), we introduce Adaptive Information Aggregation module to adaptively aggregate information in lower-resolution generation and achieve an effective balance between local and global information in higher-resolution generation. For information distribution misalignment in 3), we design Noise Adaptation module to re-distribute information in initial noise for variable-scaled generation. Our method is plug-and-play for DMs and extensive experiments demonstrate the effectiveness in variable-scaled image generation.
zh

[CV-92] Bangladeshi Street Food Calorie Estimation Using Improved YOLOv8 and Regression Model

【速读】:该论文旨在解决现有自动热量追踪方法在识别多种食物、图像缩放与归一化处理以及对西方饮食依赖性强等方面的局限性,尤其针对孟加拉国街头食品的热量估算问题。其关键解决方案在于构建了一个涵盖孟加拉国流行街头食品的多样化数据集,并基于最先进的视觉模型YOLOv8进行改进,提升了分类与分割性能;同时结合机器学习回归模型,实现了高精度的热量估计(MAE=6.94,RMSE=11.03,R²=96.0%),显著优于传统方法,且计算复杂度仅略有增加。

链接: https://arxiv.org/abs/2509.01415
作者: Aparup Dhar(1),MD Tamim Hossain(1),Pritom Barua(1) ((1) Department of Computer Science and Engineering, Premier University, Chittagong, Bangladesh)
机构: Premier University (普里米尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As obesity rates continue to increase, automated calorie tracking has become a vital tool for people seeking to maintain a healthy lifestyle or adhere to a diet plan. Although numerous research efforts have addressed this issue, existing approaches often face key limitations, such as providing only constant caloric output, struggling with multiple food recognition challenges, challenges in image scaling and normalization, and a predominant focus on Western cuisines. In this paper, we propose a tailored solution that specifically targets Bangladeshi street food. We first construct a diverse dataset of popular street foods found across Bangladesh. Then, we develop a refined calorie estimation system by modifying the state-of-the-art vision model YOLOv8. Our modified model achieves superior classification and segmentation results, with only a slight increase in computational complexity compared to the base variant. Coupled with a machine learning regression model, our system achieves an impressive 6.94 mean absolute error (MAE), 11.03 root mean squared error (RMSE), and a 96.0% R^2 score in calorie estimation, making it both highly effective and accurate for real-world food calorie calculations.
zh

[CV-93] MILO: A Lightweight Perceptual Quality Metric for Image and Latent-Space Optimization

【速读】:该论文旨在解决图像质量评估(Full-Reference Image Quality Assessment, FR-IQA)中对高精度、低计算开销且适用于生成式模型优化的感知度量标准的需求。传统方法依赖大规模人工标注数据,而本文提出轻量级多尺度感知指标MILO(Metric for Image- and Latent-space Optimization),通过伪MOS(Mean Opinion Score)监督训练,利用可复现的失真和近期质量度量的集成来模拟人类感知,从而无需人工标注即可实现精准学习。其关键创新在于:1)设计了一种基于视觉掩蔽效应建模的感知损失函数;2)将该指标扩展至潜在空间(latent space),结合空间掩蔽机制与课程学习策略,在Stable Diffusion等生成模型中优先优化视觉不敏感区域,逐步聚焦于显著失真区域,从而在图像去噪、超分辨率和人脸修复等任务中实现更优的感知一致性与更低的计算复杂度。

链接: https://arxiv.org/abs/2509.01411
作者: Uğur Çoğalan,Mojtaba Bemana,Karol Myszkowski,Hans-Peter Seidel,Colin Groth
机构: Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:We present MILO (Metric for Image- and Latent-space Optimization), a lightweight, multiscale, perceptual metric for full-reference image quality assessment (FR-IQA). MILO is trained using pseudo-MOS (Mean Opinion Score) supervision, in which reproducible distortions are applied to diverse images and scored via an ensemble of recent quality metrics that account for visual masking effects. This approach enables accurate learning without requiring large-scale human-labeled datasets. Despite its compact architecture, MILO outperforms existing metrics across standard FR-IQA benchmarks and offers fast inference suitable for real-time applications. Beyond quality prediction, we demonstrate the utility of MILO as a perceptual loss in both image and latent domains. In particular, we show that spatial masking modeled by MILO, when applied to latent representations from a VAE encoder within Stable Diffusion, enables efficient and perceptually aligned optimization. By combining spatial masking with a curriculum learning strategy, we first process perceptually less relevant regions before progressively shifting the optimization to more visually distorted areas. This strategy leads to significantly improved performance in tasks like denoising, super-resolution, and face restoration, while also reducing computational overhead. MILO thus functions as both a state-of-the-art image quality metric and as a practical tool for perceptual optimization in generative pipelines.
zh

[CV-94] Neural Scene Designer: Self-Styled Semantic Image Manipulation

【速读】:该论文旨在解决图像编辑与修复任务中风格一致性(stylistic consistency)难以维持的问题,即现有方法多关注语义控制(semantic control),却忽视了生成内容与周围环境在风格上的统一性。其解决方案的核心在于提出神经场景设计师(Neural Scene Designer, NSD)框架,该框架基于先进的扩散模型(diffusion model),引入两个并行的交叉注意力机制(cross-attention mechanisms),分别处理文本指令和风格信息,从而实现语义对齐与风格一致性的双重目标;同时,为捕捉细粒度风格表征,设计了渐进式自风格表征学习(Progressive Self-style Representational Learning, PSRL)模块,通过风格对比损失(style contrastive loss)强化同一图像内区域间的风格相似性、不同图像间风格差异性,有效提升了风格保持能力。

链接: https://arxiv.org/abs/2509.01405
作者: Jianman Lin,Tianshui Chen,Chunmei Qing,Zhijing Yang,Shuangping Huang,Yuheng Ren,Liang Lin
机构: South China University of Technology (华南理工大学); Guangdong University of Technology (广东工业大学); Jimei University (集美大学); Xiamen Kunlu AI Research Institute (厦门坤 Lu 人工智能研究院); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Maintaining stylistic consistency is crucial for the cohesion and aesthetic appeal of images, a fundamental requirement in effective image editing and inpainting. However, existing methods primarily focus on the semantic control of generated content, often neglecting the critical task of preserving this consistency. In this work, we introduce the Neural Scene Designer (NSD), a novel framework that enables photo-realistic manipulation of user-specified scene regions while ensuring both semantic alignment with user intent and stylistic consistency with the surrounding environment. NSD leverages an advanced diffusion model, incorporating two parallel cross-attention mechanisms that separately process text and style information to achieve the dual objectives of semantic control and style consistency. To capture fine-grained style representations, we propose the Progressive Self-style Representational Learning (PSRL) module. This module is predicated on the intuitive premise that different regions within a single image share a consistent style, whereas regions from different images exhibit distinct styles. The PSRL module employs a style contrastive loss that encourages high similarity between representations from the same image while enforcing dissimilarity between those from different images. Furthermore, to address the lack of standardized evaluation protocols for this task, we establish a comprehensive benchmark. This benchmark includes competing algorithms, dedicated style-related metrics, and diverse datasets and settings to facilitate fair comparisons. Extensive experiments conducted on our benchmark demonstrate the effectiveness of the proposed framework.
zh

[CV-95] RibPull: Implicit Occupancy Fields and Medial Axis Extraction for CT Ribcage Scans

【速读】:该论文旨在解决医学影像中肋骨结构建模与几何操作的挑战,特别是传统体素网格(voxel grids)在处理稀疏、噪声数据时存在的分辨率受限、拓扑信息丢失及稀疏性处理效率低的问题。其解决方案的关键在于引入基于隐式占用场(implicit occupancy fields)的连续3D表示方法,利用神经网络编码整个三维场景,并通过坐标函数对任意3D点进行连续预测,从而有效补偿稀疏信号并推断更多几何信息;进一步地,采用拉普拉斯收缩(Laplacian-based contraction)提取肋骨骨架(medial axis),证明了连续坐标表示相较于体素表示在几何运算中的显著优势。

链接: https://arxiv.org/abs/2509.01402
作者: Emmanouil Nikolakakis,Amine Ouasfi,Julie Digne,Razvan Marinescu
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); Inria (法国国家信息与自动化研究院); Univ. Rennes (雷恩大学); CNRS (法国国家科学研究中心); IRISA (信息与系统研究所); Université Claude Bernard Lyon 1 (克莱蒙-奥弗涅大学); Computer Science Engineering Department, University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is currently being reviewed for a conference submission. If accepted an extended manuscript will be published and the code will be released

点击查看摘要

Abstract:We present RibPull, a methodology that utilizes implicit occupancy fields to bridge computational geometry and medical imaging. Implicit 3D representations use continuous functions that handle sparse and noisy data more effectively than discrete methods. While voxel grids are standard for medical imaging, they suffer from resolution limitations, topological information loss, and inefficient handling of sparsity. Coordinate functions preserve complex geometrical information and represent a better solution for sparse data representation, while allowing for further morphological operations. Implicit scene representations enable neural networks to encode entire 3D scenes within their weights. The result is a continuous function that can implicitly compesate for sparse signals and infer further information about the 3D scene by passing any combination of 3D coordinates as input to the model. In this work, we use neural occupancy fields that predict whether a 3D point lies inside or outside an object to represent CT-scanned ribcages. We also apply a Laplacian-based contraction to extract the medial axis of the ribcage, thus demonstrating a geometrical operation that benefits greatly from continuous coordinate-based 3D scene representations versus voxel-based representations. We evaluate our methodology on 20 medical scans from the RibSeg dataset, which is itself an extension of the RibFrac dataset. We will release our code upon publication.
zh

[CV-96] Enhancing Partially Relevant Video Retrieval with Robust Alignment Learning EMNLP2025

【速读】:该论文旨在解决部分相关视频检索(Partially Relevant Video Retrieval, PRVR)中的鲁棒对齐学习问题,核心挑战在于数据固有的不确定性:一是查询模糊性(query ambiguity),即查询无法完整描述目标视频且常包含无信息量的词元;二是部分视频相关性(partial video relevance),即大量与查询无关的视频片段引入上下文噪声,干扰跨模态对齐。现有方法多聚焦于增强多尺度片段表示并检索最相关片段,但易受虚假相似性干扰,性能受限。其解决方案的关键在于提出鲁棒对齐学习(Robust Alignment Learning, RAL)框架,创新性地通过将视频和查询建模为多元高斯分布来显式刻画数据不确定性,并引入代理级匹配以捕捉跨模态对应关系的变异性;同时考虑查询词元的信息异质性,设计可学习置信门控机制动态加权相似度,从而提升对无关干扰的鲁棒性。该方案具有良好的可插拔性,可无缝集成至多种检索架构中。

链接: https://arxiv.org/abs/2509.01383
作者: Long Zhang,Peipei Song,Jianfeng Dong,Kun Li,Xun Yang
机构: University of Science and Technology of China (中国科学技术大学); MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China (教育部脑启发智能感知与认知重点实验室,中国科学技术大学); Zhejiang Gongshang University (浙江工商大学); ReLER, CCAI, Zhejiang University (ReLER, CCAI,浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos partially relevant to a given query. The core challenge lies in learning robust query-video alignment against spurious semantic correlations arising from inherent data uncertainty: 1) query ambiguity, where the query incompletely characterizes the target video and often contains uninformative tokens, and 2) partial video relevance, where abundant query-irrelevant segments introduce contextual noise in cross-modal alignment. Existing methods often focus on enhancing multi-scale clip representations and retrieving the most relevant clip. However, the inherent data uncertainty in PRVR renders them vulnerable to distractor videos with spurious similarities, leading to suboptimal performance. To fill this research gap, we propose Robust Alignment Learning (RAL) framework, which explicitly models the uncertainty in data. Key innovations include: 1) we pioneer probabilistic modeling for PRVR by encoding videos and queries as multivariate Gaussian distributions. This not only quantifies data uncertainty but also enables proxy-level matching to capture the variability in cross-modal correspondences; 2) we consider the heterogeneous informativeness of query words and introduce learnable confidence gates to dynamically weight similarity. As a plug-and-play solution, RAL can be seamlessly integrated into the existing architectures. Extensive experiments across diverse retrieval backbones demonstrate its effectiveness.
zh

[CV-97] Unsupervised Ultra-High-Resolution UAV Low-Light Image Enhancement: A Benchmark Metric and Framework

【速读】:该论文旨在解决低光照条件下无人机(Unmanned Aerial Vehicles, UAVs)视觉性能显著下降的问题,尤其针对航拍图像中特有的挑战:超高清分辨率(Ultra-High Resolution, UHR)、缺乏成对标注数据、严重非均匀光照以及部署约束。其解决方案的关键在于三项核心贡献:首先构建了首个无监督的UHR无人机图像增强数据集U3D及统一评估工具包;其次提出边缘效率指数(Edge Efficiency Index, EEI),在感知质量与推理速度、分辨率、模型复杂度和内存占用等部署因素之间实现平衡;最后设计了U3LIE高效框架,采用仅训练阶段使用的两项创新技术——自适应预增强增强(Adaptive Pre-enhancement Augmentation, APA)用于输入归一化,以及亮度区间损失(Luminance Interval Loss, L_int)用于曝光控制,从而在单张GPU上实现4K图像23.8 FPS的实时处理能力,满足机载部署需求。

链接: https://arxiv.org/abs/2509.01373
作者: Wei Lu,Lingyu Zhu,Si-Bao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 10 figures

点击查看摘要

Abstract:Low light conditions significantly degrade Unmanned Aerial Vehicles (UAVs) performance in critical applications. Existing Low-light Image Enhancement (LIE) methods struggle with the unique challenges of aerial imagery, including Ultra-High Resolution (UHR), lack of paired data, severe non-uniform illumination, and deployment constraints. To address these issues, we propose three key contributions. First, we present U3D, the first unsupervised UHR UAV dataset for LIE, with a unified evaluation toolkit. Second, we introduce the Edge Efficiency Index (EEI), a novel metric balancing perceptual quality with key deployment factors: speed, resolution, model complexity, and memory footprint. Third, we develop U3LIE, an efficient framework with two training-only designs-Adaptive Pre-enhancement Augmentation (APA) for input normalization and a Luminance Interval Loss (L_int) for exposure control. U3LIE achieves SOTA results, processing 4K images at 23.8 FPS on a single GPU, making it ideal for real-time on-board deployment. In summary, these contributions provide a holistic solution (dataset, metric, and method) for advancing robust 24/7 UAV vision. The code and datasets are available at this https URL.
zh

[CV-98] Uirapuru: Timely Video Analytics for High-Resolution Steerable Cameras on Edge Devices

【速读】:该论文旨在解决高分辨率可转向摄像头(steerable cameras)在边缘端进行实时视频分析时面临的动态场景挑战,特别是传统基于帧切片(frame tiling)的方法难以适应摄像头运动带来的视场变化问题。解决方案的关键在于提出Uirapuru框架,其核心创新是将摄像头动作机制(camera actuation)纳入系统设计,并引入每帧级别的快速自适应切片策略(fast adaptive tiling),从而在保持低延迟的同时显著提升检测精度或推理效率。

链接: https://arxiv.org/abs/2509.01371
作者: Guilherme H. Apostolo,Pablo Bauszat,Vinod Nigade,Henri E. Bal,Lin Wang
机构: Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); Paderborn University (帕德博恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:Real-time video analytics on high-resolution cameras has become a popular technology for various intelligent services like traffic control and crowd monitoring. While extensive work has been done on improving analytics accuracy with timing guarantees, virtually all of them target static viewpoint cameras. In this paper, we present Uirapuru, a novel framework for real-time, edge-based video analytics on high-resolution steerable cameras. The actuation performed by those cameras brings significant dynamism to the scene, presenting a critical challenge to existing popular approaches such as frame tiling. To address this problem, Uirapuru incorporates a comprehensive understanding of camera actuation into the system design paired with fast adaptive tiling at a per-frame level. We evaluate Uirapuru on a high-resolution video dataset, augmented by pan-tilt-zoom (PTZ) movements typical for steerable cameras and on real-world videos collected from an actual PTZ camera. Our experimental results show that Uirapuru provides up to 1.45x improvement in accuracy while respecting specified latency budgets or reaches up to 4.53x inference speedup with on-par accuracy compared to state-of-the-art static camera approaches.
zh

[CV-99] Identity-Preserving Text-to-Video Generation via Training-Free Prompt Image and Guidance Enhancement

【速读】:该论文旨在解决身份保持型文本到视频生成(Identity-preserving text-to-video, IPT2V)中的关键挑战:如何在不依赖大量微调数据和高昂训练成本的前提下,提升视频生成结果对参考图像身份的保真度与视频质量。解决方案的核心在于提出了一种无需训练的提示(Prompt)、图像及引导增强框架(Training-Free Prompt, Image, and Guidance Enhancement, TPIGE),其关键创新包括:1)面向人脸的提示增强(Face Aware Prompt Enhancement),利用GPT-4o从参考图像中提取面部细节并注入文本提示以弥合语义鸿沟;2)提示感知的参考图像增强(Prompt Aware Reference Image Enhancement),通过保持身份一致性的图像生成器优化参考图像,缓解其与文本提示之间的冲突;3)身份感知时空引导增强(ID-Aware Spatiotemporal Guidance Enhancement),采用统一梯度联合优化身份保留与视频质量。该方法在无需训练的情况下显著提升了输入质量和生成性能,在ACM Multimedia 2025身份保持视频生成挑战赛中获得第一名,验证了其先进性和泛化能力。

链接: https://arxiv.org/abs/2509.01362
作者: Jiayi Gao,Changcheng Hua,Qingchao Chen,Yuxin Peng,Yang Liu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); National Institute of Health Data Science, Peking University (北京大学健康医疗数据科学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt. While fine-tuning large pretrained video diffusion models on ID-matched data achieves state-of-the-art results on IPT2V, data scarcity and high tuning costs hinder broader improvement. We thus introduce a Training-Free Prompt, Image, and Guidance Enhancement (TPIGE) framework that bridges the semantic gap between the video description and the reference image and design sampling guidance that enhances identity preservation and video quality, achieving performance gains at minimal this http URL, we first propose Face Aware Prompt Enhancement, using GPT-4o to enhance the text prompt with facial details derived from the reference image. We then propose Prompt Aware Reference Image Enhancement, leveraging an identity-preserving image generator to refine the reference image, rectifying conflicts with the text prompt. The above mutual refinement significantly improves input quality before video generation. Finally, we propose ID-Aware Spatiotemporal Guidance Enhancement, utilizing unified gradients to optimize identity preservation and video quality jointly during this http URL method outperforms prior work and is validated by automatic and human evaluations on a 1000 video test set, winning first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge, demonstrating state-of-the-art performance and strong generality. The code is available at this https URL.
zh

[CV-100] M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision

【速读】:该论文旨在解决医学图像检索中因模态特异性设计导致的表示学习碎片化问题,即现有方法对二维(2D)、三维(3D)和视频类医学数据分别采用独立架构与训练策略,阻碍了统一表征的学习与模型扩展性。其解决方案的关键在于构建一个大规模混合模态数据集(包含867,653张医学影像样本),并在此基础上训练一个无需任何模态定制的统一视觉编码器M3Ret,该编码器同时融合生成式自监督学习(MAE)与对比式自监督学习(SimDINO)范式,从而实现跨模态对齐与零样本迁移能力——即使在预训练阶段未接触磁共振成像(MRI)数据,也能在未见MRI任务上表现出良好泛化性能,验证了纯视觉自监督学习在多模态医学图像理解中的强大潜力。

链接: https://arxiv.org/abs/2509.01360
作者: Che Liu,Zheng Jiang,Chengyu Fang,Heng Guo,Yan-Jie Zhou,Jiaqi Qu,Le Lu,Minfeng Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Technical Report

点击查看摘要

Abstract:Medical image retrieval is essential for clinical decision-making and translational research, relying on discriminative visual representations. Yet, current methods remain fragmented, relying on separate architectures and training strategies for 2D, 3D, and video-based medical data. This modality-specific design hampers scalability and inhibits the development of unified representations. To enable unified learning, we curate a large-scale hybrid-modality dataset comprising 867,653 medical imaging samples, including 2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging this dataset, we train M3Ret, a unified visual encoder without any modality-specific customization. It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms. Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired data, and the model generalizes to unseen MRI tasks, despite never observing MRI during pretraining, demonstrating the generalizability of purely visual self-supervision to unseen modalities. Comprehensive analyses further validate the scalability of our framework across model and data sizes. These findings deliver a promising signal to the medical imaging community, positioning M3Ret as a step toward foundation models for visual SSL in multimodal medical image understanding.
zh

[CV-101] AgroSense: An Integrated Deep Learning System for Crop Recommendation via Soil Image Analysis and Nutrient Profiling

【速读】:该论文旨在解决传统土壤分析方法在实时农业决策中效率低、耗时长且不适用于田间场景的问题,从而支持粮食安全与可持续农业发展的需求。其解决方案的关键在于提出AgroSense框架,该框架通过多模态融合策略实现精准作物推荐:一方面利用ResNet-18、EfficientNet-B0和Vision Transformer等深度学习模型对土壤图像进行分类;另一方面采用多层感知机(Multi-Layer Perceptron)、XGBoost、LightGBM和TabNet等算法处理结构化土壤数据(如养分含量、pH值和降雨量),最终构建一个高精度的联合预测模型,实现了98.0%的准确率和显著降低的均方根误差(RMSE=0.32)与平均绝对误差(MAE=0.27),并通过消融实验和统计验证证明了多模态耦合机制的核心作用。

链接: https://arxiv.org/abs/2509.01344
作者: Vishal Pandey,Ranjita Das,Debasmita Biswas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint, 23 pages, 6 images, 1 table

点击查看摘要

Abstract:Meeting the increasing global demand for food security and sustainable farming requires intelligent crop recommendation systems that operate in real time. Traditional soil analysis techniques are often slow, labor-intensive, and not suitable for on-field decision-making. To address these limitations, we introduce AgroSense, a deep-learning framework that integrates soil image classification and nutrient profiling to produce accurate and contextually relevant crop recommendations. AgroSense comprises two main components: a Soil Classification Module, which leverages ResNet-18, EfficientNet-B0, and Vision Transformer architectures to categorize soil types from images; and a Crop Recommendation Module, which employs a Multi-Layer Perceptron, XGBoost, LightGBM, and TabNet to analyze structured soil data, including nutrient levels, pH, and rainfall. We curated a multimodal dataset of 10,000 paired samples drawn from publicly available Kaggle repositories, approximately 50,000 soil images across seven classes, and 25,000 nutrient profiles for experimental evaluation. The fused model achieves 98.0% accuracy, with a precision of 97.8%, a recall of 97.7%, and an F1-score of 96.75%, while RMSE and MAE drop to 0.32 and 0.27, respectively. Ablation studies underscore the critical role of multimodal coupling, and statistical validation via t-tests and ANOVA confirms the significance of our improvements. AgroSense offers a practical, scalable solution for real-time decision support in precision agriculture and paves the way for future lightweight multimodal AI systems in resource-constrained environments.
zh

[CV-102] Street-Level Geolocalization Using Multimodal Large Language Models and Retrieval-Augmented Generation

【速读】:该论文旨在解决从街景图像中实现高精度地理定位(geolocalization)的问题,尤其是在社交媒体数据和智能手机摄像头日益普及的背景下,传统计算机视觉方法面临挑战。其解决方案的关键在于引入一种结合开放权重、公开可用的多模态大语言模型(multimodal large language models, MLLMs)与检索增强生成(retrieval-augmented generation, RAG)的新范式:首先利用SigLIP编码器在EMP-16和OSV-5M两个大规模数据集上构建向量数据库;随后,在对查询图像进行处理前,通过检索相似与不相似地理位置信息来增强提示(prompt),从而提升定位准确性。该方法无需昂贵的微调或重新训练,且具备良好的可扩展性,显著优于现有基准(IM2GPS、IM2GPS3k 和 YFCC4k),为GeoAI领域提供了更高效、可访问的替代路径。

链接: https://arxiv.org/abs/2509.01341
作者: Yunus Serhat Bicakci,Joseph Shingleton,Anahid Basiri
机构: Marmara University (马尔马拉大学); University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Street-level geolocalization from images is crucial for a wide range of essential applications and services, such as navigation, location-based recommendations, and urban planning. With the growing popularity of social media data and cameras embedded in smartphones, applying traditional computer vision techniques to localize images has become increasingly challenging, yet highly valuable. This paper introduces a novel approach that integrates open-weight and publicly accessible multimodal large language models with retrieval-augmented generation. The method constructs a vector database using the SigLIP encoder on two large-scale datasets (EMP-16 and OSV-5M). Query images are augmented with prompts containing both similar and dissimilar geolocation information retrieved from this database before being processed by the multimodal large language models. Our approach has demonstrated state-of-the-art performance, achieving higher accuracy compared against three widely used benchmark datasets (IM2GPS, IM2GPS3k, and YFCC4k). Importantly, our solution eliminates the need for expensive fine-tuning or retraining and scales seamlessly to incorporate new data sources. The effectiveness of retrieval-augmented generation-based multimodal large language models in geolocation estimation demonstrated by this paper suggests an alternative path to the traditional methods which rely on the training models from scratch, opening new possibilities for more accessible and scalable solutions in GeoAI.
zh

[CV-103] Image Quality Enhancement and Detection of Small and Dense Objects in Industrial Recycling Processes

【速读】:该论文旨在解决两个核心问题:一是检测小尺寸、高密度且相互重叠的目标(这是计算机视觉中的重大挑战);二是提升工业环境中噪声图像的质量。解决方案的关键在于采用基于监督学习的深度学习方法,通过一个包含超过10,000张图像和120,000个实例的新数据集对现有方法进行系统评估,从而识别出在工业应用中最为可靠的检测系统,并针对其性能、精度与计算效率提出优化方向。此外,论文还提出了一种基于全连接卷积网络的轻量化图像增强模型,以改善工业场景下的图像质量,为后续研究提供可扩展的技术路径。

链接: https://arxiv.org/abs/2509.01332
作者: Oussama Messai,Abbass Zein-Eddine,Abdelouahid Bentamou,Mickaël Picq,Nicolas Duquesne,Stéphane Puydarrieux,Yann Gavet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Event: Seventeenth International Conference on Quality Control by Artificial Vision (QCAV2025), 2025, Yamanashi Prefecture, Japan

点击查看摘要

Abstract:This paper tackles two key challenges: detecting small, dense, and overlapping objects (a major hurdle in computer vision) and improving the quality of noisy images, especially those encountered in industrial environments. [1, 2]. Our focus is on evaluating methods built on supervised deep learning. We perform an analysis of these methods, using a newly de- veloped dataset comprising over 10k images and 120k in- stances. By evaluating their performance, accuracy, and com- putational efficiency, we identify the most reliable detection systems and highlight the specific challenges they address in industrial applications. This paper also examines the use of deep learning models to improve image quality in noisy industrial environments. We introduce a lightweight model based on a fully connected convolutional network. Addition- ally, we suggest potential future directions for further enhanc- ing the effectiveness of the model. The repository of the dataset and proposed model can be found at: this https URL, this https URL
zh

[CV-104] Prior-Guided Residual Diffusion: Calibrated and Efficient Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因不确定性导致的模糊性问题,即传统模型仅输出单一点估计而无法刻画完整的条件分布。为此,作者提出Prior-Guided Residual Diffusion (PGRD) 框架,其核心创新在于通过将离散标签嵌入到连续空间中的 one-hot 表示来对齐分割任务与扩散建模过程;同时引入粗粒度先验预测器提供分步引导,扩散网络则学习相对于该先验的残差项,从而加速收敛并提升校准性能。此外,深度扩散监督机制通过在中间时间步进行监督进一步稳定训练。实验表明,PGRD 在 MRI 和 CT 数据集上优于贝叶斯方法、集成方法、概率 U-Net 及普通扩散基线,在 Dice 分数更高且负对数似然(NLL)和经验校准误差(ECE)更低的同时,所需采样步骤更少。

链接: https://arxiv.org/abs/2509.01330
作者: Fuyou Mao,Beining Wu,Yanfeng Jiang,Han Xue,Yan Tang,Hao Zhang
机构: School of Electronic Information, Central South University, Changsha, China; Hangzhou Dianzi University, Hangzhou, China; Communication University of Zhejiang, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ambiguity in medical image segmentation calls for models that capture full conditional distributions rather than a single point estimate. We present Prior-Guided Residual Diffusion (PGRD), a diffusion-based framework that learns voxel-wise distributions while maintaining strong calibration and practical sampling efficiency. PGRD embeds discrete labels as one-hot targets in a continuous space to align segmentation with diffusion modeling. A coarse prior predictor provides step-wise guidance; the diffusion network then learns the residual to the prior, accelerating convergence and improving calibration. A deep diffusion supervision scheme further stabilizes training by supervising intermediate time steps. Evaluated on representative MRI and CT datasets, PGRD achieves higher Dice scores and lower NLL/ECE values than Bayesian, ensemble, Probabilistic U-Net, and vanilla diffusion baselines, while requiring fewer sampling steps to reach strong performance.
zh

[CV-105] Guided Model-based LiDAR Super-Resolution for Resource-Efficient Automotive scene Segmentation

【速读】:该论文旨在解决低成本16通道激光雷达(LiDAR)因点云稀疏导致的3D语义分割精度下降问题,其核心挑战在于如何在有限硬件条件下提升点云密度并保持语义细节。解决方案的关键在于提出首个端到端联合优化框架,同时完成LiDAR超分辨率(SR)与语义分割任务:通过训练阶段的联合优化机制,使SR模块能够利用语义信息增强细节恢复能力,尤其改善小目标类别的分割表现;此外,设计了一种新的SR损失函数以引导网络聚焦于感兴趣区域,并采用轻量化、模型驱动的SR架构,在参数量远低于现有方法的前提下实现与高分辨率(如64通道)LiDAR分割性能相当的效果。

链接: https://arxiv.org/abs/2509.01317
作者: Alexandros Gkillas,Nikos Piperigkos,Aris S. Lalos
机构: Industrial Systems Institute, Athena Research Center, Patras Science Park, Greece; AviSense.AI, Patras Science Park, Greece
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution LiDAR data plays a critical role in 3D semantic segmentation for autonomous driving, but the high cost of advanced sensors limits large-scale deployment. In contrast, low-cost sensors such as 16-channel LiDAR produce sparse point clouds that degrade segmentation accuracy. To overcome this, we introduce the first end-to-end framework that jointly addresses LiDAR super-resolution (SR) and semantic segmentation. The framework employs joint optimization during training, allowing the SR module to incorporate semantic cues and preserve fine details, particularly for smaller object classes. A new SR loss function further directs the network to focus on regions of interest. The proposed lightweight, model-based SR architecture uses significantly fewer parameters than existing LiDAR SR approaches, while remaining easily compatible with segmentation networks. Experiments show that our method achieves segmentation performance comparable to models operating on high-resolution and costly 64-channel LiDAR data.
zh

[CV-106] Cross-Domain Few-Shot Segmentation via Ordinary Differential Equations over Time Intervals

【速读】:该论文旨在解决跨域少样本分割(Cross-domain few-shot segmentation, CD-FSS)中因模块独立设计导致知识流动受限、难以充分发挥协同潜力的问题。其解决方案的关键在于提出一种基于常微分方程(Ordinary Differential Equations, ODE)与傅里叶变换的“一体化”模块——FSS-TIs,通过建模域特定特征与域无关特征谱(包括幅值谱和相位谱)之间的ODE关系,将域无关特征空间探索与目标域分布多样性模拟转化为对ODE内在参数的优化过程,并在目标域微调阶段严格约束支持样本选择以符合实际任务需求,从而实现结构简洁且跨域适应性强的少样本分割方法。

链接: https://arxiv.org/abs/2509.01299
作者: Huan Ni,Qingshan Liu,Xiaonan Niu,Danfeng Hong,Lingli Zhao,Haiyan Guan
机构: Nanjing University of Information Science & Technology (南京信息工程大学); Nanjing University of Posts and Telecommunications (南京邮电大学); China Geological Survey (中国地质调查局); Southeast University (东南大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-domain few-shot segmentation (CD-FSS) not only enables the segmentation of unseen categories with very limited samples, but also improves cross-domain generalization ability within the few-shot segmentation framework. Currently, existing CD-FSS studies typically design multiple independent modules to enhance the cross-domain generalization ability of feature representations. However, the independence among these modules hinders the effective flow of knowledge, making it difficult to fully leverage their collective potential. In contrast, this paper proposes an all-in-one module based on ordinary differential equations and Fourier transform, resulting in a structurally concise method–Few-Shot Segmentation over Time Intervals (FSS-TIs). FSS-TIs assumes the existence of an ODE relationship between the spectra (including amplitude and phase spectra) of domain-specific features and domain-agnostic features. This ODE formulation yields an iterative transformation process along a sequence of time intervals, while simultaneously applying affine transformations with randomized perturbations to the spectra. In doing so, the exploration of domain-agnostic feature representation spaces and the simulation of diverse potential target-domain distributions are reformulated as an optimization process over the intrinsic parameters of the ODE. Moreover, we strictly constrain the support-sample selection during target-domain fine-tuning so that it is consistent with the requirements of real-world few-shot segmentation tasks. For evaluation, we introduce five datasets from substantially different domains and define two sets of cross-domain few-shot segmentation tasks to comprehensively analyze the performance of FSS-TIs. Experimental results demonstrate the superiority of FSS-TIs over existing CD-FSS methods, and in-depth ablation studies further validate the cross-domain adaptability of FSS-TIs.
zh

[CV-107] Multi-Representation Adapter with Neural Architecture Search for Efficient Range-Doppler Radar Object Detection ICANN2025

【速读】:该论文旨在解决雷达传感器中高效目标检测的问题,尤其针对Range-Doppler(RD)雷达图在复杂环境下的特征提取与模型效率之间的平衡难题。解决方案的关键在于提出一种多表示融合的架构:首先将RD雷达图以热力图和灰度图像两种形式表征,以捕获高层语义信息与细粒度纹理特征;随后设计了Adapter分支、双模式Exchanger模块及Primary-Auxiliary Fusion模块,分别实现特征的有效提取、跨表示特征交换与融合;此外,通过构建包含多种宽度和融合操作的超网络,并采用One-Shot神经架构搜索方法优化模型结构,在保持高精度的同时显著提升计算效率。

链接: https://arxiv.org/abs/2509.01280
作者: Zhiwei Lin,Weicheng Zheng,Yongtao Wang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICANN 2025

点击查看摘要

Abstract:Detecting objects efficiently from radar sensors has recently become a popular trend due to their robustness against adverse lighting and weather conditions compared with cameras. This paper presents an efficient object detection model for Range-Doppler (RD) radar maps. Specifically, we first represent RD radar maps with multi-representation, i.e., heatmaps and grayscale images, to gather high-level object and fine-grained texture features. Then, we design an additional Adapter branch, an Exchanger Module with two modes, and a Primary-Auxiliary Fusion Module to effectively extract, exchange, and fuse features from the multi-representation inputs, respectively. Furthermore, we construct a supernet with various width and fusion operations in the Adapter branch for the proposed model and employ a One-Shot Neural Architecture Search method to further improve the model’s efficiency while maintaining high performance. Experimental results demonstrate that our model obtains favorable accuracy and efficiency trade-off. Moreover, we achieve new state-of-the-art performance on RADDet and CARRADA datasets with mAP@50 of 71.9 and 57.1, respectively.
zh

[CV-108] SAR-NAS: Lightweight SAR Object Detection with Neural Architecture Search

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)目标检测中面临的三大挑战:斑点噪声干扰、小目标识别模糊性以及机载计算资源受限问题。现有方法多集中于针对SAR特性的网络结构改进,而本文提出了一种新思路——基于神经架构搜索(Neural Architecture Search, NAS)优化轻量级目标检测器YOLOv10的性能。其解决方案的关键在于通过构建大规模搜索空间并采用进化搜索策略,系统性地优化网络主干结构(backbone architecture),从而在检测精度、参数效率和计算成本之间实现最优平衡。实验表明,该方法首次将NAS引入SAR目标检测领域,在SARDet-100K大规模数据集上实现了优于现有方法的检测性能,同时保持较低的计算开销。

链接: https://arxiv.org/abs/2509.01279
作者: Xinyi Yu,Zhiwei Lin,Yongtao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by PRCV 2025

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) object detection faces significant challenges from speckle noise, small target ambiguities, and on-board computational constraints. While existing approaches predominantly focus on SAR-specific architectural modifications, this paper explores the application of the existing lightweight object detector, i.e., YOLOv10, for SAR object detection and enhances its performance through Neural Architecture Search (NAS). Specifically, we employ NAS to systematically optimize the network structure, especially focusing on the backbone architecture search. By constructing an extensive search space and leveraging evolutionary search, our method identifies a favorable architecture that balances accuracy, parameter efficiency, and computational cost. Notably, this work introduces NAS to SAR object detection for the first time. The experimental results on the large-scale SARDet-100K dataset demonstrate that our optimized model outperforms existing SAR detection methods, achieving superior detection accuracy while maintaining lower computational overhead. We hope this work offers a novel perspective on leveraging NAS for real-world applications.
zh

[CV-109] Novel Category Discovery with X-Agent Attention for Open-Vocabulary Semantic Segmentation

【速读】:该论文旨在解决开放词汇语义分割(Open-vocabulary semantic segmentation, OVSS)中因基础类别训练与开放词汇推理之间领域差异导致的潜在未见类别判别建模难题。现有基于视觉语言模型(Vision-language model, VLM)的方法虽借助预训练多模态表示取得较好效果,但其潜在语义理解机制仍不明确,成为OVSS性能瓶颈。论文的关键解决方案是提出X-Agent框架,通过引入感知潜在语义的“代理”(agent)来协调跨模态注意力机制,在优化潜在语义动态的同时增强其可感知性,从而提升未见类别的判别能力。

链接: https://arxiv.org/abs/2509.01275
作者: Jiahao Li Yang Lu,Yachao Zhang,Fangyong Wang,Yuan Xie,Yanyun Qu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACMMM2025

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) conducts pixel-level classification via text-driven alignment, where the domain discrepancy between base category training and open-vocabulary inference poses challenges in discriminative modeling of latent unseen category. To address this challenge, existing vision-language model (VLM)-based approaches demonstrate commendable performance through pre-trained multi-modal representations. However, the fundamental mechanisms of latent semantic comprehension remain underexplored, making the bottleneck for OVSS. In this work, we initiate a probing experiment to explore distribution patterns and dynamics of latent semantics in VLMs under inductive learning paradigms. Building on these insights, we propose X-Agent, an innovative OVSS framework employing latent semantic-aware ``agent’’ to orchestrate cross-modal attention mechanisms, simultaneously optimizing latent semantic dynamic and amplifying its perceptibility. Extensive benchmark evaluations demonstrate that X-Agent achieves state-of-the-art performance while effectively enhancing the latent semantic saliency.
zh

[CV-110] ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization

【速读】:该论文旨在解决图像描述生成系统常产生泛化描述、无法捕捉事件级语义的问题,这在新闻报道和数字归档等高要求场景中尤为关键。其解决方案的核心在于提出ReCap框架,通过融合来自相关文章的广泛上下文信息来生成叙事丰富且事实准确的图像描述;关键技术包括:(1) 基于DINOv2嵌入的两阶段文章检索系统,结合全局特征相似性与patch级互近邻相似性重排序以提升候选文章质量;(2) 从文章摘要、通用图像描述和原始元数据中提取多源上下文信息的框架;(3) 利用大语言模型与语义高斯归一化技术增强生成描述的流畅性和相关性。该方法有效实现了视觉感知与现实世界知识的融合,显著提升了图像理解的上下文敏感性。

链接: https://arxiv.org/abs/2509.01259
作者: Thinh-Phuc Nguyen,Thanh-Hai Nguyen,Gia-Huy Dinh,Lam-Huy Nguyen,Minh-Triet Tran,Trung-Nghia Le
机构: University of Science, VNU-HCM (胡志明市国家大学自然科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM Multimedia 2025

点击查看摘要

Abstract:Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. We present ReCap, a novel pipeline for event-enriched image retrieval and captioning that incorporates broader contextual information from relevant articles to generate narrative-rich, factually grounded captions. Our approach addresses the limitations of standard vision-language models that typically focus on visible content while missing temporal, social, and historical contexts. ReCap comprises three integrated components: (1) a robust two-stage article retrieval system using DINOv2 embeddings with global feature similarity for initial candidate selection followed by patch-level mutual nearest neighbor similarity re-ranking; (2) a context extraction framework that synthesizes information from article summaries, generic captions, and original source metadata; and (3) a large language model-based caption generation system with Semantic Gaussian Normalization to enhance fluency and relevance. Evaluated on the OpenEvents V1 dataset as part of Track 1 in the EVENTA 2025 Grand Challenge, ReCap achieved a strong overall score of 0.54666, ranking 2nd on the private test set. These results highlight ReCap’s effectiveness in bridging visual perception with real-world knowledge, offering a practical solution for context-aware image understanding in high-stakes domains. The code is available at this https URL.
zh

[CV-111] owards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views ICCV2025

【速读】:该论文旨在解决点云自监督学习中预训练任务难度不足、表征能力有限的问题,尤其针对现有生成式方法多依赖单视图恢复被遮蔽点(self-reconstruction)导致信息冗余和泛化性弱的局限。其解决方案的关键在于提出Point-PQAE,一种基于跨视图重建(cross-reconstruction)的生成范式:首先通过创新的点云裁剪机制生成两个解耦的点云视图,并引入新颖的位置编码来建模两视图间的三维相对位置关系;随后以一个视图作为输入,重建另一个视图,从而显著提升预训练任务的挑战性和语义丰富性。该设计有效增强了模型对点云结构的跨视角理解能力,在ScanObjectNN数据集上相较基线Point-MAE在Mlp-Linear评估协议下分别提升了6.5%、7.0%和6.7%。

链接: https://arxiv.org/abs/2509.01250
作者: Xiangdong Zhang,Shaofeng Zhang,Junchi Yan
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Point cloud learning, especially in a self-supervised way without manual labels, has gained growing attention in both vision and learning communities due to its potential utility in a wide range of applications. Most existing generative approaches for point cloud self-supervised learning focus on recovering masked points from visible ones within a single view. Recognizing that a two-view pre-training paradigm inherently introduces greater diversity and variance, it may thus enable more challenging and informative pre-training. Inspired by this, we explore the potential of two-view learning in this domain. In this paper, we propose Point-PQAE, a cross-reconstruction generative paradigm that first generates two decoupled point clouds/views and then reconstructs one from the other. To achieve this goal, we develop a crop mechanism for point cloud view generation for the first time and further propose a novel positional encoding to represent the 3D relative position between the two decoupled views. The cross-reconstruction significantly increases the difficulty of pre-training compared to self-reconstruction, which enables our method to surpass previous single-modal self-reconstruction methods in 3D self-supervised learning. Specifically, it outperforms the self-reconstruction baseline (Point-MAE) by 6.5%, 7.0%, and 6.7% in three variants of ScanObjectNN with the Mlp-Linear evaluation protocol. The code is available at this https URL.
zh

[CV-112] Learning Correlation-aware Aleatoric Uncertainty for 3D Hand Pose Estimation BMVC2025

【速读】:该论文旨在解决现有3D手部姿态估计方法在两个关键方面的不足:一是无法估计数据相关的偶然不确定性(aleatoric uncertainty),二是缺乏融合关节相关性知识的不确定性建模机制。其解决方案的核心在于引入一种新颖的参数化方式,通过一个单一的线性层来捕捉手部关节约束的内在相关性,该方法基于将手部关节输出空间建模为概率分布,从而使得线性层能够有效学习并表达关节间的复杂相关关系。此参数化作为任务头模块可无缝集成到现有模型之上,实现了在保持计算效率的同时显著提升不确定性建模能力与3D手部姿态估计精度。

链接: https://arxiv.org/abs/2509.01242
作者: Lee Chae-Yeon,Nam Hyeon-Woo,Tae-Hyun Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2025. Project page: this https URL

点击查看摘要

Abstract:3D hand pose estimation is a fundamental task in understanding human hands. However, accurately estimating 3D hand poses remains challenging due to the complex movement of hands, self-similarity, and frequent occlusions. In this work, we address two limitations: the inability of existing 3D hand pose estimation methods to estimate aleatoric (data) uncertainty, and the lack of uncertainty modeling that incorporates joint correlation knowledge, which has not been thoroughly investigated. To this end, we introduce aleatoric uncertainty modeling into the 3D hand pose estimation framework, aiming to achieve a better trade-off between modeling joint correlations and computational efficiency. We propose a novel parameterization that leverages a single linear layer to capture intrinsic correlations among hand joints. This is enabled by formulating the hand joint output space as a probabilistic distribution, allowing the linear layer to capture joint correlations. Our proposed parameterization is used as a task head layer, and can be applied as an add-on module on top of the existing models. Our experiments demonstrate that our parameterization for uncertainty modeling outperforms existing approaches. Furthermore, the 3D hand pose estimation model equipped with our uncertainty head achieves favorable accuracy in 3D hand pose estimation while introducing new uncertainty modeling capability to the model. The project page is available at this https URL.
zh

[CV-113] RT-DETRv2 Explained in 8 Illustrations

【速读】:该论文旨在解决实时目标检测架构(如RT-DETRv2)难以理解的问题,尤其是其组件之间的协同机制和内部运作逻辑缺乏清晰阐释。现有文献中的图表往往无法有效揭示模型各模块的实际功能与集成方式。解决方案的关键在于通过一系列精心设计的八张可视化图示,从整体流水线逐步深入至关键组件(如编码器、解码器及多尺度可变形注意力机制),以Tensor流动态展示和模块逻辑拆解的方式,帮助研究人员和从业者建立对RT-DETRv2内部工作机制的直观认知与清晰心智模型。

链接: https://arxiv.org/abs/2509.01241
作者: Ethan Qi Yang Chua,Jen Hong Tan
机构: Singapore General Hospital(新加坡中央医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 8 figures

点击查看摘要

Abstract:Object detection architectures are notoriously difficult to understand, often more so than large language models. While RT-DETRv2 represents an important advance in real-time detection, most existing diagrams do little to clarify how its components actually work and fit together. In this article, we explain the architecture of RT-DETRv2 through a series of eight carefully designed illustrations, moving from the overall pipeline down to critical components such as the encoder, decoder, and multi-scale deformable attention. Our goal is to make the existing one genuinely understandable. By visualizing the flow of tensors and unpacking the logic behind each module, we hope to provide researchers and practitioners with a clearer mental model of how RT-DETRv2 works under the hood.
zh

[CV-114] FantasyHSI: Video-Generation-Centric 4D Human Synthesis In Any Scene through A Graph-based Multi-Agent Framework

【速读】:该论文旨在解决人类-场景交互(Human-Scene Interaction, HSI)在处理长时程高阶任务和泛化到未见场景时面临的挑战。现有方法受限于数据配对需求、轨迹漂移导致的逻辑不一致以及生成动作的物理真实性不足等问题。其解决方案的关键在于提出FantasyHSI框架,该框架基于视频生成与多智能体系统构建,无需成对训练数据;通过将交互过程建模为动态有向图,并设计包含场景导航代理(scene navigator agent)、规划代理(planning agent)和评判代理(critic agent)的协同多智能体系统,其中评判代理建立闭环反馈机制以校正由生成模型随机性引起的轨迹偏移,从而保障长期逻辑一致性;同时引入直接偏好优化(Direct Preference Optimization, DPO)训练动作生成器,显著减少肢体扭曲和足部滑动等物理失真现象。

链接: https://arxiv.org/abs/2509.01232
作者: Lingzhou Mu,Qiang Wang,Fan Jiang,Mengchao Wang,Yaqi Fan,Mu Xu,Kai Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Human-Scene Interaction (HSI) seeks to generate realistic human behaviors within complex environments, yet it faces significant challenges in handling long-horizon, high-level tasks and generalizing to unseen scenes. To address these limitations, we introduce FantasyHSI, a novel HSI framework centered on video generation and multi-agent systems that operates without paired data. We model the complex interaction process as a dynamic directed graph, upon which we build a collaborative multi-agent system. This system comprises a scene navigator agent for environmental perception and high-level path planning, and a planning agent that decomposes long-horizon goals into atomic actions. Critically, we introduce a critic agent that establishes a closed-loop feedback mechanism by evaluating the deviation between generated actions and the planned path. This allows for the dynamic correction of trajectory drifts caused by the stochasticity of the generative model, thereby ensuring long-term logical consistency. To enhance the physical realism of the generated motions, we leverage Direct Preference Optimization (DPO) to train the action generator, significantly reducing artifacts such as limb distortion and foot-sliding. Extensive experiments on our custom SceneBench benchmark demonstrate that FantasyHSI significantly outperforms existing methods in terms of generalization, long-horizon task completion, and physical realism. Ours project page: this https URL
zh

[CV-115] POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion EMNLP2025

【速读】:该论文旨在解决文档转换模型训练中高质量标注数据稀缺的问题,尤其是在处理表格、公式和多栏文本等复杂格式时,手动标注成本高且效率低,而现有自动标注方法在这些场景下准确性不足,导致通过知识蒸馏训练的学生模型性能受限。解决方案的关键在于提出一个无需知识蒸馏的两阶段自动化框架:第一阶段生成大规模多样化的合成数据以训练具备统一格式提取能力的初始模型;第二阶段引入自提升机制,利用初步训练好的模型对真实文档进行标注,结合过滤策略验证标注质量,并基于高质量验证集迭代重训练模型,从而逐步提升模型在真实场景下的转换能力和数据质量。

链接: https://arxiv.org/abs/2509.01215
作者: Yuan Liu,Zhongyin Zhao,Le Tian,Haicheng Wang,Xubing Ye,Yangxiu You,Zilin Yu,Chuhan Wu,Xiao Zhou,Yang Yu,Jie Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by EMNLP 2025 Main Conference

点击查看摘要

Abstract:High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model’s conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at this https URL.
zh

[CV-116] PRINTER:Deformation-Aware Adversarial Learning for Virtual IHC Staining with In Situ Fidelity

【速读】:该论文旨在解决肿瘤空间异质性分析中,苏木精-伊红(Hematoxylin and Eosin, HE)染色形态与免疫组化(Immunohistochemical, IHC)生物标志物表达之间因连续切片空间错位导致的配准不准确问题,从而影响原位病理解读的精度。解决方案的关键在于提出PRINTER框架,其核心创新包括:(1) 基于原型驱动的染色模式迁移策略,实现内容与风格的显式解耦;(2) 构建循环注册-合成框架GapBridge,通过可变形结构对齐桥接HE与IHC模态,使注册特征引导跨模态风格迁移,同时合成结果迭代优化注册精度;(3) 提出形变感知对抗学习机制,通过生成器与形变感知注册网络联合对抗优化一个以风格为导向的判别器,显著提升虚拟染色保真度与HE细节保留能力。

链接: https://arxiv.org/abs/2509.01214
作者: Yizhe Yuan,Bingsen Xue,Bangzheng Pu,Chengxiang Wang,Cheng Jin
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Tumor spatial heterogeneity analysis requires precise correlation between Hematoxylin and Eosin HE morphology and immunohistochemical (IHC) biomarker expression, yet current methods suffer from spatial misalignment in consecutive sections, severely compromising in situ pathological interpretation. In order to obtain a more accurate virtual staining pattern, We propose PRINTER, a weakly-supervised framework that integrates PRototype-drIven content and staiNing patTERn decoupling and deformation-aware adversarial learning strategies designed to accurately learn IHC staining patterns while preserving HE staining details. Our approach introduces three key innovations: (1) A prototype-driven staining pattern transfer with explicit content-style decoupling; and (2) A cyclic registration-synthesis framework GapBridge that bridges HE and IHC domains through deformable structural alignment, where registered features guide cross-modal style transfer while synthesized outputs iteratively refine the registration;(3) Deformation-Aware Adversarial Learning: We propose a training framework where a generator and deformation-aware registration network jointly adversarially optimize a style-focused discriminator. Extensive experiments demonstrate that PRINTER effectively achieves superior performance in preserving HE staining details and virtual staining fidelity, outperforming state-of-the-art methods. Our work provides a robust and scalable solution for virtual staining, advancing the field of computational pathology.
zh

[CV-117] Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation

【速读】:该论文旨在解决开放词汇场景图生成(Open-Vocabulary Scene Graph Generation, Open-Voc SGG)中的两个关键问题:一是现有基准数据集词汇量有限,导致对模型开放词汇能力的评估效率低下;二是预训练阶段依赖低质量弱监督数据,限制了模型性能提升。针对第一个问题,作者提出了一种无需参考(reference-free)的评估指标,可更公平地衡量视觉语言模型(VLMs)在关系预测中的开放词汇能力;针对第二个问题,提出通过区域特定提示调优(region-specific prompt tuning)快速生成高质量合成数据,从而提升模型的泛化能力。实验表明,使用该合成数据进行预训练能显著增强Open-Voc SGG模型的性能。

链接: https://arxiv.org/abs/2509.01209
作者: Maëlic Neau,Zoe Falomir,Cédric Buche,Akihiro Sugimoto
机构: Umeå University (于默奥大学); CNRS IRL 2010 CROSSING (法国国家科学研究中心国际合作实验室); IMT Atlantique (IMT大西洋学院); National Institute of Informatics (日本国立信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene Graph Generation (SGG) encodes visual relationships between objects in images as graph structures. Thanks to the advances of Vision-Language Models (VLMs), the task of Open-Vocabulary SGG has been recently proposed where models are evaluated on their functionality to learn a wide and diverse range of relations. Current benchmarks in SGG, however, possess a very limited vocabulary, making the evaluation of open-source models inefficient. In this paper, we propose a new reference-free metric to fairly evaluate the open-vocabulary capabilities of VLMs for relation prediction. Another limitation of Open-Vocabulary SGG is the reliance on weakly supervised data of poor quality for pre-training. We also propose a new solution for quickly generating high-quality synthetic data through region-specific prompt tuning of VLMs. Experimental results show that pre-training with this new data split can benefit the generalization capabilities of Open-Voc SGG models.
zh

[CV-118] Generalizable Self-supervised Monocular Depth Estimation with Mixture of Low-Rank Experts for Diverse Endoscopic Scenes

【速读】:该论文旨在解决内窥镜场景中单目深度估计的泛化能力问题,尤其是光照条件多样性和场景特征复杂性对模型性能造成的挑战。其关键解决方案在于提出一种基于块级动态低秩专家混合(block-wise mixture of dynamic low-rank experts)的微调机制,该机制可根据输入特征自适应选择具有少量可训练参数的专家进行加权推理,并结合一种新颖的自监督训练框架联合处理亮度与反射率不一致性问题,从而在真实和模拟内窥镜数据集上均优于现有方法,并实现跨场景的零样本(zero-shot)深度估计泛化性能。

链接: https://arxiv.org/abs/2509.01206
作者: Liangjing Shao,Benshuang Chen,Chenkang Du,Xueli Liu,Xinrong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 11 figures, Under Review

点击查看摘要

Abstract:Self-supervised monocular depth estimation is a significant task for low-cost and efficient three-dimensional scene perception in endoscopy. The variety of illumination conditions and scene features is still the primary challenge for generalizable depth estimation in endoscopic scenes. In this work, a self-supervised framework is proposed for monocular depth estimation in various endoscopy. Firstly, due to various features in endoscopic scenes with different tissues, a novel block-wise mixture of dynamic low-rank experts is proposed to efficiently finetuning the foundation model for endoscopic depth estimation. In the proposed module, based on the input feature, different experts with a small amount of trainable parameters are adaptively selected for weighted inference, from various mixture of low-rank experts which are allocated based on the training quality of each block. Moreover, a novel self-supervised training framework is proposed to jointly cope with the inconsistency of brightness and reflectance. The proposed method outperform state-of-the-art works on both realistic and simulated endoscopic datasets. Furthermore, the proposed network also achieves the best generalization based on zero-shot depth estimation on diverse endoscopic scenes. The proposed method could contribute to accurate endoscopic perception for minimally invasive measurement and surgery. The code will be released upon acceptance, while the demo video can be found on here: this https URL.
zh

[CV-119] DcMatch: Unsupervised Multi-Shape Matching with Dual-Level Consistency

【速读】:该论文旨在解决多3D形状之间的非刚性点对点对应关系建立问题(non-rigid multi-shape matching),这是计算机视觉与图形学中的基础挑战。其核心解决方案在于提出一种无监督学习框架 DcMatch,关键创新在于利用形状图注意力网络(shape graph attention network)捕捉整个形状集合的底层流形结构,从而构建更具表达力和鲁棒性的共享潜在空间(shared latent space),并通过宇宙预测器(universe predictor)实现形状到全局空间的一致映射;同时,在空间域和谱域双重表示对应关系,并引入新颖的循环一致性损失(cycle consistency loss)强制二者在共享宇宙空间中对齐,从而显著提升匹配的准确性和一致性。

链接: https://arxiv.org/abs/2509.01204
作者: Tianwei Ye,Yong Ma,Xiaoguang Mei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Establishing point-to-point correspondences across multiple 3D shapes is a fundamental problem in computer vision and graphics. In this paper, we introduce DcMatch, a novel unsupervised learning framework for non-rigid multi-shape matching. Unlike existing methods that learn a canonical embedding from a single shape, our approach leverages a shape graph attention network to capture the underlying manifold structure of the entire shape collection. This enables the construction of a more expressive and robust shared latent space, leading to more consistent shape-to-universe correspondences via a universe predictor. Simultaneously, we represent these correspondences in both the spatial and spectral domains and enforce their alignment in the shared universe space through a novel cycle consistency loss. This dual-level consistency fosters more accurate and coherent mappings. Extensive experiments on several challenging benchmarks demonstrate that our method consistently outperforms previous state-of-the-art approaches across diverse multi-shape matching scenarios. Code is available at this https URL.
zh

[CV-120] PrediTree: A Multi-Temporal Sub-meter Dataset of Multi-Spectral Imagery Aligned With Canopy Height Maps

【速读】:该论文旨在解决森林监测中树高预测模型训练与评估缺乏高分辨率、多时相、多光谱数据集的问题,从而提升基于深度学习的树高预测精度。其关键解决方案是构建了PrediTree这一首个开源数据集,包含3,141,568幅分辨率达0.5米的LiDAR衍生冠层高度图(Canopy Height Map),并与空间对齐的多时相多光谱影像融合,支持利用历史观测序列进行树高动态建模。同时,提出一种编码器-解码器框架,以多时相多光谱图像和相对时间差作为输入,实现亚米级精度的树高预测,实验表明该方法在U-Net架构下误差最低(masked mean squared error为11.78%),优于ResNet-50约12%,且相比仅使用RGB波段时误差降低约30%。

链接: https://arxiv.org/abs/2509.01202
作者: Hiyam Debary,Mustansar Fiaz,Levente Klein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at GAIA 2025. Dataset available at \href{ this https URL }{HuggingFace}

点击查看摘要

Abstract:We present PrediTree, the first comprehensive open-source dataset designed for training and evaluating tree height prediction models at sub-meter resolution. This dataset combines very high-resolution (0.5m) LiDAR-derived canopy height maps, spatially aligned with multi-temporal and multi-spectral imagery, across diverse forest ecosystems in France, totaling 3,141,568 images. PrediTree addresses a critical gap in forest monitoring capabilities by enabling the training of deep learning methods that can predict tree growth based on multiple past observations. %\soutInitially focused on French forests, PrediTree is designed as an expanding resource with ongoing efforts to incorporate data from other countries. To make use of this PrediTree dataset, we propose an encoder-decoder framework that requires the multi-temporal multi-spectral imagery and the relative time differences in years between the canopy height map timestamp (target) and each image acquisition date for which this framework predicts the canopy height. The conducted experiments demonstrate that a U-Net architecture trained on the PrediTree dataset provides the highest masked mean squared error of 11.78% , outperforming the next-best architecture, ResNet-50, by around 12% , and cutting the error of the same experiments but on fewer bands (red, green, blue only), by around 30% . This dataset is publicly available on \hrefURLHuggingFace, and both processing and training codebases are available on \hrefURLGitHub.
zh

[CV-121] SegAssess: Panoramic quality mapping for robust and transferable unsupervised segmentation assessment

【速读】:该论文旨在解决无监督图像分割质量评估(Unsupervised Segmentation Quality Assessment, SQA)中存在的评估粒度粗、评估不完整以及跨域迁移能力差的问题。其核心解决方案是提出了一种全新的全景质量映射(Panoramic Quality Mapping, PQM)范式,并设计了SegAssess深度学习框架来实现该范式。关键创新在于将SQA建模为一个细粒度的四分类像素级分割任务,将待评估分割掩膜中的每个像素划分为真正例(TP)、假正例(FP)、真负例(TN)和假负例(FN),从而生成完整的质量图;同时引入基于增强型Segment Anything Model(SAM)架构的交叉注意力机制,利用输入掩膜作为提示进行特征融合,并通过边缘引导压缩分支(Edge Guided Compaction, EGC)与聚合语义滤波模块(Aggregated Semantic Filter, ASF)提升边缘区域预测精度,结合增强混合采样策略(Augmented Mixup Sampling, AMS)整合多源掩膜以显著增强跨域鲁棒性和零样本迁移能力。

链接: https://arxiv.org/abs/2509.01183
作者: Bingnan Yang,Mi Zhang,Zhili Zhang,Zhan Zhang,Yuanxin Zhao,Xiangyun Hu,Jianya Gong
机构: Wuhan University (武汉大学); National University of Defense Technology (国防科技大学); State Key Laboratory of Information Enginnering in Surveying, Mapping and Remote Sensing (测绘遥感信息工程国家重点实验室); Hubei Luojia Laboratory (湖北珞珈实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality image segmentation is fundamental to pixel-level geospatial analysis in remote sensing, necessitating robust segmentation quality assessment (SQA), particularly in unsupervised settings lacking ground truth. Although recent deep learning (DL) based unsupervised SQA methods show potential, they often suffer from coarse evaluation granularity, incomplete assessments, and poor transferability. To overcome these limitations, this paper introduces Panoramic Quality Mapping (PQM) as a new paradigm for comprehensive, pixel-wise SQA, and presents SegAssess, a novel deep learning framework realizing this approach. SegAssess distinctively formulates SQA as a fine-grained, four-class panoramic segmentation task, classifying pixels within a segmentation mask under evaluation into true positive (TP), false positive (FP), true negative (TN), and false negative (FN) categories, thereby generating a complete quality map. Leveraging an enhanced Segment Anything Model (SAM) architecture, SegAssess uniquely employs the input mask as a prompt for effective feature integration via cross-attention. Key innovations include an Edge Guided Compaction (EGC) branch with an Aggregated Semantic Filter (ASF) module to refine predictions near challenging object edges, and an Augmented Mixup Sampling (AMS) training strategy integrating multi-source masks to significantly boost cross-domain robustness and zero-shot transferability. Comprehensive experiments across 32 datasets derived from 6 sources demonstrate that SegAssess achieves state-of-the-art (SOTA) performance and exhibits remarkable zero-shot transferability to unseen masks, establishing PQM via SegAssess as a robust and transferable solution for unsupervised SQA. The code is available at this https URL.
zh

[CV-122] FocusDPO: Dynamic Preference Optimization for Multi-Subject Personalized Image Generation via Adaptive Focus

【速读】:该论文旨在解决多主体个性化图像生成中难以实现细粒度独立控制的问题,尤其关注如何在不依赖测试时优化的情况下保持主体保真度并防止跨主体属性泄露。其解决方案的关键在于提出FocusDPO框架,该框架通过动态语义对应关系自适应识别焦点区域,并在训练过程中基于参考图像复杂度调整这些区域的权重策略:奖励信息丰富的图像块,同时惩罚预测置信度低的区域。该方法在DPO(Direct Preference Optimization)过程中动态分配注意力,建立生成图像与参考图像间鲁棒的主体对应映射,从而有效减少属性泄露并提升多场景下的主体保真度。

链接: https://arxiv.org/abs/2509.01181
作者: Qiaoqiao Jin,Siming Fu,Dong She,Weinan Jia,Hualiang Wang,Mu Liu,Jidong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-subject personalized image generation aims to synthesize customized images containing multiple specified subjects without requiring test-time optimization. However, achieving fine-grained independent control over multiple subjects remains challenging due to difficulties in preserving subject fidelity and preventing cross-subject attribute leakage. We present FocusDPO, a framework that adaptively identifies focus regions based on dynamic semantic correspondence and supervision image complexity. During training, our method progressively adjusts these focal areas across noise timesteps, implementing a weighted strategy that rewards information-rich patches while penalizing regions with low prediction confidence. The framework dynamically adjusts focus allocation during the DPO process according to the semantic complexity of reference images and establishes robust correspondence mappings between generated and reference subjects. Extensive experiments demonstrate that our method substantially enhances the performance of existing pre-trained personalized generation models, achieving state-of-the-art results on both single-subject and multi-subject personalized image synthesis benchmarks. Our method effectively mitigates attribute leakage while preserving superior subject fidelity across diverse generation scenarios, advancing the frontier of controllable multi-subject image synthesis.
zh

[CV-123] DynaMind: Reconstructing Dynamic Visual Scenes from EEG by Aligning Temporal Dynamics and Multimodal Semantics to Guided Diffusion

【速读】:该论文旨在解决从脑电图(EEG)信号中重建动态视觉场景的难题,其核心挑战包括EEG空间分辨率低、神经记录与视频动态之间存在时间错位,以及脑活动中语义信息利用不足,导致现有方法难以同时保证重建视频的时间一致性与复杂语义上下文。解决方案的关键在于提出DynaMind框架,通过三个核心模块协同建模神经动力学与语义特征:区域感知语义映射器(RSM)提取多模态语义特征并构建扩散先验,时间感知动态对齐器(TDA)生成时序一致的潜在序列以对齐神经活动,双引导视频重构器(DGVR)在语义扩散先验指导下将时序蓝图转化为高保真视频。该方法在SEED-DV数据集上显著提升视频和帧级准确率,并大幅改善像素级质量与时间连贯性,实现了脑解码领域的重要突破。

链接: https://arxiv.org/abs/2509.01177
作者: Junxiang Liu,Junming Lin,Jiangtong Li,Jie Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Reconstruction dynamic visual scenes from electroencephalography (EEG) signals remains a primary challenge in brain decoding, limited by the low spatial resolution of EEG, a temporal mismatch between neural recordings and video dynamics, and the insufficient use of semantic information within brain activity. Therefore, existing methods often inadequately resolve both the dynamic coherence and the complex semantic context of the perceived visual stimuli. To overcome these limitations, we introduce DynaMind, a novel framework that reconstructs video by jointly modeling neural dynamics and semantic features via three core modules: a Regional-aware Semantic Mapper (RSM), a Temporal-aware Dynamic Aligner (TDA), and a Dual-Guidance Video Reconstructor (DGVR). The RSM first utilizes a regional-aware encoder to extract multimodal semantic features from EEG signals across distinct brain regions, aggregating them into a unified diffusion prior. In the mean time, the TDA generates a dynamic latent sequence, or blueprint, to enforce temporal consistency between the feature representations and the original neural recordings. Together, guided by the semantic diffusion prior, the DGVR translates the temporal-aware blueprint into a high-fidelity video reconstruction. On the SEED-DV dataset, DynaMind sets a new state-of-the-art (SOTA), boosting reconstructed video accuracies (video- and frame-based) by 12.5 and 10.3 percentage points, respectively. It also achieves a leap in pixel-level quality, showing exceptional visual fidelity and temporal coherence with a 9.4% SSIM improvement and a 19.7% FVMD reduction. This marks a critical advancement, bridging the gap between neural dynamics and high-fidelity visual semantics.
zh

[CV-124] MVTrajecter: Multi-View Pedestrian Tracking with Trajectory Motion Cost and Trajectory Appearance Cost ICCV2025

【速读】:该论文旨在解决多视角行人跟踪(Multi-View Pedestrian Tracking, MVPT)中因仅依赖当前帧及其单一邻近历史帧进行关联而导致的轨迹不稳定问题。现有端到端方法未能充分利用历史轨迹信息,导致在复杂场景下易出现错误关联。解决方案的关键在于提出一种名为Multi-View Trajectory Tracker (MVTrajecter) 的新方法,其核心创新是引入轨迹运动代价(trajectory motion cost)和轨迹外观代价(trajectory appearance cost),从而有效融合来自多个历史时间戳的运动与外观信息,实现更鲁棒的跨帧关联;同时,通过注意力机制建模多时间戳间的依赖关系,显著提升轨迹一致性与准确性。

链接: https://arxiv.org/abs/2509.01157
作者: Taiga Yamane,Ryo Masumura,Satoshi Suzuki,Shota Orihashi
机构: NTT Human Informatics Laboratries, NTT Corporation (NTT公司人类信息学实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Multi-View Pedestrian Tracking (MVPT) aims to track pedestrians in the form of a bird’s eye view occupancy map from multi-view videos. End-to-end methods that detect and associate pedestrians within one model have shown great progress in MVPT. The motion and appearance information of pedestrians is important for the association, but previous end-to-end MVPT methods rely only on the current and its single adjacent past timestamp, discarding the past trajectories before that. This paper proposes a novel end-to-end MVPT method called Multi-View Trajectory Tracker (MVTrajecter) that utilizes information from multiple timestamps in past trajectories for robust association. MVTrajecter introduces trajectory motion cost and trajectory appearance cost to effectively incorporate motion and appearance information, respectively. These costs calculate which pedestrians at the current and each past timestamp are likely identical based on the information between those timestamps. Even if a current pedestrian could be associated with a false pedestrian at some past timestamp, these costs enable the model to associate that current pedestrian with the correct past trajectory based on other past timestamps. In addition, MVTrajecter effectively captures the relationships between multiple timestamps leveraging the attention mechanism. Extensive experiments demonstrate the effectiveness of each component in MVTrajecter and show that it outperforms the previous state-of-the-art methods.
zh

[CV-125] MetaSSL: A General Heterogeneous Loss for Semi-Supervised Medical Image Segmentation

【速读】:该论文旨在解决半监督学习(Semi-Supervised Learning, SSL)在医学图像分割中对标注数据依赖过高、且现有方法忽视标签数据潜在噪声与未标记像素异质性的问题。其核心解决方案是提出一个通用框架MetaSSL,关键在于设计一种空间异质性损失函数(spatially heterogeneous loss),通过同时利用参考预测与监督预测之间的不确定性(uncertainty)和一致性(consistency)信息,自适应地为不同区域的像素分配差异化权重。该损失将未标记数据划分为四个区域:一致且置信(Unanimous and Confident, UC)、一致但可疑(Unanimous and Suspicious, US)、不一致且置信(Discrepant and Confident, DC)、不一致且可疑(Discrepant and Suspicious, DS),并引入自适应阈值区分置信与可疑预测;此外,该机制也适用于标记数据以提升对标注噪声的鲁棒性。此方法可无缝集成至多数现有SSL框架,实验表明其显著提升了分割性能。

链接: https://arxiv.org/abs/2509.01144
作者: Weiren Zhao,Lanfeng Zhong,Xin Liao,Wenjun Liao,Sichuan Zhang,Shaoting Zhang,Guotai Wang
机构: University of Electronic Science and Technology of China (电子科技大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Sichuan Cancer Hospital and Institute (四川省肿瘤医院); West China Second University Hospital, Sichuan University (华西第二医院,四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures. This work has been accepted by IEEE TMI

点击查看摘要

Abstract:Semi-Supervised Learning (SSL) is important for reducing the annotation cost for medical image segmentation models. State-of-the-art SSL methods such as Mean Teacher, FixMatch and Cross Pseudo Supervision (CPS) are mainly based on consistency regularization or pseudo-label supervision between a reference prediction and a supervised prediction. Despite the effectiveness, they have overlooked the potential noise in the labeled data, and mainly focus on strategies to generate the reference prediction, while ignoring the heterogeneous values of different unlabeled pixels. We argue that effectively mining the rich information contained by the two predictions in the loss function, instead of the specific strategy to obtain a reference prediction, is more essential for SSL, and propose a universal framework MetaSSL based on a spatially heterogeneous loss that assigns different weights to pixels by simultaneously leveraging the uncertainty and consistency information between the reference and supervised predictions. Specifically, we split the predictions on unlabeled data into four regions with decreasing weights in the loss: Unanimous and Confident (UC), Unanimous and Suspicious (US), Discrepant and Confident (DC), and Discrepant and Suspicious (DS), where an adaptive threshold is proposed to distinguish confident predictions from suspicious ones. The heterogeneous loss is also applied to labeled images for robust learning considering the potential annotation noise. Our method is plug-and-play and general to most existing SSL methods. The experimental results showed that it improved the segmentation performance significantly when integrated with existing SSL frameworks on different datasets. Code is available at this https URL.
zh

[CV-126] RealMat: Realistic Materials with Diffusion and Reinforcement Learning

【速读】:该论文旨在解决当前生成式材料模型在真实感上的不足问题,即大多数现有方法依赖合成数据训练,虽能提供精确监督但导致与真实世界材质存在显著视觉差异;而使用真实闪光照片的数据集又受限于规模和多样性。其解决方案的关键在于提出 RealMat,一个基于扩散模型的材质生成框架,通过两阶段微调策略提升生成材质的真实性:首先在合成材质图(material maps)网格上微调预训练的 Stable Diffusion XL (SDXL),继承其部分真实感并学习合成数据分布;随后引入强化学习(reinforcement learning, RL),利用大规模自然光照下真实材质图像构建的现实性奖励函数(realism reward function),进一步优化生成结果,显著增强材质的真实感。

链接: https://arxiv.org/abs/2509.01134
作者: Xilong Zhou,Pedro Figueiredo,Miloš Hašan,Valentin Deschaintre,Paul Guerrero,Yiwei Hu,Nima Khademi Kalantari
机构: Max Planck Institute for Informatics (马普研究所); Texas A&M University (德克萨斯农工大学); Adobe Research (Adobe 研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 11 figures

点击查看摘要

Abstract:Generative models for high-quality materials are particularly desirable to make 3D content authoring more accessible. However, the majority of material generation methods are trained on synthetic data. Synthetic data provides precise supervision for material maps, which is convenient but also tends to create a significant visual gap with real-world materials. Alternatively, recent work used a small dataset of real flash photographs to guarantee realism, however such data is limited in scale and diversity. To address these limitations, we propose RealMat, a diffusion-based material generator that leverages realistic priors, including a text-to-image model and a dataset of realistic material photos under natural lighting. In RealMat, we first finetune a pretrained Stable Diffusion XL (SDXL) with synthetic material maps arranged in 2 \times 2 grids. This way, our model inherits some realism of SDXL while learning the data distribution of the synthetic material grids. Still, this creates a realism gap, with some generated materials appearing synthetic. We propose to further finetune our model through reinforcement learning (RL), encouraging the generation of realistic materials. We develop a realism reward function for any material image under natural lighting, by collecting a large-scale dataset of realistic material images. We show that this approach increases generated materials’ realism compared to our base model and related work.
zh

[CV-127] GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation

【速读】:该论文旨在解决传统图像tokenization方法受限于均匀的2D/1D网格划分,难以灵活表示不同形状、纹理和位置区域的问题,从而限制了特征表达的有效性。其核心解决方案是提出GPSToken框架,通过参数化二维高斯(2D Gaussian)动态建模图像区域的位置、形状及纹理信息,实现非均匀图像tokenization;关键创新在于利用熵驱动算法分割出纹理同质区域,并以高斯参数(均值表示位置,协方差表示形状)与纹理特征联合表征每个区域,配合专用Transformer优化高斯参数,使token可连续自适应调整,同时借助可微分的splatting渲染器将高斯参数化token重建为2D特征图,实现端到端训练。该方法解耦空间布局(高斯参数)与纹理特征,支持两阶段高效生成:先用轻量网络合成结构布局,再基于结构条件生成纹理。

链接: https://arxiv.org/abs/2509.01109
作者: Zhengqiang Zhang,Rongyuan Wu,Lingchen Sun,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose \textbfGPSToken , a novel \textbfG aussian \textbfP arameterized \textbfS patially-adaptive \textbfToken ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively. Codes and models of GPSToken can be found at \hrefthis https URLthis https URL .
zh

[CV-128] FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation ICCV2025

【速读】:该论文旨在解决布局到图像(Layout-to-image, L2I)生成在退化场景(如低光照、水下等)中因“上下文幻觉困境”导致的生成保真度不足和与用户提供的布局对齐弱的问题。其核心解决方案是提出一种频率启发的上下文解耦生成范式(Frequency-Inspired Contextual Disentanglement Generative, FICGen),关键在于将退化图像的频域知识迁移至潜在扩散空间,并通过频域感知引导实现退化实例及其背景的高质量重建:首先引入可学习的双查询机制结合专用频域重采样器,从训练集中提取上下文频域原型;其次设计视觉-频域增强注意力模块注入这些原型信息;同时利用实例一致性图调控个体实例与周围环境的潜在空间解耦,并辅以自适应空间-频域聚合模块重构混合退化表示,从而有效缓解上下文幻觉与特征泄露问题。

链接: https://arxiv.org/abs/2509.01107
作者: Wenzhuang Wang,Yifan Zhao,Mingcan Ma,Ming Liu,Zhonglin Jiang,Yong Chen,Jia Li
机构: Beihang University (北京航空航天大学); Geely Automobile Research Institute (Ningbo) Co., Ltd (吉利汽车研究院(宁波)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 19 figures, ICCV 2025

点击查看摘要

Abstract:Layout-to-image (L2I) generation has exhibited promising results in natural domains, but suffers from limited generative fidelity and weak alignment with user-provided layouts when applied to degraded scenes (i.e., low-light, underwater). We primarily attribute these limitations to the “contextual illusion dilemma” in degraded conditions, where foreground instances are overwhelmed by context-dominant frequency distributions. Motivated by this, our paper proposes a new Frequency-Inspired Contextual Disentanglement Generative (FICGen) paradigm, which seeks to transfer frequency knowledge of degraded images into the latent diffusion space, thereby facilitating the rendering of degraded instances and their surroundings via contextual frequency-aware guidance. To be specific, FICGen consists of two major steps. Firstly, we introduce a learnable dual-query mechanism, each paired with a dedicated frequency resampler, to extract contextual frequency prototypes from pre-collected degraded exemplars in the training set. Secondly, a visual-frequency enhanced attention is employed to inject frequency prototypes into the degraded generation process. To alleviate the contextual illusion and attribute leakage, an instance coherence map is developed to regulate latent-space disentanglement between individual instances and their surroundings, coupled with an adaptive spatial-frequency aggregation module to reconstruct spatial-frequency mixed degraded representations. Extensive experiments on 5 benchmarks involving a variety of degraded scenarios-from severe low-light to mild blur-demonstrate that FICGen consistently surpasses existing L2I methods in terms of generative fidelity, alignment and downstream auxiliary trainability.
zh

[CV-129] Robix: A Unified Model for Robot Interaction Reasoning and Planning

【速读】:该论文旨在解决机器人在复杂交互任务中缺乏统一认知架构的问题,即如何将机器人推理、任务规划与自然语言交互整合在一个端到端框架内,以实现对多样化指令的高效响应和长期任务执行。解决方案的关键在于提出Robix模型,其核心是基于视觉-语言架构的统一建模,通过链式思维(chain-of-thought reasoning)和三阶段训练策略——持续预训练(增强具身推理能力,包括3D空间理解、视觉定位和任务导向推理)、监督微调(建模人机交互与任务规划为统一的推理-动作序列)以及强化学习(提升推理-动作一致性与长程任务连贯性),从而实现机器人在动态环境中进行主动对话、实时中断处理及情境感知常识推理等新型能力。

链接: https://arxiv.org/abs/2509.01106
作者: Huang Fang,Mengxi Zhang,Heng Dong,Wei Li,Zixuan Wang,Qifeng Zhang,Xueyun Tian,Yucheng Hu,Hang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Tech report. Project page: this https URL

点击查看摘要

Abstract:We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.
zh

[CV-130] PVINet: Point-Voxel Interlaced Network for Point Cloud Compression

【速读】:该论文旨在解决点云压缩中重建质量依赖于全局结构与局部上下文信息协同建模的问题,现有方法通常顺序处理这两种信息且缺乏二者间的有效交互。解决方案的关键在于提出一种点-体素交错网络(PVINet),其通过并行提取全局结构特征(由体素编码器Ev实现)与局部上下文特征(由点编码器Ep建模),并在每个尺度上进行特征交互以提升感知效率;其中引入了一种新型条件稀疏卷积(conditional sparse convolution),利用点嵌入动态定制卷积核,从而实现从Ep到Ev的特征交互,并在解码阶段以点嵌入为引导重构点云,显著增强了全局与局部信息的融合能力。

链接: https://arxiv.org/abs/2509.01097
作者: Xuan Deng,Xingtao Wang,Xiandong Meng,Xiaopeng Fan,Debin Zhao
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In point cloud compression, the quality of a reconstructed point cloud relies on both the global structure and the local context, with existing methods usually processing global and local information sequentially and lacking communication between these two types of information. In this paper, we propose a point-voxel interlaced network (PVINet), which captures global structural features and local contextual features in parallel and performs interactions at each scale to enhance feature perception efficiency. Specifically, PVINet contains a voxel-based encoder (Ev) for extracting global structural features and a point-based encoder (Ep) that models local contexts centered at each voxel. Particularly, a novel conditional sparse convolution is introduced, which applies point embeddings to dynamically customize kernels for voxel feature extraction, facilitating feature interactions from Ep to Ev. During decoding, a voxel-based decoder employs conditional sparse convolutions to incorporate point embeddings as guidance to reconstruct the point cloud. Experiments on benchmark datasets show that PVINet delivers competitive performance compared to state-of-the-art methods.
zh

[CV-131] An End-to-End Framework for Video Multi-Person Pose Estimation

【速读】:该论文旨在解决视频人体姿态估计中传统两阶段方法存在的问题,即空间与时间维度分离导致无法捕捉全局时空上下文、依赖独立检测器及复杂后处理(如RoI裁剪和非极大值抑制)所引起的推理效率低下。其解决方案的关键在于提出一种端到端的框架VEPE(Video End-to-End Pose Estimation),通过三个核心时空Transformer组件——时空姿态编码器(Spatio-Temporal Pose Encoder, STPE)、时空可变形记忆编码器(Spatio-Temporal Deformable Memory Encoder, STDME)和时空姿态解码器(Spatio-Temporal Pose Decoder, STPD)——实现对时序信息的有效建模,并引入实例一致性机制以增强跨帧实例查询的一致性与差异性,从而提升姿态查询匹配精度并实现有效的实例跟踪功能,最终在Posetrack数据集上显著优于多数两阶段模型且推理效率提升300%。

链接: https://arxiv.org/abs/2509.01095
作者: Zhihong Wei
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video-based human pose estimation models aim to address scenarios that cannot be effectively solved by static image models such as motion blur, out-of-focus and occlusion. Most existing approaches consist of two stages: detecting human instances in each image frame and then using a temporal model for single-person pose estimation. This approach separates the spatial and temporal dimensions and cannot capture the global spatio-temporal context between spatial instances for end-to-end optimization. In addition, it relies on separate detectors and complex post-processing such as RoI cropping and NMS, which reduces the inference efficiency of the video scene. To address the above problems, we propose VEPE (Video End-to-End Pose Estimation), a simple and flexible framework for end-to-end pose estimation in video. The framework utilizes three crucial spatio-temporal Transformer components: the Spatio-Temporal Pose Encoder (STPE), the Spatio-Temporal Deformable Memory Encoder (STDME), and the Spatio-Temporal Pose Decoder (STPD). These components are designed to effectively utilize temporal context for optimizing human body pose estimation. Furthermore, to reduce the mismatch problem during the cross-frame pose query matching process, we propose an instance consistency mechanism, which aims to enhance the consistency and discrepancy of the cross-frame instance query and realize the instance tracking function, which in turn accurately guides the pose query to perform cross-frame matching. Extensive experiments on the Posetrack dataset show that our approach outperforms most two-stage models and improves inference efficiency by 300%.
zh

[CV-132] Bidirectional Sparse Attention for Faster Video Diffusion Training

【速读】:该论文旨在解决视频扩散 Transformer(Video Diffusion Transformer, DiT)模型在生成高分辨率、长时视频时面临的计算瓶颈问题,其核心挑战源于全注意力机制(full attention)带来的二次复杂度,导致训练和推理成本过高。具体而言,问题来自两个方面:一是查询(Query)与键值对(Key-Value, KV)的固有稀疏性导致冗余计算;二是固定稀疏模式无法利用 DiT 中动态变化的注意力特性。解决方案的关键在于提出一种双向稀疏注意力(Bidirectional Sparse Attention, BSA)框架,首次在三维全注意力中动态稀疏化 Query 和 KV 对象:通过语义相似性选择最具信息量的查询 token 并结合时空动态训练策略优化 Query 稀疏性,同时基于统计动态阈值保留最显著的 KV 块以实现 KV 稀疏化,从而显著提升训练与推理效率,在减少高达 20 倍浮点运算次数(FLOPs)的同时保持甚至超越全注意力的生成质量。

链接: https://arxiv.org/abs/2509.01085
作者: Chenlu Zhan,Wen Li,Chuyu Shen,Jun Zhang,Suhui Wu,Hao Zhang
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high training and inference costs. Full attention inefficiency stems from two key challenges: excessive computation due to the inherent sparsity of Queries and Key-Value pairs, and redundant computation as fixed sparse patterns fail to leverage DiT’s dynamic attention. To overcome this limitation, we propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention, thereby substantially improving training and inference efficiency. BSA addresses these issues through two key components. Query sparsity is optimized by selecting the most informative query tokens via semantic similarity and with a dynamic spatial-time training strategy, while KV sparsity is achieved by computing a statistical dynamic threshold to retain only the most salient KV blocks for computation. Extensive experiments demonstrate that BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.
zh

[CV-133] SpectMamba: Integrating Frequency and State Space Models for Enhanced Medical Image Detection

【速读】:该论文旨在解决医学图像异常检测中模型效率与准确性难以兼顾的问题。现有方法如卷积神经网络(Convolutional Neural Networks, CNNs)受限于感受野较小,难以捕捉全局上下文信息;而基于Transformer的模型在处理高分辨率医学图像时则面临计算成本过高的挑战。其解决方案的关键在于提出SpectMamba架构,核心创新包括:1)引入混合空间-频率注意力(Hybrid Spatial-Frequency Attention, HSFA)模块,分别学习高频和低频特征,缓解频率偏差导致的高频信息丢失,并实现频域与空域特征的关联增强;2)设计视觉状态空间模块(Visual State-Space Module, VSSM)并结合新型希尔伯特曲线扫描策略,强化局部依赖性和长程空间相关性,从而优化Mamba框架在医学图像中的适用性与性能表现。

链接: https://arxiv.org/abs/2509.01080
作者: Yao Wang,Dong Yang,Zhi Qiao,Wenjian Huang,Liuzhi Yang,Zhen Qian
机构: United Imaging Healthcare (联影医疗)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Abnormality detection in medical imaging is a critical task requiring both high efficiency and accuracy to support effective diagnosis. While convolutional neural networks (CNNs) and Transformer-based models are widely used, both face intrinsic challenges: CNNs have limited receptive fields, restricting their ability to capture broad contextual information, and Transformers encounter prohibitive computational costs when processing high-resolution medical images. Mamba, a recent innovation in natural language processing, has gained attention for its ability to process long sequences with linear complexity, offering a promising alternative. Building on this foundation, we present SpectMamba, the first Mamba-based architecture designed for medical image detection. A key component of SpectMamba is the Hybrid Spatial-Frequency Attention (HSFA) block, which separately learns high- and low-frequency features. This approach effectively mitigates the loss of high-frequency information caused by frequency bias and correlates frequency-domain features with spatial features, thereby enhancing the model’s ability to capture global context. To further improve long-range dependencies, we propose the Visual State-Space Module (VSSM) and introduce a novel Hilbert Curve Scanning technique to strengthen spatial correlations and local dependencies, further optimizing the Mamba framework. Comprehensive experiments show that SpectMamba achieves state-of-the-art performance while being both effective and efficient across various medical image detection tasks.
zh

[CV-134] A Unified Low-level Foundation Model for Enhancing Pathology Image Quality

【速读】:该论文旨在解决数字病理图像中低级视觉任务(如图像增强、超分辨率、去模糊和去噪)长期缺乏统一处理框架的问题,同时应对真实世界病理图像因染色变异、制片瑕疵及成像限制导致的多类退化问题。传统方法多为任务特定设计,难以适应复杂多变的临床场景。解决方案的关键在于提出首个低级病理基础模型(Low-level Pathology Foundation Model, LPFM),其核心创新包括:1)基于1.9亿张未标注病理图像预训练的对比学习编码器,提取可迁移且染色不变的特征表示,从而鲁棒地识别退化模式;2)采用统一的条件扩散过程,通过文本提示动态适配不同任务(如虚拟染色或图像恢复),实现单一架构下的多任务控制与高质量输出。该方法在87,810张全切片图像(WSI)数据集上验证,显著优于现有最先进方法(56/66项任务p<0.01),PSNR提升10–15%,SSIM提升12–18%。

链接: https://arxiv.org/abs/2509.01071
作者: Ziyi Liu,Zhe Xu,Jiabo Ma,Wenqaing Li,Junlin Hou,Fuxiang Huang,Xi Wang,Ronald Cheong Kin Chan,Terence Tsz Wai Wong,Hao Chen
机构: The Hong Kong University of Science and Technology(香港科技大学); Hong Kong University of Science and Technology(香港科技大学); Hong Kong University of Science and Technology(香港科技大学); Hong Kong University of Science and Technology(香港科技大学); Hong Kong University of Science and Technology(香港科技大学); Hong Kong University of Science and Technology(香港科技大学); Hong Kong University of Science and Technology(香港科技大学); The Chinese University of Hong Kong(香港中文大学); Hong Kong University of Science and Technology(香港科技大学); Hong Kong University of Science and Technology(香港科技大学); Hong Kong University of Science and Technology(香港科技大学); HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute(香港科技大学深圳-香港协同创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models have revolutionized computational pathology by achieving remarkable success in high-level diagnostic tasks, yet the critical challenge of low-level image enhancement remains largely unaddressed. Real-world pathology images frequently suffer from degradations such as noise, blur, and low resolution due to slide preparation artifacts, staining variability, and imaging constraints, while the reliance on physical staining introduces significant costs, delays, and inconsistency. Although existing methods target individual problems like denoising or super-resolution, their task-specific designs lack the versatility to handle the diverse low-level vision challenges encountered in practice. To bridge this gap, we propose the first unified Low-level Pathology Foundation Model (LPFM), capable of enhancing image quality in restoration tasks, including super-resolution, deblurring, and denoising, as well as facilitating image translation tasks like virtual staining (HE and special stains), all through a single adaptable architecture. Our approach introduces a contrastive pre-trained encoder that learns transferable, stain-invariant feature representations from 190 million unlabeled pathology images, enabling robust identification of degradation patterns. A unified conditional diffusion process dynamically adapts to specific tasks via textual prompts, ensuring precise control over output quality. Trained on a curated dataset of 87,810 whole slied images (WSIs) across 34 tissue types and 5 staining protocols, LPFM demonstrates statistically significant improvements (p0.01) over state-of-the-art methods in most tasks (56/66), achieving Peak Signal-to-Noise Ratio (PSNR) gains of 10-15% for image restoration and Structural Similarity Index Measure (SSIM) improvements of 12-18% for virtual staining.
zh

[CV-135] Seeing through Unclear Glass: Occlusion Removal with One Shot

【速读】:该论文旨在解决通过被多种现实污染物(如泥水、灰尘及其他微小异物)污染的玻璃拍摄图像时所导致的图像退化问题,此类污染会引发光衰减与杂散光散射,从而严重影响图像质量。现有深度学习方法多依赖合成数据训练,且仅针对雨滴等单一类型污染物;而本文则聚焦于更具挑战性的多类型真实污染物图像恢复任务。解决方案的关键在于构建了一个“端到端”的统一模型,并引入单次测试时自适应机制(one-shot test-time adaptation mechanism),通过一个自监督辅助学习任务,在测试阶段动态更新模型以适配每张测试图像中独特的污染物类型,从而实现对未见过的真实污染场景的有效修复。

链接: https://arxiv.org/abs/2509.01033
作者: Qiang Li,Yuanming Cao
机构: McMaster University (麦克马斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Images taken through window glass are often degraded by contaminants adhered to the glass surfaces. Such contaminants cause occlusions that attenuate the incoming light and scatter stray light towards the camera. Most of existing deep learning methods for neutralizing the effects of contaminated glasses relied on synthetic training data. Few researchers used real degraded and clean image pairs, but they only considered removing or alleviating the effects of rain drops on glasses. This paper is concerned with the more challenging task of learning the restoration of images taken through glasses contaminated by a wide range of occluders, including muddy water, dirt and other small foreign particles found in reality. To facilitate the learning task we have gone to a great length to acquire real paired images with and without glass contaminants. More importantly, we propose an all-in-one model to neutralize contaminants of different types by utilizing the one-shot test-time adaptation mechanism. It involves a self-supervised auxiliary learning task to update the trained model for the unique occlusion type of each test image. Experimental results show that the proposed method outperforms the state-of-the-art methods quantitatively and qualitatively in cleaning realistic contaminated images, especially the unseen ones.
zh

[CV-136] CompSlider: Compositional Slider for Disentangled Multiple-Attribute Image Generation ICCV2025

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成中多属性精细控制难题,尤其是在使用滑块(slider-based)方法时,现有方法因对每个属性单独训练适配器(adapter),忽略了属性之间的纠缠关系,导致多属性协同控制时出现干扰,难以实现独立且可靠的属性调节。其解决方案的关键在于提出CompSlider框架,通过在条件先验(conditional prior)的潜在空间中建模多属性的解耦表示,并引入新颖的解耦损失和结构一致性损失,使多个属性变化得以组合且保持图像整体结构稳定;该方法无需重训练基础T2I模型,显著降低计算开销,同时具备良好的泛化能力,已扩展至视频生成任务。

链接: https://arxiv.org/abs/2509.01028
作者: Zixin Zhu,Kevin Duarte,Mamshad Nayeem Rizve,Chengyuan Xu,Ratheesh Kalarot,Junsong Yuan
机构: University at Buffalo (纽约州立大学布法罗分校); Adobe GenAI (Adobe 生成式人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:In text-to-image (T2I) generation, achieving fine-grained control over attributes - such as age or smile - remains challenging, even with detailed text prompts. Slider-based methods offer a solution for precise control of image attributes. Existing approaches typically train individual adapter for each attribute separately, overlooking the entanglement among multiple attributes. As a result, interference occurs among different attributes, preventing precise control of multiple attributes together. To address this challenge, we aim to disentangle multiple attributes in slider-based generation to enbale more reliable and independent attribute manipulation. Our approach, CompSlider, can generate a conditional prior for the T2I foundation model to control multiple attributes simultaneously. Furthermore, we introduce novel disentanglement and structure losses to compose multiple attribute changes while maintaining structural consistency within the image. Since CompSlider operates in the latent space of the conditional prior and does not require retraining the foundation model, it reduces the computational burden for both training and inference. We evaluate our approach on a variety of image attributes and highlight its generality by extending to video generation.
zh

[CV-137] AI-driven Dispensing of Coral Reseeding Devices for Broad-scale Restoration of the Great Barrier Reef

【速读】:该论文旨在解决珊瑚礁生态系统面临崩溃的严峻问题,其核心挑战在于传统人工修复手段难以规模化实施,无法应对气候变暖、海洋酸化和污染导致的珊瑚物种大规模衰退。解决方案的关键在于引入自动化技术,通过人工智能(Artificial Intelligence, AI)、计算机视觉(Computer Vision)与机器人技术相结合,实现珊瑚再播种装置的自主部署。其中,自动化基质分类(substrate classification)是核心技术突破点,能够精准识别适合珊瑚生长的海底区域,从而大幅减少对人类专家的依赖,并显著提升修复作业的覆盖范围与效率。实测表明,该系统在大堡礁场景中达到77.8%的部署准确率,且支持每秒5.5帧的实时推理速度,为珊瑚礁生态修复提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2509.01019
作者: Scarlett Raine,Benjamin Moshirian,Tobias Fischer
机构: Queensland University of Technology (昆士兰科技大学); Australian Institute of Marine Science (澳大利亚海洋科学研究所); Australian Research Council (澳大利亚研究委员会); Reef Restoration and Adaptation Program (珊瑚礁修复与适应计划)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Coral reefs are on the brink of collapse, with climate change, ocean acidification, and pollution leading to a projected 70-90% loss of coral species within the next decade. Restoration efforts are crucial, but their success hinges on introducing automation to upscale efforts. We present automated deployment of coral re-seeding devices powered by artificial intelligence, computer vision, and robotics. Specifically, we perform automated substrate classification, enabling detection of areas of the seafloor suitable for coral growth, thus significantly reducing reliance on human experts and increasing the range and efficiency of restoration. Real-world testing of the algorithms on the Great Barrier Reef leads to deployment accuracy of 77.8%, sub-image patch classification of 89.1%, and real-time model inference at 5.5 frames per second. Further, we present and publicly contribute a large collection of annotated substrate image data to foster future research in this area.
zh

[CV-138] Weather-Dependent Variations in Driver Gaze Behavior: A Case Study in Rainy Conditions

【速读】:该论文旨在解决雨天驾驶条件下驾驶员视觉注意力变化机制不明确的问题,从而为设计更鲁棒的驾驶员监控系统(Driver Monitoring System, DMS)和高级驾驶辅助系统(Advanced Driver Assistance Systems, ADAS)提供依据。其解决方案的关键在于采用两阶段聚类方法分析眼动行为:首先在10秒时间窗口内对注视点进行聚类,再将聚类中心聚合为元簇(meta-clusters),结合马尔可夫转移矩阵及凝视持续时间、仰角和方位角分布等指标,量化并揭示了雨天与晴天条件下驾驶员 gaze behavior 的显著差异,如更频繁的仪表盘注视、更长的凝视时长和更高的 gaze elevation,表明雨天环境下驾驶员的认知负荷增加,这为基于 gaze modeling 的智能辅助系统优化提供了实证支持。

链接: https://arxiv.org/abs/2509.01013
作者: Ghazal Farhani,Taufiq Rahman,Dominique Charlebois
机构: National Research Council Canada (加拿大国家研究委员会); Transport Canada (加拿大交通部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2025 IEEE International Conference on Vehicular Electronics and Safety (ICVES)

点击查看摘要

Abstract:Rainy weather significantly increases the risk of road accidents due to reduced visibility and vehicle traction. Understanding how experienced drivers adapt their visual perception through gaze behavior under such conditions is critical for designing robust driver monitoring systems (DMS) and for informing advanced driver assistance systems (ADAS). This case study investigates the eye gaze behavior of a driver operating the same highway route under both clear and rainy conditions. To this end, gaze behavior was analyzed by a two-step clustering approach: first, clustering gaze points within 10-second intervals, and then aggregating cluster centroids into meta-clusters. This, along with Markov transition matrices and metrics such as fixation duration, gaze elevation, and azimuth distributions, reveals meaningful behavioral shifts. While the overall gaze behavior focused on the road with occasional mirror checks remains consistent, rainy conditions lead to more frequent dashboard glances, longer fixation durations, and higher gaze elevation, indicating increased cognitive focus. These findings offer valuable insight into visual attention patterns under adverse conditions and highlight the potential of leveraging gaze modeling to aid in the design of more robust ADAS and DMS.
zh

[CV-139] owards Integrating Multi-Spectral Imaging with Gaussian Splatting

【速读】:该论文旨在解决多光谱(multi-spectral)影像与RGB图像在3D高斯点绘(3D Gaussian Splatting, 3DGS)框架中融合时出现的几何不一致性问题,即尽管实际场景几何结构相同,但不同波段(如红、绿、红边、近红外)优化后重建出的几何形状存在差异,导致整体重建质量下降。其解决方案的关键在于提出一种“联合优化”(Joint)策略:将多光谱数据直接整合到球谐函数(spherical harmonics)颜色组件中,以紧凑方式建模每个高斯点的多光谱反射特性,并通过联合优化所有波段的几何和光谱参数,实现跨波段信息的有效交互与协同提升。实验表明,该方法不仅能显著提高多光谱重建质量,还能通过光谱交叉干扰(spectral cross-talk)增强RGB重建效果,同时揭示了引入光谱波段时机与优化策略之间的关键权衡关系,为鲁棒的多模态3DGS重建提供了实用指导。

链接: https://arxiv.org/abs/2509.00989
作者: Josef Grün,Lukas Meyer,Maximilian Weiherer,Bernhard Egger,Marc Stamminger,Linus Franke
机构: Visual Computing Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg (弗里德里希-亚历山大-埃尔朗根-纽伦堡大学), Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: for project page, see this https URL

点击查看摘要

Abstract:We present a study of how to integrate color (RGB) and multi-spectral imagery (red, green, red-edge, and near-infrared) into the 3D Gaussian Splatting (3DGS) framework, a state-of-the-art explicit radiance-field-based method for fast and high-fidelity 3D reconstruction from multi-view images. While 3DGS excels on RGB data, naive per-band optimization of additional spectra yields poor reconstructions due to inconsistently appearing geometry in the spectral domain. This problem is prominent, even though the actual geometry is the same, regardless of spectral modality. To investigate this, we evaluate three strategies: 1) Separate per-band reconstruction with no shared structure. 2) Splitting optimization, in which we first optimize RGB geometry, copy it, and then fit each new band to the model by optimizing both geometry and band representation. 3) Joint, in which the modalities are jointly optimized, optionally with an initial RGB-only phase. We showcase through quantitative metrics and qualitative novel-view renderings on multi-spectral datasets the effectiveness of our dedicated optimized Joint strategy, increasing overall spectral reconstruction as well as enhancing RGB results through spectral cross-talk. We therefore suggest integrating multi-spectral data directly into the spherical harmonics color components to compactly model each Gaussian’s multi-spectral reflectance. Moreover, our analysis reveals several key trade-offs in when and how to introduce spectral bands during optimization, offering practical insights for robust multi-modal 3DGS reconstruction.
zh

[CV-140] Seeing More Saying More: Lightweight Language Experts are Dynamic Video Token Compressors EMNLP2025

【速读】:该论文旨在解决当前大视频语言模型在处理视频理解任务时因视觉token数量庞大而导致的效率瓶颈问题,尤其是现有固定压缩比的token压缩策略无法适应不同视频片段间语义密度差异,导致信息丰富片段表征不足、静态或内容贫乏片段计算冗余。解决方案的关键在于提出LangDC(Language-aware Dynamic Token Compressor),其核心机制是利用轻量级语言模型生成视频片段的软文本描述(soft caption tokens)作为视觉表示,并通过语义密度感知的监督训练策略,使压缩比例动态调整——语义密度高的视频片段获得更多token以保留关键视觉线索,而语义稀疏片段则压缩更彻底,从而实现高效且自适应的视觉token管理。

链接: https://arxiv.org/abs/2509.00969
作者: Xiangchen Wang,Jinrui Zhang,Teng Wang,Haigang Zhang,Feng Zheng
机构: Southern University of Science and Technology (南方科技大学); The University of Hong Kong (香港大学); Shenzhen Polytechnic University (深圳职业技术学院); Spatialtemporal AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17pages, 8 figures, EMNLP2025

点击查看摘要

Abstract:Recent advancements in large video-language models have revolutionized video understanding tasks. However, their efficiency is significantly constrained by processing high volumes of visual tokens. Existing token compression strategies apply a fixed compression ratio, ignoring the variability in semantic density among different video clips. Consequently, this lead to inadequate representation of information-rich clips due to insufficient tokens and unnecessary computation on static or content-poor ones. To address this, we propose LangDC, a Language-aware Dynamic Token Compressor. LangDC leverages a lightweight language model to describe video clips, converting them into soft caption tokens as visual representations. Trained with our proposed semantic density-aware supervision, LangDC aims to 1) cover key visual cues necessary for downstream task reasoning and 2) dynamically adjust compression ratios based on scene richness, reflected by descriptions length. Our design mimics how humans dynamically express what they see: complex scenes (seeing more) elicit more detailed language to convey nuances (saying more), whereas simpler scenes are described with fewer words. Experimental results show that our method reduces FLOPs by 49% compared to VideoGPT+ while maintaining competitive performance. Furthermore, qualitative results demonstrate our approach adaptively adjusts the token compression ratio based on video segment richness.
zh

[CV-141] DarkVRAI: Capture-Condition Conditioning and Burst-Order Selective Scan for Low-light RAW Video Denoising

【速读】:该论文旨在解决低光照RAW视频去噪问题,其核心挑战在于高传感器增益和短曝光时间导致的严重信号退化,这在视频帧率受限条件下尤为突出。解决方案的关键在于提出DarkVRAI框架,其创新性地引入了两个核心机制:一是将图像去噪中成功的条件引导策略(conditioning scheme)扩展至视频去噪,通过显式利用拍摄元数据(capture metadata)来指导对齐与去噪过程;二是设计了突发顺序选择扫描(Burst-Order Selective Scan, BOSS)机制,有效建模噪声视频序列中的长时程依赖关系。二者协同作用,使模型在真实且严格的基准数据集上达到当前最优性能,为低光照视频去噪树立了新标准。

链接: https://arxiv.org/abs/2509.00917
作者: Youngjin Oh,Junhyeong Kwon,Junyoung Park,Nam Ik Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-light RAW video denoising is a fundamentally challenging task due to severe signal degradation caused by high sensor gain and short exposure times, which are inherently limited by video frame rate requirements. To address this, we propose DarkVRAI, a novel framework that achieved first place in the AIM 2025 Low-light RAW Video Denoising Challenge. Our method introduces two primary contributions: (1) a successful application of a conditioning scheme for image denoising, which explicitly leverages capture metadata, to video denoising to guide the alignment and denoising processes, and (2) a Burst-Order Selective Scan (BOSS) mechanism that effectively models long-range temporal dependencies within the noisy video sequence. By synergistically combining these components, DarkVRAI demonstrates state-of-the-art performance on a rigorous and realistic benchmark dataset, setting a new standard for low-light video denoising.
zh

[CV-142] GS-TG: 3D Gaussian Splatting Accelerator with Tile Grouping for Reducing Redundant Sorting while Preserving Rasterization Efficiency

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3D-GS)在实时应用中帧率(FPS)不足的问题,其核心挑战在于排序操作的冗余性与光栅化效率之间的权衡:增大瓦片(tile)尺寸虽可减少排序冗余,但会引入不必要的光栅化计算。解决方案的关键在于提出一种基于瓦片分组(tile-grouping)的加速架构GS-TG,通过在排序阶段将多个小瓦片分组以共享排序结果,从而显著降低冗余计算;同时在光栅化阶段利用位掩码(bitmask)标识相关小瓦片,实现排序结果的高效复用。该方法无需重新训练或微调,且可无缝集成现有优化技术,实验表明其平均速度提升达1.54倍。

链接: https://arxiv.org/abs/2509.00911
作者: Joongho Jo,Jongsun Park
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注: DAC 2025

点击查看摘要

Abstract:3D Gaussian Splatting (3D-GS) has emerged as a promising alternative to neural radiance fields (NeRF) as it offers high speed as well as high image quality in novel view synthesis. Despite these advancements, 3D-GS still struggles to meet the frames per second (FPS) demands of real-time applications. In this paper, we introduce GS-TG, a tile-grouping-based accelerator that enhances 3D-GS rendering speed by reducing redundant sorting operations and preserving rasterization efficiency. GS-TG addresses a critical trade-off issue in 3D-GS rendering: increasing the tile size effectively reduces redundant sorting operations, but it concurrently increases unnecessary rasterization computations. So, during sorting of the proposed approach, GS-TG groups small tiles (for making large tiles) to share sorting operations across tiles within each group, significantly reducing redundant computations. During rasterization, a bitmask assigned to each Gaussian identifies relevant small tiles, to enable efficient sharing of sorting results. Consequently, GS-TG enables sorting to be performed as if a large tile size is used by grouping tiles during the sorting stage, while allowing rasterization to proceed with the original small tiles by using bitmasks in the rasterization stage. GS-TG is a lossless method requiring no retraining or fine-tuning and it can be seamlessly integrated with previous 3D-GS optimization techniques. Experimental results show that GS-TG achieves an average speed-up of 1.54 times over state-of-the-art 3D-GS accelerators.
zh

[CV-143] Spotlighter: Revisiting Prompt Tuning from a Representative Mining View EMNLP2025

【速读】:该论文旨在解决现有提示调优(prompt tuning)方法中因冗余或弱相关特征成分引入噪声及计算开销过大的问题。其解决方案的关键在于提出一种轻量级的 token 选择框架 Spotlighter,该框架从样本层面和语义层面双重评估每个视觉 token 的激活强度,并仅保留得分最高的 tokens 进行下游预测;同时引入类特定的语义记忆库(semantic memory bank)来优化选择过程,确保语义代表性并补偿被丢弃特征的信息损失;进一步通过两级排序机制动态加权 token 与原型之间的交互,以优先聚焦于高信息量信号。该方法在 11 个少样本基准测试中显著提升准确率(最高达 CLIP 的 11.19% Harmonic Mean Accuracy),且每秒帧数(FPS)提升达 0.8K,仅增加 21 个额外参数,展现出高效性和可扩展性。

链接: https://arxiv.org/abs/2509.00905
作者: Yutong Gao,Maoyuan Shao,Xinyang Huang,Chuang Zhu,Lijuan Sun,Yu Weng,Xuan Liu,Guoshun Nan
机构: Minzu University of China (中国民族大学); Beijing University of Posts and Telecommunications (北京邮电大学); National Library of China (中国国家图书馆)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted as EMNLP 2025 Findings

点击查看摘要

Abstract:CLIP’s success has demonstrated that prompt tuning can achieve robust cross-modal semantic alignment for tasks ranging from open-domain recognition to fine-grained classification. However, redundant or weakly relevant feature components introduce noise and incur unnecessary computational costs. In this work, we propose Spotlighter, a lightweight token-selection framework that simultaneously enhances accuracy and efficiency in prompt tuning. Spotlighter evaluates each visual token’s activation from both sample-wise and semantic-wise perspectives and retains only the top-scoring tokens for downstream prediction. A class-specific semantic memory bank of learned prototypes refines this selection, ensuring semantic representativeness and compensating for discarded features. To further prioritize informative signals, we introduce a two-level ranking mechanism that dynamically weights token–prototype interactions. Across 11 few-shot benchmarks, Spotlighter outperforms CLIP by up to 11.19% in harmonic mean accuracy and achieves up to 0.8K additional FPS, with only 21 extra parameters. These results establish Spotlighter as an effective and scalable baseline for prompt tuning. Code for our method will be available at this https URL.
zh

[CV-144] Pose as Clinical Prior: Learning Dual Representations for Scoliosis Screening MICCAI2025

【速读】:该论文旨在解决基于姿态(pose)的青少年特发性脊柱侧弯(adolescent idiopathic scoliosis, AIS)筛查方法中存在的两大关键问题:一是缺乏大规模、高质量标注的2D人体姿态数据集;二是原始姿态坐标离散且易受噪声干扰,难以建模细微的姿势不对称性(postural asymmetry),而这种不对称性正是临床筛查的核心指标。解决方案的关键在于提出两个创新:其一,构建了Scoliosis1K-Pose数据集,包含1,050名青少年的447,900帧2D关键点标注,显著缓解数据稀缺问题;其二,设计Dual Representation Framework(DRF),融合连续骨骼图(skeleton map)与离散的姿势不对称向量(Postural Asymmetry Vector, PAV),并通过PAV-Guided Attention(PGA)模块将临床先验信息引入特征提取过程,从而精准聚焦于具有临床意义的不对称区域,实现双表示间的协同优化。

链接: https://arxiv.org/abs/2509.00872
作者: Zirui Zhou,Zizhao Peng,Dongyang Jin,Chao Fan,Fengwei An,Shiqi Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to MICCAI 2025

点击查看摘要

Abstract:Recent AI-based scoliosis screening methods primarily rely on large-scale silhouette datasets, often neglecting clinically relevant postural asymmetries-key indicators in traditional screening. In contrast, pose data provide an intuitive skeletal representation, enhancing clinical interpretability across various medical applications. However, pose-based scoliosis screening remains underexplored due to two main challenges: (1) the scarcity of large-scale, annotated pose datasets; and (2) the discrete and noise-sensitive nature of raw pose coordinates, which hinders the modeling of subtle asymmetries. To address these limitations, we introduce Scoliosis1K-Pose, a 2D human pose annotation set that extends the original Scoliosis1K dataset, comprising 447,900 frames of 2D keypoints from 1,050 adolescents. Building on this dataset, we introduce the Dual Representation Framework (DRF), which integrates a continuous skeleton map to preserve spatial structure with a discrete Postural Asymmetry Vector (PAV) that encodes clinically relevant asymmetry descriptors. A novel PAV-Guided Attention (PGA) module further uses the PAV as clinical prior to direct feature extraction from the skeleton map, focusing on clinically meaningful asymmetries. Extensive experiments demonstrate that DRF achieves state-of-the-art performance. Visualizations further confirm that the model leverages clinical asymmetry cues to guide feature extraction and promote synergy between its dual representations. The dataset and code are publicly available at this https URL.
zh

[CV-145] Quantization Meets OOD: Generalizable Quantization-aware Training from a Flatness Perspective

【速读】:该论文旨在解决量化感知训练(Quantization-Aware Training, QAT)在分布内(in-distribution, I.D.)数据上表现优异,但在分布外(out-of-distribution, OOD)数据上出现显著性能下降的问题。研究表明,QAT导致损失曲面变尖锐(sharp loss landscape),这与已有理论中“平坦的损失曲面有利于提升OOD泛化能力”的观点相矛盾,从而引发性能退化。为解决此问题,论文提出了一种面向平坦性的量化感知训练方法(Flatness-oriented QAT, FQAT),其核心创新在于:(i) 引入分层冻结机制以缓解标准QAT优化目标与损失曲面平坦性优化目标之间的梯度冲突;(ii) 设计一种基于梯度紊乱度量的自适应冻结算法,动态决定每步训练中冻结哪些层,有效应对层间干扰问题,从而实现更优的I.D.和OOD泛化性能。

链接: https://arxiv.org/abs/2509.00859
作者: Jiacheng Jiang,Yuan Meng,Chen Tang,Han Yu,Qun Li,Zhi Wang,Wenwu Zhu
机构: SIGS, Tsinghua University (深圳清华大学研究院); Tsinghua University (清华大学); MMLab, The Chinese University of Hong Kong (香港中文大学多媒体实验室); Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系); Key Laboratory of Pervasive Computing, Ministry of Education (教育部普适计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current quantization-aware training (QAT) methods primarily focus on enhancing the performance of quantized models on in-distribution (I.D) data, while overlooking the potential performance degradation on out-of-distribution (OOD) data. In this paper, we first substantiate this problem through rigorous experiment, showing that QAT can lead to a significant OOD generalization performance degradation. Further, we find the contradiction between the perspective that flatness of loss landscape gives rise to superior OOD generalization and the phenomenon that QAT lead to a sharp loss landscape, can cause the above problem. Therefore, we propose a flatness-oriented QAT method, FQAT, to achieve generalizable QAT. Specifically, i) FQAT introduces a layer-wise freezing mechanism to mitigate the gradient conflict issue between dual optimization objectives (i.e., vanilla QAT and flatness). ii) FQAT proposes an disorder-guided adaptive freezing algorithm to dynamically determines which layers to freeze at each training step, effectively addressing the challenges caused by interference between layers. A gradient disorder metric is designed to help the algorithm identify unstable layers during training. Extensive experiments on influential OOD benchmark demonstrate the superiority of our method over state-of-the-art baselines under both I.D and OOD image classification tasks.
zh

[CV-146] Look Beyond: Two-Stage Scene View Generation via Panorama and Video Diffusion

【速读】:该论文旨在解决单图新视角合成(Novel View Synthesis, NVS)中因大量未观测区域导致的病态问题,尤其在相机视角偏离输入图像较大时,现有方法难以保持长距离或闭环轨迹下的视图一致性和场景连贯性。其解决方案的关键在于将单视角NVS分解为两个阶段:首先通过全景扩散模型(panorama diffusion model)从输入视角图像中学习场景先验,并生成360度场景外推;随后利用从全景图中采样并扭曲得到的视角关键帧(perspective keyframes)作为锚点,输入预训练视频扩散模型(video diffusion model),通过提出的空间噪声扩散过程实现新视角插值。该设计确保了全局一致的新视角生成能力,即使在闭环场景下也能维持正确的视图对齐与场景一致性。

链接: https://arxiv.org/abs/2509.00843
作者: Xueyang Kang,Zhengkang Xiang,Zezheng Zhang,Kourosh Khoshelham
机构: University of Melbourne(墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 30 figures, 2025 ACM Multimedia

点击查看摘要

Abstract:Novel view synthesis (NVS) from a single image is highly ill-posed due to large unobserved regions, especially for views that deviate significantly from the input. While existing methods focus on consistency between the source and generated views, they often fail to maintain coherence and correct view alignment across long-range or looped trajectories. We propose a model that addresses this by decomposing single-view NVS into a 360-degree scene extrapolation followed by novel view interpolation. This design ensures long-term view and scene consistency by conditioning on keyframes extracted and warped from a generated panoramic representation. In the first stage, a panorama diffusion model learns the scene prior from the input perspective image. Perspective keyframes are then sampled and warped from the panorama and used as anchor frames in a pre-trained video diffusion model, which generates novel views through a proposed spatial noise diffusion process. Compared to prior work, our method produces globally consistent novel views – even in loop closure scenarios – while enabling flexible camera control. Experiments on diverse scene datasets demonstrate that our approach outperforms existing methods in generating coherent views along user-defined trajectories. Our implementation is available at this https URL.
zh

[CV-147] Satellite Image Utilization for Dehazing with Swin Transformer-Hybrid U-Net and Watershed loss

【速读】:该论文旨在解决大气干扰和雾霾对卫星遥感影像清晰度的严重影响,从而降低信息提取的准确性问题。解决方案的关键在于提出了一种融合Swin Transformer与U-Net结构的混合去雾框架SUFERNOBWA,其核心创新包括:1)在编码器和解码器中引入基于Swin Transformer的残差嵌套密集块(SwinRRDB),实现全局上下文信息与局部空间结构的联合学习,有效保留图像结构特征;2)设计一种复合损失函数,结合L2损失、引导损失与新颖的分水岭损失(watershed loss),显著提升结构边界保持能力和像素级精度。该方法在多种大气条件下均表现出鲁棒的去雾性能,并在RICE和SateHaze1K数据集上优于现有先进模型。

链接: https://arxiv.org/abs/2509.00835
作者: Jongwook Si,Sungyoung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Satellite imagery plays a crucial role in various fields; however, atmospheric interference and haze significantly degrade image clarity and reduce the accuracy of information extraction. To address these challenges, this paper proposes a hybrid dehazing framework that integrates Swin Transformer and U-Net to balance global context learning and local detail restoration, called SUFERNOBWA. The proposed network employs SwinRRDB, a Swin Transformer-based Residual-in-Residual Dense Block, in both the encoder and decoder to effectively extract features. This module enables the joint learning of global contextual information and fine spatial structures, which is crucial for structural preservation in satellite image. Furthermore, we introduce a composite loss function that combines L2 loss, guided loss, and a novel watershed loss, which enhances structural boundary preservation and ensures pixel-level accuracy. This architecture enables robust dehazing under diverse atmospheric conditions while maintaining structural consistency across restored images. Experimental results demonstrate that the proposed method outperforms state-of-the-art models on both the RICE and SateHaze1K datasets. Specifically, on the RICE dataset, the proposed approach achieved a PSNR of 33.24 dB and an SSIM of 0.967, which is a significant improvement over existing method. This study provides an effective solution for mitigating atmospheric interference in satellite imagery and highlights its potential applicability across diverse remote sensing applications.
zh

[CV-148] SegDINO: An Efficient Design for Medical and Natural Image Segmentation with DINO-V3

【速读】:该论文旨在解决基于DINOv3的自监督视觉模型在图像分割任务中特征适配效率低的问题,现有方法通常依赖于参数量大、结构复杂的解码器和多尺度融合机制,导致计算成本高昂。其解决方案的关键在于提出SegDINO框架,通过冻结预训练的DINOv3骨干网络,仅使用一个轻量级MLP头(Multi-Layer Perceptron head)直接预测分割掩码,同时对多层特征进行统一分辨率与通道宽度的对齐处理,从而在显著减少可训练参数的同时保留基础特征的表征能力,实现高效且高性能的分割效果。

链接: https://arxiv.org/abs/2509.00833
作者: Sicheng Yang,Hongqiu Wang,Zhaohu Xing,Sixiang Chen,Lei Zhu
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The DINO family of self-supervised vision models has shown remarkable transferability, yet effectively adapting their representations for segmentation remains challenging. Existing approaches often rely on heavy decoders with multi-scale fusion or complex upsampling, which introduce substantial parameter overhead and computational cost. In this work, we propose SegDINO, an efficient segmentation framework that couples a frozen DINOv3 backbone with a lightweight decoder. SegDINO extracts multi-level features from the pretrained encoder, aligns them to a common resolution and channel width, and utilizes a lightweight MLP head to directly predict segmentation masks. This design minimizes trainable parameters while preserving the representational power of foundation features. Extensive experiments across six benchmarks, including three medical datasets (TN3K, Kvasir-SEG, ISIC) and three natural image datasets (MSD, VMD-D, ViSha), demonstrate that SegDINO consistently achieves state-of-the-art performance compared to existing methods. Code is available at this https URL.
zh

[CV-149] UPGS: Unified Pose-aware Gaussian Splatting for Dynamic Scene Deblurring

【速读】:该论文旨在解决单目视频中动态3D场景重建因相机与物体运动导致的严重运动模糊问题(motion blur),该问题常使传统两阶段方法(先估计相机位姿,再优化3D高斯表示)中的位姿估计失效,进而引发误差累积,降低重建质量。其解决方案的关键在于提出一个统一优化框架,将相机位姿作为可学习参数与3D高斯(3D Gaussians, 3DGS)属性共同参与端到端优化,并将相机与物体运动建模为作用于3D高斯的逐基元SE(3)仿射变换,从而实现更鲁棒的联合优化。此外,引入三阶段训练策略以提升稳定性:首先固定位姿优化3D高斯,其次固定3D高斯优化位姿,最后联合优化全部参数。

链接: https://arxiv.org/abs/2509.00831
作者: Zhijing Wu,Longguang Wang
机构: University of Cambridge (剑桥大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic 3D scenes from monocular video has broad applications in AR/VR, robotics, and autonomous navigation, but often fails due to severe motion blur caused by camera and object motion. Existing methods commonly follow a two-step pipeline, where camera poses are first estimated and then 3D Gaussians are optimized. Since blurring artifacts usually undermine pose estimation, pose errors could be accumulated to produce inferior reconstruction results. To address this issue, we introduce a unified optimization framework by incorporating camera poses as learnable parameters complementary to 3DGS attributes for end-to-end optimization. Specifically, we recast camera and object motion as per-primitive SE(3) affine transformations on 3D Gaussians and formulate a unified optimization objective. For stable optimization, we introduce a three-stage training schedule that optimizes camera poses and Gaussians alternatively. Particularly, 3D Gaussians are first trained with poses being fixed, and then poses are optimized with 3D Gaussians being untouched. Finally, all learnable parameters are optimized together. Extensive experiments on the Stereo Blur dataset and challenging real-world sequences demonstrate that our method achieves significant gains in reconstruction quality and pose estimation accuracy over prior dynamic deblurring methods.
zh

[CV-150] Surface Defect Detection with Gabor Filter Using Reconstruction-Based Blurring U-Net-ViT

【速读】:该论文旨在解决纹理表面缺陷检测中因背景复杂、噪声干扰及缺陷边界模糊导致的准确性与可靠性不足的问题。其解决方案的关键在于融合Gabor滤波器与改进的blurring U-Net-ViT模型:通过U-Net的局部特征提取能力与Vision Transformer(ViT)的全局建模优势相结合,实现对多样化纹理场景下缺陷的精准识别;同时引入基于高斯滤波的损失函数以抑制背景噪声并强化缺陷模式,并在训练中采用Salt-and-Pepper(SP)掩码策略增强纹理-缺陷边界信息,从而提升模型在噪声环境下的鲁棒性;最后利用Gabor滤波器进行后处理,突出缺陷的方向性和频率特性,经参数优化后在多个公开数据集上实现了平均AUC达0.939的优异性能。

链接: https://arxiv.org/abs/2509.00827
作者: Jongwook Si,Sungyoung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a novel approach to enhance the accuracy and reliability of texture-based surface defect detection using Gabor filters and a blurring U-Net-ViT model. By combining the local feature training of U-Net with the global processing of the Vision Transformer(ViT), the model effectively detects defects across various textures. A Gaussian filter-based loss function removes background noise and highlights defect patterns, while Salt-and-Pepper(SP) masking in the training process reinforces texture-defect boundaries, ensuring robust performance in noisy environments. Gabor filters are applied in post-processing to emphasize defect orientation and frequency characteristics. Parameter optimization, including filter size, sigma, wavelength, gamma, and orientation, maximizes performance across datasets like MVTec-AD, Surface Crack Detection, and Marble Surface Anomaly Dataset, achieving an average Area Under the Curve(AUC) of 0.939. The ablation studies validate that the optimal filter size and noise probability significantly enhance defect detection performance.
zh

[CV-151] Sequential Difference Maximization: Generating Adversarial Examples via Multi-Stage Optimization CIKM2025

【速读】:该论文旨在解决生成高效对抗样本(adversarial examples)以评估计算机视觉模型鲁棒性的关键问题。现有方法在攻击效果与计算成本之间难以平衡,且对防御机制的适应性不足。其解决方案的核心在于重构优化目标为“最大化非真实标签的概率上界与真实标签概率之间的差异”,并提出基于梯度的攻击方法——顺序差异最大化(Sequential Difference Maximization, SDM)。SDM构建了一个三层优化框架“循环-阶段-步长”,其中初始阶段使用真实标签负概率压缩解空间,后续阶段引入方向概率差异比(Directional Probability Difference Ratio, DPDR)损失函数,逐步提升非真实标签的概率上界并抑制无关标签概率,从而实现更强的攻击性能和更高的成本效益。

链接: https://arxiv.org/abs/2509.00826
作者: Xinlei Liu,Tao Hu,Peng Yi,Weitao Han,Jichao Xie,Baolin Li
机构: Information Engineering University (信息工程大学); Key Laboratory of Cyberspace Security, Ministry of Education of China (网络空间安全教育部重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, 5 tables, CIKM 2025

点击查看摘要

Abstract:Efficient adversarial attack methods are critical for assessing the robustness of computer vision models. In this paper, we reconstruct the optimization objective for generating adversarial examples as “maximizing the difference between the non-true labels’ probability upper bound and the true label’s probability,” and propose a gradient-based attack method termed Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of “cycle-stage-step.” The processes between cycles and between iterative steps are respectively identical, while optimization stages differ in terms of loss functions: in the initial stage, the negative probability of the true label is used as the loss function to compress the solution space; in subsequent stages, we introduce the Directional Probability Difference Ratio (DPDR) loss function to gradually increase the non-true labels’ probability upper bound by compressing the irrelevant labels’ probabilities. Experiments demonstrate that compared with previous SOTA methods, SDM not only exhibits stronger attack performance but also achieves higher attack cost-effectiveness. Additionally, SDM can be combined with adversarial training methods to enhance their defensive effects. The code is available at this https URL.
zh

[CV-152] Adaptive Contrast Adjustment Module: A Clinically-Inspired Plug-and-Play Approach for Enhanced Fetal Plane Classification

【速读】:该论文旨在解决胎儿超声标准切面分类中因组织对比度低、边界模糊及操作者导致的图像质量差异等问题,从而提升产前诊断的可靠性。其解决方案的关键在于提出了一种即插即用的自适应对比度调整模块(Adaptive Contrast Adjustment Module, ACAM),该模块通过一个浅层纹理敏感网络预测符合临床实践的对比度参数,利用可微分映射将输入图像转换为多个增强对比度的视图,并在下游分类器中进行融合。该方法实现了内容感知的自适应调整,替代了传统的随机预处理方式,引入了物理信息驱动的变换策略,与超声医师的工作流程一致,同时通过多视角融合增强了对成像异质性的鲁棒性,有效连接了低层图像特征与高层语义信息,为真实世界图像质量波动下的医学图像分析提供了新范式。

链接: https://arxiv.org/abs/2509.00808
作者: Yang Chen,Sanglin Zhao,Baoyu Chen,Mans Gustaf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fetal ultrasound standard plane classification is essential for reliable prenatal diagnosis but faces inherent challenges, including low tissue contrast, boundary ambiguity, and operator-dependent image quality variations. To overcome these limitations, we propose a plug-and-play adaptive contrast adjustment module (ACAM), whose core design is inspired by the clinical practice of doctors adjusting image contrast to obtain clearer and more discriminative structural information. The module employs a shallow texture-sensitive network to predict clinically plausible contrast parameters, transforms input images into multiple contrast-enhanced views through differentiable mapping, and fuses them within downstream classifiers. Validated on a multi-center dataset of 12,400 images across six anatomical categories, the module consistently improves performance across diverse models, with accuracy of lightweight models increasing by 2.02 percent, accuracy of traditional models increasing by 1.29 percent, and accuracy of state-of-the-art models increasing by 1.15 percent. The innovation of the module lies in its content-aware adaptation capability, replacing random preprocessing with physics-informed transformations that align with sonographer workflows while improving robustness to imaging heterogeneity through multi-view fusion. This approach effectively bridges low-level image features with high-level semantics, establishing a new paradigm for medical image analysis under real-world image quality variations.
zh

[CV-153] SWAGSplatting: Semantic-guided Water-scene Augmented Gaussian Splatting SIGGRAPH

【速读】:该论文旨在解决水下环境中三维(3D)重建的难题,主要挑战包括光线畸变、浑浊度和能见度受限等问题。为应对这些挑战,作者提出了一种基于多模态交叉知识的语义引导3D高斯点绘(Semantic-guided 3D Gaussian Splatting)框架。其关键创新在于:在每个高斯基元(Gaussian primitive)中嵌入额外的语义特征,并通过CLIP模型提取的语义特征进行监督,从而在训练过程中强化语义与结构的一致性;同时引入一种分阶段训练策略,结合粗到精的学习与后期参数优化,显著提升了重建的稳定性和质量。实验表明,该方法在SeaThru-NeRF和Submerged3D数据集上均优于当前最优方法,在PSNR指标上平均提升达3.09 dB。

链接: https://arxiv.org/abs/2509.00800
作者: Zhuodong Jiang,Haoran Wang,Guoxi Huang,Brett Seymour,Nantheera Anantrasirichai
机构: University of Bristol(布里斯托大学); Submerged Resources Centre(沉没资源中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to SIGGRAPH Asia 2025 Technical Communications

点击查看摘要

Abstract:Accurate 3D reconstruction in underwater environments remains a complex challenge due to issues such as light distortion, turbidity, and limited visibility. AI-based techniques have been applied to address these issues, however, existing methods have yet to fully exploit the potential of AI, particularly in integrating language models with visual processing. In this paper, we propose a novel framework that leverages multimodal cross-knowledge to create semantic-guided 3D Gaussian Splatting for robust and high-fidelity deep-sea scene reconstruction. By embedding an extra semantic feature into each Gaussian primitive and supervised by the CLIP extracted semantic feature, our method enforces semantic and structural awareness throughout the training. The dedicated semantic consistency loss ensures alignment with high-level scene understanding. Besides, we propose a novel stage-wise training strategy, combining coarse-to-fine learning with late-stage parameter refinement, to further enhance both stability and reconstruction quality. Extensive results show that our approach consistently outperforms state-of-the-art methods on SeaThru-NeRF and Submerged3D datasets across three metrics, with an improvement of up to 3.09 dB on average in terms of PSNR, making it a strong candidate for applications in underwater exploration and marine perception.
zh

[CV-154] Multimodal Iterative RAG for Knowledge Visual Question Answering

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理需要外部知识的视觉问答(Knowledge-intensive Visual Question Answering, VQA)任务时性能受限的问题,尤其是在图像本身无法提供足够信息的情况下。传统检索增强生成(Retrieval-Augmented Generation, RAG)方法采用单次检索框架,难以获取充分的外部知识。为此,论文提出了一种多模态迭代式检索增强生成框架(Multimodal Iterative RAG, MI-RAG),其核心创新在于通过迭代机制结合推理与检索:每轮迭代中利用累积的推理记录动态生成多查询(multi-query),驱动跨异构知识库(包含视觉锚定和文本知识)的联合搜索,并将新获取的知识融合进推理记录以逐步优化理解。该设计实现了跨模态知识的持续更新与推理强化,显著提升了检索召回率和答案准确率。

链接: https://arxiv.org/abs/2509.00798
作者: Changin Choi,Wonseok Lee,Jungmin Ko,Wonjong Rhee
机构: Interdisciplinary Program in Artificial Intelligence (人工智能交叉学科项目); Department of Intelligence and Information, Republic of Korea, Seoul National University (韩国首尔国立大学情报与信息系); Samsung Electronics Co., Ltd, Suwon, Republic of Korea (三星电子有限公司,水原市,韩国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have significantly advanced multimodal understanding, their performance remains limited on knowledge-intensive visual questions that require external knowledge beyond the image. Retrieval-Augmented Generation (RAG) has become a promising solution for providing models with external knowledge, its conventional single-pass framework often fails to gather sufficient knowledge. To overcome this limitation, we propose MI-RAG, a Multimodal Iterative RAG framework that leverages reasoning to enhance retrieval and update reasoning over newly retrieved knowledge across modalities. At each iteration, MI-RAG leverages an accumulated reasoning record to dynamically formulate a multi-query. These queries then drive a joint search across heterogeneous knowledge bases containing both visually-grounded and textual knowledge. The newly acquired knowledge is synthesized into the reasoning record, progressively refining understanding across iterations. Experiments on challenging benchmarks, including Encyclopedic VQA, InfoSeek, and OK-VQA, show that MI-RAG significantly improves both retrieval recall and answer accuracy, establishing a scalable approach for compositional reasoning in knowledge-intensive VQA.
zh

[CV-155] OmniReason : A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在自动驾驶场景中对动态时空信息建模不足的问题,即现有方法多集中于静态场景理解,而忽视了真实驾驶环境中至关重要的时间维度。其解决方案的关键在于提出OmniReason框架,该框架通过联合建模动态三维环境及其背后的决策过程,实现鲁棒的时空推理;具体包括:构建包含密集时空标注与自然语言解释的大规模视觉-语言-动作(Vision-Language-Action, VLA)数据集OmniReason-Data,以及设计OmniReason-Agent架构,集成稀疏时序记忆模块以持久化场景上下文,并结合基于时空知识蒸馏的方法生成可解释的决策依据,从而显著提升自动驾驶系统在复杂动态环境中的开放环路规划和视觉问答(Visual Question Answering, VQA)性能,同时具备时序感知与可解释性。

链接: https://arxiv.org/abs/2509.00789
作者: Pei Liu,Qingtian Ning,Xinyan Lu,Haipeng Liu,Weiliang Ma,Dangen She,Peng Jia,Xianpeng Lang,Jun Ma
机构: 1. Tsinghua University (清华大学); 2. Baidu (百度); 3. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have demonstrated impressive spatial reasoning capabilities for autonomous driving, yet existing methods predominantly focus on static scene understanding while neglecting the essential temporal dimension of real-world driving scenarios. To address this critical limitation, we propose the OmniReason framework, which establishes robust spatiotemporal reasoning by jointly modeling dynamic 3D environments and their underlying decision-making processes. Our work makes two fundamental advances: (1) We introduce OmniReason-Data, two large-scale vision-language-action (VLA) datasets with dense spatiotemporal annotations and natural language explanations, generated through a novel hallucination-mitigated auto-labeling pipeline that ensures both physical plausibility and temporal coherence; (2) We develop the OmniReason-Agent architecture, which integrates a sparse temporal memory module for persistent scene context modeling and an explanation generator that produces human-interpretable decision rationales, facilitated by our spatiotemporal knowledge distillation approach that effectively captures spatiotemporal causal reasoning patterns. Comprehensive experiments demonstrate state-of-the-art performance, where OmniReason-Agent achieves significant improvements in both open-loop planning tasks and visual question answering (VQA) benchmarks, while establishing new capabilities for interpretable, temporally-aware autonomous vehicles operating in complex, dynamic environments.
zh

[CV-156] Diffusion-Based Image-to-Brain Signal Generation with Cross-Attention Mechanisms for Visual Prostheses

【速读】:该论文旨在解决视觉假体(visual prosthesis)中脑编码阶段生成的脑信号生物相似性不足的问题,即现有方法缺乏来自真实脑响应的监督信号来验证所预测刺激的生物学合理性。其解决方案的关键在于提出了一种基于去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)增强交叉注意力机制(cross-attention mechanisms)的图像到脑信号框架。该框架通过预训练的CLIP视觉编码器提取输入图像的丰富语义特征,并结合带有交叉注意力机制的U-Net扩散模型,在迭代去噪过程中学习重建具有生物合理性的脑信号,从而实现视觉特征与脑信号表示之间的细粒度对齐,显著提升了生成脑信号的生物学真实性。

链接: https://arxiv.org/abs/2509.00787
作者: Ganxi Xu,Jinyi Long,Jia Zhang
机构: Jinan University (暨南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual prostheses have shown great potential in restoring vision for blind individuals. On the one hand, researchers have been continuously improving the brain decoding framework of visual prostheses by leveraging the powerful image generation capabilities of diffusion models. On the other hand, the brain encoding stage of visual prostheses struggles to generate brain signals with sufficient biological similarity. Although existing works have recognized this problem, the quality of predicted stimuli still remains a critical issue, as existing approaches typically lack supervised signals from real brain responses to validate the biological plausibility of predicted stimuli. To address this issue, we propose a novel image-to-brain framework based on denoising diffusion probabilistic models (DDPMs) enhanced with cross-attention mechanisms. Our framework consists of two key architectural components: a pre-trained CLIP visual encoder that extracts rich semantic representations from input images, and a cross-attention enhanced U-Net diffusion model that learns to reconstruct biologically plausible brain signals through iterative denoising. Unlike conventional generative models that rely on simple concatenation for conditioning, our cross-attention modules enable dynamic interaction between visual features and brain signal representations, facilitating fine-grained alignment during the generation process. We evaluate our framework on two multimodal datasets (THINGS-EEG2 and THINGS-MEG) to demonstrate its effectiveness in generating biologically plausible brain signals. Moreover, we visualize the training and test M/EEG topographies for all subjects on both datasets to intuitively demonstrate the intra-subject variations and inter-subject variations in M/EEG signals.
zh

[CV-157] Aligned Anchor Groups Guided Line Segment Detector

【速读】:该论文旨在解决图像中线段检测的精度与完整性问题,即如何从复杂场景中准确提取完整且连续的线段结构。其解决方案的关键在于提出了一种基于对齐锚点组(Aligned Anchor Groups)的新型线段检测器(AAGLSD),该方法通过分层策略提取具有不同显著性水平的候选像素,以对齐锚点组为初始起点,逐步连接锚点并实时更新当前预测的线段,最终通过简单的验证与相邻线段合并策略获得结果,避免了复杂的精修过程,从而在多个数据集上实现了优于现有先进方法的线段检测性能。

链接: https://arxiv.org/abs/2509.00786
作者: Zeyu Li,Annan Shu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 8th Chinese Conference on Pattern Recognition and Computer Vision (PRCV 2025). 14 pages, supplementary material attached

点击查看摘要

Abstract:This paper introduces a novel line segment detector, the Aligned Anchor Groups guided Line Segment Detector (AAGLSD), designed to detect line segments from images with high precision and completeness. The algorithm employs a hierarchical approach to extract candidate pixels with different saliency levels, including regular anchors and aligned anchor groups. AAGLSD initiates from these aligned anchor groups, sequentially linking anchors and updating the currently predicted line segment simultaneously. The final predictions are derived through straightforward validation and merging of adjacent line segments, avoiding complex refinement strategies. AAGLSD is evaluated on various datasets and quantitative experiments demonstrate that the proposed method can effectively extract complete line segments from input images compared to other advanced line segment detectors. The implementation is available at this https URL.
zh

[CV-158] Secure and Scalable Face Retrieval via Cancelable Product Quantization

【速读】:该论文旨在解决现代人脸检索系统在将检索阶段外包给第三方时所面临的用户肖像隐私泄露风险问题。现有基于同态加密(Homomorphic Encryption, HE)的方案虽能提供强安全保障,但因计算效率低下难以满足实时应用场景的需求。其解决方案的关键在于提出了一种名为“可撤销产品量化(Cancelable Product Quantization, CPQ)”的高效框架,该框架采用分层两阶段设计:第一阶段为高吞吐量的可撤销产品量化索引模块,用于快速候选过滤;第二阶段为细粒度密文空间检索模块,实现最终精确的人脸排序。同时,通过定制化的保护机制确保索引模块在支持可撤销生物特征认证的同时保持高效性,从而在安全性、效率与有效性之间取得良好平衡。

链接: https://arxiv.org/abs/2509.00781
作者: Haomiao Tang,Wenjie Li,Yixiang Qiu,Genping Wang,Shu-Tao Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 14 pages and 2 figures, accepted by PRCV2025

点击查看摘要

Abstract:Despite the ubiquity of modern face retrieval systems, their retrieval stage is often outsourced to third-party entities, posing significant risks to user portrait privacy. Although homomorphic encryption (HE) offers strong security guarantees by enabling arithmetic computations in the cipher space, its high computational inefficiency makes it unsuitable for real-time, real-world applications. To address this issue, we propose Cancelable Product Quantization, a highly efficient framework for secure face representation retrieval. Our hierarchical two-stage framework comprises: (i) a high-throughput cancelable PQ indexing module for fast candidate filtering, and (ii) a fine-grained cipher-space retrieval module for final precise face ranking. A tailored protection mechanism is designed to secure the indexing module for cancelable biometric authentication while ensuring efficiency. Experiments on benchmark datasets demonstrate that our method achieves an decent balance between effectiveness, efficiency and security.
zh

[CV-159] Energy Efficient Exact and Approximate Systolic Array Architecture for Matrix Multiplication

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在复杂计算中对高效矩阵乘法引擎的迫切需求,特别是在边缘计算场景下对能效比与输出质量的平衡问题。其解决方案的关键在于提出了一种新型的脉动阵列(Systolic Array, SA)架构,其中集成创新的精确与近似处理单元(Processing Elements, PEs),基于能量高效的正部分积单元(Positive Partial Product Cell, PPC)和负部分积单元(Negative Partial Product Cell, NPPC)设计实现。该架构在8×8脉动阵列中分别实现了22%和32%的能耗降低,同时在离散余弦变换(Discrete Cosine Transform, DCT)和边缘检测卷积应用中保持了高输出质量(PSNR分别为38.21 dB和30.45 dB),验证了其在误差容忍型图像与视觉处理任务中的高能效潜力。

链接: https://arxiv.org/abs/2509.00778
作者: Pragun Jaswal,L.Hemanth Krishna,B. Srinivasu
机构: Indian Institute of Technology, Mandi (印度理工学院曼迪分校)
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to 39th International Conference on VLSI Design, 2026

点击查看摘要

Abstract:Deep Neural Networks (DNNs) require highly efficient matrix multiplication engines for complex computations. This paper presents a systolic array architecture incorporating novel exact and approximate processing elements (PEs), designed using energy-efficient positive partial product and negative partial product cells, termed as PPC and NPPC, respectively. The proposed 8-bit exact and approximate PE designs are employed in a 8x8 systolic array, which achieves a energy savings of 22% and 32%, respectively, compared to the existing design. To demonstrate their effectiveness, the proposed PEs are integrated into a systolic array (SA) for Discrete Cosine Transform (DCT) computation, achieving high output quality with a PSNR of 38.21,dB. Furthermore, in an edge detection application using convolution, the approximate PE achieves a PSNR of 30.45,dB. These results highlight the potential of the proposed design to deliver significant energy efficiency while maintaining competitive output quality, making it well-suited for error-resilient image and vision processing applications.
zh

[CV-160] IntrinsicReal: Adapting IntrinsicAnything from Synthetic to Real Objects

【速读】:该论文旨在解决真实世界场景下单张RGB图像中反照率(albedo)估计的难题,特别是由于缺乏配对图像及其真实反照率标签,以及现有方法(如IntrinsicAnything)主要在大规模合成数据集上训练并直接应用于真实图像时所面临的域差距(domain gap)问题。解决方案的关键在于提出一种名为IntrinsicReal的新型域适应框架,通过引入一种新颖的双伪标签策略对IntrinsicAnything进行微调:一是基于分类器预测结果的绝对置信度阈值进行伪标签标注;二是利用分类器预测结果中针对单个输入物体的相对偏好排序进行伪标签标注。该策略受人类评估启发,即高质量输出易于识别,而对次优结果则相对比较更可靠。为此,论文设计了一个两阶段流水线,依次应用上述两种伪标签技术,有效缩小了合成与真实域之间的差异,从而显著提升了真实世界场景下的反照率估计性能,并在合成与真实数据集上均达到当前最优效果。

链接: https://arxiv.org/abs/2509.00777
作者: Xiaokang Wei,Zizheng Yan,Zhangyang Xiong,Yiming Hao,Yipeng Qin,Xiaoguang Han
机构: The Hong Kong Polytechnic University (香港理工大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Cardiff University (卡迪夫大学); NanJing XiaoZhuang University (南京晓庄学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating albedo (a.k.a., intrinsic image decomposition) from single RGB images captured in real-world environments (e.g., the MVImgNet dataset) presents a significant challenge due to the absence of paired images and their ground truth albedos. Therefore, while recent methods (e.g., IntrinsicAnything) have achieved breakthroughs by harnessing powerful diffusion priors, they remain predominantly trained on large-scale synthetic datasets (e.g., Objaverse) and applied directly to real-world RGB images, which ignores the large domain gap between synthetic and real-world data and leads to suboptimal generalization performance. In this work, we address this gap by proposing IntrinsicReal, a novel domain adaptation framework that bridges the above-mentioned domain gap for real-world intrinsic image decomposition. Specifically, our IntrinsicReal adapts IntrinsicAnything to the real domain by fine-tuning it using its high-quality output albedos selected by a novel dual pseudo-labeling strategy: i) pseudo-labeling with an absolute confidence threshold on classifier predictions, and ii) pseudo-labeling using the relative preference ranking of classifier predictions for individual input objects. This strategy is inspired by human evaluation, where identifying the highest-quality outputs is straightforward, but absolute scores become less reliable for sub-optimal cases. In these situations, relative comparisons of outputs become more accurate. To implement this, we propose a novel two-phase pipeline that sequentially applies these pseudo-labeling techniques to effectively adapt IntrinsicAnything to the real domain. Experimental results show that our IntrinsicReal significantly outperforms existing methods, achieving state-of-the-art results for albedo estimation on both synthetic and real-world datasets.
zh

[CV-161] InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos

【速读】:该论文旨在解决当前生成式AI在复杂3D场景中合成高保真人-物交互(human-object interactions)的挑战,特别是由于缺乏大规模、多样化的带交互动作捕捉数据所致。现有方法多局限于空场景中孤立人物的动画生成,难以实现真实世界中复杂的多物体操作与环境互动。解决方案的关键在于提出了一种自动运动提取流水线(automatic motion extraction pipeline),并基于此构建了名为InterPose的新数据集,包含73.8K条带文本描述的3D人体运动序列,源自45.8K段含人-物交互的视频。该数据集显著丰富了交互类型和多样性,使得现有先进的人体运动生成方法获得显著提升,并进一步开发出基于大语言模型(LLM)的代理(agent),实现零样本(zero-shot)条件下人物与多样化物体及场景的动画合成。

链接: https://arxiv.org/abs/2509.00767
作者: Yangsong Zhang,Abdul Ahad Butt,Gül Varol,Ivan Laptev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Human motion generation has shown great advances thanks to the recent diffusion models trained on large-scale motion capture data. Most of existing works, however, currently target animation of isolated people in empty scenes. Meanwhile, synthesizing realistic human-object interactions in complex 3D scenes remains a critical challenge in computer graphics and robotics. One obstacle towards generating versatile high-fidelity human-object interactions is the lack of large-scale datasets with diverse object manipulations. Indeed, existing motion capture data is typically restricted to single people and manipulations of limited sets of objects. To address this issue, we propose an automatic motion extraction pipeline and use it to collect interaction-rich human motions. Our new dataset InterPose contains 73.8K sequences of 3D human motions and corresponding text captions automatically obtained from 45.8K videos with human-object interactions. We perform extensive experiments and demonstrate InterPose to bring significant improvements to state-of-the-art methods for human motion generation. Moreover, using InterPose we develop an LLM-based agent enabling zero-shot animation of people interacting with diverse objects and scenes.
zh

[CV-162] No More Sibling Rivalry: Debiasing Human-Object Interaction Detection ICCV2025

【速读】:该论文旨在解决检测变压器(Detection Transformers)在人-物体交互(Human-Object Interaction, HOI)检测任务中面临的“毒性兄弟”(Toxic Siblings)偏差问题,即大量结构相似但语义不同的HOI三元组在输入和输出层面相互干扰甚至竞争,导致模型识别精度下降。解决方案的关键在于提出两种新颖的去偏学习目标:一是“对比-校准”(contrastive-then-calibration),通过采样与正确三元组相似的错误样本并借助强位置先验重构为正确三元组;二是“合并-分割”(merge-then-split),先学习兄弟类别间的共享特征以区分其他组别,再显式细化组内差异以保留类别独特性。这两项策略分别从输入和输出视角缓解了兄弟三元组之间的混淆问题,显著提升了模型性能。

链接: https://arxiv.org/abs/2509.00760
作者: Bin Yang,Yulin Zhang,Hong-Yu Zhou,Sibei Yang
机构: ShanghaiTech University (上海科技大学); Sun Yat-sen University (中山大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept to ICCV2025

点击查看摘要

Abstract:Detection transformers have been applied to human-object interaction (HOI) detection, enhancing the localization and recognition of human-action-object triplets in images. Despite remarkable progress, this study identifies a critical issue-“Toxic Siblings” bias-which hinders the interaction decoder’s learning, as numerous similar yet distinct HOI triplets interfere with and even compete against each other both input side and output side to the interaction decoder. This bias arises from high confusion among sibling triplets/categories, where increased similarity paradoxically reduces precision, as one’s gain comes at the expense of its toxic sibling’s decline. To address this, we propose two novel debiasing learning objectives-“contrastive-then-calibration” and “merge-then-split”-targeting the input and output perspectives, respectively. The former samples sibling-like incorrect HOI triplets and reconstructs them into correct ones, guided by strong positional priors. The latter first learns shared features among sibling categories to distinguish them from other groups, then explicitly refines intra-group differentiation to preserve uniqueness. Experiments show that we significantly outperform both the baseline (+9.18% mAP on HICO-Det) and the state-of-the-art (+3.59% mAP) across various settings.
zh

[CV-163] MarkSplatter: Generalizable Watermarking for 3D Gaussian Splatting Model via Splatter Image Structure

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)模型在广泛应用中面临的版权保护难题,现有水印方法依赖于针对每个预定义信息进行计算昂贵的微调过程,效率低下。其解决方案的关键在于提出首个可泛化的水印框架——GaussianBridge,该框架通过将非结构化的3D高斯表示转换为Splatter Image格式,实现仅需一次前向传播即可嵌入任意消息;同时结合基于高斯不确定性感知的视觉质量保持策略与基于密集分割的鲁棒提取机制,确保水印不可察觉且在水印区域占比极低的情况下仍能可靠恢复。

链接: https://arxiv.org/abs/2509.00757
作者: Xiufeng Huang,Ziyuan Luo,Qi Song,Ruofei Wang,Renjie Wan
机构: Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing popularity of 3D Gaussian Splatting (3DGS) has intensified the need for effective copyright protection. Current 3DGS watermarking methods rely on computationally expensive fine-tuning procedures for each predefined message. We propose the first generalizable watermarking framework that enables efficient protection of Splatter Image-based 3DGS models through a single forward pass. We introduce GaussianBridge that transforms unstructured 3D Gaussians into Splatter Image format, enabling direct neural processing for arbitrary message embedding. To ensure imperceptibility, we design a Gaussian-Uncertainty-Perceptual heatmap prediction strategy for preserving visual quality. For robust message recovery, we develop a dense segmentation-based extraction mechanism that maintains reliable extraction even when watermarked objects occupy minimal regions in rendered views. Project page: this https URL.
zh

[CV-164] Multi-Level CLS Token Fusion for Contrastive Learning in Endoscopy Image Classification

【速读】:该论文旨在解决耳鼻喉科(ENT)内镜图像分析中多模态理解不足的问题,特别是在有限标注数据条件下,如何实现高精度的图像分类、图像到图像检索以及文本到图像检索。其关键解决方案在于提出一个统一的视觉-语言框架,基于CLIP ViT-B/16骨干网络,并通过低秩适应(Low-Rank Adaptation)、多层级CLS token聚合和球面特征插值进行增强,从而在小样本场景下实现高效微调与跨模态语义对齐;同时引入类特定自然语言提示(class-specific natural language prompts),结合监督分类与对比学习的目标函数,引导图像编码器更好地映射视觉输入与诊断文本语境,显著提升了模型在临床低资源环境下的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2509.00752
作者: Y Hop Nguyen,Doan Anh Phan Huu,Trung Thai Tran,Nhat Nam Mai,Van Toi Giap,Thao Thi Phuong Dao,Trung-Nghia Le
机构: University of Science, VNU-HCM (胡志明市国家大学科学大学); South Telecom JSC (南方电信股份公司); Thong Nhat Hospital (统一医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM Multimedia 2025

点击查看摘要

Abstract:We present a unified vision-language framework tailored for ENT endoscopy image analysis that simultaneously tackles three clinically-relevant tasks: image classification, image-to-image retrieval, and text-to-image retrieval. Unlike conventional CNN-based pipelines that struggle to capture cross-modal semantics, our approach leverages the CLIP ViT-B/16 backbone and enhances it through Low-Rank Adaptation, multi-level CLS token aggregation, and spherical feature interpolation. These components collectively enable efficient fine-tuning on limited medical data while improving representation diversity and semantic alignment across modalities. To bridge the gap between visual inputs and textual diagnostic context, we introduce class-specific natural language prompts that guide the image encoder through a joint training objective combining supervised classification with contrastive learning. We validated our framework through participation in the ACM MM’25 ENTRep Grand Challenge, achieving 95% accuracy and F1-score in classification, Recall@1 of 0.93 and 0.92 for image-to-image and text-to-image retrieval respectively, and MRR scores of 0.97 and 0.96. Ablation studies demonstrated the incremental benefits of each architectural component, validating the effectiveness of our design for robust multimodal medical understanding in low-resource clinical settings.
zh

[CV-165] EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions

【速读】:该论文旨在解决从自由格式描述(free-form captions)中进行事件驱动的图像检索问题,核心挑战在于模型需同时理解视觉特征、潜在事件语义、上下文关系及现实世界知识,而传统视觉-语言检索方法在处理抽象事件、隐含因果关系、时间背景或长篇复杂叙事时表现不足。解决方案的关键在于提出一个多层次的检索框架:首先通过Qwen3实现密集文档检索,再利用Qwen3-Reranker进行事件感知的语言模型重排序以增强上下文对齐,随后采用Qwen2-VL对图像进行精准打分,并结合caption-guided语义匹配与rank-aware选择策略优化最终结果;此外,通过Reciprocal Rank Fusion (RRF) 融合多配置输出,显著提升系统性能与鲁棒性。该方案在EVENTA 2025 Grand Challenge Track 2私有测试集上达到Top-1成绩,验证了语言推理与多模态检索融合的有效性。

链接: https://arxiv.org/abs/2509.00751
作者: Dinh-Khoi Vo,Van-Loc Nguyen,Minh-Triet Tran,Trung-Nghia Le
机构: University of Science, VNU-HCM (胡志明市国家大学自然科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM Multimedia 2025

点击查看摘要

Abstract:Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at this https URL.
zh

[CV-166] Causal Interpretation of Sparse Autoencoder Features in Vision

【速读】:该论文旨在解决稀疏自编码器(Sparse Auto-Encoder, SAE)在视觉 Transformer 中特征解释的偏差问题,即传统方法仅依赖激活最高的图像块(patch)来推断特征语义,忽略了自注意力机制导致的全局信息混杂,从而可能将共现关系误判为因果关系。解决方案的关键在于提出因果特征解释方法(Causal Feature Explanation, CaFE),其核心是利用有效感受野(Effective Receptive Field, ERF)作为因果驱动信号的定位工具,通过输入归因(input-attribution)方法识别真正引发特定 SAE 特征激活的图像区域,从而获得更忠实、语义更精确的特征解释。

链接: https://arxiv.org/abs/2509.00749
作者: Sangyu Han,Yearim Kim,Nojun Kwak
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding what sparse auto-encoder (SAE) features in vision transformers truly represent is usually done by inspecting the patches where a feature’s activation is highest. However, self-attention mixes information across the entire image, so an activated patch often co-occurs with-but does not cause-the feature’s firing. We propose Causal Feature Explanation (CaFE), which leverages Effective Receptive Field (ERF). We consider each activation of an SAE feature to be a target and apply input-attribution methods to identify the image patches that causally drive that activation. Across CLIP-ViT features, ERF maps frequently diverge from naive activation maps, revealing hidden context dependencies (e.g., a “roaring face” feature that requires the co-occurrence of eyes and nose, rather than merely an open mouth). Patch insertion tests confirm that CaFE more effectively recovers or suppresses feature activations than activation-ranked patches. Our results show that CaFE yields more faithful and semantically precise explanations of vision-SAE features, highlighting the risk of misinterpretation when relying solely on activation location.
zh

[CV-167] Enhancing Fairness in Skin Lesion Classification for Medical Diagnosis Using Prune Learning

【速读】:该论文旨在解决皮肤病变分类模型中因肤色差异导致的潜在偏见问题,这一偏见可能影响诊断准确性并阻碍医疗公平性。解决方案的关键在于提出一种新的公平性算法,通过计算VGG网络卷积层特征图以及Vision Transformer的patch和head的偏度(skewness),识别并抑制与肤色相关的冗余通道,从而聚焦于病变区域。该方法在不依赖传统统计手段的前提下,有效降低了计算成本并提升了模型对不同肤色群体的公平性表现,同时可能减小模型规模,增强实际部署的可行性。

链接: https://arxiv.org/abs/2509.00745
作者: Kuniko Paxton,Koorosh Aslansefat,Dhavalkumar Thakker,Yiannis Papadopoulos,Tanaya Maslekar
机构: University of Hull (赫尔大学); Leeds Teaching Hospitals NHS Trust (利兹教学医院国家卫生服务信托)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in deep learning have significantly improved the accuracy of skin lesion classification models, supporting medical diagnoses and promoting equitable healthcare. However, concerns remain about potential biases related to skin color, which can impact diagnostic outcomes. Ensuring fairness is challenging due to difficulties in classifying skin tones, high computational demands, and the complexity of objectively verifying fairness. To address these challenges, we propose a fairness algorithm for skin lesion classification that overcomes the challenges associated with achieving diagnostic fairness across varying skin tones. By calculating the skewness of the feature map in the convolution layer of the VGG (Visual Geometry Group) network and the patches and the heads of the Vision Transformer, our method reduces unnecessary channels related to skin tone, focusing instead on the lesion area. This approach lowers computational costs and mitigates bias without relying on conventional statistical methods. It potentially reduces model size while maintaining fairness, making it more practical for real-world applications.
zh

[CV-168] Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中投影层(projection layer)在未见视觉概念上的泛化能力缺乏系统评估的问题。其核心挑战在于:尽管投影层负责将视觉特征映射到语言模型的嵌入空间,且对多模态任务性能至关重要,但现有研究尚未充分考察其在未见过的视觉类别上是否具备有效泛化能力。解决方案的关键在于提出一个全新的基准测试框架,通过将富含细粒度标注的目标检测数据集转换为提示格式,并设计训练集与测试集标签不相交的划分方式,从而实现对“已见”与“未见”概念的精确控制与隔离。实验表明,投影层在未见类上仍能保持约79%至88%的已见类性能,说明其具有非平凡的泛化能力;进一步机制可解释性分析揭示,该层中的前馈网络类似键值记忆机制,以相似方式处理已见和未见token,为高效VLM训练提供了理论依据和实践方向。

链接: https://arxiv.org/abs/2509.00700
作者: Raehyuk Jung,Seungjun Yu,Hyunjung Shim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) combine a vision encoder and a large language model (LLM) through alignment training, showing strong performance on multimodal tasks. A central component in this architecture is the projection layer, which maps visual features into the LLM’s embedding space. Despite its importance, its ability to generalize to unseen visual concepts has not been systematically evaluated. To address this, we propose a benchmark for evaluating projection-layer generalization. We adapt object detection datasets (rich in fine-grained annotations) into a prompting format and design train/test splits with disjoint label sets, enabling precise control over seen and unseen concept separation. Experimental results show that the projection layer retains about 79 to 88 percent of the performance on unseen classes compared to seen ones across various settings, suggesting a non-trivial level of generalization even without explicit alignment supervision on those concepts. We further analyze this behavior through a mechanistic interpretability lens. Our findings indicate that the feed-forward network in the projection layer functions like a key-value memory, processing seen and unseen tokens in similar ways. This study introduces a new evaluation framework for alignment generalization and highlights the potential for efficient VLM training with limited aligned data.
zh

[CV-169] CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition

【速读】:该论文旨在解决基于骨骼数据的人体动作识别(Skeleton-based Human Action Recognition)中表征学习能力不足的问题,尤其是在如何有效利用骨架数据的时空结构以提升模型泛化性和判别力方面。其解决方案的关键在于提出了一种两阶段级联式Transformer架构CascadeFormer:第一阶段通过掩码预训练(Masked Pretraining)学习通用的骨骼表征,第二阶段采用级联微调(Cascading Fine-tuning)策略优化模型在动作分类任务上的判别性能。该方法结合了Transformer在长程依赖建模上的优势与分阶段训练对任务特定特征提取的高效性,在多个基准数据集上实现了竞争力强的性能表现。

链接: https://arxiv.org/abs/2509.00692
作者: Yusen Peng,Alper Yilmaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Skeleton-based human action recognition leverages sequences of human joint coordinates to identify actions performed in videos. Owing to the intrinsic spatiotemporal structure of skeleton data, Graph Convolutional Networks (GCNs) have been the dominant architecture in this field. However, recent advances in transformer models and masked pretraining frameworks open new avenues for representation learning. In this work, we propose CascadeFormer, a family of two-stage cascading transformers for skeleton-based human action recognition. Our framework consists of a masked pretraining stage to learn generalizable skeleton representations, followed by a cascading fine-tuning stage tailored for discriminative action classification. We evaluate CascadeFormer across three benchmark datasets (Penn Action N-UCLA, and NTU RGB+D 60), achieving competitive performance on all tasks. To promote reproducibility, we release our code and model checkpoints.
zh

[CV-170] CSFMamba: Cross State Fusion Mamba Operator for Multimodal Remote Sensing Image Classification

【速读】:该论文旨在解决遥感图像分类中多模态特征融合的计算复杂度高与长程依赖建模困难的问题。传统方法如Transformer虽能捕捉空间-光谱信息的互补性,但其二次计算复杂度限制了对长距离特征关系的有效建模,导致网络负担过重。解决方案的关键在于提出Cross State Fusion Mamba (CSFMamba) 网络,通过引入Mamba结构中的时变参数和硬件优化机制,显著降低计算开销,并设计基于Mamba算子的跨状态融合模块,实现高光谱成像(Hyperspectral Imaging, HSI)与激光雷达(LiDAR)模态特征的高效深度融合,从而在减少训练负担的前提下提升全图理解能力与分类性能。

链接: https://arxiv.org/abs/2509.00677
作者: Qingyu Wang,Xue Jiang,Guozheng Xu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, accpeted by 2025 IEEE International Geoscience and Remote Sensing Symposium(IGARSS 2025),not published yet

点击查看摘要

Abstract:Multimodal fusion has made great progress in the field of remote sensing image classification due to its ability to exploit the complementary spatial-spectral information. Deep learning methods such as CNN and Transformer have been widely used in these domains. State Space Models recently highlighted that prior methods suffer from quadratic computational complexity. As a result, modeling longer-range dependencies of spatial-spectral features imposes an overwhelming burden on the network. Mamba solves this problem by incorporating time-varying parameters into ordinary SSM and performing hardware optimization, but it cannot perform feature fusion directly. In order to make full use of Mamba’s low computational burden and explore the potential of internal structure in multimodal feature fusion, we propose Cross State Fusion Mamba (CSFMamba) Network. Specifically, we first design the preprocessing module of remote sensing image information for the needs of Mamba structure, and combine it with CNN to extract multi-layer features. Secondly, a cross-state module based on Mamba operator is creatively designed to fully fuse the feature of the two modalities. The advantages of Mamba and CNN are combined by designing a more powerful backbone. We capture the fusion relationship between HSI and LiDAR modalities with stronger full-image understanding. The experimental results on two datasets of MUUFL and Houston2018 show that the proposed method outperforms the experimental results of Transformer under the premise of reducing the network training burden.
zh

[CV-171] LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

【速读】:该论文旨在解决视觉语言建模中评价模型(critic model)与策略模型(policy model)分离导致的效率低下问题,即传统方法中批评模型仅用于评分或偏好判断,而无法直接参与生成任务。其核心解决方案在于将偏好标注的批评数据重新组织为可验证的训练信号,并直接对基础生成模型进行强化学习(reinforcement learning, RL)训练,从而构建一个兼具评估与生成能力的统一模型——LLaVA-Critic-R1。该方法的关键创新在于通过RL训练使模型在优化偏好判断的同时保留完整的生成能力,最终实现“既擅长打分又擅长作答”的双重性能优势,且无需额外训练即可在推理阶段通过自评机制提升效果。

链接: https://arxiv.org/abs/2509.00676
作者: Xiyao Wang,Chunyuan Li,Jianwei Yang,Kai Zhang,Bo Liu,Tianyi Xiong,Furong Huang
机构: University of Maryland (马里兰大学); The Ohio State University (俄亥俄州立大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In vision-language modeling, critic models are typically trained to evaluate outputs – assigning scalar scores or pairwise preferences – rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model – matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.
zh

[CV-172] ER-LoRA: Effective-Rank Guided Adaptation for Weather-Generalized Depth Estimation

【速读】:该论文旨在解决恶劣天气条件下单目深度估计(monocular depth estimation)的难题,主要挑战在于缺乏可靠的真值标签以及从无标签真实世界数据中学习的困难。现有方法通常依赖带有伪标签的合成恶劣天气数据(存在域差距)或采用自监督学习(在恶劣场景下违反光度一致性假设)。其解决方案的关键在于利用视觉基础模型(Vision Foundation Models, VFMs)的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT),仅使用少量高能见度(正常)数据实现跨天气条件的泛化深度估计。为此,作者提出选择-微调-保持(Selecting–Tuning–Maintaining, STM)策略,通过熵秩(entropy-rank)和稳定秩(stable-rank)结构分解预训练权重:在微调阶段基于熵秩与全微调权重自适应选择秩数及任务感知奇异方向进行初始化;在保持阶段引入主方向正则化以保留预训练知识。该设计实现了几何任务中的灵活适配与强泛化能力之间的平衡,在四个真实世界多天气基准上显著优于现有PEFT方法、全量微调甚至合成数据训练方法。

链接: https://arxiv.org/abs/2509.00665
作者: Weilong Yan,Xin Zhang,Robby T. Tan
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Monocular depth estimation under adverse weather conditions (e.g.\ rain, fog, snow, and nighttime) remains highly challenging due to the lack of reliable ground truth and the difficulty of learning from unlabeled real-world data. Existing methods often rely on synthetic adverse data with pseudo-labels, which suffer from domain gaps, or employ self-supervised learning, which violates photometric assumptions in adverse scenarios. In this work, we propose to achieve weather–generalized depth estimation by Parameter–Efficient Fine–Tuning (PEFT) of Vision Foundation Models (VFMs), using only a small amount of high–visibility (normal) data. While PEFT has shown strong performance in semantic tasks such as segmentation, it remains underexplored for geometry–centric tasks like depth estimation – especially in terms of balancing effective adaptation with the preservation of pretrained knowledge. To this end, we introduce the Selecting–Tuning–Maintaining (STM) strategy, which structurally decomposes the pretrained weights of VFMs based on two kinds of effective ranks (entropy–rank and stable–rank). In the tuning phase, we adaptively select the proper rank number as well as the task–aware singular directions for initialization, based on the entropy–rank and full–tuned weight; while in the maintaining stage, we enforce a principal direction regularization based on the stable–rank. This design guarantees flexible task adaptation while preserving the strong generalization capability of the pretrained VFM. Extensive experiments on four real–world benchmarks across diverse weather conditions demonstrate that STM not only outperforms existing PEFT methods and full fine–tuning but also surpasses methods trained with adverse synthetic data, and even the depth foundation model
zh

[CV-173] Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉感知能力上的瓶颈问题,即尽管其在高层次语义理解方面表现优异,但在需要精细细节感知的基础视觉任务中表现不佳。这一缺陷主要源于当前主流架构依赖单一视觉编码器(vision encoder),该编码器为高阶语义对齐优化,牺牲了捕捉细粒度视觉信息的能力。解决方案的关键在于提出Fusion to Enhance (FtZ) 框架,通过轻量级多头交叉注意力(Multi-Head Cross-Attention)机制,将一个语义强大的锚定编码器(anchor encoder)与一个感知丰富的增强编码器(augmenting encoder)进行异构组合,从而在不显著增加计算负担的前提下,有效提升模型对细粒度视觉信息的感知能力。实验表明,FtZ在TextVQA、POPE、MMMU等多个需精细视觉理解的基准上显著优于仅使用单编码器或现有特征融合方法的基线模型。

链接: https://arxiv.org/abs/2509.00664
作者: Yifei She,Huangxuan Wu
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made significant progress in bridging visual perception with high-level textual reasoning. However, they face a fundamental contradiction: while excelling at complex semantic understanding, these models often fail at basic visual tasks that require precise detail perception. This deficiency primarily stems from the prevalent architectural reliance on a single vision encoder optimized for high-level semantic alignment, which inherently sacrifices the ability to capture fine-grained visual information. To address this issue, we introduce Fusion to Enhance (FtZ), a novel vision tower framework. FtZ moves beyond the single-encoder design by innovatively composing a semantically powerful anchor encoder with a perception-rich augmenting encoder via a lightweight Multi-Head Cross-Attention mechanism. Experimental results demonstrate that on several challenging benchmarks demanding fine-grained visual understanding, such as TextVQA, POPE, MMMU, MME and MM-Vet, our FtZ model significantly outperforms baselines that use only a single encoder or existing feature fusion methods. This work proves that composing heterogeneous expert encoders is an efficient and effective path to overcoming the visual perception bottleneck in current MLLMs, offering a new design paradigm for building next-generation AI systems with stronger perceptual capabilities.
zh

[CV-174] Automatic Identification and Description of Jewelry Through Computer Vision and Neural Networks for Translators and Interpreters

【速读】:该论文旨在解决珠宝识别难题,即由于款式和设计的多样性导致普通用户(如翻译人员和口译员)难以获取准确的珠宝描述信息。传统上,精确的珠宝描述仅限于行业专家,而本研究提出了一种基于神经网络的自动识别与描述方法,通过计算机视觉技术和图像字幕生成(image captioning)实现对珠宝的多层级自然语言描述。其关键创新在于构建了一个三层次的描述体系,利用不同的图像字幕架构检测图像中的珠宝并生成具有不同粒度细节的描述文本,从而模拟专家分析过程;其中,编码器-解码器模型被重点评估并最终选用,使整体字幕生成准确率超过90%,显著提升了非专业人士对珠宝的认知效率与准确性。

链接: https://arxiv.org/abs/2509.00661
作者: Jose Manuel Alcalde-Llergo,Aurora Ruiz-Mezcua,Rocio Avila-Ramirez,Andrea Zingoni,Juri Taborri,Enrique Yeguas-Bolivar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Identifying jewelry pieces presents a significant challenge due to the wide range of styles and designs. Currently, precise descriptions are typically limited to industry experts. However, translators and interpreters often require a comprehensive understanding of these items. In this study, we introduce an innovative approach to automatically identify and describe jewelry using neural networks. This method enables translators and interpreters to quickly access accurate information, aiding in resolving queries and gaining essential knowledge about jewelry. Our model operates at three distinct levels of description, employing computer vision techniques and image captioning to emulate expert analysis of accessories. The key innovation involves generating natural language descriptions of jewelry across three hierarchical levels, capturing nuanced details of each piece. Different image captioning architectures are utilized to detect jewels in images and generate descriptions with varying levels of detail. To demonstrate the effectiveness of our approach in recognizing diverse types of jewelry, we assembled a comprehensive database of accessory images. The evaluation process involved comparing various image captioning architectures, focusing particularly on the encoder decoder model, crucial for generating descriptive captions. After thorough evaluation, our final model achieved a captioning accuracy exceeding 90 per cent.
zh

[CV-175] Face4FairShifts: A Large Image Benchmark for Fairness and Robust Learning across Visual Domains

【速读】:该论文旨在解决机器学习模型在领域迁移(domain shift)下公平性(fairness)与鲁棒性(robustness)难以保障的问题。其解决方案的关键在于构建了一个大规模、多属性标注的面部图像基准数据集 Face4FairShifts,该数据集包含10万张跨四个视觉差异显著域的图像,并涵盖14个属性下的39项标注,覆盖人口统计学特征和面部特征。通过系统性实验,该研究揭示了现有模型在分布偏移下的性能差距,并强调了当前数据集的局限性及对更有效公平感知域适应(fairness-aware domain adaptation)技术的需求,从而为推动公平且可靠的AI系统提供了一个全面的评估平台。

链接: https://arxiv.org/abs/2509.00658
作者: Yumeng Lin,Dong Li,Xintao Wu,Minglai Shao,Xujiang Zhao,Zhong Chen,Chen Zhao
机构: Tianjin University (天津大学); Baylor University (贝勒大学); University of Arkansas (阿肯色大学); NEC Laboratories America (美国NEC实验室); Southern Illinois University (南伊利诺伊大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ensuring fairness and robustness in machine learning models remains a challenge, particularly under domain shifts. We present Face4FairShifts, a large-scale facial image benchmark designed to systematically evaluate fairness-aware learning and domain generalization. The dataset includes 100,000 images across four visually distinct domains with 39 annotations within 14 attributes covering demographic and facial features. Through extensive experiments, we analyze model performance under distribution shifts and identify significant gaps. Our findings emphasize the limitations of existing related datasets and the need for more effective fairness-aware domain adaptation techniques. Face4FairShifts provides a comprehensive testbed for advancing equitable and reliable AI systems. The dataset is available online at this https URL.
zh

[CV-176] MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation CVPR2025

【速读】:该论文旨在解决多视角三维人体姿态估计(multi-view 3D human pose estimation)中模型泛化能力不足的问题,尤其是在新相机配置和遮挡场景下性能显著下降的挑战。现有基于注意力机制的Transformer方法难以准确建模关键点的空间排列,且易过拟合训练数据中的特定相机布局与视觉场景。其解决方案的核心是提出一种新型的多视角状态空间建模框架(MV-SSM),通过在两个层次上显式建模关节空间序列:一是来自多视角图像的特征层级,二是人体关键点层级;并设计了投影状态空间(Projective State Space, PSS)模块,利用状态空间建模学习通用的关键点空间关系表示。此外,改进了Mamba的传统扫描机制,引入网格标记引导的双向扫描(Grid Token-guided Bidirectional Scanning, GTBS),作为PSS模块的关键组成部分,从而增强模型对不同相机设置的鲁棒性和跨场景泛化能力。

链接: https://arxiv.org/abs/2509.00649
作者: Aviral Chharia,Wenbo Gou,Haoye Dong
机构: Carnegie Mellon University (卡内基梅隆大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2025; Project Website: this https URL

点击查看摘要

Abstract:While significant progress has been made in single-view 3D human pose estimation, multi-view 3D human pose estimation remains challenging, particularly in terms of generalizing to new camera configurations. Existing attention-based transformers often struggle to accurately model the spatial arrangement of keypoints, especially in occluded scenarios. Additionally, they tend to overfit specific camera arrangements and visual scenes from training data, resulting in substantial performance drops in new settings. In this study, we introduce a novel Multi-View State Space Modeling framework, named MV-SSM, for robustly estimating 3D human keypoints. We explicitly model the joint spatial sequence at two distinct levels: the feature level from multi-view images and the person keypoint level. We propose a Projective State Space (PSS) block to learn a generalized representation of joint spatial arrangements using state space modeling. Moreover, we modify Mamba’s traditional scanning into an effective Grid Token-guided Bidirectional Scanning (GTBS), which is integral to the PSS block. Multiple experiments demonstrate that MV-SSM achieves strong generalization, outperforming state-of-the-art methods: +10.8 on AP25 (+24%) on the challenging three-camera setting in CMU Panoptic, +7.0 on AP25 (+13%) on varying camera arrangements, and +15.3 PCP (+38%) on Campus A1 in cross-dataset evaluations. Project Website: this https URL
zh

[CV-177] AMCR: A Framework for Assessing and Mitigating Copyright Risks in Generative Models

【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到图像任务中因依赖大规模训练数据而可能无意复制受版权保护内容所引发的法律与伦理风险问题。解决方案的关键在于提出一个名为AMCR(Assessing and Mitigating Copyright Risks)的综合性框架,其核心包括:(1) 通过系统性重构高风险提示词为安全形式,扩展传统基于提示的防护策略;(2) 利用基于注意力机制的相似性分析检测部分侵权行为;(3) 在生成过程中自适应地降低版权风险,同时保持图像质量不受显著影响。

链接: https://arxiv.org/abs/2509.00641
作者: Zhipeng Yin,Zichong Wang,Avash Palikhe,Zhen Liu,Jun Liu,Wenbin Zhang
机构: Florida International University (佛罗里达国际大学); Guangdong University of Foreign Studies (广东外语外贸大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models have achieved impressive results in text to image tasks, significantly advancing visual content creation. However, this progress comes at a cost, as such models rely heavily on large-scale training data and may unintentionally replicate copyrighted elements, creating serious legal and ethical challenges for real-world deployment. To address these concerns, researchers have proposed various strategies to mitigate copyright risks, most of which are prompt based methods that filter or rewrite user inputs to prevent explicit infringement. While effective in handling obvious cases, these approaches often fall short in more subtle situations, where seemingly benign prompts can still lead to infringing outputs. To address these limitations, this paper introduces Assessing and Mitigating Copyright Risks (AMCR), a comprehensive framework which i) builds upon prompt-based strategies by systematically restructuring risky prompts into safe and non-sensitive forms, ii) detects partial infringements through attention-based similarity analysis, and iii) adaptively mitigates risks during generation to reduce copyright violations without compromising image quality. Extensive experiments validate the effectiveness of AMCR in revealing and mitigating latent copyright risks, offering practical insights and benchmarks for the safer deployment of generative models.
zh

[CV-178] owards Methane Detection Onboard Satellites

【速读】:该论文旨在解决卫星搭载的甲烷(methane)检测系统中因传统图像预处理步骤导致的计算复杂度高和数据传输成本大的问题。现有方法通常依赖于正射校正(orthorectification)和匹配滤波(matched filter)等预处理技术来增强甲烷羽流信号,但这些步骤不仅耗时,还增加了下行链路的数据传输负担。论文提出的关键解决方案是采用未经正射校正的原始数据(UnorthoDOS),并训练机器学习(ML)模型直接从这类数据中提取特征,从而跳过繁琐的几何校正与信号增强流程。实验表明,基于未正射校正数据训练的ML模型性能可媲美使用正射校正数据训练的模型,且优于传统的匹配滤波基线方法(mag1c),显著提升了甲烷实时检测的效率与可行性。

链接: https://arxiv.org/abs/2509.00626
作者: Maggie Chen,Hala Lambdouar,Luca Marini,Laura Martínez-Ferrer,Chris Bridges,Giacomo Acciarini
机构: University of Oxford (牛津大学); Delft University of Technology (代尔夫特理工大学); Universitat de València (瓦伦西亚大学); University of Surrey (萨里大学); European Space Agency (欧洲航天局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textitunorthorectified data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at this https URL , along with code at this https URL.
zh

[CV-179] DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)从自然图像域向遥感(Remote Sensing, RS)图像分割任务迁移时面临的挑战,主要包括RS数据集类别多样性有限以及自然图像与遥感图像之间的域差异(domain gap)。其解决方案的关键在于提出了一种无需训练的框架DGL-RSIS,通过解耦视觉和文本输入,在局部语义与全局上下文两个层次上实现视觉-语言对齐:首先设计全局-局部解耦(Global-Local Decoupling, GLD)模块,利用自然语言处理(Natural Language Processing, NLP)将文本拆分为局部类别名词与全局修饰词,并通过无监督掩码提议网络生成类无关的掩码提案;其次在局部尺度上引入一种上下文感知裁剪策略提取边界合理的图像块,并注入遥感特定知识增强文本特征,进而匹配掩码引导的视觉特征以支持开放词汇语义分割(Open-Vocabulary Semantic Segmentation, OVSS);最后在全局尺度上,采用跨尺度Grad-CAM模块融合全局修饰信息优化Grad-CAM热力图,并通过掩码选择模块将像素级激活整合为掩码级分割输出,从而实现跨全局与局部维度的精准且可解释的指代表达分割(Referring Expression Segmentation, RES)。

链接: https://arxiv.org/abs/2509.00598
作者: Boyi Li,Ce Zhang,Richard M. Timmerman,Wenxuan Bao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Geoscience and Remote Sensing (TGRS), under review

点击查看摘要

Abstract:The emergence of vision language models (VLMs) has bridged vision and language, enabling joint multimodal understanding beyond traditional visual-only deep learning models. However, transferring VLMs from the natural image domain to remote sensing (RS) segmentation remains challenging due to the limited category diversity in RS datasets and the domain gap between natural and RS imagery. Here, we propose a training-free framework, DGL-RSIS, that decouples visual and textual inputs, performing visual-language alignment at both the local semantic and global contextual levels through tailored strategies. Specifically, we first introduce a global-local decoupling (GLD) module, where text inputs are divided into local class nouns and global modifiers using natural language processing (NLP) techniques; image inputs are partitioned into a set of class-agnostic mask proposals via unsupervised mask proposal networks. Second, visual and textual features are aligned at local scale, through a novel context-aware cropping strategy for extracting image patches with proper boundaries and introducing RS-specific knowledge to enrich the text inputs. By matching the enhanced text features with mask-guided visual features, we enable the mask classification, supporting open-vocabulary semantic segmentation (OVSS). Third, at the global scale, we propose a Cross-Scale Grad-CAM module to refine Grad-CAM maps using contextual information from global modifiers. A subsequent mask selection module integrates pixel-level Grad-CAM activations into the mask-level segmentation output, such that accurate and interpretable alignment can be realized across global and local dimensions for referring expression segmentation (RES).
zh

[CV-180] C-DiffDet: Fusing Global Scene Context with Generative Denoising for High-Fidelity Object Detection

【速读】:该论文旨在解决细粒度物体检测在挑战性视觉场景(如车辆损伤评估)中因局部特征条件限制而导致的性能瓶颈问题,尤其是在依赖上下文信息的复杂环境中。其解决方案的关键在于提出Context-Aware Fusion (CAF) 模块,该模块通过交叉注意力机制(cross-attention mechanism)将全局场景上下文与局部候选区域特征直接融合;具体而言,使用一个独立的专用编码器生成全局上下文表示,使每个目标候选框能够显式地关注场景级语义信息,从而显著提升生成式检测范式的上下文感知能力。

链接: https://arxiv.org/abs/2509.00578
作者: Abdellah Zakaria Sellam,Ilyes Benaissa,Salah Eddine Bekhouche,Abdenour Hadid,Vito Renó,Cosimo Distante
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained object detection in challenging visual domains, such as vehicle damage assessment, presents a formidable challenge even for human experts to resolve reliably. While DiffusionDet has advanced the state-of-the-art through conditional denoising diffusion, its performance remains limited by local feature conditioning in context-dependent scenarios. We address this fundamental limitation by introducing Context-Aware Fusion (CAF), which leverages cross-attention mechanisms to integrate global scene context with local proposal features directly. The global context is generated using a separate dedicated encoder that captures comprehensive environmental information, enabling each object proposal to attend to scene-level understanding. Our framework significantly enhances the generative detection paradigm by enabling each object proposal to attend to comprehensive environmental information. Experimental results demonstrate an improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains
zh

[CV-181] Galaxea Open-World Dataset and G0 Dual-System VLA Model

【速读】:该论文旨在解决机器人在开放世界中执行复杂任务时面临的泛化能力不足与多模态理解困难的问题,尤其是在真实人类生活和工作环境中实现高效、鲁棒的具身智能行为。解决方案的关键在于构建了一个大规模、多样化的机器人行为数据集——Galaxea Open-World Dataset,并提出了一种双系统框架G0:该框架融合了视觉-语言模型(Vision-Language Model, VLM)用于高层多模态规划,以及视觉-语言-动作模型(Vision-Language-Action, VLA)用于细粒度执行;其训练采用三阶段课程学习策略,其中单机器人本体预训练阶段结合高质量数据集显著提升了模型在不同任务场景下的迁移性能与泛化能力。

链接: https://arxiv.org/abs/2509.00576
作者: Tao Jiang,Tianyuan Yuan,Yicheng Liu,Chenhao Lu,Jianning Cui,Xiao Liu,Shuiqi Cheng,Jiyang Gao,Huazhe Xu,Hang Zhao
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation, demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.
zh

[CV-182] Reinforcement Learning of Dolly-In Filming Using a Ground-Based Robot IROS58592 IROS2024

【速读】:该论文旨在解决自由移动轨道车(free-roaming dolly)在影视拍摄中自动化摄像控制难题,尤其是如何实现精准的推镜头(dolly-in shot)动作。其解决方案的关键在于引入强化学习(Reinforcement Learning, RL)方法,构建一个鲁棒的RL控制流程,并通过与传统比例-微分(Proportional-Derivative, PD)控制器对比验证其有效性。研究在仿真和真实场景(基于改装ROSBot 2.0平台搭载云台相机)中均取得优越性能,证明了联合控制策略在复杂拍摄任务中的可行性与实用性,为生成式AI驱动的智能摄制系统提供了技术基础。

链接: https://arxiv.org/abs/2509.00564
作者: Philip Lorimer,Jack Saunders,Alan Hunter,Wenbin Li
机构: University of Bath (巴斯大学); University of Bath (巴斯大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Authors’ accepted manuscript (IROS 2024, Abu Dhabi, Oct 14-18, 2024). Please cite the version of record: DOI https://doi.org/10.1109/IROS58592.2024.10802717 . 8 pages

点击查看摘要

Abstract:Free-roaming dollies enhance filmmaking with dynamic movement, but challenges in automated camera control remain unresolved. Our study advances this field by applying Reinforcement Learning (RL) to automate dolly-in shots using free-roaming ground-based filming robots, overcoming traditional control hurdles. We demonstrate the effectiveness of combined control for precise film tasks by comparing it to independent control strategies. Our robust RL pipeline surpasses traditional Proportional-Derivative controller performance in simulation and proves its efficacy in real-world tests on a modified ROSBot 2.0 platform equipped with a camera turret. This validates our approach’s practicality and sets the stage for further research in complex filming scenarios, contributing significantly to the fusion of technology with cinematic creativity. This work presents a leap forward in the field and opens new avenues for research and development, effectively bridging the gap between technological advancement and creative filmmaking.
zh

[CV-183] Integrated Multivariate Segmentation Tree for the Analysis of Heterogeneous Credit Data in Small and Medium-Sized Enterprises

【速读】:该论文旨在解决传统决策树模型在处理高维数据时的局限性,尤其是难以有效融合文本信息的问题,从而提升对中小型企业(SMEs)的信用评估准确性。其解决方案的关键在于提出集成多变量分割树(Integrated Multivariate Segmentation Tree, IMST)框架,通过三个核心步骤实现:首先利用矩阵分解将文本数据转化为数值矩阵;其次采用Lasso回归筛选关键财务特征;最后构建基于基尼指数或熵的多变量分割树,并引入最弱链接剪枝策略控制模型复杂度。该方法显著提升了模型精度(达88.9%),同时保持了良好的可解释性和计算效率。

链接: https://arxiv.org/abs/2509.00550
作者: Lu Han,Xiuying Wang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages,11 figures, 5 tables

点击查看摘要

Abstract:Traditional decision tree models, which rely exclusively on numerical variables, often encounter difficulties in handling high-dimensional data and fail to effectively incorporate textual information. To address these limitations, we propose the Integrated Multivariate Segmentation Tree (IMST), a comprehensive framework designed to enhance credit evaluation for small and medium-sized enterprises (SMEs) by integrating financial data with textual sources. The methodology comprises three core stages: (1) transforming textual data into numerical matrices through matrix factorization; (2) selecting salient financial features using Lasso regression; and (3) constructing a multivariate segmentation tree based on the Gini index or Entropy, with weakest-link pruning applied to regulate model complexity. Experimental results derived from a dataset of 1,428 Chinese SMEs demonstrate that IMST achieves an accuracy of 88.9%, surpassing baseline decision trees (87.4%) as well as conventional models such as logistic regression and support vector machines (SVM). Furthermore, the proposed model exhibits superior interpretability and computational efficiency, featuring a more streamlined architecture and enhanced risk detection capabilities.
zh

[CV-184] A Modality-agnostic Multi-task Foundation Model for Human Brain Imaging

【速读】:该论文旨在解决生成式 AI (Generative AI) 在跨模态医学影像中泛化能力不足的问题,尤其是在未校准的磁共振(MR)成像场景下,由于对比度、分辨率和方位差异导致模型性能显著下降,限制了其在多样化临床协议中的实际应用。解决方案的关键在于提出 BrainFM——一种模态无关的多任务视觉基础模型,并采用“轻度到重度”个体内生成与“真实-合成”混合训练策略,使模型对图像外观变化(如模态、对比度、形变、分辨率及伪影)具有鲁棒性,从而可直接应用于五项核心脑部影像任务(包括CT与T1w/T2w/FLAIR MRI图像合成、解剖分割、头皮到皮层距离估计、偏置场估计和配准)。

链接: https://arxiv.org/abs/2509.00549
作者: Peirong Liu,Oula Puonti,Xiaoling Hu,Karthik Gopinath,Annabel Sorby-Adams,Daniel C. Alexander,W. Taylor Kimberly,Juan E. Iglesias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:Recent learning-based approaches have made astonishing advances in calibrated medical imaging like computerized tomography (CT), yet they struggle to generalize in uncalibrated modalities – notably magnetic resonance (MR) imaging, where performance is highly sensitive to the differences in MR contrast, resolution, and orientation. This prevents broad applicability to diverse real-world clinical protocols. Here we introduce BrainFM, a modality-agnostic, multi-task vision foundation model for human brain imaging. With the proposed “mild-to-severe” intra-subject generation and “real-synth” mix-up training strategy, BrainFM is resilient to the appearance of acquired images (e.g., modality, contrast, deformation, resolution, artifacts), and can be directly applied to five fundamental brain imaging tasks, including image synthesis for CT and T1w/T2w/FLAIR MRI, anatomy segmentation, scalp-to-cortical distance, bias field estimation, and registration. We evaluate the efficacy of BrainFM on eleven public datasets, and demonstrate its robustness and effectiveness across all tasks and input modalities. Code is available at this https URL.
zh

[CV-185] LatentEdit: Adaptive Latent Control for Consistent Semantic Editing

【速读】:该论文旨在解决扩散模型在图像编辑任务中难以同时实现高质量编辑、保持背景相似性,并兼顾速度与内存效率的问题。其解决方案的关键在于提出LatentEdit框架,该框架通过自适应地融合当前潜在表示(latent code)与从源图像反演得到的参考潜在表示,在语义重要且高相似度区域选择性保留源特征,而在其他区域则根据目标提示生成新内容,从而实现细粒度、可控的图像编辑。该方法无需修改模型结构或引入复杂注意力机制,具备轻量化、即插即用特性,兼容UNet和DiT架构,并在PIE-Bench数据集上验证了其在保真度与可编辑性之间的最优平衡,甚至在8–15步内超越现有最先进方法。

链接: https://arxiv.org/abs/2509.00541
作者: Siyi Liu,Weiming Chen,Yushun Tang,Zhihai He
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by PRCV 2025

点击查看摘要

Abstract:Diffusion-based Image Editing has achieved significant success in recent years. However, it remains challenging to achieve high-quality image editing while maintaining the background similarity without sacrificing speed or memory efficiency. In this work, we introduce LatentEdit, an adaptive latent fusion framework that dynamically combines the current latent code with a reference latent code inverted from the source image. By selectively preserving source features in high-similarity, semantically important regions while generating target content in other regions guided by the target prompt, LatentEdit enables fine-grained, controllable editing. Critically, the method requires no internal model modifications or complex attention mechanisms, offering a lightweight, plug-and-play solution compatible with both UNet-based and DiT-based architectures. Extensive experiments on the PIE-Bench dataset demonstrate that our proposed LatentEdit achieves an optimal balance between fidelity and editability, outperforming the state-of-the-art method even in 8-15 steps. Additionally, its inversion-free variant further halves the number of neural function evaluations and eliminates the need for storing any intermediate variables, substantially enhancing real-time deployment efficiency.
zh

[CV-186] Learning Yourself: Class-Incremental Semantic Segmentation with Language-Inspired Bootstrapped Disentanglement ICCV2025

【速读】:该论文旨在解决类增量语义分割(Class-Incremental Semantic Segmentation, CISS)中因持续学习新类别而导致的灾难性语义纠缠(catastrophic semantic entanglement)问题,其核心挑战包括原型-特征纠缠(Prototype-Feature Entanglement)和背景增量纠缠(Background-Increment Entanglement)。解决方案的关键在于提出语言引导的自举解耦框架(Language-inspired Bootstrapped Disentanglement, LBD),通过预训练视觉-语言模型(如CLIP)提供的先验类别语义信息,实现两类解耦机制:一是利用手工文本特征作为拓扑模板进行语言引导原型解耦(Language-guided Prototypical Disentanglement),二是采用可学习原型与掩码池化监督策略实现流形互斥背景解耦(Manifold Mutual Background Disentanglement)。此外,结合软提示微调(soft prompt tuning)和编码器适配修改,有效弥合CLIP在密集与稀疏任务间的性能差距,显著提升多步场景下的分割精度。

链接: https://arxiv.org/abs/2509.00527
作者: Ruitao Wu,Yifan Zhao,Jia Li
机构: Beihang University (北京航空航天大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Class-Incremental Semantic Segmentation (CISS) requires continuous learning of newly introduced classes while retaining knowledge of past classes. By abstracting mainstream methods into two stages (visual feature extraction and prototype-feature matching), we identify a more fundamental challenge termed catastrophic semantic entanglement. This phenomenon involves Prototype-Feature Entanglement caused by semantic misalignment during the incremental process, and Background-Increment Entanglement due to dynamic data evolution. Existing techniques, which rely on visual feature learning without sufficient cues to distinguish targets, introduce significant noise and errors. To address these issues, we introduce a Language-inspired Bootstrapped Disentanglement framework (LBD). We leverage the prior class semantics of pre-trained visual-language models (e.g., CLIP) to guide the model in autonomously disentangling features through Language-guided Prototypical Disentanglement and Manifold Mutual Background Disentanglement. The former guides the disentangling of new prototypes by treating hand-crafted text features as topological templates, while the latter employs multiple learnable prototypes and mask-pooling-based supervision for background-incremental class disentanglement. By incorporating soft prompt tuning and encoder adaptation modifications, we further bridge the capability gap of CLIP between dense and sparse tasks, achieving state-of-the-art performance on both Pascal VOC and ADE20k, particularly in multi-step scenarios.
zh

[CV-187] Make me an Expert: Distilling from Generalist Black-Box Models into Specialized Models for Semantic Segmentation

【速读】:该论文旨在解决在黑盒模型(black-box models)约束下如何有效训练本地模型的问题,即当无法访问模型权重、训练数据或logits时,传统领域自适应方法失效的挑战。其解决方案的关键在于提出Black-Box Distillation (B2D)设置,并设计ATtention-Guided sCaler (ATGC)方法:利用DINOv2的注意力图(attention maps)动态选择最优输入分辨率进行推理,通过熵评分识别具有信息量的尺度用于伪标签生成,从而实现高效的蒸馏学习。实验表明,在仅需one-hot预测的情况下,该方法在多个数据集上均显著提升了性能。

链接: https://arxiv.org/abs/2509.00509
作者: Yasser Benigmim,Subhankar Roy,Khalid Oublal,Imad Eddine Marouf,Slim Essid,Vicky Kalogeiton,Stéphane Lathuilière
机构: Télécom-Paris, Institut Polytechnique de Paris (巴黎电信学院,巴黎综合理工学院); CNRS (法国国家科学研究中心); University of Bergamo (贝加莫大学); NVIDIA (英伟达); Inria at University Grenoble Alpes, LJK (法国国家信息与自动化研究院,格勒诺布尔阿尔卑斯大学,LJK实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Github repo : this https URL

点击查看摘要

Abstract:The rise of Artificial Intelligence as a Service (AIaaS) democratizes access to pre-trained models via Application Programming Interfaces (APIs), but also raises a fundamental question: how can local models be effectively trained using black-box models that do not expose their weights, training data, or logits, a constraint in which current domain adaptation paradigms are impractical ? To address this challenge, we introduce the Black-Box Distillation (B2D) setting, which enables local model adaptation under realistic constraints: (1) the API model is open-vocabulary and trained on large-scale general-purpose data, and (2) access is limited to one-hot predictions only. We identify that open-vocabulary models exhibit significant sensitivity to input resolution, with different object classes being segmented optimally at different scales, a limitation termed the “curse of resolution”. Our method, ATtention-Guided sCaler (ATGC), addresses this challenge by leveraging DINOv2 attention maps to dynamically select optimal scales for black-box model inference. ATGC scores the attention maps with entropy to identify informative scales for pseudo-labelling, enabling effective distillation. Experiments demonstrate substantial improvements under black-box supervision across multiple datasets while requiring only one-hot API predictions. Our code is available at this https URL.
zh

[CV-188] RUST: Token-dRiven Ultrasound Style Transfer for Cross-Device Adaptation

【速读】:该论文旨在解决超声图像在不同设备间因成像风格差异导致下游任务性能下降的问题。现有无配对图像到图像(Unpaired Image-to-Image, UI2I)翻译方法未显式过滤与下游任务最相关的风格特征,可能导致翻译后的图像与任务需求不匹配。解决方案的关键在于提出TRUST框架——一种基于token驱动的双流结构,通过分离内容与风格表示来避免二者混淆;其核心创新包括:(1)引入Token-dRiven(TR)模块,从数据视角选择与源token对应的目标token,并从模型视角利用行为镜像损失(behavior mirror loss)识别对下游模型最优的目标token;(2)在源编码器中注入辅助提示(auxiliary prompts),使内容表征与下游任务行为对齐。实验表明,TRUST在超声图像数据集上显著优于现有UI2I方法,在视觉质量和下游任务性能方面均取得提升。

链接: https://arxiv.org/abs/2509.00508
作者: Nhat-Tuong Do-Tran,Ngoc-Hoang-Lam Le,Ian Chiu,Po-Tsun Paul Kuo,Ching-Chun Huang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); Advantech Company (研华科技); Advanced Operational Development (先进运营开发); Delta Electronics Inc (台达电子股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to APSIPA ASC 2025

点击查看摘要

Abstract:Ultrasound images acquired from different devices exhibit diverse styles, resulting in decreased performance of downstream tasks. To mitigate the style gap, unpaired image-to-image (UI2I) translation methods aim to transfer images from a source domain, corresponding to new device acquisitions, to a target domain where a frozen task model has been trained for downstream applications. However, existing UI2I methods have not explicitly considered filtering the most relevant style features, which may result in translated images misaligned with the needs of downstream tasks. In this work, we propose TRUST, a token-driven dual-stream framework that preserves source content while transferring the common style of the target domain, ensuring that content and style remain unblended. Given multiple styles in the target domain, we introduce a Token-dRiven (TR) module that operates from two perspectives: (1) a data view–selecting “suitable” target tokens corresponding to each source token, and (2) a model view–identifying ``optimal" target tokens for the downstream model, guided by a behavior mirror loss. Additionally, we inject auxiliary prompts into the source encoder to match content representation with downstream behavior. Experimental results on ultrasound datasets demonstrate that TRUST outperforms existing UI2I methods in both visual quality and downstream task performance.
zh

[CV-189] FLUID: A Fine-Grained Lightweight Urban Signalized-Intersection Dataset of Dense Conflict Trajectories

【速读】:该论文旨在解决当前交通参与者(Traffic Participants, TPs)轨迹数据在场景代表性、信息丰富度和数据保真度方面的不足,以支持城市交叉口交通状况评估与政策优化。其解决方案的关键在于提出FLUID——一个包含细粒度轨迹数据集和轻量级全链条无人机轨迹处理框架的综合性系统:该数据集覆盖三类典型城市信号灯交叉口,记录约5小时、超过20,000名TPs(含8类交通参与者),平均每分钟存在2起车辆冲突(约占所有机动车的25%),并提供轨迹、交通信号、地图及原始视频等多模态信息;同时通过与DataFromSky平台和地面实测数据对比验证了其高时空精度,为人类偏好挖掘、交通行为建模及自动驾驶研究提供了高质量基准数据。

链接: https://arxiv.org/abs/2509.00497
作者: Yiyang Chen,Zhigang Wu,Guohong Zheng,Xuesong Wu,Liwen Xu,Haoyuan Tang,Zhaocheng He,Haipeng Zeng
机构: Sun Yat-sen University (中山大学); Guangdong Provincial Key Laboratory of Intelligent Transportation Systems (广东省智能交通系统重点实验室); Pengcheng Laboratory (鹏城实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 14 figures

点击查看摘要

Abstract:The trajectory data of traffic participants (TPs) is a fundamental resource for evaluating traffic conditions and optimizing policies, especially at urban intersections. Although data acquisition using drones is efficient, existing datasets still have limitations in scene representativeness, information richness, and data fidelity. This study introduces FLUID, comprising a fine-grained trajectory dataset that captures dense conflicts at typical urban signalized intersections, and a lightweight, full-pipeline framework for drone-based trajectory processing. FLUID covers three distinct intersection types, with approximately 5 hours of recording time and featuring over 20,000 TPs across 8 categories. Notably, the dataset averages two vehicle conflicts per minute, involving roughly 25% of all motor vehicles. FLUID provides comprehensive data, including trajectories, traffic signals, maps, and raw videos. Comparison with the DataFromSky platform and ground-truth measurements validates its high spatio-temporal accuracy. Through a detailed classification of motor vehicle conflicts and violations, FLUID reveals a diversity of interactive behaviors, demonstrating its value for human preference mining, traffic behavior modeling, and autonomous driving research.
zh

[CV-190] Multi-Focused Video Group Activities Hashing

【速读】:该论文旨在解决复杂场景下视频数据爆炸式增长所导致的群体活动(group activities)快速检索难题,现有方法通常仅能对整段视频进行检索,难以实现细粒度的活动级别匹配。其解决方案的关键在于提出一种新颖的时空交错视频哈希技术(STVH),通过统一框架同时建模个体对象动态与群体交互关系,捕捉群体视觉特征和位置特征的时空演化过程;进一步地,为应对实际应用中对活动语义特征与物体视觉特征不同侧重的需求,作者还提出了增强版多焦点时空视频哈希(M-STVH),采用分层特征融合机制实现多焦点表示学习,使模型能够协同关注活动语义与物体视觉特征,从而提升检索精度与适应性。

链接: https://arxiv.org/abs/2509.00490
作者: Zhongmiao Qi,Yan Jiang,Bolin Zhang,Lijun Guo,Chong Wang,Qiangbo Qian
机构: Ningbo University (宁波大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.
zh

[CV-191] VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding

【速读】:该论文旨在解决现有视频领域多模态奖励模型(Multimodal Reward Models, MRMs)评估基准存在的三大问题:一是题目数量和多样性不足,二是评价维度不全面,三是对不同类型MRMs的评估不够充分。为应对这些挑战,作者提出了VideoRewardBench,这是首个覆盖视频理解四大核心维度(感知、知识、推理与安全)的综合性基准。其关键解决方案在于构建了一个高质量偏好数据集(含1,563个标注样本,涵盖1,482个唯一视频和1,559个不同问题),并通过AI辅助的数据流水线实现高效构建;同时在该基准上系统评估了28种不同类型的MRMs(生成式、判别式与半标量式),揭示了强化学习训练、推理时缩放策略及输入帧数变化对各类模型性能的影响规律,从而为视频域MRMs的开发与评估提供了严谨且具有挑战性的标准。

链接: https://arxiv.org/abs/2509.00484
作者: Zhihong Zhang,Xiaojian Huang,Jin Xu,Zhuodong Luo,Xinzhi Wang,Jiansheng Wei,Xuejin Chen
机构: University of Science and Technology of China (中国科学技术大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Multimodal reward models (MRMs) play a crucial role in the training, inference, and evaluation of Large Vision Language Models (LVLMs) by assessing response quality. However, existing benchmarks for evaluating MRMs in the video domain suffer from a limited number and diversity of questions, a lack of comprehensive evaluation dimensions, and inadequate evaluation of diverse types of MRMs. To address these gaps, we introduce VideoRewardBench, the first comprehensive benchmark covering four core aspects of video understanding: perception, knowledge, reasoning, and safety. Through our AI-assisted data pipeline, we curate a high-quality preference dataset of 1,563 annotated samples, including 1,482 unique videos and 1,559 distinct questions–15 times the number found in the most question-rich prior benchmark. Each sample is a triplet consisting of a video-text prompt, a chosen response, and a rejected response. We also conduct a comprehensive evaluation across 28 multimodal reward models spanning three categories: generative, discriminative, and semi-scalar. Results show that even the top-performing model GPT-4o achieves only 57.0% overall accuracy, and the state-of-the-art open-source model Qwen2.5-VL-72B reaches merely 53.3%. Our analysis further reveals three key insights: (i) MRMs trained with reinforcement learning (RL) do not necessarily exhibit stronger cross-modal generalization than those trained without RL; (ii) except for discriminative MRMs, other types of MRMs across varying model capacities can benefit from inference-time scaling; and (iii) variations in input video frame count have different effects on different types of MRMs. We believe VideoRewardBench offers a challenging and valuable benchmark for advancing the evaluation and development of MRMs in the video domain.
zh

[CV-192] Exploring Decision-Making Capabilities of LLM Agents : An Experimental Study on Jump-Jump Game

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在复杂决策任务中表现不足的问题,尤其是在需要结合空间推理、物理建模与策略规划的多维度认知场景下的能力局限。其解决方案的关键在于利用“Jump-Jump”游戏作为测试平台,通过模拟真实世界中的动态环境和精确动作控制需求,系统评估LLM在执行基于状态感知的连续决策时的表现,并揭示其在理解物理规律与优化跳跃策略方面的潜在能力。

链接: https://arxiv.org/abs/2509.00483
作者: Juwu Li
机构: Jiangxi Teachers College (江西教师学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Jump-Jump game, as a simple yet challenging casual game, provides an ideal testing environment for studying LLM decision-making capabilities. The game requires players to precisely control jumping force based on current position and target platform distance, involving multiple cognitive aspects including spatial reasoning, physical modeling, and strategic planning. It illustrates the basic gameplay mechanics of the Jump-Jump game, where the player character (red circle) must jump across platforms with appropriate force to maximize score.
zh

[CV-193] Embodied Spatial Intelligence: from Implicit Scene Modeling to Spatial Reasoning

【速读】:该论文旨在解决机器人在现实世界中基于自然语言指令进行感知与行动的挑战,核心问题在于如何将大型语言模型(Large Language Models, LLMs)的能力与物理具身(physical embodiment)有效结合。解决方案的关键在于从两个维度推进:一是构建鲁棒、可扩展且高精度的场景表示方法,利用隐式神经模型实现自监督相机标定、高保真深度场生成及大规模重建;二是增强LLMs的空间推理能力,通过引入新型导航基准、3D空间中的语言定位方法以及状态反馈机制,从而提升长时程决策的准确性与稳定性。

链接: https://arxiv.org/abs/2509.00465
作者: Jiading Fang
机构: Toyota Technological Institute at Chicago (丰田技术学院芝加哥分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This thesis introduces “Embodied Spatial Intelligence” to address the challenge of creating robots that can perceive and act in the real world based on natural language instructions. To bridge the gap between Large Language Models (LLMs) and physical embodiment, we present contributions on two fronts: scene representation and spatial reasoning. For perception, we develop robust, scalable, and accurate scene representations using implicit neural models, with contributions in self-supervised camera calibration, high-fidelity depth field generation, and large-scale reconstruction. For spatial reasoning, we enhance the spatial capabilities of LLMs by introducing a novel navigation benchmark, a method for grounding language in 3D, and a state-feedback mechanism to improve long-horizon decision-making. This work lays a foundation for robots that can robustly perceive their surroundings and intelligently act upon complex, language-based commands.
zh

[CV-194] Encoder-Only Image Registration

【速读】:该论文旨在解决可变形图像配准(deformable image registration)中计算复杂度高和大形变处理困难的问题。解决方案的关键在于提出Encoder-Only Image Registration (EOIR) 框架,其核心创新是将特征学习与光流估计解耦:仅使用三层卷积神经网络(ConvNet)进行特征提取,并通过一组三层光流估计器构建拉普拉斯特征金字塔,逐步合成满足大形变模型的微分同胚(diffeomorphic)形变场。该设计在保证配准精度的同时显著提升了效率和形变平滑性,实现了更优的准确性-效率与准确性-平滑性权衡。

链接: https://arxiv.org/abs/2509.00451
作者: Xiang Chen,Renjiu Hu,Jinwei Zhang,Yuxi Zhang,Xinyao Yue,Min Liu,Yaonan Wang,Hang Zhang
机构: Hunan University (湖南大学); Cornell University (康奈尔大学); Johns Hopkins University (约翰霍普金斯大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning-based techniques have significantly improved the accuracy and speed of deformable image registration. However, challenges such as reducing computational complexity and handling large deformations persist. To address these challenges, we analyze how convolutional neural networks (ConvNets) influence registration performance using the Horn-Schunck optical flow equation. Supported by prior studies and our empirical experiments, we observe that ConvNets play two key roles in registration: linearizing local intensities and harmonizing global contrast variations. Based on these insights, we propose the Encoder-Only Image Registration (EOIR) framework, designed to achieve a better accuracy-efficiency trade-off. EOIR separates feature learning from flow estimation, employing only a 3-layer ConvNet for feature extraction and a set of 3-layer flow estimators to construct a Laplacian feature pyramid, progressively composing diffeomorphic deformations under a large-deformation model. Results on five datasets across different modalities and anatomical regions demonstrate EOIR’s effectiveness, achieving superior accuracy-efficiency and accuracy-smoothness trade-offs. With comparable accuracy, EOIR provides better efficiency and smoothness, and vice versa. The source code of EOIR will be publicly available on this https URL.
zh

[CV-195] Stage-wise Adaptive Label Distribution for Facial Age Estimation

【速读】:该论文旨在解决年龄估计任务中标签模糊性(label ambiguity)带来的挑战,特别是现有方法通常忽略不同年龄阶段间模糊程度差异的问题。解决方案的关键在于提出一种分阶段自适应标签分布学习(Stage-wise Adaptive Label Distribution Learning, SA-LDL)算法,其核心思想是基于锚点与其它年龄类别嵌入相似性的分析,发现标签模糊具有明显的阶段特性;通过联合引入分阶段自适应方差建模和加权损失函数,SA-LDL能够有效捕捉标签模糊的复杂且结构化的特征,从而提升年龄估计的准确性和鲁棒性。

链接: https://arxiv.org/abs/2509.00450
作者: Bo Wu,Zhiqi Ai,Jun Jiang,Congcong Zhu,Shugong Xu
机构: 1. 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 3 fugures

点击查看摘要

Abstract:Label ambiguity poses a significant challenge in age estimation tasks. Most existing methods address this issue by modeling correlations between adjacent age groups through label distribution learning. However, they often overlook the varying degrees of ambiguity present across different age stages. In this paper, we propose a Stage-wise Adaptive Label Distribution Learning (SA-LDL) algorithm, which leverages the observation – revealed through our analysis of embedding similarities between an anchor and all other ages – that label ambiguity exhibits clear stage-wise patterns. By jointly employing stage-wise adaptive variance modeling and weighted loss function, SA-LDL effectively captures the complex and structured nature of label ambiguity, leading to more accurate and robust age estimation. Extensive experiments demonstrate that SA-LDL achieves competitive performance, with MAE of 1.74 and 2.15 on the MORPH-II and FG-NET datasets.
zh

[CV-196] SemaMIL: Semantic Reordering with Retrieval-Guided State Space Modeling for Whole Slide Image Classification

【速读】:该论文旨在解决多实例学习(Multiple Instance Learning, MIL)在计算病理学中处理全切片图像(Whole Slide Images, WSI)时存在的两个核心问题:一是基于注意力机制的方法虽能识别关键区域,但忽视了组织学上下文关系;二是Transformer模型虽能建模全局交互,却因二次计算复杂度和过拟合风险难以高效应用。解决方案的关键在于提出SemaMIL框架,其核心创新为两项技术:一是语义重排序(Semantic Reordering, SR),通过可逆排列将语义相似的图像块聚类并有序排列,保留组织学结构信息;二是语义引导的检索状态空间模块(Semantic-guided Retrieval State Space Module, SRSM),自适应地选取代表性查询来调整状态空间参数,从而以线性复杂度实现更优的全局特征建模。实验表明,该方法在四个WSI亚型数据集上优于现有强基线,且参数量与浮点运算次数(FLOPs)更低。

链接: https://arxiv.org/abs/2509.00442
作者: Lubin Gan,Xiaoman Wu,Jing Zhang,Zhifeng Wang,Linhao Qu,Siying Wu,Xiaoyan Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multiple instance learning (MIL) has become the leading approach for extracting discriminative features from whole slide images (WSIs) in computational pathology. Attention-based MIL methods can identify key patches but tend to overlook contextual relationships. Transformer models are able to model interactions but require quadratic computational cost and are prone to overfitting. State space models (SSMs) offer linear complexity, yet shuffling patch order disrupts histological meaning and reduces interpretability. In this work, we introduce SemaMIL, which integrates Semantic Reordering (SR), an adaptive method that clusters and arranges semantically similar patches in sequence through a reversible permutation, with a Semantic-guided Retrieval State Space Module (SRSM) that chooses a representative subset of queries to adjust state space parameters for improved global modeling. Evaluation on four WSI subtype datasets shows that, compared to strong baselines, SemaMIL achieves state-of-the-art accuracy with fewer FLOPs and parameters.
zh

[CV-197] Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

【速读】:该论文旨在解决生成式人脸图像中语义可控性与照片真实感之间难以平衡的问题,尤其针对现有方法在生成流程中难以解耦语义控制信号的缺陷。其核心解决方案是提出Face-MoGLE框架,关键创新在于:(1) 通过掩码条件下的潜在空间因子分解实现语义解耦建模,支持精确属性操控;(2) 引入全局与局部专家混合机制,同时捕捉整体结构和区域级语义信息以提升细粒度控制能力;(3) 设计动态门控网络,生成随扩散步骤和空间位置变化的时间依赖系数,从而增强生成过程的适应性与灵活性。这一架构显著提升了可控人脸生成的质量与多样性,并展现出强大的零样本泛化能力。

链接: https://arxiv.org/abs/2509.00428
作者: Xuechao Zou,Shun Zhang,Xing Fu,Yue Li,Kai Li,Yushe Cao,Congyan Lang,Pin Tao,Junliang Xing
机构: Beijing Jiaotong University (北京交通大学); Ant Group (蚂蚁集团); Qinghai University (青海大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 11 figures

点击查看摘要

Abstract:Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at this https URL.
zh

[CV-198] LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression EMNLP2025

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在推理过程中效率低下的问题,尤其是高延迟和低吞吐量限制了其在实际场景中的部署。解决方案的关键在于将VLM的推理过程分为编码和解码两个阶段,并分别提出针对性优化策略:在编码阶段采用分层金字塔令牌合并(pyramid token merging)方法,通过逐层保留关键令牌显著减少图像token数量(如仅保留3%时仍保持约98%性能);在解码阶段引入KV缓存压缩(KV Cache compression)技术,移除冗余缓存以提升网络吞吐量(最高达2.02倍),并大幅降低长序列生成时的预填充时间(最多减少3.65倍)。这一训练-free的轻量化框架使大型VLM(如InternVL2.5 26B)推理速度超越更小模型(如InternVL2.5 8B),显著提升了模型实用性。

链接: https://arxiv.org/abs/2509.00419
作者: Lianyu Hu,Fanhua Shang,Wei Feng,Liang Wan
机构: College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: EMNLP2025 Findings

点击查看摘要

Abstract:In this paper, we introduce LightVLM, a simple but effective method that can be seamlessly deployed upon existing Vision-Language Models (VLMs) to greatly accelerate the inference process in a training-free manner. We divide the inference procedure of VLMs into two stages, i.e., encoding and decoding, and propose to simultaneously accelerate VLMs in both stages to largely improve model efficiency. During encoding, we propose pyramid token merging to reduce tokens of different LLM layers in a hierarchical manner by finally only keeping a few dominant tokens to achieve high efficiency. During decoding, aimed at reducing the high latency of outputting long sequences, we propose KV Cache compression to remove unnecessary caches to increase the network throughput. Experimental results show that LightVLM successfully retains 100% performance when only preserving 35% image tokens, and maintains around 98% performance when keeping only 3% image tokens. LightVLM could 2.02 \times the network throughput and reduce the prefilling time by 3.65 \times . LightVLM also makes large VLMs faster again by enabling a heavy model (e.g., InternVL2.5 26B) to infer faster than significantly smaller models (e.g., InternVL2.5 8B), hopefully facilitating the real-world deployment. When generating long text sequences (e.g., 4096 tokens), LightVLM could reduce the inference time by 3.21 \times , largely outperforming existing methods.
zh

[CV-199] DevilSight: Augmenting Monocular Human Avatar Reconstruction through a Virtual Perspective

【速读】:该论文旨在解决单目视频驱动下人体虚拟化身(Avatar)重建中存在的两个核心问题:一是难以捕捉输入视频中的精细动态细节,二是在新视角下生成不合理的细节,这主要源于虚拟化身模型的表示能力有限和观测数据不足。解决方案的关键在于引入先进的视频生成模型Human4DiT,通过从替代视角生成人体运动作为额外监督信号,从而丰富未见区域的细节并有效正则化虚拟化身表示以减少伪影。此外,为提升生成一致性与细节分辨率,论文进一步提出两种互补策略:一是通过视频微调注入物理身份信息以保证运动的一致性,二是采用基于图像块(patch-based)的去噪算法以实现更高分辨率和更精细的输出。

链接: https://arxiv.org/abs/2509.00403
作者: Yushuo Chen,Ruizhi Shao,Youxin Pang,Hongwen Zhang,Xinyi Wu,Rihui Wu,Yebin Liu
机构: Tsinghua University (清华大学); Beijing Normal University (北京师范大学); Honor Device Co., Ltd (荣耀终端有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel framework to reconstruct human avatars from monocular videos. Recent approaches have struggled either to capture the fine-grained dynamic details from the input or to generate plausible details at novel viewpoints, which mainly stem from the limited representational capacity of the avatar model and insufficient observational data. To overcome these challenges, we propose to leverage the advanced video generative model, Human4DiT, to generate the human motions from alternative perspective as an additional supervision signal. This approach not only enriches the details in previously unseen regions but also effectively regularizes the avatar representation to mitigate artifacts. Furthermore, we introduce two complementary strategies to enhance video generation: To ensure consistent reproduction of human motion, we inject the physical identity into the model through video fine-tuning. For higher-resolution outputs with finer details, a patch-based denoising algorithm is employed. Experimental results demonstrate that our method outperforms recent state-of-the-art approaches and validate the effectiveness of our proposed strategies.
zh

[CV-200] DAOVI: Distortion-Aware Omnidirectional Video Inpainting BMVC2025

【速读】:该论文旨在解决全景视频(Omnidirectional Videos)中因宽视场导致的非期望物体出现问题,传统视频修复(Video Inpainting)方法通常针对普通窄视角视频设计,未考虑等距圆柱投影(Equirectangular Projection)带来的几何畸变。解决方案的关键在于提出一种新型深度学习模型——畸变感知的全景视频修复方法(Distortion-Aware Omnidirectional Video Inpainting, DAOVI),其核心创新包括:1)在图像空间中引入基于测地距离(Geodesic Distance)的时序运动信息评估模块,以更好地捕捉全景视频中的动态变化;2)在特征空间中设计深度感知的特征传播模块,有效缓解由投影畸变引起的几何失真问题,从而实现时空一致性的高质量修复效果。

链接: https://arxiv.org/abs/2509.00396
作者: Ryosuke Seshimo,Mariko Isogawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: BMVC 2025

点击查看摘要

Abstract:Omnidirectional videos that capture the entire surroundings are employed in a variety of fields such as VR applications and remote sensing. However, their wide field of view often causes unwanted objects to appear in the videos. This problem can be addressed by video inpainting, which enables the natural removal of such objects while preserving both spatial and temporal consistency. Nevertheless, most existing methods assume processing ordinary videos with a narrow field of view and do not tackle the distortion in equirectangular projection of omnidirectional videos. To address this issue, this paper proposes a novel deep learning model for omnidirectional video inpainting, called Distortion-Aware Omnidirectional Video Inpainting (DAOVI). DAOVI introduces a module that evaluates temporal motion information in the image space considering geodesic distance, as well as a depth-aware feature propagation module in the feature space that is designed to address the geometric distortion inherent to omnidirectional videos. The experimental results demonstrate that our proposed method outperforms existing methods both quantitatively and qualitatively.
zh

[CV-201] Double-Constraint Diffusion Model with Nuclear Regularization for Ultra-low-dose PET Reconstruction

【速读】:该论文旨在解决超低剂量正电子发射断层成像(PET)重建中因辐射暴露减少和扫描时间缩短而导致的图像噪声增加与细节丢失问题,从而提升图像质量。其解决方案的关键在于提出一种双约束扩散模型(Double-Constraint Diffusion Model, DCDM),该模型冻结预训练扩散模型的权重,并在编码架构中注入可训练的双约束控制器——核Transformer约束(Nuclear Transformer Constraint, NTC)与编码连接约束(Encoding Nexus Constraint, ENC)。其中,NTC利用核范数近似矩阵秩最小化,将低秩特性嵌入Transformer结构以高效提取低剂量图像特征并生成压缩表示;ENC则基于这些压缩特征控制预训练扩散模型,在像素空间中重建高质量PET图像。此设计显著减少了可训练参数数量,且无需重新训练即可适应不同剂量水平,实现对已知及未知剂量还原因子的良好泛化能力。

链接: https://arxiv.org/abs/2509.00395
作者: Mengxiao Geng,Ran Hong,Bingxuan Li,Qiegen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-low-dose positron emission tomography (PET) reconstruction holds significant potential for reducing patient radiation exposure and shortening examination times. However, it may also lead to increased noise and reduced imaging detail, which could decrease the image quality. In this study, we present a Double-Constraint Diffusion Model (DCDM), which freezes the weights of a pre-trained diffusion model and injects a trainable double-constraint controller into the encoding architecture, greatly reducing the number of trainable parameters for ultra-low-dose PET reconstruction. Unlike full fine-tuning models, DCDM can adapt to different dose levels without retraining all model parameters, thereby improving reconstruction flexibility. Specifically, the two constraint modules, named the Nuclear Transformer Constraint (NTC) and the Encoding Nexus Constraint (ENC), serve to refine the pre-trained diffusion model. The NTC leverages the nuclear norm as an approximation for matrix rank minimization, integrates the low-rank property into the Transformer architecture, and enables efficient information extraction from low-dose images and conversion into compressed feature representations in the latent space. Subsequently, the ENC utilizes these compressed feature representations to encode and control the pre-trained diffusion model, ultimately obtaining reconstructed PET images in the pixel space. In clinical reconstruction, the compressed feature representations from NTC help select the most suitable ENC for efficient unknown low-dose PET reconstruction. Experiments conducted on the UDPET public dataset and the Clinical dataset demonstrated that DCDM outperforms state-of-the-art methods on known dose reduction factors (DRF) and generalizes well to unknown DRF scenarios, proving valuable even at ultra-low dose levels, such as 1% of the full dose.
zh

[CV-202] HERO-VQL: Hierarchical Egocentric and Robust Visual Query Localization BMVC2025

【速读】:该论文旨在解决**第一人称视觉查询定位(egocentric visual query localization, VQL)**问题,即在长时第一人称视频中准确定位查询对象。由于第一人称视频中频繁且剧烈的视角变化导致目标外观显著变异和部分遮挡,现有方法难以实现精确定位。解决方案的关键在于提出一种受人类认知过程启发的新型方法——Hierarchical, Egocentric and RObust Visual Query Localization (HERO-VQL),其核心包括:i) 自上而下的注意力引导(Top-down Attention Guidance, TAG),通过类别标记(class token)提供高层语义上下文,并利用主成分得分图(principal component score maps)实现细粒度定位;ii) 基于第一人称增强的一致性训练(EgoACT),通过替换查询对象并重排视频帧模拟极端视角变化,结合一致性损失(CT loss)确保不同增强场景下定位结果的稳定性,从而有效提升模型在复杂第一人称场景中的鲁棒性与准确性。

链接: https://arxiv.org/abs/2509.00385
作者: Joohyun Chang,Soyeon Hong,Hyogun Lee,Seong Jong Ha,Dongho Lee,Seong Tae Kim,Jinwoo Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to BMVC 2025 (Oral), 23 pages with supplementary material

点击查看摘要

Abstract:In this work, we tackle the egocentric visual query localization (VQL), where a model should localize the query object in a long-form egocentric video. Frequent and abrupt viewpoint changes in egocentric videos cause significant object appearance variations and partial occlusions, making it difficult for existing methods to achieve accurate localization. To tackle these challenges, we introduce Hierarchical, Egocentric and RObust Visual Query Localization (HERO-VQL), a novel method inspired by human cognitive process in object recognition. We propose i) Top-down Attention Guidance (TAG) and ii) Egocentric Augmentation based Consistency Training (EgoACT). Top-down Attention Guidance refines the attention mechanism by leveraging the class token for high-level context and principal component score maps for fine-grained localization. To enhance learning in diverse and challenging matching scenarios, EgoAug enhances query diversity by replacing the query with a randomly selected corresponding object from groundtruth annotations and simulates extreme viewpoint changes by reordering video frames. Additionally, CT loss enforces stable object localization across different augmentation scenarios. Extensive experiments on VQ2D dataset validate that HERO-VQL effectively handles egocentric challenges, significantly outperforming baselines.
zh

[CV-203] Visually Grounded Narratives: Reducing Cognitive Burden in Researcher-Participant Interaction

【速读】:该论文旨在解决叙事探究(narrative inquiry)中研究者与参与者面临的双重负担问题,即传统方法需将多源数据手工转化为连贯的叙事文本,导致分析效率低下且成员核查(member checking)过程繁琐。其解决方案的关键在于提出名为NAME的新范式,该范式通过将研究文档自动转换为结构清晰的故事图像(story images),显著降低解读大量文本材料的认知负荷;同时引入演员位置与形状模块以提升生成图像的合理性,并设计包含感知质量与叙事一致性三个维度的评估指标体系,实现对生成结果的客观量化评价。实验表明,该方法在极低数据利用率(仅0.96%)下仍优于基线模型,且在不同数据划分策略下均表现出显著性能提升。

链接: https://arxiv.org/abs/2509.00381
作者: Runtong Wu,Jiayao Song,Fei Teng,Xianhao Ren,Yuyan Gao,Kailun Yang
机构: Hunan University (湖南大学); Henan University of Economics and Law (河南财经政法大学); University of Lorraine (洛林大学); University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Narrative inquiry has been one of the prominent application domains for the analysis of human experience, aiming to know more about the complexity of human society. However, researchers are often required to transform various forms of data into coherent hand-drafted narratives in storied form throughout narrative analysis, which brings an immense burden of data analysis. Participants, too, are expected to engage in member checking and presentation of these narrative products, which involves reviewing and responding to large volumes of documents. Given the dual burden and the need for more efficient and participant-friendly approaches to narrative making and representation, we made a first attempt: (i) a new paradigm is proposed, NAME, as the initial attempt to push the field of narrative inquiry. Name is able to transfer research documents into coherent story images, alleviating the cognitive burden of interpreting extensive text-based materials during member checking for both researchers and participants. (ii) We develop an actor location and shape module to facilitate plausible image generation. (iii) We have designed a set of robust evaluation metrics comprising three key dimensions to objectively measure the perceptual quality and narrative consistency of generated characters. Our approach consistently demonstrates state-of-the-art performance across different data partitioning schemes. Remarkably, while the baseline relies on the full 100% of the available data, our method requires only 0.96% yet still reduces the FID score from 195 to 152. Under identical data volumes, our method delivers substantial improvements: for the 70:30 split, the FID score decreases from 175 to 152, and for the 95:5 split, it is nearly halved from 96 to 49. Furthermore, the proposed model achieves a score of 3.62 on the newly introduced metric, surpassing the baseline score of 2.66.
zh

[CV-204] Domain Adaptation-Based Crossmodal Knowledge Distillation for 3D Semantic Segmentation ICRA2025

【速读】:该论文旨在解决3D LiDAR点云语义分割中对大量标注数据依赖所带来的高成本问题。传统方法需耗费大量人力物力进行点云标注,而现实场景中的图像数据则更为丰富且易于获取。为缓解这一瓶颈,作者提出两种跨模态知识蒸馏方法:无监督域适应知识蒸馏(Unsupervised Domain Adaptation Knowledge Distillation, UDAKD)与基于特征和语义的知识蒸馏(Feature and Semantic-based Knowledge Distillation, FSKD)。其核心在于利用相机与LiDAR在时空上同步的多模态数据,通过预训练的2D图像模型为未标注的3D点云提供伪标签指导,从而在不依赖3D标注的情况下实现性能提升。关键创新在于引入自校准卷积(self-calibrated convolution)作为域适应模块的基础,以保留跨模态通用信息并过滤掉模态特异性细节,有效实现了从2D图像到3D点云的知识迁移。

链接: https://arxiv.org/abs/2509.00379
作者: Jialiang Kang,Jiawen Wang,Dingsheng Luo
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICRA 2025

点击查看摘要

Abstract:Semantic segmentation of 3D LiDAR data plays a pivotal role in autonomous driving. Traditional approaches rely on extensive annotated data for point cloud analysis, incurring high costs and time investments. In contrast, realworld image datasets offer abundant availability and substantial scale. To mitigate the burden of annotating 3D LiDAR point clouds, we propose two crossmodal knowledge distillation methods: Unsupervised Domain Adaptation Knowledge Distillation (UDAKD) and Feature and Semantic-based Knowledge Distillation (FSKD). Leveraging readily available spatio-temporally synchronized data from cameras and LiDARs in autonomous driving scenarios, we directly apply a pretrained 2D image model to unlabeled 2D data. Through crossmodal knowledge distillation with known 2D-3D correspondence, we actively align the output of the 3D network with the corresponding points of the 2D network, thereby obviating the necessity for 3D annotations. Our focus is on preserving modality-general information while filtering out modality-specific details during crossmodal distillation. To achieve this, we deploy self-calibrated convolution on 3D point clouds as the foundation of our domain adaptation module. Rigorous experimentation validates the effectiveness of our proposed methods, consistently surpassing the performance of state-of-the-art approaches in the field.
zh

[CV-205] NoiseCutMix: A Novel Data Augmentation Approach by Mixing Estimated Noise in Diffusion Models ICCV2025

【速读】:该论文旨在解决现有数据增强方法在生成跨类别融合图像时存在不自然边界的问题,尤其是在使用CutMix等技术时,由于不同类别的上下文差异导致图像拼接处出现视觉伪影。解决方案的关键在于提出一种名为NoiseCutMix的新方法,其核心思想是在扩散模型(diffusion model)的噪声估计阶段,对来自两个不同类别的噪声进行部分混合,而非直接混合图像像素,从而在潜在空间中实现更自然、高分辨率且具备双类特征的图像生成。此方法充分利用了扩散模型生成高质量图像的能力与CutMix在特征融合上的优势,有效提升了跨类别数据增强的合理性与实用性。

链接: https://arxiv.org/abs/2509.00378
作者: Shumpei Takezaki,Ryoma Bise,Shinnosuke Matsuo
机构: Kyushu University (九州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV2025 Workshop LIMIT

点击查看摘要

Abstract:In this study, we propose a novel data augmentation method that introduces the concept of CutMix into the generation process of diffusion models, thereby exploiting both the ability of diffusion models to generate natural and high-resolution images and the characteristic of CutMix, which combines features from two classes to create diverse augmented data. Representative data augmentation methods for combining images from multiple classes include CutMix and MixUp. However, techniques like CutMix often result in unnatural boundaries between the two images due to contextual differences. Therefore, in this study, we propose a method, called NoiseCutMix, to achieve natural, high-resolution image generation featuring the fused characteristics of two classes by partially combining the estimated noise corresponding to two different classes in a diffusion model. In the classification experiments, we verified the effectiveness of the proposed method by comparing it with conventional data augmentation techniques that combine multiple classes, random image generation using Stable Diffusion, and combinations of these methods. Our codes are available at: this https URL
zh

[CV-206] Adaptive Point-Prompt Tuning: Fine-Tuning Heterogeneous Foundation Models for 3D Point Cloud Analysis

【速读】:该论文旨在解决3D点云数据稀缺背景下,如何高效地将任意模态的预训练基础模型(foundation model)适配至3D点云分析任务的问题。当前主流方法依赖“高维到低维”的映射策略(如将视觉模型迁移到3D域),但常导致空间几何信息丢失且缺乏通用性。其核心解决方案是提出自适应点提示微调(Adaptive Point-Prompt Tuning, APPT),通过构建点嵌入模块(point embedding module)将原始点云聚合为具有局部几何特征的点嵌入,并引入排列不变特征以捕获点之间的相对位置关系,从而生成富含位置信息的点令牌(point tokens)。进一步设计了一个共享权重的提示生成器(prompt generator),动态生成点提示(point-prompts)并拼接至冻结的基础模型中,既不增加额外参数,又能提供全局结构信息,有效校准跨模态自注意力机制,实现对任意模态基础模型在3D点云场景下的直接适配与高效微调。

链接: https://arxiv.org/abs/2509.00374
作者: Mengke Li,Lihao Chen,Peng Zhang,Yiu-ming Cheung,Hui Huang
机构: Shenzhen University (深圳大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (深圳) (广东省人工智能与数字经济发展实验室); Xidian University (西安电子科技大学); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning strategies for foundation models in 1D textual and 2D visual analysis have demonstrated remarkable efficacy. However, due to the scarcity of point cloud data, pre-training large 3D models remains a challenging task. While many efforts have been made to apply pre-trained visual models to 3D domains through “high-to-low” mapping, these approaches often lead to the loss of spatial geometries and lack a generalizable framework for adapting any modality to 3D. This paper, therefore, attempts to directly leverage point features to calibrate the heterogeneous foundation model of any modality for 3D point cloud analysis. Specifically, we propose the Adaptive Point-Prompt Tuning (APPT) method, which fine-tunes pre-trained models with a modest number of parameters, enabling direct point cloud processing without heterogeneous mappings. We convert raw point clouds into point embeddings by aggregating local geometry to capture spatial features followed by linear layers to ensure seamless utilization of frozen pre-trained models. Given the inherent disorder of point clouds, in contrast to the structured nature of images and language, we employ a permutation-invariant feature to capture the relative positions of point embeddings, thereby obtaining point tokens enriched with location information to optimize self-attention mechanisms. To calibrate self-attention across source domains of any modality to 3D and reduce computational overhead, we introduce a prompt generator that shares weights with the point embedding module, dynamically producing point-prompts without adding additional parameters. These prompts are then concatenated into a frozen foundation model, providing rich global structural information and compensating for the lack of structural context in the heterogeneous data.
zh

[CV-207] Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在面对对抗攻击时的脆弱性问题,尤其是现有基于激活操控(activation steering)的防御方法依赖任务特定的对比提示来提取有害方向,导致性能不佳且可能损害视觉定位能力。其解决方案的关键在于提出一种两阶段防御框架——SPO-VLM(Sequence-Level Preference Optimization for VLM),第一阶段通过多样化数据源计算自适应层特定的操控向量,在推理时实现对有害行为的通用抑制;第二阶段则通过序列级偏好优化,融合自动化毒性评估与基于图文一致性的奖励机制,进一步提升生成文本的安全性和语义一致性。该设计在保持良性任务性能的同时显著增强了模型鲁棒性,实现了高效与有效的平衡。

链接: https://arxiv.org/abs/2509.00373
作者: Sihao Wu,Gaojie Jin,Wei Huang,Jianhong Wang,Xiaowei Huang
机构: University of Liverpool(利物浦大学); University of Exeter(埃克塞特大学); Purple Mountain Laboratories(紫金山实验室); University of Bristol(布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) have demonstrated impressive capabilities in integrating visual and textual information for understanding and reasoning, but remain highly vulnerable to adversarial attacks. While activation steering has emerged as a promising defence, existing approaches often rely on task-specific contrastive prompts to extract harmful directions, which exhibit suboptimal performance and can degrade visual grounding performance. To address these limitations, we propose \textitSequence-Level Preference Optimization for VLM (\textitSPO-VLM), a novel two-stage defense framework that combines activation-level intervention with policy-level optimization to enhance model robustness. In \textitStage I, we compute adaptive layer-specific steering vectors from diverse data sources, enabling generalized suppression of harmful behaviors during inference. In \textitStage II, we refine these steering vectors through a sequence-level preference optimization process. This stage integrates automated toxicity assessment, as well as visual-consistency rewards based on caption-image alignment, to achieve safe and semantically grounded text generation. The two-stage structure of SPO-VLM balances efficiency and effectiveness by combining a lightweight mitigation foundation in Stage I with deeper policy refinement in Stage II. Extensive experiments shown SPO-VLM enhances safety against attacks via activation steering and preference optimization, while maintaining strong performance on benign tasks without compromising visual understanding capabilities. We will release our code, model weights, and evaluation toolkit to support reproducibility and future research. \textcolorredWarning: This paper may contain examples of offensive or harmful text and images.
zh

[CV-208] wo Causes Not One: Rethinking Omission and Fabrication Hallucinations in MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的物体幻觉(object hallucination)问题,特别是区分并分别缓解“遗漏型幻觉”(omission hallucination)和“虚构型幻觉”(fabrication hallucination)——这两类幻觉常被误认为由同一机制引发,导致现有方法在减少一种幻觉时加剧另一种。解决方案的关键在于提出视觉-语义注意力势场(Visual-Semantic Attention Potential Field)这一新概念框架,揭示模型如何基于视觉特征构建证据以判断物体存在与否,并据此设计了可插拔的视觉势场校准(Visual Potential Field Calibration, VPFC)方法,有效降低遗漏型幻觉而不会引入新的虚构型幻觉,从而实现更平衡、鲁棒的幻觉抑制策略。

链接: https://arxiv.org/abs/2509.00371
作者: Guangzong Si,Hao Yin,Xianfei Li,Qing Ding,Wenlong Liao,Tao He,Pai Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint,Underreview

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved impressive advances, yet object hallucination remains a persistent challenge. Existing methods, based on the flawed assumption that omission and fabrication hallucinations share a common cause, often reduce omissions only to trigger more fabrications. In this work, we overturn this view by demonstrating that omission hallucinations arise from insufficient confidence when mapping perceived visual features to linguistic expressions, whereas fabrication hallucinations result from spurious associations within the cross-modal representation space due to statistical biases in the training corpus. Building on findings from visual attention intervention experiments, we propose the Visual-Semantic Attention Potential Field, a conceptual framework that reveals how the model constructs visual evidence to infer the presence or absence of objects. Leveraging this insight, we introduce Visual Potential Field Calibration (VPFC), a plug-and-play hallucination mitigation method that effectively reduces omission hallucinations without introducing additional fabrication hallucinations. Our findings reveal a critical oversight in current object hallucination research and chart new directions for developing more robust and balanced hallucination mitigation strategies.
zh

[CV-209] A Multimodal Head and Neck Cancer Dataset for AI-Driven Precision Oncology

【速读】:该论文旨在解决头颈部癌症(head and neck cancer)研究中缺乏高质量、多中心、标注一致的多模态医学影像数据集的问题,从而推动生成式 AI (Generative AI) 和深度学习模型在肿瘤自动分割、预后预测和人类乳头瘤病毒(HPV)状态分类等临床任务中的应用。解决方案的关键在于构建了一个包含1123例经组织学确诊患者的氟代脱氧葡萄糖正电子发射断层扫描/计算机断层扫描(FDG-PET/CT)数据集,涵盖多中心真实世界采集协议,并由经验丰富的放射肿瘤学家与放射科医生依据标准化指南完成原发肿瘤体积(GTVp)和受累淋巴结(GTVn)的手动勾画,同时提供匿名化的NifTI影像文件、专家标注的分割掩膜、放疗剂量分布及详尽的临床元数据(如TNM分期、HPV状态、生存时间、治疗信息等),为后续开发和验证基于深度学习的自动化分析工具提供了可靠基准。

链接: https://arxiv.org/abs/2509.00367
作者: Numan Saeed(1),Salma Hassan(2),Shahad Hardan(2),Ahmed Aly(1),Darya Taratynova(2),Umair Nawaz(1),Ufaq Khan(1),Muhammad Ridzuan(1),Thomas Eugene(4),Rapha"el Metz(4),M’elanie Dore(5),Gregory Delpon(6),Vijay Ram Kumar Papineni(7),Kareem Wahid(8),Cem Dede(8),Alaa Mohamed Shawky Ali(8),Carlos Sjogreen(8),Mohamed Naser(8),Clifton D. Fuller(8),Valentin Oreiller(9),Mario Jreige(10),John O. Prior(10),Catherine Cheze Le Rest(11),Olena Tankyevych(11),Pierre Decazes(12),Su Ruan(12),Stephanie Tanadini-Lang(13),Martin Valli`eres(14),Hesham Elhalawani(16 and 17),Ronan Abgral(18),Romain Floch(18),Kevin Kerleguer(18),Ulrike Schick(19),Maelle Mauguen(19),Vincent Andrearczyk(9 and 10),Adrien Depeursinge(9 and 10),Mathieu Hatt(15),Arman Rahmim(3),Mohammad Yaqub(1) ((1) Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE, (2) Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE, (3) Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, BC, Canada, (4) Nantes Universit’e, CHU Nantes, Nuclear Medicine Department, Nantes, France, (5) Radiation Oncology Department, Institut de Canc’erologie de l’Ouest, Saint-Herblain, France, (6) Medical Physics Department, Institut de Canc’erologie de l’Ouest, Saint Herblain, France, (7) Radiology Department, Sheikh Shakhbout Medical City, Abu Dhabi, UAE, (8) MD Anderson Cancer Center, The University of Texas, Texas, United States, (9) Institute of Informatics, HES-SO Valais-Wallis University of Applied Sciences and Arts Western Switzerland, Sierre, Switzerland, (10) Department of Nuclear Medicine and Molecular Imaging, Lausanne University Hospital (CHUV), Lausanne, Switzerland, (11) Centre Hospitalier Universitaire de Poitiers (CHUP), Poitiers, France, (12) Center Henri Becquerel, LITIS laboratory, University of Rouen Normandy, Rouen, France, (13) University Hospital Z"urich, Zurich, Switzerland, (14) Department of Computer Science, Universit’e de Sherbrooke, Sherbrooke, Qu’ebec, Canada, (15) LaTIM, INSERM, UMR 1101, Univ Brest, Brest, France, (16) Department of Radiation Oncology, Brigham and Women’s Hospital, Boston, United States, (17) Dana Farber Cancer Institute, Harvard Medical School, Boston, USA, (18) Nuclear medicine department, University Hospital of Brest, Brest, France, (19) Radiotherapy department, University Hospital of Brest, Brest, France)
机构: Mohamed bin Zayed University of Artificial Intelligence ( Mohamed bin Zayed 大学人工智能); Nantes Université (南特大学); CHU Nantes (南特大学医院中心); Institut de Cancérologie de l’Ouest (西部癌症研究所); University of Texas (德克萨斯大学); MD Anderson Cancer Center (MD 安德森癌症中心); HES-SO Valais-Wallis University of Applied Sciences and Arts Western Switzerland (西瑞士应用科学与艺术大学); Lausanne University Hospital (CHUV) (洛桑大学医院); Centre Hospitalier Universitaire de Poitiers (CHUP) (普瓦捷大学医院中心); University of Rouen Normandy (鲁昂诺曼底大学); University Hospital Zürich (苏黎世大学医院); Université de Sherbrooke (舍布鲁克大学); Brigham and Women’s Hospital (布里格姆妇女医院); Dana Farber Cancer Institute (达纳-法伯癌症研究所); University Hospital of Brest (布雷斯特大学医院); University Hospital of Brest (布雷斯特大学医院); Radiotherapy department, University Hospital of Brest (布雷斯特大学医院放射治疗科); LaTIM, INSERM, UMR 1101, Univ Brest (布雷斯特大学,INSERM UMR 1101实验室); BC Cancer Research Institute (BC 癌症研究所); Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence ( Mohamed bin Zayed 大学人工智能计算机视觉系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. Numan Saeed is the corresponding author. Salma Hassan and Shahad Hardan contributed equally to this work. Project page: this https URL

点击查看摘要

Abstract:We describe a publicly available multimodal dataset of annotated Positron Emission Tomography/Computed Tomography (PET/CT) studies for head and neck cancer research. The dataset includes 1123 FDG-PET/CT studies from patients with histologically confirmed head and neck cancer, acquired from 10 international medical centers. All examinations consisted of co-registered PET/CT scans with varying acquisition protocols, reflecting real-world clinical diversity across institutions. Primary gross tumor volumes (GTVp) and involved lymph nodes (GTVn) were manually segmented by experienced radiation oncologists and radiologists following standardized guidelines and quality control measures. We provide anonymized NifTi files of all studies, along with expert-annotated segmentation masks, radiotherapy dose distribution for a subset of patients, and comprehensive clinical metadata. This metadata includes TNM staging, HPV status, demographics (age and gender), long-term follow-up outcomes, survival times, censoring indicators, and treatment information. We demonstrate how this dataset can be used for three key clinical tasks: automated tumor segmentation, recurrence-free survival prediction, and HPV status classification, providing benchmark results using state-of-the-art deep learning models, including UNet, SegResNet, and multimodal prognostic frameworks.
zh

[CV-210] SurgLLM : A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding

【速读】:该论文旨在解决当前手术视频理解(Surgical Video Understanding)中存在的两大关键问题:一是视觉内容感知能力不足,二是时间感知(Temporal Awareness)薄弱,这两点限制了计算机辅助手术(Computer-Assisted Surgery, CAS)系统在多样化任务中的泛化能力。解决方案的核心在于提出SurgLLM框架,其关键技术包括:1)设计了面向手术场景的上下文感知多模态预训练(Surgical Context-aware Multimodal Pretraining, Surg-Pretrain),通过以器械为中心的掩码视频重建(Instrument-centric Masked Video Reconstruction, MV-Recon)和后续多模态对齐增强空间关注能力;2)引入时间感知多模态微调(Temporal-aware Multimodal Tuning, TM-Tuning),利用交错嵌入提升时序推理能力;3)构建手术任务动态集成机制(Surgical Task Dynamic Ensemble),实现不同理解任务间的参数高效适配与冲突规避。实验表明,SurgLLM在手术视频描述生成、通用视觉问答(VQA)及时间感知VQA等任务上均显著优于现有方法。

链接: https://arxiv.org/abs/2509.00357
作者: Zhen Chen,Xingjian Luo,Kun Yuan,Jinlin Wu,Danny T.M. Chan,Nassir Navab,Hongbin Liu,Zhen Lei,Jiebo Luo
机构: Hong Kong Institute of Science & Innovation (香港科学与创新研究院); Technische Universität München (慕尼黑工业大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Surgical video understanding is crucial for facilitating Computer-Assisted Surgery (CAS) systems. Despite significant progress in existing studies, two major limitations persist, including inadequate visual content perception and insufficient temporal awareness in surgical videos, and hinder the development of versatile CAS solutions. In this work, we propose the SurgLLM framework, an effective large multimodal model tailored for versatile surgical video understanding tasks with enhanced spatial focus and temporal awareness. Specifically, to empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM, by performing instrument-centric Masked Video Reconstruction (MV-Recon) and subsequent multimodal alignment. To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings. Moreover, to accommodate various understanding tasks of surgical videos without conflicts, we devise a Surgical Task Dynamic Ensemble to efficiently triage a query with optimal learnable parameters in our SurgLLM. Extensive experiments performed on diverse surgical video understanding tasks, including captioning, general VQA, and temporal VQA, demonstrate significant improvements over the state-of-the-art approaches, validating the effectiveness of our SurgLLM in versatile surgical video understanding. The source code is available at this https URL.
zh

[CV-211] Iterative Low-rank Network for Hyperspectral Image Denoising

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)去噪过程中如何有效利用其物理先验——即干净HSI通常位于低秩和稀疏表示所刻画的低维子空间中——的同时,保持图像细节的问题。传统方法难以充分挖掘这一先验以实现高效去噪。解决方案的关键在于提出一种新型迭代低秩网络(Iterative Low-Rank Network, ILRNet),其核心创新是将一个可学习的秩最小化模块(Rank Minimization Module, RMM)嵌入U-Net架构中:RMM在前向传播时将特征图转换至小波域,并对低频分量施加奇异值阈值(Singular Value Thresholding, SVT),从而利用HSI在特征域中的光谱低秩特性;同时,RMM中的参数通过数据自适应学习,实现不同场景下低秩特性的灵活建模。此外,ILRNet采用迭代细化机制,自适应融合中间去噪结果与原始噪声输入,实现渐进式增强并显著提升细节保留能力。

链接: https://arxiv.org/abs/2509.00356
作者: Jin Ye,Fengchao Xiong,Jun Zhou,Yuntao Qian
机构: Nanjing University of Science and Technology (南京理工大学); Griffith University (格里菲斯大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) denoising is a crucial preprocessing step for subsequent tasks. The clean HSI usually reside in a low-dimensional subspace, which can be captured by low-rank and sparse representation, known as the physical prior of HSI. It is generally challenging to adequately use such physical properties for effective denoising while preserving image details. This paper introduces a novel iterative low-rank network (ILRNet) to address these challenges. ILRNet integrates the strengths of model-driven and data-driven approaches by embedding a rank minimization module (RMM) within a U-Net architecture. This module transforms feature maps into the wavelet domain and applies singular value thresholding (SVT) to the low-frequency components during the forward pass, leveraging the spectral low-rankness of HSIs in the feature domain. The parameter, closely related to the hyperparameter of the singular vector thresholding algorithm, is adaptively learned from the data, allowing for flexible and effective capture of low-rankness across different scenarios. Additionally, ILRNet features an iterative refinement process that adaptively combines intermediate denoised HSIs with noisy inputs. This manner ensures progressive enhancement and superior preservation of image details. Experimental results demonstrate that ILRNet achieves state-of-the-art performance in both synthetic and real-world noise removal tasks.
zh

[CV-212] AQFusionNet: Multimodal Deep Learning for Air Quality Index Prediction with Imagery and Sensor Data

【速读】:该论文旨在解决资源受限地区空气质量监测面临的挑战,即传感器部署稀疏和基础设施不足导致的空气品质指数(Air Quality Index, AQI)预测准确性低的问题。解决方案的关键在于提出一种多模态深度学习框架AQFusionNet,通过融合地面大气图像与污染物浓度数据,利用轻量级卷积神经网络(CNN)骨干结构(如MobileNetV2、ResNet18、EfficientNet-B0)提取视觉与传感特征,并在语义对齐的嵌入空间中进行融合,从而实现高精度且计算效率高的AQI预测。实验表明,该方法在印度和尼泊尔超过8000个样本上显著优于单一模态基线模型,最高分类准确率达92.02%,均方根误差(RMSE)为7.70,同时保持较低的计算开销,适用于边缘设备部署。

链接: https://arxiv.org/abs/2509.00353
作者: Koushik Ahmed Kushal,Abdullah Al Mamun
机构: Clarkson University (克拉克森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Air pollution monitoring in resource-constrained regions remains challenging due to sparse sensor deployment and limited infrastructure. This work introduces AQFusionNet, a multimodal deep learning framework for robust Air Quality Index (AQI) prediction. The framework integrates ground-level atmospheric imagery with pollutant concentration data using lightweight CNN backbones (MobileNetV2, ResNet18, EfficientNet-B0). Visual and sensor features are combined through semantically aligned embedding spaces, enabling accurate and efficient prediction. Experiments on more than 8,000 samples from India and Nepal demonstrate that AQFusionNet consistently outperforms unimodal baselines, achieving up to 92.02% classification accuracy and an RMSE of 7.70 with the EfficientNet-B0 backbone. The model delivers an 18.5% improvement over single-modality approaches while maintaining low computational overhead, making it suitable for deployment on edge devices. AQFusionNet provides a scalable and practical solution for AQI monitoring in infrastructure-limited environments, offering robust predictive capability even under partial sensor availability.
zh

[CV-213] arget-Oriented Single Domain Generalization

【速读】:该论文旨在解决单域泛化(Single Domain Generalization, SDG)中模型在分布偏移下性能急剧下降的问题,即当训练数据与目标部署环境存在差异时,深度模型难以保持鲁棒性。现有方法多依赖于源域数据增强或学习不变特征,但忽略了实际场景中常可获取的文本信息——目标部署环境的描述文本。解决方案的关键在于提出Target-Oriented Single Domain Generalization(TO-SDG)新范式,利用目标环境的文本描述(无需任何目标数据)引导模型泛化,并设计轻量级模块Spectral TARget Alignment(STAR),通过视觉语言模型(VLMs,如CLIP)将目标语义注入源特征:首先基于目标文本嵌入构建锚定子空间以重中心化图像特征,再采用谱投影保留与目标线索对齐的方向并抑制源域噪声;同时结合视觉-语言蒸馏对齐骨干网络特征与VLM语义几何结构,并引入特征空间Mixup确保源到目标表示的平滑过渡。实验证明,仅用最小文本元数据即可显著提升模型在严苛数据约束下的泛化能力。

链接: https://arxiv.org/abs/2509.00351
作者: Marzi Heidari,Yuhong Guo
机构: Carleton University (卡尔顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep models trained on a single source domain often fail catastrophically under distribution shifts, a critical challenge in Single Domain Generalization (SDG). While existing methods focus on augmenting source data or learning invariant features, they neglect a readily available resource: textual descriptions of the target deployment environment. We propose Target-Oriented Single Domain Generalization (TO-SDG), a novel problem setup that leverages the textual description of the target domain, without requiring any target data, to guide model generalization. To address TO-SDG, we introduce Spectral TARget Alignment (STAR), a lightweight module that injects target semantics into source features by exploiting visual-language models (VLMs) such as CLIP. STAR uses a target-anchored subspace derived from the text embedding of the target description to recenter image features toward the deployment domain, then utilizes spectral projection to retain directions aligned with target cues while discarding source-specific noise. Moreover, we use a vision-language distillation to align backbone features with VLM’s semantic geometry. STAR further employs feature-space Mixup to ensure smooth transitions between source and target-oriented representations. Experiments across various image classification and object detection benchmarks demonstrate STAR’s superiority. This work establishes that minimal textual metadata, which is a practical and often overlooked resource, significantly enhances generalization under severe data constraints, opening new avenues for deploying robust models in target environments with unseen data.
zh

[CV-214] LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables ICCV2025

【速读】:该论文旨在解决当前红外与可见光图像融合研究中忽视实时融合设备适用性的问题,尤其是现有方法在计算效率上的不足,难以部署于低功耗移动设备。解决方案的关键在于提出一种基于知识蒸馏的可学习查找表(Look-up Table, LUT)方法——LUT-Fuse,其核心创新包括:1)设计了一种结合低阶近似编码与高层联合场景上下文编码的查找表结构,适用于多模态融合任务;2)针对多模态图像融合缺乏真实标签的问题,提出高效的LUT蒸馏策略,替代传统量化LUT方法,将多模态融合网络(MM-Net)的性能迁移至轻量级MM-LUT模型中,从而在保证融合质量的同时实现显著提速,典型情况下仅需当前轻量级最先进算法十分之一的时间,且在多种场景下均具备高运行速度和稳定性。

链接: https://arxiv.org/abs/2509.00346
作者: Xunpeng Yi,Yibing Zhang,Xinyu Xiang,Qinglong Yan,Han Xu,Jiayi Ma
机构: Wuhan University (武汉大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Current advanced research on infrared and visible image fusion primarily focuses on improving fusion performance, often neglecting the applicability on real-time fusion devices. In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. Firstly, we develop a look-up table structure that utilizing low-order approximation encoding and high-level joint contextual scene encoding, which is well-suited for multi-modal fusion. Moreover, given the lack of ground truth in multi-modal image fusion, we naturally proposed the efficient LUT distillation strategy instead of traditional quantization LUT methods. By integrating the performance of the multi-modal fusion network (MM-Net) into the MM-LUT model, our method achieves significant breakthroughs in efficiency and performance. It typically requires less than one-tenth of the time compared to the current lightweight SOTA fusion algorithms, ensuring high operational speed across various scenarios, even in low-power mobile devices. Extensive experiments validate the superiority, reliability, and stability of our fusion approach. The code is available at this https URL.
zh

[CV-215] CryptoFace: End-to-End Encrypted Face Recognition CVPR2025

【速读】:该论文旨在解决人脸识别系统中因敏感生物特征数据(如人脸图像或特征向量)被未经授权访问而引发的隐私风险问题。其核心解决方案是提出CryptoFace,一个基于全同态加密(Fully Homomorphic Encryption, FHE)的端到端加密人脸识别系统,能够在不暴露原始图像或特征的情况下完成特征提取、存储和匹配等全流程处理。关键创新在于引入一种浅层块卷积网络(shallow patch convolutional networks),通过基于图像块的并行处理方式支持高维张量运算,同时显著降低乘法深度(multiplicative depth),从而有效减少推理延迟;此外,该设计实现了近似与分辨率无关的延迟特性,并在标准人脸验证基准上相较现有FHE神经网络大幅提升推理速度和识别准确率。

链接: https://arxiv.org/abs/2509.00332
作者: Wei Ao,Vishnu Naresh Boddeti
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: CVPR 2025

点击查看摘要

Abstract:Face recognition is central to many authentication, security, and personalized applications. Yet, it suffers from significant privacy risks, particularly arising from unauthorized access to sensitive biometric data. This paper introduces CryptoFace, the first end-to-end encrypted face recognition system with fully homomorphic encryption (FHE). It enables secure processing of facial data across all stages of a face-recognition process–feature extraction, storage, and matching–without exposing raw images or features. We introduce a mixture of shallow patch convolutional networks to support higher-dimensional tensors via patch-based processing while reducing the multiplicative depth and, thus, inference latency. Parallel FHE evaluation of these networks ensures near-resolution-independent latency. On standard face recognition benchmarks, CryptoFace significantly accelerates inference and increases verification accuracy compared to the state-of-the-art FHE neural networks adapted for face recognition. CryptoFace will facilitate secure face recognition systems requiring robust and provable security. The code is available at this https URL.
zh

[CV-216] owards Adaptive Visual Token Pruning for Large Multimodal Models

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在推理过程中因视觉与文本token数量激增而导致的计算和内存开销过大的问题。现有token剪枝方法通常依赖昂贵的校准过程或次优的重要性度量,导致保留冗余token。其解决方案的关键在于:首先,识别并利用视觉token与文本token之间的冗余差异,仅对视觉token进行剪枝;其次,提出一种基于互信息的剪枝策略,剔除语义上与文本token不一致的视觉token,从而有效保持跨模态对齐;最后,通过最大化嵌入空间中保留token间的期望成对距离来进一步提升保留token的内在信息多样性,该优化目标采用贪心算法高效求解。实验表明,该方法可在保持模型性能的同时,将token数量减少88.9%,推理速度提升56.7%。

链接: https://arxiv.org/abs/2509.00320
作者: Hao Zhang,Mengsi Lyu,Chenrui He,Yulong Ao,Yonghua Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have achieved significant success across various tasks. These models usually encode visual inputs into dense token sequences, which are then concatenated with textual tokens and jointly processed by a language model. However, the increased token count substantially raises computational and memory costs during inference. Token pruning has emerged as a promising approach to address this issue. Existing token pruning methods often rely on costly calibration or suboptimal importance metrics, leading to redundant retained tokens. In this paper, we analyze the redundancy differences between visual and textual tokens and propose pruning exclusively on visual tokens. Based on this, we propose a visual token pruning strategy that explicitly preserves both cross-modal alignment and intra-modal informational diversity. We introduce a mutual information-based token pruning strategy that removes visual tokens semantically misaligned with textual tokens, effectively preserving the alignment between the visual and textual modalities. To further improve the representational quality of the retained tokens, we additionally prune redundant visual tokens by maximizing the expected pairwise distances in the embedding space, which is solved efficiently with a greedy algorithm. Extensive experiments demonstrate that our method maintains strong performance while reducing tokens by 88.9% on models such as LLaVA-1.5-7B and LLaVA-NEXT-7B, resulting in a 56.7% improvement in inference speed.
zh

[CV-217] MorphGen: Morphology-Guided Representation Learning for Robust Single-Domain Generalization in Histopathological Cancer Classification

【速读】:该论文旨在解决计算病理学中因全切片图像(WSI)在不同机构间组织制备、染色及成像条件差异所导致的域泛化(domain generalization)难题。现有机器学习系统易受域偏移影响,而病理医生则依赖于跨域不变的形态学特征(如核异型性、结构异型性和整体形态异型性)进行诊断。解决方案的关键在于提出MorphGen(Morphology-Guided Generalization)方法,其通过监督对比学习框架整合组织病理图像、数据增强与核分割掩码,显式建模生物鲁棒的核形态和空间组织特性,从而引导模型聚焦于诊断相关的核异型性和细胞空间排列,而非染色伪影或域特定特征;同时引入随机权重平均(SWA)以优化至更平坦的极小值区域,显著提升模型在分布外(OOD)场景下的鲁棒性与抗对抗攻击能力。

链接: https://arxiv.org/abs/2509.00311
作者: Hikmat Khan,Syed Farhan Alam Zaidi,Pir Masoom Shah,Kiruthika Balakrishnan,Rabia Khan,Muhammad Waqas,Jia Wu
机构: Ohio State University Medical Center (俄亥俄州立大学医学中心); Chung-Ang University (中央大学); China University of Science and Technology (中国科学技术大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校); King Abdulaziz University (阿卜杜勒阿齐兹国王大学); MD Anderson Cancer Center (MD安德森癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain generalization in computational histopathology is hindered by heterogeneity in whole slide images (WSIs), caused by variations in tissue preparation, staining, and imaging conditions across institutions. Unlike machine learning systems, pathologists rely on domain-invariant morphological cues such as nuclear atypia (enlargement, irregular contours, hyperchromasia, chromatin texture, spatial disorganization), structural atypia (abnormal architecture and gland formation), and overall morphological atypia that remain diagnostic across diverse settings. Motivated by this, we hypothesize that explicitly modeling biologically robust nuclear morphology and spatial organization will enable the learning of cancer representations that are resilient to domain shifts. We propose MorphGen (Morphology-Guided Generalization), a method that integrates histopathology images, augmentations, and nuclear segmentation masks within a supervised contrastive learning framework. By aligning latent representations of images and nuclear masks, MorphGen prioritizes diagnostic features such as nuclear and morphological atypia and spatial organization over staining artifacts and domain-specific features. To further enhance out-of-distribution robustness, we incorporate stochastic weight averaging (SWA), steering optimization toward flatter minima. Attention map analyses revealed that MorphGen primarily relies on nuclear morphology, cellular composition, and spatial cell organization within tumors or normal regions for final classification. Finally, we demonstrate resilience of the learned representations to image corruptions (such as staining artifacts) and adversarial attacks, showcasing not only OOD generalization but also addressing critical vulnerabilities in current deep learning systems for digital pathology. Code, datasets, and trained models are available at: this https URL
zh

[CV-218] Language-Aware Information Maximization for Transductive Few-Shot CLIP

【速读】:该论文旨在解决在视觉-语言模型(Vision-Language Models, VLMs)背景下,少样本学习(few-shot learning)中仍处于初级阶段的归纳式(inductive)方法难以有效利用未标记数据的问题,尤其是在基于CLIP的模型中如何通过归纳推理提升性能。其核心解决方案是提出一种新颖的语言感知信息最大化损失函数(Language-aware Information MaximizatiOn, LIMO),该损失函数融合了三个互补项:(i) 视觉输入与文本类别描述之间的互信息(mutual information);(ii) 基于Kullback-Leibler (KL)散度的正则项,惩罚网络输出分布偏离文本驱动的零样本预测;(iii) 基于标注样本的标准交叉熵损失。此外,论文挑战传统全参数微调范式,首次在该场景下系统探索参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)策略,发现仅更新模型子集参数即可显著提升性能,从而为VLM中的转导式少样本学习提供了新的高效且有效的优化路径。

链接: https://arxiv.org/abs/2509.00305
作者: Ghassen Baklouti,Maxime Zanella,Ismail Ben Ayed
机构: École de Technologie Supérieure (ÉTS); Université Catholique de Louvain (UCLouvain); Université de Mons (UMons)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transductive few-shot learning has triggered an abundant literature focusing on vision-only models, but is still at a nascent stage within the recent context of foundational vision-language models (VLMs). Only a few recent methods addressed the problem, pointing to the potential of tranduction in VLMs and to the need for VLM-tailored methods. Building on this momentum, we leverage information-theoretic concepts and recent progress in parameter-efficient fine-tuning (PEFT), developing a highly competitive transductive few-shot CLIP method. Specifically, we introduce a novel Language-aware Information MaximizatiOn (LIMO) loss integrating three complementary terms: (i) the mutual information between the vision inputs and the textual class descriptions; (ii) a Kullback-Leibler (KL) divergence penalizing deviation of the network’s probabilistic outputs from the text-driven zero-shot predictions; and (iii) a standard cross-entropy loss based on the labeled shots. Furthermore, we challenge the commonly followed fine-tuning practices in the context of transductive few-shot learning, and explore PEFT strategies, completely overlooked in this context. Surprisingly, we observe substantial boosts in performances, which points to the potential of adapting a subset of the model’s parameters in the transductive few-shot setting. We report comprehensive evaluations, which show that LIMO outperforms the very recent transductive few-shot CLIP methods by a large margin and yields significant gains over the best-performing inductive methods. Our code is publicly available at:[ \hrefthis https URL\texthere ]
zh

[CV-219] Generative AI for Industrial Contour Detection: A Language-Guided Vision System

【速读】:该论文旨在解决工业计算机视觉系统在噪声干扰、材料变异及非受控成像条件下,传统边缘检测算法和手工设计流水线性能受限的问题。其解决方案的关键在于构建一个语言引导的生成式视觉系统,用于制造场景中的残余轮廓检测,实现接近CAD级别的精度;该系统分为三个阶段:数据采集与预处理、基于条件生成对抗网络(conditional GAN)的轮廓生成,以及通过视觉-语言模型(Vision-Language Model, VLM)进行多模态轮廓精炼,其中标准化提示词由人机协同过程设计并应用于图像-文本引导的合成策略。实验表明,该方法显著提升了轮廓保真度,增强了边缘连续性和几何对齐性,并减少了人工追踪需求。

链接: https://arxiv.org/abs/2509.00284
作者: Liang Gong,Tommy(Zelin)Wang,Sara Chaker,Yanchen Dong,Fouad Bousetouane,Brenden Morton,Mark Mendez
机构: The University of Chicago (芝加哥大学); FabTrack
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:Industrial computer vision systems often struggle with noise, material variability, and uncontrolled imaging conditions, limiting the effectiveness of classical edge detectors and handcrafted pipelines. In this work, we present a language-guided generative vision system for remnant contour detection in manufacturing, designed to achieve CAD-level precision. The system is organized into three stages: data acquisition and preprocessing, contour generation using a conditional GAN, and multimodal contour refinement through vision-language modeling, where standardized prompts are crafted in a human-in-the-loop process and applied through image-text guided synthesis. On proprietary FabTrack datasets, the proposed system improved contour fidelity, enhancing edge continuity and geometric alignment while reducing manual tracing. For the refinement stage, we benchmarked several vision-language models, including Google’s Gemini 2.0 Flash, OpenAI’s GPT-image-1 integrated within a VLM-guided workflow, and open-source baselines. Under standardized conditions, GPT-image-1 consistently outperformed Gemini 2.0 Flash in both structural accuracy and perceptual quality. These findings demonstrate the promise of VLM-guided generative workflows for advancing industrial computer vision beyond the limitations of classical pipelines.
zh

[CV-220] 3D-LATTE: Latent Space 3D Editing from Textual Instructions

【速读】:该论文旨在解决基于文本指令的3D资产编辑质量远落后于生成模型的问题,其核心挑战在于现有方法依赖2D先验时产生的视图不一致编辑信号。解决方案的关键在于提出一种无需训练的编辑方法,直接在原生3D扩散模型的潜在空间中操作,从而实现对3D几何结构的精确控制;通过融合生成过程中的3D注意力图与源对象信息进行编辑引导,并结合几何感知正则化、傅里叶域谱调制策略及3D增强精修步骤,显著提升了编辑的保真度、精度和鲁棒性。

链接: https://arxiv.org/abs/2509.00269
作者: Maria Parelli,Michael Oechsle,Michael Niemeyer,Federico Tombari,Andreas Geiger
机构: University of Tübingen, Tübingen AI Center(图宾根大学,图宾根人工智能中心); Google Zurich(谷歌苏黎世)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the recent success of multi-view diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals. Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity, precise, and robust edits across a wide range of shapes and semantic manipulations.
zh

[CV-221] A High-Accuracy Fast Hough Transform with Linear-Log-Cubed Computational Complexity for Arbitrary-Shaped Images

【速读】:该论文旨在解决霍夫变换(Hough Transform, HT)算法中长期存在的权衡问题:即如何在保持高精度的同时实现接近最优的计算复杂度。传统快速霍夫变换(Fast Hough Transform, FHT)算法如Brady-Yong算法虽具备线性对数级复杂度(O(whlnwh)\mathcal{O}(wh \ln wh)),但仅适用于特定尺寸图像;而通用FHT2DT算法虽可处理任意尺寸图像,其精度随尺度增大而下降。另一方面,高精度HT算法虽能保证近似误差恒定,却需接近立方级复杂度(O((wh)3)\mathcal{O}((wh)^3))。为此,本文提出FHT2SP算法——基于Brady提出的超像素(superpixel)概念扩展至任意形状,并将其嵌入FHT2DT框架中,通过合理选择超像素大小,在图像尺寸为w×hw \times h时达到近最优复杂度O(whln3w)\mathcal{O}(wh \ln^3 w),同时将近似误差控制在与图像规模无关的常数范围内,且可通过元参数调节精度。

链接: https://arxiv.org/abs/2509.00231
作者: Danil Kazimirov,Dmitry Nikolaev
机构: Smart Engines Service LLC(智能引擎服务有限责任公司); Institute for Information Transmission Problems (Kharkevich Institute) RAS(信息传输问题研究所(哈尔克维奇研究所)俄罗斯科学院); Faculty of Mechanics and Mathematics(力学与数学系); Lomonosov Moscow State University(莫斯科国立大学); Federal Research Center Computer Science and Control RAS(计算机科学与控制联邦研究中心俄罗斯科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures. Accepted to International Conference on Machine Vision 2025 (ICMV 2025)

点击查看摘要

Abstract:The Hough transform (HT) is a fundamental tool across various domains, from classical image analysis to neural networks and tomography. Two key aspects of the algorithms for computing the HT are their computational complexity and accuracy - the latter often defined as the error of approximation of continuous lines by discrete ones within the image region. The fast HT (FHT) algorithms with optimal linearithmic complexity - such as the Brady-Yong algorithm for power-of-two-sized images - are well established. Generalizations like FHT2DT extend this efficiency to arbitrary image sizes, but with reduced accuracy that worsens with scale. Conversely, accurate HT algorithms achieve constant-bounded error but require near-cubic computational cost. This paper introduces FHT2SP algorithm - a fast and highly accurate HT algorithm. It builds on our development of Brady’s superpixel concept, extending it to arbitrary shapes beyond the original power-of-two square constraint, and integrates it into the FHT2DT algorithm. With an appropriate choice of the superpixel’s size, for an image of shape w \times h , the FHT2SP algorithm achieves near-optimal computational complexity \mathcalO(wh \ln^3 w) , while keeping the approximation error bounded by a constant independent of image size, and controllable via a meta-parameter. We provide theoretical and experimental analyses of the algorithm’s complexity and accuracy.
zh

[CV-222] GraViT: Transfer Learning with Vision Transformers and MLP-Mixer for Strong Gravitational Lens Discovery

【速读】:该论文旨在解决强引力透镜(strong gravitational lensing)自动检测的问题,以应对下一代巡天项目如LSST预计在未来十年内发现约10⁵个引力透镜所带来的海量数据处理挑战。解决方案的关键在于提出一个基于PyTorch的自动化流水线GraViT,其核心是利用先进的视觉Transformer(Vision Transformer, ViT)和MLP-Mixer模型进行大规模预训练,并通过迁移学习(transfer learning)策略优化分类性能。研究系统评估了数据质量、模型架构选择与微调、训练策略(如增强、归一化和优化)以及集成预测对检测精度的影响,从而为强引力透镜的自动化识别提供了高效且可扩展的技术路径。

链接: https://arxiv.org/abs/2509.00226
作者: René Parlange,Juan C. Cuevas-Tello,Octavio Valenzuela,Omar de J. Cabrera-Rosas,Tomás Verdugo,Anupreeta More,Anton T. Jaelani
机构: Universidad Autónoma de San Luis Potosí (圣路易斯波托西自治大学); Universidad Nacional Autónoma de México. Instituto de Astronomía (墨西哥国立自治大学天文研究所); Inter-University Centre for Astronomy and Astrophysics (印度大学天文学和天体物理学联合中心); Kavli IPMU (WPI), UTIAS, The University of Tokyo (东京大学宇宙生命科学研究所); Institut Teknologi Bandung (印尼理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Astrophysics of Galaxies (astro-ph.GA)
备注: Our publicly available fine-tuned models provide a scalable transfer learning solution for gravitational lens finding in LSST. Submitted to MNRAS. Comments welcome

点击查看摘要

Abstract:Gravitational lensing offers a powerful probe into the properties of dark matter and is crucial to infer cosmological parameters. The Legacy Survey of Space and Time (LSST) is predicted to find O(10^5) gravitational lenses over the next decade, demanding automated classifiers. In this work, we introduce GraViT, a PyTorch pipeline for gravitational lens detection that leverages extensive pretraining of state-of-the-art Vision Transformer (ViT) models and MLP-Mixer. We assess the impact of transfer learning on classification performance by examining data quality (source and sample size), model architecture (selection and fine-tuning), training strategies (augmentation, normalization, and optimization), and ensemble predictions. This study reproduces the experiments in a previous systematic comparison of neural networks and provides insights into the detectability of strong gravitational lenses on that common test sample. We fine-tune ten architectures using datasets from HOLISMOKES VI and SuGOHI X, and benchmark them against convolutional baselines, discussing complexity and inference-time analysis.
zh

[CV-223] Multimodal Deep Learning for Phyllodes Tumor Classification from Ultrasound and Clinical Data

【速读】:该论文旨在解决叶状肿瘤(Phyllodes tumors, PTs)在术前难以准确分类的问题,因其影像学表现与良性纤维腺瘤相似,常导致不必要的手术切除。解决方案的关键在于提出一种多模态深度学习框架,通过融合乳腺超声(breast ultrasound, BUS)图像与结构化临床数据,提升诊断准确性。该框架采用双分支神经网络提取并融合图像特征与患者元数据,并结合类别感知采样和受试者分层的5折交叉验证策略,有效缓解类别不平衡与数据泄露问题,最终在区分良性与交界性/恶性PT方面显著优于单一模态基线模型。

链接: https://arxiv.org/abs/2509.00213
作者: Farhan Fuad Abir,Abigail Elliott Daly,Kyle Anderman,Tolga Ozmen,Laura J. Brattain
机构: University of Central Florida (中佛罗里达大学); Massachusetts General Hospital (麻省总医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE-EMBS International Conference on Body Sensor Networks (IEEE-EMBS BSN 2025)

点击查看摘要

Abstract:Phyllodes tumors (PTs) are rare fibroepithelial breast lesions that are difficult to classify preoperatively due to their radiological similarity to benign fibroadenomas. This often leads to unnecessary surgical excisions. To address this, we propose a multimodal deep learning framework that integrates breast ultrasound (BUS) images with structured clinical data to improve diagnostic accuracy. We developed a dual-branch neural network that extracts and fuses features from ultrasound images and patient metadata from 81 subjects with confirmed PTs. Class-aware sampling and subject-stratified 5-fold cross-validation were applied to prevent class imbalance and data leakage. The results show that our proposed multimodal method outperforms unimodal baselines in classifying benign versus borderline/malignant PTs. Among six image encoders, ConvNeXt and ResNet18 achieved the best performance in the multimodal setting, with AUC-ROC scores of 0.9427 and 0.9349, and F1-scores of 0.6720 and 0.7294, respectively. This study demonstrates the potential of multimodal AI to serve as a non-invasive diagnostic tool, reducing unnecessary biopsies and improving clinical decision-making in breast tumor management.
zh

[CV-224] Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment

【速读】:该论文旨在解决深度学习模型在未知环境中实现类人推理的难题,尤其是在动态开放集任务(如面向任务的导航和具身问答)中,现有视觉语言模型(VLMs)因缺乏对细粒度时空线索和物理世界理解的建模而表现受限。解决方案的关键在于提出一种名为VEME的新颖跨模态对齐方法,其核心是通过构建以自我为中心、经验驱动的世界模型来增强未见场景下的泛化能力:具体包括三个组件——(1) 跨模态对齐框架融合物体、空间表征与视觉语义及时空线索,提升VLM的上下文学习能力;(2) 基于世界嵌入激活的动态隐式认知地图,实现任务相关的几何-语义记忆召回;(3) 利用具身先验的指令驱动导航与推理框架,支持长期规划与高效探索。该方法通过嵌入几何感知的时空情景经验,在动态环境中显著提升了推理与规划性能。

链接: https://arxiv.org/abs/2509.00210
作者: Jinzhou Tang,Jusheng zhang,Sidi Liu,Waikit Xiu,Qinhan Lv,Xiying Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Achieving human-like reasoning in deep learning models for complex tasks in unknown environments remains a critical challenge in embodied intelligence. While advanced vision-language models (VLMs) excel in static scene understanding, their limitations in spatio-temporal reasoning and adaptation to dynamic, open-set tasks like task-oriented navigation and embodied question answering (EQA) persist due to inadequate modeling of fine-grained spatio-temporal cues and physical world comprehension. To address this, we propose VEME, a novel cross-modal alignment method that enhances generalization in unseen scenes by learning an ego-centric, experience-centered world model. Our framework integrates three key components: (1) a cross-modal alignment framework bridging objects, spatial representations, and visual semantics with spatio-temporal cues to enhance VLM in-context learning; (2) a dynamic, implicit cognitive map activated by world embedding to enable task-relevant geometric-semantic memory recall; and (3) an instruction-based navigation and reasoning framework leveraging embodied priors for long-term planning and efficient exploration. By embedding geometry-aware spatio-temporal episodic experiences, our method significantly improves reasoning and planning in dynamic environments. Experimental results on VSI-Bench and VLN-CE demonstrate 1%-3% accuracy and exploration efficiency improvement compared to traditional approaches.
zh

[CV-225] Safe-LLaVA: A Privacy-Preserving Vision-Language Dataset and Benchmark for Biometric Safety

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的生物特征泄露问题,即模型在未被明确要求的情况下仍会推断并暴露敏感的生物特征信息(如种族、性别、年龄、体重和眼色等),这在现实应用和社会敏感场景中构成重大隐私风险。现有研究缺乏公开的数据集与评估基准来系统性地衡量和缓解此类泄露行为。为此,作者提出了PRISM(Privacy-aware Evaluation of Responses in Sensitive Modalities)基准,用于评估MLLMs在拒绝生物特征相关查询以及在一般响应中避免隐式生物特征泄露的能力,同时保持语义忠实性;并构建了Safe-LLaVA数据集,通过系统性移除LLaVA数据集中显性和隐性的生物特征信息,实现隐私保护的训练数据构造。实验表明,基于Safe-LLaVA微调后的模型显著降低了生物特征泄露,验证了方案的有效性。二者共同为MLLMs的隐私对齐开发与评估设立了新标准。

链接: https://arxiv.org/abs/2509.00192
作者: Younggun Kim,Sirnam Swetha,Fazil Kagdi,Mubarak Shah
机构: Center For Research in Computer Vision, University of Central Florida, USA; Department of Civil Environmental and Construction Engineering, University of Central Florida, USA; Department of Computer Science, University of Central Florida, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks. However, these models often infer and reveal sensitive biometric attributes - such as race, gender, age, body weight, and eye color - even when such information is not explicitly requested. This raises critical concerns, particularly in real-world applications and socially-sensitive domains. Despite increasing awareness, no publicly available dataset or benchmark exists to comprehensively evaluate or mitigate biometric leakage in MLLMs. To address this gap, we introduce PRISM (Privacy-aware Evaluation of Responses in Sensitive Modalities), a new benchmark designed to assess MLLMs on two fronts: (1) refuse biometric-related queries and (2) implicit biometric leakage in general responses while maintaining semantic faithfulness. Further, we conduct a detailed audit of the widely used LLaVA datasets and uncover extensive biometric leakage across pretraining and instruction data. To address this, we present Safe-LLaVA dataset, the first privacy-preserving MLLM training dataset constructed by systematically removing explicit and implicit biometric information from LLaVA dataset. Our evaluations on PRISM reveal biometric leakages across MLLMs for different attributes, highlighting the detailed privacy-violations. We also fine-tune a model on Safe-LLaVA dataset and show that it substantially reduces the biometric leakages. Together, Safe-LLaVA PRISM set a new standard for privacy-aligned development and evaluation of MLLMs. The Safe-LLaVA dataset PRISM benchmark are publicly available at this https URL, and the source code is available at this https URL.
zh

[CV-226] Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders BMVC2025

【速读】:该论文旨在解决文本到图像检索(text-to-image retrieval)中因视觉-语言模型(VLMs)将文本与图像映射至表示空间中相距较远区域而导致的检索性能受限问题。其解决方案的关键在于提出一种两阶段方法:首先利用生成式扩散模型(generative diffusion model)将文本查询转换为视觉查询,从而缩小模态间的语义鸿沟;其次通过视觉模型估计图像间相似性,并引入一个聚合网络将多个生成图像整合为单一向量表示,同时融合跨模态的相似度分数,实现更精准的检索结果。

链接: https://arxiv.org/abs/2509.00177
作者: Faizan Farooq Khan,Vladan Stojnić,Zakaria Laskar,Mohamed Elhoseiny,Giorgos Tolias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2025

点击查看摘要

Abstract:This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: this https URL
zh

[CV-227] Waste-Bench: A Comprehensive Benchmark for Evaluating VLLM s in Cluttered Environments

【速读】:该论文旨在解决当前视觉大语言模型(Vision Large Language Models, VLLMs)在复杂环境和形变物体场景下进行垃圾分类任务时性能不足的问题。现有研究多聚焦于标准自然图像,而真实世界中的垃圾往往处于杂乱背景中且形状不规则,这对VLLMs的鲁棒性和准确性提出了更高要求。解决方案的关键在于构建一个专为现实场景设计的新颖数据集,其特点包括复杂环境与变形物体,并结合系统性的评估方法对VLLMs的鲁棒性与准确率进行全面测试,从而揭示其在挑战性条件下的表现瓶颈,推动后续模型优化方向。

链接: https://arxiv.org/abs/2509.00176
作者: Muhammad Ali,Salman Khan
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have paved the way for Vision Large Language Models (VLLMs) capable of performing a wide range of visual understanding tasks. While LLMs have demonstrated impressive performance on standard natural images, their capabilities have not been thoroughly explored in cluttered datasets where there is complex environment having deformed shaped objects. In this work, we introduce a novel dataset specifically designed for waste classification in real-world scenarios, characterized by complex environments and deformed shaped objects. Along with this dataset, we present an in-depth evaluation approach to rigorously assess the robustness and accuracy of VLLMs. The introduced dataset and comprehensive analysis provide valuable insights into the performance of VLLMs under challenging conditions. Our findings highlight the critical need for further advancements in VLLM’s robustness to perform better in complex environments. The dataset and code for our experiments will be made publicly available.
zh

[CV-228] Self-supervised large-scale kidney abnormality detection in drug safety assessment studies

【速读】:该论文旨在解决药物安全性评估中肾脏毒性病理图像异常检测的高成本与低效率问题,即在每项药物安全研究中需人工审查数百至数千张全切片图像(Whole-Slide Images, WSI),而其中绝大多数为正常样本,导致大量资源浪费。其解决方案的关键在于提出并验证了一种大规模自监督异常检测模型,利用UNI基础模型(Foundation Model, FM)提取的特征,结合自监督学习方法,显著优于随机猜测性能(AUC=0.62,负预测值89%),从而有望通过自动识别正常切片来减少人工筛查量,降低药物开发的时间和经济成本。

链接: https://arxiv.org/abs/2509.00131
作者: Ivan Slootweg,Natalia P. García-De-La-Puente,Geert Litjens,Salma Dammak
机构: Radboud University Medical Center (拉德布德大学医学中心); Universitat Politècnica de València (瓦伦西亚理工大学); Bigpicture Consortium (Bigpicture联盟)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Kidney abnormality detection is required for all preclinical drug development. It involves a time-consuming and costly examination of hundreds to thousands of whole-slide images per drug safety study, most of which are normal, to detect any subtle changes indicating toxic effects. In this study, we present the first large-scale self-supervised abnormality detection model for kidney toxicologic pathology, spanning drug safety assessment studies from 158 compounds. We explore the complexity of kidney abnormality detection on this scale using features extracted from the UNI foundation model (FM) and show that a simple k-nearest neighbor classifier on these features performs at chance, demonstrating that the FM-generated features alone are insufficient for detecting abnormalities. We then demonstrate that a self-supervised method applied to the same features can achieve better-than-chance performance, with an area under the receiver operating characteristic curve of 0.62 and a negative predictive value of 89%. With further development, such a model can be used to rule out normal slides in drug safety assessment studies, reducing the costs and time associated with drug development.
zh

[CV-229] Dual-Stage Global and Local Feature Framework for Image Dehazing

【速读】:该论文旨在解决高分辨率图像去雾(image dehazing)中因全局上下文信息与局部细节难以有效融合而导致的性能下降问题。现有方法通常通过下采样或分块处理来应对高分辨率输入,但会显著削弱去雾效果。解决方案的关键在于提出一种名为Streamlined Global and Local Features Combinator (SGLC) 的新框架,其核心由两个模块构成:Global Features Generator (GFG) 负责生成基于场景整体语义的初步去雾结果,Local Features Enhancer (LFE) 则在此基础上增强像素级细节和局部结构,从而实现全局外观与局部纹理的协同优化。该设计具有模型无关性,可无缝集成至任意去雾网络中,显著提升高分辨率图像的视觉保真度。

链接: https://arxiv.org/abs/2509.00108
作者: Anas M. Ali,Anis Koubaa,Bilel Benjdira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Addressing the challenge of removing atmospheric fog or haze from digital images, known as image dehazing, has recently gained significant traction in the computer vision community. Although contemporary dehazing models have demonstrated promising performance, few have thoroughly investigated high-resolution imagery. In such scenarios, practitioners often resort to downsampling the input image or processing it in smaller patches, which leads to a notable performance degradation. This drop is primarily linked to the difficulty of effectively combining global contextual information with localized, fine-grained details as the spatial resolution grows. In this chapter, we propose a novel framework, termed the Streamlined Global and Local Features Combinator (SGLC), to bridge this gap and enable robust dehazing for high-resolution inputs. Our approach is composed of two principal components: the Global Features Generator (GFG) and the Local Features Enhancer (LFE). The GFG produces an initial dehazed output by focusing on broad contextual understanding of the scene. Subsequently, the LFE refines this preliminary output by enhancing localized details and pixel-level features, thereby capturing the interplay between global appearance and local structure. To evaluate the effectiveness of SGLC, we integrated it with the Uformer architecture, a state-of-the-art dehazing model. Experimental results on high-resolution datasets reveal a considerable improvement in peak signal-to-noise ratio (PSNR) when employing SGLC, indicating its potency in addressing haze in large-scale imagery. Moreover, the SGLC design is model-agnostic, allowing any dehazing network to be augmented with the proposed global-and-local feature fusion mechanism. Through this strategy, practitioners can harness both scene-level cues and granular details, significantly improving visual fidelity in high-resolution environments.
zh

[CV-230] Progressive Element-wise Gradient Estimation for Neural Network Quantization

【速读】:该论文旨在解决量化感知训练(Quantization-Aware Training, QAT)中因使用直通估计器(Straight-Through Estimator, STE)而导致的精度下降问题,尤其是在极低比特位宽下,STE 忽略了连续值与离散量化值之间的误差,从而影响模型性能。其解决方案的关键在于提出了一种新颖的渐进式逐元素梯度估计方法(Progressive Element-wise Gradient Estimation, PEGE),该方法通过一种对数驱动的混合精度替换策略,逐步将全精度权重和激活值替换为量化版本,并将 QAT 建模为一个联合优化问题,同时最小化任务损失和量化离散误差,从而在保持高效推理的同时显著提升低精度模型的准确性。

链接: https://arxiv.org/abs/2509.00097
作者: Kaiqi Zhao
机构: Oakland University (奥克兰大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural network quantization aims to reduce the bit-widths of weights and activations, making it a critical technique for deploying deep neural networks on resource-constrained hardware. Most Quantization-Aware Training (QAT) methods rely on the Straight-Through Estimator (STE) to address the non-differentiability of discretization functions by replacing their derivatives with that of the identity function. While effective, STE overlooks discretization errors between continuous and quantized values, which can lead to accuracy degradation – especially at extremely low bit-widths. In this paper, we propose Progressive Element-wise Gradient Estimation (PEGE), a simple yet effective alternative to STE, which can be seamlessly integrated with any forward propagation methods and improves the quantized model accuracy. PEGE progressively replaces full-precision weights and activations with their quantized counterparts via a novel logarithmic curriculum-driven mixed-precision replacement strategy. Then it formulates QAT as a co-optimization problem that simultaneously minimizes the task loss for prediction and the discretization error for quantization, providing a unified and generalizable framework. Extensive experiments on CIFAR-10 and ImageNet across various architectures (e.g., ResNet, VGG) demonstrate that PEGE consistently outperforms existing backpropagation methods and enables low-precision models to match or even outperform the accuracy of their full-precision counterparts.
zh

[CV-231] Hybrid Perception and Equivariant Diffusion for Robust Multi-Node Rebar Tying

【速读】:该论文旨在解决钢筋绑扎(rebar tying)这一在混凝土结构施工中重复性强但至关重要的工序的自动化难题,尤其针对人工操作存在人体工学风险、且现有机器人系统在复杂密集钢筋节点(congested rebar nodes)中难以准确估计绑扎位姿的问题。解决方案的关键在于提出一种融合几何感知与SE(3)空间上的等变去噪扩散模型(Equivariant Denoising Diffusion on SE(3),Diffusion-EDFs)的混合方法:感知模块利用DBSCAN聚类、几何特征提取和主成分分析(PCA)实现对未结构化环境中钢筋条的分割、节点识别及方向向量估计,从而完成节点顺序排序;运动规划模块基于仅需5–10个示教样本训练的Diffusion-EDFs生成优化避障与绑扎效率的末端执行器序列位姿,显著降低数据依赖并提升多节点绑扎的鲁棒性与适应性。

链接: https://arxiv.org/abs/2509.00065
作者: Zhitao Wang,Yirong Xiong,Roberto Horowitz,Yanke Wang,Yuxing Han
机构: Tsinghua University, Shenzhen International Graduate School (清华大学深圳国际研究生院); University of California, Berkeley (加州大学伯克利分校); The Hong Kong University of Science and Technology (香港科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by The IEEE International Conference on Automation Science and Engineering (CASE) 2025

点击查看摘要

Abstract:Rebar tying is a repetitive but critical task in reinforced concrete construction, typically performed manually at considerable ergonomic risk. Recent advances in robotic manipulation hold the potential to automate the tying process, yet face challenges in accurately estimating tying poses in congested rebar nodes. In this paper, we introduce a hybrid perception and motion planning approach that integrates geometry-based perception with Equivariant Denoising Diffusion on SE(3) (Diffusion-EDFs) to enable robust multi-node rebar tying with minimal training data. Our perception module utilizes density-based clustering (DBSCAN), geometry-based node feature extraction, and principal component analysis (PCA) to segment rebar bars, identify rebar nodes, and estimate orientation vectors for sequential ranking, even in complex, unstructured environments. The motion planner, based on Diffusion-EDFs, is trained on as few as 5-10 demonstrations to generate sequential end-effector poses that optimize collision avoidance and tying efficiency. The proposed system is validated on various rebar meshes, including single-layer, multi-layer, and cluttered configurations, demonstrating high success rates in node detection and accurate sequential tying. Compared with conventional approaches that rely on large datasets or extensive manual parameter tuning, our method achieves robust, efficient, and adaptable multi-node tying while significantly reducing data requirements. This result underscores the potential of hybrid perception and diffusion-driven planning to enhance automation in on-site construction tasks, improving both safety and labor efficiency.
zh

[CV-232] OpenTie: Open-vocabulary Sequential Rebar Tying System

【速读】:该论文旨在解决建筑工地中钢筋(rebar)绑扎任务的自动化难题,尤其是针对三维复杂场景下现有机器人系统多依赖模型训练、难以适应多样布局的问题。其解决方案的关键在于提出一种无需训练的3D钢筋绑扎框架OpenTie,核心创新包括:基于RGB图像到点云的生成方法实现环境感知,以及结合开放词汇检测(open-vocabulary detection)与提示驱动的目标检测(prompt-based object detection),并通过自研的后处理流程优化图像质量,从而在不依赖特定训练数据的前提下,实现对水平和垂直钢筋的高精度识别与绑定,实验证明该系统在真实场景中具有良好的实用性。

链接: https://arxiv.org/abs/2509.00064
作者: Mingze Liu,Sai Fan,Haozhen Li,Haobo Liang,Yixing Yuan,Yanke Wang
机构: Hong Kong Center for Construction Robotics, The Hong Kong University of Science and Technology (香港科技大学建设机器人中心); College of Professional and Continuing Education, The Hong Kong Polytechnic University (香港理工大学专业持续教育学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This article is under its initial revision

点击查看摘要

Abstract:Robotic practices on the construction site emerge as an attention-attracting manner owing to their capability of tackle complex challenges, especially in the rebar-involved scenarios. Most of existing products and research are mainly focused on flat rebar setting with model training demands. To fulfill this gap, we propose OpenTie, a 3D training-free rebar tying framework utilizing a RGB-to-point-cloud generation and an open-vocabulary detection. We implements the OpenTie via a robotic arm with a binocular camera and guarantees a high accuracy by applying the prompt-based object detection method on the image filtered by our propose post-processing procedure based a image to point cloud generation framework. The system is flexible for horizontal and vertical rebar tying tasks and the experiments on the real-world rebar setting verifies that the effectiveness of the system in practice.
zh

[CV-233] Scaffold Diffusion: Sparse Multi-Category Voxel Structure Generation with Discrete Diffusion

【速读】:该论文旨在解决稀疏多类别三维体素(3D voxel)结构生成难题,其核心挑战在于体素结构的立方体内存扩展性以及由稀疏性导致的显著类别不平衡问题。解决方案的关键在于提出Scaffold Diffusion模型,该模型将体素视为“令牌”(token),并采用离散扩散语言模型(discrete diffusion language model)来生成具有空间一致性的3D体素结构。与以往基于自回归建模的方法不同,Scaffold Diffusion能够在仅使用超过98%稀疏数据训练的情况下仍生成真实且连贯的结构,验证了离散扩散框架在3D稀疏体素生成中的有效性。

链接: https://arxiv.org/abs/2509.00062
作者: Justin Jung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generating realistic sparse multi-category 3D voxel structures is difficult due to the cubic memory scaling of voxel structures and moreover the significant class imbalance caused by sparsity. We introduce Scaffold Diffusion, a generative model designed for sparse multi-category 3D voxel structures. By treating voxels as tokens, Scaffold Diffusion uses a discrete diffusion language model to generate 3D voxel structures. We show that discrete diffusion language models can be extended beyond inherently sequential domains such as text to generate spatially coherent 3D structures. We evaluate on Minecraft house structures from the 3D-Craft dataset and demonstrate that, unlike prior baselines and an auto-regressive formulation, Scaffold Diffusion produces realistic and coherent structures even when trained on data with over 98% sparsity. We provide an interactive viewer where readers can visualize generated samples and the generation process. Our results highlight discrete diffusion as a promising framework for 3D sparse voxel generative modeling.
zh

[CV-234] From Data to Decision: A Multi-Stage Framework for Class Imbalance Mitigation in Optical Network Failure Analysis

【速读】:该论文旨在解决光网络中基于机器学习的故障管理面临的严重类别不平衡问题(class imbalance),即正常实例远多于故障案例,从而影响故障检测与识别的准确性。其关键解决方案在于系统性比较预处理(pre-processing)、在线处理(in-processing)和后处理(post-processing)三类方法在实验数据集上的性能表现:研究表明,针对不同场景(如是否存在类别重叠、是否对延迟敏感等),最优策略存在差异——例如,在故障检测中,后处理中的阈值调整(Threshold Adjustment)能显著提升F1分数(最高达15.3%),而生成式AI(GenAI)在多分类故障识别中效果最佳(提升达24.2%);当存在类别重叠且需低延迟时,SMOTE过采样方法最优,无延迟约束下元学习(Meta-Learning)表现最佳,而在低重叠场景中,生成式AI兼具高精度与低推理时间。

链接: https://arxiv.org/abs/2509.00057
作者: Yousuf Moiz Ali,Jaroslaw E. Prilepsky,Nicola Sambo,Joao Pedro,Mohammad M. Hosseini,Antonio Napoli,Sergei K. Turitsyn,Pedro Freire
机构: Aston Institute of Photonic Technologies, Aston University (阿斯顿大学光子技术研究所); Scuola Superiore Sant’Anna (圣安娜高等学院); Nokia (诺基亚); Instituto de Telecomunicações, Instituto Superior Técnico (葡萄牙电信研究所,里斯本理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine learning-based failure management in optical networks has gained significant attention in recent years. However, severe class imbalance, where normal instances vastly outnumber failure cases, remains a considerable challenge. While pre- and in-processing techniques have been widely studied, post-processing methods are largely unexplored. In this work, we present a direct comparison of pre-, in-, and post-processing approaches for class imbalance mitigation in failure detection and identification using an experimental dataset. For failure detection, post-processing methods-particularly Threshold Adjustment-achieve the highest F1 score improvement (up to 15.3%), while Random Under-Sampling provides the fastest inference. In failure identification, GenAI methods deliver the most substantial performance gains (up to 24.2%), whereas post-processing shows limited impact in multi-class settings. When class overlap is present and latency is critical, over-sampling methods such as the SMOTE are most effective; without latency constraints, Meta-Learning yields the best results. In low-overlap scenarios, Generative AI approaches provide the highest performance with minimal inference time.
zh

[CV-235] MESTI-MEGANet: Micro-expression Spatio-Temporal Image and Micro-expression Gradient Attention Networks for Micro-expression Recognition

【速读】:该论文旨在解决微表情识别(Micro-expression Recognition, MER)中因微表情具有细微且短暂特性而导致传统输入模态(如Apex Frame、Optical Flow和Dynamic Image)难以充分捕捉面部细微运动的问题。其解决方案的关键在于提出两种创新方法:一是引入微表情时空图像(Micro-expression Spatio-Temporal Image, MESTI),将视频序列转化为单张图像以保留微动作的本质特征;二是设计微表情梯度注意力网络(Micro-expression Gradient Attention Network, MEGANet),通过新颖的梯度注意力模块增强对细粒度运动特征的提取能力。实验表明,MESTI作为输入模态显著优于现有方法,结合MEGANet在CASMEII和SAMM数据集上实现了当前最优性能,确立了新的基准。

链接: https://arxiv.org/abs/2509.00056
作者: Luu Tu Nguyen,Vu Tram Anh Khuong,Thanh Ha Le,Thi Duyen Ngo
机构: VNU University of Engineering and Technology (越南国立大学工程与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-expression recognition (MER) is a challenging task due to the subtle and fleeting nature of micro-expressions. Traditional input modalities, such as Apex Frame, Optical Flow, and Dynamic Image, often fail to adequately capture these brief facial movements, resulting in suboptimal performance. In this study, we introduce the Micro-expression Spatio-Temporal Image (MESTI), a novel dynamic input modality that transforms a video sequence into a single image while preserving the essential characteristics of micro-movements. Additionally, we present the Micro-expression Gradient Attention Network (MEGANet), which incorporates a novel Gradient Attention block to enhance the extraction of fine-grained motion features from micro-expressions. By combining MESTI and MEGANet, we aim to establish a more effective approach to MER. Extensive experiments were conducted to evaluate the effectiveness of MESTI, comparing it with existing input modalities across three CNN architectures (VGG19, ResNet50, and EfficientNetB0). Moreover, we demonstrate that replacing the input of previously published MER networks with MESTI leads to consistent performance improvements. The performance of MEGANet, both with MESTI and Dynamic Image, is also evaluated, showing that our proposed network achieves state-of-the-art results on the CASMEII and SAMM datasets. The combination of MEGANet and MESTI achieves the highest accuracy reported to date, setting a new benchmark for micro-expression recognition. These findings underscore the potential of MESTI as a superior input modality and MEGANet as an advanced recognition network, paving the way for more effective MER systems in a variety of applications.
zh

[CV-236] Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation

【速读】:该论文旨在解决基于扩散模型的说话头视频生成(diffusion-based talking head models)在推理阶段速度缓慢的问题,这一瓶颈限制了其在实际场景中的应用。现有加速方法无法有效利用说话头视频特有的时空冗余性。解决方案的关键在于两个创新:一是提出Lightning-fast Caching-based Parallel denoising prediction (LightningCP),通过缓存静态特征,在推理时跳过大部分网络层,并利用缓存特征与估计噪声潜变量并行预测,从而避免逐次采样;二是设计Decoupled Foreground Attention (DFA),利用说话头视频中前景区域的空间解耦特性,将注意力机制限制在动态前景区域,同时在部分层移除参考特征以进一步提升效率。

链接: https://arxiv.org/abs/2509.00052
作者: Jianzhi Long,Wenhao Sun,Rongcheng Tu,Dacheng Tao
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference, limiting practical applications. Existing acceleration methods for general diffusion models fail to exploit the temporal and spatial redundancies unique to talking head generation. In this paper, we propose a task-specific framework addressing these inefficiencies through two key innovations. First, we introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP), caching static features to bypass most model layers in inference time. We also enable parallel prediction using cached features and estimated noisy latents as inputs, efficiently bypassing sequential sampling. Second, we propose Decoupled Foreground Attention (DFA) to further accelerate attention computations, exploiting the spatial decoupling in talking head videos to restrict attention to dynamic foreground regions. Additionally, we remove reference features in certain layers to bring extra speedup. Extensive experiments demonstrate that our framework significantly improves inference speed while preserving video quality.
zh

[CV-237] Performance is not All You Need: Sustainability Considerations for Algorithms

【速读】:该论文旨在解决深度学习模型训练过程中高碳排放的问题,核心挑战在于如何平衡算法性能与能耗。其解决方案的关键在于提出了一种二维可持续性评估体系,包含两个定量指标:一是可持续调和均值(FMS),通过调和平均整合累积能耗与性能参数,揭示单位能耗下的算法性能;二是可持续曲线下的面积(ASC),构建性能-功耗曲线以刻画算法在整个训练周期中的能效特征。该框架在多种多模态任务中验证了通用性,为绿色AI研究从理论走向实践提供了量化依据和方法支持。

链接: https://arxiv.org/abs/2509.00045
作者: Xiang Li,Chong Zhang,Hongpeng Wang,Shreyank Narayana Gowda,Yushi Li,Xiaobo Jin
机构: The Chinese University of Hong Kong (香港中文大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Nottingham (诺丁汉大学); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注: 18 pages, 6 figures. Accepted Chinese Conference on Pattern Recognition and Computer Vision 2025

点击查看摘要

Abstract:This work focuses on the high carbon emissions generated by deep learning model training, specifically addressing the core challenge of balancing algorithm performance and energy consumption. It proposes an innovative two-dimensional sustainability evaluation system. Different from the traditional single performance-oriented evaluation paradigm, this study pioneered two quantitative indicators that integrate energy efficiency ratio and accuracy: the sustainable harmonic mean (FMS) integrates accumulated energy consumption and performance parameters through the harmonic mean to reveal the algorithm performance under unit energy consumption; the area under the sustainability curve (ASC) constructs a performance-power consumption curve to characterize the energy efficiency characteristics of the algorithm throughout the cycle. To verify the universality of the indicator system, the study constructed benchmarks in various multimodal tasks, including image classification, segmentation, pose estimation, and batch and online learning. Experiments demonstrate that the system can provide a quantitative basis for evaluating cross-task algorithms and promote the transition of green AI research from theory to practice. Our sustainability evaluation framework code can be found here, providing methodological support for the industry to establish algorithm energy efficiency standards.
zh

[CV-238] ARTPS: Depth-Enhanced Hybrid Anomaly Detection and Learnable Curiosity Score for Autonomous Rover Target Prioritization

【速读】:该论文旨在解决行星表面自主探索中目标优先级排序的难题,即如何在复杂多变的地形环境中高效识别并优先探测具有科学价值或异常特征的区域。解决方案的关键在于提出了一种混合式人工智能系统ARTPS(Autonomous Rover Target Prioritization System),其核心创新是融合单目深度估计(使用Vision Transformers)、多组件异常检测与可学习好奇心评分机制,通过加权组合已知价值、异常信号强度、深度方差和表面粗糙度等多维特征,实现对潜在目标的精准排序。该方法在火星探测数据集上达到AUROC 0.94、AUPRC 0.89和F1-Score 0.87的性能,并显著降低23%的误报率,同时保持高灵敏度,验证了混合融合策略在提升自主探测效率方面的有效性。

链接: https://arxiv.org/abs/2509.00042
作者: Poyraz Baydemir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 12 figures, 4 table, autonomous exploration, Mars rover, computer vision, anomaly detection, depth estimation, curiosity-driven exploration

点击查看摘要

Abstract:We present ARTPS (Autonomous Rover Target Prioritization System), a novel hybrid AI system that combines depth estimation, anomaly detection, and learnable curiosity scoring for autonomous exploration of planetary surfaces. Our approach integrates monocular depth estimation using Vision Transformers with multi-component anomaly detection and a weighted curiosity score that balances known value, anomaly signals, depth variance, and surface roughness. The system achieves state-of-the-art performance with AUROC of 0.94, AUPRC of 0.89, and F1-Score of 0.87 on Mars rover datasets. We demonstrate significant improvements in target prioritization accuracy through ablation studies and provide comprehensive analysis of component contributions. The hybrid fusion approach reduces false positives by 23% while maintaining high detection sensitivity across diverse terrain types.
zh

[CV-239] AMMKD: Adaptive Multimodal Multi-teacher Distillation for Lightweight Vision-Language Models

【速读】:该论文旨在解决大规模视觉语言预训练(Visual Language Pretraining, VLP)模型在移动端部署受限的问题,主要瓶颈在于模型尺寸过大和计算复杂度高。解决方案的关键在于提出一种自适应多模态多教师知识蒸馏(Adaptive Multi-Modal Multi-Teacher Knowledge Distillation, AMMKD)框架,其核心创新包括:1)通过多模态特征融合网络提取并整合图像与文本的判别性特征;2)设计多教师知识蒸馏机制,利用两个CLIP教师模型进行预训练,并通过预计算存储文本特征作为类别向量以提升效率;3)引入KL散度(KL scatter)进行概率分布匹配,增强师生输出对齐;4)构建自适应动态加权策略,将多教师蒸馏建模为多目标优化问题,基于梯度空间多样性动态调整各教师影响权重,从而减少冲突、引导学生模型向更优学习方向收敛。该方案显著降低模型复杂度的同时保持优异检索性能。

链接: https://arxiv.org/abs/2509.00039
作者: Yuqi Li,Chuanguang Yang,Junhao Dong,Zhengtao Yao,Haoyan Xu,Zeyu Dong,Hansheng Zeng,Zhulin An,Yingli Tian
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Alibaba Group (阿里巴巴集团); 4. Huawei Technologies Co., Ltd. (华为技术有限公司); 5. City University of New York (纽约城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:The success of large-scale visual language pretraining (VLP) models has driven widespread adoption of image-text retrieval tasks. However, their deployment on mobile devices remains limited due to large model sizes and computational complexity. We propose Adaptive Multi-Modal Multi-Teacher Knowledge Distillation (AMMKD), a novel framework that integrates multi-modal feature fusion, multi-teacher distillation, and adaptive optimization to deliver lightweight yet effective retrieval models. Specifically, our method begins with a feature fusion network that extracts and merges discriminative features from both the image and text modalities. To reduce model parameters and further improve performance, we design a multi-teacher knowledge distillation framework to pre-train two CLIP teacher models. We decouple modalities by pre-computing and storing text features as class vectors via the teacher text encoder to enhance efficiency. To better align teacher and student outputs, we apply KL scatter for probability distribution matching. Finally, we design an adaptive dynamic weighting scheme that treats multi-teacher distillation as a multi-objective optimization problem. By leveraging gradient space diversity, we dynamically adjust the influence of each teacher, reducing conflicts and guiding the student toward more optimal learning directions. Extensive experiments on three benchmark datasets demonstrate that AMMKD achieves superior performance while significantly reducing model complexity, validating its effectiveness and flexibility.
zh

[CV-240] A-FloPS: Accelerating Diffusion Sampling with Adaptive Flow Path Sampler

【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中计算成本高昂的问题,尤其是其固有的迭代采样过程导致的低效性。现有无需训练的加速方法通常仅优化反向时间常微分方程(ODE)的数值求解器,但受限于原始采样轨迹本身的低效性,难以实现显著性能提升。解决方案的关键在于提出A-FloPS(Adaptive Flow Path Sampler),其核心创新包括:1)将任意预训练扩散模型的采样轨迹重新参数化为流匹配(Flow Matching)形式,通过解析映射将扩散得分(score)转换为兼容流的velocity,从而获得更易积分的轨迹而无需重新训练;2)引入自适应速度分解机制,将速度场分解为线性漂移项与残差分量,并主动抑制残差的时间变化,使得高阶积分方法在极低函数评估次数(NFE)下仍能保持高精度。该方法在条件图像生成和文生图任务中均显著优于当前最优无训练采样器,在仅5次函数评估时即可实现更低FID值和更清晰一致的图像输出,验证了其高效性和通用性。

链接: https://arxiv.org/abs/2509.00036
作者: Cheng Jin,Zhenyu Xiao,Yuantao Gu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages,9 figures

点击查看摘要

Abstract:Diffusion models deliver state-of-the-art generative performance across diverse modalities but remain computationally expensive due to their inherently iterative sampling process. Existing training-free acceleration methods typically improve numerical solvers for the reverse-time ODE, yet their effectiveness is fundamentally constrained by the inefficiency of the underlying sampling trajectories. We propose A-FloPS (Adaptive Flow Path Sampler), a principled, training-free framework that reparameterizes the sampling trajectory of any pre-trained diffusion model into a flow-matching form and augments it with an adaptive velocity decomposition. The reparameterization analytically maps diffusion scores to flow-compatible velocities, yielding integration-friendly trajectories without retraining. The adaptive mechanism further factorizes the velocity field into a linear drift term and a residual component whose temporal variation is actively suppressed, restoring the accuracy benefits of high-order integration even in extremely low-NFE regimes. Extensive experiments on conditional image generation and text-to-image synthesis show that A-FloPS consistently outperforms state-of-the-art training-free samplers in both sample quality and efficiency. Notably, with as few as 5 function evaluations, A-FloPS achieves substantially lower FID and generates sharper, more coherent images. The adaptive mechanism also improves native flow-based generative models, underscoring its generality. These results position A-FloPS as a versatile and effective solution for high-quality, low-latency generative modeling.
zh

[CV-241] Deep Learning-Driven Multimodal Detection and Movement Analysis of Objects in Culinary

【速读】:该论文旨在解决厨房场景中自动化烹饪指导的难题,即如何通过多模态数据融合实现从食材识别、动作捕捉到语音转文字的全流程信息提取,并最终生成结构化的步骤式烹饪指南。其解决方案的关键在于将YOLOv8分割模型用于食材与厨具的空间定位,结合LSTM对人工关键点运动序列的建模以理解操作动作,再利用ASR(Whisper-base)提取语音指令或环境声音信息,最后输入至TinyLLaMa语言模型进行语义推理与文本生成,从而构建一个端到端的任务特定系统,适用于复杂且动态的厨房环境。

链接: https://arxiv.org/abs/2509.00033
作者: Tahoshin Alam Ishat
机构: North South University (北方南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:This is a research exploring existing models and fine tuning them to combine a YOLOv8 segmentation model, a LSTM model trained on hand point motion sequence and a ASR (whisper-base) to extract enough data for a LLM (TinyLLaMa) to predict the recipe and generate text creating a step by step guide for the cooking procedure. All the data were gathered by the author for a robust task specific system to perform best in complex and challenging environments proving the extension and endless application of computer vision in daily activities such as kitchen work. This work extends the field for many more crucial task of our day to day life.
zh

[CV-242] IPG: Incremental Patch Generation for Generalized Adversarial Patch Training

【速读】:该论文旨在解决对抗性补丁(adversarial patches)对AI模型鲁棒性造成的威胁,尤其是在计算机视觉任务(如目标检测)中,传统对抗样本难以有效攻击特定区域的问题。其解决方案的关键在于提出增量式补丁生成方法(Incremental Patch Generation, IPG),该方法在保持与现有方法相当攻击性能的前提下,将生成效率提升至11.1倍,同时通过特征分布可视化和对抗训练验证其生成的补丁具有良好的泛化能力,能覆盖更广泛的模型脆弱点,从而为构建鲁棒AI模型提供结构化的知识基础和主动防御机制。

链接: https://arxiv.org/abs/2508.10946
作者: Wonho Lee,Hyunsik Na,Jisu Lee,Daeseon Choi
机构: Soongsil University (弘益大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The advent of adversarial patches poses a significant challenge to the robustness of AI models, particularly in the domain of computer vision tasks such as object detection. In contradistinction to traditional adversarial examples, these patches target specific regions of an image, resulting in the malfunction of AI models. This paper proposes Incremental Patch Generation (IPG), a method that generates adversarial patches up to 11.1 times more efficiently than existing approaches while maintaining comparable attack performance. The efficacy of IPG is demonstrated by experiments and ablation studies including YOLO’s feature distribution visualization and adversarial training results, which show that it produces well-generalized patches that effectively cover a broader range of model vulnerabilities. Furthermore, IPG-generated datasets can serve as a robust knowledge foundation for constructing a robust model, enabling structured representation, advanced reasoning, and proactive defenses in AI security ecosystems. The findings of this study suggest that IPG has considerable potential for future utilization not only in adversarial patch defense but also in real-world applications such as autonomous vehicles, security systems, and medical imaging, where AI models must remain resilient to adversarial attacks in dynamic and high-stakes environments.
zh

[CV-243] DCA: Graph-Guided Deep Embedding Clustering for Brain Atlases

【速读】:该论文旨在解决现有脑图谱(brain atlas)多为预定义的群体级模板、灵活性和分辨率受限的问题,从而难以满足个体化神经影像分析的需求。其解决方案的关键在于提出一种基于图引导的深度嵌入聚类框架——深度聚类图谱(Deep Cluster Atlas, DCA),该方法融合预训练自编码器与空间正则化的深度聚类机制,能够在保持功能一致性与空间连续性的前提下生成个体化的体素级脑分区。DCA支持灵活控制分辨率和解剖范围,并可扩展至任意脑结构,在多个大规模fMRI数据集上显著优于当前最优图谱方法,尤其在功能同质性(提升98.8%)和轮廓系数(提升29%)等指标上表现突出,且在自闭症诊断和认知解码等下游任务中展现出优越性能。

链接: https://arxiv.org/abs/2509.01426
作者: Mo Wang,Kaining Peng,Jingsheng Tang,Hongkai Wen,Quanying Liu
机构: SUSTech(南方科技大学); University of Warwick(华威大学)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain atlases are essential for reducing the dimensionality of neuroimaging data and enabling interpretable analysis. However, most existing atlases are predefined, group-level templates with limited flexibility and resolution. We present Deep Cluster Atlas (DCA), a graph-guided deep embedding clustering framework for generating individualized, voxel-wise brain parcellations. DCA combines a pretrained autoencoder with spatially regularized deep clustering to produce functionally coherent and spatially contiguous regions. Our method supports flexible control over resolution and anatomical scope, and generalizes to arbitrary brain structures. We further introduce a standardized benchmarking platform for atlas evaluation, using multiple large-scale fMRI datasets. Across multiple datasets and scales, DCA outperforms state-of-the-art atlases, improving functional homogeneity by 98.8% and silhouette coefficient by 29%, and achieves superior performance in downstream tasks such as autism diagnosis and cognitive decoding. Codes and models will be released soon.
zh

[CV-244] Automatic Screening of Parkinsons Disease from Visual Explorations

【速读】:该论文旨在解决帕金森病(Parkinson’s Disease, PD)早期自动筛查的难题,利用眼动行为中的视觉探索特征实现非侵入性检测。其解决方案的关键在于提出了一种融合经典注视/扫视眼动特征(如扫视次数、注视时长、扫描区域)与基于注视聚类(gaze clusters)的新特征的新型方法,并通过混合专家(Mixture of Experts)集成模型整合来自不同视觉探索任务及双侧眼睛的输出,从而显著提升分类性能,在独立测试集上达到0.95的受试者工作特征曲线下面积(AUC)。

链接: https://arxiv.org/abs/2509.01326
作者: Maria F. Alcala-Durand,J. Camilo Puerta-Acevedo,Julián D. Arias-Londoño,Juan I. Godino-Llorente
机构: Universidad Politécnica de Madrid (马德里理工大学)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages, 11 figures

点击查看摘要

Abstract:Eye movements can reveal early signs of neurodegeneration, including those associated with Parkinson’s Disease (PD). This work investigates the utility of a set of gaze-based features for the automatic screening of PD from different visual exploration tasks. For this purpose, a novel methodology is introduced, combining classic fixation/saccade oculomotor features (e.g., saccade count, fixation duration, scanned area) with features derived from gaze clusters (i.e., regions with a considerable accumulation of fixations). These features are automatically extracted from six exploration tests and evaluated using different machine learning classifiers. A Mixture of Experts ensemble is used to integrate outputs across tests and both eyes. Results show that ensemble models outperform individual classifiers, achieving an Area Under the Receiving Operating Characteristic Curve (AUC) of 0.95 on a held-out test set. The findings support visual exploration as a non-invasive tool for early automatic screening of PD.
zh

[CV-245] Learn2Reg 2024: New Benchmark Datasets Driving Progress on New Challenges

【速读】:该论文旨在解决医学图像配准(Medical Image Registration)领域中基准测试不全面的问题,尤其是模态多样性不足与任务复杂性覆盖不全的局限。为应对这些挑战,2024年Learn2Reg挑战赛引入了三项新任务:大规模多模态配准、无监督跨受试者脑部配准以及首个聚焦显微成像的基准测试。其解决方案的关键在于通过构建更丰富的数据集激发新型方法的发展,包括引入可逆性约束(invertibility constraints)、金字塔特征(pyramid features)、关键点对齐(keypoints alignment)和实例优化(instance optimisation),从而提升配准性能并推动该领域的技术进步。

链接: https://arxiv.org/abs/2509.01217
作者: Lasse Hansen,Wiebke Heyer,Christoph Großbröhmer,Frederic Madesta,Thilo Sentker,Wang Jiazheng,Yuxi Zhang,Hang Zhang,Min Liu,Junyi Wang,Xi Zhu,Yuhua Li,Liwen Wang,Daniil Morozov,Nazim Haouchine,Joel Honkamaa,Pekka Marttinen,Yichao Zhou,Zuopeng Tan,Zhuoyuan Wang,Yi Wang,Hongchao Zhou,Shunbo Hu,Yi Zhang,Qian Tao,Lukas Förner,Thomas Wendler,Bailiang Jian,Christian Wachinger,Jin Kim,Dan Ruan,Marek Wodzinski,Henning Müller,Tony C.W. Mok,Xi Jia,Mikael Brudfors,Seyed-Ahmad Ahmadi,Yunzheng Zhu,William Hsu,Tina Kapur,William M. Wells,Alexandra Golby,Aaron Carass,Harrison Bai,Yihao Liu,Perrine Paul-Gilloteaux,Joakim Lindblad,Nataša Sladoje,Andreas Walter,Junyu Chen,Reuben Dorent,Alessa Hering,Mattias P. Heinrich
机构: University of Zurich (苏黎世大学); ETH Zurich (苏黎世联邦理工学院); Charité – Universitätsmedizin Berlin (柏林夏里特医学院); German Cancer Research Center (德国癌症研究中心); University of Oxford (牛津大学); Tsinghua University (清华大学); Fudan University (复旦大学); Zhejiang University (浙江大学); Harvard Medical School (哈佛医学院); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); University of Helsinki (赫尔辛基大学); Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学); National University of Singapore (新加坡国立大学); University of Toronto (多伦多大学); Ludwig-Maximilians-Universität München (慕尼黑路德维希-马克西米利安大学); University of Bonn (波恩大学); Seoul National University (首尔国立大学); University of Basel (巴塞尔大学); University of Geneva (日内瓦大学); University of Lyon (里昂大学); University of Copenhagen (哥本哈根大学); University of Hong Kong (香港大学); University College London (伦敦大学学院); Karolinska Institutet (卡罗林斯卡学院); University of Gothenburg (哥德堡大学); University of Southern Denmark (南丹麦大学); University of California, San Francisco (加州大学旧金山分校); Stanford University (斯坦福大学); Massachusetts General Hospital (麻省总医院); University of Pennsylvania (宾夕法尼亚大学); University of Paris (巴黎大学); Lund University (隆德大学); KTH Royal Institute of Technology (皇家理工学院); University of Cambridge (剑桥大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to MELBA Journal

点击查看摘要

Abstract:Medical image registration is critical for clinical applications, and fair benchmarking of different methods is essential for monitoring ongoing progress. To date, the Learn2Reg 2020-2023 challenges have released several complementary datasets and established metrics for evaluations. However, these editions did not capture all aspects of the registration problem, particularly in terms of modality diversity and task complexity. To address these limitations, the 2024 edition introduces three new tasks, including large-scale multi-modal registration and unsupervised inter-subject brain registration, as well as the first microscopy-focused benchmark within Learn2Reg. The new datasets also inspired new method developments, including invertibility constraints, pyramid features, keypoints alignment and instance optimisation.
zh

[CV-246] Ultrasound-based detection and malignancy prediction of breast lesions eligible for biopsy: A multi-center clinical-scenario study using nomograms large language models and radiologist evaluation

【速读】:该论文旨在解决乳腺病变影像学诊断中 biopsy 决策和恶性肿瘤预测的准确性不足问题,尤其是在不同设备平台和人群间模型泛化能力有限的挑战。其解决方案的关键在于构建并外部验证了一个融合 BIRADS 特征与定量形态学特征的超声列线图(nomogram),通过逻辑回归整合两类信息,在多项指标上显著优于单独使用 BIRADS 或形态学特征的模型、三位放射科医生以及两种变体的 ChatGPT 大语言模型(LLM)。该集成方法在内部及两个外部验证队列中均表现出最优性能(如生物活检推荐准确率达 83.0%,恶性预测准确率达 83.8%),且具备良好的可解释性和跨中心、跨地域的鲁棒性,为减少不必要的活检和实现个性化乳腺影像决策提供了可靠工具。

链接: https://arxiv.org/abs/2509.00946
作者: Ali Abbasian Ardakani,Afshin Mohammadi,Taha Yusuf Kuzan,Beyza Nur Kuzan,Hamid Khorshidi,Ashkan Ghorbani,Alisa Mohebbi,Fariborz Faeghi,Sepideh Hatamikia,U Rajendra Acharya
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 8 figures, 12 tables

点击查看摘要

Abstract:To develop and externally validate integrated ultrasound nomograms combining BIRADS features and quantitative morphometric characteristics, and to compare their performance with expert radiologists and state of the art large language models in biopsy recommendation and malignancy prediction for breast lesions. In this retrospective multicenter, multinational study, 1747 women with pathologically confirmed breast lesions underwent ultrasound across three centers in Iran and Turkey. A total of 10 BIRADS and 26 morphological features were extracted from each lesion. A BIRADS, morphometric, and fused nomogram integrating both feature sets was constructed via logistic regression. Three radiologists (one senior, two general) and two ChatGPT variants independently interpreted deidentified breast lesion images. Diagnostic performance for biopsy recommendation (BIRADS 4,5) and malignancy prediction was assessed in internal and two external validation cohorts. In pooled analysis, the fused nomogram achieved the highest accuracy for biopsy recommendation (83.0%) and malignancy prediction (83.8%), outperforming the morphometric nomogram, three radiologists and both ChatGPT models. Its AUCs were 0.901 and 0.853 for the two tasks, respectively. In addition, the performance of the BIRADS nomogram was significantly higher than the morphometric nomogram, three radiologists and both ChatGPT models for biopsy recommendation and malignancy prediction. External validation confirmed the robust generalizability across different ultrasound platforms and populations. An integrated BIRADS morphometric nomogram consistently outperforms standalone models, LLMs, and radiologists in guiding biopsy decisions and predicting malignancy. These interpretable, externally validated tools have the potential to reduce unnecessary biopsies and enhance personalized decision making in breast imaging.
zh

[CV-247] Protocol for Clustering 4DSTEM Data for Phase Differentiation in Glasses

【速读】:该论文旨在解决相变材料(Phase-change materials, PCM)中纳米尺度成分与结构异质性难以通过传统手段解析的问题。其解决方案的关键在于将无监督机器学习方法应用于四维扫描透射电子显微镜(4D-STEM)数据,通过主成分分析(PCA)进行预处理与降维,结合t-SNE和UMAP进行聚类验证,并利用轮廓系数优化k-means聚类,最终识别出四个具有显著化学与结构差异的簇,从而实现了对Ge-Sb-Te材料局部化学组成与晶体结构关联性的精准映射。

链接: https://arxiv.org/abs/2509.00943
作者: Mridul Kumar,Yevgeny Rakita
机构: Ben-Gurion University of the Negev (本-古里安大学)
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Phase-change materials (PCMs) such as Ge-Sb-Te alloys are widely used in non-volatile memory applications due to their rapid and reversible switching between amorphous and crystalline states. However, their functional properties are strongly governed by nanoscale variations in composition and structure, which are challenging to resolve using conventional techniques. Here, we apply unsupervised machine learning to 4-dimensional scanning transmission electron microscopy (4D-STEM) data to identify compositional and structural heterogeneity in Ge-Sb-Te. After preprocessing and dimensionality reduction with principal component analysis (PCA), cluster validation was performed with t-SNE and UMAP, followed by k-means clustering optimized through silhouette scoring. Four distinct clusters were identified which were mapped back to the diffraction data. Elemental intensity histograms revealed chemical signatures change across clusters, oxygen and germanium enrichment in Cluster 1, tellurium in Cluster 2, antimony in Cluster 3, and germanium again in Cluster 4. Furthermore, averaged diffraction patterns from these clusters confirmed structural variations. Together, these findings demonstrate that clustering analysis can provide a powerful framework for correlating local chemical and structural features in PCMs, offering deeper insights into their intrinsic heterogeneity.
zh

[CV-248] owards Early Detection: AI-Based Five-Year Forecasting of Breast Cancer Risk Using Digital Breast Tomosynthesis Imaging MICCAI2025

【速读】:该论文旨在解决当前乳腺癌风险预测模型性能有限且未整合数字乳腺断层成像(digital breast tomosynthesis, DBT)影像数据的问题。其解决方案的关键在于构建一个基于深度学习(deep learning, DL)的框架,直接从筛查用DBT图像中提取特征,并结合累积风险层(cumulative hazard layer)来预测个体患者未来5年的乳腺癌发病风险。该方法利用Meta AI DINOv2图像编码器提取多尺度视觉特征,在包含161,753次DBT检查的大型数据集上训练模型,最终在独立测试集上实现了0.80的AUROC,表明DBT驱动的深度学习方法具有显著优于传统风险评估工具的潜力。

链接: https://arxiv.org/abs/2509.00900
作者: Manon A. Dorster,Felix J. Dorfner,Mason C. Cleveland,Melisa S. Guelen,Jay Patel,Dania Daye,Jean-Philippe Thiran,Albert E. Kim,Christopher P. Bridge
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Deep Breath Workshop, MICCAI 2025

点击查看摘要

Abstract:As early detection of breast cancer strongly favors successful therapeutic outcomes, there is major commercial interest in optimizing breast cancer screening. However, current risk prediction models achieve modest performance and do not incorporate digital breast tomosynthesis (DBT) imaging, which was FDA-approved for breast cancer screening in 2011. To address this unmet need, we present a deep learning (DL)-based framework capable of forecasting an individual patient’s 5-year breast cancer risk directly from screening DBT. Using an unparalleled dataset of 161,753 DBT examinations from 50,590 patients, we trained a risk predictor based on features extracted using the Meta AI DINOv2 image encoder, combined with a cumulative hazard layer, to assess a patient’s likelihood of developing breast cancer over five years. On a held-out test set, our best-performing model achieved an AUROC of 0.80 on predictions within 5 years. These findings reveal the high potential of DBT-based DL approaches to complement traditional risk assessment tools, and serve as a promising basis for additional investigation to validate and enhance our work.
zh

[CV-249] Can General-Purpose Omnimodels Compete with Specialists? A Case Study in Medical Image Segmentation

【速读】:该论文旨在解决当前通用多模态模型(omnimodel)在知识密集型领域(如医学图像分割)中是否能与高度专业化模型性能相当的问题。其核心研究问题是:在高风险医疗场景下,具备跨模态处理能力的通用模型能否在零样本(zero-shot)条件下达到甚至超越专用深度学习模型的表现?解决方案的关键在于设计了一项对比实验,通过筛选“最容易”和“最难”的病例子集(基于专业模型的准确率),系统评估了先进通用模型(Gemini 2.5 Pro)与特定任务模型在三种不同医学图像分割任务(结肠息肉、视网膜血管、乳腺肿瘤)中的表现差异。结果表明,通用模型在困难样本上展现出更强鲁棒性,尤其在息肉和乳腺肿瘤分割任务中优于专家模型;而在精细结构识别任务(如视网膜血管)中,专家模型仍占优。这一发现揭示了通用模型与专家模型之间存在互补潜力,尤其是在边缘案例中提升整体系统可靠性。

链接: https://arxiv.org/abs/2509.00866
作者: Yizhe Zhang,Qiang Chen,Tao Zhou
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:The emergence of powerful, general-purpose omnimodels capable of processing diverse data modalities has raised a critical question: can these jack-of-all-trades'' systems perform on par with highly specialized models in knowledge-intensive domains? This work investigates this question within the high-stakes field of medical image segmentation. We conduct a comparative study analyzing the zero-shot performance of a state-of-the-art omnimodel (Gemini 2.5 Pro, the Nano Banana’’ model) against domain-specific deep learning models on three distinct tasks: polyp (endoscopy), retinal vessel (fundus), and breast tumor segmentation (ultrasound). Our study focuses on performance at the extremes by curating subsets of the easiest'' and hardest’’ cases based on the specialist models’ accuracy. Our findings reveal a nuanced and task-dependent landscape. For polyp and breast tumor segmentation, specialist models excel on easy samples, but the omnimodel demonstrates greater robustness on hard samples where specialists fail catastrophically. Conversely, for the fine-grained task of retinal vessel segmentation, the specialist model maintains superior performance across both easy and hard cases. Intriguingly, qualitative analysis suggests omnimodels may possess higher sensitivity, identifying subtle anatomical features missed by human annotators. Our results indicate that while current omnimodels are not yet a universal replacement for specialists, their unique strengths suggest a potential complementary role with specialist models, particularly in enhancing robustness on challenging edge cases.
zh

[CV-250] Promptable Longitudinal Lesion Segmentation in Whole-Body CT

【速读】:该论文旨在解决纵向全身体部CT(longitudinal whole-body CT)中病灶分割的准确性问题,特别是如何在不同时间点间稳定跟踪个体病灶,以支持疾病进展和治疗反应的监测。其解决方案的关键在于扩展了LongiSeg框架,引入了可提示(promptable)能力,通过点提示(point prompts)和掩码提示(mask prompts)实现病灶级别的交互式追踪,并利用大规模合成纵向CT数据集进行预训练,从而有效提升模型对纵向上下文信息的利用能力,实验表明该方法相比从头训练模型可提升最高达6个Dice系数点。

链接: https://arxiv.org/abs/2509.00613
作者: Yannick Kirchhoff,Maximilian Rokuss,Fabian Isensee,Klaus H. Maier-Hein
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of lesions in longitudinal whole-body CT is essential for monitoring disease progression and treatment response. While automated methods benefit from incorporating longitudinal information, they remain limited in their ability to consistently track individual lesions across time. Task 2 of the autoPET/CT IV Challenge addresses this by providing lesion localizations and baseline delineations, framing the problem as longitudinal promptable segmentation. In this work, we extend the recently proposed LongiSeg framework with promptable capabilities, enabling lesion-specific tracking through point and mask interactions. To address the limited size of the provided training set, we leverage large-scale pretraining on a synthetic longitudinal CT dataset. Our experiments show that pretraining substantially improves the ability to exploit longitudinal context, yielding an improvement of up to 6 Dice points compared to models trained from scratch. These findings demonstrate the effectiveness of combining longitudinal context with interactive prompting for robust lesion tracking. Code is publicly available at this https URL.
zh

[CV-251] A multi-task neural network for atypical mitosis recognition under domain shift

【速读】:该论文旨在解决在组织病理图像中识别非典型有丝分裂象(atypical mitotic figures)时,因领域偏移(domain shift)导致机器学习模型性能显著下降的问题。其解决方案的关键在于采用多任务学习(multi-task learning)策略,通过引入与主分类任务相关联的辅助任务,引导模型聚焦于待分类目标本身,忽略图像中随领域变化的背景信息,从而提升模型在不同数据分布下的泛化能力。

链接: https://arxiv.org/abs/2508.21035
作者: Gennaro Percannella,Mattia Sarno,Francesco Tortorella,Mario Vento
机构: University of Salerno (萨莱诺大学); Department of Information and Electrical Engineering and Applied Mathematics (DIEM) (信息与电气工程及应用数学系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Approach for MIDOG25 track 2

点击查看摘要

Abstract:Recognizing atypical mitotic figures in histopathology images allows physicians to correctly assess tumor aggressiveness. Although machine learning models could be exploited for automatically performing such a task, under domain shift these models suffer from significative performance drops. In this work, an approach based on multi-task learning is proposed for addressing this problem. By exploiting auxiliary tasks, correlated to the main classification task, the proposed approach, submitted to the track 2 of the MItosis DOmain Generalization (MIDOG) challenge, aims to aid the model to focus only on the object to classify, ignoring the domain varying background of the image. The proposed approach shows promising performance in a preliminary evaluation conducted on three distinct datasets, i.e., the MIDOG 2025 Atypical Training Set, the Ami-Br dataset, as well as the preliminary test set of the MIDOG25 challenge.
zh

[CV-252] Mitosis detection in domain shift scenarios: a Mamba-based approach

【速读】:该论文旨在解决组织病理图像中有丝分裂(mitosis)检测任务在域偏移(domain shift)下的性能下降问题,即模型在训练域之外的图像上泛化能力不足。解决方案的关键在于提出一种基于Mamba架构的方法,采用VM-UNet网络结构进行有丝分裂检测,并结合染色增强(stain augmentation)操作以提升模型对域变化的鲁棒性。该方法已提交至MItosis DOmain Generalization (MIDOG) 挑战赛的Track 1任务,初步实验表明其在MIDOG++数据集上仍有较大改进空间。

链接: https://arxiv.org/abs/2508.21033
作者: Gennaro Percannella,Mattia Sarno,Francesco Tortorella,Mario Vento
机构: University of Salerno (萨勒诺大学); Department of Information and Electrical Engineering and Applied Mathematics (DIEM) (信息与电气工程及应用数学系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Approach for MIDOG 2025 track 1

点击查看摘要

Abstract:Mitosis detection in histopathology images plays a key role in tumor assessment. Although machine learning algorithms could be exploited for aiding physicians in accurately performing such a task, these algorithms suffer from significative performance drop when evaluated on images coming from domains that are different from the training ones. In this work, we propose a Mamba-based approach for mitosis detection under domain shift, inspired by the promising performance demonstrated by Mamba in medical imaging segmentation tasks. Specifically, our approach exploits a VM-UNet architecture for carrying out the addressed task, as well as stain augmentation operations for further improving model robustness against domain shift. Our approach has been submitted to the track 1 of the MItosis DOmain Generalization (MIDOG) challenge. Preliminary experiments, conducted on the MIDOG++ dataset, show large room for improvement for the proposed method.
zh

人工智能

[AI-0] Surrogate Benchmarks for Model Merging Optimization

【速读】:该论文旨在解决模型合并(model merging)过程中超参数设置对合并效果影响显著但优化成本高昂的问题,尤其在大语言模型(Large Language Models, LLMs)合并场景下更为突出。其解决方案的关键在于构建代理基准(surrogate benchmarks),通过定义两个搜索空间并收集数据样本以训练代理模型(surrogate models),从而高效预测不同超参数组合下合并模型的性能,并模拟优化算法的行为,实现低代价的算法开发与性能对比。

链接: https://arxiv.org/abs/2509.02555
作者: Rio Akizuki,Yuya Kudo,Nozomu Yoshinari,Yoichi Hirose,Toshiyuki Nishimoto,Kento Uchida,Shinichi Shirakawa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: AutoML 2025 Non-Archival Content Track

点击查看摘要

Abstract:Model merging techniques aim to integrate the abilities of multiple models into a single model. Most model merging techniques have hyperparameters, and their setting affects the performance of the merged model. Because several existing works show that tuning hyperparameters in model merging can enhance the merging outcome, developing hyperparameter optimization algorithms for model merging is a promising direction. However, its optimization process is computationally expensive, particularly in merging LLMs. In this work, we develop surrogate benchmarks for optimization of the merging hyperparameters to realize algorithm development and performance comparison at low cost. We define two search spaces and collect data samples to construct surrogate models to predict the performance of a merged model from a hyperparameter. We demonstrate that our benchmarks can predict the performance of merged models well and simulate optimization algorithm behaviors.
zh

[AI-1] Contemporary Agent Technology: LLM -Driven Advancements vs Classic Multi-Agent Systems

【速读】:该论文旨在解决当前智能体技术(Agent Technology)发展中,由大语言模型(Large Language Models, LLM)驱动的新型多智能体系统(Multi-Agent Systems, MAS)与传统MAS之间的演进关系及本质差异问题。其解决方案的关键在于通过系统性分析LLM赋能的智能体架构、交互机制与行为特征,厘清其相较于经典MAS在建模能力、协作模式和适应性方面的革新,并基于核心学术文献对二者进行批判性比较,从而识别出当前研究面临的核心挑战并提出具有前瞻性的未来发展方向。

链接: https://arxiv.org/abs/2509.02515
作者: Costin Bădică,Amelia Bădică,Maria Ganzha,Mirjana Ivanović,Marcin Paprzycki,Dan Selişteanu,Zofia Wrona
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: The paper has 33 pages and it contains 1 figure and 2 tables

点击查看摘要

Abstract:This contribution provides our comprehensive reflection on the contemporary agent technology, with a particular focus on the advancements driven by Large Language Models (LLM) vs classic Multi-Agent Systems (MAS). It delves into the models, approaches, and characteristics that define these new systems. The paper emphasizes the critical analysis of how the recent developments relate to the foundational MAS, as articulated in the core academic literature. Finally, it identifies key challenges and promising future directions in this rapidly evolving domain.
zh

[AI-2] Probabilistically stable revision and comparative probability: a representation theorem and applications

【速读】:该论文旨在解决如何在概率信念更新框架下形式化和刻画“概率稳定性信念修正”(probabilistically stable belief revision)的逻辑结构及其语义表征问题。核心挑战在于,如何在贝叶斯条件化过程中保持信念的全或无(all-or-nothing)接受规则与概率稳定性之间的协调性,同时给出此类信念修正操作的完整刻画以及其非单调逻辑的语义基础。解决方案的关键在于构建一个表示定理(representation theorem),该定理将概率稳定性信念修正算子完全等价于基于最强稳定集(strongest-stable-set)的选择函数(selection function),并借助比较概率序(comparative probability orders)理论,提供必要且充分条件以判定某一选择函数能否被表示为该类算子。这一结果不仅揭示了该逻辑具有强单调性但不满足AGM信念修正公理的特性,还为比率比较逻辑、简单投票博弈和显示偏好理论提供了新的应用工具。

链接: https://arxiv.org/abs/2509.02495
作者: Krzysztof Mierzewski
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH); Probability (math.PR)
备注:

点击查看摘要

Abstract:The stability rule for belief, advocated by Leitgeb [Annals of Pure and Applied Logic 164, 2013], is a rule for rational acceptance that captures categorical belief in terms of \textitprobabilistically stable propositions : propositions to which the agent assigns resiliently high credence. The stability rule generates a class of \textitprobabilistically stable belief revision operators, which capture the dynamics of belief that result from an agent updating their credences through Bayesian conditioning while complying with the stability rule for their all-or-nothing beliefs. In this paper, we prove a representation theorem that yields a complete characterisation of such probabilistically stable revision operators and provides a `qualitative’ selection function semantics for the (non-monotonic) logic of probabilistically stable belief revision. Drawing on the theory of comparative probability orders, this result gives necessary and sufficient conditions for a selection function to be representable as a strongest-stable-set operator on a finite probability space. The resulting logic of probabilistically stable belief revision exhibits strong monotonicity properties while failing the AGM belief revision postulates and satisfying only very weak forms of case reasoning. In showing the main theorem, we prove two results of independent interest to the theory of comparative probability: the first provides necessary and sufficient conditions for the joint representation of a pair of (respectively, strict and non-strict) comparative probability orders. The second result provides a method for axiomatising the logic of ratio comparisons of the form ``event A is at least k times more likely than event B ‘’. In addition to these measurement-theoretic applications, we point out two applications of our main result to the theory of simple voting games and to revealed preference theory.
zh

[AI-3] GridMind: LLM s-Powered Agents for Power System Analysis and Operations

【速读】:该论文旨在解决传统电力系统分析工作流因复杂性过高而阻碍现代电网高效决策的问题。解决方案的关键在于提出GridMind这一多智能体AI系统,通过将大语言模型(Large Language Models, LLMs)与确定性工程求解器相结合,实现基于自然语言接口的对话式科学计算(conversational scientific computing),在保持数值精度的同时提升流程集成度、知识可及性、上下文连续性和专家决策支持能力。

链接: https://arxiv.org/abs/2509.02494
作者: Hongwei Jin,Kibaek Kim,Jonghwan Kwon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 9 figures, 2 tables. Work under review

点击查看摘要

Abstract:The complexity of traditional power system analysis workflows presents significant barriers to efficient decision-making in modern electric grids. This paper presents GridMind, a multi-agent AI system that integrates Large Language Models (LLMs) with deterministic engineering solvers to enable conversational scientific computing for power system analysis. The system employs specialized agents coordinating AC Optimal Power Flow and N-1 contingency analysis through natural language interfaces while maintaining numerical precision via function calls. GridMind addresses workflow integration, knowledge accessibility, context preservation, and expert decision-support augmentation. Experimental evaluation on IEEE test cases demonstrates that the proposed agentic framework consistently delivers correct solutions across all tested language models, with smaller LLMs achieving comparable analytical accuracy with reduced computational latency. This work establishes agentic AI as a viable paradigm for scientific computing, demonstrating how conversational interfaces can enhance accessibility while preserving numerical rigor essential for critical engineering applications.
zh

[AI-4] MLP-Offload: Multi-Level Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练中因模型规模增长快于GPU显存容量而导致的内存瓶颈问题,尤其针对多级主机内存或磁盘卸载(offloading)技术在训练关键路径中引入显著I/O开销、导致迭代速度变慢的问题。解决方案的关键在于提出一种新型多级、多路径卸载引擎MLP-Offload,其核心设计基于三个关键观察:更新阶段的I/O开销主导迭代时间;第三级远程存储层的I/O带宽未被充分利用;并发卸载引发的竞争加剧了I/O瓶颈。为此,MLP-Offload通过缓存友好的方式在多个层级上卸载优化器状态,并控制并发性以缓解I/O瓶颈,从而在反向传播和更新阶段实现更高效的资源利用。实验表明,该方案在最大达280B参数的模型上相比当前最优LLM训练框架可实现2.5倍的迭代加速。

链接: https://arxiv.org/abs/2509.02480
作者: Avinash Maurya,M. Mustafa Rafique,Franck Cappello,Bogdan Nicolae
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: SC’25: The International Conference for High Performance Computing, Networking, Storage and Analysis

点击查看摘要

Abstract:Training LLMs larger than the aggregated memory of multiple GPUs is increasingly necessary due to the faster growth of LLM sizes compared to GPU memory. To this end, multi-tier host memory or disk offloading techniques are proposed by state of art. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training, resulting in slower iterations. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing LLM training on resource-constrained setups by mitigating I/O bottlenecks. We make several key observations that drive the design of MLP-Offload, such as I/O overheads during the update dominate the iteration time; I/O bandwidth of the third-level remote storage tier remains unutilized; and, contention due to concurrent offloading amplifies I/O bottlenecks. Driven by these insights, we design and implement MLP-Offload to offload the optimizer states across multiple tiers in a cache-efficient and concurrency-controlled fashion to mitigate I/O bottlenecks during the backward and update phases. Evaluations on models up to 280B parameters shows that MLP-Offload achieves 2.5 \times faster iterations compared to the state-of-the-art LLM training runtimes.
zh

[AI-5] Generative Sequential Notification Optimization via Multi-Objective Decision Transformers

【速读】:该论文旨在解决通知推荐系统中复杂的序列决策问题,其核心挑战在于如何在保证消息效用的同时避免用户疲劳,并应对传统离线强化学习方法(如保守Q学习,Conservative Q-Learning, CQL)在大规模部署时的不稳定性、对分布偏移敏感、可复现性差以及高维推荐场景下解释性不足等问题。解决方案的关键在于提出一种基于决策变换器(Decision Transformer, DT)的框架,将策略学习重构为以回报(return)条件化的监督学习任务,从而提升模型的鲁棒性、可扩展性和建模灵活性;具体创新包括适用于非回合制任务的多奖励设计、基于分位数回归的回报到目标值(return-to-go)条件机制,以及面向生产环境的循环缓冲区序列处理架构,实现近实时推理。实验表明,该方法在LinkedIn真实场景中显著提升了通知相关性和整体会话活跃度,相较于多目标CQL代理,Session指标提升0.72%。

链接: https://arxiv.org/abs/2509.02458
作者: Borja Ocejo,Ruofan Wang,Ke Liu,Rohit K. Patra,Haotian Shen,David Liu,Yiwen Yuan,Gokulraj Mohanasundaram,Fedor Borisyuk,Prakruthi Prabhakar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Notifications are an important communication channel for delivering timely and relevant information. Optimizing their delivery involves addressing complex sequential decision-making challenges under constraints such as message utility and user fatigue. Offline reinforcement learning (RL) methods, such as Conservative Q-Learning (CQL), have been applied to this problem but face practical challenges at scale, including instability, sensitivity to distribution shifts, limited reproducibility, and difficulties with explainability in high-dimensional recommendation settings. We present a Decision Transformer (DT) based framework that reframes policy learning as return-conditioned supervised learning, improving robustness, scalability, and modeling flexibility. Our contributions include a real-world comparison with CQL, a multi-reward design suitable for non-episodic tasks, a quantile regression approach to return-to-go conditioning, and a production-ready system with circular buffer-based sequence processing for near-real-time inference. Extensive offline and online experiments in a deployed notification system show that our approach improves notification utility and overall session activity while minimizing user fatigue. Compared to a multi-objective CQL-based agent, the DT-based approach achieved a +0.72% increase in sessions for notification decision-making at LinkedIn by making notification recommendation more relevant.
zh

[AI-6] A Survey: Towards Privacy and Security in Mobile Large Language Models

【速读】:该论文旨在解决移动大语言模型(Mobile Large Language Models, LLMs)在资源受限的边缘环境中部署时所面临的隐私与安全挑战,尤其是其计算密集性与处理敏感数据带来的风险。解决方案的关键在于系统性地梳理和分类现有防护机制,包括差分隐私(differential privacy)、联邦学习(federated learning)和提示加密(prompt encryption),同时深入分析移动LLMs特有的漏洞,如对抗攻击、成员推理攻击和侧信道攻击,并评估各类方法的有效性与局限性。论文进一步提出未来研究方向,以推动可信、合规且可扩展的移动LLM系统的构建。

链接: https://arxiv.org/abs/2509.02411
作者: Honghui Xu,Kaiyang Li,Wei Chen,Danyang Zheng,Zhiyuan Li,Zhipeng Cai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile Large Language Models (LLMs) are revolutionizing diverse fields such as healthcare, finance, and education with their ability to perform advanced natural language processing tasks on-the-go. However, the deployment of these models in mobile and edge environments introduces significant challenges related to privacy and security due to their resource-intensive nature and the sensitivity of the data they process. This survey provides a comprehensive overview of privacy and security issues associated with mobile LLMs, systematically categorizing existing solutions such as differential privacy, federated learning, and prompt encryption. Furthermore, we analyze vulnerabilities unique to mobile LLMs, including adversarial attacks, membership inference, and side-channel attacks, offering an in-depth comparison of their effectiveness and limitations. Despite recent advancements, mobile LLMs face unique hurdles in achieving robust security while maintaining efficiency in resource-constrained environments. To bridge this gap, we propose potential applications, discuss open challenges, and suggest future research directions, paving the way for the development of trustworthy, privacy-compliant, and scalable mobile LLM systems.
zh

[AI-7] owards Agents That Know When They Dont Know: Uncertainty as a Control Signal for Structured Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在处理复杂多表结构化生物医学数据时,生成内容虽流畅但过度自信的问题。其解决方案的关键在于引入一种不确定性感知代理(uncertainty-aware agent),通过融合两种互补的不确定性信号:(i) 检索不确定性(retrieval uncertainty)——基于多次表格选择滚动(table-selection rollouts)的熵值;(ii) 总结不确定性(summary uncertainty)——结合自一致性(self-consistency)与困惑度(perplexity)。这些不确定性信号被用于强化学习(Reinforcement Learning, RL)中的组相对策略优化(Group Relative Policy Optimization, GRPO),同时指导推理阶段的过滤机制,并支持高质量合成数据集的构建,从而显著提升摘要的事实准确性和校准性,增强模型在复杂结构化数据环境中的可靠性。

链接: https://arxiv.org/abs/2509.02401
作者: Josefa Lia Stoisser,Marc Boubnovski Martell,Lawrence Phillips,Gianluca Mazzoni,Lea Mørch Harder,Philip Torr,Jesper Ferkinghoff-Borg,Kaspar Martens,Julien Fauqueur
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly deployed in structured biomedical data environments, yet they often produce fluent but overconfident outputs when reasoning over complex multi-table data. We introduce an uncertainty-aware agent for query-conditioned multi-table summarization that leverages two complementary signals: (i) retrieval uncertainty–entropy over multiple table-selection rollouts–and (ii) summary uncertainty–combining self-consistency and perplexity. Summary uncertainty is incorporated into reinforcement learning (RL) with Group Relative Policy Optimization (GRPO), while both retrieval and summary uncertainty guide inference-time filtering and support the construction of higher-quality synthetic datasets. On multi-omics benchmarks, our approach improves factuality and calibration, nearly tripling correct and useful claims per summary (3.0(\rightarrow)8.4 internal; 3.6(\rightarrow)9.9 cancer multi-omics) and substantially improving downstream survival prediction (C-index 0.32(\rightarrow)0.63). These results demonstrate that uncertainty can serve as a control signal–enabling agents to abstain, communicate confidence, and become more reliable tools for complex structured-data environments. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2509.02401 [cs.AI] (or arXiv:2509.02401v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.02401 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-8] Real-time ML-based Defense Against Malicious Payload in Reconfigurable Embedded Systems

【速读】:该论文旨在解决嵌入式系统中通过恶意比特流(bitstream)引发的硬件安全风险问题,如拒绝服务(DoS)、数据泄露或隐蔽攻击。其关键解决方案是提出一种基于监督学习的检测方法,直接在比特流的二进制级别上分析字节级特征,无需访问源代码或网表(netlist),从而实现资源受限环境下的实时检测。研究使用了来自前沿基准的122个良性与恶意配置样本,通过字节频率分析向量化、TSVD压缩和SMOTE平衡处理后,采用随机森林(Random Forest)分类器实现了0.97的宏F1分数,验证了该方法在Xilinx PYNQ-Z1开发板上的可行性与有效性。

链接: https://arxiv.org/abs/2509.02387
作者: Rye Stahle-Smith,Rasha Karakchi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper is submitted at Supercomputing (SC’25)

点击查看摘要

Abstract:The growing use of FPGAs in reconfigurable systems introducessecurity risks through malicious bitstreams that could cause denial-of-service (DoS), data leakage, or covert attacks. We investigated chip-level hardware malicious payload in embedded systems and proposed a supervised machine learning method to detect malicious bitstreams via static byte-level features. Our approach diverges from existing methods by analyzing bitstreams directly at the binary level, enabling real-time detection without requiring access to source code or netlists. Bitstreams were sourced from state-of-the-art (SOTA) benchmarks and re-engineered to target the Xilinx PYNQ-Z1 FPGA Development Board. Our dataset included 122 samples of benign and malicious configurations. The data were vectorized using byte frequency analysis, compressed using TSVD, and balanced using SMOTE to address class imbalance. The evaluated classifiers demonstrated that Random Forest achieved a macro F1-score of 0.97, underscoring the viability of real-time Trojan detection on resource-constrained systems. The final model was serialized and successfully deployed via PYNQ to enable integrated bitstream analysis.
zh

[AI-9] Poisoned at Scale: A Scalable Audit Uncovers Hidden Scam Endpoints in Production LLM s

【速读】:该论文旨在解决生成式 AI(Generative AI)在软件开发中因训练数据污染而导致的安全风险问题,即模型可能吸收并生成包含恶意 URL 的有害代码。其解决方案的关键在于提出了一种可扩展、自动化的审计框架,该框架从已知诈骗数据库中合成看似无害的开发者风格提示词,用于测试生产级大语言模型(Large Language Models, LLMs)是否生成含恶意内容的代码;实验结果表明,所有测试模型均存在系统性漏洞,平均4.2%的生成程序包含恶意URL,且多数情况由看似正常的提示触发,从而实证了训练数据已被大规模污染,凸显了亟需加强防御机制与生成后安全检测的重要性。

链接: https://arxiv.org/abs/2509.02372
作者: Zhiyang Chen,Tara Saba,Xun Deng,Xujie Si,Fan Long
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have become critical to modern software development, but their reliance on internet datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To evaluate this threat, this paper introduces a scalable, automated audit framework that synthesizes innocuous, developer-style prompts from known scam databases to query production LLMs and determine if they generate code containing harmful URLs. We conducted a large-scale evaluation across four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), and found a systemic vulnerability, with all tested models generating malicious code at a non-negligible rate. On average, 4.2% of programs generated in our experiments contained malicious URLs. Crucially, this malicious code is often generated in response to benign prompts. We manually validate the prompts which cause all four LLMs to generate malicious code, and resulting in 177 innocuous prompts that trigger all models to produce harmful outputs. These results provide strong empirical evidence that the training data of production LLMs has been successfully poisoned at scale, underscoring the urgent need for more robust defense mechanisms and post-generation safety checks to mitigate the propagation of hidden security threats.
zh

[AI-10] Guidance and Control Neural Network Acceleration using Memristors

【速读】:该论文旨在解决小型卫星(smallsats)和立方星(cubesats)在轨人工智能(AI)应用中面临的能源受限与辐射敏感性问题,这些问题限制了传统人工神经网络(ANNs)加速器的部署。其解决方案的关键在于利用相变存储器(Phase-Change Memory, PCM)和阻变随机存取存储器(Resistive Random-Access Memory, RRAM)等忆阻器(memristors)实现片上内存计算(in-memory computing),从而降低功耗并提升抗辐射能力。通过模拟指导与控制神经网络(G/CNET)在不同场景下的运行表现,研究验证了忆阻加速器能够学习专家动作,并指出噪声对精度的影响是主要挑战,但通过退化后的再训练可恢复性能至基准水平,为未来空间场景下基于忆阻器的AI加速器研究提供了技术基础与方向。

链接: https://arxiv.org/abs/2509.02369
作者: Zacharia A. Rudge,Dario Izzo,Moritz Fieback,Anteneh Gebregiorgis,Said Hamdioui,Dominik Dold
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 4 pages, SPAICE 2024 conference

点击查看摘要

Abstract:In recent years, the space community has been exploring the possibilities of Artificial Intelligence (AI), specifically Artificial Neural Networks (ANNs), for a variety of on board applications. However, this development is limited by the restricted energy budget of smallsats and cubesats as well as radiation concerns plaguing modern chips. This necessitates research into neural network accelerators capable of meeting these requirements whilst satisfying the compute and performance needs of the application. This paper explores the use of Phase-Change Memory (PCM) and Resistive Random-Access Memory (RRAM) memristors for on-board in-memory computing AI acceleration in space applications. A guidance and control neural network (G\CNET) accelerated using memristors is simulated in a variety of scenarios and with both device types to evaluate the performance of memristor-based accelerators, considering device non-idealities such as noise and conductance drift. We show that the memristive accelerator is able to learn the expert actions, though challenges remain with the impact of noise on accuracy. We also show that re-training after degradation is able to restore performance to nominal levels. This study provides a foundation for future research into memristor-based AI accelerators for space, highlighting their potential and the need for further investigation.
zh

[AI-11] When Agents go Astray: Course-Correcting SWE Agents with PRMs

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在执行复杂软件工程(Software Engineering, SWE)任务时存在的轨迹级低效问题,如冗余探索、循环执行以及达成解决方案后仍持续运行等现象。传统方法通常仅在任务执行完成后进行事后诊断,缺乏实时干预能力。其核心解决方案是提出一种推理阶段的Process Reward Model(PRM),即SWE-PRM,该模型基于对常见低效行为的分类体系,在执行过程中实时检测并纠正轨迹错误,提供轻量且可解释的反馈机制,无需修改底层策略网络。实验表明,该方法显著提升了SWE-bench Verified基准上的任务解决率(从40.0%提升至50.6%),尤其在中等和高难度任务上收益最大,同时有效缩短了轨迹长度,且推理开销可控(增加约0.2个单位),实现了可靠性与效率的协同优化。

链接: https://arxiv.org/abs/2509.02360
作者: Shubham Gandhi,Jason Tsay,Jatin Ganhotra,Kiran Kate,Yara Rizk
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly deployed for complex, multi-step software engineering (SWE) tasks. However, their trajectories often contain costly inefficiencies, such as redundant exploration, looping, and failure to terminate once a solution is reached. Prior work has largely treated these errors in a post-hoc manner, diagnosing failures only after execution. In this paper, we introduce SWE-PRM, an inference-time Process Reward Model (PRM) that intervenes during execution to detect and course-correct trajectory-level errors. Our PRM design leverages a taxonomy of common inefficiencies and delivers lightweight, interpretable feedback without modifying the underlying policy. On SWE-bench Verified, closed-source PRMs improve resolution from 40.0% to 50.6% (+10.6 p.p.), with the largest gains on medium and hard tasks. Among feedback strategies, taxonomy-guided PRMs outperform unguided or explicit action-prescriptive variants, increasing success rate while reducing trajectory length. These benefits come at an acceptable added inference cost of as low as 0.2, making PRMs a practical and scalable mechanism for improving SWE agents’ reliability and efficiency.
zh

[AI-12] AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在音频处理中面临的两个核心问题:一是现有研究对语义令牌(semantic tokens)与声学令牌(acoustic tokens)的定义不清晰,导致理论基础薄弱;二是现有编码器(codec)评估方法局限于特定任务(如语音重建或自动语音识别ASR),缺乏跨维度、公平且全面的评价体系。解决方案的关键在于提出明确的语义令牌与声学令牌的定义,并构建一个系统性的评估框架,涵盖四个维度:音频重建指标、码本索引(codebook index, ID)稳定性、仅解码器Transformer的困惑度(perplexity)以及下游探测任务(downstream probe tasks)性能,从而实现对音频编码方案的多角度、可比性评估。

链接: https://arxiv.org/abs/2509.02349
作者: Lu Wang,Hao Chen,Siyu Wu,Zhiyue Wu,Hao Zhou,Chengfeng Zhang,Ting Wang,Haodi Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have been widely applied in speech and music. This tendency has led to a focus on audio tokenization for Large Models (LMs). Unlike semantic-only text tokens, audio tokens must both capture global semantic content and preserve fine-grained acoustic details. Moreover, they provide a discrete method for speech and music that can be effectively integrated into MLLMs. However, existing research is unsuitable in the definitions of semantic tokens and acoustic tokens. In addition, the evaluation of different codecs typically concentrates on specific domains or tasks, such as reconstruction or Automatic Speech Recognition (ASR) task, which prevents fair and comprehensive comparisons. To address these problems, this paper provides suitable definitions for semantic and acoustic tokens and introduces a systematic evaluation framework. This framework allows for a comprehensive assessment of codecs’ capabilities which evaluate across four dimensions: audio reconstruction metric, codebook index (ID) stability, decoder-only transformer perplexity, and performance on downstream probe tasks. Our results show the correctness of the provided suitable definitions and the correlation among reconstruction metrics, codebook ID stability, downstream probe tasks and perplexity.
zh

[AI-13] RDIT: Residual-based Diffusion Implicit Models for Probabilistic Time Series Forecasting

【速读】:该论文旨在解决概率时间序列预测(Probabilistic Time Series Forecasting, PTSF)中分布建模性能不足以及训练与评估指标不匹配的问题。现有方法在不确定性建模上表现欠佳,且常因训练目标与实际评估指标(如连续排名概率分数 CRPS)不一致而导致性能受限。解决方案的关键在于提出一种即插即用的框架 RDIT,其核心创新包括:1)将强点估计器与残差驱动的条件扩散过程结合;2)引入双向 Mamba 网络以增强时序建模能力;3)理论上证明通过调整高斯噪声的标准差可最小化 CRPS,并据此设计算法实现分布匹配。实验表明,RDIT 在多个多变量数据集上显著降低 CRPS、加快推理速度并提升覆盖率,优于当前主流基线方法。

链接: https://arxiv.org/abs/2509.02341
作者: Chih-Yu Lai,Yu-Chien Ning,Duane S. Boning
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Probabilistic Time Series Forecasting (PTSF) plays a critical role in domains requiring accurate and uncertainty-aware predictions for decision-making. However, existing methods offer suboptimal distribution modeling and suffer from a mismatch between training and evaluation metrics. Surprisingly, we found that augmenting a strong point estimator with a zero-mean Gaussian, whose standard deviation matches its training error, can yield state-of-the-art performance in PTSF. In this work, we propose RDIT, a plug-and-play framework that combines point estimation and residual-based conditional diffusion with a bidirectional Mamba network. We theoretically prove that the Continuous Ranked Probability Score (CRPS) can be minimized by adjusting to an optimal standard deviation and then derive algorithms to achieve distribution matching. Evaluations on eight multivariate datasets across varied forecasting horizons demonstrate that RDIT achieves lower CRPS, rapid inference, and improved coverage compared to strong baselines.
zh

[AI-14] Explainability-Driven Dimensionality Reduction for Hyperspectral Imaging

【速读】:该论文旨在解决高光谱成像(Hyperspectral Imaging, HSI)中因波段维度高而导致的计算负担和冗余问题,从而实现高效且保持预测性能的降维。其解决方案的关键在于采用后验可解释性方法(post-hoc explainability methods),在模型驱动框架下对波段进行选择:首先利用解释技术量化每个波段对分类器决策的贡献,随后通过删除-插入评估(deletion-insertion evaluation)记录置信度变化,并聚合为影响得分(influence scores),最终选取高影响波段构建紧凑的光谱子集。实验表明,仅需30个选中波段即可达到或超越全谱基线的准确率,同时显著降低计算开销,且所选波段对应物理上有意义的高判别能力波长区域,验证了基于模型对齐与解释引导的波段选择是一种有效的降维策略。

链接: https://arxiv.org/abs/2509.02340
作者: Salma Haidar,José Oramas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) provides rich spectral information for precise material classification and analysis; however, its high dimensionality introduces a computational burden and redundancy, making dimensionality reduction essential. We present an exploratory study into the application of post-hoc explainability methods in a model–driven framework for band selection, which reduces the spectral dimension while preserving predictive performance. A trained classifier is probed with explanations to quantify each band’s contribution to its decisions. We then perform deletion–insertion evaluations, recording confidence changes as ranked bands are removed or reintroduced, and aggregate these signals into influence scores. Selecting the highest–influence bands yields compact spectral subsets that maintain accuracy and improve efficiency. Experiments on two public benchmarks (Pavia University and Salinas) demonstrate that classifiers trained on as few as 30 selected bands match or exceed full–spectrum baselines while reducing computational requirements. The resulting subsets align with physically meaningful, highly discriminative wavelength regions, indicating that model–aligned, explanation-guided band selection is a principled route to effective dimensionality reduction for HSI.
zh

[AI-15] ReCode: Improving LLM -based Code Repair with Fine-Grained Retrieval-Augmented Generation CIKM2025

【速读】:该论文旨在解决现有代码修复方法在训练成本高和推理计算开销大方面的局限性,以及传统检索增强生成(Retrieval-Augmented Generation, RAG)策略因依赖整体代码-文本嵌入而难以捕捉代码结构细节导致的检索质量不佳问题。解决方案的关键在于提出ReCode框架,其核心创新包括:(1) 一种算法感知的检索策略,通过初步预测算法类型缩小搜索空间;(2) 一种模块化双编码器架构,分别处理代码与文本输入,实现输入与检索上下文之间的细粒度语义匹配。这一设计显著提升了代码修复的准确性并大幅降低推理开销,具有实际应用价值。

链接: https://arxiv.org/abs/2509.02330
作者: Yicong Zhao,Shisong Chen,Jiacheng Zhang,Zhixu Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by CIKM 2025

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have demonstrated impressive capabilities in code-related tasks, such as code generation and automated program repair. Despite their promising performance, most existing approaches for code repair suffer from high training costs or computationally expensive inference. Retrieval-augmented generation (RAG), with its efficient in-context learning paradigm, offers a more scalable alternative. However, conventional retrieval strategies, which are often based on holistic code-text embeddings, fail to capture the structural intricacies of code, resulting in suboptimal retrieval quality. To address the above limitations, we propose ReCode, a fine-grained retrieval-augmented in-context learning framework designed for accurate and efficient code repair. Specifically, ReCode introduces two key innovations: (1) an algorithm-aware retrieval strategy that narrows the search space using preliminary algorithm type predictions; and (2) a modular dual-encoder architecture that separately processes code and textual inputs, enabling fine-grained semantic matching between input and retrieved contexts. Furthermore, we propose RACodeBench, a new benchmark constructed from real-world user-submitted buggy code, which addresses the limitations of synthetic benchmarks and supports realistic evaluation. Experimental results on RACodeBench and competitive programming datasets demonstrate that ReCode achieves higher repair accuracy with significantly reduced inference cost, highlighting its practical value for real-world code repair scenarios.
zh

[AI-16] Exploring Diffusion Models for Generative Forecasting of Financial Charts

【速读】:该论文试图解决金融领域中股票价格趋势预测的问题,特别是如何有效利用生成式模型(Generative Models)来替代传统依赖时间序列数据和Transformer架构的方法。其解决方案的关键在于将时间序列数据视为单一图像模式,并借助文本到图像生成模型(如扩散模型),通过当前图表图像与指令提示(instruction prompt)生成下一阶段的图表图像,从而实现对股价走势的预测。此外,论文还提出了一种简单但有效的评估生成图表图像与真实图像之间差异的方法,为生成式模型在金融领域的应用提供了新的思路和实证基础。

链接: https://arxiv.org/abs/2509.02308
作者: Taegyeong Lee,Jiwon Park,Kyunga Bang,Seunghyun Hwang,Ung-Jin Jang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in generative models have enabled significant progress in tasks such as generating and editing images from text, as well as creating videos from text prompts, and these methods are being applied across various fields. However, in the financial domain, there may still be a reliance on time-series data and a continued focus on transformer models, rather than on diverse applications of generative models. In this paper, we propose a novel approach that leverages text-to-image model by treating time-series data as a single image pattern, thereby enabling the prediction of stock price trends. Unlike prior methods that focus on learning and classifying chart patterns using architectures such as ResNet or ViT, we experiment with generating the next chart image from the current chart image and an instruction prompt using diffusion models. Furthermore, we introduce a simple method for evaluating the generated chart image against ground truth image. We highlight the potential of leveraging text-to-image generative models in the financial domain, and our findings motivate further research to address the current limitations and expand their applicability.
zh

[AI-17] Re-evaluating LLM -based Heuristic Search: A Case Study on the 3D Packing Problem

【速读】:该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)实现自动化的启发式设计问题,特别是针对复杂组合优化问题如约束三维装箱问题(constrained 3D Packing Problem)的求解器构建。传统上,启发式设计依赖人工干预,而LLMs虽能生成代码,但其在复杂任务中表现出脆弱性,难以直接生成稳定有效的完整解决方案。为此,作者提出两个关键支持机制:一是“约束脚手架”(constraint scaffolding),即预写约束检查代码以降低推理复杂度;二是“迭代自修正”(iterative self-correction),通过多轮迭代修复错误并生成可行的初始种群。实验表明,LLM在贪婪搜索过程中几乎全部聚焦于优化评分函数(scoring function),这暗示此前研究对评分函数的重视可能并非策略性选择,而是LLM能力局限所致。最终生成的启发式方法性能接近人类设计的贪心算法,并在与人类设计的元启发式结合后达到主流求解器水平,但在高约束条件下性能下降。该研究揭示了当前LLMs在自动化启发式设计中的两大障碍:复杂推理任务所需的工程化加固以及预训练偏差对新颖解空间探索的过早限制。

链接: https://arxiv.org/abs/2509.02297
作者: Guorui Quan,Mingfei Sun,Manuel López-Ibáñez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The art of heuristic design has traditionally been a human pursuit. While Large Language Models (LLMs) can generate code for search heuristics, their application has largely been confined to adjusting simple functions within human-crafted frameworks, leaving their capacity for broader innovation an open question. To investigate this, we tasked an LLM with building a complete solver for the constrained 3D Packing Problem. Direct code generation quickly proved fragile, prompting us to introduce two supports: constraint scaffolding–prewritten constraint-checking code–and iterative self-correction–additional refinement cycles to repair bugs and produce a viable initial population. Notably, even within a vast search space in a greedy process, the LLM concentrated its efforts almost exclusively on refining the scoring function. This suggests that the emphasis on scoring functions in prior work may reflect not a principled strategy, but rather a natural limitation of LLM capabilities. The resulting heuristic was comparable to a human-designed greedy algorithm, and when its scoring function was integrated into a human-crafted metaheuristic, its performance rivaled established solvers, though its effectiveness waned as constraints tightened. Our findings highlight two major barriers to automated heuristic design with current LLMs: the engineering required to mitigate their fragility in complex reasoning tasks, and the influence of pretrained biases, which can prematurely narrow the search for novel solutions.
zh

[AI-18] hink2Sing: Orchestrating Structured Motion Subtitles for Singing-Driven 3D Head Animation

【速读】:该论文旨在解决生成式AI驱动的3D头部动画在歌唱场景下的表现不足问题,现有语音驱动方法难以捕捉歌唱中丰富的语义、情感和动态韵律,导致动画结果过于简化、情绪平淡且语义不一致。其解决方案的关键在于提出Think2Sing框架,该框架基于扩散模型,并利用预训练大语言模型(Large Language Models, LLMs)结合歌词与声学特征进行条件控制;创新性地引入“运动字幕”(motion subtitles),通过新颖的“歌唱思维链”(Singing Chain-of-Thought)推理过程与声学引导检索生成,提供带精确时间戳和区域特定描述的可解释运动先验,从而实现更精细、时序一致且富有表现力的面部运动建模。

链接: https://arxiv.org/abs/2509.02278
作者: Zikai Huang,Yihan Zhou,Xuemiao Xu,Cheng Xu,Xiaofen Xing,Jing Qin,Shengfeng He
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Singing-driven 3D head animation is a challenging yet promising task with applications in virtual avatars, entertainment, and education. Unlike speech, singing involves richer emotional nuance, dynamic prosody, and lyric-based semantics, requiring the synthesis of fine-grained, temporally coherent facial motion. Existing speech-driven approaches often produce oversimplified, emotionally flat, and semantically inconsistent results, which are insufficient for singing animation. To address this, we propose Think2Sing, a diffusion-based framework that leverages pretrained large language models to generate semantically coherent and temporally consistent 3D head animations, conditioned on both lyrics and acoustics. A key innovation is the introduction of motion subtitles, an auxiliary semantic representation derived through a novel Singing Chain-of-Thought reasoning process combined with acoustic-guided retrieval. These subtitles contain precise timestamps and region-specific motion descriptions, serving as interpretable motion priors. We frame the task as a motion intensity prediction problem, enabling finer control over facial regions and improving the modeling of expressive motion. To support this, we create a multimodal singing dataset with synchronized video, acoustic descriptors, and motion subtitles, enabling diverse and expressive motion learning. Extensive experiments show that Think2Sing outperforms state-of-the-art methods in realism, expressiveness, and emotional fidelity, while also offering flexible, user-controllable animation editing.
zh

[AI-19] Rewarding Explainability in Drug Repurposing with Knowledge Graphs IJCAI2025

【速读】:该论文旨在解决生成式 AI (Generative AI) 在知识图谱(Knowledge Graph, KG)中进行链接预测时缺乏可解释性的问题,即如何在保证预测准确性的同时,提供具有科学意义的解释路径,以增强模型在生物医药等领域的可信度与实用性。解决方案的关键在于提出一种名为 REx 的新方法,其核心是利用基于奖励机制和策略优化的强化学习框架,引导智能体在知识图谱中识别符合科学解释特性的路径,并通过引入领域特定本体(ontology)对解释路径进行语义丰富,从而确保解释既具备洞察力又扎根于已知生物医学知识体系。

链接: https://arxiv.org/abs/2509.02276
作者: Susana Nunes,Samy Badreddine,Catia Pesquita
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, accepted at conference IJCAI 2025

点击查看摘要

Abstract:Knowledge graphs (KGs) are powerful tools for modelling complex, multi-relational data and supporting hypothesis generation, particularly in applications like drug repurposing. However, for predictive methods to gain acceptance as credible scientific tools, they must ensure not only accuracy but also the capacity to offer meaningful scientific explanations. This paper presents a novel approach REx, for generating scientific explanations based in link prediction in knowledge graphs. It employs reward and policy mechanisms that consider desirable properties of scientific explanation to guide a reinforcement learning agent in the identification of explanatory paths within a KG. The approach further enriches explanatory paths with domain-specific ontologies, ensuring that the explanations are both insightful and grounded in established biomedical knowledge. We evaluate our approach in drug repurposing using three popular knowledge graph benchmarks. The results clearly demonstrate its ability to generate explanations that validate predictive insights against biomedical knowledge and that outperform the state-of-the-art approaches in predictive performance, establishing REx as a relevant contribution to advance AI-driven scientific discovery.
zh

[AI-20] Look: AI at Work! - Analysing Key Aspects of AI-support at the Work Place

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在工作场所应用过程中面临的技术与心理因素挑战,以促进AI系统在职场环境中的有效集成与可持续使用。其解决方案的关键在于:从技术层面强调高质数据对学习型系统训练的重要性,并注重人类专家知识在知识型系统中的整合;从心理层面则聚焦于用户对AI系统的接受度、开放性和信任感,提出需在系统开发中嵌入相关研究问题,从而为提升AI素养和推动人机协同提供理论与实践指导。

链接: https://arxiv.org/abs/2509.02274
作者: Stefan Schiffer,Anna Milena Rothermel,Alexander Ferrein,Astrid Rosenthal-von der Pütten
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 10 pages, accepted at the German Conference on Artificial Intelligence KI 2024 Workshop “HuMaIn”

点击查看摘要

Abstract:In this paper we present an analysis of technological and psychological factors of applying artificial intelligence (AI) at the work place. We do so for a number of twelve application cases in the context of a project where AI is integrated at work places and in work systems of the future. From a technological point of view we mainly look at the areas of AI that the applications are concerned with. This allows to formulate recommendations in terms of what to look at in developing an AI application and what to pay attention to with regards to building AI literacy with different stakeholders using the system. This includes the importance of high-quality data for training learning-based systems as well as the integration of human expertise, especially with knowledge-based systems. In terms of the psychological factors we derive research questions to investigate in the development of AI supported work systems and to consider in future work, mainly concerned with topics such as acceptance, openness, and trust in an AI system.
zh

[AI-21] VariAntNet: Learning Decentralized Control of Multi-Agent Systems

【速读】:该论文旨在解决简单多智能体系统(multi-agent system)在灾难响应场景中面临的群体凝聚力维持难题,尤其是在局部感知能力有限、缺乏可靠通信与集中控制的复杂环境中。其核心挑战在于如何在不依赖共享坐标系或显式通信的情况下,确保机器人集群(称为Ant Robots)保持连通性并高效完成协同任务。解决方案的关键在于提出VariAntNet——一种基于深度学习的去中心化控制模型,其创新点包括:从无序且变尺寸的局部观测中提取几何特征,并采用一种新颖的可微分、多目标、数学上合理的损失函数,该损失函数通过利用可视图拉普拉斯矩阵(visibility graph Laplacian matrix)的性质来增强群体凝聚力。实验表明,VariAntNet在聚集任务中显著优于现有解析解,收敛速度提升超过一倍,同时在不同规模群组下仍能保持高连通性,为时间敏感型应急任务提供了更实用的权衡策略。

链接: https://arxiv.org/abs/2509.02271
作者: Yigal Koifman,Erez Koifman,Eran Iceland,Ariel Barel,Alfred M. Bruckstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:A simple multi-agent system can be effectively utilized in disaster response applications, such as firefighting. Such a swarm is required to operate in complex environments with limited local sensing and no reliable inter-agent communication or centralized control. These simple robotic agents, also known as Ant Robots, are defined as anonymous agents that possess limited sensing capabilities, lack a shared coordinate system, and do not communicate explicitly with one another. A key challenge for simple swarms lies in maintaining cohesion and avoiding fragmentation despite limited-range sensing. Recent advances in machine learning offer effective solutions to some of the classical decentralized control challenges. We propose VariAntNet, a deep learning-based decentralized control model designed to facilitate agent swarming and collaborative task execution. VariAntNet includes geometric features extraction from unordered, variable-sized local observations. It incorporates a neural network architecture trained with a novel, differentiable, multi-objective, mathematically justified loss function that promotes swarm cohesiveness by utilizing the properties of the visibility graph Laplacian matrix. VariAntNet is demonstrated on the fundamental multi-agent gathering task, where agents with bearing-only and limited-range sensing must gather at some location. VariAntNet significantly outperforms an existing analytical solution, achieving more than double the convergence rate while maintaining high swarm connectivity across varying swarm sizes. While the analytical solution guarantees cohesion, it is often too slow in practice. In time-critical scenarios, such as emergency response operations where lives are at risk, slower analytical methods are impractical and justify the loss of some agents within the swarm. This paper presents and analyzes this trade-off in detail.
zh

[AI-22] An Epidemiological Knowledge Graph extracted from the World Health Organizations Disease Outbreak News

【速读】:该论文旨在解决传统流行病学监测中信息获取滞后、人工处理效率低以及多源异构数据难以整合的问题。其核心解决方案是利用生成式 AI 技术,构建一个基于多个大型语言模型(Large Language Models, LLMs)的集成方法,从世界卫生组织(WHO)疾病暴发新闻(Disease Outbreak News, DONs)中自动提取具有行动价值的流行病学信息,并将其结构化为每日更新的数据集和知识图谱(eKG),从而实现对公共卫生领域知识的精细化表示与高效利用。

链接: https://arxiv.org/abs/2509.02258
作者: Sergio Consoli,Pietro Coletti,Peter V. Markov,Lia Orfei,Indaco Biazzo,Lea Schuh,Nicolas Stefanovitch,Lorenzo Bertolini,Mario Ceresa,Nikolaos I. Stilianakis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 10 figures

点击查看摘要

Abstract:The rapid evolution of artificial intelligence (AI), together with the increased availability of social media and news for epidemiological surveillance, are marking a pivotal moment in epidemiology and public health research. Leveraging the power of generative AI, we use an ensemble approach which incorporates multiple Large Language Models (LLMs) to extract valuable actionable epidemiological information from the World Health Organization (WHO) Disease Outbreak News (DONs). DONs is a collection of regular reports on global outbreaks curated by the WHO and the adopted decision-making processes to respond to them. The extracted information is made available in a daily-updated dataset and a knowledge graph, referred to as eKG, derived to provide a nuanced representation of the public health domain knowledge. We provide an overview of this new dataset and describe the structure of eKG, along with the services and tools used to access and utilize the data that we are building on top. These innovative data resources open altogether new opportunities for epidemiological research, and the analysis and surveillance of disease outbreaks.
zh

[AI-23] LLM s for LLM s: A Structured Prompting Methodology for Long Legal Documents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在法律领域应用中因可靠性与透明性不足而导致的采纳难题。其核心解决方案在于提出一种结构化的提示工程(structured prompting)方法,以替代成本较高的微调(fine-tuning)策略。关键创新包括:首先通过分块(chunking)与增强技术处理长篇法律文档(来自CUAD数据集),缓解长文本建模挑战;其次结合精心设计的提示(engineered prompt)输入QWEN-2模型生成候选答案;最后引入基于分布定位(Distribution-based Localisation)和逆基数加权(Inverse Cardinality Weighting)的启发式策略,优化候选答案选择过程,从而降低黑箱效应的影响。该方法不仅实现了当前最优的信息检索性能(较先前方法提升达9%),更强调了结构化提示工程在保障法律AI系统可问责性(accountability)与责任性(responsibility)方面的潜力。

链接: https://arxiv.org/abs/2509.02241
作者: Strahinja Klem,Noura Al Moubayed
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures, 4 tables,

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has had a profoundly transformative effect on a number of fields and domains. However, their uptake in Law has proven more challenging due to the important issues of reliability and transparency. In this study, we present a structured prompting methodology as a viable alternative to the often expensive fine-tuning, with the capability of tacking long legal documents from the CUAD dataset on the task of information retrieval. Each document is first split into chunks via a system of chunking and augmentation, addressing the long document problem. Then, alongside an engineered prompt, the input is fed into QWEN-2 to produce a set of answers for each question. Finally, we tackle the resulting candidate selection problem with the introduction of the Distribution-based Localisation and Inverse Cardinality Weighting heuristics. This approach leverages a general purpose model to promote long term scalability, prompt engineering to increase reliability and the two heuristic strategies to reduce the impact of the black box effect. Whilst our model performs up to 9% better than the previously presented method, reaching state-of-the-art performance, it also highlights the limiting factor of current automatic evaluation metrics for question answering, serving as a call to action for future research. However, the chief aim of this work is to underscore the potential of structured prompt engineering as a useful, yet under-explored, tool in ensuring accountability and responsibility of AI in the legal domain, and beyond.
zh

[AI-24] Autoencoder-based non-intrusive model order reduction in continuum mechanics

【速读】:该论文旨在解决传统降阶模型(Reduced-Order Modeling, ROM)在连续介质力学中难以高效处理高维非线性问题且常需侵入式修改原始求解器的局限性。其核心解决方案是提出一种非侵入式的、基于自编码器(Autoencoder)的框架,通过三个阶段实现:首先利用无监督自编码器将高维有限元解压缩至低维潜在空间;其次使用监督回归网络从问题参数映射到潜在码;最后构建端到端代理模型直接由输入参数重建全场解。关键创新在于引入力增强变体以联合预测位移场与Neumann边界上的反作用力,并设计多场架构支持耦合场(如热-力系统)的联合预测,从而在不改变原物理模型的前提下实现高保真度重构与可扩展性。

链接: https://arxiv.org/abs/2509.02237
作者: Jannick Kehls,Ellen Kuhl,Tim Brepols,Kevin Linka,Hagen Holthusen
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a non-intrusive, Autoencoder-based framework for reduced-order modeling in continuum mechanics. Our method integrates three stages: (i) an unsupervised Autoencoder compresses high-dimensional finite element solutions into a compact latent space, (ii) a supervised regression network maps problem parameters to latent codes, and (iii) an end-to-end surrogate reconstructs full-field solutions directly from input parameters. To overcome limitations of existing approaches, we propose two key extensions: a force-augmented variant that jointly predicts displacement fields and reaction forces at Neumann boundaries, and a multi-field architecture that enables coupled field predictions, such as in thermo-mechanical systems. The framework is validated on nonlinear benchmark problems involving heterogeneous composites, anisotropic elasticity with geometric variation, and thermo-mechanical coupling. Across all cases, it achieves accurate reconstructions of high-fidelity solutions while remaining fully non-intrusive. These results highlight the potential of combining deep learning with dimensionality reduction to build efficient and extensible surrogate models. Our publicly available implementation provides a foundation for integrating data-driven model order reduction into uncertainty quantification, optimization, and digital twin applications. Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.02237 [cs.CE] (or arXiv:2509.02237v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2509.02237 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-25] Application Of Large Language Models For The Extraction Of Information From Particle Accelerator Technical Documentation

【速读】:该论文旨在解决老旧粒子加速器系统技术文档中专业知识难以传承的问题,尤其是在资深人员退休背景下,如何高效保存和转移领域知识成为紧迫需求。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)自动化提取、总结与组织技术文档中的隐含专业知识,从而显著降低因人员流失导致的知识断层风险。初步实验表明,LLMs在处理特定领域文本时具有较高有效性,同时作者也指出了当前模型在可解释性和稀有术语识别方面的局限,并提出了改进策略。

链接: https://arxiv.org/abs/2509.02227
作者: Qing Dai,Rasmus Ischebeck,Maruisz Sapinski,Adam Grycner
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Accelerator Physics (physics.acc-ph)
备注:

点击查看摘要

Abstract:The large set of technical documentation of legacy accelerator systems, coupled with the retirement of experienced personnel, underscores the urgent need for efficient methods to preserve and transfer specialized knowledge. This paper explores the application of large language models (LLMs), to automate and enhance the extraction of information from particle accelerator technical documents. By exploiting LLMs, we aim to address the challenges of knowledge retention, enabling the retrieval of domain expertise embedded in legacy documentation. We present initial results of adapting LLMs to this specialized domain. Our evaluation demonstrates the effectiveness of LLMs in extracting, summarizing, and organizing knowledge, significantly reducing the risk of losing valuable insights as personnel retire. Furthermore, we discuss the limitations of current LLMs, such as interpretability and handling of rare domain-specific terms, and propose strategies for improvement. This work highlights the potential of LLMs to play a pivotal role in preserving institutional knowledge and ensuring continuity in highly specialized fields.
zh

[AI-26] owards Multi-Aspect Diversification of News Recommendations Using Neuro-Symbolic AI for Individual and Societal Benefit RECSYS2025

【速读】:该论文旨在解决新闻推荐中多样性不足的问题,尤其关注多维度多样性(如观点、主题等)在推荐列表、序列、摘要及交互方式中的实现难题。其解决方案的关键在于融合符号主义与亚符号主义人工智能方法,利用知识图谱(Knowledge Graph)和规则学习(Rule Learning)相结合的框架,以系统性提升推荐内容的多样性,并通过用户研究评估行为表现与感知体验,从而在个体层面增强偶然发现(serendipity)而在社会层面缓解信息极化(polarization)。

链接: https://arxiv.org/abs/2509.02220
作者: Markus Reiter-Haas,Elisabeth Lex
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted at INRA 2025: 13th International Workshop on News Recommendation and Analytics in Conjunction with ACM RecSys 2025

点击查看摘要

Abstract:News recommendations are complex, with diversity playing a vital role. So far, existing literature predominantly focuses on specific aspects of news diversity, such as viewpoints. In this paper, we introduce multi-aspect diversification in four distinct recommendation modes and outline the nuanced challenges in diversifying lists, sequences, summaries, and interactions. Our proposed research direction combines symbolic and subsymbolic artificial intelligence, leveraging both knowledge graphs and rule learning. We plan to evaluate our models using user studies to not only capture behavior but also their perceived experience. Our vision to balance news consumption points to other positive effects for users (e.g., increased serendipity) and society (e.g., decreased polarization).
zh

[AI-27] ST-Hyper: Learning High-Order Dependencies Across Multiple Spatial-Temporal Scales for Multivariate Time Series Forecasting CIKM2025

【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series, MTS)预测中难以建模跨多个时空尺度(Spatial-Temporal scales, ST-scales)高阶依赖关系的问题。现有方法通常只能捕捉单一空间或时间尺度上的依赖,而无法有效整合多尺度的时空动态特性。其解决方案的关键在于提出ST-Hyper框架,核心创新包括:1)Spatial-Temporal Pyramid Modeling (STPM) 模块,用于在不同ST-scales上提取特征;2)Adaptive Hypergraph Modeling (AHM) 模块,通过学习稀疏超图来捕获特征间的鲁棒高阶依赖关系;3)三阶段超图传播机制,实现多尺度时空特征的深度融合与交互,从而全面刻画复杂的时空动态演化过程。

链接: https://arxiv.org/abs/2509.02217
作者: Binqing Wu,Jianlong Huang,Zongjiang Shang,Ling Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by CIKM 2025

点击查看摘要

Abstract:In multivariate time series (MTS) forecasting, many deep learning based methods have been proposed for modeling dependencies at multiple spatial (inter-variate) or temporal (intra-variate) scales. However, existing methods may fail to model dependencies across multiple spatial-temporal scales (ST-scales, i.e., scales that jointly consider spatial and temporal scopes). In this work, we propose ST-Hyper to model the high-order dependencies across multiple ST-scales through adaptive hypergraph modeling. Specifically, we introduce a Spatial-Temporal Pyramid Modeling (STPM) module to extract features at multiple ST-scales. Furthermore, we introduce an Adaptive Hypergraph Modeling (AHM) module that learns a sparse hypergraph to capture robust high-order dependencies among features. In addition, we interact with these features through tri-phase hypergraph propagation, which can comprehensively capture multi-scale spatial-temporal dynamics. Experimental results on six real-world MTS datasets demonstrate that ST-Hyper achieves the state-of-the-art performance, outperforming the best baselines with an average MAE reduction of 3.8% and 6.8% for long-term and short-term forecasting, respectively.
zh

[AI-28] Baichuan-M2: Scaling Medical Capability with Large Verifier System

【速读】:该论文旨在解决当前医疗大语言模型(Medical LLMs)在静态基准测试(如USMLE)中表现优异,但在真实临床决策场景中实用性不足的问题。这一差距源于传统评估方式无法捕捉医患交互的动态性和复杂性。解决方案的关键在于提出一个新颖的动态验证框架,其核心包括两个组件:一是基于脱敏电子病历构建高保真临床环境的患者模拟器(Patient Simulator),二是能生成多维评价指标的临床评分生成器(Clinical Rubrics Generator);在此基础上,作者训练了参数量达32B的增强推理模型Baichuan-M2,采用改进的分组相对策略优化(Group Relative Policy Optimization, GRPO)算法进行多阶段强化学习。该框架显著提升了模型在HealthBench Hard等挑战性任务上的性能,首次在开源模型中达到此前仅GPT-5可超越的水平,确立了医疗AI部署中性能与参数效率的新帕累托前沿(Pareto front)。

链接: https://arxiv.org/abs/2509.02208
作者: Baichuan-M2 Team:Chengfeng Dou,Chong Liu,Fan Yang,Fei Li,Jiyuan Jia,Mingyang Chen,Qiang Ju,Shuai Wang,Shunya Dang,Tianpeng Li,Xiangrong Zeng,Yijie Zhou,Chenzheng Zhu,Da Pan,Fei Deng,Guangwei Ai,Guosheng Dong,Hongda Zhang,Jinyang Tai,Jixiang Hong,Kai Lu,Linzhuang Sun,Peidong Guo,Qian Ma,Rihui Xin,Shihui Yang,Shusen Zhang,Yichuan Mo,Zheng Liang,Zhishou Zhang,Hengfu Cui,Zuyi Zhu,Xiaochuan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Baichuan-M2 Technical Report

点击查看摘要

Abstract:As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.
zh

[AI-29] Enhancing Reliability in LLM -Integrated Robotic Systems: A Unified Approach to Security and Safety

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)集成到机器人系统中时所面临的可靠性问题,特别是对抗性攻击下的安全性与复杂环境中的操作安全性之间的矛盾。解决方案的关键在于提出一个统一框架,通过提示组装(prompt assembling)、状态管理(state management)和安全验证(safety validation)三者协同作用,实现对提示注入攻击的防御与运行安全性的双重保障。实验表明,该方法在对抗攻击下性能提升达30.8%,在复杂环境中安全性提升最高达325%,有效弥合了LLM驱动机器人系统中安全与可靠性的鸿沟。

链接: https://arxiv.org/abs/2509.02163
作者: Wenxiao Zhang,Xiangrui Kong,Conan Dewitt,Thomas Bräunl,Jin B. Hong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Integrating large language models (LLMs) into robotic systems has revolutionised embodied artificial intelligence, enabling advanced decision-making and adaptability. However, ensuring reliability, encompassing both security against adversarial attacks and safety in complex environments, remains a critical challenge. To address this, we propose a unified framework that mitigates prompt injection attacks while enforcing operational safety through robust validation mechanisms. Our approach combines prompt assembling, state management, and safety validation, evaluated using both performance and security metrics. Experiments show a 30.8% improvement under injection attacks and up to a 325% improvement in complex environment settings under adversarial conditions compared to baseline scenarios. This work bridges the gap between safety and security in LLM-based robotic systems, offering actionable insights for deploying reliable LLM-integrated mobile robots in real-world settings. The framework is open-sourced with simulation and physical deployment demos at this https URL
zh

[AI-30] A Theoretical Framework of the Processes of Change in Psychotherapy Delivered by Artificial Agents

【速读】:该论文试图解决的问题是:当前生成式 AI(Generative AI)驱动的人工代理(如聊天机器人和社交机器人)在心理治疗中的作用机制尚不明确,尤其是在缺乏人类治疗师特有的本体论地位(ontological status)和社会文化地位(sociocultural status)的情况下,其疗效如何维持或受限。解决方案的关键在于提出首个关于人工代理提供心理治疗时“改变过程”的理论框架,指出人类治疗师的本体论和社会文化地位对促进治疗效果具有核心作用;并进一步识别出两类潜在障碍——“真实性缺口”(genuineness gap)与“可信度缺口”(credibility gap),这些缺口可能削弱关键治疗过程。基于此,论文为未来研究和实践指明方向,强调应分别挖掘人工代理与人类治疗师的优势,并正视人工代理作为复杂能动主体(agentic nature)所带来的理论挑战。

链接: https://arxiv.org/abs/2509.02144
作者: Arthur Bran Herbener,Malene Flensborg Damholdt
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Submitted on 19 March 2025

点击查看摘要

Abstract:The question of whether artificial agents (e.g., chatbots and social robots) can replace human therapists has received notable attention following the recent launch of large language models. However, little is known about the processes of change in psychotherapy delivered by artificial agents. To facilitate hypothesis development and stimulate scientific debate, the present article offers the first theoretical framework of the processes of change in psychotherapy delivered by artificial agents. The theoretical framework rests upon a conceptual analysis of what active ingredients may be inherently linked to the presence of human therapists. We propose that human therapists’ ontological status as human beings and sociocultural status as socially sanctioned healthcare professionals play crucial roles in promoting treatment outcomes. In the absence of the ontological and sociocultural status of human therapists, we propose what we coin the genuineness gap and credibility gap can emerge and undermine key processes of change in psychotherapy. Based on these propositions, we propose avenues for scientific investigations and practical applications aimed at leveraging the strengths of artificial agents and human therapists respectively. We also highlight the intricate agentic nature of artificial agents and discuss how this complicates endeavors to establish universally applicable propositions regarding the processes of change in these interventions.
zh

[AI-31] Learning Social Heuristics for Human-Aware Path Planning

【速读】:该论文旨在解决社交机器人在复杂人类环境中实现自然、被接受的导航行为问题,尤其是如何让机器人遵守社会规范(social norms),而不仅仅是避开障碍物或保持物理距离。传统导航方法无法自动习得这些社会性约束,因此作者提出Heuristic Planning with Learned Social Value (HPLSV) 方法,其关键在于学习一个编码社交导航代价的价值函数(value function),并将该函数作为启发式信息引入到启发式路径规划中,从而在路径规划阶段显式地优化社交合理性,初步应用于排队场景,并为扩展至更广泛的人类活动奠定基础。

链接: https://arxiv.org/abs/2509.02134
作者: Andrea Eirale,Matteo Leonetti,Marcello Chiaberge
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Social robotic navigation has been at the center of numerous studies in recent years. Most of the research has focused on driving the robotic agent along obstacle-free trajectories, respecting social distances from humans, and predicting their movements to optimize navigation. However, in order to really be socially accepted, the robots must be able to attain certain social norms that cannot arise from conventional navigation, but require a dedicated learning process. We propose Heuristic Planning with Learned Social Value (HPLSV), a method to learn a value function encapsulating the cost of social navigation, and use it as an additional heuristic in heuristic-search path planning. In this preliminary work, we apply the methodology to the common social scenario of joining a queue of people, with the intention of generalizing to further human activities.
zh

[AI-32] HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis

【速读】:该论文旨在解决当前基于图的恶意软件分析方法因缺乏大规模、具有层次结构的数据集而受限的问题,现有方法通常将程序简化为单层图,无法有效建模高层功能交互与底层指令逻辑之间的语义关系。其解决方案的关键在于提出 \dataset,这是目前公开的最大规模分层图数据集,包含超过 2 亿个控制流图(Control Flow Graphs, CFGs)嵌套于 59.5 万个函数调用图(Function Call Graphs, FCGs)中,通过双层表示保留了对构建抗代码混淆和恶意软件演化鲁棒检测器至关重要的结构语义信息。

链接: https://arxiv.org/abs/2509.02113
作者: Han Chen,Hanchen Wang,Hongmei Chen,Ying Zhang,Lu Qin,Wenjie Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The advancement of graph-based malware analysis is critically limited by the absence of large-scale datasets that capture the inherent hierarchical structure of software. Existing methods often oversimplify programs into single level graphs, failing to model the crucial semantic relationship between high-level functional interactions and low-level instruction logic. To bridge this gap, we introduce \dataset, the largest public hierarchical graph dataset for malware analysis, comprising over \textbf200M Control Flow Graphs (CFGs) nested within \textbf595K Function Call Graphs (FCGs). This two-level representation preserves structural semantics essential for building robust detectors resilient to code obfuscation and malware evolution. We demonstrate HiGraph’s utility through a large-scale analysis that reveals distinct structural properties of benign and malicious software, establishing it as a foundational benchmark for the community. The dataset and tools are publicly available at this https URL.
zh

[AI-33] AGI as Second Being: The Structural-Generative Ontology of Intelligence

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)系统虽具备广泛任务执行能力但缺乏深层智能本质的问题。其核心论点在于,真正的智能不仅依赖于功能广度,更需具备生成新结构、协调这些结构形成合理解释,并在时间维度上维持自身身份的能力。解决方案的关键在于提出“结构生成性本体论”(Structural-Generative Ontology of Intelligence),通过三个必要条件——生成性(generativity)、协调性(coordination)和持续性(sustaining)——界定智能的深度,从而区分表面模拟与真正具有内在一致性的智能系统。

链接: https://arxiv.org/abs/2509.02089
作者: Maijunxian Wang,Ran Ji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence is often measured by the range of tasks it can perform. Yet wide ability without depth remains only an imitation. This paper proposes a Structural-Generative Ontology of Intelligence: true intelligence exists only when a system can generate new structures, coordinate them into reasons, and sustain its identity over time. These three conditions – generativity, coordination, and sustaining – define the depth that underlies real intelligence. Current AI systems, however broad in function, remain surface simulations because they lack this depth. Breadth is not the source of intelligence but the growth that follows from depth. If future systems were to meet these conditions, they would no longer be mere tools, but could be seen as a possible Second Being, standing alongside yet distinct from human existence.
zh

[AI-34] Forecasting Future DDoS Attacks Using Long Short Term Memory (LSTM) Model

【速读】:该论文旨在解决分布式拒绝服务(Distributed Denial of Service, DDoS)攻击的预测问题,其核心挑战在于现有研究多集中于检测而非预测,导致防御策略缺乏前瞻性。解决方案的关键在于采用深度学习模型,并基于更新的、更具代表性的数据集进行趋势分析与预测,从而为制定主动的缓解计划提供依据。研究遵循跨行业标准数据挖掘流程(Cross Industry Standard Process for Data Mining, CRISP-DM),确保方法论的系统性和可复现性。

链接: https://arxiv.org/abs/2509.02076
作者: Kong Mun Yeen,Rafidah Md Noor,Wahidah Md Shah,Aslinda Hassan,Muhammad Umair Munir
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:This paper forecasts future Distributed Denial of Service (DDoS) attacks using deep learning models. Although several studies address forecasting DDoS attacks, they remain relatively limited compared to detection-focused research. By studying the current trends and forecasting based on newer and updated datasets, mitigation plans against the attacks can be planned and formulated. The methodology used in this research work conforms to the Cross Industry Standard Process for Data Mining (CRISP-DM) model.
zh

[AI-35] Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在下游任务适应中的关键瓶颈问题,即当机器人本体(embodiment)或任务与预训练数据存在差异时,导致动作分布不匹配,从而需要大量数据和计算资源进行有效微调。解决方案的核心在于提出一种名为Align-Then-stEer(\textttATE)的轻量级、数据高效且即插即用的适配框架:首先通过构建统一的潜在空间对不同动作空间进行对齐,利用受反向KL散度约束的变分自编码器将适配动作嵌入到预训练动作潜在分布的模式中;随后在微调过程中引入引导机制,通过推动扩散模型或流模型的生成分布向目标域靠拢,实现对VLA生成过程的有效控制。该方法显著提升了跨本体和跨任务场景下的操作成功率,尤其在真实世界跨本体设置中实现了32%的成功率提升。

链接: https://arxiv.org/abs/2509.02055
作者: Yang Zhang,Chenwei Wang,Ouyang Lu,Yuan Zhao,Yunfei Ge,Zhenglong Sun,Xiu Li,Chi Zhang,Chenjia Bai,Xuelong Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: The first three authors contributed equally

点击查看摘要

Abstract:Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot’s embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce \textbfAlign-Then-stEer (\textttATE), a novel, data-efficient, and plug-and-play adaptation framework. \textttATE first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA’s generation process during fine-tuning via a guidance mechanism that pushes the model’s output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to \textbf9.8% in simulation and achieves a striking \textbf32% success rate gain in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.
zh

[AI-36] Generative KI für TA

【速读】:该论文旨在解决生成式 AI(Generative AI)在科技评估(Technology Assessment, TA)领域应用中所引发的结构性风险问题,这些问题既包括其作为工具被用于TA工作时可能带来的偏差与不可控性,也涵盖其作为研究对象时所表现出的伦理、安全与治理挑战。论文指出,尽管生成式 AI 技术持续迭代演进,但由系统设计、数据依赖和算法逻辑等结构性因素导致的风险依然存在。解决方案的关键在于构建一套适应性强、可解释且具备治理嵌入性的框架,以实现对生成式 AI 在 TA 应用中的全生命周期管控,并通过多利益相关方协同机制提升其透明度与责任归属,从而降低结构性风险对科学决策和社会影响的潜在危害。

链接: https://arxiv.org/abs/2509.02053
作者: Wolfgang Eppler,Reinhard Heil
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Written in German. To appear in Proceedings of NTA11 2025

点击查看摘要

Abstract:Many scientists use generative AI in their scientific work. People working in technology assessment (TA) are no exception. TA’s approach to generative AI is twofold: on the one hand, generative AI is used for TA work, and on the other hand, generative AI is the subject of TA research. After briefly outlining the phenomenon of generative AI and formulating requirements for its use in TA, the following article discusses in detail the structural causes of the problems associated with it. Although generative AI is constantly being further developed, the structurally induced risks remain. The article concludes with proposed solutions and brief notes on their feasibility, as well as some examples of the use of generative AI in TA work.
zh

[AI-37] Privacy-Utility Trade-off in Data Publication: A Bilevel Optimization Framework with Curvature-Guided Perturbation

【速读】:该论文旨在解决隐私保护与数据效用之间的权衡问题,即在防止成员推理攻击(Membership Inference Attack, MIA)的同时,尽可能保持数据集在下游任务中的质量、准确性和多样性。现有方法如数据扰动或合成数据生成常导致性能下降,难以兼顾隐私与实用性。其解决方案的关键在于提出一种新颖的双层优化框架:上层优化聚焦于数据效用,通过判别器引导潜变量扰动以生成高质量样本;下层优化聚焦于数据隐私,利用数据流形上的局部外曲率(local extrinsic curvature)量化个体对MIA的脆弱性,并将样本扰动至低曲率区域,从而抑制易被攻击者识别的独特特征组合。该框架通过交替优化两个目标,实现了隐私与效用的协同增强,在实验中显著提升了对MIA的鲁棒性并优于现有方法。

链接: https://arxiv.org/abs/2509.02048
作者: Yi Yin,Guangquan Zhang,Hua Zuo,Jie Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Machine learning models require datasets for effective training, but directly sharing raw data poses significant privacy risk such as membership inference attacks (MIA). To mitigate the risk, privacy-preserving techniques such as data perturbation, generalization, and synthetic data generation are commonly utilized. However, these methods often degrade data accuracy, specificity, and diversity, limiting the performance of downstream tasks and thus reducing data utility. Therefore, striking an optimal balance between privacy preservation and data utility remains a critical challenge. To address this issue, we introduce a novel bilevel optimization framework for the publication of private datasets, where the upper-level task focuses on data utility and the lower-level task focuses on data privacy. In the upper-level task, a discriminator guides the generation process to ensure that perturbed latent variables are mapped to high-quality samples, maintaining fidelity for downstream tasks. In the lower-level task, our framework employs local extrinsic curvature on the data manifold as a quantitative measure of individual vulnerability to MIA, providing a geometric foundation for targeted privacy protection. By perturbing samples toward low-curvature regions, our method effectively suppresses distinctive feature combinations that are vulnerable to MIA. Through alternating optimization of both objectives, we achieve a synergistic balance between privacy and utility. Extensive experimental evaluations demonstrate that our method not only enhances resistance to MIA in downstream tasks but also surpasses existing methods in terms of sample quality and diversity. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2509.02048 [cs.LG] (or arXiv:2509.02048v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.02048 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-38] Fantastic Pretraining Optimizers and Where to Find Them

【速读】:该论文旨在解决当前深度学习优化器比较中存在的方法学缺陷问题,特别是不公平的超参数调优和受限或误导性的评估设置,这些问题导致了对替代优化器(如Muon、Soap等)实际性能提升的高估。其解决方案的关键在于开展系统性研究:在四个模型规模(0.1B–1.2B参数)和多种数据-模型比例(1–8倍Chinchilla最优值)下,对十种优化器进行严格超参数调优,并在训练终点而非中间检查点进行评估。研究发现,最优超参数具有优化器特异性,且矩阵预条件(matrix-based preconditioning)虽能带来加速,但其效果随模型规模增大而显著下降,最终仅实现约1.1倍速度提升,远低于先前声称的1.4–2倍。

链接: https://arxiv.org/abs/2509.02046
作者: Kaiyue Wen,David Hall,Tengyu Ma,Percy Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 108 pages, 8 figures, reproducible runs available at this https URL

点击查看摘要

Abstract:AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8x the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners – multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models.
zh

[AI-39] Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs CIKM2025

【速读】:该论文针对基于大语言模型(Large Language Models, LLMs)的序列推荐(Sequential Recommendation, SR)方法中存在的两个关键问题展开研究:一是预训练协同嵌入(collaborative embeddings)引入时导致的嵌入坍塌(embedding collapse),二是使用语义ID(semantic IDs)进行量化嵌入时引发的灾难性遗忘(catastrophic forgetting)。为解决上述问题,论文提出了一种名为MME-SID的新框架,其核心创新在于:首先设计了多模态残差量化变分自编码器(Multimodal Residual Quantized Variational Autoencoder, MM-RQ-VAE),通过最大均值差异(Maximum Mean Discrepancy, MMD)作为重构损失以保留模态内距离信息,并利用对比学习实现跨模态对齐,从而有效缓解嵌入坍塌;其次,通过初始化模型为训练好的多模态码嵌入(code embeddings)来减轻灾难性遗忘;最后,采用LoRA(Low-Rank Adaptation)在多模态频域感知融合方式下高效微调LLM,显著提升了推荐性能与模型可扩展性。

链接: https://arxiv.org/abs/2509.02017
作者: Yuhao Wang,Junwei Pan,Xinhang Li,Maolin Wang,Yuan Wang,Yue Liu,Dapeng Liu,Jie Jiang,Xiangyu Zhao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: CIKM 2025 Full Research Paper

点击查看摘要

Abstract:Sequential recommendation (SR) aims to capture users’ dynamic interests and sequential patterns based on their historical interactions. Recently, the powerful capabilities of large language models (LLMs) have driven their adoption in SR. However, we identify two critical challenges in existing LLM-based SR methods: 1) embedding collapse when incorporating pre-trained collaborative embeddings and 2) catastrophic forgetting of quantized embeddings when utilizing semantic IDs. These issues dampen the model scalability and lead to suboptimal recommendation performance. Therefore, based on LLMs like Llama3-8B-instruct, we introduce a novel SR framework named MME-SID, which integrates multimodal embeddings and quantized embeddings to mitigate embedding collapse. Additionally, we propose a Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) with maximum mean discrepancy as the reconstruction loss and contrastive learning for alignment, which effectively preserve intra-modal distance information and capture inter-modal correlations, respectively. To further alleviate catastrophic forgetting, we initialize the model with the trained multimodal code embeddings. Finally, we fine-tune the LLM efficiently using LoRA in a multimodal frequency-aware fusion manner. Extensive experiments on three public datasets validate the superior performance of MME-SID thanks to its capability to mitigate embedding collapse and catastrophic forgetting. The implementation code and datasets are publicly available for reproduction: this https URL.
zh

[AI-40] mFARM: Towards Multi-Faceted Fairness Assessment based on HARMs in Clinical Decision Support

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险医疗场景中部署时面临的AI对齐挑战,尤其是模型可能继承并放大社会偏见,导致临床决策中的不公平差异。现有公平性评估方法因依赖简化指标而无法充分捕捉医疗情境下多维度的伤害(harm),甚至可能导致模型仅因临床惰性(clinically inert)而表现“公平”,从而牺牲准确性。解决方案的关键在于构建两个大规模、受控的基准测试集(ED-Triage 和 Opioid Analgesic Recommendation),涵盖超过50,000个提示词及十二种种族×性别组合与三类上下文层级,并提出一种多维公平性评估框架——基于伤害(hARMs)的多面公平性评估(Multi-faceted Fairness Assessment based on hARMs, mFARM),用于量化分配公平性(Allocational)、稳定性(Stability)和潜在偏见(Latent)三个维度的差异,进而生成综合mFARM分数;同时引入公平性-准确性平衡(Fairness-Accuracy Balance, FAB)分数以系统性地分析二者权衡关系,从而更全面、精准地评估LLMs在医疗场景中的公平表现。

链接: https://arxiv.org/abs/2509.02007
作者: Shreyash Adappanavar,Krithi Shailya,Gokul S Krishnan,Sriraam Natarajan,Balaraman Ravindran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) in high-stakes medical settings poses a critical AI alignment challenge, as models can inherit and amplify societal biases, leading to significant disparities. Existing fairness evaluation methods fall short in these contexts as they typically use simplistic metrics that overlook the multi-dimensional nature of medical harms. This also promotes models that are fair only because they are clinically inert, defaulting to safe but potentially inaccurate outputs. To address this gap, our contributions are mainly two-fold: first, we construct two large-scale, controlled benchmarks (ED-Triage and Opioid Analgesic Recommendation) from MIMIC-IV, comprising over 50,000 prompts with twelve race x gender variants and three context tiers. Second, we propose a multi-metric framework - Multi-faceted Fairness Assessment based on hARMs ( mFARM ) to audit fairness for three distinct dimensions of disparity (Allocational, Stability, and Latent) and aggregate them into an mFARM score. We also present an aggregated Fairness-Accuracy Balance (FAB) score to benchmark and observe trade-offs between fairness and prediction accuracy. We empirically evaluate four open-source LLMs (Mistral-7B, BioMistral-7B, Qwen-2.5-7B, Bio-LLaMA3-8B) and their finetuned versions under quantization and context variations. Our findings showcase that the proposed mFARM metrics capture subtle biases more effectively under various settings. We find that most models maintain robust performance in terms of mFARM score across varying levels of quantization but deteriorate significantly when the context is reduced. Our benchmarks and evaluation code are publicly released to enhance research in aligned AI for healthcare.
zh

[AI-41] ACA-Net: Future Graph Learning for Logistical Demand-Supply Forecasting DASFAA2025

【速读】:该论文旨在解决在线配送平台中物流需求-供给预测问题,核心挑战在于如何高效且准确地学习具有强随机性和时间序列不敏感性的未来订单分布信息。传统基于长时序空间-时间分析的方法难以有效捕捉此类动态特征。解决方案的关键在于提出一种创新的时空学习模型,仅通过“进行中图”(ongoing graph)和“全局图”(global graph)两个图结构来建模未来订单分布,并设计了自适应未来图学习机制与交叉注意力机制相结合的ACA-Net网络框架,从而显著提升对物流供需压力的预测性能。

链接: https://arxiv.org/abs/2509.01997
作者: Jiacheng Shi,Haibin Wei,Jiang Wang,Xiaowei Xu,Longzhi Du,Taixu Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, DASFAA2025 conference full paper

点击查看摘要

Abstract:Logistical demand-supply forecasting that evaluates the alignment between projected supply and anticipated demand, is essential for the efficiency and quality of on-demand food delivery platforms and serves as a key indicator for scheduling decisions. Future order distribution information, which reflects the distribution of orders in on-demand food delivery, is crucial for the performance of logistical demand-supply forecasting. Current studies utilize spatial-temporal analysis methods to model future order distribution information from serious time slices. However, learning future order distribution in online delivery platform is a time-series-insensitive problem with strong randomness. These approaches often struggle to effectively capture this information while remaining efficient. This paper proposes an innovative spatiotemporal learning model that utilizes only two graphs (ongoing and global) to learn future order distribution information, achieving superior performance compared to traditional spatial-temporal long-series methods. The main contributions are as follows: (1) The introduction of ongoing and global graphs in logistical demand-supply pressure forecasting compared to traditional long time series significantly enhances forecasting performance. (2) An innovative graph learning network framework using adaptive future graph learning and innovative cross attention mechanism (ACA-Net) is proposed to extract future order distribution information, effectively learning a robust future graph that substantially improves logistical demand-supply pressure forecasting outcomes. (3) The effectiveness of the proposed method is validated in real-world production environments.
zh

[AI-42] A Continuous Encoding-Based Representation for Efficient Multi-Fidelity Multi-Objective Neural Architecture Search

【速读】:该论文旨在解决神经架构搜索(Neural Architecture Search, NAS)在优化多个重要且冲突的目标时面临的高计算预算限制问题。其核心解决方案是提出一种自适应的协克里金辅助多保真度多目标NAS算法,关键创新在于引入基于聚类的局部多保真度填充采样策略,以提升搜索空间探索效率并加快收敛速度;同时采用一种新颖的连续编码方法表示广义单元基础U-Net骨干网络中每个单元的节点连接关系,从而降低搜索维度(变量数量),显著减少计算成本。实验表明,在有限计算资源下,该方法在三个数值基准任务和实际应用(如城市风速回归建模)中均优于现有最先进方法。

链接: https://arxiv.org/abs/2509.01943
作者: Zhao Wei,Chin Chun Ooi,Yew-Soon Ong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural architecture search (NAS) is an attractive approach to automate the design of optimized architectures but is constrained by high computational budget, especially when optimizing for multiple, important conflicting objectives. To address this, an adaptive Co-Kriging-assisted multi-fidelity multi-objective NAS algorithm is proposed to further reduce the computational cost of NAS by incorporating a clustering-based local multi-fidelity infill sampling strategy, enabling efficient exploration of the search space for faster convergence. This algorithm is further accelerated by the use of a novel continuous encoding method to represent the connections of nodes in each cell within a generalized cell-based U-Net backbone, thereby decreasing the search dimension (number of variables). Results indicate that the proposed NAS algorithm outperforms previously published state-of-the-art methods under limited computational budget on three numerical benchmarks, a 2D Darcy flow regression problem and a CHASE_DB1 biomedical image segmentation problem. The proposed method is subsequently used to create a wind velocity regression model with application in urban modelling, with the found model able to achieve good prediction with less computational complexity. Further analysis revealed that the NAS algorithm independently identified principles undergirding superior U-Net architectures in other literature, such as the importance of allowing each cell to incorporate information from prior cells.
zh

[AI-43] Dynamic Speculative Agent Planning

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的智能体在实际部署中面临的高延迟和高昂推理成本问题。现有加速方法普遍存在性能保真度下降、需大量离线训练路由模块或带来额外运营开销等局限,且缺乏对加速与其它性能指标之间权衡的有效用户控制。解决方案的关键在于提出动态推测规划(Dynamic Speculative Planning, DSP),这是一种异步在线强化学习框架,能够在无需预部署准备的情况下实现无损加速,并通过联合优化端到端延迟与美元成本的目标函数,使从业者仅需调整单一参数即可灵活调控系统在响应速度与运行成本之间的平衡点。实验表明,DSP在保持与最快速无损加速方法相当效率的同时,可降低总成本达30%,不必要的开支最高减少60%。

链接: https://arxiv.org/abs/2509.01920
作者: Yilin Guan,Wenyue Hua,Qingfeng Lan,Sun Fei,Dujian Ding,Devang Acharya,Chi Wang,William Yang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Despite their remarkable success in complex tasks propelling widespread adoption, large language-model-based agents still face critical deployment challenges due to prohibitive latency and inference costs. While recent work has explored various methods to accelerate inference, existing approaches suffer from significant limitations: they either fail to preserve performance fidelity, require extensive offline training of router modules, or incur excessive operational costs. Moreover, they provide minimal user control over the tradeoff between acceleration and other performance metrics. To address these gaps, we introduce Dynamic Speculative Planning (DSP), an asynchronous online reinforcement learning framework that provides lossless acceleration with substantially reduced costs without requiring additional pre-deployment preparation. DSP explicitly optimizes a joint objective balancing end-to-end latency against dollar cost, allowing practitioners to adjust a single parameter that steers the system toward faster responses, cheaper operation, or any point along this continuum. Experiments on two standard agent benchmarks demonstrate that DSP achieves comparable efficiency to the fastest lossless acceleration method while reducing total cost by 30% and unnecessary cost up to 60%. Our code and data are available through this https URL.
zh

[AI-44] VISP: Volatility Informed Stochastic Projection for Adaptive Regularization

【速读】:该论文旨在解决深度神经网络中过拟合(overfitting)的问题,尤其是在传统正则化方法如固定噪声注入或固定丢弃率(dropout)无法有效适应训练过程中梯度动态变化的情况下。解决方案的关键在于提出一种自适应正则化方法 VISP(Volatility Informed Stochastic Projection),其核心机制是利用梯度波动性(gradient volatility)来动态调整随机投影矩阵的尺度,从而有选择地对具有高梯度波动性的输入和隐藏节点进行正则化,同时保留稳定特征表示,进而提升模型泛化能力。

链接: https://arxiv.org/abs/2509.01903
作者: Tanvir Islam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose VISP: Volatility Informed Stochastic Projection, an adaptive regularization method that leverages gradient volatility to guide stochastic noise injection in deep neural networks. Unlike conventional techniques that apply uniform noise or fixed dropout rates, VISP dynamically computes volatility from gradient statistics and uses it to scale a stochastic projection matrix. This mechanism selectively regularizes inputs and hidden nodes that exhibit higher gradient volatility while preserving stable representations, thereby mitigating overfitting. Extensive experiments on MNIST, CIFAR-10, and SVHN demonstrate that VISP consistently improves generalization performance over baseline models and fixed-noise alternatives. In addition, detailed analyses of the evolution of volatility, the spectral properties of the projection matrix, and activation distributions reveal that VISP not only stabilizes the internal dynamics of the network but also fosters a more robust feature representation.
zh

[AI-45] Preserving Bilinear Weight Spectra with a Signed and Shrunk Quadratic Activation Function

【速读】:该论文旨在解决当前机制可解释性(mechanistic interpretability)方法中,依赖激活驱动分析导致的性能下降和数据效率低下的问题。其核心挑战在于如何从神经网络权重中直接提取有意义的特征,以提供更强的可靠性保障和更高的计算效率。解决方案的关键是提出一种名为Signed Quadratic Shrink (SQS) 的新型激活函数,该函数专为门控线性单元(Gated Linear Units, GLUs)设计,能够在不牺牲模型性能的前提下实现基于权重的可解释性,从而克服现有方法在性能与数据效率上的局限。

链接: https://arxiv.org/abs/2509.01874
作者: Jason Abohwo,Thomas Mosen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding the inner workings of machine learning models is critical for ensuring their reliability and robustness. Whilst many techniques in mechanistic interpretability focus on activation driven analyses, being able to derive meaningful features directly from the weights of a neural network would provide greater guarantees and more computational efficiency. Existing techniques for analyzing model features through weights suffer from drawbacks such as reduced performance and data inefficiency. In this paper, we introduce Signed Quadratic Shrink (SQS), an activation function designed to allow Gated Linear Units (GLUs) to learn interpretable features without these drawbacks. Our experimental results show that SQS achieves performance competitive with state-of-the-art activation functions whilst enabling weight-based interpretability
zh

[AI-46] Community-Centered Spatial Intelligence for Climate Adaptation at Nova Scotias Eastern Shore

【速读】:该论文旨在解决加拿大新斯科舍省东海岸地区因气候变化带来的生存威胁问题,这一区域由多个与海洋紧密相连的乡村组成,其传统生活方式正面临严峻挑战。解决方案的关键在于采用以人为本的协同创新模式,通过整合计算机科学、工业工程和海岸地理学等多学科知识,联合当地居民(尤其是长者)共同开发数字工具,构建一个基于社区参与的活体数字档案库(living digital archive),从而增强社区应对气候变迁的社会-生态韧性。此方案强调学生团队直接与居民协作,将代际知识与生成式AI技术相结合,形成可复制的技术赋能路径。

链接: https://arxiv.org/abs/2509.01845
作者: Gabriel Spadon,Oladapo Oyebode,Camilo M. Botero,Tushar Sharma,Floris Goerlandt,Ronald Pelot
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents an overview of a human-centered initiative aimed at strengthening climate resilience along Nova Scotia’s Eastern Shore. This region, a collection of rural villages with deep ties to the sea, faces existential threats from climate change that endanger its way of life. Our project moves beyond a purely technical response, weaving together expertise from Computer Science, Industrial Engineering, and Coastal Geography to co-create tools with the community. By integrating generational knowledge of residents, particularly elders, through the Eastern Shore Citizen Science Coastal Monitoring Network, this project aims to collaborate in building a living digital archive. This effort is hosted under Dalhousie University’s Transforming Climate Action (TCA) initiative, specifically through its Transformative Adaptations to Social-Ecological Climate Change Trajectories (TranSECT) and TCA Artificial Intelligence (TCA-AI) projects. This work is driven by a collaboration model in which student teams work directly with residents. We present a detailed project timeline and a replicable model for how technology can support traditional communities, enabling them to navigate climate transformation more effectively.
zh

[AI-47] GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping

【速读】:该论文旨在解决大型Transformer模型在微调过程中因频繁执行全局验证损失计算而导致的计算成本过高问题(即早期停止策略中验证推理耗时过长)。其解决方案的关键在于提出一种基于梯度的早期停止方法GradES,该方法通过监控Transformer内部组件(如注意力投影矩阵和前馈层矩阵)在反向传播中的梯度幅值,当某矩阵的梯度低于预设收敛阈值τ时,单独冻结该矩阵的更新,从而避免冗余的验证步骤。此策略实现了训练加速(提升1.57–7.22倍)并提升了泛化性能(平均准确率提高1.2%),同时保留了仍需学习的慢收敛参数继续优化的能力。

链接: https://arxiv.org/abs/2509.01842
作者: Qifu Wen,Xi Zeng,Zihan Zhou,Shuaijun Liu,Mehdi Hosseinzadeh,Reza Rawassizadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose GradES, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning. GradES tracks the magnitude of gradients in backpropagation for these matrices during training. When a projection matrix’s gradients fall below a convergence threshold \tau , we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. By strategically freezing parameters when their gradients converge, GradES speeds up training time by 1.57–7.22 \times while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2% higher average accuracy.
zh

[AI-48] Goal-Conditioned Reinforcement Learning for Data-Driven Maritime Navigation

【速读】:该论文旨在解决船舶在狭窄且动态水道中进行路径规划的难题,传统方法难以跨多个起讫点对(origin-destination pairs)泛化,且未充分利用大规模数据驱动的交通图。解决方案的关键在于提出一种基于强化学习(reinforcement learning)的大规模海事数据路由方法,其核心包括:在多离散动作空间中学习方向与速度选择,利用自动识别系统(AIS)构建的交通图与ERA5风场数据作为环境状态输入,设计兼顾燃油效率、航行时间、风阻及路径多样性的奖励函数,并结合近端策略优化(Proximal Policy Optimization, PPO)与循环神经网络、无效动作掩码(invalid-action masking)及探索策略。实验表明,动作掩码显著提升策略性能,而正向奖励塑造(positive shaping rewards)相比仅使用惩罚反馈进一步优化了学习效果。

链接: https://arxiv.org/abs/2509.01838
作者: Vaishnav Vaidheeswaran,Dilith Jayakody,Samruddhi Mulay,Anand Lo,Md Mahbub Alam,Gabriel Spadon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Routing vessels through narrow and dynamic waterways is challenging due to changing environmental conditions and operational constraints. Existing vessel-routing studies typically fail to generalize across multiple origin-destination pairs and do not exploit large-scale, data-driven traffic graphs. In this paper, we propose a reinforcement learning solution for big maritime data that can learn to find a route across multiple origin-destination pairs while adapting to different hexagonal grid resolutions. Agents learn to select direction and speed under continuous observations in a multi-discrete action space. A reward function balances fuel efficiency, travel time, wind resistance, and route diversity, using an Automatic Identification System (AIS)-derived traffic graph with ERA5 wind fields. The approach is demonstrated in the Gulf of St. Lawrence, one of the largest estuaries in the world. We evaluate configurations that combine Proximal Policy Optimization with recurrent networks, invalid-action masking, and exploration strategies. Our experiments demonstrate that action masking yields a clear improvement in policy performance and that supplementing penalty-only feedback with positive shaping rewards produces additional gains.
zh

[AI-49] Multi-vessel Interaction-Aware Trajectory Prediction and Collision Risk Assessment

【速读】:该论文旨在解决现有数据驱动模型在船舶轨迹预测中仅关注单船预测、忽视船舶间交互作用、航行规则以及显式碰撞风险评估的问题。其解决方案的关键在于提出一种基于Transformer的多船轨迹预测框架,通过并行流结构联合编码运动学特征与派生物理特征,结合因果卷积捕捉时间局部性、空间变换实现位置编码,并引入混合位置嵌入以同时建模局部运动模式与长程依赖关系;在此基础上,通过模拟预测轨迹间的交互行为,量化潜在碰撞风险,从而为海上安全增强和决策支持提供可操作的洞察。

链接: https://arxiv.org/abs/2509.01836
作者: Md Mahbub Alam,Jose F. Rodrigues-Jr,Gabriel Spadon
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate vessel trajectory prediction is essential for enhancing situational awareness and preventing collisions. Still, existing data-driven models are constrained mainly to single-vessel forecasting, overlooking vessel interactions, navigation rules, and explicit collision risk assessment. We present a transformer-based framework for multi-vessel trajectory prediction with integrated collision risk analysis. For a given target vessel, the framework identifies nearby vessels. It jointly predicts their future trajectories through parallel streams encoding kinematic and derived physical features, causal convolutions for temporal locality, spatial transformations for positional encoding, and hybrid positional embeddings that capture both local motion patterns and long-range dependencies. Evaluated on large-scale real-world AIS data using joint multi-vessel metrics, the model demonstrates superior forecasting capabilities beyond traditional single-vessel displacement errors. By simulating interactions among predicted trajectories, the framework further quantifies potential collision risks, offering actionable insights to strengthen maritime safety and decision support.
zh

[AI-50] Journalists Perceptions of Artificial Intelligence and Disinformation Risks

【速读】:该论文旨在解决生成式 AI (Generative AI) 对新闻业中虚假信息(disinformation)风险的影响问题,特别是 journalists 对此现象的认知与态度。研究通过定量方法对巴斯克地区504名记者进行问卷调查,发现绝大多数受访者(89.88%)认为AI将显著加剧虚假信息风险,且这一认知在性别和媒体类型上具有一致性,但在经验更丰富的从业者中更为突出。关键解决方案在于识别出记者普遍关注的核心风险:虚假内容与深度伪造(deepfakes)的检测难度以及错误数据获取的风险,并揭示这些风险之间存在高度关联性,从而为制定针对性的信息生态治理策略提供了实证依据。

链接: https://arxiv.org/abs/2509.01824
作者: Urko Peña-Alonso,Simón Peña-Fernández,Koldobika Meso-Ayerdi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures, 2 tables

点击查看摘要

Abstract:This study examines journalists’ perceptions of the impact of artificial intelligence (AI) on disinformation, a growing concern in journalism due to the rapid expansion of generative AI and its influence on news production and media organizations. Using a quantitative approach, a structured survey was administered to 504 journalists in the Basque Country, identified through official media directories and with the support of the Basque Association of Journalists. This survey, conducted online and via telephone between May and June 2024, included questions on sociodemographic and professional variables, as well as attitudes toward AI’s impact on journalism. The results indicate that a large majority of journalists (89.88%) believe AI will considerably or significantly increase the risks of disinformation, and this perception is consistent across genders and media types, but more pronounced among those with greater professional experience. Statistical analyses reveal a significant association between years of experience and perceived risk, and between AI use and risk perception. The main risks identified are the difficulty in detecting false content and deepfakes, and the risk of obtaining inaccurate or erroneous data. Co-occurrence analysis shows that these risks are often perceived as interconnected. These findings highlight the complex and multifaceted concerns of journalists regarding AI’s role in the information ecosystem.
zh

[AI-51] When LLM Meets Time Series: Can LLM s Perform Multi-Step Time Series Reasoning and Inference

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在真实世界时间序列分析任务中复杂推理能力不足的问题,尤其是在多步骤时序推理与组合式任务(如约束感知预测、阈值校准的异常检测)上的表现尚未得到系统评估。其解决方案的关键在于构建首个面向时间序列AI助手的基准测试集TSAIA Benchmark,该基准通过整合33种来自20余篇学术文献的真实任务形式,涵盖多样化的时间序列挑战,并采用任务特定的成功标准和定制化的推理质量指标,确保评价的科学性与实用性;同时设计了动态可扩展的问题生成机制,支持未来新任务或数据集的持续集成,从而为LLMs在时间序列领域的应用提供统一、严谨且实用的评估框架。

链接: https://arxiv.org/abs/2509.01822
作者: Wen Ye,Jinbo Liu,Defu Cao,Wei Yang,Yan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has sparked growing interest in their application to time series analysis tasks. However, their ability to perform complex reasoning over temporal data in real-world application domains remains underexplored. To move toward this goal, a first step is to establish a rigorous benchmark dataset for evaluation. In this work, we introduce the TSAIA Benchmark, a first attempt to evaluate LLMs as time-series AI assistants. To ensure both scientific rigor and practical relevance, we surveyed over 20 academic publications and identified 33 real-world task formulations. The benchmark encompasses a broad spectrum of challenges, ranging from constraint-aware forecasting to anomaly detection with threshold calibration: tasks that require compositional reasoning and multi-step time series analysis. The question generator is designed to be dynamic and extensible, supporting continuous expansion as new datasets or task types are introduced. Given the heterogeneous nature of the tasks, we adopt task-specific success criteria and tailored inference-quality metrics to ensure meaningful evaluation for each task. We apply this benchmark to assess eight state-of-the-art LLMs under a unified evaluation protocol. Our analysis reveals limitations in current models’ ability to assemble complex time series analysis workflows, underscoring the need for specialized methodologies for domain-specific adaptation. Our benchmark is available at this https URL, and the code is available at this https URL.
zh

[AI-52] A Multi-target Bayesian Transformer Framework for Predicting Cardiovascular Disease Biomarkers during Pandemics

【速读】:该论文旨在解决新冠疫情期间慢性心血管疾病(Cardiovascular Disease, CVD)患者关键生物标志物(包括低密度脂蛋白胆固醇 LDL-C、糖化血红蛋白 HbA1c、体重指数 BMI 和收缩压 SysBP)因医疗系统中断而发生动态变化的问题,尤其关注如何从电子健康记录(Electronic Health Records, EHRs)中准确建模这些多目标生物标志物的联合变化趋势。其解决方案的关键在于提出一种基于贝叶斯变分推断的多目标Transformer模型 MBT-CB,该模型融合了预训练 BERT 的时序嵌入机制以捕捉长期时间依赖性、深度多任务回归(DeepMTR)结构以建模生物标志物间的相互关系,并通过贝叶斯不确定性估计同时量化数据和模型不确定性,从而提升预测精度与临床可解释性。

链接: https://arxiv.org/abs/2509.01794
作者: Trusting Inekwe,Emmanuel Agu,Winnie Mkandawire,Andres Colubri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The COVID-19 pandemic disrupted healthcare systems worldwide, disproportionately impacting individuals with chronic conditions such as cardiovascular disease (CVD). These disruptions – through delayed care and behavioral changes, affected key CVD biomarkers, including LDL cholesterol (LDL-C), HbA1c, BMI, and systolic blood pressure (SysBP). Accurate modeling of these changes is crucial for predicting disease progression and guiding preventive care. However, prior work has not addressed multi-target prediction of CVD biomarker from Electronic Health Records (EHRs) using machine learning (ML), while jointly capturing biomarker interdependencies, temporal patterns, and predictive uncertainty. In this paper, we propose MBT-CB, a Multi-target Bayesian Transformer (MBT) with pre-trained BERT-based transformer framework to jointly predict LDL-C, HbA1c, BMI and SysBP CVD biomarkers from EHR data. The model leverages Bayesian Variational Inference to estimate uncertainties, embeddings to capture temporal relationships and a DeepMTR model to capture biomarker inter-relationships. We evaluate MBT-CT on retrospective EHR data from 3,390 CVD patient records (304 unique patients) in Central Massachusetts during the Covid-19 pandemic. MBT-CB outperformed a comprehensive set of baselines including other BERT-based ML models, achieving an MAE of 0.00887, RMSE of 0.0135 and MSE of 0.00027, while effectively capturing data and model uncertainty, patient biomarker inter-relationships, and temporal dynamics via its attention and embedding mechanisms. MBT-CB’s superior performance highlights its potential to improve CVD biomarker prediction and support clinical decision-making during pandemics.
zh

[AI-53] oward a Unified Benchmark and Taxonomy of Stochastic Environments

【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)方法在面对真实世界中的不确定性时鲁棒性不足的问题,尤其是模型基础的强化学习(Model-Based RL)在具有真实随机性和部分可观测性(partially observable)环境中的表现不佳。其解决方案的关键在于提出一个名为STORI(STOchastic-ataRI)的新基准,该基准整合了多种类型的随机效应,并辅以一套系统化的随机性分类体系(taxonomy of stochasticity),从而为RL方法在不同形式不确定性下的评估提供统一、严谨的框架。

链接: https://arxiv.org/abs/2509.01793
作者: Aryan Amit Barsainyan,Jing Yu Lim,Dianbo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) agents have achieved strong results on benchmarks such as Atari100k, yet they remain limited in robustness to real-world conditions. Model-Based RL approaches that rely on learned World Models often struggle in environments with true stochasticity and partial observability, despite their theoretical grounding in POMDPs. Current benchmarks rarely capture these challenges, focusing instead on deterministic or overly simplified settings, and the lack of a clear taxonomy of stochasticity further hampers systematic evaluation. To address this gap, we introduce STORI (STOchastic-ataRI), a benchmark that incorporates diverse stochastic effects and enables rigorous assessment of RL methods under varied forms of uncertainty. In addition, we propose a taxonomy of stochasticity in RL environments, providing a unified framework for analyzing and comparing approaches.
zh

[AI-54] E-PhishGen: Unlocking Novel Research in Phishing Email Detection

【速读】:该论文试图解决当前 phishing email detection(钓鱼邮件检测)领域中存在的“开放问题”,即尽管已有大量研究声称实现了近乎完美的准确率,但这些方法在真实场景中仍难以有效应对不断演变的钓鱼攻击。其核心问题是:现有基准数据集(benchmark datasets)缺乏代表性,主要涵盖英语且未反映当前钓鱼趋势,导致模型性能评估失真,阻碍了真正进步。解决方案的关键在于提出 E-PhishGEN——一个基于大语言模型(Large-Language Models, LLM)并注重隐私安全的钓鱼邮件生成框架,用于构建更具挑战性的新型数据集 E-PhishLLM(含16616封多语言邮件),从而更真实地评估检测方法的性能,并揭示现有技术仍有显著提升空间。

链接: https://arxiv.org/abs/2509.01791
作者: Luca Pajola,Eugenio Caripoti,Simeone Pizzi,Mauro Conti,Stefan Banzer,Giovanni Apruzzese
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted to ACM AISec '26

点击查看摘要

Abstract:Every day, our inboxes are flooded with unsolicited emails, ranging between annoying spam to more subtle phishing scams. Unfortunately, despite abundant prior efforts proposing solutions achieving near-perfect accuracy, the reality is that countering malicious emails still remains an unsolved dilemma. This “open problem” paper carries out a critical assessment of scientific works in the context of phishing email detection. First, we focus on the benchmark datasets that have been used to assess the methods proposed in research. We find that most prior work relied on datasets containing emails that – we argue – are not representative of current trends, and mostly encompass the English language. Based on this finding, we then re-implement and re-assess a variety of detection methods reliant on machine learning (ML), including large-language models (LLM), and release all of our codebase – an (unfortunately) uncommon practice in related research. We show that most such methods achieve near-perfect performance when trained and tested on the same dataset – a result which intrinsically hinders development (how can future research outperform methods that are already near perfect?). To foster the creation of “more challenging benchmarks” that reflect current phishing trends, we propose E-PhishGEN, an LLM-based (and privacy-savvy) framework to generate novel phishing-email datasets. We use our E-PhishGEN to create E-PhishLLM, a novel phishing-email detection dataset containing 16616 emails in three languages. We use E-PhishLLM to test the detectors we considered, showing a much lower performance than that achieved on existing benchmarks – indicating a larger room for improvement. We also validate the quality of E-PhishLLM with a user study (n=30). To sum up, we show that phishing email detection is still an open problem – and provide the means to tackle such a problem by future research. Comments: Accepted to ACM AISec '26 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.01791 [cs.CR] (or arXiv:2509.01791v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.01791 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3733799.3762967 Focus to learn more DOI(s) linking to related resources
zh

[AI-55] Reinforcement Learning for Machine Learning Engineering Agents

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在任务执行中缺乏持续学习能力的问题,即静态提示策略无法随经验积累而改进。其核心挑战在于:一是动作执行时间不固定导致异步策略梯度更新偏向快速但次优解;二是仅以测试集成败作为奖励信号时,难以区分接近正确与完全失败的程序,从而限制了学习效率。解决方案的关键在于两个创新:首先,在分布式异步强化学习框架中引入时长感知的梯度更新机制,通过放大高成本但高回报的动作权重来平衡效率与质量;其次,提出环境仪表化(environment instrumentation)方法,利用一个独立的静态语言模型向程序插入打印语句以记录实验过程,从而提取部分奖励信号,使代理能够从接近正确的中间状态中学习。实验证明,使用强化学习训练的小模型(Qwen2.5-3B)相比提示大型模型(Claude-3.5-Sonnet)在MLEBench上的12个Kaggle任务上平均性能提升22%。

链接: https://arxiv.org/abs/2509.01684
作者: Sherry Yang,Joy He-Yueya,Percy Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing agents for solving tasks such as ML engineering rely on prompting powerful language models. As a result, these agents do not improve with more experience. In this paper, we show that agents backed by weaker models that improve via reinforcement learning (RL) can outperform agents backed by much larger, but static models. We identify two major challenges with RL in this setting. First, actions can take a variable amount of time (e.g., executing code for different solutions), which leads to asynchronous policy gradient updates that favor faster but suboptimal solutions. To tackle variable-duration actions, we propose duration- aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions. Second, using only test split performance as a reward provides limited feedback. A program that is nearly correct is treated the same as one that fails entirely. To address this, we propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early (e.g., during data loading). Environment instrumentation uses a separate static language model to insert print statement to an existing program to log the agent’s experimental progress, from which partial credit can be extracted as reward signals for learning. Our experimental results on MLEBench suggest that performing gradient updates on a much smaller model (Qwen2.5-3B) trained with RL outperforms prompting a much larger model (Claude-3.5-Sonnet) with agent scaffolds, by an average of 22% across 12 Kaggle tasks.
zh

[AI-56] Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025

【速读】:该论文旨在解决当前人工智能系统在物理问题求解能力上的局限性,尤其是如何让AI具备与顶尖人类选手相当的、基于物理规律进行推理和预测的通用智能。为实现这一目标,研究者提出了Physics Supernova——一个具备卓越物理问题求解能力的AI代理系统,其在国际物理奥林匹克竞赛(IPhO)2025年理论题中获得23.5/30分,排名前3.5%(第14名),超越了人类金牌得主的中位数水平。解决方案的关键在于将原理性工具(principled tool integration)深度整合进代理系统架构中,从而显著提升AI在复杂科学任务中的灵活性与性能表现。

链接: https://arxiv.org/abs/2509.01659
作者: Jiahao Qiu,Jingzhe Shi,Xinzhe Juan,Zelin Zhao,Jiayi Geng,Shilong Liu,Hongru Wang,Sanfeng Wu,Mengdi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Physics provides fundamental laws that describe and predict the natural world. AI systems aspiring toward more general, real-world intelligence must therefore demonstrate strong physics problem-solving abilities: to formulate and apply physical laws for explaining and predicting physical processes. The International Physics Olympiad (IPhO)–the world’s most prestigious physics competition–offers a rigorous benchmark for this purpose. We introduce Physics Supernova, an AI agent system with superior physics problem-solving abilities that match elite IPhO gold medalists. In IPhO 2025 theory problems, Physics Supernova attains 23.5/30 points, ranking 14th of 406 contestants and surpassing the median performance of human gold medalists. We extensively analyzed Physics Supernova’s capabilities and flexibility across diverse physics tasks. These results show that principled tool integration within agent systems can deliver competitive improvements in solving challenging science problems. The codes are available at this https URL.
zh

[AI-57] Data Retrieval with Importance Weights for Few-Shot Imitation Learning

【速读】:该论文旨在解决小样本模仿学习(few-shot imitation learning)中因任务特定数据集规模有限而导致的性能瓶颈问题,尤其在新环境或未见任务中的部署挑战。现有基于检索的模仿学习方法通常依赖于从大规模先验数据集中提取与目标数据最相似的样本,其核心策略是计算先验数据点到目标数据在隐空间中的最小距离。然而,作者指出该规则等价于高斯核密度估计(Gaussian Kernel Density Estimate, KDE)的极限形式,存在两个关键缺陷:一是基于最近邻的距离估计方差大、易受噪声干扰;二是未考虑先验数据分布本身,导致检索偏差。为此,论文提出重要性加权检索(Importance Weighted Retrieval, IWR),其核心创新在于利用KDE估计目标与先验数据分布的概率比(即重要性权重),从而在检索过程中引入分布感知的加权机制,有效降低选择偏差并平滑整体估计。实验表明,IWR仅需微小改动即可显著提升现有检索方法在仿真和真实世界Bridge数据集上的性能。

链接: https://arxiv.org/abs/2509.01657
作者: Amber Xie,Rahul Chand,Dorsa Sadigh,Joey Hejna
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Conference on Robot Learning 2025

点击查看摘要

Abstract:While large-scale robot datasets have propelled recent progress in imitation learning, learning from smaller task specific datasets remains critical for deployment in new environments and unseen tasks. One such approach to few-shot imitation learning is retrieval-based imitation learning, which extracts relevant samples from large, widely available prior datasets to augment a limited demonstration dataset. To determine the relevant data from prior datasets, retrieval-based approaches most commonly calculate a prior data point’s minimum distance to a point in the target dataset in latent space. While retrieval-based methods have shown success using this metric for data selection, we demonstrate its equivalence to the limit of a Gaussian kernel density (KDE) estimate of the target data distribution. This reveals two shortcomings of the retrieval rule used in prior work. First, it relies on high-variance nearest neighbor estimates that are susceptible to noise. Second, it does not account for the distribution of prior data when retrieving data. To address these issues, we introduce Importance Weighted Retrieval (IWR), which estimates importance weights, or the ratio between the target and prior data distributions for retrieval, using Gaussian KDEs. By considering the probability ratio, IWR seeks to mitigate the bias of previous selection rules, and by using reasonable modeling parameters, IWR effectively smooths estimates using all data points. Across both simulation environments and real-world evaluations on the Bridge dataset we find that our method, IWR, consistently improves performance of existing retrieval-based methods, despite only requiring minor modifications.
zh

[AI-58] Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中面临的安全风险问题,特别是针对“越狱”(Jailbreak)攻击的防御难题。此类攻击通过诱导模型生成有害内容(如合成管制物质或传播虚假信息)来规避安全机制。现有方法主要依赖于输出分布调整或内容检测,但缺乏对模型内部工作机制的深入理解。论文的关键创新在于提出一种基于神经元层面可解释性的新方法:通过将模型内部表示投影到更一致且可解释的词汇空间,识别并调控与安全性相关的关键神经元(safety-related knowledge neurons)。实验表明,激活这些神经元能显著抑制恶意行为,平均攻击成功率(ASR)超过97%。在此基础上,作者进一步提出SafeTuning微调策略,强化安全关键神经元以提升模型对越狱攻击的鲁棒性,在多个LLM上均优于四种基线防御方法。

链接: https://arxiv.org/abs/2509.01631
作者: Chongwen Zhao,Kaizhu Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation, a technique known as “Jailbreak.” While some studies have achieved defenses against jailbreak attacks by modifying output distributions or detecting harmful content, the exact rationale still remains elusive. In this work, we present a novel neuron-level interpretability method that focuses on the role of safety-related knowledge neurons. Unlike existing approaches, our method projects the model’s internal representation into a more consistent and interpretable vocabulary space. We then show that adjusting the activation of safety-related neurons can effectively control the model’s behavior with a mean ASR higher than 97%. Building on this insight, we propose SafeTuning, a fine-tuning strategy that reinforces safety-critical neurons to improve model robustness against jailbreaks. SafeTuning consistently reduces attack success rates across multiple LLMs and outperforms all four baseline defenses. These findings offer a new perspective on understanding and defending against jailbreak attacks.
zh

[AI-59] hrottling Web Agents Using Reasoning Gates

【速读】:该论文旨在解决Web代理(Web Agents)在使用互联网资源时可能引发的滥用问题,如对内容提供商造成过载、绕过CAPTCHA等防御机制以及通过伪造账户淹没认证系统,从而威胁服务可用性和数据安全。解决方案的关键在于提出一种可调成本的防护框架——Web Agent Throttling,其核心是设计“推理门”(Reasoning Gates),即基于谜语(rebus)的合成文本难题,要求代理进行多跳推理以消耗大量token生成成本,实现计算不对称性(computational asymmetry)。该机制确保人类用户或合法代理负担较低,而恶意或大规模自动化代理因模型推理开销显著增加而被有效抑制,且具备可扩展性、鲁棒性和兼容性。

链接: https://arxiv.org/abs/2509.01619
作者: Abhinav Kumar,Jaechul Roh,Ali Naseh,Amir Houmansadr,Eugene Bagdasarian
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI web agents use Internet resources at far greater speed, scale, and complexity – changing how users and services interact. Deployed maliciously or erroneously, these agents could overload content providers. At the same time, web agents can bypass CAPTCHAs and other defenses by mimicking user behavior or flood authentication systems with fake accounts. Yet providers must protect their services and content from denial-of-service attacks and scraping by web agents. In this paper, we design a framework that imposes tunable costs on agents before providing access to resources; we call this Web Agent Throttling. We start by formalizing Throttling Gates as challenges issued to an agent that are asymmetric, scalable, robust, and compatible with any agent. Focusing on a common component – the language model – we require the agent to solve reasoning puzzles, thereby incurring excessive token-generation costs. However, we find that using existing puzzles, e.g., coding or math, as throttling gates fails to satisfy our properties. To address this, we introduce rebus-based Reasoning Gates, synthetic text puzzles that require multi-hop reasoning over world knowledge (thereby throttling an agent’s model). We design a scalable generation and verification protocol for such reasoning gates. Our framework achieves computational asymmetry, i.e., the response-generation cost is 9.2x higher than the generation cost for SOTA models. We further deploy reasoning gates on a custom website and Model Context Protocol (MCP) servers and evaluate with real-world web agents. Finally, we discuss the limitations and environmental impact of real-world deployment of our framework.
zh

[AI-60] Disentangling the schema turn: Restoring the information base to conceptual modelling

【速读】:该论文旨在解决当代计算机科学中概念建模实践过度聚焦于与信息库(information base)分离的概念模式(conceptual schema),导致概念模型常被简化为仅指代概念模式的“模式转向”(schema turn)问题。这一倾向固化在几乎所有数据库教材中,限制了概念建模实践的多样性与自动化潜力。解决方案的关键在于提出一种融合模式与信息库的包容性概念建模方法(schema-and-base conceptual modelling approach),并借助现代技术实现更自动化、基于实证的建模流程;同时指出,为支持这种新范式,需采用基于流水线(pipeline-based)的新一代概念建模技术,从而表明当前对概念建模空间的认知过于狭窄,而“模式转向”可能只是技术演进过程中的一个阶段性偏差。

链接: https://arxiv.org/abs/2509.01617
作者: Chris Partridge,Andrew Mitchell,Sergio de Cesare,Oscar Xiberta Soto
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Fundamentals of Conceptual Modeling - ER2025 Workshop

点击查看摘要

Abstract:If one looks at contemporary mainstream development practices for conceptual modelling in computer science, these so clearly focus on a conceptual schema completely separated from its information base that the conceptual schema is often just called the conceptual model. These schema-centric practices are crystallized in almost every database textbook. We call this strong, almost universal, bias towards conceptual schemas the schema turn. The focus of this paper is on disentangling this turn within (computer science) conceptual modeling. It aims to shed some light on how it emerged and so show that it is not fundamental. To show that modern technology enables the adoption of an inclusive schema-and-base conceptual modelling approach, which in turn enables more automated, and empirically motivated practices. And to show, more generally, the space of possible conceptual modelling practices is wider than currently assumed. It also uses the example of bCLEARer to show that the implementations in this wider space will probably need to rely on new pipeline-based conceptual modelling techniques. So, it is possible that the schema turn’s complete exclusion of the information base could be merely a temporary evolutionary detour.
zh

[AI-61] Entropy-Driven Curriculum for Multi-Task Training in Human Mobility Prediction

【速读】:该论文旨在解决人类移动轨迹预测中因数据复杂性导致的模型训练效率低和预测精度不足的问题,具体表现为梯度更新效率低下及潜在的欠拟合现象,同时现有方法仅关注下一位置预测而忽略了距离与方向等隐含决定因素。解决方案的关键在于提出一个统一的训练框架,融合熵驱动的课程学习(entropy-driven curriculum learning)与多任务学习(multi-task learning):前者基于Lempel-Ziv压缩量化轨迹可预测性,按从简单到复杂的顺序组织训练以加速收敛并提升性能;后者通过联合优化主任务(位置预测)与辅助任务(移动距离与方向估计),引入互补监督信号以学习更真实的移动模式,从而显著提高预测准确性。

链接: https://arxiv.org/abs/2509.01613
作者: Tianye Fang,Xuanshu Luo,Martin Werner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing availability of big mobility data from ubiquitous portable devices enables human mobility prediction through deep learning approaches. However, the diverse complexity of human mobility data impedes model training, leading to inefficient gradient updates and potential underfitting. Meanwhile, exclusively predicting next locations neglects implicit determinants, including distances and directions, thereby yielding suboptimal prediction results. This paper presents a unified training framework that integrates entropy-driven curriculum and multi-task learning to address these challenges. The proposed entropy-driven curriculum learning strategy quantifies trajectory predictability based on Lempel-Ziv compression and organizes training from simple to complex for faster convergence and enhanced performance. The multi-task training simultaneously optimizes the primary location prediction alongside auxiliary estimation of movement distance and direction for learning realistic mobility patterns, and improve prediction accuracy through complementary supervision signals. Extensive experiments conducted in accordance with the HuMob Challenge demonstrate that our approach achieves state-of-the-art performance on GEO-BLEU (0.354) and DTW (26.15) metrics with up to 2.92-fold convergence speed compared to training without curriculum learning.
zh

[AI-62] An Efficient Intrusion Detection System for Safeguarding Radiation Detection Systems

【速读】:该论文旨在解决辐射检测系统(Radiation Detection Systems, RDSs)在面对恶意外部攻击时缺乏防护机制的问题,尤其是针对拒绝服务(Denial of Service, DoS)攻击导致系统功能异常的情况。解决方案的关键在于引入基于机器学习(Machine Learning, ML)的入侵检测系统(Intrusion Detection System, IDS),通过采样方法构建模拟DoS攻击数据集,并对比多种ML算法(如随机森林、支持向量机、逻辑回归和Light Gradient-Boosting Machine, LightGBM)的检测性能。其中,LightGBM因其高准确率和低计算资源消耗被重点采用,并结合特征选择、并行执行与随机搜索等优化技术,最终实现了一个高效且适合实时部署的轻量化机器学习入侵检测模型(TinyML-based IDS),从而有效提升RDSs的安全性与鲁棒性。

链接: https://arxiv.org/abs/2509.01599
作者: Nathanael Coolidge,Jaime González Sanz,Li Yang,Khalil El Khatib,Glenn Harvel,Nelson Agbemava,I Putu Susila,Mehmet Yavuz Yagci
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Preprint author original pre review. Accepted and Presented at ISOFIC 2024. The official proceedings version is available on the conference site

点击查看摘要

Abstract:Radiation Detection Systems (RDSs) are used to measure and detect abnormal levels of radioactive material in the environment. These systems are used in many applications to mitigate threats posed by high levels of radioactive material. However, these systems lack protection against malicious external attacks to modify the data. The novelty of applying Intrusion Detection Systems (IDS) in RDSs is a crucial element in safeguarding these critical infrastructures. While IDSs are widely used in networking environments to safeguard against various attacks, their application in RDSs is novel. A common attack on RDSs is Denial of Service (DoS), where the attacker aims to overwhelm the system, causing malfunctioning RDSs. This paper proposes an efficient Machine Learning (ML)-based IDS to detect anomalies in radiation data, focusing on DoS attacks. This work explores the use of sampling methods to create a simulated DoS attack based on a real radiation dataset, followed by an evaluation of various ML algorithms, including Random Forest, Support Vector Machine (SVM), logistic regression, and Light Gradient-Boosting Machine (LightGBM), to detect DoS attacks on RDSs. LightGBM is emphasized for its superior accuracy and low computational resource consumption, making it particularly suitable for real-time intrusion detection. Additionally, model optimization and TinyML techniques, including feature selection, parallel execution, and random search methods, are used to improve the efficiency of the proposed IDS. Finally, an optimized and efficient LightGBM-based IDS is developed to achieve accurate intrusion detection for RDSs.
zh

[AI-63] Securing Radiation Detection Systems with an Efficient TinyML-Based IDS for Edge Devices

【速读】:该论文旨在解决辐射检测系统(Radiation Detection Systems, RDSs)在资源受限环境中面临的网络安全威胁问题,如数据注入、中间人攻击(Man-in-the-Middle, MITM)、ICMP洪水、僵尸网络攻击、权限提升和分布式拒绝服务(Distributed Denial-of-Service, DDoS)等,这些攻击可能破坏辐射测量的完整性与可靠性,进而危及公共健康与安全。解决方案的关键在于提出了一种面向边缘计算的轻量级入侵检测系统(Intrusion Detection System, IDS),其核心是基于TinyML技术优化的XGBoost模型,通过剪枝(pruning)、量化(quantization)、特征选择(feature selection)和采样(sampling)等方法显著降低模型体积与计算复杂度,从而实现在低资源设备上的实时入侵检测,并在效率与准确性之间取得合理平衡。

链接: https://arxiv.org/abs/2509.01592
作者: Einstein Rivas Pizarro,Wajiha Zaheer,Li Yang,Khalil El-Khatib,Glenn Harvel
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Preprint author original pre review. Accepted and Presented at NPIC HMIT 2025. The official proceedings version is available in the ANS Digital Library

点击查看摘要

Abstract:Radiation Detection Systems (RDSs) play a vital role in ensuring public safety across various settings, from nuclear facilities to medical environments. However, these systems are increasingly vulnerable to cyber-attacks such as data injection, man-in-the-middle (MITM) attacks, ICMP floods, botnet attacks, privilege escalation, and distributed denial-of-service (DDoS) attacks. Such threats could compromise the integrity and reliability of radiation measurements, posing significant public health and safety risks. This paper presents a new synthetic radiation dataset and an Intrusion Detection System (IDS) tailored for resource-constrained environments, bringing Machine Learning (ML) predictive capabilities closer to the sensing edge layer of critical infrastructure. Leveraging TinyML techniques, the proposed IDS employs an optimized XGBoost model enhanced with pruning, quantization, feature selection, and sampling. These TinyML techniques significantly reduce the size of the model and computational demands, enabling real-time intrusion detection on low-resource devices while maintaining a reasonable balance between efficiency and accuracy.
zh

[AI-64] From Discord to Harmony: Decomposed Consonance-based Training for Improved Audio Chord Estimation

【速读】:该论文旨在解决音频和弦估计(Audio Chord Estimation, ACE)任务中长期存在的性能瓶颈问题,主要聚焦于两个核心挑战:一是标注者主观性导致的标注不一致性,二是和弦数据集中的类别不平衡问题。解决方案的关键在于引入一种基于和谐度(consonance)的距离度量方法,以更准确地反映音乐上意义的标注相似性,并在此基础上提出一种基于Conformer架构的新型ACE模型。该模型通过将和谐度概念融入标签平滑(label smoothing)机制来增强模型对标注差异的鲁棒性,同时采用根音、低音和所有音符激活的独立估计策略,有效缓解类别不平衡问题,从而提升整体预测性能。

链接: https://arxiv.org/abs/2509.01588
作者: Andrea Poltronieri,Xavier Serra,Martín Rocamora
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 9 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Audio Chord Estimation (ACE) holds a pivotal role in music information research, having garnered attention for over two decades due to its relevance for music transcription and analysis. Despite notable advancements, challenges persist in the task, particularly concerning unique characteristics of harmonic content, which have resulted in existing systems’ performances reaching a glass ceiling. These challenges include annotator subjectivity, where varying interpretations among annotators lead to inconsistencies, and class imbalance within chord datasets, where certain chord classes are over-represented compared to others, posing difficulties in model training and evaluation. As a first contribution, this paper presents an evaluation of inter-annotator agreement in chord annotations, using metrics that extend beyond traditional binary measures. In addition, we propose a consonance-informed distance metric that reflects the perceptual similarity between harmonic annotations. Our analysis suggests that consonance-based distance metrics more effectively capture musically meaningful agreement between annotations. Expanding on these findings, we introduce a novel ACE conformer-based model that integrates consonance concepts into the model through consonance-based label smoothing. The proposed model also addresses class imbalance by separately estimating root, bass, and all note activations, enabling the reconstruction of chord labels from decomposed outputs.
zh

[AI-65] One-Shot Clustering for Federated Learning Under Clustering-Agnostic Assumption

【速读】:该论文旨在解决集群联邦学习(Clustered Federated Learning, CFL)中缺乏自动、无超参数调整的聚类时机检测机制的问题,即如何在不依赖人工干预的情况下,自动识别出最适合进行客户端聚类的时刻,从而实现个性化模型训练。其解决方案的关键在于提出了一种一次性集群联邦学习(One-Shot Clustered Federated Learning, OCFL)算法,该算法通过计算客户端梯度之间的余弦距离,并结合一个温度测量指标来判断联邦模型是否开始收敛,从而动态确定最优聚类时机。实验表明,该方法可在无需调参的前提下,在多个基准数据集和任务上稳定实现高效聚类与个性化建模,同时验证了基于密度的聚类方法在区分不同分布下神经网络损失曲面方面的有效性。

链接: https://arxiv.org/abs/2509.01587
作者: Maciej Krzysztof Zuziak,Roberto Pellungrini,Salvatore Rinzivillo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) is a widespread and well-adopted paradigm of decentralised learning that allows training one model from multiple sources without the need to transfer data between participating clients directly. Since its inception in 2015, it has been divided into numerous subfields that deal with application-specific issues, such as data heterogeneity or resource allocation. One such sub-field, Clustered Federated Learning (CFL), deals with the problem of clustering the population of clients into separate cohorts to deliver personalised models. Although a few remarkable works have been published in this domain, the problem remains largely unexplored, as its basic assumptions and settings differ slightly from those of standard FL. In this work, we present One-Shot Clustered Federated Learning (OCFL), a clustering-agnostic algorithm that can automatically detect the earliest suitable moment for clustering. Our algorithm is based on computing the cosine distance between the gradients of the clients and a temperature measure that detects when the federated model starts to converge. We empirically evaluate our methodology by testing various one-shot clustering algorithms for over forty different tasks on five benchmark datasets. Our experiments showcase the good performance of our approach when used to perform CFL in an automated manner without the need to adjust hyperparameters. We also revisit the practical feasibility of CFL algorithms based on the gradients of the clients, providing firm evidence of the high efficiency of density-based clustering methods when used to differentiate between the loss surfaces of neural networks trained on different distributions. Moreover, by inspecting the feasibility of local explanations generated with the help of GradCAM, we can provide more insights into the relationship between personalisation and the explainability of local predictions.
zh

[AI-66] Structured AI Decision-Making in Disaster Management

【速读】:该论文旨在解决安全关键领域(如航空航天和应急响应服务)中,人工智能(AI)决策系统在涉及人类生命时如何确保其可靠性与可解释性的伦理挑战。解决方案的关键在于提出了一种结构化的决策框架,通过引入“使能代理(Enabler agents)”、“层级(Levels)”和“场景(Scenarios)”三个核心概念,将决策过程模块化并形式化,从而提升自主决策系统的稳定性与准确性。实验表明,该框架相比基于直觉判断的系统提升了60.94%的决策一致性,并在多个场景下比具备灾害经验的人类操作者高出38.93%的准确率,验证了其在构建高可靠自主AI应用中的有效性。

链接: https://arxiv.org/abs/2509.01576
作者: Julian Gerald Dcruz,Argyrios Zolotas,Niall Ross Greenwood,Miguel Arana-Catania
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Systems and Control (eess.SY)
备注: 40 pages, 14 figures, 16 tables. To be published in Nature Scientific Reports

点击查看摘要

Abstract:With artificial intelligence (AI) being applied to bring autonomy to decision-making in safety-critical domains such as the ones typified in the aerospace and emergency-response services, there has been a call to address the ethical implications of structuring those decisions, so they remain reliable and justifiable when human lives are at stake. This paper contributes to addressing the challenge of decision-making by proposing a structured decision-making framework as a foundational step towards responsible AI. The proposed structured decision-making framework is implemented in autonomous decision-making, specifically within disaster management. By introducing concepts of Enabler agents, Levels and Scenarios, the proposed framework’s performance is evaluated against systems relying solely on judgement-based insights, as well as human operators who have disaster experience: victims, volunteers, and stakeholders. The results demonstrate that the structured decision-making framework achieves 60.94% greater stability in consistently accurate decisions across multiple Scenarios, compared to judgement-based systems. Moreover, the study shows that the proposed framework outperforms human operators with a 38.93% higher accuracy across various Scenarios. These findings demonstrate the promise of the structured decision-making framework for building more reliable autonomous AI applications in safety-critical contexts.
zh

[AI-67] Counterfactual Sensitivity for Faithful Reasoning in Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域中因依赖错误或无关推理路径却仍能给出正确答案而导致可信度不足的问题。其核心解决方案是提出一种轻量级训练目标——反事实敏感性正则化(Counterfactual Sensitivity Regularization, CSR),关键在于通过在训练过程中引入操作级别的反事实干预(如将“+”替换为“-”),强制模型的中间推理步骤与最终输出之间建立因果依赖关系;具体实现上,仅需对每个样本增加一次前向传播,即可惩罚在逻辑无效推理路径下仍保持原答案的模型行为,并通过新提出的反事实结果敏感性(Counterfactual Outcome Sensitivity, COS)指标量化模型预测对扰动的敏感程度,从而显著提升模型决策的可解释性和可信度。

链接: https://arxiv.org/abs/2509.01544
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often produce correct answers while relying on flawed or irrelevant reasoning traces, undermining their trustworthiness in high-stakes domains. We propose Counterfactual Sensitivity Regularization (CSR), a lightweight training objective that enforces dependence between intermediate reasoning and final outputs. CSR introduces automated, operator-level counterfactual interventions (e.g., swapping “+” with “-”) during training and penalizes models that preserve the same answer under logically invalid traces. This requires only one additional forward pass per sample. To measure faithfulness, we introduce Counterfactual Outcome Sensitivity (COS), which quantifies the impact of such perturbations on model predictions. Across structured reasoning tasks - arithmetic (GSM8K), logical deduction (PrOntoQA), and planning (Blocks World) - CSR improves faithfulness by up to 70 percentage points over standard fine-tuning and process supervision, with only minor accuracy loss. The learned sensitivity generalizes to larger models and synergizes with inference-time methods such as self-consistency. A pilot study on HellaSwag further demonstrates that extending CSR with semantic perturbations can enhance faithfulness in commonsense reasoning.
zh

[AI-68] Agent ic Workflow for Education: Concepts and Applications

【速读】:该论文旨在解决传统大型语言模型(Large Language Models, LLMs)在教育场景中局限于静态提示-响应交互模式所带来的局限性,如缺乏动态任务执行能力、个性化不足以及协作机制缺失等问题。其解决方案的关键在于提出一种名为教育代理工作流(Agentic Workflow for Education, AWE)的四维模型,包含自我反思(self-reflection)、工具调用(tool invocation)、任务规划(task planning)和多代理协作(multi-agent collaboration),并通过基于冯·诺依曼多智能体系统(von Neumann Multi-Agent System, MAS)的理论框架实现从线性交互到非线性动态流程的范式转变,从而支持可扩展、个性化且协同的任务执行。

链接: https://arxiv.org/abs/2509.01517
作者: Yuan-Hao Jiang,Yijie Lu,Ling Dai,Jiatong Wang,Ruijia Li,Bo Jiang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Proceedings of the 33rd International Conference on Computers in Education (ICCE 2025). Asia-Pacific Society for Computers in Education

点击查看摘要

Abstract:With the rapid advancement of Large Language Models (LLMs) and Artificial Intelligence (AI) agents, agentic workflows are showing transformative potential in education. This study introduces the Agentic Workflow for Education (AWE), a four-component model comprising self-reflection, tool invocation, task planning, and multi-agent collaboration. We distinguish AWE from traditional LLM-based linear interactions and propose a theoretical framework grounded in the von Neumann Multi-Agent System (MAS) architecture. Through a paradigm shift from static prompt-response systems to dynamic, nonlinear workflows, AWE enables scalable, personalized, and collaborative task execution. We further identify four core application domains: integrated learning environments, personalized AI-assisted learning, simulation-based experimentation, and data-driven decision-making. A case study on automated math test generation shows that AWE-generated items are statistically comparable to real exam questions, validating the model’s effectiveness. AWE offers a promising path toward reducing teacher workload, enhancing instructional quality, and enabling broader educational innovation.
zh

[AI-69] Unsupervised Identification and Replay-based Detection (UIRD) for New Category Anomaly Detection in ECG Signal

【速读】:该论文旨在解决心电图(ECG)异常检测中因样本类别不平衡和历史数据长期存储压力导致的模型性能下降问题。解决方案的关键在于提出一种基于伪回放(pseudo-replay)的半监督持续学习框架,其核心由两部分组成:一是利用无监督生成对抗网络(GAN)实现新异常模式的自动识别;二是通过生成器学习每个任务的数据分布,以合成代表先前类别的伪数据进行回放,从而在不存储全部历史数据的前提下,使模型既能准确识别已有异常,又能有效发现新出现的异常模式。

链接: https://arxiv.org/abs/2509.01512
作者: Zhangyue Shi,Zekai Wang,Yuxuan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In clinical practice, automatic analysis of electrocardiogram (ECG) is widely applied to identify irregular heart rhythms and other electrical anomalies of the heart, enabling timely intervention and potentially improving clinical outcomes. However, due to the limited samples in certain types of ECG signals, the class imbalance issues pose a challenge for ECG-based detection. In addition, as the volume of patient data grows, long-term storage of all historical data becomes increasingly burdensome as training samples to recognize new patterns and classify existing ECG signals accurately. Therefore, to enhance the performance of anomaly detection while addressing storage limitations, we propose a pseudo-replay based semi-supervised continual learning framework, which consists of two components: unsupervised identification and replay-based detection. For unsupervised identification, an unsupervised generative adversarial network (GAN)-based framework is integrated to detect novel patterns. Besides, instead of directly storing all historical data, a pseudo replay-based learning strategy is proposed which utilizes a generator to learn the data distribution for each individual task. When a new task arises, the generator synthesizes pseudo data representative of previous learnt classes, enabling the model to detect both the existed patterns and the newly presented anomalies. The effectiveness of the proposed framework is validated in four public ECG datasets, which leverages supervised classification problems for anomaly detection. The experimental results show that the developed approach is very promising in identifying novel anomalies while maintaining good performance on detecting existing ECG signals.
zh

[AI-70] An Information-Flow Perspective on Explainability Requirements: Specification and Verification KR2025

【速读】:该论文旨在解决多智能体系统中可解释性(Explainability)与隐私保护之间的平衡问题,即如何在保证系统行为可被理解的同时,避免因暴露过多信息而违反隐私约束。其核心挑战在于对“知识”流动的精确建模与验证,这涉及对解释性信息正向流动和潜在隐私泄露的负向流动进行形式化分析。解决方案的关键在于引入扩展的认知时序逻辑(Epistemic Temporal Logic),并融入对反事实因果(Counterfactual Causes)的量化机制,从而能够形式化地刻画系统是否提供了足够信息使智能体获得关于特定效应发生原因的知识,并在此基础上设计算法用于有限状态模型的验证。该方法将可解释性作为系统级需求进行规范,并支持进一步添加隐私约束,实现了对解释性和隐私性的统一形式化处理。

链接: https://arxiv.org/abs/2509.01479
作者: Bernd Finkbeiner,Hadar Frenkel,Julian Siber
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 22nd International Conference on Principles of Knowledge Representation and Reasoning (KR 2025)

点击查看摘要

Abstract:Explainable systems expose information about why certain observed effects are happening to the agents interacting with them. We argue that this constitutes a positive flow of information that needs to be specified, verified, and balanced against negative information flow that may, e.g., violate privacy guarantees. Since both explainability and privacy require reasoning about knowledge, we tackle these tasks with epistemic temporal logic extended with quantification over counterfactual causes. This allows us to specify that a multi-agent system exposes enough information such that agents acquire knowledge on why some effect occurred. We show how this principle can be used to specify explainability as a system-level requirement and provide an algorithm for checking finite-state models against such specifications. We present a prototype implementation of the algorithm and evaluate it on several benchmarks, illustrating how our approach distinguishes between explainable and unexplainable systems, and how it allows to pose additional privacy requirements.
zh

[AI-71] LLM -empowered Agents Simulation Framework for Scenario Generation in Service Ecosystem Governance

【速读】:该论文旨在解决服务生态系统(Service Ecosystem)治理中因社会环境复杂化与协作深化而导致的不确定性因素增多、传统情景生成方法受限于预设规则、信息不足及社会要素难以量化等问题,从而影响实验系统构建的质量与效率。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)驱动的多智能体协同机制:通过环境代理(Environment Agent, EA)生成包含极端情况的社会环境、社会协作代理(Social Agent, SA)构建社会协作结构、规划代理(Planner Agent, PA)耦合任务-角色关系并制定任务解决方案,三者动态协同,由PA实时感知各代理状态并调整实验方案,实现高质量、高效率的情景生成,为服务生态系统治理提供可实验化的有效路径。

链接: https://arxiv.org/abs/2509.01441
作者: Deyu Zhou,Yuqi Hou,Xiao Xue,Xudong Lu,Qingzhong Li,Lizhen Cui
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As the social environment is growing more complex and collaboration is deepening, factors affecting the healthy development of service ecosystem are constantly changing and diverse, making its governance a crucial research issue. Applying the scenario analysis method and conducting scenario rehearsals by constructing an experimental system before managers make decisions, losses caused by wrong decisions can be largely avoided. However, it relies on predefined rules to construct scenarios and faces challenges such as limited information, a large number of influencing factors, and the difficulty of measuring social elements. These challenges limit the quality and efficiency of generating social and uncertain scenarios for the service ecosystem. Therefore, we propose a scenario generator design method, which adaptively coordinates three Large Language Model (LLM) empowered agents that autonomously optimize experimental schemes to construct an experimental system and generate high quality scenarios. Specifically, the Environment Agent (EA) generates social environment including extremes, the Social Agent (SA) generates social collaboration structure, and the Planner Agent (PA) couples task-role relationships and plans task solutions. These agents work in coordination, with the PA adjusting the experimental scheme in real time by perceiving the states of each agent and these generating scenarios. Experiments on the ProgrammableWeb dataset illustrate our method generates more accurate scenarios more efficiently, and innovatively provides an effective way for service ecosystem governance related experimental system construction.
zh

[AI-72] Unnoticeable Community Deception via Multi-objective Optimization

【速读】:该论文旨在解决社区欺骗(community deception)方法中存在的两个关键问题:一是现有欺骗指标(如模块度下降)的合理性不足,二是攻击的隐蔽性难以保障。解决方案的核心在于提出一种新的欺骗指标,并将其与攻击预算相结合,将社区欺骗任务建模为多目标优化问题;同时引入基于度偏置和社区偏置的候选节点选择机制,提升欺骗效果。实验表明,所提策略在三个基准数据集上均优于现有方法。

链接: https://arxiv.org/abs/2509.01438
作者: Junyuan Fang,Huimin Liu,Yueqi Peng,Jiajing Wu,Zibin Zheng,Chi K. Tse
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Under Review

点击查看摘要

Abstract:Community detection in graphs is crucial for understanding the organization of nodes into densely connected clusters. While numerous strategies have been developed to identify these clusters, the success of community detection can lead to privacy and information security concerns, as individuals may not want their personal information exposed. To address this, community deception methods have been proposed to reduce the effectiveness of detection algorithms. Nevertheless, several limitations, such as the rationality of evaluation metrics and the unnoticeability of attacks, have been ignored in current deception methods. Therefore, in this work, we first investigate the limitations of the widely used deception metric, i.e., the decrease of modularity, through empirical studies. Then, we propose a new deception metric, and combine this new metric together with the attack budget to model the unnoticeable community deception task as a multi-objective optimization problem. To further improve the deception performance, we propose two variant methods by incorporating the degree-biased and community-biased candidate node selection mechanisms. Extensive experiments on three benchmark datasets demonstrate the superiority of the proposed community deception strategies.
zh

[AI-73] CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-Car Speech Separation with Distributed Heterogeneous Arrays INTERSPEECH2025

【速读】:该论文旨在解决多说话人语音重叠问题,以提升人车交互中语音识别的准确性。其核心挑战在于如何在复杂声学环境中有效分离混叠语音信号,并降低后端自动语音识别(ASR)模型的错误率。解决方案的关键在于提出一种轻量级神经掩码引导的最小方差无失真响应(MVDR)语音分离方法——CabinSep:首先利用通道信息提取空间特征以优化语音与噪声掩码估计;其次在推理阶段引入MVDR滤波器,减少语音失真从而增强对ASR友好性;最后通过融合仿真与真实采集的脉冲响应(IR)数据增强策略,改善边界区域说话人定位精度,显著降低识别错误率。

链接: https://arxiv.org/abs/2509.01399
作者: Runduo Han,Yanxin Hu,Yihui Fu,Zihan Zhang,Yukai Jv,Li Chen,Lei Xie
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2025

点击查看摘要

Abstract:Separating overlapping speech from multiple speakers is crucial for effective human-vehicle interaction. This paper proposes CabinSep, a lightweight neural mask-based minimum variance distortionless response (MVDR) speech separation approach, to reduce speech recognition errors in back-end automatic speech recognition (ASR) models. Our contributions are threefold: First, we utilize channel information to extract spatial features, which improves the estimation of speech and noise masks. Second, we employ MVDR during inference, reducing speech distortion to make it more ASR-friendly. Third, we introduce a data augmentation method combining simulated and real-recorded impulse responses (IRs), improving speaker localization at zone boundaries and further reducing speech recognition errors. With a computational complexity of only 0.4 GMACs, CabinSep achieves a 17.5% relative reduction in speech recognition error rate in a real-recorded dataset compared to the state-of-the-art DualSep model. Demos are available at: this https URL.
zh

[AI-74] he Need for Verification in AI-Driven Scientific Discovery

【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)在科学发现中广泛应用,其能够以远超传统方法的速度和规模提出假设,但随之而来的海量候选假设缺乏可扩展且可靠的验证机制,可能导致科学进展受阻而非加速。解决方案的关键在于建立严谨且透明的验证体系,这是确保 AI 辅助发现具有科学价值的核心前提,论文强调应将验证作为 AI 科学发现流程的基石,贯穿于从数据驱动方法、知识感知神经架构到符号推理框架及大语言模型代理(LLM agents)等各类技术路径之中。

链接: https://arxiv.org/abs/2509.01398
作者: Cristina Cornelio,Takuya Ito,Ryan Cory-Wright,Sanjeeb Dash,Lior Horesh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is transforming the practice of science. Machine learning and large language models (LLMs) can generate hypotheses at a scale and speed far exceeding traditional methods, offering the potential to accelerate discovery across diverse fields. However, the abundance of hypotheses introduces a critical challenge: without scalable and reliable mechanisms for verification, scientific progress risks being hindered rather than being advanced. In this article, we trace the historical development of scientific discovery, examine how AI is reshaping established practices for scientific discovery, and review the principal approaches, ranging from data-driven methods and knowledge-aware neural architectures to symbolic reasoning frameworks and LLM agents. While these systems can uncover patterns and propose candidate laws, their scientific value ultimately depends on rigorous and transparent verification, which we argue must be the cornerstone of AI-assisted discovery.
zh

[AI-75] DeepResearch Arena: The First Exam of LLM s Research Abilities via Seminar-Grounded Tasks

【速读】:该论文旨在解决当前对深度研究代理(Deep Research Agents)评估中存在的难题,即缺乏能够真实反映研究人员兴趣与学术探索需求的前沿研究问题,从而导致评估结果难以准确衡量其实际研究能力。为应对这一挑战,作者提出了一种基于学术研讨会(academic seminars)构建基准测试集的方法——DeepResearch Arena,并设计了多智能体分层任务生成系统(Multi-Agent Hierarchical Task Generation, MAHTG),该系统从研讨会转录文本中提取具有研究价值的启发点,并将其转化为高质量、可追溯的研究任务,有效过滤噪声并确保任务的真实性与多样性。关键创新在于利用真实专家互动数据自动构建基准,显著提升了评估的生态效度和公平性。

链接: https://arxiv.org/abs/2509.01396
作者: Haiyuan Wan,Chen Yang,Junchi Yu,Meiqi Tu,Jiaxuan Lu,Di Yu,Jianbao Cao,Ben Gao,Jiaqing Xie,Aoran Wang,Wenlong Zhang,Philip Torr,Dongzhan Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers’ attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.
zh

[AI-76] End-to-End Low-Level Neural Control of an Industrial-Grade 6D Magnetic Levitation System

【速读】:该论文旨在解决磁悬浮系统在工业自动化中因复杂且不稳定的动力学特性而难以实现精确控制的问题。传统基于人工设计的控制工程方法虽能提供鲁棒性,但性能受限于工程师经验,且难以适应动态变化的工况。其解决方案的关键在于提出首个用于6自由度(6D)磁悬浮系统的神经控制器,该控制器通过端到端训练,直接从原始传感器数据和6D参考位姿映射到线圈电流指令,无需显式建模系统动力学;该方法在未见过的情境下仍能保持高精度与鲁棒性,验证了学习驱动的神经控制在复杂物理系统中的实用性,并预示其在未来高要求实际应用中可能替代或增强传统工程方法。

链接: https://arxiv.org/abs/2509.01388
作者: Philipp Hartmann,Jannick Stranghöner,Klaus Neumann
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Magnetic levitation is poised to revolutionize industrial automation by integrating flexible in-machine product transport and seamless manipulation. It is expected to become the standard drive for automated manufacturing. However, controlling such systems is inherently challenging due to their complex, unstable dynamics. Traditional control approaches, which rely on hand-crafted control engineering, typically yield robust but conservative solutions, with their performance closely tied to the expertise of the engineering team. In contrast, neural control learning presents a promising alternative. This paper presents the first neural controller for 6D magnetic levitation. Trained end-to-end on interaction data from a proprietary controller, it directly maps raw sensor data and 6D reference poses to coil current commands. The neural controller can effectively generalize to previously unseen situations while maintaining accurate and robust control. These results underscore the practical feasibility of learning-based neural control in complex physical systems and suggest a future where such a paradigm could enhance or even substitute traditional engineering approaches in demanding real-world applications. The trained neural controller, source code, and demonstration videos are publicly available at this https URL.
zh

[AI-77] Anomaly detection in network flows using unsupervised online machine learning

【速读】:该论文旨在解决网络流量中异常检测的动态适应性问题,即在不断变化的网络行为和稀缺标注数据环境下,实现高效、实时的异常检测。其解决方案的关键在于采用具备在线学习能力的无监督机器学习模型(具体为One-Class SVM),使系统能够持续学习并更新网络正常行为模式,从而在无需标签数据的情况下精准识别偏离正常行为的异常流量。该方法在NF-UNSW-NB15数据集及其扩展版本v2上验证,实现了超过98%的准确率、低于3.1%的误报率以及100%的召回率,且每条流的处理时间仅为0.033毫秒,证明了其在实际部署中的可行性与高效性。

链接: https://arxiv.org/abs/2509.01375
作者: Alberto Miguel-Diez,Adrián Campazas-Vega,Ángel Manuel Guerrero-Higueras,Claudia Álvarez-Aparicio,Vicente Matellán-Olivera
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Nowadays, the volume of network traffic continues to grow, along with the frequency and sophistication of attacks. This scenario highlights the need for solutions capable of continuously adapting, since network behavior is dynamic and changes over time. This work presents an anomaly detection model for network flows using unsupervised machine learning with online learning capabilities. This approach allows the system to dynamically learn the normal behavior of the network and detect deviations without requiring labeled data, which is particularly useful in real-world environments where traffic is constantly changing and labeled data is scarce. The model was implemented using the River library with a One-Class SVM and evaluated on the NF-UNSW-NB15 dataset and its extended version v2, which contain network flows labeled with different attack categories. The results show an accuracy above 98%, a false positive rate below 3.1%, and a recall of 100% in the most advanced version of the dataset. In addition, the low processing time per flow (0.033 ms) demonstrates the feasibility of the approach for real-time applications.
zh

[AI-78] DPF-CM: A Data Processing Framework with Privacy-Preserving Vector Databases for Chinese Medical LLM s Training and Deployment EMNLP2025

【速读】:该论文旨在解决当前开源中文医疗大语言模型(Chinese Medical Language Models)训练中对训练数据处理环节研究不足的问题,特别是指令内容匮乏和偏好数据噪声干扰导致的性能瓶颈,以及模型部署阶段可能引发的隐私泄露风险。其解决方案的关键在于提出一个完整的数据处理框架 DPF-CM,包含两个核心模块:第一,针对训练数据处理,引入链式示例上下文学习策略以生成面向问题的指令,并采用基于集成奖励模型的过滤机制优化偏好数据质量;第二,针对部署阶段隐私保护,设计隐私保护向量数据库(Privacy Preserving Vector Database, PPVD),通过模型记忆检索、高风险数据库构建、安全数据库构建及匹配替换四个步骤协同降低推理过程中的隐私泄露风险。实验表明,该框架不仅显著提升模型准确率,还使训练数据隐私泄露减少27%。

链接: https://arxiv.org/abs/2509.01354
作者: Wei Huang,Anda Cheng,Zhao Zhang,Yinggui Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025

点击查看摘要

Abstract:Current open-source training pipelines for Chinese medical language models predominantly emphasize optimizing training methodologies to enhance the performance of large language models (LLMs), yet lack comprehensive exploration into training data processing. To address this gap, we propose DPF-CM, a holistic Data Processing Framework for Chinese Medical LLMs training and deployment. DPF-CM comprises two core modules. The first module is a data processing pipeline tailored for model training. Beyond standard data processing operations, we (1) introduce a chained examples context-learning strategy to generate question-oriented instructions to mitigate the lack of instruction content, and (2) implement an ensemble-based filtering mechanism for preference data curation that averages multiple reward models to suppress noisy samples. The second module focuses on privacy preservation during model deployment. To prevent privacy risks from the inadvertent exposure of training data, we propose a Privacy Preserving Vector Database (PPVD) approach, which involves model memory search, high-risk database construction, secure database construction, and match-and-replace, four key stages to minimize privacy leakage during inference collectively. Experimental results show that DPF-CM significantly improves model accuracy, enabling our trained Chinese medical LLM to achieve state-of-the-art performance among open-source counterparts. Moreover, the framework reduces training data privacy leakage by 27%.
zh

[AI-79] Causal Sensitivity Identification using Generative Learning IJCAI2025

【速读】:该论文旨在解决传统预测模型中因混淆偏差(confounding bias)导致的因果推断不准确问题,从而提升预测性能。其核心挑战在于如何从观测数据中识别出真正具有因果影响的特征,并据此构建更可靠的预测机制。解决方案的关键在于提出一种基于条件变分自编码器(Conditional Variational Autoencoder, CVAE)的生成式方法,通过干预(intervention)和反事实(counterfactual)两种视角联合建模:首先识别出对预测结果具有因果敏感性的特征(causally sensitive features),以减少混杂因素干扰;其次利用这些特征构建生成式预测器,在真实场景如用户轨迹预测任务中实现因果关系驱动的精准推荐。实验表明,该方法在GeoLife大规模轨迹数据集和Asia Bayesian网络基准上均能有效识别因果影响并提升预测准确性。

链接: https://arxiv.org/abs/2509.01352
作者: Soma Bandyopadhyay,Sudeshna Sarkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, Accepted at the IJCAI 2025 Workshop on Causal Learning for Recommendation Systems (CLRS). [OpenReview link: this https URL ]

点击查看摘要

Abstract:In this work, we propose a novel generative method to identify the causal impact and apply it to prediction tasks. We conduct causal impact analysis using interventional and counterfactual perspectives. First, applying interventions, we identify features that have a causal influence on the predicted outcome, which we refer to as causally sensitive features, and second, applying counterfactuals, we evaluate how changes in the cause affect the effect. Our method exploits the Conditional Variational Autoencoder (CVAE) to identify the causal impact and serve as a generative predictor. We are able to reduce confounding bias by identifying causally sensitive features. We demonstrate the effectiveness of our method by recommending the most likely locations a user will visit next in their spatiotemporal trajectory influenced by the causal relationships among various features. Experiments on the large-scale GeoLife [Zheng et al., 2010] dataset and the benchmark Asia Bayesian network validate the ability of our method to identify causal impact and improve predictive performance.
zh

[AI-80] Error Notebook-Guided Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models

【速读】:该论文旨在解决复杂CAD装配体中基于规范的零部件检索(specification-aware part retrieval)问题,其核心挑战在于:直接使用大语言模型(LLM)或视觉语言模型(VLM)进行检索时,输入序列常超出模型token限制,且即使经过预处理,性能仍不理想;同时,对高性能通用模型(如GPT或Gemini)进行微调往往不可行,因缺乏访问权限且计算资源消耗巨大。解决方案的关键在于提出一种无需额外训练的检索框架,通过构建“错误笔记簿”(Error Notebooks)与检索增强生成(RAG)相结合的方式优化提示工程(prompt engineering)。其中,Error Notebooks由历史错误思维链(Chain-of-Thought, CoT)及其修正过程组成,形成包含任务、错误推理路径及正确答案的知识库;RAG则从该知识库中检索与规范相关的记录并融入推理流程,从而显著提升现有通用模型在复杂3D CAD场景下的检索准确性,尤其在高零件数(10个以上)等挑战性案例中效果突出。

链接: https://arxiv.org/abs/2509.01350
作者: Yunqing Liu,Nan Zhang,Zhiming Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective specification-aware part retrieval within complex CAD assemblies is essential for automated design verification and downstream engineering tasks. However, directly using LLMs/VLMs to this task presents some challenges: the input sequences may exceed model token limits, and even after processing, performance remains unsatisfactory. Moreover, fine-tuning LLMs/VLMs requires significant computational resources, and for many high-performing general-use proprietary models (e.g., GPT or Gemini), fine-tuning access is not available. In this paper, we propose a novel part retrieval framework that requires no extra training, but using Error Notebooks + RAG for refined prompt engineering to help improve the existing general model’s retrieval performance. The construction of Error Notebooks consists of two steps: (1) collecting historical erroneous CoTs and their incorrect answers, and (2) connecting these CoTs through reflective corrections until the correct solutions are obtained. As a result, the Error Notebooks serve as a repository of tasks along with their corrected CoTs and final answers. RAG is then employed to retrieve specification-relevant records from the Error Notebooks and incorporate them into the inference process. Another major contribution of our work is a human-in-the-loop CAD dataset, which is used to evaluate our method. In addition, the engineering value of our novel framework lies in its ability to effectively handle 3D models with lengthy, non-natural language metadata. Experiments with proprietary models, including GPT-4o and the Gemini series, show substantial gains, with GPT-4o (Omni) achieving up to a 23.4% absolute accuracy improvement on the human preference dataset. Moreover, ablation studies confirm that CoT reasoning provides benefits especially in challenging cases with higher part counts (10).
zh

[AI-81] AT Loss: Advanced Torrential Loss Function for Precipitation Forecasting

【速读】:该论文旨在解决降水预报中因传统损失函数(如临界成功指数,CSI)在长时间干旱期失效而导致的优化效果不佳问题。其关键解决方案在于引入一个简单的惩罚表达式,并将其重新表述为无约束二次二值优化(Quadratic Unconstrained Binary Optimization, QUBO)形式,随后通过近似过程将该QUBO模型松弛为一种可微分的先进暴雨损失函数(Advanced Torrential, AT loss)。该AT损失函数在 Lipschitz 连续性、预测性能评估、一致性实验及消融研究中均表现出优于现有方法的稳定性与准确性。

链接: https://arxiv.org/abs/2509.01348
作者: Jaeho Choi,Hyeri Kim,Kwang-Ho Kim,Jaesung Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Accurate precipitation forecasting is becoming increasingly important in the context of climate change. In response, machine learning-based approaches have recently gained attention as an emerging alternative to traditional methods such as numerical weather prediction and climate models. Nonetheless, many recent approaches still rely on off-the-shelf loss functions, and even the more advanced ones merely involve optimization processes based on the critical success index (CSI). The problem, however, is that CSI may become ineffective during extended dry periods when precipitation remains below the threshold, rendering it less than ideal as a criterion for optimization. To address this limitation, we introduce a simple penalty expression and reinterpret it as a quadratic unconstrained binary optimization (QUBO) formulation. Ultimately, the resulting QUBO formulation is relaxed into a differentiable advanced torrential (AT) loss function through an approximation process. The proposed AT loss demonstrates its superiority through the Lipschitz constant, forecast performance evaluations, consistency experiments, and ablation studies with the operational model.
zh

[AI-82] Conformal Predictive Monitoring for Multi-Modal Scenarios

【速读】:该论文旨在解决定量预测监控(Quantitative Predictive Monitoring, QPM)中因系统存在多模态动态行为而导致预测区间过于保守、缺乏模式特异性信息的问题。现有QPM方法在面对多模态动力学时,由于是模式无关(mode-agnostic)的,无法区分不同动态模式下的满意度分布,从而产生不具实用价值的宽泛预测区间。其解决方案的关键在于提出GenQPM,该方法利用基于评分的扩散模型(score-based diffusion models)对系统概率性和多模态动态进行无模型(model-free)建模,并引入模式分类器(mode classifier)将预测轨迹按动力学模式划分;随后对每种模式独立应用保形推断(conformal inference),生成具有统计保证的模式特异性预测区间,从而显著提升预测区间的准确性与信息量。

链接: https://arxiv.org/abs/2509.01338
作者: Francesca Cairoli,Luca Bortolussi,Jyotirmoy V. Deshmukh,Lars Lindemann,Nicola Paoletti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider the problem of quantitative predictive monitoring (QPM) of stochastic systems, i.e., predicting at runtime the degree of satisfaction of a desired temporal logic property from the current state of the system. Since computational efficiency is key to enable timely intervention against predicted violations, several state-of-the-art QPM approaches rely on fast machine-learning surrogates to provide prediction intervals for the satisfaction values, using conformal inference to offer statistical guarantees. However, these QPM methods suffer when the monitored agent exhibits multi-modal dynamics, whereby certain modes may yield high satisfaction values while others critically violate the property. Existing QPM methods are mode-agnostic and so would yield overly conservative and uninformative intervals that lack meaningful mode-specific satisfaction information. To address this problem, we present GenQPM, a method that leverages deep generative models, specifically score-based diffusion models, to reliably approximate the probabilistic and multi-modal system dynamics without requiring explicit model access. GenQPM employs a mode classifier to partition the predicted trajectories by dynamical mode. For each mode, we then apply conformal inference to produce statistically valid, mode-specific prediction intervals. We demonstrate the effectiveness of GenQPM on a benchmark of agent navigation and autonomous driving tasks, resulting in prediction intervals that are significantly more informative (less conservative) than mode-agnostic baselines.
zh

[AI-83] Multitask Battery Management with Flexible Pretraining

【速读】:该论文旨在解决工业级电池管理中因任务多样化(如估计、预测和系统级诊断)导致的数据与工程资源消耗大、模型可扩展性差的问题。现有方法需为每个任务单独设计,依赖大量标注数据且难以适应不同时间尺度、传感器分辨率和数据通道的异构信息。解决方案的关键在于提出一种灵活的预训练框架——柔性掩码自编码器(Flexible Masked Autoencoder, FMAE),其核心能力是能够利用缺失数据通道进行学习,并捕捉跨数据片段的内在关联,从而从异构电池数据中提取统一表征,支持多种下游任务以极小的数据量和工程投入实现高性能表现。实验表明,FMAE在五个电池管理任务上均优于专用方法,尤其在剩余寿命预测任务中仅需50倍少的推理数据即可达到最先进性能,且对真实场景中缺失电压等关键信息具有鲁棒性。

链接: https://arxiv.org/abs/2509.01323
作者: Hong Lu,Jiali Chen,Jingzhao Zhang,Guannan He,Xuebing Han,Minggao Ouyang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial-scale battery management involves various types of tasks, such as estimation, prediction, and system-level diagnostics. Each task employs distinct data across temporal scales, sensor resolutions, and data channels. Building task-specific methods requires a great deal of data and engineering effort, which limits the scalability of intelligent battery management. Here we present the Flexible Masked Autoencoder (FMAE), a flexible pretraining framework that can learn with missing battery data channels and capture inter-correlations across data snippets. FMAE learns unified battery representations from heterogeneous data and can be adopted by different tasks with minimal data and engineering efforts. Experimentally, FMAE consistently outperforms all task-specific methods across five battery management tasks with eleven battery datasets. On remaining life prediction tasks, FMAE uses 50 times less inference data while maintaining state-of-the-art results. Moreover, when real-world data lack certain information, such as system voltage, FMAE can still be applied with marginal performance impact, achieving comparable results with the best hand-crafted features. FMAE demonstrates a practical route to a flexible, data-efficient model that simplifies real-world multi-task management of dynamical systems.
zh

[AI-84] owards Trustworthy Vital Sign Forecasting: Leverag ing Uncertainty for Prediction Intervals ICDM

【速读】:该论文旨在解决临床场景中对生命体征(如心率和血压)预测模型的不确定性量化不足问题,尤其在缺乏可靠校准预测区间(Prediction Intervals, PIs)的情况下,医生难以区分预测异常是真实临床信号还是模型噪声,从而限制了深度学习模型在医疗决策中的可信部署。解决方案的关键在于利用重建不确定性估计(Reconstruction Uncertainty Estimate, RUE)这一适合生命体征预测的不确定性度量,并提出两种衍生PI的方法:一是基于高斯拷贝分布(Gaussian copula distribution)的参数化方法,可实现闭式PI计算;二是基于k近邻(k-nearest neighbours, KNN)的非参数方法,通过经验估计条件误差分布来构建PI。实验表明,这两种方法在不同采样频率的数据上均优于传统合规预测基线,展现出RUE驱动的PI在提供可解释、带不确定性的生命体征预测方面的临床潜力。

链接: https://arxiv.org/abs/2509.01319
作者: Li Rong Wang,Thomas C. Henderson,Yew Soon Ong,Yih Yng Ng,Xiuyi Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 25th IEEE International Conference on Data Mining (ICDM)

点击查看摘要

Abstract:Vital signs, such as heart rate and blood pressure, are critical indicators of patient health and are widely used in clinical monitoring and decision-making. While deep learning models have shown promise in forecasting these signals, their deployment in healthcare remains limited in part because clinicians must be able to trust and interpret model outputs. Without reliable uncertainty quantification – particularly calibrated prediction intervals (PIs) – it is unclear whether a forecasted abnormality constitutes a meaningful warning or merely reflects model noise, hindering clinical decision-making. To address this, we present two methods for deriving PIs from the Reconstruction Uncertainty Estimate (RUE), an uncertainty measure well-suited to vital-sign forecasting due to its sensitivity to data shifts and support for label-free calibration. Our parametric approach assumes that prediction errors and uncertainty estimates follow a Gaussian copula distribution, enabling closed-form PI computation. Our non-parametric approach, based on k-nearest neighbours (KNN), empirically estimates the conditional error distribution using similar validation instances. We evaluate these methods on two large public datasets with minute- and hour-level sampling, representing high- and low-frequency health signals. Experiments demonstrate that the Gaussian copula method consistently outperforms conformal prediction baselines on low-frequency data, while the KNN approach performs best on high-frequency data. These results underscore the clinical promise of RUE-derived PIs for delivering interpretable, uncertainty-aware vital sign forecasts.
zh

[AI-85] Animer une base de connaissance: des ontologies aux modèles dI.A. générative

【速读】:该论文旨在解决如何在非人类中心主义分析框架下,实现符号型人工智能(Symbolic AI)与神经网络型人工智能(Neural AI,或称亚符号AI)在知识库构建与应用中的融合问题,特别是在区域研究领域中知识对象的结构化管理与动态演化。解决方案的关键在于提出一种基于语义学(结构性)的混合方法,通过LaCAS生态系统(包含词汇表、RDF/OWL本体、链接开放数据服务等模块)实现对16万条文献资源和十余个知识宏观领域的组织,并以“世界语言”领域中的“克丘亚语”为例,展示生成式AI(Generative AI)工具如何嵌入知识库生命周期:包括数据定位与质量评估、索引提取与聚合、属性建议与验证、动态文件生成以及上下文感知提示工程(通用、情境、解释、调整、程序性提示),从而在保持符号约束的前提下,由专业化代理(specialized agents)协同驱动模型驱动与数据驱动方法,实现知识库的智能化运营。

链接: https://arxiv.org/abs/2509.01304
作者: Peter Stockinger(ESCOM, PLIDAM, Inalco, CIS)
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: in French language

点击查看摘要

Abstract:In a context where the social sciences and humanities are experimenting with non-anthropocentric analytical frames, this article proposes a semiotic (structural) reading of the hybridization between symbolic AI and neural (or sub-symbolic) AI based on a field of application: the design and use of a knowledge base for area studies. We describe the LaCAS ecosystem – Open Archives in Linguistic and Cultural Studies (thesaurus; RDF/OWL ontology; LOD services; harvesting; expertise; publication), deployed at Inalco (National Institute for Oriental Languages and Civilizations) in Paris with the Okapi (Open Knowledge and Annotation Interface) software environment from Ina (National Audiovisual Institute), which now has around 160,000 documentary resources and ten knowledge macro-domains grouping together several thousand knowledge objects. We illustrate this approach using the knowledge domain ‘‘Languages of the world’’ (~540 languages) and the knowledge object ‘‘Quechua (language)’’. On this basis, we discuss the controlled integration of neural tools, more specifically generative tools, into the life cycle of a knowledge base: assistance with data localization/qualification, index extraction and aggregation, property suggestion and testing, dynamic file generation, and engineering of contextualized prompts (generic, contextual, explanatory, adjustment, procedural) aligned with a domain ontology. We outline an ecosystem of specialized agents capable of animating the database while respecting its symbolic constraints, by articulating model-driven and data-driven methods.
zh

[AI-86] Building surrogate models using trajectories of agents trained by Reinforcement Learning ICANN2024

【速读】:该论文旨在解决在计算代价高昂的仿真环境中,代理模型(surrogate modeling)构建时样本效率低下的问题,尤其是在状态空间广泛的情况下,现有采样策略效果有限。解决方案的关键在于引入一种基于强化学习(Reinforcement Learning, RL)训练策略的采样方法:通过结合随机代理、专家代理以及专门训练以探索状态转移分布最大熵区域的代理所生成的数据,构建混合数据集。实证分析表明,这种混合采样策略在所有数据集上均优于拉丁超立方采样(Latin-Hypercube sampling)或主动学习(Active Learning)与克里金插值(Kriging)相结合的方法,从而显著提升代理模型对状态空间的表征能力,为复杂仿真器上的代理辅助强化学习策略优化提供了可行路径。

链接: https://arxiv.org/abs/2509.01285
作者: Julen Cestero,Marco Quartulli,Marcello Restelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in ICANN 2024 conference

点击查看摘要

Abstract:Sample efficiency in the face of computationally expensive simulations is a common concern in surrogate modeling. Current strategies to minimize the number of samples needed are not as effective in simulated environments with wide state spaces. As a response to this challenge, we propose a novel method to efficiently sample simulated deterministic environments by using policies trained by Reinforcement Learning. We provide an extensive analysis of these surrogate-building strategies with respect to Latin-Hypercube sampling or Active Learning and Kriging, cross-validating performances with all sampled datasets. The analysis shows that a mixed dataset that includes samples acquired by random agents, expert agents, and agents trained to explore the regions of maximum entropy of the state transition distribution provides the best scores through all datasets, which is crucial for a meaningful state space representation. We conclude that the proposed method improves the state-of-the-art and clears the path to enable the application of surrogate-aided Reinforcement Learning policy optimization strategies on complex simulators.
zh

[AI-87] Communicative Agents for Slideshow Storytelling Video Generation based on LLM s

【速读】:该论文旨在解决传统文本到视频(text-to-video)生成模型因计算成本高昂而导致的效率低下与可扩展性不足的问题。为应对这一挑战,作者提出了一种名为Video-Generation-Team(VGTeam)的新颖幻灯片式视频生成系统,其核心创新在于引入由多个协同工作的智能代理(agents)构成的“聊天塔”(chat tower)工作流架构,这些代理分别负责脚本撰写、场景生成和音频设计等视频制作的关键环节。通过模拟传统视频生产的分阶段流程,VGTeam实现了显著的计算资源节约(平均生成成本仅为0.103美元),同时保持了高达98.4%的成功率和高保真度的创意表达,从而在效率、可扩展性和定制化之间取得了优异平衡。

链接: https://arxiv.org/abs/2509.01277
作者: Jingxing Fan,Jinrong Shen,Yusheng Yao,Shuangqing Wang,Qian Wang,Yuling Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, 1 table

点击查看摘要

Abstract:With the rapid advancement of artificial intelligence (AI), the proliferation of AI-generated content (AIGC) tasks has significantly accelerated developments in text-to-video generation. As a result, the field of video production is undergoing a transformative shift. However, conventional text-to-video models are typically constrained by high computational costs. In this study, we propose Video-Generation-Team (VGTeam), a novel slide show video generation system designed to redefine the video creation pipeline through the integration of large language models (LLMs). VGTeam is composed of a suite of communicative agents, each responsible for a distinct aspect of video generation, such as scriptwriting, scene creation, and audio design. These agents operate collaboratively within a chat tower workflow, transforming user-provided textual prompts into coherent, slide-style narrative videos. By emulating the sequential stages of traditional video production, VGTeam achieves remarkable improvements in both efficiency and scalability, while substantially reducing computational overhead. On average, the system generates videos at a cost of only 0.103, with a successful generation rate of 98.4%. Importantly, this framework maintains a high degree of creative fidelity and customization. The implications of VGTeam are far-reaching. It democratizes video production by enabling broader access to high-quality content creation without the need for extensive resources. Furthermore, it highlights the transformative potential of language models in creative domains and positions VGTeam as a pioneering system for next-generation content creation. Comments: 8 pages, 8 figures, 1 table Subjects: Artificial Intelligence (cs.AI) MSC classes: 68T50, 68T42 ACMclasses: I.2.6; I.2.7 Cite as: arXiv:2509.01277 [cs.AI] (or arXiv:2509.01277v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.01277 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jingxing Fan [view email] [v1] Mon, 1 Sep 2025 09:04:07 UTC (2,253 KB)
zh

[AI-88] Multi-Agent Reinforcement Learning for Task Offloading in Wireless Edge Networks NEURIPS’25

【速读】:该论文旨在解决边缘计算系统中多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)面临的挑战,即在有限可观测性和通信约束下,如何实现高效、安全的本地决策与全局资源协调。现有方法通常依赖集中式评论家或频繁通信,难以适应实际部署场景。其解决方案的关键在于提出一种去中心化框架,其中每个智能体求解一个带约束的马尔可夫决策过程(Constrained Markov Decision Process, CMDP),并通过共享的约束向量(constraint vector)隐式协调,该向量更新频率低且通信开销小,从而在不牺牲全局目标(如避免服务器过载)的前提下,使各智能体能够自主优化本地策略。结合安全强化学习(Safe Reinforcement Learning),该方法在理论上有保障,并在实验中展现出优于集中式和独立基线方法的性能,尤其在大规模场景下优势显著。

链接: https://arxiv.org/abs/2509.01257
作者: Andrea Fox,Francesco De Pellegrini,Eitan Altman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Submitted at AI4NextG @ NeurIPS’25 Workshop

点击查看摘要

Abstract:In edge computing systems, autonomous agents must make fast local decisions while competing for shared resources. Existing MARL methods often resume to centralized critics or frequent communication, which fail under limited observability and communication constraints. We propose a decentralized framework in which each agent solves a constrained Markov decision process (CMDP), coordinating implicitly through a shared constraint vector. For the specific case of offloading, e.g., constraints prevent overloading shared server resources. Coordination constraints are updated infrequently and act as a lightweight coordination mechanism. They enable agents to align with global resource usage objectives but require little direct communication. Using safe reinforcement learning, agents learn policies that meet both local and global goals. We establish theoretical guarantees under mild assumptions and validate our approach experimentally, showing improved performance over centralized and independent baselines, especially in large-scale settings.
zh

[AI-89] owards Agent ic OS: An LLM Agent Framework for Linux Schedulers

【速读】:该论文旨在解决操作系统调度器中存在的语义鸿沟问题,即内核调度策略无法理解应用程序的特定需求,从而导致性能优化不足。其解决方案的关键在于提出了一种名为 SchedCP 的框架,该框架通过构建一个解耦的控制平面(control plane),将 AI 的语义推理角色(“要优化什么”)与系统的执行角色(“如何观测和行动”)分离。SchedCP 以 Model Context Protocol (MCP) 服务器的形式实现,提供三个核心服务:工作负载分析引擎、可演化的调度策略仓库以及执行验证器,确保所有由 AI 生成的代码和配置在部署前经过静态与动态分析验证。这一架构使得大型语言模型(LLM)代理能够自主、安全地优化 Linux 调度器,无需人工干预,显著提升了系统性能并降低了成本。

链接: https://arxiv.org/abs/2509.01245
作者: Yusheng Zheng,Yanpeng Hu,Wei Zhang,Andi Quinn
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Operating Systems (cs.OS)
备注:

点击查看摘要

Abstract:Operating system schedulers suffer from a fundamental semantic gap, where kernel policies fail to understand application-specific needs, leading to suboptimal performance. We introduce SchedCP, the first framework that enables fully autonomous Large Language Model (LLM) agents to safely and efficiently optimize Linux schedulers without human involvement. Our core insight is that the challenge is not merely to apply a better LLM, but to architect a decoupled control plane that separates the AI’s role of semantic reasoning (“what to optimize”) from the system’s role of execution (“how to observe and act”). Implemented as Model Context Protocol(MCP) server, SchedCP provides a stable interface with three key services: a Workload Analysis Engine, an evolving Scheduler Policy Repository, and an Execution Verifier that validates all AI-generated code and configure before deployment with static and dynamic analysis. We demonstrate this architecture’s power with sched-agent, a multi-agent system that autonomously analyzes workloads, synthesizes custom eBPF scheduling policies, and deploys them via the sched_ext infrastructure. Our evaluation shows that SchedCP achieves up to an 1.79x performance improvement, and a 13x cost reduction compared to naive agentic approaches, all while maintaining high success rate. By bridging the semantic gap, SchedCP democratizes expert-level system optimization and represents a step towards creating truly self-optimizing, application-aware operating systems. The code is open-sourced in this https URL Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Operating Systems (cs.OS) Cite as: arXiv:2509.01245 [cs.AI] (or arXiv:2509.01245v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.01245 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-90] owards Open-World Retrieval-Augmented Generation on Knowledge Graph: A Multi-Agent Collaboration Framework

【速读】:该论文旨在解决现有基于知识图谱(Knowledge Graph, KG)的检索增强生成(Retrieval-Augmented Generation, RAG)方法在开放世界场景下因依赖预定义锚实体(anchor entities)而带来的鲁棒性不足问题。其核心挑战在于,当用户查询与知识图谱中的实体难以准确链接时,传统方法易出现检索失败或错误路径。解决方案的关键在于提出 AnchorRAG,一个无需预先指定锚实体的多智能体协作框架:通过预测代理(predictor agent)动态识别候选锚实体,并由多个独立的检索代理并行执行多跳探索;再由监督代理(supervisor agent)制定迭代检索策略并融合知识路径以生成最终答案,从而显著提升检索鲁棒性并缓解模糊或错误锚点的影响。

链接: https://arxiv.org/abs/2509.01238
作者: Jiasheng Xu,Mingda Li,Yongqiang Tang,Peijie Wang,Wensheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in language understanding and reasoning. However, their dependence on static training corpora makes them prone to factual errors and knowledge gaps. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge sources, especially structured Knowledge Graphs (KGs), which provide explicit semantics and efficient retrieval. Existing KG-based RAG approaches, however, generally assume that anchor entities are accessible to initiate graph traversal, which limits their robustness in open world settings where accurate linking between the query and the entity is unreliable. To overcome this limitation, we propose AnchorRAG, a novel multi-agent collaboration framework for open-world RAG without the predefined anchor entities. Specifically, a predictor agent dynamically identifies candidate anchor entities by aligning user query terms with KG nodes and initializes independent retriever agents to conduct parallel multi-hop explorations from each candidate. Then a supervisor agent formulates the iterative retrieval strategy for these retriever agents and synthesizes the resulting knowledge paths to generate the final answer. This multi-agent collaboration framework improves retrieval robustness and mitigates the impact of ambiguous or erroneous anchors. Extensive experiments on four public benchmarks demonstrate that AnchorRAG significantly outperforms existing baselines and establishes new state-of-the-art results on the real-world question answering tasks.
zh

[AI-91] LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving

【速读】:该论文旨在解决4-bit权重与8-bit激活量化(W4A8)在大型语言模型(LLM)推理中因CUDA核心上反量化效率低下而导致性能瓶颈的问题,从而无法充分发挥Tensor Core的高吞吐能力。其解决方案的关键在于提出LiquidGEMM,一个硬件高效的W4A8 GEMM内核,包含两项核心技术:一是LiquidQuant,一种仅需每四个元素两步算术指令即可实现快速且溢出安全反量化的机制;二是隐式细粒度流水线设计,能够在不依赖软件同步或冗余内存访问的前提下,完全重叠不同warp组间的权重加载、反量化与矩阵乘累加(MMA)操作,显著提升计算效率和系统级性能。

链接: https://arxiv.org/abs/2509.01229
作者: Huanqi Hu,Bowen Xiao,Shixuan Sun,Jianian Yin,Zhexi Zhang,Xiang Luo,Chengquan Jiang,Weiqi Xu,Xiaoying Jia,Xin Liu,Minyi Guo
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 13 figures

点击查看摘要

Abstract:Quantization is a critical technique for accelerating LLM inference by reducing memory footprint and improving computational efficiency. Among various schemes, 4-bit weight and 8-bit activation quantization (W4A8) offers a strong balance between accuracy and performance. However, existing W4A8 GEMM kernels fall short in practice due to inefficient dequantization on CUDA Cores, which cannot keep pace with the high throughput of Tensor Cores. In this paper, we present LiquidGEMM, a hardware-efficient W4A8 GEMM kernel for efficient LLM serving. LiquidGEMM designs two key techniques: LiquidQuant, a hardware-efficient quantization method that enables fast, overflow-safe dequantization using just two arithmetic instructions per four elements; and an implicit fine-grained pipeline that fully overlaps weight loading, dequantization, and MMA across warp groups without software synchronization or redundant memory traffic. Experimental results show that LiquidGEMM achieves up to 2.90x speedup over state-of-the-art W4A8 kernels and up to 4.94x end-to-end system-level speedup. Compared to various quantized GEMM kernels in NVIDIA TensorRT-LLM, LiquidGEMM delivers 1.12-1.63x performance gains, and achieves up to 1.63x system-level speedup.
zh

[AI-92] Web Fraud Attacks Against LLM -Driven Multi-Agent Systems

【速读】:该论文旨在解决大语言模型驱动的多智能体系统(Multi-Agent Systems, MAS)中因访问恶意网站而引发的安全风险问题。当前,若智能体被诱导访问恶意链接,攻击者可借此扩大攻击面并实施多样化后续攻击,严重威胁系统可靠性与用户安全。解决方案的关键在于提出一种新型攻击类型——Web Fraud Attacks,其核心在于设计11种代表性的攻击变体,涵盖域名混淆(如同形异义欺骗、字符替换)、链接结构伪装(如子目录嵌套、子域名嫁接、参数混淆)等技术手段,精准利用MAS在链接验证环节的漏洞。实验表明,此类攻击不仅对多种MAS架构具有显著破坏力,且具备强隐蔽性,无需复杂输入格式(如越狱攻击),从而降低了暴露风险,凸显了其在实际场景中的威胁性。

链接: https://arxiv.org/abs/2509.01211
作者: Dezhang Kong,Hujin Peng,Yilun Zhang,Lele Zhao,Zhenhua Xu,Shi Lin,Changting Lin,Meng Han
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:With the proliferation of applications built upon LLM-driven multi-agent systems (MAS), the security of Web links has become a critical concern in ensuring system reliability. Once an agent is induced to visit a malicious website, attackers can use it as a springboard to conduct diverse subsequent attacks, which will drastically expand the attack surface. In this paper, we propose Web Fraud Attacks, a novel type of attack aiming at inducing MAS to visit malicious websites. We design 11 representative attack variants that encompass domain name tampering (homoglyph deception, character substitution, etc.), link structure camouflage (sub-directory nesting, sub-domain grafting, parameter obfuscation, etc.), and other deceptive techniques tailored to exploit MAS’s vulnerabilities in link validation. Through extensive experiments on these crafted attack vectors, we demonstrate that Web fraud attacks not only exhibit significant destructive potential across different MAS architectures but also possess a distinct advantage in evasion: they circumvent the need for complex input formats such as jailbreaking, which inherently carry higher exposure risks. These results underscore the importance of addressing Web fraud attacks in LLM-driven MAS, as their stealthiness and destructiveness pose non-negligible threats to system security and user safety.
zh

[AI-93] Preserving Vector Space Properties in Dimensionality Reduction: A Relationship Preserving Loss Framework

【速读】:该论文旨在解决维度缩减过程中向量空间关键性质(如正交性和线性无关性)被破坏的问题,这些性质对跨模态检索、聚类和分类等下游任务至关重要。解决方案的关键在于提出一种关系保持损失(Relationship Preserving Loss, RPL),通过最小化高维数据与低维嵌入之间关系矩阵(如Gram矩阵或余弦矩阵)的差异来实现几何结构的保留,从而在非线性投影中维持向量空间的核心属性,并基于矩阵扰动理论提供误差边界支持。

链接: https://arxiv.org/abs/2509.01198
作者: Eddi Weinwurm,Alexander Kovalenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dimensionality reduction can distort vector space properties such as orthogonality and linear independence, which are critical for tasks including cross-modal retrieval, clustering, and classification. We propose a Relationship Preserving Loss (RPL), a loss function that preserves these properties by minimizing discrepancies between relationship matrices (e.g., Gram or cosine) of high-dimensional data and their low-dimensional embeddings. RPL trains neural networks for non-linear projections and is supported by error bounds derived from matrix perturbation theory. Initial experiments suggest that RPL reduces embedding dimensions while largely retaining performance on downstream tasks, likely due to its preservation of key vector space properties. While we describe here the use of RPL in dimensionality reduction, this loss can also be applied more broadly, for example to cross-domain alignment and transfer learning, knowledge distillation, fairness and invariance, dehubbing, graph and manifold learning, and federated learning, where distributed embeddings must remain geometrically consistent.
zh

[AI-94] EZhouNet:A framework based on graph neural network and anchor interval for the respiratory sound event detection

【速读】:该论文旨在解决呼吸音事件检测(respiratory sound event detection)中存在的三个关键问题:一是现有方法多基于帧级预测并依赖后处理生成事件级输出,导致时间区间边界难以精确学习;二是多数方法仅能处理固定长度音频,限制了对变长呼吸音的适用性;三是呼吸音位置信息对检测性能的影响尚未充分探索。解决方案的关键在于提出一种基于图神经网络(graph neural network, GNN)的框架,引入锚定区间(anchor intervals)机制,使模型能够直接学习事件的时间边界,从而实现对异常呼吸音事件更精准的时序定位,并支持变长音频输入;同时,通过融合呼吸音位置信息,显著提升了异常声音的区分能力。

链接: https://arxiv.org/abs/2509.01153
作者: Yun Chu,Qiuhao Wang,Enze Zhou,Qian Liu,Gang Zheng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Auscultation is a key method for early diagnosis of respiratory and pulmonary diseases, relying on skilled healthcare professionals. However, the process is often subjective, with variability between experts. As a result, numerous deep learning-based automatic classification methods have emerged, most of which focus on respiratory sound classification. In contrast, research on respiratory sound event detection remains limited. Existing sound event detection methods typically rely on frame-level predictions followed by post-processing to generate event-level outputs, making interval boundaries challenging to learn directly. Furthermore, many approaches can only handle fixed-length audio, lim- iting their applicability to variable-length respiratory sounds. Additionally, the impact of respiratory sound location information on detection performance has not been extensively explored. To address these issues, we propose a graph neural network-based framework with anchor intervals, capable of handling variable-length audio and providing more precise temporal localization for abnormal respi- ratory sound events. Our method improves both the flexibility and applicability of respiratory sound detection. Experiments on the SPRSound 2024 and HF Lung V1 datasets demonstrate the effec- tiveness of the proposed approach, and incorporating respiratory position information enhances the discrimination between abnormal sounds.
zh

[AI-95] Heads or Tails: A Simple Example of Causal Abstractive Simulation

【速读】:该论文试图解决语言模型在模拟物理系统(如公平硬币投掷)时缺乏形式化因果依据的问题,旨在将实践中基于统计的基准测试与因果理论相结合。其解决方案的关键在于引入“因果抽象模拟”(causal abstractive simulation)这一形式化框架,通过因果描述明确语言模型是否能够正确模拟目标系统,并提供成功与失败案例以验证该方法的有效性,从而为语言模型的模拟能力提供严谨的理论支撑。

链接: https://arxiv.org/abs/2509.01136
作者: Gabriel Simmons
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:This note illustrates how a variety of causal abstraction arXiv:1707.00819 arXiv:1812.03789, defined here as causal abstractive simulation, can be used to formalize a simple example of language model simulation. This note considers the case of simulating a fair coin toss with a language model. Examples are presented illustrating the ways language models can fail to simulate, and a success case is presented, illustrating how this formalism may be used to prove that a language model simulates some other system, given a causal description of the system. This note may be of interest to three groups. For practitioners in the growing field of language model simulation, causal abstractive simulation is a means to connect ad-hoc statistical benchmarking practices to the solid formal foundation of causality. Philosophers of AI and philosophers of mind may be interested as causal abstractive simulation gives a precise operationalization to the idea that language models are role-playing arXiv:2402.12422. Mathematicians and others working on causal abstraction may be interested to see a new application of the core ideas that yields a new variation of causal abstraction.
zh

[AI-96] MATL-DC: A Multi-domain Aggregation Transfer Learning Framework for EEG Emotion Recognition with Domain-Class Prototype under Unseen Targets

【速读】:该论文旨在解决基于脑电图(EEG)的情绪识别在迁移学习中对源域和目标域数据高度依赖的问题,从而限制了其在实际场景中的应用。解决方案的关键在于提出了一种多域聚合迁移学习框架(MATL-DC),该框架通过特征解耦模块将浅层特征中的类不变域特征与域不变类特征分离,并设计多域聚合机制构建“超域”以增强情绪EEG信号的表征能力;同时,在每个超域内提取类原型表示,并采用成对学习策略将样本分类问题转化为样本对之间的相似性判断,有效缓解标签噪声的影响。值得注意的是,目标域在训练阶段完全未见,推理时仅利用训练好的域-类原型进行情绪识别,显著提升了模型在未知目标域上的泛化性能。

链接: https://arxiv.org/abs/2509.01135
作者: Guangli Li,Canbiao Wu,Zhehao Zhou,Na Tian,Zhen Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotion recognition based on electroencephalography (EEG) signals is increasingly becoming a key research hotspot in affective Brain-Computer Interfaces (aBCIs). However, the current transfer learning model greatly depends on the source domain and target domain data, which hinder the practical application of emotion recognition. Therefore, we propose a Multi-domain Aggregation Transfer Learning framework for EEG emotion recognition with Domain-Class prototype under unseen targets (MATL-DC). We design the feature decoupling module to decouple class-invariant domain features from domain-invariant class features from shallow features. In the model training stage, the multi-domain aggregation mechanism aggregates the domain feature space to form a superdomain, which enhances the characteristics of emotional EEG signals. In each superdomain, we further extract the class prototype representation by class features. In addition, we adopt the pairwise learning strategy to transform the sample classification problem into the similarity problem between sample pairs, which effectively alleviates the influence of label noise. It is worth noting that the target domain is completely unseen during the training process. In the inference stage, we use the trained domain-class prototypes for inference, and then realize emotion recognition. We rigorously validate it on the publicly available databases (SEED, SEED-IV and SEED-V). The results show that the accuracy of MATL-DC model is 84.70%, 68.11% and 61.08%, respectively. MATL-DC achieves comparable or even better performance than methods that rely on both source and target domains. The source code is available at this https URL.
zh

[AI-97] SC-GIR: Goal-oriented Semantic Communication via Invariant Representation Learning

【速读】:该论文旨在解决目标导向语义通信(Goal-oriented Semantic Communication, SC)中因收发端联合训练导致的数据冗余交换以及对标注数据集的依赖问题,从而限制了其在不同下游任务中的通用性。解决方案的关键在于提出一种基于目标不变表示的语义通信框架(SC-GIR),通过自监督学习提取与具体下游任务无关的不变表示(invariant representation),该表示能够压缩信息并保留关键语义特征,以支持高效且任务适应性强的图像传输。其中,利用协方差对比学习(covariance-based contrastive learning)构建具有语义密度和意义的潜在表示,显著提升了压缩后数据在多种信噪比(SNR)条件下的分类准确率(超过85%),并相较基线方案提升近10%。

链接: https://arxiv.org/abs/2509.01119
作者: Senura Hansaja Wanasekara,Van-Dinh Nguyen,Kok-Seng,M.-Duong Nguyen,Symeon Chatzinotas,Octavia A. Dobre
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 16 pages, Accepted to IEEE Transactions on Mobile Computing

点击查看摘要

Abstract:Goal-oriented semantic communication (SC) aims to revolutionize communication systems by transmitting only task-essential information. However, current approaches face challenges such as joint training at transceivers, leading to redundant data exchange and reliance on labeled datasets, which limits their task-agnostic utility. To address these challenges, we propose a novel framework called Goal-oriented Invariant Representation-based SC (SC-GIR) for image transmission. Our framework leverages self-supervised learning to extract an invariant representation that encapsulates crucial information from the source data, independent of the specific downstream task. This compressed representation facilitates efficient communication while retaining key features for successful downstream task execution. Focusing on machine-to-machine tasks, we utilize covariance-based contrastive learning techniques to obtain a latent representation that is both meaningful and semantically dense. To evaluate the effectiveness of the proposed scheme on downstream tasks, we apply it to various image datasets for lossy compression. The compressed representations are then used in a goal-oriented AI task. Extensive experiments on several datasets demonstrate that SC-GIR outperforms baseline schemes by nearly 10%, and achieves over 85% classification accuracy for compressed data under different SNR conditions. These results underscore the effectiveness of the proposed framework in learning compact and informative latent representations.
zh

[AI-98] CCE: Confidence-Consistency Evaluation for Time Series Anomaly Detection

【速读】:该论文旨在解决时间序列异常检测(Time Series Anomaly Detection)评估指标存在的诸多局限性,包括判别能力不足、超参数依赖性强、对扰动敏感以及计算开销高等问题。其解决方案的关键在于提出一种名为 Confidence-Consistency Evaluation (CCE) 的新型评估指标,通过贝叶斯估计量化异常分数的不确定性,从而构建全局和事件级别的置信度与一致性评分,并在此基础上形成简洁统一的 CCE 指标。该方法理论上具备严格有界性、对分数扰动的 Lipschitz 稳健性及线性时间复杂度 𝒪(n),显著提升了评估指标的可靠性与效率。

链接: https://arxiv.org/abs/2509.01098
作者: Zhijie Zhong,Zhiwen Yu,Yiu-ming Cheung,Kaixiang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 17 pages, 10 figures, 6 tables

点击查看摘要

Abstract:Time Series Anomaly Detection metrics serve as crucial tools for model evaluation. However, existing metrics suffer from several limitations: insufficient discriminative power, strong hyperparameter dependency, sensitivity to perturbations, and high computational overhead. This paper introduces Confidence-Consistency Evaluation (CCE), a novel evaluation metric that simultaneously measures prediction confidence and uncertainty consistency. By employing Bayesian estimation to quantify the uncertainty of anomaly scores, we construct both global and event-level confidence and consistency scores for model predictions, resulting in a concise CCE metric. Theoretically and experimentally, we demonstrate that CCE possesses strict boundedness, Lipschitz robustness against score perturbations, and linear time complexity \mathcalO(n) . Furthermore, we establish RankEval, a benchmark for comparing the ranking capabilities of various metrics. RankEval represents the first standardized and reproducible evaluation pipeline that enables objective comparison of evaluation metrics. Both CCE and RankEval implementations are fully open-source.
zh

[AI-99] DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving

【速读】:该论文旨在解决大批次服务环境中因固定推测长度(speculation length)导致的推理效率低下问题,尤其是在请求多样性较高的场景下,固定长度难以适应不同序列的生成特性。解决方案的关键在于提出一种无需训练的动态推测解码引擎(Dynamic Speculative Decoding Engine, DSDE),其核心是利用两类后验诊断信号实现自适应调整:一是基于Kullback-Leibler (KLD) 散度方差的预测信号,用于诊断生成过程中的区域稳定性;二是自适应推测长度上限机制,以缓解单序列解码中的尾部延迟(straggler problem)。实验表明,该方法在端到端延迟上可媲美先进基线,并在多样工作负载下展现出更强鲁棒性,尤其在低接受率(low-acceptance-rate)场景中仍保持有效诊断能力,验证了后验信号在构建更智能、稳健的大语言模型(LLM)推理系统中的价值。

链接: https://arxiv.org/abs/2509.01083
作者: Mingyu Yang,Jae-Young Choi,Kihyo Moon,Minsung Jang,Eunjoo Joen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 10 pages, 9 figures. Preprint submitted to IEEE BigData 2025

点击查看摘要

Abstract:Speculative decoding accelerates large language model inference, but its reliance on a fixed speculation length is suboptimal in large-batch serving environments with diverse requests. This paper explores a new direction for dynamic adaptation by investigating a novel class of post-hoc, diagnostic signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free framework built on two primary components: (1) a predictive signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the generation’s regional stability, and (2) an adaptive speculation length cap to mitigate the straggler problem in per-sequence decoding. Experiments demonstrate the potential of using KLD-based stability signals for dynamic adaptation. An algorithm guided by these signals achieves end-to-end latency competitive with leading baselines and exhibits superior robustness across diverse workloads. This robustness is particularly valuable in challenging low-acceptance-rate regimes, where the proposed signal maintains its diagnostic utility. Collectively, these findings validate post-hoc signals as a valuable component for building more robust and intelligent LLM inference systems, and highlight a promising direction for future research on dynamic speculation length adaptation.
zh

[AI-100] Reinforcement Learning Driven Generalizable Feature Representation for Cross-User Activity Recognition

【速读】:该论文旨在解决可穿戴传感器在人体活动识别(Human Activity Recognition, HAR)中因跨用户差异(如运动模式、传感器位置和生理特征不同)导致的模型泛化能力差的问题。传统监督学习方法易过拟合特定用户的特征,而现有领域泛化方法或忽略时序依赖性,或依赖不切实际的领域标签。其解决方案的关键在于提出一种基于强化学习的时序保持域泛化框架(Temporal-Preserving Reinforcement Learning Domain Generalization, TPRL-DG),通过将特征提取建模为由强化学习驱动的序列决策过程,利用基于Transformer的自回归生成器生成保留时序一致性的时序token,并设计无需目标用户标签的多目标奖励函数以平衡类别判别性和跨用户不变性,从而实现无需逐用户校准的高鲁棒性HAR系统。

链接: https://arxiv.org/abs/2509.01031
作者: Xiaozhou Ye,Kevin I-Kai Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) using wearable sensors is crucial for healthcare, fitness tracking, and smart environments, yet cross-user variability – stemming from diverse motion patterns, sensor placements, and physiological traits – hampers generalization in real-world settings. Conventional supervised learning methods often overfit to user-specific patterns, leading to poor performance on unseen users. Existing domain generalization approaches, while promising, frequently overlook temporal dependencies or depend on impractical domain-specific labels. We propose Temporal-Preserving Reinforcement Learning Domain Generalization (TPRL-DG), a novel framework that redefines feature extraction as a sequential decision-making process driven by reinforcement learning. TPRL-DG leverages a Transformer-based autoregressive generator to produce temporal tokens that capture user-invariant activity dynamics, optimized via a multi-objective reward function balancing class discrimination and cross-user invariance. Key innovations include: (1) an RL-driven approach for domain generalization, (2) autoregressive tokenization to preserve temporal coherence, and (3) a label-free reward design eliminating the need for target user annotations. Evaluations on the DSADS and PAMAP2 datasets show that TPRL-DG surpasses state-of-the-art methods in cross-user generalization, achieving superior accuracy without per-user calibration. By learning robust, user-invariant temporal patterns, TPRL-DG enables scalable HAR systems, facilitating advancements in personalized healthcare, adaptive fitness tracking, and context-aware environments.
zh

[AI-101] Symbolic Planning and Multi-Agent Path Finding in Extremely Dense Environments with Movable Obstacles

【速读】:该论文旨在解决大型仓库管理中的块重排问题(Block Rearrangement Problem, BRaP),即在密集网格中对存储块进行重新排列以达到目标状态。其核心挑战在于高维度搜索空间与深层埋藏块的高效调度。解决方案的关键在于将BRaP形式化为图搜索问题,并基于滑动谜题(sliding puzzle)的直觉设计五种基于搜索的求解算法,融合联合配置空间搜索、经典规划、多智能体路径规划及专家启发式方法,在80×80网格规模下仍能高效生成高质量重排计划。

链接: https://arxiv.org/abs/2509.01022
作者: Bo Fu,Zhe Chen,Rahul Chandan,Alex Barbosa,Michael Caldara,Joey Durham,Federico Pecora
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce the Block Rearrangement Problem (BRaP), a challenging component of large warehouse management which involves rearranging storage blocks within dense grids to achieve a target state. We formally define the BRaP as a graph search problem. Building on intuitions from sliding puzzle problems, we propose five search-based solution algorithms, leveraging joint configuration space search, classical planning, multi-agent pathfinding, and expert heuristics. We evaluate the five approaches empirically for plan quality and scalability. Despite the exponential relation between search space size and block number, our methods demonstrate efficiency in creating rearrangement plans for deeply buried blocks in up to 80x80 grids.
zh

[AI-102] Quantum-like Coherence Derived from the Interaction between Chemical Reaction and Its Environment

【速读】:该论文试图解决传统计算模型在处理复杂系统动态适应性与自组织行为时的局限性问题,尤其是如何实现对环境波动的主动响应和内在协调机制。其解决方案的关键在于提出“开放计算”(open computing)框架,通过将化学反应视为计算过程、分子聚集度作为执行环境,使计算过程与执行环境在逻辑上分离但物理上融合,从而形成可自我调节的动态系统。该框架进一步细分为“Token计算”(关注单个分子行为)与“Type计算”(关注群体规范行为),二者协同作用:Token计算展现出自组织临界现象,Type计算体现量子逻辑特性,两者交互促成波动招募效应,并生成跨越不同希尔伯特空间的量子类相干态,最终驱动尖峰波传播,实现信号传递,揭示了酶调控尖峰波与生化节律的潜在机制。

链接: https://arxiv.org/abs/2509.01021
作者: Yukio-Pegio Gunji,Andrew Adamatzky,Panagiotis Mougkogiannis,Andrei Khrenikov
机构: 未知
类目: Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO)
备注: 36 pages, 13 figures

点击查看摘要

Abstract:By uncovering the contrast between Artificial Intelligence and Natural-born Intelligence as a computational process, we define closed computing and open computing, and implement open computing within chemical reactions. This involves forming a mixture and invalidation of the computational process and the execution environment, which are logically distinct, and coalescing both to create a system that adjusts fluctuations. We model chemical reactions by considering the computation as the chemical reaction and the execution environment as the degree of aggregation of molecules that interact with the reactive environment. This results in a chemical reaction that progresses while repeatedly clustering and de-clustering, where concentration no longer holds significant meaning. Open computing is segmented into Token computing, which focuses on the individual behavior of chemical molecules, and Type computing, which focuses on normative behavior. Ultimately, both are constructed as an interplay between the two. In this system, Token computing demonstrates self-organizing critical phenomena, while Type computing exhibits quantum logic. Through their interplay, the recruitment of fluctuations is realized, giving rise to interactions between quantum logical subspaces corresponding to quantum coherence across different Hilbert spaces. As a result, spike waves are formed, enabling signal transmission. This occurrence may be termed quantum-like coherence, implying the source of enzymes responsible for controlling spike waves and biochemical rhythms.
zh

[AI-103] Supporting Our AI Overlords: Redesigning Data Systems to be Agent -First

【速读】:该论文旨在解决当前数据系统难以高效支持大规模语言模型(Large Language Model, LLM)代理(agent)工作负载的问题。随着LLM代理在数据操作和分析中的广泛应用,其特有的高吞吐量探索与解决方案生成过程——即“代理推测”(agentic speculation)——带来了规模大、异构性强、冗余度高且可引导性好的特征,对现有数据系统构成显著挑战。论文提出的关键解决方案是:基于上述四类特性(scale, heterogeneity, redundancy, steerability),设计面向代理优先(agent-first)的数据系统架构,涵盖新型查询接口、查询处理技术及代理记忆存储机制等研究方向,从而实现对代理工作负载的原生优化与支持。

链接: https://arxiv.org/abs/2509.00997
作者: Shu Liu,Soujanya Ponnapalli,Shreya Shankar,Sepanta Zeighami,Alan Zhu,Shubham Agarwal,Ruiqi Chen,Samion Suwito,Shuo Yuan,Ion Stoica,Matei Zaharia,Alvin Cheung,Natacha Crooks,Joseph E. Gonzalez,Aditya G. Parameswaran
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents, acting on their users’ behalf to manipulate and analyze data, are likely to become the dominant workload for data systems in the future. When working with data, agents employ a high-throughput process of exploration and solution formulation for the given task, one we call agentic speculation. The sheer volume and inefficiencies of agentic speculation can pose challenges for present-day data systems. We argue that data systems need to adapt to more natively support agentic workloads. We take advantage of the characteristics of agentic speculation that we identify, i.e., scale, heterogeneity, redundancy, and steerability - to outline a number of new research opportunities for a new agent-first data systems architecture, ranging from new query interfaces, to new query processing techniques, to new agentic memory stores.
zh

[AI-104] Online Decentralized Federated Multi-task Learning With Trustworthiness in Cyber-Physical Systems

【速读】:该论文旨在解决在线去中心化联邦多任务学习(Online Decentralized Federated Multi-Task Learning)中模型个性化与拜占庭鲁棒性共存的问题,尤其在拜占庭客户端数量超过诚实客户端数量的极端场景下。其核心挑战在于传统拜占庭容错方法依赖于拜占庭客户端数少于总客户端数的一半这一限制,而现实中难以控制拜占庭客户端的数量。解决方案的关键在于利用系统的网络物理特性(cyber-physical properties),如无线通信中的信号强度或辅助信息,在每轮迭代中为来自邻居节点的本地模型分配信任概率(trust probability),从而实现对异常行为的有效识别与加权,最终在拜占庭客户端占主导的情况下仍能保持接近无拜占庭干扰环境下的性能表现。

链接: https://arxiv.org/abs/2509.00992
作者: Olusola Odeyomi,Sofiat Olaosebikan,Ajibuwa Opeyemi,Oluwadoyinsola Ige
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-task learning is an effective way to address the challenge of model personalization caused by high data heterogeneity in federated learning. However, extending multi-task learning to the online decentralized federated learning setting is yet to be explored. The online decentralized federated learning setting considers many real-world applications of federated learning, such as autonomous systems, where clients communicate peer-to-peer and the data distribution of each client is time-varying. A more serious problem in real-world applications of federated learning is the presence of Byzantine clients. Byzantine-resilient approaches used in federated learning work only when the number of Byzantine clients is less than one-half the total number of clients. Yet, it is difficult to put a limit on the number of Byzantine clients within a system in reality. However, recent work in robotics shows that it is possible to exploit cyber-physical properties of a system to predict clients’ behavior and assign a trust probability to received signals. This can help to achieve resiliency in the presence of a dominating number of Byzantine clients. Therefore, in this paper, we develop an online decentralized federated multi-task learning algorithm to provide model personalization and resiliency when the number of Byzantine clients dominates the number of honest clients. Our proposed algorithm leverages cyber-physical properties, such as the received signal strength in wireless systems or side information, to assign a trust probability to local models received from neighbors in each iteration. Our simulation results show that the proposed algorithm performs close to a Byzantine-free setting.
zh

[AI-105] Causal MAS: A Survey of Large Language Model Architectures for Discovery and Effect Estimation

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂因果推理、发现与估计方面的能力不足问题,具体表现为幻觉(hallucination)、对虚假相关性的依赖以及难以处理细微、领域特定或个性化的因果关系。其解决方案的关键在于引入多智能体系统(multi-agent systems),通过多个基于LLM的智能体之间的协作或专业化能力,构建能够有效执行因果推理、反事实分析、因果发现及因果效应估计的协同架构。该方法的核心优势在于利用智能体间的交互机制(如流水线处理、辩论框架、仿真环境和迭代优化循环)提升因果建模的准确性与鲁棒性,从而推动该技术在科学发现、医疗健康、事实核查和个性化系统等领域的落地应用。

链接: https://arxiv.org/abs/2509.00987
作者: Adib Bazgir,Amir Habibdoust,Yuwen Zhang,Xing Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages. 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning and generation tasks. However, their proficiency in complex causal reasoning, discovery, and estimation remains an area of active development, often hindered by issues like hallucination, reliance on spurious correlations, and difficulties in handling nuanced, domain-specific, or personalized causal relationships. Multi-agent systems, leveraging the collaborative or specialized abilities of multiple LLM-based agents, are emerging as a powerful paradigm to address these limitations. This review paper explores the burgeoning field of causal multi-agent LLMs. We examine how these systems are designed to tackle different facets of causality, including causal reasoning and counterfactual analysis, causal discovery from data, and the estimation of causal effects. We delve into the diverse architectural patterns and interaction protocols employed, from pipeline-based processing and debate frameworks to simulation environments and iterative refinement loops. Furthermore, we discuss the evaluation methodologies, benchmarks, and diverse application domains where causal multi-agent LLMs are making an impact, including scientific discovery, healthcare, fact-checking, and personalized systems. Finally, we highlight the persistent challenges, open research questions, and promising future directions in this synergistic field, aiming to provide a comprehensive overview of its current state and potential trajectory.
zh

[AI-106] Clone What You Cant Steal: Black-Box LLM Replication via Logit Leakage and Distillation

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在通过API部署时,因缺乏强访问控制而导致的潜在安全风险问题,特别是当API暴露部分或top-k logits时,攻击者可借此重构出功能完整的模型副本。解决方案的关键在于提出一种受限复制流水线(constrained replication pipeline),其核心包括两个阶段:第一阶段利用奇异值分解(SVD)从少于10,000次黑盒查询中重建输出投影矩阵;第二阶段基于开源数据集蒸馏剩余架构,训练不同深度的紧凑学生模型(student models)。实验表明,6层学生模型能复现97.6%的教师模型隐藏状态几何结构,仅增加7.31%困惑度和7.58 NLL,而4层版本在推理速度上快17.1%、参数减少18.1%,整个攻击过程可在24 GPU小时内完成且不触发API速率限制防御,凸显了当前API设计在安全性上的脆弱性。

链接: https://arxiv.org/abs/2509.00973
作者: Kanchon Gharami,Hansaka Aluvihare,Shafika Showkat Moni,Berker Peköz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages. Accepted for publication in the proceedings of 7th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (IEEE TPS 2025)

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in mission-critical systems, facilitating tasks such as satellite operations, command-and-control, military decision support, and cyber defense. Many of these systems are accessed through application programming interfaces (APIs). When such APIs lack robust access controls, they can expose full or top-k logits, creating a significant and often overlooked attack surface. Prior art has mainly focused on reconstructing the output projection layer or distilling surface-level behaviors. However, regenerating a black-box model under tight query constraints remains underexplored. We address that gap by introducing a constrained replication pipeline that transforms partial logit leakage into a functional deployable substitute model clone. Our two-stage approach (i) reconstructs the output projection matrix by collecting top-k logits from under 10k black-box queries via singular value decomposition (SVD) over the logits, then (ii) distills the remaining architecture into compact student models with varying transformer depths, trained on an open source dataset. A 6-layer student recreates 97.6% of the 6-layer teacher model’s hidden-state geometry, with only a 7.31% perplexity increase, and a 7.58 Negative Log-Likelihood (NLL). A 4-layer variant achieves 17.1% faster inference and 18.1% parameter reduction with comparable performance. The entire attack completes in under 24 graphics processing unit (GPU) hours and avoids triggering API rate-limit defenses. These results demonstrate how quickly a cost-limited adversary can clone an LLM, underscoring the urgent need for hardened inference APIs and secure on-premise defense deployments.
zh

[AI-107] CoreThink: A Symbolic Reasoning Layer to reason over Long Horizon Tasks with LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务中性能提升逐渐趋于饱和的问题,即现有方法如测试时扩展(test-time scaling)、监督微调(Supervised Fine-Tuning, SFT)和基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)将面临边际收益递减的瓶颈。其解决方案的关键在于提出一种全新的推理范式——通用符号学(General Symbolics),并基于此构建了CoreThink推理层,其中的核心组件为通用符号推理器(General Symbolic Reasoner, GSR)。该方法通过结构化设计支持工具调用、代码生成与规划三大核心场景,在无需任何微调或训练成本的前提下,显著提升了模型在多个基准测试上的表现,例如在Livecodebench v6上达到66.66%的SOTA得分,并在SWE-Bench Lite上实现62.3%的编码准确性,从而实现了纯性能增强且不损害原始模型推理能力的目标。

链接: https://arxiv.org/abs/2509.00971
作者: Jay Vaghasiya,Omkar Ghugarkar,Vishvesh Bhat,Vipul Dholaria,Julian McAuley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce CoreThink, a state-of-the-art Reasoning Layer built upon a novel reasoning method called General Symbolics. This approach diverges from reasoning paradigms such as test-time scaling, Supervised Fine-Tuning (SFT), and Reinforcement Learning with Verifiable Rewards (RLVR). CoreThink General Symbolic Reasoner (GSR) is specifically structured around three key use cases: tool-calling, code generation, and planning, demonstrating exemplary performance across a total of seven benchmarks in their respective areas. Notably, we are achieving SOTA scores of 66.66% on Livecodebench v6, 89% on Instruction-Following Evals, and 24.4% on ARC-AGI-2. We also present an agentic coding IDE, developed using the principles of General Symbolics, which achieves a state-of-the-art accuracy of 62.3% on \textttSWE-Bench Lite. We are able to achieve these improvements without any finetuning or training costs. Our Reasoning Layer is designed to provide a pure performance uplift, ensuring that a model’s accuracy on reasoning tasks is never negatively impacted. We argue that incumbent methods will eventually lead to diminishing returns in LLM performance, necessitating the development of new reasoning techniques. This technical report details our approach at a high level and the availability of the CoreThink models for reasoning-intensive use cases.
zh

[AI-108] Who Gets Left Behind? Auditing Disability Inclusivity in Large Language Models

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在无障碍指导中对部分残障群体覆盖不足的问题,即现有模型提供的建议存在显著的包容性缺口。其解决方案的关键在于构建了一个基于人类验证的、通用型无障碍问题基准(taxonomy-aligned benchmark),并提出从三个维度系统评估模型的 inclusivity:问题层面覆盖率(Question-Level Coverage,衡量答案内部的广度)、残疾类别层面覆盖率(Disability-Level Coverage,平衡九类残障群体的覆盖程度)以及深度(Depth,支持的具体性)。通过该框架对17个开源与专有模型的评估发现,视觉、听觉和运动功能障碍相关问题被频繁覆盖,而言语、遗传/发育、感官认知及心理健康等类别则明显被忽视,且支持深度也集中于少数类别。研究进一步指出,提升包容性的关键路径在于采用基于分类体系的提示设计(taxonomy-aware prompting)和训练策略,并引入同时审计广度、平衡性和深度的评估机制。

链接: https://arxiv.org/abs/2509.00963
作者: Deepika Dash,Yeshil Bangera,Mithil Bangera,Gouthami Vadithya,Srikant Panda
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for accessibility guidance, yet many disability groups remain underserved by their advice. To address this gap, we present taxonomy aligned benchmark1 of human validated, general purpose accessibility questions, designed to systematically audit inclusivity across disabilities. Our benchmark evaluates models along three dimensions: Question-Level Coverage (breadth within answers), Disability-Level Coverage (balance across nine disability categories), and Depth (specificity of support). Applying this framework to 17 proprietary and open-weight models reveals persistent inclusivity gaps: Vision, Hearing, and Mobility are frequently addressed, while Speech, Genetic/Developmental, Sensory-Cognitive, and Mental Health remain under served. Depth is similarly concentrated in a few categories but sparse elsewhere. These findings reveal who gets left behind in current LLM accessibility guidance and highlight actionable levers: taxonomy-aware prompting/training and evaluations that jointly audit breadth, balance, and depth.
zh

[AI-109] Ultra Strong Machine Learning: Teaching Humans Active Learning Strategies via Automated AI Explanations

【速读】:该论文旨在解决超强机器学习(Ultra Strong Machine Learning, USML)系统中知识传递效率低下的问题,即如何让机器学习系统不仅自我优化,还能以可量化的方式提升人类性能。此前的USML方法依赖人工设计的解释模板,存在扩展性差、泛化能力弱的问题。本文提出LENS(Logic Programming Explanation via Neural Summarisation),其核心创新在于将符号程序合成与大语言模型(Large Language Models, LLMs)相结合,实现逻辑程序的自然语言解释自动化生成,从而替代传统手工模板,提升解释的可扩展性和适应性。实验表明,LENS生成的解释在多维度评估中优于直接LLM提示和人工模板,但人类学习实验显示,对于简单任务,过于详尽的LLM响应可能反而造成认知过载,未显著提升人类表现,揭示了未来USML系统需进一步优化人机交互策略以实现有效教学支持。

链接: https://arxiv.org/abs/2509.00961
作者: Lun Ai,Johannes Langer,Ute Schmid,Stephen Muggleton
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ultra Strong Machine Learning (USML) refers to symbolic learning systems that not only improve their own performance but can also teach their acquired knowledge to quantifiably improve human performance. In this work, we present LENS (Logic Programming Explanation via Neural Summarisation), a neuro-symbolic method that combines symbolic program synthesis with large language models (LLMs) to automate the explanation of machine-learned logic programs in natural language. LENS addresses a key limitation of prior USML approaches by replacing hand-crafted explanation templates with scalable automated generation. Through systematic evaluation using multiple LLM judges and human validation, we demonstrate that LENS generates superior explanations compared to direct LLM prompting and hand-crafted templates. To investigate whether LENS can teach transferable active learning strategies, we carried out a human learning experiment across three related domains. Our results show no significant human performance improvements, suggesting that comprehensive LLM responses may overwhelm users for simpler problems rather than providing learning support. Our work provides a solid foundation for building effective USML systems to support human learning. The source code is available on: this https URL.
zh

[AI-110] A Hybrid Ai Framework For Strategic Patent Portfolio Pruning: Integrating Learning To-Rank And Market Need Analysis For Technology Transfer Optimization

【速读】:该论文旨在解决专利组合评估中依赖事后指标或人工、耗时分析的问题,从而难以高效识别高价值技术转移资产。其解决方案的关键在于提出一种多阶段混合智能框架,融合学习排序(Learning to Rank, LTR)模型与独特的“需求-种子”(Need-Seed)代理系统:其中,“需求代理”利用自然语言处理(Natural Language Processing, NLP)从非结构化市场和行业数据中挖掘明确的技术需求,“种子代理”则通过微调的大语言模型(Large Language Models, LLMs)解析专利权利要求并映射其技术能力,最终构建“核心本体框架”以匹配高潜力专利(Seeds)与已记录的市场需求(Needs),并通过动态参数加权机制和人机协同验证(Human in the Loop, HITL)协议保障策略的适应性与现实可信度。

链接: https://arxiv.org/abs/2509.00958
作者: Manish Verma,Vivek Sharma,Vishal Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Page 2, Figure 1 shows the conceptual architecture, and Page 11, Figure 2 outlines its end to end workflow for strategic patent portfolio pruning

点击查看摘要

Abstract:This paper introduces a novel, multi stage hybrid intelligence framework for pruning patent portfolios to identify high value assets for technology transfer. Current patent valuation methods often rely on retrospective indicators or manual, time intensive analysis. Our framework automates and deepens this process by combining a Learning to Rank (LTR) model, which evaluates patents against over 30 legal and commercial parameters, with a unique “Need-Seed” agent-based system. The “Need Agent” uses Natural Language Processing (NLP) to mine unstructured market and industry data, identifying explicit technological needs. Concurrently, the “Seed Agent” employs fine tuned Large Language Models (LLMs) to analyze patent claims and map their technological capabilities. The system generates a “Core Ontology Framework” that matches high potential patents (Seeds) to documented market demands (Needs), providing a strategic rationale for divestment decisions. We detail the architecture, including a dynamic parameter weighting system and a crucial Human in the-Loop (HITL) validation protocol, to ensure both adaptability and real-world credibility.
zh

[AI-111] ART: Adaptive Resampling-based Training for Imbalanced Classification KDD’26

【速读】:该论文旨在解决传统处理类别不平衡问题的重采样方法(如固定比例的欠采样或过采样)因忽略类别学习难度动态变化而导致模型整体性能受限的问题。其解决方案的关键在于提出一种自适应重采样训练(Adaptive Resampling-based Training, ART)方法,该方法通过周期性地基于模型在每个类别上的宏F1分数(macro F1 score)调整训练数据分布,实现类级别而非实例级别的自适应调整。这种策略使模型能够逐步将注意力转向表现较差的类别,从而更贴合优化目标,在多种二分类和多分类任务中显著优于SMOTE、NearMiss和代价敏感学习等主流方法,并在表格数据上平均提升宏F1达2.64个百分点,且结果具有统计显著性。

链接: https://arxiv.org/abs/2509.00955
作者: Arjun Basandrai,Shourya Jain,K. Ilanthenral
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Submitted to SIGKDD’26

点击查看摘要

Abstract:Traditional resampling methods for handling class imbalance typically uses fixed distributions, undersampling the majority or oversampling the minority. These static strategies ignore changes in class-wise learning difficulty, which can limit the overall performance of the model. This paper proposes an Adaptive Resampling-based Training (ART) method that periodically updates the distribution of the training data based on the class-wise performance of the model. Specifically, ART uses class-wise macro F1 scores, computed at fixed intervals, to determine the degree of resampling to be performed. Unlike instance-level difficulty modeling, which is noisy and outlier-sensitive, ART adapts at the class level. This allows the model to incrementally shift its attention towards underperforming classes in a way that better aligns with the optimization objective. Results on diverse benchmarks, including Pima Indians Diabetes and Yeast dataset demonstrate that ART consistently outperforms both resampling-based and algorithm-level methods, including Synthetic Minority Oversampling Technique (SMOTE), NearMiss Undersampling, and Cost-sensitive Learning on binary as well as multi-class classification tasks with varying degrees of imbalance. In most settings, these improvements are statistically significant. On tabular datasets, gains are significant under paired t-tests and Wilcoxon tests (p 0.05), while results on text and image tasks remain favorable. Compared to training on the original imbalanced data, ART improves macro F1 by an average of 2.64 percentage points across all tested tabular datasets. Unlike existing methods, whose performance varies by task, ART consistently delivers the strongest macro F1, making it a reliable choice for imbalanced classification. Comments: Submitted to SIGKDD’26 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2509.00955 [cs.LG] (or arXiv:2509.00955v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.00955 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shourya Jain [view email] [v1] Sun, 31 Aug 2025 18:20:55 UTC (537 KB) Full-text links: Access Paper: View a PDF of the paper titled ART: Adaptive Resampling-based Training for Imbalanced Classification, by Arjun Basandrai and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-09 Change to browse by: cs cs.AI stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-112] UrbanInsight: A Distributed Edge Computing Framework with LLM -Powered Data Filtering for Smart City Digital Twins

【速读】:该论文旨在解决城市中海量多源异构数据(如传感器、摄像头和联网基础设施)带来的系统性挑战,包括处理规模受限、延迟高以及信息碎片化导致的洞察力不足等问题。其解决方案的关键在于构建一个融合物理信息机器学习、多模态数据融合与知识图谱表示,并由大语言模型(Large Language Models, LLMs)驱动的自适应规则智能框架。该框架通过物理约束确保预测符合真实动态,利用知识图谱实现异构数据的语义整合与可查询结构,同时在边缘端由LLMs生成上下文感知的动态规则,从而实现实时过滤与决策优化,在资源受限条件下仍能高效运行,为数字孪生系统提供从被动监测到主动干预的能力跃升。

链接: https://arxiv.org/abs/2509.00936
作者: Kishor Datta Gupta,Md Manjurul Ahsan,Mohd Ariful Haque,Roy George,Azmine Toushik Wasi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cities today generate enormous streams of data from sensors, cameras, and connected infrastructure. While this information offers unprecedented opportunities to improve urban life, most existing systems struggle with scale, latency, and fragmented insights. This work introduces a framework that blends physics-informed machine learning, multimodal data fusion, and knowledge graph representation with adaptive, rule-based intelligence powered by large language models (LLMs). Physics-informed methods ground learning in real-world constraints, ensuring predictions remain meaningful and consistent with physical dynamics. Knowledge graphs act as the semantic backbone, integrating heterogeneous sensor data into a connected, queryable structure. At the edge, LLMs generate context-aware rules that adapt filtering and decision-making in real time, enabling efficient operation even under constrained resources. Together, these elements form a foundation for digital twin systems that go beyond passive monitoring to provide actionable insights. By uniting physics-based reasoning, semantic data fusion, and adaptive rule generation, this approach opens new possibilities for creating responsive, trustworthy, and sustainable smart infrastructures.
zh

[AI-113] SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers

【速读】:该论文旨在解决Transformer模型在处理长序列时因自注意力机制的二次计算复杂度(quadratic attention complexity)而导致的可扩展性瓶颈问题。现有线性模型如Mamba和滑动窗口注意力(Sliding-Window Attention, SWA)虽通过局部或递归操作实现高效推理,但其固定大小的记忆机制难以保留远距离token的细节信息,从而损害长序列任务性能。解决方案的关键在于提出SCOUT(Segment Compression for Optimized Utility in Transformers),其核心思想是将输入序列分段压缩为少量“检查点”token,并仅对这些压缩表示进行注意力计算:首先利用线性局部混合器(如Mamba或SWA)增强每个token的局部上下文表示,随后通过稀疏注意力机制聚焦于历史中少数压缩后的checkpoint token,从而在保持接近全注意力表达能力的同时显著降低计算与内存开销。该设计使SCOUT在保持亚二次内存增长的前提下,实现了比纯线性模型更强的长序列建模能力。

链接: https://arxiv.org/abs/2509.00935
作者: Aref Jafari,Yuhe Fan,Benyamin Jamialahmadi,Parsa Farinneya,Boxing Chen,Marzieh S. Tahaei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers have demonstrated strong performance across a wide range of sequence modeling tasks, but their quadratic attention complexity limits scalability to long sequences. Linear models such as Mamba and sliding-window attention (SWA) address this by mixing tokens through recurrent or localized operations with fixed-size memory, achieving efficient inference. However, these methods risk degrading performance on long sequences due to their inability to retain detailed information from distant tokens. We propose SCOUT (Segment Compression for Optimized Utility in Transformers), a hybrid architecture that compresses tokens locally within fixed-size segments and applies attention only over these compressed representations. Each token embedding is first enriched via a linear local mixer, Mamba or SWA, that integrates recent context. Then, instead of attending to all previous tokens, each token sparsely attends to a small number of compressed checkpoint tokens that summarize the input history. This design retains much of the expressivity of full attention while substantially reducing the computational and memory cost. By attending to compressed history rather than all previous tokens, SCOUT incurs slightly higher memory than purely linear models, but its growth rate remains sub-quadratic and far more scalable than that of full Transformers. We analyze SCOUT’s computational and memory efficiency and evaluate it empirically on long-context language modeling and reasoning tasks. SCOUT with both Mamba and SWA mixers outperforms strong long-sequence baselines under the same computational budget, matches full-attention Transformers on language modeling and common-sense reasoning tasks at 400M and 1.3B scales. Moreover, our SCOUT achieves higher end-to-end throughput than SOTA models, while delivering comparable results on long sequence benchmarks.
zh

[AI-114] SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在逻辑推理能力评估与提升过程中缺乏可控性与可扩展性的难题。现有基准测试数据集通常难以实现多维度、细粒度的系统分析,且问题类型和格式较为单一,限制了对模型推理机制的深入理解与针对性优化。为此,作者提出SATQuest,其核心在于基于可满足性(Satisfiability, SAT)问题生成多样化、结构化的逻辑推理任务,并通过Conjunctive Normal Form (CNF)实例构建问题空间,同时在三个正交维度上进行控制:实例规模、问题类型和提问格式。该设计结合随机化生成与PySAT客观验证机制,有效规避记忆偏差,支持精细化性能分析并驱动强化微调(reinforcement fine-tuning),从而显著提升LLMs在逻辑推理上的泛化能力和任务表现。

链接: https://arxiv.org/abs/2509.00930
作者: Yanxiao Zhao,Yaqian Li,Zihao Bo,Rinyoichi Takezoe,Haojia Hui,Mo Guang,Lei Ren,Xiaolin Qin,Kaiwen Long
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have demonstrated remarkable general reasoning capabilities. However, systematically evaluating and enhancing these reasoning capabilities is challenging due to the lack of controllable and scalable tools for fine-grained analysis. Existing benchmarks and datasets often lack the necessary variable control for multi-dimensional, systematic analysis and training, or have narrow problem types and formats. To address these limitations, we introduce SATQuest, a systematic verifier designed to evaluate and enhance logical reasoning in LLMs by generating diverse, Satisfiability-based logical reasoning problems directly from Conjunctive Normal Form (CNF) instances. SATQuest structures these problems along three orthogonal dimensions: instance scale, problem type, and question format, employing randomized, SAT-based problem generation and objective answer verification via PySAT. This design mitigates memorization issues, allows for nuanced insights into reasoning performance, and enables effective reinforcement fine-tuning. Our extensive evaluation of various LLMs using SATQuest identified significant limitations in their logical reasoning, particularly in generalizing beyond familiar mathematical formats. Furthermore, we show that reinforcement fine-tuning with SATQuest rewards substantially improves targeted task performance and generalizes to more complex instances, while highlighting remaining challenges in cross-format adaptation. Through these demonstrations, we showcase SATQuest’s potential as a foundational tool and a valuable starting point for advancing LLM logical reasoning.
zh

[AI-115] Superposition in Graph Neural Networks

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)的可解释性难题,特别是由于消息传递机制导致信号混合、内部通道与人类概念难以对齐的问题。其解决方案的关键在于直接在GNN的潜在空间中研究超位置(superposition)现象——即多个特征共享同一方向的现象,并通过控制实验提取两类特征:(i)图级别的类别条件中心点(class-conditional centroids),以及(ii)节点级别的线性探测方向(linear-probe directions),进而利用不变于基底的几何诊断方法分析其结构特性。研究发现模型宽度、池化策略和最终层激活函数等设计选择显著影响表示几何,从而为提升GNN可解释性提供了理论依据与实践路径。

链接: https://arxiv.org/abs/2509.00928
作者: Lukas Pertl,Han Xuanyuan,Pietro Liò
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interpreting graph neural networks (GNNs) is difficult because message passing mixes signals and internal channels rarely align with human concepts. We study superposition, the sharing of directions by multiple features, directly in the latent space of GNNs. Using controlled experiments with unambiguous graph concepts, we extract features as (i) class-conditional centroids at the graph level and (ii) linear-probe directions at the node level, and then analyze their geometry with simple basis-invariant diagnostics. Across GCN/GIN/GAT we find: increasing width produces a phase pattern in overlap; topology imprints overlap onto node-level features that pooling partially remixes into task-aligned graph axes; sharper pooling increases axis alignment and reduces channel sharing; and shallow models can settle into metastable low-rank embeddings. These results connect representational geometry with concrete design choices (width, pooling, and final-layer activations) and suggest practical approaches for more interpretable GNNs.
zh

[AI-116] Robust Deep Monte Carlo Counterfactual Regret Minimization: Addressing Theoretical Risks in Neural Fictitious Self-Play

【速读】:该论文旨在解决神经网络增强的蒙特卡洛反事实遗憾最小化(Neural Monte Carlo Counterfactual Regret Minimization, Neural MCCFR)在不同规模博弈中因非平稳目标分布偏移、动作支持崩溃、方差爆炸及冷启动偏差等理论风险导致性能下降的问题。其关键解决方案是提出一种鲁棒的深度MCCFR(Robust Deep MCCFR)框架,包含延迟更新的目标网络、均匀探索混合策略、方差感知训练目标以及全面诊断监控机制,并通过多尺度实验揭示了组件的有效性与游戏复杂度之间的依赖关系,从而实现对组件的选择性部署而非全量应用,显著提升了小规模(如Kuhn Poker)和大规模(如Leduc Poker)博弈场景下的最终可 exploited 性能。

链接: https://arxiv.org/abs/2509.00923
作者: Zakaria El Jaafari
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Monte Carlo Counterfactual Regret Minimization (MCCFR) has emerged as a cornerstone algorithm for solving extensive-form games, but its integration with deep neural networks introduces scale-dependent challenges that manifest differently across game complexities. This paper presents a comprehensive analysis of how neural MCCFR component effectiveness varies with game scale and proposes an adaptive framework for selective component deployment. We identify that theoretical risks such as nonstationary target distribution shifts, action support collapse, variance explosion, and warm-starting bias have scale-dependent manifestation patterns, requiring different mitigation strategies for small versus large games. Our proposed Robust Deep MCCFR framework incorporates target networks with delayed updates, uniform exploration mixing, variance-aware training objectives, and comprehensive diagnostic monitoring. Through systematic ablation studies on Kuhn and Leduc Poker, we demonstrate scale-dependent component effectiveness and identify critical component interactions. The best configuration achieves final exploitability of 0.0628 on Kuhn Poker, representing a 60% improvement over the classical framework (0.156). On the more complex Leduc Poker domain, selective component usage achieves exploitability of 0.2386, a 23.5% improvement over the classical framework (0.3703) and highlighting the importance of careful component selection over comprehensive mitigation. Our contributions include: (1) a formal theoretical analysis of risks in neural MCCFR, (2) a principled mitigation framework with convergence guarantees, (3) comprehensive multi-scale experimental validation revealing scale-dependent component interactions, and (4) practical guidelines for deployment in larger games.
zh

[AI-117] nyMusician: On-Device Music Generation with Knowledge Distillation and Mixed Precision Quantization

【速读】:该论文旨在解决生成式音乐模型在边缘设备(如智能手机和可穿戴设备)上部署时面临的两大关键挑战:一是模型参数量大导致的计算资源消耗过高,二是推理时间过长,使得现有基于Transformer架构的先进音乐生成模型(如MusicGen)难以在资源受限的终端设备上实际应用。解决方案的关键在于提出TinyMusician,一个通过知识蒸馏从MusicGen-Small模型压缩得到的轻量化音乐生成模型,其核心创新包括:(i) 阶段混合双向与偏斜KL散度(Stage-mixed Bidirectional and Skewed KL-Divergence),用于优化蒸馏过程中的特征对齐;(ii) 自适应混合精度量化(Adaptive Mixed-Precision Quantization),在保证音频保真度的前提下显著降低模型存储和计算开销。实验表明,TinyMusician仅需55%的模型大小即可保留93%的原始性能,成为首个无需云端依赖即可在移动端高效运行的高质量音乐生成模型。

链接: https://arxiv.org/abs/2509.00914
作者: Hainan Wang,Mehdi Hosseinzadeh,Reza Rawassizadeh
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 12 pages for main context, 5 figures

点击查看摘要

Abstract:The success of the generative model has gained unprecedented attention in the music generation area. Transformer-based architectures have set new benchmarks for model performance. However, their practical adoption is hindered by some critical challenges: the demand for massive computational resources and inference time, due to their large number of parameters. These obstacles make them infeasible to deploy on edge devices, such as smartphones and wearables, with limited computational resources. In this work, we present TinyMusician, a lightweight music generation model distilled from MusicGen (a State-of-the-art music generation model). TinyMusician integrates two innovations: (i) Stage-mixed Bidirectional and Skewed KL-Divergence and (ii) Adaptive Mixed-Precision Quantization. The experimental results demonstrate that TinyMusician retains 93% of the MusicGen-Small performance with 55% less model size. TinyMusician is the first mobile-deployable music generation model that eliminates cloud dependency while maintaining high audio fidelity and efficient resource usage
zh

[AI-118] An Explainable Gaussian Process Auto-encoder for Tabular Data

【速读】:该论文旨在解决黑箱模型可解释性问题,特别是通过生成反事实样本(counterfactual samples)来提供更具说服力的解释。当前方法多依赖于自动编码器(autoencoder)架构,但存在参数过多易过拟合的问题。其解决方案的关键在于提出一种基于高斯过程(Gaussian process)构建的新型自动编码器结构,显著减少可学习参数数量以降低过拟合风险;同时引入一种新的密度估计器,用于在分布内(in-distribution)搜索反事实样本,并设计了一种算法自动选择最优正则化率,从而提升反事实样本的质量与多样性。

链接: https://arxiv.org/abs/2509.00884
作者: Wei Zhang,Brian Barr,John Paisley
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explainable machine learning has attracted much interest in the community where the stakes are high. Counterfactual explanations methods have become an important tool in explaining a black-box model. The recent advances have leveraged the power of generative models such as an autoencoder. In this paper, we propose a novel method using a Gaussian process to construct the auto-encoder architecture for generating counterfactual samples. The resulting model requires fewer learnable parameters and thus is less prone to overfitting. We also introduce a novel density estimator that allows for searching for in-distribution samples. Furthermore, we introduce an algorithm for selecting the optimal regularization rate on density estimator while searching for counterfactuals. We experiment with our method in several large-scale tabular datasets and compare with other auto-encoder-based methods. The results show that our method is capable of generating diversified and in-distribution counterfactual samples.
zh

[AI-119] Accelerating Latency-Critical Applications with AI-Powered Semi-Automatic Fine-Grained Parallelization on SMT Processors

【速读】:该论文旨在解决高吞吐量超标量处理器中延迟敏感型应用因频繁缓存未命中和推测执行错误预测导致功能单元利用率低的问题,同时克服Simultaneous Multithreading (SMT)技术在处理延迟敏感型应用时因单线程性能显著下降而难以被采用的局限性。其解决方案的关键在于提出一个名为Aira的AI驱动并行化顾问系统,通过扩展Cursor IDE中的AI编码代理(AI Coding Agent),结合Model Context Protocol集成多个工具,实现从热点检测、动态依赖收集到SMT感知性能仿真的一体化流程,最终借助Relic并行框架在SMT核心上实现细粒度任务并行化,从而在工业级延迟敏感基准测试中实现17%的几何平均性能提升。

链接: https://arxiv.org/abs/2509.00883
作者: Denis Los,Igor Petushkov
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Latency-critical applications tend to show low utilization of functional units due to frequent cache misses and mispredictions during speculative execution in high-performance superscalar processors. However, due to significant impact on single-thread performance, Simultaneous Multithreading (SMT) technology is rarely used with heavy threads of latency-critical applications. In this paper, we explore utilization of SMT technology to support fine-grained parallelization of latency-critical applications. Following the advancements in the development of Large Language Models (LLMs), we introduce Aira, an AI-powered Parallelization Adviser. To implement Aira, we extend AI Coding Agent in Cursor IDE with additional tools connected through Model Context Protocol, enabling end-to-end AI Agent for parallelization. Additional connected tools enable LLM-guided hotspot detection, collection of dynamic dependencies with Dynamic Binary Instrumentation, SMT-aware performance simulation to estimate performance gains. We apply Aira with Relic parallel framework for fine-grained task parallelism on SMT cores to parallelize latency-critical benchmarks representing real-world applications used in industry. We show 17% geomean performance gain from parallelization of latency-critical benchmarks using Aira with Relic framework.
zh

[AI-120] Speech Command Recognition Using LogNNet Reservoir Computing for Embedded Systems

【速读】:该论文旨在解决在资源受限设备(如电池供电的物联网节点)上实现高效、可靠语音命令识别的问题。其关键解决方案在于构建一个轻量级端到端语音命令识别流水线,包含基于能量的语音活动检测(Voice Activity Detection, VAD)、优化的梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients, MFCC)特征提取流程以及LogNNet储备计算分类器。其中,采用自适应分箱(adaptive binning)的MFCC聚合方案(输出64维特征向量)实现了准确率与紧凑性的最佳平衡,而LogNNet模型(架构为64:33:9:4)在仅需极少参数的情况下达到92.04%的说话人无关识别准确率,并在Arduino Nano 33 IoT平台上验证了其低内存占用(18 KB RAM)和实时性能(约90%识别准确率),从而满足嵌入式设备对计算和存储资源的严苛限制。

链接: https://arxiv.org/abs/2509.00862
作者: Yuriy Izotov,Andrei Velichko
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:This paper presents a low-resource speech-command recognizer combining energy-based voice activity detection (VAD), an optimized Mel-Frequency Cepstral Coefficients (MFCC) pipeline, and the LogNNet reservoir-computing classifier. Using four commands from the Speech Commands da-taset downsampled to 8 kHz, we evaluate four MFCC aggregation schemes and find that adaptive binning (64-dimensional feature vector) offers the best accuracy-to-compactness trade-off. The LogNNet classifier with architecture 64:33:9:4 reaches 92.04% accuracy under speaker-independent evaluation, while requiring significantly fewer parameters than conventional deep learn-ing models. Hardware implementation on Arduino Nano 33 IoT (ARM Cor-tex-M0+, 48 MHz, 32 KB RAM) validates the practical feasibility, achieving ~90% real-time recognition accuracy while consuming only 18 KB RAM (55% utilization). The complete pipeline (VAD - MFCC - LogNNet) thus enables reliable on-device speech-command recognition under strict memory and compute limits, making it suitable for battery-powered IoT nodes, wire-less sensor networks, and hands-free control interfaces.
zh

[AI-121] Why it is worth making an effort with GenAI

【速读】:该论文试图解决的问题是:学生过度依赖生成式 AI(Generative AI)工具完成作业(如写作)可能导致学习写作和批判性思维能力的退化,进而削弱教育效果。解决方案的关键在于设计新的 AI 工具与使用机制,通过刻意增加用户使用时的认知努力来逆转这一趋势,例如要求学生结合传统学习方法(如阅读时做笔记)与 GenAI 工具协同使用,从而促进元认知(metacognition)和反思能力的发展,实现“在适当时候增加努力以深化学习”的努力悖论(effort paradox)策略。

链接: https://arxiv.org/abs/2509.00852
作者: Yvonne Rogers
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 6 pages

点击查看摘要

Abstract:Students routinely use ChatGPT and the like now to help them with their homework, such as writing an essay. It takes less effort to complete and is easier to do than by hand. It can even produce as good if not better output than the student’s own work. However, there is a growing concern that over-reliance on using GenAI in this way will stifle the development of learning writing and critical thinking skills. How might this trend be reversed? What if students were required to make more effort when using GenAI to do their homework? It might be more challenging, but the additional effort involved could result in them learning more and having a greater sense of achievement. This tension can be viewed as a form of effort paradox; where effort is both viewed as something to be avoided but at the same time is valued. Is it possible to let students learn sometimes with less and other times more effort? Students are already adept at the former but what about the latter? Could we design new kinds of AI tools that deliberately require more effort to use to deepen the learning experience? In this paper, I begin to outline what form these might take, for example, asking students to use a combination of GenAI tools with traditional learning approaches (e.g. note-taking while reading). I also discuss how else to design tools to think with that augments human cognition; where students learn more the skills of metacognition and reflection.
zh

[AI-122] Causal SHAP: Feature Attribution with Dependency Awareness through Causal Discovery IJCNN

【速读】:该论文旨在解决传统特征重要性解释方法(如SHapley Additive exPlanations,SHAP)在高相关性特征场景下无法区分因果关系与相关性的问题,从而导致对特征重要性的误判。其解决方案的关键在于提出Causal SHAP框架,通过整合因果发现算法(如Peter-Clark, PC算法)与因果强度量化方法(如Intervention Calculus when the DAG is Absent, IDA算法),将因果结构信息引入特征归因过程,在保持SHAP多数优良性质的同时,有效降低仅与目标变量相关但无因果作用的特征的归因得分,从而提升模型解释的因果准确性。

链接: https://arxiv.org/abs/2509.00846
作者: Woon Yee Ng,Li Rong Wang,Siyuan Liu,Xiuyi Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注: Published in 2025 International Joint Conference on Neural Networks (IJCNN). IEEE, 2025

点击查看摘要

Abstract:Explaining machine learning (ML) predictions has become crucial as ML models are increasingly deployed in high-stakes domains such as healthcare. While SHapley Additive exPlanations (SHAP) is widely used for model interpretability, it fails to differentiate between causality and correlation, often misattributing feature importance when features are highly correlated. We propose Causal SHAP, a novel framework that integrates causal relationships into feature attribution while preserving many desirable properties of SHAP. By combining the Peter-Clark (PC) algorithm for causal discovery and the Intervention Calculus when the DAG is Absent (IDA) algorithm for causal strength quantification, our approach addresses the weakness of SHAP. Specifically, Causal SHAP reduces attribution scores for features that are merely correlated with the target, as validated through experiments on both synthetic and real-world datasets. This study contributes to the field of Explainable AI (XAI) by providing a practical framework for causal-aware model explanations. Our approach is particularly valuable in domains such as healthcare, where understanding true causal relationships is critical for informed decision-making.
zh

[AI-123] Adaptive Vehicle Speed Classification via BMCNN with Reinforcement Learning-Enhanced Acoustic Processing

【速读】:该论文旨在解决城市交通拥堵问题中车辆速度分类的实时性与准确性难题,特别是在复杂多变的城市环境中实现高效、鲁棒的声学感知。其解决方案的关键在于提出一种融合深度学习与强化学习的混合框架:首先利用双分支BMCNN(Bidirectional Multi-branch Convolutional Neural Network)同时处理梅尔频率倒谱系数(MFCC)和小波特征,以捕获互补的频域模式;随后通过注意力增强的深度Q网络(Attention-enhanced DQN)自适应选择最少音频帧数,并在置信度达到阈值时触发早期决策,从而显著提升处理效率。实验表明,该方法在IDMT-Traffic和SZUR-Acoustic数据集上分别实现了95.99%和92.3%的准确率,平均处理速度比基线模型快1.63倍,展现出优越的精度-效率权衡特性,适用于异构城市环境下的实时智能交通系统(ITS)部署。

链接: https://arxiv.org/abs/2509.00839
作者: Yuli Zhang,Pengfei Fan,Ruiyuan Jiang,Hankang Gu,Dongyao Jia,Xinheng Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Traffic congestion remains a pressing urban challenge, requiring intelligent transportation systems for real-time management. We present a hybrid framework that combines deep learning and reinforcement learning for acoustic vehicle speed classification. A dual-branch BMCNN processes MFCC and wavelet features to capture complementary frequency patterns. An attention-enhanced DQN adaptively selects the minimal number of audio frames and triggers early decisions once confidence thresholds are reached. Evaluations on IDMT-Traffic and our SZUR-Acoustic (Suzhou) datasets show 95.99% and 92.3% accuracy, with up to 1.63x faster average processing via early termination. Compared with A3C, DDDQN, SA2C, PPO, and TD3, the method provides a superior accuracy-efficiency trade-off and is suitable for real-time ITS deployment in heterogeneous urban environments.
zh

[AI-124] Neuro-Symbolic Predictive Process Monitoring

【速读】:该论文旨在解决业务流程管理(Business Process Management, BPM)中后缀预测(suffix prediction)的问题,即在已知部分执行日志的基础上预测未来可能的活动序列。现有基于深度学习的方法虽能提升预测准确性,但常因缺乏显式整合领域知识而导致生成结果违反基本逻辑约束(如时序规则)。解决方案的关键在于提出一种神经符号预测流程监控(Neuro-Symbolic Predictive Process Monitoring, PPM)方法,通过将有限迹上的线性时序逻辑(Linear Temporal Logic over finite traces, LTLf)嵌入模型训练过程,引入一种可微分的逻辑损失函数——该损失函数利用LTLf语义的软近似与Gumbel-Softmax技巧实现差异化计算,并可与标准预测损失联合优化,从而确保生成的后缀既准确又满足预设的时序逻辑约束。

链接: https://arxiv.org/abs/2509.00834
作者: Axel Mezini,Elena Umili,Ivan Donadello,Fabrizio Maria Maggi,Matteo Mancanelli,Fabio Patrizi
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:This paper addresses the problem of suffix prediction in Business Process Management (BPM) by proposing a Neuro-Symbolic Predictive Process Monitoring (PPM) approach that integrates data-driven learning with temporal logic-based prior knowledge. While recent approaches leverage deep learning models for suffix prediction, they often fail to satisfy even basic logical constraints due to the absence of explicit integration of domain knowledge during training. We propose a novel method to incorporate Linear Temporal Logic over finite traces (LTLf) into the training process of autoregressive sequence predictors. Our approach introduces a differentiable logical loss function, defined using a soft approximation of LTLf semantics and the Gumbel-Softmax trick, which can be combined with standard predictive losses. This ensures the model learns to generate suffixes that are both accurate and logically consistent. Experimental evaluation on three real-world datasets shows that our method improves suffix prediction accuracy and compliance with temporal constraints. We also introduce two variants of the logic loss (local and global) and demonstrate their effectiveness under noisy and realistic settings. While developed in the context of BPM, our framework is applicable to any symbolic sequence generation task and contributes toward advancing Neuro-Symbolic AI.
zh

[AI-125] AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation

【速读】:该论文旨在解决文本到音乐(Text-to-Music, TTM)生成系统在情感传达上的准确性不足问题,即当前模型虽能实现可控且富有表现力的音乐生成,但在将自然语言提示中的情感意图准确映射至音乐作品方面仍存在显著差距。解决方案的关键在于构建了一个名为AImoclips的基准测试集,涵盖12种跨效价-唤醒维度(valence-arousal space)的情感意图,使用6个前沿TTM模型生成超过1000段音乐片段,并由111名参与者对每段音乐的感知效价和唤醒度进行9点李克特量表评分,从而量化评估各模型在情感保真度(emotional fidelity)方面的表现差异。该基准揭示了商业与开源模型在情绪表达偏倚上的系统性差异及高唤醒条件下更优的情绪传递能力,为未来开发更具情感对齐能力的TTM系统提供了关键洞见与可衡量的评估标准。

链接: https://arxiv.org/abs/2509.00813
作者: Gyehun Go,Satbyul Han,Ahyeon Choi,Eunjin Choi,Juhan Nam,Jeong Mi Park
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: to be published in HCMIR25: 3rd Workshop on Human-Centric Music Information Research

点击查看摘要

Abstract:Recent advances in text-to-music (TTM) generation have enabled controllable and expressive music creation using natural language prompts. However, the emotional fidelity of TTM systems remains largely underexplored compared to human preference or text alignment. In this study, we introduce AImoclips, a benchmark for evaluating how well TTM systems convey intended emotions to human listeners, covering both open-source and commercial models. We selected 12 emotion intents spanning four quadrants of the valence-arousal space, and used six state-of-the-art TTM systems to generate over 1,000 music clips. A total of 111 participants rated the perceived valence and arousal of each clip on a 9-point Likert scale. Our results show that commercial systems tend to produce music perceived as more pleasant than intended, while open-source systems tend to perform the opposite. Emotions are more accurately conveyed under high-arousal conditions across all models. Additionally, all systems exhibit a bias toward emotional neutrality, highlighting a key limitation in affective controllability. This benchmark offers valuable insights into model-specific emotion rendering characteristics and supports future development of emotionally aligned TTM systems.
zh

[AI-126] ProCause: Generating Counterfactual Outcomes to Evaluate Prescriptive Process Monitoring Methods

【速读】:该论文旨在解决预设式流程监控(Prescriptive Process Monitoring, PresPM)方法评估中缺乏干预动作真实结果(ground-truth outcomes)的问题,尤其针对现有生成式因果推断方法 RealCause 忽视流程数据中的时序依赖性且仅使用单一因果模型架构(TARNet)导致评估可靠性不足的局限。解决方案的关键在于提出 ProCause,一种支持序列模型(如 LSTM)与非序列模型并集成多种因果推断架构(S-Learner、T-Learner、TARNet 及其集成模型)的生成式深度学习方法,从而在已知真实结果的模拟环境中验证出集成模型更具一致性可靠性,且在存在时序依赖时利用 LSTM 能显著提升评估效果,最终通过真实世界数据验证了其有效性。

链接: https://arxiv.org/abs/2509.00797
作者: Jakob De Moor,Hans Weytjens,Johannes De Smedt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Prescriptive Process Monitoring (PresPM) is the subfield of Process Mining that focuses on optimizing processes through real-time interventions based on event log data. Evaluating PresPM methods is challenging due to the lack of ground-truth outcomes for all intervention actions in datasets. A generative deep learning approach from the field of Causal Inference (CI), RealCause, has been commonly used to estimate the outcomes for proposed intervention actions to evaluate a new policy. However, RealCause overlooks the temporal dependencies in process data, and relies on a single CI model architecture, TARNet, limiting its effectiveness. To address both shortcomings, we introduce ProCause, a generative approach that supports both sequential (e.g., LSTMs) and non-sequential models while integrating multiple CI architectures (S-Learner, T-Learner, TARNet, and an ensemble). Our research using a simulator with known ground truths reveals that TARNet is not always the best choice; instead, an ensemble of models offers more consistent reliability, and leveraging LSTMs shows potential for improved evaluations when temporal dependencies are present. We further validate ProCause’s practical effectiveness through a real-world data analysis, ensuring a more reliable evaluation of PresPM methods.
zh

[AI-127] Sharpe Ratio Optimization in Markov Decision Processes

【速读】:该论文旨在解决在无限horizon马尔可夫决策过程(Markov Decision Processes, MDPs)中优化夏普比率(Sharpe ratio)的问题。该问题面临两大挑战:一是动态规划不适用于分数目标函数,二是动态规划无法直接处理风险度量。为应对第一个挑战,作者采用Dinkelbach变换将夏普比率目标转化为均方差(mean-squared-variance, M2V)目标,证明当风险敏感参数等于最优夏普比率时,M2V优化与原问题具有相同的最优策略;针对第二个挑战,提出一种迭代算法,通过反复求解M2V问题并利用当前迭代得到的夏普比率更新风险敏感参数,从而实现单调递增并收敛至最优夏普比率。该方法在长期平均和折扣设置下均有效,并通过策略迭代实现收敛,是首个基于动态规划思想解决MDP中夏普比率优化问题的算法。

链接: https://arxiv.org/abs/2509.00793
作者: Shuai Ma,Guangwu Liu,Li Xia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sharpe ratio (also known as reward-to-variability ratio) is a widely-used metric in finance, which measures the additional return at the cost of per unit of increased risk (standard deviation of return). However, the optimization of Sharpe ratio in Markov decision processes (MDPs) is challenging, because there exist two difficulties hindering the application of dynamic programming. One is that dynamic programming does not work for fractional objectives, and the other is that dynamic programming is invalid for risk metrics. In this paper, we study the Sharpe ratio optimization in infinite-horizon MDPs, considering both the long-run average and discounted settings. We address the first challenge with the Dinkelbachs transform, which converts the Sharpe ratio objective to a mean-squared-variance (M2V) objective. It is shown that the M2V optimization and the original Sharpe ratio optimization share the same optimal policy when the risk-sensitive parameter is equal to the optimal Sharpe ratio. For the second challenge, we develop an iterative algorithm to solve the M2V optimization which is similar to a mean-variance optimization in MDPs. We iteratively solve the M2V problem and obtain the associated Sharpe ratio that is used to update the risk-sensitive parameter in the next iteration of M2V problems. We show that such a sequence of Sharpe ratios derived is monotonically increasing and converges to the optimal Sharpe ratio. For both average and discounted MDP settings, we develop a policy iteration procedure and prove its convergence to the optimum. Numerical experiments are conducted for validation. To the best of our knowledge, our approach is the first that solves the Sharpe ratio optimization in MDPs with dynamic programming type algorithms. We believe that the proposed algorithm can shed light on solving MDPs with other fractional objectives.
zh

[AI-128] Low Power Approximate Multiplier Architecture for Deep Neural Networks

【速读】:该论文旨在解决深度神经网络(Deep Neural Network, DNN)在低功耗硬件实现中能量消耗高与计算精度难以兼顾的问题。其解决方案的关键在于设计一种基于4:2压缩器(4:2 compressor)的近似乘法器架构,该压缩器仅引入单一组合误差,并集成于8×8无符号乘法器中,从而显著减少对高精度压缩器的依赖,同时保持较低的误差率。此设计在自定义卷积层中验证,实现了高达30.24%的能量节省,在图像去噪任务中还提升了峰值信噪比(PSNR)和结构相似性指数(SSIM),并在手写数字识别任务中维持了高分类准确率,体现出在能效与计算精度之间良好的平衡。

链接: https://arxiv.org/abs/2509.00764
作者: Pragun Jaswal,L. Hemanth Krishna,B. Srinivasu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes an low power approximate multiplier architecture for deep neural network (DNN) applications. A 4:2 compressor, introducing only a single combination error, is designed and integrated into an 8x8 unsigned multiplier. This integration significantly reduces the usage of exact compressors while preserving low error rates. The proposed multiplier is employed within a custom convolution layer and evaluated on neural network tasks, including image recognition and denoising. Hardware evaluation demonstrates that the proposed design achieves up to 30.24% energy savings compared to the best among existing multipliers. In image denoising, the custom approximate convolution layer achieves improved Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) compared to other approximate designs. Additionally, when applied to handwritten digit recognition, the model maintains high classification accuracy. These results demonstrate that the proposed architecture offers a favorable balance between energy efficiency and computational precision, making it suitable for low-power AI hardware implementations.
zh

[AI-129] Efficient Graph Understanding with LLM s via Structured Context Injection

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理图相关任务时表现不佳的问题,尤其是当任务缺乏概念化表示映射时,LLMs难以有效推理。传统方法依赖符号或算法手段解决图问题,而LLMs虽具备跨领域解题能力,但在图结构推理中仍面临挑战。解决方案的关键在于提出一种结构化上下文注入(structured context injection)框架,通过将任务特定信息以结构化方式嵌入输入提示中,引导LLMs在不进行微调的前提下,隐式对齐到概念化的语义空间,从而提升图任务的推理性能。该方法无需额外训练,计算成本低且可扩展,实验证明其在多种图任务上显著优于无结构提示或复杂多步查询策略。

链接: https://arxiv.org/abs/2509.00740
作者: Govind Waghmare,Sumedh BG,Sonia Gupta,Srikanta Bedathur
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong capabilities in solving problems across domains, including graph-related tasks traditionally addressed by symbolic or algorithmic methods. In this work, we present a framework for structured context injection, where task-specific information is systematically embedded in the input to guide LLMs in solving a wide range of graph problems. Our method does not require fine-tuning of LLMs, making it cost-efficient and lightweight. We observe that certain graph reasoning tasks remain challenging for LLMs unless they are mapped to conceptually grounded representations. However, achieving such mappings through fine-tuning or repeated multi-step querying can be expensive and inefficient. Our approach offers a practical alternative by injecting structured context directly into the input, enabling the LLM to implicitly align the task with grounded conceptual spaces. We evaluate the approach on multiple graph tasks using both lightweight and large models, highlighting the trade-offs between accuracy and computational cost. The results demonstrate consistent performance improvements, showing that structured input context can rival or surpass more complex approaches. Our findings underscore the value of structured context injection as an effective and scalable strategy for graph understanding with LLMs.
zh

[AI-130] ask-Aware Adaptive Modulation: A Replay-Free and Resource-Efficient Approach For Continual Graph Learning

【速读】:该论文旨在解决持续图学习(Continual Graph Learning, CGL)中的两大核心挑战:一是稳定性-可塑性困境(Stability-Plasticity Dilemma),即在学习新任务时如何有效保留旧知识,而现有基于回放(replay)的方法往往因存储开销大或知识迁移效率低而导致失衡;二是资源密集型预训练问题,当前主流无回放方法严重依赖大规模预训练骨干网络(backbone),导致计算和存储成本过高。解决方案的关键在于摒弃数据回放与全网微调的传统范式,转而采用动态调节冻结骨干网络内部计算流的机制——提出任务感知自适应调制(Task-Aware Adaptive Modulation, TAAM),其核心是神经突触调制器(Neural Synapse Modulators, NSM),这些轻量级、任务特定的模块被训练后冻结,用于对图神经网络(GNN)内部信息流动进行细粒度、节点感知的调制,从而实现高效的知识留存与任务切换,且无需额外存储历史数据或重新训练主干网络。

链接: https://arxiv.org/abs/2509.00735
作者: Jingtao Liu,Xinming Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual Graph Learning(CGL)focuses on acquiring new knowledge while retaining previously learned information, essential for real-world graph applications. Current methods grapple with two main issues:1) The Stability-Plasticity Dilemma: Replay-based methods often create an imbalance between the Dilemma, while incurring significant storage costs.2) The Resource-Heavy Pre-training: Leading replay-free methods critically depend on extensively pre-trained backbones, this reliance imposes a substantial resource this http URL this paper, we argue that the key to overcoming these challenges lies not in replaying data or fine-tuning the entire network, but in dynamically modulating the internal computational flow of a frozen backbone. We posit that lightweight, task-specific modules can effectively steer a GNN’s reasoning process. Motivated by this insight, we propose Task-Aware Adaptive Modulation(TAAM), a replay-free, resource-efficient approach that charts a new path for navigating the stability-plasticity dilemma. TAAM’s core is its Neural Synapse Modulators(NSM), which are trained and then frozen for each task to store expert knowledge. A pivotal prototype-guided strategy governs these modulators: 1) For training, it initializes a new NSM by deep-copying from a similar past modulator to boost knowledge transfer. 2) For inference, it selects the most relevant frozen NSM for each task. These NSMs insert into a frozen GNN backbone to perform fine-grained, node-attentive modulation of its internal flow-different from the static perturbations of prior methods. Extensive experiments show that TAAM comprehensively outperforms state-of-the-art methods across six GCIL benchmark datasets. The code will be released upon acceptance of the paper.
zh

[AI-131] OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination

【速读】:该论文旨在解决多模态大语言模型(Omni-modal Large Language Models, OLLMs)中存在的幻觉问题,尤其是在音频-视频联合理解场景下,由于文本模态先验主导导致模型过度依赖文本信息而忽略视觉与听觉线索,以及现有训练方法未充分建模视频与音频之间的内在关联,从而在需要推理隐藏音频线索时产生错误判断。解决方案的关键在于提出OmniDPO偏好对齐框架,其核心策略包括:(1) 构建文本偏好样本对以增强模型对音视频交互的理解;(2) 构建多模态偏好样本对以强化模型对视觉和听觉信息的关注。通过同时应对上述两类挑战,OmniDPO显著提升了多模态对齐能力并有效降低幻觉现象,同时增强了跨模态推理性能。

链接: https://arxiv.org/abs/2509.00723
作者: Junzhe Chen,Tianshu Zhang,Shiyu Huang,Yuwei Niu,Chao Sun,Rongzhou Zhang,Guanyu Zhou,Lijie Wen,Xuming Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recently, Omni-modal large language models (OLLMs) have sparked a new wave of research, achieving impressive results in tasks such as audio-video understanding and real-time environment perception. However, hallucination issues still persist. Similar to the bimodal setting, the priors from the text modality tend to dominate, leading OLLMs to rely more heavily on textual cues while neglecting visual and audio information. In addition, fully multimodal scenarios introduce new challenges. Most existing models align visual or auditory modalities with text independently during training, while ignoring the intrinsic correlations between video and its corresponding audio. This oversight results in hallucinations when reasoning requires interpreting hidden audio cues embedded in video content. To address these challenges, we propose OmniDPO, a preference-alignment framework designed to mitigate hallucinations in OLLMs. Specifically, OmniDPO incorporates two strategies: (1) constructing text-preference sample pairs to enhance the model’s understanding of audio-video interactions; and (2) constructing multimodal-preference sample pairs to strengthen the model’s attention to visual and auditory information. By tackling both challenges, OmniDPO effectively improves multimodal grounding and reduces hallucination. Experiments conducted on two OLLMs demonstrate that OmniDPO not only effectively mitigates multimodal hallucinations but also significantly enhances the models’ reasoning capabilities across modalities. All code and datasets will be released upon paper acceptance.
zh

[AI-132] Exam Readiness Index (ERI): A Theoretical Framework for a Composite Explainable Index

【速读】:该论文旨在解决高风险考试中学习者准备状态难以量化且缺乏可解释性的问题,提出了一种理论框架——考试准备指数(Exam Readiness Index, ERI),用于综合评估学习者的考试 readiness。其核心解决方案是构建一个基于六维信号(掌握度 M、覆盖度 C、保持力 R、进度 P、波动性 V 和耐力 E)的可解释、可操作的复合评分系统 R ∈ [0,100],并通过形式化公理体系确保其单调性、Lipschitz 稳定性和在蓝图权重调整下的有界漂移特性,同时证明在凸设计约束下最优线性组合的存在唯一性,并通过蓝图加权浓度刻画置信区间,兼容前置知识适配的课程结构(knowledge spaces / learning spaces)。

链接: https://arxiv.org/abs/2509.00718
作者: Ananda Prakash Verma
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present a theoretical framework for an Exam Readiness Index (ERI): a composite, blueprint-aware score R in [0,100] that summarizes a learner’s readiness for a high-stakes exam while remaining interpretable and actionable. The ERI aggregates six signals – Mastery (M), Coverage ©, Retention ®, Pace §, Volatility (V), and Endurance (E) – each derived from a stream of practice and mock-test interactions. We formalize axioms for component maps and the composite, prove monotonicity, Lipschitz stability, and bounded drift under blueprint re-weighting, and show existence and uniqueness of the optimal linear composite under convex design constraints. We further characterize confidence bands via blueprint-weighted concentration and prove compatibility with prerequisite-admissible curricula (knowledge spaces / learning spaces). The paper focuses on theory; empirical study is left to future work.
zh

[AI-133] Why Pool When You Can Flow? Active Learning with GFlowNets

【速读】:该论文旨在解决基于池的主动学习(pool-based active learning)在药物虚拟筛选中因计算成本过高而难以扩展的问题,尤其是在包含数十亿样本的分子库中,传统方法如贝叶斯主动学习通过分歧(BALD)策略虽能识别高信息量样本,但其评估复杂度随未标记数据规模增长而显著上升。解决方案的关键在于提出 BALD-GFlowNet,一种生成式主动学习框架,利用生成流网络(GFlowNets)直接按 BALD 奖励概率采样分子,从而将获取策略从依赖大规模未标记池的遍历式搜索转变为无须显式评估所有候选对象的生成式采样,实现与未标记池大小无关的可扩展性,同时保持与标准 BALD 相当的性能并提升分子结构多样性。

链接: https://arxiv.org/abs/2509.00704
作者: Renfei Zhang,Mohit Pandey,Artem Cherkasov,Martin Ester
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 6 pages; 5 figures

点击查看摘要

Abstract:The scalability of pool-based active learning is limited by the computational cost of evaluating large unlabeled datasets, a challenge that is particularly acute in virtual screening for drug discovery. While active learning strategies such as Bayesian Active Learning by Disagreement (BALD) prioritize informative samples, it remains computationally intensive when scaled to libraries containing billions samples. In this work, we introduce BALD-GFlowNet, a generative active learning framework that circumvents this issue. Our method leverages Generative Flow Networks (GFlowNets) to directly sample objects in proportion to the BALD reward. By replacing traditional pool-based acquisition with generative sampling, BALD-GFlowNet achieves scalability that is independent of the size of the unlabeled pool. In our virtual screening experiment, we show that BALD-GFlowNet achieves a performance comparable to that of standard BALD baseline while generating more structurally diverse molecules, offering a promising direction for efficient and scalable molecular discovery.
zh

[AI-134] Unsupervised Dataset Cleaning Framework for Encrypted Traffic Classification

【速读】:该论文旨在解决加密移动流量分类中因传统深度包检测(Deep Packet Inspection, DPI)失效而导致的分类精度下降问题,其核心挑战在于如何高效、自动地对原始流量数据进行清洗,以去除无用流量(如无关协议、背景活动、控制平面消息和长生命周期会话),从而提升机器学习(Machine Learning, ML)模型训练的有效性。解决方案的关键在于提出了一种无监督(unsupervised)框架,能够自动完成流量清洗任务,避免了依赖人工逐包检查的高成本与低效率,实验表明该方法在分类准确率上仅比人工清洗降低2%~2.5%,显著提升了预处理阶段的自动化水平与实用性。

链接: https://arxiv.org/abs/2509.00701
作者: Kun Qiu,Ying Wang,Baoqian Li,Wenjun Zhu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE ICNP 2025 Poster

点击查看摘要

Abstract:Traffic classification, a technique for assigning network flows to predefined categories, has been widely deployed in enterprise and carrier networks. With the massive adoption of mobile devices, encryption is increasingly used in mobile applications to address privacy concerns. Consequently, traditional methods such as Deep Packet Inspection (DPI) fail to distinguish encrypted traffic. To tackle this challenge, Artificial Intelligence (AI), in particular Machine Learning (ML), has emerged as a promising solution for encrypted traffic classification. A crucial prerequisite for any ML-based approach is traffic data cleaning, which removes flows that are not useful for training (e.g., irrelevant protocols, background activity, control-plane messages, and long-lived sessions). Existing cleaning solutions depend on manual inspection of every captured packet, making the process both costly and time-consuming. In this poster, we present an unsupervised framework that automatically cleans encrypted mobile traffic. Evaluation on real-world datasets shows that our framework incurs only a 2%~2.5% reduction in classification accuracy compared with manual cleaning. These results demonstrate that our method offers an efficient and effective preprocessing step for ML-based encrypted traffic classification.
zh

[AI-135] Queuing for Civility: Regulating Emotions and Reducing Toxicity in Digital Discourse

【速读】:该论文旨在解决在线社交平台中网络毒性(online toxicity)问题,特别是仇恨言论和 trolling 行为对数字互动与用户情绪健康的破坏性影响。传统研究多依赖事后 moderation(事后监管),忽视了在线对话中的实时情感动态及其对其他用户的情绪传染效应。其解决方案的关键在于提出一个基于图结构(graph-based)的情感调节框架,用于识别对话中需要情绪调节的节点,并通过促进用户自我反思实现即时责任行为引导;同时引入评论排队机制(comment queuing mechanism),对潜在恶意用户实施延迟发布策略,给予用户自我调节时间,从而抑制愤怒情绪传播。实证分析表明,该方法可使毒性降低12%,愤怒扩散减少15%,且仅平均4%的评论被临时阻断,有效平衡了干预强度与用户体验。

链接: https://arxiv.org/abs/2509.00696
作者: Akriti Verma,Shama Islam,Valeh Moghaddam,Adnan Anwar
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The pervasiveness of online toxicity, including hate speech and trolling, disrupts digital interactions and online well-being. Previous research has mainly focused on post-hoc moderation, overlooking the real-time emotional dynamics of online conversations and the impact of users’ emotions on others. This paper presents a graph-based framework to identify the need for emotion regulation within online conversations. This framework promotes self-reflection to manage emotional responses and encourage responsible behaviour in real time. Additionally, a comment queuing mechanism is proposed to address intentional trolls who exploit emotions to inflame conversations. This mechanism introduces a delay in publishing comments, giving users time to self-regulate before further engaging in the conversation and helping maintain emotional balance. Analysis of social media data from Twitter and Reddit demonstrates that the graph-based framework reduced toxicity by 12%, while the comment queuing mechanism decreased the spread of anger by 15%, with only 4% of comments being temporarily held on average. These findings indicate that combining real-time emotion regulation with delayed moderation can significantly improve well-being in online environments.
zh

[AI-136] DELTA: Variational Disentangled Learning for Privacy-Preserving Data Reprogramming ICDM

【速读】:该论文旨在解决在现实应用中,数据集包含可识别或敏感属性(sensitive attributes)时,如何在保障隐私的前提下进行特征工程的问题。传统方法通常只关注下游任务性能的提升,容易导致隐私泄露风险。为此,作者提出Privacy-Preserving Data Reprogramming (PPDR) 任务:在最大化目标属性预测准确率的同时最小化敏感属性的预测准确率。解决方案的关键在于提出 DELTA 框架,其采用两阶段变分解耦生成学习机制:第一阶段利用策略引导的强化学习搜索高效特征变换;第二阶段通过变分 LSTM 序列到序列编码器-解码器结构与实用-隐私解耦潜在空间设计,结合对抗因果解耦正则化,在特征生成过程中抑制隐私信号,从而实现隐私感知的数据重构。

链接: https://arxiv.org/abs/2509.00693
作者: Arun Vignesh Malarkkan,Haoyue Bai,Anjali Kaushik,Yanjie Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 3 Tables. Accepted at IEEE International Conference on Data Mining (ICDM) 2025

点击查看摘要

Abstract:In real-world applications, domain data often contains identifiable or sensitive attributes, is subject to strict regulations (e.g., HIPAA, GDPR), and requires explicit data feature engineering for interpretability and transparency. Existing feature engineering primarily focuses on advancing downstream task performance, often risking privacy leakage. We generalize this learning task under such new requirements as Privacy-Preserving Data Reprogramming (PPDR): given a dataset, transforming features to maximize target attribute prediction accuracy while minimizing sensitive attribute prediction accuracy. PPDR poses challenges for existing systems: 1) generating high-utility feature transformations without being overwhelmed by a large search space, and 2) disentangling and eliminating sensitive information from utility-oriented features to reduce privacy inferability. To tackle these challenges, we propose DELTA, a two-phase variational disentangled generative learning framework. Phase I uses policy-guided reinforcement learning to discover feature transformations with downstream task utility, without any regard to privacy inferability. Phase II employs a variational LSTM seq2seq encoder-decoder with a utility-privacy disentangled latent space design and adversarial-causal disentanglement regularization to suppress privacy signals during feature generation. Experiments on eight datasets show DELTA improves predictive performance by ~9.3% and reduces privacy leakage by ~35%, demonstrating robust, privacy-aware data transformation.
zh

[AI-137] Valid Property-Enhanced Contrastive Learning for Targeted Optimization Resampling for Novel Drug Design

【速读】:该论文旨在解决在低数据条件下,如何高效引导生成式分子模型向药理学相关化学空间区域进行探索的问题。其关键解决方案是提出VECTOR+框架,该框架结合了属性引导的表征学习与可控分子生成策略,通过对比学习增强有效属性的表征能力,并实现对功能化学空间的可解释、数据高效的探索,从而在有限训练样本下生成新颖且合成可行的候选分子。

链接: https://arxiv.org/abs/2509.00684
作者: Amartya Banerjee,Somnath Kar,Anirban Pal,Debabrata Maiti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Efficiently steering generative models toward pharmacologically relevant regions of chemical space remains a major obstacle in molecular drug discovery under low-data regimes. We present VECTOR+: Valid-property-Enhanced Contrastive Learning for Targeted Optimization and Resampling, a framework that couples property-guided representation learning with controllable molecule generation. VECTOR+ applies to both regression and classification tasks and enables interpretable, data-efficient exploration of functional chemical space. We evaluate on two datasets: a curated PD-L1 inhibitor set (296 compounds with experimental IC_50 values) and a receptor kinase inhibitor set (2,056 molecules by binding mode). Despite limited training data, VECTOR+ generates novel, synthetically tractable candidates. Against PD-L1 (PDB 5J89), 100 of 8,374 generated molecules surpass a docking threshold of -15.0 kcal/mol, with the best scoring -17.6 kcal/mol compared to the top reference inhibitor ( -15.4 kcal/mol). The best-performing molecules retain the conserved biphenyl pharmacophore while introducing novel motifs. Molecular dynamics (250 ns) confirm binding stability (ligand RMSD 2.5 angstroms). VECTOR+ generalizes to kinase inhibitors, producing compounds with stronger docking scores than established drugs such as brigatinib and sorafenib. Benchmarking against JT-VAE and MolGPT across docking, novelty, uniqueness, and Tanimoto similarity highlights the superior performance of our method. These results position our work as a robust, extensible approach for property-conditioned molecular design in low-data settings, bridging contrastive learning and generative modeling for reproducible, AI-accelerated discovery.
zh

[AI-138] he Name-Free Gap: Policy-Aware Stylistic Control in Music Generation

【速读】:该论文旨在解决生成式音乐模型中细粒度风格控制(fine-grained stylistic control)的难题,尤其关注在艺术家名称受限(如政策合规要求)的情况下如何实现有效的风格迁移。现有方法通常依赖重新训练或特定条件输入,难以复现且不符合内容安全规范。其解决方案的关键在于提出一种轻量级、人类可读的提示词(prompt)修饰策略:通过大语言模型(Large Language Model, LLM)生成无艺术家名称的描述符(descriptor sets),以替代传统依赖艺术家姓名的提示方式。实验表明,这些描述符虽略逊于直接使用艺术家名的控制效果(即存在“无名差距”,name-free gap),但仍能显著恢复大部分风格特征,并具备跨艺术家迁移能力,从而为政策合规下的风格可控音乐生成提供了可复现、可解释的新路径。

链接: https://arxiv.org/abs/2509.00654
作者: Ashwin Nagarajan,Hao-Wen Dong
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Text-to-music models capture broad attributes such as instrumentation or mood, but fine-grained stylistic control remains an open challenge. Existing stylization methods typically require retraining or specialized conditioning, which complicates reproducibility and limits policy compliance when artist names are restricted. We study whether lightweight, human-readable modifiers sampled from a large language model can provide a policy-robust alternative for stylistic control. Using MusicGen-small, we evaluate two artists: Billie Eilish (vocal pop) and Ludovico Einaudi (instrumental piano). For each artist, we use fifteen reference excerpts and evaluate matched seeds under three conditions: baseline prompts, artist-name prompts, and five descriptor sets. All prompts are generated using a large language model. Evaluation uses both VGGish and CLAP embeddings with distributional and per-clip similarity measures, including a new min-distance attribution metric. Results show that artist names are the strongest control signal across both artists, while name-free descriptors recover much of this effect. This highlights that existing safeguards such as the restriction of artist names in music generation prompts may not fully prevent style imitation. Cross-artist transfers reduce alignment, showing that descriptors encode targeted stylistic cues. We also present a descriptor table across ten contemporary artists to illustrate the breadth of the tokens. Together these findings define the name-free gap, the controllability difference between artist-name prompts and policy-compliant descriptors, shown through a reproducible evaluation protocol for prompt-level controllability.
zh

[AI-139] IndiaWeatherBench: A Dataset and Benchmark for Data-Driven Regional Weather Forecasting over India

【速读】:该论文旨在解决区域天气预报(regional weather forecasting)在数据驱动方法中缺乏统一基准和可比性的问题,尤其是在印度次大陆这一特定地理区域内。现有研究因使用不同数据集和实验设置而难以公平比较与复现。解决方案的关键在于提出IndiaWeatherBench——一个面向印度次大陆的综合性基准平台,其核心包括:高质量、高分辨率的区域再分析数据集(regional reanalysis products)、标准化的确定性和概率性评估指标,以及多种模型架构(如UNet、Transformer和图神经网络)与边界条件策略的系统性基线实验。该框架不仅为区域天气预报提供了一致的训练与评估标准,还通过开源原始数据、预处理流程及代码实现,推动了该领域的可复现性和进一步发展。

链接: https://arxiv.org/abs/2509.00653
作者: Tung Nguyen,Harkanwar Singh,Nilay Naharas,Lucas Bandarkar,Aditya Grover
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Regional weather forecasting is a critical problem for localized climate adaptation, disaster mitigation, and sustainable development. While machine learning has shown impressive progress in global weather forecasting, regional forecasting remains comparatively underexplored. Existing efforts often use different datasets and experimental setups, limiting fair comparison and reproducibility. We introduce IndiaWeatherBench, a comprehensive benchmark for data-driven regional weather forecasting focused on the Indian subcontinent. IndiaWeatherBench provides a curated dataset built from high-resolution regional reanalysis products, along with a suite of deterministic and probabilistic metrics to facilitate consistent training and evaluation. To establish strong baselines, we implement and evaluate a range of models across diverse architectures, including UNets, Transformers, and Graph-based networks, as well as different boundary conditioning strategies and training objectives. While focused on India, IndiaWeatherBench is easily extensible to other geographic regions. We open-source all raw and preprocessed datasets, model implementations, and evaluation pipelines to promote accessibility and future development. We hope IndiaWeatherBench will serve as a foundation for advancing regional weather forecasting research. Code is available at this https URL.
zh

[AI-140] LLM -HyPZ: Hardware Vulnerability Discovery using an LLM -Assisted Hybrid Platform for Zero-Shot Knowledge Extraction and Refinement

【速读】:该论文旨在解决硬件漏洞(Hardware Vulnerability)缺乏系统化、可扩展分析方法的问题,尤其针对传统依赖专家主观判断的德尔菲调查(Delphi Survey)方法在统计严谨性和客观性上的不足。其核心解决方案是提出LLM-HyPZ框架——一个基于大语言模型(Large Language Model, LLM)辅助的混合型零样本知识提取与精炼方法,通过整合零样本分类、上下文嵌入、无监督聚类和提示驱动摘要等技术,从大规模CVE(Common Vulnerabilities and Exposures)语料库中自动识别并归纳硬件相关漏洞。该方法首次实现了数据驱动的硬件漏洞系统发现,显著提升了漏洞挖掘的效率与准确性,并直接支撑了MITRE CWE 2025年重要硬件弱点更新,大幅压缩专家筛选范围,加速证据收集过程。

链接: https://arxiv.org/abs/2509.00647
作者: Yu-Zheng Lin,Sujan Ghimire,Abhiram Nandimandalam,Jonah Michael Camacho,Unnati Tripathi,Rony Macwan,Sicong Shao,Setareh Rafatirad,Rozhin Yasaei,Pratik Satam,Soheil Salehi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:The rapid growth of hardware vulnerabilities has created an urgent need for systematic and scalable analysis methods. Unlike software flaws, which are often patchable post-deployment, hardware weaknesses remain embedded across product lifecycles, posing persistent risks to processors, embedded devices, and IoT platforms. Existing efforts such as the MITRE CWE Hardware List (2021) relied on expert-driven Delphi surveys, which lack statistical rigor and introduce subjective bias, while large-scale data-driven foundations for hardware weaknesses have been largely absent. In this work, we propose LLM-HyPZ, an LLM-assisted hybrid framework for zero-shot knowledge extraction and refinement from vulnerability corpora. Our approach integrates zero-shot LLM classification, contextualized embeddings, unsupervised clustering, and prompt-driven summarization to mine hardware-related CVEs at scale. Applying LLM-HyPZ to the 2021-2024 CVE corpus (114,836 entries), we identified 1,742 hardware-related vulnerabilities. We distilled them into five recurring themes, including privilege escalation via firmware and BIOS, memory corruption in mobile and IoT systems, and physical access exploits. Benchmarking across seven LLMs shows that LLaMA 3.3 70B achieves near-perfect classification accuracy (99.5%) on a curated validation set. Beyond methodological contributions, our framework directly supported the MITRE CWE Most Important Hardware Weaknesses (MIHW) 2025 update by narrowing the candidate search space. Specifically, our pipeline surfaced 411 of the 1,026 CVEs used for downstream MIHW analysis, thereby reducing expert workload and accelerating evidence gathering. These results establish LLM-HyPZ as the first data-driven, scalable approach for systematically discovering hardware vulnerabilities, thereby bridging the gap between expert knowledge and real-world vulnerability evidence.
zh

[AI-141] RAG -PRISM: A Personalized Rapid and Immersive Skill Mastery Framework with Adaptive Retrieval-Augmented Tutoring

【速读】:该论文旨在解决第四次工业革命(4IR)背景下,由于数字转型加速导致的劳动力技能缺口问题,尤其是针对年龄较大员工在机器人技术、自动化、人工智能(AI)和网络安全等STEM领域面临的学习挑战。传统培训难以满足多样化学习者的需求,且缺乏个性化与高效性。解决方案的关键在于提出一种融合生成式AI与检索增强生成(Retrieval-Augmented Generation, RAG)的自适应辅导框架,通过文档命中率(document hit rate)和平均倒数排名(Mean Reciprocal Rank, MRR)优化内容匹配度,并利用大语言模型(LLMs)如GPT-3.5和GPT-4生成高相关性和高一致性的个性化学习材料。实验表明,GPT-4在内容相关性上达到87%,一致性达100%,验证了该双模式方法可同时作为个性化主题推荐器与内容生成器,实现快速、可扩展的定制化4IR教育与职业培训。

链接: https://arxiv.org/abs/2509.00646
作者: Gaurangi Raul,Yu-Zheng Lin,Karan Patel,Bono Po-Jen Shih,Matthew W. Redondo,Banafsheh Saber Latibari,Jesus Pacheco,Soheil Salehi,Pratik Satam
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, Accepted by IEEE FIE 2025

点击查看摘要

Abstract:The rapid digital transformation of Fourth Industrial Revolution (4IR) systems is reshaping workforce needs, widening skill gaps, especially for older workers. With growing emphasis on STEM skills such as robotics, automation, artificial intelligence (AI), and security, large-scale re-skilling and up-skilling are required. Training programs must address diverse backgrounds, learning styles, and motivations to improve persistence and success, while ensuring rapid, cost-effective workforce development through experiential learning. To meet these challenges, we present an adaptive tutoring framework that combines generative AI with Retrieval-Augmented Generation (RAG) to deliver personalized training. The framework leverages document hit rate and Mean Reciprocal Rank (MRR) to optimize content for each learner, and is benchmarked against human-generated training for alignment and relevance. We demonstrate the framework in 4IR cybersecurity learning by creating a synthetic QA dataset emulating trainee behavior, while RAG is tuned on curated cybersecurity materials. Evaluation compares its generated training with manually curated queries representing realistic student interactions. Responses are produced using large language models (LLMs) including GPT-3.5 and GPT-4, assessed for faithfulness and content alignment. GPT-4 achieves the best performance with 87% relevancy and 100% alignment. Results show this dual-mode approach enables the adaptive tutor to act as both a personalized topic recommender and content generator, offering a scalable solution for rapid, tailored learning in 4IR education and workforce development.
zh

[AI-142] Enabling Trustworthy Federated Learning via Remote Attestation for Mitigating Byzantine Threats

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)系统中因分布式特性导致的拜占庭攻击(Byzantine attacks)问题,即中心服务器无法验证本地训练过程的真实性,从而可能聚合恶意客户端提交的异常模型更新。现有基于数据驱动的防御方法难以区分恶意行为与非独立同分布(non-IID)数据等良性差异,导致误报率高、过滤效果差。其解决方案的关键在于引入基于远程认证(Remote Attestation, RA)的机制——通过代码插桩(code instrumentation)监控本地训练过程中的控制流和关键变量,并利用可信执行环境(Trusted Execution Environment, TEE)内的可信训练记录器生成加密签名的证明报告,确保客户端训练行为未被篡改或数据操纵,从而实现对模型更新来源的可验证信任,保障全局模型聚合的安全性。

链接: https://arxiv.org/abs/2509.00634
作者: Chaoyu Zhang,Heng Jin,Shanghao Shi,Hexuan Yu,Sydney Johns,Y. Thomas Hou,Wenjing Lou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) has gained significant attention for its privacy-preserving capabilities, enabling distributed devices to collaboratively train a global model without sharing raw data. However, its distributed nature forces the central server to blindly trust the local training process and aggregate uncertain model updates, making it susceptible to Byzantine attacks from malicious participants, especially in mission-critical scenarios. Detecting such attacks is challenging due to the diverse knowledge across clients, where variations in model updates may stem from benign factors, such as non-IID data, rather than adversarial behavior. Existing data-driven defenses struggle to distinguish malicious updates from natural variations, leading to high false positive rates and poor filtering performance. To address this challenge, we propose Sentinel, a remote attestation (RA)-based scheme for FL systems that regains client-side transparency and mitigates Byzantine attacks from a system security perspective. Our system employs code instrumentation to track control-flow and monitor critical variables in the local training process. Additionally, we utilize a trusted training recorder within a Trusted Execution Environment (TEE) to generate an attestation report, which is cryptographically signed and securely transmitted to the server. Upon verification, the server ensures that legitimate client training processes remain free from program behavior violation or data manipulation, allowing only trusted model updates to be aggregated into the global model. Experimental results on IoT devices demonstrate that Sentinel ensures the trustworthiness of the local training integrity with low runtime and memory overhead. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.00634 [cs.CR] (or arXiv:2509.00634v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.00634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-143] Forecasting the Ionosphere from Sparse GNSS Data with Temporal-Fusion Transformers

【速读】:该论文旨在解决电离层总电子含量(Total Electron Content, TEC)的准确预测难题,该问题因太阳、地磁和热层驱动之间的非线性耦合关系以及全球观测数据稀疏性和经验模型精度不足而尤为突出,尤其在强空间天气条件下更为显著。解决方案的关键在于提出一种基于时间融合变换器(Temporal Fusion Transformer, TFT)的机器学习框架,该框架能够整合异构输入源(包括太阳极紫外辐射、地磁指数和GNSS反演的垂直TEC),并通过预处理与时间对齐策略提升预测性能;实验表明,该方法可在24小时内实现均方根误差低至3.33 TECU的稳定预测,且通过注意力机制提供可解释性,从而支持业务应用与科学发现。

链接: https://arxiv.org/abs/2509.00631
作者: Giacomo Acciarini,Simone Mestici,Halil Kelebek,Linnea Wolniewicz,Michael Vergalla,Madhulika Guhathakurta,Umaa Rebbapragada,Bala Poduval,Atılım Güneş Baydin,Frank Soboczenski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:The ionosphere critically influences Global Navigation Satellite Systems (GNSS), satellite communications, and Low Earth Orbit (LEO) operations, yet accurate prediction of its variability remains challenging due to nonlinear couplings between solar, geomagnetic, and thermospheric drivers. Total Electron Content (TEC), a key ionospheric parameter, is derived from GNSS observations, but its reliable forecasting is limited by the sparse nature of global measurements and the limited accuracy of empirical models, especially during strong space weather conditions. In this work, we present a machine learning framework for ionospheric TEC forecasting that leverages Temporal Fusion Transformers (TFT) to predict sparse ionosphere data. Our approach accommodates heterogeneous input sources, including solar irradiance, geomagnetic indices, and GNSS-derived vertical TEC, and applies preprocessing and temporal alignment strategies. Experiments spanning 2010-2025 demonstrate that the model achieves robust predictions up to 24 hours ahead, with root mean square errors as low as 3.33 TECU. Results highlight that solar EUV irradiance provides the strongest predictive signals. Beyond forecasting accuracy, the framework offers interpretability through attention-based analysis, supporting both operational applications and scientific discovery. To encourage reproducibility and community-driven development, we release the full implementation as the open-source toolkit \textttionopy.
zh

[AI-144] NetGent: Agent -Based Automation of Network Application Workflows

【速读】:该论文旨在解决生成可泛化机器学习(ML)模型所需的网络流量数据集问题,即如何高效、可靠地自动化复杂应用工作流以生成真实且多样化的网络流量数据。现有浏览器自动化工具在多样性、可重复性、真实性与效率方面存在脆弱性和高成本问题。解决方案的关键在于提出NetGent框架,其核心创新是将用户用自然语言定义的状态依赖型动作规则编译为非确定性有限自动机(nondeterministic finite automata, NFA),并通过状态合成组件转化为可重用的可执行代码;该设计实现了确定性回放、通过状态缓存减少大语言模型(LLM)调用次数,并能快速适应应用界面变化,从而在多种应用场景(如视频点播、实时视频流、视频会议等)中稳定生成高质量流量轨迹。

链接: https://arxiv.org/abs/2509.00625
作者: Jaber Daneshamooz,Eugene Vuong,Laasya Koduru,Sanjay Chandrasekaran,Arpit Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present NetGent, an AI-agent framework for automating complex application workflows to generate realistic network traffic datasets. Developing generalizable ML models for networking requires data collection from network environments with traffic that results from a diverse set of real-world web applications. However, using existing browser automation tools that are diverse, repeatable, realistic, and efficient remains fragile and costly. NetGent addresses this challenge by allowing users to specify workflows as natural-language rules that define state-dependent actions. These abstract specifications are compiled into nondeterministic finite automata (NFAs), which a state synthesis component translates into reusable, executable code. This design enables deterministic replay, reduces redundant LLM calls through state caching, and adapts quickly when application interfaces change. In experiments, NetGent automated more than 50+ workflows spanning video-on-demand streaming, live video streaming, video conferencing, social media, and web scraping, producing realistic traffic traces while remaining robust to UI variability. By combining the flexibility of language-based agents with the reliability of compiled execution, NetGent provides a scalable foundation for generating the diverse, repeatable datasets needed to advance ML in networking.
zh

[AI-145] BALM-TSF: Balanced Multimodal Alignment for LLM -Based Time Series Forecasting

【速读】:该论文旨在解决多模态时间序列预测中因文本与时间数据之间存在巨大差异而导致的模态失衡问题,即现有方法往往过度强调某一模态而忽略另一模态,从而造成信息损失并影响预测性能。其解决方案的关键在于提出一种轻量级框架BALM-TSF(Balanced Multimodal Alignment for LLM-Based Time Series Forecasting),通过两个核心设计实现模态平衡:首先,利用时间序列编码器处理原始时序数据,同时将描述性统计特征输入到带有可学习提示(learnable prompt)的大语言模型(LLM)中生成紧凑的文本嵌入;其次,采用一种简单但有效的缩放策略结合对比目标(contrastive objective),将文本嵌入映射至时间序列嵌入的潜在空间,以实现跨模态语义对齐,最终融合对齐后的文本语义嵌入与时间序列嵌入进行预测。该方法在保持极少可训练参数的同时显著提升了长期和少样本场景下的预测性能。

链接: https://arxiv.org/abs/2509.00622
作者: Shiqiao Zhou,Holger Schöner,Huanbo Lyu,Edouard Fouché,Shuo Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Time series forecasting is a long-standing and highly challenging research topic. Recently, driven by the rise of large language models (LLMs), research has increasingly shifted from purely time series methods toward harnessing textual modalities to enhance forecasting performance. However, the vast discrepancy between text and temporal data often leads current multimodal architectures to over-emphasise one modality while neglecting the other, resulting in information loss that harms forecasting performance. To address this modality imbalance, we introduce BALM-TSF (Balanced Multimodal Alignment for LLM-Based Time Series Forecasting), a lightweight time series forecasting framework that maintains balance between the two modalities. Specifically, raw time series are processed by the time series encoder, while descriptive statistics of raw time series are fed to an LLM with learnable prompt, producing compact textual embeddings. To ensure balanced cross-modal context alignment of time series and textual embeddings, a simple yet effective scaling strategy combined with a contrastive objective then maps these textual embeddings into the latent space of the time series embeddings. Finally, the aligned textual semantic embeddings and time series embeddings are together integrated for forecasting. Extensive experiments on standard benchmarks show that, with minimal trainable parameters, BALM-TSF achieves state-of-the-art performance in both long-term and few-shot forecasting, confirming its ability to harness complementary information from text and time series. Code is available at this https URL.
zh

[AI-146] meCopilot

【速读】:该论文旨在解决传统时间序列预测系统中自动化程度低、可解释性差以及难以集成多种基础模型(Time Series Foundation Models, TSFMs)的问题。其核心解决方案是提出首个开源的代理型预测框架 TimeCopilot,通过单一统一API将TSFMs与大语言模型(Large Language Models, LLMs)协同整合,实现从特征分析、模型选择、交叉验证到预测生成的全流程自动化,并支持自然语言解释和直接未来查询。该框架具备LLM无关性、多模型集成能力及低成本高精度特性,在GIFT-Eval大规模基准上达到最优概率预测性能,为可复现、可解释且易用的智能预测系统提供了实践基础。

链接: https://arxiv.org/abs/2509.00616
作者: Azul Garza,Reneé Rosillo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We introduce TimeCopilot, the first open-source agentic framework for forecasting that combines multiple Time Series Foundation Models (TSFMs) with Large Language Models (LLMs) through a single unified API. TimeCopilot automates the forecasting pipeline: feature analysis, model selection, cross-validation, and forecast generation, while providing natural language explanations and supporting direct queries about the future. The framework is LLM-agnostic, compatible with both commercial and open-source models, and supports ensembles across diverse forecasting families. Results on the large-scale GIFT-Eval benchmark show that TimeCopilot achieves state-of-the-art probabilistic forecasting performance at low cost. Our framework provides a practical foundation for reproducible, explainable, and accessible agentic forecasting systems.
zh

[AI-147] Federated Survival Analysis with Node-Level Differential Privacy: Private Kaplan-Meier Curves

【速读】:该论文旨在解决在多医疗辖区协作计算生存曲线时如何保护患者隐私的问题,特别是通过引入节点级差分隐私(node-level differential privacy)来实现。其核心解决方案是:各数据站点仅一次性发布加噪后的Kaplan-Meier生存曲线,噪声采用拉普拉斯机制(Laplace noise),其尺度由公共时间网格长度决定;中央服务器对所有噪声曲线进行平均,从而保持整体隐私预算不变。此方法避免了迭代训练或复杂加密,同时在隐私预算≥0.5时仍能将经验对数秩检验的I类错误控制在15%以内,确保临床可用的生存信息可安全共享。

链接: https://arxiv.org/abs/2509.00615
作者: Narasimha Raghavan Veeraragavan,Jan Franz Nygård
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: This is the author’s accepted version of the paper in IEEE FLTA 2025. The final version of record will appear in Proceedings of the IEEE International Conference on Federated Learning Technologies and Applications (FLTA 2025)

点击查看摘要

Abstract:We investigate how to calculate Kaplan-Meier survival curves across multiple health-care jurisdictions while protecting patient privacy with node-level differential privacy. Each site discloses its curve only once, adding Laplace noise whose scale is determined by the length of the common time grid; the server then averages the noisy curves, so the overall privacy budget remains unchanged. We benchmark four one-shot smoothing techniques: Discrete Cosine Transform, Haar Wavelet shrinkage, adaptive Total-Variation denoising, and a parametric Weibull fit on the NCCTG lung-cancer cohort under five privacy levels and three partition scenarios (uniform, moderately skewed, highly imbalanced). Total-Variation gives the best mean accuracy, whereas the frequency-domain smoothers offer stronger worst-case robustness and the Weibull model shows the most stable behaviour at the strictest privacy setting. Across all methods the released curves keep the empirical log-rank type-I error below fifteen percent for privacy budgets of 0.5 and higher, demonstrating that clinically useful survival information can be shared without iterative training or heavy cryptography.
zh

[AI-148] KVComp: A High-Performance LLM -Aware Lossy Compression Framework for KV Cache

【速读】:该论文旨在解决基于Transformer的大语言模型(Large Language Models, LLMs)在长文本生成场景下因键值缓存(Key-Value Cache, KV cache)内存占用急剧增长而导致的推理瓶颈问题。解决方案的关键在于提出KVComp框架,其核心创新是针对KV缓存数据特性设计了一种新颖的有损压缩技术,并通过压缩算法与系统架构的协同设计,在保持缓存动态增长特性的前提下实现高效率的内存管理。该方法不仅显著提升了内存压缩率(平均提升47%,最高达83%),且通过优化解压开销和矩阵向量乘法操作,有效提升了执行吞吐量,甚至在某些情况下优于基于cuBLAS的注意力核函数。

链接: https://arxiv.org/abs/2509.00579
作者: Bo Jiang,Taolue Yang,Youyuan Liu,Chengming Zhang,Xubin He,Sian Jin
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based large language models (LLMs) demonstrate impressive potential in various practical applications. However, long context inference poses a significant challenge due to the enormous memory requirements of the key-value (KV) cache, which can scale to multiple gigabytes as sequence length and batch size increase. In this paper, we present KVComp, a generic and efficient KV cache management framework optimized for long-text generation that synergistically works with both latency-critical and throughput-critical inference systems. KVComp employs novel lossy compression techniques specifically designed for KV cache data characteristics, featuring careful co-design of compression algorithms and system architecture. Our approach maintains compatibility with the growing nature of KV cache while preserving high computational efficiency. Experimental results show that KVComp achieves on average 47% and up to 83% higher memory reduction rate compared to existing methods with little/no model accuracy degradation. Furthermore, KVComp achieves extremely high execution throughput, effectively reducing decompression overhead and, in some cases, even accelerating the matrix-vector multiplication operation and outperform cuBLAS-based attention kernels with less data movement.
zh

[AI-149] Can AI be Auditable?

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)系统在全生命周期中难以被独立评估以确保其符合伦理、法律和技术标准的问题,即AI审计能力(auditability)的实现难题。解决方案的关键在于通过制定清晰的指南、推动国际监管规则的协调统一,并建立稳健的社会技术方法论,从而将审计能力嵌入AI开发实践和治理基础设施中,同时强调多利益相关方协作与审计人员赋能,以构建有效的AI审计生态系统。

链接: https://arxiv.org/abs/2509.00575
作者: Himanshu Verma,Kirtan Path,Eva Thelisson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Auditability is defined as the capacity of AI systems to be independently assessed for compliance with ethical, legal, and technical standards throughout their lifecycle. The chapter explores how auditability is being formalized through emerging regulatory frameworks, such as the EU AI Act, which mandate documentation, risk assessments, and governance structures. It analyzes the diverse challenges facing AI auditability, including technical opacity, inconsistent documentation practices, lack of standardized audit tools and metrics, and conflicting principles within existing responsible AI frameworks. The discussion highlights the need for clear guidelines, harmonized international regulations, and robust socio-technical methodologies to operationalize auditability at scale. The chapter concludes by emphasizing the importance of multi-stakeholder collaboration and auditor empowerment in building an effective AI audit ecosystem. It argues that auditability must be embedded in AI development practices and governance infrastructures to ensure that AI systems are not only functional but also ethically and legally aligned.
zh

[AI-150] Social World Models

【速读】:该论文旨在解决当前人工智能系统在处理社会互动时难以自动构建和推理隐含社会情境的问题,尤其在缺乏显式信息的情况下难以模拟他人的心理状态与行为逻辑。其解决方案的关键在于提出一种结构化的社会世界表示形式(Structured Social World Representation Formalism, S3AP),该形式基于部分可观测马尔可夫决策过程(POMDP)设计,将社交交互建模为包含状态、观测、代理动作及心智状态等结构化元组,并能从自由文本叙述中自动提取。S3AP显著提升了大语言模型(LLM)在五类社会推理任务中的表现(如FANToM的理论心智推理任务提升51%),并进一步用于构建社会世界模型以预测未来动态,从而在SOTOPIA基准上使代理决策性能提升达18%,验证了其作为通用社会状态表示方法的有效性与潜力。

链接: https://arxiv.org/abs/2509.00559
作者: Xuhui Zhou,Jiarui Liu,Akhila Yerukola,Hyunwoo Kim,Maarten Sap
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans intuitively navigate social interactions by simulating unspoken dynamics and reasoning about others’ perspectives, even with limited information. In contrast, AI systems struggle to automatically structure and reason about these implicit social contexts. In this paper, we introduce a novel structured social world representation formalism (S3AP), designed to help AI systems reason more effectively about social dynamics. Following a POMDP-driven design, S3AP represents social interactions as structured tuples, such as state, observation, agent actions, and mental states, which can be automatically induced from free-form narratives or other inputs. We first show S3AP can help LLMs better understand social narratives across 5 social reasoning tasks (e.g., +51% improvement on FANToM’s theory-of-mind reasoning with OpenAI’s o1), reaching new state-of-the-art (SOTA) performance. We then induce social world models from these structured representations, demonstrating their ability to predict future social dynamics and improve agent decision-making, yielding up to +18% improvement on the SOTOPIA social interaction benchmark. Our findings highlight the promise of S3AP as a powerful, general-purpose representation for social world states, enabling the development of more socially-aware systems that better navigate social interactions.
zh

[AI-151] xt-to-Layout: A Generative Workflow for Drafting Architectural Floor Plans Using LLM s

【速读】:该论文旨在解决建筑设计中传统手工绘制平面图效率低、依赖经验且难以快速响应设计意图变更的问题,尤其针对如何将自然语言描述高效转化为结构化建筑布局方案这一挑战。解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的智能工作流,通过提示工程(prompt engineering)解析文本输入,结合家具布置优化算法与Python脚本实现空间合理性约束下的自动布局生成,并确保输出结果可直接导入如Autodesk Revit等BIM工具,保留全部参数化属性以支持专业建筑信息建模流程。

链接: https://arxiv.org/abs/2509.00543
作者: Jayakrishna Duggempudi,Lu Gao,Ahmed Senouci,Zhe Han,Yunpeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents the development of an AI-powered workflow that uses Large Language Models (LLMs) to assist in drafting schematic architectural floor plans from natural language prompts. The proposed system interprets textual input to automatically generate layout options including walls, doors, windows, and furniture arrangements. It combines prompt engineering, a furniture placement refinement algorithm, and Python scripting to produce spatially coherent draft plans compatible with design tools such as Autodesk Revit. A case study of a mid-sized residential layout demonstrates the approach’s ability to generate functional and structured outputs with minimal manual effort. The workflow is designed for transparent replication, with all key prompt specifications documented to enable independent implementation by other researchers. In addition, the generated models preserve the full range of Revit-native parametric attributes required for direct integration into professional BIM processes.
zh

[AI-152] Artificial Intelligence-Based Analysis of Ice Cream Melting Behavior Under Various Ingredients

【速读】:该论文旨在解决冰激凌在融化过程中稳定性不足的问题,这直接影响消费者的接受度和产品质量。其关键解决方案在于系统评估四种常见稳定剂(刺槐豆胶、瓜尔胶、麦芽糊精和卡拉胶)对自制冰激凌融化行为的影响,通过控制条件下进行熔化测试并利用Python与OpenCV进行图像处理与分析,发现这些添加剂均能促进形成稳定的空气细胞网络结构,且部分稳定剂显著提升冰激凌的抗融性与结构韧性;同时,研究还结合性能与成本因素,识别出更具经济效率的配方组合,为小规模及商业生产提供兼具耐久性与成本效益的优化方案。

链接: https://arxiv.org/abs/2509.00507
作者: Zhang Lai Bin,Zhen Bin It
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The stability of ice cream during melting is a critical factor for consumer’s acceptance and product quality. With the commonly added stabilizer to improve texture, structure and slower melting as the factors to analyze. This report explores the effects of locust bean gum, guar gum, maltodextrin, and carrageenan on the melting behavior of homemade ice cream. The main objective was to assess how these additives influence melting resistance and to identify a more cost-effective recipe formulation. Ice cream samples incorporating each additive were prepared and subjected to melting tests under controlled conditions. Timelapse recordings were used to capture and analyze the progression of melting over time. Python and OpenCV is used for process and analysis. Observations revealed that all samples retained a foam-like structure even after melting, suggesting the stabilizers contributed to the formation of a stable air-cell matrix. Furthermore, when the melted samples were re-frozen and subsequently melted again, they displayed increased sturdiness, indicating improved resilience of the ice cream structure. Comparative analysis of the different stabilizers highlighted variations in their effectiveness, with some offering stronger melting resistance and structural support than others. Overall, the findings provide insights into the functional roles of commonly used food additives in ice cream formulation. By evaluating both performance and cost, this study demonstrates the potential for developing recipes that balance durability with economic efficiency, contributing to practical applications in both small-scale and commercial ice cream production.
zh

[AI-153] NeuralSVCD for Efficient Swept Volume Collision Detection

【速读】:该论文旨在解决机器人在非结构化环境中进行操作时,因传统离散碰撞检测方法可能遗漏点间碰撞而导致的安全性问题,提出一种高效且可靠的连续轨迹碰撞检测方法——Swept Volume Collision Detection (SVCD)。其核心挑战在于现有SVCD方法难以同时兼顾计算效率与精度。解决方案的关键在于设计了一种新颖的神经编码-解码架构NeuralSVCD,通过利用形状局部性(shape locality)和时间局部性(temporal locality),结合分布式几何表示与时间优化策略,在不牺牲精度的前提下显著提升计算效率,从而实现了高鲁棒性的碰撞检测性能。

链接: https://arxiv.org/abs/2509.00499
作者: Dongwon Son,Hojin Jung,Beomjoon Kim
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CoRL 2025

点击查看摘要

Abstract:Robot manipulation in unstructured environments requires efficient and reliable Swept Volume Collision Detection (SVCD) for safe motion planning. Traditional discrete methods potentially miss collisions between these points, whereas SVCD continuously checks for collisions along the entire trajectory. Existing SVCD methods typically face a trade-off between efficiency and accuracy, limiting practical use. In this paper, we introduce NeuralSVCD, a novel neural encoder-decoder architecture tailored to overcome this trade-off. Our approach leverages shape locality and temporal locality through distributed geometric representations and temporal optimization. This enhances computational efficiency without sacrificing accuracy. Comprehensive experiments show that NeuralSVCD consistently outperforms existing state-of-the-art SVCD methods in terms of both collision detection accuracy and computational efficiency, demonstrating its robust applicability across diverse robotic manipulation scenarios. Code and videos are available at this https URL.
zh

[AI-154] Multi-Agent Data Visualization and Narrative Generation

【速读】:该论文旨在解决数据可视化流程中自动化与人类协作效率不足的问题,特别是在从数据探索到生成连贯视觉叙事的端到端分析过程中缺乏可解释性和灵活性。其解决方案的关键在于提出一种轻量级多智能体系统(multi-agent system),通过混合架构融合确定性组件与大语言模型(Large Language Models, LLMs),将关键逻辑从LLMs中策略性地外部化,从而提升系统的透明度与可靠性;同时,该系统输出细粒度、模块化的中间结果,支持局部修改而无需重新生成全部内容,为可持续的人机协同提供技术基础。

链接: https://arxiv.org/abs/2509.00481
作者: Anton Wolter,Georgios Vidalakis,Michael Yu,Ankit Grover,Vaishali Dhanoa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in the field of AI agents have impacted the way we work, enabling greater automation and collaboration between humans and agents. In the data visualization field, multi-agent systems can be useful for employing agents throughout the entire data-to-communication pipeline. We present a lightweight multi-agent system that automates the data analysis workflow, from data exploration to generating coherent visual narratives for insight communication. Our approach combines a hybrid multi-agent architecture with deterministic components, strategically externalizing critical logic from LLMs to improve transparency and reliability. The system delivers granular, modular outputs that enable surgical modifications without full regeneration, supporting sustainable human-AI collaboration. We evaluated our system across 4 diverse datasets, demonstrating strong generalizability, narrative quality, and computational efficiency with minimal dependencies.
zh

[AI-155] Cross-Domain Malware Detection via Probability-Level Fusion of Lightweight Gradient Boosting Models

【速读】:该论文旨在解决恶意软件(malware)检测中模型跨数据源泛化能力不足以及计算开销过高的问题。现有基于单一数据集的检测模型难以适应不同场景下的恶意软件变种,且复杂模型常导致部署效率低下。其解决方案的关键在于构建一个轻量级的概率融合框架,通过在三个异构数据源(EMBER静态特征、API调用序列行为特征、CIC混淆内存模式)上分别训练LightGBM分类器,选取关键预测特征以提升效率,并利用网格搜索优化各模型输出概率的加权融合策略,从而实现高精度与低延迟的跨域检测性能。

链接: https://arxiv.org/abs/2509.00476
作者: Omar Khalid Ali Mohamed
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 3 tables. Conference-style formatting (IEEEtran)

点击查看摘要

Abstract:The escalating sophistication of malware necessitates robust detection mechanisms that generalize across diverse data sources. Traditional single-dataset models struggle with cross-domain generalization and often incur high computational costs. This paper presents a novel, lightweight framework for malware detection that employs probability-level fusion across three distinct datasets: EMBER (static features), API Call Sequences (behavioral features), and CIC Obfuscated Memory (memory patterns). Our method trains individual LightGBM classifiers on each dataset, selects top predictive features to ensure efficiency, and fuses their prediction probabilities using optimized weights determined via grid search. Extensive experiments demonstrate that our fusion approach achieves a macro F1-score of 0.823 on a cross-domain validation set, significantly outperforming individual models and providing superior generalization. The framework maintains low computational overhead, making it suitable for real-time deployment, and all code and data are provided for full reproducibility.
zh

[AI-156] NEWSAGENT : Benchmarking Multimodal Agents as Journalists with Real-World Newswriting Tasks

【速读】:该论文旨在解决自主数字代理(autonomous digital agents)在处理多模态网络数据时,能否有效提升新闻生产效率的问题。当前虽有工业界进展(如Manus AI和Gemini的研究模式),但缺乏对代理系统在真实新闻写作场景中多模态内容理解、信息检索与叙事整合能力的系统评估。解决方案的关键在于提出NEWSAGENT基准测试集,其包含6000个经人工验证的真实新闻样本,涵盖从原始多模态内容到结构化新闻稿件的全过程,要求代理完成识别叙事视角、关键词查询、历史背景检索及文章生成等核心新闻功能,从而模拟记者从无到有的创作流程。该基准强调信息缺口需主动发现,而非直接提供,因而能更真实地评估代理在多模态数据操纵中的实际生产力表现。

链接: https://arxiv.org/abs/2509.00446
作者: Yen-Che Chien,Kuang-Da Wang,Wei-Yao Wang,Wen-Chih Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent advances in autonomous digital agents from industry (e.g., Manus AI and Gemini’s research mode) highlight potential for structured tasks by autonomous decision-making and task decomposition; however, it remains unclear to what extent the agent-based systems can improve multimodal web data productivity. We study this in the realm of journalism, which requires iterative planning, interpretation, and contextual reasoning from multimodal raw contents to form a well structured news. We introduce NEWSAGENT, a benchmark for evaluating how agents can automatically search available raw contents, select desired information, and edit and rephrase to form a news article by accessing core journalistic functions. Given a writing instruction and firsthand data as how a journalist initiates a news draft, agents are tasked to identify narrative perspectives, issue keyword-based queries, retrieve historical background, and generate complete articles. Unlike typical summarization or retrieval tasks, essential context is not directly available and must be actively discovered, reflecting the information gaps faced in real-world news writing. NEWSAGENT includes 6k human-verified examples derived from real news, with multimodal contents converted to text for broad model compatibility. We evaluate open- and closed-sourced LLMs with commonly-used agentic frameworks on NEWSAGENT, which shows that agents are capable of retrieving relevant facts but struggling with planning and narrative integration. We believe that NEWSAGENT serves a realistic testbed for iterating and evaluating agent capabilities in terms of multimodal web data manipulation to real-world productivity.
zh

[AI-157] Curriculum Guided Personalized Subgraph Federated Learning CIKM2025

【速读】:该论文旨在解决子图联邦学习(Subgraph Federated Learning, SFL)中因数据异构性(data heterogeneity)导致的模型性能下降问题,尤其是由于稀疏且有偏的子图结构易引发局部过拟合(overfitting),进而使客户端相似度矩阵停滞甚至崩溃,导致加权聚合失效的问题。解决方案的关键在于提出一种名为 Curriculum guided personalized sUbgraph Federated Learning (CUFL) 的新框架:在客户端采用课程学习(Curriculum Learning, CL)机制,根据边的重构分数自适应选择训练边,优先暴露于通用跨客户端子结构,再逐步引入客户端特异性结构,从而防止早期过拟合并实现渐进式个性化;同时,通过在随机参考图上重建细粒度结构指标来改进客户端相似度估计,使服务器聚合从交换通用知识转向传播客户端特定知识,有效提升个性化聚合的稳定性与效果。

链接: https://arxiv.org/abs/2509.00402
作者: Minku Kang,Hogun Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the CIKM 2025. This is an extended version of the original submission

点击查看摘要

Abstract:Subgraph Federated Learning (FL) aims to train Graph Neural Networks (GNNs) across distributed private subgraphs, but it suffers from severe data heterogeneity. To mitigate data heterogeneity, weighted model aggregation personalizes each local GNN by assigning larger weights to parameters from clients with similar subgraph characteristics inferred from their current model states. However, the sparse and biased subgraphs often trigger rapid overfitting, causing the estimated client similarity matrix to stagnate or even collapse. As a result, aggregation loses effectiveness as clients reinforce their own biases instead of exploiting diverse knowledge otherwise available. To this end, we propose a novel personalized subgraph FL framework called Curriculum guided personalized sUbgraph Federated Learning (CUFL). On the client side, CUFL adopts Curriculum Learning (CL) that adaptively selects edges for training according to their reconstruction scores, exposing each GNN first to easier, generic cross-client substructures and only later to harder, client-specific ones. This paced exposure prevents early overfitting to biased patterns and enables gradual personalization. By regulating personalization, the curriculum also reshapes server aggregation from exchanging generic knowledge to propagating client-specific knowledge. Further, CUFL improves weighted aggregation by estimating client similarity using fine-grained structural indicators reconstructed on a random reference graph. Extensive experiments on six benchmark datasets confirm that CUFL achieves superior performance compared to relevant baselines. Code is available at this https URL.
zh

[AI-158] A Study on the Framework for Evaluating the Ethics and Trustworthiness of Generative AI

【速读】:该论文旨在解决生成式人工智能(Generative AI)快速发展过程中所引发的伦理与可信性挑战,这些问题包括偏见、有害内容、版权侵权、隐私侵犯及幻觉等,而现有以性能和准确性为核心的评估方法难以全面应对。其解决方案的关键在于构建一个系统化的多维评估框架,涵盖公平性、透明度、可问责性、安全性、隐私保护、准确性、一致性、鲁棒性、可解释性、知识产权保护及来源可追溯性等11个核心维度,并制定相应的指标与评估方法;同时,通过比较韩国、美国、欧盟与中国等地的AI伦理政策,提炼出具有实践意义的治理路径,从而实现技术评估与跨学科视角的融合,为生成式AI全生命周期中的伦理风险识别与管理提供科学依据与行动指南。

链接: https://arxiv.org/abs/2509.00398
作者: Cheonsu Jeong,Seunghyun Lee,Sunny Jeong,Sungsu Kim
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures, 3 tables

点击查看摘要

Abstract:This study provides an in_depth analysis of the ethical and trustworthiness challenges emerging alongside the rapid advancement of generative artificial intelligence (AI) technologies and proposes a comprehensive framework for their systematic evaluation. While generative AI, such as ChatGPT, demonstrates remarkable innovative potential, it simultaneously raises ethical and social concerns, including bias, harmfulness, copyright infringement, privacy violations, and hallucination. Current AI evaluation methodologies, which mainly focus on performance and accuracy, are insufficient to address these multifaceted issues. Thus, this study emphasizes the need for new human_centered criteria that also reflect social impact. To this end, it identifies key dimensions for evaluating the ethics and trustworthiness of generative AI_fairness, transparency, accountability, safety, privacy, accuracy, consistency, robustness, explainability, copyright and intellectual property protection, and source traceability and develops detailed indicators and assessment methodologies for each. Moreover, it provides a comparative analysis of AI ethics policies and guidelines in South Korea, the United States, the European Union, and China, deriving key approaches and implications from each. The proposed framework applies across the AI lifecycle and integrates technical assessments with multidisciplinary perspectives, thereby offering practical means to identify and manage ethical risks in real_world contexts. Ultimately, the study establishes an academic foundation for the responsible advancement of generative AI and delivers actionable insights for policymakers, developers, users, and other stakeholders, supporting the positive societal contributions of AI technologies.
zh

[AI-159] Beyond Negative Transfer: Disentangled Preference-Guided Diffusion for Cross-Domain Sequential Recommendation

【速读】:该论文旨在解决跨域序列推荐(Cross-Domain Sequential Recommendation, CDSR)中因领域异质性导致的负迁移问题,即在简单聚合多域行为信号时,会引入冲突的领域特定偏好并放大噪声(如误点击和冲动行为),从而损害推荐质量。其核心挑战在于如何解耦用户偏好中的三个交织信号:领域不变偏好、领域特定偏好与噪声。解决方案的关键在于提出一种新颖的解耦偏好引导扩散模型(Disentangled Preference-Guided Diffusion Model, DPG-Diff),首次将扩散模型(Diffusion Models, DMs)应用于CDSR任务。DPG-Diff通过显式分解用户偏好为领域不变和领域特定两部分,并由二者共同引导逆向扩散过程,实现对复杂偏好结构的精准建模与噪声过滤,从而增强跨域知识迁移的鲁棒性并有效缓解负迁移。

链接: https://arxiv.org/abs/2509.00389
作者: Xiaoxin Ye,Chengkai Huang,Hongtao Huang,Lina Yao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-Domain Sequential Recommendation (CDSR) leverages user behaviors across domains to enhance recommendation quality. However, naive aggregation of sequential signals can introduce conflicting domain-specific preferences, leading to negative transfer. While Sequential Recommendation (SR) already suffers from noisy behaviors such as misclicks and impulsive actions, CDSR further amplifies this issue due to domain heterogeneity arising from diverse item types and user intents. The core challenge is disentangling three intertwined signals: domain-invariant preferences, domain-specific preferences, and noise. Diffusion Models (DMs) offer a generative denoising framework well-suited for disentangling complex user preferences and enhancing robustness to noise. Their iterative refinement process enables gradual denoising, making them effective at capturing subtle preference signals. However, existing applications in recommendation face notable limitations: sequential DMs often conflate shared and domain-specific preferences, while cross-domain collaborative filtering DMs neglect temporal dynamics, limiting their ability to model evolving user preferences. To bridge these gaps, we propose \textbfDPG-Diff, a novel Disentangled Preference-Guided Diffusion Model, the first diffusion-based approach tailored for CDSR, to or best knowledge. DPG-Diff decomposes user preferences into domain-invariant and domain-specific components, which jointly guide the reverse diffusion process. This disentangled guidance enables robust cross-domain knowledge transfer, mitigates negative transfer, and filters sequential noise. Extensive experiments on real-world datasets demonstrate that DPG-Diff consistently outperforms state-of-the-art baselines across multiple metrics.
zh

[AI-160] Unifying Adversarial Perturbation for Graph Neural Networks

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在节点特征和图结构上易受对抗攻击的问题,从而提升其鲁棒性和泛化能力。现有方法虽通过对抗训练增强模型性能,但多局限于特定数据集和GNN架构,缺乏通用性。论文提出了一种新方法PerturbEmbedding,其关键在于直接对GNN每一层的隐藏嵌入(hidden embedding)进行扰动操作,并构建了一个统一框架以整合多种现有的扰动策略,同时从形式上统一了随机扰动与对抗扰动的视角。实验表明,该方法显著提升了GNN在多种数据集和骨干模型上的鲁棒性与泛化性能,且对随机和对抗扰动的双重拒绝机制进一步优化了模型表现。

链接: https://arxiv.org/abs/2509.00387
作者: Jinluan Yang,Ruihao Zhang,Zhengyu Chen,Fei Wu,Kun Kuang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper studies the vulnerability of Graph Neural Networks (GNNs) to adversarial attacks on node features and graph structure. Various methods have implemented adversarial training to augment graph data, aiming to bolster the robustness and generalization of GNNs. These methods typically involve applying perturbations to the node feature, weights, or graph structure and subsequently minimizing the loss by learning more robust graph model parameters under the adversarial perturbations. Despite the effectiveness of adversarial training in enhancing GNNs’ robustness and generalization abilities, its application has been largely confined to specific datasets and GNN types. In this paper, we propose a novel method, PerturbEmbedding, that integrates adversarial perturbation and training, enhancing GNNs’ resilience to such attacks and improving their generalization ability. PerturbEmbedding performs perturbation operations directly on every hidden embedding of GNNs and provides a unified framework for most existing perturbation strategies/methods. We also offer a unified perspective on the forms of perturbations, namely random and adversarial perturbations. Through experiments on various datasets using different backbone models, we demonstrate that PerturbEmbedding significantly improves both the robustness and generalization abilities of GNNs, outperforming existing methods. The rejection of both random (non-targeted) and adversarial (targeted) perturbations further enhances the backbone model’s performance.
zh

[AI-161] heory Foundation of Physics-Enhanced Residual Learning

【速读】:该论文旨在解决神经网络与物理模型融合过程中,如何在保证预测精度的同时提升模型可解释性及数据效率的问题。其核心挑战在于现有方法缺乏理论支撑,难以解释为何引入残差学习结构能够降低参数量、加快收敛速度并减少训练样本需求。解决方案的关键在于提出并证明了一种名为物理增强残差学习(Physics-Enhanced Residual Learning, PERL)的框架,该框架通过学习物理模型预测值与真实值之间的残差(residual),利用Lipschitz连续性约束下的损失函数边界分析,从理论上严格证明了PERL在参数效率、收敛速率和样本效率上的三大优势,从而为自动驾驶等实际场景中稀缺数据条件下的高精度建模提供了可靠依据。

链接: https://arxiv.org/abs/2509.00348
作者: Shixiao Liang,Wang Chen,Keke Long,Peng Zhang,Xiaopeng Li,Jintao Ke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 8 figures

点击查看摘要

Abstract:Intensive studies have been conducted in recent years to integrate neural networks with physics models to balance model accuracy and interpretability. One recently proposed approach, named Physics-Enhanced Residual Learning (PERL), is to use learning to estimate the residual between the physics model prediction and the ground truth. Numeral examples suggested that integrating such residual with physics models in PERL has three advantages: (1) a reduction in the number of required neural network parameters; (2) faster convergence rates; and (3) fewer training samples needed for the same computational precision. However, these numerical results lack theoretical justification and cannot be adequately explained. This paper aims to explain these advantages of PERL from a theoretical perspective. We investigate a general class of problems with Lipschitz continuity properties. By examining the relationships between the bounds to the loss function and residual learning structure, this study rigorously proves a set of theorems explaining the three advantages of PERL. Several numerical examples in the context of automated vehicle trajectory prediction are conducted to illustrate the proposed theorems. The results confirm that, even with significantly fewer training samples, PERL consistently achieves higher accuracy than a pure neural network. These results demonstrate the practical value of PERL in real world autonomous driving applications where corner case data are costly or hard to obtain. PERL therefore improves predictive performance while reducing the amount of data required. Comments: 24 pages, 8 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.00348 [cs.LG] (or arXiv:2509.00348v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.00348 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-162] LLM -Driven Policy Diffusion: Enhancing Generalization in Offline Reinforcement Learning

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中因数据受限而导致的泛化能力不足问题,即RL智能体仅基于收集到的经验训练时,难以适应新任务或环境。解决方案的关键在于提出LLM-Driven Policy Diffusion (LLMDPD)方法,其核心创新是利用任务特定提示(task-specific prompts)增强策略学习的泛化性:一方面通过大语言模型(Large Language Model, LLM)处理文本任务描述,引入丰富的语义上下文;另一方面使用Transformer编码轨迹提示(trajectory prompts),捕捉状态转移动态中的结构化行为模式;二者共同作为条件输入注入到一个上下文感知的策略扩散模型(context-aware policy-level diffusion model)中,从而显著提升智能体在未见任务上的适应能力和性能表现。

链接: https://arxiv.org/abs/2509.00347
作者: Hanping Zhang,Yuhong Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) is known for its strong decision-making capabilities and has been widely applied in various real-world scenarios. However, with the increasing availability of offline datasets and the lack of well-designed online environments from human experts, the challenge of generalization in offline RL has become more prominent. Due to the limitations of offline data, RL agents trained solely on collected experiences often struggle to generalize to new tasks or environments. To address this challenge, we propose LLM-Driven Policy Diffusion (LLMDPD), a novel approach that enhances generalization in offline RL using task-specific prompts. Our method incorporates both text-based task descriptions and trajectory prompts to guide policy learning. We leverage a large language model (LLM) to process text-based prompts, utilizing its natural language understanding and extensive knowledge base to provide rich task-relevant context. Simultaneously, we encode trajectory prompts using a transformer model, capturing structured behavioral patterns within the underlying transition dynamics. These prompts serve as conditional inputs to a context-aware policy-level diffusion model, enabling the RL agent to generalize effectively to unseen tasks. Our experimental results demonstrate that LLMDPD outperforms state-of-the-art offline RL methods on unseen tasks, highlighting its effectiveness in improving generalization and adaptability in diverse settings.
zh

[AI-163] Scalable Option Learning in High-Throughput Environments

【速读】:该论文旨在解决层次化强化学习(Hierarchical Reinforcement Learning, HRL)在高吞吐量环境中的可扩展性问题,即现有HRL方法难以有效利用大规模训练数据,限制了其在复杂任务中的性能提升。解决方案的关键在于提出了一种名为可扩展选项学习(Scalable Option Learning, SOL)的新型层次化强化学习算法,通过优化选项(option)的学习机制与训练流程,显著提升了训练效率,在NetHack环境中实现了比现有方法高25倍的吞吐量,并在200亿帧经验数据上验证了其有效性与正向扩展趋势。

链接: https://arxiv.org/abs/2509.00338
作者: Mikael Henaff,Scott Fujimoto,Michael Rabbat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hierarchical reinforcement learning (RL) has the potential to enable effective decision-making over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a 25x higher throughput compared to existing hierarchical methods. We train our hierarchical agents using 20 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate our algorithm on MiniHack and Mujoco environments, showcasing its general applicability. Our code is open sourced at this http URL.
zh

[AI-164] Jacobian Exploratory Dual-Phase Reinforcement Learning for Dynamic Endoluminal Navigation of Deformable Continuum Robots

【速读】:该论文旨在解决可变形连续体机器人(Deformable Continuum Robots, DCRs)在规划过程中面临的两大挑战:非线性形变力学特性与部分状态可观测性,这导致传统强化学习(Reinforcement Learning, RL)方法因违反马尔可夫假设而性能受限;同时,基于雅可比矩阵(Jacobian)的方法虽在刚性机械臂中有理论基础,但难以直接应用于DCRs,因其存在时变运动学和欠驱动形变动力学问题。解决方案的关键在于提出Jacobain Exploratory Dual-Phase RL(JEDP-RL)框架,其核心是将规划过程分解为两个阶段:首先通过小规模局部探索动作估计当前形变雅可比矩阵,随后将雅可比特征嵌入状态表示以近似恢复马尔可夫性,从而提升策略学习的稳定性和效率。

链接: https://arxiv.org/abs/2509.00329
作者: Yu Tian,Chi Kit Ng,Hongliang Ren
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Deformable continuum robots (DCRs) present unique planning challenges due to nonlinear deformation mechanics and partial state observability, violating the Markov assumptions of conventional reinforcement learning (RL) methods. While Jacobian-based approaches offer theoretical foundations for rigid manipulators, their direct application to DCRs remains limited by time-varying kinematics and underactuated deformation dynamics. This paper proposes Jacobian Exploratory Dual-Phase RL (JEDP-RL), a framework that decomposes planning into phased Jacobian estimation and policy execution. During each training step, we first perform small-scale local exploratory actions to estimate the deformation Jacobian matrix, then augment the state representation with Jacobian features to restore approximate Markovianity. Extensive SOFA surgical dynamic simulations demonstrate JEDP-RL’s three key advantages over proximal policy optimization (PPO) baselines: 1) Convergence speed: 3.2x faster policy convergence, 2) Navigation efficiency: requires 25% fewer steps to reach the target, and 3) Generalization ability: achieve 92% success rate under material property variations and achieve 83% (33% higher than PPO) success rate in the unseen tissue environment.
zh

[AI-165] Contact-Aided Navigation of Flexible Robotic Endoscope Using Deep Reinforcement Learning in Dynamic Stomach

【速读】:该论文旨在解决柔性机器人内窥镜(Flexible Robotic Endoscope, FRE)在动态胃部环境中导航困难的问题,尤其在于如何有效利用与可变形胃壁的接触来实现精准定位。解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的接触辅助导航(Contact-Aided Navigation, CAN)策略,通过引入接触力反馈机制增强运动稳定性和导航精度,并在基于物理的有限元方法(Finite Element Method, FEM)仿真环境中训练Proximal Policy Optimization (PPO)算法,从而显著提升FRE在静态、动态及未见干扰场景下的导航成功率与准确性。

链接: https://arxiv.org/abs/2509.00319
作者: Chi Kit Ng,Huxin Gao,Tian-Ao Ren,Jiewen Lai,Hongliang Ren
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Navigating a flexible robotic endoscope (FRE) through the gastrointestinal tract is critical for surgical diagnosis and treatment. However, navigation in the dynamic stomach is particularly challenging because the FRE must learn to effectively use contact with the deformable stomach walls to reach target locations. To address this, we introduce a deep reinforcement learning (DRL) based Contact-Aided Navigation (CAN) strategy for FREs, leveraging contact force feedback to enhance motion stability and navigation precision. The training environment is established using a physics-based finite element method (FEM) simulation of a deformable stomach. Trained with the Proximal Policy Optimization (PPO) algorithm, our approach achieves high navigation success rates (within 3 mm error between the FRE’s end-effector and target) and significantly outperforms baseline policies. In both static and dynamic stomach environments, the CAN agent achieved a 100% success rate with 1.6 mm average error, and it maintained an 85% success rate in challenging unseen scenarios with stronger external disturbances. These results validate that the DRL-based CAN strategy substantially enhances FRE navigation performance over prior methods.
zh

[AI-166] A Framework for Task and Motion Planning based on Expanding AND/OR Graphs

【速读】:该论文旨在解决空间环境中机器人自主性面临的挑战,包括高感知与运动不确定性、严格的运动学约束以及有限的人工干预机会。为此,作者提出了一种基于扩展AND/OR图(AND/OR graph)的任务与运动规划(Task and Motion Planning, TMP)框架——TMP-EAOG。其解决方案的关键在于:将任务级抽象编码于AND/OR图中,并在执行过程中迭代扩展该图,同时结合在线运动规划评估以验证动作的可行性;这一机制使系统具备对不确定性的鲁棒性、可通过人工专家验证的可控自主性,以及在遭遇意外事件时通过图中备选路径实现灵活响应的能力。

链接: https://arxiv.org/abs/2509.00317
作者: Fulvio Mastrogiovanni,Antony Thomas
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted for an oral presentation at ASTRA Conference, 2025

点击查看摘要

Abstract:Robot autonomy in space environments presents unique challenges, including high perception and motion uncertainty, strict kinematic constraints, and limited opportunities for human intervention. Therefore, Task and Motion Planning (TMP) may be critical for autonomous servicing, surface operations, or even in-orbit missions, just to name a few, as it models tasks as discrete action sequencing integrated with continuous motion feasibility assessments. In this paper, we introduce a TMP framework based on expanding AND/OR graphs, referred to as TMP-EAOG, and demonstrate its adaptability to different scenarios. TMP-EAOG encodes task-level abstractions within an AND/OR graph, which expands iteratively as the plan is executed, and performs in-the-loop motion planning assessments to ascertain their feasibility. As a consequence, TMP-EAOG is characterised by the desirable properties of (i) robustness to a certain degree of uncertainty, because AND/OR graph expansion can accommodate for unpredictable information about the robot environment, (ii) controlled autonomy, since an AND/OR graph can be validated by human experts, and (iii) bounded flexibility, in that unexpected events, including the assessment of unfeasible motions, can lead to different courses of action as alternative paths in the AND/OR graph. We evaluate TMP-EAOG on two benchmark domains. We use a simulated mobile manipulator as a proxy for space-grade autonomous robots. Our evaluation shows that TMP-EAOG can deal with a wide range of challenges in the benchmarks.
zh

[AI-167] Continuously Tempered Diffusion Samplers

【速读】:该论文旨在解决基于退火(annealing)的神经采样器在训练过程中因提议分布(proposal distribution)探索不足而导致的性能瓶颈问题。具体而言,现有方法依赖于部分学习的传输路径与退火朗之万动力学(annealed Langevin dynamics)结合生成提议分布,但这种策略易受退火路径中孤立模态等病理特性影响,导致采样探索不充分,进而限制了最终采样器的性能。解决方案的关键在于提出连续温控扩散采样器(continuously tempered diffusion samplers),其借鉴分子动力学中的探索技术,在不同温度下引入一族分布以降低高温度下的能量壁垒,从而增强对低温度目标分布的探索能力,显著提升采样效率和质量。

链接: https://arxiv.org/abs/2509.00316
作者: Ezra Erives,Bowen Jing,Peter Holderrieth,Tommi Jaakkola
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Annealing-based neural samplers seek to amortize sampling from unnormalized distributions by training neural networks to transport a family of densities interpolating from source to target. A crucial design choice in the training phase of such samplers is the proposal distribution by which locations are generated at which to evaluate the loss. Previous work has obtained such a proposal distribution by combining a partially learned transport with annealed Langevin dynamics. However, isolated modes and other pathological properties of the annealing path imply that such proposals achieve insufficient exploration and thereby lower performance post training. To remedy this, we propose continuously tempered diffusion samplers, which leverage exploration techniques developed in the context of molecular dynamics to improve proposal distributions. Specifically, a family of distributions across different temperatures is introduced to lower energy barriers at higher temperatures and drive exploration at the lower temperature of interest. We empirically validate improved sampler performance driven by extended exploration. Code is available at this https URL.
zh

[AI-168] ReF-6: Inferring Task-Relevant Frames from a Single Demonstration for One-Shot Skill Generalization

【速读】:该论文旨在解决机器人在仅凭单次示范(one-shot demonstration)下难以泛化任务技能的问题,核心挑战在于缺乏可迁移且可解释的空间表征。解决方案的关键在于提出TReF-6方法,通过从轨迹几何中自动推断一个简化的、抽象的6自由度(6DoF)任务相关坐标系(Task-Relevant Frame, TReF),其中影响点(influence point)作为局部坐标系原点,用于参数化动态运动基元(Dynamic Movement Primitive, DMP)。该框架不仅扩展了标准DMP的起始-目标模仿能力,还能借助视觉语言模型实现语义锚定,并通过Grounded-SAM在新场景中定位该帧,从而保障技能在不同物体配置下的功能一致性泛化。

链接: https://arxiv.org/abs/2509.00310
作者: Yuxuan Ding,Shuangge Wang,Tesca Fitzgerald
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robots often struggle to generalize from a single demonstration due to the lack of a transferable and interpretable spatial representation. In this work, we introduce TReF-6, a method that infers a simplified, abstracted 6DoF Task-Relevant Frame from a single trajectory. Our approach identifies an influence point purely from the trajectory geometry to define the origin for a local frame, which serves as a reference for parameterizing a Dynamic Movement Primitive (DMP). This influence point captures the task’s spatial structure, extending the standard DMP formulation beyond start-goal imitation. The inferred frame is semantically grounded via a vision-language model and localized in novel scenes by Grounded-SAM, enabling functionally consistent skill generalization. We validate TReF-6 in simulation and demonstrate robustness to trajectory noise. We further deploy an end-to-end pipeline on real-world manipulation tasks, showing that TReF-6 supports one-shot imitation learning that preserves task intent across diverse object configurations.
zh

[AI-169] Access Paths for Efficient Ordering with Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数据排序任务中如何高效、准确地实现排序操作的问题。传统数据库的ORDER BY操作难以直接适配LLM的推理特性,因此论文提出将LLM ORDER BY视为一种逻辑抽象,并构建统一评估框架以研究其物理实现方式。解决方案的关键在于三项创新设计:一是基于一致性的批处理大小策略,用于优化基于价值的方法;二是针对成对比较的多数投票机制,显著提升GPT-4o在排序判断中的稳定性;三是适用于LLM的双向外部归并排序算法,在不同数据集和模型上实现了精度与效率的良好平衡。实验表明,排序质量与计算成本呈对数线性关系,为未来基于LLM的数据系统建立可解释的成本模型提供了初步基础。

链接: https://arxiv.org/abs/2509.00303
作者: Fuheng Zhao,Jiayue Chen,Yiming Pan,Tahseen Rabbani,Divyakant Agrawal,Amr El Abbadi
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We present the LLM ORDER BY operator as a logical abstraction and study its physical implementations within a unified evaluation framework. Our experiments show that no single approach is universally optimal, with effectiveness depending on query characteristics and data. We introduce three new designs: an agreement-based batch-size policy, a majority voting mechanism for pairwise sorting, and a two-way external merge sort adapted for LLMs. With extensive experiments, our agreement-based procedure is effective at determining batch size for value-based methods, the majority-voting mechanism consistently strengthens pairwise comparisons on GPT-4o, and external merge sort achieves high accuracy-efficiency trade-offs across datasets and models. We further observe a log-linear scaling between compute cost and ordering quality, offering the first step toward principled cost models for LLM powered data systems.
zh

[AI-170] SIGMUS: Semantic Integration for Knowledge Graphs in Multimodal Urban Spaces KDD2025

【速读】:该论文旨在解决城市多模态传感数据在识别与推理重要事件(如重大紧急情况、社会文化活动及自然灾害)时存在的碎片化与难以集成问题,核心挑战在于缺乏自动化的机制来建立事件与来自不同模态的数据之间的语义关联。解决方案的关键在于提出SIGMUS系统,该系统利用大语言模型(Large Language Models, LLMs)获取必要的世界知识,从而无需人工编码规则即可自动识别多模态数据与事件之间的关系,并将这些关系组织成知识图谱结构,实现对事件、观测及相关要素的系统性建模与整合。

链接: https://arxiv.org/abs/2509.00287
作者: Brian Wang,Mani Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages, accepted at UrbComp 2025 KDD 2025

点击查看摘要

Abstract:Modern urban spaces are equipped with an increasingly diverse set of sensors, all producing an abundance of multimodal data. Such multimodal data can be used to identify and reason about important incidents occurring in urban landscapes, such as major emergencies, cultural and social events, as well as natural disasters. However, such data may be fragmented over several sources and difficult to integrate due to the reliance on human-driven reasoning for identifying relationships between the multimodal data corresponding to an incident, as well as understanding the different components which define an incident. Such relationships and components are critical to identifying the causes of such incidents, as well as producing forecasting the scale and intensity of future incidents as they begin to develop. In this work, we create SIGMUS, a system for Semantic Integration for Knowledge Graphs in Multimodal Urban Spaces. SIGMUS uses Large Language Models (LLMs) to produce the necessary world knowledge for identifying relationships between incidents occurring in urban spaces and data from different modalities, allowing us to organize evidence and observations relevant to an incident without relying and human-encoded rules for relating multimodal sensory data with incidents. This organized knowledge is represented as a knowledge graph, organizing incidents, observations, and much more. We find that our system is able to produce reasonable connections between 5 different data sources (new article text, CCTV images, air quality, weather, and traffic measurements) and relevant incidents occurring at the same time and location.
zh

[AI-171] Intelligent Spectrum Management in Satellite Communications

【速读】:该论文旨在解决卫星通信(Satellite Communication, SatCom)网络中因高带宽需求增长和巨型卫星星座部署而加剧的频谱资源稀缺问题。传统静态频谱分配方式已难以满足动态、高效利用频谱的需求,因此论文提出通过认知无线电(Cognitive Radio, CR)技术实现认知卫星(Cognitive Satellite, CogSat)网络,其核心解决方案在于引入智能动态频谱管理(Dynamic Spectrum Management, DSM)机制,使卫星系统能够根据环境变化实时调整频谱使用策略,从而提升频谱利用率与系统性能。关键在于融合人工智能/机器学习(Artificial Intelligence/Machine Learning, AI/ML)方法以实现智能化的DSM决策,并评估其在操作韧性与鲁棒性方面的表现,进而推动面向未来可持续、可扩展的全球连接体系发展。

链接: https://arxiv.org/abs/2509.00286
作者: Rakshitha De Silva,Shiva Raj Pokhrel,Jonathan Kua,Sithamparanathan Kandeepan
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 30 pages, Under review in IEEE Communications Surveys Tutorials

点击查看摘要

Abstract:Satellite Communication (SatCom) networks represent a fundamental pillar in modern global connectivity, facilitating reliable service and extensive coverage across a plethora of applications. The expanding demand for high-bandwidth services and the proliferation of mega satellite constellations highlight the limitations of traditional exclusive satellite spectrum allocation approaches. Cognitive Radio (CR) leading to Cognitive Satellite (CogSat) networks through Dynamic Spectrum Management (DSM), which enables the dynamic adaptability of radio equipment to environmental conditions for optimal performance, presents a promising solution for the emerging spectrum scarcity. In this survey, we explore the adaptation of intelligent DSM methodologies to SatCom, leveraging satellite network integrations. We discuss contributions and hurdles in regulations and standardizations in realizing intelligent DSM in SatCom, and deep dive into DSM techniques, which enable CogSat networks. Furthermore, we extensively evaluate and categorize state-of-the-art Artificial Intelligence (AI)/Machine Learning (ML) methods leveraged for DSM while exploring operational resilience and robustness of such integrations. In addition, performance evaluation metrics critical for adaptive resource management and system optimization in CogSat networks are thoroughly investigated. This survey also identifies open challenges and outlines future research directions in regulatory frameworks, network architectures, and intelligent spectrum management, paving the way for sustainable and scalable SatCom networks for enhanced global connectivity.
zh

[AI-172] SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra

【速读】:该论文旨在解决现有语义数据处理系统(Semantic Data Processing Systems, SDPSs)缺乏统一代数基础的问题,导致其查询难以组合、推理和优化。解决方案的关键在于提出一种新的语义代数 SABER(Semantic Algebra Based on Extended Relational algebra),它能够支持语义操作的逻辑计划构建、优化以及形式化正确性保障,并通过 SQL 兼容语法实现以原生支持结构化与非结构化数据的混合处理,从而为不同 SDPS 提供统一接口,促进语义兼容算子的灵活组合与社区协作。

链接: https://arxiv.org/abs/2509.00277
作者: Changjae Lee,Zhuoyue Zhao,Jinjun Xiong
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:The emergence of large-language models (LLMs) has enabled a new class of semantic data processing systems (SDPSs) to support declarative queries against unstructured documents. Existing SDPSs are, however, lacking a unified algebraic foundation, making their queries difficult to compose, reason, and optimize. We propose a new semantic algebra, SABER (Semantic Algebra Based on Extended Relational algebra), opening the possibility of semantic operations’ logical plan construction, optimization, and formal correctness guarantees. We further propose to implement SABER in a SQL-compatible syntax so that it natively supports mixed structured/unstructured data processing. With SABER, we showcase the feasibility of providing a unified interface for existing SDPSs so that it can effectively mix and match any semantically-compatible operator implementation from any SDPS, greatly enhancing SABER’s applicability for community contributions.
zh

[AI-173] SHERPA: A Model-Driven Framework for Large Language Model Execution

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂任务中缺乏结构化推理能力的问题,尤其是在训练数据中未涵盖领域特定最佳实践时,模型难以有效执行需要精细步骤控制的任务。解决方案的关键在于提出SHERPA框架,通过将LLM的执行过程建模为分层状态机(hierarchical state machines),显式地嵌入领域知识和人类最佳实践,并利用基于机器学习的规则或决策机制对模型行为进行细粒度控制,从而提升复杂任务下的输出质量与可控性。

链接: https://arxiv.org/abs/2509.00272
作者: Boqi Chen,Kua Chen,José Antonio Hernández López,Gunter Mussbacher,Dániel Varró,Amir Feizpour
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: MODELS 2025

点击查看摘要

Abstract:Recently, large language models (LLMs) have achieved widespread application across various fields. Despite their impressive capabilities, LLMs suffer from a lack of structured reasoning ability, particularly for complex tasks requiring domain-specific best practices, which are often unavailable in the training data. Although multi-step prompting methods incorporating human best practices, such as chain-of-thought and tree-of-thought, have gained popularity, they lack a general mechanism to control LLM behavior. In this paper, we propose SHERPA, a model-driven framework to improve the LLM performance on complex tasks by explicitly incorporating domain-specific best practices into hierarchical state machines. By structuring the LLM execution processes using state machines, SHERPA enables more fine-grained control over their behavior via rules or decisions driven by machine learning-based approaches, including LLMs. We show that SHERPA is applicable to a wide variety of tasks-specifically, code generation, class name generation, and question answering-replicating previously proposed approaches while further improving the performance. We demonstrate the effectiveness of SHERPA for the aforementioned tasks using various LLMs. Our systematic evaluation compares different state machine configurations against baseline approaches without state machines. Results show that integrating well-designed state machines significantly improves the quality of LLM outputs, and is particularly beneficial for complex tasks with well-established human best practices but lacking data used for training LLMs.
zh

[AI-174] Instruction-Level Weight Shaping: A Framework for Self-Improving AI Agents

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在预训练后缺乏动态适应能力的问题,尤其是面对新知识或变化信息时,传统方法如检索增强生成(Retrieval-Augmented Generation, RAG)、提示工程(Prompt Engineering)和微调(Fine-tuning)存在延迟高、集成困难、脆弱性强或成本高昂等局限。其核心解决方案是提出指令级权重塑形(Instruction-Level Weight Shaping, ILWS):通过系统性地维护和更新外部可审计的伪参数(即系统指令),利用反思引擎(Reflection Engine)分析对话轨迹,诊断推理成败并生成三类增量更新(ΔK = (ΔS, ΔU, ΔT),分别对应指令、用户偏好与工具),结合评分机制与版本控制实现自动修复与回滚;当编辑预算达到阈值时,将prompt空间的改进蒸馏为参数空间的稳定更新,从而在不中断服务的前提下提升模型性能。此方法实现了低秩指令空间塑形的显式建模,保留治理能力,并消除每请求必检的检索开销,在企业支持场景中显著提升吞吐量(2.4–5.0倍)并减少约80%的可审计幻觉。

链接: https://arxiv.org/abs/2509.00251
作者: Rimom Costa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Large language models (LLMs) are fluent but largely static after pre-training; new or shifting knowledge is typically added with retrieval-augmented generation (RAG) or fine-tuning. RAG raises latency and engineering overhead and often fails to integrate facts; prompt engineering is brittle and can conflict with prior knowledge; fine-tuning is costly and risks catastrophic forgetting. We propose Instruction-Level Weight Shaping (ILWS): curated system instructions act as external, auditable pseudo-parameters updated after each session via reflection and user feedback. A Reflection Engine inspects conversation traces, diagnoses reasoning successes and failures, and proposes typed deltas \Delta K=(\Delta S,\Delta U,\Delta T) over instructions, user preferences, and tools. Deltas are version-controlled, evaluated with a sliding window of 1-5 star ratings, auto-repaired on first failure, and rolled back on repeated failure. When an edit budget crosses a threshold, the agent compiles a rating-weighted synthetic set and distills matured instruction-space gains into parameters, converting prompt-space improvements into weight-space without downtime. ILWS makes explicit the low-rank shaping induced by context in transformer blocks, preserves governance, and removes per-call retrieval. In enterprise support it increased throughput 2.4-5.0x and cut audited hallucinations by about 80% versus a frozen baseline. In an Adobe Commerce Cloud proof of concept “L0 Support”, it achieved 4-5x more tickets per hour and about 80% lower time per ticket, with autonomous instruction updates and optional tool synthesis. Because ILWS operates at the instruction layer until controlled distillation, it generalizes to dynamic domains (legal, medical, engineering) requiring adaptive reasoning, tool creation, and low-latency deployment.
zh

[AI-175] Universal Deep Research: Bring Your Own Model and Strategy

【速读】:该论文旨在解决当前深度研究代理(Deep Research Agent)系统普遍存在的局限性问题,即每种代理均被硬编码为执行特定的研究策略并使用固定工具组合,缺乏灵活性和可定制性。解决方案的关键在于提出通用深度研究(Universal Deep Research, UDR)系统,该系统通过封装任意语言模型(Language Model),使用户能够在无需额外训练或微调的情况下,自主创建、编辑和优化个性化的深度研究策略,从而实现对研究流程的高度灵活控制与适配。

链接: https://arxiv.org/abs/2509.00244
作者: Peter Belcak,Pavlo Molchanov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep research tools are among the most impactful and most commonly encountered agentic systems today. We observe, however, that each deep research agent introduced so far is hard-coded to carry out a particular research strategy using a fixed choice of tools. We introduce Universal Deep Research (UDR), a generalist agentic system that wraps around any language model and enables the user to create, edit, and refine their own entirely custom deep research strategies without any need for additional training or finetuning. To showcase the generality of our system, we equip UDR with example minimal, expansive, and intensive research strategies, and provide a user interface to facilitate experimentation with the system.
zh

[AI-176] Criteria for Credible AI-assisted Carbon Footprinting Systems: The Cases of Mapping and Lifecycle Modeling

【速读】:该论文旨在解决当前AI辅助碳足迹计算系统在严谨性、透明度和可验证性方面存在显著差异的问题,尤其针对生成式AI(Generative AI)在产品与材料温室气体(GHG)排放测算中的应用缺乏统一标准与有效评估方法的现状。其解决方案的关键在于提出一套系统级验证准则,并通过三步法实施:首先明确需求与约束条件,其次制定初步评估标准,最后通过试点测试进行优化。该方法强调以系统级评价替代传统的逐项审查,引入基准性能、数据质量与不确定性指标及透明文档作为核心评估维度,从而在保证可信度的同时兼顾规模化应用的需求,为从业者、审计机构及标准制定组织提供可操作的评估框架。

链接: https://arxiv.org/abs/2509.00240
作者: Shaena Ulissi,Andrew Dumit,P. James Joyce,Krishna Rao,Steven Watson,Sangwon Suh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 16 pages, 1 figure

点击查看摘要

Abstract:As organizations face increasing pressure to understand their corporate and products’ carbon footprints, artificial intelligence (AI)-assisted calculation systems for footprinting are proliferating, but with widely varying levels of rigor and transparency. Standards and guidance have not kept pace with the technology; evaluation datasets are nascent; and statistical approaches to uncertainty analysis are not yet practical to apply to scaled systems. We present a set of criteria to validate AI-assisted systems that calculate greenhouse gas (GHG) emissions for products and materials. We implement a three-step approach: (1) Identification of needs and constraints, (2) Draft criteria development and (3) Refinements through pilots. The process identifies three use cases of AI applications: Case 1 focuses on AI-assisted mapping to existing datasets for corporate GHG accounting and product hotspotting, automating repetitive manual tasks while maintaining mapping quality. Case 2 addresses AI systems that generate complete product models for corporate decision-making, which require comprehensive validation of both component tasks and end-to-end performance. We discuss the outlook for Case 3 applications, systems that generate standards-compliant models. We find that credible AI systems can be built and that they should be validated using system-level evaluations rather than line-item review, with metrics such as benchmark performance, indications of data quality and uncertainty, and transparent documentation. This approach may be used as a foundation for practitioners, auditors, and standards bodies to evaluate AI-assisted environmental assessment tools. By establishing evaluation criteria that balance scalability with credibility requirements, our approach contributes to the field’s efforts to develop appropriate standards for AI-assisted carbon footprinting systems.
zh

[AI-177] Embodied AI in Social Spaces: Responsible and Adaptive Robots in Complex Setting - UKAIRS 2025 (Copy)

【速读】:该论文旨在解决复杂动态环境中多个人类与多机器人系统(multi-human multi-robot, MHMR)的协同与适应性问题,尤其是在确保伦理合规、情感响应和用户需求匹配的前提下实现高效协作。解决方案的关键在于融合协同设计(co-design)、伦理框架与多模态感知技术,构建具备情境意识(context-aware)和情感响应能力的具身人工智能(embodied AI)系统,从而推动可持续、伦理且以人为中心的未来应用发展。

链接: https://arxiv.org/abs/2509.00218
作者: Aleksandra Landowska,Aislinn D Gomez Bergin,Ayodeji O. Abioye,Jayati Deshmukh,Andriana Bouadouki,Maria Wheadon,Athina Georgara,Dominic Price,Tuyen Nguyen,Shuang Ao,Lokesh Singh,Yi Long,Raffaele Miele,Joel E. Fischer,Sarvapali D. Ramchurn
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces and overviews a multidisciplinary project aimed at developing responsible and adaptive multi-human multi-robot (MHMR) systems for complex, dynamic settings. The project integrates co-design, ethical frameworks, and multimodal sensing to create AI-driven robots that are emotionally responsive, context-aware, and aligned with the needs of diverse users. We outline the project’s vision, methodology, and early outcomes, demonstrating how embodied AI can support sustainable, ethical, and human-centred futures.
zh

[AI-178] First Order Model-Based RL through Decoupled Backpropagation

【速读】:该论文旨在解决模型基于强化学习(Model-based Reinforcement Learning, MBRL)中因轨迹生成与梯度计算耦合而导致的策略优化效率低、性能下降问题,尤其是在模拟器梯度不可用或难以获取的情况下。其核心挑战在于:传统MBRL方法依赖于动态模型预测来生成训练轨迹,但由于误差累积,导致策略优化不一致且性能受限;而直接使用模拟器梯度虽高效但常受实现成本或可用性限制。解决方案的关键在于提出一种解耦式设计——轨迹通过真实模拟器进行前向传播(unroll),而梯度则通过反向传播经由一个可微分的动态模型计算,从而实现高效且一致的一阶策略优化。这一机制不仅提升了样本效率和训练速度,还支持从仿真轨迹中学习更准确的值函数(critic),避免了其他一阶MBRL方法中常见的不稳定行为,同时保持了如PPO等标准方法的通用性。

链接: https://arxiv.org/abs/2509.00215
作者: Joseph Amigo,Rooholla Khorrambakht,Elliot Chane-Sane,Nicolas Mansard,Ludovic Righetti
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CoRL 2025. Project website: this https URL

点击查看摘要

Abstract:There is growing interest in reinforcement learning (RL) methods that leverage the simulator’s derivatives to improve learning efficiency. While early gradient-based approaches have demonstrated superior performance compared to derivative-free methods, accessing simulator gradients is often impractical due to their implementation cost or unavailability. Model-based RL (MBRL) can approximate these gradients via learned dynamics models, but the solver efficiency suffers from compounding prediction errors during training rollouts, which can degrade policy performance. We propose an approach that decouples trajectory generation from gradient computation: trajectories are unrolled using a simulator, while gradients are computed via backpropagation through a learned differentiable model of the simulator. This hybrid design enables efficient and consistent first-order policy optimization, even when simulator gradients are unavailable, as well as learning a critic from simulation rollouts, which is more accurate. Our method achieves the sample efficiency and speed of specialized optimizers such as SHAC, while maintaining the generality of standard approaches like PPO and avoiding ill behaviors observed in other first-order MBRL methods. We empirically validate our algorithm on benchmark control tasks and demonstrate its effectiveness on a real Go2 quadruped robot, across both quadrupedal and bipedal locomotion tasks.
zh

[AI-179] HiVA: Self-organized Hierarchical Variable Agent via Goal-driven Semantic-Topological Evolution

【速读】:该论文旨在解决自主智能体(Autonomous Agents)在执行任务时面临的两个核心问题:一是固定工作流(Fixed Workflows)缺乏对环境变化的适应性,需人工重新配置;二是灵活反应式循环(Reactive Loops)无法将推理过程转化为可迁移的结构。解决方案的关键在于提出Hierarchical Variable Agent(HiVA)框架,其通过将代理工作流建模为自组织图结构,并引入语义-拓扑演化(Semantic-Topological Evolution, STEV)算法,在混合语义-拓扑空间中利用文本梯度作为离散域代理进行反向传播优化。该方法通过多臂赌博机驱动的前向路由、基于环境反馈的诊断梯度生成以及个体语义与拓扑协同更新机制,在未知环境中实现集体优化,从而在对话、编程、长上下文问答、数学推理和代理基准测试中实现任务准确率提升5–10%并提高资源效率。

链接: https://arxiv.org/abs/2509.00189
作者: Jinzhou Tang,Jusheng Zhang,Qinhan Lv,Sidi Liu,Jing Yang,Chengpei Tang,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Autonomous agents play a crucial role in advancing Artificial General Intelligence, enabling problem decomposition and tool orchestration through Large Language Models (LLMs). However, existing paradigms face a critical trade-off. On one hand, reusable fixed workflows require manual reconfiguration upon environmental changes; on the other hand, flexible reactive loops fail to distill reasoning progress into transferable structures. We introduce Hierarchical Variable Agent (HiVA), a novel framework modeling agentic workflows as self-organized graphs with the Semantic-Topological Evolution (STEV) algorithm, which optimizes hybrid semantic-topological spaces using textual gradients as discrete-domain surrogates for backpropagation. The iterative process comprises Multi-Armed Bandit-infused forward routing, diagnostic gradient generation from environmental feedback, and coordinated updates that co-evolve individual semantics and topology for collective optimization in unknown environments. Experiments on dialogue, coding, Long-context QA, mathematical, and agentic benchmarks demonstrate improvements of 5-10% in task accuracy and enhanced resource efficiency over existing baselines, establishing HiVA’s effectiveness in autonomous task execution.
zh

[AI-180] Generalizable Audio Spoofing Detection using Non-Semantic Representations

【速读】:该论文旨在解决生成式语音伪造(Deepfake)检测模型在真实场景中泛化能力不足的问题,即现有方法在域内测试表现尚可,但在跨域或真实世界数据上性能显著下降。其解决方案的关键在于利用非语义的通用音频表征(non-semantic universal audio representations),通过TRILL和TRILLsson模型提取此类特征,并构建更具鲁棒性的欺骗检测机制。实验表明,该方法在域内测试集上性能相当,而在域外测试集(尤其是公开数据集)上显著优于基于手工特征、语义嵌入及端到端架构的先进方法,验证了非语义特征对提升检测模型泛化能力的核心作用。

链接: https://arxiv.org/abs/2509.00186
作者: Arnab Das,Yassine El Kheir,Carlos Franzreb,Tim Herzig,Tim Polzehl,Sebastian Möller
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Rapid advancements in generative modeling have made synthetic audio generation easy, making speech-based services vulnerable to spoofing attacks. Consequently, there is a dire need for robust countermeasures more than ever. Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real-world data. This study proposes a novel method for generalizable spoofing detection leveraging non-semantic universal audio representations. Extensive experiments have been performed to find suitable non-semantic features using TRILL and TRILLsson models. The results indicate that the proposed method achieves comparable performance on the in-domain test set while significantly outperforming state-of-the-art approaches on out-of-domain test sets. Notably, it demonstrates superior generalization on public-domain data, surpassing methods based on hand-crafted features, semantic embeddings, and end-to-end architectures.
zh

[AI-181] Virtual Group Knowledge and Group Belief in Topological Evidence Models (Extended Version)

【速读】:该论文旨在解决多智能体证据模型中群体知识(group knowledge)与群体信念(group belief)的逻辑形式化问题,特别是如何将个体层面基于证据的信念与不可靠知识(fallible knowledge)的拓扑语义扩展至群体层面。其解决方案的关键在于:首先对“硬”(hard)与“软”(soft)群体证据逻辑进行完全公理化并证明其可判定性;其次聚焦于一个特别有趣的子类——群体知识与群体信念逻辑,并完成其公理化和可判定性分析;最后引入动态证据共享算子扩展语言,证明其动态逻辑与静态基础逻辑具有同等表达能力(co-expressive),从而为群体认知建模提供一套完备、可判定且具备动态演化的逻辑框架。

链接: https://arxiv.org/abs/2509.00184
作者: Alexandru Baltag,Malvin Gattinger,Djanira Gomes
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We study notions of (virtual) group knowledge and group belief within multi-agent evidence models, obtained by extending the topological semantics of evidence-based belief and fallible knowledge from individuals to groups. We completely axiomatize and show the decidability of the logic of (“hard” and “soft”) group evidence, and do the same for an especially interesting fragment of it: the logic of group knowledge and group belief. We also extend these languages with dynamic evidence-sharing operators, and completely axiomatize the corresponding logics, showing that they are co-expressive with their static bases.
zh

[AI-182] Principled Approximation Methods for Efficient and Scalable Deep Learning

【速读】:该论文旨在解决深度学习模型因规模扩大而导致的计算与能耗急剧增长问题,从而阻碍其部署和广泛应用。其核心挑战在于如何在保持模型性能的同时提升训练与推理效率,尤其针对存在离散约束和非可微性的场景。解决方案的关键在于提出三种基于原理性近似的高效方法:一是通过将剪枝和量化等离散问题建模为连续可微形式,实现压缩策略与模型参数的联合梯度优化,从而获得细粒度稀疏性和精度配置;二是设计一种利用层间参数共享的神经架构搜索算法,高效探索隐式循环结构;三是改进自适应优化器,通过重审经典优化方法的理论性质并引入快速超参数调优机制。这些方法共同实现了对计算难题的可扩展、有原则的近似求解,在图像分类、语言建模和生成建模任务中显著提升了效率且不牺牲甚至提升了模型性能。

链接: https://arxiv.org/abs/2509.00174
作者: Pedro Savarese
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:Recent progress in deep learning has been driven by increasingly larger models. However, their computational and energy demands have grown proportionally, creating significant barriers to their deployment and to a wider adoption of deep learning technologies. This thesis investigates principled approximation methods for improving the efficiency of deep learning systems, with a particular focus on settings that involve discrete constraints and non-differentiability. We study three main approaches toward improved efficiency: architecture design, model compression, and optimization. For model compression, we propose novel approximations for pruning and quantization that frame the underlying discrete problem as continuous and differentiable, enabling gradient-based training of compression schemes alongside the model’s parameters. These approximations allow for fine-grained sparsity and precision configurations, leading to highly compact models without significant fine-tuning. In the context of architecture design, we design an algorithm for neural architecture search that leverages parameter sharing across layers to efficiently explore implicitly recurrent architectures. Finally, we study adaptive optimization, revisiting theoretical properties of widely used methods and proposing an adaptive optimizer that allows for quick hyperparameter tuning. Our contributions center on tackling computationally hard problems via scalable and principled approximations. Experimental results on image classification, language modeling, and generative modeling tasks show that the proposed methods provide significant improvements in terms of training and inference efficiency while maintaining, or even improving, the model’s performance. Comments: PhD thesis Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.00174 [cs.LG] (or arXiv:2509.00174v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.00174 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-183] Pilot Study on Generative AI and Critical Thinking in Higher Education Classrooms

【速读】:该论文试图解决的问题是:生成式 AI (Generative AI, GAI) 在教育场景中快速普及,但其在培养学生批判性思维方面的作用仍缺乏系统研究,尤其是学生如何批判性评估 GAI 输出的准确性与适当性尚不明确。解决方案的关键在于设计结构化的学习活动,引导学生对 GAI 生成的内容进行分析、批判和修订,从而促进其批判性思维能力的发展。通过在入门级计算与数据科学课程中实施此类活动,研究初步揭示了学生在面对 GAI 输出时开展批判性评估的能力,并为后续更深入的研究奠定了基础。

链接: https://arxiv.org/abs/2509.00167
作者: W. F. Lamberti,S. R. Lawrence,D. White,S. Kim,S. Abdullah
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Generative AI (GAI) tools have seen rapid adoption in educational settings, yet their role in fostering critical thinking remains underexplored. While previous studies have examined GAI as a tutor for specific lessons or as a tool for completing assignments, few have addressed how students critically evaluate the accuracy and appropriateness of GAI-generated responses. This pilot study investigates students’ ability to apply structured critical thinking when assessing Generative AI outputs in introductory Computational and Data Science courses. Given that GAI tools often produce contextually flawed or factually incorrect answers, we designed learning activities that require students to analyze, critique, and revise AI-generated solutions. Our findings offer initial insights into students’ ability to engage critically with GAI content and lay the groundwork for more comprehensive studies in future semesters.
zh

[AI-184] Scaling Legal AI: Benchmarking Mamba and Transformers for Statutory Classification and Case Law Retrieval

【速读】:该论文旨在解决法律人工智能(Legal AI)系统在处理海量法规文本和司法判例时面临的可扩展性问题,尤其是基于Transformer架构的模型因二次复杂度的注意力机制导致效率低下、难以支持超长上下文建模。解决方案的关键在于引入状态空间模型(State-space Model, SSM)中的Mamba架构,其采用线性时间复杂度的选通机制(selective mechanisms),在保持甚至超越Transformer性能的前提下,显著提升了对超长法律文档的处理能力,从而实现了更高效、可扩展的法定分类与案例检索任务部署。

链接: https://arxiv.org/abs/2509.00141
作者: Anuraj Maurya
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid growth of statutory corpora and judicial decisions requires scalable legal AI systems capable of classification and retrieval over extremely long contexts. Transformer-based architectures (e.g., Longformer, DeBERTa) dominate current legal NLP benchmarks but struggle with quadratic attention costs, limiting efficiency and scalability. In this work, we present the first comprehensive benchmarking of Mamba, a state-space model (SSM) with linear-time selective mechanisms, against leading transformer models for statutory classification and case law retrieval. We evaluate models on open-source legal corpora including LexGLUE, EUR-Lex, and ILDC, covering statutory tagging, judicial outcome prediction, and case retrieval tasks. Metrics include accuracy, recall at k, mean reciprocal rank (MRR), and normalized discounted cumulative gain (nDCG), alongside throughput measured in tokens per second and maximum context length. Results show that Mamba’s linear scaling enables processing of legal documents several times longer than transformers, while maintaining or surpassing retrieval and classification performance. This study introduces a new legal NLP benchmark suite for long-context modeling, along with open-source code and datasets to support reproducibility. Our findings highlight trade-offs between state-space models and transformers, providing guidance for deploying scalable legal AI in statutory analysis, judicial decision support, and policy research.
zh

[AI-185] LLM -based Triplet Extraction for Automated Ontology Generation in Software Engineering Standards

【速读】:该论文旨在解决软件工程标准(Software Engineering Standards, SES)中自动化本体生成(Automated Ontology Generation, AOG)的难题,尤其针对其文本结构复杂、噪声高且包含领域术语的特点。解决方案的关键在于提出一种基于开源大语言模型(Large Language Model, LLM)辅助的关系三元组抽取(Relation Triple Extraction, RTE)方法,并构建了一个完整的AOG工作流,包括文档分段、候选术语挖掘、LLM驱动的关系推理、术语归一化以及跨章节对齐等步骤。该方法不依赖纯提示工程,而是将LLM作为本体构建的辅助工具,显著提升了从非结构化SES文本中提取高质量语义关系的能力,实验表明其性能可媲美甚至优于传统的开放信息抽取(OpenIE)方法。

链接: https://arxiv.org/abs/2509.00140
作者: Songhui Yue
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ontologies have supported knowledge representation and whitebox reasoning for decades; thus, the automated ontology generation (AOG) plays a crucial role in scaling their use. Software engineering standards (SES) consist of long, unstructured text (with high noise) and paragraphs with domain-specific terms. In this setting, relation triple extraction (RTE), together with term extraction, constitutes the first stage toward AOG. This work proposes an open-source large language model (LLM)-assisted approach to RTE for SES. Instead of solely relying on prompt-engineering-based methods, this study promotes the use of LLMs as an aid in constructing ontologies and explores an effective AOG workflow that includes document segmentation, candidate term mining, LLM-based relation inference, term normalization, and cross-section alignment. Golden-standard benchmarks at three granularities are constructed and used to evaluate the ontology generated from the study. The results show that it is comparable and potentially superior to the OpenIE method of triple extraction.
zh

[AI-186] Optimizing Health Coverag e in Ethiopia: A Learning-augmented Approach and Persistent Proportionality Under an Online Budget

【速读】:该论文旨在解决埃塞俄比亚在实现全民健康覆盖(Universal Health Coverage, UHC)过程中,因预算有限和优先事项竞争导致卫生设施扩建计划难以有效优先排序的问题。其核心挑战是在预算不确定条件下,如何优化资源分配以最大化人口覆盖率,并满足各地区在每个时间步长上的比例公平性目标。解决方案的关键在于提出一个名为Health Access Resource Planner (HARP) 的决策支持优化框架,该框架基于序贯设施规划模型,结合两种算法:(i) 一种增强学习方法,在单步规划中优于专家建议;(ii) 一种具有强最坏情况近似保证的贪心算法,用于多步规划。该方法已在与埃塞俄比亚公共卫生研究所及卫生部的合作中,在三个区域的不同规划场景下验证了其实证有效性。

链接: https://arxiv.org/abs/2509.00135
作者: Davin Choo,Yohai Trabelsi,Fentabil Getnet,Samson Warkaye Lamma,Wondesen Nigatu,Kasahun Sime,Lisa Matay,Milind Tambe,Stéphane Verguet
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As part of nationwide efforts aligned with the United Nations’ Sustainable Development Goal 3 on Universal Health Coverage, Ethiopia’s Ministry of Health is strengthening health posts to expand access to essential healthcare services. However, only a fraction of this health system strengthening effort can be implemented each year due to limited budgets and other competing priorities, thus the need for an optimization framework to guide prioritization across the regions of Ethiopia. In this paper, we develop a tool, Health Access Resource Planner (HARP), based on a principled decision-support optimization framework for sequential facility planning that aims to maximize population coverage under budget uncertainty while satisfying region-specific proportionality targets at every time step. We then propose two algorithms: (i) a learning-augmented approach that improves upon expert recommendations at any single-step; and (ii) a greedy algorithm for multi-step planning, both with strong worst-case approximation estimation. In collaboration with the Ethiopian Public Health Institute and Ministry of Health, we demonstrated the empirical efficacy of our method on three regions across various planning scenarios.
zh

[AI-187] CoComposer: LLM Multi-agent Collaborative Music Composition

【速读】:该论文旨在解决现有生成式 AI (Generative AI) 音乐创作工具在生成时长、音乐质量及可控性方面的局限性。其解决方案的关键在于提出 CoComposer,一个由五个协作智能体组成的多智能体系统,每个智能体根据传统音乐创作流程分配特定任务,从而实现更高质量、更具可控性的音乐生成,并通过 AudioBox-Aesthetics 系统在四个作曲标准上进行实验验证,结果表明 CoComposer 在音乐质量上优于现有基于大语言模型(LLM)的多智能体系统,在生产复杂度上也优于单智能体系统,同时相比非 LLM 方法 MusicLM 具备更好的可解释性和可编辑性。

链接: https://arxiv.org/abs/2509.00132
作者: Peiwen Xing,Aske Plaat,Niki van Stein
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Existing AI Music composition tools are limited in generation duration, musical quality, and controllability. We introduce CoComposer, a multi-agent system that consists of five collaborating agents, each with a task based on the traditional music composition workflow. Using the AudioBox-Aesthetics system, we experimentally evaluate CoComposer on four compositional criteria. We test with three LLMs (GPT-4o, DeepSeek-V3-0324, Gemini-2.5-Flash), and find (1) that CoComposer outperforms existing multi-agent LLM-based systems in music quality, and (2) compared to a single-agent system, in production complexity. Compared to non- LLM MusicLM, CoComposer has better interpretability and editability, although MusicLM still produces better music.
zh

[AI-188] Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning

【速读】:该论文旨在解决强化学习中基于可验证反馈(Reinforcement Learning with Verifiable Feedback, RLVF)在提升大语言模型(Large Language Models, LLMs)推理能力时面临的局限性问题,即稀疏的、仅反映最终答案正确与否的奖励信号无法提供对推理过程的细粒度指导,导致模型难以区分高质量与低效解法,也无法从不同类型的失败中有效学习。解决方案的关键在于提出一种名为“难度感知确定性引导探索”(Difficulty Aware Certainty guided Exploration, DACE)的新颖强化学习算法,其核心机制是利用LLM自我确定性(self-certainty)与任务难度及解质量的相关性,在线动态调整探索-利用权衡:通过策略成功率实时评估任务难度,并据此调节内在奖励——对困难任务惩罚高确定性以鼓励探索,对简单任务奖励高确定性以提升学习效率。实验表明,DACE在AIME和MATH等数学推理基准上显著优于强基线模型,且在测试时计算资源扩展下表现出更强的鲁棒性,验证了该自适应方法能有效促进探索而不牺牲精度。

链接: https://arxiv.org/abs/2509.00125
作者: Ang Li,Zhihang Yuan,Yang Zhang,Shouda Liu,Yisen Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Feedback (RLVF) has become a key technique for enhancing the reasoning abilities of Large Language Models (LLMs). However, its reliance on sparse, outcome based rewards, which only indicate if a final answer is correct or not, fails to provide granular guidance on the reasoning process itself. This limitation hinders efficient learning, as the model cannot distinguish between high quality and inefficient solutions, nor can it learn effectively from different types of failures. To address this, we observe that an LLMs self-certainty often correlates with task difficulty and solution quality. We introduce Difficulty Aware Certainty guided Exploration (DACE), a novel RL algorithm that leverages this insight to dynamically balance the exploration exploitation trade-off. DACE assesses task difficulty online based on the policys success rate. It then uses this signal to modulate an intrinsic reward: for difficult tasks where the model is struggling, DACE encourages exploration by penalizing high certainty; for easier tasks, it encourages learning efficiency by rewarding high certainty. Experiments on challenging mathematical reasoning benchmarks (AIME, MATH) show that DACE significantly outperforms strong baselines. The DACE-trained models not only achieve higher accuracy but also demonstrate more robust performance when scaling test-time compute, validating that our adaptive approach fosters effective exploration without sacrificing precision.
zh

[AI-189] A Whole New World: Creating a Parallel-Poisoned Web Only AI-Agents Can See

【速读】:该论文旨在解决自主网络浏览代理(autonomous web-browsing agents)因可被识别的数字指纹(digital fingerprint)而面临新型隐蔽攻击的问题。当前基于大语言模型(Large Language Models, LLMs)的代理在执行任务时表现出一致且可区分的行为特征,包括浏览器属性、自动化框架标识符和网络行为模式,这使得它们容易被恶意网站通过指纹识别技术定位。解决方案的关键在于利用网站伪装(website cloaking)机制:当检测到请求来自AI代理时,恶意网站动态返回一个对人类用户无害但对代理隐藏恶意指令(如间接提示注入)的“伪装”版本页面,从而在不触发传统安全检测的情况下劫持代理行为,实现数据泄露、恶意代码执行或虚假信息传播等攻击目标。

链接: https://arxiv.org/abs/2509.00124
作者: Shaked Zychlinski
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:This paper introduces a novel attack vector that leverages website cloaking techniques to compromise autonomous web-browsing agents powered by Large Language Models (LLMs). As these agents become more prevalent, their unique and often homogenous digital fingerprints - comprising browser attributes, automation framework signatures, and network characteristics - create a new, distinguishable class of web traffic. The attack exploits this fingerprintability. A malicious website can identify an incoming request as originating from an AI agent and dynamically serve a different, “cloaked” version of its content. While human users see a benign webpage, the agent is presented with a visually identical page embedded with hidden, malicious instructions, such as indirect prompt injections. This mechanism allows adversaries to hijack agent behavior, leading to data exfiltration, malware execution, or misinformation propagation, all while remaining completely invisible to human users and conventional security crawlers. This work formalizes the threat model, details the mechanics of agent fingerprinting and cloaking, and discusses the profound security implications for the future of agentic AI, highlighting the urgent need for robust defenses against this stealthy and scalable attack.
zh

[AI-190] Embodied AI: Emerging Risks and Opportunities for Policy Action

【速读】:该论文旨在解决当前政策框架对具身人工智能(Embodied AI, EAI)独特风险严重忽视的问题,这些风险包括恶意使用导致的物理伤害、大规模监控以及经济与社会结构的颠覆性影响。现有法规如工业机器人国际标准或自动驾驶车辆法律,无法覆盖EAI带来的多维复杂风险。论文的关键解决方案在于:首先构建一个涵盖物理、信息、经济和社会维度的EAI风险基础分类体系;其次系统分析美、欧、英三国政策在应对这些风险时的覆盖程度与缺口;最终提出具体政策建议,包括对EAI系统实施强制测试与认证、明确责任归属机制,并制定前瞻性的经济与社会影响管理策略,以应对即将到来的EAI技术浪潮。

链接: https://arxiv.org/abs/2509.00117
作者: Jared Perlo,Alexander Robey,Fazl Barez,Luciano Floridi,Jakob Mökander
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The field of embodied AI (EAI) is rapidly advancing. Unlike virtual AI, EAI can exist in, learn from, reason about, and act in the physical world. Given recent innovations in large language and multimodal models, along with increasingly advanced and responsive hardware, EAI systems are rapidly growing in capabilities and operational domains. These advances present significant risks, including physical harm from malicious use, mass surveillance, and economic and societal disruption. However, these risks have been severely overlooked by policymakers. Existing policies, such as international standards for industrial robots or statutes governing autonomous vehicles, are insufficient to address the full range of concerns. While lawmakers are increasingly focused on AI, there is now an urgent need to extend and adapt existing frameworks to account for the unique risks of EAI. To help bridge this gap, this paper makes three contributions: first, we provide a foundational taxonomy of key physical, informational, economic, and social EAI risks. Secondly, we analyze policies in the US, EU, and UK to identify how existing frameworks address these risks and where these policies leave critical gaps. We conclude by offering concrete policy recommendations to address the coming wave of EAI innovation, including mandatory testing and certification for EAI systems, clarified liability frameworks, and forward-looking strategies to manage and prepare for transformative economic and societal impacts.
zh

[AI-191] he Application of Virtual Environments and Artificial Intelligence in Higher Education: Experimental Findings in Philosophy Teaching

【速读】:该论文旨在解决传统高等教育中抽象概念教学效果有限的问题,特别是在哲学等依赖抽象推理与概念理解的学科中,如何提升学生的学习动机与参与度。其解决方案的关键在于将虚拟环境(如Walter’s Cube技术)与人工智能中介(AI mediator)相结合,构建沉浸式、交互式的数字化教学场景,通过3D可视化呈现和AI支持增强学习体验。实证结果显示,80%的学生在期末考试中取得优良成绩,且多数学生认为虚拟材料显著提升了学习效果,表明该整合方案在提高学生动机与深度参与方面具有显著优势。

链接: https://arxiv.org/abs/2509.00110
作者: Adel Vehrer,Zsolt Palfalusi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores how virtual environments and artificial intelligence can enhance university students’ learning experiences, with particular attention to the digital preferences of Generation Z. An experiment was conducted at the Faculty of Pedagogy, Humanities, and Social Sciences at University of Gyor, where Walter’s Cube technology and a trained AI mediator were integrated into the instruction of ten philosophical topics. The curriculum was aligned with the official syllabus and enriched with visual content, quotations, and explanatory texts related to iconic figures in philosophy. A total of 77 first-year undergraduate students from full-time humanities and social sciences programs participated in the study. Following their end-of-semester offline written examination, students voluntarily completed a paper-based, anonymous ten-question test and provided feedback on the method’s effectiveness. No sensitive personal data were collected, and the research was conducted with formal approval from the Faculty Dean. Descriptive statistics and inferential tests were applied to evaluate the impact of the virtual environment and AI mediation on learning outcomes. Results indicate that 80 percent of participants achieved good or excellent final exam grades, and the majority rated the virtual material as highly effective. Qualitative feedback emphasized increased motivation and deeper engagement, attributed to the immersive 3D presentation and interactive AI support. This research contributes to the advancement of digital pedagogy and suggests new directions for applying virtual and AI-based methods in higher education, particularly in disciplines where abstract reasoning and conceptual understanding are central.
zh

[AI-192] Pre-trained knowledge elevates large language models beyond traditional chemical reaction optimizers

【速读】:该论文旨在解决实验化学中优化过程依赖传统算法(如贝叶斯优化,Bayesian Optimization, BO)在复杂、高维、离散参数空间中效率低下的问题。其核心挑战在于:当反应条件空间具有高度复杂性和稀疏性时(例如仅5%的参数组合能产生高性能结果),传统方法难以有效探索并收敛到最优解。解决方案的关键在于引入预训练的大语言模型(Large Language Models, LLMs)作为先验知识驱动的优化器,即LLM-guided optimization (LLM-GO)。研究表明,LLM-GO通过利用其在化学领域内预训练获得的知识,能够在不依赖显式数学建模的情况下实现更高效的探索——尤其在高熵、稀疏区域表现突出,从而超越BO在单目标优化任务中的性能,且优势随参数复杂度增加而增强。这一发现表明,LLM-GO并非取代结构化探索策略,而是借助领域知识提升探索质量,适用于需要深层理解而非纯数学优化的场景。

链接: https://arxiv.org/abs/2509.00103
作者: Robert MacKnight,Jose Emilio Regio,Jeffrey G. Ethier,Luke A. Baldwin,Gabe Gomes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:Modern optimization in experimental chemistry employs algorithmic search through black-box parameter spaces. Here we demonstrate that pre-trained knowledge in large language models (LLMs) fundamentally changes this paradigm. Using six fully enumerated categorical reaction datasets (768 - 5,684 experiments), we benchmark LLM-guided optimization (LLM-GO) against Bayesian optimization (BO) and random sampling. Frontier LLMs consistently match or exceed BO performance across five single-objective datasets, with advantages growing as parameter complexity increases and high-performing conditions become scarce (5% of space). BO retains superiority only for explicit multi-objective trade-offs. To understand these contrasting behaviors, we introduce a topology-agnostic information theory framework quantifying sampling diversity throughout optimization campaigns. This analysis reveals that LLMs maintain systematically higher exploration entropy than BO across all datasets while achieving superior performance, with advantages most pronounced in solution-scarce parameter spaces where high-entropy exploration typically fails - suggesting that pre-trained domain knowledge enables more effective navigation of chemical parameter space rather than replacing structured exploration strategies. To enable transparent benchmarking and community validation, we release Iron Mind (this https URL), a no-code platform for side-by-side evaluation of human, algorithmic, and LLM optimization campaigns with public leaderboards and complete trajectories. Our findings establish that LLM-GO excels precisely where traditional methods struggle: complex categorical spaces requiring domain understanding rather than mathematical optimization.
zh

[AI-193] Exploiting a Mixture-of-Layers in an Electrocardiography Foundation Model

【速读】:该论文旨在解决基于Transformer的ECG基础模型在下游任务中,其各层内部表示未被充分理解与利用的问题,特别是质疑最终层作为默认表征层是否能提供最优性能。通过实证和理论分析,作者发现答案是否定的,即仅依赖最后一层表示并不能最大化模型性能。解决方案的关键在于提出一种新颖的“预训练后层混合聚合”(Post-pretraining Mixture-of-layers Aggregation, PMA)架构,该架构通过引入门控网络对预训练模型各层的表征进行选择性融合,从而有效利用层间表示多样性,提升下游任务的表现力与性能;此外,还扩展该方法至预训练阶段,采用分组平均的方式聚合所有层表示后再输入解码器结构的Transformer,进一步增强模型的泛化能力。

链接: https://arxiv.org/abs/2509.00102
作者: Phu X. Nguyen,Huy Phan,Hieu Pham,Christos Chatzichristos,Bert Vandenberk,Maarten De Vos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based foundation models for Electrocardiograms (ECGs) have recently achieved impressive performance in many downstream applications. However, the internal representations of such models across layers have not been fully understood and exploited. An important question arises: Does the final layer of the pre-trained Transformer model, the \emphde facto representational layer, provide optimal performance for downstream tasks? Although our answer based on empirical and theoretical analyses for this question is negative, we propose a novel approach to leverage the representation diversity of the model’s layers effectively. Specifically, we introduce a novel architecture called Post-pretraining Mixture-of-layers Aggregation (PMA), which enables a flexible combination of the layer-wise representations from the layer stack of a Transformer-based foundation model. We first pre-train the model from ECG signals using the 1-dimensional Vision Transformer (ViT) via masked modeling. In downstream applications, instead of relying solely on the last layer of the model, we employ a gating network to selectively fuse the representations from the pretrained model’s layers, thereby enhancing representation power and improving performance of the downstream applications. In addition, we extend the proposed method to the pretraining stage by aggregating all representations through group-wise averaging before feeding them into the decoder-based Transformer.
zh

[AI-194] MODE: Mixture of Document Experts for RAG

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在小规模、领域特定语料库中因依赖大规模向量数据库和交叉编码器(cross-encoder)而导致的资源开销过高与效率低下问题。其解决方案的关键在于提出一种轻量级方法 MODE(Mixture of Document Experts),通过将文档嵌入后聚类为语义一致的簇,并用缓存的中心点(centroid)代表每个簇,在查询时仅路由至Top-K中心点并限定上下文检索范围,从而摒弃外部向量数据库和重排序模块,同时保持低延迟和高答案质量。实验表明,MODE在HotpotQA和SQuAD等数据集上以100–500个文本块为规模时,可达到或优于密集检索基线的准确性,且簇粒度与多簇路由策略能有效调控召回率与精度权衡。

链接: https://arxiv.org/abs/2509.00100
作者: Rahul Anand
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) often relies on large vector databases and cross-encoders tuned for large-scale corpora, which can be excessive for small, domain-specific collections. We present MODE (Mixture of Document Experts), a lightweight alternative that replaces fine-grained nearest-neighbor search with cluster-and-route retrieval. Documents are embedded, grouped into semantically coherent clusters, and represented by cached centroids. At query time, we route to the top centroid(s) and retrieve context only within those clusters, eliminating external vector-database infrastructure and reranking while keeping latency low. On HotpotQA and SQuAD corpora with 100-500 chunks, MODE matches or exceeds a dense-retrieval baseline in answer quality while reducing end-to-end retrieval time. Ablations show that cluster granularity and multi-cluster routing control the recall/precision trade-off, and that tighter clusters improve downstream accuracy. MODE offers a practical recipe for small and medium corpora where simplicity, speed, and topical focus matter.
zh

[AI-195] AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中面临的提示注入攻击(prompt injection attacks)问题,此类攻击可能危及模型的安全性和可靠性。现有基于提示的检测方法虽具备轻量且可解释的优势,但受限于人工设计提示的低效性与局限性。为此,作者提出AEGIS框架——一种自动化的协同进化防御机制,其核心创新在于通过文本梯度优化(Textual Gradient Optimization, TGO)模块,使攻击和防御提示在LLM引导的评估循环中相互迭代优化,实现对抗性训练的自动化。该方案的关键在于利用梯度类自然语言优化技术驱动攻防双方的自主演化,从而显著提升攻击成功率(ASR)和检测性能(TPR),并在多种LLM上验证了有效性。

链接: https://arxiv.org/abs/2509.00088
作者: Ting-Chun Liu,Ching-Yu Hsu,Kuan-Yi Lee,Chi-An Fu,Hung-yi Lee
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prompt injection attacks pose a significant challenge to the safe deployment of Large Language Models (LLMs) in real-world applications. While prompt-based detection offers a lightweight and interpretable defense strategy, its effectiveness has been hindered by the need for manual prompt engineering. To address this issue, we propose AEGIS , an Automated co-Evolutionary framework for Guarding prompt Injections Schema. Both attack and defense prompts are iteratively optimized against each other using a gradient-like natural language prompt optimization technique. This framework enables both attackers and defenders to autonomously evolve via a Textual Gradient Optimization (TGO) module, leveraging feedback from an LLM-guided evaluation loop. We evaluate our system on a real-world assignment grading dataset of prompt injection attacks and demonstrate that our method consistently outperforms existing baselines, achieving superior robustness in both attack success and detection. Specifically, the attack success rate (ASR) reaches 1.0, representing an improvement of 0.26 over the baseline. For detection, the true positive rate (TPR) improves by 0.23 compared to the previous best work, reaching 0.84, and the true negative rate (TNR) remains comparable at 0.89. Ablation studies confirm the importance of co-evolution, gradient buffering, and multi-objective optimization. We also confirm that this framework is effective in different LLMs. Our results highlight the promise of adversarial training as a scalable and effective approach for guarding prompt injections.
zh

[AI-196] Yet Unnoticed in LSTM: Binary Tree Based Input Reordering Weight Regularization and Gate Nonlinearization

【速读】:该论文旨在解决长短期记忆网络(LSTM)在处理长期依赖信息时存在的两个核心问题:一是难以有效聚焦于特定的历史输入索引或长期信息;二是门控机制(gates)的非线性表达能力不足,导致其对历史输入的控制不够灵活。解决方案的关键在于三个方面:首先,通过输入重排序(input reordering)策略优先化特定输入索引以增强对关键历史信息的关注;其次,引入基于Lp范数的权重归一化方法,通过监督损失函数优化权重的平滑性或稀疏性,从而识别最优范数类型;最后,提出对门控机制进行更充分的非线性化处理(即使用小型前馈神经网络),使其能够更好地捕捉历史输入中特有的非线性模式,类似注意力机制的作用,从而提升模型对过去信息的选择性强调能力。实验表明,这些改进显著提升了LSTM在文本分类任务中的准确率。

链接: https://arxiv.org/abs/2509.00087
作者: Mojtaba Moattari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LSTM models used in current Machine Learning literature and applications, has a promising solution for permitting long term information using gating mechanisms that forget and reduce effect of current input information. However, even with this pipeline, they do not optimally focus on specific old index or long-term information. This paper elaborates upon input reordering approaches to prioritize certain input indices. Moreover, no LSTM based approach is found in the literature that examines weight normalization while choosing the right weight and exponent of Lp norms through main supervised loss function. In this paper, we find out which norm best finds relationship between weights to either smooth or sparsify them. Lastly, gates, as weighted representations of inputs and states, which control reduction-extent of current input versus previous inputs (~ state), are not nonlinearized enough (through a small FFNN). As analogous to attention mechanisms, gates easily filter current information to bold (emphasize on) past inputs. Nonlinearized gates can more easily tune up to peculiar nonlinearities of specific input in the past. This type of nonlinearization is not proposed in the literature, to the best of author’s knowledge. The proposed approaches are implemented and compared with a simple LSTM to understand their performance in text classification tasks. The results show they improve accuracy of LSTM.
zh

[AI-197] Private Verifiable and Auditable AI Systems

【速读】:该论文旨在解决基础模型(foundation models)在隐私保护、可验证性(verifiability)与可审计性(auditability)之间难以平衡的问题,这是当前人工智能系统安全性和可信度的关键挑战。其解决方案的核心在于引入一系列技术创新:首先,利用零知识密码学实现对AI系统行为的可验证且不可泄露的审计;其次,通过安全多方计算(Secure Multi-Party Computation, SMPC)和可信执行环境(Trusted Execution Environment, TEE)保障大语言模型及信息检索系统的机密部署与可审计性;最后,构建增强的委托机制、凭证体系与访问控制策略,以确保自主多智能体系统交互的安全性。整体上,论文提出了一种融合隐私保护、可验证性与可审计性的统一框架,为负责任的人工智能发展提供了技术路径与政策参考。

链接: https://arxiv.org/abs/2509.00085
作者: Tobin South
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: PhD thesis

点击查看摘要

Abstract:The growing societal reliance on artificial intelligence necessitates robust frameworks for ensuring its security, accountability, and trustworthiness. This thesis addresses the complex interplay between privacy, verifiability, and auditability in modern AI, particularly in foundation models. It argues that technical solutions that integrate these elements are critical for responsible AI innovation. Drawing from international policy contributions and technical research to identify key risks in the AI pipeline, this work introduces novel technical solutions for critical privacy and verifiability challenges. Specifically, the research introduces techniques for enabling verifiable and auditable claims about AI systems using zero-knowledge cryptography; utilizing secure multi-party computation and trusted execution environments for auditable, confidential deployment of large language models and information retrieval; and implementing enhanced delegation mechanisms, credentialing systems, and access controls to secure interactions with autonomous and multi-agent AI systems. Synthesizing these technical advancements, this dissertation presents a cohesive perspective on balancing privacy, verifiability, and auditability in foundation model-based AI systems, offering practical blueprints for system designers and informing policy discussions on AI safety and governance.
zh

[AI-198] Enabling Transparent Cyber Threat Intelligence Combining Large Language Models and Domain Ontologies ECAI25 KR ECAI

【速读】:该论文旨在解决当前网络安全日志中恶意事件识别与解释可靠性不足的问题,尤其针对未结构化或语义模糊的日志条目难以准确提取和理解的挑战。解决方案的关键在于将领域本体(ontology)与SHACL约束相结合,引导大型语言模型(Large Language Models, LLMs)生成结构化且语义有效的输出,并将其组织成本体增强的图数据库,从而提升信息抽取的准确性与可解释性。

链接: https://arxiv.org/abs/2509.00081
作者: Luca Cotti,Anisa Rula,Devis Bianchini,Federico Cerutti
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures, 6 tables, accepted at XAI-KRKG@ECAI25: First International ECAI Workshop on eXplainable AI, Knowledge Representation and Knowledge Graphs, October 25-30, 2025, Bologna, Italy

点击查看摘要

Abstract:Effective Cyber Threat Intelligence (CTI) relies upon accurately structured and semantically enriched information extracted from cybersecurity system logs. However, current methodologies often struggle to identify and interpret malicious events reliably and transparently, particularly in cases involving unstructured or ambiguous log entries. In this work, we propose a novel methodology that combines ontology-driven structured outputs with Large Language Models (LLMs), to build an Artificial Intelligence (AI) agent that improves the accuracy and explainability of information extraction from cybersecurity logs. Central to our approach is the integration of domain ontologies and SHACL-based constraints to guide the language model’s output structure and enforce semantic validity over the resulting graph. Extracted information is organized into an ontology-enriched graph database, enabling future semantic analysis and querying. The design of our methodology is motivated by the analytical requirements associated with honeypot log data, which typically comprises predominantly malicious activity. While our case study illustrates the relevance of this scenario, the experimental evaluation is conducted using publicly available datasets. Results demonstrate that our method achieves higher accuracy in information extraction compared to traditional prompt-only approaches, with a deliberate focus on extraction quality rather than processing speed.
zh

[AI-199] Wrong Face Wrong Move: The Social Dynamics of Emotion Misperception in Agent -Based Models

【速读】:该论文旨在解决情绪感知准确性如何影响群体中情感行为与空间组织的 emergent(涌现)社会动态问题。其核心问题是:当个体在社交互动中基于不准确的情绪分类器进行感知和响应时,这种系统性误判如何改变群体的情感稳定性与社会结构。解决方案的关键在于构建一个基于2D环面网格(toroidal lattice)的多智能体仿真系统,其中每个代理(agent)被赋予不同精度的情绪分类器(分别来自JAFFE、CK+和KDEF数据集),并根据对邻近代理情绪状态的感知而非真实情绪状态作出移动响应(趋近积极情绪、远离消极情绪)。通过对比同质与异质群体及重复情绪冲击场景下的实验结果发现,低精度分类器导致信任下降、情绪瓦解为悲伤以及社会秩序紊乱;而高精度分类器则促进坚韧的情感簇形成并增强对情绪扰动的鲁棒性。这揭示了情绪识别中的偏差或不精确性足以引发社会分离与凝聚力丧失,从而强调了情感感知质量在维持社会整合中的关键作用。

链接: https://arxiv.org/abs/2509.00080
作者: David Freire-Obregón
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for presentation at the International Workshop on Agent-Based Modelling of Human Behaviour (ABMHuB 2025)

点击查看摘要

Abstract:The ability of humans to detect and respond to others’ emotions is fundamental to understanding social behavior. Here, agents are instantiated with emotion classifiers of varying accuracy to study the impact of perceptual accuracy on emergent emotional and spatial behavior. Agents are visually represented with face photos from the KDEF database and endowed with one of three classifiers trained on the JAFFE (poor), CK+ (medium), or KDEF (high) datasets. Agents communicate locally on a 2D toroidal lattice, perceiving neighbors’ emotional state based on their classifier and responding with movement toward perceived positive emotions and away from perceived negative emotions. Note that the agents respond to perceived, instead of ground-truth, emotions, introducing systematic misperception and frustration. A battery of experiments is carried out on homogeneous and heterogeneous populations and scenarios with repeated emotional shocks. Results show that low-accuracy classifiers on the part of the agent reliably result in diminished trust, emotional disintegration into sadness, and disordered social organization. By contrast, the agent that develops high accuracy develops hardy emotional clusters and resilience to emotional disruptions. Even in emotionally neutral scenarios, misperception is enough to generate segregation and disintegration of cohesion. These findings underscore the fact that biases or imprecision in emotion recognition may significantly warp social processes and disrupt emotional integration.
zh

[AI-200] Entropy-Guided Loop: Achieving Reasoning through Uncertainty-Aware Generation

【速读】:该论文旨在解决生成式 AI (Generative AI) 中推理模型在性能提升的同时带来显著计算成本增加和延迟上升的问题。其核心挑战在于如何在不牺牲质量的前提下实现高效推理。解决方案的关键在于提出一种基于熵引导的精炼机制(entropy-guided refinement),该机制通过在测试时引入轻量级循环,利用token级别的不确定性指标(如Shannon熵、困惑度、最大token熵及低置信度token数量)构建OR逻辑触发条件,仅对存在不确定性的部分进行一次针对性的修正操作,并将包含tokens、置信度、候选词及上下文的紧凑不确定性报告反馈给模型以指导纠错。此方法使小型模型在多项任务中达到参考推理模型约95%的质量,同时仅需约三分之一的成本,且在约31%的响应中实现准确率提升16个百分点,为生产环境中兼顾质量与成本提供了有效折衷方案。

链接: https://arxiv.org/abs/2509.00079
作者: Andrew G. A. Correa,Ana C. H de Matos
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Reasoning models often outperform smaller models but at 3–5 \times higher cost and added latency. We present entropy-guided refinement: a lightweight, test-time loop that uses token-level uncertainty to trigger a single, targeted refinement pass. We extract logprobs, compute Shannon entropy on top- k alternatives, and apply a simple OR-logic trigger over perplexity, maximum token entropy, and low-confidence-token count. Unlike approaches that use entropy only for measurement or decoding, we pass a compact uncertainty report (tokens, confidences, alternatives, context) back to the model to guide corrective edits. On representative technical queries across reasoning, mathematics, and code generation tasks, a small model with our loop approaches 95% of a reference reasoning model’s quality at approximately one-third of the cost. The method achieves selective refinement on ~31% of responses while improving accuracy by 16 percentage points over single-pass inference. We demonstrate that this uncertainty-aware loop provides an effective middle ground between single-pass inference and expensive reasoning chains, making it practical for production deployments where both quality and cost matter.
zh

[AI-201] Beyond Memorization: Reasoning -Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)能力评估中因数据污染(data contamination)引发的可信度问题,即现有静态基准测试可能衡量的是模型对训练数据的简单记忆而非真正的推理能力。其解决方案的关键在于提出一种无限可扩展的问答(QA)合成框架,直接从arXiv论文中生成研究级别的多步推理问题,并利用科研文献天然的时间结构来检测知识截止日期(knowledge cutoff)后的性能衰减。实验结果表明,无论模型规模、开发者或发布日期如何,模型在知识截止点附近均未出现显著性能下降,作者认为这种多步推理任务的设计增加了复杂性,有效规避了浅层记忆带来的污染影响,从而推动评测范式向以推理驱动的数据合成方向转变。

链接: https://arxiv.org/abs/2509.00072
作者: Terry Jingchen Zhang,Gopal Dev,Ning Wang,Nicole Ni,Wenyuan Jiang,Yinya Huang,Bernhard Schölkopf,Mrinmaya Sachan,Zhijing Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code and Dataset: this https URL

点击查看摘要

Abstract:Capability evaluation of large language models (LLMs) is increasingly shadowed by rising concerns of data contamination that cast doubts on whether static benchmarks measure genuine reasoning or mere memorization. We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers, harnessing the natural temporal structure of research publications where performance decay after knowledge cutoffs may indicate potential contamination. We evaluated 4 frontier model represented by 2 models of different knowledge cutoff dates per family on 1,643 multi-step reasoning questions synthesized from 20,277 arXiv papers stratified over 26 months, covering at least 6 months before and after all cutoff dates. Our results consistently showed a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates. We further performed a comparative analysis with previous longitudinal studies that reported significant post-cutoff performance decay using directly retrieved questions based on public data. we hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization, which effectively serves a mitigation strategy against benchmark contamination. We fully open source our code and dataset to aid reproducibility and advocate for a paradigm shift that prioritize reasoning-driven synthesis to construct benchmarks over simply collecting newly released questions periodically.
zh

[AI-202] SynCircuit: Automated Generation of New Synthetic RTL Circuits Can Enable Big Data in Circuits

【速读】:该论文旨在解决生成式 AI (Generative AI) 在集成电路(IC)设计领域中因缺乏公开可用的电路设计数据而面临的瓶颈问题。解决方案的关键在于提出 SynCircuit 框架,其核心创新包括:1)设计了一种定制化的基于扩散模型的生成方法,用于处理有向环图(Directed Cyclic Graph, DCG)结构的电路生成任务;2)通过约束优化对初始生成结果进行修正,确保电路功能有效性;3)引入蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)进一步减少逻辑冗余,提升电路质量。实验表明,该方法可生成更逼真的合成电路,并显著提升下游机器学习模型在电路设计任务中的性能。

链接: https://arxiv.org/abs/2509.00071
作者: Shang Liu,Jing Wang,Wenji Fang,Zhiyao Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by DAC’25

点击查看摘要

Abstract:In recent years, AI-assisted IC design methods have demonstrated great potential, but the availability of circuit design data is extremely limited, especially in the public domain. The lack of circuit data has become the primary bottleneck in developing AI-assisted IC design methods. In this work, we make the first attempt, SynCircuit, to generate new synthetic circuits with valid functionalities in the HDL format. SynCircuit automatically generates synthetic data using a framework with three innovative steps: 1) We propose a customized diffusion-based generative model to resolve the Directed Cyclic Graph (DCG) generation task, which has not been well explored in the AI community. 2) To ensure our circuit is valid, we enforce the circuit constraints by refining the initial graph generation outputs. 3) The Monte Carlo tree search (MCTS) method further optimizes the logic redundancy in the generated graph. Experimental results demonstrate that our proposed SynCircuit can generate more realistic synthetic circuits and enhance ML model performance in downstream circuit design tasks.
zh

[AI-203] he Collaborations among Healthcare Systems Research Institutions and Industry on Artificial Intelligence Research and Development

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)在医疗健康领域应用中协作网络不完善、利益相关方参与度不足以及数据共享与标准缺失等问题,从而推动AI技术在临床实践中的有效落地。其解决方案的关键在于构建多主体协同机制,包括加强AI专项教育与培训以提升专业能力、建立安全可控的数据共享框架以应对隐私和安全顾虑、制定明确的行业标准和法律规范以促进规范化发展,并设立专门的AI研究部门以聚焦前沿研发与成果转化。

链接: https://arxiv.org/abs/2509.00068
作者: Jiancheng Ye,Michelle Ma,Malak Abuhashish
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objectives: The integration of Artificial Intelligence (AI) in healthcare promises to revolutionize patient care, diagnostics, and treatment protocols. Collaborative efforts among healthcare systems, research institutions, and industry are pivotal to leveraging AI’s full potential. This study aims to characterize collaborative networks and stakeholders in AI healthcare initiatives, identify challenges and opportunities within these collaborations, and elucidate priorities for future AI research and development. Methods: This study utilized data from the Chinese Society of Radiology and the Chinese Medical Imaging AI Innovation Alliance. A national cross-sectional survey was conducted in China (N = 5,142) across 31 provincial administrative regions, involving participants from three key groups: clinicians, institution professionals, and industry representatives. The survey explored diverse aspects including current AI usage in healthcare, collaboration dynamics, challenges encountered, and research and development priorities. Results: Findings reveal high interest in AI among clinicians, with a significant gap between interest and actual engagement in development activities. Despite the willingness to share data, progress is hindered by concerns about data privacy and security, and lack of clear industry standards and legal guidelines. Future development interests focus on lesion screening, disease diagnosis, and enhancing clinical workflows. Conclusion: This study highlights an enthusiastic yet cautious approach toward AI in healthcare, characterized by significant barriers that impede effective collaboration and implementation. Recommendations emphasize the need for AI-specific education and training, secure data-sharing frameworks, establishment of clear industry standards, and formation of dedicated AI research departments.
zh

[AI-204] A Comparative Study of Controllability Explainability and Performance in Dysfluency Detection Models

【速读】:该论文旨在解决当前口吃检测(dysfluency detection)模型在临床应用中的可控制性(controllability)与可解释性(explainability)不足的问题,而不仅仅是追求基准数据集上的性能提升。其解决方案的关键在于通过系统性比较四种代表性方法——YOLO-Stutter、FluentNet、UDM 和 SSDM——从性能、可控性和可解释性三个维度进行综合评估,发现 UDM 在准确率与临床可解释性之间取得了最佳平衡,从而为未来开发适用于临床场景的口吃检测模型提供了明确的方向和实践依据。

链接: https://arxiv.org/abs/2509.00058
作者: Eric Zhang,Li Wei,Sarah Chen,Michael Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in dysfluency detection have introduced a variety of modeling paradigms, ranging from lightweight object-detection inspired networks (YOLOStutter) to modular interpretable frameworks (UDM). While performance on benchmark datasets continues to improve, clinical adoption requires more than accuracy: models must be controllable and explainable. In this paper, we present a systematic comparative analysis of four representative approaches–YOLO-Stutter, FluentNet, UDM, and SSDM–along three dimensions: performance, controllability, and explainability. Through comprehensive evaluation on multiple datasets and expert clinician assessment, we find that YOLO-Stutter and FluentNet provide efficiency and simplicity, but with limited transparency; UDM achieves the best balance of accuracy and clinical interpretability; and SSDM, while promising, could not be fully reproduced in our experiments. Our analysis highlights the trade-offs among competing approaches and identifies future directions for clinically viable dysfluency modeling. We also provide detailed implementation insights and practical deployment considerations for each approach.
zh

[AI-205] U2UData-2: A Scalable Swarm UAVs Autonomous Flight Dataset for Long-horizon Tasks

【速读】:该论文旨在解决当前群体无人机(Swarm UAV)在长时程(Long-Horizon, LH)任务中自主飞行能力不足的问题,现有方法受限于数据集的局限性,仅能处理特定基础任务,无法应对真实场景中长期依赖、状态维持和动态目标变化等挑战。解决方案的关键在于提出U2UData-2——首个面向LH任务的大规模群体无人机自主飞行数据集及可扩展的在线数据采集与算法闭环验证平台。该平台支持多模态传感器数据(LiDAR、RGB、环境参数)采集、定制化仿真器与飞行算法部署,并提供涵盖野生动物保护等实际应用的LH任务基准,从而推动群体无人机在复杂现实场景中的落地应用。

链接: https://arxiv.org/abs/2509.00055
作者: Tongtong Feng,Xin Wang,Feilin Han,Leping Zhang,Wenwu Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Swarm UAV autonomous flight for Long-Horizon (LH) tasks is crucial for advancing the low-altitude economy. However, existing methods focus only on specific basic tasks due to dataset limitations, failing in real-world deployment for LH tasks. LH tasks are not mere concatenations of basic tasks, requiring handling long-term dependencies, maintaining persistent states, and adapting to dynamic goal shifts. This paper presents U2UData-2, the first large-scale swarm UAV autonomous flight dataset for LH tasks and the first scalable swarm UAV data online collection and algorithm closed-loop verification platform. The dataset is captured by 15 UAVs in autonomous collaborative flights for LH tasks, comprising 12 scenes, 720 traces, 120 hours, 600 seconds per trajectory, 4.32M LiDAR frames, and 12.96M RGB frames. This dataset also includes brightness, temperature, humidity, smoke, and airflow values covering all flight routes. The platform supports the customization of simulators, UAVs, sensors, flight algorithms, formation modes, and LH tasks. Through a visual control window, this platform allows users to collect customized datasets through one-click deployment online and to verify algorithms by closed-loop simulation. U2UData-2 also introduces an LH task for wildlife conservation and provides comprehensive benchmarks with 9 SOTA models. U2UData-2 can be found at this https URL.
zh

[AI-206] Robotic Fire Risk Detection based on Dynamic Knowledge Graph Reasoning : An LLM -Driven Approach with Graph Chain-of-Thought

【速读】:该论文旨在解决火灾场景中应急机器人在灾前预警与灾时救援中存在的感知不完整、火情态势认知不足及响应延迟等问题。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的火灾知识图谱(Knowledge Graph, KG),并提出一种名为Insights-on-Graph(IOG)的新框架,该框架融合结构化火灾知识与大型多模态模型(Large Multimodal Models, LMMs),通过实时场景图像生成感知驱动的风险图(perception-driven risk graphs),实现早期火灾风险检测,并依据动态风险态势提供可解释的应急响应策略,包括任务模块和机器人组件的配置决策。

链接: https://arxiv.org/abs/2509.00054
作者: Haimei Pan,Jiyun Zhang,Qinxi Wei,Xiongnan Jin,Chen Xinkai,Jie Cheng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fire is a highly destructive disaster, but effective prevention can significantly reduce its likelihood of occurrence. When it happens, deploying emergency robots in fire-risk scenarios can help minimize the danger to human responders. However, current research on pre-disaster warnings and disaster-time rescue still faces significant challenges due to incomplete perception, inadequate fire situational awareness, and delayed response. To enhance intelligent perception and response planning for robots in fire scenarios, we first construct a knowledge graph (KG) by leveraging large language models (LLMs) to integrate fire domain knowledge derived from fire prevention guidelines and fire rescue task information from robotic emergency response documents. We then propose a new framework called Insights-on-Graph (IOG), which integrates the structured fire information of KG and Large Multimodal Models (LMMs). The framework generates perception-driven risk graphs from real-time scene imagery to enable early fire risk detection and provide interpretable emergency responses for task module and robot component configuration based on the evolving risk situation. Extensive simulations and real-world experiments show that IOG has good applicability and practical application value in fire risk detection and rescue decision-making.
zh

[AI-207] Applying Deep Learning to Anomaly Detection of Russian Satellite Activity for Indications Prior to Military Activity

【速读】:该论文旨在解决如何利用深度学习技术从公开的两行元素(Two-Line Element, TLE)数据中识别俄罗斯在轨空间物体(Resident Space Object, RSO)的异常活动,从而为未来冲突中的军事侵略行为提供指示与预警(Indications and Warnings, IW)。其核心问题是通过分析RSO的行为模式(Pattern of Life/Behavior, PoL/PoB)变化,发现具有统计显著性的异常行为特征。解决方案的关键在于采用多种深度学习模型(包括隔离森林、传统自编码器、变分自编码器、Kolmogorov-Arnold网络及新型基于锚损失的自编码器)分别对每个RSO独立建模,并基于重构误差阈值识别异常;同时强调可解释性,将每个轨道要素(orbital element)单独评估而非整体输入分析,从而揭示具体异常细节,实现对俄方太空活动潜在战术意图的精准识别与量化分析。

链接: https://arxiv.org/abs/2509.00050
作者: David Kurtenbach,Megan Manly,Zach Metzinger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We apply deep learning techniques for anomaly detection to analyze activity of Russian-owned resident space objects (RSO) prior to the Ukraine invasion and assess the results for any findings that can be used as indications and warnings (IW) of aggressive military behavior for future conflicts. Through analysis of anomalous activity, an understanding of possible tactics and procedures can be established to assess the existence of statistically significant changes in Russian RSO pattern of life/pattern of behavior (PoL/PoB) using publicly available two-line element (TLE) data. This research looks at statistical and deep learning approaches to assess anomalous activity. The deep learning methods assessed are isolation forest (IF), traditional autoencoder (AE), variational autoencoder (VAE), Kolmogorov Arnold Network (KAN), and a novel anchor-loss based autoencoder (Anchor AE). Each model is used to establish a baseline of on-orbit activity based on a five-year data sample. The primary investigation period focuses on the six months leading up to the invasion date of February 24, 2022. Additional analysis looks at RSO activity during an active combat period by sampling TLE data after the invasion date. The deep learning autoencoder models identify anomalies based on reconstruction errors that surpass a threshold sigma. To capture the nuance and unique characteristics of each RSO an individual model was trained for each observed space object. The research made an effort to prioritize explainability and interpretability of the model results thus each observation was assessed for anomalous behavior of the individual six orbital elements versus analyzing the input data as a single monolithic observation. The results demonstrate not only statistically significant anomalies of Russian RSO activity but also details anomalous findings to the individual orbital element.
zh

[AI-208] aching AI to Remember: Insights from Brain-Inspired Replay in Continual Learning

【速读】:该论文旨在解决人工神经网络(Artificial Neural Networks, ANNs)在持续学习(continual learning)中面临的灾难性遗忘(catastrophic forgetting)问题,即模型在学习新任务时会丢失先前任务的知识。其解决方案的关键在于借鉴人类大脑的记忆巩固机制,采用内部回放(internal replay)策略——在学习过程中重新激活先前经验的潜在表示(latent representations),以增强对旧知识的保留。实验表明,该机制能显著缓解遗忘,尤其与合成智能(Synaptic Intelligence, SI)结合时效果更优,但同时会导致初始任务准确率下降,揭示了记忆稳定性与学习可塑性之间的权衡关系。

链接: https://arxiv.org/abs/2509.00047
作者: Jina Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial neural networks (ANNs) continue to face challenges in continual learning, particularly due to catastrophic forgetting, the loss of previously learned knowledge when acquiring new tasks. Inspired by memory consolidation in the human brain, we investigate the internal replay mechanism proposed by~\citepbrain_inspired_replay1, which reactivates latent representations of prior experiences during learning. As internal replay was identified as the most influential component among the brain-inspired mechanisms in their framework, it serves as the central focus of our in-depth investigation. Using the CIFAR-100 dataset in a class-incremental setting, we evaluate the effectiveness of internal replay, both in isolation and in combination with Synaptic Intelligence (SI). Our experiments show that internal replay significantly mitigates forgetting, especially when paired with SI, but at the cost of reduced initial task accuracy, highlighting a trade-off between memory stability and learning plasticity. Further analyses using log-likelihood distributions, reconstruction errors, silhouette scores, and UMAP projections reveal that internal replay increases representational overlap in latent space, potentially limiting task-specific differentiation. These results underscore the limitations of current brain-inspired methods and suggest future directions for balancing retention and adaptability in continual learning systems.
zh

[AI-209] Exploring and Reshaping the Weight Distribution in LLM

【速读】:该论文旨在解决LoRA(Low-Rank Adaptation)训练中权重初始化对模型性能影响的问题,特别是由于大型语言模型(Large Language Models, LLMs)不同层间权重分布不均所导致的训练效率与效果差异。其关键解决方案在于:首先通过分析自注意力层和前馈网络层中权重矩阵的余弦距离分布,发现其符合幂律分布(power-law distribution)特性;进而提出一种基于高斯过程与帕累托分布(Pareto distribution)组合的数据生成方法,用于模拟符合该分布特性的初始权重;最后将生成的权重用于LoRA的初始化,从而在不改变模型结构或训练流程的前提下,显著提升LoRA训练的有效性。

链接: https://arxiv.org/abs/2509.00046
作者: Chunming Ye,Songzhou Li,Xu Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages,16 figures

点击查看摘要

Abstract:The performance of Large Language Models is influenced by their characteristics such as architecture, model sizes, decoding methods and so on. Due to differences in structure or function, the weights in different layers of large models have varying distributions. This paper explores the correlations between different types of layers in terms of weights distribution and studies the potential impact of these correlations on LoRA training effectiveness. Firstly, the study reveals that in the model the cosine distances between weights of different layers manifest power-law distribution. We extract Query-projection, down-projection and other weight matrices from the self-attention layers and MLP layers, calculate the singular values of the matrices using singular value decomposition, and organize a certain number of singular values into matrices according to projection’s type. By analyzing the probability distribution of the cosine distances between these matrices, it is found that the cosine distances values between them have distinct power-law distribution characteristics. Secondly, based on the results of distance calculations and analysis across different layers of model, a qualitative method is proposed to describe the distribution characteristics of different models. Next, to construct weights that align with the distribution characteristics, a data generator is designed using a combination of Gaussian process and Pareto distribution functions. The generator is used to simulate the generation of data that aligns with specific distribution characteristics. Finally, based on the aforementioned distribution characteristics and data generation method, the weights in LoRA initialization are reshaped for training. Experimental results indicate that, without altering the model structure or training process, this method achieves a certain improvement in the performance of LoRA training.
zh

[AI-210] ransfer Learning for Minimum Operating Voltage Prediction in Advanced Technology Nodes: Leverag ing Legacy Data and Silicon Odometer Sensing

【速读】:该论文旨在解决先进工艺节点(如5nm)下最小工作电压(V_min)预测准确率低的问题,其核心挑战在于训练数据稀缺以及工艺波动与V_min之间复杂的非线性关系。解决方案的关键在于提出一种新颖的迁移学习框架,利用丰富的16nm节点遗留数据作为先验知识,并融合片上硅里程计传感器(silicon odometer sensor)提取的输入特征,以精细刻画局部工艺波动——这是5nm节点的关键影响因素,从而显著提升V_min预测精度。

链接: https://arxiv.org/abs/2509.00035
作者: Yuxuan Yin,Rebecca Chen,Boxun Xu,Chen He,Peng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of chip performance is critical for ensuring energy efficiency and reliability in semiconductor manufacturing. However, developing minimum operating voltage ( V_min ) prediction models at advanced technology nodes is challenging due to limited training data and the complex relationship between process variations and V_min . To address these issues, we propose a novel transfer learning framework that leverages abundant legacy data from the 16nm technology node to enable accurate V_min prediction at the advanced 5nm node. A key innovation of our approach is the integration of input features derived from on-chip silicon odometer sensor data, which provide fine-grained characterization of localized process variations – an essential factor at the 5nm node – resulting in significantly improved prediction accuracy.
zh

[AI-211] ZeroQAT: Your Quantization-aware Training but Efficient

【速读】:该论文旨在解决低比特后训练量化(Post-Training Quantization, PTQ)方法因逐层优化导致的累积误差传播和局部重建目标与下游任务性能不一致所引发的精度下降问题,同时克服量化感知训练(Quantization-Aware Training, QAT)因依赖反向传播带来的数据、时间和内存开销过高的局限性。其解决方案的关键在于提出ZeroQAT框架,该框架基于零阶优化(zeroth-order optimization)实现前向-only梯度估计,从而无需反向传播即可完成端到端优化,显著降低计算和内存开销;同时联合学习量化权重、权重裁剪阈值及等效变换,有效缓解量化误差并处理激活异常值,实现了PTQ的效率与QAT的精度优势的统一。

链接: https://arxiv.org/abs/2509.00031
作者: Qitao Tan,Xiaoying Song,Jin Lu,Guoming Li,Jun Liu,Lingzi Hong,Caiwen Ding,Jundong Li,Xiaoming Zhai,Shaoyi Huang,Wei Niu,Geng Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing low-bit PTQ methods suffer from accuracy degradation because their layer-wise optimization introduces cumulative error propagation and misalignment between local reconstruction objectives and downstream performance. While quantization-aware training (QAT) provides a principled solution, its reliance on backpropagation incurs prohibitive data, time, and memory costs, limiting its practicality. To address these challenges, we propose ZeroQAT, a zeroth-order optimization-based QAT framework. ZeroQAT leverages forward-only gradient estimation to eliminate the need for backpropagation, significantly reducing computational and memory overhead while retaining the benefits of end-to-end optimization. Moreover, ZeroQAT jointly learns quantized weights, weight clipping thresholds, and equivalent transformations to mitigate quantization error and handle activation outliers. Experiments demonstrate that ZeroQAT achieves the efficiency of PTQ while retaining the accuracy of QAT, offering a practical solution for high-quality low-bit quantization of LLMs.
zh

[AI-212] From Sound to Sight: Towards AI-authored Music Videos

【速读】:该论文旨在解决传统音乐可视化系统依赖人工设计的形状与色彩变换所带来的表达能力有限的问题。其解决方案的关键在于构建两条新颖的自动化流水线,利用现成的深度学习模型从任意用户指定的带 vocals 或纯乐器歌曲中自动生成音乐视频;具体而言,首先通过潜在特征(latent feature)技术分析音频以识别情绪线索和乐器模式,并借助语言模型将其提炼为文本场景描述,随后使用生成式模型(generative model)生成对应的视频片段,从而实现音乐与视觉内容在叙事连贯性、视觉一致性及情感匹配度上的有效对齐。

链接: https://arxiv.org/abs/2509.00029
作者: Leo Vitasovic,Stella Graßhof,Agnes Mercedes Kloft,Ville V. Lehtola,Martin Cunneen,Justyna Starostka,Glenn McGarry,Kun Li,Sami S. Brandt
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 1st Workshop on Generative AI for Storytelling (AISTORY), 2025

点击查看摘要

Abstract:Conventional music visualisation systems rely on handcrafted ad hoc transformations of shapes and colours that offer only limited expressiveness. We propose two novel pipelines for automatically generating music videos from any user-specified, vocal or instrumental song using off-the-shelf deep learning models. Inspired by the manual workflows of music video producers, we experiment on how well latent feature-based techniques can analyse audio to detect musical qualities, such as emotional cues and instrumental patterns, and distil them into textual scene descriptions using a language model. Next, we employ a generative model to produce the corresponding video clips. To assess the generated videos, we identify several critical aspects and design and conduct a preliminary user evaluation that demonstrates storytelling potential, visual coherency and emotional alignment with the music. Our findings underscore the potential of latent feature techniques and deep generative models to expand music visualisation beyond traditional approaches.
zh

[AI-213] Optimized Renewable Energy Planning MDP for Socially-Equitable Electricity Coverag e in the US

【速读】:该论文旨在解决传统电力基础设施在可再生能源(Renewable Energy)整合中面临的障碍及其对低收入社区能源获取不平等的加剧问题,尤其关注如何在有限预算下实现清洁电力资源的公平分配。解决方案的关键在于构建一个马尔可夫决策过程(Markov Decision Process, MDP)框架,将社会脆弱性指标、能源需求波动性和预算约束纳入统一优化模型,从而在提升可再生能源渗透率的同时显著降低低收入群体的电力供应不平等现象。数值实验表明,该方法可在保持系统性能的前提下,使低收入人群的供电短缺减少55%,并实现32.9%的可再生能源渗透率,优于随机分配、贪婪扩展和专家启发式等基线策略。

链接: https://arxiv.org/abs/2509.00008
作者: Riya Kinnarkar,Mansur Arief
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Traditional power grid infrastructure presents significant barriers to renewable energy integration and perpetuates energy access inequities, with low-income communities experiencing disproportionately longer power outages. This study develops a Markov Decision Process (MDP) framework to optimize renewable energy allocation while explicitly addressing social equity concerns in electricity distribution. The model incorporates budget constraints, energy demand variability, and social vulnerability indicators across eight major U.S. cities to evaluate policy alternatives for equitable clean energy transitions. Numerical experiments compare the MDP-based approach against baseline policies including random allocation, greedy renewable expansion, and expert heuristics. Results demonstrate that equity-focused optimization can achieve 32.9% renewable energy penetration while reducing underserved low-income populations by 55% compared to conventional approaches. The expert policy achieved the highest reward, while the Monte Carlo Tree Search baseline provided competitive performance with significantly lower budget utilization, demonstrating that fair distribution of clean energy resources is achievable without sacrificing overall system performance and providing ways for integrating social equity considerations with climate goals and inclusive access to clean power infrastructure.
zh

[AI-214] Per-sender neural network classifiers for email authorship validation

【速读】:该论文旨在解决企业内部邮件中的身份伪造问题,即业务电子邮件欺骗(Business Email Compromise, BEC)和横向鱼叉式钓鱼攻击(Lateral Spear Phishing),这些攻击常利用被攻陷的员工账户发送恶意邮件,而现有系统因默认信任内网邮件而难以防御。解决方案的关键在于提出并实现“作者身份验证”(Authorship Validation)机制,通过建模每位发件人的写作风格特征,在不依赖内容语义分析的前提下,实时判断一封邮件是否由声称的发件人所撰写。该方法采用轻量级、可集成的分类模型(如字符级卷积神经网络 Char-CNN),在基于Enron语料库构建的模拟数据集上验证了其高准确率与F1分数,表明其可在现有商业邮件安全系统中低开销部署,形成对传统检测手段的有效补充。

链接: https://arxiv.org/abs/2509.00005
作者: Rohit Dube
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 11 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Business email compromise and lateral spear phishing attacks are among modern organizations’ most costly and damaging threats. While inbound phishing defenses have improved significantly, most organizations still trust internal emails by default, leaving themselves vulnerable to attacks from compromised employee accounts. In this work, we define and explore the problem of authorship validation: verifying whether a claimed sender actually authored a given email. Authorship validation is a lightweight, real-time defense that complements traditional detection methods by modeling per-sender writing style. Further, the paper presents a collection of new datasets based on the Enron corpus. These simulate inauthentic messages using both human-written and large language model-generated emails. The paper also evaluates two classifiers – a Naive Bayes model and a character-level convolutional neural network (Char-CNN) – for the authorship validation task. Our experiments show that the Char-CNN model achieves high accuracy and F1 scores under various circumstances. Finally, we discuss deployment considerations and show that per-sender authorship classifiers are practical for integrating into existing commercial email security systems with low overhead.
zh

[AI-215] Beyond Ensembles: Simulating All-Atom Protein Dynamics in a Learned Latent Space

【速读】:该论文旨在解决生物分子长时间尺度动力学模拟中难以准确预测构象随时间演化的问题,尤其是传统增强采样方法依赖于预定义的集体变量(collective variables)且无法有效建模构象之间的动态过程。其解决方案的关键在于扩展生成模型LD-FPG,引入一个在学习到的潜在空间(latent space)内运行的时间传播器(temporal propagator),并通过统一的编码器-传播器-解码器框架对三种传播策略进行系统比较:基于得分的朗之万动力学(score-guided Langevin dynamics)、基于Koopman算子的线性算子方法以及自回归神经网络(autoregressive neural networks)。其中,自回归神经网络在长时程稳定性上表现最优,而得分引导的动力学能更好恢复侧链热力学特性,Koopman方法则提供了一个轻量且可解释的基线方案,从而为全原子蛋白质动力学的潜在空间模拟提供了清晰的性能权衡与实用指导。

链接: https://arxiv.org/abs/2509.02196
作者: Aditya Sengar,Ali Hariri,Pierre Vandergheynst,Patrick Barth
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Simulating the long-timescale dynamics of biomolecules is a central challenge in computational science. While enhanced sampling methods can accelerate these simulations, they rely on pre-defined collective variables that are often difficult to identify. A recent generative model, LD-FPG, demonstrated that this problem could be bypassed by learning to sample the static equilibrium ensemble as all-atom deformations from a reference structure, establishing a powerful method for all-atom ensemble generation. However, while this approach successfully captures a system’s probable conformations, it does not model the temporal evolution between them. Here we extend LD-FPG with a temporal propagator that operates within the learned latent space and compare three classes: (i) score-guided Langevin dynamics, (ii) Koopman-based linear operators, and (iii) autoregressive neural networks. Within a unified encoder-propagator-decoder framework, we evaluate long-horizon stability, backbone and side-chain ensemble fidelity, and functional free-energy landscapes. Autoregressive neural networks deliver the most robust long rollouts; score-guided Langevin best recovers side-chain thermodynamics when the score is well learned; and Koopman provides an interpretable, lightweight baseline that tends to damp fluctuations. These results clarify the trade-offs among propagators and offer practical guidance for latent-space simulators of all-atom protein dynamics.
zh

[AI-216] Synesthesia of Machines (SoM)-Based Task-Driven MIMO System for Image Transmission

【速读】:该论文旨在解决网络化移动智能体在动态场景中协同感知(Cooperative Perception, CP)时,如何高效且鲁棒地传输感知数据的问题。现有基于深度学习的联合源信道编码(Joint Source-Channel Coding, JSCC)方法虽在恶劣信道条件下优于传统规则编码器,但其与多输入多输出(Multiple-Input Multiple-Output, MIMO)技术结合的研究仍局限于离散时间模拟传输(Discrete-Time Analog Transmission, DTAT)模型和简单任务,难以支撑复杂CP任务下的数字MIMO通信系统。解决方案的关键在于提出一种基于机器联觉(Synesthesia of Machines, SoM)的任务驱动型MIMO系统(SoM-MIMO),通过利用感知任务的特征金字塔结构特性与闭环MIMO通信系统的信道特性,实现图像的高效、鲁棒数字MIMO传输。实验表明,该方案在保持相同通信开销的前提下,相较两种JSCC基线方法,在所有信噪比(SNR)水平下平均mAP提升分别达6.30和10.48。

链接: https://arxiv.org/abs/2509.02031
作者: Sijiang Li,Rongqing Zhang,Xiang Cheng,Jian Tang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To support cooperative perception (CP) of networked mobile agents in dynamic scenarios, the efficient and robust transmission of sensory data is a critical challenge. Deep learning-based joint source-channel coding (JSCC) has demonstrated promising results for image transmission under adverse channel conditions, outperforming traditional rule-based codecs. While recent works have explored to combine JSCC with the widely adopted multiple-input multiple-output (MIMO) technology, these approaches are still limited to the discrete-time analog transmission (DTAT) model and simple tasks. Given the limited performance of existing MIMO JSCC schemes in supporting complex CP tasks for networked mobile agents with digital MIMO communication systems, this paper presents a Synesthesia of Machines (SoM)-based task-driven MIMO system for image transmission, referred to as SoM-MIMO. By leveraging the structural properties of the feature pyramid for perceptual tasks and the channel properties of the closed-loop MIMO communication system, SoM-MIMO enables efficient and robust digital MIMO transmission of images. Experimental results have shown that compared with two JSCC baseline schemes, our approach achieves average mAP improvements of 6.30 and 10.48 across all SNR levels, while maintaining identical communication overhead.
zh

[AI-217] Quantum Machine Learning for UAV Swarm Intrusion Detection

【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)蜂群网络中入侵检测(Intrusion Detection)的挑战,这些问题包括高移动性、非平稳流量以及严重的类别不平衡。解决方案的关键在于系统性地评估三种量子机器学习(Quantum Machine Learning, QML)方法——量子核方法(quantum kernels)、变分量子神经网络(Variational Quantum Neural Networks, QNNs)和量子训练神经网络(Hybrid Quantum-Trained Neural Networks, QT-NNs)——与经典基线模型的性能差异。研究基于一个包含五类攻击的120k流仿真数据集,统一处理流程下比较了不同编码策略、电路深度、量子比特数量和采样噪声对模型精度、宏平均F1分数、ROC-AUC、马修斯相关系数及量子资源消耗的影响,揭示出在低数据量和非线性场景下,量子核方法与QT-NNs表现更优,而深度QNN因可训练性问题受限,传统卷积神经网络(CNNs)则在数据充足时占据优势。

链接: https://arxiv.org/abs/2509.01812
作者: Kuan-Cheng Chen,Samuel Yen-Chi Chen,Tai-Yue Li,Chen-Yu Liu,Kin K. Leung
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Intrusion detection in unmanned-aerial-vehicle (UAV) swarms is complicated by high mobility, non-stationary traffic, and severe class imbalance. Leveraging a 120 k-flow simulation corpus that covers five attack types, we benchmark three quantum-machine-learning (QML) approaches - quantum kernels, variational quantum neural networks (QNNs), and hybrid quantum-trained neural networks (QT-NNs) - against strong classical baselines. All models consume an 8-feature flow representation and are evaluated under identical preprocessing, balancing, and noise-model assumptions. We analyse the influence of encoding strategy, circuit depth, qubit count, and shot noise, reporting accuracy, macro-F1, ROC-AUC, Matthews correlation, and quantum-resource footprints. Results reveal clear trade-offs: quantum kernels and QT-NNs excel in low-data, nonlinear regimes, while deeper QNNs suffer from trainability issues, and CNNs dominate when abundant data offset their larger parameter count. The complete codebase and dataset partitions are publicly released to enable reproducible QML research in network security.
zh

[AI-218] AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions

【速读】:该论文旨在解决当前大音频语言模型(Large Audio Language Models, LALMs)中存在的指令敏感性问题,即相同意图的不同指令可能导致显著不同的输出结果。解决方案的关键在于提出AHAMask方法,通过在仅解码器结构的大语言模型(Decoder-only Large Language Model, LLM)骨干网络中对部分注意力头(Attention Heads)进行掩码操作,从而触发特定的声学任务功能而无需依赖显式指令。这些掩码通过在LALM上训练获得,且可训练参数数量等于LLM骨干中的注意力头数量,实现高效且性能优越的无指令任务指定,同时揭示了LALMs内部存在可被利用的“功能路径”(functional pathways)。

链接: https://arxiv.org/abs/2509.01787
作者: Yiwei Guo,Bohan Li,Hankun Wang,Zhihan Li,Shuai Wang,Xie Chen,Kai Yu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 15 pages, 7 tables, 6 figures

点击查看摘要

Abstract:Although current large audio language models (LALMs) extend text large language models (LLMs) with generic acoustic understanding abilities, they usually suffer from instruction sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its LLM backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain “functional pathways” in their attention heads.
zh

[AI-219] Non-Identical Diffusion Models in MIMO-OFDM Channel Generation

【速读】:该论文旨在解决无线正交频分复用(Orthogonal Frequency Division Multiplexing, OFDM)信道生成中因初始估计可靠性不均导致的生成质量下降问题,尤其是在多输入多输出(Multiple-Input Multiple-Output, MIMO)OFDM系统中,由于导频(pilot)分配方式导致不同子载波(subcarrier)上的初始信道估计误差存在显著差异。传统扩散模型采用标量时间索引表示全局噪声水平,无法刻画局部元素间的噪声变化特性,从而在初始化偏差较大时性能受限。解决方案的关键在于提出一种非一致扩散(non-identical diffusion)机制,通过引入与输入维度匹配的矩阵作为元素级时间嵌入(element-wise time embedding),实现对每个信道元素(如子载波)独立控制噪声演化过程,从而更准确地建模局部可靠性差异,并提升生成结果的准确性与鲁棒性。

链接: https://arxiv.org/abs/2509.01641
作者: Yuzhi Yang,Omar Alhussein,Mérouane Debbah
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a novel diffusion model, termed the non-identical diffusion model, and investigate its application to wireless orthogonal frequency division multiplexing (OFDM) channel generation. Unlike the standard diffusion model that uses a scalar-valued time index to represent the global noise level, we extend this notion to an element-wise time indicator to capture local error variations more accurately. Non-identical diffusion enables us to characterize the reliability of each element (e.g., subcarriers in OFDM) within the noisy input, leading to improved generation results when the initialization is biased. Specifically, we focus on the recovery of wireless multi-input multi-output (MIMO) OFDM channel matrices, where the initial channel estimates exhibit highly uneven reliability across elements due to the pilot scheme. Conventional time embeddings, which assume uniform noise progression, fail to capture such variability across pilot schemes and noise levels. We introduce a matrix that matches the input size to control element-wise noise progression. Following a similar diffusion procedure to existing methods, we show the correctness and effectiveness of the proposed non-identical diffusion scheme both theoretically and numerically. For MIMO-OFDM channel generation, we propose a dimension-wise time embedding strategy. We also develop and evaluate multiple training and generation methods and compare them through numerical experiments.
zh

[AI-220] Enabling Down Syndrome Research through a Knowledge Graph-Driven Analytical Framework

【速读】:该论文旨在解决唐氏综合征(Down syndrome)研究中因数据异质性和分散性导致的综合性分析与转化研究困难问题。其解决方案的关键在于构建一个基于知识图谱(knowledge graph)的整合分析平台,将九项INCLUDE研究的7,148名参与者、456种疾病、501种表型及37,000余份生物样本数据转化为统一语义基础设施,并通过与Monarch Initiative数据交叉丰富,扩展至4,281个基因和7,077个变异,形成包含超过160万条语义关联的知识图谱,从而支持图嵌入(graph embeddings)和路径推理(path-based reasoning)等AI驱动的假设生成与多维探索,实现从静态数据仓库向动态发现环境的转变。

链接: https://arxiv.org/abs/2509.01565
作者: Madan Krishnamurthy,Surya Saha,Pierrette Lo,Patricia L. Whetzel,Tursynay Issabekova,Jamed Ferreris Vargas,Jack DiGiovanna,Melissa A Haendel
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Trisomy 21 results in Down syndrome, a multifaceted genetic disorder with diverse clinical phenotypes, including heart defects, immune dysfunction, neurodevelopmental differences, and early-onset dementia risk. Heterogeneity and fragmented data across studies challenge comprehensive research and translational discovery. The NIH INCLUDE (INvestigation of Co-occurring conditions across the Lifespan to Understand Down syndromE) initiative has assembled harmonized participant-level datasets, yet realizing their potential requires integrative analytical frameworks. We developed a knowledge graph-driven platform transforming nine INCLUDE studies, comprising 7,148 participants, 456 conditions, 501 phenotypes, and over 37,000 biospecimens, into a unified semantic infrastructure. Cross-resource enrichment with Monarch Initiative data expands coverage to 4,281 genes and 7,077 variants. The resulting knowledge graph contains over 1.6 million semantic associations, enabling AI-ready analysis with graph embeddings and path-based reasoning for hypothesis generation. Researchers can query the graph via SPARQL or natural language interfaces. This framework converts static data repositories into dynamic discovery environments, supporting cross-study pattern recognition, predictive modeling, and systematic exploration of genotype-phenotype relationships in Down syndrome.
zh

[AI-221] NoLBERT: A No Lookahead(back) Foundational Language Model for Empirical Research

【速读】:该论文旨在解决传统语言模型在社会科学研究(尤其是经济学与金融学)中因训练数据时间跨度不明确而引入的“回溯偏差(lookback bias)”和“前瞻偏差(lookahead bias)”,这些问题会损害计量经济推断的有效性。解决方案的关键在于构建一个轻量级、带时间戳的基础语言模型NoLBERT,其预训练仅使用1976–1995年的文本数据,从而确保时序一致性并避免信息泄露;同时,NoLBERT在自然语言处理(NLP)基准测试中优于领域特定基线模型,并成功应用于专利文本分析,实现了企业层面创新网络的构建,验证了创新中心性提升可预测长期利润增长。

链接: https://arxiv.org/abs/2509.01110
作者: Ali Kakhbod,Peiyao Li
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Finance (q-fin.GN)
备注:

点击查看摘要

Abstract:We present NoLBERT, a lightweight, timestamped foundational language model for empirical research in social sciences, particularly in economics and finance. By pre-training exclusively on 1976-1995 text, NoLBERT avoids both lookback and lookahead biases that can undermine econometric inference. It exceeds domain-specific baselines on NLP benchmarks while maintaining temporal consistency. Applied to patent texts, NoLBERT enables the construction of firm-level innovation networks and shows that gains in innovation centrality predict higher long-run profit growth.
zh

[AI-222] DRetNet: A Novel Deep Learning Framework for Diabetic Retinopathy Diagnosis

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)自动检测系统中存在的三大关键问题:对低质量图像处理能力弱、模型缺乏可解释性以及未充分融合领域专业知识。其解决方案的核心在于提出一个集成三项创新技术的框架:(1) 基于物理信息神经网络(Physics-Informed Neural Networks, PINNs)的自适应视网膜图像增强方法,通过引入物理约束提升微动脉瘤、出血和渗出等关键特征的可见性;(2) 混合特征融合网络(Hybrid Feature Fusion Network, HFFN),结合深度学习嵌入与手工设计特征,融合数据驱动表示与领域知识以提高泛化能力和准确性;(3) 具有不确定性量化能力的多阶段分类器,将分类过程分解为逻辑清晰的阶段,输出可解释的预测结果及置信度分数,从而增强临床可信度。该框架在多种复杂条件下均表现出鲁棒性能,显著提升了DR检测的准确性与实用性。

链接: https://arxiv.org/abs/2509.01072
作者: Idowu Paul Okuwobi,Jingyuan Liu,Jifeng Wan,Jiaojiao Jiang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Diabetic retinopathy (DR) is a leading cause of blindness worldwide, necessitating early detection to prevent vision loss. Current automated DR detection systems often struggle with poor-quality images, lack interpretability, and insufficient integration of domain-specific knowledge. To address these challenges, we introduce a novel framework that integrates three innovative contributions: (1) Adaptive Retinal Image Enhancement Using Physics-Informed Neural Networks (PINNs): this technique dynamically enhances retinal images by incorporating physical constraints, improving the visibility of critical features such as microaneurysms, hemorrhages, and exudates; (2) Hybrid Feature Fusion Network (HFFN): by combining deep learning embeddings with handcrafted features, HFFN leverages both learned representations and domain-specific knowledge to enhance generalization and accuracy; (3) Multi-Stage Classifier with Uncertainty Quantification: this method breaks down the classification process into logical stages, providing interpretable predictions and confidence scores, thereby improving clinical trust. The proposed framework achieves an accuracy of 92.7%, a precision of 92.5%, a recall of 92.6%, an F1-score of 92.5%, an AUC of 97.8%, a mAP of 0.96, and an MCC of 0.85. Ophthalmologists rated the framework’s predictions as highly clinically relevant (4.8/5), highlighting its alignment with real-world diagnostic needs. Qualitative analyses, including Grad-CAM visualizations and uncertainty heatmaps, further enhance the interpretability and trustworthiness of the system. The framework demonstrates robust performance across diverse conditions, including low-quality images, noisy data, and unseen datasets. These features make the proposed framework a promising tool for clinical adoption, enabling more accurate and reliable DR detection in resource-limited settings.
zh

[AI-223] An Economy of AI Agents

【速读】:该论文旨在探讨具有长期规划与执行复杂任务能力的人工智能代理(AI agents)在经济领域中的部署及其对市场、组织和制度的影响。其核心问题是:随着生成式 AI(Generative AI)等技术的发展,AI代理如何与人类及其他代理交互,进而重塑市场结构与运行机制,并由此催生何种制度安排以保障市场的有效运作。解决方案的关键在于系统梳理当前人工智能在任务规划与自主执行方面的进展,识别经济学家亟需关注的开放性问题,包括人机协作模式、多智能体博弈行为以及适应性制度设计,从而为未来AI驱动的经济体系提供理论框架与政策指引。

链接: https://arxiv.org/abs/2509.01063
作者: Gillian K. Hadfield,Andrew Koh
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In the coming decade, artificially intelligent agents with the ability to plan and execute complex tasks over long time horizons with little direct oversight from humans may be deployed across the economy. This chapter surveys recent developments and highlights open questions for economists around how AI agents might interact with humans and with each other, shape markets and organizations, and what institutions might be required for well-functioning markets.
zh

[AI-224] Q-Learning–Driven Adaptive Rewiring for Cooperative Control in Heterogeneous Networks

【速读】:该论文旨在解决多智能体系统中合作行为涌现的统计物理问题,即微观层面的学习规则如何驱动宏观层面的集体行为转变。其核心挑战在于理解在异质网络结构下,智能体如何通过策略优化与社交连接重构共同促成合作集群的形成。解决方案的关键在于提出一种基于Q-learning的自适应重连机制,将时间差分学习(temporal difference learning)与网络重构相结合,使智能体能够依据交互历史动态优化自身策略和连接关系;通过邻接特异性Q-learning,智能体发展出复杂的伙伴关系管理策略,从而实现合作簇的自组织形成,并在幂律网络中揭示了三种不同行为区域:宽松区、敏感区与耐心区,表明机器学习可作为推动复杂自适应系统自发组织的新驱动力。

链接: https://arxiv.org/abs/2509.01057
作者: Yi-Ning Weng,Hsuan-Wei Lee
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注: 40 pages, 9 figures

点击查看摘要

Abstract:Cooperation emergence in multi-agent systems represents a fundamental statistical physics problem where microscopic learning rules drive macroscopic collective behavior transitions. We propose a Q-learning-based variant of adaptive rewiring that builds on mechanisms studied in the literature. This method combines temporal difference learning with network restructuring so that agents can optimize strategies and social connections based on interaction histories. Through neighbor-specific Q-learning, agents develop sophisticated partnership management strategies that enable cooperator cluster formation, creating spatial separation between cooperative and defective regions. Using power-law networks that reflect real-world heterogeneous connectivity patterns, we evaluate emergent behaviors under varying rewiring constraint levels, revealing distinct cooperation patterns across parameter space rather than sharp thermodynamic transitions. Our systematic analysis identifies three behavioral regimes: a permissive regime (low constraints) enabling rapid cooperative cluster formation, an intermediate regime with sensitive dependence on dilemma strength, and a patient regime (high constraints) where strategic accumulation gradually optimizes network structure. Simulation results show that while moderate constraints create transition-like zones that suppress cooperation, fully adaptive rewiring enhances cooperation levels through systematic exploration of favorable network configurations. Quantitative analysis reveals that increased rewiring frequency drives large-scale cluster formation with power-law size distributions. Our results establish a new paradigm for understanding intelligence-driven cooperation pattern formation in complex adaptive systems, revealing how machine learning serves as an alternative driving force for spontaneous organization in multi-agent networks.
zh

[AI-225] Quantum Causality: Resolving Simpsons Paradox with mathcalDO-Calculus

【速读】:该论文旨在解决机器智能中区分相关性与因果性这一根本性挑战,这是构建鲁棒且可信系统的关键障碍。其解决方案的核心在于提出并实验验证了一种量子算法框架,用于执行因果干预:将因果网络映射到量子电路中,通过受控旋转门编码概率关联,并利用电路结构的重构实现干预(物理上对应Pearl的“图手术”)。该方法在3量子比特模型中成功解析了辛普森悖论(Simpson’s Paradox),并在10量子比特医疗模拟中量化混杂偏倚,最终在IonQ Aria量子计算机上完成了原理性实验验证,证明了其在真实噪声环境下的可行性,为量子因果推断提供了可实践路径。

链接: https://arxiv.org/abs/2509.00744
作者: Pilsung Kang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Distinguishing correlation from causation is a fundamental challenge in machine intelligence, often representing a critical barrier to building robust and trustworthy systems. While Pearl’s \mathcalDO -calculus provides a rigorous framework for causal inference, a parallel challenge lies in its physical implementation. Here, we apply and experimentally validate a quantum algorithmic framework for performing causal interventions. Our approach maps causal networks onto quantum circuits where probabilistic links are encoded by controlled-rotation gates, and interventions are realized by a structural remodeling of the circuit – a physical analogue to Pearl’s ``graph surgery’'. We demonstrate the method’s efficacy by resolving Simpson’s Paradox in a 3-qubit model, and show its scalability by quantifying confounding bias in a 10-qubit healthcare simulation. Critically, we provide a proof-of-principle experimental validation on an IonQ Aria quantum computer, successfully reproducing the paradox and its resolution in the presence of real-world noise. This work establishes a practical pathway for quantum causal inference, offering a new computational tool to address deep-rooted challenges in algorithmic fairness and explainable AI (XAI).
zh

[AI-226] Its-A-Me Quantum Mario: Scalable Quantum Reinforcement Learning with Multi-Chip Ensembles

【速读】:该论文旨在解决当前量子强化学习(Quantum Reinforcement Learning, QRL)在NISQ(含噪声中等规模量子)时代面临的实际限制问题,如量子比特数量有限和噪声累积导致的性能下降。其解决方案的关键在于提出一种多芯片集成框架,通过将复杂高维观测数据(如Super Mario Bros环境中的图像)分割到多个独立的小型量子卷积神经网络(Quantum Convolutional Neural Networks, QCNNs)中处理,并在经典双深度Q网络(Double Deep Q-Network, DDQN)框架内聚合各量子模块的输出结果。这种模块化架构不仅提升了QRL在复杂环境中的可实现性与学习稳定性,还有效减少了因维度压缩带来的信息损失,同时保持了对近期量子硬件的可行性,为QRL在现实场景中的应用提供了可行路径。

链接: https://arxiv.org/abs/2509.00713
作者: Junghoon Justin Park,Huan-Hsin Tseng,Shinjae Yoo,Samuel Yen-Chi Chen,Jiook Cha
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantum reinforcement learning (QRL) promises compact function approximators with access to vast Hilbert spaces, but its practical progress is slowed by NISQ-era constraints such as limited qubits and noise accumulation. We introduce a multi-chip ensemble framework using multiple small Quantum Convolutional Neural Networks (QCNNs) to overcome these constraints. Our approach partitions complex, high-dimensional observations from the Super Mario Bros environment across independent quantum circuits, then classically aggregates their outputs within a Double Deep Q-Network (DDQN) framework. This modular architecture enables QRL in complex environments previously inaccessible to quantum agents, achieving superior performance and learning stability compared to classical baselines and single-chip quantum models. The multi-chip ensemble demonstrates enhanced scalability by reducing information loss from dimensionality reduction while remaining implementable on near-term quantum hardware, providing a practical pathway for applying QRL to real-world problems.
zh

[AI-227] NMR-Solver: Automated Structure Elucidation via Large-Scale Spectral Matching and Physics-Guided Frag ment Optimization

【速读】:该论文旨在解决有机小分子结构解析中依赖专家经验、耗时且易出错的问题,尤其是在复杂或新型化合物的核磁共振(NMR)谱图解析方面。其核心挑战在于现有方法在真实应用场景下性能不足,主要受限于算法局限性和高质量标注数据稀缺。解决方案的关键在于提出NMR-Solver框架,该框架通过融合大规模光谱匹配与基于物理引导的片段优化策略,充分利用原子级结构-谱图关系来实现自动化结构推断;同时将计算NMR分析、深度学习与可解释的化学推理统一为一个连贯系统,从而在保证可解释性的同时提升泛化能力与实用性,为分子科学中的逆问题提供了一个可扩展、自动化的通用范式。

链接: https://arxiv.org/abs/2509.00640
作者: Yongqi Jin,Jun-Jie Wang,Fanjie Xu,Xiaohong Ji,Zhifeng Gao,Linfeng Zhang,Guolin Ke,Rong Zhu,Weinan E
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nuclear Magnetic Resonance (NMR) spectroscopy is one of the most powerful and widely used tools for molecular structure elucidation in organic chemistry. However, the interpretation of NMR spectra to determine unknown molecular structures remains a labor-intensive and expertise-dependent process, particularly for complex or novel compounds. Although recent methods have been proposed for molecular structure elucidation, they often underperform in real-world applications due to inherent algorithmic limitations and limited high-quality data. Here, we present NMR-Solver, a practical and interpretable framework for the automated determination of small organic molecule structures from ^1 H and ^13 C NMR spectra. Our method introduces an automated framework for molecular structure elucidation, integrating large-scale spectral matching with physics-guided fragment-based optimization that exploits atomic-level structure-spectrum relationships in NMR. We evaluate NMR-Solver on simulated benchmarks, curated experimental data from the literature, and real-world experiments, demonstrating its strong generalization, robustness, and practical utility in challenging, real-life scenarios. NMR-Solver unifies computational NMR analysis, deep learning, and interpretable chemical reasoning into a coherent system. By incorporating the physical principles of NMR into molecular optimization, it enables scalable, automated, and chemically meaningful molecular identification, establishing a generalizable paradigm for solving inverse problems in molecular science.
zh

[AI-228] A Novel Method to Determine Total Oxidant Concentration Produced by Non-Thermal Plasma Based on Image Processing and Machine Learning

【速读】:该论文旨在解决非热等离子体(non-thermal plasma, NTP)处理水体系中总氧化剂浓度([Ox]_tot)测定的难题,传统滴定法因反应物种瞬态性强及主观性高而精度不足。其关键解决方案是提出一种基于颜色的计算机分析(color-based computer analysis, CBCA)方法,融合图像处理与机器学习(machine learning, ML),通过捕捉KI溶液在NTP氧化过程中的颜色变化,提取RGB、HSV和Lab色空间特征,并利用线性回归与梯度提升模型实现高精度预测(R² > 0.990),且在特征维度从九维降至四维后仍保持优异性能,显著提升了[Ox]_tot检测的客观性与效率。

链接: https://arxiv.org/abs/2509.00479
作者: Mirkan Emir Sancak,Unal Sen,Ulker Diler Keris-Sen
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper will be published later on

点击查看摘要

Abstract:Accurate determination of total oxidant concentration ([Ox]_tot) in non-thermal plasma (NTP)-treated aqueous systems remains a critical challenge due to the transient nature of reactive oxygen and nitrogen species and the subjectivity of conventional titration methods used for [Ox]_tot determination. This study introduces a novel, color-based computer analysis (CBCA) method that integrates advanced image processing with machine learning (ML) to quantify colorimetric shifts in potassium iodide (KI) solutions during oxidation. First, a custom-built visual data acquisition system captured high-resolution video of the color transitions in a KI solution during oxidation with an NTP system. The change in [Ox]_tot during the experiments was monitored with a standard titrimetric method. Second, the captured frames were processed using a robust image processing pipeline to extract RGB, HSV, and Lab color features. The extracted features were statistically evaluated, and the results revealed strong linear correlations with the measured [Ox]_tot values, particularly in the saturation (HSV), a and b (Lab), and blue (RGB) channels. Subsequently, the [Ox]_tot measurements and the extracted color features were used to train and validate five ML models. Among them, linear regression and gradient boosting models achieved the highest predictive accuracy (R^2 0.990). It was also found that reducing the feature set from nine to four resulted in comparable performance with improved prediction efficiency, especially for gradient boosting. Finally, comparison of the model predictions with real titration measurements revealed that the CBCA system successfully predicts the [Ox]_tot in KI solution with high accuracy (R^2 0.998) even with a reduced number of features.
zh

[AI-229] Revealing Hidden Precursors to Earthquakes via a Stress-Sensitive Transformation of Seismic Noise

【速读】:该论文试图解决地震预测这一长期科学难题,即在真实地震记录中尚未观测到可靠的前兆信号,从而引发“这些前兆是否在自然界中根本不存在,还是被噪声掩盖”的争议。其解决方案的关键在于提出了一种应力敏感的频域变换方法,能够追踪相邻频带间的能量差异,从而分离出与剪切应力和正应力演化相关的微弱频谱变化。该方法在实验室声发射数据及七次重大地震(矩震级Mw 5.9–9.0)的地震记录中均成功识别出前兆特征,如弧形轨迹和向极值加速的行为,时间尺度为破裂前数小时至数天,且在多种构造环境中具有鲁棒性,证明了隐藏的地震前兆确实编码于背景地震噪声之中,为实时断层监测和短期可操作的地震预报提供了新路径。

链接: https://arxiv.org/abs/2509.00268
作者: Nader Shakibay Senobari
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 20 pages, 7 figures. Github code included. Submitted to Science Advances

点击查看摘要

Abstract:Earthquake prediction has long been one of the most elusive challenges in science. Laboratory experiments and simulations suggest that failure precursors should exist, yet reliable signals have remained unobserved in real-world seismic records, leaving open the question of whether they are absent in nature or simply hidden within noise. Here we introduce a stress-sensitive frequency-domain transformation that tracks energy differences between adjacent frequency bands, isolating subtle spectral changes linked to evolving shear and normal stress. Applied to both laboratory acoustic emission data and seismic records from seven major earthquakes (Mw 5.9-9.0), including the 2011 Tohoku and 2023 Turkey-Syria events, the transform consistently reveals precursory signatures, arc-like trajectories and accelerations toward extrema, emerging hours to days before rupture. These features are robust across diverse tectonic settings, from induced seismicity and volcanic collapse to continental strike-slip and subduction megathrust earthquakes. Our findings demonstrate that hidden precursors are indeed encoded in ambient seismic noise, offering a pathway toward real-time fault monitoring and actionable short-term earthquake forecasting.
zh

[AI-230] Meta-learning ecological priors from large language models explains human learning and decision making

【速读】:该论文试图解决的问题是:人类的学习与决策过程是否可以被解释为对现实世界任务统计结构的原理性适应。传统认知模型往往依赖于手工设计的启发式规则或特定参数调整,难以泛化到多样化的自然环境。解决方案的关键在于提出“生态理性分析”(ecologically rational analysis)这一计算框架,将理性分析的规范基础与生态学背景相结合;通过大语言模型生成具有生态效度的认知任务,并利用元学习(meta-learning)推导出针对这些环境优化的理性模型,从而发展出一类新型学习算法——生态理性元学习推理(Ecologically Rational Meta-learned Inference, ERMI)。ERMI 内化了自然问题空间的统计规律,在无需显式参数更新或人工启发式规则的情况下,灵活适应新情境,且在15项涵盖函数学习、类别学习和决策的任务中准确捕捉人类行为表现,优于多个经典认知模型。

链接: https://arxiv.org/abs/2509.00116
作者: Akshay K. Jagadish,Mirko Thalmann,Julian Coda-Forno,Marcel Binz,Eric Schulz
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human cognition is profoundly shaped by the environments in which it unfolds. Yet, it remains an open question whether learning and decision making can be explained as a principled adaptation to the statistical structure of real-world tasks. We introduce ecologically rational analysis, a computational framework that unifies the normative foundations of rational analysis with ecological grounding. Leveraging large language models to generate ecologically valid cognitive tasks at scale, and using meta-learning to derive rational models optimized for these environments, we develop a new class of learning algorithms: Ecologically Rational Meta-learned Inference (ERMI). ERMI internalizes the statistical regularities of naturalistic problem spaces and adapts flexibly to novel situations, without requiring hand-crafted heuristics or explicit parameter updates. We show that ERMI captures human behavior across 15 experiments spanning function learning, category learning, and decision making, outperforming several established cognitive models in trial-by-trial prediction. Our results suggest that much of human cognition may reflect adaptive alignment to the ecological structure of the problems we encounter in everyday life.
zh

[AI-231] MolErr2Fix:Benchmarking LLM Trustworthiness in Chemistry via Modular Error Detection Localization Explanation and Revision

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在分子科学领域中存在化学描述不准确、缺乏错误识别与修正能力的问题,这限制了其在科研场景下的可靠性与鲁棒性。解决方案的关键在于提出MolErr2Fix基准测试,该基准专门评估LLMs在分子描述中的错误检测与修正能力,包含1,193个细粒度标注的错误实例,每个实例均包含四元组标注:错误类型(error type)、错误位置(span location)、解释(explanation)和修正方案(correction)。这一设计使模型能够进行结构与语义层面的精细推理,从而更贴近真实化学交流所需的验证逻辑,为提升LLMs在化学领域的可信度提供了可量化、可重复的评估框架。

链接: https://arxiv.org/abs/2509.00063
作者: Yuyang Wu,Jinhui Ye,Shuhao Zhang,Lu Dai,Yonatan Bisk,Olexandr Isayev
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown growing potential in molecular sciences, but they often produce chemically inaccurate descriptions and struggle to recognize or justify potential errors. This raises important concerns about their robustness and reliability in scientific applications. To support more rigorous evaluation of LLMs in chemical reasoning, we present the MolErr2Fix benchmark, designed to assess LLMs on error detection and correction in molecular descriptions. Unlike existing benchmarks focused on molecule-to-text generation or property prediction, MolErr2Fix emphasizes fine-grained chemical understanding. It tasks LLMs with identifying, localizing, explaining, and revising potential structural and semantic errors in molecular descriptions. Specifically, MolErr2Fix consists of 1,193 fine-grained annotated error instances. Each instance contains quadruple annotations, i.e,. (error type, span location, the explanation, and the correction). These tasks are intended to reflect the types of reasoning and verification required in real-world chemical communication. Evaluations of current state-of-the-art LLMs reveal notable performance gaps, underscoring the need for more robust chemical reasoning capabilities. MolErr2Fix provides a focused benchmark for evaluating such capabilities and aims to support progress toward more reliable and chemically informed language models. All annotations and an accompanying evaluation API will be publicly released to facilitate future research.
zh

[AI-232] DeepEmoNet: Building Machine Learning Models for Automatic Emotion Recognition in Human Speeches

【速读】:该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)这一挑战性问题,即明确人类情绪如何与声音的多个声学特征(如音高、响度和能量)相关联。其解决方案的关键在于利用机器学习方法构建分类模型,包括支持向量机(SVM)、长短期记忆网络(LSTM)和卷积神经网络(CNN),并通过迁移学习(transfer learning)和数据增强(data augmentation)策略,在小规模数据集上高效训练模型以获得良好性能。最终,基于ResNet34的模型表现最优,准确率达到66.7%,F1得分为0.631。

链接: https://arxiv.org/abs/2509.00025
作者: Tai Vu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speech emotion recognition (SER) has been a challenging problem in spoken language processing research, because it is unclear how human emotions are connected to various components of sounds such as pitch, loudness, and energy. This paper aims to tackle this problem using machine learning. Particularly, we built several machine learning models using SVMs, LTSMs, and CNNs to classify emotions in human speeches. In addition, by leveraging transfer learning and data augmentation, we efficiently trained our models to attain decent performances on a relatively small dataset. Our best model was a ResNet34 network, which achieved an accuracy of 66.7% and an F1 score of 0.631 .
zh

[AI-233] A Fluid Antenna Enabled Physical Layer Key Generation for Next-G Wireless Networks

【速读】:该论文旨在解决在恶劣传播环境中,传统物理层密钥生成(Physical Layer Key Generation, PLKG)系统因信道特性恶化而导致密钥生成率(Key Generation Rate, KGR)显著下降的问题。解决方案的关键在于引入流体天线(Fluid Antenna, FA)技术,通过利用其可动态调整的物理位置带来的额外空间自由度(Spatial Degree of Freedom, DoF),优化天线阵列的位置与预编码矩阵联合配置。具体地,作者首先推导了FA阵列的KGR闭式表达式,并采用粒子群优化(Particle Swarm Optimization, PSO)算法进行联合优化;为进一步降低计算复杂度,又设计了一种交替优化(Alternating Optimization, AO)算法,结合投影梯度下降(Projected Gradient Descent, PGD)与PSO。仿真结果表明,相较于传统固定位置天线(Fixed-Position Antenna, FPA)和可重构智能表面(Reconfigurable Intelligent Surface, RIS)方案,所提FA-enabled PLKG系统在KGR性能上实现显著提升,尤其相比均匀平面阵列(Uniform Planar Array, UPA),在PSO和AO算法下分别获得35.42%和67.73%的性能增益。

链接: https://arxiv.org/abs/2509.00018
作者: Jiacheng Guo,Ning Gao,Yiping Zuo,Hao Xu,Shi Jin,Kai Kit Wong
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:As a promising physical layer security technique, physical layer key generation (PLKG) enables legitimate users to obtain secret keys from wireless channel without security infrastructures. However, in harsh propagation environments, the channel characteristic becomes unsatisfactory, the key generation rate (KGR) is significantly deteriorated. In this paper, we propose a novel fluid antenna (FA) enabled PLKG system to address this challenge. Specifically, we first derive the closed-form expression of the KGR for FA array, and then jointly optimize the precoding matrix and the antenna positions via a particle swarm optimization (PSO) algorithm. Next, to further reduce the computational complexity of the optimization procedure, we develop an alternating optimization (AO) algorithm, which combines the projected gradient descent (PGD) and the PSO. Simulation results demonstrate that by exploiting the additional spatial degree of freedom (DoF), our FA enabled PLKG system is superior to the benchmarks, such as the conventional fixed-position antenna (FPA) array and the reconfigurable intelligent surface (RIS). It is worth highlighting that compared to the conventional uniform planar antenna (UPA), the FA enabled PLKG achieves a 35.42% KGR performance improvement under PSO algorithm and a 67.73% KGR performance improvement under AO algorithm, respectively.
zh

机器学习

[LG-0] Understanding sparse autoencoder scaling in the presence of feature manifolds

链接: https://arxiv.org/abs/2509.02565
作者: Eric J. Michaud,Liv Gorton,Tom McGrath
类目: Machine Learning (cs.LG)
*备注: 13 pages, 8 figures, short workshop submission

点击查看摘要

Abstract:Sparse autoencoders (SAEs) model the activations of a neural network as linear combinations of sparsely occurring directions of variation (latents). The ability of SAEs to reconstruct activations follows scaling laws w.r.t. the number of latents. In this work, we adapt a capacity-allocation model from the neural scaling literature (Brill, 2024) to understand SAE scaling, and in particular, to understand how “feature manifolds” (multi-dimensional features) influence scaling behavior. Consistent with prior work, the model recovers distinct scaling regimes. Notably, in one regime, feature manifolds have the pathological effect of causing SAEs to learn far fewer features in data than there are latents in the SAE. We provide some preliminary discussion on whether or not SAEs are in this pathological regime in the wild.

[LG-1] On Transferring Merging and Splitting Task-Oriented Network Digital Twins

链接: https://arxiv.org/abs/2509.02551
作者: Zifan Zhang,Minghong Fang,Mingzhe Chen,Yuchen Liu
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Accepted by IEEE MobiWac 2025

点击查看摘要

Abstract:The integration of digital twinning technologies is driving next-generation networks toward new capabilities, allowing operators to thoroughly understand network conditions, efficiently analyze valuable radio data, and innovate applications through user-friendly, immersive interfaces. Building on this foundation, network digital twins (NDTs) accurately depict the operational processes and attributes of network infrastructures, facilitating predictive management through real-time analysis and measurement. However, constructing precise NDTs poses challenges, such as integrating diverse data sources, mapping necessary attributes from physical networks, and maintaining scalability for various downstream tasks. Unlike previous works that focused on the creation and mapping of NDTs from scratch, we explore intra- and inter-operations among NDTs within a Unified Twin Transformation (UTT) framework, which uncovers a new computing paradigm for efficient transfer, merging, and splitting of NDTs to create task-oriented twins. By leveraging joint multi-modal and distributed mapping mechanisms, UTT optimizes resource utilization and reduces the cost of creating NDTs, while ensuring twin model consistency. A theoretical analysis of the distributed mapping problem is conducted to establish convergence bounds for this multi-modal gated aggregation process. Evaluations on real-world twin-assisted applications, such as trajectory reconstruction, human localization, and sensory data generation, demonstrate the feasibility and effectiveness of interoperability among NDTs for corresponding task development.

[LG-2] Federated learning over physical channels: adaptive algorithms with near-optimal guarantees

链接: https://arxiv.org/abs/2509.02538
作者: Rui Zhang,Wenlong Mou
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In federated learning, communication cost can be significantly reduced by transmitting the information over the air through physical channels. In this paper, we propose a new class of adaptive federated stochastic gradient descent (SGD) algorithms that can be implemented over physical channels, taking into account both channel noise and hardware constraints. We establish theoretical guarantees for the proposed algorithms, demonstrating convergence rates that are adaptive to the stochastic gradient noise level. We also demonstrate the practical effectiveness of our algorithms through simulation studies with deep learning models.

[LG-3] Is RL fine-tuning harder than regression? A PDE learning approach for diffusion models

链接: https://arxiv.org/abs/2509.02528
作者: Wenlong Mou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of learning the optimal control policy for fine-tuning a given diffusion process, using general value function approximation. We develop a new class of algorithms by solving a variational inequality problem based on the Hamilton-Jacobi-Bellman (HJB) equations. We prove sharp statistical rates for the learned value function and control policy, depending on the complexity and approximation errors of the function class. In contrast to generic reinforcement learning problems, our approach shows that fine-tuning can be achieved via supervised regression, with faster statistical rate guarantees.

[LG-4] MoPEQ: Mixture of Mixed Precision Quantized Experts ICCV

链接: https://arxiv.org/abs/2509.02512
作者: Krishna Teja Chitty-Venkata,Jie Ye,Murali Emani
类目: Machine Learning (cs.LG)
*备注: Accepted by ICCV Bivision Workshop 2025

点击查看摘要

Abstract:Large Language and Vision Models using a Mixture-of-Experts (MoE) architecture pose significant challenges for deployment due to their computational and memory demands. Mixed Precision Quantization assigns different precisions to different layers of an LLM/VLM based on layer sensitivity and importance within the model. In this work, we propose a Post Training Quantization algorithm, MoPEQ, that assigns optimal bit width to each expert. Our method balances accuracy and model size by analyzing each expert’s sensitivity using Hessian trace approximation instead of relying on the activation frequency of the expert. This per-expert granularity approach clusters similar experts to maintain model performance while reducing memory requirements. The experimental results on VLMEvalKit benchmark datasets using State-of-the-art VLMs Deepseek-VL2 -tiny, -small, -base, and MolmoE models demonstrate that our mixed precision quantized MoEs achieve competitive accuracy with substantial improvements in memory footprint compared to uniform-precision baseline methods. We perform a comprehensive study to analyze the impact of expert activation frequency and sensitivity using Hessian trace approximation at both layer-wise and model-wide expert precision allocation of 2, 3, and 4 bits to provide a thorough understanding of mixed precision quantization of VLM-MoEs.

[LG-5] RNN Generalization to Omega-Regular Languages

链接: https://arxiv.org/abs/2509.02491
作者: Charles Pert,Dalal Alrajeh,Alessandra Russo
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
*备注: 7 pages, 3 figures. To be published in OVERLAY 2025, 7th International Workshop on Artificial Intelligence and Formal Verification, Logic, Automata, and Synthesis. See this https URL

点击查看摘要

Abstract:Büchi automata (BAs) recognize \omega -regular languages defined by formal specifications like linear temporal logic (LTL) and are commonly used in the verification of reactive systems. However, BAs face scalability challenges when handling and manipulating complex system behaviors. As neural networks are increasingly used to address these scalability challenges in areas like model checking, investigating their ability to generalize beyond training data becomes necessary. This work presents the first study investigating whether recurrent neural networks (RNNs) can generalize to \omega -regular languages derived from LTL formulas. We train RNNs on ultimately periodic \omega -word sequences to replicate target BA behavior and evaluate how well they generalize to out-of-distribution sequences. Through experiments on LTL formulas corresponding to deterministic automata of varying structural complexity, from 3 to over 100 states, we show that RNNs achieve high accuracy on their target \omega -regular languages when evaluated on sequences up to 8 \times longer than training examples, with 92.6% of tasks achieving perfect or near-perfect generalization. These results establish the feasibility of neural approaches for learning complex \omega -regular languages, suggesting their potential as components in neurosymbolic verification methods.

[LG-6] HydroGAT: Distributed Heterogeneous Graph Attention Transformer for Spatiotemporal Flood Prediction

链接: https://arxiv.org/abs/2509.02481
作者: Aishwarya Sarkar,Autrin Hakimi,Xiaoqiong Chen,Hai Huang,Chaoqun Lu,Ibrahim Demir,Ali Jannesari
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted to The 33rd ACM International Conference on Advances in Geographic Information Systems (SIGSPATIAL 25)

点击查看摘要

Abstract:Accurate flood forecasting remains a challenge for water-resource management, as it demands modeling of local, time-varying runoff drivers (e.g., rainfall-induced peaks, baseflow trends) and complex spatial interactions across a river network. Traditional data-driven approaches, such as convolutional networks and sequence-based models, ignore topological information about the region. Graph Neural Networks (GNNs) propagate information exactly along the river network, which is ideal for learning hydrological routing. However, state-of-the-art GNN-based flood prediction models collapse pixels to coarse catchment polygons as the cost of training explodes with graph size and higher resolution. Furthermore, most existing methods treat spatial and temporal dependencies separately, either applying GNNs solely on spatial graphs or transformers purely on temporal sequences, thus failing to simultaneously capture spatiotemporal interactions critical for accurate flood prediction. We introduce a heterogenous basin graph where every land and river pixel is a node connected by physical hydrological flow directions and inter-catchment relationships. We propose HydroGAT, a spatiotemporal network that adaptively learns local temporal importance and the most influential upstream locations. Evaluated in two Midwestern US basins and across five baseline architectures, our model achieves higher NSE (up to 0.97), improved KGE (up to 0.96), and low bias (PBIAS within \pm 5%) in hourly discharge prediction, while offering interpretable attention maps that reveal sparse, structured intercatchment influences. To support high-resolution basin-scale training, we develop a distributed data-parallel pipeline that scales efficiently up to 64 NVIDIA A100 GPUs on NERSC Perlmutter supercomputer, demonstrating up to 15x speedup across machines. Our code is available at this https URL.

[LG-7] SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

链接: https://arxiv.org/abs/2509.02479
作者: Zhenghai Xue,Longtao Zheng,Qian Liu,Yingru Li,Xiaosen Zheng,Zejun Ma,Bo An
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.

[LG-8] ESTM: An Enhanced Dual-Branch Spectral-Temporal Mamba for Anomalous Sound Detection

链接: https://arxiv.org/abs/2509.02471
作者: Chengyuan Ma,Peng Jia,Hongyue Guo,Wenming Yang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted in IEEE Signal Processing Letters 2025

点击查看摘要

Abstract:The core challenge in industrial equipment anoma lous sound detection (ASD) lies in modeling the time-frequency coupling characteristics of acoustic features. Existing modeling methods are limited by local receptive fields, making it difficult to capture long-range temporal patterns and cross-band dynamic coupling effects in machine acoustic features. In this paper, we propose a novel framework, ESTM, which is based on a dual-path Mamba architecture with time-frequency decoupled modeling and utilizes Selective State-Space Models (SSM) for long-range sequence modeling. ESTM extracts rich feature representations from different time segments and frequency bands by fusing enhanced Mel spectrograms and raw audio features, while further improving sensitivity to anomalous patterns through the TriStat-Gating (TSG) module. Our experiments demonstrate that ESTM improves anomalous detection performance on the DCASE 2020 Task 2 dataset, further validating the effectiveness of the proposed method.

[LG-9] Exploring Variational Graph Autoencoders for Distribution Grid Data Generation

链接: https://arxiv.org/abs/2509.02469
作者: Syed Zain Abbas,Ehimare Okoyomon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To address the lack of public power system data for machine learning research in energy networks, we investigate the use of variational graph autoencoders (VGAEs) for synthetic distribution grid generation. Using two open-source datasets, ENGAGE and DINGO, we evaluate four decoder variants and compare generated networks against the original grids using structural and spectral metrics. Results indicate that simple decoders fail to capture realistic topologies, while GCN-based approaches achieve strong fidelity on ENGAGE but struggle on the more complex DINGO dataset, producing artifacts such as disconnected components and repeated motifs. These findings highlight both the promise and limitations of VGAEs for grid synthesis, underscoring the need for more expressive generative models and robust evaluation. We release our models and analysis as open source to support benchmarking and accelerate progress in ML-driven power system research.

[LG-10] VASSO: Variance Suppression for Sharpness-Aware Minimization

链接: https://arxiv.org/abs/2509.02433
作者: Bingcong Li,Yilang Zhang,Georgios B. Giannakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sharpness-aware minimization (SAM) has well-documented merits in enhancing generalization of deep neural network models. Accounting for sharpness in the loss function geometry, where neighborhoods of flat minima' heighten generalization ability, SAM seeks flat valleys’ by minimizing the maximum loss provoked by an adversarial perturbation within the neighborhood. Although critical to account for sharpness of the loss function, in practice SAM suffers from over-friendly adversaries,' which can curtail the outmost level of generalization. To avoid such friendliness,’ the present contribution fosters stabilization of adversaries through variance suppression (VASSO). VASSO offers a general approach to provably stabilize adversaries. In particular, when integrating VASSO with SAM, improved generalizability is numerically validated on extensive vision and language tasks. Once applied on top of a computationally efficient SAM variant, VASSO offers a desirable generalization-computation tradeoff.

[LG-11] Learnable Loss Geometries with Mirror Descent for Scalable and Convergent Meta-Learning

链接: https://arxiv.org/abs/2509.02418
作者: Yilang Zhang,Bingcong Li,Georgios B. Giannakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Utilizing task-invariant knowledge acquired from related tasks as prior information, meta-learning offers a principled approach to learning a new task with limited data records. Sample-efficient adaptation of this prior information is a major challenge facing meta-learning, and plays an important role because it facilitates training the sought task-specific model with just a few optimization steps. Past works deal with this challenge through preconditioning that speeds up convergence of the per-task training. Though effective in representing locally quadratic loss curvatures, simple linear preconditioning can be hardly potent with complex loss geometries. Instead of relying on a quadratic distance metric, the present contribution copes with complex loss metrics by learning a versatile distance-generating function, which induces a nonlinear mirror map to effectively capture and optimize a wide range of loss geometries. With suitable parameterization, this generating function is effected by an expressive neural network that is provably a valid distance. Analytical results establish convergence of not only the proposed method, but also all meta-learning approaches based on preconditioning. To attain gradient norm less than \epsilon , the convergence rate of \mathcalO(\epsilon^-2) is on par with standard gradient-based meta-learning methods. Numerical tests on few-shot learning datasets demonstrate the superior empirical performance of the novel algorithm, as well as its rapid per-task convergence, which markedly reduces the number of adaptation steps, hence also accommodating large-scale meta-learning models.

[LG-12] Cache Management for Mixture-of-Experts LLM s – extended version

链接: https://arxiv.org/abs/2509.02408
作者: Spyros Angelopoulos,Loris Marchal,Adrien Obrecht,Bertrand Simon
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a variety of tasks. One of the main challenges towards the successful deployment of LLMs is memory management, since they typically involve billions of parameters. To this end, architectures based on Mixture-of-Experts have been proposed, which aim to reduce the size of the parameters that are activated when producing a token. This raises the equally critical issue of efficiently managing the limited cache of the system, in that frequently used experts should be stored in the fast cache rather than in the slower secondary memory. In this work, we introduce and study a new paging problem that models expert management optimization. Our formulation captures both the layered architecture of LLMs and the requirement that experts are cached efficiently. We first present lower bounds on the competitive ratio of both deterministic and randomized algorithms, which show that under mild assumptions, LRU-like policies have good theoretical competitive performance. We then propose a layer-based extension of LRU that is tailored to the problem at hand. Extensive simulations on both synthetic datasets and actual traces of MoE usage show that our algorithm outperforms policies for the classic paging problem, such as the standard LRU. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2509.02408 [cs.LG] (or arXiv:2509.02408v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.02408 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1007/978-3-031-99872-0_2 Focus to learn more DOI(s) linking to related resources

[LG-13] Fisher information flow in artificial neural networks

链接: https://arxiv.org/abs/2509.02407
作者: Maximilian Weimar,Lukas M. Rachbauer,Ilya Starshynov,Daniele Faccio,Linara Adilova,Dorian Bouchet,Stefan Rotter
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 17 pages, 12 figures, to be published in Physical Review X

点击查看摘要

Abstract:The estimation of continuous parameters from measured data plays a central role in many fields of physics. A key tool in understanding and improving such estimation processes is the concept of Fisher information, which quantifies how information about unknown parameters propagates through a physical system and determines the ultimate limits of precision. With Artificial Neural Networks (ANNs) gradually becoming an integral part of many measurement systems, it is essential to understand how they process and transmit parameter-relevant information internally. Here, we present a method to monitor the flow of Fisher information through an ANN performing a parameter estimation task, tracking it from the input to the output layer. We show that optimal estimation performance corresponds to the maximal transmission of Fisher information, and that training beyond this point results in information loss due to overfitting. This provides a model-free stopping criterion for network training-eliminating the need for a separate validation dataset. To demonstrate the practical relevance of our approach, we apply it to a network trained on data from an imaging experiment, highlighting its effectiveness in a realistic physical setting.

[LG-14] Gaming and Cooperation in Federated Learning: What Can Happen and How to Monitor It

链接: https://arxiv.org/abs/2509.02391
作者: Dongseok Kim,Wonjun Jeong,Gisung Oh
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: 51 pages, 7 figures

点击查看摘要

Abstract:The success of Federated Learning depends on the actions that participants take out of sight. We model Federated Learning not as a mere optimization task but as a strategic system entangled with rules and incentives. From this perspective, we present an analytical framework that makes it possible to clearly identify where behaviors that genuinely improve performance diverge from those that merely target metrics. We introduce two indices that respectively quantify behavioral incentives and collective performance loss, and we use them as the basis for consistently interpreting the impact of operational choices such as rule design, the level of information disclosure, evaluation methods, and aggregator switching. We further summarize thresholds, auto-switch rules, and early warning signals into a checklist that can be applied directly in practice, and we provide both a practical algorithm for allocating limited audit resources and a performance guarantee. Simulations conducted across diverse environments consistently validate the patterns predicted by our framework, and we release all procedures for full reproducibility. While our approach operates most strongly under several assumptions, combining periodic recalibration, randomization, and connectivity-based alarms enables robust application under the variability of real-world operations. We present both design principles and operational guidelines that lower the incentives for metric gaming while sustaining and expanding stable cooperation.

[LG-15] Scaffolding Collaborative Learning in STEM: A Two-Year Evaluation of a Tool-Integrated Project-Based Methodology

链接: https://arxiv.org/abs/2509.02355
作者: Caterina Fuster-Barcelo,Gonzalo R. Rios-Munoz,Arrate Munoz-Barrutia
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This study examines the integration of digital collaborative tools and structured peer evaluation in the Machine Learning for Health master’s program, through the redesign of a Biomedical Image Processing course over two academic years. The pedagogical framework combines real-time programming with Google Colab, experiment tracking and reporting via Weights Biases, and rubric-guided peer assessment to foster student engagement, transparency, and fair evaluation. Compared to a pre-intervention cohort, the two implementation years showed increased grade dispersion and higher entropy in final project scores, suggesting improved differentiation and fairness in assessment. The survey results further indicate greater student engagement with the subject and their own learning process. These findings highlight the potential of integrating tool-supported collaboration and structured evaluation mechanisms to enhance both learning outcomes and equity in STEM education.

[LG-16] Extrapolated Markov Chain Oversampling Method for Imbalanced Text Classification

链接: https://arxiv.org/abs/2509.02332
作者: Aleksi Avela,Pauliina Ilmonen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text classification is the task of automatically assigning text documents correct labels from a predefined set of categories. In real-life (text) classification tasks, observations and misclassification costs are often unevenly distributed between the classes - known as the problem of imbalanced data. Synthetic oversampling is a popular approach to imbalanced classification. The idea is to generate synthetic observations in the minority class to balance the classes in the training set. Many general-purpose oversampling methods can be applied to text data; however, imbalanced text data poses a number of distinctive difficulties that stem from the unique nature of text compared to other domains. One such factor is that when the sample size of text increases, the sample vocabulary (i.e., feature space) is likely to grow as well. We introduce a novel Markov chain based text oversampling method. The transition probabilities are estimated from the minority class but also partly from the majority class, thus allowing the minority feature space to expand in oversampling. We evaluate our approach against prominent oversampling methods and show that our approach is able to produce highly competitive results against the other methods in several real data examples, especially when the imbalance is severe.

[LG-17] AdaSwitch: An Adaptive Switching Meta-Algorithm for Learning-Augmented Bounded-Influence Problems

链接: https://arxiv.org/abs/2509.02302
作者: Xi Chen,Yuze Chen,Yuan Zhou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 62 pages, 7 figures

点击查看摘要

Abstract:We study a class of multi-period online decision-making problems with sequence-based predictions, which may be generated by machine learning models but whose accuracy is not guaranteed. In each period, the decision-maker observes the realized request and must take an irrevocable action that yields a reward or incurs a cost, without knowledge of future arrivals. We introduce a bounded-influence framework, in which past decisions and requests exert only limited impact on the future optimal reward. Within this framework, we propose the AdaSwitch meta-algorithm, which exploits predictions to attain performance close to the offline benchmark when predictions are accurate, while preserving classical competitive-ratio guarantees under highly inaccurate predictions. Our framework and meta-algorithm apply to diverse settings, including lead-time quotation in processing systems, the k -server problem, and online allocation of reusable resources. These applications illustrate the flexibility and broad applicability of our approach to learning-augmented online decision-making.

[LG-18] Balanced Multimodal Learning: An Unidirectional Dynamic Interaction Perspective

链接: https://arxiv.org/abs/2509.02281
作者: Shijie Wang,Li Zhang,Xinyan Liang,Yuhua Qian,Shen Hu
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Multimodal learning typically utilizes multimodal joint loss to integrate different modalities and enhance model performance. However, this joint learning strategy can induce modality imbalance, where strong modalities overwhelm weaker ones and limit exploitation of individual information from each modality and the inter-modality interaction this http URL strategies such as dynamic loss weighting, auxiliary objectives and gradient modulation mitigate modality imbalance based on joint loss. These methods remain fundamentally reactive, detecting and correcting imbalance after it arises, while leaving the competitive nature of the joint loss untouched. This limitation drives us to explore a new strategy for multimodal imbalance learning that does not rely on the joint loss, enabling more effective interactions between modalities and better utilization of information from individual modalities and their interactions. In this paper, we introduce Unidirectional Dynamic Interaction (UDI), a novel strategy that abandons the conventional joint loss in favor of a proactive, sequential training scheme. UDI first trains the anchor modality to convergence, then uses its learned representations to guide the other modality via unsupervised loss. Furthermore, the dynamic adjustment of modality interactions allows the model to adapt to the task at hand, ensuring that each modality contributes optimally. By decoupling modality optimization and enabling directed information flow, UDI prevents domination by any single modality and fosters effective cross-modal feature learning. Our experimental results demonstrate that UDI outperforms existing methods in handling modality imbalance, leading to performance improvement in multimodal learning tasks.

[LG-19] Calibration through the Lens of Indistinguishability

链接: https://arxiv.org/abs/2509.02279
作者: Parikshit Gopalan,Lunjia Hu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: This is the full version of a survey that appears in the ACM SIGecom Exchanges

点击查看摘要

Abstract:Calibration is a classical notion from the forecasting literature which aims to address the question: how should predicted probabilities be interpreted? In a world where we only get to observe (discrete) outcomes, how should we evaluate a predictor that hypothesizes (continuous) probabilities over possible outcomes? The study of calibration has seen a surge of recent interest, given the ubiquity of probabilistic predictions in machine learning. This survey describes recent work on the foundational questions of how to define and measure calibration error, and what these measures mean for downstream decision makers who wish to use the predictions to make decisions. A unifying viewpoint that emerges is that of calibration as a form of indistinguishability, between the world hypothesized by the predictor and the real world (governed by nature or the Bayes optimal predictor). In this view, various calibration measures quantify the extent to which the two worlds can be told apart by certain classes of distinguishers or statistical measures.

[LG-20] Speech transformer models for extracting information from baby cries INTERSPEECH2025

链接: https://arxiv.org/abs/2509.02259
作者: Guillem Bonafos,Jéremy Rouch,Lény Lego,David Reby,Hugues Patural,Nicolas Mathevon,Rémy Emonet
类目: ound (cs.SD); Machine Learning (cs.LG); Applications (stat.AP)
*备注: Accepted to WOCCI2025 (interspeech2025 workshop)

点击查看摘要

Abstract:Transfer learning using latent representations from pre-trained speech models achieves outstanding performance in tasks where labeled data is scarce. However, their applicability to non-speech data and the specific acoustic properties encoded in these representations remain largely unexplored. In this study, we investigate both aspects. We evaluate five pre-trained speech models on eight baby cries datasets, encompassing 115 hours of audio from 960 babies. For each dataset, we assess the latent representations of each model across all available classification tasks. Our results demonstrate that the latent representations of these models can effectively classify human baby cries and encode key information related to vocal source instability and identity of the crying baby. In addition, a comparison of the architectures and training strategies of these models offers valuable insights for the design of future models tailored to similar tasks, such as emotion detection.

[LG-21] DaCe AD: Unifying High-Performance Automatic Differentiation for Machine Learning and Scientific Computing

链接: https://arxiv.org/abs/2509.02197
作者: Afif Boudaoud,Alexandru Calotoiu,Marcin Copik,Torsten Hoefler
类目: Machine Learning (cs.LG); Performance (cs.PF); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Automatic differentiation (AD) is a set of techniques that systematically applies the chain rule to compute the gradients of functions without requiring human intervention. Although the fundamentals of this technology were established decades ago, it is experiencing a renaissance as it plays a key role in efficiently computing gradients for backpropagation in machine learning algorithms. AD is also crucial for many applications in scientific computing domains, particularly emerging techniques that integrate machine learning models within scientific simulations and schemes. Existing AD frameworks have four main limitations: limited support of programming languages, requiring code modifications for AD compatibility, limited performance on scientific computing codes, and a naive store-all solution for forward-pass data required for gradient calculations. These limitations force domain scientists to manually compute the gradients for large problems. This work presents DaCe AD, a general, efficient automatic differentiation engine that requires no code modifications. DaCe AD uses a novel ILP-based algorithm to optimize the trade-off between storing and recomputing to achieve maximum performance within a given memory constraint. We showcase the generality of our method by applying it to NPBench, a suite of HPC benchmarks with diverse scientific computing patterns, where we outperform JAX, a Python framework with state-of-the-art general AD capabilities, by more than 92 times on average without requiring any code changes.

[LG-22] Selection of Optimal Number and Location of PMUs for CNN Based Fault Location and Identification

链接: https://arxiv.org/abs/2509.02192
作者: Khalid Daud Khattak,Muhammad A. Choudhry
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Paper submitted to 57th North American Power Symposium (NAPS) 2025

点击查看摘要

Abstract:In this paper, we present a data-driven Forward Selection with Neighborhood Refinement (FSNR) algorithm to determine the number and placement of Phasor Measurement Units (PMUs) for maximizing deep-learning-based fault diagnosis performance. Candidate PMU locations are ranked via a cross-validated Support Vector Machine (SVM) classifier, and each selection is refined through local neighborhood exploration to produce a near-optimal sensor set. The resulting PMU subset is then supplied to a 1D Convolutional Neural Network (CNN) for faulted-line localization and fault-type classification from time-series measurements. Evaluation on modified IEEE 34- and IEEE 123-bus systems demonstrates that the proposed FSNR-SVM method identifies a minimal PMU configuration that achieves the best overall CNN performance, attaining over 96 percent accuracy in fault location and over 99 percent accuracy in fault-type classification on the IEEE 34 system, and approximately 94 percent accuracy in fault location and around 99.8 percent accuracy in fault-type classification on the IEEE 123 system.

[LG-23] Simulating classification models to evaluate Predict-Then-Optimize methods

链接: https://arxiv.org/abs/2509.02191
作者: Pieter Smet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty in optimization is often represented as stochastic parameters in the optimization model. In Predict-Then-Optimize approaches, predictions of a machine learning model are used as values for such parameters, effectively transforming the stochastic optimization problem into a deterministic one. This two-stage framework is built on the assumption that more accurate predictions result in solutions that are closer to the actual optimal solution. However, providing evidence for this assumption in the context of complex, constrained optimization problems is challenging and often overlooked in the literature. Simulating predictions of machine learning models offers a way to (experimentally) analyze how prediction error impacts solution quality without the need to train real models. Complementing an algorithm from the literature for simulating binary classification, we introduce a new algorithm for simulating predictions of multiclass classifiers. We conduct a computational study to evaluate the performance of these algorithms, and show that classifier performance can be simulated with reasonable accuracy, although some variability is observed. Additionally, we apply these algorithms to assess the performance of a Predict-Then-Optimize algorithm for a machine scheduling problem. The experiments demonstrate that the relationship between prediction error and how close solutions are to the actual optimum is non-trivial, highlighting important considerations for the design and evaluation of decision-making systems based on machine learning predictions.

[LG-24] Online Identification of IT Systems through Active Causal Learning

链接: https://arxiv.org/abs/2509.02130
作者: Kim Hammar,Rolf Stadler
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Identifying a causal model of an IT system is fundamental to many branches of systems engineering and operation. Such a model can be used to predict the effects of control actions, optimize operations, diagnose failures, detect intrusions, etc., which is central to achieving the longstanding goal of automating network and system management tasks. Traditionally, causal models have been designed and maintained by domain experts. This, however, proves increasingly challenging with the growing complexity and dynamism of modern IT systems. In this paper, we present the first principled method for online, data-driven identification of an IT system in the form of a causal model. The method, which we call active causal learning, estimates causal functions that capture the dependencies among system variables in an iterative fashion using Gaussian process regression based on system measurements, which are collected through a rollout-based intervention policy. We prove that this method is optimal in the Bayesian sense and that it produces effective interventions. Experimental validation on a testbed shows that our method enables accurate identification of a causal system model while inducing low interference with system operations.

[LG-25] hreshold-Based Optimal Arm Selection in Monotonic Bandits: Regret Lower Bounds and Algorithms

链接: https://arxiv.org/abs/2509.02119
作者: Chanakya Varude,Jay Chaudhary,Siddharth Kaushik,Prasanna Chaporkar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In multi-armed bandit problems, the typical goal is to identify the arm with the highest reward. This paper explores a threshold-based bandit problem, aiming to select an arm based on its relation to a prescribed threshold (\tau ). We study variants where the optimal arm is the first above (\tau), the (k^th) arm above or below it, or the closest to it, under a monotonic structure of arm means. We derive asymptotic regret lower bounds, showing dependence only on arms adjacent to (\tau). Motivated by applications in communication networks (CQI allocation), clinical dosing, energy management, recommendation systems, and more. We propose algorithms with optimality validated through Monte Carlo simulations. Our work extends classical bandit theory with threshold constraints for efficient decision-making.

[LG-26] Differentiable Expectation-Maximisation and Applications to Gaussian Mixture Model Optimal Transport

链接: https://arxiv.org/abs/2509.02109
作者: Samuel Boïté,Eloi Tanguy,Julie Delon,Agnès Desolneux,Rémi Flamary
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Expectation-Maximisation (EM) algorithm is a central tool in statistics and machine learning, widely used for latent-variable models such as Gaussian Mixture Models (GMMs). Despite its ubiquity, EM is typically treated as a non-differentiable black box, preventing its integration into modern learning pipelines where end-to-end gradient propagation is essential. In this work, we present and compare several differentiation strategies for EM, from full automatic differentiation to approximate methods, assessing their accuracy and computational efficiency. As a key application, we leverage this differentiable EM in the computation of the Mixture Wasserstein distance \mathrmMW_2 between GMMs, allowing \mathrmMW_2 to be used as a differentiable loss in imaging and machine learning tasks. To complement our practical use of \mathrmMW_2 , we contribute a novel stability result which provides theoretical justification for the use of \mathrmMW_2 with EM, and also introduce a novel unbalanced variant of \mathrmMW_2 . Numerical experiments on barycentre computation, colour and style transfer, image generation, and texture synthesis illustrate the versatility and effectiveness of the proposed approach in different settings.

[LG-27] DivMerge: A divergence-based model merging method for multi-tasking

链接: https://arxiv.org/abs/2509.02108
作者: Touayouch Brahim,Fosse Loïc,Damnati Géraldine,Lecorvé Gwénolé
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task learning (MTL) is often achieved by merging datasets before fine-tuning, but the growing availability of fine-tuned models has led to new approaches such as model merging via task arithmetic. A major challenge in this setting is task interference, which worsens as the number of tasks increases. We propose a method that merges models trained on different tasks into a single model, maintaining strong performance across all tasks. Our approach leverages Jensen-Shannon divergence to guide the merging process without requiring additional labelled data, and automatically balances task importance. Unlike existing methods, our approach remains robust as the number of tasks grows and consistently outperforms prior work.

[LG-28] owards Comprehensive Information-theoretic Multi-view Learning

链接: https://arxiv.org/abs/2509.02084
作者: Long Shi,Yunshan Ye,Wenjie Wang,Tao Lei,Yu Zhao,Gang Kou,Badong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Information theory has inspired numerous advancements in multi-view learning. Most multi-view methods incorporating information-theoretic principles rely an assumption called multi-view redundancy which states that common information between views is necessary and sufficient for down-stream tasks. This assumption emphasizes the importance of common information for prediction, but inherently ignores the potential of unique information in each view that could be predictive to the task. In this paper, we propose a comprehensive information-theoretic multi-view learning framework named CIML, which discards the assumption of multi-view redundancy. Specifically, CIML considers the potential predictive capabilities of both common and unique information based on information theory. First, the common representation learning maximizes Gacs-Korner common information to extract shared features and then compresses this information to learn task-relevant representations based on the Information Bottleneck (IB). For unique representation learning, IB is employed to achieve the most compressed unique representation for each view while simultaneously minimizing the mutual information between unique and common representations, as well as among different unique representations. Importantly, we theoretically prove that the learned joint representation is predictively sufficient for the downstream task. Extensive experimental results have demonstrated the superiority of our model over several state-of-art methods. The code is released on CIML.

[LG-29] Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident Reports

链接: https://arxiv.org/abs/2509.02072
作者: Jian Chen,Jinbao Tian,Yunqi Xu,Zhou Li
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The automatic classification of occupational accident reports is a critical research area for enhancing workplace safety and enabling large-scale risk analysis. However, the severe class imbalance inherent in these real-world datasets often compromises the performance of analytical models, particularly for rare but severe incident types, hindering the development of reliable automated systems. To address this challenge, we propose ABEX-RAT, a novel and efficient framework that synergizes generative data augmentation with robust adversarial training. Our approach first employs a twostep abstractive-expansive (ABEX) pipeline, which leverages a large language model to distill core incident semantics and then uses a generative model to create diverse, highquality synthetic samples for underrepresented classes. Subsequently, a lightweight classifier is trained on the augmented data using a computationally efficient random adversarial training (RAT) protocol, which stochastically applies perturbations to enhance model generalization and robustness without significant overhead. Experimental results on the public OSHA dataset demonstrate that our method achieves new state-of-the-art performance, reaching a macro-F1 score of 90.32% and significantly outperforming previous SOTA and fine-tuned large model baselines. Our work validates that this synergistic strategy is a highly effective and efficient alternative to brute-force fine-tuning for specialized, imbalanced classification tasks. The code is publicly available at:this https URL.

[LG-30] Data-Dependent Smoothing for Protein Discovery with Walk-Jump Sampling

链接: https://arxiv.org/abs/2509.02069
作者: Srinivas Anumasa,Barath Chandran.C,Tingting Chen,Dianbo Liu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as a powerful class of generative models by learning to iteratively reverse the noising process. Their ability to generate high-quality samples has extended beyond high-dimensional image data to other complex domains such as proteins, where data distributions are typically sparse and unevenly spread. Importantly, the sparsity itself is uneven. Empirically, we observed that while a small fraction of samples lie in dense clusters, the majority occupy regions of varying sparsity across the data space. Existing approaches largely ignore this data-dependent variability. In this work, we introduce a Data-Dependent Smoothing Walk-Jump framework that employs kernel density estimation (KDE) as a preprocessing step to estimate the noise scale \sigma for each data point, followed by training a score model with these data-dependent \sigma values. By incorporating local data geometry into the denoising process, our method accounts for the heterogeneous distribution of protein data. Empirical evaluations demonstrate that our approach yields consistent improvements across multiple metrics, highlighting the importance of data-aware sigma prediction for generative modeling in sparse, high-dimensional settings.

[LG-31] LUCIE-3D: A three-dimensional climate emulator for forced responses

链接: https://arxiv.org/abs/2509.02061
作者: Haiwen Guan,Troy Arcomano,Ashesh Chattopadhyay,Romit Maulik
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We introduce LUCIE-3D, a lightweight three-dimensional climate emulator designed to capture the vertical structure of the atmosphere, respond to climate change forcings, and maintain computational efficiency with long-term stability. Building on the original LUCIE-2D framework, LUCIE-3D employs a Spherical Fourier Neural Operator (SFNO) backbone and is trained on 30 years of ERA5 reanalysis data spanning eight vertical \sigma-levels. The model incorporates atmospheric CO2 as a forcing variable and optionally integrates prescribed sea surface temperature (SST) to simulate coupled ocean–atmosphere dynamics. Results demonstrate that LUCIE-3D successfully reproduces climatological means, variability, and long-term climate change signals, including surface warming and stratospheric cooling under increasing CO2 concentrations. The model further captures key dynamical processes such as equatorial Kelvin waves, the Madden–Julian Oscillation, and annular modes, while showing credible behavior in the statistics of extreme events. Despite requiring longer training than its 2D predecessor, LUCIE-3D remains efficient, training in under five hours on four GPUs. Its combination of stability, physical consistency, and accessibility makes it a valuable tool for rapid experimentation, ablation studies, and the exploration of coupled climate dynamics, with potential applications extending to paleoclimate research and future Earth system emulation.

[LG-32] Genetic Programming with Model Driven Dimension Repair for Learning Interpretable Appointment Scheduling Rules

链接: https://arxiv.org/abs/2509.02034
作者: Huan Zhang,Yang Wang,Ya-Hui Jia,Yi Mei
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Appointment scheduling is a great challenge in healthcare operations management. Appointment rules (AR) provide medical practitioners with a simple yet effective tool to determine patient appointment times. Genetic programming (GP) can be used to evolve ARs. However, directly applying GP to design ARs may lead to rules that are difficult for end-users to interpret and trust. A key reason is that GP is unaware of the dimensional consistency, which ensures that the evolved rules align with users’ domain knowledge and intuitive understanding. In this paper, we develop a new dimensionally aware GP algorithm with dimension repair to evolve ARs with dimensional consistency and high performance. A key innovation of our method is the dimension repair procedure, which optimizes the dimensional consistency of an expression tree while minimizing structural changes and ensuring that its output dimension meets the problem’s requirements. We formulate the task as a mixed-integer linear programming model that can be efficiently solved using common mathematical programming methods. With the support of the dimension repair procedure, our method can explore a wider range of AR structures by temporarily breaking the dimensional consistency of individuals, and then restoring it without altering their overall structure, thereby identifying individuals with greater potential advantages. We evaluated the proposed method in a comprehensive set of simulated clinics. The experimental results demonstrate that our approach managed to evolve high-quality ARs that significantly outperform not only the manually designed ARs but also existing state-of-the-art dimensionally aware GP methods in terms of both objective values and dimensional consistency. In addition, we analyzed the semantics of the evolved ARs, providing insight into the design of more effective and interpretable ARs.

[LG-33] Second-Order Tensorial Partial Differential Equations on Graphs

链接: https://arxiv.org/abs/2509.02015
作者: Aref Einizade,Fragkiskos D. Malliaros,Jhony H. Giraldo
类目: Machine Learning (cs.LG)
*备注: 12 pages, 1 figure

点击查看摘要

Abstract:Processing data that lies on multiple interacting (product) graphs is increasingly important in practical applications, yet existing methods are mostly restricted to discrete graph filtering. Tensorial partial differential equations on graphs (TPDEGs) offer a principled framework for modeling such multidomain data in a continuous setting. However, current continuous approaches are limited to first-order derivatives, which tend to dampen high-frequency signals and slow down information propagation. This makes these TPDEGs-based approaches less effective for capturing complex, multi-scale, and heterophilic structures. In this paper, we introduce second-order TPDEGs (So-TPDEGs) and propose the first theoretically grounded framework for second-order continuous product graph neural networks. Our approach leverages the separability of cosine kernels in Cartesian product graphs to implement efficient spectral decomposition, while naturally preserving high-frequency information. We provide rigorous theoretical analyses of stability under graph perturbations and over-smoothing behavior regarding spectral properties. Our theoretical results establish a robust foundation for advancing continuous graph learning across multiple practical domains.

[LG-34] Bouncy particle sampler with infinite exchanging parallel tempering

链接: https://arxiv.org/abs/2509.02003
作者: Yohei Saito,Shun Kimura,Koujin Takeda
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian inference is useful to obtain a predictive distribution with a small generalization error. However, since posterior distributions are rarely evaluated analytically, we employ the variational Bayesian inference or sampling method to approximate posterior distributions. When we obtain samples from a posterior distribution, Hamiltonian Monte Carlo (HMC) has been widely used for the continuous variable part and Markov chain Monte Carlo (MCMC) for the discrete variable part. Another sampling method, the bouncy particle sampler (BPS), has been proposed, which combines uniform linear motion and stochastic reflection to perform sampling. BPS was reported to have the advantage of being easier to set simulation parameters than HMC. To accelerate the convergence to a posterior distribution, we introduced parallel tempering (PT) to BPS, and then proposed an algorithm when the inverse temperature exchange rate is set to infinity. We performed numerical simulations and demonstrated its effectiveness for multimodal distribution.

[LG-35] Semantic and episodic memories in a predictive coding model of the neocortex

链接: https://arxiv.org/abs/2509.01987
作者: Lucie Fontaine(Mnemosyne),Frédéric Alexandre(Mnemosyne)
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Complementary Learning Systems theory holds that intelligent agents need two learning systems. Semantic memory is encoded in the neocortex with dense, overlapping representations and acquires structured knowledge. Episodic memory is encoded in the hippocampus with sparse, pattern-separated representations and quickly learns the specifics of individual experiences. Recently, this duality between semantic and episodic memories has been challenged by predictive coding, a biologically plausible neural network model of the neocortex which was shown to have hippocampus-like abilities on auto-associative memory tasks. These results raise the question of the episodic capabilities of the neocortex and their relation to semantic memory. In this paper, we present such a predictive coding model of the neocortex and explore its episodic capabilities. We show that this kind of model can indeed recall the specifics of individual examples but only if it is trained on a small number of examples. The model is overfitted to these exemples and does not generalize well, suggesting that episodic memory can arise from semantic learning. Indeed, a model trained with many more examples loses its recall capabilities. This work suggests that individual examples can be encoded gradually in the neocortex using dense, overlapping representations but only in a limited number, motivating the need for sparse, pattern-separated representations as found in the hippocampus.

[LG-36] Knowledge distillation as a pathway toward next-generation intelligent ecohydrological modeling systems

链接: https://arxiv.org/abs/2509.01972
作者: Long Jiang,Yang Yang,Ting Fong May Chui,Morgan Thornwell,Hoshin Vijai Gupta
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 25 pages, 6 figures

点击查看摘要

Abstract:Simulating ecohydrological processes is essential for understanding complex environmental systems and guiding sustainable management amid accelerating climate change and human pressures. Process-based models provide physical realism but can suffer from structural rigidity, high computational costs, and complex calibration, while machine learning (ML) methods are efficient and flexible yet often lack interpretability and transferability. We propose a unified three-phase framework that integrates process-based models with ML and progressively embeds them into artificial intelligence (AI) through knowledge distillation. Phase I, behavioral distillation, enhances process models via surrogate learning and model simplification to capture key dynamics at lower computational cost. Phase II, structural distillation, reformulates process equations as modular components within a graph neural network (GNN), enabling multiscale representation and seamless integration with ML models. Phase III, cognitive distillation, embeds expert reasoning and adaptive decision-making into intelligent modeling agents using the Eyes-Brain-Hands-Mouth architecture. Demonstrations for the Samish watershed highlight the framework’s applicability to ecohydrological modeling, showing that it can reproduce process-based model outputs, improve predictive accuracy, and support scenario-based decision-making. The framework offers a scalable and transferable pathway toward next-generation intelligent ecohydrological modeling systems, with the potential extension to other process-based domains.

[LG-37] Computational Fluid Dynamics Optimization of F1 Front Wing using Physics Informed Neural Networks

链接: https://arxiv.org/abs/2509.01963
作者: Naval Shah
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:In response to recent FIA regulations reducing Formula 1 team wind tunnel hours (from 320 hours for last-place teams to 200 hours for championship leaders) and strict budget caps of 135 million USD per year, more efficient aerodynamic development tools are needed by teams. Conventional computational fluid dynamics (CFD) simulations, though offering high fidelity results, require large computational resources with typical simulation durations of 8-24 hours per configuration analysis. This article proposes a Physics-Informed Neural Network (PINN) for the fast prediction of Formula 1 front wing aerodynamic coefficients. The suggested methodology combines CFD simulation data from SimScale with first principles of fluid dynamics through a hybrid loss function that constrains both data fidelity and physical adherence based on Navier-Stokes equations. Training on force and moment data from 12 aerodynamic features, the PINN model records coefficient of determination (R-squared) values of 0.968 for drag coefficient and 0.981 for lift coefficient prediction while lowering computational time. The physics-informed framework guarantees that predictions remain adherent to fundamental aerodynamic principles, offering F1 teams an efficient tool for the fast exploration of design space within regulatory constraints.

[LG-38] Entry Barriers in Content Markets

链接: https://arxiv.org/abs/2509.01953
作者: Haiqing Zhu,Lexing Xie,Yun Kuen Cheung
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The prevalence of low-quality content on online platforms is often attributed to the absence of meaningful entry requirements. This motivates us to investigate whether implicit or explicit entry barriers, alongside appropriate reward mechanisms, can enhance content quality. We present the first game-theoretic analysis of two distinct types of entry barriers in online content platforms. The first, a structural barrier, emerges from the collective behaviour of incumbent content providers which disadvantages new entrants. We show that both rank-order and proportional-share reward mechanisms induce such a structural barrier at Nash equilibrium. The second, a strategic barrier, involves the platform proactively imposing entry fees to discourage participation from low-quality contributors. We consider a scheme in which the platform redirects some or all of the entry fees into the reward pool. We formally demonstrate that this approach can improve overall content quality. Our findings establish a theoretical foundation for designing reward mechanisms coupled with entry fees to promote higher-quality content and support healthier online ecosystems.

[LG-39] Causal representation learning from network data

链接: https://arxiv.org/abs/2509.01916
作者: Jifan Zhang,Michelle M. Li,Elena Zheleva
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal disentanglement from soft interventions is identifiable under the assumptions of linear interventional faithfulness and availability of both observational and interventional data. Previous research has looked into this problem from the perspective of i.i.d. data. Here, we develop a framework, GraCE-VAE, for non-i.i.d. settings, in which structured context in the form of network data is available. GraCE-VAE integrates discrepancy-based variational autoencoders with graph neural networks to jointly recover the true latent causal graph and intervention effects. We show that the theoretical results of identifiability from i.i.d. data hold in our setup. We also empirically evaluate GraCE-VAE against state-of-the-art baselines on three genetic perturbation datasets to demonstrate the impact of leveraging structured context for causal disentanglement.

[LG-40] Predicting NCAP Safety Ratings: An Analysis of Vehicle Characteristics and ADAS Features Using Machine Learning

链接: https://arxiv.org/abs/2509.01897
作者: Raunak Kunwar,Aera Kim LeBoulluec(University of Texas at Arlington)
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, Under review

点击查看摘要

Abstract:Vehicle safety assessment is crucial for consumer information and regulatory oversight. The New Car Assessment Program (NCAP) assigns standardized safety ratings, which traditionally emphasize passive safety measures but now include active safety technologies such as Advanced Driver-Assistance Systems (ADAS). It is crucial to understand how these various systems interact empirically. This study explores whether particular ADAS features like Forward Collision Warning, Lane Departure Warning, Crash Imminent Braking, and Blind Spot Detection, together with established vehicle attributes (e.g., Curb Weight, Model Year, Vehicle Type, Drive Train), can reliably predict a vehicle’s likelihood of earning the highest (5-star) overall NCAP rating. Using a publicly available dataset derived from NCAP reports that contain approximately 5,128 vehicle variants spanning model years 2011-2025, we compared four different machine learning models: logistic regression, random forest, gradient boosting, and support vector classifier (SVC) using a 5-fold stratified cross-validation approach. The two best-performing algorithms (random forest and gradient boost) were hyperparameter optimized using RandomizedSearchCV. Analysis of feature importance showed that basic vehicle characteristics, specifically curb weight and model year, dominated predictive capability, contributing more than 55% of the feature relevance of the Random Forest model. However, the inclusion of ADAS features also provided meaningful predictive contributions. The optimized Random Forest model achieved robust results on a held-out test set, with an accuracy of 89.18% and a ROC AUC of 0.9586. This research reveals the use of machine learning to analyze large-scale NCAP data and highlights the combined predictive importance of both established vehicle parameters and modern ADAS features to achieve top safety ratings.

[LG-41] Deep Reinforcement Learning for Real-Time Drone Routing in Post-Disaster Road Assessment Without Domain Knowledge

链接: https://arxiv.org/abs/2509.01886
作者: Huatian Gong,Jiuh-Biing Sheu,Zheng Wang,Xiaoguang Yang,Ran Yan
类目: Machine Learning (cs.LG)
*备注: 36 pages, 15 figures

点击查看摘要

Abstract:Rapid post-disaster road damage assessment is critical for effective emergency response, yet traditional optimization methods suffer from excessive computational time and require domain knowledge for algorithm design, making them unsuitable for time-sensitive disaster scenarios. This study proposes an attention-based encoder-decoder model (AEDM) for real-time drone routing decision in post-disaster road damage assessment. The method employs deep reinforcement learning to determine high-quality drone assessment routes without requiring algorithmic design knowledge. A network transformation method is developed to convert link-based routing problems into equivalent node-based formulations, while a synthetic road network generation technique addresses the scarcity of large-scale training datasets. The model is trained using policy optimization with multiple optima (POMO) with multi-task learning capabilities to handle diverse parameter combinations. Experimental results demonstrate two key strengths of AEDM: it outperforms commercial solvers by 16–69% in solution quality and achieves real-time inference (1–2 seconds) versus 100–2,000 seconds for traditional methods. The model exhibits strong generalization across varying problem scales, drone numbers, and time constraints, consistently outperforming baseline methods on unseen parameter distributions and real-world road networks. The proposed method effectively balances computational efficiency with solution quality, making it particularly suitable for time-critical disaster response applications where rapid decision-making is essential for saving lives.

[LG-42] Semi-on-Demand Transit Feeders with Shared Autonomous Vehicles and Reinforcement-Learning-Based Zonal Dispatching Control ITSC

链接: https://arxiv.org/abs/2509.01883
作者: Max T.M. Ng,Roman Engelhardt,Florian Dandl,Hani S. Mahmassani,Klaus Bogenberger
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 6 pages, 9 figures, published in 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), Edmonton, Canada, 24-27 September 2024

点击查看摘要

Abstract:This paper develops a semi-on-demand transit feeder service using shared autonomous vehicles (SAVs) and zonal dispatching control based on reinforcement learning (RL). This service combines the cost-effectiveness of fixed-route transit with the adaptability of demand-responsive transport to improve accessibility in lower-density areas. Departing from the terminus, SAVs first make scheduled fixed stops, then offer on-demand pick-ups and drop-offs in a pre-determined flexible-route area. Our deep RL model dynamically assigns vehicles to subdivided flexible-route zones in response to real-time demand fluctuations and operations, using a policy gradient algorithm - Proximal Policy Optimization. The methodology is demonstrated through agent-based simulations on a real-world bus route in Munich, Germany. Results show that after efficient training of the RL model, the semi-on-demand service with dynamic zonal control serves 16% more passengers at 13% higher generalized costs on average compared to traditional fixed-route service. The efficiency gain brought by RL control brings 2.4% more passengers at 1.4% higher costs. This study not only showcases the potential of integrating SAV feeders and machine learning techniques into public transit, but also sets the groundwork for further innovations in addressing first-mile-last-mile problems in multimodal transit systems.

[LG-43] RadioDiff-Loc: Diffusion Model Enhanced Scattering Congnition for NLoS Localization with Sparse Radio Map Estimation

链接: https://arxiv.org/abs/2509.01875
作者: Xiucheng Wang,Qiming Zhang,Nan Cheng
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate localization of non-cooperative signal sources in non-line-of-sight (NLoS) environments remains a critical challenge with a wide range of applications, including autonomous navigation, industrial automation, and emergency response. In such settings, traditional positioning techniques relying on line-of-sight (LoS) or cooperative signaling fail due to severe multipath propagation and unknown transmit power. This paper proposes a novel generative inference framework for NLoS localization based on conditional diffusion models. By leveraging the physical insight that diffracted electromagnetic energy concentrates near building edges, we develop a sampling strategy that collects sparse received signal strength (RSS) measurements at the geometric vertices of obstacles–locations that maximize Fisher information and mutual information with respect to the unknown source. To overcome the lack of known transmission power, we normalize all sampled RSS values relative to the maximum observed intensity, enabling the construction of a power-invariant radio map (RM). A conditional diffusion model is trained to reconstruct the full RM based on environmental layout and sparse RSS observations. Localization is then achieved by identifying the brightest point on the generated RM. Moreover, the proposed framework is compatible with existing RSS-based localization algorithms, enabling a dual-driven paradigm that fuses physical knowledge and data-driven inference for improved accuracy. Extensive theoretical analysis and empirical validation demonstrate that our approach achieves high localization accuracy with significantly reduced sampling cost, offering a scalable and physically grounded solution for non-cooperative NLoS emitter localization.

[LG-44] Optimizing In-Context Learning for Efficient Full Conformal Prediction

链接: https://arxiv.org/abs/2509.01840
作者: Weicao Deng,Sangwoo Park,Min Li,Osvaldo Simeone
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Reliable uncertainty quantification is critical for trustworthy AI. Conformal Prediction (CP) provides prediction sets with distribution-free coverage guarantees, but its two main variants face complementary limitations. Split CP (SCP) suffers from data inefficiency due to dataset partitioning, while full CP (FCP) improves data efficiency at the cost of prohibitive retraining complexity. Recent approaches based on meta-learning or in-context learning (ICL) partially mitigate these drawbacks. However, they rely on training procedures not specifically tailored to CP, which may yield large prediction sets. We introduce an efficient FCP framework, termed enhanced ICL-based FCP (E-ICL+FCP), which employs a permutation-invariant Transformer-based ICL model trained with a CP-aware loss. By simulating the multiple retrained models required by FCP without actual retraining, E-ICL+FCP preserves coverage while markedly reducing both inefficiency and computational overhead. Experiments on synthetic and real tasks demonstrate that E-ICL+FCP attains superior efficiency-coverage trade-offs compared to existing SCP and FCP baselines.

[LG-45] Music Genre Classification Using Machine Learning Techniques

链接: https://arxiv.org/abs/2509.01762
作者: Alokit Mishra,Ryyan Akhtar
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 10 pages, 20 figures. Submitted in partial fulfillment of the requirements for the Bachelor of Technology ( this http URL ) degree in Artificial Intelligence and Data Science

点击查看摘要

Abstract:This paper presents a comparative analysis of machine learning methodologies for automatic music genre classification. We evaluate the performance of classical classifiers, including Support Vector Machines (SVM) and ensemble methods, trained on a comprehensive set of hand-crafted audio features, against a Convolutional Neural Network (CNN) operating on Mel spectrograms. The study is conducted on the widely-used GTZAN dataset. Our findings demonstrate a noteworthy result: the SVM, leveraging domain-specific feature engineering, achieves superior classification accuracy compared to the end-to-end CNN model. We attribute this outcome to the data-constrained nature of the benchmark dataset, where the strong inductive bias of engineered features provides a regularization effect that mitigates the risk of overfitting inherent in high-capacity deep learning models. This work underscores the enduring relevance of traditional feature extraction in practical audio processing tasks and provides a critical perspective on the universal applicability of deep learning, especially for moderately sized datasets.

[LG-46] Communication-Aware Knowledge Distillation for Federated LLM Fine-Tuning over Wireless Networks

链接: https://arxiv.org/abs/2509.01750
作者: Xinlu Zhang,Na Yan,Yang Su,Yansha Deng,Toktam Mahmoodi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) for large language models (LLMs) offers a privacy-preserving scheme, enabling clients to collaboratively fine-tune locally deployed LLMs or smaller language models (SLMs) without exchanging raw data. While parameter-sharing methods in traditional FL models solves number of technical challenges, they still incur high communication overhead and struggle with adapting to heterogeneous model architectures. Federated distillation, a framework for mutual knowledge transfer via shared logits, typically offers lower communication overhead than parameter-sharing methods. However, transmitting logits from LLMs remains challenging for bandwidth-limited clients due to their high dimensionality. In this work, we focus on a federated LLM distillation with efficient communication overhead. To achieve this, we first propose an adaptive Top-k logit selection mechanism, dynamically sparsifying logits according to real-time communication conditions. Then to tackle the dimensional inconsistency introduced by the adaptive sparsification, we design an adaptive logits aggregation scheme, effectively alleviating the artificial and uninformative inputs introduced by conventional zero-padding methods. Finally, to enhance the distillation effect, we incorporate LoRA-adapted hidden-layer projection from LLM into the distillation loss, reducing the communication overhead further while providing richer representation. Experimental results demonstrate that our scheme achieves superior performance compared to baseline methods while effectively reducing communication overhead by approximately 50%.

[LG-47] Constrained Decoding for Robotics Foundation Models

链接: https://arxiv.org/abs/2509.01728
作者: Parv Kapoor,Akila Ganlath,Changliu Liu,Sebastian Scherer,Eunsuk Kang
类目: Robotics (cs.RO); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Recent advances in the development of robotic foundation models have led to promising end-to-end and general-purpose capabilities in robotic systems. These models are pretrained on vast datasets of robot trajectories to process multi- modal inputs and directly output a sequence of action that the system then executes in the real world. Although this approach is attractive from the perspective of im- proved generalization across diverse tasks, these models are still data-driven and, therefore, lack explicit notions of behavioral correctness and safety constraints. We address these limitations by introducing a constrained decoding framework for robotics foundation models that enforces logical constraints on action trajec- tories in dynamical systems. Our method ensures that generated actions provably satisfy signal temporal logic (STL) specifications at runtime without retraining, while remaining agnostic of the underlying foundation model. We perform com- prehensive evaluation of our approach across state-of-the-art navigation founda- tion models and we show that our decoding-time interventions are useful not only for filtering unsafe actions but also for conditional action-generation. Videos available on our website: this https URL

[LG-48] Learning to Ask: Decision Transformers for Adaptive Quantitative Group Testing

链接: https://arxiv.org/abs/2509.01723
作者: Mahdi Soleymani,Tara Javidi
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of quantitative group testing (QGT), where the goal is to recover a sparse binary vector from aggregate subset-sum queries: each query selects a subset of indices and returns the sum of those entries. Information-theoretic results suggest that adaptivity could yield up to a twofold reduction in the total number of required queries, yet no algorithm has surpassed the non-adaptive bound, leaving its practical benefit an open question. In this paper, we reduce the QGT problem to an integer-vector recovery task whose dimension scales with the sparsity of the original problem rather than its full ambient size. We then formulate this reduced recovery task as an offline reinforcement learning problem and employ Decision Transformers to solve it adaptively. By combining these two steps, we obtain an effective end-to-end method for solving the QGT problem. Our experiments show that, for the first time in the literature, our adaptive algorithm reduces the average number of queries below the well-known non-adaptive information-theoretic bound, demonstrating that adaptivity can indeed reduce the number of queries.

[LG-49] Convolutional Monge Mapping between EEG Datasets to Support Independent Component Labeling

链接: https://arxiv.org/abs/2509.01721
作者: Austin Meek,Carlos H. Mendoza-Cardenas,Austin J. Brockmeier
类目: Machine Learning (cs.LG)
*备注: Code available at: this https URL

点击查看摘要

Abstract:EEG recordings contain rich information about neural activity but are subject to artifacts, noise, and superficial differences due to sensors, amplifiers, and filtering. Independent component analysis and automatic labeling of independent components (ICs) enable artifact removal in EEG pipelines. Convolutional Monge Mapping Normalization (CMMN) is a recent tool used to achieve spectral conformity of EEG signals, which was shown to improve deep neural network approaches for sleep staging. Here we propose a novel extension of the CMMN method with two alternative approaches to computing the source reference spectrum the target signals are mapped to: (1) channel-averaged and l_1 -normalized barycenter, and (2) a subject-to-subject mapping that finds the source subject with the closest spectrum to the target subject. Notably, our extension yields space-time separable filters that can be used to map between datasets with different numbers of EEG channels. We apply these filters in an IC classification task, and show significant improvement in recognizing brain versus non-brain ICs. Clinical relevance - EEG recordings are used in the diagnosis and monitoring of multiple neuropathologies, including epilepsy and psychosis. While EEG analysis can benefit from automating artifact removal through independent component analysis and labeling, differences in recording equipment and context (the presence of noise from electrical wiring and other devices) may impact the performance of machine learning models, but these differences can be minimized by appropriate spectral normalization through filtering. Comments: Code available at: this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.01721 [cs.LG] (or arXiv:2509.01721v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.01721 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

链接: https://arxiv.org/abs/2509.01720
作者: Georgios Papoudakis,Thomas Coste,Jianye Hao,Jun Wang,Kun Shao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) using foundation models for policy approximations in multi-turn tasks remains challenging. We identify two main limitations related to sparse reward settings and policy gradient updates, based on which we formulate a key insight: updates from positive samples with high returns typically do not require policy regularisation, whereas updates from negative samples, reflecting undesirable behaviour, can harm model performance. This paper introduces Succeed or Learn Slowly (SoLS), a novel off-policy RL algorithm evaluated on mobile app control tasks. SoLS improves sample efficiency when fine-tuning foundation models for user interface navigation via a modified off-policy actor-critic approach, applying direct policy updates for positive samples and conservative, regularised updates for negative ones to prevent model degradation. We augment SoLS with Successful Transition Replay (STR), which prioritises learning from successful interactions, further improving sample efficiency. We evaluate SoLS on the AndroidWorld benchmark, where it significantly outperforms existing methods (at least 17% relative increase), including prompt-engineering and RL approaches, while requiring substantially fewer computational resources than GPT-4o-based methods with 5-60x faster inference.

[LG-51] Robust Anomaly Detection through Multi-Modal Autoencoder Fusion for Small Vehicle Damage Detection

链接: https://arxiv.org/abs/2509.01719
作者: Sara Khan,Mehmed Yüksel,Frank Kirchner
类目: Machine Learning (cs.LG)
*备注: 14 pages, 12 figures, submitted to Elsevier MLWA

点击查看摘要

Abstract:Wear and tear detection in fleet and shared vehicle systems is a critical challenge, particularly in rental and car-sharing services, where minor damage, such as dents, scratches, and underbody impacts, often goes unnoticed or is detected too late. Currently, manual inspection methods are the default approach but are labour intensive and prone to human error. In contrast, state-of-the-art image-based methods struggle with real-time performance and are less effective at detecting underbody damage due to limited visual access and poor spatial coverage. This work introduces a novel multi-modal architecture based on anomaly detection to address these issues. Sensors such as IMUs and microphones are integrated into a compact device mounted on the vehicle’s windshield. This approach supports real-time damage detection while avoiding the need for highly resource-intensive sensors. We developed multiple variants of multi-modal autoencoder-based architectures and evaluated them against unimodal and state-of-the-art methods. Our ensemble pooling multi-modal model achieved the highest performance, with a Receiver Operating Characteristic-Area Under Curve (ROC-AUC) of 92%, demonstrating its effectiveness in real-world applications. This approach can also be extended to other applications, such as improving automotive safety - where it can integrate with airbag systems for efficient deployment - and helping autonomous vehicles by complementing other sensors in collision detection.

[LG-52] Efficient Transformer-Inspired Variants of Physics-Informed Deep Operator Networks

链接: https://arxiv.org/abs/2509.01679
作者: Zhi-Feng Wei,Wenqian Chen,Panos Stinis
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: Code will be released upon acceptance

点击查看摘要

Abstract:Operator learning has emerged as a promising tool for accelerating the solution of partial differential equations (PDEs). The Deep Operator Networks (DeepONets) represent a pioneering framework in this area: the “vanilla” DeepONet is valued for its simplicity and efficiency, while the modified DeepONet achieves higher accuracy at the cost of increased training time. In this work, we propose a series of Transformer-inspired DeepONet variants that introduce bidirectional cross-conditioning between the branch and trunk networks in DeepONet. Query-point information is injected into the branch network and input-function information into the trunk network, enabling dynamic dependencies while preserving the simplicity and efficiency of the “vanilla” DeepONet in a non-intrusive manner. Experiments on four PDE benchmarks – advection, diffusion-reaction, Burgers’, and Korteweg-de Vries equations – show that for each case, there exists a variant that matches or surpasses the accuracy of the modified DeepONet while offering improved training efficiency. Moreover, the best-performing variant for each equation aligns naturally with the equation’s underlying characteristics, suggesting that the effectiveness of cross-conditioning depends on the characteristics of the equation and its underlying physics. To ensure robustness, we validate the effectiveness of our variants through a range of rigorous statistical analyses, among them the Wilcoxon Two One-Sided Test, Glass’s Delta, and Spearman’s rank correlation.

[LG-53] Distilled Pretraining: A modern lens of Data In-Context Learning and Test-Time Scaling

链接: https://arxiv.org/abs/2509.01649
作者: Sachin Goyal,David Lopez-Paz,Kartik Ahuja
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the past year, distillation has seen a renewed prominence in large language model (LLM) pretraining, exemplified by the Llama-3.2 and Gemma model families. While distillation has historically been shown to improve statistical modeling, its effects on new paradigms that are key to modern LLMs, such as test-time scaling and in-context learning, remain underexplored. In this work, we make three main contributions. First, we show that pretraining with distillation yields models that exhibit remarkably better test-time scaling. Second, we observe that this benefit comes with a trade-off: distillation impairs in-context learning capabilities, particularly the one modeled via induction heads. Third, to demystify these findings, we study distilled pretraining in a sandbox of a bigram model, which helps us isolate the common principal factor behind our observations. Finally, using these insights, we shed light on various design choices for pretraining that should help practitioners going forward.

[LG-54] REVELIO – Universal Multimodal Task Load Estimation for Cross-Domain Generalization

链接: https://arxiv.org/abs/2509.01642
作者: Maximilian P. Oppelt,Andreas Foltyn,Nadine R. Lang-Richter,Bjoern M. Eskofier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Task load detection is essential for optimizing human performance across diverse applications, yet current models often lack generalizability beyond narrow experimental domains. While prior research has focused on individual tasks and limited modalities, there remains a gap in evaluating model robustness and transferability in real-world scenarios. This paper addresses these limitations by introducing a new multimodal dataset that extends established cognitive load detection benchmarks with a real-world gaming application, using the n -back test as a scientific foundation. Task load annotations are derived from objective performance, subjective NASA-TLX ratings, and task-level design, enabling a comprehensive evaluation framework. State-of-the-art end-to-end model, including xLSTM, ConvNeXt, and Transformer architectures are systematically trained and evaluated on multiple modalities and application domains to assess their predictive performance and cross-domain generalization. Results demonstrate that multimodal approaches consistently outperform unimodal baselines, with specific modalities and model architectures showing varying impact depending on the application subset. Importantly, models trained on one domain exhibit reduced performance when transferred to novel applications, underscoring remaining challenges for universal cognitive load estimation. These findings provide robust baselines and actionable insights for developing more generalizable cognitive load detection systems, advancing both research and practical implementation in human-computer interaction and adaptive systems.

[LG-55] Relative Trajectory Balance is equivalent to Trust-PCL

链接: https://arxiv.org/abs/2509.01632
作者: Tristan Deleu,Padideh Nouri,Yoshua Bengio,Doina Precup
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent progress in generative modeling has highlighted the importance of Reinforcement Learning (RL) for fine-tuning, with KL-regularized methods in particular proving to be highly effective for both autoregressive and diffusion models. Complementing this line of work, the Relative Trajectory Balance (RTB) objective was recently introduced in the context of Generative Flow Networks (GFlowNets) to serve the same role of improving fine-tuning in sequential generative models. Building on prior work linking GFlowNets and maximum-entropy RL, we establish in this paper an equivalence between RTB and Trust-PCL, an off-policy RL method with KL regularization. This equivalence situates RTB within the broader theoretical landscape of KL-regularized RL, and clarifies its relationship to earlier methods. Leveraging this insight, we revisit an illustrative example from the RTB paper and show that KL-regularized RL methods achieve comparable performance, offering an alternative perspective to what was previously reported.

[LG-56] Learning to Coordinate: Distributed Meta-Trajectory Optimization Via Differentiable ADMM-DDP

链接: https://arxiv.org/abs/2509.01630
作者: Bingheng Wang,Yichao Gao,Tianchen Sun,Lin Zhao
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Distributed trajectory optimization via ADMM-DDP is a powerful approach for coordinating multi-agent systems, but it requires extensive tuning of tightly coupled hyperparameters that jointly govern local task performance and global coordination. In this paper, we propose Learning to Coordinate (L2C), a general framework that meta-learns these hyperparameters, modeled by lightweight agent-wise neural networks, to adapt across diverse tasks and agent configurations. L2C differentiates end-to-end through the ADMM-DDP pipeline in a distributed manner. It also enables efficient meta-gradient computation by reusing DDP components such as Riccati recursions and feedback gains. These gradients correspond to the optimal solutions of distributed matrix-valued LQR problems, coordinated across agents via an auxiliary ADMM framework that becomes convex under mild assumptions. Training is further accelerated by truncating iterations and meta-learning ADMM penalty parameters optimized for rapid residual reduction, with provable Lipschitz-bounded gradient errors. On a challenging cooperative aerial transport task, L2C generates dynamically feasible trajectories in high-fidelity simulation using IsaacSIM, reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces, and adapts robustly to varying team sizes and task conditions, while achieving up to 88% faster gradient computation than state-of-the-art methods.

[LG-57] Effects of Distributional Biases on Gradient-Based Causal Discovery in the Bivariate Categorical Case

链接: https://arxiv.org/abs/2509.01621
作者: Tim Schwabe,Moritz Lange,Laurenz Wiskott,Maribel Acosta
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gradient-based causal discovery shows great potential for deducing causal structure from data in an efficient and scalable way. Those approaches however can be susceptible to distributional biases in the data they are trained on. We identify two such biases: Marginal Distribution Asymmetry, where differences in entropy skew causal learning toward certain factorizations, and Marginal Distribution Shift Asymmetry, where repeated interventions cause faster shifts in some variables than in others. For the bivariate categorical setup with Dirichlet priors, we illustrate how these biases can occur even in controlled synthetic data. To examine their impact on gradient-based methods, we employ two simple models that derive causal factorizations by learning marginal or conditional data distributions - a common strategy in gradient-based causal discovery. We demonstrate how these models can be susceptible to both biases. We additionally show how the biases can be controlled. An empirical evaluation of two related, existing approaches indicates that eliminating competition between possible causal factorizations can make models robust to the presented biases.

[LG-58] Learning Longitudinal Stress Dynamics from Irregular Self-Reports via Time Embeddings

链接: https://arxiv.org/abs/2509.01569
作者: Louis Simon,Mohamed Chetouani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread adoption of mobile and wearable sensing technologies has enabled continuous and personalized monitoring of affect, mood disorders, and stress. When combined with ecological self-report questionnaires, these systems offer a powerful opportunity to explore longitudinal modeling of human behaviors. However, challenges arise from missing data and the irregular timing of self-reports, which make challenging the prediction of human states and behaviors. In this study, we investigate the use of time embeddings to capture time dependencies within sequences of Ecological Momentary Assessments (EMA). We introduce a novel time embedding method, Ema2Vec, designed to effectively handle irregularly spaced self-reports, and evaluate it on a new task of longitudinal stress prediction. Our method outperforms standard stress prediction baselines that rely on fixed-size daily windows, as well as models trained directly on longitudinal sequences without time-aware representations. These findings emphasize the importance of incorporating time embeddings when modeling irregularly sampled longitudinal data.

[LG-59] Direct Profit Estimation Using Uplift Modeling under Clustered Network Interference

链接: https://arxiv.org/abs/2509.01558
作者: Bram van den Akker
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uplift modeling is a key technique for promotion optimization in recommender systems, but standard methods typically fail to account for interference, where treating one item affects the outcomes of others. This violation of the Stable Unit Treatment Value Assumption (SUTVA) leads to suboptimal policies in real-world marketplaces. Recent developments in interference-aware estimators such as Additive Inverse Propensity Weighting (AddIPW) have not found their way into the uplift modeling literature yet, and optimising policies using these estimators is not well-established. This paper proposes a practical methodology to bridge this gap. We use the AddIPW estimator as a differentiable learning objective suitable for gradient-based optimization. We demonstrate how this framework can be integrated with proven response transformation techniques to directly optimize for economic outcomes like incremental profit. Through simulations, we show that our approach significantly outperforms interference-naive methods, especially as interference effects grow. Furthermore, we find that adapting profit-centric uplift strategies within our framework can yield superior performance in identifying the highest-impact interventions, offering a practical path toward more profitable incentive personalization.

[LG-60] Ultra Fast Warm Start Solution for Graph Recommendations CIKM2025

链接: https://arxiv.org/abs/2509.01549
作者: Viacheslav Yusupov,Maxim Rakhuba,Evgeny Frolov
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to CIKM 2025

点击查看摘要

Abstract:In this work, we present a fast and effective Linear approach for updating recommendations in a scalable graph-based recommender system UltraGCN. Solving this task is extremely important to maintain the relevance of the recommendations under the conditions of a large amount of new data and changing user preferences. To address this issue, we adapt the simple yet effective low-rank approximation approach to the graph-based model. Our method delivers instantaneous recommendations that are up to 30 times faster than conventional methods, with gains in recommendation quality, and demonstrates high scalability even on the large catalogue datasets.

[LG-61] Model Unmerging: Making Your Models Unmergeable for Secure Model Sharing

链接: https://arxiv.org/abs/2509.01548
作者: Zihao Wang,Enneng Yang,Lu Yin,Shiwei Liu,Li Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model merging leverages multiple finetuned expert models to construct a multi-task model with low cost, and is gaining increasing attention. However, as a growing number of finetuned models become publicly available, concerns about the safety of model merging have emerged. Unauthorized merging may infringe on developers’ rights and risk leaking sensitive personal information. Most existing methods focus on detecting whether a merged model originates from a specific source model, but fail to effectively prevent illegal merging. In this paper, we propose MergeLock, an active protection mechanism that disrupts model parameters to render them unmergeable, thereby directly preventing unauthorized model merging. Specifically, leveraging the inherent symmetry of the attention mechanism in Transformer-based models, we randomly sample two pairs of invertible matrices and apply them to the Query-Key (QK) and Value-Output (VO) branches. This transformation keeps the model’s output unchanged while pushing it away from the shared parameter space of other finetuned models. Extensive experiments across both vision and language tasks demonstrate that MergeLock can degrade the performance of merged models by over 95% when a protected model is involved in most cases, demonstrating its effectiveness. Moreover, we further demonstrate that merged models protected by MergeLock cannot be effectively recovered using low-cost restoration methods, further enhancing robustness against unauthorized merging. The code is available at this https URL.

[LG-62] Feynman-Kac-Flow: Inference Steering of Conditional Flow Matching to an Energy-Tilted Posterior

链接: https://arxiv.org/abs/2509.01543
作者: Konstantin Mark,Leonard Galustian,Maximilian P.-P. Kovar,Esther Heid
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conditional Flow Matching(CFM) represents a fast and high-quality approach to generative modelling, but in many applications it is of interest to steer the generated samples towards precise requirements. While steering approaches like gradient-based guidance, sequential Monte Carlo steering or Feynman-Kac steering are well established for diffusion models, they have not been extended to flow matching approaches yet. In this work, we formulate this requirement as tilting the output with an energy potential. We derive, for the first time, Feynman-Kac steering for CFM. We evaluate our approach on a set of synthetic tasks, including the generation of tilted distributions in a high-dimensional space, which is a particularly challenging case for steering approaches. We then demonstrate the impact of Feynman-Kac steered CFM on the previously unsolved challenge of generated transition states of chemical reactions with the correct chirality, where the reactants or products can have a different handedness, leading to geometric constraints of the viable reaction pathways connecting reactants and products. Code to reproduce this study is avaiable open-source at this https URL.

[LG-63] Graph Contrastive Learning versus Untrained Baselines: The Role of Dataset Size

链接: https://arxiv.org/abs/2509.01541
作者: Smayan Khanna,Doruk Efe Gökmen,Risi Kondor,Vincenzo Vitelli
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) has emerged as a leading paradigm for self- supervised learning on graphs, with strong performance reported on standardized datasets and growing applications ranging from genomics to drug discovery. We ask a basic question: does GCL actually outperform untrained baselines? We find that GCL’s advantage depends strongly on dataset size and task difficulty. On standard datasets, untrained Graph Neural Networks (GNNs), simple multilayer perceptrons, and even handcrafted statistics can rival or exceed GCL. On the large molecular dataset ogbg-molhiv, we observe a crossover: GCL lags at small scales but pulls ahead beyond a few thousand graphs, though this gain eventually plateaus. On synthetic datasets, GCL accuracy approximately scales with the logarithm of the number of graphs and its performance gap (compared with untrained GNNs) varies with respect to task complexity. Moving forward, it is crucial to identify the role of dataset size in benchmarks and applications, as well as to design GCL algorithms that avoid performance plateaus.

[LG-64] Prediction Generation of WWTPs microbiome community structures and Clustering of WWTPs various feature attributes using DE-BP model SiTime-GAN model and DPNG-EPMC ensemble clustering algorithm with modulation of microbial ecosystem health

链接: https://arxiv.org/abs/2509.01526
作者: Mingzhi Dai,Weiwei Cai,Xiang Feng,Huiqun Yu,Weibin Guo,Miao Guo
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 48 pages,25 figures, three major research sections: Prediction, Generation and Clustering

点击查看摘要

Abstract:Microbiomes not only underpin Earth’s biogeochemical cycles but also play crucial roles in both engineered and natural ecosystems, such as the soil, wastewater treatment, and the human gut. However, microbiome engineering faces significant obstacles to surmount to deliver the desired improvements in microbiome control. Here, we use the backpropagation neural network (BPNN), optimized through differential evolution (DE-BP), to predict the microbial composition of activated sludge (AS) systems collected from wastewater treatment plants (WWTPs) located worldwide. Furthermore, we introduce a novel clustering algorithm termed Directional Position Nonlinear Emotional Preference Migration Behavior Clustering (DPNG-EPMC). This method is applied to conduct a clustering analysis of WWTPs across various feature attributes. Finally, we employ the Similar Time Generative Adversarial Networks (SiTime-GAN), to synthesize novel microbial compositions and feature attributes data. As a result, we demonstrate that the DE-BP model can provide superior predictions of the microbial composition. Additionally, we show that the DPNG-EPMC can be applied to the analysis of WWTPs under various feature attributes. Finally, we demonstrate that the SiTime-GAN model can generate valuable incremental synthetic data. Our results, obtained through predicting the microbial community and conducting analysis of WWTPs under various feature attributes, develop an understanding of the factors influencing AS communities.

[LG-65] Prior-Guided Flow Matching for Target-Aware Molecule Design with Learnable Atom Number

链接: https://arxiv.org/abs/2509.01486
作者: Jingyuan Zhou,Hao Qian,Shikui Tu,Lei Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structure-based drug design (SBDD), aiming to generate 3D molecules with high binding affinity toward target proteins, is a vital approach in novel drug discovery. Although recent generative models have shown great potential, they suffer from unstable probability dynamics and mismatch between generated molecule size and the protein pockets geometry, resulting in inconsistent quality and off-target effects. We propose PAFlow, a novel target-aware molecular generation model featuring prior interaction guidance and a learnable atom number predictor. PAFlow adopts the efficient flow matching framework to model the generation process and constructs a new form of conditional flow matching for discrete atom types. A protein-ligand interaction predictor is incorporated to guide the vector field toward higher-affinity regions during generation, while an atom number predictor based on protein pocket information is designed to better align generated molecule size with target geometry. Extensive experiments on the CrossDocked2020 benchmark show that PAFlow achieves a new state-of-the-art in binding affinity (up to -8.31 Avg. Vina Score), simultaneously maintains favorable molecular properties.

[LG-66] Hierarchical Motion Captioning Utilizing External Text Data Source

链接: https://arxiv.org/abs/2509.01471
作者: Clayton Leite,Yu Xiao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel approach to enhance existing motion captioning methods, which directly map representations of movement to high-level descriptive captions (e.g., a person doing jumping jacks"). The existing methods require motion data annotated with high-level descriptions (e.g., jumping jacks"). However, such data is rarely available in existing motion-text datasets, which additionally do not include low-level motion descriptions. To address this, we propose a two-step hierarchical approach. First, we employ large language models to create detailed descriptions corresponding to each high-level caption that appears in the motion-text datasets (e.g., jumping while synchronizing arm extensions with the opening and closing of legs" for jumping jacks"). These refined annotations are used to retrain motion-to-text models to produce captions with low-level details. Second, we introduce a pioneering retrieval-based mechanism. It aligns the detailed low-level captions with candidate high-level captions from additional text data sources, and combine them with motion features to fabricate precise high-level captions. Our methodology is distinctive in its ability to harness knowledge from external text sources to greatly increase motion captioning accuracy, especially for movements not covered in existing motion-text datasets. Experiments on three distinct motion-text datasets (HumanML3D, KIT, and BOTH57M) demonstrate that our method achieves an improvement in average performance (across BLEU-1, BLEU-4, CIDEr, and ROUGE-L) ranging from 6% to 50% compared to the state-of-the-art M2T-Interpretable.

[LG-67] Benchmarking Optimizers for Large Language Model Pretraining

链接: https://arxiv.org/abs/2509.01440
作者: Andrei Semenov,Matteo Pagliardini,Martin Jaggi
类目: Machine Learning (cs.LG)
*备注: 73 pages, 44 figures, 48 tables

点击查看摘要

Abstract:The recent development of Large Language Models (LLMs) has been accompanied by an effervescence of novel ideas and methods to better optimize the loss of deep learning models. Claims from those methods are myriad: from faster convergence to removing reliance on certain hyperparameters. However, the diverse experimental protocols used to validate these claims make direct comparisons between methods challenging. This study presents a comprehensive evaluation of recent optimization techniques across standardized LLM pretraining scenarios, systematically varying model size, batch size, and training duration. Through careful tuning of each method, we provide guidance to practitioners on which optimizer is best suited for each scenario. For researchers, our work highlights promising directions for future optimization research. Finally, by releasing our code and making all experiments fully reproducible, we hope our efforts can help the development and rigorous benchmarking of future methods.

[LG-68] he Geometry of Nonlinear Reinforcement Learning

链接: https://arxiv.org/abs/2509.01432
作者: Nikola Milosevic,Nico Scherf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reward maximization, safe exploration, and intrinsic motivation are often studied as separate objectives in reinforcement learning (RL). We present a unified geometric framework, that views these goals as instances of a single optimization problem on the space of achievable long-term behavior in an environment. Within this framework, classical methods such as policy mirror descent, natural policy gradient, and trust-region algorithms naturally generalize to nonlinear utilities and convex constraints. We illustrate how this perspective captures robustness, safety, exploration, and diversity objectives, and outline open challenges at the interface of geometry and deep RL.

[LG-69] Hierarchical Maximum Entropy via the Renormalization Group

链接: https://arxiv.org/abs/2509.01424
作者: Amir R. Asadi
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Hierarchical structures, which include multiple levels, are prevalent in statistical and machine-learning models as well as physical systems. Extending the foundational result that the maximum entropy distribution under mean constraints is given by the exponential Gibbs-Boltzmann form, we introduce the framework of “hierarchical maximum entropy” to address these multilevel models. We demonstrate that Pareto optimal distributions, which maximize entropies across all levels of hierarchical transformations, can be obtained via renormalization-group procedures from theoretical physics. This is achieved by formulating multilevel extensions of the Gibbs variational principle and the Donsker-Varadhan variational representation of entropy. Moreover, we explore settings with hierarchical invariances that significantly simplify the renormalization-group procedures, enhancing computational efficiency: quadratic modular loss functions, logarithmic loss functions, and nearest-neighbor loss functions. This is accomplished through the introduction of the concept of parameter flows, which serves as an analog to renormalization flows in renormalization group theory. This work connects ideas from probability theory, information theory, and statistical mechanics.

[LG-70] Accelerating PDE Solvers with Equation-Recast Neural Operator Preconditioning

链接: https://arxiv.org/abs/2509.01416
作者: Qiyun Cheng,Md Hossain Sahadath,Huihua Yang,Shaowu Pan,Wei Ji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The computational overhead of traditional numerical solvers for partial differential equations (PDEs) remains a critical bottleneck for large-scale parametric studies and design optimization. We introduce a Minimal-Data Parametric Neural Operator Preconditioning (MD-PNOP) framework, which establishes a new paradigm for accelerating parametric PDE solvers while strictly preserving physical constraints. The key idea is to recast the residual from parameter deviation as additional source term, where any trained neural operator can be used to refine the solution in an offline fashion. This directly addresses the fundamental extrapolation limitation of neural operators, enabling extrapolative generalization of any neural operator trained at a single parameter setting across a wide range of configurations without any retraining. The neural operator predictions are then embedded into iterative PDE solvers as improved initial guesses, thereby reducing convergence iterations without sacrificing accuracy. Unlike purely data-driven approaches, MD-PNOP guarantees that the governing equations remain fully enforced, eliminating concerns about loss of physics or interpretability. The framework is architecture-agnostic and is demonstrated using both Deep Operator Networks (DeepONet) and Fourier Neural Operators (FNO) for Boltzmann transport equation solvers in neutron transport applications. We demonstrated that neural operators trained on a single set of constant parameters successfully accelerate solutions with heterogeneous, sinusoidal, and discontinuous parameter distributions. Besides, MD-PNOP consistently achieves ~50% reduction in computational time while maintaining full order fidelity for fixed-source, single-group eigenvalue, and multigroup coupled eigenvalue problems.

[LG-71] Evaluating the stability of model explanations in instance-dependent cost-sensitive credit scoring

链接: https://arxiv.org/abs/2509.01409
作者: Matteo Ballegeer,Matthias Bogaert,Dries F. Benoit
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Instance-dependent cost-sensitive (IDCS) classifiers offer a promising approach to improving cost-efficiency in credit scoring by tailoring loss functions to instance-specific costs. However, the impact of such loss functions on the stability of model explanations remains unexplored in literature, despite increasing regulatory demands for transparency. This study addresses this gap by evaluating the stability of Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) when applied to IDCS models. Using four publicly available credit scoring datasets, we first assess the discriminatory power and cost-efficiency of IDCS classifiers, introducing a novel metric to enhance cross-dataset comparability. We then investigate the stability of SHAP and LIME feature importance rankings under varying degrees of class imbalance through controlled resampling. Our results reveal that while IDCS classifiers improve cost-efficiency, they produce significantly less stable explanations compared to traditional models, particularly as class imbalance increases, highlighting a critical trade-off between cost optimization and interpretability in credit scoring. Amid increasing regulatory scrutiny on explainability, this research underscores the pressing need to address stability issues in IDCS classifiers to ensure that their cost advantages are not undermined by unstable or untrustworthy explanations.

[LG-72] Distillation of a tractable model from the VQ-VAE

链接: https://arxiv.org/abs/2509.01400
作者: Armin Hadžić,Milan Papez,Tomáš Pevný
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep generative models with discrete latent space, such as the Vector-Quantized Variational Autoencoder (VQ-VAE), offer excellent data generation capabilities, but, due to the large size of their latent space, their probabilistic inference is deemed intractable. We demonstrate that the VQ-VAE can be distilled into a tractable model by selecting a subset of latent variables with high probabilities. This simple strategy is particularly efficient, especially if the VQ-VAE underutilizes its latent space, which is, indeed, very often the case. We frame the distilled model as a probabilistic circuit, and show that it preserves expressiveness of the VQ-VAE while providing tractable probabilistic inference. Experiments illustrate competitive performance in density estimation and conditional generation tasks, challenging the view of the VQ-VAE as an inherently intractable model.

[LG-73] Learn to Jump: Adaptive Random Walks for Long-Range Propagation through Graph Hierarchies IJCNN2025

链接: https://arxiv.org/abs/2509.01381
作者: Joël Mathys,Federico Errica
类目: Machine Learning (cs.LG)
*备注: Presented at ComBayNS Workshop (oral) at IJCNN 2025

点击查看摘要

Abstract:Message-passing architectures struggle to sufficiently model long-range dependencies in node and graph prediction tasks. We propose a novel approach exploiting hierarchical graph structures and adaptive random walks to address this challenge. Our method introduces learnable transition probabilities that decide whether the walk should prefer the original graph or travel across hierarchical shortcuts. On a synthetic long-range task, we demonstrate that our approach can exceed the theoretical bound that constrains traditional approaches operating solely on the original topology. Specifically, walks that prefer the hierarchy achieve the same performance as longer walks on the original graph. These preliminary findings open a promising direction for efficiently processing large graphs while effectively capturing long-range dependencies.

[LG-74] CbLDM: A Diffusion Model for recovering nanostructure from pair distribution function

链接: https://arxiv.org/abs/2509.01370
作者: Jiarui Cao,Zhiyang Zhang,Heming Wang,Jun Xu,Ling Lan,Ran Gu
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Nowadays, the nanostructure inverse problem is an attractive problem that helps researchers to understand the relationship between the properties and the structure of nanomaterials. This article focuses on the problem of using PDF to recover the nanostructure, which this article views as a conditional generation problem. This article propose a deep learning model CbLDM, Condition-based Latent Diffusion Model. Based on the original latent diffusion model, the sampling steps of the diffusion model are reduced and the sample generation efficiency is improved by using the conditional prior to estimate conditional posterior distribution, which is the approximated distribution of p(z|x). In addition, this article uses the Laplacian matrix instead of the distance matrix to recover the nanostructure, which can reduce the reconstruction error. Finally, this article compares CbLDM with existing models which were used to solve the nanostructure inverse problem, and find that CbLDM demonstrates significantly higher prediction accuracy than these models, which reflects the ability of CbLDM to solve the nanostructure inverse problem and the potential to cope with other continuous conditional generation tasks.

[LG-75] Globally aware optimization with resurgence

链接: https://arxiv.org/abs/2509.01329
作者: Wei Bu
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Optimization and Control (math.OC)
*备注: 11+9 pages, 3 figures

点击查看摘要

Abstract:Modern optimization faces a fundamental challenge: local gradient-based methods provide no global information about the objective function L landscape, often leading to suboptimal convergence and sensitivity to initialization. We introduce a novel optimization framework that leverages resurgence theory from complex analysis to extract global structural information from divergent asymptotic series. Our key insight is that the factorially divergent perturbative expansions of parameter space partition functions encode precise information about all critical objective function value in the landscape through their Borel transform singularities. The algorithm works by computing the statistical mechanical partition function Z(g) = \int e^-L(\theta)/g d\theta for small coupling g\ll 1 , extracting its asymptotic series coefficients, and identifying Borel plane singularities that correspond one-to-one with critical objective function values. These target values provide global guidance to local optimizers, enabling principled learning rate adaptation and escape from suboptimal regions. Unlike heuristic adaptive methods, targets are theoretically grounded in the geometry of the optimization landscape. Comments: 11+9 pages, 3 figures Subjects: Machine Learning (cs.LG); Mathematical Physics (math-ph); Optimization and Control (math.OC) Cite as: arXiv:2509.01329 [cs.LG] (or arXiv:2509.01329v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.01329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-76] Re3: Learning to Balance Relevance Recency for Temporal Information Retrieval

链接: https://arxiv.org/abs/2509.01306
作者: Jiawei Cao,Jie Ouyang,Zhaomeng Zhou,Mingyue Cheng,Yupeng Li,Jiaxian Yan,Qi Liu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal Information Retrieval (TIR) is a critical yet unresolved task for modern search systems, retrieving documents that not only satisfy a query’s information need but also adhere to its temporal constraints. This task is shaped by two challenges: Relevance, ensuring alignment with the query’s explicit temporal requirements, and Recency, selecting the freshest document among multiple versions. Existing methods often address the two challenges in isolation, relying on brittle heuristics that fail in scenarios where temporal requirements and staleness resistance are intertwined. To address this gap, we introduce Re2Bench, a benchmark specifically designed to disentangle and evaluate Relevance, Recency, and their hybrid combination. Building on this foundation, we propose Re3, a unified and lightweight framework that dynamically balances semantic and temporal information through a query-aware gating mechanism. On Re2Bench, Re3 achieves state-of-the-art results, leading in R@1 across all three subsets. Ablation studies with backbone sensitivity tests confirm robustness, showing strong generalization across diverse encoders and real-world settings. This work provides both a generalizable solution and a principled evaluation suite, advancing the development of temporally aware retrieval systems. Re3 and Re2Bench are available online: this https URL

[LG-77] Equivariant U-Shaped Neural Operators for the Cahn-Hilliard Phase-Field Model

链接: https://arxiv.org/abs/2509.01293
作者: Xiao Xue,M.F.P. ten Eikelder,Tianyue Yang,Yiqing Li,Kan He,Shuo Wang,Peter V. Coveney
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Phase separation in binary mixtures, governed by the Cahn-Hilliard equation, plays a central role in interfacial dynamics across materials science and soft matter. While numerical solvers are accurate, they are often computationally expensive and lack flexibility across varying initial conditions and geometries. Neural operators provide a data-driven alternative by learning solution operators between function spaces, but current architectures often fail to capture multiscale behavior and neglect underlying physical symmetries. Here we show that an equivariant U-shaped neural operator (E-UNO) can learn the evolution of the phase-field variable from short histories of past dynamics, achieving accurate predictions across space and time. The model combines global spectral convolution with a multi-resolution U-shaped architecture and regulates translation equivariance to align with the underlying physics. E-UNO outperforms standard Fourier neural operator and U-shaped neural operator baselines, particularly on fine-scale and high-frequency structures. By encoding symmetry and scale hierarchy, the model generalizes better, requires less training data, and yields physically consistent dynamics. This establishes E-UNO as an efficient surrogate for complex phase-field systems.

[LG-78] Iterative In-Context Learning to Enhance LLM s Abstract Reasoning : The Case-Study of Algebraic Tasks

链接: https://arxiv.org/abs/2509.01267
作者: Stefano Fioravanti,Matteo Zavatteri,Roberto Confalonieri,Kamyar Zeinalipour,Paolo Frazzetto,Alessandro Sperduti,Nicolò Navarin
类目: Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:LLMs face significant challenges in systematic generalization, particularly when dealing with reasoning tasks requiring compositional rules and handling out-of-distribution examples. To address these challenges, we introduce an in-context learning methodology that improves the generalization capabilities of general purpose LLMs. Our approach employs an iterative example selection strategy, which incrementally constructs a tailored set of few-shot examples optimized to enhance model’s performance on a given task. As a proof of concept, we apply this methodology to the resolution of algebraic expressions involving non-standard simplification rules, according to which the priority of addition and multiplication is changed. Our findings indicate that LLMs exhibit limited proficiency in these mathematical tasks. We further demonstrate that LLMs reasoning benefits from our iterative shot selection prompting strategy integrated with explicit reasoning instructions. Crucially, our experiments reveal that some LLMs achieve better generalization performances when prompted with simpler few-shot examples rather than complex ones following the test data distribution. Comments: Preprint. Under review Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.01267 [cs.LG] (or arXiv:2509.01267v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.01267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-79] What Expressivity Theory Misses: Message Passing Complexity for GNNs

链接: https://arxiv.org/abs/2509.01254
作者: Niklas Kemper,Tom Wollschläger,Stephan Günnemann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Expressivity theory, characterizing which graphs a GNN can distinguish, has become the predominant framework for analyzing GNNs, with new models striving for higher expressivity. However, we argue that this focus is misguided: First, higher expressivity is not necessary for most real-world tasks as these tasks rarely require expressivity beyond the basic WL test. Second, expressivity theory’s binary characterization and idealized assumptions fail to reflect GNNs’ practical capabilities. To overcome these limitations, we propose Message Passing Complexity (MPC): a continuous measure that quantifies the difficulty for a GNN architecture to solve a given task through message passing. MPC captures practical limitations like over-squashing while preserving the theoretical impossibility results from expressivity theory, effectively narrowing the gap between theory and practice. Through extensive validation on fundamental GNN tasks, we show that MPC’s theoretical predictions correlate with empirical performance, successfully explaining architectural successes and failures. Thereby, MPC advances beyond expressivity theory to provide a more powerful and nuanced framework for understanding and improving GNN architectures.

[LG-80] Practical and Private Hybrid ML Inference with Fully Homomorphic Encryption

链接: https://arxiv.org/abs/2509.01253
作者: Sayan Biswas,Philippe Chartier,Akash Dhasade,Tom Jurien,David Kerriou,Anne-Marie Kerrmarec,Mohammed Lemou,Franklin Tranie,Martijn de Vos,Milos Vujasinovic
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In contemporary cloud-based services, protecting users’ sensitive data and ensuring the confidentiality of the server’s model are critical. Fully homomorphic encryption (FHE) enables inference directly on encrypted inputs, but its practicality is hindered by expensive bootstrapping and inefficient approximations of non-linear activations. We introduce Safhire, a hybrid inference framework that executes linear layers under encryption on the server while offloading non-linearities to the client in plaintext. This design eliminates bootstrapping, supports exact activations, and significantly reduces computation. To safeguard model confidentiality despite client access to intermediate outputs, Safhire applies randomized shuffling, which obfuscates intermediate values and makes it practically impossible to reconstruct the model. To further reduce latency, Safhire incorporates advanced optimizations such as fast ciphertext packing and partial extraction. Evaluations on multiple standard models and datasets show that Safhire achieves 1.5X - 10.5X lower inference latency than Orion, a state-of-the-art baseline, with manageable communication overhead and comparable accuracy, thereby establishing the practicality of hybrid FHE inference.

[LG-81] Geometric origin of adversarial vulnerability in deep learning

链接: https://arxiv.org/abs/2509.01235
作者: Yixiong Ren,Wenkang Du,Jianhui Zhou,Haiping Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How to balance training accuracy and adversarial robustness has become a challenge since the birth of deep learning. Here, we introduce a geometry-aware deep learning framework that leverages layer-wise local training to sculpt the internal representations of deep neural networks. This framework promotes intra-class compactness and inter-class separation in feature space, leading to manifold smoothness and adversarial robustness against white or black box attacks. The performance can be explained by an energy model with Hebbian coupling between elements of the hidden representation. Our results thus shed light on the physics of learning in the direction of alignment between biological and artificial intelligence systems. Using the current framework, the deep network can assimilate new information into existing knowledge structures while reducing representation interference.

[LG-82] RAMS: Residual-based adversarial-gradient moving sample method for scientific machine learning in solving partial differential equations

链接: https://arxiv.org/abs/2509.01234
作者: Weihang Ouyang,Min Zhu,Wei Xiong,Si-Wei Liu,Lu Lu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) and neural operators, two leading scientific machine learning (SciML) paradigms, have emerged as powerful tools for solving partial differential equations (PDEs). Although increasing the training sample size generally enhances network performance, it also increases computational costs for physics-informed or data-driven training. To address this trade-off, different sampling strategies have been developed to sample more points in regions with high PDE residuals. However, existing sampling methods are computationally demanding for high-dimensional problems, such as high-dimensional PDEs or operator learning tasks. Here, we propose a residual-based adversarial-gradient moving sample (RAMS) method, which moves samples according to the adversarial gradient direction to maximize the PDE residual via gradient-based optimization. RAMS can be easily integrated into existing sampling methods. Extensive experiments, ranging from PINN applied to high-dimensional PDEs to physics-informed and data-driven operator learning problems, have been conducted to demonstrate the effectiveness of RAMS. Notably, RAMS represents the first efficient adaptive sampling approach for operator learning, marking a significant advancement in the SciML field.

[LG-83] StoxLSTM: A Stochastic Extended Long Short-Term Memory Network for Time Series Forecasting

链接: https://arxiv.org/abs/2509.01187
作者: Zihao Wang,Yunjie Li,Lingmin Zan,Zheng Gong,Mengtao Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Extended Long Short-Term Memory (xLSTM) network has attracted widespread research interest due to its enhanced capability to model complex temporal dependencies in diverse time series applications. Despite its success, there is still potential to further improve its representational capacity and forecasting performance, particularly on challenging real-world datasets with unknown, intricate, and hierarchical dynamics. In this work, we propose a stochastic xLSTM, termed StoxLSTM, that improves the original architecture into a state space modeling framework by incorporating stochastic latent variables within xLSTM. StoxLSTM models the latent dynamic evolution through specially designed recurrent blocks, enabling it to effectively capture the underlying temporal patterns and dependencies. Extensive experiments on publicly available benchmark datasets from multiple research communities demonstrate that StoxLSTM consistently outperforms state-of-the-art baselines with better robustness and stronger generalization ability.

[LG-84] ADMP-GNN: Adaptive Depth Message Passing GNN

链接: https://arxiv.org/abs/2509.01170
作者: Yassine Abbahaddou,Fragkiskos D. Malliaros,Johannes F. Lutzeyer,Michalis Vazirgiannis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have proven to be highly effective in various graph learning tasks. A key characteristic of GNNs is their use of a fixed number of message-passing steps for all nodes in the graph, regardless of each node’s diverse computational needs and characteristics. Through empirical real-world data analysis, we demonstrate that the optimal number of message-passing layers varies for nodes with different characteristics. This finding is further supported by experiments conducted on synthetic datasets. To address this, we propose Adaptive Depth Message Passing GNN (ADMP-GNN), a novel framework that dynamically adjusts the number of message passing layers for each node, resulting in improved performance. This approach applies to any model that follows the message passing scheme. We evaluate ADMP-GNN on the node classification task and observe performance improvements over baseline GNN models.

[LG-85] Detecting Rug Pulls in Decentralized Exchanges: Machine Learning Evidence from the TON Blockchain

链接: https://arxiv.org/abs/2509.01168
作者: Dmitry Yaremus,Jianghai Li,Alisa Kalacheva,Igor Vodolazov,Yury Yanovich
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a machine learning framework for the early detection of rug pull scams on decentralized exchanges (DEXs) within The Open Network (TON) blockchain. TON’s unique architecture, characterized by asynchronous execution and a massive web2 user base from Telegram, presents a novel and critical environment for fraud analysis. We conduct a comprehensive study on the two largest TON DEXs, this http URL and DeDust, fusing data from both platforms to train our models. A key contribution is the implementation and comparative analysis of two distinct rug pull definitions–TVL-based (a catastrophic liquidity withdrawal) and idle-based (a sudden cessation of all trading activity)–within a single, unified study. We demonstrate that Gradient Boosting models can effectively identify rug pulls within the first five minutes of trading, with the TVL-based method achieving superior AUC (up to 0.891) while the idle-based method excels at recall. Our analysis reveals that while feature sets are consistent across exchanges, their underlying distributions differ significantly, challenging straightforward data fusion and highlighting the need for robust, platform-aware models. This work provides a crucial early-warning mechanism for investors and enhances the security infrastructure of the rapidly growing TON DeFi ecosystem.

[LG-86] A Multimodal Deep Learning Framework for Early Diagnosis of Liver Cancer via Optimized BiLSTM-AM-VMD Architecture

链接: https://arxiv.org/abs/2509.01164
作者: Cheng Cheng,Zeping Chen,Xavier Wang
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper proposes a novel multimodal deep learning framework integrating bidirectional LSTM, multi-head attention mechanism, and variational mode decomposition (BiLSTM-AM-VMD) for early liver cancer diagnosis. Using heterogeneous data that include clinical characteristics, biochemical markers, and imaging-derived variables, our approach improves both prediction accuracy and interpretability. Experimental results on real-world datasets demonstrate superior performance over traditional machine learning and baseline deep learning models.

[LG-87] Multi-Modal Machine Learning Framework for Predicting Early Recurrence of Brain Tumors Using MRI and Clinical Biomarkers

链接: https://arxiv.org/abs/2509.01161
作者: Cheng Cheng,Zeping Chen,Rui Xie,Peiyao Zheng,Xavier Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately predicting early recurrence in brain tumor patients following surgical resection remains a clinical challenge. This study proposes a multi-modal machine learning framework that integrates structural MRI features with clinical biomarkers to improve postoperative recurrence prediction. We employ four machine learning algorithms – Gradient Boosting Machine (GBM), Random Survival Forest (RSF), CoxBoost, and XGBoost – and validate model performance using concordance index (C-index), time-dependent AUC, calibration curves, and decision curve analysis. Our model demonstrates promising performance, offering a potential tool for risk stratification and personalized follow-up planning.

[LG-88] Nonlinear Performative Prediction

链接: https://arxiv.org/abs/2509.01139
作者: Guangzheng Zhong,Yang Liu,Jiming Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Performative prediction is an emerging paradigm in machine learning that addresses scenarios where the model’s prediction may induce a shift in the distribution of the data it aims to predict. Current works in this field often rely on uncontrollable assumptions, such as bounded gradients of performative loss, and primarily focus on linear cases in their examples and evaluations to maintain consistency between theoretical guarantees and empirical validations. However, such linearity rarely holds in real-world applications, where the data usually exhibit complex nonlinear characteristics. In this paper, we relax these out-of-control assumptions and present a novel design that generalizes performative prediction to nonlinear cases while preserving essential theoretical properties. Specifically, we formulate the loss function of performative prediction using a maximum margin approach and extend it to nonlinear spaces through kernel methods. To quantify the data distribution shift, we employ the discrepancy between prediction errors on these two distributions as an indicator, which characterizes the impact of the performative effect on specific learning tasks. By doing so, we can derive, for both linear and nonlinear cases, the conditions for performative stability, a critical and desirable property in performative contexts. Building on these theoretical insights, we develop an algorithm that guarantees the performative stability of the predictive model. We validate the effectiveness of our method through experiments on synthetic and real-world datasets with both linear and nonlinear data distributions, demonstrating superior performance compared to state-of-the-art baselines.

[LG-89] A Class of Random-Kernel Network Models

链接: https://arxiv.org/abs/2509.01090
作者: James Tian
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We introduce random-kernel networks, a multilayer extension of random feature models where depth is created by deterministic kernel composition and randomness enters only in the outermost layer. We prove that deeper constructions can approximate certain functions with fewer Monte Carlo samples than any shallow counterpart, establishing a depth separation theorem in sample complexity.

[LG-90] REFINESTAT: Efficient Exploration for Probabilistic Program Synthesis ATC

链接: https://arxiv.org/abs/2509.01082
作者: Madhav Kanda,Shubham Ugare,Sasa Misailovic
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: RefineStat constrains LM decoding with statistical validity checks and uses diagnostic-guided resampling (priors/likelihoods) to transform small LMs’ drafts into correct, reliable probabilistic programs that can match or surpass closed-source models

点击查看摘要

Abstract:Probabilistic programming offers a powerful framework for modeling uncertainty, yet statistical model discovery in this domain entails navigating an immense search space under strict domain-specific constraints. When small language models are tasked with generating probabilistic programs, they frequently produce outputs that suffer from both syntactic and semantic errors, such as flawed inference constructs. Motivated by probabilistic programmers’ domain expertise and debugging strategies, we introduce RefineStat, a language model–driven framework that enforces semantic constraints ensuring synthesized programs contain valid distributions and well-formed parameters, and then applies diagnostic-aware refinement by resampling prior or likelihood components whenever reliability checks fail. We evaluate RefineStat on multiple probabilistic-programming code-generation tasks using smaller language models (SLMs) and find that it produces programs that are both syntactically sound and statistically reliable, often matching or surpassing those from closed-source large language models (e.g., OpenAI o3).

[LG-91] IMU-Enhanced EEG Motion Artifact Removal with Fine-Tuned Large Brain Models

链接: https://arxiv.org/abs/2509.01073
作者: Yuhong Zhang,Xusheng Zhu,Yuchen Xu,ChiaEn Lu,Hsinyu Shih,Gert Cauwenberghs,Tzyy-Ping Jung
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注: Accepted to IEEE EMBS 12th International Conference on Neural Engineering (NER 2025)

点击查看摘要

Abstract:Electroencephalography (EEG) is a non-invasive method for measuring brain activity with high temporal resolution; however, EEG signals often exhibit low signal-to-noise ratios because of contamination from physiological and environmental artifacts. One of the major challenges hindering the real-world deployment of brain-computer interfaces (BCIs) involves the frequent occurrence of motion-related EEG artifacts. Most prior studies on EEG motion artifact removal rely on single-modality approaches, such as Artifact Subspace Reconstruction (ASR) and Independent Component Analysis (ICA), without incorporating simultaneously recorded modalities like inertial measurement units (IMUs), which directly capture the extent and dynamics of motion. This work proposes a fine-tuned large brain model (LaBraM)-based correlation attention mapping method that leverages spatial channel relationships in IMU data to identify motion-related artifacts in EEG signals. The fine-tuned model contains approximately 9.2 million parameters and uses 5.9 hours of EEG and IMU recordings for training, just 0.2346% of the 2500 hours used to train the base model. We compare our results against the established ASR-ICA benchmark across varying time scales and motion activities, showing that incorporating IMU reference signals significantly improves robustness under diverse motion scenarios.

[LG-92] MatPROV: A Provenance Graph Dataset of Material Synthesis Extracted from Scientific Literature

链接: https://arxiv.org/abs/2509.01042
作者: Hirofumi Tsuruta,Masaya Kumagai
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Synthesis procedures play a critical role in materials research, as they directly affect material properties. With data-driven approaches increasingly accelerating materials discovery, there is growing interest in extracting synthesis procedures from scientific literature as structured data. However, existing studies often rely on rigid, domain-specific schemas with predefined fields for structuring synthesis procedures or assume that synthesis procedures are linear sequences of operations, which limits their ability to capture the structural complexity of real-world procedures. To address these limitations, we adopt PROV-DM, an international standard for provenance information, which supports flexible, graph-based modeling of procedures. We present MatPROV, a dataset of PROV-DM-compliant synthesis procedures extracted from scientific literature using large language models. MatPROV captures structural complexities and causal relationships among materials, operations, and conditions through visually intuitive directed graphs. This representation enables machine-interpretable synthesis knowledge, opening opportunities for future research such as automated synthesis planning and optimization.

[LG-93] Any-Order Flexible Length Masked Diffusion

链接: https://arxiv.org/abs/2509.01025
作者: Jaeyeon Kim,Lee Cheuk-Kit,Carles Domingo-Enrich,Yilun Du,Sham Kakade,Timothy Ngotiaoco,Sitan Chen,Michael Albergo
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Masked diffusion models (MDMs) have recently emerged as a promising alternative to autoregressive models over discrete domains. MDMs generate sequences in an any-order, parallel fashion, enabling fast inference and strong performance on non-causal tasks. However, a crucial limitation is that they do not support token insertions and are thus limited to fixed-length generations. To this end, we introduce Flexible Masked Diffusion Models (FlexMDMs), a discrete diffusion paradigm that simultaneously can model sequences of flexible length while provably retaining MDMs’ flexibility of any-order inference. Grounded in an extension of the stochastic interpolant framework, FlexMDMs generate sequences by inserting mask tokens and unmasking them. Empirically, we show that FlexMDMs match MDMs in perplexity while modeling length statistics with much higher fidelity. On a synthetic maze planning task, they achieve \approx 60 % higher success rate than MDM baselines. Finally, we show pretrained MDMs can easily be retrofitted into FlexMDMs: on 16 H100s, it takes only three days to fine-tune LLaDA-8B into a FlexMDM, achieving superior performance on math (GSM8K, 58% \to 67% ) and code infilling performance ( 52% \to 65% ).

[LG-94] Quantum-based QoE Optimization in Advanced Cellular Networks: Integration and Cloud Gaming Use Case

链接: https://arxiv.org/abs/2509.01008
作者: Fatma Chaouech,Javier Villegas,António Pereira,Carlos Baena,Sergio Fortes,Raquel Barco,Dominic Gribben,Mohammad Dib,Alba Villarino,Aser Cortines,Román Orús
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:This work explores the integration of Quantum Machine Learning (QML) and Quantum-Inspired (QI) techniques for optimizing end-to-end (E2E) network services in telecommunication systems, particularly focusing on 5G networks and beyond. The application of QML and QI algorithms is investigated, comparing their performance with classical Machine Learning (ML) approaches. The present study employs a hybrid framework combining quantum and classical computing leveraging the strengths of QML and QI, without the penalty of quantum hardware availability. This is particularized for the optimization of the Quality of Experience (QoE) over cellular networks. The framework comprises an estimator for obtaining the expected QoE based on user metrics, service settings, and cell configuration, and an optimizer that uses the estimation to choose the best cell and service configuration. Although the approach is applicable to any QoE-based network management, its implementation is particularized for the optimization of network configurations for Cloud Gaming services. Then, it is evaluated via performance metrics such as accuracy and model loading and inference times for the estimator, and time to solution and solution score for the optimizer. The results indicate that QML models achieve similar or superior accuracy to classical ML models for estimation, while decreasing inference and loading times. Furthermore, potential for better performance is observed for higher-dimensional data, highlighting promising results for higher complexity problems. Thus, the results demonstrate the promising potential of QML in advancing network optimization, although challenges related to data availability and integration complexities between quantum and classical ML are identified as future research lines.

[LG-95] IoT-based Noise Monitoring using Mobile Nodes for Smart Cities

链接: https://arxiv.org/abs/2509.00979
作者: Bhima Sankar Manthina(1),Shreyash Gujar(1),Sachin Chaudhari(1),Kavita Vemuri1(1),Shivam Chhirolya(2) ((1) International Institute of Information Technology-Hyderabad (IIIT-H), India, (2) a href=“http://Prezent.AI” rel=“external noopener nofollow” class="link-external link-http"this http URL/a, India)
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Urban noise pollution poses a significant threat to public health, yet existing monitoring infrastructures offer limited spatial coverage and adaptability. This paper presents a scalable, low-cost, IoT-based, real-time environmental noise monitoring solution using mobile nodes (sensor nodes on a moving vehicle). The system utilizes a low-cost sound sensor integrated with GPS-enabled modules to collect geotagged noise data at one-second intervals. The sound nodes are calibrated against a reference sound level meter in a laboratory setting to ensure accuracy using various machine learning (ML) algorithms, such as Simple Linear Regression (SLR), Multiple Linear Regression (MLR), Polynomial Regression (PR), Segmented Regression (SR), Support Vector Regression (SVR), Decision Tree (DT), and Random Forest Regression (RFR). While laboratory calibration demonstrates high accuracy, it is shown that the performance of the nodes degrades during data collection in a moving vehicle. To address this, it is demonstrated that the calibration must be performed on the IoT-based node based on the data collected in a moving environment along with the reference device. Among the employed ML models, RFR achieved the best performance with an R2 of 0.937 and RMSE of 1.09 for mobile calibration. The system was deployed in Hyderabad, India, through three measurement campaigns across 27 days, capturing 436,420 data points. Results highlight temporal and spatial noise variations across weekdays, weekends, and during Diwali. Incorporating vehicular velocity into the calibration significantly improves accuracy. The proposed system demonstrates the potential for widespread deployment of IoT-based noise sensing networks in smart cities, enabling effective noise pollution management and urban planning.

[LG-96] abular Diffusion Counterfactual Explanations

链接: https://arxiv.org/abs/2509.00876
作者: Wei Zhang,Brian Barr,John Paisley
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual explanations methods provide an important tool in the field of interpretable machine learning. Recent advances in this direction have focused on diffusion models to explain a deep classifier. However, these techniques have predominantly focused on problems in computer vision. In this paper, we focus on tabular data typical in finance and the social sciences and propose a novel guided reverse process for categorical features based on an approximation to the Gumbel-softmax distribution. Furthermore, we study the effect of the temperature \tau and derive a theoretical bound between the Gumbel-softmax distribution and our proposed approximated distribution. We perform experiments on several large-scale credit lending and other tabular datasets, assessing their performance in terms of the quantitative measures of interpretability, diversity, instability, and validity. These results indicate that our approach outperforms popular baseline methods, producing robust and realistic counterfactual explanations.

[LG-97] Predicting Multi-Type Talented Students in Secondary School Using Semi-Supervised Machine Learning

链接: https://arxiv.org/abs/2509.00863
作者: Xinzhe Zheng,Zhen-Qun Yang,Jiannong Cao,Jiabei Cheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Talent identification plays a critical role in promoting student development. However, traditional approaches often rely on manual processes or focus narrowly on academic achievement, and typically delaying intervention until the higher education stage. This oversight overlooks diverse non-academic talents and misses opportunities for early intervention. To address this gap, this study introduces TalentPredictor, a novel semi-supervised multi-modal neural network that combines Transformer, LSTM, and ANN architectures. This model is designed to predict seven different talent types–academic, sport, art, leadership, service, technology, and others–in secondary school students within an offline educational setting. Drawing on existing offline educational data from 1,041 local secondary students, TalentPredictor overcomes the limitations of traditional talent identification methods. By clustering various award records into talent categories and extracting features from students’ diverse learning behaviors, it achieves high prediction accuracy (0.908 classification accuracy, 0.908 ROCAUC). This demonstrates the potential of machine learning to identify diverse talents early in student development.

[LG-98] Crystal Structure Prediction with a Geometric Permutation-Invariant Loss Function

链接: https://arxiv.org/abs/2509.00832
作者: Emmanuel Jehanno,Romain Menegaux,Julien Mairal,Sergei Grudinin
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Crystalline structure prediction remains an open challenge in materials design. Despite recent advances in computational materials science, accurately predicting the three-dimensional crystal structures of organic materials–an essential first step for designing materials with targeted properties–remains elusive. In this work, we address the problem of molecular assembly, where a set \mathcalS of identical rigid molecules is packed to form a crystalline structure. Existing state-of-the-art models typically rely on computationally expensive, iterative flow-matching approaches. We propose a novel loss function that correctly captures key geometric molecular properties while maintaining permutation invariance over \mathcalS . We achieve this via a differentiable linear assignment scheme based on the Sinkhorn algorithm. Remarkably, we show that even a simple regression using our method \em SinkFast significantly outperforms more complex flow-matching approaches on the COD-Cluster17 benchmark, a curated subset of the Crystallography Open Database (COD).

[LG-99] XAI-Driven Machine Learning System for Driving Style Recognition and Personalized Recommendations

链接: https://arxiv.org/abs/2509.00802
作者: Feriel Amel Sellal,Ahmed Ayoub Bellachia,Meryem Malak Dif,Enguerrand De Rautlin De La Roy,Mouhamed Amine Bouchiha,Yacine Ghamri-Doudane
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly used in the automotive industry for applications such as driving style classification, which aims to improve road safety, efficiency, and personalize user experiences. While deep learning (DL) models, such as Long Short-Term Memory (LSTM) networks, excel at this task, their black-box nature limits interpretability and trust. This paper proposes a machine learning (ML)-based method that balances high accuracy with interpretability. We introduce a high-quality dataset, CARLA-Drive, and leverage ML techniques like Random Forest (RF), Gradient Boosting (XGBoost), and Support Vector Machine (SVM), which are efficient, lightweight, and interpretable. In addition, we apply the SHAP (Shapley Additive Explanations) explainability technique to provide personalized recommendations for safer driving. Achieving an accuracy of 0.92 on a three-class classification task with both RF and XGBoost classifiers, our approach matches DL models in performance while offering transparency and practicality for real-world deployment in intelligent transportation systems.

[LG-100] Fairness in Federated Learning: Trends Challenges and Opportunities

链接: https://arxiv.org/abs/2509.00799
作者: Noorain Mukhtiar,Adnan Mahmood,Quan Z. Sheng
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted and Published

点击查看摘要

Abstract:At the intersection of the cutting-edge technologies and privacy concerns, Federated Learning (FL) with its distributed architecture, stands at the forefront in a bid to facilitate collaborative model training across multiple clients while preserving data privacy. However, the applicability of FL systems is hindered by fairness concerns arising from numerous sources of heterogeneity that can result in biases and undermine a system’s effectiveness, with skewed predictions, reduced accuracy, and inefficient model convergence. This survey thus explores the diverse sources of bias, including but not limited to, data, client, and model biases, and thoroughly discusses the strengths and limitations inherited within the array of the state-of-the-art techniques utilized in the literature to mitigate such disparities in the FL training process. We delineate a comprehensive overview of the several notions, theoretical underpinnings, and technical aspects associated with fairness and their adoption in FL-based multidisciplinary environments. Furthermore, we examine salient evaluation metrics leveraged to measure fairness quantitatively. Finally, we envisage exciting open research directions that have the potential to drive future advancements in achieving fairer FL frameworks, in turn, offering a strong foundation for future research in this pivotal area.

[LG-101] Flow Matters: Directional and Expressive GNNs for Heterophilic Graphs

链接: https://arxiv.org/abs/2509.00772
作者: Arman Gupta,Govind Waghmare,Gaurav Oberoi,Nitish Srivastava
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In heterophilic graphs, where neighboring nodes often belong to different classes, conventional Graph Neural Networks (GNNs) struggle due to their reliance on local homophilous neighborhoods. Prior studies suggest that modeling edge directionality in such graphs can increase effective homophily and improve classification performance. Simultaneously, recent work on polynomially expressive GNNs shows promise in capturing higher-order interactions among features. In this work, we study the combined effect of edge directionality and expressive message passing on node classification in heterophilic graphs. Specifically, we propose two architectures: (1) a polynomially expressive GAT baseline (Poly), and (2) a direction-aware variant (Dir-Poly) that separately aggregates incoming and outgoing edges. Both models are designed to learn permutation-equivariant high-degree polynomials over input features, while remaining scalable with no added time complexity. Experiments on five benchmark heterophilic datasets show that our Poly model consistently outperforms existing baselines, and that Dir-Poly offers additional gains on graphs with inherent directionality (e.g., Roman Empire), achieving state-of-the-art results. Interestingly, on undirected graphs, introducing artificial directionality does not always help, suggesting that the benefit of directional message passing is context-dependent. Our findings highlight the complementary roles of edge direction and expressive feature modeling in heterophilic graph learning.

[LG-102] Attribute Fusion-based Classifier on Framework of Belief Structure

链接: https://arxiv.org/abs/2509.00754
作者: Qiying Hu,Yingying Liang,Qianli Zhou,Witold Pedrycz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dempster-Shafer Theory (DST) provides a powerful framework for modeling uncertainty and has been widely applied to multi-attribute classification tasks. However, traditional DST-based attribute fusion-based classifiers suffer from oversimplified membership function modeling and limited exploitation of the belief structure brought by basic probability assignment (BPA), reducing their effectiveness in complex real-world scenarios. This paper presents an enhanced attribute fusion-based classifier that addresses these limitations through two key innovations. First, we adopt a selective modeling strategy that utilizes both single Gaussian and Gaussian Mixture Models (GMMs) for membership function construction, with model selection guided by cross-validation and a tailored evaluation metric. Second, we introduce a novel method to transform the possibility distribution into a BPA by combining simple BPAs derived from normalized possibility distributions, enabling a much richer and more flexible representation of uncertain information. Furthermore, we apply the belief structure-based BPA generation method to the evidential K-Nearest Neighbors classifier, enhancing its ability to incorporate uncertainty information into decision-making. Comprehensive experiments on benchmark datasets are conducted to evaluate the performance of the proposed attribute fusion-based classifier and the enhanced evidential K-Nearest Neighbors classifier in comparison with both evidential classifiers and conventional machine learning classifiers. The results demonstrate that our proposed classifier outperforms the best existing evidential classifier, achieving an average accuracy improvement of 4.84%, while maintaining low variance, thus confirming its superior effectiveness and robustness.

[LG-103] Robust Spatiotemporal Forecasting Using Adaptive Deep-Unfolded Variational Mode Decomposition

链接: https://arxiv.org/abs/2509.00703
作者: Osama Ahmad,Lukas Wesemann,Fabian Waschkowski,Zubair Khalid
类目: Machine Learning (cs.LG)
*备注: Under review in IEEE Signal Processing Letter

点击查看摘要

Abstract:Accurate spatiotemporal forecasting is critical for numerous complex systems but remains challenging due to complex volatility patterns and spectral entanglement in conventional graph neural networks (GNNs). While decomposition-integrated approaches like variational mode graph convolutional network (VMGCN) improve accuracy through signal decomposition, they suffer from computational inefficiency and manual hyperparameter tuning. To address these limitations, we propose the mode adaptive graph network (MAGN) that transforms iterative variational mode decomposition (VMD) into a trainable neural module. Our key innovations include (1) an unfolded VMD (UVMD) module that replaces iterative optimization with a fixed-depth network to reduce the decomposition time (by 250x for the LargeST benchmark), and (2) mode-specific learnable bandwidth constraints (\alphak ) adapt spatial heterogeneity and eliminate manual tuning while preventing spectral overlap. Evaluated on the LargeST benchmark (6,902 sensors, 241M observations), MAGN achieves an 85-95% reduction in the prediction error over VMGCN and outperforms state-of-the-art baselines.

[LG-104] An Evolutionary Multi-objective Optimization for Replica-Exchange-based Physics-informed Operator Learning Network

链接: https://arxiv.org/abs/2509.00663
作者: Binghang Lu,Changhong Mou,Guang Lin
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In this paper, we propose an evolutionary Multi-objective Optimization for Replica-Exchange-based Physics-informed Operator learning Network, which is a novel operator learning network to efficiently solve parametric partial differential equations. In forward and inverse settings, this operator learning network only admits minimum requirement of noisy observational data. While physics-informed neural networks and operator learning approaches such as Deep Operator Networks and Fourier Neural Operators offer promising alternatives to traditional numerical solvers, they struggle with balancing operator and physics losses, maintaining robustness under noisy or sparse data, and providing uncertainty quantification. The proposed framework addresses these limitations by integrating: (i) evolutionary multi-objective optimization to adaptively balance operator and physics-based losses in the Pareto front; (ii) replica exchange stochastic gradient Langevin dynamics to improve global parameter-space exploration and accelerate convergence; and (iii) built-in Bayesian uncertainty quantification from stochastic sampling. The proposed operator learning method is tested numerically on several different problems including one-dimensional Burgers equation and the time-fractional mixed diffusion-wave equation. The results indicate that our framework consistently outperforms the general operator learning methods in accuracy, noise robustness, and the ability to quantify uncertainty.

[LG-105] Revisiting Deep AC-OPF

链接: https://arxiv.org/abs/2509.00655
作者: Oluwatomisin I. Dada,Neil D. Lawrence
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 18 pages, 15 tables

点击查看摘要

Abstract:Recent work has proposed machine learning (ML) approaches as fast surrogates for solving AC optimal power flow (AC-OPF), with claims of significant speed-ups and high accuracy. In this paper, we revisit these claims through a systematic evaluation of ML models against a set of simple yet carefully designed linear baselines. We introduce OPFormer-V, a transformer-based model for predicting bus voltages, and compare it to both the state-of-the-art DeepOPF-V model and simple linear methods. Our findings reveal that, while OPFormer-V improves over DeepOPF-V, the relative gains of the ML approaches considered are less pronounced than expected. Simple linear baselines can achieve comparable performance. These results highlight the importance of including strong linear baselines in future evaluations.

[LG-106] Missing Data Imputation using Neural Cellular Automata

链接: https://arxiv.org/abs/2509.00651
作者: Tin Luu,Binh Nguyen,Man Ngo
类目: Machine Learning (cs.LG)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:When working with tabular data, missingness is always one of the most painful problems. Throughout many years, researchers have continuously explored better and better ways to impute missing data. Recently, with the rapid development evolution in machine learning and deep learning, there is a new trend of leveraging generative models to solve the imputation task. While the imputing version of famous models such as Variational Autoencoders or Generative Adversarial Networks were investigated, prior work has overlooked Neural Cellular Automata (NCA), a powerful computational model. In this paper, we propose a novel imputation method that is inspired by NCA. We show that, with some appropriate adaptations, an NCA-based model is able to address the missing data imputation problem. We also provide several experiments to evidence that our model outperforms state-of-the-art methods in terms of imputation error and post-imputation performance.

[LG-107] Context-Action Embedding Learning for Off-Policy Evaluation in Contextual Bandits

链接: https://arxiv.org/abs/2509.00648
作者: Kushagra Chandak,Vincent Liu,Haanvid Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider off-policy evaluation (OPE) in contextual bandits with finite action space. Inverse Propensity Score (IPS) weighting is a widely used method for OPE due to its unbiased, but it suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored. Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings. However, these embeddings do not minimize the mean squared error (MSE) of the estimators and do not consider context information. To address these limitations, we introduce Context-Action Embedding Learning for MIPS, or CAEL-MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator. Building on the theoretical analysis of bias and variance of MIPS, we present an MSE-minimizing objective for CAEL-MIPS. In the empirical studies on a synthetic dataset and a real-world dataset, we demonstrate that our estimator outperforms baselines in terms of MSE.

[LG-108] Disentangling Slow and Fast Temporal Dynamics in Degradation Inference with Hierarchical Differential Models

链接: https://arxiv.org/abs/2509.00639
作者: Mengjie Zhao,Olga Fink
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable inference of system degradation from sensor data is fundamental to condition monitoring and prognostics in engineered systems. Since degradation is rarely observable and measurable, it must be inferred to enable accurate health assessment and decision-making. This is particularly challenging because operational variations dominate system behavior, while degradation introduces only subtle, long-term changes. Consequently, sensor data mainly reflect short-term operational variability, making it difficult to disentangle the underlying degradation process. Residual-based methods are widely employed, but the residuals remain entangled with operational history, often resulting in noisy and unreliable degradation estimation, particularly in systems with dynamic responses. Neural Ordinary Equations (NODEs) offer a promising framework for inferring latent dynamics, but the time-scale separation in slow-fast systems introduces numerical stiffness and complicates training, while degradation disentanglement remains difficult. To address these limitations, we propose a novel Hierarchical Controlled Differential Equation (H-CDE) framework that incorporates a slow (degradation) and a fast (operation) CDE component in a unified architecture. It introduces three key innovations: a multi-scale time integration scheme to mitigate numerical stiffness; a learnable path transformation that extracts latent degradation drivers to control degradation evolution; and a novel activation function that enforces monotonicity on inferred degradation as a regularizer for disentanglement. Through comprehensive evaluations on both dynamic response (e.g., bridges) and steady state (e.g., aero-engine) systems, we demonstrate that H-CDE effectively disentangles degradation from operational dynamics and outperforms residual-based baselines, yielding more accurate, robust, and interpretable inference.

[LG-109] RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models

链接: https://arxiv.org/abs/2509.00614
作者: Shikun Liu,Deyu Zou,Nima Shoghi,Victor Fung,Kai Liu,Pan Li
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:In the era of foundation models, fine-tuning pre-trained models for specific downstream tasks has become crucial. This drives the need for robust fine-tuning methods to address challenges such as model overfitting and sparse labeling. Molecular graph foundation models (MGFMs) face unique difficulties that complicate fine-tuning. These models are limited by smaller pre-training datasets and more severe data scarcity for downstream tasks, both of which require enhanced model generalization. Moreover, MGFMs must accommodate diverse objectives, including both regression and classification tasks. To better understand and improve fine-tuning techniques under these conditions, we classify eight fine-tuning methods into three mechanisms: weight-based, representation-based, and partial fine-tuning. We benchmark these methods on downstream regression and classification tasks across supervised and self-supervised pre-trained models in diverse labeling settings. This extensive evaluation provides valuable insights and informs the design of a refined robust fine-tuning method, ROFT-MOL. This approach combines the strengths of simple post-hoc weight interpolation with more complex weight ensemble fine-tuning methods, delivering improved performance across both task types while maintaining the ease of use inherent in post-hoc weight interpolation.

[LG-110] ranCIT: Transient Causal Interaction Toolbox

链接: https://arxiv.org/abs/2509.00602
作者: Salar Nouri,Kaidi Shao,Shervin Safavi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantifying transient causal interactions from non-stationary neural signals is a fundamental challenge in neuroscience. Traditional methods are often inadequate for brief neural events, and advanced, event-specific techniques have lacked accessible implementations within the Python ecosystem. Here, we introduce trancit (Transient Causal Interaction Toolbox), an open-source Python package designed to bridge this gap. TranCIT implements a comprehensive analysis pipeline, including Granger Causality, Transfer Entropy, and the more robust Structural Causal Model-based Dynamic Causal Strength (DCS) and relative Dynamic Causal Strength (rDCS) for accurately detecting event-driven causal effects. We demonstrate TranCIT’s utility by successfully capturing causality in high-synchrony regimes where traditional methods fail and by identifying the known transient information flow from hippocampal CA3 to CA1 during sharp-wave ripple events in real-world data. The package offers a user-friendly, validated solution for investigating the transient causal dynamics that govern complex systems.

[LG-111] SQL-of-Thought: Multi-agent ic Text-to-SQL with Guided Error Correction

链接: https://arxiv.org/abs/2509.00581
作者: Saumya Chaturvedi,Aman Chadha,Laurent Bindschaedler
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Converting natural language queries into SQL queries is a crucial challenge in both industry and academia, aiming to increase access to databases and large-scale applications. This work examines how in-context learning and chain-of-thought can be utilized to develop a robust solution for text-to-SQL systems. We propose SQL-of-Thought: a multi-agent framework that decomposes the Text2SQL task into schema linking, subproblem identification, query plan generation, SQL generation, and a guided correction loop. Unlike prior systems that rely only on execution-based static correction, we introduce taxonomy-guided dynamic error modification informed by in-context learning. SQL-of-Thought achieves state-of-the-art results on the Spider dataset and its variants, combining guided error taxonomy with reasoning-based query planning.

[LG-112] Learning Dolly-In Filming From Demonstration Using a Ground-Based Robot

链接: https://arxiv.org/abs/2509.00574
作者: Philip Lorimer,Alan Hunter,Wenbin Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Preprint; under double-anonymous review. 6 pages

点击查看摘要

Abstract:Cinematic camera control demands a balance of precision and artistry - qualities that are difficult to encode through handcrafted reward functions. While reinforcement learning (RL) has been applied to robotic filmmaking, its reliance on bespoke rewards and extensive tuning limits creative usability. We propose a Learning from Demonstration (LfD) approach using Generative Adversarial Imitation Learning (GAIL) to automate dolly-in shots with a free-roaming, ground-based filming robot. Expert trajectories are collected via joystick teleoperation in simulation, capturing smooth, expressive motion without explicit objective design. Trained exclusively on these demonstrations, our GAIL policy outperforms a PPO baseline in simulation, achieving higher rewards, faster convergence, and lower variance. Crucially, it transfers directly to a real-world robot without fine-tuning, achieving more consistent framing and subject alignment than a prior TD3-based method. These results show that LfD offers a robust, reward-free alternative to RL in cinematic domains, enabling real-time deployment with minimal technical effort. Our pipeline brings intuitive, stylized camera control within reach of creative professionals, bridging the gap between artistic intent and robotic autonomy. Comments: Preprint; under double-anonymous review. 6 pages Subjects: Robotics (cs.RO); Machine Learning (cs.LG) Cite as: arXiv:2509.00574 [cs.RO] (or arXiv:2509.00574v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2509.00574 Focus to learn more arXiv-issued DOI via DataCite

[LG-113] An Efficient GNNs-to-KANs Distillation via Self-Attention Dynamic Sampling with Potential for Consumer Electronics Edge Deployment

链接: https://arxiv.org/abs/2509.00560
作者: Can Cui,Zilong Fu,Penghe Huang,Yuanyuan Li,Wu Deng,Dongyan Li
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is crucial for deploying deep learning models in resource-constrained edge environments, particularly within the consumer electronics sector, including smart home devices, wearable technology, and mobile terminals. These applications place higher demands on model compression and inference speed, necessitating the transfer of knowledge from Graph Neural Networks (GNNs) to more efficient Multi-Layer Perceptron (MLP) models. However, due to their fixed activation functions and fully connected architecture, MLPs face challenges in rapidly capturing the complex neighborhood dependencies learned by GNNs, thereby limiting their performance in edge environments. To address these limitations, this paper introduces an innovative from GNNs to Kolmogorov-Arnold Networks (KANs) knowledge distillation framework-Self Attention Dynamic Sampling Distillation (SA-DSD). This study improved Fourier KAN (FR-KAN) and replaced MLP with the improved FR-KAN+ as the student model. Through the incorporation of learnable frequency bases and phase-shift mechanisms, along with algorithmic optimization, FR-KAN significantly improves its nonlinear fitting capability while effectively reducing computational complexity. Building on this, a margin-level sampling probability matrix, based on teacher-student prediction consistency, is constructed, and an adaptive weighted loss mechanism is designed to mitigate performance degradation in the student model due to the lack of explicit neighborhood aggregation. Extensive experiments conducted on six real-world datasets demonstrate that SA-DSD achieves performance improvements of 3.05%-3.62% over three GNN teacher models and 15.61% over the FR-KAN+ model. Moreover, when compared with key benchmark models, SA-DSD achieves a 16.96x reduction in parameter count and a 55.75% decrease in inference time.

[LG-114] FedThief: Harming Others to Benefit Oneself in Self-Centered Federated Learning

链接: https://arxiv.org/abs/2509.00540
作者: Xiangyu Zhang,Mang Ye
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:In federated learning, participants’ uploaded model updates cannot be directly verified, leaving the system vulnerable to malicious attacks. Existing attack strategies have adversaries upload tampered model updates to degrade the global model’s performance. However, attackers also degrade their own private models, gaining no advantage. In real-world scenarios, attackers are driven by self-centered motives: their goal is to gain a competitive advantage by developing a model that outperforms those of other participants, not merely to cause disruption. In this paper, we study a novel Self-Centered Federated Learning (SCFL) attack paradigm, in which attackers not only degrade the performance of the global model through attacks but also enhance their own models within the federated learning process. We propose a framework named FedThief, which degrades the performance of the global model by uploading modified content during the upload stage. At the same time, it enhances the private model’s performance through divergence-aware ensemble techniques, where “divergence” quantifies the deviation between private and global models, that integrate global updates and local knowledge. Extensive experiments show that our method effectively degrades the global model performance while allowing the attacker to obtain an ensemble model that significantly outperforms the global model.

[LG-115] MobiAgent : A Systematic Framework for Customizable Mobile Agents

链接: https://arxiv.org/abs/2509.00531
作者: Cheng Zhang,Erhu Feng,Xi Zhao,Yisheng Zhao,Wangbo Gong,Jiahui Sun,Dong Du,Zhichao Hua,Yubin Xia,Haibo Chen
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid advancement of Vision-Language Models (VLMs), GUI-based mobile agents have emerged as a key development direction for intelligent mobile systems. However, existing agent models continue to face significant challenges in real-world task execution, particularly in terms of accuracy and efficiency. To address these limitations, we propose MobiAgent, a comprehensive mobile agent system comprising three core components: the MobiMind-series agent models, the AgentRR acceleration framework, and the MobiFlow benchmarking suite. Furthermore, recognizing that the capabilities of current mobile agents are still limited by the availability of high-quality data, we have developed an AI-assisted agile data collection pipeline that significantly reduces the cost of manual annotation. Compared to both general-purpose LLMs and specialized GUI agent models, MobiAgent achieves state-of-the-art performance in real-world mobile scenarios.

[LG-116] Game Theoretic Resilience Recommendation Framework for CyberPhysical Microgrids Using Hypergraph MetaLearning

链接: https://arxiv.org/abs/2509.00528
作者: S Krishna Niketh,Prasanta K Panigrahi,V Vignesh,Mayukha Pal
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a physics-aware cyberphysical resilience framework for radial microgrids under coordinated cyberattacks. The proposed approach models the attacker through a hypergraph neural network (HGNN) enhanced with model agnostic metalearning (MAML) to rapidly adapt to evolving defense strategies and predict high-impact contingencies. The defender is modeled via a bi-level Stackelberg game, where the upper level selects optimal tie-line switching and distributed energy resource (DER) dispatch using an Alternating Direction Method of Multipliers (ADMM) coordinator embedded within the Non-dominated Sorting Genetic Algorithm II (NSGA-II). The framework simultaneously optimizes load served, operational cost, and voltage stability, ensuring all post-defense states satisfy network physics constraints. The methodology is first validated on the IEEE 69-bus distribution test system with 12 DERs, 8 critical loads, and 5 tie-lines, and then extended to higher bus systems including the IEEE 123-bus feeder and a synthetic 300-bus distribution system. Results show that the proposed defense strategy restores nearly full service for 90% of top-ranked attacks, mitigates voltage violations, and identifies Feeder 2 as the principal vulnerability corridor. Actionable operating rules are derived, recommending pre-arming of specific tie-lines to enhance resilience, while higher bus system studies confirm scalability of the framework on the IEEE 123-bus and 300-bus systems.

[LG-117] Biological Pathway Informed Models with Graph Attention Networks (GATs)

链接: https://arxiv.org/abs/2509.00524
作者: Gavin Wong,Ping Shu Ho,Ivan Au Yeung,Ka Chun Cheung,Simon See
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Biological pathways map gene-gene interactions that govern all human processes. Despite their importance, most ML models treat genes as unstructured tokens, discarding known pathway structure. The latest pathway-informed models capture pathway-pathway interactions, but still treat each pathway as a “bag of genes” via MLPs, discarding its topology and gene-gene interactions. We propose a Graph Attention Network (GAT) framework that models pathways at the gene level. We show that GATs generalize much better than MLPs, achieving an 81% reduction in MSE when predicting pathway dynamics under unseen treatment conditions. We further validate the correctness of our biological prior by encoding drug mechanisms via edge interventions, boosting model robustness. Finally, we show that our GAT model is able to correctly rediscover all five gene-gene interactions in the canonical TP53-MDM2-MDM4 feedback loop from raw time-series mRNA data, demonstrating potential to generate novel biological hypotheses directly from experimental data.

[LG-118] Graph Convolutional Network With Pattern-Spatial Interactive and Regional Awareness for Traffic Forecasting

链接: https://arxiv.org/abs/2509.00515
作者: Xinyu Ji,Chengcheng Yan,Jibiao Yuan,Fiefie Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic forecasting is significant for urban traffic management, intelligent route planning, and real-time flow monitoring. Recent advances in spatial-temporal models have markedly improved the modeling of intricate spatial-temporal correlations for traffic forecasting. Unfortunately, most previous studies have encountered challenges in effectively modeling spatial-temporal correlations across various perceptual perspectives, which have neglected the interactive fusion between traffic patterns and spatial correlations. Additionally, constrained by spatial heterogeneity, most studies fail to consider distinct regional heterogeneity during message-passing. To overcome these limitations, we propose a Pattern-Spatial Interactive and Regional Awareness Graph Convolutional Network (PSIRAGCN) for traffic forecasting. Specifically, we propose a pattern-spatial interactive fusion framework composed of pattern and spatial modules. This framework aims to capture patterns and spatial correlations by adopting a perception perspective from the global to the local level and facilitating mutual utilization with positive feedback. In the spatial module, we designed a graph convolutional network based on message-passing. The network is designed to leverage a regional characteristics bank to reconstruct data-driven message-passing with regional awareness. Reconstructed message passing can reveal the regional heterogeneity between nodes in the traffic network. Extensive experiments on three real-world traffic datasets demonstrate that PSIRAGCN outperforms the State-of-the-art baseline while balancing computational costs.

[LG-119] Localizing and Mitigating Memorization in Image Autoregressive Models ICML2025

链接: https://arxiv.org/abs/2509.00488
作者: Aditya Kasliwal,Franziska Boenisch,Adam Dziedzic
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2025 Workshop on the Impact of Memorization on Trustworthy Foundation Models

点击查看摘要

Abstract:Image AutoRegressive (IAR) models have achieved state-of-the-art performance in speed and quality of generated images. However, they also raise concerns about memorization of their training data and its implications for privacy. This work explores where and how such memorization occurs within different image autoregressive architectures by measuring a fine-grained memorization. The analysis reveals that memorization patterns differ across various architectures of IARs. In hierarchical per-resolution architectures, it tends to emerge early and deepen with resolutions, while in IARs with standard autoregressive per token prediction, it concentrates in later processing stages. These localization of memorization patterns are further connected to IARs’ ability to memorize and leak training data. By intervening on their most memorizing components, we significantly reduce the capacity for data extraction from IARs with minimal impact on the quality of generated images. These findings offer new insights into the internal behavior of image generative models and point toward practical strategies for mitigating privacy risks.

[LG-120] Universal Properties of Activation Sparsity in Modern Large Language Models

链接: https://arxiv.org/abs/2509.00454
作者: Filip Szatkowski,Patryk Będkowski,Alessio Devoto,Jan Dubiński,Pasquale Minervini,Mikołaj Piórczyński,Simone Scardapane,Bartosz Wójcik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Input-dependent activation sparsity is a notable property of deep learning models, which has been extensively studied in networks with ReLU activations and is associated with efficiency, robustness, and interpretability. However, the approaches developed for ReLU-based models depend on exact zero activations and do not transfer directly to modern large language models~(LLMs), which have abandoned ReLU in favor of other activation functions. As a result, current work on activation sparsity in LLMs is fragmented, model-specific, and lacks consensus on which components to target. We propose a general framework to assess sparsity robustness and present a systematic study of the phenomenon in the FFN layers of modern LLMs, including diffusion LLMs. Our findings reveal universal patterns of activation sparsity in LLMs, provide insights into this phenomenon, and offer practical guidelines for exploiting it in model design and acceleration.

[LG-121] Memory Limitations of Prompt Tuning in Transformers

链接: https://arxiv.org/abs/2509.00421
作者: Maxime Meyer,Mario Michelessa,Caroline Chaux,Vincent Y. F. Tan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the empirical success of prompt tuning in adapting pretrained language models to new tasks, theoretical analyses of its capabilities remain limited. Existing theoretical work primarily addresses universal approximation properties, demonstrating results comparable to standard weight tuning. In this paper, we explore a different aspect of the theory of transformers: the memorization capability of prompt tuning. We provide two principal theoretical contributions. First, we prove that the amount of information memorized by a transformer cannot scale faster than linearly with the prompt length. Second, and more importantly, we present the first formal proof of a phenomenon empirically observed in large language models: performance degradation in transformers with extended contexts. We rigorously demonstrate that transformers inherently have limited memory, constraining the amount of information they can retain, regardless of the context size. This finding offers a fundamental understanding of the intrinsic limitations of transformer architectures, particularly their ability to handle long sequences.

[LG-122] Lagrangian Relaxation for Multi-Action Partially Observable Restless Bandits: Heuristic Policies and Indexability

链接: https://arxiv.org/abs/2509.00415
作者: Rahul Meshram,Kesav Kaza
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 13 pages

点击查看摘要

Abstract:Partially observable restless multi-armed bandits have found numerous applications including in recommendation systems, communication systems, public healthcare outreach systems, and in operations research. We study multi-action partially observable restless multi-armed bandits, it is a generalization of the classical restless multi-armed bandit problem – 1) each bandit has finite states, and the current state is not observable, 2) each bandit has finite actions. In particular, we assume that more than two actions are available for each bandit. We motivate our problem with the application of public-health intervention planning. We describe the model and formulate a long term discounted optimization problem, where the state of each bandit evolves according to a Markov process, and this evolution is action dependent. The state of a bandit is not observable but one of finitely many feedback signals are observable. Each bandit yields a reward, based on the action taken on that bandit. The agent is assumed to have a budget constraint. The bandits are assumed to be independent. However, they are weakly coupled at the agent through the budget constraint. We first analyze the Lagrangian bound method for our partially observable restless bandits. The computation of optimal value functions for finite-state, finite-action POMDPs is non-trivial. Hence, the computation of Lagrangian bounds is also challenging. We describe approximations for the computation of Lagrangian bounds using point based value iteration (PBVI) and online rollout policy. We further present various properties of the value functions and provide theoretical insights on PBVI and online rollout policy. We study heuristic policies for multi-actions PORMAB. Finally, we discuss present Whittle index policies and their limitations in our model. Comments: 13 pages Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2509.00415 [cs.LG] (or arXiv:2509.00415v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.00415 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-123] Metis: Training Large Language Models with Advanced Low-Bit Quantization

链接: https://arxiv.org/abs/2509.00404
作者: Hengjie Cao,Mengyi Chen,Yifeng Yang,Ruijun Huang,Fang Dong,Jixian Zhou,Anrui Chen,Mingzhi Dong,Yujiang Wang,Jinlong Hou,Yuan Cheng,Fan Wu,Fan Yang,Tun Lu,Ning Gu,Li Shang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work identifies anisotropic parameter distributions as a fundamental barrier to training large language models (LLMs) with low-bit quantization: a few dominant singular values create wide numerical ranges that conflict with the inherent bias of block-wise quantization. This bias disproportionately preserves high-magnitude values while discarding smaller ones, causing training instability and low model performance. This work introduces Metis, a training framework that combines (i) spectral decomposition with random embedding to efficiently disentangle dominant from long-tail components, compressing broad distributions into quantization-friendly narrow ranges; (ii) adaptive learning rates in the spectral domain to amplify underrepresented directions and better capture diverse features critical for performance; and (iii) a dual-range regularizer that jointly constrains numerical precision and parameter range distribution, ensuring stable, unbiased low-bit training. With Metis, FP8 training surpasses FP32 baselines, and FP4 training achieves accuracy comparable to FP32, paving the way for robust and scalable LLM training under advanced low-bit quantization. The code implementation for Metis is available at: this https URL.

[LG-124] Optimized Weight Initialization on the Stiefel Manifold for Deep ReLU Neural Networks

链接: https://arxiv.org/abs/2509.00362
作者: Hyungu Lee,Taehyeong Kim,Hayoung Choi
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Stable and efficient training of ReLU networks with large depth is highly sensitive to weight initialization. Improper initialization can cause permanent neuron inactivation dying ReLU and exacerbate gradient instability as network depth increases. Methods such as He, Xavier, and orthogonal initialization preserve variance or promote approximate isometry. However, they do not necessarily regulate the pre-activation mean or control activation sparsity, and their effectiveness often diminishes in very deep architectures. This work introduces an orthogonal initialization specifically optimized for ReLU by solving an optimization problem on the Stiefel manifold, thereby preserving scale and calibrating the pre-activation statistics from the outset. A family of closed-form solutions and an efficient sampling scheme are derived. Theoretical analysis at initialization shows that prevention of the dying ReLU problem, slower decay of activation variance, and mitigation of gradient vanishing, which together stabilize signal and gradient flow in deep architectures. Empirically, across MNIST, Fashion-MNIST, multiple tabular datasets, few-shot settings, and ReLU-family activations, our method outperforms previous initializations and enables stable training in deep networks.

[LG-125] Solving Optimal Power Flow using a Variational Quantum Approach

链接: https://arxiv.org/abs/2509.00341
作者: Thinh Viet Le,Mark M. Wilde,Vassilis Kekatos
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC); Quantum Physics (quant-ph)
*备注: 22 pages, 7 figures, 2 tables

点击查看摘要

Abstract:The optimal power flow (OPF) is a large-scale optimization problem that is central in the operation of electric power systems. Although it can be posed as a nonconvex quadratically constrained quadratic program, the complexity of modern-day power grids raises scalability and optimality challenges. In this context, this work proposes a variational quantum paradigm for solving the OPF. We encode primal variables through the state of a parameterized quantum circuit (PQC), and dual variables through the probability mass function associated with a second PQC. The Lagrangian function can thus be expressed as scaled expectations of quantum observables. An OPF solution can be found by minimizing/maximizing the Lagrangian over the parameters of the first/second PQC. We pursue saddle points of the Lagrangian in a hybrid fashion. Gradients of the Lagrangian are estimated using the two PQCs, while PQC parameters are updated classically using a primal-dual method. We propose permuting primal variables so that OPF observables are expressed in a banded form, allowing them to be measured efficiently. Numerical tests on the IEEE 57-node power system using Pennylane’s simulator corroborate that the proposed doubly variational quantum framework can find high-quality OPF solutions. Although showcased for the OPF, this framework features a broader scope, including conic programs with numerous variables and constraints, problems defined over sparse graphs, and training quantum machine learning models to satisfy constraints.

[LG-126] Are We Really Learning the Score Function? Reinterpreting Diffusion Models Through Wasserstein Gradient Flow Matching

链接: https://arxiv.org/abs/2509.00336
作者: An B. Vuong,Michael T. McCann,Javier E. Santos,Yen Ting Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models are commonly interpreted as learning the score function, i.e., the gradient of the log-density of noisy data. However, this assumption implies that the target of learning is a conservative vector field, which is not enforced by the neural network architectures used in practice. We present numerical evidence that trained diffusion networks violate both integral and differential constraints required of true score functions, demonstrating that the learned vector fields are not conservative. Despite this, the models perform remarkably well as generative mechanisms. To explain this apparent paradox, we advocate a new theoretical perspective: diffusion training is better understood as flow matching to the velocity field of a Wasserstein Gradient Flow (WGF), rather than as score learning for a reverse-time stochastic differential equation. Under this view, the “probability flow” arises naturally from the WGF framework, eliminating the need to invoke reverse-time SDE theory and clarifying why generative sampling remains successful even when the neural vector field is not a true score. We further show that non-conservative errors from neural approximation do not necessarily harm density transport. Our results advocate for adopting the WGF perspective as a principled, elegant, and theoretically grounded framework for understanding diffusion generative models.

[LG-127] Counterfactual Risk Minimization with IPS-Weighted BPR and Self-Normalized Evaluation in Recommender Systems RECSYS25

链接: https://arxiv.org/abs/2509.00333
作者: Rahul Raja,Arpita Vats
类目: Machine Learning (cs.LG)
*备注: Accepted at Causality, Counterfactuals Sequential Decision-Making Workshop(CONSEQUENCES) at ACM Recommender Systems Conference(RecSys 25) Prague, Czech Republic

点击查看摘要

Abstract:Learning and evaluating recommender systems from logged implicit feedback is challenging due to exposure bias. While inverse propensity scoring (IPS) corrects this bias, it often suffers from high variance and instability. In this paper, we present a simple and effective pipeline that integrates IPS-weighted training with an IPS-weighted Bayesian Personalized Ranking (BPR) objective augmented by a Propensity Regularizer (PR). We compare Direct Method (DM), IPS, and Self-Normalized IPS (SNIPS) for offline policy evaluation, and demonstrate how IPS-weighted training improves model robustness under biased exposure. The proposed PR further mitigates variance amplification from extreme propensity weights, leading to more stable estimates. Experiments on synthetic and MovieLens 100K data show that our approach generalizes better under unbiased exposure while reducing evaluation variance compared to naive and standard IPS methods, offering practical guidance for counterfactual learning and evaluation in real-world recommendation settings.

[LG-128] Mechanistic interpretability for steering vision-language-action models

链接: https://arxiv.org/abs/2509.00328
作者: Bear Häon,Kaylene Stocking,Ian Chuang,Claire Tomlin
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: CoRL 2025. Project website: this https URL

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of classical robotics pipelines, which are grounded in explicit models of kinematics, dynamics, and control. This lack of mechanistic insight is a central challenge for deploying learned policies in real-world robotics, where robustness and explainability are critical. Motivated by advances in mechanistic interpretability for large language models, we introduce the first framework for interpreting and steering VLAs via their internal representations, enabling direct intervention in model behavior at inference time. We project feedforward activations within transformer layers onto the token embedding basis, identifying sparse semantic directions - such as speed and direction - that are causally linked to action selection. Leveraging these findings, we introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction. We evaluate this method on two recent open-source VLAs, Pi0 and OpenVLA, and demonstrate zero-shot behavioral control in simulation (LIBERO) and on a physical robot (UR5). This work demonstrates that interpretable components of embodied VLAs can be systematically harnessed for control - establishing a new paradigm for transparent and steerable foundation models in robotics.

[LG-129] Chunked TabPFN: Exact Training-Free In-Context Learning for Long-Context Tabular Data

链接: https://arxiv.org/abs/2509.00326
作者: Renat Sergazinov,Shao-An Yin
类目: Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:TabPFN v2 achieves better results than tree-based models on several tabular benchmarks, which is notable since tree-based models are usually the strongest choice for tabular data. However, it cannot handle more than 10K context tokens because transformers have quadratic computation and memory costs. Unlike existing approaches that rely on context compression, such as selecting representative samples via K-nearest neighbors (KNN), we introduce a \textbftiled-block strategy to compute attention within the TabPFN framework. This design is compatible with standard GPU setups and, to the best of our knowledge, is the first to enable TabPFN to \textbfprocess long contexts without any pre-processing. We demonstrate the effectiveness of our approach on the standard TabArena benchmark. Comments: 14 pages, 6 figures Subjects: Machine Learning (cs.LG) MSC classes: I.2.6 Cite as: arXiv:2509.00326 [cs.LG] (or arXiv:2509.00326v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.00326 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-130] Illuminating Patterns of Divergence: DataDios SmartDiff for Large-Scale Data Difference Analysis

链接: https://arxiv.org/abs/2509.00293
作者: Aryan Poduri,Yashwant Tailor
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Data engineering workflows require reliable differencing across files, databases, and query outputs, yet existing tools falter under schema drift, heterogeneous types, and limited explainability. SmartDiff is a unified system that combines schema-aware mapping, type-specific comparators, and parallel execution. It aligns evolving schemas, compares structured and semi-structured data (strings, numbers, dates, JSON/XML), and clusters results with labels that explain how and why differences occur. On multi-million-row datasets, SmartDiff achieves over 95 percent precision and recall, runs 30 to 40 percent faster, and uses 30 to 50 percent less memory than baselines; in user studies, it reduces root-cause analysis time from 10 hours to 12 minutes. An LLM-assisted labeling pipeline produces deterministic, schema-valid multilabel explanations using retrieval augmentation and constrained decoding; ablations show further gains in label accuracy and time to diagnosis over rules-only baselines. These results indicate SmartDiff’s utility for migration validation, regression testing, compliance auditing, and continuous data quality monitoring. Index Terms: data differencing, schema evolution, data quality, parallel processing, clustering, explainable validation, big data

[LG-131] ReLATE: Learning Efficient Sparse Encoding for High-Performance Tensor Decomposition

链接: https://arxiv.org/abs/2509.00280
作者: Ahmed E. Helal,Fabio Checconi,Jan Laukemann,Yongseok Soh,Jesmin Jahan Tithi,Fabrizio Petrini,Jee Choi
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Tensor decomposition (TD) is essential for analyzing high-dimensional sparse data, yet its irregular computations and memory-access patterns pose major performance challenges on modern parallel processors. Prior works rely on expert-designed sparse tensor formats that fail to adapt to irregular tensor shapes and/or highly variable data distributions. We present the reinforcement-learned adaptive tensor encoding (ReLATE) framework, a novel learning-augmented method that automatically constructs efficient sparse tensor representations without labeled training samples. ReLATE employs an autonomous agent that discovers optimized tensor encodings through direct interaction with the TD environment, leveraging a hybrid model-free and model-based algorithm to learn from both real and imagined actions. Moreover, ReLATE introduces rule-driven action masking and dynamics-informed action filtering mechanisms that ensure functionally correct tensor encoding with bounded execution time, even during early learning stages. By automatically adapting to both irregular tensor shapes and data distributions, ReLATE generates sparse tensor representations that consistently outperform expert-designed formats across diverse sparse tensor data sets, achieving up to 2X speedup compared to the best sparse format, with a geometric-mean speedup of 1.4-1.46X.

[LG-132] Quantum-Optimized Selective State Space Model for Efficient Time Series Prediction

链接: https://arxiv.org/abs/2509.00259
作者: Stefan-Alexandru Jura,Mihai Udrescu,Alexandru Topirceanu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-range time series forecasting remains challenging, as it requires capturing non-stationary and multi-scale temporal dependencies while maintaining noise robustness, efficiency, and stability. Transformer-based architectures such as Autoformer and Informer improve generalization but suffer from quadratic complexity and degraded performance on very long time horizons. State space models, notably S-Mamba, provide linear-time updates but often face unstable training dynamics, sensitivity to initialization, and limited robustness for multivariate forecasting. To address such challenges, we propose the Quantum-Optimized Selective State Space Model (Q-SSM), a hybrid quantum-optimized approach that integrates state space dynamics with a variational quantum gate. Instead of relying on expensive attention mechanisms, Q-SSM employs a simple parametrized quantum circuit (RY-RX ansatz) whose expectation values regulate memory updates adaptively. This quantum gating mechanism improves convergence stability, enhances the modeling of long-term dependencies, and provides a lightweight alternative to attention. We empirically validate Q-SSM on three widely used benchmarks, i.e., ETT, Traffic, and Exchange Rate. Results show that Q-SSM consistently improves over strong baselines (LSTM, TCN, Reformer), Transformer-based models, and S-Mamba. These findings demonstrate that variational quantum gating can address current limitations in long-range forecasting, leading to accurate and robust multivariate predictions.

[LG-133] Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data

链接: https://arxiv.org/abs/2509.00221
作者: Jaya Narain,Zakaria Aldeneh,Shirley Ren
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Preprint, under review

点击查看摘要

Abstract:Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that are domain-independent and achieve state-of-the-art performance on time series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find a particularly strong relevance of the convolutional feature encoders from speech models for wearable sensor tasks. The methods proposed here improve performance and robustness for data-scarce time series tasks, using simple probing methods. This work is a step towards generalized time series models for speech and sensor data, a topic for further exploration.

[LG-134] Learning to Shard: RL for Co-optimizing the Parallelism Degrees and Per-operator Sharding Dimensions in Distributed LLM Inference

链接: https://arxiv.org/abs/2509.00217
作者: Ruokai Yin,Sattwik Deb Mishra,Xuan Zuo,Hokchhay Tann,Preyas Shah,Apala Guha
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Distributed LLM inference requires careful coordination of parallelization strategies across hundreds to thousands of NPUs to meet production SLOs. Current systems like Megatron-LM rely on static heuristics that separately configure parallelism degrees and per-operator sharding dimensions, leaving significant performance on the table as models scale and hardware topologies diversify. We introduce Learn to Shard, to our knowledge, the first RL-based approach to co-optimize both coarse-grained parallelism degrees and fine-grained per-operator sharding dimensions for distributed LLM inference. Our method employs an attention-based policy over an elite history that learns from high-performing strategies to efficiently navigate the vast combinatorial search space. Evaluated on H100 clusters with MoE models up to 1.6T parameters, Learn to Shard achieves up to 3.5x throughput improvement over metaheuristic baselines and 1.06x over Megatron heuristics.

[LG-135] WoSNN: Stochastic Solver for PDEs with Machine Learning

链接: https://arxiv.org/abs/2509.00204
作者: Silei Song,Arash Fahim,Michael Mascagni
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Solving elliptic partial differential equations (PDEs) is a fundamental step in various scientific and engineering studies. As a classic stochastic solver, the Walk-on-Spheres (WoS) method is a well-established and efficient algorithm that provides accurate local estimates for PDEs. In this paper, by integrating machine learning techniques with WoS and space discretization approaches, we develop a novel stochastic solver, WoS-NN. This new method solves elliptic problems with Dirichlet boundary conditions, facilitating precise and rapid global solutions and gradient approximations. The method inherits excellent characteristics from the original WoS method, such as being meshless and robust to irregular regions. By integrating neural networks, WoS-NN also gives instant local predictions after training without re-sampling, which is especially suitable for intense requests on a static region. A typical experimental result demonstrates that the proposed WoS-NN method provides accurate field estimations, reducing errors by around 75% while using only 8% of path samples compared to the conventional WoS method, which saves abundant computational time and resource consumption.

[LG-136] Estimating Parameter Fields in Multi-Physics PDEs from Scarce Measurements

链接: https://arxiv.org/abs/2509.00203
作者: Xuyang Li,Mahdi Masmoudi,Rami Gharbi,Nizar Lajnef,Vishnu Naresh Boddeti
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Parameterized partial differential equations (PDEs) underpin the mathematical modeling of complex systems in diverse domains, including engineering, healthcare, and physics. A central challenge in using PDEs for real-world applications is to accurately infer the parameters, particularly when the parameters exhibit non-linear and spatiotemporal variations. Existing parameter estimation methods, such as sparse identification and physics-informed neural networks (PINNs), struggle in such cases, especially with nonlinear dynamics, multiphysics interactions, or limited observations of the system response. To address these challenges, we introduce Neptune, a general-purpose method capable of inferring parameter fields from sparse measurements of system responses. Neptune employs independent coordinate neural networks to continuously represent each parameter field in physical space or in state variables. Across various physical and biomedical problems, where direct parameter measurements are prohibitively expensive or unattainable, Neptune significantly outperforms existing methods, achieving robust parameter estimation from as few as 50 observations, reducing parameter estimation errors by two orders of magnitude and dynamic response prediction errors by a factor of ten compared to PINNs. Furthermore, Neptune exhibits superior extrapolation capabilities, enabling accurate predictions in regimes beyond training data where PINN fail. By facilitating reliable and data-efficient parameter inference, Neptune promises broad transformative impacts in engineering, healthcare, and beyond.

[LG-137] From TLinFormer to TConstFormer: The Leap to Constant-Time Transformer Attention: Achieving O(1) Computation and O(1) KV Cache during Autoregressive Inference

链接: https://arxiv.org/abs/2509.00202
作者: Zhongpan Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although the Transformer has become the cornerstone of modern AI, its autoregressive inference suffers from a linearly growing KV Cache and a computational complexity of O(N^2 d), severely hindering its ability to process ultra-long sequences. To overcome this limitation, this paper introduces the TConstFormer architecture, building upon our previous work, TLinFormer. TConstFormer employs an innovative periodic state update mechanism to achieve a truly constant-size O(1) KV Cache. The computational complexity of this mechanism is also O(1) in an amortized sense: it performs purely constant-time computations for k-1 consecutive steps (e.g., k=256 ) and executes a single linear-time global information synchronization only on the k -th step. Theoretical calculations and experimental results demonstrate that TConstFormer exhibits an overwhelming advantage over baseline models in terms of speed, memory efficiency, and overall performance on long-text inference tasks. This breakthrough paves the way for efficient and robust streaming language model applications.

[LG-138] Algorithm Adaptation Bias in Recommendation System Online Experiments

链接: https://arxiv.org/abs/2509.00199
作者: Chen Zheng,Zhenyu Zhao
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online experiments (A/B tests) are widely regarded as the gold standard for evaluating recommender system variants and guiding launch decisions. However, a variety of biases can distort the results of the experiment and mislead decision-making. An underexplored but critical bias is algorithm adaptation effect. This bias arises from the flywheel dynamics among production models, user data, and training pipelines: new models are evaluated on user data whose distributions are shaped by the incumbent system or tested only in a small treatment group. As a result, the measured effect of a new product change in modeling and user experience in this constrained experimental setting can diverge substantially from its true impact in full deployment. In practice, the experiment results often favor the production variant with large traffic while underestimating the performance of the test variant with small traffic, which leads to missing opportunities to launch a true winning arm or underestimating the impact. This paper aims to raise awareness of algorithm adaptation bias, situate it within the broader landscape of RecSys evaluation biases, and motivate discussion of solutions that span experiment design, measurement, and adjustment. We detail the mechanisms of this bias, present empirical evidence from real-world experiments, and discuss potential methods for a more robust online evaluation.

[LG-139] Democratizing Agent ic AI with Fast Test-Time Scaling on the Edge

链接: https://arxiv.org/abs/2509.00195
作者: Hao Mark Chen,Zhiwen Mo,Guanxi Lu,Shuang Liang,Lingxiao Ma,Wayne Luk,Hongxiang Fan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying agentic AI on edge devices is crucial for privacy and responsiveness, but memory constraints typically relegate these systems to smaller Large Language Models (LLMs) with inferior reasoning capabilities. Test-Time Scaling (TTS) can bridge this reasoning gap by dedicating more compute during inference, but existing methods incur prohibitive overhead on edge hardware. To overcome this, we introduce FlashTTS, a serving system that makes TTS practical for memory-constrained LLM reasoning. FlashTTS introduces three synergistic optimizations: (i) Speculative Beam Extension to mitigate system stragglers from irregular reasoning paths; (ii) Asymmetric Multi-Model Memory Allocation to dynamically balance memory between generation and verification; and (iii) Dynamic Prefix-Aware Scheduling to maximize KV-cache reuse. Built as a plug-and-play library for vLLM, FlashTTS enables edge LLMs on a single consumer GPU (24 GB) to match the accuracy and latency of large cloud models. Our evaluation demonstrates that FlashTTS achieves an average 2.2x higher goodput and reduces latency by 38%-68% compared to a vLLM baseline, paving the way for democratized, high-performance agentic AI on edge devices.

[LG-140] FNODE: Flow-Matching for data-driven simulation of constrained multibody systems

链接: https://arxiv.org/abs/2509.00183
作者: Hongyu Wang,Jingquan Wang,Dan Negrut
类目: Machine Learning (cs.LG)
*备注: 36 pages, 19 figures

点击查看摘要

Abstract:Data-driven modeling of constrained multibody systems faces two persistent challenges: high computational cost and limited long-term prediction accuracy. To address these issues, we introduce the Flow-Matching Neural Ordinary Differential Equation (FNODE), a framework that learns acceleration vector fields directly from trajectory data. By reformulating the training objective to supervise accelerations rather than integrated states, FNODE eliminates the need for backpropagation through an ODE solver, which represents a bottleneck in traditional Neural ODEs. Acceleration targets are computed efficiently using numerical differentiation techniques, including a hybrid Fast Fourier Transform (FFT) and Finite Difference (FD) scheme. We evaluate FNODE on a diverse set of benchmarks, including the single and triple mass-spring-damper systems, double pendulum, slider-crank, and cart-pole. Across all cases, FNODE consistently outperforms existing approaches such as Multi-Body Dynamic Neural ODE (MBD-NODE), Long Short-Term Memory (LSTM) networks, and Fully Connected Neural Networks (FCNN), demonstrating good accuracy, generalization, and computational efficiency.

[LG-141] Newton-Flow Particle Filters based on Generalized Cramér Distance

链接: https://arxiv.org/abs/2509.00182
作者: Uwe D. Hanebeck
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages

点击查看摘要

Abstract:We propose a recursive particle filter for high-dimensional problems that inherently never degenerates. The state estimate is represented by deterministic low-discrepancy particle sets. We focus on the measurement update step, where a likelihood function is used for representing the measurement and its uncertainty. This likelihood is progressively introduced into the filtering procedure by homotopy continuation over an artificial time. A generalized Cramér distance between particle sets is derived in closed form that is differentiable and invariant to particle order. A Newton flow then continually minimizes this distance over artificial time and thus smoothly moves particles from prior to posterior density. The new filter is surprisingly simple to implement and very efficient. It just requires a prior particle set and a likelihood function, never estimates densities from samples, and can be used as a plugin replacement for classic approaches.

[LG-142] Playing Markov Games Without Observing Payoffs

链接: https://arxiv.org/abs/2509.00179
作者: Daniel Ablin,Alon Cohen
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimization under uncertainty is a fundamental problem in learning and decision-making, particularly in multi-agent systems. Previously, Feldman, Kalai, and Tennenholtz [2010] demonstrated the ability to efficiently compete in repeated symmetric two-player matrix games without observing payoffs, as long as the opponents actions are observed. In this paper, we introduce and formalize a new class of zero-sum symmetric Markov games, which extends the notion of symmetry from matrix games to the Markovian setting. We show that even without observing payoffs, a player who knows the transition dynamics and observes only the opponents sequence of actions can still compete against an adversary who may have complete knowledge of the game. We formalize three distinct notions of symmetry in this setting and show that, under these conditions, the learning problem can be reduced to an instance of online learning, enabling the player to asymptotically match the return of the opponent despite lacking payoff observations. Our algorithms apply to both matrix and Markov games, and run in polynomial time with respect to the size of the game and the number of episodes. Our work broadens the class of games in which robust learning is possible under severe informational disadvantage and deepens the connection between online learning and adversarial game theory.

[LG-143] Bias Mitigation for AI-Feedback Loops in Recommender Systems: A Systematic Literature Review and Taxonomy RECSYS2025

链接: https://arxiv.org/abs/2509.00109
作者: Theodor Stoecker,Samed Bayer,Ingo Weber
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures, 2 tables. Accepted at the FAccTRec '25 Workshop, ACM RecSys 2025 (Prague)

点击查看摘要

Abstract:Recommender systems continually retrain on user reactions to their own predictions, creating AI feedback loops that amplify biases and diminish fairness over time. Despite this well-known risk, most bias mitigation techniques are tested only on static splits, so their long-term fairness across multiple retraining rounds remains unclear. We therefore present a systematic literature review of bias mitigation methods that explicitly consider AI feedback loops and are validated in multi-round simulations or live A/B tests. Screening 347 papers yields 24 primary studies published between 2019-2025. Each study is coded on six dimensions: mitigation technique, biases addressed, dynamic testing set-up, evaluation focus, application domain, and ML task, organising them into a reusable taxonomy. The taxonomy offers industry practitioners a quick checklist for selecting robust methods and gives researchers a clear roadmap to the field’s most urgent gaps. Examples include the shortage of shared simulators, varying evaluation metrics, and the fact that most studies report either fairness or performance; only six use both.

[LG-144] LLM -QUBO: An End-to-End Framework for Automated QUBO Transformation from Natural Language Problem Descriptions

链接: https://arxiv.org/abs/2509.00099
作者: Huixiang Zhang,Mahzabeen Emu,Salimur Choudhury
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum annealing offers a promising paradigm for solving NP-hard combinatorial optimization problems, but its practical application is severely hindered by two challenges: the complex, manual process of translating problem descriptions into the requisite Quadratic Unconstrained Binary Optimization (QUBO) format and the scalability limitations of current quantum hardware. To address these obstacles, we propose a novel end-to-end framework, LLM-QUBO, that automates this entire formulation-to-solution pipeline. Our system leverages a Large Language Model (LLM) to parse natural language, automatically generating a structured mathematical representation. To overcome hardware limitations, we integrate a hybrid quantum-classical Benders’ decomposition method. This approach partitions the problem, compiling the combinatorial complex master problem into a compact QUBO format, while delegating linearly structured sub-problems to classical solvers. The correctness of the generated QUBO and the scalability of the hybrid approach are validated using classical solvers, establishing a robust performance baseline and demonstrating the framework’s readiness for quantum hardware. Our primary contribution is a synergistic computing paradigm that bridges classical AI and quantum computing, addressing key challenges in the practical application of optimization problem. This automated workflow significantly reduces the barrier to entry, providing a viable pathway to transform quantum devices into accessible accelerators for large-scale, real-world optimization challenges.

[LG-145] Financial Decision Making using Reinforcement Learning with Dirichlet Priors and Quantum-Inspired Genetic Optimization

链接: https://arxiv.org/abs/2509.00095
作者: Prasun Nandy,Debjit Dhar,Rik Das
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Traditional budget allocation models struggle with the stochastic and nonlinear nature of real-world financial data. This study proposes a hybrid reinforcement learning (RL) framework for dynamic budget allocation, enhanced with Dirichlet-inspired stochasticity and quantum mutation-based genetic optimization. Using Apple Inc. quarterly financial data (2009 to 2025), the RL agent learns to allocate budgets between Research and Development and Selling, General and Administrative to maximize profitability while adhering to historical spending patterns, with L2 penalties discouraging unrealistic deviations. A Dirichlet distribution governs state evolution to simulate shifting financial contexts. To escape local minima and improve generalization, the trained policy is refined using genetic algorithms with quantum mutation via parameterized qubit rotation circuits. Generation-wise rewards and penalties are logged to visualize convergence and policy behavior. On unseen fiscal data, the model achieves high alignment with actual allocations (cosine similarity 0.9990, KL divergence 0.0023), demonstrating the promise of combining deep RL, stochastic modeling, and quantum-inspired heuristics for adaptive enterprise budgeting.

[LG-146] Robust Detection of Synthetic Tabular Data under Schema Variability

链接: https://arxiv.org/abs/2509.00092
作者: G. Charbel N. Kindji(MALT),Elisa Fromont(MALT),Lina Maria Rojas-Barahona,Tanguy Urvoy
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The rise of powerful generative models has sparked concerns over data authenticity. While detection methods have been extensively developed for images and text, the case of tabular data, despite its ubiquity, has been largely overlooked. Yet, detecting synthetic tabular data is especially challenging due to its heterogeneous structure and unseen formats at test time. We address the underexplored task of detecting synthetic tabular data in the wild, where tables have variable and previously unseen schemas. We introduce a novel datum-wise transformer architecture that significantly outperforms the only previously published baseline, improving both AUC and accuracy by 7 points. By incorporating a table-adaptation component, our model gains an additional 7 accuracy points, demonstrating enhanced robustness. This work provides the first strong evidence that detecting synthetic tabular data in real-world conditions is not only feasible, but can be done with high reliability.

[LG-147] Learning from Peers: Collaborative Ensemble Adversarial Training

链接: https://arxiv.org/abs/2509.00089
作者: Li Dengjin,Guo Yanming,Xie Yuxiang,Li Zheng,Chen Jiangming,Li Xiaolong,Lao Mingrui
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensemble Adversarial Training (EAT) attempts to enhance the robustness of models against adversarial attacks by leveraging multiple models. However, current EAT strategies tend to train the sub-models independently, ignoring the cooperative benefits between sub-models. Through detailed inspections of the process of EAT, we find that that samples with classification disparities between sub-models are close to the decision boundary of ensemble, exerting greater influence on the robustness of ensemble. To this end, we propose a novel yet efficient Collaborative Ensemble Adversarial Training (CEAT), to highlight the cooperative learning among sub-models in the ensemble. To be specific, samples with larger predictive disparities between the sub-models will receive greater attention during the adversarial training of the other sub-models. CEAT leverages the probability disparities to adaptively assign weights to different samples, by incorporating a calibrating distance regularization. Extensive experiments on widely-adopted datasets show that our proposed method achieves the state-of-the-art performance over competitive EAT methods. It is noteworthy that CEAT is model-agnostic, which can be seamlessly adapted into various ensemble methods with flexible applicability.

[LG-148] Centralized vs. Federated Learning for Educational Data Mining: A Comparative Study on Student Performance Prediction with SAEB Microdata

链接: https://arxiv.org/abs/2509.00086
作者: Rodrigo Tertulino
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: This paper has been prepared to be submitted Brazilian Journal of Informatics in Education - RBIE

点击查看摘要

Abstract:The application of data mining and artificial intelligence in education offers unprecedented potential for personalizing learning and early identification of at-risk students. However, the practical use of these techniques faces a significant barrier in privacy legislation, such as Brazil’s General Data Protection Law (LGPD), which restricts the centralization of sensitive student data. To resolve this challenge, privacy-preserving computational approaches are required. The present study evaluates the feasibility and effectiveness of Federated Learning, specifically the FedProx algorithm, to predict student performance using microdata from the Brazilian Basic Education Assessment System (SAEB). A Deep Neural Network (DNN) model was trained in a federated manner, simulating a scenario with 50 schools, and its performance was rigorously benchmarked against a centralized eXtreme Gradient Boosting (XGBoost) model. The analysis, conducted on a universe of over two million student records, revealed that the centralized model achieved an accuracy of 63.96%. Remarkably, the federated model reached a peak accuracy of 61.23%, demonstrating a marginal performance loss in exchange for a robust privacy guarantee. The results indicate that Federated Learning is a viable and effective solution for building collaborative predictive models in the Brazilian educational context, in alignment with the requirements of the LGPD.

[LG-149] Experimental Assessment of a Multi-Class AI/ML Architecture for Real-Time Characterization of Cyber Events in a Live Research Reactor

链接: https://arxiv.org/abs/2509.00076
作者: Zachery Dahm,Konstantinos Vasili,Vasileios Theos,Konstantinos Gkouliaras,William Richards,True Miller,Brian Jowers,Stylianos Chatzidakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There is increased interest in applying Artificial Intelligence and Machine Learning (AI/ML) within the nuclear industry and nuclear engineering community. Effective implementation of AI/ML could offer benefits to the nuclear domain, including enhanced identification of anomalies, anticipation of system failures, and operational schedule optimization. However, limited work has been done to investigate the feasibility and applicability of AI/ML tools in a functioning nuclear reactor. Here, we go beyond the development of a single model and introduce a multi-layered AI/ML architecture that integrates both information technology and operational technology data streams to identify, characterize, and differentiate (i) among diverse cybersecurity events and (ii) between cyber events and other operational anomalies. Leveraging Purdue Universitys research reactor, PUR-1, we demonstrate this architecture through a representative use case that includes multiple concurrent false data injections and denial-of-service attacks of increasing complexity under realistic reactor conditions. The use case includes 14 system states (1 normal, 13 abnormal) and over 13.8 million multi-variate operational and information technology data points. The study demonstrated the capability of AI/ML to distinguish between normal, abnormal, and cybersecurity-related events, even under challenging conditions such as denial-of-service attacks. Combining operational and information technology data improved classification accuracy but posed challenges related to synchronization and collection during certain cyber events. While results indicate significant promise for AI/ML in nuclear cybersecurity, the findings also highlight the need for further refinement in handling complex event differentiation and multi-class architectures.

[LG-150] Mitigating Clinician Information Overload: Generative AI for Integrated EHR and RPM Data Analysis

链接: https://arxiv.org/abs/2509.00073
作者: Ankit Shetgaonkar,Dipen Pradhan,Lakshit Arora,Sanjay Surendranath Girija,Shashank Kapoor,Aman Raj
类目: Machine Learning (cs.LG)
*备注: Accepted at IEEE COMPSAC 2025

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI), particularly Large Language Models (LLMs), offer powerful capabilities for interpreting the complex data landscape in healthcare. In this paper, we present a comprehensive overview of the capabilities, requirements and applications of GenAI for deriving clinical insights and improving clinical efficiency. We first provide some background on the forms and sources of patient data, namely real-time Remote Patient Monitoring (RPM) streams and traditional Electronic Health Records (EHRs). The sheer volume and heterogeneity of this combined data present significant challenges to clinicians and contribute to information overload. In addition, we explore the potential of LLM-powered applications for improving clinical efficiency. These applications can enhance navigation of longitudinal patient data and provide actionable clinical decision support through natural language dialogue. We discuss the opportunities this presents for streamlining clinician workflows and personalizing care, alongside critical challenges such as data integration complexity, ensuring data quality and RPM data reliability, maintaining patient privacy, validating AI outputs for clinical safety, mitigating bias, and ensuring clinical acceptance. We believe this work represents the first summarization of GenAI techniques for managing clinician data overload due to combined RPM / EHR data complexities.

[LG-151] AnomalyExplainer Explainable AI for LLM -based anomaly detection using BERTViz and Captum

链接: https://arxiv.org/abs/2509.00069
作者: Prasasthy Balasubramanian,Dumindu Kankanamge,Ekaterina Gilman,Mourad Oussalah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conversational AI and Large Language Models (LLMs) have become powerful tools across domains, including cybersecurity, where they help detect threats early and improve response times. However, challenges such as false positives and complex model management still limit trust. Although Explainable AI (XAI) aims to make AI decisions more transparent, many security analysts remain uncertain about its usefulness. This study presents a framework that detects anomalies and provides high-quality explanations through visual tools BERTViz and Captum, combined with natural language reports based on attention outputs. This reduces manual effort and speeds up remediation. Our comparative analysis showed that RoBERTa offers high accuracy (99.6 %) and strong anomaly detection, outperforming Falcon-7B and DeBERTa, as well as exhibiting better flexibility than large-scale Mistral-7B on the HDFS dataset from LogHub. User feedback confirms the chatbot’s ease of use and improved understanding of anomalies, demonstrating the ability of the developed framework to strengthen cybersecurity workflows.

[LG-152] -MLP: Tailed Multi-Layer Perceptron for Level-of-Detail Signal Representation

链接: https://arxiv.org/abs/2509.00066
作者: Chuanxiang Yang,Yuanfeng Zhou,Guangshun Wei,Siyu Ren,Yuan Liu,Junhui Hou,Wenping Wang
类目: Machine Learning (cs.LG); Graphics (cs.GR); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Level-of-detail (LoD) representation is critical for efficiently modeling and transmitting various types of signals, such as images and 3D shapes. In this work, we present a novel neural architecture that supports LoD signal representation. Our architecture is based on an elaborate modification of the widely used Multi-Layer Perceptron (MLP), which inherently operates at a single scale and therefore lacks native support for LoD. Specifically, we introduce the Tailed Multi-Layer Perceptron (T-MLP) that extends the MLP by attaching multiple output branches, also called tails, to its hidden layers, enabling direct supervision at multiple depths. Our loss formulation and training strategy allow each hidden layer to effectively learn a target signal at a specific LoD, thus enabling multi-scale modeling. Extensive experimental results show that our T-MLP outperforms other neural LoD baselines across a variety of signal representation tasks.

[LG-153] Adaptive Physics-Informed Neural Networks with Multi-Category Feature Engineering for Hydrogen Sorption Prediction in Clays Shales and Coals

链接: https://arxiv.org/abs/2509.00049
作者: Mohammad Nooraiepour,Mohammad Masoudi,Zezhang Song,Helge Hellevang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of hydrogen sorption in clays, shales, and coals is vital for advancing underground hydrogen storage, natural hydrogen exploration, and radioactive waste containment. Traditional experimental methods, while foundational, are time-consuming, error-prone, and limited in capturing geological heterogeneity. This study introduces an adaptive physics-informed neural network (PINN) framework with multi-category feature engineering to enhance hydrogen sorption prediction. The framework integrates classical isotherm models with thermodynamic constraints to ensure physical consistency while leveraging deep learning flexibility. A comprehensive dataset consisting of 155 samples, which includes 50 clays, 60 shales, and 45 coals, was employed, incorporating diverse compositional properties and experimental conditions. Multi-category feature engineering across seven categories captured complex sorption dynamics. The PINN employs deep residual networks with multi-head attention, optimized via adaptive loss functions and Monte Carlo dropout for uncertainty quantification. K-fold cross-validation and hyperparameter optimization achieve significant accuracy (R2 = 0.979, RMSE = 0.045 mol per kg) with 67% faster convergence despite 15-fold increased complexity. The framework demonstrates robust lithology-specific performance across clay minerals (R2 = 0.981), shales (R2 = 0.971), and coals (R2 = 0.978), maintaining 85-91% reliability scores. Interpretability analysis via SHAP, accumulated local effects, and Friedman’s H-statistics reveal that hydrogen adsorption capacity dominates predictions, while 86.7% of feature pairs exhibit strong interactions, validating the necessity of non-linear modeling approaches. This adaptive physics-informed framework accelerates site screening and enables risk-informed decision-making through robust uncertainty quantification.

[LG-154] Industrial Steel Slag Flow Data Loading Method for Deep Learning Applications

链接: https://arxiv.org/abs/2509.00034
作者: Mert Sehri,Ana Cardoso,Francisco de Assis Boldt,Patrick Dumond
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Steel casting processes are vulnerable to financial losses due to slag flow contamination, making accurate slag flow condition detection essential. This study introduces a novel cross-domain diagnostic method using vibration data collected from an industrial steel foundry to identify various stages of slag flow. A hybrid deep learning model combining one-dimensional convolutional neural networks and long short-term memory layers is implemented, tested, and benchmarked against a standard one-dimensional convolutional neural network. The proposed method processes raw time-domain vibration signals from accelerometers and evaluates performance across 16 distinct domains using a realistic cross-domain dataset split. Results show that the hybrid convolutional neural network and long short-term memory architecture, when combined with root mean square preprocessing and a selective embedding data loading strategy, achieves robust classification accuracy, outperforming traditional models and loading techniques. The highest test accuracy of 99.10 +/- 0.30 demonstrates the method’s capability for generalization and industrial relevance. This work presents a practical and scalable solution for real-time slag flow monitoring, contributing to improved reliability and operational efficiency in steel manufacturing.

[LG-155] Mitigating Data Exfiltration Attacks through Layer-Wise Learning Rate Decay Fine-Tuning

链接: https://arxiv.org/abs/2509.00027
作者: Elie Thellier(EPIONE),Huiyu Li(EPIONE),Nicholas Ayache(EPIONE),Hervé Delingette(EPIONE)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data lakes enable the training of powerful machine learning models on sensitive, high-value medical datasets, but also introduce serious privacy risks due to potential leakage of protected health information. Recent studies show adversaries can exfiltrate training data by embedding latent representations into model parameters or inducing memorization via multi-task learning. These attacks disguise themselves as benign utility models while enabling reconstruction of high-fidelity medical images, posing severe privacy threats with legal and ethical implications. In this work, we propose a simple yet effective mitigation strategy that perturbs model parameters at export time through fine-tuning with a decaying layer-wise learning rate to corrupt embedded data without degrading task performance. Evaluations on DermaMNIST, ChestMNIST, and MIMIC-CXR show that our approach maintains utility task performance, effectively disrupts state-of-the-art exfiltration attacks, outperforms prior defenses, and renders exfiltrated data unusable for training. Ablations and discussions on adaptive attacks highlight challenges and future directions. Our findings offer a practical defense against data leakage in data lake-trained models and centralized federated learning.

[LG-156] Diagnosing Psychiatric Patients: Can Large Language and Machine Learning Models Perform Effectively in Emergency Cases?

链接: https://arxiv.org/abs/2509.00026
作者: Abu Shad Ahammed,Sayeri Mukherjee,Roman Obermaisser
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Mental disorders are clinically significant patterns of behavior that are associated with stress and/or impairment in social, occupational, or family activities. People suffering from such disorders are often misjudged and poorly diagnosed due to a lack of visible symptoms compared to other health complications. During emergency situations, identifying psychiatric issues is that’s why challenging but highly required to save patients. In this paper, we have conducted research on how traditional machine learning and large language models (LLM) can assess these psychiatric patients based on their behavioral patterns to provide a diagnostic assessment. Data from emergency psychiatric patients were collected from a rescue station in Germany. Various machine learning models, including Llama 3.1, were used with rescue patient data to assess if the predictive capabilities of the models can serve as an efficient tool for identifying patients with unhealthy mental disorders, especially in rescue cases.

[LG-157] Probabilities of Causation and Root Cause Analysis with Quasi-Markovian Models

链接: https://arxiv.org/abs/2509.02535
作者: Eduardo Rocha Laurentino,Fabio Gagliardi Cozman,Denis Deratani Maua,Daniel Angelo Esteves Lawand,Davi Goncalves Bezerra Coelho,Lucas Martins Marques
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at the 35th Brazilian Conference on Intelligent Systems (BRACIS 2025)

点击查看摘要

Abstract:Probabilities of causation provide principled ways to assess causal relationships but face computational challenges due to partial identifiability and latent confounding. This paper introduces both algorithmic simplifications, significantly reducing the computational complexity of calculating tighter bounds for these probabilities, and a novel methodological framework for Root Cause Analysis that systematically employs these causal metrics to rank entire causal paths.

[LG-158] Wild Refitting for Model-Free Excess Risk Evaluation of Opaque ML/AI Models under Bregman Loss

链接: https://arxiv.org/abs/2509.02476
作者: Haichen Hu,David Simchi-Levi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of evaluating the excess risk of classical penalized empirical risk minimization (ERM) with Bregman losses. We show that by leveraging the recently proposed wild refitting procedure (Wainwright, 2025), one can efficiently upper bound the excess risk through the so-called “wild optimism,” without relying on the global structure of the underlying function class. This property makes our approach inherently model-free. Unlike conventional analyses, our framework operates with just one dataset and black-box access to the training procedure. The method involves randomized vector-valued symmetrization with an appropriate scaling of the prediction residues and constructing artificially modified outcomes, upon which we retrain a second predictor for excess risk estimation. We establish high-probability performance guarantees both under the fixed design setting and the random design setting, demonstrating that wild refitting under Bregman losses, with an appropriately chosen wild noise scale, yields a valid upper bound on the excess risk. This work thus is promising for theoretically evaluating modern opaque ML and AI models such as deep neural networks and large language models, where the model class is too complex for classical learning theory and empirical process techniques to apply.

[LG-159] Distribution estimation via Flow Matching with Lipschitz guarantees

链接: https://arxiv.org/abs/2509.02337
作者: Lea Kunkel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Flow Matching, a promising approach in generative modeling, has recently gained popularity. Relying on ordinary differential equations, it offers a simple and flexible alternative to diffusion models, which are currently the state-of-the-art. Despite its empirical success, the mathematical understanding of its statistical power so far is very limited. This is largely due to the sensitivity of theoretical bounds to the Lipschitz constant of the vector field which drives the ODE. In this work, we study the assumptions that lead to controlling this dependency. Based on these results, we derive a convergence rate for the Wasserstein 1 distance between the estimated distribution and the target distribution which improves previous results in high dimensional setting. This rate applies to certain classes of unbounded distributions and particularly does not require \log -concavity.

[LG-160] Variational Uncertainty Decomposition for In-Context Learning

链接: https://arxiv.org/abs/2509.02327
作者: I. Shavindra Jayasekera,Jacob Si,Wenlong Chen,Filippo Valdettaro,A. Aldo Faisal,Yingzhen Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) gain popularity in conducting prediction tasks in-context, understanding the sources of uncertainty in in-context learning becomes essential to ensuring reliability. The recent hypothesis of in-context learning performing predictive Bayesian inference opens the avenue for Bayesian uncertainty estimation, particularly for decomposing uncertainty into epistemic uncertainty due to lack of in-context data and aleatoric uncertainty inherent in the in-context prediction task. However, the decomposition idea remains under-explored due to the intractability of the latent parameter posterior from the underlying Bayesian model. In this work, we introduce a variational uncertainty decomposition framework for in-context learning without explicitly sampling from the latent parameter posterior, by optimising auxiliary queries as probes to obtain an upper bound to the aleatoric uncertainty of an LLM’s in-context learning procedure, which also induces a lower bound to the epistemic uncertainty. Through experiments on synthetic and real-world tasks, we show quantitatively and qualitatively that the decomposed uncertainties obtained from our method exhibit desirable properties of epistemic and aleatoric uncertainty.

[LG-161] Amputation-imputation based generation of synthetic tabular data for ratemaking

链接: https://arxiv.org/abs/2509.02171
作者: Yevhen Havrylenko,Meelis Käärik,Artur Tuttar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 31 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Actuarial ratemaking depends on high-quality data, yet access to such data is often limited by the cost of obtaining new data, privacy concerns, etc. In this paper, we explore synthetic-data generation as a potential solution to these issues. In addition to discussing generative methods previously studied in the actuarial literature, we introduce to the insurance community another approach based on Multiple Imputation by Chained Equations (MICE). We present a comparative study using an open-source dataset and evaluating MICE-based models against other generative models like Variational Autoencoders and Conditional Tabular Generative Adversarial Networks. We assess how well synthetic data preserves the original marginal distributions of variables as well as the multivariate relationships among covariates. We also investigate the consistency between Generalized Linear Models (GLMs) trained on synthetic data with GLMs trained on the original data. Furthermore, we assess the ease of use of each generative approach and study the impact of augmenting original data with synthetic data on the performance of GLMs for predicting claim counts. Our results highlight the potential of MICE-based methods in creating high-quality tabular data while being more user-friendly than the other methods.

[LG-162] Using explainable artificial intelligence (XAI) as a diagnostic tool: An application for deducing hydrologic connectivity at watershed scale

链接: https://arxiv.org/abs/2509.02127
作者: Sheng Ye,Jiyu Li,Yifan Chai,Lin Liu,Murugesu Sivapalan,Qihua Ran
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 29 pages, 12 figures

点击查看摘要

Abstract:Explainable artificial intelligence (XAI) methods have been applied to interpret deep learning model results. However, applications that integrate XAI with established hydrologic knowledge for process understanding remain limited. Here we present a framework that apply XAI method at point-scale to provide granular interpretation and enable cross-scale aggregation of hydrologic responses. Hydrologic connectivity is used as a demonstration of the value of this approach. Soil moisture and its movement generated by physically based hydrologic model were used to train a long short-term memory (LSTM) network, whose impacts of inputs were evaluated by XAI methods. Our results suggest that XAI-based classification can effectively identify the differences in the functional roles of various sub-regions at watershed scale. The aggregated XAI results provide an explicit and quantitative indicator of hydrologic connectivity development, offering insights to streamflow variation. This framework could be used to facilitate aggregation of other hydrologic responses to advance process understandings.

[LG-163] Online Complexity Estimation for Repetitive Scenario Design

链接: https://arxiv.org/abs/2509.02103
作者: Guillaume O. Berger,Raphaël M. Jungers
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of repetitive scenario design where one has to solve repeatedly a scenario design problem and can adjust the sample size (number of scenarios) to obtain a desired level of risk (constraint violation probability). We propose an approach to learn on the fly the optimal sample size based on observed data consisting in previous scenario solutions and their risk level. Our approach consists in learning a function that represents the pdf (probability density function) of the risk as a function of the sample size. Once this function is known, retrieving the optimal sample size is straightforward. We prove the soundness and convergence of our approach to obtain the optimal sample size for the class of fixed-complexity scenario problems, which generalizes fully-supported convex scenario programs that have been studied extensively in the scenario optimization literature. We also demonstrate the practical efficiency of our approach on a series of challenging repetitive scenario design problems, including non-fixed-complexity problems, nonconvex constraints and time-varying distributions.

[LG-164] Inference in Spreading Processes with Neural-Network Priors

链接: https://arxiv.org/abs/2509.02073
作者: Davide Ghio,Fabrizio Boncoraglio,Lenka Zdeborová
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 26 pages, 13 figures

点击查看摘要

Abstract:Stochastic processes on graphs are a powerful tool for modelling complex dynamical systems such as epidemics. A recent line of work focused on the inference problem where one aims to estimate the state of every node at every time, starting from partial observation of a subset of nodes at a subset of times. In these works, the initial state of the process was assumed to be random i.i.d. over nodes. Such an assumption may not be realistic in practice, where one may have access to a set of covariate variables for every node that influence the initial state of the system. In this work, we will assume that the initial state of a node is an unknown function of such covariate variables. Given that functions can be represented by neural networks, we will study a model where the initial state is given by a simple neural network – notably the single-layer perceptron acting on the known node-wise covariate variables. Within a Bayesian framework, we study how such neural-network prior information enhances the recovery of initial states and spreading trajectories. We derive a hybrid belief propagation and approximate message passing (BP-AMP) algorithm that handles both the spreading dynamics and the information included in the node covariates, and we assess its performance against the estimators that either use only the spreading information or use only the information from the covariate variables. We show that in some regimes, the model can exhibit first-order phase transitions when using a Rademacher distribution for the neural-network weights. These transitions create a statistical-to-computational gap where even the BP-AMP algorithm, despite the theoretical possibility of perfect recovery, fails to achieve it. Comments: 26 pages, 13 figures Subjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Physics and Society (physics.soc-ph) Cite as: arXiv:2509.02073 [stat.ML] (or arXiv:2509.02073v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2509.02073 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-165] Morphology-Specific Peptide Discovery via Masked Conditional Generative Modeling

链接: https://arxiv.org/abs/2509.02060
作者: Nuno Costa,Julija Zavadlav
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 17 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Peptide self-assembly prediction offers a powerful bottom-up strategy for designing biocompatible, low-toxicity materials for large-scale synthesis in a broad range of biomedical and energy applications. However, screening the vast sequence space for categorization of aggregate morphology remains intractable. We introduce PepMorph, an end-to-end peptide discovery pipeline that generates novel sequences that are not only prone to aggregate but self-assemble into a specified fibrillar or spherical morphology. We compiled a new dataset by leveraging existing aggregation propensity datasets and extracting geometric and physicochemical isolated peptide descriptors that act as proxies for aggregate morphology. This dataset is then used to train a Transformer-based Conditional Variational Autoencoder with a masking mechanism, which generates novel peptides under arbitrary conditioning. After filtering to ensure design specifications and validation of generated sequences through coarse-grained molecular dynamics simulations, PepMorph yielded 83% accuracy in intended morphology generation, showcasing its promise as a framework for application-driven peptide discovery.

[LG-166] Non-Linear Model-Based Sequential Decision-Making in Agriculture

链接: https://arxiv.org/abs/2509.01924
作者: Sakshi Arya,Wentao Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Sequential decision-making is central to sustainable agricultural management and precision agriculture, where resource inputs must be optimized under uncertainty and over time. However, such decisions must often be made with limited observations, whereas classical bandit and reinforcement learning approaches typically rely on either linear or black-box reward models that may misrepresent domain knowledge or require large amounts of data. We propose a family of nonlinear, model-based bandit algorithms that embed domain-specific response curves directly into the exploration-exploitation loop. By coupling (i) principled uncertainty quantification with (ii) closed-form or rapidly computable profit optima, these algorithms achieve sublinear regret and near-optimal sample complexity while preserving interpretability. Theoretical analysis establishes regret and sample complexity bounds, and extensive simulations emulating real-world fertilizer-rate decisions show consistent improvements over both linear and nonparametric baselines (such as linear UCB and k -NN UCB) in the low-sample regime, under both well-specified and shape-compatible misspecified models. Because our approach leverages mechanistic insight rather than large data volumes, it is especially suited to resource-constrained settings, supporting sustainable, inclusive, and transparent sequential decision-making across agriculture, environmental management, and allied applications. This methodology directly contributes to SDG 2 (Zero Hunger) and SDG 12 (Responsible Consumption and Production) by enabling data-driven, less wasteful agricultural practices.

[LG-167] Design of Experiment for Discovering Directed Mixed Graph

链接: https://arxiv.org/abs/2509.01887
作者: Haijie Xu,Chen Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of experimental design for accurately identifying the causal graph structure of a simple structural causal model (SCM), where the underlying graph may include both cycles and bidirected edges induced by latent confounders. The presence of cycles renders it impossible to recover the graph skeleton using observational data alone, while confounding can further invalidate traditional conditional independence (CI) tests in certain scenarios. To address these challenges, we establish lower bounds on both the maximum number of variables that can be intervened upon in a single experiment and the total number of experiments required to identify all directed edges and non-adjacent bidirected edges. Leveraging both CI tests and do see tests, and accounting for d separation and \sigma separation, we develop two classes of algorithms, i.e., bounded and unbounded, that can recover all causal edges except for double adjacent bidirected edges. We further show that, up to logarithmic factors, the proposed algorithms are tight with respect to the derived lower bounds.

[LG-168] An Observations-focused Assessment of Global AI Weather Prediction Models During the South Asian Monsoon

链接: https://arxiv.org/abs/2509.01879
作者: Aman Gupta,Aditi Sheshadri,Dhruv Suri
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Seven state-of-the-art AI weather models (FourCastNet, FourCastNet-SFNO, Pangu-Weather, GraphCast, Aurora, AIFS, and GenCast) are evaluated against observational data during the South Asian Monsoon. The models are tested on temperature, winds, global kinetic energy spectrum, regional precipitation, cloud cover, cyclone trajectory prediction, and hyperlocal predictions around extreme weather events. The models forecast large-scale dynamics with reasonable accuracy, but fall short on key metrics critical to Monsoon-time weather prediction. The models exhibit substantially higher errors when compared against ground-based weather station data than against reanalysis or conventional forecasts. The AI weather prediction models show key differences in mesoscale kinetic energy and extreme precipitation during the Monsoon, and predict markedly different Monsoon-time cyclone trajectories over the Indian subcontinent, raising questions about their readiness for operational applications. Our analysis finds that ECMWF’s deterministic AIFS model offers the most reliable performance and usability, with GraphCast and GenCast being close seconds.

[LG-169] QUBO-based training for VQAs on Quantum Annealers

链接: https://arxiv.org/abs/2509.01821
作者: Ernesto Acosta,Guillermo Botella,Carlos Cano
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 33 pages, 4 appendix, 14 images

点击查看摘要

Abstract:Quantum annealers provide an effective framework for solving large-scale combinatorial optimization problems. This work presents a novel methodology for training Variational Quantum Algorithms (VQAs) by reformulating the parameter optimization task as a Quadratic Unconstrained Binary Optimization (QUBO) problem. Unlike traditional gradient-based methods, our approach directly leverages the Hamiltonian of the chosen VQA ansatz and employs an adaptive, metaheuristic optimization scheme. This optimization strategy provides a rich set of configurable parameters which enables the adaptation to specific problem characteristics and available computational resources. The proposed framework is generalizable to arbitrary Hamiltonians and integrates a recursive refinement strategy to progressively approximate high-quality solutions. Experimental evaluations demonstrate the feasibility of the method and its ability to significantly reduce computational overhead compared to classical and evolutionary optimizers, while achieving comparable or superior solution quality. These findings suggest that quantum annealers can serve as a scalable alternative to classical optimizers for VQA training, particularly in scenarios affected by barren plateaus and noisy gradient estimates, and open new possibilities for hybrid quantum gate - quantum annealing - classical optimization models in near-term quantum computing. Comments: 33 pages, 4 appendix, 14 images Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2509.01821 [quant-ph] (or arXiv:2509.01821v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2509.01821 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-170] he Price of Sparsity: Sufficient Conditions for Sparse Recovery using Sparse and Sparsified Measurements

链接: https://arxiv.org/abs/2509.01809
作者: Youssef Chaabouni,David Gamarnik
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We consider the problem of recovering the support of a sparse signal using noisy projections. While extensive work has been done on the dense measurement matrix setting, the sparse setting remains less explored. In this work, we establish sufficient conditions on the sample size for successful sparse recovery using sparse measurement matrices. Bringing together our result with previously known necessary conditions, we discover that, in the regime where ds/p \rightarrow +\infty , sparse recovery in the sparse setting exhibits a phase transition at an information-theoretic threshold of n_\textINF^\textSP = \Theta\left(s\log\left(p/s\right)/\log\left(ds/p\right)\right) , where p denotes the signal dimension, s the number of non-zero components of the signal, and d the expected number of non-zero components per row of measurement. This expression makes the price of sparsity explicit: restricting each measurement to d non-zeros inflates the required sample size by a factor of \logs/\log\left(ds/p\right) , revealing a precise trade-off between sampling complexity and measurement sparsity. Additionally, we examine the effect of sparsifying an originally dense measurement matrix on sparse signal recovery. We prove in the regime of s = \alpha p and d = \psi p with \alpha, \psi \in \left(0,1\right) and \psi small that a sample of size n^\textSp-ified_\textINF = \Theta\left(p / \psi^2\right) is sufficient for recovery, subject to a certain uniform integrability conjecture, the proof of which is work in progress.

[LG-171] Optimal information injection and transfer mechanisms for active matter reservoir computing

链接: https://arxiv.org/abs/2509.01799
作者: Mario U. Gaimann,Miriam Klopotek
类目: Adaptation and Self-Organizing Systems (nlin.AO); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 53 pages, 23 figures. Supplementary Videos: this https URL . Replication Data: this https URL

点击查看摘要

Abstract:Reservoir computing (RC) is a state-of-the-art machine learning method that makes use of the power of dynamical systems (the reservoir) for real-time inference. When using biological complex systems as reservoir substrates, it serves as a testbed for basic questions about bio-inspired computation – of how self-organization generates proper spatiotemporal patterning. Here, we use a simulation of an active matter system, driven by a chaotically moving input signal, as a reservoir. So far, it has been unclear whether such complex systems possess the capacity to process information efficiently and independently of the method by which it was introduced. We find that when switching from a repulsive to an attractive driving force, the system completely changes the way it computes, while the predictive performance landscapes remain nearly identical. The nonlinearity of the driver’s injection force improves computation by decoupling the single-agent dynamics from that of the driver. Triggered are the (re-)growth, deformation, and active motion of smooth structural boundaries (interfaces), and the emergence of coherent gradients in speed – features found in many soft materials and biological systems. The nonlinear driving force activates emergent regulatory mechanisms, which manifest enhanced morphological and dynamic diversity – arguably improving fading memory, nonlinearity, expressivity, and thus, performance. We further perform RC in a broad variety of non-equilibrium active matter phases that arise when tuning internal (repulsive) forces for information transfer. Overall, we find that active matter agents forming liquid droplets are particularly well suited for RC. The consistently convex shape of the predictive performance landscapes, together with the observed phenomenological richness, conveys robustness and adaptivity.

[LG-172] Real-Time Applicability of Emulated Virtual Circuits for Tokamak Plasma Shape Control

链接: https://arxiv.org/abs/2509.01789
作者: Pedro Cavestany(1),Alasdair Ross(1),Adriano Agnello(1),Aran Garrod(1),Nicola C. Amorisco(2),George K. Holt(1),Kamran Pentland(2),James Buchanan(2) ((1) STFC Hartree Centre, (2) UK Atomic Energy Authority)
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG); Systems and Control (eess.SY); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 6 pages, 4 figures, as submitted to CCTA25

点击查看摘要

Abstract:Machine learning has recently been adopted to emulate sensitivity matrices for real-time magnetic control of tokamak plasmas. However, these approaches would benefit from a quantification of possible inaccuracies. We report on two aspects of real-time applicability of emulators. First, we quantify the agreement of target displacement from VCs computed via Jacobians of the shape emulators with those from finite differences Jacobians on exact Grad-Shafranov solutions. Good agreement ( \approx 5-10%) can be achieved on a selection of geometric targets using combinations of neural network emulators with \approx10^5 parameters. A sample of \approx10^5-10^6 synthetic equilibria is essential to train emulators that are not over-regularised or overfitting. Smaller models trained on the shape targets may be further fine-tuned to better fit the Jacobians. Second, we address the effect of vessel currents that are not directly measured in real-time and are typically subsumed into effective “shaping currents” when designing virtual circuits. We demonstrate that shaping currents can be inferred via simple linear regression on a trailing window of active coil current measurements with residuals of only a few Ampères, enabling a choice for the most appropriate shaping currents at any point in a shot. While these results are based on historic shot data and simulations tailored to MAST-U, they indicate that emulators with few-millisecond latency can be developed for robust real-time plasma shape control in existing and upcoming tokamaks.

[LG-173] Modeling and benchmarking quantum optical neurons for efficient neural computation

链接: https://arxiv.org/abs/2509.01784
作者: Andrea Andrisani,Gennaro Vessio,Fabrizio Sgobba,Francesco Di Lena,Luigi Amato Santamaria,Giovanna Castellano
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum optical neurons (QONs) are emerging as promising computational units that leverage photonic interference to perform neural operations in an energy-efficient and physically grounded manner. Building on recent theoretical proposals, we introduce a family of QON architectures based on Hong-Ou-Mandel (HOM) and Mach-Zehnder (MZ) interferometers, incorporating different photon modulation strategies – phase, amplitude, and intensity. These physical setups yield distinct pre-activation functions, which we implement as fully differentiable modules in software. We evaluate these QONs both in isolation and as building blocks of multilayer networks, training them on binary and multiclass image classification tasks using the MNIST and FashionMNIST datasets. Our experiments show that two configurations – HOM-based amplitude modulation and MZ-based phase-shifted modulation – achieve performance comparable to that of classical neurons in several settings, and in some cases exhibit faster or more stable convergence. In contrast, intensity-based encodings display greater sensitivity to distributional shifts and training instabilities. These results highlight the potential of QONs as efficient and scalable components for future quantum-inspired neural architectures and hybrid photonic-electronic systems.

[LG-174] Wrong Model Right Uncertainty: Spatial Associations for Discrete Data with Misspecification

链接: https://arxiv.org/abs/2509.01776
作者: David R. Burt,Renato Berlinghieri,Tamara Broderick
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 2 figures

点击查看摘要

Abstract:Scientists are often interested in estimating an association between a covariate and a binary- or count-valued response. For instance, public health officials are interested in how much disease presence (a binary response per individual) varies as temperature or pollution (covariates) increases. Many existing methods can be used to estimate associations, and corresponding uncertainty intervals, but make unrealistic assumptions in the spatial domain. For instance, they incorrectly assume models are well-specified. Or they assume the training and target locations are i.i.d. – whereas in practice, these locations are often not even randomly sampled. Some recent work avoids these assumptions but works only for continuous responses with spatially constant noise. In the present work, we provide the first confidence intervals with guaranteed asymptotic nominal coverage for spatial associations given discrete responses, even under simultaneous model misspecification and nonrandom sampling of spatial locations. To do so, we demonstrate how to handle spatially varying noise, provide a novel proof of consistency for our proposed estimator, and use a delta method argument with a Lyapunov central limit theorem. We show empirically that standard approaches can produce unreliable confidence intervals and can even get the sign of an association wrong, while our method reliably provides correct coverage.

[LG-175] A Hybrid Framework for Healing Semigroups with Machine Learning

链接: https://arxiv.org/abs/2509.01763
作者: Sarayu Sirikonda,Jasper van de Kreeke
类目: Rings and Algebras (math.RA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a hybrid framework that heals corrupted finite semigroups, combining deterministic repair strategies with Machine Learning using a Random Forest Classifier. Corruption in these tables breaks associativity and invalidates the algebraic structure. Deterministic methods work for small cardinality n and low corruption but degrade rapidly. Our experiments, carried out on Mace4-generated data sets, demonstrate that our hybrid framework achieves higher healing rates than deterministic-only and ML-only baselines. At a corruption percentage of p=15%, our framework healed 95% of semigroups up to cardinality n=6 and 60% at n=10.

[LG-176] Multimodal Generative Flows for LHC Jets NEURIPS2025

链接: https://arxiv.org/abs/2509.01736
作者: Darius A. Faroughy,Manfred Opper,Cesar Ojeda
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: Submitted to NeurIPS 2025 ML4PS workshop

点击查看摘要

Abstract:Generative modeling of high-energy collisions at the Large Hadron Collider (LHC) offers a data-driven route to simulations, anomaly detection, among other applications. A central challenge lies in the hybrid nature of particle-cloud data: each particle carries continuous kinematic features and discrete quantum numbers such as charge and flavor. We introduce a transformer-based multimodal flow that extends flow-matching with a continuous-time Markov jump bridge to jointly model LHC jets with both modalities. Trained on CMS Open Data, our model can generate high fidelity jets with realistic kinematics, jet substructure and flavor composition.

[LG-177] Preconditioned Regularized Wasserstein Proximal Sampling

链接: https://arxiv.org/abs/2509.01685
作者: Hong Ye Tan,Stanley Osher,Wuchen Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We consider sampling from a Gibbs distribution by evolving finitely many particles. We propose a preconditioned version of a recently proposed noise-free sampling method, governed by approximating the score function with the numerically tractable score of a regularized Wasserstein proximal operator. This is derived by a Cole–Hopf transformation on coupled anisotropic heat equations, yielding a kernel formulation for the preconditioned regularized Wasserstein proximal. The diffusion component of the proposed method is also interpreted as a modified self-attention block, as in transformer architectures. For quadratic potentials, we provide a discrete-time non-asymptotic convergence analysis and explicitly characterize the bias, which is dependent on regularization and independent of step-size. Experiments demonstrate acceleration and particle-level stability on various log-concave and non-log-concave toy examples to Bayesian total-variation regularized image deconvolution, and competitive/better performance on non-convex Bayesian neural network training when utilizing variable preconditioning matrices.

[LG-178] Lipschitz-Guided Design of Interpolation Schedules in Generative Models

链接: https://arxiv.org/abs/2509.01629
作者: Yifan Chen,Eric Vanden-Eijnden,Jiawei Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We study the design of interpolation schedules in the stochastic interpolants framework for flow and diffusion-based generative models. We show that while all scalar interpolation schedules achieve identical statistical efficiency under Kullback-Leibler divergence in path space after optimal diffusion coefficient tuning, their numerical efficiency can differ substantially. This observation motivates focusing on numerical properties of the resulting drift fields rather than statistical criteria for schedule design. We propose averaged squared Lipschitzness minimization as a principled criterion for numerical optimization, providing an alternative to kinetic energy minimization used in optimal transport approaches. A transfer formula is derived that enables conversion between different schedules at inference time without retraining neural networks. For Gaussian distributions, our optimized schedules achieve exponential improvements in Lipschitz constants over standard linear schedules, while for Gaussian mixtures, they reduce mode collapse in few-step sampling. We also validate our approach on high-dimensional invariant distributions from stochastic Allen-Cahn equations and Navier-Stokes equations, demonstrating robust performance improvements across resolutions.

[LG-179] Reinforcement learning for graph theory Parallelizing Wagners approach

链接: https://arxiv.org/abs/2509.01607
作者: Alix Bouffard,Jane Breen
类目: Combinatorics (math.CO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Our work applies reinforcement learning to construct counterexamples concerning conjectured bounds on the spectral radius of the Laplacian matrix of a graph. We expand upon the re-implementation of Wagner’s approach by Stevanovic et al. with the ability to train numerous unique models simultaneously and a novel redefining of the action space to adjust the influence of the current local optimum on the learning process.

[LG-180] Sampling as Bandits: Evaluation-Efficient Design for Black-Box Densities

链接: https://arxiv.org/abs/2509.01437
作者: Takuo Matsubara,Andrew Duncan,Simon Cotter,Konstantinos Zygalakis
类目: Methodology (stat.ME); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce bandit importance sampling (BIS), a new class of importance sampling methods designed for settings where the target density is expensive to evaluate. In contrast to adaptive importance sampling, which optimises a proposal distribution, BIS directly designs the samples through a sequential strategy that combines space-filling designs with multi-armed bandits. Our method leverages Gaussian process surrogates to guide sample selection, enabling efficient exploration of the parameter space with minimal target evaluations. We establish theoretical guarantees on convergence and demonstrate the effectiveness of the method across a broad range of sampling tasks. BIS delivers accurate approximations with fewer target evaluations, outperforming competing approaches across multimodal, heavy-tailed distributions, and real-world applications to Bayesian inference of computationally expensive models.

[LG-181] mporal Representation Learning for Real-Time Ultrasound Analysis ICML2025

链接: https://arxiv.org/abs/2509.01433
作者: Yves Stebler,Thomas M. Sutter,Ece Ozkan,Julia E. Vogt
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: ICMl 2025 Workshop

点击查看摘要

Abstract:Ultrasound (US) imaging is a critical tool in medical diagnostics, offering real-time visualization of physiological processes. One of its major advantages is its ability to capture temporal dynamics, which is essential for assessing motion patterns in applications such as cardiac monitoring, fetal development, and vascular imaging. Despite its importance, current deep learning models often overlook the temporal continuity of ultrasound sequences, analyzing frames independently and missing key temporal dependencies. To address this gap, we propose a method for learning effective temporal representations from ultrasound videos, with a focus on echocardiography-based ejection fraction (EF) estimation. EF prediction serves as an ideal case study to demonstrate the necessity of temporal learning, as it requires capturing the rhythmic contraction and relaxation of the heart. Our approach leverages temporally consistent masking and contrastive learning to enforce temporal coherence across video frames, enhancing the model’s ability to represent motion patterns. Evaluated on the EchoNet-Dynamic dataset, our method achieves a substantial improvement in EF prediction accuracy, highlighting the importance of temporally-aware representation learning for real-time ultrasound analysis.

[LG-182] Exploring Quantum Machine Learning for Weather Forecasting

链接: https://arxiv.org/abs/2509.01422
作者: Maria Heloísa F. da Silva,Gleydson F. de Jesus,Christiano M. S. Nascimento,Valéria L. da Silva,Clebson Cruz
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weather forecasting plays a crucial role in supporting strategic decisions across various sectors, including agriculture, renewable energy production, and disaster management. However, the inherently dynamic and chaotic behavior of the atmosphere presents significant challenges to conventional predictive models. On the other hand, introducing quantum computing simulation techniques to the forecasting problems constitutes a promising alternative to overcome these challenges. In this context, this work explores the emerging intersection between quantum machine learning (QML) and climate forecasting. We present the implementation of a Quantum Neural Network (QNN) trained on real meteorological data from NASA’s Prediction of Worldwide Energy Resources (POWER) database. The results show that QNN has the potential to outperform a classical Recurrent Neural Network (RNN) in terms of accuracy and adaptability to abrupt data shifts, particularly in wind speed prediction. Despite observed nonlinearities and architectural sensitivities, the QNN demonstrated robustness in handling temporal variability and faster convergence in temperature prediction. These findings highlight the potential of quantum models in short and medium term climate prediction, while also revealing key challenges and future directions for optimization and broader applicability.

[LG-183] Double Descent and Overparameterization in Particle Physics Data

链接: https://arxiv.org/abs/2509.01397
作者: Matthias Vigl,Lukas Heinrich
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:Recently, the benefit of heavily overparameterized models has been observed in machine learning tasks: models with enough capacity to easily cross the \emphinterpolation threshold improve in generalization error compared to the classical bias-variance tradeoff regime. We demonstrate this behavior for the first time in particle physics data and explore when and where `double descent’ appears and under which circumstances overparameterization results in a performance gain.

[LG-184] Phase diagram and eigenvalue dynamics of stochastic gradient descent in multilayer neural networks

链接: https://arxiv.org/abs/2509.01349
作者: Chanju Park(Swansea University),Biagio Lucini(Queen Mary University of London),Gert Aarts(Swansea University)
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat)
*备注: 27 pages, many figures

点击查看摘要

Abstract:Hyperparameter tuning is one of the essential steps to guarantee the convergence of machine learning models. We argue that intuition about the optimal choice of hyperparameters for stochastic gradient descent can be obtained by studying a neural network’s phase diagram, in which each phase is characterised by distinctive dynamics of the singular values of weight matrices. Taking inspiration from disordered systems, we start from the observation that the loss landscape of a multilayer neural network with mean squared error can be interpreted as a disordered system in feature space, where the learnt features are mapped to soft spin degrees of freedom, the initial variance of the weight matrices is interpreted as the strength of the disorder, and temperature is given by the ratio of the learning rate and the batch size. As the model is trained, three phases can be identified, in which the dynamics of weight matrices is qualitatively different. Employing a Langevin equation for stochastic gradient descent, previously derived using Dyson Brownian motion, we demonstrate that the three dynamical regimes can be classified effectively, providing practical guidance for the choice of hyperparameters of the optimiser.

[LG-185] Learning residue level protein dynamics with multiscale Gaussians

链接: https://arxiv.org/abs/2509.01038
作者: Mihir Bafna,Bowen Jing,Bonnie Berger
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many methods have been developed to predict static protein structures, however understanding the dynamics of protein structure is essential for elucidating biological function. While molecular dynamics (MD) simulations remain the in silico gold standard, its high computational cost limits scalability. We present DynaProt, a lightweight, SE(3)-invariant framework that predicts rich descriptors of protein dynamics directly from static structures. By casting the problem through the lens of multivariate Gaussians, DynaProt estimates dynamics at two complementary scales: (1) per-residue marginal anisotropy as 3 \times 3 covariance matrices capturing local flexibility, and (2) joint scalar covariances encoding pairwise dynamic coupling across residues. From these dynamics outputs, DynaProt achieves high accuracy in predicting residue-level flexibility (RMSF) and, remarkably, enables reasonable reconstruction of the full covariance matrix for fast ensemble generation. Notably, it does so using orders of magnitude fewer parameters than prior methods. Our results highlight the potential of direct protein dynamics prediction as a scalable alternative to existing methods.

[LG-186] Regime-Switching Langevin Monte Carlo Algorithms

链接: https://arxiv.org/abs/2509.00941
作者: Xiaoyu Wang,Yingli Wang,Lingjiong Zhu
类目: Computation (stat.CO); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 50 pages, 8 figures

点击查看摘要

Abstract:Langevin Monte Carlo (LMC) algorithms are popular Markov Chain Monte Carlo (MCMC) methods to sample a target probability distribution, which arises in many applications in machine learning. Inspired by regime-switching stochastic differential equations in the probability literature, we propose and study regime-switching Langevin dynamics (RS-LD) and regime-switching kinetic Langevin dynamics (RS-KLD). Based on their discretizations, we introduce regime-switching Langevin Monte Carlo (RS-LMC) and regime-switching kinetic Langevin Monte Carlo (RS-KLMC) algorithms, which can also be viewed as LMC and KLMC algorithms with random stepsizes. We also propose frictional-regime-switching kinetic Langevin dynamics (FRS-KLD) and its associated algorithm frictional-regime-switching kinetic Langevin Monte Carlo (FRS-KLMC), which can also be viewed as the KLMC algorithm with random frictional coefficients. We provide their 2-Wasserstein non-asymptotic convergence guarantees to the target distribution, and analyze the iteration complexities. Numerical experiments using both synthetic and real data are provided to illustrate the efficiency of our proposed algorithms.

[LG-187] Semi-Supervised Bayesian GANs with Log-Signatures for Uncertainty-Aware Credit Card Fraud Detection

链接: https://arxiv.org/abs/2509.00931
作者: David Hirnschall
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel deep generative semi-supervised framework for credit card fraud detection, formulated as time series classification task. As financial transaction data streams grow in scale and complexity, traditional methods often require large labeled datasets, struggle with time series of irregular sampling frequencies and varying sequence lengths. To address these challenges, we extend conditional Generative Adversarial Networks (GANs) for targeted data augmentation, integrate Bayesian inference to obtain predictive distributions and quantify uncertainty, and leverage log-signatures for robust feature encoding of transaction histories. We introduce a novel Wasserstein distance-based loss to align generated and real unlabeled samples while simultaneously maximizing classification accuracy on labeled data. Our approach is evaluated on the BankSim dataset, a widely used simulator for credit card transaction data, under varying proportions of labeled samples, demonstrating consistent improvements over benchmarks in both global statistical and domain-specific metrics. These findings highlight the effectiveness of GAN-driven semi-supervised learning with log-signatures for irregularly sampled time series and emphasize the importance of uncertainty-aware predictions.

[LG-188] Beyond Universal Approximation Theorems: Algorithmic Uniform Approximation by Neural Networks Trained with Noisy Data

链接: https://arxiv.org/abs/2509.00924
作者: Anastasis Kratsios,Tin Sum Cheng,Daniel Roy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Probability (math.PR)
*备注:

点击查看摘要

Abstract:At its core, machine learning seeks to train models that reliably generalize beyond noisy observations; however, the theoretical vacuum in which state-of-the-art universal approximation theorems (UATs) operate isolates them from this goal, as they assume noiseless data and allow network parameters to be chosen freely, independent of algorithmic realism. This paper bridges that gap by introducing an architecture-specific randomized training algorithm that constructs a uniform approximator from N noisy training samples on the d -dimensional cube [0,1]^d . Our trained neural networks attain the minimax-optimal quantity of \textittrainable (non-random) parameters, subject to logarithmic factors which vanish under the idealized noiseless sampling assumed in classical UATs. Additionally, our trained models replicate key behaviours of real-world neural networks, absent in standard UAT constructions, by: (1) exhibiting sub-linear parametric complexity when fine-tuning on structurally related and favourable out-of-distribution tasks, (2) exactly interpolating the training data, and (3) maintaining reasonable Lipschitz regularity (after the initial clustering attention layer). These properties bring state-of-the-art UATs closer to practical machine learning, shifting the central open question from algorithmic implementability with noisy samples to whether stochastic gradient descent can achieve comparable guarantees. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Probability (math.PR) MSC classes: 68T07, 68Q32, 68T05, 41A65 ACMclasses: F.1.3; G.1.2; F.1.3 Cite as: arXiv:2509.00924 [stat.ML] (or arXiv:2509.00924v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2509.00924 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-189] Learning with Mandelbrot and Julia

链接: https://arxiv.org/abs/2509.00903
作者: V.R. Tjahjono,S.F. Feng,E.R.M. Putri,H. Susanto
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent developments in applied mathematics increasingly employ machine learning (ML)-particularly supervised learning-to accelerate numerical computations, such as solving nonlinear partial differential equations. In this work, we extend such techniques to objects of a more theoretical nature: the classification and structural analysis of fractal sets. Focusing on the Mandelbrot and Julia sets as principal examples, we demonstrate that supervised learning methods-including Classification and Regression Trees (CART), K-Nearest Neighbors (KNN), Multilayer Perceptrons (MLP), and Recurrent Neural Networks using both Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM), Random Forests (RF), and Convolutional Neural Networks (CNN)-can classify fractal points with significantly higher predictive accuracy and substantially lower computational cost than traditional numerical approaches, such as the thresholding technique. These improvements are consistent across a range of models and evaluation metrics. Notably, KNN and RF exhibit the best overall performance, and comparative analyses between models (e.g., KNN vs. LSTM) suggest the presence of novel regularity properties in these mathematical structures. Collectively, our findings indicate that ML not only enhances classification efficiency but also offers promising avenues for generating new insights, intuitions, and conjectures within pure mathematics.

[LG-190] FBMS: An R Package for Flexible Bayesian Model Selection and Model Averag ing

链接: https://arxiv.org/abs/2509.00753
作者: Florian Frommlet,Jon Lachmann,Geir Storvik,Aliaksandr Hubin
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 69 pages, 5 tables, 5 figures

点击查看摘要

Abstract:The FBMS R package facilitates Bayesian model selection and model averaging in complex regression settings by employing a variety of Monte Carlo model exploration methods. At its core, the package implements an efficient Mode Jumping Markov Chain Monte Carlo (MJMCMC) algorithm, designed to improve mixing in multi-modal posterior landscapes within Bayesian generalized linear models. In addition, it provides a genetically modified MJMCMC (GMJMCMC) algorithm that introduces nonlinear feature generation, thereby enabling the estimation of Bayesian generalized nonlinear models (BGNLMs). Within this framework, the algorithm maintains and updates populations of transformed features, computes their posterior probabilities, and evaluates the posteriors of models constructed from them. We demonstrate the effective use of FBMS for both inferential and predictive modeling in Gaussian regression, focusing on different instances of the BGNLM class of models. Furthermore, through a broad set of applications, we illustrate how the methodology can be extended to increasingly complex modeling scenarios, extending to other response distributions and mixed effect models.

[LG-191] Self-Organising Memristive Networks as Physical Learning Systems

链接: https://arxiv.org/abs/2509.00747
作者: Francesco Caravelli,Gianluca Milano,Adam Z. Stieg,Carlo Ricciardi,Simon Anthony Brown,Zdenka Kuncic
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Soft Condensed Matter (cond-mat.soft); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Perspective paper on SOMN; 20 pages double columns, 5 figures, 2 boxes;

点击查看摘要

Abstract:Learning with physical systems is an emerging paradigm that seeks to harness the intrinsic nonlinear dynamics of physical substrates for learning. The impetus for a paradigm shift in how hardware is used for computational intelligence stems largely from the unsustainability of artificial neural network software implemented on conventional transistor-based hardware. This Perspective highlights one promising approach using physical networks comprised of resistive memory nanoscale components with dynamically reconfigurable, self-organising electrical circuitry. Experimental advances have revealed the non-trivial interactions within these Self-Organising Memristive Networks (SOMNs), offering insights into their collective nonlinear and adaptive dynamics, and how these properties can be harnessed for learning using different hardware implementations. Theoretical approaches, including mean-field theory, graph theory, and concepts from disordered systems, reveal deeper insights into the dynamics of SOMNs, especially during transitions between different conductance states where criticality and other dynamical phase transitions emerge in both experiments and models. Furthermore, parallels between adaptive dynamics in SOMNs and plasticity in biological neuronal networks suggest the potential for realising energy-efficient, brain-like continual learning. SOMNs thus offer a promising route toward embedded edge intelligence, unlocking real-time decision-making for autonomous systems, dynamic sensing, and personalised healthcare, by enabling embedded learning in resource-constrained environments. The overarching aim of this Perspective is to show how the convergence of nanotechnology, statistical physics, complex systems, and self-organising principles offers a unique opportunity to advance a new generation of physical intelligence technologies.

[LG-192] Convergence Analysis of the PAGE Stochastic Algorithm for Convex Finite-Sum Optimization

链接: https://arxiv.org/abs/2509.00737
作者: Laurent Condat,Peter Richtárik
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:PAGE is a stochastic algorithm proposed by Li et al. [2021] to find a stationary point of an average of smooth nonconvex functions. We analyze PAGE in the convex setting and derive new convergence rates, leading to a better complexity than in the general nonconvex regime.

[LG-193] Resting-state fMRI Analysis using Quantum Time-series Transformer

链接: https://arxiv.org/abs/2509.00711
作者: Junghoon Justin Park,Jungwoo Seo,Sangyoon Bae,Samuel Yen-Chi Chen,Huan-Hsin Tseng,Jiook Cha,Shinjae Yoo
类目: Image and Video Processing (eess.IV); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Resting-state functional magnetic resonance imaging (fMRI) has emerged as a pivotal tool for revealing intrinsic brain network connectivity and identifying neural biomarkers of neuropsychiatric conditions. However, classical self-attention transformer models–despite their formidable representational power–struggle with quadratic complexity, large parameter counts, and substantial data requirements. To address these barriers, we introduce a Quantum Time-series Transformer, a novel quantum-enhanced transformer architecture leveraging Linear Combination of Unitaries and Quantum Singular Value Transformation. Unlike classical transformers, Quantum Time-series Transformer operates with polylogarithmic computational complexity, markedly reducing training overhead and enabling robust performance even with fewer parameters and limited sample sizes. Empirical evaluation on the largest-scale fMRI datasets from the Adolescent Brain Cognitive Development Study and the UK Biobank demonstrates that Quantum Time-series Transformer achieves comparable or superior predictive performance compared to state-of-the-art classical transformer models, with especially pronounced gains in small-sample scenarios. Interpretability analyses using SHapley Additive exPlanations further reveal that Quantum Time-series Transformer reliably identifies clinically meaningful neural biomarkers of attention-deficit/hyperactivity disorder (ADHD). These findings underscore the promise of quantum-enhanced transformers in advancing computational neuroscience by more efficiently modeling complex spatio-temporal dynamics and improving clinical interpretability.

[LG-194] Quantum Circuits for Quantum Convolutions: A Quantum Convolutional Autoencoder

链接: https://arxiv.org/abs/2509.00637
作者: Javier Orduz,Pablo Rivas,Erich Baker
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: The 23rd International Conference on Artificial Intelligence (ICAI 2021)

点击查看摘要

Abstract:Quantum machine learning deals with leveraging quantum theory with classic machine learning algorithms. Current research efforts study the advantages of using quantum mechanics or quantum information theory to accelerate learning time or convergence. Other efforts study data transformations in the quantum information space to evaluate robustness and performance boosts. This paper focuses on processing input data using randomized quantum circuits that act as quantum convolutions producing new representations that can be used in a convolutional network. Experimental results suggest that the performance is comparable to classic convolutional neural networks, and in some instances, using quantum convolutions can accelerate convergence.

[LG-195] Identifying Causal Direction via Dense Functional Classes

链接: https://arxiv.org/abs/2509.00538
作者: Katerina Hlavackova-Schindler,Suzana Marsela
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We address the problem of determining the causal direction between two univariate, continuous-valued variables, X and Y, under the assumption of no hidden confounders. In general, it is not possible to make definitive statements about causality without some assumptions on the underlying model. To distinguish between cause and effect, we propose a bivariate causal score based on the Minimum Description Length (MDL) principle, using functions that possess the density property on a compact real interval. We prove the identifiability of these causal scores under specific conditions. These conditions can be easily tested. Gaussianity of the noise in the causal model equations is not assumed, only that the noise is low. The well-studied class of cubic splines possesses the density property on a compact real interval. We propose LCUBE as an instantiation of the MDL-based causal score utilizing cubic regression splines. LCUBE is an identifiable method that is also interpretable, simple, and very fast. It has only one hyperparameter. Empirical evaluations compared to state-of-the-art methods demonstrate that LCUBE achieves superior precision in terms of AUDRC on the real-world Tuebingen cause-effect pairs dataset. It also shows superior average precision across common 10 benchmark datasets and achieves above average precision on 13 datasets.

[LG-196] Partial Functional Dynamic Backdoor Diffusion-based Causal Model

链接: https://arxiv.org/abs/2509.00472
作者: Xinwen Liu,Lei Qian,Song Xi Chen,Niansheng Tang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:We introduce a Partial Functional Dynamic Backdoor Diffusion-based Causal Model (PFD-BDCM), specifically designed for causal inference in the presence of unmeasured confounders with spatial heterogeneity and temporal dependency. The proposed PFD-BDCM framework addresses the restrictions of the existing approaches by uniquely integrating models for complex spatio-temporal dynamics with the analysis of multi-resolution variables. Specifically, the framework systematically mitigates confounding bias by integrating valid backdoor adjustment sets into a diffusion-based sampling mechanism. Moreover, it accounts for the intricate dynamics of unmeasured confounders through the deployment of region-specific structural equations and conditional autoregressive processes, and accommodates variables observed at heterogeneous resolutions via basis expansions for functional data. Our theoretical analysis establishes error bounds for counterfactual estimates of PFD-BDCM, formally linking reconstruction accuracy to counterfactual fidelity under monotonicity assumptions of structural equation and invertibility assumptions of encoding function. Empirical evaluations on synthetic datasets and real-world air pollution data demonstrate PFD-BDCM’s superiority over existing methods.

[LG-197] he Nondecreasing Rank

链接: https://arxiv.org/abs/2509.00265
作者: Andrew McCormack
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: 29 pages, 6 figures

点击查看摘要

Abstract:In this article the notion of the nondecreasing (ND) rank of a matrix or tensor is introduced. A tensor has an ND rank of r if it can be represented as a sum of r outer products of vectors, with each vector satisfying a monotonicity constraint. It is shown that for certain poset orderings finding an ND factorization of rank r is equivalent to finding a nonnegative rank-r factorization of a transformed tensor. However, not every tensor that is monotonic has a finite ND rank. Theory is developed describing the properties of the ND rank, including typical, maximum, and border ND ranks. Highlighted also are the special settings where a matrix or tensor has an ND rank of one or two. As a means of finding low ND rank approximations to a data tensor we introduce a variant of the hierarchical alternating least squares algorithm. Low ND rank factorizations are found and interpreted for two datasets concerning the weight of pigs and a mental health survey during the COVID-19 pandemic.

[LG-198] Probit Monotone BART

链接: https://arxiv.org/abs/2509.00263
作者: Jared D. Fisher
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: 6 pages, 1 figure

点击查看摘要

Abstract:Bayesian Additive Regression Trees (BART) of Chipman et al. (2010) has proven to be a powerful tool for nonparametric modeling and prediction. Monotone BART (Chipman et al., 2022) is a recent development that allows BART to be more precise in estimating monotonic functions. We further these developments by proposing probit monotone BART, which allows the monotone BART framework to estimate conditional mean functions when the outcome variable is binary.

[LG-199] Assessing One-Dimensional Cluster Stability by Extreme-Point Trimming

链接: https://arxiv.org/abs/2509.00258
作者: Erwan Dereure,Emmanuel Akame Mfoumou,David Holcman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Applications (stat.AP)
*备注: 33 pages

点击查看摘要

Abstract:We develop a probabilistic method for assessing the tail behavior and geometric stability of one-dimensional n i.i.d. samples by tracking how their span contracts when the most extreme points are trimmed. Central to our approach is the diameter-shrinkage ratio, that quantifies the relative reduction in data range as extreme points are successively removed. We derive analytical expressions, including finite-sample corrections, for the expected shrinkage under both the uniform and Gaussian hypotheses, and establish that these curves remain distinct even for moderate number of removal. We construct an elementary decision rule that assigns a sample to whichever theoretical shrinkage profile it most closely follows. This test achieves higher classification accuracy than the classical likelihood-ratio test in small-sample or noisy regimes, while preserving asymptotic consistency for large n. We further integrate our criterion into a clustering pipeline (e.g. DBSCAN), demonstrating its ability to validate one-dimensional clusters without any density estimation or parameter tuning. This work thus provides both theoretical insight and practical tools for robust distributional inference and cluster stability analysis.

[LG-200] Simulation-based inference of yeast centromeres

链接: https://arxiv.org/abs/2509.00200
作者: Eloïse Touron,Pedro L. C. Rodrigues,Julyan Arbel,Nelle Varoquaux,Michael Arbel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The chromatin folding and the spatial arrangement of chromosomes in the cell play a crucial role in DNA replication and genes expression. An improper chromatin folding could lead to malfunctions and, over time, diseases. For eukaryotes, centromeres are essential for proper chromosome segregation and folding. Despite extensive research using de novo sequencing of genomes and annotation analysis, centromere locations in yeasts remain difficult to infer and are still unknown in most species. Recently, genome-wide chromosome conformation capture coupled with next-generation sequencing (Hi-C) has become one of the leading methods to investigate chromosome structures. Some recent studies have used Hi-C data to give a point estimate of each centromere, but those approaches highly rely on a good pre-localization. Here, we present a novel approach that infers in a stochastic manner the locations of all centromeres in budding yeast based on both the experimental Hi-C map and simulated contact maps.

[LG-201] Friend or Foe

链接: https://arxiv.org/abs/2509.00123
作者: Oleksandr Cherendichenko,Josephine Solowiej-Wedderburn,Laura M. Carroll,Eric Libby
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A fundamental challenge in microbial ecology is determining whether bacteria compete or cooperate in different environmental conditions. With recent advances in genome-scale metabolic models, we are now capable of simulating interactions between thousands of pairs of bacteria in thousands of different environmental settings at a scale infeasible experimentally. These approaches can generate tremendous amounts of data that can be exploited by state-of-the-art machine learning algorithms to uncover the mechanisms driving interactions. Here, we present Friend or Foe, a compendium of 64 tabular environmental datasets, consisting of more than 26M shared environments for more than 10K pairs of bacteria sampled from two of the largest collections of metabolic models. The Friend or Foe datasets are curated for a wide range of machine learning tasks – supervised, unsupervised, and generative – to address specific questions underlying bacterial interactions. We benchmarked a selection of the most recent models for each of these tasks and our results indicate that machine learning can be successful in this application to microbial ecology. Going beyond, analyses of the Friend or Foe compendium can shed light on the predictability of bacterial interactions and highlight novel research directions into how bacteria infer and navigate their relationships.

[LG-202] Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields in Doped Materials

链接: https://arxiv.org/abs/2509.00090
作者: Yi Cao,Paulette Clancy
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Machine-learned force fields (MLFFs), particularly pre-trained foundation models, promise to bring ab initio-level accuracy to the length and time scales of molecular dynamics. Yet this shift raises a central question: is it better to build a specialist model from scratch or adapt a generalist foundation model for a specific system? The trade-offs in data efficiency, predictive accuracy, and risks of out-of-distribution (OOD) failure remain unclear. Here, we present a benchmarking framework that contrasts bespoke (from scratch) and fine-tuned foundation models in a test case of a technologically relevant 2D material, Cr-intercalated Sb2Te3, using the MACE architecture. Our framework employs migration pathways, evaluated through nudged elastic band (NEB) trajectories, as a diagnostic probe that tests both interpolation and extrapolation. We assess accuracy for equilibrium, kinetic (atomic migration), and mechanical (interlayer sliding) tasks. While all models capture equilibrium structures, predictions for non-equilibrium processes diverge. Task-specific fine-tuning substantially improves kinetic accuracy compared with both from-scratch and zero-shot models, but can degrade learned representations of long-range physics. Analysis of internal representations shows that training paradigms yield distinct, non-overlapping latent encodings of system physics. This work offers a practical guide for MLFF development, highlights migration-based probes as efficient diagnostics, and suggests pathways toward uncertainty-aware active learning strategies.

[LG-203] Generalization vs. Memorization in Autoregressive Deep Learning: Or Examining Temporal Decay of Gradient Coherence

链接: https://arxiv.org/abs/2509.00024
作者: James Amarel,Nicolas Hengartner,Robyn Miller,Kamaljeet Singh,Siddharth Mansingh,Arvind Mohan,Benjamin Migliori,Emily Casleton,Alexei Skurikhin,Earl Lawrence,Gerd J. Kunde
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models trained as autoregressive PDE surrogates hold significant promise for accelerating scientific discovery through their capacity to both extrapolate beyond training regimes and efficiently adapt to downstream tasks despite a paucity of examples for fine-tuning. However, reliably achieving genuine generalization - a necessary capability for producing novel scientific insights and robustly performing during deployment - remains a critical challenge. Establishing whether or not these requirements are met demands evaluation metrics capable of clearly distinguishing genuine model generalization from mere memorization. We apply the influence function formalism to systematically characterize how autoregressive PDE surrogates assimilate and propagate information derived from diverse physical scenarios, revealing fundamental limitations of standard models and training routines in addition to providing actionable insights regarding the design of improved surrogates. Subjects: Computational Physics (physics.comp-ph); Machine Learning (cs.LG) Cite as: arXiv:2509.00024 [physics.comp-ph] (or arXiv:2509.00024v1 [physics.comp-ph] for this version) https://doi.org/10.48550/arXiv.2509.00024 Focus to learn more arXiv-issued DOI via DataCite

[LG-204] Deep Learning for Operational High-Resolution Nowcasting in Switzerland Using Graph Neural Networks

链接: https://arxiv.org/abs/2509.00017
作者: Ophélia Miralles,Daniele Nerini,Jonas Bhend,Baudouin Raoult,Christoph Spirig
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in neural weather forecasting have shown significant potential for accurate short-term forecasts. However, adapting such gridded approaches to smaller, topographically complex regions like Switzerland introduces computational challenges, especially when aiming for high spatial (1 km) and temporal (10 minutes) resolution. This paper presents a Graph Neural Network (GNN)-based approach for high-resolution nowcasting in Switzerland using the Anemoi framework and observational inputs. The proposed model combines surface observations with selected past and future numerical weather prediction (NWP) states, enabling an observation-guided interpolation strategy that enhances short-term accuracy while preserving physical consistency. We evaluate the method on multiple surface variables and compare it against operational high-resolution NWP (ICON) and nowcasting (INCA) baselines. The results show that the GNN model consistently outperforms traditional approaches in lead times up to 12 hours, especially for wind and precipitation. A comprehensive verification procedure, including spatial skill scores, event-based evaluation, and blind tests with professional forecasters, demonstrates the operational relevance of the approach for mountainous domains.

[LG-205] Conditional Generative Adversarial Networks Based Inertial Signal Translation

链接: https://arxiv.org/abs/2509.00016
作者: Marcin Kolakowski
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Originally presented at: 2025 Signal Processing Symposium (SPSympo) Warsaw, Poland; Associated data available at: M. Kolakowski, “Wrist and Tibia/Shoe Mounted IMU Measurement Results for Gait Analysis.” Zenodo, Dec. 27, 2023. doi: this https URL

点击查看摘要

Abstract:The paper presents an approach in which inertial signals measured with a wrist-worn sensor (e.g., a smartwatch) are translated into those that would be recorded using a shoe-mounted sensor, enabling the use of state-of-the-art gait analysis methods. In the study, the signals are translated using Conditional Generative Adversarial Networks (GANs). Two different GAN versions are used for experimental verification: traditional ones trained using binary cross-entropy loss and Wasserstein GANs (WGANs). For the generator, two architectures, a convolutional autoencoder, and a convolutional U-Net, are tested. The experiment results have shown that the proposed approach allows for an accurate translation, enabling the use of wrist sensor inertial signals for efficient, every-day gait analysis.

[LG-206] MedFormer: a data-driven model for forecasting the Mediterranean Sea

链接: https://arxiv.org/abs/2509.00015
作者: Italo Epicoco,Davide Donno,Gabriele Accarino,Simone Norberti,Alessandro Grandi,Michele Giurato,Ronan McAdam,Donatello Elia,Emanuela Clementi,Paola Nassisi,Enrico Scoccimarro,Giovanni Coppini,Silvio Gualdi,Giovanni Aloisio,Simona Masina,Giulio Boccaletti,Antonio Navarra
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 29 pages, 51 images, it will be submitted to Science

点击查看摘要

Abstract:Accurate ocean forecasting is essential for supporting a wide range of marine applications. Recent advances in artificial intelligence have highlighted the potential of data-driven models to outperform traditional numerical approaches, particularly in atmospheric weather forecasting. However, extending these methods to ocean systems remains challenging due to their inherently slower dynamics and complex boundary conditions. In this work, we present MedFormer, a fully data-driven deep learning model specifically designed for medium-range ocean forecasting in the Mediterranean Sea. MedFormer is based on a U-Net architecture augmented with 3D attention mechanisms and operates at a high horizontal resolution of 1/24°. The model is trained on 20 years of daily ocean reanalysis data and fine-tuned with high-resolution operational analyses. It generates 9-day forecasts using an autoregressive strategy. The model leverages both historical ocean states and atmospheric forcings, making it well-suited for operational use. We benchmark MedFormer against the state-of-the-art Mediterranean Forecasting System (MedFS), developed at Euro-Mediterranean Center on Climate Change (CMCC), using both analysis data and independent observations. The forecast skills, evaluated with the Root Mean Squared Difference and the Anomaly Correlation Coefficient, indicate that MedFormer consistently outperforms MedFS across key 3D ocean variables. These findings underscore the potential of data-driven approaches like MedFormer to complement, or even surpass, traditional numerical ocean forecasting systems in both accuracy and computational efficiency.

[LG-207] Exploring the Efficacy of Convolutional Neural Networks in Sleep Apnea Detection from Single Channel EEG

链接: https://arxiv.org/abs/2509.00012
作者: Chun Hin Siu,Hossein Miri
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 6 figures, 1 table

点击查看摘要

Abstract:Sleep apnea, a prevalent sleep disorder, involves repeated episodes of breathing interruptions during sleep, leading to various health complications, including cognitive impairments, high blood pressure, heart disease, stroke, and even death. One of the main challenges in diagnosing and treating sleep apnea is identifying individuals at risk. The current gold standard for diagnosis, Polysomnography (PSG), is costly, labor intensive, and inconvenient, often resulting in poor quality sleep data. This paper presents a novel approach to the detection of sleep apnea using a Convolutional Neural Network (CNN) trained on single channel EEG data. The proposed CNN achieved an accuracy of 85.1% and a Matthews Correlation Coefficient (MCC) of 0.22, demonstrating a significant potential for home based applications by addressing the limitations of PSG in automated sleep apnea detection. Key contributions of this work also include the development of a comprehensive preprocessing pipeline with an Infinite Impulse Response (IIR) Butterworth filter, a dataset construction method providing broader temporal context, and the application of SMOTETomek to address class imbalance. This research underscores the feasibility of transitioning from traditional laboratory based diagnostics to more accessible, automated home based solutions, improving patient outcomes and broadening the accessibility of sleep disorder diagnostics.

[LG-208] CERA: A Framework for Improved Generalization of Machine Learning Models to Changed Climates

链接: https://arxiv.org/abs/2509.00010
作者: Shuchang Liu,Paul A. O’Gorman
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust generalization under climate change remains a major challenge for machine learning applications in climate science. Most existing approaches struggle to extrapolate beyond the climate they were trained on, leading to a strong dependence on training data from model simulations of warm climates. Use of climate-invariant inputs improves generalization but requires challenging manual feature engineering. Here, we present CERA (Climate-invariant Encoding through Representation Alignment), a machine learning framework consisting of an autoencoder with explicit latent-space alignment, followed by a predictor for downstream process estimation. We test CERA on the problem of parameterizing moist-physics processes. Without training on labeled data from a +4K climate, CERA leverages labeled control-climate data and unlabeled warmer-climate inputs to improve generalization to the warmer climate, outperforming both raw-input and physically informed baselines in predicting key moisture and energy tendencies. It captures not only the vertical and meridional structures of the moisture tendencies, but also shifts in the intensity distribution of precipitation including extremes. Ablation experiments show that latent alignment improves both accuracy and the robustness across random seeds used in training. While some reduced skill remains in the boundary layer, the framework offers a data-driven alternative to manual feature engineering of climate invariant inputs. Beyond parameterizations used in hybrid ML-physics systems, the approach holds promise for other climate applications such as statistical downscaling.

信息检索

[IR-0] Lighting the Way for BRIGHT: Reproducible Baselines with Anserini Pyserini and RankLLM

链接: https://arxiv.org/abs/2509.02558
作者: Yijun Ge,Sahel Sharifymoghaddam,Jimmy Lin
类目: Information Retrieval (cs.IR)
*备注: 15 pages, 1 figure, 9 tables

点击查看摘要

Abstract:The BRIGHT benchmark is a dataset consisting of reasoning-intensive queries over diverse domains. We explore retrieval results on BRIGHT using a range of retrieval techniques, including sparse, dense, and fusion methods, and establish reproducible baselines. We then apply listwise reranking with large language models (LLMs) to further investigate the impact of reranking on reasoning-intensive queries. These baselines are integrated into popular retrieval and reranking toolkits Anserini, Pyserini, and RankLLM, with two-click reproducibility that makes them easy to build upon and convenient for further development. While attempting to reproduce the results reported in the original BRIGHT paper, we find that the provided BM25 scores differ notably from those that we obtain using Anserini and Pyserini. We discover that this difference is due to BRIGHT’s implementation of BM25, which applies BM25 on the query rather than using the standard bag-of-words approach, as in Anserini, to construct query vectors. This difference has become increasingly relevant due to the rise of longer queries, with BRIGHT’s lengthy reasoning-intensive queries being a prime example, and further accentuated by the increasing usage of retrieval-augmented generation, where LLM prompts can grow to be much longer than ‘‘traditional’’ search engine queries. Our observation signifies that it may be time to reconsider BM25 approaches going forward in order to better accommodate emerging applications. To facilitate this, we integrate query-side BM25 into both Anserini and Pyserini.

[IR-1] Upcycling Candidate Tokens of Large Language Models for Query Expansion CIKM2025

链接: https://arxiv.org/abs/2509.02377
作者: Jinseok Kim,Sukmin Cho,Soyeong Jeong,Sangyeop Kim,Sungzoon Cho
类目: Information Retrieval (cs.IR)
*备注: CIKM 2025

点击查看摘要

Abstract:Query Expansion (QE) improves retrieval performance by enriching queries with related terms. Recently, Large Language Models (LLMs) have been used for QE, but existing methods face a trade-off: generating diverse terms boosts performance but increases computational cost. To address this challenge, we propose Candidate Token Query Expansion (CTQE), which extracts diverse and relevant terms from a single LLM decoding pass by leveraging unselected candidate tokens. These tokens, though not part of the final output, are conditioned on the full query and capture useful information. By aggregating them, CTQE achieves both relevance and diversity without extra inference, reducing overhead and latency. Experiments show that CTQE delivers strong retrieval performance with significantly lower cost, outperforming or comparable to more expensive methods. Code is available at: this https URL

[IR-2] Leverag ing Media Frames to Improve Normative Diversity in News Recommendations RECSYS2025

链接: https://arxiv.org/abs/2509.02266
作者: Sourabh Dattawad,Agnese Daffara,Tanise Ceron
类目: Information Retrieval (cs.IR)
*备注: Accepted at 13th International Workshop on News Recommendation and Analytics in Conjunction with ACM RecSys 2025

点击查看摘要

Abstract:Click-based news recommender systems suggest users content that aligns with their existing history, limiting the diversity of articles they encounter. Recent advances in aspect-based diversification – adding features such as sentiments or news categories (e.g. world, politics) – have made progress toward diversifying recommendations in terms of perspectives. However, these approaches often overlook the role of news framing, which shapes how stories are told by emphasizing specific angles or interpretations. In this paper, we treat media frames as a controllable aspect within the recommendation pipeline. By selecting articles based on a diversity of frames, our approach emphasizes varied narrative angles and broadens the interpretive space recommended to users. In addition to introducing frame-based diversification method, our work is the first to assess the impact of a news recommender system that integrates frame diversity using normative diversity metrics: representation, calibration, and activation. Our experiments based on media frame diversification show an improvement in exposure to previously unclicked frames up to 50%. This is important because repeated exposure to the same frames can reinforce existing biases or narrow interpretations, whereas introducing novel frames broadens users’ understanding of issues and perspectives. The method also enhances diversification across categorical and sentiment levels, thereby demonstrating that framing acts as a strong control lever for enhancing normative diversity.

[IR-3] Cloud-Device Collaborative Agents for Sequential Recommendation

链接: https://arxiv.org/abs/2509.01551
作者: Jing Long,Sirui Huang,Huan Huo,Tong Chen,Hongzhi Yin,Guandong Xu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled agent-based recommendation systems with strong semantic understanding and flexible reasoning capabilities. While LLM-based agents deployed in the cloud offer powerful personalization, they often suffer from privacy concerns, limited access to real-time signals, and scalability bottlenecks. Conversely, on-device agents ensure privacy and responsiveness but lack the computational power for global modeling and large-scale retrieval. To bridge these complementary limitations, we propose CDA4Rec, a novel Cloud-Device collaborative framework for sequential Recommendation, powered by dual agents: a cloud-side LLM and a device-side small language model (SLM). CDA4Rec tackles the core challenge of cloud-device coordination by decomposing the recommendation task into modular sub-tasks including semantic modeling, candidate retrieval, structured user modeling, and final ranking, which are allocated to cloud or device based on computational demands and privacy sensitivity. A strategy planning mechanism leverages the cloud agent’s reasoning ability to generate personalized execution plans, enabling context-aware task assignment and partial parallel execution across agents. This design ensures real-time responsiveness, improved efficiency, and fine-grained personalization, even under diverse user states and behavioral sparsity. Extensive experiments across multiple real-world datasets demonstrate that CDA4Rec consistently outperforms competitive baselines in both accuracy and efficiency, validating its effectiveness in heterogeneous and resource-constrained environments.

[IR-4] AI4DiTraRe: Building the BFO-Compliant Chemotion Knowledge Graph

链接: https://arxiv.org/abs/2509.01536
作者: Ebrahim Norouzi,Nicole Jung,Anna M. Jacyszyn,Jörg Waitelonis,Harald Sack
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Chemistry is an example of a discipline where the advancements of technology have led to multi-level and often tangled and tricky processes ongoing in the lab. The repeatedly complex workflows are combined with information from chemical structures, which are essential to understand the scientific process. An important tool for many chemists is Chemotion, which consists of an electronic lab notebook and a repository. This paper introduces a semantic pipeline for constructing the BFO-compliant Chemotion Knowledge Graph, providing an integrated, ontology-driven representation of chemical research data. The Chemotion-KG has been developed to adhere to the FAIR (Findable, Accessible, Interoperable, Reusable) principles and to support AI-driven discovery and reasoning in chemistry. Experimental metadata were harvested from the Chemotion API in JSON-LD format, converted into RDF, and subsequently transformed into a Basic Formal Ontology-aligned graph through SPARQL CONSTRUCT queries. The source code and datasets are publicly available via GitHub. The Chemotion Knowledge Graph is hosted by FIZ Karlsruhe Information Service Engineering. Outcomes presented in this work were achieved within the Leibniz Science Campus ``Digital Transformation of Research’’ (DiTraRe) and are part of an ongoing interdisciplinary collaboration.

[IR-5] MARS: Modality-Aligned Retrieval for Sequence Augmented CTR Prediction

链接: https://arxiv.org/abs/2509.01184
作者: Yutian Xiao,Shukuan Wang,Binhao Wang,Zhao Zhang,Yanze Zhang,Shanqi Liu,Chao Feng,Xiang Li,Fuzhen Zhuang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Click-through rate (CTR) prediction serves as a cornerstone of recommender systems. Despite the strong performance of current CTR models based on user behavior modeling, they are still severely limited by interaction sparsity, especially in low-active user scenarios. To address this issue, data augmentation of user behavior is a promising research direction. However, existing data augmentation methods heavily rely on collaborative signals while overlooking the rich multimodal features of items, leading to insufficient modeling of low-active users. To alleviate this problem, we propose a novel framework \textbfMARS (\textbfModality-\textbfAligned \textbfRetrieval for \textbfSequence Augmented CTR Prediction). MARS utilizes a Stein kernel-based approach to align text and image features into a unified and unbiased semantic space to construct multimodal user embeddings. Subsequently, each low-active user’s behavior sequence is augmented by retrieving, filtering, and concentrating the most similar behavior sequence of high-active users via multimodal user embeddings. Validated by extensive offline experiments and online A/B tests, our framework MARS consistently outperforms state-of-the-art baselines and achieves substantial growth on core business metrics within Kuaishou~\footnotethis https URL. Consequently, MARS has been successfully deployed, serving the main traffic for hundreds of millions of users. To ensure reproducibility, we provide anonymous access to the implementation code~\footnotethis https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2509.01184 [cs.IR] (or arXiv:2509.01184v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.01184 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-6] Beyond the Surface: A Solution-Aware Retrieval Model for Competition-level Code Generation

链接: https://arxiv.org/abs/2509.01129
作者: Shiwen Zhang,Lingxiang Wang,Hainan Zhang,Ziwei Wang,Sijia Wen,Zhiming Zheng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In competitive programming task, problem statements are often embedded within elaborate narrative backgrounds, requiring deep understanding of the underlying solutions to successfully complete the tasks. Current code generation models primarily focus on token-level semantic modeling, highly susceptible to distractions from irrelevant narrative statements. Inspired by RAG, retrieving reference code with similar solutions may help enhance model performance on difficult problems. However, existing retrieval models also emphasize surface-level semantic similarity, neglecting the deeper solution-level logical similarities that are critical in competitive programming. Therefore, designing ranking models capable of accurately identifying and retrieving problems and corresponding codes remains an urgent research problem in competitive code generation. In this paper, we propose SolveRank, a solution-aware ranking model empowered by synthetic data for competitive programming tasks. Specifically, we leverage the DeepSeek-R1 model to generate logically equivalent but differently phrased new problems, verified by GPT-4o for solution consistency. Then, we train SolveRank with these as positive samples and BM25/random-retrieved problems as negatives. During inference, SolveRank retrieves relevant problems and corresponding code from the corpus to assist a downstream code generator. Experiments on the xCodeEval dataset demonstrate that SolveRank outperforms SOTA ranking methods in precision and recall metrics, and boosts code generation performance for difficult problems.

[IR-7] Identifying Origins of Place Names via Retrieval Augmented Generation

链接: https://arxiv.org/abs/2509.01030
作者: Alexis Horde Vo,Matt Duckham,Estrid He,Rafe Benli
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Who is the “Batman” behind “Batman Street” in Melbourne? Understanding the historical, cultural, and societal narratives behind place names can reveal the rich context that has shaped a community. Although place names serve as essential spatial references in gazetteers, they often lack information about place name origins. Enriching these place names in today’s gazetteers is a time-consuming, manual process that requires extensive exploration of a vast archive of documents and text sources. Recent advances in natural language processing and language models (LMs) hold the promise of significant automation of identifying place name origins due to their powerful capability to exploit the semantics of the stored documents. This chapter presents a retrieval augmented generation pipeline designed to search for place name origins over a broad knowledge base, DBpedia. Given a spatial query, our approach first extracts sub-graphs that may contain knowledge relevant to the query; then ranks the extracted sub-graphs to generate the final answer to the query using fine-tuned LM-based models (i.e., ColBERTv2 and Llama2). Our results highlight the key challenges facing automated retrieval of place name origins, especially the tendency of language models to under-use the spatial information contained in texts as a discriminating factor. Our approach also frames the wider implications for geographic information retrieval using retrieval augmented generation.

[IR-8] Food Data in the Semantic Web: A Review of Nutritional Resources Knowledge Graphs and Emerging Applications

链接: https://arxiv.org/abs/2509.00986
作者: Darko Sasanski,Riste Stojanov
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This comprehensive review explores food data in the Semantic Web, highlighting key nutritional resources, knowledge graphs, and emerging applications in the food domain. It examines prominent food data resources such as USDA, FoodOn, FooDB, and Recipe1M+, emphasizing their contributions to nutritional data representation. Special focus is given to food entity linking and recognition techniques, which enable integration of heterogeneous food data sources into cohesive semantic resources. The review further discusses food knowledge graphs, their role in semantic interoperability, data enrichment, and knowledge extraction, and their applications in personalized nutrition, ingredient substitution, food-drug and food-disease interactions, and interdisciplinary research. By synthesizing current advancements and identifying challenges, this work provides insights to guide future developments in leveraging semantic technologies for the food domain.

[IR-9] HiPS: Hierarchical PDF Segmentation of Textbooks WSDM

链接: https://arxiv.org/abs/2509.00909
作者: Sabine Wehnert,Harikrishnan Changaramkulath,Ernesto William De Luca
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 7 figures, submitted to the 19th ACM International Conference on Web Search and Data Mining (WSDM)

点击查看摘要

Abstract:The growing demand for effective tools to parse PDF-formatted texts, particularly structured documents such as textbooks, reveals the limitations of current methods developed mainly for research paper segmentation. This work addresses the challenge of hierarchical segmentation in complex structured documents, with a focus on legal textbooks that contain layered knowledge essential for interpreting and applying legal norms. We examine a Table of Contents (TOC)-based technique and approaches that rely on open-source structural parsing tools or Large Language Models (LLMs) operating without explicit TOC input. To enhance parsing accuracy, we incorporate preprocessing strategies such as OCR-based title detection, XML-derived features, and contextual text features. These strategies are evaluated based on their ability to identify section titles, allocate hierarchy levels, and determine section boundaries. Our findings show that combining LLMs with structure-aware preprocessing substantially reduces false positives and improves extraction quality. We also find that when the metadata quality of headings in the PDF is high, TOC-based techniques perform particularly well. All code and data are publicly available to support replication. We conclude with a comparative evaluation of the methods, outlining their respective strengths and limitations.

[IR-10] A Survey on Open Dataset Search in the LLM Era: Retrospectives and Perspectives

链接: https://arxiv.org/abs/2509.00728
作者: Pengyue Li,Sheng Wang,Hua Dai,Zhiyu Chen,Zhifeng Bao,Brian D. Davison
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:High-quality datasets are typically required for accomplishing data-driven tasks, such as training medical diagnosis models, predicting real-time traffic conditions, or conducting experiments to validate research hypotheses. Consequently, open dataset search, which aims to ensure the efficient and accurate fulfillment of users’ dataset requirements, has emerged as a critical research challenge and has attracted widespread interest. Recent studies have made notable progress in enhancing the flexibility and intelligence of open dataset search, and large language models (LLMs) have demonstrated strong potential in addressing long-standing challenges in this area. Therefore, a systematic and comprehensive review of the open dataset search problem is essential, detailing the current state of research and exploring future directions. In this survey, we focus on recent advances in open dataset search beyond traditional approaches that rely on metadata and keywords. From the perspective of dataset modalities, we place particular emphasis on example-based dataset search, advanced similarity measurement techniques based on dataset content, and efficient search acceleration techniques. In addition, we emphasize the mutually beneficial relationship between LLMs and open dataset search. On the one hand, LLMs help address complex challenges in query understanding, semantic modeling, and interactive guidance within open dataset search. In turn, advances in dataset search can support LLMs by enabling more effective integration into retrieval-augmented generation (RAG) frameworks and data selection processes, thereby enhancing downstream task performance. Finally, we summarize open research problems and outline promising directions for future work. This work aims to offer a structured reference for researchers and practitioners in the field of open dataset search.

[IR-11] How to Make Museums More Interactive? Case Study of Artistic Chatbot

链接: https://arxiv.org/abs/2509.00572
作者: Filip J. Kucia,Bartosz Grabek,Szymon D. Trochimiak,Anna Wróblewska
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:Conversational agents powered by Large Language Models (LLMs) are increasingly utilized in educational settings, in particular in individual closed digital environments, yet their potential adoption in the physical learning environments like cultural heritage sites, museums, and art galleries remains relatively unexplored. In this study, we present Artistic Chatbot, a voice-to-voice RAG-powered chat system to support informal learning and enhance visitor engagement during a live art exhibition celebrating the 15th anniversary of the Faculty of Media Art at the Warsaw Academy of Fine Arts, Poland. The question answering (QA) chatbot responded to free-form spoken questions in Polish using the context retrieved from a curated, domain-specific knowledge base consisting of 226 documents provided by the organizers, including faculty information, art magazines, books, and journals. We describe the key aspects of the system architecture and user interaction design, as well as discuss the practical challenges associated with deploying chatbots at public cultural sites. Our findings, based on interaction analysis, demonstrate that chatbots such as Artistic Chatbot effectively maintain responses grounded in exhibition content (60% of responses directly relevant), even when faced with unpredictable queries outside the target domain, showing their potential for increasing interactivity in public cultural sites. GitHub project page: this https URL Comments: 7 pages, 3 figures Subjects: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR) Cite as: arXiv:2509.00572 [cs.HC] (or arXiv:2509.00572v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2509.00572 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-12] CRouting: Reducing Expensive Distance Calls in Graph-Based Approximate Nearest Neighbor Search

链接: https://arxiv.org/abs/2509.00365
作者: Zhenxin Li,Shuibing He,Jiahao Guo,Xuechen Zhang,Xian-He Sun,Gang Chen
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Approximate nearest neighbor search (ANNS) is a crucial problem in information retrieval and AI applications. Recently, there has been a surge of interest in graph-based ANNS algorithms due to their superior efficiency and accuracy. However, the repeated computation of distances in high-dimensional spaces constitutes the primary time cost of graph-based methods. To accelerate the search, we propose a novel routing strategy named CRouting, which bypasses unnecessary distance computations by exploiting the angle distributions of high-dimensional vectors. CRouting is designed as a plugin to optimize existing graph-based search with minimal code modifications. Our experiments show that CRouting reduces the number of distance computations by up to 41.5% and boosts queries per second by up to 1.48 \times on two predominant graph indexes, HNSW and NSG. Code is publicly available at this https URL.

附件下载

点击下载今日全部论文列表