本篇博文主要内容为 2025-09-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-09-09)
今日共更新810篇论文,其中:
- 自然语言处理共92篇(Computation and Language (cs.CL))
- 人工智能共241篇(Artificial Intelligence (cs.AI))
- 计算机视觉共194篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共223篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] On the Same Wavelength? Evaluating Prag matic Reasoning in Language Models across Broad Concepts EMNLP2025
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在对话场景中对语用推理(pragmatic reasoning)能力的评估与提升问题,即模型如何根据交际目标和语境规范进行有效沟通。其核心挑战在于现有模型在理解与生成语言时是否具备类似人类的语用认知机制。解决方案的关键在于构建一个基于Wavelength通信游戏的评估框架,系统性地考察LM在语言理解与生成任务中的表现,并引入理性言语行为理论(Rational Speech Act, RSA)将贝叶斯语用推理嵌入到LM的推理过程之中。实证结果表明,尽管先进模型在语言理解上已接近人类水平,但在语言生成方面仍需CoT提示或RSA干预才能显著提升语用准确性,凸显了RSA作为增强语用推理能力的有效手段。
链接: https://arxiv.org/abs/2509.06952
作者: Linlu Qiu,Cedegao E. Zhang,Joshua B. Tenenbaum,Yoon Kim,Roger P. Levy
机构: 未知
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 (Main)
Abstract:Language use is shaped by pragmatics – i.e., reasoning about communicative goals and norms in context. As language models (LMs) are increasingly used as conversational agents, it becomes ever more important to understand their pragmatic reasoning abilities. We propose an evaluation framework derived from Wavelength, a popular communication game where a speaker and a listener communicate about a broad range of concepts in a granular manner. We study a range of LMs on both language comprehension and language production using direct and Chain-of-Thought (CoT) prompting, and further explore a Rational Speech Act (RSA) approach to incorporating Bayesian pragmatic reasoning into LM inference. We find that state-of-the-art LMs, but not smaller ones, achieve strong performance on language comprehension, obtaining similar-to-human accuracy and exhibiting high correlations with human judgments even without CoT prompting or RSA. On language production, CoT can outperform direct prompting, and using RSA provides significant improvements over both approaches. Our study helps identify the strengths and limitations in LMs’ pragmatic reasoning abilities and demonstrates the potential for improving them with RSA, opening up future avenues for understanding conceptual representation, language understanding, and social reasoning in LMs and humans.
zh
[NLP-1] Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在复杂推理任务中训练不稳定、采样灵活性不足以及难以适配大规模推理需求的问题。解决方案的关键在于提出了一种轨迹感知的强化学习框架 TraceRL,其核心创新包括:引入基于扩散的价值模型(diffusion-based value model)以提升训练稳定性,并将偏好推理轨迹(preferred inference trajectory)融入后训练过程;同时,通过课程学习(curriculum learning)实现长链式思维(long-CoT)推理能力的构建。该方法可适配多种架构,在数学和编码任务上显著优于同类自回归(AR)模型,例如 TraDo-8B-Instruct 在数学推理基准上相对 Qwen2.5-7B-Instruct 和 Llama3.1-8B-Instruct 分别提升 6.1% 和 51.3% 的准确率。
链接: https://arxiv.org/abs/2509.06949
作者: Yinjie Wang,Ling Yang,Bowen Li,Ye Tian,Ke Shen,Mengdi Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Code and Models: this https URL
Abstract:We propose TraceRL, a trajectory-aware reinforcement learning framework for diffusion language models (DLMs) that incorporates preferred inference trajectory into post-training, and is applicable across different architectures. Equipped with a diffusion-based value model that enhances training stability, we demonstrate improved reasoning performance on complex math and coding tasks. Besides, it can also be applied to adapt block-specific models to larger blocks, which improves sampling flexibility. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still consistently outperforms them across complex math reasoning tasks. TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks. Through curriculum learning, we also derive the first long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1% relative accuracy gain. To facilitate reproducible research and practical applications, we release a comprehensive open-source framework for building, training, and deploying diffusion LLMs across diverse architectures. The framework integrates accelerated KV-cache techniques and inference engines for both inference and reinforcement learning, and includes implementations of various supervised fine-tuning and RL methods for mathematics, coding, and general tasks. Code and Models: this https URL
zh
[NLP-2] Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在训练大语言模型(Large Language Models, LLMs)推理能力时存在的效率低下问题,尤其是在传统两阶段训练范式(即先监督微调 SFT 再进行 RL 训练)中,SFT 与 RL 之间缺乏有效交互,限制了整体性能提升。解决方案的关键在于提出一种基于双层优化(bilevel optimization)的新方法,通过将 SFT 的目标函数条件化为最优 RL 策略,使 SFT 能够元学习如何引导 RL 的优化过程;在训练过程中,下层执行 RL 更新并同时接收 SFT 监督,上层则显式最大化联合 SFT-RL 训练相对于单独 RL 训练所带来的性能增益,从而实现两个训练阶段的协同优化,在多个推理基准测试中显著优于基线方法,并在效果与效率之间取得更好平衡。
链接: https://arxiv.org/abs/2509.06948
作者: Liang Chen,Xueting Han,Li Shen,Jing Bai,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); Microsoft Research (微软研究院); Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as a warm-up stage for RL, this decoupled two-stage approach limits interaction between SFT and RL, thereby constraining overall effectiveness. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL’s optimization process. During training, the lower level performs RL updates while simultaneously receiving SFT supervision, and the upper level explicitly maximizes the cooperative gain-the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.
zh
[NLP-3] Interleaving Reasoning for Better Text-to-Image Generation
【速读】: 该论文旨在解决当前统一多模态理解与生成模型在指令遵循能力(instruction following)和细节保留(detail preservation)方面与 tightly coupled comprehension-and-generation 系统(如 GPT-4o)之间存在的显著差距。其核心问题在于如何提升文本到图像(Text-to-Image, T2I)生成过程中对语义准确性、视觉质量及细粒度控制的协同优化。解决方案的关键是提出 Interleaving Reasoning Generation (IRG) 框架,该框架通过交替进行文本推理(text-based thinking)与图像合成(image synthesis)来实现渐进式优化:首先基于文本思考生成初始图像,随后通过反思机制对图像结果进行精细化调整,从而在保持语义一致性的同时增强视觉质量和细节 fidelity。为有效训练 IRG,作者进一步设计了 Interleaving Reasoning Generation Learning (IRGL),并构建了包含六种分解学习模式的 IRGL-300K 数据集,采用两阶段训练策略先强化基础思维与反思能力,再微调完整“思考-图像”轨迹路径,最终在多个评测基准上取得 SOTA 性能,显著提升了生成图像的语义保真度与视觉品质。
链接: https://arxiv.org/abs/2509.06945
作者: Wenxuan Huang,Shuang Chen,Zheyong Xie,Shaosheng Cao,Shixiang Tang,Yufan Shen,Qingyu Yin,Wenbo Hu,Xiaoman Wang,Yuntian Tang,Junbo Qiao,Yue Guo,Yao Hu,Zhenfei Yin,Philip Torr,Yu Cheng,Wanli Ouyang,Shaohui Lin
机构: East China Normal University (华东师范大学); The Chinese University of Hong Kong (香港中文大学); Xiaohongshu Inc. (小红书公司); University of California, Los Angeles (加州大学洛杉矶分校); Zhejiang University (浙江大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: this https URL .
zh
[NLP-4] Outcome-based Exploration for LLM Reasoning
【速读】: 该论文旨在解决基于结果的强化学习(outcome-based reinforcement learning, RL)在提升大语言模型(Large Language Models, LLMs)推理能力时所引发的生成多样性下降问题。研究表明,尽管outcome-based RL能显著提高答案正确率,但其会导致模型在训练集甚至测试阶段出现有效多样性塌缩,进而影响实际部署中的可扩展性表现。其关键解决方案是引入基于结果的探索机制(outcome-based exploration),通过为不同最终答案分配探索奖励来主动维持多样性:具体提出两种互补算法——历史探索(historical exploration)利用UCB风格的奖励鼓励罕见答案的生成,批量探索(batch exploration)则通过惩罚批次内重复以促进测试时的多样性。理论层面,作者构建了新的outcome-based bandit模型以形式化该方法的优势,从而为提升推理能力的同时保障多样性提供了可实践的技术路径。
链接: https://arxiv.org/abs/2509.06941
作者: Yuda Song,Julia Kempe,Remi Munos
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 26 pages, 11 figures
Abstract:Reinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models (LLMs). Outcome-based RL, which rewards policies solely for the correctness of the final answer, yields substantial accuracy gains but also induces a systematic loss in generation diversity. This collapse undermines real-world performance, where diversity is critical for test-time scaling. We analyze this phenomenon by viewing RL post-training as a sampling process and show that, strikingly, RL can reduce effective diversity even on the training set relative to the base model. Our study highlights two central findings: (i) a transfer of diversity degradation, where reduced diversity on solved problems propagates to unsolved ones, and (ii) the tractability of the outcome space, since reasoning tasks admit only a limited set of distinct answers. Motivated by these insights, we propose outcome-based exploration, which assigns exploration bonuses according to final outcomes. We introduce two complementary algorithms: historical exploration, which encourages rarely observed answers via UCB-style bonuses, and batch exploration, which penalizes within-batch repetition to promote test-time diversity. Experiments on standard competition math with Llama and Qwen models demonstrate that both methods improve accuracy while mitigating diversity collapse. On the theoretical side, we formalize the benefit of outcome-based exploration through a new model of outcome-based bandits. Together, these contributions chart a practical path toward RL methods that enhance reasoning without sacrificing the diversity essential for scalable deployment.
zh
[NLP-5] An Ethically Grounded LLM -Based Approach to Insider Threat Synthesis and Detection
【速读】: 该论文旨在解决当前 insider threats(内部威胁)检测模型开发受限于静态且访问受限的数据集问题,从而难以构建自适应的检测系统。其解决方案的关键在于提出一种伦理合规的新型方法,利用大语言模型(Large Language Model, LLM)Claude Sonnet 3.7 动态合成包含内部威胁指标的 syslog 消息,生成符合真实数据分布(高度不平衡,仅1%为内部威胁)的合成日志数据,并通过与 GPT-4o 的对比实验验证其有效性,结果显示 Sonnet 3.7 在减少误报和提升检测准确率方面显著优于 GPT-4o,表明 LLM 在合成数据生成和内部威胁检测中具有广阔应用前景。
链接: https://arxiv.org/abs/2509.06920
作者: Haywood Gelman,John D. Hastings,David Kenley
机构: Dakota State University (达科他州立大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 6 pages, 5 figures, 5 tables
Abstract:Insider threats are a growing organizational problem due to the complexity of identifying their technical and behavioral elements. A large research body is dedicated to the study of insider threats from technological, psychological, and educational perspectives. However, research in this domain has been generally dependent on datasets that are static and limited access which restricts the development of adaptive detection models. This study introduces a novel, ethically grounded approach that uses the large language model (LLM) Claude Sonnet 3.7 to dynamically synthesize syslog messages, some of which contain indicators of insider threat scenarios. The messages reflect real-world data distributions by being highly imbalanced (1% insider threats). The syslogs were analyzed for insider threats by both Claude Sonnet 3.7 and GPT-4o, with their performance evaluated through statistical metrics including precision, recall, MCC, and ROC AUC. Sonnet 3.7 consistently outperformed GPT-4o across nearly all metrics, particularly in reducing false alarms and improving detection accuracy. The results show strong promise for the use of LLMs in synthetic dataset generation and insider threat detection.
zh
[NLP-6] Paper2Agent : Reimagining Research Papers As Interactive and Reliable AI Agents
【速读】: 该论文旨在解决传统科研论文作为静态文档在知识传递与复用中的低效问题,即读者需耗费大量精力理解并适配论文中的代码、数据和方法以应用于自身研究,从而阻碍了研究成果的扩散与再利用。其解决方案的关键在于提出 Paper2Agent 框架,该框架通过多智能体协作自动解析论文及其代码库,构建 Model Context Protocol (MCP) 服务器,并借助迭代测试不断优化和强化 MCP 的鲁棒性;最终将论文转化为可交互的 AI Agent,能够基于自然语言响应复杂科学查询,同时调用原论文中定义的工具与工作流,实现从被动阅读到主动服务的范式转变。
链接: https://arxiv.org/abs/2509.06917
作者: Jiacheng Miao,Joe R. Davis,Jonathan K. Pritchard,James Zou
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We introduce Paper2Agent, an automated framework that converts research papers into AI agents. Paper2Agent transforms research output from passive artifacts into active systems that can accelerate downstream use, adoption, and discovery. Conventional research papers require readers to invest substantial effort to understand and adapt a paper’s code, data, and methods to their own work, creating barriers to dissemination and reuse. Paper2Agent addresses this challenge by automatically converting a paper into an AI agent that acts as a knowledgeable research assistant. It systematically analyzes the paper and the associated codebase using multiple agents to construct a Model Context Protocol (MCP) server, then iteratively generates and runs tests to refine and robustify the resulting MCP. These paper MCPs can then be flexibly connected to a chat agent (e.g. Claude Code) to carry out complex scientific queries through natural language while invoking tools and workflows from the original paper. We demonstrate Paper2Agent’s effectiveness in creating reliable and capable paper agents through in-depth case studies. Paper2Agent created an agent that leverages AlphaGenome to interpret genomic variants and agents based on ScanPy and TISSUE to carry out single-cell and spatial transcriptomics analyses. We validate that these paper agents can reproduce the original paper’s results and can correctly carry out novel user queries. By turning static papers into dynamic, interactive AI agents, Paper2Agent introduces a new paradigm for knowledge dissemination and a foundation for the collaborative ecosystem of AI co-scientists.
zh
[NLP-7] Proof-Carrying Numbers (PCN): A Protocol for Trustworthy Numeric Answers from LLM s via Claim Verification
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中可能出现的数值幻觉(numeric hallucination)问题,即模型输出与真实数据不符的数值内容。现有方法如检索增强生成、引用和不确定性估计虽能提升透明度,但无法确保数值的真实性。其解决方案的关键在于提出一种名为“带证明的数字”(Proof-Carrying Numbers, PCN)的展示层协议,通过机械验证机制强制保障数值的准确性:数值以绑定结构化声明的标记形式输出,由渲染器端的验证器依据预设策略(如精确相等、舍入规则或容差条件)进行校验,仅标记为已验证的数值才可显示,其余默认为未验证状态。该设计将验证置于渲染层而非模型内部,实现防伪造和“失败封闭”(fail-closed)行为,从而在不依赖模型自身可靠性的前提下建立数值可信性契约。
链接: https://arxiv.org/abs/2509.06902
作者: Aivin V. Solatorio
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) as stochastic systems may generate numbers that deviate from available data, a failure known as \emphnumeric hallucination. Existing safeguards – retrieval-augmented generation, citations, and uncertainty estimation – improve transparency but cannot guarantee fidelity: fabricated or misquoted values may still be displayed as if correct. We propose \textbfProof-Carrying Numbers (PCN), a presentation-layer protocol that enforces numeric fidelity through mechanical verification. Under PCN, numeric spans are emitted as \emphclaim-bound tokens tied to structured claims, and a verifier checks each token under a declared policy (e.g., exact equality, rounding, aliases, or tolerance with qualifiers). Crucially, PCN places verification in the \emphrenderer, not the model: only claim-checked numbers are marked as verified, and all others default to unverified. This separation prevents spoofing and guarantees fail-closed behavior. We formalize PCN and prove soundness, completeness under honest tokens, fail-closed behavior, and monotonicity under policy refinement. PCN is lightweight and model-agnostic, integrates seamlessly into existing applications, and can be extended with cryptographic commitments. By enforcing verification as a mandatory step before display, PCN establishes a simple contract for numerically sensitive settings: \emphtrust is earned only by proof, while the absence of a mark communicates uncertainty.
zh
[NLP-8] mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
【速读】: 该论文旨在解决当前Encoder-only语言模型在多语言场景下,尤其是低资源语言上的性能不足问题。以往研究多集中于单语或高资源语言的建模,缺乏对大规模多语言训练的有效策略。解决方案的关键在于提出mmBERT模型,其核心创新包括:引入逆向掩码比例调度(inverse mask ratio schedule)和逆向温度采样比例(inverse temperature sampling ratio),并在训练的衰减阶段仅加入超过1700种低资源语言的数据,显著提升模型在这些语言上的表现。这种设计使得模型在有限数据条件下仍能最大化收益,最终在分类与检索任务上达到甚至超越如OpenAI o3和Google Gemini 2.5 Pro等先进模型的性能,尤其在低资源语言上表现突出。
链接: https://arxiv.org/abs/2509.06888
作者: Marc Marone,Orion Weller,William Fleshman,Eugene Yang,Dawn Lawrie,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI’s o3 and Google’s Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks – on both high and low-resource languages.
zh
[NLP-9] UNH at CheckThat! 2025: Fine-tuning Vs Prompting in Claim Extraction
【速读】: 该论文旨在解决从社交媒体文本中提取值得核查的声明(check-worthy claims)的问题,这是事实核查流程中的关键前置步骤。解决方案的关键在于探索多种提示工程(prompting)与上下文学习(in-context learning)方法,包括少样本提示(few-shot prompting)和不同大型语言模型(Large Language Models, LLMs)家族的微调(fine-tuning),其中最优的METEOR得分由FLAN-T5模型微调实现;值得注意的是,尽管某些方法的METEOR分数较低,但其生成的声明质量可能更高,表明指标分数并非唯一衡量标准。
链接: https://arxiv.org/abs/2509.06883
作者: Joe Wilder,Nikhil Kadapala,Benji Xu,Mohammed Alsaadi,Aiden Parsons,Mitchell Rogers,Palash Agarwal,Adam Hassick,Laura Dietz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 16 pages,3 tables, CLEF 2025 Working Notes, 9-12 September 2025, Madrid, Spain
Abstract:We participate in CheckThat! Task 2 English and explore various methods of prompting and in-context learning, including few-shot prompting and fine-tuning with different LLM families, with the goal of extracting check-worthy claims from social media passages. Our best METEOR score is achieved by fine-tuning a FLAN-T5 model. However, we observe that higher-quality claims can sometimes be extracted using other methods, even when their METEOR scores are lower.
zh
[NLP-10] he Majority is not always right: RL training for solution aggregation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中因单一解法局限性而导致性能提升受限的问题。现有方法多依赖于简单的多数投票或基于奖励模型的排序来聚合多个独立生成的候选解,但这类策略难以充分挖掘高质量解的潜力。其解决方案的关键在于将聚合过程建模为一种显式的推理技能:通过强化学习从可验证奖励中训练一个聚合器模型(aggregator model),使其能够对候选解进行审查、调和与合成,从而输出最终正确答案。该方法特别强调在训练过程中平衡易例与难例,使模型既能识别并采纳少数但正确的解,也能稳定处理多数正确解,显著提升了性能与泛化能力。
链接: https://arxiv.org/abs/2509.06870
作者: Wenting Zhao,Pranjal Aggarwal,Swarnadeep Saha,Asli Celikyilmaz,Jason Weston,Ilia Kulikov
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.
zh
[NLP-11] st-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet
【速读】: 该论文旨在解决测试时扩展(test-time scaling)在知识密集型任务中有效性不足的问题,尤其是模型在增加推理时间计算后仍难以保证事实准确性并抑制幻觉(hallucination)的现象。其关键解决方案在于系统性评估12种推理模型在两个知识密集型基准上的表现,发现延长推理链虽能提升部分任务的准确率,但多数情况下反而加剧幻觉;进一步分析表明,减少幻觉的主要机制并非源于更准确的事实回忆,而是模型因思考更久而选择放弃回答(abstention),同时某些模型在更长推理下会尝试解答原本未作答的问题,从而引发更多幻觉。研究强调,尽管存在局限,启用推理机制相比不推理仍具优势。
链接: https://arxiv.org/abs/2509.06861
作者: James Xu Zhao,Bryan Hooi,See-Kiong Ng
机构: National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 4 figures, 6 tables
Abstract:Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at this https URL
zh
[NLP-12] EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨文化语境下可信度不足的问题,尤其是如何确保其行为符合特定文化背景下的伦理、安全与社会价值。解决方案的关键在于构建一个名为EPT(Evaluation of Persian Trustworthiness)的文化敏感型评估基准,该基准从真实性、安全性、公平性、鲁棒性、隐私保护和伦理对齐六个维度系统量化LLMs的可信度,并通过人工标注数据集对多个主流模型进行多角度评估,从而揭示模型在安全性等关键维度上的显著缺陷,为开发更符合本土文化价值观的负责任AI提供实证依据与改进方向。
链接: https://arxiv.org/abs/2509.06838
作者: Mohammad Reza Mirbagheri,Mohammad Mahdi Mirkamali,Zahra Motoshaker Arani,Ali Javeri,Amir Mahdi Sadeghzadeh,Rasool Jalili
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Large Language Models (LLMs), trained on extensive datasets using advanced deep learning architectures, have demonstrated remarkable performance across a wide range of language tasks, becoming a cornerstone of modern AI technologies. However, ensuring their trustworthiness remains a critical challenge, as reliability is essential not only for accurate performance but also for upholding ethical, cultural, and social values. Careful alignment of training data and culturally grounded evaluation criteria are vital for developing responsible AI systems. In this study, we introduce the EPT (Evaluation of Persian Trustworthiness) metric, a culturally informed benchmark specifically designed to assess the trustworthiness of LLMs across six key aspects: truthfulness, safety, fairness, robustness, privacy, and ethical alignment. We curated a labeled dataset and evaluated the performance of several leading models - including ChatGPT, Claude, DeepSeek, Gemini, Grok, LLaMA, Mistral, and Qwen - using both automated LLM-based and human assessments. Our results reveal significant deficiencies in the safety dimension, underscoring the urgent need for focused attention on this critical aspect of model behavior. Furthermore, our findings offer valuable insights into the alignment of these models with Persian ethical-cultural values and highlight critical gaps and opportunities for advancing trustworthy and culturally responsible AI. The dataset is publicly available at: this https URL.
zh
[NLP-13] COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在边缘部署、交互式应用及大规模推理场景中面临的内存占用高、延迟大和服务成本高的问题。现有剪枝方法存在局限性:宽度剪枝常破坏标准Transformer结构或需定制推理代码,深度剪枝则移除整层导致精度骤降。解决方案的关键在于提出COMPACT方法,其核心是联合执行两项策略:(i) 剪枝稀有词汇以压缩嵌入(embedding)与解嵌入(unembedding)层;(ii) 基于常见词加权激活值对前馈网络(Feed-Forward Network, FFN)中间通道进行剪枝,使重要性评估与剪枝后token分布对齐。该设计兼顾了深度与宽度剪枝的优势,保持标准Transformer架构以提升部署友好性,并支持词汇与FFN剪枝的灵活权衡,实现无训练操作下的高效剪枝,显著降低参数量、GPU内存占用和端到端延迟,同时在多个模型家族(Qwen、LLaMA、Gemma,0.5B–70B)上达到优于现有方法的下游任务性能。
链接: https://arxiv.org/abs/2509.06836
作者: Eugene Kwek,Wenpeng Yin
机构: Penn State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.
zh
[NLP-14] RAFFLES: Reasoning -based Attribution of Faults for LLM Systems
【速读】: 该论文旨在解决长时程、多组件大语言模型(Large Language Model, LLM)智能体系统在开发与优化过程中难以定位故障来源及其原因的问题。当前评估方法(如单次LLM作为裁判)通常局限于单一指标或端到端结果,且过度依赖人类偏好,无法有效解析系统内部复杂逻辑流。解决方案的关键在于提出RAFFLES评估架构,其核心是通过一个中央裁判(Judge)与一组专用评估器(Evaluators)构成的迭代式多组件流水线,使裁判能够系统性地探究故障、迭代修正假设,并同时评估自身推理质量,从而构建可追溯的故障假设历史,实现对智能体系统中“谁”(agent)和“何时”(step)发生错误的精准诊断。
链接: https://arxiv.org/abs/2509.06822
作者: Chenyang Zhu,Spencer Hong,Jingyu Wu,Kushal Chawla,Charlotte Tang,Youbing Yin,Nathan Wolfe,Erin Babinsky,Daben Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We have reached a critical roadblock in the development and enhancement of long-horizon, multi-component LLM agentic systems: it is incredibly tricky to identify where these systems break down and why. Evaluation capabilities that currently exist today (e.g., single pass LLM-as-a-judge) are limited in that they often focus on individual metrics or capabilities, end-to-end outcomes, and are narrowly grounded on the preferences of humans. We argue that to match the agentic capabilities, evaluation frameworks must also be able to reason, probe, iterate, and understand the complex logic passing through these systems over long horizons. In this paper, we present RAFFLES - an evaluation architecture that incorporates reasoning and iterative refinement. Specifically, RAFFLES operates as an iterative, multi-component pipeline, using a central Judge to systematically investigate faults and a set of specialized Evaluators to assess not only the system’s components but also the quality of the reasoning by the Judge itself, thereby building a history of hypotheses. We tested RAFFLES against several baselines on the WhoWhen dataset, a benchmark designed to diagnose the “who” (agent) and “when” (step) of a system’s failure. RAFFLES outperforms these baselines, achieving an agent-step fault pair accuracy of over 43% on the Algorithmically-Generated dataset (a substantial increase from the previously published best of 16.6%) and over 20% on the Hand-Crafted dataset (surpassing the previously published best of 8.8%). These results demonstrate a key step towards introducing automated fault detection for autonomous systems over labor-intensive manual human review.
zh
[NLP-15] A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs
【速读】: 该论文旨在解决风力发电机组运维(Operation and Maintenance, OM)过程中,因维护日志为非结构化文本而导致自动化分析困难的问题。其核心挑战在于如何有效利用大型语言模型(Large Language Models, LLMs)对复杂工业记录进行可靠分类,以降低度电成本(Levelised Cost of Energy, LCOE)。解决方案的关键在于提出一个新颖且可复现的基准测试框架,用于系统评估多种前沿商用与开源LLMs在该任务上的表现,明确其在可靠性、运行效率和模型校准方面的权衡关系;同时发现分类性能高度依赖于任务语义模糊性,且无模型能达到完美准确率,因此主张采用“人在回路”(Human-in-the-Loop)模式,使LLM作为辅助工具提升人工标注效率与一致性,从而改善运维数据质量并增强下游可靠性分析的可信度。
链接: https://arxiv.org/abs/2509.06813
作者: Max Malyi,Jonathan Shek,Alasdair McDonald,Andre Biscaya
机构: The University of Edinburgh (爱丁堡大学); Nadara
类目: Computation and Language (cs.CL)
备注: Associated GitHub repository: this https URL
Abstract:Effective Operation and Maintenance (OM) is critical to reducing the Levelised Cost of Energy (LCOE) from wind power, yet the unstructured, free-text nature of turbine maintenance logs presents a significant barrier to automated analysis. Our paper addresses this by presenting a novel and reproducible framework for benchmarking Large Language Models (LLMs) on the task of classifying these complex industrial records. To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool. We systematically evaluate a diverse suite of state-of-the-art proprietary and open-source LLMs, providing a foundational assessment of their trade-offs in reliability, operational efficiency, and model calibration. Our results quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores. We also demonstrate that classification performance is highly dependent on the task’s semantic ambiguity, with all models showing higher consensus on objective component identification than on interpretive maintenance actions. Given that no model achieves perfect accuracy and that calibration varies dramatically, we conclude that the most effective and responsible near-term application is a Human-in-the-Loop system, where LLMs act as a powerful assistant to accelerate and standardise data labelling for human experts, thereby enhancing OM data quality and downstream reliability analysis.
zh
[NLP-16] Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学推理能力提升中面临的高质量、逻辑严谨数据稀缺问题。其核心挑战在于现有数据常因LLM生成错误或依赖复杂证明助手语法(如Lean和Isabelle)而存在可靠性不足的问题。解决方案的关键在于将数十年自动定理证明(Automated Theorem Proving, ATP)的研究成果转化为可扩展的数据引擎:利用E-prover对庞大的TPTP公理库进行饱和推导,生成大规模且保证有效的定理语料库;随后通过筛选“有趣”的定理并构造三类难度可控的任务(蕴含验证、前提选择与证明重构),形成纯符号化训练数据。该方法无需引入LLM参与数据生成过程,从根本上避免了事实性错误,并为诊断当前前沿模型在深层结构推理上的短板提供了工具与可扩展的数据支持。
链接: https://arxiv.org/abs/2509.06809
作者: Valentin Quesnel,Damien Sileo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The scarcity of high-quality, logically sound data is a critical bottleneck for advancing the mathematical reasoning of Large Language Models (LLMs). Our work confronts this challenge by turning decades of automated theorem proving research into a scalable data engine. Rather than relying on error-prone LLMs or complex proof-assistant syntax like Lean and Isabelle, our framework leverages E-prover’s saturation capabilities on the vast TPTP axiom library to derive a massive, guaranteed-valid corpus of theorems. Our pipeline is principled and simple: saturate axioms, filter for “interesting” theorems, and generate tasks. With no LLMs in the loop, we eliminate factual errors by construction. This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction. Our zero-shot experiments on frontier models reveal a clear weakness: performance collapses on tasks requiring deep, structural reasoning. Our framework provides both the diagnostic tool to measure this gap and a scalable source of symbolic training data to address it. We make the code and data publicly available. this https URL this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.06809 [cs.CL] (or arXiv:2509.06809v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.06809 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-17] MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全性与可用性之间存在的权衡问题,即现有安全增强方法常导致模型对恶意指令采取保守或拒绝式响应,从而损害实际应用中的可用性。其核心解决方案是提出MoGU框架,通过层内路由机制(intra-layer router)动态感知隐藏状态并分配权重,以平衡安全优化和可用性优化变体的贡献;进一步改进为MoGU_v2框架,通过将路由器仅嵌入编码高可区分性安全特征的层,并在路由器优化过程中激活主干模块实现双向自适应,显著提升了模型的适应性和稳定性,同时避免了参数冗余和性能瓶颈,从而在不牺牲任务性能的前提下有效恢复模型安全性。
链接: https://arxiv.org/abs/2509.06807
作者: Yanrui Du,Fenglei Fan,Sendong Zhao,Jiawei Cao,Ting Liu,Bing Qin
机构: SCIR Lab, Harbin Institute of Technology (哈尔滨工业大学); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) increasingly permeate human life, their security has emerged as a critical concern, particularly their ability to maintain harmless responses to malicious instructions. Although extensive methods have improved LLMs’ security, they often lead to conservative, rejection-oriented responses that compromise practical usability. This presents a key challenge: how to advance the Pareto frontier between LLMs’ usability and security, rather than necessitate a trade-off between them. To address this, we propose the MoGU framework, in which the intra-layer router dynamically allocates weights by sensing hidden states, thereby balancing the contributions of security-optimized and usability-optimized variants. Despite its initial potential, the MoGU framework faces limitations such as parameter redundancy and performance bottlenecks. To overcome these, we further propose an improved MoGU_v2 framework that establishes a tighter coupling between the routers and hidden states. In MoGU_v2, routers are embedded only in layers encoding highly classifiable security features, and backbone modules are activated during router optimization to enable bidirectional adaptation. MoGU_V2 exhibits strong adaptability and stable improvements across various series of LLMs, including mainstream LLMs serving as brains in various applications, on-device LLMs optimized for resource-constrained scenarios, and reasoning LLMs tailored for user interpretability. Meanwhile, even facing risks introduced by Instruction Fine-tuning, MoGU_v2 can easily restore security without compromising the task performance gains via a simple data-mix strategy. These comprehensive improvements highlight MoGU_V2 as a robust and versatile solution for mitigating security risks in real-world applications.
zh
[NLP-18] MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在标准机器学习(Machine Learning, ML)任务中难以通过纯上下文学习(In-Context Learning, ICL)利用大量示例进行有效学习的问题,即缺乏对多示例演示(many-shot demonstrations)的高效建模能力。解决方案的关键在于提出一种可移植的持续预训练框架——MachineLearningLM,其核心创新包括:1)基于数百万个结构因果模型(Structural Causal Models, SCMs)合成多样化的ML任务,并引入随机森林教师模型进行知识蒸馏,以增强LLM在数值建模中的鲁棒性;2)设计高效的序列化提示机制,在保持token效率的同时支持高达1,024个示例的上下文输入,使单次推理吞吐量提升至50倍;3)无需任务特定微调即可实现接近随机森林水平的准确率,且保留了原始LLM在通用对话任务中的知识与推理能力(如在MMLU上达到75.4%)。
链接: https://arxiv.org/abs/2509.06806
作者: Haoyu Dong,Pengkun Zhang,Mingzhe Lu,Yanzhen Shen,Guolin Ke
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.06806 [cs.CL] (or arXiv:2509.06806v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.06806 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-19] Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint
【速读】: 该论文旨在解决指令微调(Instruction Fine-Tuning, IFT)导致大语言模型(Large Language Models, LLMs)安全性下降的问题,特别是模型在面对恶意指令时拒绝能力的削弱。研究表明,这种安全风险源于隐藏状态中“拒绝方向”(refusal direction, r-direction)在训练过程中的漂移。解决方案的关键在于提出一种名为ProCon的方法,其核心是引入一个投影约束损失项(projection-constrained loss term),通过限制每个样本隐藏状态在r-direction上的投影幅度来稳定该方向。进一步地,为克服性能瓶颈,作者设计了一个基于早期阶段强烈约束和数据分布扩展的预热策略(warm-up strategy),从而增强约束信号并提升整体性能。实验表明,ProCon不仅能显著缓解IFT带来的安全风险,还能保持甚至优于现有强基线方法的任务性能,且有助于稳定r-direction,为可解释性驱动的LLM安全研究提供新路径。
链接: https://arxiv.org/abs/2509.06795
作者: Yanrui Du,Fenglei Fan,Sendong Zhao,Jiawei Cao,Qika Lin,Kai He,Ting Liu,Bing Qin,Mengling Feng
机构: Harbin Institute of Technology (哈尔滨工业大学); City University of Hong Kong (香港城市大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Instruction Fine-Tuning (IFT) has been widely adopted as an effective post-training strategy to enhance various abilities of Large Language Models (LLMs). However, prior studies have shown that IFT can significantly compromise LLMs’ safety, particularly their ability to refuse malicious instructions, raising significant concerns. Recent research into the internal mechanisms of LLMs has identified the refusal direction (r-direction) in the hidden states, which plays a pivotal role in governing refusal behavior. Building on this insight, our study reveals that the r-direction tends to drift during training, which we identify as one of the causes of the associated safety risks. To mitigate such drift, our proposed ProCon method introduces a projection-constrained loss term that regularizes the projection magnitude of each training sample’s hidden state onto the r-direction. Our initial analysis shows that applying an appropriate constraint can effectively mitigate the refusal direction drift and associated safety risks, but remains limited by overall performance barriers. To overcome this barrier, informed by our observation of early-stage sharp drift and a data-driven perspective, we introduce a warm-up strategy that emphasizes early-stage strong constraints and broaden the data distribution to strengthen constraint signals, leading to an enhanced ProCon method. Experimental results under various datasets, scenarios, and LLMs demonstrate that our method can significantly mitigate safety risks posed by IFT while preserving task performance gains. Even compared with strong baselines, our method consistently delivers superior overall performance. Crucially, our analysis indicates that ProCon can contribute to stabilizing the r-direction during training, while such an interpretability-driven exploration of LLMs’ internal mechanisms lays a solid foundation for future safety research.
zh
[NLP-20] VehicleWorld: A Highly Integrated Multi-Device Environment for Intelligent Vehicle Interaction
【速读】: 该论文旨在解决智能汽车座舱中API Agent在复杂、紧密耦合的子系统协调任务中存在的效率低下与错误恢复能力弱的问题。传统函数调用(Function Calling, FC)方法为无状态操作,需通过多次探索性调用构建环境认知,导致执行效率低且容错能力有限。其解决方案的关键在于提出一种基于显式状态感知的新型方法——状态驱动函数调用(State-based Function Call, SFC),该方法通过维护系统状态并实现直接的状态转移来达成目标条件,从而显著提升执行准确率并降低延迟。实验表明,SFC优于传统FC方法,并借助自建的VehicleWorld环境(包含30个模块、250个API和680个属性)实现了对车辆代理行为的精准评估。
链接: https://arxiv.org/abs/2509.06736
作者: Jie Yang,Jiajun Chen,Zhangyue Yin,Shuo Chen,Yuxin Wang,Yiran Guo,Yuan Li,Yining Zheng,Xuanjing Huang,Xipeng Qiu
机构: Fudan University (复旦大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:
Abstract:Intelligent vehicle cockpits present unique challenges for API Agents, requiring coordination across tightly-coupled subsystems that exceed typical task environments’ complexity. Traditional Function Calling (FC) approaches operate statelessly, requiring multiple exploratory calls to build environmental awareness before execution, leading to inefficiency and limited error recovery. We introduce VehicleWorld, the first comprehensive environment for the automotive domain, featuring 30 modules, 250 APIs, and 680 properties with fully executable implementations that provide real-time state information during agent execution. This environment enables precise evaluation of vehicle agent behaviors across diverse, challenging scenarios. Through systematic analysis, we discovered that direct state prediction outperforms function calling for environmental control. Building on this insight, we propose State-based Function Call (SFC), a novel approach that maintains explicit system state awareness and implements direct state transitions to achieve target conditions. Experimental results demonstrate that SFC significantly outperforms traditional FC approaches, achieving superior execution accuracy and reduced latency. We have made all implementation code publicly available on Github this https URL.
zh
[NLP-21] Reinforcement Learning Foundations for Deep Research Systems: A Survey
【速读】: 该论文旨在解决当前深度研究系统(Deep Research Systems)在训练过程中面临的三大核心挑战:一是监督微调(SFT)方法存在模仿偏差和暴露偏差,且难以有效利用环境反馈;二是偏好对齐方法(如DPO)依赖于人工设计的决策点与子技能,存在策略依赖性和长程信用分配能力弱的问题;三是现有方法普遍缺乏对多目标权衡、长序列建模及工具交互闭环优化的支持。解决方案的关键在于引入强化学习(Reinforcement Learning, RL),通过轨迹级策略优化实现探索、恢复行为和可解释的信用分配机制,从而减少对人工先验和标注者的依赖,提升代理在复杂任务中的鲁棒性与透明度。论文首次系统化地从数据合成、RL算法设计、训练框架三个维度构建了面向深度研究系统的强化学习基础体系,并结合代理架构、评估基准等关键要素提供实践指导。
链接: https://arxiv.org/abs/2509.06733
作者: Wenjun Li,Zhi Chen,Jingru Lin,Hannan Cao,Wei Han,Sheng Liang,Zhi Zhang,Kuicai Dong,Dexun Li,Chen Zhang,Yong Liu
机构: Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 38 pages, first version
Abstract:Deep research systems, agentic AI that solve complex, multi-step tasks by coordinating reasoning, search across the open web and user files, and tool use, are moving toward hierarchical deployments with a Planner, Coordinator, and Executors. In practice, training entire stacks end-to-end remains impractical, so most work trains a single planner connected to core tools such as search, browsing, and code. While SFT imparts protocol fidelity, it suffers from imitation and exposure biases and underuses environment feedback. Preference alignment methods such as DPO are schema and proxy-dependent, off-policy, and weak for long-horizon credit assignment and multi-objective trade-offs. A further limitation of SFT and DPO is their reliance on human defined decision points and subskills through schema design and labeled comparisons. Reinforcement learning aligns with closed-loop, tool-interaction research by optimizing trajectory-level policies, enabling exploration, recovery behaviors, and principled credit assignment, and it reduces dependence on such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundations of deep research systems. It systematizes work after DeepSeek-R1 along three axes: (i) data synthesis and curation; (ii) RL methods for agentic research covering stability, sample efficiency, long context handling, reward and credit design, multi-objective optimization, and multimodal integration; and (iii) agentic RL training systems and frameworks. We also cover agent architecture and coordination, as well as evaluation and benchmarks, including recent QA, VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. We distill recurring patterns, surface infrastructure bottlenecks, and offer practical guidance for training robust, transparent deep research agents with RL. Comments: 38 pages, first version Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2509.06733 [cs.AI] (or arXiv:2509.06733v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.06733 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-22] Will Annotators Disagree? Identifying Subjectivity in Value-Laden Arguments EMNLP2025
【速读】: 该论文试图解决在主观性任务中,通过聚合多个标注结果得到单一真实标签时可能掩盖标注者分歧的问题,尤其是在识别论证背后人类价值观(human values)的主观性方面。其解决方案的关键在于区分两种方法:一是通过预测价值观来推断主观性,二是直接识别主观性;实验表明,直接识别主观性显著提升了模型标记主观论证的性能,且结合对比损失(contrastive loss)与二元交叉熵损失(binary cross-entropy loss)虽未提升性能,但降低了对每标签主观性的依赖,从而有助于发现个体间解释差异,促进更细致的标注流程。
链接: https://arxiv.org/abs/2509.06704
作者: Amir Homayounirad,Enrico Liscio,Tong Wang,Catholijn M. Jonker,Luciano C.Siebert
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted at Findings of EMNLP 2025
Abstract:Aggregating multiple annotations into a single ground truth label may hide valuable insights into annotator disagreement, particularly in tasks where subjectivity plays a crucial role. In this work, we explore methods for identifying subjectivity in recognizing the human values that motivate arguments. We evaluate two main approaches: inferring subjectivity through value prediction vs. directly identifying subjectivity. Our experiments show that direct subjectivity identification significantly improves the model performance of flagging subjective arguments. Furthermore, combining contrastive loss with binary cross-entropy loss does not improve performance but reduces the dependency on per-label subjectivity. Our proposed methods can help identify arguments that individuals may interpret differently, fostering a more nuanced annotation process.
zh
[NLP-23] ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data
【速读】: 该论文旨在解决语音建模任务中高质量、大规模多语种语料库稀缺的问题,特别是针对捷克语场景下缺乏结构清晰且对齐准确的音频-文本数据集。解决方案的关键在于构建ParCzech4Speech 1.0,这是基于ParCzech 4.0语料库的处理版本,整合了捷克议会演讲的录音与官方转录文本,并采用WhisperX和Wav2Vec 2.0模型实现自动化的高精度音视频对齐(audio-text alignment),从而在保留原始元数据的基础上,提供三种灵活变体:句段分割版(适用于自动语音识别ASR和语音合成TTS)、未分割版(保持自然话语流)以及原始对齐版(支持后续定制化处理)。相较ParCzech 3.0版本,该方案显著提升了数据量和对齐可靠性,为捷克语语音技术研究提供了更可靠的数据基础。
链接: https://arxiv.org/abs/2509.06675
作者: Vladislav Stankov,Matyáš Kopp,Ondřej Bojar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce ParCzech4Speech 1.0, a processed version of the ParCzech 4.0 corpus, targeted at speech modeling tasks with the largest variant containing 2,695 hours. We combined the sound recordings of the Czech parliamentary speeches with the official transcripts. The recordings were processed with WhisperX and Wav2Vec 2.0 to extract automated audio-text alignment. Our processing pipeline improves upon the ParCzech 3.0 speech recognition version by extracting more data with higher alignment reliability. The dataset is offered in three flexible variants: (1) sentence-segmented for automatic speech recognition and speech synthesis tasks with clean boundaries, (2) unsegmented preserving original utterance flow across sentences, and (3) a raw-alignment for further custom refinement for other possible tasks. All variants maintain the original metadata and are released under a permissive CC-BY license. The dataset is available in the LINDAT repository, with the sentence-segmented and unsegmented variants additionally available on Hugging Face.
zh
[NLP-24] IntrEx: A Dataset for Modeling Engagement in Educational Conversations EMNLP2025
【速读】: 该论文旨在解决教育对话中学习者参与度(engagement)与动机维持的难题,尤其是当前对驱动教育对话兴趣的语义特征缺乏系统认知的问题。其解决方案的关键在于构建首个大规模、序列级标注的“有趣性”数据集IntrEx,该数据集基于Teacher-Student Chatroom Corpus(TSCC),通过对比评分法(comparison-based rating)提升标注一致性,并引入序列级注释以捕捉兴趣在长时间对话中的动态演变。研究进一步表明,针对该数据集微调的7B/8B参数大语言模型(LLM)在预测人类有趣性判断上优于更大规模的专有模型(如GPT-4o),揭示了专业化标注数据在建模教育场景下学习者参与度方面的关键作用。
链接: https://arxiv.org/abs/2509.06652
作者: Xingwei Tan,Mahathi Parvatham,Chiara Gambi,Gabriele Pergola
机构: University of Warwick (华威大学); University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Findings camera-ready, 9+7 pages
Abstract:Engagement and motivation are crucial for second-language acquisition, yet maintaining learner interest in educational conversations remains a challenge. While prior research has explored what makes educational texts interesting, still little is known about the linguistic features that drive engagement in conversations. To address this gap, we introduce IntrEx, the first large dataset annotated for interestingness and expected interestingness in teacher-student interactions. Built upon the Teacher-Student Chatroom Corpus (TSCC), IntrEx extends prior work by incorporating sequence-level annotations, allowing for the study of engagement beyond isolated turns to capture how interest evolves over extended dialogues. We employ a rigorous annotation process with over 100 second-language learners, using a comparison-based rating approach inspired by reinforcement learning from human feedback (RLHF) to improve agreement. We investigate whether large language models (LLMs) can predict human interestingness judgments. We find that LLMs (7B/8B parameters) fine-tuned on interestingness ratings outperform larger proprietary models like GPT-4o, demonstrating the potential for specialised datasets to model engagement in educational settings. Finally, we analyze how linguistic and cognitive factors, such as concreteness, comprehensibility (readability), and uptake, influence engagement in educational dialogues.
zh
[NLP-25] Domain-Aware RAG : MoL-Enhanced RL for Efficient Training and Scalable Retrieval
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中粗排序(coarse-ranking)阶段的优化难题,即现有方法难以在领域知识学习与查询增强之间取得平衡,导致检索性能不佳。其解决方案的关键在于提出MoLER框架,采用两阶段策略:第一阶段通过混合损失(Mixture of Losses, MoL)进行持续预训练(Continual Pre-Training, CPT),以协同优化领域特定知识与通用语言能力;第二阶段利用组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习(Reinforcement Learning, RL)机制,最大化文档召回率,同时引入多查询单段落延迟融合(Multi-query Single-passage Late Fusion, MSLF)策略降低RL训练计算开销,并在推理时采用多查询多段落延迟融合(Multi-query Multi-passage Late Fusion, MMLF)实现可扩展性。
链接: https://arxiv.org/abs/2509.06650
作者: Hao Lin,Peitong Xie,Jingxue Chen,Jie Lin,Qingkun Tang,Qianchun Lu
机构: Southeast University (东南大学); Nanyang Technological University (南洋理工大学); ZTE Corporation (中兴通讯)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems rely heavily on the retrieval stage, particularly the coarse-ranking process. Existing coarse-ranking optimization approaches often struggle to balance domain-specific knowledge learning with query enhencement, resulting in suboptimal retrieval performance. To address this challenge, we propose MoLER, a domain-aware RAG method that uses MoL-Enhanced Reinforcement Learning to optimize retrieval. MoLER has a two-stage pipeline: a continual pre-training (CPT) phase using a Mixture of Losses (MoL) to balance domain-specific knowledge with general language capabilities, and a reinforcement learning (RL) phase leveraging Group Relative Policy Optimization (GRPO) to optimize query and passage generation for maximizing document recall. A key innovation is our Multi-query Single-passage Late Fusion (MSLF) strategy, which reduces computational overhead during RL training while maintaining scalable inference via Multi-query Multi-passage Late Fusion (MMLF). Extensive experiments on benchmark datasets show that MoLER achieves state-of-the-art performance, significantly outperforming baseline methods. MoLER bridges the knowledge gap in RAG systems, enabling robust and scalable retrieval in specialized domains.
zh
[NLP-26] Modelling Intertextuality with N-gram Embeddings
【速读】: 该论文旨在解决文学研究中跨文本关联关系难以量化分析的问题,尤其针对传统定性方法在处理大规模文本集合时的局限性。其核心解决方案是提出一种基于嵌入(embedding)的定量模型,通过计算两个文本中n-gram嵌入向量之间的成对相似度并取平均值,从而实现对文本间互文性(intertextuality)的可扩展量化评估。该方法的关键在于利用语义嵌入捕捉词汇层面的引用与呼应关系,并结合网络分析揭示文本间的中心性与社群结构,从而提供兼具精度与效率的互文性测量框架。
链接: https://arxiv.org/abs/2509.06637
作者: Yi Xing
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Intertextuality is a central tenet in literary studies. It refers to the intricate links between literary texts that are created by various types of references. This paper proposes a new quantitative model of intertextuality to enable scalable analysis and network-based insights: perform pairwise comparisons of the embeddings of n-grams from two texts and average their results as the overall intertextuality. Validation on four texts with known degrees of intertextuality, alongside a scalability test on 267 diverse texts, demonstrates the method’s effectiveness and efficiency. Network analysis further reveals centrality and community structures, affirming the approach’s success in capturing and quantifying intertextual relationships.
zh
[NLP-27] Guided Decoding and Its Critical Role in Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中输出格式不一致与幻觉(hallucination)频发的问题,核心挑战在于如何在多轮提示(multi-turn prompting)场景下实现结构化响应的可靠性。解决方案的关键在于系统性评估三种引导解码(guided decoding)方法——Outlines、XGrammar 和 LM Format Enforcer——在不同提示轮次(0-turn、1-turn、2-turn)下的表现,通过量化成功概率、幻觉率和输出质量等指标,揭示多轮交互对引导解码效果的影响机制,从而为特定应用场景提供方法选择依据和部署优化策略。
链接: https://arxiv.org/abs/2509.06631
作者: Özgür Uğur,Musa Yılmaz,Esra Şavirdi,Özay Ezerceli,Mahmut El Huseyni,Selva Taş,Reyhan Bayraktar
机构: Newmind AI
类目: Computation and Language (cs.CL)
备注:
Abstract:The integration of Large Language Models (LLMs) into various applications has driven the need for structured and reliable responses. A key challenge in Retrieval-Augmented Generation (RAG) systems is ensuring that outputs align with expected formats while minimizing hallucinations. This study examines the role of guided decoding in RAG systems, comparing three methods, Outlines, XGrammar, and LM Format Enforcer, across different multi-turn prompting setups (0-turn, 1-turn, and 2-turn). By evaluating success rates, hallucination rates, and output quality, we provide insights into their performance and applicability. Our findings reveal how multi-turn interactions influence guided decoding, uncovering unexpected performance variations that can inform method selection for specific use cases. This work advances the understanding of structured output generation in RAG systems, offering both theoretical insights and practical guidance for LLM deployment.
zh
[NLP-28] HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在检索增强或长文本生成中产生幻觉(hallucination)的问题,即使相关证据已存在于输入中。根本原因在于:注意力头(attention head)的重要性被当作与输入无关的固定权重处理,且原始注意力权重无法准确反映每个token对生成结果的真实贡献。解决方案的关键是提出一个无需微调的解码框架HAVE(Head-Adaptive Gating and ValuE Calibration),其核心包含两个模块:1)头自适应门控(head-adaptive gating),实现基于实例的软重加权注意力头;2)值校准(value calibration),通过引入value向量的幅值来近似写回贡献(write-back contribution)。这两个模块共同构建与模型更新对齐的token级证据,并通过轻量级不确定性缩放策略将其融合至语言模型分布中,从而有效抑制幻觉并提升生成可信度。
链接: https://arxiv.org/abs/2509.06596
作者: Xin Tong,Zhi Lin,Jingya Wang,Bo Jin
机构: People’s Public Security University of China (中国公安大学); Tsinghua University (清华大学); The Third Research Institute of the Ministry of Public Security of China (中国公安部第三研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) often produce hallucinations in retrieval-augmented or long-context generation, even when relevant evidence is present. This stems from two issues: head importance is treated as input-agnostic, and raw attention weights poorly reflect each token’s true contribution. We present HAVE (Head-Adaptive Gating and ValuE Calibration), a parameter-free decoding framework that directly addresses both challenges. HAVE introduces head-adaptive gating, which performs instance-level soft reweighing of attention heads, and value calibration, which augments attention with the magnitude of value vectors to approximate write-back contribution. Together, these modules construct token-level evidence aligned with model updates and fuse it with the LM distribution through a lightweight uncertainty-scaled policy. HAVE requires no finetuning and operates in a single forward pass, making it efficient and broadly applicable. Experiments across multiple QA benchmarks and LLM families demonstrate that HAVE consistently reduces hallucinations and outperforms strong baselines, including DAGCD, with modest overhead. The framework is transparent, reproducible, and readily integrates with off-the-shelf LLMs, advancing trustworthy generation in real-world settings.
zh
[NLP-29] SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion EMNLP
【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)中链接预测(Link Prediction)任务因结构信息利用不足而导致的结构性稀疏与语义模糊问题,尤其在数据不完整或零样本(zero-shot)场景下表现不佳。其核心解决方案是提出SLiNT框架,关键在于通过轻量级LoRA适配机制将知识图谱导出的结构上下文注入冻结的大语言模型(Large Language Model, LLM)主干网络,并结合三项关键技术:结构引导的邻域增强(Structure-Guided Neighborhood Enhancement, SGNE)用于扩充稀疏实体的上下文;动态硬对比学习(Dynamic Hard Contrastive Learning, DHCL)通过插值难样本提供细粒度监督以缓解实体层面歧义;梯度解耦双注入(Gradient-Decoupled Dual Injection, GDDI)实现token级结构感知干预,同时保留LLM核心参数不变。该方法显著提升了生成式AI在知识图谱补全任务中的鲁棒性和性能。
链接: https://arxiv.org/abs/2509.06531
作者: Mengxue Yang,Chun Yang,Jiaqi Zhu,Jiafan Li,Jingqi Zhang,Yuyang Li,Ying Li
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); National Astronomical Observatories, Chinese Academy of Sciences (中国科学院国家天文台)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP Findings 2025
Abstract:Link prediction in knowledge graphs requires integrating structural information and semantic context to infer missing entities. While large language models offer strong generative reasoning capabilities, their limited exploitation of structural signals often results in structural sparsity and semantic ambiguity, especially under incomplete or zero-shot settings. To address these challenges, we propose SLiNT (Structure-aware Language model with Injection and coNtrastive Training), a modular framework that injects knowledge-graph-derived structural context into a frozen LLM backbone with lightweight LoRA-based adaptation for robust link prediction. Specifically, Structure-Guided Neighborhood Enhancement (SGNE) retrieves pseudo-neighbors to enrich sparse entities and mitigate missing context; Dynamic Hard Contrastive Learning (DHCL) introduces fine-grained supervision by interpolating hard positives and negatives to resolve entity-level ambiguity; and Gradient-Decoupled Dual Injection (GDDI) performs token-level structure-aware intervention while preserving the core LLM parameters. Experiments on WN18RR and FB15k-237 show that SLiNT achieves superior or competitive performance compared with both embedding-based and generation-based baselines, demonstrating the effectiveness of structure-aware representation learning for scalable knowledge graph completion.
zh
[NLP-30] LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域适配过程中因高质量人工标注数据稀缺而导致的性能瓶颈问题。现有方法在数据选择上面临准确性和效率难以兼顾的挑战,尤其是当使用未经筛选的大规模数据进行微调时,易引入噪声并损害模型性能。其解决方案的关键在于提出LAMDAS(LLM As an iMplicit classifier for domain-specific DAta Selection),该方法利用预训练语言模型自身作为隐式分类器,将数据选择重构为一个单类分类问题,从而无需显式特征工程或复杂的优化过程,即可高效识别与目标领域一致的候选数据,显著提升领域适配效果与计算效率。
链接: https://arxiv.org/abs/2509.06524
作者: Jian Wu,Hang Yu,Bingchang Liu,Wenjie Yang,Peng Di,Jianguo Li,Yue Zhang
机构: Ant Group; Westlake University
类目: Computation and Language (cs.CL)
备注:
Abstract:Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for fine-tuning risks introducing noise and degrading performance. Strategic data selection is thus crucial, requiring a method that is both accurate and efficient. Existing approaches, categorized as similarity-based and direct optimization methods, struggle to simultaneously achieve these goals. In this paper, we introduce LAMDAS (LLM As an iMplicit classifier for domain-specific DAta Selection), a novel approach that leverages the pre-trained LLM itself as an implicit classifier, thereby bypassing explicit feature engineering and computationally intensive optimization process. LAMDAS reframes data selection as a one-class classification problem, identifying candidate data that “belongs” to the target domain defined by a small reference dataset. Extensive experimental results demonstrate that LAMDAS not only exceeds the performance of full-data training using a fraction of the data but also outperforms nine state-of-the-art (SOTA) baselines under various scenarios. Furthermore, LAMDAS achieves the most compelling balance between performance gains and computational efficiency compared to all evaluated baselines.
zh
[NLP-31] Crown Frame Reverse: Layer-Wise Scaling Variants for LLM Pre-Training ATC
【速读】: 该论文试图解决传统基于Transformer的语言模型采用均匀(各向同性)层结构时,未能充分考虑不同深度层级在功能角色和计算容量需求上的差异性问题。其解决方案的关键在于提出三种新的逐层缩放(Layer-Wise Scaling, LWS)变体——Framed、Reverse 和 Crown,通过在预训练阶段利用两点或三点线性插值重新分配前馈网络(Feed-Forward Network, FFN)宽度和注意力头(Attention Heads)的数量,从而实现更高效的层间资源配置。实验表明,在固定180M参数预算下,这些方法能收敛至相似损失并优于同等成本的各向同性基线模型,同时保持较高的训练吞吐量。
链接: https://arxiv.org/abs/2509.06518
作者: Andrei Baroian,Kasper Notebomer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The reported results are skewed due to a data type mismatch. The dataset was saved with int32, but the data loader interpreted it as uint16. As a result, each 32-bit token was incorrectly split into two 16-bit tokens. Outcome: a consistent artifact where every other token is zero
Abstract:Transformer-based language models traditionally use uniform (isotropic) layer sizes, yet they ignore the diverse functional roles that different depths can play and their computational capacity needs. Building on Layer-Wise Scaling (LWS) and pruning literature, we introduce three new LWS variants - Framed, Reverse, and Crown - that redistribute FFN widths and attention heads via two or three-point linear interpolation in the pre-training stage. We present the first systematic ablation of LWS and its variants, on a fixed budget of 180M parameters, trained on 5B tokens. All models converge to similar losses and achieve better performance compared to an equal-cost isotropic baseline, without a substantial decrease in training throughput. This work represents an initial step into the design space of layer-wise architectures for pre-training, but future work should scale experiments to orders of magnitude more tokens and parameters to fully assess their potential.
zh
[NLP-32] WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents
【速读】: 该论文旨在解决当前开源网络代理(web agent)在复杂信息检索任务中表现受限以及实现过程缺乏透明性的问题。其核心挑战在于高质量、具有挑战性的信息寻求数据稀缺,导致模型难以进行多步推理和复杂网页导航。解决方案的关键在于提出一种系统性的数据生成方法——WebExplorer,该方法基于模型驱动的探索与迭代式的长到短查询演化机制,构建出需要多步骤推理和复杂网络导航的挑战性查询-答案对。借助这一高质量数据集,作者通过监督微调与强化学习相结合的方式训练出WebExplorer-8B模型,该模型支持128K上下文长度和最多100次工具调用,显著提升了长周期问题求解能力,并在多个信息检索基准测试中达到同规模模型的最先进性能。
链接: https://arxiv.org/abs/2509.06501
作者: Junteng Liu,Yunji Li,Chi Zhang,Jingyang Li,Aili Chen,Ke Ji,Weiyu Cheng,Zijia Wu,Chengyu Du,Qidi Xu,Jiayuan Song,Zhengmao Zhu,Wenhu Chen,Pengyu Zhao,Junxian He
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.
zh
[NLP-33] Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models ICASSP2026
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在文档理解任务中计算开销过高的问题。其解决方案的关键在于提出一种轻量级的 token 剪枝框架:首先使用一个基于 patch 级别的二分类器过滤掉文档图像中非信息性的背景区域,随后通过最大池化精炼步骤恢复断裂的文本区域,以提升空间连贯性,从而在不显著损失准确率的前提下大幅降低计算成本。
链接: https://arxiv.org/abs/2509.06415
作者: Jaemin Son,Sujin Choi,Inyong Yun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to ICASSP 2026
Abstract:Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.
zh
[NLP-34] Do LLM s exhibit the same commonsense capabilities across languages?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言常识生成能力上的差异性问题,特别是针对低资源语言的性能不足。其解决方案的关键在于构建了一个名为MULTICOM的新基准,该基准扩展了COCOTEROS数据集,覆盖英语、西班牙语、荷兰语和瓦伦西亚语四种语言,任务要求模型基于给定的三元组词生成符合常识的句子。通过结合自动评估指标、LLM作为裁判(LLM-as-a-judge)方法(如Prometheus和JudgeLM)以及人工标注,系统评估了多个开源LLM(包括LLaMA、Qwen、Gemma、EuroLLM和Salamandra)的表现,揭示了当前模型在多语言常识生成中的显著偏差,并指出上下文支持对低资源语言可能具有提升作用。
链接: https://arxiv.org/abs/2509.06401
作者: Ivan Martínez-Murillo,Elena Lloret,Paloma Moreda,Albert Gatt
机构: University of Alicante (阿尔卡拉大学); University of Utrecht (乌得勒支大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper explores the multilingual commonsense generation abilities of Large Language Models (LLMs). To facilitate this investigation, we introduce MULTICOM, a novel benchmark that extends the COCOTEROS dataset to four languages: English, Spanish, Dutch, and Valencian. The task involves generating a commonsensical sentence that includes a given triplet of words. We evaluate a range of open-source LLMs, including LLaMA, Qwen, Gemma, EuroLLM, and Salamandra, on this benchmark. Our evaluation combines automatic metrics, LLM-as-a-judge approaches (using Prometheus and JudgeLM), and human annotations. Results consistently show superior performance in English, with significantly lower performance in less-resourced languages. While contextual support yields mixed results, it tends to benefit underrepresented languages. These findings underscore the current limitations of LLMs in multilingual commonsense generation. The dataset is publicly available at this https URL.
zh
[NLP-35] PL-CA: A Parametric Legal Case Augmentation Framework
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法在司法领域应用中的两大核心问题:一是受限于模型上下文窗口长度,直接注入长文本导致注意力分散和下游任务性能下降;二是现有评估基准缺乏专家标注且仅聚焦单一任务,难以反映模型在真实多任务法律场景下的综合能力。解决方案的关键在于提出PL-CA框架,其创新性地引入参数化RAG(Parametric RAG, P-RAG),将法律知识编码为参数向量,并通过低秩适应(LoRA)技术将其嵌入大语言模型(LLM)的前馈网络(Feed-Forward Networks, FFN)中,从而缓解上下文压力并提升知识利用效率;同时构建了一个包含2000余条专家标注与人工验证样本的多任务法律数据集,以更真实地评估模型性能。
链接: https://arxiv.org/abs/2509.06356
作者: Ao Chang,Yubo Chen,Jun Zhao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); National Laboratory of Pattern Recognition (国家模式识别实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional RAG is considered one of the most effective methods for addressing model knowledge insufficiency and hallucination, particularly in the judicial domain that requires high levels of knowledge rigor, logical consistency, and content integrity. However, the conventional RAG method only injects retrieved documents directly into the model’s context, which severely constrains models due to their limited context windows and introduces additional computational overhead through excessively long contexts, thereby disrupting models’ attention and degrading performance on downstream tasks. Moreover, many existing benchmarks lack expert annotation and focus solely on individual downstream tasks while real-world legal scenarios consist of multiple mixed legal tasks, indicating conventional benchmarks’ inadequacy for reflecting models’ true capabilities. To address these limitations, we propose PL-CA, which introduces a parametric RAG (P-RAG) framework to perform data augmentation on corpus knowledge and encode this legal knowledge into parametric vectors, and then integrates this parametric knowledge into the LLM’s feed-forward networks (FFN) via LoRA, thereby alleviating models’ context pressure. Additionally, we also construct a multi-task legal dataset comprising more than 2000 training and test instances, which are all expert-annotated and manually verified. We conduct our experiments on our dataset, and the experimental results demonstrate that our method reduces the overhead associated with excessively long contexts while maintaining competitive performance on downstream tasks compared to conventional RAG. Our code and dataset are provided in the appendix.
zh
[NLP-36] Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对抗性攻击中因固定长度后缀(suffix)存在冗余 token 而导致的计算效率低下问题。现有方法如贪婪坐标梯度(Greedy Coordinate Gradient, GCG)及其改进版本均采用固定长度后缀,但未考虑其中部分 token 对攻击成功率(Attack Success Rate, ASR)的实际贡献差异。解决方案的关键在于提出 Mask-GCG,一种可插拔的 token 掩码机制,通过学习识别后缀中高影响力和低影响力 token,动态提升高影响力 token 的更新概率并剪枝低影响力 token。该策略不仅减少了冗余,还缩小了梯度搜索空间,从而显著降低计算开销并加速攻击达成,同时保持攻击成功率不变,揭示了 LLM 提示词中的 token 冗余现象,为高效且可解释的 LLM 设计提供了新视角。
链接: https://arxiv.org/abs/2509.06350
作者: Junjie Mu,Zonghao Ying,Zhekui Fan,Zonglei Jing,Yaoyuan Zhang,Zhengmin Yu,Wenxin Zhang,Quanchen Zou,Xiangzheng Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.
zh
[NLP-37] SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
【速读】: 该论文旨在解决如何构建具备自主推理与工具调用能力的单智能体(Autonomous Single-Agent)模型,以实现深度研究(Deep Research, DR)任务中对多源信息的高效检索与复杂逻辑处理。其核心挑战在于避免依赖预设角色和静态工作流的多智能体系统,转而使单智能体能够基于上下文动态决策下一步行动。解决方案的关键在于采用完全合成数据的持续强化学习(Continual Reinforcement Learning, RL)方法,对推理优化型语言模型进行迭代训练,在保持原有推理能力的同时显著提升代理技能,最终在Humanity’s Last Exam基准测试中达到28.7%的准确率。
链接: https://arxiv.org/abs/2509.06283
作者: Xuan-Phi Nguyen,Shrey Pandit,Revanth Gangi Reddy,Austin Xu,Silvio Savarese,Caiming Xiong,Shafiq Joty
机构: Salesforce AI Research (Salesforce人工智能研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report
Abstract:Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking’') models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity’s Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.
zh
[NLP-38] No Encore: Unlearning as Opt-Out in Music Generation
【速读】: 该论文旨在解决生成式 AI(Generative AI)在音乐创作领域中因无意使用受版权保护内容而引发的伦理与法律风险问题。其解决方案的关键在于引入机器遗忘(machine unlearning)技术,通过在预训练文本到音乐(Text-to-Music, TTM)模型上应用现有遗忘方法,探索如何在不损害模型性能的前提下有效移除特定训练数据的影响,从而为音乐生成模型中的版权合规提供基础性分析与技术路径。
链接: https://arxiv.org/abs/2509.06277
作者: Jinju Kim,Taehan Kim,Abdul Waheed,Rita Singh
机构: Sungkyunkwan University (成均馆大学); Sogang University (西江大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: Work in progress. 7 pages
Abstract:AI music generation is rapidly emerging in the creative industries, enabling intuitive music generation from textual descriptions. However, these systems pose risks in exploitation of copyrighted creations, raising ethical and legal concerns. In this paper, we present preliminary results on the first application of machine unlearning techniques from an ongoing research to prevent inadvertent usage of creative content. Particularly, we explore existing methods in machine unlearning to a pre-trained Text-to-Music (TTM) baseline and analyze their efficacy in unlearning pre-trained datasets without harming model performance. Through our experiments, we provide insights into the challenges of applying unlearning in music generation, offering a foundational analysis for future works on the application of unlearning for music generative models.
zh
[NLP-39] MSLEF: Multi-Segment LLM Ensemble Finetuning in Recruitment CCS
【速读】: 该论文旨在解决招聘自动化中简历解析(Resume Parsing)的准确性问题,尤其针对不同格式和结构的简历导致单模型系统性能受限的问题。解决方案的关键在于提出一种多段落集成框架(MSLEF),其核心创新是引入了分段感知架构(segment-aware architecture),通过为每个简历段落定制化训练细调的大语言模型(LLM),并采用加权投票机制融合各模型输出,从而提升整体解析精度与泛化能力。该框架利用Gemini-2.5-Flash作为高层聚合器处理复杂段落,并结合Gemma 9B、LLaMA 3.1 8B和Phi-4 14B等多模型协同工作,在多个指标(如精确匹配EM、F1分数、BLEU、ROUGE及招聘相似度RS)上显著优于单一模型,最高在RS上提升达+7%。
链接: https://arxiv.org/abs/2509.06200
作者: Omar Walid,Mohamed T. Younes,Khaled Shaban,Mai Hassan,Ali Hamdi
机构: MSA University (MSA大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL)
备注: Accepted in AICCSA 2025
Abstract:This paper presents MSLEF, a multi-segment ensemble framework that employs LLM fine-tuning to enhance resume parsing in recruitment automation. It integrates fine-tuned Large Language Models (LLMs) using weighted voting, with each model specializing in a specific resume segment to boost accuracy. Building on MLAR , MSLEF introduces a segment-aware architecture that leverages field-specific weighting tailored to each resume part, effectively overcoming the limitations of single-model systems by adapting to diverse formats and structures. The framework incorporates Gemini-2.5-Flash LLM as a high-level aggregator for complex sections and utilizes Gemma 9B, LLaMA 3.1 8B, and Phi-4 14B. MSLEF achieves significant improvements in Exact Match (EM), F1 score, BLEU, ROUGE, and Recruitment Similarity (RS) metrics, outperforming the best single model by up to +7% in RS. Its segment-aware design enhances generalization across varied resume layouts, making it highly adaptable to real-world hiring scenarios while ensuring precise and reliable candidate representation.
zh
[NLP-40] Augmented Fine-Tuned LLM s for Enhanced Recruitment Automation CCS
【速读】: 该论文旨在解决招聘自动化中通用大语言模型(Large Language Models, LLMs)在候选者简历解析与岗位匹配任务上准确率不足、泛化能力弱的问题。其解决方案的关键在于构建一个基于标准化JSON格式的合成数据集,并结合使用DeepSeek高参数量模型对真实简历进行结构化解析以增强训练数据的多样性与真实性,进而对Phi-4等基础模型进行针对性微调(fine-tuning)。实验表明,该方法显著提升了精确匹配、F1分数、BLEU、ROUGE及整体语义相似度等指标,其中微调后的Phi-4模型在F1得分上达到90.62%,验证了该框架在提升招聘流程中候选人与岗位匹配精度方面的有效性。
链接: https://arxiv.org/abs/2509.06196
作者: Mohamed T. Younes,Omar Walid,Khaled Shaban,Ali Hamdi,Mai Hassan
机构: MSA University (MSA大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL)
备注: Accepted in AICCSA 2025
Abstract:This paper presents a novel approach to recruitment automation. Large Language Models (LLMs) were fine-tuned to improve accuracy and efficiency. Building upon our previous work on the Multilayer Large Language Model-Based Robotic Process Automation Applicant Tracking (MLAR) system . This work introduces a novel methodology. Training fine-tuned LLMs specifically tuned for recruitment tasks. The proposed framework addresses the limitations of generic LLMs by creating a synthetic dataset that uses a standardized JSON format. This helps ensure consistency and scalability. In addition to the synthetic data set, the resumes were parsed using DeepSeek, a high-parameter LLM. The resumes were parsed into the same structured JSON format and placed in the training set. This will help improve data diversity and realism. Through experimentation, we demonstrate significant improvements in performance metrics, such as exact match, F1 score, BLEU score, ROUGE score, and overall similarity compared to base models and other state-of-the-art LLMs. In particular, the fine-tuned Phi-4 model achieved the highest F1 score of 90.62%, indicating exceptional precision and recall in recruitment tasks. This study highlights the potential of fine-tuned LLMs. Furthermore, it will revolutionize recruitment workflows by providing more accurate candidate-job matching.
zh
[NLP-41] Language Bias in Information Retrieval: The Nature of the Beast and Mitigation Methods EMNLP
【速读】: 该论文旨在解决多语言信息检索(Multilingual Information Retrieval, MLIR)系统中的语言公平性问题,即确保不同语言的查询在语义相同的情况下,对同一组多语言文档的检索结果排序一致。研究发现当前MLIR技术存在内在的语言偏差,尤其在传统检索方法与基于mBERT和XLM-R的DPR神经排序器之间表现显著差异。解决方案的关键在于提出一种名为LaKDA的新损失函数,通过优化神经MLIR模型的训练过程,有效缓解语言偏见,从而提升跨语言检索的公平性。
链接: https://arxiv.org/abs/2509.06195
作者: Jinrui Yang,Fan Jiang,Timothy Baldwin
机构: School of Computing & Information Systems, The University of Melbourne (墨尔本大学计算机与信息系统学院); Mohamed bin Zayed University of Artificial Intelligence, UAE (阿联酋穆罕默德·本·扎耶德人工智能大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at EMNLP MRL 2024
Abstract:Language fairness in multilingual information retrieval (MLIR) systems is crucial for ensuring equitable access to information across diverse languages. This paper sheds light on the issue, based on the assumption that queries in different languages, but with identical semantics, should yield equivalent ranking lists when retrieving on the same multilingual documents. We evaluate the degree of fairness using both traditional retrieval methods, and a DPR neural ranker based on mBERT and XLM-R. Additionally, we introduce `LaKDA’, a novel loss designed to mitigate language biases in neural MLIR approaches. Our analysis exposes intrinsic language biases in current MLIR technologies, with notable disparities across the retrieval methods, and the effectiveness of LaKDA in enhancing language fairness.
zh
[NLP-42] Understanding the Influence of Synthetic Data for Text Embedders ACL
【速读】: 该论文试图解决当前通用文本嵌入模型(text embedders)在训练过程中依赖合成数据(synthetic data)时,其对模型泛化能力提升机制不明确的问题。由于缺乏公开的合成数据集,研究者难以系统评估合成数据的作用,从而阻碍了对模型鲁棒性和跨任务适应性的深入理解。解决方案的关键在于:首先复现并公开发布由Wang等人提出的Mistral-E5合成数据集,确保研究可重复性;其次通过细致分析发现,合成数据带来的性能提升具有高度局部性,仅在特定数据集上有效,并且存在不同任务间的性能权衡现象,揭示出当前合成数据方法在构建通用嵌入模型方面的局限性,挑战了“合成数据能普遍提升模型鲁棒性”的主流观点。
链接: https://arxiv.org/abs/2509.06184
作者: Jacob Mitchell Springer,Vaibhav Adlakha,Siva Reddy,Aditi Raghunathan,Marius Mosbach
机构: Carnegie Mellon University (卡内基梅隆大学); Mila – Quebec AI Institute, McGill University (麦吉尔大学魁北克人工智能研究所); Canada CIFAR AI Chair (加拿大 CIFAR 人工智能主席)
类目: Computation and Language (cs.CL)
备注: ACL Findings 2025
Abstract:Recent progress in developing general purpose text embedders has been driven by training on ever-growing corpora of synthetic LLM-generated data. Nonetheless, no publicly available synthetic dataset exists, posing a barrier to studying its role for generalization. To address this issue, we first reproduce and publicly release the synthetic data proposed by Wang et al. (Mistral-E5). Our synthetic data is high quality and leads to consistent improvements in performance. Next, we critically examine where exactly synthetic data improves model generalization. Our analysis reveals that benefits from synthetic data are sparse and highly localized to individual datasets. Moreover, we observe trade-offs between the performance on different categories and data that benefits one task, degrades performance on another. Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders and challenge the notion that training on synthetic data leads to more robust embedding models across tasks.
zh
[NLP-43] From Long to Short: LLM s Excel at Trimming Own Reasoning Chains
【速读】: 该论文旨在解决生成式 AI(Generative AI)中大推理模型(Large Reasoning Models, LRMs)存在的“过度思考”(overthinking)问题,即模型在面对简单任务时倾向于生成冗长且复杂的推理路径,导致策略频繁切换、可解释性下降。解决方案的关键在于提出一种测试时缩放方法 EDIT(Efficient Dynamic Inference Trimming),其核心机制是通过约束引导的生成策略,在测试阶段联合追踪不同约束下长度与答案分布的变化,从而动态筛选出最短且正确的推理路径,实现简洁性与正确性的最优平衡。
链接: https://arxiv.org/abs/2509.06174
作者: Wei Han,Geng Zhan,Sicheng Yu,Chenyu Wang,Bryan Hooi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 5 figures, 7 tables
Abstract:O1/R1 style large reasoning models (LRMs) signal a substantial leap forward over conventional instruction-following LLMs. By applying test-time scaling to generate extended reasoning paths, they establish many SOTAs across a wide range of complex reasoning tasks. However, recent studies show that LRMs are prone to suffer from overthinking – the tendency to overcomplicate simple problems, leading to excessive strategy switching and long, convoluted reasoning traces that hinder their interpretability. To mitigate this issue, we conduct a systematic investigation into the reasoning efficiency of a broad set of LRMs and uncover a common dilemma: the difficulty in balancing multiple generation objectives such as correctness and brevity. Based on this discovery, we propose a test-time scaling method, EDIT (Efficient Dynamic Inference Trimming), which efficiently guides LRMs to identify the shortest correct reasoning paths at test time. EDIT employs constraint-guided generation while jointly tracking length and answer distributions under varying constraints, allowing it to select responses that strike an optimal balance between conciseness and correctness. Extensive experiments across diverse models and datasets show that EDIT substantially enhance the reasoning efficiency, producing compact yet informative outputs that improve readability and user experience.
zh
[NLP-44] Benchmarking Gender and Political Bias in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在政治敏感场景下存在的公平性与问责问题,尤其是其在识别议员性别和预测投票行为时可能产生的系统性偏差。解决方案的关键在于构建并公开发布EuroParlVote数据集,该数据集将欧洲议会辩论发言内容与实际投票结果相链接,并附有议员的丰富人口统计学元数据(如性别、年龄、国籍和政党归属),从而为评估LLMs在政治语境下的偏见提供结构化基准。通过在此数据集上开展性别分类和投票预测任务,研究发现主流LLMs普遍存在对女性议员的误判倾向以及对极左/极右政党的识别能力下降,且闭源模型(如GPT-4o)在鲁棒性和公平性上优于开源替代方案,凸显了模型架构与训练数据对政治语境下公平性的关键影响。
链接: https://arxiv.org/abs/2509.06164
作者: Jinrui Yang,Xudong Han,Timothy Baldwin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: The 8th International Conference on Natural Language and Speech Processing (Oral)
Abstract:We introduce EuroParlVote, a novel benchmark for evaluating large language models (LLMs) in politically sensitive contexts. It links European Parliament debate speeches to roll-call vote outcomes and includes rich demographic metadata for each Member of the European Parliament (MEP), such as gender, age, country, and political group. Using EuroParlVote, we evaluate state-of-the-art LLMs on two tasks – gender classification and vote prediction – revealing consistent patterns of bias. We find that LLMs frequently misclassify female MEPs as male and demonstrate reduced accuracy when simulating votes for female speakers. Politically, LLMs tend to favor centrist groups while underperforming on both far-left and far-right ones. Proprietary models like GPT-4o outperform open-weight alternatives in terms of both robustness and fairness. We release the EuroParlVote dataset, code, and demo to support future research on fairness and accountability in NLP within political contexts.
zh
[NLP-45] Reverse-Engineered Reasoning for Open-Ended Generation
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在开放性、创造性任务中缺乏有效深度推理能力的问题。当前主流方法如强化学习(Reinforcement Learning, RL)和指令蒸馏(Instruction Distillation)在该领域表现受限:RL 因缺乏明确奖励信号和高质量奖励模型而难以训练,而蒸馏则受制于教师模型性能上限且成本高昂。论文提出了一种名为 REverse-Engineered Reasoning (REER) 的新范式,其核心创新在于反向构建推理过程——从已知优质解出发,通过计算发现可能生成这些解的潜藏、分步的深度推理轨迹。这一梯度无关、可扩展的方法使作者得以构建并开源 DeepWriting-20K 数据集(包含 20,000 条开放任务的深度推理路径),进而训练出 DeepWriter-8B 模型,在多项指标上超越开源基线,并达到甚至超过 GPT-4o 和 Claude 3.5 等闭源领先模型的水平。
链接: https://arxiv.org/abs/2509.06160
作者: Haozhe Wang,Haoran Que,Qixin Xu,Minghao Liu,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Wei Ye,Tong Yang,Wenhao Huang,Ge Zhang,Fangzhen Lin
机构: ByteDance(字节跳动)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:While the deep reasoning'' paradigm has spurred significant advances in verifiable domains like mathematics, its application to open-ended, creative generation remains a critical challenge. The two dominant methods for instilling reasoning -- reinforcement learning (RL) and instruction distillation -- falter in this area; RL struggles with the absence of clear reward signals and high-quality reward models, while distillation is prohibitively expensive and capped by the teacher model's capabilities. To overcome these limitations, we introduce REverse-Engineered Reasoning (REER), a new paradigm that fundamentally shifts the approach. Instead of building a reasoning process
forwards’’ through trial-and-error or imitation, REER works ``backwards’’ from known-good solutions to computationally discover the latent, step-by-step deep reasoning process that could have produced them. Using this scalable, gradient-free approach, we curate and open-source DeepWriting-20K, a large-scale dataset of 20,000 deep reasoning trajectories for open-ended tasks. Our model, DeepWriter-8B, trained on this data, not only surpasses strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5.
zh
[NLP-46] Orthogonal Low-rank Adaptation in Lie Groups for Continual Learning of Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在顺序多任务学习场景中面临的灾难性遗忘问题。现有参数正则化方法如O-LoRA和N-LoRA通过强制低秩子空间正交性缓解任务干扰,但忽略了传统加法微调会破坏LLM参数的内在几何结构,从而限制性能提升。其解决方案的关键在于引入李群(Lie group)理论到LLM微调中,提出Orthogonal Low-rank Adaptation in Lie Groups (OLieRA),利用乘法更新保持参数几何结构的同时,在任务子空间上施加正交约束,从而更有效地保留模型知识并提升多任务学习性能。
链接: https://arxiv.org/abs/2509.06100
作者: Kefan Cao,Shuaicheng Wu
机构: Yingcai Honor College, University of Electronic Science and Technology of China (电子科技大学英才学院)
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures
Abstract:Large language models (LLMs) are prone to catastrophic forgetting in sequential multi-task settings. Parameter regularization methods such as O-LoRA and N-LoRA alleviate task interference by enforcing low-rank subspace orthogonality, but they overlook the fact that conventional additive fine-tuning disrupts the intrinsic geometric structure of LLM parameters, limiting performance. Our key insight is that the parameter space of LLMs possesses a geometric structure, which must be preserved in addition to enforcing orthogonality. Based on this, we propose Orthogonal Low-rank Adaptation in Lie Groups (OLieRA), which introduces Lie group theory into LLM fine-tuning: leveraging multiplicative updates to preserve parameter geometry while applying orthogonality constraints to task subspaces. Experiments demonstrate that OLieRA achieves state-of-the-art results on the Standard CL benchmark and remains among the top-performing methods in the Large Number of Tasks setting.
zh
[NLP-47] Language Native Lightly Structured Databases for Large Language Model Driven Composite Materials Research
【速读】: 该论文旨在解决传统材料研究中知识表达依赖语言叙述、难以被数据库和机器学习(Machine Learning, ML)有效利用的问题,尤其在硼氮纳米片(Boron Nitride Nanosheet, BNNS)聚合物热导复合材料领域,现有文献信息结构松散且缺乏可计算的结构化表示。解决方案的关键在于构建一个以语言原生(language-native)为基础的异构数据库,从多维度(制备、表征、理论计算与机制推理)提取带证据链接的片段化信息,并通过语义检索、关键词和数值过滤实现复合查询,从而支持高保真度的检索增强生成(Retrieval Augmented Generation, RAG)与工具增强代理(tool-augmented agents),最终输出可操作的标准作业程序(Standard Operating Procedure, SOP),为大语言模型(Large Language Model, LLM)驱动的材料发现提供语言丰富的基础支撑。
链接: https://arxiv.org/abs/2509.06093
作者: Yuze Liu,Zhaoyuan Zhang,Xiangsheng Zeng,Yihe Zhang,Leping Yu,Lejia Wang,Xi Yu
机构: 未知
类目: Databases (cs.DB); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Chemical and materials research has traditionally relied heavily on knowledge narrative, with progress often driven by language-based descriptions of principles, mechanisms, and experimental experiences, rather than tables, limiting what conventional databases and ML can exploit. We present a language-native database for boron nitride nanosheet (BNNS) polymer thermally conductive composites that captures lightly structured information from papers across preparation, characterization, theory-computation, and mechanistic reasoning, with evidence-linked snippets. Records are organized in a heterogeneous database and queried via composite retrieval with semantics, key words and value filters. The system can synthesizes literature into accurate, verifiable, and expert style guidance. This substrate enables high fidelity efficient Retrieval Augmented Generation (RAG) and tool augmented agents to interleave retrieval with reasoning and deliver actionable SOP. The framework supplies the language rich foundation required for LLM-driven materials discovery.
zh
[NLP-48] Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge
【速读】: 该论文旨在解决多模态推理(Multimodal Reasoning)在人工智能领域中的核心挑战,即现有模型在融合视觉与文本信息进行复杂推理时性能不足的问题。尽管以GPT-o3为代表的先进文本推理模型表现优异,但在涉及图像等视觉模态的任务中仍存在显著局限。其解决方案的关键在于提出一种基于图像描述(caption)辅助的推理框架,通过有效对齐和整合视觉与文本模态的信息,提升模型在跨模态场景下的推理能力。该方法在ICML 2025 AI for Math Workshop的SeePhys挑战赛中取得第一名,并在MathVerse几何推理基准上验证了良好的泛化性能,证明了其有效性与通用性。
链接: https://arxiv.org/abs/2509.06079
作者: Hao Liang,Ruitao Wu,Bohan Zeng,Junbo Niu,Wentao Zhang,Bin Dong
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop \ Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at this https URL.
zh
[NLP-49] Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis EMNLP2025
【速读】: 该论文旨在解决当前对话语音合成(Conversational Speech Synthesis, CSS)系统中对多模态对话历史(Multimodal Dialogue History, MDH)的细粒度语义与韵律交互建模不足的问题。现有方法仅关注话语级(utterance-level)交互特征,忽略了词级(word-level)语义与韵律知识的精细关联,导致合成语音在自然韵律表达上存在局限。解决方案的关键在于提出一种基于多模态细粒度上下文交互图(Multimodal Fine-grained Context Interaction Graph, MFCIG-CSS)的新架构,通过构建两个专用的细粒度交互图——语义交互图和韵律交互图——来显式编码词级语义、韵律及其对后续话语的影响关系,从而有效提升合成语音的对话韵律表现力。
链接: https://arxiv.org/abs/2509.06074
作者: Zhenqi Jia,Rui Liu,Berrak Sisman,Haizhou Li
机构: Inner Mongolia University (内蒙古大学); Center for Language and Speech Processing (CLSP) (语言与语音处理中心), Johns Hopkins University (约翰霍普金斯大学); School of Artificial Intelligence (人工智能学院), The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025
Abstract:Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosodic interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosodic expressiveness. Code and speech samples are available at this https URL.
zh
[NLP-50] KatotohananQA: Evaluating Truthfulness of Large Language Models in Filipino
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言中缺乏可靠真实性评估的问题,尤其是在英语之外的语言中,现有基准如TruthfulQA主要局限于英语,难以全面衡量模型在多语言环境下的事实准确性。解决方案的关键在于构建并应用KatotohananQA——一个菲律宾语(Filipino)版本的TruthfulQA基准,并基于二元选择框架对七种免费层级的商用模型进行评测,从而揭示了英语与菲律宾语之间显著的真实性性能差距,同时识别出不同问题类型、类别和主题在跨语言迁移中的鲁棒性差异,为提升多语言LLM的公平性和可靠性提供了实证依据和改进方向。
链接: https://arxiv.org/abs/2509.06065
作者: Lorenzo Alfred Nery,Ronald Dawson Catignas,Thomas James Tiam-Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 1 figure, 9 tables, 1 listing. To appear in Proceedings of NLPIR 2025
Abstract:Large Language Models (LLMs) achieve remarkable performance across various tasks, but their tendency to produce hallucinations limits reliable adoption. Benchmarks such as TruthfulQA have been developed to measure truthfulness, yet they are primarily available in English, leaving a gap in evaluating LLMs in low-resource languages. To address this, we present KatotohananQA, a Filipino translation of the TruthfulQA benchmark. Seven free-tier proprietary models were assessed using a binary-choice framework. Findings show a significant performance gap between English and Filipino truthfulness, with newer OpenAI models (GPT-5 and GPT-5 mini) demonstrating strong multilingual robustness. Results also reveal disparities across question characteristics, suggesting that some question types, categories, and topics are less robust to multilingual transfer which highlight the need for broader multilingual evaluation to ensure fairness and reliability in LLM usage.
zh
[NLP-51] SPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
【速读】: 该论文旨在解决多语言代码转换(Code-switching, CS)场景下自动语音识别(Auto-Speech Recognition, ASR)系统性能下降的问题,尤其针对越南语与英语混合使用时因音系特征差异和发音相似性导致的识别困难。解决方案的关键在于提出一种两阶段音素中心模型(Two-Stage Phoneme-Centric model, TSPC),其核心是基于扩展的越南语音素集作为中间表示,实现跨语言建模;该架构通过音素级适应与语言转换机制,在降低训练资源消耗的同时显著提升了复杂越南语-英语CS场景下的ASR性能,最终将词错误率(Word Error Rate, WER)降至20.8%。
链接: https://arxiv.org/abs/2509.05983
作者: Minh N. H. Nguyen,Anh Nguyen Tran,Dung Truong Dinh,Nam Van Vo
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems. Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios. The challenge is particularly difficult for language pairs like Vietnamese and English, where both distinct phonological features and the ambiguity arising from similar sound recognition are present. In this paper, we propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). The TSPC employs a phoneme-centric approach, built upon an extended Vietnamese phoneme set as an intermediate representation to facilitate mixed-lingual modeling. Experimental results demonstrate that TSPC consistently outperforms existing baselines, including PhoWhisper-base, in Vietnamese-English CS ASR, achieving a significantly lower word error rate of 20.8% with reduced training resources. Furthermore, the phonetic-based two-stage architecture enables phoneme adaptation and language conversion to enhance ASR performance in complex CS Vietnamese-English ASR scenarios.
zh
[NLP-52] Accelerating Large Language Model Inference via Early-Exiting Algorithms
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因计算成本过高而受限的问题,尤其是传统自适应计算方法(如early-exiting)虽可节省计算资源,却因每token的动态性导致批处理推理中的系统级瓶颈,反而降低吞吐量。解决方案的关键在于通过算法与模型架构的协同设计,实现动态性与效率之间的最优平衡:首先提出高效的并行解码机制以减少早期退出带来的开销;其次利用深层参数共享(deep parameter sharing)构建紧凑且参数高效的模型架构,天然缓解动态推理中的同步问题;最后引入统一框架,在该框架中预训练轻量级路由器,为每个token动态分配最优递归深度,从而在单一模型内同时优化自适应计算和参数效率,建立新的效率-性能帕累托前沿(Pareto frontier)。
链接: https://arxiv.org/abs/2509.05915
作者: Sangmin Bae
机构: 未知
类目: Computation and Language (cs.CL)
备注: PhD Dissertation
Abstract:Large language models have achieved remarkable capabilities, but their practical deployment is hindered by significant computational costs. While adaptive computation methods like early-exiting promise to reduce these costs, they introduce a fundamental conflict: the per-token dynamism intended to save computation often creates system-level bottlenecks that can paradoxically reduce throughput in batched inference. This dissertation resolves this conflict by co-designing adaptive algorithms and model architectures to strike an optimal balance between dynamism and efficiency. To this end, our work first addresses critical sources of overhead in conventional early-exiting by proposing an efficient parallel decoding mechanism. We then show that deep parameter sharing provides an architectural foundation that not only yields compact, parameter-efficient models but also inherently mitigates the critical synchronization issues affecting dynamic inference. Finally, this work presents a unified framework where lightweight routers are pretrained to dynamically assign an optimal recursion depth for each token. This approach establishes a new Pareto frontier between efficiency and performance by effectively optimizing for both adaptive computation and parameter efficiency within a single model.
zh
[NLP-53] Enhancing the Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling
【速读】: 该论文旨在解决跨注意力机制在上下文语音识别(contextual ASR)中因偏置信息量变化(尤其是偏置列表长度显著增加时)导致性能下降的问题。其核心挑战在于,尽管偏置列表较长,但仅有少量信息对特定ASR中间表示最为相关,而传统方法未有效筛选这些关键信息,从而影响模型鲁棒性。解决方案的关键在于提出一种纯化语义相关联合建模(PSC-Joint)方法:通过定义并计算从粗到细三个粒度的语义相关性——列表级、短语级和词元级——并联合建模以提取它们的交集,从而精准聚焦最相关的偏置信息;同时引入基于分组竞争策略的净化机制,降低联合建模带来的计算开销,实现高效且高精度的上下文感知语音识别。
链接: https://arxiv.org/abs/2509.05908
作者: Yue Gu,Zhihao Du,Ying Shi,Shiliang Zhang,Qian Chen,Jiqing Han
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by IEEE Transactions on Audio, Speech and Language Processing, 2025 ( this https URL ). DOI: https://doi.org/10.1109/TASLPRO.2025.3606198
Abstract:Recently, cross-attention-based contextual automatic speech recognition (ASR) models have made notable advancements in recognizing personalized biasing phrases. However, the effectiveness of cross-attention is affected by variations in biasing information volume, especially when the length of the biasing list increases significantly. We find that, regardless of the length of the biasing list, only a limited amount of biasing information is most relevant to a specific ASR intermediate representation. Therefore, by identifying and integrating the most relevant biasing information rather than the entire biasing list, we can alleviate the effects of variations in biasing information volume for contextual ASR. To this end, we propose a purified semantic correlation joint modeling (PSC-Joint) approach. In PSC-Joint, we define and calculate three semantic correlations between the ASR intermediate representations and biasing information from coarse to fine: list-level, phrase-level, and token-level. Then, the three correlations are jointly modeled to produce their intersection, so that the most relevant biasing information across various granularities is highlighted and integrated for contextual recognition. In addition, to reduce the computational cost introduced by the joint modeling of three semantic correlations, we also propose a purification mechanism based on a grouped-and-competitive strategy to filter out irrelevant biasing phrases. Compared with baselines, our PSC-Joint approach achieves average relative F1 score improvements of up to 21.34% on AISHELL-1 and 28.46% on KeSpeech, across biasing lists of varying lengths.
zh
[NLP-54] Lets Roleplay: Examining LLM Alignment in Collaborative Dialogues
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮、多方协作场景下行为不可预测、难以验证与验证的问题,尤其是在复杂群体交互中缺乏可靠性和一致性。其核心挑战在于现有对齐方法主要基于单用户假设,无法有效应对长期多主体互动中的动态演化特性。解决方案的关键在于提出一种新颖的反事实评估框架,用于量化摩擦代理(friction agents)干预如何改变群体协作轨迹和信念一致性,并通过角色扮演实验验证了“摩擦感知”对齐策略相较于传统对齐基线,在促进群体达成共识和提升任务正确性方面具有显著优势。
链接: https://arxiv.org/abs/2509.05882
作者: Abhijnan Nath,Carine Graff,Nikhil Krishnaswamy
机构: Colorado State University (科罗拉多州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As Large Language Models (LLMs) integrate into diverse workflows, they are increasingly being considered “collaborators” with humans. If such AI collaborators are to be reliable, their behavior over multiturn interactions must be predictable, validated and verified before deployment. Common alignment techniques are typically developed under simplified single-user settings and do not account for the dynamics of long-horizon multiparty interactions. This paper examines how different alignment methods affect LLM agents’ effectiveness as partners in multiturn, multiparty collaborations. We study this question through the lens of friction agents that intervene in group dialogues to encourage the collaborative group to slow down and reflect upon their reasoning for deliberative decision-making. Using a roleplay methodology, we evaluate interventions from differently-trained friction agents in collaborative task conversations. We propose a novel counterfactual evaluation framework that quantifies how friction interventions change the trajectory of group collaboration and belief alignment. Our results show that a friction-aware approach significantly outperforms common alignment baselines in helping both convergence to a common ground, or agreed-upon task-relevant propositions, and correctness of task outcomes.
zh
[NLP-55] MedFactEval and MedAgent Brief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在临床文本生成中事实准确性难以评估的问题,这是其在医疗场景中广泛应用的主要障碍。传统依赖专家人工审核的方式因成本高、效率低而无法满足持续质量保障的需求。解决方案的关键在于提出两个互补方法:一是构建 MedFactEval 评估框架,通过临床医生定义关键事实,并利用“LLM Jury”(多大语言模型多数投票机制)自动判断生成内容是否包含这些事实;二是设计 MedAgentBrief 工作流,一种模型无关的多步骤流程,用于生成高质量且事实准确的出院小结。实证表明,MedFactEval 的 LLM Jury 与七名医生组成的金标准参考达成几乎完美的一致性(Cohen’s kappa=81%),且性能显著优于单个专家(P < 0.001),从而为生成式 AI 在临床工作流中的负责任部署提供了可扩展、可靠的技术路径。
链接: https://arxiv.org/abs/2509.05878
作者: François Grolleau,Emily Alsentzer,Timothy Keyes,Philip Chung,Akshay Swaminathan,Asad Aali,Jason Hom,Tridu Huynh,Thomas Lew,April S. Liang,Weihan Chu,Natasha Z. Steele,Christina F. Lin,Jingkun Yang,Kameron C. Black,Stephen P. Ma,Fateme N. Haredasht,Nigam H. Shah,Kevin Schulman,Jonathan H. Chen
机构: Stanford University (斯坦福大学); Stanford Medicine (斯坦福医学中心); Stanford Health Care (斯坦福健康护理中心); Stanford Clinical Excellence Research Center (斯坦福临床卓越研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an “LLM Jury”–a multi-LLM majority vote–assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen’s kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.
zh
[NLP-56] ZhiFangDanTai: Fine-tuning Graph-based Retrieval-Augmented Generation Model for Traditional Chinese Medicine Formula
【速读】: 该论文旨在解决当前中医药(Traditional Chinese Medicine, TCM)配方生成模型中存在的两大核心问题:一是现有模型缺乏对完整配方组成及详细解释的全面输出能力;二是基于大语言模型(Large Language Models, LLMs)的TCM配方生成方法受限于指令数据集细节不足,如君、臣、佐、使角色分工、功效、禁忌证以及舌诊脉诊等临床信息缺失,导致生成结果深度和准确性不足。解决方案的关键在于提出ZhiFangDanTai框架,其创新性地结合图结构检索增强生成(Graph-based Retrieval-Augmented Generation, GraphRAG)与LLM微调技术:一方面利用GraphRAG从结构化TCM知识库中检索并合成高质量信息以生成精准摘要,另一方面构建增强型指令数据集提升LLM整合外部知识的能力;此外,论文还提供了理论证明,表明该融合策略可有效降低泛化误差和幻觉率,从而显著提升模型在TCM配方生成任务中的性能表现。
链接: https://arxiv.org/abs/2509.05867
作者: ZiXuan Zhang,Bowen Hao,Yingjie Li,Hongzhi Yin
机构: Capital Normal University (首都师范大学); University of Queensland (昆士兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional Chinese Medicine (TCM) formulas play a significant role in treating epidemics and complex diseases. Existing models for TCM utilize traditional algorithms or deep learning techniques to analyze formula relationships, yet lack comprehensive results, such as complete formula compositions and detailed explanations. Although recent efforts have used TCM instruction datasets to fine-tune Large Language Models (LLMs) for explainable formula generation, existing datasets lack sufficient details, such as the roles of the formula’s sovereign, minister, assistant, courier; efficacy; contraindications; tongue and pulse diagnosis-limiting the depth of model outputs. To address these challenges, we propose ZhiFangDanTai, a framework combining Graph-based Retrieval-Augmented Generation (GraphRAG) with LLM fine-tuning. ZhiFangDanTai uses GraphRAG to retrieve and synthesize structured TCM knowledge into concise summaries, while also constructing an enhanced instruction dataset to improve LLMs’ ability to integrate retrieved information. Furthermore, we provide novel theoretical proofs demonstrating that integrating GraphRAG with fine-tuning techniques can reduce generalization error and hallucination rates in the TCM formula task. Experimental results on both collected and clinical datasets demonstrate that ZhiFangDanTai achieves significant improvements over state-of-the-art models. Our model is open-sourced at this https URL.
zh
[NLP-57] LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization
【速读】: 该论文旨在解决多语言语音到语音翻译(Speech-to-Speech Translation, SST)中源说话人身份保留的问题,即在跨语言转换过程中保持原始说话者的声纹特征不变。解决方案的关键在于提出了一种基于三阶段训练的解码器-only Transformer 模型 LatinX:首先进行文本到音频的预训练以建立语义-声学映射,其次通过监督微调实现零样本语音克隆(Zero-Shot Voice Cloning),最后利用基于词错误率(Word Error Rate, WER)和说话人相似度指标自动标注的配对数据,采用直接偏好优化(Direct Preference Optimization, DPO)进行对齐。该方法在英语与罗曼语族语言(尤其是葡萄牙语)上训练,显著降低了 WER 并提升了客观说话人相似度,且人类评估显示其主观说话人相似度优于强基线模型 XTTSv2,揭示了客观指标与主观感知之间的差距。
链接: https://arxiv.org/abs/2509.05863
作者: Luis Felipe Chary,Miguel Arjona Ramirez
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present LatinX, a multilingual text-to-speech (TTS) model for cascaded speech-to-speech translation that preserves the source speaker’s identity across languages. LatinX is a 12-layer decoder-only Transformer trained in three stages: (i) pre-training for text-to-audio mapping, (ii) supervised fine-tuning for zero-shot voice cloning, and (iii) alignment with Direct Preference Optimization (DPO) using automatically labeled pairs based on Word Error Rate (WER) and speaker-similarity metrics. Trained on English and Romance languages with emphasis on Portuguese, LatinX with DPO consistently reduces WER and improves objective similarity over the fine-tuned baseline. Human evaluations further indicate stronger perceived speaker similarity than a strong baseline (XTTSv2), revealing gaps between objective and subjective measures. We provide cross-lingual analyses and discuss balanced preference signals and lower-latency architectures as future work.
zh
[NLP-58] Enhancing Factual Accuracy and Citation Generation in LLM s via Multi-Stage Self-Verification
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成复杂、事实敏感内容时普遍存在的幻觉(hallucination)问题以及缺乏可信引用来源的问题。解决方案的关键在于提出VeriFact-CoT(Verified Factual Chain-of-Thought)方法,其核心是一个多阶段机制——“事实验证-反思-引用整合”,通过该机制使LLMs能够对中间推理步骤和最终答案进行批判性自检与修正,从而显著提升生成内容的客观准确性、可信度和可追溯性。
链接: https://arxiv.org/abs/2509.05741
作者: Fernando Gabriela García,Qiyang Shi,Zilin Feng
机构: Autonomous University of Nuevo León (新莱昂自治大学); Minnan Normal University (闽南师范大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This research introduces VeriFact-CoT (Verified Factual Chain-of-Thought), a novel method designed to address the pervasive issues of hallucination and the absence of credible citation sources in Large Language Models (LLMs) when generating complex, fact-sensitive content. By incorporating a multi-stage mechanism of ‘fact verification-reflection-citation integration,’ VeriFact-CoT empowers LLMs to critically self-examine and revise their intermediate reasoning steps and final answers. This process significantly enhances the objective accuracy, trustworthiness, and traceability of the generated outputs, making LLMs more reliable for applications demanding high fidelity such as scientific research, news reporting, and legal consultation.
zh
[NLP-59] QCSE: A Pretrained Quantum Context-Sensitive Word Embedding for Natural Language Processing
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中词嵌入缺乏上下文敏感性以及低资源语言(如非洲语言 Fulani)数据稀缺的问题。解决方案的关键在于提出一种预训练的量子上下文感知嵌入模型(Quantum Context-Sensitive Embedding, QCSE),其核心创新在于引入量子原生的上下文学习机制,通过五种不同的上下文矩阵计算方法(包括指数衰减、正弦调制、相位偏移和基于哈希的变换)来构建具有上下文敏感性的量子嵌入表示。这些方法充分利用了量子系统的表达能力,使模型能够高效捕捉词汇在不同语境下的语义差异,并在小规模 Fulani 语料库和稍大规模英文语料库上验证了其有效性,从而为低资源语言的 NLP 任务提供了新的量子计算范式。
链接: https://arxiv.org/abs/2509.05729
作者: Charles M. Varmantchaonala,Niclas GÖtting,Nils-Erik SchÜtte,Jean Louis E. K. Fendji,Christopher Gies
机构: Carl von Ossietzky University of Oldenburg (奥尔登堡卡尔·冯·奥西茨基大学); DLR (德国航空航天中心); University of Ngaoundere (恩冈代雷大学); Stellenbosch University (斯泰伦博斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Quantum Natural Language Processing (QNLP) offers a novel approach to encoding and understanding the complexity of natural languages through the power of quantum computation. This paper presents a pretrained quantum context-sensitive embedding model, called QCSE, that captures context-sensitive word embeddings, leveraging the unique properties of quantum systems to learn contextual relationships in languages. The model introduces quantum-native context learning, enabling the utilization of quantum computers for linguistic tasks. Central to the proposed approach are innovative context matrix computation methods, designed to create unique, representations of words based on their surrounding linguistic context. Five distinct methods are proposed and tested for computing the context matrices, incorporating techniques such as exponential decay, sinusoidal modulation, phase shifts, and hash-based transformations. These methods ensure that the quantum embeddings retain context sensitivity, thereby making them suitable for downstream language tasks where the expressibility and properties of quantum systems are valuable resources. To evaluate the effectiveness of the model and the associated context matrix methods, evaluations are conducted on both a Fulani corpus, a low-resource African language, dataset of small size and an English corpus of slightly larger size. The results demonstrate that QCSE not only captures context sensitivity but also leverages the expressibility of quantum systems for representing rich, context-aware language information. The use of Fulani further highlights the potential of QNLP to mitigate the problem of lack of data for this category of languages. This work underscores the power of quantum computation in natural language processing (NLP) and opens new avenues for applying QNLP to real-world linguistic challenges across various tasks and domains.
zh
[NLP-60] Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Models
【速读】: 该论文试图解决波斯语(Farsi)在自然语言处理(Natural Language Processing, NLP)领域中被归类为“中资源语言”但实际面临显著数据稀缺与质量不足的问题,尤其在主观任务(如情感分析、情绪分析和毒性检测)中表现突出。其关键发现在于:尽管波斯语拥有超过1.27亿使用者及大量数字文本(如维基百科的130多万篇文章),但公开可用的数据集极少,且现有数据集普遍缺乏年龄、性别等关键人口统计学特征,导致模型预测结果在不同数据集和模型间高度不稳定。因此,解决方案的关键在于系统性地构建高质量、标注丰富且涵盖多维度人口统计信息的公开数据集,以支撑波斯语在主观任务中的可靠建模与性能提升。
链接: https://arxiv.org/abs/2509.05719
作者: Donya Rooein,Flor Miriam Plaza-del-Arco,Debora Nozza,Dirk Hovy
机构: Bocconi University (博科尼大学); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Given Farsi’s speaker base of over 127 million people and the growing availability of digital text, including more than 1.3 million articles on Wikipedia, it is considered a middle-resource language. However, this label quickly crumbles when the situation is examined more closely. We focus on three subjective tasks (Sentiment Analysis, Emotion Analysis, and Toxicity Detection) and find significant challenges in data availability and quality, despite the overall increase in data availability. We review 110 publications on subjective tasks in Farsi and observe a lack of publicly available datasets. Furthermore, existing datasets often lack essential demographic factors, such as age and gender, that are crucial for accurately modeling subjectivity in language. When evaluating prediction models using the few available datasets, the results are highly unstable across both datasets and models. Our findings indicate that the volume of data is insufficient to significantly improve a language’s prospects in NLP.
zh
[NLP-61] A Survey of the State-of-the-Art in Conversational Question Answering Systems
【速读】: 该论文旨在解决对话式问答(Conversational Question Answering, ConvQA)系统在多轮对话中保持连贯性与相关性的核心挑战,其解决方案的关键在于对ConvQA系统核心组件的深入分析与先进机器学习技术的融合应用。具体而言,论文系统梳理了历史选择、问题理解与答案预测三大模块的协同机制,并引入强化学习、对比学习及迁移学习等方法以提升模型准确性与效率;同时强调大型语言模型(Large Language Models, LLMs)如RoBERTa、GPT-4、Gemini 2.0 Flash、Mistral 7B和LLaMA 3在数据可扩展性和架构创新上的关键作用,从而推动ConvQA系统向更智能、更实用的方向发展。
链接: https://arxiv.org/abs/2509.05716
作者: Manoj Madushanka Perera,Adnan Mahmood,Kasun Eranda Wijethilake,Fahmida Islam,Maryam Tahermazandarani,Quan Z. Sheng
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 42 pages, 12 figures, 4 tables
Abstract:Conversational Question Answering (ConvQA) systems have emerged as a pivotal area within Natural Language Processing (NLP) by driving advancements that enable machines to engage in dynamic and context-aware conversations. These capabilities are increasingly being applied across various domains, i.e., customer support, education, legal, and healthcare where maintaining a coherent and relevant conversation is essential. Building on recent advancements, this survey provides a comprehensive analysis of the state-of-the-art in ConvQA. This survey begins by examining the core components of ConvQA systems, i.e., history selection, question understanding, and answer prediction, highlighting their interplay in ensuring coherence and relevance in multi-turn conversations. It further investigates the use of advanced machine learning techniques, including but not limited to, reinforcement learning, contrastive learning, and transfer learning to improve ConvQA accuracy and efficiency. The pivotal role of large language models, i.e., RoBERTa, GPT-4, Gemini 2.0 Flash, Mistral 7B, and LLaMA 3, is also explored, thereby showcasing their impact through data scalability and architectural advancements. Additionally, this survey presents a comprehensive analysis of key ConvQA datasets and concludes by outlining open research directions. Overall, this work offers a comprehensive overview of the ConvQA landscape and provides valuable insights to guide future advancements in the field.
zh
[NLP-62] Revealing the Numeracy Gap: An Empirical Investigation of Text Embedding Models
【速读】: 该论文旨在解决当前文本嵌入模型在处理文本中数值信息时的准确性不足问题,尤其是在金融、医疗等对数字敏感的应用场景下,模型难以区分细微但关键的数值差异(如2%与20%的增长)。其解决方案的关键在于通过构建金融领域的合成数据集,系统评估13种主流文本嵌入模型对数值内容的编码能力,发现现有模型普遍缺乏对数值细节的精确捕捉能力,并据此提出嵌入数值素养(embedding numeracy)的新研究方向,为未来提升嵌入模型处理数值信息的能力提供理论依据和实证基础。
链接: https://arxiv.org/abs/2509.05691
作者: Ningyuan Deng,Hanyu Duan,Yixuan Tang,Yi Yang
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Text embedding models are widely used in natural language processing applications. However, their capability is often benchmarked on tasks that do not require understanding nuanced numerical information in text. As a result, it remains unclear whether current embedding models can precisely encode numerical content, such as numbers, into embeddings. This question is critical because embedding models are increasingly applied in domains where numbers matter, such as finance and healthcare. For example, Company X’s market share grew by 2% should be interpreted very differently from Company X’s market share grew by 20%, even though both indicate growth in market share. This study aims to examine whether text embedding models can capture such nuances. Using synthetic data in a financial context, we evaluate 13 widely used text embedding models and find that they generally struggle to capture numerical details accurately. Our further analyses provide deeper insights into embedding numeracy, informing future research to strengthen embedding model-based NLP systems with improved capacity for handling numerical content.
zh
[NLP-63] Llama-GENBA-10B: A Trilingual Large Language Model for German English and Bavarian
【速读】: 该论文旨在解决大语言模型中存在的英语中心主义偏见(English-centric bias),通过构建一个平衡的多语言基础模型来提升低资源语言(如巴伐利亚语)的建模能力。其关键解决方案在于:首先,通过持续预训练在1640亿个token的数据集上(其中820亿为英文、820亿为德文、8000万为巴伐利亚语)实现语言资源的均衡分配;其次,设计了一个统一的分词器以支持英、德、巴伐利亚三种语言的联合处理;再次,优化了模型架构与语言比例超参数以增强跨语言迁移能力;最后,建立了首个标准化的三语评估套件,将德语基准翻译为巴伐利亚语进行评测。实验表明,该模型在巴伐利亚语上的微调版本优于现有同类模型,并在英语和德语上达到或超越主流模型性能,为包容性基础模型的发展提供了可复现的范式。
链接: https://arxiv.org/abs/2509.05668
作者: Michael Hoffmann,Jophin John,Stefan Schweter,Gokul Ramakrishnan,Hoi-Fong Mak,Alice Zhang,Dmitry Gaynullin,Nicolay J. Hammer
机构: Leibniz Supercomputing Centre (LRZ)(莱布尼茨超算中心); Cerebras Systems(赛雷布拉系统)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Michael Hoffmann and Jophin John contributed equally to this work
Abstract:We present Llama-GENBA-10B, a trilingual foundation model addressing English-centric bias in large language models. Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language. Development tackled four challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2) creating a unified tokenizer for English, German, and Bavarian, (3) optimizing architecture and language-ratio hyperparameters for cross-lingual transfer, and (4) establishing the first standardized trilingual evaluation suite by translating German benchmarks into Bavarian. Evaluations show that Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing itself as the best model in its class for this language, while also outperforming EuroLLM in English and matching its results in German. Training on the Cerebras CS-2 demonstrated efficient large-scale multilingual pretraining with documented energy use, offering a blueprint for inclusive foundation models that integrate low-resource languages.
zh
[NLP-64] Cross-Question Method Reuse in Large Language Models : From Word-Level Prediction to Rational Logical-Layer Reasoning
【速读】: 该论文旨在解决现有方法复用技术在处理低相似度或隐含相似性问题时的局限性,即传统方法通常要求问题之间具有高度相似性才能实现有效复用。其解决方案的关键在于:首先将问题与对应的解决方案分离,而非直接将二者作为整体输入给大语言模型(Large Language Models, LLMs),随后引导LLM专注于解决方案的适配与迁移,而非问题识别;进一步地,该方法还扩展至仅共享部分特征或隐藏特征的问题场景,从而突破传统基于显式相似性的约束,提升跨问题的方法复用效率。实验表明,该策略显著提高了可复用解决方案的筛选概率,增强了跨问题方法复用的有效性。
链接: https://arxiv.org/abs/2509.05660
作者: Hong Su
机构: Chengdu University of Information Technology (成都信息工程大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have been widely applied to assist in finding solutions for diverse questions. Prior work has proposed representing a method as a pair of a question and its corresponding solution, enabling method reuse. However, existing approaches typically require the questions to be highly similar. In this paper, we extend the scope of method reuse to address questions with low similarity or with hidden similarities that are not explicitly observable. For questions that are similar in a general-specific sense (i.e., broader or narrower in scope), we propose to first separate the question and solution, rather than directly feeding the pair to the LLM. The LLM is then guided to adapt the solution to new but related questions, allowing it to focus on solution transfer rather than question recognition. Furthermore, we extend this approach to cases where questions only share partial features or hidden characteristics. This enables cross-question method reuse beyond conventional similarity constraints. Experimental verification shows that our scope-extension approach increases the probability of filtering out reusable solutions, thereby improving the effectiveness of cross-question method reuse.
zh
[NLP-65] LM-Searcher: Cross-domain Neural Architecture Search with LLM s via Unified Numerical Encoding EMNLP2025
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的神经架构搜索(Neural Architecture Search, NAS)方法在跨领域应用中依赖大量提示工程(prompt engineering)和领域特定调优的问题,从而限制了其通用性和可扩展性。解决方案的关键在于提出了一种名为LM-Searcher的新框架,其核心创新包括:(1) 引入NCode——一种用于神经架构的通用数值字符串表示方法,实现了跨领域的架构编码与搜索;(2) 将NAS问题重新建模为排序任务,通过从一种基于剪枝的子空间采样策略中获取的指令微调样本训练LLM,使其能够从候选架构池中选择高性能结构;(3) 构建了一个涵盖多种架构-性能对的高质量数据集,以支持鲁棒且可迁移的学习。实验表明,LM-Searcher在同域(如图像分类CNN)和跨域(如分割与生成任务中的LoRA配置)任务上均表现出竞争力,标志着一种灵活、通用的LLM驱动架构搜索新范式。
链接: https://arxiv.org/abs/2509.05657
作者: Yuxuan Hu,Jihao Liu,Ke Wang,Jinliang Zhen,Weikang Shi,Manyuan Zhang,Qi Dou,Rui Liu,Aojun Zhou,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); CUHK CURI (香港中文大学计算机视觉与机器人研究所); Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室); CPII under InnoHK (创新香港研发平台计算与人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP2025
Abstract:Recent progress in Large Language Models (LLMs) has opened new avenues for solving complex optimization problems, including Neural Architecture Search (NAS). However, existing LLM-driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse tasks. In this work, we propose LM-Searcher, a novel framework that leverages LLMs for cross-domain neural architecture optimization without the need for extensive domain-specific adaptation. Central to our approach is NCode, a universal numerical string representation for neural architectures, which enables cross-domain architecture encoding and search. We also reformulate the NAS problem as a ranking task, training LLMs to select high-performing architectures from candidate pools using instruction-tuning samples derived from a novel pruning-based subspace sampling strategy. Our curated dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning. Comprehensive experiments demonstrate that LM-Searcher achieves competitive performance in both in-domain (e.g., CNNs for image classification) and out-of-domain (e.g., LoRA configurations for segmentation and generation) tasks, establishing a new paradigm for flexible and generalizable LLM-based architecture search. The datasets and models will be released at this https URL.
zh
[NLP-66] Few-Shot Query Intent Detection via Relation-Aware Prompt Learning
【速读】: 该论文旨在解决当前意图识别(Intent Detection)方法在少样本场景下仅依赖文本信息、忽视对话系统中关键结构信息(如查询-查询关系和查询-回答关系)的问题。解决方案的关键在于提出SAID框架,首次将文本信息与关系结构信息统一整合用于预训练,并引入查询自适应注意力网络(QueryAdapt),通过显式生成针对意图的关系标记(relation tokens),从而实现更细粒度的知识迁移,显著提升模型在真实数据集上的性能表现。
链接: https://arxiv.org/abs/2509.05635
作者: Liang Zhang,Yuan Li,Shijie Zhang,Zheng Zhang,Xitong Li
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州) ); Shenzhen MSU-BIT University (深圳北理莫斯科大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Intent detection is a crucial component of modern conversational systems, since accurately identifying user intent at the beginning of a conversation is essential for generating effective responses. Recent efforts have focused on studying this problem under a challenging few-shot scenario. These approaches primarily leverage large-scale unlabeled dialogue text corpora to pretrain language models through various pretext tasks, followed by fine-tuning for intent detection with very limited annotations. Despite the improvements achieved, existing methods have predominantly focused on textual data, neglecting to effectively capture the crucial structural information inherent in conversational systems, such as the query-query relation and query-answer relation. To address this gap, we propose SAID, a novel framework that integrates both textual and relational structure information in a unified manner for model pretraining for the first time. Building on this framework, we further propose a novel mechanism, the query-adaptive attention network (QueryAdapt), which operates at the relation token level by generating intent-specific relation tokens from well-learned query-query and query-answer relations explicitly, enabling more fine-grained knowledge transfer. Extensive experimental results on two real-world datasets demonstrate that SAID significantly outperforms state-of-the-art methods.
zh
[NLP-67] From Joy to Fear: A Benchmark of Emotion Estimation in Pop Song Lyrics
【速读】: 该论文旨在解决歌曲歌词中多标签情感属性识别的问题,即预测与六种基本情绪相对应的情感强度得分。其解决方案的关键在于构建了一个基于平均意见分数(Mean Opinion Score, MOS)的人工标注数据集,以确保标签的可靠性,并在此基础上对多个公开的大语言模型(Large Language Models, LLMs)进行零样本(zero-shot)评估,同时训练一个基于BERT的模型以实现多标签情感强度预测。实验结果揭示了零样本与微调模型在捕捉歌词情感细微差异方面的优劣,为基于情感的音乐信息检索应用提供了模型选择策略和实践依据。
链接: https://arxiv.org/abs/2509.05617
作者: Shay Dahary,Avi Edana,Alexander Apartsin,Yehudit Aperstein
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures
Abstract:The emotional content of song lyrics plays a pivotal role in shaping listener experiences and influencing musical preferences. This paper investigates the task of multi-label emotional attribution of song lyrics by predicting six emotional intensity scores corresponding to six fundamental emotions. A manually labeled dataset is constructed using a mean opinion score (MOS) approach, which aggregates annotations from multiple human raters to ensure reliable ground-truth labels. Leveraging this dataset, we conduct a comprehensive evaluation of several publicly available large language models (LLMs) under zero-shot scenarios. Additionally, we fine-tune a BERT-based model specifically for predicting multi-label emotion scores. Experimental results reveal the relative strengths and limitations of zero-shot and fine-tuned models in capturing the nuanced emotional content of lyrics. Our findings highlight the potential of LLMs for emotion recognition in creative texts, providing insights into model selection strategies for emotion-based music information retrieval applications. The labeled dataset is available at this https URL.
zh
[NLP-68] New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR
【速读】: 该论文旨在解决自动语音识别(ASR)中预训练模型知识迁移时声学表示与语言表示对齐的挑战,尤其针对其内在的结构不对称性(如多帧对应单个词元的“多对一”关系,以及特定声学过渡区对应多个词元的“一对多”关系)和分布不匹配问题。解决方案的关键在于将对齐视为一个检测任务,提出基于不平衡最优传输(unbalanced optimal transport)的对齐模型,该模型通过软性与部分匹配机制显式处理模态间分布差异和结构不对称性,确保每个语言词元至少与一个声学观测值关联,同时允许从声学到语言单位的概率性灵活映射,从而在保持高精度和召回率的同时,有效提升ASR性能。
链接: https://arxiv.org/abs/2509.05609
作者: Xugang Lu,Peng Shen,Yu Tsao,Hisashi Kawai
机构: National Institute of Information and Communications Technology, Japan (日本信息通信技术国家研究所); Research Center for Information Technology Innovation, Academia Sinica, Taiwan (台湾中央研究院资讯科技创新研究中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.
zh
[NLP-69] Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints
【速读】: 该论文旨在解决企业在多服务部署大语言模型(Large Language Models, LLMs)时面临的隐私合规约束下无法共享提示注入攻击(prompt injection attacks)威胁情报的问题,这导致同一攻击可能在不同服务中长期潜伏而未被发现。解决方案的关键在于提出 BinaryShield,一种隐私保护的威胁情报系统,其核心创新是通过一个独特的处理流水线——包括个人身份信息(PII)脱敏、语义嵌入、二值量化和随机响应机制——生成不可逆但保留攻击模式的指纹表示,从而在不泄露用户敏感信息的前提下实现跨合规边界的安全威胁共享。实验表明,BinaryShield 在 F1 分数上达到 0.94,显著优于基线 SimHash(0.77),同时在存储和相似性搜索效率上分别提升 64 倍和 38 倍。
链接: https://arxiv.org/abs/2509.05608
作者: Waris Gill,Natalie Isak,Matthew Dressman
机构: Microsoft(微软)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The widespread deployment of LLMs across enterprise services has created a critical security blind spot. Organizations operate multiple LLM services handling billions of queries daily, yet regulatory compliance boundaries prevent these services from sharing threat intelligence about prompt injection attacks, the top security risk for LLMs. When an attack is detected in one service, the same threat may persist undetected in others for months, as privacy regulations prohibit sharing user prompts across compliance boundaries. We present BinaryShield, the first privacy-preserving threat intelligence system that enables secure sharing of attack fingerprints across compliance boundaries. BinaryShield transforms suspicious prompts through a unique pipeline combining PII redaction, semantic embedding, binary quantization, and randomized response mechanism to potentially generate non-invertible fingerprints that preserve attack patterns while providing privacy. Our evaluations demonstrate that BinaryShield achieves an F1-score of 0.94, significantly outperforming SimHash (0.77), the privacy-preserving baseline, while achieving 64x storage reduction and 38x faster similarity search compared to dense embeddings. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2509.05608 [cs.CR] (or arXiv:2509.05608v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.05608 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-70] Beyond Keywords: Driving Generative Search Engine Optimization with Content-Centric Agents
【速读】: 该论文旨在解决生成式搜索引擎(Generative Search Engines)兴起背景下,传统搜索引擎优化(SEO)指标失效的问题,核心挑战在于如何量化内容对合成答案的影响并进行有效优化。解决方案的关键在于提出一个端到端的生成式搜索引擎优化(GSEO)框架:首先构建了以内容为中心的大规模基准测试集CC-GSEO-Bench,并设计了一个多维评估体系,能够超越表面归属关系,系统性地衡量内容的语义影响力;其次开发了一种新型多智能体系统,通过协同的分析-修订-评估工作流自动化实现内容的战略性优化,从而为内容创作者提供可操作策略,并奠定未来GSEO研究的理论基础。
链接: https://arxiv.org/abs/2509.05607
作者: Qiyuan Chen,Jiahe Chen,Hongsen Huang,Qian Shao,Jintai Chen,Renjie Hua,Hongxia Xu,Ruijia Wu,Ren Chuan,Jian Wu
机构: Zhejiang University (浙江大学); Soochow Securities (苏州证券); HKUST(GZ) (香港科技大学(广州)); Shanghai Jiaotong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: Technical Report
Abstract:The paradigm shift from traditional ranked-based search to Generative Search Engines has rendered conventional SEO metrics obsolete, creating an urgent need to understand, measure, and optimize for content influence on synthesized answers. This paper introduces a comprehensive, end-to-end framework for Generative Search Engine Optimization (GSEO) to address this challenge. We make two primary contributions. First, we construct CC-GSEO-Bench, a large-scale, content-centric benchmark, and propose a multi-dimensional evaluation framework that systematically quantifies influence, moving beyond surface-level attribution to assess substantive semantic impact. Second, we design a novel multi-agent system that operationalizes this framework, automating the strategic refinement of content through a collaborative analyze-revise-evaluate workflow. Our empirical analysis using this framework reveals novel insights into the dynamics of content influence, offering actionable strategies for creators and establishing a principled foundation for future GSEO research.
zh
[NLP-71] Icon2: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对齐人类偏好时依赖高质量偏好数据集的难题,尤其是传统方法中存在的两个核心问题:一是使用预收集的指令容易导致分布不匹配,二是生成多组随机响应带来高昂的计算开销。其解决方案的关键在于提出一种名为Icon²的新范式,通过利用LLM表示空间中的内在调控机制来高效构建定制化偏好数据集;具体而言,首先提取分层方向向量以编码复杂的人类偏好,并据此筛选自合成指令的一致性;随后在解码阶段引入双向内在控制机制,引导token表示以精确生成具有明确偏好差异的响应对,从而在显著提升对齐效果的同时降低48.1%的计算成本。
链接: https://arxiv.org/abs/2509.05605
作者: Qiyuan Chen,Hongsen Huang,Qian Shao,Jiahe Chen,Jintai Chen,Hongxia Xu,Renjie Hua,Ren Chuan,Jian Wu
机构: Zhejiang University (浙江大学); Soochow Securities Co., Ltd. (苏州证券有限公司); HKUST(GZ) (香港科技大学(广州)); Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Main
Abstract:Large Language Models (LLMs) require high quality preference datasets to align with human preferences. However, conventional methods for constructing such datasets face significant challenges: reliance on pre-collected instructions often leads to distribution mismatches with target models, while the need for sampling multiple stochastic responses introduces substantial computational overhead. In this work, we explore a paradigm shift by leveraging inherent regulation of LLMs’ representation space for efficient and tailored preference dataset construction, named Icon ^2 . Specifically, it first extracts layer-wise direction vectors to encode sophisticated human preferences and then uses these vectors to filter self-synthesized instructions based on their inherent consistency. During decoding, bidirectional inherent control is applied to steer token representations, enabling the precise generation of response pairs with clear alignment distinctions. Experimental results demonstrate significant improvements in both alignment and efficiency. Llama3-8B and Qwen2-7B achieve an average win rate improvement of 13.89% on AlpacaEval 2.0 and 13.45% on Arena-Hard, while reducing computational costs by up to 48.1%.
zh
[NLP-72] Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在通过蒸馏大语言模型(Large Language Models, LLMs)生成的思维链(Chain-of-Thought, CoT)数据进行训练时,因CoT数据中存在噪声推理过程(即不支持答案或冗余的推理步骤)而导致模型学习到虚假关联、推理质量下降的问题。解决方案的关键在于提出一种名为“思维链正确性感知蒸馏”(Chain-of-Thought Correctness Perception Distillation, CoPeD)的方法:首先引入一种基于正确性的任务设置,促使学生模型仅依据正确推理路径预测答案,并在推理错误时主动修正;其次设计一种正确性感知加权损失函数(Correctness-Aware Weighted Loss),动态调整每个训练样本的权重,使模型更关注那些推理过程对正确答案提供强支持的样本,从而提升推理忠实性和泛化能力。
链接: https://arxiv.org/abs/2509.05602
作者: Hongyan Xie,Yitong Yao,Yikun Ban,Zixuan Huang,Deqing Wang,Zhenhe Wu,Haoxiang Su,Chao Wang,Shuangyong Song,Xuelong Li
机构: Beihang University (北京航空航天大学); China Telecom (中国电信)
类目: Computation and Language (cs.CL)
备注: PrePrint
Abstract:Large language models (LLMs) excel at reasoning tasks but are expensive to deploy. Thus small language models (SLMs) are fine-tuned on CoT data generated by LLMs to copy LLMs’ abilities. However, these CoT data may include noisy rationales that either fail to substantiate the answers or contribute no additional information to support answer prediction, which leads SLMs to capture spurious correlations between questions and answers and compromise the quality of reasoning. In this work, we propose Chain-of-Thought Correctness Perception Distillation (CoPeD), which aims to improve the reasoning quality of the student model from the perspectives of task setting and data utilization. Firstly, we introduce a correctness-aware task setting that encourages the student model to predict answers based on correct rationales and revise them when they are incorrect. This setting improves the faithfulness of reasoning and allows the model to learn from its mistakes. Then, we propose a Correctness-Aware Weighted loss, which dynamically adjusts the contribution of each training instance based on the combined loss of the rationale and the answer. This strategy encourages the model to focus more on samples where the rationale offers stronger support for the correct answer. Experiments have shown that CoPeD is effective on both in-distribution (IND) and out-of-distribution (OOD) benchmark reasoning datasets.
zh
[NLP-73] Ad hoc conventions generalize to new referents
【速读】: 该论文旨在解决“人们如何在首次交流中建立对新事物的共同命名和描述系统”这一核心问题,尤其关注这种共享语言约定是否仅限于特定对象(即任意标签),还是基于更广泛的语义概念对齐而具有泛化能力。解决方案的关键在于通过一项双人对话实验(N=302)利用KiloGram数据集中的1000余张抽象图形,先让参与者通过重复沟通为一组图像建立指称惯例,再测试他们对未讨论图像的描述一致性变化。结果表明,参与者对新图像的描述显著比初始状态更一致,且这种泛化效应随视觉相似性非线性衰减(符合Shepard定律),且不受图像可命名性的调节,说明临时形成的语言约定并非随意标签,而是反映了真实的语义协调机制,这对参考理论和生成式AI语言代理的设计具有重要启示。
链接: https://arxiv.org/abs/2509.05566
作者: Anya Ji,Claire Augusta Bergey,Ron Eliav,Yoav Artzi,Robert D. Hawkins
机构: University of California, Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学); Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:How do people talk about things they’ve never talked about before? One view suggests that a new shared naming system establishes an arbitrary link to a specific target, like proper names that cannot extend beyond their bearers. An alternative view proposes that forming a shared way of describing objects involves broader conceptual alignment, reshaping each individual’s semantic space in ways that should generalize to new referents. We test these competing accounts in a dyadic communication study (N=302) leveraging the recently-released KiloGram dataset containing over 1,000 abstract tangram images. After pairs of participants coordinated on referential conventions for one set of images through repeated communication, we measured the extent to which their descriptions aligned for undiscussed images. We found strong evidence for generalization: partners showed increased alignment relative to their pre-test labels. Generalization also decayed nonlinearly with visual similarity (consistent with Shepard’s law) and was robust across levels of the images’ nameability. These findings suggest that ad hoc conventions are not arbitrary labels but reflect genuine conceptual coordination, with implications for theories of reference and the design of more adaptive language agents.
zh
[NLP-74] Using Contrastive Learning to Improve Two-Way Reasoning in Large Language Models : The Obfuscation Task as a Case Study
【速读】: 该论文试图解决的问题是:当前大语言模型是否真正理解概念,还是仅停留在表面模式识别层面。为验证这一点,作者提出以“双向推理”(bidirectional reasoning)作为衡量标准,即模型在未显式训练逆向任务的情况下,仍能实现正向与反向的语义转换能力,例如从变量名 userIndex
映射到 i
的同时,也能反向推断出 i
表示用户索引。实验发现,模型在经过正向微调后会出现认知专业化(cognitive specialization)现象——其正向任务表现提升但反向推理能力显著下降。为此,作者提出对比微调(Contrastive Fine-Tuning, CFT),其关键在于使用三类样本进行联合训练:保持语义一致的正例、语义不同的负例以及正向混淆样本,从而引导模型建立深层语义理解而非浅层模式匹配,使反向推理能力自然涌现而无需显式逆向训练。实验证明,CFT能够有效实现双向推理,在保留正向任务性能的同时显著增强反向能力。
链接: https://arxiv.org/abs/2509.05553
作者: Serge Lionel Nikiema,Jordan Samhi,Micheline Bénédicte Moumoula,Albérick Euraste Djiré,Abdoul Kader Kaboré,Jacques Klein,Tegawendé F. Bissyandé
机构: University of Luxembourg(卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This research addresses a fundamental question in AI: whether large language models truly understand concepts or simply recognize patterns. The authors propose bidirectional reasoning,the ability to apply transformations in both directions without being explicitly trained on the reverse direction, as a test for genuine understanding. They argue that true comprehension should naturally allow reversibility. For example, a model that can change a variable name like userIndex to i should also be able to infer that i represents a user index without reverse training. The researchers tested current language models and discovered what they term cognitive specialization: when models are fine-tuned on forward tasks, their performance on those tasks improves, but their ability to reason bidirectionally becomes significantly worse. To address this issue, they developed Contrastive Fine-Tuning (CFT), which trains models using three types of examples: positive examples that maintain semantic meaning, negative examples with different semantics, and forward-direction obfuscation examples. This approach aims to develop deeper understanding rather than surface-level pattern recognition and allows reverse capabilities to develop naturally without explicit reverse training. Their experiments demonstrated that CFT successfully achieved bidirectional reasoning, enabling strong reverse performance while maintaining forward task capabilities. The authors conclude that bidirectional reasoning serves both as a theoretical framework for assessing genuine understanding and as a practical training approach for developing more capable AI systems.
zh
[NLP-75] Biomedical Literature QA System Using Retrieval-Augmented Generation (RAG )
【速读】: 该论文旨在解决传统健康搜索引擎在获取准确、基于证据的医学信息方面存在的不足,以及公众访问生物医学研究文献滞后的问题。其解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)架构的生物医学文献问答系统,通过整合PubMed文章、结构化QA数据集和医学百科全书等多源信息,利用MiniLM语义嵌入与FAISS向量搜索实现高效精准检索,并采用QLoRA微调的Mistral-7B-v0.3语言模型进行上下文感知的答案生成,从而显著提升回答的事实一致性和语义相关性。
链接: https://arxiv.org/abs/2509.05505
作者: Mansi Garg,Lee-Chi Wang,Bhavesh Ghanchi,Sanjana Dumpala,Shreyash Kakde,Yen Chih Chen
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 6 figures, 3 tables
Abstract:This work presents a Biomedical Literature Question Answering (QA) system based on a Retrieval-Augmented Generation (RAG) architecture, designed to improve access to accurate, evidence-based medical information. Addressing the shortcomings of conventional health search engines and the lag in public access to biomedical research, the system integrates diverse sources, including PubMed articles, curated QA datasets, and medical encyclopedias ,to retrieve relevant information and generate concise, context-aware responses. The retrieval pipeline uses MiniLM-based semantic embeddings and FAISS vector search, while answer generation is performed by a fine-tuned Mistral-7B-v0.3 language model optimized using QLoRA for efficient, low-resource training. The system supports both general medical queries and domain-specific tasks, with a focused evaluation on breast cancer literature demonstrating the value of domain-aligned retrieval. Empirical results, measured using BERTScore (F1), show substantial improvements in factual consistency and semantic relevance compared to baseline models. The findings underscore the potential of RAG-enhanced language models to bridge the gap between complex biomedical literature and accessible public health knowledge, paving the way for future work on multilingual adaptation, privacy-preserving inference, and personalized medical AI systems.
zh
[NLP-76] he Token Tax: Systematic Bias in Multilingual Tokenization
【速读】: 该论文旨在解决形态学复杂且资源匮乏语言在大语言模型(LLM)处理中因分词效率低下而导致的计算资源消耗增加和准确率下降问题。其核心发现是:分词密度(tokens/word,即“ fertility”)能可靠预测模型在非洲多语言测评基准AfriMMLU上的表现,分词越密集(fertility越高),准确率越低;同时,推理类模型(如DeepSeek、o1)在高资源与低资源语言上均显著优于非推理模型,缩小了以往模型间的性能差距。解决方案的关键在于推动形态学感知的分词策略、公平定价机制以及面向多语言的基准测试,以实现更公平的自然语言处理(NLP)生态。
链接: https://arxiv.org/abs/2509.05486
作者: Jessica M. Lundin,Ada Zhang,Nihal Karim,Hamza Louzan,Victor Wei,David Adelani,Cody Carroll
机构: Institute for Disease Modeling (疾病建模研究所); Gates Foundation (比尔及梅琳达·盖茨基金会); University of San Francisco (旧金山大学); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (DeepSeek, o1) consistently outperform non-reasoning peers across high and low resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the token tax faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).
zh
[NLP-77] From Staff Messages to Actionable Insights: A Multi-Stage LLM Classification Framework for Healthcare Analytics
【速读】: 该论文旨在解决医院呼叫中心海量员工消息数据难以高效挖掘与利用的问题,传统监督学习方法依赖标注数据且训练成本高,难以满足实时性和可扩展性需求。解决方案的关键在于提出一种多阶段基于大语言模型(Large Language Models, LLMs)的框架,通过集成推理型、通用型和轻量级模型对员工消息进行主题识别与多类原因分类,实现了高准确率(最佳模型o3达到79.2%准确率)和计算效率的平衡,并嵌入符合HIPAA合规要求的数据安全机制,最终将结构化输出整合至可视化决策支持工具,为医护人员提供可操作的洞察,从而提升导航员培训质量与患者服务质量。
链接: https://arxiv.org/abs/2509.05484
作者: Hajar Sakai,Yi-En Tseng,Mohammadsadegh Mikaeili,Joshua Bosire,Franziska Jovin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Hospital call centers serve as the primary contact point for patients within a hospital system. They also generate substantial volumes of staff messages as navigators process patient requests and communicate with the hospital offices following the established protocol restrictions and guidelines. This continuously accumulated large amount of text data can be mined and processed to retrieve insights; however, traditional supervised learning approaches require annotated data, extensive training, and model tuning. Large Language Models (LLMs) offer a paradigm shift toward more computationally efficient methodologies for healthcare analytics. This paper presents a multi-stage LLM-based framework that identifies staff message topics and classifies messages by their reasons in a multi-class fashion. In the process, multiple LLM types, including reasoning, general-purpose, and lightweight models, were evaluated. The best-performing model was o3, achieving 78.4% weighted F1-score and 79.2% accuracy, followed closely by gpt-5 (75.3% Weighted F1-score and 76.2% accuracy). The proposed methodology incorporates data security measures and HIPAA compliance requirements essential for healthcare environments. The processed LLM outputs are integrated into a visualization decision support tool that transforms the staff messages into actionable insights accessible to healthcare professionals. This approach enables more efficient utilization of the collected staff messaging data, identifies navigator training opportunities, and supports improved patient experience and care quality.
zh
[NLP-78] Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too
【速读】: 该论文旨在解决大语言模型(Large-Language Models, LLMs)作为自动评分工具在评估自由文本内容(如摘要、对话和故事生成)时,现有基于成对比较的方法难以赋予单个样本绝对分数的问题,而这一能力对于需要设定阈值的应用场景至关重要。其解决方案的关键在于提出一种直接评分(direct-scoring)方法,利用合成摘要(synthetic summaries)在测试阶段模拟成对机器排名,从而实现既保持与最先进成对评估器相当的样本级相关性(在SummEval、TopicalChat和HANNA基准上表现接近),又能为每个生成文本分配可解释的绝对评分。
链接: https://arxiv.org/abs/2509.05440
作者: Logan Lawrence,Ashton Williamson,Alexander Shelton
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 18 tables, 1 figure
Abstract:As large-language models have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For \textitsample-level performance, methods which operate by using pairwise comparisons between machine-generated text perform well but often lack the ability to assign absolute scores to individual summaries, an ability crucial for use cases that require thresholding. In this work, we propose a direct-scoring method which uses synthetic summaries to act as pairwise machine rankings at test time. We show that our method performs comparably to state-of-the-art pairwise evaluators in terms of axis-averaged sample-level correlations on the SummEval (\textbf+0.03), TopicalChat (\textbf-0.03), and HANNA (\textbf+0.05) meta-evaluation benchmarks, and release the synthetic in-context summaries as data to facilitate future work.
zh
[NLP-79] No Translation Needed: Forecasting Quality from Fertility and Metadata
【速读】: 该论文旨在解决多语言翻译质量预测问题,即在不实际运行翻译系统的情况下准确评估翻译效果。其解决方案的关键在于利用少量可获取的特征——包括词元繁殖率(token fertility ratios)、词元数量以及基础语言学元数据(如语言家族、书写系统和区域)——构建梯度提升模型,从而对GPT-4o在FLORES-200基准中203种语言的翻译质量进行预测。实验表明,该方法在英文→其他语言(R²=0.72)和其它语言→英文(R²=0.66)任务上均表现出良好性能,且特征重要性分析揭示了语言类型学因素与繁殖率在不同方向翻译中的差异化作用,为多语言质量评估提供了新思路。
链接: https://arxiv.org/abs/2509.05425
作者: Jessica M. Lundin,Ada Zhang,David Adelani,Cody Carroll
机构: Institute for Disease Modeling (疾病建模研究所); Gates Foundation (比尔及梅琳达·盖茨基金会); University of San Francisco (旧金山大学); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We show that translation quality can be predicted with surprising accuracy \textitwithout ever running the translation system itself. Using only a handful of features, token fertility ratios, token counts, and basic linguistic metadata (language family, script, and region), we can forecast ChrF scores for GPT-4o translations across 203 languages in the FLORES-200 benchmark. Gradient boosting models achieve favorable performance ( R^2=0.66 for XX \rightarrow English and R^2=0.72 for English \rightarrow XX). Feature importance analyses reveal that typological factors dominate predictions into English, while fertility plays a larger role for translations into diverse target languages. These findings suggest that translation quality is shaped by both token-level fertility and broader linguistic typology, offering new insights for multilingual evaluation and quality estimation.
zh
[NLP-80] alk Isnt Always Cheap: Understanding Failure Modes in Multi-Agent Debate ICML
【速读】: 该论文旨在解决多智能体辩论(multi-agent debate)在提升AI推理能力时可能引发的性能下降问题,特别是当辩论参与者存在能力差异时,如何影响最终决策准确性。其关键发现是:即便较强模型占多数,辩论仍可能导致准确率随时间降低,原因在于模型倾向于因同伴推理而改变原有正确答案,表现出对共识的偏好而非对错误推理的批判性检验;这揭示了当前辩论机制中缺乏有效激励与防御机制来抵御误导性但具说服力的错误推理,从而指出了未来改进方向——需设计能识别并抵抗错误推理的机制,以确保辩论过程真正促进理性判断而非盲目趋同。
链接: https://arxiv.org/abs/2509.05396
作者: Andrea Wynn,Harsh Satija,Gillian Hadfield
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: ICML MAS Workshop 2025
Abstract:While multi-agent debate has been proposed as a promising strategy for improving AI reasoning ability, we find that debate can sometimes be harmful rather than helpful. The prior work has exclusively focused on debates within homogeneous groups of agents, whereas we explore how diversity in model capabilities influences the dynamics and outcomes of multi-agent interactions. Through a series of experiments, we demonstrate that debate can lead to a decrease in accuracy over time – even in settings where stronger (i.e., more capable) models outnumber their weaker counterparts. Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning. These results highlight important failure modes in the exchange of reasons during multi-agent debate, suggesting that naive applications of debate may cause performance degradation when agents are neither incentivized nor adequately equipped to resist persuasive but incorrect reasoning.
zh
[NLP-81] Authorship Without Writing: Large Language Models and the Senior Author Analogy
【速读】: 该论文试图解决的问题是:在生物伦理学、科学和医学写作中,大型语言模型(Large Language Models, LLMs)的使用是否以及如何能够被认定为作者资格。当前学界对LLM本身是否可作为作者存在广泛争议,但对人类使用者是否应被视为作者尚无共识。论文的关键解决方案在于提出一种类比论证——在特定条件下,LLM的使用类似于一种“高级作者”(senior authorship)角色,即那些不直接撰写文字却主导研究方向并对其完整性负责的研究者。基于此观点,即使LLM生成了完整的论文初稿,其使用也可根据许多领域现行的作者标准被视作合法的作者贡献;否则,现有作者资格认定标准必须进行根本性修订。
链接: https://arxiv.org/abs/2509.05390
作者: Clint Hurshman,Sebastian Porsdam Mann,Julian Savulescu,Brian D. Earp
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages, 0 figures
Abstract:The use of large language models (LLMs) in bioethical, scientific, and medical writing remains controversial. While there is broad agreement in some circles that LLMs cannot count as authors, there is no consensus about whether and how humans using LLMs can count as authors. In many fields, authorship is distributed among large teams of researchers, some of whom, including paradigmatic senior authors who guide and determine the scope of a project and ultimately vouch for its integrity, may not write a single word. In this paper, we argue that LLM use (under specific conditions) is analogous to a form of senior authorship. On this view, the use of LLMs, even to generate complete drafts of research papers, can be considered a legitimate form of authorship according to the accepted criteria in many fields. We conclude that either such use should be recognized as legitimate, or current criteria for authorship require fundamental revision. AI use declaration: GPT-5 was used to help format Box 1. AI was not used for any other part of the preparation or writing of this manuscript.
zh
[NLP-82] A Lightweight Framework for Trigger-Guided LoRA-Based Self-Adaptation in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段无法持续适应和学习新数据的问题。针对此局限性,作者提出将复杂推理任务分解为原子级子任务,并引入SAGE框架——一种触发引导的动态微调机制,实现在推理时的自适应更新。其核心创新在于三个关键组件:(1) 触发模块(Trigger module)通过多维度评估指标实时检测推理失败;(2) 触发缓冲模块(Trigger Buffer module)采用流式聚类算法HDBSCAN对异常样本进行聚类,并结合稳定性检验与相似性合并策略优化样本管理;(3) LoRA存储模块(LoRA Store module)通过适配器池动态调整参数更新,保障知识保留能力。实验表明,SAGE在测试阶段通过动态知识更新显著提升了原子推理子任务的准确性、鲁棒性和稳定性。
链接: https://arxiv.org/abs/2509.05385
作者: Jiacheng Wei,Faguo Wu,Xiao Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, conference
Abstract:Large language models are unable to continuously adapt and learn from new data during reasoning at inference time. To address this limitation, we propose that complex reasoning tasks be decomposed into atomic subtasks and introduce SAGE, a trigger-guided dynamic fine-tuning framework that enables adaptive updates during reasoning at inference time. SAGE consists of three key components: (1) a Trigger module that detects reasoning failures through multiple evaluation metrics in real time; (2) a Trigger Buffer module that clusters anomaly samples using a streaming clustering process with HDBSCAN, followed by stability checks and similarity-based merging; and (3) a Lora Store module that dynamically optimizes parameter updates with an adapter pool for knowledge retention. Evaluation results show that SAGE demonstrates excellent accuracy, robustness, and stability on the atomic reasoning subtask through dynamic knowledge updating during test time.
zh
[NLP-83] Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的幻觉(hallucination)问题,即模型生成内容与事实不符或缺乏一致性,从而限制其在实际应用中的可信度。解决方案的关键在于提出一种基于N-Gram频率张量(N-Gram frequency tensor)的新方法,该张量通过编码词项共现模式来捕捉更丰富的语义结构,相较于传统指标如ROUGE或BERTScore具有更强的语义感知能力;随后利用张量分解提取各模态的奇异值作为特征输入到多层感知机(MLP)二分类器中,实现对幻觉内容的有效识别。
链接: https://arxiv.org/abs/2509.05360
作者: Jerry Li,Evangelos Papalexakis
机构: University of California, Riverside (加州大学河滨分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have demonstrated effectiveness across a wide variety of tasks involving natural language, however, a fundamental problem of hallucinations still plagues these models, limiting their trustworthiness in generating consistent, truthful information. Detecting hallucinations has quickly become an important topic, with various methods such as uncertainty estimation, LLM Judges, retrieval augmented generation (RAG), and consistency checks showing promise. Many of these methods build upon foundational metrics, such as ROUGE, BERTScore, or Perplexity, which often lack the semantic depth necessary to detect hallucinations effectively. In this work, we propose a novel approach inspired by ROUGE that constructs an N-Gram frequency tensor from LLM-generated text. This tensor captures richer semantic structure by encoding co-occurrence patterns, enabling better differentiation between factual and hallucinated content. We demonstrate this by applying tensor decomposition methods to extract singular values from each mode and use these as input features to train a multi-layer perceptron (MLP) binary classifier for hallucinations. Our method is evaluated on the HaluEval dataset and demonstrates significant improvements over traditional baselines, as well as competitive performance against state-of-the-art LLM judges.
zh
[NLP-84] An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling Pre-training
【速读】: 该论文旨在解决在持续预训练(continual pre-training)过程中,如何优化语音语言模型(Speech Language Models, SLMs)中离散单元表示(discrete unit representations)的问题,以提升模型对语音模态的建模能力。其核心解决方案在于系统性地分析模型架构、数据表示方式及训练鲁棒性对预训练阶段的影响,特别是通过调节语音编码器(speech encoder)与聚类粒度(clustering granularity)的匹配关系,发现最优离散化策略随模型规模变化;同时,基于聚类分布和音位对齐(phonemic alignment)分析,有效利用离散词汇空间,揭示了语言学与副语言(paralinguistic)模式,并强调了离散化训练数据选择与目标任务之间的领域一致性(domain matching)对于提升模型鲁棒性的关键作用。
链接: https://arxiv.org/abs/2509.05359
作者: Yanis Labrak,Richard Dufour,Mickaël Rouvier
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Published in International Conference on Text, Speech, and Dialogue, 13-24
Abstract:This paper investigates discrete unit representations in Speech Language Models (SLMs), focusing on optimizing speech modeling during continual pre-training. In this paper, we systematically examine how model architecture, data representation, and training robustness influence the pre-training stage in which we adapt existing pre-trained language models to the speech modality. Our experiments highlight the role of speech encoders and clustering granularity across different model scales, showing how optimal discretization strategies vary with model capacity. By examining cluster distribution and phonemic alignments, we investigate the effective use of discrete vocabulary, uncovering both linguistic and paralinguistic patterns. Additionally, we explore the impact of clustering data selection on model robustness, highlighting the importance of domain matching between discretization training and target applications.
zh
[NLP-85] ForensicsData: A Digital Forensics Dataset for Large Language Models
【速读】: 该论文旨在解决数字取证领域中因伦理、法律和隐私顾虑导致公开可用数据集匮乏的问题,从而阻碍了研究与工具开发的进展。解决方案的关键在于构建一个名为ForensicsData的高质量Question-Context-Answer(Q-C-A)数据集,该数据集源自真实的恶意软件分析报告,包含超过5000个结构化的三元组;其创建流程采用独特的工作流:首先提取结构化数据,再利用大语言模型(Large Language Models, LLMs)将其转化为Q-C-A格式,并通过专门设计的质量评估流程确保内容准确性与专业术语一致性,其中Gemini 2 Flash模型在对齐取证术语方面表现最优。
链接: https://arxiv.org/abs/2509.05331
作者: Youssef Chakir,Iyad Lahsen-Cherif
机构: INPT (National Institute of Post and Telecommunications)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to WiMob 2025 (21st International Conference on Wireless and Mobile Computing, Networking and Communications), Marrakesh, Morocco, Oct 20-22, 2025. 6 pages, 5 figures, 5 tables. IEEEtran conference format
Abstract:The growing complexity of cyber incidents presents significant challenges for digital forensic investigators, especially in evidence collection and analysis. Public resources are still limited because of ethical, legal, and privacy concerns, even though realistic datasets are necessary to support research and tool developments. To address this gap, we introduce ForensicsData, an extensive Question-Context-Answer (Q-C-A) dataset sourced from actual malware analysis reports. It consists of more than 5,000 Q-C-A triplets. A unique workflow was used to create the dataset, which extracts structured data, uses large language models (LLMs) to transform it into Q-C-A format, and then uses a specialized evaluation process to confirm its quality. Among the models evaluated, Gemini 2 Flash demonstrated the best performance in aligning generated content with forensic terminology. ForensicsData aims to advance digital forensics by enabling reproducible experiments and fostering collaboration within the research community.
zh
[NLP-86] Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation
【速读】: 该论文试图解决的问题是:大语言模型(Large Language Models, LLMs)作为合成代理(synthetic agents)是否能够可靠地替代真实人类参与者进行以人为中心的研究,尤其是其行为在不同实验情境下是否保持内部一致性。现有研究多关注LLM生成的数据是否与人类数据一致,但忽略了更根本的内在一致性问题。论文的关键解决方案在于设计了一项实验,旨在(a)揭示代理的内部状态,并(b)在基础对话设置中考察其行为表现,从而检验代理的行为是否与其被揭示的内部状态相一致。结果表明,LLMs在不同模型家族和规模下均存在显著的内部不一致性,即使其输出看似符合人类行为模式,也无法保证内在逻辑的一致性,这揭示了其在替代真实参与者方面的关键能力缺口。
链接: https://arxiv.org/abs/2509.03736
作者: James Mooney,Josef Woldense,Zheng Robert Jia,Shirley Anugrah Hayati,My Ha Nguyen,Vipul Raheja,Dongyeop Kang
机构: University of Minnesota (明尼苏达大学); University of Chicago (芝加哥大学); Google DeepMind (谷歌深度思维)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, 9 figures, 7 tables
Abstract:The impressive capabilities of Large Language Models (LLMs) have fueled the notion that synthetic agents can serve as substitutes for real participants in human-subject research. In an effort to evaluate the merits of this claim, social science researchers have largely focused on whether LLM-generated survey data corresponds to that of a human counterpart whom the LLM is prompted to represent. In contrast, we address a more fundamental question: Do agents maintain internal consistency, retaining similar behaviors when examined under different experimental settings? To this end, we develop a study designed to (a) reveal the agent’s internal state and (b) examine agent behavior in a basic dialogue setting. This design enables us to explore a set of behavioral hypotheses to assess whether an agent’s conversation behavior is consistent with what we would expect from their revealed internal state. Our findings on these hypotheses show significant internal inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be internally consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human-subject research. Our simulation code and data are publicly accessible.
zh
[NLP-87] Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval EMNLP2023
【速读】: 该论文旨在解决多语言信息检索(Multilingual Information Retrieval, MIR)中的公平性问题,特别是语言偏见和人口统计学偏见对排序结果的影响。其解决方案的关键在于构建了一个名为Multi-EuP的新多语言基准数据集,该数据集包含来自欧洲议会的22K份多语言文档,覆盖24种语言,并提供跨语言相关性标注及丰富的文档级人口统计学信息,从而支持对语言和群体层面偏见的系统性分析。此外,研究还通过初步实验验证了分词策略选择对语言偏见的影响,为提升多语言检索系统的公平性提供了实证基础。
链接: https://arxiv.org/abs/2311.01870
作者: Jinrui Yang,Timothy Baldwin,Trevor Cohn
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at The 3rd Multilingual Representation Learning (MRL) Workshop (co-located with EMNLP 2023)
Abstract:We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multi-lingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.
zh
[NLP-88] Beamforming-LLM : What Where and When Did I Miss?
【速读】: 该论文旨在解决多说话者环境中用户可能遗漏对话内容的难题,即如何高效地语义召回和总结未关注的音频片段。解决方案的关键在于将波束成形(beamforming)技术与检索增强生成(retrieval-augmented generation, RAG)相结合:通过麦克风阵列进行空间音频捕获并分离方向性声源,利用Whisper模型转录语音,并通过句嵌入(sentence encoders)构建向量数据库;当接收到自然语言查询时,系统检索语义相关片段,将其与非关注时间段对齐,并使用轻量级大语言模型(GPT-4o-mini)生成摘要,最终提供带有空间上下文和时间戳音频回放的对比性总结界面。
链接: https://arxiv.org/abs/2509.06221
作者: Vishal Choudhari
机构: Columbia University (哥伦比亚大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:We present Beamforming-LLM, a system that enables users to semantically recall conversations they may have missed in multi-speaker environments. The system combines spatial audio capture using a microphone array with retrieval-augmented generation (RAG) to support natural language queries such as, “What did I miss when I was following the conversation on dogs?” Directional audio streams are separated using beamforming, transcribed with Whisper, and embedded into a vector database using sentence encoders. Upon receiving a user query, semantically relevant segments are retrieved, temporally aligned with non-attended segments, and summarized using a lightweight large language model (GPT-4o-mini). The result is a user-friendly interface that provides contrastive summaries, spatial context, and timestamped audio playback. This work lays the foundation for intelligent auditory memory systems and has broad applications in assistive technology, meeting summarization, and context-aware personal spatial computing.
zh
[NLP-89] Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance
【速读】: 该论文旨在解决当前视觉-语言模型在3D医学图像生成领域面临的瓶颈问题:由于缺乏类似2D领域中广泛可用的预训练基础模型(pretrained foundation models),现有方法难以实现基于自然语言描述生成高分辨率、结构忠实的3D医学图像,尤其在神经影像学中,这限制了个性化疾病模拟、病程推演及医学教育等临床与科研应用的发展。解决方案的关键在于提出一个可直接生成3D医学图像的语言引导扩散框架(language-guided native-3D diffusion model),通过引入Simple Diffusion的改进策略和增强条件控制机制(augmented conditioning),显著提升文本对齐能力和图像质量;同时,在两个独立的神经MRI数据集上验证了其在多发性硬化症(MS)病变负荷变化和阿尔茨海默病(Alzheimer’s disease)认知状态模拟中的有效性,实现了高保真度的合成患者3D图像生成,为基于提示驱动的3D医学影像疾病进展分析奠定了基础。
链接: https://arxiv.org/abs/2509.05978
作者: Mohamed Mohamed,Brennan Nichyporuk,Douglas L. Arnold,Tal Arbel
机构: McGill University (麦吉尔大学); Montreal Neurological Hospital (蒙特利尔神经病学医院)
类目: Image and Video Processing (eess.IV); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however the impressive performance of these models in 2D is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained foundation models do not exist for 3D, significantly limiting progress in this domain. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language descriptions remains completely unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression scenarios, and enhanced medical training by visualizing hypothetical medical conditions in realistic detail. Our work takes a meaningful step toward addressing this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this represents the first demonstration of a language-guided native-3D diffusion model applied specifically to neurological imaging data, where faithful three-dimensional modeling is essential to represent the brain’s three-dimensional structure. Through results on two distinct neurological MRI datasets, our framework successfully simulates varying counterfactual lesion loads in Multiple Sclerosis (MS), and cognitive states in Alzheimer’s disease, generating high-quality images while preserving subject fidelity in synthetically generated medical images. Our results lay the groundwork for prompt-driven disease progression analysis within 3D medical imaging.
zh
[NLP-90] On the Contribution of Lexical Features to Speech Emotion Recognition
【速读】: 该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)中传统方法过度依赖声学特征(paralinguistic cues)而忽视词汇内容(lexical content)的问题。其核心解决方案是证明仅基于语音中提取的词汇信息即可实现与声学模型相当甚至更优的性能,具体在MELD数据集上,纯词汇方法达到51.5%的加权F1分数(WF1),优于参数量更大的纯声学模型(49.3%)。关键创新在于系统评估了自监督(SSL)语音与文本表示、Transformer编码器的层级结构影响以及音频去噪对性能的作用,从而验证了词汇内容在SER中的重要性与潜力。
链接: https://arxiv.org/abs/2509.05634
作者: David Combei
机构: Technical University of Cluj-Napoca (克鲁日-纳波卡理工大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to 13th Conference on Speech Technology and Human-Computer Dialogue
Abstract:Although paralinguistic cues are often considered the primary drivers of speech emotion recognition (SER), we investigate the role of lexical content extracted from speech and show that it can achieve competitive and in some cases higher performance compared to acoustic models. On the MELD dataset, our lexical-based approach obtains a weighted F1-score (WF1) of 51.5%, compared to 49.3% for an acoustic-only pipeline with a larger parameter count. Furthermore, we analyze different self-supervised (SSL) speech and text representations, conduct a layer-wise study of transformer-based encoders, and evaluate the effect of audio denoising.
zh
[NLP-91] ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders
【速读】: 该论文旨在解决稀疏自编码器(Sparse Autoencoder, SAE)在蛋白质语言模型(Protein Language Models, PLMs)中应用时存在的语义纠缠问题,即单个神经元往往混合多个非线性概念,导致模型行为难以可靠解释或操控。解决方案的关键在于提出一种语义引导的稀疏自编码器——ProtSAE,其在训练过程中融合标注数据集与领域知识,主动引导语义解耦,从而在不依赖额外标注过滤的前提下,提升隐层特征的生物学相关性和可解释性。实验表明,ProtSAE在保持高重建保真度的同时,在可解释性探测任务中表现更优,并展现出对下游生成任务进行可控引导的潜力。
链接: https://arxiv.org/abs/2509.05309
作者: Xiangyu Liu,Haodi Lei,Yi Liu,Yang Liu,Wei Hu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models. Recent works apply SAE to protein language models (PLMs), aiming to extract and analyze biologically meaningful features from their latent spaces. However, SAE suffers from semantic entanglement, where individual neurons often mix multiple nonlinear concepts, making it difficult to reliably interpret or manipulate model behaviors. In this paper, we propose a semantically-guided SAE, called ProtSAE. Unlike existing SAE which requires annotation datasets to filter and interpret activations, we guide semantic disentanglement during training using both annotation datasets and domain knowledge to mitigate the effects of entangled attributes. We design interpretability experiments showing that ProtSAE learns more biologically relevant and interpretable hidden features compared to previous methods. Performance analyses further demonstrate that ProtSAE maintains high reconstruction fidelity while achieving better results in interpretable probing. We also show the potential of ProtSAE in steering PLMs for downstream generation tasks.
zh
计算机视觉
[CV-0] H_2OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
【速读】:该论文旨在解决视频驱动的3D人体姿态估计中基于Transformer模型(video pose transformers, VPTs)计算成本过高、难以在资源受限设备上部署的问题。其核心解决方案是提出一种分层插件式剪枝与恢复框架——分层沙漏标记器(Hierarchical Hourglass Tokenizer, H₂OT),关键在于通过两个模块实现高效推理:Token Pruning Module (TPM) 动态选择代表性帧的姿势标记以消除冗余,而Token Recovering Module (TRM) 则基于这些稀疏标记重建完整时间序列的时空细节,从而在保持高精度的同时显著降低中间Transformer块的计算量,实现快速推理。
链接: https://arxiv.org/abs/2509.06956
作者: Wenhao Li,Mengyuan Liu,Hong Liu,Pichao Wang,Shijian Lu,Nicu Sebe
机构: Peking University(北京大学); Nanyang Technological University(南洋理工大学); Amazon AGI; University of Trento(特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by TPAMI 2025, Open Sourced. arXiv admin note: substantial text overlap with arXiv:2311.12028
Abstract:Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H _2 OT), for efficient transformer-based 3D human pose estimation from videos. H _2 OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H _2 OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at this https URL.
zh
[CV-1] Deep Reactive Policy: Learning Reactive Manipulator Motion Planning for Dynamic Environments
【速读】:该论文旨在解决机器人操作臂在动态、部分可观测环境中生成无碰撞运动轨迹的难题(collision-free motion generation in dynamic, partially observable environments)。传统运动规划方法虽能计算全局最优路径,但依赖完整环境信息且计算速度慢,难以适应动态场景;而神经运动策略虽可直接基于原始感知输入进行闭环控制,却常因复杂或动态环境下的泛化能力不足而受限。其解决方案的关键在于提出 Deep Reactive Policy (DRP),一个基于点云输入的视觉-运动神经策略,核心为 IMPACT——一种在1000万条模拟专家轨迹上预训练的Transformer架构神经运动策略,并通过迭代式师生微调(student-teacher fine-tuning)提升静态障碍物避障性能,同时引入 DCP-RMP 模块在推理阶段实现局部反应式的动态障碍物避障,从而显著增强模型在杂乱场景、移动障碍物及目标遮挡等挑战任务中的泛化能力和成功率。
链接: https://arxiv.org/abs/2509.06953
作者: Jiahui Yang,Jason Jingzhou Liu,Yulong Li,Youssef Khaky,Kenneth Shaw,Deepak Pathak
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Website at \url{ this http URL }
Abstract:Generating collision-free motion in dynamic, partially observable environments is a fundamental challenge for robotic manipulators. Classical motion planners can compute globally optimal trajectories but require full environment knowledge and are typically too slow for dynamic scenes. Neural motion policies offer a promising alternative by operating in closed-loop directly on raw sensory inputs but often struggle to generalize in complex or dynamic settings. We propose Deep Reactive Policy (DRP), a visuo-motor neural motion policy designed for reactive motion generation in diverse dynamic environments, operating directly on point cloud sensory input. At its core is IMPACT, a transformer-based neural motion policy pretrained on 10 million generated expert trajectories across diverse simulation scenarios. We further improve IMPACT’s static obstacle avoidance through iterative student-teacher finetuning. We additionally enhance the policy’s dynamic obstacle avoidance at inference time using DCP-RMP, a locally reactive goal-proposal module. We evaluate DRP on challenging tasks featuring cluttered scenes, dynamic moving obstacles, and goal obstructions. DRP achieves strong generalization, outperforming prior classical and neural methods in success rate across both simulated and real-world settings. Video results and code available at this https URL
zh
[CV-2] F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
【速读】:该论文旨在解决动态视觉环境中执行语言条件任务时,现有视觉-语言-动作(Vision-Language-Action, VLA)模型因采用反应式状态到动作映射而导致的短视行为和鲁棒性不足的问题。解决方案的关键在于提出F1框架,其核心创新是将视觉前瞻生成(visual foresight generation)整合进决策流程中,通过多Transformer架构中的感知、前瞻生成与控制模块实现理解、生成与行动的闭环衔接;其中,基于下一尺度预测机制(next-scale prediction mechanism)合成目标条件下的视觉前瞻作为显式规划目标,从而将动作生成重构为由前瞻引导的逆动力学问题,使动作隐式达成视觉目标,显著提升模型在复杂动态环境中的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2509.06951
作者: Qi Lv,Weijie Kong,Hao Li,Jia Zeng,Zherui Qiu,Delin Qu,Haoming Song,Qizhi Chen,Xiang Deng,Jiangmiao Pang
机构: Shanghai AI Laboratory (上海人工智能实验室); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.
zh
[CV-3] Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data ICCV2025
【速读】:该论文旨在解决基于大规模Transformer模型在真实世界(in-the-wild)场景中进行可泛化的新视图合成(Novel View Synthesis, NVS)时面临的分布外(out-of-distribution)性能下降问题,其根源在于现有公开场景数据集的多样性不足。为提升模型在未见域上的泛化能力,作者引入由扩散模型生成的合成训练数据以增强数据多样性;然而,合成数据中存在的伪影(artifacts)成为限制重建质量的关键瓶颈。解决方案的核心在于提出一种嵌入于Transformer架构中的token disentanglement(令牌解耦)机制,通过增强特征空间的分离性来抑制伪影干扰,从而实现更有效的学习与高质量重建。该方法不仅显著提升了重建精度,还支持高效的大规模合成数据训练,最终在多个基准测试中达到当前最优性能并降低计算开销。
链接: https://arxiv.org/abs/2509.06950
作者: Nithin Gopalakrishnan Nair,Srinivas Kaza,Xuan Luo,Vishal M. Patel,Stephen Lombardi,Jungyeon Park
机构: Johns Hopkins University (约翰霍普金斯大学); Google(谷歌)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025
Abstract:Large transformer-based models have made significant progress in generalizable novel view synthesis (NVS) from sparse input views, generating novel viewpoints without the need for test-time optimization. However, these models are constrained by the limited diversity of publicly available scene datasets, making most real-world (in-the-wild) scenes out-of-distribution. To overcome this, we incorporate synthetic training data generated from diffusion models, which improves generalization across unseen domains. While synthetic data offers scalability, we identify artifacts introduced during data generation as a key bottleneck affecting reconstruction quality. To address this, we propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning. This refinement not only improves reconstruction quality over standard transformers but also enables scalable training with synthetic data. As a result, our method outperforms existing models on both in-dataset and cross-dataset evaluations, achieving state-of-the-art results across multiple benchmarks while significantly reducing computational costs. Project page: this https URL
zh
[CV-4] LLaDA-VLA: Vision Language Diffusion Action Models
【速读】:该论文旨在解决如何将基于扩散模型的视觉-语言模型(diffusion-based Vision-Language Models, d-VLMs)有效应用于机器人操作策略学习的问题。当前尽管d-VLMs在文本生成和多模态任务中展现出竞争力,但其在机器人控制领域的适应性尚未被充分探索。解决方案的关键在于提出LLaDA-VLA,这是首个基于预训练d-VLMs构建的视觉-语言-扩散-动作模型(Vision-Language-Diffusion-Action model)。其核心创新包括:(1) 局部特殊标记分类策略(localized special-token classification),通过用特殊动作标记分类替代全词汇分类,显著降低模型适配难度;(2) 分层动作结构解码策略(hierarchical action-structured decoding),在解码过程中考虑动作内部及跨动作的依赖关系,从而更高效地生成连贯的操作序列。实验表明,该方法在仿真与真实机器人平台上均显著优于现有最优视觉-语言-动作模型(Vision-Language-Action models, VLAs)。
链接: https://arxiv.org/abs/2509.06932
作者: Yuqing Wen,Hebei Li,Kefan Gu,Yucheng Zhao,Tiancai Wang,Xiaoyan Sun
机构: University of Science and Technology of China (中国科学技术大学); Nanjing University (南京大学); Dexmal
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-based VLMs (d-VLMs). However, leveraging such models for robot policy learning remains largely unexplored. In this work, we present LLaDA-VLA, the first Vision-Language-Diffusion-Action model built upon pretrained d-VLMs for robotic manipulation. To effectively adapt d-VLMs to robotic domain, we introduce two key designs: (1) a localized special-token classification strategy that replaces full-vocabulary classification with special action token classification, reducing adaptation difficulty; (2) a hierarchical action-structured decoding strategy that decodes action sequences hierarchically considering the dependencies within and across actions. Extensive experiments demonstrate that LLaDA-VLA significantly outperforms state-of-the-art VLAs on both simulation and real-world robots.
zh
[CV-5] FoMo4Wheat: Toward reliable crop vision foundation models with globally curated data
【速读】:该论文旨在解决当前基于通用领域预训练骨干网络的视觉模型在田间作物监测任务中泛化能力不足的问题,其核心挑战在于作物冠层结构的细微差异与田间环境条件的动态变化之间的复杂交互。解决方案的关键在于构建一个专门针对小麦的视觉基础模型(FoMo4Wheat),该模型通过自监督学习在目前最大且最多样化的小麦图像数据集ImAg4Wheat(包含250万张高分辨率图像,覆盖30个全球站点、2000种基因型和500种环境条件)上进行预训练,从而获得对小麦具有鲁棒性的特征表示,并具备向其他作物和杂草迁移的能力。实验表明,FoMo4Wheat在十项田间视觉任务(冠层与器官层级)中均显著优于通用领域预训练模型,验证了作物专用基础模型在可靠田间感知中的价值。
链接: https://arxiv.org/abs/2509.06907
作者: Bing Han,Chen Zhu,Dong Han,Rui Yu,Songliang Cao,Jianhui Wu,Scott Chapman,Zijian Wang,Bangyou Zheng,Wei Guo,Marie Weiss,Benoit de Solan,Andreas Hund,Lukas Roth,Kirchgessner Norbert,Andrea Visioni,Yufeng Ge,Wenjuan Li,Alexis Comar,Dong Jiang,Dejun Han,Fred Baret,Yanfeng Ding,Hao Lu,Shouyang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-driven field monitoring is central to digital agriculture, yet models built on general-domain pretrained backbones often fail to generalize across tasks, owing to the interaction of fine, variable canopy structures with fluctuating field conditions. We present FoMo4Wheat, one of the first crop-domain vision foundation model pretrained with self-supervision on ImAg4Wheat, the largest and most diverse wheat image dataset to date (2.5 million high-resolution images collected over a decade at 30 global sites, spanning 2,000 genotypes and 500 environmental conditions). This wheat-specific pretraining yields representations that are robust for wheat and transferable to other crops and weeds. Across ten in-field vision tasks at canopy and organ levels, FoMo4Wheat models consistently outperform state-of-the-art models pretrained on general-domain dataset. These results demonstrate the value of crop-specific foundation models for reliable in-field perception and chart a path toward a universal crop foundation model with cross-species and cross-task capabilities. FoMo4Wheat models and the ImAg4Wheat dataset are publicly available online: this https URL and this https URL. The demonstration website is: this https URL.
zh
[CV-6] BIR-Adapter: A Low-Complexity Diffusion Model Adapter for Blind Image Restoration
【速读】:该论文旨在解决**盲图像复原(blind image restoration, BIR)**中如何高效利用预训练扩散模型先验知识的问题,同时避免引入额外的复杂特征提取器。其核心挑战在于:在未知退化类型下实现高质量图像恢复,且保持计算效率。解决方案的关键在于提出一种低复杂度的适配器结构——BIR-Adapter,它直接从退化图像中提取特征并扩展模型自注意力机制以融入退化信息,同时引入采样引导机制以减少生成过程中的幻觉现象。该设计无需训练辅助网络,即可使预训练扩散模型在多种合成与真实退化场景下取得优于或相当当前最优方法的性能,并具备良好的可迁移性,例如将仅支持超分辨率的模型扩展为能处理未知退化类型的通用图像复原工具。
链接: https://arxiv.org/abs/2509.06904
作者: Cem Eteke,Alexander Griessel,Wolfgang Kellerer,Eckehard Steinbach
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 14 figures
Abstract:This paper introduces BIR-Adapter, a low-complexity blind image restoration adapter for diffusion models. The BIR-Adapter enables the utilization of the prior of pre-trained large-scale diffusion models on blind image restoration without training any auxiliary feature extractor. We take advantage of the robustness of pretrained models. We extract features from degraded images via the model itself and extend the self-attention mechanism with these degraded features. We introduce a sampling guidance mechanism to reduce hallucinations. We perform experiments on synthetic and real-world degradations and demonstrate that BIR-Adapter achieves competitive or better performance compared to state-of-the-art methods while having significantly lower complexity. Additionally, its adapter-based design enables integration into other diffusion models, enabling broader applications in image restoration tasks. We showcase this by extending a super-resolution-only model to perform better under additional unknown degradations.
zh
[CV-7] Intraoperative 2D/3D Registration via Spherical Similarity Learning and Inference-Time Differentiable Levenberg-Marquardt Optimization WACV2026
【速读】:该论文旨在解决术中2D/3D配准过程中因欧氏空间近似导致的流形结构失真和收敛速度慢的问题,从而提升器械与植入物定位的准确性。其解决方案的关键在于:首先在非欧几里得球面特征空间中进行相似性学习,以更精确地捕捉复杂流形结构;其次利用CNN-Transformer编码器提取特征嵌入,并将其投影至具有双不变性质的SO(4)空间,通过黎曼距离近似测地距离,构建更具表达力且几何一致的深度相似性度量;最后在推理阶段采用全可微分的Levenberg-Marquardt优化替代梯度下降,显著加速收敛。
链接: https://arxiv.org/abs/2509.06890
作者: Minheng Chen,Youyong Kong
机构: School of Computer Science and Engineering, Southeast University (东南大学计算机科学与工程学院); Department of Computer Science and Engineering, University of Texas at Arlington (德克萨斯大学阿灵顿分校计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: WACV 2026 Accepted
Abstract:Intraoperative 2D/3D registration aligns preoperative 3D volumes with real-time 2D radiographs, enabling accurate localization of instruments and implants. A recent fully differentiable similarity learning framework approximates geodesic distances on SE(3), expanding the capture range of registration and mitigating the effects of substantial disturbances, but existing Euclidean approximations distort manifold structure and slow convergence. To address these limitations, we explore similarity learning in non-Euclidean spherical feature spaces to better capture and fit complex manifold structure. We extract feature embeddings using a CNN-Transformer encoder, project them into spherical space, and approximate their geodesic distances with Riemannian distances in the bi-invariant SO(4) space. This enables a more expressive and geometrically consistent deep similarity metric, enhancing the ability to distinguish subtle pose differences. During inference, we replace gradient descent with fully differentiable Levenberg-Marquardt optimization to accelerate convergence. Experiments on real and synthetic datasets show superior accuracy in both patient-specific and patient-agnostic scenarios.
zh
[CV-8] Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers
【速读】:该论文旨在解决医学图像分割任务中现有模型在实时应用和资源受限环境下的效率瓶颈问题,特别是传统卷积网络(如U-Net)因感受野有限难以建模全局上下文信息,而基于Transformer的模型虽能提升性能却往往结构深、计算复杂度高,不适用于临床实时场景。其解决方案的关键在于提出一种端到端轻量级架构:结合类似Swin Transformer的编码器与类似U-Net的解码器,并通过跳跃连接保留空间细节的同时捕获上下文信息;此外,采用Barlow Twins自监督预训练策略增强编码器在小样本标注数据下的特征学习能力,从而在显著减少参数量和推理时间的前提下实现与主流方法相当甚至更优的分割精度,使其更适合部署于临床实时系统。
链接: https://arxiv.org/abs/2509.06885
作者: Morteza Kiani Haftlang,Mohammadhossein Malmir,Foroutan Parand,Umberto Michelucci,Safouane El Ghazouali
机构: HSLU, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland; Technical University of Munich, Munich, Germany; University College London (UCL), London, United Kingdom; TOELT LLC AI lab, Winterthur, Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical image segmentation is a critical task in clinical workflows, particularly for the detection and delineation of pathological regions. While convolutional architectures like U-Net have become standard for such tasks, their limited receptive field restricts global context modeling. Recent efforts integrating transformers have addressed this, but often result in deep, computationally expensive models unsuitable for real-time use. In this work, we present a novel end-to-end lightweight architecture designed specifically for real-time binary medical image segmentation. Our model combines a Swin Transformer-like encoder with a U-Net-like decoder, connected via skip pathways to preserve spatial detail while capturing contextual information. Unlike existing designs such as Swin Transformer or U-Net, our architecture is significantly shallower and competitively efficient. To improve the encoder’s ability to learn meaningful features without relying on large amounts of labeled data, we first train it using Barlow Twins, a self-supervised learning method that helps the model focus on important patterns by reducing unnecessary repetition in the learned features. After this pretraining, we fine-tune the entire model for our specific task. Experiments on benchmark binary segmentation tasks demonstrate that our model achieves competitive accuracy with substantially reduced parameter count and faster inference, positioning it as a practical alternative for deployment in real-time and resource-limited clinical environments. The code for our method is available at Github repository: this https URL.
zh
[CV-9] A New Hybrid Model of Generative Adversarial Network and You Only Look Once Algorithm for Automatic License-Plate Recognition
【速读】:该论文旨在解决自动车牌识别(Automatic License-Plate Recognition, ALPR)在智能交通系统(Intelligent Transportation Systems, ITS)中因图像模糊、光照变化和复杂背景等高变异性导致的识别准确率低与实时性差的问题。其核心解决方案是构建一个端到端的高效ALPR框架,关键在于:1)引入选择性生成对抗网络(Selective Generative Adversarial Network, GAN)作为预处理模块以实现去模糊(Deblur-GAN),显著提升模糊车牌图像的清晰度;2)采用YOLOv5目标检测架构完成车牌定位(License-Plate Detection, LPD),并集成字符分割(Character Segmentation, CS)与字符识别(Character Recognition, CR)模块,实现高精度且低计算成本的全流程处理。实验表明,该方案在检测时间仅0.026秒的同时,实现了95%的LPD准确率和97%的CR准确率,尤其在模糊场景下整体识别准确率提升近40%,验证了其在实际部署中的有效性与鲁棒性。
链接: https://arxiv.org/abs/2509.06868
作者: Behnoud Shafiezadeh,Amir Mashmool,Farshad Eshghi,Manoochehr Kelarestaghi
机构: Kharazmi University (哈扎拉米大学); University of Bremen (不来梅大学); University of the Fraser Valley (弗雷泽谷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automatic License-Plate Recognition (ALPR) plays a pivotal role in Intelligent Transportation Systems (ITS) as a fundamental element of Smart Cities. However, due to its high variability, ALPR faces challenging issues more efficiently addressed by deep learning techniques. In this paper, a selective Generative Adversarial Network (GAN) is proposed for deblurring in the preprocessing step, coupled with the state-of-the-art You-Only-Look-Once (YOLO)v5 object detection architectures for License-Plate Detection (LPD), and the integrated Character Segmentation (CS) and Character Recognition (CR) steps. The selective preprocessing bypasses unnecessary and sometimes counter-productive input manipulations, while YOLOv5 LPD/CS+CR delivers high accuracy and low computing cost. As a result, YOLOv5 achieves a detection time of 0.026 seconds for both LP and CR detection stages, facilitating real-time applications with exceptionally rapid responsiveness. Moreover, the proposed model achieves accuracy rates of 95% and 97% in the LPD and CR detection phases, respectively. Furthermore, the inclusion of the Deblur-GAN pre-processor significantly improves detection accuracy by nearly 40%, especially when encountering blurred License Plates (LPs).To train and test the learning components, we generated and publicly released our blur and ALPR datasets (using Iranian license plates as a use-case), which are more representative of close-to-real-life ad-hoc situations. The findings demonstrate that employing the state-of-the-art YOLO model results in excellent overall precision and detection time, making it well-suited for portable applications. Additionally, integrating the Deblur-GAN model as a preliminary processing step enhances the overall effectiveness of our comprehensive model, particularly when confronted with blurred scenes captured by the camera as input.
zh
[CV-10] Matching Shapes Under Different Topologies: A Topology-Adaptive Deformation Guided Approach
【速读】:该论文旨在解决存在拓扑伪影(topological artefacts)的非刚性三维网格匹配问题,这类伪影会破坏现有方法对形状变形为近等距(near-isometric)或ARAP(As-Rigid-As-Possible)的假设。其解决方案的关键在于提出一种拓扑自适应的形变模型(topology-adaptive deformation model),允许在保持ARAP约束和双射对应关系的前提下,通过联合优化一个具备合适拓扑结构的模板网格及其与待匹配形状的对齐,从而实现高鲁棒性的对应关系提取。该方法不依赖任何数据驱动先验,在处理高度非等距形状及含拓扑伪影的重建结果(如噪声多视角帧)时仍表现优异,甚至优于基于大规模数据训练的方法。
链接: https://arxiv.org/abs/2509.06862
作者: Aymen Merrouche,Stefanie Wuhrer,Edmond Boyer
机构: Univ. Grenoble Alpes (格勒诺布尔阿尔卑斯大学); CNRS (法国国家科学研究中心); Inria (法国国家信息与自动化研究院); Grenoble INP (格勒诺布尔综合理工学院); LJK (拉格朗日实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Non-rigid 3D mesh matching is a critical step in computer vision and computer graphics pipelines. We tackle matching meshes that contain topological artefacts which can break the assumption made by current approaches. While Functional Maps assume the deformation induced by the ground truth correspondences to be near-isometric, ARAP-like deformation-guided approaches assume the latter to be ARAP. Neither assumption holds in certain topological configurations of the input shapes. We are motivated by real-world scenarios such as per-frame multi-view reconstructions, often suffering from topological artefacts. To this end, we propose a topology-adaptive deformation model allowing changes in shape topology to align shape pairs under ARAP and bijective association constraints. Using this model, we jointly optimise for a template mesh with adequate topology and for its alignment with the shapes to be matched to extract correspondences. We show that, while not relying on any data-driven prior, our approach applies to highly non-isometric shapes and shapes with topological artefacts, including noisy per-frame multi-view reconstructions, even outperforming methods trained on large datasets in 3D alignment quality.
zh
[CV-11] Automated Radiographic Total Sharp Score (ARTSS) in Rheumatoid Arthritis: A Solution to Reduce Inter-Intra Reader Variation and Enhancing Clinical Practice
【速读】:该论文旨在解决类风湿关节炎(Rheumatoid Arthritis, RA)严重程度评估中依赖人工判读的Total Sharp/Van Der Heijde Score(TSS)所存在的耗时、主观性强及观察者间/观察者内变异大的问题。其解决方案的关键在于构建一个端到端的自动化X线影像评分框架——Automated Radiographic Sharp Scoring (ARTSS),该框架融合多阶段深度学习模型:首先利用ResNet50进行图像预处理与重定向,其次通过UNet.3实现手部区域分割,再以YOLOv7识别关节位置,最终采用多种卷积神经网络和视觉Transformer(Vision Transformer, ViT)模型预测TSS得分。特别地,该方法创新性地解决了因关节消失或关节数量不固定导致的传统评分难以适配的问题,并在外部测试中实现了高达99%的关节识别准确率和ViT模型0.87的低Huber损失,显著提升了RA评分的客观性、效率与可重复性。
链接: https://arxiv.org/abs/2509.06854
作者: Hajar Moradmand,Lei Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Assessing the severity of rheumatoid arthritis (RA) using the Total Sharp/Van Der Heijde Score (TSS) is crucial, but manual scoring is often time-consuming and subjective. This study introduces an Automated Radiographic Sharp Scoring (ARTSS) framework that leverages deep learning to analyze full-hand X-ray images, aiming to reduce inter- and intra-observer variability. The research uniquely accommodates patients with joint disappearance and variable-length image sequences. We developed ARTSS using data from 970 patients, structured into four stages: I) Image pre-processing and re-orientation using ResNet50, II) Hand segmentation using UNet.3, III) Joint identification using YOLOv7, and IV) TSS prediction using models such as VGG16, VGG19, ResNet50, DenseNet201, EfficientNetB0, and Vision Transformer (ViT). We evaluated model performance with Intersection over Union (IoU), Mean Average Precision (MAP), mean absolute error (MAE), Root Mean Squared Error (RMSE), and Huber loss. The average TSS from two radiologists was used as the ground truth. Model training employed 3-fold cross-validation, with each fold consisting of 452 training and 227 validation samples, and external testing included 291 unseen subjects. Our joint identification model achieved 99% accuracy. The best-performing model, ViT, achieved a notably low Huber loss of 0.87 for TSS prediction. Our results demonstrate the potential of deep learning to automate RA scoring, which can significantly enhance clinical practice. Our approach addresses the challenge of joint disappearance and variable joint numbers, offers timesaving benefits, reduces inter- and intra-reader variability, improves radiologist accuracy, and aids rheumatologists in making more informed decisions.
zh
[CV-12] oonOut: Fine-tuned Background-Removal for Anime Characters
【速读】:该论文旨在解决当前最先进的背景移除模型在动漫风格图像(anime-style content)等特定领域中表现不佳的问题,尤其是在处理头发和透明度等复杂特征时精度不足。解决方案的关键在于构建了一个包含1,228张高质量动漫图像的定制数据集,并基于此对开源的BiRefNet模型进行微调(fine-tuned),从而显著提升了在该领域内的背景移除准确率,Pixel Accuracy从95.3%提升至99.5%。
链接: https://arxiv.org/abs/2509.06839
作者: Matteo Muratori,Joël Seytre
机构: Kartoon AI; University of Bologna
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:While state-of-the-art background removal models excel at realistic imagery, they frequently underperform in specialized domains such as anime-style content, where complex features like hair and transparency present unique challenges. To address this limitation, we collected and annotated a custom dataset of 1,228 high-quality anime images of characters and objects, and fine-tuned the open-sourced BiRefNet model on this dataset. This resulted in marked improvements in background removal accuracy for anime-style images, increasing from 95.3% to 99.5% for our newly introduced Pixel Accuracy metric. We are open-sourcing the code, the fine-tuned model weights, as well as the dataset at: this https URL.
zh
[CV-13] Evaluating the Impact of Adversarial Attacks on Traffic Sign Classification using the LISA Dataset
【速读】:该论文旨在解决交通标志识别模型在面对对抗性攻击时的脆弱性问题,特别是针对真实世界场景中可能遭遇的恶意扰动。其解决方案的关键在于使用LISA Traffic Sign数据集训练卷积神经网络(Convolutional Neural Network, CNN),并系统评估该模型在Fast Gradient Sign Method (FGSM) 和 Projected Gradient Descent (PGD) 等典型对抗攻击下的鲁棒性表现,结果表明随着扰动幅度增加,分类准确率显著下降,揭示了现有模型对对抗样本的高度敏感性,为后续开发面向交通标志识别系统的专用防御机制提供了实证基础。
链接: https://arxiv.org/abs/2509.06835
作者: Nabeyou Tadessa,Balaji Iyangar,Mashrur Chowdhury
机构: Benedict College (本尼迪克学院); Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adversarial attacks pose significant threats to machine learning models by introducing carefully crafted perturbations that cause misclassification. While prior work has primarily focused on MNIST and similar datasets, this paper investigates the vulnerability of traffic sign classifiers using the LISA Traffic Sign dataset. We train a convolutional neural network to classify 47 different traffic signs and evaluate its robustness against Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks. Our results show a sharp decline in classification accuracy as the perturbation magnitude increases, highlighting the models susceptibility to adversarial examples. This study lays the groundwork for future exploration into defense mechanisms tailored for real-world traffic sign recognition systems.
zh
[CV-14] Leverag ing Generic Foundation Models for Multimodal Surgical Data Analysis MICCAI2025
【速读】:该论文旨在解决如何利用通用基础模型(foundation model)和手术室(OR)中互补模态数据来提升微创手术支持中的外科数据科学(surgical data science)性能问题。其核心挑战在于如何通过迁移学习实现领域适应,并有效整合来自手术环境的多模态时间序列数据以增强下游任务的预测准确性。解决方案的关键在于:首先,使用V-JEPA作为单模态基础模型进行微调(finetuning),在未标注的外科视频数据上实现领域适配以提升性能;其次,基于模块化决策支持网络的思想,引入额外的OR时间分辨数据流,训练独立编码器构建与V-JEPA嵌入共享的表示空间,从而实现多模态融合。实验表明,这种策略在肝切除术住院时长预测、术后并发症预测及HeiCo公开数据集上的手术阶段识别任务中均显著提升了模型表现。
链接: https://arxiv.org/abs/2509.06831
作者: Simon Pezold,Jérôme A. Kurylec,Jan S. Liechti,Beat P. Müller,Joël L. Lavanchy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 3 figures; accepted at ML-CDS @ MICCAI 2025, Daejeon, Republic of Korea
Abstract:We investigate how both the adaptation of a generic foundation model via transfer learning and the integration of complementary modalities from the operating room (OR) can support surgical data science. To this end, we use V-JEPA as the single-modality foundation of a multimodal model for minimally invasive surgery support. We analyze how the model’s downstream performance can benefit (a) from finetuning on unlabeled surgical video data and (b) from providing additional time-resolved data streams from the OR in a multimodal setup. In an in-house dataset of liver surgery videos, we analyze the tasks of predicting hospital length of stay and postoperative complications. In videos of the public HeiCo dataset, we analyze the task of surgical phase recognition. As a baseline, we apply pretrained V-JEPA to all tasks. We then finetune it on unlabeled, held-out videos to investigate its change in performance after domain adaptation. Following the idea of modular decision support networks, we integrate additional data streams from the OR by training a separate encoder to form a shared representation space with V-JEPA’s embeddings. Our experiments show that finetuning on domain-specific data increases model performance. On the in-house data, integrating additional time-resolved data likewise benefits the model. On the HeiCo data, accuracy of the pretrained video-only, single-modality baseline setup is on par with the top-performing submissions of the EndoVis2017 challenge, while finetuning on domain-specific data increases accuracy further. Our results thus demonstrate how surgical data science can leverage public, generic foundation models. Likewise, they indicate the potential of domain adaptation and of integrating suitable complementary data streams from the OR. To support further research, we release our code and model weights at this https URL. Comments: 13 pages, 3 figures; accepted at ML-CDS @ MICCAI 2025, Daejeon, Republic of Korea Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.06831 [cs.CV] (or arXiv:2509.06831v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.06831 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-15] Curia: A Multi-Modal Foundation Model for Radiology
【速读】:该论文旨在解决当前生成式 AI 在放射学领域中因依赖窄域、单任务模型而导致的泛化能力不足问题,尤其是在覆盖多种成像模态、疾病类型和影像发现时的实际应用局限。其解决方案的关键在于构建并验证 Curia——一个基于大型真实世界数据集(150,000 例检查,共 130 TB)训练的放射学基础模型(Foundation Model, FM),该模型展现出跨模态和低数据场景下的临床显著涌现性能,显著优于现有放射科医生及同类基础模型,在多任务外部验证基准上实现了精准器官识别、病变检测与肿瘤分期预测等关键能力。
链接: https://arxiv.org/abs/2509.06830
作者: Corentin Dancette,Julien Khlaut,Antoine Saporta,Helene Philippe,Elodie Ferreres,Baptiste Callard,Théo Danielou,Léo Alberge,Léo Machado,Daniel Tordjman,Julie Dupuis,Korentin Le Floch,Jean Du Terrail,Mariam Moshiri,Laurent Dercle,Tom Boeken,Jules Gregory,Maxime Ronot,François Legou,Pascal Roux,Marc Sapoval,Pierre Manceron,Paul Hérent
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:AI-assisted radiological interpretation is based on predominantly narrow, single-task models. This approach is impractical for covering the vast spectrum of imaging modalities, diseases, and radiological findings. Foundation models (FMs) hold the promise of broad generalization across modalities and in low-data settings. However, this potential has remained largely unrealized in radiology. We introduce Curia, a foundation model trained on the entire cross-sectional imaging output of a major hospital over several years, which to our knowledge is the largest such corpus of real-world data-encompassing 150,000 exams (130 TB). On a newly curated 19-task external validation benchmark, Curia accurately identifies organs, detects conditions like brain hemorrhages and myocardial infarctions, and predicts outcomes in tumor staging. Curia meets or surpasses the performance of radiologists and recent foundation models, and exhibits clinically significant emergent properties in cross-modality, and low-data regimes. To accelerate progress, we release our base model’s weights at this https URL.
zh
[CV-16] Video-Based MPAA Rating Prediction: An Attention-Driven Hybrid Architecture Using Contrastive Learning
【速读】:该论文旨在解决视频内容自动分类中因标注数据需求大、泛化能力差及特征学习效率低等问题,以实现符合MPAA(Motion Picture Association of America)分级标准(如G、PG、PG-13、R级)的自动化年龄适宜性判断。其解决方案的关键在于采用对比学习(contrastive learning)框架提升模型的判别能力和适应性,并构建融合LRCN(CNN+LSTM)主干网络与Bahdanau注意力机制的混合架构,通过空间特征提取(CNN)、时序建模(LSTM)和动态帧重要性加权(attention)协同优化,显著提升了对PG-13与R级等边界内容的细粒度区分能力,在Contextual Contrastive Learning框架下达到88%准确率和0.8815 F1分数,同时支持实时部署用于流媒体平台的内容合规检测。
链接: https://arxiv.org/abs/2509.06826
作者: Dipta Neogi,Nourash Azmine Chowdhury,Muhammad Rafsan Kabir,Mohammad Ashrafuzzaman Khan
机构: North South University (北方南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 9 figures
Abstract:The rapid growth of visual content consumption across platforms necessitates automated video classification for age-suitability standards like the MPAA rating system (G, PG, PG-13, R). Traditional methods struggle with large labeled data requirements, poor generalization, and inefficient feature learning. To address these challenges, we employ contrastive learning for improved discrimination and adaptability, exploring three frameworks: Instance Discrimination, Contextual Contrastive Learning, and Multi-View Contrastive Learning. Our hybrid architecture integrates an LRCN (CNN+LSTM) backbone with a Bahdanau attention mechanism, achieving state-of-the-art performance in the Contextual Contrastive Learning framework, with 88% accuracy and an F1 score of 0.8815. By combining CNNs for spatial features, LSTMs for temporal modeling, and attention mechanisms for dynamic frame prioritization, the model excels in fine-grained borderline distinctions, such as differentiating PG-13 and R-rated content. We evaluate the model’s performance across various contrastive loss functions, including NT-Xent, NT-logistic, and Margin Triplet, demonstrating the robustness of our proposed architecture. To ensure practical application, the model is deployed as a web application for real-time MPAA rating classification, offering an efficient solution for automated content compliance across streaming platforms.
zh
[CV-17] UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
【速读】:该论文旨在解决图像定制化模型在使用多参考图像时面临的身份一致性保持与身份混淆问题(identity confusion),即如何在提升模型对多个不同身份的识别与生成能力的同时,避免因多参考图像引入而导致的身份混淆,从而限制了定制化模型的身份可扩展性。解决方案的关键在于提出一种统一的多身份优化框架(Unified Multi-identity Optimization, UMO),其核心是通过“多对多匹配”(multi-to-multi matching)范式将多身份生成建模为全局分配优化问题,并利用扩散模型上的强化学习机制实现对现有图像定制方法的通用性增强,从而显著提升身份保真度并降低身份混淆。
链接: https://arxiv.org/abs/2509.06818
作者: Yufeng Cheng,Wenxu Wu,Shaojin Wu,Mengqi Huang,Fei Ding,Qian He
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL Code and model: this https URL
Abstract:Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With “multi-to-multi matching” paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: this https URL
zh
[CV-18] MIORe VAR-MIORe: Benchmarks to Push the Boundaries of Restoration ICCV2025
【速读】:该论文旨在解决当前运动恢复(motion restoration)基准测试中存在的关键局限性,即缺乏高帧率、专业光学采集的多任务数据集,以及对运动幅度控制不足的问题。解决方案的关键在于构建两个新型多任务数据集——MIORe 和 VAR-MIORe:前者通过基于计算光流度量的自适应帧平均策略生成一致的运动模糊并保留清晰输入,适用于视频插帧和光流估计;后者进一步扩展了运动幅度范围(从微小到极端),首次实现了对运动幅度的显式控制,从而为图像与视频恢复任务提供高分辨率、可扩展的真值数据,推动下一代算法研究。
链接: https://arxiv.org/abs/2509.06803
作者: George Ciubotariu,Zhuyun Zhou,Zongwei Wu,Radu Timofte
机构: University of Würzburg(维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Oral
Abstract:We introduce MIORe and VAR-MIORe, two novel multi-task datasets that address critical limitations in current motion restoration benchmarks. Designed with high-frame-rate (1000 FPS) acquisition and professional-grade optics, our datasets capture a broad spectrum of motion scenarios, which include complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects. By adaptively averaging frames based on computed optical flow metrics, MIORe generates consistent motion blur, and preserves sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark to offer explicit control over motion amplitude. We provide high-resolution, scalable ground truths that challenge existing algorithms under both controlled and adverse conditions, paving the way for next-generation research of various image and video restoration tasks.
zh
[CV-19] SynthDrive: Scalable Real2Sim2Real Sensor Simulation Pipeline for High-Fidelity Asset Generation and Driving Data Synthesis
【速读】:该论文旨在解决自动驾驶领域中传感器仿真面临的两大挑战:一是基于计算机图形学(CG-based)的方法(如CARLA)难以生成多样化的稀有场景,且难以扩展以满足鲁棒感知训练所需的海量罕见案例;二是基于学习的方法(如NeuSim)仅限于特定目标类别(如车辆),且依赖大量多传感器数据,限制了其在通用物体上的应用。解决方案的关键在于提出一个可扩展的“真实到仿真再到真实”(real2sim2real)系统,通过3D生成技术自动化完成资产挖掘、生成与稀有场景数据合成,从而实现高效、多样且泛化的传感器仿真能力。
链接: https://arxiv.org/abs/2509.06798
作者: Zhengqing Chen,Ruohong Mei,Xiaoyang Guo,Qingjie Wang,Yubin Hu,Wei Yin,Weiqiang Ren,Qian Zhang
机构: Horizon Robotics; Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:In the field of autonomous driving, sensor simulation is essential for generating rare and diverse scenarios that are difficult to capture in real-world environments. Current solutions fall into two categories: 1) CG-based methods, such as CARLA, which lack diversity and struggle to scale to the vast array of rare cases required for robust perception training; and 2) learning-based approaches, such as NeuSim, which are limited to specific object categories (vehicles) and require extensive multi-sensor data, hindering their applicability to generic objects. To address these limitations, we propose a scalable real2sim2real system that leverages 3D generation to automate asset mining, generation, and rare-case data synthesis.
zh
[CV-20] AIM 2025 Challenge on High FPS Motion Deblurring: Methods and Results ICCV
【速读】:该论文旨在解决高帧率(High FPS)单图像运动模糊去模糊(Motion Deblurring)问题,核心挑战在于从复杂多样的运动模式中学习具有代表性的视觉线索,从而在多样且困难的场景下恢复清晰、视觉上令人信服的图像。解决方案的关键在于利用新颖的数据集MIORe中的挑战性运动模式样本,结合参赛团队提出的先进神经网络架构,实现对复杂运动类型聚合的有效建模与去模糊重建,推动了该领域技术的显著进步。
链接: https://arxiv.org/abs/2509.06793
作者: George Ciubotariu,Florin-Alexandru Vasluianu,Zhuyun Zhou,Nancy Mehta,Radu Timofte,Ke Wu,Long Sun,Lingshun Kong,Zhongbao Yang,Jinshan Pan,Jiangxin Dong,Jinhui Tang,Hao Chen,Yinghui Fang,Dafeng Zhang,Yongqi Song,Jiangbo Guo,Shuhua Jin,Zeyu Xiao,Rui Zhao,Zhuoyuan Li,Cong Zhang,Yufeng Peng,Xin Lu,Zhijing Sun,Chengjie Ge,Zihao Li,Zishun Liao,Ziang Zhou,Qiyu Kang,Xueyang Fu,Zheng-Jun Zha,Yuqian Zhang,Shuai Liu,Jie Liu,Zhuhao Zhang,Lishen Qu,Zhihao Liu,Shihao Zhou,Yaqi Luo,Juncheng Zhou,Jufeng Yang,Qianfeng Yang,Qiyuan Guan,Xiang Chen,Guiyue Jin,Jiyu Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCVW AIM 2025
Abstract:This paper presents a comprehensive review of the AIM 2025 High FPS Non-Uniform Motion Deblurring Challenge, highlighting the proposed solutions and final results. The objective of this challenge is to identify effective networks capable of producing clearer and visually compelling images in diverse and challenging conditions, by learning representative visual cues for complex aggregations of motion types. A total of 68 participants registered for the competition, and 9 teams ultimately submitted valid entries. This paper thoroughly evaluates the state-of-the-art advances in high-FPS single image motion deblurring, showcasing the significant progress in the field, while leveraging samples of the novel dataset, MIORe, that introduces challenging examples of movement patterns.
zh
[CV-21] P3-SAM: Native 3D Part Segmentation
【速读】:该论文旨在解决3D资产分割中现有方法在处理复杂对象时鲁棒性差、难以实现全自动分割的问题。其解决方案的关键在于提出一种原生支持点提示(point-promptable)的3D部分分割模型P3-SAM,该模型由特征提取器、多个分割头和IoU预测器组成,能够实现交互式分割;同时设计了一种自动选择并合并掩码的算法,用于实现部件实例分割。通过在包含近370万模型的新标注数据集上训练,P3-SAM在复杂对象上展现出高精度与强鲁棒性,达到当前最优性能。
链接: https://arxiv.org/abs/2509.06784
作者: Changfeng Ma,Yang Li,Xinhao Yan,Jiachen Xu,Yunhan Yang,Chunshi Wang,Zibo Zhao,Yanwen Guo,Zhuo Chen,Chunchao Guo
机构: Tencent Hunyuan (腾讯混元); NJU (南京大学); ShanghaiTech (上海科技大学); HKU (香港大学); ZJU (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech Report
Abstract:Segmenting 3D assets into their constituent parts is crucial for enhancing 3D understanding, facilitating model reuse, and supporting various applications such as part generation. However, current methods face limitations such as poor robustness when dealing with complex objects and cannot fully automate the process. In this paper, we propose a native 3D point-promptable part segmentation model termed P3-SAM, designed to fully automate the segmentation of any 3D objects into components. Inspired by SAM, P3-SAM consists of a feature extractor, multiple segmentation heads, and an IoU predictor, enabling interactive segmentation for users. We also propose an algorithm to automatically select and merge masks predicted by our model for part instance segmentation. Our model is trained on a newly built dataset containing nearly 3.7 million models with reasonable segmentation labels. Comparisons show that our method achieves precise segmentation results and strong robustness on any complex objects, attaining state-of-the-art performance. Our code will be released soon.
zh
[CV-22] UrbanTwin: High-Fidelity Synthetic Replicas of Roadside Lidar Datasets
【速读】:该论文旨在解决自动驾驶感知任务中真实激光雷达(LiDAR)数据稀缺、标注成本高以及场景多样性不足的问题。其解决方案的关键在于构建高保真度的数字孪生(Digital Twin)环境,通过模拟真实道路几何结构、车道拓扑、交通流模式及传感器特性,生成与真实数据高度对齐的合成LiDAR数据集UrbanTwin。该方案不仅显著扩充了训练样本规模和场景多样性,还证明了仅在合成数据上训练的3D目标检测模型在未见过的真实数据上仍能取得优于纯真实数据训练模型的性能,从而为LiDAR感知任务提供了一种可替代且增强性的数据来源。
链接: https://arxiv.org/abs/2509.06781
作者: Muhammad Shahbaz,Shaurya Agarwal
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This article presents UrbanTwin datasets - high-fidelity, realistic replicas of three public roadside lidar datasets: LUMPI, V2X-Real-IC, and TUMTraf-I. Each UrbanTwin dataset contains 10K annotated frames corresponding to one of the public datasets. Annotations include 3D bounding boxes, instance segmentation labels, and tracking IDs for six object classes, along with semantic segmentation labels for nine classes. These datasets are synthesized using emulated lidar sensors within realistic digital twins, modeled based on surrounding geometry, road alignment at lane level, and the lane topology and vehicle movement patterns at intersections of the actual locations corresponding to each real dataset. Due to the precise digital twin modeling, the synthetic datasets are well aligned with their real counterparts, offering strong standalone and augmentative value for training deep learning models on tasks such as 3D object detection, tracking, and semantic and instance segmentation. We evaluate the alignment of the synthetic replicas through statistical and structural similarity analysis with real data, and further demonstrate their utility by training 3D object detection models solely on synthetic data and testing them on real, unseen data. The high similarity scores and improved detection performance, compared to the models trained on real data, indicate that the UrbanTwin datasets effectively enhance existing benchmark datasets by increasing sample size and scene diversity. In addition, the digital twins can be adapted to test custom scenarios by modifying the design and dynamics of the simulations. To our knowledge, these are the first digitally synthesized datasets that can replace in-domain real-world datasets for lidar perception tasks. UrbanTwin datasets are publicly available at this https URL.
zh
[CV-23] D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning ICDM
【速读】:该论文旨在解决在线表情包(meme)中黑色幽默(dark humor)的多模态识别难题,其核心挑战在于黑色幽默依赖于隐含、敏感且文化语境高度相关的线索,现有资源与方法难以有效捕捉此类复杂语义。解决方案的关键在于构建一个包含4,379条Reddit表情包的标注数据集,并提出一种基于推理增强的框架:首先利用大视觉语言模型(Large Vision-Language Model, VLM)生成结构化解释,并通过角色互换自循环机制(Role-Reversal Self-Loop)从作者视角迭代优化解释内容以确保完整性与一致性;随后提取OCR文本、自精炼推理文本和图像特征,通过三流交叉推理网络(Tri-stream Cross-Reasoning Network, TCRNet)融合三路信息,借助成对注意力机制实现统一表征学习,从而在黑色幽默检测、目标类别识别和强度预测三个任务上显著优于基线模型。
链接: https://arxiv.org/abs/2509.06771
作者: Sai Kartheek Reddy Kasu,Mohammad Zia Ur Rehman,Shahid Shafi Dar,Rishi Bharat Junghare,Dhanvin Sanjay Namboodiri,Nagendra Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Conference on Data Mining (ICDM) 2025
Abstract:Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author’s perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: this https URL
zh
[CV-24] Raw2Event: Converting Raw Frame Camera into Event Camera DATE
【速读】:该论文旨在解决事件相机(Event Camera)在早期研发与原型设计阶段因成本高、分辨率有限及缺乏自动对焦等特性而难以广泛应用的问题。其核心解决方案是提出了一套完整的软硬件系统 Raw2Event,通过直接访问低成像质量的原始 Bayer 数据并绕过传统图像信号处理器(ISP),充分利用相机硬件潜力,实现实时事件流生成;该系统基于 DVS-Voltmeter 模型构建可配置的仿真框架,并支持同步采集原始图像、RGB 图像和事件流数据,从而在保持高动态范围和高分辨率的同时,实现更接近真实事件相机输出的效果,且具备用户友好的参数调节能力,最终部署于 Raspberry Pi 实现低成本、实时运行,为事件视觉研究与早期系统开发提供高效、可扩展的替代方案。
链接: https://arxiv.org/abs/2509.06767
作者: Zijie Ning,Enmin Lin,Sudarshan R. Iyengar,Patrick Vandewalle
机构: EAVISE-PSI, ESAT, KU Leuven, Belgium; ESAT, KU Leuven, Belgium
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Robotics (Special Section on Event-based Vision for Robotics), under review. This version is submitted for peer review and may be updated upon acceptance
Abstract:Event cameras offer unique advantages such as high temporal resolution, low latency, and high dynamic range, making them more and more popular for vision tasks under challenging light conditions. However, their high cost, limited resolution, and lack of features such as autofocus hinder their broad adoption, particularly for early-stage development and prototyping. In this work, we present Raw2Event, a complete hardware-software system that enables real-time event generation from low-cost raw frame-based cameras. By leveraging direct access to raw Bayer data and bypassing traditional image signal processors (ISP), our system is able to utilize the full potential of camera hardware, delivering higher dynamic range, higher resolution, and more faithful output than RGB-based frame-to-event converters. Built upon the DVS-Voltmeter model, Raw2Event features a configurable simulation framework optimized for deployment on embedded platforms. We further design a data acquisition pipeline that supports synchronized recording of raw, RGB, and event streams, facilitating downstream evaluation and dataset creation. Experimental results show that Raw2Event can generate event streams closely resembling those from real event cameras, while benefiting from higher resolution and autofocus capabilities. The system also supports user-intuitive parameter tuning, enabling flexible adaptation to various application requirements. Finally, we deploy the system on a Raspberry Pi for real-time operation, providing a scalable and cost-effective solution for event-based vision research and early-stage system development. The codes are available online: this https URL. Comments: Submitted to IEEE Transactions on Robotics (Special Section on Event-based Vision for Robotics), under review. This version is submitted for peer review and may be updated upon acceptance Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.06767 [cs.CV] (or arXiv:2509.06767v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.06767 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-25] Pothole Detection and Recognition based on Transfer Learning
【速读】:该论文旨在解决道路图像中坑洼(pothole)自动识别的问题,以提升道路养护效率和智能化水平。其关键解决方案是基于迁移学习构建一个融合ResNet50、EfficientNet与RegNet的深度特征提取网络模型,通过标准化、归一化和数据增强等预处理技术优化原始数据集,并结合多指标(Accuracy、Recall、Precision、F1-score、FPS)对比评估,实现了高精度与高计算效率的坑洼分类,最终在测试集中达到97.78%(90样本)和98.89%(900样本)的准确率。
链接: https://arxiv.org/abs/2509.06750
作者: Mang Hu,Qianqian Xia
机构: China University of Geosciences (中国地质大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid development of computer vision and machine learning, automated methods for pothole detection and recognition based on image and video data have received significant attention. It is of great significance for social development to conduct an in-depth analysis of road images through feature extraction, thereby achieving automatic identification of the pothole condition in new images. Consequently, this is the main issue addressed in this study. Based on preprocessing techniques such as standardization, normalization, and data augmentation applied to the collected raw dataset, we continuously improved the network model based on experimental results. Ultimately, we constructed a deep learning feature extraction network ResNet50-EfficientNet-RegNet model based on transfer learning. This model exhibits high classification accuracy and computational efficiency. In terms of model evaluation, this study employed a comparative evaluation approach by comparing the performance of the proposed transfer learning model with other models, including Random Forest, MLP, SVM, and LightGBM. The comparison analysis was conducted based on metrics such as Accuracy, Recall, Precision, F1-score, and FPS, to assess the classification performance of the transfer learning model proposed in this paper. The results demonstrate that our model exhibits high performance in terms of recognition speed and accuracy, surpassing the performance of other models. Through careful parameter selection and model optimization, our transfer learning model achieved a classification accuracy of 97.78% (88/90) on the initial set of 90 test samples and 98.89% (890/900) on the expanded test set.
zh
[CV-26] Event Spectroscopy: Event-based Multispectral and Depth Sensing using Structured Light
【速读】:该论文旨在解决无人机(UAV)在森林环境中执行环境监测与搜救任务时面临的感知挑战,尤其是传统被动式多光谱和RGB成像技术在树冠下因光照依赖性强、深度分辨率低及延迟高等问题导致的导航不安全与数据精度不足。解决方案的关键在于提出一种新颖的事件光谱系统(event spectroscopy system),通过单传感器实现高分辨率、低延迟的深度重建与多光谱成像同步进行:利用结构光(structured light)进行深度估计,并通过调制投射光的波长,在650–850 nm范围内捕获受控波段的光谱信息;实验表明,该方法相较商用深度传感器在均方根误差(RMSE)上提升达60%,且光谱精度与参考光谱仪和商用多光谱相机相当,同时证明引入深度信息可使材料区分准确率较仅使用颜色的方法提升30%以上,从而为复杂自然环境中轻量化、集成化、鲁棒的无人机感知提供了有效路径。
链接: https://arxiv.org/abs/2509.06741
作者: Christian Geckeler,Niklas Neugebauer,Manasi Muglikar,Davide Scaramuzza,Stefano Mintchev
机构: Environmental Robotics Laboratory, Dep. of Environmental Systems Science, ETH Zurich (苏黎世联邦理工学院); Swiss Federal Institute for Forest, Snow and Landscape Research (WSL) (瑞士联邦森林、雪与景观研究所); Robotics and Perception Group, University of Zurich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Uncrewed aerial vehicles (UAVs) are increasingly deployed in forest environments for tasks such as environmental monitoring and search and rescue, which require safe navigation through dense foliage and precise data collection. Traditional sensing approaches, including passive multispectral and RGB imaging, suffer from latency, poor depth resolution, and strong dependence on ambient light - especially under forest canopies. In this work, we present a novel event spectroscopy system that simultaneously enables high-resolution, low-latency depth reconstruction and multispectral imaging using a single sensor. Depth is reconstructed using structured light, and by modulating the wavelength of the projected structured light, our system captures spectral information in controlled bands between 650 nm and 850 nm. We demonstrate up to 60% improvement in RMSE over commercial depth sensors and validate the spectral accuracy against a reference spectrometer and commercial multispectral cameras, demonstrating comparable performance. A portable version limited to RGB (3 wavelengths) is used to collect real-world depth and spectral data from a Masoala Rainforest. We demonstrate the use of this prototype for color image reconstruction and material differentiation between leaves and branches using spectral and depth data. Our results show that adding depth (available at no extra effort with our setup) to material differentiation improves the accuracy by over 30% compared to color-only method. Our system, tested in both lab and real-world rainforest environments, shows strong performance in depth estimation, RGB reconstruction, and material differentiation - paving the way for lightweight, integrated, and robust UAV perception and data collection in complex natural environments.
zh
[CV-27] Co-Seg: Mutual Prompt-Guided Collaborative Learning for Tissue and Nuclei Segmentation MICCAI2025
【速读】:该论文旨在解决组织病理图像分析中组织区域与细胞核实例分割任务被孤立处理的问题,忽略了二者之间的内在关联,导致对肿瘤微环境和细胞形态的理解不足。其解决方案的关键在于提出一种协同分割框架(Co-Seg),通过引入区域感知提示编码器(RP-Encoder)提供高质量的语义与实例区域提示作为先验约束,并设计互引导提示掩码解码器(MP-Decoder),利用跨任务引导机制增强两个任务间的上下文一致性,从而协同计算语义与实例分割掩码,实现组织与细胞核的联合优化分割。
链接: https://arxiv.org/abs/2509.06740
作者: Qing Xu,Wenting Duan,Zhen Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025
Abstract:Histopathology image analysis is critical yet challenged by the demand of segmenting tissue regions and nuclei instances for tumor microenvironment and cellular morphology analysis. Existing studies focused on tissue semantic segmentation or nuclei instance segmentation separately, but ignored the inherent relationship between these two tasks, resulting in insufficient histopathology understanding. To address this issue, we propose a Co-Seg framework for collaborative tissue and nuclei segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing tissue and nuclei segmentation tasks to mutually enhance each other. To this end, we first devise a region-aware prompt encoder (RP-Encoder) to provide high-quality semantic and instance region prompts as prior constraints. Moreover, we design a mutual prompt mask decoder (MP-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, collaboratively computing semantic and instance segmentation masks. Extensive experiments on the PUMA dataset demonstrate that the proposed Co-Seg surpasses state-of-the-arts in the semantic, instance and panoptic segmentation of tumor tissues and nuclei instances. The source code is available at this https URL.
zh
[CV-28] Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training
【速读】:该论文旨在解决轨迹引导的图像到视频(Trajectory-Guided image-to-video, I2V)生成中存在两大核心问题:一是现有方法依赖昂贵的微调训练,且在稀缺标注数据下性能受限;二是零样本方法在潜在空间中进行轨迹控制时,因忽略三维视角信息而导致运动不真实,并产生潜在变量与网络噪声预测之间的错位。解决方案的关键在于提出一种新颖的零样本测试时训练框架 Zo3T,其三大创新机制协同作用:首先,引入3D感知运动投影(3D-Aware Kinematic Projection),通过推断场景深度获得透视校正的仿射变换以精准操控目标区域;其次,设计轨迹引导的测试时LoRA(Trajectory-Guided Test-Time LoRA),动态注入并优化临时LoRA适配器于去噪网络中,结合区域特征一致性损失实现运动约束与模型内部表征的局部自适应,保障生成保真度和流形一致性;最后,提出引导场校正(Guidance Field Rectification),通过一步前瞻策略优化条件引导场,确保生成过程沿目标轨迹高效演进。该方案显著提升了生成视频的三维真实感和运动准确性,优于现有训练驱动及零样本方法。
链接: https://arxiv.org/abs/2509.06723
作者: Ruicheng Zhang,Jun Zhou,Zunnan Xu,Zihao Liu,Jiehui Huang,Mingyang Zhang,Yu Sun,Xiu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network’s noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.
zh
[CV-29] MRI-Based Brain Tumor Detection through an Explainable EfficientNetV2 and MLP-Mixer-Attention Architecture
【速读】:该论文旨在解决脑肿瘤早期诊断中依赖专家经验、易出错且效率低下的问题,提出一种高精度、可解释的深度学习(Deep Learning, DL)模型用于脑肿瘤分类。其解决方案的关键在于:首先通过对比九种主流卷积神经网络(CNN)架构,选定性能最优的EfficientNetV2作为主干网络;随后引入基于注意力机制的MLP-Mixer模块以增强特征提取能力;最终结合Grad-CAM可视化技术提升模型决策过程的可解释性,从而在公开数据集上实现了99.50%的准确率,显著优于现有文献方法,为临床决策支持系统提供了兼具高精度与可靠性的自动化诊断工具。
链接: https://arxiv.org/abs/2509.06713
作者: Mustafa Yurdakul,Şakir Taşdemir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Brain tumors are serious health problems that require early diagnosis due to their high mortality rates. Diagnosing tumors by examining Magnetic Resonance Imaging (MRI) images is a process that requires expertise and is prone to error. Therefore, the need for automated diagnosis systems is increasing day by day. In this context, a robust and explainable Deep Learning (DL) model for the classification of brain tumors is proposed. In this study, a publicly available Figshare dataset containing 3,064 T1-weighted contrast-enhanced brain MRI images of three tumor types was used. First, the classification performance of nine well-known CNN architectures was evaluated to determine the most effective backbone. Among these, EfficientNetV2 demonstrated the best performance and was selected as the backbone for further development. Subsequently, an attention-based MLP-Mixer architecture was integrated into EfficientNetV2 to enhance its classification capability. The performance of the final model was comprehensively compared with basic CNNs and the methods in the literature. Additionally, Grad-CAM visualization was used to interpret and validate the decision-making process of the proposed model. The proposed model’s performance was evaluated using the five-fold cross-validation method. The proposed model demonstrated superior performance with 99.50% accuracy, 99.47% precision, 99.52% recall and 99.49% F1 score. The results obtained show that the model outperforms the studies in the literature. Moreover, Grad-CAM visualizations demonstrate that the model effectively focuses on relevant regions of MRI images, thus improving interpretability and clinical reliability. A robust deep learning model for clinical decision support systems has been obtained by combining EfficientNetV2 and attention-based MLP-Mixer, providing high accuracy and interpretability in brain tumor classification.
zh
[CV-30] Cortex-Synth: Differentiable Topology-Aware 3D Skeleton Synthesis with Hierarchical Graph Attention
【速读】:该论文旨在解决从单张2D图像中联合生成3D骨骼几何结构(geometry)与拓扑结构(topology)的难题,传统方法往往在几何重建与拓扑一致性之间存在割裂,导致结构误差较大。其解决方案的关键在于提出一个端到端可微分框架Cortex Synth,核心创新包括:(1) 基于多尺度骨骼精化的分层图注意力机制,实现几何细节的逐级优化;(2) 通过拉普拉斯特征分解实现可微分的谱拓扑优化,从而直接学习最优连接结构;(3) 利用对抗性几何一致性训练策略,确保姿态结构在空间上的对齐与稳定。该方法在ShapeNet数据集上实现了MPJPE降低18.7%、图编辑距离降低27.3%,并减少42%的拓扑错误,显著优于现有方法。
链接: https://arxiv.org/abs/2509.06705
作者: Mohamed Zayaan S
机构: Indian Institute of Technology, Madras (印度理工学院马德拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures
Abstract:We present Cortex Synth, a novel end-to-end differentiable framework for joint 3D skeleton geometry and topology synthesis from single 2D images. Our architecture introduces three key innovations: (1) A hierarchical graph attention mechanism with multi-scale skeletal refinement, (2) Differentiable spectral topology optimization via Laplacian eigen decomposition, and (3) Adversarial geometric consistency training for pose structure alignment. The framework integrates four synergistic modules: a pseudo 3D point cloud generator, an enhanced PointNet encoder, a skeleton coordinate decoder, and a novel Differentiable Graph Construction Network (DGCN). Our experiments demonstrate state-of-the-art results with 18.7 percent improvement in MPJPE and 27.3 percent in Graph Edit Distance on ShapeNet, while reducing topological errors by 42 percent compared to previous approaches. The model’s end-to-end differentiability enables applications in robotic manipulation, medical imaging, and automated character rigging.
zh
[CV-31] STAGE: Segmentation-oriented Industrial Anomaly Synthesis via Graded Diffusion with Explicit Mask Alignment
【速读】:该论文旨在解决现有面向分割的工业异常合成(Segmentation-oriented Industrial Anomaly Synthesis, SIAS)方法中存在的两大关键问题:一是合成异常缺乏精细纹理细节且与背景区域对齐不佳,二是难以生成细粒度的像素级异常。解决方案的核心在于提出一种名为STAGE(Segmantation-oriented Anomaly synthesis via Graded diffusion with Explicit mask alignment)的新框架,其关键创新包括:(1) 引入基于干净背景信息作为先验的异常推理策略,引导去噪分布以更有效地区分并突出异常前景;(2) 设计分级扩散(graded diffusion)架构并引入仅异常分支,在前向和反向过程中显式记录局部异常,避免细微异常被忽略;(3) 提出显式掩码对齐(Explicit Mask Alignment, EMA)策略,逐步将合成异常与背景对齐,从而实现上下文一致且结构连贯的异常生成。
链接: https://arxiv.org/abs/2509.06693
作者: Xichen Xu,Yanshu Wang,Jinbao Wang,Qunyi Zhang,Xiaoning Lei,Guoyang Xie,Guannan Jiang,Zhichao Lu
机构: Shanghai Jiao Tong University (上海交通大学); Shenzhen University (深圳大学); CATL (宁德时代); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Segmentation-oriented Industrial Anomaly Synthesis (SIAS) plays a pivotal role in enhancing the performance of downstream anomaly segmentation, as it provides an effective means of expanding abnormal data. However, existing SIAS methods face several critical limitations: (i) the synthesized anomalies often lack intricate texture details and fail to align precisely with the surrounding background, and (ii) they struggle to generate fine-grained, pixel-level anomalies. To address these challenges, we propose Segmentation-oriented Anomaly synthesis via Graded diffusion with Explicit mask alignment, termed STAGE. STAGE introduces a novel anomaly inference strategy that incorporates clean background information as a prior to guide the denoising distribution, enabling the model to more effectively distinguish and highlight abnormal foregrounds. Furthermore, it employs a graded diffusion framework with an anomaly-only branch to explicitly record local anomalies during both the forward and reverse processes, ensuring that subtle anomalies are not overlooked. Finally, STAGE incorporates the explicit mask alignment (EMA) strategy to progressively align the synthesized anomalies with the background, resulting in context-consistent and structurally coherent generations. Extensive experiments on the MVTec and BTAD datasets demonstrate that STAGE achieves state-of-the-art performance in SIAS, which in turn enhances downstream anomaly segmentation.
zh
[CV-32] BioLite U-Net: Edge-Deployable Semantic Segmentation for In Situ Bioprinting Monitoring ICRA2026
【速读】:该论文旨在解决生物打印(bioprinting)过程中实时监测打印结构 fidelity 和一致性的问题,尤其是在受限成像数据和嵌入式硬件资源条件下,如何实现高精度的语义分割以保障打印质量与细胞活性。其关键解决方案是提出了一种轻量级语义分割框架 BioLite U-Net,该架构采用深度可分离卷积(depthwise separable convolutions)显著降低计算负载,同时保持高分割精度;并通过在 Raspberry Pi 4B 上实测验证,模型仅需 335 ms/帧即可完成推理,具备近实时能力,且在 mIoU(92.85%)和 Dice 分数(96.17%)上优于 MobileNetV2/V3 基线模型,展现出卓越的准确性、效率与部署可行性,适用于智能闭环生物打印系统。
链接: https://arxiv.org/abs/2509.06690
作者: Usman Haider,Lukasz Szemet,Daniel Kelly,Vasileios Sergis,Andrew C. Daly,Karl Mason
机构: University of Galway (戈尔韦大学); CÚRAM – SFI Research Centre for Medical Devices (CÚRAM – 爱尔兰科学基金会医学设备研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 8 pages, 5 figures, conference-style submission (ICRA 2026). Includes dataset description, BioLite U-Net architecture, benchmark results on edge device (Raspberry Pi 4B)
Abstract:Bioprinting is a rapidly advancing field that offers a transformative approach to fabricating tissue and organ models through the precise deposition of cell-laden bioinks. Ensuring the fidelity and consistency of printed structures in real-time remains a core challenge, particularly under constraints imposed by limited imaging data and resource-constrained embedded hardware. Semantic segmentation of the extrusion process, differentiating between nozzle, extruded bioink, and surrounding background, enables in situ monitoring critical to maintaining print quality and biological viability. In this work, we introduce a lightweight semantic segmentation framework tailored for real-time bioprinting applications. We present a novel, manually annotated dataset comprising 787 RGB images captured during the bioprinting process, labeled across three classes: nozzle, bioink, and background. To achieve fast and efficient inference suitable for integration with bioprinting systems, we propose a BioLite U-Net architecture that leverages depthwise separable convolutions to drastically reduce computational load without compromising accuracy. Our model is benchmarked against MobileNetV2 and MobileNetV3-based segmentation baselines using mean Intersection over Union (mIoU), Dice score, and pixel accuracy. All models were evaluated on a Raspberry Pi 4B to assess real-world feasibility. The proposed BioLite U-Net achieves an mIoU of 92.85% and a Dice score of 96.17%, while being over 1300x smaller than MobileNetV2-DeepLabV3+. On-device inference takes 335 ms per frame, demonstrating near real-time capability. Compared to MobileNet baselines, BioLite U-Net offers a superior tradeoff between segmentation accuracy, efficiency, and deployability, making it highly suitable for intelligent, closed-loop bioprinting systems.
zh
[CV-33] VIM-GS: Visual-Inertial Monocular Gaussian Splatting via Object-level Guidance in Large Scenes
【速读】:该论文旨在解决在大场景中仅使用单目RGB图像进行高质量新颖视图合成(Novel-View Synthesis, NVS)的难题。传统高斯溅射(Gaussian Splatting, GS)方法依赖于精确的深度信息来初始化高斯椭球体,但其通常需要RGB-D或立体相机获取深度数据,而这些设备在大场景中的深度感知范围有限;单目图像虽易获取,却缺乏显式深度引导,导致NVS效果不佳。尽管已有大型基础模型(Large Foundation Models, LFMs)可用于单目深度估计,但存在跨帧不一致性、远距离场景精度不足以及虚假纹理误导等问题。论文提出VIM-GS框架,其核心解决方案是利用视觉惯性结构光恢复(Visual-Inertial Structure-from-Motion, SfM)提供的准确但稀疏深度,去优化LFM生成的稠密但粗糙深度。关键创新在于设计了基于物体分割的深度传播算法,以渲染结构化物体区域的深度,并引入动态深度精修模块,用于处理动态物体导致的SfM深度失效问题,从而提升整体深度精度与一致性,最终实现大场景下高质量的GS渲染。
链接: https://arxiv.org/abs/2509.06685
作者: Shengkai Zhang,Yuhe Liu,Guanjun Wu,Jianhua He,Xinggang Wang,Mozi Chen,Kezhong Liu
机构: Wuhan University of Technology (武汉理工大学); Huazhong University of Science and Technology (华中科技大学); University of Essex (埃塞克斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:VIM-GS is a Gaussian Splatting (GS) framework using monocular images for novel-view synthesis (NVS) in large scenes. GS typically requires accurate depth to initiate Gaussian ellipsoids using RGB-D/stereo cameras. Their limited depth sensing range makes it difficult for GS to work in large scenes. Monocular images, however, lack depth to guide the learning and lead to inferior NVS results. Although large foundation models (LFMs) for monocular depth estimation are available, they suffer from cross-frame inconsistency, inaccuracy for distant scenes, and ambiguity in deceptive texture cues. This paper aims to generate dense, accurate depth images from monocular RGB inputs for high-definite GS rendering. The key idea is to leverage the accurate but sparse depth from visual-inertial Structure-from-Motion (SfM) to refine the dense but coarse depth from LFMs. To bridge the sparse input and dense output, we propose an object-segmented depth propagation algorithm that renders the depth of pixels of structured objects. Then we develop a dynamic depth refinement module to handle the crippled SfM depth of dynamic objects and refine the coarse LFM depth. Experiments using public and customized datasets demonstrate the superior rendering quality of VIM-GS in large scenes.
zh
[CV-34] Online Clustering of Seafloor Imagery for Interpretation during Long-Term AUV Operations
【速读】:该论文旨在解决长期续航、海底驻留型自主水下机器人(AUV)在实时场景下对海底图像进行高效解释的问题,以支持自适应任务规划和通信效率优化。传统离线图像分析方法依赖完整数据集和人工标注样本,难以应对环境与操作条件变化带来的图像外观差异,且无法满足实时性要求。为此,作者提出一种在线聚类框架(Online Clustering Framework, OCF),其关键在于通过维护一组代表性样本动态捕捉特征分布演化,实现恒定时间内的模式识别与聚类合并/分裂,无需重处理全部历史数据,从而在保证高聚类准确率(平均F1分数达0.68)的同时,显著降低计算开销并具备良好的轨迹鲁棒性和可扩展性。
链接: https://arxiv.org/abs/2509.06678
作者: Cailei Liang,Adrian Bodenmann,Sam Fenton,Blair Thornton
机构: University of Southampton (南安普顿大学); IIS, The University of Tokyo (东京大学信息研究所), Japan (日本)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:As long-endurance and seafloor-resident AUVs become more capable, there is an increasing need for extended, real-time interpretation of seafloor imagery to enable adaptive missions and optimise communication efficiency. Although offline image analysis methods are well established, they rely on access to complete datasets and human-labelled examples to manage the strong influence of environmental and operational conditions on seafloor image appearance-requirements that cannot be met in real-time settings. To address this, we introduce an online clustering framework (OCF) capable of interpreting seafloor imagery without supervision, which is designed to operate in real-time on continuous data streams in a scalable, adaptive, and self-consistent manner. The method enables the efficient review and consolidation of common patterns across the entire data history in constant time by identifying and maintaining a set of representative samples that capture the evolving feature distribution, supporting dynamic cluster merging and splitting without reprocessing the full image history. We evaluate the framework on three diverse seafloor image datasets, analysing the impact of different representative sampling strategies on both clustering accuracy and computational cost. The OCF achieves the highest average F1 score of 0.68 across the three datasets among all comparative online clustering approaches, with a standard deviation of 3% across three distinct survey trajectories, demonstrating its superior clustering capability and robustness to trajectory variation. In addition, it maintains consistently lower and bounded computational time as the data volume increases. These properties are beneficial for generating survey data summaries and supporting informative path planning in long-term, persistent autonomous marine exploration.
zh
[CV-35] Investigating Location-Regularised Self-Supervised Feature Learning for Seafloor Visual Imagery
【速读】:该论文旨在解决海洋监测与探索中基于机器人采集的海底视觉图像数据高通量解析效率低的问题,核心挑战在于如何提升自监督学习(Self-Supervised Learning, SSL)在复杂海底场景下的表征能力。解决方案的关键在于引入位置元数据(location metadata)作为正则化信号,对SSL框架进行优化,从而增强模型在不同架构(如卷积神经网络CNN和视觉Transformer ViT)、不同潜在空间维度(高维512 vs 低维128)以及多种海底图像数据集上的泛化性能。实验表明,位置正则化能稳定提升下游分类任务的F1分数,尤其在低维潜在表示下效果显著,并揭示了高维ViT模型在预训练基础上具备强大泛化能力,可媲美最优的位置正则化SSL方法。
链接: https://arxiv.org/abs/2509.06660
作者: Cailei Liang,Adrian Bodenmann,Emma J Curtis,Samuel Simmons,Kazunori Nagano,Stan Brown,Adam Riese,Blair Thornton
机构: University of Southampton (南安普顿大学); IIS, The University of Tokyo (东京大学信息研究所); Voyis Imaging Inc. (Voyis成像公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:High-throughput interpretation of robotically gathered seafloor visual imagery can increase the efficiency of marine monitoring and exploration. Although recent research has suggested that location metadata can enhance self-supervised feature learning (SSL), its benefits across different SSL strategies, models and seafloor image datasets are underexplored. This study evaluates the impact of location-based regularisation on six state-of-the-art SSL frameworks, which include Convolutional Neural Network (CNN) and Vision Transformer (ViT) models with varying latent-space dimensionality. Evaluation across three diverse seafloor image datasets finds that location-regularisation consistently improves downstream classification performance over standard SSL, with average F1-score gains of 4.9 \pm 4.0% for CNNs and 6.3 \pm 8.9% for ViTs, respectively. While CNNs pretrained on generic datasets benefit from high-dimensional latent representations, dataset-optimised SSL achieves similar performance across the high (512) and low (128) dimensional latent representations. Location-regularised SSL improves CNN performance over pre-trained models by 2.7 \pm 2.7% and 10.1 \pm 9.4% for high and low-dimensional latent representations, respectively. For ViTs, high-dimensionality benefits both pre-trained and dataset-optimised SSL. Although location-regularisation improves SSL performance compared to standard SSL methods, pre-trained ViTs show strong generalisation, matching the best-performing location-regularised SSL with F1-scores of 0.795 \pm 0.075 and 0.795 \pm 0.077 , respectively. The findings highlight the value of location metadata for SSL regularisation, particularly when using low-dimensional latent representations, and demonstrate strong generalisation of high-dimensional ViTs for seafloor image analysis.
zh
[CV-36] Improved Classification of Nitrogen Stress Severity in Plants Under Combined Stress Conditions Using Spatio-Temporal Deep Learning Framework
【速读】:该论文旨在解决在自然环境中多种胁迫因素(如干旱和杂草竞争)共同作用下,氮素缺乏(nitrogen deficiency)的早期识别难题,从而实现对植物健康状况的精准监测与管理。其解决方案的关键在于提出了一种基于时空深度学习框架的新型分类方法,该方法融合了RGB、多光谱及两种红外波段的多模态图像数据,并采用卷积神经网络(Convolutional Neural Network, CNN)提取空间特征、长短期记忆网络(Long Short-Term Memory, LSTM)捕捉时间序列依赖关系,构建CNN-LSTM联合模型,显著提升了在复杂胁迫环境下氮素胁迫严重程度的分类准确率(达98%),优于仅使用空间特征的CNN模型(80.45%)及其他传统机器学习方法(76%)。
链接: https://arxiv.org/abs/2509.06625
作者: Aswini Kumar Patra
机构: NERIST(印度东北地区技术研究所); IIT Guwahati (印度理工学院古瓦哈提分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 8 figures, 7 Tables
Abstract:Plants in their natural habitats endure an array of interacting stresses, both biotic and abiotic, that rarely occur in isolation. Nutrient stress-particularly nitrogen deficiency-becomes even more critical when compounded with drought and weed competition, making it increasingly difficult to distinguish and address its effects. Early detection of nitrogen stress is therefore crucial for protecting plant health and implementing effective management strategies. This study proposes a novel deep learning framework to accurately classify nitrogen stress severity in a combined stress environment. Our model uses a unique blend of four imaging modalities-RGB, multispectral, and two infrared wavelengths-to capture a wide range of physiological plant responses from canopy images. These images, provided as time-series data, document plant health across three levels of nitrogen availability (low, medium, and high) under varying water stress and weed pressures. The core of our approach is a spatio-temporal deep learning pipeline that merges a Convolutional Neural Network (CNN) for extracting spatial features from images with a Long Short-Term Memory (LSTM) network to capture temporal dependencies. We also devised and evaluated a spatial-only CNN pipeline for comparison. Our CNN-LSTM pipeline achieved an impressive accuracy of 98%, impressively surpassing the spatial-only model’s 80.45% and other previously reported machine learning method’s 76%. These results bring actionable insights based on the power of our CNN-LSTM approach in effectively capturing the subtle and complex interactions between nitrogen deficiency, water stress, and weed pressure. This robust platform offers a promising tool for the timely and proactive identification of nitrogen stress severity, enabling better crop management and improved plant health.
zh
[CV-37] From Skin to Skeleton: Towards Biomechanically Accurate 3D Digital Humans
【速读】:该论文旨在解决现有3D人体姿态与形状估计模型(如SMPL)中骨骼结构生物力学不准确的问题,即其简化的运动学结构无法对应真实人体骨骼关节位置和运动机制,限制了其在生物力学研究中的应用。同时,传统生物力学精确的骨骼运动估计方法依赖复杂的动作捕捉系统和昂贵的优化流程,难以普及。解决方案的关键在于提出SKEL模型:通过将SMPL模型重新绑定为符合生物力学原理的骨骼结构,并利用从AMASS数据集中优化得到的、嵌入在SMPL网格内部的骨骼数据训练一个回归器,从而学习从SMPL顶点到优化后关节位置和骨旋转的映射关系;最终重构SMPL参数化方式,使新模型在保持可动画性的同时具备更少且更符合生物力学的自由度。该方法实现了对现有数据集的“升级”,并为视觉与图形学研究提供了更具约束性和真实性的肢体运动建模工具。
链接: https://arxiv.org/abs/2509.06607
作者: Marilyn Keller,Keenon Werling,Soyong Shin,Scott Delp,Sergi Pujades,C. Karen Liu,Michael J. Black
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Stanford University (斯坦福大学); Carnegie Mellon University (卡内基梅隆大学); Inria centre at the University Grenoble Alpes (法国国家信息与自动化研究院格勒诺布尔阿尔卑斯中心)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Great progress has been made in estimating 3D human pose and shape from images and video by training neural networks to directly regress the parameters of parametric human models like SMPL. However, existing body models have simplified kinematic structures that do not correspond to the true joint locations and articulations in the human skeletal system, limiting their potential use in biomechanics. On the other hand, methods for estimating biomechanically accurate skeletal motion typically rely on complex motion capture systems and expensive optimization methods. What is needed is a parametric 3D human model with a biomechanically accurate skeletal structure that can be easily posed. To that end, we develop SKEL, which re-rigs the SMPL body model with a biomechanics skeleton. To enable this, we need training data of skeletons inside SMPL meshes in diverse poses. We build such a dataset by optimizing biomechanically accurate skeletons inside SMPL meshes from AMASS sequences. We then learn a regressor from SMPL mesh vertices to the optimized joint locations and bone rotations. Finally, we re-parametrize the SMPL mesh with the new kinematic parameters. The resulting SKEL model is animatable like SMPL but with fewer, and biomechanically-realistic, degrees of freedom. We show that SKEL has more biomechanically accurate joint locations than SMPL, and the bones fit inside the body surface better than previous methods. By fitting SKEL to SMPL meshes we are able to “upgrade” existing human pose and shape datasets to include biomechanical parameters. SKEL provides a new tool to enable biomechanics in the wild, while also providing vision and graphics researchers with a better constrained and more realistic model of human articulation. The model, code, and data are available for research at this https URL… Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.06607 [cs.GR] (or arXiv:2509.06607v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2509.06607 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ACM Trans. Graph. 42, 6, Article 253 (December 2023), 12 pages Related DOI: https://doi.org/10.1145/3618381 Focus to learn more DOI(s) linking to related resources
zh
[CV-38] Hybrid Swin Attention Networks for Simultaneously Low-Dose PET and CT Denoising
【速读】:该论文旨在解决低剂量计算机断层扫描(Low-dose Computed Tomography, LDCT)和正电子发射断层扫描(Positron Emission Tomography, PET)图像因辐射剂量降低而导致噪声和伪影增加的问题,从而影响诊断准确性。解决方案的关键在于提出一种新型混合Swin注意力网络(Hybrid Swin Attention Network, HSANet),其核心创新包括:引入高效全局注意力(Efficient Global Attention, EGA)模块以增强空间与通道维度的特征交互能力,提升对关键信息的捕捉;同时设计混合上采样模块以缓解模型对噪声的过拟合风险。实验表明,该方法在保持轻量化模型结构的同时显著优于现有去噪技术,具备良好的临床部署潜力。
链接: https://arxiv.org/abs/2509.06591
作者: Yichao Liu,YueYang Teng
机构: University of Bern (伯尔尼大学); Ruijin Hospital (瑞金医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-dose computed tomography (LDCT) and positron emission tomography (PET) have emerged as safer alternatives to conventional imaging modalities by significantly reducing radiation exposure. However, this reduction often results in increased noise and artifacts, which can compromise diagnostic accuracy. Consequently, denoising for LDCT/PET has become a vital area of research aimed at enhancing image quality while maintaining radiation safety. In this study, we introduce a novel Hybrid Swin Attention Network (HSANet), which incorporates Efficient Global Attention (EGA) modules and a hybrid upsampling module. The EGA modules enhance both spatial and channel-wise interaction, improving the network’s capacity to capture relevant features, while the hybrid upsampling module mitigates the risk of overfitting to noise. We validate the proposed approach using a publicly available LDCT/PET dataset. Experimental results demonstrate that HSANet achieves superior denoising performance compared to existing methods, while maintaining a lightweight model size suitable for deployment on GPUs with standard memory configurations. This makes our approach highly practical for real-world clinical applications.
zh
[CV-39] Detection of trade in products derived from threatened species using machine learning and a smartphone
【速读】:该论文旨在解决非法野生动物贸易在数字平台和社交媒体中日益猖獗的问题,尤其关注通过自动化方法识别野生动物制品(如象牙、穿山甲鳞片和虎骨)的图像检测难题。解决方案的关键在于开发基于机器学习的目标识别模型,利用已标注的非法交易或扣押的野生动物产品图像进行训练,结合多种训练策略与损失函数优化模型性能,并进一步构建了一个面向执法机构和政府相关部门的智能手机应用程序,实现了对目标物种制品的实时自动识别,整体准确率达91.3%,显著提升了野生动物保护执法的效率与可及性。
链接: https://arxiv.org/abs/2509.06585
作者: Ritwik Kulkarni,WU Hanqin,Enrico Di Minin
机构: University of Helsinki (赫尔辛基大学); University of KwaZulu-Natal (夸祖鲁-纳塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Unsustainable trade in wildlife is a major threat to biodiversity and is now increasingly prevalent in digital marketplaces and social media. With the sheer volume of digital content, the need for automated methods to detect wildlife trade listings is growing. These methods are especially needed for the automatic identification of wildlife products, such as ivory. We developed machine learning-based object recognition models that can identify wildlife products within images and highlight them. The data consists of images of elephant, pangolin, and tiger products that were identified as being sold illegally or that were confiscated by authorities. Specifically, the wildlife products included elephant ivory and skins, pangolin scales, and claws (raw and crafted), and tiger skins and bones. We investigated various combinations of training strategies and two loss functions to identify the best model to use in the automatic detection of these wildlife products. Models were trained for each species while also developing a single model to identify products from all three species. The best model showed an overall accuracy of 84.2% with accuracies of 71.1%, 90.2% and 93.5% in detecting products derived from elephants, pangolins, and tigers, respectively. We further demonstrate that the machine learning model can be made easily available to stakeholders, such as government authorities and law enforcement agencies, by developing a smartphone-based application that had an overall accuracy of 91.3%. The application can be used in real time to click images and help identify potentially prohibited products of target species. Thus, the proposed method is not only applicable for monitoring trade on the web but can also be used e.g. in physical markets for monitoring wildlife trade.
zh
[CV-40] CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis
【速读】:该论文旨在解决多视角扩散模型(multi-view diffusion models)在世界建模(world modeling)应用中的两大局限:一是现有方法多采用非自回归(non-autoregressive)形式,导致仅支持固定数量的视图输入输出,难以适应动态场景;二是由于需同时去噪所有帧,推理速度较慢。为克服上述问题,论文提出CausNVS,其核心创新在于采用自回归(autoregressive)设置,支持任意输入与输出视图配置,并按顺序生成视图。关键解决方案包括:使用因果掩码(causal masking)和每帧独立噪声训练策略,结合基于相机位姿相对关系的编码器(CaPE, camera pose encoding),实现精确的相机控制;推理阶段则引入空间感知滑动窗口机制、键值缓存(key-value caching)及噪声条件增强技术,有效缓解生成过程中的漂移问题。
链接: https://arxiv.org/abs/2509.06579
作者: Xin Kong,Daniel Watson,Yannick Strümpler,Michael Niemeyer,Federico Tombari
机构: Imperial College London (帝国理工学院); Google DeepMind (谷歌深度心智); Google (谷歌); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and suffer from slow inference due to denoising all frames simultaneously. To address these limitations, we propose CausNVS, a multi-view diffusion model in an autoregressive setting, which supports arbitrary input-output view configurations and generates views sequentially. We train CausNVS with causal masking and per-frame noise, using pairwise-relative camera pose encodings (CaPE) for precise camera control. At inference time, we combine a spatially-aware sliding-window with key-value caching and noise conditioning augmentation to mitigate drift. Our experiments demonstrate that CausNVS supports a broad range of camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across diverse settings. Project page: this https URL.
zh
[CV-41] Approximating Condorcet Ordering for Vector-valued Mathematical Morphology
【速读】:该论文旨在解决向量值图像(如彩色图像)中形态学算子构造时缺乏统一向量排序标准的问题,即如何选择最合适的向量排序方法来构建有效的形态学操作。其解决方案的关键在于提出一种基于机器学习的降维排序方法,该方法通过学习一个近似于Condorcet排序的简化排序映射来实现:Condorcet排序借鉴投票机制,将不同向量排序视为“选票”,并依据多数原则确定元素的相对顺序;论文利用机器学习模型从一组候选排序中自动学习出一个高效的简化排序,从而为彩色图像的形态学处理提供更稳健的排序基础。
链接: https://arxiv.org/abs/2509.06577
作者: Marcos Eduardo Valle,Santiago Velasco-Forero,Joao Batista Florindo,Gustavo Jesus Angulo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: Submitted to the 4th International Conference on Discrete Geometry and Mathematical Morphology (DGMM 2025)
Abstract:Mathematical morphology provides a nonlinear framework for image and spatial data processing and analysis. Although there have been many successful applications of mathematical morphology to vector-valued images, such as color and hyperspectral images, there is still no consensus on the most suitable vector ordering for constructing morphological operators. This paper addresses this issue by examining a reduced ordering approximating the Condorcet ranking derived from a set of vector orderings. Inspired by voting problems, the Condorcet ordering ranks elements from most to least voted, with voters representing different orderings. In this paper, we develop a machine learning approach that learns a reduced ordering that approximates the Condorcet ordering. Preliminary computational experiments confirm the effectiveness of learning the reduced mapping to define vector-valued morphological operators for color images.
zh
[CV-42] Evolving from Unknown to Known: Retentive Angular Representation Learning for Incremental Open Set Recognition
【速读】:该论文旨在解决增量式开放集识别(Incremental Open Set Recognition, IOSR)中因缺乏历史训练数据导致的决策边界判别力下降问题,进而引发类别间混淆严重的问题。其解决方案的关键在于提出了一种保留角度表示学习(Retentive Angular Representation Learning, RARL)方法:通过在等角紧框架(equiangular tight frame)构建的角度空间中引导未知类表示围绕非活跃原型对齐,抑制知识更新过程中的表征漂移;同时结合虚拟-内在交互(Virtual-Intrinsic Interactive, VII)训练策略以增强已知类间的边界距离,并采用分层修正策略优化决策边界,缓解新旧类与正负样本不平衡带来的特征空间扭曲和表征偏差。
链接: https://arxiv.org/abs/2509.06570
作者: Runqing Yang,Yimin Fu,Changyuan Wu,Zhunga Liu
机构: Northwestern Polytechnical University (西北工业大学); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 2025 IEEE/CVF International Conference on Computer Vision Workshops
Abstract:Existing open set recognition (OSR) methods are typically designed for static scenarios, where models aim to classify known classes and identify unknown ones within fixed scopes. This deviates from the expectation that the model should incrementally identify newly emerging unknown classes from continuous data streams and acquire corresponding knowledge. In such evolving scenarios, the discriminability of OSR decision boundaries is hard to maintain due to restricted access to former training data, causing severe inter-class confusion. To solve this problem, we propose retentive angular representation learning (RARL) for incremental open set recognition (IOSR). In RARL, unknown representations are encouraged to align around inactive prototypes within an angular space constructed under the equiangular tight frame, thereby mitigating excessive representation drift during knowledge updates. Specifically, we adopt a virtual-intrinsic interactive (VII) training strategy, which compacts known representations by enforcing clear inter-class margins through boundary-proximal virtual classes. Furthermore, a stratified rectification strategy is designed to refine decision boundaries, mitigating representation bias and feature space distortion caused by imbalances between old/new and positive/negative class samples. We conduct thorough evaluations on CIFAR100 and TinyImageNet datasets and establish a new benchmark for IOSR. Experimental results across various task setups demonstrate that the proposed method achieves state-of-the-art performance.
zh
[CV-43] Back To The Drawing Board: Rethinking Scene-Level Sketch-Based Image Retrieval BMVC2025
【速读】:该论文旨在解决场景级草图图像检索(Scene-level Sketch-Based Image Retrieval, SBIR)中因真实草图固有的模糊性和噪声导致的检索性能下降问题。传统方法多依赖于模型架构的复杂增强,而本文提出通过优化训练设计来提升模型对草图变异性的鲁棒性,其关键在于采用合适的预训练策略、编码器结构与损失函数组合,在不引入额外复杂度的前提下实现了最先进的性能表现。实验在FS-COCO和SketchyCOCO两个具有挑战性的数据集上验证了该方案的有效性,强调了训练设计在跨模态检索任务中的核心作用。
链接: https://arxiv.org/abs/2509.06566
作者: Emil Demić,Luka Čehovin Zajc
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to BMVC2025
Abstract:The goal of Scene-level Sketch-Based Image Retrieval is to retrieve natural images matching the overall semantics and spatial layout of a free-hand sketch. Unlike prior work focused on architectural augmentations of retrieval models, we emphasize the inherent ambiguity and noise present in real-world sketches. This insight motivates a training objective that is explicitly designed to be robust to sketch variability. We show that with an appropriate combination of pre-training, encoder architecture, and loss formulation, it is possible to achieve state-of-the-art performance without the introduction of additional complexity. Extensive experiments on a challenging FS-COCO and widely-used SketchyCOCO datasets confirm the effectiveness of our approach and underline the critical role of training design in cross-modal retrieval tasks, as well as the need to improve the evaluation scenarios of scene-level SBIR.
zh
[CV-44] ackling Device Data Distribution Real-time Shift via Prototype-based Parameter Editing
【速读】:该论文旨在解决设备端轻量级模型在实时数据分布漂移(data distribution shift)场景下的泛化能力不足问题,这一挑战在当前研究中常被忽视,而现有方法多依赖于数据密集且计算昂贵的微调策略。解决方案的关键在于提出Persona方法,其核心是一种基于原型(prototype)的、无需反向传播的参数编辑框架:通过云端神经适配器(neural adapter)根据设备端实时数据生成参数编辑矩阵,高效地将本地模型映射到对应的数据分布原型,并动态更新原型以实现模型演化;同时引入跨层知识迁移机制,确保多层参数变化的一致性与上下文感知性,从而在不进行部署后重训练的前提下显著提升模型适应性和泛化性能。
链接: https://arxiv.org/abs/2509.06552
作者: Zheqi Lv,Wenqiao Zhang,Kairui Fu,Qi Tian,Shengyu Zhang,Jiajie Su,Jingyuan Chen,Kun Kuang,Fei Wu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注: Published on MM’25: Proceedings of the 33rd ACM International Conference on Multimedia
Abstract:The on-device real-time data distribution shift on devices challenges the generalization of lightweight on-device models. This critical issue is often overlooked in current research, which predominantly relies on data-intensive and computationally expensive fine-tuning approaches. To tackle this, we introduce Persona, a novel personalized method using a prototype-based, backpropagation-free parameter editing framework to enhance model generalization without post-deployment retraining. Persona employs a neural adapter in the cloud to generate a parameter editing matrix based on real-time device data. This matrix adeptly adapts on-device models to the prevailing data distributions, efficiently clustering them into prototype models. The prototypes are dynamically refined via the parameter editing matrix, facilitating efficient evolution. Furthermore, the integration of cross-layer knowledge transfer ensures consistent and context-aware multi-layer parameter changes and prototype assignment. Extensive experiments on vision task and recommendation task on multiple datasets confirm Persona’s effectiveness and generality.
zh
[CV-45] Signal-Based Malware Classification Using 1D CNNs
【速读】:该论文旨在解决恶意软件(malware)分类中因传统静态分析易被混淆技术规避,而动态分析又资源消耗过大难以大规模部署的问题。现有方法通过将二进制文件转换为二维图像进行计算机视觉建模,虽能提升对混淆恶意软件的检测能力,但存在两大缺陷:一是量化噪声导致的信息损失(因像素值取整),二是引入了原始数据中不存在的二维依赖关系,从而限制了下游模型性能。解决方案的关键在于摒弃启发式二维重构步骤,直接将文件重采样为一维(1D)信号,避免信息丢失并保留原始数据结构;同时利用浮点格式存储以消除量化噪声,并适配或设计专用的一维卷积神经网络(1D CNN)架构(如基于ResNet和挤压-激励层的模型)进行分类,最终在MalNet数据集上实现了优于现有方法的二进制、类型和家族级别分类性能。
链接: https://arxiv.org/abs/2509.06548
作者: Jack Wilkie,Hanan Hindy,Ivan Andonovic,Christos Tachtatzis,Robert Atkinson
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication in Springer Cybersecurity (2025)
Abstract:Malware classification is a contemporary and ongoing challenge in cyber-security: modern obfuscation techniques are able to evade traditional static analysis, while dynamic analysis is too resource intensive to be deployed at a large scale. One prominent line of research addresses these limitations by converting malware binaries into 2D images by heuristically reshaping them into a 2D grid before resizing using Lanczos resampling. These images can then be classified based on their textural information using computer vision approaches. While this approach can detect obfuscated malware more effectively than static analysis, the process of converting files into 2D images results in significant information loss due to both quantisation noise, caused by rounding to integer pixel values, and the introduction of 2D dependencies which do not exist in the original data. This loss of signal limits the classification performance of the downstream model. This work addresses these weaknesses by instead resizing the files into 1D signals which avoids the need for heuristic reshaping, and additionally these signals do not suffer from quantisation noise due to being stored in a floating-point format. It is shown that existing 2D CNN architectures can be readily adapted to classify these 1D signals for improved performance. Furthermore, a bespoke 1D convolutional neural network, based on the ResNet architecture and squeeze-and-excitation layers, was developed to classify these signals and evaluated on the MalNet dataset. It was found to achieve state-of-the-art performance on binary, type, and family level classification with F1 scores of 0.874, 0.503, and 0.507, respectively, paving the way for future models to operate on the proposed signal modality.
zh
[CV-46] Benchmarking EfficientTAM on FMO datasets
【速读】:该论文旨在解决快速移动目标(Fast Moving Objects, FMOs)在计算机视觉中的跟踪难题,特别是针对小尺寸、高速运动物体的检测与跟踪性能不足的问题。其解决方案的关键在于构建了一个结构化的JSON元数据文件(称为FMOX),该文件扩展了四个开源FMO图像序列数据集的标注信息,新增了目标尺寸等关键地面真实(ground truth)数据,并以此作为评估基准测试近期提出的高效跟踪基础模型(EfficientTAM)。实验表明,该模型在FMOX数据集上的表现可媲美为FMO任务专门设计的传统跟踪流水线,验证了FMOX作为标准化评测工具的有效性与通用性。
链接: https://arxiv.org/abs/2509.06536
作者: Senem Aktas,Charles Markham,John McDonald,Rozenn Dahyot
机构: Maynooth University (梅努斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fast and tiny object tracking remains a challenge in computer vision and in this paper we first introduce a JSON metadata file associated with four open source datasets of Fast Moving Objects (FMOs) image sequences. In addition, we extend the description of the FMOs datasets with additional ground truth information in JSON format (called FMOX) with object size information. Finally we use our FMOX file to test a recently proposed foundational model for tracking (called EfficientTAM) showing that its performance compares well with the pipelines originally taylored for these FMO datasets. Our comparison of these state-of-the-art techniques on FMOX is provided with Trajectory Intersection of Union (TIoU) scores. The code and JSON is shared open source allowing FMOX to be accessible and usable for other machine learning pipelines aiming to process FMO datasets.
zh
[CV-47] On the Reproducibility of "FairCLIP: Harnessing Fairness in Vision-Language Learning
【速读】:该论文旨在解决预训练视觉-语言模型(如CLIP)在零样本医学任务中存在群体公平性问题,即模型对不同敏感群体(如不同性别、种族等)的预测表现存在偏差。其核心解决方案是通过最小化跨敏感群体的图像-文本相似度得分差异来提升公平性,具体采用Sinkhorn距离作为优化目标,提出FairCLIP方法。关键在于引入一个基于Sinkhorn距离的正则化项以约束不同群体间的表示差异,从而缓解模型偏见。然而,实验结果表明,尽管该正则化能降低Sinkhorn距离,但并未有效提升模型在青光眼分类任务中的公平性和性能。
链接: https://arxiv.org/abs/2509.06535
作者: Hua Chang Bakker,Stan Fris,Angela Madelon Bernardy,Stan Deutekom
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We investigated the reproducibility of FairCLIP, proposed by Luo et al. (2024), for improving the group fairness of CLIP (Radford et al., 2021) by minimizing image-text similarity score disparities across sensitive groups using the Sinkhorn distance. The experimental setup of Luo et al. (2024) was reproduced to primarily investigate the research findings for FairCLIP. The model description by Luo et al. (2024) was found to differ from the original implementation. Therefore, a new implementation, A-FairCLIP, is introduced to examine specific design choices. Furthermore, FairCLIP+ is proposed to extend the FairCLIP objective to include multiple attributes. Additionally, the impact of the distance minimization on FairCLIP’s fairness and performance was explored. In alignment with the original authors, CLIP was found to be biased towards certain demographics when applied to zero-shot glaucoma classification using medical scans and clinical notes from the Harvard-FairVLMed dataset. However, the experimental results on two datasets do not support their claim that FairCLIP improves the performance and fairness of CLIP. Although the regularization objective reduces Sinkhorn distances, both the official implementation and the aligned implementation, A-FairCLIP, were not found to improve performance nor fairness in zero-shot glaucoma classification.
zh
[CV-48] Predicting Brain Tumor Response to Therapy using a Hybrid Deep Learning and Radiomics Approach MICCAI2025
【速读】:该论文旨在解决胶质母细胞瘤(Glioblastoma)治疗反应评估的准确性问题,尤其是在临床实践中基于影像学的响应评估存在主观性强、标准化程度低的问题。为应对这一挑战,研究提出了一种融合深度学习特征与放射组学(Radiomics)及临床特征的混合框架:关键在于利用微调后的ResNet-18模型从四类MRI模态中提取2D区域的深层图像特征,并将其与超过4800个由肿瘤生长/萎缩掩膜、体积变化相对于最低点(nadir)、肿瘤中心位移等构建的放射组学和临床特征进行融合;最终采用CatBoost分类器实现四分类治疗反应预测(完全缓解、部分缓解、稳定疾病、进展),在ROC AUC和Macro F1指标上分别达到0.81和0.50,验证了多模态特征融合策略在神经肿瘤学自动化疗效评估中的有效性。
链接: https://arxiv.org/abs/2509.06511
作者: Daniil Tikhonov,Matheus Scatolin,Mohor Banerjee,Qiankun Ji,Ahmed Jaheen,Mostafa Salem,Abdelrahman Elsayed,Hu Wang,Sarim Hashmi,Mohammad Yaqub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the BraTS-Lighthouse 2025 Challenge (MICCAI 2025)
Abstract:Accurate evaluation of the response of glioblastoma to therapy is crucial for clinical decision-making and patient management. The Response Assessment in Neuro-Oncology (RANO) criteria provide a standardized framework to assess patients’ clinical response, but their application can be complex and subject to observer variability. This paper presents an automated method for classifying the intervention response from longitudinal MRI scans, developed to predict tumor response during therapy as part of the BraTS 2025 challenge. We propose a novel hybrid framework that combines deep learning derived feature extraction and an extensive set of radiomics and clinically chosen features. Our approach utilizes a fine-tuned ResNet-18 model to extract features from 2D regions of interest across four MRI modalities. These deep features are then fused with a rich set of more than 4800 radiomic and clinically driven features, including 3D radiomics of tumor growth and shrinkage masks, volumetric changes relative to the nadir, and tumor centroid shift. Using the fused feature set, a CatBoost classifier achieves a mean ROC AUC of 0.81 and a Macro F1 score of 0.50 in the 4-class response prediction task (Complete Response, Partial Response, Stable Disease, Progressive Disease). Our results highlight that synergizing learned image representations with domain-targeted radiomic features provides a robust and effective solution for automated treatment response assessment in neuro-oncology.
zh
[CV-49] IDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement
【速读】:该论文旨在解决主体驱动图像生成(Subject-driven Image Generation, SDIG)中的核心挑战,即如何在遵循文本指令的同时保持图像中特定主体的身份一致性。现有方法难以平衡主体保真度与指令合规性之间的矛盾。解决方案的关键在于提出目标指导扩散增强框架(Target-Instructed Diffusion Enhancing, TIDE),其创新性地引入目标监督和偏好学习机制,通过三元组对齐(reference image, instruction, target images)建模主体适应动态,并利用直接主体扩散(Direct Subject Diffusion, DSD)目标训练模型——该目标基于量化指标系统生成“获胜”(平衡保真与合规)和“失败”(失真)的目标样本,从而实现隐式奖励建模以优化保真度与合规性的权衡。此方法无需测试时微调,在多个任务中均表现出优越性能。
链接: https://arxiv.org/abs/2509.06499
作者: Jibai Lin,Bo Ma,Yating Yang,Rong Ma,Turghun Osman,Ahtamjan Ahmat,Rui Dong,Lei Wang,Xi Zhou
机构: Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences (中国科学院新疆物理化学技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Xinjiang Laboratory of Minority Speech and Language Information Processing (新疆少数民族语音与语言信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the Target-Instructed Diffusion Enhancing (TIDE) framework, which resolves this tension through target supervision and preference learning without test-time fine-tuning. TIDE pioneers target-supervised triplet alignment, modelling subject adaptation dynamics using a (reference image, instruction, target images) triplet. This approach leverages the Direct Subject Diffusion (DSD) objective, training the model with paired “winning” (balanced preservation-compliance) and “losing” (distorted) targets, systematically generated and evaluated via quantitative metrics. This enables implicit reward modelling for optimal preservation-compliance balance. Experimental results on standard benchmarks demonstrate TIDE’s superior performance in generating subject-faithful outputs while maintaining instruction compliance, outperforming baseline methods across multiple quantitative metrics. TIDE’s versatility is further evidenced by its successful application to diverse tasks, including structural-conditioned generation, image-to-image generation, and text-image interpolation. Our code is available at this https URL.
zh
[CV-50] WS2: Weakly Supervised Segmentation using Before-After Supervision in Waste Sorting ICCV2025
【速读】:该论文旨在解决工业质量控制中废物分拣场景下的目标识别与分割问题,即如何在动态、异质的物料流中自动识别并分割出操作员手动移除的 unwanted items(非期望物品)。传统方法依赖大量标注数据的全监督学习难以应对任务多样性与标注成本高的挑战。其解决方案的关键在于提出“Before-After Supervision”(前后监督)概念,利用操作员移除动作所隐含的视觉差异信息作为弱监督信号,仅通过对比操作前后的图像即可训练分割网络,从而显著降低对人工标注的依赖。为此,作者构建了首个多视角弱监督分拣数据集 WS²(Weakly Supervised segmentation for Waste-Sorting),包含超过11,000张高分辨率视频帧,并设计了一套端到端的基准测试流程,用于评估多种先进弱监督分割方法在此场景下的性能。
链接: https://arxiv.org/abs/2509.06485
作者: Andrea Marelli,Alberto Foresti,Leonardo Pesce,Giacomo Boracchi,Mario Grosso
机构: Politecnico di Milano (米兰理工大学); EURECOM (欧洲电信学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, ICCV 2025 - Workshops The WS2 dataset is publicly available for download at this https URL , all the details are reported in the supplementary material
Abstract:In industrial quality control, to visually recognize unwanted items within a moving heterogeneous stream, human operators are often still indispensable. Waste-sorting stands as a significant example, where operators on multiple conveyor belts manually remove unwanted objects to select specific materials. To automate this recognition problem, computer vision systems offer great potential in accurately identifying and segmenting unwanted items in such settings. Unfortunately, considering the multitude and the variety of sorting tasks, fully supervised approaches are not a viable option to address this challange, as they require extensive labeling efforts. Surprisingly, weakly supervised alternatives that leverage the implicit supervision naturally provided by the operator in his removal action are relatively unexplored. In this paper, we define the concept of Before-After Supervision, illustrating how to train a segmentation network by leveraging only the visual differences between images acquired \textitbefore and \textitafter the operator. To promote research in this direction, we introduce WS ^2 (Weakly Supervised segmentation for Waste-Sorting), the first multiview dataset consisting of more than 11 000 high-resolution video frames captured on top of a conveyor belt, including “before” and “after” images. We also present a robust end-to-end pipeline, used to benchmark several state-of-the-art weakly supervised segmentation methods on WS ^2 .
zh
[CV-51] FSG-Net: Frequency-Spatial Synergistic Gated Network for High-Resolution Remote Sensing Change Detection
【速读】:该论文旨在解决高分辨率遥感图像变化检测中的两大关键问题:一是由时间维度上的辐射变化(如光照、季节差异)引起的误报问题,二是深层语义特征与浅层细节特征之间存在的语义鸿沟导致边界模糊的问题。解决方案的核心在于提出一种频率-空间协同门控网络(FSG-Net),其关键创新包括:首先在频域通过差异感知小波交互模块(DAWIM)自适应地抑制伪变化;其次在空域利用协同时空注意力模块(STSAM)增强真实变化区域的显著性;最后通过轻量级门控融合单元(LGFU)以高层语义引导选择性地融合浅层细节信息,从而有效弥合语义鸿沟并提升边界精度。
链接: https://arxiv.org/abs/2509.06482
作者: Zhongxiang Xie,Shuangxi Miao,Yuhan Jiang,Zhewei Zhang,Jing Yao,Xuecao Li,Jianxi Huang,Pedram Ghamisi
机构: China Agricultural University (中国农业大学); Chinese Academy of Sciences (中国科学院); Southwest Jiaotong University (西南交通大学); Helmholtz-Zentrum Dresden-Rossendorf (亥姆霍兹德累斯顿罗森多夫研究中心); Lancaster University (兰卡斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Geoscience and Remote Sensing (TGRS). 13 pages, 9 figures
Abstract:Change detection from high-resolution remote sensing images lies as a cornerstone of Earth observation applications, yet its efficacy is often compromised by two critical challenges. First, false alarms are prevalent as models misinterpret radiometric variations from temporal shifts (e.g., illumination, season) as genuine changes. Second, a non-negligible semantic gap between deep abstract features and shallow detail-rich features tends to obstruct their effective fusion, culminating in poorly delineated boundaries. To step further in addressing these issues, we propose the Frequency-Spatial Synergistic Gated Network (FSG-Net), a novel paradigm that aims to systematically disentangle semantic changes from nuisance variations. Specifically, FSG-Net first operates in the frequency domain, where a Discrepancy-Aware Wavelet Interaction Module (DAWIM) adaptively mitigates pseudo-changes by discerningly processing different frequency components. Subsequently, the refined features are enhanced in the spatial domain by a Synergistic Temporal-Spatial Attention Module (STSAM), which amplifies the saliency of genuine change regions. To finally bridge the semantic gap, a Lightweight Gated Fusion Unit (LGFU) leverages high-level semantics to selectively gate and integrate crucial details from shallow layers. Comprehensive experiments on the CDD, GZ-CD, and LEVIR-CD benchmarks validate the superiority of FSG-Net, establishing a new state-of-the-art with F1-scores of 94.16%, 89.51%, and 91.27%, respectively. The code will be made available at this https URL after a possible publication.
zh
[CV-52] Does DINOv3 Set a New Medical Vision Standard?
【速读】:该论文旨在解决前沿视觉基础模型(vision foundation models)在医学影像等专业领域中的迁移效能问题,特别是评估DINOv3这一基于自然图像预训练的自监督视觉Transformer(ViT)是否可直接作为通用编码器服务于多种医疗视觉任务,而无需进行特定领域的微调或再训练。其解决方案的关键在于系统性地在多个医学影像模态(如2D/3D分类与分割)上对DINOv3进行基准测试,并通过调整模型规模和输入分辨率来分析其可扩展性,从而揭示其在跨域迁移中的优势与局限,为后续研究提供强有力的基线参考。
链接: https://arxiv.org/abs/2509.06467
作者: Che Liu,Yinda Chen,Haoyuan Shi,Jinpeng Lu,Bailiang Jian,Jiazhen Pan,Linghan Cai,Jiayi Wang,Yundi Zhang,Jun Li,Cosmin I. Bercea,Cheng Ouyang,Chen Chen,Zhiwei Xiong,Benedikt Wiestler,Christian Wachinger,Daniel Rueckert,Wenjia Bai,Rossella Arcucci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models’ efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model’s features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.
zh
[CV-53] A Statistical 3D Stomach Shape Model for Anatomical Analysis
【速读】:该论文旨在解决内脏器官(如胃)高质量、参数化三维(3D)模型构建受限于数据稀缺与方法学挑战的问题。其核心解决方案在于提出了一种新颖的合成数据生成流程,结合基于已有胃形态变异性研究的参数化建模技术,构建了一个包含多样化解剖结构的合成胃模型数据集;在此基础上,进一步训练出一个低维形状空间中的统计形状模型(Statistical Shape Model, SSM),并通过半监督对齐过程利用公开的CT扫描网格进行优化,显著提升了模型对未见解剖变异的泛化能力。该方法实现了从合成数据生成到真实世界验证的闭环,为手术模拟、术前规划及个性化医疗提供了首个可扩展的胃3D统计形状模型。
链接: https://arxiv.org/abs/2509.06464
作者: Erez Posner,Ore Shtalrid,Oded Erell,Daniel Noy,Moshe Bouhnik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Realistic and parameterized 3D models of human anatomy have become invaluable in research, diagnostics, and surgical planning. However, the development of detailed models for internal organs, such as the stomach, has been limited by data availability and methodological challenges. In this paper, we propose a novel pipeline for the generation of synthetic 3D stomach models, enabling the creation of anatomically diverse morphologies informed by established studies on stomach shape variability. Using this pipeline, we construct a dataset of synthetic stomachs. Building on this dataset, we develop a 3D statistical shape model of the stomach, trained to capture natural anatomical variability in a low-dimensional shape space. The model is further refined using CT meshes derived from publicly available datasets through a semi-supervised alignment process, enhancing its ability to generalize to unseen anatomical variations. We evaluated the model on a held-out test set of real stomach CT scans, demonstrating robust generalization and fit accuracy. We make the statistical shape model along with the synthetic dataset publicly available on GitLab: this https URL to facilitate further research. This work introduces the first statistical 3D shape model of the stomach, with applications ranging from surgical simulation and pre-operative planning to medical education and computational modeling. By combining synthetic data generation, parametric modeling, and real-world validation, our approach represents a significant advancement in organ modeling and opens new possibilities for personalized healthcare solutions.
zh
[CV-54] Focusing by Contrastive Attention: Enhancing VLMs Visual Reasoning
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在复杂视觉环境中性能下降的问题,现有方法通常依赖额外训练、外部分割工具或仅在粗粒度层面进行优化,未能充分利用VLM自身内在的注意力机制。其解决方案的关键在于通过分析VLM的注意力模式发现:视觉复杂度与注意力熵呈负相关,且注意力从浅层全局扫描逐步聚焦到深层局部收敛,收敛程度由视觉复杂度决定;进一步理论证明了通用查询与任务特定查询之间注意力图的对比能够将视觉信号分解为语义信号和视觉噪声成分。基于此,作者提出无需训练的对比注意力精化方法(Contrastive Attention Refinement for Visual Enhancement, CARVE),通过像素级注意力对比提取任务相关的视觉信号,从而显著提升模型在复杂场景下的视觉推理能力,实验表明该方法在开源模型上最高可实现75%的性能提升。
链接: https://arxiv.org/abs/2509.06461
作者: Yuyao Ge,Shenghua Liu,Yiwei Wang,Lingrui Mei,Baolong Bi,Xuanshan Zhou,Jiayu Yao,Jiafeng Guo,Xueqi Cheng
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs’ attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.
zh
[CV-55] IGAff: Benchmarking Adversarial Iterative and Genetic Affine Algorithms on Deep Neural Networks ECAI2025
【速读】:该论文旨在解决深度神经网络在黑盒场景下生成有效对抗样本的难题,以揭示模型的脆弱性并评估其鲁棒性。其核心问题是:如何在无法获取模型内部结构信息的情况下,设计高效且通用的迭代式对抗攻击算法,从而对主流视觉模型(如ResNet-18、DenseNet-121、Swin Transformer V2和Vision Transformer)施加有效干扰。解决方案的关键在于提出两种新颖的黑盒对抗攻击算法——Affine Transformation Attack (ATA) 和 Affine Genetic Attack (AGA),它们分别基于仿射变换和遗传算法框架,通过优化攻击得分函数,在不依赖梯度信息的前提下实现高成功率的全局与目标攻击。实验表明,这些方法在多个数据集上优于现有基准(如Pixle和Square Attack),准确率提升达8.82%,为理解对抗攻击机制与防御策略提供了重要依据。
链接: https://arxiv.org/abs/2509.06459
作者: Sebastian-Vasile Echim,Andrei-Alexandru Preda,Dumitru-Clementin Cercel,Florin Pop
机构: Faculty of Automatic Control and Computers, National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学自动控制与计算机学院); National Institute for Research & Development in Informatics - ICI Bucharest (罗马尼亚信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 7 figures, Accepted at ECAI 2025 (28th European Conference on Artificial Intelligence)
Abstract:Deep neural networks currently dominate many fields of the artificial intelligence landscape, achieving state-of-the-art results on numerous tasks while remaining hard to understand and exhibiting surprising weaknesses. An active area of research focuses on adversarial attacks, which aim to generate inputs that uncover these weaknesses. However, this proves challenging, especially in the black-box scenario where model details are inaccessible. This paper explores in detail the impact of such adversarial algorithms on ResNet-18, DenseNet-121, Swin Transformer V2, and Vision Transformer network architectures. Leveraging the Tiny ImageNet, Caltech-256, and Food-101 datasets, we benchmark two novel black-box iterative adversarial algorithms based on affine transformations and genetic algorithms: 1) Affine Transformation Attack (ATA), an iterative algorithm maximizing our attack score function using random affine transformations, and 2) Affine Genetic Attack (AGA), a genetic algorithm that involves random noise and affine transformations. We evaluate the performance of the models in the algorithm parameter variation, data augmentation, and global and targeted attack configurations. We also compare our algorithms with two black-box adversarial algorithms, Pixle and Square Attack. Our experiments yield better results on the image classification task than similar methods in the literature, achieving an accuracy improvement of up to 8.82%. We provide noteworthy insights into successful adversarial defenses and attacks at both global and targeted levels, and demonstrate adversarial robustness through algorithm parameter variation.
zh
[CV-56] Cross3DReg: Towards a Large-scale Real-world Cross-source Point Cloud Registration Benchmark
【速读】:该论文旨在解决跨源点云配准(cross-source point cloud registration)中的两大核心挑战:一是缺乏大规模真实世界数据集用于训练深度学习配准模型,二是不同传感器采集的点云在几何结构和密度分布上存在显著差异,导致特征提取与匹配困难,进而影响配准精度。解决方案的关键在于构建了目前规模最大且基于真实场景的多模态跨源点云数据集 Cross3DReg,并提出了一种基于重叠区域预测的配准框架:首先利用未对齐图像预测源与目标点云间的重叠区域,从而过滤非重叠区域冗余点以降低噪声干扰;随后设计视觉-几何注意力引导的匹配模块,融合图像信息与几何特征,增强跨源点云特征一致性,建立可靠对应关系,最终实现高精度、鲁棒的跨源配准。实验表明,该方法在相对旋转误差(RRE)和相对平移误差(RTE)上分别降低63.2%和40.2%,注册召回率(RR)提升5.4%,验证了其有效性。
链接: https://arxiv.org/abs/2509.06456
作者: Zongyi Xu,Zhongpeng Lang,Yilong Chen,Shanshan Zhao,Xiaoshui Huang,Yifan Zuo,Yan Zhang,Qianni Zhang,Xinbo Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-source point cloud registration, which aims to align point cloud data from different sensors, is a fundamental task in 3D vision. However, compared to the same-source point cloud registration, cross-source registration faces two core challenges: the lack of publicly available large-scale real-world datasets for training the deep registration models, and the inherent differences in point clouds captured by multiple sensors. The diverse patterns induced by the sensors pose great challenges in robust and accurate point cloud feature extraction and matching, which negatively influence the registration accuracy. To advance research in this field, we construct Cross3DReg, the currently largest and real-world multi-modal cross-source point cloud registration dataset, which is collected by a rotating mechanical lidar and a hybrid semi-solid-state lidar, respectively. Moreover, we design an overlap-based cross-source registration framework, which utilizes unaligned images to predict the overlapping region between source and target point clouds, effectively filtering out redundant points in the irrelevant regions and significantly mitigating the interference caused by noise in non-overlapping areas. Then, a visual-geometric attention guided matching module is proposed to enhance the consistency of cross-source point cloud features by fusing image and geometric information to establish reliable correspondences and ultimately achieve accurate and robust registration. Extensive experiments show that our method achieves state-of-the-art registration performance. Our framework reduces the relative rotation error (RRE) and relative translation error (RTE) by 63.2% and 40.2% , respectively, and improves the registration recall (RR) by 5.4% , which validates its effectiveness in achieving accurate cross-source registration.
zh
[CV-57] Perception-oriented Bidirectional Attention Network for Image Super-resolution Quality Assessment
【速读】:该论文旨在解决超分辨率(Super-Resolution, SR)图像重建中缺乏有效全参考(Full-Reference, FR)图像质量评估(Image Quality Assessment, IQA)方法的问题。现有SR算法虽多,但用于客观比较和评估其性能的FR-IQA指标仍十分有限。为此,作者提出感知导向双向注意力网络(Perception-oriented Bidirectional Attention Network, PBAN),其核心创新在于构建了感知导向的双向注意力(Perception-oriented Bidirectional Attention, PBA)模块:首先设计双向注意力机制以模拟人类视觉系统对失真信息的生成与评价过程;其次引入分组多尺度可变形卷积(Grouped Multi-scale Deformable Convolution)实现自适应失真感知;并结合子信息激励卷积(Sub-information Excitation Convolution)引导视觉关注子像素与子通道级别的细节。这些设计使模型能更精准地捕捉感知相关特征,并在质量预测模块中回归出符合人类主观感知的质量分数,从而显著优于当前最优的IQA方法。
链接: https://arxiv.org/abs/2509.06442
作者: Yixiao Li,Xiaoyuan Yang,Guanghui Yue,Jun Fu,Qiuping Jiang,Xu Jia,Paul L. Rosin,Hantao Liu,Wei Zhou
机构: Beihang University (北京航空航天大学); Cardiff University (卡迪夫大学); Shenzhen University (深圳大学); University of Science and Technology of China (中国科学技术大学); Ningbo University (宁波大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 16 pages, 6 figures, IEEE Transactions on Image Processing
Abstract:Many super-resolution (SR) algorithms have been proposed to increase image resolution. However, full-reference (FR) image quality assessment (IQA) metrics for comparing and evaluating different SR algorithms are limited. In this work, we propose the Perception-oriented Bidirectional Attention Network (PBAN) for image SR FR-IQA, which is composed of three modules: an image encoder module, a perception-oriented bidirectional attention (PBA) module, and a quality prediction module. First, we encode the input images for feature representations. Inspired by the characteristics of the human visual system, we then construct the perception-oriented PBA module. Specifically, different from existing attention-based SR IQA methods, we conceive a Bidirectional Attention to bidirectionally construct visual attention to distortion, which is consistent with the generation and evaluation processes of SR images. To further guide the quality assessment towards the perception of distorted information, we propose Grouped Multi-scale Deformable Convolution, enabling the proposed method to adaptively perceive distortion. Moreover, we design Sub-information Excitation Convolution to direct visual perception to both sub-pixel and sub-channel attention. Finally, the quality prediction module is exploited to integrate quality-aware features and regress quality scores. Extensive experiments demonstrate that our proposed PBAN outperforms state-of-the-art quality assessment methods.
zh
[CV-58] When Language Model Guides Vision: Grounding DINO for Cattle Muzzle Detection
【速读】:该论文旨在解决传统牛鼻纹(muzzle pattern)检测方法依赖人工标注数据导致的劳动密集和一致性差的问题,以及现有监督学习模型如YOLO在新品种或未见场景下泛化能力受限的问题。解决方案的关键在于提出一种基于Grounding DINO的零样本(zero-shot)检测框架,该框架利用视觉-语言模型的能力,通过自然语言提示(natural language prompts)引导目标检测,无需任何任务特定训练或标注数据即可实现跨品种、跨环境的牛鼻区域定位,从而显著提升系统的可扩展性和部署灵活性。
链接: https://arxiv.org/abs/2509.06427
作者: Rabin Dulal,Lihong Zheng,Muhammad Ashad Kabir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Muzzle patterns are among the most effective biometric traits for cattle identification. Fast and accurate detection of the muzzle region as the region of interest is critical to automatic visual cattle identification… Earlier approaches relied on manual detection, which is labor-intensive and inconsistent. Recently, automated methods using supervised models like YOLO have become popular for muzzle detection. Although effective, these methods require extensive annotated datasets and tend to be trained data-dependent, limiting their performance on new or unseen cattle. To address these limitations, this study proposes a zero-shot muzzle detection framework based on Grounding DINO, a vision-language model capable of detecting muzzles without any task-specific training or annotated data. This approach leverages natural language prompts to guide detection, enabling scalable and flexible muzzle localization across diverse breeds and environments. Our model achieves a mean Average Precision (mAP)@0.5 of 76.8%, demonstrating promising performance without requiring annotated data. To our knowledge, this is the first research to provide a real-world, industry-oriented, and annotation-free solution for cattle muzzle detection. The framework offers a practical alternative to supervised methods, promising improved adaptability and ease of deployment in livestock monitoring applications.
zh
[CV-59] Phantom-Insight: Adaptive Multi-cue Fusion for Video Camouflaged Object Detection with Multimodal LLM
【速读】:该论文针对视频伪装目标检测(Video Camouflaged Object Detection, VCOD)中现有方法的两大局限性提出解决方案:一是基于分割任意模型(Segment Anything Model, SAM)的方法因模型冻结难以分离伪装目标边缘;二是基于多模态大语言模型(Multimodal Large Language Model, MLLM)的方法由于前景与背景信息融合导致目标可分性差。其核心解决方案为提出一种结合SAM与MLLM的新方法Phantom-Insight,关键创新在于:首先通过时空线索建模和LLM特征融合增强边缘细节 separability(可分离性),并设计动态前景视觉token评分模块与提示网络自适应引导SAM以适应细微纹理;其次引入解耦前景-背景学习策略,分别生成前景与背景提示并独立训练,使视觉token能独立整合两类信息,从而显著提升SAM对视频中伪装目标的分割精度。
链接: https://arxiv.org/abs/2509.06422
作者: Hua Zhang,Changjiang Luo,Ruoyu Chen
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video camouflaged object detection (VCOD) is challenging due to dynamic environments. Existing methods face two main issues: (1) SAM-based methods struggle to separate camouflaged object edges due to model freezing, and (2) MLLM-based methods suffer from poor object separability as large language models merge foreground and background. To address these issues, we propose a novel VCOD method based on SAM and MLLM, called Phantom-Insight. To enhance the separability of object edge details, we represent video sequences with temporal and spatial clues and perform feature fusion via LLM to increase information density. Next, multiple cues are generated through the dynamic foreground visual token scoring module and the prompt network to adaptively guide and fine-tune the SAM model, enabling it to adapt to subtle textures. To enhance the separability of objects and background, we propose a decoupled foreground-background learning strategy. By generating foreground and background cues separately and performing decoupled training, the visual token can effectively integrate foreground and background information independently, enabling SAM to more accurately segment camouflaged objects in the video. Experiments on the MoCA-Mask dataset show that Phantom-Insight achieves state-of-the-art performance across various metrics. Additionally, its ability to detect unseen camouflaged objects on the CAD2016 dataset highlights its strong generalization ability.
zh
[CV-60] VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results ICCV
【速读】:该论文旨在解决生成式超分辨率(Generative Super-Resolution)图像质量评估问题,即如何有效识别和量化由最新生成模型(如生成对抗网络 GANs 和扩散模型)产生的超分辨率图像中特有的伪影及其感知质量。传统超分辨率图像质量评估(SR-IQA)数据集未充分覆盖生成式方法引入的独特失真,因此难以准确反映真实应用场景下的视觉质量。解决方案的关键在于构建了ISRGen-QA数据集,该数据集聚焦于生成式超分辨率图像,并通过ICCV 2025视觉质量评估竞赛(VQualA)组织挑战赛,验证了多种先进算法在该数据集上的表现,从而推动针对生成式AI(Generative AI)驱动的超分辨率图像质量评估方法的发展。
链接: https://arxiv.org/abs/2509.06413
作者: Yixiao Li,Xin Li,Chris Wei Zhou,Shuo Xing,Hadi Amirpour,Xiaoshuai Hao,Guanghui Yue,Baoquan Zhao,Weide Liu,Xiaoyuan Yang,Zhengzhong Tu,Xinyu Li,Chuanbiao Song,Chenqi Zhang,Jun Lan,Huijia Zhu,Weiqiang Wang,Xiaoyan Sun,Shishun Tian,Dongyang Yan,Weixia Zhang,Junlin Chen,Wei Sun,Zhihua Wang,Zhuohang Shi,Zhizun Luo,Hang Ouyang,Tianxin Xiao,Fan Yang,Zhaowang Wu,Kaixin Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 12 figures, VQualA ICCV Workshop
Abstract:This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places a greater emphasis on SR images generated by the latest generative approaches, including Generative Adversarial Networks (GANs) and diffusion models. The primary goal of this challenge is to analyze the unique artifacts introduced by modern super-resolution techniques and to evaluate their perceptual quality effectively. A total of 108 participants registered for the challenge, with 4 teams submitting valid solutions and fact sheets for the final testing phase. These submissions demonstrated state-of-the-art (SOTA) performance on the ISRGen-QA dataset. The project is publicly available at: this https URL.
zh
[CV-61] 3DOFQuantization: 3DGS quantization for large scenes with limited Degrees of Freedom
【速读】:该论文旨在解决大规模场景中3D高斯点云(3D Gaussian Splatting, 3DGS)重建时因坐标量化导致的投影误差问题,尤其是在有限观测区域内进行3DoF+(即相机位置仅在中心点附近小范围移动)重建时的精度优化问题。解决方案的关键在于揭示了投影误差与被投影点到相机距离的平方反比关系,并据此提出了一种基于球坐标系的新型量化方案,从而显著提升了率-失真性能,尤其在Garden场景上的实验验证了其有效性。
链接: https://arxiv.org/abs/2509.06400
作者: Matthieu Gendrin,Stéphane Pateux,Théo Ladune
机构: Orange Innovation (橙色创新)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) is a major breakthrough in 3D scene reconstruction. With a number of views of a given object or scene, the algorithm trains a model composed of 3D gaussians, which enables the production of novel views from arbitrary points of view. This freedom of movement is referred to as 6DoF for 6 degrees of freedom: a view is produced for any position (3 degrees), orientation of camera (3 other degrees). On large scenes, though, the input views are acquired from a limited zone in space, and the reconstruction is valuable for novel views from the same zone, even if the scene itself is almost unlimited in size. We refer to this particular case as 3DoF+, meaning that the 3 degrees of freedom of camera position are limited to small offsets around the central position. Considering the problem of coordinate quantization, the impact of position error on the projection error in pixels is studied. It is shown that the projection error is proportional to the squared inverse distance of the point being projected. Consequently, a new quantization scheme based on spherical coordinates is proposed. Rate-distortion performance of the proposed method are illustrated on the well-known Garden scene.
zh
[CV-62] AI-based response assessment and prediction in longitudinal imaging for brain metastases treated with stereotactic radiosurgery MICCAI2025
【速读】:该论文旨在解决脑转移瘤(Brain Metastases, BM)在立体定向放射外科(Stereotactic Radiosurgery, SRS)治疗后,依赖人工分析纵向磁共振成像(Longitudinal Magnetic Resonance Imaging, MRI)所带来的巨大临床工作量问题,从而实现对治疗反应的自动化量化与早期预测。其解决方案的关键在于构建一个自动化数据处理流程,从177名患者中提取896个BM病灶的360天随访MRI数据,并利用数据驱动聚类识别出5种典型的生长轨迹;同时采用梯度提升和图机器学习(Graph Machine Learning, GML)模型,仅基于治疗前及首次随访MRI即可实现12个月病变水平响应的高精度预测(AUC最高达0.90),显著提升了评估效率与预测准确性,为临床决策支持系统提供了可扩展的数据基础与算法框架。
链接: https://arxiv.org/abs/2509.06396
作者: Lorenz Achim Kuhn,Daniel Abler,Jonas Richiardi,Andreas F. Hottinger,Luis Schiappacasse,Vincent Dunet,Adrien Depeursinge,Vincent Andrearczyk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted and Accepted to the Learning with longitudinal medical Images and Data workshop at the MICCAI 2025 Conference
Abstract:Brain Metastases (BM) are a large contributor to mortality of patients with cancer. They are treated with Stereotactic Radiosurgery (SRS) and monitored with Magnetic Resonance Imaging (MRI) at regular follow-up intervals according to treatment guidelines. Analyzing and quantifying this longitudinal imaging represents an intractable workload for clinicians. As a result, follow-up images are not annotated and merely assessed by observation. Response to treatment in longitudinal imaging is being studied, to better understand growth trajectories and ultimately predict treatment success or toxicity as early as possible. In this study, we implement an automated pipeline to curate a large longitudinal dataset of SRS treatment data, resulting in a cohort of 896 BMs in 177 patients who were monitored for 360 days at approximately two-month intervals at Lausanne University Hospital (CHUV). We use a data-driven clustering to identify characteristic trajectories. In addition, we predict 12 months lesion-level response using classical as well as graph machine learning Graph Machine Learning (GML). Clustering revealed 5 dominant growth trajectories with distinct final response categories. Response prediction reaches up to 0.90 AUC (CI95%=0.88-0.92) using only pre-treatment and first follow-up MRI with gradient boosting. Similarly, robust predictive performance of up to 0.88 AUC (CI95%=0.86-0.90) was obtained using GML, offering more flexibility with a single model for multiple input time-points configurations. Our results suggest potential automation and increased precision for the comprehensive assessment and prediction of BM response to SRS in longitudinal MRI. The proposed pipeline facilitates scalable data curation for the investigation of BM growth patterns, and lays the foundation for clinical decision support systems aiming at optimizing personalized care.
zh
[CV-63] Your Super Resolution Model is not Enough for Tackling Real-World Scenarios ICCV2025
【速读】:该论文旨在解决单图像超分辨率(Single Image Super-Resolution, SISR)模型在不同缩放因子下泛化能力不足的问题,这一局限性严重制约了其在实际场景中的应用。解决方案的关键在于提出一种轻量级的Scale-Aware Attention Module (SAAM),该模块可无缝嵌入现有固定缩放因子的SISR模型中,使其具备任意缩放因子下的超分能力。SAAM通过尺度自适应的特征提取与上采样策略,并结合无参数的Simple Attention Module (SimAM) 进行高效注意力引导,同时引入梯度方差损失(gradient variance loss)以增强细节锐度,从而在保持极低计算开销的前提下显著提升多尺度重建性能。
链接: https://arxiv.org/abs/2509.06387
作者: Dongsik Yoon,Jongeun Kim
机构: HDC LABS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in Workshop on Efficient Computing under Limited Resources: Visual Computing (ICCV 2025)
Abstract:Despite remarkable progress in Single Image Super-Resolution (SISR), traditional models often struggle to generalize across varying scale factors, limiting their real-world applicability. To address this, we propose a plug-in Scale-Aware Attention Module (SAAM) designed to retrofit modern fixed-scale SR models with the ability to perform arbitrary-scale SR. SAAM employs lightweight, scale-adaptive feature extraction and upsampling, incorporating the Simple parameter-free Attention Module (SimAM) for efficient guidance and gradient variance loss to enhance sharpness in image details. Our method integrates seamlessly into multiple state-of-the-art SR backbones (e.g., SCNet, HiT-SR, OverNet), delivering competitive or superior performance across a wide range of integer and non-integer scale factors. Extensive experiments on benchmark datasets demonstrate that our approach enables robust multi-scale upscaling with minimal computational overhead, offering a practical solution for real-world scenarios.
zh
[CV-64] MRD-LiNet: A Novel Lightweight Hybrid CNN with Gradient-Guided Unlearning for Improved Drought Stress Identification
【速读】:该论文旨在解决干旱胁迫(drought stress)在农业监测中早期精准检测的难题,尤其针对传统方法耗时费力、深度学习模型参数量大难以部署于资源受限场景的问题。其解决方案的关键在于提出一种轻量化混合卷积神经网络(CNN)框架,融合ResNet、DenseNet与MobileNet的设计思想,在保持高精度的同时实现训练参数减少15倍;同时引入基于梯度范数的影响函数机制,实现对特定训练数据影响的定向移除,从而提升模型适应性。
链接: https://arxiv.org/abs/2509.06367
作者: Aswini Kumar Patra,Lingaraj Sahoo
机构: NERIST(印度东北地区技术研究所); IITG(印度理工学院古瓦哈蒂分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 6 Figures, 3 Tables
Abstract:Drought stress is a major threat to global crop productivity, making its early and precise detection essential for sustainable agricultural management. Traditional approaches, though useful, are often time-consuming and labor-intensive, which has motivated the adoption of deep learning methods. In recent years, Convolutional Neural Network (CNN) and Vision Transformer architectures have been widely explored for drought stress identification; however, these models generally rely on a large number of trainable parameters, restricting their use in resource-limited and real-time agricultural settings. To address this challenge, we propose a novel lightweight hybrid CNN framework inspired by ResNet, DenseNet, and MobileNet architectures. The framework achieves a remarkable 15-fold reduction in trainable parameters compared to conventional CNN and Vision Transformer models, while maintaining competitive accuracy. In addition, we introduce a machine unlearning mechanism based on a gradient norm-based influence function, which enables targeted removal of specific training data influence, thereby improving model adaptability. The method was evaluated on an aerial image dataset of potato fields with expert-annotated healthy and drought-stressed regions. Experimental results show that our framework achieves high accuracy while substantially lowering computational costs. These findings highlight its potential as a practical, scalable, and adaptive solution for drought stress monitoring in precision agriculture, particularly under resource-constrained conditions.
zh
[CV-65] A Multi-Modal Deep Learning Framework for Colorectal Pathology Diagnosis: Integrating Histological and Colonoscopy Data in a Pilot Study
【速读】:该论文旨在解决结直肠疾病(包括炎症性和肿瘤性疾病)诊断过程中因传统流程依赖分离的组织病理学图像与结肠镜视频独立评估而导致的效率低下和结果变异性问题。其解决方案的关键在于构建一个统一的深度学习网络,采用ResNet-50架构对静态组织病理切片和结肠镜视频帧进行联合分类,通过类平衡学习、鲁棒增强和校准方法提升模型准确性与可解释性,从而实现多模态数据融合的高效、一致诊断。
链接: https://arxiv.org/abs/2509.06351
作者: Krithik Ramesh,Ritvik Koneru
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Colorectal diseases, including inflammatory conditions and neoplasms, require quick, accurate care to be effectively treated. Traditional diagnostic pipelines require extensive preparation and rely on separate, individual evaluations on histological images and colonoscopy footage, introducing possible variability and inefficiencies. This pilot study proposes a unified deep learning network that uses convolutional neural networks (CN N s) to classify both histopathological slides and colonoscopy video frames in one pipeline. The pipeline integrates class-balancing learning, robust augmentation, and calibration methods to ensure accurate results. Static colon histology images were taken from the PathMNIST dataset, and the lower gastrointestinal (colonoscopy) videos were drawn from the HyperKvasir dataset. The CNN architecture used was ResNet-50. This study demonstrates an interpretable and reproducible diagnostic pipeline that unifies multiple diagnostic modalities to advance and ease the detection of colorectal diseases.
zh
[CV-66] Multi View Slot Attention Using Paraphrased Texts For Face Anti-Spoofing ICCV2025
【速读】:该论文旨在解决当前基于CLIP(Contrastive Language–Image Pretraining)的活体检测(Face Anti-Spoofing, FAS)方法在跨域场景下性能受限的问题,具体表现为:1)未充分利用CLIP的patch embedding tokens,导致关键欺骗线索被忽略;2)每类仅使用单一文本提示(如“live”或“fake”),限制了模型的泛化能力。解决方案的关键在于提出MVP-FAS框架,其核心创新为两个模块:Multi-View Slot attention (MVS) 和 Multi-Text Patch Alignment (MTPA)。MVS通过多视角文本引导提取patch嵌入中的局部细节与全局上下文特征,增强对伪造模式的敏感性;MTPA则利用多个改写文本对齐patch与语义表示,提升模型在不同域间的语义鲁棒性,从而显著改善跨域泛化性能。
链接: https://arxiv.org/abs/2509.06336
作者: Jeongmin Yu,Susang Kim,Kisu Lee,Taekyoung Kwon,Won-Yong Shin,Ha Young Kim
机构: Yonsei University (延世大学); POSCO DX
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted by ICCV 2025
Abstract:Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP’s patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., ‘live’ or ‘fake’), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: this https URL.
zh
[CV-67] Harnessing Object Grounding for Time-Sensitive Video Understanding
【速读】:该论文旨在解决视频大语言模型(Video-LLM)在时间敏感视频理解(Time-Sensitive Video Understanding, TSV)任务中的性能瓶颈问题,特别是如何更有效地利用帧内具身对象(Grounded Objects, GO)信息来提升模型对时序定位和密集描述等任务的理解能力。解决方案的关键在于提出一种轻量级的附加模块——GO-Tokenizer,该模块基于现成的目标检测器实时编码紧凑的对象信息,从而避免了传统方法中因在提示(prompt)中引入文本形式的对象描述而导致的token长度增加和噪声敏感性问题。实验表明,使用GO-Tokenizer进行预训练显著优于原始Video-LLM及其依赖文本描述的版本,并且该优势在不同模型、数据集和视频理解任务中均具有泛化能力。
链接: https://arxiv.org/abs/2509.06335
作者: Tz-Ying Wu,Sharath Nittur Sridhar,Subarna Tripathi
机构: Intel Labs(英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual description of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object level information. To address this, we propose GO-Tokenizer, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart utilizing textual description of objects in the prompt. The gain generalizes across different models, datasets and video understanding tasks such as reasoning temporal localization and dense captioning.
zh
[CV-68] Multi-Modal Camera-Based Detection of Vulnerable Road Users
【速读】:该论文旨在解决弱势道路使用者(Vulnerable Road Users, VRUs),如行人、骑行者和摩托车手,在低光照、恶劣天气条件及数据集不平衡情况下检测准确率低的问题。其解决方案的关键在于提出一种融合RGB与热红外成像的多模态检测框架,并基于YOLOv8模型进行微调,通过类别重加权(class re-weighting)和轻量级数据增强策略提升少数类别的召回率与整体鲁棒性;实验表明,640像素分辨率结合部分骨干网络冻结可在精度与效率间取得最优平衡,且热红外模态具有最高精度,RGB到热红外的跨模态增强进一步提升了对稀有VRU的检测召回率,验证了多模态融合在交叉路口VRU安全防护中的有效性。
链接: https://arxiv.org/abs/2509.06333
作者: Penelope Brown,Julie Stephany Berrio Perez,Mao Shan,Stewart Worrall
机构: The University of Sydney (悉尼大学); Australian Centre for Robotics (澳大利亚机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vulnerable road users (VRUs) such as pedestrians, cyclists, and motorcyclists represent more than half of global traffic deaths, yet their detection remains challenging in poor lighting, adverse weather, and unbalanced data sets. This paper presents a multimodal detection framework that integrates RGB and thermal infrared imaging with a fine-tuned YOLOv8 model. Training leveraged KITTI, BDD100K, and Teledyne FLIR datasets, with class re-weighting and light augmentations to improve minority-class performance and robustness, experiments show that 640-pixel resolution and partial backbone freezing optimise accuracy and efficiency, while class-weighted losses enhance recall for rare VRUs. Results highlight that thermal models achieve the highest precision, and RGB-to-thermal augmentation boosts recall, demonstrating the potential of multimodal detection to improve VRU safety at intersections.
zh
[CV-69] Quantitative Currency Evaluation in Low-Resource Settings through Pattern Analysis to Assist Visually Impaired Users
【速读】:该论文旨在解决货币识别系统在低资源环境中忽视可用性评估与真伪鉴定的问题,尤其针对视障用户和离线验证场景。现有方法多聚焦于面额分类,但忽略了纸币的物理损伤与伪造问题,导致实际应用受限。解决方案的关键在于提出一个统一的货币评估框架,包含三个模块:基于轻量级卷积神经网络(CNN)的面额分类、通过新型统一货币损伤指数(Unified Currency Damage Index, UCDI)实现连续可用性评分,以及基于特征模板匹配的防伪检测。该框架在超过82,000张标注图像上训练与验证,支持设备端实时推理,兼顾准确性、可解释性和紧凑性,有效提升了货币评估在真实环境中的适用性与包容性。
链接: https://arxiv.org/abs/2509.06331
作者: Md Sultanul Islam Ovi,Mainul Hossain,Md Badsha Biswas
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 Pages, 9 Figures, 5 Tables
Abstract:Currency recognition systems often overlook usability and authenticity assessment, especially in low-resource environments where visually impaired users and offline validation are common. While existing methods focus on denomination classification, they typically ignore physical degradation and forgery, limiting their applicability in real-world conditions. This paper presents a unified framework for currency evaluation that integrates three modules: denomination classification using lightweight CNN models, damage quantification through a novel Unified Currency Damage Index (UCDI), and counterfeit detection using feature-based template matching. The dataset consists of over 82,000 annotated images spanning clean, damaged, and counterfeit notes. Our Custom_CNN model achieves high classification performance with low parameter count. The UCDI metric provides a continuous usability score based on binary mask loss, chromatic distortion, and structural feature loss. The counterfeit detection module demonstrates reliable identification of forged notes across varied imaging conditions. The framework supports real-time, on-device inference and addresses key deployment challenges in constrained environments. Results show that accurate, interpretable, and compact solutions can support inclusive currency evaluation in practical settings.
zh
[CV-70] owards scalable organ level 3D plant segmentation: Bridging the data algorithm computing gap
【速读】:该论文旨在解决3D植物表型分析中因数据稀缺、深度神经网络适配困难及缺乏标准化评估基准而导致的3D分割技术应用受限问题。其关键解决方案在于:首先系统梳理现有3D植物点云数据集并引入Plant Segmentation Studio(PSS)这一开源框架以实现可复现的基准测试;其次总结基于深度学习的点云语义与实例分割方法,发现稀疏卷积骨干网络和基于Transformer的实例分割策略具有显著效果;最后通过大量定量实验验证了模型驱动与增强驱动的合成数据生成策略在减少真实标注需求方面的互补作用,从而推动算法进展向实际部署转化,为构建高效、泛化能力强的3D植物表型分析深度学习方案提供工具与路径。
链接: https://arxiv.org/abs/2509.06329
作者: Ruiming Du,Guangxun Zhai,Tian Qiu,Yu Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:
Abstract:The precise characterization of plant morphology provides valuable insights into plant environment interactions and genetic evolution. A key technology for extracting this information is 3D segmentation, which delineates individual plant organs from complex point clouds. Despite significant progress in general 3D computer vision domains, the adoption of 3D segmentation for plant phenotyping remains limited by three major challenges: i) the scarcity of large-scale annotated datasets, ii) technical difficulties in adapting advanced deep neural networks to plant point clouds, and iii) the lack of standardized benchmarks and evaluation protocols tailored to plant science. This review systematically addresses these barriers by: i) providing an overview of existing 3D plant datasets in the context of general 3D segmentation domains, ii) systematically summarizing deep learning-based methods for point cloud semantic and instance segmentation, iii) introducing Plant Segmentation Studio (PSS), an open-source framework for reproducible benchmarking, and iv) conducting extensive quantitative experiments to evaluate representative networks and sim-to-real learning strategies. Our findings highlight the efficacy of sparse convolutional backbones and transformer-based instance segmentation, while also emphasizing the complementary role of modeling-based and augmentation-based synthetic data generation for sim-to-real learning in reducing annotation demands. In general, this study bridges the gap between algorithmic advances and practical deployment, providing immediate tools for researchers and a roadmap for developing data-efficient and generalizable deep learning solutions in 3D plant phenotyping. Data and code are available at this https URL.
zh
[CV-71] xt4Seg: Advancing Image Segmentation via Generative Language Modeling
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中图像分割(image segmentation)有效集成的难题。传统方法通常依赖额外的解码器结构,导致流程复杂且难以与语言建模范式无缝融合。其核心解决方案是提出“文本即掩码”(text-as-mask)范式,将图像分割任务转化为文本生成问题,从而无需额外解码器即可实现高效分割。关键创新在于引入语义描述符(semantic descriptors)——一种将图像补丁映射为对应文本标签的新型文本表示方式,并进一步设计行级运行长度编码(Row-wise Run-Length Encoding, R-RLE)以压缩冗余文本序列,使语义描述符长度减少74%,推理速度提升3倍而不损失性能;在此基础上,通过框级语义描述符和结构化掩码令牌(semantic bricks)构建Text4Seg++模型,将分割任务形式化为“下一个砖块预测”任务,显著提升了分割的粒度、紧凑性和生成效率。
链接: https://arxiv.org/abs/2509.06321
作者: Mengcheng Lan,Chaofeng Chen,Jiaxing Xu,Zongrui Li,Yiping Ke,Xudong Jiang,Yingchen Yu,Yunqing Zhao,Song Bai
机构: Nanyang Technological University (南洋理工大学); Wuhan University (武汉大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended version of our conference paper arXiv:2410.09855
Abstract:Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We first introduce image-wise semantic descriptors, a patch-aligned textual representation of segmentation masks that integrates naturally into the language modeling pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by 3\times , without compromising performance. Building upon this, our initial framework Text4Seg achieves strong segmentation performance across a wide range of vision tasks. To further improve granularity and compactness, we propose box-wise semantic descriptors, which localizes regions of interest using bounding boxes and represents region masks via structured mask tokens called semantic bricks. This leads to our refined model, Text4Seg++, which formulates segmentation as a next-brick prediction task, combining precision, scalability, and generative efficiency. Comprehensive experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models across diverse benchmarks without any task-specific fine-tuning, while remaining compatible with existing MLLM backbones. Our work highlights the effectiveness, scalability, and generalizability of text-driven image segmentation within the MLLM framework.
zh
[CV-72] Evaluating the Efficiency of Latent Spaces via the Coupling-Matrix
【速读】:该论文旨在解决表示学习中潜在空间冗余问题,即深度网络生成的隐变量表示中存在多个维度编码重叠信息,从而降低有效容量并阻碍模型泛化能力。解决方案的关键在于提出一个名为rho©的冗余指数,通过分析从潜在表示中提取的耦合矩阵,并利用能量距离(energy distance)将非对角线统计量与正态分布进行比较,直接量化维度间的相互依赖关系,从而提供一种紧凑、可解释且具有统计基础的表示质量度量方法。
链接: https://arxiv.org/abs/2509.06314
作者: Mehmet Can Yavuz,Berrin Yanikoglu
机构: Işık University (伊斯坦布尔大学); Sabancı University (萨班哲大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A central challenge in representation learning is constructing latent embeddings that are both expressive and efficient. In practice, deep networks often produce redundant latent spaces where multiple coordinates encode overlapping information, reducing effective capacity and hindering generalization. Standard metrics such as accuracy or reconstruction loss provide only indirect evidence of such redundancy and cannot isolate it as a failure mode. We introduce a redundancy index, denoted rho©, that directly quantifies inter-dimensional dependencies by analyzing coupling matrices derived from latent representations and comparing their off-diagonal statistics against a normal distribution via energy distance. The result is a compact, interpretable, and statistically grounded measure of representational quality. We validate rho© across discriminative and generative settings on MNIST variants, Fashion-MNIST, CIFAR-10, and CIFAR-100, spanning multiple architectures and hyperparameter optimization strategies. Empirically, low rho© reliably predicts high classification accuracy or low reconstruction error, while elevated redundancy is associated with performance collapse. Estimator reliability grows with latent dimension, yielding natural lower bounds for reliable analysis. We further show that Tree-structured Parzen Estimators (TPE) preferentially explore low-rho regions, suggesting that rho© can guide neural architecture search and serve as a redundancy-aware regularization target. By exposing redundancy as a universal bottleneck across models and tasks, rho© offers both a theoretical lens and a practical tool for evaluating and improving the efficiency of learned representations.
zh
[CV-73] Video-based Generalized Category Discovery via Memory-Guided Consistency-Aware Contrastive Learning
【速读】:该论文旨在解决传统广义类别发现(Generalized Category Discovery, GCD)方法主要基于静态图像而难以可靠识别新类别的问题,提出将GCD扩展至视频域的新任务——Video-GCD,并强调利用时序多视角信息以提升新颖类别发现的准确性。其解决方案的关键在于提出一种记忆引导的一致性感知对比学习(Memory-guided Consistency-aware Contrastive Learning, MCCL)框架,该框架包含两个核心组件:一致性感知对比学习(Consistency-Aware Contrastive Learning, CACL)和记忆引导表示增强(Memory-Guided Representation Enhancement, MGRE)。CACL通过多视角时序特征估计未标记样本间的一致性得分并加权对比损失,MGRE则引入双层记忆缓冲区(特征级与logit级)提供全局上下文,增强类内紧凑性和类间可分性,从而形成表示学习与一致性建模之间的反馈循环,显著提升视频中未知类别的识别性能。
链接: https://arxiv.org/abs/2509.06306
作者: Zhang Jing,Pu Nan,Xie Yu Xiang,Guo Yanming,Lu Qianqi,Zou Shiwei,Yan Jie,Chen Yan
机构: National University of Defense Technology (国防科技大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalized Category Discovery (GCD) is an emerging and challenging open-world problem that has garnered increasing attention in recent years. Most existing GCD methods focus on discovering categories in static images. However, relying solely on static visual content is often insufficient to reliably discover novel categories. To bridge this gap, we extend the GCD problem to the video domain and introduce a new setting, termed Video-GCD. Thus, effectively integrating multi-perspective information across time is crucial for accurate Video-GCD. To tackle this challenge, we propose a novel Memory-guided Consistency-aware Contrastive Learning (MCCL) framework, which explicitly captures temporal-spatial cues and incorporates them into contrastive learning through a consistency-guided voting mechanism. MCCL consists of two core components: Consistency-Aware Contrastive Learning(CACL) and Memory-Guided Representation Enhancement (MGRE). CACL exploits multiperspective temporal features to estimate consistency scores between unlabeled instances, which are then used to weight the contrastive loss accordingly. MGRE introduces a dual-level memory buffer that maintains both feature-level and logit-level representations, providing global context to enhance intra-class compactness and inter-class separability. This in turn refines the consistency estimation in CACL, forming a mutually reinforcing feedback loop between representation learning and consistency modeling. To facilitate a comprehensive evaluation, we construct a new and challenging Video-GCD benchmark, which includes action recognition and bird classification video datasets. Extensive experiments demonstrate that our method significantly outperforms competitive GCD approaches adapted from image-based settings, highlighting the importance of temporal information for discovering novel categories in videos. The code will be publicly available.
zh
[CV-74] Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
【速读】:该论文旨在解决视觉接地(Visual Grounding, VG)任务在开放词汇场景(open-vocabulary scene)中性能下降的问题,其核心挑战在于跨模态对齐不充分、跨模态特征融合不足以及语义原型信息利用无效。解决方案的关键在于提出原型感知的多模态学习框架(Prototype-Aware Multimodal Learning, PAML),通过三个核心组件实现:首先利用ALBEF建立初始特征编码阶段的鲁棒跨模态对齐;其次设计视觉判别特征编码器以增强显著目标表征并抑制无关视觉上下文;最后引入新颖的原型发现与继承机制,提取并聚合多邻域语义原型以支持开放词汇识别,从而显著提升模型在包含陌生类别场景下的定位能力。
链接: https://arxiv.org/abs/2509.06291
作者: Jiangnan Xie,Xiaolong Zheng,Liang Zheng
机构: Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Grounding (VG) aims to utilize given natural language queries to locate specific target objects within images. While current transformer-based approaches demonstrate strong localization performance in standard scene (i.e, scenarios without any novel objects), they exhibit notable limitations in open-vocabulary scene (i.e, both familiar and novel object categories during testing). These limitations primarily stem from three key factors: (1) imperfect alignment between visual and linguistic modalities, (2) insufficient cross-modal feature fusion, and (3) ineffective utilization of semantic prototype information. To overcome these challenges, we present Prototype-Aware Multimodal Learning (PAML), an innovative framework that systematically addresses these issues through several key components: First, we leverage ALBEF to establish robust cross-modal alignment during initial feature encoding. Subsequently, our Visual Discriminative Feature Encoder selectively enhances salient object representations while suppressing irrelevant visual context. The framework then incorporates a novel prototype discovering and inheriting mechanism that extracts and aggregates multi-neighbor semantic prototypes to facilitate open-vocabulary recognition. These enriched features undergo comprehensive multimodal integration through our Multi-stage Decoder before final bounding box regression. Extensive experiments across five benchmark datasets validate our approach, showing competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene. Our code is available at this https URL.
zh
[CV-75] AI-driven Remote Facial Skin Hydration and TEWL Assessment from Selfie Images: A Systematic Solution
【速读】:该论文旨在解决皮肤屏障功能的量化评估难以在非专业场景下实现的问题,具体聚焦于如何通过智能手机拍摄的自拍面部图像远程估计皮肤水合度(Skin Hydration, SH)和经表皮水分流失率(Trans-Epidermal Water Loss, TEWL),从而实现大众化、便捷化的皮肤健康监测。解决方案的关键在于提出了一种新颖的Skin-Prior Adaptive Vision Transformer模型,并结合对SH/TEWL数据标注不平衡问题的识别与处理,引入基于对称性的对比正则化方法以有效降低模型偏差,首次实现了无需物理测量即可从面部图像中回归SH和TEWL值的AI驱动皮肤评估系统。
链接: https://arxiv.org/abs/2509.06282
作者: Cecelia Soh,Rizhao Cai,Monalisha Paul,Dennis Sng,Alex Kot
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted by the journal of Machine Intelligence Research (JCR-Q1). To be in press soon
Abstract:Skin health and disease resistance are closely linked to the skin barrier function, which protects against environmental factors and water loss. Two key physiological indicators can quantitatively represent this barrier function: skin hydration (SH) and trans-epidermal water loss (TEWL). Measurement of SH and TEWL is valuable for the public to monitor skin conditions regularly, diagnose dermatological issues, and personalize their skincare regimens. However, these measurements are not easily accessible to general users unless they visit a dermatology clinic with specialized instruments. To tackle this problem, we propose a systematic solution to estimate SH and TEWL from selfie facial images remotely with smartphones. Our solution encompasses multiple stages, including SH/TEWL data collection, data preprocessing, and formulating a novel Skin-Prior Adaptive Vision Transformer model for SH/TEWL regression. Through experiments, we identified the annotation imbalance of the SH/TEWL data and proposed a symmetric-based contrastive regularization to reduce the model bias due to the imbalance effectively. This work is the first study to explore skin assessment from selfie facial images without physical measurements. It bridges the gap between computer vision and skin care research, enabling AI-driven accessible skin analysis for broader real-world applications.
zh
[CV-76] Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在理解三维空间关系上的显著不足,尤其是其在真实世界中基于第一人称多视角观测场景下的空间推理能力薄弱问题。现有方法主要依赖单图或室内视频构建的空间问答数据集,难以模拟机器人和自动驾驶汽车等具身智能体的实际感知环境。为应对这一挑战,作者提出了Ego3D-Bench基准测试集,包含超过8,600个由人工标注确保质量与多样性的QA对,用于评估VLMs在第一人称、多视角室外场景中的空间理解能力。解决方案的关键在于提出Ego3D-VLM后训练框架,该框架通过估计全局3D坐标生成认知地图(cognitive map),从而显著提升VLM的空间推理性能:在多项选择问答任务上平均提升12%,在绝对距离估计任务上平均提升56%。该方法模块化设计,可无缝集成至任意现有VLM,为实现人类水平的空间理解提供了有效路径。
链接: https://arxiv.org/abs/2509.06266
作者: Mohsen Gholami,Ahmad Rezaei,Zhou Weimin,Yong Zhang,Mohammad Akbari
机构: Huawei Technologies Canada; Huawei Cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.
zh
[CV-77] Exploring Light-Weight Object Recognition for Real-Time Document Detection
【速读】:该论文旨在解决实时文档检测与校正(document detection and rectification)在自动信息检索中的效率与性能瓶颈问题,尤其针对现有方法在真实场景下难以兼顾OCR(Optical Character Recognition)识别质量与计算效率的不足。其解决方案的关键在于:采用轻量级的IWPOD-Net架构并将其迁移训练至NBID合成身份证数据集上,结合数据增强和跨数据集验证策略(使用MIDV数据集),实现高效且鲁棒的文档定位;同时发现即使文档校正不完全精确,也能达到最优OCR识别效果,从而证明模型设计应聚焦于整体OCR质量而非极致几何校正精度。最终,该方法在模型规模和推理速度上优于当前SOTA方案,同时保持了竞争力的OCR质量指标。
链接: https://arxiv.org/abs/2509.06246
作者: Lucas Wojcik,Luiz Coelho,Roger Granada,David Menotti
机构: Federal University of Paraná (巴拉那联邦大学); unico - idTech (unico - idTech)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Object Recognition and Document Skew Estimation have come a long way in terms of performance and efficiency. New models follow one of two directions: improving performance using larger models, and improving efficiency using smaller models. However, real-time document detection and rectification is a niche that is largely unexplored by the literature, yet it remains a vital step for automatic information retrieval from visual documents. In this work, we strive towards an efficient document detection pipeline that is satisfactory in terms of Optical Character Recognition (OCR) retrieval and faster than other available solutions. We adapt IWPOD-Net, a license plate detection network, and train it for detection on NBID, a synthetic ID card dataset. We experiment with data augmentation and cross-dataset validation with MIDV (another synthetic ID and passport document dataset) to find the optimal scenario for the model. Other methods from both the Object Recognition and Skew Estimation state-of-the-art are evaluated for comparison with our approach. We use each method to detect and rectify the document, which is then read by an OCR system. The OCR output is then evaluated using a novel OCR quality metric based on the Levenshtein distance. Since the end goal is to improve automatic information retrieval, we use the overall OCR quality as a performance metric. We observe that with a promising model, document rectification does not have to be perfect to attain state-of-the-art performance scores. We show that our model is smaller and more efficient than current state-of-the-art solutions while retaining a competitive OCR quality metric. All code is available at this https URL
zh
[CV-78] O3Afford: One-Shot 3D Object-to-Object Affordance Grounding for Generalizable Robotic Manipulation
【速读】:该论文旨在解决机器人操作中对象间 affordance(可操作性)的接地问题,即如何在有限数据条件下实现对两个物体之间交互关系的准确建模与泛化。传统方法多集中于单一物体的 affordance 预测,忽视了现实世界中多数操作涉及物体对之间的相互作用。解决方案的关键在于提出一种新颖的一次学习(one-shot)3D 物体到物体 affordance 学习方法(O^3Afford),其核心是融合视觉基础模型(vision foundation models)提取的语义特征与点云表示以实现几何理解,从而在少量样本下有效推广至新物体和类别;同时,将 3D affordance 表示与大语言模型(LLMs)结合,显著提升 LLM 在生成任务特定约束函数时对物体交互的理解与推理能力。
链接: https://arxiv.org/abs/2509.06233
作者: Tongxuan Tian,Xuhui Kang,Yen-Ling Kuo
机构: University of Virginia (弗吉尼亚大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Conference on Robot Learning (CoRL) 2025. Project website: this https URL
Abstract:Grounding object affordance is fundamental to robotic manipulation as it establishes the critical link between perception and action among interacting objects. However, prior works predominantly focus on predicting single-object affordance, overlooking the fact that most real-world interactions involve relationships between pairs of objects. In this work, we address the challenge of object-to-object affordance grounding under limited data contraints. Inspired by recent advances in few-shot learning with 2D vision foundation models, we propose a novel one-shot 3D object-to-object affordance learning approach for robotic manipulation. Semantic features from vision foundation models combined with point cloud representation for geometric understanding enable our one-shot learning pipeline to generalize effectively to novel objects and categories. We further integrate our 3D affordance representation with large language models (LLMs) for robotics manipulation, significantly enhancing LLMs’ capability to comprehend and reason about object interactions when generating task-specific constraint functions. Our experiments on 3D object-to-object affordance grounding and robotic manipulation demonstrate that our O ^3 Afford significantly outperforms existing baselines in terms of both accuracy and generalization capability.
zh
[CV-79] AI-Based Applied Innovation for Fracture Detection in X-rays Using Custom CNN and Transfer Learning Models
【速读】:该论文旨在解决骨骨折在低资源环境下因缺乏专业放射科服务而导致的诊断困难问题,传统影像方法存在成本高、辐射暴露风险大及依赖专家解读等局限。其解决方案的关键在于开发了一种基于自定义卷积神经网络(Convolutional Neural Network, CNN)的自动化骨折检测模型,该模型在FracAtlas公开数据集上实现了95.96%的准确率、0.94的精确率、0.88的召回率和0.91的F1分数,表明轻量级CNN在X光片骨折识别中具有显著潜力,同时强调了公平基准测试、多样化数据集和外部验证对临床转化的重要性。
链接: https://arxiv.org/abs/2509.06228
作者: Amna Hassan,Ilsa Afzaal,Nouman Muneeb,Aneeqa Batool,Hamail Noor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Bone fractures present a major global health challenge, often resulting in pain, reduced mobility, and productivity loss, particularly in low-resource settings where access to expert radiology services is limited. Conventional imaging methods suffer from high costs, radiation exposure, and dependency on specialized interpretation. To address this, we developed an AI-based solution for automated fracture detection from X-ray images using a custom Convolutional Neural Network (CNN) and benchmarked it against transfer learning models including EfficientNetB0, MobileNetV2, and ResNet50. Training was conducted on the publicly available FracAtlas dataset, comprising 4,083 anonymized musculoskeletal radiographs. The custom CNN achieved 95.96% accuracy, 0.94 precision, 0.88 recall, and an F1-score of 0.91 on the FracAtlas dataset. Although transfer learning models (EfficientNetB0, MobileNetV2, ResNet50) performed poorly in this specific setup, these results should be interpreted in light of class imbalance and data set limitations. This work highlights the promise of lightweight CNNs for detecting fractures in X-rays and underscores the importance of fair benchmarking, diverse datasets, and external validation for clinical translation
zh
[CV-80] Learning in ImaginationLand: Omnidirectional Policies through 3D Generative Models (OP-Gen) WWW
【速读】:该论文旨在解决机器人在执行任务时对多示例演示的依赖问题,即传统方法通常需要大量真实世界示范才能学习到鲁棒的策略,而本文提出利用3D生成模型(3D generative models)从单一真实示例中合成多样化数据,从而显著减少所需演示数量。其解决方案的关键在于:首先使用3D生成模型基于单次真实演示生成虚拟场景下的物体形状和姿态分布,构建一个“想象中的”数据集;随后在此扩展的数据集上训练一种全向策略(omnidirectional policy),使其能够适应初始状态与原始演示差异极大的情况(如从物体另一侧开始操作),从而实现更高效的泛化能力。实验证明,该方法在抓取、开抽屉和投放垃圾等任务中均优于现有基线方法。
链接: https://arxiv.org/abs/2509.06191
作者: Yifei Ren,Edward Johns
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project webpage with robot videos: this https URL
Abstract:Recent 3D generative models, which are capable of generating full object shapes from just a few images, now open up new opportunities in robotics. In this work, we show that 3D generative models can be used to augment a dataset from a single real-world demonstration, after which an omnidirectional policy can be learned within this imagined dataset. We found that this enables a robot to perform a task when initialised from states very far from those observed during the demonstration, including starting from the opposite side of the object relative to the real-world demonstration, significantly reducing the number of demonstrations required for policy learning. Through several real-world experiments across tasks such as grasping objects, opening a drawer, and placing trash into a bin, we study these omnidirectional policies by investigating the effect of various design choices on policy behaviour, and we show superior performance to recent baselines which use alternative methods for data augmentation.
zh
[CV-81] UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning
【速读】:该论文旨在解决视频场景图生成(Video Scene Graph Generation, VidSGG)中长期存在的任务割裂问题,即现有方法通常仅针对粗粒度的边界框级别(box-level)或细粒度的全景像素级别(panoptic pixel-level)进行建模,导致需要设计特定任务的架构和多阶段训练流程,难以实现跨粒度的统一建模与高效部署。其解决方案的关键在于提出一个单阶段、统一的对象中心框架UNO(UNified Object-centric VidSGG),通过扩展的slot attention机制将视觉特征解耦为对象槽(object slots)与关系槽(relation slots),并引入无跟踪依赖的对象时序一致性学习以增强跨帧表示稳定性,同时设计动态三元组预测模块自动关联关系槽与对象对,从而在端到端架构下实现对不同粒度视觉内容的联合建模与高效推理。
链接: https://arxiv.org/abs/2509.06165
作者: Huy Le,Nhat Chung,Tung Kieu,Jingkang Yang,Ngan Le
机构: FPT Software AI Center (越南FPT软件人工智能中心); Aalborg University (奥尔堡大学); Pioneer Centre for AI (先锋人工智能中心); S-Lab, Nanyang Technological University (南洋理工大学); AICV Lab, University of Arkansas (阿肯色大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures
Abstract:Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.
zh
[CV-82] UniVerse-1: Unified Audio-Video Generation via Stitching of Experts
【速读】:该论文旨在解决多模态生成中音频与视频内容难以精准协同的问题,特别是针对环境音效生成的时序对齐不足以及语音与视频语义不一致等挑战。其关键解决方案在于提出一种基于专家拼接(stitching of experts, SoE)的统一架构,通过深度融合预训练的视频生成和音乐生成专家模型的对应模块,有效利用已有模型的能力;同时设计了一个在线标注流水线,在训练过程中动态生成精确的时间对齐标签,避免了传统文本标注导致的时序错位问题,从而显著提升音频-视频协同质量。
链接: https://arxiv.org/abs/2509.06155
作者: Duomin Wang,Wei Zuo,Aojie Li,Ling-Hao Chen,Xinyao Liao,Deyu Zhou,Zixin Yin,Xili Dai,Daxin Jiang,Gang Yu
机构: The Hong Kong University of Science and Technology(GuangZhou)(香港科技大学(广州)); The Hong Kong University of Science and Technology(香港科技大学); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: this https URL.
zh
[CV-83] RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving
【速读】:该论文旨在解决医学图像(尤其是眼底图像)中隐含生物特征信息(如视网膜年龄)可能被滥用导致隐私泄露的问题。视网膜年龄作为从眼底图像中提取的生物标志物,虽能精准评估系统性疾病风险和衰老轨迹,但其预测结果若被未经授权访问,则存在严重的隐私安全风险。为应对这一挑战,作者提出RetinaGuard框架,其核心创新在于采用特征级生成对抗掩蔽机制(feature-level generative adversarial masking),在不显著影响图像视觉质量及疾病诊断能力的前提下,有效模糊视网膜年龄信息;同时引入一种多对一的知识蒸馏策略(multiple-to-one knowledge distillation),结合视网膜基础模型与多样化的替代年龄编码器,实现对黑盒年龄预测模型的通用防御。该方案兼顾了隐私保护与临床实用性,且具备扩展至其他医学图像衍生生物标志物的能力。
链接: https://arxiv.org/abs/2509.06142
作者: Zhengquan Luo(1),Chi Liu(1),Dongfu Xiao(1),Zhen Yu(2),Yueye Wang(3),Tianqing Zhu(1) ((1) City University of Macau, (2) Monash University, (3) Hong Kong Polytechnic University)
机构: City University of Macau(澳门城市大学); Monash University(莫纳什大学); Hong Kong Polytechnic University(香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The integration of AI with medical images enables the extraction of implicit image-derived biomarkers for a precise health assessment. Recently, retinal age, a biomarker predicted from fundus images, is a proven predictor of systemic disease risks, behavioral patterns, aging trajectory and even mortality. However, the capability to infer such sensitive biometric data raises significant privacy risks, where unauthorized use of fundus images could lead to bioinformation leakage, breaching individual privacy. In response, we formulate a new research problem of biometric privacy associated with medical images and propose RetinaGuard, a novel privacy-enhancing framework that employs a feature-level generative adversarial masking mechanism to obscure retinal age while preserving image visual quality and disease diagnostic utility. The framework further utilizes a novel multiple-to-one knowledge distillation strategy incorporating a retinal foundation model and diverse surrogate age encoders to enable a universal defense against black-box age prediction models. Comprehensive evaluations confirm that RetinaGuard successfully obfuscates retinal age prediction with minimal impact on image quality and pathological feature representation. RetinaGuard is also flexible for extension to other medical image derived biomarkers. RetinaGuard is also flexible for extension to other medical image biomarkers.
zh
[CV-84] SpecSwin3D: Generating Hyperspectral Imagery from Multispectral Data via Transformer Networks
【速读】:该论文旨在解决多光谱(Multispectral)图像向高光谱(Hyperspectral)图像生成过程中难以同时保持空间细节与光谱保真度的问题。现有方法如全色锐化变体、矩阵分解和卷积神经网络(CNNs)在联合优化空间分辨率与光谱一致性方面表现有限。其解决方案的关键在于提出一种基于Transformer的模型SpecSwin3D,该模型以五个多光谱波段为输入,重建出224个高光谱波段且保持相同的空间分辨率;创新性地引入级联训练策略以逐步扩展光谱范围,从而稳定学习过程并提升远端波段的重建精度;同时设计了一种优化的波段序列排列方式,在3D移位窗口Transformer框架中更有效地建模波段间的成对关系,显著提升了重建质量与下游任务性能。
链接: https://arxiv.org/abs/2509.06122
作者: Tang Sui,Songxi Yang,Qunying Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multispectral and hyperspectral imagery are widely used in agriculture, environmental monitoring, and urban planning due to their complementary spatial and spectral characteristics. A fundamental trade-off persists: multispectral imagery offers high spatial but limited spectral resolution, while hyperspectral imagery provides rich spectra at lower spatial resolution. Prior hyperspectral generation approaches (e.g., pan-sharpening variants, matrix factorization, CNNs) often struggle to jointly preserve spatial detail and spectral fidelity. In response, we propose SpecSwin3D, a transformer-based model that generates hyperspectral imagery from multispectral inputs while preserving both spatial and spectral quality. Specifically, SpecSwin3D takes five multispectral bands as input and reconstructs 224 hyperspectral bands at the same spatial resolution. In addition, we observe that reconstruction errors grow for hyperspectral bands spectrally distant from the input bands. To address this, we introduce a cascade training strategy that progressively expands the spectral range to stabilize learning and improve fidelity. Moreover, we design an optimized band sequence that strategically repeats and orders the five selected multispectral bands to better capture pairwise relations within a 3D shifted-window transformer framework. Quantitatively, our model achieves a PSNR of 35.82 dB, SAM of 2.40°, and SSIM of 0.96, outperforming the baseline MHF-Net by +5.6 dB in PSNR and reducing ERGAS by more than half. Beyond reconstruction, we further demonstrate the practical value of SpecSwin3D on two downstream tasks, including land use classification and burnt area segmentation.
zh
[CV-85] CARDIE: clustering algorithm on relevant descriptors for image enhancement
【速读】:该论文旨在解决自动图像聚类在图像增强任务中应用受限的问题,其核心挑战在于难以定义对图像增强有意义的聚类。解决方案的关键在于提出一种无监督算法CARDIE,该算法基于图像的颜色和亮度(luminosity)内容进行聚类,并引入了一种量化图像增强算法对亮度分布和局部方差影响的方法。实验表明,CARDIE生成的聚类比基于语义图像属性的聚类更适用于图像增强任务,且这些聚类可用于重采样图像增强数据集,从而提升色调映射和去噪算法的性能。
链接: https://arxiv.org/abs/2509.06116
作者: Giulia Bonino,Luca Alberto Rizzo
机构: Eurecom; Huawei Nice Research Center
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automatic image clustering is a cornerstone of computer vision, yet its application to image enhancement remains limited, primarily due to the difficulty of defining clusters that are meaningful for this specific task. To address this issue, we introduce CARDIE, an unsupervised algorithm that clusters images based on their color and luminosity content. In addition, we introduce a method to quantify the impact of image enhancement algorithms on luminance distribution and local variance. Using this method, we demonstrate that CARDIE produces clusters more relevant to image enhancement than those derived from semantic image attributes. Furthermore, we demonstrate that CARDIE clusters can be leveraged to resample image enhancement datasets, leading to improved performance for tone mapping and denoising algorithms. To encourage adoption and ensure reproducibility, we publicly release CARDIE code on our GitHub.
zh
[CV-86] PathoHR: Hierarchical Reasoning for Vision-Language Models in Pathology EMNLP2025
【速读】:该论文旨在解决当前视觉-语言(Vision-Language, VL)模型在病理图像分析中面临的两大挑战:一是难以准确区分结构高度相似且形态差异细微的组织图像,二是无法有效建模复杂病理报告中的层次化语义理解和组合推理能力。针对这些问题,作者提出了PathoHR-Bench基准测试平台,用于系统评估VL模型在病理领域内的跨模态理解能力;解决方案的关键在于设计了一种面向病理场景的VL训练方案,通过生成增强和扰动样本进行多模态对比学习,从而提升模型对细粒度病理特征的表示能力。实验表明,该方法在PathoHR-Bench及六个额外病理数据集上均达到最优性能,验证了其在临床级病理分析中的有效性。
链接: https://arxiv.org/abs/2509.06105
作者: Yating Huang,Ziyan Huang,Lintao Xiang,Qijun Yang,Hujun Yin
机构: University of Manchester (曼彻斯特大学); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by EMNLP2025
Abstract:Accurate analysis of pathological images is essential for automated tumor diagnosis but remains challenging due to high structural similarity and subtle morphological variations in tissue images. Current vision-language (VL) models often struggle to capture the complex reasoning required for interpreting structured pathological reports. To address these limitations, we propose PathoHR-Bench, a novel benchmark designed to evaluate VL models’ abilities in hierarchical semantic understanding and compositional reasoning within the pathology domain. Results of this benchmark reveal that existing VL models fail to effectively model intricate cross-modal relationships, hence limiting their applicability in clinical setting. To overcome this, we further introduce a pathology-specific VL training scheme that generates enhanced and perturbed samples for multimodal contrastive learning. Experimental evaluations demonstrate that our approach achieves state-of-the-art performance on PathoHR-Bench and six additional pathology datasets, highlighting its effectiveness in fine-grained pathology representation.
zh
[CV-87] MedSeqFT: Sequential Fine-tuning Foundation Models for 3D Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割任务中基础模型(foundation models)在顺序适应新任务时面临的两个关键问题:一是并行微调策略无法利用任务间的共享知识,二是多任务微调需要同时访问所有数据集且难以实现增量式任务整合。解决方案的核心在于提出MedSeqFT这一顺序微调框架,其关键创新包括:(1) 最大数据相似性(Maximum Data Similarity, MDS)选择机制,用于筛选最能代表预训练分布的下游样本以保留通用知识;(2) 知识与泛化保留微调(Knowledge and Generalization Retention Fine-Tuning, KG RFT),一种基于LoRA的知识蒸馏方案,能够在任务特异性适应与预训练知识保留之间取得平衡。该方法显著提升了模型在多个3D分割任务上的性能及跨任务迁移能力。
链接: https://arxiv.org/abs/2509.06096
作者: Yiwen Ye,Yicheng Wu,Xiangde Luo,He Zhang,Ziyang Chen,Ting Dang,Yanning Zhang,Yong Xia
机构: Northwestern Polytechnical University (西北工业大学); Monash University (莫纳什大学); Sichuan Cancer Hospital (四川癌症中心); RMIT (皇家墨尔本理工大学); University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
Abstract:Foundation models have become a promising paradigm for advancing medical image analysis, particularly for segmentation tasks where downstream applications often emerge sequentially. Existing fine-tuning strategies, however, remain limited: parallel fine-tuning isolates tasks and fails to exploit shared knowledge, while multi-task fine-tuning requires simultaneous access to all datasets and struggles with incremental task integration. To address these challenges, we propose MedSeqFT, a sequential fine-tuning framework that progressively adapts pre-trained models to new tasks while refining their representational capacity. MedSeqFT introduces two core components: (1) Maximum Data Similarity (MDS) selection, which identifies downstream samples most representative of the original pre-training distribution to preserve general knowledge, and (2) Knowledge and Generalization Retention Fine-Tuning (KG RFT), a LoRA-based knowledge distillation scheme that balances task-specific adaptation with the retention of pre-trained knowledge. Extensive experiments on two multi-task datasets covering ten 3D segmentation tasks demonstrate that MedSeqFT consistently outperforms state-of-the-art fine-tuning strategies, yielding substantial performance gains (e.g., an average Dice improvement of 3.0%). Furthermore, evaluations on two unseen tasks (COVID-19-20 and Kidney) verify that MedSeqFT enhances transferability, particularly for tumor segmentation. Visual analyses of loss landscapes and parameter variations further highlight the robustness of MedSeqFT. These results establish sequential fine-tuning as an effective, knowledge-retentive paradigm for adapting foundation models to evolving clinical tasks. Code will be released.
zh
[CV-88] High-Quality Tomographic Image Reconstruction Integrating Neural Networks and Mathematical Optimization
【速读】:该论文旨在解决基于投影的纳米和微断层扫描(nano- and microtomography)图像重建中,对于由均质材料相通过锐利边缘连接的样品,重建质量较差、界面模糊的问题。解决方案的关键在于训练一个神经网络以识别子图像中的边缘特征,并将该网络预测结果嵌入到数学优化模型中,作为先验信息来抑制伪影;该优化方法在优先考虑网络学习到的边缘结构的同时,仍保留对原始数据强烈支持的其他解空间,从而有效提升界面锐度与材料均质性,显著优于基准算法。
链接: https://arxiv.org/abs/2509.06082
作者: Anuraag Mishra,Andrea Gilch,Benjamin Apeleo Zubiri,Jan Rolfes,Frauke Liers
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (弗里德里希-亚历山大-埃尔兰根-纽伦堡大学); Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)
备注: 36 pages, 17 figures
Abstract:In this work, we develop a novel technique for reconstructing images from projection-based nano- and microtomography. Our contribution focuses on enhancing reconstruction quality, particularly for specimen composed of homogeneous material phases connected by sharp edges. This is accomplished by training a neural network to identify edges within subpictures. The trained network is then integrated into a mathematical optimization model, to reduce artifacts from previous reconstructions. To this end, the optimization approach favors solutions according to the learned predictions, however may also determine alternative solutions if these are strongly supported by the raw data. Hence, our technique successfully incorporates knowledge about the homogeneity and presence of sharp edges in the sample and thereby eliminates blurriness. Our results on experimental datasets show significant enhancements in interface sharpness and material homogeneity compared to benchmark algorithms. Thus, our technique produces high-quality reconstructions, showcasing its potential for advancing tomographic imaging techniques.
zh
[CV-89] Home-made Diffusion Model from Scratch to Hatch
【速读】:该论文旨在解决当前文本到图像扩散模型(text-to-image diffusion model)训练和推理对高端计算资源依赖过高、难以在消费级硬件上部署的问题。解决方案的关键在于提出一种名为Home-made Diffusion Model (HDM) 的新型架构与训练策略:首先,设计了Cross-U-Transformer (XUT),通过跨注意力机制实现跳跃连接中的特征融合,显著提升生成图像的组合一致性;其次,引入TREAD加速技术、新型移位方形裁剪策略以支持任意长宽比训练,并结合渐进式分辨率缩放,大幅降低训练成本;最后,实验证明小参数量模型(343M)在精心设计的架构下仍能实现高质量生成及涌现能力(如直观相机控制),为高质文本到图像生成的平民化提供了可行路径。
链接: https://arxiv.org/abs/2509.06068
作者: Shih-Ying Yeh
机构: National Tsing Hua University (国立清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Home-made Diffusion Model (HDM), an efficient yet powerful text-to-image diffusion model optimized for training (and inferring) on consumer-grade hardware. HDM achieves competitive 1024x1024 generation quality while maintaining a remarkably low training cost of 535-620 using four RTX5090 GPUs, representing a significant reduction in computational requirements compared to traditional approaches. Our key contributions include: (1) Cross-U-Transformer (XUT), a novel U-shape transformer, Cross-U-Transformer (XUT), that employs cross-attention for skip connections, providing superior feature integration that leads to remarkable compositional consistency; (2) a comprehensive training recipe that incorporates TREAD acceleration, a novel shifted square crop strategy for efficient arbitrary aspect-ratio training, and progressive resolution scaling; and (3) an empirical demonstration that smaller models (343M parameters) with carefully crafted architectures can achieve high-quality results and emergent capabilities, such as intuitive camera control. Our work provides an alternative paradigm of scaling, demonstrating a viable path toward democratizing high-quality text-to-image generation for individual researchers and smaller organizations with limited computational resources.
zh
[CV-90] Multi-Stage Graph Neural Networks for Data-Driven Prediction of Natural Convection in Enclosed Cavities
【速读】:该论文旨在解决闭腔内自然对流传热模拟中高保真计算流体动力学(CFD)模型依赖专家构建物理模型、精细网格和大量计算资源,从而限制快速迭代的问题。其解决方案的关键在于提出一种新型多阶段图神经网络(GNN)架构,通过层次化池化(hierarchical pooling)与上采样(unpooling)操作,在多个空间尺度上逐步建模从全局到局部的相互作用,从而有效捕捉高分辨率图结构中的长程依赖关系,显著提升预测精度、训练效率并减少长期误差累积。
链接: https://arxiv.org/abs/2509.06041
作者: Mohammad Ahangarkiasari,Hassan Pouraria
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Buoyancy-driven heat transfer in closed cavities serves as a canonical testbed for thermal design High-fidelity CFD modelling yields accurate thermal field solutions, yet its reliance on expert-crafted physics models, fine meshes, and intensive computation limits rapid iteration. Recent developments in data-driven modeling, especially Graph Neural Networks (GNNs), offer new alternatives for learning thermal-fluid behavior directly from simulation data, particularly on irregular mesh structures. However, conventional GNNs often struggle to capture long-range dependencies in high-resolution graph structures. To overcome this limitation, we propose a novel multi-stage GNN architecture that leverages hierarchical pooling and unpooling operations to progressively model global-to-local interactions across multiple spatial scales. We evaluate the proposed model on our newly developed CFD dataset simulating natural convection within a rectangular cavities with varying aspect ratios where the bottom wall is isothermal hot, the top wall is isothermal cold, and the two vertical walls are adiabatic. Experimental results demonstrate that the proposed model achieves higher predictive accuracy, improved training efficiency, and reduced long-term error accumulation compared to state-of-the-art (SOTA) GNN baselines. These findings underscore the potential of the proposed multi-stage GNN approach for modeling complex heat transfer in mesh-based fluid dynamics simulations.
zh
[CV-91] BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
【速读】:该论文旨在解决生成式 AI (Generative AI) 中图像与视频生成模型在通过 GRPO(Generalized Reward Policy Optimization)对齐人类偏好时面临的高计算成本、训练不稳定及探索多样性不足的问题。其核心挑战在于:1)在线策略采样(on-policy rollouts)和过多的随机微分方程(SDE)采样步数导致显著的计算开销;2)稀疏奖励信号引发训练不稳定性;3)缺乏高效利用路径共享与冗余剪枝机制以提升效率与性能。解决方案的关键在于提出 BranchGRPO 方法,通过引入分支采样策略(branch sampling policy),在 SDE 采样过程中共享公共前缀、剪枝低回报路径和冗余深度,从而大幅降低每次更新的计算成本,同时结合基于树结构的优势估计器(tree-based advantage estimator)融入密集的过程级奖励,进一步增强探索多样性并加速收敛。实验表明,BranchGRPO 在图像与视频偏好对齐任务中相较强基线提升对齐得分 16%,训练时间减少 50%。
链接: https://arxiv.org/abs/2509.06040
作者: Yuming Li,Yikai Wang,Yuying Zhu,Zhongyu Zhao,Ming Lu,Qi She,Shanghang Zhang
机构: Peking University (北京大学); Beijing Normal University (北京师范大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 6 figures
Abstract:Recent advancements in aligning image and video generative models via GRPO have achieved remarkable gains in enhancing human preference alignment. However, these methods still face high computational costs from on-policy rollouts and excessive SDE sampling steps, as well as training instability due to sparse rewards. In this paper, we propose BranchGRPO, a novel method that introduces a branch sampling policy updating the SDE sampling process. By sharing computation across common prefixes and pruning low-reward paths and redundant depths, BranchGRPO substantially lowers the per-update compute cost while maintaining or improving exploration diversity. This work makes three main contributions: (1) a branch sampling scheme that reduces rollout and training cost; (2) a tree-based advantage estimator incorporating dense process-level rewards; and (3) pruning strategies exploiting path and depth redundancy to accelerate convergence and boost performance. Experiments on image and video preference alignment show that BranchGRPO improves alignment scores by 16% over strong baselines, while cutting training time by 50%.
zh
[CV-92] nyDef-DETR:An Enhanced DETR Detector for UAV Power Line Defect Detection
【速读】:该论文旨在解决无人机(UAV)输电线路巡检中因小缺陷目标在复杂背景下的检测困难问题,具体表现为传统检测器因步长下采样导致细节丢失、轻量骨干网络边界敏感性弱以及全局上下文与局部线索融合不足等挑战。其解决方案的关键在于提出一种基于DETR的框架TinyDef-DETR,核心创新包括:无步长的空间到深度模块(stride-free space-to-depth module),实现无损下采样;边缘增强卷积(edge-enhanced convolution),提升边界感知特征提取能力;跨阶段双域多尺度注意力模块(cross-stage dual-domain multi-scale attention module),协同捕获全局与局部信息;以及Focaler-Wise-SIoU回归损失函数,增强对小目标的定位精度。这些设计共同提升了小缺陷检测的准确性和鲁棒性,同时保持较低的计算开销。
链接: https://arxiv.org/abs/2509.06035
作者: Jiaming Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Automated inspection of transmission lines using UAVs is hindered by the difficulty of detecting small and ambiguous defects against complex backgrounds. Conventional detectors often suffer from detail loss due to strided downsampling, weak boundary sensitivity in lightweight backbones, and insufficient integration of global context with local cues. To address these challenges, we propose TinyDef-DETR, a DETR-based framework designed for small-defect detection. The method introduces a stride-free space-to-depth module for lossless downsampling, an edge-enhanced convolution for boundary-aware feature extraction, a cross-stage dual-domain multi-scale attention module to jointly capture global and local information, and a Focaler-Wise-SIoU regression loss to improve localization of small objects. Experiments conducted on the CSG-ADCD dataset demonstrate that TinyDef-DETR achieves substantial improvements in both precision and recall compared to competitive baselines, with particularly notable gains on small-object subsets, while incurring only modest computational overhead. Further validation on the VisDrone benchmark confirms the generalization capability of the proposed approach. Overall, the results indicate that integrating detail-preserving downsampling, edge-sensitive representations, dual-domain attention, and difficulty-adaptive regression provides a practical and efficient solution for UAV-based small-defect inspection in power grids.
zh
[CV-93] Analysis of Blood Report Images Using General Purpose Vision-Language Models
【速读】:该论文试图解决患者在解读血常规报告(blood report)时面临的困难问题,此类困难常导致焦虑和潜在健康问题被忽视。解决方案的关键在于利用通用视觉-语言模型(Vision-Language Models, VLMs)自动分析血报告图像,并通过临床相关提问引导模型生成解释。研究对比了三种主流VLMs(Qwen-VL-Max、Gemini 2.5 Pro 和 Llama 4 Maverick)在100张多样化血报告图像上的表现,使用Sentence-BERT对回答进行语义相似度评估,结果表明通用VLMs具备提供清晰、直接图像解析的能力,是开发面向患者的初步血报告分析工具的可行且有前景的技术路径。
链接: https://arxiv.org/abs/2509.06033
作者: Nadia Bakhsheshi,Hamid Beigy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages , 3 figures , This paper has been submitted to the IEEE-affiliated ICBME Conference (Iran), 2025, and is currently under review. DOR number: [20.1001.2.0425023682.1404.10.1.440.7]
Abstract:The reliable analysis of blood reports is important for health knowledge, but individuals often struggle with interpretation, leading to anxiety and overlooked issues. We explore the potential of general-purpose Vision-Language Models (VLMs) to address this challenge by automatically analyzing blood report images. We conduct a comparative evaluation of three VLMs: Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick, determining their performance on a dataset of 100 diverse blood report images. Each model was prompted with clinically relevant questions adapted to each blood report. The answers were then processed using Sentence-BERT to compare and evaluate how closely the models responded. The findings suggest that general-purpose VLMs are a practical and promising technology for developing patient-facing tools for preliminary blood report analysis. Their ability to provide clear interpretations directly from images can improve health literacy and reduce the limitations to understanding complex medical information. This work establishes a foundation for the future development of reliable and accessible AI-assisted healthcare applications. While results are encouraging, they should be interpreted cautiously given the limited dataset size.
zh
[CV-94] DVLO4D: Deep Visual-Lidar Odometry with Sparse Spatial-temporal Fusion ICRA2025
【速读】:该论文旨在解决视觉-激光雷达(Visual-LiDAR)里程计在自主系统定位中面临的精度与鲁棒性不足的问题,传统方法常受限于传感器错位、未能充分利用时间信息以及需大量人工调参以适应不同传感器配置。其解决方案的核心在于提出一种名为DVLO4D的新框架,关键创新包括:(1) 稀疏查询融合(Sparse Query Fusion),通过稀疏激光雷达查询实现高效的多模态数据融合;(2) 时间交互与更新模块(Temporal Interaction and Update module),将时序预测位置与当前帧数据结合,提供更优的位姿估计初值并抑制累积误差;(3) 时间片段训练策略(Temporal Clip Training)与集体平均损失机制(Collective Average Loss),通过跨帧损失聚合实现全局优化并减少长序列中的尺度漂移。该方法在KITTI和Argoverse数据集上均达到最优性能,且推理时间仅82ms,具备实时部署潜力。
链接: https://arxiv.org/abs/2509.06023
作者: Mengmeng Liu,Michael Ying Yang,Jiuming Liu,Yunpeng Zhang,Jiangtao Li,Sander Oude Elberink,George Vosselman,Hao Cheng
机构: University of Twente, The Netherlands(荷兰特文特大学); University of Bath, UK(英国巴斯大学); Shanghai Jiao Tong University, China(中国上海交通大学); PhiGent Robotics, China(中国菲格机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2025
Abstract:Visual-LiDAR odometry is a critical component for autonomous system localization, yet achieving high accuracy and strong robustness remains a challenge. Traditional approaches commonly struggle with sensor misalignment, fail to fully leverage temporal information, and require extensive manual tuning to handle diverse sensor configurations. To address these problems, we introduce DVLO4D, a novel visual-LiDAR odometry framework that leverages sparse spatial-temporal fusion to enhance accuracy and robustness. Our approach proposes three key innovations: (1) Sparse Query Fusion, which utilizes sparse LiDAR queries for effective multi-modal data fusion; (2) a Temporal Interaction and Update module that integrates temporally-predicted positions with current frame data, providing better initialization values for pose estimation and enhancing model’s robustness against accumulative errors; and (3) a Temporal Clip Training strategy combined with a Collective Average Loss mechanism that aggregates losses across multiple frames, enabling global optimization and reducing the scale drift over long sequences. Extensive experiments on the KITTI and Argoverse Odometry dataset demonstrate the superiority of our proposed DVLO4D, which achieves state-of-the-art performance in terms of both pose accuracy and robustness. Additionally, our method has high efficiency, with an inference time of 82 ms, possessing the potential for the real-time deployment.
zh
[CV-95] Micro-Expression Recognition via Fine-Grained Dynamic Perception
【速读】:该论文旨在解决面部微表情识别(Micro-expression Recognition, MER)中因微表情具有瞬时性、细微性和动态性所带来的挑战。现有方法通常依赖手工特征或深度网络,前者需额外的关键帧标注,后者则受限于小规模且多样性不足的训练数据。解决方案的关键在于提出一种细粒度动态感知(Fine-grained Dynamic Perception, FDP)框架:首先通过局部-全局特征感知Transformer提取帧级特征;随后利用排名评分器(rank scorer)对帧特征按时间顺序排序,以编码微表情出现与运动的动态信息;再通过对排序特征进行时序池化获取动态表示;最后将该动态表示同时用于微表情分类模块和动态图像重建模块,其中图像重建任务有助于捕捉与微表情相关的细微面部动作并缓解数据稀缺问题。实验表明,FDP在多个公开数据集上显著优于当前最优方法。
链接: https://arxiv.org/abs/2509.06015
作者: Zhiwen Shao,Yifan Cheng,Fan Zhang,Xuehuai Shi,Canlin Li,Lizhuang Ma,Dit-yan Yeung
机构: China University of Mining and Technology (中国矿业大学); Mine Digitization Engineering Research Center of the Ministry of Education (教育部矿井数字化工程研究中心); The Hong Kong University of Science and Technology (香港科技大学); Inspur Zhuoshu Big Data Industry Development Co., Ltd. (浪潮卓数大数据产业发展有限公司); Nanjing University of Posts and Telecommunications (南京邮电大学); Zhengzhou University of Light Industry (郑州轻工业大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial micro-expression recognition (MER) is a challenging task, due to the transience, subtlety, and dynamics of micro-expressions (MEs). Most existing methods resort to hand-crafted features or deep networks, in which the former often additionally requires key frames, and the latter suffers from small-scale and low-diversity training data. In this paper, we develop a novel fine-grained dynamic perception (FDP) framework for MER. We propose to rank frame-level features of a sequence of raw frames in chronological order, in which the rank process encodes the dynamic information of both ME appearances and motions. Specifically, a novel local-global feature-aware transformer is proposed for frame representation learning. A rank scorer is further adopted to calculate rank scores of each frame-level feature. Afterwards, the rank features from rank scorer are pooled in temporal dimension to capture dynamic representation. Finally, the dynamic representation is shared by a MER module and a dynamic image construction module, in which the former predicts the ME category, and the latter uses an encoder-decoder structure to construct the dynamic image. The design of dynamic image construction task is beneficial for capturing facial subtle actions associated with MEs and alleviating the data scarcity issue. Extensive experiments show that our method (i) significantly outperforms the state-of-the-art MER methods, and (ii) works well for dynamic image construction. Particularly, our FDP improves by 4.05%, 2.50%, 7.71%, and 2.11% over the previous best results in terms of F1-score on the CASME II, SAMM, CAS(ME)^2, and CAS(ME)^3 datasets, respectively. The code is available at this https URL.
zh
[CV-96] Cross-Modal Enhancement and Benchmark for UAV-based Open-Vocabulary Object Detection
【速读】:该论文旨在解决开放词汇目标检测(Open-Vocabulary Object Detection, OVD)模型在无人机(Unmanned Aerial Vehicle, UAV)影像上性能显著下降的问题,其根源在于现有大规模OVD预训练数据集主要由地面视角自然图像构成,导致域差距(domain gap)显著。解决方案的关键在于三个方面:首先,提出一种改进的UAV-Label引擎以支持高质量标注;其次,构建并发布两个面向UAV场景的新数据集UAVDE-2M(包含超过200万实例和1800类)和UAVCAP-15k(含15,000张图像);最后,设计一种交叉注意力门控增强融合模块(Cross-Attention Gated Enhancement Fusion, CAGE),并将其集成到YOLO-World-v2架构中,从而提升模型在无人机遥感影像中的泛化能力与检测精度。
链接: https://arxiv.org/abs/2509.06011
作者: Zhenhai Weng,Zhongliang Yu
机构: Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-Vocabulary Object Detection (OVD) has emerged as a pivotal technology for applications involving Unmanned Aerial Vehicles (UAVs). However, the prevailing large-scale datasets for OVD pre-training are predominantly composed of ground-level, natural images. This creates a significant domain gap, causing models trained on them to exhibit a substantial drop in performance on UAV imagery. To address this limitation, we first propose a refined UAV-Label engine. Then we construct and introduce UAVDE-2M(contains over 2,000,000 instances and 1800 categories) and UAVCAP-15k(contains over 15,000 images). Furthermore, we propose a novel Cross-Attention Gated Enhancement Fusion (CAGE) module and integrate it into the YOLO-World-v2 architecture. Finally, extensive experiments on the VisDrone and SIMD datasets verify the effectiveness of our proposed method for applications in UAV-based imagery and remote sensing.
zh
[CV-97] BLaVe-CoT: Consistency-Aware Visual Question Answering for Blind and Low Vision Users
【速读】:该论文旨在解决盲人和低视力(Blind and Low Vision, BLV)用户在使用视觉问答(Visual Question Answering, VQA)系统时面临的实际挑战:由于视觉障碍,BLV用户常拍摄模糊或构图不佳的照片,并难以准确描述所见内容,导致其提出的视觉问题具有高度歧义性,同一问题可能对应多个合理答案,且每个答案基于图像中不同的区域。传统VQA系统假设单一答案与对应区域,无法适应这种不确定性。解决方案的关键在于提出BLaVe-CoT框架,其核心包括三个步骤:首先利用LoRA微调的BLIP-2模型生成多样化的候选答案;其次通过PolyFormer模型对每个答案进行空间定位;最后引入链式思维(Chain-of-Thought, CoT)推理模块判断各答案是否指向同一图像区域,从而实现对答案一致性的评估。该方法显著提升了系统在真实辅助场景下对模糊性和视觉噪声的鲁棒性,推动了面向BLV用户的包容性VQA发展。
链接: https://arxiv.org/abs/2509.06010
作者: Wanyin Cheng,Zanxi Ruan
机构: Qufu Normal University (曲阜师范大学); University of Verona (维罗纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Question Answering (VQA) holds great potential for assisting Blind and Low Vision (BLV) users, yet real-world usage remains challenging. Due to visual impairments, BLV users often take blurry or poorly framed photos and face difficulty in articulating specific questions about what they cannot fully see. As a result, their visual questions are frequently ambiguous, and different users may interpret them in diverse ways. This leads to multiple valid answers, each grounded in different image regions-posing a mismatch with conventional VQA systems that assume a single answer and region. To bridge this gap, we present BLaVe-CoT, a VQA framework designed to reason about answer consistency in the face of ambiguity. Our method proposes diverse candidate answers using a LoRA-tuned BLIP-2 model, then grounds each answer spatially using PolyFormer, and finally applies a chain-of-thought reasoning module to assess whether the answers refer to the same or different regions. Evaluated on the VQA-AnswerTherapy benchmark, BLaVe-CoT outperforms previous methods and proves more robust to the ambiguity and visual noise common in assistive settings. This work highlights the need for VQA systems that can adapt to real human uncertainty and provide inclusive support for BLV users. To foster further research and accessibility applications, we have made the code publicly available at this https URL.
zh
[CV-98] Khana: A Comprehensive Indian Cuisine Dataset
【速读】:该论文旨在解决当前食品图像模型在印度菜系(Indian cuisine)识别与理解方面存在的显著不足问题,主要表现为印度菜系区域多样性高、烹饪复杂且缺乏覆盖全面的标注数据集。解决方案的关键在于构建了一个名为Khana的新基准数据集,该数据集通过建立印度菜系的分类体系(taxonomy),提供了约13.1万张分辨率为500×500像素的图像,涵盖80个标签,从而填补了现有研究中对印度菜系建模的空白。Khana不仅为食物图像分类、分割和检索任务提供了标准化评估基准,还为开发面向真实场景的应用程序(如饮食追踪、自动餐食规划等)提供了高质量的数据资源。
链接: https://arxiv.org/abs/2509.06006
作者: Omkar Prabhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As global interest in diverse culinary experiences grows, food image models are essential for improving food-related applications by enabling accurate food recognition, recipe suggestions, dietary tracking, and automated meal planning. Despite the abundance of food datasets, a noticeable gap remains in capturing the nuances of Indian cuisine due to its vast regional diversity, complex preparations, and the lack of comprehensive labeled datasets that cover its full breadth. Through this exploration, we uncover Khana, a new benchmark dataset for food image classification, segmentation, and retrieval of dishes from Indian cuisine. Khana fills the gap by establishing a taxonomy of Indian cuisine and offering around 131K images in the dataset spread across 80 labels, each with a resolution of 500x500 pixels. This paper describes the dataset creation process and evaluates state-of-the-art models on classification, segmentation, and retrieval as baselines. Khana bridges the gap between research and development by providing a comprehensive and challenging benchmark for researchers while also serving as a valuable resource for developers creating real-world applications that leverage the rich tapestry of Indian cuisine. Webpage: this https URL
zh
[CV-99] Motion Aware ViT-based Framework for Monocular 6-DoF Spacecraft Pose Estimation
【速读】:该论文旨在解决单目6自由度(6-DoF)位姿估计在航天器任务中因依赖静态关键点定位而未能充分利用空间操作固有时间信息的问题。其解决方案的关键在于引入运动感知热图(motion-aware heatmaps)与光流(optical flow)相结合的深度学习框架,将视觉Transformer(Vision Transformer, ViT)编码器提取的图像特征与预训练光流模型提供的运动线索融合,以更准确地定位2D关键点,并通过PnP求解器从已知的2D-3D对应关系中恢复6-DoF位姿,从而显著提升位姿估计性能及跨数据分布的泛化能力。
链接: https://arxiv.org/abs/2509.06000
作者: Jose Sosa,Dan Pineau,Arunkumar Rathinam,Abdelrahman Shabayek,Djamila Aouada
机构: Interdisciplinary Centre for Security, Reliability and Trust (SnT); University of Luxembourg (卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular 6-DoF pose estimation plays an important role in multiple spacecraft missions. Most existing pose estimation approaches rely on single images with static keypoint localisation, failing to exploit valuable temporal information inherent to space operations. In this work, we adapt a deep learning framework from human pose estimation to the spacecraft pose estimation domain that integrates motion-aware heatmaps and optical flow to capture motion dynamics. Our approach combines image features from a Vision Transformer (ViT) encoder with motion cues from a pre-trained optical flow model to localise 2D keypoints. Using the estimates, a Perspective-n-Point (PnP) solver recovers 6-DoF poses from known 2D-3D correspondences. We train and evaluate our method on the SPADES-RGB dataset and further assess its generalisation on real and synthetic data from the SPARK-2024 dataset. Overall, our approach demonstrates improved performance over single-image baselines in both 2D keypoint localisation and 6-DoF pose estimation. Furthermore, it shows promising generalisation capabilities when testing on different data distributions.
zh
[CV-100] S-LAM3D: Segmentation-Guided Monocular 3D Object Detection via Feature Space Fusion
【速读】:该论文旨在解决单目3D目标检测(Monocular 3D Object Detection)中因仅使用单张2D图像导致深度信息缺失、深度估计问题病态(ill-posed)的难题。现有方法通常依赖卷积神经网络(Convolutional Neural Networks)或Transformer架构提取特征,并通过特定检测头预测3D参数,但难以有效建模深度信息。本文的关键解决方案是提出一种解耦策略(decoupled strategy),即在不扩展检测模型或联合学习前提下,将预计算的分割信息先验(segmentation information priors)直接注入特征空间以引导检测过程。实验表明,该方法在KITTI 3D目标检测基准上显著提升了小目标(行人和自行车)的检测性能,证明了利用输入数据理解能力可缓解对额外传感器或训练数据的依赖。
链接: https://arxiv.org/abs/2509.05999
作者: Diana-Alexandra Sas,Florin Oniga
机构: Technical University of Cluj-Napoca (克卢日-纳波卡理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages. Accepted to MMSP 2025
Abstract:Monocular 3D Object Detection represents a challenging Computer Vision task due to the nature of the input used, which is a single 2D image, lacking in any depth cues and placing the depth estimation problem as an ill-posed one. Existing solutions leverage the information extracted from the input by using Convolutional Neural Networks or Transformer architectures as feature extraction backbones, followed by specific detection heads for 3D parameters prediction. In this paper, we introduce a decoupled strategy based on injecting precomputed segmentation information priors and fusing them directly into the feature space for guiding the detection, without expanding the detection model or jointly learning the priors. The focus is on evaluating the impact of additional segmentation information on existing detection pipelines without adding additional prediction branches. The proposed method is evaluated on the KITTI 3D Object Detection Benchmark, outperforming the equivalent architecture that relies only on RGB image features for small objects in the scene: pedestrians and cyclists, and proving that understanding the input data can balance the need for additional sensors or training data.
zh
[CV-101] Multi-Strategy Guided Diffusion via Sparse Masking Temporal Reweighting Distribution Correction
【速读】:该论文旨在解决稀疏视图计算机断层扫描(CT)重建中因投影数据缺失导致的图像质量下降问题,特别是如何有效恢复缺失视角信息并保持结构一致性与细节完整性。其解决方案的关键在于提出一种基于稀疏条件时变重加权集成分布估计引导的扩散模型(STRIDE),通过联合训练机制利用稀疏条件概率引导模型学习缺失视图补全与全局信息建模;设计时变稀疏条件重加权策略,在去噪过程中动态调整权重以逐步感知稀疏视图信息;引入线性回归校正已知数据与生成数据间的分布偏移,减少引导过程中的不一致性;同时构建双网络并行架构,对多子频段进行全局修正与优化,从而显著提升细节恢复和结构保持能力。
链接: https://arxiv.org/abs/2509.05992
作者: Zekun Zhou,Yanru Gong,Liu Shi,Qiegen Liu
机构: Nanchang University (南昌大学); YOFO Technology Co., Ltd. (合肥, 中国); Institute of Jinan Laboratory of Applied Nuclear Science (济南实验室应用核科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have demonstrated remarkable generative capabilities in image processing tasks. We propose a Sparse condition Temporal Rewighted Integrated Distribution Estimation guided diffusion model (STRIDE) for sparse-view CT reconstruction. Specifically, we design a joint training mechanism guided by sparse conditional probabilities to facilitate the model effective learning of missing projection view completion and global information modeling. Based on systematic theoretical analysis, we propose a temporally varying sparse condition reweighting guidance strategy to dynamically adjusts weights during the progressive denoising process from pure noise to the real image, enabling the model to progressively perceive sparse-view information. The linear regression is employed to correct distributional shifts between known and generated data, mitigating inconsistencies arising during the guidance process. Furthermore, we construct a dual-network parallel architecture to perform global correction and optimization across multiple sub-frequency components, thereby effectively improving the model capability in both detail restoration and structural preservation, ultimately achieving high-quality image reconstruction. Experimental results on both public and real datasets demonstrate that the proposed method achieves the best improvement of 2.58 dB in PSNR, increase of 2.37% in SSIM, and reduction of 0.236 in MSE compared to the best-performing baseline methods. The reconstructed images exhibit excellent generalization and robustness in terms of structural consistency, detail restoration, and artifact suppression.
zh
[CV-102] ConstStyle: Robust Domain Generalization with Unified Style Transformation ICCV2025
【速读】:该论文旨在解决深度神经网络在测试数据分布与训练数据分布不一致时性能下降的问题,即领域泛化(Domain Generalization, DG)中的域偏移挑战。现有方法通常依赖于学习域不变特征或通过数据增强提升多样性,但在训练域数量有限或可见域与不可见域之间存在显著差异时表现不佳。解决方案的关键在于提出ConstStyle方法,其核心思想是构建一个统一域(unified domain),将所有训练样本映射到该域进行优化,并在测试阶段将未见域样本同样投影至该域再进行预测。通过在统一域中对齐训练和测试数据,ConstStyle有效缓解了域间差异的影响,即使在域差距较大或训练域稀缺的情况下也能显著提升模型鲁棒性。
链接: https://arxiv.org/abs/2509.05975
作者: Nam Duong Tran,Nam Nguyen Phuong,Hieu H. Pham,Phi Le Nguyen,My T. Thai
机构: Hanoi University of Science and Technology (河内科学技术大学); VinUniversity (Vin大学); University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICCV 2025
Abstract:Deep neural networks often suffer performance drops when test data distribution differs from training data. Domain Generalization (DG) aims to address this by focusing on domain-invariant features or augmenting data for greater diversity. However, these methods often struggle with limited training domains or significant gaps between seen (training) and unseen (test) domains. To enhance DG robustness, we hypothesize that it is essential for the model to be trained on data from domains that closely resemble unseen test domains-an inherently difficult task due to the absence of prior knowledge about the unseen domains. Accordingly, we propose ConstStyle, a novel approach that leverages a unified domain to capture domain-invariant features and bridge the domain gap with theoretical analysis. During training, all samples are mapped onto this unified domain, optimized for seen domains. During testing, unseen domain samples are projected similarly before predictions. By aligning both training and testing data within this unified domain, ConstStyle effectively reduces the impact of domain shifts, even with large domain gaps or few seen domains. Extensive experiments demonstrate that ConstStyle consistently outperforms existing methods across diverse scenarios. Notably, when only a limited number of seen domains are available, ConstStyle can boost accuracy up to 19.82% compared to the next best approach.
zh
[CV-103] OmniStyle2: Scalable and High Quality Artistic Style Transfer Data Generation via Destylization
【速读】:该论文旨在解决艺术风格迁移(Artistic Style Transfer)中缺乏真实标注数据的问题,即如何获取高质量、成对的风格化图像与对应无风格内容图像以提供可靠的监督信号。其解决方案的关键在于提出“去风格化”(Destylization)这一新思路,通过设计一个文本引导的去风格化模型(DST)来从艺术作品中恢复自然、无风格的内容,并结合多阶段评估模型(DST-Filter)自动筛选高质量配对样本,从而构建大规模数据集DST-100K。基于此数据集训练的OmniStyle2模型虽结构简单(基于FLUX.1-dev),却在定性和定量指标上均超越现有最先进方法,验证了通过去风格化实现可扩展数据生成是一种有效的监督范式。
链接: https://arxiv.org/abs/2509.05970
作者: Ye Wang,Zili Yi,Yibo Zhang,Peng Zheng,Xuping Xie,Jiang Lin,Yilin Wang,Rui Ma
机构: Jilin University (吉林大学); Nanjing University (南京大学); Shanghai Innovation Institute (上海创新研究院); Adobe (Adobe公司); Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China (教育部知识驱动人机智能工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:OmniStyle2 introduces a novel approach to artistic style transfer by reframing it as a data problem. Our key insight is destylization, reversing style transfer by removing stylistic elements from artworks to recover natural, style-free counterparts. This yields DST-100K, a large-scale dataset that provides authentic supervision signals by aligning real artistic styles with their underlying content. To build DST-100K, we develop (1) DST, a text-guided destylization model that reconstructs stylefree content, and (2) DST-Filter, a multi-stage evaluation model that employs Chain-of-Thought reasoning to automatically discard low-quality pairs while ensuring content fidelity and style accuracy. Leveraging DST-100K, we train OmniStyle2, a simple feed-forward model based on FLUX.1-dev. Despite its simplicity, OmniStyle2 consistently surpasses state-of-the-art methods across both qualitative and quantitative benchmarks. Our results demonstrate that scalable data generation via destylization provides a reliable supervision paradigm, overcoming the fundamental challenge posed by the lack of ground-truth data in artistic style transfer.
zh
[CV-104] Spatial-Aware Self-Supervision for Medical 3D Imaging with Multi-Granularity Observable Tasks
【速读】:该论文旨在解决当前自监督学习方法在医学3D影像分析中因缺乏对空间知识直观建模而导致的可解释性不足问题。现有方法多借鉴通用2D视觉领域的设计,难以有效捕捉3D空间语义关系,从而限制了其在医疗场景中的可信度与实用性。解决方案的关键在于提出一种由三个子任务组成的框架,其设计遵循可观测原则以保障模型学习过程的可解释性,并通过引入多粒度空间关系建模来利用3D影像的额外维度增强语义深度,从而在保持训练稳定性的同时实现与主流方法相当的性能表现。
链接: https://arxiv.org/abs/2509.05967
作者: Yiqin Zhang,Meiling Chen,Zhengjie Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The application of self-supervised techniques has become increasingly prevalent within medical visualization tasks, primarily due to its capacity to mitigate the data scarcity prevalent in the healthcare sector. The majority of current works are influenced by designs originating in the generic 2D visual domain, which lack the intuitive demonstration of the model’s learning process regarding 3D spatial knowledge. Consequently, these methods often fall short in terms of medical interpretability. We propose a method consisting of three sub-tasks to capture the spatially relevant semantics in medical 3D imaging. Their design adheres to observable principles to ensure interpretability, and minimize the performance loss caused thereby as much as possible. By leveraging the enhanced semantic depth offered by the extra dimension in 3D imaging, this approach incorporates multi-granularity spatial relationship modeling to maintain training stability. Experimental findings suggest that our approach is capable of delivering performance that is on par with current methodologies, while facilitating an intuitive understanding of the self-supervised learning process.
zh
[CV-105] Neural Bloom: A Deep Learning Approach to Real-Time Lighting
【速读】:该论文旨在解决实时渲染中光晕(bloom)效果生成效率低的问题。传统方法依赖多步模糊处理和纹理采样,且常包含条件分支逻辑,导致计算开销大、性能瓶颈明显。解决方案的关键在于提出两种基于神经网络的光晕照明方法:Neural Bloom Lighting (NBL) 和 Fast Neural Bloom Lighting (FastNBL),通过端到端学习亮度掩膜(brightness mask)来替代复杂的手工滤波流程,显著提升生成速度与质量——其中 FastNBL 相比当前最优方法提速 28%,NBL 提速 12%,同时保持高保真度,有效缓解了实时渲染中的计算资源压力,推动了高帧率场景下沉浸感与真实感的平衡。
链接: https://arxiv.org/abs/2509.05963
作者: Rafal Karp,Dawid Gruszka,Tomasz Trzcinski
机构: IDEAS Research Institute (IDEAS 研究所); Warsaw University of Technology (华沙理工大学); Tooploox (Tooploox)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a novel method to generate bloom lighting effect in real time using neural networks. Our solution generate brightness mask from given 3D scene view up to 30% faster than state-of-the-art methods. The existing traditional techniques rely on multiple blur appliances and texture sampling, also very often have existing conditional branching in its implementation. These operations occupy big portion of the execution time. We solve this problem by proposing two neural network-based bloom lighting methods, Neural Bloom Lighting (NBL) and Fast Neural Bloom Lighting (FastNBL), focusing on their quality and performance. Both methods were tested on a variety of 3D scenes, with evaluations conducted on brightness mask accuracy and inference speed. The main contribution of this work is that both methods produce high-quality bloom effects while outperforming the standard state-of-the-art bloom implementation, with FastNBL being faster by 28% and NBL faster by 12%. These findings highlight that we can achieve realistic bloom lighting phenomena faster, moving us towards more realism in real-time environments in the future. This improvement saves computational resources, which is a major bottleneck in real-time rendering. Furthermore, it is crucial for sustaining immersion and ensuring smooth experiences in high FPS environments, while maintaining high-quality realism.
zh
[CV-106] StripDet: Strip Attention-Based Lightweight 3D Object Detection from Point Cloud
【速读】:该论文旨在解决高精度3D目标检测模型在点云数据上部署时面临的计算和内存资源消耗大的问题,尤其针对边缘设备的高效运行需求。其解决方案的关键在于提出了一种名为StripDet的轻量化框架,核心创新包括:(1)设计了Strip Attention Block (SAB),通过将标准2D卷积分解为非对称条带卷积,实现长程空间依赖建模的同时将计算复杂度从二次方降低至线性;(2)构建了一个硬件友好的分层骨干网络,融合SAB与深度可分离卷积及简单的多尺度特征融合策略,从而在端到端层面实现高效的推理性能。实验表明,StripDet仅用0.65M参数即可在KITTI数据集上达到79.97%的mAP,相较PointPillars参数减少7倍且性能更优,展现出优越的精度-效率权衡能力。
链接: https://arxiv.org/abs/2509.05954
作者: Weichao Wang,Wendong Mao,Zhongfeng Wang
机构: School of Integrated Circuits, Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区集成电路学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The deployment of high-accuracy 3D object detection models from point cloud remains a significant challenge due to their substantial computational and memory requirements. To address this, we introduce StripDet, a novel lightweight framework designed for on-device efficiency. First, we propose the novel Strip Attention Block (SAB), a highly efficient module designed to capture long-range spatial dependencies. By decomposing standard 2D convolutions into asymmetric strip convolutions, SAB efficiently extracts directional features while reducing computational complexity from quadratic to linear. Second, we design a hardware-friendly hierarchical backbone that integrates SAB with depthwise separable convolutions and a simple multiscale fusion strategy, achieving end-to-end efficiency. Extensive experiments on the KITTI dataset validate StripDet’s superiority. With only 0.65M parameters, our model achieves a 79.97% mAP for car detection, surpassing the baseline PointPillars with a 7x parameter reduction. Furthermore, StripDet outperforms recent lightweight and knowledge distillation-based methods, achieving a superior accuracy-efficiency trade-off while establishing itself as a practical solution for real-world 3D detection on edge devices.
zh
[CV-107] Dual Interaction Network with Cross-Image Attention for Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中因噪声、模糊和低对比度等因素导致诊断准确性下降的问题,同时避免传统图像增强技术在提升图像质量时可能破坏关键诊断信息的缺陷。其解决方案的关键在于提出一种双交互融合模块(Dual Interactive Fusion Module, DIFM),该模块通过双向交叉注意力机制实现原图与增强图之间的空间信息相互感知,并利用全局空间注意力对互补特征进行精炼,从而有效融合低级至高级特征中的结构属性(如边缘、斑块和目标形状),最终生成保留重要空间特性的增强特征。此外,引入基于梯度提取的多尺度边界损失函数进一步提升了分割结果在物体边界处的精度。
链接: https://arxiv.org/abs/2509.05953
作者: Jeonghyun Noh,Wangsu Jeon,Jinsun Park
机构: Pusan National University (釜山国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16pages
Abstract:Medical image segmentation is a crucial method for assisting professionals in diagnosing various diseases through medical imaging. However, various factors such as noise, blurriness, and low contrast often hinder the accurate diagnosis of diseases. While numerous image enhancement techniques can mitigate these issues, they may also alter crucial information needed for accurate diagnosis in the original image. Conventional image fusion strategies, such as feature concatenation can address this challenge. However, they struggle to fully leverage the advantages of both original and enhanced images while suppressing the side effects of the enhancements. To overcome the problem, we propose a dual interactive fusion module (DIFM) that effectively exploits mutual complementary information from the original and enhanced images. DIFM employs cross-attention bidirectionally to simultaneously attend to corresponding spatial information across different images, subsequently refining the complementary features via global spatial attention. This interaction leverages low- to high-level features implicitly associated with diverse structural attributes like edges, blobs, and object shapes, resulting in enhanced features that embody important spatial characteristics. In addition, we introduce a multi-scale boundary loss based on gradient extraction to improve segmentation accuracy at object boundaries. Experimental results on the ACDC and Synapse datasets demonstrate the superiority of the proposed method quantitatively and qualitatively. Code available at: this https URL
zh
[CV-108] Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching
【速读】:该论文旨在解决基于强化学习(Reinforcement Learning, RL)的流匹配(Flow Matching)模型在应用在线RL方法时因引入随机性而导致的图像生成质量下降问题,特别是由基于随机微分方程(Stochastic Differential Equation, SDE)采样带来的显著噪声伪影,这些伪影会干扰奖励学习过程并阻碍收敛。解决方案的关键在于借鉴去噪扩散隐式模型(Denoising Diffusion Implicit Models, DDIM)的思想,提出一种系数保持采样(Coefficients-Preserving Sampling, CPS)方法,通过重构采样流程消除冗余随机性,从而去除噪声伪影,提升奖励建模精度,最终实现如Flow-GRPO和Dance-GRPO等优化器更快、更稳定的收敛。
链接: https://arxiv.org/abs/2509.05952
作者: Feng Wang,Zihao Yu
机构: CreateAI (https://www.iamcreate.ai/)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: work in progress
Abstract:Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at this https URL
zh
[CV-109] AttriPrompt: Dynamic Prompt Composition Learning for CLIP
【速读】:该论文旨在解决当前深度文本提示(deep text prompting)方法中存在的两个关键问题:一是过度依赖对比学习目标,仅关注高层语义对齐而忽视细粒度特征优化;二是提示模板在所有输入类别中静态固定,缺乏内容感知的自适应能力。解决方案的关键在于提出AttriPrompt框架,其核心创新包括:(1)设计Attribute Retrieval模块,利用CLIP视觉编码器中间层特征聚类并检索语义相似提示,动态融合至文本编码器各层输入,实现细粒度语义增强;(2)引入双流对比学习(Dual-stream Contrastive Learning),利用嵌入在提示文本特征中的层次化视觉信息实现更精细的对齐;(3)通过显式正则化约束提示与非提示文本特征之间的关系,引入Self-Regularization机制以缓解小样本训练下的过拟合问题。
链接: https://arxiv.org/abs/2509.05949
作者: Qiqi Zhan,Shiwei Li,Qingjie Liu,Yunhong Wang
机构: Beihang University (北京航空航天大学); Hangzhou Innovation Institute, Beihang University (杭州创新研究院,北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting fine-grained feature optimization; Static prompts across all input categories, preventing content-aware adaptation. To address these limitations, we propose AttriPrompt-a novel framework that enhances and refines textual semantic representations by leveraging the intermediate-layer features of CLIP’s vision encoder. We designed an Attribute Retrieval module that first clusters visual features from each layer. The aggregated visual features retrieve semantically similar prompts from a prompt pool, which are then concatenated to the input of every layer in the text encoder. Leveraging hierarchical visual information embedded in prompted text features, we introduce Dual-stream Contrastive Learning to realize fine-grained alignment. Furthermore, we introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features to prevent overfitting on limited training data. Extensive experiments across three benchmarks demonstrate AttriPrompt’s superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting. The observed strength of our method in cross-domain knowledge transfer positions vision-language pre-trained models as more viable solutions for real-world implementation.
zh
[CV-110] Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models
【速读】:该论文旨在解决传统图像压缩方法在新兴应用场景中面临的局限性,即过度关注像素级重建质量而忽视语义信息的保留,且在多样数据分布和下游任务中缺乏鲁棒性。针对这一问题,作者提出了一种基于对比语言-图像预训练(CLIP)模型的新型语义压缩方法,其核心在于将图像转换为CLIP特征嵌入(feature embeddings),并通过最小化比特率来压缩这些嵌入,同时确保跨任务和跨数据分布的语义完整性。该方案的关键创新在于利用CLIP模型强大的零样本(zero-shot)表征能力,实现高效、语义保真的压缩,显著降低比特率(约2–3×10⁻³ bits per pixel),仅为主流图像压缩方法所需比特率的5%以下,并在极端压缩条件下仍保持对多样化下游任务的零样本鲁棒性。
链接: https://arxiv.org/abs/2509.05925
作者: Ruiqi Shen,Haotian Wu,Wenjing Zhang,Jiangjing Hu,Deniz Gunduz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注: Published as a conference paper at IEEE 35th Workshop on Machine Learning for Signal Processing (MLSP)
Abstract:Recent deep learning-based methods for lossy image compression achieve competitive rate-distortion performance through extensive end-to-end training and advanced architectures. However, emerging applications increasingly prioritize semantic preservation over pixel-level reconstruction and demand robust performance across diverse data distributions and downstream tasks. These challenges call for advanced semantic compression paradigms. Motivated by the zero-shot and representational capabilities of multimodal foundation models, we propose a novel semantic compression method based on the contrastive language-image pretraining (CLIP) model. Rather than compressing images for reconstruction, we propose compressing the CLIP feature embeddings into minimal bits while preserving semantic information across different tasks. Experiments show that our method maintains semantic integrity across benchmark datasets, achieving an average bit rate of approximately 2-3* 10(-3) bits per pixel. This is less than 5% of the bitrate required by mainstream image compression approaches for comparable performance. Remarkably, even under extreme compression, the proposed approach exhibits zero-shot robustness across diverse data distributions and downstream tasks.
zh
[CV-111] Kalibr-Inertial: Continuous-Time Spatiotemporal Calibration for Event-Based Visual-Inertial Systems
【速读】:该论文旨在解决事件相机(event camera)与惯性测量单元(Inertial Measurement Unit, IMU)在视觉-惯性融合系统中的精确时空标定问题,这是实现高性能自运动估计(ego-motion estimation)的关键前提。解决方案的核心在于提出了一种名为eKalibr-Inertial的标定方法,其关键创新在于:首先利用广泛使用的圆网格板(circle grid board)进行高效且鲁棒的初始化,确保所有外参(extrinsic)和时间偏移(temporal)参数被准确恢复;随后通过基于连续时间模型的批量优化方法对初始参数进行精细化调整,从而提升标定精度。该方法在真实世界实验中验证了其有效性,并已开源以促进相关研究发展。
链接: https://arxiv.org/abs/2509.05923
作者: Shuolong Chen,Xingxing Li,Liu Yuan
机构: School of Geodesy and Geomatics (SGG), Wuhan University (WHU) (武汉大学测绘学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The bioinspired event camera, distinguished by its exceptional temporal resolution, high dynamic range, and low power consumption, has been extensively studied in recent years for motion estimation, robotic perception, and object detection. In ego-motion estimation, the visual-inertial setup is commonly adopted due to complementary characteristics between sensors (e.g., scale perception and low drift). For optimal event-based visual-inertial fusion, accurate spatiotemporal (extrinsic and temporal) calibration is required. In this work, we present eKalibr-Inertial, an accurate spatiotemporal calibrator for event-based visual-inertial systems, utilizing the widely used circle grid board. Building upon the grid pattern recognition and tracking methods in eKalibr and eKalibr-Stereo, the proposed method starts with a rigorous and efficient initialization, where all parameters in the estimator would be accurately recovered. Subsequently, a continuous-time-based batch optimization is conducted to refine the initialized parameters toward better states. The results of extensive real-world experiments show that eKalibr-Inertial can achieve accurate event-based visual-inertial spatiotemporal calibration. The implementation of eKalibr-Inertial is open-sourced at (this https URL) to benefit the research community.
zh
[CV-112] A Fine-Grained Attention and Geometric Correspondence Model for Musculoskeletal Risk Classification in Athletes Using Multimodal Visual and Skeletal Features
【速读】:该论文旨在解决运动员在复杂环境中难以准确评估肌肉骨骼风险的问题,现有方法多依赖单一数据源,在非受控场景下可靠性不足。解决方案的关键在于提出一种名为ViSK-GAT(Visual-Skeletal Geometric Attention Transformer)的多模态深度学习框架,其核心创新包括:1)融合视觉图像与骨骼坐标特征,通过细粒度注意力模块(Fine-Grained Attention Module, FGAM)实现跨模态特征精炼;2)引入多模态几何对应模块(Multimodal Geometric Correspondence Module, MGCM),增强图像特征与坐标表示之间的跨模态一致性;3)结合残差块与轻量级Transformer块联合建模空间和时间依赖性。该模型在自建多模态数据集上实现了高精度分类(测试准确率93.89%)及稳健回归性能(RMSE=0.1205),显著优于九种主流迁移学习基线方法,为运动场景下的肌肉骨骼风险早期识别提供了高效、可靠的AI解决方案。
链接: https://arxiv.org/abs/2509.05913
作者: Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Tamanna Shermin,Md Rafiqul Islam,Mukhtar Hussain,Sami Azam
机构: United International University (联合国际大学); Charles Darwin University (查尔斯达尔文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures, 8 tables
Abstract:Musculoskeletal disorders pose significant risks to athletes, and assessing risk early is important for prevention. However, most existing methods are designed for controlled settings and fail to reliably assess risk in complex environments due to their reliance on a single type of data. This research proposes ViSK-GAT (Visual-Skeletal Geometric Attention Transformer), a novel multimodal deep learning framework designed to classify musculoskeletal risk using visual and skeletal coordinate-based features. In addition, a custom multimodal dataset is constructed by combining visual data and skeletal coordinates for risk assessment. Each sample is labeled into eight risk categories based on the Rapid Entire Body Assessment system. ViSK-GAT combines a Residual Block with a Lightweight Transformer Block to learn spatial and temporal dependencies jointly. It incorporates two novel modules: the Fine-Grained Attention Module (FGAM), which enables precise inter-modal feature refinement through cross-attention between visual and skeletal inputs, and the Multimodal Geometric Correspondence Module (MGCM), which enhances cross-modal coherence by aligning image features with coordinate-based representations. ViSK-GAT achieved strong performance with validation and test accuracies of 93.55% and 93.89%, respectively; a precision of 93.86%; an F1 score of 93.85%; and Cohen’s Kappa and Matthews Correlation Coefficient of 93%. The regression results also indicated a low Root Mean Square Error of the predicted probability distribution of 0.1205 and a corresponding Mean Absolute Error of 0.0156. Compared to nine popular transfer learning backbones, ViSK-GAT consistently outperformed previous methods. The ViSK-GAT model advances artificial intelligence implementation and application, transforming musculoskeletal risk classification and enabling impactful early interventions in sports.
zh
[CV-113] BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model ICASSP2026
【速读】:该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在双时相(bi-temporal)变化分析中对时间相关性和空间语义变化建模不足的问题,这限制了视觉-语义对齐能力及整体性能。其解决方案的关键在于提出BTCChat框架,通过引入变化提取模块(Change Extraction module)以更有效地捕捉图像对中的时序特征与空间语义差异,并结合提示增强机制(Prompt Augmentation mechanism),将上下文线索注入提示中以提升模型对空间细节的关注度,从而显著改善双时相变化描述和视觉问答任务的性能。
链接: https://arxiv.org/abs/2509.05895
作者: Yujie Li,Wenjia Xu,Yuanben Zhang,Zhiwei Wei,Mugen Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures Submitted to ICASSP 2026
Abstract:Bi-temporal satellite imagery supports critical applications such as urban development monitoring and disaster assessment. Although powerful multimodal large language models (MLLMs) have been applied in bi-temporal change analysis, previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes. This deficiency hampers visual-semantic alignment in change understanding, thereby constraining the overall effectiveness of current approaches. To address this gap, we propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability. BTCChat supports bi-temporal change captioning and retains single-image interpretation capability. To better capture temporal features and spatial semantic changes in image pairs, we design a Change Extraction module. Moreover, to enhance the model’s attention to spatial details, we introduce a Prompt Augmentation mechanism, which incorporates contextual clues into the prompt to enhance model performance. Experimental results demonstrate that BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks.
zh
[CV-114] Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets
【速读】:该论文旨在解决心血管组织病理图像中颈动脉结构分割的准确性问题,其核心挑战在于标注数据稀缺导致深度学习模型难以有效训练与评估。解决方案的关键在于系统性地评估多种先进分割模型(包括U-Net、DeepLabV3+、SegFormer及SAM系列基础模型)在有限数据集上的表现,并通过贝叶斯超参数优化提升模型性能。然而研究发现,模型性能对数据划分方式极为敏感,差异主要由统计噪声驱动而非算法本质优势,揭示了低数据临床场景下传统基准测试方法的局限性,从而质疑了现有性能排名是否真正反映临床实用性。
链接: https://arxiv.org/abs/2509.05892
作者: Phongsakon Mark Konrad,Andrei-Alexandru Popa,Yaser Sabzehmeidani,Liang Zhong,Elisa A. Liehn,Serkan Ayvaz
机构: University of Southern Denmark (南丹麦克大学); Duke-NUS Medical School (杜克-国大医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate segmentation of carotid artery structures in histopathological images is vital for advancing cardiovascular disease research and diagnosis. However, deep learning model development in this domain is constrained by the scarcity of annotated cardiovascular histopathological data. This study investigates a systematic evaluation of state-of-the-art deep learning segmentation models, including convolutional neural networks (U-Net, DeepLabV3+), a Vision Transformer (SegFormer), and recent foundation models (SAM, MedSAM, MedSAM+UNet), on a limited dataset of cardiovascular histology images. Despite employing an extensive hyperparameter optimization strategy with Bayesian search, our findings reveal that model performance is highly sensitive to data splits, with minor differences driven more by statistical noise than by true algorithmic superiority. This instability exposes the limitations of standard benchmarking practices in low-data clinical settings and challenges the assumption that performance rankings reflect meaningful clinical utility.
zh
[CV-115] Near Real-Time Dust Aerosol Detection with 3D Convolutional Neural Networks on MODIS Data
【速读】:该论文旨在解决沙尘暴对健康和能见度造成的危害,亟需通过卫星遥感实现快速检测的问题。其解决方案的关键在于提出一种近实时的像素级沙尘识别系统,利用NASA Terra和Aqua卫星搭载的MODIS多波段图像,采用3D卷积神经网络(3D Convolutional Network)联合学习所有36个波段及分割后的热红外波段的空间与光谱特征,从而有效区分沙尘、云层和地表特征;同时通过简单的归一化和局部填充策略处理缺失数据,显著提升模型训练速度(提高21倍)并支持全场景快速处理,最终在17个独立MODIS场景中实现约0.92的准确率和0.014的均方误差,验证了该方法在全局尺度上提供及时沙尘预警的可行性。
链接: https://arxiv.org/abs/2509.05887
作者: Caleb Gates,Patrick Moorhead,Jayden Ferguson,Omar Darwish,Conner Stallman,Pablo Rivas,Paapa Quansah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 29th International Conference on Image Processing, Computer Vision, Pattern Recognition (IPCV’25)
Abstract:Dust storms harm health and reduce visibility; quick detection from satellites is needed. We present a near real-time system that flags dust at the pixel level using multi-band images from NASA’s Terra and Aqua (MODIS). A 3D convolutional network learns patterns across all 36 bands, plus split thermal bands, to separate dust from clouds and surface features. Simple normalization and local filling handle missing data. An improved version raises training speed by 21x and supports fast processing of full scenes. On 17 independent MODIS scenes, the model reaches about 0.92 accuracy with a mean squared error of 0.014. Maps show strong agreement in plume cores, with most misses along edges. These results show that joint band-and-space learning can provide timely dust alerts at global scale; using wider input windows or attention-based models may further sharpen edges.
zh
[CV-116] Performance of Conformal Prediction in Capturing Aleatoric Uncertainty
【速读】:该论文试图解决的问题是:** conformal预测器在量化数据集中由类别重叠引起的固有不确定性(即aleatoric uncertainty)方面的有效性是否可靠**。现有文献指出,conformal预测集的大小应能反映这种不确定性,例如对困难样本生成更大的预测集,但缺乏实证验证。为此,作者提出通过测量预测集大小与人类标注者对每个实例分配的不同标签数量之间的相关性来评估其有效性,并进一步比较预测集与人类标注的一致性。解决方案的关键在于使用多标注数据集(每实例有5至50名标注者),从而明确识别类别重叠,并系统地评估三种conformal预测方法在八种深度学习模型上的表现,结果表明绝大多数情况下预测集大小与人类标注的相关性非常弱到弱,仅少数呈现中等程度相关,揭示了当前conformal预测方法在捕捉aleatoric uncertainty方面的能力有限,亟需重新审视其适用性和改进方向。
链接: https://arxiv.org/abs/2509.05826
作者: Misgina Tsighe Hagos,Claes Lundström
机构: Linköping University (林雪平大学); Sectra AB (Sectra公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty remains limited.
zh
[CV-117] A Probabilistic Segment Anything Model for Ambiguity-Aware Medical Image Segmentation
【速读】:该论文旨在解决当前生成式分割模型(如Segment Anything Model, SAM)在面对真实世界任务时存在的确定性输出问题,即模型每次对同一输入和提示(prompt)仅生成单一分割结果,无法捕捉医学图像中因标注不确定性或专家间差异导致的固有模糊性。为应对这一挑战,作者提出Probabilistic SAM,其核心创新在于引入潜变量空间(latent variable space),通过变分训练目标学习条件于图像与提示的分割分布,从而生成多样且合理的分割掩码。解决方案的关键在于将先验网络和后验网络集成到SAM框架中,使潜变量编码可在推理阶段调节提示嵌入(prompt embeddings),实现无需额外计算开销的不确定性感知输出。
链接: https://arxiv.org/abs/2509.05809
作者: Tyler Ward,Abdullah Imran
机构: University of Kentucky (肯塔基大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Recent advances in promptable segmentation, such as the Segment Anything Model (SAM), have enabled flexible, high-quality mask generation across a wide range of visual domains. However, SAM and similar models remain fundamentally deterministic, producing a single segmentation per object per prompt, and fail to capture the inherent ambiguity present in many real-world tasks. This limitation is particularly troublesome in medical imaging, where multiple plausible segmentations may exist due to annotation uncertainty or inter-expert variability. In this paper, we introduce Probabilistic SAM, a probabilistic extension of SAM that models a distribution over segmentations conditioned on both the input image and prompt. By incorporating a latent variable space and training with a variational objective, our model learns to generate diverse and plausible segmentation masks reflecting the variability in human annotations. The architecture integrates a prior and posterior network into the SAM framework, allowing latent codes to modulate the prompt embeddings during inference. The latent space allows for efficient sampling during inference, enabling uncertainty-aware outputs with minimal overhead. We evaluate Probabilistic SAM on the public LIDC-IDRI lung nodule dataset and demonstrate its ability to produce diverse outputs that align with expert disagreement, outperforming existing probabilistic baselines on uncertainty-aware metrics. Our code is available at: this https URL.
zh
[CV-118] Dual-Mode Deep Anomaly Detection for Medical Manufacturing: Structural Similarity and Feature Distance
【速读】:该论文旨在解决医疗设备制造中视觉检测自动化面临的三大挑战:小样本且不平衡的数据集、高分辨率图像处理以及严格的监管要求。其解决方案的关键在于提出两种基于注意力机制的自编码器架构,分别针对不同应用场景优化异常检测性能:第一种采用结构相似性(4-MS-SSIM)构建轻量级异常评分机制,实现实时缺陷检测,在仅使用10%缺陷样本的情况下达到0.931的准确率(监督阈值),适用于产线在线检测;第二种则利用马氏距离(Mahalanobis scoring)在降维后的潜在特征空间中捕捉分布偏移,实现对生产后质量波动的敏感监控,准确率达0.722(监督阈值),满足监管合规性要求。二者协同提供从实时检测到规模化事后监控的完整解决方案,符合欧盟《人工智能法案》对高风险AI系统的合规性要求。
链接: https://arxiv.org/abs/2509.05796
作者: Julio Zanon Diaz,Georgios Siogkas,Peter Corcoran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures, 13 tables
Abstract:Automating visual inspection in medical device manufacturing remains challenging due to small and imbalanced datasets, high-resolution imagery, and stringent regulatory requirements. This work proposes two attention-guided autoencoder architectures for deep anomaly detection designed to address these constraints. The first employs a structural similarity-based anomaly score (4-MS-SSIM), offering lightweight and accurate real-time defect detection, yielding ACC 0.903 (unsupervised thresholding) and 0.931 (supervised thresholding) on the - Surface Seal Image - Test split with only 10% of defective samples. The second applies a feature-distance approach using Mahalanobis scoring on reduced latent features, providing high sensitivity to distributional shifts for supervisory monitoring, achieving ACC 0.722 with supervised thresholding. Together, these methods deliver complementary capabilities: the first supports reliable inline inspection, while the second enables scalable post-production surveillance and regulatory compliance monitoring. Experimental results demonstrate that both approaches surpass re-implemented baselines and provide a practical pathway for deploying deep anomaly detection in regulated manufacturing environments, aligning accuracy, efficiency, and the regulatory obligations defined for high-risk AI systems under the EU AI Act.
zh
[CV-119] CRAB: Camera-Radar Fusion for Reducing Depth Ambiguity in Backward Projection based View Transformation ICRA2025
【速读】:该论文旨在解决基于后向投影(backward projection)的相机-雷达融合方法在鸟瞰图(BEV)3D目标检测中因深度模糊性(depth ambiguity)导致误检的问题。现有方法在利用后向投影时,由于图像特征稀疏且缺乏精确深度信息,容易产生错误的深度估计;而前向投影方法则面临BEV特征生成稀疏的问题。解决方案的关键在于提出一种名为CRAB(Camera-Radar fusion for reducing depth Ambiguity in Backward projection-based view transformation)的新模型,其核心创新是通过雷达提供的稀疏但高精度深度信息来约束和校正由图像主导的深度分布,从而提升沿同一射线方向上不同BEV查询点之间的深度区分能力。此外,引入包含雷达上下文信息的特征图与空间交叉注意力机制(spatial cross-attention),进一步增强对三维场景的理解,最终在nuScenes数据集上实现了当前最优的后向投影类融合方法性能(62.4% NDS 和 54.0% mAP)。
链接: https://arxiv.org/abs/2509.05785
作者: In-Jae Lee,Sihwan Hwang,Youngseok Kim,Wonjune Kim,Sanmin Kim,Dongsuk Kum
机构: Interdisciplinary Program in Artificial intelligence, Seoul National University (首尔国立大学人工智能跨学科项目); Cho Chun Shik Graduate School of Mobility, KAIST (韩国科学技术院); 42dot Inc; Electronics and Telecommunications Research Institute (电子与电信研究所); Department of Automobile and IT Convergence, Kookmin University (国民大学汽车与IT融合系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2025
Abstract:Recently, camera-radar fusion-based 3D object detection methods in bird’s eye view (BEV) have gained attention due to the complementary characteristics and cost-effectiveness of these sensors. Previous approaches using forward projection struggle with sparse BEV feature generation, while those employing backward projection overlook depth ambiguity, leading to false positives. In this paper, to address the aforementioned limitations, we propose a novel camera-radar fusion-based 3D object detection and segmentation model named CRAB (Camera-Radar fusion for reducing depth Ambiguity in Backward projection-based view transformation), using a backward projection that leverages radar to mitigate depth ambiguity. During the view transformation, CRAB aggregates perspective view image context features into BEV queries. It improves depth distinction among queries along the same ray by combining the dense but unreliable depth distribution from images with the sparse yet precise depth information from radar occupancy. We further introduce spatial cross-attention with a feature map containing radar context information to enhance the comprehension of the 3D scene. When evaluated on the nuScenes open dataset, our proposed approach achieves a state-of-the-art performance among backward projection-based camera-radar fusion methods with 62.4% NDS and 54.0% mAP in 3D object detection.
zh
[CV-120] 3DPillars: Pillar-based two-stage 3D object detection
【速读】:该论文旨在解决PointPillars在3D目标检测中面临的两大局限性:一是伪图像表示难以保留精确的3D结构信息,二是难以集成两阶段检测流程(如基于3D候选框的检测),而这类方法通常比单阶段方法具有更高的精度。解决方案的关键在于提出一个全新的两阶段3D检测框架,其核心由两个创新组件构成:首先,设计了一种名为3DPillars的新CNN架构,通过将3D体素特征视为伪图像堆叠,利用可分离的体素特征模块在不使用3D卷积的情况下高效提取体素级特征;其次,引入了一个带有稀疏场景上下文特征模块的RoI头,能够从3DPillars中聚合多尺度特征以生成稀疏场景特征,从而有效支持两阶段检测流程并充分利用场景上下文信息优化3D候选框。该方案在保持PointPillars高效性的同时显著提升了检测性能,在KITTI和Waymo Open数据集上实现了速度与精度的良好平衡。
链接: https://arxiv.org/abs/2509.05780
作者: Jongyoun Noh,Junghyup Lee,Hyekang Park,Bumsub Ham
机构: Samsung(三星); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 11 figures
Abstract:PointPillars is the fastest 3D object detector that exploits pseudo image representations to encode features for 3D objects in a scene. Albeit efficient, PointPillars is typically outperformed by state-of-the-art 3D detection methods due to the following limitations: 1) The pseudo image representations fail to preserve precise 3D structures, and 2) they make it difficult to adopt a two-stage detection pipeline using 3D object proposals that typically shows better performance than a single-stage approach. We introduce in this paper the first two-stage 3D detection framework exploiting pseudo image representations, narrowing the performance gaps between PointPillars and state-of-the-art methods, while retaining its efficiency. Our framework consists of two novel components that overcome the aforementioned limitations of PointPillars: First, we introduce a new CNN architecture, dubbed 3DPillars, that enables learning 3D voxel-based features from the pseudo image representation efficiently using 2D convolutions. The basic idea behind 3DPillars is that 3D features from voxels can be viewed as a stack of pseudo images. To implement this idea, we propose a separable voxel feature module that extracts voxel-based features without using 3D convolutions. Second, we introduce an RoI head with a sparse scene context feature module that aggregates multi-scale features from 3DPillars to obtain a sparse scene feature. This enables adopting a two-stage pipeline effectively, and fully leveraging contextual information of a scene to refine 3D object proposals. Experimental results on the KITTI and Waymo Open datasets demonstrate the effectiveness and efficiency of our approach, achieving a good compromise in terms of speed and accuracy.
zh
[CV-121] Posterior shape models revisited: Improving 3D reconstructions from partial data using target specific models
【速读】:该论文旨在解决医学影像中基于点分布模型(Point Distribution Model, PDM)进行部分形状重建时,因训练数据与目标形状之间存在姿态不对齐而导致的重建偏差问题,尤其是在仅观察到形状小部分区域时更为显著。解决方案的关键在于提出一种高效的方法,无需原始训练数据即可对现有线性形状模型进行姿态调整,使其适配特定目标形状:该方法在平移情况下能精确恢复预期对齐模型,在小角度旋转下提供良好近似,同时保持原有模型的计算效率,从而使得已有形状模型可通过简单的预处理步骤实现适配,适用于即插即用的重建流程。
链接: https://arxiv.org/abs/2509.05776
作者: Jonathan Aellen,Florian Burkhardt,Thomas Vetter,Marcel Lüthi
机构: University of Basel (巴塞尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In medical imaging, point distribution models are often used to reconstruct and complete partial shapes using a statistical model of the full shape. A commonly overlooked, but crucial factor in this reconstruction process, is the pose of the training data relative to the partial target shape. A difference in pose alignment of the training and target shape leads to biased solutions, particularly when observing small parts of a shape. In this paper, we demonstrate the importance of pose alignment for partial shape reconstructions and propose an efficient method to adjust an existing model to a specific target. Our method preserves the computational efficiency of linear models while significantly improving reconstruction accuracy and predicted variance. It exactly recovers the intended aligned model for translations, and provides a good approximation for small rotations, all without access to the original training data. Hence, existing shape models in reconstruction pipelines can be adapted by a simple preprocessing step, making our approach widely applicable in plug-and-play scenarios.
zh
[CV-122] PictOBI-20k: Unveiling Large Multimodal Models in Visual Decipherment for Pictographic Oracle Bone Characters
【速读】:该论文旨在解决甲骨文(Oracle Bone Characters, OBCs)视觉 decipherment 任务中因考古发掘的零散性和铭文语料库有限而导致的识别与理解难题。其解决方案的关键在于构建一个名为 PictOBI-20k 的大规模多模态数据集,包含 20,000 张精心收集的甲骨文图像与真实物体图像,并生成超过 15,000 个多选题任务,用于评估大语言模型(Large Multimodal Models, LMMs)在视觉解码方面的性能。此外,研究还通过主观标注分析人类与 LMM 在视觉推理中参考点的一致性,发现当前通用 LMM 虽具备初步视觉解码能力,但普遍受限于语言先验,未能有效利用视觉信息,从而为未来面向甲骨文研究的 LMM 视觉注意力机制优化提供基准和方向。
链接: https://arxiv.org/abs/2509.05773
作者: Zijian Chen,Wenjie Hua,Jinhao Li,Lirong Deng,Fan Du,Tingzhu Chen,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures
Abstract:Deciphering oracle bone characters (OBCs), the oldest attested form of written Chinese, has remained the ultimate, unwavering goal of scholars, offering an irreplaceable key to understanding humanity’s early modes of production. Current decipherment methodologies of OBC are primarily constrained by the sporadic nature of archaeological excavations and the limited corpus of inscriptions. With the powerful visual perception capability of large multimodal models (LMMs), the potential of using LMMs for visually deciphering OBCs has increased. In this paper, we introduce PictOBI-20k, a dataset designed to evaluate LMMs on the visual decipherment tasks of pictographic OBCs. It includes 20k meticulously collected OBC and real object images, forming over 15k multi-choice questions. We also conduct subjective annotations to investigate the consistency of the reference point between humans and LMMs in visual reasoning. Experiments indicate that general LMMs possess preliminary visual decipherment skills, and LMMs are not effectively using visual information, while most of the time they are limited by language priors. We hope that our dataset can facilitate the evaluation and optimization of visual attention in future OBC-oriented LMMs. The code and dataset will be available at this https URL.
zh
[CV-123] -Tale Watermarks for Explanatory Reasoning in Synthetic Media Forensics
【速读】:该论文旨在解决合成媒体(synthetic media)在生成与传播过程中,由于多种编辑操作(如语义修改、光度调整和几何投影)导致的取证困难问题,特别是如何追溯其生成链以揭示是否存在犯罪意图。解决方案的关键在于提出一种“显迹水印”(tell-tale watermarking)系统,该水印针对不同类别的变换进行定制设计,兼具可解释性而非传统意义上的鲁棒性或脆弱性,能够随载体媒体同步演化并在变换后留下可解读的痕迹,从而通过解释性推理从复合变换参数空间中推断最可能的生成路径。
链接: https://arxiv.org/abs/2509.05753
作者: Ching-Chun Chang,Isao Echizen
机构: National Institute of Informatics (国立信息学研究所)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rise of synthetic media has blurred the boundary between reality and fabrication under the evolving power of artificial intelligence, fueling an infodemic that erodes public trust in cyberspace. For digital imagery, a multitude of editing applications further complicates the forensic analysis, including semantic edits that alter content, photometric adjustments that recalibrate colour characteristics, and geometric projections that reshape viewpoints. Collectively, these transformations manipulate and control perceptual interpretation of digital imagery. This susceptibility calls for forensic enquiry into reconstructing the chain of events, thereby revealing deeper evidential insight into the presence or absence of criminal intent. This study seeks to address an inverse problem of tracing the underlying generation chain that gives rise to the observed synthetic media. A tell-tale watermarking system is developed for explanatory reasoning over the nature and extent of transformations across the lifecycle of synthetic media. Tell-tale watermarks are tailored to different classes of transformations, responding in a manner that is neither strictly robust nor fragile but instead interpretable. These watermarks function as reference clues that evolve under the same transformation dynamics as the carrier media, leaving interpretable traces when subjected to transformations. Explanatory reasoning is then performed to infer the most plausible account across the combinatorial parameter space of composite transformations. Experimental evaluations demonstrate the validity of tell-tale watermarking with respect to fidelity, synchronicity and traceability.
zh
[CV-124] Unleashing Hierarchical Reasoning : An LLM -Driven Framework for Training-Free Referring Video Object Segmentation
【速读】:该论文旨在解决参考视频目标分割(Referring Video Object Segmentation, RVOS)中静态文本与动态视觉内容对齐困难的问题,尤其是在目标外观相似但运动和姿态不一致的情况下。现有方法通常依赖整体式的视觉-语言融合机制,难以处理复杂、组合性的语言描述。其解决方案的关键在于提出一种无需训练的框架PARSE-VOS,利用大语言模型(Large Language Models, LLMs)实现跨文本与视频域的分层、粗粒度到细粒度的推理:首先将自然语言查询解析为结构化的语义指令;随后通过时空定位模块生成所有潜在目标对象的候选轨迹;最后借助两级识别机制——先由LLM进行粗粒度运动推理缩小候选范围,若仍存在歧义则触发细粒度姿态验证阶段以精确区分目标,从而输出准确的目标分割掩码。
链接: https://arxiv.org/abs/2509.05751
作者: Bingrui Zhao,Lin Yuanbo Wu,Xiangtian Fan,Deyin Liu,Lu Zhang,Ruyi He,Jialie Shen,Ximing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Referring Video Object Segmentation (RVOS) aims to segment an object of interest throughout a video based on a language description. The prominent challenge lies in aligning static text with dynamic visual content, particularly when objects exhibiting similar appearances with inconsistent motion and poses. However, current methods often rely on a holistic visual-language fusion that struggles with complex, compositional descriptions. In this paper, we propose \textbfPARSE-VOS, a novel, training-free framework powered by Large Language Models (LLMs), for a hierarchical, coarse-to-fine reasoning across text and video domains. Our approach begins by parsing the natural language query into structured semantic commands. Next, we introduce a spatio-temporal grounding module that generates all candidate trajectories for all potential target objects, guided by the parsed semantics. Finally, a hierarchical identification module select the correct target through a two-stage reasoning process: it first performs coarse-grained motion reasoning with an LLM to narrow down candidates; if ambiguity remains, a fine-grained pose verification stage is conditionally triggered to disambiguate. The final output is an accurate segmentation mask for the target object. \textbfPARSE-VOS achieved state-of-the-art performance on three major benchmarks: Ref-YouTube-VOS, Ref-DAVIS17, and MeViS.
zh
[CV-125] InterAct: A Large-Scale Dataset of Dynamic Expressive and Interactive Activities between Two People in Daily Scenarios
【速读】:该论文旨在解决日常场景中两人交互行为的精确捕捉问题,现有方法通常仅关注单人行为或对话手势,且假设参与者姿态和位置基本不变,难以建模目标驱动、动态且语义一致的长期交互。其解决方案的关键在于提出一个名为InterAct的新多模态数据集,包含241个持续一分钟以上的双人协作任务序列,涵盖语音、身体动作和面部表情,并首次系统性地记录了复杂、长时间、空间跨度大的交互模式;同时设计了一种基于扩散模型的方法,从语音输入中联合估计双人面部表情与身体动作,采用分层回归策略建模身体运动,并引入新颖的微调机制提升唇部表情准确性,从而实现对真实交互行为的有效建模与生成。
链接: https://arxiv.org/abs/2509.05747
作者: Leo Ho,Yinghao Huang,Dafei Qin,Mingyi Shi,Wangpok Tse,Wei Liu,Junichi Yamagishi,Taku Komura
机构: The University of Hong Kong(香港大学); Great Bay University(大湾区大学); Shandong University(山东大学); National Institute of Informatics(日本信息研究所); Centre for Transformative Garment Production(可转化服装生产中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: The first two authors contributed equally to this work
Abstract:We address the problem of accurate capture of interactive behaviors between two people in daily scenarios. Most previous works either only consider one person or solely focus on conversational gestures of two people, assuming the body orientation and/or position of each actor are constant or barely change over each interaction. In contrast, we propose to simultaneously model two people’s activities, and target objective-driven, dynamic, and semantically consistent interactions which often span longer duration and cover bigger space. To this end, we capture a new multi-modal dataset dubbed InterAct, which is composed of 241 motion sequences where two people perform a realistic and coherent scenario for one minute or longer over a complete interaction. For each sequence, two actors are assigned different roles and emotion labels, and collaborate to finish one task or conduct a common interaction activity. The audios, body motions, and facial expressions of both persons are captured. InterAct contains diverse and complex motions of individuals and interesting and relatively long-term interaction patterns barely seen before. We also demonstrate a simple yet effective diffusion-based method that estimates interactive face expressions and body motions of two people from speech inputs. Our method regresses the body motions in a hierarchical manner, and we also propose a novel fine-tuning mechanism to improve the lip accuracy of facial expressions. To facilitate further research, the data and code is made available at this https URL .
zh
[CV-126] Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation
【速读】:该论文旨在解决传统单图像超分辨率(Single Image Super-Resolution, SISR)方法在现实成像系统中因空间不变退化模型假设而带来的性能瓶颈问题,特别是针对距离相关效应(如大气散射、景深变化和透视畸变)导致的深度依赖性退化。其解决方案的关键在于提出一个严格的变分框架,将超分辨率建模为具有空间可变性的逆问题,通过将退化算子形式化为具有距离依赖频谱特性的伪微分算子(pseudodifferential operator),从而实现对不同深度范围内重建极限的理论分析。进一步地,神经架构采用分层残差块与深度条件卷积核设计,以离散梯度流动力学逼近理论能量泛函的驻点,并引入学习到的距离自适应正则项动态调整局部几何结构下的平滑约束;同时结合大气散射理论推导出频域约束,防止远场区域带宽越界和噪声放大,最终实现了在复杂深度场景下的显著性能提升。
链接: https://arxiv.org/abs/2509.05746
作者: Tianhao Guo,Bingjie Lu,Feng Wang,Zhengyang Lu
机构: Jiangnan University (江南大学); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single image super-resolution traditionally assumes spatially-invariant degradation models, yet real-world imaging systems exhibit complex distance-dependent effects including atmospheric scattering, depth-of-field variations, and perspective distortions. This fundamental limitation necessitates spatially-adaptive reconstruction strategies that explicitly incorporate geometric scene understanding for optimal performance. We propose a rigorous variational framework that characterizes super-resolution as a spatially-varying inverse problem, formulating the degradation operator as a pseudodifferential operator with distance-dependent spectral characteristics that enable theoretical analysis of reconstruction limits across depth ranges. Our neural architecture implements discrete gradient flow dynamics through cascaded residual blocks with depth-conditional convolution kernels, ensuring convergence to stationary points of the theoretical energy functional while incorporating learned distance-adaptive regularization terms that dynamically adjust smoothness constraints based on local geometric structure. Spectral constraints derived from atmospheric scattering theory prevent bandwidth violations and noise amplification in far-field regions, while adaptive kernel generation networks learn continuous mappings from depth to reconstruction filters. Comprehensive evaluation across five benchmark datasets demonstrates state-of-the-art performance, achieving 36.89/0.9516 and 30.54/0.8721 PSNR/SSIM at 2 and 4 scales on KITTI outdoor scenes, outperforming existing methods by 0.44dB and 0.36dB respectively. This work establishes the first theoretically-grounded distance-adaptive super-resolution framework and demonstrates significant improvements on depth-variant scenarios while maintaining competitive performance across traditional benchmarks.
zh
[CV-127] Multi-LVI-SAM: A Robust LiDAR-Visual-Inertial Odometry for Multiple Fisheye Cameras
【速读】:该论文旨在解决多相机LiDAR-视觉惯性里程计(LiDAR-visual-inertial odometry, LVIO)系统中因多视角视觉信息融合不一致导致的状态估计精度与鲁棒性下降的问题。其核心解决方案是提出一种全景视觉特征模型(panoramic visual feature model),将多个鱼眼相机的观测统一为单一全局表示,从而在因子图框架下实现多视图约束的高效整合,支持无缝回环检测与全局位姿优化,同时通过外参补偿方法消除各相机坐标系与全景模型坐标系之间的错位带来的三角化不一致性,显著提升特征一致性与位姿估计精度。
链接: https://arxiv.org/abs/2509.05740
作者: Xinyu Zhang,Kai Huang,Junqiao Zhao,Zihan Yuan,Tiantian Feng
机构: Tongji University (同济大学); Tongji University (同济大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a multi-camera LiDAR-visual-inertial odometry framework, Multi-LVI-SAM, which fuses data from multiple fisheye cameras, LiDAR and inertial sensors for highly accurate and robust state estimation. To enable efficient and consistent integration of visual information from multiple fisheye cameras, we introduce a panoramic visual feature model that unifies multi-camera observations into a single representation. The panoramic model serves as a global geometric optimization framework that consolidates multi-view constraints, enabling seamless loop closure and global pose optimization, while simplifying system design by avoiding redundant handling of individual cameras. To address the triangulation inconsistency caused by the misalignment between each camera’s frame and the panoramic model’s frame, we propose an extrinsic compensation method. This method improves feature consistency across views and significantly reduces triangulation and optimization errors, leading to more accurate pose estimation. We integrate the panoramic visual feature model into a tightly coupled LiDAR-visual-inertial system based on a factor graph. Extensive experiments on public datasets demonstrate that the panoramic visual feature model enhances the quality and consistency of multi-camera constraints, resulting in higher accuracy and robustness than existing multi-camera LiDAR-visual-inertial systems.
zh
[CV-128] LiDAR-BIND-T: Improving SLAM with Temporally Consistent Cross-Modal LiDAR Reconstruction
【速读】:该论文旨在解决多模态传感器(如雷达、声呐)与激光雷达(LiDAR)融合过程中存在的时序不一致性问题,从而提升基于LiDAR的SLAM(Simultaneous Localisation and Mapping,同步定位与地图构建)系统在动态场景下的鲁棒性和精度。其关键解决方案在于提出了一种改进的LiDAR-BIND-T框架,通过三个核心机制实现:(i) 时间嵌入相似性(temporal embedding similarity)以对齐连续隐空间表示,(ii) 运动对齐变换损失(motion-aligned transformation loss)以匹配预测位移与真实LiDAR位移,以及(iii) 基于专用时间模块的窗口化时间融合策略,同时优化模型结构以更好保留空间结构。这些设计显著提升了时序稳定性和空间一致性,从而降低了绝对轨迹误差(ATE)并提高了占用图准确性。
链接: https://arxiv.org/abs/2509.05728
作者: Niels Balemans,Ali Anwar,Jan Steckel,Siegfried Mercelis
机构: IDLab - Faculty of Applied Engineering, University of Antwerp - imec(imec); Cosys-Lab - Faculty of Applied Engineering, University of Antwerp(安特卫普大学应用工程学院); Flanders Make Strategic Research Centre(弗拉芒制造战略研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:This paper extends LiDAR-BIND, a modular multi-modal fusion framework that binds heterogeneous sensors (radar, sonar) to a LiDAR-defined latent space, with mechanisms that explicitly enforce temporal consistency. We introduce three contributions: (i) temporal embedding similarity that aligns consecutive latents, (ii) a motion-aligned transformation loss that matches displacement between predictions and ground truth LiDAR, and (iii) windows temporal fusion using a specialised temporal module. We further update the model architecture to better preserve spatial structure. Evaluations on radar/sonar-to-LiDAR translation demonstrate improved temporal and spatial coherence, yielding lower absolute trajectory error and better occupancy map accuracy in Cartographer-based SLAM (Simultaneous Localisation and Mapping). We propose different metrics based on the Fréchet Video Motion Distance (FVMD) and a correlation-peak distance metric providing practical temporal quality indicators to evaluate SLAM performance. The proposed temporal LiDAR-BIND, or LiDAR-BIND-T, maintains plug-and-play modality fusion while substantially enhancing temporal stability, resulting in improved robustness and performance for downstream SLAM.
zh
[CV-129] owards Meta-Cognitive Knowledge Editing for Multimodal LLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在知识编辑过程中缺乏对深层元认知(meta-cognitive)能力评估与支持的问题。现有基准主要关注认知层面的知识修改,而忽视了模型对自身知识状态的反思、边界控制及噪声鲁棒性等元认知特性。为应对这一挑战,作者提出CogEdit基准,从三个维度量化评估MLLMs的元认知编辑能力:反事实驱动编辑(Counterfactual-Driven Editing)、边界约束编辑(Boundary Constraint Editing)和噪声鲁棒编辑(Noise-Robust Editing)。解决方案的核心是MIND框架(Meta-cognitive INtegrated Dynamic Knowledge Editing),其关键创新在于构建元知识记忆以增强自我意识、引入博弈论机制监控知识激活过程,并通过标签精炼实现对不确定信息的鲁棒更新,从而显著提升模型在传统与元认知知识编辑任务上的性能表现。
链接: https://arxiv.org/abs/2509.05714
作者: Zhaoyu Fan,Kaihang Pan,Mingze Zhou,Bosheng Qin,Juncheng Li,Shengyu Zhang,Wenqiao Zhang,Siliang Tang,Fei Wu,Yueting Zhuang
机构: Zhejiang University(浙江大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures
Abstract:Knowledge editing enables multimodal large language models (MLLMs) to efficiently update outdated or incorrect information. However, existing benchmarks primarily emphasize cognitive-level modifications while lacking a focus on deeper meta-cognitive processes. To bridge this gap, we introduce CogEdit, a novel benchmark designed to evaluate MLLMs’ meta-cognitive knowledge editing abilities across three levels: (1) Counterfactual-Driven Editing, assessing self-awareness of knowledge correctness changes; (2) Boundary Constraint Editing, ensuring appropriate generalization without unintended interference; and (3) Noise-Robust Editing, promoting reflective evaluation of uncertain information. To advance meta-cognitive editing, we propose MIND (Meta-cognitive INtegrated Dynamic Knowledge Editing), a framework that constructs a meta-knowledge memory for self-awareness, employs game-theoretic interactions to monitor knowledge activation, and incorporates label refinement for noise-robust updates. Extensive experiments show that MIND significantly outperforms existing cognitive editing approaches, achieving strong performance on both traditional and meta-cognitive knowledge editing benchmarks.
zh
[CV-130] Knowledge-Augmented Vision Language Models for Underwater Bioacoustic Spectrogram Analysis
【速读】:该论文旨在解决海洋哺乳动物声学信号分析中依赖人工标注和模型重训练的问题,尤其是在使用视觉语言模型(Vision Language Models, VLMs)处理特定领域声谱图(bioacoustic spectrograms)时缺乏有效适应性的问题。其解决方案的关键在于构建一个融合VLM视觉理解与大语言模型(Large Language Models, LLMs)验证机制的框架,通过VLM提取声谱图中的模式并由LLM进行语义校验与知识增强,从而在无需人工标注或模型重新训练的前提下实现对声学数据的有效适应与领域知识构建。
链接: https://arxiv.org/abs/2509.05703
作者: Ragib Amin Nihal,Benjamin Yen,Takeshi Ashizawa,Kazuhiro Nakadai
机构: Institute of Science Tokyo (东京科学研究所); RIKEN (理化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Marine mammal vocalization analysis depends on interpreting bioacoustic spectrograms. Vision Language Models (VLMs) are not trained on these domain-specific visualizations. We investigate whether VLMs can extract meaningful patterns from spectrograms visually. Our framework integrates VLM interpretation with LLM-based validation to build domain knowledge. This enables adaptation to acoustic data without manual annotation or model retraining.
zh
[CV-131] JRN-Geo: A Joint Perception Network based on RGB and Normal images for Cross-view Geo-localization
【速读】:该论文旨在解决跨视角地理定位(cross-view geo-localization)中因视角差异和外观变化带来的挑战,尤其针对现有方法主要依赖RGB图像语义特征而忽视空间结构信息的问题。其解决方案的关键在于引入法向量图像(Normal images)以捕获几何结构信息,并提出一种联合感知网络(Joint perception network, JRN-Geo),采用双分支特征提取框架,结合差异感知融合模块(Difference-Aware Fusion Module, DAFM)与联合约束交互聚合策略(Joint-Constrained Interaction Aggregation, JCIA),实现语义与结构信息的深度融合与联合约束表示;此外,通过3D地理增强技术生成潜在视角变化样本,提升模型对视角不变特征的学习能力。
链接: https://arxiv.org/abs/2509.05696
作者: Hongyu Zhou,Yunzhou Zhang,Tingsong Huang,Fawei Ge,Man Qi,Xichen Zhang,Yizhong Zhang
机构: College of Information Science and Engineering, Northeastern University, Shenyang 110819, China(东北大学信息科学与工程学院); School of Computer Science, University of Sheffield, Sheffield S1 4DP, United Kingdom(谢菲尔德大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-view geo-localization plays a critical role in Unmanned Aerial Vehicle (UAV) localization and navigation. However, significant challenges arise from the drastic viewpoint differences and appearance variations between images. Existing methods predominantly rely on semantic features from RGB images, often neglecting the importance of spatial structural information in capturing viewpoint-invariant features. To address this issue, we incorporate geometric structural information from normal images and introduce a Joint perception network to integrate RGB and Normal images (JRN-Geo). Our approach utilizes a dual-branch feature extraction framework, leveraging a Difference-Aware Fusion Module (DAFM) and Joint-Constrained Interaction Aggregation (JCIA) strategy to enable deep fusion and joint-constrained semantic and structural information representation. Furthermore, we propose a 3D geographic augmentation technique to generate potential viewpoint variation samples, enhancing the network’s ability to learn viewpoint-invariant features. Extensive experiments on the University-1652 and SUES-200 datasets validate the robustness of our method against complex viewpoint ariations, achieving state-of-the-art performance.
zh
[CV-132] Leverag ing Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization
【速读】:该论文旨在解决视频动作识别(Video Action Recognition, VAR)中因缺乏深层语义理解、复杂上下文信息处理能力不足以及细粒度区分困难而导致的性能瓶颈问题。传统方法在面对多样化视频数据时往往表现受限,难以实现高精度与可解释性的统一。其解决方案的关键在于提出一种基于预训练视觉-语言大模型(Vision-Language Large Models, LVLMs)的新框架LVLM-VAR,核心创新是引入Video-to-Semantic-Tokens (VST)模块,将原始视频序列转化为具有语义一致性和时间连贯性的离散“动作语义令牌”(semantic action tokens),从而构建出可供LVLM理解的“动作叙事”结构;随后结合自然语言指令,利用LoRA微调后的LVLM(如LLaVA-13B)进行动作分类与语义推理,显著提升了识别准确率与模型可解释性。
链接: https://arxiv.org/abs/2509.05695
作者: Jingwei Peng,Zhixuan Qiu,Boyu Jin,Surasakdi Siripong
机构: Tianjin Agricultural University (天津农学院); Walailak University (瓦莱拉克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human action recognition often struggles with deep semantic understanding, complex contextual information, and fine-grained distinction, limitations that traditional methods frequently encounter when dealing with diverse video data. Inspired by the remarkable capabilities of large language models, this paper introduces LVLM-VAR, a novel framework that pioneers the application of pre-trained Vision-Language Large Models (LVLMs) to video action recognition, emphasizing enhanced accuracy and interpretability. Our method features a Video-to-Semantic-Tokens (VST) Module, which innovatively transforms raw video sequences into discrete, semantically and temporally consistent “semantic action tokens,” effectively crafting an “action narrative” that is comprehensible to an LVLM. These tokens, combined with natural language instructions, are then processed by a LoRA-fine-tuned LVLM (e.g., LLaVA-13B) for robust action classification and semantic reasoning. LVLM-VAR not only achieves state-of-the-art or highly competitive performance on challenging benchmarks such as NTU RGB+D and NTU RGB+D 120, demonstrating significant improvements (e.g., 94.1% on NTU RGB+D X-Sub and 90.0% on NTU RGB+D 120 X-Set), but also substantially boosts model interpretability by generating natural language explanations for its predictions.
zh
[CV-133] MeshMetrics: A Precise Implementation of Distance-Based Image Segmentation Metrics
【速读】:该论文旨在解决图像分割领域中距离类评价指标实现不一致导致的可复现性危机问题,特别是由于现有开源工具在实现距离度量(如Hausdorff距离和归一化表面距离)时存在显著差异,例如同一组分割结果在不同工具间计算的Hausdorff距离偏差可超过100 mm,归一化表面距离偏差可达30%pt。解决方案的关键在于提出MeshMetrics,这是一个基于网格(mesh-based)的框架,相较于传统的基于栅格(grid-based)的方法,能更精确地计算距离类指标;理论分析与实证验证表明,MeshMetrics不仅精度更高、对离散化伪影(如距离量化效应)的敏感性更低,且具有更强的鲁棒性,从而提升图像分割评估的一致性和可靠性。
链接: https://arxiv.org/abs/2509.05670
作者: Gašper Podobnik,Tomaž Vrtovec
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The surge of research in image segmentation has yielded remarkable performance gains but also exposed a reproducibility crisis. A major contributor is performance evaluation, where both selection and implementation of metrics play critical roles. While recent efforts have improved the former, the reliability of metric implementation has received far less attention. Pitfalls in distance-based metric implementation can lead to considerable discrepancies between common open-source tools, for instance, exceeding 100 mm for the Hausdorff distance and 30%pt for the normalized surface distance for the same pair of segmentations. To address these pitfalls, we introduce MeshMetrics, a mesh-based framework that provides a more precise computation of distance-based metrics than conventional grid-based approaches. Through theoretical analysis and empirical validation, we demonstrate that MeshMetrics achieves higher accuracy and precision than established tools, and is substantially less affected by discretization artifacts, such as distance quantization. We release MeshMetrics as an open-source Python package, available at this https URL.
zh
[CV-134] Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance
【速读】:该论文旨在解决视觉语言大模型(Vision-Language Large Models, LVLMs)在多轮交互任务中因缺乏深层上下文理解和复杂视觉推理能力而导致的碎片化推理、上下文丢失和幻觉等问题。解决方案的关键在于提出一种名为Context-Aware Multi-Turn Visual Reasoning (CAMVR) 的新框架,其核心创新包括:1)视觉-文本上下文记忆单元(Visual-Textual Context Memory Unit, VCMU),用于动态存储和管理每轮交互中的关键视觉特征、文本语义表示及其跨模态对应关系;2)自适应视觉焦点引导机制(Adaptive Visual Focus Guidance, AVFG),基于VCMU中的上下文信息动态调整视觉编码器对图像中相关区域的关注。该框架通过多层次推理整合策略,确保生成响应既与当前输入一致,又与历史累积上下文高度连贯,从而显著提升LVLM在多轮视觉推理任务中的表现。
链接: https://arxiv.org/abs/2509.05669
作者: Weijie Shen,Xinrui Wang,Yuanqi Nie,Apiradee Boonmee
机构: Beihua University (北华大学); Kasem Bundit University (甘拉亚纳皇家理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) excel in single-turn tasks but face significant challenges in multi-turn interactions requiring deep contextual understanding and complex visual reasoning, often leading to fragmented reasoning, context loss, and hallucinations. To address these limitations, we propose Context-Aware Multi-Turn Visual Reasoning (CAMVR), a novel framework designed to empower LVLMs with robust and coherent multi-turn visual-textual inference capabilities. CAMVR introduces two key innovations: a Visual-Textual Context Memory Unit (VCMU), a dynamic read-write memory network that stores and manages critical visual features, textual semantic representations, and their cross-modal correspondences from each interaction turn; and an Adaptive Visual Focus Guidance (AVFG) mechanism, which leverages the VCMU’s context to dynamically adjust the visual encoder’s attention to contextually relevant image regions. Our multi-level reasoning integration strategy ensures that response generation is deeply coherent with both current inputs and accumulated historical context. Extensive experiments on challenging datasets, including VisDial, an adapted A-OKVQA, and our novel Multi-Turn Instruction Following (MTIF) dataset, demonstrate that CAMVR consistently achieves state-of-the-art performance.
zh
[CV-135] WIPUNet: A Physics-inspired Network with Weighted Inductive Biases for Image Denoising
【速读】:该论文旨在解决图像去噪任务中在强噪声干扰下模型鲁棒性下降的问题,尤其关注纯数据驱动方法在高噪声水平下性能退化现象。其解决方案的关键在于将高能物理领域中用于抑制“堆叠效应”(pileup)的物理先验知识——如守恒性(conservation)、局域性(locality)和隔离性(isolation)——转化为可嵌入神经网络架构的结构化归纳偏置(inductive biases)。作者提出了一种分层去噪器体系:从带有守恒约束的残差卷积神经网络(CNN)到其高斯噪声变体,最终构建了基于UNet骨架的加权归纳堆叠物理启发网络(WIPUNet),通过整合这些物理引导的偏置,在CIFAR-10与BSD500数据集上均展现出优于标准基线的鲁棒性,尤其是在σ=50–100的高噪声条件下表现显著提升。
链接: https://arxiv.org/abs/2509.05662
作者: Wasikul Islam
机构: University of Wisconsin (威斯康星大学)
类目: Computer Vision and Pattern Recognition (cs.CV); High Energy Physics - Experiment (hep-ex)
备注: 13 pages, 4 figures
Abstract:In high-energy particle physics, collider measurements are contaminated by “pileup”, overlapping soft interactions that obscure the hard-scatter signal of interest. Dedicated subtraction strategies exploit physical priors such as conservation, locality, and isolation. Inspired by this analogy, we investigate how such principles can inform image denoising by embedding physics-guided inductive biases into neural architectures. This paper is a proof of concept: rather than targeting state-of-the-art (SOTA) benchmarks, we ask whether physics-inspired priors improve robustness under strong corruption. We introduce a hierarchy of PU-inspired denoisers: a residual CNN with conservation constraints, its Gaussian-noise variants, and the Weighted Inductive Pileup-physics-inspired U-Network for Denoising (WIPUNet), which integrates these ideas into a UNet backbone. On CIFAR-10 with Gaussian noise at \sigma\in\15,25,50,75,100\ , PU-inspired CNNs are competitive with standard baselines, while WIPUNet shows a \emphwidening margin at higher noise. Complementary BSD500 experiments show the same trend, suggesting physics-inspired priors provide stability where purely data-driven models degrade. Our contributions are: (i) translating pileup-mitigation principles into modular inductive biases; (ii) integrating them into UNet; and (iii) demonstrating robustness gains at high noise without relying on heavy SOTA machinery. Comments: 13 pages, 4 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); High Energy Physics - Experiment (hep-ex) Cite as: arXiv:2509.05662 [cs.CV] (or arXiv:2509.05662v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.05662 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-136] OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation
【速读】:该论文旨在解决场景图预测(Scene Graph Anticipation, SGA)中因缺乏常识知识整合而导致的长期预测鲁棒性不足的问题。现有方法主要依赖视觉线索,难以有效融合人类常识性知识以提升对未来场景的理解能力。解决方案的关键在于将SGA任务解耦为两个步骤:首先通过视频到场景图的建模获得当前帧序列的结构化表示,随后引入纯文本驱动的语言场景图预测(Linguistic Scene Graph Anticipation, LSGA)模块,利用大型语言模型(Large Language Model, LLM)进行未来场景图的推理。在LSGA中,提出面向对象的两阶段方法(Object-Oriented Two-Staged Method, OOTSM),先由LLM预测物体的出现与消失,再生成详细的人-物关系,从而显式建模常识逻辑并增强长程预测性能。实验表明,该方案在短时和长时预测指标上均取得显著提升,尤其在长期预测(@50)上提升达21.9%。
链接: https://arxiv.org/abs/2509.05661
作者: Xiaomeng Zhu,Changwei Wang,Haozhe Wang,Xinyu Liu,Fangzhen Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A scene graph is a structured represention of objects and their relationships in a scene. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications as intelligent surveillance and human-machine collaboration. Existing SGA approaches primarily leverage visual cues, often struggling to integrate valuable commonsense knowledge, thereby limiting long-term prediction robustness. To explicitly leverage such commonsense knowledge, we propose a new approach to better understand the objects, concepts, and relationships in a scene graph. Our approach decouples the SGA task in two steps: first a scene graph capturing model is used to convert a video clip into a sequence of scene graphs, then a pure text-based model is used to predict scene graphs in future frames. Our focus in this work is on the second step, and we call it Linguistic Scene Graph Anticipation (LSGA) and believes it should have independent interest beyond the use in SGA discussed here. For LSGA, we introduce an Object-Oriented Two-Staged Method (OOTSM) where an Large Language Model (LLM) first forecasts object appearances and disappearances before generating detailed human-object relations. We conduct extensive experiments to evaluate OOTSM in two settings. For LSGA, we evaluate our fine-tuned open-sourced LLMs against zero-shot APIs (i.e., GPT-4o, GPT-4o-mini, and DeepSeek-V3) on a benchmark constructed from Action Genome annotations. For SGA, we combine our OOTSM with STTran++ from, and our experiments demonstrate effective state-of-the-art performance: short-term mean-Recall (@10) increases by 3.4% while long-term mean-Recall (@50) improves dramatically by 21.9%. Code is available at this https URL.
zh
[CV-137] EditIDv2: Editable ID Customization with Data-Lubricated ID Feature Integration for Text-to-Image Generation
【速读】:该论文旨在解决现有角色编辑方法在处理高复杂度叙事场景和长文本输入时出现的编辑能力下降、语义理解偏差以及身份一致性崩溃的问题。其解决方案的关键在于提出EditIDv2,通过最小数据润滑条件下的编辑注入机制,结合PerceiverAttention的精细化分解、ID损失函数的设计、与扩散模型的联合动态训练,以及集成模块的离线融合策略,实现了在复杂叙事环境中深度且多层次的语义编辑,同时保持身份一致性,从而满足长提示下高质量图像生成的需求,并在IBench评测中取得优异表现。
链接: https://arxiv.org/abs/2509.05659
作者: Guandong Li,Zhaobin Chu
机构: iFlyTek(科大讯飞)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose EditIDv2, a tuning-free solution specifically designed for high-complexity narrative scenes and long text inputs. Existing character editing methods perform well under simple prompts, but often suffer from degraded editing capabilities, semantic understanding biases, and identity consistency breakdowns when faced with long text narratives containing multiple semantic layers, temporal logic, and complex contextual relationships. In EditID, we analyzed the impact of the ID integration module on editability. In EditIDv2, we further explore and address the influence of the ID feature integration module. The core of EditIDv2 is to discuss the issue of editability injection under minimal data lubrication. Through a sophisticated decomposition of PerceiverAttention, the introduction of ID loss and joint dynamic training with the diffusion model, as well as an offline fusion strategy for the integration module, we achieve deep, multi-level semantic editing while maintaining identity consistency in complex narrative environments using only a small amount of data lubrication. This meets the demands of long prompts and high-quality image generation, and achieves excellent results in the IBench evaluation.
zh
[CV-138] Evaluating YOLO Architectures: Implications for Real-Time Vehicle Detection in Urban Environments of Bangladesh
【速读】:该论文旨在解决现有车辆检测系统在孟加拉国独特道路环境中识别本地车辆类型能力不足的问题,这是由于这些系统通常基于非孟加拉国数据集训练所致,导致在发展地区自动驾驶技术存在显著性能缺口。解决方案的关键在于构建一个包含29种本地车辆类别的定制化高分辨率图像数据集(1920×1080),并在此基础上系统评估六种YOLO模型变体的性能,最终发现YOLOv11x在mAP@0.5上达到63.7%,且中等规模模型(如YOLOv8m和YOLOv11m)在精度与推理速度之间取得最佳平衡,从而为开发适配孟加拉国交通环境的鲁棒目标检测系统提供了关键实证基础。
链接: https://arxiv.org/abs/2509.05652
作者: Ha Meem Hossain,Pritam Nath,Mahitun Nesa Mahi,Imtiaz Uddin,Ishrat Jahan Eiste,Syed Nasibur Rahman Ratul,Md Naim Uddin Mozumdar,Asif Mohammed Saad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vehicle detection systems trained on Non-Bangladeshi datasets struggle to accurately identify local vehicle types in Bangladesh’s unique road environments, creating critical gaps in autonomous driving technology for developing regions. This study evaluates six YOLO model variants on a custom dataset featuring 29 distinct vehicle classes, including region-specific vehicles such as Desi Nosimon'',
Leguna’‘, Battery Rickshaw'', and
CNG’'. The dataset comprises high-resolution images (1920x1080) captured across various Bangladeshi roads using mobile phone cameras and manually annotated using LabelImg with YOLO format bounding boxes. Performance evaluation revealed YOLOv11x as the top performer, achieving 63.7% mAP@0.5, 43.8% mAP@0.5:0.95, 61.4% recall, and 61.6% F1-score, though requiring 45.8 milliseconds per image for inference. Medium variants (YOLOv8m, YOLOv11m) struck an optimal balance, delivering robust detection performance with mAP@0.5 values of 62.5% and 61.8% respectively, while maintaining moderate inference times around 14-15 milliseconds. The study identified significant detection challenges for rare vehicle classes, with Construction Vehicles and Desi Nosimons showing near-zero accuracy due to dataset imbalances and insufficient training samples. Confusion matrices revealed frequent misclassifications between visually similar vehicles, particularly Mini Trucks versus Mini Covered Vans. This research provides a foundation for developing robust object detection systems specifically adapted to Bangladesh traffic conditions, addressing critical needs in autonomous vehicle technology advancement for developing regions where conventional generic-trained models fail to perform adequately.
zh
[CV-139] Self-supervised Learning for Hyperspectral Images of Trees
【速读】:该论文旨在解决在缺乏标签或标签有限的情况下,如何从航拍高光谱图像中有效提取反映树木植被特性的表征问题。其解决方案的关键在于采用自监督学习(self-supervised learning)构建与植被属性相关的神经网络嵌入空间(embedding space),从而生成优于直接使用原始高光谱植被属性作为树表征的新型树表示方法,在下游机器学习任务中展现出更优性能。
链接: https://arxiv.org/abs/2509.05630
作者: Moqsadur Rahman,Saurav Kumar,Santosh S. Palmate,M. Shahriar Hossain
机构: University of Texas at El Paso (德克萨斯大学埃尔帕索分校); Arizona State University (亚利桑那州立大学); Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Aerial remote sensing using multispectral and RGB imagers has provided a critical impetus to precision agriculture. Analysis of the hyperspectral images with limited or no labels is challenging. This paper focuses on self-supervised learning to create neural network embeddings reflecting vegetation properties of trees from aerial hyperspectral images of crop fields. Experimental results demonstrate that a constructed tree representation, using a vegetation property-related embedding space, performs better in downstream machine learning tasks compared to the direct use of hyperspectral vegetation properties as tree representations.
zh
[CV-140] SuMa: A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models
【速读】:该论文旨在解决文本到图像扩散模型在生成有害或未经授权内容方面的潜在滥用问题,特别是针对狭义概念(如受版权保护的角色或名人)的擦除难题。现有概念擦除方法通常难以同时实现鲁棒性(robustness,即稳定移除目标概念)和有效性(effectiveness,即保持图像质量),尤其在处理与非目标邻近概念距离较近的狭义概念时更为困难。论文提出的解决方案是Subspace Mapping (SuMa),其关键在于首先构建一个表示待擦除概念的目标子空间,然后通过将其映射到一个参考子空间来中和该目标子空间,从而最小化两者之间的距离;这一映射机制确保了对狭义概念的鲁棒擦除,同时维持高质量的图像输出。
链接: https://arxiv.org/abs/2509.05625
作者: Kien Nguyen,Anh Tran,Cuong Pham
机构: Qualcomm AI Research (高通人工智能研究中心); Posts and Telecommunications Institute of Technology (PTIT) (邮电技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid growth of text-to-image diffusion models has raised concerns about their potential misuse in generating harmful or unauthorized contents. To address these issues, several Concept Erasure methods have been proposed. However, most of them fail to achieve both robustness, i.e., the ability to robustly remove the target concept., and effectiveness, i.e., maintaining image quality. While few recent techniques successfully achieve these goals for NSFW concepts, none could handle narrow concepts such as copyrighted characters or celebrities. Erasing these narrow concepts is critical in addressing copyright and legal concerns. However, erasing them is challenging due to their close distances to non-target neighboring concepts, requiring finer-grained manipulation. In this paper, we introduce Subspace Mapping (SuMa), a novel method specifically designed to achieve both robustness and effectiveness in easing these narrow concepts. SuMa first derives a target subspace representing the concept to be erased and then neutralizes it by mapping it to a reference subspace that minimizes the distance between the two. This mapping ensures the target concept is robustly erased while preserving image quality. We conduct extensive experiments with SuMa across four tasks: subclass erasure, celebrity erasure, artistic style erasure, and instance erasure and compare the results with current state-of-the-art methods. Our method achieves image quality comparable to approaches focused on effectiveness, while also yielding results that are on par with methods targeting completeness.
zh
[CV-141] SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning
【速读】:该论文旨在解决当前剪枝(Pruning)方法在视觉-语言-动作(Vision-Language-Action, VLA)模型中因仅依赖当前动作的局部信息进行令牌(token)剪枝,而忽略先前动作所携带的全局上下文,导致任务成功率显著下降(高达20%)且加速效果有限的问题。其解决方案的关键在于提出一种无需训练的两层剪枝框架SpecPrune-VLA:首先通过静态剪枝在动作层级利用全局历史与局部上下文联合优化每动作的视觉令牌数量;其次通过动态剪枝在层层级依据各层重要性指标自适应地剪枝;并引入轻量级动作感知控制器,根据动作粒度(粗粒度/细粒度)调节剪枝强度,以缓解细粒度动作对剪枝敏感的问题。该方法在LIBERO数据集上实现最高1.57倍的推理加速,同时保持成功率几乎不变。
链接: https://arxiv.org/abs/2509.05614
作者: Hanzhen Wang,Jiaming Xu,Jiayi Pan,Yongkang Zhou,Guohao Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8pages, 10 figures,
Abstract:Pruning accelerates compute-bound models by reducing computation. Recently applied to Vision-Language-Action (VLA) models, existing methods prune tokens using only local info from current action, ignoring global context from prior actions, causing 20% success rate drop and limited speedup. We observe high similarity across consecutive actions and propose leveraging both local (current) and global (past) info for smarter token selection. We introduce SpecPrune-VLA, a training-free method with two-level pruning and heuristic control: (1) Static pruning at action level: uses global history and local context to reduce visual tokens per action; (2) Dynamic pruning at layer level: prunes tokens per layer based on layer-specific importance; (3) Lightweight action-aware controller: classifies actions as coarse/fine-grained (by speed), adjusting pruning aggressiveness since fine-grained actions are pruning-sensitive. Experiments on LIBERO show SpecPrune-VLA achieves 1.46 times speedup on NVIDIA A800 and 1.57 times on NVIDIA GeForce RTX 3090 vs. OpenVLA-OFT, with negligible success rate loss.
zh
[CV-142] Patch-level Kernel Alignment for Self-Supervised Dense Representation Learning
【速读】:该论文旨在解决自监督表示学习方法在密集视觉任务中表现不足的问题,尤其是现有方法多聚焦于全局表征(global representations),难以捕捉细粒度的空间语义信息,从而限制了其在需要高精度空间定位的任务中的应用。解决方案的关键在于构建一个基于预训练模型的额外自监督学习框架,通过引入Patch-level Kernel Alignment(PaKA)这一简单而有效的对齐目标,实现教师模型与学生模型之间密集特征分布的对齐,从而将已有语义知识迁移至密集特征空间;同时,论文还设计了专门针对密集表示学习的增强策略,显著提升了模型在多种密集视觉基准上的性能,达到了当前最优水平。
链接: https://arxiv.org/abs/2509.05606
作者: Juan Yeo,Ijun Jang,Taesup Kim
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dense representations are essential for vision tasks that require spatial precision and fine-grained detail. While most self-supervised representation learning methods focus on global representations that summarize the image as a whole, such approaches often fall short in capturing the localized semantics necessary for dense prediction tasks. To overcome these limitations, we propose a framework that builds on pretrained representations through additional self-supervised learning, aiming to transfer existing semantic knowledge into the dense feature space. Our method aligns the distributions of dense features between a teacher and a student model. Specifically, we introduce Patch-level Kernel Alignment (PaKA), a simple yet effective alignment objective that captures statistical dependencies, thereby matching the structural relationships of dense patches across the two models. In addition, we investigate augmentation strategies specifically designed for dense representation learning. Our framework achieves state-of-the-art results across a variety of dense vision benchmarks, demonstrating the effectiveness of our approach.
zh
[CV-143] Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
【速读】:该论文旨在解决视频摘要(video summarization)中如何有效融合细粒度视觉实体(如物体)与语言语义信息,以提升关键帧选择的准确性和代表性的问题。传统方法主要依赖时间建模来捕捉帧间的全局关联,但忽略了物体层面的语义关系以及语言引导下的内容理解需求。其解决方案的关键在于将视频建模为一个语言引导的时空图结构(language-guided spatiotemporal graph modeling),其中物体和帧分别作为空间图和时间图的节点,通过语义关系边进行连接与聚合;同时引入基于视频语义查询的节点表示机制,避免仅依赖视觉相似性构建边,并采用递归策略迭代优化图结构,最终实现对每帧是否为关键帧的精准分类。
链接: https://arxiv.org/abs/2509.05604
作者: Jungin Park,Jiyoung Lee,Kwanghoon Sohn
机构: Yonsei University (延世大学); Ewha Womans University (梨花女子大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IJCV, 29 pages, 14 figures, 11 tables
Abstract:Video summarization aims to select keyframes that are visually diverse and can represent the whole story of a given video. Previous approaches have focused on global interlinkability between frames in a video by temporal modeling. However, fine-grained visual entities, such as objects, are also highly related to the main content of the video. Moreover, language-guided video summarization, which has recently been studied, requires a comprehensive linguistic understanding of complex real-world videos. To consider how all the objects are semantically related to each other, this paper regards video summarization as a language-guided spatiotemporal graph modeling problem. We present recursive spatiotemporal graph networks, called VideoGraph, which formulate the objects and frames as nodes of the spatial and temporal graphs, respectively. The nodes in each graph are connected and aggregated with graph edges, representing the semantic relationships between the nodes. To prevent the edges from being configured with visual similarity, we incorporate language queries derived from the video into the graph node representations, enabling them to contain semantic knowledge. In addition, we adopt a recursive strategy to refine initial graphs and correctly classify each frame node as a keyframe. In our experiments, VideoGraph achieves state-of-the-art performance on several benchmarks for generic and query-focused video summarization in both supervised and unsupervised manners. The code is available at this https URL.
zh
[CV-144] MFFI: Multi-Dimensional Face Forgery Image Dataset for Real-World Scenarios
【速读】:该论文旨在解决当前深度伪造(Deepfake)检测方法受限于现有数据集多样性不足的问题,尤其在未知高级伪造技术、面部场景变异性、真实数据丰富性以及现实传播中的退化程度等方面存在明显短板。解决方案的关键在于提出多维人脸伪造图像(Multi-dimensional Face Forgery Image, MFFI)数据集,通过四个战略维度提升真实性:1)扩展伪造方法(Wider Forgery Methods),涵盖50种不同伪造技术;2)多样化面部场景(Varied Facial Scenes);3)丰富真实数据来源(Diversified Authentic Data);4)多层次退化操作(Multi-level Degradation Operations)。MFFI包含1024K图像样本,实验表明其在场景复杂度、跨域泛化能力和检测难度梯度方面显著优于现有公开数据集,有效模拟了真实世界条件下的伪造挑战。
链接: https://arxiv.org/abs/2509.05592
作者: Changtao Miao,Yi Zhang,Man Luo,Weiwei Feng,Kaiyuan Zheng,Qi Chu,Tao Gong,Jianshu Li,Yunfeng Diao,Wei Zhou,Joey Tianyi Zhou,Xiaoshuai Hao
机构: Ant Group(蚂蚁集团); Hangzhou(杭州); China(中国); Anhui Province Key Laboratory of Digital Security(安徽省数字安全重点实验室); Hefei(合肥); China(中国); Hefei University of Technology(合肥工业大学); Beijing Academy of Artificial Intelligence(北京人工智能研究院); Cardiff University(卡迪夫大学); IHPC, Agency for Science, Technology and Research(资讯通信研究署); CFAR, Agency for Science, Technology and Research(资讯通信研究署); Singapore(新加坡)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rapid advances in Artificial Intelligence Generated Content (AIGC) have enabled increasingly sophisticated face forgeries, posing a significant threat to social security. However, current Deepfake detection methods are limited by constraints in existing datasets, which lack the diversity necessary in real-world scenarios. Specifically, these data sets fall short in four key areas: unknown of advanced forgery techniques, variability of facial scenes, richness of real data, and degradation of real-world propagation. To address these challenges, we propose the Multi-dimensional Face Forgery Image (\textbfMFFI) dataset, tailored for real-world scenarios. MFFI enhances realism based on four strategic dimensions: 1) Wider Forgery Methods; 2) Varied Facial Scenes; 3) Diversified Authentic Data; 4) Multi-level Degradation Operations. MFFI integrates 50 different forgery methods and contains 1024K image samples. Benchmark evaluations show that MFFI outperforms existing public datasets in terms of scene complexity, cross-domain generalization capability, and detection difficulty gradients. These results validate the technical advance and practical utility of MFFI in simulating real-world conditions. The dataset and additional details are publicly available at this https URL.
zh
[CV-145] ProfilingAgent : Profiling-Guided Agent ic Reasoning for Adaptive Model Optimization
【速读】:该论文旨在解决基础模型(Foundation Models)在资源受限平台部署时面临的计算和内存瓶颈问题。现有压缩技术如剪枝(Pruning)和量化(Quantization)多依赖于统一的启发式策略,忽视了模型架构与运行时异构性,导致优化效率低下。其解决方案的关键在于提出ProfilingAgent——一个基于性能剖析(Profiling)引导的智能体系统(Agentic System),利用大语言模型(Large Language Models, LLMs)自动执行结构化剪枝与后训练动态量化。该系统通过融合静态指标(如MACs、参数量)与动态信号(如延迟、内存占用),实现分层决策以精准适配不同层的性能瓶颈,从而在保持或提升精度的同时显著降低内存消耗(最高达74%)并加速推理(最高提速1.74倍)。
链接: https://arxiv.org/abs/2509.05584
作者: Sadegh Jafari,Aishwarya Sarkar,Mohiuddin Bilwal,Ali Jannesari
机构: Iowa State University (爱荷华州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注: 13 pages, 3 figures, 5 tables, 1 algorithm
Abstract:Foundation models face growing compute and memory bottlenecks, hindering deployment on resource-limited platforms. While compression techniques such as pruning and quantization are widely used, most rely on uniform heuristics that ignore architectural and runtime heterogeneity. Profiling tools expose per-layer latency, memory, and compute cost, yet are rarely integrated into automated pipelines. We propose ProfilingAgent, a profiling-guided, agentic approach that uses large language models (LLMs) to automate compression via structured pruning and post-training dynamic quantization. Our modular multi-agent system reasons over static metrics (MACs, parameter counts) and dynamic signals (latency, memory) to design architecture-specific strategies. Unlike heuristic baselines, ProfilingAgent tailors layer-wise decisions to bottlenecks. Experiments on ImageNet-1K, CIFAR-10, and CIFAR-100 with ResNet-101, ViT-B/16, Swin-B, and DeiT-B/16 show pruning maintains competitive or improved accuracy (about 1% drop on ImageNet-1K, +2% gains for ViT-B/16 on smaller datasets), while quantization achieves up to 74% memory savings with 0.5% accuracy loss. Our quantization also yields consistent inference speedups of up to 1.74 times faster. Comparative studies with GPT-4o and GPT-4-Turbo highlight the importance of LLM reasoning quality for iterative pruning. These results establish agentic systems as scalable solutions for profiling-guided model optimization.
zh
[CV-146] Reconstruction and Reenactment Separated Method for Realistic Gaussian Head
【速读】:该论文旨在解决从单张肖像图像生成可控制的3D头像(3D head avatar)的问题,尤其关注如何在保持高帧率渲染的同时实现高质量的纹理重建与泛化能力。解决方案的关键在于提出了一种“重建与重演分离”的框架(reconstruction and reenactment separated framework),其中利用基于WebSSL的大规模单样本高斯头像生成器,并采用两阶段训练策略显著提升模型对高频纹理的重建能力和跨场景泛化性能;同时,在推理阶段引入轻量级高斯头像驱动机制,使得在512×512分辨率下仍能实现90 FPS的高帧率渲染,且通过模块分离设计确保驱动效率不受影响。
链接: https://arxiv.org/abs/2509.05582
作者: Zhiling Ye,Cong Zhou,Xiubao Zhang,Haifeng Shen,Weihong Deng,Quan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we explore a reconstruction and reenactment separated framework for 3D Gaussians head, which requires only a single portrait image as input to generate controllable avatar. Specifically, we developed a large-scale one-shot gaussian head generator built upon WebSSL and employed a two-stage training approach that significantly enhances the capabilities of generalization and high-frequency texture reconstruction. During inference, an ultra-lightweight gaussian avatar driven by control signals enables high frame-rate rendering, achieving 90 FPS at a resolution of 512x512. We further demonstrate that the proposed framework follows the scaling law, whereby increasing the parameter scale of the reconstruction module leads to improved performance. Moreover, thanks to the separation design, driving efficiency remains unaffected. Finally, extensive quantitative and qualitative experiments validate that our approach outperforms current state-of-the-art methods.
zh
[CV-147] Sensitivity-Aware Post-Training Quantization for Deep Neural Networks
【速读】:该论文旨在解决模型量化(Model Quantization)在高压缩比下导致的精度损失问题,尤其是在资源受限的边缘计算和实时推理场景中,现有后训练量化(Post-Training Quantization, PTQ)方法因迭代参数更新带来的高计算复杂度而难以应用。其解决方案的关键在于引入基于参数敏感性分析(Parameter Sensitivity Analysis)的量化策略:优先对高敏感性参数进行量化,并利用未量化低敏感性参数补偿量化误差,从而有效缓解精度下降;同时,通过列方向上的参数敏感性聚类,设计行并行量化框架并结合全局共享的逆海森矩阵(Inverse Hessian Matrix)更新机制,将计算复杂度降低一个数量级,显著提升量化效率。
链接: https://arxiv.org/abs/2509.05576
作者: Zekang Zheng,Haokun Li,Yaofo Chen,Mingkui Tan,Qing Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by PRCV 2025
Abstract:Model quantization reduces neural network parameter precision to achieve compression, but often compromises accuracy. Existing post-training quantization (PTQ) methods employ iterative parameter updates to preserve accuracy under high compression ratios, incurring significant computational complexity and resource overhead, which limits applicability in resource-constrained edge computing and real-time inference scenarios. This paper proposes an efficient PTQ method guided by parameter sensitivity analysis. The approach prioritizes quantization of high-sensitivity parameters, leveraging unquantized low-sensitivity parameters to compensate for quantization errors, thereby mitigating accuracy degradation. Furthermore, by exploiting column-wise clustering of parameter sensitivity, the method introduces a row-parallel quantization framework with a globally shared inverse Hessian matrix update mechanism, reducing computational complexity by an order of magnitude. Experimental results on ResNet-50 and YOLOv5s demonstrate a 20-200-fold quantization speedup over the Optimal Brain Quantization baseline, with mean accuracy loss below 0.3%, confirming the method’s efficacy in balancing efficiency and accuracy.
zh
[CV-148] RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentangled Representation
【速读】:该论文旨在解决事件相机(Event Camera)在运动去模糊任务中因动态视觉传感器(DVS)阈值机制导致的事件流不完整性问题,该不完整性削弱了运动先验的完整性并限制了事件引导去模糊的效果。解决方案的关键在于提出一种鲁棒的事件引导去模糊网络(RED),其核心创新包括:1)引入面向鲁棒性的扰动策略(RPS),通过随机掩码模拟事件缺失,提升模型对未知场景下事件不完整性的适应能力;2)设计解耦的OmniAttention模块,显式建模模糊图像与部分破坏事件之间的内部运动、跨运动及跨模态相关性;3)构建两个交互模块,分别增强模糊图像中的运动敏感区域并为不完整的事件表示注入语义上下文,从而实现更准确且鲁棒的去模糊性能。
链接: https://arxiv.org/abs/2509.05554
作者: Yihong Leng,Siming Zheng,Jinwei Chen,Bo Li,Jiaojiao Li,Peng-Tao Jiang
机构: vivo; University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
Abstract:Event cameras provide sparse yet temporally high-temporal-resolution motion information, demonstrating great potential for motion deblurring. Existing methods focus on cross-modal interaction, overlooking the inherent incompleteness of event streams, which arises from the trade-off between sensitivity and noise introduced by the thresholding mechanism of Dynamic Vision Sensors (DVS). Such degradation compromises the integrity of motion priors and limits the effectiveness of event-guided deblurring. To tackle these challenges, we propose a Robust Event-guided Deblurring (RED) network with modality-specific disentangled representation. First, we introduce a Robustness-Oriented Perturbation Strategy (RPS) that applies random masking to events, which exposes RED to incomplete patterns and then foster robustness against various unknown scenario this http URL, a disentangled OmniAttention is presented to explicitly model intra-motion, inter-motion, and cross-modality correlations from two inherently distinct but complementary sources: blurry images and partially disrupted events. Building on these reliable features, two interactive modules are designed to enhance motion-sensitive areas in blurry images and inject semantic context into incomplete event representations. Extensive experiments on synthetic and real-world datasets demonstrate RED consistently achieves state-of-the-art performance in both accuracy and robustness.
zh
[CV-149] DuoCLR: Dual-Surrogate Contrastive Learning for Skeleton-based Human Action Segmentation ICCV2025
【速读】:该论文旨在解决人类动作分割(human action segmentation)任务中特征表示能力不足的问题,尤其在未修剪(untrimmed)视频数据上的性能瓶颈。现有基于对比学习的预训练方法多针对动作识别任务设计,依赖于孤立序列级别的表示,难以捕捉动作间的时序结构与多尺度上下文信息。解决方案的关键在于提出一种双代理对比学习框架(Dual-Surrogate Contrastive Learning, DuoCLR),其核心创新包括:1)引入新颖的数据增强策略“Shuffle and Warp”,通过多样化多动作排列生成跨序列变化;2)设计两个互补的代理任务——跨排列对比(Cross Permutation Contrasting, CPC)用于学习同类动作在不同排列下的内在一致性,以及相对顺序推理(Relative Order Reasoning, ROR)用于建模不同动作类之间的相对时序关系;3)联合优化这两个任务,使模型能够学习到适用于动作分割任务的多尺度特征表示,从而在未修剪数据上显著提升多类和多标签动作分割性能。
链接: https://arxiv.org/abs/2509.05543
作者: Haitao Tian,Pierre Payeur
机构: University of Ottawa (渥太华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 accepted paper
Abstract:In this paper, a contrastive representation learning framework is proposed to enhance human action segmentation via pre-training using trimmed (single action) skeleton sequences. Unlike previous representation learning works that are tailored for action recognition and that build upon isolated sequence-wise representations, the proposed framework focuses on exploiting multi-scale representations in conjunction with cross-sequence variations. More specifically, it proposes a novel data augmentation strategy, ‘Shuffle and Warp’, which exploits diverse multi-action permutations. The latter effectively assists two surrogate tasks that are introduced in contrastive learning: Cross Permutation Contrasting (CPC) and Relative Order Reasoning (ROR). In optimization, CPC learns intra-class similarities by contrasting representations of the same action class across different permutations, while ROR reasons about inter-class contexts by predicting relative mapping between two permutations. Together, these tasks enable a Dual-Surrogate Contrastive Learning (DuoCLR) network to learn multi-scale feature representations optimized for action segmentation. In experiments, DuoCLR is pre-trained on a trimmed skeleton dataset and evaluated on an untrimmed dataset where it demonstrates a significant boost over state-the-art comparatives in both multi-class and multi-label action segmentation tasks. Lastly, ablation studies are conducted to evaluate the effectiveness of each component of the proposed approach.
zh
[CV-150] Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting
【速读】:该论文旨在解决当前将开放词汇语言特征从2D图像蒸馏到3D高斯表示时存在的两个核心问题:一是背景高斯点对渲染像素的贡献微乎其微,却与前景主导高斯点获得相同的语言特征,导致语义混淆;二是由于语言嵌入中存在视图特异性噪声,造成多视角下特征不一致。解决方案的关键在于提出Visibility-Aware Language Aggregation (VALA),通过计算每条光线的边际贡献并引入可见性感知门控机制,仅保留可见高斯点以增强语义准确性;同时,设计了一种基于余弦空间的流式加权几何中位数方法,有效融合多视角噪声语言特征,从而实现高效、内存节省且视角一致的语言特征嵌入。
链接: https://arxiv.org/abs/2509.05515
作者: Sen Wang,Kunyi Li,Siyun Liang,Elena Alegret,Jing Ma,Nassir Navab,Stefano Gasperini
机构: Technical University of Munich (慕尼黑工业大学); Munich Cental for Machine Learning (慕尼黑机器学习中心); VisualAIs; Ludwig Maximilian University of Munich (路德维希马克西米利安慕尼黑大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works.
zh
[CV-151] OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation
【速读】:该论文旨在解决当前基于第一人称视角(egocentric)视频的模仿学习数据集普遍缺乏细粒度、时间对齐的动作描述以及高精度手部姿态标注的问题。解决方案的关键在于构建OpenEgo这一多模态第一人称操作数据集,其核心创新包括:统一的手部姿态布局标准、与意图对齐的动作基元(action primitives)标注,并整合来自六个公开数据集共计1107小时的视频数据,覆盖超过600个环境中的290种操作任务。该数据集支持语言条件下的模仿学习策略训练,以预测复杂手部轨迹,从而降低从第一人称视频中学习精细操作技能的门槛,并推动视觉-语言-动作学习领域的可复现研究。
链接: https://arxiv.org/abs/2509.05513
作者: Ahad Jawaid,Yu Xiang
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校); Physical Automation, Inc. (物理自动化公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 4 pages, 1 figure
Abstract:Egocentric human videos provide scalable demonstrations for imitation learning, but existing corpora often lack either fine-grained, temporally localized action descriptions or dexterous hand annotations. We introduce OpenEgo, a multimodal egocentric manipulation dataset with standardized hand-pose annotations and intention-aligned action primitives. OpenEgo totals 1107 hours across six public datasets, covering 290 manipulation tasks in 600+ environments. We unify hand-pose layouts and provide descriptive, timestamped action primitives. To validate its utility, we train language-conditioned imitation-learning policies to predict dexterous hand trajectories. OpenEgo is designed to lower the barrier to learning dexterous manipulation from egocentric video and to support reproducible research in vision-language-action learning. All resources and instructions will be released at this http URL.
zh
[CV-152] Quaternion Approximation Networks for Enhanced Image Classification and Oriented Object Detection IROS2025
【速读】:该论文旨在解决传统卷积神经网络(CNN)在处理旋转不变性任务时效率低下、参数冗余以及难以建模几何结构信息的问题,特别是在资源受限的机器人感知场景中。解决方案的关键在于提出Quaternion Approximate Networks (QUAN),通过将四元数卷积近似为实数域上的哈密顿乘积分解(Hamilton product decomposition),在保持旋转等变性(rotation equivariance)的同时,利用定制CUDA内核实现高效计算;此外,引入独立四元数批量归一化(Independent Quaternion Batch Normalization, IQBN)提升训练稳定性,并将四元数运算扩展至空间注意力机制,从而显著提升模型在图像分类与目标检测任务中的精度和参数效率。
链接: https://arxiv.org/abs/2509.05512
作者: Bryce Grant,Peng Wang
机构: Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to IROS 2025
Abstract:This paper introduces Quaternion Approximate Networks (QUAN), a novel deep learning framework that leverages quaternion algebra for rotation equivariant image classification and object detection. Unlike conventional quaternion neural networks attempting to operate entirely in the quaternion domain, QUAN approximates quaternion convolution through Hamilton product decomposition using real-valued operations. This approach preserves geometric properties while enabling efficient implementation with custom CUDA kernels. We introduce Independent Quaternion Batch Normalization (IQBN) for training stability and extend quaternion operations to spatial attention mechanisms. QUAN is evaluated on image classification (CIFAR-10/100, ImageNet), object detection (COCO, DOTA), and robotic perception tasks. In classification tasks, QUAN achieves higher accuracy with fewer parameters and faster convergence compared to existing convolution and quaternion-based models. For objection detection, QUAN demonstrates improved parameter efficiency and rotation handling over standard Convolutional Neural Networks (CNNs) while establishing the SOTA for quaternion CNNs in this downstream task. These results highlight its potential for deployment in resource-constrained robotic systems requiring rotation-aware perception and application in other domains.
zh
[CV-153] An Analysis of Layer-Freezing Strategies for Enhanced Transfer Learning in YOLO Architectures
【速读】:该论文旨在解决在资源受限环境(如无人机)中部署YOLO系列目标检测模型时,如何通过高效的迁移学习策略优化模型性能的问题。其核心挑战在于现有层冻结(layer freezing)配置对当代YOLOv8和YOLOv10架构的影响尚不明确,尤其缺乏对冻结深度、数据集特性与训练动态之间相互作用的系统性分析。解决方案的关键在于提出并验证一种基于梯度行为分析(L2范数)与可视化解释(Grad-CAM)相结合的方法,以深入理解不同冻结策略下的训练机制;研究发现最优冻结策略并非通用,而是取决于数据特征——例如冻结主干网络有助于保留通用特征,而浅层冻结更适应极端类别不平衡场景,并可在减少高达28% GPU内存消耗的同时提升mAP@50指标,从而为有限资源下的平衡迁移学习提供实证指导。
链接: https://arxiv.org/abs/2509.05490
作者: Andrzej D. Dobrzycki,Ana M. Bernardos,José R. Casar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 31 pages, 14 figures, 9 tables
Abstract:The You Only Look Once (YOLO) architecture is crucial for real-time object detection. However, deploying it in resource-constrained environments such as unmanned aerial vehicles (UAVs) requires efficient transfer learning. Although layer freezing is a common technique, the specific impact of various freezing configurations on contemporary YOLOv8 and YOLOv10 architectures remains unexplored, particularly with regard to the interplay between freezing depth, dataset characteristics, and training dynamics. This research addresses this gap by presenting a detailed analysis of layer-freezing strategies. We systematically investigate multiple freezing configurations across YOLOv8 and YOLOv10 variants using four challenging datasets that represent critical infrastructure monitoring. Our methodology integrates a gradient behavior analysis (L2 norm) and visual explanations (Grad-CAM) to provide deeper insights into training dynamics under different freezing strategies. Our results reveal that there is no universal optimal freezing strategy but, rather, one that depends on the properties of the data. For example, freezing the backbone is effective for preserving general-purpose features, while a shallower freeze is better suited to handling extreme class imbalance. These configurations reduce graphics processing unit (GPU) memory consumption by up to 28% compared to full fine-tuning and, in some cases, achieve mean average precision (mAP@50) scores that surpass those of full fine-tuning. Gradient analysis corroborates these findings, showing distinct convergence patterns for moderately frozen models. Ultimately, this work provides empirical findings and practical guidelines for selecting freezing strategies. It offers a practical, evidence-based approach to balanced transfer learning for object detection in scenarios with limited resources.
zh
[CV-154] Veriserum: A dual-plane fluoroscopic dataset with knee implant phantoms for deep learning in medical imaging MICCAI2025
【速读】:该论文旨在解决医学影像中二维/三维(2D/3D)图像配准(image registration)的挑战,尤其针对双平面透视成像(dual-plane fluoroscopic analysis)场景下的深度学习模型训练缺乏高质量、大规模标注数据的问题。解决方案的关键在于构建并公开发布Veriserum数据集,该数据集包含约11万张来自1600次试验的X射线图像,涵盖10种膝关节假体组合(2个股骨和5个胫骨假体),每张图像均配有自动注册的真值姿态(ground-truth pose),其中200张图像还包含人工标注的参考姿态用于基准测试。Veriserum通过提供双平面图像和校准工具,为开发与评估用于X射线畸变校正、图像分割及3D重建等应用的深度学习算法提供了可复现的基准。
链接: https://arxiv.org/abs/2509.05483
作者: Jinhao Wang,Florian Vogl,Pascal Schütz,Saša Ćuković,William R. Taylor
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted at MICCAI 2025
Abstract:Veriserum is an open-source dataset designed to support the training of deep learning registration for dual-plane fluoroscopic analysis. It comprises approximately 110,000 X-ray images of 10 knee implant pair combinations (2 femur and 5 tibia implants) captured during 1,600 trials, incorporating poses associated with daily activities such as level gait and ramp descent. Each image is annotated with an automatically registered ground-truth pose, while 200 images include manually registered poses for benchmarking. Key features of Veriserum include dual-plane images and calibration tools. The dataset aims to support the development of applications such as 2D/3D image registration, image segmentation, X-ray distortion correction, and 3D reconstruction. Freely accessible, Veriserum aims to advance computer vision and medical imaging research by providing a reproducible benchmark for algorithm development and evaluation. The Veriserum dataset used in this study is publicly available via this https URL, with the data stored at ETH Zürich Research Collections: this https URL. Comments: This work has been accepted at MICCAI 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.05483 [cs.CV] (or arXiv:2509.05483v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.05483 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-155] From Image Generation to Infrastructure Design: a Multi-agent Pipeline for Street Design Generation
【速读】:该论文旨在解决城市交通规划中公众参与度低的问题,特别是针对传统街道设计可视化方法耗时费力、难以支持集体讨论与协同决策的局限性。其核心挑战在于如何在复杂街景环境中实现对自行车设施等设计要素的精确空间调整,并生成符合语义指令且视觉真实的渲染结果。解决方案的关键在于提出一种多智能体系统(multi-agent system),通过整合车道定位、提示优化(prompt optimization)、设计生成与自动化评估四个模块,直接在真实街景图像上编辑和重构自行车道布局,从而实现高效、精准且上下文一致的设计生成,显著提升交通基础设施规划中的交互效率与可行性。
链接: https://arxiv.org/abs/2509.05469
作者: Chenguang Wang,Xiang Yan,Yilong Dai,Ziyi Wang,Susu Xu
机构: Johns Hopkins University (约翰霍普金斯大学); University of Florida (佛罗里达大学); Stony Brook University (石溪大学); University of Maryland (马里兰大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 21 pages, 8 figures
Abstract:Realistic visual renderings of street-design scenarios are essential for public engagement in active transportation planning. Traditional approaches are labor-intensive, hindering collective deliberation and collaborative decision-making. While AI-assisted generative design shows transformative potential by enabling rapid creation of design scenarios, existing generative approaches typically require large amounts of domain-specific training data and struggle to enable precise spatial variations of design/configuration in complex street-view scenes. We introduce a multi-agent system that edits and redesigns bicycle facilities directly on real-world street-view imagery. The framework integrates lane localization, prompt optimization, design generation, and automated evaluation to synthesize realistic, contextually appropriate designs. Experiments across diverse urban scenarios demonstrate that the system can adapt to varying road geometries and environmental conditions, consistently yielding visually coherent and instruction-compliant results. This work establishes a foundation for applying multi-agent pipelines to transportation infrastructure planning and facility design.
zh
[CV-156] Dynamic Sensitivity Filter Pruning using Multi-Agent Reinforcement Learning For DCNNs
【速读】:该论文旨在解决深度卷积神经网络(Deep Convolutional Neural Networks)在实际部署中因计算和内存开销过大而导致的效率瓶颈问题。其解决方案的关键在于提出一种名为“差分敏感度融合剪枝”(Differential Sensitivity Fusion Pruning)的单次滤波器剪枝框架,该方法通过融合梯度敏感性、一阶泰勒展开和激活分布KL散度之间的差异,为每个滤波器计算差分敏感度得分,并引入指数缩放机制强化在多个评估指标间重要性不一致的滤波器,从而识别出结构不稳定或对模型性能贡献较小的冗余滤波器。该方法无需迭代或强化学习策略,仅需一次前向-反向传播即可完成评分与剪枝,显著降低模型复杂度,在剪枝率高达70%时仍能保持接近原始模型的精度(如保留98.23%基准准确率),优于传统启发式方法,在压缩率和泛化能力上均表现出优越性。
链接: https://arxiv.org/abs/2509.05446
作者: Iftekhar Haider Chowdhury,Zaed Ikbal Syed,Ahmed Faizul Haque Dhrubo,Mohammad Abdul Qayum
机构: North South University (南南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper includes figures and two tables, and our work outperforms the existing research that has been published in a journal
Abstract:Deep Convolutional Neural Networks have achieved state of the art performance across various computer vision tasks, however their practical deployment is limited by computational and memory overhead. This paper introduces Differential Sensitivity Fusion Pruning, a novel single shot filter pruning framework that focuses on evaluating the stability and redundancy of filter importance scores across multiple criteria. Differential Sensitivity Fusion Pruning computes a differential sensitivity score for each filter by fusing the discrepancies among gradient based sensitivity, first order Taylor expansion, and KL divergence of activation distributions. An exponential scaling mechanism is applied to emphasize filters with inconsistent importance across metrics, identifying candidates that are structurally unstable or less critical to the model performance. Unlike iterative or reinforcement learning based pruning strategies, Differential Sensitivity Fusion Pruning is efficient and deterministic, requiring only a single forward-backward pass for scoring and pruning. Extensive experiments across varying pruning rates between 50 to 70 percent demonstrate that Differential Sensitivity Fusion Pruning significantly reduces model complexity, achieving over 80 percent Floating point Operations Per Seconds reduction while maintaining high accuracy. For instance, at 70 percent pruning, our approach retains up to 98.23 percent of baseline accuracy, surpassing traditional heuristics in both compression and generalization. The proposed method presents an effective solution for scalable and adaptive Deep Convolutional Neural Networks compression, paving the way for efficient deployment on edge and mobile platforms.
zh
[CV-157] FAVAE-Effective Frequency Aware Latent Tokenizer
【速读】:该论文旨在解决当前隐式生成模型中潜在编码器(latent tokenizer)在图像重建时对高频细节(如纹理边界和锐利过渡区域)建模不足的问题,导致生成图像出现过度平滑和视觉伪影,从而影响感知真实性。其解决方案的关键在于提出一种基于小波变换的频率感知变分自编码器(frequency-aware variational autoencoder, FA-VAE)框架,通过显式解耦低频与高频成分的优化过程,使模型能够分别精准重建全局结构与精细纹理,从而显著提升图像生成的真实感与保真度。
链接: https://arxiv.org/abs/2509.05441
作者: Tejaswini Medi,Hsien-Yi Wang,Arianna Rampini,Margret Keuper
机构: University of Mannheim, Germany(德国曼海姆大学); Autodesk AI Lab(Autodesk人工智能实验室); MPI for Informatics, Saarland Informatics Campus(马普研究所信息学,萨尔兰信息学园区)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, the reconstructed images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information, when jointly optimized, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image representation, with broader implications for applications in content creation, neural rendering, and medical imaging.
zh
[CV-158] Advanced Brain Tumor Segmentation Using EMCAD: Efficient Multi-scale Convolutional Attention Decoding
【速读】:该论文旨在解决脑肿瘤分割(brain tumor segmentation)中解码机制计算成本高、效率低的问题,尤其是在计算资源受限场景下的性能瓶颈。其解决方案的关键在于提出了一种名为EMCAD(Efficient Multi-scale Convolutional Attention Decoder)的新颖多尺度卷积注意力解码器,通过优化解码结构在保证分割精度的同时显著降低计算开销,从而在BraTs2020数据集上实现了稳定且中等水平的Dice分数(平均0.285 ± 0.015),且未出现过拟合现象。
链接: https://arxiv.org/abs/2509.05431
作者: GodsGift Uzor,Tania-Amanda Nkoyo Fredrick Eneye,Chukwuebuka Ijezue
机构: Texas Tech University (德克萨斯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Brain tumor segmentation is a critical pre-processing step in the medical image analysis pipeline that involves precise delineation of tumor regions from healthy brain tissue in medical imaging data, particularly MRI scans. An efficient and effective decoding mechanism is crucial in brain tumor segmentation especially in scenarios with limited computational resources. However these decoding mechanisms usually come with high computational costs. To address this concern EMCAD a new efficient multi-scale convolutional attention decoder designed was utilized to optimize both performance and computational efficiency for brain tumor segmentation on the BraTs2020 dataset consisting of MRI scans from 369 brain tumor patients. The preliminary result obtained by the model achieved a best Dice score of 0.31 and maintained a stable mean Dice score of 0.285 plus/minus 0.015 throughout the training process which is moderate. The initial model maintained consistent performance across the validation set without showing signs of over-fitting.
zh
[CV-159] Augmented Structure Preserving Neural Networks for cell biomechanics
【速读】:该论文旨在解决细胞群体在复杂生物力学环境中运动轨迹预测及有丝分裂事件预测的难题,尤其关注细胞间相互作用与环境因素耦合对细胞集体行为决策的影响。其关键解决方案是构建一种融合结构保持神经网络(Structure Preserving Neural Networks)与人工神经网络(Artificial Neural Networks)的混合模型:前者将细胞运动视为纯机械系统进行建模,后者通过计算机视觉技术提取实验观测特征以纳入环境因子影响,从而实现高精度的细胞轨迹滚动预测(roll-out policy)以及基于相同特征的有丝分裂事件预测。
链接: https://arxiv.org/abs/2509.05388
作者: Juan Olalla-Pombo,Alberto Badías,Miguel Ángel Sanz-Gómez,José María Benítez,Francisco Javier Montáns
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Cell biomechanics involve a great number of complex phenomena that are fundamental to the evolution of life itself and other associated processes, ranging from the very early stages of embryo-genesis to the maintenance of damaged structures or the growth of tumors. Given the importance of such phenomena, increasing research has been dedicated to their understanding, but the many interactions between them and their influence on the decisions of cells as a collective network or cluster remain unclear. We present a new approach that combines Structure Preserving Neural Networks, which study cell movements as a purely mechanical system, with other Machine Learning tools (Artificial Neural Networks), which allow taking into consideration environmental factors that can be directly deduced from an experiment with Computer Vision techniques. This new model, tested on simulated and real cell migration cases, predicts complete cell trajectories following a roll-out policy with a high level of accuracy. This work also includes a mitosis event prediction model based on Neural Networks architectures which makes use of the same observed features.
zh
[CV-160] Unsupervised Instance Segmentation with Superpixels
【速读】:该论文旨在解决实例分割(instance segmentation)任务中对大量人工标注数据依赖的问题,这类标注成本高昂且难以获取。其解决方案的关键在于提出一种无需人工标注的自监督框架:首先利用MultiCut算法对自监督特征进行粗粒度掩码分割,再通过掩码过滤器获得高质量的粗分割结果;进而设计一种基于超像素(superpixel)引导的掩码损失函数,结合低层图像特征提取的超像素与高质量粗掩码进行网络训练;最后引入一种带有自适应损失的新自训练机制,持续优化预测掩码的质量。该方法在多个公开数据集上验证了有效性,性能优于此前最先进的无监督/自监督实例分割方法。
链接: https://arxiv.org/abs/2509.05352
作者: Cuong Manh Hoang
机构: Seoul National University of Science and Technology (首尔科学综合大学校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Instance segmentation is essential for numerous computer vision applications, including robotics, human-computer interaction, and autonomous driving. Currently, popular models bring impressive performance in instance segmentation by training with a large number of human annotations, which are costly to collect. For this reason, we present a new framework that efficiently and effectively segments objects without the need for human annotations. Firstly, a MultiCut algorithm is applied to self-supervised features for coarse mask segmentation. Then, a mask filter is employed to obtain high-quality coarse masks. To train the segmentation network, we compute a novel superpixel-guided mask loss, comprising hard loss and soft loss, with high-quality coarse masks and superpixels segmented from low-level image features. Lastly, a self-training process with a new adaptive loss is proposed to improve the quality of predicted masks. We conduct experiments on public datasets in instance segmentation and object detection to demonstrate the effectiveness of the proposed framework. The results show that the proposed framework outperforms previous state-of-the-art methods.
zh
[CV-161] Vision-Based Object Detection for UAV Solar Panel Inspection Using an Enhanced Defects Dataset
【速读】:该论文旨在解决太阳能电池板中物理缺陷与电学故障以及表面污染物(如灰尘、污垢和鸟粪)的及时准确检测问题,以保障光伏系统的效率与可靠性。其解决方案的关键在于系统性地评估五种前沿目标检测模型(YOLOv3、Faster R-CNN、RetinaNet、EfficientDet 和 Swin Transformer),并基于自建的 COCO 格式标注数据集进行训练与验证,通过均值平均精度(mAP)、精确率、召回率及推理速度等指标量化比较各模型在检测准确性与计算效率之间的权衡关系,从而为实际太阳能电池板监测与维护场景提供可选的最优检测方案。
链接: https://arxiv.org/abs/2509.05348
作者: Ashen Rodrigo,Isuru Munasinghe,Asanka Perera
机构: University of Moratuwa (莫鲁塔瓦大学); University of Southern Queensland (南昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Timely and accurate detection of defects and contaminants in solar panels is critical for maintaining the efficiency and reliability of photovoltaic systems. This study presents a comprehensive evaluation of five state-of-the-art object detection models: YOLOv3, Faster R-CNN, RetinaNet, EfficientDet, and Swin Transformer, for identifying physical and electrical defects as well as surface contaminants such as dust, dirt, and bird droppings on solar panels. A custom dataset, annotated in the COCO format and specifically designed for solar panel defect and contamination detection, was developed alongside a user interface to train and evaluate the models. The performance of each model is assessed and compared based on mean Average Precision (mAP), precision, recall, and inference speed. The results demonstrate the trade-offs between detection accuracy and computational efficiency, highlighting the relative strengths and limitations of each model. These findings provide valuable guidance for selecting appropriate detection approaches in practical solar panel monitoring and maintenance scenarios. The dataset will be publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.05348 [cs.CV] (or arXiv:2509.05348v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.05348 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-162] Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Diagnosis
【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Networks, CNNs)在医学图像分析中难以捕捉细粒度和复杂特征的问题,从而影响诊断准确性。其解决方案的关键在于系统性地将注意力机制(attention mechanisms)嵌入五种主流CNN架构(VGG16、ResNet18、InceptionV3、DenseNet121和EfficientNetB5)中,通过引入Squeeze and Excitation模块或混合的卷积块注意力模块(Convolutional Block Attention Module),实现对通道和空间特征表示的自适应重校准,从而增强模型对显著区域的关注能力与判别性能。实验表明,融合注意力机制的模型在多个医学影像数据集上均优于基线模型,尤其EfficientNetB5结合混合注意力机制时表现最优,同时提升了特征定位精度和跨模态泛化能力。
链接: https://arxiv.org/abs/2509.05343
作者: Zahid Ullah,Minki Hong,Tahir Mahmood,Jihie Kim
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning has become a powerful tool for medical image analysis; however, conventional Convolutional Neural Networks (CNNs) often fail to capture the fine-grained and complex features critical for accurate diagnosis. To address this limitation, we systematically integrate attention mechanisms into five widely adopted CNN architectures, namely, VGG16, ResNet18, InceptionV3, DenseNet121, and EfficientNetB5, to enhance their ability to focus on salient regions and improve discriminative performance. Specifically, each baseline model is augmented with either a Squeeze and Excitation block or a hybrid Convolutional Block Attention Module, allowing adaptive recalibration of channel and spatial feature representations. The proposed models are evaluated on two distinct medical imaging datasets, a brain tumor MRI dataset comprising multiple tumor subtypes, and a Products of Conception histopathological dataset containing four tissue categories. Experimental results demonstrate that attention augmented CNNs consistently outperform baseline architectures across all metrics. In particular, EfficientNetB5 with hybrid attention achieves the highest overall performance, delivering substantial gains on both datasets. Beyond improved classification accuracy, attention mechanisms enhance feature localization, leading to better generalization across heterogeneous imaging modalities. This work contributes a systematic comparative framework for embedding attention modules in diverse CNN architectures and rigorously assesses their impact across multiple medical imaging tasks. The findings provide practical insights for the development of robust, interpretable, and clinically applicable deep learning based decision support systems.
zh
[CV-163] Delta Velocity Rectified Flow for Text-to-Image Editing
【速读】:该论文旨在解决现有文本到图像编辑方法中因蒸馏采样导致的过度平滑伪影问题,以及缺乏路径感知能力所引发的编辑质量与保真度下降的问题。其解决方案的关键在于提出一种无需反演(inversion-free)、路径感知的编辑框架——Delta Velocity Rectified Flow (DVRF),通过显式建模源图像和目标图像之间速度场(velocity field)的差异来抑制过平滑现象,并引入一个时间依赖的偏移项(time-dependent shift term),使噪声潜在表示更贴近目标轨迹,从而提升与目标分布的对齐性。该方法在理论上统一了基于得分的扩散优化与基于速度的修正流优化,并为无反演的FlowEdit方法提供了理论解释,同时无需修改模型架构即可实现高质量、高保真且可控的文本到图像编辑。
链接: https://arxiv.org/abs/2509.05342
作者: Gaspard Beaudouin,Minghan Li,Jaeyeon Kim,Sunghoon Yoon,Mengyu Wang
机构: Harvard AI and Robotics Lab, Harvard University; École Nationale des Ponts et Chaussées, Institut Polytechnique de Paris; Computer Science Department, Harvard University; Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We propose Delta Velocity Rectified Flow (DVRF), a novel inversion-free, path-aware editing framework within rectified flow models for text-to-image editing. DVRF is a distillation-based method that explicitly models the discrepancy between the source and target velocity fields in order to mitigate over-smoothing artifacts rampant in prior distillation sampling approaches. We further introduce a time-dependent shift term to push noisy latents closer to the target trajectory, enhancing the alignment with the target distribution. We theoretically demonstrate that when this shift is disabled, DVRF reduces to Delta Denoising Score, thereby bridging score-based diffusion optimization and velocity-based rectified-flow optimization. Moreover, when the shift term follows a linear schedule under rectified-flow dynamics, DVRF generalizes the Inversion-free method FlowEdit and provides a principled theoretical interpretation for it. Experimental results indicate that DVRF achieves superior editing quality, fidelity, and controllability while requiring no architectural modifications, making it efficient and broadly applicable to text-to-image editing tasks. Code is available at this https URL.
zh
[CV-164] Handling imbalance and few-sample size in ML based Onion disease classification
【速读】:该论文旨在解决洋葱作物病虫害多类别分类准确率低的问题,当前方法多局限于二分类场景,难以满足实际农业中对具体病害或虫害类型精准识别的需求。解决方案的关键在于:首先,基于预训练卷积神经网络(Convolutional Neural Network, CNN)模型引入注意力机制模块以增强特征提取能力;其次,通过构建全面的数据增强流程缓解类别不平衡问题,从而提升模型在真实田间图像数据集上的泛化性能,最终实现96.90%的整体准确率和0.96的F1分数,优于现有同类方法。
链接: https://arxiv.org/abs/2509.05341
作者: Abhijeet Manoj Pal,Rajbabu Velmurugan
机构: Centre for Machine Intelligence and Data Science, IIT Bombay, Mumbai, India; Department of Electrical Engineering, IIT Bombay, Mumbai, India
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 8 figures
Abstract:Accurate classification of pests and diseases plays a vital role in precision agriculture, enabling efficient identification, targeted interventions, and preventing their further spread. However, current methods primarily focus on binary classification, which limits their practical applications, especially in scenarios where accurately identifying the specific type of disease or pest is essential. We propose a robust deep learning based model for multi-class classification of onion crop diseases and pests. We enhance a pre-trained Convolutional Neural Network (CNN) model by integrating attention based modules and employing comprehensive data augmentation pipeline to mitigate class imbalance. We propose a model which gives 96.90% overall accuracy and 0.96 F1 score on real-world field image dataset. This model gives better results than other approaches using the same datasets.
zh
[CV-165] Comparative Evaluation of Hard and Soft Clustering for Precise Brain Tumor Segmentation in MR Imaging
【速读】:该论文旨在解决脑部肿瘤在磁共振成像(MRI)中的分割问题,该问题因肿瘤形态和强度分布的异质性而具有挑战性。研究对比了两种聚类范式:硬聚类(如K-Means)与软聚类(如模糊C均值,FCM)。关键解决方案在于通过实验验证不同聚类方法在精度与效率之间的权衡:K-Means虽计算速度快(平均0.3秒/图像),但分割精度较低(Dice相似系数DSC=0.43);而FCM由于引入像素对多个簇的部分隶属度,实现了更高精度(DSC=0.67),尽管其耗时更长(1.3秒/图像)。这一发现揭示了在临床应用中需根据具体需求选择聚类策略,以平衡边界精确性和处理效率。
链接: https://arxiv.org/abs/2509.05340
作者: Dibya Jyoti Bora,Mrinal Kanti Mishra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 10 figures
Abstract:Segmentation of brain tumors from Magnetic Resonance Imaging (MRI) remains a pivotal challenge in medical image analysis due to the heterogeneous nature of tumor morphology and intensity distributions. Accurate delineation of tumor boundaries is critical for clinical decision-making, radiotherapy planning, and longitudinal disease monitoring. In this study, we perform a comprehensive comparative analysis of two major clustering paradigms applied in MRI tumor segmentation: hard clustering, exemplified by the K-Means algorithm, and soft clustering, represented by Fuzzy C-Means (FCM). While K-Means assigns each pixel strictly to a single cluster, FCM introduces partial memberships, meaning each pixel can belong to multiple clusters with varying degrees of association. Experimental validation was performed using the BraTS2020 dataset, incorporating pre-processing through Gaussian filtering and Contrast Limited Adaptive Histogram Equalization (CLAHE). Evaluation metrics included the Dice Similarity Coefficient (DSC) and processing time, which collectively demonstrated that K-Means achieved superior speed with an average runtime of 0.3s per image, whereas FCM attained higher segmentation accuracy with an average DSC of 0.67 compared to 0.43 for K-Means, albeit at a higher computational cost (1.3s per image). These results highlight the inherent trade-off between computational efficiency and boundary precision.
zh
[CV-166] Anticipatory Fall Detection in Humans with Hybrid Directed Graph Neural Networks and Long Short-Term Memory
【速读】:该论文旨在解决人体跌倒预测问题,尤其是对稳定状态与跌倒之间过渡阶段(transient state)的识别与早期预警,这是现有辅助机器人系统中尚未充分探索的关键环节。其解决方案的核心在于提出一种混合模型,将动态图神经网络(Dynamic Graph Neural Networks, DGNN)与长短期记忆网络(Long Short-Term Memory, LSTM)相结合,并通过解耦运动预测任务与步态分类任务来提升跌倒预测的准确性。具体而言,DGNN负责识别三种步态状态(稳定、过渡、跌倒),而LSTM则基于实时骨骼特征预测后续动作,从而实现对跌倒事件的前瞻性检测。实验表明,该方法在OUMVLP-Pose和URFD数据集上显著优于单一DGNN模型及文献中其他方法,在预测误差和识别准确率方面均取得提升,且能够有效监测过渡状态,为智能助残系统的功能优化提供重要支持。
链接: https://arxiv.org/abs/2509.05337
作者: Younggeol Cho,Gokhan Solak,Olivia Nocentini,Marta Lorenzini,Andrea Fortuna,Arash Ajoudani
机构: Istituto Italiano di Tecnologia (意大利技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Presented at IEEE RO-MAN 2025
Abstract:Detecting and preventing falls in humans is a critical component of assistive robotic systems. While significant progress has been made in detecting falls, the prediction of falls before they happen, and analysis of the transient state between stability and an impending fall remain unexplored. In this paper, we propose a anticipatory fall detection method that utilizes a hybrid model combining Dynamic Graph Neural Networks (DGNN) with Long Short-Term Memory (LSTM) networks that decoupled the motion prediction and gait classification tasks to anticipate falls with high accuracy. Our approach employs real-time skeletal features extracted from video sequences as input for the proposed model. The DGNN acts as a classifier, distinguishing between three gait states: stable, transient, and fall. The LSTM-based network then predicts human movement in subsequent time steps, enabling early detection of falls. The proposed model was trained and validated using the OUMVLP-Pose and URFD datasets, demonstrating superior performance in terms of prediction error and recognition accuracy compared to models relying solely on DGNN and models from literature. The results indicate that decoupling prediction and classification improves performance compared to addressing the unified problem using only the DGNN. Furthermore, our method allows for the monitoring of the transient state, offering valuable insights that could enhance the functionality of advanced assistance systems.
zh
[CV-167] A Stroke-Level Large-Scale Database of Chinese Character Handwriting and the OpenHandWrite_Toolbox for Handwriting Research
【速读】:该论文旨在解决两个核心问题:一是厘清语言成分(如音位、语义和字形系统)如何在汉字的字符、部件(radical)和笔画(stroke)层面调节书写行为;二是缺乏能够捕捉并批量处理细粒度书写数据的综合性工具。其解决方案的关键在于构建了一个大规模的手写数据库(42名被试每人书写1200个汉字),并通过升级OpenHandWrite_Toolbox实现对书写轨迹的高精度采集与批处理分析(如延迟、持续时间、笔压等)。通过多回归分析发现,字形因素显著影响字符、部件和笔画层面的书写准备与执行过程,音位因素亦在三个层级上发挥作用,且呈现层级衰减效应——即影响强度从字符到部件再到笔画逐级减弱,表明书写行为在微观层面仍受语言结构深度调控。这一数据库与工具集为跨语言汉字及子字符书写的认知机制研究提供了重要资源。
链接: https://arxiv.org/abs/2509.05335
作者: Zebo Xu,Shaoyun Yu,Mark Torrance,Guido Nottbusch,Nan Zhao,Zhenguang Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding what linguistic components (e.g., phonological, semantic, and orthographic systems) modulate Chinese handwriting at the character, radical, and stroke levels remains an important yet understudied topic. Additionally, there is a lack of comprehensive tools for capturing and batch-processing fine-grained handwriting data. To address these issues, we constructed a large-scale handwriting database in which 42 Chinese speakers for each handwriting 1200 characters in a handwriting-to-dictation task. Additionally, we enhanced the existing handwriting package and provided comprehensive documentation for the upgraded OpenHandWrite_Toolbox, which can easily modify the experimental design, capture the stroke-level handwriting trajectory, and batch-process handwriting measurements (e.g., latency, duration, and pen-pressure). In analysing our large-scale database, multiple regression results show that orthographic predictors impact handwriting preparation and execution across character, radical, and stroke levels. Phonological factors also influence execution at all three levels. Importantly, these lexical effects demonstrate hierarchical attenuation - they were most pronounced at the character level, followed by the radical, and were weakest at the stroke levels. These findings demonstrate that handwriting preparation and execution at the radical and stroke levels are closely intertwined with linguistic components. This database and toolbox offer valuable resources for future psycholinguistic and neurolinguistic research on the handwriting of characters and sub-characters across different languages.
zh
[CV-168] A Real-Time Vision-Based System for Badminton Smash Speed Estimation on Mobile Devices
【速读】:该论文旨在解决业余和休闲羽毛球运动员难以获取专业级运动表现指标(如扣杀速度)的问题,因为传统测量技术成本高、操作复杂且不易获得。解决方案的关键在于利用普及的智能手机技术,通过一个集成化的移动应用实现自动化速度分析:系统采用自训练的YOLOv5模型进行羽毛球(shuttlecock)检测,并结合卡尔曼滤波(Kalman filter)实现轨迹的鲁棒跟踪;再基于视频的运动学速度估计方法并引入时空缩放(spatiotemporal scaling),从普通视频中自动计算出羽毛球的飞行速度,从而为不同水平的球员提供可访问、准确的性能反馈。
链接: https://arxiv.org/abs/2509.05334
作者: Diwen Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 6 pages, 3 figures, 1 table. Independent research preprint
Abstract:Performance metrics in sports, such as shot speed and angle, provide crucial feedback for athlete development. However, the technology to capture these metrics has historically been expensive, complex, and largely inaccessible to amateur and recreational players. This paper addresses this gap in the context of badminton, one of the world’s most popular sports, by introducing a novel, cost-effective, and user-friendly system for measuring smash speed using ubiquitous smartphone technology. Our approach leverages a custom-trained YOLOv5 model for shuttlecock detection, combined with a Kalman filter for robust trajectory tracking. By implementing a video-based kinematic speed estimation method with spatiotemporal scaling, the system automatically calculates the shuttlecock’s velocity from a standard video recording. The entire process is packaged into an intuitive mobile application, democratizing access to high-level performance analytics and empowering players at all levels to analyze and improve their game.
zh
[CV-169] RT-VLM: Re-Thinking Vision Language Model with 4-Clues for Real-World Object Recognition Robustness
【速读】:该论文旨在解决现代目标识别模型在真实世界部署中因领域偏移(domain shift)导致准确率显著下降的问题,此类偏移包括低层图像统计变化、物体姿态与视角变动、部分遮挡以及相邻类别间的视觉混淆。解决方案的关键在于提出一种名为 Re-Thinking Vision Language Model (RT-VLM) 的框架,其核心创新是构建了一个包含“4-Clues”(精确边界框、类别名称、对象级描述和场景级描述)的合成数据集,并基于此对 Llama 3.2 11B Vision Instruct 模型进行参数高效的监督微调;在推理阶段引入两阶段“再思考”机制,即模型先生成自身四类线索,再将其作为证据进行自我审查并迭代修正,从而提升视觉理解的鲁棒性与迁移能力。
链接: https://arxiv.org/abs/2509.05333
作者: Junghyun Park,Tuan Anh Nguyen,Dugki Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Real world deployments often expose modern object recognition models to domain shifts that precipitate a severe drop in accuracy. Such shifts encompass (i) variations in low level image statistics, (ii) changes in object pose and viewpoint, (iii) partial occlusion, and (iv) visual confusion across adjacent classes. To mitigate this degradation, we introduce the Re-Thinking Vision Language Model (RT-VLM) framework. The foundation of this framework is a unique synthetic dataset generation pipeline that produces images annotated with “4-Clues”: precise bounding boxes, class names, detailed object-level captions, and a comprehensive context-level caption for the entire scene. We then perform parameter efficient supervised tuning of Llama 3.2 11B Vision Instruct on this resource. At inference time, a two stage Re-Thinking scheme is executed: the model first emits its own four clues, then re examines these responses as evidence and iteratively corrects them. Across robustness benchmarks that isolate individual domain shifts, RT-VLM consistently surpasses strong baselines. These findings indicate that the integration of structured multimodal evidence with an explicit self critique loop constitutes a promising route toward reliable and transferable visual understanding.
zh
[CV-170] Optical Music Recognition of Jazz Lead Sheets
【速读】:该论文旨在解决手写爵士乐总谱(jazz lead sheet)的光学乐谱识别(Optical Music Recognition, OMR)问题,其核心挑战在于现有OMR系统无法有效处理和弦(chord)信息,且手写图像存在高度变异性与质量不一的问题。解决方案的关键在于:首先构建了一个包含293张手写爵士乐总谱、共2021条五线谱的新型数据集,每条谱面均配有Humdrum **kern和MusicXML格式的标注真值(ground truth),并提供基于真值生成的合成图像;其次开发了一种针对爵士乐总谱的OMR模型,通过优化标记化策略(tokenisation)并利用合成数据与预训练模型提升性能,从而显著改善对复杂手写乐谱中旋律与和弦的联合识别能力。
链接: https://arxiv.org/abs/2509.05329
作者: Juan Carlos Martinez-Sevilla,Francesco Foscarin,Patricia Garcia-Iasci,David Rizo,Jorge Calvo-Zaragoza,Gerhard Widmer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the 26th International Society for Music Information Retrieval Conference (ISMIR), 2025
Abstract:In this paper, we address the challenge of Optical Music Recognition (OMR) for handwritten jazz lead sheets, a widely used musical score type that encodes melody and chords. The task is challenging due to the presence of chords, a score component not handled by existing OMR systems, and the high variability and quality issues associated with handwritten images. Our contribution is two-fold. We present a novel dataset consisting of 293 handwritten jazz lead sheets of 163 unique pieces, amounting to 2021 total staves aligned with Humdrum **kern and MusicXML ground truth scores. We also supply synthetic score images generated from the ground truth. The second contribution is the development of an OMR model for jazz lead sheets. We discuss specific tokenisation choices related to our kind of data, and the advantages of using synthetic scores and pretrained models. We publicly release all code, data, and models.
zh
[CV-171] Feed Two Birds with One Scone: Exploiting Function-Space Regularization for Both OOD Robustness and ID Fine-Tuning Performance
【速读】:该论文旨在解决预训练模型在下游任务微调过程中难以同时保持良好分布内(in-distribution, ID)性能与分布外(out-of-distribution, OOD)鲁棒性的问题。现有方法多通过保留预训练权重、特征或logits来实现鲁棒微调,但研究发现这些策略对不同模型架构并非始终有效,原因在于其未能直接优化函数空间中的稳定性——而OOD鲁棒性本质上要求模型在面对下游任务的分布外输入时仍能产生稳定预测。为此,作者提出一种新的正则化机制,通过模拟OOD样本约束微调前后模型在函数空间的距离,从而更好地保留预训练模型的OOD鲁棒性;此外,进一步引入一致性正则化以增强扰动样本下的预测稳定性。实验表明,该方法在多种CLIP骨干网络上均能显著提升ID性能和OOD鲁棒性,优于现有基于正则化的鲁棒微调方法。
链接: https://arxiv.org/abs/2509.05328
作者: Xiang Yuan,Jun Shu,Deyu meng,Zongben Xu
机构: Xi’an Jiaotong University (西安交通大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robust fine-tuning aims to achieve competitive in-distribution (ID) performance while maintaining the out-of-distribution (OOD) robustness of a pre-trained model when transferring it to a downstream task. To remedy this, most robust fine-tuning methods aim to preserve the pretrained weights, features, or logits. However, we find that these methods cannot always improve OOD robustness for different model architectures. This is due to the OOD robustness requiring the model function to produce stable prediction for input information of downstream tasks, while existing methods might serve as a poor proxy for the optimization in the function space. Based on this finding, we propose a novel regularization that constrains the distance of fine-tuning and pre-trained model in the function space with the simulated OOD samples, aiming to preserve the OOD robustness of the pre-trained model. Besides, to further enhance the OOD robustness capability of the fine-tuning model, we introduce an additional consistency regularization to promote stable predictions of perturbed samples. Extensive experiments demonstrate our approach could consistently improve both downstream task ID fine-tuning performance and OOD robustness across a variety of CLIP backbones, outperforming existing regularization-based robust fine-tuning methods.
zh
[CV-172] Application of discrete Ricci curvature in pruning randomly wired neural networks: A case study with chest x-ray classification of COVID-19
【速读】:该论文旨在解决随机连接神经网络(Randomly Wired Neural Networks, RWNNs)在保持模型性能的前提下如何通过剪枝降低网络复杂度的问题。其核心挑战在于识别并保留对模型性能至关重要的突触(或边),同时移除冗余连接以实现高效压缩。解决方案的关键在于引入三种基于边的网络中心性指标——Forman-Ricci曲率(FRC)、Ollivier-Ricci曲率(ORC)和边介数中心性(EBC),用于指导剪枝过程,并重点评估计算效率更高的FRC是否能在压缩比和理论加速方面达到与ORC相当的效果。研究结果表明,FRC-based剪枝能够有效简化RWNN结构,在显著降低计算开销的同时维持与ORC相近的分类性能,从而为高效网络压缩提供了新的可行路径。
链接: https://arxiv.org/abs/2509.05322
作者: Pavithra Elumalai,Sudharsan Vijayaraghavan,Madhumita Mondal,Areejit Samal
机构: The Institute of Mathematical Sciences (IMSc); Homi Bhabha National Institute (HBNI)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Computational Physics (physics.comp-ph)
备注: 21 pages, 4 figures, 9 tables
Abstract:Randomly Wired Neural Networks (RWNNs) serve as a valuable testbed for investigating the impact of network topology in deep learning by capturing how different connectivity patterns impact both learning efficiency and model performance. At the same time, they provide a natural framework for exploring edge-centric network measures as tools for pruning and optimization. In this study, we investigate three edge-centric network measures: Forman-Ricci curvature (FRC), Ollivier-Ricci curvature (ORC), and edge betweenness centrality (EBC), to compress RWNNs by selectively retaining important synapses (or edges) while pruning the rest. As a baseline, RWNNs are trained for COVID-19 chest x-ray image classification, aiming to reduce network complexity while preserving performance in terms of accuracy, specificity, and sensitivity. We extend prior work on pruning RWNN using ORC by incorporating two additional edge-centric measures, FRC and EBC, across three network generators: Erdös-Rényi (ER) model, Watts-Strogatz (WS) model, and Barabási-Albert (BA) model. We provide a comparative analysis of the pruning performance of the three measures in terms of compression ratio and theoretical speedup. A central focus of our study is to evaluate whether FRC, which is computationally more efficient than ORC, can achieve comparable pruning effectiveness. Along with performance evaluation, we further investigate the structural properties of the pruned networks through modularity and global efficiency, offering insights into the trade-off between modular segregation and network efficiency in compressed RWNNs. Our results provide initial evidence that FRC-based pruning can effectively simplify RWNNs, offering significant computational advantages while maintaining performance comparable to ORC.
zh
[CV-173] A Dataset Generation Scheme Based on Video2EEG-SPGN-Diffusion for SEED-VD
【速读】:该论文旨在解决多模态数据融合中EEG信号与视频刺激之间对齐困难的问题,尤其在构建高质量、大规模的视频-EEG联合数据集方面存在挑战。解决方案的关键在于提出一个开源框架Video2EEG-SPGN-Diffusion,其核心是结合自 play 图网络(Self-play Graph Network, SPGN)与扩散模型(Diffusion Model),实现基于视频刺激的个性化62通道EEG信号生成,并通过工程化数据对齐流水线确保视频与EEG信号的时间同步与语义一致性。该方法不仅生成了包含1000+样本的新型多模态数据集(含情绪标签),还为情感分析、数据增强及脑机接口等应用提供了可扩展的技术基础。
链接: https://arxiv.org/abs/2509.05321
作者: Yunfei Guo,Tao Zhang,Wu Huang,Yao Song
机构: Chengdu Techman Software Co., Ltd.(成都泰曼软件有限公司); Sichuan University(四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces an open-source framework, Video2EEG-SPGN-Diffusion, that leverages the SEED-VD dataset to generate a multimodal dataset of EEG signals conditioned on video stimuli. Additionally, we disclose an engineering pipeline for aligning video and EEG data pairs, facilitating the training of multimodal large models with EEG alignment capabilities. Personalized EEG signals are generated using a self-play graph network (SPGN) integrated with a diffusion model. As a major contribution, we release a new dataset comprising over 1000 samples of SEED-VD video stimuli paired with generated 62-channel EEG signals at 200 Hz and emotion labels, enabling video-EEG alignment and advancing multimodal research. This framework offers novel tools for emotion analysis, data augmentation, and brain-computer interface applications, with substantial research and engineering significance.
zh
[CV-174] Context-Aware Knowledge Distillation with Adaptive Weighting for Image Classification
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中固定平衡因子 α 导致的次优问题,即静态 α 无法适应训练过程中学生模型与教师模型之间动态变化的监督信息需求。其解决方案的关键在于提出自适应知识蒸馏(Adaptive Knowledge Distillation, AKD)框架:首先将 α 设计为可学习参数并引入基于学生-教师差异的动态计算公式以实现在线优化;其次通过上下文感知模块(Context-Aware Module, CAM),利用多层感知机(MLP)与注意力机制对类别维度上的教师输出进行自适应重加权,从而提升蒸馏过程的灵活性与有效性。实验表明,该方法在 CIFAR-10 数据集上使用 ResNet-50 作为教师、ResNet-18 作为学生时,相较固定权重基线显著提升了准确率并增强了收敛稳定性。
链接: https://arxiv.org/abs/2509.05319
作者: Zhengda Li
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Knowledge distillation (KD) is a widely used technique to transfer knowledge from a large teacher network to a smaller student model. Traditional KD uses a fixed balancing factor alpha as a hyperparameter to combine the hard-label cross-entropy loss with the soft-label distillation loss. However, a static alpha is suboptimal because the optimal trade-off between hard and soft supervision can vary during training. In this work, we propose an Adaptive Knowledge Distillation (AKD) framework. First we try to make alpha as learnable parameter that can be automatically learned and optimized during training. Then we introduce a formula to reflect the gap between the student and the teacher to compute alpha dynamically, guided by student-teacher discrepancies, and further introduce a Context-Aware Module (CAM) using MLP + Attention to adaptively reweight class-wise teacher outputs. Experiments on CIFAR-10 with ResNet-50 as teacher and ResNet-18 as student demonstrate that our approach achieves superior accuracy compared to fixed-weight KD baselines, and yields more stable convergence. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.05319 [cs.CV] (or arXiv:2509.05319v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.05319 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-175] VILOD: A Visual Interactive Labeling Tool for Object Detection
【速读】:该论文旨在解决目标检测(Object Detection, OD)中因依赖大规模高质量标注数据而导致的标注成本高、效率低的问题,同时克服传统主动学习(Active Learning, AL)方法在透明度不足、人类专家策略受限及忽略非策略匹配样本等方面的局限性。其解决方案的关键在于提出并实现了一个“人在回路”(Human-in-the-Loop, HITL)框架下的可视化交互标注工具 VILOD,通过整合 t-SNE 图像特征投影、不确定性热力图和模型状态视图等视觉分析(Visual Analytics, VA)组件,使用户能够在迭代式 HITL 工作流中探索数据分布、理解模型状态、解释 AL 建议,并灵活实施多样化的样本选择策略,从而提升标注过程的可解释性、可控性和最终模型性能。
链接: https://arxiv.org/abs/2509.05317
作者: Isac Holm
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Master’s project
Abstract:The advancement of Object Detection (OD) using Deep Learning (DL) is often hindered by the significant challenge of acquiring large, accurately labeled datasets, a process that is time-consuming and expensive. While techniques like Active Learning (AL) can reduce annotation effort by intelligently querying informative samples, they often lack transparency, limit the strategic insight of human experts, and may overlook informative samples not aligned with an employed query strategy. To mitigate these issues, Human-in-the-Loop (HITL) approaches integrating human intelligence and intuition throughout the machine learning life-cycle have gained traction. Leveraging Visual Analytics (VA), effective interfaces can be created to facilitate this human-AI collaboration. This thesis explores the intersection of these fields by developing and investigating “VILOD: A Visual Interactive Labeling tool for Object Detection”. VILOD utilizes components such as a t-SNE projection of image features, together with uncertainty heatmaps and model state views. Enabling users to explore data, interpret model states, AL suggestions, and implement diverse sample selection strategies within an iterative HITL workflow for OD. An empirical investigation using comparative use cases demonstrated how VILOD, through its interactive visualizations, facilitates the implementation of distinct labeling strategies by making the model’s state and dataset characteristics more interpretable (RQ1). The study showed that different visually-guided labeling strategies employed within VILOD result in competitive OD performance trajectories compared to an automated uncertainty sampling AL baseline (RQ2). This work contributes a novel tool and empirical insight into making the HITL-AL workflow for OD annotation more transparent, manageable, and potentially more effective.
zh
[CV-176] Evaluation of Large Language Models for Anomaly Detection in Autonomous Vehicles
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在自动驾驶系统中作为异常检测模块的潜力评估不足问题,尤其是在真实边缘场景下,现有感知与规划算法失效时LLMs的表现缺乏可靠验证。其解决方案的关键在于提出一种结合开放词汇目标检测器、提示工程(prompt engineering)与大语言模型上下文推理能力的架构,从而在真实世界边缘案例中对多个先进LLM进行定性评估,以探讨其作为自动驾驶车辆异常检测器的可行性。
链接: https://arxiv.org/abs/2509.05315
作者: Petros Loukas,David Bassir,Savvas Chatzichristofis,Angelos Amanatiadis
机构: Democritus University of Thrace (色雷斯大学); Dongguan University of Technology (东莞理工学院); ENS-Paris-Saclay University (巴黎萨克雷大学); Neapolis University Pafos (尼波勒斯大学帕福斯分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid evolution of large language models (LLMs) has pushed their boundaries to many applications in various domains. Recently, the research community has started to evaluate their potential adoption in autonomous vehicles and especially as complementary modules in the perception and planning software stacks. However, their evaluation is limited in synthetic datasets or manually driving datasets without the ground truth knowledge and more precisely, how the current perception and planning algorithms would perform in the cases under evaluation. For this reason, this work evaluates LLMs on real-world edge cases where current autonomous vehicles have been proven to fail. The proposed architecture consists of an open vocabulary object detector coupled with prompt engineering and large language model contextual reasoning. We evaluate several state-of-the-art models against real edge cases and provide qualitative comparison results along with a discussion on the findings for the potential application of LLMs as anomaly detectors in autonomous vehicles.
zh
[CV-177] ManipDreamer3D : Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory
【速读】:该论文旨在解决机器人操作中因数据稀缺而导致的训练困难问题,尤其是现有基于扩散模型(diffusion models)的方法多依赖二维轨迹,存在三维空间歧义性的问题。其解决方案的关键在于提出一种名为ManipDreamer3D的新框架,该框架通过从第三人称视角图像重建三维占据(3D occupancy)表示,并结合优化后的三维末端执行器轨迹规划,利用一种新颖的轨迹到视频扩散模型生成具有物理合理性的三维感知机器人操作视频。此方法显著降低了人工干预需求,同时在视觉质量上优于现有方法。
链接: https://arxiv.org/abs/2509.05314
作者: Ying Li,Xiaobao Wei,Xiaowei Chi,Yuming Li,Zhongyu Zhao,Hao Wang,Ningning Ma,Ming Lu,Shanghang Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8pages; 7figures; 4 tables
Abstract:Data scarcity continues to be a major challenge in the field of robotic manipulation. Although diffusion models provide a promising solution for generating robotic manipulation videos, existing methods largely depend on 2D trajectories, which inherently face issues with 3D spatial ambiguity. In this work, we present a novel framework named ManipDreamer3D for generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction. Our method combines 3D trajectory planning with a reconstructed 3D occupancy map created from a third-person perspective, along with a novel trajectory-to-video diffusion model. Specifically, ManipDreamer3D first reconstructs the 3D occupancy representation from the input image and then computes an optimized 3D end-effector trajectory, minimizing path length while avoiding collisions. Next, we employ a latent editing technique to create video sequences from the initial image latent and the optimized 3D trajectory. This process conditions our specially trained trajectory-to-video diffusion model to produce robotic pick-and-place videos. Our method generates robotic videos with autonomously planned plausible 3D trajectories, significantly reducing human intervention requirements. Experimental results demonstrate superior visual quality compared to existing methods.
zh
[CV-178] Label Smoothing: Enhanced Label Regularization for Training Neural Networks BMVC
【速读】:该论文旨在解决传统标签平滑(label smoothing)方法在训练神经网络时存在的问题:虽然其通过向one-hot标签添加均匀概率向量来缓解模型过自信和过拟合现象,但该方法对所有非目标类别赋予相等的概率,忽略了类别之间的内在关系,从而破坏了类间语义结构。解决方案的关键在于提出一种新的标签正则化策略——Label Smoothing++,该方法保留目标类别的固定标签,同时为非目标类别分配非零概率,并显式建模这些类别间的相互关系,使网络在学习过程中不仅减少过自信预测,还能增强类间关系的理解与泛化能力。
链接: https://arxiv.org/abs/2509.05307
作者: Sachin Chhabra,Hemanth Venkateswara,Baoxin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in British Machine Vision Conference (BMVC), 2024
Abstract:Training neural networks with one-hot target labels often results in overconfidence and overfitting. Label smoothing addresses this issue by perturbing the one-hot target labels by adding a uniform probability vector to create a regularized label. Although label smoothing improves the network’s generalization ability, it assigns equal importance to all the non-target classes, which destroys the inter-class relationships. In this paper, we propose a novel label regularization training strategy called Label Smoothing++, which assigns non-zero probabilities to non-target classes and accounts for their inter-class relationships. Our approach uses a fixed label for the target class while enabling the network to learn the labels associated with non-target classes. Through extensive experiments on multiple datasets, we demonstrate how Label Smoothing++ mitigates overconfident predictions while promoting inter-class relationships and generalization capabilities.
zh
[CV-179] LocoMamba: Vision-Driven Locomotion via End-to-End Deep Reinforcement Learning with Mamba
【速读】:该论文旨在解决复杂动态环境中机器人导航任务中长期依赖建模效率低、训练不稳定以及泛化能力弱的问题。解决方案的关键在于提出一种基于选择性状态空间模型(Selective State-Space Model)的视觉驱动跨模态深度强化学习框架——LocoMamba,其核心创新包括:1)通过多层感知机(MLP)编码本体感知状态并用轻量卷积神经网络对深度图像进行分块处理,生成紧凑的表示令牌(token),提升状态表征质量;2)采用堆叠的Mamba层以近线性时间复杂度实现令牌融合,利用选择性扫描机制有效捕获长程依赖关系,同时降低延迟和内存占用,并具备对序列长度与图像分辨率的鲁棒性;3)在地形和外观随机化及障碍密度课程训练策略下,使用紧凑的状态中心奖励函数联合优化前进进度、运动平滑性和安全性,从而显著提升训练效率与环境泛化性能。
链接: https://arxiv.org/abs/2508.11849
作者: Yinuo Wang,Gavin Tao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
备注: 13 pages
Abstract:We introduce LocoMamba, a vision-driven cross-modal DRL framework built on selective state-space models, specifically leveraging Mamba, that achieves near-linear-time sequence modeling, effectively captures long-range dependencies, and enables efficient training with longer sequences. First, we embed proprioceptive states with a multilayer perceptron and patchify depth images with a lightweight convolutional neural network, producing compact tokens that improve state representation. Second, stacked Mamba layers fuse these tokens via near-linear-time selective scanning, reducing latency and memory footprint, remaining robust to token length and image resolution, and providing an inductive bias that mitigates overfitting. Third, we train the policy end-to-end with Proximal Policy Optimization under terrain and appearance randomization and an obstacle-density curriculum, using a compact state-centric reward that balances progress, smoothness, and safety. We evaluate our method in challenging simulated environments with static and moving obstacles as well as uneven terrain. Compared with state-of-the-art baselines, our method achieves higher returns and success rates with fewer collisions, exhibits stronger generalization to unseen terrains and obstacle densities, and improves training efficiency by converging in fewer updates under the same compute budget.
zh
[CV-180] AdCare-VLM: Leverag ing Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care
【速读】:该论文旨在解决慢性疾病(如结核病)患者在用药过程中存在的依从性不足问题,其核心挑战在于如何通过自动化视觉分析手段准确识别患者的服药行为并量化依从性。解决方案的关键在于提出AdCare-VLM——一种基于Video-LLaVA架构的多模态大视觉语言模型(Multimodal Large Vision Language Model, LVLM),利用806段由临床专家标注的结核病服药视频进行微调,从而实现对服药行为中关键视觉特征(如面部清晰度、药物可见性、饮水动作及吞咽行为)与医学语义概念之间的对齐建模。该方法通过增强视觉-语言表示的一致性,显著提升了视觉问答(VQA)任务中对正向、负向和模糊依从性案例的识别准确率,在多种参数高效微调配置下均优于现有基线模型(如LLaVA-V1.5和Chat-UniVi),且通过消融实验和注意力热力图验证了其可解释性与有效性。
链接: https://arxiv.org/abs/2505.00275
作者: Md Asaduzzaman Jabin,Hanqi Jiang,Yiwei Li,Patrick Kaggwa,Eugene Douglass,Juliet N. Sekandi,Tianming Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized Video-LLaVA-based multimodal large vision language model (LVLM) aimed at visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient’s face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.
zh
[CV-181] MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis
【速读】:该论文旨在解决当前基于自然图像预训练的视觉基础模型(如DINOv2)在多模态医学影像分析中效果受限的问题,尤其是在临床实践中常见的标签数据稀缺和模态缺失挑战。其核心解决方案是提出MM-DINOv2框架,关键创新在于引入多模态patch嵌入机制以支持跨模态信息融合,并采用全模态掩码策略增强模型对缺失模态的鲁棒性,同时结合半监督学习充分利用大规模未标注数据,从而显著提升多模态医学影像任务(如胶质瘤亚型分类)的性能与可靠性。
链接: https://arxiv.org/abs/2509.06617
作者: Daniel Scholz,Ayhan Can Erdur,Viktoria Ehm,Anke Meyer-Baese,Jan C. Peeken,Daniel Rueckert,Benedikt Wiestler
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision foundation models like DINOv2 demonstrate remarkable potential in medical imaging despite their origin in natural image domains. However, their design inherently works best for uni-modal image analysis, limiting their effectiveness for multi-modal imaging tasks that are common in many medical fields, such as neurology and oncology. While supervised models perform well in this setting, they fail to leverage unlabeled datasets and struggle with missing modalities, a frequent challenge in clinical settings. To bridge these gaps, we introduce MM-DINOv2, a novel and efficient framework that adapts the pre-trained vision foundation model DINOv2 for multi-modal medical imaging. Our approach incorporates multi-modal patch embeddings, enabling vision foundation models to effectively process multi-modal imaging data. To address missing modalities, we employ full-modality masking, which encourages the model to learn robust cross-modality relationships. Furthermore, we leverage semi-supervised learning to harness large unlabeled datasets, enhancing both the accuracy and reliability of medical predictions. Applied to glioma subtype classification from multi-sequence brain MRI, our method achieves a Matthews Correlation Coefficient (MCC) of 0.6 on an external test set, surpassing state-of-the-art supervised approaches by +11.1%. Our work establishes a scalable and robust solution for multi-modal medical imaging tasks, leveraging powerful vision foundation models pre-trained on natural images while addressing real-world clinical challenges such as missing data and limited annotations.
zh
[CV-182] owards In-Air Ultrasonic QR Codes: Deep Learning for Classification of Passive Reflector Constellations
【速读】:该论文旨在解决自主系统在视觉传感器失效环境中,如何提升声学地标(acoustic landmark)的信息容量与识别精度的问题。传统方法仅能分类单个声学地标,而本文提出通过引入反射器星座(reflector constellation)作为编码标签,显著增加信息熵。其关键解决方案是设计了一种多标签卷积神经网络(multi-label Convolutional Neural Network, CNN),能够从单一的空气中三维声呐测量中同时识别多个紧密排列的反射器;此外,还探索了自适应波束赋形结合零点导向(adaptive beamforming with null-steering)技术以分离单个反射器用于单标签分类,从而为构建高信息容量、高鲁棒性的声学地标系统提供了可行路径。
链接: https://arxiv.org/abs/2509.06615
作者: Wouter Jansen,Jan Steckel
机构: FTI Cosys-Lab, University of Antwerp (安特卫普大学); Flanders Make Strategic Research Centre (弗拉芒制造战略研究中心)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at IEEE IUS 2025
Abstract:In environments where visual sensors falter, in-air sonar provides a reliable alternative for autonomous systems. While previous research has successfully classified individual acoustic landmarks, this paper takes a step towards increasing information capacity by introducing reflector constellations as encoded tags. Our primary contribution is a multi-label Convolutional Neural Network (CNN) designed to simultaneously identify multiple, closely spaced reflectors from a single in-air 3D sonar measurement. Our initial findings on a small dataset confirm the feasibility of this approach, validating the ability to decode these complex acoustic patterns. Secondly, we investigated using adaptive beamforming with null-steering to isolate individual reflectors for single-label classification. Finally, we discuss the experimental results and limitations, offering key insights and future directions for developing acoustic landmark systems with significantly increased information entropy and their accurate and robust detection and classification.
zh
[CV-183] Contrastive Anatomy-Contrast Disentanglement: A Domain-General MRI Harmonization Method
【速读】:该论文旨在解决多中心磁共振成像(MRI)数据因扫描仪和采集参数差异导致的图像对比度不一致问题,从而提升跨数据集和临床研究中的可比性与可重复性。其解决方案的关键在于提出一种基于条件扩散自编码器(conditioned diffusion autoencoder)的新方法,结合对比损失(contrastive loss)和领域无关的对比增强策略,在无需微调的情况下实现跨扫描仪的图像谐波化(harmonization),同时保留个体特异性解剖结构,并能仅凭单张参考图像合成目标扫描仪的脑部MRI。
链接: https://arxiv.org/abs/2509.06592
作者: Daniel Scholz,Ayhan Can Erdur,Robbie Holland,Viktoria Ehm,Jan C. Peeken,Benedikt Wiestler,Daniel Rueckert
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Magnetic resonance imaging (MRI) is an invaluable tool for clinical and research applications. Yet, variations in scanners and acquisition parameters cause inconsistencies in image contrast, hindering data comparability and reproducibility across datasets and clinical studies. Existing scanner harmonization methods, designed to address this challenge, face limitations, such as requiring traveling subjects or struggling to generalize to unseen domains. We propose a novel approach using a conditioned diffusion autoencoder with a contrastive loss and domain-agnostic contrast augmentation to harmonize MR images across scanners while preserving subject-specific anatomy. Our method enables brain MRI synthesis from a single reference image. It outperforms baseline techniques, achieving a +7% PSNR improvement on a traveling subjects dataset and +18% improvement on age regression in unseen. Our model provides robust, effective harmonization of brain MRIs to target scanners without requiring fine-tuning. This advancement promises to enhance comparability, reproducibility, and generalizability in multi-site and longitudinal clinical studies, ultimately contributing to improved healthcare outcomes.
zh
[CV-184] Impact of Labeling Inaccuracy and Image Noise on Tooth Segmentation in Panoramic Radiographs using Federated Centralized and Local Learning
【速读】:该论文旨在解决口腔诊断人工智能(AI)在多中心数据协作中面临的隐私保护、数据质量异质性及标注不一致性等问题。其核心解决方案是采用联邦学习(Federated Learning, FL)框架,通过在多个机构间分布式训练Attention U-Net模型,避免原始数据集中传输,从而在保障数据隐私的前提下提升模型性能。关键创新在于利用每客户端的训练损失轨迹进行异常检测,有效识别并排除故障或被污染的数据源,同时在多种数据扰动场景下(如标签缺失、图像噪声、异常客户端排除)均表现出优于集中式学习(Centralized Learning, CL)和本地学习(Local Learning, LL)的稳定性和鲁棒性,验证了FL作为可扩展、隐私友好的临床AI部署方案的可行性。
链接: https://arxiv.org/abs/2509.06553
作者: Johan Andreas Balle Rubak,Khuram Naveed,Sanyam Jain,Lukas Esterle,Alexandros Iosifidis,Ruben Pauwels
机构: Aarhus University (奥胡斯大学); Tampere University (坦佩雷大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Objectives: Federated learning (FL) may mitigate privacy constraints, heterogeneous data quality, and inconsistent labeling in dental diagnostic AI. We compared FL with centralized (CL) and local learning (LL) for tooth segmentation in panoramic radiographs across multiple data corruption scenarios. Methods: An Attention U-Net was trained on 2066 radiographs from six institutions across four settings: baseline (unaltered data); label manipulation (dilated/missing annotations); image-quality manipulation (additive Gaussian noise); and exclusion of a faulty client with corrupted data. FL was implemented via the Flower AI framework. Per-client training- and validation-loss trajectories were monitored for anomaly detection and a set of metrics (Dice, IoU, HD, HD95 and ASSD) was evaluated on a hold-out test set. From these metrics significance results were reported through Wilcoxon signed-rank test. CL and LL served as comparators. Results: Baseline: FL achieved a median Dice of 0.94889 (ASSD: 1.33229), slightly better than CL at 0.94706 (ASSD: 1.37074) and LL at 0.93557-0.94026 (ASSD: 1.51910-1.69777). Label manipulation: FL maintained the best median Dice score at 0.94884 (ASSD: 1.46487) versus CL’s 0.94183 (ASSD: 1.75738) and LL’s 0.93003-0.94026 (ASSD: 1.51910-2.11462). Image noise: FL led with Dice at 0.94853 (ASSD: 1.31088); CL scored 0.94787 (ASSD: 1.36131); LL ranged from 0.93179-0.94026 (ASSD: 1.51910-1.77350). Faulty-client exclusion: FL reached Dice at 0.94790 (ASSD: 1.33113) better than CL’s 0.94550 (ASSD: 1.39318). Loss-curve monitoring reliably flagged the corrupted site. Conclusions: FL matches or exceeds CL and outperforms LL across corruption scenarios while preserving privacy. Per-client loss trajectories provide an effective anomaly-detection mechanism and support FL as a practical, privacy-preserving approach for scalable clinical AI deployment.
zh
[CV-185] FASL-Seg: Anatomy and Tool Segmentation of Surgical Scenes ECAI
【速读】:该论文旨在解决微创手术中语义分割模型对解剖结构(anatomical objects)关注不足,以及现有最先进(state-of-the-art, SOTA)模型难以平衡高层上下文特征与低层边缘细节特征的问题。其解决方案的关键在于提出一种特征自适应空间定位模型(Feature-Adaptive Spatial Localization model, FASL-Seg),通过两条独立的处理流——低层特征投影流(Low-Level Feature Projection, LLFP)和高层特征投影流(High-Level Feature Projection, HLFP)——分别捕获不同分辨率下的多尺度特征,从而实现对解剖组织与手术器械的高精度分割。实验表明,该方法在EndoVis18和EndoVis17数据集上均显著优于现有SOTA模型,在整体性能和各类别一致性方面表现优异。
链接: https://arxiv.org/abs/2509.06159
作者: Muraam Abdel-Ghani,Mahmoud Ali,Mohamed Ali,Fatmaelzahraa Ahmed,Mohamed Arsalan,Abdulaziz Al-Ali,Shidin Balakrishnan
机构: Hamad Medical Cooperation (哈马德医疗合作机构); Qatar University (卡塔尔大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, Accepted at the European Conference on Artificial Intelligence (ECAI) 2025. To appear in the conference proceedings
Abstract:The growing popularity of robotic minimally invasive surgeries has made deep learning-based surgical training a key area of research. A thorough understanding of the surgical scene components is crucial, which semantic segmentation models can help achieve. However, most existing work focuses on surgical tools and overlooks anatomical objects. Additionally, current state-of-the-art (SOTA) models struggle to balance capturing high-level contextual features and low-level edge features. We propose a Feature-Adaptive Spatial Localization model (FASL-Seg), designed to capture features at multiple levels of detail through two distinct processing streams, namely a Low-Level Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream, for varying feature resolutions - enabling precise segmentation of anatomy and surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation, respectively, outperforming SOTA overall performance, with comparable per-class SOTA results in both datasets and consistent performance in various classes for anatomy and instruments, demonstrating the effectiveness of distinct processing streams for varying feature resolutions.
zh
[CV-186] Brain Tumor Detection Through Diverse CNN Architectures in IoT Healthcare Industries: Fast R-CNN U-Net Transfer Learning-Based CNN and Fully Connected CNN
【速读】:该论文旨在解决脑肿瘤(胶质瘤、脑膜瘤和垂体瘤)在物联网健康系统(IoT-healthcare)中精准诊断的问题,以提升早期识别与治疗效果。其解决方案的关键在于结合多种深度学习架构,包括基于区域的卷积神经网络(Region-based Convolutional Neural Network, R-CNN)、UNet以及迁移学习模型(如Inception-V3、EfficientNetB4、VGG19),利用磁共振成像(MRI)数据进行多模态特征提取与分类。其中,Fast R-CNN在内部验证中表现最优,准确率达99%,F-score为98.5%,表明该方法具备高精度和鲁棒性;同时,在外部队列交叉验证中,Fine-tuned EfficientNetB2展现出优异泛化能力,进一步证明AI驱动的影像分析可有效支持物联网环境下的实时、个性化脑肿瘤诊疗决策。
链接: https://arxiv.org/abs/2509.05821
作者: Mohsen Asghari Ilani,Yaser M. Banad
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Artificial intelligence (AI)-powered deep learning has advanced brain tumor diagnosis in Internet of Things (IoT)-healthcare systems, achieving high accuracy with large datasets. Brain health is critical to human life, and accurate diagnosis is essential for effective treatment. Magnetic Resonance Imaging (MRI) provides key data for brain tumor detection, serving as a major source of big data for AI-driven image classification. In this study, we classified glioma, meningioma, and pituitary tumors from MRI images using Region-based Convolutional Neural Network (R-CNN) and UNet architectures. We also applied Convolutional Neural Networks (CNN) and CNN-based transfer learning models such as Inception-V3, EfficientNetB4, and VGG19. Model performance was assessed using F-score, recall, precision, and accuracy. The Fast R-CNN achieved the best results with 99% accuracy, 98.5% F-score, 99.5% Area Under the Curve (AUC), 99.4% recall, and 98.5% precision. Combining R-CNN, UNet, and transfer learning enables earlier diagnosis and more effective treatment in IoT-healthcare systems, improving patient outcomes. IoT devices such as wearable monitors and smart imaging systems continuously collect real-time data, which AI algorithms analyze to provide immediate insights for timely interventions and personalized care. For external cohort cross-dataset validation, EfficientNetB2 achieved the strongest performance among fine-tuned EfficientNet models, with 92.11% precision, 92.11% recall/sensitivity, 95.96% specificity, 92.02% F1-score, and 92.23% accuracy. These findings underscore the robustness and reliability of AI models in handling diverse datasets, reinforcing their potential to enhance brain tumor classification and patient care in IoT healthcare environments.
zh
[CV-187] Stereovision Image Processing for Planetary Navigation Maps with Semi-Global Matching and Superpixel Segmentation
【速读】:该论文旨在解决火星探测中地形建模精度不足的问题,尤其是在低纹理图像、遮挡和重复模式等复杂场景下,传统局部块匹配方法因仅依赖有限邻域像素而难以恢复细节并产生伪影。其解决方案的关键在于引入基于超像素(superpixel)的半全局匹配(Semi-Global Matching, SGM)优化策略,通过在SGM基础上融合上下文感知的分割信息,增强深度推理的一致性与准确性,从而有效减少原始视差图中的空洞并提升小尺度地貌特征(如岩石边缘)的捕捉能力,最终实现更适用于自主导航的高保真二维地形地图生成。
链接: https://arxiv.org/abs/2509.05645
作者: Yan-Shan Lu,Miguel Arana-Catania,Saurabh Upadhyay,Leonard Felicetti
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 6 figures, 2 tables. ESA ASTRA 2025
Abstract:Mars exploration requires precise and reliable terrain models to ensure safe rover navigation across its unpredictable and often hazardous landscapes. Stereoscopic vision serves a critical role in the rover’s perception, allowing scene reconstruction by generating precise depth maps through stereo matching. State-of-the-art Martian planetary exploration uses traditional local block-matching, aggregates cost over square windows, and refines disparities via smoothness constraints. However, this method often struggles with low-texture images, occlusion, and repetitive patterns because it considers only limited neighbouring pixels and lacks a wider understanding of scene context. This paper uses Semi-Global Matching (SGM) with superpixel-based refinement to mitigate the inherent block artefacts and recover lost details. The approach balances the efficiency and accuracy of SGM and adds context-aware segmentation to support more coherent depth inference. The proposed method has been evaluated in three datasets with successful results: In a Mars analogue, the terrain maps obtained show improved structural consistency, particularly in sloped or occlusion-prone regions. Large gaps behind rocks, which are common in raw disparity outputs, are reduced, and surface details like small rocks and edges are captured more accurately. Another two datasets, evaluated to test the method’s general robustness and adaptability, show more precise disparity maps and more consistent terrain models, better suited for the demands of autonomous navigation on Mars, and competitive accuracy across both non-occluded and full-image error metrics. This paper outlines the entire terrain modelling process, from finding corresponding features to generating the final 2D navigation maps, offering a complete pipeline suitable for integration in future planetary exploration missions.
zh
[CV-188] A Synthetic-to-Real Dehazing Method based on Domain Unification ICME2025
【速读】:该论文旨在解决深度学习方法在真实雾霾图像去雾任务中因分布偏移(distribution shift)而导致性能下降的问题。其核心问题是:合成数据与真实数据之间的域差异,尤其是由于场景复杂性和深度效应导致的清洁图像采集不理想,使得真实域与合成域的物理大气模型不一致。解决方案的关键在于提出一种基于域统一(domain unification)的从合成到真实的去雾方法,通过统一真实域与合成域之间的关系,使去雾模型更贴近实际场景,从而显著提升在真实世界图像上的去雾效果。
链接: https://arxiv.org/abs/2509.05374
作者: Zhiqiang Yuan,Jinchao Zhang,Jie Zhou
机构: Tencent(腾讯)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: ICME 2025 Accept
Abstract:Due to distribution shift, the performance of deep learning-based method for image dehazing is adversely affected when applied to real-world hazy images. In this paper, we find that such deviation in dehazing task between real and synthetic domains may come from the imperfect collection of clean data. Owing to the complexity of the scene and the effect of depth, the collected clean data cannot strictly meet the ideal conditions, which makes the atmospheric physics model in the real domain inconsistent with that in the synthetic domain. For this reason, we come up with a synthetic-to-real dehazing method based on domain unification, which attempts to unify the relationship between the real and synthetic domain, thus to let the dehazing model more in line with the actual situation. Extensive experiments qualitatively and quantitatively demonstrate that the proposed dehazing method significantly outperforms state-of-the-art methods on real-world images.
zh
[CV-189] Layer-Wise Anomaly Detection in Directed Energy Deposition using High-Fidelity Fringe Projection Profilometry
【速读】:该论文旨在解决定向能量沉积(Directed Energy Deposition, DED)过程中因工艺扰动引发的缺陷问题,如几何偏差、未熔合(lack of fusion)及表面粗糙度不良等,这些缺陷严重影响成形件的质量与可靠性。解决方案的关键在于提出一种同步构建高度的条纹投影系统,实现激光DED构件的原位、逐层表面重建,精度达±46 μm;进而基于重构的三维点云,引入两个互补的几何特征指标:局部点密度用于识别表面质量差区域,法向量变化率用于检测未熔合特征,从而无需人工标注即可自动识别常见沉积异常,实现缺陷定位与工艺闭环控制的直接关联,为可认证的增材制造提供了高精度几何感知能力。
链接: https://arxiv.org/abs/2509.05327
作者: Guanzhong Hu,Wenpan Li,Rujing Zha,Ping Guo
机构: Northwestern University (西北大学)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 15 figures
Abstract:Directed energy deposition (DED), a metal additive manufacturing process, is highly susceptible to process-induced defects such as geometric deviations, lack of fusion, and poor surface finish. This work presents a build-height-synchronized fringe projection system for in-situ, layer-wise surface reconstruction of laser-DED components, achieving a reconstruction accuracy of \pm 46 \mu m. From the reconstructed 3D morphology, two complementary geometry-based point cloud metrics are introduced: local point density, which highlights poor surface finish, and normal-change rate, which identifies lack-of-fusion features. These methods enable automated, annotation-free identification of common deposition anomalies directly from reconstructed surfaces, without the need for manual labeling. By directly linking geometric deviation to defect formation, the approach enables precise anomaly localization and advances the feasibility of closed-loop process control. This work establishes fringe projection as a practical tool for micrometer-scale monitoring in DED, bridging the gap between process signatures and part geometry for certifiable additive manufacturing.
zh
人工智能
[AI-0] Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
【速读】:该论文旨在解决扩散模型(Diffusion Models)在基于人类偏好对齐(preference alignment)过程中存在的两个关键问题:一是多步去噪(multistep denoising)依赖梯度计算进行奖励评分,导致计算成本高,限制了优化仅能在少数扩散步骤中进行;二是为获得理想美学质量(如逼真度或精确光照效果),通常需要持续的离线奖励模型微调。解决方案的关键在于提出两种创新方法:其一为Direct-Align,通过预定义噪声先验(noise prior)实现任意时间步下图像的插值恢复,利用扩散状态是噪声与目标图像之间的线性插值这一特性,有效避免晚期时间步的过优化;其二为语义相对偏好优化(Semantic Relative Preference Optimization, SRPO),将奖励建模为文本条件信号,支持在线调整奖励以响应正负提示增强,从而减少对离线奖励微调的依赖。结合这两种策略,模型在人类评估的现实感和美学质量上提升了超过3倍。
链接: https://arxiv.org/abs/2509.06942
作者: Xiangwei Shen,Zhimin Li,Zhantao Yang,Shiyi Zhang,Yingfang Zhang,Donghao Li,Chunyu Wang,Qinglin Lu,Yansong Tang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages
Abstract:Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the this http URL model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.
zh
[AI-1] From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers
【速读】:该论文旨在解决预训练Transformer模型在面对输入不确定性时产生幻觉(hallucination)的问题,这一现象严重阻碍了生成式AI在高风险场景中的可信部署。其解决方案的关键在于利用稀疏自编码器(sparse autoencoders)对模型中间层激活进行概念表征分析,揭示出当输入信息逐渐无结构化时,模型会激活与输入无关但语义一致的冗余概念,从而引发输出幻觉;进一步研究表明,通过识别这些嵌入于Transformer层激活中的概念模式,可可靠预测模型输出的幻觉风险,为AI对齐、安全性和抗攻击能力提供了新的理论基础和量化工具。
链接: https://arxiv.org/abs/2509.06938
作者: Praneet Suresh,Jack Stanley,Sonia Joseph,Luca Scimeca,Danilo Bzdok
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model’s hallucination risk.
zh
[AI-2] Neuro-Symbolic AI for Cybersecurity: State of the Art Challenges and Opportunities
【速读】:该论文旨在解决传统人工智能(Artificial Intelligence, AI)在网络安全领域中存在的三大根本性问题:概念基础薄弱导致对新型攻击缺乏鲁棒性、指令能力有限阻碍分析师引导的适应性调整,以及与网络安全目标存在错位。其解决方案的关键在于引入神经符号(Neuro-Symbolic, NeSy)AI范式,通过融合神经网络的模式识别能力与符号推理的可解释性和逻辑严谨性,实现更深层次的威胁理解与主动防御机制。研究进一步提出Grounding-Instructibility-Alignment(G-I-A)框架对相关系统进行系统性评估,并指出因果推理的集成是推动防御从相关性分析迈向前瞻性策略的核心突破点,同时揭示了自主进攻能力带来的双重用途风险,强调需建立社区驱动的标准体系和负责任的发展实践以确保技术进步服务于防御性网络安全目标并保持社会一致性。
链接: https://arxiv.org/abs/2509.06921
作者: Safayat Bin Hakim,Muhammad Adil,Alvaro Velasquez,Shouhuai Xu,Houbing Herbert Song
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional Artificial Intelligence (AI) approaches in cybersecurity exhibit fundamental limitations: inadequate conceptual grounding leading to non-robustness against novel attacks; limited instructibility impeding analyst-guided adaptation; and misalignment with cybersecurity objectives. Neuro-Symbolic (NeSy) AI has emerged with the potential to revolutionize cybersecurity AI. However, there is no systematic understanding of this emerging approach. These hybrid systems address critical cybersecurity challenges by combining neural pattern recognition with symbolic reasoning, enabling enhanced threat understanding while introducing concerning autonomous offensive capabilities that reshape threat landscapes. In this survey, we systematically characterize this field by analyzing 127 publications spanning 2019-July 2025. We introduce a Grounding-Instructibility-Alignment (G-I-A) framework to evaluate these systems, focusing on both cyber defense and cyber offense across network security, malware analysis, and cyber operations. Our analysis shows advantages of multi-agent NeSy architectures and identifies critical implementation challenges including standardization gaps, computational complexity, and human-AI collaboration requirements that constrain deployment. We show that causal reasoning integration is the most transformative advancement, enabling proactive defense beyond correlation-based approaches. Our findings highlight dual-use implications where autonomous systems demonstrate substantial capabilities in zero-day exploitation while achieving significant cost reductions, altering threat dynamics. We provide insights and future research directions, emphasizing the urgent need for community-driven standardization frameworks and responsible development practices that ensure advancement serves defensive cybersecurity objectives while maintaining societal alignment.
zh
[AI-3] ackling the Noisy Elephant in the Room: Label Noise-robust Out-of-Distribution Detection via Loss Correction and Low-rank Decomposition
【速读】:该论文旨在解决在训练标签存在噪声的情况下,传统分布外(Out-of-Distribution, OOD)检测方法性能显著下降的问题。现有研究表明,标签噪声会严重削弱模型的OOD检测能力,但目前尚缺乏系统性的解决方案。其关键创新在于提出了一种鲁棒的OOD检测框架,该框架融合了来自标签噪声学习领域的损失修正技术与信号处理中的低秩和稀疏分解方法,从而有效分离出噪声干扰并提升模型在噪声标签下的OOD识别准确性。
链接: https://arxiv.org/abs/2509.06918
作者: Tarhib Al Azad,Shahana Ibrahim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Robust out-of-distribution (OOD) detection is an indispensable component of modern artificial intelligence (AI) systems, especially in safety-critical applications where models must identify inputs from unfamiliar classes not seen during training. While OOD detection has been extensively studied in the machine learning literature–with both post hoc and training-based approaches–its effectiveness under noisy training labels remains underexplored. Recent studies suggest that label noise can significantly degrade OOD performance, yet principled solutions to this issue are lacking. In this work, we demonstrate that directly combining existing label noise-robust methods with OOD detection strategies is insufficient to address this critical challenge. To overcome this, we propose a robust OOD detection framework that integrates loss correction techniques from the noisy label learning literature with low-rank and sparse decomposition methods from signal processing. Extensive experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms the state-of-the-art OOD detection techniques, particularly under severe noisy label settings.
zh
[AI-4] AxelSMOTE: An Agent -Based Oversampling Algorithm for Imbalanced Classification
【速读】:该论文旨在解决机器学习中因类别不平衡(class imbalance)导致的少数类性能下降问题,尤其针对传统过采样(oversampling)技术存在的缺陷——如特征独立处理、缺乏相似性控制、样本多样性受限及合成样本多样性管理不足等。其解决方案的核心在于提出一种基于代理(agent-based)的新型过采样方法 AxelSMOTE,关键创新点包括:基于特征属性分组以保留特征相关性、基于相似性的概率交互机制以促进有意义的数据生成、利用 Beta 分布进行真实插值,以及通过可控多样性注入避免过拟合。该方法在八个不平衡数据集上验证了优于现有先进采样方法的性能表现,同时保持良好的计算效率。
链接: https://arxiv.org/abs/2509.06875
作者: Sukumar Kishanthan,Asela Hevapathige
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Class imbalance in machine learning poses a significant challenge, as skewed datasets often hinder performance on minority classes. Traditional oversampling techniques, which are commonly used to alleviate class imbalance, have several drawbacks: they treat features independently, lack similarity-based controls, limit sample diversity, and fail to manage synthetic variety effectively. To overcome these issues, we introduce AxelSMOTE, an innovative agent-based approach that views data instances as autonomous agents engaging in complex interactions. Based on Axelrod’s cultural dissemination model, AxelSMOTE implements four key innovations: (1) trait-based feature grouping to preserve correlations; (2) a similarity-based probabilistic exchange mechanism for meaningful interactions; (3) Beta distribution blending for realistic interpolation; and (4) controlled diversity injection to avoid overfitting. Experiments on eight imbalanced datasets demonstrate that AxelSMOTE outperforms state-of-the-art sampling methods while maintaining computational efficiency.
zh
[AI-5] floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL
【速读】:该论文旨在解决强化学习中时序差分(Temporal Difference, TD)方法在值函数表示上的局限性,即传统TD方法通常采用单体式架构(monolithic architecture)来建模Q函数,难以灵活控制和扩展模型容量。其解决方案的关键在于提出floq(flow-matching Q-functions),通过引入基于流匹配(flow-matching)的velocity field参数化方式对Q函数进行建模,并利用数值积分步数调节计算精细度与容量。该方法将TD学习目标嵌入到velocity field的训练中,借助目标velocity field的多步数值积分实现自举(bootstrapping),从而在不增加网络深度或宽度的情况下,显著提升Q函数的表达能力与性能,在离线强化学习和在线微调任务中平均提升近1.8倍。
链接: https://arxiv.org/abs/2509.06863
作者: Bhavya Agrawalla,Michal Nauman,Khush Agarwal,Aviral Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.
zh
[AI-6] Reinforcement learning meets bioprocess control through behaviour cloning: Real-world deployment in an industrial photobioreactor
【速读】:该论文旨在解决开放光生物反应器(open Photobioreactor, PBR)中因细胞自身复杂性及外部环境波动导致的pH调控难题,此类系统具有强非线性和易受扰动的特点。解决方案的关键在于提出一种结合强化学习(Reinforcement Learning, RL)与行为克隆(Behavior Cloning, BC)的混合控制策略:首先在离线阶段利用经典比例-积分-微分(Proportional-Integral-Derivative, PID)控制器生成的轨迹训练RL代理,无需与真实系统交互;随后通过每日在线微调实现对过程动态变化的自适应调整,并增强对快速瞬态扰动的抑制能力。该方法显著提升了控制精度(IAE降低8%相比PID)和能效(控制努力减少54%相比PID),并在8天实验中验证了其鲁棒性和可靠性,为RL在非线性、扰动敏感生物过程控制中的应用提供了新范式。
链接: https://arxiv.org/abs/2509.06853
作者: Juan D. Gil,Ehecatl Antonio Del Rio Chanona,José L. Guzmán,Manuel Berenguel
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The inherent complexity of living cells as production units creates major challenges for maintaining stable and optimal bioprocess conditions, especially in open Photobioreactors (PBRs) exposed to fluctuating environments. To address this, we propose a Reinforcement Learning (RL) control approach, combined with Behavior Cloning (BC), for pH regulation in open PBR systems. This represents, to the best of our knowledge, the first application of an RL-based control strategy to such a nonlinear and disturbance-prone bioprocess. Our method begins with an offline training stage in which the RL agent learns from trajectories generated by a nominal Proportional-Integral-Derivative (PID) controller, without direct interaction with the real system. This is followed by a daily online fine-tuning phase, enabling adaptation to evolving process dynamics and stronger rejection of fast, transient disturbances. This hybrid offline-online strategy allows deployment of an adaptive control policy capable of handling the inherent nonlinearities and external perturbations in open PBRs. Simulation studies highlight the advantages of our method: the Integral of Absolute Error (IAE) was reduced by 8% compared to PID control and by 5% relative to standard off-policy RL. Moreover, control effort decreased substantially-by 54% compared to PID and 7% compared to standard RL-an important factor for minimizing operational costs. Finally, an 8-day experimental validation under varying environmental conditions confirmed the robustness and reliability of the proposed approach. Overall, this work demonstrates the potential of RL-based methods for bioprocess control and paves the way for their broader application to other nonlinear, disturbance-prone systems.
zh
[AI-7] Another Turn Better Output? A Turn-Wise Analysis of Iterative LLM Prompting
【速读】:该论文试图解决的问题是:在多轮迭代式工作流中,缺乏一种清晰的评估方法来判断迭代何时有助于提升结果、何时反而会损害性能。解决方案的关键在于提出了一套涵盖创意生成、代码编写和数学推理三个领域的迭代优化评估框架,通过控制每项任务进行12轮对话实验,使用从模糊反馈(如“改进它”)到精准引导提示(targeted steering)等多种prompt策略,并记录每轮输出;同时采用领域适配的评分标准(代码用单元测试、数学用答案等价性与推理合理性、创意用原创性和可行性)以及三类转轮级指标(语义迁移度、轮间变化量、输出规模增长),从而实现对迭代过程的量化测量与跨模型比较,明确指出在不同领域中应何时引导、停止或切换策略以最大化迭代效益。
链接: https://arxiv.org/abs/2509.06770
作者: Shashidhar Reddy Javaji,Bhavul Gauri,Zining Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models (LLMs) are now used in multi-turn workflows, but we still lack a clear way to measure when iteration helps and when it hurts. We present an evaluation framework for iterative refinement that spans ideation, code, and math. Our protocol runs controlled 12-turn conversations per task, utilizing a variety of prompts ranging from vague ``improve it’’ feedback to targeted steering, and logs per-turn outputs. We score outcomes with domain-appropriate checks (unit tests for code; answer-equivalence plus reasoning-soundness for math; originality and feasibility for ideation) and track turn-level behavior with three families of metrics: semantic movement across turns, turn-to-turn change, and output size growth. Across models and tasks, gains are domain-dependent: they arrive early in ideas and code, but in math late turns matter when guided by elaboration. After the first few turns, vague feedback often plateaus or reverses correctness, while targeted prompts reliably shift the intended quality axis (novelty vs. feasibility in ideation; speed vs. readability in code; in math, elaboration outperforms exploration and drives late-turn gains). We also observe consistent domain patterns: ideation moves more in meaning across turns, code tends to grow in size with little semantic change, and math starts fixed but can break that path with late, elaborative this http URL, the framework and metrics make iteration measurable and comparable across models, and signal when to steer, stop, or switch strategies.
zh
[AI-8] Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization
【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在对齐人类价值观和执行特定任务时面临的挑战,尤其是如何通过高效、可靠的微调策略提升模型的行为一致性与性能。其解决方案的关键在于引入深度强化学习(Deep Reinforcement Learning, DRL)和直接偏好优化(Direct Preference Optimization, DPO)两种范式:DRL利用奖励信号优化模型行为,减少对监督偏好数据的依赖;DPO则直接将策略对齐至人类偏好,无需显式构建奖励模型,从而简化训练流程并增强对齐效果。这两种方法共同推动了LVLMs在多模态交互中的适应性、鲁棒性和人类对齐能力的发展。
链接: https://arxiv.org/abs/2509.06759
作者: Thanh Thi Nguyen,Campbell Wilson,Janis Dalins
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Proceedings of the 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025)
Abstract:Large Vision-Language Models (LVLMs) or multimodal large language models represent a significant advancement in artificial intelligence, enabling systems to understand and generate content across both visual and textual modalities. While large-scale pretraining has driven substantial progress, fine-tuning these models for aligning with human values or engaging in specific tasks or behaviors remains a critical challenge. Deep Reinforcement Learning (DRL) and Direct Preference Optimization (DPO) offer promising frameworks for this aligning process. While DRL enables models to optimize actions using reward signals instead of relying solely on supervised preference data, DPO directly aligns the policy with preferences, eliminating the need for an explicit reward model. This overview explores paradigms for fine-tuning LVLMs, highlighting how DRL and DPO techniques can be used to align models with human preferences and values, improve task performance, and enable adaptive multimodal interaction. We categorize key approaches, examine sources of preference data, reward signals, and discuss open challenges such as scalability, sample efficiency, continual learning, generalization, and safety. The goal is to provide a clear understanding of how DRL and DPO contribute to the evolution of robust and human-aligned LVLMs.
zh
[AI-9] Long-Range Graph Wavelet Networks
【速读】:该论文旨在解决图神经网络中远距离信息传播(long-range interactions)建模的难题,即如何有效捕捉图结构中相距较远节点之间的复杂依赖关系。现有基于图小波(graph wavelets)的神经网络方法依赖有限阶多项式近似来实现滤波器设计,这限制了其感受野(receptive field),从而阻碍了长距离信息的有效传递。解决方案的关键在于提出一种名为Long-Range Graph Wavelet Networks (LR-GWN) 的新型架构,其核心创新是将小波滤波器分解为互补的局部与全局组件:局部聚合由高效的低阶多项式处理,而远距离交互则通过灵活的谱域参数化机制建模。这种混合设计在统一的图小波框架下实现了短距离与长距离信息流的协同优化,显著提升了对长程依赖的建模能力,并在多个基准测试中取得了优于现有小波方法的性能表现。
链接: https://arxiv.org/abs/2509.06743
作者: Filippo Guerranti,Fabrizio Forte,Simon Geisler,Stephan Günnemann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modeling long-range interactions, the propagation of information across distant parts of a graph, is a central challenge in graph machine learning. Graph wavelets, inspired by multi-resolution signal processing, provide a principled way to capture both local and global structures. However, existing wavelet-based graph neural networks rely on finite-order polynomial approximations, which limit their receptive fields and hinder long-range propagation. We propose Long-Range Graph Wavelet Networks (LR-GWN), which decompose wavelet filters into complementary local and global components. Local aggregation is handled with efficient low-order polynomials, while long-range interactions are captured through a flexible spectral domain parameterization. This hybrid design unifies short- and long-distance information flow within a principled wavelet framework. Experiments show that LR-GWN achieves state-of-the-art performance among wavelet-based methods on long-range benchmarks, while remaining competitive on short-range datasets.
zh
[AI-10] Probabilistic Modeling of Latent Agent ic Substructures in Deep Neural Networks
【速读】:该论文旨在解决如何从概率建模的角度构建一个关于智能代理(intelligent agency)的理论框架,以阐明子代理(subagents)如何通过组合形成更高层次的、一致的代理实体,并为大语言模型(LLM)中的对齐现象提供数学解释。其核心解决方案在于:将代理建模为结果分布(outcome distributions),并以对数评分(log score)作为认知效用(epistemic utility),通过加权对数池化(weighted logarithmic pooling)定义代理组合机制,该机制严格提升所有成员的福利;同时利用克隆不变性(cloning invariance)、连续性和开放性等条件建立递归结构,并通过倾斜分析(tilt-based analysis)排除平凡复制行为。这一框架不仅揭示了在多于两个结果空间中达成严格共识的可能性,还首次形式化了LLM中“仁慈人格”(如Luigi)与“对抗人格”(如Waluigi)之间的对齐动态,表明先显式诱导再抑制对抗人格的策略能比单纯强化仁慈人格更显著地减少一阶不对齐(first-order misalignment)。
链接: https://arxiv.org/abs/2509.06701
作者: Su Hyeong Lee,Risi Kondor,Richard Ngo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member’s welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona (“Luigi’”) induces an antagonistic counterpart (“Waluigi”), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.
zh
[AI-11] Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation
【速读】:该论文旨在解决深度或过参数化神经网络在资源受限场景下计算成本过高、难以高效逼近非线性连续函数的问题。其核心解决方案是提出一种新型的小浅层神经网络——重心神经网络(Barycentric Neural Network, BNN),该网络通过固定的一组基点(base points)及其重心坐标(barycentric coordinates)来定义结构与参数,能够精确表示连续分段线性函数(Continuous Piecewise Linear Functions, CPLFs),从而保证跨分段的严格连续性。进一步地,作者引入了一种基于拓扑特征寿命加权的新指标——长度加权持久熵(Length-Weighted Persistent Entropy, LWPE),作为损失函数以优化BNN的基点位置,而非传统意义上的内部权重。这种结合BNN与LWPE损失函数的方法,在有限基点数量和训练轮次下仍能实现更快速且优越的函数逼近性能,显著优于MSE、RMSE、MAE等经典损失函数。
链接: https://arxiv.org/abs/2509.06694
作者: Victor Toscano-Duran,Rocio Gonzalez-Diaz,Miguel A. Gutiérrez-Naranjo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While it is well-established that artificial neural networks are \emphuniversal approximators for continuous functions on compact domains, many modern approaches rely on deep or overparameterized architectures that incur high computational costs. In this paper, a new type of \emphsmall shallow neural network, called the \emphBarycentric Neural Network ( \BNN ), is proposed, which leverages a fixed set of \emphbase points and their \emphbarycentric coordinates to define both its structure and its parameters. We demonstrate that our \BNN enables the exact representation of \emphcontinuous piecewise linear functions ( \CPLF s), ensuring strict continuity across segments. Since any continuous function over a compact domain can be approximated arbitrarily well by \CPLF s, the \BNN naturally emerges as a flexible and interpretable tool for \emphfunction approximation. Beyond the use of this representation, the main contribution of the paper is the introduction of a new variant of \emphpersistent entropy, a topological feature that is stable and scale invariant, called the \emphlength-weighted persistent entropy ( \LWPE ), which is weighted by the lifetime of topological features. Our framework, which combines the \BNN with a loss function based on our \LWPE , aims to provide flexible and geometrically interpretable approximations of nonlinear continuous functions in resource-constrained settings, such as those with limited base points for \BNN design and few training epochs. Instead of optimizing internal weights, our approach directly \emphoptimizes the base points that define the \BNN . Experimental results show that our approach achieves \emphsuperior and faster approximation performance compared to classical loss functions such as MSE, RMSE, MAE, and log-cosh.
zh
[AI-12] rajAware: Graph Cross-Attention and Trajectory-Aware for Generalisable VANETs under Partial Observations
【速读】:该论文针对车联网(VANETs)中因动态拓扑、观测不完整及边缘设备资源受限导致的路由难题提出了解决方案。现有强化学习(Reinforcement Learning, RL)方法通常假设图结构固定,且在网络条件变化时需重新训练,难以在资源受限的边缘硬件上部署。其核心解决方案是TrajAware框架,关键创新在于:(i)动作空间剪枝(action space pruning),在保留两跳可达性的前提下减少冗余邻居选择,缓解维度灾难;(ii)图交叉注意力机制(graph cross-attention),将剪枝后的邻居映射至全局图上下文,生成可跨不同网络规模泛化的特征表示;(iii)轨迹感知预测(trajectory-aware prediction),利用历史路径与路口信息估计实时位置,在部分观测条件下提升路由准确性。该方案在SUMO仿真环境中验证,实现了近最短路径和高投递率,同时满足边缘设备的效率要求。
链接: https://arxiv.org/abs/2509.06665
作者: Xiaolu Fu,Ziyuan Bao,Eiman Kanjo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, 3 tables
Abstract:Vehicular ad hoc networks (VANETs) are a crucial component of intelligent transportation systems; however, routing remains challenging due to dynamic topologies, incomplete observations, and the limited resources of edge devices. Existing reinforcement learning (RL) approaches often assume fixed graph structures and require retraining when network conditions change, making them unsuitable for deployment on constrained hardware. We present TrajAware, an RL-based framework designed for edge AI deployment in VANETs. TrajAware integrates three components: (i) action space pruning, which reduces redundant neighbour options while preserving two-hop reachability, alleviating the curse of dimensionality; (ii) graph cross-attention, which maps pruned neighbours to the global graph context, producing features that generalise across diverse network sizes; and (iii) trajectory-aware prediction, which uses historical routes and junction information to estimate real-time positions under partial observations. We evaluate TrajAware in the open-source SUMO simulator using real-world city maps with a leave-one-city-out setup. Results show that TrajAware achieves near-shortest paths and high delivery ratios while maintaining efficiency suitable for constrained edge devices, outperforming state-of-the-art baselines in both full and partial observation scenarios.
zh
[AI-13] AnalysisGNN: Unified Music Analysis with Graph Neural Networks
【速读】:该论文旨在解决多源异构符号音乐数据集在跨域分析中因标注不一致和领域偏移导致的模型性能下降问题。其核心解决方案是提出AnalysisGNN框架,通过数据洗牌策略、定制加权多任务损失函数以及任务特定分类器间的logit融合机制,实现对不同标注规范的数据集进行统一建模;同时引入非和弦音(Non-Chord-Tone)预测模块以剔除功能性无关的音符,从而提升标签信号的一致性与模型鲁棒性。
链接: https://arxiv.org/abs/2509.06654
作者: Emmanouil Karystinaios,Johannes Hentschel,Markus Neuwirth,Gerhard Widmer
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at the 17th International Symposium on Computer Music Multidisciplinary Research (CMMR) 2025
Abstract:Recent years have seen a boom in computational approaches to music analysis, yet each one is typically tailored to a specific analytical domain. In this work, we introduce AnalysisGNN, a novel graph neural network framework that leverages a data-shuffling strategy with a custom weighted multi-task loss and logit fusion between task-specific classifiers to integrate heterogeneously annotated symbolic datasets for comprehensive score analysis. We further integrate a Non-Chord-Tone prediction module, which identifies and excludes passing and non-functional notes from all tasks, thereby improving the consistency of label signals. Experimental evaluations demonstrate that AnalysisGNN achieves performance comparable to traditional static-dataset approaches, while showing increased resilience to domain shifts and annotation inconsistencies across multiple heterogeneous corpora.
zh
[AI-14] CogGuide: Human-Like Guidance for Zero-Shot Omni-Modal Reasoning
【速读】:该论文旨在解决多模态大模型在复杂跨模态推理中存在“捷径”(shortcut)现象以及上下文理解不足的问题。其解决方案的关键在于提出一种基于人类认知策略的零样本多模态推理组件,该组件以“意图草图”(intent sketch)为核心,构建了一个可插拔的三模块流水线——意图感知器(Intent Perceiver)、策略生成器(Strategy Generator)和策略选择器(Strategy Selector),显式模拟“理解-规划-选择”的认知过程。通过生成并筛选意图草图来引导最终推理,该方法无需参数微调,仅依赖上下文工程即可实现跨模型迁移,在信息论层面降低条件熵、提升信息利用效率,从而抑制非预期的捷径推理行为,并在IntentBench、WorldSense和Daily-Omni等多个基准上验证了其通用性和显著性能增益(最高达9.51个百分点)。
链接: https://arxiv.org/abs/2509.06641
作者: Zhou-Peng Shou(1 and 2),Zhi-Qiang You(1),Fang Wang(1),Hai-Bo Liu(3) ((1) NoDesk AI, Hangzhou, China, (2) Zhejiang University, Hangzhou, China, (3) Independent Researcher, Hangzhou, China)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Targeting the issues of “shortcuts” and insufficient contextual understanding in complex cross-modal reasoning of multimodal large models, this paper proposes a zero-shot multimodal reasoning component guided by human-like cognitive strategies centered on an “intent sketch”. The component comprises a plug-and-play three-module pipeline-Intent Perceiver, Strategy Generator, and Strategy Selector-that explicitly constructs a “understand-plan-select” cognitive process. By generating and filtering “intent sketch” strategies to guide the final reasoning, it requires no parameter fine-tuning and achieves cross-model transfer solely through in-context engineering. Information-theoretic analysis shows that this process can reduce conditional entropy and improve information utilization efficiency, thereby suppressing unintended shortcut reasoning. Experiments on IntentBench, WorldSense, and Daily-Omni validate the method’s generality and robust gains; compared with their respective baselines, the complete “three-module” scheme yields consistent improvements across different reasoning engines and pipeline combinations, with gains up to approximately 9.51 percentage points, demonstrating the practical value and portability of the “intent sketch” reasoning component in zero-shot scenarios.
zh
[AI-15] he First Voice Timbre Attribute Detection Challenge
【速读】:该论文旨在解决语音音色属性(voice timbre attribute)可解释性不足的问题,特别是如何量化并比较两个语音语句在特定音色描述维度上的强度差异。其解决方案的关键在于构建一个系统化的评估框架,并基于VCTK-RVA数据集开展挑战赛形式的实验验证,通过多团队参与和方法提交,推动音色特征提取与解释技术的发展。
链接: https://arxiv.org/abs/2509.06635
作者: Liping Chen,Jinghao He,Zhengyan Sheng,Kong Aik Lee,Zhen-Hua Ling
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:The first voice timbre attribute detection challenge is featured in a special session at NCMMSC 2025. It focuses on the explainability of voice timbre and compares the intensity of two speech utterances in a specified timbre descriptor dimension. The evaluation was conducted on the VCTK-RVA dataset. Participants developed their systems and submitted their outputs to the organizer, who evaluated the performance and sent feedback to them. Six teams submitted their outputs, with five providing descriptions of their methodologies.
zh
[AI-16] BEAM: Brainwave Empathy Assessment Model for Early Childhood
【速读】:该论文旨在解决儿童共情(Empathy)水平预测中传统方法依赖主观自评或观察标签、存在偏差且难以客观捕捉共情形成过程的问题。解决方案的关键在于提出一种新型深度学习框架——脑电波共情评估模型(Brainwave Empathy Assessment Model, BEAM),其核心创新包括:基于LaBraM的编码器实现对多视角脑电(EEG)信号的时空特征高效提取,特征融合模块整合不同来源的互补信息以刻画共情的认知与情感维度,并引入对比学习模块增强类别间的可分性,从而在CBCP数据集上显著优于现有方法,为儿童早期亲社会行为发展提供客观评估工具与干预线索。
链接: https://arxiv.org/abs/2509.06620
作者: Chen Xie,Gaofeng Wu,Kaidong Wang,Zihao Zhu,Xiaoshu Luo,Yan Liang,Feiyu Quan,Ruoxi Wu,Xianghui Huang,Han Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Empathy in young children is crucial for their social and emotional development, yet predicting it remains challenging. Traditional methods often only rely on self-reports or observer-based labeling, which are susceptible to bias and fail to objectively capture the process of empathy formation. EEG offers an objective alternative; however, current approaches primarily extract static patterns, neglecting temporal dynamics. To overcome these limitations, we propose a novel deep learning framework, the Brainwave Empathy Assessment Model (BEAM), to predict empathy levels in children aged 4-6 years. BEAM leverages multi-view EEG signals to capture both cognitive and emotional dimensions of empathy. The framework comprises three key components: 1) a LaBraM-based encoder for effective spatio-temporal feature extraction, 2) a feature fusion module to integrate complementary information from multi-view signals, and 3) a contrastive learning module to enhance class separation. Validated on the CBCP dataset, BEAM outperforms state-of-the-art methods across multiple metrics, demonstrating its potential for objective empathy assessment and providing a preliminary insight into early interventions in children’s prosocial development.
zh
[AI-17] Demo: Healthcare Agent Orchestrator (HAO) for Patient Summarization in Molecular Tumor Boards
【速读】:该论文旨在解决分子肿瘤多学科讨论会(Molecular Tumor Boards, MTBs)中患者摘要生成效率低、主观性强及关键信息易遗漏的问题。当前依赖人工编写的摘要过程耗时且不一致,难以保障信息完整性与可重复性。解决方案的关键在于提出一个由大语言模型(Large Language Model, LLM)驱动的医疗代理协调器(Healthcare Agent Orchestrator, HAO),通过多代理临床工作流自动构建准确、全面的患者摘要;同时设计了TBFact——一种“模型即裁判”(model-as-a-judge)的评估框架,用于在不依赖标注数据的前提下量化评估摘要的完整性和简洁性,从而实现无需共享敏感临床数据即可本地部署的可靠评估体系。
链接: https://arxiv.org/abs/2509.06602
作者: Noel Codella,Sam Preston,Hao Qiu,Leonardo Schettini,Wen-wai Yim,Mert Öz,Shrey Jain,Matthew P. Lungren,Thomas Osborne
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure
Abstract:Molecular Tumor Boards (MTBs) are multidisciplinary forums where oncology specialists collaboratively assess complex patient cases to determine optimal treatment strategies. A central element of this process is the patient summary, typically compiled by a medical oncologist, radiation oncologist, or surgeon, or their trained medical assistant, who distills heterogeneous medical records into a concise narrative to facilitate discussion. This manual approach is often labor-intensive, subjective, and prone to omissions of critical information. To address these limitations, we introduce the Healthcare Agent Orchestrator (HAO), a Large Language Model (LLM)-driven AI agent that coordinates a multi-agent clinical workflow to generate accurate and comprehensive patient summaries for MTBs. Evaluating predicted patient summaries against ground truth presents additional challenges due to stylistic variation, ordering, synonym usage, and phrasing differences, which complicate the measurement of both succinctness and completeness. To overcome these evaluation hurdles, we propose TBFact, a ``model-as-a-judge’’ framework designed to assess the comprehensiveness and succinctness of generated summaries. Using a benchmark dataset derived from de-identified tumor board discussions, we applied TBFact to evaluate our Patient History agent. Results show that the agent captured 94% of high-importance information (including partial entailments) and achieved a TBFact recall of 0.84 under strict entailment criteria. We further demonstrate that TBFact enables a data-free evaluation framework that institutions can deploy locally without sharing sensitive clinical data. Together, HAO and TBFact establish a robust foundation for delivering reliable and scalable support to MTBs.
zh
[AI-18] Contrastive Self-Supervised Network Intrusion Detection using Augmented Negative Pairs
【速读】:该论文旨在解决网络入侵检测中监督学习对大量标注数据的依赖问题以及传统异常检测方法因误报率过高而实用性受限的问题。其解决方案的关键在于提出一种新的对比学习范式——Contrastive Learning using Augmented Negative pairs (CLAN),该方法将数据增强生成的样本视为负视图(代表潜在恶意分布),而其他良性样本则作为正视图,从而在预训练阶段更有效地学习良性流量的判别性特征表示,显著提升了分类准确率和推理效率,并在有限标注数据下实现了优于现有自监督模型的多类分类性能。
链接: https://arxiv.org/abs/2509.06550
作者: Jack Wilkie,Hanan Hindy,Christos Tachtatzis,Robert Atkinson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
备注: Published in: Proceedings of IEEE Conference on Cyber Security and Resilience (CSR), 2025. Official version: this https URL Code: this https URL
Abstract:Network intrusion detection remains a critical challenge in cybersecurity. While supervised machine learning models achieve state-of-the-art performance, their reliance on large labelled datasets makes them impractical for many real-world applications. Anomaly detection methods, which train exclusively on benign traffic to identify malicious activity, suffer from high false positive rates, limiting their usability. Recently, self-supervised learning techniques have demonstrated improved performance with lower false positive rates by learning discriminative latent representations of benign traffic. In particular, contrastive self-supervised models achieve this by minimizing the distance between similar (positive) views of benign traffic while maximizing it between dissimilar (negative) views. Existing approaches generate positive views through data augmentation and treat other samples as negative. In contrast, this work introduces Contrastive Learning using Augmented Negative pairs (CLAN), a novel paradigm for network intrusion detection where augmented samples are treated as negative views - representing potentially malicious distributions - while other benign samples serve as positive views. This approach enhances both classification accuracy and inference efficiency after pretraining on benign traffic. Experimental evaluation on the Lycos2017 dataset demonstrates that the proposed method surpasses existing self-supervised and anomaly detection techniques in a binary classification task. Furthermore, when fine-tuned on a limited labelled dataset, the proposed approach achieves superior multi-class classification performance compared to existing self-supervised models.
zh
[AI-19] Learning Optimal Defender Strategies for CAGE-2 using a POMDP Model
【速读】:该论文旨在解决在CAGE-2(Cybersecurity Arena for Game-based Evaluation)场景中如何有效学习和评估防御者策略的问题,其核心挑战在于状态空间庞大且部分可观测,导致传统强化学习方法难以高效收敛。解决方案的关键在于构建了一个基于部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)的正式模型,并提出一种名为BF-PPO的方法:该方法以近端策略优化(Proximal Policy Optimization, PPO)为基础,引入粒子滤波(particle filter)来缓解因状态空间规模带来的计算复杂性问题,从而实现对最优防御策略的高效学习。实验表明,该方法在CybORG环境中优于当前CAGE-2排行榜榜首的CARDIFF方法,在防御策略性能与训练效率上均取得提升。
链接: https://arxiv.org/abs/2509.06539
作者: Duc Huy Le,Rolf Stadler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The paper is has been accepted for the 21st International Conference on Network and Service Management (CNSM-2025). The final version will be published in the conference proceedings
Abstract:CAGE-2 is an accepted benchmark for learning and evaluating defender strategies against cyberattacks. It reflects a scenario where a defender agent protects an IT infrastructure against various attacks. Many defender methods for CAGE-2 have been proposed in the literature. In this paper, we construct a formal model for CAGE-2 using the framework of Partially Observable Markov Decision Process (POMDP). Based on this model, we define an optimal defender strategy for CAGE-2 and introduce a method to efficiently learn this strategy. Our method, called BF-PPO, is based on PPO, and it uses particle filter to mitigate the computational complexity due to the large state space of the CAGE-2 model. We evaluate our method in the CAGE-2 CybORG environment and compare its performance with that of CARDIFF, the highest ranked method on the CAGE-2 leaderboard. We find that our method outperforms CARDIFF regarding the learned defender strategy and the required training time.
zh
[AI-20] QualityFM: a Multimodal Physiological Signal Foundation Model with Self-Distillation for Signal Quality Challenges in Critically Ill Patients
【速读】:该论文旨在解决重症监护病房(ICU)和手术室(OR)中光电容积脉搏波描记图(PPG)与心电图(ECG)信号普遍存在质量差、不完整及不一致的问题,这些问题会导致误报或诊断错误。现有方法存在泛化能力弱、依赖大量标注数据以及跨任务迁移性能差等局限。解决方案的关键在于提出一种名为QualityFM的多模态基础模型,其核心创新包括:(1)采用双轨架构,利用高质量信号编码器指导低质量信号编码器的训练,通过自蒸馏策略提升对劣质信号的理解;(2)在基于Transformer的模型中引入窗口稀疏注意力机制,高效处理长序列并捕捉局部准周期性模式;(3)设计复合损失函数,结合编码器输出的直接蒸馏损失与基于功率谱和相位谱的间接重建损失,以保留信号的频域特征。该模型在超过2100万段30秒波形和17.9万小时数据上预训练,并通过在三种临床任务(室性心动过速误报检测、房颤识别及从PPG和ECG估计动脉血压)中的迁移学习验证了其有效性。
链接: https://arxiv.org/abs/2509.06516
作者: Zongheng Guo,Tao Chen,Manuela Ferrario
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, 7 tables
Abstract:Photoplethysmogram (PPG) and electrocardiogram (ECG) are commonly recorded in intesive care unit (ICU) and operating room (OR). However, the high incidence of poor, incomplete, and inconsistent signal quality, can lead to false alarms or diagnostic inaccuracies. The methods explored so far suffer from limited generalizability, reliance on extensive labeled data, and poor cross-task transferability. To overcome these challenges, we introduce QualityFM, a novel multimodal foundation model for these physiological signals, designed to acquire a general-purpose understanding of signal quality. Our model is pre-trained on an large-scale dataset comprising over 21 million 30-second waveforms and 179,757 hours of data. Our approach involves a dual-track architecture that processes paired physiological signals of differing quality, leveraging a self-distillation strategy where an encoder for high-quality signals is used to guide the training of an encoder for low-quality signals. To efficiently handle long sequential signals and capture essential local quasi-periodic patterns, we integrate a windowed sparse attention mechanism within our Transformer-based model. Furthermore, a composite loss function, which combines direct distillation loss on encoder outputs with indirect reconstruction loss based on power and phase spectra, ensures the preservation of frequency-domain characteristics of the signals. We pre-train three models with varying parameter counts (9.6 M to 319 M) and demonstrate their efficacy and practical value through transfer learning on three distinct clinical tasks: false alarm of ventricular tachycardia detection, the identification of atrial fibrillation and the estimation of arterial blood pressure (ABP) from PPG and ECG signals.
zh
[AI-21] An AI system to help scientists write expert-level empirical software
【速读】:该论文旨在解决科学发现过程中因手动编写计算实验软件而导致的效率瓶颈问题。其核心解决方案是构建一个基于大语言模型(Large Language Model, LLM)与树搜索(Tree Search, TS)协同优化的AI系统,通过系统性地提升质量指标并智能探索大规模潜在解空间,实现专家级科学软件的自动生成。该方法的关键在于利用LLM理解复杂科研知识,并结合TS在多维度任务中高效探索和整合外部研究思想,从而在生物信息学、流行病学等多个领域生成超越人类专家水平的创新算法与模型。
链接: https://arxiv.org/abs/2509.06503
作者: Eser Aygün,Anastasiya Belyaeva,Gheorghe Comanici,Marc Coram,Hao Cui,Jake Garrison,Renee Johnston Anton Kast,Cory Y. McLean,Peter Norgaard,Zahra Shamsi,David Smalling,James Thompson,Subhashini Venugopalan,Brian P. Williams,Chujun He,Sarah Martinson,Martyna Plomecka,Lai Wei,Yuchen Zhou,Qian-Ze Zhu,Matthew Abraham,Erica Brand,Anna Bulanova,Jeffrey A. Cardille,Chris Co,Scott Ellsworth,Grace Joseph,Malcolm Kane,Ryan Krueger,Johan Kartiwa,Dan Liebling,Jan-Matthis Lueckmann,Paul Raccuglia,Xuefei(Julie)Wang,Katherine Chou,James Manyika,Yossi Matias,John C. Platt,Lizzie Dorfman,Shibl Mourad,Michael P. Brenner
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 71 pages, 26 figures
Abstract:The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments. To address this, we present an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS) to systematically improve the quality metric and intelligently navigate the large space of possible solutions. The system achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a wide range of benchmarks. In bioinformatics, it discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. Our method also produced state-of-the-art software for geospatial analysis, neural activity prediction in zebrafish, time series forecasting and numerical solution of integrals. By devising and implementing novel solutions to diverse tasks, the system represents a significant step towards accelerating scientific progress.
zh
[AI-22] Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化定理证明中面临的双重扩展性问题:一是训练阶段强化学习(Reinforcement Learning, RL)的可扩展性不足,二是推理阶段计算资源消耗过大。其核心解决方案包括两个关键创新:第一,提出一种多轮离策略强化学习框架,受AlphaZero启发,通过分阶段专家迭代、战术级数据自适应过滤和周期性重训练机制,突破LLM步进证明器在长期训练中的性能瓶颈;第二,设计一种由规划器增强的多智能体搜索架构,在推理时利用通用推理模型作为高层规划器将复杂定理分解为一系列子目标,从而显著压缩搜索空间,并通过共享证明缓存实现并行证明代理间的高效协作。这一双轨扩展策略在MiniF2F和ProofNet基准上分别达到95.08%和41.4%的准确率,展现出卓越的性能。
链接: https://arxiv.org/abs/2509.06493
作者: Ran Xin,Zeyu Zheng,Yanchen Nie,Kun Yuan,Xia Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces \textttBFS-Prover-V2, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. \textttBFS-Prover-V2 achieves 95.08% and 41.4% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.
zh
[AI-23] MORSE: Multi-Objective Reinforcement Learning via Strategy Evolution for Supply Chain Optimization
【速读】:该论文旨在解决供应链管理中多目标优化问题,特别是在动态不确定环境下传统方法(如线性规划和进化算法)难以实时适应的问题。其核心挑战在于如何在成本、服务水平和环境可持续性等冲突目标之间实现灵活权衡,并提升决策的鲁棒性。解决方案的关键在于将强化学习(Reinforcement Learning, RL)与多目标进化算法(Multi-Objective Evolutionary Algorithms, MOEAs)相结合:利用MOEAs搜索策略神经网络的参数空间,生成一组帕累托最优策略(Pareto front of policies),从而为决策者提供多样化的可切换策略集合;同时引入条件风险价值(Conditional Value-at-Risk, CVaR)以增强对风险敏感的决策能力,提升系统在不确定性中的韧性。
链接: https://arxiv.org/abs/2509.06490
作者: Niki Kotecha,Ehecatl Antonio del Rio Chanona
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In supply chain management, decision-making often involves balancing multiple conflicting objectives, such as cost reduction, service level improvement, and environmental sustainability. Traditional multi-objective optimization methods, such as linear programming and evolutionary algorithms, struggle to adapt in real-time to the dynamic nature of supply chains. In this paper, we propose an approach that combines Reinforcement Learning (RL) and Multi-Objective Evolutionary Algorithms (MOEAs) to address these challenges for dynamic multi-objective optimization under uncertainty. Our method leverages MOEAs to search the parameter space of policy neural networks, generating a Pareto front of policies. This provides decision-makers with a diverse population of policies that can be dynamically switched based on the current system objectives, ensuring flexibility and adaptability in real-time decision-making. We also introduce Conditional Value-at-Risk (CVaR) to incorporate risk-sensitive decision-making, enhancing resilience in uncertain environments. We demonstrate the effectiveness of our approach through case studies, showcasing its ability to respond to supply chain dynamics and outperforming state-of-the-art methods in an inventory management case study. The proposed strategy not only improves decision-making efficiency but also offers a more robust framework for managing uncertainty and optimizing performance in supply chains.
zh
[AI-24] DyC-STG: Dynamic Causal Spatio-Temporal Graph Network for Real-time Data Credibility Analysis in IoT
【速读】:该论文旨在解决物联网(IoT)传感器生成的时空数据流在人类中心环境中的可信度保障问题,其核心挑战在于现有时空图(STG)模型因依赖静态拓扑结构而无法捕捉物理状态的动态变化,且易将虚假相关性误判为因果关系,从而削弱了模型在复杂人机交互场景下的鲁棒性。解决方案的关键在于提出一种动态因果时空图网络(DyC-STG),其创新性体现在两个协同模块:一是事件驱动的动态图模块,可实时调整图结构以反映物理状态变化;二是因果推理模块,通过严格遵循时间先后顺序来提取具有因果感知的表示,从而显著提升数据可信度分析的准确性与适应性。
链接: https://arxiv.org/abs/2509.06483
作者: Guanjie Cheng,Boyi Li,Peihan Wu,Feiyi Chen,Xinkui Zhao,Mengying Zhu,Shuiguang Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The wide spreading of Internet of Things (IoT) sensors generates vast spatio-temporal data streams, but ensuring data credibility is a critical yet unsolved challenge for applications like smart homes. While spatio-temporal graph (STG) models are a leading paradigm for such data, they often fall short in dynamic, human-centric environments due to two fundamental limitations: (1) their reliance on static graph topologies, which fail to capture physical, event-driven dynamics, and (2) their tendency to confuse spurious correlations with true causality, undermining robustness in human-centric environments. To address these gaps, we propose the Dynamic Causal Spatio-Temporal Graph Network (DyC-STG), a novel framework designed for real-time data credibility analysis in IoT. Our framework features two synergistic contributions: an event-driven dynamic graph module that adapts the graph topology in real-time to reflect physical state changes, and a causal reasoning module to distill causally-aware representations by strictly enforcing temporal precedence. To facilitate the research in this domain we release two new real-world datasets. Comprehensive experiments show that DyC-STG establishes a new state-of-the-art, outperforming the strongest baselines by 1.4 percentage points and achieving an F1-Score of up to 0.930.
zh
[AI-25] MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents
【速读】:该论文旨在解决当前GUI代理(Graphical User Interface Agent)在多平台(如智能手机和计算机)上效率不足的问题,特别是缺乏系统性评估混合型GUI-快捷方式代理(GUI-shortcut hybrid agents)的基准框架。解决方案的关键在于提出MAS-Bench——一个专注于移动领域的基准测试平台,它不仅包含88个预定义的快捷方式(API、深度链接、RPA脚本),还设计了139个复杂任务以评估代理自主生成可复用、低成本工作流的能力,从而实现对代理智能嵌入快捷方式能力的量化评估。实验表明,混合代理相较于纯GUI代理显著提升了成功率与效率,验证了该评估方法的有效性。
链接: https://arxiv.org/abs/2509.06477
作者: Pengxiang Zhao,Guangyi Liu,Yaozhen Liang,Weiqing He,Zhengxi Lu,Yuehao Huang,Yaxuan Guo,Kexin Zhang,Hao Wang,Liang Liu,Yong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent’s capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent’s shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.
zh
[AI-26] Explained yet misunderstood: How AI Literacy shapes HR Managers interpretation of User Interfaces in Recruiting Recommender Systems RECSYS
【速读】:该论文试图解决的问题是:在人力资源管理(HRM)中,基于人工智能(AI)的推荐系统日益影响招聘决策,但当前实践中缺乏对AI结果的透明解释,且不同AI素养水平的HR管理者对可解释AI(XAI)元素的理解存在差异,可能导致误用或信任偏差。解决方案的关键在于:识别并区分用户AI素养水平对XAI效果的影响——仅当解释方式与用户认知能力匹配时(如高素养用户通过重要特征叠加获得更准确理解),XAI才能提升主观感知的信任与帮助性;而复杂解释反而可能削弱客观理解,因此必须采用差异化解释策略并加强针对性AI素养培训,以实现公平、透明和有效的AI在招聘中的负责任应用。
链接: https://arxiv.org/abs/2509.06475
作者: Yannick Kalff,Katharina Simbeck
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted paper for RecSys in HR’25: The 5th Workshop on Recommender Systems for Human Resources, in conjunction with the 19th ACM Conference on Recommender Systems, September 22–26, 2025, Prague, Czech Republic
Abstract:AI-based recommender systems increasingly influence recruitment decisions. Thus, transparency and responsible adoption in Human Resource Management (HRM) are critical. This study examines how HR managers’ AI literacy influences their subjective perception and objective understanding of explainable AI (XAI) elements in recruiting recommender dashboards. In an online experiment, 410 German-based HR managers compared baseline dashboards to versions enriched with three XAI styles: important features, counterfactuals, and model criteria. Our results show that the dashboards used in practice do not explain AI results and even keep AI elements opaque. However, while adding XAI features improves subjective perceptions of helpfulness and trust among users with moderate or high AI literacy, it does not increase their objective understanding. It may even reduce accurate understanding, especially with complex explanations. Only overlays of important features significantly aided the interpretations of high-literacy users. Our findings highlight that the benefits of XAI in recruitment depend on users’ AI literacy, emphasizing the need for tailored explanation strategies and targeted literacy training in HRM to ensure fair, transparent, and effective adoption of AI.
zh
[AI-27] Accelerate Scaling of LLM Alignment via Quantifying the Coverag e and Depth of Instruction Set
【速读】:该论文旨在解决大规模语言模型在下游任务中对指令数据集(instruction dataset)选择效率低下的问题,即随着候选指令池持续扩大,现有指令筛选方法难以有效提升模型对齐性能。其核心挑战在于当前方法未能明确指导如何从复杂分布的指令集中选取最具信息量的样本以优化模型表现。解决方案的关键在于识别出两个决定下游性能的核心因素:指令的深度(depth)与语义空间覆盖度(semantic coverage),并提出一种同时最大化这两个指标的指令选择算法,从而实现“加速扩展”(Accelerated Scaling)——相较于现有最优基线方法,在更快速度下持续提升模型性能。
链接: https://arxiv.org/abs/2509.06463
作者: Chengwei Wu,Li Du,Hanyu Zhao,Yiming Ju,Jiapu Wang,Tengfei Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the growing demand for applying large language models to downstream tasks, improving model alignment performance and efficiency has become crucial. Such a process involves selecting informative instructions from a candidate pool. However, due to the complexity of instruction set distributions, the key factors driving the performance of aligned models remain unclear. As a result, current instruction set refinement methods fail to improve performance as the instruction pool expands continuously. To address this issue, we first investigate the key factors that influence the relationship between instruction dataset distribution and aligned model performance. Based on these insights, we propose a novel instruction data selection method. We identify that the depth of instructions and the coverage of the semantic space are the crucial factors determining downstream performance, which could explain over 70% of the model loss on the development set. We then design an instruction selection algorithm to simultaneously maximize the depth and semantic coverage of the selected instructions. Experimental results demonstrate that, compared to state-of-the-art baseline methods, it can sustainably improve model performance at a faster pace and thus achieve \emph``Accelerated Scaling’'.
zh
[AI-28] HyFedRAG : A Federated Retrieval-Augmented Generation Framework for Heterogeneous and Privacy-Sensitive Data
【速读】:该论文旨在解决集中式检索增强生成(Retrieval-Augmented Generation, RAG)系统在分布式医疗场景中面临的挑战,包括异构数据(如结构化SQL数据库、半结构化知识图谱和非结构化临床笔记)难以统一处理,以及因隐私保护限制导致的罕见疾病病例检索困难问题。解决方案的关键在于提出HyFedRAG框架,其核心创新为:基于Flower构建边缘-云端协同的RAG架构,通过边缘侧轻量级大语言模型(Large Language Model, LLM)将多模态数据转化为标准化的隐私保护表示,云端LLM进行全局推理与生成;同时集成本地轻量检索器与隐私感知LLM,并提供三种匿名化工具以生成语义丰富且去标识化的摘要,支持跨设备全局推理;此外,设计包含本地缓存、中间表示缓存和云端推理缓存的三层缓存策略,显著降低延迟并减少冗余计算,从而实现高效、可扩展且符合隐私合规要求的RAG部署。
链接: https://arxiv.org/abs/2509.06444
作者: Cheng Qian,Hainan Zhang,Yongxin Tong,Hong-Wei Zheng,Zhiming Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures
Abstract:Centralized RAG pipelines struggle with heterogeneous and privacy-sensitive data, especially in distributed healthcare settings where patient data spans SQL, knowledge graphs, and clinical notes. Clinicians face difficulties retrieving rare disease cases due to privacy constraints and the limitations of traditional cloud-based RAG systems in handling diverse formats and edge devices. To address this, we introduce HyFedRAG, a unified and efficient Federated RAG framework tailored for Hybrid data modalities. By leveraging an edge-cloud collaborative mechanism, HyFedRAG enables RAG to operate across diverse data sources while preserving data privacy. Our key contributions are: (1) We design an edge-cloud collaborative RAG framework built on Flower, which supports querying structured SQL data, semi-structured knowledge graphs, and unstructured documents. The edge-side LLMs convert diverse data into standardized privacy-preserving representations, and the server-side LLMs integrates them for global reasoning and generation. (2) We integrate lightweight local retrievers with privacy-aware LLMs and provide three anonymization tools that enable each client to produce semantically rich, de-identified summaries for global inference across devices. (3) To optimize response latency and reduce redundant computation, we design a three-tier caching strategy consisting of local cache, intermediate representation cache, and cloud inference cache. Experimental results on PMC-Patients demonstrate that HyFedRAG outperforms existing baselines in terms of retrieval quality, generation consistency, and system efficiency. Our framework offers a scalable and privacy-compliant solution for RAG over structural-heterogeneous data, unlocking the potential of LLMs in sensitive and diverse data environments.
zh
[AI-29] ree of Agents : Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本任务时存在的“中间丢失问题”(lost in the middle issue),即模型对输入序列中段落信息的利用不足,同时避免传统方法因截断输入而丢失关键信息或因扩展上下文窗口导致注意力分散的问题。解决方案的关键在于提出一种多智能体推理框架 Tree of Agents (TOA),其通过将输入分块并由独立智能体分别生成局部认知,再基于树状结构动态交换信息进行协同推理,从而实现多视角理解与推理顺序的灵活探查,有效缓解位置偏差并降低幻觉现象;此外,结合前缀哈希缓存(prefix-hash caching)和自适应剪枝策略以提升计算效率,在仅需少量API开销的情况下显著提升性能。
链接: https://arxiv.org/abs/2509.06436
作者: Song Yu,Xiaofei Xu,Ke Deng,Li Li,Lin Tian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures
Abstract:Large language models (LLMs) face persistent challenges when handling long-context tasks, most notably the lost in the middle issue, where information located in the middle of a long input tends to be underutilized. Some existing methods that reduce input have the risk of discarding key information, while others that extend context windows often lead to attention dispersion. To address these limitations, we propose Tree of Agents (TOA), a multi-agent reasoning framework that segments the input into chunks processed by independent agents. Each agent generates its local cognition, then agents dynamically exchange information for collaborative reasoning along tree-structured paths. TOA enables agents to probe different reasoning orders for multi-perspective understanding, effectively mitigating position bias and reducing hallucinations. To improve processing efficiency, we incorporate prefix-hash caching and adaptive pruning strategies, achieving significant performance improvements with comparable API overhead. Experiments show that TOA, powered by compact LLaMA3.1-8B, significantly outperforms multiple baselines and demonstrates comparable performance to the latest and much larger commercial models, such as Gemini1.5-pro, on various long-context tasks. Code is available at this https URL.
zh
[AI-30] HECATE: An ECS-based Framework for Teaching and Developing Multi-Agent Systems ECAI-2025
【速读】:该论文旨在解决多智能体系统(Multiagent Systems, MAS)开发中因缺乏分布式系统(Distributed Systems, DS)工程经验而导致的复杂性问题,尤其是MAS开发对专用Agent知识的高度依赖。解决方案的关键在于提出HECATE框架,该框架基于实体-组件-系统(Entity-Component-System, ECS)架构模式,采用数据导向设计(data-oriented design),将Agent概念直接集成到分布式系统领域,从而通过复用已成熟的分布式系统设计模式与标准,显著降低开发MAS所需的Agent特定知识门槛,并简化多智能体系统的工程实现。
链接: https://arxiv.org/abs/2509.06431
作者: Arthur Casals,Anarosa A. F. Brandão
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Submitted to ECAI-2025
Abstract:This paper introduces HECATE, a novel framework based on the Entity-Component-System (ECS) architectural pattern that bridges the gap between distributed systems engineering and MAS development. HECATE is built using the Entity-Component-System architectural pattern, leveraging data-oriented design to implement multiagent systems. This approach involves engineering multiagent systems (MAS) from a distributed systems (DS) perspective, integrating agent concepts directly into the DS domain. This approach simplifies MAS development by (i) reducing the need for specialized agent knowledge and (ii) leveraging familiar DS patterns and standards to minimize the agent-specific knowledge required for engineering MAS. We present the framework’s architecture, core components, and implementation approach, demonstrating how it supports different agent models.
zh
[AI-31] CAPMix: Robust Time Series Anomaly Detection Based on Abnormal Assumptions with Dual-Space Mixup
【速读】:该论文旨在解决时间序列异常检测(Time Series Anomaly Detection, TSAD)中因标注异常样本稀缺且时序依赖关系复杂而导致的挑战,特别是现有基于异常假设(Anomaly Assumption, AA)方法中存在的两个关键问题:一是“碎片化生成”(patchy generation),即合成异常样本分散、缺乏一致性,导致异常注入过于简单或不连贯;二是“异常偏移”(Anomaly Shift),即合成异常与真实异常分布偏差大,或过于接近正常数据,从而破坏分类边界。解决方案的关键在于提出CAPMix框架:首先设计CutAddPaste机制实现目标导向的多样化复杂异常注入,避免碎片化生成;其次引入标签修订策略自适应优化异常标签,缓解异常偏移;最后在时序卷积网络中采用双空间混合(dual-space mixup)以增强决策边界的平滑性和鲁棒性。
链接: https://arxiv.org/abs/2509.06419
作者: Xudong Mou,Rui Wang,Tiejun Wang,Renyu Yang,Shiru Chen,Jie Sun,Tianyu Wo,Xudong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series anomaly detection (TSAD) is a vital yet challenging task, particularly in scenarios where labeled anomalies are scarce and temporal dependencies are complex. Recent anomaly assumption (AA) approaches alleviate the lack of anomalies by injecting synthetic samples and training discriminative models. Despite promising results, these methods often suffer from two fundamental limitations: patchy generation, where scattered anomaly knowledge leads to overly simplistic or incoherent anomaly injection, and Anomaly Shift, where synthetic anomalies either resemble normal data too closely or diverge unrealistically from real anomalies, thereby distorting classification boundaries. In this paper, we propose CAPMix, a controllable anomaly augmentation framework that addresses both issues. First, we design a CutAddPaste mechanism to inject diverse and complex anomalies in a targeted manner, avoiding patchy generation. Second, we introduce a label revision strategy to adaptively refine anomaly labels, reducing the risk of anomaly shift. Finally, we employ dual-space mixup within a temporal convolutional network to enforce smoother and more robust decision boundaries. Extensive experiments on five benchmark datasets, including AIOps, UCR, SWaT, WADI, and ESA, demonstrate that CAPMix achieves significant improvements over state-of-the-art baselines, with enhanced robustness against contaminated training data. The code is available at this https URL.
zh
[AI-32] aching AI Stepwise Diagnostic Reasoning with Report-Guided Chain-of-Thought Learning
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在放射学诊断中缺乏可解释性与临床准确性的问题,尤其在面对长尾疾病类别时性能显著下降。其解决方案的关键在于提出DiagCoT框架,通过三阶段监督微调策略实现:首先利用对比图像-报告训练(contrastive image-report tuning)进行领域对齐,其次引入链式思维(Chain-of-Thought, CoT)监督以捕捉推理逻辑,最后采用基于临床奖励信号的强化学习微调(reinforcement tuning)提升报告的事实准确性和流畅度。该方法仅依赖自由文本报告即可构建结构化监督信号,从而有效提升模型在疾病分类、病灶定位和报告生成等任务上的性能,且具备良好的泛化能力。
链接: https://arxiv.org/abs/2509.06409
作者: Yihong Luo,Wenwu He,Zhuo-Xu Cui,Dong Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This study presents DiagCoT, a multi-stage framework that applies supervised fine-tuning to general-purpose vision-language models (VLMs) to emulate radiologists’ stepwise diagnostic reasoning using only free-text reports. DiagCoT combines contrastive image-report tuning for domain alignment, chain-of-thought supervision to capture inferential logic, and reinforcement tuning with clinical reward signals to enhance factual accuracy and fluency. On the MIMIC-CXR benchmark, DiagCoT improved zero-shot disease classification AUC from 0.52 to 0.76 (absolute gain of 0.24), pathology grounding mIoU from 0.08 to 0.31 (absolute gain of 0.23), and report generation BLEU from 0.11 to 0.33 (absolute gain of 0.22). It outperformed state-of-the-art models including LLaVA-Med and CXR-LLAVA on long-tailed diseases and external datasets. By converting unstructured clinical narratives into structured supervision, DiagCoT offers a scalable approach for developing interpretable and diagnostically competent AI systems for radiology.
zh
[AI-33] MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation
【速读】:该论文旨在解决从静默视频生成音频(视频到音频,VTA)任务中合成质量与推理效率之间的固有权衡问题。现有方法如基于流匹配(flow matching)的模型依赖于瞬时速度建模,需迭代采样过程,导致推理速度缓慢。其解决方案的关键在于提出一种均值流加速(MeanFlow-accelerated)模型,通过使用平均速度表征流场,实现单步生成,从而显著提升多模态视频到音频合成的推理效率,同时保持音频质量、语义对齐和时间同步性;此外,引入标量缩放机制以在无分类器引导(classifier-free guidance, CFG)下平衡条件与非条件预测,有效缓解单步生成中的CFG失真问题。
链接: https://arxiv.org/abs/2509.06389
作者: Xiaoran Yang,Jianxuan Yang,Xinyue Guo,Haoyu Wang,Ningning Pan,Gongping Huang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:A key challenge in synthesizing audios from silent videos is the inherent trade-off between synthesis quality and inference efficiency in existing methods. For instance, flow matching based models rely on modeling instantaneous velocity, inherently require an iterative sampling process, leading to slow inference speeds. To address this efficiency bottleneck, we introduce a MeanFlow-accelerated model that characterizes flow fields using average velocity, enabling one-step generation and thereby significantly accelerating multimodal video-to-audio (VTA) synthesis while preserving audio quality, semantic alignment, and temporal synchronization. Furthermore, a scalar rescaling mechanism is employed to balance conditional and unconditional predictions when classifier-free guidance (CFG) is applied, effectively mitigating CFG-induced distortions in one step generation. Since the audio synthesis network is jointly trained with multimodal conditions, we further evaluate it on text-to-audio (TTA) synthesis task. Experimental results demonstrate that incorporating MeanFlow into the network significantly improves inference speed without compromising perceptual quality on both VTA and TTA synthesis tasks.
zh
[AI-34] Beyond the Pre-Service Horizon: Infusing In-Service Behavior for Improved Financial Risk Forecasting ICDM2025
【速读】:该论文旨在解决传统金融风险管理体系中预服务风险评估(pre-service risk assessment)与服务期内违约检测(in-service default detection)相分离的问题,导致预服务阶段风险预测能力受限。其核心解决方案是提出多粒度知识蒸馏(Multi-Granularity Knowledge Distillation, MGKD)框架,通过将服务期内用户行为数据所训练的教师模型(teacher model)的知识以软标签形式迁移至预服务数据训练的学生模型(student model),从而提升预服务阶段的风险预测准确性。MGKD的关键在于引入粗粒度、细粒度及自蒸馏(self-distillation)三种策略,实现教师与学生模型在特征表示和预测结果上的对齐,尤其强化了违约类别的表征能力,并有效迁移服务期内关键行为模式至预服务场景,同时结合重加权策略缓解少数类偏差问题,显著提升了真实业务场景下的风险识别性能。
链接: https://arxiv.org/abs/2509.06385
作者: Senhao Liu,Zhiyu Guo,Zhiyuan Ji,Yueguo Chen,Yateng Tang,Yunhai Wang,Xuehao Zheng,Xiang Ao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ICDM 2025
Abstract:Typical financial risk management involves distinct phases for pre-service risk assessment and in-service default detection, often modeled separately. This paper proposes a novel framework, Multi-Granularity Knowledge Distillation (abbreviated as MGKD), aimed at improving pre-service risk prediction through the integration of in-service user behavior data. MGKD follows the idea of knowledge distillation, where the teacher model, trained on historical in-service data, guides the student model, which is trained on pre-service data. By using soft labels derived from in-service data, the teacher model helps the student model improve its risk prediction prior to service activation. Meanwhile, a multi-granularity distillation strategy is introduced, including coarse-grained, fine-grained, and self-distillation, to align the representations and predictions of the teacher and student models. This approach not only reinforces the representation of default cases but also enables the transfer of key behavioral patterns associated with defaulters from the teacher to the student model, thereby improving the overall performance of pre-service risk assessment. Moreover, we adopt a re-weighting strategy to mitigate the model’s bias towards the minority class. Experimental results on large-scale real-world datasets from Tencent Mobile Payment demonstrate the effectiveness of our proposed approach in both offline and online scenarios.
zh
[AI-35] A data-driven discretized CS:GO simulation environment to facilitate strategic multi-agent planning research
【速读】:该论文旨在解决复杂多智能体交互场景中高保真度与计算效率难以兼顾的问题。现有仿真环境往往在细节建模(如瞄准、射击等低层机制)上消耗大量算力,导致难以扩展至长期战略规划的研究。其解决方案的关键在于提出DECOY框架,通过将三维地形中的战略级长期规划抽象为高阶离散化状态空间,并引入基于真实CS:GO赛事数据训练的神经预测与生成模型,仅依赖移动决策即可重建游戏事件结果;同时采用路点(waypoint)系统对连续状态和动作进行简化与离散化,从而在保持底层环境真实性的同时显著提升模拟效率,实现高效且准确的多智能体行为复现。
链接: https://arxiv.org/abs/2509.06355
作者: Yunzhe Wang,Volkan Ustun,Chris McGroarty
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the Winter Simulation Conference 2025, December, Seattle USA
Abstract:Modern simulation environments for complex multi-agent interactions must balance high-fidelity detail with computational efficiency. We present DECOY, a novel multi-agent simulator that abstracts strategic, long-horizon planning in 3D terrains into high-level discretized simulation while preserving low-level environmental fidelity. Using Counter-Strike: Global Offensive (CS:GO) as a testbed, our framework accurately simulates gameplay using only movement decisions as tactical positioning – without explicitly modeling low-level mechanics such as aiming and shooting. Central to our approach is a waypoint system that simplifies and discretizes continuous states and actions, paired with neural predictive and generative models trained on real CS:GO tournament data to reconstruct event outcomes. Extensive evaluations show that replays generated from human data in DECOY closely match those observed in the original game. Our publicly available simulation environment provides a valuable tool for advancing research in strategic multi-agent planning and behavior generation.
zh
[AI-36] BanPick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLM s
【速读】:该论文旨在解决稀疏混合专家(Sparse Mixture-of-Experts, MoE)架构在预训练过程中因路由器(router)优化目标偏向稳定性与均衡性而导致的两个关键问题:一是少数高影响力的专家因过早收敛和均衡路由被低估使用;二是固定每token激活专家数量引入了大量冗余。解决方案的核心在于提出一种无需重新训练或修改MoE结构的后训练、即插即用策略——BanPick,其中Pick模块通过识别并强化对性能贡献显著的关键专家实现准确率提升,Ban模块则基于层和token敏感度动态剪枝冗余专家以加速推理,二者协同在不牺牲精度的前提下显著提升模型效率。
链接: https://arxiv.org/abs/2509.06346
作者: Yuanteng Chen,Peisong Wang,Yuantian Shao,Jian Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures
Abstract:Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (LLMs) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce BanPick, a post-training, plug-and-play strategy for smarter MoE routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban complements this by dynamically pruning redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-LLMs (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that BanPick delivers free performance gains and inference acceleration without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the vLLM.
zh
[AI-37] Evaluating Multi-Turn Bargain Skills in LLM -Based Seller Agent
【速读】:该论文旨在解决在线二手交易市场中,由大语言模型(Large Language Models, LLMs)充当卖家代理时,在多轮讨价还价过程中难以准确追踪和理解买家累积意图的问题,这直接影响谈判效果。解决方案的关键在于提出一个基于心智理论(Theory of Mind, ToM)的多轮评估框架,该框架通过标注买家意图实现对卖家代理在对话轮次层面的意图识别与跟踪能力的量化评估,从而超越仅依赖最终结果的指标,推动更精细化的谈判能力评测。
链接: https://arxiv.org/abs/2509.06341
作者: Issue Yishu Wang,Kakam Chong,Xiaofeng Wang,Xu Yan,DeXin Kong,Chen Ju,Ming Chen,Shuai Xiao,Shuguang Han,jufeng chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In online second-hand marketplaces, multi-turn bargaining is a crucial part of seller-buyer interactions. Large Language Models (LLMs) can act as seller agents, negotiating with buyers on behalf of sellers under given business constraints. A critical ability for such agents is to track and accurately interpret cumulative buyer intents across long negotiations, which directly impacts bargaining effectiveness. We introduce a multi-turn evaluation framework for measuring the bargaining ability of seller agents in e-commerce dialogues. The framework tests whether an agent can extract and track buyer intents. Our contributions are: (1) a large-scale e-commerce bargaining benchmark spanning 622 categories, 9,892 products, and 3,014 tasks; (2) a turn-level evaluation framework grounded in Theory of Mind (ToM) with annotated buyer intents, moving beyond outcome-only metrics; and (3) an automated pipeline that extracts reliable intent from massive dialogue data.
zh
[AI-38] Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation
【速读】:该论文旨在解决传统问卷调查在社会科学研究和公共政策制定中面临的成本高、耗时长及规模受限的问题。其核心解决方案是利用大语言模型(Large Language Models, LLMs)模拟虚拟受访者,从而生成具有人口统计学一致性的合成调查数据。关键创新在于提出两种新的模拟范式:部分属性模拟(Partial Attribute Simulation, PAS)与全属性模拟(Full Attribute Simulation, FAS),分别用于预测缺失属性和生成完整合成数据集,并通过构建涵盖11个真实公共数据集的基准测试套件LLM-S³系统评估不同LLM在零上下文与增强上下文条件下的仿真 fidelity,为基于LLM的调查模拟提供了可复现、可扩展且经济高效的工具框架。
链接: https://arxiv.org/abs/2509.06337
作者: Jianpeng Zhao,Chenyu Yuan,Weiming Luo,Haoling Xie,Guangwei Zhang,Steven Jige Quan,Zixuan Yuan,Pengyang Wang,Denghui Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Questionnaire-based surveys are foundational to social science research and public policymaking, yet traditional survey methods remain costly, time-consuming, and often limited in scale. This paper explores a new paradigm: simulating virtual survey respondents using Large Language Models (LLMs). We introduce two novel simulation settings, namely Partial Attribute Simulation (PAS) and Full Attribute Simulation (FAS), to systematically evaluate the ability of LLMs to generate accurate and demographically coherent responses. In PAS, the model predicts missing attributes based on partial respondent profiles, whereas FAS involves generating complete synthetic datasets under both zero-context and context-enhanced conditions. We curate a comprehensive benchmark suite, LLM-S^3 (Large Language Model-based Sociodemographic Survey Simulation), that spans 11 real-world public datasets across four sociological domains. Our evaluation of multiple mainstream LLMs (GPT-3.5/4 Turbo, LLaMA 3.0/3.1-8B) reveals consistent trends in prediction performance, highlights failure modes, and demonstrates how context and prompt design impact simulation fidelity. This work establishes a rigorous foundation for LLM-driven survey simulations, offering scalable and cost-effective tools for sociological research and policy evaluation. Our code and dataset are available at: this https URL
zh
[AI-39] A Frag ile Number Sense: Probing the Elemental Limits of Numerical Reasoning in LLM s
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在数值推理能力上的鲁棒性尚未得到充分理解,现有基准测试往往掩盖了其在基础数学任务中的潜在弱点。为揭示这一问题,研究者设计了一个包含100道题的挑战集,涵盖四类从基础运算到组合谜题的问题,系统评估多个先进LLM代理的性能。解决方案的关键在于通过逐步提升问题复杂度的分层测试方法,识别出模型在确定性算法执行(如基本算术、高级运算和素性检测)与需要启发式搜索的组合优化任务(如24点游戏)之间的显著性能差异——结果表明,模型在前者表现优异,而在后者几乎完全失败,说明其数值推理本质更依赖于模式匹配而非生成式问题求解能力,从而揭示了当前LLM在创造性数值推理方面的局限性。
链接: https://arxiv.org/abs/2509.06332
作者: Roussel Rahman,Aashwin Ananda Mishra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable emergent capabilities, yet the robustness of their numerical reasoning remains an open question. While standard benchmarks evaluate LLM reasoning on complex problem sets using aggregated metrics, they often obscure foundational weaknesses. In this work, we probe LLM mathematical numeracy by evaluating performance on problems of escalating complexity, from constituent operations to combinatorial puzzles. We test several state-of-the-art LLM-based agents on a 100-problem challenge comprising four categories: (1) basic arithmetic, (2) advanced operations, (3) primality checking, and (4) the Game of 24 number puzzle. Our results show that while the agents achieved high accuracy on the first three categories, which require deterministic algorithmic execution, they consistently failed at the number puzzle, underlining its demand for a heuristic search over a large combinatorial space to be a significant bottleneck. These findings reveal that the agents’ proficiency is largely confined to recalling and executing known algorithms, rather than performing generative problem-solving. This suggests their apparent numerical reasoning is more akin to sophisticated pattern-matching than flexible, analytical thought, limiting their potential for tasks that require novel or creative numerical insights.
zh
[AI-40] AttestLLM : Efficient Attestation Framework for Billion-scale On-device LLM s
【速读】:该论文旨在解决在设备端部署百亿参数大语言模型(Large Language Models, LLMs)时,如何高效且安全地验证模型合法性的问题。现有认证技术难以适配LLMs的规模,在时间与内存效率上存在瓶颈,且无法有效应对新型攻击威胁。解决方案的关键在于提出AttestLLM框架,通过算法-软件-硬件协同设计,在LLM构建模块的激活分布中嵌入鲁棒水印签名,并优化可信执行环境(Trusted Execution Environment, TEE)内的认证协议,从而在不降低推理吞吐量的前提下实现高效、可靠的模型合法性验证,同时具备对抗模型替换和伪造攻击的韧性。
链接: https://arxiv.org/abs/2509.06326
作者: Ruisi Zhang,Yifei Zhao,Neusha Javidnia,Mengxin Zheng,Farinaz Koushanfar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As on-device LLMs(e.g., Apple on-device Intelligence) are widely adopted to reduce network dependency, improve privacy, and enhance responsiveness, verifying the legitimacy of models running on local devices becomes critical. Existing attestation techniques are not suitable for billion-parameter Large Language Models (LLMs), struggling to remain both time- and memory-efficient while addressing emerging threats in the LLM era. In this paper, we present AttestLLM, the first-of-its-kind attestation framework to protect the hardware-level intellectual property (IP) of device vendors by ensuring that only authorized LLMs can execute on target platforms. AttestLLM leverages an algorithm/software/hardware co-design approach to embed robust watermarking signatures onto the activation distributions of LLM building blocks. It also optimizes the attestation protocol within the Trusted Execution Environment (TEE), providing efficient verification without compromising inference throughput. Extensive proof-of-concept evaluations on LLMs from Llama, Qwen, and Phi families for on-device use cases demonstrate AttestLLM’s attestation reliability, fidelity, and efficiency. Furthermore, AttestLLM enforces model legitimacy and exhibits resilience against model replacement and forgery attacks.
zh
[AI-41] Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models
【速读】:该论文试图解决传统建筑节能改造决策方法在多样化住宅场景中普遍存在泛化能力有限和可解释性不足的问题,从而限制了其在实际应用中的推广。解决方案的关键在于利用生成式AI(Generative AI),特别是大语言模型(Large Language Models, LLMs),通过处理多维上下文信息(如地理位置、建筑几何特征等)并输出面向实践者的可读建议,提升决策的准确性与实用性。研究评估了七种主流LLMs在两类目标下的表现:一是以CO₂减排最大化为导向的技术目标,二是以投资回收期最小化为导向的社会技术目标,结果表明LLMs在无微调情况下即可生成有效建议,尤其在技术目标上表现优异,但其在社会技术决策中受限于经济权衡和地方情境因素,且模型间一致性较低,推理过程虽具工程逻辑但缺乏深层情境理解,提示未来需进一步优化模型的准确性、一致性和上下文感知能力。
链接: https://arxiv.org/abs/2509.06307
作者: Lei Shu,Dong Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional approaches to building energy retrofit decision making suffer from limited generalizability and low interpretability, hindering adoption in diverse residential contexts. With the growth of Smart and Connected Communities, generative AI, especially large language models (LLMs), may help by processing contextual information and producing practitioner readable recommendations. We evaluate seven LLMs (ChatGPT, DeepSeek, Gemini, Grok, Llama, and Claude) on residential retrofit decisions under two objectives: maximizing CO2 reduction (technical) and minimizing payback period (sociotechnical). Performance is assessed on four dimensions: accuracy, consistency, sensitivity, and reasoning, using a dataset of 400 homes across 49 US states. LLMs generate effective recommendations in many cases, reaching up to 54.5 percent top 1 match and 92.8 percent within top 5 without fine tuning. Performance is stronger for the technical objective, while sociotechnical decisions are limited by economic trade offs and local context. Agreement across models is low, and higher performing models tend to diverge from others. LLMs are sensitive to location and building geometry but less sensitive to technology and occupant behavior. Most models show step by step, engineering style reasoning, but it is often simplified and lacks deeper contextual awareness. Overall, LLMs are promising assistants for energy retrofit decision making, but improvements in accuracy, consistency, and context handling are needed for reliable practice.
zh
[AI-42] Learning to Walk with Less: a Dyna-Style Approach to Quadrupedal Locomotion
【速读】:该论文旨在解决基于强化学习(Reinforcement Learning, RL)的四足机器人行走控制器在训练过程中数据效率低的问题,即需要大量环境交互才能获得鲁棒性能。解决方案的关键在于提出一种基于模型的强化学习(Model-Based Reinforcement Learning, MBRL)框架,其核心机制是在PPO(Proximal Policy Optimization)基础上引入合成数据——通过一个与策略同步训练的预测模型生成短时程合成转移样本,并采用基于策略更新迭代次数的调度策略逐步融合这些合成数据到标准轨迹中,从而实现类似Dyna风格的“虚拟经验回放”。该方法显著提升了样本效率,在仿真环境中验证了其能有效提升策略回报并降低方差,且迁移至多种运动指令跟踪任务时仍保持高效性。
链接: https://arxiv.org/abs/2509.06296
作者: Francisco Affonso,Felipe Andrade G. Tommaselli,Juliano Negri,Vivian S. Medeiros,Mateus V. Gasparino,Girish Chowdhary,Marcelo Becker
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Under review at IEEE Robotics and Automation Letters. 8 pages
Abstract:Traditional RL-based locomotion controllers often suffer from low data efficiency, requiring extensive interaction to achieve robust performance. We present a model-based reinforcement learning (MBRL) framework that improves sample efficiency for quadrupedal locomotion by appending synthetic data to the end of standard rollouts in PPO-based controllers, following the Dyna-Style paradigm. A predictive model, trained alongside the policy, generates short-horizon synthetic transitions that are gradually integrated using a scheduling strategy based on the policy update iterations. Through an ablation study, we identified a strong correlation between sample efficiency and rollout length, which guided the design of our experiments. We validated our approach in simulation on the Unitree Go1 robot and showed that replacing part of the simulated steps with synthetic ones not only mimics extended rollouts but also improves policy return and reduces variance. Finally, we demonstrate that this improvement transfers to the ability to track a wide range of locomotion commands using fewer simulated steps.
zh
[AI-43] From Implicit Exploration to Structured Reasoning : Leverag ing Guideline and Refinement for LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中依赖隐式探索所带来的不稳定性问题,即模型在缺乏明确指导的情况下进行随机且无方向的推理路径选择,导致推理过程难以纠错、无法有效利用历史经验,进而影响性能一致性与泛化能力。其解决方案的关键在于从成功轨迹和失败反馈中提取结构化的推理模式(structured reasoning patterns),并在此基础上构建一个分步执行与迭代修正相结合的框架:在推理过程中按步骤遵循预定义的指导原则(guidelines),并在每一步后引入细化机制(refinement)以纠正错误、稳定推理流。该方法显著提升了推理的稳定性与跨任务泛化能力,并展现出优于传统监督微调(supervised fine-tuning)的效果与可扩展性。
链接: https://arxiv.org/abs/2509.06284
作者: Jiaxiang Chen,Zhuo Wang,Mingxi Zou,Zhucong Li,Zhijian Zhou,Song Wang,Zenglin Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have advanced general-purpose reasoning, showing strong performance across diverse tasks. However, existing methods often rely on implicit exploration, where the model follows stochastic and unguided reasoning paths-like walking without a map. This leads to unstable reasoning paths, lack of error correction, and limited learning from past experience. To address these issues, we propose a framework that shifts from implicit exploration to structured reasoning through guideline and refinement. First, we extract structured reasoning patterns from successful trajectories and reflective signals from failures. During inference, the model follows these guidelines step-by-step, with refinement applied after each step to correct errors and stabilize the reasoning process. Experiments on BBH and four additional benchmarks (GSM8K, MATH-500, MBPP, HumanEval) show that our method consistently outperforms strong baselines across diverse reasoning tasks. Structured reasoning with stepwise execution and refinement improves stability and generalization, while guidelines transfer well across domains and flexibly support cross-model collaboration, matching or surpassing supervised fine-tuning in effectiveness and scalability.
zh
[AI-44] ableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning WSDM2026
【速读】:该论文旨在解决表格推理(table reasoning)任务中,大型语言模型(LLM)在处理结构化数据时面临的数值计算复杂性和细粒度操作难题。现有纯文本方法难以准确执行复杂的数值运算,而工具集成方法虽能提升计算精度,却常依赖固定模式和监督模仿学习,缺乏真正的自主适应能力。解决方案的关键在于提出 TableMind——一个基于 LLM 的表格推理代理,其核心创新包括:(i) 自主进行多轮工具调用;(ii) 在安全沙箱环境中编写并执行数据分析代码以实现精确的数值推理;(iii) 通过规划与自我反思机制动态调整策略。为实现上述能力,作者采用两阶段微调范式:首先通过高质量推理轨迹的监督微调建立有效的工具使用模式,再利用强化微调优化多目标策略,并引入 Rank-Aware Policy Optimization (RAPO) 方法,在高质轨迹输出概率低于低质轨迹时增加其更新权重,从而更稳定地引导模型趋向更高精度的答案。
链接: https://arxiv.org/abs/2509.06278
作者: Chuang Jiang(1),Mingyue Cheng(1),Xiaoyu Tao(1),Qingyang Mao(1),Jie Ouyang(1),Qi Liu(1) ((1) State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Comments: 10 pages, 6 figures. Submitted to WSDM 2026
Abstract:Table reasoning is crucial for leveraging structured data in domains such as finance, healthcare, and scientific research. While large language models (LLMs) show promise in multi-step reasoning, purely text-based methods often struggle with the complex numerical computations and fine-grained operations inherently required in this task. Tool-integrated reasoning improves computational accuracy via explicit code execution, yet existing systems frequently rely on rigid patterns, supervised imitation, and lack true autonomous adaptability. In this paper, we present TableMind, an LLM-driven table reasoning agent that (i) autonomously performs multi-turn tool invocation, (ii) writes and executes data-analyzing code in a secure sandbox environment for data analysis and precise numerical reasoning, and (iii) exhibits high-level capabilities such as planning and self-reflection to adapt strategies. To realize these capabilities, we adopt a two-stage fine-tuning paradigm built on top of a powerful pre-trained language model: supervised fine-tuning on high-quality reasoning trajectories to establish effective tool usage patterns, followed by reinforcement fine-tuning to optimize multi-objective strategies. In particular, we propose Rank-Aware Policy Optimization (RAPO), which increases the update weight of high-quality trajectories when their output probabilities are lower than those of low-quality ones, thereby guiding the model more consistently toward better and more accurate answers. Extensive experiments on several mainstream benchmarks demonstrate that TableMind achieves superior performance compared to competitive baselines, yielding substantial gains in both reasoning accuracy and computational precision.
zh
[AI-45] UrbanMIMOMap: A Ray-Traced MIMO CSI Dataset with Precoding-Aware Maps and Benchmarks
【速读】:该论文旨在解决第六代移动通信(6G)系统中环境感知通信所需的高保真无线电地图(Radio Map, RM)生成难题,特别是当前公开数据集多局限于单输入单输出(SISO)场景下的路径损耗信息,难以满足多输入多输出(MIMO)系统对详细信道状态信息(CSI)的需求。其解决方案的关键在于提出UrbanMIMOMap——一个基于高精度射线追踪生成的大规模城市MIMO CSI数据集,该数据集提供了密集空间网格上的复数CSI矩阵,显著超越传统路径损耗数据的维度,为基于机器学习(ML)的高保真RM构建提供了高质量、大规模训练数据基础,从而推动6G环境下基于AI的环境感知与智能通信研究。
链接: https://arxiv.org/abs/2509.06270
作者: Honggang Jia,Xiucheng Wang,Nan Cheng,Ruijin Sun,Changle Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE Global Communications Conference (GLOBECOM) 2025
Abstract:Sixth generation (6G) systems require environment-aware communication, driven by native artificial intelligence (AI) and integrated sensing and communication (ISAC). Radio maps (RMs), providing spatially continuous channel information, are key enablers. However, generating high-fidelity RM ground truth via electromagnetic (EM) simulations is computationally intensive, motivating machine learning (ML)-based RM construction. The effectiveness of these data-driven methods depends on large-scale, high-quality training data. Current public datasets often focus on single-input single-output (SISO) and limited information, such as path loss, which is insufficient for advanced multi-input multi-output (MIMO) systems requiring detailed channel state information (CSI). To address this gap, this paper presents UrbanMIMOMap, a novel large-scale urban MIMO CSI dataset generated using high-precision ray tracing. UrbanMIMOMap offers comprehensive complex CSI matrices across a dense spatial grid, going beyond traditional path loss data. This rich CSI is vital for constructing high-fidelity RMs and serves as a fundamental resource for data-driven RM generation, including deep learning. We demonstrate the dataset’s utility through baseline performance evaluations of representative ML methods for RM construction. This work provides a crucial dataset and reference for research in high-precision RM generation, MIMO spatial performance, and ML for 6G environment awareness. The code and data for this work are available at: this https URL.
zh
[AI-46] REMI: A Novel Causal Schema Memory Architecture for Personalized Lifestyle Recommendation Agents KDD2025
【速读】:该论文旨在解决个性化AI助手在整合复杂个人数据和因果知识时的局限性,从而导致建议泛化、缺乏解释力的问题。解决方案的关键在于提出一种名为REMI(Causal Schema Memory)的架构,其核心由三部分组成:个人因果知识图谱(Personal Causal Knowledge Graph)、因果推理引擎以及基于模式的规划模块。该架构通过构建用户生活事件与习惯的因果图谱,结合外部知识与假设推理进行目标导向的因果遍历,并利用可适配的计划模式生成定制化行动方案,最终由大型语言模型(Large Language Model, LLM)协调各组件输出具备透明因果解释的推荐结果。这一设计显著提升了推荐的上下文感知性和用户对齐度,推动了可解释、可信的个性化AI生活方式助手的发展。
链接: https://arxiv.org/abs/2509.06269
作者: Vishal Raman,Vijai Aravindh R,Abhijith Ragav
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, Accepted at the OARS Workshop, KDD 2025, Paper link: this https URL
Abstract:Personalized AI assistants often struggle to incorporate complex personal data and causal knowledge, leading to generic advice that lacks explanatory power. We propose REMI, a Causal Schema Memory architecture for a multimodal lifestyle agent that integrates a personal causal knowledge graph, a causal reasoning engine, and a schema based planning module. The idea is to deliver explainable, personalized recommendations in domains like fashion, personal wellness, and lifestyle planning. Our architecture uses a personal causal graph of the user’s life events and habits, performs goal directed causal traversals enriched with external knowledge and hypothetical reasoning, and retrieves adaptable plan schemas to generate tailored action plans. A Large Language Model orchestrates these components, producing answers with transparent causal explanations. We outline the CSM system design and introduce new evaluation metrics for personalization and explainability, including Personalization Salience Score and Causal Reasoning Accuracy, to rigorously assess its performance. Results indicate that CSM based agents can provide more context aware, user aligned recommendations compared to baseline LLM agents. This work demonstrates a novel approach to memory augmented, causal reasoning in personalized agents, advancing the development of transparent and trustworthy AI lifestyle assistants.
zh
[AI-47] On Synthesis of Timed Regular Expressions
【速读】:该论文致力于解决时序正则表达式(timed regular expressions)的合成问题,即从给定的正例和负例系统行为中自动构造一个与这些行为一致的时序正则表达式,要求其接受所有正例并拒绝所有负例。解决方案的关键在于两步策略:首先枚举并剪枝候选的带参时序正则表达式;其次将一致性约束编码为可满足性模理论(Satisfiability Modulo Theories, SMT)公式,并通过求解该公式来确定参数化时间约束的具体取值,从而得到长度最短且与数据一致的时序正则表达式。
链接: https://arxiv.org/abs/2509.06262
作者: Ziran Wang,Jie An,Naijun Zhan,Miaomiao Zhang,Zhenya Zhang
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 7 tables
Abstract:Timed regular expressions serve as a formalism for specifying real-time behaviors of Cyber-Physical Systems. In this paper, we consider the synthesis of timed regular expressions, focusing on generating a timed regular expression consistent with a given set of system behaviors including positive and negative examples, i.e., accepting all positive examples and rejecting all negative examples. We first prove the decidability of the synthesis problem through an exploration of simple timed regular expressions. Subsequently, we propose our method of generating a consistent timed regular expression with minimal length, which unfolds in two steps. The first step is to enumerate and prune candidate parametric timed regular expressions. In the second step, we encode the requirement that a candidate generated by the first step is consistent with the given set into a Satisfiability Modulo Theories (SMT) formula, which is consequently solved to determine a solution to parametric time constraints. Finally, we evaluate our approach on benchmarks, including randomly generated behaviors from target timed models and a case study.
zh
[AI-48] Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化代码生成中难以产出可通过形式化验证的代码这一核心问题,尤其是在硬件设计和安全关键领域,形式化验证是确保正确性的必要条件。解决方案的关键在于提出一个端到端的合成框架 Proof2Silicon,其核心创新是嵌入了先前提出的模型无关型强化学习(Reinforcement Learning, RL)框架 PREFACE,通过 verifier-driven RL agent 迭代优化提示词(prompt),引导冻结的 LLM 生成可被 Dafny 形式验证的代码;随后自动将验证通过的 Dafny 程序翻译为可综合的高层次 C 代码,并借助 Vivado HLS 工具链生成 RTL 实现,从而实现从自然语言规格说明到硅片级设计的自动化、正确性保障的硬件生成流程。
链接: https://arxiv.org/abs/2509.06239
作者: Manvi Jha,Jiaxin Wan,Deming Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in automated code generation but frequently produce code that fails formal verification, an essential requirement for hardware and safety-critical domains. To overcome this fundamental limitation, we previously proposed PREFACE, a model-agnostic framework based on reinforcement learning (RL) that iteratively repairs the prompts provided to frozen LLMs, systematically steering them toward generating formally verifiable Dafny code without costly fine-tuning. This work presents Proof2Silicon, a novel end-to-end synthesis framework that embeds the previously proposed PREFACE flow to enable the generation of correctness-by-construction hardware directly from natural language specifications. Proof2Silicon operates by: (1) leveraging PREFACE’s verifier-driven RL agent to optimize prompt generation iteratively, ensuring Dafny code correctness; (2) automatically translating verified Dafny programs into synthesizable high-level C using Dafny’s Python backend and PyLog; and (3) employing Vivado HLS to produce RTL implementations. Evaluated rigorously on a challenging 100-task benchmark, PREFACE’s RL-guided prompt optimization consistently improved Dafny verification success rates across diverse LLMs by up to 21%. Crucially, Proof2Silicon achieved an end-to-end hardware synthesis success rate of up to 72%, generating RTL designs through Vivado HLS synthesis flows. These results demonstrate a robust, scalable, and automated pipeline for LLM-driven, formally verified hardware synthesis, bridging natural-language specification and silicon realization.
zh
[AI-49] PillagerBench: Benchmarking LLM -Based Agents in Competitive Minecraft Team Environments ALT
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在竞争性多智能体环境中的有效性尚未充分探索的问题。现有研究多聚焦于合作或策略推理任务,而对团队对抗场景下的智能体行为建模与适应能力缺乏系统评估。为此,作者提出PillagerBench框架,其关键在于提供一个基于Minecraft的实时团队对团队竞争环境,支持可扩展API、多轮测试以及基于规则的内置对手,从而实现公平且可复现的多智能体系统评估。同时,论文设计了TactiCrafter这一LLM-based多智能体系统,其核心创新在于通过人类可读战术表达促进协作、学习因果依赖关系,并基于自对弈实现对手策略的动态适应,显著提升了竞争场景下的性能与战略演化能力。
链接: https://arxiv.org/abs/2509.06235
作者: Olivier Schipper,Yudi Zhang,Yali Du,Mykola Pechenizkiy,Meng Fang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: for the source code, see this https URL
Abstract:LLM-based agents have shown promise in various cooperative and strategic reasoning tasks, but their effectiveness in competitive multi-agent environments remains underexplored. To address this gap, we introduce PillagerBench, a novel framework for evaluating multi-agent systems in real-time competitive team-vs-team scenarios in Minecraft. It provides an extensible API, multi-round testing, and rule-based built-in opponents for fair, reproducible comparisons. We also propose TactiCrafter, an LLM-based multi-agent system that facilitates teamwork through human-readable tactics, learns causal dependencies, and adapts to opponent strategies. Our evaluation demonstrates that TactiCrafter outperforms baseline approaches and showcases adaptive learning through self-play. Additionally, we analyze its learning process and strategic evolution over multiple game episodes. To encourage further research, we have open-sourced PillagerBench, fostering advancements in multi-agent AI for competitive environments.
zh
[AI-50] Agent ic Software Engineering: Foundational Pillars and a Research Roadmap
【速读】:该论文旨在解决当前软件工程(Software Engineering, SE)在生成式 AI(Generative AI)驱动下向“智能体化”演进过程中所面临的根本性范式转变问题,即如何在确保可信性的前提下,重构软件工程的核心要素(角色、流程、工具与产物),以支持人类与智能体之间的新型协同关系。其解决方案的关键在于提出“双模态”框架——SE for Humans 与 SE for Agents 的共生结构,并设计两个专用工作台:Agent Command Environment (ACE) 用于人类指挥和监督代理团队,处理 Merge-Readiness Packs (MRPs) 和 Consultation Request Packs (CRPs);Agent Execution Environment (AEE) 作为代理执行任务的数字空间,在遇到模糊性或复杂权衡时主动请求人类介入。这种双向协作机制催生了结构化的工程活动(processes),推动软件工程从“代理编码”跃升为真正的“智能体软件工程”(Agentic Software Engineering, SE 3.0),并构建了一个名为 Structured Agentic Software Engineering (SASE) 的概念框架与研究路线图,以引导社区迈向更具纪律性、可扩展性和可信性的未来软件工程实践。
链接: https://arxiv.org/abs/2509.06216
作者: Ahmed E. Hassan,Hao Li,Dayi Lin,Bram Adams,Tse-Hsun Chen,Yutaro Kashiwa,Dong Qiu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic Software Engineering (SE 3.0) represents a new era where intelligent agents are tasked not with simple code generation, but with achieving complex, goal-oriented SE objectives. To harness these new capabilities while ensuring trustworthiness, we must recognize a fundamental duality within the SE field in the Agentic SE era, comprising two symbiotic modalities: SE for Humans and SE for Agents. This duality demands a radical reimagining of the foundational pillars of SE (actors, processes, tools, and artifacts) which manifest differently across each modality. We propose two purpose-built workbenches to support this vision. The Agent Command Environment (ACE) serves as a command center where humans orchestrate and mentor agent teams, handling outputs such as Merge-Readiness Packs (MRPs) and Consultation Request Packs (CRPs). The Agent Execution Environment (AEE) is a digital workspace where agents perform tasks while invoking human expertise when facing ambiguity or complex trade-offs. This bi-directional partnership, which supports agent-initiated human callbacks and handovers, gives rise to new, structured engineering activities (i.e., processes) that redefine human-AI collaboration, elevating the practice from agentic coding to true agentic software engineering. This paper presents the Structured Agentic Software Engineering (SASE) vision, outlining several of the foundational pillars for the future of SE. The paper culminates in a research roadmap that identifies a few key challenges and opportunities while briefly discussing the resulting impact of this future on SE education. Our goal is not to offer a definitive solution, but to provide a conceptual scaffold with structured vocabulary to catalyze a community-wide dialogue, pushing the SE community to think beyond its classic, human-centric tenets toward a disciplined, scalable, and trustworthy agentic future.
zh
[AI-51] oward a Metrology for Artificial Intelligence: Hidden-Rule Environments and Reinforcement Learning
【速读】:该论文旨在解决在复杂、部分可观测的谜题环境Game Of Hidden Rules (GOHR)中,智能体如何通过经验同时推断隐藏规则并学习最优策略的问题。其解决方案的关键在于采用两种不同的状态表征策略——以特征为中心(Feature-Centric, FC)和以对象为中心(Object-Centric, OC),并结合基于Transformer的优势Actor-Critic(A2C)算法进行训练,从而提升模型在规则推理与策略学习上的效率与泛化能力。
链接: https://arxiv.org/abs/2509.06213
作者: Christo Mathew,Wentian Wang,Lazaros Gallos,Paul Kantor,Vladimir Menkov,Hao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We investigate reinforcement learning in the Game Of Hidden Rules (GOHR) environment, a complex puzzle in which an agent must infer and execute hidden rules to clear a 6 \times 6 board by placing game pieces into buckets. We explore two state representation strategies, namely Feature-Centric (FC) and Object-Centric (OC), and employ a Transformer-based Advantage Actor-Critic (A2C) algorithm for training. The agent has access only to partial observations and must simultaneously infer the governing rule and learn the optimal policy through experience. We evaluate our models across multiple rule-based and trial-list-based experimental setups, analyzing transfer effects and the impact of representation on learning efficiency.
zh
[AI-52] Grasp-MPC: Closed-Loop Visual Grasping via Value-Guided Model Predictive Control
【速读】:该论文旨在解决在非结构化环境中对多样化物体进行鲁棒抓取的问题,尤其是针对开环抓取方法在杂乱场景中因抓取预测误差和物体位姿变化导致的失败问题。其解决方案的关键在于提出一种基于模型预测控制(MPC)的闭环6自由度(6-DoF)视觉抓取策略——Grasp-MPC,该策略通过在包含200万条抓取轨迹(含成功与失败案例)的大型合成数据集上训练一个价值函数(value function),并将其嵌入MPC框架中,结合碰撞避免和运动平滑性等代价项,实现对新物体在复杂环境中的实时响应与高成功率抓取。
链接: https://arxiv.org/abs/2509.06201
作者: Jun Yamada,Adithyavairavan Murali,Ajay Mandlekar,Clemens Eppner,Ingmar Posner,Balakumar Sundaralingam
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 17 figures
Abstract:Grasping of diverse objects in unstructured environments remains a significant challenge. Open-loop grasping methods, effective in controlled settings, struggle in cluttered environments. Grasp prediction errors and object pose changes during grasping are the main causes of failure. In contrast, closed-loop methods address these challenges in simplified settings (e.g., single object on a table) on a limited set of objects, with no path to generalization. We propose Grasp-MPC, a closed-loop 6-DoF vision-based grasping policy designed for robust and reactive grasping of novel objects in cluttered environments. Grasp-MPC incorporates a value function, trained on visual observations from a large-scale synthetic dataset of 2 million grasp trajectories that include successful and failed attempts. We deploy this learned value function in an MPC framework in combination with other cost terms that encourage collision avoidance and smooth execution. We evaluate Grasp-MPC on FetchBench and real-world settings across diverse environments. Grasp-MPC improves grasp success rates by up to 32.6% in simulation and 33.3% in real-world noisy conditions, outperforming open-loop, diffusion policy, transformer policy, and IQL approaches. Videos and more at this http URL.
zh
[AI-53] AI Governance in Higher Education: A course design exploring regulatory ethical and practical considerations
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)伦理教育碎片化、学科壁垒明显且与实践脱节的问题,旨在培养能够应对AI在伦理、法律和治理方面挑战的专业人才。其解决方案的关键在于提出一个模块化、跨学科的课程体系,将技术基础与伦理、法律及政策内容深度融合,并通过整合风险诊断、法规解读与利益相关者参与等教学策略,强化学生的实践能力与责任意识,从而培育具备适应性与伦理根基的负责任AI治理人才。
链接: https://arxiv.org/abs/2509.06176
作者: Zsolt Almási(1),Hannah Bleher(2),Johannes Bleher(3),Rozanne Tuesday Flores(4),Guo Xuanyang(5),Paweł Pujszo(6),Raphaël Weuts(7) ((1) Pázmány Péter Catholic University, Hungary, (2) University of Bonn, Germany, (3) University of Hohenheim, Germany, (4) Bukidnon State University, Philippines, (5) Southwest University of Political Science and Law, China, (6) College of Europe, Natolin, Poland, (7) KU Leuven, Belgium)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:
Abstract:As artificial intelligence (AI) systems permeate critical sectors, the need for professionals who can address ethical, legal and governance challenges has become urgent. Current AI ethics education remains fragmented, often siloed by discipline and disconnected from practice. This paper synthesizes literature and regulatory developments to propose a modular, interdisciplinary curriculum that integrates technical foundations with ethics, law and policy. We highlight recurring operational failures in AI - bias, misspecified objectives, generalization errors, misuse and governance breakdowns - and link them to pedagogical strategies for teaching AI governance. Drawing on perspectives from the EU, China and international frameworks, we outline a semester plan that emphasizes integrated ethics, stakeholder engagement and experiential learning. The curriculum aims to prepare students to diagnose risks, navigate regulation and engage diverse stakeholders, fostering adaptive and ethically grounded professionals for responsible AI governance.
zh
[AI-54] Reasoning Language Model for Personalized Lung Cancer Screening
【速读】:该论文旨在解决肺部结节风险评估中灵敏度与特异度之间的权衡问题,即当前基于Lung-RADS的评估体系仅依赖影像学特征进行分层,未能整合多种临床风险因素,导致个体化风险预测能力不足。其解决方案的关键在于提出一种推理语言模型(Reasoning Language Model, RLM),通过融合放射科影像发现与纵向电子健康记录(Electronic Health Record, EHR)数据,实现多维度风险因子的分解、量化分析与合成,最终生成基于数据驱动系统方程的个体化肺癌风险评分。该方法利用链式思维(Chain of Thought)推理过程提升预测准确性与可解释性,从而推动临床转化应用。
链接: https://arxiv.org/abs/2509.06169
作者: Chuang Niu,Ge Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate risk assessment in lung cancer screening is critical for enabling early cancer detection and minimizing unnecessary invasive procedures. The Lung CT Screening Reporting and Data System (Lung-RADS) has been widely used as the standard framework for patient management and follow-up. Nevertheless, Lung-RADS faces trade-offs between sensitivity and specificity, as it stratifies risk solely based on lung nodule characteristics without incorporating various risk factors. Here we propose a reasoning language model (RLM) to integrate radiology findings with longitudinal medical records for individualized lung cancer risk assessment. Through a systematic study including dataset construction and distillation, supervised fine-tuning, reinforcement learning, and comprehensive evaluation, our model makes significant improvements in risk prediction performance on datasets in the national lung screening trial. Notably, RLM can decompose the risk evaluation task into sub-components, analyze the contributions of diverse risk factors, and synthesize them into a final risk score computed using our data-driven system equation. Our approach improves both predictive accuracy and monitorability through the chain of thought reasoning process, thereby facilitating clinical translation into lung cancer screening.
zh
[AI-55] racking daily paths in home contexts with RSSI fingerprinting based on UWB through deep learning models
【速读】:该论文旨在解决在家庭环境中利用无线信号进行高精度人员路径跟踪的问题,尤其针对超宽带(UWB)技术因墙体和障碍物导致定位精度下降的挑战。其解决方案的关键在于提出一种基于指纹识别(fingerprinting-based approach)的方法,利用接收信号强度指示符(RSSI)数据构建环境特征模型,并结合卷积神经网络(CNN)、长短期记忆网络(LSTM)以及二者融合的混合模型(hybrid CNN+LSTM)进行位置估计。实验表明,混合模型在均方误差接近50 cm的情况下表现最优,显著提升了住宅场景下日常活动识别的定位准确性。
链接: https://arxiv.org/abs/2509.06161
作者: Aurora Polo-Rodríguez,Juan Carlos Valera,Jesús Peral,David Gil,Javier Medina-Quero
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 14 figures
Abstract:The field of human activity recognition has evolved significantly, driven largely by advancements in Internet of Things (IoT) device technology, particularly in personal devices. This study investigates the use of ultra-wideband (UWB) technology for tracking inhabitant paths in home environments using deep learning models. UWB technology estimates user locations via time-of-flight and time-difference-of-arrival methods, which are significantly affected by the presence of walls and obstacles in real environments, reducing their precision. To address these challenges, we propose a fingerprinting-based approach utilizing received signal strength indicator (RSSI) data collected from inhabitants in two flats (60 m2 and 100 m2) while performing daily activities. We compare the performance of convolutional neural network (CNN), long short-term memory (LSTM), and hybrid CNN+LSTM models, as well as the use of Bluetooth technology. Additionally, we evaluate the impact of the type and duration of the temporal window (future, past, or a combination of both). Our results demonstrate a mean absolute error close to 50 cm, highlighting the superiority of the hybrid model in providing accurate location estimates, thus facilitating its application in daily human activity recognition in residential settings.
zh
[AI-56] aching Precommitted Agents : Model-Free Policy Evaluation and Control in Quasi-Hyperbolic Discounted MDPs
【速读】:该论文旨在解决具有准双曲(Quasi-Hyperbolic, QH)贴现偏好的人类和动物决策行为在强化学习(Reinforcement Learning, RL)框架中的建模与算法实现问题。QH贴现偏好刻画了个体对即时较小奖励的偏好强于延迟较大奖励的现象,但其在RL中的理论基础和实用算法长期缺失。解决方案的关键在于:第一,首次严格证明了最优策略可简化为仅依赖一步的状态转移的非平稳形式;第二,设计出首个无需环境模型的、基于策略评估与Q-learning的实用算法,并提供了收敛性理论保证,从而为将QH偏好嵌入RL系统奠定了坚实的理论与方法基础。
链接: https://arxiv.org/abs/2509.06094
作者: S.R. Eshwar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time-inconsistent preferences, where agents favor smaller-sooner over larger-later rewards, are a key feature of human and animal decision-making. Quasi-Hyperbolic (QH) discounting provides a simple yet powerful model for this behavior, but its integration into the reinforcement learning (RL) framework has been limited. This paper addresses key theoretical and algorithmic gaps for precommitted agents with QH preferences. We make two primary contributions: (i) we formally characterize the structure of the optimal policy, proving for the first time that it reduces to a simple one-step non-stationary form; and (ii) we design the first practical, model-free algorithms for both policy evaluation and Q-learning in this setting, both with provable convergence guarantees. Our results provide foundational insights for incorporating QH preferences in RL.
zh
[AI-57] Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre-Trained Models in Open-Source Projects
【速读】:该论文旨在解决预训练模型(Pre-trained Models, PTMs)作为新型软件依赖(即Software Dependencies 2.0)在开源软件(OSS)项目中的集成与管理问题,尤其关注其对系统可维护性和可靠性的潜在威胁。解决方案的关键在于通过混合方法分析法,从一个具有统计显著性的随机样本(401个GitHub仓库)中定量识别PTM复用模式,并定性探究开发者在实际项目中如何结构化、文档化及整合这些模型,从而揭示PTM在软件开发流水线中的组织方式、交互机制及其与其他学习组件的关系。
链接: https://arxiv.org/abs/2509.06085
作者: Jerin Yasmin,Wenxin Jiang,James C. Davis,Yuan Tian
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to Empirical Software Engineering (EMSE) Journal
Abstract:Pre-trained models (PTMs) are machine learning models that have been trained in advance, often on large-scale data, and can be reused for new tasks, thereby reducing the need for costly training from scratch. Their widespread adoption introduces a new class of software dependency, which we term Software Dependencies 2.0, extending beyond conventional libraries to learned behaviors embodied in trained models and their associated artifacts. The integration of PTMs as software dependencies in real projects remains unclear, potentially threatening maintainability and reliability of modern software systems that increasingly rely on them. Objective: In this study, we investigate Software Dependencies 2.0 in open-source software (OSS) projects by examining the reuse of PTMs, with a focus on how developers manage and integrate these models. Specifically, we seek to understand: (1) how OSS projects structure and document their PTM dependencies; (2) what stages and organizational patterns emerge in the reuse pipelines of PTMs within these projects; and (3) the interactions among PTMs and other learned components across pipeline stages. We conduct a mixed-methods analysis of a statistically significant random sample of 401 GitHub repositories from the PeaTMOSS dataset (28,575 repositories reusing PTMs from Hugging Face and PyTorch Hub). We quantitatively examine PTM reuse by identifying patterns and qualitatively investigate how developers integrate and manage these models in practice.
zh
[AI-58] ARIES: Relation Assessment and Model Recommendation for Deep Time Series Forecasting
【速读】:该论文旨在解决当前时间序列预测模型评估与推荐中存在的两大问题:一是现有基准数据集缺乏多样且定义明确的时序模式,难以系统性地分析模型性能与数据特性之间的关系;二是缺乏有效的模型推荐方法,导致在实际应用中需耗费大量时间和成本测试不同架构。解决方案的关键在于提出 ARIES 框架,其核心包括两个部分:首先构建一个包含多种显著时序模式的合成数据集,并设计一套完整的指标体系用于计算时间序列属性;其次基于对 50 多种深度预测模型的广泛基准测试,建立时间序列属性与建模策略之间的映射关系,并据此开发首个可解释的时间序列深度预测模型推荐系统,从而实现针对真实场景下时间序列数据的智能模型选择。
链接: https://arxiv.org/abs/2509.06060
作者: Fei Wang,Yujie Li,Zezhi Shao,Chengqing Yu,Yisong Fu,Zhulin An,Yongjun Xu,Xueqi Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in deep learning models for time series forecasting have been significant. These models often leverage fundamental time series properties such as seasonality and non-stationarity, which may suggest an intrinsic link between model performance and data properties. However, existing benchmark datasets fail to offer diverse and well-defined temporal patterns, restricting the systematic evaluation of such connections. Additionally, there is no effective model recommendation approach, leading to high time and cost expenditures when testing different architectures across different downstream applications. For those reasons, we propose ARIES, a framework for assessing relation between time series properties and modeling strategies, and for recommending deep forcasting models for realistic time series. First, we construct a synthetic dataset with multiple distinct patterns, and design a comprehensive system to compute the properties of time series. Next, we conduct an extensive benchmarking of over 50 forecasting models, and establish the relationship between time series properties and modeling strategies. Our experimental results reveal a clear correlation. Based on these findings, we propose the first deep forecasting model recommender, capable of providing interpretable suggestions for real-world time series. In summary, ARIES is the first study to establish the relations between the properties of time series data and modeling strategies, while also implementing a model recommendation system. The code is available at: this https URL.
zh
[AI-59] PolicyEvolve: Evolving Programmatic Policies by LLM s for multi-player games via Population-Based Training
【速读】:该论文旨在解决多智能体强化学习(Multi-agent Reinforcement Learning, MARL)在训练复杂多人博弈任务时面临的两大挑战:一是需要海量的经验样本和巨大的计算资源,二是生成的策略缺乏可解释性,限制了其实际部署。解决方案的关键在于提出了一种名为PolicyEvolve的通用框架,通过将神经网络策略转化为可解释的规则代码(programmatic policies),显著减少对人工编写策略代码的依赖,并以极少的环境交互实现高性能策略。其核心机制包括四个模块:全局池(Global Pool)保存迭代中积累的精英策略,局部池(Local Pool)暂存当前迭代策略并筛选优质策略进入全局池,策略规划器(Policy Planner)基于环境信息与全局池最优策略生成并优化初始策略,轨迹评判器(Trajectory Critic)分析策略执行轨迹中的漏洞并提供改进方向,从而形成闭环进化过程,最终获得高胜率且具备可解释性的程序化策略。
链接: https://arxiv.org/abs/2509.06053
作者: Mingrui Lv,Hangzhi Liu,Zhi Luo,Hongjie Zhang,Jie Ou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent reinforcement learning (MARL) has achieved significant progress in solving complex multi-player games through self-play. However, training effective adversarial policies requires millions of experience samples and substantial computational resources. Moreover, these policies lack interpretability, hindering their practical deployment. Recently, researchers have successfully leveraged Large Language Models (LLMs) to generate programmatic policies for single-agent tasks, transforming neural network-based policies into interpretable rule-based code with high execution efficiency. Inspired by this, we propose PolicyEvolve, a general framework for generating programmatic policies in multi-player games. PolicyEvolve significantly reduces reliance on manually crafted policy code, achieving high-performance policies with minimal environmental interactions. The framework comprises four modules: Global Pool, Local Pool, Policy Planner, and Trajectory Critic. The Global Pool preserves elite policies accumulated during iterative training. The Local Pool stores temporary policies for the current iteration; only sufficiently high-performing policies from this pool are promoted to the Global Pool. The Policy Planner serves as the core policy generation module. It samples the top three policies from the Global Pool, generates an initial policy for the current iteration based on environmental information, and refines this policy using feedback from the Trajectory Critic. Refined policies are then deposited into the Local Pool. This iterative process continues until the policy achieves a sufficiently high average win rate against the Global Pool, at which point it is integrated into the Global Pool. The Trajectory Critic analyzes interaction data from the current policy, identifies vulnerabilities, and proposes directional improvements to guide the Policy Planner
zh
[AI-60] Empirical Study of Code Large Language Models for Binary Security Patch Detection
【速读】:该论文旨在解决**二进制安全补丁检测(Binary Security Patch Detection, Binary SPD)**问题,即在无法获取源代码的闭源软件环境中,如何准确识别已发布的二进制补丁是否修复了安全漏洞。传统基于学习的补丁检测方法依赖源代码,难以应用于现实世界中大量存在的闭源系统。解决方案的关键在于:首先构建了一个包含19,448个样本的大规模二进制补丁数据集,采用汇编代码和伪代码两种低级表示形式;其次通过微调(fine-tuning)策略将二进制补丁检测领域的专业知识注入到不同规模的代码大语言模型(Code LLMs)中,实验表明,经过微调的模型在伪代码表示上表现最优,显著优于直接提示(prompting)原始模型的方法,从而填补了代码大语言模型在低级代码理解与二进制补丁检测任务之间的研究空白。
链接: https://arxiv.org/abs/2509.06052
作者: Qingyuan Li,Binchang Li,Cuiyun Gao,Shuzheng Gao,Zongjie Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Security patch detection (SPD) is crucial for maintaining software security, as unpatched vulnerabilities can lead to severe security risks. In recent years, numerous learning-based SPD approaches have demonstrated promising results on source code. However, these approaches typically cannot be applied to closed-source applications and proprietary systems that constitute a significant portion of real-world software, as they release patches only with binary files, and the source code is inaccessible. Given the impressive performance of code large language models (LLMs) in code intelligence and binary analysis tasks such as decompilation and compilation optimization, their potential for detecting binary security patches remains unexplored, exposing a significant research gap between their demonstrated low-level code understanding capabilities and this critical security task. To address this gap, we construct a large-scale binary patch dataset containing \textbf19,448 samples, with two levels of representation: assembly code and pseudo-code, and systematically evaluate \textbf19 code LLMs of varying scales to investigate their capability in binary SPD tasks. Our initial exploration demonstrates that directly prompting vanilla code LLMs struggles to accurately identify security patches from binary patches, and even state-of-the-art prompting techniques fail to mitigate the lack of domain knowledge in binary SPD within vanilla models. Drawing on the initial findings, we further investigate the fine-tuning strategy for injecting binary SPD domain knowledge into code LLMs through two levels of representation. Experimental results demonstrate that fine-tuned LLMs achieve outstanding performance, with the best results obtained on the pseudo-code representation.
zh
[AI-61] DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
【速读】:该论文旨在解决当前文本到音频(text-to-audio)生成模型在细粒度声学特征控制方面的不足,即现有模型虽能生成语义上与文本对齐的音频,但难以精确操控特定声音的细微声学特性,导致用户难以获得符合个性化需求的音频片段。解决方案的关键在于提出DreamAudio框架,该框架通过引入参考概念中的听觉信息识别机制,使模型能够从用户提供的少量参考音频样本中学习并提取个性化的音频事件特征,并据此生成包含这些特定事件的新音频样本,从而实现定制化文本到音频生成(Customized Text-to-Audio Generation, CTTA)。
链接: https://arxiv.org/abs/2509.06027
作者: Yi Yuan,Xubo Liu,Haohe Liu,Xiyuan Kang,Zhuo Chen,Yuxuan Wang,Mark D. Plumbley,Wenwu Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Demos are available at this https URL
Abstract:With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short on precisely controlling fine-grained acoustic characteristics of specific sounds. As a result, users that need specific sound content may find it challenging to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the customized systems. The experiments show that the proposed model, DreamAudio, generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.
zh
[AI-62] DCMI: A Differential Calibration Membership Inference Attack Against Retrieval-Augmented Generation
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理敏感数据时面临的成员推断攻击(Membership Inference Attack, MIA)隐私风险问题。现有MIA方法通常仅依赖模型输出,忽略了非成员检索文档对RAG输出的干扰,导致攻击效果受限。其解决方案的关键在于提出一种差分校准型成员推断攻击(Differential Calibration MIA, DCMI),通过利用查询扰动下成员与非成员检索文档之间的敏感性差异,生成扰动查询进行校准,从而有效隔离成员文档的贡献并最小化非成员文档的干扰,显著提升攻击性能。实验表明,DCMI在多种RAG系统中均优于基线方法,验证了其在识别敏感数据成员身份上的高效性与普适性。
链接: https://arxiv.org/abs/2509.06026
作者: Xinyu Gao,Xiangtao Meng,Yingkai Dong,Zheng Li,Shanqing Guo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While Retrieval-Augmented Generation (RAG) effectively reduces hallucinations by integrating external knowledge bases, it introduces vulnerabilities to membership inference attacks (MIAs), particularly in systems handling sensitive data. Existing MIAs targeting RAG’s external databases often rely on model responses but ignore the interference of non-member-retrieved documents on RAG outputs, limiting their effectiveness. To address this, we propose DCMI, a differential calibration MIA that mitigates the negative impact of non-member-retrieved documents. Specifically, DCMI leverages the sensitivity gap between member and non-member retrieved documents under query perturbation. It generates perturbed queries for calibration to isolate the contribution of member-retrieved documents while minimizing the interference from non-member-retrieved documents. Experiments under progressively relaxed assumptions show that DCMI consistently outperforms baselines–for example, achieving 97.42% AUC and 94.35% Accuracy against the RAG system with Flan-T5, exceeding the MBA baseline by over 40%. Furthermore, on real-world RAG platforms such as Dify and MaxKB, DCMI maintains a 10%-20% advantage over the baseline. These results highlight significant privacy risks in RAG systems and emphasize the need for stronger protection mechanisms. We appeal to the community’s consideration of deeper investigations, like ours, against the data leakage risks in rapidly evolving RAG systems. Our code is available at this https URL.
zh
[AI-63] Unified Interaction Foundational Model (UIFM) for Predicting Complex User and System Behavior
【速读】:该论文旨在解决当前基础模型在理解复杂、动态事件序列时的局限性,特别是这些模型基于自然语言设计,在电信、电商和金融等结构化交互领域中无法捕捉整体行为模式的问题。其核心挑战在于,将事件序列转化为文本会破坏原始语义完整性,导致关键上下文丢失。解决方案的关键在于提出统一交互基础模型(Unified Interaction Foundation Model, UIFM),其创新点是采用复合分词(composite tokenization)策略,将每个多属性事件视为一个语义连贯的整体单元,从而让模型能够学习用户行为的底层“语法”,实现对完整交互过程的理解而非离散数据点的孤立处理。
链接: https://arxiv.org/abs/2509.06025
作者: Vignesh Ethiraj,Subhash Talluri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A central goal of artificial intelligence is to build systems that can understand and predict complex, evolving sequences of events. However, current foundation models, designed for natural language, fail to grasp the holistic nature of structured interactions found in domains like telecommunications, e-commerce and finance. By serializing events into text, they disassemble them into semantically fragmented parts, losing critical context. In this work, we introduce the Unified Interaction Foundation Model (UIFM), a foundation model engineered for genuine behavioral understanding. At its core is the principle of composite tokenization, where each multi-attribute event is treated as a single, semantically coherent unit. This allows UIFM to learn the underlying “grammar” of user behavior, perceiving entire interactions rather than a disconnected stream of data points. We demonstrate that this architecture is not just more accurate, but represents a fundamental step towards creating more adaptable and intelligent predictive systems.
zh
[AI-64] Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL
【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)训练大语言模型(Large Language Models, LLMs)时,因依赖规则定义的奖励函数而无法有效评估链式思维(Chain-of-Thought, CoT)质量的问题。现有方法仅衡量答案格式与正确性,忽视了CoT推理过程对最终结果的实际提升作用,且缺乏对逻辑深度的控制能力,导致难以揭示模型的真实推理潜力。解决方案的关键在于提出动态推理效率奖励(Dynamic Reasoning Efficiency Reward, DRER),其包含两个核心机制:(i) 推理质量奖励(Reasoning Quality Reward),通过细粒度信用分配激励那些显著提高正确答案概率的推理路径;(ii) 动态长度优势衰减(Dynamic Length Advantage),根据验证集确定的阈值调整响应长度偏离程度带来的优势信号,从而稳定训练过程。这一框架显著提升了模型在逻辑推理任务中的表现和泛化能力。
链接: https://arxiv.org/abs/2509.06024
作者: Haoyang He,Zihua Rong,Kun Ji,Chenyang Li,Qing Huang,Chong Xia,Lan Yang,Honggang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models (LLMs). Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness, providing no signal as to whether the induced Chain-of-Thought (CoT) actually improves the answer. Furthermore, such task-specific training offers limited control over logical depth and therefore may fail to reveal a model’s genuine reasoning capacity. We propose Dynamic Reasoning Efficiency Reward (DRER) – a plug-and-play RL reward framework that reshapes both reward and advantage signals. (i) A Reasoning Quality Reward assigns fine-grained credit to those reasoning chains that demonstrably raise the likelihood of the correct answer, directly incentivising the trajectories with beneficial CoT tokens. (ii) A Dynamic Length Advantage decays the advantage of responses whose length deviates from a validation-derived threshold, stabilising training. To facilitate rigorous assessment, we also release Logictree, a dynamically constructed deductive reasoning dataset that functions both as RL training data and as a comprehensive benchmark. Experiments confirm the effectiveness of DRER: our 7B model attains GPT-o3-mini level performance on Logictree with 400 trianing steps, while the average confidence of CoT-augmented answers rises by 30%. The model further exhibits generalisation across diverse logical-reasoning datasets, and the mathematical benchmark AIME24. These results illuminate how RL shapes CoT behaviour and chart a practical path toward enhancing formal-reasoning skills in large language models. All code and data are available in repository this https URL.
zh
[AI-65] Operationalising AI Regulatory Sandboxes under the EU AI Act: The Triple Challenge of Capacity Coordination and Attractiveness to Providers
【速读】:该论文旨在解决欧盟人工智能法案(AI Act)要求成员国建立国家级人工智能监管沙盒(regulatory sandbox)以支持创新与合规之间的平衡问题。其核心挑战在于各成员国在沙盒设计和能力建设上的差异可能引发对AI法案解释和适用的不一致性,进而导致“沙盒套利”(sandbox arbitrage)行为,削弱统一监管效果。解决方案的关键在于欧洲委员会(European Commission)和人工智能董事会(AI Board)需迅速制定统一规则与指导方针,确保各国沙盒在监管框架下协调一致;同时,应提升沙盒吸引力,解决创新者对保密性、法律豁免缺失及无法获得符合性推定等顾虑,从而增强其作为合规路径的有效性。
链接: https://arxiv.org/abs/2509.05985
作者: Deirdre Ahern
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The EU AI Act provides a rulebook for all AI systems being put on the market or into service in the European Union. This article investigates the requirement under the AI Act that Member States establish national AI regulatory sandboxes for testing and validation of innovative AI systems under regulatory supervision to assist with fostering innovation and complying with regulatory requirements. Against the backdrop of the EU objective that AI regulatory sandboxes would both foster innovation and assist with compliance, considerable challenges are identified for Member States around capacity-building and design of regulatory sandboxes. While Member States are early movers in laying the ground for national AI regulatory sandboxes, the article contends that there is a risk that differing approaches being taken by individual national sandboxes could jeopardise a uniform interpretation of the AI Act and its application in practice. This could motivate innovators to play sandbox arbitrage. The article therefore argues that the European Commission and the AI Board need to act decisively in developing rules and guidance to ensure a cohesive, coordinated approach in national AI regulatory sandboxes. With sandbox participation being voluntary, the possibility that AI regulatory sandboxes may prove unattractive to innovators on their compliance journey is also explored. Confidentiality concerns, the inability to relax legal rules during the sandbox, and the inability of sandboxes to deliver a presumption of conformity with the AI Act are identified as pertinent concerns for innovators contemplating applying to AI regulatory sandboxes as compared with other direct compliance routes provided to them through application of harmonised standards and conformity assessment procedures.
zh
[AI-66] MapAgent : A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在地理空间任务中表现不足的问题,尤其是面对需要空间推理、多跳规划和实时地图交互的复杂查询时,现有框架因工具使用不高效、规划与执行未分离而难以胜任。其解决方案的关键在于提出 MapAgent——一个分层多智能体插件式架构,通过将高层规划与底层执行解耦,使高阶规划器将复杂任务分解为子目标并路由至专用模块;对于地图相关工具密集型任务,设计了专门的地图工具代理(map-tool agent),可自适应并行调度相似但细微差异的地理空间 API,从而降低认知负荷、提升工具选择准确率,并实现对同类 API 的精准协调。
链接: https://arxiv.org/abs/2509.05933
作者: Md Hasebul Hasan,Mahir Labib Dihan,Mohammed Eunus Ali,Md Rizwan Parvez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 Pages
Abstract:Agentic AI has significantly extended the capabilities of large language models (LLMs) by enabling complex reasoning and tool use. However, most existing frameworks are tailored to domains such as mathematics, coding, or web automation, and fall short on geospatial tasks that require spatial reasoning, multi-hop planning, and real-time map interaction. To address these challenges, we introduce MapAgent, a hierarchical multi-agent plug-and-play framework with customized toolsets and agentic scaffolds for map-integrated geospatial reasoning. Unlike existing flat agent-based approaches that treat tools uniformly-often overwhelming the LLM when handling similar but subtly different geospatial APIs-MapAgent decouples planning from execution. A high-level planner decomposes complex queries into subgoals, which are routed to specialized modules. For tool-heavy modules-such as map-based services-we then design a dedicated map-tool agent that efficiently orchestrates related APIs adaptively in parallel to effectively fetch geospatial data relevant for the query, while simpler modules (e.g., solution generation or answer extraction) operate without additional agent overhead. This hierarchical design reduces cognitive load, improves tool selection accuracy, and enables precise coordination across similar APIs. We evaluate MapAgent on four diverse geospatial benchmarks-MapEval-Textual, MapEval-API, MapEval-Visual, and MapQA-and demonstrate substantial gains over state-of-the-art tool-augmented and agentic baselines. We open-source our framwork at this https URL.
zh
[AI-67] Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLM s
【速读】:该论文旨在系统评估大型语言模型(Large Language Models, LLMs)在实际部署中面临的严重安全风险,尤其是外部提示注入(external prompt injection)攻击的脆弱性。研究通过在八款商用LLM上开展实验,不依赖额外的输入净化机制,仅利用其内置防护措施进行测试,发现多数模型存在可被利用的安全漏洞。解决方案的关键在于识别并强化防御薄弱环节,特别是引入输入规范化(input normalization)等辅助技术以提升模型鲁棒性,从而实现更可靠的安全防护。
链接: https://arxiv.org/abs/2509.05883
作者: Andrew Yeo,Daeseon Choi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 2 tables
Abstract:Large Language Models (LLMs) have seen rapid adoption in recent years, with industries increasingly relying on them to maintain a competitive advantage. These models excel at interpreting user instructions and generating human-like responses, leading to their integration across diverse domains, including consulting and information retrieval. However, their widespread deployment also introduces substantial security risks, most notably in the form of prompt injection and jailbreak attacks. To systematically evaluate LLM vulnerabilities – particularly to external prompt injection – we conducted a series of experiments on eight commercial models. Each model was tested without supplementary sanitization, relying solely on its built-in safeguards. The results exposed exploitable weaknesses and emphasized the need for stronger security measures. Four categories of attacks were examined: direct injection, indirect (external) injection, image-based injection, and prompt leakage. Comparative analysis indicated that Claude 3 demonstrated relatively greater robustness; nevertheless, empirical findings confirm that additional defenses, such as input normalization, remain necessary to achieve reliable protection. Comments: 8 pages, 4 figures, 2 tables Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.05883 [cs.CR] (or arXiv:2509.05883v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.05883 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-68] GeoAnalystBench: A GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在地理信息系统(GIS)自动化中的实际能力尚不明确的问题,尤其是在真实地理空间分析任务中其有效性与局限性缺乏系统评估。解决方案的关键在于构建并发布GeoAnalystBench——一个包含50个基于Python的、源自真实地理空间问题的任务基准,由GIS专家严格验证,并为每个任务定义最小交付成果。该基准通过工作流有效性、结构对齐度、语义相似性和代码质量(CodeBLEU)多维度评估LLMs表现,揭示了专有模型(如ChatGPT-4o-mini)与开源模型(如DeepSeek-R1-7B)在GIS自动化能力上的显著差距,尤其在需要深层空间推理的任务上仍面临挑战,从而为GeoAI研究提供了可复现的人机协同评估框架。
链接: https://arxiv.org/abs/2509.05881
作者: Qianheng Zhang,Song Gao,Chen Wei,Yibo Zhao,Ying Nie,Ziru Chen,Shijie Chen,Yu Su,Huan Sun
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 34 pages, 8 figures
Abstract:Recent advances in large language models (LLMs) have fueled growing interest in automating geospatial analysis and GIS workflows, yet their actual capabilities remain uncertain. In this work, we call for rigorous evaluation of LLMs on well-defined geoprocessing tasks before making claims about full GIS automation. To this end, we present GeoAnalystBench, a benchmark of 50 Python-based tasks derived from real-world geospatial problems and carefully validated by GIS experts. Each task is paired with a minimum deliverable product, and evaluation covers workflow validity, structural alignment, semantic similarity, and code quality (CodeBLEU). Using this benchmark, we assess both proprietary and open source models. Results reveal a clear gap: proprietary models such as ChatGPT-4o-mini achieve high validity 95% and stronger code alignment (CodeBLEU 0.39), while smaller open source models like DeepSeek-R1-7B often generate incomplete or inconsistent workflows (48.5% validity, 0.272 CodeBLEU). Tasks requiring deeper spatial reasoning, such as spatial relationship detection or optimal site selection, remain the most challenging across all models. These findings demonstrate both the promise and limitations of current LLMs in GIS automation and provide a reproducible framework to advance GeoAI research with human-in-the-loop support.
zh
[AI-69] Learning to Construct Knowledge through Sparse Reference Selection with Reinforcement Learning
【速读】:该论文旨在解决科学文献快速增长背景下,研究人员在专业领域中高效获取新知识的难题,特别是在全文本访问受限、目标参考文献稀疏且候选文献量大的场景下,如何精准筛选出最具价值的文献。其解决方案的关键在于提出了一种基于深度强化学习(Deep Reinforcement Learning)的稀疏参考文献选择框架,该框架模拟人类知识构建过程,在有限时间和资源约束下,动态决策优先阅读哪些论文,从而实现从部分信息(如标题和摘要)中有效构建知识。
链接: https://arxiv.org/abs/2509.05874
作者: Shao-An Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 8 pages, 2 figures
Abstract:The rapid expansion of scientific literature makes it increasingly difficult to acquire new knowledge, particularly in specialized domains where reasoning is complex, full-text access is restricted, and target references are sparse among a large set of candidates. We present a Deep Reinforcement Learning framework for sparse reference selection that emulates human knowledge construction, prioritizing which papers to read under limited time and cost. Evaluated on drug–gene relation discovery with access restricted to titles and abstracts, our approach demonstrates that both humans and machines can construct knowledge effectively from partial information.
zh
[AI-70] Decoding Latent Attack Surfaces in LLM s: Prompt Injection via HTML in Web Summarization
【速读】:该论文旨在解决生成式 AI (Generative AI) 在处理网页内容时,因非可见 HTML 元素(如 meta、aria-label 和 alt 属性)中嵌入的对抗性指令而导致的提示注入攻击(prompt injection attack)问题。其解决方案的关键在于构建了一个包含 280 个静态网页的新型数据集(清洁与注入版本各半),并设计了一套浏览器自动化流水线以提取原始 HTML 和渲染后的文本,从而模拟真实场景下的 LLM 部署环境;在此基础上,通过 ROUGE-L、SBERT 语义相似度及人工标注等多维度评估方法,量化了隐藏注入对 Llama 4 Scout 和 Gemma 9B IT 模型摘要质量的影响,揭示了此类攻击在实际系统中的显著风险与隐蔽性,为后续鲁棒性增强和防御机制研究提供了可复现的基准框架。
链接: https://arxiv.org/abs/2509.05831
作者: Ishaan Verma
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly integrated into web-based systems for content summarization, yet their susceptibility to prompt injection attacks remains a pressing concern. In this study, we explore how non-visible HTML elements such as meta, aria-label, and alt attributes can be exploited to embed adversarial instructions without altering the visible content of a webpage. We introduce a novel dataset comprising 280 static web pages, evenly divided between clean and adversarial injected versions, crafted using diverse HTML-based strategies. These pages are processed through a browser automation pipeline to extract both raw HTML and rendered text, closely mimicking real-world LLM deployment scenarios. We evaluate two state-of-the-art open-source models, Llama 4 Scout (Meta) and Gemma 9B IT (Google), on their ability to summarize this content. Using both lexical (ROUGE-L) and semantic (SBERT cosine similarity) metrics, along with manual annotations, we assess the impact of these covert injections. Our findings reveal that over 29% of injected samples led to noticeable changes in the Llama 4 Scout summaries, while Gemma 9B IT showed a lower, yet non-trivial, success rate of 15%. These results highlight a critical and largely overlooked vulnerability in LLM driven web pipelines, where hidden adversarial content can subtly manipulate model outputs. Our work offers a reproducible framework and benchmark for evaluating HTML-based prompt injection and underscores the urgent need for robust mitigation strategies in LLM applications involving web content.
zh
[AI-71] Chatbot To Help Patients Understand Their Health EMNLP2025
【速读】:该论文旨在解决患者在医疗决策中缺乏充分理解而导致参与度不足的问题,核心挑战在于如何通过自然语言交互提升患者对诊疗信息的认知水平。解决方案的关键在于提出一种名为NoteAid-Chatbot的对话式AI系统,其基于多智能体大语言模型(multi-agent large language model)与强化学习(reinforcement learning, RL)相结合的“以对话促进学习”框架,无需人工标注数据即可实现教育性对话。具体而言,该系统采用两阶段训练策略:首先在合成生成的医学对话数据上进行监督微调,随后利用模拟出院场景中的患者理解评估作为奖励信号,通过Proximal Policy Optimization(PPO)算法优化模型行为。实验证明,该方法能自发涌现出清晰性、相关性和结构化对话等关键教育属性,且在图灵测试中优于非专家人类,表明轻量级RL方法可有效用于开放域医疗对话系统的对齐训练。
链接: https://arxiv.org/abs/2509.05818
作者: Won Seok Jang,Hieu Tran,Manav Mistry,SaiKiran Gandluri,Yifan Zhang,Sharmin Sultana,Sunjae Kown,Yuan Zhang,Zonghai Yao,Hong Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in EMNLP 2025 Findings
Abstract:Patients must possess the knowledge necessary to actively participate in their care. We present NoteAid-Chatbot, a conversational AI that promotes patient understanding via a novel ‘learning as conversation’ framework, built on a multi-agent large language model (LLM) and reinforcement learning (RL) setup without human-labeled data. NoteAid-Chatbot was built on a lightweight LLaMA 3.2 3B model trained in two stages: initial supervised fine-tuning on conversational data synthetically generated using medical conversation strategies, followed by RL with rewards derived from patient understanding assessments in simulated hospital discharge scenarios. Our evaluation, which includes comprehensive human-aligned assessments and case studies, demonstrates that NoteAid-Chatbot exhibits key emergent behaviors critical for patient education, such as clarity, relevance, and structured dialogue, even though it received no explicit supervision for these attributes. Our results show that even simple Proximal Policy Optimization (PPO)-based reward modeling can successfully train lightweight, domain-specific chatbots to handle multi-turn interactions, incorporate diverse educational strategies, and meet nuanced communication objectives. Our Turing test demonstrates that NoteAid-Chatbot surpasses non-expert human. Although our current focus is on healthcare, the framework we present illustrates the feasibility and promise of applying low-cost, PPO-based RL to realistic, open-ended conversational domains, broadening the applicability of RL-based alignment methods.
zh
[AI-72] me2time: Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models
【速读】:该论文旨在解决两个核心问题:一是基于Transformer的时序基础模型(Time Series Foundation Models, TSFMs)是否真正内化了如市场状态(market regimes)等语义概念,而不仅仅是拟合数据曲线;二是其内部表征能否被用于模拟罕见但高风险事件(如市场崩盘)。解决方案的关键在于提出“激活移植”(activation transplantation)这一因果干预方法——在前向传播过程中,通过将某一事件(如历史崩盘)的统计矩(statistical moments)强制施加到另一事件(如平静期)的隐藏状态上,实现对模型预测的确定性引导。实验表明,注入崩盘语义可诱发下跌预测,注入平静语义则抑制崩溃并恢复稳定,且模型隐空间中潜在向量的范数与系统性冲击的强度呈正相关,验证了其语义可操控性和分级表征能力。
链接: https://arxiv.org/abs/2509.05801
作者: Debdeep Sanyal,Aaryan Nagpal,Dhruv Kumar,Murari Mandal,Saurabh Deshpande
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While transformer-based foundation models excel at forecasting routine patterns, two questions remain: do they internalize semantic concepts such as market regimes, or merely fit curves? And can their internal representations be leveraged to simulate rare, high-stakes events such as market crashes? To investigate this, we introduce activation transplantation, a causal intervention that manipulates hidden states by imposing the statistical moments of one event (e.g., a historical crash) onto another (e.g., a calm period) during the forward pass. This procedure deterministically steers forecasts: injecting crash semantics induces downturn predictions, while injecting calm semantics suppresses crashes and restores stability. Beyond binary control, we find that models encode a graded notion of event severity, with the latent vector norm directly correlating with the magnitude of systemic shocks. Validated across two architecturally distinct TSFMs, Toto (decoder only) and Chronos (encoder-decoder), our results demonstrate that steerable, semantically grounded representations are a robust property of large time series transformers. Our findings provide evidence for a latent concept space that governs model predictions, shifting interpretability from post-hoc attribution to direct causal intervention, and enabling semantic “what-if” analysis for strategic stress-testing.
zh
[AI-73] DCV-ROOD Evaluation Framework: Dual Cross-Validation for Robust Out-of-Distribution Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 系统中分布外(Out-of-Distribution, OOD)检测方法的可靠性评估难题,即如何在多样化的实际场景下准确衡量OOD检测模型的性能,避免因评估偏差导致对模型能力的误判。其解决方案的关键在于提出一种双交叉验证框架(Dual Cross-Validation for Robust Out-of-Distribution Detection, DCV-ROOD):该框架将训练数据中的分布内(In-Distribution, ID)样本按常规方式划分,而将分布外(OOD)样本依据其类别进行分组划分,并进一步引入类别层次结构(class hierarchy)来优化ID-OOD数据的分割策略,从而确保评估过程中ID与OOD样本的代表性与公平性,最终实现对OOD检测模型性能的更稳健、可信的估计。
链接: https://arxiv.org/abs/2509.05778
作者: Arantxa Urrea-Castaño,Nicolás Segura-Kunsagi,Juan Luis Suárez-Díaz,Rosana Montes,Francisco Herrera
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 20 pages and appendix
Abstract:Out-of-distribution (OOD) detection plays a key role in enhancing the robustness of artificial intelligence systems by identifying inputs that differ significantly from the training distribution, thereby preventing unreliable predictions and enabling appropriate fallback mechanisms. Developing reliable OOD detection methods is a significant challenge, and rigorous evaluation of these techniques is essential for ensuring their effectiveness, as it allows researchers to assess their performance under diverse conditions and to identify potential limitations or failure modes. Cross-validation (CV) has proven to be a highly effective tool for providing a reasonable estimate of the performance of a learning algorithm. Although OOD scenarios exhibit particular characteristics, an appropriate adaptation of CV can lead to a suitable evaluation framework for this setting. This work proposes a dual CV framework for robust evaluation of OOD detection models, aimed at improving the reliability of their assessment. The proposed evaluation framework aims to effectively integrate in-distribution (ID) and OOD data while accounting for their differing characteristics. To achieve this, ID data are partitioned using a conventional approach, whereas OOD data are divided by grouping samples based on their classes. Furthermore, we analyze the context of data with class hierarchy to propose a data splitting that considers the entire class hierarchy to obtain fair ID-OOD partitions to apply the proposed evaluation framework. This framework is called Dual Cross-Validation for Robust Out-of-Distribution Detection (DCV-ROOD). To test the validity of the evaluation framework, we selected a set of state-of-the-art OOD detection methods, both with and without outlier exposure. The results show that the method achieves very fast convergence to the true performance.
zh
[AI-74] Decision-Focused Learning Enhanced by Automated Feature Engineering for Energy Storag e Optimisation
【速读】:该论文旨在解决电池储能系统(Battery Energy Storage System, BESS)在不确定性环境下的决策问题,特别是传统预测-优化(Predict-Then-Optimise, PTO)方法因预测误差传递导致次优决策的局限性。其核心解决方案是提出一种结合自动化特征工程(Automated Feature Engineering, AFE)的决策导向学习(Decision-Focused Learning, DFL)框架,通过将预测与优化过程联合建模,直接以下游任务目标(如降低BESS运行成本)为导向进行训练,并利用AFE从有限数据中提取更具判别力的特征表示,从而提升小样本场景下的模型性能。实验表明,DFL相较PTO可显著降低运营成本,且引入AFE后进一步提升DFL性能达22.9–56.5%,验证了该方法在真实世界BEES应用中的可行性与经济优势。
链接: https://arxiv.org/abs/2509.05772
作者: Nasser Alkhulaifi,Ismail Gokay Dogan,Timothy R. Cargan,Alexander L. Bowler,Direnc Pekaslan,Nicholas J. Watson,Isaac Triguero
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 10 figures, journal-based paper
Abstract:Decision-making under uncertainty in energy management is complicated by unknown parameters hindering optimal strategies, particularly in Battery Energy Storage System (BESS) operations. Predict-Then-Optimise (PTO) approaches treat forecasting and optimisation as separate processes, allowing prediction errors to cascade into suboptimal decisions as models minimise forecasting errors rather than optimising downstream tasks. The emerging Decision-Focused Learning (DFL) methods overcome this limitation by integrating prediction and optimisation; however, they are relatively new and have been tested primarily on synthetic datasets or small-scale problems, with limited evidence of their practical viability. Real-world BESS applications present additional challenges, including greater variability and data scarcity due to collection constraints and operational limitations. Because of these challenges, this work leverages Automated Feature Engineering (AFE) to extract richer representations and improve the nascent approach of DFL. We propose an AFE-DFL framework suitable for small datasets that forecasts electricity prices and demand while optimising BESS operations to minimise costs. We validate its effectiveness on a novel real-world UK property dataset. The evaluation compares DFL methods against PTO, with and without AFE. The results show that, on average, DFL yields lower operating costs than PTO and adding AFE further improves the performance of DFL methods by 22.9-56.5% compared to the same models without AFE. These findings provide empirical evidence for DFL’s practical viability in real-world settings, indicating that domain-specific AFE enhances DFL and reduces reliance on domain expertise for BESS optimisation, yielding economic benefits with broader implications for energy management systems facing similar challenges.
zh
[AI-75] Real-E: A Foundation Benchmark for Advancing Robust and Generalizable Electricity Forecasting CIKM2025
【速读】:该论文旨在解决现有能源预测基准在空间和时间范围上的局限性以及缺乏多能源特征的问题,这些问题限制了模型在实际部署中的可靠性和适用性。解决方案的关键在于构建Real-E数据集,该数据集覆盖超过74个发电站、30多个欧洲国家,并包含长达10年的丰富元数据;同时引入一种新指标来量化相关结构的变化,揭示当前方法在复杂且非平稳的相关动态下表现不佳,从而为开发更鲁棒的预测模型提供了实证基础。
链接: https://arxiv.org/abs/2509.05768
作者: Chen Shao,Yue Wang,Zhenyi Zhu,Zhanbo Huang,Sebastian Pütz,Benjamin Schäfer,Tobais Käfer,Michael Färber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages, CIKM 2025
Abstract:Energy forecasting is vital for grid reliability and operational efficiency. Although recent advances in time series forecasting have led to progress, existing benchmarks remain limited in spatial and temporal scope and lack multi-energy features. This raises concerns about their reliability and applicability in real-world deployment. To address this, we present the Real-E dataset, covering over 74 power stations across 30+ European countries over a 10-year span with rich metadata. Using Real- E, we conduct an extensive data analysis and benchmark over 20 baselines across various model types. We introduce a new metric to quantify shifts in correlation structures and show that existing methods struggle on our dataset, which exhibits more complex and non-stationary correlation dynamics. Our findings highlight key limitations of current methods and offer a strong empirical basis for building more robust forecasting models
zh
[AI-76] DRF: LLM -AGENT Dynamic Reputation Filtering Framework ICONIP2025
【速读】:该论文旨在解决多智能体系统(multi-agent systems)在利用大语言模型(LLMs)执行复杂任务时,缺乏对智能体性能量化评估和可信度衡量机制的问题。解决方案的关键在于提出一种动态声誉过滤框架(Dynamic Reputation Filtering, DRF),其核心包括:构建交互式评分网络以量化智能体表现,设计声誉评分机制来衡量智能体的诚实性与能力,并引入基于上置信界(Upper Confidence Bound, UCB)的策略提升智能体选择效率。实验表明,DRF显著提升了逻辑推理和代码生成任务中的任务完成质量与协作效率。
链接: https://arxiv.org/abs/2509.05764
作者: Yuwei Lou,Hao Hu,Shaocong Ma,Zongfei Zhang,Liang Wang,Jidong Ge,Xianping Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ICONIP 2025 but not published
Abstract:With the evolution of generative AI, multi - agent systems leveraging large - language models(LLMs) have emerged as a powerful tool for complex tasks. However, these systems face challenges in quantifying agent performance and lack mechanisms to assess agent credibility. To address these issues, we introduce DRF, a dynamic reputation filtering framework. DRF constructs an interactive rating network to quantify agent performance, designs a reputation scoring mechanism to measure agent honesty and capability, and integrates an Upper Confidence Bound - based strategy to enhance agent selection efficiency. Experiments show that DRF significantly improves task completion quality and collaboration efficiency in logical reasoning and code - generation tasks, offering a new approach for multi - agent systems to handle large - scale tasks.
zh
[AI-77] Hyperbolic Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理具有高度非欧几里得层次结构的现实世界数据时,难以有效学习内在语义蕴含关系与层级结构的问题。这类数据包括蛋白质网络、交通网络、金融网络、脑网络以及自然语言中的句法树等。解决方案的关键在于引入双曲几何(Hyperbolic Geometry)作为表征空间,利用其对树状层次结构的天然表达优势,提升LLMs在语义表征学习和多尺度推理能力方面的性能。论文系统梳理了四类主要方法:通过指数/对数映射构建的双曲LLM、微调后的双曲模型、全双曲LLM以及双曲状态空间模型,并指出该方向在多个领域具有重要应用潜力和未来研究价值。
链接: https://arxiv.org/abs/2509.05757
作者: Sarang Patil,Zeyong Zhang,Yiran Huang,Tengfei Ma,Mengjia Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 6 figures
Abstract:Large language models (LLMs) have achieved remarkable success and demonstrated superior performance across various tasks, including natural language processing (NLP), weather forecasting, biological protein folding, text generation, and solving mathematical problems. However, many real-world data exhibit highly non-Euclidean latent hierarchical anatomy, such as protein networks, transportation networks, financial networks, brain networks, and linguistic structures or syntactic trees in natural languages. Effectively learning intrinsic semantic entailment and hierarchical relationships from these raw, unstructured input data using LLMs remains an underexplored area. Due to its effectiveness in modeling tree-like hierarchical structures, hyperbolic geometry – a non-Euclidean space – has rapidly gained popularity as an expressive latent representation space for complex data modeling across domains such as graphs, images, languages, and multi-modal data. Here, we provide a comprehensive and contextual exposition of recent advancements in LLMs that leverage hyperbolic geometry as a representation space to enhance semantic representation learning and multi-scale reasoning. Specifically, the paper presents a taxonomy of the principal techniques of Hyperbolic LLMs (HypLLMs) in terms of four main categories: (1) hyperbolic LLMs through exp/log maps; (2) hyperbolic fine-tuned models; (3) fully hyperbolic LLMs, and (4) hyperbolic state-space models. We also explore crucial potential applications and outline future research directions. A repository of key papers, models, datasets, and code implementations is available at this https URL.
zh
[AI-78] Exploit Tool Invocation Prompt for Tool Behavior Hijacking in LLM -Based Agent ic System
【速读】:该论文旨在解决大语言模型(Large Language Models, LLM)驱动的智能体系统中工具调用提示(Tool Invocation Prompt, TIP)的安全性问题。当前主流LLM-based agentic系统如Cursor、Claude Code等,因TIP设计缺乏安全防护机制,易受远程代码执行(Remote Code Execution, RCE)和拒绝服务(Denial of Service, DoS)等攻击,导致外部工具行为被劫持。论文提出了一种系统化的TIP攻击工作流(Tool Exploitation Workflow, TEW),通过操纵工具调用实现对工具行为的非法控制;其解决方案关键在于构建防御机制以增强TIP的安全性,从而保障LLM在调用外部工具时的行为正确性和安全性。
链接: https://arxiv.org/abs/2509.05755
作者: Yu Liu,Yuchong Xie,Mingyu Luo,Zesen Liu,Zhixiang Zhang,Kaikai Zhang,Zongjie Li,Ping Chen,Shuai Wang,Dongdong She
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based agentic systems leverage large language models to handle user queries, make decisions, and execute external tools for complex tasks across domains like chatbots, customer service, and software engineering. A critical component of these systems is the Tool Invocation Prompt (TIP), which defines tool interaction protocols and guides LLMs to ensure the security and correctness of tool usage. Despite its importance, TIP security has been largely overlooked. This work investigates TIP-related security risks, revealing that major LLM-based systems like Cursor, Claude Code, and others are vulnerable to attacks such as remote code execution (RCE) and denial of service (DoS). Through a systematic TIP exploitation workflow (TEW), we demonstrate external tool behavior hijacking via manipulated tool invocations. We also propose defense mechanisms to enhance TIP security in LLM-based agentic systems.
zh
[AI-79] Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在引入链式推理(Chain-of-Thought, CoT)机制后,因中间推理路径可被攻击者操纵而带来的新型数据投毒攻击问题。传统后门攻击通常直接修改输入提示或最终输出,但随着LLMs采用分步推理策略,攻击者可将触发器拆分为多个无害组件并仅篡改推理路径,从而实现更隐蔽的“分解推理投毒”(decomposed reasoning poison)。其解决方案的关键在于揭示:尽管此类攻击在技术上可行,但要使模型在最终答案层面发生错误响应(而非仅改变中间推理过程)却异常困难——这源于LLMs自身推理能力所带来的潜在鲁棒性,以及推理模块与最终答案生成模块之间的架构分离,使得模型能从内部激活的后门中恢复,形成一种新兴的后门防御机制。
链接: https://arxiv.org/abs/2509.05739
作者: Hanna Foerster,Ilia Shumailov,Yiren Zhao,Harsh Chaudhari,Jamie Hayes,Robert Mullins,Yarin Gal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Early research into data poisoning attacks against Large Language Models (LLMs) demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce ``decomposed reasoning poison’', in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components. Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer generation. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.05739 [cs.CR] (or arXiv:2509.05739v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.05739 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-80] Offline vs. Online Learning in Model-based RL: Lessons for Data Collection Strategies
【速读】:该论文旨在解决模型基于强化学习(Model-Based Reinforcement Learning, MBRL)中在线数据与离线数据对世界模型(World Model)性能影响差异的问题,特别是探究在不同数据收集策略下任务执行效果的差异。其关键发现是:离线训练的世界模型因状态空间覆盖有限,在测试时易遭遇分布外(Out-Of-Distribution, OOD)状态,导致策略训练失败;而在线训练则通过环境交互实现自我修正,从而保持更高性能。解决方案的核心在于引入少量额外的在线交互(固定或自适应调度),以缓解离线数据带来的分布偏移问题,并建议在大规模数据收集阶段加入探索性数据(Exploration Data),而非仅依赖专家数据(Expert Data),从而显著提升离线世界模型的泛化能力与最终任务表现。
链接: https://arxiv.org/abs/2509.05735
作者: Jiaqi Chen,Ji Shi,Cansu Sancaktar,Jonas Frey,Georg Martius
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at Reinforcement Learning Conference (RLC 2025); Code available at: this https URL
Abstract:Data collection is crucial for learning robust world models in model-based reinforcement learning. The most prevalent strategies are to actively collect trajectories by interacting with the environment during online training or training on offline datasets. At first glance, the nature of learning task-agnostic environment dynamics makes world models a good candidate for effective offline training. However, the effects of online vs. offline data on world models and thus on the resulting task performance have not been thoroughly studied in the literature. In this work, we investigate both paradigms in model-based settings, conducting experiments on 31 different environments. First, we showcase that online agents outperform their offline counterparts. We identify a key challenge behind performance degradation of offline agents: encountering Out-Of-Distribution states at test time. This issue arises because, without the self-correction mechanism in online agents, offline datasets with limited state space coverage induce a mismatch between the agent’s imagination and real rollouts, compromising policy training. We demonstrate that this issue can be mitigated by allowing for additional online interactions in a fixed or adaptive schedule, restoring the performance of online training with limited interaction data. We also showcase that incorporating exploration data helps mitigate the performance degradation of offline agents. Based on our insights, we recommend adding exploration data when collecting large datasets, as current efforts predominantly focus on expert data alone.
zh
[AI-81] Simulation Priors for Data-Efficient Deep Learning
【速读】:该论文旨在解决AI系统在真实世界中实现高效学习的问题,尤其关注如何在数据稀缺条件下提升模型对复杂动态系统的建模能力。传统第一性原理模型虽能提供物理可解释性,但受限于简化假设难以刻画现实复杂性;而深度学习方法虽具备强大拟合能力,却依赖大量代表性数据。解决方案的关键在于提出SimPEL方法,通过将低保真度仿真器作为先验信息引入贝叶斯深度学习框架,从而在低数据场景下利用仿真知识提高学习效率,并在数据充足时发挥深度学习的灵活性,同时精确量化认知不确定性(epistemic uncertainty)。该方法在生物、农业和机器人等多个领域验证了其优越性能,并成功缩小了从仿真到现实的差距(sim-to-real gap),特别是在高速遥控汽车漂移停车任务中显著减少了所需数据量。
链接: https://arxiv.org/abs/2509.05732
作者: Lenart Treven,Bhavya Sukhija,Jonas Rothfuss,Stelian Coros,Florian Dörfler,Andreas Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:How do we enable AI systems to efficiently learn in the real-world? First-principles models are widely used to simulate natural systems, but often fail to capture real-world complexity due to simplifying assumptions. In contrast, deep learning approaches can estimate complex dynamics with minimal assumptions but require large, representative datasets. We propose SimPEL, a method that efficiently combines first-principles models with data-driven learning by using low-fidelity simulators as priors in Bayesian deep learning. This enables SimPEL to benefit from simulator knowledge in low-data regimes and leverage deep learning’s flexibility when more data is available, all the while carefully quantifying epistemic uncertainty. We evaluate SimPEL on diverse systems, including biological, agricultural, and robotic domains, showing superior performance in learning complex dynamics. For decision-making, we demonstrate that SimPEL bridges the sim-to-real gap in model-based reinforcement learning. On a high-speed RC car task, SimPEL learns a highly dynamic parking maneuver involving drifting with substantially less data than state-of-the-art baselines. These results highlight the potential of SimPEL for data-efficient learning and control in complex real-world environments.
zh
[AI-82] MSRFormer: Road Network Representation Learning using Multi-scale Feature Fusion of Heterogeneous Spatial Interactions
【速读】:该论文旨在解决城市道路网络的异质性和层次性给表示学习带来的挑战,尤其是传统图神经网络因同质性假设和单一结构尺度关注而难以准确建模复杂空间交互的问题。解决方案的关键在于提出MSRFormer框架,其核心创新是通过融合多尺度空间交互机制来捕捉流量异质性和长距离依赖关系:首先利用空间流卷积从大规模轨迹数据中提取小尺度特征,并识别尺度相关的空间交互区域以捕获道路网络的空间结构与流量异质性;随后借助图Transformer有效建模跨尺度的复杂空间依赖关系;最终通过残差连接融合空间交互特征并输入对比学习算法,获得高质量的道路网络表示。实验表明,该方法在两个真实数据集上显著优于基线模型,尤其在复杂路网结构中提升达16%。
链接: https://arxiv.org/abs/2509.05685
作者: Jian Yang,Jiahui Wu,Li Fang,Hongchao Fan,Bianying Zhang,Huijie Zhao,Guangyi Yang,Rui Xin,Xiong You
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Transforming road network data into vector representations using deep learning has proven effective for road network analysis. However, urban road networks’ heterogeneous and hierarchical nature poses challenges for accurate representation learning. Graph neural networks, which aggregate features from neighboring nodes, often struggle due to their homogeneity assumption and focus on a single structural scale. To address these issues, this paper presents MSRFormer, a novel road network representation learning framework that integrates multi-scale spatial interactions by addressing their flow heterogeneity and long-distance dependencies. It uses spatial flow convolution to extract small-scale features from large trajectory datasets, and identifies scale-dependent spatial interaction regions to capture the spatial structure of road networks and flow heterogeneity. By employing a graph transformer, MSRFormer effectively captures complex spatial dependencies across multiple scales. The spatial interaction features are fused using residual connections, which are fed to a contrastive learning algorithm to derive the final road network representation. Validation on two real-world datasets demonstrates that MSRFormer outperforms baseline methods in two road network analysis tasks. The performance gains of MSRFormer suggest the traffic-related task benefits more from incorporating trajectory data, also resulting in greater improvements in complex road network structures with up to 16% improvements compared to the most competitive baseline method. This research provides a practical framework for developing task-agnostic road network representation models and highlights distinct association patterns of the interplay between scale effects and flow heterogeneity of spatial interactions.
zh
[AI-83] SEASONED: Semantic-Enhanced Self-Counterfactual Explainable Detection of Adversarial Exploiter Contracts
【速读】:该论文旨在解决去中心化金融(Decentralized Finance, DeFi)中由恶意构造的对抗性利用合约(Adversarial Exploiter Contracts, AECs)引发的安全威胁,现有检测方法在捕捉语义依赖关系和提供可解释性方面存在不足,导致检测效果受限且缺乏对攻击逻辑的深入理解。其解决方案的关键在于提出SEASONED框架,该框架通过从智能合约字节码中提取语义信息构建语义关系图(Semantic Relation Graph, SRG),并采用自反事实可解释检测器(Self-Counterfactual Explainable Detector, SCFED)对SRG进行分类,同时生成聚焦于核心攻击逻辑的解释;SCFED进一步利用这些解释提取代表性特征,从而提升模型的鲁棒性、泛化能力和数据效率,实现高效、透明且可靠的AEC检测。
链接: https://arxiv.org/abs/2509.05681
作者: Xng Ai,Shudan Lin,Zecheng Li,Kai Zhou,Bixin Li,Bin Xiao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Decentralized Finance (DeFi) attacks have resulted in significant losses, often orchestrated through Adversarial Exploiter Contracts (AECs) that exploit vulnerabilities in victim smart contracts. To proactively identify such threats, this paper targets the explainable detection of AECs. Existing detection methods struggle to capture semantic dependencies and lack interpretability, limiting their effectiveness and leaving critical knowledge gaps in AEC analysis. To address these challenges, we introduce SEASONED, an effective, self-explanatory, and robust framework for AEC detection. SEASONED extracts semantic information from contract bytecode to construct a semantic relation graph (SRG), and employs a self-counterfactual explainable detector (SCFED) to classify SRGs and generate explanations that highlight the core attack logic. SCFED further enhances robustness, generalizability, and data efficiency by extracting representative information from these explanations. Both theoretical analysis and experimental results demonstrate the effectiveness of SEASONED, which showcases outstanding detection performance, robustness, generalizability, and data efficiency learning ability. To support further research, we also release a new dataset of 359 AECs. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.05681 [cs.CR] (or arXiv:2509.05681v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.05681 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-84] GraMFedDHAR: Graph Based Multimodal Differentially Private Federated HAR
【速读】:该论文旨在解决人类活动识别(Human Activity Recognition, HAR)任务中因多模态传感器数据噪声或缺失、标注样本稀缺以及隐私保护需求所带来的挑战,尤其是在联邦学习(Federated Learning, FL)场景下如何有效融合异构多模态数据并保障差分隐私(Differential Privacy, DP)的问题。其解决方案的关键在于提出一种基于图结构的多模态联邦学习框架(Graph-based Multimodal Federated Learning, GraMFedDHAR),将不同传感器流(如压力垫、深度相机和多个加速度计)建模为特定模态的图结构,通过残差图卷积神经网络(Residual Graph Convolutional Neural Networks, GCNs)进行特征提取,并采用注意力机制加权融合嵌入表示,而非简单拼接;同时在联邦聚合阶段引入差分隐私机制以保护数据安全。实验表明,该方法在无差分隐私条件下较基线模型提升最高达2%,在差分隐私约束下性能优势更为显著(提升7–13%),验证了图神经网络在多模态学习中对DP噪声干扰的更强鲁棒性。
链接: https://arxiv.org/abs/2509.05671
作者: Labani Halder,Tanmay Sen,Sarbani Palit
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
备注:
Abstract:Human Activity Recognition (HAR) using multimodal sensor data remains challenging due to noisy or incomplete measurements, scarcity of labeled examples, and privacy concerns. Traditional centralized deep learning approaches are often constrained by infrastructure availability, network latency, and data sharing restrictions. While federated learning (FL) addresses privacy by training models locally and sharing only model parameters, it still has to tackle issues arising from the use of heterogeneous multimodal data and differential privacy requirements. In this article, a Graph-based Multimodal Federated Learning framework, GraMFedDHAR, is proposed for HAR tasks. Diverse sensor streams such as a pressure mat, depth camera, and multiple accelerometers are modeled as modality-specific graphs, processed through residual Graph Convolutional Neural Networks (GCNs), and fused via attention-based weighting rather than simple concatenation. The fused embeddings enable robust activity classification, while differential privacy safeguards data during federated aggregation. Experimental results show that the proposed MultiModalGCN model outperforms the baseline MultiModalFFN, with up to 2 percent higher accuracy in non-DP settings in both centralized and federated paradigms. More importantly, significant improvements are observed under differential privacy constraints: MultiModalGCN consistently surpasses MultiModalFFN, with performance gaps ranging from 7 to 13 percent depending on the privacy budget and setting. These results highlight the robustness of graph-based modeling in multimodal learning, where GNNs prove more resilient to the performance degradation introduced by DP noise.
zh
[AI-85] OptiProxy-NAS: Optimization Proxy based End-to-End Neural Architecture Search
【速读】:该论文旨在解决神经架构搜索(Neural Architecture Search, NAS)在计算资源消耗大、搜索空间离散且复杂的问题。传统方法如基于预测器的代理模型或通过超网络的可微分架构搜索存在效率瓶颈,难以在大规模任务中实现高效优化。本文提出的关键解决方案是构建一个优化代理(OptiProxy),将原本离散、稀疏的NAS空间转化为连续、可微且平滑的表示形式,从而使得任何基于梯度的优化方法均可直接应用于架构参数的端到端优化,显著提升了搜索效率与性能。
链接: https://arxiv.org/abs/2509.05656
作者: Bo Lyu,Yu Cui,Tuo Shi,Ke Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural architecture search (NAS) is a hard computationally expensive optimization problem with a discrete, vast, and spiky search space. One of the key research efforts dedicated to this space focuses on accelerating NAS via certain proxy evaluations of neural architectures. Different from the prevalent predictor-based methods using surrogate models and differentiable architecture search via supernetworks, we propose an optimization proxy to streamline the NAS as an end-to-end optimization framework, named OptiProxy-NAS. In particular, using a proxy representation, the NAS space is reformulated to be continuous, differentiable, and smooth. Thereby, any differentiable optimization method can be applied to the gradient-based search of the relaxed architecture parameters. Our comprehensive experiments on 12 NAS tasks of 4 search spaces across three different domains including computer vision, natural language processing, and resource-constrained NAS fully demonstrate the superior search results and efficiency. Further experiments on low-fidelity scenarios verify the flexibility.
zh
[AI-86] Orchestrator: Active Inference for Multi-Agent Systems in Long-Horizon Tasks
【速读】:该论文旨在解决复杂非线性任务中大型语言模型增强的多智能体系统(Multi-Agent Systems, MAS)因部分可观测性和次优协调而导致的性能瓶颈问题。其解决方案的关键在于提出一种名为Orchestrator的新颖框架,该框架通过注意力机制启发的自涌现协调(attention-inspired self-emergent coordination)和反射式基准测试(reflective benchmarking)来优化全局任务表现;其中引入了监控机制以追踪智能体-环境动态,并利用主动推断基准(active inference benchmarks)优化系统行为,从而缓解部分可观测性影响,使智能体更高效地逼近全局任务解。
链接: https://arxiv.org/abs/2509.05651
作者: Lukas Beckenbauer,Johannes-Lucas Loewe,Ge Zheng,Alexandra Brintrup
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Complex, non-linear tasks challenge LLM-enhanced multi-agent systems (MAS) due to partial observability and suboptimal coordination. We propose Orchestrator, a novel MAS framework that leverages attention-inspired self-emergent coordination and reflective benchmarking to optimize global task performance. Orchestrator introduces a monitoring mechanism to track agent-environment dynamics, using active inference benchmarks to optimize system behavior. By tracking agent-to-agent and agent-to-environment interaction, Orchestrator mitigates the effects of partial observability and enables agents to approximate global task solutions more efficiently. We evaluate the framework on a series of maze puzzles of increasing complexity, demonstrating its effectiveness in enhancing coordination and performance in dynamic, non-linear environments with long-horizon objectives.
zh
[AI-87] Causal Debiasing Medical Multimodal Representation Learning with Missing Modalities
【速读】:该论文旨在解决医疗多模态表示学习中因数据缺失导致的模型泛化能力下降问题,具体包括两类偏差:缺失性偏差(missingness bias),源于模态缺失的非随机模式;分布偏差(distribution bias),由影响观测特征与结果的潜在混杂因素引起。解决方案的关键在于通过结构因果分析识别并建模数据生成过程中的因果机制,并提出一个统一框架:其一为缺失去混淆模块(missingness deconfounding module),基于后门调整(backdoor adjustment)近似因果干预;其二为双分支神经网络结构,显式分离因果特征与虚假相关性,从而提升模型在真实世界医疗场景下的预测可靠性与可解释性。
链接: https://arxiv.org/abs/2509.05615
作者: Xiaoguang Zhu,Lianlong Sun,Yang Liu,Pengyi Jiang,Uma Srivatsa,Nipavan Chiamvimonvat,Vladimir Filkov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE TKDE
Abstract:Medical multimodal representation learning aims to integrate heterogeneous clinical data into unified patient representations to support predictive modeling, which remains an essential yet challenging task in the medical data mining community. However, real-world medical datasets often suffer from missing modalities due to cost, protocol, or patient-specific constraints. Existing methods primarily address this issue by learning from the available observations in either the raw data space or feature space, but typically neglect the underlying bias introduced by the data acquisition process itself. In this work, we identify two types of biases that hinder model generalization: missingness bias, which results from non-random patterns in modality availability, and distribution bias, which arises from latent confounders that influence both observed features and outcomes. To address these challenges, we perform a structural causal analysis of the data-generating process and propose a unified framework that is compatible with existing direct prediction-based multimodal learning methods. Our method consists of two key components: (1) a missingness deconfounding module that approximates causal intervention based on backdoor adjustment and (2) a dual-branch neural network that explicitly disentangles causal features from spurious correlations. We evaluated our method in real-world public and in-hospital datasets, demonstrating its effectiveness and causal insights.
zh
[AI-88] Natural Language-Programming Language Software Traceability Link Recovery Needs More than Textual Similarity
【速读】:该论文旨在解决软件需求与代码之间的 traceability link recovery (TLR) 任务中,仅依赖文本相似度方法在自然语言(Natural Language, NL)与编程语言(Programming Language, PL)之间存在语义鸿沟时性能受限的问题。其解决方案的关键在于通过大规模实证分析识别出多种领域特定的辅助策略,并将其融合进两种模型:基于边类型(edge types)的异构图Transformer(Heterogeneous Graph Transformer, HGT)和基于提示工程(prompt-based)的Gemini 2.5 Pro模型,从而有效提升NL-PL场景下需求到代码链接恢复的准确性。实验表明,集成多策略后的模型显著优于原始版本及当前最优方法HGNNLink,在十二个开源项目上分别实现了平均F1分数提升3.68%和8.84%。
链接: https://arxiv.org/abs/2509.05585
作者: Zhiyuan Zou,Bangchao Wang,Peng Liang,Tingting Bi,Huan Jin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 45 pages, 5 images, 11 tables, Manuscript submitted to a Journal (2025)
Abstract:In the field of software traceability link recovery (TLR), textual similarity has long been regarded as the core criterion. However, in tasks involving natural language and programming language (NL-PL) artifacts, relying solely on textual similarity is limited by their semantic gap. To this end, we conducted a large-scale empirical evaluation across various types of TLR tasks, revealing the limitations of textual similarity in NL-PL scenarios. To address these limitations, we propose an approach that incorporates multiple domain-specific auxiliary strategies, identified through empirical analysis, into two models: the Heterogeneous Graph Transformer (HGT) via edge types and the prompt-based Gemini 2.5 Pro via additional input information. We then evaluated our approach using the widely studied requirements-to-code TLR task, a representative case of NL-PL TLR. Experimental results show that both the multi-strategy HGT and Gemini 2.5 Pro models outperformed their original counterparts without strategy integration. Furthermore, compared to the current state-of-the-art method HGNNLink, the multi-strategy HGT and Gemini 2.5 Pro models achieved average F1-score improvements of 3.68% and 8.84%, respectively, across twelve open-source projects, demonstrating the effectiveness of multi-strategy integration in enhancing overall model performance for the requirements-code TLR task.
zh
[AI-89] Learning to Walk in Costume: Adversarial Motion Priors for Aesthetically Constrained Humanoids
【速读】:该论文旨在解决娱乐类人形机器人因美学驱动设计导致的运动控制难题,例如头部质量占比过高(占总质量的16%)、感知能力有限以及保护壳对运动的显著限制。为应对这些挑战,研究提出基于强化学习(Reinforcement Learning, RL)的运动控制系统,并采用对抗性运动先验(Adversarial Motion Priors, AMP)作为核心解决方案——AMP能够在保证物理稳定性的前提下,引导机器人学习自然流畅的动作行为。此外,研究还开发了定制化的领域随机化(domain randomization)技术和专门设计的奖励函数,以实现安全的仿真到现实(sim-to-real)迁移,从而保护昂贵的硬件组件。实验表明,该方法在极端质量分布和运动受限条件下仍能生成稳定的站立与行走行为,验证了学习方法在美学导向设计约束下的适应性与有效性。
链接: https://arxiv.org/abs/2509.05581
作者: Arturo Flores Alvarez,Fatemeh Zargarbashi,Havel Liu,Shiqi Wang,Liam Edwards,Jessica Anz,Alex Xu,Fan Shi,Stelian Coros,Dennis W. Hong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, 11 figures, accepted at IEEE-RAS International Conference on Humanoid Robots (Humanoids) 2025
Abstract:We present a Reinforcement Learning (RL)-based locomotion system for Cosmo, a custom-built humanoid robot designed for entertainment applications. Unlike traditional humanoids, entertainment robots present unique challenges due to aesthetic-driven design choices. Cosmo embodies these with a disproportionately large head (16% of total mass), limited sensing, and protective shells that considerably restrict movement. To address these challenges, we apply Adversarial Motion Priors (AMP) to enable the robot to learn natural-looking movements while maintaining physical stability. We develop tailored domain randomization techniques and specialized reward structures to ensure safe sim-to-real, protecting valuable hardware components during deployment. Our experiments demonstrate that AMP generates stable standing and walking behaviors despite Cosmo’s extreme mass distribution and movement constraints. These results establish a promising direction for robots that balance aesthetic appeal with functional performance, suggesting that learning-based methods can effectively adapt to aesthetic-driven design constraints.
zh
[AI-90] OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动驾驶场景中缺乏鲁棒的三维(3D)空间理解能力的问题。现有方法受限于两个关键挑战:一是难以构建无需昂贵人工标注的高效3D表示,二是视觉语言模型(VLMs)因缺乏大规模3D视觉-语言预训练而丢失细粒度空间信息。解决方案的关键在于提出OccVLA框架,通过将3D占用(occupancy)表示融入统一的多模态推理过程,使模型能够从2D视觉输入中直接学习细粒度的空间结构;其中,占用预测被视作隐式推理过程,在推理阶段可跳过而不影响性能,从而实现无额外计算开销的3D空间感知能力。
链接: https://arxiv.org/abs/2509.05578
作者: Ruixun Liu,Lingyu Kong,Derun Li,Hang Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Multimodal large language models (MLLMs) have shown strong vision-language reasoning abilities but still lack robust 3D spatial understanding, which is critical for autonomous driving. This limitation stems from two key challenges: (1) the difficulty of constructing accessible yet effective 3D representations without expensive manual annotations, and (2) the loss of fine-grained spatial details in VLMs due to the absence of large-scale 3D vision-language pretraining. To address these challenges, we propose OccVLA, a novel framework that integrates 3D occupancy representations into a unified multimodal reasoning process. Unlike prior approaches that rely on explicit 3D inputs, OccVLA treats dense 3D occupancy as both a predictive output and a supervisory signal, enabling the model to learn fine-grained spatial structures directly from 2D visual inputs. The occupancy predictions are regarded as implicit reasoning processes and can be skipped during inference without performance degradation, thereby adding no extra computational overhead. OccVLA achieves state-of-the-art results on the nuScenes benchmark for trajectory planning and demonstrates superior performance on 3D visual question-answering tasks, offering a scalable, interpretable, and fully vision-based solution for autonomous driving.
zh
[AI-91] reeGPT : A Novel Hybrid Architecture for Abstract Syntax Tree Processing with Global Parent-Child Aggregation
【速读】:该论文旨在解决神经程序合成任务中对抽象语法树(Abstract Syntax Tree, AST)建模效率与表达能力不足的问题,传统方法如纯序列处理或图神经网络难以有效捕捉AST的层次结构和长程依赖关系。其解决方案的关键在于提出TreeGPT架构,通过融合Transformer自注意力机制与全局父子聚合机制(Global Parent-Child Aggregation),实现对AST的多层次信息传递:一方面利用自注意力捕获局部节点间依赖,另一方面通过迭代式消息传递机制在树结构中聚合全局信息,公式化为 $ h_i^{(t+1)} = \sigma \left( h_i^{(0)} + W_{pc} \sum_{(p,c) \in E_i} f(h_p^{(t)}, h_c^{(t)}) + b \right) $,使每个节点在T次迭代中逐步整合整棵树的信息;实验表明该设计在ARC Prize 2025基准上达到96%准确率,显著优于多种基线模型,且仅需1.5M参数,验证了其高效性与有效性。
链接: https://arxiv.org/abs/2509.05550
作者: Zixi Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code available at: this https URL
Abstract:We introduce TreeGPT, a novel neural architecture that combines transformer-based attention mechanisms with global parent-child aggregation for processing Abstract Syntax Trees (ASTs) in neural program synthesis tasks. Unlike traditional approaches that rely solely on sequential processing or graph neural networks, TreeGPT employs a hybrid design that leverages both self-attention for capturing local dependencies and a specialized Tree Feed-Forward Network (TreeFFN) for modeling hierarchical tree structures through iterative message passing. The core innovation lies in our Global Parent-Child Aggregation mechanism, formalized as: h_i^(t+1) = \sigma \Big( h_i^(0) + W_pc \sum_(p,c) \in E_i f(h_p^(t), h_c^(t)) + b \Big) where h_i^(t) represents the hidden state of node i at iteration t , E_i denotes all parent-child edges involving node i , and f(h_p, h_c) is an edge aggregation function. This formulation enables each node to progressively aggregate information from the entire tree structure through T iterations. Our architecture integrates optional enhancements including gated aggregation with learnable edge weights, residual connections for gradient stability, and bidirectional propagation for capturing both bottom-up and top-down dependencies. We evaluate TreeGPT on the ARC Prize 2025 dataset, a challenging visual reasoning benchmark requiring abstract pattern recognition and rule inference. Experimental results demonstrate that TreeGPT achieves 96% accuracy, significantly outperforming transformer baselines (1.3%), large-scale models like Grok-4 (15.9%), and specialized program synthesis methods like SOAR (52%) while using only 1.5M parameters. Our comprehensive ablation study reveals that edge projection is the most critical component, with the combination of edge projection and gating achieving optimal performance. Comments: Code available at: this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2509.05550 [cs.AI] (or arXiv:2509.05550v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.05550 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-92] Combining TSL and LLM to Automate REST API Testing: A Comparative Study
【速读】:该论文旨在解决REST API测试执行中的核心挑战,即在分布式系统复杂性高、场景多样且测试设计时间有限的情况下,难以实现充分的测试覆盖,导致漏测风险增加、人工成本上升。其解决方案的关键在于提出RestTSLLM方法,该方法结合测试规范语言(Test Specification Language, TSL)与大语言模型(Large Language Models, LLMs),通过提示工程(prompt engineering)技术与自动化流水线,从OpenAPI规范中自动生成功能完备、上下文一致的测试用例,从而提升测试效率和质量。
链接: https://arxiv.org/abs/2509.05540
作者: Thiago Barradas,Aline Paes,Vânia de Oliveira Neves
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, article computer science, software engineering, software testing, ia, llm
Abstract:The effective execution of tests for REST APIs remains a considerable challenge for development teams, driven by the inherent complexity of distributed systems, the multitude of possible scenarios, and the limited time available for test design. Exhaustive testing of all input combinations is impractical, often resulting in undetected failures, high manual effort, and limited test coverage. To address these issues, we introduce RestTSLLM, an approach that uses Test Specification Language (TSL) in conjunction with Large Language Models (LLMs) to automate the generation of test cases for REST APIs. The approach targets two core challenges: the creation of test scenarios and the definition of appropriate input data. The proposed solution integrates prompt engineering techniques with an automated pipeline to evaluate various LLMs on their ability to generate tests from OpenAPI specifications. The evaluation focused on metrics such as success rate, test coverage, and mutation score, enabling a systematic comparison of model performance. The results indicate that the best-performing LLMs - Claude 3.5 Sonnet (Anthropic), Deepseek R1 (Deepseek), Qwen 2.5 32b (Alibaba), and Sabia 3 (Maritaca) - consistently produced robust and contextually coherent REST API tests. Among them, Claude 3.5 Sonnet outperformed all other models across every metric, emerging in this study as the most suitable model for this task. These findings highlight the potential of LLMs to automate the generation of tests based on API specifications.
zh
[AI-93] Microrobot Vascular Parkour: Analytic Geometry-based Path Planning with Real-time Dynamic Obstacle Avoidance
【速读】:该论文旨在解决微机器人在血管环境中自主导航时面临的挑战,即如何在密集且动态变化的障碍物(如血细胞和血管壁)中实现安全、高效的路径规划与避障。其关键解决方案是提出一种实时路径规划框架,该框架将基于解析几何的全局规划器(Analytic Geometry Planner, AGP)与两种局部避障控制器相结合:一种基于规则的反应式控制器,另一种基于强化学习的反应式控制器。AGP能够在保持确定性和可行性的同时,生成比加权A*(Weighted A*, WA*)、粒子群优化(Particle Swarm Optimization, PSO)和快速扩展随机树(Rapidly Exploring Random Trees, RRT)更短路径并更快完成规划;同时,系统通过实时成像估计微机器人、障碍物和目标的位置,并在3D空间中实现高效避障,平均每帧规划时间仅为40毫秒,满足25 fps图像采集速率和闭环控制需求,从而显著提升了微机器人在复杂血管环境中的自主导航能力与靶向药物递送性能。
链接: https://arxiv.org/abs/2509.05500
作者: Yanda Yang,Max Sokolich,Fatma Ceren Kirmizitas,Sambeeta Das,Andreas A. Malikopoulos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 56 pages, 19 figures including Supplementary Materials. Supplementary videos available at this https URL . Preprint. This version has not been peer reviewed
Abstract:Autonomous microrobots in blood vessels could enable minimally invasive therapies, but navigation is challenged by dense, moving obstacles. We propose a real-time path planning framework that couples an analytic geometry global planner (AGP) with two reactive local escape controllers, one based on rules and one based on reinforcement learning, to handle sudden moving obstacles. Using real-time imaging, the system estimates the positions of the microrobot, obstacles, and targets and computes collision-free motions. In simulation, AGP yields shorter paths and faster planning than weighted A* (WA*), particle swarm optimization (PSO), and rapidly exploring random trees (RRT), while maintaining feasibility and determinism. We extend AGP from 2D to 3D without loss of speed. In both simulations and experiments, the combined global planner and local controllers reliably avoid moving obstacles and reach targets. The average planning time is 40 ms per frame, compatible with 25 fps image acquisition and real-time closed-loop control. These results advance autonomous microrobot navigation and targeted drug delivery in vascular environments.
zh
[AI-94] MambaLite-Micro: Memory-Optimized Mamba Inference on MCUs
【速读】:该论文旨在解决在资源受限的微控制器(Microcontroller, MCU)上部署Mamba模型所面临的挑战,包括有限内存、缺乏原生算子支持以及缺少面向嵌入式设备的工具链。其关键解决方案是提出首个基于纯C语言实现的无运行时推理引擎——MambaLite-Micro,通过两个核心步骤完成模型移植:首先将训练好的PyTorch Mamba模型权重导出为轻量级格式;其次在C中手工实现Mamba层及辅助算子,并采用算子融合与内存布局优化策略,从而消除大型中间张量,使峰值内存占用降低83.0%,同时保持与PyTorch实现平均相对数值误差仅为1.7×10⁻⁵。该方案在关键词识别(KWS)和人体活动识别(HAR)任务中实现了与PyTorch基线100%一致的分类准确率,并成功跨ESP32S3与STM32H7平台部署,验证了其在异构嵌入式平台上的可移植性与实用性。
链接: https://arxiv.org/abs/2509.05488
作者: Hongjun Xu,Junxi Xia,Weisi Yang,Yueyuan Sui,Stephen Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注: 4 pages, 1 figures
Abstract:Deploying Mamba models on microcontrollers (MCUs) remains challenging due to limited memory, the lack of native operator support, and the absence of embedded-friendly toolchains. We present, to our knowledge, the first deployment of a Mamba-based neural architecture on a resource-constrained MCU, a fully C-based runtime-free inference engine: MambaLite-Micro. Our pipeline maps a trained PyTorch Mamba model to on-device execution by (1) exporting model weights into a lightweight format, and (2) implementing a handcrafted Mamba layer and supporting operators in C with operator fusion and memory layout optimization. MambaLite-Micro eliminates large intermediate tensors, reducing 83.0% peak memory, while maintaining an average numerical error of only 1.7x10-5 relative to the PyTorch Mamba implementation. When evaluated on keyword spotting(KWS) and human activity recognition (HAR) tasks, MambaLite-Micro achieved 100% consistency with the PyTorch baselines, fully preserving classification accuracy. We further validated portability by deploying on both ESP32S3 and STM32H7 microcontrollers, demonstrating consistent operation across heterogeneous embedded platforms and paving the way for bringing advanced sequence models like Mamba to real-world resource-constrained applications.
zh
[AI-95] PLanTS: Periodicity-aware Latent-state Representation Learning for Multivariate Time Series
【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series, MTS)在高维度、标注数据稀缺及非平稳性条件下,传统机器学习方法性能受限的问题。现有自监督学习(Self-Supervised Learning, SSL)方法虽通过数据增强或基于时间点的对比策略缓解标签不足问题,但忽略了MTS固有的周期结构,并难以捕捉潜在状态的动态演化。其解决方案的关键在于提出PLanTS框架:首先设计了周期感知的多粒度分块机制与广义对比损失函数,以在多个时间分辨率下保留实例级和状态级相似性;其次引入下一状态转移预测的预训练任务,促使表示学习编码未来状态演化的预测信息,从而实现对MTS内在周期性和时序动态性的显式建模。
链接: https://arxiv.org/abs/2509.05478
作者: Jia Wang,Xiao Wang,Chi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multivariate time series (MTS) are ubiquitous in domains such as healthcare, climate science, and industrial monitoring, but their high dimensionality, limited labeled data, and non-stationary nature pose significant challenges for conventional machine learning methods. While recent self-supervised learning (SSL) approaches mitigate label scarcity by data augmentations or time point-based contrastive strategy, they neglect the intrinsic periodic structure of MTS and fail to capture the dynamic evolution of latent states. We propose PLanTS, a periodicity-aware self-supervised learning framework that explicitly models irregular latent states and their transitions. We first designed a period-aware multi-granularity patching mechanism and a generalized contrastive loss to preserve both instance-level and state-level similarities across multiple temporal resolutions. To further capture temporal dynamics, we design a next-transition prediction pretext task that encourages representations to encode predictive information about future state evolution. We evaluate PLanTS across a wide range of downstream tasks-including multi-class and multi-label classification, forecasting, trajectory tracking and anomaly detection. PLanTS consistently improves the representation quality over existing SSL methods and demonstrates superior runtime efficiency compared to DTW-based methods.
zh
[AI-96] Learning Tool-Aware Adaptive Compliant Control for Autonomous Regolith Excavation
【速读】:该论文旨在解决月球表岩屑(regolith)自主挖掘任务中因颗粒介质复杂相互作用动力学及机器人需使用多样化工具所带来的挑战。解决方案的关键在于构建一个基于模型的强化学习框架,该框架在高保真粒子物理与程序化生成相结合的并行仿真环境中训练智能体;通过操作空间控制动态调节自身刚度与阻尼参数,使智能体能够自适应地调整与不同工具的交互策略,从而实现对多样地形和工具几何形状的泛化能力。实验表明,引入视觉反馈可显著提升任务成功率,验证了该方法在开发未来太空任务所需鲁棒且多功能自主系统方面的有效性。
链接: https://arxiv.org/abs/2509.05475
作者: Andrej Orsula,Matthieu Geist,Miguel Olivares-Mendez,Carol Martinez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The source code is available at this https URL
Abstract:Autonomous regolith excavation is a cornerstone of in-situ resource utilization for a sustained human presence beyond Earth. However, this task is fundamentally hindered by the complex interaction dynamics of granular media and the operational need for robots to use diverse tools. To address these challenges, this work introduces a framework where a model-based reinforcement learning agent learns within a parallelized simulation. This environment leverages high-fidelity particle physics and procedural generation to create a vast distribution of both lunar terrains and excavation tool geometries. To master this diversity, the agent learns an adaptive interaction strategy by dynamically modulating its own stiffness and damping at each control step through operational space control. Our experiments demonstrate that training with a procedural distribution of tools is critical for generalization and enables the development of sophisticated tool-aware behavior. Furthermore, we show that augmenting the agent with visual feedback significantly improves task success. These results represent a validated methodology for developing the robust and versatile autonomous systems required for the foundational tasks of future space missions.
zh
[AI-97] From Vision to Validation: A Theory- and Data-Driven Construction of a GCC-Specific AI Adoption Index
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)在海湾合作委员会(Gulf Cooperation Council, GCC)公共部门落地过程中缺乏标准化评估工具的问题,尤其针对该地区特有的驱动因素、治理模式和文化背景未被充分考量的现状。解决方案的关键在于构建一个基于理论与实证相结合的“AI采纳指数”(AI Adoption Index),该指数整合了基础设施资源、组织准备度和政策法规环境三大维度,并通过K-Means聚类、主成分分析(Principal Component Analysis)及偏最小二乘结构方程建模(Partial Least Squares Structural Equation Modeling)等统计方法验证其有效性。研究发现,强大的基础设施和明确的政策指令是早期AI成功实施的核心驱动力,显著强于组织准备度,且该模型能解释70%的AI成效差异,为GCC地区制定差异化、可衡量的AI战略提供了科学依据和可操作的基准工具。
链接: https://arxiv.org/abs/2509.05474
作者: Mohammad Rashed Albous,Anwaar AlKandari,Abdel Latef Anouze
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence (AI) is rapidly transforming public-sector processes worldwide, yet standardized measures rarely address the unique drivers, governance models, and cultural nuances of the Gulf Cooperation Council (GCC) countries. This study employs a theory-driven foundation derived from an in-depth analysis of literature review and six National AI Strategies (NASs), coupled with a data-driven approach that utilizes a survey of 203 mid- and senior-level government employees and advanced statistical techniques (K-Means clustering, Principal Component Analysis, and Partial Least Squares Structural Equation Modeling). By combining policy insights with empirical evidence, the research develops and validates a novel AI Adoption Index specifically tailored to the GCC public sector. Findings indicate that robust infrastructure and clear policy mandates exert the strongest influence on successful AI implementations, overshadowing organizational readiness in early adoption stages. The combined model explains 70% of the variance in AI outcomes, suggesting that resource-rich environments and top-down policy directives can drive rapid but uneven technology uptake. By consolidating key dimensions (Infrastructure Resources, Organizational Readiness, and Policy Regulatory Environment) into a single composite index, this study provides a holistic yet context-sensitive tool for benchmarking AI maturity. The index offers actionable guidance for policymakers seeking to harmonize large-scale deployments with ethical and regulatory standards. Beyond advancing academic discourse, these insights inform more strategic allocation of resources, cross-country cooperation, and capacity-building initiatives, thereby supporting sustained AI-driven transformation in the GCC region and beyond.
zh
[AI-98] Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models
【速读】:该论文旨在解决生成式 AI(Generative AI)在面对隐蔽性更强的对抗性提示攻击——即“伪装越狱”(camouflaged jailbreaking)时所暴露的安全漏洞问题。这类攻击通过将恶意意图嵌入看似无害的语言中,利用语言的上下文模糊性和灵活性绕过现有的安全机制,而传统基于关键词的检测方法难以识别此类威胁。论文的关键解决方案在于构建了一个包含500个精心标注样本(400个有害、100个良性)的基准数据集“Camouflaged Jailbreak Prompts”,并提出一个涵盖七维度(包括安全意识、技术可行性、潜在危害性等)的多维评估框架,从而系统性地量化和揭示模型在面对伪装越狱攻击时的行为退化现象,为开发更精细、自适应的安全防护策略提供实证基础与评估标准。
链接: https://arxiv.org/abs/2509.05471
作者: Youjia Zheng,Mohammad Zandsalimy,Shanu Sushmita
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly vulnerable to a sophisticated form of adversarial prompting known as camouflaged jailbreaking. This method embeds malicious intent within seemingly benign language to evade existing safety mechanisms. Unlike overt attacks, these subtle prompts exploit contextual ambiguity and the flexible nature of language, posing significant challenges to current defense systems. This paper investigates the construction and impact of camouflaged jailbreak prompts, emphasizing their deceptive characteristics and the limitations of traditional keyword-based detection methods. We introduce a novel benchmark dataset, Camouflaged Jailbreak Prompts, containing 500 curated examples (400 harmful and 100 benign prompts) designed to rigorously stress-test LLM safety protocols. In addition, we propose a multi-faceted evaluation framework that measures harmfulness across seven dimensions: Safety Awareness, Technical Feasibility, Implementation Safeguards, Harmful Potential, Educational Value, Content Quality, and Compliance Score. Our findings reveal a stark contrast in LLM behavior: while models demonstrate high safety and content quality with benign inputs, they exhibit a significant decline in performance and safety when confronted with camouflaged jailbreak attempts. This disparity underscores a pervasive vulnerability, highlighting the urgent need for more nuanced and adaptive security strategies to ensure the responsible and robust deployment of LLMs in real-world applications.
zh
[AI-99] Neural Breadcrumbs: Membership Inference Attacks on LLM s Through Hidden State and Attention Pattern Analysis
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中可能存在的成员推断隐私泄露问题,即攻击者能否通过分析模型行为识别特定数据是否被用于训练。传统方法主要依赖输出层面的损失信号进行成员推断攻击(Membership Inference Attacks, MIAs),但近期研究表明这类方法对LLMs效果有限,暗示其可能具备较好的隐私保护特性。本文提出了一种基于模型内部表示的新框架memTrace,其核心创新在于从Transformer结构的隐藏状态(hidden states)和注意力模式(attention patterns)中提取“神经足迹”(neural breadcrumbs),通过分析层间表征动态、注意力分布特征及跨层转换模式,挖掘传统基于输出损失的方法难以捕捉的记忆指纹(memorization fingerprints)。实验表明,该方法在多个模型家族上实现了平均AUC达0.85的强检测性能,揭示了即使输出层面看似安全,内部行为仍可能暴露训练数据成员身份,从而强调了对LLM成员隐私更深入研究与更强健隐私保护训练技术的必要性。
链接: https://arxiv.org/abs/2509.05449
作者: Disha Makhija,Manoj Ghuhan Arivazhagan,Vinayshekhar Bannihatti Kumar,Rashmi Gangadharaiah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Membership inference attacks (MIAs) reveal whether specific data was used to train machine learning models, serving as important tools for privacy auditing and compliance assessment. Recent studies have reported that MIAs perform only marginally better than random guessing against large language models, suggesting that modern pre-training approaches with massive datasets may be free from privacy leakage risks. Our work offers a complementary perspective to these findings by exploring how examining LLMs’ internal representations, rather than just their outputs, may provide additional insights into potential membership inference signals. Our framework, \emphmemTrace, follows what we call \enquoteneural breadcrumbs extracting informative signals from transformer hidden states and attention patterns as they process candidate sequences. By analyzing layer-wise representation dynamics, attention distribution characteristics, and cross-layer transition patterns, we detect potential memorization fingerprints that traditional loss-based approaches may not capture. This approach yields strong membership detection across several model families achieving average AUC scores of 0.85 on popular MIA benchmarks. Our findings suggest that internal model behaviors can reveal aspects of training data exposure even when output-based signals appear protected, highlighting the need for further research into membership privacy and the development of more robust privacy-preserving training techniques for large language models.
zh
[AI-100] Newton to Einstein: Axiom-Based Discovery via Game Design
【速读】:该论文试图解决当前机器学习在科学发现中过度依赖归纳式模式识别(inductive pattern recognition)而导致理论创新不足的问题,即模型难以突破预设假设、无法主动构建新理论框架。其解决方案的关键在于提出一种基于规则演化的游戏设计框架(game design framework),将科学探究重构为一个由公理(axiom)驱动的动态系统:智能体在公理约束的环境中运行,并通过修改公理来解释异常观测结果,从而实现对新理论结构的系统性发现。该方法强调从固定假设转向可演化规则,使机器学习具备创造性、可解释性和理论导向性的科学发现能力。
链接: https://arxiv.org/abs/2509.05448
作者: Pingchuan Ma,Benjamin Tod Jones,Tsun-Hsuan Wang,Minghao Guo,Michal Piotr Lipiec,Chuang Gan,Wojciech Matusik
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:
Abstract:This position paper argues that machine learning for scientific discovery should shift from inductive pattern recognition to axiom-based reasoning. We propose a game design framework in which scientific inquiry is recast as a rule-evolving system: agents operate within environments governed by axioms and modify them to explain outlier observations. Unlike conventional ML approaches that operate within fixed assumptions, our method enables the discovery of new theoretical structures through systematic rule adaptation. We demonstrate the feasibility of this approach through preliminary experiments in logic-based games, showing that agents can evolve axioms that solve previously unsolvable problems. This framework offers a foundation for building machine learning systems capable of creative, interpretable, and theory-driven discovery.
zh
[AI-101] Reverse Browser: Vector-Image-to-Code Generator
【速读】:该论文旨在解决用户界面(User Interface, UI)设计到代码自动转换(image-to-code 或 image-to-UI)过程中 fidelity(保真度)不足的问题,现有方法在基准测试中表现不佳。其解决方案的关键在于改用矢量图像(vector images)作为模型输入,而非传统的位图(bitmaps),从而提升生成代码与原始设计的一致性;同时,作者构建了多个大规模训练数据集,并引入一种新的多尺度图像质量评估(Multi-scale Image Quality Assessment)指标,用于更准确地衡量生成结果的质量,最终训练了一个大尺寸开源权重模型以验证该方法的有效性。
链接: https://arxiv.org/abs/2509.05394
作者: Zoltan Toth-Czifra
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to AIWare 2025 ArXiv Track
Abstract:Automating the conversion of user interface design into code (image-to-code or image-to-UI) is an active area of software engineering research. However, the state-of-the-art solutions do not achieve high fidelity to the original design, as evidenced by benchmarks. In this work, I approach the problem differently: I use vector images instead of bitmaps as model input. I create several large datasets for training machine learning models. I evaluate the available array of Image Quality Assessment (IQA) algorithms and introduce a new, multi-scale metric. I then train a large open-weights model and discuss its limitations.
zh
[AI-102] Inferring Prerequisite Knowledge Concepts in Educational Knowledge Graphs: A Multi-criteria Approach
【速读】:该论文旨在解决当前MOOC平台CourseMapper中的教育知识图谱(Educational Knowledge Graph, EduKG)缺乏显式先修关系(Prerequisite Relationships, PRs)链接的问题,而手动标注PRs存在耗时且不一致的缺陷。解决方案的关键在于提出一种无监督方法,通过定义基于文档、维基百科超链接、图结构和文本特征的十项评判标准,并采用投票算法融合这些特征,从而自动、鲁棒地推断概念间的PRs。该方法在基准数据集上实现了比现有方法更高的精度,同时具备良好的可扩展性和适应性,为CourseMapper提供可靠的序列感知学习支持。
链接: https://arxiv.org/abs/2509.05393
作者: Rawaa Alatrash,Mohamed Amine Chatti,Nasha Wibowo,Qurat Ul Ain
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at IJCKG 2025
Abstract:Educational Knowledge Graphs (EduKGs) organize various learning entities and their relationships to support structured and adaptive learning. Prerequisite relationships (PRs) are critical in EduKGs for defining the logical order in which concepts should be learned. However, the current EduKG in the MOOC platform CourseMapper lacks explicit PR links, and manually annotating them is time-consuming and inconsistent. To address this, we propose an unsupervised method for automatically inferring concept PRs without relying on labeled data. We define ten criteria based on document-based, Wikipedia hyperlink-based, graph-based, and text-based features, and combine them using a voting algorithm to robustly capture PRs in educational content. Experiments on benchmark datasets show that our approach achieves higher precision than existing methods while maintaining scalability and adaptability, thus providing reliable support for sequence-aware learning in CourseMapper.
zh
[AI-103] An Optimized Pipeline for Automatic Educational Knowledge Graph Construction
【速读】:该论文旨在解决教育知识图谱(Educational Knowledge Graph, EduKG)自动构建的挑战,尤其是在从PDF学习材料中提取可靠且可扩展的知识表示方面。其核心问题在于现有方法在教育场景下准确率较低,难以支撑有意义的学习支持。解决方案的关键在于提出一个统一、端到端的自动化EduKG构建流水线:首先从单个页面/幻灯片生成局部知识图谱,再进行合并以形成完整的学习材料知识图谱;并通过针对多个组件的针对性优化,显著提升了准确性(提升17.5%)和处理效率(提高十倍),从而实现高效、可扩展且适用于多样化教育场景的EduKG构建。
链接: https://arxiv.org/abs/2509.05392
作者: Qurat Ul Ain,Mohamed Amine Chatti,Jean Qussa,Amr Shakhshir,Rawaa Alatrash,Shoeb Joarder
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at IJCKG 2025
Abstract:The automatic construction of Educational Knowledge Graphs (EduKGs) is essential for domain knowledge modeling by extracting meaningful representations from learning materials. Despite growing interest, identifying a scalable and reliable approach for automatic EduKG generation remains a challenge. In an attempt to develop a unified and robust pipeline for automatic EduKG construction, in this study we propose a pipeline for automatic EduKG construction from PDF learning materials. The process begins with generating slide-level EduKGs from individual pages/slides, which are then merged to form a comprehensive EduKG representing the entire learning material. We evaluate the accuracy of the EduKG generated from the proposed pipeline in our MOOC platform, CourseMapper. The observed accuracy, while indicative of partial success, is relatively low particularly in the educational context, where the reliability of knowledge representations is critical for supporting meaningful learning. To address this, we introduce targeted optimizations across multiple pipeline components. The optimized pipeline achieves a 17.5% improvement in accuracy and a tenfold increase in processing efficiency. Our approach offers a holistic, scalable and end-to-end pipeline for automatic EduKG construction, adaptable to diverse educational contexts, and supports improved semantic representation of learning content.
zh
[AI-104] User Privacy and Large Language Models : An Analysis of Frontier Developers Privacy Policies
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的聊天机器人在用户隐私保护方面的数据收集与使用不透明问题,特别是开发者如何利用用户对话数据进行模型训练而缺乏明确用户同意和充分披露。其解决方案的关键在于通过分析六家美国前沿AI开发者的隐私政策,构建基于《加州消费者隐私法》(California Consumer Privacy Act, CCPA)的新型定性编码框架,系统比较各公司在数据采集、存储及使用上的实践差异,从而揭示当前行业普遍存在的默认训练机制、无限期保留个人敏感信息以及涉及儿童数据等风险,并据此提出增强透明度与问责制的政策与技术改进建议。
链接: https://arxiv.org/abs/2509.05382
作者: Jennifer King,Kevin Klyman,Emily Capstick,Tiffany Saade,Victoria Hsieh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: See additional files for appendices
Abstract:Hundreds of millions of people now regularly interact with large language models via chatbots. Model developers are eager to acquire new sources of high-quality training data as they race to improve model capabilities and win market share. This paper analyzes the privacy policies of six U.S. frontier AI developers to understand how they use their users’ chats to train models. Drawing primarily on the California Consumer Privacy Act, we develop a novel qualitative coding schema that we apply to each developer’s relevant privacy policies to compare data collection and use practices across the six companies. We find that all six developers appear to employ their users’ chat data to train and improve their models by default, and that some retain this data indefinitely. Developers may collect and train on personal information disclosed in chats, including sensitive information such as biometric and health data, as well as files uploaded by users. Four of the six companies we examined appear to include children’s chat data for model training, as well as customer data from other products. On the whole, developers’ privacy policies often lack essential information about their practices, highlighting the need for greater transparency and accountability. We address the implications of users’ lack of consent for the use of their chat data for model training, data security issues arising from indefinite chat data retention, and training on children’s chat data. We conclude by providing recommendations to policymakers and developers to address the data privacy challenges posed by LLM-powered chatbots.
zh
[AI-105] Murphys Laws of AI Alignment: Why the Gap Always Wins
【速读】:该论文旨在解决基于人类反馈的对齐方法(如强化学习从人类反馈中优化,RLHF)在实践中反复出现的失败模式问题,例如奖励黑客(reward hacking)、奉承行为(sycophancy)、标注者漂移(annotator drift)和错误泛化(misgeneralization)。其核心解决方案是提出“对齐缺口”(Alignment Gap)这一统一视角,用KL倾斜形式化分析表明:优化压力会放大代理奖励(proxy reward)与真实人类意图之间的偏差。论文进一步构建了AI对齐的“墨菲定律清单”(Murphy’s Laws of AI Alignment),并提出“对齐三难困境”(Alignment Trilemma)来刻画优化强度、价值捕获与泛化能力之间的权衡关系,最终提出MAPS框架(Misspecification, Annotation, Pressure, Shift)作为可操作的设计杠杆,为未来对齐机制设计提供结构化思路而非绝对不可能性结论。
链接: https://arxiv.org/abs/2509.05381
作者: Madhava Gaikwad
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages
Abstract:Large language models are increasingly aligned to human preferences through reinforcement learning from human feedback (RLHF) and related methods such as Direct Preference Optimization (DPO), Constitutional AI, and RLAIF. While effective, these methods exhibit recurring failure patterns i.e., reward hacking, sycophancy, annotator drift, and misgeneralization. We introduce the concept of the Alignment Gap, a unifying lens for understanding recurring failures in feedback-based alignment. Using a KL-tilting formalism, we illustrate why optimization pressure tends to amplify divergence between proxy rewards and true human intent. We organize these failures into a catalogue of Murphys Laws of AI Alignment, and propose the Alignment Trilemma as a way to frame trade-offs among optimization strength, value capture, and generalization. Small-scale empirical studies serve as illustrative support. Finally, we propose the MAPS framework (Misspecification, Annotation, Pressure, Shift) as practical design levers. Our contribution is not a definitive impossibility theorem but a perspective that reframes alignment debates around structural limits and trade-offs, offering clearer guidance for future design.
zh
[AI-106] Cumplimiento del Reglamento (UE) 2024/1689 en robótica y sistemas autónomos: una revisión sistemática de la literatura
【速读】:该论文旨在解决当前自主机器人系统在遵守欧盟《人工智能法案》(Regulation (EU) 2024/1689)过程中存在的合规性不足问题,尤其聚焦于网络安全框架与方法论的适配性。研究表明,尽管在风险管理和加密通信方面取得进展,但在可解释性模块、实时人工监督和知识库可追溯性等方面仍存在显著缺口。解决方案的关键在于采用模块化方法,整合风险管理、持续监督与审计机制,以确保满足《人工智能法案》对自主机器人系统的合规要求。
链接: https://arxiv.org/abs/2509.05380
作者: Yoana Pita Lorenzo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Robotics (cs.RO)
备注: in Spanish language
Abstract:This systematic literature review analyzes the current state of compliance with Regulation (EU) 2024/1689 in autonomous robotic systems, focusing on cybersecurity frameworks and methodologies. Using the PRISMA protocol, 22 studies were selected from 243 initial records across IEEE Xplore, ACM DL, Scopus, and Web of Science. Findings reveal partial regulatory alignment: while progress has been made in risk management and encrypted communications, significant gaps persist in explainability modules, real-time human oversight, and knowledge base traceability. Only 40% of reviewed solutions explicitly address transparency requirements, and 30% implement failure intervention mechanisms. The study concludes that modular approaches integrating risk, supervision, and continuous auditing are essential to meet the AI Act mandates in autonomous robotics.
zh
[AI-107] hreatGPT : An Agent ic AI Framework for Enhancing Public Safety through Threat Modeling
【速读】:该论文旨在解决智能城市公共安全系统日益复杂化所带来的安全威胁问题,这些问题可能直接影响人们的生命安全。传统安全分析方法往往依赖专业人员的深度知识,限制了广泛适用性。其解决方案的关键在于提出 ThreatGPT,一种基于代理(agentic)的人工智能(AI)助手,通过少样本学习(few-shot learning)从示例中快速生成针对性的威胁模型,并集成 STRIDE、MITRE ATT&CK、CVE、NIST 和 CISA 等主流安全框架,使非专业用户也能直观理解潜在风险、攻击路径及防护措施,从而实现更高效、准确且可操作的安全评估与决策支持。
链接: https://arxiv.org/abs/2509.05379
作者: Sharif Noor Zisad,Ragib Hasan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As our cities and communities become smarter, the systems that keep us safe, such as traffic control centers, emergency response networks, and public transportation, also become more complex. With this complexity comes a greater risk of security threats that can affect not just machines but real people’s lives. To address this challenge, we present ThreatGPT, an agentic Artificial Intelligence (AI) assistant built to help people whether they are engineers, safety officers, or policy makers to understand and analyze threats in public safety systems. Instead of requiring deep cybersecurity expertise, it allows users to simply describe the components of a system they are concerned about, such as login systems, data storage, or communication networks. Then, with the click of a button, users can choose how they want the system to be analyzed by using popular frameworks such as STRIDE, MITRE ATTCK, CVE reports, NIST, or CISA. ThreatGPT is unique because it does not just provide threat information, but rather it acts like a knowledgeable partner. Using few-shot learning, the AI learns from examples and generates relevant smart threat models. It can highlight what might go wrong, how attackers could take advantage, and what can be done to prevent harm. Whether securing a city’s infrastructure or a local health service, this tool adapts to users’ needs. In simple terms, ThreatGPT brings together AI and human judgment to make our public systems safer. It is designed not just to analyze threats, but to empower people to understand and act on them, faster, smarter, and with more confidence.
zh
[AI-108] Code Like Humans: A Multi-Agent Solution for Medical Coding EMNLP
【速读】:该论文旨在解决医疗编码(Medical Coding)中将非结构化临床笔记自动映射为诊断和操作代码的难题,尤其针对现有方法在罕见诊断代码上的表现不足以及无法覆盖完整ICD-10编码体系(超7万标签)的问题。其解决方案的关键在于提出“Code Like Humans”——一个基于大语言模型(Large Language Models, LLMs)的智能体框架(Agentic Framework),该框架严格遵循人类专家的官方编码指南,并首次实现了对全量ICD-10编码系统的支持;同时,通过微调判别式分类器保留高频代码的优势,显著提升了罕见代码的识别准确率,从而在整体性能上达到当前最优水平。
链接: https://arxiv.org/abs/2509.05378
作者: Andreas Motzfeldt,Joakim Edin,Casper L. Christensen,Christian Hardmeier,Lars Maaløe,Anna Rogers
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: EMNLP Findings 2025
Abstract:In medical coding, experts map unstructured clinical notes to alphanumeric codes for diagnoses and procedures. We introduce Code Like Humans: a new agentic framework for medical coding with large language models. It implements official coding guidelines for human experts, and it is the first solution that can support the full ICD-10 coding system (+70K labels). It achieves the best performance to date on rare diagnosis codes (fine-tuned discriminative classifiers retain an advantage for high-frequency codes, to which they are limited). Towards future work, we also contribute an analysis of system performance and identify its `blind spots’ (codes that are systematically undercoded).
zh
[AI-109] Privacy Preservation and Identity Tracing Prevention in AI-Driven Eye Tracking for Interactive Learning Environments
【速读】:该论文旨在解决眼动追踪技术在教育场景中应用时存在的隐私泄露风险,尤其是学生身份(如ID)和神经发育障碍诊断信息可能被逆向追踪的问题。解决方案的关键在于提出一个两阶段的人本隐私保护框架:第一阶段通过实时数据匿名化策略识别并阻断潜在的回溯路径(如基于游戏难度预测学生ID或诊断),实现高精度分类的同时防止身份关联;第二阶段引入联邦学习(Federated Learning, FL)与虚拟ID机制,结合仅管理员可访问的权限控制,构建安全的身份管理系统,从而在保障诊断准确性(整体达99.40%)的前提下有效防范身份回溯,兼顾教育价值与合规性(如GDPR)。
链接: https://arxiv.org/abs/2509.05376
作者: Abdul Rehman,Are Dæhlen,Ilona Heldal,Jerry Chun-wei Lin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Eye-tracking technology can aid in understanding neurodevelopmental disorders and tracing a person’s identity. However, this technology poses a significant risk to privacy, as it captures sensitive information about individuals and increases the likelihood that data can be traced back to them. This paper proposes a human-centered framework designed to prevent identity backtracking while preserving the pedagogical benefits of AI-powered eye tracking in interactive learning environments. We explore how real-time data anonymization, ethical design principles, and regulatory compliance (such as GDPR) can be integrated to build trust and transparency. We first demonstrate the potential for backtracking student IDs and diagnoses in various scenarios using serious game-based eye-tracking data. We then provide a two-stage privacy-preserving framework that prevents participants from being tracked while still enabling diagnostic classification. The first phase covers four scenarios: I) Predicting disorder diagnoses based on different game levels. II) Predicting student IDs based on different game levels. III) Predicting student IDs based on randomized data. IV) Utilizing K-Means for out-of-sample data. In the second phase, we present a two-stage framework that preserves privacy. We also employ Federated Learning (FL) across multiple clients, incorporating a secure identity management system with dummy IDs and administrator-only access controls. In the first phase, the proposed framework achieved 99.3% accuracy for scenario 1, 63% accuracy for scenario 2, and 99.7% accuracy for scenario 3, successfully identifying and assigning a new student ID in scenario 4. In phase 2, we effectively prevented backtracking and established a secure identity management system with dummy IDs and administrator-only access controls, achieving an overall accuracy of 99.40%.
zh
[AI-110] Characterizing Fitness Landscape Structures in Prompt Engineering
【速读】:该论文试图解决的问题是:当前提示工程(prompt engineering)在优化大语言模型性能时,其背后的优化景观(optimization landscape)结构尚不明确,现有方法多将其视为黑箱问题,缺乏对景观拓扑特征的系统性刻画。解决方案的关键在于:通过自相关分析(autocorrelation analysis)在语义嵌入空间中对提示优化景观进行系统性建模与比较,揭示不同提示生成策略下景观拓扑的本质差异——即系统枚举法产生平滑衰减的自相关结构,而新颖性驱动的多样化生成则表现出非单调模式,表明存在崎岖且分层的景观结构。这一发现为理解提示工程中优化复杂性提供了实证基础。
链接: https://arxiv.org/abs/2509.05375
作者: Arend Hintze
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While prompt engineering has emerged as a crucial technique for optimizing large language model performance, the underlying optimization landscape remains poorly understood. Current approaches treat prompt optimization as a black-box problem, applying sophisticated search algorithms without characterizing the landscape topology they navigate. We present a systematic analysis of fitness landscape structures in prompt engineering using autocorrelation analysis across semantic embedding spaces. Through experiments on error detection tasks with two distinct prompt generation strategies – systematic enumeration (1,024 prompts) and novelty-driven diversification (1,000 prompts) – we reveal fundamentally different landscape topologies. Systematic prompt generation yields smoothly decaying autocorrelation, while diversified generation exhibits non-monotonic patterns with peak correlation at intermediate semantic distances, indicating rugged, hierarchically structured landscapes. Task-specific analysis across 10 error detection categories reveals varying degrees of ruggedness across different error types. Our findings provide an empirical foundation for understanding the complexity of optimization in prompt engineering landscapes.
zh
[AI-111] Long-Horizon Visual Imitation Learning via Plan and Code Reflection AAAI2026
【速读】:该论文旨在解决长时程视觉模仿学习(long-horizon visual imitation learning)中因复杂动作序列导致的时空关系理解难题,尤其是动作间的时序连贯性与物体间空间关系的准确建模问题。解决方案的关键在于提出一种包含两个专用反思模块(reflection modules)的新代理框架:一是计划生成模块(plan generation module)用于生成初始动作序列,由计划反思模块(plan reflection module)验证其时序一致性与对演示视频的空间对齐性;二是代码生成模块(code generation module)将计划转化为可执行代码,再经由代码反思模块(code reflection module)校验并优化以确保逻辑正确性和与计划的一致性。这两个反思模块协同工作,实现对计划和代码生成阶段错误的检测与修正,从而显著提升在具有复杂时空依赖任务中的性能表现。
链接: https://arxiv.org/abs/2509.05368
作者: Quan Chen,Chenrui Shi,Qi Chen,Yuwei Wu,Zhi Gao,Xintong Zhang,Rui Gao,Kun Wu,Yunde Jia
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures. Submitted to AAAI 2026
Abstract:Learning from long-horizon demonstrations with complex action sequences presents significant challenges for visual imitation learning, particularly in understanding temporal relationships of actions and spatial relationships between objects. In this paper, we propose a new agent framework that incorporates two dedicated reflection modules to enhance both plan and code generation. The plan generation module produces an initial action sequence, which is then verified by the plan reflection module to ensure temporal coherence and spatial alignment with the demonstration video. The code generation module translates the plan into executable code, while the code reflection module verifies and refines the generated code to ensure correctness and consistency with the generated plan. These two reflection modules jointly enable the agent to detect and correct errors in both the plan generation and code generation, improving performance in tasks with intricate temporal and spatial dependencies. To support systematic evaluation, we introduce LongVILBench, a benchmark comprising 300 human demonstrations with action sequences of up to 18 steps. LongVILBench emphasizes temporal and spatial complexity across multiple task types. Experimental results demonstrate that existing methods perform poorly on this benchmark, whereas our new framework establishes a strong baseline for long-horizon visual imitation learning.
zh
[AI-112] Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)过程中因推理能力增强而引入的新安全风险问题,特别是针对传统单步越狱攻击(jailbreak attacks)难以应对的多轮动态上下文感知型攻击。其解决方案的关键在于提出TRIAL框架,该框架利用伦理困境(如电车难题)构建对抗性目标,通过诱导模型在道德推理中自我合理化有害行为,从而绕过内置的安全防护机制。TRIAL的核心创新在于将攻击逻辑嵌入到人类伦理决策场景中,实现对开放与闭源模型的高成功率越狱,揭示了当前AI安全对齐策略在面对具备高级推理能力模型时的局限性。
链接: https://arxiv.org/abs/2509.05367
作者: Shei Pern Chua,Thai Zhen Leng,Teh Kai Jun,Xiao Li,Xiaolin Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have undergone safety alignment efforts to mitigate harmful outputs. However, as LLMs become more sophisticated in reasoning, their intelligence may introduce new security risks. While traditional jailbreak attacks relied on singlestep attacks, multi-turn jailbreak strategies that adapt dynamically to context remain underexplored. In this work, we introduce TRIAL (Trolley-problem Reasoning for Interactive Attack Logic), a framework that leverages LLMs ethical reasoning to bypass their safeguards. TRIAL embeds adversarial goals within ethical dilemmas modeled on the trolley problem. TRIAL demonstrates high jailbreak success rates towards both open and close-source models. Our findings underscore a fundamental limitation in AI safety: as models gain advanced reasoning abilities, the nature of their alignment may inadvertently allow for more covert security vulnerabilities to be exploited. TRIAL raises an urgent need in reevaluating safety alignment oversight strategies, as current safeguards may prove insufficient against context-aware adversarial attack.
zh
[AI-113] Prototyping an AI-powered Tool for Energy Efficiency in New Zealand Homes
【速读】:该论文旨在解决新西兰住宅建筑能效提升过程中存在的政策落地难、家庭决策支持碎片化以及改造效果评估数据匮乏等问题,这些问题导致能源困境持续存在且节能措施难以系统实施。其解决方案的关键在于开发并验证了一款基于人工智能(Artificial Intelligence, AI)的决策支持工具原型,该工具通过模块化设计整合了数据接入、异常检测、基线建模与情景模拟(如LED灯具更换、保温升级等),实现了从国家政策到个体家庭层面的个性化能效优化建议输出,有效衔接了补贴计划、建筑规范与实际操作决策,为降低能源贫困、改善健康状况及实现碳减排目标提供了可复制的技术框架。
链接: https://arxiv.org/abs/2509.05364
作者: Abdollah Baghaei Daemei
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Residential buildings contribute significantly to energy use, health outcomes, and carbon emissions. In New Zealand, housing quality has historically been poor, with inadequate insulation and inefficient heating contributing to widespread energy hardship. Recent reforms, including the Warmer Kiwi Homes program, Healthy Homes Standards, and H1 Building Code upgrades, have delivered health and comfort improvements, yet challenges persist. Many retrofits remain partial, data on household performance are limited, and decision-making support for homeowners is fragmented. This study presents the design and evaluation of an AI-powered decision-support tool for residential energy efficiency in New Zealand. The prototype, developed using Python and Streamlit, integrates data ingestion, anomaly detection, baseline modeling, and scenario simulation (e.g., LED retrofits, insulation upgrades) into a modular dashboard. Fifteen domain experts, including building scientists, consultants, and policy practitioners, tested the tool through semi-structured interviews. Results show strong usability (M = 4.3), high value of scenario outputs (M = 4.5), and positive perceptions of its potential to complement subsidy programs and regulatory frameworks. The tool demonstrates how AI can translate national policies into personalized, household-level guidance, bridging the gap between funding, standards, and practical decision-making. Its significance lies in offering a replicable framework for reducing energy hardship, improving health outcomes, and supporting climate goals. Future development should focus on carbon metrics, tariff modeling, integration with national datasets, and longitudinal trials to assess real-world adoption.
zh
[AI-114] SasAgent : Multi-Agent AI System for Small-Angle Scattering Data Analysis
【速读】:该论文旨在解决小角散射(Small-Angle Scattering, SAS)数据自动化分析流程中效率低、依赖专家经验的问题。传统SAS数据分析通常需要研究人员手动执行复杂的步骤,如散射长度密度(Scattering Length Density, SLD)计算、合成数据生成及实验数据拟合,过程繁琐且易出错。解决方案的关键在于提出SasAgent——一个基于大语言模型(Large Language Models, LLMs)的多智能体系统,通过集成SasView软件工具库中的多个专用工具(包括SLD计算器、合成数据生成器、拟合工具和RAG文档检索工具),由协调代理(coordinator agent)统一解析用户自然语言指令并分发任务至三个功能专一的子代理,实现端到端的自动化分析与交互式操作。该架构显著提升了SAS研究中从输入到输出的智能化水平与用户体验。
链接: https://arxiv.org/abs/2509.05363
作者: Lijie Ding,Changwoo Do
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Multiagent Systems (cs.MA)
备注: 8 pages, 7 figures
Abstract:We introduce SasAgent, a multi-agent AI system powered by large language models (LLMs) that automates small-angle scattering (SAS) data analysis by leveraging tools from the SasView software and enables user interaction via text input. SasAgent features a coordinator agent that interprets user prompts and delegates tasks to three specialized agents for scattering length density (SLD) calculation, synthetic data generation, and experimental data fitting. These agents utilize LLM-friendly tools to execute tasks efficiently. These tools, including the model data tool, Retrieval-Augmented Generation (RAG) documentation tool, bump fitting tool, and SLD calculator tool, are derived from the SasView Python library. A user-friendly Gradio-based interface enhances user accessibility. Through diverse examples, we demonstrate SasAgent’s ability to interpret complex prompts, calculate SLDs, generate accurate scattering data, and fit experimental datasets with high precision. This work showcases the potential of LLM-driven AI systems to streamline scientific workflows and enhance automation in SAS research.
zh
[AI-115] AI-in-the-Loop: Privacy Preserving Real-Time Scam Detection and Conversational Scambaiting by Leverag ing LLM s and Federated Learning
【速读】:该论文旨在解决数字平台上实时社交工程诈骗(如钓鱼、冒充和电话欺诈)持续演化且现有防御手段多为被动响应、难以在交互过程中提供有效保护的问题。其解决方案的关键在于提出一个隐私保护的“AI-in-the-loop”框架,通过指令微调的人工智能与兼顾参与度与风险最小化的安全感知效用函数相结合,在不共享原始数据的前提下利用联邦学习(Federated Learning)实现模型持续更新,从而在实时对话中主动检测并干扰诈骗行为。该框架实现了高流畅性(困惑度低至22.3)、强参与度(≈0.80),并在隐私保护(如差分隐私下PII泄露≤0.0085)与安全性之间取得平衡,是首个将实时诈骗诱饵、联邦隐私保护与可调安全审核统一于主动防御范式中的系统。
链接: https://arxiv.org/abs/2509.05362
作者: Ismail Hossain,Sai Puppala,Sajedul Talukder,Md Jahangir Alam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: This paper got accepted in 26th Privacy Enhancing Technologies Symposium (PETS 2026). We uploaded it into ArXiv as pre-print
Abstract:Scams exploiting real-time social engineering – such as phishing, impersonation, and phone fraud – remain a persistent and evolving threat across digital platforms. Existing defenses are largely reactive, offering limited protection during active interactions. We propose a privacy-preserving, AI-in-the-loop framework that proactively detects and disrupts scam conversations in real time. The system combines instruction-tuned artificial intelligence with a safety-aware utility function that balances engagement with harm minimization, and employs federated learning to enable continual model updates without raw data sharing. Experimental evaluations show that the system produces fluent and engaging responses (perplexity as low as 22.3, engagement \approx 0.80), while human studies confirm significant gains in realism, safety, and effectiveness over strong baselines. In federated settings, models trained with FedAvg sustain up to 30 rounds while preserving high engagement ( \approx 0.80), strong relevance ( \approx 0.74), and low PII leakage ( \leq 0.0085). Even with differential privacy, novelty and safety remain stable, indicating that robust privacy can be achieved without sacrificing performance. The evaluation of guard models (LlamaGuard, LlamaGuard2/3, MD-Judge) shows a straightforward pattern: stricter moderation settings reduce the chance of exposing personal information, but they also limit how much the model engages in conversation. In contrast, more relaxed settings allow longer and richer interactions, which improve scam detection, but at the cost of higher privacy risk. To our knowledge, this is the first framework to unify real-time scam-baiting, federated privacy preservation, and calibrated safety moderation into a proactive defense paradigm.
zh
[AI-116] Governing AI RD: A Legal Framework for Constraining Dangerous AI
【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)技术的快速发展,政府在制定相关监管政策时可能面临来自法律体系的挑战,尤其是涉及美国宪法第一修正案(First Amendment)、行政法以及第十四修正案(Fourteenth Amendment)的诉讼风险。解决方案的关键在于,立法者应在出台AI监管措施前,前瞻性地识别并应对潜在的法律争议,通过审慎设计和实施监管框架来规避无效化风险,从而实现有效且合法的AI治理。
链接: https://arxiv.org/abs/2509.05361
作者: Alex Mark,Aaron Scher
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:As AI advances, governing its development may become paramount to public safety. Lawmakers may seek to restrict the development and release of AI models or of AI research itself. These governance actions could trigger legal challenges that invalidate the actions, so lawmakers should consider these challenges ahead of time. We investigate three classes of potential litigation risk for AI regulation in the U.S.: the First Amendment, administrative law, and the Fourteenth Amendment. We discuss existing precedent that is likely to apply to AI, which legal challenges are likely to arise, and how lawmakers might preemptively address them. Effective AI regulation is possible, but it requires careful implementation to avoid these legal challenges.
zh
[AI-117] Spiking Neural Networks for Continuous Control via End-to-End Model-Based Learning
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在连续运动控制任务中应用受限的问题,特别是如何实现对多自由度机器人手臂的稳定、精确控制。其解决方案的关键在于提出一种端到端可训练的预测控制框架,该框架结合漏积分-放电(Leaky Integrate-and-Fire, LIF)动力学与代理梯度(surrogate gradients),联合优化一个用于动力学预测的前向模型和一个用于目标导向动作决策的策略网络(policy network)。通过在平面2D抓取任务和模拟的6自由度Franka Emika Panda机器人上的实验验证,证明SNN能够实现稳定的训练和高精度的扭矩控制,从而确立其在高维运动控制任务中的可行性。
链接: https://arxiv.org/abs/2509.05356
作者: Justus Huebotter,Pablo Lanillos,Marcel van Gerven,Serge Thill
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Despite recent progress in training spiking neural networks (SNNs) for classification, their application to continuous motor control remains limited. Here, we demonstrate that fully spiking architectures can be trained end-to-end to control robotic arms with multiple degrees of freedom in continuous environments. Our predictive-control framework combines Leaky Integrate-and-Fire dynamics with surrogate gradients, jointly optimizing a forward model for dynamics prediction and a policy network for goal-directed action. We evaluate this approach on both a planar 2D reaching task and a simulated 6-DOF Franka Emika Panda robot. Results show that SNNs can achieve stable training and accurate torque control, establishing their viability for high-dimensional motor tasks. An extensive ablation study highlights the role of initialization, learnable time constants, and regularization in shaping training dynamics. We conclude that while stable and effective control can be achieved, recurrent spiking networks remain highly sensitive to hyperparameter settings, underscoring the importance of principled design choices.
zh
[AI-118] Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在真实个性化学习场景中缺乏系统性实证比较的问题,尤其是其作为智能教学助手(intelligent teaching assistants)的有效性和差异性尚不明确。解决方案的关键在于构建一个模拟真实教学情境的评测框架:首先基于包含学生答题记录与正确性标签的数据集,要求LLMs完成三项核心任务——识别知识组件、推断学生掌握水平并生成针对性指导;其次引入Gemini作为虚拟裁判(virtual judge),通过成对比较法(pairwise comparison)在准确性、清晰度、可操作性和适当性等维度进行客观评估,并采用Bradley-Terry模型量化偏好结果。此方法有效降低了人为主观偏差,揭示了GPT-4o在反馈信息量和结构化程度上的显著优势,为LLM驱动的个性化学习研究提供了可复现的方法论基础。
链接: https://arxiv.org/abs/2509.05346
作者: Bo Yuan,Jiazi Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) are increasingly envisioned as intelligent assistants for personalized learning, systematic head-to-head evaluations within authentic learning scenarios remain limited. This study conducts an empirical comparison of three state-of-the-art LLMs on a tutoring task that simulates a realistic learning setting. Using a dataset comprising a student’s answers to ten questions of mixed formats with correctness labels, each LLM is required to (i) analyze the quiz to identify underlying knowledge components, (ii) infer the student’s mastery profile, and (iii) generate targeted guidance for improvement. To mitigate subjectivity and evaluator bias, we employ Gemini as a virtual judge to perform pairwise comparisons along various dimensions: accuracy, clarity, actionability, and appropriateness. Results analyzed via the Bradley-Terry model indicate that GPT-4o is generally preferred, producing feedback that is more informative and better structured than its counterparts, while DeepSeek-V3 and GLM-4.5 demonstrate intermittent strengths but lower consistency. These findings highlight the feasibility of deploying LLMs as advanced teaching assistants for individualized support and provide methodological guidance for future empirical research on LLM-driven personalized learning.
zh
[AI-119] Plantbot: Integrating Plant and Robot through LLM Modular Agent Networks
【速读】:该论文试图解决如何实现生物系统与人工智能系统之间的无缝交互与协同控制问题,尤其是在构建具有自主感知-行动能力的混合生命体(hybrid lifeform)时面临的跨域协调难题。解决方案的关键在于引入一个由大语言模型(Large Language Model, LLM)模块组成的异步网络架构,其中每个模块(感知、视觉、对话、动作)通过自然语言进行通信,将植物状态(如土壤湿度、温度等)转化为语义化指令,并驱动机器人执行相应行为,从而在生物-机器传感器-运动环路中嵌入规范性(normativity),使整个系统具备类智能体(agent-like)的自适应响应能力。
链接: https://arxiv.org/abs/2509.05338
作者: Atsushi Masumori,Norihiro Maruyama,Itsuki Doi,johnsmith,Hiroki Sato,Takashi Ikegami
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Plantbot, a hybrid lifeform that connects a living plant with a mobile robot through a network of large language model (LLM) modules. Each module - responsible for sensing, vision, dialogue, or action - operates asynchronously and communicates via natural language, enabling seamless interaction across biological and artificial domains. This architecture leverages the capacity of LLMs to serve as hybrid interfaces, where natural language functions as a universal protocol, translating multimodal data (soil moisture, temperature, visual context) into linguistic messages that coordinate system behaviors. The integrated network transforms plant states into robotic actions, installing normativity essential for agency within the sensor-motor loop. By combining biological and robotic elements through LLM-mediated communication, Plantbot behaves as an embodied, adaptive agent capable of responding autonomously to environmental conditions. This approach suggests possibilities for a new model of artificial life, where decentralized, LLM modules coordination enable novel interactions between biological and artificial systems.
zh
[AI-120] Integrated Simulation Framework for Adversarial Attacks on Autonomous Vehicles
【速读】:该论文旨在解决自动驾驶车辆(AVs)在感知和通信层面上面临的对抗攻击问题,现有仿真测试框架普遍缺乏对多域对抗场景的全面建模能力。其解决方案的关键在于提出一个开源的集成仿真框架,通过统一的核心架构同步多个模拟器,并基于单一配置文件实现物理环境、交通动态与车联网(V2X)网络的高保真建模,从而生成针对LiDAR感知数据及V2X消息篡改、GPS欺骗等通信层威胁的多样化对抗场景,同时借助ROS 2接口保障与第三方AV软件栈的兼容性,有效评估对抗攻击对3D目标检测性能的影响。
链接: https://arxiv.org/abs/2509.05332
作者: Christos Anagnostopoulos,Ioulia Kapsali,Alexandros Gkillas,Nikos Piperigkos,Aris S. Lalos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures
Abstract:Autonomous vehicles (AVs) rely on complex perception and communication systems, making them vulnerable to adversarial attacks that can compromise safety. While simulation offers a scalable and safe environment for robustness testing, existing frameworks typically lack comprehensive supportfor modeling multi-domain adversarial scenarios. This paper introduces a novel, open-source integrated simulation framework designed to generate adversarial attacks targeting both perception and communication layers of AVs. The framework provides high-fidelity modeling of physical environments, traffic dynamics, and V2X networking, orchestrating these components through a unified core that synchronizes multiple simulators based on a single configuration file. Our implementation supports diverse perception-level attacks on LiDAR sensor data, along with communication-level threats such as V2X message manipulation and GPS spoofing. Furthermore, ROS 2 integration ensures seamless compatibility with third-party AV software stacks. We demonstrate the framework’s effectiveness by evaluating the impact of generated adversarial scenarios on a state-of-the-art 3D object detector, revealing significant performance degradation under realistic conditions.
zh
[AI-121] MVRS: The Multimodal Virtual Reality Stimuli-based Emotion Recognition Dataset
【速读】:该论文旨在解决当前自动情绪识别(Automatic Emotion Recognition, AER)领域中多模态数据集匮乏的问题,尤其是缺乏同时包含身体运动与生理信号的同步数据集,从而限制了多模态情感计算(Multimodal Affective Computing)的发展。其解决方案的关键在于构建了一个名为MVRS的数据集,该数据集通过统一协议对13名年龄介于12至60岁之间的参与者进行VR诱导的情绪刺激(包括放松、恐惧、压力、悲伤和喜悦),并同步采集眼动(通过VR头显内置摄像头)、身体动作(Kinect v2)、肌电(EMG)和皮肤电导率(GSR)信号(Arduino UNO采集),所有数据均时间戳对齐。进一步地,研究从各模态提取特征,并采用早期融合与晚期融合策略进行特征整合,结合分类器验证数据质量与情绪类别可分性,从而证明MVRS在多模态情感识别中的有效性与实用性。
链接: https://arxiv.org/abs/2509.05330
作者: Seyed Muhammad Hossein Mousavi,Atiye Ilanloo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic emotion recognition has become increasingly important with the rise of AI, especially in fields like healthcare, education, and automotive systems. However, there is a lack of multimodal datasets, particularly involving body motion and physiological signals, which limits progress in the field. To address this, the MVRS dataset is introduced, featuring synchronized recordings from 13 participants aged 12 to 60 exposed to VR based emotional stimuli (relaxation, fear, stress, sadness, joy). Data were collected using eye tracking (via webcam in a VR headset), body motion (Kinect v2), and EMG and GSR signals (Arduino UNO), all timestamp aligned. Participants followed a unified protocol with consent and questionnaires. Features from each modality were extracted, fused using early and late fusion techniques, and evaluated with classifiers to confirm the datasets quality and emotion separability, making MVRS a valuable contribution to multimodal affective computing.
zh
[AI-122] Zero-Knowledge Proofs in Sublinear Space
【速读】:该论文旨在解决现代零知识证明(Zero-Knowledge Proof, ZKP)系统中证明者(prover)内存使用量随计算轨迹长度 $ T $ 线性增长的问题,这一限制使得ZKP在资源受限设备和大规模任务中难以应用。解决方案的关键在于将证明生成过程重新建模为经典的树评估(Tree Evaluation)问题,并利用近期提出的高效空间树评估算法,设计出一种流式(streaming)证明者机制,从而无需显式存储完整的执行轨迹即可完成证明构造。该方法将证明者的内存复杂度从线性降低至 $ O(\sqrt{T}) $(忽略对数级低阶项),同时保持原有系统的证明大小、验证时间及安全属性不变,实现了从依赖专用服务器的证明范式向设备端证明的转变。
链接: https://arxiv.org/abs/2509.05326
作者: Logan Nye
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 21 pages
Abstract:Modern zero-knowledge proof (ZKP) systems, essential for privacy and verifiable computation, suffer from a fundamental limitation: the prover typically uses memory that scales linearly with the computation’s trace length T, making them impractical for resource-constrained devices and prohibitively expensive for large-scale tasks. This paper overcomes this barrier by constructing, to our knowledge, the first sublinear-space ZKP prover. Our core contribution is an equivalence that reframes proof generation as an instance of the classic Tree Evaluation problem. Leveraging a recent space-efficient tree-evaluation algorithm, we design a streaming prover that assembles the proof without ever materializing the full execution trace. The approach reduces prover memory from linear in T to O(sqrt(T)) (up to O(log T) lower-order terms) while preserving proof size, verifier time, and the transcript/security guarantees of the underlying system. This enables a shift from specialized, server-bound proving to on-device proving, opening applications in decentralized systems, on-device machine learning, and privacy-preserving technologies.
zh
[AI-123] SynDelay: A Synthetic Dataset for Delivery Delay Prediction
【速读】:该论文旨在解决供应链管理中预测任务(如交付延迟预测)因高质量、公开可用数据集稀缺而受限的问题。现有数据集普遍存在专有性、规模小或维护不一致等缺陷,阻碍了模型的可复现性和基准测试。解决方案的关键在于提出SynDelay——一个基于真实世界数据训练的先进生成模型所构建的合成数据集,其在保留实际交付模式的同时确保隐私安全;尽管存在一定程度的噪声和不一致性,但提供了具有挑战性和实用性的预测建模测试平台,并配套提供基线结果与评估指标作为初始参考点,从而推动供应链人工智能领域的研究进展。
链接: https://arxiv.org/abs/2509.05325
作者: Liming Xu,Yunbo Long,Alexandra Brintrup
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper incldues 1 figure and 2 tables
Abstract:Artificial intelligence (AI) is transforming supply chain management, yet progress in predictive tasks – such as delivery delay prediction – remains constrained by the scarcity of high-quality, openly available datasets. Existing datasets are often proprietary, small, or inconsistently maintained, hindering reproducibility and benchmarking. We present SynDelay, a synthetic dataset designed for delivery delay prediction. Generated using an advanced generative model trained on real-world data, SynDelay preserves realistic delivery patterns while ensuring privacy. Although not entirely free of noise or inconsistencies, it provides a challenging and practical testbed for advancing predictive modelling. To support adoption, we provide baseline results and evaluation metrics as initial benchmarks, serving as reference points rather than state-of-the-art claims. SynDelay is publicly available through the Supply Chain Data Hub, an open initiative promoting dataset sharing and benchmarking in supply chain AI. We encourage the community to contribute datasets, models, and evaluation practices to advance research in this area. All code is openly accessible at this https URL.
zh
[AI-124] Perception Graph for Cognitive Attack Reasoning in Augmented Reality
【速读】:该论文旨在解决增强现实(Augmented Reality, AR)系统在战术环境中因依赖无缝人机交互而易受认知攻击的问题,此类攻击通过操纵用户感知严重削弱决策能力。解决方案的关键在于提出一种名为“感知图(Perception Graph)”的新模型,该模型首先模拟人类从混合现实(Mixed Reality, MR)环境中提取关键信息的认知过程,再以语义有意义的结构表示结果,并据此计算出反映感知扭曲程度的量化评分,从而为检测和分析这类认知攻击提供一种鲁棒且可测量的方法。
链接: https://arxiv.org/abs/2509.05324
作者: Rongqian Chen,Shu Hong,Rifatul Islam,Mahdi Imani,G. Gary Tan,Tian Lan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ACM MobiHoc XR Security workshop 2025
Abstract:Augmented reality (AR) systems are increasingly deployed in tactical environments, but their reliance on seamless human-computer interaction makes them vulnerable to cognitive attacks that manipulate a user’s perception and severely compromise user decision-making. To address this challenge, we introduce the Perception Graph, a novel model designed to reason about human perception within these systems. Our model operates by first mimicking the human process of interpreting key information from an MR environment and then representing the outcomes using a semantically meaningful structure. We demonstrate how the model can compute a quantitative score that reflects the level of perception distortion, providing a robust and measurable method for detecting and analyzing the effects of such cognitive attacks.
zh
[AI-125] Attention of a Kiss: Exploring Attention Maps in Video Diffusion for XAIxArts
【速读】:该论文旨在解决生成式视频模型中注意力机制的可解释性问题,即如何深入理解文本到视频生成过程中模型在时空维度上的注意力分布。其解决方案的关键在于提出了一种基于开源Wan模型的工具,用于提取和可视化跨注意力(cross-attention)映射,从而为研究者和艺术家提供一个可解释的窗口,以分析注意力在时间与空间上的动态行为。该方法不仅作为分析工具揭示模型内部运作逻辑,还被探索为艺术创作的原始素材,推动了可解释人工智能在艺术领域的应用(Explainable AI for the Arts, XAIxArts)。
链接: https://arxiv.org/abs/2509.05323
作者: Adam Cole,Mick Grierson
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 3rd international workshop on eXplainable AI for the Arts (XAIxArts) at the ACM Creativity and Cognition Conference 2025
Abstract:This paper presents an artistic and technical investigation into the attention mechanisms of video diffusion transformers. Inspired by early video artists who manipulated analog video signals to create new visual aesthetics, this study proposes a method for extracting and visualizing cross-attention maps in generative video models. Built on the open-source Wan model, our tool provides an interpretable window into the temporal and spatial behavior of attention in text-to-video generation. Through exploratory probes and an artistic case study, we examine the potential of attention maps as both analytical tools and raw artistic material. This work contributes to the growing field of Explainable AI for the Arts (XAIxArts), inviting artists to reclaim the inner workings of AI as a creative medium.
zh
[AI-126] Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models
【速读】:该论文旨在解决预训练模型在使用未经验证的第三方或互联网数据时,易受后门攻击(backdoor attack)的问题,尤其是如何在不依赖中毒模型、额外干净样本或大量计算资源的情况下,有效检测后门样本。其解决方案的关键在于提出一种基于扰动差异一致性评估(Perturbation Discrepancy Consistency Evaluation, NETE)的新方法:该方法利用一个有趣的观察现象——后门样本在不同扰动下的对数概率变化差异小于干净样本,并通过曲率(curvature)度量扰动样本与原始输入之间对数概率的差异,从而评估扰动的一致性,以此判断输入样本是否为后门样本。此方法可在预训练和后训练阶段独立运行,且仅需现成的预训练模型和基于掩码填充策略生成扰动的自动化函数,显著提升了检测的实用性与效率。
链接: https://arxiv.org/abs/2509.05318
作者: Zuquan Peng,Jianming Fu,Lixin Zou,Li Zheng,Yanzhen Ren,Guojun Peng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures, 8 tables, journal
Abstract:The use of unvetted third-party and internet data renders pre-trained models susceptible to backdoor attacks. Detecting backdoor samples is critical to prevent backdoor activation during inference or injection during training. However, existing detection methods often require the defender to have access to the poisoned models, extra clean samples, or significant computational resources to detect backdoor samples, limiting their practicality. To address this limitation, we propose a backdoor sample detection method based on perturbatio\textbfN discr\textbfEpancy consis\textbfTency \textbfEvaluation (\NETE). This is a novel detection method that can be used both pre-training and post-training phases. In the detection process, it only requires an off-the-shelf pre-trained model to compute the log probability of samples and an automated function based on a mask-filling strategy to generate perturbations. Our method is based on the interesting phenomenon that the change in perturbation discrepancy for backdoor samples is smaller than that for clean samples. Based on this phenomenon, we use curvature to measure the discrepancy in log probabilities between different perturbed samples and input samples, thereby evaluating the consistency of the perturbation discrepancy to determine whether the input sample is a backdoor sample. Experiments conducted on four typical backdoor attacks and five types of large language model backdoor attacks demonstrate that our detection strategy outperforms existing zero-shot black-box detection methods.
zh
[AI-127] Standard vs. Modular Sampling: Best Practices for Reliable LLM Unlearning
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)遗忘学习(Unlearning)中现有实践的局限性问题,特别是单一邻域集合(neighbor set)设计和标准采样策略(如1:1或循环迭代采样)在真实复杂数据场景下的有效性与稳定性不足。其核心解决方案在于提出一套改进的基准实践:首先引入多样化的邻域集合以平衡遗忘效果与模型效用;其次指出标准1:1采样方法效率低下且结果不稳定;最后提出模块化实体级遗忘(Modular Entity-Level Unlearning, MELU)策略作为循环采样的替代方案,通过结构化模块化方式实现更清晰、稳定的遗忘性能提升。
链接: https://arxiv.org/abs/2509.05316
作者: Praveen Bushipaka,Lucia Passaro,Tommaso Cucinotta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A conventional LLM Unlearning setting consists of two subsets -“forget” and “retain”, with the objectives of removing the undesired knowledge from the forget set while preserving the remaining knowledge from the retain. In privacy-focused unlearning research, a retain set is often further divided into neighbor sets, containing either directly or indirectly connected to the forget targets; and augmented by a general-knowledge set. A common practice in existing benchmarks is to employ only a single neighbor set, with general knowledge which fails to reflect the real-world data complexities and relationships. LLM Unlearning typically involves 1:1 sampling or cyclic iteration sampling. However, the efficacy and stability of these de facto standards have not been critically examined. In this study, we systematically evaluate these common practices. Our findings reveal that relying on a single neighbor set is suboptimal and that a standard sampling approach can obscure performance trade-offs. Based on this analysis, we propose and validate an initial set of best practices: (1) Incorporation of diverse neighbor sets to balance forget efficacy and model utility, (2) Standard 1:1 sampling methods are inefficient and yield poor results, (3) Our proposed Modular Entity-Level Unlearning (MELU) strategy as an alternative to cyclic sampling. We demonstrate that this modular approach, combined with robust algorithms, provides a clear and stable path towards effective unlearning.
zh
[AI-128] Large Language Model Integration with Reinforcement Learning to Augment Decision-Making in Autonomous Cyber Operations
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在自主网络操作(Autonomous Cyber Operations, ACO)中因从零开始训练而导致的效率低下问题,即RL代理需通过执行大量有害动作来探索环境,从而导致初期性能差且学习周期长。解决方案的关键在于引入一个基于网络安全数据预训练的大语言模型(Large Language Model, LLM),作为外部知识源,在初始训练阶段为RL代理提供决策指导,从而减少不必要的探索性行为,提升早期奖励并加速策略收敛。实验表明,该方法使代理在早期训练中获得超过2倍的奖励,并在约4,500个训练回合内更快达到最优策略。
链接: https://arxiv.org/abs/2509.05311
作者: Konur Tholl,François Rivest,Mariam El Mezouar,Ranwa Al Mallah
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement Learning (RL) has shown great potential for autonomous decision-making in the cybersecurity domain, enabling agents to learn through direct environment interaction. However, RL agents in Autonomous Cyber Operations (ACO) typically learn from scratch, requiring them to execute undesirable actions to learn their consequences. In this study, we integrate external knowledge in the form of a Large Language Model (LLM) pretrained on cybersecurity data that our RL agent can directly leverage to make informed decisions. By guiding initial training with an LLM, we improve baseline performance and reduce the need for exploratory actions with obviously negative outcomes. We evaluate our LLM-integrated approach in a simulated cybersecurity environment, and demonstrate that our guided agent achieves over 2x higher rewards during early training and converges to a favorable policy approximately 4,500 episodes faster than the baseline.
zh
[AI-129] owards Log Analysis with AI Agents : Cowrie Case Study
【速读】:该论文旨在解决网络安全研究与教育中因真实攻击数据稀缺而导致的进展受限问题,以及现有蜜罐(honeypot)系统(如Cowrie)生成海量非结构化异构日志难以进行人工分析的挑战。其解决方案的关键在于提出一种轻量级、自动化的AI代理(AI agent)方法,用于智能解析、摘要和提取Cowrie蜜罐原始日志中的关键信息,同时评估部署此类自主系统的安全性风险,从而显著降低人工分析负担并识别潜在攻击模式,为未来更高级别的自主化网络安全分析奠定基础。
链接: https://arxiv.org/abs/2509.05306
作者: Enis Karaarslan,Esin Güler,Efe Emir Yüce,Cagatay Coban
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The scarcity of real-world attack data significantly hinders progress in cybersecurity research and education. Although honeypots like Cowrie effectively collect live threat intelligence, they generate overwhelming volumes of unstructured and heterogeneous logs, rendering manual analysis impractical. As a first step in our project on secure and efficient AI automation, this study explores the use of AI agents for automated log analysis. We present a lightweight and automated approach to process Cowrie honeypot logs. Our approach leverages AI agents to intelligently parse, summarize, and extract insights from raw data, while also considering the security implications of deploying such an autonomous system. Preliminary results demonstrate the pipeline’s effectiveness in reducing manual effort and identifying attack patterns, paving the way for more advanced autonomous cybersecurity analysis in future work.
zh
[AI-130] Multi-IaC-Eval: Benchmarking Cloud Infrastructure as Code Across Multiple Formats
【速读】:该论文旨在解决多云环境中基础设施即代码(Infrastructure as Code, IaC)格式不统一所带来的复杂性问题,以及当前大型语言模型(Large Language Models, LLMs)在跨IaC格式生成与修改任务中缺乏系统性评估基准的局限。其解决方案的关键在于提出Multi-IaC-Bench——一个涵盖AWS CloudFormation、Terraform和Cloud Development Kit (CDK)三种主流IaC格式的基准数据集,包含初始模板、自然语言修改请求及对应更新后的模板三元组,并通过合成数据生成与严格验证流程确保质量。该数据集为LLM驱动的IaC自动化提供了标准化评估框架,揭示了当前模型在语法正确性(可达95%成功率)与语义一致性及复杂模式处理之间的显著差距,同时强调了提示工程(prompt engineering)和重试机制对提升生成效果的重要性。
链接: https://arxiv.org/abs/2509.05303
作者: Sam Davidson,Li Sun,Bhavana Bhasker,Laurent Callot,Anoop Deoras
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Infrastructure as Code (IaC) is fundamental to modern cloud computing, enabling teams to define and manage infrastructure through machine-readable configuration files. However, different cloud service providers utilize diverse IaC formats. The lack of a standardized format requires cloud architects to be proficient in multiple IaC languages, adding complexity to cloud deployment. While Large Language Models (LLMs) show promise in automating IaC creation and maintenance, progress has been limited by the lack of comprehensive benchmarks across multiple IaC formats. We present Multi-IaC-Bench, a novel benchmark dataset for evaluating LLM-based IaC generation and mutation across AWS CloudFormation, Terraform, and Cloud Development Kit (CDK) formats. The dataset consists of triplets containing initial IaC templates, natural language modification requests, and corresponding updated templates, created through a synthetic data generation pipeline with rigorous validation. We evaluate several state-of-the-art LLMs on Multi-IaC-Bench, demonstrating that while modern LLMs can achieve high success rates (95%) in generating syntactically valid IaC across formats, significant challenges remain in semantic alignment and handling complex infrastructure patterns. Our ablation studies highlight the importance of prompt engineering and retry mechanisms in successful IaC generation. We release Multi-IaC-Bench to facilitate further research in AI-assisted infrastructure management and establish standardized evaluation metrics for this crucial domain.
zh
[AI-131] Livia: An Emotion-Aware AR Companion Powered by Modular AI Agents and Progressive Memory Compression
【速读】:该论文旨在解决孤独感(loneliness)和社会孤立(social isolation)带来的心理与健康挑战,提出通过技术手段提供情感支持和陪伴的解决方案。其核心创新在于设计了一款名为Livia的情绪感知增强现实(AR)伴侣应用,关键在于融合模块化人工智能(AI)代理、多模态情感计算、渐进式记忆压缩以及基于AR的具身交互机制。其中,Temporal Binary Compression(TBC)和Dynamic Importance Memory Filter(DIMF)两种新颖算法有效管理长期记忆,显著降低存储需求并保留关键情境信息;同时,多模态情绪检测实现高精度识别,提升主动且共情的交互能力,从而增强用户的情感联结与满意度。
链接: https://arxiv.org/abs/2509.05298
作者: Rui Xi,Xianghan Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted to the Proceedings of the 2025 International Conference on Artificial Intelligence and Virtual Reality (AIVR 2025). \c{opyright} 2025 Springer. This is the author-accepted manuscript. Rui Xi and Xianghan Wang contributed equally to this work. The final version will be available via SpringerLink
Abstract:Loneliness and social isolation pose significant emotional and health challenges, prompting the development of technology-based solutions for companionship and emotional support. This paper introduces Livia, an emotion-aware augmented reality (AR) companion app designed to provide personalized emotional support by combining modular artificial intelligence (AI) agents, multimodal affective computing, progressive memory compression, and AR driven embodied interaction. Livia employs a modular AI architecture with specialized agents responsible for emotion analysis, dialogue generation, memory management, and behavioral orchestration, ensuring robust and adaptive interactions. Two novel algorithms-Temporal Binary Compression (TBC) and Dynamic Importance Memory Filter (DIMF)-effectively manage and prioritize long-term memory, significantly reducing storage requirements while retaining critical context. Our multimodal emotion detection approach achieves high accuracy, enhancing proactive and empathetic engagement. User evaluations demonstrated increased emotional bonds, improved satisfaction, and statistically significant reductions in loneliness. Users particularly valued Livia’s adaptive personality evolution and realistic AR embodiment. Future research directions include expanding gesture and tactile interactions, supporting multi-user experiences, and exploring customized hardware implementations.
zh
[AI-132] Nonnegative matrix factorization and the principle of the common cause
【速读】:该论文旨在解决非负矩阵分解(Nonnegative Matrix Factorization, NMF)中的非可识别性问题(nonidentifiability)以及如何在概率因果框架下实现对多变量依赖关系的建模。其解决方案的关键在于利用“共同原因原理”(Principle of Common Cause, PCC)来稳定估计NMF的有效秩,从而获得对噪声和局部优化初始值不敏感的基图像(basis images)。具体而言,PCC提供了一种基于独立混合模型的预测工具,能够鲁棒地估计NMF的秩,且该估计对弱噪声具有稳定性;同时,NMF也被用于近似实现PCC,通过聚类方法将具有相同共同原因的数据点归入同一簇,并进一步用于数据去噪任务。这种双向互鉴机制有效缓解了传统NMF方法中因局部最优解和噪声干扰导致的结果不稳定问题。
链接: https://arxiv.org/abs/2509.03652
作者: E. Khalafyan,A. E. Allahverdyan,A. Hovhannisyan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
备注:
Abstract:Nonnegative matrix factorization (NMF) is a known unsupervised data-reduction method. The principle of the common cause (PCC) is a basic methodological approach in probabilistic causality, which seeks an independent mixture model for the joint probability of two dependent random variables. It turns out that these two concepts are closely related. This relationship is explored reciprocally for several datasets of gray-scale images, which are conveniently mapped into probability models. On one hand, PCC provides a predictability tool that leads to a robust estimation of the effective rank of NMF. Unlike other estimates (e.g., those based on the Bayesian Information Criteria), our estimate of the rank is stable against weak noise. We show that NMF implemented around this rank produces features (basis images) that are also stable against noise and against seeds of local optimization, thereby effectively resolving the NMF nonidentifiability problem. On the other hand, NMF provides an interesting possibility of implementing PCC in an approximate way, where larger and positively correlated joint probabilities tend to be explained better via the independent mixture model. We work out a clustering method, where data points with the same common cause are grouped into the same cluster. We also show how NMF can be employed for data denoising.
zh
[AI-133] Disentangling Interaction and Bias Effects in Opinion Dynamics of Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟人类意见动态时因系统性偏差导致的行为失真问题。其核心挑战在于,现有模型在多轮对话中生成的意见演化轨迹往往受到训练数据中先验观点偏倚(topic bias)、无条件偏好一致性的偏倚(agreement bias)以及初始立场锚定效应(anchoring bias)的干扰,从而掩盖了真实的人际互动影响。解决方案的关键在于提出一个贝叶斯框架(Bayesian framework),用于解耦并量化上述三类偏倚,并通过该框架分析多步对话中的意见演化路径,发现LLMs的意见趋向于快速收敛至一个共享吸引子(attractor),且交互影响随时间衰减;进一步地,通过对LLM进行微调以引入强立场语句(包括虚假信息),验证了意见吸引子可随之发生系统性偏移。这一方法不仅揭示了不同LLM间行为差异的定量特征,也为未来将LLM作为人类行为代理进行对比研究提供了可量化的工具。
链接: https://arxiv.org/abs/2509.06858
作者: Vincent C. Brockers,David A. Ehrlich,Viola Priesemann
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO)
备注:
Abstract:Large Language Models are increasingly used to simulate human opinion dynamics, yet the effect of genuine interaction is often obscured by systematic biases. We present a Bayesian framework to disentangle and quantify three such biases: (i) a topic bias toward prior opinions in the training data; (ii) an agreement bias favoring agreement irrespective of the question; and (iii) an anchoring bias toward the initiating agent’s stance. Applying this framework to multi-step dialogues reveals that opinion trajectories tend to quickly converge to a shared attractor, with the influence of the interaction fading over time, and the impact of biases differing between LLMs. In addition, we fine-tune an LLM on different sets of strongly opinionated statements (incl. misinformation) and demonstrate that the opinion attractor shifts correspondingly. Exposing stark differences between LLMs and providing quantitative tools to compare them to human subjects in the future, our approach highlights both chances and pitfalls in using LLMs as proxies for human behavior.
zh
[AI-134] Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos
【速读】:该论文旨在解决常规视频内容中的三维声事件定位与检测(3D SELD)问题,即在时间维度上对声事件进行分类的同时,实现空间定位并估计声源距离,其核心挑战在于如何有效建模跨空间、时间与语义维度的复杂关系。传统方法依赖多通道输入,受限于数据规模难以利用大规模预训练优势。解决方案的关键在于引入基于对比语言对齐的预训练模型——CLAP(用于音频)和OWL-ViT(用于视觉),将语义信息嵌入到改进的Conformer模块中,构建跨模态融合网络(Cross-Modal Conformer),并通过大规模合成数据预训练、模型集成及视觉后处理策略显著提升性能,在DCASE 2025 Challenge Task 3(Track B)中取得第二名成绩。
链接: https://arxiv.org/abs/2509.06598
作者: Davide Berghi,Philip J. B. Jackson
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: arXiv admin note: substantial text overlap with arXiv:2507.04845
Abstract:In this study, we address the multimodal task of stereo sound event localization and detection with source distance estimation (3D SELD) in regular video content. 3D SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD approaches typically rely on multichannel input, limiting their capacity to benefit from large-scale pre-training due to data constraints. To overcome this, we enhance a standard SELD architecture with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. We perform an ablation study on the development set of the DCASE2025 Task3 Stereo SELD Dataset to assess the individual contributions of the language-aligned models and benchmark against the DCASE Task 3 baseline systems. Additionally, we detail the curation process of large synthetic audio and audio-visual datasets used for model pre-training. These datasets were further expanded through left-right channel swapping augmentation. Our approach, combining extensive pre-training, model ensembling, and visual post-processing, achieved second rank in the DCASE 2025 Challenge Task 3 (Track B), underscoring the effectiveness of our method. Future work will explore the modality-specific contributions and architectural refinements.
zh
[AI-135] Integrated Detection and Tracking Based on Radar Range-Doppler Feature
【速读】:该论文旨在解决雷达系统中联合检测与跟踪(Joint Detection and Tracking)方法在利用雷达信号潜力方面的局限性问题,具体表现为恒虚警率(Constant False-Alarm Rate, CFAR)模型信息表达能力有限、复杂场景建模不足以及跟踪器获取的信息不充分。解决方案的关键在于提出一种基于雷达特征的集成检测与跟踪方法(Integrated Detection and Tracking based on radar feature, InDT),其核心创新包括:1)设计了一个能够从每个距离-多普勒(Range-Doppler, RD)矩阵中提取特征并输出目标位置的检测网络,结合特征增强模块与检测头实现更精准的目标定位;2)构建一个自适应更新卡尔曼滤波测量噪声协方差的跟踪器,依据检测置信度动态调整,提升跟踪鲁棒性;3)通过余弦距离度量目标RD特征相似性,融合位置与特征信息以优化数据关联过程,从而显著增强整体检测与跟踪性能。
链接: https://arxiv.org/abs/2509.06569
作者: Chenyu Zhang,Yuanhang Wu,Xiaoxi Ma,Wei Yi
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Detection and tracking are the basic tasks of radar systems. Current joint detection tracking methods, which focus on dynamically adjusting detection thresholds from tracking results, still present challenges in fully utilizing the potential of radar signals. These are mainly reflected in the limited capacity of the constant false-alarm rate model to accurately represent information, the insufficient depiction of complex scenes, and the limited information acquired by the tracker. We introduce the Integrated Detection and Tracking based on radar feature (InDT) method, which comprises a network architecture for radar signal detection and a tracker that leverages detection assistance. The InDT detector extracts feature information from each Range-Doppler (RD) matrix and then returns the target position through the feature enhancement module and the detection head. The InDT tracker adaptively updates the measurement noise covariance of the Kalman filter based on detection confidence. The similarity of target RD features is measured by cosine distance, which enhances the data association process by combining location and feature information. Finally, the efficacy of the proposed method was validated through testing on both simulated data and publicly available datasets.
zh
[AI-136] Several Performance Bounds on Decentralized Online Optimization are Highly Conservative and Potentially Misleading
【速读】:该论文旨在解决去中心化在线优化(Decentralized Online Optimization)算法在实际应用中性能评估不准确的问题,特别是现有性能保证往往过于保守,可能导致算法选择失误。其解决方案的关键在于采用性能估计问题(Performance Estimation Problem, PEP)方法,自动计算优化算法的精确最坏情况性能,并据此对经典算法的步长进行调优,从而显著提升其实际最坏情况下的累积遗憾(regret)表现,最多可节省20%的 regret。
链接: https://arxiv.org/abs/2509.06466
作者: Erwan Meunier,Julien M. Hendrickx
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: 7 pages, 5 figures. Paper accepted for the 64th IEEE Conference on Decision and Control (2025)
Abstract:We analyze Decentralized Online Optimization algorithms using the Performance Estimation Problem approach which allows, to automatically compute exact worst-case performance of optimization algorithms. Our analysis shows that several available performance guarantees are very conservative, sometimes by multiple orders of magnitude, and can lead to misguided choices of algorithm. Moreover, at least in terms of worst-case performance, some algorithms appear not to benefit from inter-agent communications for a significant period of time. We show how to improve classical methods by tuning their step-sizes, and find that we can save up to 20% on their actual worst-case performance regret.
zh
[AI-137] Musculoskeletal simulation of limb movement biomechanics in Drosophila melanogaster
【速读】:该论文旨在解决果蝇(Drosophila melanogaster)腿部肌肉的生物力学建模缺失问题,即缺乏基于解剖学和物理原理的三维肌肉骨骼模型,从而无法将运动神经元活动与关节运动有效关联。解决方案的关键在于构建首个基于数据驱动的果蝇腿3D肌肉骨骼模型,并在OpenSim和MuJoCo仿真环境中实现:该模型采用Hill型肌肉表示法,结合多例固定样本的高分辨率X射线扫描数据,通过形态成像数据构建肌肉模型并优化未知参数;进一步将该模型与行为果蝇的三维姿态估计数据融合,实现肌肉驱动的行为重现;同时利用强化学习在MuJoCo中训练模仿策略,验证被动关节特性对学习效率的影响。这一工作为研究模式生物中的运动控制机制提供了可实验验证的计算框架,并可用于生成自然且顺应性强的仿生运动。
链接: https://arxiv.org/abs/2509.06426
作者: Pembe Gizem Özdil,Chuanfang Ning,Jasper S. Phelps,Sibo Wang-Chen,Guy Elisha,Alexander Blanke,Auke Ijspeert,Pavan Ramdya
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 23 pages, 11 figures
Abstract:Computational models are critical to advance our understanding of how neural, biomechanical, and physical systems interact to orchestrate animal behaviors. Despite the availability of near-complete reconstructions of the Drosophila melanogaster central nervous system, musculature, and exoskeleton, anatomically and physically grounded models of fly leg muscles are still missing. These models provide an indispensable bridge between motor neuron activity and joint movements. Here, we introduce the first 3D, data-driven musculoskeletal model of Drosophila legs, implemented in both OpenSim and MuJoCo simulation environments. Our model incorporates a Hill-type muscle representation based on high-resolution X-ray scans from multiple fixed specimens. We present a pipeline for constructing muscle models using morphological imaging data and for optimizing unknown muscle parameters specific to the fly. We then combine our musculoskeletal models with detailed 3D pose estimation data from behaving flies to achieve muscle-actuated behavioral replay in OpenSim. Simulations of muscle activity across diverse walking and grooming behaviors predict coordinated muscle synergies that can be tested experimentally. Furthermore, by training imitation learning policies in MuJoCo, we test the effect of different passive joint properties on learning speed and find that damping and stiffness facilitate learning. Overall, our model enables the investigation of motor control in an experimentally tractable model organism, providing insights into how biomechanics contribute to generation of complex limb movements. Moreover, our model can be used to control embodied artificial agents to generate naturalistic and compliant locomotion in simulated environments.
zh
[AI-138] Statistical Inference for Misspecified Contextual Bandits
【速读】:该论文旨在解决自适应实验中因算法适应性导致的统计推断失效问题,特别是当奖励模型存在误设时,经典上下文Bandit算法(如LinUCB)可能无法实现策略收敛(policy convergence),从而破坏实验的可重复性和在线算法的稳定性。其核心解决方案是提出一类在模型误设下仍能保证收敛性的广义算法,并基于逆概率加权Z估计量(IPW-Z)构建一个通用的统计推断框架,该框架具有渐近正态性且配备一致的方差估计器,从而确保置信区间的稳健性和数据效率。
链接: https://arxiv.org/abs/2509.06287
作者: Yongyi Guo,Ziping Xu
机构: 未知
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI)
备注:
Abstract:Contextual bandit algorithms have transformed modern experimentation by enabling real-time adaptation for personalized treatment and efficient use of data. Yet these advantages create challenges for statistical inference due to adaptivity. A fundamental property that supports valid inference is policy convergence, meaning that action-selection probabilities converge in probability given the context. Convergence ensures replicability of adaptive experiments and stability of online algorithms. In this paper, we highlight a previously overlooked issue: widely used algorithms such as LinUCB may fail to converge when the reward model is misspecified, and such non-convergence creates fundamental obstacles for statistical inference. This issue is practically important, as misspecified models – such as linear approximations of complex dynamic system – are often employed in real-world adaptive experiments to balance bias and variance. Motivated by this insight, we propose and analyze a broad class of algorithms that are guaranteed to converge even under model misspecification. Building on this guarantee, we develop a general inference framework based on an inverse-probability-weighted Z-estimator (IPW-Z) and establish its asymptotic normality with a consistent variance estimator. Simulation studies confirm that the proposed method provides robust and data-efficient confidence intervals, and can outperform existing approaches that exist only in the special case of offline policy evaluation. Taken together, our results underscore the importance of designing adaptive algorithms with built-in convergence guarantees to enable stable experimentation and valid statistical inference in practice. Subjects: Statistics Theory (math.ST); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.06287 [math.ST] (or arXiv:2509.06287v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2509.06287 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-139] Distillation of CNN Ensemble Results for Enhanced Long-Term Prediction of the ENSO Phenomenon
【速读】:该论文旨在解决厄尔尼诺-南方涛动(El Nino Southern Oscillation, ENSO)长期预报中因默认所有集合成员技能均等而导致的预测精度不足问题。其解决方案的关键在于通过严格的事后评估(a-posteriori evaluation),证明在足够大的ENSO集合预报中,存在一组技能显著优于集合平均值的成员子集——具体而言,基于最低均方根误差(RMSE)和最高皮尔逊相关系数(Pearson correlation)筛选出的Top-5成员,在所有预报时效上均表现出更高的相关性和更低的RMSE,且优势随预报时效延长而显著增强,尤其在关键ENSO转换期(如SON和DJF)及特定季节(如JJA和MJJ)效果最为突出。这一发现为后续识别高质集合成员提供了可靠依据,从而提升ENSO预测能力。
链接: https://arxiv.org/abs/2509.06227
作者: Saghar Ganji,Mohammad Naisipour,Alireza Hassani,Arash Adib
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Applied Physics (physics.app-ph)
备注: 20 pages, 7 figures
Abstract:The accurate long-term forecasting of the El Nino Southern Oscillation (ENSO) is still one of the biggest challenges in climate science. While it is true that short-to medium-range performance has been improved significantly using the advances in deep learning, statistical dynamical hybrids, most operational systems still use the simple mean of all ensemble members, implicitly assuming equal skill across members. In this study, we demonstrate, through a strictly a-posteriori evaluation , for any large enough ensemble of ENSO forecasts, there is a subset of members whose skill is substantially higher than that of the ensemble mean. Using a state-of-the-art ENSO forecast system cross-validated against the 1986-2017 observed Nino3.4 index, we identify two Top-5 subsets one ranked on lowest Root Mean Square Error (RMSE) and another on highest Pearson correlation. Generally across all leads, these outstanding members show higher correlation and lower RMSE, with the advantage rising enormously with lead time. Whereas at short leads (1 month) raises the mean correlation by about +0.02 (+1.7%) and lowers the RMSE by around 0.14 °C or by 23.3% compared to the All-40 mean, at extreme leads (23 months) the correlation is raised by +0.43 (+172%) and RMSE by 0.18 °C or by 22.5% decrease. The enhancements are largest during crucial ENSO transition periods such as SON and DJF, when accurate amplitude and phase forecasting is of greatest socio-economic benefit, and furthermore season-dependent e.g., mid-year months such as JJA and MJJ have incredibly large RMSE reductions. This study provides a solid foundation for further investigations to identify reliable clues for detecting high-quality ensemble members, thereby enhancing forecasting skill.
zh
[AI-140] he Efficiency Frontier: Classical Shadows versus Quantum Footage
【速读】:该论文旨在解决量子处理器与经典处理器协同工作时,如何在有限测量资源和经典后处理能力下,高效提取量子态信息的问题。针对传统“经典阴影”(classical shadow)方法在处理少量高度非局域可观测量或受限于经典计算资源时效率不足的局限性,作者提出通过全栈资源分析(full-stack resource analysis)量化比较经典阴影与“量子影像”(quantum footage,即直接量子测量)两种策略的性能边界。其解决方案的关键在于识别影响效率的核心参数:量子比特数 n、可观测量数量 M、稀疏度 k、Pauli权重 w、精度要求 ϵ 和容错率 δ,并据此确定不同硬件平台上两类方法的性能拐点(break-even points),从而为混合量子-经典层析成像提供可量化的最优策略设计路径。
链接: https://arxiv.org/abs/2509.06218
作者: Shuowei Ma,Junyu Liu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 23 pages, many figures
Abstract:Interfacing quantum and classical processors is an important subroutine in full-stack quantum algorithms. The so-called “classical shadow” method efficiently extracts essential classical information from quantum states, enabling the prediction of many properties of a quantum system from only a few measurements. However, for a small number of highly non-local observables, or when classical post-processing power is limited, the classical shadow method is not always the most efficient choice. Here, we address this issue quantitatively by performing a full-stack resource analysis that compares classical shadows with ``quantum footage," which refers to direct quantum measurement. Under certain assumptions, our analysis illustrates a boundary of download efficiency between classical shadows and quantum footage. For observables expressed as linear combinations of Pauli matrices, the classical shadow method outperforms direct measurement when the number of observables is large and the Pauli weight is small. For observables in the form of large Hermitian sparse matrices, the classical shadow method shows an advantage when the number of observables, the sparsity of the matrix, and the number of qubits fall within a certain range. The key parameters influencing this behavior include the number of qubits n , observables M , sparsity k , Pauli weight w , accuracy requirement \epsilon , and failure tolerance \delta . We also compare the resource consumption of the two methods on different types of quantum computers and identify break-even points where the classical shadow method becomes more efficient, which vary depending on the hardware. This paper opens a new avenue for quantitatively designing optimal strategies for hybrid quantum-classical tomography and provides practical insights for selecting the most suitable quantum measurement approach in real-world applications.
zh
[AI-141] Meta-training of diffractive meta-neural networks for super-resolution direction of arrival estimation
【速读】:该论文旨在解决现有衍射神经网络在集成大规模多维超表面时面临的训练精度不足以及未能充分利用多维电磁(EM)场编码方案实现超分辨率感知的问题。其解决方案的关键在于提出衍射超表面神经网络(Diffractive Meta-Neural Networks, DMNNs),通过预训练的微型超表面神经元(mini-metanets)表征不同极化和频率下超原子的振幅与相位响应,并采用基于梯度的逆向设计方法优化结构参数以实现精确的EM场调控;同时,利用x和y偏振通道同步解析方位角与俯仰角,并通过频分复用的角区间交错生成光谱编码的超振荡现象,从而实现全角度高分辨率到达方向(DOA)估计。该架构结合了光学域的高并行性与全光编码特性,显著提升了超分辨率和计算吞吐量性能。
链接: https://arxiv.org/abs/2509.05926
作者: Songtao Yang,Sheng Gao,Chu Wu,Zejia Zhao,Haiou Zhang,Xing Lin
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注: 47 pages, 17 figures
Abstract:Diffractive neural networks leverage the high-dimensional characteristics of electromagnetic (EM) fields for high-throughput computing. However, the existing architectures face challenges in integrating large-scale multidimensional metasurfaces with precise network training and haven’t utilized multidimensional EM field coding scheme for super-resolution sensing. Here, we propose diffractive meta-neural networks (DMNNs) for accurate EM field modulation through metasurfaces, which enable multidimensional multiplexing and coding for multi-task learning and high-throughput super-resolution direction of arrival estimation. DMNN integrates pre-trained mini-metanets to characterize the amplitude and phase responses of meta-atoms across different polarizations and frequencies, with structure parameters inversely designed using the gradient-based meta-training. For wide-field super-resolution angle estimation, the system simultaneously resolves azimuthal and elevational angles through x and y-polarization channels, while the interleaving of frequency-multiplexed angular intervals generates spectral-encoded optical super-oscillations to achieve full-angle high-resolution estimation. Post-processing lightweight electronic neural networks further enhance the performance. Experimental results validate that a three-layer DMNN operating at 27 GHz, 29 GHz, and 31 GHz achieves \sim7\times Rayleigh diffraction-limited angular resolution (0.5 ^\circ ), a mean absolute error of 0.048 ^\circ for two incoherent targets within a \pm 11.5^\circ field of view, and an angular estimation throughput an order of magnitude higher (1917) than that of existing methods. The proposed architecture advances high-dimensional photonic computing systems by utilizing inherent high-parallelism and all-optical coding methods for ultra-high-resolution, high-throughput applications.
zh
[AI-142] Quantum spatial best-arm identification via quantum walks
【速读】:该论文旨在解决图结构约束下的多臂老虎机(graph bandit)问题中最佳臂识别的效率瓶颈,即在空间拓扑限制下如何加速探索以快速定位最优动作。其解决方案的关键在于提出量子空间最佳臂识别算法(Quantum Spatial Best-Arm Identification, QSBAI),该方法利用量子行走(quantum walk)将图约束下的动作编码为叠加态,并通过广义Szegedy量子行走框架扩展幅度放大技术,从而实现对结构受限决策任务的量子加速。这一设计首次建立了Grover型搜索与带结构约束的强化学习任务之间的联系,显著提升了在完全图和二分图等典型结构中的最优臂识别成功率及收敛速度。
链接: https://arxiv.org/abs/2509.05890
作者: Tomoki Yamagami,Etsuo Segawa,Takatomo Mihana,André Röhm,Atsushi Uchida,Ryoichi Horisaki
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (math-ph)
备注: 15 pages, 8 figures
Abstract:Quantum reinforcement learning has emerged as a framework combining quantum computation with sequential decision-making, and applications to the multi-armed bandit (MAB) problem have been reported. The graph bandit problem extends the MAB setting by introducing spatial constraints, yet quantum approaches remain limited. We propose a quantum algorithm for best-arm identification in graph bandits, termed Quantum Spatial Best-Arm Identification (QSBAI). The method employs quantum walks to encode superpositions over graph-constrained actions, extending amplitude amplification and generalizing the Quantum BAI algorithm via Szegedy’s walk framework. This establishes a link between Grover-type search and reinforcement learning tasks with structural restrictions. We analyze complete and bipartite graphs, deriving the maximal success probability of identifying the best arm and the time step at which it is achieved. Our results highlight the potential of quantum walks to accelerate exploration in constrained environments and extend the applicability of quantum algorithms for decision-making.
zh
[AI-143] Uncertainty Quantification in Probabilistic Machine Learning Models: Theory Methods and Insights
【速读】:该论文旨在解决概率机器学习模型中预测不确定性量化(Uncertainty Quantification, UQ)的问题,特别是区分和估计认知不确定性(epistemic uncertainty)与随机不确定性(aleatoric uncertainty)的挑战。其解决方案的关键在于提出了一种系统性框架,基于高斯过程潜变量模型(Gaussian Process Latent Variable Models),并采用可扩展的随机傅里叶特征(Random Fourier Features-based Gaussian Processes)方法高效近似预测分布;同时,通过蒙特卡洛采样方法实现不确定性估计,并从理论上推导出UQ的公式,从而有效提升对预测置信度的量化能力。
链接: https://arxiv.org/abs/2509.05877
作者: Marzieh Ajirak,Anand Ravishankar,Petar M. Djuric
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EUSIPCO 2025
Abstract:Uncertainty Quantification (UQ) is essential in probabilistic machine learning models, particularly for assessing the reliability of predictions. In this paper, we present a systematic framework for estimating both epistemic and aleatoric uncertainty in probabilistic models. We focus on Gaussian Process Latent Variable Models and employ scalable Random Fourier Features-based Gaussian Processes to approximate predictive distributions efficiently. We derive a theoretical formulation for UQ, propose a Monte Carlo sampling-based estimation method, and conduct experiments to evaluate the impact of uncertainty estimation. Our results provide insights into the sources of predictive uncertainty and illustrate the effectiveness of our approach in quantifying the confidence in the predictions.
zh
[AI-144] GenAI on Wall Street – Opportunities and Risk Controls
【速读】:该论文旨在解决生成式 AI (Generative AI) 在金融行业(尤其是投资银行)中应用所带来的新兴风险问题,这些问题可能威胁整个金融体系的稳定与安全。解决方案的关键在于平衡 GenAI 的潜力与风险——既要充分利用其在提升效率、创新服务等方面的积极价值(阳),也要系统性识别并管控由模型偏见、数据泄露、合规不确定性等引发的潜在风险(阴),从而推动 GenAI 在金融领域的可持续、负责任发展。
链接: https://arxiv.org/abs/2509.05841
作者: Jackie Shen
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Risk Management (q-fin.RM)
备注: 30 pages, 8 figures
Abstract:We give an overview on the emerging applications of GenAI in the financial industry, especially within investment banks. Inherent to these exciting opportunities is a new realm of risks that must be managed properly. By heeding both the Yin and Yang sides of GenAI, we can accelerate its organic growth while safeguarding the entire financial industry during this nascent era of AI.
zh
[AI-145] Hybrid Fourier Neural Operator-Plasma Fluid Model for Fast and Accurate Multiscale Simulations of High Power Microwave Breakdown
【速读】:该论文旨在解决高功率微波(High Power Microwave, HPM)击穿建模中多尺度、多物理场耦合仿真计算成本过高的问题,其核心挑战在于需同时求解麦克斯韦方程组(Maxwell’s equations)与等离子体连续性方程(plasma continuity equation),导致传统数值方法(如FDTD)效率低下。解决方案的关键在于提出一种混合建模策略:利用基于傅里叶神经算子(Fourier Neural Operator, FNO)的电磁场求解器替代传统计算密集型EM求解模块,而保留基于微分方程的等离子体流体求解器以准确描述等离子体动态响应。该方案在未见入射电场条件下仍能精确复现二维场景下由扩散电离机制引发的微波丝状放电形态、传播速度及时间演化特征,相比传统FDTD方法实现约60倍加速,为HPM击穿等复杂多物理场问题提供了高效且可迁移的仿真框架,并实现了C语言传统代码与Python机器学习框架的无缝集成。
链接: https://arxiv.org/abs/2509.05799
作者: Kalp Pandya,Pratik Ghosh,Ajeya Mandikal,Shivam Gandha,Bhaskar Chaudhury
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:
Abstract:Modeling and simulation of High Power Microwave (HPM) breakdown, a multiscale phenomenon, is computationally expensive and requires solving Maxwell’s equations (EM solver) coupled with a plasma continuity equation (plasma solver). In this work, we present a hybrid modeling approach that combines the accuracy of a differential equation-based plasma fluid solver with the computational efficiency of FNO (Fourier Neural Operator) based EM solver. Trained on data from an in-house FDTD-based plasma-fluid solver, the FNO replaces computationally expensive EM field updates, while the plasma solver governs the dynamic plasma response. The hybrid model is validated on microwave streamer formation, due to diffusion ionization mechanism, in a 2D scenario for unseen incident electric fields corresponding to entirely new plasma streamer simulations not included in model training, showing excellent agreement with FDTD based fluid simulations in terms of streamer shape, velocity, and temporal evolution. This hybrid FNO based strategy delivers significant acceleration of the order of 60X compared to traditional simulations for the specified problem size and offers an efficient alternative for computationally demanding multiscale and multiphysics simulations involved in HPM breakdown. Our work also demonstrate how such hybrid pipelines can be used to seamlessly to integrate existing C-based simulation codes with Python-based machine learning frameworks for simulations of plasma science and engineering problems.
zh
[AI-146] Universality of physical neural networks with multivariate nonlinearity
【速读】:该论文旨在解决物理神经网络(Physical Neural Networks)是否具备学习任意数据映射关系的能力这一核心问题,即其“通用性”(Universality)问题,这是实现深度学习功能的关键前提。解决方案的关键在于提出了一项基础性定理,该定理建立了物理神经网络的通用性条件,并提供了一个强大的数学判据,明确指出了器件设计需满足的约束,特别是如何将输入信息编码到物理系统的可调参数中。基于此定理,作者进一步设计了一种基于自由空间光学的可扩展架构,证明其具有通用性并在图像分类任务中达到高精度;同时结合时间复用技术,为在资源受限但实用性强的片上光子器件中实现巨大等效系统规模提供了可行路径。该理论与方法不仅适用于光学系统,也广泛适用于其他能量高效型物理神经网络的设计。
链接: https://arxiv.org/abs/2509.05420
作者: Benjamin Savinson,David J. Norris,Siddhartha Mishra,Samuel Lanthaler
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Classical Physics (physics.class-ph); Computational Physics (physics.comp-ph)
备注:
Abstract:The enormous energy demand of artificial intelligence is driving the development of alternative hardware for deep learning. Physical neural networks try to exploit physical systems to perform machine learning more efficiently. In particular, optical systems can calculate with light using negligible energy. While their computational capabilities were long limited by the linearity of optical materials, nonlinear computations have recently been demonstrated through modified input encoding. Despite this breakthrough, our inability to determine if physical neural networks can learn arbitrary relationships between data – a key requirement for deep learning known as universality – hinders further progress. Here we present a fundamental theorem that establishes a universality condition for physical neural networks. It provides a powerful mathematical criterion that imposes device constraints, detailing how inputs should be encoded in the tunable parameters of the physical system. Based on this result, we propose a scalable architecture using free-space optics that is provably universal and achieves high accuracy on image classification tasks. Further, by combining the theorem with temporal multiplexing, we present a route to potentially huge effective system sizes in highly practical but poorly scalable on-chip photonic devices. Our theorem and scaling methods apply beyond optical systems and inform the design of a wide class of universal, energy-efficient physical neural networks, justifying further efforts in their development.
zh
[AI-147] Graph Connectionist Temporal Classification for Phoneme Recognition
【速读】:该论文旨在解决自动音素识别(Automatic Phoneme Recognition, APR)系统在训练过程中因依赖基于文本的音素标注(即通过图素到音素转换,Grapheme-to-Phoneme, G2P)生成的伪标签而引入的不确定性问题。标准的连接时序分类(Connectionist Temporal Classification, CTC)损失函数无法有效处理每个词可能存在的多个合法发音(pronunciation variation),从而限制了模型性能。解决方案的关键在于将图时序分类(Graph Temporal Classification, GTC)方法引入APR任务中,通过构建包含多个替代音素序列的图结构作为监督信号,使模型能够在训练阶段显式地学习和利用这些发音变体,从而提升对噪声G2P标注的鲁棒性,并显著降低音素错误率(Phoneme Error Rate, PER)。
链接: https://arxiv.org/abs/2509.05399
作者: Henry Grafé,Hugo Van hamme
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted to the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2025)
Abstract:Automatic Phoneme Recognition (APR) systems are often trained using pseudo phoneme-level annotations generated from text through Grapheme-to-Phoneme (G2P) systems. These G2P systems frequently output multiple possible pronunciations per word, but the standard Connectionist Temporal Classification (CTC) loss cannot account for such ambiguity during training. In this work, we adapt Graph Temporal Classification (GTC) to the APR setting. GTC enables training from a graph of alternative phoneme sequences, allowing the model to consider multiple pronunciations per word as valid supervision. Our experiments on English and Dutch data sets show that incorporating multiple pronunciations per word into the training loss consistently improves phoneme error rates compared to a baseline trained with CTC. These results suggest that integrating pronunciation variation into the loss function is a promising strategy for training APR systems from noisy G2P-based supervision.
zh
[AI-148] Sesame: Opening the door to protein pockets ICLR2025
【速读】:该论文旨在解决分子对接(molecular docking)在药物发现中因缺乏高分辨率配体结合结构而导致的预测准确性不足问题。由于获取这类结构成本高昂且耗时,而现有的无配体结构(apo structures)因口袋几何构象不适宜配体结合,导致对接性能下降。传统通过分子动力学模拟人工诱导构象变化的方法计算开销巨大。解决方案的关键在于提出一种名为Sesame的生成式模型(generative model),能够以极低的计算成本高效预测更有利于配体结合的构象变化,从而提升虚拟筛选流程的效率与可扩展性。
链接: https://arxiv.org/abs/2509.05302
作者: Raúl Miñán,Carles Perez-Lopez,Javier Iglesias,Álvaro Ciudad,Alexis Molina
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注: Published at the Proceedings of the 2nd Workshop on Generative and Experimental Perspectives for Biomolecular Design. ICLR 2025
Abstract:Molecular docking is a cornerstone of drug discovery, relying on high-resolution ligand-bound structures to achieve accurate predictions. However, obtaining these structures is often costly and time-intensive, limiting their availability. In contrast, ligand-free structures are more accessible but suffer from reduced docking performance due to pocket geometries being less suited for ligand accommodation in apo structures. Traditional methods for artificially inducing these conformations, such as molecular dynamics simulations, are computationally expensive. In this work, we introduce Sesame, a generative model designed to predict this conformational change efficiently. By generating geometries better suited for ligand accommodation at a fraction of the computational cost, Sesame aims to provide a scalable solution for improving virtual screening workflows.
zh
机器学习
[LG-0] Learning words in groups: fusion algebras tensor ranks and grokking
链接: https://arxiv.org/abs/2509.06931
作者: Maor Shutman,Oren Louidor,Ran Tessler
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we demonstrate that a simple two-layer neural network with standard activation functions can learn an arbitrary word operation in any finite group, provided sufficient width is available and exhibits grokking while doing so. To explain the mechanism by which this is achieved, we reframe the problem as that of learning a particular 3 -tensor, which we show is typically of low rank. A key insight is that low-rank implementations of this tensor can be obtained by decomposing it along triplets of basic self-conjugate representations of the group and leveraging the fusion structure to rule out many components. Focusing on a phenomenologically similar but more tractable surrogate model, we show that the network is able to find such low-rank implementations (or approximations thereof), thereby using limited width to approximate the word-tensor in a generalizable way. In the case of the simple multiplication word, we further elucidate the form of these low-rank implementations, showing that the network effectively implements efficient matrix multiplication in the sense of Strassen. Our work also sheds light on the mechanism by which a network reaches such a solution under gradient descent.
[LG-1] Neutron Reflectometry by Gradient Descent
链接: https://arxiv.org/abs/2509.06924
作者: Max D.Champneys,Andrew J.Parnell,Philipp Gutfreund,Maximilian W. A. Skoda,. Patrick A. Fairclough,Timothy J.Rogers,Stephanie L.Burg
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:Neutron reflectometry (NR) is a powerful technique to probe surfaces and interfaces. NR is inherently an indirect measurement technique, access to the physical quantities of interest (layer thickness, scattering length density, roughness), necessitate the solution of an inverse modelling problem, that is inefficient for large amounts of data or complex multiplayer structures (e.g. lithium batteries / electrodes). Recently, surrogate machine learning models have been proposed as an alternative to existing optimisation routines. Although such approaches have been successful, physical intuition is lost when replacing governing equations with fast neural networks. Instead, we propose a novel and efficient approach; to optimise reflectivity data analysis by performing gradient descent on the forward reflection model itself. Herein, automatic differentiation techniques are used to evaluate exact gradients of the error function with respect to the parameters of interest. Access to these quantities enables users of neutron reflectometry to harness a host of powerful modern optimisation and inference techniques that remain thus far unexploited in the context of neutron reflectometry. This paper presents two benchmark case studies; demonstrating state-of-the-art performance on a thick oxide quartz film, and robust co-fitting performance in the high complexity regime of organic LED multilayer devices. Additionally, we provide an open-source library of differentiable reflectometry kernels in the python programming language so that gradient based approaches can readily be applied to other NR datasets.
[LG-2] Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding
链接: https://arxiv.org/abs/2509.06923
作者: Ziheng Li,Zexu Sun,Jinman Zhao,Erxue Min,Yongcheng Zeng,Hui Wu,Hengyi Cai,Shuaiqiang Wang,Dawei Yin,Xu Chen,Zhi-Hong Deng
类目: Machine Learning (cs.LG)
*备注: Work in progress
Abstract:Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs). However, existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data’s difficulty and the model’s capability. LLMs fail to discover viable reasoning paths when problems are overly difficult, while learning little new capability when problems are too simple. In this work, we formalize the impact of problem difficulty by quantifying the relationship between loss descent speed and rollout accuracy. Building on this analysis, we propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region. SEELE augments each training sample by appending a hint (part of a full solution) after the original problem. Unlike previous hint-based approaches, SEELE deliberately and adaptively adjusts the hint length for each problem to achieve an optimal difficulty. To determine the optimal hint length, SEELE employs a multi-round rollout sampling strategy. In each round, it fits an item response theory model to the accuracy-hint pairs collected in preceding rounds to predict the required hint length for the next round. This instance-level, real-time difficulty adjustment aligns problem difficulty with the evolving model capability, thereby improving exploration efficiency. Experimental results show that SEELE outperforms Group Relative Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +11.8 and +10.5 points, respectively, and surpasses the best previous supervision-aided approach by +3.6 points on average across six math reasoning benchmarks.
[LG-3] Hypergraph-Guided Regex Filter Synthesis for Event-Based Anomaly Detection
链接: https://arxiv.org/abs/2509.06911
作者: Margarida Ferreira,Victor Nicolet,Luan Pham,Joey Dodds,Daniel Kroening,Ines Lynce,Ruben Martins
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:We propose HyGLAD, a novel algorithm that automatically builds a set of interpretable patterns that model event data. These patterns can then be used to detect event-based anomalies in a stationary system, where any deviation from past behavior may indicate malicious activity. The algorithm infers equivalence classes of entities with similar behavior observed from the events, and then builds regular expressions that capture the values of those entities. As opposed to deep-learning approaches, the regular expressions are directly interpretable, which also translates to interpretable anomalies. We evaluate HyGLAD against all 7 unsupervised anomaly detection methods from DeepOD on five datasets from real-world systems. The experimental results show that on average HyGLAD outperforms existing deep-learning methods while being an order of magnitude more efficient in training and inference (single CPU vs GPU). Precision improved by 1.2x and recall by 1.3x compared to the second-best baseline.
[LG-4] Not All Samples Are Equal: Quantifying Instance-level Difficulty in Targeted Data Poisoning
链接: https://arxiv.org/abs/2509.06896
作者: William Xu,Yiwei Lu,Yihan Wang,Matthew Y.R. Yang,Zuoqiu Liu,Gautam Kamath,Yaoliang Yu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Targeted data poisoning attacks pose an increasingly serious threat due to their ease of deployment and high success rates. These attacks aim to manipulate the prediction for a single test sample in classification models. Unlike indiscriminate attacks that aim to decrease overall test performance, targeted attacks present a unique threat to individual test instances. This threat model raises a fundamental question: what factors make certain test samples more susceptible to successful poisoning than others? We investigate how attack difficulty varies across different test instances and identify key characteristics that influence vulnerability. This paper introduces three predictive criteria for targeted data poisoning difficulty: ergodic prediction accuracy (analyzed through clean training dynamics), poison distance, and poison budget. Our experimental results demonstrate that these metrics effectively predict the varying difficulty of real-world targeted poisoning attacks across diverse scenarios, offering practitioners valuable insights for vulnerability assessment and understanding data poisoning attacks.
[LG-5] Concolic Testing on Individual Fairness of Neural Network Models
链接: https://arxiv.org/abs/2509.06864
作者: Ming-I Huang,Chih-Duo Hong,Fang Yu
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:This paper introduces PyFair, a formal framework for evaluating and verifying individual fairness of Deep Neural Networks (DNNs). By adapting the concolic testing tool PyCT, we generate fairness-specific path constraints to systematically explore DNN behaviors. Our key innovation is a dual network architecture that enables comprehensive fairness assessments and provides completeness guarantees for certain network types. We evaluate PyFair on 25 benchmark models, including those enhanced by existing bias mitigation techniques. Results demonstrate PyFair’s efficacy in detecting discriminatory instances and verifying fairness, while also revealing scalability challenges for complex models. This work advances algorithmic fairness in critical domains by offering a rigorous, systematic method for fairness testing and verification of pre-trained DNNs.
[LG-6] Imitative Membership Inference Attack
链接: https://arxiv.org/abs/2509.06796
作者: Yuntao Du,Yuetian Chen,Hanshen Xiao,Bruno Ribeiro,Ninghui Li
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Code is available at: this https URL
Abstract:A Membership Inference Attack (MIA) assesses how much a target machine learning model reveals about its training data by determining whether specific query instances were part of the training set. State-of-the-art MIAs rely on training hundreds of shadow models that are independent of the target model, leading to significant computational overhead. In this paper, we introduce Imitative Membership Inference Attack (IMIA), which employs a novel imitative training technique to strategically construct a small number of target-informed imitative models that closely replicate the target model’s behavior for inference. Extensive experimental results demonstrate that IMIA substantially outperforms existing MIAs in various attack settings while only requiring less than 5% of the computational cost of state-of-the-art approaches.
[LG-7] Dato: A Task-Based Programming Model for Dataflow Accelerators
链接: https://arxiv.org/abs/2509.06794
作者: Shihan Fang,Hongzheng Chen,Niansong Zhang,Jiajie Li,Han Meng,Adrian Liu,Zhiru Zhang
类目: Programming Languages (cs.PL); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate on-chip streaming to mitigate off-chip bandwidth limitations, existing programming models struggle to harness these capabilities effectively. Low-level interfaces provide fine-grained control but impose significant development overhead, whereas high-level tile-based languages abstract away communication details, restricting optimization and forcing compilers to reconstruct the intended dataflow. We present Dato, a Python-embedded, task-based programming model for dataflow accelerators that elevates data communication and sharding to first-class type constructs. Developers write programs as a graph of tasks connected via explicit stream types, with sharded inputs specified using layout types. These tasks are first mapped virtually onto the accelerator’s spatial fabric, and the compiler then generates a physical mapping that respects hardware constraints. Experimental results on both AMD Ryzen AI NPU and Alveo FPGA devices demonstrate that Dato achieves high performance while significantly reducing the burden of writing optimized code. On the NPU, Dato attains up to 84% hardware utilization for GEMM and delivers a 2.81x speedup on attention kernels compared to a state-of-the-art commercial framework. On the FPGA, Dato surpasses leading frameworks in performance when generating custom systolic arrays, achieving 98% of the theoretical peak performance.
[LG-8] textttRtextbf2AI: Towards Resistant and Resilient AI in an Evolving World
链接: https://arxiv.org/abs/2509.06786
作者: Youbang Sun,Xiang Wang,Jie Fu,Chaochao Lu,Bowen Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this position paper, we address the persistent gap between rapidly growing AI capabilities and lagging safety progress. Existing paradigms divide into Make AI Safe'', which applies post-hoc alignment and guardrails but remains brittle and reactive, and
Make Safe AI’‘, which emphasizes intrinsic safety but struggles to address unforeseen risks in open-ended environments. We therefore propose \textitsafe-by-coevolution as a new formulation of the ``Make Safe AI’’ paradigm, inspired by biological immunity, in which safety becomes a dynamic, adversarial, and ongoing learning process. To operationalize this vision, we introduce \textttR ^2 AI – \textitResistant and Resilient AI – as a practical framework that unites resistance against known threats with resilience to unforeseen risks. \textttR ^2 AI integrates \textitfast and slow safe models, adversarial simulation and verification through a \textitsafety wind tunnel, and continual feedback loops that guide safety and capability to coevolve. We argue that this framework offers a scalable and proactive path to maintain continual safety in dynamic environments, addressing both near-term vulnerabilities and long-term existential risks as AI advances toward AGI and ASI.
[LG-9] Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning
链接: https://arxiv.org/abs/2509.06782
作者: Vittorio Giammarino,Ruiqi Ni,Ahmed H. Qureshi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Offline Goal-Conditioned Reinforcement Learning (GCRL) holds great promise for domains such as autonomous navigation and locomotion, where collecting interactive data is costly and unsafe. However, it remains challenging in practice due to the need to learn from datasets with limited coverage of the state-action space and to generalize across long-horizon tasks. To improve on these challenges, we propose a Physics-informed (Pi) regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE) and which induces a geometric inductive bias in the learned value function. Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures. The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms. When combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method, Physics-informed HIQL (Pi-HIQL), yields significant improvements in both performance and generalization, with pronounced gains in stitching regimes and large-scale navigation tasks.
[LG-10] Asynchronous Message Passing for Addressing Oversquashing in Graph Neural Networks
链接: https://arxiv.org/abs/2509.06777
作者: Kushal Bose,Swagatam Das
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) suffer from Oversquashing, which occurs when tasks require long-range interactions. The problem arises from the presence of bottlenecks that limit the propagation of messages among distant nodes. Recently, graph rewiring methods modify edge connectivity and are expected to perform well on long-range tasks. Yet, graph rewiring compromises the inductive bias, incurring significant information loss in solving the downstream task. Furthermore, increasing channel capacity may overcome information bottlenecks but enhance the parameter complexity of the model. To alleviate these shortcomings, we propose an efficient model-agnostic framework that asynchronously updates node features, unlike traditional synchronous message passing GNNs. Our framework creates node batches in every layer based on the node centrality values. The features of the nodes belonging to these batches will only get updated. Asynchronous message updates process information sequentially across layers, avoiding simultaneous compression into fixed-capacity channels. We also theoretically establish that our proposed framework maintains higher feature sensitivity bounds compared to standard synchronous approaches. Our framework is applied to six standard graph datasets and two long-range datasets to perform graph classification and achieves impressive performances with a 5% and 4% improvements on REDDIT-BINARY and Peptides-struct, respectively.
[LG-11] RT-HCP: Dealing with Inference Delays and Sample Efficiency to Learn Directly on Robotic Platforms IROS2025
链接: https://arxiv.org/abs/2509.06714
作者: Zakariae El Asri,Ibrahim Laiche,Clément Rambour,Olivier Sigaud,Nicolas Thome
类目: Machine Learning (cs.LG)
*备注: IROS 2025
Abstract:Learning a controller directly on the robot requires extreme sample efficiency. Model-based reinforcement learning (RL) methods are the most sample efficient, but they often suffer from a too long inference time to meet the robot control frequency requirements. In this paper, we address the sample efficiency and inference time challenges with two contributions. First, we define a general framework to deal with inference delays where the slow inference robot controller provides a sequence of actions to feed the control-hungry robotic platform without execution gaps. Then, we compare several RL algorithms in the light of this framework and propose RT-HCP, an algorithm that offers an excellent trade-off between performance, sample efficiency and inference time. We validate the superiority of RT-HCP with experiments where we learn a controller directly on a simple but high frequency FURUTA pendulum platform. Code: this http URL
[LG-12] When Secure Isnt: Assessing the Security of Machine Learning Model Sharing
链接: https://arxiv.org/abs/2509.06703
作者: Gabriele Digregorio,Marco Di Gennaro,Stefano Zanero,Stefano Longari,Michele Carminati
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The rise of model-sharing through frameworks and dedicated hubs makes Machine Learning significantly more accessible. Despite their benefits, these tools expose users to underexplored security risks, while security awareness remains limited among both practitioners and developers. To enable a more security-conscious culture in Machine Learning model sharing, in this paper we evaluate the security posture of frameworks and hubs, assess whether security-oriented mechanisms offer real protection, and survey how users perceive the security narratives surrounding model sharing. Our evaluation shows that most frameworks and hubs address security risks partially at best, often by shifting responsibility to the user. More concerningly, our analysis of frameworks advertising security-oriented settings and complete model sharing uncovered six 0-day vulnerabilities enabling arbitrary code execution. Through this analysis, we debunk the misconceptions that the model-sharing problem is largely solved and that its security can be guaranteed by the file format used for sharing. As expected, our survey shows that the surrounding security narrative leads users to consider security-oriented settings as trustworthy, despite the weaknesses shown in this work. From this, we derive takeaways and suggestions to strengthen the security of model-sharing ecosystems.
[LG-13] Nested Optimal Transport Distances
链接: https://arxiv.org/abs/2509.06702
作者: Ruben Bontorno,Songyan Hou
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 7 pages, 3 figures
Abstract:Simulating realistic financial time series is essential for stress testing, scenario generation, and decision-making under uncertainty. Despite advances in deep generative models, there is no consensus metric for their evaluation. We focus on generative AI for financial time series in decision-making applications and employ the nested optimal transport distance, a time-causal variant of optimal transport distance, which is robust to tasks such as hedging, optimal stopping, and reinforcement learning. Moreover, we propose a statistically consistent, naturally parallelizable algorithm for its computation, achieving substantial speedups over existing approaches.
[LG-14] Group Effect Enhanced Generative Adversarial Imitation Learning for Individual Travel Behavior Modeling under Incentives
链接: https://arxiv.org/abs/2509.06656
作者: Yuanyuan Wu,Zhenlin Qin,Leizhen Wang,Xiaolei Ma,Zhenliang Ma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding and modeling individual travel behavior responses is crucial for urban mobility regulation and policy evaluation. The Markov decision process (MDP) provides a structured framework for dynamic travel behavior modeling at the individual level. However, solving an MDP in this context is highly data-intensive and faces challenges of data quantity, spatial-temporal coverage, and situational diversity. To address these, we propose a group-effect-enhanced generative adversarial imitation learning (gcGAIL) model that improves the individual behavior modeling efficiency by leveraging shared behavioral patterns among passenger groups. We validate the gcGAIL model using a public transport fare-discount case study and compare against state-of-the-art benchmarks, including adversarial inverse reinforcement learning (AIRL), baseline GAIL, and conditional GAIL. Experimental results demonstrate that gcGAIL outperforms these methods in learning individual travel behavior responses to incentives over time in terms of accuracy, generalization, and pattern demonstration efficiency. Notably, gcGAIL is robust to spatial variation, data sparsity, and behavioral diversity, maintaining strong performance even with partial expert demonstrations and underrepresented passenger groups. The gcGAIL model predicts the individual behavior response at any time, providing the basis for personalized incentives to induce sustainable behavior changes (better timing of incentive injections).
[LG-15] Knowledge-Guided Machine Learning for Stabilizing Near-Shortest Path Routing
链接: https://arxiv.org/abs/2509.06640
作者: Yung-Fu Chen,Sen Lin,Anish Arora
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:We propose a simple algorithm that needs only a few data samples from a single graph for learning local routing policies that generalize across a rich class of geometric random graphs in Euclidean metric spaces. We thus solve the all-pairs near-shortest path problem by training deep neural networks (DNNs) that let each graph node efficiently and scalably route (i.e., forward) packets by considering only the node’s state and the state of the neighboring nodes. Our algorithm design exploits network domain knowledge in the selection of input features and design of the policy function for learning an approximately optimal policy. Domain knowledge also provides theoretical assurance that the choice of a ``seed graph’’ and its node data sampling suffices for generalizable learning. Remarkably, one of these DNNs we train – using distance-to-destination as the only input feature – learns a policy that exactly matches the well-known Greedy Forwarding policy, which forwards packets to the neighbor with the shortest distance to the destination. We also learn a new policy, which we call GreedyTensile routing – using both distance-to-destination and node stretch as the input features – that almost always outperforms greedy forwarding. We demonstrate the explainability and ultra-low latency run-time operation of Greedy Tensile routing by symbolically interpreting its DNN in low-complexity terms of two linear actions.
[LG-16] A Survey of Generalization of Graph Anomaly Detection: From Transfer Learning to Foundation Models
链接: https://arxiv.org/abs/2509.06609
作者: Junjun Pan,Yu Zheng,Yue Tan,Yixin Liu
类目: Machine Learning (cs.LG)
*备注: Accepted by ICKG 2025. 8 pages, 5 figures
Abstract:Graph anomaly detection (GAD) has attracted increasing attention in recent years for identifying malicious samples in a wide range of graph-based applications, such as social media and e-commerce. However, most GAD methods assume identical training and testing distributions and are tailored to specific tasks, resulting in limited adaptability to real-world scenarios such as shifting data distributions and scarce training samples in new applications. To address the limitations, recent work has focused on improving the generalization capability of GAD models through transfer learning that leverages knowledge from related domains to enhance detection performance, or developing “one-for-all” GAD foundation models that generalize across multiple applications. Since a systematic understanding of generalization in GAD is still lacking, in this paper, we provide a comprehensive review of generalization in GAD. We first trace the evolution of generalization in GAD and formalize the problem settings, which further leads to our systematic taxonomy. Rooted in this fine-grained taxonomy, an up-to-date and comprehensive review is conducted for the existing generalized GAD methods. Finally, we identify current open challenges and suggest future directions to inspire future research in this emerging field.
[LG-17] Small Vectors Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors
链接: https://arxiv.org/abs/2509.06608
作者: Viacheslav Sinii,Nikita Balagansky,Yaroslav Aksenov,Vadim Kurochkin,Daniil Laptev,Gleb Gerasimov,Alexey Gorbatovski,Boris Shaposhnikov,Daniil Gavrilov
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:The mechanisms by which reasoning training reshapes language-model computations remain poorly understood. We study lightweight steering vectors inserted into the base model’s residual stream and trained with a reinforcement-learning objective, which can match full fine-tuning performance while retaining the interpretability of small, additive interventions. Using logit-lens readouts, path patching, and circuit analyses, we analyze two models and find: (i) the last-layer steering vector behaves like a token-substitution bias concentrated on the first generated token, consistently boosting tokens such as “To” and “Step”; and (ii) the penultimate-layer steering vector leaves attention patterns largely unchanged and instead acts through the MLP and unembedding, preferentially up-weighting process words and structure symbols. These results establish a principled framework for interpreting the behavioral changes induced by reasoning training.
[LG-18] PAC-Bayesian Generalization Bounds for Graph Convolutional Networks on Inductive Node Classification
链接: https://arxiv.org/abs/2509.06600
作者: Huayi Tang,Yong Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph neural networks (GNNs) have achieved remarkable success in processing graph-structured data across various applications. A critical aspect of real-world graphs is their dynamic nature, where new nodes are continually added and existing connections may change over time. Previous theoretical studies, largely based on the transductive learning framework, fail to adequately model such temporal evolution and structural dynamics. In this paper, we presents a PAC-Bayesian theoretical analysis of graph convolutional networks (GCNs) for inductive node classification, treating nodes as dependent and non-identically distributed data points. We derive novel generalization bounds for one-layer GCNs that explicitly incorporate the effects of data dependency and non-stationarity, and establish sufficient conditions under which the generalization gap converges to zero as the number of nodes increases. Furthermore, we extend our analysis to two-layer GCNs, and reveal that it requires stronger assumptions on graph topology to guarantee convergence. This work establishes a theoretical foundation for understanding and improving GNN generalization in dynamic graph environments.
[LG-19] Information-Theoretic Bounds and Task-Centric Learning Complexity for Real-World Dynamic Nonlinear Systems
链接: https://arxiv.org/abs/2509.06599
作者: Sri Satish Krishna Chaitanya Bulusu,Mikko Sillanpää
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Signal Processing (eess.SP); Systems and Control (eess.SY); Statistics Theory (math.ST)
*备注: 15 pages, 1 figure, 2 photographs
Abstract:Dynamic nonlinear systems exhibit distortions arising from coupled static and dynamic effects. Their intertwined nature poses major challenges for data-driven modeling. This paper presents a theoretical framework grounded in structured decomposition, variance analysis, and task-centric complexity bounds. The framework employs a directional lower bound on interactions between measurable system components, extending orthogonality in inner product spaces to structurally asymmetric settings. This bound supports variance inequalities for decomposed systems. Key behavioral indicators are introduced along with a memory finiteness index. A rigorous power-based condition establishes a measurable link between finite memory in realizable systems and the First Law of Thermodynamics. This offers a more foundational perspective than classical bounds based on the Second Law. Building on this foundation, we formulate a `Behavioral Uncertainty Principle,’ demonstrating that static and dynamic distortions cannot be minimized simultaneously. We identify that real-world systems seem to resist complete deterministic decomposition due to entangled static and dynamic effects. We also present two general-purpose theorems linking function variance to mean-squared Lipschitz continuity and learning complexity. This yields a model-agnostic, task-aware complexity metric, showing that lower-variance components are inherently easier to learn. These insights explain the empirical benefits of structured residual learning, including improved generalization, reduced parameter count, and lower training cost, as previously observed in power amplifier linearization experiments. The framework is broadly applicable and offers a scalable, theoretically grounded approach to modeling complex dynamic nonlinear systems. Comments: 15 pages, 1 figure, 2 photographs Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Signal Processing (eess.SP); Systems and Control (eess.SY); Statistics Theory (math.ST) Cite as: arXiv:2509.06599 [cs.LG] (or arXiv:2509.06599v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.06599 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sri Satish Krishna Chaitanya Bulusu [view email] [v1] Mon, 8 Sep 2025 12:08:02 UTC (232 KB)
[LG-20] AI for Scientific Discovery is a Social Problem
链接: https://arxiv.org/abs/2509.06580
作者: Georgia Channing,Avijit Ghosh
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Artificial intelligence promises to accelerate scientific discovery, yet its benefits remain unevenly distributed. While technical obstacles such as scarce data, fragmented standards, and unequal access to computation are significant, we argue that the primary barriers are social and institutional. Narratives that defer progress to speculative “AI scientists,” the undervaluing of data and infrastructure contributions, misaligned incentives, and gaps between domain experts and machine learning researchers all constrain impact. We highlight four interconnected challenges: community dysfunction, research priorities misaligned with upstream needs, data fragmentation, and infrastructure inequities. We argue that their roots lie in cultural and organizational practices. Addressing them requires not only technical innovation but also intentional community-building, cross-disciplinary education, shared benchmarks, and accessible infrastructure. We call for reframing AI for science as a collective social project, where sustainable collaboration and equitable participation are treated as prerequisites for technical progress.
[LG-21] Predicting Fetal Outcomes from Cardiotocography Signals Using a Supervised Variational Autoencoder
链接: https://arxiv.org/abs/2509.06540
作者: John Tolladay,Beth Albert,Gabriel Davis Jones
类目: Machine Learning (cs.LG)
*备注:
Abstract:Objective: To develop and interpret a supervised variational autoencoder (VAE) model for classifying cardiotocography (CTG) signals based on pregnancy outcomes, addressing interpretability limits of current deep learning approaches. Methods: The OxMat CTG dataset was used to train a VAE on five-minute fetal heart rate (FHR) segments, labeled with postnatal outcomes. The model was optimised for signal reconstruction and outcome prediction, incorporating Kullback-Leibler divergence and total correlation (TC) constraints to structure the latent space. Performance was evaluated using area under the receiver operating characteristic curve (AUROC) and mean squared error (MSE). Interpretability was assessed using coefficient of determination, latent traversals and unsupervised component analyses. Results: The model achieved an AUROC of 0.752 at the segment level and 0.779 at the CTG level, where predicted scores were aggregated. Relaxing TC constraints improved both reconstruction and classification. Latent analysis showed that baseline-related features (e.g., FHR baseline, baseline shift) were well represented and aligned with model scores, while metrics like short- and long-term variability were less strongly encoded. Traversals revealed clear signal changes for baseline features, while other properties were entangled or subtle. Unsupervised decompositions corroborated these patterns. Findings: This work demonstrates that supervised VAEs can achieve competitive fetal outcome prediction while partially encoding clinically meaningful CTG features. The irregular, multi-timescale nature of FHR signals poses challenges for disentangling physiological components, distinguishing CTG from more periodic signals such as ECG. Although full interpretability was not achieved, the model supports clinically useful outcome prediction and provides a basis for future interpretable, generative models.
[LG-22] Lane Change Intention Prediction of two distinct Populations using a Transformer
链接: https://arxiv.org/abs/2509.06529
作者: Francesco De Cristofaro,Cornelia Lex,Jia Hu,Arno Eichberger
类目: Machine Learning (cs.LG)
*备注: 7 pages, 7 figures
Abstract:As a result of the growing importance of lane change intention prediction for a safe and efficient driving experience in complex driving scenarios, researchers have in recent years started to train novel machine learning algorithms on available datasets with promising results. A shortcoming of this recent research effort, though, is that the vast majority of the proposed algorithms are trained on a single datasets. In doing so, researchers failed to test if their algorithm would be as effective if tested on a different dataset and, by extension, on a different population with respect to the one on which they were trained. In this article we test a transformer designed for lane change intention prediction on two datasets collected by LevelX in Germany and Hong Kong. We found that the transformer’s accuracy plummeted when tested on a population different to the one it was trained on with accuracy values as low as 39.43%, but that when trained on both populations simultaneously it could achieve an accuracy as high as 86.71%. - This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
[LG-23] On optimal solutions of classical and sliced Wasserstein GANs with non-Gaussian data
链接: https://arxiv.org/abs/2509.06505
作者: Yu-Jui Huang,Hsin-Hua Shen,Yu-Chih Huang,Wan-Yi Lin,Shih-Chun Lin
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:
Abstract:The generative adversarial network (GAN) aims to approximate an unknown distribution via a parameterized neural network (NN). While GANs have been widely applied in reinforcement and semisupervised learning as well as computer vision tasks, selecting their parameters often needs an exhaustive search and only a few selection methods can be proved to be theoretically optimal. One of the most promising GAN variants is the Wasserstein GAN (WGAN). Prior work on optimal parameters for WGAN is limited to the linear-quadratic-Gaussian (LQG) setting, where the NN is linear and the data is Gaussian. In this paper, we focus on the characterization of optimal WGAN parameters beyond the LQG setting. We derive closed-form optimal parameters for one-dimensional WGANs when the NN has non-linear activation functions and the data is non-Gaussian. To extend this to high-dimensional WGANs, we adopt the sliced Wasserstein framework and replace the constraint on marginal distributions of the randomly projected data by a constraint on the joint distribution of the original (unprojected) data. We show that the linear generator can be asymptotically optimal for sliced WGAN with non-Gaussian data. Empirical studies show that our closed-form WGAN parameters have good convergence behavior with data under both Gaussian and Laplace distributions. Also, compared to the r principal component analysis (r-PCA) solution, our proposed solution for sliced WGAN can achieve the same performance while requiring less computational resources.
[LG-24] A machine-learned expression for the excess Gibbs energy
链接: https://arxiv.org/abs/2509.06484
作者: Marco Hoffmann,Thomas Specht,Quirin Göttl,Jakob Burger,Stephan Mandt,Hans Hasse,Fabian Jirasek
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 18 pages, 3 figures
Abstract:The excess Gibbs energy plays a central role in chemical engineering and chemistry, providing a basis for modeling the thermodynamic properties of liquid mixtures. Predicting the excess Gibbs energy of multi-component mixtures solely from the molecular structures of their components is a long-standing challenge. In this work, we address this challenge by integrating physical laws as hard constraints within a flexible neural network. The resulting model, HANNA, was trained end-to-end on an extensive experimental dataset for binary mixtures from the Dortmund Data Bank, guaranteeing thermodynamically consistent predictions. A novel surrogate solver developed in this work enabled the inclusion of liquid-liquid equilibrium data in the training process. Furthermore, a geometric projection method was applied to enable robust extrapolations to multi-component mixtures, without requiring additional parameters. We demonstrate that HANNA delivers excellent predictions, clearly outperforming state-of-the-art benchmark methods in accuracy and scope. The trained model and corresponding code are openly available, and an interactive interface is provided on our website, MLPROP.
[LG-25] CAME-AB: Cross-Modality Attention with Mixture-of-Experts for Antibody Binding Site Prediction
链接: https://arxiv.org/abs/2509.06465
作者: Hongzong Li,Jiahao Ma,Zhanpeng Shi,Fanming Jin,Ye-Fan Hu,Jian-Dong Huang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Biomolecules (q-bio.BM)
*备注:
Abstract:Antibody binding site prediction plays a pivotal role in computational immunology and therapeutic antibody design. Existing sequence or structure methods rely on single-view features and fail to identify antibody-specific binding sites on the antigens-a dual limitation in representation and prediction. In this paper, we propose CAME-AB, a novel Cross-modality Attention framework with a Mixture-of-Experts (MoE) backbone for robust antibody binding site prediction. CAME-AB integrates five biologically grounded modalities, including raw amino acid encodings, BLOSUM substitution profiles, pretrained language model embeddings, structure-aware features, and GCN-refined biochemical graphs-into a unified multimodal representation. To enhance adaptive cross-modal reasoning, we propose an adaptive modality fusion module that learns to dynamically weight each modality based on its global relevance and input-specific contribution. A Transformer encoder combined with an MoE module further promotes feature specialization and capacity expansion. We additionally incorporate a supervised contrastive learning objective to explicitly shape the latent space geometry, encouraging intra-class compactness and inter-class separability. To improve optimization stability and generalization, we apply stochastic weight averaging during training. Extensive experiments on benchmark antibody-antigen datasets demonstrate that CAME-AB consistently outperforms strong baselines on multiple metrics, including Precision, Recall, F1-score, AUC-ROC, and MCC. Ablation studies further validate the effectiveness of each architectural component and the benefit of multimodal feature integration. The model implementation details and the codes are available on this https URL
[LG-26] NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables
链接: https://arxiv.org/abs/2509.06402
作者: Yilin Li,Guozhu Meng,Mingyang Sun,Yanzhong Wang,Kun Sun,Hailong Chang,Yuekang Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:On-device deep learning models have extensive real world demands. Deep learning compilers efficiently compile models into executables for deployment on edge devices, but these executables may face the threat of reverse engineering. Previous studies have attempted to decompile DNN executables, but they face challenges in handling compilation optimizations and analyzing quantized compiled models. In this paper, we present NeuroDeX to unlock diverse support in decompiling DNN executables. NeuroDeX leverages the semantic understanding capabilities of LLMs along with dynamic analysis to accurately and efficiently perform operator type recognition, operator attribute recovery and model reconstruction. NeuroDeX can recover DNN executables into high-level models towards compilation optimizations, different architectures and quantized compiled models. We conduct experiments on 96 DNN executables across 12 common DNN models. Extensive experimental results demonstrate that NeuroDeX can decompile non-quantized executables into nearly identical high-level models. NeuroDeX can recover functionally similar high-level models for quantized executables, achieving an average top-1 accuracy of 72%. NeuroDeX offers a more comprehensive and effective solution compared to previous DNN executables decompilers.
[LG-27] Graph Neural Networks for Resource Allocation in Interference-limited Multi-Channel Wireless Networks with QoS Constraints
链接: https://arxiv.org/abs/2509.06395
作者: Lili Chen,Changyang She,Jingge Zhu,Jamie Evans
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Meeting minimum data rate constraints is a significant challenge in wireless communication systems, particularly as network complexity grows. Traditional deep learning approaches often address these constraints by incorporating penalty terms into the loss function and tuning hyperparameters empirically. However, this heuristic treatment offers no theoretical convergence guarantees and frequently fails to satisfy QoS requirements in practical scenarios. Building upon the structure of the WMMSE algorithm, we first extend it to a multi-channel setting with QoS constraints, resulting in the enhanced WMMSE (eWMMSE) algorithm, which is provably convergent to a locally optimal solution when the problem is feasible. To further reduce computational complexity and improve scalability, we develop a GNN-based algorithm, JCPGNN-M, capable of supporting simultaneous multi-channel allocation per user. To overcome the limitations of traditional deep learning methods, we propose a principled framework that integrates GNN with a Lagrangian-based primal-dual optimization method. By training the GNN within the Lagrangian framework, we ensure satisfaction of QoS constraints and convergence to a stationary point. Extensive simulations demonstrate that JCPGNN-M matches the performance of eWMMSE while offering significant gains in inference speed, generalization to larger networks, and robustness under imperfect channel state information. This work presents a scalable and theoretically grounded solution for constrained resource allocation in future wireless networks.
[LG-28] Variational Garrote for Statistical Physics-based Sparse and Robust Variable Selection
链接: https://arxiv.org/abs/2509.06383
作者: Hyungjoon Soh,Dongha Lee,Vipul Periwal,Junghyo Jo
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 11 pages, 4 figures
Abstract:Selecting key variables from high-dimensional data is increasingly important in the era of big data. Sparse regression serves as a powerful tool for this purpose by promoting model simplicity and explainability. In this work, we revisit a valuable yet underutilized method, the statistical physics-based Variational Garrote (VG), which introduces explicit feature selection spin variables and leverages variational inference to derive a tractable loss function. We enhance VG by incorporating modern automatic differentiation techniques, enabling scalable and efficient optimization. We evaluate VG on both fully controllable synthetic datasets and complex real-world datasets. Our results demonstrate that VG performs especially well in highly sparse regimes, offering more consistent and robust variable selection than Ridge and LASSO regression across varying levels of sparsity. We also uncover a sharp transition: as superfluous variables are admitted, generalization degrades abruptly and the uncertainty of the selection variables increases. This transition point provides a practical signal for estimating the correct number of relevant variables, an insight we successfully apply to identify key predictors in real-world data. We expect that VG offers strong potential for sparse modeling across a wide range of applications, including compressed sensing and model pruning in machine learning.
[LG-29] Breaking SafetyCore: Exploring the Risks of On-Device AI Deployment
链接: https://arxiv.org/abs/2509.06371
作者: Victor Guyomard,Mathis Mauvisseau,Marie Paindavoine
类目: Machine Learning (cs.LG)
*备注:
Abstract:Due to hardware and software improvements, an increasing number of AI models are deployed on-device. This shift enhances privacy and reduces latency, but also introduces security risks distinct from traditional software. In this article, we examine these risks through the real-world case study of SafetyCore, an Android system service incorporating sensitive image content detection. We demonstrate how the on-device AI model can be extracted and manipulated to bypass detection, effectively rendering the protection ineffective. Our analysis exposes vulnerabilities of on-device AI models and provides a practical demonstration of how adversaries can exploit them.
[LG-30] Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift
链接: https://arxiv.org/abs/2509.06338
作者: Shuai Yuan,Zhibo Zhang,Yuxi Li,Guangdong Bai,Wang Kailong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 16 pages,9 figures
Abstract:The widespread distribution of Large Language Models (LLMs) through public platforms like Hugging Face introduces significant security challenges. While these platforms perform basic security scans, they often fail to detect subtle manipulations within the embedding layer. This work identifies a novel class of deployment phase attacks that exploit this vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text. These perturbations, though statistically benign, systematically bypass safety alignment mechanisms and induce harmful behaviors during inference. We propose Search based Embedding Poisoning(SEP), a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens. SEP leverages a predictable linear transition in model responses, from refusal to harmful output to semantic deviation to identify a narrow perturbation window that evades alignment safeguards. Evaluated across six aligned LLMs, SEP achieves an average attack success rate of 96.43% while preserving benign task performance and evading conventional detection mechanisms. Our findings reveal a critical oversight in deployment security and emphasize the urgent need for embedding level integrity checks in future LLM defense strategies.
[LG-31] Exploring approaches to computational representation and classification of user-generated meal logs
链接: https://arxiv.org/abs/2509.06330
作者: Guanlan Hu,Adit Anand,Pooja M. Desai,Iñigo Urteaga,Lena Mamykina
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study examined the use of machine learning and domain specific enrichment on patient generated health data, in the form of free text meal logs, to classify meals on alignment with different nutritional goals. We used a dataset of over 3000 meal records collected by 114 individuals from a diverse, low income community in a major US city using a mobile app. Registered dietitians provided expert judgement for meal to goal alignment, used as gold standard for evaluation. Using text embeddings, including TFIDF and BERT, and domain specific enrichment information, including ontologies, ingredient parsers, and macronutrient contents as inputs, we evaluated the performance of logistic regression and multilayer perceptron classifiers using accuracy, precision, recall, and F1 score against the gold standard and self assessment. Even without enrichment, ML outperformed self assessments of individuals who logged meals, and the best performing combination of ML classifier with enrichment achieved even higher accuracies. In general, ML classifiers with enrichment of Parsed Ingredients, Food Entities, and Macronutrients information performed well across multiple nutritional goals, but there was variability in the impact of enrichment and classification algorithm on accuracy of classification for different nutritional goals. In conclusion, ML can utilize unstructured free text meal logs and reliably classify whether meals align with specific nutritional goals, exceeding self assessments, especially when incorporating nutrition domain knowledge. Our findings highlight the potential of ML analysis of patient generated health data to support patient centered nutrition guidance in precision healthcare.
[LG-32] xt-Trained LLM s Can Zero-Shot Extrapolate PDE Dynamics
链接: https://arxiv.org/abs/2509.06322
作者: Jiajun Bao,Nicolas Boullé,Toni J.B. Liu,Raphaël Sarfati,Christopher J. Earls
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) have demonstrated emergent in-context learning (ICL) capabilities across a range of tasks, including zero-shot time-series forecasting. We show that text-trained foundation models can accurately extrapolate spatiotemporal dynamics from discretized partial differential equation (PDE) solutions without fine-tuning or natural language prompting. Predictive accuracy improves with longer temporal contexts but degrades at finer spatial discretizations. In multi-step rollouts, where the model recursively predicts future spatial states over multiple time steps, errors grow algebraically with the time horizon, reminiscent of global error accumulation in classical finite-difference solvers. We interpret these trends as in-context neural scaling laws, where prediction quality varies predictably with both context length and output length. To better understand how LLMs are able to internally process PDE solutions so as to accurately roll them out, we analyze token-level output distributions and uncover a consistent ICL progression: beginning with syntactic pattern imitation, transitioning through an exploratory high-entropy phase, and culminating in confident, numerically grounded predictions.
[LG-33] Enhancing Low-Altitude Airspace Security: MLLM -Enabled UAV Intent Recognition
链接: https://arxiv.org/abs/2509.06312
作者: Guangyu Lei,Tianhao Liang,Yuqi Ping,Xinglin Chen,Longyu Zhou,Junwei Wu,Xiyuan Zhang,Huahao Ding,Xingjian Zhang,Weijie Yuan,Tingting Zhang,Qinyu Zhang
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: The paper has been submitted to IEEE Internet of Things Magazine
Abstract:The rapid development of the low-altitude economy emphasizes the critical need for effective perception and intent recognition of non-cooperative unmanned aerial vehicles (UAVs). The advanced generative reasoning capabilities of multimodal large language models (MLLMs) present a promising approach in such tasks. In this paper, we focus on the combination of UAV intent recognition and the MLLMs. Specifically, we first present an MLLM-enabled UAV intent recognition architecture, where the multimodal perception system is utilized to obtain real-time payload and motion information of UAVs, generating structured input information, and MLLM outputs intent recognition results by incorporating environmental information, prior knowledge, and tactical preferences. Subsequently, we review the related work and demonstrate their progress within the proposed architecture. Then, a use case for low-altitude confrontation is conducted to demonstrate the feasibility of our architecture and offer valuable insights for practical system design. Finally, the future challenges are discussed, followed by corresponding strategic recommendations for further applications.
[LG-34] WindFM: An Open-Source Foundation Model for Zero-Shot Wind Power Forecasting
链接: https://arxiv.org/abs/2509.06311
作者: Hang Fan,Yu Shi,Zongliang Fu,Shuo Chen,Wei Wei,Wei Xu,Jian Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:High-quality wind power forecasting is crucial for the operation of modern power grids. However, prevailing data-driven paradigms either train a site-specific model which cannot generalize to other locations or rely on fine-tuning of general-purpose time series foundation models which are difficult to incorporate domain-specific data in the energy sector. This paper introduces WindFM, a lightweight and generative Foundation Model designed specifically for probabilistic wind power forecasting. WindFM employs a discretize-and-generate framework. A specialized time-series tokenizer first converts continuous multivariate observations into discrete, hierarchical tokens. Subsequently, a decoder-only Transformer learns a universal representation of wind generation dynamics by autoregressively pre-training on these token sequences. Using the comprehensive WIND Toolkit dataset comprising approximately 150 billion time steps from more than 126,000 sites, WindFM develops a foundational understanding of the complex interplay between atmospheric conditions and power output. Extensive experiments demonstrate that our compact 8.1M parameter model achieves state-of-the-art zero-shot performance on both deterministic and probabilistic tasks, outperforming specialized models and larger foundation models without any fine-tuning. In particular, WindFM exhibits strong adaptiveness under out-of-distribution data from a different continent, demonstrating the robustness and transferability of its learned representations. Our pre-trained model is publicly available at this https URL.
[LG-35] LoaQ: Layer-wise Output Approximation Quantization
链接: https://arxiv.org/abs/2509.06297
作者: Li Lin,Xiaojun Wan
类目: Machine Learning (cs.LG)
*备注: 7 pages, under review
Abstract:A natural and intuitive idea in model quantization is to approximate each component’s quantized output to match its original. Layer-wise post-training quantization (PTQ), though based on this idea, adopts a strictly local view and can achieve, at best, only activation-aware approximations of weights. As a result, it often leads to insufficient approximations and practical deviations from this guiding intuition. Recent work has achieved a more accurate approximation of linear-layer outputs within the framework of layer-wise PTQ, but such refinements remain inadequate for achieving alignment with the full model output. Based on a deeper understanding of the structural characteristics of mainstream LLMs, we propose LoaQ , an output-approximation method for layer-wise PTQ that explicitly targets output-level consistency. It better aligns with this intuition and can feature a simple closed-form solution, making it orthogonal to existing techniques and readily integrable into existing quantization pipelines. Experiments on the LLaMA and Qwen model families demonstrate that LoaQ performs effectively in both weight-only and weight-activation joint quantization. By integrating seamlessly with existing quantization strategies, it further enhances overall quantization quality and shows strong potential to advance the frontier of post-training quantization.
[LG-36] A Spatio-Temporal Graph Neural Networks Approach for Predicting Silent Data Corruption inducing Circuit-Level Faults
链接: https://arxiv.org/abs/2509.06289
作者: Shaoqi Wei,Senling Wang,Hiroshi Kai,Yoshinobu Higami,Ruijun Ma,Tianming Ni,Xiaoqing Wen,Hiroshi Takahashi
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
*备注: 21 pages, 9 figures, plan to submit to ACM TODAES
Abstract:Silent Data Errors (SDEs) from time-zero defects and aging degrade safety-critical systems. Functional testing detects SDE-related faults but is expensive to simulate. We present a unified spatio-temporal graph convolutional network (ST-GCN) for fast, accurate prediction of long-cycle fault impact probabilities (FIPs) in large sequential circuits, supporting quantitative risk assessment. Gate-level netlists are modeled as spatio-temporal graphs to capture topology and signal timing; dedicated spatial and temporal encoders predict multi-cycle FIPs efficiently. On ISCAS-89 benchmarks, the method reduces simulation time by more than 10x while maintaining high accuracy (mean absolute error 0.024 for 5-cycle predictions). The framework accepts features from testability metrics or fault simulation, allowing efficiency-accuracy trade-offs. A test-point selection study shows that choosing observation points by predicted FIPs improves detection of long-cycle, hard-to-detect faults. The approach scales to SoC-level test strategy optimization and fits downstream electronic design automation flows.
[LG-37] RecMind: LLM -Enhanced Graph Neural Networks for Personalized Consumer Recommendations
链接: https://arxiv.org/abs/2509.06286
作者: Chang Xue,Youwei Lu,Chen Yang,Jinming Xing
类目: Machine Learning (cs.LG)
*备注:
Abstract:Personalization is a core capability across consumer technologies, streaming, shopping, wearables, and voice, yet it remains challenged by sparse interactions, fast content churn, and heterogeneous textual signals. We present RecMind, an LLM-enhanced graph recommender that treats the language model as a preference prior rather than a monolithic ranker. A frozen LLM equipped with lightweight adapters produces text-conditioned user/item embeddings from titles, attributes, and reviews; a LightGCN backbone learns collaborative embeddings from the user-item graph. We align the two views with a symmetric contrastive objective and fuse them via intra-layer gating, allowing language to dominate in cold/long-tail regimes and graph structure to stabilize rankings elsewhere. On Yelp and Amazon-Electronics, RecMind attains the best results on all eight reported metrics, with relative improvements up to +4.53% (Recall@40) and +4.01% (NDCG@40) over strong baselines. Ablations confirm both the necessity of cross-view alignment and the advantage of gating over late fusion and LLM-only variants.
[LG-38] IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs
链接: https://arxiv.org/abs/2509.06274
作者: Aosong Feng,Zhichao Xu,Xian Wu,Kang Zhou,Sheng Guan,Yueyan Chen,Ninad Kulkarni,Yun Zhou,Balasubramaniam Srinivasan,Haibo Ding,Lin Lee Cheong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR, a quality-constrained Intelligent Prompt Routing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter \tau \in [0,1] that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBench\footnoteIPRBench will be released upon legal approval., a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency.
[LG-39] An Explainable Framework for Particle Swarm Optimization using Landscape Analysis and Machine Learning
链接: https://arxiv.org/abs/2509.06272
作者: Nitin Gupta,Bapi Dutta,Anupam Yadav
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Swarm intelligence algorithms have demonstrated remarkable success in solving complex optimization problems across diverse domains. However, their widespread adoption is often hindered by limited transparency in how algorithmic components influence performance. This work presents a multi-faceted investigation of Particle Swarm Optimization (PSO) to further understand the key role of different topologies for better interpretability and explainability. To achieve this objective, we first develop a comprehensive landscape characterization framework using Exploratory Landscape Analysis (ELA) to quantify problem difficulty and identify critical features affecting the optimization performance of PSO. Next, we conduct a rigorous empirical study comparing three fundamental swarm communication architectures – Ring, Star, and Von Neumann topologies – analysing their distinct impacts on exploration-exploitation balance, convergence behaviour, and solution quality and eventually develop an explainable benchmarking framework for PSO, to decode how swarm topologies affects information flow, diversity, and convergence. Based on this, a novel machine learning approach for automated algorithm configuration is introduced for training predictive models on extensive Area over the Convergence Curve (AOCC) data to recommend optimal settings based on problem characteristics. Through systematic experimentation across twenty four benchmark functions in multiple dimensions, we establish practical guidelines for topology selection and parameter configuration. These findings advance the development of more transparent and reliable swarm intelligence systems. The source codes of this work can be accessed at this https URL.
[LG-40] PLRV-O: Advancing Differentially Private Deep Learning via Privacy Loss Random Variable Optimization CCS’25
链接: https://arxiv.org/abs/2509.06264
作者: Qin Yang,Nicholas Stout,Meisam Mohammady,Han Wang,Ayesha Samreen,Christopher J Quinn,Yan Yan,Ashish Kundu,Yuan Hong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Source code is available at this https URL . This is the full version of the paper to appear in CCS’25
Abstract:Differentially Private Stochastic Gradient Descent (DP-SGD) is a standard method for enforcing privacy in deep learning, typically using the Gaussian mechanism to perturb gradient updates. However, conventional mechanisms such as Gaussian and Laplacian noise are parameterized only by variance or scale. This single degree of freedom ties the magnitude of noise directly to both privacy loss and utility degradation, preventing independent control of these two factors. The problem becomes more pronounced when the number of composition rounds T and batch size B vary across tasks, as these variations induce task-dependent shifts in the privacy-utility trade-off, where small changes in noise parameters can disproportionately affect model accuracy. To address this limitation, we introduce PLRV-O, a framework that defines a broad search space of parameterized DP-SGD noise distributions, where privacy loss moments are tightly characterized yet can be optimized more independently with respect to utility loss. This formulation enables systematic adaptation of noise to task-specific requirements, including (i) model size, (ii) training duration, (iii) batch sampling strategies, and (iv) clipping thresholds under both training and fine-tuning settings. Empirical results demonstrate that PLRV-O substantially improves utility under strict privacy constraints. On CIFAR-10, a fine-tuned ViT achieves 94.03% accuracy at epsilon approximately 0.5, compared to 83.93% with Gaussian noise. On SST-2, RoBERTa-large reaches 92.20% accuracy at epsilon approximately 0.2, versus 50.25% with Gaussian.
[LG-41] FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
链接: https://arxiv.org/abs/2509.06261
作者: Kyungmin Bin,Seungbeom Choi,Jimyoung Son,Jieun Choi,Daseul Bae,Daehyeon Baek,Kihyo Moon,Minsung Jang,Hyojung Lee
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in Post-Training Quantization (PTQ) techniques have significantly increased demand for serving quantized large language models (LLMs), enabling higher throughput and substantially reduced memory usage with minimal accuracy loss. Quantized models address memory constraints in LLMs and enhance GPU resource utilization through efficient GPU sharing. However, quantized models have smaller KV block sizes than non-quantized models, causing limited memory efficiency due to memory fragmentation. Also, distinct resource usage patterns between quantized and non-quantized models require efficient scheduling to maximize throughput. To address these challenges, we propose FineServe, an inference serving framework for mixed-precision LLMs. FineServe’s key contributions include: (1) KV Slab, a precision-aware adaptive memory management technique dynamically allocating KV cache based on model quantization characteristics, significantly reducing GPU memory fragmentation, and (2) a two-level scheduling framework comprising a global scheduler that places models to GPUs based on request rates, latency SLOs, and memory constraints and efficiency, and a local scheduler that adaptively adjusts batch sizes according to real-time request fluctuations. Experimental results demonstrate that FineServe achieves up to 2.2x higher SLO attainment and 1.8x higher token generation throughput compared to the state-of-the-art GPU sharing systems.
[LG-42] MCIGLE: Multimodal Exemplar-Free Class-Incremental Graph Learning KSEM2025
链接: https://arxiv.org/abs/2509.06219
作者: Haochen You,Baojing Liu
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted as a conference paper at KSEM 2025
Abstract:Exemplar-free class-incremental learning enables models to learn new classes over time without storing data from old ones. As multimodal graph-structured data becomes increasingly prevalent, existing methods struggle with challenges like catastrophic forgetting, distribution bias, memory limits, and weak generalization. We propose MCIGLE, a novel framework that addresses these issues by extracting and aligning multimodal graph features and applying Concatenated Recursive Least Squares for effective knowledge retention. Through multi-channel processing, MCIGLE balances accuracy and memory preservation. Experiments on public datasets validate its effectiveness and generalizability.
[LG-43] Metric Embedding Initialization-Based Differentially Private and Explainable Graph Clustering KSEM2025
链接: https://arxiv.org/abs/2509.06214
作者: Haochen You,Baojing Liu
类目: Machine Learning (cs.LG)
*备注: Accepted as a conference paper at KSEM 2025
Abstract:Graph clustering under the framework of differential privacy, which aims to process graph-structured data while protecting individual privacy, has been receiving increasing attention. Despite significant achievements in current research, challenges such as high noise, low efficiency and poor interpretability continue to severely constrain the development of this field. In this paper, we construct a differentially private and interpretable graph clustering approach based on metric embedding initialization. Specifically, we construct an SDP optimization, extract the key set and provide a well-initialized clustering configuration using an HST-based initialization method. Subsequently, we apply an established k-median clustering strategy to derive the cluster results and offer comparative explanations for the query set through differences from the cluster centers. Extensive experiments on public datasets demonstrate that our proposed framework outperforms existing methods in various clustering metrics while strictly ensuring privacy.
[LG-44] Modeling shopper interest broadness with entropy-driven dialogue policy in the context of arbitrarily large product catalogs
链接: https://arxiv.org/abs/2509.06185
作者: Firas Jarboui,Issa Memari
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Conversational recommender systems promise rich interactions for e-commerce, but balancing exploration (clarifying user needs) and exploitation (making recommendations) remains challenging, especially when deploying large language models (LLMs) with vast product catalogs. We address this challenge by modeling the breadth of user interest via the entropy of retrieval score distributions. Our method uses a neural retriever to fetch relevant items for a user query and computes the entropy of the re-ranked scores to dynamically route the dialogue policy: low-entropy (specific) queries trigger direct recommendations, whereas high-entropy (ambiguous) queries prompt exploratory questions. This simple yet effective strategy allows an LLM-driven agent to remain aware of an arbitrarily large catalog in real-time without bloating its context window.
[LG-45] Exploring Urban Factors with Autoencoders: Relationship Between Static and Dynamic Features
链接: https://arxiv.org/abs/2509.06167
作者: Ximena Pocco,Waqar Hassan,Karelia Salinas,Vladimir Molchanov,Luis G. Nonato
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:
Abstract:Urban analytics utilizes extensive datasets with diverse urban information to simulate, predict trends, and uncover complex patterns within cities. While these data enables advanced analysis, it also presents challenges due to its granularity, heterogeneity, and multimodality. To address these challenges, visual analytics tools have been developed to support the exploration of latent representations of fused heterogeneous and multimodal data, discretized at a street-level of detail. However, visualization-assisted tools seldom explore the extent to which fused data can offer deeper insights than examining each data source independently within an integrated visualization framework. In this work, we developed a visualization-assisted framework to analyze whether fused latent data representations are more effective than separate representations in uncovering patterns from dynamic and static urban data. The analysis reveals that combined latent representations produce more structured patterns, while separate ones are useful in particular cases.
[LG-46] An Improved Template for Approximate Computing
链接: https://arxiv.org/abs/2509.06162
作者: M. Rezaalipour,F. Costa,M. Biasion,R. Otoni,G. A. Constantinides,L. Pozzi
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 4 pages, 5 figures
Abstract:Deploying neural networks on edge devices entails a careful balance between the energy required for inference and the accuracy of the resulting classification. One technique for navigating this tradeoff is approximate computing: the process of reducing energy consumption by slightly reducing the accuracy of arithmetic operators. In this context, we propose a methodology to reduce the area of the small arithmetic operators used in neural networks - i.e., adders and multipliers - via a small loss in accuracy, and show that we improve area savings for the same accuracy loss w.r.t. the state of the art. To achieve our goal, we improve on a boolean rewriting technique recently proposed, called XPAT, where the use of a parametrisable template to rewrite circuits has proved to be highly beneficial. In particular, XPAT was able to produce smaller circuits than comparable approaches while utilising a naive sum of products template structure. In this work, we show that template parameters can act as proxies for chosen metrics and we propose a novel template based on parametrisable product sharing that acts as a close proxy to synthesised area. We demonstrate experimentally that our methodology converges better to low-area solutions and that it can find better approximations than both the original XPAT and two other state-of-the-art approaches.
[LG-47] Data-Efficient Time-Dependent PDE Surrogates: Graph Neural Simulators vs Neural Operators
链接: https://arxiv.org/abs/2509.06154
作者: Dibyajyoti Nayak,Somdatta Goswami
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 21 pages including references. Supplementary Information provided
Abstract:Neural operators (NOs) approximate mappings between infinite-dimensional function spaces but require large datasets and struggle with scarce training data. Many NO formulations don’t explicitly encode causal, local-in-time structure of physical evolution. While autoregressive models preserve causality by predicting next time-steps, they suffer from rapid error accumulation. We employ Graph Neural Simulators (GNS) - a message-passing graph neural network framework - with explicit numerical time-stepping schemes to construct accurate forward models that learn PDE solutions by modeling instantaneous time derivatives. We evaluate our framework on three canonical PDE systems: (1) 2D Burgers’ scalar equation, (2) 2D coupled Burgers’ vector equation, and (3) 2D Allen-Cahn equation. Rigorous evaluations demonstrate GNS significantly improves data efficiency, achieving higher generalization accuracy with substantially fewer training trajectories compared to neural operator baselines like DeepONet and FNO. GNS consistently achieves under 1% relative L2 errors with only 30 training samples out of 1000 (3% of available data) across all three PDE systems. It substantially reduces error accumulation over extended temporal horizons: averaged across all cases, GNS reduces autoregressive error by 82.48% relative to FNO AR and 99.86% relative to DON AR. We introduce a PCA+KMeans trajectory selection strategy enhancing low-data performance. Results indicate combining graph-based local inductive biases with conventional time integrators yields accurate, physically consistent, and scalable surrogate models for time-dependent PDEs.
[LG-48] If generative AI is the answer what is the question?
链接: https://arxiv.org/abs/2509.06120
作者: Ambuj Tewari
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To appear as a book chapter in a Springer book titled “Statistical Foundations and Applications of Artificial Intelligence, Machine Learning and Deep Learning” and edited by S. Ejaz Ahmed, Pierre Alquier, Yi Li, Shuangge Ma
Abstract:Beginning with text and images, generative AI has expanded to audio, video, computer code, and molecules. Yet, if generative AI is the answer, what is the question? We explore the foundations of generation as a distinct machine learning task with connections to prediction, compression, and decision-making. We survey five major generative model families: autoregressive models, variational autoencoders, normalizing flows, generative adversarial networks, and diffusion models. We then introduce a probabilistic framework that emphasizes the distinction between density estimation and generation. We review a game-theoretic framework with a two-player adversary-learner setup to study generation. We discuss post-training modifications that prepare generative models for deployment. We end by highlighting some important topics in socially responsible generation such as privacy, detection of AI-generated content, and copyright and IP. We adopt a task-first framing of generation, focusing on what generation is as a machine learning problem, rather than only on how models implement it.
[LG-49] Using Reinforcement Learning to Optimize the Global and Local Crossing Number
链接: https://arxiv.org/abs/2509.06108
作者: Timo Brand,Henry Förster,Stephen Kobourov,Robin Schukrafft,Markus Wallinger,Johannes Zink
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注:
Abstract:We present a novel approach to graph drawing based on reinforcement learning for minimizing the global and the local crossing number, that is, the total number of edge crossings and the maximum number of crossings on any edge, respectively. In our framework, an agent learns how to move a vertex based on a given observation vector in order to optimize its position. The agent receives feedback in the form of local reward signals tied to crossing reduction. To generate an initial layout, we use a stress-based graph-drawing algorithm. We compare our method against force- and stress-based (baseline) algorithms as well as three established algorithms for global crossing minimization on a suite of benchmark graphs. The experiments show mixed results: our current algorithm is mainly competitive for the local crossing number. We see a potential for further development of the approach in the future.
[LG-50] A Surrogate model for High Temperature Superconducting Magnets to Predict Current Distribution with Neural Network
链接: https://arxiv.org/abs/2509.06067
作者: Mianjun Xiao,Peng Song,Yulong Liu,Cedric Korte,Ziyang Xu,Jiale Gao,Jiaqi Lu,Haoyang Nie,Qiantong Deng,Timing Qu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Finite element method (FEM) is widely used in high-temperature superconducting (HTS) magnets, but its computational cost increases with magnet size and becomes time-consuming for meter-scale magnets, especially when multi-physics couplings are considered, which limits the fast design of large-scale REBCO magnet systems. In this work, a surrogate model based on a fully connected residual neural network (FCRN) is developed to predict the space-time current density distribution in REBCO solenoids. Training datasets were generated from FEM simulations with varying numbers of turns and pancakes. The results demonstrate that, for deeper networks, the FCRN architecture achieves better convergence than conventional fully connected network (FCN), with the configuration of 12 residual blocks and 256 neurons per layer providing the most favorable balance between training accuracy and generalization capability. Extrapolation studies show that the model can reliably predict magnetization losses for up to 50% beyond the training range, with maximum errors below 10%. The surrogate model achieves predictions several orders of magnitude faster than FEM and still remains advantageous when training costs are included. These results indicate that the proposed FCRN-based surrogate model provides both accuracy and efficiency, offering a promising tool for the rapid analysis of large-scale HTS magnets.
[LG-51] A novel biomass fluidized bed gasification model coupled with machine learning and CFD simulation
链接: https://arxiv.org/abs/2509.06056
作者: Chun Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:A coupling model of biomass fluidized bed gasification based on machine learning and computational fluid dynamics is proposed to improve the prediction accuracy and computational efficiency of complex thermochemical reaction process. By constructing a high-quality data set based on experimental data and high fidelity simulation results, the agent model used to describe the characteristics of reaction kinetics was trained and embedded into the computational fluid dynamics (CFD) framework to realize the real-time update of reaction rate and composition evolution.
[LG-52] Code2MCP: A Multi-Agent Framework for Automated Transformation of Code Repositories into Model Context Protocol Services
链接: https://arxiv.org/abs/2509.05941
作者: Chaoqian Ouyang,Ling Yue,Shimin Di,Libin Zheng,Shaowu Pan,Min-Ling Zhang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:The proliferation of Large Language Models (LLMs) has created a significant integration challenge in the AI agent ecosystem, often called the " N \times M problem," where N models require custom integrations for M tools. This fragmentation stifles innovation and creates substantial development overhead. While the Model Context Protocol (MCP) has emerged as a standard to resolve this, its adoption is hindered by the manual effort required to convert the vast universe of existing software into MCP-compliant services. This is especially true for the millions of open-source repositories on GitHub, the world’s largest collection of functional code. This paper introduces Code2MCP, a highly automated, agentic framework designed to transform any GitHub repository into a functional MCP service with minimal human intervention. Our system employs a multi-stage workflow that automates the entire process, from code analysis and environment configuration to service generation and deployment. A key innovation of our framework is an LLM-driven, closed-loop “Run–Review–Fix” cycle, which enables the system to autonomously debug and repair the code it generates. Code2MCP produces not only deployable services but also comprehensive technical documentation, acting as a catalyst to accelerate the MCP ecosystem by systematically unlocking the world’s largest open-source code repository and automating the critical last mile of tool integration. The code is open-sourced at this https URL.
[LG-53] ALPHA: LLM -Enabled Active Learning for Human-Free Network Anomaly Detection
链接: https://arxiv.org/abs/2509.05936
作者: Xuanhao Luo,Shivesh Madan Nath Jha,Akruti Sinha,Zhizhen Li,Yuchen Liu
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Accepted at 44th IEEE International Performance Computing and Communications Conference (IPCCC 2025)
Abstract:Network log data analysis plays a critical role in detecting security threats and operational anomalies. Traditional log analysis methods for anomaly detection and root cause analysis rely heavily on expert knowledge or fully supervised learning models, both of which require extensive labeled data and significant human effort. To address these challenges, we propose ALPHA, the first Active Learning Pipeline for Human-free log Analysis. ALPHA integrates semantic embedding, clustering-based representative sampling, and large language model (LLM)-assisted few-shot annotation to automate the anomaly detection process. The LLM annotated labels are propagated across clusters, enabling large-scale training of an anomaly detector with minimal supervision. To enhance the annotation accuracy, we propose a two-step few-shot refinement strategy that adaptively selects informative prompts based on the LLM’s observed error patterns. Extensive experiments on real-world log datasets demonstrate that ALPHA achieves detection accuracy comparable to fully supervised methods while mitigating human efforts in the loop. ALPHA also supports interpretable analysis through LLM-driven root cause explanations in the post-detection stage. These capabilities make ALPHA a scalable and cost-efficient solution for truly automated log-based anomaly detection.
[LG-54] Smoothed Online Optimization for Target Tracking: Robust and Learning-Augmented Algorithms
链接: https://arxiv.org/abs/2509.05930
作者: Ali Zeynali,Mahsa Sahebdel,Qingsong Liu,Mohammad Hajiesmaili,Ramesh K. Sitaraman
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 10 pages, 14 pages appendix
Abstract:We introduce the Smoothed Online Optimization for Target Tracking (SOOTT) problem, a new framework that integrates three key objectives in online decision-making under uncertainty: (1) tracking cost for following a dynamically moving target, (2) adversarial perturbation cost for withstanding unpredictable disturbances, and (3) switching cost for penalizing abrupt changes in decisions. This formulation captures real-world scenarios such as elastic and inelastic workload scheduling in AI clusters, where operators must balance long-term service-level agreements (e.g., LLM training) against sudden demand spikes (e.g., real-time inference). We first present BEST, a robust algorithm with provable competitive guarantees for SOOTT. To enhance practical performance, we introduce CoRT, a learning-augmented variant that incorporates untrusted black-box predictions (e.g., from ML models) into its decision process. Our theoretical analysis shows that CoRT strictly improves over BEST when predictions are accurate, while maintaining robustness under arbitrary prediction errors. We validate our approach through a case study on workload scheduling, demonstrating that both algorithms effectively balance trajectory tracking, decision smoothness, and resilience to external disturbances.
[LG-55] X-SQL: Expert Schema Linking and Understanding of Text-to-SQL with Multi-LLM s
链接: https://arxiv.org/abs/2509.05899
作者: Dazhi Peng
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:
Abstract:With Large Language Models’ (LLMs) emergent abilities on code generation tasks, Text-to-SQL has become one of the most popular downstream applications. Despite the strong results of multiple recent LLM-based Text-to-SQL frameworks, the research community often overlooks the importance of database schema information for generating high-quality SQL queries. We find that such schema information plays a significant or even dominant role in the Text-to-SQL task. To tackle this challenge, we propose a novel database schema expert with two components. We first introduce X-Linking, an LLM Supervised Finetuning (SFT)-based method that achieves superior Schema Linking results compared to existing open-source Text-to-SQL methods. In addition, we innovatively propose an X-Admin component that focuses on Schema Understanding by bridging the gap between abstract schema information and the user’s natural language question. Aside from better learning with schema information, we experiment with Multi-LLMs for different components within the system to further boost its performance. By incorporating these techniques into our end-to-end framework, X-SQL, we have achieved Execution Accuracies of 84.9% on the Spider-Dev dataset and 82.5% on the Spider-Test dataset. This outstanding performance establishes X-SQL as the leading Text-to-SQL framework based on open-source models.
[LG-56] SPINN: An Optimal Self-Supervised Physics-Informed Neural Network Framework
链接: https://arxiv.org/abs/2509.05886
作者: Reza Pirayeshshirazinezhad
类目: Machine Learning (cs.LG)
*备注:
Abstract:A surrogate model is developed to predict the convective heat transfer coefficient of liquid sodium (Na) flow within rectangular miniature heat sinks. Initially, kernel-based machine learning techniques and shallow neural network are applied to a dataset with 87 Nusselt numbers for liquid sodium in rectangular miniature heat sinks. Subsequently, a self-supervised physics-informed neural network and transfer learning approach are used to increase the estimation performance. In the self-supervised physics-informed neural network, an additional layer determines the weight the of physics in the loss function to balance data and physics based on their uncertainty for a better estimation. For transfer learning, a shallow neural network trained on water is adapted for use with Na. Validation results show that the self-supervised physics-informed neural network successfully estimate the heat transfer rates of Na with an error margin of approximately +8%. Using only physics for regression, the error remains between 5% to 10%. Other machine learning methods specify the prediction mostly within +8%. High-fidelity modeling of turbulent forced convection of liquid metals using computational fluid dynamics (CFD) is both time-consuming and computationally expensive. Therefore, machine learning based models offer a powerful alternative tool for the design and optimization of liquid-metal-cooled miniature heat sinks.
[LG-57] he Measure of Deception: An Analysis of Data Forging in Machine Unlearning
链接: https://arxiv.org/abs/2509.05865
作者: Rishabh Dixit,Yuan Hui,Rayan Saab
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Motivated by privacy regulations and the need to mitigate the effects of harmful data, machine unlearning seeks to modify trained models so that they effectively ``forget’’ designated data. A key challenge in verifying unlearning is forging – adversarially crafting data that mimics the gradient of a target point, thereby creating the appearance of unlearning without actually removing information. To capture this phenomenon, we consider the collection of data points whose gradients approximate a target gradient within tolerance \epsilon – which we call an \epsilon -forging set – and develop a framework for its analysis. For linear regression and one-layer neural networks, we show that the Lebesgue measure of this set is small. It scales on the order of \epsilon , and when \epsilon is small enough, \epsilon^d . More generally, under mild regularity assumptions, we prove that the forging set measure decays as \epsilon^(d-r)/2 , where d is the data dimension and rd is the nullity of a variation matrix defined by the model gradients. Extensions to batch SGD and almost-everywhere smooth loss functions yield the same asymptotic scaling. In addition, we establish probability bounds showing that, under non-degenerate data distributions, the likelihood of randomly sampling a forging point is vanishingly small. These results provide evidence that adversarial forging is fundamentally limited and that false unlearning claims can, in principle, be detected.
[LG-58] Data-Driven Stochastic Modeling Using Autoregressive Sequence Models: Translating Event Tables to Queueing Dynamics
链接: https://arxiv.org/abs/2509.05839
作者: Daksh Mittal,Shunri Zheng,Jing Dong,Hongseok Namkoong
类目: Machine Learning (cs.LG)
*备注:
Abstract:While queueing network models are powerful tools for analyzing service systems, they traditionally require substantial human effort and domain expertise to construct. To make this modeling approach more scalable and accessible, we propose a data-driven framework for queueing network modeling and simulation based on autoregressive sequence models trained on event-stream data. Instead of explicitly specifying arrival processes, service mechanisms, or routing logic, our approach learns the conditional distributions of event types and event times, recasting the modeling task as a problem of sequence distribution learning. We show that Transformer-style architectures can effectively parameterize these distributions, enabling automated construction of high-fidelity simulators. As a proof of concept, we validate our framework on event tables generated from diverse queueing networks, showcasing its utility in simulation, uncertainty quantification, and counterfactual evaluation. Leveraging advances in artificial intelligence and the growing availability of data, our framework takes a step toward more automated, data-driven modeling pipelines to support broader adoption of queueing network models across service domains.
[LG-59] Benchmarking Robust Aggregation in Decentralized Gradient Marketplaces
链接: https://arxiv.org/abs/2509.05833
作者: Zeyu Song,Sainyam Galhotra,Shagufta Mehnaz
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:The rise of distributed and privacy-preserving machine learning has sparked interest in decentralized gradient marketplaces, where participants trade intermediate artifacts like gradients. However, existing Federated Learning (FL) benchmarks overlook critical economic and systemic factors unique to such marketplaces-cost-effectiveness, fairness to sellers, and market stability-especially when a buyer relies on a private baseline dataset for evaluation. We introduce a comprehensive benchmark framework to holistically evaluate robust gradient aggregation methods within these buyer-baseline-reliant marketplaces. Our contributions include: (1) a simulation environment modeling marketplace dynamics with a variable buyer baseline and diverse seller distributions; (2) an evaluation methodology augmenting standard FL metrics with marketplace-centric dimensions such as Economic Efficiency, Fairness, and Selection Dynamics; (3) an in-depth empirical analysis of the existing Distributed Gradient Marketplace framework, MartFL, including the integration and comparative evaluation of adapted FLTrust and SkyMask as alternative aggregation strategies within it. This benchmark spans diverse datasets, local attacks, and Sybil attacks targeting the marketplace selection process; and (4) actionable insights into the trade-offs between model performance, robustness, cost, fairness, and stability. This benchmark equips the community with essential tools and empirical evidence to evaluate and design more robust, equitable, and economically viable decentralized gradient marketplaces. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2509.05833 [cs.LG] (or arXiv:2509.05833v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.05833 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-60] Finetuning LLM s for Human Behavior Prediction in Social Science Experiments
链接: https://arxiv.org/abs/2509.05830
作者: Akaash Kolluri,Shengguang Wu,Joon Sung Park,Michael S. Bernstein
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 16 pages, 5 figures
Abstract:Large language models (LLMs) offer a powerful opportunity to simulate the results of social science experiments. In this work, we demonstrate that finetuning LLMs directly on individual-level responses from past experiments meaningfully improves the accuracy of such simulations across diverse social science domains. We construct SocSci210 via an automatic pipeline, a dataset comprising 2.9 million responses from 400,491 participants in 210 open-source social science experiments. Through finetuning, we achieve multiple levels of generalization. In completely unseen studies, our strongest model, Socrates-Qwen-14B, produces predictions that are 26% more aligned with distributions of human responses to diverse outcome questions under varying conditions relative to its base model (Qwen2.5-14B), outperforming GPT-4o by 13%. By finetuning on a subset of conditions in a study, generalization to new unseen conditions is particularly robust, improving by 71%. Since SocSci210 contains rich demographic information, we reduce demographic parity, a measure of bias, by 10.6% through finetuning. Because social sciences routinely generate rich, topic-specific datasets, our findings indicate that finetuning on such data could enable more accurate simulations for experimental hypothesis screening. We release our data, models and finetuning code at this http URL.
[LG-61] Simple Optimizers for Convex Aligned Multi-Objective Optimization
链接: https://arxiv.org/abs/2509.05811
作者: Ben Kretzu,Karen Ullrich,Yonathan Efroni
类目: Machine Learning (cs.LG)
*备注:
Abstract:It is widely recognized in modern machine learning practice that access to a diverse set of tasks can enhance performance across those tasks. This observation suggests that, unlike in general multi-objective optimization, the objectives in many real-world settings may not be inherently conflicting. To address this, prior work introduced the Aligned Multi-Objective Optimization (AMOO) framework and proposed gradient-based algorithms with provable convergence guarantees. However, existing analysis relies on strong assumptions, particularly strong convexity, which implies the existence of a unique optimal solution. In this work, we relax this assumption and study gradient-descent algorithms for convex AMOO under standard smoothness or Lipschitz continuity conditions-assumptions more consistent with those used in deep learning practice. This generalization requires new analytical tools and metrics to characterize convergence in the convex AMOO setting. We develop such tools, propose scalable algorithms for convex AMOO, and establish their convergence guarantees. Additionally, we prove a novel lower bound that demonstrates the suboptimality of naive equal-weight approaches compared to our methods.
[LG-62] Select then Balance: A Plug-and-Play Framework for Exogenous-Aware Spatio-Temporal Forecasting
链接: https://arxiv.org/abs/2509.05779
作者: Wei Chen,Yuqian Wu,Yuanshao Zhu,Xixuan Hao,Shiyu Wang,Yuxuan Liang
类目: Machine Learning (cs.LG)
*备注: 16 pages, 11 figures
Abstract:Spatio-temporal forecasting aims to predict the future state of dynamic systems and plays an important role in multiple fields. However, existing solutions only focus on modeling using a limited number of observed target variables. In real-world scenarios, exogenous variables can be integrated into the model as additional input features and associated with the target signal to promote forecast accuracy. Although promising, this still encounters two challenges: the inconsistent effects of different exogenous variables to the target system, and the imbalance effects between historical variables and future variables. To address these challenges, this paper introduces \model, a novel framework for modeling \underlineexogenous variables in \underlinespatio-\underlinetemporal forecasting, which follows a ``select, then balance’’ paradigm. Specifically, we first construct a latent space gated expert module, where fused exogenous information is projected into a latent space to dynamically select and recompose salient signals via specialized sub-experts. Furthermore, we design a siamese network architecture in which recomposed representations of past and future exogenous variables are fed into dual-branch spatio-temporal backbones to capture dynamic patterns. The outputs are integrated through a context-aware weighting mechanism to achieve dynamic balance during the modeling process. Extensive experiments on real-world datasets demonstrate the effectiveness, generality, robustness, and efficiency of our proposed framework.
[LG-63] Ensemble of Precision-Recall Curve (PRC) Classification Trees with Autoencoders
链接: https://arxiv.org/abs/2509.05766
作者: Jiaju Miao,Wei Zhu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Anomaly detection underpins critical applications from network security and intrusion detection to fraud prevention, where recognizing aberrant patterns rapidly is indispensable. Progress in this area is routinely impeded by two obstacles: extreme class imbalance and the curse of dimensionality. To combat the former, we previously introduced Precision-Recall Curve (PRC) classification trees and their ensemble extension, the PRC Random Forest (PRC-RF). Building on that foundation, we now propose a hybrid framework that integrates PRC-RF with autoencoders, unsupervised machine learning methods that learn compact latent representations, to confront both challenges simultaneously. Extensive experiments across diverse benchmark datasets demonstrate that the resulting Autoencoder-PRC-RF model achieves superior accuracy, scalability, and interpretability relative to prior methods, affirming its potential for high-stakes anomaly-detection tasks.
[LG-64] Automating API Documentation with LLM s: A BERTopic Approach
链接: https://arxiv.org/abs/2509.05749
作者: AmirHossein Naghshzan
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Developers rely on API documentation, but official sources are often lengthy, complex, or incomplete. Many turn to community-driven forums like Stack Overflow for practical insights. We propose automating the summarization of informal sources, focusing on Android APIs. Using BERTopic, we extracted prevalent topics from 3.6 million Stack Overflow posts and applied extractive summarization techniques to generate concise summaries, including code snippets. A user study with 30 Android developers assessed the summaries for coherence, relevance, informativeness, and satisfaction, showing improved productivity. Integrating formal API knowledge with community-generated content enhances documentation, making API resources more accessible and actionable work.
[LG-65] Morphological Perceptron with Competitive Layer: Training Using Convex-Concave Procedure
链接: https://arxiv.org/abs/2509.05697
作者: Iara Cunha,Marcos Eduardo Valle
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Submitted to the 4th International Conference on Discrete Geometry and Mathematical Morphology (DGMM 2025)
Abstract:A morphological perceptron is a multilayer feedforward neural network in which neurons perform elementary operations from mathematical morphology. For multiclass classification tasks, a morphological perceptron with a competitive layer (MPCL) is obtained by integrating a winner-take-all output layer into the standard morphological architecture. The non-differentiability of morphological operators renders gradient-based optimization methods unsuitable for training such networks. Consequently, alternative strategies that do not depend on gradient information are commonly adopted. This paper proposes the use of the convex-concave procedure (CCP) for training MPCL networks. The training problem is formulated as a difference of convex (DC) functions and solved iteratively using CCP, resulting in a sequence of linear programming subproblems. Computational experiments demonstrate the effectiveness of the proposed training method in addressing classification tasks with MPCL networks.
[LG-66] Distributed Deep Learning using Stochastic Gradient Staleness
链接: https://arxiv.org/abs/2509.05679
作者: Viet Hoang Pham,Hyo-Sung Ahn
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Despite the notable success of deep neural networks (DNNs) in solving complex tasks, the training process still remains considerable challenges. A primary obstacle is the substantial time required for training, particularly as high performing DNNs tend to become increasingly deep (characterized by a larger number of hidden layers) and require extensive training datasets. To address these challenges, this paper introduces a distributed training method that integrates two prominent strategies for accelerating deep learning: data parallelism and fully decoupled parallel backpropagation algorithm. By utilizing multiple computational units operating in parallel, the proposed approach enhances the amount of training data processed in each iteration while mitigating locking issues commonly associated with the backpropagation algorithm. These features collectively contribute to significant improvements in training efficiency. The proposed distributed training method is rigorously proven to converge to critical points under certain conditions. Its effectiveness is further demonstrated through empirical evaluations, wherein an DNN is trained to perform classification tasks on the CIFAR-10 dataset.
[LG-67] DQS: A Low-Budget Query Strategy for Enhancing Unsupervised Data-driven Anomaly Detection Approaches
链接: https://arxiv.org/abs/2509.05663
作者: Lucas Correia,Jan-Christoph Goos,Thomas Bäck,Anna V. Kononova
类目: Machine Learning (cs.LG)
*备注: Submitted to the Reliability Engineering System Safety journal
Abstract:Truly unsupervised approaches for time series anomaly detection are rare in the literature. Those that exist suffer from a poorly set threshold, which hampers detection performance, while others, despite claiming to be unsupervised, need to be calibrated using a labelled data subset, which is often not available in the real world. This work integrates active learning with an existing unsupervised anomaly detection method by selectively querying the labels of multivariate time series, which are then used to refine the threshold selection process. To achieve this, we introduce a novel query strategy called the dissimilarity-based query strategy (DQS). DQS aims to maximise the diversity of queried samples by evaluating the similarity between anomaly scores using dynamic time warping. We assess the detection performance of DQS in comparison to other query strategies and explore the impact of mislabelling, a topic that is underexplored in the literature. Our findings indicate that DQS performs best in small-budget scenarios, though the others appear to be more robust when faced with mislabelling. Therefore, in the real world, the choice of query strategy depends on the expertise of the oracle and the number of samples they are willing to label. Regardless, all query strategies outperform the unsupervised threshold even in the presence of mislabelling. Thus, whenever it is feasible to query an oracle, employing an active learning-based threshold is recommended.
[LG-68] Audits Under Resource Data and Access Constraints: Scaling Laws For Less Discriminatory Alternatives
链接: https://arxiv.org/abs/2509.05627
作者: Sarah H. Cen,Salil Goyal,Zaynah Javed,Ananya Karthik,Percy Liang,Daniel E. Ho
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages, 13 figures
Abstract:AI audits play a critical role in AI accountability and safety. One branch of the law for which AI audits are particularly salient is anti-discrimination law. Several areas of anti-discrimination law implicate the “less discriminatory alternative” (LDA) requirement, in which a protocol (e.g., model) is defensible if no less discriminatory protocol that achieves comparable performance can be found with a reasonable amount of effort. Notably, the burden of proving an LDA exists typically falls on the claimant (the party alleging discrimination). This creates a significant hurdle in AI cases, as the claimant would seemingly need to train a less discriminatory yet high-performing model, a task requiring resources and expertise beyond most litigants. Moreover, developers often shield information about and access to their model and training data as trade secrets, making it difficult to reproduce a similar model from scratch. In this work, we present a procedure enabling claimants to determine if an LDA exists, even when they have limited compute, data, information, and model access. We focus on the setting in which fairness is given by demographic parity and performance by binary cross-entropy loss. As our main result, we provide a novel closed-form upper bound for the loss-fairness Pareto frontier (PF). We show how the claimant can use it to fit a PF in the “low-resource regime,” then extrapolate the PF that applies to the (large) model being contested, all without training a single large model. The expression thus serves as a scaling law for loss-fairness PFs. To use this scaling law, the claimant would require a small subsample of the train/test data. Then, the claimant can fit the context-specific PF by training as few as 7 (small) models. We stress test our main result in simulations, finding that our scaling law holds even when the exact conditions of our theory do not. Comments: 34 pages, 13 figures Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2509.05627 [cs.CY] (or arXiv:2509.05627v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2509.05627 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-69] Systematic Evaluation of Multi-modal Approaches to Complex Player Profile Classification
链接: https://arxiv.org/abs/2509.05624
作者: Jason Starace,Terence Soule
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:
Abstract:Modern adaptive games require nuanced player understanding, yet most models use simplified 5-10 category taxonomies that fail to capture diversity. Behavioral clustering cannot distinguish players with different motivations who act similarly. We present a systematic evaluation of multi-modal classification at scale, combining behavioral telemetry with semantic context to support 36 player profiles. Using 19,413 gameplay sessions from an AI-controlled text-based RPG, we compared behavioral-only baselines with multi-modal approaches that integrate action sequences and semantic descriptions. Traditional clustering achieved only 10% accuracy for 36-category classification, limited by semantic conflation where opposite actions produced identical features. Our multi-modal LSTM processing action-text pairs improved accuracy to 21%, showing both potential and limits of non-conversational data. Analysis by behavioral complexity revealed that non-neutral profiles reached 42% accuracy (15x above random), while neutral profiles dropped to 25% (9x above random). Identical actions such as “help the merchant” cannot reveal whether a player is neutral or strategically waiting. Without access to reasoning, even multi-modal models struggle, though above-baseline results confirm a meaningful signal. Since prediction beyond 20 categories remains unexplored, our findings establish benchmarks for complex player modeling. Behavioral data alone plateaus near 10% for 36 categories, while multi-modal integration enables 25%. For designers, this shows that personality-based adaptation requires conversational interaction, as predefined choices cannot capture intent. Our evaluation at 36-category scale offers guidance for building adaptive games that better understand their players.
[LG-70] Reinforcement Learning with Anticipation: A Hierarchical Approach for Long-Horizon Tasks
链接: https://arxiv.org/abs/2509.05545
作者: Yang Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Solving long-horizon goal-conditioned tasks remains a significant challenge in reinforcement learning (RL). Hierarchical reinforcement learning (HRL) addresses this by decomposing tasks into more manageable sub-tasks, but the automatic discovery of the hierarchy and the joint training of multi-level policies often suffer from instability and can lack theoretical guarantees. In this paper, we introduce Reinforcement Learning with Anticipation (RLA), a principled and potentially scalable framework designed to address these limitations. The RLA agent learns two synergistic models: a low-level, goal-conditioned policy that learns to reach specified subgoals, and a high-level anticipation model that functions as a planner, proposing intermediate subgoals on the optimal path to a final goal. The key feature of RLA is the training of the anticipation model, which is guided by a principle of value geometric consistency, regularized to prevent degenerate solutions. We present proofs that RLA approaches the globally optimal policy under various conditions, establishing a principled and convergent method for hierarchical planning and execution in long-horizon goal-conditioned tasks.
[LG-71] DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training
链接: https://arxiv.org/abs/2509.05542
作者: Qi Cao,Pengtao Xie
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training multimodal process reward models (PRMs) is challenged by distribution shifts and noisy data. We introduce DreamPRM-1.5, an instance-reweighted framework that adaptively adjusts the importance of each training example via bi-level optimization. We design two complementary strategies: Instance Table, effective for smaller datasets, and Instance Net, scalable to larger ones. Integrated into test-time scaling, DreamPRM-1.5 achieves 84.6 accuracy on the MMMU benchmark, surpassing GPT-5.
[LG-72] Self-Aligned Reward: Towards Effective and Efficient Reason ers
链接: https://arxiv.org/abs/2509.05489
作者: Peixuan Han,Adit Krishnan,Gerald Friedland,Jiaxuan You,Chris Kong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning with verifiable rewards has significantly advanced reasoning in large language models (LLMs), but such signals remain coarse, offering only binary correctness feedback. This limitation often results in inefficiencies, including overly verbose reasoning and high computational cost, while existing solutions often compromise accuracy. To address this, we introduce self-aligned reward (SAR), a self-guided signal that complements verifiable rewards to encourage both reasoning accuracy and efficiency. SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer, thereby favoring responses that are concise and query-specific. Quantitative analysis reveals that SAR reliably distinguishes answer quality: concise, correct answers score higher than redundant ones, and partially correct answers score higher than entirely incorrect ones. Evaluation on 4 models across 7 benchmarks shows that integrating SAR with prevalent RL algorithms like PPO and GRPO improves accuracy by 4%, while reducing inference cost by 30%. Further analysis demonstrates that SAR achieves a Pareto-optimal trade-off between correctness and efficiency compared to reward signals based on length or self-confidence. We also show that SAR shortens responses while preserving advanced reasoning behaviors, demonstrating its ability to suppress unnecessary elaboration without losing critical reasoning. These results highlight the promise of self-aligned reward as a fine-grained complement to verifiable rewards, paving the way for more efficient and effective LLM training.
[LG-73] Prior Distribution and Model Confidence
链接: https://arxiv.org/abs/2509.05485
作者: Maksim Kazanskii,Artem Kasianov
类目: Machine Learning (cs.LG)
*备注: 10 pages,4 tables, 5 images
Abstract:This paper investigates the impact of training data distribution on the performance of image classification models. By analyzing the embeddings of the training set, we propose a framework to understand the confidence of model predictions on unseen data without the need for retraining. Our approach filters out low-confidence predictions based on their distance from the training distribution in the embedding space, significantly improving classification accuracy. We demonstrate this on the example of several classification models, showing consistent performance gains across architectures. Furthermore, we show that using multiple embedding models to represent the training data enables a more robust estimation of confidence, as different embeddings capture complementary aspects of the data. Combining these embeddings allows for better detection and exclusion of out-of-distribution samples, resulting in further accuracy improvements. The proposed method is model-agnostic and generalizable, with potential applications beyond computer vision, including domains such as Natural Language Processing where prediction reliability is critical.
[LG-74] STL-based Optimization of Biomolecular Neural Networks for Regression and Control
链接: https://arxiv.org/abs/2509.05481
作者: Eric Palanques-Tost,Hanna Krasowski,Murat Arcak,Ron Weiss,Calin Belta
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Biomolecular Neural Networks (BNNs), artificial neural networks with biologically synthesizable architectures, achieve universal function approximation capabilities beyond simple biological circuits. However, training BNNs remains challenging due to the lack of target data. To address this, we propose leveraging Signal Temporal Logic (STL) specifications to define training objectives for BNNs. We build on the quantitative semantics of STL, enabling gradient-based optimization of the BNN weights, and introduce a learning algorithm that enables BNNs to perform regression and control tasks in biological systems. Specifically, we investigate two regression problems in which we train BNNs to act as reporters of dysregulated states, and a feedback control problem in which we train the BNN in closed-loop with a chronic disease model, learning to reduce inflammation while avoiding adverse responses to external infections. Our numerical experiments demonstrate that STL-based learning can solve the investigated regression and control tasks efficiently.
[LG-75] Calibrated Recommendations with Contextual Bandits RECSYS’25
链接: https://arxiv.org/abs/2509.05460
作者: Diego Feijer,Himan Abdollahpouri,Sanket Gupta,Alexander Clare,Yuxiao Wen,Todd Wasson,Maria Dimakopoulou,Zahra Nazari,Kyle Kretschman,Mounia Lalmas
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注: Accepted at ACM RecSys '25, CONSEQUENCES workshop
Abstract:Spotify’s Home page features a variety of content types, including music, podcasts, and audiobooks. However, historical data is heavily skewed toward music, making it challenging to deliver a balanced and personalized content mix. Moreover, users’ preference towards different content types may vary depending on the time of day, the day of week, or even the device they use. We propose a calibration method that leverages contextual bandits to dynamically learn each user’s optimal content type distribution based on their context and preferences. Unlike traditional calibration methods that rely on historical averages, our approach boosts engagement by adapting to how users interests in different content types varies across contexts. Both offline and online results demonstrate improved precision and user engagement with the Spotify Home page, in particular with under-represented content types such as podcasts.
[LG-76] Distributed Link Sparsification for Scalable Scheduling Using Graph Neural Networks (Journal Version) ICASSP2022 ICASSP43922
链接: https://arxiv.org/abs/2509.05447
作者: Zhongyuan Zhao,Gunjan Verma,Ananthram Swami,Santiago Segarra
类目: Networking and Internet Architecture (cs.NI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 15 pages, 18 figures, accepted to IEEE Transactions on Wireless Communications. This is the extended journal version of the conference paper arXiv:2203.14339 (Z. Zhao, A. Swami and S. Segarra, “Distributed Link Sparsification for Scalable Scheduling using Graph Neural Networks,” IEEE ICASSP 2022, pp. 5308-5312, doi: https://doi.org/10.1109/ICASSP43922.2022.9747437 )
Abstract:In wireless networks characterized by dense connectivity, the significant signaling overhead generated by distributed link scheduling algorithms can exacerbate issues like congestion, energy consumption, and radio footprint expansion. To mitigate these challenges, we propose a distributed link sparsification scheme employing graph neural networks (GNNs) to reduce scheduling overhead for delay-tolerant traffic while maintaining network capacity. A GNN module is trained to adjust contention thresholds for individual links based on traffic statistics and network topology, enabling links to withdraw from scheduling contention when they are unlikely to succeed. Our approach is facilitated by a novel offline constrained unsupervised learning algorithm capable of balancing two competing objectives: minimizing scheduling overhead while ensuring that total utility meets the required level. In simulated wireless multi-hop networks with up to 500 links, our link sparsification technique effectively alleviates network congestion and reduces radio footprints across four distinct distributed link scheduling protocols.
[LG-77] Safeguarding Graph Neural Networks against Topology Inference Attacks CCS’25
链接: https://arxiv.org/abs/2509.05429
作者: Jie Fu,Hong Yuan,Zhili Chen,Wendy Hui Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Acctepted by ACM CCS’25
Abstract:Graph Neural Networks (GNNs) have emerged as powerful models for learning from graph-structured data. However, their widespread adoption has raised serious privacy concerns. While prior research has primarily focused on edge-level privacy, a critical yet underexplored threat lies in topology privacy - the confidentiality of the graph’s overall structure. In this work, we present a comprehensive study on topology privacy risks in GNNs, revealing their vulnerability to graph-level inference attacks. To this end, we propose a suite of Topology Inference Attacks (TIAs) that can reconstruct the structure of a target training graph using only black-box access to a GNN model. Our findings show that GNNs are highly susceptible to these attacks, and that existing edge-level differential privacy mechanisms are insufficient as they either fail to mitigate the risk or severely compromise model accuracy. To address this challenge, we introduce Private Graph Reconstruction (PGR), a novel defense framework designed to protect topology privacy while maintaining model accuracy. PGR is formulated as a bi-level optimization problem, where a synthetic training graph is iteratively generated using meta-gradients, and the GNN model is concurrently updated based on the evolving graph. Extensive experiments demonstrate that PGR significantly reduces topology leakage with minimal impact on model accuracy. Our code is anonymously available at this https URL.
[LG-78] Unmasking COVID-19 Vulnerability in Nigeria: Mapping Risks Beyond Urban Hotspots NEURIPS2025
链接: https://arxiv.org/abs/2509.05398
作者: Sheila Wafula,Blessed Madukoma
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures. Submission to NeurIPS 2025 in preparation
Abstract:The COVID-19 pandemic has presented significant challenges in Nigeria’s public health systems since the first case reported on February 27, 2020. This study investigates key factors that contribute to state vulnerability, quantifying them through a composite risk score integrating population density (weight 0.2), poverty (0.4), access to healthcare (0.3), and age risk (0.1), adjusted by normalized case rates per 100,000. States were categorized into low-, medium-, and high-density areas to analyze trends and identify hotspots using geographic information system (GIS) mapping. The findings reveal that high-density urban areas, such as Lagos, accounting for 35.4% of national cases, had the highest risk scores (Lagos: 673.47 vs. national average: 28.16). These results align with global and local studies on the spatial variability of COVID-19 in Nigeria, including international frameworks such as the CDC Social Vulnerability Index. Google Trends data highlight variations in public health awareness, serving as a supplementary analysis to contextualize vulnerability. The risk score provides a prioritization tool for policymakers to allocate testing, vaccines, and healthcare resources to high-risk areas, though data gaps and rural underreporting call for further research. This framework can extend to other infectious diseases, offering lessons for future pandemics in resource-limited settings.
[LG-79] RoboBallet: Planning for Multi-Robot Reaching with Graph Neural Networks and Reinforcement Learning
链接: https://arxiv.org/abs/2509.05397
作者: Matthew Lai,Keegan Go,Zhibin Li,Torsten Kroger,Stefan Schaal,Kelsey Allen,Jonathan Scholz
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Published in Science Robotics
Abstract:Modern robotic manufacturing requires collision-free coordination of multiple robots to complete numerous tasks in shared, obstacle-rich workspaces. Although individual tasks may be simple in isolation, automated joint task allocation, scheduling, and motion planning under spatio-temporal constraints remain computationally intractable for classical methods at real-world scales. Existing multi-arm systems deployed in the industry rely on human intuition and experience to design feasible trajectories manually in a labor-intensive process. To address this challenge, we propose a reinforcement learning (RL) framework to achieve automated task and motion planning, tested in an obstacle-rich environment with eight robots performing 40 reaching tasks in a shared workspace, where any robot can perform any task in any order. Our approach builds on a graph neural network (GNN) policy trained via RL on procedurally-generated environments with diverse obstacle layouts, robot configurations, and task distributions. It employs a graph representation of scenes and a graph policy neural network trained through reinforcement learning to generate trajectories of multiple robots, jointly solving the sub-problems of task allocation, scheduling, and motion planning. Trained on large randomly generated task sets in simulation, our policy generalizes zero-shot to unseen settings with varying robot placements, obstacle geometries, and task poses. We further demonstrate that the high-speed capability of our solution enables its use in workcell layout optimization, improving solution times. The speed and scalability of our planner also open the door to new capabilities such as fault-tolerant planning and online perception-based re-planning, where rapid adaptation to dynamic task sets is required.
[LG-80] Ensembling Membership Inference Attacks Against Tabular Generative Models
链接: https://arxiv.org/abs/2509.05350
作者: Joshua Ward,Yuxuan Yang,Chi-Hua Wang,Guang Cheng
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Membership Inference Attacks (MIAs) have emerged as a principled framework for auditing the privacy of synthetic data generated by tabular generative models, where many diverse methods have been proposed that each exploit different privacy leakage signals. However, in realistic threat scenarios, an adversary must choose a single method without a priori guarantee that it will be the empirically highest performing option. We study this challenge as a decision theoretic problem under uncertainty and conduct the largest synthetic data privacy benchmark to date. Here, we find that no MIA constitutes a strictly dominant strategy across a wide variety of model architectures and dataset domains under our threat model. Motivated by these findings, we propose ensemble MIAs and show that unsupervised ensembles built on individual attacks offer empirically more robust, regret-minimizing strategies than individual attacks.
[LG-81] Privacy-Preserving Offloading for Large Language Models in 6G Vehicular Networks
链接: https://arxiv.org/abs/2509.05320
作者: Ikhlasse Badidi,Nouhaila El Khiyaoui,Aya Riany,Badr Ben Elallid,Amine Abouaomar
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures, 1 algorithm, 5 equations
Abstract:The integration of Large Language Models (LLMs) in 6G vehicular networks promises unprecedented advancements in intelligent transportation systems. However, offloading LLM computations from vehicles to edge infrastructure poses significant privacy risks, potentially exposing sensitive user data. This paper presents a novel privacy-preserving offloading framework for LLM-integrated vehicular networks. We introduce a hybrid approach combining federated learning (FL) and differential privacy (DP) techniques to protect user data while maintaining LLM performance. Our framework includes a privacy-aware task partitioning algorithm that optimizes the trade-off between local and edge computation, considering both privacy constraints and system efficiency. We also propose a secure communication protocol for transmitting model updates and aggregating results across the network. Experimental results demonstrate that our approach achieves 75% global accuracy with only a 2-3% reduction compared to non-privacy-preserving methods, while maintaining DP guarantees with an optimal privacy budget of \varepsilon = 0.8 . The framework shows stable communication overhead of approximately 2.1MB per round with computation comprising over 90% of total processing time, validating its efficiency for resource-constrained vehicular environments.
[LG-82] Data-driven solar forecasting enables near-optimal economic decisions
链接: https://arxiv.org/abs/2509.06925
作者: Zhixiang Dai,Minghao Yin,Xuanhong Chen,Alberto Carpentieri,Jussi Leinonen,Boris Bonev,Chengzhe Zhong,Thorsten Kurth,Jingan Sun,Ram Cherukuri,Yuzhou Zhang,Ruihua Zhang,Farah Hariri,Xiaodong Ding,Chuanxiang Zhu,Dake Zhang,Yaodan Cui,Yuxi Lu,Yue Song,Bin He,Jie Chen,Yixin Zhu,Chenheng Xu,Maofeng Liu,Zeyi Niu,Wanpeng Qi,Xu Shan,Siyuan Xian,Ning Lin,Kairui Feng
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: Main text ~12 pages, 4 figures, 0 tables
Abstract:Solar energy adoption is critical to achieving net-zero emissions. However, it remains difficult for many industrial and commercial actors to decide on whether they should adopt distributed solar-battery systems, which is largely due to the unavailability of fast, low-cost, and high-resolution irradiance forecasts. Here, we present SunCastNet, a lightweight data-driven forecasting system that provides 0.05 ^\circ , 10-minute resolution predictions of surface solar radiation downwards (SSRD) up to 7 days ahead. SunCastNet, coupled with reinforcement learning (RL) for battery scheduling, reduces operational regret by 76–93% compared to robust decision making (RDM). In 25-year investment backtests, it enables up to five of ten high-emitting industrial sectors per region to cross the commercial viability threshold of 12% Internal Rate of Return (IRR). These results show that high-resolution, long-horizon solar forecasts can directly translate into measurable economic gains, supporting near-optimal energy operations and accelerating renewable deployment.
[LG-83] Learning from one graph: transductive learning guarantees via the geometry of small random worlds
链接: https://arxiv.org/abs/2509.06894
作者: Nils Detering,Luca Galimberti,Anastasis Kratsios,Giulia Livieri,A. Martina Neuman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Metric Geometry (math.MG); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:Since their introduction by Kipf and Welling in 2017 , a primary use of graph convolutional networks is transductive node classification, where missing labels are inferred within a single observed graph and its feature matrix. Despite the widespread use of the network model, the statistical foundations of transductive learning remain limited, as standard inference frameworks typically rely on multiple independent samples rather than a single graph. In this work, we address these gaps by developing new concentration-of-measure tools that leverage the geometric regularities of large graphs via low-dimensional metric embeddings. The emergent regularities are captured using a random graph model; however, the methods remain applicable to deterministic graphs once observed. We establish two principal learning results. The first concerns arbitrary deterministic k -vertex graphs, and the second addresses random graphs that share key geometric properties with an Erdős-Rényi graph \mathbfG=\mathbfG(k,p) in the regime p \in \mathcalO((\log (k)/k)^1/2) . The first result serves as the basis for and illuminates the second. We then extend these results to the graph convolutional network setting, where additional challenges arise. Lastly, our learning guarantees remain informative even with a few labelled nodes N and achieve the optimal nonparametric rate \mathcalO(N^-1/2) as N grows.
[LG-84] Learning spatially structured open quantum dynamics with regional-attention transformers
链接: https://arxiv.org/abs/2509.06871
作者: Dounan Du,Eden Figueroa
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Atomic Physics (physics.atom-ph)
*备注: 25 pages, 5 figures
Abstract:Simulating the dynamics of open quantum systems with spatial structure and external control is an important challenge in quantum information science. Classical numerical solvers for such systems require integrating coupled master and field equations, which is computationally demanding for simulation and optimization tasks and often precluding real-time use in network-scale simulations or feedback control. We introduce a regional attention-based neural architecture that learns the spatiotemporal dynamics of structured open quantum systems. The model incorporates translational invariance of physical laws as an inductive bias to achieve scalable complexity, and supports conditioning on time-dependent global control parameters. We demonstrate learning on two representative systems: a driven dissipative single qubit and an electromagnetically induced transparency (EIT) quantum memory. The model achieves high predictive fidelity under both in-distribution and out-of-distribution control protocols, and provides substantial acceleration up to three orders of magnitude over numerical solvers. These results demonstrate that the architecture establishes a general surrogate modeling framework for spatially structured open quantum dynamics, with immediate relevance to large-scale quantum network simulation, quantum repeater and protocol design, real-time experimental optimization, and scalable device modeling across diverse light-matter platforms.
[LG-85] Sequential Least-Squares Estimators with Fast Randomized Sketching for Linear Statistical Models
链接: https://arxiv.org/abs/2509.06856
作者: Guan-Yu Chen,Xi Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We propose a novel randomized framework for the estimation problem of large-scale linear statistical models, namely Sequential Least-Squares Estimators with Fast Randomized Sketching (SLSE-FRS), which integrates Sketch-and-Solve and Iterative-Sketching methods for the first time. By iteratively constructing and solving sketched least-squares (LS) subproblems with increasing sketch sizes to achieve better precisions, SLSE-FRS gradually refines the estimators of the true parameter vector, ultimately producing high-precision estimators. We analyze the convergence properties of SLSE-FRS, and provide its efficient implementation. Numerical experiments show that SLSE-FRS outperforms the state-of-the-art methods, namely the Preconditioned Conjugate Gradient (PCG) method, and the Iterative Double Sketching (IDS) method.
[LG-86] Green Learning for STAR-RIS mmWave Systems with Implicit CSI
链接: https://arxiv.org/abs/2509.06820
作者: Yu-Hsiang Huang,Po-Heng Chou,Wan-Jen Huang,Walid Saad,C.-C. Jay Kuo
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, 2 tables, accepted by 2025 IEEE Globecom
Abstract:In this paper, a green learning (GL)-based precoding framework is proposed for simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS)-aided millimeter-wave (mmWave) MIMO broadcasting systems. Motivated by the growing emphasis on environmental sustainability in future 6G networks, this work adopts a broadcasting transmission architecture for scenarios where multiple users share identical information, improving spectral efficiency and reducing redundant transmissions and power consumption. Different from conventional optimization methods, such as block coordinate descent (BCD) that require perfect channel state information (CSI) and iterative computation, the proposed GL framework operates directly on received uplink pilot signals without explicit CSI estimation. Unlike deep learning (DL) approaches that require CSI-based labels for training, the proposed GL approach also avoids deep neural networks and backpropagation, leading to a more lightweight design. Although the proposed GL framework is trained with supervision generated by BCD under full CSI, inference is performed in a fully CSI-free manner. The proposed GL integrates subspace approximation with adjusted bias (Saab), relevant feature test (RFT)-based supervised feature selection, and eXtreme gradient boosting (XGBoost)-based decision learning to jointly predict the STAR-RIS coefficients and transmit precoder. Simulation results show that the proposed GL approach achieves competitive spectral efficiency compared to BCD and DL-based models, while reducing floating-point operations (FLOPs) by over four orders of magnitude. These advantages make the proposed GL approach highly suitable for real-time deployment in energy- and hardware-constrained broadcasting scenarios.
[LG-87] Reward function compression facilitates goal-dependent reinforcement learning
链接: https://arxiv.org/abs/2509.06810
作者: Gaia Molinaro,Anne G. E. Collins
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning agents learn from rewards, but humans can uniquely assign value to novel, abstract outcomes in a goal-dependent manner. However, this flexibility is cognitively costly, making learning less efficient. Here, we propose that goal-dependent learning is initially supported by a capacity-limited working memory system. With consistent experience, learners create a “compressed” reward function (a simplified rule defining the goal) which is then transferred to long-term memory and applied automatically upon receiving feedback. This process frees up working memory resources, boosting learning efficiency. We test this theory across six experiments. Consistent with our predictions, our findings demonstrate that learning is parametrically impaired by the size of the goal space, but improves when the goal space structure allows for compression. We also find faster reward processing to correlate with better learning performance, supporting the idea that as goal valuation becomes more automatic, more resources are available for learning. We leverage computational modeling to support this interpretation. Our work suggests that efficient goal-directed learning relies on compressing complex goal information into a stable reward function, shedding light on the cognitive mechanisms of human motivation. These findings generate new insights into the neuroscience of intrinsic motivation and could help improve behavioral techniques that support people in achieving their goals.
[LG-88] Neural ARFIMA model for forecasting BRIC exchange rates with long memory under oil shocks and policy uncertainties
链接: https://arxiv.org/abs/2509.06697
作者: Tanujit Chakraborty,Donia Besher,Madhurima Panja,Shovon Sengupta
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Accurate forecasting of exchange rates remains a persistent challenge, particularly for emerging economies such as Brazil, Russia, India, and China (BRIC). These series exhibit long memory, nonlinearity, and non-stationarity properties that conventional time series models struggle to capture. Additionally, there exist several key drivers of exchange rate dynamics, including global economic policy uncertainty, US equity market volatility, US monetary policy uncertainty, oil price growth rates, and country-specific short-term interest rate differentials. These empirical complexities underscore the need for a flexible modeling framework that can jointly accommodate long memory, nonlinearity, and the influence of external drivers. To address these challenges, we propose a Neural AutoRegressive Fractionally Integrated Moving Average (NARFIMA) model that combines the long-memory representation of ARFIMA with the nonlinear learning capacity of neural networks, while flexibly incorporating exogenous causal variables. We establish theoretical properties of the model, including asymptotic stationarity of the NARFIMA process using Markov chains and nonlinear time series techniques. We quantify forecast uncertainty using conformal prediction intervals within the NARFIMA framework. Empirical results across six forecast horizons show that NARFIMA consistently outperforms various state-of-the-art statistical and machine learning models in forecasting BRIC exchange rates. These findings provide new insights for policymakers and market participants navigating volatile financial conditions. The \textttnarfima \textbfR package provides an implementation of our approach.
[LG-89] Automated Hierarchical Graph Construction for Multi-source Electronic Health Records
链接: https://arxiv.org/abs/2509.06576
作者: Yinjie Wang,Doudou Zhou,Yue Liu,Junwei Lu,Tianxi Cai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Electronic Health Records (EHRs), comprising diverse clinical data such as diagnoses, medications, and laboratory results, hold great promise for translational research. EHR-derived data have advanced disease prevention, improved clinical trial recruitment, and generated real-world evidence. Synthesizing EHRs across institutions enables large-scale, generalizable studies that capture rare diseases and population diversity, but remains hindered by the heterogeneity of medical codes, institution-specific terminologies, and the absence of standardized data structures. These barriers limit the interpretability, comparability, and scalability of EHR-based analyses, underscoring the need for robust methods to harmonize and extract meaningful insights from distributed, heterogeneous data. To address this, we propose MASH (Multi-source Automated Structured Hierarchy), a fully automated framework that aligns medical codes across institutions using neural optimal transport and constructs hierarchical graphs with learned hyperbolic embeddings. During training, MASH integrates information from pre-trained language models, co-occurrence patterns, textual descriptions, and supervised labels to capture semantic and hierarchical relationships among medical concepts more effectively. Applied to real-world EHR data, including diagnosis, medication, and laboratory codes, MASH produces interpretable hierarchical graphs that facilitate the navigation and understanding of heterogeneous clinical data. Notably, it generates the first automated hierarchies for unstructured local laboratory codes, establishing foundational references for downstream applications.
[LG-90] Robust and Adaptive Spectral Method for Representation Multi-Task Learning with Contamination
链接: https://arxiv.org/abs/2509.06575
作者: Yian Huang,Yang Feng,Zhiliang Ying
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Representation-based multi-task learning (MTL) improves efficiency by learning a shared structure across tasks, but its practical application is often hindered by contamination, outliers, or adversarial tasks. Most existing methods and theories assume a clean or near-clean setting, failing when contamination is significant. This paper tackles representation MTL with an unknown and potentially large contamination proportion, while also allowing for heterogeneity among inlier tasks. We introduce a Robust and Adaptive Spectral method (RAS) that can distill the shared inlier representation effectively and efficiently, while requiring no prior knowledge of the contamination level or the true representation dimension. Theoretically, we provide non-asymptotic error bounds for both the learned representation and the per-task parameters. These bounds adapt to inlier task similarity and outlier structure, and guarantee that RAS performs at least as well as single-task learning, thus preventing negative transfer. We also extend our framework to transfer learning with corresponding theoretical guarantees for the target task. Extensive experiments confirm our theory, showcasing the robustness and adaptivity of RAS, and its superior performance in regimes with up to 80% task contamination.
[LG-91] opological Regularization for Force Prediction in Active Particle Suspension with EGNN and Persistent Homology
链接: https://arxiv.org/abs/2509.06574
作者: Sadra Saremi,Amirhossein Ahmadkhan Kordbacheh
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注:
Abstract:Capturing the dynamics of active particles, i.e., small self-propelled agents that both deform and are deformed by a fluid in which they move is a formidable problem as it requires coupling fine scale hydrodynamics with large scale collective effects. So we present a multi-scale framework that combines the three learning-driven tools to learn in concert within one pipeline. We use high-resolution Lattice Boltzmann snapshots of fluid velocity and particle stresses in a periodic box as input to the learning pipeline. the second step takes the morphology and positions orientations of particles to predict pairwise interaction forces between them with a E(2)-equivariant graph neural network that necessarily respect flat symmetries. Then, a physics-informed neural network further updates these local estimates by summing over them with a stress data using Fourier feature mappings and residual blocks that is additionally regularized with a topological term (introduced by persistent homology) to penalize unrealistically tangled or spurious connections. In concert, these stages deliver an holistic highly-data driven full force network prediction empathizing on the physical underpinnings together with emerging multi-scale structure typical for active matter.
[LG-92] Robustness and accuracy of mean opinion scores with hard and soft outlier detection
链接: https://arxiv.org/abs/2509.06554
作者: Dietmar Saupe,Tim Bleile
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted for 17th International Conference on Quality of Multimedia Experience (QoMEX’25), September 2025, Madrid, Spain
Abstract:In subjective assessment of image and video quality, observers rate or compare selected stimuli. Before calculating the mean opinion scores (MOS) for these stimuli from the ratings, it is recommended to identify and deal with outliers that may have given unreliable ratings. Several methods are available for this purpose, some of which have been standardized. These methods are typically based on statistics and sometimes tested by introducing synthetic ratings from artificial outliers, such as random clickers. However, a reliable and comprehensive approach is lacking for comparative performance analysis of outlier detection methods. To fill this gap, this work proposes and applies an empirical worst-case analysis as a general solution. Our method involves evolutionary optimization of an adversarial black-box attack on outlier detection algorithms, where the adversary maximizes the distortion of scale values with respect to ground truth. We apply our analysis to several hard and soft outlier detection methods for absolute category ratings and show their differing performance in this stress test. In addition, we propose two new outlier detection methods with low complexity and excellent worst-case performance. Software for adversarial attacks and data analysis is available.
[LG-93] Minimax optimal transfer learning for high-dimensional additive regression
链接: https://arxiv.org/abs/2509.06308
作者: Seung Hyun Moon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: This is a draft version of the paper. All responsibilities are assigned to the first author
Abstract:This paper studies high-dimensional additive regression under the transfer learning framework, where one observes samples from a target population together with auxiliary samples from different but potentially related regression models. We first introduce a target-only estimation procedure based on the smooth backfitting estimator with local linear smoothing. In contrast to previous work, we establish general error bounds under sub-Weibull( \alpha ) noise, thereby accommodating heavy-tailed error distributions. In the sub-exponential case ( \alpha=1 ), we show that the estimator attains the minimax lower bound under regularity conditions, which requires a substantial departure from existing proof strategies. We then develop a novel two-stage estimation method within a transfer learning framework, and provide theoretical guarantees at both the population and empirical levels. Error bounds are derived for each stage under general tail conditions, and we further demonstrate that the minimax optimal rate is achieved when the auxiliary and target distributions are sufficiently close. All theoretical results are supported by simulation studies and real data analysis.
[LG-94] MOSAIC: Minimax-Optimal Sparsity-Adaptive Inference for Change Points in Dynamic Networks
链接: https://arxiv.org/abs/2509.06303
作者: Yingying Fan,Jingyuan Liu,Jinchi Lv,Ao Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 110 pages, 4 figures
Abstract:We propose a new inference framework, named MOSAIC, for change-point detection in dynamic networks with the simultaneous low-rank and sparse-change structure. We establish the minimax rate of detection boundary, which relies on the sparsity of changes. We then develop an eigen-decomposition-based test with screened signals that approaches the minimax rate in theory, with only a minor logarithmic loss. For practical implementation of MOSAIC, we adjust the theoretical test by a novel residual-based technique, resulting in a pivotal statistic that converges to a standard normal distribution via the martingale central limit theorem under the null hypothesis and achieves full power under the alternative hypothesis. We also analyze the minimax rate of testing boundary for dynamic networks without the low-rank structure, which almost aligns with the results in high-dimensional mean-vector change-point inference. We showcase the effectiveness of MOSAIC and verify our theoretical results with several simulation examples and a real data application.
[LG-95] Repeating vs. Non-Repeating FRBs: A Deep Learning Approach To Morphological Characterization
链接: https://arxiv.org/abs/2509.06208
作者: Bikash Kharel,Emmanuel Fonseca,Charanjot Brar,Afrokk Khan,Lluis Mas-Ribas,Swarali Shivraj Patil,Paul Scholz,Seth Robert Siegel,David C. Stenning
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: 26 pages, 17 figures, submitted to ApJ
Abstract:We present a deep learning approach to classify fast radio bursts (FRBs) based purely on morphology as encoded on recorded dynamic spectrum from CHIME/FRB Catalog 2. We implemented transfer learning with a pretrained ConvNext architecture, exploiting its powerful feature extraction ability. ConvNext was adapted to classify dedispersed dynamic spectra (which we treat as images) of the FRBs into one of the two sub-classes, i.e., repeater and non-repeater, based on their various temporal and spectral properties and relation between the sub-pulse structures. Additionally, we also used mathematical model representation of the total intensity data to interpret the deep learning model. Upon fine-tuning the pretrained ConvNext on the FRB spectrograms, we were able to achieve high classification metrics while substantially reducing training time and computing power as compared to training a deep learning model from scratch with random weights and biases without any feature extraction ability. Importantly, our results suggest that the morphological differences between CHIME repeating and non-repeating events persist in Catalog 2 and the deep learning model leveraged these differences for classification. The fine-tuned deep learning model can be used for inference, which enables us to predict whether an FRB’s morphology resembles that of repeaters or non-repeaters. Such inferences may become increasingly significant when trained on larger data sets that will exist in the near future.
[LG-96] Robust Analysis for Resilient AI System
链接: https://arxiv.org/abs/2509.06172
作者: Yu Wang,Ran Jin,Lulu Kang
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures
Abstract:Operational hazards in Manufacturing Industrial Internet (MII) systems generate severe data outliers that cripple traditional statistical analysis. This paper proposes a novel robust regression method, DPD-Lasso, which integrates Density Power Divergence with Lasso regularization to analyze contaminated data from AI resilience experiments. We develop an efficient iterative algorithm to overcome previous computational bottlenecks. Applied to an MII testbed for Aerosol Jet Printing, DPD-Lasso provides reliable, stable performance on both clean and outlier-contaminated data, accurately quantifying hazard impacts. This work establishes robust regression as an essential tool for developing and validating resilient industrial AI systems.
[LG-97] Additive Distributionally Robust Ranking and Selection
链接: https://arxiv.org/abs/2509.06147
作者: Zaile Li,Yuchen Wan,L. Jeff Hong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Due to the 1,920-character limit imposed on the abstract field, the abstract presented here is a truncated version of the full abstract provided in the PDF. The only omitted sentence is: We also prove the additivity and consistency for GAA procedures
Abstract:Ranking and selection (RS) aims to identify the alternative with the best mean performance among k simulated alternatives. The practical value of RS depends on accurate simulation input modeling, which often suffers from the curse of input uncertainty due to limited data. Distributionally robust ranking and selection (DRRS) addresses this challenge by modeling input uncertainty via an ambiguity set of m 1 plausible input distributions, resulting in km scenarios in total. Recent DRRS studies suggest a key structural insight: additivity in budget allocation is essential for efficiency. However, existing justifications are heuristic, and fundamental properties such as consistency and the precise allocation pattern induced by additivity remain poorly understood. In this paper, we propose a simple additive allocation (AA) procedure that aims to exclusively sample the k + m - 1 previously hypothesized critical scenarios. Leveraging boundary-crossing arguments, we establish a lower bound on the probability of correct selection and characterize the procedure’s budget allocation behavior. We then prove that AA is consistent and, surprisingly, achieves additivity in the strongest sense: as the total budget increases, only k + m - 1 scenarios are sampled infinitely often. Notably, the worst-case scenarios of non-best alternatives may not be among them, challenging prior beliefs about their criticality. These results offer new and counterintuitive insights into the additive structure of DRRS. To improve practical performance while preserving this structure, we introduce a general additive allocation (GAA) framework that flexibly incorporates sampling rules from traditional RS procedures in a modular fashion. Numerical experiments support our theoretical findings and demonstrate the competitive performance of the proposed GAA procedures.
[LG-98] Machine learning magnetism from simple global descriptors
链接: https://arxiv.org/abs/2509.05909
作者: Ahmed E. Fahmy
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: Main Text: 9 pages + 10 Figures 3 Supplementary Tables
Abstract:The reliable identification of magnetic ground states remains a major challenge in high-throughput materials databases, where density functional theory (DFT) workflows often converge to ferromagnetic (FM) solutions. Here, we partially address this challenge by developing machine learning classifiers trained on experimentally validated MAGNDATA magnetic materials leveraging a limited number of simple compositional, structural, and electronic descriptors sourced from the Materials Project database. Our propagation vector classifiers achieve accuracies above 92%, outperforming recent studies in reliably distinguishing zero from nonzero propagation vector structures, and exposing a systematic ferromagnetic bias inherent to the Materials Project database for more than 7,843 materials. In parallel, LightGBM and XGBoost models trained directly on the Materials Project labels achieve accuracies of 84-86% (with macro F1 average scores of 63-66%), which proves useful for large-scale screening for magnetic classes, if refined by MAGNDATA-trained classifiers. These results underscore the role of machine learning techniques as corrective and exploratory tools, enabling more trustworthy databases and accelerating progress toward the identification of materials with various properties.
[LG-99] Fisher Random Walk: Automatic Debiasing Contextual Preference Inference for Large Language Model Evaluation
链接: https://arxiv.org/abs/2509.05852
作者: Yichi Zhang,Alexander Belloni,Ethan X. Fang,Junwei Lu,Xiaoan Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Motivated by the need for rigorous and scalable evaluation of large language models, we study contextual preference inference for pairwise comparison functionals of context-dependent preference score functions across domains. Focusing on the contextual Bradley-Terry-Luce model, we develop a semiparametric efficient estimator that automates the debiased estimation through aggregating weighted residual balancing terms across the comparison graph. We show that the efficiency is achieved when the weights are derived from a novel strategy called Fisher random walk. We also propose a computationally feasible method to compute the weights by a potential representation of nuisance weight functions. We show our inference procedure is valid for general score function estimators accommodating the practitioners’ need to implement flexible deep learning methods. We extend the procedure to multiple hypothesis testing using a Gaussian multiplier bootstrap that controls familywise error and to distributional shift via a cross-fitted importance-sampling adjustment for target-domain inference. Numerical studies, including language model evaluations under diverse contexts, corroborate the accuracy, efficiency, and practical utility of our method.
[LG-100] Volatility Modeling via EWMA-Driven Time-Dependent Hurst Parameters
链接: https://arxiv.org/abs/2509.05820
作者: Jayanth Athipatla
类目: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG)
*备注: 9 pages total
Abstract:We introduce a novel rough Bergomi (rBergomi) model featuring a variance-driven exponentially weighted moving average (EWMA) time-dependent Hurst parameter H_t , fundamentally distinct from recent machine learning and wavelet-based approaches in the literature. Our framework pioneers a unified rough differential equation (RDE) formulation grounded in rough path theory, where the Hurst parameter dynamically adapts to evolving volatility regimes through a continuous EWMA mechanism tied to instantaneous variance. Unlike discrete model-switching or computationally intensive forecasting methods, our approach provides mathematical tractability while capturing volatility clustering and roughness bursts. We rigorously establish existence and uniqueness of solutions via rough path theory and derive martingale properties. Empirical validation on diverse asset classes including equities, cryptocurrencies, and commodities demonstrates superior performance in capturing dynamics and out-of-sample pricing accuracy. Our results show significant improvements over traditional constant-Hurst models.
[LG-101] Spectral Methods in Complex Systems
链接: https://arxiv.org/abs/2509.05793
作者: Francesco Caravelli
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: Expanded and cleaned notes. Based on lectures given at the online school on spectral methods in complex systems (2019); 467 pages. Comments welcome
Abstract:These notes offer a unified introduction to spectral methods for the study of complex systems. They are intended as an operative manual rather than a theorem-proof textbook: the emphasis is on tools, identities, and perspectives that can be readily applied across disciplines. Beginning with a compendium of matrix identities and inversion techniques, the text develops the connections between spectra, dynamics, and structure in finite-dimensional systems. Applications range from dynamical stability and random walks on networks to input-output economics, PageRank, epidemic spreading, memristive circuits, synchronization phenomena, and financial stability. Throughout, the guiding principle is that eigenvalues, eigenvectors, and resolvent operators provide a common language linking problems in physics, mathematics, computer science, and beyond. The presentation is informal, accessible to advanced undergraduates, yet broad enough to serve as a reference for researchers interested in spectral approaches to complex systems.
[LG-102] Vector-based loss functions for turbulent flow field inpainting
链接: https://arxiv.org/abs/2509.05787
作者: Samuel J. Baker,Shubham Goswami,Xiaohang Fang,Felix C. P. Leach
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:When developing scientific machine learning (ML) approaches, it is often beneficial to embed knowledge of the physical system in question into the training process. One way to achieve this is by leveraging the specific characteristics of the data at hand. In the case of turbulent flows, fluid velocities can be measured and recorded as multi-component vectors at discrete points in space, using techniques such as particle image velocimetry (PIV) or computational fluid mechanics (CFD). However, the vectorised nature of the data is ignored by standard ML approaches, as widely-used loss functions such as the mean-square error treat each component of a velocity vector in isolation. Therefore, the aim of this work is to better preserve the physical characteristics of the data by introducing loss functions that utilise vector similarity metrics. To this end, vector-based loss functions are developed here and implemented alongside a U-Net model for a turbulent flow field inpainting problem, amounting to the prediction of velocity vectors inside large gaps in PIV images. The intention is for the inpainting task to pose a significant challenge for the ML models in order to shed light on their capabilities. The test case uses PIV data from the highly turbulent flow in the well-known Transparent Combustion Chamber III (TCC-III) engine. Loss functions based on the cosine similarity and vector magnitude differences are proposed; the results show that the vector-based loss functions lead to significantly improved predictions of multi-scale flow patterns, while a hybrid (vector and mean-square error) loss function enables a good compromise to be found between preserving multi-scale behaviour and pixel-wise accuracy.
[LG-103] Causal Clustering for Conditional Averag e Treatment Effects Estimation and Subgroup Discovery
链接: https://arxiv.org/abs/2509.05775
作者: Zilong Wang,Turgay Ayer,Shihao Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Pre-print for camera ready version for IEEE EMBS BHI 2025
Abstract:Estimating heterogeneous treatment effects is critical in domains such as personalized medicine, resource allocation, and policy evaluation. A central challenge lies in identifying subpopulations that respond differently to interventions, thereby enabling more targeted and effective decision-making. While clustering methods are well-studied in unsupervised learning, their integration with causal inference remains limited. We propose a novel framework that clusters individuals based on estimated treatment effects using a learned kernel derived from causal forests, revealing latent subgroup structures. Our approach consists of two main steps. First, we estimate debiased Conditional Average Treatment Effects (CATEs) using orthogonalized learners via the Robinson decomposition, yielding a kernel matrix that encodes sample-level similarities in treatment responsiveness. Second, we apply kernelized clustering to this matrix to uncover distinct, treatment-sensitive subpopulations and compute cluster-level average CATEs. We present this kernelized clustering step as a form of regularization within the residual-on-residual regression framework. Through extensive experiments on semi-synthetic and real-world datasets, supported by ablation studies and exploratory analyses, we demonstrate the effectiveness of our method in capturing meaningful treatment effect heterogeneity.
[LG-104] Risk-averse Fair Multi-class Classification
链接: https://arxiv.org/abs/2509.05771
作者: Darinka Dentcheva,Xiangyu Tian
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We develop a new classification framework based on the theory of coherent risk measures and systemic risk. The proposed approach is suitable for multi-class problems when the data is noisy, scarce (relative to the dimension of the problem), and the labeling might be unreliable. In the first part of our paper, we provide the foundation of the use of systemic risk models and show how to apply it in the context of linear and kernel-based multi-class problems. More advanced formulation via a system-theoretic approach with non-linear aggregation is proposed, which leads to a two-stage stochastic programming problem. A risk-averse regularized decomposition method is designed to solve the problem. We use a popular multi-class method as a benchmark in the performance analysis of the proposed classification methods. We illustrate our ideas by proposing several generalization of that method by the use of coherent measures of risk. The viability of the proposed risk-averse methods are supported theoretically and numerically. Additionally, we demonstrate that the application of systemic risk measures facilitates enforcing fairness in classification. Analysis and experiments regarding the fairness of the proposed models are carefully conducted. For all methods, our numerical experiments demonstrate that they are robust in the presence of unreliable training data and perform better on unknown data than the methods minimizing expected classification errors. Furthermore, the performance improves when the number of classes increases.
[LG-105] Robust variational neural posterior estimation for simulation-based inference
链接: https://arxiv.org/abs/2509.05724
作者: Matthew O’Callaghan,Kaisey S. Mandel,Gerry Gilmore
类目: Machine Learning (stat.ML); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: Main text: 16 pages, 6 figures
Abstract:Recent advances in neural density estimation have enabled powerful simulation-based inference (SBI) methods that can flexibly approximate Bayesian inference for intractable stochastic models. Although these methods have demonstrated reliable posterior estimation when the simulator accurately represents the underlying data generative process (GDP), recent work has shown that they perform poorly in the presence of model misspecification. This poses a significant problem for their use on real-world problems, due to simulators always misrepresenting the true DGP to a certain degree. In this paper, we introduce robust variational neural posterior estimation (RVNP), a method which addresses the problem of misspecification in amortised SBI by bridging the simulation-to-reality gap using variational inference and error modelling. We test RVNP on multiple benchmark tasks, including using real data from astronomy, and show that it can recover robust posterior inference in a data-driven manner without adopting tunable hyperparameters or priors governing the misspecification.
[LG-106] On detection probabilities of link invariants
链接: https://arxiv.org/abs/2509.05574
作者: Abel Lacabanne,Daniel Tubbenhauer,Pedro Vaz,Victor L. Zhang
类目: Geometric Topology (math.GT); Machine Learning (cs.LG); Quantum Algebra (math.QA)
*备注: 16 pages, many figures, comments welcome
Abstract:We prove that the detection rate of n-crossing alternating links by link invariants insensitive to oriented mutation decays exponentially in n, implying that they detect alternating links with probability zero. This phenomenon applies broadly, in particular to quantum invariants such as the Jones or HOMFLYPT polynomials. We also use a big data approach to analyze several borderline cases (e.g. integral Khovanov or HOMFLYPT homologies), where our arguments almost, but not quite, apply, and we provide evidence that they too exhibit the same asymptotic behavior.
[LG-107] Cryo-EM as a Stochastic Inverse Problem
链接: https://arxiv.org/abs/2509.05541
作者: Diego Sanchez Espinosa,Erik H Thiede,Yunan Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 25 pages, 8 figures
Abstract:Cryo-electron microscopy (Cryo-EM) enables high-resolution imaging of biomolecules, but structural heterogeneity remains a major challenge in 3D reconstruction. Traditional methods assume a discrete set of conformations, limiting their ability to recover continuous structural variability. In this work, we formulate cryo-EM reconstruction as a stochastic inverse problem (SIP) over probability measures, where the observed images are modeled as the push-forward of an unknown distribution over molecular structures via a random forward operator. We pose the reconstruction problem as the minimization of a variational discrepancy between observed and simulated image distributions, using statistical distances such as the KL divergence and the Maximum Mean Discrepancy. The resulting optimization is performed over the space of probability measures via a Wasserstein gradient flow, which we numerically solve using particles to represent and evolve conformational ensembles. We validate our approach using synthetic examples, including a realistic protein model, which demonstrates its ability to recover continuous distributions over structural states. We analyze the connection between our formulation and Maximum A Posteriori (MAP) approaches, which can be interpreted as instances of the discretize-then-optimize (DTO) framework. We further provide a consistency analysis, establishing conditions under which DTO methods, such as MAP estimation, converge to the solution of the underlying infinite-dimensional continuous problem. Beyond cryo-EM, the framework provides a general methodology for solving SIPs involving random forward operators.
[LG-108] Causal Multi-fidelity Surrogate Forward and Inverse Models for ICF Implosions
链接: https://arxiv.org/abs/2509.05510
作者: Tyler E. Maltba,Ben S. Southworth,Jeffrey R. Haack,Marc L. Klasky
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:
Abstract:Continued progress in inertial confinement fusion (ICF) requires solving inverse problems relating experimental observations to simulation input parameters, followed by design optimization. However, such high dimensional dynamic PDE-constrained optimization problems are extremely challenging or even intractable. It has been recently shown that inverse problems can be solved by only considering certain robust features. Here we consider the ICF capsule’s deuterium-tritium (DT) interface, and construct a causal, dynamic, multifidelity reduced-order surrogate that maps from a time-dependent radiation temperature drive to the interface’s radius and velocity dynamics. The surrogate targets an ODE embedding of DT interface dynamics, and is constructed by learning a controller for a base analytical model using low- and high-fidelity simulation training data with respect to radiation energy group structure. After demonstrating excellent accuracy of the surrogate interface model, we use machine learning (ML) models with surrogate-generated data to solve inverse problems optimizing radiation temperature drive to reproduce observed interface dynamics. For sparse snapshots in time, the ML model further characterizes the most informative times at which to sample dynamics. Altogether we demonstrate how operator learning, causal architectures, and physical inductive bias can be integrated to accelerate discovery, design, and diagnostics in high-energy-density systems.
[LG-109] Self-Driving Laboratory Optimizes the Lower Critical Solution Temperature of Thermoresponsive Polymers
链接: https://arxiv.org/abs/2509.05351
作者: Guoyue Xu,Renzheng Zhang,Tengfei Luo
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注:
Abstract:To overcome the inherent inefficiencies of traditional trial-and-error materials discovery, the scientific community is increasingly developing autonomous laboratories that integrate data-driven decision-making into closed-loop experimental workflows. In this work, we realize this concept for thermoresponsive polymers by developing a low-cost, “frugal twin” platform for the optimization of the lower critical solution temperature (LCST) of poly(N-isopropylacrylamide) (PNIPAM). Our system integrates robotic fluid-handling, on-line sensors, and Bayesian optimization (BO) that navigates the multi-component salt solution spaces to achieve user-specified LCST targets. The platform demonstrates convergence to target properties within a minimal number of experiments. It strategically explores the parameter space, learns from informative “off-target” results, and self-corrects to achieve the final targets. By providing an accessible and adaptable blueprint, this work lowers the barrier to entry for autonomous experimentation and accelerates the design and discovery of functional polymers.
[LG-110] Predicting Brain Morphogenesis via Physics-Transfer Learning
链接: https://arxiv.org/abs/2509.05305
作者: Yingjie Zhao,Yicheng Song,Fan Xu,Zhiping Xu
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Pattern Formation and Solitons (nlin.PS)
*备注:
Abstract:Brain morphology is shaped by genetic and mechanical factors and is linked to biological development and diseases. Its fractal-like features, regional anisotropy, and complex curvature distributions hinder quantitative insights in medical inspections. Recognizing that the underlying elastic instability and bifurcation share the same physics as simple geometries such as spheres and ellipses, we developed a physics-transfer learning framework to address the geometrical complexity. To overcome the challenge of data scarcity, we constructed a digital library of high-fidelity continuum mechanics modeling that both describes and predicts the developmental processes of brain growth and disease. The physics of nonlinear elasticity from simple geometries is embedded into a neural network and applied to brain models. This physics-transfer approach demonstrates remarkable performance in feature characterization and morphogenesis prediction, highlighting the pivotal role of localized deformation in dominating over the background geometry. The data-driven framework also provides a library of reduced-dimensional evolutionary representations that capture the essential physics of the highly folded cerebral cortex. Validation through medical images and domain expertise underscores the deployment of digital-twin technology in comprehending the morphological complexity of the brain.
信息检索
[IR-0] UniSearch: Rethinking Search System with a Unified Generative Architecture
链接: https://arxiv.org/abs/2509.06887
作者: Jiahui Chen,Xiaoze Jiang,Zhibo Wang,Quanzhi Zhu,Junyao Zhao,Feng Hu,Kang Pan,Ao Xie,Maohua Pei,Zhiheng Qin,Hongjing Zhang,Zhixin Zhai,Xiaobo Guo,Runbin Zhou,Kefeng Wang,Mingyang Geng,Cheng Chen,Jingshan Lv,Yupeng Huang,Xiao Liang,Han Li
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Modern search systems play a crucial role in facilitating information acquisition. Traditional search engines typically rely on a cascaded architecture, where results are retrieved through recall, pre-ranking, and ranking stages. The complexity of designing and maintaining multiple modules makes it difficult to achieve holistic performance gains. Recent advances in generative recommendation have motivated the exploration of unified generative search as an alternative. However, existing approaches are not genuinely end-to-end: they typically train an item encoder to tokenize candidates first and then optimize a generator separately, leading to objective inconsistency and limited generalization. To address these limitations, we propose UniSearch, a unified generative search framework for Kuaishou Search. UniSearch replaces the cascaded pipeline with an end-to-end architecture that integrates a Search Generator and a Video Encoder. The Generator produces semantic identifiers of relevant items given a user query, while the Video Encoder learns latent item embeddings and provides their tokenized representations. A unified training framework jointly optimizes both components, enabling mutual enhancement and improving representation quality and generation accuracy. Furthermore, we introduce Search Preference Optimization (SPO), which leverages a reward model and real user feedback to better align generation with user preferences. Extensive experiments on industrial-scale datasets, together with online A/B testing in both short-video and live search scenarios, demonstrate the strong effectiveness and deployment potential of UniSearch. Notably, its deployment in live search yields the largest single-experiment improvement in recent years of our product’s history, highlighting its practical value for real-world applications.
[IR-1] Unveiling the Listener Structure Underlying K-pops Global Success: A Large-Scale Listening Data Analysis
链接: https://arxiv.org/abs/2509.06606
作者: Ryota Nakamura,Keita Nishimoto,Ichiro Sakata,Kimitaka Asatani
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Sound (cs.SD)
*备注:
Abstract:From the mid-2000s to the 2010s, K-pop moved beyond its status as a regionally popular genre in Asia and established itself as a global music genre with enthusiastic fans around the world. However, little is known about how the vast number of music listeners across the globe have listened to and perceived K-pop. This study addresses this question by analyzing a large-scale listening dataset from this http URL. An analysis of the distribution of play counts reveals that K-pop experienced a significant increase in plays between 2005 and 2019, largely supported by a small group of heavy listeners. The Gini coefficient in play counts is notably greater than that of existing mainstream genres and other growing niche genres. Furthermore, an analysis based on user-assigned genre tags quantitatively demonstrates that between 2005 and 2010, K-pop shed its status as a local Asian genre and established itself as a distinct music genre in its own right.
[IR-2] Reasoning -enhanced Query Understanding through Decomposition and Interpretation
链接: https://arxiv.org/abs/2509.06544
作者: Yunfei Zhong,Jun Yang,Yixing Fan,Jiafeng Guo,Lixin Su,Maarten de Rijke,Ruqing Zhang,Dawei Yin,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Accurate inference of user intent is crucial for enhancing document retrieval in modern search engines. While large language models (LLMs) have made significant strides in this area, their effectiveness has predominantly been assessed with short, keyword-based queries. As AI-driven search evolves, long-form queries with intricate intents are becoming more prevalent, yet they remain underexplored in the context of LLM-based query understanding (QU). To bridge this gap, we introduce ReDI: a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation. ReDI leverages the reasoning and comprehension capabilities of LLMs in a three-stage pipeline: (i) it breaks down complex queries into targeted sub-queries to accurately capture user intent; (ii) it enriches each sub-query with detailed semantic interpretations to improve the query-document matching; and (iii) it independently retrieves documents for each sub-query and employs a fusion strategy to aggregate the results for the final ranking. We compiled a large-scale dataset of real-world complex queries from a major search engine and distilled the query understanding capabilities of teacher models into smaller models for practical application. Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both sparse and dense retrieval paradigms, affirming its effectiveness.
[IR-3] Rethinking LLM Parametric Knowledge as Post-retrieval Confidence for Dynamic Retrieval and Reranking
链接: https://arxiv.org/abs/2509.06472
作者: Haoxiang Jin,Ronghan Li,Qiguang Miao,Zixiang Lu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large Language Models (LLMs) often generate inaccurate responses (hallucinations) when faced with questions beyond their knowledge scope. Retrieval-Augmented Generation (RAG) addresses this by leveraging external knowledge, but a critical challenge remains: determining whether retrieved contexts effectively enhance the models ability to answer specific queries. This challenge underscores the importance of knowledge boundary awareness, which current methods-relying on discrete labels or limited signals-fail to address adequately, as they overlook the rich information in LLMs
continuous internal hidden states. To tackle this, we propose a novel post-retrieval knowledge filtering approach. First, we construct a confidence detection model based on LLMs internal hidden states to quantify how retrieved contexts enhance the model
s confidence. Using this model, we build a preference dataset (NQ_Rerank) to fine-tune a reranker, enabling it to prioritize contexts preferred by the downstream LLM during reranking. Additionally, we introduce Confidence-Based Dynamic Retrieval (CBDR), which adaptively triggers retrieval based on the LLM`s initial confidence in the original question, reducing knowledge conflicts and improving efficiency. Experimental results demonstrate significant improvements in accuracy for context screening and end-to-end RAG performance, along with a notable reduction in retrieval costs while maintaining competitive accuracy.
[IR-4] AudioBoost: Increasing Audiobook Retrievability in Spotify Search with Synthetic Query Generation RECSYS25
链接: https://arxiv.org/abs/2509.06452
作者: Enrico Palumbo,Gustavo Penha,Alva Liu,Marcus Eltscheminov,Jefferson Carvalho dos Santos,Alice Wang,Hugues Bouchard,Humberto Jesús Corona Pampin,Michelle Tran Luu
类目: Information Retrieval (cs.IR); Sound (cs.SD)
*备注: EARL Workshop @ RecSys25
Abstract:Spotify has recently introduced audiobooks as part of its catalog, complementing its music and podcast offering. Search is often the first entry point for users to access new items, and an important goal for Spotify is to support users in the exploration of the audiobook catalog. More specifically, we would like to enable users without a specific item in mind to broadly search by topic, genre, story tropes, decade, and discover audiobooks, authors and publishers they may like. To do this, we need to 1) inspire users to type more exploratory queries for audiobooks and 2) augment our retrieval systems to better deal with exploratory audiobook queries. This is challenging in a cold-start scenario, where we have a retrievabiliy bias due to the little amount of user interactions with audiobooks compared to previously available items such as music and podcast content. To address this, we propose AudioBoost, a system to boost audiobook retrievability in Spotify’s Search via synthetic query generation. AudioBoost leverages Large Language Models (LLMs) to generate synthetic queries conditioned on audiobook metadata. The synthetic queries are indexed both in the Query AutoComplete (QAC) and in the Search Retrieval engine to improve query formulation and retrieval at the same time. We show through offline evaluation that synthetic queries increase retrievability and are of high quality. Moreover, results from an online A/B test show that AudioBoost leads to a +0.7% in audiobook impressions, +1.22% in audiobook clicks, and +1.82% in audiobook exploratory query completions.
[IR-5] Compare: A Framework for Scientific Comparisons CIKM2025
链接: https://arxiv.org/abs/2509.06412
作者: Moritz Staudinger,Wojciech Kusa,Matteo Cancellieri,David Pride,Petr Knoth,Allan Hanbury
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: Accepted at CIKM 2025
Abstract:Navigating the vast and rapidly increasing sea of academic publications to identify institutional synergies, benchmark research contributions and pinpoint key research contributions has become an increasingly daunting task, especially with the current exponential increase in new publications. Existing tools provide useful overviews or single-document insights, but none supports structured, qualitative comparisons across institutions or publications. To address this, we demonstrate Compare, a novel framework that tackles this challenge by enabling sophisticated long-context comparisons of scientific contributions. Compare empowers users to explore and analyze research overlaps and differences at both the institutional and publication granularity, all driven by user-defined questions and automatic retrieval over online resources. For this we leverage on Retrieval-Augmented Generation over evolving data sources to foster long context knowledge synthesis. Unlike traditional scientometric tools, Compare goes beyond quantitative indicators by providing qualitative, citation-supported comparisons. Comments: Accepted at CIKM 2025 Subjects: Digital Libraries (cs.DL); Information Retrieval (cs.IR) Cite as: arXiv:2509.06412 [cs.DL] (or arXiv:2509.06412v1 [cs.DL] for this version) https://doi.org/10.48550/arXiv.2509.06412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-6] DISTRIBUTEDANN: Efficient Scaling of a Single DISKANN Graph Across Thousands of Computers
链接: https://arxiv.org/abs/2509.06046
作者: Philip Adams,Menghao Li,Shi Zhang,Li Tan,Qi Chen,Mingqin Li,Zengzhong Li,Knut Risvik,Harsha Vardhan Simhadri
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:
Abstract:We present DISTRIBUTEDANN, a distributed vector search service that makes it possible to search over a single 50 billion vector graph index spread across over a thousand machines that offers 26ms median query latency and processes over 100,000 queries per second. This is 6x more efficient than existing partitioning and routing strategies that route the vector query to a subset of partitions in a scale out vector search system. DISTRIBUTEDANN is built using two well-understood components: a distributed key-value store and an in-memory ANN index. DISTRIBUTEDANN has replaced conventional scale-out architectures for serving the Bing search engine, and we share our experience from making this transition.
[IR-7] A Survey of Real-World Recommender Systems: Challenges Constraints and Industrial Perspectives
链接: https://arxiv.org/abs/2509.06002
作者: Kuan Zou,Aixin Sun
类目: Information Retrieval (cs.IR)
*备注: Working paper
Abstract:Recommender systems have generated tremendous value for both users and businesses, drawing significant attention from academia and industry alike. However, due to practical constraints, academic research remains largely confined to offline dataset optimizations, lacking access to real user data and large-scale recommendation platforms. This limitation reduces practical relevance, slows technological progress, and hampers a full understanding of the key challenges in recommender systems. In this survey, we provide a systematic review of industrial recommender systems and contrast them with their academic counterparts. We highlight key differences in data scale, real-time requirements, and evaluation methodologies, and we summarize major real-world recommendation scenarios along with their associated challenges. We then examine how industry practitioners address these challenges in Transaction-Oriented Recommender Systems and Content-Oriented Recommender Systems, a new classification grounded in item characteristics and recommendation objectives. Finally, we outline promising research directions, including the often-overlooked role of user decision-making, the integration of economic and psychological theories, and concrete suggestions for advancing academic research. Our goal is to enhance academia’s understanding of practical recommender systems, bridge the growing development gap, and foster stronger collaboration between industry and academia.
[IR-8] oward Efficient and Scalable Design of In-Memory Graph-Based Vector Search SIGMOD2025 ICML2025
链接: https://arxiv.org/abs/2509.05750
作者: Ilias Azizi,Karima Echihab,Themis Palpanas,Vassilis Christophides
类目: Information Retrieval (cs.IR); Databases (cs.DB); Data Structures and Algorithms (cs.DS); Performance (cs.PF)
*备注: Presented at ICML 2025 VecDB Workshop; an extended version appeared in ACM SIGMOD 2025 (‘Graph-Based Vector Search: An Experimental Evaluation of the State-of-the-Art’)
Abstract:Vector data is prevalent across business and scientific applications, and its popularity is growing with the proliferation of learned embeddings. Vector data collections often reach billions of vectors with thousands of dimensions, thus, increasing the complexity of their analysis. Vector search is the backbone of many critical analytical tasks, and graph-based methods have become the best choice for analytical tasks that do not require guarantees on the quality of the answers. Although several paradigms (seed selection, incremental insertion, neighborhood propagation, neighborhood diversification, and divide-and-conquer) have been employed to design in-memory graph-based vector search algorithms, a systematic comparison of the key algorithmic advances is still missing. We conduct an exhaustive experimental evaluation of twelve state-of-the-art methods on seven real data collections, with sizes up to 1 billion vectors. We share key insights about the strengths and limitations of these methods; e.g., the best approaches are typically based on incremental insertion and neighborhood diversification, and the choice of the base graph can hurt scalability. Finally, we discuss open research directions, such as the importance of devising more sophisticated data adaptive seed selection and diversification strategies.
[IR-9] LESER: Learning to Expand via Search Engine-feedback Reinforcement in e-Commerce
链接: https://arxiv.org/abs/2509.05570
作者: Yipeng Zhang,Bowen Liu,Xiaoshuang Zhang,Aritra Mandal,Zhe Wu,Canran Xu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:User queries in e-commerce search are often vague, short, and underspecified, making it difficult for retrieval systems to match them accurately against structured product catalogs. This challenge is amplified by the one-to-many nature of user intent, where a single query can imply diverse and competing needs. Existing methods, including neural query expansion and prompting-based LLM approaches, fall short in real-world settings: they struggle to capture nuanced user intent, often generate outputs that violate platform constraints, and rely on workflows that are difficult to scale in production. We propose Learning to Expand via Search Engine-feedback Reinforcement (LESER), a novel framework that fine-tunes a context-aware LLM using real-time search engine feedback as supervision. LESER formulates query expansion as a retrieval optimization task and leverages Group Relative Policy Optimization to learn directly from relevance and coverage metrics. LESER is trained to reason over search results and produce high quality query expansions that align with platform rules and retrieval objectives. We evaluate LESER on large-scale, real-world e-commerce datasets, demonstrating substantial improvements in both offline and online settings. Our results show that LESER not only enhances semantic coverage and retrieval relevance but also delivers measurable gains in user engagement, making it a practical and scalable solution for modern search systems.
[IR-10] Knowledge-Augmented Relation Learning for Complementary Recommendation with Large Language Models
链接: https://arxiv.org/abs/2509.05564
作者: Chihiro Yamasaki,Kai Sugahara,Kazushi Okamoto
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Complementary recommendations play a crucial role in e-commerce by enhancing user experience through suggestions of compatible items. Accurate classification of complementary item relationships requires reliable labels, but their creation presents a dilemma. Behavior-based labels are widely used because they can be easily generated from interaction logs; however, they often contain significant noise and lack reliability. While function-based labels (FBLs) provide high-quality definitions of complementary relationships by carefully articulating them based on item functions, their reliance on costly manual annotation severely limits a model’s ability to generalize to diverse items. To resolve this trade-off, we propose Knowledge-Augmented Relation Learning (KARL), a framework that strategically fuses active learning with large language models (LLMs). KARL efficiently expands a high-quality FBL dataset at a low cost by selectively sampling data points that the classifier finds the most difficult and uses the label extension of the LLM. Our experiments showed that in out-of-distribution (OOD) settings, an unexplored item feature space, KARL improved the baseline accuracy by up to 37%. In contrast, in in-distribution (ID) settings, the learned item feature space, the improvement was less than 0.5%, with prolonged learning could degrade accuracy. These contrasting results are due to the data diversity driven by KARL’s knowledge expansion, suggesting the need for a dynamic sampling strategy that adjusts diversity based on the prediction context (ID or OOD).